build.yml
A build.yml
file specifies the structure, type, and names for package contents.
Below is the general syntax of the build.yml
file:
Example build.yml
:
Reserved words
file
- required for leaf nodes; specifies where source file lives on disktransform
- specifies how the file will be parsedkwargs
- these options are passed through to the parser (usuallypandas.read_csv
so that users can skip lines, type columns, specify delimiters, and much more)checks
- experimental data unit testsenvironments
- experimental environments forchecks
package
- experimental source specifier includes an existing package or sub-package in the build tree (see Package Composition)*?[!]
- any character in this group will initiate glob-style pattern matching
transform
and kwargs
can be provided at the group level, in which case they apply to all descendants until and unless overridden.
Column types
By default, quilt build
converts some file types (e.g., csv, tsv) to Pandas DataFrames using pandas.read_csv
. Sometimes, usually due to columns of mixed types, pandas will throw an exception during quilt build
. In such cases it's helpful to include column types in build.yml
by adding a dtype
parameter:
dtype
takes a dict where keys are column names and values are valid Pandas column types:
int
bool
float
complex
str
unicode
buffer
See also dtypes.
Glob and wildcard matching
If a string containing wildcards is used as a node name, it will be matched against the build directory. The filename of any matching path, minus the extension, will be used as the nodename. As when specifying a single data node, kwargs
and transform
may be used to specify how the file should be read.
The following standard wildcard strings are accepted:
**
- Match current dir and all subdirs, recursively*
- Match any one or more characters?
- Match any single character[X]
(where X is any number of characters) - Match any one character contained in X[!X]
(where X is any number of characters) - Exclude any one character contained in X
All wildcards except **
act in the current directory alone, so *
does not match subdir/foo/file.ext
, but subdir/foo/*
and **/*.ext
do.
If provided, transform
and any specific kwargs
are used with each matched file. Otherwise, the parent (or default) transform
and kwargs
will be used.
Finally, if matching results in identical node names, the nodes are renamed in a consistent manner (paths are sorted lexicographically), and any duplicate names are numbered. So for files "foo.txt" and "subdir/foo.txt", the result is "foo" (from foo.txt) and "foo_2" (from "subdir/foo.txt"). The naming behavior is consistent across platforms.
Last updated