quilt
Search…
build.yml
A build.yml file specifies the structure, type, and names for package contents.
Below is the general syntax of the build.yml file:
1
contents:
2
GROUP_NAME:
3
# transform: optional. applies recursively to child nodes. If 'transform'
4
# is omitted, Quilt will select a transform by file extension, falling
5
# back on the "id" transform, which stores the raw data.
6
transform: {id | csv | tsv | ssv | xls | xlsx}
7
# kwargs: optional. applies recursively to child nodes.
8
kwargs:
9
# Each KEYWORD: VALUE pair is passed directly to the transform function.
10
# For example, with csv this is pandas.read_csv.
11
KEYWORD: VALUE # keyword arguments for transform function/method
12
DATA_NAME:
13
# file: required. Relative path from base of package dir.
14
file: FILE_PATH
15
# transform: optional. If given, overrides GROUP_NAME's transform.
16
transform: {id, csv, tsv, ssv, xls, xlsx} # optional
17
# if transform is omitted,
18
kwargs: # overrides GROUP_NAME's kwargs for this node.
19
KEYWORD: VALUE # optional. keyword arguments to
20
GLOB_GROUP: # Container node for nodes created from matching files
21
QUOTED_GLOB_PATH: # standard glob path, such as "*.csv"
22
transform: csv # set transform for all matched files (optional)
23
kwargs: # set kwargs for all matched files (otional)
24
KEYWORD_ARG: VALUE
Copied!
Example build.yml:
1
contents:
2
data_example: # create a node named 'data_example'
3
transform: csv # Read files using csv reader
4
kwargs: # optional
5
header: # optional: no header row
6
sep: "," # optional: set field separator
7
child:
8
file: data/foo.txt # parsed as CSV, no header
9
another_child:
10
file: data/bar.txt # parsed as CSV, no header
11
child_from_elsewhere:
12
# parsed as TSV (from kwargs set here) no header (from parent's kwargs)
13
file: data2/bar.txt
14
kwargs:
15
sep: '\t' # Use tab as separator.
16
glob_example: # create a node named 'glob example'
17
# assuming these files exist: 'somedir/foo.xls', 'somedir/subdir/bar.csv', 'somedir/baz.tsv'
18
'somedir/**/*': # create nodes 'foo', 'bar', and 'baz'
19
# matched files parsed as CSV, no header
20
glob_example_2: # create a node named 'glob_example_2'
21
# assuming these files exist: 'chars/bella.txt', 'chars/edward.txt', 'chars/old/esme.txt'
22
'chars/*.txt':
23
transform: tsv # create nodes 'bella' and 'edward'
24
# matched files parsed as TSV, no header
Copied!

Reserved words

  • file - required for leaf nodes; specifies where source file lives on disk
  • transform - specifies how the file will be parsed
  • kwargs - these options are passed through to the parser (usually pandas.read_csv so that users can skip lines, type columns, specify delimiters, and much more)
  • checks - experimental data unit tests
  • environments - experimental environments for checks
  • package - experimental source specifier includes an existing package or sub-package in the build tree (see Package Composition)
  • *?[!] - any character in this group will initiate glob-style pattern matching
transform and kwargs can be provided at the group level, in which case they apply to all descendants until and unless overridden.

Column types

By default, quilt build converts some file types (e.g., csv, tsv) to Pandas DataFrames using pandas.read_csv. Sometimes, usually due to columns of mixed types, pandas will throw an exception during quilt build. In such cases it's helpful to include column types in build.yml by adding a dtype parameter:
1
contents:
2
iris:
3
file: iris.data
4
transform: csv
5
kwargs:
6
header:
7
dtype:
8
sepal_length: float
9
sepal_width: float
10
petal_length: float
11
petal_width: float
12
class: str
Copied!
dtype takes a dict where keys are column names and values are valid Pandas column types:
  • int
  • bool
  • float
  • complex
  • str
  • unicode
  • buffer
See also dtypes.

Glob and wildcard matching

If a string containing wildcards is used as a node name, it will be matched against the build directory. The filename of any matching path, minus the extension, will be used as the nodename. As when specifying a single data node, kwargs and transform may be used to specify how the file should be read.
The following standard wildcard strings are accepted:
  • ** - Match current dir and all subdirs, recursively
  • * - Match any one or more characters
  • ? - Match any single character
  • [X] (where X is any number of characters) - Match any one character contained in X
  • [!X] (where X is any number of characters) - Exclude any one character contained in X
All wildcards except ** act in the current directory alone, so * does not match subdir/foo/file.ext, but subdir/foo/* and **/*.ext do.
If provided, transform and any specific kwargs are used with each matched file. Otherwise, the parent (or default) transform and kwargs will be used.
Finally, if matching results in identical node names, the nodes are renamed in a consistent manner (paths are sorted lexicographically), and any duplicate names are numbered. So for files "foo.txt" and "subdir/foo.txt", the result is "foo" (from foo.txt) and "foo_2" (from "subdir/foo.txt"). The naming behavior is consistent across platforms.
Last modified 2yr ago