quilt
Search…
Workflows
New in Quilt 3.3

Workflows

A Quilt workflow is a quality gate that you set to ensure the quality of your data and metadata before it becomes a Quilt package. You can create as many workflows as you like to accommodate all of your data creation patterns.

On data quality

Under the hood, Quilt workflows use JSON Schema to check that package metadata have the right shape. Metadata shape determines which keys are defined, their values, and the types of the values.
Ensuring the quality of your data has long-lasting implications:
  1. 1.
    Consistency - if labels and other metadata don't use a consistent, controlled vocabulary, reuse becomes difficult and trust in data declines
  2. 2.
    Completeness - if your workflows do not require users to include files, documentation, labels, etc. then your data is on its way towards becoming mystery data and ultimately junk data that no one can use
  3. 3.
    Context - data can only be reused if users know where it came from, what it means, who touched it, and what the related datasets are
From the standpoint of querying engines like Amazon Athena, data that lacks consistency and completeness is extremely difficult to query longitudinally and depreciates over time (as team members change, platforms change, and tribal knowledge is lost).

Use cases

  • Ensure that labels are correct and drawn from a controlled vocabulary (e.g. ensure that the only labels in a package of images are either "bird" or "not bird"; avoid data entry errors like "birb")
  • Ensure that users provide a README.md for every new package
  • Ensure that included files are non-empty
  • Ensure that every new package (or dataset) has enough labels so that it can be reused (e.g. Date, Creator, Type, etc.)

Get started

To get started, create a configuration file in your Quilt S3 bucket at s3://BUCKET/.quilt/workflows/config.yml.
Here's an example:
1
version:
2
base: "1"
3
catalog: "1"
4
workflows:
5
alpha:
6
name: Search for aliens
7
is_message_required: true
8
beta:
9
name: Studying superpowers
10
metadata_schema: superheroes
11
gamma:
12
name: Nothing special
13
description: TOP SECRET
14
is_message_required: true
15
metadata_schema: top-secret
16
handle_pattern: ^(employee1|employee2)/(staging|production)$
17
entries_schema: validate-secrets
18
catalog:
19
package_handle:
20
files: <%= username %>/<%= directory %>
21
packages: <%= username %>/production
22
schemas:
23
superheroes:
24
url: s3://quilt-sergey-dev-metadata/schemas/superheroes.schema.json
25
top-secret:
26
url: s3://quilt-sergey-dev-metadata/schemas/top-secret.schema.json
27
validate-secrets:
28
url: s3://quilt-sergey-dev-metadata/schemas/validate-secrets.schema.json
Copied!
With the above configuration, you must specify a workflow before you can push:
1
>>> import quilt3
2
>>> quilt3.Package().push('test/package', registry='s3://quilt-sergey-dev-metadata')
3
4
QuiltException: Workflow required, but none specified.
Copied!
Let's try with the workflow= parameter:
1
>>> quilt3.Package().push('test/package', registry='s3://quilt-sergey-dev-metadata', workflow='alpha')
2
3
QuiltException: Commit message is required by workflow, but none was provided.
Copied!
The above QuiltException is caused by is_message_required: true. Here's how we can pass the workflow:
1
>>> quilt3.Package().push(
2
'test/package',
3
registry='s3://quilt-sergey-dev-metadata',
4
message='added info about UFO',
5
workflow='alpha')
6
7
Package test/[email protected] pushed to s3://quilt-sergey-dev-metadata
Copied!
Now let's push with workflow='beta':
1
>>> quilt3.Package().push(
2
'test/package',
3
registry='s3://quilt-sergey-dev-metadata',
4
workflow='beta')
5
6
QuiltException: Metadata failed validation: 'superhero' is a required property.
Copied!
We encountered another exception because the beta workflow specifies metadata_schema: superheroes. Therefore, the test/package metadata must validate against the JSON Schema at s3://quilt-sergey-dev-metadata/schemas/superheroes.schema.json:
1
{
2
"$schema": "http://json-schema.org/draft-07/schema#",
3
"$id": "http://example.com/superheroes.schema.json",
4
"properties": {
5
"superhero": {
6
"enum": [
7
"Spider-Man",
8
"Superman",
9
"Batman"
10
]
11
}
12
},
13
"required": [
14
"superhero"
15
]
16
}
Copied!
Note that superhero is a required property:
1
>>> quilt3.Package().set_meta({'superhero': 'Batman'}).push(
2
'test/package',
3
registry='s3://quilt-sergey-dev-metadata',
4
workflow='beta')
5
6
Package test/[email protected] pushed to s3://quilt-sergey-dev-metadata
Copied!
For the gamma workflow, both is_message_required: true and metadata_schema are set, so both message and package metadata are validated:
1
>>> quilt3.Package().push(
2
'test/package',
3
registry='s3://quilt-sergey-dev-metadata',
4
workflow='gamma')
5
6
QuiltException: Metadata failed validation: 'answer' is a required property.
7
8
>>> quilt3.Package().set_meta({'answer': 42}).push(
9
'test/package',
10
registry='s3://quilt-sergey-dev-metadata',
11
workflow='gamma')
12
13
QuiltException: Commit message is required by workflow, but none was provided.
14
15
>>> quilt3.Package().set_meta({'answer': 42}).push(
16
'test/package',
17
registry='s3://quilt-sergey-dev-metadata',
18
message='at last all is set up',
19
workflow='gamma')
20
21
Package test/[email protected]6331508 pushed to s3://quilt-sergey-dev-metadata
Copied!
If you wish for your users to be able to skip workflows altogether, you can make workflow validation optional with is_workflow_required: false in your config.yml, and specify workflow=None in the API:
1
>>> quilt3.Package().push(
2
'test/package',
3
registry='s3://quilt-sergey-dev-metadata',
4
workflow=None)
5
6
Package test/[email protected]06b2815 pushed to s3://quilt-sergey-dev-metadata
Copied!
Also default_workflow can be set in the config to specify which workflow will be used if workflow parameter is not provided.

JSON Schema

Quilt workflows support the Draft 7 JSON Schema.

Default values

Quilt supports the default keyword.

Auto-fill dates

If you wish to pre-populate dates in the Quilt catalog, you can use the custom keyword dateformat in your schemas. For example:
1
{
2
"type": "string",
3
"format": "date",
4
"dateformat": "yyyy-MM-dd"
5
}
Copied!
The dateformat template follows Unicode Technical Standard #35.

Data quality controls

In addition to package-level metadata. Quilt workflows enable you to validate package names, and basic file metadata.
You must include the following schema version at the root of your config.yml in order for any catalog-specific features to function:
1
version:
2
base: "1"
3
catalog: "1"
Copied!

Package name defaults (Quilt catalog)

By default the Quilt catalog auto-fills the package handle prefix according to the following logic:
  • Packages tab: username (everything before the @ in your sign-in email). Equivalent to
1
catalog:
2
package_handle:
3
packages: <%= username %>
Copied!
  • Files tab: parent directory name. Equivalent to
1
catalog:
2
package_handle:
3
files: <%= directory %>
Copied!
You can customize the default prefix with package_handle key in one or both of the following places:
  • Set catalog.package_handle.(files|packages) at the root of config.yml to affect all workflows
  • Set workflows.WORKFLOW.catalog.package_handle.(files|packages) to affect the tabs and workflow in question
Example
1
catalog:
2
# default for all workflows for Packages tab
3
package_handle:
4
packages: analysis/
5
workflows:
6
my-workflow:
7
catalog:
8
# defaults for my-workflow, different for each tab
9
package_handle:
10
files: <%= username %>/<%= directory %>
11
packages: <%= username %>/production
Copied!

Package name validation

You can validate package names with WORKFLOW.handle_pattern, which accepts JavaScript regular expression.
By default, patterns are not anchored. You can explicitly add start (^) and end ($) markers as needed.
Example
1
workflows:
2
my-workflow:
3
handle_pattern: ^(employee1|employee2)/(production|staging)$
Copied!

Package file validation

You can validate the names and sizes of files in the package with WORkFLOW.entries_schema. The provided schema runs against an array of objects known as package entries. Each package entry defines a logical key (its releative path and name in the parent package) and size (in bytes).
Example
1
workflows:
2
myworkflow-1:
3
entries_schema: must-contain-readme
4
myworkflow-2:
5
entries_schema: must-contain-readme-summarize-at-least-1byte
6
description: Must contain non-empty README.md and quilt_summarize.json at package root; no more than 4 files
7
schemas:
8
must-contain-readme:
9
url: s3://bucket/must-contain-readme.json
10
must-contain-readme-summarize-at-least-1byte:
11
url: s3://bucket/must-contain-readme-summarize-at-least-1byte.json
Copied!
s3://bucket/must-contain-readme.json
1
{
2
"type": "array",
3
"items": {
4
"contains": {
5
"type": "object",
6
"properties": {
7
"logical_key": {
8
"type": "string",
9
"pattern": "^README\\.mdquot;
10
}
11
}
12
}
13
}
Copied!
s3://bucket/must-contain-readme-summarize-at-least-1byte.json
1
{
2
"$schema": "http://json-schema.org/draft-07/schema#",
3
"allOf": [
4
{
5
"type": "array",
6
"items": {
7
"type": "object",
8
"properties": {
9
"size": {
10
"type": "number",
11
"minimum": 1,
12
"maximum": 100000
13
}
14
}
15
},
16
"minItems": 2,
17
"maxItems": 4
18
},
19
{
20
"type": "array",
21
"contains": {
22
"type": "object",
23
"properties": {
24
"logical_key": {
25
"type": "string",
26
"pattern": "^README\\.mdquot;
27
}
28
}
29
}
30
},
31
{
32
"type": "array",
33
"contains": {
34
"type": "object",
35
"properties": {
36
"logical_key": {
37
"type": "string",
38
"pattern": "^quilt_summarize\\.jsonquot;
39
}
40
}
41
}
42
}
43
]
44
}
Copied!

Cross-bucket package push (Quilt catalog)

In Quilt, S3 buckets are like git branches but for data. With quilt3 you can browse any package and then push it to any bucket that you choose.
As a rule, cross-bucket pushes or "merges" reflect change in a package's lifecycle. For example, you might push a package from my-staging-bucket to my-production-bucket as it matures and becomes trusted.
The catalog's Push to bucket feature can be enabled by adding a successors property to the config. A successor is a destination bucket.
1
successors:
2
s3://bucket1:
3
title: Staging
4
copy_data: false
5
s3://bucket2:
6
title: Production
Copied!
If copy_data is true (the default), all package entries will be copied to the destination bucket. If copy_data is false, all entries will remain in their current locations.

config.yml JSON Schema

Known limitations

  • Only Draft 7 Json Schemas are supported
  • Schemas with $ref are not supported
  • Schemas must be in an S3 bucket for which the Quilt user has read permissions
Last modified 6d ago