Packaging Engine
Last updated
Was this helpful?
Last updated
Was this helpful?
This feature requires Quilt Platform version 1.58.0 or higher
The Quilt Packaging Engine in the Quilt Platform allows administrators and developers to automate the process of creating Quilt packages from data stored in Amazon S3. It serves as a key component of Quilt's functionality as a Scientific Data Management System, enabling automated data ingestion and standardization. It currently consists of:
Admin Settings GUI to enable package creation based on notifications from:
AWS Health Omics
Nextflow workflows using the WRROC (Workflow Run RO-Crate) format from nf-prov.
SQS queue that will process package descriptions
Documentation for creating custom EventBridge rules to invoke that queue
The simplest way to enable package creation is through the Admin Settings GUI, which supports the following built-in event sources:
When enabled, this will create a package from the runOutputUri
provided in a aws.omics
completion event. For example, if the runOutputUri
is s3://quilt-example/omics-quilt/3395667
, the package will be created in that same bucket with the name omics-quilt/3395667
.
When enabled, this will create a package from the enclosing folder when an ro-crate-manifest.json
file is written to a bucket that is already part of the stack.
RO-Crate is a metadata standard for describing research data. The Workflow Run working group adds three additional profiles, which are supported in the latest versions of nf-prov. You will need to explicitly configure nf-prov
to use wrroc
, by using a nextflow.config
file like this:
Note that Research Objects identify people using an ORCID iD, which anyone can get for free at the ORCID website.
The package will be created in the same bucket as the outdir
, with the package name inferred from the S3 key. For example, if the key is my/s3/folder/ro-crate-manifest.json
, the package name will be my_s3/folder
.
The Quilt Packaging Engine is built on top of the existing packaging lambdas used by the Quilt Platform, including the ability to parallelize creation of S3 Checksums for existing objects. We have exposed this functionality to customers via an SQS queue, which is invoked by the EventBridge rules created by the Admin Settings GUI.
You can also send messages directly to the SQS queue, which is part of the Quilt stack. The queue information will be listed as PackagerQueueArn
and PackagerQueueUrl
under the Outputs
tab in CloudFormation section of the AWS Console. The URL will be something like:
Where REGION and ACCOUNT_ID will be the same as for the Quilt stack.
The body of the message is the stringified JSON of a package description. There is only one required parameter:
This is assumed to be a folder if it ends in a /
; otherwise, we will remove the last component of the path to get the folder. The contents of the folder will be used to create a package in the same bucket as the source folder, with the package name being inferred from the source URI.
Optionally, you can control the package name, metadata, and other settings by explicitly specifying any of the following fields:
The job will fail if you try to specify both metadata
and metadata_uri
.
If you have appropriate IAM permissions, and the SQS URL, you can send a message to the queue using the AWS SDK or the AWS CLI. Here is an example using the AWS CLI:
EventBridge rules can be used to transform EventBridge events from any bus in your account into a conforming SQS message.
Event-Driven Packaging, currently in private preview, coalesces multiple S3 uploads into a single package-objects-ready
event, which infers the appropriate top-level folder. When ready, it creates an event like this on its own EventBridge bus:
The following Python code creates an EventBridge rule that targets the packager queue when matching that event:
The package creation process is asynchronous, so you may need to wait a few minutes before the package is available (longer if the source data is large).
If you send the same message multiple times before the folder is updated, it will not actually create a new revision, since the content hash will be the same. However, that would still waste computational cycles, so you should avoid doing so.