Git-like operations for datasets and Jupyter notebooks
quilt3
provides a simple command-line for versioning large datasets and storing them in Amazon S3. There are only two commands you need to know:
push
creates a new package revision in an S3 bucket that you designateinstall
downloads data from a remote package to disk
Why not use Git?
In short, neither Git nor Git LFS have the capacity or performance to function as a repository for data. S3, on the other hand, is widely used, fast, supports versioning, and currently stores some trillions of data objects.
Similar concerns apply when baking datasets into Docker containers: images bloat and slow container operations down.
Pre-requisites
You will need either an AWS account, credentials, and an S3 bucket, OR a Quilt enterprise stack with at least one bucket. In order to read from and write to S3 with quilt3
, you must first do one of the following:
OR, if and only if your company runs a Quilt enterprise stack, run the following:
Install a package
A Quilt package contains any collection of data (usually as files), metadata, and documentation that you specify.
Let's get a data package from S3 and write it quilt-hurdat/data
.
Now you've got data in the current working directory.
Creating your first package
Now let's imagine that we've modified this data locally. We save our Jupyter notebook and push the results back to Quilt:
Quilt will then print out something like the following:
List the packages in a bucket
In the Quilt catalog, you will now see a new package revision, complete with a README, data grid preview, and an interactive visualization in Altair.
You can see an example of this package live here.
Learn more
Those are the basics of reading and writing Quilt packages with the CLI. See the CLI reference for more.
Last updated