Introduction
Last updated
Last updated
master
status
Quilt provides versioned, reusable building blocks for analysis in the form of data packages. A data package may contain data of any type or any size.
Quilt does for data what package managers do for code: provide a centralized store of record.
Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
Collaboration and transparency - Data likes to be shared. Quilt offers a unified catalog for finding and sharing data.
Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed.
Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
De-duplication - Data are identified by their SHA-256 hash. Duplicate data are written to disk once, for each user. As a result, large, repeated data fragments consume less disk and network bandwidth.
Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.
A Quilt data package is a tree of data wrapped in a Python module. You can think of a package as a miniature, virtualized filesystem accessible to a variety of languages and platforms.
Each Quilt package has a unique handle of the form USER_NAME/PACKAGE_NAME
.
Packages are stored in a server-side registry. The registry controls permissions and stores package meta-data, such as the revision history. Each package has a web landing page for documentation, like this one for uciml/iris
.
The data in a package are tracked in a hash tree. The tophash for the tree is the hash of all hashes of all data in the package. The combination of a package handle and tophash form a package instance. Package instances are immutable.
Leaf nodes in the package tree are called fragments or objects. Installed fragments are de-duplicated and kept in a local object store.
build
creates a packagebuild
hashes and serializes data. All data and metadata are tracked in a hash-tree that specifies the structure of the package.
By default:
Unstructured and semi-structured data are copied "as is" (e.g. JSON, TXT)
You may override the above defaults, for example if you wish data to remain in its original format, with the transform: id
setting in build.yml
.
push
stores a package in a server-side registryPackages are registered against a Flask/MySQL endpoint that controls permissions and keeps track of where data lives in blob storage (S3 for the Free tier).
install
downloads a packageAfter a permissions check the client receives a signed URL to download the package from blob storage.
Installed packages are stored in a local quilt_modules
folder. Type $ quilt ls
to see where quilt_modules
is located.
import
exposes your package to codeQuilt data packages are wrapped in a Python module so that users can import data like code: from quilt.data.USER_NAME import PACKAGE_NAME
.
Data import
is lazy to minimize I/O. Data are only loaded from disk if and when the user references the data (usually by adding parenthesis to a package path, pkg.foo.bar()
).
Quilt is offered as a managed service at quiltdata.com. Alternatively, users can run their own registries (refer to the registry documentation).
Quilt consists of three components. See the contributing docs for further details.