quilt
Search…
Querying metadata with Athena
Quilt stores package data and metadata in S3. Metadata lives in a per-package manifest file in a each bucket's .quilt/ directory.
You can therefore query package metadata wth SQL engines like AWS Athena. Users can write SQL queries to select packages (or files from within packages) using predicates based on package or object-level metadata.

Defining package tables and views in Athena

The first step in configuring Athena to query the package contents and metadata is to define a set of tables that represent the package metadata fields as columns.

Manifests table

The following Athena DDL will build a table of all the manifests in a given bucket (all package-level and object-level metadata).
1
CREATE EXTERNAL TABLE `quilt_manifests_YOUR_BUCKET`(
2
`logical_key` string,
3
`physical_keys` array<string>,
4
`size` string,
5
`hash` struct<type:string,value:string>,
6
`meta` string,
7
`user_meta` string,
8
`message` string,
9
`version` string)
10
ROW FORMAT SERDE
11
'org.openx.data.jsonserde.JsonSerDe'
12
WITH SERDEPROPERTIES (
13
'ignore.malformed.json'='true')
14
STORED AS INPUTFORMAT
15
'org.apache.hadoop.mapred.TextInputFormat'
16
OUTPUTFORMAT
17
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
18
LOCATION
19
's3://{bucket}/.quilt/packages'
20
TBLPROPERTIES (
21
'has_encrypted_data'='false',
22
'transient_lastDdlTime'='1605312102')
Copied!

Package metadata table

Package names and top hashes are not stored in the manifests. Rather they are stored in pointer files in the .quilt/named_packages folder. The following DDL creates a table from these pointer files to make package top hashes available in Athena.
1
CREATE EXTERNAL TABLE `quilt_named_packages_YOUR_BUCKET`(
2
`hash` string)
3
ROW FORMAT DELIMITED
4
FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT
5
'org.apache.hadoop.mapred.TextInputFormat'
6
OUTPUTFORMAT
7
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
8
LOCATION
9
's3://{bucket}/.quilt/named_packages'
10
TBLPROPERTIES (
11
'has_encrypted_data'='false',
12
'transient_lastDdlTime'='1557626200')
Copied!

View of package-level metadata

The DDL below creates a view that contains package-level information including:
  • User
  • Package name
  • Tophash
  • Timestamp
  • Commit message
1
CREATE OR REPLACE VIEW "quilt_packages_{bucket}_view" AS
2
WITH
3
npv AS (
4
SELECT
5
regexp_extract("$path", '^s3:\/\/([^\\/]+)\/([^\/]+)\/([^\/]+)\/([^\/]+)', 4) as user,
6
regexp_extract("$path", '^s3:\/\/([^\\/]+)\/([^\/]+)\/([^\/]+)\/([^\/]+)\/([^\/]+)', 5) as name,
7
regexp_extract("$path", '[^/]+#x27;) as timestamp,
8
"quilt_named_packages_{bucket}"."hash"
9
FROM "quilt_named_packages_{bucket}"
10
),
11
mv AS (
12
SELECT
13
regexp_extract("$path", '[^/]+#x27;) as tophash,
14
manifest."meta",
15
manifest."message"
16
FROM
17
"quilt_manifests_{bucket}" as manifest
18
WHERE manifest."logical_key" IS NULL
19
)
20
SELECT
21
npv."user",
22
npv."name",
23
npv."hash",
24
npv."timestamp",
25
mv."message",
26
mv."meta"
27
FROM npv
28
JOIN
29
mv
30
ON
31
npv."hash" = mv."tophash"
Copied!

View of object-Level metadata

The DDL below creates a view that contains package contents, including:
  • logical_key
  • physical_keys
  • object hash
  • object metadata
1
CREATE OR REPLACE VIEW "quilt_package_objects_YOUR_BUCKET_view" AS
2
WITH
3
mv AS (
4
SELECT
5
regexp_extract("$path", '[^/]+#x27;) as tophash,
6
manifest."logical_key",
7
manifest."physical_keys",
8
manifest."size",
9
manifest."hash",
10
manifest."meta",
11
manifest."user_meta"
12
FROM
13
"quilt_manifests_YOUR_BUCKET" as manifest
14
WHERE manifest."logical_key" IS NOT NULL
15
)
16
SELECT
17
npv."user",
18
npv."name",
19
npv."timestamp",
20
mv."tophash",
21
mv."logical_key",
22
mv."physical_keys",
23
mv."hash",
24
mv."meta",
25
mv."user_meta"
26
FROM mv
27
JOIN
28
"quilt_packages_{bucket}_view" as npv
29
ON
30
npv."hash" = mv."tophash"
Copied!

Example: query package-level metadata

Suppose we wish to find all .tiff files produced by algorithm version 1.3 with a cell index of 5.
1
SELECT * FROM "quilt_package_objects_YOUR_BUCKET_view" AS
2
WHERE substr(logical_key, -5)='.tiff'
3
-- extract and query package-level metadata
4
AND json_extract_scalar(meta, '$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%'
5
AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5');
Copied!