Tabulator
Last updated
Last updated
NOTE: This feature requires Quilt stack version 1.55.0 or higher
Tabulator aggregates tabular data objects across multiple packages using AWS Athena. Admins define schemas and data sources for CSV, TSV, or Parquet files, enabling users to run SQL queries directly on the contents of Quilt packages. You can even use named capture groups to extract additional columns from the logical key and package name.
The configuration is written in YAML and managed using the quilt3.admin.tabulator
APIs or via the Quilt Admin UI:
Each Tabulator configuration is written in YAML, following the structure outlined below.
Schema: The schema defines the columns in the table. Each column must have a name and a type. The name must match the regular expression ^[a-z_][a-z0-9_]*$
. For CSV/TSVs, these names do not need to match the column names in the document. For Parquet, they must match except for case. However, if column names are present in a CSV/TSV, you must set header
to true
in the parser configuration.
Types: Must be uppercase and match the Apache Arrow Data Types used by Amazon Athena. Valid types are BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BINARY, DATE, TIMESTAMP.
Source: The source defines the packages and objects to query. The type
must be quilt-packages
. The package_name
is a regular expression that matches the package names to include. The logical_key
is a regular expression that matches the keys of the objects to include. The regular expression may include named capture groups that will be added as columns to the table.
Parser: The parser defines how to read the files. The format
must be one of csv
or parquet
. The optional delimiter
(defaults to ',') is the character used to separate fields in the CSV file. The optional header
field (defaults to 'false') is a boolean that indicates whether the first row of the CSV file contains column names.
In addition to the columns defined in the schema, Tabulator will add:
any named capture groups from the logical key regular expression
$pkg_name
for the package name
$logical_key
for the object as referenced by the package
$physical_key
for the underlying S3 URI
$top_hash
for the revision of the package containing the object (currently we query only the latest
package revision)
Due to the way permissions are configured, Tabulator cannot be accessed from the AWS Console or Athena views. You must access Tabulator via the Quilt stack in order to query those tables. This can be done by users via the per-bucket "Queries" tab in the Quilt Catalog, or programmatically via quilt3
. See "Usage" below for more details.
Schema Consistency: All files in the package that match the logical key must have the same schema as defined in the configuration.
Memory Usage: Tabulator may fail on large files (> 10 GB), files with large rows (> 100 KB), and large numbers of files (> 10000). Additionally, Athena has a 16 MB limit per row.
Cost Management: Querying very large datasets can be expensive (approximately dollars per terabyte). Be sure to set up appropriate cost controls and monitoring.
Concurrency: Tabulator will attempt to process each file concurrently, but may be limited by the concurrency of Athena or the federation lambda in the region where the query is running. If you are experiencing slow performance, it may be because the concurrency is too low. You can increase the concurrency in that region's AWS Service Quotas console.
Athena VPC: If you are using a VPC endpoint for Athena, you must ensure it is accessible from the Quilt stack and Tabulator lambda.
Once the configuration is set, users can query the tables using the Athena tab from the Quilt Catalog. Note that because Tabulator runs with elevated permissions, it cannot be accessed from the AWS Console.
For example, to query the ccle_tsv
table from the appropriate workgroup in the quilt-tf-stable
stack, where the database (bucket name) is udp-spec
:
You can join this with any other Athena table, including the package and object tables automatically created by Quilt. For example, this is the package table:
We can then join on PKG_NAME to add the user_meta
field from the package metadata to the tabulated results:
To call Tabulator from outside the Queries tab, you must use quilt3
to authenticate against the stack using config()
and login()
, which opens a web page from which you must paste in the appropriate access token. Use get_boto3_session()
to get a session with the same permissions as your Quilt Catalog user, then use the boto3
Athena client to run queries.
Here is a complete example: