Checks are data integrity tests that are defined in build.yml
. They are run at package build time to ensure that all consumers of the data package only receive data that comply to the given checks.
Checks can be used to prevent model drift and data deployment errors that result from using data that do not fit an expected profile.
support data that are larger than a pandas data frame (1GB to 10GB)
display progress bars during checks process
print offending line number when a check fails
allow package users (other than the owner) to see checks and build.yml
source
Checks are defined in a top-level dictionary called checks:
qc.data
is an automatic variable that contains the node's data in pandas data frame
The full pandas expression syntax is supported
Standard Python can be inlined with YAML's |
operator (see below)
Signature | Description |
| Check that |
| Checks that all column values are in the list (and vice versa), or calls a lambda on the column |
| Print line numbers of rows that match |
| Check that column values fall within [ |
| Check that all column values match |
| Check that all column values contain substing |
| Not yet supported. Check that all column datetimes conform to |
COL_REGEX
is a string literal or regular expression that matches one or more columns; the corresponding check is applied to each matching column
Source data: sales.xls from Tableau Community​
contents:transactions:file: sales.xlstransform: xlschecks: cardinality labels stats range price dates​checks:cardinality: |# verify column cardinalitysymbols = qc.data['Order Priority'].nunique()qc.check(symbols == 5)labels: |qc.check_column_enum(r'Order Priority', ['Low', 'High', 'Medium', 'Not Specified', 'Critical'])qc.print_recnums("Critical orders", qc.data['Order Priority'] == 'Critical')stats: |# standard deviationstdev = qc.data['Sales'].std()qc.check(stdev < 3586)range: |# ensure average discount is no more than 20%qc.check_column_valrange('Discount', maxval=0.2, lambda_or_name='avg')price: |# check that prices are formatted properlyqc.check_column_regexp('Unit Price','\d+\.\d+')