Checks
Checks are data integrity tests that are defined in build.yml
. They are run at package build time to ensure that all consumers of the data package only receive data that comply to the given checks.
Checks can be used to prevent model drift and data deployment errors that result from using data that do not fit an expected profile.
Known issues
Syntax
Checks are defined in a top-level dictionary called
checks:
qc.data
is an automatic variable that contains the node's data in pandas data frameThe full pandas expression syntax is supported
Standard Python can be inlined with YAML's
|
operator (see below)
Functions (qc.*
)
qc.*
)Signature
Description
check(COND)
Check that COND == true
check_column_enum(COL_REGEX, LIST_OR_LAMBDA)
Checks that all column values are in the list (and vice versa), or calls a lambda on the column
print_recnums(COL_REGEX, EXPR)
Print line numbers of rows that match EXPR
.
check_column_valrange(COL_REGEX, minval=None, maxval=None, lambda_or_name=None)
Check that column values fall within [minval
, maxval
]. lambda_or_name
is either a lambda expression applied to the matching column(s) or one of 'abs', 'count', 'mean', 'median', 'mode', 'stddev', or 'sum
check_column_regexp(COL_REGEX, REGEX)
Check that all column values match REGEX
check_column_substr(COL_REGEX, SUBSTR)
Check that all column values contain substing SUBSTR
check_column_datetime(COL_REGEX, FORMAT)
COL_REGEX
is a string literal or regular expression that matches one or more columns; the corresponding check is applied to each matching column
Example
Source data: sales.xls from Tableau Community
Last updated