Update the job arguments of your Glue job to include:
- Extra jars:
--extra-jars = s3://dataplatform-stg-glue-scripts/jars/deequ-1.0.3.jar
- Extra Python files:
--extra-py-file = s3://dataplatform-stg-glue-scripts/python-modules/pydeequ-1.0.1.zip
- Metrics repository S3 target location using the template format:
--deequ_metrics_location = s3://dataplatform-stg-EXAMPLE-zone/quality-metrics/department=EXAMPLE/dataset=EXAMPLE/deequ-metrics.json
Here is an example of using deequ checks to validate a dataframe, and storing related metrics to S3.
description_of_work column is checked to be complete, and
1 and 4 inclusively.
There is also the option to include a hint message on each of the checks which will be
displayed to the user in the event there is a failing constraint to help diagnose the problem.
For example, the
hasMin check has the hint message: "
The minimum(work_priority_priority_code) >= 1')".
Here is a list of checks that are available to use.
Anomaly detection uses historic metrics to determine if a value is invalid.
You can only run an anomaly check if there are historic metrics results in the metrics repository you are using. If no historic metrics results exist, you will get the below error message:
To avoid this error, run the standard verification constraint checks first (see Example Check section above) and then add your anomaly constraint checks afterwards.
For example, we check if the size of a dataframe has increased by more than twice the previous size.
Here is a list of anomaly detection types that are available to use.
You can add tags to your verification metrics which may be helpful when reviewing the metric
results for a particular job.
To do this, include a dictionary containing key value pairs in the
ResultKey as shown in example below:
In order to ensure that only trusted data is outputted from your Glue job, it is important to make an assertion against your constraints to check that they have been satisfied.
You can do this by including a helper function called
in your script (see helpers.py for more info).
You can see an example usage in the Example Anomaly Detection section.
When a constraint check fails, an error message will be provided which might look something like the below message:
Multiple constraint failures are delimited by a
| character in the error message.