Skip to main content

Guide to unit testing on the Data Platform


Historically, unit testing practices have helped:

  • Improve the speed at which PySpark scripts can be developed
  • Provide documentation for each script with example data they expect, and what results they output
  • Increase the proportion of defects found before they reach staging environment
  • Help those who come to maintain your script make their change without being scared of breaking existing functionality

The most valuable code to test is the code which is subject to change, and has some complicated behaviour. We recommend that when writing PySpark code which isn’t exceedingly simple that you write some tests alongside that production code.

HackIT have produced a video describing unit testing principles for C# which are transferable to Spark scripts.

Organising your code so it can be tested

Writing tests around your spark scripts require them to be written in a way which makes them easy to test.

Using address cleaning as an example script, we have extracted all DataFrame manipulation code into a method, and then tested the behaviours of that with tests.

The "main" part of the ETL job which will run within the AWS Glue environment is then wrapped in a conditional which prevents that code from being run within the testing environment. All code, except import statements, that isn't in a method should be included in this conditional. The “main” part is also the part where AWS connections should happen. Make sure that your testable methods, outside the “main” part, don’t require AWS connections, or you won’t be able to test them locally.

Writing your own tests

  • Before starting, pull the latest Data Platform code from GitHub. You need to have git working and ready in your IDE. In VSCode, you may need to use the bash command line for git commands to be recognised.

  • Ensure that you run the tests and that they are all passing before writing your own tests. See for instructions on how to do this.

  • Create and check out a new branch to develop your tests. For example, if you are testing the individual cleaning scripts for Housing repairs Google sheets, you could use: git checkout -b testing_housing_repairs_cleaning_scripts.

  • The tests are stored in the scripts folder of the project alongside the glue scripts themselves.

  • There are 2 scenarios:

    • In the ideal scenario, you should be writing your script and your tests at the same time. Start by creating your own glue job script as usual say, jobs/, and also a test file in the tests folder tests/ Test files should be named with "test" at the beginning of the file. For example, If your job is department specific then both job script and test files should be in that departments subfolders.
    • In the non-ideal scenario, you are writing tests at posteriori for a script that already exists. Pick this script and create the corresponding test file in the tests folder named tests/
  • You will need to organise your script (or reorganise it if it already exists) according to the Organising your code so it can be tested above. For existing scripts, it includes adding the line if __name__ = “__main__”: Just after the methods definitions and just before the part that reads arguments from the AWS environment (look at examples). Indent everything under this new line.

  • Define a function within, which takes in the input DataFrame(s), and returns the DataFrame to be written. (For existing scripts: take the processing code out of the main part and turn it into one or several methods).

  • Within your start creating tests. You can use the test_spark_example as a template and address cleaning as an example of a fully tested script. If you do, rename the class name into something relevant, for instance TestMechFireDpaCleaning. Test method names should start with a "test" in their name, otherwise the tests won’t be run. Ensure the name clearly describes what is being tested. For example, if you are testing the behaviour of adding a unique id column, then a suitable name may be along the lines of test_creates_unique_id_column.

  • There are some helper functions in the "helpers" folder, which you can use in your tests. For example, dictionaryContains in is a useful function that asserts that specific fields and values exist in an output, instead of asserting equality for a full data line. If you do decide to use these functions, ensure you import them at the top of your test file, for example add from helpers import assertions to access methods in the file.

  • If your script is using a logger, you will need to pass it into your testable method. See the clean_addresses function in the for an example of how you can do this.

  • When you’ve written your first test, run it using one of the commands in the It is common practice to write a test in such a way that it initially fails, and then to change it to make it pass. This practice is called Test-driven development (TDD), for more information this video talks through the principles of TDD.

  • If you need to add any packages in order to write your tests, you can add them to the requirements.test.txt file and they will be installed when you run your tests.

  • When you’re happy with your test, commit your changes locally on your branch.

  • To have your tests published on the Data Platform, push your commits to GitHub and create a Pull Request on your branch, so others can review and merge your code.

We use the following things to help write and run tests against pyspark scripts.

  • The testing framework that we use for the glue scripts is called Pytest.
  • Unittest, is a testing framework like pytest, that has some helpful features that we use in some of the tests.
  • Docker is used to run the tests locally. We use a docker image provided by amazon that helps us replicate the environment that our glue jobs are run in. It also has pyspark and pytest installed to allow us to easily run the tests. You will need docker installed to run tests locally.

CI/CD Pipeline

The tests will run each time a commit is made to a branch including when branches are merged into main. The build will not run/deploy if there are failing tests. This is to prevent breaking changes from being deployed to the Data Platform staging environment.

This has been configured in the Github workflows configuration file