- Have access to the Parking Liberator Raw zone.
- Have experience with writing SQL queries in AWS Athena already.
- Have experience running AWS Glue Crawlers.
- Know how to create a batch transformation job written in SparkSQL, hosted inside of AWS Glue Studio.
- Aware there are differences between the Presto SQL language, and SparkSQL lanugage.
As part of this workshop, we will be using Apache Spark to perform batch transformation on the Hackney parking Liberator dataset.
The Data Platform provides a managed installation of Apache Spark called AWS Glue Studio. We'll be using this tool to create an AWS Glue job which will aggregate permit renewal information.
This guide takes the reader through using SparkSQL, and assumes experience of writing and debugging SQL queries.
Once written, AWS Glue batch jobs can be scheduled to run unattended over very large datasets.
Throughout this documentation, whereever you see NAME, replace this with your name e.g. "adrian".
This is so that this guide can be performed independently by multiple people at the same time. When writing jobs outside of this guide, you don't need to follow this convention.
Imagine we want to perform analysis on the number and duration of parking permit renewals.
We first prototype our query inside of AWS Athena using its Presto SQL language.
Create an SQL query for AWS Athena which extracts, and aggregates permit renewals and produces a resultset with the below format.
The data platform provides source data within the table
You will want to convert VARCHAR columns to appropriate AWS Athena data types. Specifically, the time columns should have a TIMESTAMP type. Presto SQL provide a variety of date/time functions. Note that some of the renewal records have empty string values which may cause a cryptic error message to appear when passing an empty string into a time function.
Keep a copy of the query you create somewhere safe, for the next part of this guide.
|XXXYYYYYYY||2017-01-02 00:00:00||2018-01-02 00:00:00||2||2021||04||15|
Answer, reveal once you've attempted the above.
SELECT permit_reference, MAX(DATE_PARSE(NULLIF(renewal_end_date, ''), '%Y-%m-%d %H:%i:%S')) AS last_end_date, MIN(DATE_PARSE(NULLIF(renewal_start_date, ''), '%Y-%m-%d %H:%i:%S')) AS earliest_start_date, COUNT(*) AS number_of_renewals, import_year, import_month, import_day, import_dateFROM "dataplatform-stg-liberator-raw-zone"."liberator_permit_renewals"GROUP BY permit_reference, import_year, import_month, import_day, import_date
Once you have written, and seen a successful execution of your query in AWS Athena, we can move onto
creating an AWS Glue Job which will transform and copy the
Turning our first query into a Glue Job which transforms the data into a new dataset, will allow us to share this into a permit renewals dashboard on Google Data Studio.
We will first create a new AWS Glue Studio job by following a modified version of the guide creating a new Glue Job.
For the environment, we'll be using
For the Data source node, we'll select Data catalogue table for "S3 source type" under the "Data source properties" tab. Then choosing
liberator_permit_renewalsfor Database and Table respectively.
For the Data target node:
- Set the Format to "Glue Parquet"
- Specify the destination as
- For "Data Catalog update options" select "Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions".
- For "Database" select "dataplatform-stg-liberator-refined-zone" from the dropdown.
- In "Table name" write
- Under the parition keys, add in the following order:
For the Name of the job, specify
For the IAM Role of the job, specify
For the "Number of retries" under "Job details" specify 0.
For the "Security configuration" select "dataplatform-stg-config-to-refined".
AWS Glue enables a feature called Job Bookmarks by default.
Job bookmarks cause a glue job to perform its batch processing on data which has not already been processed previously by the same job.
This is useful to reduce the cost of processing, but when developing jobs you will want to temporarily disable this feature from the "Job Details" tab.
Once you have created and saved this job, we will replace the default transformation with our SQL created above.
- Switch to the "Visual" tab, and click on the "Transform - ApplyMapping" node. In the properties bar on the right, switch to the "Node Properties" tab and change the "Node Type" to "Spark SQL".
- Switch to the "Transform" tab paste in your SQL query from above into the "Code Block" box. The Spark SQL executor will only accept a single SQL query, and that query mustn't have a trailing semicolon.
- Change the value of "Spark SQL aliases" to
liberator_permit_renewals, and remove any usage of a database prefix
dataplatform-stg-liberator-raw-zonefrom the SQL query inside of the "Code Block". If your query joined multiple tables, each table would need a distinct "Data Source" linked to the "Spark SQL" node.
- Click the Save button, followed by the Run button.
- Click on the "Runs" tab, and follow the progress of your job.
- Once finished, the job might fail with an error message similar to this
AnalysisException: " Undefined function: 'date_parse'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'."
Switching back to the "Visual" tab, continue modifying, saving and running the SQL query
of your AWS Glue job until the "Run status" becomes "Succeeded".
The Spark SQL
to_timestamp function documentation might be useful.
Confirm your AWS Glue Job has worked as you expected by querying the newly created table
in AWS Athena under the database
dataplatform-stg-liberator-refined-zone, with the
Once you have finished the exercise
- Find and delete your job within AWS Glue Jobs list.
- From the S3 console, delete the folders your job created.
There will be a folder called
NAME_parking_permit_renewalsand a file called
- Delete the table created within the AWS Glue table interface.
You can find the table by searching for
NAME_parking_permit_renewalswithin the search box.