Skip to main content

Using Glue Studio

Using AWS Glue Studio to create ETL processes#

AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue.

To use AWS Glue Studio the job needs to be created manually. This guide includes the steps needed to create an editable job in Glue Studio. Templates may have already been created and existing jobs can be duplicated to make it easier to get started with Glue Studio. Dynamically created scripts using terraform (the infrastructure as code tool Hackney use) will not have the ability to use the visual editor.

Note: If a job has already been created manually you can select the job using the radio button and in the job Actions mennu, select Clone job.

Creating a new Glue job#

Note: The instructions below assume an S3 Data Source and Target Location.

  1. Log in to AWS Management Console.
  2. Select the AWS Glue service from the services menu (or search field).
  3. From the AWS Glue menu ETL section, select AWS Glue Studio.
  4. From the AWS Glue Studio main page, choose the Create and manage jobs option.
  5. Within the Create Job section, select Source and target added to the graph and select desired Source and Target values.
  6. Click the Create button.
  7. In the Visual editor that is now displayed, click on the Data Source - S3 box and in the Data source properties - S3 tab to set S3 source type to S3 location and set the S3 URL for the source data.
  8. Also Data Target - S3 box and in the Data target properties - S3 tab set the S3 URL for the data target, usually your department folder in the Data Platform account (you can add additional folders at this point for your new data within your department folder). Data source and data target (amongst other operations) must be set to be able to save the job. You can also apply Transformations specific to your job via the Visual tab. See the AWS Glue Studio Documentation.
  9. To complete the set up you need to select the Job details tab.
  10. Complete the Name and optional Description fields. You may use this job as a template for repeat use, so a generic name to use as a template might be useful to begin with.
  11. Select dataplatform-{environment}-glue-role where environment is either 'stg' or 'prod'; as the IAM Role for the job.
  12. The remaining standard fields default values are usually fine to use. The programming language Glue Studio creates can be Scala or Python.
  13. In Advanced properties check the Script filename refers to the task being carried out.
  14. Set the Script path to the central scripts S3 bucket: s3://dataplatform-{environment}-glue-scripts/custom/ where environment is either 'stg' or 'prod' - you can create new folders or specify existing folders in the S3 bucket like this: s3://dataplatform-{environment}-glue-scripts/custom/YOUR_FOLDER_NAME/.
  15. Set the Temporary path to the central temp storage S3 bucket: s3://dataplatform-{environment}-glue-temp-storage/ where environment is either 'stg' or 'prod'.
  16. In Security configuration select the appropriate security configuration for your target bucket location (for example for the Raw Zone, use glue-job-security-configuration-to-raw).
  17. Ensure the Server-side encryption option is not checked, so that it uses the buckets default encryption configuration.
  18. In the Tags section, add the key PlatformDepartment and set the value as the name of your department. This should be the same name which was used to set up the department on the Data Platform.
  19. Add any further libraries or parameters you need in the Advanced properties section.

Note: Exporting Data#

When exporting data from your Glue Job it is important that you follow the Data Platforms partitioning stratagy. To do this, exports must be exported with the partitions import_year, import_month, import_day & import_date at a minimum. To accomplish this using AWS Glue Studio use the Add a partition key option at the bottom of the Data target node. It is important to further note, that inorder to add these attributes as partitions, they must exist within the dataset as fields.

Clone and edit an existing Glue job#

  1. In the Your jobs sections, select the radio button of the job you would like to clone.
  2. In the Actions dropdown, select Clone job. If there is an information banner with a message Your job was created outside of Glue Studio and has no DAG available (see screenshot below) there will not be a Visual tab and therefore you will not have an interface to edit your job. Instead you will need to edit your job in the python job script. To do this navigate to the Script tab. If you prefer to use the interface, then select another job which has a Visual tab or create a new job as instructed in the Creating a new Glue job section above. image
  3. In the Job Details tab, update the Name for your new job and check the Creating a new Glue job seciton above to ensure the configuration suits your needs.
  4. To save your job, select the Save button.

Email notifications of failing Glue jobs#

Each time a Glue job fails, an email notification with details of the error message is sent to the respective department, and their subscribed members.

The message will include:

  • Name of the Glue job
  • Error message
  • Time of failure
  • Job start time
  • Job end time
  • Job last modified time
  • A link to log in to Hackney SSO and view the Job run details

In order to receive email notifications, you will need to ensure that you are subscribed to receive emails from your department's Google group and that you have confirmed your subscription to receive AWS Notifications when prompted.

important

Ensure the PlatformDepartment tag is correctly set in the Advanced details section in the Glue job's Job Details (see Creating a new Glue job section above).