Introduction

We have developed a number of training modules to help users understand the end-to-end journey of data in the platform. The training modules should also build confidence with the core elements and processes of the Data Platform by allowing users to practice within in a safe environment whilst using open data (non-sensitive). Depending on what you intend to do in the Data Platform, you may not need to complete all the modules.

Training module: Ingesting data from Google Sheets

This module is intended for users who want to add/ingest data into the platform so they or others can use it in other tools (e.g. Qlik). After completing this module, you should be able to do the following:

Ingest Google Sheets into the Data Platform
Write a simple Terraform module
Submit a pull request in GitHub
Navigate the AWS console
Run a job in AWS Glue
View data objects in AWS S3
Query data in AWS Athena

Training module: Transforming data to the refined zone using Sagemaker

This module is intended for users who want to transform data within the Data Platform and make the result available for querying. After completing this module, you should be able to do the following:

Load one or more datasets from S3 into a 'notebook' coding environment on AWS (SageMaker)
Learn how to do simple data transformations in PySpark
Write the transformed dataset into the refined zone by setting up job parameters

Training module: Deploying a job in AWS Glue

This module is intended for users who want to schedule or automate data transformations. After completing this module, you should be able to do the following:

Add a Glue job within the AWS console to test the job in a staging environment
Write a Terraform module to deploy the Glue Job into the Data Platform production environment, with scheduler to automate the process

Why we are moving towards using a Data Platform

Please read the Playbook Documentation

What are the pre-requisites for these training modules?

Introduction: Experience, skills, software and equipment

To get the most out of these training modules, you must be comfortable with the idea of coding. It is fine to be a beginner in Python and Terraform, but you should at least have a bit of experience with a programming language (such as SQL or developing formulas in Microsoft Excel and/or Google Sheets).

You’ll also need accounts and permissions to access some platforms; please contact data.platform@hackney.gov.uk to be given the necessary permissions.

You may have to contact the IT Support Desk to install some prerequisite software if you do not have the permissions to do it yourself.

Pre-requisites

What you'll need to set up before starting specific training modules:

Module	GitHub	Notebook Environment	AWS Access
Ingesting data from Google Sheets	X
Transforming data to refined zone using Amazon Sagemaker	X	X	X
Deploying a job in AWS Glue	X		X

Tools used with the Data Platform

Amazon Web Services (AWS)

AWS (Amazon Web Services) is a comprehensive cloud platform offering over 200 different services including data storage, computing power, and tools Glue, Athena, SageMaker and Redshift which we use in the Data Platform. AWS is the Council’s primary cloud supplier and the chosen environment in which we are building the Data Platform.

GitHub: Online code repository

GitHub is where the Data Platform code repository is stored. In particular, it is where the Terraform scripts are, which is the code which controls and manages the Data Platform resources in AWS. We also store the code for the Playbook there. You can refer to the Getting Set Up with GitHub Playbook entry for more information.

Using GitHub means we have version control on our code. It also means that have we have a way to verify and approve code before it is deployed to the Data Platform.

The Jupyter notebook environment through Amazon SageMaker

Jupyter notebooks allow users to write snippets of Python code in 'cells', which makes it easier to develop scripts. Users of the Data Platform can now access Jupyter notebooks through Amazon SageMaker, which means that scripts can be developed within the browser and AWS environment rather than on local machines.

How to get set up

Github

Instructions to set up
- GitHub Playbook documentation
- If you are a part of the HackIT GitHub Group, you don’t need to set anything up
- If you are not, make a request in the HackIT Dev Slack Channel
Common Problems / FAQ
- I can’t find the repository for the Data Platform
  - It can be found here

Notebook Environment

Instructions to set up
Common Problems / FAQ
- How can I tell if my Notebook Environment is working?
  - When you open up the Jupyter Instance, open the template.ipnyb and try to run the first cell. It should create a Spark instance, if it doesn’t something might be wrong. At that point it is best to ask somebody for some troubleshooting.

Introduction

Training module: Ingesting data from Google Sheets​

Training module: Transforming data to the refined zone using Sagemaker​

Training module: Deploying a job in AWS Glue​

Why we are moving towards using a Data Platform​

What are the pre-requisites for these training modules?​

Introduction: Experience, skills, software and equipment​

Pre-requisites​

Tools used with the Data Platform​

Amazon Web Services (AWS)​

GitHub: Online code repository​

The Jupyter notebook environment through Amazon SageMaker​

How to get set up​

Github​

Notebook Environment​