Confirm that the data catalog meets the minimum expectations for infrastructure and security so that it can be maintained long term
The following criteria were to be investigated
- Docker/container setup (via compose, ECS, Fargate etc)
- Role Based Access Control
Docker/container setup (via compose, ECS, Fargate etc)
Deploying this with docker and ECS was not feasible due to the number of dependencies.
Datahub provides documentation on how to deploy using Kubernetes and AWS EKS. Due to a lack of kubernetes expertise, this route was not further explored.
You can enable OIDC SSO in React application (datahub frontend).
This was not investigated in this spike due to lack of time.
Role Based Access Control
Limitation: Not reviewed as part of the spike, ran out of time.
- Could not get DataHub CLI with the deployment process specified above to work
- Must ensure a process for constantly re-cloning the repo for updates
- Datahub has various dependencies
- There are no obvious solutions for notifications or alerts
- Investigate datahub SAAS as an option
Helpful Resources / Documentation
Deploying to EC2
- To deploy datahub, ensure an EC2 instance with at least 4GM memory to be able to run datahub docker
- SSH into the instance and run the following:
sudo yum update -y
sudo amazon-linux-extras install docker
sudo service docker start
sudo usermod -a -G docker ec2-user
# restart connection
sudo yum install git -y
# install docer-compose
sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
# download datahub
git clone https://github.com/linkedin/datahub.git
# start datahub
docker-compose -f docker-compose.quickstart.yml up -d
vi glue_to_datahub.yml # add params
python3 -m pip install --upgrade pip wheel setuptools
pip install 'acryl-datahub[glue]'
datahub ingest -c glue_to_datahub.yml
Deploying to EKS
- To deploy datahub to EKS we used eksctl, helm and kubectl using the following guide
- When deploying to a production environment we would prefer to deploy using terraform.
We have added helper files to the data platform project under the
datahub-deployment directory. This directory includes:
- A makefile with commands for various common tasks
- A values.yaml file which contains a helm config for the datahub chart
- A kubeconfig file with the details for connecting to staging eks cluster
- Note: The file does NOT contain credentials, your aws credentials are used
The make file contains the following commands:
Creates or updates the kubeconfig file for the staging cluster
Get information about all the resources running in the cluster
Updates the datahub images to the latest version
Prints historical revisions for a given release
Pull the latest datahub helm repository
Git clone the datahub project repo to your current location
To ingest data into the data catalogue we followed this guide: Metadata Ingestion
This involved installing the PyPI prerequisites on a engineer machine using the following commands:
python3 -m venv venv
pip3 install --upgrade pip wheel setuptools
pip3 install --upgrade acryl-datahub
# If you see "command not found", try running this instead: python3 -m datahub version
Once the prerequisites had been installed we needed to install the specific plugin that we need in order to be able to ingest data from the data platform. In this case we installed the glue plugin which allowed us to ingest data from the Glue data catalogue.
pip3 install 'acryl-datahub[glue]' # install the required plugin
datahub check plugins
Once the prerequisites and the required plugins are installed you can then ingest the data from the glue catalogue by locate and update the glue_to_datahub.yml metadata ingestion example receipe which can be found here: /metadata-ingestion/examples/recipes/glue_to_datahub.yml
The contents of which should look as below:
# see https://datahubproject.io/docs/metadata-ingestion/source_docs/glue for complete documentation
# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
You will need to update the following properties:
source -> config -> aws_region
sink -> config -> server
With values which reflect the staging environment. One you have updated the recipe configuration you can run the ingestion with the following command:
datahub ingest -c ./metadata-ingestion/examples/recipes/glue_to_datahub.yml
When setting up the DataHub data catalogue site we required both a security certificate and a domain name to be associatable with the Kubernetes cluster.
To accomplish this we:
- Created a Name Server (NS) record on the Hackney domain for the data.hackney.gov.uk sub-domain. Pull Request
- This granted the production Data Platform account control over the sub-domain.
- Created a Name Server (NS) record on the data.hackney.gov.uk Hosted Zone for the stg.data.hackney.gov.uk sub-domain.
- This granted the staging Data Platform account control over the sub-domain.
- Created a CName record on the stg.data.hackney.gov.uk Hosted Zone for the datahub.stg.data.hackney.gove.uk sub-domain that points at the Kubernetes cluster.
- Using AWS Certificate Manager created a security certificate for datahub.stg.data.hackney.gove.uk by DNS validation.
- A guide to creating a certificate is referenced by the datahub setup guide and can be found here: Create Certificate