Datahub deployment

Objective

Confirm that the data catalog meets the minimum expectations for infrastructure and security so that it can be maintained long term

Considerations

The following criteria were to be investigated

Docker/container setup (via compose, ECS, Fargate etc)
SSO
Role Based Access Control
Notifications/Alerts

Findings

Docker/container setup (via compose, ECS, Fargate etc)

Deploying this with docker and ECS was not feasible due to the number of dependencies.

Datahub provides documentation on how to deploy using Kubernetes and AWS EKS. Due to a lack of kubernetes expertise, this route was not further explored.

Finally, we got deployed datahub by provisioning an ec2 instance, connecting to the instace though SSH, cloning the datahub repository in the instance, and running datahub via docker-compose.

SSO

You can enable OIDC SSO in React application (datahub frontend).

This was not investigated in this spike due to lack of time.

Role Based Access Control

Limitation: Not reviewed as part of the spike, ran out of time.

Summary

Limitations

Could not get DataHub CLI with the deployment process specified above to work
Must ensure a process for constantly re-cloning the repo for updates
Datahub has various dependencies
There are no obvious solutions for notifications or alerts

Suggestions

Investigate datahub SAAS as an option

Helpful Resources / Documentation

Deploying to EC2

To deploy datahub, ensure an EC2 instance with at least 4GM memory to be able to run datahub docker
SSH into the instance and run the following:

sudo yum update -y

sudo amazon-linux-extras install docker

sudo service docker start

sudo usermod -a -G docker ec2-user

# restart connection

sudo yum install git -y
  # install docer-compose
sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose

sudo chmod +x /usr/local/bin/docker-compose
  # download datahub
git clone https://github.com/linkedin/datahub.git

cd datahub/docker/quickstart
  # start datahub
docker-compose -f docker-compose.quickstart.yml up -d

cd ../../metadata-ingestion/examples/recipes/
vi glue_to_datahub.yml # add params

python3 -m pip install --upgrade pip wheel setuptools

pip install 'acryl-datahub[glue]'

datahub ingest -c glue_to_datahub.yml

Deploying to EKS

To deploy datahub to EKS we used eksctl, helm and kubectl using the following guide
When deploying to a production environment we would prefer to deploy using terraform.

We have added helper files to the data platform project under the datahub-deployment directory. This directory includes:

A makefile with commands for various common tasks
A values.yaml file which contains a helm config for the datahub chart
A kubeconfig file with the details for connecting to staging eks cluster
- Note: The file does NOT contain credentials, your aws credentials are used

Makefile

The make file contains the following commands:

generate-kubeconfig-file Creates or updates the kubeconfig file for the staging cluster

get-all Get information about all the resources running in the cluster

upgrade-datahub-release Updates the datahub images to the latest version

datahub-release-history Prints historical revisions for a given release

pull-datahub-repository Pull the latest datahub helm repository

datahub-source Git clone the datahub project repo to your current location

Data Ingestion

To ingest data into the data catalogue we followed this guide: Metadata Ingestion

This involved installing the PyPI prerequisites on a engineer machine using the following commands:

python3 -m venv venv
source venv/bin/activate
pip3 install --upgrade pip wheel setuptools
pip3 install --upgrade acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version

Once the prerequisites had been installed we needed to install the specific plugin that we need in order to be able to ingest data from the data platform. In this case we installed the glue plugin which allowed us to ingest data from the Glue data catalogue.

pip3 install 'acryl-datahub[glue]' # install the required plugin
datahub check plugins

Once the prerequisites and the required plugins are installed you can then ingest the data from the glue catalogue by locate and update the glue_to_datahub.yml metadata ingestion example receipe which can be found here: /metadata-ingestion/examples/recipes/glue_to_datahub.yml

The contents of which should look as below:

# see https://datahubproject.io/docs/metadata-ingestion/source_docs/glue for complete documentation
source:
  type: glue
  config:
    aws_region: "us-west-2"
    extract_transforms: true

# see https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

You will need to update the following properties:

source -> config -> aws_region
sink -> config -> server

With values which reflect the staging environment. One you have updated the recipe configuration you can run the ingestion with the following command:

datahub ingest -c ./metadata-ingestion/examples/recipes/glue_to_datahub.yml

DNS configuration

When setting up the DataHub data catalogue site we required both a security certificate and a domain name to be associatable with the Kubernetes cluster.

To accomplish this we:

Created a Name Server (NS) record on the Hackney domain for the data.hackney.gov.uk sub-domain. Pull Request
- This granted the production Data Platform account control over the sub-domain.
Created a Name Server (NS) record on the data.hackney.gov.uk Hosted Zone for the stg.data.hackney.gov.uk sub-domain.
- This granted the staging Data Platform account control over the sub-domain.
Created a CName record on the stg.data.hackney.gov.uk Hosted Zone for the datahub.stg.data.hackney.gove.uk sub-domain that points at the Kubernetes cluster.
Using AWS Certificate Manager created a security certificate for datahub.stg.data.hackney.gove.uk by DNS validation.
- A guide to creating a certificate is referenced by the datahub setup guide and can be found here: Create Certificate

Objective​

Considerations​

Findings​

Docker/container setup (via compose, ECS, Fargate etc)​

SSO​

Role Based Access Control​

Summary​

Limitations​

Suggestions​

Helpful Resources / Documentation​

Deploying to EC2​

Deploying to EKS​

Makefile​

Data Ingestion​

DNS configuration​