AWS Data Migration Service (DMS) is a service that allows us to migrate data between a source (in our case, on-premises database) and a target (in our case, Postgres database hosted in AWS).
Continuous replication (CDC)
- When we want to do a one-off migration of all data and then continuously capture new inserts, updates and deletes and reflect them in our target database
One-off data migration
- When the goal is to migrate all data from a source, and is expected that changes will not be captured and reflected
CDC is a SQL server feature, available only on Enterprise and Developer editions.
It allows for changes to be captured (inserts/updates/deletes).
When the source database does not have primary keys and you want to migrate data continuously.
MS Replication is a SQL server feature available on all editions.
It creates a “distribution” database and every time there is a change, it is captured and stored in the “distribution” database.
.MS will then read from that database to reflect the changes in the target database.
Note: The sql user created must have sysadmin permissions to set up replication
Additional notes: Configuration on the source database is required (please see below section). Additionally, SQL servers DO NOT come with MS replication features pre-installed, so the server might require a set up.
When you want to migrate data continuously
When the SQL server is not Enterprise/Developer edition
When the source database has tables, which make use of primary keys
No database configuration is required
The sql user must have at least db_owner permissions
The replication runs ones and migrates the data specified
There are no subsequent runs of the migration task, unless triggered with other means
When only a one-off migration is required
When the underlying source database is a reporting server and there are no possible ways to capture updates. In this scenario, we need to daily run a one-off migration, after the reporting server was updated with the latest data
Both DMS and Postgres can be created via Terraform.
Follow the example usage, which also demonstrates how to add table mappings (specifying which tables are to be replicated)
The source DB server should be specified with IP and not the server name
DMS instance should be in the VPC, where the VPN is set up to ensure communication to on-prem is possible
- Make sure your DMS instance’s subnet group has only private subnets in it!
- DMS does not support Postgres version 12, so use version 11 or older.
- Always store passwords in parameter store and do not hardcode them
- “Multi_az” should be true for production databases
- “subnet_ids” requires subnets in 2 different AZs. Make sure those are private subnets to ensure that the database is secure.
- Currently not terraformed: To enable traffic from DMS to your Postgres instance, ensure you add to the ingress rules of the database’s security group all traffic from DMS security group.
What is a data pipeline?
A data pipeline is an automated flow that gets data stored in one location (source) and uploads it to a target destination.
As of 26/06/2020, we have implemented one data pipeline.
The pipeline takes data uploaded in an S3 bucket in .csv format and uploads the data into a Postgres database.