By Anton Grbin, Staff Engineer
AWS Data Migration Service migrates data between various databases. Precisely as the name says—I **love** readable names! After a year of DMSing, the name is the only predictable and straightforward part of it.
We use DMS to replicate our data from application databases (MySQL, Postgres) to the data lake. Shoveling data around is a classic data engineering problem, and every data-driven company is doing some form of it.
We replicate relational databases to the data lake for analytics purposes. The data first lands on S3 in transaction logs (Change Data Capture) form. As the data flows into S3, we use batch transformations to reconstruct consistent snapshots, which are then loaded into Redshift for some more batch processing.
Consistent snapshot means that our Redshift warehouse contains data from our relational database with integrity guarantees preserved.
But wait, what’s the buzz? Isn’t replication a solved problem? We were also optimistic in the beginning of this journey, but on the way we hit many roadblocks that resulted in experiences and best practices that we are summarizing here. Hope is that this summary will help fellow engineers dealing with similar problems!
This blog post tackles the following questions:
- How do we reliably replicate relational data to the data lake?
- Why is this such a big deal?
- How do we manage DMS configuration when multiple databases are replicated?
- How do we reliably control the DMS state?
The post will be published in two episodes. This blog covers the replication system up to S3, while the follow-up post focuses on transforming transaction log data into a consistent snapshot ready for analysis.
Why use DMS? We quickly discarded the alternative of building our own tool that talks with relational databases and pushes the data to S3. The main difficulty lies in the input diversity: database vendors, table schemas, and data access patterns. When analyzing third-party vendors, we didn’t find a clear winner.
It Starts With User Requirements
All software engineering projects start with user requirements. At the beginning of this project, we identified the stakeholders and asked them about their key requirements. These are the answers they provided:
Data Analyst: I want to have the full company’s data continuously and consistently replicated to the data lake!
Software Engineer in data producing teams: I want to minimize the effort needed to register new tables for replication.
Data Engineer: I want to minimize the time spent on mitigating replication issues.
Then we included system architects (Mark Simon, Marco Tabini, and Viktor Braut) and asked about the high-level principles the system should adhere to. The next section describes those principles and gives lessons learned that motivated them!
With a Little Help From Our Infrastructure Principles
The principles are:
- Infrastructure and Configuration as Code—the infrastructure and configuration must be reviewable, auditable, and repeatable.
- Democratized Configs, Self Service —each team that owns source databases must be able to make changes to their replication stack without the fear of breaking replication of other teams.
Infrastructure and Configuration as Code
In most of the resources online, the replication systems’ changes are described as a manual process, often requiring the use of the graphical user interface (GUI). This may be because the data replication infrastructure is different from serving infrastructure—the data replication is stateful. Because of this, I initially questioned this principle for our use case. Why require a code change to pause/resume the process or to add a newly replicated table?
I was missing the team’s experience with data replication throughout 2019 since I joined only at the end of the year. Here are a few failure patterns I was presented with:
Failure Pattern One: “If only we tested this in a sandbox environment…”
An engineer executes a configuration change resulting in the replication system not being able to “catch up” with the incoming changes. For such changes, we can use infrastructure as code to spin up an environment in our test AWS account where we can simulate the traffic and test whether the new configuration will survive.
Failure Pattern Two: “If only I knew this would happen as a side effect…”
An engineer changes the replication system’s configuration, which causes some or all tables to be reloaded from scratch. For our biggest tables, this process takes more than24 hours, and during that time, the data is not fresh, which means business disruption. The infrastructure as a code automated checks can inform the code change author and the reviewer about such disruptive side effects!
Failure Pattern Three: “If only I knew what you did while I was sleeping…”
An outage may span more than a single work shift. In those situations, it’s helpful to see configuration code changes that happened in the prior shift to understand how we got to the current state.
At Noom, we have a strong Infrastructure as Code culture. We could bring up our entire infrastructure just by pointing our deployment system to a clean AWS account.
Democratized Configs and Self-Serve
We formed this principle after we observed a hard-to-resolve bottleneck forming in our infrastructure team. Historically, this team owned all Noom’s provisioned AWS resources. Owning a resource binds you to the following responsibilities:
- being in the process of making changes to the resources
- owning the alerting channels and reacting to production issues
- keeping track of the AWS resources utilization and costs
- providing advisory about the infrastructure
This system quickly collapsed when we started adding more product-engineering teams, each requiring new infrastructure.
To pull this off successfully, the platform team must invest the effort to research all possible use cases and to decide which use cases to support. The key is to make the stakeholder-facing processes simple and proven.
The next step is to provide processes that help the stakeholder teams that include both the infrastructure and the related monitoring.
And in the end, the horizontal team is left with only advisory responsibility. Looking back, the decision to move in this direction saved us a lot of engineering hours once the system landed. The teams are now doing code changes against replication configuration without any review from the data platform team—we fully trust automated checks and tests.
This post discusses the replication system up to S3. This includes solutions to the following problems:
- How do we reliably copy transaction logs from source databases to S3?
- How do we maintain DMS configuration files when multiple teams are replicating databases?
- How do we create a self-serve DMS environment where each team owns their own configs?
We use CloudFormation to deploy all DMS-related resources. The deployment is split into N+1 CloudFormation stacks, where N is the number of teams owning relational databases. The additional “base stack” deploys shared resources that rarely change and are unlikely to be a bottleneck. The base stack contains and exports the following:
- AWS::IAM::Role for DMS service role
- AWS::DMS::Endpoint to write to the S3 data lake with one minute buffering
- AWS::DMS::Endpoint to write to the S3 data lake with 10 minutes buffering
- AWS::EC2::SecurityGroup for DMS networking
- AWS::DMS::ReplicationSubnetGroup for DMS networking
To simplify permissions and infrastructure, we decided to put all of our replicated data into a single S3 bucket. The output location is determined in the team’s stack configs.
Each team stack contains resources to execute the actual replication:
- AWS::DMS::Endpoint to read from the source database
- AWS::DMS::ReplicationInstance(s) to perform the actual replication
- AWS::DMS::ReplicationTask(s) to define what to replicate
More about the stacks in “CloudFormation stacks.”
The AWS::DMS::ReplicationTask contains two fields that are tricky to maintain in the CloudFormation template due to their size:
Due to the size of these fields, the configuration is hard to read and review. To improve readability, we use the Jinja2 template engine to render the final CloudFormation template. The ReplicationTaskSettings field is configured by inlining a file containing the default settings that no team has had to change so far.
The TableMappings field changes whenever a new table is added, removed, or reloaded. This is where we invested in an abstraction to make it as simple as possible; we separated TableMappings definitions into a more straightforward schema and a separate config file (more on this in “TableMappings Configuration Language”). Table reload as code is specifically important for us and discussed in “Table Reload.”
The deployment script relies on Boto3 to pause DMS tasks, execute CloudFormation deploy, and resume DMS tasks. The script also does a multitude of lint actions to verify configs are consistent and that the actual DMS state is ready to be updated with the new changes. We’ll cover this more in-depth in “Deployment Process.”
In this section, we share more details about our CloudFormation definitions for the base and team stacks.
This is an example of the base stack definition. Note the following:
- The “NetworkStackName” is out of scope for this article, but a follow-up on this topic is in the works!
- In “DatalakeAccessRole,” we don’t allow DMS to delete any files, which aligns with our data lake principles. This works only because of our special handling of table reloads, as explained in the “Table Reload” section.
- The configuration of the two AWS::DMS::Endpoints, to enable for different replication delay SLAs.
This is the example of the team stack example definition. Note the following:
- Values that are imported from the base stack—look for “ImportValue.”
- Values that are populated from the template engine—look for `<<<`.
- The target endpoint is “S3DatalakeTargetEndpoint,” which flushes every 10 minutes.
TableMappings Configuration Language
The DMS TableMappings configuration is an involved process—visible from the official docs. Most importantly, this configuration defines which tables are replicated. To make it easier for client team engineers to control DMS, we identified the common actions that need to be supported and created a smaller language which translates into the TableMappings value.
The common actions we wanted to support are:
- Defining which tables to replicate
- Reloading a specific table
- Pausing/resuming replication for a single table, or all tables in a task
We didn’t allow a config to request replication of all tables from the database; each table must be specifically called out. The complete state of what DMS is replicating is always reflected in the code.
Our smaller config is also specified as JSON, and it follows a strict contract. The real example of our smaller language config and the rendered TableMappings follows:
The rendered TableMappings contain three rules per replicated table:
- “Selection,” requesting the replication of the given table,
- “rename schema,” requesting that the output goes into the folder “data/” in the S3 bucket, and
- “rename table,” requesting the specific folder name for this full load of the table.
Renaming the schema and table to control the output location in the S3 is hacky, but we couldn’t find another way to include the table version in the output path to allow for table reloads.
The pause/resume config values are handled by the deployment process, as explained below.
DMS replication consists of two phases named FULL LOAD and CDC. Replication starts with a FULL LOAD phase, and upon completion, it transitions into CDC. FULL LOAD is a simple
SELECT * query running over the source database, while CDC uses logical replication to stream the transactions (changes) only.
Reloading a table is a DMS action which removes the output written so far and restarts the table to a FULL LOAD phase. One might ask—when is this needed? In our experience, reloading always happens due to not-so-fun events:
- A migration on the source database causes DMS to get stuck. We don’t have all the explanations here, but one hard data point is that gh-ost always does the trick.
- A CDC phase experiences a hiccup in which a transaction is dropped or corrupted.
- A mis-designed downstream pipeline chokes on the size of the transaction log that accumulated over time.
- A hard delete request (data must not be contained in the transaction logs).
In emergency events like the above, we want a reload to be reliable. In our experience, the reload can bring more trouble:
- Our biggest table takes 24 hours to reload. We need to have a way to initiate the reload and have the replication of the previous table running in parallel to avoid a 24 hour SLA miss.
- Reloading a table deletes files in our S3 bucket; we don’t want to allow any automated tools to delete files on S3 since this can cause downstream consistency issues.
- FULL LOAD does a heavy “SELECT *” over the source database. We had issues overloading our MySQL slave with a FULL RELOAD that broke all replications sourcing from that slave.
For the reasons above, we decided to control the reloading of a table with reviewed code changes. The code change required to initiate a table reload is a table version bump. For example, the following change starts a reload of the “barcodes” table:
Since we include the “version” value in the output S3 path, the previous data is not automatically removed nor overwritten. Even better, DMS doesn’t know that we are reloading a table; for the service, this looks like a new table.
To leave the table in a CDC phase while the FULL LOAD is happening, we would duplicate the record in the configuration for that table and leave them running with two different versions in parallel (visible here).
Configuration Deployment Process
Since the configuration lives in a code repository, we invested in CI/CD to make the stakeholder process as effortless as possible.
The deployment runs in two main phases: integration test and deployment. The integration test output is a go/no-go decision and a detailed deployment plan the engineer can review.
Although the test phase runs RPCs against CloudFormation, S3, and DMS, all actions are strictly read-only. This enables us to automatically run the planning phase (dry run) on branch commits before the PR is merged to the master branch.
The responsibility of the integration test is to inspect local configs and the current state of the infrastructure and decide if the configuration deployment will succeed.
Simple checks performed:
- JSON configs are parsable as JSON and have the expected schema.
- CloudFormation templates are valid.
- The schema name/table name pairs in a single DMS task are unique.
- Output S3 paths are globally unique.
- The base stack is deployed.
- The table version changes are always increasing.
More involved checks:
- A table that starts loading for the first time will not overwrite any S3 data.
- A FULL LOAD is never interrupted.
Our deployment plans have two main flavors: single-step and two-step. Each step is a CloudFormation deploy operation.
The two-step deployment plan is used if a table needs to be reloaded. In the first step, we remove the table from the task, followed by a step that adds it back with a new name. We couldn’t find a more straightforward and safer way to request a table reload using CloudFormation.
Since the DMS Task update fails if the task is currently running, each step is surrounded by “pause” and “resume” operations.
Related Blog Posts From AWS
Load ongoing data lake changes with AWS DMS and AWS Glue. We rejected using DynamoDB to store the set of replicated tables in favor of storing them as code. Storing them as code allows us to review and audit the changes.
Create AWS CloudFormation templates for AWS DMS tasks using Microsoft Excel. Similarly, we rejected using Excel to store the set of replicated tables.
Automate AWS CloudFormation template creation for AWS DMS tasks. Our solution gets the basic idea from this post but pushes it to maximize reliability.
The configuration and deploying configuration changes to DMS is not a simple task. At Noom, we invested our time in a more complex deployment process to unlock two important wins.
First, we achieved strict DMS configuration as code—the AWS console is **never** used for changes. This removes a big source of human error and gives us a clear trace of the previous states.
Second, with democratized configuration changes the data engineering team is not spending time setting up new replications for tables. With some cost in CI/CD complexity, we enabled engineers from product teams to add new databases and tables with confidence that the CI/CD pipeline will warn them if they misconfigure something.
Wow—you made it to the end! If you’re interested in this topic and would like to discuss it further you can email me at anton [at] noom [dot] com!