Migrating PostgreSQL Data Out of Heroku Link to heading
Heroku is still one of the best platforms for deploying applications. Their simple deployment process and convenience are hard to beat. However, sometimes due to perhaps company policy requirements, you need to move to a different cloud provider or host your database elsewhere.
This was the case previously where I moved a 1.4TB PostgreSQL database from Heroku to AWS Aurora. Here’s how my team and I did it.
Requirements Link to heading
The primary directive was to move the database from Heroku to AWS. There were also requirements to have as little impact on customers during the migration.
The application was an SaaS Inventory Management System that managed customers' multi-channel sales and inventory. This meant that the application, and by extension the database, was always processing data. Minimal impact meant as short a down-time period as possible and no data loss.
- Move from Heroku to AWS Aurora
- Minimal down time
- No loss of transactions or data during downtime
Limitations Link to heading
At the time of the migration (back in 2021), Heroku was not offering the ability to stream data directly out of their Postgres instances. Replication could only be done via WAL replication.
On the other end, Aurora did not support being a follower of the Heroku Postgres instance. So we could not directly move data from Heroku to Aurora. Aurora could make use of Postgres native logical replication, but it was not possible to have Aurora as a readonly follower.
Replication was paramount as it was the only way to ensure that the data was moved with minimal customer impact.
The Plan Link to heading
The overall plan we settled on was a 2-failover process:
- Set up a Postgres instance on EC2 as a follower of the Heroku Postgres instance.
- Failover to the EC2 instance with a short down time of about 10 minutes.
- Replicate the data from the EC2 instance into Aurora.
- Failover to Aurora with another short down time of about 10 minutes.
Setting up EC2 Link to heading
Setting up EC2 instances is a simple process. Running Postgres on the instance is also a straightforward process. As long as the WAL files are being streamed, the Postgres running on EC2 would be able to replicate the chaanges. We used a back-up of the Heroku Postgres instance, and restored it in the EC2 instance. Then we set up WAL replication to get the EC2 instance to catch up to the Heroku instance.
Failing Over to EC2 Link to heading
The process of failing over or moving all the application writes to EC2 involved multiple steps that are really dependent on how the application writes to the database. In our case, we had the main application and multiple other background jobs running all of which both read and wrote to the database.
As a general rule, stop all background jobs first. Then, stop the main application. Once the application is down, promote the EC2 instance to leader and perform any sanity checks you need on the data and the database. Finally switch the database connection strings over to the new instance and start the application again. You probably also want to perform a few smoke tests just to make sure everything works properly.
Replicating to Aurora Link to heading
For replicating the data to Aurora, we used AWS’ DMS service. It works in similar ways to say dbt or even Postgres’ own logical replication. You set up a source and a target, and DMS will replicate the data. We settled on DMS because we were going to have support from AWS technical account managers, so it made more sense to use an AWS service to replicate the data.
Failing Over to Aurora Link to heading
The failover from EC2 to Aurora follows the exact same process as failing over from Heroku to AWS.
The Migration Planning Process Link to heading
This migration took about a year to plan and execute. One of the guiding principles was to remember Murphy’s Law. At each step of the process, we asked what could go wrong, how likely it was that a specific thing would go wrong, what was the impact if it did happen, and how can we mitigate that risk.
Examples of what could go wrong were:
- What if the EC2 instance crashed?
- What if there was a network issue while replicating?
- What if there was a sudden surge in traffic just before or during the failover process?
With questions like these, we developed mitigations. For example, we had multi-AZ setup with multiple replicas of the EC2 instance. We would promote on instance, and switch other instances to replicate from the leader instance on EC2. That way if one EC2 instance crashed, we could failover to another one. We wrote scripts of everything as this reduced the likelihood of human error.
We had a plan to always fall-back to the running system if there was any data loss of corruption. There were only 2 points where we could not roll back to the previous database instance. Once we failed over to EC2, and once we failed over to Aurora. For each of those, we added checks to snapshot the data, and scripts to sample a percentage of the data to ensure no data loss and no corruption.
During planning, we had multiple dry-runs with the entire team and practiced the entire migration process from start to end. We also practiced rollback plans and failure scenarios. Running these practice migrations helped us identify issues that could arise. After each practice run, we would have an ARR and update our plans and scripts accordingly.
The Thought Process Link to heading
Several guiding principles helped us in this process.
- Is the database the actual source of truth for this particular piece of data? Can the data be replicated from another source?
- At which points do the data in each of the database instances diverge and how would that affect the customers?
Conclusion Link to heading
We successfully migrated the database in 2021. The actual migration went smoothly without any hiccups. We moved about 1.4TB of data between cloud providers with minimal downtime and with no loss of any customer data.
The success of the migration can be attributed to planning, practice, and a dedicated and well-motivated team.