We Migrated 30 Kubernetes Clusters to Terraform

4 months ago 11

Thibault JAMET

In this blog post, we will guide you through the process of automating a complex infrastructure migration from a patchwork of Sceptre and CDK to Terraform, in a few months, following simple principles:

  • Decide the architectural principles, and write them down using ADRs or similar
  • Invest and iterate on small tooling and automation for your needs
  • Plan for small iterations and continuous improvements
  • Don’t rely on Continuous Integration only for development

We’ll share the practical automation and small iteration strategies that made this possible, the custom tooling decisions that paid off, and the hard-learned lessons about rollbacks, local development workflows, and knowledge transfer.

Whether you’re an SRE looking for proven migration techniques or a leader planning a similar transformation, this post covers both the technical nuts-and-bolts and the organisational choices that made our migration successful, including where we accepted risk, where we invested in custom solutions, and how we kept our team productive throughout the process.

Back in 2018, SCHIP was a small team with a big ambition: to provide a robust, scalable Kubernetes platform for Adevinta. We began with just one cluster and a small team of engineers. As the years rolled on, our ambitions — and our infrastructure — grew. By 2025, we were running 30 clusters in 12 accounts and four regions, and the team had doubled in size. But with growth came complexity.

Our early infrastructure was a patchwork of Sceptre, CloudFormation, and kube-aws. These tools initially served us well, but as we scaled, their limitations became apparent. Sceptre and CloudFormation were hard to test, slow to evolve, and increasingly brittle. Suggestions to use Terraform had been circulating since 2019, but at the time, we needed something that could help us move quickly and test our infrastructure as code. Enter AWS CDK.

With kube-aws reaching end of life, we needed a new way to manage clusters. We ran a proof of concept with EKS and debated between CDK and Terraform. CDK won out, largely because of its promise of testability and its alignment with our Go-based workflows. We built custom cluster operators and tooling, using Kubernetes jobs to deploy clusters. CDK became our primary deployment tool for clusters, while Sceptre remained in use for the surrounding infrastructure.

But as we scaled, the cracks began to show. CDK’s heavy reliance on Node.js, cryptic error messages, and slower-than-expected support for new Kubernetes features made life harder. Testability was better than CloudFormation, but not as easy as we’d hoped.

Around 2023, the winds shifted. The company's strategy called for simplification and standardisation.

Terraform had become the de facto standard, both in the community and within Adevinta. We needed to make our platform easier to maintain, hand over, and scale as team members transitioned to new opportunities.

The new vision was clear: infrastructure should be modular and built on open standards. We wanted to leverage the rich ecosystem of Terraform modules, make our infrastructure code easier to review and reason about, and ensure that anyone with SRE experience could quickly understand and contribute to it.

We started aligning the team by writing an ADR (Architectural Decision Record) that outlined the motivation and the target architecture of our Terraform modules. We took particular care in defining how components would be split into Terraform toplevel and submodules, as well as how to communicate outputs between toplevel modules.

This meant reorganising our stacks, adopting best practices for state management, and automating as much as possible, while always keeping safety and auditability at the forefront.

We knew the migration wouldn’t be a one-shot deal. Instead, we embraced an iterative, risk-aware approach.

1. Automate: We invested heavily in automation. Every pull request triggered a complete terraform plan for all clusters and accounts, using GitHub Actions. We had automated terraform imports in each Pull Requests to spot any undesired changes and raise our confidence level.

2. Check: Every change was reviewed, and the plan output was summarised and posted as comments in each Pull Request. This required discipline and careful sequencing, but our CI pipeline helped us catch most issues before they reached production.

3. Polish: We iterated on our process, learning from each migration. When we missed a resource to import or inadvertently changed a resource, the plan would catch it, and we’d fix it within the PR before the merge. We marked resources as DeletionPolicy: Retain in CloudFormation before deleting stacks, ensuring we didn’t delete critical and migrated resources.

4. Repeat: Migration happened in waves. We started with low-risk, modular resources (IAM roles, Lambdas, monitoring), then tackled high-risk, stateful resources (VPCs, subnets, routes). Each wave built our confidence and improved our process.

Rollback was a real challenge. Once a resource was imported into Terraform, a rollback would eventually delete it, potentially generating an outage in our systems.

The solution to work around this was to integrate the 'terraform state rm' command during our import phase, and ensure that any resource imported in future commits is deleted from the state.

We planned to import in the future, but it would be deleted from the state.

This way, a plain rollback of the commit didn’t risk deleting helpful resources in our infrastructure.

Similarly, during the cleanup phase of the migration, any rollback would require a manual CloudFormation import — something we hadn’t automated. Our workaround was to keep the migration in discrete, reviewable steps:

  • Import the resources into Terraform and edit their DeletionPolicy to Retain in CloudFormation.
  • Then, start using the new Terraform-managed resources.
  • Finally, empty or delete the CloudFormation stack.

This process wasn’t foolproof, but it was safe enough. If we missed something, the PR plan would catch it, and we could fix it before merging.

We accepted some risk and temporary complexity in exchange for progress.

Documentation was practical and focused on day-to-day tasks. The real knowledge transfer happened through hands-on pairing and code reviews. We found that pairing new team members with experienced engineers was more effective than relying solely on exhaustive documentation. The repo README included instructions to migrate cloudformation stacks, but the real learning happened in the trenches.

The evolution of our migration automation was a story of iterative improvement, driven by the sheer scale of what we were migrating.

In the early iterations, our automation was straightforward: run terraform import commands for individual resources.

A simple Go package was born to handle resource imports. Each import was executed one by one, which worked fine for the initial handful of resources — IAM roles, Lambda functions, and monitoring components.

We also posted the whole terraform plan output directly in PR comments.

This provided engineers with immediate visibility into what would change before merging, eliminating the need to dig through CI logs or run plans locally. But this seemingly simple feature required careful engineering.

Original simple PR comment

The challenge was state safety. If we ran terraform plan against the remote state during CI and someone simultaneously ran terraform apply in production, we could end up with state conflicts or, worse, accidentally destroy resources if a terraform apply was run while any PR was importing resources but had not been merged.

To solve this problem, we leveraged terragrunt and its templating capabilities to utilise local Terraform state files during CI runs, avoiding changes to the remote state until actual deployment.

Here’s how we made it work.

1. Download remote state: At the start of each PR job, we’d pull the current remote state using terragrunt state pull > terraform.tfstate

2. Simulate local operations: We’d “fool” terraform by removing .terraform/terraform.tfstate, making it think it was working with a local state

3. Run imports and plans locally: All import simulations and plan operations happened against this local copy

4. Generate actionable feedback: The plan output showed exactly what would happen when the PR was merged, without any risk to production

This approach gave us the best of both worlds:

Safety: No risk of interfering with production deployments or corrupting remote state

Accuracy: Plans reflected the true state of resources, including any imports that would happen

Speed: No state locking delays or conflicts with concurrent operations

remote_state {
# In github actions, in PRs, we are running plan locally and we don't want to change the remote state.
# Here we are tweaking terragrunt and terraform to change the backend to local in those cases.
backend = get_env("PLAN_LOCAL_ONLY", "false") == "true" ? "local" : "s3"
generate = {
path = "backend.terragrunt.tf"
if_exists = "overwrite"
}
config = get_env("PLAN_LOCAL_ONLY", "false") == "true"?{}:{
bucket = "terraform-state-bucket-${local.account.accountID}"
key = "account/${local.account.name}/${local.account.region}/terrraform.tfstate"
region = "${local.main.region}"
dynamodb_table = "terraform-state-lock-${local.account.accountID}"
}
}

However, as we expanded our scope to include VPCs, subnets, and network routes, this approach became increasingly slower. What took 30 seconds for a simple account ballooned to over 10 minutes for accounts with dozens of resources.

We needed a better way.

As the number of resources to import increased, the overall process began to slow down.

Each time we invoked Terraform, it would need to lock its state, download it from S3, run the imports, upload it again to S3, and unlock the state.

The breakthrough came with the realisation that we didn’t need to run a new Terraform command for each import, but rather to generate the import statements and let Terraform handle them efficiently. This changed everything:

Instead of running individual terraform import commands, we evolved to generate import statements in Terraform’s declarative syntax.

Doing so, we could speed up our deployments from up to 10 minutes to around 30 seconds.

As the number of managed and imported resources increased, the PR comments accompanying the Terraform plan began to become less readable.

Engineers didn’t want to scroll through hundreds of lines of “no change” resources.

We improved our tooling to make sure they would easily spot:

  • Whether there would be a deletion, replacement or in-place update of resources
  • Which are the resources being deleted, replaced or updated
Improved comment showing the most important changes of the PR
A subset of all the PR comments

If engineers need more details about the changes, they can always develop a comment to verify the actual changes.

The evolution from basic plan dumps to rich, organised feedback was crucial for adoption. Engineers could quickly scan a PR comment and understand the impact of their changes, making code reviews more effective and reducing the chance of surprises in production.

While our CI automation was becoming increasingly sophisticated, we faced a classic developer productivity problem: slow feedback loops. Every time an engineer wanted to test a change — whether adding a new import, updating a resource configuration, or debugging a plan issue — they had to:

1. Commit their changes

2. Push to a branch

3. Wait for CI workers to be available

4. Wait for the full plan to complete across all accounts

5. Review the results and iterate

This cycle could take 10–15 minutes per iteration, which was painful when trying to debug complex import issues or validate configuration changes.

The breakthrough was realising that our CI planning process could be replicated exactly on local machines. After all, the magic wasn’t in the CI environment — it was in our approach to local state management and import simulation.

We introduced scripts that allow engineers to run terraform plan with imports from their local machine, simply and safely.

They replicated the same process as in CI and soon became essential tools for every engineer working on the migration.

Here is what it looked like.

#!/bin/bash

set -e
mkdir -p bin
export PATH=$(pwd)/bin:$PATH

if [ $# -gt 1 ]; then
echo "Usage: $0 [account-name/region]"
exit 1
fi

if [ $# -eq 1 ]; then
export AWS_PROFILE=${1%/*}
export AWS_REGION=${1#*/}
fi

echo "Running plan for ${AWS_PROFILE}/${AWS_REGION}"

# Build the tool that will resolve the missing import statements for this specific account
go build -o bin/tool ./cmd/tool

# Delete any terraform state to ensure it starts from scratch
rm imports.generated.tf || true
rm .terraform/terraform.tfstate terraform.tfstate || true

(
# Pull the real terraform state to ensure we are using the latest one
unset PLAN_LOCAL_ONLY
terragrunt init
terragrunt state pull > terraform.tfstate
)

# Delete any terraform state to ensure it starts from scratch using the local state only
export PLAN_LOCAL_ONLY=true
rm .terraform/terraform.tfstate

terragrunt init

# generate the relevant import statements
tool terraform-import account --import-statements-file=imports.generated.tf --account-name ${AWS_PROFILE} --account-region ${AWS_REGION}
terragrunt plan

The impact on developer productivity was immediate and significant. Instead of the 10–15 minute commit/push/CI cycle, engineers could:

  • Get plan results in under a minute locally
  • Test multiple configuration variations quickly
  • Catch issues before they ever reach CI

Perhaps most importantly, local planning built engineer confidence in the migration process. When you can see exactly what Terraform will do locally, including all the imports and state changes, before committing anything, the migration feels much safer and more predictable.

Engineers could experiment with different approaches, validate their understanding of resource dependencies, and ensure their changes would work correctly, all without any risk to production infrastructure or interference with other team members’ work.

The temptation with any large-scale infrastructure change is to plan everything and execute it in a single, massive, and coordinated effort.

We learned that this approach is both riskier and slower than it appears.

By starting with low-risk resources, such as IAM roles and monitoring, we built confidence in our process and tooling before tackling the critical VPCs and network routes.

Each wave taught us something new — whether it was an edge case in our import logic, a quirk in how certain AWS resources behave, or a gap in our automation. This iterative approach meant that by the time we reached the most critical resources, our process was battle-tested and our team was experienced.

Automation is the only path to sustainable scale, but it must include human oversight

Manual terraform imports don’t work when you’re dealing with dozens of accounts and hundreds of resources.

However, automation without human review is dangerous.

The CI pipelines and PR plan comments became our safety net, catching subtle issues that pure automation would miss.

The key insight is that automation should amplify human judgment, not replace it.

Custom tooling can beat off-the-shelf solutions for complex yet specific problems

Throughout the migration, we constantly faced the choice between adapting existing tools and building custom solutions.

Our Go-based import automation, local state manipulation scripts, and PR feedback systems were all custom-built, as we couldn’t find off-the-shelf migration tools that could handle our specific constraints: repeated invocations of imports, working across dozens of accounts, and transparent reporting of changes and risks taken.

Throughout the migration, we didn’t hesitate to switch between custom and off-the-shelf tools for the ones that best fit our needs.

Perfect rollback strategies are a myth

Design for early detection and fast recovery instead

We spent time trying to design foolproof rollback mechanisms, but ultimately accepted that exhaustively planning all future resources for import is complex and prone to error.

Instead, we invested heavily in early detection — comprehensive PR plans, local testing scripts, and careful sequencing of changes.

When issues arise, the rollback PR would have benefited from the same automation and helped us fix undesired deletions accordingly.

Hands-on learning scales better than documentation for complex technical migrations

We initially attempted to document every aspect of the migration process, but quickly realised that pairing engineers and learning through hands-on experience was far more effective.

Complex import scenarios, debugging terraform state issues, and understanding the nuances of our automation required tacit knowledge that was hard to capture in documentation. The time we saved by not over-documenting was better spent on improving our automation and directly supporting more migration work.

This doesn’t mean documentation is useless — the practical, day-to-day guides in our README were essential — but for the deep technical knowledge required during an active migration, pairing and hands-on experience were irreplaceable.

Migrating to Terraform has set SCHIP up for easier maintenance, better collaboration, and future growth. The journey wasn’t always smooth, but by automating, checking, refining, and repeating, we ultimately arrived. If you’re facing a similar migration, remember: it’s a marathon, not a sprint. Embrace the journey, and don’t be afraid to learn as you go.

Read Entire Article