Techdecline's Blog

Fail Fast, Refactor Often - Escaping the Defensive Terraform Trap

Scenario

Your Terraform apply just failed. Again. For the third time this week, and you're not even sure why. Was it the Azure DevOps provider timing out? That managed identity that existed yesterday but returns 404 today? Or just one of the 247 API calls in your 40-minute deployment hitting a rate limit?

Most cloud platforms rely on multiple services running in integration to deliver value. Those are usually configured using Infrastructure as Code like Terraform or Pulumi. By nature, those tools let engineers define desired infrastructure declaratively using providers which abstract the actual API calls required to determine and change the state.

As these platforms grow, the configurations become brittle due to the sheer amount of hidden API calls (of which some will fail eventually) and providers that may require updates or show intermittent bugs.

You'll notice the creep: deployments that took 8 minutes now take 35. Success rates drop from 95% to 70%. Your team starts adding "retry the pipeline" to their daily routine.

A cognitive bias called present bias will push us towards short-term improvements

What we usually do

As engineers, our (short-term) goal is to make the pipeline go green again. To achieve that, we usually add quick fixes, escape hatches and band-aid to our configuration code.

Coming back to our example, we may implement data sources combined with try-catch blocks to check for identity existence and only configure IAM roles if requirements are met. Other similar "solutions" may include a plethora of life cycle attributes (most prominently ignore_changes).

This is what I call "defensive" Infrastructure as Code, making our configuration more ambiguous to make up for a messy world. This will eventually "fix" some of the error conditions slightly improving deployment success rates, but you're trading one problem for three worse ones.

...and why we shouldn't

As we add additional resources, hooks, try functions and other utilities to our configuration code, we accidentally add even more API calls - more moving parts, more failure modes, more debugging nightmares. This is accidental complexity: code that exists to work around architectural problems, not solve infrastructure challenges..

To make matters worse, we add non-determinism to our deployments. You cannot predict the actual state of the target infrastructure after deployment.

data "azuread_user" "example" {
  user_principal_name = try(var.user_email, null)
}

resource "azurerm_role_assignment" "example" {
  count = try(data.azuread_user.example.object_id, null) != null ? 1 : 0
  # ...
}

That data source might find the identity, or it might not. The IAM role might exist, or it might be skipped. Each run is a surprise.

This breaks the single-most important aspect of declarative configuration management which is idempotency by declaring intent. Runtime dependencies break the foundational principle of IaC: bringing a system from any state into a known state, reliably, every time.

So, we now can report an improvement in deployment failure rate and green pipeline results, but we no longer know what intent was validated.

A Way out

"If something hurts, do it more often."

The Columbia space shuttle disaster in 2003 happened because NASA normalized deviation. Small foam strikes occurred on previous missions without catastrophe, so engineers learned to work around them rather than stop and fix the root cause. Each workaround said "this anomaly is acceptable." Gene Kim and Dr. Steven Spear dissect this pattern in Wiring the Winning Organization: when you carry errors forward, you're not managing risk—you're accumulating it.

Defensive Terraform does the same thing. Every try() block normalizes an anomaly. Every conditional data source says "this failure is acceptable, route around it." Your pipeline turns green, but you're carrying errors into production.

Instead, errors and deviations must be explicit.

Your deployment system must fail on errors. The engineering challenge isn't eliminating failures; it's making them less costly:

When you stop masking errors, they become signals. A stack that fails repeatedly tells you where to refactor. High failure rates tell you to break things down further.

Refactoring the Configuration

When refactoring, you cut your configuration into smaller units reducing coupling and failure rates. Your infrastructure as code should be as static as possible making it more readable. Engineers can see what will be deployed without mentally executing conditionals and loops.

It is crucial to organize the stacks around coherent, tightly-coupled resources. Kief Morris's Infrastructure as Code covers this in depth, but in practice I've found organizing stacks using a layered design like Governance -> Networking -> Compute -> Application than creating a complete stack per Application from top to bottom. This reduces the amount of providers required per stack and the amount of duplication across stacks with the latter.

A note on tooling: Domain-specific languages like HCL have their place, but when it comes to large-scale refactoring, general-purpose languages (TypeScript, Python, Go) have a clear advantage. You get real functions, real modules, mature testing frameworks, and type systems that catch errors before deployment. AI assistants handle these languages more effectively, offering better refactoring suggestions. The tooling ecosystem makes breaking down large configurations significantly less painful.

Refactoring the Deployment

The goal of Infrastructure as Code is to continuously keep the infrastructure in a known desired state. As of today, most Terraform configurations I come across are deployed using CI/CD pipelines in some form of: Code Change -> Terraform Plan -> Pull Request Review (Code Change and Plan Output) -> Terraform Apply.

This GitOps methodology is reasonable, but there's a gap: This approach only catches drift when you change code. Between deployments, your infrastructure drifts unchecked. Configuration drift creates the API inconsistencies and intermittent failures that pushed you toward defensive coding in the first place.

To break that cycle, separate integration from deployment:

This is the Controller Pattern, and it's a major reason Kubernetes became the dominant orchestration platform: continuous reconciliation means drift gets corrected automatically, not whenever someone remembers to deploy.

You can apply this pattern natively in Kubernetes using Crossplane or the Pulumi Kubernetes Operator. For Terraform, consider hosted solutions like Spacelift or Terraform Cloud, or implement scheduled reconciliation in your existing CI/CD system.

Conclusion

Platform engineers fight drift constantly. Time pressure and present bias (our tendency to favor quick fixes) push us to accept ambiguity in our configurations. This builds technical debt.

But failures in infrastructure deployments must be explicit. Masking them makes the system less reliable. Stuffing your configuration with escape hatches only makes matters worse.

Instead, treat failures as signals. Repeated deployment failures, increasing run times, declining success rates tell you when and where to refactor. Break down large configurations. Implement continuous reconciliation. Make failures cheap enough to tolerate frequently.

This approach doesn't just fix today's deployments—it unlocks platform growth. When failures are cheap and localized, you can move faster. When drift is caught automatically, you can scale reliably. When your configuration signals what needs attention, you can evolve your platform deliberately rather than reactively.

Stop making your Terraform more defensive. Start making your platform more scalable.

#cicd #guides #iac #kubernetes #pulumi #research #terraform