Architecture for generative Terragrunt & Terraform infrastructure as code (IaC)

This article covers a specific scenario where despite trying to leverage as many DRY (don’t repeat yourself) principles made available to us by the underlying IaC (infrastructure as code) frameworks, sometimes we still need to elevate the abstraction to another level to fully reduce code duplication and gain larger economies of scale deploying large platforms using IaC.

Evolution 1

Sometimes when we first get into DevOps automation using infrastructure as code tools (IaC) like Terraform, we start out with monolithic projects that define multiple logical environments (i.e. dev, test prod) and end up with a lot of copy pasted code between them as resource definitions first get created and tested in a lower environment and then once working, get copy pasted up the chain. Its doesn’t end up very DRY (don’t repeat yourself)

For example, here we we have a simple project for 3 environments, each have the same Terraform code that defines N resources in main.tf, the code is identical in each folder, what varies at runtime is the value of the variables. The more environments we need, the more copied code we end up with. For each environment you need to do a separate terraform plan/apply/destroy. Each environment also has its own provider, vars, outputs configuration and backend state definition, again each basically the same with minor differences pertaining to the context of the environment they are for.

my-infra/
     dev/
        variables.tf
        main.tf 
        outputs.tf
        providers.tf
     test/
        variables.tf
        main.tf 
        outputs.tf
        providers.tf
     prod/
        variables.tf
        main.tf
        outputs.tf
        providers.tf

Evolution 2

The next level of evolution towards being DRY is to get rid of the copy-pasted resource definitions defined in main.tf and to start using Terraform modules that live in a dedicated “modules” folder somewhere in the IaC project or even remotely in a git repository. In this scenario you still have some copy pasted code in each main.tf file but it contains much less code as it is now invoking your Terraform modules via source references which abstract away the resource details while just passing the required input variables through down to the underlying module referenced via source. Still… each time you have a new logical environment folder you want to target you end up copying your main.tf file, and the variables.tf … and the outputs.tf.

my-infra/
     modules/
        database/
          main.tf
          variables.tf
          outputs.tf
        vm/
          ...
        vnet/
          ...
        storage/
          ...
     dev/
        variables.tf
        main.tf (references modules/[N+] via source refs)
        outputs.tf
        providers.tf
     test/
        main.tf (references modules/[N+] via source refs)
        ...
     prod/
        main.tf (references modules/[N+] via source refs)
        ...

Evolution 3

What about storing our variables? tfvars work, but even for those we might end up with a fair amount of copy pasting of variables that have standard values that we want to keep configurable but don’t necessarily deviate that much from environment to environment…. i.e. we really always end up identifying variables that can be classified at a higher level as shared across N environments or common to other characteristics such as location/regions etc., we have a need for global/common/shared variables at different levels of a hierarchy.

Once we really start to look at the problem, outside of minor differences related to underlying resource capacity settings and other minor feature flags, the majority of our resources per environment are pretty much configured identically (as they should be!) Also, how each resource component relates to one another at a network level; in other words, the general topology or “footprint” of each environment is often the same.

So what else is different between environments? Well often the major thing that is different is outside of those things described above, is really the target cloud account that our infrastructure is being deployed to and something in the logical “name/context” for the instance of our topology being deployed. I.E. the “name” we assign to everything and the convention for it. Man… this all sounds like it can be further templated…. at this point we start looking at other solutions, like Terragrunt

Terragrunt

I’m not going to go too much into the full details of Terragrunt but the key take away of it is that it really allows you to take your Terraform IaC to the next level of DRY. Terragrunt is a wrapper around Terraform that really lets you fully isolate your Terraform code into modules that can be independently executed or executed as part of a larger footprint that exists within a hierarchy. For example:

my-infra-modules/
       database/
          main.tf
          variables.tf
          outputs.tf
       vm/
         ...
       vnet/
         ...
       storage/
         ...

my-infra/
     terragrunt.hcl (common TF backend, provider, global vars declarations)
     useast/
       terragrunt.hcl (common vars for region)
       apphosting/
         terragrunt.hcl (common vars for apphosting footprint)
           dev/
              terragrunt.hcl (common vars for dev)
              database/
                 terragrunt.hcl (git ref to my-infra-modules/database + input vars)
              vnet/
              storage/
              vm/
           test/
              ...
           prod/
              ...
           
        ...

There is a lot to Terragrunt and its a bit à la carte in regards to which features you wish to use. Point being is to get the next level of DRY and economies of scale with your Terraform code, you will need to use something else like Terragrunt.

Evolution 4

Awesome. Now that we are using Terragrunt everything feels much cleaner and DRY. We have a place for common/shared variables in the hierarchy, core Terraform constructs like backend for our state and provider are not repeated and still customizable. Things are pretty good. However the one less than ideal thing is that we still have some code repetition within each separate footprint’s “environment” folders for each sub-module being called (i.e. what is in each environment’s vm/database etc terragrunt.hcl files). Its not too bad and we can live with it as the other gains described above make up for it.

But now lets imagine you need to be able to take the above and treat that as an
“template” for a platform that can be deployed for N number of customers? For example, not only do we need 3 instances of our “apphosting” footprint (dev, test, prod), but now need to create N number of “apphosting” footprint instances for various clients? If you are a platform provider now you would be faced with copy pasting each Terragrunt “footprint” project for each of these instances and each of those has 3 different environments! “apphosting-teamA-[env]“, “apphosting-teamB-[env]“, “apphosting-teamC-[env]” etc.

This is the kind of scenario the team was faced with. What we started out building as a single services platform and all of its infrastructure worked pretty good; … but then we decided to make this a much larger platform with many instances of it, each of those with their own requirement of 3 of environments with the management of them delegated to other teams.

FOOTPRINTS Overview

At this point you’ve recognized a key term that I’ve thrown out there: footprints

When you create a set of IaC that produces a set of infrastructure that is fulfilling some particular use-case/role, we call that a footprint. Regardless of environment or “where” it is provisioned to, outside of some naming, feature flags and capacity related settings, any given instance of a footprint is the basically the same as any other, at least from an IaC driven management perspective. Knowing this we can start to think about making these footprints “templates” themselves, so that when given an even more limited set of parameters they can be stamped out on-demand and new ones easily created.

Platform overview

For the team in question we boiled it down to 3 distinct footprints that were already codified in Terragrunt and Terraform and successfully deploying into 3 different environments dev, test and production. The instances of these footprints happened to be deployed for one platform instance, but given the footprint’s architecture was generic enough it could be used for many other use-cases…. so we needed to blow this out to be able to easily create N platform instances.

At a high level the three footprints that together constitute a single “platform instance” were as follows:

  • control (CNTRL): set of isolated cloud network infrastructure, vms, dns, storage etc a core Kubernetes cluster to be used for key services (running in k8s) to support all peer level APP[N] footprints. Here we are talking about things that support telemetry, monitoring, dns, CDN services, service discovery etc.
  • application: (APP[N]): set of isolated cloud network infrastructure, vms, dns, storage etc and one or more dedicated Kubernetes cluster(s) to be used for running N number of application and service workloads. Certain elements in this footprint depended on services provided in the CNTRL footprint and each APP[N] network was peered with the central CNTRL network. There could be N instances of an APP[N] footprint within a platform instance depending upon the requirements.
  • admin (ADM): set of isolated cloud network infrastructure, vms, dns, storage and CI/CD execution infrastructure intended for DevOps personnel to use to deploy and administer the CNTRL and APP[N] footprint instances under its governance.

Each of these 3 footprints would have their own dev, test, and production instances.

For each of these 3 footprints, constituting a single platform instance, the desire was to have N instances of it as such, each of the logical instances had environment specific instances for dev, test, prod. So given 4 logical platforms, there were actually 12 physical instances materialized in the cloud.

Generating Terragrunt projects

So how was this platform ultimately materialized? As stated earlier, Terragrunt could only take us so far, given the requirements it was unmanageable to proceed with DevOps staff copy-pasting Terragrunt projects over and over. What if footprints evolved? How would be introduce new combinations of footprints?

Ultimately it was decided to do the following:

  • Take each Terragrunt “footprint” and templatize each using Jinja templating. As stated earlier, when you really break it down, much of the actual deviation between environments boils to down to a few dozen feature flag changes, capacity related settings, target cloud accounts, backend state targets and primarily the naming of of all resources managed under IaC control.
  • Consolidate and express all configurable variables that each footprint would need into a simplified single file per platform instance/environment combination. Express the configuration using YAML to make it much clearer for an human operator to curate and make changes to.
  • Write a python program that could be run via CLI invocation that could take in a target Terragrunt footprint template, marry it with a platform environment YAML file and output a executable Terragrunt project that depends on nothing else other than any underlying Terraform modules it sources from versioned git URLs.
  • Once generated each Terragrunt project could completely fork and live on its own, or ideally simply be updated as need be by the same python program that initially generated it, applying any new changes that materialize as a result of changes to the YAML configuration and/or the underlying Terragrunt footprint template projects.
  • Using all of the above in combination with solid Git workflow development practices using branching, PRs and tagging…. each platform instance could be fully customizable, versioned and isolated from breaking changes over time.

The implementation broke down into the following separate projects:

  • terraform-modules: this project contained a library of pure Terraform modules that could be run independently or referenced by Terragrunt projects. By separating the modules into a separate git project, specific tagged versions or branches could be referenced by different Terragrunt project versions.
  • terragrunt-footprints: this project contained the various components that would make up N number of different Terragrunt project “footprints”. The footprints are a combination of directory structure templates and footprint directories that composed of all the individual components that constituted “footprint”. Individual terrarunt.hcl files called out to various individual modules from terraform-modules. The footprint configs also contained various DevOps related support scripts and CI pipeline templates that would also be generated into the final project. Any file within this project could reference any supported variable defined in footprint-configs YAMLs via Jinja template syntax. All files were treated as Jinja templates.
  • footprint-configs: this project contained the simplified YAML config files, one per distinct platform-name, footprint-name, environment-name. The supported values for this file were dictated by the overall generative framework.
  • [platform-name]-[footprint]-[env]: The actual generated executable Terragrunt projects. Once generated, these projects were fully de-coupled from any other dependencies other than any references to terraform-modules. The owners of the projects could take the project any direction they would like, but by default projects were typically updated by the same generative process that created them in the first place.

By separating the framework into the individual projects, as well as the executable Terragrunt project outputs themselves, this permitted a high level of granularity when rolling out changes at any level, while permitting for numerous types of release lifecycles and version combinations. We didn’t have to worry as much about “will this change break something” as the separation of projects, each with their own release cycles permitted for a very flexible and de-coupled change management process.

Final Thoughts

Overall this was a challenging project with a unique solution that ultimately met the project requirements. By compartmentalizing and abstracting patterns of infrastructure into footprints that could then be templatized and stamped out into individual executable Terragrunt projects, this gave much larger economies of scale to be able to create N number of platform instances.

Since the original inception of this project, some newer features in Terraform itself and other 3rd party frameworks, such as Terraspace have come online which are also worthy of consideration if venturing down a path like this. Terragrunt itself is pretty nice, however it does have some issues that we struggled with, particularly around Terragrunt’s “dependencies” syntax…. it was often problematic and un-reliable.

Leave a comment