Testing in the wonderful world of infrastructure as code

I was planning to write about my home infrastructure (I have at least one post in mind already), but as it is fresh in my mind I decided to write about the state of IaC code, or more specifically, testing IaC code (or lack of it).

Testing is the foundation of most workflows

I have recently spent quite a bit of time setting up Github Actions-based workflows both for my personal and professional projects. Ultimately, that stuff works quite well:

For open-source projects, GHA minutes are free (and there is large ecosystem of add-ons for them, e.g. codecov is free for open-source projects)
In my experience the SLA of GHA seems acceptable; it is down sometimes, but so is everything (and usually workarounds are found fast if it is something that actually is localized problem, as opposed to larger scale (rarer) problem)
It performs well enough, even with the ‘free’ runners
- ’free’ for open source
- 50k minutes per month for Github Enterprise customers (which is quite good deal for small startups even at list price)

Here is an example of a GH workflow I recently worked on, with few GH workflows interacting reasonably well (if any of the steps fail, later ones will not be run):

Containers are built if there are changes
- Unit tests are run, coverage produced, container built
System tests are then run
If these also succeed for the ‘app’, tag the container versions in container registry, and update the helm chart repository ( main branch, sigh ) accordingly

┌──────────────────────────────────────┐
│                   Per container build│            .─────────────.
│        ┌───────────┐                 │         ,─'               '─.
│        │Unit tests │                 │        ╱   git repository    ╲
│        └───────────┘                 │   ┌───(     (containers)      )
│              │                       │   │    `.                   ,'
│              │                       │   │      '─.             ,─'
│              │                       │   │         `───────────'
│              │                       │   │
│              │                       │   │
│              │                       │   │
│              ▼                       │   │
│         ┌─────────┐                  │   │
│         │Coverage │                  │◀──┘
│         └─────────┘                  │
│              │                       │
│              │                       │
│              │                       │
│              │                       │
│              ▼                       │
│     ┌────────────────┐               │        .─────────────────────.
│     │Build container │───────────────┼──────▶( container repository  )◀─┐
│     └────────────────┘               │        `─────────────────────'   │
│                                      │                   │              │
│                                      │                   │              │
│                                      │                   │              │
└──────────────────────────────────────┘                   │              │
                                                           │              │
                                                           │              │
  ┌──────────────────┐                                     │              │
  │   System tests   │                                     │              │
  │ (docker compose) │◀────────────────────────────────────┘              │
  └──────────────────┘                                                    │
            │                                                             │
            │                                                             │
            ▼                                                             │
            Λ                                                             │
           ╱ ╲                                                            │
          ╱   ╲                                                           │
         ╱     ╲        ┌──────────────┐                                  │
        ▕ main? ▏──y───▶│Tag container │──────────────────────────────────┘
         ╲     ╱        └──────────────┘
          ╲   ╱                 │
           ╲ ╱                  │
            V                   │
                                │
                                │
                                │
                                │
                                │
                                │
                                │                    .─────────────.
                                ▼                 ,─'               '─.
                      ┌───────────────────┐      ╱   git repository    ╲
                      │ Update helm chart │────▶(     (helm chart)      )
                      └───────────────────┘      `.                   ,'
                                                   '─.             ,─'
                                                      `───────────'

Notable part about this workflow is that most of it is executed also for pull requests. E.g. system tests results enable knowing whether or not change broke something globally. This is a key part of good testing pipeline from my point of view, and also makes code reviews more meaningful. as you are reviewing something which has at least some chance of actually working.

As output of this, we have a helm chart which refers to containers that are proven to work. And all of this happens within GHA, and without Kubernetes cluster being used. This may not be actually a feature.

So, you have your helm chart.. what then?

This is when the things turn unfortunately murky. Even having CD, unless you have manual or per-PR testing in place, it can get quite grim. Typically it might be something like this, based on small sample of few companies I have worked at recently:

   ┌────────────────┐                    ┌────────────────────┐
   │                │                    │     Terraform      │
   │  <new stuff>   │───────────────────▶│     lint/plan      │
   │                │                    └────────────────────┘
   └────────────────┘                               │
            │                                       │
            │                                       │
            │                                       ▼
            │                                .─────────────.
            │                             ,─'               '─.
            │                            ╱   git repository    ╲
            ▼                           (        (infra)        )
 ┌────────────────────┐                  `.                   ,'
 │        Helm        │                    '─.             ,─'
 │lint/template(/test)│                       `───────────'
 └────────────────────┘                             │
            │                                       │
            │                                       │
            │                                       │
            │                                       ▼
            ▼                            ┌─────────────────────┐
     .─────────────.                     │                     │
  ,─'               '─.                  │   Terraform apply   │
 ╱   git repository    ╲                 │                     │
(     (helm chart)      )                └─────────────────────┘
 `.                   ,'                            │
   '─.             ,─'                              │
      `───────────'                                 │
            │                                       │
            │                                       │
            │                                       │
            │                                       │
            │                                       │
            └───────────────────────────────────┐   │
                                                │   ▼
                                  ┌─────────────┼────────────────────┐
                                  │             │  Kubernetes cluster│
                                  │             │                    │
                                  │             │                    │
                                  │             ▼                    │
                                  │  ┌────────────────────┐          │
                                  │  │                    │          │
                                  │  │       ArgoCD /     │          │
                                  │  │        Flux        │          │
                                  │  │                    │          │
                                  │  └────────────────────┘          │
                                  │             │                    │
                                  │             │                    │
 .─────────────────────.          │             │                    │
( container repository  )─────────▶             ▼                    │
 `─────────────────────'          │  ┌────────────────────┐          │
                                  │  │                    │          │
                                  │  │        App         │          │
                                  │  │                    │          │
                                  │  │                    │          │
                                  │  └────────────────────┘          │
                                  │                                  │
                                  │                                  │
                                  └──────────────────────────────────┘

Both Terraform definitions as well as Helm charts can be linted and to some extent unit tested, but after that it is mostly matter of faith, as far as automated workflows go. Pull requests are not particularly meaningful if the output system cannot be tested (‘oo, you added these definitions to helm chart! how nice’).

So it all boils down to manual testing (if possible). This sounds frankly horrifying, especially if it is not possible.

In an ideal world..

ALL workflows should have system tests. Whatever they touch, should be ensued that it doesn’t break anything within. This is quite tricky with Kubernetes though. The whole idea of unit being tested being whole Kubernetes cluster (or e.g. full namespaces within) seems to be relatively rare at least based on what I see on the Internet these days. A lot of effort around helm/terraform lies in linting/unit testing.

In the real world, now

This whole blog post was triggered by what I spent half of yesterday doing. I spent hours yesterday ‘developing’ by mostly pushing to main of helm chart repo (note: all of these steps with one or more commits pass helm lint test, so they’re syntactically correct at least to some extent, although perhaps some tools such as helm/chart-testing: CLI tool for linting and testing Helm charts might have helped):

I had two commits to start with, which added two new pods to our app (and configured other pods correctly, or so I thought), they went through code review, and hit cluster. Subsequent changes were not reviewed for obvious reasons (dev cluster was unhappy, and changes did not affect prod, as the new pods were not in use there).
I had forgotten environment variable from one of the other pods.
I passed ‘env’ as ‘int’ types instead of ‘string’; helm linter was fine with it, argocd was not
I had forgotten 2 environment variables from one of the new pods
Another ‘int’ as ‘string’ to env
I wrote POSTGRESQL somewhere instead of POSTGRES
Another ‘int’ as ‘string’ to env
I tried to fix uid/gid of a pod to be non-root, but I inserted securityContext block in wrong place
Password reference was in secret, not config map (brainfart)
I typoed secretKeyRef as secretMapKeyRef, fixed
Yet another environment variable missing from one of the new pods
Switched to envFrom+secretRef as I needed even more shared secrets for one of the pods
Fixed securityContext from step 8
I gave up on one of the new pods
Disabling it did not work as anticipated (automatic Service definition XXX_PORT was implicitly passed to the pod, so using that for enabling was bad idea)
.. more falloff from ^
.. more falloff from ^ (after this, the single added new pod actually worked within the app in dev environment)
Re-enable the second pod after fixing an issue in the container that was not happening in the docker-compose system test environment (Amazon RDS requires TLS, pg within docker compose did not)

How could this then be fixed? Somewhere around step 7 (and definitely now) I am thinking that any helm chart you cannot deploy, and test without pushing code anywhere is not one you should be using. This particular mess I plan to address ASAP, as yesterday was not particularly useful use of time.

This same nightmare also applies to Terraform, although that you can manually apply, and as long as you have staging setup, it might work out (but even then, dealing with various external secrets and configurations is most likely not uniform between your test and real environments so ‘testing in production’ may be more common than one might want).

Key takeaway? Define helm charts so that

you can deploy test copies of them easily (in e.g. pull requests), and also some sanity checking
have CI in place for actual manual changes to them so that you don’t have to do (wrong kind of) ‘gitops’ to get your test cluster back to sane state

Testing is the foundation of most workflows#

So, you have your helm chart.. what then?#

In an ideal world..#

In the real world, now#

Testing is the foundation of most workflows

So, you have your helm chart.. what then?

In an ideal world..

In the real world, now