Learnings From Terraform

Playbook Software Craftsmanship Learnings Terraform Infrastructure Automation

By Akshay Vadher, on Wednesday, September 28, 2022

Terraform is infrastructure as a code. I know it is a fancy word but here is the meaning.

If you want to create one virtual machine in AWS, you go to their website and click a few buttons, and it creates the instance. What if there are 15 such VMs and all have some security configuration, doing it manually is not only tedious but erroneous as well.

Terraform provides a nicer way to manage infrastructure in code. The following example creates a virtual machine in AWS

resource "aws_instance" "app_server" {
  instance_type = "t2.micro"
}

It will create, update, or delete resources using simple commands

Goal

After using terraform in multiple projects, here are my learnings from experience.

It can be used as a best practice as well.

Feel free to add yours.

State in cloud

Terraform stores its ‘state’ in a local file. The state is a list of created resources using Terraform.

In the above example, it will make an entry in the state file with aws_instance with ID generated by AWS. So when we try to apply the configuration again, it won’t create already created resources.

Since the default way of storing the state is in a local file, most beginners don’t change that. That is highly risky.

We should always store the state in cloud file storage. For example, AWS store the state in ‘s3’, for Azure store in ‘Azure Blob Storage’. The reason behind that is, if your local machine got crashed, the state is always stored in the cloud, somewhere safe.

Store and run as a pipeline

Always run the Terraform commands from the pipeline and not from your local machine.

At least transactional commands like apply (to create/update) and destroy (to delete!)

The idea behind Terraform is to have ‘infrastructure as code’. Since it is a ‘code’ it needs to be committed inside a repository.

Version number is mandatory

Always specify a version number for any resource you create.

Following is an example for creating one AWS RDS MySQL instance,

resource "aws_db_instance" "default" {
  db_name              = "mydb"
  engine               = "mysql"
  instance_class       = "db.t3.micro"
}

Here we are not defining a property called engine_version. It is taken by default as the latest. Suppose now the latest is 5.7 but after a few months, the latest becomes 8. The version change might have breaking changes introduced. The application will start breaking.

The biggest issue is, nobody would know why the app is crashing without any code change.

MySQL might be a poor example because the version number is very important in that, however, things like the docker image version, helm chart version are often taken/used as the latest

No manual changes in managed services

We might face some important issues or requests that require immediate fixes. Or we might be trying to play around with resources to do POC or want to check capabilities. In these cases, instead of updating resources using Terraform, we often go to the web console and update the resources there.

This is a big no-no. Don’t manage the resources that are managed by Terraform.

For example, we had one ‘AWS Cognito’ user pool (it is AWS’s managed Identity Service), and someone introduced a custom parameter to that. Then someone tried to deploy something else, since the Cognito didn’t have that parameter in Terraform configuration, it ‘understood’ that now this field needs to be deleted, so it tried to delete that field.

The problem with this example is, Cognito does not support field deletion. So the whole pipeline failed, and we had to recreate the whole user pool.

If you want to play around, do it using Terraform or create a separate resource.

Think about the environments

When we get a requirement to create resources, we jump into the task, create a bunch of resources, and declare that the task is done. The developers start using those resources and are happy.

As soon as we have to create a ’test’ or a ‘staging’ environment, we are stuck because we only created resources for ‘dev’ environment, we have to now replicate the whole code chunk to create other environments.

Always create the resources in such a way that when we have to duplicate the resources, we have to pass a few configurations.

There are multiple ways to manage the environments like modules, variable files, or ENV variables. It might require a separate post to weigh the pros and cons. But think about the reusability at the start.

Think about duplication of resources

Let’s assume you thought about environments and now creating templatized resources.

For example, here is the code to create one role

resource "aws_iam_role" "reader" {
  name = "reader"
}

The problem with this approach is, since the naming is static, if you create two environments from the same account, it will try to create the role reader in the same account. Roles have a validation that its name cannot be duplicated.

Always define the name of the resources dynamically like below. So for each environment, it creates a uniquely identifiable resource

resource "aws_iam_role" "reader" {
  name = "${clustername}-${env}-reader"
}

Think about enabling and disabling a feature

Always think about enabling or disabling a feature, in other words, most of the resources should be created optionally, preferably using variables.

It might be possible that one of the environments does not require RDS MySQL instance. Here is the code to create one MySQL

resource "aws_db_instance" "default" {
  db_name              = "mydb"
  engine               = "mysql"
  instance_class       = "db.t3.micro"
}

Since you now want to make this optional, you will introduce a variable that will create the resource only if the variable is true, for example as below

resource "aws_db_instance" "default" {
  count	   	   = var.is-mysql-required ? 1 : 0
  db_name          = "mydb"
  engine           = "mysql"
  instance_class   = "db.t3.micro"
}

The problem with this transition is, since we introduced a count property, it will destroy the old instance and try to recreate a new one.

It might not be a problem for cattle type of resources (a resource that does not persist in any state and can be destroyed and recreated at any time). However, resources like MySQL have data and if we recreate it, we might lose the data.

Think about the availability of managed services

This is not specifically for Terraform but can be applied to any cloud architecture.

When you are trying to create a resource that is the managed service of that cloud provider, always make sure it is available in all the regions you are intending to create environments in.

For example, we are using ‘HTTP API Gateway’ of AWS as the entry point to our kubernetes cluster. However, when we tried to set up our 3rd PROD environment, we came to know that ‘HTTP API Gateway’ is not available in the Indonesia Jakarta region.

There are two problems here, if we didn’t have the resources as optional using variables, the whole pipeline would fail.

Another thing is, when we are choosing services, always think about alternatives and don’t rely too much on one specific service. (this is subjective and use your judgement)

Think about how to pass a value instead of creating a resource

This is loosely related to the previous two topics, if a resource is not created, then try to think if there is a way to pass a value instead of creating that resource.

For example, AWS Cognito is not available in Indonesia Jakarta as well. Since that resource is optional, we are not creating it. The problem is, there could be some other resources, that might require the issuer URL of the Cognito. Now we choose to use Cognito from some other environment. So if we keep a provision to provide the issuer URL directly in case a resource is not created, that would be helpful.

Another example is, if we don’t want to terraform to create a security group but want to reuse an existing one, we might provide the id of an existing security group.

Common things

Don’t keep everything under the template boundary.

For example, one time thing like ECR (it is a repository for docker images), you would want to create that only once and all the environment needs to access the same repository.

If that resource is inside the template then it would be created multiple times, which is a waste of money as well as creates issues of unnecessary duplication.

Think about account separation

This is again more related to the cloud instead of Terraform. However, when you create multiple environments, think about how you want to separate them.

Some options are:

Create in the same account
Create in the different account
Create in separate the resource group (Azure).

Cross access

Since some resources should be accessible from any environment, think about how will you provide cross env access.

For example, ECR (Container registry - docker image repository) needs to be accessed from any authorized env.