Table of Contents
When consuming resources from the cloud, reliable deployment times might not be the first thing you think of, but in the event of a disaster recovery, it is essential to have consistent, dependable, and rapid deployments. With tools like Bicep and Terraform, these deployments can be easily automated, but how reliable and consistent are they? This research will dive into that question.
Infrastructure as Code
While the phrase “infrastructure as code” was introduced around 2006, it gained popularity in the 2010s as it gained traction in DevOps circles. It gained wider use in various publications and experienced increased adoption, specifically on cloud platforms. With the introduction of Terraform by Hashicorp in 2014, it quickly became a widely adopted standard, which is now the de facto standard for infrastructure as code tools for most companies The term infrastructure-as-code or IaC, in its abbreviated form, is a method of automating and managing IT infrastructure using code rather than manually setting it up through a user interface. In simple terms, it means writing scripts or configuration files to automatically create and manage servers, networks, and cloud resources. The primary benefit of using IaC is that it will produce a consistent and repeatable environment. By defining the environment as configurations as code, organisations can ensure that the environments are consistently deployed and maintained, which can reduce errors and enable faster and more reliable deployments.
Solutions like Terraform made the shift from an imperative to a declarative approach. When taking an imperative approach, this means creating all the logic to generate that resource. An example of this would be to use PowerShell to set up an environment. This way, you must ensure that every scenario, as well as potential exceptions, is taken into account when making that resource. In most cases, executing the script again will result in the generation of new resources repeatedly. This is more about the ‘how’, which is defined by a step-by-step approach.
The declarative way is more like the ‘what,’ where you describe the desired state of the resource you want. An example of this would be to declare: “I need to have one resource with these specifications”, and the tool itself contains all the logic for creating that resource, as well as being aware of the resources already created. Within Terraform, this is referred to as the State. This state is used by Terraform to map real world resources to your configuration and keep track of relevant changes to the metadata. This way, you only have to focus on the definition of the resources and not the logic and exceptions of creating the resources, making it a very efficient and effective way to provision infrastructure.
Setup and methodology
As this research is about the reliability of deployments, the primary setup is a bit different compared to other GO-EUC researches. A standard and straightforward Terraform deployment configuration is used to provision the infrastructure over and over again, which is stored in a Git repository in Azure DevOps. The Terraform configuration contains the following components that are needed to provision the Virtual Machine resources in Azure.
- Resource group
- Virtual network
- Subnet
- Network security group
- Network security rules
- Windows Virtual Machine
- Virtual machine extension
- Network interface
- Public IP
- Random password
- Time sleep
- Ansible groups
- Ansible hosts
Furthermore, the VM SKU is a variable component and for this research four different SKUs are used from various series:
- Standard_B2s
- Standard_D2s_v5
- Standard_F4s_v2
- Standard_E4s_v5
With these variables, we can use a single Terraform configuration to deploy all four SKUs.
Terraform uses a state file to store information about what it has already deployed, this is stored in an on-premises PostgreSQL database, and using the workpace logic, those are separated per deployment, which is, in this case, based on the VM SKUs.
For the deployment, an Azure DevOps pipeline is used to deploy this code. The pipeline consists of two stages: a Terraform build stage and a Terraform destroy stage. For each defined SKU variable, these stages are executed in parallel, meaning that each deployment is run simultaneously. The following steps are defined in the stage:
- Terraform init
- Terraform select existing workspace or Terraform new workspace
- Terraform plan or plan destroy
- Terraform apply
With a schedule, the pipeline runs every 30 minutes and is executed on private DevOps agents that are hosted in a Docker container on a dedicated machine.
Via the Azure DevOps API, the timing of the pipeline runs and stages are collected, where the total time in seconds per stage can be calculated based on the start time and completion time. In the event of a failure, the data is marked as failed.
Each deployment is run a minimum of 10 times over the course of a multiple days, and the data is collected and stored.
Hypothesis and results
As the difference is only the SKU (type of virtual machine), it is expected to see that the deployment times are similar, but the day might influence the total deployment times.
The first analysis is the difference between deploying the infrastructure (terraform apply) and removing the infrastructure (terraform destroy). This does not take the various SKUs and days into account.
It is obvious that creating the infrastructure takes more time than removing it, which is as expected. On average, it takes seven and a half minutes to deploy all the components that are listed in the setup section.
As multiple SKUs are deployed, the dataset contains the times of each SKU.
The deployment times vary between the SKUs, which can be almost 2 minutes. As a machine is created on specific hardware when using a SKU, the times are influenced by those resources. A SKU with a faster CPU will improve the deployment times. Now it is not expected to see on the destroy the other way around. For some reason, the longest apply time has the fastest destroyed time, but it is unclear what is causing this. However, the difference is only 10 seconds, which is, on this scale, not noticeable.
Among the tested SKUs, Standard_E4s_v5 showed to be the most consistent and fastest average deployment times, making it a strong candidate for use in scenarios that require reliable provisioning, such as disaster recovery.
In contrast, Standard_B2s showed the opposite, with higher deployment times and the most failures. While the B-Series is a special series, this inconsistency makes it less suitable for production or time-sensitive workloads, despite its lower cost.
Standard_D2s_v5 and Standard_F4s_v2 provided a good middle ground, delivering consistent and predictable deployment times.
As this research has been running over multiple days, based on the collected data, it is possible to see the deployment times per day. Please note, the test has been running for a couple of days, and unfortunately, due to limited time, not the entire week.
There is a consistent drop-off in the time it takes to run over the last couple of days. There is no direct indication, but this might be due to the day of the week. As the current data set is too small, it requires further investigation to draw any conclusions on this. Please let us know if you have experienced this or know the cause of this behaviour in the comments below.
It might occur that something did not go as planned and results in a failure.
In the dataset used for this research, 121 failures were recorded during the deployment of the infrastructure. The leading cause of failures during the deployment was primarily timeouts with the following error:
OS Provisioning for VM 'go-vm-1' did not finish in the allotted time.
As the infrastructure was created, it resulted in incomplete state files, which in turn caused the planned stages to fail, as the infrastructure was already in place. When the infrastructure was manually removed from Azure, the pipeline was restored. As this test has run continuously, this occurred at random times on some occasions, it took some time to recover.
An interesting observation is that it occurs one time for the Standard_F4s_v2 and the rest for the Standard_B2s. This would suggest that some SKUs are less reliable than the others.
Another error occurred while downloading the Terraform providers, which provided the following error:
Error while installing hashicorp/azurerm v4.37.0: read tcp 0.0.0.0:45376->0.0.0.0:443: read: connection reset by peer
As Terraform relies on internet connectivity, it is possible that a timeout may occur during downloading, resulting in the pipeline breaking.
Conclusion
While Infrastructure as Code (IaC) offers a powerful and consistent way to automate deployments, this research shows that deployment times are not always consistent. Even when using the same automation tooling and configurations, deployment duration can vary.
Overall, there are various factors that can influence how long a deployment takes and in most cases, these factors are outside of the control of the user. This variability highlights that, although IaC improves repeatability, it does not always guarantee consistency across executions. The data shows that the day of the week does affect the overall run time of the pipeline. This might be due to the demand in the public cloud. It is important to note that as the B-Series stage was often broken due to an error, this could impact the overall results. To fully draw a conclusion on this, it does require more investigation.
An essential factor to consider is the use of a private DevOps agent; this ensures you don’t have to wait for the public queue to clear and allows you to run the pipeline directly. It is therefore highly recommended to use a private DevOps agent when you require a reliable deployment time.
Photo by Arnout van Nieuwkoop on Unsplash