Quidest?

Terraform

Terraform is an infrastructure as code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, re-use, and share. You can then use a consistent workflow to provision and manage all your infrastructure throughout its life-cycle.

Why use it

How it works

You download the Terraform software on your machine. Through a provider, you can communicate to different services that allow to bring up infrastructure (e.g. AWS provider, GCP provider etc), and enact your terraform configuration file.

Providers is code that allows terraform to manage resources on Cloud services.

Key commands

GCP

GCP is Google Cloud Platform, Google’s Cloud Service. In order to let terraform create and manage infrastructure, we need to give an authentication method.

On GCP, in IAM and Admin, head over to service accounts. Service accounts are like user accounts, but for automated services. You can give them permissions like Storage Admin etc, and a number of other restrictions. When you want a service to operate on a GCP project automatically (like using terraform), you can create a service account, and then create an API key for that service account.

For our excercise, let’s create a service account and let’s give it Cloud Storage and BigQuery Admin permissions. After it’s created, let’s create a JSON API key, and store it in ./keys/my-credentials.json

Terraform can authenticate by using the env variable export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json", or you can hardcode the credentials path in the script

main.tf

The main file for a terraform service is main.tf. Let’s see an example:

 1terraform {
 2  required_providers {
 3    google = {
 4      source  = "hashicorp/google"
 5      version = "4.51.0"
 6    }
 7  }
 8}
 9
10provider "google" {
11# Credentials only needs to be set if you do not have the GOOGLE_APPLICATION_CREDENTIALS set
12#  credentials = 
13  project = "<Your Project ID>"
14  region  = "us-central1"
15}
16
17
18
19resource "google_storage_bucket" "data-lake-bucket" {
20  name          = "<Your Unique Bucket Name>"
21  location      = "US"
22
23  # Optional, but recommended settings:
24  storage_class = "STANDARD"
25  uniform_bucket_level_access = true
26
27  versioning {
28    enabled     = true
29  }
30
31  lifecycle_rule {
32    action {
33      type = "Delete"
34    }
35    condition {
36      age = 30  # days
37    }
38  }
39
40  force_destroy = true
41}
42
43
44resource "google_bigquery_dataset" "dataset" {
45  dataset_id = "<The Dataset Name You Want to Use>"
46  project    = "<Your Project ID>"
47  location   = "US"
48}

this example will create a bucket with specified settings, and will also create a bigquery table. You specify resources using the resource keyword, and providing a type (like google_bigquery_dataset), and a resource name that you can reference elsewhere in the file.

To download the required provider, we can run terraform init. This will download the binary of the provider in .terraform folder. To see the planned action that terraform will take to create our infrastructure, we can use terraform plan. To act on those actions, we can run terraform apply. This will create a .tfstate state json file describing the current state of the infrastructure. Finally, if we want to take down what we created, we can run terraform destroy.

You can visualize the state file in a nice format with the terraform show command.

The main files for a terraform declarations are:

Let’s go over the various elements of the script:

variables.tf

Let’s see how we can improve our main.tf script using variables:

main.tf

 1terraform {
 2  required_providers {
 3    google = {
 4      source  = "hashicorp/google"
 5      version = "5.6.0"
 6    }
 7  }
 8}
 9
10provider "google" {
11  credentials = file(var.credentials)
12  project     = var.project
13  region      = var.region
14}
15
16
17resource "google_storage_bucket" "demo-bucket" {
18  name          = var.gcs_bucket_name
19  location      = var.location
20  force_destroy = true
21
22
23  lifecycle_rule {
24    condition {
25      age = 1
26    }
27    action {
28      type = "AbortIncompleteMultipartUpload"
29    }
30  }
31}
32
33
34
35resource "google_bigquery_dataset" "demo_dataset" {
36  dataset_id = var.bq_dataset_name
37  location   = var.location
38}

variables.tf

 1variable "credentials" {
 2  description = "My Credentials"
 3  default     = "<Path to your Service Account json file>"
 4  #ex: if you have a directory where this file is called keys with your service account json file
 5  #saved there as my-creds.json you could use default = "./keys/my-creds.json"
 6}
 7
 8
 9variable "project" {
10  description = "Project"
11  default     = "<Your Project ID>"
12}
13
14variable "region" {
15  description = "Region"
16  #Update the below to your desired region
17  default     = "us-central1"
18}
19
20variable "location" {
21  description = "Project Location"
22  #Update the below to your desired location
23  default     = "US"
24}
25
26variable "bq_dataset_name" {
27  description = "My BigQuery Dataset Name"
28  #Update the below to what you want your dataset to be called
29  default     = "demo_dataset"
30}
31
32variable "gcs_bucket_name" {
33  description = "My Storage Bucket Name"
34  #Update the below to a unique bucket name
35  default     = "terraform-demo-terra-bucket"
36}
37
38variable "gcs_storage_class" {
39  description = "Bucket Storage Class"
40  default     = "STANDARD"
41}

so we can define variables using the variable keyword, and then access the value through the var namespace in the main.tf file.

If a default value is set, the variable is optional. Otherwise, the variable is required. If you run terraform plan now, Terraform will prompt you for the values for the variables without defaults.

You can populate variables using values from a file. Terraform automatically loads files called terraform.tfvars or matching *.auto.tfvars in the working directory when running operations.

Outputs

When building complex infrastructure, Terraform stores hundreds or thousands of attribute values for all your resources. As a user of Terraform, you may only be interested in a few values of importance. Outputs designate which data to display. This data is outputted when apply is called, and can be queried using the terraform output command.

Define an output for the IP address of the instance that Terraform provisions. Create a file called outputs.tf with the following contents:

1output "ip" {
2  value = google_compute_instance.vm_instance.network_interface.0.network_ip
3}

You must apply this configuration before you can use these output values. Apply your configuration now. Respond to the confirmation prompt with yes.

Now query the outputs with the terraform output command:

1terraform output
2# ip = "10.128.0.3"

You can use Terraform outputs to connect your Terraform projects with other parts of your infrastructure, or with other Terraform projects.

For more, look at the documentation: Get Started

#data-engineering #study-plan #career-development #zoomcamp