Sela. | Cloud Better.

GCP Alerts and Monitoring using Terraform

Keeping eyes on GCP infrastructure resources is essential for your applications to work seamlessly. DevOps team should get notified when applications or services went down or inaccessible due to some issues like compute instance/s crossed defined threshold, K8S pods crashed, network went down at some region, etc. Getting alerts on configured notification channels allows DevOps teams to act quickly to rectify and resolve issues to keep your services up and running. Here GCP monitoring and alerts are come to rescue and terraform allow us to manage these resources as a code (IaC).

Vikrant Barde, Tech lead, Cloud & DevOps, Sela

Keeping eyes on GCP infrastructure resources is essential for your applications to work seamlessly. DevOps team should get notified when applications or services went down or inaccessible due to some issues like compute instance/s crossed defined threshold, K8S pods crashed, network went down at some region, etc. Getting alerts on configured notification channels allows DevOps teams to act quickly to rectify and resolve issues to keep your services up and running.  

Here GCP monitoring and alerts are come to the rescue and terraform allow us to manage these resources as a code (IaC). 

 

GCP Monitoring and Alerts 

GCP monitoring provides a complete solution to collect and analyze the metrics of GCP resources and visualize them in the form of dashboards. GCP also allows you to monitor your applications’ availability using uptime checks. We can set up alerts when user defined criteria for resource state, utilization get matched and you will get notified on configured notification channels like email, slack, etc. 

 

Why Terraform 

We can create infrastructure manually using GCP console. If it’s a small infrastructure and for specific environment like Dev or Prod only, then its Ok to go with manual infrastructure creation. 

 

Glossary - 

Dashboards 

Graphical visualizations of GCP resource metrics. 

Alerts  

Message raised by GCP monitoring when certain criteria get matched in resource metrics. 

Notification Channel 

Communication channels where GCP will send alert notifications like email, slack, etc.  

 

Uptime Check 

Checks application availability is application is responding for specific API/health check call. 

Infrastructure as Code 

Allow us to create and manage Gloud infrastructure/resources using code. 

 

Diagram 

 

Fig – Infrastructure creation using Terraform. 

 

Configuring monitoring and alerts for few resources for single environment may not take too much time, but what if we need to configure it in multiple environments like dev, prod, staging, qa for multiple times, this will consume large amount of time and energy. Here terraform comes into picture to automate infrastructure creation, we can create, modify, and destroy resources quickly. 

 

Terraform In Short 

Terraform is an infrastructure as code (IaC) tool developed by HashiCorp, it allows us to manage and provision cloud infrastructure. 

 

Terraform Code 

Here, as we can see there is code snippets of terraform main, variable files, modules for notification channels, uptime checks and alerts for uptime checks. For creation alerts and monitoring we have created a simple python app using terraform startup script as seen below. 

 

VM starup script

"sudo apt-get update; sudo apt-get install -yq build-essential python3-pip rsync; pip install flask; echo -e \"from flask import Flask \napp = Flask(__name__)\n@app.route('/')\ndef hello_cloud():\n\treturn 'Hello Cloud'\n\napp.run(host='0.0.0.0')\" > app.py; python3 app.py;" 

 
Python “Hello Cloud” App.


Monitoring Dashboard 

module>monitoring>dashboard>dashboard.tf 
resource "google_monitoring_dashboard" "dashboard" { 


dashboard_json = var.dash_json 

} 

 

 

environment>dev>main.tf 
module "flask_app_dashboard" { 


source = "../../module/monitoring/dashboard" 

dash_json = jsonencode({ 

    "displayName": "Flask App VM Dashboard", 

    "dashboardFilters": [], 

    "mosaicLayout": { 

        "columns": 48, 

        "tiles": [ 

          { 

            "width": 24, 

            "height": 16, 

            "widget": { 

              "title": "VM Instance - CPU utilization [MEAN]", 

              "xyChart": { 

                  "chartOptions": { 

                  "mode": "COLOR" 

                  }, 

                  "dataSets": [ 

                    { 

                      "breakdowns": [], 

                      "dimensions": [], 

                      "measures": [], 

                      "minAlignmentPeriod": "60s", 

                      "plotType": "LINE", 

                      "targetAxis": "Y1", 

                      "timeSeriesQuery": { 

                      "timeSeriesFilter": { 

                          "aggregation": { 

                          "alignmentPeriod": "60s", 

                          "perSeriesAligner": "ALIGN_MEAN" 

                          }, 

                          "filter": "metric.type=\"compute.googleapis.com/instance/cpu/utilization\" resource.type=\"gce_instance\"" 

                        } 

                      } 

                    } 

                  ], 

                  "thresholds": [], 

                  "yAxis": { 

                    "label": "", 

                    "scale": "LINEAR" 

                  } 

              } 

            } 

          }, 

          { 

            "xPos": 24, 

            "width": 24, 

            "height": 16, 

            "widget": { 

              "title": "Flask App logs panel", 

              "logsPanel": { 

                "filter": "resource.type=\"gce_instance\" resource.labels.instance_id=\"${module.flask_app_vm.instance_id}\" resource.labels.zone=\"us-central1-c\"\n", 

                "resourceNames": ["projects/1055175960331"] 

              } 

            } 

          } 

        ] 

    }, 

    "labels": {} 

    }) 

} 

 

terraform apply -target module.flask_app_dashboard.google_monitoring_dashboard.dashboard -var-file dev.tfvars 

Generated Dashboard in GCP Console 

 

Uptime Checks – TCP and HTTP 

 

module>monitoring>uptime-check>http>http-uptime-check.tf 

resource "google_monitoring_uptime_check_config" "http-uptime-check" { 

 

for_each = local.flat_hosts 

 

display_name = "${each.value.hostname}-http-uptime-check" 

timeout      = "60s" 

selected_regions = ["ASIA_PACIFIC", "USA", "EUROPE"] 

 

http_check { 

    path         = each.value.path 

    port         = each.value.port 

    use_ssl      = each.value.use_ssl 

    validate_ssl = each.value.validate_ssl 

 

    accepted_response_status_codes { 

      status_class = "STATUS_CLASS_2XX" 

    } 

    accepted_response_status_codes { 

            status_value = 301 

    } 

    accepted_response_status_codes { 

            status_value = 302 

    } 

  } 

 

monitored_resource { 

    type = "uptime_url" 

    labels = { 

      project_id = var.project_id 

      host       = each.value.hostname 

    } 

  } 

} 

 

 

 

environment>dev>main.tf 

#TCP UPTIME CHECK 

module "flask_app_tcp_uptime_check" { 

source = "../../module/monitoring/uptime-check/tcp" 

tcp_uptime_hosts = var.flask_app_tcp_uptime_check_hosts 

project_id = var.project_id 

} 

 

#HTTP UPTIME CHECK 

module "flask_app_http_uptime_check" { 

source = "../../module/monitoring/uptime-check/http" 

http_uptime_hosts = var.flask_app_http_uptime_check_hosts 

project_id = var.project_id 

} 



Environment variables 

Here we can set uptime checks for multiple targets, just add multiple targets/hosts in Json array. 

Generally, we don’t need tcp and http uptime checks for same application, here we have created it just for example. 

 

environment>dev>dev.tfvars 

#TCP UPTIME CHECK 

flask_app_tcp_uptime_check_hosts = [{ 

    hostname = "35.209.69.34" 

    port     = "5000" 

}] 

 

#HTTP UPTIME CHECK 

flask_app_http_uptime_check_hosts = [{ 

    hostname = "35.209.69.34" 

    path     = "/" 

    port     = "5000" 

    use_ssl= "false" 

    validate_ssl = "false" 

}] 

 

terraform apply -target module.flask_app_http_uptime_check.google_monitoring_uptime_check_config.http-uptime-check -var-file dev.tfvars 

 

terraform apply -target module.flask_app_tcp_uptime_check.google_monitoring_uptime_check_config.tcp-uptime-check -var-file dev.tfvars 

 

Created uptime checks in GCP console 

 

Notification Channels 

module>monitoring>notification-channel>email.tf 

resource "google_monitoring_notification_channel" "email" { 

display_name = var.email_channel_display_name 

   type = "email" 

   labels = { 

     email_address = var.notification_email 

   } 

 } 

 

 

environment>dev>main.tf 

#NOTIFICATION CHANNEL 

module "notification_channel_email" { 

source = "../../module/monitoring/notification-channel" 

notification_email = var.notification_email 

email_channel_display_name = var.notification_email_desc 

} 

 

 

environment>dev>dev.tfvars 

#NOTIFICATION CHANNEL 

notification_email = "alerts@yourdomain.com" 

notification_email_desc = "notification email for to receive uptime check alerts" 

 

 

terraform apply -target module.notification_channel_email.google_monitoring_notification_channel.email -var-file dev.tfvars 

 

Created Notification Channels in GCP console 

 

Alerts for Uptime checks 

 

module>monitoring>alerts>alert-policy-uptime-check.tf 

resource "google_monitoring_alert_policy" "alert-policy-uptime-check" { 

project      = var.project_id 

enabled      = true 

count        = length(var.uptime_check_ids) 

display_name = "Uptime check alert policy for ${element(split("/", var.uptime_check_ids[count.index]), 3)}" 

   

documentation { 

    content = "Uptime check failed for ${element(split("/", var.uptime_check_ids[count.index]), 3)}" 

  } 

 

notification_channels = [var.notification_channel] 

combiner              = "OR" 

 

conditions { 

    display_name = "Uptime check for ${element(split("/", var.uptime_check_ids[count.index]), 3)}" 

    condition_threshold { 

      filter = <<EOT 

        metric.type="monitoring.googleapis.com/uptime_check/check_passed" AND metric.label.check_id="${element(split("/", var.uptime_check_ids[count.index]), 3)}" AND resource.type="uptime_url" 

      EOT 

 

      duration        = "0s" 

      threshold_value = "1" 

      comparison      = "COMPARISON_GT" 

 

      aggregations { 

        alignment_period     = "1200s" 

        cross_series_reducer = "REDUCE_COUNT_FALSE" 

        per_series_aligner   = "ALIGN_NEXT_OLDER" 

        group_by_fields = ["resource.label.project_id", 

        "resource.label.host"] 

      } 

 

      trigger { 

        count = "1" 

      } 

    } 

  } 

 

user_labels = { 

    severity = "critical" 

  } 

 

alert_strategy { 

    auto_close = "604800s" 

  } 

} 

 
 


environment>dev>main.tf 

#ALERTS - TCP 

module "flask_app_tcp_email_alerts" { 

source = "../../module/monitoring/alerts" 

depends_on = [module.flask_app_tcp_uptime_check, module.notification_channel_email] 

uptime_check_ids = module.flask_app_tcp_uptime_check.tcp_uptime_check_ids 

notification_channel = module.notification_channel_email.notification_channel_id 

} 

 

#ALERTS - HTTP 

module "flask_app_http_email_alerts" { 

source = "../../module/monitoring/alerts" 

depends_on = [module.flask_app_http_uptime_check, module.notification_channel_email] 

uptime_check_ids = module.flask_app_http_uptime_check.http_uptime_check_ids 

notification_channel = module.notification_channel_email.notification_channel_id 

} 

 

 

 

Environment>dev>output.tf 

output "tcp_uptime_check_ids" { 

value = module.flask_app_tcp_uptime_check.tcp_uptime_check_ids 

} 

 

output "http_uptime_check_ids" { 

value = module.flask_app_http_uptime_check.http_uptime_check_ids 

} 

 

output "notification_channel_id" { 

value = module.notification_channel_email.notification_channel_id 

} 



terraform apply -target module.flask_app_email_alerts.google_monitoring_alert_policy.alert-policy-uptime-check -var-file dev.tfvars 

 

 

 

Created Policies in GCP Console. 

 

Conclusion 

Smooth running if apps and services in cloud environments is critical to corporate success. GCP monitoring provides robust solution for tracking resource metrics and application availability, it will help support team to act quickly if case of any abnormality. Terraform (IaC) allows DevOps teams to efficiently manage and automate infrastructure, alerts and monitoring across multiple environments in minimum time and effort.