EKS Monitoring with Amazon Managed Service for Prometheus

Oguzhan
3 min readFeb 7, 2023

--

I want to talk about Amazon Managed Service for Prometheus. We migrated some environments from AWS Opsworks to EKS. We used the Sensu monitoring tool for non-container (primitive) environments, but now we choose EKS for our new environments and are trying to migrate current primitive environments. It’s a little tricky but possible.

Whatever, for EKS we generally chose cloud-native tools, also Prometheus is a CNCF graduate project.

Our Infrastructure

Here is our monitoring infrastructure;

AMP Infrastructure

I am passing Amazon Managed Service for Prometheus setup, I talk about EKS integration and rule management.

Prometheus Config
When your AMP (Amazon Managed Service for Prometheus) is ready and your EKS cluster is also prepared, you need Prometheus Server and some custom deployments. On the EKS side Prometheus configuration is like below.

We followed the Prometheus community helm chart;

server:
remoteWrite:
- url: ${prometheus_endpoint}
sigv4:
region: ${aws_region}
%{ if amp_account != "master" }
role_arn: arn:aws:iam::${master_aws_account_id}:role/master-prod-amp-role
%{ endif }
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500

We have an AMP server and we don’t need to store any metrics in any cluster. We need to configure our Prometheus Helm Chart config like the above. remoteWrite configuration is enough. When the Prometheus server in the cluster collects metrics, forward AMP via remoteWrite configuration.

Rules
AMP makes possible Prometheus rule management.
Prometheus has default rules but if you are using AMP, rules managed by AMP and default empty. You can manage via the dashboard or any IaC.

Format like this,

groups:
- name: kube-apiserver-slos
rules:
- alert: KubeAPIErrorBudgetBurn
annotations:
description: The API server is burning too much error budget.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn
summary: The API server is burning too much error budget.
expr: |-
sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
and
sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)
for: 2m
labels:
long: 1h
severity: critical
short: 5m

Alert Management

AMP also makes it possible alert management like pagerduty. Here is an example;

I chose SNS for the Pagerduty trigger. You need SNS ARN if you use this block.

template_files:
default_template: |
{{ define "sns.default.subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{ end }}
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
global:
resolve_timeout: 1m
route:
group_by: ["alertname"]
group_wait: 1m
group_interval: 5m
repeat_interval: 3m
receiver: "pagerduty"
routes:
- receiver: "pagerduty"
group_wait: 1m
match_re:
severity: critical
continue: true

receivers:
- name: pagerduty
sns_configs:
- send_resolved: true
topic_arn: "[sns_topic_arn]"
sigv4:
region: "[region]"
message: |
routing_key: "[routing_key]"
client_url: {{ .ExternalURL }}
{{ range .Alerts -}}
dedup_key: {{ .Labels.alertname }}
severity: {{ .Labels.severity }}
description: {{ .Annotations.summary }}
details:
details: {{ .Annotations.description }}
{{ range .Labels.SortedPairs }}
{{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}

Here is an example trigger response;

Pagerduty Prometheus Alert

--

--

Oguzhan

Solutions Architect, #AWS, food, chess, books and cheese lover. Philosophy 101.