Monitoring Stack in Kubernetes, with Prometheus

Written by Ho Man Aug 01, 2019 17 minutes read

For the past year or so, I’ve been working with DevOps in Tulip. It’s a fairly big change in direction but quite frankly, it’s been a refreshing experience!

One of the first projects was to build a monitoring system for a number of different components in our kubernetes cluster: various microservices, main monolith application, our ingress controller, and the health of the cluster itself. I thought Prometheus fit fairly well with what we wanted, so we went ahead with that. I would say that it has been served us pretty well!

Prometheus

Some context about Prometheus; it is a pull-based monitoring system that is sometimes compared to Nagios. It’s not event-based, so applications do not report each individual event to Prometheus as it happens (like SegmentIO). Also, since its pull-based, we have to define all its targets (to scrape) in advance. Contrary to certain arguments, I actually think this is a plus. It is more tedious and harder to set up, but it is also harder to get into a situation where it becomes a blackbox and you have no idea what you’re dumping into it. It’s also easier to detect if a target is actually down vs a push-based system.

Prometheus Operator

Prometheus Operator is an open-source tool that makes deploying a Prometheus stack (AlertManager, Grafana, Prometheus) so, much easier than hand-crafting the entire stack. It helps generate a whole lot of boiler plates and pretty much reduces the entire deployment down to native kubernetes declarations and YAML.

If you’re familiar with Kubernetes, then you’ve probably heard of custom resource definitions, or CRDs in short. Think of them as definitions of objects like pods, deployments or daemonsets that the cluster can understand and act on if needed. For the purpose of deploying a monitoring stack, Prometheus Operator introduces 3 new CRDs - Prometheus, AlertManager and ServiceMonitor, and a controller that is in-charge of deploying and connfiguring the respective services into the Kubernetes Cluster.

For example: if a prometheus CRD like the one below is present in the cluster, the prometheus-operator controller would create a matching deployment of Prometheus into the kubernetes cluster, that in this case would also link up with the alertmanager by that name in the monitoring namespace.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  ...
  name: clustermon-prometheus-oper-prometheus
  namespace: monitoring
spec:
  alerting:
    alertmanagers:
    - name: clustermon-prometheus-oper-alertmanager
      namespace: monitoring
      pathPrefix: /
      port: web
  baseImage: quay.io/prometheus/prometheus
  ...

kube-prometheus

kube-prometheus used to be a set of contrib helm charts that utilizes the capabilities of the Prometheus Operator to deploy an entire monitoring stack (with some assumptions and defaults ofc). It has since been absorbed into the main helm charts and moved to the official stable chart repository.

There are various exporters included such as: kube-dns, kube-state-metrics, node-exporter and many others that are necessary to monitor the health of a Kubernetes cluster (and more). You can find the full list here. It also has a simple set of kubernetes-mixins for Grafana as well (if you choose to install that).

Overview

This section gives a general idea of the components involved.

An important implementation decision that I’ll like to point out is that Grafana is (mostly) stateless. Any new dashboards, or changes would need to be commited to code; in general I think this conforms better to the infrastructure-as-code kind of idealogy which makes it much easier to replicate the same infrastructure across multiple clouds / regions.

Custom Helm Chart

You can find a stripped down version fo the things I’ll talk about in this repository: k8s-prometheus-operator-helm-example

Note: The contents are all based on prometheus-operator helm chart 5.10.5.

# requirements.yaml
dependencies:
  - name: prometheus-operator
    version: 5.10.5
    repository: https://kubernetes-charts.storage.googleapis.com/

This part of the chart is responsible for loading *.json dashboard configurations exported from Grafana and creating them as individual configmaps in Kubernetes. The config-reloaders in Grafana would then read them and reconfigure itself.

# templates/dashboards-configmap.yaml
{{- $files := .Files.Glob "dashboards/*.json" }}
apiVersion: v1
kind: ConfigMapList
items:
{{- range $path, $fileContents := $files }}
{{- $dashboardName := regexReplaceAll "(^.*/)(.*)\\.json$" $path "${2}" }}
- apiVersion: v1
  kind: ConfigMap
  metadata:
    name: {{ printf "%s-%s" (include "prometheus-operator.fullname" $) $dashboardName | trunc 63 | trimSuffix "-" }}
    labels:
      grafana_dashboard: "1"
      app: {{ template "prometheus-operator.name" $ }}-grafana
{{ include "prometheus-operator.labels" $ | indent 6 }}
  data:
    {{ $dashboardName }}.json: |-
{{ $.Files.Get $path | indent 6}}
{{- end }}

Prometheus

Prometheus rocks a TSDB for data storage so the instance that the pod runs on needs to have a huge volume attached to it. In my setup, I’ve chosen to run Prometheus on a node, by itself, with no other pods scheduled on it. I do this by setting up taints on a particular node and having Prometheus selectively schedule to that node and tolerate those taints. Normal pods without that toleration would then just refuse to schedule on it

(This is slightly different in the example app above)

# values.yaml
prometheus:
  prometheusSpec:

    # So that Prometheus can schedule onto the node with this taint
    # Other pods will not have this toleration and won't schedule on it
    tolerations:
    - key: "dedicated"
      operator: "Exists"
      effect: "NoSchedule"
    - key: "dedicated"
      operator: "Exists"
      effect: "NoExecute"

    # The Prometheus node
    nodeSelector:
      node: prometheus

    # This PV is named in my case, but you can also just do a dynamic claim template like here:
    # https://github.com/aranair/k8s-prometheus-operator-helm-example/blob/master/clustermon/values.yaml#L181-L189
    storageSpec:
      volumeClaimTemplate:
        spec:
          volumeName: prometheus-pv
          selector:
            matchLabels:
              app: prometheus-pv
          resources:
            requests:
              storage: 1500Gi
      selector: {}

If you take a look at the prometheus-operator’s default [values.yaml][helm-chart-values] file, you will find just about any configuration you can think of.

Monitoring Custom Services

The ServiceMonitor CRD from the prometheus-operators is used to describe the set of targets to be monitored by Prometheus; the controller would automatically generate the Prometheus config needed.

For example, a ServiceMonitor for monitoring Traefik, our ingress controller would look something like this: additionalServiceMonitors: - name: traefik-monitor namespace: monitoring selector: matchLabels: app: traefik # this should be the selector for the Service namespaceSelector: matchNames: - kube-system # Which namespace to look for the Service in endpoints: - basicAuth: # Take creds from secret named traefik-monitor-metrics-auth password: name: traefik-monitor-metrics-auth key: password username: name: traefik-monitor-metrics-auth key: user port: metrics interval: 10s

These would show up as targets in the prometheus deployment, e.g.

You can then use PromQL to query things.. like average number of open connections per second looking back at 5min windows, (then extrapolate to 5mins by multiplying by 300)

Charting isnt the best in Prometheus but to be fair, that’s not really the primary function of Prometheus. It can get you what you need eventually, but it just takes way more effort than it should.

Grafana

Grafana fills that gap; with this setup, a Grafana instance is automatically setup with Prometheus targeted as a data source. So generally what I’ll do is experiement in Prometheus with PromQL, then port over to a Grafana dashboard with proper variables and timeframes then export in json and check that in into our git repository. Overtime, we have developed quite a number of dashboards that monitor many of the services in our cluster (as well as many good default mixins provided out of the box).

One example is shown below, where it displays the total CPU/RAM usage; we can also click to drill down to each individual pod.

This next one is a dashboard that I built to monitor the health of Traefik, looking at the number of times its had to hot-reload configurations, and latencies and other useful metrics. We also track the Apdex for example for both entrypoints and backends.

Prometheus Rules

Prometheus rules can be defined in PromQL; these are primarily alerts that you might want the system to flag. There are many built-in rules that come along with the default installation.

Like when the kube api pods’ error rate is high: alert: KubeAPIErrorsHigh expr: sum(rate(apiserver_request_count{code=~"^(?:5..)$",job="apiserver"}[5m])) / sum(rate(apiserver_request_count{job="apiserver"}[5m])) * 100 > 3 for: 10m labels: severity: critical annotations: message: API server is returning errors for {{ $value }}% of requests. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh

Or like when there are CLBO pods:

alert: KubePodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m])
  * 60 * 5 > 0
for: 1h
labels:
  severity: critical
annotations:
  message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }})
    is restarting {{ printf "%.2f" $value }} times / 5 minutes.
  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping

But you can also define your own custom ones; we have quite a number of custom rules. As an example, there is an alert that will fire when there are > 10 etcd failed proposals in the past 10 mins, which might indicate some stability issues with the etcd cluster.

additionalPrometheusRules:
  - name: custom-alerts
    groups:
    - name: generic.rules
      rules:
      - alert: EtcdFailedProposals
        expr: increase(etcd_server_proposal_failed_total[10m]) > 10
        labels:
          severity: warning
          group: tulip
        annotations:
          summary: "etcd failed proposals"
          description: "{{ $labels.instance }} failed etcd proposals over the past 10 minutes has increased. May signal etcd cluster instability"

Or when a specific pod has restarted X number of times:

...
    - name: generic.rules
        - alert: TraefikPodCrashLooping
          expr: round(increase(kube_pod_container_status_restarts_total{pod=~"traefik-.*"}[5m])) > 5
          labels:
            severity: critical
            group: tulip
          annotations:
            summary: "Traefik pod is restarting frequently"
            description: "Traefik pod {{$labels.pod}} has restarted {{$value}} times in the last 5 mins"

When these alerts fire, you can see them in Prometheus directly; they are also sent off to the AlertManger if one is linked up with Prometheus.

AlertManager

AlertManager can be configured to send to Slack, VictorOps, PagerDuty, or various other sorts of alerting systems.

alertmanager:
  config:
    global:
      smtp_auth_username: ''
      smtp_auth_password: ''
      victorops_api_key: ''
      victorops_api_url: ''

In our setup, I configured it to post to Slack whenever there is a Warning level alert, and to VictorOps when there is a critical level alert.

route:
  group_by: ['job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'null'

  # This can be used to route specific specific type of alerts to specific teams.
  routes:
  - match:
      alertname: DeadMansSwitch
    receiver: 'null'
  - match:
      alertname: TargetDown
    receiver: 'null'
  - match:
      severity: warning
      group: custom
    group_by: ['namespace']
    receiver: 'slack'
  - match:
      severity: critical
      group: custom
    group_by: ['namespace']
    receiver: 'victorops'

  receivers:
  - name: 'null'
  - name: 'sysadmins-email'
    email_configs:
      - to: 'sysadmin@example.com'

  - name: 'slack'
    slack_configs:
      - username: 'Prometheus'
        send_resolved: true
        api_url: ''
        title: '[{{ .Status | toUpper }}] Warning Alert'
        text: >-
          {{ template "slack.techops.text" . }}

  - name: 'victorops'
    victorops_configs:
      - routing_key: 'routing_key'
        message_type: '{{ .CommonLabels.severity }}'
        entity_display_name: '{{ .CommonAnnotations.summary }}'
        state_message: >-
          {{ template "slack.techops.text" . }}
        api_url: ''
        api_key: ''

Generally speaking, warning alerts could indicate some level of degraded service but might self-recover, such as when a node is down and pods auto-reschedule; or they could also be non time-critical situations that does not need immediate intervention. And critical alerts are reserved for mission-critical services or infrastructure, that can cause a bunch of issues if not recovered quickly. These would page someone and be resolved as quickly as we can.

Example of an alert that has gone off in AlertManager:

Slack Alert:

From here, you can have inhibitions that, when present, other alerts will not fire; or silences that would silence alerts based on their tags.

Wrap-up

Together, I think the 3 components form a rather well-rounded monitoring stack for k8s infrastructure and services’ metrics. I think, down the road, the next big extension would be to have it spin up federated clusters to monitor different AWS regions and/or clusters.

PS: Here’s the repo that has a simplified version of everything I’ve talked about above: k8s-prometheus-operator-helm-example. And feel free to let me know in the comments section below if you have any questions or run into any issues playing with that example.