#kubernetes

April 5, 2020

Prometheus push gateway

Introduction

While Prometheus' default architecture is scraping there may be good reasons to want to push metrics:

from sources that are not reachable from Prometheus
from source that are short-lived, e.g. batch jobs

For such use-cases Prometheus comes with a pushgateway. When using this architecture you should be aware of the fact that the pushgateway is a single-point-of-failure.

In this post we will look at implementing pushing metrics to Prometheus from a backup job running on another node. For that we will use this community maintained library for bash. At the time of writing, this library did not accommodate for using https as a protocol so I fixed this and submitted a pull request.

Archtitecture


    ----------------------
   | (4)     Grafana      |
    ----------------------
             |
             v
    ----------------------
   | (3)  Prometheus      |
    ----------------------
             ^
             |
    --------------------
   | (2)  pushgateway   |
    --------------------
             ^
             |
    ---------------------
   | (1)  nfs server     |  Running the backup cronjob
    ---------------------

The nfs server is a VM running a backup cronjob with Restic. Restic returns some metrics like the size of the backup, the total number of files and directories processed, the number of files and directories added and the number of directory and files unchanged. These values will be our metrics
The pushgateway runs as a pod on Kubernetes, and is exposed using an ingress
Prometheus runs as a pod on Kubernetes as well
Grafana as a datasource for Prometheus and can be used to visualize the metrics and to define alerts on them

The backup job

I use Restic for backing-up the NFS server:

every four hours with a tag sub-daily; these backups are kept for two days
daily with a tag daily; these backups are kept for seven days
weekly with a tag weekly; these backups are kept for one month
monthly with a tag monthly; these backups are kept for twelve month

The use of the tags makes it easier to prune old backups, e.g. monthly after twelve month: restic -r <repo> forget --tag monthly --keep-last 12

Backing-up with Restic

First we need to source prometheus.bash:

#!/bin/bash

source prometheus.bash

Then we take the time before running the backup as well as after, to determine the run-time:

start=`date +%s`
# we limit the CPU usage to one core
GOMAXPROCS=1 restic --exclude-if-present '.nobackup' -r <repo> backup --json --tag $1 <directory-we-backup> > output.txt

end=`date +%s`
runtime=$((end-start))

You may note a few specifics in the way we use restic:

We want to limit it’s CPU usage so we use GOMAXPROCS=1
We automatically exclude backups of directories containing a .nonackup file
The tag ($1) is a parameter passed to the backup script
we write the output of the Restic command to a file output.txt in --json format

Creating metrics

Now we can parse output.txt to extract all the metrics we are interested in. As the output is in json this is pretty simple:

files=$(cat output.txt | jq -r 'select(.message_type=="summary") | .total_files_processed')
bytes=$(cat output.txt | jq -r 'select(.message_type=="summary") | .total_bytes_processed')
files_new=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_new')
files_changed=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_changed')
files_unchanged=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_unmodified')
dirs_new=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_new')
dirs_changed=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_changed')
dirs_unchanged=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_unmodified')

Creating a metric using prometheus.bash is pretty simple, thanks to the helper function io::prometheus::NewGauge:

Create a metric
Assign a value

In the script we create the following metrics:

the time that backup took to run: restic_backup_time
the number of files processed: restic_backup_processed_files
the amount of bytes processed: restic_backup_processed_bytes
the number of new files: restic_backup_files_new
the number of changed files: restic_backup_files_new
the number of unchanged files: restic_backup_files_unchanged
the number of new directories: restic_backup_dirs_new
the number of changed directories: restic_backup_dirs_new
the number of unchanged directories: restic_backup_dirs_unchanged

io::prometheus::NewGauge name=restic_backup_time help='How long the backup took'
restic_backup_time set "$runtime"

io::prometheus::NewGauge name=restic_backup_processed_files help='How many files where processed'
restic_backup_processed_files set "$files"

io::prometheus::NewGauge name=restic_backup_processed_bytes help='How many bytes where processed'
restic_backup_processed_bytes set "$bytes"

io::prometheus::NewGauge name=restic_backup_files_new help='How many new files were backed-up'
restic_backup_files_new set "$files_new"

io::prometheus::NewGauge name=restic_backup_files_changed help='How many files were changed'
restic_backup_files_changed set "$files_changed"

io::prometheus::NewGauge name=restic_backup_files_unchanged help='How many files did not change'
restic_backup_files_unchanged set "$files_unchanged"

io::prometheus::NewGauge name=restic_backup_dirs_new help='How many new dirs were backed-up'
restic_backup_dirs_new set "$dirs_new"

io::prometheus::NewGauge name=restic_backup_dirs_changed help='How many dirs were changed'
restic_backup_dirs_changed set "$dirs_changed"

io::prometheus::NewGauge name=restic_backup_dirs_unchanged help='How many dirs did not change'
restic_backup_dirs_unchanged set "$dirs_unchanged"

Pushing to the pushgateway

Again, using the function io::prometheus::PushAdd from te helper this is pretty simple:

io::prometheus::PushAdd job=nfs_backup instance=$1 gateway="https://my-gtw:443"

Here again the type of the job (sub.daily, daily, weely, monthly) is passed to the script and used as instance for the metric.

Querying Prometheus

Now you can create promql queries on those metrics:

Show and alert if any backup ran in the last 24 hours: absent_over_time(restic_backup_processed_files{job="nfs_backup"}[1d])
Show the number of bytes processes by the sub-daily jobs: restic_backup_processed_files{instance="sub-daily",job="nfs_backup"}
The average runtinf for daily jobs: avg_over_time(restic_backup_time{instance="daily",job="nfs_backup"}[1d])

Load comments

Content licensed under CC BY 4.0