Introduction
While Prometheus' default architecture is scraping there may be good reasons to want to push metrics:
- from sources that are not reachable from Prometheus
- from source that are short-lived, e.g. batch jobs
For such use-cases Prometheus comes with a pushgateway. When using this architecture you should be aware of the fact that the pushgateway is a single-point-of-failure.
In this post we will look at implementing pushing metrics to Prometheus from a backup job running on another node. For that we will use this community maintained library for bash. At the time of writing, this library did not accommodate for using https as a protocol so I fixed this and submitted a pull request.
Archtitecture
----------------------
| (4) Grafana |
----------------------
|
v
----------------------
| (3) Prometheus |
----------------------
^
|
--------------------
| (2) pushgateway |
--------------------
^
|
---------------------
| (1) nfs server | Running the backup cronjob
---------------------
- The nfs server is a VM running a backup cronjob with Restic. Restic returns some metrics like the size of the backup, the total number of files and directories processed, the number of files and directories added and the number of directory and files unchanged. These values will be our metrics
- The pushgateway runs as a pod on Kubernetes, and is exposed using an ingress
- Prometheus runs as a pod on Kubernetes as well
- Grafana as a datasource for Prometheus and can be used to visualize the metrics and to define alerts on them
The backup job
I use Restic for backing-up the NFS server:
- every four hours with a tag
sub-daily
; these backups are kept for two days - daily with a tag
daily
; these backups are kept for seven days - weekly with a tag
weekly
; these backups are kept for one month - monthly with a tag
monthly
; these backups are kept for twelve month
The use of the tags makes it easier to prune old backups, e.g. monthly
after twelve month: restic -r <repo> forget --tag monthly --keep-last 12
Backing-up with Restic
First we need to source prometheus.bash
:
#!/bin/bash
source prometheus.bash
Then we take the time before running the backup as well as after, to determine the run-time:
start=`date +%s`
# we limit the CPU usage to one core
GOMAXPROCS=1 restic --exclude-if-present '.nobackup' -r <repo> backup --json --tag $1 <directory-we-backup> > output.txt
end=`date +%s`
runtime=$((end-start))
You may note a few specifics in the way we use restic:
- We want to limit it’s CPU usage so we use
GOMAXPROCS=1
- We automatically exclude backups of directories containing a
.nonackup
file - The tag (
$1
) is a parameter passed to the backup script - we write the output of the Restic command to a file
output.txt
in--json
format
Creating metrics
Now we can parse output.txt
to extract all the metrics we are interested in. As the output is in json this is pretty simple:
files=$(cat output.txt | jq -r 'select(.message_type=="summary") | .total_files_processed')
bytes=$(cat output.txt | jq -r 'select(.message_type=="summary") | .total_bytes_processed')
files_new=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_new')
files_changed=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_changed')
files_unchanged=$(cat output.txt | jq -r 'select(.message_type=="summary") | .files_unmodified')
dirs_new=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_new')
dirs_changed=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_changed')
dirs_unchanged=$(cat output.txt | jq -r 'select(.message_type=="summary") | .dirs_unmodified')
Creating a metric using prometheus.bash
is pretty simple, thanks to the helper function io::prometheus::NewGauge
:
- Create a metric
- Assign a value
In the script we create the following metrics:
- the time that backup took to run:
restic_backup_time
- the number of files processed:
restic_backup_processed_files
- the amount of bytes processed:
restic_backup_processed_bytes
- the number of new files:
restic_backup_files_new
- the number of changed files:
restic_backup_files_new
- the number of unchanged files:
restic_backup_files_unchanged
- the number of new directories:
restic_backup_dirs_new
- the number of changed directories:
restic_backup_dirs_new
- the number of unchanged directories:
restic_backup_dirs_unchanged
io::prometheus::NewGauge name=restic_backup_time help='How long the backup took'
restic_backup_time set "$runtime"
io::prometheus::NewGauge name=restic_backup_processed_files help='How many files where processed'
restic_backup_processed_files set "$files"
io::prometheus::NewGauge name=restic_backup_processed_bytes help='How many bytes where processed'
restic_backup_processed_bytes set "$bytes"
io::prometheus::NewGauge name=restic_backup_files_new help='How many new files were backed-up'
restic_backup_files_new set "$files_new"
io::prometheus::NewGauge name=restic_backup_files_changed help='How many files were changed'
restic_backup_files_changed set "$files_changed"
io::prometheus::NewGauge name=restic_backup_files_unchanged help='How many files did not change'
restic_backup_files_unchanged set "$files_unchanged"
io::prometheus::NewGauge name=restic_backup_dirs_new help='How many new dirs were backed-up'
restic_backup_dirs_new set "$dirs_new"
io::prometheus::NewGauge name=restic_backup_dirs_changed help='How many dirs were changed'
restic_backup_dirs_changed set "$dirs_changed"
io::prometheus::NewGauge name=restic_backup_dirs_unchanged help='How many dirs did not change'
restic_backup_dirs_unchanged set "$dirs_unchanged"
Pushing to the pushgateway
Again, using the function io::prometheus::PushAdd
from te helper this is pretty simple:
io::prometheus::PushAdd job=nfs_backup instance=$1 gateway="https://my-gtw:443"
Here again the type of the job (sub.daily, daily, weely, monthly) is passed to the script and used as instance
for the metric.
Querying Prometheus
Now you can create promql queries on those metrics:
- Show and alert if any backup ran in the last 24 hours:
absent_over_time(restic_backup_processed_files{job="nfs_backup"}[1d])
- Show the number of bytes processes by the sub-daily jobs:
restic_backup_processed_files{instance="sub-daily",job="nfs_backup"}
- The average runtinf for daily jobs:
avg_over_time(restic_backup_time{instance="daily",job="nfs_backup"}[1d])