Introduction
I run my workloads (blog, different apps) on my home-lab server (Proxmox) and Kubernetes, because I can. I have been working on backup as well as automated provisioning of Azure Kubernetes Service (aks) lately so I thought why not put both together and automate a disaster recovery scenario.
Depending on conditions the azure provisioning time may vary but based on different tests the end-to-end process takes about 15 minutes.
TLDR; if you want to get your hands dirty you can find the code here
Note: in this article I have decided for simplicity and we will not deal with state: all workloads that are covered are stateless so they have no persistent volumes. Once I have gained some experience on how to deal with state I will write another article covering that aspect and using the same concept.
Setup
This is how my setup looks like:
----------------------
| (5) Ephemaral k8s |
----------------------
^
|
|
---------------------- ----------------------
| (4) Azure Storage | | (2) DNS Zone |
---------------------- ----------------------
^
| Azure Cloud
| ----------------------------------------------
| On-prem Datacenter
---------------------
| (3) Velero |
---------------------
---------------------
| (1) on-prem k8s |
---------------------
- My workloads run on Kubernetes on my local Proxmox cluster
- I permanently use Azure DNS Zones to manage my DNS (must be created on the same subscription as the DR cluster will run in)
- I run Velero to back-up all my important namespaces. Backups are daily and pushed to Azure (4)
- The backups are written to Azure storage with a retention of 7 days
- In the case of disaster a RD k8s cluster is created. During setup my important workloads are restored there
What happens in the disaster case?
- Create the aks cluster
- install all needed (system) components: helm, Velero, Traefik
- use Velero to restore the lasted backups for my important namespaces
- Switch DNS the production DNS to the dr cluster IP
Testing the disaster case
Obviously I had to do a lot of testing while writing the automation so I adopted a strategy to make the code more testable:
- In order to be able to test the DR cluster without taking the production cluster down the DR cluster has two ingress controllers: one for accessing the cluster via a DNS name containing
dr
and one for accessing the cluster with the production DNS (once DNS has been switched from the production to the dr site). Thedr
ingress controller managed all objects (ingresses) annotated withkubernetes.io/ingress.class: traefik-dr
(mock ingresses) while the other ingress controller managed all ingresses annotated withkubernetes.io/ingress.class: traefik
(as on the production site) - Of course “real” tests are also performed, by switching the production DNS to the DR cluster; these happen less often though
And how do I go back?
- Switch DNS back to the main IP
- Destroy the DR cluster
Note: remember I have no state so there is no backup of the DR cluster and no moving data back
And how does it work?
Backing up to the cloud
I use Velero (formerly known as ask) to backup all Kubernetes objects from selected namespaces to azure. This article covers that part.
What happens when a disaster occurs
I did not want to provision a cold Kubernetes cluster just for the disaster case so I use automation to create the DR environment:
- Create an azure resource group
- Create an azure KeyVault: this one is optional but I like to have a secure place to stash all the credentials
- Create the AKS cluster
- Install helm: used to install
traefik
- Install Traefik instances: I prefer
traefik
tonginx
, especially because of it’s simple integration with let’s encrypt. We deploy two Traefik instances (see “Testing the disaster case” above) - Install Velero: it gets pointed to the cloud backup
- Restore selected namespaces for the workloads that are important to me: remember, we do not restore state at the moment
- Deploy mock ingresses: as the DR cluster needs to be testable while the main cluster in still online the cluster
- Switch my main DNS to the DR cluster: this step is only executed when doing “real” DR tests and in the case the real site goes down
Configuration
Aside from the scripts there are three files that need to be changed to suit your needs:
setenv
: contains all the values for accessing your azure subscription, the VM size you want to provision, info about DNSparams
: contains one variable with the name of the clustervelero/credentials-velero
: this file contains the reference/credentials to the azure storage container where the Velero backups are stored
Give me the code!
The code is on github