June 20, 2019

kubernetes cloud disaster recovery

Introduction

I run my workloads (blog, different apps) on my home-lab server (Proxmox) and Kubernetes, because I can. I have been working on backup as well as automated provisioning of Azure Kubernetes Service (aks) lately so I thought why not put both together and automate a disaster recovery scenario.

Depending on conditions the azure provisioning time may vary but based on different tests the end-to-end process takes about 15 minutes.

TLDR; if you want to get your hands dirty you can find the code here

Note: in this article I have decided for simplicity and we will not deal with state: all workloads that are covered are stateless so they have no persistent volumes. Once I have gained some experience on how to deal with state I will write another article covering that aspect and using the same concept.

Setup

This is how my setup looks like:


    ----------------------
   | (5)  Ephemaral k8s   |
    ---------------------- 
             ^
             |
             |
    ----------------------            ----------------------
   | (4)  Azure Storage   |          | (2)     DNS Zone     |
    ----------------------            ---------------------- 
             ^
             |                 Azure Cloud         
             |                ----------------------------------------------                 
             |                 On-prem Datacenter 
    ---------------------
   | (3)     Velero      |
    ---------------------
    ---------------------
   | (1) on-prem k8s     | 
    --------------------- 
  1. My workloads run on Kubernetes on my local Proxmox cluster
  2. I permanently use Azure DNS Zones to manage my DNS (must be created on the same subscription as the DR cluster will run in)
  3. I run Velero to back-up all my important namespaces. Backups are daily and pushed to Azure (4)
  4. The backups are written to Azure storage with a retention of 7 days
  5. In the case of disaster a RD k8s cluster is created. During setup my important workloads are restored there

What happens in the disaster case?

  1. Create the aks cluster
  2. install all needed (system) components: helm, Velero, Traefik
  3. use Velero to restore the lasted backups for my important namespaces
  4. Switch DNS the production DNS to the dr cluster IP

Testing the disaster case

Obviously I had to do a lot of testing while writing the automation so I adopted a strategy to make the code more testable:

  • In order to be able to test the DR cluster without taking the production cluster down the DR cluster has two ingress controllers: one for accessing the cluster via a DNS name containing dr and one for accessing the cluster with the production DNS (once DNS has been switched from the production to the dr site). The dr ingress controller managed all objects (ingresses) annotated with kubernetes.io/ingress.class: traefik-dr (mock ingresses) while the other ingress controller managed all ingresses annotated with kubernetes.io/ingress.class: traefik (as on the production site)
  • Of course “real” tests are also performed, by switching the production DNS to the DR cluster; these happen less often though

And how do I go back?

  1. Switch DNS back to the main IP
  2. Destroy the DR cluster

Note: remember I have no state so there is no backup of the DR cluster and no moving data back

And how does it work?

Backing up to the cloud

I use Velero (formerly known as ask) to backup all Kubernetes objects from selected namespaces to azure. This article covers that part.

What happens when a disaster occurs

I did not want to provision a cold Kubernetes cluster just for the disaster case so I use automation to create the DR environment:

  1. Create an azure resource group
  2. Create an azure KeyVault: this one is optional but I like to have a secure place to stash all the credentials
  3. Create the AKS cluster
  4. Install helm: used to install traefik
  5. Install Traefik instances: I prefer traefik to nginx, especially because of it’s simple integration with let’s encrypt. We deploy two Traefik instances (see “Testing the disaster case” above)
  6. Install Velero: it gets pointed to the cloud backup
  7. Restore selected namespaces for the workloads that are important to me: remember, we do not restore state at the moment
  8. Deploy mock ingresses: as the DR cluster needs to be testable while the main cluster in still online the cluster
  9. Switch my main DNS to the DR cluster: this step is only executed when doing “real” DR tests and in the case the real site goes down

Configuration

Aside from the scripts there are three files that need to be changed to suit your needs:

  • setenv: contains all the values for accessing your azure subscription, the VM size you want to provision, info about DNS
  • params: contains one variable with the name of the cluster
  • velero/credentials-velero: this file contains the reference/credentials to the azure storage container where the Velero backups are stored

Give me the code!

The code is on github

Content licensed under CC BY 4.0