avatar-andy

Resizing EKS EBS Volumes Safely in Kubernetes Using Blue-Green Approach

In this post I want to share a simple, step-by-step approach to resize a volume in Kubernetes using blue-green approach.

Working with volumes in Kubernetes is tricky. Unlike pods, deployments, or ingresses, volumes behave in non-standard ways, depending on the provisioner set in your StorageClass – certain features might work with AWS EBS, but not in GCP or Azure (or vice versa).

For a full list of provisioners and their capabilities, check out the Kubernetes documentation. In this post, I focus on Amazon’s EBS (Elastic Block Storage) in a standard EKS (Elastic Kubernetes Service) distribution. In theory though, this approach should work with any other provisioner.

Kubernetes native volume expansion

Before we dive in to the example, let us first address the Kubernetes native volume expansion capabilities. It’d be nice to trust the orchestrator to do the heavy lifting for us. In reality, it’s not that simple (yet).

If you’re on a newer version of Kubernetes and don’t mind using beta functionality, there’s a support for automatic expansion of a PersistentVolumeClaim. If you want to go this route, couple of notes:

Here’s a link to Kubernetes documentation describing this in more detail.

It’s ok, but…

I’ve decided not to use the above method for several reasons:

I’ve settled on using a simple method without too much fireworks, where I can understand what is happening under the hood.

Below I describe how I’ve performed a volume resize step by step, with little risk, and the ability to restore to a previous steps’ state at any point in time.

Step 1: create a 2nd volume

In this example, I’m using StorageClass automatically provisioned by EKS: gp2. If you’re using eksctl, chances are your setup is similar. This StorageClass’s provisioner automatically creates volumes to satisfy any new PersistentVolumeClaim resource’s requirements.

I’m using helm for managing Kubernetes resources for the application deployment. Let’s edit the PersistentVolumeClaim for my sample app.

Here’s my pvc.yml contents:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "{{ .Values.volume.pvcName }}"
  labels:
    app: "{{ .Values.appName }}"
    component: "{{ .Values.appName }}"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: "{{ .Values.volume.storage }}"

Now let’s add a second volume definition – volumeBlue – to the app’s values.yml file:

volume:
  storage: 50Gi
  pvcName: "media-files"

volumeBlue:
  enabled: false
  storage: 200Gi
  pvcName: "media-files-blue"

After adding these values, you can confidently deploy the changes – note that we haven’t changed the original volume in the pvc.yml in any way. Note the enabled: false flag in volumeBlue: section – we’ll set it up first and then enable.

Now let’s add the 2nd volume to the pvc.yml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "{{ .Values.volume.pvcName }}"
  labels:
    app: "{{ .Values.appName }}"
    component: "{{ .Values.appName }}"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: "{{ .Values.volume.storage }}"
{{- if .Values.volumeBlue.enabled }}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: "{{ .Values.volumeBlue.pvcName }}"
  labels:
    app: "{{ .Values.appName }}"
    component: "{{ .Values.appName }}"
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: "{{ .Values.volumeBlue.storage }}"
{{- end }}

You can deploy the app again – these changes are not rebuilding your app’s container image and thus should be quick. The new volume sits behind enabled flag and as of now, we’re not adding the new resource to Kubernetes yet. Note that using helm you can rollback any step in a matter of seconds (helm history .., then helm rollback ..).

Finally, let’s create the 2nd volume:

In values.yml:

volumeBlue:
  enabled: true

Deploy again. Now Kubernetes should schedule creation of a new EBS volume. To see what’s happening, you can use:

You might notice the volume is in a Pending state. What happened? Kubernetes tries to be smart and it won’t provision the EBS volume until there’s a pod binding to that volume. Don’t worry about this. You can check the PVC details and events with kubectl describe pvc/media-files-blue -n <namespace>.

Step 2: Bind the volume to a pod

Let’s mount our new volume to the app. In deployment.yml:

volumes:
- name: media-files
  persistentVolumeClaim:
    claimName: {{ .Values.volume.pvcName }}
{{- if .Values.volumeBlue.enabled }}
- name: media-files-blue
  persistentVolumeClaim:
    claimName: {{ .Values.volumeBlue.pvcName }}
{{- end }}

Deploy now and see what happens. Now the EBS volume should be created and you should see the new PVC’s state switch to Bound after a moment. This means Kubernetes has created the volume in AWS EBS and bound the claim.

But the volume is not used by our application yet. We need to add a volumeMount first.

Note: we’re hiding the new volume behind the enabled flag. This means that if anything goes wrong, we can run the deployment with enabled: false and all our changes will be reversed.

deployment.yml:

volumeMounts:
- name: media-files
  mountPath: /opt/app/media-files
{{- if .Values.volumeBlue.enabled }}
- name: media-files-blue
  mountPath: /opt/app/media-files-blue
{{- end }}

After releasing the above, the app should now have 2 directories with 2 separate volumes connected.

You can check this by exec’ing into the running container via kubectl exec -it <pod_name> -c <container_name> -n <namespace> bash (or sh if the container doesn’t have bash).

Step 3: Sync the files

Alright, we now have 2 volumes: green (old), and blue (new). To switch, we need to make sure they have the same data.

To copy the files, we need to exec into the running container, and run cp -r /opt/app/media-files/* /opt/app/media-files-blue/.

In addition to that, it’s useful to create a file for each volume to distinguish them more easily later on: add __VOLUME_GREEN__ file to /opt/app/media-files/ and __VOLUME_BLUE__ to /opt/app/media-files-blue/.

NOTE 1: If the app you’re working on is deployed frequently, and/or there’s a lot of data to sync, it’s possible that a deployment will shut down the container you’re running the cp -r from. If you can’t guarantee a short non-deployment window, you can do it differently. Simply add an additional deployment/pod in the same namespace as the app (or even it the same deployment), and bind to the volumes using the same claimNames (EKS EBS Note: If you’re creating a separate pod, make sure it’s living on the same node as the app using nodeAffinity or podAffinity – you can’t mount an EBS volume to pods on 2 different nodes).

NOTE 2: If your app is writing to the volume a lot, the files added during the migration might not be copied to the new volume. To solve this issue, there are at least 2 options:

  1. Copy the files, then switch to the new volume, then copy the files again to account for the files created during the deployment switch (before deleting the old volume).
  2. Stop writing to the volume/put the app in maintenance mode for the duration of the switch.

Step 4: Switch volumes

Now the fun part.

Once all the files have been copied, we can now switch the volumes in deployment.yml:

volumes:
- name: media-files
  persistentVolumeClaim:
    claimName: {{ .Values.volumeBlue.pvcName }}
{{- if .Values.volumeBlue.enabled }}
- name: media-files-blue
  persistentVolumeClaim:
    claimName: {{ .Values.volume.pvcName }}
{{- end }}

Deploy and you should now see that the app is using the new volume – media-files-blue, whereas the directory name in the container itself did not change.

If you kubectl exec into the container, you can confirm the change was done correctly by checking our flag files:

This means everything went correctly. The app now uses the new, bigger EBS volume, with all the files from the previous one in place.

Notice that all the steps we’ve performed were easy to rollback using helm, including the final one. If anything goes wrong, we can switch the volumes back.

Step 5: Cleanup

Depending on your situation, you might want to wait a day or two before the cleanup. If everything works well though, it’s now time to remove the green volume and leave the blue one.

First leave pvc.yml unchanged. But we’ll change the values.yml:

volume:
  storage: 200Gi
  pvcName: "media-files-blue"

volumeBlue:
  enabled: true
  storage: 50Gi
  pvcName: "media-files"

We’ve switched the names above – this is an important step. This means our blue volume becomes the main one.

Now you should deploy the app yet again, and observe that nothing changed. Now the volumeBlue is the old, green one.

Then, let’s switch the volumeBlue.enabled to false and deploy – this should trigger the deletion/release of the old green volume.

And the final step, delete all the {{- if .Values.volumeBlue.enabled }} code from pvc.yml and deployment.yml.

Summary

There should be a single, resized volume connected to the deployment now. Other than that, all the .yml resources remain the same as they were before any of the changes.

There is a single important change, though. The PersistentVolumeClaim now has a different name – media-files changed to media-files-blue. I’ve picked the name for this article’s purpose. In a real-world scenario, however, you might want to pick a more useful name (e.g. media-files-resize-1).

If you need to keep the original PVC name, you can repeat the procedure one more time and it’ll get you there.