Skip to content

RKE2 CSI Snapshotting on vSphere

Introduction

As of Velero v1.14, the velero-plugin-for-csi is included in Velero. This means you no longer need to install a separate velero-plugin-for-csi or the velero-plugin-for-vsphere. This guide covers the configuration required to enable Velero to use a vSphere CSI driver for volume snapshots of a UDS Core deployment.

Prerequisites

  • An RKE2 Kubernetes cluster (additional configuration may be required for other distributions)
  • Access to vSphere infrastructure
  • UDS Core deployment with Velero configured for S3-compatible object storage

Using a CSI driver in an RKE2 cluster

The following instructions are specific to an RKE2 cluster, and assume bucket variables required for S3 object storage have already been set. The below tips are not meant to be step-by-step instructions, but useful tips for configuring the CSI driver. To integrate Velero with a CSI driver, you should first install both rancher-vsphere-cpi and rancher-vsphere-csi. Installation of the vSphere CPI/CSI on RKE2 is done via setting cloud-provider-name: rancher-vsphere in RKE2’s config.yaml.

CSI Driver Configuration

When using a vSphere CSI driver, a user must be created in vSphere with the appropriate permissions at the appropriate vSphere object levels. These roles and privileges can be found at Broadcom vSphere Roles and Privileges. This user is referenced below as vsphere_csi_username and vsphere_csi_password and is used by Velero to authenticate with the vSphere vCenter API to provision, manage, and snapshot volumes.

At least three overrides must occur in the vSphere CSI driver configuration: blockVolumeSnapshot, configTemplate and global-max-snapshots-per-block-volume

  • blockVolumeSnapshot must be enabled on the CSI driver to allow the deployment of the csi-snapshotter sidecar, which is required to create snapshots of volumes
  • configTemplate must be completely overridden, to allow overriding of the global-max-snapshots-per-block-volume setting
  • global-max-snapshots-per-block-volume should be added as an override within the configTemplate, to allow control of how many snapshots are allowed per volume

Example rancher-vsphere-cpi and rancher-vsphere-csi overrides:

---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-cpi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "{{ vsphere_server }}"
port: 443
insecureFlag: true
datacenters: "<vsphere_datacenter_name>"
username: "{{ vsphere_csi_username }}"
password: "{{ vsphere_csi_password }}"
credentialsSecret:
name: "vsphere-cpi-creds"
generate: true
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
datacenters: "<vsphere_datacenter_name>"
username: "{{ vsphere_csi_username }}"
password: "{{ vsphere_csi_password }}"
configSecret:
configTemplate: |
[Global]
cluster-id = "{{ rke2_token }}"
user = "{{ vsphere_csi_username }}"
password = "{{ vsphere_csi_password }}"
port = 443
insecure-flag = "1"
[VirtualCenter "{{ vsphere_server }}"]
datacenters = "<vsphere_datacenter_name>"
[Snapshot]
global-max-snapshots-per-block-volume = 12
csiNode:
tolerations:
- operator: "Exists"
effect: "NoSchedule"
blockVolumeSnapshot:
enabled: true
storageClass:
reclaimPolicy: Retain

Snapshot Limit Configuration

The default snapshot limit (3) is insufficient for UDS Core’s 10-day backup retention policy.

  • Each UDS backup creates approximately 13 snapshots distributed across all volumes
  • For a cluster that has 13 volumes, each nightly UDS backup will create 1 snapshot per volume
  • After 3 days of backups, the default global-max-snapshots-per-block-volume will have been met, and further backups will fail
  • To account for 10 days of UDS backups (assuming 13 volumes), set the global-max-snapshots-per-block-volume to a minimum of 10
  • Consider setting a higher global-max-snapshots-per-block-volume to create a buffer that accommodates manual backups or restore testing (e.g., global-max-snapshots-per-block-volume=12)

If the following error is seen when creating a backup, the global-max-snapshots-per-block-volume needs to be adjusted:

name: /prometheus-kube-prometheus-stack-prometheus-0 message: /Error backing up item error: /error
executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=monitoring,
name=velero-prometheus-kube-prometheus-stack-prometheus-db-prom2n67g): rpc error: code = Unknown desc
= CSI got timed out with error: Failed to check and update snapshot content:\n failed to take snapshot
of the volume 6e908637-1c40-41ab-a65b-0460b403e364: "rpc error: code = FailedPrecondition desc =\n the
number of snapshots on the source volume 6e908637-1c40-41ab-a65b-0460b403e364 reaches the configured
maximum (3)"

Create a VolumeSnapshotClass

In addition to the above CSI driver overrides, a VolumeSnapshotClass must be defined to tell Velero how to create snapshots. This can be achieved by creating a velero-config Zarf package that contains the VolumeSnapshotClass manifest, and having your uds-bundle.yaml deploy this package. The VolumeSnapshotClass defines the driver, which in the below example is vSphere.

Example VolumeSnapshotClass deployment:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: vsphere-csi-snapshot-class
labels:
velero.io/csi-volumesnapshot-class: "true"
driver: csi.vsphere.vmware.com
deletionPolicy: Retain

Configure Velero for CSI Support

In the uds-bundle.yaml Velero overrides, you must EnableCSI, set snapshotsEnabled to true, define the volumeSnapshotLocation as the CSI driver, and set snapshotVolumes to true.

Example uds-bundle.yaml core-backup-restore layer overrides:

overrides:
velero:
velero:
values:
- path: configuration.features
value: EnableCSI
- path: snapshotsEnabled
value: true
- path: configuration.volumeSnapshotLocation
value:
- name: default
provider: velero.io/csi
- path: schedules.udsbackup.template.snapshotVolumes
value: true

Additional Tips

  • When restoring specific namespaces, always use the --include-namespaces flag to avoid creating unnecessary VolumeSnapshotContents:
    velero restore create --from-backup <backup-name> --include-namespaces <namespace>
  • Be cautious when deleting backups that have been used for restores, as this may attempt to delete VolumeSnapshotContents that are still in use by restored volumes.
  • Velero’s garbage collection runs hourly by default. Ensure your TTL settings allow enough time for cleanup before hitting snapshot limits.
  • The pyvmomi-community-samples repo contains several scripts that are useful for interacting with the vSphere client. In particular, the fcd_list_vdisk_snapshots script allows you to list snapshots stored in vSphere, even when they can’t be directly viewed in the vSphere UI. This comes in handy when snapshots and VolumeSnapshotContents get manually deleted from the cluster, but are not cleaned up appropriately in vSphere.

Resources

Velero CSI Snapshot Support

Kubernetes CSI Snapshot API

Rancher vSphere

Rancher vSphere Configuration Reference

global-max-snapshots-per-block-volume

How Velero Works