RKE2 CSI Snapshotting on vSphere
Introduction
As of Velero v1.14, the velero-plugin-for-csi is included in Velero. This means you no longer need to install a separate velero-plugin-for-csi or the velero-plugin-for-vsphere. This guide covers the configuration required to enable Velero to use a vSphere CSI driver for volume snapshots of a UDS Core deployment.
Prerequisites
- An RKE2 Kubernetes cluster (additional configuration may be required for other distributions)
- Access to vSphere infrastructure
- UDS Core deployment with Velero configured for S3-compatible object storage
Using a CSI driver in an RKE2 cluster
The following instructions are specific to an RKE2 cluster, and assume bucket variables required for S3 object storage have already been set. The below tips are not meant to be step-by-step instructions, but useful tips for configuring the CSI driver. To integrate Velero with a CSI driver, you should first install both rancher-vsphere-cpi and rancher-vsphere-csi. Installation of the vSphere CPI/CSI on RKE2 is done via setting cloud-provider-name: rancher-vsphere
in RKE2’s config.yaml
.
CSI Driver Configuration
When using a vSphere CSI driver, a user must be created in vSphere with the appropriate permissions at the appropriate vSphere object levels. These roles and privileges can be found at Broadcom vSphere Roles and Privileges. This user is referenced below as vsphere_csi_username
and vsphere_csi_password
and is used by Velero to authenticate with the vSphere vCenter API to provision, manage, and snapshot volumes.
At least three overrides must occur in the vSphere CSI driver configuration: blockVolumeSnapshot
, configTemplate
and global-max-snapshots-per-block-volume
blockVolumeSnapshot
must be enabled on the CSI driver to allow the deployment of the csi-snapshotter sidecar, which is required to create snapshots of volumesconfigTemplate
must be completely overridden, to allow overriding of theglobal-max-snapshots-per-block-volume
settingglobal-max-snapshots-per-block-volume
should be added as an override within theconfigTemplate
, to allow control of how many snapshots are allowed per volume
Example rancher-vsphere-cpi and rancher-vsphere-csi overrides:
---apiVersion: helm.cattle.io/v1kind: HelmChartConfigmetadata: name: rancher-vsphere-cpi namespace: kube-systemspec: valuesContent: |- vCenter: host: "{{ vsphere_server }}" port: 443 insecureFlag: true datacenters: "<vsphere_datacenter_name>" username: "{{ vsphere_csi_username }}" password: "{{ vsphere_csi_password }}" credentialsSecret: name: "vsphere-cpi-creds" generate: true---apiVersion: helm.cattle.io/v1kind: HelmChartConfigmetadata: name: rancher-vsphere-csi namespace: kube-systemspec: valuesContent: |- vCenter: datacenters: "<vsphere_datacenter_name>" username: "{{ vsphere_csi_username }}" password: "{{ vsphere_csi_password }}" configSecret: configTemplate: | [Global] cluster-id = "{{ rke2_token }}" user = "{{ vsphere_csi_username }}" password = "{{ vsphere_csi_password }}" port = 443 insecure-flag = "1" [VirtualCenter "{{ vsphere_server }}"] datacenters = "<vsphere_datacenter_name>" [Snapshot] global-max-snapshots-per-block-volume = 12 csiNode: tolerations: - operator: "Exists" effect: "NoSchedule" blockVolumeSnapshot: enabled: true storageClass: reclaimPolicy: Retain
Snapshot Limit Configuration
The default snapshot limit (3) is insufficient for UDS Core’s 10-day backup retention policy.
- Each UDS backup creates approximately 13 snapshots distributed across all volumes
- For a cluster that has 13 volumes, each nightly UDS backup will create 1 snapshot per volume
- After 3 days of backups, the default
global-max-snapshots-per-block-volume
will have been met, and further backups will fail - To account for 10 days of UDS backups (assuming 13 volumes), set the
global-max-snapshots-per-block-volume
to a minimum of 10 - Consider setting a higher
global-max-snapshots-per-block-volume
to create a buffer that accommodates manual backups or restore testing (e.g.,global-max-snapshots-per-block-volume=12
)
If the following error is seen when creating a backup, the global-max-snapshots-per-block-volume
needs to be adjusted:
name: /prometheus-kube-prometheus-stack-prometheus-0 message: /Error backing up item error: /errorexecuting custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=monitoring,name=velero-prometheus-kube-prometheus-stack-prometheus-db-prom2n67g): rpc error: code = Unknown desc= CSI got timed out with error: Failed to check and update snapshot content:\n failed to take snapshotof the volume 6e908637-1c40-41ab-a65b-0460b403e364: "rpc error: code = FailedPrecondition desc =\n thenumber of snapshots on the source volume 6e908637-1c40-41ab-a65b-0460b403e364 reaches the configuredmaximum (3)"
Create a VolumeSnapshotClass
In addition to the above CSI driver overrides, a VolumeSnapshotClass
must be defined to tell Velero how to create snapshots. This can be achieved by creating a velero-config Zarf package that contains the VolumeSnapshotClass manifest, and having your uds-bundle.yaml deploy this package. The VolumeSnapshotClass
defines the driver, which in the below example is vSphere.
Example VolumeSnapshotClass
deployment:
apiVersion: snapshot.storage.k8s.io/v1kind: VolumeSnapshotClassmetadata: name: vsphere-csi-snapshot-class labels: velero.io/csi-volumesnapshot-class: "true"driver: csi.vsphere.vmware.comdeletionPolicy: Retain
Configure Velero for CSI Support
In the uds-bundle.yaml Velero overrides, you must EnableCSI
, set snapshotsEnabled
to true
, define the volumeSnapshotLocation
as the CSI driver, and set snapshotVolumes
to true
.
Example uds-bundle.yaml core-backup-restore layer overrides:
overrides: velero: velero: values: - path: configuration.features value: EnableCSI - path: snapshotsEnabled value: true - path: configuration.volumeSnapshotLocation value: - name: default provider: velero.io/csi - path: schedules.udsbackup.template.snapshotVolumes value: true
Additional Tips
- When restoring specific namespaces, always use the
--include-namespaces
flag to avoid creating unnecessary VolumeSnapshotContents:velero restore create --from-backup <backup-name> --include-namespaces <namespace> - Be cautious when deleting backups that have been used for restores, as this may attempt to delete VolumeSnapshotContents that are still in use by restored volumes.
- Velero’s garbage collection runs hourly by default. Ensure your TTL settings allow enough time for cleanup before hitting snapshot limits.
- The pyvmomi-community-samples repo contains several scripts that are useful for interacting with the vSphere client. In particular, the fcd_list_vdisk_snapshots script allows you to list snapshots stored in vSphere, even when they can’t be directly viewed in the vSphere UI. This comes in handy when snapshots and VolumeSnapshotContents get manually deleted from the cluster, but are not cleaned up appropriately in vSphere.
Resources
Rancher vSphere Configuration Reference