Gerneral Business Continuity Plan for Managed OpenShift
| VSHN has its own BCM plan, which is not part of this documentation. |
Purpose
This document describes how OpenShift clusters operated by VSHN maintain service continuity in the event of a Cloud Service Provider (CSP) zone failure.
It outlines:
-
The shared responsibility model
-
The expected impact
-
The recovery strategy and procedures
This is a generalized scenario applicable across supported CSP environments.
Scope
This BC/DR scenario assumes:
-
A single CSP availability zone failure
-
Underlying CSP infrastructure in the affected zone is unavailable
-
Other zones/regions remain operational
-
Customer workloads are deployed according to recommended high-availability practices
Out of scope:
-
Single cluster failure (see: Disaster Recovery)
-
Full region or CSP failure
-
Application-level recovery guarantees
-
Customer-specific custom architectures
-
On-Prem setups. In the event of a customer’s data center failure, VSHN is depending on the customers BCM for the underlying infrastructure.
Shared Responsibility Model
|
See the VSHN Managed OpenShift technical documentation for a general overview of the Shared Responsibility Model for Managed OpenShift. |
VSHN
-
Operates and maintains the OpenShift cluster
-
Monitors cluster health and infrastructure signals
-
Performs platform-level backup and recovery
-
Rebuilds clusters if required
-
Provides incident communication
|
If VSHN is managing applications for a customer, the respective VSHN team is responsible for the BC/DR of those applications and must enact their own procedures. |
Backup model
The platform includes:
-
Regular backups of cluster configuration (etcd and YAML representations of all Kubernetes objects in the cluster)
-
Encrypted storage of backups outside the primary zone
-
Object-level recovery capabilities
Important limitations:
-
Backups do not replace disaster recovery of infrastructure
-
Persistent volumes are not automatically backed up
-
Application data protection remains the responsibility of the customer
Platform behaviour
-
Multiple clusters become unavailable
-
No automatic recovery possible
-
Manual recovery procedure required
Recovery procedure
Overview
The recovery process follows a structured sequence:
-
Detection of failure
-
Incident declaration
-
Impact assessment
-
For each customer:
-
Cluster rebuild and restoration
-
Trigger customer BC/DR procedures
-
Application restore by the customer
-
Validation with the customer
-
-
Validation and closure
A visual representation of the recovery workflow is shown below.
Detection
-
Automated monitoring detects node and infrastructure failure
-
Alerts are triggered and evaluated
Impact assessment
-
Determine whether clusters are fully unavailable. Rule out that there is a network outage.
-
Verify control plane reachability
-
Assess data and backup availability
Recovery actions
In a CSP zone failure scenario, clusters are assumed to be unavailable.
Determine priorities of the clusters to be recovered. Generally this means:
-
Production
-
Test
-
Others
Recovery is performed for each affected cluster as follows, in order of above determined priority:
-
Provision a new cluster in a healthy zone or alternative CSP location
-
Restore cluster configuration from available backups
-
Reconfigure access and networking
-
Customer redeploys workloads
-
Customer restores application data
Important for our customers:
-
Recovery is based on cluster rebuild, not failover
-
Backup restoration is limited to available and compatible data
-
Application data recovery is the responsibility of the customer
Recovery objectives
If the customer whishes to define recovery objectives in the event of a zone failure, they should do so in cooperation with VSHN and their partners. Please reach out to your account manager for more information.
Key limitations
-
Infrastructure failures cannot be mitigated by backups alone
-
Single-zone deployments have no inherent resilience
-
Application data recovery depends entirely on customer design
-
Recovery time depends on CSP recovery and cluster rebuild times
-
Precise recovery time objectives depend on maturity level of the chosen CSP