Gerneral Business Continuity Plan for Managed OpenShift

VSHN has its own BCM plan, which is not part of this documentation.

Purpose

This document describes how OpenShift clusters operated by VSHN maintain service continuity in the event of a Cloud Service Provider (CSP) zone failure.

It outlines:

The shared responsibility model
The expected impact
The recovery strategy and procedures

This is a generalized scenario applicable across supported CSP environments.

Scope

This BC/DR scenario assumes:

A single CSP availability zone failure
Underlying CSP infrastructure in the affected zone is unavailable
Other zones/regions remain operational
Customer workloads are deployed according to recommended high-availability practices

Out of scope:

Single cluster failure (see: Disaster Recovery)
Full region or CSP failure
Application-level recovery guarantees
Customer-specific custom architectures
On-Prem setups. In the event of a customer’s data center failure, VSHN is depending on the customers BCM for the underlying infrastructure.

Shared Responsibility Model

See the VSHN Managed OpenShift technical documentation for a general overview of the Shared Responsibility Model for Managed OpenShift.

VSHN

Operates and maintains the OpenShift cluster
Monitors cluster health and infrastructure signals
Performs platform-level backup and recovery
Rebuilds clusters if required
Provides incident communication

If VSHN is managing applications for a customer, the respective VSHN team is responsible for the BC/DR of those applications and must enact their own procedures.

Customer

Ensures application-level backup and restore procedures
Defines RTO and RPO requirements with VSHN and partners
Validates restoration success

Cloud Service Provider (CSP)

Provides compute, storage and network infrastructure
Restores failed zones

Backup model

The platform includes:

Regular backups of cluster configuration (etcd and YAML representations of all Kubernetes objects in the cluster)
Encrypted storage of backups outside the primary zone
Object-level recovery capabilities

Important limitations:

Backups do not replace disaster recovery of infrastructure
Persistent volumes are not automatically backed up
Application data protection remains the responsibility of the customer

Failure scenario

CSP zone outage

A CSP availability zone becomes unavailable.

Impact

Nodes in the affected zone are lost
Running workloads are terminated
Storage in the affected zone may be unavailable
Cluster capacity is lost

Platform behaviour

Multiple clusters become unavailable
No automatic recovery possible
Manual recovery procedure required

Recovery procedure

Overview

The recovery process follows a structured sequence:

Detection of failure
Incident declaration
Impact assessment
For each customer:
- Cluster rebuild and restoration
- Trigger customer BC/DR procedures
- Application restore by the customer
- Validation with the customer
Validation and closure

A visual representation of the recovery workflow is shown below.

Detection

Automated monitoring detects node and infrastructure failure
Alerts are triggered and evaluated

Response

Incident is declared
Impacted clusters are identified
Customers are notified

Impact assessment

Determine whether clusters are fully unavailable. Rule out that there is a network outage.
Verify control plane reachability
Assess data and backup availability

Recovery actions

In a CSP zone failure scenario, clusters are assumed to be unavailable.

Determine priorities of the clusters to be recovered. Generally this means:

Production
Test
Others

Recovery is performed for each affected cluster as follows, in order of above determined priority:

Provision a new cluster in a healthy zone or alternative CSP location
Restore cluster configuration from available backups
Reconfigure access and networking
Customer redeploys workloads
Customer restores application data

Important for our customers:

Recovery is based on cluster rebuild, not failover
Backup restoration is limited to available and compatible data
Application data recovery is the responsibility of the customer

Validation

For each customer:

Cluster health is verified
Core platform components are operational (API, Monitoring)
Customer workloads are confirmed running
Customer validation is requested

Closure

Incident is closed
Post-incident review is performed
Identified improvements are tracked

Important limitation:

Full recovery depends on backup availability and CSP capacity in alternative zones or providers

Recovery objectives

If the customer whishes to define recovery objectives in the event of a zone failure, they should do so in cooperation with VSHN and their partners. Please reach out to your account manager for more information.

Communication

During incidents, VSHN provides:

Status updates
Impact assessment
Recovery progress

Key limitations

Infrastructure failures cannot be mitigated by backups alone
Single-zone deployments have no inherent resilience
Application data recovery depends entirely on customer design
Recovery time depends on CSP recovery and cluster rebuild times
Precise recovery time objectives depend on maturity level of the chosen CSP