Gerneral Business Continuity Plan for Managed OpenShift

VSHN has its own BCM plan, which is not part of this documentation.

Purpose

This document describes how OpenShift clusters operated by VSHN maintain service continuity in the event of a Cloud Service Provider (CSP) zone failure.

It outlines:

  • The shared responsibility model

  • The expected impact

  • The recovery strategy and procedures

This is a generalized scenario applicable across supported CSP environments.

Scope

This BC/DR scenario assumes:

  • A single CSP availability zone failure

  • Underlying CSP infrastructure in the affected zone is unavailable

  • Other zones/regions remain operational

  • Customer workloads are deployed according to recommended high-availability practices

Out of scope:

  • Single cluster failure (see: Disaster Recovery)

  • Full region or CSP failure

  • Application-level recovery guarantees

  • Customer-specific custom architectures

  • On-Prem setups. In the event of a customer’s data center failure, VSHN is depending on the customers BCM for the underlying infrastructure.

Shared Responsibility Model

See the VSHN Managed OpenShift technical documentation for a general overview of the Shared Responsibility Model for Managed OpenShift.

VSHN

  • Operates and maintains the OpenShift cluster

  • Monitors cluster health and infrastructure signals

  • Performs platform-level backup and recovery

  • Rebuilds clusters if required

  • Provides incident communication

If VSHN is managing applications for a customer, the respective VSHN team is responsible for the BC/DR of those applications and must enact their own procedures.

Customer

  • Ensures application-level backup and restore procedures

  • Defines RTO and RPO requirements with VSHN and partners

  • Validates restoration success

Cloud Service Provider (CSP)

  • Provides compute, storage and network infrastructure

  • Restores failed zones

Backup model

The platform includes:

  • Regular backups of cluster configuration (etcd and YAML representations of all Kubernetes objects in the cluster)

  • Encrypted storage of backups outside the primary zone

  • Object-level recovery capabilities

Important limitations:

  • Backups do not replace disaster recovery of infrastructure

  • Persistent volumes are not automatically backed up

  • Application data protection remains the responsibility of the customer

Failure scenario

CSP zone outage

A CSP availability zone becomes unavailable.

Impact

  • Nodes in the affected zone are lost

  • Running workloads are terminated

  • Storage in the affected zone may be unavailable

  • Cluster capacity is lost

Platform behaviour

  • Multiple clusters become unavailable

  • No automatic recovery possible

  • Manual recovery procedure required

Recovery procedure

Overview

The recovery process follows a structured sequence:

  • Detection of failure

  • Incident declaration

  • Impact assessment

  • For each customer:

    • Cluster rebuild and restoration

    • Trigger customer BC/DR procedures

    • Application restore by the customer

    • Validation with the customer

  • Validation and closure

A visual representation of the recovery workflow is shown below.

Recovery Procedure Diagram

Detection

  • Automated monitoring detects node and infrastructure failure

  • Alerts are triggered and evaluated

Response

  • Incident is declared

  • Impacted clusters are identified

  • Customers are notified

Impact assessment

  • Determine whether clusters are fully unavailable. Rule out that there is a network outage.

  • Verify control plane reachability

  • Assess data and backup availability

Recovery actions

In a CSP zone failure scenario, clusters are assumed to be unavailable.

Determine priorities of the clusters to be recovered. Generally this means:

  1. Production

  2. Test

  3. Others

Recovery is performed for each affected cluster as follows, in order of above determined priority:

  • Provision a new cluster in a healthy zone or alternative CSP location

  • Restore cluster configuration from available backups

  • Reconfigure access and networking

  • Customer redeploys workloads

  • Customer restores application data

Important for our customers:

  • Recovery is based on cluster rebuild, not failover

  • Backup restoration is limited to available and compatible data

  • Application data recovery is the responsibility of the customer

Validation

For each customer:

  • Cluster health is verified

  • Core platform components are operational (API, Monitoring)

  • Customer workloads are confirmed running

  • Customer validation is requested

Closure

  • Incident is closed

  • Post-incident review is performed

  • Identified improvements are tracked

Important limitation:

  • Full recovery depends on backup availability and CSP capacity in alternative zones or providers

Recovery objectives

If the customer whishes to define recovery objectives in the event of a zone failure, they should do so in cooperation with VSHN and their partners. Please reach out to your account manager for more information.

Communication

During incidents, VSHN provides:

  • Status updates

  • Impact assessment

  • Recovery progress

Key limitations

  • Infrastructure failures cannot be mitigated by backups alone

  • Single-zone deployments have no inherent resilience

  • Application data recovery depends entirely on customer design

  • Recovery time depends on CSP recovery and cluster rebuild times

  • Precise recovery time objectives depend on maturity level of the chosen CSP