Business Continuity and Disaster Recovery (BCDR) Planning for IT Professionals

The creation and implementation of a fully tested BCDR Planning that is ready for the failover event have a great structural resemblance to any other IT implementation plan as well as other disaster response plans.

It is wise to consult or even adapt existing IT project BCDR Planning and risk management methodologies.

In this section, some activities and concerns are highlighted that are relevant for cloud BCDR.

When organizations are incorporating IT systems and cloud solutions on an ongoing basis, creating and reevaluating BCDR Planning should be a defined and documented process.

The Scope of the BCDR Plan

The BCDR plan and its implementation are embedded in an information security strategy, which encompasses clearly defined roles, risk assessment, classification, policy, awareness, and training.

It makes sense to consider BCDR as an intrinsic part of the IT service that is regularly invoked, if only for testing purposes.

Gathering Requirements and Context

The requirements that are input for BCDR Planning include the identification of critical business processes and their dependence on specific data and services.

The characteristics, descriptions, and service agreements (if any) of these services and systems are required in the analysis.

Input to the analysis and design of BCDR solutions also includes a list of risks and threats that can negatively affect any important business processes.

This threat model should include the failure of any CSPs.

Business strategy influences the acceptable RTO and RPO values.

Finally, requirements for BCDR may derive from company internal policies and procedures as well as from applicable legal, statutory, or regulatory compliance obligations.

Analysis of the BCDR Planning

The purpose of the analysis phase is to translate BCDR requirements into input to be used in the design phase.

The most important inputs for the design phase are scope, requirements, budget, and performance objectives.

Business requirements and the threat model should be analyzed for completeness and consistency and then translated into an identification of the assets at risk.

With that, requirements on resources needs for mitigating those risks can be made.

This includes the identification of all dependencies, including processes, applications, business partners, and third-party service providers.

For example, what are the technical components and underlying services of an application operated in-house that would need to be replicated in a BCDR facility?

Analysis should identify any opportunities for decoupling systems and services and breaking any common failure modes.

The capabilities of the current providers in delivering resources to the BCDR solution should be investigated.

Performance requirements such as bandwidth and offsite storage needs derive from the assets at risk.

Careful analysis and assessment should be undertaken with the objective of minimizing these performance requirements.

Risk Assessment

In the same way, as any IT solution should be assessed for residual risk,

BCDR solutions should be assessed for residual risks. Some risks have been elaborated on in earlier topics.

All scenarios involve evaluation of the CSP’s capability to deliver.

The typical challenges include the following:

The elasticity of the CSP: Can the CSP provide all the resources if BCDR is invoked?

Contractual issues: Will any new CSP address all contractual issues and SLA requirements?

Available network bandwidth for timely replication of data.

Available bandwidth between the impacted user base and the BCDR locations.

Legal and licensing risks: There may be legal or licensing constraints that prohibit the data or functionality to be present in the backup location.

BCDR Planning Design

The objective of the design phase is to establish and evaluate candidate architecture solutions.

The approaches and their components have been illustrated in earlier topics.

This design phase should not just result in technical alternatives but also flesh out procedures and workflow.

As with any IT service or system, the BCDR solution should have a clear owner with a clear role and mandate in the organization who is accountable for the correct setup and maintenance of the BCDR capability.

Following are additional BCDR-specific questions that should be addressed in the design phase:

How will the BCDR solution be invoked?

What is the manual or automated procedure for invoking the failover services?

How will the business use of the service be affected during the failover, if at all?

How will the BCDR be tested?

BCDR Other Planning Considerations

Once the design of the BCDR solution is ready, work will start on implementing the solution.

This is likely to require work both on the primary solution platform and on the DR platform.

On the primary platform, these activities are likely to include the implementation of functionality for enabling data replication on a regular or continuous schedule and functionality to automatically monitor for any contingency that might arise and raise a failover event.

On the DR platform, the required infrastructure and services need to be built up and brought into trial production mode.

Care must be taken so that not only the required infrastructure and services are made available but the DR platform tracks any relevant changes and functional updates that are being made on the primary platform.

Additionally, it is advisable to include all DR-related infrastructure and services in the regular IT services management.

BCDR Planning, Exercising, Assessing, and Maintaining the Plan

Once the plan has been completed and the recovery strategies have been fully implemented, it is important to test all parts of the plan to validate that it would work in a real event.

The testing policy should include enterprise-wide testing strategies that establish expectations for individual business lines.

Business lines include all internal and external

supporting functions, such as IT and facilities management. supporting functions, such as IT and facilities management.

The testing strategy should include the following:

Expectations for business lines and support functions to demonstrate the achievement of business continuity test objectives consistent with the business impact analysis (BIA) and risk assessment

A description of the depth and breadth of testing to be accomplished

The involvement of staff, technology, and facilities

Expectations for testing internal and external interdependencies

An evaluation of the reasonableness of assumptions used in developing the testing strategy

Testing strategies should include the testing scope and objectives, which clearly define which functions, systems, or processes are going to be tested and what will constitute a successful test.

The objective of a testing program is to ensure that the business continuity BCDR Planning (BCP) process is accurate, relevant, and viable under adverse conditions.

Therefore, the BCP process should be tested at least annually, with more frequent testing required when significant changes have occurred in business operations.

Testing should include applications and business functions that were identified during the BIA.

The BIA determines the recovery point objectives and recovery time objectives, which then help determine the appropriate recovery strategy. Validation of the RPOs and RTOs is important to ensure that they are attainable.

Testing objectives should start simply and gradually increase in complexity and scope.

The scope of individual tests can be continually expanded to eventually encompass enterprise-wide testing and testing with vendors and key market participants.

Achieving the following objectives provides progressive levels of assurance and confidence in the plan.

At a minimum, the testing scope and objectives should do the following:

Ensure support for normal business operations

Gradually increase the complexity, level of participation, functions, and physical locations involved

Demonstrate a variety of management and response proficiencies under simulated crisis conditions, progressively involving more resources and participants

Uncover inadequacies so that testing procedures can be revised

Consider deviating from the test script to interject unplanned events, such as the loss of key individuals or services

Involve a sufficient volume of all types of transactions to ensure adequate capacity and functionality of the recovery facility

The testing policy should also include test BCDR Planning, which is based on the predefined testing scope and objectives established as part of management’s testing strategies.

Test BCDR Planning includes test plan review procedures and the development of various testing scenarios and methods.

Management should evaluate the risks and merits of various types of testing scenarios and develop test plans based on identified recovery needs.

Test plans should identify quantifiable measurements of each test objective and should be reviewed before the test to ensure they can be implemented as designed.

Test scenarios should include a variety of threats, event types, and crisis management situations and should vary from isolated system failures to wide-scale disruptions.

Scenarios should also promote testing alternate facilities with the primary and alternate facilities of key counterparties and third-party service providers.

Comprehensive test scenarios focus attention on dependencies, both internal and external, between critical business functions, information systems, and networks.

Integrated testing moves beyond the testing of individual components to include testing with internal and external parties and the supporting systems, processes, and resources.

As such, test plans should include scenarios addressing local and wide-scale disruptions, as appropriate.

Business line management should develop scenarios to effectively test internal and external interdependencies, with the assistance of IT staff members who are knowledgeable of application data flows and other areas of vulnerability.

Organizations should periodically reassess and update their test scenarios to reflect changes in the organization’s business and operating environments.

Test plans should communicate the predefined test scope and objectives and give participants relevant information, such as the following:

A master test schedule that encompasses all test objectives

Specific descriptions of test objectives and methods

Roles and responsibilities for all test participants, including support staff

Designation of test participants

Test decision-makers and succession plans

Test locations

Test escalation conditions and test contact information

BCDR Planning Test Plan Review

Management should prepare and review a script for each test before testing to identify weaknesses that could lead to unsatisfactory or invalid tests.

As part of the review process, the testing plan should be revised to account for any changes to key personnel, policies, procedures, facilities, equipment, outsourcing relationships, vendors, or other components that affect a critical business function.

In addition, as a preliminary step to the testing process, management should perform a thorough review of the BCP. This is a checklist review.

A checklist review involves distributing copies of the BCP to the managers of each critical business unit and requesting that they review portions of the plan applicable to their department to ensure that the procedures are comprehensive and complete.

It is often wise to stop using the word test for this and begin to use the word exercise.

The reason to call them exercises is that when the word test is used, people think pass or fail.

There is no way to fail a contingency test. If the security professionals knew that it worked, they would not bother to test it.

The reason to test is to find out what does not work so issues can be fixed before a disaster happens for real.

Testing methods can vary from simple to complex depending on the preparation and resources required.

Each bears its characteristics, objectives, and benefits.

The type or combination of testing methods employed by an organization should be determined by, among other things, the organization’s age and experience with BCP, size, complexity, and the nature of its business.

Testing methods include both business recovery and DR exercises.

Business recovery exercises primarily focus on testing business line operations, whereas DR exercises focus on testing the continuity of technology components, including systems, networks, applications, and data.

To test split processing configurations, in which two or more sites support part of a business line’s workload, tests should include the transfer of work among processing sites to demonstrate that alternate sites can effectively support customer-specific requirements and work volumes, and site-specific business processes.

A comprehensive test should involve processing a full day’s work at peak volumes to ensure that equipment capacity is available and that RTOs and RPOs can be achieved.

More rigorous testing methods and greater frequency of testing provide greater confidence in the continuity of business functions.

Although comprehensive tests do require greater investments of time, resources, and coordination to implement, detailed testing more accurately depicts a true disaster and assists management in assessing the actual responsiveness of the individuals involved in the recovery process.

Furthermore, comprehensive testing of all critical functions and applications allows management to identify potential problems; therefore, management should use one of the more thorough testing methods discussed in this section to ensure the viability of the BCP before a disaster occurs.

The security professional can conduct many different types of exercises.

Some take minutes, whereas others take hours or days.

The amount of exercise BCDR Planning needed is entirely dependent on the exercise type, the exercise length, and the exercise scope the security professional will plan to conduct.

The most common types of exercises are call exercises, walk-through exercises, simulated or actual exercises, and compact exercises.

Tabletop Exercise/Structured Walk-Through Test

A tabletop exercise/structured walk-through test is considered a preliminary one in the overall testing process and may be used as an effective training tool; however, it is not a preferred testing method.

Its primary objective is to ensure that critical personnel from all areas are familiar with the BCP and that the plan accurately reflects the organization’s ability to recover from a disaster.

This exercise/test is characterized by the following:

Attendance of business unit management representatives and employees who play a critical role in the BCP process

Discussion about each person’s responsibilities as defined by the BCP

Individual and team training, which includes a walk-through of the step-by-step procedures outlined in the BCP

Clarification and highlighting of critical plan elements, as well as problems noted during testing

Walk-Through Drill/Simulation Test

A walk-through drill/simulation test is somewhat more involved than a tabletop exercise/ structured walk-through test because the participants choose a specific event scenario and apply the BCP to it. It includes the following:

Attendance by all operational and support personnel who are responsible for implementing the BCP procedures

Practice and validation of specific functional response capabilities

Focus on the demonstration of knowledge and skills, as well as team interaction and decision-making capabilities

Role-playing with the simulated response at alternate locations to act out critical steps, recognize difficulties, and resolve problems in a nonthreatening environment

Mobilization of all or some of the crisis management and response team to practice proper coordination without performing actual recovery processing

Varying degrees of actual, as opposed to simulated, notification and resource mobilization to reinforce the content and logic of the plan

Functional Drill/Parallel Test

A functional drill/parallel test is the first type that involves the actual mobilization of personnel to other sites in an attempt to establish communications and perform actual recovery processing as outlined in the BCP.

The goal is to determine whether critical systems
can be recovered at the alternate processing site and if employees can deploy the procedures defined in the BCP.

A functional drill/parallel test encompasses the following:

A full test of the BCP, which involves all employees

Demonstration of emergency management capabilities of several groups practicing a series of interactive functions, such as direction, control, assessment, operations, and BCDR Planning

Testing medical response and warning procedures

Response(s) to alternate locations or facilities using actual communications capabilities

Mobilization of personnel and resources at varied geographical sites, including evacuation drills in which employees test the evacuation route and procedures for personnel accountability

Varying degrees of actual, as opposed to simulated, notification and resource mobilization in which parallel processing is performed and transactions are compared to production results

Full-Interruption/Full-Scale Test

Full-interruption/full-scale test is the most comprehensive type of test. In a full-scale test, a real-life emergency is simulated as closely as possible.

Therefore, comprehensive BCDR Planning should be a prerequisite to this type of test to ensure that business operations are not negatively affected.

The organization implements all or portions of its BCP by processing data and transactions using backup media at the recovery site.

This test involves the following:

Enterprise-wide participation and interaction of internal and external management response teams with full involvement of external organizations

Validation of crisis response functions

Demonstration of knowledge and skills as well as management response and decision-making capability

On-the-scene execution of coordination and decision-making roles

Actual, as opposed to simulated, notifications, mobilization of resources, and communication of decisions

Activities conducted at actual response locations or facilities

The actual processing of data using backup media

Exercises generally extending over a longer period to allow issues to fully evolve as they would in a crisis and to allow realistic role-playing of all the involved groups

After every exercise the security professional conducts, the results need to be published and action items identified to address the issues that were uncovered.

Action items should be tracked until they have been resolved and, where appropriate, the plan should be updated. It is unfortunate when an organization has the same issue in subsequent tests simply because someone did not update the plan.

Testing and Acceptance to Production

The BCP, like any other security incident response plan, is subject to testing at planned intervals or upon significant organizational or environmental changes, as discussed previously. Ideally, a test realizes a full switchover to the DR platform.

At the same time, it should be recognized that this test does represent a risk to the production user population. Just to provide an idea of the realism level that organizations can aspire to, consider the architecture of a well-known online video distribution service.

Its infrastructure is designed to operate without a single point of failure being allowed to affect production.

To test and ensure that this is and remains so, the video distribution service employs a so-called chaos monkey, which is a process that continuously triggers component failures in the production service.

For each of these BCDR Planning components, an automatic failover mechanism is in place.7