One-size-fits-all? Not when it comes to disaster recoveryBlog
When it comes to Disaster Recovery and Business Continuity, everyone has their own ideas, not only on definition but also how to implement a solid solution that will actually work when an organisation is on its knees. This is the key, you never choose when to have a disaster, it can happen anytime. A Business Continuity Plan (BCP) will take into account all these factors – not just Technology but People and Process are just as, if not more, important.
In today’s DevOps world where applications are evolving faster than ever, the processes and people managing them need to be just as flexible. An application that was in pilot phase last week could now be in production and had three development cycles. Furthermore, it may be used by hundreds or thousands of users. Was a plan for how to recover this system in case of disaster put in place? Who was consulted and at what phase of the lifecycle? Have requirements changed since the last development cycle or now that the application has become critical to a business function?
Planning for a DR event is made more complex when you start to think about what you are protecting your systems from. Are you limiting the scope of your solution to just natural and physical disasters? What about the increased threat of ransomware? These two types of disasters should be handled differently to ensure business impact is limited as much as possible. In terms of ransomware, traditional data replication technologies will not help you avert disaster as the problem is simply copied to secondary locations.
DR events for physical problems can manifest themselves in different ways also. For example, does an internet connectivity failure impact a system if users only access the system locally? Maybe it does if it relies on an external system for some functionality. Knowing the system dependencies are is critical to a successful BCP implementation.
Once the dependencies are known, a matrix can start to be built. From here, decisions can be made about which systems are critical, nice to have and maybe even not required in a DR event. This needs to be made clear in your BCP. Resources during an event are finite and they must be working on restoring the right system at the right time. This period is what is known as the Recovery Time Objective (RTO), in simple terms, how long can the organisation be without the system? Within this RTO the system needs to be restored to a functioning state that can be accessed by users. If the RTO is not achieved what is the Maximum Tolerable Outage (MTO)? This benchmark is related to how long the business can sustain the loss of the system and can mean different things to different organisations. If the application is so critical to the organisation, is a plan needed for if/when your DR site cannot be brought online? All these alternate plans need to be counted as part of the MTO.
Furthermore, don’t let preconceptions hold you back from exploring new ideas. Perhaps a third party might be able to help with thinking outside of the box. Does your current IT provider offer a consultative service to approach these challenges? More often than not it helps to get fresh eyes and ideas for these. Challenges that arise may also not fit your team’s core skillset, as you focus on business-as-usual operations. All of these can combine and provide road-blocks to achieving a solid outcome for the organisation.
So that should be it right? Nope, not even close. The question of how you get back to production after an event has passed is almost always overlooked by organisations while planning their BCP. The process of moving back to production can sometimes be even harder. This is due to the fact that data may still be intact in the primary site. How you deal with this can mean the potential to recover data that was thought to be lost due to a gap in synchronisation between the Primary and Secondary sites (Recovery Point Objective, RPO). Depending on how long the DR event lasted, this data could mean saving hundreds of hours of rework by users. Be sure to include getting back to production as part of your BCP, trying to work it out after the event will just make the impact of the disaster even worse.
Have you gone through issues discussed here? I’d love to hear from you on your experiences and what you did to address these/other challenges.