What if you have nothing left (part 2)?
This is part 2 of a series of articles about Business Continuity. In part 1 I showed that full recovery is no theoretical scenario. In such a scenario, all of your colleagues are waiting for you to finish, so they can start working too. Where do you begin?
Setting the scene
When things do go wrong, they go really wrong. There is a lot of stress. The available run books turn out to be unreliable and people will have to work on experience and intuition. There is a very small group of people that can do that. An even smaller group of people is permitted to act.
As a result, recovery is a slow, error-prone and painstaking process. Everybody is jumping on the backs of those 5-10 people doing the work. And after a few days of 24x7 work, mistakes become common. If things go even more wrong, you may even need medical assistance to keep your crew alive. Literally.
You may think that it is possible to hire hands to support your staff. You probably cannot. They are just not there. And if you manage to find people, they do not know your environment, so bringing them in will slow down recovery even more. I know from experience, I once had to recover an ISP that had been down for two weeks due to lack of knowledge and understanding of the underlying technology.
As a result, recovery will take weeks, even months, not hours. Can your business handle that? Probably not.
Keep this in mind
Separation of concern
What your customers need is functioning processes, which in a lot of cases are embedded in applications. So, you need your applications up and running as fast as possible. To do this, the application teams need to have experience to bring up their own environment.
They can only do this if they have an environment to deploy upon. If there is no on-prem anymore, they should be able to run in the cloud - any cloud, provided data regulation permits that.
To facilitate this, it helps to have an abstraction layer separating the functional request of the application team from the technical implementation of the platform provider.
Using such an abstraction layer, CI/CD pipelines are no longer dependent on a specific infrastructure provider and can be developed by the teams themselves.
The basic structure for this is straightforward. An application is part of a tenant, has environments (prod, dev), divided into compartments/security zones (Presentation, Application, Data):
Recommended by LinkedIn
Every block can be mapped to an equivalent block on a specific platform: a tenant in this model translates to an MPLS VPN in Cisco ACI, a VPC in AWS or a Resource Group in Azure. Identically, a compartment translates to a subnet, which can be mapped to a VNET in Azure and an EPG in your on-prem Cisco ACI.
This looks like a lot of complexity, and to figure this out initially is indeed quite hard. But the net effect is that the application teams are provided with a very easy way to define their platform. They only have to store their own manifest in their own git repository as part of the code for their own application environment. All the rest can be centrally maintained.
Application installation
After the platform can be deployed as code, it is time to install the application itself. I prefer to do this through Ansible, as more and more vendors provide and maintain Ansible playbooks for their products. The advantage is that you can focus on your work, while the vendor does the heavy lifting.
In order to provide Ansible with the inventory data needed to deploy the software (IP addresses, system names, etc), there is a Terraform plugin that permits to define the inventory when Terraform is run, so that this data ends up in the Terraform state. Ansible can then run a python script to extract the data when running the playbook. This way there is no longer a need to store and that data with the application team.
Everything as code
As you can see, deploying infrastructure as code in an environment where you have to be able to switch the deployment platform is not as hard as it seems. When you have the data, you can build a manifest describing what you need.
With this data in a central repository (a backup of which most of the time fits perfectly fine on a thumb drive), your application teams are able to deploy themselves and in parallel.
A new dawn
In a DR scenario, the work being done should be the work done on a daily basis to prevent mistakes. After all, practice makes perfect. The team knows their risks, challenges and knowledge gaps beforehand and will most probably have fixed those before disaster strikes. Recovery is then limited on the speed the application team manages to achieve in their deployment pipeline and the time to restore the data.
Firefighting, ad-hoc changes and improvising on the go should not be part of a teams' normal way of working. In order to be able to do DR, everything should be planned and run though a normal pipeline. Only then can you keep your environment tested and hardened and can you provide the post-mortem data required after a breach.
This may sound simple, but for most organizations this has a major impact on the way of working.
Using Agile/SAFE with PI sessions and release trains can provide the insight needed to assess what the impact of a change will be.
In the next part we’ll dive into the underlying technical architecture you want to have to make this work.