What if you have nothing left (part 2)?
Image generated by Pixlr with the prompt: "a datacenter after a ransomware attack"

What if you have nothing left (part 2)?


This is part 2 of a series of articles about Business Continuity. In part 1 I showed that full recovery is no theoretical scenario. In such a scenario, all of your colleagues are waiting for you to finish, so they can start working too. Where do you begin?

Setting the scene

When things do go wrong, they go really wrong. There is a lot of stress. The available run books turn out to be unreliable and people will have to work on experience and intuition. There is a very small group of people that can do that. An even smaller group of people is permitted to act.

As a result, recovery is a slow, error-prone and painstaking process. Everybody is jumping on the backs of those 5-10 people doing the work. And after a few days of 24x7 work, mistakes become common. If things go even more wrong, you may even need medical assistance to keep your crew alive. Literally.

You may think that it is possible to hire hands to support your staff. You probably cannot. They are just not there. And if you manage to find people, they do not know your environment, so bringing them in will slow down recovery even more. I know from experience, I once had to recover an ISP that had been down for two weeks due to lack of knowledge and understanding of the underlying technology.

As a result, recovery will take weeks, even months, not hours. Can your business handle that? Probably not.

Keep this in mind

  • Many application teams are blissfully ignorant on how their application really works, which resources it really needs and how much of those are required to run normally.
  • Most servers are loaded between 6% - 15%, so they are quickly oversized. Especially when you have to order IaaS for replacement, rightsizing can save a lot of time and money.
  • Application teams are generally in the dark on the network connectivity needed, firewall rules and load balancer configs can be hard to collect. Using flow data from the network can help the application teams to get the insight needed to properly define their required resources. Measure long enough to catch the once-a-year running processes too. As netflow/sflow is present in most if not all switching equipment, you can start collecting data today with negligible impact.
  • Most of the time you do not need all the storage you currently have operational to become functional again. Keeping a retention period of 1-3 months should be sufficient for most operations and reducing restore volume diminishes recovery time significantly.

Separation of concern

What your customers need is functioning processes, which in a lot of cases are embedded in applications. So, you need your applications up and running as fast as possible. To do this, the application teams need to have experience to bring up their own environment.

They can only do this if they have an environment to deploy upon. If there is no on-prem anymore, they should be able to run in the cloud - any cloud, provided data regulation permits that.

To facilitate this, it helps to have an abstraction layer separating the functional request of the application team from the technical implementation of the platform provider.

Using such an abstraction layer, CI/CD pipelines are no longer dependent on a specific infrastructure provider and can be developed by the teams themselves.

The basic structure for this is straightforward. An application is part of a tenant, has environments (prod, dev), divided into compartments/security zones (Presentation, Application, Data):

Example of a datacenter stack of generic building blocks

Every block can be mapped to an equivalent block on a specific platform: a tenant in this model translates to an MPLS VPN in Cisco ACI, a VPC in AWS or a Resource Group in Azure. Identically, a compartment translates to a subnet, which can be mapped to a VNET in Azure and an EPG in your on-prem Cisco ACI.

This looks like a lot of complexity, and to figure this out initially is indeed quite hard. But the net effect is that the application teams are provided with a very easy way to define their platform. They only have to store their own manifest in their own git repository as part of the code for their own application environment. All the rest can be centrally maintained.

Application installation

After the platform can be deployed as code, it is time to install the application itself. I prefer to do this through Ansible, as more and more vendors provide and maintain Ansible playbooks for their products. The advantage is that you can focus on your work, while the vendor does the heavy lifting.

In order to provide Ansible with the inventory data needed to deploy the software (IP addresses, system names, etc), there is a Terraform plugin that permits to define the inventory when Terraform is run, so that this data ends up in the Terraform state. Ansible can then run a python script to extract the data when running the playbook. This way there is no longer a need to store and that data with the application team.

Everything as code

As you can see, deploying infrastructure as code in an environment where you have to be able to switch the deployment platform is not as hard as it seems. When you have the data, you can build a manifest describing what you need.

With this data in a central repository (a backup of which most of the time fits perfectly fine on a thumb drive), your application teams are able to deploy themselves and in parallel.

A new dawn

In a DR scenario, the work being done should be the work done on a daily basis to prevent mistakes. After all, practice makes perfect. The team knows their risks, challenges and knowledge gaps beforehand and will most probably have fixed those before disaster strikes. Recovery is then limited on the speed the application team manages to achieve in their deployment pipeline and the time to restore the data.

Firefighting, ad-hoc changes and improvising on the go should not be part of a teams' normal way of working. In order to be able to do DR, everything should be planned and run though a normal pipeline. Only then can you keep your environment tested and hardened and can you provide the post-mortem data required after a breach.

This may sound simple, but for most organizations this has a major impact on the way of working.

Using Agile/SAFE with PI sessions and release trains can provide the insight needed to assess what the impact of a change will be.

In the next part we’ll dive into the underlying technical architecture you want to have to make this work.


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics