Systems thinking & CTI: Scenario-Based Incident Response Playbooks

Systems thinking & CTI: Scenario-Based Incident Response Playbooks


This article provides lessons learned to help teams cut approximately  4 hours off of the initial response to a security incident. It is for threat & risk teams supporting incident response functions.

Alternatively, the contents help you ask managed service providers or vendors specific questions about how they would support you in scenario-based incident response. 

Cheers!


Why is this even new?

Incident response playbooks describe what you do if a given situation occurs. They provide guidance and predetermined lists of actions so that you don’t have to scramble when an incident occurs. 

Scenario-based incident response playbooks emphasize deeper integration between the threat model and the exact playbook. For example, there are interlinkages between different functions, using a common taxonomy or data model.

Mapping out how exactly to respond to a given situation predates modern technology. What has changed over time is the effectiveness of sharing & collaboration. Various templates and tips from incident response playbooks are readily available; just Google for ‘cyber incident response playbook’ and you will be hit with 1000’s of pages of content.

The real question is why do teams still struggle with an initial response to certain events? 

Feedback we've gathered from colleagues, clients, and other practitioners suggests that the biggest struggles are::

  1. Maintaining a playbook process-wise is very hard with changing priorities;
  2. Keeping the individual contents up to date based on developments in the threat landscape is even harder. 

The most often mentioned reason for these struggles: there’s simply too little priority placed on keeping the versions current. This is just one of those items that is not considered important enough or urgent enough for teams to focus on directly  except when it’s time to use it. 

To help chip away at this problem, I worked with my team at Venation to understand which areas need attention and which don’t. 

Full disclosure, we provide commercial offerings to mitigate these struggles but I still felt that we should provide folks the means to first evaluate this themselves.

This article contains key takeaways so that teams can do an honest self-assessment of their status.



Stakeholder integration & communication flow

First, how do team A and team B communicate? And if they communicate, what do they communicate?

Based on my personal experience, before we start thinking about playbooks we should be thinking about people. Who communicates with whom at specific points in the incident management process. Or, if your organization does not have an incident response process, what is your ‘who do I call’ list.

In complex environments I regularly support cyber threat intelligence (CTI) teams that leverage the concepts of stakeholders. This logic is not just for CTI teams, it’s for entire organisations and a crucial mechanism to understand who works together. 

I often recommend starting with a Stakeholder Matrix and the reason is that I find that a lot of teams don’t use it. This is especially true for incident response teams, where the priority is on the daily grind. 

Communication flow

I found that it is extremely practical to make it crystal clear what the exact flow of communication is. 

When I say crystal clear, I mean going above and beyond, to make sure that no implicit aspects exist. When you review these playbooks long enough the trend you will see is that it is highly-likely they are created based on internal knowledge. 

The key questions to ask to understand the current status of communication flow are:

  1. How do we currently document roles and responsibilities for upcoming incidents? If we already do so, who is responsible for maintaining the documentation?If we outsourced this to our service provider, what is the agreed maintenance window of these playbooks and who is the internal contact? 
  2. Where is the latest version of the document stored? 
  3. When was the playbook last updated? Have the people and their contact information been confirmed as still accurate?
  4. What is currently documented in the playbook? (see next paragraph)?

See an example of how you can think about a data flow below:

Example created using Draw.io

One key lesson learned is setting up relationships with law enforcement, especially the communication flow to and from law enforcement. This might sound like overdoing it, but knowing who to call in this case can be very beneficial in having a timely response. In addition, in some cases it might be needed due to compliance or legislative reasons.

For example, in the Netherlands, there is a comprehensive network with local and national law enforcement agencies. Not to mention government agencies such as the Nationaal Cyber Security Centrum (NCSC-NL) and the Digital Trust Center , who provide guidance in these areas.

Key information needed before or during an incident

Once the communication flow is sorted out, then you explore exactly what information is being shared between teams. Exactly what this should be is something I cannot document here. This is unique to each company and the secret sauce of each team. 

I do want to share two observations: structure & level of detail


Playbook Structure

The most effective structures we currently found when building incident response playbooks include the following:

  • Description of scenario: Relevant internal categorization mechanism
  • Initial Response: Predefined roles and responsibilities, communication channels (including a who-to-call list with up to date numbers), External support
  • Workflows to ensure a coordinated response: Overview of information flow with all activities in a given visualization
  • Checklists: Room for specific actions that you can take immediately.
  • Incident-specific checklists for predefined immediate response actions: Investigation steps, Process steps, Activities, Involved stakeholders,
  • Maintenance: Responsibilities (RACI for bonus points), Creation date, Last updated date

Level of detail & the scenario-based approach

Once all elements are included, you need to decide with your teams what the right level of detail is for the playbook. Are you already at the right level of detail or do you need to go one level deeper?

This is where the scenario-based approach is providing peak value: when your team operates with a common set of threat or risk scenarios, you can easily use this as a template to start from. Going deeper based on the exact demand of the Incident Response function.

Here’s an example of one of my teams scenarios:

The example is available via

Your CTI function should be equipped to develop similar scenarios, which in turn changes discussions about level of detail into ‘this is the starting point, where exactly do you need more details and what is that level?’. Much more specific and actionable. 

I strongly believe this contributes to the actionable and timely nature of CTI functions. In addition, it just saves you loads of time for everything from the workflow to the exact investigative steps.

Few other lessons:

  • In the last 5 years, I have seen more cases where teams incorporate MITRE ATTACK references into their playbooks, providing a central taxonomy teams from different functions can contribute into. I for one believe that we should work with interoperable frameworks as much as we can. This is why I include references to it in all the scenarios we produce. It’s not perfect, but it’s a starting point. To me it is striking that even after last year, when the framework celebrated its 10th birthday, I still notice a lack of central implementation in organisations.
  • The discussion takes the most time. Getting a template is easy, filling it is more work. Getting everyone on the same page is where the rubber hits the road. Here you will also find out which playbooks are used often and which ones aren't. This reiterates the need for a process. You need someone to own it and make sure it’s 
  • Another lesson is about investigative checklists; it is crucial to know exactly how each step relates to the process. This might sound awfully obvious, but guess what: this does not happen every time. Most of the time we’re seeing a playbook consisting of exact activities performed during the last IR; making sure you map it to the process, note what stakeholders are involved, and if there are specific comments as it relates to the individual step helps a lot when looking back at the materials during your periodic review.

Operational Security

I cannot end this segment without talking about operational security. Having worked directly with red teams most of my career, I found that crucial action intelligence is often derived from having access to the current playbooks from the security team. This same lesson applies to your adversaries. 

You have to make sure access to these materials is need-to-know and that this access is periodically reviewed. If you make it available in a wider security function, then make sure that you are aware of what information is being disclosed in case something happens. 

  • When someone leaves the team or when no access to the playbooks is required, also perform a manual confirmation on the access to your playbooks. 
  • Usually access restrictions are centrally managed, except there could be edge cases where the files are managed by this person. This is also a good exercise to spot tribal knowledge that has not been handed over.

This is often not considered nor included in daily operations.


Yes. You need to have a process.

Most likely your teams already have a process. Good! Is it working? 

If you don’t, then consider that making up the process as you go along while dealing with the stress of an ongoing incident is like reading the car manual while trying to change the brakes on a car heading for a cliff. By having a ready made and approved IR process, your organization avoids the chaos of trying to figure out who is responsible for what, the lines of communication, and a decision making process. You will also need to have a back-up plan in place should everything be compromised, but we will get to that later in the article.

An incident response process is a structured approach to effectively managing and mitigating security incidents typically consisting of five stages and providing a roadmap for handling incidents. 

Two widely recognised frameworks for incident response are the NIST (National Institute of Standards and Technology) and SANS (SysAdmin, Audit, Network, Security) frameworks.

This article is not about incident management, but here’s a TLDR primer: During the preparation phase, organisations establish incident response plans, define roles and responsibilities, and implement necessary security measures. The identification stage involves detecting and validating incidents through monitoring systems and alerts. Once an incident is confirmed, containment measures are implemented to prevent further damage. The eradication phase focuses on removing the cause of the incident and restoring systems to a secure state. Finally, the recovery stage involves restoring normal operations, conducting post-incident analysis, and implementing improvements to prevent future incidents. 

Maintenance

This is the segment you don’t explicitly see in the aforementioned processes. Especially when talking about Playbooks this might be one of the most important items. A playbook needs to be updated regularly! This means periodic review and signoff by a person with authority. Next this should be exercised and results should be fed back into the playbook.

If your incident management process does not explicitly include updating & maintaining the playbooks, I encourage you to do so.

This is also the key area where you can integrate your CTI function with the Incident Response function. This is a highly tangible integration, as this provides specific Priority Intelligence Requirements for what to focus on. For example, if the Scenario is about Ransomware & Double Extortion, you can task your CTI function to investigate this further and propose updates to the playbooks based on current research. 

I specifically recommend your CTI function to contribute into:

  • Workflows to ensure a coordinated response 
  • Checklists
  • Incident-specific checklists for predefined immediate response actions. 

Added benefit: you can slap a ‘threat informed’ sticker on them as well!


Workflows

Visualisation plays a crucial role in incident response, providing a clear representation of roles, responsibilities, and activities within the incident playbook. A visual workflow allows stakeholders to quickly grasp the sequence of activities, dependencies, and decision points. This enhances communication, coordination, and overall efficiency during incident response. It might sound silly, but we remain a visually oriented species.

You can simply start by using Microsoft PowerPoint or Visio, but other simple charting solutions like draw.io will also do. Everybody loves a good workflow.

Here’s a beautiful workflow from Microsoft, get the high res version via:

Exercising

Can you still remember the last time you practiced at least one of your incident playbooks? 

Once a playbook is set up, it shouldn't collect dust on a shelf but should be vetted and tested on a regular basis to make sure that it is still effective and constitutes the optimal response.

When you explore a scenario-based approach, you will naturally develop a selection of relevant or urgent scenarios which need to be trained. This is perfect for training based on the playbook.

There’s two key lessons learned to keep in mind:

Cadence:

  • Conduct exercises monthly: Once a playbook is set up, it shouldn't collect dust on a shelf but should be vetted and tested on a regular basis to make sure that it is still effective and constitutes the optimal response. Rotating different team members on a monthly basis is an excellent starting point.

Types:

  • Self study: Teams periodically clicking on a link to read through the latest version. You will learn more by actually practicing it, but still reading up on things is also important.
  • Tabletop: If you seek to test your team's understanding of the process, then a tabletop exercise is a good starting point. Make sure the scenario is aligned with the exact playbook and you're good to go. Duration: 30-60 minutes.
  • Simulation: A more comprehensive exercise, sometimes including simulated activities (e.g. like running Mimikats in a specific area), that focuses on both processes and actually going through the different steps. 
  • Purple teaming: Taking it a step further, you might want to seek collaboration with offensive security professionals. In so-called purple team exercises, which exist in different shapes and formats, you have a red team performing activities and the blue team then works through their response to the situation. Sometimes they do this together in the same room. These exercises are extremely valuable, and great learning opportunities and ways to find gaps in your playbooks.

From a personal example, I often create overarching exercise scenarios that cover multiple playbooks. This forces the team to think on their feet and quickly go over the different playbooks. This also allows you to create a narrative that can be used in multiple exercises, for example using different exercises for different phases of the scenario.


Value measurement 

One area of research I believe we need to be more upfront about is measuring the value add of a given playbook. This brings us to a crucial discussion, as it certainly involves rating something done by someone. It’s not for all organizations. My key takeaway is that it’s better to think about this before someone asks you for it. This way, you can say with clear eyes what the exact value is. 

Some of the industry’s most commonly tracked metrics to create empirical evidence are:

  • MTBF (mean time before failure);
  • MTTR (mean time to rcovery, repair, respond, or resolve): Average exercise Time, Average incident time, Comparison over time (3, 6, 12 months);
  • MTTF (mean time to failure);
  • MTTA (mean time to acknowledge);

These metrics are designed to help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. Whatever metrics you will choose, I recommend measuring exercises vs real incident performance. Comparing the two. The difference provides you a lot of data points to work with.


Technology

Technology is mostly dependent on your own IT infrastructure. 

Many prime vendors provide onboard solutions that include incident response playbooks. They tend to be locked into their own format and require some development work to customise it to your environment if possible. 

From my experience, some playbooks exist in Confluence while others live in OneNote (or even on paper in a vault). This really depends on the organisation and the IT architectural stack you are using. 

My key lessons learned are that you are best off creating them in the area also used for documentation in your security operations center or cyber fusion center. Taking into account operational security obviously. 

Second, make sure to have a plan for an out-of-bound solution in case both the incident management system and subsequent playbooks are compromised. This is one of those things that you won’t take seriously until it’s really too late. Because if something happens then we will ‘spin up an out-of-bound Slack’ right? My friend Martyn Gill ll builds a tool for this exact use case for the SMB with their company ORNA . Full disclosure, Martyn helps out with Venation and I personally don’t get any benefits from marketing (maybe now I am!).

Some other good examples from different entities:


Role of LLM & AI: 

Key takeaway here, automate as much as possible. Just make sure to have a granular understanding of what is AI-supported and what not. This means that your teams can track over time. In addition, this allows you to monitor specific areas that have a higher chance of hallucinations.

We recommend using LLM for: 

  • Review & rephrasing generic text to improve readability;
  • Support structuring your playbook;
  • Quickly generating passages of explanatory text so you don’t have to.

We discourage using LLM for:

  • Documenting any form of internal information, unless you found an approach to leverage an internal LLM with subsequent privacy considerations addressed);
  • Review & rephrasing specific texts to your process to improve readability;



Wrapping up

Adopting a scenario-based approach within the incident response process offers numerous benefits. Most notably, it can also significantly reduce the stress on frontline defenders during an incident. 

Four actions you should take today:

  1. Review your current approach to incident response playbooks, if possible move to scenario-based interaction. For example, leveraging MITRE ATTACK as integrated taxonomy for collaboration;
  2. Setup law enforcement contacts before something happens;
  3. Have CTI update playbooks based on current & forecasted behavior;
  4. Practice monthly: Once a playbook is set up, it shouldn't collect dust on a shelf but should be vetted and tested on a regular basis to make sure that it is still effective and constitutes the optimal response.

If you like this article, then you’ll love the curated threat scenario repository we have at Venation. 

Together with my Venation team I customize and build playbooks for teams, and train their incident response teams through tabletop exercise or hands-on-keyboard simulations.

Check out more information via www.venation.digital

Cheers!

Jeremy Koval

Threat Intelligence Account Manager | Committed to Customer Success • Collaborating to Build Strong Customer Relationships • Enhancing Customers’ Systems and Security Posture • Pipeline Forecasting & Order Mgmt

10mo

Impressive approach to cutting response time in security incidents!

Like
Reply
Logan Wolfe

Founder and CEO at ORNA | TEDx Speaker | Published Author | Investor

10mo

Not to butt in unceremoneously, but we digitized this completely at ORNA 👌 Scenario-specific AI Playbooks, team management, etc. even in the Free version. This is what a CRM is to filing cabinets 👀 6 free DFIR playbooks included: https://www.orna.app

  • No alternative text description for this image

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics