A new detection model for Azure Sentinel
Picking up where we left off in part 1, we know that time series decomposition is not entirely suited for detecting cyberattacks from the Azure Activity logs produced by the plentiful SPNs operating in our subscriptions. Let's figure out what is their limit and how we could get around them in Azure Sentinel.
Current limitations
In a context of suspicious operations detection, I think the three main grievances one might have against anomalies decomposition are:
- Non-distributivity. As we discovered previously, anomalies(op1+op2) != anomalies(op1) + anomalies(op2). Likewise, anomalies(spn1+spn2) != anomalies(spn1) + anomalies(spn2). To perform detection at scale, with so many ops and SPNs to manage, it would be much desirable that anomaly detection be at least roughly distributive.
- No learning capability. An anomaly which triggers once will always trigger, even if it is a false positive (or if it's a benign true positive). This approach is not sustainable in a context of automated devSecOps.
- No time-orientation. If analyzing things in the right order might not be crucial for failure prediction and health monitoring, it is of key importance for cybersecurity: patching an image before publishing it in a registry is better than publishing before and patching after. Time-orientation eliminates many false positives (but it could also ignore some true positives).We could take time-orientation for granted because one can't imagine anything more chronological than time-series. But in fact, the process of decomposition destroys chronology: the only component that retains a flavor of time-orientation is the seasonality. Unfortunately as we have seen previously, even automated tasks -when complex, can be unseasonal.
In our search for a successful replacement of time-series, we must thrive to get those three properties: distributivity, memorization and chronology.
But above all, we must find the right balance between perfect and functional anomaly detection. This is really important if we want to go anywhere. In support for this argument, let me quote Mahmoud ElAssir, VP of Customer experience at Google Cloud:
Complexity needs to be managed because it’s too complex to solve. What you want to do is manage complexity with better measurements, better prediction, and better accountability
Achieving better detection with Markov models
I propose to follow a classical approach in anomaly detection: evaluate the ebb and flow of SPNs activity against a first-order hidden Markov model.
Such models are made of two parts: a "hidden state", and "observable outcomes". Here, the hidden state (also named the 'emission matrix') holds all acceptable transitions between two subsequent operations of the {OperationNameValue} set. It is a square matrix of rank c, where c is the cardinality of {OperationNameValue}.
Observable outcomes are long sequences of legitimate operations taken from Azure Activity logs.
The construction of the emission matrix is straightforward: each time operation A is followed by operation B in a given time-series, we increase a counter at coordinates (A,B). So this counter simply tracks the number of A->B transitions in the series. When we have ingested the whole data set, we normalize row A so that each row cell represents a probability and that the row sums up to 1.0.
Optimizations
To keep the matrix rank small, we may hash operation names with a modulus (at the expense of precision). Kusto built-in hash(object,modulus) is good for that, beware the algorithm is subjected to change by Microsoft without notice.
To make the process less CPU intensive, we may replace the emission matrix with a simpler object without loss of precision: a logical matrix. That's not a problem because we do not want to know the likelihood of a given transition between two ops, we just want to know whether the transition is legitimate (probability > 0.0) or not (probability == 0.0)
In the logical matrix, the "ones" represent legitimate transitions, and the "zeroes" represent unexpected transitions. Hitting a zero during a routine evaluation is like setting off a canary or detonating a url: we have found an anomaly which needs to be investigated.
Model assessment
Distributivity
Distributivity should be "good enough" if we take care to group SPNs into families with similar semantics so as to reduce:
a) false positives caused by artefacts[*]
b) false positives in the symmetric difference[**]
This grouping is very business-dependent; it's not guaranteed to scale well with the number of SPNs, but if it does it's not difficult to identify and to set up.
Without grouping we have:
markov(spn1 OR spn2) = markov(spn1) AND markov(spn2) OR artefacts(spn1,spn2) OR delta(spn1,spn2)
With proper grouping, we hope to have spn1 ~= spn2, hence: markov(spn1 OR spn2) ~= markov(spn1) OR markov(spn2).
Memorization and chronology
The learning ability is straightforward: checking a false positive and forgetting about it in future evaluations just means OR-ing the false positive with the existing matrix.
Time-orientation is ensured by design: the highest the order of the model, the more time-oriented it will be. In practice however, memory constraints limit us to orders 1 and 2.
Conclusion
A simplified markov model looks like a good substitute for anomalies decomposition when tackling the seemingly intractable problem of outling Azure activities for a given SPN:
- on one hand, three properties work in sympathy to limit false positives drastically: this is an important criteria for performing sustainable devSecOps.
- on the other hand, record-keeping transitions offers assurance that most true positives won't be missed. This is an equally important criteria, this time for cyberdefense.
The main current grey area is whether the model scales as the number of SPNs grows. If not, its use could be limited to business-critical SPNs.
In part 3, I will describe a case study to comfort the conclusions we've had of far, and how we can stitch this together with the native and superb Azure Sentinel incidents management workflow.
In part 4, I will describe a pen-testing tool (yes! you read me...) I use to probe this model against frauds.
Finally, let me quote the second part of Mahmoud ElAssir's point on complexity:
What you want to do is manage complexity with better measurements, better prediction, and better accountability. In other words, better data management and analytics.
Notes
[*]: artefacts are caused by artificial transitions across two SPNs: an operation triggered by SPN1 is followed incidentally by an operation triggered by SPN2.
[**]: the more two SPNs are similar, the smaller the symmetric difference of their logical matrix.
Well written. We are adding sequence mining capabilities to Kusto, I think it might be relevant for detecting these types of security anomalies, as you could identify non-legitimate sequences.
Empowering companies to thrive in hostile environments
4yVery clear paper that can also easily be transported to other technologies! Thanks!
VP of Strategy @ Hackuity 🎤 Speaker ➡️ Follow me on Linkedin to be updated on 𝗖𝘆𝗯𝗲𝗿𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆 and 𝗜𝗔𝗠 news 👀
4yUn des meilleurs blog post sur Sentinel qu'il m'ait été donné de lire. Bon boulot, bonne réflexion, j'ai hâte de lire la suite.
CyberSec practitioner
4yThanks, Christophe! Sentinel POC is already planned for us ...
Senior Cloud security architect at Société Générale
4yThanks to David Knott for reporting the wonderful quote I use in this article, as well as for him being an incredible source of inspiration as a thinker and enterprise architect. Do follow him on linkedin!