AI Pipeline: Surfing from Concept to Product Reality — SolarWinds TechPod 084

Stream on:
In this conversation, hosts Sean Sebring and Chrystal Taylor talk with Derek Daly, Principal AIOps Product Manager at SolarWinds. He discusses his career journey and the role of AI and machine learning in the company's products. He shares insights into the process of introducing AI and machine learning into product features, the impact of AI on jobs, and the considerations for on-premises vs cloud deployment.  © 2024 SolarWinds Worldwide, LLC. All rights reserved  RELATED LINKS:  
Sean Sebring

Host

Some people call him Mr. ITIL - actually, nobody calls him that - But everyone who works with Sean knows how crazy he is about… Read More
Chrystal Taylor

Host | Head Geek

Chrystal Taylor is a dedicated technologist with nearly a decade of experience and has built her career by leveraging curiosity to solve problems, no matter… Read More

Guest

We’re Geekbuilt.® Developed by network and systems engineers who know what it takes to manage today's dynamic IT environments, SolarWinds has a deep connection to… Read More

Episode Transcript

Sean Sebring:

Hello, everyone and welcome to another enlightening episode of SolarWinds TechPod. I’m your host, Sean Sebring, joined by my brilliant co-host, Chrystal Taylor. For this episode we’ll discuss the strategy, implementation, and design of AIOps with our very own well-qualified Derek Daly. Derek is a Principal Product manager for AIOps, or at least he was. Can you tell us a little bit more about your career leading up to this and maybe what you’ve moved to now?

Derek Daly:

Yeah, thanks, Sean and Chrystal for having me on today. So yeah, I’m AIOps PM in SolarWinds for the last two years. Started off my career in SolarWinds as a sales engineer and then had multiple different iterations through the sales engineering organization or solutions engineering organization. Previous to SolarWinds, I was stuck in the network world as well, but in the telecom sphere. So I was working with their OSS and BSS, so essentially your Orion for telecoms.

Derek Daly:

As soon as I saw Orion and I played around with the demo for my actual interview, I was like, “Oh my God, I know how to talk about this immediately. It’s so easy to use, it looks so nice.” I just couldn’t get over compared to the beasts of software that I was used to in the telecoms world. So started as a network-focused engineer, but through a bit of graft and timing and luck, I got to get stuck into other sides of our portfolio, so was more or less able to demonstrate pretty much every product we had up until we have ITSM. That’s one I know absolutely nothing about.

Derek Daly:

And then I started studying machine learning. I was like, “What can I learn about this and how can I apply this to SolarWinds?” I was always looking at it from a lens of, “How can we use machine learning to improve what we have in Orion?” So anyway, I used to talk a lot to product leadership, talk, give feedback also from a coaching perspective and one of our leaders at the time, VP of Product, who I’ve gotten pretty close to was aware that I was keen on Product, but I mentioned in passing that I was studying machine learning. He was like, “Great, how are you finding it?” And I talked a bit about it.

Derek Daly:

About four or five months later my boss rings me says, “There’s someone here called Cullen who wants to talk to you about a role in Product.” I said, “Look, I’ll have a conversation with this person. They were eager to get someone to run the AIOps side of the Product side of things.” And I spoke with this person and the vision I was sold was incredible, but also as an engineer, I really understand the customer needs and if I can get into a role in PM where I’m actually setting the strategy and deciding what features we build, I really think I can add a lot of value to customers.

Derek Daly:

Again, I was very happy in the old role, and I wasn’t trying to ditch it. I wasn’t trying to jump from it ’cause I really loved that role, but it was just too good an opportunity to say no to. So I said yes to it. I took my sabbatical in July, and I came back as a PM, so it was a good transition. So that’s how we got to where I am today.

Sean Sebring:

And here you are. I feel like it does make the most sense as a natural exchange from an SE over to PM that happens more often than I would’ve thought. Like you said, I love that you take that tangible, actionable use cases, feedback from all the customers and they’re so close ’cause as an SE still, solution engineer, still I have to build and maintain the closest relationships with my PM to feel like the product’s going where I know my customers want it to go. I have to explain to my leadership why I moved to PM next. It sounds like-

Chrystal Taylor:

I think that was a very clear example too of all of the things we’re always talking about how important networking is and how important it is to maintain your education as you’re continuing forward and all of those things. You’re such a good example of all that. It’s always exciting to talk to you because you’re also just so passionate about working with the products and with the customers and I don’t think that should be taken lightly. We hear it all the time. Trade shows, when you guys are SEs or were SEs and I’ve never been an SE, but I did work for a partner and I worked with lots of customers and listening to their feedback and what they’re using everything for and what they see as more important is always super interesting. So it’s nice to re-meet you.

Sean Sebring:

You have been with SolarWinds for how long, Derek?

Derek Daly:

11 years.

Sean Sebring:

11 years. That’s awesome. ‘Cause if it was just two years for you as far as taking over AIOps and the Product portfolio, how long has SolarWinds been adding AI/ML, and you can separate them if you want to answer that question into the portfolio.

Derek Daly:

Yeah. So it’s important to outline what the words mean and what the phrases mean. So AI is artificial intelligence and that is using computers to fulfill anything a human can do. So AI in essence is and has been around us for a long, long time. Machine learning is a subset of AI and that is where a machine using some form of mathematical equation or algorithm is identifying patterns and trends and then using those to make decisions in the future.

Derek Daly:

It’s technically learning, it’s identifying things and being able to use that to make decisions in the future. AIOps then is a subset of AI, but then some elements of it are not machine learning-based. So AIOps does not necessarily always equate to machine learning. When you’re building products and when you’re trying to deliver features to a customer the very base point is understanding a problem. What is the problem the customer is facing? Maybe it’s a problem that they don’t know they have right now.

Sean Sebring:

Spoken like a true solution engineer right there.

Derek Daly:

I can’t walk away from my heritage as an SE, but yeah, so it’s identifying the problem, be it something that they’re aware of or not aware of. Okay? Then when we try and come up with a solution, it’s always best to keep the solution as simple as possible. So if you can do something, if you can solve a problem with inbuilt logic or some very, very simple mathematics, then it’s absolutely the way to go. It’s faster. The cost to host it for a customer perspective or for a SaaS perspective is going to be lower. So your total cost of ownership is lower, the performance is typically better, and it solves the problem.

Derek Daly:

Now, if whatever problem you’re trying to solve cannot be easily solved with inbuilt logic, then that’s where you start to think about AIOps and in particular machine learning. An example that would be something that is seasonal, for instance. So you can easily get understanding of what the average metric value was for CPU for the last week, and you could very easily use that to somewhat forecast, somewhat accurately for the next week or so. But that doesn’t take into account that every three weeks that payroll is super busy, so all the systems that they use are going to have a spike. It doesn’t take into account the TechPod that’s running once a month. It doesn’t take into account the stuff that isn’t as daily or weekly.

Derek Daly:

So that’s the things you need to take into consideration when you’re trying to solve a problem. AIOps, machine learning, is not always the best solution and to be fair to SolarWinds and something that I think is actually we should be proud of is that we haven’t just thrown machine learning all over the place just to stick a machine learning banner on it.

Derek Daly:

And also, we’ve been doing AIOps for 10 years. We’ve had machine learning in our products for 5, 6, 7 years, for longer than a lot of software vendors out there, but we haven’t just blown our trumpet about it and said, “Hello, look at us. We’re using machine learning. We’re great.” We’re out there to solve real problems and it doesn’t matter to the customer how we solve them. If we solve the problem, they’re happy. We don’t have to stick a big machine learning banner on it and say, “Look, this is really cool, it’s AI, yada yada yada.”

Chrystal Taylor:

Yeah, it sounds like it’s really important to make purposeful choices and where you use that. AI and machine learning are tools just like any other. So if you are using them in the wrong place and they’re not going to do everything that you hope for, and they might in fact be harmful to the end goal that you’re trying to achieve.

Chrystal Taylor:

You mentioned several things there and all I could think about was dynamic baselines in the Orion platform you were describing the alerting. It doesn’t account for certain things because those are designed to over the last seven or 15 days to set a baseline based on performance over that time period. So it only takes into account things that happen during that time period and it’s a lot less intuitive and can’t learn over time because you have to reset a time window for it to use for the baseline if you want to reset that baseline.

Chrystal Taylor:

So I just remember when it came out and it was just mind-blowing before you had to have all manual baselines and tell it when things were not normal, and that was already a big step in the next direction of, “Let’s figure out how we can reduce the amount of time it takes to maintain all of this and figure out when things are going awry and what things are outside the normal operating procedures?”

Chrystal Taylor:

And I think that was really interesting that watching this kind of journey, I’ve been working with the SolarWinds products for about 13 years and the same thing. You watch the journey of it go from very manual and it was very easy to use. I agree with you on that. You were describing your journey earlier. I worked in retail and then I got into working with SolarWinds software and I had no problem.

Chrystal Taylor:

So I agree it’s super easy to use and that was super helpful, but watching it go from monitoring things on a very manual basis, you have to go and say, “These are the things that you’re doing. These are the things I care about, these specific things.” Instead of doing that, we moved into additional kind of ways, the dynamic baselines and then we added recommendations and certain things like the virtualization recommendation engine was added. And just watching it evolve over the years and you can clearly see where AI and machine learning have come into play a little bit as time went by and now, we’re just jumping in a little bit more in those kind of same spaces, actually.

Derek Daly:

Yeah. When the recommendations came out, “Wow.” As a feature, it was just incredible to demo. Customers absolutely loved it. It solved unbelievably complex problems in a real easy manner. Even today, I love the feature. I just love how it looks and it’s so powerful. One of the drawbacks to it, which we know it doesn’t take into account seasonality. Other than that, it’s a phenomenal tool and it’s been there for many, many years. We didn’t charge any extra, first I thought just a really, really neat feature.

Sean Sebring:

I have a very ignorant question for an appropriate audience, which I feel like I do on every episode, so maybe I’m just an ignorant person, but I like to get experts’ opinions on definitions, like the meaning of something. Of course, observability is one that’s been out for a while, but in your case, Derek, I was going to ask you to give us a definition of AIOps, and it doesn’t have to be SolarWinds definition. It can be your thought leadership and maybe SolarWinds happens to perfectly align with the way you would describe it, but what is AIOps? What does that mean to you?

Derek Daly:

What is AIOps? So if we just break it apart, it is artificial intelligence for IT operations. Observability in its own right is a form of AIOps because it’s computers who are consuming, ingesting all of this data and presenting it in a way that makes it easier for you to either find the root cause of an issue, but essentially to make a business decision. Your end goal is the business use case. Anything in my opinion that’s able to ingest different types of data over different periods of time and to give you a unified view on it, that is the type of insights that helps you run your operation.

Chrystal Taylor:

There’s so much data out there. I don’t remember the exact figure, but I do remember there’s a study done about how much data we produce as a whole and how it’s growing exponentially day-over-day at this point. Because there’s no possible way for anyone human being to consume even a fraction of it. So the fact that we’re using computers to help process that data and help us make decisions is only really the next step because we can’t possibly do it ourselves.

Chrystal Taylor:

I’m excited about using the technology that we have, using the tools that we have to further our goals and to help people process things so they can make decisions more quickly. If you have to use a human to process all the data and make a decision every single time, it’s going to take a long time to make any decisions. Using the tools at your disposal is what you should do.

Derek Daly:

If you just take a Kubernetes cluster, for example. So Kubernetes cluster can generate millions of log lines in a couple of minutes, just one cluster. That’s just logs, not including traces, not including metrics, just logs. That’s one service for example, that’s been run by that Kubernetes cluster.

Chrystal Taylor:

This leads me to the next question, which is what is the process for introducing AI and machine learning into a product feature versus adding a feature that doesn’t require AI or machine learning?

Derek Daly:

Before, so when we introduced anomaly detection in DPA or the recommendation system in VMAN, they were solely focused on solving an individual problem. So I think at that stage it was an easier decision that, “Okay, we have a query response time. How do we find if a query response time is normal or not normal?” So they played around with a few different things in built logic. And at the time it was felt that maybe we need to start putting some form of machine learning algorithms into the product.

Derek Daly:

So SolarWinds created three or four algorithms that were specific to that use case to solving database query response time, a very focused approach to a very specific problem. Now, if we think about the current structure of our products and the SolarWinds platform, then we are approaching it from a different way. So we are pre-building machine learning services, so we have a machine learning team, or teams even, and they’re creating different services for us to utilize.

Derek Daly:

So for example, forecasting, anomaly detection, so on. So they’re creating them, they’re continuously refining them, they’re changing algorithms, adding algorithms, giving more flexibility in terms of the algorithms. They’re testing them for different metric types. For example, to forecast something like CPU isn’t as easy as forecasting something like disk space usage. For disk space usage, we can accurately forecast up to probably 95%, maybe four or five weeks ahead. Whereas CPU memory and these more ephemeral entities for these, it’s harder to forecast further ahead, so maybe four or five, six days you get some form of accurate responses.

Derek Daly:

So you have to understand what we’re trying to achieve, and then you have to do all of the machine learning testing, all of the experimentation on the algorithms and on the data science side. So that is a new field and that’s just a sole function that AIOps services, machine learning, data science. So they’re their own kind of unit and they’re continuously refining how they do things and the services they can provide.

Derek Daly:

Then as service users, we can connect with the team and say, “Look, this is a problem such as alert correlation. How do we use your service to give us a more refined correlation?” And then we start the conversation with those guys. Then they may do some separate streams of experimentation specifically for our use cases and figure out if we’re using the correct algorithm or if there’s a better approach. At the same time, from our engineering perspective, we’re trying to solve the problem in an inbuilt logic way. Once we finish our experimentation and figure out which gives us the best results because obviously, we’re SolarWinds, we have lots of really good data sets that we can use to test to get accuracy and to see which approach is better.

Derek Daly:

Once we understand which approach is better from a data science, from an experimentation perspective, then we decide to either go down the machine learning route or down the inbuilt logic route. So it’s pretty cool that having a platform approach means that any products, any feature can technically utilize the service. So for instance, let’s say I want to monitor websites and I’m like, “Oh, it’d be great if we could see response time anomalies for websites. Oh, great, there’s a service there for that. We just have to build an API to connect into it.” So that’s the new approach. You have the service there, it’s available to consume, and then you figure out if that’s appropriate for your use case or not.

Sean Sebring:

I want to make a quick tangent, I promise it’ll be quick, I promise, but it’s a callback to stuff we’ve talked about in the past and something that you said earlier, Derek, it’s hard to not see AI splashed everywhere lately. We’ve had a hard time having episodes that don’t include talking about AI, not for lack of trying to be creative and find things, but one of the questions that always comes up when we’re talking about AI and the ethics of AI is jobs.

Sean Sebring:

AI’s job is to quote unquote, “Replace the work that we’re doing.” And some people take that as replacing them, but what it’s actually doing is changing, and we’ve talked about this too, it’s changing jobs. You just mentioned teams, several different types of teams dedicated to creating, supplementing, enhancing the AI. In fact, we have an AI PM. It’s just a fun thing to note and someone who works directly in that field talking about the several different teams means that jobs are just changing. AI is not taking things away from us. It’s making the mundane go away in many ways, and then just adding fun complex job roles to create new AIs.

Derek Daly:

When the printing press was invented, you were able to produce a book in a day, so print a book in a day, whereas beforehand you had scribes who were, that was their lifelong dedication, and it took them years to complete a book. Some scribes lost their jobs, I assume, or had to train as something else or whatever, or maybe they just became more niche. When the mills were brought out, the Luddite movement was burning down mills because they’re like, “Oh, you’re going to ruin our jobs or take our jobs.” It didn’t. It just made things more efficient, and it created more jobs.

Derek Daly:

So AI, sure it’s going to take some jobs, it’s definitely going to take my job anyway as a product manager, but what it does is it allows you to be more efficient. So obviously working with engineers on a daily basis, something like Microsoft copilot, these types of things. Instead of spending all that time and trying to figure out where your bugs are to accurately identify where you need to make your code more efficient, it’s doing it for you automatically.

Derek Daly:

So then you can actually focus on solving problems, not just having clean code and making sure it’s commented right and that you’re using the most appropriate methods and stuff like that. It just makes things more efficient. And there potentially are some roles, some positions that may be somewhat manual that could be replaced by AI. Maybe it’ll improve some customer service jobs, make them quicker for them to identify the solution or it is going to replace jobs, but for me, it will make certain jobs more meaningful and more creative.

Sean Sebring:

I think creative is a key word I like to take away from that, and I’m allowed to focus on innovation because it’s doing the boring work where a human error is possible that I would normally have to spend hours on, not spending hours on innovation.

Derek Daly:

Completely. From a solutions engineer perspective, a lot of the time is spent doing demos for customers. Imagine you had a very smart system that could do really, really smart tailored demos, not just the inbuilt logic ones that we’ve seen over the years where it’s really, really smart and intuitive, and then as the solution engineer helping with issues customers are facing, help them to interpret the data to identify issues, it just gives you so much possibilities. You know what I mean?

Derek Daly:

I’ve met a guy in the airport there last week, a friend of mine, and he’s a very niche radiologist and has to look for certain things in images a lot to identify certain illnesses, and for him as a medical professional, just being able to save hours and hours and hours a week where he’s just studying these images, you have AI which can understand it, identify patterns immediately, alert you to something very, very quickly, save time.

Derek Daly:

We know all healthcare systems around the world are probably slammed at the moment. There’s just not enough doctors, there’s not enough nurses, there’s not enough consultants, there’s not enough specialists, and this isn’t going to get rid of nurse jobs, it’s not going to get rid of specialist assistance, any of these roles. What it does is it just makes things more efficient. It shortens the queues that people have to wait to get medical attention. It’s just going to make things faster, more accurate, but again, it’s the human will always have to make that final decision. Sure, it’s going to be inaccurate here and there, but is it going to be as inaccurate as a human is? Probably not. I saw something yesterday about self-driving cars and it’s said that one of the barriers to actually real self-driving isn’t how accurate it is or isn’t. It’s the liability of the company that delivers it.

Derek Daly:

In the US I think last year there was 45,000 people killed on roads. So if everyone went to self-driving cars and there was only 10,000 deaths per year, that’s a humongous improvement, but it’s still 10,000 lives and there should not be any lives lost on the road. So the car companies could demonstrate, “Yeah, with our self-driving cars, everyone’s on them now. It’s all smarter.” But there’s always going to be that liability, so there’s always going to have to be a human that actually makes the final decision. They have to have a human to pin this on, can’t just be pinned on some anonymous algorithm.

Chrystal Taylor:

You reminded me of two different things that I read about semi-recently, one of which was yesterday I think, where someone was using the Apple Pro headset, and they were using it in their self-driving car, and they were being chased by police for miles and miles. The self-driving car thing brought that to my attention of they’re calling it the future of work and all of this, and it’s just like, “But no, you still need to be able to…” The proof was that you still needed to be a human making the decision following what you just said. You still need to be able to make that decision because the car wasn’t stopping on its own for all these police that were chasing this person. You still have to make a decision to stop or to do things, and they still make mistakes because as we know, all good AI is only as good as the data that it’s based upon, so it can only make good decisions based on what it’s using.

Chrystal Taylor:

The other story that I was remembering was from your healthcare conversation, which was very insightful and very interesting. I read a story about a robot that they had developed in a country in Asia, and I don’t remember which country, but it was originally developed to distinguish from bagels and donuts, and that was their whole goal was to… That’s all they were going to do. And it ended up being able to detect whether there were cancer cells in a person, and it was just like that wasn’t what their intention was, and this just proves technology is iterative.

Chrystal Taylor:

We’re always learning and the technology that we’ve gotten over the years, you mentioned the printing press and all of these things. I’ve used cameras as an example when we’ve talked about this in the past of they replaced paintings and we still have jobs, we still have to learn how to maintain these things. The assembly line is another great example of that. It replaced a lot of jobs, but it gave you different jobs instead. There’s still things that humans need to do. We still need to be able to do those things, and I like that you said allows us to do the creative things as well.

Chrystal Taylor:

We should be using it to process data and do all this mandating. I don’t want to stare at spreadsheets and databases and just flat data all day long. That sounds very monotonous and tedious and boring. And if I could get something that would troll that data for me, gigabytes and terabytes and petabytes of data for me and tell me any insight into it, that then I could go, “Oh, I pinpointed this timeframe or something.” Now I can look at a subset of data instead of petabytes of data. That sounds like excellent win for me.

Sean Sebring:

Okay I promised it was a short tangent and we’ve enjoyed veering quite a bit, but it’s such a relevant topic and I don’t think that one should ever be ignored if it’s ever fully understood what the ethics are behind how AI is quote unquote, “Replacing things.” But I appreciate you both humoring me there, but maybe I can reel us back into the AIOps side of things and something, Derek, you and I had talked about that you wanted to express your opinion or thoughts on was the considerations that go into choosing when to add to on-prem versus cloud or how to make that decision?

Derek Daly:

So when I joined a team, most of my experience be it in telecoms and then in SolarWinds was on the on-prem world. So that’s self-hosted where customer has the software installed on their own premise or it could be cloud, but they’re hosting it themselves, they’re paying for it. And when I came into the team and we were talking about anomaly detection for the on-prem world, the architecture, the solution was based on cloud-based service.

Derek Daly:

So the first question I had was, “Why can’t we do this on-prem?” And it was a very obvious question, and there had been many, many months of debates about this and redesigns of what we now see as AIOps and SolarWinds. And then I slowly began to realize, as I understood machine learning a bit better, was the amount of effort and work required to keep your algorithms accurate on the on-prem world where you’re upgrading every quarter or maybe once a year, twice a year, it’s just not impossible obviously. Nothing’s impossible, but it’s just incredibly difficult to keep it accurate.

Derek Daly:

Now, that’s a strange comment. “How do we keep an algorithm accurate?” The algorithm is the algorithm. The algorithm is a mathematical equation that’s chewing on data and it’s spitting out results. The problem is machine learning is a process whereby it’s never static. They’re more informed. So machine learning is the machine is learning. The day you deploy an algorithm is the day it’s at its least accurate. It is always getting better, it’s always learning, it’s always getting more accurate. It’s only getting more accurate because we’re continuously training it. So it’s like you’re going into kung fu. The day you get your black belt, can you get better? You’ve reached the top, you can’t go your 3rd degree Black Belt in karate, which is I believe the highest belt, you can still get better.

Derek Daly:

So everything is always continue to get better. It can continue to get more accurate. You have to treat it as a living thing. So you have to make sure it’s always getting appropriate data, appropriate to the use case because you can use one algorithm for hundreds, thousands, millions of different use cases. So if you can imagine observability or SolarWinds Orion or HCO or SolarWinds Observability (SWO), well, we have systems in place that actually monitor and maintain our algorithms. We manage those systems. We continuously monitor how an algorithm is working. We always take the results, we test them against training data, we test them against other data sets.

Derek Daly:

Obviously, a lot of this is done automatically, but we’re there, we’re sitting there. We’re continuously refining the algorithms. We can change the algorithms and we can update the algorithms. If you are an on-prem customer and you had some form of machine learning sitting there on-prem, and you update your system once a year, then there’s 12 months there that your model, your algorithm can become less and less accurate. It’s called model drift. So just the model doesn’t necessarily give you the appropriate responses based on the input. So we’re continuously refining it and sure you could have it on-prem, but the customer would potentially have some form of update on a daily, weekly basis.

Derek Daly:

So there’s going to have to be some connectivity to the cloud. The fact that it’s cloud-based too means that we can not just optimize it, but we can improve it and we can just rip it out and replace. So in V1 of our anomaly detection on the HCO, we used a particular algorithm, and this took into account daily seasonality, very, very accurate for daily seasonality. We had been playing around with different algorithms and we found a newer algorithm that was better, and it could take into account almost a month’s worth of data. So that would give us what’s called weekly seasonality.

Derek Daly:

So we can take into account those occurrences every two weeks, every three weeks, every 25 days, and we were able to basically swap out the algorithm overnight to V2 and customers didn’t experience any difference. The accuracy of their existing anomaly detection didn’t change, it was all seamless. We’re in the process now of V3, which is going to be a step-up again on V2, and this will have a more appropriate algorithm, potentially longer periods of time.

Derek Daly:

So all of this stuff can be done without having to update on the customer side. There’s no change required. We’re obviously consuming the cost of this as well. We’re not charging an extra license or there’s no extra line item on the SKU for machine learning or for anomaly detection or all the other features that are there or that will be there. And the cost of running a machine learning service is extremely expensive.

Derek Daly:

So one of the barriers to entry for some people from a observability perspective, especially with an on-prem, is the total cost of ownership. So the hardware that you have to buy to run something. So take the most expensive component of a HCO deployment, which is typically the SQL Server. You have to pay for a license, and you have to give it a lot of RAM and has to be really well managed. Imagine that in a twofold, threefold, maybe fourfold. That’s the type of system we’re talking about when we’re doing machine learning, obviously, depending on the amount of metrics we’re throwing at it and the number of nodes that you have as well, but it’s unbelievably expensive.

Derek Daly:

So by taking that cost away from the customer and then by us having control over the algorithm and making sure that it’s always as accurate as possible, and again, being able to rip and replace better algorithms, it just makes sense from a deployment perspective why we went with cloud only. Now we have on-prem and anomaly detection for DPA, but it’s really smart, but it’s really use case-specific. It’s query response time. Okay. So that’s an easier thing to accurately predict and for you not to have model drift, but we also have a number of different algorithms that I can switch between there as well. It’s extremely accurate because it’s such a narrow, narrow use case and because of the query response time in terms of the scale of data that’s sent to it it’s way lower than from a massive observability system.

Chrystal Taylor:

Right. If we’re talking about the anomaly-based alerts, for instance, it’s based off of five different metrics. I think it might only be less than that now, but I think that’s the goal is for the five golden metrics to all be included in that. So it would need to be by default, way more robust than that if it’s only doing query tuning.

Derek Daly:

Yeah. And while it’s golden metrics or a subset of metrics, it’s on every type of entity servers, network devices, virtual machines, and then there’s free or virtual infrastructure, you’ve got extra metrics again, so it’s potentially millions of metrics per user. It’s large scale.

Sean Sebring:

Perfect segue, ignorant question again, anomaly-based alerts. I can make my own interpretation based on the name, like Derek’s brilliant explanation of what is AIOps? Well, I could say lots, probably alerts based on an anomaly, but the anomaly is the cool part here. Because the anomaly, is it actually using its AI and machine learning stuff to determine what’s an anomaly and then that’s my deduction. But if you would please, Derek, expand for us.

Derek Daly:

Sean and Chrystal, you’re definitely, definitely fully aware of this from talking to customers for years and years and years. There’s probably three things that just have been apparent since I started working with monitoring tools in general is one, how do we sift through the noise? Two, how do we tie things together better? And three, can you help me to find the issue and to fix it? So there are three things. The third one I always call the easy button. Do it for me. So that’s one that’s still difficult one to do. It’s closer definitely, but I’ll leave that one off for now.

Derek Daly:

The alert storm one is probably the easier one. I could have 500,000 interfaces in my system. And I have an alert that says, “Alert me when interface utilization is higher than 80%.” I could get 50,000 alerts in a day because there is extra traffic. It could be an all-hands meeting where everyone’s online at the same time or something simple or everyone’s watching the Super Bowl replay at work at the same time from the same service provider or on our side of the world watching the World Cup or rugby or soccer, it doesn’t matter.

Derek Daly:

So there can be a big increase and it’s maybe not normal, but it’s not something we should worry about. So how do we identify alerts that should actually be alerts? So how do we just say, “Look, these are all triggering here. They’re all saying they’re high, they’re bad, we need to look at them.” How do we just say, “You know what? Just show me the ones I need to see. That was the problem.” And we obviously worked on it for years. We brought in certain things, dynamic thresholds on Orion, which definitely helped things, and then some other bits and bobs, which helped without a doubt, but didn’t go far enough.

Derek Daly:

So then we started to talk about anomaly detection and machine learning. How do we use machine learning to reduce the alert noise for customers? And machine learning was absolutely the right way to go. It was the only way to go. And what we’re doing essentially is we’re identifying normal operating range. So what is normal? That’s what we’re first thought. We’re saying, “Look, this is normal.” And we’re sending that normal to HCO all the time, continuously being sent, and we have a three-hour sliding window, so it’s always being updated.

Derek Daly:

So we’re always looking ahead three hours for what is normal. It’s obviously changing based on real-time information, but it knows the last months’ worth of data for that particular metric. That’s a separate training stream just for that one metric for that month essentially. And it’s continuously saying, “Look, this is what we expect the next three hours to be.” And that’s always looking ahead.

Derek Daly:

So we know what normal is, if anything outside of normal occurs. So if a metric that we’re ingesting is outside of normal, that’s an anomaly. Now, we don’t always want to be alerted on every single anomaly. If a CPU normal operating range for Monday at 10:00 till 12:00 is 10 to 30%, that’s the expected range for the next three hours. Then what if the metric is 31%? That’s an anomaly. It’s outside of the normal operating range. But do we care about that? Absolutely not. So with our anomaly-based alerts feature, it’s understanding what the normal operating range is for the next three hours. But we also have the ability to tie in a static threshold as well. So we’re saying if CPU is anomalous and it’s greater than 80, then trigger the alert.

Derek Daly:

Now, people say, “But then why don’t we just use static? Why don’t we use just use greater than 80?” And I would say, “Yeah, we can do that. But think of it this way, if the expected range for CPU from 10 to 12 today for whatever device is 85 to 95%, if we just go back to the static threshold, it’s greater than 80 that’s going to flag. But that’s not something that we need to worry about ’cause it’s expected.” So we’re saying if it’s anomalous and greater than the threshold trigger alert, it’s showing the value comes in at 92%.

Derek Daly:

Most people will go, “Oh my God, we got to do something about this.” But we know that’s expected. So that’s why it’s so smart having a static and anomalous detection within the one alert. And then what happens is, let’s say for instance, the anomaly detection service is unavailable. Now it doesn’t matter. We have that three-hour buffer where we’ve looked the head three hours, but let’s say there’s an outage where we have no inside connectivity. Then the alert, if we have set the static threshold over 80, we’ll still trigger even if we don’t understand what the normal are operating range is.

Derek Daly:

So you have a fallback and a safety measure, and anomaly detection is fine, but from our experimentation, we found that anomaly detection… Everything is anomalous. The so much stuff is anomalous because it’s not expected, and you could be reducing your alerts by 60%, but you’re still getting that extra 40% of alerts that you don’t need to see. By adding in that static threshold and tying it with the anomaly detection, then we’re getting those alerts down to 1%, sometimes more. So you’re really cutting out the noise. So that’s how we’ve come up and solved that problem of alert noise.

Chrystal Taylor:

From years and years of talking with customers and working in customer environments. We get an alert every single day in that 10 to 12 window because it was at 80% or 85% or whatever versus with the anomaly detection, then they can say, “Well, it’s not anomalous at this time every day, so I won’t get an alert.” Imagine people paying attention to your alerts again. That’s all I’m saying.

Derek Daly:

People live in alerts and it’s just so hard to clean it up, and I can’t leave unread emails in my inbox, so I’m seeing 10,000 alerts I’m like, “It is just driving me crazy.” And you can’t just mark all as read, delete all. It’s just not as simple as that. So we need to get appropriate alerts to stuff that we actually need to address. And this leads me on to the second major feature that I’ve been involved in on the HCO side of things. And that is AlertStack. AlertStack goes that step further to try and solve problem two, which is, “How do we pull things together and logically group them based on dependencies or some form of relationships?” You could do some very rudimentary relationship management or parent-child type of stuff, but we need something that goes a bit further than that that’s a bit smarter.

Derek Daly:

We know that HCO has really smart topology information already that really understands the connectivity between different elements within an environment. So to build on top of anomalous alerts, we built an alert correlation tool. Now, the cool thing about this is that it does not require connectivity to the cloud. It is all done with inbuilt logic. So this is a decision we did. So we experimented with machine learning, we experimented with inbuilt logic, and this was a prime example of something that did not need the big flashy machine learning badge on it.

Derek Daly:

And what it does is it uses HCO topology to understand relationships. And if an alert is triggered, now that can be an anomaly-based alert by the way, so that integrates into this feature. So if any alert is triggered, the system looks into the entity or the node or whatever element owns that alert, goes back 30 minutes in time, there’s a sniff around the relationship to see if there’s anything that is somewhat connected or related to that, and if there is it checks to see if there’s any issues with it. It could be a change event, it could be an alert, an anomaly alert. And if there is more than one that are somewhat connected, it will create a group that we call a cluster.

Derek Daly:

This cluster is a living and breathing thing, and it’s continuously updated with the current status of those change events or alerts. It stays open then, and it listens for other alerts that come into the system that are somewhat related and it appends them or adds them to the group or to the cluster. So the cluster gives us this time series where we can see when things started, and you can see the history of how things have impacted other things all the way in a lovely little timeline as well.

Derek Daly:

The aim of it is to identify a potential root cause. It’s not going to say it’s root cause analysis, but it allows us to see the knock-on impact of a particular event, be a change event or from an alert. And then we can track changes throughout the lifespan of that cluster, but we can see it in one logical place instead of having 100 different alerts where we’re flicking between the alert, alert detail page, looking at the map, looking at the node details page, it gives us this one-time series to view everything.

Derek Daly:

And then we go a couple of steps further where we’ve integrated it with ServiceNow and also SolarWinds Service Desk. Imagine that cluster and it has 50 different nodes, entities, servers, hosts, whatever, and it could have 250 different alerts. Imagine the ability to see all of that in one logical ticket in Service Desk or ServiceNow. That’s what we’ve done. We’ve enabled the customer to easily create that ticket on a one-to-many basis, so many contained in the cluster. So one cluster creates one incident and it’s a more logical place to manage it.

Derek Daly:

When we look at the cluster as well. One of the use cases, again, going back to my example a while ago, that I can’t leave unread emails in my inbox, it just drives me crazy. So having that use case of the customer, “Okay, now we’ve got AlertStack where we can see all the elements logically contained. How do we then clean up my alert view?” So we can close a cluster, or we can update a cluster and that can acknowledge all the alerts contained within that cluster, and you can add a comment in there just to tell the team, “This is part of cluster 1, 2, 3, 4, and we’ve solved the issue.”

Derek Daly:

So we’re just really, really trying to minimize the time that folks spend managing alerts, looking at alerts, and trying to figure out, “What’s connected to what.” So there are two of the major features that we’ve been involved in. There are more, I could talk for another two hours about other features and functions that are coming or that we’ve built. And again, the AlertStack doesn’t require any particular license. It’s not going to cost you anything extra. It’s all built into the product and the adoption has been phenomenal for AlertStack. People absolutely love the feature and what it does for them and how much time it saves for them.

Sean Sebring:

I think what it does for them, you just said that is it’s a perfect way to look at it because when you’re talking about these features, all it’s doing is… And we’ve said this throughout, making our lives easier. All of the data is still there. You could correlate it yourself. You could acknowledge, or as Chrystal said, “Just leave that email” because we know it’s an anomaly or we could just make it a smarter system to not put that in my inbox because I do know that it is an anomaly. So it’s really cool to see all these things as what we had said when we were talking about it. It’s there to free us up so that we can be creative, so that we can be innovative, pay attention to what matters.

Sean Sebring:

Okay. Before we wrap up, I wanted to of course get our rapid-fire questions in, Derek. So we’re going to shoot you some questions and just give us your answers. They can be super quick one-word answers or if you want to tell us a little bit about why that’s cool too. So I will ask one that I ask everybody, and it is, would you rather travel to the past or the future? If time travel was an option.

Derek Daly:

Going back into the past for your own lifespan would be for sentimental value. And I always feel that if you’ve been to a place or you’ve lived in a place and you have great memories of the place, it’s typically because of the people, so you’d be going back just to feel that fun or that thing again. And I think that’s a bad thing. I think memory should probably just stay as memories because I think they’re better as memories.

Derek Daly:

Okay, you could go back further and change the course of history. I would probably say, “Go to the future.” And is it to win the lottery? No, it’s not. It’s just to give me a bit of a heads-up of what’s going on the track, and so I can start thinking that way because things change so fast nowadays, and you can be left in the past very quickly.

Chrystal Taylor:

If we were colonizing a planet and you had the opportunity to move there, would you?

Derek Daly:

No, ’cause you couldn’t surf or snowboard there.

Sean Sebring:

There we go. There’s a nice quick brief, rapid fire response. Like it. I’ll give you another rapid one, and I haven’t used this one in a while, but when it comes to taste, flavor, sweet or savory, what’s your preferred?

Derek Daly:

I mean I have a sweet tooth. So it’s got to be sweet. Anything cake-related is my cup of tea. And speaking of tea, I take honey in my tea, so everything’s sweet.

Sean Sebring:

All right. Let me ask you one more, and this one’s one of my favorites that someone else came up with. If you could give yourself any talent, something you wish you could do today, maybe it’s ’cause you don’t have the time or maybe it’s because you’ve tried and you’re just really hard, if you could give yourself a talent, what would it be?

Derek Daly:

I’m going to give you an answer that it’s going to come across as cheesy, but I’ll give you the history to it. When I was a kid, I was terrible at art, really bad drawer, but I really wanted to be a good drawer. So I used to buy these books, “How to draw cartoons” and well, I didn’t buy it, my parents did, and I used to just really try and force myself into being a good artist and to be able to draw, but I was just useless. So what I learned from that is I don’t need to be the best at something I’m not good at. I’ll give you an example. I love surfing and I’m a very average surfer and I’ve been surfing for a long time. If someone said to me, “Tomorrow, you could be the best surfer in the world, you could be as skillful as Kelly Slater.”

Derek Daly:

So I’m going to say, “No.” Because the enjoyment is the effort you put into something. It’s not just clicking your fingers and being amazing, because if I clicked my fingers and I was as good as Kelly Slater tomorrow, then I would go surfing. I’d go to over to Hawaii and get stuck into some big, massive tubes and it would be great. But then how am I going to progress? It’s all about progress for me and getting better. I started playing golf last year. I always said I would never play it. I was like, “Nah, golf’s not for me.” And then from being a father on the sidelines of a football match, and me and a guy from school that I was very friendly with who talked me into coming in and playing golf with him and talking to my wife and us saying, “It’d be great if we could join those retirement groups that go play golf in Spain, this would be great. So let’s start playing golf.”

Derek Daly:

So there was a couple of different reasons to why I started playing, but I started playing and I’m terrible, but I’m enjoying the process of getting better. Not that I have huge amount of time to put into it, but I like trying to improve and I like trying to get better at something. So if I could be the best or better at something in the morning, let’s say golf, if I was as good as Tiger Woods or Rory McIlroy in the morning, then I think I would’ve lost years of enjoyment and the enjoyment of actually getting better and improving.

Sean Sebring:

A little cheesy, but I’d say noble.

Derek Daly:

I’ll take noble. Noble is better than cheesy anyway.

Sean Sebring:

Yeah, it’s about the journey.

Derek Daly:

Yeah, 100%.

Chrystal Taylor:

I think that attitude though, that you just expressed is like you want to do the learning. That experience is more valuable than just having the skill is why you’re succeeding in the tech industry. We have to constantly learn and iterate. That’s all part of being in tech. And when you get bored of something or when you need to learn something new, it’s great to have the attitude of, “I want to go learn.”

Derek Daly:

We’re lucky our generation and the generations that are coming. If I think about my parents, they were in jobs in the same company for 35 years doing more or less the same thing. And it’s not that they weren’t smart or that they weren’t high achievers or anything like that. It’s just that was the way things were, and you just stuck at one thing, and you did that for your life. Now, if it’s an art or a craft, like anything from being a painter to a baker to whatever, that’s different ’cause you’re continuously innovating.

Derek Daly:

But for me, technology is allowing us to stay in creative roles, that are allowing us to continuously innovate. And it could be something really, really simple like changing — If you work in sales and you’ve got a CRM and you find a better way of forecasting using your CRM, that’s innovating in your job, that’s being creative. So we are lucky to have the ability to be creative in our roles and to be creative every day.

Sean Sebring:

Well said. Well said, sir. Well, this has been really enlightening. We talk about AI constantly. It’s so fun to get to see it from different angles, and especially because I’m a part of SolarWinds, Chrystal’s part of SolarWinds. It’s cool to hear from someone inside how SolarWinds is doing this, right, and the thinking behind it. So, Derek, I really want to extend a big thank you to you for joining us today.

Derek Daly:

My pleasure. Thanks for having me. It’s been great.

Sean Sebring:

And thank you listeners for joining us on another episode of SolarWinds TechPod. I’m your host, Sean Sebring, joined by fellow host, Chrystal Taylor. If you haven’t yet, make sure to subscribe and follow for more TechPod content. Thanks for tuning in.

  翻译: