Determining the Value of Proactive Monitoring

Determining the Value of Proactive Monitoring

System downtime has a cost to any organization. If your organization uses Microsoft Teams for its collaboration and communications, when Teams is unavailable your business is impacted.

The challenge is separating headline grabbing estimates, “… downtime can eclipse $5 million an hour in certain scenarios…” (Forbes Technology Council, April 10, 2024), often associated with only the largest organizations, from reasonable estimates for various sized medium and large organizations.

To address this challenge, Martello Technologies commissioned EnableUC to independently develop a model that estimates the impact of deploying Vantage DX proactive monitoring and enhanced diagnostic tools within a Teams environment, based on different sized and configured organizations. 

Our research and model development occurred without any input or influence from Martello and we only shared the results when completed.

In completing this project, we ended up building two models (we are over achievers 😊), one that focused on the operational costs to support a Teams environment and another that looked more broadly at operational, productivity, and revenue impacts.

This article discusses our first model which calculated the difference proactive monitoring and enhanced diagnostic tools can have on operational costs.

Key Takeaways

➡️ 60% of issues could potentially be mitigated with proactive monitoring.

➡️ For an organization with 1,000 users, proactive monitoring is likely to halve IT support labor required.

➡️An organization with over 10,000 users should expect to reduce required staffing by 70% if proactive monitoring and enhanced diagnostics are deployed.

Building a Model

To assess the advantages of an enhanced monitoring and issue diagnosing toolset, we developed an operational model loosely based on the Microsoft Operations Framework (MOF). While Microsoft has shifted its focus to a tool-based approach they call the Microsoft Operations Management Suite (OMS), MOF provides a structured life-cycle based approach and serves as a good foundational model for IT service management.

We extended MOF using a series of “runbooks” we have developed over the years for various organizations who have implemented Microsoft Teams.

The result was a clearly defined series of daily, weekly, monthly and annual tasks required to successfully operate any Microsoft Teams environment.

Task Effort Estimates

Based on our collective expertise, discussions with IT professionals, who are responsible for managing Teams environments, and Microsoft MVPs (most valuable professionals), along with online research, we assigned effort estimates to each of the identified Teams management tasks.

We then estimated the number of issues and tickets that would be generated, based on hands-on experience and research. Understanding the number of tickets generated is critical because a significant portion of daily IT time is typically allocated to addressing tickets.

We identified 11 categories of issues that created outages or service degradation. (Categories included core services issues, supporting service issues, hardware and software issues, human error, loss of power, etc. We will explore these categories in detail in a follow-up article.) Collectively, these items degrade Teams service 1.8% of the time, for one or more users. Depending on your organization’s work hours, not all these outages will occur during working hours, unless you operate 7 x 24, the model accounts for this.

Additional assumptions built into the model (which can be configured) include:

  • Expect 1 incident per every 1,000 physical phones deployed per day
  • Expect 1 incident per every 50 Microsoft Teams Rooms per day.
  • An issue or outage needs to last 10 minutes in order to potentially create a ticket. For instance, if a momentary “blip” occurs while trying to join a meeting, most users simply retry a few times.
  • On average 16% of users raise a ticket when an incident/issue occurs.

The Impact of Proactive Monitoring

Proactive monitoring reduces the number of user-impacting incidents, because it allows IT teams to correct issues quickly, potentially before users are impacted or, when an issue can’t be quickly corrected, allows IT to communicate alternatives.  

For example, if a network issue is impacting a location, users can be advised to work from home, a coffee shop, or another nearby location. If Teams, or a supporting service (e.g. authentication), is experiencing an issue, users can be alerted that they should use a backup UC solution, or their mobile phones for an upcoming meeting.

For each of the identified 11 issue categories, we estimated the percentage of issues that could be mitigated with proactive monitoring, ranging from 0% to 90% depending on the source of the issue.

In total, our model indicates that up to 60% of potential issues could potentially be mitigated with proactive monitoring.

Implementing Proactive Monitoring

To proactively monitor a Microsoft Teams environment synthetic transactions and agents or appliances are key tools. Here’s a breakdown of how they work and their benefits:

Synthetic Transactions

Synthetic transactions simulate user activities to test and monitor the performance and availability of Microsoft Teams services. These transactions are pre-scripted actions that mimic real user interactions, such as:

  • Joining a Teams meeting
  • Sending a message
  • Sharing a file
  • Scheduling a meeting

By continuously running these synthetic transactions, IT teams can detect issues before they impact actual users. This proactive approach helps identify performance bottlenecks, service outages, and other problems early on.

Agents or Appliances

To execute synthetic transactions, organizations deploy agents or appliances at various locations. These agents can be software-based or hardware devices that perform the following functions:

  • Monitoring Performance: Agents simulate user activities and measure the response times and success rates of these actions.
  • Collecting Data: They gather detailed metrics on network performance, application responsiveness, and service availability.
  • Alerting and Reporting: When an issue is detected, agents can trigger alerts and generate reports, providing IT teams with actionable insights.

Enhanced Diagnostics

Proactive monitoring can reduce issues, but it cannot eliminate every issue or the corresponding tickets that users raise.

As such, our model takes into account how enhanced diagnostics can reduce the time required to identify a root cause and address a particular issue.

Microsoft continues to improve the built-in diagnostic reports, most recently deprecating the Call Quality Dashboard in favor of PowerBI Quality of Experience (QER) report templates. However, both CQD and QER reports can be data rich and information poor. They provide lots of technical details but overwhelm all but the most skilled IT professionals.

Additionally, the Microsoft reports don’t provide much detail outside the Microsoft environment. Local network and ISP details are not fully captured using the Microsoft built-in reports. For organizations using direct routing, session border control (SBC) details and carrier SIP trunk details are incomplete. For customers using Operator Connect, key carrier or network service provider details are sparse.

We believe that enhanced third-party diagnostic tools can reduce the time taken to resolve a particular incident from an average of 30 minutes to 15 minutes. Put another way, a typical support engineer can handle an average of 20 tickets per day with the bult-in tools and an average of 30 tickets per day with an enhanced set of tools. Note that these tickets per day averages assume some tickets are more straightforward moves, adds, or changes and do not require root cause analysis.

Results

Taking into consideration all of the above, here is what the model indicates for several different sized organizations.

For organizations smaller than approximately 200 users, you typically require at least one person whether proactive monitoring or enhanced diagnostic tools are deployed. Once you reach approximately 250 users, you can invest in more people or use better tools to reduce overall labor costs.

With 1,000 users working in the office 3 out of 5 days (a common hybrid arrangement), the potential labor savings are significant as proactive monitoring reduces the number of tickets that require investigating and speeds up the time to resolution for issues that can’t be mitigated.

Scenario: 1,000 users in 2 locations

As the number of users increases, proactive monitoring has a larger potential impact.

Scenario: 2,500 users in 5 locations
Scenario: 10,000 users in 20 locations

The complete model takes into consideration other factors including the number of desk phones and room systems deployed, the number of locations, the number of time zones operated in, etc.

Conclusion

Using reasonable assumptions related to operational management of a Microsoft Teams environment, for most organizations, with 200 or more people, proactive monitoring and enhanced diagnostic tools can provide a significant return on investment by reducing the amount of support labor required. For organizations with over 1,000 users, proactive monitoring can halve the amount of IT support labor required. Larger organizations with over 10,000 users can expect proactive monitoring to reduce support labor by two-thirds.

This is only part of the story because outages also impact productivity and revenue generation for an organization. We will explore these broader impacts in a follow-up article that will dive into the details of the second model we developed as part of this project.


Craig Heward

Forging Partnership Growth and Driving Revenue | Microsoft | MSP VAR | Passionate Problem Solver |

3w

Great breakdown Kevin Kieller, thanks for pulling this insight together!

Scott Luton

Passionate about sharing stories from across the global business world

1mo

Have a good week and thanks for sharing Kevin Kieller

Mitigating 60% of potential issues is huge - the ROI is pretty much a no brainer.

Cyril Coste

Co-Founder & Chief Product Officer @merveilleux. Building the #1 AI agents product development platform 🔥

1mo

Proactive monitoring: obvious in value, invisible in leadership priorities. This gap is why incidents keep spiraling out of control.

Aaron Lax

Info Systems Coordinator, Technologist and Futurist, Thinkers360 Thought Leader and CSI Group Founder. Manage The Intelligence Community and The Dept of Homeland Security LinkedIn Groups. Advisor

1mo

It’s important to anticipate rather than react Kevin Kieller

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics