Try a more organized approach to keep systems and missions running

Insight by PagerDuty

Federal Insights

Try a more organized approach to keep systems and missions running

More than simply monitoring enterprise applications, agencies need to have an organized and automated way to alert the people who can fix it the fastest.

Tom Temin@tteminWFED

December 2, 2024 4:44 pm

5 min read

Despites agencies’ most diligent efforts, enterprise application crashes occasionally occur. Two trends have made preventing and the addressing such crashes trickier: digital transformation, which sometimes bring separate applications together, and hosting applications in hybrid cloud environments.

“What agencies need to do is figure out how to give minutes back to the mission,” said Eric Forseter, vice president for public sector at PagerDuty.

The main question, he said, is obvious: “How do I ensure the resiliency of the approach that I’m taking, and how do I transform my environment so that I am more modern, more agile and improving the digital customer experience?”

But the challenge becomes innovating and transforming while minimizing risk, Forseter said on Federal Insights — Mastering IT Resilience: Strategies for Federal Continuity. He pointed to one study showing developers spend more than half their time fixing applications, “as opposed to building and creating that transformation.”

Keeping an eye on the action

Observability of the IT environment can help focus efforts and mitigate risk, Forseter said, but only if an organization has a system to sort out the important alerts from the noise.

“We need to automate,” he said. “We need applications to help us tune down the noise.”

More than that, he said, IT organizations should create environments that anticipate adverse events. Perhaps with a touch of artificial intelligence, such applications “will tell you when an alert happens and, in effect, ask whether you want a server rebooted while it generates a report on what happened,” Forseter said.

In the same vein, a management overlay could monitor developers, whether government employees or contractors, to better understand the source of a problem and reduce what Forseter called the “mean time to resolution.” The dev team also would gain more knowledge about how to prevent a repeat of a problem, he added.

Forseter said several agencies work with PagerDuty to better understand not only when peak website visits occur but also the characteristics of their systems, which lets them define possible courses of action.

“When an issue arises, the product is alerting different teams. We’re letting management know but also letting the operational folks know and giving them decision points,” he said. That might include recommended actions or how to configure systems to orchestrate what PagerDuty recommends.

“Maybe it’s spin up more servers for that application itself or do something cloud,” Forseter said. “Or maybe it’s, ‘Hey, we need to have a team come in right away to fix it.’ ”

Automation helps speed the response

But which team should receive the alert? For example, a customer service rep or call center staff member might become aware of a technology-induced issue. The rep, Forseter said, may hit the PagerDuty chat button to alert others, typically the development, operations and security teams, as well as people in the customer-facing group who may be seeing the same problem.

The internal groups have the knowledge to send a fix-by time back to the customer services team, Forseter said.

A practical issue can affect organizations, especially large ones. Given that members of DevSecOps teams work different days and hours, who precisely should be alerted when something goes awry?

That’s the working premise behind PagerDuty: Alert the right people based on all possible factors. After all, few people use pagers anymore.

Forseter said the company, founded well after the page age, was the brainchild of a software developer who had worked while on pager duty, hence the name. But a pager only gives limited information, say, a number to call back.

“He got this idea: ‘You know what, I don’t want to just get a call. I want someone to actually let me know and to alert me when things are happening,’ ” Forseter said. “And that’s what we’re built out of.”

Users configure the product with information about who does what in a company or agency, what their shift schedules look like and how to contact them, usually on their smartphones. It also can escalate alerts to others on the team if the designated contact fails to respond in a certain amount of time.

Forseter recounted working with one federal customer with a mission so complex it’s “beyond comprehension. They have documents upon documents saying this guy has this piece, but this lady has this piece.”

Before the organization began using PagerDuty, if something broke, employees would end up scrambling through numerous pieces of paper to figure out the right person that they needed to contact.

“If you can make that less manual, and then understand also who is that person’s boss or who are other team members on that person’s team, that’s a huge benefit right there,” Forseter said.

At another federal agency, knowing the roles and responsibilities helped solve recurring scrambles.

“We had an agency this summer where literally something went down, and there was finger pointing,” Forseter said. “ ‘No, it’s your responsibility. No, it’s your responsibility.’ And they didn’t even know where to start. It took them four or five days to even get close to resolution, not because they didn’t know how to fix it but because they didn’t know who the right people were.”

Because PagerDuty understands the nature and likelihood of an adverse event, and folds in the organizational information to deal with it, Forseter said, “now, all of a sudden, I’ve given time back to the mission. I can increase the pace of innovation. I can actually do that digital transformation.”

Tom Temin

Tom Temin is host of the Federal Drive and has been providing insight on federal technology and management issues for more than 30 years.

Follow @tteminWFED

After rocky history, GSA shuts down 18F office

Reorganization

Trump administration questions federal employees’ use of official time for union activities

Unions

Senate Dems asking Air Force secretary nominee if he helped SpaceX secure lucrative satellite contract

Air Force

Insight by PagerDuty

Try a more organized approach to keep systems and missions running

Keeping an eye on the action

Automation helps speed the response

Related Stories

After rocky history, GSA shuts down 18F office

Trump administration questions federal employees’ use of official time for union activities

Senate Dems asking Air Force secretary nominee if he helped SpaceX secure lucrative satellite contract

Upcoming Events

Related Stories

Top Stories