Design for Observability - Role of Metrics Ep 2

Design for Observability - Role of Metrics Ep 2

In the previous article, I discussed the best practices, organizations can leverage metrics effectively within their observability platforms to gain actionable insights, improve system reliability, and drive business value. In this article let us look at some of the strategies for getting the metrics. Obtaining metrics from both applications and infrastructure involves instrumentation, data collection, aggregation, and analysis. Following are some of the a generalized approaches to getting metrics from each,

Getting metrics from applications:

  1. Instrumentation, in your application code to emit metrics relevant to its performance, behavior, and business logic. Instrumentation can be done using the libraries or frameworks specific to the programming language or platform in your application stack.
  2. Identify the types of metrics you need, such as counters, gauges, histograms, or summaries and based on the aspects of your application you want to monitor (e.g., request latency, error rates, throughput).
  3. Use standardized formats like Prometheus exposition format, StatsD, or OpenTelemetry for emitting metrics, to ensures compatibility with a wide range of monitoring and observability tools.
  4. Integrate metrics instrumentation with logging and distributed tracing frameworks to provide comprehensive observability across your application stack.
  5. Choose between a push model where applications actively push metrics to a central metrics collection system or a pull model, a centralized monitoring system periodically pulls metrics from application endpoints.
  6. Implement error handling mechanisms to gracefully handle failures in metric emission and ensure that errors don't impact application performance or stability.

Getting metrics from infrastructure:

  1. Deploy monitoring agents on infrastructure components (e.g., servers, VMs, containers, orchestrators) to collect metrics related to resource utilization (CPU, memory, disk, network) system health and workload performance.
  2. Define monitoring configurations alongside infrastructure definitions. Integrate metrics collection into your infrastructure provisioning and configuration management workflows using tools like Terraform, Ansible, or Chef.
  3. Utilize APIs provided by cloud service providers (e.g., AWS CloudWatch, Azure Monitor, Google Cloud Monitoring) to retrieve metrics related to cloud resources, services, and platforms.
  4. Leverage built-in tools and utilities to collect operating system-level metrics such as CPU usage, memory utilization, disk I/O, and network traffic.
  5. Monitor network infrastructure components (e.g., routers, switches, load balancers) and collect metrics related to network throughput, latency, packet loss, and connectivity status.
  6. Extract relevant metrics from log data using log aggregation and parsing tools. Some metrics, such as HTTP response codes or database query execution times, can be derived from log entries.


In addition, security plays an important role in the whole observability stack. Ensure that metrics collection processes can extract security, vulnerability data from tools (e.g. AWS CloudTrail, image repository scanners, dependency trees, SAST, DAST) and adhere to security best practices, such as encryption of data in transit, access controls, and compliance with relevant regulatory requirements.

By implementing these strategies, you can establish comprehensive metrics collection processes for both applications and infrastructure, enabling effective monitoring, troubleshooting, and optimization of your systems. In the following article, let's look at some of the challenges associated with getting and utilizing the metrics.

For more updates to subscribe to the Cloud Native Hero! Newsletter

LinkedIn | Twitter | GitHub | Blog | Medium


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics