- An Introduction to Metrics, Monitoring, and Alerting
This tutorial walks through how metrics, monitoring and alerting are related; what type of information is important to track; factors that affect what you choose to monitor; and important qualities of a metrics, monitoring and alerting system.
- Site Reliability Engineering, Chapter 6: Monitoring Distributed Systems
This chapter of Google's SRE book defines some basic principles and best practices for building successful monitoring and alerting systems. It offers guidelines for what issues should interrupt a human via a page, and how to deal with issues that aren’t serious enough to trigger a page.
- The RED Method: How to Instrument Your Services
Tom discusses how the USE (Utilization, Saturation, Errors) instrumentation method is appropriate for monitoring hardware, while the RED (rate, errors, duration) method is more appropriate for microservices. He also compares the RED method with Google's Four Golden Signals.
- Monitoring and Observability
Cindy contrasts the terms "monitoring" and "observability", where "monitoring" shows the overall health of systems and "observability" provides highly granular insights into the behavior of systems along with rich context, for debugging.