Actionable Alerting for Site Reliability Engineers (class SRE implements DevOps)

36 124
17.1
Опубликовано 5 сентября 2018, 1:29
Are you tired of getting paged in the middle of the night for noisy alerts or flapping systems, only to find no action can be taken? In this episode, Liz and a very sleepy Seth discuss how to build actionable alerts from your SLOs and SLIs as a Site Reliability Engineer. Alerting on low-level metrics such as CPU usage or disk space doesn't actually show whether our users are experiencing issues with our product or service. Instead, we should build our alerts using our SLOs. By integrating our remaining error budget over time, we can see how outages or partial outages will affect our SLO. Liz discusses strategies for deciding when to alert, how to alert, and what to do with those old alerts.

Actionable alerts tie closely into the DevOps principles of expecting failure and creating a blameless culture. This is why we say "class SRE implements DevOps".

Reference Links:
Stackdriver Service Monitoring → bit.ly/2wJdVS7
Creating a Dashboard with Stackdriver SLI Monitoring Metrics → bit.ly/2wHwGWo

Have questions? Reach out to Liz and Seth on Twitter:
@sethvargo - twitter.com/sethvargo
@lizthegrey - twitter.com/lizthegrey

Watch more episodes here → bit.ly/2PPL6f0
Subscribe to the channel → bit.ly/GCloudPlatform
автотехномузыкадетское