Incident Management (class SRE implements DevOps)

45 720

19.1

Google Cloud Platform1.19 млн

Следующее

04.10.18 – 1 6342:29

How Serverless Helps You Build Highly Scalable and Secure Apps (Next Rewind '18)

Популярные

99 дней – 1 7984:52

Dataflow for Real-time Clickstream Analytics

192 дня – 93248:47

Google Cloud databases in the gen AI era

Опубликовано 2 октября 2018, 17:02

In the previous video, Liz and Seth discussed how to make systems observable and how observability helps us diagnose failing systems, but didn't cover what to do when an incident grows beyond the ability of one person to do it all. In this video, you learn about the most important part of the incident management process – humans.

In the stressful moments of systems failure, it is important to define clear, concise roles for all the humans involved in an incident. With too few people, you can quickly become overloaded with work, but with too many people, work may be duplicated (i.e. too many hands on the keyboard). Learn how SREs effectively manage incidents with clearly defined roles and responsibilities such as the operations lead, planning lead, communications lead, logistics lead, and more. Seth and Liz also discuss techniques for managing long-running and exponentially complex incidents.

Reach out to Liz and Seth:
twitter.com/lizthegrey
twitter.com/sethvargo

Watch more episodes from the playlist here → bit.ly/2PPL6f0

Subscribe to the Google Cloud Platform channel for more Cloud content → bit.ly/GCloudPlatform

Свежие видео