Using Statistical Monitoring to Detect Failures in Internet Services [1/8]

16
Опубликовано 8 сентября 2016, 18:50
Today, we are increasingly building large and complex systems whose workings we do not understand, and this lack of understanding translates into systems that are hard to manage and have low availability. The problem is that there is a disconnect between our high-level goals for the system and the low-level visibility and control we have into and over it. To keep a system running, operators must wade through the minutiae of its low-level architecture and implementation. This is not unlike driving a car while looking through a magnifying glass---the driver is both overwhelmed by the details immediately in front of him and unable to focus on more important items on the horizon. A concrete example of this problem is fault detection in Internet services. Current surveys find that over 60) is the time required to simply realize that a service has failed. The challenge is that these Internet services are complex, poorly understood systems, and the correct operation of the application is only defined at a human-layer (I
автотехномузыкадетское