Using Statistical Monitoring to Detect Failures in Internet Services

57
Следующее
Популярные
213 дней – 6333:07:51
AI For All: Embracing Equity for All
Опубликовано 8 сентября 2016, 18:56
Today, we are increasingly building large and complex systems whose workings we do not understand, and this lack of understanding translates into systems that are hard to manage and have low availability. The problem is that there is a disconnect between our high-level goals for the system and the low-level visibility and control we have into and over it. To keep a system running, operators must wade through the minutiae of its low-level architecture and implementation. This is not unlike driving a car while looking through a magnifying glass---the driver is both overwhelmed by the details immediately in front of him and unable to focus on more important items on the horizon. A concrete example of this problem is fault detection in Internet services. Current surveys find that over 60) is the time required to simply realize that a service has failed. The challenge is that these Internet services are complex, poorly understood systems, and the correct operation of the application is only defined at a human-layer (`I know a problem when I see it`). In this talk, I present my work on statistical monitoring, which combines systems research with statistical analysis and machine learning tools to transform low-level behaviors that are easy to observe into high-level indicators of failure. Unlike other techniques which detect high-level failures, this does not require a priori application-specific information; it thus needs little maintenance as a service evolves and changes over time. I will discuss results from testbed experiments, where detection `miss rates` are reduced by 30-70, as well as early experiences analyzing failures at a large Internet service.
Свежие видео
7 дней – 2 7820:49
Agents in SharePoint
9 дней – 778 2320:57
Gaming on the BACK of the phone?
автотехномузыкадетское