Statistical Failure Diagnosis in Software and Systems

Опубликовано 7 сентября 2016, 16:59
As software and systems become increasingly complex, the task of debugging also becomes increasingly difficult. Manual diagnosis can require sifting through millions of lines of code and output logs. In addition, large systems contain many components, each complex on its own, and often interacting in unexpected ways. I present a case study illustrating how statistical machine learning algorithms, along with appropriate system instrumentation, can aid in failure diagnosis. I propose a statistical software debugging framework that collects information from past successes and failures via fine-grained instrumentation of the program and then analyzes this information to locate suspicious program predicates. I discuss the algorithmic challenges of the approach, and demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates. Using this approach, it took a programmer 20 minutes to find a long-standing bug in a real-world software program which he had never seen before. This work is done in collaboration with Ben Liblit (U. Wisconsin, Madison), Michael Jordan (U.C. Berkeley), Alex Aiken and Mayur Naik (Stanford).