In-Situ Model Checking of MPI Parallel Programs

137
Опубликовано 6 сентября 2016, 16:41
MPI is the de facto standard for programming cluster machines used in high performance computing. Even though MPI programs are not as error-prone to write as thread programs based on shared memory, the use of wildcard communications, split operations (posting and later testing), and recently added weak shared memory extensions to MPI all can, nevertheless, render MPI programs quite buggy. Extracting models from MPI programs and using existing model checkers suffers from many problems: extracting models is not easy; also the assumption that the MPI calls work can be unrealistic. In our ongoing work, we run the actual MPI programs, trapping MPI calls, and allowing a custom-designed scheduler process to manifest process interleavings. We employ a dynamic partial order reduction algorithm to cut down the number of interleavings. We take precautions so that the scheduler abides by MPI's internal progress engine's conventions. Our resulting ISP (In-Situ Partial Order) tool has detected known deadlocks in a byte-range locking protocol that uses MPI's weak memory features. It has also found deadlocks in simple message-passing MPI programs. The talk will introduce some MPI commands, their semantics, the dependence relation identified for MPI, and the DPOR algorithm used. It will summarize the directions being pursued to make ISP work more efficiently. Other work in DPOR in our group (e.g. a distributed DPOR algorithm for PThread programs) will be summarized. In summary, the talk will show how DPOR can be implemented in two actual parallel programming contexts. This is joint work with the Gauss group members : cs.utah.edu/formal_verificatio...
автотехномузыкадетское