Not All Frames Are Created Equal: Temporal Sparsity for Robust and Efficient ASR

Опубликовано 17 августа 2016, 22:04
Traditional frame-based speech recognition technologies build sequence models of temporally dense vector time series representations that account for the entirety of the speech signal. However, under non-stationary distortion, the burden of accounting for everything can propagate errors beyond the corrupted frames. I will advocate an alternative strategy where the speech signal is instead (i) transformed into a sparse set of temporal point patterns of the most salient acoustic events and (ii) decoded using explicit models of the temporal statistics of these patterns. Formalized under a point process model framework, the proposed sparse methods exhibit sufficiency for clean speech recognition, provide a new avenue to improve noise robustness, and hold potential for significantly increased computational efficiency over their frame-based counterparts.