Modeling Audio and Visual Cues for Real-world Event Detection

Audio-visual event detection aims to identify semantically defined events that reveal human activities. Most previous literature focused on restricted highlight events, and depended on highly ad-hoc detectors for these events. This research emphasizes generalizable robust modeling of single-microphone audio cues and/or single-camera visual cues for the detection of real-world events, requiring no expensive annotation other than the known timestamps of the training events. To model the audio cues for event detection, we propose leveraging statistical models proven effective in speech recognition. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the Hidden Markov Model(HMM) with the context-dependent discriminative capabilities of an artificial neural network. Second, an SVM-GMM-supervector approach uses noise-robust kernels to approximate the KL divergence between feature distributions in different audio segments. The proposed methods outperform our top-ranked HMM-based acoustic event detection system in the CLEAR 2007 Evaluation, which detects twelve general meeting room events such as keyboard typing, cough and chair moving. To model the visual cues, we propose the Gaussianized vector representation, constructed by adapting a set of Gaussian mixtures according to the set of patch-based descriptors in an image or video clip, regularized by the global Gaussian mixture model. The proposed visual modeling approach achieves outstanding performance in a video event categorization task on ten LSCOM-defined events in Trecvid broadcast news data, such as exiting car, running and people marching. Following an efficient branch-and-bound search scheme, we propose an object localization approach for the Gaussianized vector representation, which can be potentially used to identify regions of interest in the video corresponding to different events. We jointly model audio and visual cues for improved event detection using multi-stream HMMs and coupled HMMs (CHMM). Spatial pyramid histograms based on the optical flow is proposed as a generalizable visual representation that does not require training on labeled video data. In a multimedia meeting room non-speech event detection task, the proposed methods outperform previously reported systems leveraging ad-hoc visual object detectors and sound localization information obtained from multiple microphones.