Large Linear Classification when Data Cannot Fit In Memory

234
Следующее
Популярные
Опубликовано 28 июля 2016, 23:03
Linear classification is a useful tool for dealing with large-scale data in applications such as document classification and natural language processing. Recent developments of linear classification have shown that the training process can be efficiently conducted. However, when the data size exceeds the memory capacity, most training methods suffer from very slow convergence due to the severe disk swapping. In the first part of the talk, we describe a block minimization framework for data larger than memory. Under the framework, a solver splits data into blocks and stores them into separate files. Then, at each time, the solver trains a data block loaded from disk. In the second part of the talk, we introduce a selective block minimization (SBM) algorithm, a block minimization method that makes use of selective sampling. At each step, SBM updates the model using data consisting of two parts: (1) new data loaded from disk and (2) a set of informative samples already in memory from previous steps. We prove that, by updating the linear model in the dual form, the proposed method fully utilizes the data in memory and converges to a globally optimal solution on the entire data. Experiments show that the SBM algorithm dramatically reduces the number of blocks loaded from disk and consequently obtains an accurate and stable model quickly on both binary and multi-class classification. Joint work with Hsiang-Fu Yu, Cho-Jui Hsieh, Chih-Jen Lin and Dan Roth.
автотехномузыкадетское