Nuit Blanche: Random Subspace with Trees for Feature Selection Under Memory Constraints / Learning Mixture of Gaussians with Streaming Data

Friday, September 15, 2017

Random Subspace with Trees for Feature Selection Under Memory Constraints / Learning Mixture of Gaussians with Streaming Data

Probably the last Image of Titan by the Cassini spacecraft. Taken: Sep. 12, 2017 9:26 PM. Received: Sep. 13, 2017 10:19 AM. Image Credit: NASA/JPL-Caltech/Space Science Institute

As our capabilities to produce features from data is getting larger everyday, we are now getting into the stage where we have to learn/infer under a streaming constraint: i.e; we get to see the feature once and then have to produce some inference. The first paper tries to do in the random forest approach while the second paper looks at it in building a mixture of gaussians ( relevant: Compressive Statistical Learning with Random Feature Moments, Sketching for Large-Scale Learning of Mixture Models, SketchMLbox) . Enjoy !

Random Subspace with Trees for Feature Selection Under Memory Constraints by Antonio Sutera, Célia Châtel, Gilles Louppe, Louis Wehenkel, Pierre Geurts

Dealing with datasets of very high dimension is a major challenge in machine learning. In this paper, we consider the problem of feature selection in applications where the memory is not large enough to contain all features. In this setting, we propose a novel tree-based feature selection approach that builds a sequence of randomized trees on small subsamples of variables mixing both variables already identified as relevant by previous models and variables randomly selected among the other variables. As our main contribution, we provide an in-depth theoretical analysis of this method in infinite sample setting. In particular, we study its soundness with respect to common definitions of feature relevance and its convergence speed under various variable dependance scenarios. We also provide some preliminary empirical results highlighting the potential of the approach.

Learning Mixture of Gaussians with Streaming Data by Aditi Raghunathan, Ravishankar Krishnaswamy, Prateek Jain

In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of $N$ points in $d$ dimensions generated by an unknown mixture of $k$ spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are $C\sigma$ distant with $C=\Omega((k\log k)^{1/4}\sigma)$ and where $\sigma^2$ is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to constants); our center separation requirement matches the best known result for spherical Gaussians \citep{vempalawang}. For finite samples, we show that a bias term based on the initial estimate decreases at $O(1/{\rm poly}(N))$ rate while variance decreases at nearly optimal rate of $\sigma^2 d/N$.
Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal $d\cdot k$ while space complexity of our algorithm is $O(dk\log k)$.
In addition to the bias and variance terms which tend to $0$, the hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an approximation error that cannot be avoided. However, by using a streaming version of the classical (soft-thresholding-based) EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to $0$ for $N\rightarrow \infty$.