Sparsity
Sparsity
Efficient Projections onto the L1 Ball for Learning in High Dimensions constrained
We describe efficient algorithms for projecting a vector onto the L1-ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the L1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform interior point methods, which are considered state-of-the-art optimization techniques. We also show that in online settings gradient updates with L1 projections outperform the exponentiated gradient algorithm while obtaining models with high degrees of sparsity.
This page contains various papers on learning sparse models. The papers are given in a chronological order with a short explanation. In the near future I plan to post a few demos, code, pointers, and a survey (yet to be written) on algorithms for learning sparse models. The focus of the work is on efficient algorithms for high-dimension data in the presence of large amounts of training examples.
Efficient Online and Batch Learning Using Forward Backward Splitting
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as L1. We derive concrete and very simple algorithms for minimization of loss functions with L1, L2, L22, and Linf regularization. We also show how to construct efficient algorithms for mixed-norm L1/Lq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural datasets.
Boosting with Structural Sparsity
Despite popular belief, boosting algorithms and related coordinate descent methods are prone to overfitting. We derive modifications to AdaBoost and related gradient-based coordinate descent methods that incorporate, along their entire run, sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrates forward feature induction and back-pruning through sparsity promoting regularization along with an automatic stopping criterion for feature induction. We study penalties based on the L1, L2, and Linf norm of the learned predictor and also introduce mixed-norm penalties that build upon the initial norm-based penalties. The mixed-norm penalties facilitate structural sparsity of the parameters of the predictor, which is a useful property in multiclass prediction and other related learning tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models from high dimensional data.
Group Sparse Coding
Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary or text documents, there is no simple mapping from raw images or videos to dictio- nary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.espite popular belief, boosting algorithms and related coordinate descent methods are prone to overfitting. We derive modifications to AdaBoost and related gradient-based coordinate descent methods that incorporate, along their entire run, sparsity-promoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrates forward feature induction and back-pruning through sparsity promoting regularization along with an automatic stopping criterion for feature induction. We study penalties based on the L1, L2, and Linf norm of the learned predictor and also introduce mixed-norm penalties that build upon the initial norm-based penalties. The mixed-norm penalties facilitate structural sparsity of the parameters of the predictor, which is a useful property in multiclass prediction and other related learning tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models from high dimensional data.
Composite Objective Mirror Descent
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known first-order algorithms, such as the projected gradient method, mirror descent, and forward-backward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as L1, mixed norm, and trace-norm.
This paper is so technical that:
• I cannot think of any image to put here
• Only Ambuj and John fully grasp it
Adaptive Subgradient Methods for Online Learning & Stochastic Optimization
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe & analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods significantly outperform state-of-the-art, yet non-adaptive, subgradient algorithms.
What’s next? I am currently working on different approaches for learning sparse representations which cope with non-convexity and/or go beyond the standard empirical risk minimization. I am kind of stuck (in a good sense) with various aspects. Send me an email message if you are interested in an update. The current open problems include:
Self-pruning decision trees
Entire regularization path for relaxed max-ent models
Confidence-weighted [not rated] boosting with structural sparsity
A Coordinate Descent Algorithm for Learning Compact Ranking Functions
Algorithms for learning to rank can be inefficient when they use cost functions that involve more structure than the basic all-pair approach. To achieve efficient ranking, we consider the domination loss, which is designed to rank a small number of positive examples above a large number of negative ones, and extends to several layers of such relationships. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present a number of extensions to the basic algorithm, including regularization, layers of examples, and feature induction. Experiments performed on several benchmark datasets show that the proposed approach yields significantly more compact models than existing algorithms do for similar performance.