Roshan Shariff - Publications

Publications

Refereed Conferences

Efficient Planning in Large MDPs with Weak Linear Function Approximation. Roshan Shariff and Csaba Szepesvári. In Neural Information Processing Systems (NeurIPS), Online. December 6–12, 2020. (20% acceptance)
Abstract
Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.
Differentially Private Contextual Linear Bandits. Roshan Shariff and Or Sheffet. In Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada. December 2–8, 2018. (21% acceptance)
Abstract
We study the contextual linear bandit problem, a version of the standard stochastic multi-armed bandit (MAB) problem where a learner sequentially selects actions to maximize a reward which depends also on a user provided per-round context. Though the context is chosen arbitrarily or adversarially, the reward is assumed to be a stochastic function of a feature vector that encodes the context and selected action. Our goal is to devise private learners for the contextual linear bandit problem.
We first show that using the standard definition of differential privacy results in linear regret. So instead, we adopt the notion of joint differential privacy, where we assume that the action chosen on day t is only revealed to user t and thus needn't be kept private that day, only on following days. We give a general scheme converting the classic linear-UCB algorithm into a joint differentially private algorithm using the tree-based algorithm. We then apply either Gaussian noise or Wishart noise to achieve joint-differentially private algorithms and bound the resulting algorithms' regrets. In addition, we give the first lower bound on the additional regret any private algorithms for the MAB problem must incur.
Conservative Bandits. Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. In International Conference on Machine Learning (ICML), New York, NY, USA. June 20–22, 2016. (25% acceptance)
Abstract
We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time. While previous work addressed the problem under the weaker requirement of maintaining the revenue constraint only at a given fixed time in the future, the algorithms previously proposed are unsuitable due to their design under the more stringent constraints. We consider both the stochastic and the adversarial settings, where we propose, natural, yet novel strategies and analyze the price for maintaining the constraints. Amongst other things, we prove both high probability and expectation bounds on the regret, while we also consider both the problem of maintaining the constraints with high probability or expectation. For the adversarial setting the price of maintaining the constraint appears to be higher, at least for the algorithm considered. A lower bound is given showing that the algorithm for the stochastic setting is almost optimal. Empirical results obtained in synthetic environments complement our theoretical findings.
Exploiting Symmetries to Construct Efficient MCMC Algorithms With an Application to SLAM. Roshan Shariff, András György, and Csaba Szepesvári. In Artificial Intelligence & Statistics (AISTATS), San Diego, CA, USA. May 9–12, 2015. (29% acceptance)
Abstract
The Metropolis-Hastings (MH) algorithm is a flexible method to generate samples from a target distribution, a key problem in probabilistic inference. In this paper we propose a variation of the MH algorithm based on group moves, where the next state is obtained by first choosing a random transformation of the state space and then applying this transformation to the current state. This adds much-needed flexibility to the “textbook” MH algorithm where all measures involved must be given in terms of densities with respect to a common reference measure. Under mild conditions, our main result extends the acceptance probability formula of the textbook algorithm to MH algorithms with group moves. We work out how the new algorithms can be used to exploit a problem’s natural symmetries and apply the technique to the simultaneous localization and mapping (SLAM) problem, obtaining the first fully rigorous justification of a previous MCMC-based SLAM method. New experimental results comparing our method to existing state-of-the-art specialized methods on a standard range-only SLAM benchmark problem validate the strength of the approach.

Thesis

Exploiting Symmetries to Construct Efficient MCMC Algorithms With an Application to SLAM. Roshan Shariff. M.Sc. Thesis. University of Alberta, Edmonton, Canada. April 2015.
(Winner of the Department of Computing Science M.Sc. Outstanding Thesis Award.)
- PDF
- DOI
Abstract
Sampling from a given probability distribution is a key problem in many different disciplines. Markov chain Monte Carlo (MCMC) algorithms approach this problem by constructing a random walk governed by a specially constructed transition probability distribution. As the random walk progresses, the distribution of its states converges to the required target distribution. The Metropolis-Hastings (MH) algorithm is a generally applicable MCMC method which, given a proposal distribution, modifies it by adding an accept/reject step: it proposes a new state based on the proposal distribution and the existing state of the random walk, then either accepts or rejects it with a certain probability; if it is rejected, the old state is retained. The MH algorithm is most effective when the proposal distribution closely matches the target distribution: otherwise most proposals will be rejected and convergence to the target distribution will be slow. The proposal distribution should therefore be designed to take advantage of any known structure in the target distribution.
A particular kind of structure that arises in some probabilistic inference problems is that of symmetry: the problem (or its sub-problems) remains unchanged under certain transformations. A simple kind of symmetry is the choice of a coordinate system in a geometric problem; translating and rotating the origin of a plane does not affect the relative positions of any points on it. The field of group theory has a rich and fertile history of being used to characterize such symmetries; in particular, topological group theory has been applied to the study of both continuous and discrete symmetries. Symmetries are described by a group that acts on the state space of a problem, transforming it in such a way that the problem remains unchanged. We consider problems in which the target distribution has factors, each of which has a symmetry group; each factor's value does not change when the space is transformed by an element of its corresponding symmetry group.
This thesis proposes a variation of the MH algorithm where each step first chooses a random transformation of the state space and then applies it to the current state; these transformations are elements of suitable symmetry groups. The main result of this thesis extends the acceptance probability formula of the textbook MH algorithm to this case under mild conditions, adding much-needed flexibility to the MH algorithm. The new algorithm is also demonstrated in the Simultaneous Localization and Mapping (SLAM) problem in robotics, in which a robot traverses an unknown environment, and its trajectory and a map of the environment must be recovered from sensor observations and known control signals. Here the group moves are chosen to exploit the SLAM problem's natural geometric symmetries, obtaining the first fully rigorous justification of a previous MCMC-based SLAM method. New experimental results comparing this method to existing state-of-the-art specialized methods on a standard range-only SLAM benchmark problem validate the strength of the approach.

Workshops

Discounted Reinforcement Learning Is Not an Optimization Problem. Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao. and Richard S. Sutton. In Optimization Foundations for Reinforcement Learning Workshop, Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada. December 14, 2019.
Abstract
Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. It is not an optimization problem in its usual formulation, so when using function approximation there is no optimal policy. We substantiate these claims, then go on to address some misconceptions about discounting and its connection to the average reward formulation. We encourage researchers to adopt rigorous optimization approaches, such as maximizing average reward, for reinforcement learning in continuing tasks.
A Value Function Basis for Nexting and Multi-Step Prediction. Andrew Jacobsen, Vincent Liu, Roshan Shariff, Adam White, and Martha White. In Reinforcement Learning & Decision Making (RLDM), Montréal, QC, Canada. July 7–10, 2019.
- Extended Abstract
Abstract
Humans and animals continuously make short-term cumulative predictions about their sensory-input stream, an ability referred to by psychologists as nexting. This ability has been recreated in a mobile robot using modern reinforcement learning approaches, but in practice there are limitations on how many predictions we can learn. In this paper, we investigate inferring new predictions from a minimal set of learned General Value Functions. We show that linearly weighting such a collection of value function predictions enables us to also make accurate multi-step predictions about future outcomes, and provide a closed-form solution to estimate this linear weighting. We also show that a similar approach can produce accurate estimates of value functions which we did not explicitly train to predict.
Conservative Bandits. Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvári. In Workshop on Machine Learning for eCommerce, Neural Information Processing Systems (NIPS), Montréal, QC, Canada. December 11, 2015.
Abstract
We study a novel multi-armed bandit problem that models the challenge faced by a company wishing to explore new strategies whilst simultaneously maintaining their revenue above a fixed baseline. We develop a new algorithm that satisfies the revenue constraint and also enjoys problem dependent regret guarantees with respect to the optimal action. Lower bounds are given showing that the new algorithm is almost optimal.
Lunar Lander: A Continuous-Action Case Study for Policy-Gradient Actor-Critic Algorithms. Roshan Shariff and Travis Dick. In Reinforcement Learning & Decision Making (RLDM), Princeton, NJ, USA. October 25–27, 2013.
- Poster
- Extended Abstract
Abstract
We empirically investigate modifications and implementation techniques required to apply a policy-gradient actor-critic algorithm to reinforcement learning problems with continuous state and action spaces. As a test-bed, we introduce a new simulated task, which involves landing a lunar module in a simplified two-dimensional world. The empirical results demonstrate the importance of efficiently implementing eligibility traces and appropriately weighting the features produced by a tile coder.
A Markov Chain Monte Carlo Approach To Simultaneous Localization and Mapping. Roshan Shariff, Péter Torma, András György, and Csaba Szepesvári. In Workshop on Monte Carlo Methods for Bayesian Inference in Modern Day Applications, Neural Information Processing Systems (NIPS), Vancouver, BC, Canada. December 10, 2010.
Abstract
A Markov-chain Monte Carlo based algorithm is provided that addresses the Simultaneous Localization and Mapping (SLAM) problem with general dynamics and observation model under open-loop control and provided that the map-representation is finite dimensional. To our knowledge this is the first provably consistent yet (close-to) practical solution to this problem. The algorithm has comparable results to the current state of the art as demonstrated in difficult loop closing situations.