Results for "David Woodruff"

total 40602took 0.12s
New Algorithms for Heavy Hitters in Data StreamsMar 05 2016An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-$k$, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify ... More
Sketching as a Tool for Numerical Linear AlgebraNov 17 2014Feb 10 2015This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a (usually) random ... More
A Note on Experiments and Software For Multidimensional Order StatisticsApr 25 2017In this note we describe experiments on an implementation of two methods proposed in the literature for computing regions that correspond to a notion of order statistics for multidimensional data. Our implementation, which works for any dimension greater ... More
Subspace Embeddings and $\ell_p$-Regression Using Exponential Random VariablesMay 23 2013Mar 17 2014Oblivious low-distortion subspace embeddings are a crucial building block for numerical linear algebra problems. We show for any real $p, 1 \leq p < \infty$, given a matrix $M \in \mathbb{R}^{n \times d}$ with $n \gg d$, with constant probability we can ... More
Lower Bounds for Adaptive Sparse RecoveryMay 15 2012Oct 21 2012We give lower bounds for the problem of stable sparse recovery from /adaptive/ linear measurements. In this problem, one would like to estimate a vector $x \in \R^n$ from $m$ linear measurements $A_1x,..., A_mx$. One may choose each vector $A_i$ based ... More
Optimal CUR Matrix DecompositionsMay 30 2014Jul 16 2014The CUR decomposition of an $m \times n$ matrix $A$ finds an $m \times c$ matrix $C$ with a subset of $c < n$ columns of $A,$ together with an $r \times n$ matrix $R$ with a subset of $r < m$ rows of $A,$ as well as a $c \times r$ low-rank matrix $U$ ... More
Towards Optimal Moment Estimation in Streaming and Distributed ModelsJul 12 2019One of the oldest problems in the data stream model is to approximate the $p$-th moment $\|\mathcal{X}\|_p^p = \sum_{i=1}^n |\mathcal{X}_i|^p$ of an underlying vector $\mathcal{X} \in \mathbb{R}^n$, which is presented as a sequence of poly$(n)$ updates ... More
When Distributed Computation is Communication ExpensiveApr 16 2013Jul 26 2013We consider a number of fundamental statistical and graph problems in the message-passing model, where we have $k$ machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the $k$ data sets. ... More
Perfect $L_p$ Sampling in a Data StreamAug 16 2018Nov 28 2018In this paper, we resolve the one-pass space complexity of $L_p$ sampling for $p \in (0,2)$. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler must output an index ... More
On Approximating Functions of the Singular Values in a StreamApr 29 2016For any real number $p > 0$, we nearly completely characterize the space complexity of estimating $\|A\|_p^p = \sum_{i=1}^n \sigma_i^p$ for $n \times n$ matrices $A$ in which each row and each column has $O(1)$ non-zero entries and whose entries are presented ... More
Sublinear Time Low-Rank Approximation of Distance MatricesSep 19 2018Let $\mathbf{P}=\{ p_1, p_2, \ldots p_n \}$ and $\mathbf{Q} = \{ q_1, q_2 \ldots q_m \}$ be two point sets in an arbitrary metric space. Let $\mathbf{A}$ represent the $m\times n$ pairwise distance matrix with $\mathbf{A}_{i,j} = d(p_i, q_j)$. Such distance ... More
Sublinear Time Low-Rank Approximation of Positive Semidefinite MatricesApr 11 2017Jan 03 2019We show how to compute a relative-error low-rank approximation to any positive semidefinite (PSD) matrix in sublinear time, i.e., for any $n \times n$ PSD matrix $A$, in $\tilde O(n \cdot poly(k/\epsilon))$ time we output a rank-$k$ matrix $B$, in factored ... More
Embeddings of Schatten Norms with Applications to Data StreamsFeb 18 2017Given an $n \times d$ matrix $A$, its Schatten-$p$ norm, $p \geq 1$, is defined as $\|A\|_p = \left (\sum_{i=1}^{\textrm{rank}(A)}\sigma_i(A)^p \right )^{1/p}$, where $\sigma_i(A)$ is the $i$-th largest singular value of $A$. These norms have been studied ... More
Strong Coresets for k-Median and Subspace Approximation: Goodbye DimensionSep 09 2018We obtain the first strong coresets for the $k$-median and subspace approximation problems with sum of distances objective function, on $n$ points in $d$ dimensions, with a number of weighted points that is independent of both $n$ and $d$; namely, our ... More
Distributed Low Rank Approximation of Implicit Functions of a MatrixJan 28 2016We study distributed low rank approximation in which the matrix to be approximated is only implicitly represented across the different servers. For example, each of $s$ servers may have an $n \times d$ matrix $A^t$, and we may be interested in computing ... More
How Robust are Linear Sketches to Adaptive Inputs?Nov 05 2012Linear sketches are powerful algorithmic tools that turn an n-dimensional input into a concise lower-dimensional representation via a linear transformation. Such sketches have seen a wide range of applications including norm estimation over data streams, ... More
Is Input Sparsity Time Possible for Kernel Low-Rank Approximation?Nov 05 2017Low-rank approximation is a common tool used to accelerate kernel methods: the $n \times n$ kernel matrix $K$ is approximated via a rank-$k$ matrix $\tilde K$ which can be stored in much less space and processed more quickly. In this work we study the ... More
On Approximating Functions of the Singular Values in a StreamApr 29 2016Mar 20 2017For any real number $p > 0$, we nearly completely characterize the space complexity of estimating $\|A\|_p^p = \sum_{i=1}^n \sigma_i^p$ for $n \times n$ matrices $A$ in which each row and each column has $O(1)$ non-zero entries and whose entries are presented ... More
Optimal Random Sampling from Distributed Streams RevisitedMar 28 2019We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator ... More
Tight Bounds for Distributed Functional MonitoringDec 21 2011Jun 12 2013We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008). In this model there are $k$ sites each tracking their input and communicating with a central coordinator ... More
A Near-Optimal Algorithm for L1-DifferenceApr 13 2009We give the first L_1-sketching algorithm for integer vectors which produces nearly optimal sized sketches in nearly linear time. This answers the first open problem in the list of open problems from the 2006 IITK Workshop on Algorithms for Data Streams. ... More
Tight Bounds for $\ell_p$ Oblivious Subspace EmbeddingsJan 13 2018Apr 06 2018An $\ell_p$ oblivious subspace embedding is a distribution over $r \times n$ matrices $\Pi$ such that for any fixed $n \times d$ matrix $A$, $$\Pr_{\Pi}[\textrm{for all }x, \ \|Ax\|_p \leq \|\Pi Ax\|_p \leq \kappa \|Ax\|_p] \geq 9/10,$$ where $r$ is the ... More
Tight Bounds for $\ell_p$ Oblivious Subspace EmbeddingsJan 13 2018An $\ell_p$ oblivious subspace embedding is a distribution over $r \times n$ matrices $\Pi$ such that for any fixed $n \times d$ matrix $A$, $$\Pr_{\Pi}[\textrm{for all }x, \ \|Ax\|_p \leq \|\Pi Ax\|_p \leq \kappa \|Ax\|_p] \geq 9/10,$$ where $r$ is the ... More
The Round Complexity of Small Set IntersectionApr 05 2013Apr 09 2013The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions ... More
(1+eps)-approximate Sparse RecoveryOct 19 2011Dec 26 2011The problem central to sparse recovery and compressive sensing is that of stable sparse recovery: we want a distribution of matrices A in R^{m\times n} such that, for any x \in R^n and with probability at least 2/3 over A, there is an algorithm to recover ... More
Separating k-Player from t-Player One-Way Communication, with Applications to Data StreamsMay 17 2019In a $k$-party communication problem, the $k$ players with inputs $x_1, x_2, \ldots, x_k$, respectively, want to evaluate a function $f(x_1, x_2, \ldots, x_k)$ using as little communication as possible. We consider the message-passing model, in which ... More
High Probability Frequency Moment SketchesMay 28 2018We consider the problem of sketching the $p$-th frequency moment of a vector, $p>2$, with multiplicative error at most $1\pm \epsilon$ and \emph{with high confidence} $1-\delta$. Despite the long sequence of work on this problem, tight bounds on this ... More
Data Streams with Bounded DeletionsMar 23 2018Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a $\Theta(\log(n))$ multiplicative factor more space for turnstile streams than for insertion-only ... More
Distributed Statistical Estimation of Matrix Products with ApplicationsJul 02 2018We consider statistical estimations of a matrix product over the integers in a distributed setting, where we have two parties Alice and Bob; Alice holds a matrix $A$ and Bob holds a matrix $B$, and they want to estimate statistics of $A \cdot B$. We focus ... More
On Low-Risk Heavy Hitters and Sparse Recovery SchemesSep 09 2017Jun 11 2018We study the heavy hitters and related sparse recovery problems in the low-failure probability regime. This regime is not well-understood, and has only been studied for non-adaptive schemes. The main previous work is one on sparse recovery by Gilbert ... More
Leveraging Well-Conditioned Bases: Streaming \& Distributed Summaries in Minkowski $p$-NormsJul 06 2018Work on approximate linear algebra has led to efficient distributed and streaming algorithms for problems such as approximate matrix multiplication, low rank approximation, and regression, primarily for the Euclidean norm $\ell_2$. We study other $\ell_p$ ... More
Input Sparsity and Hardness for Robust Subspace ApproximationOct 20 2015In the subspace approximation problem, we seek a k-dimensional subspace F of R^d that minimizes the sum of p-th powers of Euclidean distances to a given set of n points a_1, ..., a_n in R^d, for p >= 1. More generally than minimizing sum_i dist(a_i,F)^p,we ... More
Learning Two Layer Rectified Neural Networks in Polynomial TimeNov 05 2018Consider the following fundamental learning problem: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. We consider two-layer networks, ... More
Approximation Algorithms for $\ell_0$-Low Rank ApproximationOct 30 2017Oct 01 2018We study the $\ell_0$-Low Rank Approximation Problem, where the goal is, given an $m \times n$ matrix $A$, to output a rank-$k$ matrix $A'$ for which $\|A'-A\|_0$ is minimized. Here, for a matrix $B$, $\|B\|_0$ denotes the number of its non-zero entries. ... More
Tight Bounds for the Subspace Sketch Problem with ApplicationsApr 11 2019Jul 10 2019In the subspace sketch problem one is given an $n\times d$ matrix $A$ with $O(\log(nd))$ bit entries, and would like to compress it in an arbitrary way to build a small space data structure $Q_p$, so that for any given $x \in \mathbb{R}^d$, with probability ... More
Applications of Uniform Sampling: Densest Subgraph and BeyondJun 15 2015Jul 29 2015Recently [Bhattacharya et al., STOC 2015] provide the first non-trivial algorithm for the densest subgraph problem in the streaming model with additions and deletions to its edges, i.e., for dynamic graph streams. They present a $(0.5-\epsilon)$-approximation ... More
Low Rank Approximation and Regression in Input Sparsity TimeJul 26 2012Apr 05 2013We design a new distribution over $\poly(r \eps^{-1}) \times n$ matrices $S$ so that for any fixed $n \times d$ matrix $A$ of rank $r$, with probability at least 9/10, $\norm{SAx}_2 = (1 \pm \eps)\norm{Ax}_2$ simultaneously for all $x \in \mathbb{R}^d$. ... More
An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related ProblemsMar 01 2016We give the first optimal bounds for returning the $\ell_1$-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of $m$ items in $\{1, 2, \dots, n\}$ and parameters ... More
Towards a Zero-One Law for Entrywise Low Rank ApproximationNov 04 2018There are a number of approximation algorithms for NP-hard versions of low rank approximation, such as finding a rank-$k$ matrix $B$ minimizing the sum of absolute values of differences to a given matrix $A$, $\min_{\textrm{rank-}k~B}\|A-B\|_1$, or more ... More
Weighted Maximum Independent Set of Geometric Objects in Turnstile StreamsFeb 27 2019We study the Maximum Independent Set problem for geometric objects given in the data stream model. A set of geometric objects is said to be independent if the objects are pairwise disjoint. We consider geometric objects in one and two dimensions, i.e., ... More
Optimal Principal Component Analysis in Distributed and Streaming ModelsApr 25 2015Jul 12 2016We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix $A \in R^{m \times n},$ a rank parameter $k < rank(A)$, and an accuracy parameter $0 < \epsilon < 1$, we want to output an $m ... More
Low Rank Approximation with Entrywise $\ell_1$-Norm ErrorNov 03 2016We study the $\ell_1$-low rank approximation problem, where for a given $n \times d$ matrix $A$ and approximation factor $\alpha \geq 1$, the goal is to output a rank-$k$ matrix $\widehat{A}$ for which $$\|A-\widehat{A}\|_1 \leq \alpha \cdot \min_{\textrm{rank-}k\textrm{ ... More
On Deterministic Sketching and Streaming for Sparse Recovery and Norm EstimationJun 25 2012We study classic streaming and sparse recovery problems using deterministic linear sketches, including l1/l1 and linf/l1 sparse recovery problems (the latter also being known as l1-heavy hitters), norm estimation, and approximate inner product. We focus ... More
Tight Bounds for the Subspace Sketch Problem with ApplicationsApr 11 2019In the subspace sketch problem one is given an $n\times d$ matrix $A$ with $O(\log(nd))$ bit entries, and would like to compress it in an arbitrary way to build a small space data structure $Q_p$, so that for any given $x \in \mathbb{R}^d$, with probability ... More
Near Optimal Sketching of Low-Rank Tensor RegressionSep 20 2017We study the least squares regression problem \begin{align*} \min_{\Theta \in \mathcal{S}_{\odot D,R}} \|A\Theta-b\|_2, \end{align*} where $\mathcal{S}_{\odot D,R}$ is the set of $\Theta$ for which $\Theta = \sum_{r=1}^{R} \theta_1^{(r)} \circ \cdots ... More
On the Power of Adaptivity in Sparse RecoveryOct 17 2011The goal of (stable) sparse recovery is to recover a $k$-sparse approximation $x*$ of a vector $x$ from linear measurements of $x$. Specifically, the goal is to recover $x*$ such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant $C$ ... More
The Sketching Complexity of Graph CutsMar 27 2014Nov 10 2014We study the problem of sketching an input graph, so that given the sketch, one can estimate the weight of any cut in the graph within factor $1+\epsilon$. We present lower and upper bounds on the size of a randomized sketch, focusing on the dependence ... More
On Sketching the $q$ to $p$ normsJun 17 2018We initiate the study of data dimensionality reduction, or sketching, for the $q\to p$ norms. Given an $n \times d$ matrix $A$, the $q\to p$ norm, denoted $\|A\|_{q \to p} = \sup_{x \in \mathbb{R}^d \backslash \vec{0}} \frac{\|Ax\|_p}{\|x\|_q}$, is a ... More
Low-Rank Approximation from Communication ComplexityApr 22 2019In low-rank approximation with missing entries, given $A\in \mathbb{R}^{n\times n}$ and binary $W \in \{0,1\}^{n\times n}$, the goal is to find a rank-$k$ matrix $L$ for which: $$cost(L)=\sum_{i=1}^{n} \sum_{j=1}^{n}W_{i,j}\cdot (A_{i,j} - L_{i,j})^2\le ... More
Fast Regression with an $\ell_\infty$ GuaranteeMay 30 2017Sketching has emerged as a powerful technique for speeding up problems in numerical linear algebra, such as regression. In the overconstrained regression problem, one is given an $n \times d$ matrix $A$, with $n \gg d$, as well as an $n \times 1$ vector ... More
Relative Error Tensor Low Rank ApproximationApr 26 2017Mar 29 2018We consider relative error low rank approximation of $tensors$ with respect to the Frobenius norm: given an order-$q$ tensor $A \in \mathbb{R}^{\prod_{i=1}^q n_i}$, output a rank-$k$ tensor $B$ for which $\|A-B\|_F^2 \leq (1+\epsilon)$OPT, where OPT $= ... More
Principal Component Analysis and Higher Correlations for Distributed DataApr 10 2013Jun 29 2014We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms ... More
Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel $k$-means ClusteringMay 15 2019We present tight lower bounds on the number of kernel evaluations required to approximately solve kernel ridge regression (KRR) and kernel $k$-means clustering (KKMC) on $n$ input points. For KRR, our bound for relative error approximation to the minimizer ... More
Fast Moment Estimation in Data Streams in Optimal SpaceJul 23 2010We give a space-optimal algorithm with update time O(log^2(1/eps)loglog(1/eps)) for (1+eps)-approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream. This provides a nearly exponential improvement in the update ... More
Sharper Bounds for Regularized Data FittingNov 10 2016Jun 26 2017We study matrix sketching methods for regularized variants of linear regression, low rank approximation, and canonical correlation analysis. Our main focus is on sketching techniques which preserve the objective function value for regularized problems, ... More
Improved Distributed Principal Component AnalysisAug 25 2014Dec 23 2014We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the ... More
Sublinear Optimization for Machine LearningOct 21 2010We give sublinear-time approximation algorithms for some optimization problems arising in machine learning, such as training linear classifiers and finding minimum enclosing balls. Our algorithms can be extended to some kernelized versions of these problems, ... More
Faster Algorithms for High-Dimensional Robust Covariance EstimationJun 11 2019We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal ... More
Dimensionality Reduction for Tukey RegressionMay 14 2019We give the first dimensionality reduction methods for the overconstrained Tukey regression problem. The Tukey loss function $\|y\|_M = \sum_i M(y_i)$ has $M(y_i) \approx |y_i|^p$ for residual errors $y_i$ smaller than a prescribed threshold $\tau$, but ... More
Faster Kernel Ridge Regression Using Sketching and PreconditioningNov 10 2016Random feature maps, such as random Fourier features, have recently emerged as a powerful technique for speeding up and scaling the training of kernel-based methods such as kernel ridge regression. However, random feature maps only provide crude approximations ... More
Faster Kernel Ridge Regression Using Sketching and PreconditioningNov 10 2016Jul 15 2017Kernel Ridge Regression (KRR) is a simple yet powerful technique for non-parametric regression whose computation amounts to solving a linear system. This system is usually dense and highly ill-conditioned. In addition, the dimensions of the matrix are ... More
Conditional Sparse $\ell_p$-norm Regression With Optimal ProbabilityJun 26 2018We consider the following conditional linear regression problem: the task is to identify both (i) a $k$-DNF condition $c$ and (ii) a linear rule $f$ such that the probability of $c$ is (approximately) at least some given bound $\mu$, and $f$ minimizes ... More
Sample-Optimal Low-Rank Approximation of Distance MatricesJun 02 2019A distance matrix $A \in \mathbb R^{n \times m}$ represents all pairwise distances, $A_{ij}=\mathrm{d}(x_i,y_j)$, between two point sets $x_1,...,x_n$ and $y_1,...,y_m$ in an arbitrary metric space $(\mathcal Z, \mathrm{d})$. Such matrices arise in various ... More
Improved Algorithms for Adaptive Compressed SensingApr 25 2018In the problem of adaptive compressed sensing, one wants to estimate an approximately $k$-sparse vector $x\in\mathbb{R}^n$ from $m$ linear measurements $A_1 x, A_2 x,\ldots, A_m x$, where $A_i$ can be chosen based on the outcomes $A_1 x,\ldots, A_{i-1} ... More
Revisiting Frequency Moment Estimation in Random Order StreamsMar 06 2018We revisit one of the classic problems in the data stream literature, namely, that of estimating the frequency moments $F_p$ for $0 < p < 2$ of an underlying $n$-dimensional vector presented as a sequence of additive updates in a stream. It is well-known ... More
On Coresets for Logistic RegressionMay 22 2018Sep 13 2018Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized ... More
Beating CountSketch for Heavy Hitters in Insertion StreamsNov 02 2015Given a stream $p_1, \ldots, p_m$ of items from a universe $\mathcal{U}$, which, without loss of generality we identify with the set of integers $\{1, 2, \ldots, n\}$, we consider the problem of returning all $\ell_2$-heavy hitters, i.e., those items ... More
Optimal approximate matrix product in terms of stable rankJul 08 2015Mar 02 2016We prove, using the subspace embedding guarantee in a black box way, that one can achieve the spectral norm guarantee for approximate matrix multiplication with a dimensionality-reducing map having $m = O(\tilde{r}/\varepsilon^2)$ rows. Here $\tilde{r}$ ... More
A Sketching Algorithm for Spectral Graph SparsificationDec 28 2014We study the problem of compressing a weighted graph $G$ on $n$ vertices, building a "sketch" $H$ of $G$, so that given any vector $x \in \mathbb{R}^n$, the value $x^T L_G x$ can be approximated up to a multiplicative $1+\epsilon$ factor from only $H$ ... More
Matrix Completion and Related Problems via Strong DualityApr 27 2017Apr 25 2018This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and its dual have the same optimum. This has been well understood for convex optimization, but little was known ... More
How to Fake Multiply by a Gaussian MatrixJun 18 2016Have you ever wanted to multiply an $n \times d$ matrix $X$, with $n \gg d$, on the left by an $m \times n$ matrix $\tilde G$ of i.i.d. Gaussian random variables, but could not afford to do it because it was too slow? In this work we propose a new randomized ... More
Querying a Matrix through Matrix-Vector ProductsJun 13 2019We consider algorithms with access to an unknown matrix $M\in\mathbb{F}^{n \times d}$ via matrix-vector products, namely, the algorithm chooses vectors $\mathbf{v}^1, \ldots, \mathbf{v}^q$, and observes $M\mathbf{v}^1,\ldots, M\mathbf{v}^q$. Here the ... More
Weighted Reservoir Sampling from Distributed StreamsApr 08 2019We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, ... More
Frequent Directions : Simple and Deterministic Matrix SketchingJan 08 2015Apr 21 2015We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix $A \in R^{n \times d}$ one row at a time. It performed $O(d \times \ell)$ operations ... More
Sketching for Kronecker Product Regression and P-splinesDec 27 2017TensorSketch is an oblivious linear sketch introduced in Pagh'13 and later used in Pham, Pagh'13 in the context of SVMs for polynomial kernels. It was shown in Avron, Nguyen, Woodruff'14 that TensorSketch provides a subspace embedding, and therefore can ... More
Communication-Optimal Distributed ClusteringFeb 01 2017Clustering large datasets is a fundamental problem with a number of applications in machine learning. Data is often collected on different sites and clustering needs to be performed in a distributed manner with low communication. We would like the quality ... More
The Communication Complexity of OptimizationJun 13 2019We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with $s$ servers $P_1, \ldots, P_s$, the $i$-th of which holds a subset ... More
Faster Kernel Ridge Regression Using Sketching and PreconditioningNov 10 2016Nov 26 2016Kernel Ridge Regression (KRR) is a simple yet powerful technique for non-parametric regression whose computation amounts to solving a linear system. This system is usually dense and highly ill-conditioned. In addition, the dimensions of the matrix are ... More
Sharper Bounds for Regression and Low-Rank Approximation with RegularizationNov 10 2016The technique of matrix sketching, such as the use of random projections, has been shown in recent years to be a powerful tool for accelerating many important statistical learning techniques. Research has so far focused largely on using sketching for ... More
Revisiting Norm Estimation in Data StreamsNov 21 2008Apr 09 2009The problem of estimating the pth moment F_p (p nonnegative and real) in data streams is as follows. There is a vector x which starts at 0, and many updates of the form x_i <-- x_i + v come sequentially in a stream. The algorithm also receives an error ... More
On The Communication Complexity of Linear Algebraic Problems in the Message Passing ModelJul 17 2014We study the communication complexity of linear algebraic problems over finite fields in the multi-player message passing model, proving a number of tight lower bounds. Specifically, for a matrix which is distributed among a number of players, we consider ... More
Lower Bounds for Sparse RecoveryJun 02 2011Jun 03 2011We consider the following k-sparse recovery problem: design an m x n matrix A, such that for any signal x, given Ax we can efficiently recover x' satisfying ||x-x'||_1 <= C min_{k-sparse} x"} ||x-x"||_1. It is known that there exist matrices A with this ... More
Testing Matrix Rank, OptimallyOct 18 2018We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ ... More
Oblivious Sketching of High-Degree Polynomial KernelsSep 03 2019Kernel methods are fundamental tools in machine learning that allow detection of non-linear dependencies between data without explicitly constructing feature vectors in high dimensional spaces. A major disadvantage of kernel methods is their poor scalability: ... More
Mape_Maker: A Scenario CreatorSep 04 2019We describe algorithms for creating probabilistic scenarios for the situation when the underlying forecast methodology is modeled as being more (or less) accurate than it has been historically. Such scenarios can be used in studies that extend into the ... More
Cell2Fire: A Cell Based Forest Fire Growth ModelMay 22 2019Cell2Fire is a new cell-based forest and wildland landscape fire growth simulator that is open-source and exploits parallelism to support the modelling of fire growth cross large spatial and temporal scales in a timely manner. The fire environment is ... More
EMFS: Repurposing SMTP and IMAP for Data Storage and SynchronizationJan 29 2016Cloud storage has become a massive and lucrative business, with companies like Apple, Microsoft, Google, and Dropbox providing hundreds of millions of clients with synchronized and redundant storage. These services often command price-to-storage ratios ... More
Exploring nucleon spin structure through neutrino neutral-current interactions in MicroBooNEFeb 02 2017The net contribution of the strange quark spins to the proton spin, $\Delta s$, can be determined from neutral current elastic neutrino-proton interactions at low momentum transfer combined with data from electron-proton scattering. The probability of ... More
On Sketching Quadratic FormsNov 19 2015We undertake a systematic study of sketching a quadratic form: given an $n \times n$ matrix $A$, create a succinct sketch $\textbf{sk}(A)$ which can produce (without further access to $A$) a multiplicative $(1+\epsilon)$-approximation to $x^T A x$ for ... More
Robust Communication-Optimal Distributed Clustering AlgorithmsMar 02 2017Mar 06 2019In this work, we study the $k$-median and $k$-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer ... More
Studying Neutral Current Elastic Scattering and the Strange Axial Form Factor in MicroBooNEJan 13 2019One of the least constrained contributions to the neutral current (NC) elastic neutrino-proton cross section is the strange axial form factor, which represents the strange quark spin contribution to the spin structure of the proton. This becomes the net ... More
BPTree: an $\ell_2$ heavy hitters algorithm using constant memoryMar 02 2016Mar 08 2016The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. In sub-polynomial space, the strongest guarantee available is the $\ell_2$ guarantee, which requires finding all items that occur at least ... More
Streaming Space Complexity of Nearly All Functions of One Variable on Frequency VectorsJan 27 2016A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space. Given a function $g$, we study the space complexity ... More
Communication Efficient Distributed Kernel Principal Component AnalysisMar 23 2015Feb 13 2016Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate ... More
Transitive-Closure SpannersAug 13 2008Given a directed graph G = (V,E) and an integer k>=1, a k-transitive-closure-spanner (k-TC-spanner) of G is a directed graph H = (V, E_H) that has (1) the same transitive-closure as G and (2) diameter at most k. These spanners were implicitly studied ... More
Spectrum Approximation Beyond Fast Matrix Multiplication: Algorithms and HardnessApr 13 2017Jan 03 2019Understanding the singular value spectrum of a matrix $A \in \mathbb{R}^{n \times n}$ is a fundamental task in countless applications. In matrix multiplication time, it is possible to perform a full SVD and directly compute the singular values $\sigma_1,...,\sigma_n$. ... More
Matrix Norms in Data Streams: Faster, Multi-Pass and Row-OrderSep 19 2016Oct 24 2018A central problem in data streams is to characterize which functions of an underlying frequency vector can be approximated efficiently. Recently there has been considerable effort in extending this problem to that of estimating functions of a matrix that ... More
BPTree: an $\ell_2$ heavy hitters algorithm using constant memoryMar 02 2016Nov 09 2017The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list $i_1,i_2,\ldots,i_m\in[n]$ and the goal is to identify the items among $[n]$ that appear frequently in the list. In sub-polynomial ... More
The Fast Cauchy Transform and Faster Robust Linear RegressionJul 19 2012Apr 05 2014We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\mathbb{R}^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\mathbb{R}^d} \|Ax-b\|_p$ to the same problem ... More
Optimal lower bounds for universal relation, and for samplers and finding duplicates in streamsApr 03 2017In the communication problem $\mathbf{UR}$ (universal relation) [KRW95], Alice and Bob respectively receive $x, y \in\{0,1\}^n$ with the promise that $x\neq y$. The last player to receive a message must output an index $i$ such that $x_i\neq y_i$. We ... More