A Probabilistic and RIPless Theory of Compressed Sensing.pdf

更新时间:2023-04-30 22:08:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

A Probabilistic and RIPless Theory of Compressed Sensing

Emmanuel J.Cand`e s1and Yaniv Plan2

1Departments of Mathematics and of Statistics,Stanford University,Stanford,CA94305 2Applied and Computational Mathematics,Caltech,Pasadena,CA91125

November2010;Revised June2011

Abstract

This paper introduces a simple and very general theory of compressive sensing.In this theory, the sensing mechanism simply selects sensing vectors independently at random from a probabil-

ity distribution F;it includes all standard models—e.g.Gaussian,frequency measurements—

discussed in the literature,but also provides a framework for new measurement strategies as

well.We prove that if the probability distribution F obeys a simple incoherence property and

an isotropy property,one can faithfully recover approximately sparse signals from a minimal

number of noisy measurements.The novelty is that our recovery results do not require the

restricted isometry property(RIP)to hold near the sparsity level in question,nor a random

model for the signal.As an example,the paper shows that a signal with s nonzero entries can

be faithfully recovered from about s log n Fourier coe?cients that are contaminated with noise.

3341cf27aaea998fcc220ea7pressed sensing, 1minimization,the LASSO,the Dantzig selector,(weak) restricted isometries,random matrices,sparse regression,operator Bernstein inequalities,Gross’gol?ng scheme.

Dedicated to the memory of Jerrold E.Marsden.

1Introduction

This paper develops a novel,simple and general theory of compressive sensing[12,15,20],a rapidly growing?eld of research that has developed protocols for acquiring certain types of signals with far fewer data bits than what is classically accepted.

1.1A RIPless theory?

The early paper[12]triggered a massive amount of research by showing that it is possible to sample signals at a rate proportional to their information content rather than their bandwidth.For instance,in a discrete setting,this theory asserts that a digital signal x∈R n(which can be viewed as Nyquist samples of a continuous-time signal over a time window of interest)can be recovered from a small random sample of its Fourier coe?cients provided that x is su?ciently sparse.Formally, suppose that our signal x has at most s nonzero amplitudes at completely unknown locations—such a signal is called s-sparse—and that we are given the value of its discrete Fourier transform(DFT)

1

at m frequencies selected uniformly at random(we think of m as being much smaller than n).Then [12]showed that one can recover x by solving an optimization problem which simply?nds,among all candidate signals,that with the minimum 1norm;the number of samples we need must be on the order of s log n.In other words,if we think of s as a measure of the information content,we can sample nonadaptively nearly at the information rate without information loss.By swapping time and frequency,this also says that signals occupying a very large bandwidth but with a sparse spectrum can be sampled(at random time locations)at a rate far below the Shannon-Nyquist rate.

Despite considerable progress in the?eld,some important questions are still open.We discuss two that have both a theoretical and practical appeal.

Is it possible to faithfully recover a nearly sparse signal x∈R n,one which is well ap-

proximated by its s largest entries,from about s log n of its Fourier coe?cients?Is it

still possible when these coe?cients are further corrupted by noise?

These issues are paramount since in real-world applications,signals are never exactly sparse,and measurements are never perfect either.Now the traditional way of addressing these types of problems in the?eld is by means of the restricted isometry property(RIP)[14].The trouble here is that it is unknown whether or not this property holds when the sample size m is on the order of s log n.In fact,answering this one way or the other is generally regarded as extremely di?cult, and so the restricted isometry machinery does not directly apply in this setting.

This paper proves that the two questions formulated above have positive answers.In fact,we introduce recovery results which are—up to a logarithmic factor—as good as those one would get if the restricted isometry property were known to be true.To?x ideas,suppose we observe m noisy discrete Fourier coe?cients about an s-sparse signal x,

?y k=n?1

t=0

e??2πωk t x[t]+σz k,k=1,...,m.(1.1)

Here,the frequenciesωk are chosen uniformly at random in{0,1 n,2 n,...,(n?1) n}and z k is white noise with unit variance.Then if the number of samples m is on the order of s log n,it is possible to get an estimate?x obeying

?x?x 2

2=polylog(n)

s

m

σ2(1.2)

by solving a convex 1-minimization program.(Note that when the noise vanishes,the recovery is exact.)Up to the logarithmic factor,which may sometimes be on the order of log n and at most a small power of this quantity,this is optimal.Now if the RIP held,one would get a squared

error bounded by O(log n)s

m σ2[16,5]and,therefore,the‘RIPless’theory developed in this paper

roughly enjoys the same performance guarantees.

1.2A general theory

The estimate we have just seen is not isolated and the real purpose of this paper is to develop a theory of compressive sensing which is both as simple and as general as possible.

At the heart of compressive sensing is the idea that randomness can be used as an e?ective sensing mechanism.We note that random measurements are not only crucial in the derivation of many theoretical results,but also generally seem to give better empirical results as well.Therefore,

2

we propose a mechanism whereby sensing vectors are independently sampled from a population F. Mathematically,we observe

?y k=?a k,x?+σz k,k=1,...,m,(1.3)

where x∈R n,{z k}is a noise sequence,and the sensing vectors a k iid~F(note that a k is a“column vector”).For example,if F is the family of complex sinusoids,this is the Fourier sampling model introduced earlier.All we require from F is an isotropy property and an incoherence property.

Isotropy property:We say that F obeys the isotropy property if

E aa?=I,a~F.(1.4)

If F has mean zero(we do not require this),then E aa?is the covariance matrix of F.In other words,the isotropy condition states that the components of a~F have unit variance and are uncorrelated.This assumption may be weakened a little,as we shall see later. Incoherence property:We may take the coherence parameterμ(F)to be the smallest number such that with a=(a[1],...,a[n])~F,

max

1≤t≤n

a[t] 2≤μ(F)(1.5)

holds either deterministically or stochastically in the sense discussed below.The smaller μ(F),i.e.the more incoherent the sensing vectors,the fewer samples we need for accurate recovery.When a simple deterministic bound is not available,one can take the smallest scalar μobeying

E[n?1 a 2

21E c]≤1

20

n?3 2and P(E c)≤(nm)?1,(1.6)

where E is the event{max1≤t≤n a[t] 2≤μ}.

Suppose for instance that the components are i.i.d.N(0,1).Then a simple calculation we shall not detail shows that

E[n?1 a 2

2

1E c]≤2n P(Z>√μ)+2√μφ(√μ),(1.7)

P(E c)≤2n P(Z≥√μ),

where Z is standard normal andφis its density function.The inequality P(Z>t)≤φ(t) t shows that one can takeμ(F)≤6log n as long as n≥16and m≤n.More generally,if the components of a are i.i.d.samples from a sub-Gaussian distribution,μ(F)is at most a constant times log n.If they are i.i.d.from a sub-exponential distribution,μ(F)is at most a constant times log2n.In what follows,however,it might be convenient for the reader to assume that the deterministic bound(1.5) holds.

It follows from the isotropy property that E a[t] 2=1,and thusμ(F)≥1.This lower bound is achievable by several distributions and one such example is obtained by sampling a row from the DFT matrix as before,so that

a[t]=e?2πkt n,

where k is chosen uniformly at random in{0,1,...,n?1}.Then another simple calculation shows that E aa?=I andμ(F)=1since a[t] 2=1for all t.At the other extreme,suppose the measurement

3

process reveals one entry of x selected uniformly at random so that a=√

n e i where i is uniform

in{1,...,n};the normalization ensures that E aa?=I.This is a lousy acquisition protocol because one would need to sample on the order of n log n times to recover even a1-sparse vector(the logarithmic term comes from the coupon collector e?ect).Not surprisingly,this distribution is in fact highly coherent asμ(F)=n.

With the assumptions set,we now give a representative result of this paper:suppose x is an arbitrary but?xed s-sparse vector and that one collects information about this signal by means of the random sensing mechanism(1.3),where z is white noise.Then if the number of samples is on the orderμ(F)s log n,one can invoke 1minimization to get an estimator?x obeying

?x?x 2

2≤polylog(n)

s

m

σ2.

This bound is sharp.It is not possible to substantially reduce the number of measurements and get a similar bound,no matter how intractable the recovery method might be.Further,with this many measurements,the upper bound is optimal up to logarithmic factors.Finally,we will see that when the signal is not exactly sparse,we just need to add an approximation error to the upper bound.

To summarize,this paper proves that one can faithfully recover approximately s-sparse signals from about s log n random incoherent measurements for whichμ(F)=O(1).

1.3Examples of incoherent measurements

We have seen through examples that sensing vectors with low coherence are global or spread out. Incoherence alone,however,is not a su?cient condition:if F were a constant distribution(sampling from F would always return the same vector),one would not learn anything new about the signal by taking more samples regardless of the level of incoherence.However,as we will see,the incoherence and isotropy properties together guarantee that sparse vectors lie away from the nullspace of the

sensing matrix whose rows are the a?

k ’s.

The role of the isotropy condition is to keep the measurement matrix from being rank de?-cient when su?ciently many measurements are taken(and similarly for subsets of columns of A). Speci?cally,one would hope to be able to recover any signal from an arbitrarily large number of measurements.However,if E aa?were rank de?cient,there would be signals x∈R n that would not be recoverable from an arbitrary number of samples;just take x≠0in the nullspace of E aa?. The nonnegative random variable x?aa?x has vanishing expectation,which implies a?x=0almost surely.(Put di?erently,all of the measurements would be zero almost surely.)In contrast,the

isotropy condition implies that1

m ∑m k=1a k a?k→I almost surely as m→∞and,therefore,with

enough measurements,the sensing matrix is well conditioned and has a left-inverse.1 We now provide examples of incoherent and isotropic measurements.

Sensing vectors with independent components.Suppose the components of a~F are independently distributed with mean zero and unit variance.Then F is isotropic.In addition,if the distribution of each component is light-tailed,then the measurements are clearly incoherent.

1One could require‘near isotropy,’i.e.E aa?≈I.If the approximation were tight enough,our theoretical results would still follow with minimal changes to the proof.

4

A special case concerns the case where a~N(0,I),also known in the?eld as the Gaussian measurement ensemble,which is perhaps the most commonly studied.Here,one can take μ(F)=6log n as seen before.

Another special case is the binary measurement ensemble where the entries of a are symmetric Bernoulli variables taking on the values±1.A shifted version of this distribution is the sensing mechanism underlying the single pixel camera[23].

Subsampled orthogonal transforms:Suppose we have an orthogonal matrix obeying U?U=n I.Then consider the sampling mechanism picking rows of U uniformly and inde-pendently at random.In the case where U is the DFT,this is the random frequency model introduced earlier.Clearly,this distribution is isotropic andμ(F)=max ij U ij 2.In the case where U is a Hadamard matrix,or a complex Fourier matrix,μ(F)=1.

Random convolutions:Consider the circular convolution model y=Gx in which

G=?

?

?

?

?

?

?

?

?

?

?

g[0]g[1]g[2]...g[n?1] g[n?1]g[0]g[1]...

g[1]...g[n?1]g[0]

?

?

?

?

?

?

?

?

?

?

?

.

Because a convolution is diagonal in the Fourier domain(we just multiply the Fourier compo-nents of x with those of g),G is an isometry if the Fourier components of g=(g[0],...,g[n?1]) have the same magnitude.In this case,sampling a convolution product at randomly se-lected time locations is an isotropic and incoherent process provided g is spread out(μ(F)= max t g(t) 2).This example extends to higher dimensions;e.g.to spatial3D convolutions.In fact,certain random convolutions may be used universally to e?ciently sample a signal that is sparsely represented in any orthonormal basis while still giving strong theoretical guarantees for signal recovery[43]—an appealing feature due to the fast application of a convolution matrix.For a quick comparison,the theory in[43]demonstrates that about s log n measure-ments are su?cient for noiseless recovery and about s log5n are su?cient for stable recovery. The theory in this paper may also be applied to the random convolution described in[43], demonstrating that about s log2n measurements are su?cient for stable recovery.

Subsampled tight or continuous frames:We can generalize the example above by sub-sampling a tight frame or even a continuous frame.An important example might be the Fourier transform with a continuous frequency spectrum.Here,

a(t)=e?2πωt,

whereωis chosen uniformly at random in[0,1](instead of being on an equispaced lattice as before).This distribution is isotropic and obeysμ(F)=1.A situation where this arises is in magnetic resonance imaging(MRI)as frequency samples rarely fall on an equispaced Nyquist grid.By swapping time and frequency,this is equivalent to sampling a nearly sparse trigonometric polynomial at randomly selected time points in the unit interval[38].More generally,as described in[41,Chapter4],this model applies to random sampling of signals that have a sparse expansion in a bounded orthonormal system of functions.

5

These examples could of course be multiplied,and we hope we have made clear that our framework is general and encompasses many of the measurement models commonly discussed in compressive sensing—and perhaps many more.

Our setup simply consists in selecting each measurement vector independently from the others.For completeness,however,we describe other measurement ensembles discussed in the compressive sensing literature that do not assume this model.Whether the theory in this paper can be extended to these models is open to further research.A reversal of the random convolution example described above is convolution with a random vector at ?xed locations (rather than convolution with a ?xed vector at random locations).In fact,such a measurement mechanism has been studied in detail

[40,28,41,42],with strong theoretical results.Similarly to the theory in this paper,near optimal theoretical guarantees on sparse-signal recovery are non-uniform [40,41],whereas uniform recovery through RIP machinery has been quite suboptimal.Similar results are available for random Gabor matrices [34].

1.4Matrix notation

Before continuing,we pause to introduce some useful matrix notation.Divide both sides of (1.3)by √m ,and rewrite our statistical model as

y =Ax +σm z ;(1.8)

the k th entry of y is ?y k divided by √m ,the k th row of A is a ?k divided by √m ,and σm is σdivided by √m .This normalization implies that the columns of A are approximately unit-normed,and is

most used in the compressive sensing literature.

1.5Incoherent sampling theorem

For pedagogical reasons,we introduce our results by ?rst presenting a recovery result from noiseless data.The recovered signal is obtained by the standard 1-minimization program

min ˉx ∈R n ˉx 1subject to A ˉx =y.(1.9)

(Recall that the rows of A are normalized independent samples from F .)

Theorem 1.1(Noiseless incoherent sampling)Let x be a ?xed but otherwise arbitrary s -sparse vector in R n and pick any scalar β>0.Then with probability at least 1?5 n ?e ?β,x is the unique minimizer to (1.9)with y =Ax provided that

m ≥C β?μ(F )?s ?log n.

More precisely,C βmay be chosen as C 0(1+β)for some positive numerical constant C 0.

Among other things,this theorem states that one can perfectly recover an arbitrary sparse signal from about s log n convolution samples,or a signal that happens to be sparse in the wavelet domain from about s log n randomly selected noiselet coe?cients.It extends an earlier result [11],which assumed a subsampled orthogonal model,and strengthens it since that reference could only prove the claim for randomly signed vectors x .Here,x is arbitrary,and we do not make any distributional assumption about its support or its sign pattern.

6

This theorem is also about a fundamental information theoretic limit:the number of samples for perfect recovery has to be on the order of μ(F )?s ?log n ,and cannot possibly be much below this number.More precisely,suppose we are given a distribution F with coherence parameter μ(F ).Then there exist s -sparse vectors that cannot be recovered with probability at least 1?1 n ,say,from fewer than a constant times μ(F )?s ?log n samples.When μ(F )=1,this has been already established since [12]proves that some s sparse signals cannot be recovered from fewer than a constant times s ?log n random DFT samples.Our general claim follows from a modi?cation of the argument in [12].Assume,without loss of generality,that μ(F )is an integer and consider the isotropic process that samples rows from an n ×n block diagonal matrix,each block being a DFT of a smaller size;that is,of size n where μ(F )= .Then if m ≤c 0?μ(F )?s ?log n ,one can construct s -sparse signals just as in [12]for which Ax =0with probability at least 1 n .We omit the details.

The important aspect,here,is the role played by the coherence parameter μ(F ).In general,the minimal number of samples must be on the order of the coherence times the sparsity level s times a logarithmic factor.Put di?erently,the coherence completely determines the minimal sampling rate .

1.6Main results

We assume for simplicity that we are undersampling so that m ≤n .Our general result deals with

1)arbitrary signals which are not necessarily sparse (images are never exactly sparse even in a transformed domain)and 2)noise.To recover x from the data y and the model (1.8),we consider the unconstrained LASSO which solves the 1regularized least-squares problem

min ˉx ∈R n 12 A ˉx ?y 2 2+λσm ˉx 1.(1.10)

We assume that z is Gaussian z ~N (0,I ).However,the theorem below remains valid as long as A ?z ∞≤λn for some λn ≥0,and thus many other noise models would work as well.In what follows,x s is the best s -sparse approximation of x or,equivalently,the vector consisting of the s largest entries of x in magnitude.

Theorem 1.2Let x be an arbitrary ?xed vector in R n and pick any scalar β>0.Then with probability at least 1?6 n ?6e ?β,the solution ?x to (1.10)with λ=10√log n obeys

?x ?x 2≤min 1≤s ≤ˉs C (1+α)?????? x ?x s 1√s +σ s log n m ?????

(1.11)provided that m ≥C β?μ(F )?ˉs ?log n .If one measures the error in the 1norm,then

?x ?x 1≤min 1≤s ≤ˉs C (1+α)?????? x ?x s 1+sσ log n m ??????

.(1.12)Above,C is a numerical constant,C βcan be chosen as before,and α= (1+β)sμlog n log m log 2s m

which is never greater than log 3 2n in this setup.

Above,ˉs may be interpreted as an upper bound on allowable sparsity levels s that still lead to

stable reconstruction (according to our proofs).We may take ˉs =m C β?μ(F )?log n .

7

These robust error bounds do not require either(1)a random model on the signal or(2) the RIP nor one of a few closely related strong conditions such as the RIP-1[25],the restricted eigenvalue assumption[5]or the compatibility condition[49]—all conditions that would imply uniform recovery.In contrast,our arguments work through a new machinery which guarantees ?xed sparse signal recovery with high probability.It is currently unknown whether our conditions are strong enough to imply uniform recovery.Further,the error bound is within at most a log3 2n factor of what has been established using the RIP since a variation on the arguments in[16] would give an error bound proportional to the quantity inside the square brackets in(1.11).As a consequence,the error bound is within a polylogarithmic factor of what is achievable with the help of an oracle that would reveal the locations of the signi?cant coordinates of the unknown signal [16].In other words,it cannot be substantially improved.

Because much of the compressive sensing literature works with restricted isometry conditions—we shall discuss exceptions such as[22,4]in Section1.7—we pause here to discuss these conditions and to compare them to our own.We say that an m×n matrix A obeys the RIP with parameters s andδif

(1?δ) v 2

2≤ Av 2

2

≤(1+δ) v 2

2

(1.13)

for all s-sparse vectors v.In other words,all the submatrices of A with at most s columns are well conditioned.When the RIP holds with parameters2s andδ<0.414...[8]or evenδ≤0.465...[24], it is known that the error bound(1.11)holds(without the factor(1+α)).Thisδis sometimes referred to as the restricted isometry constant.

Bounds on the restricted isometry constant have been established in[15]and in[46]for partial DFT matrices,and by extension,for partial subsampled orthogonal transforms.For instance,[46] proves that if A is a properly normalized partial DFT matrix,or a unitary matrix with bounded entries,then the RIP withδ=1 4holds with high probability if m≥C?s log4n(C is some positive constant).Rauhut[41,Chapter4,Chapter8]extended these results to the measurement matrices consisting of sampled bounded orthogonal systems.While such measurements are a special case of those considered in this paper,the theory in[41,Chapter8]extends to the measurement model considered here.Small adjustments need to be made in the case of stochastic incoherence,but the end result is that the RIP holds with high probability when m≥C?μ(F)?sμ(F)log2s log n log m, i.e.,about sμlog4n measurements are su?cient.Thus,our result bridges the gap between the region where the RIP is known to hold and the region in which one has the minimum number of measurements needed to prove perfect recovery of exactly sparse signals from noisy data,which is on the order ofμ(F)?s log n.The gap is bridged smoothly in the sense that our error bounds match those available with the RIP as soon as the RIP is known to hold,and grow slightly larger when there are fewer measurements.

The careful reader will no doubt remark that for very speci?c models such as the Gaussian measurement ensemble,it is known that on the order s log(n s)samples are su?cient for stable recovery while our result asserts that on the order of s log2n are su?cient(and s log n for the binary measurement ensemble).This slight loss is a small price to pay for a very simple general theory, which accommodates a wide array of sensing strategies.Having said this,the reader will also verify that specializing our proofs below gives an optimal result for the Gaussian ensemble;i.e.establishes a near-optimal error bound from about s log n observations(the log(n s)factor,however,may still require a di?erent approach).

8

Finally,another frequently discussed algorithm for sparse regression is the Dantzig selector[16]. Here,the estimator is given by the solution to the linear program

min ˉx∈R n ˉx

1

subject to A?(Aˉx?y)

≤λσm.(1.14)

We show that the Dantzig selector obeys nearly the same error bound.

Theorem1.3The Dantzig selector,withλ=10√

log n and everything else the same as in Theorem

1.2,obeys

?x?x

2≤min

s≤ˉs

C(1+α2)

?

?

?

?

?

?

x?x s

1

s

s log n

m

?

?

?

?

?

?

(1.15)

?x?x

1≤min

s≤ˉs

C(1+α2)

?

?

?

?

?

?

x?x s

1

+sσ

log n

m

?

?

?

?

?

?

(1.16)

with the same probabilities as before.

The only di?erence isα2instead ofαin the right-hand sides.

1.7Our contribution

The main contribution of this paper is to provide a simple framework which applies to all the standard compressive sensing models and some new ones as well.The results in this paper also reduce the minimal number of measurements theoretically required in some standard sensing models such as Fourier measurements,or,more generally,sensing matrices obtained by sampling a few rows from an orthogonal matrix.We establish that the restricted isometry property is not necessarily needed to accurately recover?xed nearly sparse vectors from noisy compressive samples.This may be of interest because in many cases RIP-based results have needed strong requirements on relationship between the number of measurements required and the sparsity level.Thus our work is a signi?cant departure from the majority of the literature,which establishes good noisy recovery properties via the RIP machinery.This literature is,of course,extremely large and we cannot cite all contributions but a partial list would include[15,21,46,13,16,5,19,50,51,3,37,29,7,2].

The reason why one can get strong error bounds,which are within a polylogarithmic factor of what is available with the aid of an‘oracle,’without the RIP is that our results do not imply universality.That is,we are not claiming that if A is randomly sampled and then?xed once for all,then the error bounds from Section1.6hold for all signals x.What we are saying is that if we are given an arbitrary x,and then collect data by applying our random scheme,then the recovery of this x will be accurate.

If one wishes to establish universal results holding for all x simultaneously,then we would need the RIP or a property very close to it.As a consequence,we cannot possibly be in this setup and guarantee universality since we are not willing to assume that the RIP holds.To be sure,suppose we had available an oracle informing us about the support T of x.Then we would need the pseudo-inverse of the submatrix with columns in T to be bounded.In other words,the minimum singular value of this submatrix would have to be away from zero.For a universal result,this would need to be true for all subsets of cardinality s;that is,the minimum singular value of all submatrices with s columns would have to be away from zero.This essentially is the restricted isometry property.

9

To the best of our knowledge,only a few other papers have addressed non-universal stability (the literature grows so rapidly that an inadvertent omission is entirely possible).In an earlier work[9],the authors also considered weak conditions that allow stable recovery;in this case the authors assumed that the signal was sampled according to a random model,but in return the measurement matrix A could be deterministic.In the asymptotic case,stable signal recovery has been demonstrated for the Gaussian measurement ensemble in a regime in which the RIP does not necessarily hold[22,4].In fact,the authors of[22,4]are able to give exact limits on the error rather than error bounds.Aside from these papers and the work in progress[10],it seems that that the literature regarding stable recovery with conditions weak enough that they do not imply universality is extremely sparse.

Lastly,we would like to point to two important papers in the literature which together give a strong result for stable RIPless recovery with subsampled discrete Fourier transforms[27,52]. The authors of[27]give an useful equivalence between the 1and 2norm of any vector which is the sum of a subset of the columns of a discrete Fourier transform.These results,have important implications for Gelfand widths,and also may be used to characterize the null space of DFT.This theory may be combined with null-space conditions of[52]to imply stable of recovery of sparse signals by the(constrained)Lasso;the required number of measurements by this theory is

m≥O(1)?s log m log5[(n m)log m].

In the case when m≈n,this leads to the necessity of only about s log m log5log m measurements, which nearly matches the requirement in this paper of about s log m measurements.However,the error bounds are quite di?erent from those presented in our paper.See[35,Section2.2.3]for more details.Finally and to be complete,we would like to mention that earlier works have considered the recovery of perfectly sparse signals from subsampled orthogonal transforms[11],and of sparse trigonometric polynomials from random time samples[38].

1.8Organization of the paper

The paper is organized as follows.In Section2,we introduce several fundamental estimates which our arguments rely upon,but which also could be useful tools for other results in the?eld.In Section3,we prove the noiseless recovery result,namely,Theorem1.1.In Section4,we prove our main results,Theorems1.2and1.3.Now all these sections assume for simplicity of exposition that the coherence bound holds deterministically(1.5).We extend the proof to distributions obeying the coherence property in the stochastic sense(1.6)in the Appendix.This Appendix also contains another important technical piece,namely,a di?cult proof of an intermediate result(weak RIP property).Finally,we conclude the main text with some?nal comments in Section5.

1.9Notation

We provide a brief summary of the notations used throughout the paper.For an m×n matrix A and a subset T?{1,...,n},A T denotes the m× T matrix with column indices in T.Also,A{i} is the i-th column of A.Likewise,for a vector v∈R n,v T is the restriction of v to indices in T. Thus,if v is supported on T,Av=A T v T.In particular,a k,T is the vector a k restricted to T.The operator norm of a matrix A is denoted A ;also, A 1,2is the maximum column norm of A.The identity matrix,in any dimension,is denoted I.Further,e i always refers to the i-th standard basis element,e.g.,e1=(1,0,...,0).For a scalar t,sgn(t)is the sign of t if t≠0and is zero otherwise.

10

For a vector x,sgn(x)applies the sign function componentwise.We shall also useμas a shorthand forμ(F)whenever convenient.Throughout,C is a constant whose value may change from instance to instance.

2Fundamental Estimates

Our proofs rely on several estimates,and we provide an interpretation of each whenever possible. The?rst estimates E1–E4are used to prove the noiseless recovery result;when combined with the weak RIP,they imply stability and robustness.Throughout this section,δis a parameter left to be?xed in later sections;it is always less than or equal to one.

2.1Local isometry

Let T of cardinality s be the support of x in Theorem1.1,or the support of the best s-sparse approximation of x in Theorem1.2.We shall need that with high probability,

A?T A T?I ≤δ(2.1) withδ≤1 2in the proof of Theorem1.1andδ≤1 4in that of Theorem1.2.Put di?erently,the singular values of A T must lie away from zero.This condition essentially prevents A T from being singular as,otherwise,there would be no hope of recovering our sparse signal x.Indeed,letting h be any vector supported on T and in the null space of A,we would have Ax=A(x+h)and thus,recovery would be impossible even if one knew the support of x.The condition(2.1)is much weaker than the restricted isometry property because it does not need to hold uniformly over all sparse subsets—only on the support set.

Lemma2.1(E1:local isometry)Let T be a?xed set of cardinality s.Then forδ>0,

P( A?T A T?I ≥δ)≤2s exp ?m

μ(F)s ?

δ2

2(1+δ 3)

.(2.2)

In particular,if m≥56

3

μ(F)?s?log n,then

P( A?T A T?I ≥1 2)≤2 n.

Note that A?

T A T?I ≤δimplies that (A?

T

A T)?1 ≤1 (1?δ),a fact that we will use several times.

In compressive sensing,the classical way of proving such estimates is via Rudelson’s selection theorem[44].Here,we use a more modern technique based on the matrix Bernstein inequality of Ahlswede and Winter[1],developed for this setting by Gross[26],and tightened in[48]by Tropp and in[32]by Oliveira.We present the version in[48].

Theorem2.2(Matrix Bernstein inequality)Let{X k}∈R d×d be a?nite sequence of indepen-dent random self-adjoint matrices.Suppose that E X k=0and X k ≤B a.s.and put

σ2∶=

k

E X2k .

Then for all t≥0,

P

k X k ≥t ≤2d exp

?t2 2

σ2+Bt 3

.(2.3) 11

Proof[Estimate E1]Decompose A?

T

A T?I as

A?T A T?I=m?1

m

k=1

(a k,T a?k,T?I)=m?1

m

k=1

X k,X k∶=a k,T a?k,T?I.

The isotropy condition implies E X k=0,and since a T 2

2≤μ(F)?s,we have X k =max( a i,T 2

2

?

1,1)≤μ(F)?3341cf27aaea998fcc220ea7stly,0?E X2

k =E(a k,T a?

k,T

)2?I?E(a k,T a?

k,T

)2=E a k,T 2a k,T a?

k,T

.However,

E a k,T 2a k,T a?k,T?μ(F)?s?E a k,T a?k,T=μ(F)?s?I

and,therefore,∑k E X2k?m?μ(F)?s?I so thatσ2is bounded above by m?μ(F)?s.Plugging t=δm into(2.3)gives the lemma.

Instead of having A act as a near isometry on all vectors supported on T,we could ask that

it preserves the norm of an arbitrary?xed vector(with high probability),i.e. Av

2≈ v

2

for

a?xed v supported on T.Not surprisingly,this can be proved with generally(slightly)weaker requirements.

Lemma2.3(E2:low-distortion)Let w be a?xed vector supported on a set T of cardinality at most s.Then for each t≤1 2,

P( (A?T A T?I)w T

2≥t w

2

)≤exp ?

1

4

t

m

μ(F)s

?1

2

.

The estimate follows from the vector Bernstein inequality,proved by Gross[26,Theorem11]. We use a slightly weaker version,which we?nd slightly more convenient.

Theorem2.4(Vector Bernstein inequality)Let{v k}∈R d be a?nite sequence of independent

random vectors.Suppose that E v k=0and v k

2≤B a.s.and putσ2≥∑k E v k 2

2

.Then for all

0≤t≤σ2 B,

P ?

?

k

v k

2

≥t

?

?

≤exp ?

(t σ?1)2

4

≤exp ?

t2

8σ2

+

1

4

.(2.4)

Note that the bound does not depend on the dimension d. Proof[Estimate E2]We have

(A?T A T?I)w T=1

m

m

i=1

a i,T?a i,T,w T??w T=∶

1

m

m

i=1

v k,

where v k∶=?a i,T,w T??w T.By isotropy,E v k=0,and by the Cauchy-Schwarz inequality together with the triangle inequality,

v k

2≤( a i,T

2

+1) w T

2

≤(

sμ+1) w T

2

=∶B.

The second inequality follows from the coherence condition.For the varianceσ2,

E v k 2

2=E a i,T 2

2

?a i,T,w T?2? w T 2

2

≤E a i,T 2

2

?a i,T,w T?2≤sμE?a i,T,w T?2=sμ w T 2

2

.

The last equality follows by the isotropy condition.Thusσ2is bounded by msμ w T 2

2.The proof

now follows by plugging in B andσ2into the vector Bernstein inequality.

12

2.2O?-support incoherence

Lemma 2.5(E3:o?-support incoherence)Let v be supported on T with T =s .Then for each t >0,

P ( A ?T c Av ∞≥t v 2)≤2n exp ?m 2μ(F )?t 21+13

√st .(2.5)This lemma says that if v =x ,then max i ∈T c ?A {i },Ax ? cannot be too large so that the o?-support columns do not correlate too well with Ax .The proof of E3is an application of Bernstein’s inequality—the matrix Bernstein inequality with d =1—together with the union bound.2Proof We have A ?T c Av ∞=max i ∈T

c ?e i ,A ?Av ? .Assume without loss of generality that v 2=1,?x i ∈T c an

d write

?e i ,A ?Av ?=1m k w k ,w k ∶=?e i ,a k a ?k v ?.

Since i ∈T c ,E w k =0by the isotropy property.Next,the Cauchy-Schwarz inequality gives w k = ?e i ,a k ???a k ,v ? ≤ ?e i ,a k ? a k,T 2.Since ?e i ,a k ? ≤ μ(F )and a k,T 2≤ μ(F )s ,we have w k ≤μ(F )√s .Lastly,for the total variance,we have

E w 2k ≤μ(

F )E ?a k,T ,v ?2=μ(F )

where the equality follows from the isotropy property.Hence,σ2≤mμ(F ),and Bernstein’s in-equality gives

P ( ?e i ,A ?Av ? ≥t )≤2exp ?m 2μ(F )?t 21+13√st

.Combine this with the union bound over all i ∈T c to give the desired result.

We also require the following related bound:

max i ∈T c A ?T A {i } 2

≤δ.In other words,none of the column vectors of A outside of the support of x should be well approx-imated by any vector sharing the support of x .

Lemma 2.6(E4:uniform o?-support incoherence)Let T be a ?xed set of cardinality s .For any 0≤t ≤√s ,

P max i ∈T c A ?T A {i } 2≥t ≤n exp ?mt 28μ(F )s +14

.In particular,if m ≥8μ(F )?s ?(2log n +1 4),then

P max i ∈T c A ?T A {i } 2

≥1 ≤1 n.2

To extend this Lemma to the complex case,one can split A ?T c Av into real and complex parts,treating each separately with Bernstein’s inequality (similarly to [17,Lemma 8.2]).13

Proof We apply the vector Bernstein inequality(Theorem2.4).Fix i∈T c and write

A?T A{i}=1

m

m

k=1

a?k,T?a k,e i?∶=

1

m

m

k=1

v k.

As before,E v k=E a?

k,T ?a k,e i?=0since i∈T c.Also, v k

2

= a k,T

2

?a k,e i? ≤μ(F)

3341cf27aaea998fcc220ea7stly,

we calculate the sum of expected squared norms,

m k=1E v k 2

2

=m E v1 2

2

≤m E[ a1,T 2

2

?e i,a1?2]≤mμ(F)s?E?e i,a1?2=mμ(F)s.

As before,the last equality follows from the isotropy property.Bernstein’s inequality together with the union bound give the lemma.

2.3Weak RIP

In the nonsparse and noisy setting,we shall make use of a variation on the restricted isometry property to control the size of the reconstruction error.This variation is as follows:

Theorem2.7(E5:weak RIP)Let T be a?xed set of cardinality s,?xδ>0,let r be a natural number,and suppose that

m≥Cδ?β?μ(F)?max(s log(sμ),r log n log2r log(rμlog n)).

Then with probability at least1?5e?β

(1?δ) v 2

2≤ Av 2

2

≤(1+δ) v 2

2

(2.6)

for all v supported on T∪R,where R is any set of cardinality R ≤r.Here Cδis a?xed numerical constant which only depends uponδ.

This theorem is proved in the Appendix using Talagrand’s generic chaining3,and combines the framework and results of Rudelson and Vershynin in[46]and[44].In the proof of Theorem1.2,we takeδ=1 4.

The condition says that the column space of A T should not be too close to that spanned by another small disjoint set R of columns.To see why a condition of this nature is necessary for any recovery algorithm,suppose that x has?xed support T and that there is a single column A{i} which is a linear combination of columns in T,i.e.,A T∪{i}is singular.Let h≠0be supported on T∪{i}and in the null space of A.Then Ax=A(x+th)for any scalar t.Clearly,there are some values of t such that x+th is at least as sparse as x,and thus one should not expect to be able

to recover x by any method.In general,if there were a vector v as above obeying Av

2? v

2

then one would have A T v T≈?A R v R.Thus,if the signal x were the restriction of v to T,it would be very di?cult to distinguish it from that of?v to R under the presence of noise.

The weak RIP is a combination of the RIP and the local conditioning estimate E1.When r=0, this is E1whereas this is the restricted isometry property when s=0.The point is that we do 3We use the generic chaining,which provides a way of invoking the majorizing measure theorem without creating a measure.A simpler approach would be to go through Dudley’s inequality—but according to our proof the majorizing measures theorem provides a tighter result by a logarithmic factor.

14

not need the RIP to hold for sparsity levels on the order of m [μ(F)log n].Instead we need the following property:consider an arbitrary submatrix formed by concatenating columns in T with r other columns from A selected in any way you like;then we would like this submatrix to be well conditioned.Because T is?xed,one can prove good conditioning when s is signi?cantly larger than the maximum sparsity level considered in the standard RIP.

2.4Implications

The careful reader may ask why we bothered to state estimates E1–E4since they are all implied by the weak RIP!Our motivation is three-fold:(1)some of these estimates,e.g.E2hold with better constants and weaker requirements than those implied by the weak RIP machinery;(2)the weak RIP requires an in-depth proof whereas the other estimates are simple applications of well-known theorems,and we believe that these theorems and the estimates should be independently useful tools to other researchers in the?eld;(3)the noiseless theorem does not require the weak RIP.

3Noiseless and Sparse Recovery

For simplicity,we state our model and the proofs for the real case,but it is straightforward to generalize our results to complex signals,noise,and measurements,while only changing the con-stants involved in the proofs.The inner-products involved in several of the bounds below would be replaced by their absolute value or their real value.In particular,Lemmas4.4and4.3would have a slightly di?erent presentation.

This section proves the noiseless recovery theorem,namely,Theorem1.1.Our proof essentially adapts the arguments of David Gross[26]from the low-rank matrix recovery problem.

3.1Dual certi?cates

The standard method for establishing exact recovery is to exhibit a dual certi?cate;that is to say, a vector v obeying the two properties below.

Lemma3.1(Exact duality)Set T=supp(x)with x feasible for(1.9),and assume A T has full column rank.Suppose there exists v∈R n in the row space of A obeying

v T=sgn(x T)and v T c

<1.(3.1) Then x is the unique 1minimizer to(1.9).

The proof is now standard,see[15].Roughly,the existence of a dual vector implies that there is a subgradient of the 1norm at x that is perpendicular to the feasible set.This geometric property shows that x is solution.Following Gross,we slightly modify this de?nition as to make use of an ‘inexact dual vector.’

Lemma3.2(Inexact duality)Set T=supp(x)where x is feasible,and assume that

(A?T A T)?1 ≤2and max

i∈T c A?T A{i}

2

≤1.(3.2)

Suppose there exists v∈R n in the row space of A obeying

v T?sgn(x T)

2≤1 4and v T c

≤1 4.(3.3)

Then x is the unique 1minimizer to(1.9).

15

Proof Let?x=x+h be a solution to(1.9)and note that Ah=0since both x and?x are feasible. To prove the claim,it su?ces to show that h=0.We begin by observing that

?x

1= x T+h T

1

+ h T c

1

≥ x T

1

+?sgn(x T),h T?+ h T c

1

.

Letting v=A?w be our(inexact)dual vector,we have

?sgn(x T),h T?=?sgn(x T)?v T,h T?+?v T,h T?=?sgn(x T)?v T,h T???v T c,h T c?,

where we used?v T,h T?=?v,h???v T c,h T c?=??v T c,h T c?since?v,h?=?w,Ah?=0.The Cauchy-Schwarz inequality together with the properties of v yield

?sgn(x T),h T? ≤1

4

( h T

2

+ h T c

1

)

and,therefore,

?x

1≥ x

1

?

1

4

h T

2

+

3

4

h T c

1

.

We now bound h T

2

.First,it follows from

h T=(A?T A T)?1A?T A T h T=?(A?T A T)?1A?T A T c h T c

that h T

2≤2 A?

T

A T c h T c

2

.Second,

A?T A T c h T c

2

i∈T c

A?T A{i}

2

h i ≤max

i∈T c

A?T A{i}

2

h T c

1

≤ h T c

1

.

In conclusion, h T 2≤2 h T c 1and thus,

?x

1≥ x

1

+

1

4

h T c

1

.

This implies h T c=0,which in turn implies h T=0since we must have A T h T=Ah=0(and A T has full rank).

Lemma3.3(Existence of a dual certi?cate)Under the hypotheses of Theorem1.1,one can ?nd v∈R n obeying the conditions of Lemma3.2with probability at least1?e?β?1 n.

This lemma,which is proved next,implies Theorem1.1.The reason is that we just need to verify conditions(3.2).However,by Lemmas2.1and2.6,they jointly hold with probability at least 1?3 n provided that m≥μ?s?(19log n+2)(recall thatμis a shorthand forμ(F)).

3.2Proof of Lemma3.3

The proof uses the clever gol?ng scheme introduced in[26].Partition A into row blocks so that from now on,A1are the?rst m1rows of the matrix A,A2the next m2rows,and so on.The matrices{A i} i=1are independently distributed,and we have m1+m2+...+m =m.As before, A i,T is the restriction of A i to the columns in T.

The gol?ng scheme then starts with v0=0,inductively de?nes

v i=m

m i

A?i A i,T(sgn(x T)?v i?1,T)+v i?1

16

for i=1,..., ,and sets v=v .Clearly v is in the row space of A.To simplify notation,let q i=sgn(x T)?v i,T,and observe the two identities

q i= I?m

m i

A?i,T A i,T q i?1=

i

j=1

I?

m

m j

A?j,T A j,T sgn(x T)(3.4)

and

v=

i=1

m

m i

A?i A i,T q i?1,(3.5)

which shall be used frequently.From(3.4)and the fact that I?m

m i A?

i,T

A i,T should be a contraction

(local isometry E1),we see that the norm of q i decreases geometrically fast—the terminology comes from this fact since each iteration brings us closer to the target just as each golf shot would bring us closer to the hole—so that v T should be close to sgn(x T).Hopefully,the process keeps the size of v T c under control as well.

To control the size of v T c and that of sgn(x T)?v T,we claim that the following inequalities hold for each i with high probability:?rst,

q i

2≤c i q i?1

2

(3.6)

and,second,

m

m i

A?i,T c A i,T q i?1

≤t i q i?1

2

(3.7)

(the values of the parameters t i and c i will be speci?ed later).Let p1(i)(resp.p2(i))be the probability that the bound(3.6)(resp.(3.7))does not hold.Lemma2.3(Estimate E2)gives

p1(i)≤exp ?1

4

(c i

m i (sμ)?1)2 .(3.8)

Thus,if

m i≥2+8(β+logα)

c2

i

sμ,(3.9)

then p1(i)≤1

α

e?β.Next,Lemma2.5(Estimate E3)gives

p2(i)≤2n exp ?

3t2i m i

6μ+2μ

st i

.(3.10)

Thus,if

m i≥

2

t2

i

s

+

2

3t i

s

(β+log(2α)+log n)sμ,(3.11)

then p2(i)≤1

αe?β.

It is now time to set the number of blocks ,the block sizes m i and the values of the parameters c i and t i.These are as follows:

=?(log2s) 2?+2;

c1=c2=1 [2√

log n]and c i=1 2for3≤i≤ ;

17

t1=t2=1 [8√s]and t i=log n [8√s]for3≤i≤ ;

m1,m2≥35(1+log4+β)sμc?2i and m i≥35(1+log6+β)sμc?2i for3≤i≤ .

Note that under the assumptions of the lemma,m≥∑i m i(extra rows need not be used in the construction of the dual certi?cate).To see why v is a valid certi?cate,suppose?rst that for each i,(3.6)and(3.7)hold.Then(3.4)gives

sgn(x T)?v T

2= q

2

≤ sgn(x T)

2

i=1

c i≤

s

2

1

4

as desired.Further,(3.5)yields

v T c

∞≤

i=1

m

m i

A?i,T c A i,T q i?1

i=1

t i q i?1

2

s

i=1

t i

i?1

j=1

c i.

(For i=0,we take∏i?1

j=1c i=1.)Now with our choice of parameters,the right-hand side is bounded

above by

1 8 1+

1

2

log n

+

log n

4log n

+? <

1

4

,

which is the desired conclusion.

Now we must show that the bounds(3.6),(3.7)hold with probability at least1?e?β?1 n.It

follows from(3.9)and(3.11)that p1(i),p2(i)≤1

4e?βfor i=1,2and p1(i),p2(i)≤1

6

e?β≤1 6for

i≥3.Thus,p1(1)+p1(2)+p2(1)+p2(2)≤e?βand p1(i)+p2(i)≤1 3for i≥3.Now the union bound would never show that(3.6)and(3.7)hold with probability at least1?1 n for all i≥3 because of the weak bound on p1(i)+p2(i).However,using a clever idea in[26],it is not necessary for each subset of rows A i to‘succeed’and give the desired bounds.Instead,one can sample a ‘few’extra batches of rows,and throw out those that fail our requirements.We only need ?2 working batches,after the?rst2.In particular,pick ′+2> batches of rows,so that we require m≥2??140(1+log4+β)?μ?s?log n?+ ??140(1+log6+β)sμ?(note that we have made no attempt to optimize constants).Now as in[26],let N be the the number of batches—after the?rst2—obeying (3.6)and(3.7);this N is larger(probabilistically)than a binomial( ′,2 3)random variable.Then by Hoe?ding’s inequality[31,Theorem2.3a]

P(N< ?2)≤exp ?2(2

3

′? +2)2

and thus if we pick ′=3?log n?+1,we have

P(N< ?2)≤1 n.

In summary,from p1(1)+p2(1)+p1(2)+p2(2)≤e?βand the calculation above,the dual certi?cate v obeys the required properties with probability at least1?1 n?e?β,provided that m≥C(1+β)?μ?s?log n.

18

4General Signal Recovery from Noisy Data

We prove the general recovery theorems from Section1.6under the assumption of Gaussian white

noise but would like to emphasize that the same result would hold for other noise distributions. Speci?cally,suppose we have the noisy model

y=Ax+z,where A?z

≤λn(4.1) holds with high probability.Then the conclusions of Theorem1.3remain valid.In details,the

Dantzig selector with constraint A?(y?Aˉx)

≤4λn obeys

?x?x

2≤C1(1+α2)

x?x s

1

s

+λn

s (4.2)

with high probability.Hence,(1.15)is a special case corresponding toλn=2.5σm √

log n=

2.5σ

log n

m

.Likewise,the bound on the 1loss(1.16)withλn in place ofσ

log n

m

holds as well.

A similar generality applies to the LASSO as well,although in this case we need a second noise correlation bound,namely,

A?T c(I?P)z

≤λn,

where P∶=A T(A?

T A T)?1A?

T

is the projection onto the range of A T.

Now when z~N(0,I)and A is a?xed matrix,we have

A?z

∞≤2 A 1,2

log n(4.3)

with probability at least1?1 2n;recall that A 1,2is the maximum column norm of A.Indeed, the i th component of A?z is distributed as N(0, A{i} 2

2

)and,therefore,the union bound gives

P( A?z

∞>2 A 1,2

log n)≤n P( N(0,1) >2

log n).

The conclusion follows for n≥2from the well-known tail bound P( N(0,1) >t)≤2φ(t) t,where φis the density of the standard normal distribution.The same steps demonstrate that

A?(I?P)z

∞≤2 (I?P)A 1,2

log n≤2 A 1,2

log n(4.4)

with probability at least1?1 2n.

4.1Proof of Theorem1.2

We begin with a few simplifying assumptions.First,we assume in the proof thatσm=1since the general result follows from a simple rescaling.Second,because we are interested in situations where m is much smaller than n,we assume for simplicity of presentation that m≤n although our results extend with only a change to the numerical constants involved if m≤n O(1).In truth,they extend without any assumption on the relation between m and n,but the general presentation becomes a bit more complicated.

Fix s obeying s≤ˉs,and let T=supp(x s).We prove the error bounds of Theorem1.2with s ?xed,and the?nal result follows by considering that s which minimizes either the 2(1.11)or 1 (1.12)error bound.This is proper since the minimizing s has a deterministic value.With T as above,we assume in the rest of the proof that

19

(i)all of the requirements for noiseless recovery in Lemma 3.2are met (and in particular,the

inexact dual vector is constructed for x T ),

(ii)and that the inexact dual vector v of Section 3is successfully constructed.

All of this occurs with probability at least 1?4 n ?e ?β.Further,we assume that

(iii)the weak RIP holds with δ=1 4,r =m

C (1+β)?μ?log n log m log 2s ∨s and T is as above.

This occurs with probability at least 1?5e ?β,and implies the RIP at sparsity level r and restricted isometry constant δ=1 3341cf27aaea998fcc220ea7stly,we assume

(iv)the noise correlation bound

A ?z ∞≤2.5 log n.(4.5)

Assuming the weak RIP above (Estimate E5),which implies A 1,2≤5 4,the conditional proba-bility that this occurs is at least 1?1 2n because of (4.3).Because the weak RIP implies the local isometry condition E1with δ=1 4,all of these conditions together hold with probability at least 1?4 n ?6e ?β.All of the steps in the proof are now deterministic consequences of (i)–(iv);from now on,we will assume they hold.

With h =?x ?x ,our goal is to bound both the 2and 1norms of h .We will do this with a pair of lemmas.The ?rst is frequently used (recall that λis set to 10√log n ).

Lemma 4.1(Tube constraint)The error h obeys

A ?Ah ∞≤5λ4

.Proof As shown in [9,Lemma 3.1],writing that the zero vector is a subgradient of the LASSO functional 12 y ?A ˉx

2 2+λ ˉx 1at ˉx =?x gives A ?(y ?A ?x ) ∞≤λ.

Then it follows from the triangle inequality that

A ?Ah ∞≤ A ?(y ?A ?x

) ∞+ A ?z ∞≤λ+ A ?z ∞,where z is our noise term.The claim is a consequence of (4.5).

Lemma 4.2The error h obeys

h T c 1≤C 0(sλ+ x T c 1)

(4.6)for some numerical constant C 0.

Before proving this lemma,we show that it gives Theorem 1.2.Some of the steps are taken from the proof of Theorem 1.1in [16].

Proof [Theorem 1.2]Set r as in (iii)above.We begin by partitioning T c and let T 1be the indices of the r largest entries of h T c ,T 2be those of the next r largest,and so on.We ?rst bound h T ∪T 1 2and set ˉT

1=T ∪T 1for short.The weak RIP assumption (iii)gives 34

h ˉT 1 2 2≤ A ˉT 1h ˉT 1 2 2=?A ˉT 1h ˉT 1,Ah ???A ˉT 1h ˉT 1,A ˉT c 1h ˉT c 1?.(4.7)20

From Lemma4.1,we have

?AˉT

1hˉT

1

,Ah?=?hˉT

1

,A?ˉ

T1

Ah?≤ hˉT

1

1

A?ˉ

T1

Ah

5

4

λ hˉT

1

1

.

SinceˉT1has cardinality at most2s,the Cauchy-Schwarz inequality gives

?AˉT

1hˉT

1

,Ah?≤

5

4

λ

2s hˉT

1

2

.(4.8)

Next,we bound ?AˉT

1hˉT

1

,AˉT c

1

hˉT c

1

? ≤ ?A T h T,AˉT c

1

hˉT c

1

? + ?A T

1

h T

1

,AˉT c

1

hˉT c

1

? .We have

?A T h T,AˉT c

1

hˉT c

1

?≤

j≥2

?A T h T,A T

j

h T

j

? .(4.9)

As shown in[14,Lemma1.2],the parallelogram identity together with the weak RIP imply that

?A T h T,A T

j h T

j

? ≤

1

4

h T

2

h T

j

2

and,therefore,

?A T h T,AˉT c

1hˉT c

1

?≤

1

4

h T

2

j≥2

h T

j

2

.(4.10)

To bound the summation,we use the now standard result[16,(3.10)]

j≥2 h T

j

2

≤r?1 2 h T c

1

,(4.11)

which gives

?A T h T,AˉT c

1hˉT c

1

? ≤

1

4

r?1 2 h T

2

h T c

1

.

The same analysis yields ?A T

1h T

1

,AˉT c

1

hˉT c

1

? ≤1

4

r?1 2 h T

1

2

h T c

1

and thus,

?AˉT

1hˉT

1

,AˉT c

1

hˉT c

1

? ≤

2

4

r?1 2 hˉT

1

2

h T c

1

.

Plugging these estimates into(4.7)gives

T1

2

5

3

2sλ+

2

3

r?1 2 h T c

1

.(4.12)

The conclusion is now one step away.Obviously,

h

2≤ hˉT

1

2

+

j≥2

h T

j

2

≤ hˉT

1

2

+r?1 2 h T c

1

5

3

2sλ+

3+

2

3

r?1 2 h T c

1

,

where the second line follows from(4.12).Lemma4.2completes the proof for the 2error.For the 1error,note that by the Cauchy-Schwarz inequality

h

1= h T

1

+ h T c

1

s h T

2

+ h T c

1

s hˉT

1

2

+ h T c

1

.

Combine this with(4.12)and Lemma4.2.

(Note that if we had not setσm=1at the beginning of the proof,we would haveλ=10σ√

log n

m

,

leading to the general error bound.)

21

本文来源:https://www.bwwdw.com/article/yc9e.html

Top