Likelihood-based data squashing a modeling approach to instance construction

更新时间：2023-09-06 13:15:01 阅读量：教育文库文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

likelihood-based推荐度：
相关推荐

Squashing is a lossy data compression technique that preserves statistical information. Specifically, squashing compresses a massive dataset to a much smaller one so that outputs from statistical analyses carried out on the smaller (squashed) dataset repro

Likelihood-based Data Squashing: A Modeling Approach to Instance Construction.David Madigan, Nandini Raghavan,& William DuMouchel AT&T Labs - Research fmadigan,raghavan,dumouchelg@http://www.77cn.com.cn Martha Nason& Christian Posse Talaria, Inc. fmnason,posseg@http://www.77cn.com.cn Greg Ridgeway University of Washington greg@stat.washington.edu

September 28, 1999AbstractSquashing is a lossy data compression technique that preserves statistical information. Speci cally, squashing compresses a massive dataset to a much smaller one so that outputs from statistical analyses carried out on the smaller (squashed) dataset reproduce outputs from the same statistical analyses carried out on the original dataset. Likelihood-based data squashing (LDS) di ers from a previously published squashing algorithm insofar as it uses a statistical model to squash the data. The results show that LDS provides excellent squashing performance even when the target statistical analysis departs from the model used to squash the data.

1 IntroductionMassive datasets containing millions or even billions of observations are increasingly common. Such data arise, for instance, in large-scale retailing, telecommunications, 1

astronomy, computational biology, and internet logging. Statistical analyses of data on this scale present new computational and statistical challenges. The computational challenges derive in large part from the multiple passes through the data required by many statistical algorithms. When data are too large to t in memory, this becomes especially pressing. A typical disk drive is a factor of 105? 106 times slower in performing a random access than is the main memory of a computer system (Gibson et al., 1996). Furthermore, the costs associated with transmitting the data may be prohibitive. The statistical challenges are many: what constitutes\statistical signi cance" when there are 100 million observations? how do we deal with the dynamic nature of most massive datasets? how can we best visualize data on this scale? Much of the current research on massive datasets concerns itself with scaling up existing algorithms - see, for example, Bradley et al. (1998) or Provost and Kolluri (1999). In this paper we focus on the alternative approach of scaling down the data. Most of the previous work in this direction has focused on sampling methods such as random sampling, strati ed sampling, duplicate compaction (Catlett, 1991), and boundary sampling (Aha et al., 1991, Syed et al., 1999). Recently DuMouchel et al. (1999) DVJCP] proposed an approach that instead constructs a reduced dataset. Speci cally their data squashing algorithm seeks to compress (or\squash") the data in such a way that a statistical analysis carried out on the squashed data provides the same outputs that would have resulted from analyzing the entire dataset. Success with respect to this goal would deal very e ectively with the computational challenges mentioned above - the entire armory of statisti

cal tools could then work with massive datasets in a routine fashion and using commonplace hardware. DVJCP's approach to squashing is model-free and relies on moment-matching. The squashed dataset consists of a set of pseudo data points chosen to replicate the moments of the\mother-data" within subsets of a partition of the mother-data. DVJCP explore various approaches to partitioning and also experiment with the order of the moments. On a logistic regression example where the mother-data contains 750,000 observations, a squashed dataset of 8,443 points outperformed a simple random sample of 7,543 points by a factor of amost 500 in terms of mean square error with respect to the regression coe cients from the mother-data. DVJCP provide a 2

theoretical justi cation of their method by considering a Taylor series expansion of an arbitrary likelihood function. Since this depends on the moments of the data, their method should work well for any application in which the likelihood is wellapproximated by the rst few terms of a Taylor series, at least within subsets of the partitioned data. The empirical evidence provided to date is limited to logistic regression. In this paper we consider the following variant of the squashing idea: suppose we declare a statistical model in advance. That is, suppose we use a particular statistical model to squash the data. Can we thus improve squashing performance? Will this improvement extend to models other than that used for the squashing? We refer to this approach as\likelihood-based data squashing" or LDS. LDS is similar to DVJCP's original algorithm (or DS) insofar as it rst partitions the dataset and then chooses pseudo data points corresponding to each subset of the partition. However the two algorithms di er in how they create the partition and how they create the pseudo data points. For instance, in the context of logistic regression with two continuous predictors, Figure 1 shows the partitions of the twodimensional predictor space generated by the two algorithms for a single value of the dichotomous response variable. The DS algorithm partitions the data along certain marginal quantiles, and then matches moments. The LDS algorithm partitions the data using a likelihood-based clustering and then selects pseudo data points so as to mimic the target sampling or posterior distribution. Section 2 describes the algorithm in detail. In what follows, we explore the application of LDS to logistic regression, variable selection for logistic regression, and neural networks. Note that both the DS and LDS algorithms produce pseudo data points with associated weights. Use of the squashed data requires software that can use these weights appropriately.

2 The LDS AlgorithmWe motivate the LDS algorithm from a Bayesian perspective. Suppose we are computing the distribution of some parameter posterior to three data points d1; d2; and 3

LDS

.18.06

.02

.02.00

.00.0

0.2

0.4

0.6

0.8

1.0

.18

.06.02

.02.00

.00.00.20.4

0.60.81.0

Table 1: Simple example of squashing when Pr(d1 j ) Pr(d2 j ). LDS constructs the pseudo data point d so that Pr(d1 j )Pr(d2 j )Pr(d3 j ) (Pr(d j ))2 Pr(d3 j ). Mother-dataInstance Weight

Squashed-dataInstance Weight

d1 d2 d3

1 1 1

d d3

2 1

that would result from a traditional clustering of the data points. Figure 1, for example, shows LDS constructing several clusters containing data points with disparate (x1; x2) coordinates. Figure 2 shows the LDS clusters in the context of simple linear regression though the origin (i.e., a model with a single parameter). In this case, the likelihood pro les for each data point di represent the likelihoods for di with a variety of lines de ned by a set of slopes f 1;:::; k g. The left-hand panel shows motherdata generated from a bivariate normal distribution with zero correlation (i.e., noise) whereas the right-hand panel shows mother-data generated from a model with a true slope of 1. Both plots demonstrate substantial symmetries about the origin - the likelihood of any point (x; y) is the same as that of (?x;?y) for all i. Both plots also have a cluster centered on the origin. Since all the lines pass through the origin, points near the origin should have similar likelihoods for all lines. The right-hand panel exhibits distinctive radial clusters, since likelihood in this context is a function of the distance from the data point to the line.

2.1 Detailed DescriptionLet observations y= (y1;:::; yn) be realized values of random variables Y= (Y1;:::; Yn ). Suppose that the functional form of the probability density function f (y; ) of Y is speci ed up to a nite number of unknown parameters= ( 1;:::; p). Denote by l(; y) the log likelihood of, that is, l(; y)= log f (y; ) and denote by^ the value of that maximizes l(; y). 5

LDS (noise)

-2-3

--3

-2

-1

LDS (signal)

01-2-3

--2

-1012

minimizes:

k X j=1

l( j; yi)? lc( j; )

where lc( j; ) denotes the average of the log likelihoods at points in cluster c.

for those data

Construct] Construct the Pseudo Data. For each of the n0 clusters, construct a sin-

gle pseudo datapoint. Consider a cluster containing m datapoints, (yi1;:::; yim ). Let yi denote the corresponding pseudo datapoint. The algorithm initializes yi 1 to m Pk yik and then optionally re nes yi by numerically minimizing:k X j=1

(m l( j; yi ))?

m X k=1

l ( j; y ik ):

The results reported in this paper do not include this optional step.

Figure 3: Central composite design for three variables As described, the algorithm requires two passes over the mother-data: one to estimate, and one to evaluate the likelihood pro les and perform the clustering. The rst pass can be omitted in favor of an estimate of based on a random sample, although this can adversely a ect squashing performance - see Section 6 below. There exist a variety of elaborations of the base algorithm, some of which we discuss in what follows. For large p, the central composite design will choose an unnecessarily large set of values of at the Select phase. The literature on experimental design (see, for example, Box et al., 1978) provides a rich array of fractional factorial designs that e ciently scale with p. The clustering algorithm in base-LDS 7

can also be improved; Zhang et al. (1996) describe an alternative that could readily provide a replacement for the Cluster phase. Other elaborations include using alternative clustering metrics at the Cluster phase, varying both the number of pseudo points and the construction algorithm at the Construct phase, and iterating the entire LDS algorithm. Some but not all of these elaborations

require extra passes over the mother-data.

3 Evaluation: Logistic RegressionTo evaluate the performance of LDS we conducted a variety of experiments with datasets of various sizes. In each case our primary goal was to compare the parameter estimates based on the mother-data with the corresponding estimates based on the squashed data. To provide a baseline we also computed estimates based on a simple random sample. We provide results both for simulated data and for the AT&T data from DVJCP. Following DVJCP we report results in the form of residuals from the mother-data parameter estimates, that is, (reduced-data parameter estimate mother-data parameter estimate). The residuals are standardized by the standard errors estimated from the mother-data and are averaged over all the parameters in the pertinent model. Note that reproducing parameter estimates represents a more challenging target than reproducing predictions since the former requires that we obtain high quality estimates for all the parameters. Section 3.4 below shows that accurate parameter estimate replication does result in high quality prediction replication. Implementation of base-LDS requires an initial estimate of^ and a choice of locations for the k values of used in the central composite design. We carried out extensive experimentation with small-scale simulated mother-data in order to understand the e ects of various possible choices on squashing performance. For the initial estimate of^ we considered three possibilities:^SRS,^ONE, and^.^SRS is a maximum likelihood estimator of based on a 10% random sample,^ONE is an approximate maximum likelihood estimator of based on a single step of the 8

3.1 Small-Scale Simulations

standard logistic regression Newton-Raphson algorithm (this requires a single pass through the mother-data), and^ is the maximum likelihood estimator of based on the mother-data. In the central composite design, let dF denote the distance of the 2p\factorial points" from and let dS denote the distance of the 2p\star" points from, both distances in standard error units. Here we considered dF= f0:1; 0:5; 1; 3g and dS= f0:1; 0:5; 1; 3g. In each case, the mother-data consisted of 1000 observations generated from the following logistic regression model: Pr(Y= 1) log 1? Pr(Y= 1)= 1X1+ 2X2+ 3X3+ 4X4+ 5X5 (1) with X1 1, X2; X3; X4; X5 U (0; 1) and 1;:::; 5 U (0; 0:5). For each of 100 simulated mother-datasets from this model, LDS generated 48 squashed datasets corresponding to the 48 (3 4 4) design settings. Parameter estimates based on each of these, as well as on an SRS sample were computed. The LDS and SRS datasets were of size 100. Figure 4 shows boxplots of the standardized residuals of the parameter estimates. The residuals are with respect to the parameter estimates from the mother-data, and are standardized by the standard errors of the estimates from the mother-data. Several features are immediately apparent: With appropriate choices f

or dF, LDS outperforms random sampling for all three settings of . Note that the results are shown on a log10 scale; for instance, for LDS-MLE with dS= 0:1 and dF= 0:1, LDS outperforms SRS by a factor of about 105 . Squashing performance improves as the quality of improves from^SRS to^ONE to^. There is a dependence between the size of dF and the quality of . For=^SRS, dF= 3 is the optimal setting amongst the four choices. For=^ONE, several choices of dF yield equivalent performance. For=^, dF= 0:1 is the optimal setting amongst the four choices. The choice of dS has a relatively small e ect on squashing performance. 9

dF=3

dF=1

dF=0.5

dF=0.1

dF=3

dF=1

dF=0.5

dF=0.1

dF=3

dF=1

dF=0.5

dF=0.1

02460246

log(MSE(LDS)/MSE(SRS))

Since de nes the center of the design matrix where LDS evaluates the likelihood pro les, it is hardly surprising that performance degrades as departs from^. It is evidently more important to cluster datapoints that have similar likelihoods in the region of the maximum likelihood estimator (which with large datasets will be close to the posterior mean) than to cluster datapoints that have similar likelihoods in regions of negligible posterior mass. What is perhaps somewhat surprising is the extent to which the design points need to depart from when 6=^. In that case it is best to evaluate the likelihood pro les at a di use set of values of most of which are far out in the tails of 's posterior distribution. In fact, choosing dS and dF as large as 10 still gives acceptable performance when 6=^. This implies that when LDS doesn't have a very good estimate of^, it needs to ensure a very broad coverage of the likelihood surface.

3.2 Medium-Scale SimulationsHere we consider the performance of LDS in a somewhat larger-scale setting. In particular, we simulated mother-datasets of size 100,000 from the logistic regression model speci ed by (1) again with X1 1, X2; X3; X4; X5 U (0; 1) and 1;:::; 5 U (0; 0:5). Figure 5 shows the results for di erent choices of . Clearly setting=^SRS yields substantially poorer squashing performance than either=^ONE or=^. However, Section 6 below describes how this can be alleviated with an iterative version of LDS that achieves squashing performance comparable to that for=^, but starting with=^SRS. Note that even with 100,000 observations the ve parameters in the model specied by (1) are often not all signi cantly di erent from zero. Experiments with models in which either all of the parameters are indistinguishable from zero or all of t

he parameters are signi cantly di erent from zero yielded LDS performance results that are similar to those reported here. For simplicity we only report the results from model (1).

log(MSE)

SRS

LDS-SRS

LDS-ONE

LDS-MLE

1000.0001

0.01

MSE

Table 2: Performance of Base-LDS for the AT&T data. k is the number of evaluSRS ations of the likelihood per data point. LDS is the average MSE for simple random sampling (154.04 in this case) divided by the MSE for LDS (i.e., the improvement 1 factor over simple random sampling). HypRect( 2 ) shows the most comparable results from DVJCP (Note that HypRect( 1 ) uses 8,373 observations as compared with 7,450 2 observations in the other rows).^ONE 6697 149^ONE 5 5 0.019 8107 DS HypRect( 1 ) 0.24 642 2 SRS (10 replications) 154.04 1k 85

dF dS 5 5

MSE 0.023

SRS LDS

3.3 Larger-Scale Application: The AT&T DataDVJCP describe a dataset of 744,963 customer records. The binary response variable identi es customers who have switched to another long-distance carrier. There are seven predictor variables. Five of these are continuous and two are 3-level categorical variables. Thus for logistic regression there are 10 parameters. As before we consider 1% random and squashed samples. With 10 parameters, the central composite design requires 1,024 factorial points, 20 star points, and 1 central point for a total of 1,045 points. This would incur a signi cant computational e ort. In place of the fully factorial component of the central composite design, we evaluated two fractional factorial designs, a resolution V design requiring 128 factorial points and a resolution IV design requiring 64 points (Box et al., 1978, p.410). In brief, a Resolution V design does not confound main e ects or two-factor interactions with each other, but does confound two-factor interactions with three-factor interaction, and so on. A Resolution IV design does not confound main e ects and two-factor interactions but does confound two-factor interactions with other two-factor interactions. Table 2 describes the results. LDS outperforms SRS by a wide margin and also provides better squashing performance than DS in this case. 13

Table 3: Comparison of predictions for the AT&T data using logistic regression with all 10 main e ects. For each reduced dataset the N= 744; 963 predictive residuals are de

ned as (Probability based on reduced dataset) - (Probability based on the motherdata) 10,000. Each row of the table describes the distribution of the corresponding residuals for a given reduction method.Method Mean StDev Min Max Random Sample -41 193 -870 679

HypRect( 1 ) 2

LDS

0.4

-2

-37

-5

If the actual parameter estimates from the mother-data are used for in the rst step of the algorithm (i.e. setting=^), then it is possible to reduce the MSE to 0.01 (k=149). At the other extreme setting=^SRS increases the MSE disimproves to 1.04 (k=149).

3.4 PredictionOur primary goal so far has been to emulate the mother-data parameter estimates. A coarser goal is to see how well squashing emulates the mother-data predictions. Following DVJCP we consider the AT&T data where each observation in the dataset is assigned a probability of being a Defector. We used the parameter estimates from a 1% random sample and from a 1% squashed dataset to assign this probability and the compared these with the\true" probability of being a Defector from the mother-data model. For each observation in the mother-data, we compute (Probability based on reduced dataset) - (Probability based on the mother-data), multiplied by 10000 for descriptive purposes. Table 3 describes the results. LDS performs about two orders of magnitude better than simple random sampling and also outperforms the comparable model-free HypRect( 1 ) method from DVJCP. 2

4 Evaluation: Variable SelectionThe preceding results demonstrate that using a particular logistic regression model to squash a dataset allows one to accurately retrieve the parameter estimates for that model with a 1% squashed sample. However, the utility of the algorithm is enhanced by its ability to facilitate other analyses that an analyst might have performed on the mother-data. Since variable selection is a widely used modeling step in regression analysis, we consider the following question: would a variable selection algorithm applied to the squashed data select the same model that the algorithm would select when applied to the mother-data? In what follows we examine all possible subsets of the predictor variables (\all-subsets") and score the competing models using the Bayesian Information Criterion (BIC, Schwarz, 1978). BIC is a penalized log-likelihood evaluated at the MLE:

BIC=?2l(^; y)+ p log(n)where n is the number of datapoints and p is the dimensionality of . For the AT&T data, all-subsets applied to the mother-data, a 1% random sample, and a 1% squashed dataset all select the full model. However the rank correlation between the BIC scores for the mother-data and the BIC scores for the squashed data is 0.9995 as opposed to 0.9922 for the mother-data-SRS comparison. For the simulated medium-scale mother-data with 100,000 datapoints and 5 predictors (see Section 3.2), a 1% LDS-squashed sample with=^ selected the correct model in each of 30 replications. By comparison, a 1% SRS selected

the correct model in 10 of the 30 replications. Table 4 shows some results. These results suggest that it is possible to achieve a 100-fold reduction in computational e ort for variable selection for certain model classes. This would facilitate the application of expensive variable selection algorithms such as all-subsets or Bayesian model averaging to massive data. Furthermore, the costs associated with transmitting a dataset over a network could be greatly reduced if variable selection is the target activity. Note that for linear and certain non-linear regression models Furnival and Wilson (1974) and Lawless and Singhal (1978) describe a highly e cient approach to variable selection that does not require maximum likelihood estimation for each individual model. 15

Table 4: LDS for logistic regression variable selection.\LDS Correct" shows the percentage of the n replications in which LDS selected the correct model (i.e., the model selected by the mother-data).\SRS Correct" shows the percentage of the n replications in which a simple random sample selected the correct model.Model: logit(Y )= P iXi 1= 0:1; 2= 0:25; 3= 0:5; 4= 0:75; i unif(0; 1) i unif(0; 0:5) LDS SRS Correct Correct 100% 33% 100% 27% 100% 23%

N 5= 1:0 100,000 100,000 100,000

P 5 5 5

n 30 30 30

5 Evaluation: Neural NetworksThe evaluations thus far have focused on logistic regression. Here we consider the application of LDS (still using a logistic regression model to perform the squashing) to neural networks. We simulated data from a feed-forward neural network with two input units, one hidden layer with three units, and a single dichotomous output unit (Venables and Ripley, 1997). The left-hand panel of Figure 6 compares the test-data misclassi cation rate using a neural network model based on the mother-data (10,000 points) with the test-data misclassi cation rate based on either a simple random sample of size 1,000 (black dots) or an LDS squashed dataset of size 1,000 (red dots). In either case, predictions are based on a holdout sample of 1,000 generated from the same neural network model that generated the mother-data. The results are for 30 replications. It is apparent that LDS consistently reproduces the misclassi cation rate of the mother-data. The right-hand panel of Figure 6 compares the predictive residuals (i.e., (Probability based on reduced dataset) - (Probability based on the mother-data)) for the two methods. Table 5 shows the results in a format comparable with Table 3. These predictive results are not as good as those for the logistic regression analysis of the AT&T data (Table 3), but here the application is to di erent a model class to that used for the squashing and LDS substantially outperforms simple random sampling nonetheless. 16

3.0*

3.0**

**4

.*0**

*****

***

3***.0**

*******

***

3.*

0*****

**0.30

0.32

0.34

0.36

0.38

Mother data Misclassification Rate

)

decu1

d0.e0R(ytilibabor0P0. d0etciderP 1

)0.re0h toM(ytiliba2b0.o0rP detcider3P0.0 SRSLDS

Reduced data Misclassification Rate

0.9

***

0.3

Reduced data Predictions

0.8

0.4

*******************************************************************************************************************************************************************************

0.4

0.5

*************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

0.7

0.6

0.60.5

0.50.60.70.80.9

0.500.550.600.650.700.750.800.50.60.70.80.50.60.7

Mother data Predictions

0.8

Mother data PredictionsMother data Predictions

0.5

0.4

***

0.0

*******************

************

Reduced data Predictions

0.5

1.0

0.80.6

0.6

0.4

**0.8

0.7

0.20.60.8

0.30.40.50.60.70.80.600.650.700.750.50.60.7

Mother data Predictions

0.8

Mother data PredictionsMother data Predictions

1.0

0.9

Reduced data Predictions

0.5

0.8

0.7

0.6

0.60.70.8

Reduced data Predictions

0.5

0.8

0.4

*****

***

0.4

0.550.600.650.700.750.800.85

0.3

0.50.60.70.80.40.50.60.70.8

Mother data PredictionsMother data PredictionsMother data Predictions

Table 5: Comparison of neural network predictions for random sampling and LDS. For each reduced dataset the 1,000 residuals from the hold-out data are de ned as (Probability based on reduced dataset) - (Probability based on the mother-data). Each row of the table describes the distribution of the corresponding residuals for a given reduction method. The results are averaged over 30 replications.Method Mean StDev Min Max Random Sample -0.005 0.08 -0.29 0.25 LDS 0.0002 0.02 -0.06 0.07

Figure 7 shows the individual predictions for nine of the replications with LDS predictions (red dots) superimposed on SRS predictions (black dots). Points on the diagonal line represent predictions where the reduced-data prediction and the motherdata prediction agree. T

he variability of the prediction from random sampling is apparent. Note that for both LDS and SRS, the back-propagation algorithm used to t the neural network is itself a source of variability since convergence to local log-likelihood maxima frequently occurs.

6 Iterative LDSExcept where noted, the evaluations reported thus far utilize a single pass through the mother-data to compute . In the case of logistic regression, is the output of the rst step of the standard Newton-Raphson algorithm for estimating^. In fact, this provides a remarkably accurate estimate of^ and results in squashing performance close to that provided by setting=^. For those cases where there does not exist a high-quality, one-pass estimate of^, and furthermore many passes through the data are required for an exact estimate of^, iterative LDS (ILDS) provides an alternative approach. ILDS works as follows: 1. Set=^SRS, an estimate of^ based on a simple random sample from the mother data. 2. Squash the mother-data using LDS (this requires one pass through the moth19

本文来源：https://www.bwwdw.com/article/fwyh.html

相关文章：

颍州区三塔镇中心学校留守儿童各项制度12-26

工程项目管理课程设计任务书2016 - 图文10-04

马钢二把手裸官苏鉴钢04-22

2018新苏教版国标本三年级上册语文《“东方之珠”》第二课时教学设计一08-25

上一篇：昼夜长短及正午太阳高度变化下一篇：国内架空输电线路防护金具