中科院机器学习题库-new

更新时间:2024-04-22 20:09:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

机器学习题库

一、 极大似然

1、 ML estimation of exponential model (10)

A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probability density function is given by

1?xp?x??eb

bGiven N observations xi drawn from such a distribution:

(a) Write down the likelihood as a function of the scale parameter b. (b) Write down the derivative of the log likelihood. (c) Give a simple expression for the ML estimate for b.

x??2、换成Poisson分布:p?x|????e,y?0,1,2,...

x!l?????log?p?xi|?????xilog????log?x!?

i?1i?1N?????xi?log??N???log?xi!?i?1?i?1?NNN 3、 二、 贝叶斯

假设在考试的多项选择中,考生知道正确答案的概率为p,猜测答案的概率为1-p,并且假

设考生知道正确答案答对题的概率为1,猜中正确答案的概率为1m,其中m为多选项的数目。那么已知考生答对题目,求他知道正确答案的概率。 1、

p?known|correct??p?known,correct?p?known??pp??1?p?1mConjugate priors

The readings for this week include discussion of conjugate priors. Given a likelihood p?x|?? for a class models with parameters θ, a conjugate prior is a distribution p??|?? with hyperparameters γ, such that the posterior distribution

p??|X,????p?X|??p??|???p??|???

与先验的分布族相同

(a) Suppose that the likelihood is given by the exponential distribution with rate parameter λ: p?x|????e??x

Show that the gamma distribution

Gamma??|?,???????1??? _

?e????is a conjugate prior for the exponential. Derive the parameter update given observations x1,?,xN and the prediction distribution p?xN?1|x1,?,xN?.

(b) Show that the beta distribution is a conjugate prior for the geometric distribution

p?x?k|????1???k?1?

which describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is θ. Derive the parameter update rule and prediction distribution.

(c) Suppose p??|?? is a conjugate prior for the likelihood p?x|??; show that the mixture prior

p??|?1,...,?M???wmp??|?m?

m?1Mis also conjugate for the same likelihood, assuming the mixture weights wm sum to 1.

(d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture, and the prior is conjugate for each mixture component of the likelihood.

some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulli and the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed α

(e) (Extra credit, 20) Explore the case where the likelihood is a mixture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.

三、判断题

(1)给定n个数据点,如果其中一半用于训练,另一半用于测试,则训练误差和测试误差之间的差别会随着n的增加而减小。

(2)极大似然估计是无偏估计且在所有的无偏估计中方差最小,所以极大似然估计的风险最小。 (3)回归函数A和B,如果A比B更简单,则A几乎一定会比B在测试集上表现更好。 (4)全局线性回归需要利用全部样本点来预测新输入的对应输出值,而局部线性回归只需利用查询点附近的样本来预测输出值。所以全局线性回归比局部线性回归计算代价更高。

(5)Boosting和Bagging都是组合多个分类器投票的方法,二者都是根据单个分类器的正确率决定其权重。

(6) In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert (F)

While the training error of the combined classifier typically decreases as a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.

(7) One advantage of Boosting is that it does not overfit. (F)

(8) Support vector machines are resistant to outliers, i.e., very noisy examples drawn from a different distribution. (F)

(9)在回归分析中,最佳子集选择可以做特征选择,当特征数目较多时计算量大;岭回归和Lasso模型计算量小,且Lasso也可以实现特征选择。

(10)当训练数据较少时更容易发生过拟合。

(11)梯度下降有时会陷于局部极小值,但EM算法不会。

(12)在核回归中,最影响回归的过拟合性和欠拟合之间平衡的参数为核函数的宽度。 (13) In the AdaBoost algorithm, the weights on all the misclassified points will go up by the same multiplicative factor. (T)

(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w? on the training data. (F)

(15) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty always decreases the expected L2 error of the solution w? on unseen test data (F). (16)除了EM算法,梯度下降也可求混合高斯模型的参数。 (T)

(20) Any decision boundary that we get from a generative model with

class-conditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.

True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.

(21) AdaBoost will eventually reach zero training error, regardless of the type of weak

classifier it uses, provided enough weak classifiers have been combined.

False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost can’t achieve zero training error.

(22) The L2 penalty in a ridge regression is equivalent to a Laplace prior on the weights. (F)

(23) The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. (F)

(c) Define a class variable yi 2 {?1, +1} which denotes the class of xi and let w=(w1,w2,w3)T . The max-margin SVM classifier solves the following problem

???0,0,?2?,b?1 and Using the method of Lagrange multipliers show that the solution is wthe margin is

For optimization problems with inequality constraints such as the above, we should apply KKT conditions which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3 constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,

T1. ?2w

(e) Show that the solution remains the same if the constraints are changed to

for any

??1.

(f) (Extra Credit) Is your answer to (d) also true for any dataset and??1? Provide a counterexample

or give a short proof.

SVM

Suppose we only have four training examples in two dimensions (see figure above):

positive examples at x1 = [0, 0] , x2 = [2, 2] and negative examples at x3 = [h, 1] , x4 = [0, 3], where we treat 0 ≤ h ≤ 3 as a parameter

(1). How large can h ≥ 0 be so that the training points are still linearly separable? Up to (excluding) h=1

(2). Does the orientation of the maximum margin decision boundary change as a function of h when the points are separable (Y/N)? No, because x1, x2, x3 remain the support vectors.

(3). What is the margin achieved by the maximum margin boundary as a function of h?

[Hint : It turns out that the margin as a function of h is a linear function.]

(4). Assume that we can only observe the second component of the input vectors. Without the other component, the labeled training points reduce to (0,y = 1), (2,y = 1), (1,y = -1), and (3,y =-1). What is the lowest order p of polynomial kernel that would allow us to correctly classify these points?

The classes of the points on the x2-projected line observe the order 1,-1,1,-1. Therefore, we need a cubic polynomial.

3、 LDA

Using a set of 100 labeled training examples (two classes), we train the following models:

GaussI : A Gaussian mixture model (one Gaussian per class), where the covariance matrices are both set to I (identity matrix).

GaussX: A Gaussian mixture model (one Gaussian per class) without any restrictions on the covariance matrices.

LinLog: A logistic regression model with linear features.

QuadLog: A logistic regression model, using all linear and quadratic features.

(1) After training, we measure for each model the average log probability of labels given examples in

the training set. Specify all the equalities or inequalities that must always hold between the models relative to this performance measure. We are looking for statements like “model 1 <= model 2” or “model 1 = model 2”. If no such statement holds, write “none”.

GaussI <= LinLog (both have logistic postiriors, and LinLog is the logistic model maximizing the average log probabilities)

GaussX <= QuadLog (both have logistic postiriors with quadratic features, and QuadLog is the model of this class maximizing the average log probabilities)

LinLog <= QuadLog (logistic regression models with linear features are a subclass of logistic regression models with quadratic functions— the maximum from the superclass is at least as high as the maximum from the subclass)

GaussI <= QuadLog (follows from above inequalities)

(GaussX will have higher average log joint probabilities of examples and labels, then will GaussI. But have higher average log joint probabilities does not necessarily translate to higher average log conditional probabilities)

(2) Which equalities and inequalities must always hold if we instead use the mean classification error

in the training set as the performance measure? Again use the format “model 1 <= model 2” or “model 1 = model 2”. Write “none” if no such statement holds.

None. Having higher average log conditional probabilities, or average log joint probabilities, does not necessarily translate to higher or lower classification error. Counterexamples can be constructed for all pairs in both directions.

Although there is no inequalities which is always correct, it is commonly the case that GaussX <= GaussI and that QuadLog <= LinLog. Partial credit of up to two points was awarded for these inequalities.

5、We consider here generative and discriminative approaches for solving the classification problem illustrated in Figure 4.1. Specifically, we will use a mixture of Gaussians model and regularized logistic regression models.

Figure 4.1. Labeled training set, where “+” corresponds to class y = 1.

(1) We will first estimate a mixture of Gaussians model, one Gaussian per class, with the constraint

that the covariance matrices are identity matrices. The mixing proportions (class frequencies) and the means of the two Gaussians are free parameters.

a) Plot the maximum likelihood estimates of the means of the two class conditional Gaussians in Figure 4.1. Mark the means as points “x” and label them “0” and “1” according to the class.

The means should be close to the center of mass of the points. b) Draw the decision boundary in the same figure.

Since the two classes have the same number of points and the same covariance matrices, the decision boundary is a line and, moreover, should be drawn as the orthogonal bisector of the line segment connecting the class means.

(2) We have also trained regularized linear logistic regression models

for the same data. The regularization penalties, used in penalized conditional loglikelihood estimation, were -Cwi, where i = 0, 1, 2. In other words, only one of the parameters were regularized in each case. Based on the data in Figure 4.1, we generated three plots, one for each regularized parameter, of the number of misclassified training points as a function of C (Figure 4.2). The three plots are not identified with the corresponding parameters, however. Please assign the “top”, “middle”, and “bottom” plots to the correct parameter, w0, w1, or w2, the parameter that was regularized in the plot. Provide a brief justification for each assignment.

2

? “top” = (w1)

By strongly regularizing w1 we force the boundary to be horizontal in the figure. The logistic regression model tries to maximize the log-probability of classifying the data correctly. The highest penalty comes from the misclassified points and thus the boundary will tend to balance the (worst) errors. In the figure, this is roughly speaking x2 = 1 line, resulting in 4 errors. ? “middle” = (w0)

If we regularize w0, then the boundary will eventually go through the origin (bias term set to zero). Based on the figure we can find a good linear boundary through the origin with only one error. ? “bottom” = (w2)

The training error is unaffected if we regularize w2 (constrain the boundary to be vertical); the value of w2 would be small already without regularization.

4、 midterm 2009 problem 4

6、Consider two classifiers: 1) an SVM with a quadratic (second order polynomial) kernel function and 2) an unconstrained mixture of two Gaussians model, one Gaussian per class label. These classifiers try to map examples in R2 to binary labels. We assume that the problem is separable, no slack penalties are added to the SVM classifier, and that we have sufficiently many training examples to estimate the covariance matrices of the two Gaussian components. (1) The two classifiers have the same VC-dimension. (T)

(2) Suppose we evaluated the structural risk minimization score for the two classifiers. The score is

the bound on the expected loss of the classifier, when the classifier is estimated on the basis of n training examples. Which of the two classifiers might yield the better (lower) score? Provide a brief justification.

The SVM would probably get a better score. Both classifiers have the same complexity penalty but SVM would better optimize the training error resulting in a lower (or equal) overall score.

本文来源:https://www.bwwdw.com/article/jdbp.html

Top