中科院机器学习题库-new
更新时间:2024-04-22 20:09:01 阅读量: 综合文库 文档下载
- 中科院研学推荐度:
- 相关推荐
机器学习题库
一、 极大似然
1、 ML estimation of exponential model (10)
A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probability density function is given by
1?xp?x??eb
bGiven N observations xi drawn from such a distribution:
(a) Write down the likelihood as a function of the scale parameter b. (b) Write down the derivative of the log likelihood. (c) Give a simple expression for the ML estimate for b.
x??2、换成Poisson分布:p?x|????e,y?0,1,2,...
x!l?????log?p?xi|?????xilog????log?x!?
i?1i?1N?????xi?log??N???log?xi!?i?1?i?1?NNN 3、 二、 贝叶斯
假设在考试的多项选择中,考生知道正确答案的概率为p,猜测答案的概率为1-p,并且假
设考生知道正确答案答对题的概率为1,猜中正确答案的概率为1m,其中m为多选项的数目。那么已知考生答对题目,求他知道正确答案的概率。 1、
p?known|correct??p?known,correct?p?known??pp??1?p?1mConjugate priors
The readings for this week include discussion of conjugate priors. Given a likelihood p?x|?? for a class models with parameters θ, a conjugate prior is a distribution p??|?? with hyperparameters γ, such that the posterior distribution
p??|X,????p?X|??p??|???p??|???
与先验的分布族相同
(a) Suppose that the likelihood is given by the exponential distribution with rate parameter λ: p?x|????e??x
Show that the gamma distribution
Gamma??|?,???????1??? _
?e????is a conjugate prior for the exponential. Derive the parameter update given observations x1,?,xN and the prediction distribution p?xN?1|x1,?,xN?.
(b) Show that the beta distribution is a conjugate prior for the geometric distribution
p?x?k|????1???k?1?
which describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is θ. Derive the parameter update rule and prediction distribution.
(c) Suppose p??|?? is a conjugate prior for the likelihood p?x|??; show that the mixture prior
p??|?1,...,?M???wmp??|?m?
m?1Mis also conjugate for the same likelihood, assuming the mixture weights wm sum to 1.
(d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture, and the prior is conjugate for each mixture component of the likelihood.
some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulli and the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed α
(e) (Extra credit, 20) Explore the case where the likelihood is a mixture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.
三、判断题
(1)给定n个数据点,如果其中一半用于训练,另一半用于测试,则训练误差和测试误差之间的差别会随着n的增加而减小。
(2)极大似然估计是无偏估计且在所有的无偏估计中方差最小,所以极大似然估计的风险最小。 (3)回归函数A和B,如果A比B更简单,则A几乎一定会比B在测试集上表现更好。 (4)全局线性回归需要利用全部样本点来预测新输入的对应输出值,而局部线性回归只需利用查询点附近的样本来预测输出值。所以全局线性回归比局部线性回归计算代价更高。
(5)Boosting和Bagging都是组合多个分类器投票的方法,二者都是根据单个分类器的正确率决定其权重。
(6) In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert (F)
While the training error of the combined classifier typically decreases as a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.
(7) One advantage of Boosting is that it does not overfit. (F)
(8) Support vector machines are resistant to outliers, i.e., very noisy examples drawn from a different distribution. (F)
(9)在回归分析中,最佳子集选择可以做特征选择,当特征数目较多时计算量大;岭回归和Lasso模型计算量小,且Lasso也可以实现特征选择。
(10)当训练数据较少时更容易发生过拟合。
(11)梯度下降有时会陷于局部极小值,但EM算法不会。
(12)在核回归中,最影响回归的过拟合性和欠拟合之间平衡的参数为核函数的宽度。 (13) In the AdaBoost algorithm, the weights on all the misclassified points will go up by the same multiplicative factor. (T)
(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w? on the training data. (F)
(15) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty always decreases the expected L2 error of the solution w? on unseen test data (F). (16)除了EM算法,梯度下降也可求混合高斯模型的参数。 (T)
(20) Any decision boundary that we get from a generative model with
class-conditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.
True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.
(21) AdaBoost will eventually reach zero training error, regardless of the type of weak
classifier it uses, provided enough weak classifiers have been combined.
False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost can’t achieve zero training error.
(22) The L2 penalty in a ridge regression is equivalent to a Laplace prior on the weights. (F)
(23) The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. (F)
(c) Define a class variable yi 2 {?1, +1} which denotes the class of xi and let w=(w1,w2,w3)T . The max-margin SVM classifier solves the following problem
???0,0,?2?,b?1 and Using the method of Lagrange multipliers show that the solution is wthe margin is
For optimization problems with inequality constraints such as the above, we should apply KKT conditions which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3 constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,
T1. ?2w
(e) Show that the solution remains the same if the constraints are changed to
for any
??1.
(f) (Extra Credit) Is your answer to (d) also true for any dataset and??1? Provide a counterexample
or give a short proof.
SVM
Suppose we only have four training examples in two dimensions (see figure above):
positive examples at x1 = [0, 0] , x2 = [2, 2] and negative examples at x3 = [h, 1] , x4 = [0, 3], where we treat 0 ≤ h ≤ 3 as a parameter
(1). How large can h ≥ 0 be so that the training points are still linearly separable? Up to (excluding) h=1
(2). Does the orientation of the maximum margin decision boundary change as a function of h when the points are separable (Y/N)? No, because x1, x2, x3 remain the support vectors.
(3). What is the margin achieved by the maximum margin boundary as a function of h?
[Hint : It turns out that the margin as a function of h is a linear function.]
(4). Assume that we can only observe the second component of the input vectors. Without the other component, the labeled training points reduce to (0,y = 1), (2,y = 1), (1,y = -1), and (3,y =-1). What is the lowest order p of polynomial kernel that would allow us to correctly classify these points?
The classes of the points on the x2-projected line observe the order 1,-1,1,-1. Therefore, we need a cubic polynomial.
3、 LDA
Using a set of 100 labeled training examples (two classes), we train the following models:
GaussI : A Gaussian mixture model (one Gaussian per class), where the covariance matrices are both set to I (identity matrix).
GaussX: A Gaussian mixture model (one Gaussian per class) without any restrictions on the covariance matrices.
LinLog: A logistic regression model with linear features.
QuadLog: A logistic regression model, using all linear and quadratic features.
(1) After training, we measure for each model the average log probability of labels given examples in
the training set. Specify all the equalities or inequalities that must always hold between the models relative to this performance measure. We are looking for statements like “model 1 <= model 2” or “model 1 = model 2”. If no such statement holds, write “none”.
GaussI <= LinLog (both have logistic postiriors, and LinLog is the logistic model maximizing the average log probabilities)
GaussX <= QuadLog (both have logistic postiriors with quadratic features, and QuadLog is the model of this class maximizing the average log probabilities)
LinLog <= QuadLog (logistic regression models with linear features are a subclass of logistic regression models with quadratic functions— the maximum from the superclass is at least as high as the maximum from the subclass)
GaussI <= QuadLog (follows from above inequalities)
(GaussX will have higher average log joint probabilities of examples and labels, then will GaussI. But have higher average log joint probabilities does not necessarily translate to higher average log conditional probabilities)
(2) Which equalities and inequalities must always hold if we instead use the mean classification error
in the training set as the performance measure? Again use the format “model 1 <= model 2” or “model 1 = model 2”. Write “none” if no such statement holds.
None. Having higher average log conditional probabilities, or average log joint probabilities, does not necessarily translate to higher or lower classification error. Counterexamples can be constructed for all pairs in both directions.
Although there is no inequalities which is always correct, it is commonly the case that GaussX <= GaussI and that QuadLog <= LinLog. Partial credit of up to two points was awarded for these inequalities.
5、We consider here generative and discriminative approaches for solving the classification problem illustrated in Figure 4.1. Specifically, we will use a mixture of Gaussians model and regularized logistic regression models.
Figure 4.1. Labeled training set, where “+” corresponds to class y = 1.
(1) We will first estimate a mixture of Gaussians model, one Gaussian per class, with the constraint
that the covariance matrices are identity matrices. The mixing proportions (class frequencies) and the means of the two Gaussians are free parameters.
a) Plot the maximum likelihood estimates of the means of the two class conditional Gaussians in Figure 4.1. Mark the means as points “x” and label them “0” and “1” according to the class.
The means should be close to the center of mass of the points. b) Draw the decision boundary in the same figure.
Since the two classes have the same number of points and the same covariance matrices, the decision boundary is a line and, moreover, should be drawn as the orthogonal bisector of the line segment connecting the class means.
(2) We have also trained regularized linear logistic regression models
for the same data. The regularization penalties, used in penalized conditional loglikelihood estimation, were -Cwi, where i = 0, 1, 2. In other words, only one of the parameters were regularized in each case. Based on the data in Figure 4.1, we generated three plots, one for each regularized parameter, of the number of misclassified training points as a function of C (Figure 4.2). The three plots are not identified with the corresponding parameters, however. Please assign the “top”, “middle”, and “bottom” plots to the correct parameter, w0, w1, or w2, the parameter that was regularized in the plot. Provide a brief justification for each assignment.
2
? “top” = (w1)
By strongly regularizing w1 we force the boundary to be horizontal in the figure. The logistic regression model tries to maximize the log-probability of classifying the data correctly. The highest penalty comes from the misclassified points and thus the boundary will tend to balance the (worst) errors. In the figure, this is roughly speaking x2 = 1 line, resulting in 4 errors. ? “middle” = (w0)
If we regularize w0, then the boundary will eventually go through the origin (bias term set to zero). Based on the figure we can find a good linear boundary through the origin with only one error. ? “bottom” = (w2)
The training error is unaffected if we regularize w2 (constrain the boundary to be vertical); the value of w2 would be small already without regularization.
4、 midterm 2009 problem 4
6、Consider two classifiers: 1) an SVM with a quadratic (second order polynomial) kernel function and 2) an unconstrained mixture of two Gaussians model, one Gaussian per class label. These classifiers try to map examples in R2 to binary labels. We assume that the problem is separable, no slack penalties are added to the SVM classifier, and that we have sufficiently many training examples to estimate the covariance matrices of the two Gaussian components. (1) The two classifiers have the same VC-dimension. (T)
(2) Suppose we evaluated the structural risk minimization score for the two classifiers. The score is
the bound on the expected loss of the classifier, when the classifier is estimated on the basis of n training examples. Which of the two classifiers might yield the better (lower) score? Provide a brief justification.
The SVM would probably get a better score. Both classifiers have the same complexity penalty but SVM would better optimize the training error resulting in a lower (or equal) overall score.
正在阅读:
中科院机器学习题库-new04-22
壁虎的尾巴作文200字07-09
医疗器械公司各级岗位职责05-18
小学生登山优秀作文06-15
一个有趣的传话游戏作文500字07-05
同济大学土木工程专业硕士卓越工程师培养方案 - 图文09-27
真国王和假国王作文400字06-14
六一儿童节开篇语及结束语02-13
专题3人类对太空的探索06-12
- 多层物业服务方案
- (审判实务)习惯法与少数民族地区民间纠纷解决问题(孙 潋)
- 人教版新课标六年级下册语文全册教案
- 词语打卡
- photoshop实习报告
- 钢结构设计原理综合测试2
- 2014年期末练习题
- 高中数学中的逆向思维解题方法探讨
- 名师原创 全国通用2014-2015学年高二寒假作业 政治(一)Word版
- 北航《建筑结构检测鉴定与加固》在线作业三
- XX县卫生监督所工程建设项目可行性研究报告
- 小学四年级观察作文经典评语
- 浅谈110KV变电站电气一次设计-程泉焱(1)
- 安全员考试题库
- 国家电网公司变电运维管理规定(试行)
- 义务教育课程标准稿征求意见提纲
- 教学秘书面试技巧
- 钢结构工程施工组织设计
- 水利工程概论论文
- 09届九年级数学第四次模拟试卷
- 题库
- 中科院
- 机器
- 学习
- new
- 2018继续教育《全面深化改革-推进国家文明发展》试题及答案1
- 小学科学课程与教学
- 婚前一方按揭购房离婚时增值部分如何分割
- 少先队活动记录1
- 心手相连主题班会
- 2018年在创建全国文明城市迎国检动员部署会议上的讲话
- 从易经中观人生态度2
- 软件工程导论期末考试简答题
- (新)预备党员考察鉴定表意见
- 吉林省长春市2018年中考语文试卷及答案(word版)
- 东台市东台镇人民政府文件
- 8-高尔夫球场地下水基础环境状况调查评估培训材料 - 图文
- 高频交易策略
- 广东省清远市2018-2019学年高二上学期期末数学试卷(理科) Word
- 数值模拟
- 河南省南阳市宛城区2016届中考化学一模试卷(解析版)
- 变压器反事故措施
- 港股通业务知识测评题库
- 2013新目标英语中考总复习精品讲义
- 广域网协议封装与验证实验