机器学习题库-new - 图文

更新时间:2024-01-19 09:52:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

机器学习题库

一、 极大似然

1、 ML estimation of exponential model (10)

A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probability density function is given by

1?xp?x??eb

bGiven N observations xi drawn from such a distribution:

(a) Write down the likelihood as a function of the scale parameter b. (b) Write down the derivative of the log likelihood. (c) Give a simple expression for the ML estimate for b.

x??2、换成Poisson分布:p?x|????e,y?0,1,2,...

x!l?????log?p?xi|?????xilog????log?x!?

i?1i?1N?????xi?log??N???log?xi!?i?1?i?1?NNN 3、 二、 贝叶斯

假设在考试的多项选择中,考生知道正确答案的概率为p,猜测答案的概率为1-p,并且假

设考生知道正确答案答对题的概率为1,猜中正确答案的概率为1m,其中m为多选项的数目。那么已知考生答对题目,求他知道正确答案的概率。 1、

p?known|correct??Conjugate priors p?known,correct?p?1p?known?p??1?p?mThe readings for this week include discussion of conjugate priors. Given a likelihood p?x|?? for a class models with parameters θ, a conjugate prior is a distribution p??|?? with hyperparameters γ, such that the posterior distribution

p??|X,????p?X|??p??|???p??|???

与先验的分布族相同

(a) Suppose that the likelihood is given by the exponential distribution with rate parameter λ: p?x|????e??x

Show that the gamma distribution

Gamma??|?,???????1??? _

?e????is a conjugate prior for the exponential. Derive the parameter update given observations x1,,xN and the prediction distribution p?xN?1|x1,,xN?.

(b) Show that the beta distribution is a conjugate prior for the geometric distribution

p?x?k|????1???k?1?

which describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is θ. Derive the parameter update rule and prediction distribution.

(c) Suppose p??|?? is a conjugate prior for the likelihood p?x|??; show that the mixture prior

p??|?1,...,?M???wmp??|?m?

m?1Mis also conjugate for the same likelihood, assuming the mixture weights wm sum to 1.

(d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture, and the prior is conjugate for each mixture component of the likelihood.

some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulli and the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed α

(e) (Extra credit, 20) Explore the case where the likelihood is a mixture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.

三、判断题

(1)给定n个数据点,如果其中一半用于训练,另一半用于测试,则训练误差和测试误差之间的差别会随着n的增加而减小。

(2)极大似然估计是无偏估计且在所有的无偏估计中方差最小,所以极大似然估计的风险最小。 (3)回归函数A和B,如果A比B更简单,则A几乎一定会比B在测试集上表现更好。 (4)全局线性回归需要利用全部样本点来预测新输入的对应输出值,而局部线性回归只需利用查询点附近的样本来预测输出值。所以全局线性回归比局部线性回归计算代价更高。

(5)Boosting和Bagging都是组合多个分类器投票的方法,二者都是根据单个分类器的正确率决定其权重。

(6) In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert (F)

While the training error of the combined classifier typically decreases as a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.

(7) One advantage of Boosting is that it does not overfit. (F)

(8) Support vector machines are resistant to outliers, i.e., very noisy examples drawn from a different distribution. (F)

(9)在回归分析中,最佳子集选择可以做特征选择,当特征数目较多时计算量大;岭回归和Lasso模型计算量小,且Lasso也可以实现特征选择。

(10)当训练数据较少时更容易发生过拟合。

(11)梯度下降有时会陷于局部极小值,但EM算法不会。

(12)在核回归中,最影响回归的过拟合性和欠拟合之间平衡的参数为核函数的宽度。 (13) In the AdaBoost algorithm, the weights on all the misclassified points will go up by the same multiplicative factor. (T)

(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w? on the training data. (T)

(15) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty always decreases the expected L2 error of the solution w? on unseen test data (F). (16)除了EM算法,梯度下降也可求混合高斯模型的参数。 (T)

(20) Any decision boundary that we get from a generative model with

class-conditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.(T)

True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.

(21) AdaBoost will eventually reach zero training error, regardless of the type of weak

classifier it uses, provided enough weak classifiers have been combined.

False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost can’t achieve zero training error.

(22) The L2 penalty in a ridge regression is equivalent to a Laplace prior on the weights. (F)

(23) The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. (T)

(24) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)

一、 回归

1、考虑回归一个正则化回归问题。在下图中给出了惩罚函数为二次正则函数,当正则化参数C取不同值时,在训练集和测试集上的log似然(mean log-probability)。(10分)

(1)说法“随着C的增加,图2中训练集上的log似然永远不会增加”是否正确,并说明理由。 (2)解释当C取较大值时,图2中测试集上的log似然下降的原因。

2、考虑线性回归模型:y~Nw0?w1x, ??2(10分) ?,训练数据如下图所示。

(1)用极大似然估计参数,并在图(a)中画出模型。(3分)

(2)用正则化的极大似然估计参数,即在log似然目标函数中加入正则惩罚函数?并在图(b)中画出当参数C取很大值时的模型。(3分)

(3)在正则化后,高斯分布的方差?是变大了、变小了还是不变?(4分)

2C2w1?, ?2

图(a) 图(b)

3. 考虑二维输入空间点x??x1,x2?上的回归问题,其中xj?T

??1,1?,j?1,2在单位正方形内。

训练样本和测试样本在单位正方形中均匀分布,输出模型为

5y~N?x13x2?10x1x2?7x12?5x2?3, 1?,我们用1-10阶多项式特征,采用线性回归模型来

学习x与y之间的关系(高阶特征模型包含所有低阶特征),损失函数取平方误差损失。 (1) 现在n?20个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独立的测试集上测试,则在下3列中选择合适的模型(可能有多个选项),并解释第3列中你选择的模型为什么测试误差小。(10分)

训练误差最小 训练误差最大 测试误差最小 1阶特征的线性模型 X 2阶特征的线性模型 X 8阶特征的线性模型 X 10阶特征的线性模型 X (2) 现在n?10个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独

6立的测试集上测试,则在下3列中选择合适的模型(可能有多个选项),并解释第3列中你选择的模型为什么测试误差小。(10分)

训练误差最小 训练误差最大 测试误差最小 1阶特征的线性模型 X 2阶特征的线性模型 8阶特征的线性模型 X X 10阶特征的线性模型 X (3) The approximation error of a polynomial regression model depends on the number of training points. (T)

(4) The structural error of a polynomial regression model depends on the number of training points. (F)

4、We are trying to learn regression parameters for a dataset which we know was

generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise (that is

y?w0?w1x?w2x2?w3x3?w4x4?w5x5??, ?~N?0,1?.

For training we have 100 {x,y} pairs and for testing we are using an additional set of 100 {x,y} pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Which of these two models is likely to fit the test data better?

Answer: Degree 6 polynomial. Since the model is a degree 5 polynomial and we have enough training data, the model we learn for a six degree polynomial will likely fit a very small coefficient for x6 . Thus, even though it is a six degree polynomial it will actually behave in a very similar way to a fifth degree polynomial which is the correct model leading to better fit to the data.

5、Input-dependent noise in regression

Ordinary least-squares regression is equivalent to assuming that each data point is generated according to a linear function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however, the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e., x >= 0).

a) Which of the following families of probability models correctly describes this situation in the

univariate case? (Hint: only one of them does.)

(iii) is correct. In a Gaussian distribution over y, the variance is determined by the coefficient of y2; so by replacing

?2by x?2, we get a variance that increases linearly with x. (Note also the change to

the normalization “constant.”) (i) has quadratic dependence on x; (ii) does not change the variance at all, it just renames w1.

b) Circle the plots in Figure 1 that could plausibly have been generated by some instance of the

model family(ies) you chose. (ii) and (iii). (Note that (iii) works for variance appears independent of x.

c) True/False: Regression with input-dependent noise gives the same solution as ordinary regression

for an infinite data set generated according to the corresponding model. True. In both cases the algorithm will recover the true underlying model.

d) For the model you chose in part (a), write down the derivative of the negative log likelihood with

respect to w1.

?2?0.) (i) exhibits a large variance at x = 0, and the

二、 分类

1. 产生式模型 vs. 判别式模型

(a) [ points] Your billionaire friend needs your help. She needs to classify job

applications into good/bad categories, and also to detect job applicants who lie in their applications using density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why? [final_sol_s07] 产生式模型 因为要估计密度p?x|y?

(b) [ points] Your billionaire friend also wants to classify software applications to detect bug-prone applications using features of the source code. This pilot project only has a few applications to be used as training data, though. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?

判别式模型

样本数较少,通常用判别式模型直接分类效果会好些

(d) [ points] Finally, your billionaire friend also wants to classify companies to decide which one to acquire. This project has lots of training data based on several decades of research. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why? 产生式模型

样本数很多时,可以学习到正确的产生式模型

2、logstic回归

Figure 2: Log-probability of labels as a function of regularization parameter C

Here we use a logistic regression model to solve a classification problem. In Figure 2, we have plotted the mean log-probability of labels in the training and test sets after having trained the classifier with quadratic regularization penalty and different values of the regularization parameter C.

(1) In training a logistic regression model by maximizing the likelihood of the labels given the inputs

we have multiple locally optimal solutions. (F)

Answer: The log-probability of labels given examples implied by the logistic regression model is a concave (convex down) function with respect to the weights. The (only) locally optimal solution is also globally optimal

(2) A stochastic gradient algorithm for training logistic regression models with a fixed learning rate

will find the optimal setting of the weights exactly. (F)

Answer: A fixed learning rate means that we are always taking a finite step towards improving the log-probability of any single training example in the update equation. Unless the examples are somehow ―aligned‖, we will continue jumping from side to side of the optimal solution, and will not be able to get arbitrarily close to it. The learning rate has to approach to zero in the course of the updates for the weights to converge.

(3) The average log-probability of training labels as in Figure 2 can never increase as we increase C.

(T)

Stronger regularization means more constraints on the solution and thus the (average) log-probability of the training examples can only get worse.

(4) Explain why in Figure 2 the test log-probability of labels decreases for large values of C. As C increases, we give more weight to constraining the predictor, and thus give less flexibility to fitting the training set. The increased regularization guarantees that the

test performance gets closer to the training performance, but as we over-constrain our allowed predictors, we are not able to fit the training set at all, and although the test performance is now very close to the training performance, both are low.

(5) The log-probability of labels in the test set would decrease for large values of C even if we had a

large number of training examples. (T)

The above argument still holds, but the value of C for which we will observe such a decrease will scale up with the number of examples.

(6) Adding a quadratic regularization penalty for the parameters when estimating a logistic regression

model ensures that some of the parameters (weights associated with the components of the input vectors) vanish.

A regularization penalty for feature selection must have non-zero derivative at zero. Otherwise, the regularization has no effect at zero, and weight will tend to be slightly non-zero, even when this does not improve the log-probabilities by much.

3、正则化的Logstic回归

This problem we will refer to the binary classification task depicted in Figure 1(a), which we attempt to solve with the simple linear logistic regression model

(for simplicity we do not use the bias parameter w0). The training data can be separated with zero training error - see line L1 in Figure 1(b) for instance.

(1) Consider a regularization approach where we try to maximize

for large C. Note that only w2 is penalized. We’d like to know which of the four lines in Figure 1(b) could arise as a result of such regularization. For each potential line L2, L3 or L4 determine whether it can result from regularizing w2. If not, explain very briefly why not.

L2: No. When we regularize w2, the resulting boundary can rely less on the value of x2 and therefore becomes more vertical. L2 here seems to be more horizontal than the unregularized solution so it

(a) The 2-dimensional data set used in Problem 2

(b) The points can be separated by L1 (solid line). Possible other decision boundaries are shown by L2;L3;L4.

cannot come as a result of penalizing w2

L3: Yes. Here w2^2 is small relative to w1^2 (as evidenced by high slope), and even though it would assign a rather low log-probability to the observed labels, it could be forced by a large regularization parameter C.

L4: No. For very large C, we get a boundary that is entirely vertical (line x1 = 0 or the x2 axis). L4 here is reflected across the x2 axis and represents a poorer solution than it’s counter part on the other side. For moderate regularization we have to get the best solution that we can construct while keeping w2 small. L4 is not the best and thus cannot come as a result of regularizing w2.

(2) If we change the form of regularization to one-norm (absolute value) and also regularize w1 we get

the following penalized log-likelihood

Consider again the problem in Figure 1(a) and the same linear logistic regression model. As we increase the regularization parameter C which of the following scenarios do you expect to observe (choose only one):

( x ) First w1 will become 0, then w2. ( ) w1 and w2 will become zero simultaneously ( ) First w2 will become 0, then w1.

( ) None of the weights will become exactly zero, only smaller as C increases

The data can be classified with zero training error and therefore also with high log-probability by looking at the value of x2 alone, i.e. making w1 = 0. Initially we might prefer to have a non-zero value for w1 but it will go to zero rather quickly as we increase regularization. Note that we pay a

regularization penalty for a non-zero value of w1 and if it doesn’t help classification why would we pay the penalty? The absolute value regularization ensures that w1 will indeed go to exactly zero. As C increases further, even w2 will eventually become zero. We pay higher and higher cost for setting w2 to a non-zero value. Eventually this cost overwhelms the gain from the log-probability of labels that we can achieve with a non-zero w2. Note that when w1 = w2 = 0, the log-probability of labels is a finite value nlog(0:5).

1、 SVM

Figure 4: Training set, maximum margin linear separator, and the support vectors (in bold).

(1) What is the leave-one-out cross-validation error estimate for maximum margin separation in figure

4? (we are asking for a number) (0)

Based on the figure we can see that removing any single point would not chance the resulting maximum margin separator. Since all the points are initially classified correctly, the leave-one-out error is zero.

(2) We would expect the support vectors to remain the same in general as we move from a linear

kernel to higher order polynomial kernels. (F)

There are no guarantees that the support vectors remain the same. The feature vectors corresponding to polynomial kernels are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different.

(3) Structural risk minimization is guaranteed to find the model (among those considered) with the

lowest expected loss. (F)

We are guaranteed to find only the model with the lowest upper bound on the expected loss.

(4) What is the VC-dimension of a mixture of two Gaussians model in the plane with equal covariance

matrices? Why?

A mixture of two Gaussians with equal covariance matrices has a linear decision boundary. Linear separators in the plane have VC-dim exactly 3. 4、SVM

对如下数据点进行分类:

(a) Plot these six training points. Are the classes {+, ?} linearly separable? yes

(b) Construct the weight vector of the maximum margin hyperplane by inspection and identify the support vectors.

The maximum margin hyperplane should have a slope of ?1 and should satisfy x1 = 3/2, x2 = 0. Therefore it’s equation is x1 + x2 = 3/2, and the weight vector is (1, 1)T .

(c) If you remove one of the support vectors does the size of the optimal margin decrease, stay the same, or increase?

In this specific dataset the optimal margin increases when we remove the support vectors (1, 0) or (1, 1)

and stays the same when we remove the other two.

(d) (Extra Credit) Is your answer to (c) also true for any dataset? Provide a counterexample or give a short proof.

When we drop some constraints in a constrained maximization problem, we get an optimal value which is at least as good the previous one. It is because the set of candidates satisfying the original (larger, stronger) set of contraints is a subset of the candidates satisfying the new (smaller, weaker) set of constraints. So, for the weaker constraints, the oldoptimal solution is still available and there may be additions soltons that are even better. In mathematical form:

Finally, note that in SVM problems we are maximizing the margin subject to the constraints given by training points. When we drop any of the constraints the margin can increase or stay the same depending on the dataset. In general problems with realistic datasets it is expected that the margin increases when we drop support vectors. The data in this problem is constructed to demonstrate that when removing some constraints the margin can stay the same or increase depending on the geometry. 2、 SVM

对下述有3个数据点的集合进行分类:

(a) Are the classes {+, ?} linearly separable? No。

(b) Consider mapping each point to 3-D using new feature vectors classes now linearly separable? If so, find a separating hyperplane.

??x??1,2x,x2??. Are the

The points are mapped to ?1,0,0?,1,?2,1,1,2,1 respectively. The points are now separable in 3-dimensional space. A separating hyperplane is given by the weight vector (0,0,1) in the new space as seen in the figure.

????

(c) Define a class variable yi 2 {?1, +1} which denotes the class of xi and let w=(w1,w2,w3)T . The max-margin SVM classifier solves the following problem

???0,0,?2?,b?1 and Using the method of Lagrange multipliers show that the solution is wthe margin is

For optimization problems with inequality constraints such as the above, we should apply KKT conditions which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3 constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,

T1. ?2w

(e) Show that the solution remains the same if the constraints are changed to

for any

??1.

(f) (Extra Credit) Is your answer to (d) also true for any dataset and??1? Provide a counterexample

or give a short proof.

SVM

Suppose we only have four training examples in two dimensions (see figure above):

positive examples at x1 = [0, 0] , x2 = [2, 2] and negative examples at x3 = [h, 1] , x4 = [0, 3], where we treat 0 ≤ h ≤ 3 as a parameter

(1). How large can h ≥ 0 be so that the training points are still linearly separable? Up to (excluding) h=1

(2). Does the orientation of the maximum margin decision boundary change as a function of h when the points are separable (Y/N)? No, because x1, x2, x3 remain the support vectors.

(3). What is the margin achieved by the maximum margin boundary as a function of h?

[Hint : It turns out that the margin as a function of h is a linear function.]

(4). Assume that we can only observe the second component of the input vectors. Without the other component, the labeled training points reduce to (0,y = 1), (2,y = 1), (1,y = -1), and (3,y =-1). What is the lowest order p of polynomial kernel that would allow us to correctly classify these points?

The classes of the points on the x2-projected line observe the order 1,-1,1,-1. Therefore, we need a cubic polynomial.

3、 LDA

Using a set of 100 labeled training examples (two classes), we train the following models:

GaussI : A Gaussian mixture model (one Gaussian per class), where the covariance matrices are both set to I (identity matrix).

GaussX: A Gaussian mixture model (one Gaussian per class) without any restrictions on the covariance matrices.

LinLog: A logistic regression model with linear features.

QuadLog: A logistic regression model, using all linear and quadratic features.

(1) After training, we measure for each model the average log probability of labels given examples in

the training set. Specify all the equalities or inequalities that must always hold between the models relative to this performance measure. We are looking for statements like “model 1 <= model 2” or “model 1 = model 2”. If no such statement holds, write “none”.

GaussI <= LinLog (both have logistic postiriors, and LinLog is the logistic model maximizing the average log probabilities)

GaussX <= QuadLog (both have logistic postiriors with quadratic features, and QuadLog is the model of this class maximizing the average log probabilities)

LinLog <= QuadLog (logistic regression models with linear features are a subclass of logistic regression models with quadratic functions— the maximum from the superclass is at least as high as the maximum from the subclass)

GaussI <= QuadLog (follows from above inequalities)

(GaussX will have higher average log joint probabilities of examples and labels, then will GaussI. But have higher average log joint probabilities does not necessarily translate to higher average log conditional probabilities)

(2) Which equalities and inequalities must always hold if we instead use the mean classification error

in the training set as the performance measure? Again use the format “model 1 <= model 2” or “model 1 = model 2”. Write “none” if no such statement holds.

None. Having higher average log conditional probabilities, or average log joint probabilities, does not necessarily translate to higher or lower classification error. Counterexamples can be constructed for all pairs in both directions.

Although there is no inequalities which is always correct, it is commonly the case that GaussX <= GaussI and that QuadLog <= LinLog. Partial credit of up to two points was awarded for these inequalities.

5、We consider here generative and discriminative approaches for solving the classification problem illustrated in Figure 4.1. Specifically, we will use a mixture of Gaussians model and regularized logistic regression models.

Figure 4.1. Labeled training set, where “+” corresponds to class y = 1.

(1) We will first estimate a mixture of Gaussians model, one Gaussian per class, with the constraint

that the covariance matrices are identity matrices. The mixing proportions (class frequencies) and the means of the two Gaussians are free parameters.

a) Plot the maximum likelihood estimates of the means of the two class conditional Gaussians in Figure 4.1. Mark the means as points “x” and label them “0” and “1” according to the class.

The means should be close to the center of mass of the points. b) Draw the decision boundary in the same figure.

Since the two classes have the same number of points and the same covariance matrices, the decision boundary is a line and, moreover, should be drawn as the orthogonal bisector of the line segment connecting the class means.

(2) We have also trained regularized linear logistic regression models

for the same data. The regularization penalties, used in penalized conditional loglikelihood estimation, were

-Cwi2, where i = 0, 1, 2. In other words, only one of the parameters were regularized in each

case. Based on the data in Figure 4.1, we generated three plots, one for each regularized parameter, of the number of misclassified training points as a function of C (Figure 4.2). The three plots are not identified with the corresponding parameters, however. Please assign the “top”, “middle”, and “bottom” plots to the correct parameter, w0, w1, or w2, the parameter that was regularized in the plot. Provide a brief justification for each assignment.

? “top” = (w1)

By strongly regularizing w1 we force the boundary to be horizontal in the figure. The logistic regression model tries to maximize the log-probability of classifying the data correctly. The highest penalty comes from the misclassified points and thus the boundary will tend to balance the (worst) errors. In the figure, this is roughly speaking x2 = 1 line, resulting in 4 errors. ? “middle” = (w0)

If we regularize w0, then the boundary will eventually go through the origin (bias term set to zero). Based on the figure we can find a good linear boundary through the origin with only one error. ? “bottom” = (w2)

The training error is unaffected if we regularize w2 (constrain the boundary to be vertical); the value of w2 would be small already without regularization.

4、 midterm 2009 problem 4

6、Consider two classifiers: 1) an SVM with a quadratic (second order polynomial) kernel function and 2) an unconstrained mixture of two Gaussians model, one Gaussian per class label. These classifiers try to map examples in R2 to binary labels. We assume that the problem is separable, no slack penalties are added to the SVM classifier, and that we have sufficiently many training examples to estimate the covariance matrices of the two Gaussian components. (1) The two classifiers have the same VC-dimension. (T)

(2) Suppose we evaluated the structural risk minimization score for the two classifiers. The score is

the bound on the expected loss of the classifier, when the classifier is estimated on the basis of n training examples. Which of the two classifiers might yield the better (lower) score? Provide a brief justification.

The SVM would probably get a better score. Both classifiers have the same complexity penalty but SVM would better optimize the training error resulting in a lower (or equal) overall score.

本文来源:https://www.bwwdw.com/article/inoo.html

Top