Multi-view ensemble learning an optimal feature set

更新时间:2023-04-20 02:10:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Multi-view ensemble learning:an optimal feature set partitioning for high-dimensional data classi?cation

Vipin Kumar1·Sonajharia Minz1

Received:14June2014/Revised:2June2015/Accepted:20August2015

?Springer-Verlag London2015

Abstract Multi-view ensemble learning has the potential to address issues related to the high dimensionality of data.It attempts to utilize all the relevant only discarding the irrelevant features.The view of a dataset is the sub-table of the training data with respect to a subset of the feature set.The problem of discarding the irrelevant features and obtaining subsets of the relevant features is useful for dimension reduction and dealing with the problem of having fewer training examples than even the reduced set of relevant features.A feature set partitioning resulting in the blocks of relevant features may not yield multiple-view-based classi?ers with good classi?cation performance.In this work the optimal feature set partition approach has been proposed.Further,the ensemble learning from views aims to maximize the performance of the classi?er.The experiments study the performance of random feature set partitioning,attribute bagging,view generation using attribute clustering,view construction using genetic algorithm and OFSP proposed method.The blocks of relevant feature subsets are used to construct the multi-view classi?er ensemble using K-nearest neighbor,Na?ve Bayesian and support vector machine algorithm applied to sixteen high-dimensional data sets from UCI machine learning repository.The performance parameters considered for comparison are classi?cation accuracy,disagreement among the classi?ers,execution time and percentage reduction of attributes.

Keywords Classi?cation·Feature set partitioning·High dimensionality·Multi-view ensemble learning

1Introduction

The persity of the various views provides insight for learning from multiple views of the data.The perse views may be obtained from multiple data sources or different features B Vipin Kumar

rt.vipink@93d0cdd159eef8c75ebfb384

1School of Computer and Systems Sciences,Jawaharlal Nehru University,New Delhi110067,India

123

V.Kumar,S.Minz sets.The sub-table of the dataset with respect to a subset of features yields as view(vertical partitioning of the dataset).The subsets of the set of features may be formed based on some criteria such as natural subsets of features with respect to the data acquisition;webpage may have multiple partial descriptions based on their text contents or hyperlinks;hyperspectral imagery may be partially viewed in respect of sets of bands to describe the properties of the contents in an imagery,etc.Many real-world data sets have large number of features [1].While the dimensionality poses problems typically known as curse of dimensionality, the datasets also lack adequate number of training samples in comparison with the number of dimensions.Feature selection techniques help in dimension reduction for the analysis of the high-dimensional data.Some machine learning algorithms can discard the less relevant features in the learning process.Therefore,multi-view ensemble learning(MEL)has the potential to explore the use of features in multiple views of the data for the classi?cation task.It may exploit the multiple views constructed by separating relevant from irrelevant features and then forming subsets of relevant features for learning useful and interpretable patterns.

In multi-view learning approach,multiple subsets of the domain features are used to generate the corresponding views.Each view is considered to contain some information to learn the target concept.The approach introduces a function to model the learning from each view.Further a function is used to ensemble the inpidual models and optimized the outcome related to the target concept.The multi-view learning approach aims to exploit the multiple views of the same input data for an improved learning performance[2–5].A number of multi-view ensemble learning algorithms have been proposed in the literature that are classi?ed into three groups such as co-training,multiple kernel learning,and subspace learning[3].The objective of the co-training algorithms is to maximize mutual agreement between two views of the data.The multiple kernel learning methods identify the natural kernels corresponding the views.In subspace learning,the input views are based on the latent subspace.The MEL has been applied for supervised learning[3,6–9]semi-supervised learning[10],clustering [11],ensemble learning[12],active learning[13],dimensionality reduction[14],etc.

The performance of MEL corresponds to the learning performance of the inpidual clas-si?ers of the respective views.Therefore,view generation based on feature subsets is an important task in the multi-view learning.In the literature,many strategies have been pro-posed for obtaining the subset of the feature set based on random selection[15–17],reduct [18–20],collective performance[21,22],feature set partitioning[23]and rotation forest[24]. Random-based strategy employs a random selection of the features for constructing the sub-sets.Reduct-based approach attempt to?nd the smallest subset of the dataset that has same predictive power as the whole features set of the data.Rodriguez[24]proposed rotation for-est method to build perse classi?ers with improved accuracy.In this,they applied feature extraction method to select the subset of features for each classi?er.

There are two ways to partition the dataset into sub-tables namely horizontal partition, the subset of data with respect to the universe of objects,and vertical partition,the subset of the data with respect to the dimension.Speci?cally,vertical partitioning,called a feature set partitioning,is an approach for feature subset-based ensembles[25].The literature contains a number of methods to address the feature set partitioning problem of which RFSP is the most general method[3].However,this method does not ensure a satisfactory outcome of MEL. The feature set may be naturally partitioned according to their type such as nominal,numeric and text value feature[26,27].A number of view construction methods such as bagging[28], random feature set partitioning(RFSP)[29],attribute bagging(AB)[17],view construction using genetic algorithm(VC-GA)[30]and view generation using attribute clustering(VG-AC)[31],have also been proposed in the literature.The feature set decompositions approach 123

Multi-view ensemble learning:an optimal feature set...

offers the methods based on target classes[32]and pair-wise mutual information[33].Rokach [23,34]proposed GA-based feature set partitioning and a general framework of feature set partitioning.A novel feature decomposition method called Pseudo multi-view Co-training automatically pides original data set(single view)into two mutually exclusive subsets[6].

The partitioning methods partition the features set in order to achieve their objectives. However,the views generated corresponding a partition may not ensure the expected per-formance of the learning model.Therefore,optimal feature set partitioning(OFSP)method proposed in this paper aims to generate a?xed number of views such that the performance of the ensemble of all classi?ers induced from each of the views may be optimal.Therefore, the method aims to reduce the dimension,to optimize the accuracy of the ensemble of the view-based classi?ers,and to reduce the classi?cation time.In OFSP is a forward selection method combining the reduct-based approach for the optimal collective performance of a ?xed number of views.The method obtains the partition of the feature set that contains a block of irrelevant features and the block of relevant features in such a way that the perfor-mance of the multi-view ensemble classi?er is optimal.The block of irrelevant features is not used for view creation.Thus dimensionality reduction is achieved.To assess the novelty of the proposed method,experiments have been carried out on sixteen high-dimensional labeled data sets.To evaluate the effectiveness of the OFSP method,?ve state-of-art fea-ture set partition methods are used to partition the original set of features.To observe the robustness of the method,three classi?cation algorithms with different approaches have been employed to induce the view-wise classi?ers.Performance weighting method is applied to ensemble the classi?ers to meet the objective of collective performance.All the models are compared based on the dimension reduction,classi?cation accuracy,disagreement among the view-based classi?ers and execution time.

This paper is organized as follows:In Sect.2a review of the principles and related work on feature selection and multi-view ensemble learning are presented.Some notation,basic concepts,and proposed method are presented in the Sect.3.Details on the experiments are presented in the Sect.4including description of the datasets,experiment design,and the results.Analysis of the results and their statistical tests are described in the Sect.5.The conclusion and future work are outlined in Sect.6.

2Related concepts and work

In this section state of the art related to the concept of feature selection and multi-view learning approach for learning in high-dimensional data has been revisited.The section covers the following categories,

?Feature selection,feature relevance and optimal feature subset.

?Multi-view ensemble learning.

2.1Feature selection

Feature selection pertains to selecting the relevant features for ef?cient classi?cation.For a data set described by n features,there are2n candidate subsets of the feature set.In case of the high-dimensional data,n is signi?cantly large.The complexity gets enhanced due to the number of samples being comparatively lesser than the number of features.Therefore, selection of relevant features is a nontrivial process for high-dimensional data.The machine

123

V .Kumar,S.Minz

learning algorithms yield ef?cient classi?cation models from the training data after dimension reduction.

De?nition 1(Feature selection [35])Let the set of original features be X ={x 1,x 2,...,x n }and consider an evaluation criterion f de?ned as f :?(X )→R .A subset of features X ∈?(X )is said to be selected as candidate feature subset if for a threshold θ,θ≤f X ≤f (X ).

For a given high-dimensional data,feature selection transforms the problem to a searching in the space of feature subsets.A candidate feature subset with evaluation measure same as that of the whole feature set has been preferred for the task of ef?cient classi?cation.A minimum candidate set with maximum evaluation measure has been de?ned to be an optimal feature set.In the context of labeled training data for the classi?cation problem feature relevance and the optimal feature,a subset are de?ned in De?nitions 2and 3respectively.De?nition 2(Relevance to the target concept [7])A feature x i ∈X is relevant to a target concept C if there exist a pair of examples A and B in the instance space such that A and B differ only on their assignment to x i and C ,i.e.x i (A )=x i (B )and C (A )=C (B ).De?nition 3(Optimal feature subset [35])Let a dataset S from a distribution P over the labeled instance space be de?ned by a set of features X ={x 1,x 2,x 3,...,x n }and let the set of all subsets of X be represented by ?(X ).Consider f a classi?er inducer,i.e.C =f (S X ).An optimal feature subset X opt ∈?(X ),is a subset of the features such that |X opt |is minimal and the accuracy Acc (f S X opt )of the induced classi?er are maximal.

In the literature,independent and dependent evaluation criteria have been used for eval-uation of feature relevance or subsets relevance [36].An Independent criterion can be implemented as a ?lter model that does not involve a learning algorithm.It exploits the characteristics of the training data to evaluate the goodness of the feature subset.Measures of information or uncertainty [11],Distance measures [12],Probability of error measures

[13],Dependency measures [14],Interclass distance measures Consistency measures [37],etc are some of the independent feature evaluation criteria However,dependent criteria are implemented as wrapper models that require a learning algorithm and determine the rele-vance of feature subsets based on the performance of the algorithm.This is computationally expensive because every feature subset is estimated based on the accuracy of the induced classi?er [3,38].

2.2Multi-view learning

Multi-view ensemble learning (MEL)involves two-step learning procedure.To induce a learning model corresponding each view,a partition of the data in the ?rst step and the second step to ensemble the multiple models for an optimized the learning performance [3].A number of multi-view learning algorithms have been proposed for the supervised,semi-supervised and clustering tasks [39–42].View construction,view-wise model induction,ensemble model extraction,and performance estimation are the main processes in the MEL.However,the success of multi-view learning is ensured by two signi?cant principles namely consensus and complementary [3].

Principles of multi-view learning

?Consensus principle The aim of consensus principle is to maximize agreement among the models corresponding all the views of the dataset.The consensus of two views 123

Multi-view ensemble learning:an optimal feature set ...

hypothesis in Dasgupta et al.[43]on two views and their error rates and gave the inequality as P f 1=f 2 ≥max P err f 1 ,P err f 2 .Therefore,it can be inferred that the disagreement rate minimization rate of two hypotheses is minimized the error of each hypothesis.

?Complementary Principle An alternate perception related to multiple views of the dataset is that each view may contain some knowledge that other views do not contain [3].There-fore,it is bene?cial to employ multiple views comprehensively and accurately.Machine learning techniques have utilized the complementary information from the multiple views of the dataset to improve the performance.

2.2.1View construction

A view,in general,is either a row-wise or column-wise sub-table of the database.The row-wise views are popularly used in partition-based methods in case the dataset is too large.The work in this paper focuses on classi?cation problem in datasets with a large feature set and a comparatively smaller set of instances.Therefore,an attribute-oriented approach has been considered in this paper for view construction.Here,attribute-based column-wise sub-table has been referred to as the view throughout this paper.

In this work,only feature set partitioning-based multiple views of the data has been considered.The objective of multiple view construction is the acquisition of views with respect to disjoint subsets of features set where the joint of views must yield the dataset.Many view construction methods have been proposed for the better utilization of the views for learning purpose.View construction method can be pided into three major classes,namely random-based view construction,performance-based view construction and feature set partitioning-based view construction.

Random-based view construction The simplest way for view construction is to split the feature set randomly into a subset.Random view construction does not assure to be the acceptable difference in results,because it depends on the data domain and learning algorithm

[10].In [44],random subspace method (RSM)has been proposed to incorporate the bene?ts of aggregation and bootstrapping in the feature space.Tao et al.[38]proposed a random subspace method for small sets of features and solved problems related to over-?tting that reduces the discrepancy between the feature vector length and training data size.Attribute bagging (AB)[17]has been proposed that combine random subsets of features.Initially,it ?nds the appropriate size of the subset by random search,and then randomly selects the subsets of features of same size.

Performance-based view construction Several feature selection methods have been com-pared by Tsymbal et al.[45]to ?nd the best collection of features subsets using the persity measure.In [46],the feature subset search algorithm has been proposed to ?nd subsets that consider the performance of ensemble and persity of subsets of features.View Construction using Genetic Algorithm (VC-GA)proposed that obtain the feature subsets automatically

[30].In this,the selection of the i th bit is represented by 1else 0.Each candidate feature subset is considered as view of the data.Zenobi and Cunningham [22]proposed criteria of the error of feature subset while learning and disagreement among the ensemble members for ?nal feature subsets.Di and Crawford [47]proposed strategies such as compatibility,persity,and accuracy,to construct multiple views of the hyperspectral data.

Feature sets partition-based view construction There are two types of the partition of the data set namely horizontal partitioning and vertical partitioning namely.In horizontal parti-tioning,a set of samples are partitioned into many subsets with whole features of the dataset where Bagging [28]is the most common method of horizontal partitioning.In vertical par-

123

V.Kumar,S.Minz titioning,the original dataset is partitioned into many subsets(blocks)with whole instances of the original dataset.Vertical partitioning is also known as feature sets partitioning,which considers pair-wise disjoint subsets of features[25].Liao and Moody[33]are employed similar features in one group based on pair-wise mutual information statistically.In other research,features are grouped according to their types:numerical value features,nominal value features and text value features[26,27].According to the class,feature set decompo-sition has been employed to group features that have a high correlation with same class[32]. The genetic algorithm has been applied successfully to feature set partitioning.In this,the new encoding scheme is used and also new caching mechanism to avoid the recreation of the same classi?er and speed up the execution[48].

2.2.2View evaluation

View evaluation is required to ensure the effectiveness of MEL.Muslea[49]introduced?rst view validation algorithm that identify the suf?ciency of view capability for the prediction. Christoudias[50]de?ned conditional view entropy to?lter subsets and detect view disagree-ment among classi?ers that indicates the noise in?uence on MEL performance.In another research,Bayesian co-training method employed to cope with per-view noise[5].The latent variable and consensus are latent variable employed to model the agreement on different views.It is also assumed that each view is corrupted by the same amount of the noise. Extended work of Bayesian co-training has been presented as Heteroscedastic Bayesian Co-Training in[51]that identi?es the different level of views noise.In[52],view suf?ciency and view dependency of multi-view learning are de?ned as new con?dence measures namely intra-view and inter-view con?dence.It is known that the views are the subsets of the original feature set.Therefore,each view(subset)can also be evaluated by data mining algorithms to identify the prediction capability of the view.

The predictions of the ensemble classi?ers are the measure of performance of the MEL. The consensus and the disagreement between the learners are quantitative measures of the performance[52].The performance evaluation of the inpidual views is measured based on the outcome corresponding the validation set.For improved classi?cation by the MEL, the consensus among the learners corresponding to the inpidual views must be higher,or disagreement must be lower.In the literature,quantitative measures of persity among the learners are categorized into two types:pair-wise and non-pair wise.Q-statistics[53]and kappa statistics[54]are the pair-wise measures that calculate the average of all possible pairings of the members.The pair-wise persity measure of k-classi?ers is calculated as the

mean pair-wise measure of overall k(k?1)

2pairs of the classi?ers[55].The total disagreement

measure(DM total)is shown in Eq.1:

DM total=

2

k(k?1)

k

i=1

k

j=i+1

dis

f i,f j

(1)

where dis(.)is the disagreement function between the classi?ers f i and f j.It is the ratio between sum the of the instances on which the two classi?ers assign different class labels, and the total number of samples,de?ned as in Eq.2:

dis

f i,f j

=n i,j

(0,1)+n i,j(1,0)

n i,j(0,0)+n i,j(1,0)+n i,j(0,1)+n i,j(1,1)

(2)

n i,j(0,1)is the number of instances in the data which are assigned the class label0by the classi?er f i where as class label1by the classi?er f j.Similarly,n i,j(0,1)is the number of 123

Multi-view ensemble learning:an optimal feature set ...

instances for vice-versa.Thus the numerator of dis (.)is the sum of instances of disagreements for a pair of classi?ers.Therefore,for a dataset of size |S |disagreement of an ensemble can be written as in Eq.3:

DM total = 2|S |×k (k ?1) k i =1k j =i +1 n i ,j (0,1)+n i ,j (1,0) (3)

2.2.3Ensemble learning techniques

The objective ensemble learning is to utilize multiple classi?ers to obtain a better predictive performance.The effective ensemble learning system consists the inpidual performance

[56].In the literature,weighting and meta-learning methods are the main methods for com-bining base classi?ers.The Embedded Multi-view Ad boost algorithm (EMV-Adaboost)

[57]proposed that extend to Adaboost to multi-view learning.In weighting method,there are many methods proposed method like Majority voting,Performance weighting [58],Bayesian combination [59],etc.The strength of the classi?er’s classi?cation has proportional to the assigned weight.The weight can be determined dynamically.In meta-combination method,many techniques are proposed like staking [60],combiner trees [61],arbiter trees [62],etc.3Notations and proposed method

In the classi?cation problem,samples and features of the dataset must be described prop-erly.Let a data set S have attributes set X ={x 1,x 2,x 3,...,x n },where,dom (x i )= v i ,1,v i ,2,v i ,3,...,v i ,|dom (x i )| ,is ?nite cardinality of the domain.Instance space (set of all possible examples)of the dataset is the cartesian product of all the input attributes domain de?ned as in Eq.4:

S =dom (x 1)×dom (x 2)×dom (x 3)×...×dom (x n )

(4)

Therefore,labeled instance space of the data S L can be represented as in Eq.5:

S L =X ×dom (y )(5)where,dom (y )= y 1,y 2,y 3,...,y |dom (x i )| are the categorical labels.

Feature Set Partitioning

The partition of a set may be understood as a collection of subsets of a features set.The de?nition of feature set partition is de?ned in De?nition 4.Let,SetPart (X )is denoted as set of all possible partitions of a set X .The partition that consist all singletons of the form {x i }with x i ∈X ,denoted as δX and the partition of the set X itself,denoted as θX .Note that θX ≤πX ≤δX for every πX ∈SetPart (X ),where δX and θX are the partitions which have minimum and maximum number of blocks of partition.An example of the partition of the set is given in the Example 1.

De?nition 4(Feature set partition )Let a data set S have nonempty features set X ={x 1,x 2,x 3,...,x n }.A partition of X is a nonempty collection of the nonempty subsets of X ,πX ={X i |i ∈I }such that ∪{X i |i ∈I }=X and X i ∩X j =?,and i =j ,j ∈I ;where I is integer and each subset X i of πX is a block of partition.123

V .Kumar,S.Minz

De?nition 5(Optimal feature set partition (OFSP ))Let a classi?er f ,a training dataset S L with set of input features X ={x 1,x 2,x 3,...,x n }and target feature is y from distribution P over labeled instance space.A partition πopt of X having k -number of blocks of partition is an optimal feature set partition,if the accuracy of the ensemble of the classi?ers with respect to the block of πopt is optimal i.e.?πi of X having k -number of blocks of partition.De?nition 6(Relevant feature to the view )A feature x i is relevant to a view of the dataset,if difference of current classi?cation accuracy (after x i inclusion in view)and previous classi?cation accuracy of the classi?er of the same view is positive

De?nition 7(Candidate subset of the feature )Let input features set X ={x 1,x 2,x 3,...,x n }of the dataset S ,target feature is y from distribution P over labeled instance space,and partition πj ∈Set Part (X )where Set Part (X )is a set of partitions.A subset X i ∈πj is called candidate subset for x r ,if positive difference of classi?cation accuracy of classi-?er index,t =argmax q ∈k (f X q ∪{x r } ?f (X q ))>0is equal to view index i ,where x r ∈X ,X q ?X ,q =1,2,3,...,k and πj ={X 1,X 2,X 3,...,X k }is a set feature set partitioning.

Example 1if X ={x 1,x 2,x 3},then the complete list of partitions will be as follow:

π0={{x 1},{x 2},{x 3}},π1={{x 1,x 2},{x 3}},π2={{x 1},{x 2,x 3}},

π3={{x 1,x 3},{x 2}},π4={{x 1,x 2,x 3}},

where δX =π0and θX =π4.In general,total number of partitions of a set having n -elementss is equal to the Bell numberB n .The initial some Bell numbers are B 0=1,B 1=1,B 2=2,B 3=5,B 4=15,etc.The Bell numbers satisfy the recursion which is de?ned in Eq.6[28]:B n =n ?1 j =0

n ?1j B j (6)The block of partition X i ∈πX (πX ∈SetPart (X ))of feature set X

may be consid-ered as i th view of the dataset,where X i = x i ,1,x i ,2,x i ,3,...,x i ,p and dom (x i )= v i ,1,v i ,2,v i ,3,...,v i ,|dom (x i )| .The graphical representation of X i block of partition πX is shown Fig.1,where X i is the i th view of feature set X .Therefore,for the i th view of the dataset can be written as in Eq.7:X i =dom (x i ,1)×dom (x i ,2)×dom x i ,3 ×...×dom (x i ,p )(7)

Fig.1Graphical representation of vertical partition of the dataset S of i th view X i

123

Multi-view ensemble learning:an optimal feature set...

The features set X can be represented as set of views X={X1,X2,X3,...,X k},where k is the number of blocks(views)in partitionπX.Let,the inducer f is applied to learn with each i th view X i of the dataset,where i=1,2,3,...,k,which is de?ned as in Eq.8:

f i:X i→y(8)

Therefore,a set of classi?ers can be represented as Fπ

X ={f1(X1),f2(X2),f3(X3),...,

f k(X k)},which correspond to the views of the dataset forπX partition.The same classi?-cation algorithm can be applied for each view of the dataset.In MEL,multiple classi?ers of the views are ensemble for the combine prediction;therefore,the classi?cation accuracy for

the partitionπX of the dataset S L can be represented as A kπ

X (S L)=E k i=1(f i(X i)),where E

is an ensemble method.

Performance weighting ensemble and classi?cation

The majority of vote and performance weighting are the two widely used techniques for

an ensemble of classi?ers[29].The following description of ensemble method is based on

performance weighting.Let s t is the t th sample of the view X i and f i(X i)is the i th classi?er and their prediction is denoted as y t for sample s t.Performance weightingαi sets weight to

each i th classi?er based on classi?cation performance on the validation set.It is de?ned as

in Eq.9:

αi=

(1?e i)

k

j=1

1?e j

(9)

where,the normalization factor e i is based on the performance of the i th classi?er on the validation set.After the computing of the weightαi of corresponding classi?ers,the class which has maximum score is assigned to the sample which is de?ned as in Eq.10:

C(s t)=argmax y t∈dom(y)

k

i=1

(αi×f i(X i))

(10)

where a Zero-one loss function L(.)is de?ned to access the classi?cation accuracy as in

Eq.11:

L(y t,C(s t))=

0,C(s t)=y t

1,C(s t)=y t(11)

For each misclassi?cation,the loss function is assigned a single unit.Therefore,classi?cation accuracy can be de?ned for the feature set partitionπX as in Eq.12:

A kπ

X (S L)=1?min

S L

?

?1

|S L|

|S L|

t=1

L(y t,C(s t))

?

?(12)

where A kπ

X (.)shows the minimization of loss function that leads to maximization of classi-

?cation accuracy,where,|S L|is the number of samples.

Proposed method:optimal feature set partition(OFSP)

Multiple views can be represented in many ways which depend on the partition of the features set(De?nition4).The performance of the MEL depends on the partition of the features set. The proposed method,partition the features set in such way that each view(block of partition) has those features which are the most relevant(De?nition6)than other views,where the feature relevant is measured by classi?cation accuracy of data mining learning algorithm for each view such as KNN,NB,SVM,etc.Moreover,the features do not include in any

123

V.Kumar,S.Minz blocks the partition that is not suitable in any of the block.In other way,it can be understood that if the partitionπX∈SetPart(X)has(k+1)blocks of partition,then the k-number of blocks of partitionπX are those subsets of X which have most suitable features than the other subsets and(k+1)th block of partition has those feature which are not suitable in any other subsets.Therefore,for k-number of vertical partition,(k+1)blocks of partition are constructed,where most suitable k-number of blocks(views)are utilized for MEL.The partition of input features set are said to be optimal,if an inducer trained on each views and ensemble of generated classi?ers produce the optimal accuracy.A de?nition of optimal feature set partition is given in the De?nition5.The value of partition k of the dataset can be identi?ed rationally.It can be decided by the expert of the same domain of the dataset or system con?guration for parallel execution of the program.

Letπi∈SetPart(X)andπj∈SetPart(X)are the two partitions of the features set of the dataset,having k-number of blocks of partition,where0≤i,j≤B s.According to the

partitionsπi andπj,the accuracy can be represented as A kπ

i (S L)=E k q=1

f q

X q

and

A kπ

j

(S L)=E k r=1(f r(X r))respectively.The partitionπi is said to optimal partitionπopt if,

A kπ

i (S L)>A kπ

j

(S L)?1≤j≤B s and i=j,where k=1,2,3,...,B n;thenπopt=πi.

Therefore,partitionπopt∈SetPart(X)can be called optimal feature set partition for the MEL.

An optimal feature set partition framework for MEL is shown in Fig.2.The whole frame-work can be pided into two part namely,optimal feature set partitioning(form Step-1to Step-7)and MEL(Step-8and Step-9).In Step-1,set of input features of the dataset,X is inserted.Step-2has the union of a feature x i∈X in each view of the dataset,where initial subset evaluation measured values(classi?cation accuracy)are set as zero for each subset X i.The classi?ers are generated corresponding to the views in the Step-3.Then in Step-4, the candidate subset is obtained among the subsets using classi?cation accuracy measure. If feature x i does not improve the classi?cation accuracy of the candidate subset,then the process go from Step-5to Step-1for x i+1and add x i in the subset X k+1;otherwise,relevant feature(De?nition6)x i adds to the candidate subset(De?nition7)in Step-6.Step-7checks the condition that all features are not exhausted or not;if all features of the dataset are not exhausted then go to the Step-1,otherwise go for Step-8.The blocks of optimal partition of the feature set are used to build the classi?ers in the Step-8.The subset X k+1is not included in the MEL.In Step-9,the ensemble of classi?ers predicts the labels of unlabeled instances.

The algorithm of Optimal Feature Set Partitioning method is shown in Table1.Optimal feature set partitioning algorithm has three input variable namely,dataset S L,k-numbers of feature set partitioning and f i classi?er(an evaluation measure).Evaluation measure can be dependent or independent to evaluate the subsets.Initially,algorithm creates k-numbers of null subsets.For each subset X i,an attribute x attr added temporarily to create subset X temp and evaluated by subset classi?cation accuracy using classi?er f i.Evaluated value A temp is compared to the previous subset evaluation value A i by taking difference.The subset with maximum difference in previous and current measured value is the candidate subset X i for the selected feature.The selected feature is added to the candidate subset and repeat for next feature from the remaining.Dimensionality reduction is done by discarding those features are not suitable(not improving the classi?cation accuracy of any subset).The overall,objective of the algorithm is to construct the best suitable features 93d0cdd159eef8c75ebfb384er must follow the condition k≤|X|for selecting the feature set partitioning of the dataset,where it is assume that each feature of the dataset is suf?cient for learning.The proposed method assumes that each feature of the dataset is suf?cient for learning.Therefore,the initializing

123

Multi-view ensemble learning:an optimal feature set...

Fig.2An optimal feature set partition framework for multi-view ensemble learning

123

4Experiments

The comparative experiments are conducted on benchmark high-dimensional datasets to illustrate the potential of proposed OFSP method in the high-dimensional situation for the classi?cation task.

4.1Datasets

Sixteen high-dimensional dataset have been used for the experiments.Datasets are selected from UCI dataset repository[63],Kent Ridge Bio-medical Dataset[64]and NIPS2003Fea-ture Selection Challenges[65].All datasets are listed in the Table2with their number of samples and features that are binary labeled.

4.2Experimental setup

The parameters setting of the RFSP,Ba,AB,VG-AC and VC-GA methods are given below that are used for comparison.For all method,the values blocks of partition k= 1,2,3,4,...,10are considered.

123

Multi-view ensemble learning:an optimal feature set ...Table 2List of binary labeled high-dimensional datasets

S.N.Dataset name

Number of samples Number of attributes 1.All-AML Leukemia [64]3871302.Arcene [65]10099203.Breast cancer [63]

8671304.Central Nervous System [64]6071305.Colon Cancer [63]6220006.Color Tumor [64]6220007.DLBCL Tumor [64]7771308.DLBCL-NIH [64]16074009.DLBCL Stanford [64]47402710.Leukemia [64]

72707111.Lung Cancer Harvard2[64]3212,53412.Lung Cancer Michigan [64]93713013.Lung Cancer Ontario [64]39288114.Madelon [63]50050015.Prostate Tumor versus Normal [64]10212,60116.

SECOM [63]

200

468

?Random feature set partitioning (RFSP):The features set of the dataset X is partitioned such as k i =1X i =X ;and X i ∩X j =?;where i =j .Each X i learns and ensemble their prediction.

?Bagging (Ba):Bagging belongs to the sampling scheme.It generates k -training sets by randomly sampling k -times with replacement and generated training sets are used to build classi?ers,respectively.

?Attribute bagging (AB):This method has two phases;?rst phase is ?nding the appropriate size of the attribute subset and second phase is the ensemble of classi?ers for the prediction.In the ?rst phase,an appropriate feature subset size m is found by testing classi?cation accuracy of variously size random subsets of features.Then,k -views are constructed of size m and ensemble them for the prediction.

?View Construction using Genetic Algorithm (VC-GA):This method is used GA to construct the views of the dataset where chromosomes in the GA are the binary bits.The length of the chromosomes is the number of features.The each bit is associated with one feature,where if the i th bit is 1,the i th feature is selected;otherwise,this feature is ignored.In the experiment,initial populations of chromosomes are generated by randomly ?ipping each bit.Cross generational is used for inpidual selection strategy.Size of population is used 50,and the size of offspring is 100.Then select the 50best inpiduals from the combined parent-offspring population as the next generation.The mutation and crossover probability used are 0.03and 0.66respectively.The top chromosomes are used for k =1,2,3,...,10number of views.

?View Generation using Attribute Clustering (VG-AC):In this method,the features are partitioned using the clustering technique.The k-mean clustering technique is used for partition,where each resulting view corresponds to one cluster of features.All experiments are performed on server version of MATLAB-2012b (64-bit).For learning purpose,KNN,NB,and SVM (linear kernel)classi?ers of PRtools [66]are used for the

123

V.Kumar,S.Minz Fig.3Box plot description

classi?cation task.KNN,NB and SVM classi?ers are used to learn from each view of the k-partitions.Performance weighting ensemble technique is used to ensemble of the classi?ers. To obtain the generalized accuracy,5-fold cross-validation is performed10times.The training dataset is pided into ten disjoint instance subsets.Each instance subset is utilized nine times as training set and one time as a test set.Average classi?cation accuracies are noted for the comparison.For all datasets,the same cross-validation setups are implemented with KNN, NB and SVM classi?ers.Weighted performance voting is used to ensemble of the classi?ers. Disagreement among the classi?ers is calculated by the disagreement measure given in Eq.3 for each value of k.The execution time for each method is obtained based on CPU time.

4.3Results

A simple method to display summarization of distribution is box plot[67].A diagram of box plot is shown in the Fig.3.Top and bottom of the box are represented by upper and lower quartiles which are75and25%respectively.The difference between upper and lower quartile is called interquartile range(IQR).Maximum value is the largest observation which is less than or equal to the upper quartile plus1.5×I Q R.The minimum value is greater than or equal to lower quartile minus1.5×I Q R.If,any value falls outside of the range of the two adjacent values,it is called outlier.

The effectiveness of the OFSP method to enhance the MEL is measured by classi?ca-tion accuracy of MEL,disagreement among the classi?ers,execution time and dimension reduction for the views k=1,2,3,...10of the datasets.

?Classi?cation accuracy of the MEL:The classi?cation accuracies of MEL for each feature set partitioning have obtained using OFSP method that is shown in Table3,where the accuracies are obtained using KNN,NB,and SVM classi?ers.

The box plots of MEL classi?cation performances using feature set partitioning methods are shown in the Fig.4for each datasets.The box plots of MEL classi?cation accuracies using RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods are shown from left to right to respective classi?ers.The highest classi?cation accuracies of the feature set partitioning methods on the basis of the median of box plot are circled for better visualization for respective classi?ers.The plus sign in the box plot indicates the outliers.?Disagreement among the classi?ers:The disagreements among the classi?ers for the views using feature set partitioning methods have calculated where,the minimum value of disagreement indicates the highest agreement.The Table4is shown the disagreement among the classi?ers using OFSP method.In the table,the disagreements among the classi?ers have obtained for two to ten views of the datasets.The box plots of disagreement 123

Multi-view ensemble learning:an optimal feature set...

among the classi?ers using RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods are shown in Fig.5for sixteen datasets.For each dataset,box plots of disagreements for RFSP, Ba,AB,VG-AC,VC-GA and OFSP methods are shown from left to right in respective classi?ers KNN,NB,and SVM.The lowest values of disagreement on the basis of the median of the box plot are circled in each dataset,where a plus sign indicates the outliers.?Execution time of MEL:The execution times of KNN,NB,and SVM classi?ers have calculated for each view that has shown in the Table5.In the table,the execution times of the classi?ers have obtained for two to ten views of the datasets.The box plots of the execution time of MEL using RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods for sixteen datasets are shown in the Fig.6.Box plots of execution time for each feature set partitioning methods are shown from left to right of respective classi?ers.The box plot having minimum medians of MEL execution time using feature set partitioning methods is circled for respective classi?ers.

?Dimension reduction:The percentage of features utilized by OFSP method for MEL using KNN,NB,and SVM classi?er are shown in Fig.7a–c respectively,where,x-axis and y-axis are represented number of views of the datasets and total features(in percentage), respectively.

In order to analyze the effectiveness of OFSP method,the1-vs-N non-parametric sta-tistical inferences have been performed on classi?cation accuracy,disagreement among the classi?ers and execution time.Many nonparametric tests have been proposed for perform-ing multiple comparisons such as Friedman test[68],Friedman Aligned Test[67],multiple Sign-test[69],Contrast estimation based on median[70]etc.In this research,Friedman Align test has been considered that has different ranking computation procedure and gives better results than Friedman test.Through normal approximations,the p value is computed[71]. The post hoc test is performed when null hypothesis is rejected[72].The p value provides information about“how signi?cant”the result is.The smaller value of p value shows the stronger evidence against the null hypothesis[72].Many p value adjustment procedure have been proposed such as Bonferroni–Dunn procedure[73],Holm procedure[74],Holland pro-cedure[75],Finner procedure[76]etc.In this research,Holm procedure is used to adjust the p value for post hoc test,where it rejects those hypotheses that have an unadjusted p value less thanα=0.05[77].Holm procedure adjusts the value ofαin step down manner. Friedman Aligned and Post Hoc Test of MEL classi?cation accuracies,disagreement among the classi?ers and execution time are shown in the Tables6,7and8.In the Tables,ranking, Aligned Friedman Statistic,p values obtained in by applying post hoc methods and Holm procedure’s are stated for each dataset corresponding KNN,NB and SVM classi?ers.

5Analysis

Effectiveness of MEL using OFSP method is analyzed in four aspects namely,classi?cation accuracy of MEL,disagreement among the classi?ers,MEL execution time and dimension-ality reduction.The Tables3,4and5are considered for the analysis of classi?cation accuracy of MEL,disagreement among the classi?ers and execution time using OFSP method,respec-tively.The ranks using Friedman Aligned test,p value obtained by post hoc method and Holm Procedures of the RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods have been obtained for each datasets corresponding classi?cation accuracy of MEL,disagreement among the classi?ers and MEL execution time,which are shown in Tables6,7and8.The lowest value of rank indicates highest ranking of the method.The p value obtained by post hoc method and

123

V .Kumar,S.Minz

T a b l e 3C l a s s i ?c a t i o n a c c u r a c i e s o f M E L u s i n g O F S P m e t h o d k =1,2,3,...10v i e w s o f t h e d a t a s e t s

D a t a s e t s K N N c l a s s i ?e r

V 1V 2

V 3

V 4V 5V 6V 7V 8V 9V 10

A l l -A M L L e u k e m i a 100100

100

100100100100100100100

A r c e n e 94.0097.00

97.00

98.0098.0097.0099.0098.0098.00

99.00

B r e a s t

C a n c e r 100100

100

100100100100100100

100

C e n t r a l N e r v o u s S y s t e m 86.6786.67

85.00

91.6791.6795.0093.3395.0091.67

95.00

C o l o n C a n c e r 95.1695.16

95.16

96.7795.1693.5593.5593.5595.16

100.00

C o l o r T u m e r 74.0376.94

75.97

76.1377.2677.9075.1678.23

77.10

77.74

D L B C L S t a n f o r d 95.1695.16

95.16

96.7795.1693.5593.5593.55

95.16

100

D L B C L _N I H 97.87100

100

100100100100100

100

100

D L B C L T u m o r 74.3876.25

79.38

86.2586.8883.1385.6384.38

85.63

83.13

L e u k e m i a 10097.40

100

100100100100

100

100

100

L u n g C a n c e r O n t a r i o 100100

100

100100100100

100

100

100

L u n g C a n c e r H a r v a r d 2100100

100

94.8794.8794.8797.44

97.44

94.87

97.44

L u n g C a n c e r M i c h i g a n 100100

100

100100100100

100

100

100

M a d e l o n 100100

100

100100100

100

100

100

100

P r o s t a t e T .v e r s u s N o r m a l 75.2084.60

82.80

84.4084.4080.00

80.20

82.20

81.00

84.40

S E C O M 92.1694.12

94.12

95.1097.0695.10

97.06

96.08

96.08

96.08

86.5089.50

88.5091.5090.0092.00

93.00

90.50

92.50

93.00

123

Multi-view ensemble learning:an optimal feature set ...

T a b l e 3c o n t

i n u e d D a t a s e t s N B c l a s s i ?e r V 1V 2V 3V 4V 5V 6V 7V 8V 9V 10A l l -A M L L e u k e m i a 97.37100100100100100100100100100A r c e n e 52.3399.5399.7710010010010010099.88100B r e a s t C a n c e r 97.6710010010010010010097.67100100C e n t r a l N e r v o u s S y s t e m 10010095.0010095.0096.6796.6798.3396.67100C o l o

n C a n c

e r 95

.1693.5591.9495.1695.1695.1695.1693.5593.5595.16C o l

o r T u m e r 10010096.7710098.3998.3910096.7793.5598.39D L B C L S t a n f o r d 97.8710097.87100100100100100100100D L B C L _N I H 76.8878.7580.0078.1386.2580.0078.1378.7579.3885.00D L B C L T u m o r 10010096.1010010010010098.7010098.70L e u k e m i a 10010010010010010010010098.61100L u n g C a n c e r O n t a r i o 92.3110087.1889.7492.3189.7487.1889.7487.1887.18L u n g C a n c e r H a r v a r d 2100100100100100100100100100100L u n g C a n c e r M i c h i

g a n 1

00100100100100100100100100100M a d e l o n

64.8068.4072.4070.0070.4073.0069.4076.0074.2075.80P r o s

t a t e

T

.

v

e r s u

s

N

o

r

m a l

96.0898.0495.1098.0495.1093.1494.1296.0896.0897.06S

E C

O

M 88.0089.5091.5090.

5091.

5091.

0091.

5091.

0091.5091.5

09

7.3

71001

001

001

001

001

001

001

001

00123

V .Kumar,S.Minz

T a b l e 3c o n t i n u e d

D a t a s e t s S V M c l a s s i ?e r

V 1V 2

V 3

V 4V 5V 6V 7V 8V 9V 10

A l l -A M L L e u k e m i a 100100

100

100100100100100100100

A r c e n e 100100

96.67

10010098.3396.6796.6798.33100

B r e a s t

C a n c e r 100100

100

100100100100100100

100

C e n t r a l N e r v o u s S y s t e m 100100

98.33

98.3398.3310010010098.33

100

C o l o n C a n c e r 95.1698.39

96.77

98.3991.9496.7793.5593.5593.55

93.55

C o l o r T u m e r 100100

98.39

98.3998.3996.7796.7795.1695.16

95.16

D L B C L S t a n f o r d 100100

100

100100100100100

100

100

D L B C L _N I H 80.6385.00

78.75

85.0081.8885.0083.1388.75

86.25

88.13

D L B C L T u m o r 64.0066.00

64.60

64.8064.6067.2067.0066.20

69.00

69.80

L e u k e m i a 100100

100

100100100100

100

100

100

L u n g C a n c e r O n t a r i o 100100

97.44

97.4489.7494.87

97.44

94.87

94.87

94.87

L u n g C a n c e r H a r v a r d 2100100

100

100100100100

100

100

100

L u n g C a n c e r M i c h i g a n 100100

100

100100100100

100

100

100

M a d e l o n 64.0063.80

66.00

65.0066.0066.00

66.40

68.00

68.40

68.60

P r o s t a t e T .v e r s u s N o r m a l 100100

96.78

96.7898.39

96.78

95.16

96.77

91.94

96.78

S E C O M 100100

100

100100100

100

100

100

100

100100

100100100100

100

100

100

100

123

Multi-view ensemble learning:an optimal feature set...

Fig.4Box plots of MEL classi?cation accuracies of corresponding views1,2,3,...,10,for comparison of RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods(from left to right plots for each dataset)using KNN,NB and SVM classi?ers respectively

Holm procedures have been analyzed forα=0.05,where p≤αrejects the null hypothesis. From the Figs.4,5and6,all the comparison of measure are done on the basis of median of performances,if medians are equal,then their maximum and minimum value are considered for comparison.

5.1Analysis of MEL classi?cation accuracy

According to the KNN,NB and SVM classi?ers,the analysis of MEL classi?cation accuracy using RFSP,Ba,AB,VG-AC,VC-GA and OFSP methods and their statistical analysis form the Table6are analyzed bellow:

123

V .Kumar,S.Minz

T a b l e 4D i s a g r e e m e n t s a m o n g t h e c l a s s i ?e r s o f c o r r e s p o n d i n g v i e w s u s i n g O F S P m e t h o d ,w h e r e k =1,2,3,...10v i e w s o f t h e d a t a s e t s

D a t a s e t s K N N c l a s s i ?e r

V 1V 2

V 3

V 4V 5V 6V 7V 8V 9

A l l -A M L L e u k e m i a 0.00000.0175

0.0132

0.01050.00000.01500.00660.01170.0000

A r c e n e 0.05000.0733

0.0733

0.08200.07330.07520.07110.0728

0.0780

B r e a s t

C a n c e r 0.00000.0000

0.0000

0.00000.00000.00000.00000.0000

0.0000

C e n t r a l N e r v o u s S y s t e m 0.18330.2222

0.1556

0.16670.19560.17460.19290.1861

0.2085

C o l o n C a n c e r 0.11290.0968

0.0968

0.04520.08170.08140.07140.0771

0.0853

C o l o r T u m e r 0.11290.0968

0.0968

0.04520.08170.08140.0714

0.0771

0.0853

D L B C L S t a n f o r d 0.30380.3088

0.3198

0.33540.32510.33610.3292

0.3332

0.3341

D L B C L _N I H 0.30630.2292

0.2708

0.32630.30420.29700.2964

0.2767

0.2821

D L B C L T u m o r 0.02600.0087

0.0325

0.02080.01730.05690.0380

0.0375

0.0505

L e u k e m i a 0.00000.0093

0.0139

0.00560.00000.00790.0035

0.0085

0.0105

L u n g C a n c e r O n t a r i o 0.12820.1026

0.1282

0.08720.10770.1319

0.1255

0.1068

0.1521

L u n g C a n c e r H a r v a r d 20.00000.0000

0.0000

0.00000.00000.0000

0.0000

0.0000

0.0000

L u n g C a n c e r M i c h i g a n 0.00000.0000

0.0000

0.00000.00350.0000

0.0000

0.0000

0.0000

M a d e l o n 0.18800.3240

0.3533

0.35120.37610.4036

0.4050

0.3907

0.4339

P r o s t a t e T .v e r s u s N o r m a l 0.11760.1046

0.0866

0.10590.0954

0.1102

0.0805

0.0893

0.1015

S E C O M 0.09500.0767

0.15670.18100.1670

0.1895

0.1725

0.1911

0.1988

123

Multi-view ensemble learning:an optimal feature set ...

T a b l e 4c o n t

i n u e d D a t a s e t s N B c l a s s i ?e r V 1V 2V 3V 4V 5V 6V 7V 8V 9A l l -A M L L e u k e m i a 0.00000.01750.01320.01050.01750.02760.01790.05260.0322A r c e n e 0.17000.12000.13170.12600.11670.11900.13360.12500.1289B r e a s t C a n c e r 0.00000.03100.01160.03260.05270.05090.03650.03040.0434C e n t r a l N e r v o u s S y s t e m 0.11670.16670.18890.19330.20560.20160.18570.20090.2189C o l o

n C a n c

e

r

0.0

8060.06450.06180.08390.06130.08600.08580.06900.0821C o l o r T u m e r 0.11290.11830.16130.15810.14730.19350.17400.18280.1910D L B C L S t a n f o r d 0.04260.01420.03190.02130.02700.01820.02130.01890.0170D L B C L _N I H 0.26880.23750.21560.28250.25880.26070.24200.26490.2894D L B C L T u m o r 0.03900.09520.03680.08570.10300.06060.07510.09740.0964L e u k e m i

a 0.00000.00000.00690.01110.02130.02250.01790.03090.0244L u

n

g C a n c e

r O n t a r i o 0.10260.11970.11540.11280.17950.12210.14930.13390.1231L u n g C a n c e r H a r v a r d 20.00000.00000.01560.00000.00000.00000.00000.00000.0000L u n g C a n c e r M i c h i

g a n 0.01040.00690.00000.00000.00000.00000.00520.00230.0000M a d e l o n 0.30000.36400.37400.36400.40230.40270.40090.42270.4353P r o s

t

a t e T

.

v e

r

s

u

s

N

o

r

m a l

0.03920.09150.05880.09220.07840.10270.10330.08770.1002S

E

C

O

M

.0

8

500.14330

.11420

.12000

.13970

.15330

.16500

.16500

.1709123

本文来源:https://www.bwwdw.com/article/2bmq.html

Top