Meta-classifier approach to reliable text classification
更新时间:2023-05-22 12:17:02 阅读量: 实用文档 文档下载
- metaclass推荐度:
- 相关推荐
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Meta-Classi erApproachesto
ReliableTextClassi cation
A.M.Kaptein
Master’sThesisCS05-15
Thesissubmittedinpartialful lment
oftherequirementsforthedegreeof
MasterofScienceofKnowledgeEngineering
intheFacultyofGeneralSciences
oftheUniversiteitMaastricht
Thesiscommittee:
Prof.dr.H.J.vandenHerik
dr.E.N.Smirnov
Prof.dr.E.O.Postma
dr.W.J.G.Hacking
UniversiteitMaastricht
InstituteforKnowledgeandAgentTechnology
DepartmentofComputerScience
Maastricht,TheNetherlands
August2005
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Abstract
Aproblemwithautomaticclassi ersisthatthereisnowaytoknowifaparticu-
larclassi cationisjustaguessoracertainanswer.Reliableclassi cationisthe
taskofpredictingwhetheracertaininstanceiscorrectlyclassi edornot,i.e.,a
classi cationisclassi edaseitherreliableorunreliable.Whentheclassi cation
isclassi edasunreliable,itislikesaying“Idonotknow”,andtheinstance
doesnotreceiveaclassi cation.
Givenabaseclassi er,themeta-classi erapproachistotrainameta-
classi erthatpredictsthecorrectnessofeachclassi cationofthebaseclassi er.
Theclassi cationruleofthemeta-classi erapproachistoassignaclasspre-
dictedbythebaseclassi ertoaninstanceifthemeta-classi erdecidesthatthe
baseclassi cationisreliable.
Themeta-classi erapproachisappliedontextclassi cationtasksprovided
bytheCBStoanswerthefollowingproblemstatement:
Doesthemeta-classi erapproachprovideapracticalsolutionto
reliabletextclassi cation?
The rstpartoftheresearchstudiestextclassi ers,andprovidesananswer
totheresearchquestion:
1.Whichtextclassi ersachievehighaccuracyandatthesametimehave
smallspaceandtimecomplexity?
ExperimentsontheCBSdatasetsshowthatthenearestneighbourandthe
na¨ veBayesalgorithmincombinationwiththetfidftextrepresentationare
acceptabletextclassi ers.
Thesecondpartoftheresearchstudiesthemeta-classi erapproachtopro-
videananswertothesecondandthirdresearchquestion.Oursecondresearch
questionis:
2.Whattypeofmetadatarepresentationisbestsuitedforreliabletextclas-
si cation?
Themeta-classi eristrainedonseveraltypesofmetadatarepresentations.The
usedmetadatarepresentationsincludetheoriginalinstances,theprobability
distributionofthebaseclassi erandasetofbasicstatisticsabouttheclassi -
cationofthebaseclassi er.Fortaskswithmanyclasses,theoriginalinstances
representationisbest.Fortaskswithasmallnumberofclasses,theoriginal
instancesrepresentationandtheprobabilitydistributionarebothgoodcandi-
dates.
i
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Ourthirdresearchquestionis:
3.Shouldthemeta-classi erbelocalorglobal?
Inadditiontotheglobalapproach,thatlearnsonemeta-classi erforallclasses,
alocalapproachisapplied.Thelocalapproachlearnsonemeta-classi erfor
eachclassintheoriginaldata.Ourstudyshowstheglobalapproachisprefer-
able,itachievesthesameperformancewithlesscomputationale ort.
Themeta-classi erapproachprovidesasoundsolutiontothetaskofreliable
classi cation,iftheprecisionofthemeta-classi eronthereliableclassis100%.
Inpracticewedonotreachthislevelofaccuracy.
Ourconclusion,toanswertheproblemstatement,isthatthemeta-classi er
approachinpracticedoesnotprovideasoundsolutiontoreliabletextclas-
si cation.Themeta-classi erapproachdoesachieveaconsiderableincrease
inaccuracy.Itisane cientsolutionaslongasthebaseclassi ersandthe
meta-classi ersaree cient,andthepropermetadatarepresentationisused.
ii
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Preface
Thisthesiscontainsadescriptionoftheresearchonreliabletextclassi cation
thatIhavebeendoingasmyM.Sc.project.Theresearchhasbeencarriedout
duringaninternshipattheCBSinHeerlenfromMarchtoAugust2005.
Theresearchwouldnothavebeenpossiblewithoutthesupportofthefol-
lowingpersons.IwouldliketothankProf.JaapvandenHerikandProf.Eric
Postmafortheiradviceandsupport.AttheCBSIwouldliketorecognize
mycolleaguesforhelpingmeoutwithmyoccasionalcomputerproblems,and
inparticularIwouldliketothankDr.WimHackingforsupervisingmeand
givingmeaviewonthepracticalsideoftheproblem.FinallyIwouldliketo
acknowledgedr.EvgueniSmirnovforalwaysbeingavailabletoanswerallmy
questions,formanyfruitfuldiscussionsandforbeinghonestatalltimes.
RianneKaptein
August2005
iii
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Contents
Abstract
Preface
1Introduction
1.1TaskofReliableClassi cation....................
1.2RelatedWork.............................
1.2.1BayesianFramework.....................
1.2.2StatisticalFrameworks....................
1.2.3Version-SpaceSupport-VectorMachines..........
1.2.4Meta-Classi erApproach..................
1.3ProblemStatementandResearchQuestions............
1.4ThesisOutline............................
2TheCBSData
2.1Backgrond...............................
2.2DataDescription...........................
2.3DataPreparation...........................
2.3.1SelectionofInformativeFields...............
2.3.2Feature-Reduction......................
2.4ChapterSummary..........................
3TextClassi cation
3.1TextRepresentation.........................
3.2Classi cationAlgorithms......................
3.2.1Na¨ veBayes..........................
3.2.2NearestNeighbour......................
3.2.3HierarchicalClassi er....................
3.3ExperimentswiththeTextClassi ers...............
3.3.1ExperimentalSettingandMethodology..........
3.3.2ExperimentalResults....................
3.3.3EvaluationandDiscussionoftheResults.........
3.4ChapterSummary..........................
4Meta-Classi erApproachestoReliableClassi cation
4.1De nition...............................
4.2MetadataRepresentations......................
4.3GenerationofMetadata.......................
iviiiii11222344677899911121213141414151516192122222324
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
CONTENTS
4.4
4.5
4.6LocalMeta-Classi erApproach..........AccuracyandCoverage..............ExperimentswiththeMeta-Classi erApproach.
4.6.1ExperimentalSettingandMethodology.
4.6.2ExperimentalResults...........
4.6.3EvaluationandDiscussionoftheResults
ChapterSummary................................................................................242526262731324.7
5ConclusionandFutureResearch33
5.1Conclusion..............................33
5.2Futureresearch............................34
Appendix35
ATestResultsoftheMeta-Classi erApproach36
A.1TestResultsontheProfession2Dataset..............36
A.2TestResultsontheEducation2Dataset..............39
References44
v
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Chapter1
Introduction
Thischapterisanintroductiontotheresearchtopicofreliableclassi cation.Wedescribethetaskofreliableclassi cationinsection1.1.Variousapproachestosolvingtheproblemofreliableclassi cationarediscussedinsection1.2.Insection1.3ourchoiceforthemeta-classi erapproachthatwewouldliketoin-vestigateismotivated.Moreover,theproblemstatementandresearchquestionsareformulated.Finally,insection1.4anoutlineoftheremainingpartofthethesisisgiven.
1.1TaskofReliableClassi cation
Knowingthelimitsofone’sownknowledgeisatypicalfeatureofhumanintel-ligence.Whenfacedwithadi cultproblem,onecancometotheconclusionthathisknowledgeisnotsu cienttosolvetheproblemandonecansay“Idonotknow”putersareabletosolvedi erentkindsofdi cultproblemssuchasplayingchessormakinghugecomputations.Butincontrasttohumanscomputersusuallydonothavethecapabilitytorecognizethelimitsoftheirknowledge.
Machinelearningisconcernedwithdevelopingclassi erstolearnfromex-perienceorextractknowledgefromexamplesinadatabase[Mitchell,1997].Theclassi ersthatwestudyinthisthesislearnfromlabelledinstancesandareabletoclassifyunseenandunlabelledinstances.However,theseclassi ersdousuallynotdistinguishbetweenluckyguessesandcertainanswers.Theyjustclassifyallnewinstances,sothatyoucanneverbereallysureifapar-ticularclassi cationiscorrect.Inotherwords:theclassi ersarenotawareofthelimitsoftheirknowledge.Thatisthereasonwhyclassi ersareoflimiteduseinmanyreal-worldapplications.Thecostsofamisclassi cationareoftenhigh,especiallyinsafety-criticaldomainssuchasmedicaldiagnosingandairtra ccontrolsystems.Classi ersareonlyusedinapplicationswhereerrorsdonothavefar-reachingconsequences,forexampleinternetapplicationssuchasautomaticdocumentindexing,spam- ltering,wordsensedisambiguation,andhierarchicalcategorizationofWebpages[Sebastiani,2002,Delanyetal.,2004].Inthisthesiswestudythetaskofreliableclassi cation.Wede nethistaskasataskofpredictingwhetheracertaininstancehasbeencorrectlyclassi ed,1Throughoutthethesisweuse‘he’or‘his’whenbothhe/sheandhis/herarepossible.
1
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
1.2.RELATEDWORK
i.e.,theclassi cationoftheinstanceisclassi edaseitherreliableorunreliable.
Whentheclassi cationisclassi edasreliable,itiscomparabletosaying“I
know”,andtheinstanceiscorrectlyclassi ed.Whentheclassi cationisclassi-
edasunreliable,itiscomparabletosaying“Idonotknow”,andtheinstance
isnotclassi ed.Solvingthetaskmeansthatallclassi cationsarecorrectand
theclassi erachievesanaccuracyof100%onthe“Iknow”instances.
Thereliabilityofaclassi cationcanbede nedinseveralways.Weadopt
thede nitionofKukarandKononenko[2002].Theyde nethereliabilityofa
classi cationastheprobabilitythattheclassi cationiscorrect.
Reliableclassi cationhastwoimportantapplicationareas:real-worldap-
plicationsandensemblesofclassi ers.Reliableclassi cationcanbeusedin
real-world,safety-criticaldomains,becauseitreducestheriskofmisclassi ca-
tion.Whenunreliableclassi cationsarediscarded,thecoverage(percentageof
recordsthatcanbeclassi ed)oftheclassi erdiminishes,butatthesametime
itsaccuracy(percentageofcorrectlyclassi edrecords)increases.
Reliableclassi cationisalreadybeingusedinanothercontextaswell,namely
inensemblesofclassi ers.Ensemblesclassifyaninstancebycombiningthe
classi cationsofmultipleclassi ers.The nalclassi cationcanbeobtainedin
severalways,e.g.,by(unanimous)votingoftheclassi ers.Reliableclassi ca-
tioncanbeusedtosifttheclassi cationsoftheclassi ers.Onlytheclassi ers
thatmakeareliableclassi cationwillbeusedtomakethe nalclassi cation
biningensemblesofclassi erswithreliableclassi cation
cansubstantiallyincreaseaccuracy[Seewald,2003,Ting,1996].
1.2RelatedWork
Fourdi erentapproachestoreliableclassi cationthathavebeendevelopedin
thelasttenyearsarediscussedinthissection:theBayesianframework,the
statisticalframeworks2,version-spacesupport-vectormachines,andthemeta-
classi erapproach.
1.2.1BayesianFramework
Classi ersintheBayesianframeworkusuallyproduceclassi cationsintheform
ofposteriorprobabilitydistributionsoverallpossibleclasses.Thesimplest
approachtoreliableclassi cationistousetheposteriorprobabilityofasingle
classi cationasameasureforreliability[Ting,1996].However,itisshown
thatposteriorprobabilitiesofclassi erslikena¨ veBayesarepoormeasuresfor
reliability,sincetheyarebasedonincorrectpriorassumptions[Melluishetal.,
2001,Delanyetal.,2004].Thus,theBayesianframeworkcanbemisleadingfor
reliableclassi cation,andisthereforeconsideredinappropriateforourpurposes.
1.2.2StatisticalFrameworks
Severalapproachestoreliableclassi cationarebasedonthealgorithmictheory
ofrandomness[Vovketal.,1999].Sincethereisnocomputable,universaltest
outputoftheclassi ersintheBayesianframeworkandthestatisticalframeworksis
areliabilityestimation,ratherthanareliableclassi cation.Byusingthresholds,reliability
estimationscaneasilybeconvertedtoreliableclassi cations.Whenthereliabilityestimation
isaboveacertaintresholdtheclassi cationisconsideredreliable.2The
2
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
1.2.RELATEDWORK
forrandomness,computableapproximationsofalgorithmictestsofrandomness
areused[Proedrouetal.,2001].Inthiscontextwedistinguishthetypicalness
frameworkandthetransductionframework.
Thetypicalnessframeworkprovidesreliabilityestimationsondatathatis
independentlyandidenticallydistributed(iid).Considerasequenceofinstances
togetherwithanewinstanceofanunknownclass.Thetypicalnessframeworkis
usedtogainameasureofreliabilityforallpossibleclassesofthisnewinstance
usingatypicalnessfunction.Foreachpossibleclassitisexaminedhowlikelyit
isthatallinstancesoftheextendedsequence,i.e.,thesequencewiththenew
instanceadded,aredrawnindependentlyfromthesamedistribution.Themore
typicalthesequenceis,thehigherthereliabilitymeasure.
Thetypicalnessfunctioncanbeconstructedbymeasuringthe“strangeness”
ofindividualinstances,usingindividualstrangenessfunctions[Kukar,2004].
Adrawbackofthisapproachisthatthestrangenessfunctiondependsonthe
classi cationalgorithmthatisused.Sofar,theonlysuccessfulapplications
useSupportVectorMachines[Vovketal.,1999]andthenearestneighbour
algorithm[Proedrouetal.,2001].Anotherdisadvantageofthisapproachisits
computationalcomplexity[KukarandKononenko,2002,Melluishetal.,2001].
Anotherstatisticalframeworkforreliableclassi cationisthetransduction
framework.Theframeworkisclassi er-independentanditisbasedonatrans-
ductivestepduringtheclassi cationprocess.First,aninstanceisclassi edbya
baseclassi er,andthentheinstanceisaddedtothetrainingset,togetherwith
theclassi cationitreceivedfromthebaseclassi er.Theclassi erisre-trained
andtheinstanceisclassi edagain.Thereliabilityoftheinstanceclassi ca-
tionismeasuredasthedi erencebetweenposteriorclassprobabilitiesthatthe
instancereceivesbeforeandafteritwasaddedtothetrainingset[Kukarand
Kononenko,2002].Smirnovetal.[2003b]showthatthisapproachisnotonly
computationallyine cient,butitalsorequireshighprecisionofthereal-number
representationwhenalargeamountofdataisusedwithmanyclasses.Onlyin
thecasethatthetrainingsetissmall,addinganinstancetothetrainingset
canchangethelevelofrandomnesssigni cantly.
Kukar[2004]presentsanalgorithmthatjoinsthetransductiveframework
andthetypicalnessframework,therebymakingthetransductivestepstatisti-
callysound.Experimentalresultsshowedthatreliabilitycanbeestimatedquite
accurately,butagainthehighcomputationalcostsposeaproblem.
1.2.3Version-SpaceSupport-VectorMachines
Adi erentapproachtoreliableinstanceclassi cationisbasedonversionspaces
[Smirnovetal.,2005].WerefertoMitchell[1997]foranintroductiontoversion
spaces.Themainideaistoconstructversionspacesthatcontainthehypothe-
sesofthetargetconceptstobelearnedortheircloseapproximations.The
unanimous-votingruleisimplementedbytestingversionspacesforcollapse.
Applyingthisrulemakesitimpossibletomisclassifyinstanceswhenthereisno
noiseinthedata.Althoughexperimentalresultsarepromising,thisapproach
iscomputationallyexpensiveaswell.
3
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
1.3.PROBLEMSTATEMENTANDRESEARCHQUESTIONS
1.2.4Meta-Classi erApproach
Themeta-classi erapproachtakesadi erentlinetoreliableclassi cation.
Givenabaseclassi er,theapproachistolearnameta-classi erthatpredicts
thecorrectnessofeachinstanceclassi cationofthebaseclassi er.Thebase
classi erplusthemetaclassi erformonecombinedclassi er.Theclassi cation
ruleofthecombinedclassi eristoassignaclasspredictedbythebaseclassi er
toaninstanceifthemeta-classi erclassi esthebaseclassi cationasReliable,
otherwisetheinstanceclassi cationisrejected[SeewaldandF¨urnkranz,2001].
Thecrucialstepforthemeta-classi erapproachisthegenerationofthe
metadatathatisusedtotrainthemeta-classi er.Themetadataarerepresented
byfourdi erentmetadatarepresentations.Allmetadatarepresentationshave
thesamebinarymetaclassattribute.Themetaclassofaninstanceindicates
thereliabilityofthebaseclassi cation.Iftheinstanceisclassi edcorrectly
bythebaseclassi erthemetaclassisReliable,otherwisethemetaclassis
Unreliable.Belowwelistfourmetadatarepresentations.
Originalinstancesrepresentation.Allattributesinthemetainstance,
excepttheclassattribute,arethesameasintheoriginalinstances.
Probabilitydistributionrepresentation.Ametainstancecontainsthepos-
teriorprobabilitiesforallclassesasgivenbythebaseclassi er.
Basicstatisticsrepresentation.Attributesofthemetainstancearedif-
ferentcharacteristicsofthebaseclassi cation,likethebaseclassandthe
posteriorprobabilityofthebaseclass.
Nearestneighbourdistancesrepresentation.Thisrepresentationcanonly
beusedincombinationwithanearestneighbourbaseclassi er.Attributes
ofthemetainstancearecalculateddistancesbetweennearest(un)like
neighboursofatargetinstance[CheethamandPrice,2004].
Ameta-classi ercanbetrainedondi erentclasslevels.Aglobalmeta-
classi erapproachlearnsonemeta-classi erforallclasses.Alocalmeta-
classi erapproachlearnsonemeta-classi erforeachbaseclass.Eachlocal
meta-classi erclassi esonlymetainstanceswithoneparticularbaseclass.
The rstapplicationsofmeta-classi ersareinthecontextofensembles.
Theestimatedreliabilityofaclassi cationisusedtochoosetheclassi er(s)
thatwillclassifytheinstance[SeewaldandF¨urnkranz,2001].Inthese rst
meta-classi erstheoriginalinstancesrepresentationisusedasmetadatarepre-
sentation.InalaterversionofSeewald[2003],alsotheprobabilitydistribution
representationisused.Real-worldapplicationsinwhichthemeta-classi erap-
proachisusedincludeautomatictextclassi cation[Smirnovetal.,2003a]and
spam- ltering[Delanyetal.,2004]edtheprobabilitydis-
tributionmetadatarepresentation,edthenearestneighbour
distancesrepresentation.
Themeta-classi erapproachisdiscussedindetailinthethirdchapter.
1.3ProblemStatementandResearchQuestions
Inthisthesiswelookforanapproachtoreliableclassi cationthatcanbeused
inreal-worldpracticalapplications,i.e.,welookforanapproachthatissta-
4
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
1.3.PROBLEMSTATEMENTANDRESEARCHQUESTIONS
tisticallysoundandcomputationallye cient.Thefourdi erentapproachesto
reliableclassi cationthatwediscussedinsection1.2allhaveshortcomings.The
Bayesianframeworkistoodependentonassumptionsaboutthepriordistribu-
tionwhenappliedinpracticeforreliableclassi cation.Thetypicalnessframe-
work,transductionframework,andVersion-SpaceSupport-Vectormachinesare
alltheoreticallysound,buttheyarecomputationallyexpensive.Themeta-
classi erapproachistheonlycandidateforourstudy.Byconstruction,
1.theapproachissoundaslongasthemeta-classi erisaccurate.The
accuracyofaclassi eristheproportionofcorrectlyclassi edinstances.
2.theapproachise cientaslongasthebaseclassi erandthemeta-classi er
aree cient.
Ourstudyispartlymotivatedbythefactthatthemeta-classi erapproach
wasnotexhaustivelystudied.Existingstudiesonmeta-classi ersfocusedon
theglobalapproachoronapplicationinthecontextofensemblesonly[Seewald,
2003,Smirnovetal.,2003a,Delanyetal.,2004].Thetheoreticalframeworkof
themeta-classi erapproachandthelocalmeta-classi erapproachhavenever
beenanalysedindetailforareal-worldapplication.Ouraimistoinvestigate
themeta-classi erapproachinthecontextoftextclassi cation.Hence,our
problemstatementreadsasfollows:
Doesthemeta-classi erapproachprovideapracticalsolutionto
reliabletextclassi cation?
Toanswerourproblemstatementweapplythemeta-classi erapproachon
realtextclassi cationtasksanddatasetsprovidedbytheCentraalBureauvoor
deStatistiek(CBS).Thedatasetsmatchthescaleoftheproblemwewouldlike
totackle:thedatasetsarelarge,ofhighdimensionality,andhavealarge
numberofclasses.
Inordertoaddresstheproblemstatement,threeresearchquestionshave
beenformulated.Tosolvethetextclassi cationtasksoftheCBSwe rsthave
totraintextclassi ers,andthenapplythemeta-classi erapproach.Therefore
ourresearchconsistsoftwoparts.
The rstpartoftheresearchstudiestheapplicabilityofdi erenttextclas-
si ersfortheCBSdata,leadingtothe rstresearchquestion:
1.Whichtextclassi ersachievehighaccuracyandatthesametimehave
smallspaceandtimecomplexity?
Thesecondandmainpartoftheresearchinvestigatesdi erentmeta-classi er
approachesappliedtothetextclassi erstrainedontheCBSdata.Thesecond
andthirdresearchquestionreadasfollows.
2.Whattypeofmetadatarepresentationisbestsuitedforreliabletextclas-
si cation?
3.Shouldthemeta-classi erbelocalorglobal?
Thesequestionsaimatstudyingtwomainfeatures,viz.thetypeofmetadata
representation,andthenature(i.e.,localorglobal).Themeta-classi ersare
evaluatedonthetextclassi cationtasksprovidedbytheCBS.Di erenttext
classi ersandmeta-classi ersarecombinedinordertoseewhichcombination
performsbest.
5
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
1.4.THESISOUTLINE
1.4ThesisOutline
Theremainingpartofthisthesisisstructuredasfollows.Inchapter2wede-
scribethetext-classi cationtasksandthedatasetsprovidedbytheCBS,and
wediscussthedata-preparationprocess.Chapter3studiestextclassi cation,
andelaboratesontextrepresentation,classi cationalgorithms,andexperiments
withtheCBSdata.Chapter4discussesthedetailsofthemeta-classi erap-
proach,anddescribesexperimentswiththeCBSdata.Finally,inchapter5we
drawaconclusion,andwegivesomerecommendationsforfutureresearch.
6
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Chapter2
TheCBSData
TheCBShasprovidedthreedatasetsforthisresearch,whichareusedtoassesstheperformanceofthemeta-classi erapproach.Section2.1givessomeback-groundinformationaboutthetaskandthedatasets.Insection2.2wedescribethecontentsandthecharacteristicsofthesedatasets.Thedatapreparationprocess,whichincludesselectionofinformative eldsandfeaturereduction,isdiscussedinsection2.3.Finally,insection2.4wesummarizethechapter.
2.1Backgrond
TheCBSistheDutchnationalstatisticsinstitute.Amongotherthings,theCBScollectsdatabyinterviewersusingquestionnaires.Theprovideddatasetscontaintheanswersonopen-endedquestionsthatcanbefoundintheseques-tionnaires.Theyconcerntheeducationortheprofessionofindividuals.ThesedatatypicallyhaveatextualformatandareinDutch.SincetheCBSwishestoconvertthedataintousefulstatisticalinformation,thetextualdatahavetobetransformedintoclasses.Ourtaskistoassignasymboliccodefromaprede nedsetofsuchcodestothetextrecordscontainingtheanswersgiveninthequestionnaire.Thecodesforprofessionsandeducationsarede nedbytheCBSthroughso-calledSBCandSOI+codesrespectively[CBS,DivisieSocialeenRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning,2003,2001].Sofarmostofthedataismanuallyclassi ed,liketheprovideddatasets.Inpreviousresearchthepossibilitiesforautomatictextclassi cationhavebeenexplored[Smirnovetal.,2003a].Onthewhole,littleresearchhasbeencar-riedouttoclassifyanswerstoopen-endedquestionsinquestionnaires.Somedictionary-basedapproacheshavebeendeveloped.Theyusespecializeddic-tionariesthatassignatextfragmentautomaticallytoaspeci ccategoryifitcontainswordsmatchingthoseinthedictionaryrelevanttothecategory.Thetextfragmentscanbepreprocessedusingdi erentparsingtechniquessuchasdeletionoftrivialwords,su xremoval,andde nitionofsynonyms[MacchiaandD’Orazio,2000].Drawbacksofthedictionary-basedapproachesarethattheyarestaticandspecializeddictionarieshavetobedevelopedmanuallybyexpertsforeachcategory.Atext-classi cationapproachbasedonsupervisedmachinelearningisabettersolutioninthiscase,sincetherearenospecializeddictionariesavailablebutalargebodyofmanuallycodedtrainingdataison
7
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.2.DATADESCRIPTION#attributes
persample
~5,400
~3,300
~5,200
~2,900
~4,650
~2,900#classespersample~1,250~1,050~1,250~2,800~4,0508DatasetProfession1Profession2Education1Education2#instances192,46663,82349,09249,092#classes1,5631,5785,9918# elds551111samplesize~19,150~9,450~19,000~9,800~19,600~9,800
Table2.1:Descriptionofthedatasets.
hand[GiorgettiandSebastiani,2003].Adisadvantageofthetext-classi cation
approachisthatwhenthede nitionofthecodeschanges,thetrainingdatais
nolongerrelevant,andanewbodyofmanuallycodedtrainingdatahastobe
provided.
2.2DataDescription
TheCBShasprovidedthreedatasetsforthisresearch,Profession1,Profession2,
andEducation1.DatasetsProfession1andProfession2containtextrecordsde-
scribingprofessionsandhavesimilarstructures.Thetextrecordsintheprofes-
siondatasetscontain ve elds.Each eldcontainstheansweronaparticular
question,forexamplethecompanysomeoneworksfororsomeone’sdailyac-
tivities.ClassvaluesofthesedatasetsareSBCcodes(seeCBS,DivisieSociale
enRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning[2001]foradescriptionoftheSBCcodes).
Education1isthethirddataset,itcontainstextrecordsdescribingeduca-
tions.Thetextrecordsinthisdatasetconsistofeleven elds.These elds
containamongotherthingsthenameandlevelofeducation.Classvaluesof
thisdatasetareSOI+codes1.
AfourthdatasetisderivedfromtheEducation1dataset.Inallinstancesof
thisdatasettheoriginalclassvalueoftheEducation1datasethasbeenreplaced
withthe rstdigitofthisclassvalue,allother eldsremainthesame.The
resultistheEducation2dataset,whichhasonly8classvalues.
Intable2.1thebasicstatisticsofthedi erentdatasetscanbefound.
Allfourdatasetsarerandomlysampledfortestingtoreducethecomputa-
tionalrequirements.Fromthedatasets,samplesofapproximately10,000and
20,000instancesaretaken.Thenumberofattributesinthetableisthenumber
ofattributesafterthefeaturereductionthatwillbedescribedinthefollowing
SOI+code,consistsoffourparts,e.g.,603150-1-NVT-DOCTORAAL.The rstsix
digitsareaso-calledSOI-code.The rstdigitoftheSOI-coderepresentsthelevelofeduca-
tion,theseconddigitrepresentsthesublevelofeducation.Thethirddigitandfourthdigit
togetherrepresenttheeducation eld,andthe fthdigitandsixthdigittogetherrepresent
theeducationsub eld.ThesecondpartoftheSOI+codeisaserialnumberfollowingthe
SOI-code,designedtomaptheSOI-codesunambiguouslyontotheinternationalISCED(In-
ternationalStandardClassi cationofEducation)codes.Thethirdpartistheeducationtrack
andthefourthpartistheuniversityphase[CBS,DivisieSocialeenRuimtelijkeStatistieken,
SectorOntwikkelingenOndersteuning,2003].1The
8
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.3.DATAPREPARATION
section.
Thetextclassi cationtasksoftheCBSdi erfromacommontextclassi-
cationtaskintwoaspects.Firstly,thenumberofclassesisverylarge.The
Education1datasethasapproximately6,000classes,theprofessiondatasets
havealmost1,600classes.Secondly,thesizeoftheinstancesissmaller.While
mostclassi cationtasksconsiderwholedocuments,inthesetasksaninstance
consistsofonlyafewwords.Theinstancesdonotcontaincompletephrases.
Thesetwocharacteristicshaveconsiderableconsequencesontheperformance
ande ciencyofthetextclassi cationalgorithms.
2.3DataPreparation
Besidessamplingothertechniquesareusedtofurthereasethecomputational
complexity.Inthefollowingsubsectionstheselectionofinformative eldsand
di erentfeaturereductiontechniquesisdiscussed.
2.3.1SelectionofInformativeFields
RecordsintheCBSdatasetsconsistofmultiple eldsthatmightnotallbe
informative.Oneofthetasksinthepreviousresearchconsistedofassessingthe
contributionofthetext eldstotheclassi cationperformance.Inturneach
text eldoftheProfession1datasetwasleftouttodeterminethee ectonthe
accuracyofthetextclassi ers.Theconclusionofthistaskwasthatonlytwo
text eldsarerequiredfortheclassi cationofprofessions.Thesetext elds
containtheprofessionandthedailyactivitiesofanindividual[Smirnovetal.,
2003a].Weusethesametwotext eldstoclassifytherecordsintheProfession2
dataset,astheProfession2datasetcontainsthesame eldsastheProfession1
dataset.
Toidentifywhich eldsareinformativeintheeducationdatasets,eachtext
eldwas,alsointurn,leftouttodeterminethee ectontheaccuracy.Itturned
outthatleavingoutcertaintext eldsdidnotleadtoasigni cantimprovement
inclassi cationaccuracy.Hencealltext eldsareusedforclassi cationof
theEducation1andEducation2dataset.Anexplanationforthisresultisthat
onlyafew eldsare lledforeachrecordintheeducationdatasets,most elds
areemptyforthemajorityofrecords.Fieldscanbeemptybecausetheyare
irrelevantorunknownforacertainindividual.
2.3.2Feature-Reduction
Amajorproblemintextcategorizationisthehighdimensionalityofthefeature
space.Inourcasethefeaturespaceconsistsofalluniquewordsthatoccurin
thedatasets.Forexample,inasampleof20,000records,around15,000unique
wordsarepresent.Aninstanceisrepresentedbyallwordsthatoccurinthe
informative eldsofthetextrecord.Featurereductioncannotonlyreducetime
andspacecomplexity,butalsotheclassi cationperformancecanbeincreasedby
removingirrelevantanduninformativefeatures.Wewillusefeature-reduction
basedondocument-frequency,removalofstopwordsandstemmingtoreduce
thenumberoffeatures.
9
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.3.DATAPREPARATION
Document-frequencytresholdingandremovalofstopwordsareamongthe
simplesttechniquesforfeaturereduction.Thedocumentfrequencyofaword
isthenumberofdocumentsinwhichthewordoccurs.Eachwordthathasa
document-frequencylessthansomeprede nedthresholdisremovedfromthe
featurespace[YangandPedersen,1997].Inourresearchallwordsthatoccur
inonlyonetextrecordareremoved,whichleadstoanaveragereductioninthe
numberofattributesaround50%.
Theremovaloffunction-orstopwordsisanothercommontechniquethat
isapplied.Stopwordsaretopic-neutralwordssuchasarticles,prepositions,
conjunctions,andabbreviations,whichdonotcontributetoclassi cationper-
formance[Sebastiani,2002].Thelistofstopwordscontains89topic-neutral
words.Thereductioninthenumberoffeaturesisnotsobig,butthenumber
ofinstancesinwhichoneormorestopwordsareremoved,islarge.
Alessfrequentlyusedpre-processingtechniquethatweapply,isstemming.
Thegoalofstemmingistoimprovetheperformanceofinformation-retrieval
techniquessuchastextcategorizationbybringingunderoneheadingvariant
formsofawordwhichshareacommonmeaning.Stemmingcanreducethe
numberoffeaturesandimproveclassi cationperformanceatthesametime.In
generaltherearetwostemmingtechniques.The rsttechniqueisalgorithmic:
stemmingisdoneaccordingtocertainruleswhichstripsu xes,pre xes,or
in xes.Thesecondtechniqueisdictionary-basedstemming.Inthiscasethe
stemmedversionofawordcanbelookedupinaspecialdictionary.Since
dictionarybasedstemmingisslowerthanalgorithmicstemmingandaDutch
stemmingdictionaryisnotreadilyavailable,wewillusealgorithmicstemming
inthisresearch[Porter,2001].
KraaijandPohlmann[1995]developedaDutchversionofthewell-known
Porterstemmer.Asimpli edversionofthisalgorithmhasbeenimplemented
forthisproject.Thefollowingtworulesoftheoriginalalgorithmhavebeen
omitted.
Removalofthea x‘ge’whichoccursasapre xorin xinDutchpartici-
ples.Inthedataconcernedhereparticiplesarerare,attributabletothe
formulationofthequestions.Mostoftheverbsareintheircompleteform.
Removalofthea x‘ge’wouldbeinappropriateinmostofthecases.
Duplicationofavowelinaclosedsyllable.InDutchlongvowelsarespelled
singleinopensyllables,e.g.,kopen,anddoubleinclosedsyllables,e.g.,
koop.Afterremovalofsomea xes,e.g.,lop(-en)thestemvowelneedsto
bedoubledtorenderanorthographicallycorrectstem.Althoughthispro-
cedurecanleadtothestemmingofmorewordstothesamestem,itisnot
necessaryfortheautomaticclassi erthatthewordsareorthographically
correct.Inadditiontheproceduredoesnotalwaysworkcorrectly,whene
isthevowelinanunstressedsyllableitisneverdoubled.Forexamplethe
correctstemofkantelencanbeeitherkanteelorkantel.Withoutword
stressinformationitisimpossibletopredictconsistentlythestatusofe
correctly.
Theresultingalgorithmconsistsofaseriesofsu xstrippers,thepseudo
codeofthealgorithmcanbefoundin gure2.1.Thestringinthispseudocode
isthewordthatisstemmed.Besidesthesu xstrippersofKraaijandPohlman,
twostrippersspeci ctothisdomainareadded.Theystriptwofrequently
10
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
2.4.CHAPTERSUMMARY
Removesuffixes(-e,-en)
If(String.length>1)
Removesuffixes(-ig,-ing,-baar,-bar,-heid,-etj,-tj)
If(String.length>1)
Replace(-opleiding→-opl,administratief→adm)
If(String.length>1)
Replacesuffixes(-v→-f,-z→-s)
Replacesuffixes(-dd→-d,-ff→-f,-gg→-g,-kk→-k,
-ll→-l,-mm→-m,-nn→-n,-pp→-p,-rr→-r,
-ss→-s,-tt→-t)
Figure2.1:Pseudocodeofthestemmingalgorithm.
occurringterms(administratiefandopleiding)totheirabbreviations(adm.
andopl.),whichalsofrequentlyoccur.
Thereductionofthenumberofattributesasaresultofstemmingisgiven
intable2.2.TheEducation1datasetandEducation2datasetcontainthesame
attributesexceptfortheclassattribute.Thereforetheresultsofthefeature
reductionarethesameandthetableshowsonlyoneEducationcolumnthat
representsbothdatasets.Forbothprofession lesthereductionisaround10%,
fortheeducation lesthereductionis6.4%.Besidesthisreductionalsoan
increaseinclassi cationperformancehasbeenobserved.Theperformancein-
creasedependsonthealgorithmandthedatasetused,itliesbetween0.15%
and1.60%.Dataset
Sample1
Sample2
Sample3
AverageEducation6.40%6.40%6.40%6.40%Profession110.00%10.14%9.97%10.04%Profession210.57%10.14%10.99%10.57%
Table2.2:Reductioninthenumberoffeaturesasaresultofstemming.
2.4ChapterSummary
ThefourCBSdatasetsprovideuswithchallengingtextclassi cationtasks,due
tothelargeamountsofrecords,attributes,andclasses.Randomsamplesfrom
10,000and20,000recordsaretakenfromthedatasetstoreducethecomputa-
tionalcomplexity.Furthermore,onlytheinformative eldshavebeenselected
andfeature-reductiontechniqueshavebeenapplied.Thefeature-reductiontech-
niquesincludedocument-frequencytresholding,removalofstopwordsandstem-
ming.Togethertheyleadtoafeaturereductionofapproximately60%.The
resultsofthedatapreparationarefourdatasetsthatarereducedinsizeand
canbeusedforclassi cation.
11
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
Chapter3
TextClassi cation
Inthepreviouschapterwedescribedtheclassi cationtasksoftheCBSandthedatapreparationprocess.Intherestofthethesiswefocusonreliabletextclassi cationfortheCBS.The rststepinthemeta-classi erapproachistogenerateabaseclassi er.Thischapterstudiesthebasetextclassi ersderivedontheCBSdata.Weanswerour rstresearchquestion,thatisconcernedwith ndingane cientandaccuratetextclassi er.Section3.1elaboratesontextrepresentationandsection3.2discussesclassi cationalgorithms.Insection
3.3experimentswithtextclassi ersontheCBSdataaredescribed.Finally,insection3.4wegiveasummaryofthischapter.
3.1TextRepresentation
Textdocumentstypicallyconsistofstringsofcharactersthatcannotbedi-rectlyinterpretedbyaclassi er.Eachdocumentneedstobetransformedintoacompactrepresentationofitscontentappropriateforautomaticclassi cation.Thevectorrepresentationisthemostfrequentlyuseddocumentrepresenta-tion[Mladeni´c,1999].Adocumentwillberepresentedasavectorofweightsdj=<w1j,...,wmj>,whereMisthesetoffeaturesthatoccuratleastonceinatleastonedocumentinthedatasetandtheweights,0≤wkj≤1representhowmuchfeaturefkcontributestothesemanticsofdocumentdj.Di erentwaystode nefeaturesandtocomputefeatureweightsleadtodi erenttextrepresen-tationapproaches[Sebastiani,2002].Fourpossibilitiestode nefeaturesareasfollows.
rmationretrievalresearchsuggeststhatwordsworkwellasfeaturesandthattheirorderinginarecordisofminorimportanceformanytasks[Lewis,1992].Whentheweightsofthefeaturesarebinary,identifyingfeatureswithwordsiscalledthebag-of-wordsapproach.
ingphrasesasfeatureshasproventobenotverysuccessful.Di erentwaysofde ningphraseswereinvestigatedbyLewis[1992].
Wordn-grams.Afeatureisde nedforeachwordsequencecontaininguptonconsecutivewords.Thisrepresentationworksbestifthesequenceofwordsisimportantforthecontextandthedocumentsareshort[Mladeni´c,1998].
12
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
3.2.CLASSIFICATIONALGORITHMS
Charactern-grams.Featuresareformedbyallsequencesofncharacters
inatext.Acommonvaluefornis3.Thecharactern-gramrepresentation
wasestablishedquitesomeyearsago[Shannon,1951].Anadvantageof
thisrepresentationisthatitcopesbetterwithspellingmistakes.
Besidesthefeatures,alsothefeatureweightshavetobedetermined.When
afeatureoccursinacertaindocument,ameasurehastodeterminethevalue
ofthisfeatureforthedocument.Someearlyna¨ veBayesapproachesusea
binary-valuedvectorrepresentation[RobertsonandSparck-Jones,1976].The
featurevalueequals1ifthefeatureoccursinthedocument,and0otherwise.
Informationinherentinthefrequenciesoffeaturesislostwhenonlytheabsence
orpresenceoffeaturesisconsidered.Thereforenewapproachesthattakeinto
accountfrequencyoffeaturesinadocumenthavebeendeveloped[Lewis,1998].
Themoststraightforwardadaptationistoconsidertheweightofafeatureequal
tothenumberofoccurrencesofthefeatureinthedocument.Thisweighting
functionisreferredtoastermfrequency(tf)[Lewis,1992].Thetffunction
encodestheintuitionthatthemoreoftenafeatureoccursinadocumentthe
moreitisrepresentativeofitscontent.Thesecondpopularweightingfunction
isthetfidfweightingfunction.Itistheproductofthetermfrequencyandthe
inversedocumentfrequency(idf).Thedocumentfrequency(df)ofafeature
isthenumberofdocumentsinthedatasetinwhichthefeatureoccursatleast
once.Theinversedocumentfrequencycanbecalculatedfromthedocument
frequencyasfollows:
idf(fi)=log|N|
i,where
|N|isthetotalnumberofdocumentsinthedataset,anddf(fi)isthedocument
frequencyoffeature(fi).Theidffunctionencodestheintuitionthatthemore
documentsafeatureoccursin,thelessdiscriminatingitis[Sebastiani,2002,
Smirnovetal.,2003b].Variationsonthetfidfweightingfunctioncanbecre-
atedthroughapplyingdi erentlogarithms,normalizations,andothercorrection
factors.
Inthisresearchweusethefollowingtextrepresentation.Toconvertatext
documentintoaninstancesuitableforclassi cationthevectorofallwords
thatoccurinthedatasetisthefeaturespace.PreviousCBSresearchshows
thatwordsworkwellasfeaturesfortheProfession1dataset,thereforeinthis
researchweusewordsasfeatures[Smirnovetal.,2003a].Thefeatureweightsof
theinstancesarecomputedbyatforatfidfweightingfunction.Weonlyuse
theweightingfunctionsthattakeintoaccountfrequenciesoffeaturesinorder
nottoloseinformationinherentinthefrequenciesoffeatures.
3.2Classi cationAlgorithms
Whenthetextofadocumentistransformedintoacompactrepresentation,in
principleanymachine-learningalgorithmcanbeapplied.Somespecialtextclas-
si cationalgorithmsthatincorporatefeaturede nitionsandweightingfunctions
havebeenproposed,forexampletheRocchioclassi er[Rocchio,1971].Theclas-
si cationalgorithmsthatareexaminedinthisresearch,i.e.,na¨ veBayesand
nearestneighbour,canuseeitherthetforthetfidfweightingfunction.
13
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
3.2.CLASSIFICATIONALGORITHMS
Inthisresearchwelookforpracticalsolutionsthathavesmallspaceand
timecomplexity.Weonlyconsiderclassi cationalgorithmsthatmeetthese
requirements,e.g.,supportvectormachinesarenotconsideredalthoughitisa
promisingtextclassi cationtechnique[Joachims,1998].Inthefollowingthree
subsectionsthena¨ veBayesalgorithm,thenearestneighbouralgorithmanda
hierarchicalapproachtotextclassi cationarediscussed.
3.2.1Na¨ veBayes
Thena¨ veBayesclassi erassumesthatallattributesaremutuallyindependent.
Thereforetheparametersforeachattributecanbelearnedseparatelyandthis
simpli esthetaskconsiderably.Especiallyintextclassi cationtasksthattypi-
callyhavealargenumberofattributesthisisbene cial.Duringtrainingofthe
na¨ veBayesclassi er,thepriorprobabilitiesoftheclasses,andtheprobabili-
tiesofobservingattributevaluesgiventheclass,areestimatedbasedontheir
frequenciesoverthetrainingdata.Anewinstanceisclassi edbyusingBayes
theoremtoidentifythemostprobableclass[Mitchell,1997].
3.2.2NearestNeighbour
Thenearestneighbourclassi erdoesnotbuildanexplicit,declarativerepresen-
tationoftheclasses,butsimplystoresallthetraininginstances.Toclassifya
newinstanceitreliesontheclassvaluesattachedtothetraininginstancesthat
aresimilartothenewinstance.Theknearesttraininginstancestakeavote
tosettleontheclassvaluewhentheclassisnominal.Themaindrawbackof
thisapproachisitsine ciencyatclassi cationtime.Toclassifyanewinstance,
theentiretrainingsetneedstoberankedforsimilaritywiththenewinstance
[Mitchell,1997].
Inthenearestneighbourclassi erkistheonlyimportantparameterthat
needstobeset.ExperimentsontheCBSdatasetsshowthatthevaluek=
1performsbest,sointheremainingofthisthesiswewillusethe1-nearest
neighbourclassi erforthetextclassi ers.
3.2.3HierarchicalClassi er
Adi cultyofthedataprovidedbytheCBS,isthelargenumberofclasses.
TheEducation1datasethasalmost6,000classes.Thisisoneofthereasons
thatclassi cationisrathertimeandspaceconsuming.Toovercomethisprob-
lemwehaveconstructedahierarchicalclassi erthatiscomputationallymore
e cient.Theclassattributecanbedividedintomultiplesubclassesthatcan
beclassi edseparately.Eachofthesesubclasseshasasmallnumberofclass
values.Assumingthattheclassvaluecanbedividedintoksubclassesandeach
subclasshasndistinctclassvalues.Thehierarchicalclassi erclassi esktimes
nclasses,whileaplainclassi erclassi esnkclasses,whichformostclassi ers
iscomputationallymoredemanding.
Mosthierarchicalclassi ersrequireastrictconcepthierarchylikeChuang
etal.[2000].However,ourdatasetcontainsmorehierarchies,ratherthanone.
Theeducationlevelisamoregeneralconceptthantheeducationsublevel,and
likewisetheeducation eldismoregeneralthantheeducationsub eld.
14
A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl
3.3.EXPERIMENTSWITHTHETEXTCLASSIFIERSDigit
Subclass
Example1Level62Sublevel034SOIcodeField31Sub eld50567Serialnr.18TrackNVT9PhaseNVT
Table3.1:SubclassesintheeducationSOI+code.
Theclassattributecanbedividedintofourorsevensubclasses.The rst
foursubclassesofthesevensubclassestogetherareequaltothe rstsubclassof
thefoursubclasses.Intable3.1thedivisionoftheSOI+codeintosubclasses
isillustrated.The rstrowinthetablearethedigitsintheSOI+code.The
secondrowshowsthesubclasses,andthethirdrowshowsthedivisionofthe
code603150-1-NVT-NVTinsevensubclasses.Thevaluesofthesubclassescan
bederiveddirectlyfromtheoriginalclassvalue,i.e.,thesubclassvalueconsists
ofoneormoredigitsfromtheoriginalclassvalue.
Thehierarchicalclassi erclassi esaninstanceinmultiplesteps.Ineach
steponesubclassisclassi ed.Thevalueofthesubclassissubsequentlyadded
tothelistofattributesthatisusedtoclassifythenextsubclass.Thisprocess
continuesuntilallsubclassvalueshavebeenclassi ed.Allthesubclassvalues
arethenmergedtocreatethe nalclassvalue.
Thesubclassthatistheeasiesttopredict,willbepredicted rst.Inmost
hierarchicalclassi ersthisisthesubclassatthetopofthehierarchy,whichrep-
resentsthemostgeneralconcept.Inourcasewedonothaveastricthierarchy.
Todeterminewhichsubclassshouldbepredicted rst,wetrainaclassi erfor
eachpossiblesubclassandthesubclassthathasthehighestclassi cationaccu-
racyisclassi ed rst.Inthesamewaywedeterminetheclassi cationorderof
theremainingsubclasses.
3.3ExperimentswiththeTextClassi ers
Thissectiondescribestheexperimentsthathavebeencarriedoutinorderto
assesstheperformanceofthedi erentclassi ers.Besidestheexperimentsthat
measuretheclassi cationaccuracyofthedi erentclassi ers,someadditional
experimentshavebeencarriedouttoderivetheoptimalsettingofparameters
forthehierarchicalclassi er.
3.3.1ExperimentalSettingandMethodology
Thetextclassi ersaretestedontheProfession1,theProfession2,theEduca-
tion1,andtheEducation2datasets.Anearestneighbourclassi er(NN)and
ana¨ veBayesclassi er(NB)aretrainedoneachdatasetincombinationwith
thetfandtfidftextrepresentation.Thedatasetshavebeensampledwith-
outreplacementtoreducethecomputationalrequirements.Threesamplesof
roughly20,000instancesaretakenforeachdataset.Eachclassi eristrained
foreachsampleandthentheresultsareaveraged.Wemeasuretheclassication
accuracy,i.e.,thepercentageofcorrectlyclassi edrecords.
Totestthehierarchicalclassi erasamplefromtheEducation1datasetof
15
正在阅读:
Meta-classifier approach to reliable text classification05-22
机械制图试卷及答案05-06
2018年普通高等学校招生全国统一考试(泄露天机押题卷)历史(一)Word版缺答案10-06
2020届浙江省温州市高三第一学期十校联合体期末联考理综物理部分04-25
e4虚拟仪器的微弱信号检测处理技术研究07-17
201515年幼儿教师个人学期末总结05-13
小学数学二年级上册三说说课稿11-26
纤维素酶活力的测定10-19
学生临床检验科实习报告09-20
防洪抢险应急预案04-25
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- classification
- classifier
- approach
- reliable
- Meta
- text
- DFT RS-3016i-D30存储安装配置指导手册 V1.0B(正稿)
- 对新形势下加强机关党支部建设的探析与思考
- 伟尔沃曼渣浆泵系列综合样本
- 七年级下地理资源与评价答案
- 考英语书面表达十篇
- 压力管道设计设计人员考试卷2004
- 学习《准则》、《条例》的心得体会
- 【试题库】(通用版)(化学)(二轮复习)2015届【迈向名师】星级题库:酚二星题
- 2017中传动画艺术学考研整理笔记经验
- 中央广播电视大学《文秘管理与应用写作》复习资料
- 发展和弘扬中国传统医药的必由之路
- 大型体育赛事中的口译应用技巧研究——哈尔滨第24届世界大学生冬季运动会反思
- 2012 11 12质粒DNA的提取、定量、酶切与PCR鉴定
- 用凯特摆测量重力加速度实验报告
- 中国南方电网通信调度管理规定091230
- 中国移动LTE承载网方案汇报
- 第四讲 刑法犯罪构成(一)
- 接种率调查总结2013
- 浅析网络安全技术
- 包装系统的动态响应与模态参数识别方法