Meta-classifier approach to reliable text classification

更新时间：2023-05-22 12:17:02 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

metaclass推荐度：
相关推荐

A problem with automatic classifiers is that there is no way to know if a particular classification is just a guess or a certain answer. Reliable classification is the task of predicting whether a certain instance is correctly classified or not, i.e., a cl

Meta-Classi erApproachesto

ReliableTextClassi cation

A.M.Kaptein

Master’sThesisCS05-15

Thesissubmittedinpartialful lment

oftherequirementsforthedegreeof

MasterofScienceofKnowledgeEngineering

intheFacultyofGeneralSciences

oftheUniversiteitMaastricht

Thesiscommittee:

Prof.dr.H.J.vandenHerik

dr.E.N.Smirnov

Prof.dr.E.O.Postma

dr.W.J.G.Hacking

UniversiteitMaastricht

InstituteforKnowledgeandAgentTechnology

DepartmentofComputerScience

Maastricht,TheNetherlands

August2005

Abstract

Aproblemwithautomaticclassi ersisthatthereisnowaytoknowifaparticu-

larclassi cationisjustaguessoracertainanswer.Reliableclassi cationisthe

taskofpredictingwhetheracertaininstanceiscorrectlyclassi edornot,i.e.,a

classi cationisclassi edaseitherreliableorunreliable.Whentheclassi cation

isclassi edasunreliable,itislikesaying“Idonotknow”,andtheinstance

doesnotreceiveaclassi cation.

Givenabaseclassi er,themeta-classi erapproachistotrainameta-

classi erthatpredictsthecorrectnessofeachclassi cationofthebaseclassi er.

Theclassi cationruleofthemeta-classi erapproachistoassignaclasspre-

dictedbythebaseclassi ertoaninstanceifthemeta-classi erdecidesthatthe

baseclassi cationisreliable.

Themeta-classi erapproachisappliedontextclassi cationtasksprovided

bytheCBStoanswerthefollowingproblemstatement:

Doesthemeta-classi erapproachprovideapracticalsolutionto

reliabletextclassi cation?

The rstpartoftheresearchstudiestextclassi ers,andprovidesananswer

totheresearchquestion:

1.Whichtextclassi ersachievehighaccuracyandatthesametimehave

smallspaceandtimecomplexity?

ExperimentsontheCBSdatasetsshowthatthenearestneighbourandthe

na¨ veBayesalgorithmincombinationwiththetfidftextrepresentationare

acceptabletextclassi ers.

Thesecondpartoftheresearchstudiesthemeta-classi erapproachtopro-

videananswertothesecondandthirdresearchquestion.Oursecondresearch

questionis:

2.Whattypeofmetadatarepresentationisbestsuitedforreliabletextclas-

si cation?

Themeta-classi eristrainedonseveraltypesofmetadatarepresentations.The

usedmetadatarepresentationsincludetheoriginalinstances,theprobability

distributionofthebaseclassi erandasetofbasicstatisticsabouttheclassi -

cationofthebaseclassi er.Fortaskswithmanyclasses,theoriginalinstances

representationisbest.Fortaskswithasmallnumberofclasses,theoriginal

instancesrepresentationandtheprobabilitydistributionarebothgoodcandi-

dates.

Ourthirdresearchquestionis:

3.Shouldthemeta-classi erbelocalorglobal?

Inadditiontotheglobalapproach,thatlearnsonemeta-classi erforallclasses,

alocalapproachisapplied.Thelocalapproachlearnsonemeta-classi erfor

eachclassintheoriginaldata.Ourstudyshowstheglobalapproachisprefer-

able,itachievesthesameperformancewithlesscomputationale ort.

Themeta-classi erapproachprovidesasoundsolutiontothetaskofreliable

classi cation,iftheprecisionofthemeta-classi eronthereliableclassis100%.

Inpracticewedonotreachthislevelofaccuracy.

Ourconclusion,toanswertheproblemstatement,isthatthemeta-classi er

approachinpracticedoesnotprovideasoundsolutiontoreliabletextclas-

si cation.Themeta-classi erapproachdoesachieveaconsiderableincrease

inaccuracy.Itisane cientsolutionaslongasthebaseclassi ersandthe

meta-classi ersaree cient,andthepropermetadatarepresentationisused.

Preface

Thisthesiscontainsadescriptionoftheresearchonreliabletextclassi cation

thatIhavebeendoingasmyM.Sc.project.Theresearchhasbeencarriedout

duringaninternshipattheCBSinHeerlenfromMarchtoAugust2005.

Theresearchwouldnothavebeenpossiblewithoutthesupportofthefol-

lowingpersons.IwouldliketothankProf.JaapvandenHerikandProf.Eric

Postmafortheiradviceandsupport.AttheCBSIwouldliketorecognize

mycolleaguesforhelpingmeoutwithmyoccasionalcomputerproblems,and

inparticularIwouldliketothankDr.WimHackingforsupervisingmeand

givingmeaviewonthepracticalsideoftheproblem.FinallyIwouldliketo

acknowledgedr.EvgueniSmirnovforalwaysbeingavailabletoanswerallmy

questions,formanyfruitfuldiscussionsandforbeinghonestatalltimes.

RianneKaptein

August2005

iii

Contents

Abstract

Preface

1Introduction

1.1TaskofReliableClassi cation....................

1.2RelatedWork.............................

1.2.1BayesianFramework.....................

1.2.2StatisticalFrameworks....................

1.2.3Version-SpaceSupport-VectorMachines..........

1.2.4Meta-Classi erApproach..................

1.3ProblemStatementandResearchQuestions............

1.4ThesisOutline............................

2TheCBSData

2.1Backgrond...............................

2.2DataDescription...........................

2.3DataPreparation...........................

2.3.1SelectionofInformativeFields...............

2.3.2Feature-Reduction......................

2.4ChapterSummary..........................

3TextClassi cation

3.1TextRepresentation.........................

3.2Classi cationAlgorithms......................

3.2.1Na¨ veBayes..........................

3.2.2NearestNeighbour......................

3.2.3HierarchicalClassi er....................

3.3ExperimentswiththeTextClassi ers...............

3.3.1ExperimentalSettingandMethodology..........

3.3.2ExperimentalResults....................

3.3.3EvaluationandDiscussionoftheResults.........

3.4ChapterSummary..........................

4Meta-Classi erApproachestoReliableClassi cation

4.1De nition...............................

4.2MetadataRepresentations......................

4.3GenerationofMetadata.......................

iviiiii11222344677899911121213141414151516192122222324

CONTENTS

4.4

4.5

4.6LocalMeta-Classi erApproach..........AccuracyandCoverage..............ExperimentswiththeMeta-Classi erApproach.

4.6.1ExperimentalSettingandMethodology.

4.6.2ExperimentalResults...........

4.6.3EvaluationandDiscussionoftheResults

ChapterSummary................................................................................242526262731324.7

5ConclusionandFutureResearch33

5.1Conclusion..............................33

5.2Futureresearch............................34

Appendix35

ATestResultsoftheMeta-Classi erApproach36

A.1TestResultsontheProfession2Dataset..............36

A.2TestResultsontheEducation2Dataset..............39

References44

Chapter1

Introduction

Thischapterisanintroductiontotheresearchtopicofreliableclassi cation.Wedescribethetaskofreliableclassi cationinsection1.1.Variousapproachestosolvingtheproblemofreliableclassi cationarediscussedinsection1.2.Insection1.3ourchoiceforthemeta-classi erapproachthatwewouldliketoin-vestigateismotivated.Moreover,theproblemstatementandresearchquestionsareformulated.Finally,insection1.4anoutlineoftheremainingpartofthethesisisgiven.

1.1TaskofReliableClassi cation

Knowingthelimitsofone’sownknowledgeisatypicalfeatureofhumanintel-ligence.Whenfacedwithadi cultproblem,onecancometotheconclusionthathisknowledgeisnotsu cienttosolvetheproblemandonecansay“Idonotknow”putersareabletosolvedi erentkindsofdi cultproblemssuchasplayingchessormakinghugecomputations.Butincontrasttohumanscomputersusuallydonothavethecapabilitytorecognizethelimitsoftheirknowledge.

Machinelearningisconcernedwithdevelopingclassi erstolearnfromex-perienceorextractknowledgefromexamplesinadatabase[Mitchell,1997].Theclassi ersthatwestudyinthisthesislearnfromlabelledinstancesandareabletoclassifyunseenandunlabelledinstances.However,theseclassi ersdousuallynotdistinguishbetweenluckyguessesandcertainanswers.Theyjustclassifyallnewinstances,sothatyoucanneverbereallysureifapar-ticularclassi cationiscorrect.Inotherwords:theclassi ersarenotawareofthelimitsoftheirknowledge.Thatisthereasonwhyclassi ersareoflimiteduseinmanyreal-worldapplications.Thecostsofamisclassi cationareoftenhigh,especiallyinsafety-criticaldomainssuchasmedicaldiagnosingandairtra ccontrolsystems.Classi ersareonlyusedinapplicationswhereerrorsdonothavefar-reachingconsequences,forexampleinternetapplicationssuchasautomaticdocumentindexing,spam- ltering,wordsensedisambiguation,andhierarchicalcategorizationofWebpages[Sebastiani,2002,Delanyetal.,2004].Inthisthesiswestudythetaskofreliableclassi cation.Wede nethistaskasataskofpredictingwhetheracertaininstancehasbeencorrectlyclassi ed,1Throughoutthethesisweuse‘he’or‘his’whenbothhe/sheandhis/herarepossible.

1.2.RELATEDWORK

i.e.,theclassi cationoftheinstanceisclassi edaseitherreliableorunreliable.

Whentheclassi cationisclassi edasreliable,itiscomparabletosaying“I

know”,andtheinstanceiscorrectlyclassi ed.Whentheclassi cationisclassi-

edasunreliable,itiscomparabletosaying“Idonotknow”,andtheinstance

isnotclassi ed.Solvingthetaskmeansthatallclassi cationsarecorrectand

theclassi erachievesanaccuracyof100%onthe“Iknow”instances.

Thereliabilityofaclassi cationcanbede nedinseveralways.Weadopt

thede nitionofKukarandKononenko[2002].Theyde nethereliabilityofa

classi cationastheprobabilitythattheclassi cationiscorrect.

Reliableclassi cationhastwoimportantapplicationareas:real-worldap-

plicationsandensemblesofclassi ers.Reliableclassi cationcanbeusedin

real-world,safety-criticaldomains,becauseitreducestheriskofmisclassi ca-

tion.Whenunreliableclassi cationsarediscarded,thecoverage(percentageof

recordsthatcanbeclassi ed)oftheclassi erdiminishes,butatthesametime

itsaccuracy(percentageofcorrectlyclassi edrecords)increases.

Reliableclassi cationisalreadybeingusedinanothercontextaswell,namely

inensemblesofclassi ers.Ensemblesclassifyaninstancebycombiningthe

classi cationsofmultipleclassi ers.The nalclassi cationcanbeobtainedin

severalways,e.g.,by(unanimous)votingoftheclassi ers.Reliableclassi ca-

tioncanbeusedtosifttheclassi cationsoftheclassi ers.Onlytheclassi ers

thatmakeareliableclassi cationwillbeusedtomakethe nalclassi cation

biningensemblesofclassi erswithreliableclassi cation

cansubstantiallyincreaseaccuracy[Seewald,2003,Ting,1996].

1.2RelatedWork

Fourdi erentapproachestoreliableclassi cationthathavebeendevelopedin

thelasttenyearsarediscussedinthissection:theBayesianframework,the

statisticalframeworks2,version-spacesupport-vectormachines,andthemeta-

classi erapproach.

1.2.1BayesianFramework

Classi ersintheBayesianframeworkusuallyproduceclassi cationsintheform

ofposteriorprobabilitydistributionsoverallpossibleclasses.Thesimplest

approachtoreliableclassi cationistousetheposteriorprobabilityofasingle

classi cationasameasureforreliability[Ting,1996].However,itisshown

thatposteriorprobabilitiesofclassi erslikena¨ veBayesarepoormeasuresfor

reliability,sincetheyarebasedonincorrectpriorassumptions[Melluishetal.,

2001,Delanyetal.,2004].Thus,theBayesianframeworkcanbemisleadingfor

reliableclassi cation,andisthereforeconsideredinappropriateforourpurposes.

1.2.2StatisticalFrameworks

Severalapproachestoreliableclassi cationarebasedonthealgorithmictheory

ofrandomness[Vovketal.,1999].Sincethereisnocomputable,universaltest

outputoftheclassi ersintheBayesianframeworkandthestatisticalframeworksis

areliabilityestimation,ratherthanareliableclassi cation.Byusingthresholds,reliability

estimationscaneasilybeconvertedtoreliableclassi cations.Whenthereliabilityestimation

isaboveacertaintresholdtheclassi cationisconsideredreliable.2The

1.2.RELATEDWORK

forrandomness,computableapproximationsofalgorithmictestsofrandomness

areused[Proedrouetal.,2001].Inthiscontextwedistinguishthetypicalness

frameworkandthetransductionframework.

Thetypicalnessframeworkprovidesreliabilityestimationsondatathatis

independentlyandidenticallydistributed(iid).Considerasequenceofinstances

togetherwithanewinstanceofanunknownclass.Thetypicalnessframeworkis

usedtogainameasureofreliabilityforallpossibleclassesofthisnewinstance

usingatypicalnessfunction.Foreachpossibleclassitisexaminedhowlikelyit

isthatallinstancesoftheextendedsequence,i.e.,thesequencewiththenew

instanceadded,aredrawnindependentlyfromthesamedistribution.Themore

typicalthesequenceis,thehigherthereliabilitymeasure.

Thetypicalnessfunctioncanbeconstructedbymeasuringthe“strangeness”

ofindividualinstances,usingindividualstrangenessfunctions[Kukar,2004].

Adrawbackofthisapproachisthatthestrangenessfunctiondependsonthe

classi cationalgorithmthatisused.Sofar,theonlysuccessfulapplications

useSupportVectorMachines[Vovketal.,1999]andthenearestneighbour

algorithm[Proedrouetal.,2001].Anotherdisadvantageofthisapproachisits

computationalcomplexity[KukarandKononenko,2002,Melluishetal.,2001].

Anotherstatisticalframeworkforreliableclassi cationisthetransduction

framework.Theframeworkisclassi er-independentanditisbasedonatrans-

ductivestepduringtheclassi cationprocess.First,aninstanceisclassi edbya

baseclassi er,andthentheinstanceisaddedtothetrainingset,togetherwith

theclassi cationitreceivedfromthebaseclassi er.Theclassi erisre-trained

andtheinstanceisclassi edagain.Thereliabilityoftheinstanceclassi ca-

tionismeasuredasthedi erencebetweenposteriorclassprobabilitiesthatthe

instancereceivesbeforeandafteritwasaddedtothetrainingset[Kukarand

Kononenko,2002].Smirnovetal.[2003b]showthatthisapproachisnotonly

computationallyine cient,butitalsorequireshighprecisionofthereal-number

representationwhenalargeamountofdataisusedwithmanyclasses.Onlyin

thecasethatthetrainingsetissmall,addinganinstancetothetrainingset

canchangethelevelofrandomnesssigni cantly.

Kukar[2004]presentsanalgorithmthatjoinsthetransductiveframework

andthetypicalnessframework,therebymakingthetransductivestepstatisti-

callysound.Experimentalresultsshowedthatreliabilitycanbeestimatedquite

accurately,butagainthehighcomputationalcostsposeaproblem.

1.2.3Version-SpaceSupport-VectorMachines

Adi erentapproachtoreliableinstanceclassi cationisbasedonversionspaces

[Smirnovetal.,2005].WerefertoMitchell[1997]foranintroductiontoversion

spaces.Themainideaistoconstructversionspacesthatcontainthehypothe-

sesofthetargetconceptstobelearnedortheircloseapproximations.The

unanimous-votingruleisimplementedbytestingversionspacesforcollapse.

Applyingthisrulemakesitimpossibletomisclassifyinstanceswhenthereisno

noiseinthedata.Althoughexperimentalresultsarepromising,thisapproach

iscomputationallyexpensiveaswell.

1.3.PROBLEMSTATEMENTANDRESEARCHQUESTIONS

1.2.4Meta-Classi erApproach

Themeta-classi erapproachtakesadi erentlinetoreliableclassi cation.

Givenabaseclassi er,theapproachistolearnameta-classi erthatpredicts

thecorrectnessofeachinstanceclassi cationofthebaseclassi er.Thebase

classi erplusthemetaclassi erformonecombinedclassi er.Theclassi cation

ruleofthecombinedclassi eristoassignaclasspredictedbythebaseclassi er

toaninstanceifthemeta-classi erclassi esthebaseclassi cationasReliable,

otherwisetheinstanceclassi cationisrejected[SeewaldandF¨urnkranz,2001].

Thecrucialstepforthemeta-classi erapproachisthegenerationofthe

metadatathatisusedtotrainthemeta-classi er.Themetadataarerepresented

byfourdi erentmetadatarepresentations.Allmetadatarepresentationshave

thesamebinarymetaclassattribute.Themetaclassofaninstanceindicates

thereliabilityofthebaseclassi cation.Iftheinstanceisclassi edcorrectly

bythebaseclassi erthemetaclassisReliable,otherwisethemetaclassis

Unreliable.Belowwelistfourmetadatarepresentations.

Originalinstancesrepresentation.Allattributesinthemetainstance,

excepttheclassattribute,arethesameasintheoriginalinstances.

Probabilitydistributionrepresentation.Ametainstancecontainsthepos-

teriorprobabilitiesforallclassesasgivenbythebaseclassi er.

Basicstatisticsrepresentation.Attributesofthemetainstancearedif-

ferentcharacteristicsofthebaseclassi cation,likethebaseclassandthe

posteriorprobabilityofthebaseclass.

Nearestneighbourdistancesrepresentation.Thisrepresentationcanonly

beusedincombinationwithanearestneighbourbaseclassi er.Attributes

ofthemetainstancearecalculateddistancesbetweennearest(un)like

neighboursofatargetinstance[CheethamandPrice,2004].

Ameta-classi ercanbetrainedondi erentclasslevels.Aglobalmeta-

classi erapproachlearnsonemeta-classi erforallclasses.Alocalmeta-

classi erapproachlearnsonemeta-classi erforeachbaseclass.Eachlocal

meta-classi erclassi esonlymetainstanceswithoneparticularbaseclass.

The rstapplicationsofmeta-classi ersareinthecontextofensembles.

Theestimatedreliabilityofaclassi cationisusedtochoosetheclassi er(s)

thatwillclassifytheinstance[SeewaldandF¨urnkranz,2001].Inthese rst

meta-classi erstheoriginalinstancesrepresentationisusedasmetadatarepre-

sentation.InalaterversionofSeewald[2003],alsotheprobabilitydistribution

representationisused.Real-worldapplicationsinwhichthemeta-classi erap-

proachisusedincludeautomatictextclassi cation[Smirnovetal.,2003a]and

spam- ltering[Delanyetal.,2004]edtheprobabilitydis-

tributionmetadatarepresentation,edthenearestneighbour

distancesrepresentation.

Themeta-classi erapproachisdiscussedindetailinthethirdchapter.

1.3ProblemStatementandResearchQuestions

Inthisthesiswelookforanapproachtoreliableclassi cationthatcanbeused

inreal-worldpracticalapplications,i.e.,welookforanapproachthatissta-

1.3.PROBLEMSTATEMENTANDRESEARCHQUESTIONS

tisticallysoundandcomputationallye cient.Thefourdi erentapproachesto

reliableclassi cationthatwediscussedinsection1.2allhaveshortcomings.The

Bayesianframeworkistoodependentonassumptionsaboutthepriordistribu-

tionwhenappliedinpracticeforreliableclassi cation.Thetypicalnessframe-

work,transductionframework,andVersion-SpaceSupport-Vectormachinesare

alltheoreticallysound,buttheyarecomputationallyexpensive.Themeta-

classi erapproachistheonlycandidateforourstudy.Byconstruction,

1.theapproachissoundaslongasthemeta-classi erisaccurate.The

accuracyofaclassi eristheproportionofcorrectlyclassi edinstances.

2.theapproachise cientaslongasthebaseclassi erandthemeta-classi er

aree cient.

Ourstudyispartlymotivatedbythefactthatthemeta-classi erapproach

wasnotexhaustivelystudied.Existingstudiesonmeta-classi ersfocusedon

theglobalapproachoronapplicationinthecontextofensemblesonly[Seewald,

2003,Smirnovetal.,2003a,Delanyetal.,2004].Thetheoreticalframeworkof

themeta-classi erapproachandthelocalmeta-classi erapproachhavenever

beenanalysedindetailforareal-worldapplication.Ouraimistoinvestigate

themeta-classi erapproachinthecontextoftextclassi cation.Hence,our

problemstatementreadsasfollows:

Doesthemeta-classi erapproachprovideapracticalsolutionto

reliabletextclassi cation?

Toanswerourproblemstatementweapplythemeta-classi erapproachon

realtextclassi cationtasksanddatasetsprovidedbytheCentraalBureauvoor

deStatistiek(CBS).Thedatasetsmatchthescaleoftheproblemwewouldlike

totackle:thedatasetsarelarge,ofhighdimensionality,andhavealarge

numberofclasses.

Inordertoaddresstheproblemstatement,threeresearchquestionshave

beenformulated.Tosolvethetextclassi cationtasksoftheCBSwe rsthave

totraintextclassi ers,andthenapplythemeta-classi erapproach.Therefore

ourresearchconsistsoftwoparts.

The rstpartoftheresearchstudiestheapplicabilityofdi erenttextclas-

si ersfortheCBSdata,leadingtothe rstresearchquestion:

1.Whichtextclassi ersachievehighaccuracyandatthesametimehave

smallspaceandtimecomplexity?

Thesecondandmainpartoftheresearchinvestigatesdi erentmeta-classi er

approachesappliedtothetextclassi erstrainedontheCBSdata.Thesecond

andthirdresearchquestionreadasfollows.

2.Whattypeofmetadatarepresentationisbestsuitedforreliabletextclas-

si cation?

3.Shouldthemeta-classi erbelocalorglobal?

Thesequestionsaimatstudyingtwomainfeatures,viz.thetypeofmetadata

representation,andthenature(i.e.,localorglobal).Themeta-classi ersare

evaluatedonthetextclassi cationtasksprovidedbytheCBS.Di erenttext

classi ersandmeta-classi ersarecombinedinordertoseewhichcombination

performsbest.

1.4.THESISOUTLINE

1.4ThesisOutline

Theremainingpartofthisthesisisstructuredasfollows.Inchapter2wede-

scribethetext-classi cationtasksandthedatasetsprovidedbytheCBS,and

wediscussthedata-preparationprocess.Chapter3studiestextclassi cation,

andelaboratesontextrepresentation,classi cationalgorithms,andexperiments

withtheCBSdata.Chapter4discussesthedetailsofthemeta-classi erap-

proach,anddescribesexperimentswiththeCBSdata.Finally,inchapter5we

drawaconclusion,andwegivesomerecommendationsforfutureresearch.

Chapter2

TheCBSData

TheCBShasprovidedthreedatasetsforthisresearch,whichareusedtoassesstheperformanceofthemeta-classi erapproach.Section2.1givessomeback-groundinformationaboutthetaskandthedatasets.Insection2.2wedescribethecontentsandthecharacteristicsofthesedatasets.Thedatapreparationprocess,whichincludesselectionofinformative eldsandfeaturereduction,isdiscussedinsection2.3.Finally,insection2.4wesummarizethechapter.

2.1Backgrond

TheCBSistheDutchnationalstatisticsinstitute.Amongotherthings,theCBScollectsdatabyinterviewersusingquestionnaires.Theprovideddatasetscontaintheanswersonopen-endedquestionsthatcanbefoundintheseques-tionnaires.Theyconcerntheeducationortheprofessionofindividuals.ThesedatatypicallyhaveatextualformatandareinDutch.SincetheCBSwishestoconvertthedataintousefulstatisticalinformation,thetextualdatahavetobetransformedintoclasses.Ourtaskistoassignasymboliccodefromaprede nedsetofsuchcodestothetextrecordscontainingtheanswersgiveninthequestionnaire.Thecodesforprofessionsandeducationsarede nedbytheCBSthroughso-calledSBCandSOI+codesrespectively[CBS,DivisieSocialeenRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning,2003,2001].Sofarmostofthedataismanuallyclassi ed,liketheprovideddatasets.Inpreviousresearchthepossibilitiesforautomatictextclassi cationhavebeenexplored[Smirnovetal.,2003a].Onthewhole,littleresearchhasbeencar-riedouttoclassifyanswerstoopen-endedquestionsinquestionnaires.Somedictionary-basedapproacheshavebeendeveloped.Theyusespecializeddic-tionariesthatassignatextfragmentautomaticallytoaspeci ccategoryifitcontainswordsmatchingthoseinthedictionaryrelevanttothecategory.Thetextfragmentscanbepreprocessedusingdi erentparsingtechniquessuchasdeletionoftrivialwords,su xremoval,andde nitionofsynonyms[MacchiaandD’Orazio,2000].Drawbacksofthedictionary-basedapproachesarethattheyarestaticandspecializeddictionarieshavetobedevelopedmanuallybyexpertsforeachcategory.Atext-classi cationapproachbasedonsupervisedmachinelearningisabettersolutioninthiscase,sincetherearenospecializeddictionariesavailablebutalargebodyofmanuallycodedtrainingdataison

2.2.DATADESCRIPTION#attributes

persample

～5,400

～3,300

～5,200

～2,900

～4,650

～2,900#classespersample～1,250～1,050～1,250～2,800～4,0508DatasetProfession1Profession2Education1Education2#instances192,46663,82349,09249,092#classes1,5631,5785,9918# elds551111samplesize～19,150～9,450～19,000～9,800～19,600～9,800

Table2.1:Descriptionofthedatasets.

hand[GiorgettiandSebastiani,2003].Adisadvantageofthetext-classi cation

approachisthatwhenthede nitionofthecodeschanges,thetrainingdatais

nolongerrelevant,andanewbodyofmanuallycodedtrainingdatahastobe

provided.

2.2DataDescription

TheCBShasprovidedthreedatasetsforthisresearch,Profession1,Profession2,

andEducation1.DatasetsProfession1andProfession2containtextrecordsde-

scribingprofessionsandhavesimilarstructures.Thetextrecordsintheprofes-

siondatasetscontain ve elds.Each eldcontainstheansweronaparticular

question,forexamplethecompanysomeoneworksfororsomeone’sdailyac-

tivities.ClassvaluesofthesedatasetsareSBCcodes(seeCBS,DivisieSociale

enRuimtelijkeStatistieken,SectorOntwikkelingenOndersteuning[2001]foradescriptionoftheSBCcodes).

Education1isthethirddataset,itcontainstextrecordsdescribingeduca-

tions.Thetextrecordsinthisdatasetconsistofeleven elds.These elds

containamongotherthingsthenameandlevelofeducation.Classvaluesof

thisdatasetareSOI+codes1.

AfourthdatasetisderivedfromtheEducation1dataset.Inallinstancesof

thisdatasettheoriginalclassvalueoftheEducation1datasethasbeenreplaced

withthe rstdigitofthisclassvalue,allother eldsremainthesame.The

resultistheEducation2dataset,whichhasonly8classvalues.

Intable2.1thebasicstatisticsofthedi erentdatasetscanbefound.

Allfourdatasetsarerandomlysampledfortestingtoreducethecomputa-

tionalrequirements.Fromthedatasets,samplesofapproximately10,000and

20,000instancesaretaken.Thenumberofattributesinthetableisthenumber

ofattributesafterthefeaturereductionthatwillbedescribedinthefollowing

SOI+code,consistsoffourparts,e.g.,603150-1-NVT-DOCTORAAL.The rstsix

digitsareaso-calledSOI-code.The rstdigitoftheSOI-coderepresentsthelevelofeduca-

tion,theseconddigitrepresentsthesublevelofeducation.Thethirddigitandfourthdigit

togetherrepresenttheeducation eld,andthe fthdigitandsixthdigittogetherrepresent

theeducationsub eld.ThesecondpartoftheSOI+codeisaserialnumberfollowingthe

SOI-code,designedtomaptheSOI-codesunambiguouslyontotheinternationalISCED(In-

ternationalStandardClassi cationofEducation)codes.Thethirdpartistheeducationtrack

andthefourthpartistheuniversityphase[CBS,DivisieSocialeenRuimtelijkeStatistieken,

SectorOntwikkelingenOndersteuning,2003].1The

2.3.DATAPREPARATION

section.

Thetextclassi cationtasksoftheCBSdi erfromacommontextclassi-

cationtaskintwoaspects.Firstly,thenumberofclassesisverylarge.The

Education1datasethasapproximately6,000classes,theprofessiondatasets

havealmost1,600classes.Secondly,thesizeoftheinstancesissmaller.While

mostclassi cationtasksconsiderwholedocuments,inthesetasksaninstance

consistsofonlyafewwords.Theinstancesdonotcontaincompletephrases.

Thesetwocharacteristicshaveconsiderableconsequencesontheperformance

ande ciencyofthetextclassi cationalgorithms.

2.3DataPreparation

Besidessamplingothertechniquesareusedtofurthereasethecomputational

complexity.Inthefollowingsubsectionstheselectionofinformative eldsand

di erentfeaturereductiontechniquesisdiscussed.

2.3.1SelectionofInformativeFields

RecordsintheCBSdatasetsconsistofmultiple eldsthatmightnotallbe

informative.Oneofthetasksinthepreviousresearchconsistedofassessingthe

contributionofthetext eldstotheclassi cationperformance.Inturneach

text eldoftheProfession1datasetwasleftouttodeterminethee ectonthe

accuracyofthetextclassi ers.Theconclusionofthistaskwasthatonlytwo

text eldsarerequiredfortheclassi cationofprofessions.Thesetext elds

containtheprofessionandthedailyactivitiesofanindividual[Smirnovetal.,

2003a].Weusethesametwotext eldstoclassifytherecordsintheProfession2

dataset,astheProfession2datasetcontainsthesame eldsastheProfession1

dataset.

Toidentifywhich eldsareinformativeintheeducationdatasets,eachtext

eldwas,alsointurn,leftouttodeterminethee ectontheaccuracy.Itturned

outthatleavingoutcertaintext eldsdidnotleadtoasigni cantimprovement

inclassi cationaccuracy.Hencealltext eldsareusedforclassi cationof

theEducation1andEducation2dataset.Anexplanationforthisresultisthat

onlyafew eldsare lledforeachrecordintheeducationdatasets,most elds

areemptyforthemajorityofrecords.Fieldscanbeemptybecausetheyare

irrelevantorunknownforacertainindividual.

2.3.2Feature-Reduction

Amajorproblemintextcategorizationisthehighdimensionalityofthefeature

space.Inourcasethefeaturespaceconsistsofalluniquewordsthatoccurin

thedatasets.Forexample,inasampleof20,000records,around15,000unique

wordsarepresent.Aninstanceisrepresentedbyallwordsthatoccurinthe

informative eldsofthetextrecord.Featurereductioncannotonlyreducetime

andspacecomplexity,butalsotheclassi cationperformancecanbeincreasedby

removingirrelevantanduninformativefeatures.Wewillusefeature-reduction

basedondocument-frequency,removalofstopwordsandstemmingtoreduce

thenumberoffeatures.

2.3.DATAPREPARATION

Document-frequencytresholdingandremovalofstopwordsareamongthe

simplesttechniquesforfeaturereduction.Thedocumentfrequencyofaword

isthenumberofdocumentsinwhichthewordoccurs.Eachwordthathasa

document-frequencylessthansomeprede nedthresholdisremovedfromthe

featurespace[YangandPedersen,1997].Inourresearchallwordsthatoccur

inonlyonetextrecordareremoved,whichleadstoanaveragereductioninthe

numberofattributesaround50%.

Theremovaloffunction-orstopwordsisanothercommontechniquethat

isapplied.Stopwordsaretopic-neutralwordssuchasarticles,prepositions,

conjunctions,andabbreviations,whichdonotcontributetoclassi cationper-

formance[Sebastiani,2002].Thelistofstopwordscontains89topic-neutral

words.Thereductioninthenumberoffeaturesisnotsobig,butthenumber

ofinstancesinwhichoneormorestopwordsareremoved,islarge.

Alessfrequentlyusedpre-processingtechniquethatweapply,isstemming.

Thegoalofstemmingistoimprovetheperformanceofinformation-retrieval

techniquessuchastextcategorizationbybringingunderoneheadingvariant

formsofawordwhichshareacommonmeaning.Stemmingcanreducethe

numberoffeaturesandimproveclassi cationperformanceatthesametime.In

generaltherearetwostemmingtechniques.The rsttechniqueisalgorithmic:

stemmingisdoneaccordingtocertainruleswhichstripsu xes,pre xes,or

in xes.Thesecondtechniqueisdictionary-basedstemming.Inthiscasethe

stemmedversionofawordcanbelookedupinaspecialdictionary.Since

dictionarybasedstemmingisslowerthanalgorithmicstemmingandaDutch

stemmingdictionaryisnotreadilyavailable,wewillusealgorithmicstemming

inthisresearch[Porter,2001].

KraaijandPohlmann[1995]developedaDutchversionofthewell-known

Porterstemmer.Asimpli edversionofthisalgorithmhasbeenimplemented

forthisproject.Thefollowingtworulesoftheoriginalalgorithmhavebeen

omitted.

Removalofthea x‘ge’whichoccursasapre xorin xinDutchpartici-

ples.Inthedataconcernedhereparticiplesarerare,attributabletothe

formulationofthequestions.Mostoftheverbsareintheircompleteform.

Removalofthea x‘ge’wouldbeinappropriateinmostofthecases.

Duplicationofavowelinaclosedsyllable.InDutchlongvowelsarespelled

singleinopensyllables,e.g.,kopen,anddoubleinclosedsyllables,e.g.,

koop.Afterremovalofsomea xes,e.g.,lop(-en)thestemvowelneedsto

bedoubledtorenderanorthographicallycorrectstem.Althoughthispro-

cedurecanleadtothestemmingofmorewordstothesamestem,itisnot

necessaryfortheautomaticclassi erthatthewordsareorthographically

correct.Inadditiontheproceduredoesnotalwaysworkcorrectly,whene

isthevowelinanunstressedsyllableitisneverdoubled.Forexamplethe

correctstemofkantelencanbeeitherkanteelorkantel.Withoutword

stressinformationitisimpossibletopredictconsistentlythestatusofe

correctly.

Theresultingalgorithmconsistsofaseriesofsu xstrippers,thepseudo

codeofthealgorithmcanbefoundin gure2.1.Thestringinthispseudocode

isthewordthatisstemmed.Besidesthesu xstrippersofKraaijandPohlman,

twostrippersspeci ctothisdomainareadded.Theystriptwofrequently

2.4.CHAPTERSUMMARY

Removesuffixes(-e,-en)

If(String.length>1)

Removesuffixes(-ig,-ing,-baar,-bar,-heid,-etj,-tj)

If(String.length>1)

Replace(-opleiding→-opl,administratief→adm)

If(String.length>1)

Replacesuffixes(-v→-f,-z→-s)

Replacesuffixes(-dd→-d,-ff→-f,-gg→-g,-kk→-k,

-ll→-l,-mm→-m,-nn→-n,-pp→-p,-rr→-r,

-ss→-s,-tt→-t)

Figure2.1:Pseudocodeofthestemmingalgorithm.

occurringterms(administratiefandopleiding)totheirabbreviations(adm.

andopl.),whichalsofrequentlyoccur.

Thereductionofthenumberofattributesasaresultofstemmingisgiven

intable2.2.TheEducation1datasetandEducation2datasetcontainthesame

attributesexceptfortheclassattribute.Thereforetheresultsofthefeature

reductionarethesameandthetableshowsonlyoneEducationcolumnthat

representsbothdatasets.Forbothprofession lesthereductionisaround10%,

fortheeducation lesthereductionis6.4%.Besidesthisreductionalsoan

increaseinclassi cationperformancehasbeenobserved.Theperformancein-

creasedependsonthealgorithmandthedatasetused,itliesbetween0.15%

and1.60%.Dataset

Sample1

Sample2

Sample3

AverageEducation6.40%6.40%6.40%6.40%Profession110.00%10.14%9.97%10.04%Profession210.57%10.14%10.99%10.57%

Table2.2:Reductioninthenumberoffeaturesasaresultofstemming.

2.4ChapterSummary

ThefourCBSdatasetsprovideuswithchallengingtextclassi cationtasks,due

tothelargeamountsofrecords,attributes,andclasses.Randomsamplesfrom

10,000and20,000recordsaretakenfromthedatasetstoreducethecomputa-

tionalcomplexity.Furthermore,onlytheinformative eldshavebeenselected

andfeature-reductiontechniqueshavebeenapplied.Thefeature-reductiontech-

niquesincludedocument-frequencytresholding,removalofstopwordsandstem-

ming.Togethertheyleadtoafeaturereductionofapproximately60%.The

resultsofthedatapreparationarefourdatasetsthatarereducedinsizeand

canbeusedforclassi cation.

Chapter3

TextClassi cation

Inthepreviouschapterwedescribedtheclassi cationtasksoftheCBSandthedatapreparationprocess.Intherestofthethesiswefocusonreliabletextclassi cationfortheCBS.The rststepinthemeta-classi erapproachistogenerateabaseclassi er.Thischapterstudiesthebasetextclassi ersderivedontheCBSdata.Weanswerour rstresearchquestion,thatisconcernedwith ndingane cientandaccuratetextclassi er.Section3.1elaboratesontextrepresentationandsection3.2discussesclassi cationalgorithms.Insection

3.3experimentswithtextclassi ersontheCBSdataaredescribed.Finally,insection3.4wegiveasummaryofthischapter.

3.1TextRepresentation

Textdocumentstypicallyconsistofstringsofcharactersthatcannotbedi-rectlyinterpretedbyaclassi er.Eachdocumentneedstobetransformedintoacompactrepresentationofitscontentappropriateforautomaticclassi cation.Thevectorrepresentationisthemostfrequentlyuseddocumentrepresenta-tion[Mladeni´c,1999].Adocumentwillberepresentedasavectorofweightsdj=<w1j,...,wmj>,whereMisthesetoffeaturesthatoccuratleastonceinatleastonedocumentinthedatasetandtheweights,0≤wkj≤1representhowmuchfeaturefkcontributestothesemanticsofdocumentdj.Di erentwaystode nefeaturesandtocomputefeatureweightsleadtodi erenttextrepresen-tationapproaches[Sebastiani,2002].Fourpossibilitiestode nefeaturesareasfollows.

rmationretrievalresearchsuggeststhatwordsworkwellasfeaturesandthattheirorderinginarecordisofminorimportanceformanytasks[Lewis,1992].Whentheweightsofthefeaturesarebinary,identifyingfeatureswithwordsiscalledthebag-of-wordsapproach.

ingphrasesasfeatureshasproventobenotverysuccessful.Di erentwaysofde ningphraseswereinvestigatedbyLewis[1992].

Wordn-grams.Afeatureisde nedforeachwordsequencecontaininguptonconsecutivewords.Thisrepresentationworksbestifthesequenceofwordsisimportantforthecontextandthedocumentsareshort[Mladeni´c,1998].

3.2.CLASSIFICATIONALGORITHMS

Charactern-grams.Featuresareformedbyallsequencesofncharacters

inatext.Acommonvaluefornis3.Thecharactern-gramrepresentation

wasestablishedquitesomeyearsago[Shannon,1951].Anadvantageof

thisrepresentationisthatitcopesbetterwithspellingmistakes.

Besidesthefeatures,alsothefeatureweightshavetobedetermined.When

afeatureoccursinacertaindocument,ameasurehastodeterminethevalue

ofthisfeatureforthedocument.Someearlyna¨ veBayesapproachesusea

binary-valuedvectorrepresentation[RobertsonandSparck-Jones,1976].The

featurevalueequals1ifthefeatureoccursinthedocument,and0otherwise.

Informationinherentinthefrequenciesoffeaturesislostwhenonlytheabsence

orpresenceoffeaturesisconsidered.Thereforenewapproachesthattakeinto

accountfrequencyoffeaturesinadocumenthavebeendeveloped[Lewis,1998].

Themoststraightforwardadaptationistoconsidertheweightofafeatureequal

tothenumberofoccurrencesofthefeatureinthedocument.Thisweighting

functionisreferredtoastermfrequency(tf)[Lewis,1992].Thetffunction

encodestheintuitionthatthemoreoftenafeatureoccursinadocumentthe

moreitisrepresentativeofitscontent.Thesecondpopularweightingfunction

isthetfidfweightingfunction.Itistheproductofthetermfrequencyandthe

inversedocumentfrequency(idf).Thedocumentfrequency(df)ofafeature

isthenumberofdocumentsinthedatasetinwhichthefeatureoccursatleast

once.Theinversedocumentfrequencycanbecalculatedfromthedocument

frequencyasfollows:

idf(fi)=log|N|

i,where

|N|isthetotalnumberofdocumentsinthedataset,anddf(fi)isthedocument

frequencyoffeature(fi).Theidffunctionencodestheintuitionthatthemore

documentsafeatureoccursin,thelessdiscriminatingitis[Sebastiani,2002,

Smirnovetal.,2003b].Variationsonthetfidfweightingfunctioncanbecre-

atedthroughapplyingdi erentlogarithms,normalizations,andothercorrection

factors.

Inthisresearchweusethefollowingtextrepresentation.Toconvertatext

documentintoaninstancesuitableforclassi cationthevectorofallwords

thatoccurinthedatasetisthefeaturespace.PreviousCBSresearchshows

thatwordsworkwellasfeaturesfortheProfession1dataset,thereforeinthis

researchweusewordsasfeatures[Smirnovetal.,2003a].Thefeatureweightsof

theinstancesarecomputedbyatforatfidfweightingfunction.Weonlyuse

theweightingfunctionsthattakeintoaccountfrequenciesoffeaturesinorder

nottoloseinformationinherentinthefrequenciesoffeatures.

3.2Classi cationAlgorithms

Whenthetextofadocumentistransformedintoacompactrepresentation,in

principleanymachine-learningalgorithmcanbeapplied.Somespecialtextclas-

si cationalgorithmsthatincorporatefeaturede nitionsandweightingfunctions

havebeenproposed,forexampletheRocchioclassi er[Rocchio,1971].Theclas-

si cationalgorithmsthatareexaminedinthisresearch,i.e.,na¨ veBayesand

nearestneighbour,canuseeitherthetforthetfidfweightingfunction.

3.2.CLASSIFICATIONALGORITHMS

Inthisresearchwelookforpracticalsolutionsthathavesmallspaceand

timecomplexity.Weonlyconsiderclassi cationalgorithmsthatmeetthese

requirements,e.g.,supportvectormachinesarenotconsideredalthoughitisa

promisingtextclassi cationtechnique[Joachims,1998].Inthefollowingthree

subsectionsthena¨ veBayesalgorithm,thenearestneighbouralgorithmanda

hierarchicalapproachtotextclassi cationarediscussed.

3.2.1Na¨ veBayes

Thena¨ veBayesclassi erassumesthatallattributesaremutuallyindependent.

Thereforetheparametersforeachattributecanbelearnedseparatelyandthis

simpli esthetaskconsiderably.Especiallyintextclassi cationtasksthattypi-

callyhavealargenumberofattributesthisisbene cial.Duringtrainingofthe

na¨ veBayesclassi er,thepriorprobabilitiesoftheclasses,andtheprobabili-

tiesofobservingattributevaluesgiventheclass,areestimatedbasedontheir

frequenciesoverthetrainingdata.Anewinstanceisclassi edbyusingBayes

theoremtoidentifythemostprobableclass[Mitchell,1997].

3.2.2NearestNeighbour

Thenearestneighbourclassi erdoesnotbuildanexplicit,declarativerepresen-

tationoftheclasses,butsimplystoresallthetraininginstances.Toclassifya

newinstanceitreliesontheclassvaluesattachedtothetraininginstancesthat

aresimilartothenewinstance.Theknearesttraininginstancestakeavote

tosettleontheclassvaluewhentheclassisnominal.Themaindrawbackof

thisapproachisitsine ciencyatclassi cationtime.Toclassifyanewinstance,

theentiretrainingsetneedstoberankedforsimilaritywiththenewinstance

[Mitchell,1997].

Inthenearestneighbourclassi erkistheonlyimportantparameterthat

needstobeset.ExperimentsontheCBSdatasetsshowthatthevaluek=

1performsbest,sointheremainingofthisthesiswewillusethe1-nearest

neighbourclassi erforthetextclassi ers.

3.2.3HierarchicalClassi er

Adi cultyofthedataprovidedbytheCBS,isthelargenumberofclasses.

TheEducation1datasethasalmost6,000classes.Thisisoneofthereasons

thatclassi cationisrathertimeandspaceconsuming.Toovercomethisprob-

lemwehaveconstructedahierarchicalclassi erthatiscomputationallymore

e cient.Theclassattributecanbedividedintomultiplesubclassesthatcan

beclassi edseparately.Eachofthesesubclasseshasasmallnumberofclass

values.Assumingthattheclassvaluecanbedividedintoksubclassesandeach

subclasshasndistinctclassvalues.Thehierarchicalclassi erclassi esktimes

nclasses,whileaplainclassi erclassi esnkclasses,whichformostclassi ers

iscomputationallymoredemanding.

Mosthierarchicalclassi ersrequireastrictconcepthierarchylikeChuang

etal.[2000].However,ourdatasetcontainsmorehierarchies,ratherthanone.

Theeducationlevelisamoregeneralconceptthantheeducationsublevel,and

likewisetheeducation eldismoregeneralthantheeducationsub eld.

3.3.EXPERIMENTSWITHTHETEXTCLASSIFIERSDigit

Subclass

Example1Level62Sublevel034SOIcodeField31Sub eld50567Serialnr.18TrackNVT9PhaseNVT

Table3.1:SubclassesintheeducationSOI+code.

Theclassattributecanbedividedintofourorsevensubclasses.The rst

foursubclassesofthesevensubclassestogetherareequaltothe rstsubclassof

thefoursubclasses.Intable3.1thedivisionoftheSOI+codeintosubclasses

isillustrated.The rstrowinthetablearethedigitsintheSOI+code.The

secondrowshowsthesubclasses,andthethirdrowshowsthedivisionofthe

code603150-1-NVT-NVTinsevensubclasses.Thevaluesofthesubclassescan

bederiveddirectlyfromtheoriginalclassvalue,i.e.,thesubclassvalueconsists

ofoneormoredigitsfromtheoriginalclassvalue.

Thehierarchicalclassi erclassi esaninstanceinmultiplesteps.Ineach

steponesubclassisclassi ed.Thevalueofthesubclassissubsequentlyadded

tothelistofattributesthatisusedtoclassifythenextsubclass.Thisprocess

continuesuntilallsubclassvalueshavebeenclassi ed.Allthesubclassvalues

arethenmergedtocreatethe nalclassvalue.

Thesubclassthatistheeasiesttopredict,willbepredicted rst.Inmost

hierarchicalclassi ersthisisthesubclassatthetopofthehierarchy,whichrep-

resentsthemostgeneralconcept.Inourcasewedonothaveastricthierarchy.

Todeterminewhichsubclassshouldbepredicted rst,wetrainaclassi erfor

eachpossiblesubclassandthesubclassthathasthehighestclassi cationaccu-

racyisclassi ed rst.Inthesamewaywedeterminetheclassi cationorderof

theremainingsubclasses.

3.3ExperimentswiththeTextClassi ers

Thissectiondescribestheexperimentsthathavebeencarriedoutinorderto

assesstheperformanceofthedi erentclassi ers.Besidestheexperimentsthat

measuretheclassi cationaccuracyofthedi erentclassi ers,someadditional

experimentshavebeencarriedouttoderivetheoptimalsettingofparameters

forthehierarchicalclassi er.

3.3.1ExperimentalSettingandMethodology

Thetextclassi ersaretestedontheProfession1,theProfession2,theEduca-

tion1,andtheEducation2datasets.Anearestneighbourclassi er(NN)and

ana¨ veBayesclassi er(NB)aretrainedoneachdatasetincombinationwith

thetfandtfidftextrepresentation.Thedatasetshavebeensampledwith-

outreplacementtoreducethecomputationalrequirements.Threesamplesof

roughly20,000instancesaretakenforeachdataset.Eachclassi eristrained

foreachsampleandthentheresultsareaveraged.Wemeasuretheclassication

accuracy,i.e.,thepercentageofcorrectlyclassi edrecords.

Totestthehierarchicalclassi erasamplefromtheEducation1datasetof

本文来源：https://www.bwwdw.com/article/3ei4.html

相关文章：

机械制图试卷及答案05-06

2018年普通高等学校招生全国统一考试（泄露天机押题卷）历史（一）Word版缺答案10-06

2020届浙江省温州市高三第一学期十校联合体期末联考理综物理部分04-25

e4虚拟仪器的微弱信号检测处理技术研究07-17

201515年幼儿教师个人学期末总结05-13

上一篇：谈会议主席台座次排列下一篇：长沙理工大学2010在职工商管理硕士招生简章