Literature Review 英文文献综述模板

更新时间:2023-06-07 14:43:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

IEEE标准格式

TextRecognitionwithMachineLearning

basedonTextStructure

LiteratureReview

YifanShiStudentID:27291944Email:ys1n13@soton.ac.ukMScArti cialIntelligence

FacultyofPhysicalSciences&Eng,UniversityofSouthampton

Abstract—ThefastdevelopingMachineLearningalgorithmsintroducedtosemanticareanowadayshasbroughtvasttechniquesintextrecognition,classi cation,andprocessing.However,thereisalwaysacontradictionbetweenaccuracyandspeed,ashigheraccuracygenerallyrepresentsmorecomplicatedsystemaswellaslargetrainingdatabase.Inordertoachieveabalancebetweenfastspeedandgoodaccuracy,manybrilliantdesignsareusedintextprocessing.Inthisliteraturereview,theseeffortsareintroducedinthreelayers:Natural-LanguageProcessing,TextClassi cation,andIBMWatsonSystem.Keywords—MachineLearning,Natural-LanguageProcessing,TextClassi cation,IBMWatson

asitsworkingpipeline.Finally,aconclusionwillbeincludedtogivesomecommentsonthesetechniques.

II.NATURALLANGUAGEPROCESSINGInordertodealwiththehumannatural-language,itisnecessarytotransformtheunstructuredtextintowell-structuredtablesofexplicitsemantics(Ferrucci,2012).AccordingtoLiddy(2001),Natural-LanguageProcessing(NLP)isaseriesofcomputationaltechniquesusedtoanalyzeandrepresentnaturallyorganizedtextinordertoachievecertaintasksandapplications.CollobertandWeston(2008)havecategorizedNLPtasksintosixtypes:Part-Of-SpeechTagging,Chunking,NamedEntityRecognition,SemanticRoleLabeling,LanguageModels,andSemanticallyRelatedWords.Inadditiontothis,theyalsoimplementedMultitaskLearningwithDeepNeuralNetworkstobuildasuccessfuluni edarchitecturewhichavoidedtraditionallargeamountofempiricalhand-designedfeaturestotrainthesystembyusingbackpropagationtraining(Collobertetal.,2011).III.TEXTCLASSIFICATION

Oneofthesimplewaytorepresentanarticleforalearningalgorithmistousethenumberoftimesthatdistinctwordsappearinthedocument(Joachims,2005).However,duetothelargeamountofpossiblewordsusedinarticles,itwouldcreateaveryhighdimensionalspaceoffeatures.Joachims(1999)suggestsaTransductive1

I.INTRODUCTION

ThegrowingpopularityoftheInternethasbroughtincreasingnumberofusersonline,withavastamountofmessages,blogs,articles,etc.tobedealtwith.Thesetexts,knownasnatural-languagetexts,containpossibleusefulinformationbuttakealongtimeforhumantoread,understandanddealwith.Despitethepopularsearchenginetechnologynowadaysinhelpingusersto ndthesourceswithkeywords,semantictechniquesarealsoneededbymanycompaniestoimprovetheiruser-friendlyworkingenvironment.Inthisliteraturereview,Iwillintroduceseveralimportantsemantictechniques,startingfromthemostbasicNatural-LanguageProcessing,concentratinginthemeaningofwordsandsentences,followedbyTextClassi cationwhichisfocusedonparagraphsandarticles.Then,IwillintroducealandmarksystemnamedIBMWatson,whichhasDeepQA

IEEE标准格式

SupportVectorMachinestodoclassi cationbecauseofitseffectivelearningabilityeveninhighdimensionalfeaturespace.Ratherthanusingnon-linearSupportVectorMachine(SVM),Dumaisetal.(1998)comparedlinearSVMwithanotherfourdifferentlearningalgorithmswhichareFindSimilar,DecisionTrees,NaiveBayes,andBayesNets,whichalsosupportsSVMintextclassi cationbecauseofitshighaccuracy,fastspeedaswellasitssimplemodel.Sebastiani(2002)alsorecommendsNeuralNetworkasapotentialselectionintextclassi cationinthatitsaccuracyisonlyslightlylowerthanSVMincomparison.Thecross-documentcomparisonofsmallpiecesoftext,usinglinguisticfeaturessuchasnounphrases,andsynonymsisintroducedbyHatzivassiloglouetal.(1999).Thesimilarityoftwoparagraphsisde nedbythesameactionconductedonthesameobjectbythesameactor.Therefore,drawingfeaturesaccordingtonounsandverbswouldgenerallyconcludeaparagraphintoseveralprimitiveelements.Inadditiontothesimilarprimitiveelements,restrictionssuchasordering,distancesandprimitive(matchingnounandverbpairs)arealsoimplementedtoexcludeweaklyrelatedfeatures.Thefeatureselectionmethodscaneffectivelyreducethedimensionsofdataset(Ikonomakis,2005)whilekeepingtheperformanceofclassi cation.Tomakesurewhichwordsaretobekept,anEvaluationfunctionhasbeenintroducedbySoucyandMineau(2003)tomeasurehowmuchinformationwecangetbyclassifyingthroughasingleword.AnotherimprovementbyHanetal.(2004)istousePrincipalComponentAnalysis(PCA)toreducethedimensionintransformationoffeatures.NigamandMccallum(2000)combineExpectation-MaximizationandNaiveBayesclassi ertotraintheclassi erwithcertainamountoflabeledtextsfollowedbylargeamountofunlabeleddocuments,whichrealizestheautomatictrainingwithouthugeamountofhand-designedtrainingdata.

answering(QA)ispossibletobeathumanchampionsinJeopardy.AsFerrucci(2012)mentioned,thestructureofWatsonismorecomplicatedthananysingleagentasithashundredsofalgorithmsworkingtogether,inthewaythatMinsky(1988)introducedinSocietyofMind.Generally,WatsonconsistsofpartswhichareDeepQA,NaturalLanguageProcessing(NLP),MachineLearning(ML),andSemanticWebandCloudComputing(Gliozzoetal.,2013).TheDeepQAsystemanalyzesthequestionbydifferentalgorithms,givingdifferentinterpretationsofquestionsandformingqueriesforeachquestion(Ferrucci,2012).Itprovidesallthepossibleanswerstothequestionwiththeevidencesandthescoresforeachcandidate,whichwouldgeneratearankingofcandidateanswerswiththelikelihoodofcorrectness.TheMachineLearningalgorithmsareusedtotraintheweightsinitsevaluatingandanalyzingalgorithms(Gliozzoetal.,2013).ThecluethatWatsonusesinsearchingisnamedaslexicalanswertype(LAT),whichtellsWatsonwhatthequestionisaskingaboutandwhatkindofthingsitneedstolookfor.Beforedoingsearching,itwouldgeneratepriorknowledgeoftypelabel,knownas‘direction’,toeachcandidateanswerandsearchevidencesforandagainstthis‘typedirection’(Ferrucci,2012).TheDeepQAalsohasahighrequirementinGrammar-basedandsyntacticanalysistechniques,forexample,relationextractiontechniquesingettingpossiblerelationsbetweenwords,basedonarule-basedapproach.Inaddition,theabilityofbreakingthequestiondownintosub-questionsbylogicsalsoimprovedWatsonsperformance(Ferrucci,2012),whichenablesWatsonto ndresultsforeachsmallerquestionsandcombinethemtogether.Incorrespondencetotheabilityofbreakingdownquestions,itcanalsogeneratethescorefortheoriginalquestionbasedontheevidenceforsub-questions.

Tosimulatehumanknowledge,Watsonalsousesself-containeddatabase.However,thisrequirementhasledtoitsgreathardwarecost.Watsonalso

IV.IBMWATSONneedstodoautomatictextanalysisandknowledge

TheIBMWatsonprojecthasshownusthatextractiontoupdateitsdatabase,becauseofthecomputersysteminopen-domainquestion-enormousamountofworkandtheinsuranceof

2

IEEE标准格式

input-knowledgeaccuracy.However,theuseofself-containeddatabaseiscostly,thatonlyfewinstitutionscanaffordthehardwareexpense,whichmakestheapplicationofWatsonexpensive.Anotherlimitationisthatthestructuredresourceisrelativelynarrowcomparedwithvastunstructurednatural-languagetexts.Oneofthepossibleimprovementistouseonlinedataandordinaryonlinesearchengineto ndpossiblerelatedarticlesandanalyzethemwithPCclients.Despitethetradeoffbetweenaccuracyandcost,becauseofthepossibletheunrealdataandincorrectinformationonline,itmakesthetechniquemorerealizableingeneral.

[4]V.Hatzivassiloglou,J.Klavans,andE.Eskin,DetectingText

SimilarityOverShortPassages:ExploringLinguisticFeatureCombinationsViaMachineLearning,JointSIGDATConferenceonEmpiricalMethodsinNaturalLanguageProcessingandVeryLargeCorpora,2000.

[5]K.Nigam,TextClassi cationfromLabeledandUnlabeledDoc-umentsusingEM,MachineLearning,Volume39,pp-103134,2000.

[6]E.Liddy,NaturalLanguageProcessing,InEncyclopediaof

LibraryandInformationScience,2ndEd.NY.MarcelDecker,Inc,2001.

[7]S.TongandD.Koller,SupportVectorMachineActiveLearning

withApplicationstoTextClassi cation,JournalofMachineLearningResearchpp-45-66,2001.

[8]F.Sebastiani,MachineLearninginAutomatedTextCategoriza-tion,ACMComputingSurveys(CSUR),Issue1,Volume34,pp-1-47,2002.

[9]P.SoucyandG.Mineau,FeatureSelectionStrategiesforText

Categorization,AI2003,LNAI2671,pp-505-509,2003.

[10]X.Han,G.Zu,W.Ohyama,T.Wakabayashi,andF.Kimura,

AccuracyImprovementofAutomaticTextClassi cationBasedonFeatureTransformationandMulti-classi erCombination,LNCS,Volume3309,pp.463-468,Jan2004.

[11]M.Ikonomakis,S.Kotsiantis,V.andTampakas,TextClassi ca-tionusingMachineLearningTechniques,WSEASTransactionsonComputers,Issue8,Volume4,pp-966-974,2005.

[12]R.CollobertandJ.Weston,uni edarchitecturefornaturallan-guageprocessing:deepneuralnetworkswithmultitasklearning,ICML’08Proceedingsofthe25thinternationalconferenceonMachinelearning,ACMNewYork,USA,Pages160-167,2008.[13]R.Collobert,J.Weston,L.Bottou,M.Karlen,K.Kavukcuoglu,

andP.KuksaNaturalLanguageProcessing(Almost)fromScratch,JournalofMachineLearningResearch,Volume12,pp-2493-2537,2011.

[14]A.Gliozzo,O.Biran,S.Patwardhan,andK.McKeown,Seman-ticTechnologiesinIBMWatson,The10thInternationalSemanticWebConference,Bonn,Germany,2011.

[15]D.Ferrucci,Introductionto“ThisisWatson”,IBMJournalof

ResearchandDevelopment,Volume56Number3/4,pp-1:1-1:15May/July2012.

[16]G.Tesauro,D.Gondek,J.Lenchner,J.Fan,andJ.Prager,

Simulation,learning,andoptimizationtechniquesinWatsonsgamestrategies,IBMJournalofResearchandDevelopment,Volume56,Number3/4,pp-16:116:11,2012.

V.CONCLUSION

Ascanbeseenfromthecontentabove,mosttechniquesusedintextanalysisarebasedon‘wordfeature’extraction,wordtypes,andrelations,whichareallsemantictechniques.WhileWatsonalsousessearchingtechniquesto ndtheexactanswershownintext.However,themachineslacktheabilitytoconcludethemainideainaparagraph,whichismorerelatedwithabstractlogicthinking.Whilethewaythathumanreadconcernsnotonlyonvocabulariesandmeanings,butalsothestructureofparagraphandthelocationofsentences,forexample,the rstsentenceintheparagraphusuallyguidesthefollowingcontent,whichhelpstellthesigni canceofthesentencesandwords.Therefore,usingmachinelearningtoanalyzethestructureofanarticleandcombiningwiththemeaningofeverysentencemightgeneratetheabilitytoconcludethemainidea,whichcanbeusedintextscanningandclassi cation.

REFERENCES

[1]S.Dumais,J.Platt,D.Heckerman,andM.Sahami,Inductive

LearningAlgorithmsandRepresentationsforTextCategoriza-tion,ProceedingsoftheseventhinternationalconferenceonInformationandknowledgemanagement,pp-148-155,1998.[2]T.Joachims,TextCategorizationwithSupportVectorMachines:

LearningwithManyRelevant,ECML-98Proceedingsofthe10thEuropeanConferenceonMachineLearning,pp-137-142,1998.[3]T.Joachims,TransductiveInferenceforTextClassi cationusing

SupportVectorMachines,InternationalConferenceonMachineLearning(ICML),pp-200-209,1999.

3

本文来源:https://www.bwwdw.com/article/zyh1.html

Top