Collapsed Consonant and Vowel Models New Approaches for Engl

更新时间:2023-04-23 09:14:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

CollapsedConsonantandVowelModels:NewApproachesforEnglish-PersianTransliterationandBack-Transliteration

SarvnazKarimiFalkScholerAndrewTurpinSchoolofComputerScienceandInformationTechnologyRMITUniversity,GPOBox2476V,Melbourne3001,Australia

{sarvnaz,fscholer,aht}@cs.rmit.edu.au

Abstract

WeproposeanovelalgorithmforEnglishtoPersiantransliteration.Previousmeth-odsproposedforthislanguagepairapplyawordalignmenttoolfortraining.Bycontrast,weintroduceanalignmentalgo-rithmparticularlydesignedfortranslitera-tion.OurnewmodelimprovestheEnglishtoPersiantransliterationaccuracyby14%overann-grambaseline.Wealsoproposeanovelback-transliterationmethodforthislanguagepair,apreviouslyunstudiedprob-lem.Experimentalresultsdemonstratethatouralgorithmleadstoanabsoluteimprove-mentof25%overstandardtransliterationapproaches.

1Introduction

Translationofatextfromasourcelanguagetoatargetlanguagerequiresdealingwithtechnicaltermsandpropernames.Theseoccurinalmostanytext,butrarelyappearinbilingualdictionar-ies.Thesolutionisthetransliterationofsuchout-of-dictionaryterms:awordfromthesourcelanguageistransformedtoawordinthetargetlanguage,pre-servingitspronunciation.Recoveringtheoriginalwordfromthetransliteratedtargetiscalledback-transliteration.Automatictransliterationisimpor-tantformanydifferentapplications,includingma-chinetranslation,cross-lingualinformationretrievalandcross-lingualquestionanswering.

Transliterationmethodscanbecategorizedintographeme-based(AbdulJaleelandLarkey,2003;Li

etal.,2004),phoneme-based(KnightandGraehl,1998;Jungetal.,2000),andcombined(BilacandTanaka,2005)approaches.Grapheme-basedmeth-odsperformadirectorthographicalmappingbe-tweensourceandtargetwords,whilephoneme-basedapproachesuseanintermediatephoneticrep-resentation.Bothgrapheme-orphoneme-basedmethodsusuallybeginbybreakingthesourcewordintosegments,andthenuseasourcesegmenttotar-getsegmentmappingtogeneratethetargetword.Therulesofthismappingareobtainedbyaligningalreadyavailabletransliteratedwordpairs(trainingdata);alternatively,suchrulescanbehandcrafted.Fromthisperspective,pastworkisroughlypidedintothosemethodswhichapplyawordalignmenttoolsuchasGIZA++(OchandNey,2003),andap-proachesthatcombinethealignmentstepintotheirmaintransliterationprocess.

Transliterationislanguagedependent,andmeth-odsthatareeffectiveforonelanguagepairmaynotworkaswellforanother.Inthispaper,weinvestigatetheEnglish-Persiantransliterationprob-lem.Persian(Farsi)isanIndo-Europeanlanguage,writteninArabicscriptfromrighttoleft,butwithanextendedalphabetanddifferentpronunciationfromArabic.OurpreviousapproachtoEnglish-Persiantransliterationintroducedthegrapheme-basedcollapsed-vowelmethod,employingGIZA++forsourcetotargetalignment(Karimietal.,2006).Weproposeanewtransliterationapproachthatex-tendsthecollapsed-vowelmethod.TomeetPer-sianlanguagetransliterationrequirements,wealsoproposeanovelalignmentalgorithminourtrainingstage,whichmakesuseofstatisticalinformationof

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

thecorpus,transliterationspeci cations,andsimplelanguageproperties.Thisapproachhandlespossi-bleconsequencesofelision(omissionofsoundstomakethewordeasiertoread)andepenthesis(addingextrasoundstoawordtomakeit uent)inwrittentargetwordsthathappenduetothechangeoflan-guage.Ourmethodshowsanabsoluteaccuracyim-provementof14.2%overann-grambaseline.

Inaddition,weinvestigatetheproblemofback-transliterationfromPersiantoEnglish.Toourknowledge,thisisthe rstreportofsuchastudy.TherearetwochallengesinPersiantoEnglishtransliterationthatmakesitparticularlydif cult.First,writtenPersianomitsshortvowels,whileonlylongvowelsappearintexts.Second,monophthon-gization(changingdiphthongstomonophthongs)ispopularamongPersianspeakerswhenadaptingfor-eignwordsintotheirlanguage.Totaketheseintoaccount,weproposeanovelmethodtoformtrans-formationrulesbychangingthenormalsegmenta-tionalgorithm.We ndthatthismethodsigni -cantlyimprovesthePersiantoEnglishtranslitera-tioneffectiveness,demonstratinganabsoluteperfor-mancegainof25.1%overstandardtransliterationapproaches.

2Background

Ingeneral,transliterationconsistsofatrainingstage(runningonabilingualtrainingcorpus),andagen-eration–alsocalledtesting–stage.

Thetrainingstepofatransliterationdevelopstransformationrulesmappingcharactersinthesourcetocharactersinthetargetlanguageusingknowledgeofcorrespondingcharactersintranslit-eratedpairsprovidedbyanalignment.Forexample,forthesource-targetwordpair(pat,),analign-mentmaymap“p”to“”and“a”to“”,andthetrainingstagemaydeveloptherulepa→,with“”asthetransliterationof“a”inthecontextof“pa”.Thegenerationstageappliestheserulesonaseg-mentedsourceword,transformingittoawordinthetargetlanguage.

Previousworkontransliterationeitheremploysawordalignmenttool(usuallyGIZA++),ordevelopsspeci calignmentstrategies.Transliterationmeth-odsthatuseGIZA++astheirwordpairaligner(Ab-dulJaleelandLarkey,2003;VirgaandKhudanpur,

2003;Karimietal.,2006)havebasedtheirworkontheassumptionthattheprovidedalignmentsarere-liable.Gaoetal.(2004)arguethatprecisealign-mentcanimprovetransliterationeffectiveness,ex-perimentingonEnglish-Chinesedataandcompar-ingIBMmodels(Brownetal.,1993)withphoneme-basedalignmentsusingdirectprobabilities.

Othertransliterationsystemsfocusonalignmentfortransliteration,forexamplethejointsource-channelmodelsuggestedbyLietal.(2004).TheirmethodoutperformsthenoisychannelmodelindirectorthographicalmappingforEnglish-Chinesetransliteration.Lietal.also ndthatgrapheme-basedmethodsthatusethejointsource-channelmodelaremoreeffectivethanphoneme-basedmeth-odsduetoremovingtheintermediatephonetictransformationstep.Alignmenthasalsobeenin-vestigatedfortransliterationbyadoptingCoving-ton’salgorithmoncognateidenti cation(Coving-ton,1996);thisisacharacteralignmentalgorithmbasedonmatchingorskippingofcharacters,withamanuallyassignedcostofassociation.Coving-tonconsidersconsonanttoconsonantandvoweltovowelcorrespondencemorevalidthanconsonanttovowel.KangandChoi(2000)revisethismethodfortransliterationwhereaskipisde nedasinsertinganullinthetargetstringwhentwocharactersdonotmatchbasedontheirphoneticsimilaritiesortheirconsonantandvowelnature.OhandChoi(2002)revisethismethodbyintroducingbinding,inwhichmanytomanycorrespondencesareallowed.How-ever,alloftheseapproachesrelyonthemanuallyassignedpenaltiesthatneedtobede nedforeachpossiblematching.

Inaddition,somerecentstudiesinvestigatedis-criminativetransliterationmethods(KlementievandRoth,2006;ZelenkoandAone,2006)inwhicheachsegmentofthesourcecanbealignedtoeachseg-mentofthetarget,wheresomerestrictiveconditionsbasedonthedistanceofthesegmentsandphoneticsimilaritiesareapplied.

3TheProposedAlignmentApproach

Weproposeanalignmentmethodbasedonsegmentoccurrencefrequencies,therebyavoidingprede nedmatchingpatternsandpenaltyassignments.Wealsoapplytheobservedtendencyofaligningconsonants

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

toconsonants,andvowelstovowels,asasubsti-tuteforphoneticsimilarities.Manytomany,onetomany,onetonullandmanytoonealignmentscanbegenerated.3.1

Formulation

Ouralignmentapproachconsistsoftwosteps:the rstisbasedontheconsonantandvowelnatureoftheword’sletters,whilethesecondusesafrequency-basedsequentialsearch.

De nition1AbilingualcorpusBistheset{(S,T)},whereS=s1..s ,T=t1..tm,siisaletterinthesourcelanguagealphabet,andtjisaletterinthetargetlanguagealphabet.

De nition2Givensomeword,w,theconsonant-vowelsequencep=(C|V)+forwisobtained

byreplacingeachconsonantwithCandeachvowelwithV.

De nition3Givensomeconsonant-vowelse-quence,p,areducedconsonant-vowelsequenceqreplacesallrunsofC’swithC,andallrunsofV’swithV;henceq=q′|q′′,q′=V(CV) (C| )andq′′=C(VC) (V| ).

Foreachnaturallanguageword,wecandeterminetheconsonant-vowelsequence(p)fromwhichthereducedconsonant-vowelsequence(q)canbede-rived,givingacommonnotationbetweentwodif-ferentlanguages,nomatterwhichscripteitherofthemuse.Tosimplify,semi-vowelsandapproxi-mants(soundsintermediatebetweenconsonantsandvowels,suchas“w”and“y”inEnglish)aretreatedaccordingtotheirtargetlanguagecounterparts.Ingeneral,forallthewordpairs(S,T)inacorpusB,analignmentcanbeachievedusingthefunction

T ,r).f:B→A;(S,T)→(S,

Thefunctionfmapsthewordpair(S,T)∈Bto

T ,r)∈AwhereS andT aresub-thetriple(S,

stringsofSandTrespectively.Thefrequencyofthiscorrespondenceisdenotedbyr.Arepresentsasetofsubstringalignments,andweuseaperwordalignmentnotationofae2pwhenaligningEnglishtoPersianandap2eforPersiantoEnglish.3.2

AlgorithmDetails

Step1(Consonant-Vowelbased)

Foranywordpair(S,T)∈B,thecorrespondingreducedconsonant-vowelsequences,qSandqT,aregenerated.Ifthesequencesmatch,thenthealignedconsonantclustersandvowelsequencesareaddedtothealignmentsetA.IfqSdoesnotmatchwithqT,thewordpairremainsunalignedinStep1.

Theassumptioninthisstepisthattransliterationofeachvowelsequenceofthesourceisavowelse-quenceinthetargetlanguage,andsimilarlyforcon-sonants.However,consonantsdonotalwaysmaptoconsonants,orvowelstovowels(forexample,theEnglishletter“s”maybewrittenas“”inPersianwhichconsistsofonevowelandoneconsonant).Al-ternatively,theymightbeomittedaltogether,whichcanbespeci edasthenullstring,ε.Wethereforerequireasecondstep.

Ouralgorithmconsistsoftwosteps.

Step2(Frequencybased)

Formostnaturallanguages,themaximumlengthofcorrespondingphonemesofeachgraphemeisadigraph(twoletters)oratmostatrigraph.Hence,alignmentcanbede nedasasearchproblemthatseeksforunitswithamaximumlengthoftwoorthreeinbothstringsthatneedtobealigned.Inourapproach,wesearchbasedonstatisticaloccurrencedataavailablefromStep1.

InStep2,onlythosewordsthatremainunalignedattheendofStep1needtobeconsidered.Foreachpairofwords(S,T),matchingproceedsfromlefttoright,examiningoneofthethreepossibleoptionsoftransliteration:singlelettertosingleletter,digraphtosingleletterandsinglelettertodigraph.Trigraphsareunnecessaryinalignmentastheycanbeeffec-tivelycapturedduringtransliterationgeneration,asweexplainbelow.

Wede nefourdifferentvalidalignmentsforthesource(S=s1s2...si...sl)andtarget(T=t1t2...tj...tm)strings:(si,tj,r),(sisi+1,tj,r),(si,tjtj+1,r)and(si,ε,r).Thesefouroptionsareconsideredastheonlypossiblevalidalignments,andthemostfrequentlyoccurringalignment(high-estr)ischosen.Thesefrequenciesaredynamicallyupdatedaftersuccessfullyaligningapair.Forex-ceptionalsituations,wherethereisnocharacterinthetargetstringtomatchwiththesourcecharactersi,itisalignedwiththeemptystring.

Itispossiblethatnoneofthefourvalidalignment

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

optionshaveoccurredpreviously(thatis,r=0foreach).Thissituationcanariseintwoways: rst,suchatuplemaysimplynothaveoccurredinthetrainingdata;and,second,thepreviousalign-mentinthecurrentstringpairmayhavebeenincor-rect.Toaccountforthissecondpossibility,apar-tialbacktrackingisconsidered.Mostmisalignmentsarederivedfromthesimultaneouscomparisonofalignmentpossibilities,givingthehighestprioritytothemostfrequent.ForexampleifS=bbc,T=andA={(b,,100),(bb,,40),(c,,60)},startingfromtheinitialpositions1andt1,the rstalignmentchoiceis(b,,101).Howeverimmediatelyafter,wefacetheproblemofaligningthesecond“b”.Therearetwosolutions:insertingεandaddingthetriple(b,ε,1),orbacktrackingthepreviousalignmentandsubstitutingthatwiththelessfrequentbutpossiblealignmentof(bb,,41).Thesecondsolutionisabetterchoiceasitaddslessambiguousalignmentscontainingε.Attheend,thealignmentsetisup-datedasA={(b,,100),(bb,,41),(c,,61)}.Incaseofequalfrequencies,wecheckpossiblesubsequentalignmentstodecideonwhichalign-mentshouldbechosen.Forexample,if(b,,100)and(bb,,100)bothexistaspossibleoptions,weconsiderifchoosingtheformerleadstoasubse-quentεinsertion.Ifso,weoptforthelatter.

Attheendofastring,ifjustonecharacterinthetargetstringremainsunalignedwhilethelastalign-mentisaεinsertion,that nalalignmentwillbesub-stitutedforε.Thisusuallyhappenswhenthealign-mentof nalcharactersisnotyetregisteredinthealignmentset,mainlybecausePersianspeakerstendtotransliteratethe nalvowelstoconsonantstopre-servetheirexistenceintheword.Forexample,intheword“Jose”the nal“e”mightbetransliteratedto“”whichisaconsonant(“h”)andthereforeisnotcapturedinStep1.

lematicsubstrings:backparsing.

Thepoorlyalignedsubstringsofthesourceandtargetaretakenasnewpairsofstrings,whicharethenreintroducedintothesystemasnewentries.Notethattheythemselvesarenotsubjecttoback-parsing.Moststringsofrepeatingnullscanbebro-kenupthisway,andintheworstcasewillremainasonetupleinthealignmentset.

Toclarify,considertheexamplegiveninFigure1.

),whereanForthewordpair(patricia,

associationbetween“c”and“”isnotyetregis-tered.Forwardparsing,asshowninthe gure,doesnotresolvealltargetcharacters;aftertheincorrectalignmentof“c”with“ε”,subsequentcharactersarealsoalignedwithnull,andthesubstring“”re-mainsintact.Backwardparsing,showninthenextlineofthe gure,isalsonotsuccessful.Itisabletocorrectlyalignthelasttwocharactersofthestring,beforegeneratingrepeatednullalignments.There-fore,thecentralregion—substringsofthesourceandtargetwhichremainedunalignedplusoneextraalignedsegmenttotheleftandright—isentered

),asshownasanewpairtothesystem(ici,

inthelinelabelledInput2inthe gure.ThisnewinputmeetsStep1requirements,andisalignedsuc-cessfully.TheresultingtuplesarethenmergedwiththealignmentsetA.

Backparsing

TheprocessofaligningwordsexplainedabovecanhandlewordswithalreadyknowncomponentsinthealignmentsetA(thefrequencyofoccurrenceisgreaterthanzero).However,whenthisisnotthecase,thesystemmayrepeatedlyinsertεwhilepartorallofthetargetcharactersareleftintact(unsuc-cessfulalignment).Insuchcases,processingthesourceandtargetbackwardshelpsto ndtheprob-Anadvantageofourbackparsingstrategyisthatittakescareofcasualtransliterationshappeningduetoelisionandepenthesis(addingorremovingex-trasounds).Itisnotonlyintranslationthatpeoplemayaddextrawordstomake uenttargettext;fortransliterationalso,itispossiblethatspuriouschar-actersareintroducedfor uency.However,thisof-tenfollowspatterns,suchasaddingvowelstothetargetform.Theseirregularitiesareconsistentlycoveredinthebackparsingstrategy,wheretheyre-mainconnectedtotheirpreviouscharacter.

4TransliterationMethod

Transliterationalgorithmsusealigneddata(theout-putfromthealignmentprocess,ae2porap2ealign-menttuples)fortrainingtoderivetransformationrules.Theserulesarethenusedtogenerateatar-getwordTgivenanewinputsourcewordS.

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

Input:Step1:

Forwardalignment:Input2:Step1:(patricia,)qS=CVCVCVqT=CVCVqS=qT

(p,,43),(a,ε,100),(t,,52),(r,,201),(i,,61),(c,ε,1)

,(i,ε,6),(r,ε,1),(t,ε,1),(a,ε,100),(p,ε,1)

)qS=VCVqT=VCV(ici,

(i,,61),(c,,1),(i,,61)

Figure1:Abackparsingexample.Notemiddletuplesinforwardandbackwardparsingsarenotmergedin

Atillthealignmentissuccessfullycompleted.

CV-MODEL1CV-MODEL2CV-MODEL3CCVCCVCCVCCVCVCV

,shshsh#sh,el,le

l(CVC),ll(CV)(CVC),ll(CV)(CVC),ll(CV)s,h,e,l,e,y

s(C),h(C),e(V),l(C),e(V),y(V)AsAbove.

sh(C),s(C),h(C),e(V),l(C),e(V),y(V)

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

consonants,asdemonstratedintheexamplebelow.Aspecialsymbolisusedtoindicatethestartand/orendofeachwordifthebeginningandendofthewordisaconsonantrespectively.Therefore,forthewordsstartingorendingwithconsonants,thesymbol“#”isadded,whichistreatedasaconsonantandthereforegroupedintheconsonantsegment.AnexampleofapplyingthistechniqueisshowninFigure2forthestring“shelley”.Inthisexample,“sh”and“ll”aretreatedastwoconsonantsegments,wherethetransliterationofinpidualcharactersin-sideasegmentisdependentontheothermembersbutnotthesurroundingsegments.However,thisisnotthecaseforvowelsequenceswhichincorporatealevelofknowledgeaboutanysegmentneighbours.Therefore,fortheexample“shelley”,the rstseg-mentis“sh”whichbelongstoCpattern.Duringtransliteration,if“#sh”doesnotappearinanyex-istingrules,abackoffsplitsthesegmenttosmallersegments:“#”and“sh”,or“s”and“h”.Thesecondsegmentcontainsthevowel“e”.Sincethisvowelissurroundedbyconsonants,thesegmentpatternisCVC.Inthiscase,backoffonlyappliesforvowelsasconsonantsaresupposedtobepartoftheirownin-dependentsegments.Thatis,ifsearchintherulesofpatternCVCwasunsuccessful,itlooksfor“e”inVpattern.Similarly,segmentationforthiswordcon-tinueswith“ll”inCpatternand“ey”inCVpattern(“y”isanapproximant,andthereforeconsideredasavowelwhentransliteratingEnglishtoPersian).4.3

RulesforBack-Transliteration

phase,itispossibletobene tfromtheirexistenceinthetrainingphase.Forexample,usingCV-,merkel)withqS=CandMODEL3,thepair(

ap2e=((,me),(,r),(,ke),(,l)),producesjustone

→merkel”basedonatransformationrule“

Cpattern.Thatis,thePersianstringcontainsnovowelcharacters.If,duringthetransliterationgen-”(S=)iserationphase,asourceword“

entered,therewouldbeoneandonlyoneoutputof“merkel”,whileanalternativesuchas“mercle”mightberequiredinstead.Toavoidover ttingthesystembylongconsonantclusters,weperformseg-mentationbasedontheEnglishqsequence,butcate-gorisetherulesbasedontheirPersiansegmentcoun-,merkel)withterparts.Thatis,forthepair(

ae2p=((m,),(e,ε),(r,),(k,),(e,ε),(l,)),theserulesaregenerated(withcategorypatternsgiveninparen-thesis):→m(C),→rk(C),→l(C),→merk(C),→rkel(C).Wecallthesuggestedtrainingapproachreversesegmentation.

Reversesegmentationavoidsclusteringalltheconsonantsinonerule,sincemanyEnglishwordsmightbetransliteratedtoall-consonantPersianwords.

4.4TransliterationGenerationandRanking

Inthetransliterationgenerationstage,thesource

wordissegmentedfollowingthesameprocessofsegmentingwordsintrainingstage,andaprobabil-ityiscomputedforeachgeneratedtargetword:

WrittenPersianignoresshortvowels,andonlylong

|K|Yvowelsappearintext.ThiscausesmostEnglish k|S k),P(TP(T|S)=

vowelstodisappearwhentransliteratingfromEn-k=1

glishtoPersian;hence,thesevowelsmustbere-storedduringback-transliteration.where|K|isthenumberofdistinctsourceseg- k|S k)istheprobabilityoftheS k→T kWhentheinitialtransliterationhappensfromEn-ments.P(T

glishtoPersian,thetransliterator(whetherhu-transformationrule,asobtainedfromthetrainingmanormachine)usestherulesoftransliterat-stage:

ingfromEnglishasthesourcelanguage.There- k|S k)=frequencyofSk→TkP(T

fore,transliteratingbacktotheoriginallanguageshouldconsidertheoriginalprocess,toavoidlos-ingessentialinformation.Intermsofsegmenta-tionincollapsed-vowelmodels,differentpatternsde nesegmentboundariesinwhichvowelsarenecessaryclues.Althoughwedonothavemostofthesevowelsinthetransliterationgeneration

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

SmallCorpus

TOP-1TOP-5TOP-10

58.0(2.2)85.6(3.4)89.4(2.9)47.2(1.0)77.6(1.4)83.3(1.5)

61.7(3.0)80.9(2.2)82.0(2.1)50.6(2.5)79.8(3.4)84.9(3.1)

60.0(3.9)86.0(2.8)91.2(2.5)47.4(1.0)79.2(1.0)87.0(0.9)

67.4(5.5)90.9(2.1)93.8(2.1)55.3(0.8)84.5(0.7)89.5(0.4)

72.2(2.2)92.9(1.6)93.5(1.7)59.8(1.1)85.4(0.8)92.6(0.7)

LargeCorpus

TOP-1TOP-5TOP-10

We propose a novel algorithm for English to Persian transliteration. Previous methods proposed for this language pair apply a word alignment tool for training. By contrast, we introduce an alignment algorithm particularly designed for transliteration. Our

GIZA++NewAlignmentReverse

Table2:Comparisonofmean(standarddeviation)wordaccuracy(%)forPersiantoEnglishtransliteration.

6Conclusions

WehavepresentedanewalgorithmforEnglishtoPersiantransliteration,andanovelalignmental-gorithmapplicablefortransliteration.Ournewtransliterationmethod(CV-MODEL3)outperformsthepreviousapproachesforEnglishtoPersian,in-creasingwordaccuracybyarelative9.2%to17.2%(TOP-1),whenusingGIZA++foralignmentintrain-ing.Thismethodshowsfurther7.1%to8.1%in-creaseinwordaccuracy(TOP-1)withournewalign-mentalgorithm.

PersiantoEnglishback-transliterationisalsoin-vestigated,withCV-MODEL3signi cantlyoutper-formingothermethods.Enrichingthismodelwithanewreversesegmentationalgorithmgivesrisetofurtheraccuracygainsincomparisontodirectlyap-plyingEnglishtoPersianmethods.

Infutureworkwewillinvestigatewhetherpho-neticinformationcanhelpre neourCV-MODEL3,andexperimentwithmanuallyconstructedrulesasabaselinesystem.

calmachinetranslation:putionalLinguistics,19(2):263–311.

putationalLinguistics,22(4):481–496.WeiGao,Kam-FaiWong,andWaiLam.2004.Improvingtransliterationwithprecisealignmentofphonemechunksandusingcontextualfeatures.InAsiaInformationRetrievalSymposium,pages106–117.SungYoungJung,SungLimHong,andEunokPaek.2000.AnEnglishtoKoreantransliterationmodelofextendedMarkovwindow.InConferenceonComputationalLinguistics,pages383–389.Byung-JuKangandKey-SunChoi.2000.Automatictranslit-erationandback-transliterationbydecisiontreelearning.InConferenceonLanguageResourcesandEvaluation,pages1135–1411.SarvnazKarimi,AndrewTurpin,andFalkScholer.2006.En-glishtoPersiantransliteration.InStringProcessingandIn-formationRetrieval,pages255–266.AlexandreKlementievandDanRoth.2006.Weaklysuper-visednamedentitytransliterationanddiscoveryfrommul-tilingualcomparablecorpora.InAssociationforComputa-tionalLinguistics,pages817–putationalLinguistics,24(4):599–612.HaizhouLi,MinZhang,andJianSu.2004.Ajointsource-channelmodelformachinetransliteration.InAssociationforComputationalLinguistics,pages159–puta-tionalLinguistics,29(1):19–51.Jong-HoonOhandKey-SunChoi.2002.AnEnglish-Koreantransliterationmodelusingpronunciationandcontextualrules.InConferenceonComputationalLinguistics.PaolaVirgaandSanjeevKhudanpur.2003.Transliterationofpropernamesincross-languageapplications.InACMSIGIRConferenceonResearchandDevelopmentonInformationRetrieval,pages365–366.DmitryZelenkoandChinatsuAone.2006.Discriminativemethodsfortransliteration.InProceedingsofthe2006Con-ferenceonEmpiricalMethodsinNaturalLanguageProcess-ing.,pages612–617.

Acknowledgments

ThisworkwassupportedinpartbytheAustraliangovernmentIPRSprogram(SK)andanARCDis-coveryProjectGrant(AT).

References

rkey.2003.StatisticaltransliterationforEnglish-Arabiccrosslanguageinforma-tionretrieval.InConferenceonInformationandKnowledgeManagement,pages139–146.SlavenBilacandHozumiTanaka.2005.Directcombinationofspellingandpronunciationinformationforrobustback-transliteration.InConferencesonComputationalLinguis-ticsandIntelligentTextProcessing,pages413–424.PeterF.Brown,VincentJ.DellaPietra,StephenA.DellaPietra,andRobertL.Mercer.1993.Themathematicsofstatisti-

本文来源:https://www.bwwdw.com/article/10tq.html

Top