Efficient computation of optimal actions

更新时间：2023-05-21 19:09:02 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

efficient推荐度：
相关推荐

线性问题

Ef cientcomputationofoptimalactions

EmanuelTodorov1

DepartmentsofAppliedMathematicsandComputerScience&Engineering,UniversityofWashington,Box352420,Seattle,WA98195EditedbyJamesL.McClelland,StanfordUniversity,Stanford,CA,andapprovedApril28,2009(receivedforreviewNovember16,2007)

Optimalchoiceofactionsisafundamentalproblemrelevantto eldsasdiverseasneuroscience,psychology,economics,computerscience,andcontrolengineering.Despitethisbroadrelevancetheabstractsettingissimilar:wehaveanagentchoosingactionsovertime,anuncertaindynamicalsystemwhosestateisaffectedbythoseactions,andaperformancecriterionthattheagentseekstooptimize.Solvingproblemsofthiskindremainshard,inpart,becauseofoverlygenericformulations.Here,weproposeamorestructuredformulationthatgreatlysimpli estheconstructionofoptimalcontrollawsinbothdiscreteandcontinuousdomains.Anexhaustivesearchoveractionsisavoidedandtheproblembecomeslinear.ThisyieldsalgorithmsthatoutperformDynamicProgrammingandReinforcementLearning,andtherebysolvetra-ditionalproblemsmoreef ciently.Ourframeworkalsoenablescomputationsthatwerenotpossiblebefore:composingoptimalcontrollawsbymixingprimitives,applyingdeterministicmethodstostochasticsystems,quantifyingthebene tsoferrortolerance,andinferringgoalsfrombehavioraldataviaconvexoptimization.Developmentofageneralclassofeasilysolvableproblemstendstoaccelerateprogress—aslinearsystemstheoryhasdone,forexample.Ourframeworkmayhavesimilarimpactin eldswhereoptimalchoiceofactionsisrelevant.

actionselection|costfunction|linearBellmanequation|stochasticoptimalcontrol

fyouaregoingtoact,youmightaswellactinthebestwaypossible.Butwhichwayisbest?Thisisthegeneralproblemweconsiderhere.Examplesincludeanervoussystemgenerat-ingmuscleactivationstomaximizemovementperformance(1),aforaginganimaldecidingwhichwaytoturntomaximizefood(2),aninternetrouterdirectingpacketstominimizedelays(3),anonboardcomputercontrollingajetenginetominimizefuelconsumption(4),andaninvestorchoosingtransactionstomaxi-mizewealth(5).SuchproblemsareoftenformalizedasMarkovdecisionprocesses(MDPs),withstochasticdynamicsp(x specifyingthetransitionprobabilityfromstatextostatex under|x,u)actionu,andimmediatecost (x,u)forbeinginstatexandchoos-ingactionu.Theperformancecriterionthattheagentseekstooptimizeissomecumulativecostthatcanbeformulatedinmul-tipleways.Throughoutthearticlewefocusononeformulation(totalcostwithterminal/goalstates)andsummarizeresultsforotherformulations.

Optimalactionscannotbefoundbygreedyoptimizationoftheimmediatecost,butinsteadmusttakeintoaccountallfuturecosts.Thisisadauntingtaskbecausethenumberofpossiblefuturesgrowsexponentiallywithtime.Whatmakesthetaskdoableistheoptimalcost-to-gofunctionv(x)de nedastheexpectedcumu-lativecostforstartingatstatexandactingoptimallythereafter.Itcompressesallrelevantinformationaboutthefutureandthusenablesgreedycomputationofoptimalactions.v(x)equalstheminimum(overactionsu)oftheimmediatecost (x,u)plustheexpectedcost-to-goE[v(x )]atthenextstatex:

v(x)=min{ (x,u)+Ex ～p(·|x,u)[v(x

)]}.

[1]

Thesubscriptindicatesthattheexpectationistakenwithrespecttothetransitionprobabilitydistributionp(·|x,u)inducedbyactionu.Eq.1isfundamentaltooptimalcontroltheoryandiscalledtheBellmanequation.ItgivesrisetoDynamicProgramming(3)and

11478–11483

PNAS

July14,2009

vol.106

no.28

ReinforcementLearning(2)methodsthatareverygeneralbutcanbeinef cient.Indeed,Eq.1characterizesv(x)onlyimplicitly,asthesolutiontoanunsolvedoptimizationproblem,impedingbothanalyticalandnumericalapproaches.

Here,weshowhowtheBellmanequationcanbegreatlysim-pli ed.We ndananalyticalsolutionfortheoptimalugivenv,andthentransformEq.1intoalinearequation.Shortofsolv-ingtheentireproblemanalytically,reducingoptimalcontroltoalinearequationisthebestonecanhopefor.Thissimpli cationcomesatamodestprice:althoughweimposecertainstructureontheproblemformulation,mostcontrolproblemsofpracticalinterestcanstillbehandled.Indiscretedomainsourworkhasnoprecursors.Incontinuousdomainsthereexistsrelatedpriorwork(6–8)thatwebuildonhere.Additionalresultscanbefoundinourrecentconference

articles(9–11),onlinepreprints(12–14),andsupplementarynotes[supportinginformation(SI)Appendix].Results

ReducingOptimalControltoaLinearProblem.Weaimtocon-

structageneralclassofMDPswheretheexhaustivesearchoveractionsisreplacedwithananalyticalsolution.Discreteoptimiza-tionproblemsrarelyhaveanalyticalsolutions,thusouragendacallsforcontinuousactions.Thismayseemcounterintuitiveifonethinksofactionsassymbols(“goleft,”“goright”).However,whatgivesmeaningtosuchsymbolsaretheunderlyingtransitionprobabilities—whicharecontinuous.Thelatterobservationiskeytotheframeworkdevelopedhere.Insteadofaskingtheagenttospecifysymbolicactions,whicharethenreplacedwithtransitionprobabilities,Thus,|x)directly.weouragentFormally,allowtheagenthasthewepowerhavetospecifytop(reshapex |x,transitionu)the=u(dynamicsx |probabilitiesu(x x).

inanywayitwishes.However,itpaysapricefortoomuchreshaping,asfollows.Letp(x thebehaviorofthe|x)systemdenoteinthethepassiveabsencedynamicsofcontrols.characterizingThelat-terwillusuallybede nedasarandomwalkindiscretedomainsandasadiffusionprocessincontinuousdomains.Notethatthenotionofpassivedynamicsiscommonincontinuousdomainsbutisrarelyusedindiscretedomains.Wecannowquantifyhow“large”anactionisbymeasuringthedifferencebetweenu(·|x)andp(·|x).Differencesbetweenprobabilitydistributionsareusu-allymeasuredviaKullback–Leibler(KL)divergence,suggestinganimmediatecostoftheform

(x,u)=q(x)+KL(u(·|x)||p(·|x))=q(x)+Eu(x x ～u(·|x)log

|x)

p(x.[2]Thestatecostq(x)canbeanarbitraryfunctionencodinghow(un)desirabledifferentstatesare.Thepassivedynamicsp(x andcontrolleddynamicsu(x |x)canalsobearbitrary,exceptthat

|x)Authorcontributions:E.T.designedresearch,performedresearch,analyzeddata,andwrotethepaper.

Theauthordeclaresnocon ictofinterest.ThisarticleisaPNASDirectSubmission.

FreelyavailableonlinethroughthePNASopenaccessoption.SeeCommentaryonPage11429.

1E-mail:

todorov@cs.washington.edu.

/cgi/content/full/0710743106/DCSupplemental.

/cgi/doi/10.1073/pnas.0710743106

线性问题

Fig.1.Newproblemformulation.(A)Acoin-tossexamplewheretheactioncorrespondstobiasingthecoin.Theoptimalbiasisobtainedbyminimizingthesum(black)oftheKLdivergencecost(red)andtheexpectedstatecost(blue).Notethatthetossesareindependentandthusthetemporaldynamicsareirrelevantinthisexample.(B)Astochasticshortest-pathproblemillus-tratinghowourframeworkcancapturethebene tsoferrortolerance.Seemaintext.

werequireu(x neededtomake|x)KL=divergence0wheneverwell-de ned.p(x |x)=0.ThisIthasconstrainttheaddedisbene tofpreventingtheagentfromjumpingdirectlytogoalstates,andmoregenerallyfrommakingstatetransitionsthatarephysicallyimpossible.

Fig.1illustratestheconstructionabovewithtwosimpleexam-ples.Fig.1Aisacoin-tossproblemwhereq(Tails)=0,q(Heads)=1andthepassivedynamicscorrespondtoanunbiasedcoin.Theactionuhastheeffectofbiasingthecoin.Theoptimalbias,whichturnsouttobeu (Heads)=0.27,achievesatrade-offbetweenkeepingtheactioncostandtheexpectedstatecostsmall.Notethatthecontrollercouldhavemadethecoindeterministicbysettingu(Heads)=0,butthisissuboptimalbecausetheassociatedactioncostistoolarge.Ingeneral,theoptimalactionsresultingfromourframeworkarestochastic.Fig.1Bisashortest-pathproblemwhereq=0forthegoalstate(gray)andq=1forallotherstates.Thepassivedynamicscorrespondtotherandomwalkonthegraph.Atthegreenstateitdoesnotmatterwhichpathistaken,sotheopti-malactionequalsthepassivedynamics,theactioncostis0,andthecost-to-go(showninsidethecircle)equalsthelengthofthedeterministicshortestpath.Attheredstate,however,theoptimalactiondeviatesfromthepassivedynamicstocauseatransitionup,incurringanactioncostof0.6andmakingtheredstateworsethanthegreenstate.Ingeneral,v(x)issmallerwhenthetaskcanbeaccomplishedinmultiplewaysstartingfromx.Thisre ectsapreferenceforerrortolerancethatisinherentinourframework.Wenowreturntothetheoreticaldevelopment.Theresultstakeonasimplerformwhenexpressedintermsofthedesirabilityfunction

z(x)=exp( v(x)).[3]

Thisterminologyre ectsthefactthatzislargeatstateswherethecost-to-govissmall.SubstitutingEq.1inEq.2,theBellmanequationcanbewrittenintermsofzas

log(z(x))=q(x)+minuEu(x x ～u(·|x)log|x)

p(x).[4]

TheexpressionbeingminimizedresemblesKLdivergence

betweenuandpz,exceptthatpzis

notnormalizedtosumto1.Thus,toobtainproperKLdivergence(SIAppendix),wehavetomultiplyanddividebythenormalizationterm

G[z](x)=

p(x |x)z(x )=Ex ～p(·|x)[z(x )].[5]x

RecallthatKLdivergenceachievesitsglobalminimumofzero

whenthe2distributionsareequal.Therefore,theoptimalactionu isproportionaltopz:

Todorov

u (x |x)=

p(x |x)z(x )

[6]

Thisrepresentsthe rstgeneralclassofMDPswheretheoptimalactionscanbefoundanalyticallygiventheoptimalcosts-to-go.Previously,suchresultswereavailableonlyincontinuousdomains.TheBellmanequationcannowbesimpli ed(SIAppendix)bysubstitutingtheoptimalaction,takingintoaccountthenormal-izationtermandexponentiating.Theresultis

z(x)=exp( q(x))G[z](x).

[7]

TheexpectationG[z]isalinearoperator;thus,Eq.7islinearinz.Itcanbewrittenmorecompactlyinvectornotation.Enumeratethestatesfrom1ton,representz(x therow-indexcorrespondstox|)andxand)withq(thethex)withcolumn-indexn-by-thenmatrixn-dimensionalcolumnvectorszandq,andp(xtoPx, .whereThenEq.7becomesz=Mz,whereM=diag(exp( q))P,expisappliedelement-wiseanddiagtransformsvectorsintodiagonalmatrices.Thelatterequationlookslikeaneigenvectorproblem,andindeeditcanbesolved(9)byusingthepoweriterationmethodz←Mz(whichwecallZiteration).However,theproblemhereisactuallysimplerbecausetheeigenvalueis1andv(x)=q(x)attermi-nalstates.Ifwede netheindexsetsTandNofterminalandnonterminalstatesandpartitionz,q,andPaccordingly,Eq.7becomes

(diag(exp(qN)) PNN)zN=PNTexp( qT).

[8]

TheunknownzNisthevectorofdesirabilitiesatthenonterminal

states.Itcanbecomputedviamatrixfactorizationorbyusinganiterativelinearsolver.

Letusnowcompareourresult(Eq.8)withpolicyiteration(3).WehavetosolveanequationoftheformAz=bjustonce.Inpol-icyiterationonehastosolveanequationoftheformAtoevaluatethecurrentpolicyπ;then,thepolicyhastobe(πimproved)v=b(π)andtheprocessrepeated.Therefore,solvinganoptimalcontrolprobleminourformulationiscomputationallyequivalenttohalfastepofpolicyiteration.

Thusfar,wehavestudiedMDPswithdiscretestatespaces.Thereexistsafamilyofcontinuous(inspaceandtime)problemsrelatedtoourMDPs.Theseproblemshavestochasticdynamics

dx=a(x)dt+B(x)(udt+σdω).[9]

ωrate(t)isisofBrowniantheform

motionandσisthenoiseamplitude.Thecost (x,u)=q(x)+

u 22σ

.[10]

Thefunctionsq(x),a(x),andB(x)canbearbitrary.Thisproblemformulationisfairlygeneralandstandard(butseeDiscussion).Consider,forexample,aone-dimensionalpointwithmassm,positionxp,andvelocityxv.Then,x=[xp,xv]T,a(x)=[xv,0]T,andB=[0,m 1]T.Thenoiseandcontrolsignalscorrespondtoexternalforcesappliedtothepointmass.

Unlikethediscretecasewheretheagentcouldspecifythetran-sitionprobabilitydistributiondirectly,here,uisavectorthatcanshiftthedistributiongivenbythepassivedynamicsbutcannotreshapeit.Speci cally,ifwediscretizethetimeaxisinEq.9withsteph,thepassivedynamicsareGaussianwithmeanx+ha(x)andcovariancehσ2B(x)B(x)T,whereasthecontrolleddynamicsareGaussianwithmeanx+ha(x)+hB(x)uandthesamecovari-ance.Thus,theagentinthecontinuoussettinghaslessfreedomcomparedwiththediscretesetting.Yetthetwosettingssharemanysimilarities,asfollows.First,theKLdivergencebetweentheabove

Gaussianscanbeshowntobeh u 2

2σ,whichisjustthequadraticenergycostaccumulatedovertimeintervalh.Second,giventhe

PNAS

July14,2009

vol.106

no.28

11479

RTANEMMOCEESE

CNEICSORUENR

UCPNMEIOCCS

线性问题

Table1.Summaryofresultsforallperformancecriteria

Discrete

ContinuousFiniteexp(q)zt=G[zt+1]qz=L[z]+Totalexp(q)z=G[z]qz=L[z] z

Averageexp(q c) z=G[ z]

(q c) z=L[ z]

Discounted

exp(q)z=G[zα]

qz=L[z] zlog(zα)

optimalcost-to-gov(x),theoptimalcontrollawcanbecomputedanalytically(4):

u (x)= σ2B(x)Tvx(x).

[11]

Here,subscriptsdenotepartialderivatives.Third,theHamilton–

Jacobi–Bellmanequationcharacterizingv(x)becomeslinear(6,7)(SIAppendix)whenwrittenintermsofthedesirabilityfunctionz(x):

q(x)z(x)=L[z](x).

[12]

Thesecond-orderlineardifferentialoperatorLisde nedas

L[z](x)=a(x)Tzx(x)+

σ2

trace(B(x)B(x)Tzxx(x)).[13]

Additionalsimilaritiesbetweenthediscreteandcontinuousset-tingswillbedescribedbelow.Theyre ectthefactthatthecontin-uousproblemisaspecialcaseofthediscreteproblem.Indeed,wecanshow(SIAppendix)that,forcertainMDPsinourclass,Eq.7reducestoEq.12inthelimith→0.

Thelinearequations7and12werederivedbyminimizingtotalcostinthepresenceofterminal/goalstates.Similarresultscanbeobtained(SIAppendix)forallotherperformancecriteriausedinpractice,inbothdiscreteandcontinuoussettings.Theyaresum-marizedinTable1.Theconstantcisthe(unknown)averagecostcomputedastheprincipaleigenvalueofthecorrespondingequa-tion. zistheexponentofthedifferentialcost-to-gofunction(3)(SIAppendix).Theconstantαistheexponentialdiscountfactor.Notethatallequationsarelinearexceptinthediscountedcase.The nite-horizonandtotal-costresultsincontinuoussettingswerepreviouslyknown(6,7);theremainingsixequationswerederivedinourwork.

Applications.The rstapplicationisanalgorithmfor nding

shortestpathsingraphs(recallFig.1B).Lets(x)denotethelength

oftheshortestpathfromstatextothenearestterminalstate.Thepassivedynamicspcorrespondtotherandomwalkonthegraph.Thestatecostsareq=ρ>0fornon-terminalstatesandq=0forterminalstates.Letvρ(x)denotetheoptimalcost-to-gofunctionforgivenρ.Iftheactioncostswere0,theshortestpath

lengthswouldbes(x)=1

vρnot0,butneverthelessthey(x)forallρ.Here,theactioncostsarearebounded.Becausewearefreetochooseρarbitrarilylarge,wecanmakethestatecostsdominatethecost-to-gofunction,andso

s(x)=lim

vρ(x)

ρ→∞

.[14]

Theconstructionaboveinvolvesalimitanddoesnotyieldtheshortestpathsdirectly.However,wecanobtainarbitrarilyaccu-rate(modulonumericalerrors)approximationsbysettingρlargeenough.ThemethodisillustratedinFig.2onthegraphofinternetrouters.Thereisarangeofvaluesofρforwhichallshortestpathlengthsareexactlyrecovered.Althoughathoroughcomparisonwithdedicatedshortest-pathalgorithmsremainstobedone,wesuspectthattheywillnotbeasef cientaslinearsolvers.Problemswithweightededgescanbehandledwiththegeneralembeddingmethodpresentednext.

11480

/cgi/doi/10.1073/pnas.0710743106

.Thegraphhas190,914nodesand609,066undirectededges.Theshortestpathlengthfromeachnodetoaspeci eddestinationnodewascomputedexactlybyusingdynamicprogrammingadaptedtothisproblem,andalsoapproxi-matedbyusingouralgorithmwithρ=40.Ouralgorithmwas≈5timesfaster,althoughbothimplementationswereequallyoptimized.Theexactshortestpathlengthswereintegersbetween1and15.(A)Theapproximatevalueswerebinnedin15binsaccordingtothecorrespondingcorrectvalue.Therangeofapproximatevaluesineachbinisshownwiththebluesymbols.Thediagonalredlineistheexactsolution.(Inset)Histogramofallvaluesinonebin,withthecorrectvaluesubtracted.Notethatallerrorsarebetween0and1,thusroundingdowntothenearestintegerrecoverstheexactsolution.Thiswasthecaseforallbins.(B)Toassesstheeffectsofthefreeparameterρ,wesolvedtheaboveproblem500timesforeachof10valuesofρbetween25and70.Ineachinstanceoftheproblem,thesetofdestinationnodeswasgeneratedrandomlyandhadbetween1and5elements.Theapproximateshortestpathlengthsfoundbyouralgorithmwereroundeddowntothenearestintegerandcomparedwiththeexactsolution.Thenumberofmis-matches,expressedasapercentofthenumberofnodesandaveragedoverthe500repetitions,isplottedinblack.Forlargevaluesofρtheapproxima-tionbecomesexact,asexpectedfromEq.14.However,ρcannotbesettoolarge,becauseouralgorithmmultipliesbyexp( ρ),thussomeelementsofzmaybecomenumericallyzero.Thepercentageofsuchnumericalerrorsisplottedingreen.Thereisacomfortablerangeofρwhereneithertypeoferrorisobserved.

ThesecondapplicationisamethodforcontinuousembeddingoftraditionalMDPswithsymbolicactions.Itisreminiscentoflinearprogrammingrelaxationinintegerprogramming.DenotethesymbolicactionsinthetraditionalMDPwitha,thetransitionprobabilitieswith p(x anMDPinour|x,aclass),andsuchthethatimmediateforeachcosts(x,awith (x,a).Weseek)theaction p(·|x,a)hascost (x,a):

q(x)+KL( p(·|x,a)||p(·|x))= (x,a).

[15]

Foreachxthisyieldsasystemoflinearequationsinqandlogp,whichhasauniquesolutionunderamildnondegeneracycon-dition(SIAppendix).Ifwekeepthenumberofsymbolicactionsperstate xedwhileincreasingthenumberofstates(forexam-ple,bymakingagridworldlargerandlarger),theamountofcomputationneededtoconstructtheembeddingscaleslinearlywiththenumberofstates.Oncetheembeddingisconstructed,theoptimalactionsarecomputedandreplacedwiththenearest(inthesenseoftransitionprobabilities)symbolicactions.ThisyieldsanapproximatelyoptimalpolicyforthetraditionalMDP.Wedonotyethavetheoreticalerrorboundsbuthavefoundtheapproximationtobeveryaccurateinpractice.ThismethodisillustratedinFig.3onamachinerepairproblemadaptedfromref.3.TheR2betweenthecorrectandapproximatecost-to-gois0.993.

ThethirdapplicationisaMonteCarlomethodforlearningzfromexperienceintheabsenceofamodelofthepassivedynamicspandstatecostsq.Unliketheprevioustwoapplications,wherewestartedwithtraditionalMDPsandapproximatedthemwithMDPsinourclass,hereweassumethattheproblemisalreadyinourclass.Thelinear BellmanEq.7canbeunfoldedrecursivelybyreplacingz(x)withtheexpectationoverthestatefollowingx

Todorov

线性问题

rgerxtmeansthatthemachineismorebrokenandmorecostlytooperate:weincuranoperationcostof0.02xt.Thereare50timesteps(thisisa nite-horizonproblem).Thestateofthemachinehasatendencytodeteriorateovertime:ifwedonothing,xt+1hasaprobability0.9ofbeingoneofxt···xt+8(chosenuniformly)andprobability0.1ofbeingoneofxt 9···xt 1.Whenxtisneartheedgesweclipthisdistributionandrenor-malizeittosumto1.Thesymbolicactionsutareintegersbetween0and9correspondingtohowmuchweinvestinrepairingthemachineineachtimestep.Therepaircostis0.1ut.Theeffectoftherepairsistocircularlyleft-shifttheabovetransitionprobabilitydistributionbyutpositions.Thus,ut=0correspondstodoingnothing;largerutcauseslargerexpectedimprovementinthestateofthemachine.(A)ThetraditionalMDPdescribedabovewas

embeddedwithinourclass,asoutlinedinthemaintextanddescribedindetailin(SIAppendix).Then,zt(x)wascomputedandtheapproximation log(z)totheoptimalcost-to-gowasplotted.Blueissmall,redislarge.(B)ThetraditionalMDPwasalsosolvedbyusingdynamicprogrammingandtheoptimalcost-to-govt(x)wasplottedinthesameformatasinA.(C)Scatterplotoftheoptimalversusapproximatecosts-to-goat5,000space-timepoints(blue).TheR2betweenthetwois0.993,thatis,theoptimalvaluesaccountfor99.3%ofthevarianceintheapproximatevalues.Thereddiagonallinecorrespondstotheidealsolution.Wealsocomputed(viaextensivesampling)theperformanceoftheoptimalpolicyfoundbydynamicprogramming,theapproximatelyoptimalpolicyderivedfromourembedding,andarandompolicy.Theperformanceofourapproximationwas0.9%worsethanoptimal,whereastheperformanceoftherandompolicywas64%worsethanoptimal.

andsoon,andthenpushingallstatecostsinsidetheexpectationoperators.Thisyieldsthepath-integralrepresentation

z(x)=E

f xt+1～p(·|xt) exp

q(xt) .[16]t=0

(x0,x1···xtf)aretrajectoriesinitializedatx0=xandsampledfromthepassivedynamics,andtfisthetimewhenagoalstateis rstreached.Asimilarresultholdsincontinuoussettings,wheretheFeynman–KactheoremstatesthattheuniquepositivesolutiontoEq.12hasapath-integralrepresentation.TheuseofMonteCarlomethodsforsolvingcontinuousoptimalcontrolproblemswaspio-neeredinref.7.Ourresult(Eq.16)makesitpossibletoapplysuchmethodstodiscreteproblemswithnon-Gaussiannoise.

Thefourthapplicationisrelatedtothepath-integralapproachabovebutaimstoachievefasterconvergence.ItismotivatedbyReinforcementLearning(2)wherefasterconvergenceisoftenobservedbyusingTemporalDifference(TD)methods.ATD-likemethodforourMDPscanbeobtainedfromEq.7.Itconstructsanapproximation zusingtriplets(xt,qt,xt+1).Notethatmeasuringtheactioncostsisnotnecessary. zisupdatedonlineasfollows:

z(xt)←(1 ηt)

z(xt)+ηtexp( qt) z(xt+1).[17]

ηtisalearningratewhichdecreasesovertime.Wecallthisalgo-rithmZlearning.DespitetheresemblancetoTDlearning,thereisanimportantdifference.Ourmethodlearnstheoptimalcost-to-godirectly,whereasTDmethodsarelimitedtolearningthecost-to-goofaspeci cpolicy—whichthenneedstobeimproved,thecost-to-gorelearned,andsoon.IndeedZlearningisanoff-policymethod,meaningthatitlearnsthecost-to-gofortheoptimalpolicywhilesamplingaccordingtothepassivedynamics.Theonly

Todorov

Fig.4.ZlearningandQlearning.ZlearningandQlearningarecompared

inthecontextofagrid-worldMDP.Thegoalstateismarked“X.”Statetransi-tionsareallowedtoallneighbors(includingdiagonalneighbors)andtothecurrentstate.Thus,thereareatmost9possiblenextstates,lesswhenthecurrentstateisadjacenttotheobstaclesshowninblackortotheedgesofthegrid.pcorrespondstoarandomwalk.Allstatecostsareq=1exceptforthegoalstatewhereq=0.ThisMDPiswithinourclasssoZlearningcanbeapplieddirectly.ToapplyQlearningwe rstneedtoconstructacorrespond-ingtraditionalMDPwithsymbolicactions.Thisisdoneasfollows.Foreachstatexwede neasymbolicactionwithtransitionprobabilitydistributionmatchingtheoptimalu (·|x).Wealsode neN(x) 1othersymbolicactions,whereN(x)isthenumberofpossiblenextstatesfollowingx.Theirtransitionprobabilitydistributionsareobtainedfromu (·|x)bycircularshifting;thus,theyhavethesameshapeasu butpeakatdifferentnextstates.Allthesesymbolicactionsincurcostq(x)+KL(u (·|x)||p(·|x))matchingthecostinourMDP.TheresultingtraditionalMDPisguaranteedtohavethesameoptimalcost-to-goasourMDP.(A)Thegrayscaleimageshowstheoptimalcost-to-gov= logzwhereziscomputedwithourmodel-basedmethod.Darkercolorscorrespondtosmallervalues.BothZlearningandQlearningaimtoapproximatethisfunction.(B)Errorisde nedastheaverageabsolutediffer-encebetweeneachapproximationandtheoptimalcosts-to-go,normalizedbytheaverageoptimalcost-to-go.Allapproximationstozareinitializedat0.Thelearningratedecaysasηt=c/(c+t),wherec=5,000forZlearningandc=40,000forQlearning.Eachsimulationisrepeated10times.Thestateisresetrandomlywheneverthegoalstateisreached.Eachlearningalgorithmistestedbyusingarandompolicycorrespondingtop,aswellasagreedypolicy.Qlearningrequiresexploration;thus,weuseanε-greedypolicywithε=0.1.Thevaluesofcandεareoptimizedmanuallyforeachalgorithm.Zlearningimplicitlycontainsexplorationsowecandirectlyusethegreedypolicy,i.e.,thepolicy u(x |x)whichappearsoptimalgiventhecurrentapproximation z.GreedyZlearningrequiresimportancesampling:thelastterminEq.17mustbeweightedbyp(xt+1|xt)/ u(xt+1|xt).Suchweightingrequiresaccesstop.(C)Empiricalperformanceofthepoliciesresultingfromeachapproximationmethodateachiteration.ZlearningoutperformsQlearning,andgreedymethodsoutperformrandomsampling.

ReinforcementLearningmethodcapableofdoingthisisQlearn-ing(15).However,Qlearninghasthedisadvantageofoperatingintheproductspaceofstatesandactions,andisthereforelessef cient.ThisisillustratedinFig.4onanavigationproblem.The fthapplicationacceleratesMDPapproximationstocon-tinuousproblems(16).SuchMDPsareobtainedviadiscretizationandtendtobeverylarge,callingforef cientnumericalmeth-ods.Becausecontinuousproblemsoftheform(Eqs.9and10)arelimitsofMDPsinourclass,theycanbeapproximatedwithMDPsinourclass,which,inturn,arereducedtolinearequationsandsolvedef ciently.ThesamecontinuousproblemscanalsobeapproximatedwithtraditionalMDPsandsolvedviadynamicpro-gramming.Bothapproximationsconvergetothesamesolutioninthelimitofin nitely nediscretization,andturnouttobeequallyaccurateawayfromthelimit,butourapproximationisfastertocompute.ThisisshowninFig.5onacar-on-a-hillproblem,whereourmethodis≈10timesfasterthanpolicyiterationand100timesfasterthanvalueiteration.

Theremainingapplicationshavebeendevelopedrecently.Belowwesummarizethekeyresultsandreferthereadertoonlinepreprints(12–14)biningEqs.6and7,theoptimalcontrollaw

canbewrittenasu (x |x)=exp( q(x))p(x |x)z(x

).Givena xed

PNAS

July14,2009

vol.106

no.28

11481

RTANEMMOCEESE

CNEICSORUENRES

UCPNMEIOCCS

线性问题

parisonofourMDPapproximationandatraditionalMDPapproximationonacontinuouscar-on-a-hillproblem.(A)Thecarmovesalongacurvedroadinthepresenceofgravity.Thecontrolsignalistangentialacceleration.Thegoalistoreachtheparkinglotwithsmallvelocity.(B)Ziteration(blue),policyiteration(red),andvalueitera-tion(black)convergetocontrollawswithidenticalperformance;however,Ziterationis10timesfasterthanpolicyiterationand100timesfasterthanvalueiteration.Notethelog-scaleonthehorizontalaxis.(C)Theoptimalcost-to-goforourapproximation.Blueissmall,redislarge.Thetwoblackcurvesarestochastictrajectoriesresultingfromtheoptimalcontrollaw.Thethickmagentacurveisthemostlikelytrajectoryoftheoptimallycontrolledstochasticsystem.Itiscomputedbysolvingthedeterministicoptimalcontrolproblemdescribedinthemaintext.(D)Theoptimalcost-to-goisinferredfromobservedstatetransitionsbyusingouralgorithmforinverseoptimalcontrol.Brownpixelscorrespondtostateswherewedidnothavedata(i.e.,nostatetransitions

landedthere);thus,thecost-to-gocouldnotbeinferred.Thedetailsaregivenin(SIAppendix).

initialstatex0and naltimeT,theprobabilitythattheoptimal

controllawu generatestrajectoryx1,x2,···xTis

T 1Tu

(xz(xt+1|xt)=T)

t=0

z(xexp( q(xt))p(xt+1|xt).

[18]

0)t=0Omittingz(x0),whichis xed,andnotingthatz(xq(x18canbeinterpretedT)=exp( T)),thenegativelogofEq.asthecumulativecostforadeterministicoptimalcontrolproblemwithimmediatecostq(x) log(p(x incontinuoussettingswhenB|xis)).constant:Arelatedtheresultcorresponding

isobtaineddeterministicproblemhasdynamics˙x

=a(x)+Buandcost (x,u)+1

div(a(x)).Theseresultsareimportantbecauseoptimaltrajectoriesfordeterministicproblemscanbefoundwithef cientnumericalmethods(17)thatavoidthecurseofdimensionality.Ourframeworkmakesitpossibletoapplydeterministicmethodstostochasticcontrolforthe rsttime.Themostlikelytrajectoryinthecar-on-a-hillproblemisshowninFig.5Cinmagenta.Seeref.12fordetails.

TheseventhapplicationexploitsthedualitybetweenBayesianestimationandstochasticoptimalcontrol.Thisdualityiswell-knownforlinearsystems(4)andwasrecentlygeneralized(8,10)tononlinearcontinuoussystemsoftheformEqs.9and10.Dual-ityalsoholdsforourMDPs.Indeed,Eq.18canbeinterpretedastheposteriorprobabilityofatrajectoryinaBayesiansmoothingproblem,wherep(xtransitionsandexp( t+q1|(xxt)arethepriorprobabilitiesofthestatet))arethelikelihoodsofsomeunspeci edmeasurements.Thedesirabilityfunctioninthecontrolproblemcanbeshowntobeproportionaltothebackward lteringdensityinthedualestimationproblem.Thus,stochasticoptimalcontrolproblemscanbesolvedbyusingBayesianinference.Seerefs.10and12fordetails.

11482

/cgi/doi/10.1073/pnas.0710743106

Theeighthapplicationconcernsinverseoptimalcontrol,where

thegoalistoinferq(x)givenadataset{x

bytheoptimalcontroller.k,xk}Existingk=1···Kofstatetransitionsgeneratedmeth-odsrelyonguessingthecostfunction,solvingtheforwardprob-lem,comparingthesolutionwithdata,andimprovingtheguessanditerating(18,19).Thisindirectprocedurecanbeinef cientwhentheforwardproblemishardtosolve.ForourMDPstheinverseproblemcanbesolveddirectly,byminimizingtheconvexfunction

L(v)=dTv+cTlog(Pexp( v)),

[19]

wherevisthevectorof(unknown)optimalcosts-to-go,Pisthepassivedynamicsmatrixde nedearlier,anddandcarethehis-togramsofx Itcanbeshownkandxthatk(i.e.,d

L(v)iisthenumberoftimesthatxisthenegativelog-likelihoodkof=thei).dataset.WehavefoundempiricallythatitsHessiantendstobediagonallydominant,whichmotivatesanef cientquasi-NewtonmethodbyusingadiagonalapproximationtotheHessian.Oncetheminimumisfound,wecancomputezandthen ndqfromthelinearBellmanequation.Fig.5Dillustratesthisinverseopti-malcontrolmethodonthecar-on-a-hillproblem.Seeref.14fordetails.

Theninthapplicationisamethodforconstructingoptimalcontrollawsascombinationsofcertainprimitives.Thiscanbedonein nite-horizonaswellastotal-costproblemswithtermi-nal/goalstates,wherethe nalcostplaystheroleofaboundarycondition.Supposewehaveacollectionofcomponent nalcostsgi(x)forwhichwehavesomehowcomputedthedesirabilityfunc-tionszi(x).Linearityimpliesthat,ifthecomposite nalcostg(x)satis es

exp( g(x))=

wiexp( gi(x))[20]

forsome setofwi,thenthecompositedesirabilityfunctionisz(x)=iwizi(x).Thisapproachisparticularlyusefulwhenthecomponentproblemscanbesolvedanalytically,asinthelinear-quadratic-Gaussian(LQG)framework(4).Inthatcasethecom-ponentdesirabilitieszi(x)areGaussians.Bymixingthemlinearly,wecanobtainanalyticalsolutionsto nite-horizonproblemsoftheformEqs.9and10wherea(x)islinear,Bisconstant,q(x)isquadratic;however,the nalcostg(x)isnotconstrainedtobequadratic.Insteadg(x)canequalthenegativelogofanyGaussianmixture.ThisisanontrivialextensiontothewidelyusedLQGframework.Seeref.13fordetails.

Discussion

Weformulatedtheproblemofstochasticoptimalcontrolinawaywhichisrathergeneralandyetaffordssubstantialsimpli cations.Exhaustivesearchoveractionswasreplacedwithananalyticalsolutionandthecomputationoftheoptimalcost-to-gofunctionwasreducedtoalinearproblem.Thisgaverisetoef cientnewalgorithmsspeedinguptheconstructionofoptimalcontrollaws.Furthermore,ourframeworkenabledanumberofcomputationsthatwerenotpossiblepreviously:solvingproblemswithnonqua-dratic nalcostsbymixingLQGprimitives, ndingthemostlikelytrajectoriesofoptimallycontrolledstochasticsystemsviadeter-ministicmethods,solvinginverseoptimalcontrolproblemsviaconvexoptimization,quantifyingthebene tsoferrortolerance,andapplyingoff-policylearninginthestatespaceasopposedtothestate-actionspace.

Theseadvancesweremadepossiblebyimposingacertainstruc-tureontheproblemformulation.First,thecontrolcostmustbeaKLdivergence—whichreducestothefamiliarquadraticenergycostincontinuoussettings.Thisisasensiblewaytomeasurecon-trolenergyandisnotparticularlyrestrictive.Second,thecontrolsandthenoisemustbeabletocausethesamestatetransitions;theanalogincontinuoussettingsisthatboththecontrolsandthe

Todorov

线性问题

noiseactinthesubspacespannedbythecolumnsofB(x).Thisisamoresigni cantlimitation.Itpreventsusfrommodelingsys-temssubjecttodisturbancesoutsidetheactuationspace.Wearenowpursuinganextensionthataimstorelaxthislimitationwhilepreservingmanyoftheappealingpropertiesofourframework.Third,thenoiseamplitudeandthecontrolcostsarecoupled,and,inparticular,thecontrolcostsarelargewhenthenoiseamplitudeissmall.Thiscanbecompensatedtosomeextentbyincreasingthestatecostswhileensuringthatexp( q(x))doesnotbecomenumericallyzero.Fourth,withregardtoZlearning,followingapolicyotherthanthepassivedynamicsprequiresimportance-samplingcorrectionbasedonamodelofp.Suchamodelcouldpresumablybelearnedonline;however,thisextensionremainstobedeveloped.

1.TodorovE(2004)Optimalityprinciplesinsensorimotorcontrol.NatNeurosci7(9):907–

915.

2.SuttonR,BartoA(1998)ReinforcementLearning:AnIntroduction(MITPress,

CambridgeMA).

3.BertsekasD(2001)DynamicProgrammingandOptimalControl(AthenaScienti c,

Bellmont,MA),2ndEd.

4.StengelR(1994)OptimalControlandEstimation(Dover,NewYork).

5.KornR(1997)OptimalPortfolios:StochasticModelsforOptimalInvestmentandRisk

ManagementinContinuousTime(WorldScienti c,Teaneck,NJ).

6.FlemingW,MitterS(1982)Optimalcontrolandnonlinear lteringfornondegenerate

diffusionprocesses.Stochastics8:226–261.

7.KappenHJ(2005)Lineartheoryforcontrolofnonlinearstochasticsystems.PhysRev

Lett95:200201.

8.MitterS,NewtonN(2003)Avariationalapproachtononlinearestimation.SIAMJ

ControlOpt42:1813–1833.

9.TodorovE(2006)Linearly-solvableMarkovdecisionproblems.AdvNeuralInformation

ProcSyst19:1369–1376.

Todorov

Thisframeworkhasmanypotentialapplicationsandwehopethatthelistofexampleswillgrowasotherresearchersbegintouseit.Ourcurrentfocusisonhigh-dimensionalcontinuousprob-lemssuchasthosearisinginbiomechanicsandrobotics,wherethediscrete-timecontinuous-stateMDPapproximationispar-ticularlypromising.Itleadstolinearintegralequationsratherthandifferentialequations,resultinginrobustnumericalmeth-ods.Furthermoreitiswellsuitedtohandlediscontinuitiesduetorigid-bodycollisions.Initialresultsusingmixture-of-Gaussianapproximationstothedesirabilityfunctionareencouraging(11),yetalotmoreworkremains.

ACKNOWLEDGMENTS.WethankSuryaGanguli,JavierMovellan,andYuvalTassafordiscussionsandcommentsonthemanuscript.ThisworkwassupportedbytheNationalScienceFoundation.

10.TodorovE(2008)Generaldualitybetweenoptimalcontrolandestimation.IEEEConf

DecisionControl47:4286–4292.

11.TodorovE(2009)Eigen-functionapproximationmethodsforlinearly-solvableopti-malcontrolproblems,IEEEIntSympAdaptDynamProgrammReinforceLearning2:161–168.

12.TodorovE(2009)ClassicMaximumPrinciplesandEstimation-ControlDualitiesfor

nonlinearStochasticSystems,preprint.

13.TodorovE(2009)CompositionalityofOptimalControlLaws,preprint.

14.TodorovE(2009)Ef cientAlgorithmsforInverseOptimalControl,preprint.15.WatkinsC,DayanP(1992)Q-learning.MachLearn8:279–292.

16.KushnerH,DupuisP(2001)NumericalMethodsforStochasticOptimalControl

ProblemsinContinuousTime(Springer,NewYork).

17.BrysonA,HoY(1969)AppliedOptimalControl(BlaisdellPublishingCompany,

Waltham,MA).

18.NgA,RussellS(2000)Algorithmsforinversereinforcementlearning.IntConfMach

Learn17:663–670.

19.LiuK,HertzmannA,PopovicZ(2005)Learningphysics-basedmotionstylewith

nonlinearinverseoptimization.ACMTransGraphics24:1071–1081.