Efficient computation of optimal actions
更新时间:2023-05-21 19:09:02 阅读量: 实用文档 文档下载
- efficient推荐度:
- 相关推荐
线性问题
Ef cientcomputationofoptimalactions
EmanuelTodorov1
DepartmentsofAppliedMathematicsandComputerScience&Engineering,UniversityofWashington,Box352420,Seattle,WA98195EditedbyJamesL.McClelland,StanfordUniversity,Stanford,CA,andapprovedApril28,2009(receivedforreviewNovember16,2007)
Optimalchoiceofactionsisafundamentalproblemrelevantto eldsasdiverseasneuroscience,psychology,economics,computerscience,andcontrolengineering.Despitethisbroadrelevancetheabstractsettingissimilar:wehaveanagentchoosingactionsovertime,anuncertaindynamicalsystemwhosestateisaffectedbythoseactions,andaperformancecriterionthattheagentseekstooptimize.Solvingproblemsofthiskindremainshard,inpart,becauseofoverlygenericformulations.Here,weproposeamorestructuredformulationthatgreatlysimpli estheconstructionofoptimalcontrollawsinbothdiscreteandcontinuousdomains.Anexhaustivesearchoveractionsisavoidedandtheproblembecomeslinear.ThisyieldsalgorithmsthatoutperformDynamicProgrammingandReinforcementLearning,andtherebysolvetra-ditionalproblemsmoreef ciently.Ourframeworkalsoenablescomputationsthatwerenotpossiblebefore:composingoptimalcontrollawsbymixingprimitives,applyingdeterministicmethodstostochasticsystems,quantifyingthebene tsoferrortolerance,andinferringgoalsfrombehavioraldataviaconvexoptimization.Developmentofageneralclassofeasilysolvableproblemstendstoaccelerateprogress—aslinearsystemstheoryhasdone,forexample.Ourframeworkmayhavesimilarimpactin eldswhereoptimalchoiceofactionsisrelevant.
actionselection|costfunction|linearBellmanequation|stochasticoptimalcontrol
I
fyouaregoingtoact,youmightaswellactinthebestwaypossible.Butwhichwayisbest?Thisisthegeneralproblemweconsiderhere.Examplesincludeanervoussystemgenerat-ingmuscleactivationstomaximizemovementperformance(1),aforaginganimaldecidingwhichwaytoturntomaximizefood(2),aninternetrouterdirectingpacketstominimizedelays(3),anonboardcomputercontrollingajetenginetominimizefuelconsumption(4),andaninvestorchoosingtransactionstomaxi-mizewealth(5).SuchproblemsareoftenformalizedasMarkovdecisionprocesses(MDPs),withstochasticdynamicsp(x specifyingthetransitionprobabilityfromstatextostatex under|x,u)actionu,andimmediatecost (x,u)forbeinginstatexandchoos-ingactionu.Theperformancecriterionthattheagentseekstooptimizeissomecumulativecostthatcanbeformulatedinmul-tipleways.Throughoutthearticlewefocusononeformulation(totalcostwithterminal/goalstates)andsummarizeresultsforotherformulations.
Optimalactionscannotbefoundbygreedyoptimizationoftheimmediatecost,butinsteadmusttakeintoaccountallfuturecosts.Thisisadauntingtaskbecausethenumberofpossiblefuturesgrowsexponentiallywithtime.Whatmakesthetaskdoableistheoptimalcost-to-gofunctionv(x)de nedastheexpectedcumu-lativecostforstartingatstatexandactingoptimallythereafter.Itcompressesallrelevantinformationaboutthefutureandthusenablesgreedycomputationofoptimalactions.v(x)equalstheminimum(overactionsu)oftheimmediatecost (x,u)plustheexpectedcost-to-goE[v(x )]atthenextstatex:
v(x)=min{ (x,u)+Ex ~p(·|x,u)[v(x
u
)]}.
[1]
Thesubscriptindicatesthattheexpectationistakenwithrespecttothetransitionprobabilitydistributionp(·|x,u)inducedbyactionu.Eq.1isfundamentaltooptimalcontroltheoryandiscalledtheBellmanequation.ItgivesrisetoDynamicProgramming(3)and
11478–11483
PNAS
July14,2009
vol.106
no.28
ReinforcementLearning(2)methodsthatareverygeneralbutcanbeinef cient.Indeed,Eq.1characterizesv(x)onlyimplicitly,asthesolutiontoanunsolvedoptimizationproblem,impedingbothanalyticalandnumericalapproaches.
Here,weshowhowtheBellmanequationcanbegreatlysim-pli ed.We ndananalyticalsolutionfortheoptimalugivenv,andthentransformEq.1intoalinearequation.Shortofsolv-ingtheentireproblemanalytically,reducingoptimalcontroltoalinearequationisthebestonecanhopefor.Thissimpli cationcomesatamodestprice:althoughweimposecertainstructureontheproblemformulation,mostcontrolproblemsofpracticalinterestcanstillbehandled.Indiscretedomainsourworkhasnoprecursors.Incontinuousdomainsthereexistsrelatedpriorwork(6–8)thatwebuildonhere.Additionalresultscanbefoundinourrecentconference
articles(9–11),onlinepreprints(12–14),andsupplementarynotes[supportinginformation(SI)Appendix].Results
ReducingOptimalControltoaLinearProblem.Weaimtocon-
structageneralclassofMDPswheretheexhaustivesearchoveractionsisreplacedwithananalyticalsolution.Discreteoptimiza-tionproblemsrarelyhaveanalyticalsolutions,thusouragendacallsforcontinuousactions.Thismayseemcounterintuitiveifonethinksofactionsassymbols(“goleft,”“goright”).However,whatgivesmeaningtosuchsymbolsaretheunderlyingtransitionprobabilities—whicharecontinuous.Thelatterobservationiskeytotheframeworkdevelopedhere.Insteadofaskingtheagenttospecifysymbolicactions,whicharethenreplacedwithtransitionprobabilities,Thus,|x)directly.weouragentFormally,allowtheagenthasthewepowerhavetospecifytop(reshapex |x,transitionu)the=u(dynamicsx |probabilitiesu(x x).
inanywayitwishes.However,itpaysapricefortoomuchreshaping,asfollows.Letp(x thebehaviorofthe|x)systemdenoteinthethepassiveabsencedynamicsofcontrols.characterizingThelat-terwillusuallybede nedasarandomwalkindiscretedomainsandasadiffusionprocessincontinuousdomains.Notethatthenotionofpassivedynamicsiscommonincontinuousdomainsbutisrarelyusedindiscretedomains.Wecannowquantifyhow“large”anactionisbymeasuringthedifferencebetweenu(·|x)andp(·|x).Differencesbetweenprobabilitydistributionsareusu-allymeasuredviaKullback–Leibler(KL)divergence,suggestinganimmediatecostoftheform
(x,u)=q(x)+KL(u(·|x)||p(·|x))=q(x)+Eu(x x ~u(·|x)log
|x)
p(x.[2]Thestatecostq(x)canbeanarbitraryfunctionencodinghow(un)desirabledifferentstatesare.Thepassivedynamicsp(x andcontrolleddynamicsu(x |x)canalsobearbitrary,exceptthat
|x)Authorcontributions:E.T.designedresearch,performedresearch,analyzeddata,andwrotethepaper.
Theauthordeclaresnocon ictofinterest.ThisarticleisaPNASDirectSubmission.
FreelyavailableonlinethroughthePNASopenaccessoption.SeeCommentaryonPage11429.
1E-mail:
todorov@cs.washington.edu.
/cgi/content/full/0710743106/DCSupplemental.
/cgi/doi/10.1073/pnas.0710743106
线性问题
Fig.1.Newproblemformulation.(A)Acoin-tossexamplewheretheactioncorrespondstobiasingthecoin.Theoptimalbiasisobtainedbyminimizingthesum(black)oftheKLdivergencecost(red)andtheexpectedstatecost(blue).Notethatthetossesareindependentandthusthetemporaldynamicsareirrelevantinthisexample.(B)Astochasticshortest-pathproblemillus-tratinghowourframeworkcancapturethebene tsoferrortolerance.Seemaintext.
werequireu(x neededtomake|x)KL=divergence0wheneverwell-de ned.p(x |x)=0.ThisIthasconstrainttheaddedisbene tofpreventingtheagentfromjumpingdirectlytogoalstates,andmoregenerallyfrommakingstatetransitionsthatarephysicallyimpossible.
Fig.1illustratestheconstructionabovewithtwosimpleexam-ples.Fig.1Aisacoin-tossproblemwhereq(Tails)=0,q(Heads)=1andthepassivedynamicscorrespondtoanunbiasedcoin.Theactionuhastheeffectofbiasingthecoin.Theoptimalbias,whichturnsouttobeu (Heads)=0.27,achievesatrade-offbetweenkeepingtheactioncostandtheexpectedstatecostsmall.Notethatthecontrollercouldhavemadethecoindeterministicbysettingu(Heads)=0,butthisissuboptimalbecausetheassociatedactioncostistoolarge.Ingeneral,theoptimalactionsresultingfromourframeworkarestochastic.Fig.1Bisashortest-pathproblemwhereq=0forthegoalstate(gray)andq=1forallotherstates.Thepassivedynamicscorrespondtotherandomwalkonthegraph.Atthegreenstateitdoesnotmatterwhichpathistaken,sotheopti-malactionequalsthepassivedynamics,theactioncostis0,andthecost-to-go(showninsidethecircle)equalsthelengthofthedeterministicshortestpath.Attheredstate,however,theoptimalactiondeviatesfromthepassivedynamicstocauseatransitionup,incurringanactioncostof0.6andmakingtheredstateworsethanthegreenstate.Ingeneral,v(x)issmallerwhenthetaskcanbeaccomplishedinmultiplewaysstartingfromx.Thisre ectsapreferenceforerrortolerancethatisinherentinourframework.Wenowreturntothetheoreticaldevelopment.Theresultstakeonasimplerformwhenexpressedintermsofthedesirabilityfunction
z(x)=exp( v(x)).[3]
Thisterminologyre ectsthefactthatzislargeatstateswherethecost-to-govissmall.SubstitutingEq.1inEq.2,theBellmanequationcanbewrittenintermsofzas
log(z(x))=q(x)+minuEu(x x ~u(·|x)log|x)
p(x).[4]
TheexpressionbeingminimizedresemblesKLdivergence
betweenuandpz,exceptthatpzis
notnormalizedtosumto1.Thus,toobtainproperKLdivergence(SIAppendix),wehavetomultiplyanddividebythenormalizationterm
G[z](x)=
p(x |x)z(x )=Ex ~p(·|x)[z(x )].[5]x
RecallthatKLdivergenceachievesitsglobalminimumofzero
whenthe2distributionsareequal.Therefore,theoptimalactionu isproportionaltopz:
Todorov
u (x |x)=
p(x |x)z(x )
.
[6]
Thisrepresentsthe rstgeneralclassofMDPswheretheoptimalactionscanbefoundanalyticallygiventheoptimalcosts-to-go.Previously,suchresultswereavailableonlyincontinuousdomains.TheBellmanequationcannowbesimpli ed(SIAppendix)bysubstitutingtheoptimalaction,takingintoaccountthenormal-izationtermandexponentiating.Theresultis
z(x)=exp( q(x))G[z](x).
[7]
TheexpectationG[z]isalinearoperator;thus,Eq.7islinearinz.Itcanbewrittenmorecompactlyinvectornotation.Enumeratethestatesfrom1ton,representz(x therow-indexcorrespondstox|)andxand)withq(thethex)withcolumn-indexn-by-thenmatrixn-dimensionalcolumnvectorszandq,andp(xtoPx, .whereThenEq.7becomesz=Mz,whereM=diag(exp( q))P,expisappliedelement-wiseanddiagtransformsvectorsintodiagonalmatrices.Thelatterequationlookslikeaneigenvectorproblem,andindeeditcanbesolved(9)byusingthepoweriterationmethodz←Mz(whichwecallZiteration).However,theproblemhereisactuallysimplerbecausetheeigenvalueis1andv(x)=q(x)attermi-nalstates.Ifwede netheindexsetsTandNofterminalandnonterminalstatesandpartitionz,q,andPaccordingly,Eq.7becomes
(diag(exp(qN)) PNN)zN=PNTexp( qT).
[8]
TheunknownzNisthevectorofdesirabilitiesatthenonterminal
states.Itcanbecomputedviamatrixfactorizationorbyusinganiterativelinearsolver.
Letusnowcompareourresult(Eq.8)withpolicyiteration(3).WehavetosolveanequationoftheformAz=bjustonce.Inpol-icyiterationonehastosolveanequationoftheformAtoevaluatethecurrentpolicyπ;then,thepolicyhastobe(πimproved)v=b(π)andtheprocessrepeated.Therefore,solvinganoptimalcontrolprobleminourformulationiscomputationallyequivalenttohalfastepofpolicyiteration.
Thusfar,wehavestudiedMDPswithdiscretestatespaces.Thereexistsafamilyofcontinuous(inspaceandtime)problemsrelatedtoourMDPs.Theseproblemshavestochasticdynamics
dx=a(x)dt+B(x)(udt+σdω).[9]
ωrate(t)isisofBrowniantheform
motionandσisthenoiseamplitude.Thecost (x,u)=q(x)+
1
u 22σ
.[10]
Thefunctionsq(x),a(x),andB(x)canbearbitrary.Thisproblemformulationisfairlygeneralandstandard(butseeDiscussion).Consider,forexample,aone-dimensionalpointwithmassm,positionxp,andvelocityxv.Then,x=[xp,xv]T,a(x)=[xv,0]T,andB=[0,m 1]T.Thenoiseandcontrolsignalscorrespondtoexternalforcesappliedtothepointmass.
Unlikethediscretecasewheretheagentcouldspecifythetran-sitionprobabilitydistributiondirectly,here,uisavectorthatcanshiftthedistributiongivenbythepassivedynamicsbutcannotreshapeit.Speci cally,ifwediscretizethetimeaxisinEq.9withsteph,thepassivedynamicsareGaussianwithmeanx+ha(x)andcovariancehσ2B(x)B(x)T,whereasthecontrolleddynamicsareGaussianwithmeanx+ha(x)+hB(x)uandthesamecovari-ance.Thus,theagentinthecontinuoussettinghaslessfreedomcomparedwiththediscretesetting.Yetthetwosettingssharemanysimilarities,asfollows.First,theKLdivergencebetweentheabove
Gaussianscanbeshowntobeh u 2
2σ,whichisjustthequadraticenergycostaccumulatedovertimeintervalh.Second,giventhe
PNAS
July14,2009
vol.106
no.28
11479
Y
RTANEMMOCEESE
CNEICSORUENR
ES
TE
UCPNMEIOCCS
线性问题
Table1.Summaryofresultsforallperformancecriteria
Discrete
ContinuousFiniteexp(q)zt=G[zt+1]qz=L[z]+Totalexp(q)z=G[z]qz=L[z] z
Averageexp(q c) z=G[ z]
(q c) z=L[ z]
Discounted
exp(q)z=G[zα]
qz=L[z] zlog(zα)
optimalcost-to-gov(x),theoptimalcontrollawcanbecomputedanalytically(4):
u (x)= σ2B(x)Tvx(x).
[11]
Here,subscriptsdenotepartialderivatives.Third,theHamilton–
Jacobi–Bellmanequationcharacterizingv(x)becomeslinear(6,7)(SIAppendix)whenwrittenintermsofthedesirabilityfunctionz(x):
q(x)z(x)=L[z](x).
[12]
Thesecond-orderlineardifferentialoperatorLisde nedas
L[z](x)=a(x)Tzx(x)+
σ2
2
trace(B(x)B(x)Tzxx(x)).[13]
Additionalsimilaritiesbetweenthediscreteandcontinuousset-tingswillbedescribedbelow.Theyre ectthefactthatthecontin-uousproblemisaspecialcaseofthediscreteproblem.Indeed,wecanshow(SIAppendix)that,forcertainMDPsinourclass,Eq.7reducestoEq.12inthelimith→0.
Thelinearequations7and12werederivedbyminimizingtotalcostinthepresenceofterminal/goalstates.Similarresultscanbeobtained(SIAppendix)forallotherperformancecriteriausedinpractice,inbothdiscreteandcontinuoussettings.Theyaresum-marizedinTable1.Theconstantcisthe(unknown)averagecostcomputedastheprincipaleigenvalueofthecorrespondingequa-tion. zistheexponentofthedifferentialcost-to-gofunction(3)(SIAppendix).Theconstantαistheexponentialdiscountfactor.Notethatallequationsarelinearexceptinthediscountedcase.The nite-horizonandtotal-costresultsincontinuoussettingswerepreviouslyknown(6,7);theremainingsixequationswerederivedinourwork.
Applications.The rstapplicationisanalgorithmfor nding
shortestpathsingraphs(recallFig.1B).Lets(x)denotethelength
oftheshortestpathfromstatextothenearestterminalstate.Thepassivedynamicspcorrespondtotherandomwalkonthegraph.Thestatecostsareq=ρ>0fornon-terminalstatesandq=0forterminalstates.Letvρ(x)denotetheoptimalcost-to-gofunctionforgivenρ.Iftheactioncostswere0,theshortestpath
lengthswouldbes(x)=1
vρnot0,butneverthelessthey(x)forallρ.Here,theactioncostsarearebounded.Becausewearefreetochooseρarbitrarilylarge,wecanmakethestatecostsdominatethecost-to-gofunction,andso
s(x)=lim
vρ(x)
ρ→∞
.[14]
Theconstructionaboveinvolvesalimitanddoesnotyieldtheshortestpathsdirectly.However,wecanobtainarbitrarilyaccu-rate(modulonumericalerrors)approximationsbysettingρlargeenough.ThemethodisillustratedinFig.2onthegraphofinternetrouters.Thereisarangeofvaluesofρforwhichallshortestpathlengthsareexactlyrecovered.Althoughathoroughcomparisonwithdedicatedshortest-pathalgorithmsremainstobedone,wesuspectthattheywillnotbeasef cientaslinearsolvers.Problemswithweightededgescanbehandledwiththegeneralembeddingmethodpresentednext.
11480
/cgi/doi/10.1073/pnas.0710743106
.Thegraphhas190,914nodesand609,066undirectededges.Theshortestpathlengthfromeachnodetoaspeci eddestinationnodewascomputedexactlybyusingdynamicprogrammingadaptedtothisproblem,andalsoapproxi-matedbyusingouralgorithmwithρ=40.Ouralgorithmwas≈5timesfaster,althoughbothimplementationswereequallyoptimized.Theexactshortestpathlengthswereintegersbetween1and15.(A)Theapproximatevalueswerebinnedin15binsaccordingtothecorrespondingcorrectvalue.Therangeofapproximatevaluesineachbinisshownwiththebluesymbols.Thediagonalredlineistheexactsolution.(Inset)Histogramofallvaluesinonebin,withthecorrectvaluesubtracted.Notethatallerrorsarebetween0and1,thusroundingdowntothenearestintegerrecoverstheexactsolution.Thiswasthecaseforallbins.(B)Toassesstheeffectsofthefreeparameterρ,wesolvedtheaboveproblem500timesforeachof10valuesofρbetween25and70.Ineachinstanceoftheproblem,thesetofdestinationnodeswasgeneratedrandomlyandhadbetween1and5elements.Theapproximateshortestpathlengthsfoundbyouralgorithmwereroundeddowntothenearestintegerandcomparedwiththeexactsolution.Thenumberofmis-matches,expressedasapercentofthenumberofnodesandaveragedoverthe500repetitions,isplottedinblack.Forlargevaluesofρtheapproxima-tionbecomesexact,asexpectedfromEq.14.However,ρcannotbesettoolarge,becauseouralgorithmmultipliesbyexp( ρ),thussomeelementsofzmaybecomenumericallyzero.Thepercentageofsuchnumericalerrorsisplottedingreen.Thereisacomfortablerangeofρwhereneithertypeoferrorisobserved.
ThesecondapplicationisamethodforcontinuousembeddingoftraditionalMDPswithsymbolicactions.Itisreminiscentoflinearprogrammingrelaxationinintegerprogramming.DenotethesymbolicactionsinthetraditionalMDPwitha,thetransitionprobabilitieswith p(x anMDPinour|x,aclass),andsuchthethatimmediateforeachcosts(x,awith (x,a).Weseek)theaction p(·|x,a)hascost (x,a):
q(x)+KL( p(·|x,a)||p(·|x))= (x,a).
[15]
Foreachxthisyieldsasystemoflinearequationsinqandlogp,whichhasauniquesolutionunderamildnondegeneracycon-dition(SIAppendix).Ifwekeepthenumberofsymbolicactionsperstate xedwhileincreasingthenumberofstates(forexam-ple,bymakingagridworldlargerandlarger),theamountofcomputationneededtoconstructtheembeddingscaleslinearlywiththenumberofstates.Oncetheembeddingisconstructed,theoptimalactionsarecomputedandreplacedwiththenearest(inthesenseoftransitionprobabilities)symbolicactions.ThisyieldsanapproximatelyoptimalpolicyforthetraditionalMDP.Wedonotyethavetheoreticalerrorboundsbuthavefoundtheapproximationtobeveryaccurateinpractice.ThismethodisillustratedinFig.3onamachinerepairproblemadaptedfromref.3.TheR2betweenthecorrectandapproximatecost-to-gois0.993.
ThethirdapplicationisaMonteCarlomethodforlearningzfromexperienceintheabsenceofamodelofthepassivedynamicspandstatecostsq.Unliketheprevioustwoapplications,wherewestartedwithtraditionalMDPsandapproximatedthemwithMDPsinourclass,hereweassumethattheproblemisalreadyinourclass.Thelinear BellmanEq.7canbeunfoldedrecursivelybyreplacingz(x)withtheexpectationoverthestatefollowingx
Todorov
线性问题
rgerxtmeansthatthemachineismorebrokenandmorecostlytooperate:weincuranoperationcostof0.02xt.Thereare50timesteps(thisisa nite-horizonproblem).Thestateofthemachinehasatendencytodeteriorateovertime:ifwedonothing,xt+1hasaprobability0.9ofbeingoneofxt···xt+8(chosenuniformly)andprobability0.1ofbeingoneofxt 9···xt 1.Whenxtisneartheedgesweclipthisdistributionandrenor-malizeittosumto1.Thesymbolicactionsutareintegersbetween0and9correspondingtohowmuchweinvestinrepairingthemachineineachtimestep.Therepaircostis0.1ut.Theeffectoftherepairsistocircularlyleft-shifttheabovetransitionprobabilitydistributionbyutpositions.Thus,ut=0correspondstodoingnothing;largerutcauseslargerexpectedimprovementinthestateofthemachine.(A)ThetraditionalMDPdescribedabovewas
embeddedwithinourclass,asoutlinedinthemaintextanddescribedindetailin(SIAppendix).Then,zt(x)wascomputedandtheapproximation log(z)totheoptimalcost-to-gowasplotted.Blueissmall,redislarge.(B)ThetraditionalMDPwasalsosolvedbyusingdynamicprogrammingandtheoptimalcost-to-govt(x)wasplottedinthesameformatasinA.(C)Scatterplotoftheoptimalversusapproximatecosts-to-goat5,000space-timepoints(blue).TheR2betweenthetwois0.993,thatis,theoptimalvaluesaccountfor99.3%ofthevarianceintheapproximatevalues.Thereddiagonallinecorrespondstotheidealsolution.Wealsocomputed(viaextensivesampling)theperformanceoftheoptimalpolicyfoundbydynamicprogramming,theapproximatelyoptimalpolicyderivedfromourembedding,andarandompolicy.Theperformanceofourapproximationwas0.9%worsethanoptimal,whereastheperformanceoftherandompolicywas64%worsethanoptimal.
andsoon,andthenpushingallstatecostsinsidetheexpectationoperators.Thisyieldsthepath-integralrepresentation
z(x)=E
t
f xt+1~p(·|xt) exp
q(xt) .[16]t=0
(x0,x1···xtf)aretrajectoriesinitializedatx0=xandsampledfromthepassivedynamics,andtfisthetimewhenagoalstateis rstreached.Asimilarresultholdsincontinuoussettings,wheretheFeynman–KactheoremstatesthattheuniquepositivesolutiontoEq.12hasapath-integralrepresentation.TheuseofMonteCarlomethodsforsolvingcontinuousoptimalcontrolproblemswaspio-neeredinref.7.Ourresult(Eq.16)makesitpossibletoapplysuchmethodstodiscreteproblemswithnon-Gaussiannoise.
Thefourthapplicationisrelatedtothepath-integralapproachabovebutaimstoachievefasterconvergence.ItismotivatedbyReinforcementLearning(2)wherefasterconvergenceisoftenobservedbyusingTemporalDifference(TD)methods.ATD-likemethodforourMDPscanbeobtainedfromEq.7.Itconstructsanapproximation zusingtriplets(xt,qt,xt+1).Notethatmeasuringtheactioncostsisnotnecessary. zisupdatedonlineasfollows:
z(xt)←(1 ηt)
z(xt)+ηtexp( qt) z(xt+1).[17]
ηtisalearningratewhichdecreasesovertime.Wecallthisalgo-rithmZlearning.DespitetheresemblancetoTDlearning,thereisanimportantdifference.Ourmethodlearnstheoptimalcost-to-godirectly,whereasTDmethodsarelimitedtolearningthecost-to-goofaspeci cpolicy—whichthenneedstobeimproved,thecost-to-gorelearned,andsoon.IndeedZlearningisanoff-policymethod,meaningthatitlearnsthecost-to-gofortheoptimalpolicywhilesamplingaccordingtothepassivedynamics.Theonly
Todorov
Fig.4.ZlearningandQlearning.ZlearningandQlearningarecompared
inthecontextofagrid-worldMDP.Thegoalstateismarked“X.”Statetransi-tionsareallowedtoallneighbors(includingdiagonalneighbors)andtothecurrentstate.Thus,thereareatmost9possiblenextstates,lesswhenthecurrentstateisadjacenttotheobstaclesshowninblackortotheedgesofthegrid.pcorrespondstoarandomwalk.Allstatecostsareq=1exceptforthegoalstatewhereq=0.ThisMDPiswithinourclasssoZlearningcanbeapplieddirectly.ToapplyQlearningwe rstneedtoconstructacorrespond-ingtraditionalMDPwithsymbolicactions.Thisisdoneasfollows.Foreachstatexwede neasymbolicactionwithtransitionprobabilitydistributionmatchingtheoptimalu (·|x).Wealsode neN(x) 1othersymbolicactions,whereN(x)isthenumberofpossiblenextstatesfollowingx.Theirtransitionprobabilitydistributionsareobtainedfromu (·|x)bycircularshifting;thus,theyhavethesameshapeasu butpeakatdifferentnextstates.Allthesesymbolicactionsincurcostq(x)+KL(u (·|x)||p(·|x))matchingthecostinourMDP.TheresultingtraditionalMDPisguaranteedtohavethesameoptimalcost-to-goasourMDP.(A)Thegrayscaleimageshowstheoptimalcost-to-gov= logzwhereziscomputedwithourmodel-basedmethod.Darkercolorscorrespondtosmallervalues.BothZlearningandQlearningaimtoapproximatethisfunction.(B)Errorisde nedastheaverageabsolutediffer-encebetweeneachapproximationandtheoptimalcosts-to-go,normalizedbytheaverageoptimalcost-to-go.Allapproximationstozareinitializedat0.Thelearningratedecaysasηt=c/(c+t),wherec=5,000forZlearningandc=40,000forQlearning.Eachsimulationisrepeated10times.Thestateisresetrandomlywheneverthegoalstateisreached.Eachlearningalgorithmistestedbyusingarandompolicycorrespondingtop,aswellasagreedypolicy.Qlearningrequiresexploration;thus,weuseanε-greedypolicywithε=0.1.Thevaluesofcandεareoptimizedmanuallyforeachalgorithm.Zlearningimplicitlycontainsexplorationsowecandirectlyusethegreedypolicy,i.e.,thepolicy u(x |x)whichappearsoptimalgiventhecurrentapproximation z.GreedyZlearningrequiresimportancesampling:thelastterminEq.17mustbeweightedbyp(xt+1|xt)/ u(xt+1|xt).Suchweightingrequiresaccesstop.(C)Empiricalperformanceofthepoliciesresultingfromeachapproximationmethodateachiteration.ZlearningoutperformsQlearning,andgreedymethodsoutperformrandomsampling.
ReinforcementLearningmethodcapableofdoingthisisQlearn-ing(15).However,Qlearninghasthedisadvantageofoperatingintheproductspaceofstatesandactions,andisthereforelessef cient.ThisisillustratedinFig.4onanavigationproblem.The fthapplicationacceleratesMDPapproximationstocon-tinuousproblems(16).SuchMDPsareobtainedviadiscretizationandtendtobeverylarge,callingforef cientnumericalmeth-ods.Becausecontinuousproblemsoftheform(Eqs.9and10)arelimitsofMDPsinourclass,theycanbeapproximatedwithMDPsinourclass,which,inturn,arereducedtolinearequationsandsolvedef ciently.ThesamecontinuousproblemscanalsobeapproximatedwithtraditionalMDPsandsolvedviadynamicpro-gramming.Bothapproximationsconvergetothesamesolutioninthelimitofin nitely nediscretization,andturnouttobeequallyaccurateawayfromthelimit,butourapproximationisfastertocompute.ThisisshowninFig.5onacar-on-a-hillproblem,whereourmethodis≈10timesfasterthanpolicyiterationand100timesfasterthanvalueiteration.
Theremainingapplicationshavebeendevelopedrecently.Belowwesummarizethekeyresultsandreferthereadertoonlinepreprints(12–14)biningEqs.6and7,theoptimalcontrollaw
canbewrittenasu (x |x)=exp( q(x))p(x |x)z(x
).Givena xed
PNAS
July14,2009
vol.106
no.28
11481
Y
RTANEMMOCEESE
CNEICSORUENRES
TE
UCPNMEIOCCS
线性问题
parisonofourMDPapproximationandatraditionalMDPapproximationonacontinuouscar-on-a-hillproblem.(A)Thecarmovesalongacurvedroadinthepresenceofgravity.Thecontrolsignalistangentialacceleration.Thegoalistoreachtheparkinglotwithsmallvelocity.(B)Ziteration(blue),policyiteration(red),andvalueitera-tion(black)convergetocontrollawswithidenticalperformance;however,Ziterationis10timesfasterthanpolicyiterationand100timesfasterthanvalueiteration.Notethelog-scaleonthehorizontalaxis.(C)Theoptimalcost-to-goforourapproximation.Blueissmall,redislarge.Thetwoblackcurvesarestochastictrajectoriesresultingfromtheoptimalcontrollaw.Thethickmagentacurveisthemostlikelytrajectoryoftheoptimallycontrolledstochasticsystem.Itiscomputedbysolvingthedeterministicoptimalcontrolproblemdescribedinthemaintext.(D)Theoptimalcost-to-goisinferredfromobservedstatetransitionsbyusingouralgorithmforinverseoptimalcontrol.Brownpixelscorrespondtostateswherewedidnothavedata(i.e.,nostatetransitions
landedthere);thus,thecost-to-gocouldnotbeinferred.Thedetailsaregivenin(SIAppendix).
initialstatex0and naltimeT,theprobabilitythattheoptimal
controllawu generatestrajectoryx1,x2,···xTis
T 1Tu
(xz(xt+1|xt)=T)
1
t=0
z(xexp( q(xt))p(xt+1|xt).
[18]
0)t=0Omittingz(x0),whichis xed,andnotingthatz(xq(x18canbeinterpretedT)=exp( T)),thenegativelogofEq.asthecumulativecostforadeterministicoptimalcontrolproblemwithimmediatecostq(x) log(p(x incontinuoussettingswhenB|xis)).constant:Arelatedtheresultcorresponding
isobtaineddeterministicproblemhasdynamics˙x
=a(x)+Buandcost (x,u)+1
div(a(x)).Theseresultsareimportantbecauseoptimaltrajectoriesfordeterministicproblemscanbefoundwithef cientnumericalmethods(17)thatavoidthecurseofdimensionality.Ourframeworkmakesitpossibletoapplydeterministicmethodstostochasticcontrolforthe rsttime.Themostlikelytrajectoryinthecar-on-a-hillproblemisshowninFig.5Cinmagenta.Seeref.12fordetails.
TheseventhapplicationexploitsthedualitybetweenBayesianestimationandstochasticoptimalcontrol.Thisdualityiswell-knownforlinearsystems(4)andwasrecentlygeneralized(8,10)tononlinearcontinuoussystemsoftheformEqs.9and10.Dual-ityalsoholdsforourMDPs.Indeed,Eq.18canbeinterpretedastheposteriorprobabilityofatrajectoryinaBayesiansmoothingproblem,wherep(xtransitionsandexp( t+q1|(xxt)arethepriorprobabilitiesofthestatet))arethelikelihoodsofsomeunspeci edmeasurements.Thedesirabilityfunctioninthecontrolproblemcanbeshowntobeproportionaltothebackward lteringdensityinthedualestimationproblem.Thus,stochasticoptimalcontrolproblemscanbesolvedbyusingBayesianinference.Seerefs.10and12fordetails.
11482
/cgi/doi/10.1073/pnas.0710743106
Theeighthapplicationconcernsinverseoptimalcontrol,where
thegoalistoinferq(x)givenadataset{x
bytheoptimalcontroller.k,xk}Existingk=1···Kofstatetransitionsgeneratedmeth-odsrelyonguessingthecostfunction,solvingtheforwardprob-lem,comparingthesolutionwithdata,andimprovingtheguessanditerating(18,19).Thisindirectprocedurecanbeinef cientwhentheforwardproblemishardtosolve.ForourMDPstheinverseproblemcanbesolveddirectly,byminimizingtheconvexfunction
L(v)=dTv+cTlog(Pexp( v)),
[19]
wherevisthevectorof(unknown)optimalcosts-to-go,Pisthepassivedynamicsmatrixde nedearlier,anddandcarethehis-togramsofx Itcanbeshownkandxthatk(i.e.,d
L(v)iisthenumberoftimesthatxisthenegativelog-likelihoodkof=thei).dataset.WehavefoundempiricallythatitsHessiantendstobediagonallydominant,whichmotivatesanef cientquasi-NewtonmethodbyusingadiagonalapproximationtotheHessian.Oncetheminimumisfound,wecancomputezandthen ndqfromthelinearBellmanequation.Fig.5Dillustratesthisinverseopti-malcontrolmethodonthecar-on-a-hillproblem.Seeref.14fordetails.
Theninthapplicationisamethodforconstructingoptimalcontrollawsascombinationsofcertainprimitives.Thiscanbedonein nite-horizonaswellastotal-costproblemswithtermi-nal/goalstates,wherethe nalcostplaystheroleofaboundarycondition.Supposewehaveacollectionofcomponent nalcostsgi(x)forwhichwehavesomehowcomputedthedesirabilityfunc-tionszi(x).Linearityimpliesthat,ifthecomposite nalcostg(x)satis es
exp( g(x))=
wiexp( gi(x))[20]
i
forsome setofwi,thenthecompositedesirabilityfunctionisz(x)=iwizi(x).Thisapproachisparticularlyusefulwhenthecomponentproblemscanbesolvedanalytically,asinthelinear-quadratic-Gaussian(LQG)framework(4).Inthatcasethecom-ponentdesirabilitieszi(x)areGaussians.Bymixingthemlinearly,wecanobtainanalyticalsolutionsto nite-horizonproblemsoftheformEqs.9and10wherea(x)islinear,Bisconstant,q(x)isquadratic;however,the nalcostg(x)isnotconstrainedtobequadratic.Insteadg(x)canequalthenegativelogofanyGaussianmixture.ThisisanontrivialextensiontothewidelyusedLQGframework.Seeref.13fordetails.
Discussion
Weformulatedtheproblemofstochasticoptimalcontrolinawaywhichisrathergeneralandyetaffordssubstantialsimpli cations.Exhaustivesearchoveractionswasreplacedwithananalyticalsolutionandthecomputationoftheoptimalcost-to-gofunctionwasreducedtoalinearproblem.Thisgaverisetoef cientnewalgorithmsspeedinguptheconstructionofoptimalcontrollaws.Furthermore,ourframeworkenabledanumberofcomputationsthatwerenotpossiblepreviously:solvingproblemswithnonqua-dratic nalcostsbymixingLQGprimitives, ndingthemostlikelytrajectoriesofoptimallycontrolledstochasticsystemsviadeter-ministicmethods,solvinginverseoptimalcontrolproblemsviaconvexoptimization,quantifyingthebene tsoferrortolerance,andapplyingoff-policylearninginthestatespaceasopposedtothestate-actionspace.
Theseadvancesweremadepossiblebyimposingacertainstruc-tureontheproblemformulation.First,thecontrolcostmustbeaKLdivergence—whichreducestothefamiliarquadraticenergycostincontinuoussettings.Thisisasensiblewaytomeasurecon-trolenergyandisnotparticularlyrestrictive.Second,thecontrolsandthenoisemustbeabletocausethesamestatetransitions;theanalogincontinuoussettingsisthatboththecontrolsandthe
Todorov
线性问题
noiseactinthesubspacespannedbythecolumnsofB(x).Thisisamoresigni cantlimitation.Itpreventsusfrommodelingsys-temssubjecttodisturbancesoutsidetheactuationspace.Wearenowpursuinganextensionthataimstorelaxthislimitationwhilepreservingmanyoftheappealingpropertiesofourframework.Third,thenoiseamplitudeandthecontrolcostsarecoupled,and,inparticular,thecontrolcostsarelargewhenthenoiseamplitudeissmall.Thiscanbecompensatedtosomeextentbyincreasingthestatecostswhileensuringthatexp( q(x))doesnotbecomenumericallyzero.Fourth,withregardtoZlearning,followingapolicyotherthanthepassivedynamicsprequiresimportance-samplingcorrectionbasedonamodelofp.Suchamodelcouldpresumablybelearnedonline;however,thisextensionremainstobedeveloped.
1.TodorovE(2004)Optimalityprinciplesinsensorimotorcontrol.NatNeurosci7(9):907–
915.
2.SuttonR,BartoA(1998)ReinforcementLearning:AnIntroduction(MITPress,
CambridgeMA).
3.BertsekasD(2001)DynamicProgrammingandOptimalControl(AthenaScienti c,
Bellmont,MA),2ndEd.
4.StengelR(1994)OptimalControlandEstimation(Dover,NewYork).
5.KornR(1997)OptimalPortfolios:StochasticModelsforOptimalInvestmentandRisk
ManagementinContinuousTime(WorldScienti c,Teaneck,NJ).
6.FlemingW,MitterS(1982)Optimalcontrolandnonlinear lteringfornondegenerate
diffusionprocesses.Stochastics8:226–261.
7.KappenHJ(2005)Lineartheoryforcontrolofnonlinearstochasticsystems.PhysRev
Lett95:200201.
8.MitterS,NewtonN(2003)Avariationalapproachtononlinearestimation.SIAMJ
ControlOpt42:1813–1833.
9.TodorovE(2006)Linearly-solvableMarkovdecisionproblems.AdvNeuralInformation
ProcSyst19:1369–1376.
Todorov
Thisframeworkhasmanypotentialapplicationsandwehopethatthelistofexampleswillgrowasotherresearchersbegintouseit.Ourcurrentfocusisonhigh-dimensionalcontinuousprob-lemssuchasthosearisinginbiomechanicsandrobotics,wherethediscrete-timecontinuous-stateMDPapproximationispar-ticularlypromising.Itleadstolinearintegralequationsratherthandifferentialequations,resultinginrobustnumericalmeth-ods.Furthermoreitiswellsuitedtohandlediscontinuitiesduetorigid-bodycollisions.Initialresultsusingmixture-of-Gaussianapproximationstothedesirabilityfunctionareencouraging(11),yetalotmoreworkremains.
ACKNOWLEDGMENTS.WethankSuryaGanguli,JavierMovellan,andYuvalTassafordiscussionsandcommentsonthemanuscript.ThisworkwassupportedbytheNationalScienceFoundation.
10.TodorovE(2008)Generaldualitybetweenoptimalcontrolandestimation.IEEEConf
DecisionControl47:4286–4292.
11.TodorovE(2009)Eigen-functionapproximationmethodsforlinearly-solvableopti-malcontrolproblems,IEEEIntSympAdaptDynamProgrammReinforceLearning2:161–168.
12.TodorovE(2009)ClassicMaximumPrinciplesandEstimation-ControlDualitiesfor
nonlinearStochasticSystems,preprint.
13.TodorovE(2009)CompositionalityofOptimalControlLaws,preprint.
14.TodorovE(2009)Ef cientAlgorithmsforInverseOptimalControl,preprint.15.WatkinsC,DayanP(1992)Q-learning.MachLearn8:279–292.
16.KushnerH,DupuisP(2001)NumericalMethodsforStochasticOptimalControl
ProblemsinContinuousTime(Springer,NewYork).
17.BrysonA,HoY(1969)AppliedOptimalControl(BlaisdellPublishingCompany,
Waltham,MA).
18.NgA,RussellS(2000)Algorithmsforinversereinforcementlearning.IntConfMach
Learn17:663–670.
19.LiuK,HertzmannA,PopovicZ(2005)Learningphysics-basedmotionstylewith
nonlinearinverseoptimization.ACMTransGraphics24:1071–1081.
PNAS
July14,2009
vol.106
no.28
11483
Y
RTANEMMOCEESE
CNEICSORUENR
ES
TEUCPNMEIOCCS
正在阅读:
Efficient computation of optimal actions05-21
邯郸市防空地下室建设基本要求11-14
制造业行业求职指南-浙江大学学生就业与职业发展协会(SCDA)02-20
风险评估和控制管理制度01-05
《凤阳花鼓》音乐教案及反思 11-28
第四章 产品市场和货币市场的一般均衡10-15
货币银行习题及答案03-11
毕业论文:个体网店营销策略分析05-31
励志寓言故事及寓意02-17
工厂全面改善03-09
- 1From Pigou to extended liability - on the optimal taxation of externalities under imperfect
- 2Multi-view ensemble learning an optimal feature set
- 3Cell Zooming for Cost-Efficient Green Cellular Networks
- 4A 20 mV Input Boost Converter With Efficient Digital Control
- 5Space-efficient terrain rendering using constrained Delaunay triangulation
- 6Accurate and efficient gesture spotting via pruning and subgesture reasoning
- 7Belief–desire reasoning in the explanation of behavior,Do actions speak louder than words
- 8Dynamic Memory Model based Optimization Framework for Computation-intensive Signal Processi
- 9Belief–desire reasoning in the explanation of behavior,Do actions speak louder than words
- 10Increasing Network Lifetime Of An IEEE 802.15.4 Wireless Sensor Network By Energy Efficient
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- computation
- Efficient
- optimal
- actions
- 数学专项练习之归纳法题的应用与练习
- 海口港二期深水泊位起步工程 “码头工程及后方配套工程”标段施工技术规格书参考Word
- 云计算与网格计算的区别
- 如何与上司有效沟通 课后测试题
- 高职院校学生业余生活现状调查报告
- 第6章 计算机病毒与恶意软件
- 车辆维修保养管理暂行规定
- 中考化学原子的结构复习
- 三级医院评审标准解读
- 河北省2014年11月通用技术会考试题1及答案
- 浅谈如何打造小学数学高效课堂
- 三相电功率计算公式
- FMI地层微电阻率扫描测井及应用
- 急重症护理学选择题(1)
- 《剪刀手爱德华》专业影评。视听语言专业分析。编导、导演必看。
- lol皮肤激活码怎么用
- 毕业生登记表自我总结
- 建筑招投标施工顶岗实习报告
- 烟草专卖法及新烟草专卖行政处罚程序规定修改情况
- 森林公园生态文化旅游理念与内容创新