Discovering Hierarchy in Reinforcement Learning with HEXQ

更新时间：2023-05-08 11:26:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

discovering推荐度：
相关推荐

Bernhard Hengst bernhardh@b681530cba1aa8114431d950.au Computer Science and Engineering,University of New South Wales,UNSW Sydney2052AUSTRALIA

Abstract

An open problem in reinforcement learning

is discovering hierarchical structure.HEXQ,

an algorithm which automatically attempts

to decompose and solve a model-free fac-

tored MDP hierarchically is described.By

searching for aliased Markov sub-space re-

gions based on the state variables the algo-

rithm uses temporal and state abstraction to

construct a hierarchy of interlinked smaller

MDPs.

1.Introduction

Bellman(1961)stated that sheer enumeration would not solve problems of any signi?cance.In reinforce-ment learning the size of the state space scales ex-ponentially with the number of variables.Designers try to manually decompose more complex problems to make them tractable.Finding good decompositions is usually an art-form.Many researchers have either ignored where decompositions come from or pointed to the desirability of automating this task(Boutilier et al.,1999;Hauskrecht et al.,1998;Dean&Lin, 1995).More recently,Dietterich(2000b)concluded that the biggest open problem in reinforcement learn-ing is to discover hierarchical structure.

It was recognised by Ashby(1956)that learning is worthwhile only when the environment shows con-straint.One type of constraint present in many en-vironments is the repetition of sub-structures.Ashby stated that repetition is of considerable practical im-portance in the regulation of very large systems.Rep-etitions are commonplace.They are evident,for exam-ple,at the molecular level,in daily routines,in o?ce layouts or even in just walking.One reason that re-inforcement learning scales poorly is that sub-policies, such as walking,need to be relearnt in every context. It makes more sense to learn how to walk only once and then reuse this skill wherever it is required.A reinforcement learning agent that can?nd and learn reusable sub-tasks and in turn employ them to learn higher level skills would be more e?cient.

In the rest of this paper we will describe the opera-tion of a hierarchical reinforcement learning algorithm, HEXQ,which attempts to solve model-free MDPs more e?ciently by?nding and exploiting repeatable sub-structures in the environment.The algorithm is designed to automatically discover state and temporal abstractions,?nd appropriate sub-goals and construct a hierarchical representation to solve the overall MDP. As a running example we will use the taxi task(Diet-terich,2000a)to illustrate how the algorithm works. We also show results for a noisy Tower of Hanoi puzzle.

2.Representation and Assumptions

We start with the usual formulation of a?nite MDP with discrete time steps,states and actions(Sutton &Barto,1998).The objective is to?nd an optimal policy by maximising the expected value of future dis-counted rewards represented by the action-value func-tion,Q(Watkins&Dayan,1992).We also employ semi-MDP theory(Puterman,1994)which generalizes MDPs to models with variable time between decisions. We assume that the state is de?ned by a vector of d state variables,x=(x1,x2,...,x d).Large MDPs are naturally described in this factored form.In this pa-per we consider only negative reward non-discounted ?nite horizon MDPs(stochastic shortest path prob-lems),but the algorithm has been extended to han-dle general?nite MDPs.The issue of solving MDPs e?ciently is largely orthogonal and complementary to the decomposition techniques discussed here.We have used simple one-step backup Q-learning throughout. HEXQ attempts to decompose a MDP by dividing the state space into nested sub-MDP regions.De-composition is possible when(1)some of the vari-ables in the state vector represent features in the en-vironment that change at less frequent time intervals, (2)variables that change value more frequently retain their transition properties in the context of the more persistent variables and(3)the interface between re-gions can be controlled.For example,if a robot nav-

igates around four equally sized rooms with intercon-necting doorways(Parr,1998)the state space can be represented by the two variables,room-identi?er and position-in-room.The room changes less frequently than the position.The position in each room needs to be represented consistently to allow generalisation across rooms,for example,by numbering cells from top to bottom,left to right in each room.Most represen-tations naturally label repeated sub-structures in this way.Finally we need to be able to?nd sub-policies to exit through each doorway with certainty.This will become clearer in the next sections.In the absence of these conditions or when they are only partially present HEXQ will nevertheless solve the MDP dis-covering abstractions where it can.In the worst case it has to solve the‘?at’problem.

3.The Taxi Domain

Dietterich(2000a)created the taxi task(Figure1)to demonstrate MAXQ hierarchical reinforcement learn-ing.For MAXQ the structure of the hierarchy is spec-i?ed by the user.We will use the same domain to illustrate how hierarchical decomposition can be au-tomated.We will keep our description of HEXQ gen-eral,but use the taxi domain to illustrate the basic concepts.We start by reviewing the taxi problem.

In the taxi domain,a taxi,started at a random loca-tion,navigates around a5-by-5grid world to pick up and then put down a passenger.There are four possi-ble source and destination locations,designated R,G, Y and B.We encode these1,2,3,4respectively.They are called taxi ranks.The objective of the taxi agent is to go to the source rank,pick up the passenger,then navigate with the passenger in the taxi to the desti-nation rank and put down the passenger.The source and destination ranks are also chosen at random for each new trial.At each step the taxi can perform one of six primitive actions,move one square to the north, south,east or west,pickup or putdown the passenger.

A move into a wall or barrier leaves the taxi location unchanged.For a successful passenger delivery the re-ward is20.If the taxi executes a pickup action at a location without the passenger or a putdown action at the wrong destination it receives a reward of-10.For all other steps the reward is-1.The trial terminates following a successful delivery.

The taxi problem can be formulated as an episodic MDP with the3state variables:the location of the taxi(values0-24),the passenger location including in the taxi(values0-4,0means in the taxi)and the desti-nation location(values1-4).Deterministic actions will make the illustrations simpler.Results in section7

are based for the

policy can be the same whether it intends to pick up or put down the passenger.The usual‘?at’formu-lation of the MDP will solve the navigation sub-task as many times as it reoccurs in the di?erent contexts. Dietterich has demonstrated how the problem can be solved more e?ciently with sub-task reuse by design-ing a MAXQ hierarchy.We will now show how a hi-erarchy can be generated automatically to solve the problem.

4.Automatic Hierarchical

Decomposition

HEXQ uses the state variables to construct the hierar-chy.The maximum number of levels in the hierarchy is the same as number of state variables.For the taxi domain there are three levels.The hierarchy is con-structed,and levels are numbered from the bottom up. The bottom level,level1,is associated with the vari-able that changes value most frequently.The rationale is that sub-tasks that are used most often appear at the lower levels and need to be learnt?rst.The?rst level is the only level that interacts with the environ-ment using primitive actions.

We start by observing one of the state variables.We choose this variable on the basis that it changes value most frequently.We now partition the states rep-resented by the values of this variable into Markov regions.The boundaries between regions are identi-?ed by‘unpredictable’(see subsection4.2)transitions which we call region exits.We then de?ne sub-MDPs over these regions and learn separate policies to leave each region via its various exits.

Whole regions are then abstracted and combined with the next most frequently changing state variable to form abstract states at the next level in the hierarchy. The exit policies just learnt become abstract actions

Table1.Frequency of change for taxi domain variables over 2000random steps.

Variable Frequency Order

Passenger location42

Taxi location8461

Destination03

at this next level.

At this stage we have a semi-MDP that has one less variable in the state description and only abstract ac-tions.We now repeat the above process on the reduced problem forming at most one hierarchical level for each state variable.If this abstraction does not reduce the number of Q-values otherwise required we can simply take the cartesian product of the current level state with the next variable and continue the construction of the hierarchy with this combined state.The top level will have one sub-MDP which is solved by recur-sively calling other sub-MDP policies as its actions. Let us now describe this process in more detail.

4.1Variable Ordering Heuristic

Just as lines in the inner loop of programs are executed more frequently,variables that change value more fre-quently are associated with lower levels of the hierar-chy.Conversely,variables that change less frequently set the context for the more frequently changing ones. The hierarchy is constructed from the bottom up start-ing with the variable that changes most frequently. To order the variables,we allow our agent to explore its environment at random for a set period of time and keep statistics on the frequency of change of each of the state variables.We then sort the variables based on their frequency of change.For the taxi,table1shows the frequency and order of each variable following a 2000random action exploration run.

4.2Discovering Repeatable Regions

HEXQ starts by projecting the entire state space onto values of the variable that changes most frequently, limiting the state space to these values.Obviously, this makes the agent’s perception of the environment highly aliased.Nevertheless,HEXQ now proceeds to attempt to model the state transitions and rewards by exploring the environment,taking random actions.It models the state transitions as a directed graph(DG) in which the vertices are the state values and the edges are the transitions for each primitive action. Following a set period of exploration in this manner,transitions that are unpredictable(called exits)are eliminated from the graph.

De?nition1An exit is a state-action pair(s e,a)sig-nifying that taking action a from state s e causes an unpredictable transition.Transitions are unpredictable when(1)the state transition or reward function is not a stationary probability distribution,or(2)another variable may change value(or the task terminates). State s e in de?nition1is a level e state(see subsec-tion4.3)and is referred to as an exit state.Action a is primitive for level1and abstract for all higher levels. An entry state is the state that is reached following an exit.For the taxi location variable,all states are en-tries because the taxi agent is restarted at any location at random following the episodic trial exit.

We are left with a DG whose connections represent predictable transitions between values of the?rst level state variable.For the taxi example the directed graph for the taxi location variable is shown in Figure2.The transitions labelled exits are not counted as edges.For example,taking action pickup or putdown in state23 is not predictable because it may change the passenger location variable or reach the goal,respectively.How-ever,we can predict the transitions anywhere else in the environment.State7,for example,always transi-tions to state2on action north,12on south,8on east, 7on west,7on pickup and7on putdown.We have only drawn one edge to represent multiple transitions between the same states in Figure2to avoid cluttering the diagram.

valid hierarchical state abstraction.

First we decompose the DG into strongly connected components(SCCs).An e?cient linear time algo-rithm for this procedure can be found in Cormen et al. (1999).The SCCs,as abstracted nodes,form a di-rected acyclic graph(DAG).It is possible that there are many connected SCCs in the DAG.Because we only require that when we enter an abstract state we can leave via any exit,we can combine some SCCs to form regions.Whole regions will later be abstracted to form higher level states.Our objective is to maximise the size of these regions so as to minimise the number of abstract states.Following the coalition of SCCs into regions we may still have regions connected by edges from the underlying DAG.We break these by forming additional exits and entries associated with their re-spective regions and repeat the entire procedure until no additional regions are formed.The regions can be labelled arbitrarily.We use consecutive integers for convenience(see equation1).

A region,therefore,is a combination of SCCs such that any exit state in a region can be reached from any entry with probability1.Regions are generally aliased in the environment.All the instances of these generic regions partition the total state space.

Each region has a set of states,actions and predictable (i.e.Markov)transition and reward functions.We can therefore de?ne a MDP over the region with an exit (or sub-goal)as a transition to an absorbing state. The solution to this‘sub-MDP’is a policy over the region that will move the agent out of an exit start-ing from any entry.We proceed to construct multi-ple sub-MDPs one for each unique hierarchical exit state(s1,s2,...s e)in each region.Sub-MDP policies in HEXQ are learnt on-line,but a form of hierarchical dynamic programming could be used directly as the sub-task models have already been uncovered.

In the taxi example,the above procedure?nds one hierarchical level1region as re?ected in?gure2.This region has8exits.They are:

(s1=0,a=pickup),(s1=0,a=putdown),

(s1=4,a=pickup),(s1=4,a=putdown),

(s1=20,a=pickup),(s1=20,a=putdown), (s1=23,a=pickup),(s1=23,a=putdown).

As there are4hierarchical exit states we create4sub-MDPs at level1and solve them.

4.3State and Action Abstraction

We are now in the position to tackle the second level in the hierarchy.The process is similar to the?rst level in that we will be searching for repeatable re-gions.Except now the states and actions are based on abstractions from the?rst level.

We de?ne abstract states at the second level as the cartesian product of the region labels and values of the next state variable in the frequency ordering.A convenient numerical method for generating abstract state values is as follows:

s e+1=|r e|x j+r e(1) where

s e+1=abstract state value at level e+1

|r e|=number of regions at level e

x j=next most frequent state variable value

r e=region label from level e

The abstract actions available in each of these abstract states are the policies leading to region exits at the level below.Abstract actions are similar to composite actions in Dietterich(2000a).The e?ect of taking an abstract action in state s e is to invoke and execute the associated level e?1sub-MDP policy.

For the taxi example there is only one region at level 1which we label0.When we apply equation(1),the level2hierarchy states,s2,simply correspond to the values of the passenger location.We therefore generate 5states at level2.There are8abstract actions which are the policies that lead to the exits listed previously. It is generally the case that di?erent abstract states have di?erent abstract actions.Also,because abstract actions can take varying primitive time periods to ex-ecute,we now have a semi-Markov decision problem. This semi-MDP has one less variable than the original MDP and uses only abstract actions.

For the taxi,the region and exits at level2are shown in?gure3.Note that the actions are abstract and labelled using the exit notation from the?rst level. There are4exits at level2.They are also shown in ?gure3.For example,exit(s2=0,(s1=23,a= putdown))means:with the passenger in the taxi,nav-igate to location s1=23and putdown the passenger. This is an exit because it may or may not lead to the goal,depending on the destination location.We generate4sub-MDPs for level2given the4unique hierarchial exit states.

In this way we generate one level of hierarchy for each variable in the original MDP.The only change to the predictability criteria is that we only test state vari-ables for change from the ones we have not yet pro-cessed.When we reach the last state variable we solve the top level sub-MDP represented by the?nal ab-stract states and actions which solves the overall MDP.

is To illustrate the execution of a competent taxi agent on the hierarchically decomposed problem,let us as-sume the taxi is initially located randomly at cell5, the passenger is on rank4and wants to go to rank3. In the top level sub-MDP,the taxi agent perceives the passenger destination as3and takes abstract action (s2=0,(s1=20,a=putdown)).This sets the sub-goal state at level2to s2=0or in English,pick up the passenger?rst.At level2,the taxi agent perceives the passenger location as4,and therefore executes ab-stract action(s1=23,a=pickup).This abstract ac-tion sets the sub-goal state at level1to taxi location s1=23,i.e.rank4.The level1policy is now ex-ecuted using primitive actions to move the taxi from location s1=5to the pickup location s1=23and the pickup action is executed on exit.Level1returns control to level2where the state has transitioned to s2=0.Level2now completes its instruction and takes abstract action(s1=20,a=putdown).This again in-vokes level1primitive actions to move the taxi from location s1=23to s1=20and then putdown to exit. Control is returned back up the hierarchy and the trial ends with the passenger delivered correctly.

5.Hierarchical Value Function

HEXQ is similar to MAXQ in its approach to hierar-chical execution using a decomposed value function, but there are important di?erences.

The motivation for these new decomposition equations is that they automatically solve the hierarchical credit assignment problem(Dietterich,2000a)by relegating non-repeatable sub-task rewards further up the hier-archy where they can be‘explained’.The equations also reduce nicely to the usual Q and value functions for‘?at’MDPs if there is only one variable in the state vector.

De?nition2The recursively optimal hierarchical exit Q-function Q?em(s e,a)at level e in sub-MDP m is the expected value after completing the execution of (abstract)action a starting in(abstract)state s e and following the optimal hierarchical policy thereafter.

Q?em(s e,a)= s′T a s e s′[R a s e+V?em(s′)](2) where

T a s e s′=P r{s′|s e,a}

R a s e=E{primitive reward after a|s e,a}

s′=hierarchical next state=(s1′,...,s e′)

Note that Q includes the expected primitive reward immediately after sub-task exit,but does not include any rewards accumulated while executing the sub-tasks.The name HEXQ is derived from this function. The recursively optimal hierarchical value function de-composition is given by:

V?em(s)=max

V?e?1,m e?1(a)(s)+Q?em(s e,a) (3)

where

m e?1(a)=subMDP implementing action a

The manner in which regions are formed has ensured that no action will exit a sub-MDP unintentionally.

For our special case of stochastic shortest path prob-lems we avoid the introduction of the pseudo-reward functions(Dietterich,2000a).We can therefore update all region sub-policies concurrently using o?-policy backups,similarly to all goals updating(Kaelbling, 1993).

6.Stochastic Considerations

HEXQ handles stochastic actions and rewards.The aspects that need attention are explained in the fol-lowing two sections.

6.1Detecting Non-Markov Transitions

Exits can also occur(see de?nition1)when a state transition or reward is not predictable and a higher level state variable does not change value.In the de-terministic case it is easy to determine these exits.We only need to?nd a transition to two di?erent next states or reward values for the same action to trig-ger an exit condition.In the stochastic case we need to record statistics and test the hypothesis that the reward and state transitions(R and T functions in equation2)come from di?erent probability distribu-tions in some contexts.In theory it is possible to de-termine this to any degree of accuracy given that we can test all the individual contexts represented by the higher level variables.In practice this is intractable, because the combinations of higher level variables can grow exponentially.

Instead,we keep the transition statistics over a shorter period of time and compare these to their long term average.The objective is to explicitly test whether the probability distribution is stationary.We use a bino-mial distribution based on average probabilities to test when a temporally close sample of each transition is outside an upper con?dence limit.When this happens we declare an exit.Similarly,to detect reward func-tion non-stationarity we use the Kolmogorov-Smirnov test.

6.2Hierarchical Greedy Policy

A sub-MDP will stubbornly attempt to exit a re-gion by the exit determined from the level above even though the agent may have slipped closer to another exit that is now more optimal(Hauskrecht et al., 1998).The solution is to re-evaluate the optimum pol-icy at every level after every step.This is only possible after the uninterrupted exit policies have been learnt and is referred to as a hierarchical greedy policy by Dietterich(2000a).7.Results

Figure5compares the performance of HEXQ against MAXQ and a‘?at’learner on a stochastic taxi task, with each of the four navigation actions performing as intended80%of the time and20%of the time moving the taxi randomly to the left or right of the intended action.

Figure5.Performance of HEXQ vs a‘Flat’learner and MAXQ for the stochastic taxi.

The graph shows the number of primitive time steps required to complete each successive trial averaged over100runs.For all experiments the Q-learning rate was set at0.25.The initial Q-values were set to0 and all actions were greedy,except for those during hierarchy construction as described previously.Explo-ration is implicitly forced by the initial Q-values and the stochastic actions.

HEXQ performance improves in distinct stages as each level of the hierarchy is constructed.After about41 trials it surpasses the performance of the‘?at’learner. While the‘?at’learner can start to improving perfor-mance during the?rst trial,HEXQ must?rst order the variables and?nd exits at the?rst level before it can improve its performance.HEXQ then learns more rapidly as it transfers sub-task skills.The in-vestment in hierarchy construction‘breaks even’at220 trials,where the cumulative number of time steps for both the‘?at’learner and HEXQ are equal.With its additional background knowledge MAXQ learns very rapidly.In our version of MAXQ the slower conver-gence evident is caused by learning the Q-values at all levels simultaneously.Higher level Q-values are ini-tially learnt with in?ated costs from the lower level partially learnt policies.

In terms of storage requirements for the value function, a‘?at’learner uses a table of500states and6actions

=3000values.HEXQ requires4MDPs at level1with 25states and6actions=600values.4MDPs at level 2with5states and8actions=160values.1MDP at level3with4states and4actions=16values.A total of776values.MAXQ by comparison requires only632 values.HEXQ,of course,is not told which actions to apply at di?erent levels and must discover these for itself.

As a second example the performance of HEXQ is com-pared to a‘?at’learner for a noisy3pin Tower of Hanoi puzzle(ToH)with7disks.The state variables are the disks.Their values represent the pin positions.Each disk can be moved to another pin but only if no other disk is moved in the process and the disks on each pin remain ordered in size with the smallest on top.Ac-tions are de?ned by disk and target pin.An example action is move-disk3-to-pin2.There are three such ac-tions per disk,at total of21.The actions are stochastic in the sense that having picked up a disk there is an 80%probability that an intended legal move will suc-ceed and a20%probability that the disk will attempt to slip randomly to another pin.For illegal moves the disks remains in situ.The reward is-1per move.The goal is to move all the disks to pin3.Disks are ran-domly(but legally)placed on the pins at the start of each trial.The learners are model-free.

Figure6.Performance of HEXQ vs a‘Flat’learner for a stochastic Tower of Hanoi puzzle with7disks averaged over10runs and smoothed.

From?gure6we see that HEXQ learns the7disk ToH in a fraction of number of trials of the‘?at’learner. HEXQ takes about half the number of steps to con-verge compared to the‘?at’learner.The recursive value function implements a depth?rst search which becomes expensive with7levels.This issue is shared with MAXQ.A reasonable approach is to only eval-uate the function to a certain depth.By analogy,in planning a trip from New York to Sydney we do not take into consideration which side of the bed to get out of on our way to the bathroom on the day of de-parture.In this example we have limited the search to a depth of only1.This is possible in the ToH because the recursive constraints are such that the values below level1cannot change the outcome.The ToH has al-ternative action representations.Move-fromPin-toPin would require only6actions.With this representation HEXQ can quickly learn the deterministic n disk ToH in time complexity of O(n2)and space complexity of O(n).With general stochastic actions this representa-tion fails to decompose because every action for every disk state is an exit.

8.Limitations and Future Work

While HEXQ will not perform any worse than a‘?at’learner,it relies on certain constraints in the prob-lem to allow it to?nd decompositions.If a subset of the variables can form sub-MDPs and we can?nd policies to reach their exits with probability1then HEXQ can?nd a decomposition.To?nd sub-MDPs it is necessary that some variables change on a longer timescale.The requirement to be able to exit with certainty ensures that learnt sub-tasks do not present any uncontrollable surprises to higher level tasks.This means that some benign problems,such as navigating around a multi-room domain with stochastic actions that slip in all directions,will not HEXQ decompose. We leave to future work automatically?nding condi-tions under which the exit-with-certainty constraint can be relaxed.

Problem characteristics may result in the discovery of a large number of exits.For example,if a doorway is modelled with three exit states,indicating that it is possible to exit either left,right or centre,HEXQ will generate and solve3sub-MDPs.A designer can choose to combine exits and only generate one sub-MDP.An improvement would be to extend HEXQ to automatically combine exits leading to the same next state.Exit combination heuristics based on gen-eral transition properties,such as entering the same next region or having small Q-value di?erences suggest themselves.Exit combining will make some problems decomposable as a composite exit may now be reached with probability one.We leave to further study the tradeo?between intra-region value improvement and inter-region value deterioration as exits are combined. The two heuristics employed by HEXQ are(1)variable ordering and(2)?nding non-stationary transitions.If the frequency order of variables is changed HEXQ may still be able to partially decompose the MDP,but in

most cases less e?ciently.Decompositions under dif-ferent variable combinations and orderings is further discussed in Hengst(2000).An independent slowly changing random variable would be sorted to the top level in the hierarchy by this heuristic and the MDP will fail to decompose as it is necessary to explore the variable’s potential impact.Further research may be better directed towards selecting relevant variables for the task at hand rather than?nding better sorting heuristics.

For the second heuristic the penalty for recognising extra exits is simply to generate some additional over-head for HEXQ.These exits will require new policies to be learnt and they in turn will need to be explored as new abstract actions in the next level up the hier-archy.This does not detract from the quality of the solution in terms of optimality.If exits are missed the solution may not be e?ected at all.Alternatively, it may be more circuitous or,in the worst case,fail altogether.It would be possible to incorporate recov-ery procedures in HEXQ when new exits are belatedly discovered.At this stage we rely on HEXQ to?nd all relevant exits.

For deterministic shortest path problems HEXQ will ?nd a globally optimal policy.With stochastic actions HEXQ,as MAXQ is recursively optimal.

9.Conclusion

HEXQ has tackled the problem of discovering hier-archical structure in stochastic shortest path factored MDPs.HEXQ will decompose MDPs if it can?nd nested sub-MDPs where there are policies to reach any exit with certainty.Generally it will perform well on problems where it can identify large regions,few ex-its and many contexts in which the sub-tasks are re-quired.HEXQ has automated work that is usually required to be performed by a designer.This includes de?ning sub-tasks,?nding sets of actions that can be performed in each sub-task,?nding sub-task termina-tion conditions or sub-goals,?nding usable state and temporal abstractions and assigning credit hierarchi-cally.Work is ongoing to generalise HEXQ,building on the key idea of discovering Markov sub-spaces,to handle hidden state and selective perception.We be-lieve the discovery and manipulation of hierarchical representations will prove essential for lifelong learn-ing in autonomous agents.

References

Ashby,R.(1956).Introduction to cybernetics.London: Chapman&Hall.Bellman,R.(1961).Adaptive control processes:A guided tour.Princeton,NJ:Princeton University Press.

Boutilier,C.,Dean,T.,&Hanks,S.(1999).Decision-theoretic planning:Structural assumptions and computational leverage.Journal of Arti?cial Intel-ligence Research,11,1–94.

Cormen,T.H.,Leiserson, C.E.,&Rivest,R.L. (1999).Introduction to algorithms.Cambridge Mas-sachusetts:MIT Press.

Dean,T.,&Lin,S.H.(1995).Decomposition tech-niques for planning in stochastic domains(Technical Report CS-95-10).Department of Computer Science Brown University.

Dietterich,T.G.(2000a).Hierarchical reinforcement learning with the MAXQ value function decomposi-tion.Journal of Arti?cial Intelligence Research,13, 227–303.

Dietterich,T.G.(2000b).An overview of MAXQ hier-archical reinforcement learning.SARA(pp.26–44). Hauskrecht,M.,Meuleau,N.,Kaelbling,L.P.,Dean, T.,&Boutilier,C.(1998).Hierarchical solution of Markov decision processes using macro-actions. Fourteenth Annual Conference on Uncertainty in Arti?cial Intelligence(pp.220–229).

Hengst,B.(2000).Generating hierarchical structure in reinforcement learning from state variables.PRICAI 2000Topics in Arti?cial Intelligence(pp.533–543). San Francisco:Springer.

Kaelbling,L.P.(1993).Hierarchical learning in stochastic domains:Preliminary results.Machine Learning Proceedings of the Tenth International Conference(pp.167–173).San Mateo,CA:Morgan Kaufmann.

Parr,R.E.(1998).Hierarchical control and learning for Markov decision processes.Doctoral disserta-tion,University of California at Berkeley. Puterman,M.L.(1994).Markov decision processes: Discrete stochastic dynamic programming.New York,NY:John Whiley&Sons,Inc.

Sutton,R.S.,&Barto, A.G.(1998).Reinforce-ment learning:An introduction.Cambridge,Mas-sachusetts:MIT Press.

Watkins,C.J.C.H.,&Dayan,P.(1992).Technical note:Q-learning.Machine Learning,8,279–292.

本文来源：https://www.bwwdw.com/article/v72e.html

相关文章：

国家统计局南阳调查队市县队建设和管理经验报告材料05-26

政府职能转变（文献综述）09-14

深圳红树林水质调查报告04-21

2014年6月大学英语四六级考前必看听力技巧03-20

高炉车间安全技术规程06-11

微机原理与接口技术考试题库(选择题、判断题精选)05-27

大学物理下册习题二04-17

与什么的一次对话作文600字06-17

CAD报告-通信线路施工图纸绘制06-06

上一篇：简易注销申请书模板(最新) 下一篇：中考作文素材优秀的成语故事