麻省理工学院:《自制决策制造原则》英文版 Planning to Maximize Reward: Markov Decision processes

Planning to Maximize Reward: Markov Decision processes Brian C. Williams 16410 December 8 h. 2003 Slides adapted from Reid simmons. Tom Mitchell. CMU Reading and Assignments Markov decision processes Read AIMA Chapters 17 sections 1-3 A This Lecture based on development in Machine Learning by Tom Mitchell Chapter 13: Reinforcement Learning
Planning to Maximize Reward: Markov Decision Processes Brian C. Williams 16.410 December 8th, 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 2 Reading and Assignments • • This Lecture based on development in: “Machine Learning” by Tom Mitchell Chapter 13: Reinforcement Learning Markov Decision Processes Read AIMA Chapters 17 sections 1 – 3. 1

How Might a mouse search a Maze for Cheese? heese · State Space Search? As a Constraint Satisfaction Problem? Goal-directed Planning As a rule or production System? What is missing? Ideas in this lecture Objective is to accumulate rewards rather than goal states Task is to generate policies for how to act in all situations rather than a plan for a single starting situation Policies fall out of value functions which describe the greatest lifetime reward achievable at every state Value functions are iteratively approximated
3 How Might a Mouse Search a Maze for Cheese? • • • • • Cheese 4 Ideas in this lecture • rather than goal states. • rather than a plan for a single starting situation. • greatest lifetime reward achievable at every state. • State Space Search? As a Constraint Satisfaction Problem? Goal-directed Planning? As a Rule or Production System? What is missing? Objective is to accumulate rewards, Task is to generate policies for how to act in all situations, Policies fall out of value functions, which describe the Value functions are iteratively approximated

MDP EXamples: TD-Gammon [Tesauro, 1995 Learning Through Reinforcement Learns to play Backgammon Board configurations(1020) · Moves Rewards +100 if win 100 if lose ·0 for all other states Trained by playing 1.5 million games against self Currently, roughly equal to best human player MDP EXamples: Aerial Robotics [Feron et al. Computing a Solution from a Continuous Model Courtesy of Eric Feron Courtesy of Eric Feron Courtesy of Eric Feron
5 MDP Examples: TD-Gammon [Tesauro, 1995] Learning Through Reinforcement States: • 20) Actions: • Rewards: • • • • Î Currently, roughly equal to best human player. 6 Computing a Solution from a Continuous Model Learns to play Backgammon Board configurations (10 Moves +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. MDP Examples: Aerial Robotics [Feron et al.] Courtesy of Eric Feron. Courtesy of Eric Feron. Courtesy of Eric Feron

Markov Decision processes · Motivation What are Markov Decision Processes(MDPS)? Lifetime reward Policies Computing policies From a Mode Summary MDP Problem State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward
7 Markov Decision Processes • • • • lici • 8 MDP Problem Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward Motivation What are Markov Decision Processes (MDPs)? Models Lifetime Reward • Po es Computing Policies From a Model • Summary Environment policy for acting that maximizes

MDP Problem Model State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward Markov Decision ProcessesMDPs) Model Process Finite set of states S Observe state s, in s Finite set of actions A · Choose action a,inA (Probabilistic) state Receive immediate reward rt transitions, as, a) · State changes to s Reward for each state and action, R(s, a) Example 9s,a522 o 10 Legal transitions shown Reward on unlabeled transitions is o
9 MDP Problem: Model Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward 10 Markov Decision Processes (MDPs) Model: • Fi S • Fi i A • (Probabilistic) state transitions, δ(s,a) • and action, R(s,a) Process: G 10 10 10 transitions is 0. s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 Example: s1 a1 • t in S • t in A • t • t+1 Environment policy for acting that maximizes nite set of states, nite set of act ons, Reward for each state • Legal transitions shown • Reward on unlabeled Observe state s Choose action a Receive immediate reward r State changes to s

MDP Environment Assumptions Markov Assumption Next state and reward is a function only of the current state and action S++1=8S, a,) r=rs, ay Uncertain and Unknown Environment D and r may be nondeterministic and unknown MDP Nondeterministic Example R-Research Today we only consider the deterministic case D-Development 0.9 0.1 S1 D D Unemployed Industry 1.0 1.0 0.9 0.9 Grad School Academia 0.1 R 1.0
11 MDP Environment Assumptions • Next state and reward is a function only of the current state and action: • st+1 = δ(st , at ) • rt = r(st , at ) • δ and r may be nondeterministic and unknown 12 MDP Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 0.1 0.9 1.0 1.0 Today we only consider Markov Assumption: Uncertain and Unknown Environment: R – Research D – Development the deterministic case

MDP Problem Model State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward MDP Problem: Lifetime Reward State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward
13 MDP Problem: Model Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward 14 MDP Problem: Lifetime Reward Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward Environment policy for acting that maximizes Environment policy for acting that maximizes

Lifetime Reward · Finite horizon Rewards accumulate for a fixed period $100K+$100K+$100K=$300K · Infinite horizon Assume reward accumulates for ever $100K $100K +.. infinity · Discounting Future rewards not worth as much (a bird in hand.) Introduce discount factor y 100K+y100K+γ2$100K converges Will make the math work MDP Problem. State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=ro+yG1+y'r2
15 Lifetime Reward • • • $100K + $100K + $100K = $300K • • • • • (a bird in hand …) • γ $100K + γ $100K + γ 2 $100K. . . • 16 MDP Problem: Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . Finite horizon: Rewards accumulate for a fixed period. Infinite horizon: Assume reward accumulates for ever $100K + $100K + . . . = infinity Discounting: Future rewards not worth as much Introduce discount factor converges Will make the math work Environment policy for acting that maximizes

MDP Problem: Policy State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=r+yr,+y2r2 assume deterministic world Policy n:S→A Selects an action for each state Optimal policy T": S7A Selects action for each state that maximizes lifetime reward 10 10 G G |10
17 MDP Problem: Policy Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . 18 Assume deterministic world Policy π : S ÆA • G 10 10 10 π G 10 10 10 π∗ Optimal policy π∗ : S ÆA • reward. Environment policy for acting that maximizes Selects an action for each state. Selects action for each state that maximizes lifetime

There are many policies, not all are necessarily optimal There may be several optimal policies 「科|44千 10 10 C|10 Markov decision processes · Motivation Markov decision processes Computing Policies From a Model lue Functions Mapping Value Functions to Policies Computing Value Functions through Value Iteration An Alternative: Policy Iteration(appendix) Summary
19 • • G 10 10 10 G 10 10 10 G 10 10 10 20 Markov Decision Processes • • • • • • • There are many policies, not all are necessarily optimal. There may be several optimal policies. Motivation Markov Decision Processes Computing Policies From a Model Value Functions Mapping Value Functions to Policies Computing Value Functions through Value Iteration An Alternative: Policy Iteration (appendix) • Summary
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 麻省理工学院:《自制决策制造原则》英文版 Robot Localization using SIR.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Probabilistic model.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Integer programs solvable as LP.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Courtesy or Eric Feron and Sommer.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Courtesy of Sommer Gentry. Used with permission.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Particle filters for Fun and profit.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Conflict-directed Diagnosis.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Roadmap path planning.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Model-based Diagnosis.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Shortest path and Informed Search.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Programming SATPlan.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Solving Constraint Satisfaction Problems Forward Checking.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Solving constraint satisfaction Problems.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Partial Order Planning and execution.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Propositional Logic.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Graph-based Planning.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Even more scheme.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Pairs. Lists.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Constraint Satisfaction Problems: Formulation.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Rules on NEAr and messenger.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Learning to Act optimally Reinforcement Learning.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Principles of Autonomy and Decision Making.pdf
- 《航线进度计划》(英文版) lec1 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec4 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec3 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec2 multi-commodity Flows.ppt
- 《航线进度计划》(英文版) lec7 crew scheduling.ppt
- 《航线进度计划》(英文版) lec6 fleet assignment.ppt
- 《航线进度计划》(英文版) lec5 passenger mix.ppt
- 《航线进度计划》(英文版) lec11 aop1.pdf
- 《航线进度计划》(英文版) lec12 aop2.pdf
- 《航线进度计划》(英文版) lec10 schedule design.ppt
- 《航线进度计划》(英文版) lec9 crew pairing and aircraft routing.ppt
- 《航线进度计划》(英文版) lec8 aircraft maintenance routing.ppt
- 《航线进度计划》(英文版) lec13 aop3.pdf
- 《航线进度计划》(英文版) lec14 Shan Lan Robust scheduling.ppt
- 《直升机涡环状态》讲义.ppt
- 美国麻省理工大学:《Thermal Energy》(热能) 01 contents cvr.pdf
- 美国麻省理工大学:《Thermal Energy》(热能) 05 part1c.pdf
- 美国麻省理工大学:《Thermal Energy》(热能) 04 part1b.pdf