麻省理工学院:《自制决策制造原则》英文版 Learning to Act optimally Reinforcement Learning

Learning to Act optimally Reinforcement Learning Brian C. Williams 16410 December 8th. 2003 Slides adapted from Manuela veloso Reid simmons Tom Mitchell. CMU TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States Board configurations(1020) Actions Rewards +100 if win 100 if lose 0 for all other states Trained by playing 1.5 million games against self Currently, roughly equal to best human player
Learning to Act Optimally: Reinforcement Learning Brian C. Williams 16.410 December 8th, 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1 TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. Î Currently, roughly equal to best human player. 2

Markov Decision Processes and Reinforcement Learning · Motivation earning policies through reinforcement Q values Q learning · Summary Appendix: Nondeterministic MDPs Reinforcement Learning problem Given: Repeatedly Executed action Observed state Observed reward Leam action policy t: s-A State//Reward Action Maximizes life reward Environment om any start state Discount: 0<y< a Note Unsupervised learning Goal: Learn to choose actions Delayed reward that maximize life reward Model not known ro+YG,+y2r2
3 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • 4 Reinforcement Learning Problem • • • Learn action policy S: S o A • r0 + J r1 + J 2 r2 . . . from any start state. • 1 Note: • • • Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions r0 + J r1 + J 2 r2 . . . Motivation Q values Q learning • Appendix: Nondeterministic MDPs Given: Repeatedly… Executed action Observed state Observed reward Maximizes life reward Discount: 0 Unsupervised learning Delayed reward Model not known Environment that maximize life reward Summary J

How About Learning value Functions? Have agent learn value function Vi, denoted V. 2. Agent selects optimal action by one step lookahead on V T*()=argmaxa [r(s, a)+yV(8(s, a) Problem: Works well if agent knows the environment model δ:SXA→S r.SxA→ With no model agent cant choose action from v Eliminating the Model with Q Functions (s)=argmaxa [r(s, a)+yV(s(s, a) Key idea Define function like V that encapsulates S and r Q(s, a)=r(s, a)+yV(8(s, a)) Then, if agent learns Q, it can choose an optimal action without knowing S r. I*(s)=argmaxa Q(s, a)
5 How About Learning Value Functions? VS , denoted V V S*(s) = argmaxa >r(s,a + JV (G(s, a)] Problem: • • G: S x A o S • r: S x A o • . 6 Eliminating the Model with Q Functions S*(s) = argmaxa >r(s,a + JV (G(s, a)] Key idea: • V that encapsulates G and r: Q(s,a = r(s,a + JV (G(s, a)) • Q, it can choose an optimal action without knowing G or r. S*(s) = argmaxa Q(s,a 1. Have agent learn value function 2. Agent selects optimal action by one step lookahead on Works well if agent knows the environment model. With no model, agent can’t choose action from V Define function like Then, if agent learns

How Do We Learn Q? Q(S, a, =r(s, a,+yV(8(s, a ) Need to eliminate v In update rule Note Q and V are closely related V(s=maxa, Q(s, a) Substituting Q for V Q(S,a, =r(s, a, +r maxa Q(s, a) Example-Q Learning Update Y=09 R R Q(s, arghtfr(s,, aright+r maxa, Q(s, a) ←-0+0.9max{63,81,100} 90 Note: if rewards are non-negative For all s, a, n, Q,(s, a)<Qn+1(s, a) For all s,a,n,0≤Qn(s,a≤Q(S,a)
7 How Do We Learn Q? Q(st ,at = r(st ,at + JV (G(st , at )) • • V*(s) = maxa’ Q(s,a’) • Q(st ,at = r(st ,at + Jmaxa’ Q(s’,a’) 8 Example – Q Learning Update Q(s1,aright) m r(s1,aright) + Jmaxa’ Q(s2,a’) m 0 + m 90 90 R 100 81 72 63 R 100 81 63 Note: if rewards are non-negative: ll s, a, n, Qn(s, a) dQn+1(s, a) ll s, a, n, 0 dQn(s, a) dQ(s, a) J Need to eliminate V* In update rule. Note Q and V* are closely related: Substituting Q for V*: • For a • For a 0.9 max {63, 81, 100} = 0.9

Q-Learning Iterations Starts at top left corner-move clockwise around perimeter Initially all values in Q table are zero: y=0.8 Q(S,a)←r+ y maxa, G(sa) S1 Q(S1, E Q(S2, E) Q(s3,S) Q(s4, W) 0 +y maxa Q (s5, loop))= 0 0 r+ymax但Q(84 =0+08xmax{100)=8 10 0 83×088) 8 10 Crib Sheet: Q-Learning for Deterministic Worlds Let q denote the current approximation to Q Initiall For each s, a initialize table entry Q(s, a)<0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q(s, a)as follows Q(s,a)∈ y max a2Q(s,a)
9 Q-Learning Iterations • Initially all values in Q table are zero; J Q(s, a) mr+ Jmaxa’ Q(s’,a’) G 10 10 10 s1 s2 s3 s6 s4s5 10 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: s, a initialize table entry Q(s, a) m0 • s Do forever: • a and execute it • r • s’ • Q (s, a) as follows: Q(s, a) mr+ Jmaxa’ Q(s’,a’) • s ms’ Starts at top left corner – move clockwise around perimeter; = 0.8 • For each Observe current state Select an action Receive immediate reward Observe the new state Update the table entry for Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) 0 0 0 r+ Jmaxa’ {Q(s5,loop)}= 10 + 0.8 x 0 = 10 0 0 r+ Jmaxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x max{10,0) = 8 10 0 r+ Jmaxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8) = 6.4 8 10

Discussion How should the learning agent use the intermediate q values? Exploration Vs Exploitation Scaling up in the size of the state space Function approximators(neural net instead of table) Reuse. use of macros Markov decision processes and Reinforcement Learning Motivation Learning policies through reinforcement · Q values Q learning Summary Appendix: Nondeterministic MDPs
11 Discussion • intermediate Q values? • • • • 12 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • How should the learning agent use the Exploration vs Exploitation Scaling up in the size of the state space Function approximators (neural net instead of table) Reuse, use of macros Motivation Q values Q learning • Appendix: Nondeterministic MDPs Summary

Nondeterministic EXample R-Research 0.9 D-Development S1 1.0 D Unemployed Industry 1.0 R 1.0 0.9 Grad School Academia 0.1 R 0.1 0.9 1.0 Non Deterministic Case We redefine V, Q by taking expected values V(S少=E[t+yrtt+y2t+2∴.] v(s)=E[Σyt Q(S, a =E[r(s, a+yV((s, aD)
13 Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 1.0 0.1 0.9 1.0 14 • V, Q by taking expected values VS(st ) = E[rt + J rt+1 + J 2 rt+2 . . .] VS(st ) = E[ ¦ J i rt+I ] Q(st ,at = E[r(st ,at + JV (G(st , at ))] R – Research D – Development NonDeterministic Case We redefine

Nondeterministic Case Alter training rule to Qn(s, a)<(1-an)Qn-1(s, a)+an[+ y maxa, Qn-1(s, a) where an =1/(1+visits,(s, a)) and s=8(s, a) Can still prove convergence of Q Watkins and Dayan, 92 Ongoing Research Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action state · Learn and useδ:SⅹA→S Relationship to dynamic programming Multiple learners- Multi-agent reinforcement learning
15 Nondeterministic Case • Qn(s, a) m (1- Dn) Qn-1 (s,a) + Dn [r+ J maxa’ Qn-1 (s’,a’)] where Dn = 1/(1+visitsn G(s, a). Can still prove convergence of Q [Watkins and Dayan, 92] 16 Ongoing Research • • • • Learn and use G S x A o S • • Alter training rule to Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action, state Relationship to dynamic programming Multiple learners – Multi-agent reinforcement learning (s,a)) and s’ =
按次数下载不扣除下载券;
注册用户24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
- 麻省理工学院:《自制决策制造原则》英文版 Planning to Maximize Reward: Markov Decision processes.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Robot Localization using SIR.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Probabilistic model.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Integer programs solvable as LP.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Courtesy or Eric Feron and Sommer.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Courtesy of Sommer Gentry. Used with permission.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Particle filters for Fun and profit.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Conflict-directed Diagnosis.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Roadmap path planning.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Model-based Diagnosis.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Shortest path and Informed Search.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Programming SATPlan.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Solving Constraint Satisfaction Problems Forward Checking.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Solving constraint satisfaction Problems.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Partial Order Planning and execution.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Propositional Logic.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Graph-based Planning.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Even more scheme.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Pairs. Lists.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Constraint Satisfaction Problems: Formulation.pdf
- 麻省理工学院:《自制决策制造原则》英文版 Principles of Autonomy and Decision Making.pdf
- 《航线进度计划》(英文版) lec1 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec4 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec3 Airline Schedule planning.ppt
- 《航线进度计划》(英文版) lec2 multi-commodity Flows.ppt
- 《航线进度计划》(英文版) lec7 crew scheduling.ppt
- 《航线进度计划》(英文版) lec6 fleet assignment.ppt
- 《航线进度计划》(英文版) lec5 passenger mix.ppt
- 《航线进度计划》(英文版) lec11 aop1.pdf
- 《航线进度计划》(英文版) lec12 aop2.pdf
- 《航线进度计划》(英文版) lec10 schedule design.ppt
- 《航线进度计划》(英文版) lec9 crew pairing and aircraft routing.ppt
- 《航线进度计划》(英文版) lec8 aircraft maintenance routing.ppt
- 《航线进度计划》(英文版) lec13 aop3.pdf
- 《航线进度计划》(英文版) lec14 Shan Lan Robust scheduling.ppt
- 《直升机涡环状态》讲义.ppt
- 美国麻省理工大学:《Thermal Energy》(热能) 01 contents cvr.pdf
- 美国麻省理工大学:《Thermal Energy》(热能) 05 part1c.pdf
- 美国麻省理工大学:《Thermal Energy》(热能) 04 part1b.pdf
- 美国麻省理工大学:《Thermal Energy》(热能) 03 part1a to CMS.pdf