Published: Aug 31, 2024
Duration: 01:05:53
Category: Science & Technology
Trending searches: stanford
hi everybody I'm Emma Brill um I'm an assistant professor in computer science and welcome to cs234 um which is a reinforcement learning class um which is designed to be sort of an entry-level Master's or PhD student and an introduction to reinforcement learning so what we're going to do today is I'm I'm going to start with just sort of a really short brief overview of what is reinforcement learning um and then we're going to go through course Logistics and when I go through course Logistics I'll also pause and ask for any questions about logistics um the website is now live and so that's also the best source of information about the class that in pza will be the best source of information um uh so I'll stop there when we get to that part to ask if there's anything things I don't go over that you have questions about and if you have questions about the weight list or any particular things relating to your own circumstance feel free to come up to me at the end um and then the third part of the class is going to be where we start to get into the technical cont of we thinking about uh an introduction to sequential decision- making under uncertainty um just so I have a sense before we get started who here has taken a machine learning class all right who here has taken AI okay so a little bit less but most people all right great so probably everybody here has seen a little bit about reinforcement learning um varies a little bit depending on where you've been at we will be covering stuff starting from the beginning as if you don't know any reinforcement learning um but then we'll rapidly be getting to other content um that's beyond anything that's covered in at least other Stanford related classes so reinforcement learning is concerned with this really foundational issue of how can an intelligent agent learn to make a good sequence of decisions um and that's sort of a single sentence that summarizes what reinforcement learning is doing and what we'll be covering during this class but it actually encodes a lot of really important ideas um so the first thing is that we're really conc ER now with sequences of decisions so in contrast to a lot of what is covered in uh machine learning we're going to be thinking about agents intelligent agents or an intelligent agent in general that might or might not be human or biological um and how it can make not just one decision but a whole sequence of decisions we're going to be concerned with goodness in other words we're going to be interested in the the second thing is how do we learn to make good decisions um and what we mean by good here is some notion of optimality we have some utility measure over the decisions that are being made um and the final critical aspect of reinforcement learning is the learning that um that the agent doesn't know in advance how its decisions are going to affect the world or what decisions might necessarily be associated with good outcomes and instead it has to acquire that information through experience so when we think about this this is really something that we do all the time we've done it since we were babies we try to figure out how to um sort of achieve High reward in the world and there's a lot of really exciting work that's going on in neuroscience and psychology um that's trying to think about the same fundamental issue from the perspective of human intelligent agents and so I think that if we want to be able to solve AI um or make significant progress we have to be able to make significant progress in allowing us to create agents that do reinforcement learning so where does this come up there's this um nice example from Yale NIV who's uh an amazing sort of psychologist and Neuroscience research over at Princeton um where she gives us the example of this sort of primitive creature um which evolves as following during its lifetime so when it's a baby it has a primitive brain and one eye and it swims around and it attaches to a rock and then when it's an adult it digests its brain and it sits there and so maybe this is some indication that the point of intelligence or the point of having a brain and at least in part is helping to guide decisions and so that once all the decisions in the agent's life has been completed maybe we no longer need a brain so I think this is you know this is one example of a biological creature but I think it's a useful reminder to think about why would an agent need to be intelligent and is it somehow fundamentally related to the fact that it has to make decisions now of course um there's been a sort of really a paradigm shift in reinforcement learning um around 2015 um in The NPS conference which is one of the main machine learning conferences David silver came and um went to a workshop and presented these incredible results of using reinforcement learning to directly control Atari games now these are important whether you like video games or not um video games are a really interesting example of sort of complex tasks that take human players a while often to learn we don't know how to do them in advance it takes us at least a little bit of experience and what the really incredible thing about this example was this is a breakout is that the agent learns to play directly from Pixel input so from the agents perspective they're just seeing sort of these colored pixels coming in and it's having to learn what's the right decisions to make in order to learn to play the game well and in fact even better than people so this was really incredible that this was possible um when I first started doing reinforcement learning a lot of the work was really focused on very artificial toy problems um a lot of the foundations were there but these of larger scale applications were really lacking and so I think in the last 5 years we've seen really an huge Improvement um in the types of techniques that are going on in reinforcement learning and in the scale of the problems that can be tackled now it's not just in video game um playing it's also in things like robotics um and particularly some of my colleagues up at University of um California Berkeley um I have been doing some really incredible work on Robotics and using reinforcement learning in the these types of scenarios um to try to have agents do grasping fold clothes things like that now those sort of examples if you guys have um looked at reinforcement learning before are probably the ones you've heard about you've probably heard about things like video games or robotics um but one of the things that I think is really exciting is that uh reinforcement learning is actually applicable to a huge number of domains um which is both an opportunity and a responsibility so in particular um I direct the AI for human impact lab here at Stanford and one of the things that we're really interested in is how do we use artificial intelligence to help amplify human potential so one way you can imagine doing that is through something like educational games where the goal is to figure out um how to quickly and effectively teach people how to learn material such as fractions another really important application area is Healthcare um this is sort of a cutout um of looking at seizures that some work that's been done by Joel poo up at uh Mill University and I think there's also a lot of excitement right now thinking about how can we use Ai and in particular reinforcement learning um to do things like to interact with things like electronic medical record systems and use them to inform patient treatment there's also a lot of recent excitement in thinking about how we can use reinforcement learning and lots of other applications kind of as an optimization technique for when it's really hard to solve optimization problems s and so this is arising in things like natural language processing and vision and a number of other areas so I think if we have to think about what are the key aspects of reinforcement learning they probably boil down to the following four and these are things that are going to distinguish it from other aspects of AI and machine learning so reinforcement learning from my sentence about that we're learning to make good decisions under uncertainty fundamentally involves optimization delayed consequences exploration and generalization so optimization naturally comes up because we're interested in good decisions there's some notion of relative different types of decisions that we can make um and we want to be able to get decisions that are good the second situation is delayed consequences so this is the challenge that the decisions that are made now you might not realize whether or not they're a good decision until much later so you eat the chocolate Sunday now and you don't realize until an hour later that that was a bad idea to eat all two carts of ice cream or um you in the case of things like video games like Mont Zuma's Revenge you have to pick up a key and then much later you realize that's helpful or you study really hard now on Friday night and in three weeks you do well on the midterm so one of the challenges to doing this is that because you don't necessarily receive immediate outcome feedback it can be hard to do what is known as the credit assignment problem which is how do you figure out the causal relationship between the decisions you made in the past and the outcomes in the future and that's a really different problem than we tend to see in most of machine learning so one of the things that comes up when we start to think about this is how do we do exploration so the agent is fundamentally trying to figure out how the world Works through experience and much of reinforcement learning and so we think about the agent as really kind of being this scientist of trying things out in the world like having an agent that tries to ride a bicycle and then learning about how physics and riding a balanced bike works by Falling now one of the really big challenges here is that data is censored and what we mean by censoring in this case is that you only get to learn about what you try to do so all of you guys are here at Stanford clearly that was the optimal Choice um but you don't actually get to figure out what it would have been like if you had went to MIT it's possible that would have been a good choice as well but you can't you can't experience that because you only get to live one life and so you only get to see the particular choice met it at this particular time so one question you might wonder about is um you know policy what we're going to we're going to talk a lot about policies policies decision policies is going to be some mapping from experiences to a decision and you might answer why we this needs to be learned so if we think about something like deep Minds um a tari playing game what it was learning from here is it was learning from pixels so it was essentially learning from the space of images what to do next and if you wanted to write that down as a program a series of if then statements it would be absolutely enormous this is not tractable so this is why we need some form of generalization and why it may be much better for us to learn from data directly as well as to have some high level representation of the task so that even if we then run into a particular configuration of pixels we've never seen before our agent can still know what to do so these are sort of the four things that really make up reinforcement learning at least online reinforcement learning and why are they different than some other types of AI and machine learning so another thing that comes up a lot in artificial intelligence is planning so for example the Go game um is can be thought of as a planning problem so what does planning involve involves optimization often generalization and delayed consequences you might take a move and go early and it might not be immediately obvious if that was a good move until many steps later but it doesn't involve exploration so the idea in planning is that you're given a model of how the world works so you're given the rules of the game for example and you know what the reward is um and the hard part is Computing what you should do given the model of the world so it doesn't require exploration in supervised machine learning versus reinforcement learning it often involves optimization and generalization but frequently it doesn't involve either exploration or delayed consequences so it doesn't tend to involve exploration because typically in supervised learning you're given a data set so your agent isn't collecting its experience or data about the world instead it's given experience and then it has to use that to say infer whether an image is a face or not similarly um it's typically making essentially one decision like whether this image is a face or not instead of having to think about making decisions now and then only learning whether or not those are the right decisions later unsupervised machine learning Al also involves optimization and generalization but generally does not involve exploration or delayed consequences and typically you have no labels about the world so in supervised learning you often get the exact label for the world like this image really is has it contains a face or not um in unsupervised learning you normally get no labels about the world and an RL you typically get something kind of halfway in between those which you get a a utility of the label you put so for example you might decide that there's a face in here and it might say okay yeah we'll give you partial credit for that because maybe there's something that looks sort of like a face but you don't get the true label of the world or maybe you decide to go to Stamford um and then you don't know and you're like okay that was a really great experience but I don't know if it was quote unquote the right experience imitation learning which is something that we'll probably touch on briefly in this class and is becoming very important um is similar um but a little bit different so in uh it involves optimization generalization and often delayed consequences but the idea is that we're going to be learning from experience of others so instead of our intelligent agent getting to take experiences um from the world and make its own decisions it might watch another intelligent agent which might be a person make decisions observe outcomes and then use that experience to figure out how it wants to act there could be a lot of benefits to doing this but it's a little bit different because it doesn't have to directly think about the exploration problem imitation learning I just want to spend a little bit more time on that one because it's been become increasingly important so to my knowledge it was first really sort of popularized um by Andrew ing um who's a former Professor here um through some of his helicopter stuff where he was looking at expert flights together with Peter AEL who's a professor over at Berkeley um to see how you could imitate very quickly um experts flying toy helicopters that was one of sort of the first kind of major application successes of invitation learning it can be very effective there can be some challenges to it because essentially if you get to observe one trajectory let's imagine it's a circle of a helicopter flying and your agent learns something that isn't exactly the same as what the expert was doing that you can essentially start to go off that path and Venture into territory where you really don't know what the right thing is to do so there's been a lot of extensive work on imitation learning that's sort of combining between imitation learning and reinforcement learning that ends up being very promising so in terms of how we think about trying to do reinforcement learning we can build on a lot of these different types of techniques um and then also think about some of the challenges that are unique to reinforcement learning which involves all four of these challenges and so these RL agents really need to explore the world and then use that exploration to guide their future decisions so we'll talk more about this throughout the course um a really important question that comes up is where do these rewards come from where is this information that the agents are using to try to guide whether or not their decisions are good um and who is providing those and what happens if they're wrong and we'll talk a lot more about that um we won't talk very much about multi-agent reinforcement learning systems but that's also a really important case um as well as thinking about game theoretic aspects all right so that's just sort of a really short overview about some of the aspects of reinforcement learning and why it's different than some of the other classes that you might have taken um and now we're going to go briefly through course Logistics and then um start through more of the content and I'll pause after course Logistics to ask for any questions in terms of prerequisites um we expect that everybody here has either taken an AI class or a machine learning class either here at Stanford or the equivalent to another institution um if you're not sure whether or not you have the right background for the class feel free to reach out to us on pza and will respond um if you've done extensive work and sort of related stuff it'll probably be sufficient in general we expect that you have basic python proficiency um and that you're familiar with probability statistics and multivariable calculus um things like gradient descent loss derivatives um those should all be very familiar to you um and I expect that most people have probably heard of mdps um before but it's not totally critical so this is a long list but I'll go through it slowly because I think it's pretty important so this is what are the goals for the class and what are the learning objectives so these are the things that we expect that you guys should be able to do by the time you finish this class and that it's our role to help you be able to understand how to do these things um so the first thing is that it's important to be able to define the key features of reinforcement learning that distinguish it from other types of AI machine learning um frames of problems so that's what I was doing a little bit of so far in this class to figure out how does this distinguish this uh how does RL distinguish itself from other types of problems so related to that um for most of you you will probably not end up being academics um and most of you will go into industry and so one of the big challenges when you do that is that when you're faced with a particular problem from your boss or when you're giving a problem to one of your um supervises is for them to think about whether or not it should be framed as a reinforcement learning problem um and what things are applicable to it so I think it's very important but by the end of this class that you have a sense of if you're given a real world problem like web advertising or patient treatment or robotics problem um that you have a sense whether or not it is useful to formulate it as a reinforcement learning problem and how to write it down in that framework and what algorithms are relevant um during the class um we'll also be introducing you to a number of reinforcement learning algorithms um and you will have the chance to implement those in code in cluding deep reinforcement learning uh problems another really important aspect is if you're trying to decide what tools to use for a particular say robotics problem or Healthcare problem um is to understand which of the algorithms is likely to be beneficial and why and so in addition to things like empirical performance I think it's really important to understand generally how do we evaluate algorithms um and can we use things like theoretical tools like regret sample complexity um as well as things like comput complexity to decide which algorithms are suitable for particular tasks and then the final thing is that one really important aspect of reinforcement learning is exploration versus exploitation this issue that arises with them agents have to figure out what decisions they want to make and what they're going to learn about the environment by making those decisions and so by the end of the class you should also be able to compare different types of techniques for doing exploration versus exploitation and what are the strengths and limitations of these so does anybody have any questions about what these learning objectives are okay so we'll have three main assignments for the class um we'll also have a midterm um uh we'll have a quiz at the end of the class um as well as a final project the quiz is a little bit unusual um so I just want to spend a little bit of time talking about it right now the quiz is done um both individually and in groups um the reason that we do this is because we want a low stakes way to sort of um have people practice with the material that they learn in the second half of the course um in a way that's sort of fun engaging and really tries to get you to think about it and also learn from your peers um and so we did it last year and I think a number of people were a little bit nervous about how it would go before and then ended up really enjoying Jo it so the way that the quiz works is it's a multiple choice quiz at the beginning everybody does it by themselves and then after everybody has submitted their answers then we do it again in groups that are pre-assigned by us and the goal is that you have to get everyone to decide on what the right answer is before you scratch off and see what the correct answer is and then we grade it according to um whether you scratched off the right answer correctly first or not you can't do worse than your individual grade so doing it in a group can only help you um and for scpd students they don't do it in groups so they just write down justifications for their answers again um it's a pretty uh lightweight uh way to do assessment um the goal is that you sort of have to be able to articulate why you believe the answers are the way they are and discuss them in small groups and then use that information um use that to figure out what the correct answer is um the final project is probably pretty similar to other projects that you guys have done in other classes um it's an open-ended project it's a chance to reason about um and think about reinforcement learning uh stuff in more depth we will also be offering a default project that will be announced over the next couple weeks before the first Milestone is due if you choose to do the default project your breakdown because you will not need to do a proposal or Milestone will be based on the project presentation and your assignment right up um since we believe that um you guys are all of each other's best resource um we use pza um uh that should be used for pretty much all class communication unless it's something of sort of a private or sensitive manner in which case of course please feel free to reach out to the core staff directly um and for things like lectures and homeworks and project questions pretty much all of that should go through pza for late day policy we have six late days um for details you can see the web page um and for collaboration please see the web page for some of the details about that so before we go on to the next part then may have any questions about logistics for the class okay let's get started um so we're now going to do an introduction to sequential decision- making under uncertainty a number of you guys will have seen some of this content before um we will be going into this in probably more depth um uh than you've seen for some of this stuff including some Theory not Theory today but in other lectures and then we'll also be moving on to content that will be new to all of you later in the class so sequential decision- making under uncertainty um the fundamental that we thing that we think about in these settings is sort of an interactive Clos Loop process where we have some agent and intelligent agent hopefully that is taking actions that are affecting the state of the world and then it's getting back an observation and a reward the key goal is that the agent is trying to maximize the total expected future reward now this expected aspect um is going to be important because sometimes the world itself will be stochastic and so the agent is going to be maximizing things in expectation um this may not always be the right criteria um this has been what has been focused on for the majority of reinforcement learning but there's now some interest in thinking about distribution orable RL and some other aspects um one of the key challenges here is that it can require balancing between immediate and long-term rewards and that it might require strategic behavior in order to achieve those High rewards indicating that you might have to sacrifice um initial higher rewards in order to achieve longer better rewards over the long term so as an example something like web advertising might be that you have a agent that is running the website and it has to choose which web ad to give to a customer the customer gives you back an observation such as how long they spent on the web page page um and also you get some information about whether or not they click on an ad um and the goal is to say have people click on ads the most so you have to pick which ad to show people so that they're going to click on ads another example is a robot that's unloading a dishwasher so in this case the action space of the agent might be joint movements um and the information that agent might get back was a camera image of the kitchen and it might get a plus one reward if there are no dishes on the counter so in this case it would generally be a delayed reward for a long time they going to be dishes on the counter um unless it can just sweep all of them off and have them crash onto the floor which may or may not be the intended goal of the person who's writing the system um and so it may have to make a sequence of decisions where it can't get any reward for a long time another example is something like blood pressure control um where the actions might be things like prescribe exercise or prescribe medication um and we get a an observation back of what is the blood pressure of the individual um and then the reward might be plus one if it's in the if the blood pressure is in a healthy range um maybe a small negative reward if medication is prescribed due to side effects and maybe zero reward otherwise okay so let's think about another case like some of the cases that I think about in my lab like having an artificial tutor so now what you could have is you could have a teaching agent and what it gets to do is pick an activity so pick a teaching activity now let's say it only has two different types of teaching activities to give um it's going to either give an addition activity or a subtraction activity and it gives this to a student and then the student either gets the problem right right or wrong and let's say the student initially does not know addition or subtraction so it's a kindergartner the student doesn't know anything about math and we're trying to figure out how to teach the student math and that the reward structure for the teaching agent is they get a plus one every time the student gets something right and they get a minus one if the student gets it wrong so I'd like you to just take a minute turn to somebody nearby and describe what you think an agent that's trying to learn to maximize its expected rewards would do in this type of case what type of problems it would give to the student and whether or not that is doing the right thing s [Music] [Music] [Music] and let me just and let me just clarify here and let me just clarify here and let me just clarify here that let's assume that for most students addition is easier than subtraction so that like what it says here that the the problem even though the student doesn't know either of these things that the skill of learning addition is simpler for a new student to learn than subtraction so what would what might happen under those cases so does anybody want to raise their hand and tell me um what they and somebody nearby them um was thinking might happen for an agent in this scenario the agent would give them really easy addition problems that's correct and that's exactly actually what happened that um there's a nice paper from approximately 2000 with Bev wolf which is one of the earliest ones that I know where they're using reinforcement learning to create an intelligent tutoring system and the reward was for the agent to to give problems to the student in order to get them correct because you know if the student's getting things correct then they've learned them um but the problem here is with that reward specification what the agent learns to do is to give really easy problems um and then maybe the student doesn't know how to do those initially but then they quickly learn how and then there's no incentive to give hard problems so this is just sort of a small uh example of what is known as reward hacking um which is that your agent is going to learn to do exactly what it is that you tell them to do in terms of the rewards uh function that you specify and yet in reinforcement learning often we spend very little of our time thinking very carefully about what that reward function is so whenever you get out into sort of the real world this is the really really critical part that normally it is the designer that gets to pick what the reward function is the agent is not not having intrinsic internal reward and so depending on how you specify it um the agent will learn to do different things yeah was there a question in the back uh I guess in this case it seems like the student would also be an RL agent uh and that like in real life the student who does the best would likely ask for harder questions and get rewarded in that sense so guess are there like techniques to approach that or is it okay that we ignore that so the question was to say well you know we also think that people are probably reinforcement learning agents as well and that's exactly correct um and maybe they would start to say hey I need to get harder questions um or or be interactive in this process for most of this class we're going to ignore the fact that the world that we interact with itself might also be an RL agent um in reality it's really critical um sometimes this is often considered in an adversarial way like for Game Theory um I think one of the most exciting things to me is when we think about it in a Cooperative way um so who here has heard about the subdiscipline of machine teaching nobody yet um so uh it's a really interesting new area that's been around for maybe 5 to 10 years some a little bit beyond that um and one of the ideas there is what happens if you have two intelligent agents that are interacting with each other where they know that each other is trying to help them um so there's a really nice classic example from sorry for those of you that aren't so familiar with machine learning but imagine that you're trying to learn a classifier to decide where along this line things are either positive or negative so in general you're going to need some amount of samples samples if you um where that's sort of the number of points on the line where you have to get positive or negative labels um if you're in an active learning setting generally I think you can reduce that to roughly log Ed um by being strategic about asking people to label particularly points in the line one of the really cool things for machine teaching is that if I know you are trying to teach me where to divide this line you only need one point or at most two points essentially constant right because if I'm trying to teach you there's no way I'm just going to randomly label things I'm just going to label you a single plus and a minus and that's going to tell you exactly where the line goes so that's one of the reasons why if the agent knows that the other agent is trying to teach them something it can actually be enormously more efficient than what we normally think of for learning um and so I think there's a lot of potential for machine teaching to be really effective but all that said we're going to ignore most of that for the course um if it's something you want to explore in uh your project you're very welcome to there's a lot connections with reinforcement learning okay so if we think about this process in general um if we think of sort of a sequential decision-making process we have this agent we're going to think about almost always about there being discreet time so agent's going to make a decision it's going to affect the world in some way it's going to see the world is going to give some new observation and a reward um and the agent receives those and uses it to make another decision so in this case when we think about a hisory what we mean by history is simply the sequence of previous actions that the agent took and the observations and rewards it received and then the second thing that's really important is to define a state space again often when this is first discussed this is sort of thought about as some immutable thing but whenever you're in a real application this is exactly what you have to Define is how to write down the representation of the world um what we're going to assume in this class is that the state is a function of the history so there might be other aspects of there might be other sensory information that the agent would like to have access to in order to make its decision but it's going to be constrained to the observations it's received so far the actions it's taken and the rewards it's observed now there's also going to be some real world state so that's the real world and the agent doesn't necessarily have access to the real world they may have access only to a small subset of the real world so for example as a human right now I have eyes that allow me to look forward you know roughly 180 degrees um but I can't see behind my head but behind my head is still part of the world state so the world state is the real world and then the agent has its own State space it uses to try to make decisions so in general we're going to assume that there some function of the history now one assumption that we're going to use a lot in this class which you guys have probably seen before is the markof Assumption and the markup assumption simply says that we're going to assume that the state used by the agent um is a sufficient statistic of the history in that in order to predict the future you only need to know the current state of the environment so this simply basically indicates that the future is independent of the past given the present if in the present you have the right aggregate statistic so as a couple examples of this yeah question name can you just explain maybe with an example the difference again between the state and the history like I'm having trouble yeah so this state um if we think about something like um a robot um so let's say you have a robot that is walking down a long Corridor okay let's say there's two long corridors okay so your robot starts here this is where your robot starts and it tries to go right right and then it goes down down down okay and let's say it's sensors are just that it can observe whether in front of it um um whether there is a wall on any of its sides so it can the observation space of the robot is simply is there a wall on any side on each of it four sides I'm sorry that's probably a little bit small in the back but the agent basically has you know some sort of local um amount a laser range finder or something like that so it knows whether or not there's a wall immediately around it sort of immediately around square and nothing else so in this case what the agent would see is that initially the wall looks like this and then like this and then like this and then like this and the history would include all of this but its local state is just this so local state could just be the current observation and that starts to be important when you're going down here because there are many places that look like that and so if you keep track of the whole history the agent can figure out where it is but if it only keeps track of where it is locally then a lot of partial alosine can occur so I put up a couple examples here um so in something like hypertension control you can imagine the state is just the current blood pressure um and your action is whether to take medication or not so current blood pressure meaning like you know every second for example what is your blood pressure so do you think this sort of system is Mark off I see some people shaking their heads almost definitely not almost definitely there are other features that have to do with you know maybe whether or not you're exercising whether or not you just ate a meal whether it's hot outside what the if you just got it in an airplane all these other features probably affect whether or not your next blood pressure is going to be high or low and particularly in response to some medication um similarly in something like website shopping um can imagine the state is just sort of what is the product you're looking at right now so like I open on Amazon I'm looking at some um you know computer and uh that's up on my web page right now and the action is what other products to recommend do you think that system is markof system is not Markov do you mean the system generally but that the assumption is Markov and it just doesn't fit question is whether or not the system generally is markof and the Assumption just doesn't fit or make just some more details I think about this what I mean here is that this particular choice of representing the system is that Markoff um and so there's the real world going on and then there's sort of the model of the world that the agent can use and what I'm arguing here is that these particular models of the world are not Markoff there might be other models of the world that are um but if we choose this particular observation say just the current blood pressure as our state that is probably not really a Markoff State now it doesn't mean that we can't use algorithms that treat it as if it is it's just that we should be aware that we might be violating some of those assumptions yeah I'm wondering so if you include enough history into a can you make any progam Mark it's a great question so why is this so popular um you know can you always make something mark off um generally yes if you include all the the history um then you can always make the system markof um in practice often you can get away with just using the most recent observation or maybe the last four observations as a reasonably sufficient statistic it depends a lot on the domain um there's certainly domains maybe like the navigation world I put up there where it's really important to model um either use the whole history as the as the state um or think about the partial observability um and other cases where you know maybe the current this most recent observation is completely sufficient now one of the challenges here is you might not want to use the whole history because that's a lot of information um and you have to keep track of it over time and so it's much nicer to have sort of a sufficient statistic um of course some of these things are changing a little bit with lstms and other things like that so um some of our prior assumptions about sort of how things scale with the size of the state space are changing a little bit right now with deep learning um but historically certainly there's been advantages to having a smaller State space and um again historically there's been a lot of implications for things like computational complexity the data required and the resulting performance depending on the size of the state space so just to give some intuition for why that might be um if you made your state everything that's ever happened to in your life um that would give you a really really rich representation but you would only have one data point for every state there would be no repeating so it's really hard to learn because um there all states are different um and in general if we want to learn how to do something we're going to either need some form of generalization or some form of clustering or aggregation so that we can compare experiences so that we can learn from prior similar experience in order to what to do so if we think about assuming that your observation is your state so the most recent observation that the agent gets we're going to treat that is the state um then we the agent is modeling the world as a markof decision process so it is thinking of taking an action getting observation and reward and it's setting the state the world state that it's the environment State it's using to be the observation if the world if it is treating the world as partially observable um then it says the agent St St is not the same um and it sort of uses things like the history or beliefs about the world state to aggregate the sequence of previous actions taken and the observations received um and uses that to make its decisions um for example and something like poker um you get to see your own cards other people have cards that are clearly affecting the course of the game um but you don't necess know what those are you can see which cards are uh are discarded and so that's somewhere where it's naturally parti observable and so you can maintain a belief State over what the other cards are of the other players and you can use that information to make your decisions and similarly often in healthcare there's a whole bunch of really complicated physiological processes that are going on but you can monitor parts of them for things like you know blood pressure or temperature Etc um and then use that in order to make decisions so in terms of types of sequential decision-making processes um one of them is Bandits we'll talk more about this later the term um Bandit is sort of a really simple version of a markup decision process in the sense that the idea is that the actions that are taken have no influence over the next observation so when might this be reasonable so let's imagine that you have a series of customers coming to your website and you show each of them an ad so and then they either click on it or not and then you get another customer logging into your website so in this case the ad that you show to customer one generally doesn't affect who which customer 2 comes along now it could maybe in really complicated ways maybe customer one goes to Facebook and says I really really love this ad you should go watch it um but most of the time whatever ad you show to customer one does not at all affect who next uh logs into your website and so the decisions you make only affect the local um the the first customer and then um the customer to is totally independent Bandits have been really really important um for at least 50 years um people thought about them for things like clinical trials how to allocate people to clinical trials people think of them for websites and a whole bunch of other applications mdps and pomdps say no wait the actions that you take can affect the state of the world they affect often the next observation you get um as well as the reward and you have to think about this closed loop system of the actions that you're taking changing the state of the world so the product that I recommend to my customer might affect what the customer's opinion is on the next time step in fact you hope it will and so in these cases we think about um the actions actually affecting the state of the world so another important question is how the world changes um one idea is that it changes deterministically so when you take an action in a particular State you go to a different state but that the state you go to is deterministic there's only one this is often a pretty common assumption a lot of Robotics and controls I remember um Tomas lzo Perez who's a professor over MIT once suggesting to me that if you flip a coin it's actually a deterministic process we just model it stochastic we don't have good enough models um so there are many processes that if you could sort of write down um a sufficient perfect model of the world it would actually look deterministic um but in many cases even it may be hard to write down those models and so we're going to approximate them as stochastic and the idea is that then when we take an action there are many possible outcomes so you can show an ad to someone and they may or may not click on it and we may just want to represent that with a sto stochastic model so let's think about a particular example so if we think about something like Mars Rover um uh when we deploy Rovers um or robots on really far off um planets it's hard to do communication back and forth so it would be nice to be able to make these sort of robots more autonomous let's imagine that we have a very simple Mars Rover that's um thinking about a seven State system so it's just landed uh it's got a particular location and it can either try to go left or try to go the right I write down try left or try right meaning that that's what it's going to try to do but maybe it'll succeed or fail let's imagine that there's different sorts of scientific information to be discovered and so over in S1 there's a little bit of useful scientific information but actually over in S7 there's an incredibly Rich place where there might be water and then there's zero in all other states so we'll go through that as a little bit of an example as I start to talk about different common components of an RL agent so one often common component is a model so a model is simply going to be a representation the agent has for what happens in the world as it takes its actions and what rewards it might get so in the case of a markof decision process it's simply a model that says if I start in this state and I take this action a what is the distribution over next States I might reach and it also is going to have a reward model that predicts the expected reward of taking um an action in a certain state so in this case um let's imagine that the reward of the agent is that it uh thinks that there's zero reward everywhere um and let's imagine that it thinks its motor control is very bad and so it estimates that whenever it tries to move with 50% probability it stays in the same place and 50% probability it actually moves now the model can be wrong so if you remember what I put up here the actual reward is that in state S1 you get plus one and in state S7 you get S you get 10 and everything else you get zero and the reward I just wrote down here is that it's zero everywhere so this is a totally reasonable reward model the agent could have it just happens to be wrong and in many cases the model will be wrong um but often can still be used by the agent in useful ways so the next important component that is always needed by an RL agent is a PO policy um and the policy or decision policy is simply how we make decisions now because we're thinking about markup decision processes here we're going to think about them as being mappings from states to actions and a deterministic policy simply means there's one action per state and the stochastic means you can have a distribution over actions you might take so maybe every time you drive to the airport you flip a coin and you decide whether you're going to take the back roads or whether you're going to take the highway so as a quick check imagine that in every single state we do the action try right is this a deterministic policy or stochastic policy deterministic right we'll talk more about why deterministic policies are useful and when stochastic policies are useful shortly now the value function um is the expected discounted sum of future rewards under a particular policy so it's a waiting it's saying how much reward do I think I'm going to get both now and in the future weighted by how much I care about immediate versus long-term rewards the discount Factor gamma is going to be between 0 and one and so the value function then allows us to say sort of how good or bad different states are so again in the case of the Mars rover let's imagine that our discount factor is zero our policy is to try to go right and in this case say this is our value function it says that the value of being in state State one is+ one everything else is zero and the value being in S7 is 10 again this Min or might not be the correct value function depends also on the true Dynamics model but this is a value function that the agent could have for this policy simply tells us what is the expected discounted sum of rewards you'd get if you follow this policy starting in the state where you weigh each reward by gamma to the number of time steps at which you reach it so when we think about yeah so if we wanted to extend the discount factor to this example um would there be like a an increasing value or decreasing value to the reward depending on how far it went yes question was if if the gamma was not zero here um so gamma being zero here indicates that essentially we just care about immediate rewards whether or not we'd start to sort of if I understood correctly start to see like rewards sing to other states and the answer is yes so we'll see more of that next time but if the discount factor is nonzero then it basically says you care about not just the immediate reward you get you're not just myopic but you care about the reward you're going to get in the future so in terms of common types of reinforcement learning agents um some of them are model based which means they maintain in their representation a direct model of how the world works like a transition model and a reward model um and they may or may not have a policy or value function they always have to compute a policy they have to figure out what to do but they may may not have an explicit representation for what they would do in any state um model free approaches have an explicit value function and a policy function and no model yeah going back with the the earlier slide I'm confused when the value function is evaluated I gu with the one with the the seven yeah so why is it not S6 that has a value 10 because if you tried right at S6 you get to S7 so he was saying well how do I when do I think of the rewards happening um we'll talk more about that next time one really uh there's many different ways people think of where the rewards happening some people think of it as the reward happening for the current state you're in some people think of it as it's the reward you're in and the action you take and some people some another common definition is r s a s prime meaning that you don't see what reward you get until you transition and this particular definition that I'm using here we're assuming the rewards happen is one you're in that state all of them are um basically isomorphic um but we'll try to be careful about which one we're using the most common one we'll use in the class is sa which says that when you're in a state and you choose a particular action then you get a reward and then you transition to your next state great question okay so when we think about reinforcement learning agents and whether or not they're maintaining these models and these values and these policies um we get a lot of intersection so I really like this figure from David silver um uh where he thinks about sort of RL algorithms or agents mostly falling into these three different classes they even have a model or explicit policy or explicit value function and then there's a whole bunch of algorithms that are sort of in the intersection of these so things like actor critic often have an explicit and what do I mean by explicit I mean like often they have a way so that if you give it a state you could tell I could tell you what the value is if I give you a state you could tell me immediately what the policy is without additional computation so actor critic combines value functions and policies um there's a lot of algorithms that are also in the intersection of all of these different ones and often in practice it's just very helpful to maintain many of these and they have different strengths and weaknesses for those of you that are interested in the theoretical aspects of learning theory there's some really cool recent work um that explicitly looks at what is the formal foundational differences between modelbased and model free RL that just came out of MSR Microsoft research um in New York which indicates that there may be a fundamental gap between model based and model-free methods um which on the Deep learning slide has been very unclear so feel free to come ask me about that so what are the challenges in learning to make good decisions um in this sort of framework um one is this issue of planning that we talked about a little bit before which is even once I've got a model of how the world works I have to use it to figure out what decisions I should make in a way that I think is going to allow me to achieve High reward um and in this case if you're given a model you can do this planning without any interaction in the real world so if someone says here's your transition model and here's your reward model you can go off and do a bunch of computations on your computer or by paper and decide what the optimal action is to do and then go back to the real world and take that action it doesn't require any additional experience to compute that but in reinforcement learning we have this other additional issue that we might want to think about not just what I think is the best thing for me to do given the information I have so far but what is the way I should act so that I can get the information I need in order to make good decisions in the future so it's like you know you go to a brand new restaurant and let's say let's say you move to a new town you go to there's only one restaurant you go there the first day and they have five different dishes you're going to be there for a long time and you want to optimize and get the best dish and so maybe the first day you try Dish One and the second day you try dish two and then the third day three and then Etc so that you can try everything and then use that to figure out which one is best so that over the long term you pick something that is really delicious so in this case the agent has to think explicitly about what decisions it should make so it can get the information it needs so that in the future it can make good decisions so in the case of planning and the fact that this is already a hard problem um if you think about something like solitire um you could already know the rules of the game this is also true for things like go or chess or many other scenarios um and you could know if you take an action what would be the probability distribution of the next state and you can use this to compute a potential score and so using things like tree search or dynamic programming and we'll talk a lot more about these um uh particularly the dynamic programming aspect you can use this to decide given a model of the world what is the right decision to make but Sol but reinforcement learning itself is a little bit more like solitire without a rule book where you're just playing things and you're observing what is happening and you're trying to get large reward and you might use your experience to explicitly compute a model and then plan in that model or you might not and you might directly compute a policy or a value function now I just want to reemphasize here this issue of exploration and exploitation so in the case of the Mars rover it's only going to learn about how the world works for the actions it tries so in state S2 if it tries to go left it can see what happens there and then from there it can decide the right next Action Now this is obvious but it can lead to a dilemma because it has to be able to balance between things that seem like they might be good based on your PRI experience and things that might be good in the future um but perhaps you got unlucky before so in Exploration we're often interested in trying things that we've never tried before or trying things that so far might have looked bad um but we think in the future might be good but in exploitation we're trying things that are expected to be good given the past experience so here's three examples of this in the case of movies um exploitation is like watching your favorite movie and exploration is watching a new movie that might be good or might be awful advertising is showing the ad that's yielded the most highest click-through rate so far um exploration is showing a different ad and driving exploitation is trying the fastest route giving your PRI experience and exploration is driving a different route yeah in the back had a question on this tradeoff going back to example going to a restaurant there for five days is it often the case that optimizing for a finite Horizon like you know there only five days and you can only live that perod once it's optimizing policy for that finite Horizon does that often times use the same type of policy as an internet Horizon problem or you can repeat the experience over and over again a great question which is let's imagine for that example that I gave um that you're only going to be in town for 5 days um and would the policy that you would compute in that case if you're in a finite Horizon setting be the same or different as one where you know you're going to live in this for all of infinite time um we'll talk a little bit more about this next uh next time but they are different um and in particular um the normally the policy if you only have a finite Horizon is non-stationary which means that um the decision you will make depends on the time step as well as the state in the infinite Horizon case the assumption is that um the optimal policy in the Markoff setting is stationary which means that if you're in the same state whether you're there on time step three or time step 3000 you will always do the same thing um but in the finite Horizon case that's not as not true and as a critical example of that so why do we explore we explore in order to learn information that we can use in the future so if you're in a finite Horizon setting it's the last day it's your last day in Hollywood and you know you're trying to decide what to do um you're not going to explore because there's no benefit from exploration for future because you're not making any more decision ision so in that case you will always exploit it's always optimal to exploit so in the finite Horizon case um the decisions you make have to depend on the value of the information you gain to change your decisions in the remaining Horizon and this off this often comes up in real cases yeah how much um how much more complicated is if there's a finite Horizon but you don't know where it is uh just I mean it's a thing I remember from Game Theory is tends to be very complicated how does it look like like this question is what about what I would call indefinite Horizon problems where there is a finite Horizon but you don't know what it is that can get very tricky one way to model it is as an infinite Horizon problem with termination States so there are some states which are essentially sync States once you get there the process ends this often happens in games um you don't know when the game will end but it's going to be finite um and that and so that's one way to put it into the formalism um but it is tricky in those cases we tend to model it as infinite Horizon and look at the probability of reaching different termination States assuming that you mix exploitation and exploitation potentially in sub problems I guess particularly for driving it seems like it would be better to kind of exploit paths you know are really good and maybe explore on sub paths you don't know are as good rather than trying like a completely brand new group about how this mix happens of exploration exploitation and maybe in the cases of cars maybe you would um sort of uh not try things totally randomly you might need some evidence that they might be good um it's a great question um there's generally it is better to intermix exploration exploitation in some cases it is optimal to do all your exploration early or at least equivalent um and then you gain from all of that information for later but it depends on the decision process um and we'll spend a significant chunk of the course after the midterm thinking about exploration exploitation it's definitely a really critical part of reinforcement learning um particularly in high stakes domains what do I mean by high stakes domains I mean domains that affect people so whether it's customers or patients or students um that's where the decisions we make actually affect real people and so we want to try to learn as quickly as possible and make good decisions as quick as we can any other questions about this if you're in a if you're in a sort of state that you haven't seen before do you have any other better option than just like take a random action to like get out of there or yeah can you use your previous experience even though you're not you've never been there before question is great it's saying if you're in a new state you've never been in before what do you do can you do anything better than random or can you somehow use your prior experience um one of the really great things about doing generalization means that we're going to use State features either learn by Deep learning or some other representation to try to share information so that even though the new the state might not be one you've ever exactly visited before you can share prior information to try to inform what might be a good action to do of course if you share in the wrong direction um you can make the wrong decision so if you over overgeneralize you can over fit to your prior experience and in fact there's a better action to do in the new scenario other questions for this okay so one of the things we're going to be talking about over the next few lectures is this sort of the two really fundamental problems which is evaluation and control so evaluation is the problem is saying if someone gives you a policy if they're like hey this is what you should do or this is what your agent should do this is how your robot um should act in the world world to evaluate how good it is so we want to be able to often figure out you know your manager says oh I think this is the the right way we should show ads to customers um can you tell me how good it is what's the quickr rate um so one really important question is evaluation um and you know you might not have a model of the world so you might still have to go out and gather data to try to evaluate this policy but you just want to know how good it is you're not trying to make a new policy at least not yet you're just trying to see how good this current one is and then the control problem is is optimization it's saying let's try to find a really good policy this typically involves as a subcomponent evaluation because often we're going to need to know what does best mean best means a really good policy how do we know how good the policy is we need to do evaluation now one of the really cool aspects of reinforcement learning um is that often we can do this evaluation off policy which means we can use data gathered from other policies to evaluate the counterfactual of what different policies might do this is really helpful because it means we don't have to try out all policies exhaustively so um in terms of what these questions look like if we go back to our Mars Rover example for policy evaluation it would be if someone says your policy is this in all of your States um the the action you should take is try right this is the discount factor I care about um please compute for me or evaluate for me what is the value of this policy in the control Cas they would say I don't know what the policy should be I just want you to give me whatever ever policy has the highest expected discounted sum of rewards um and there's actually sort of a key question here is okay expected discounted sum of rewards from what so they might care about a particular starting State they might say I want you to figure out the best policy assuming I'm starting from S4 they might say I want you to compute the best policy from all starting States um or sort of some average so in terms of the rest of the course what we're yeah um I was just wondering if it's possible to learn the optimal policy and the reward function simultaneously in yourr example if I had some belief of what the reward would be or the the clickthrough rate for for some sort of action from a given State and that turned out to be wrong would I have to start over in training to find the optimal policy or could I use what I've learned so far in addition to some sort of invation over the belief of the rewards conjunction great question which is okay let's say I have a policy to start with I'm evaluating it um and I don't know what the reward function is and I don't know what the optimal policy is and it turns out this current one isn't very good do I need to sort of restart or can I use that prior experience to sort of inform what's the next policy I try um or perhaps a whole Suite of different policies in general you can use the prior experience in order to inform what the next policy is that you try or next Suite of policies um there's a little bit of a caveat there which is uh you need to have some stochas in the actions you take so if you only take the same you know one action in a state you can't really learn about any other um actions you would take so you need to assume some sort of generalization or some sort of stochasticity in your policy in order for that information to be useful to try to evaluate other policies this is a really important issue this is the issue of sort of counterfactual reasoning and how do we use our old data to figure out how we should act in the future um if the old polic policies may not be the optimal ones so in general we can um and we'll talk a lot about that it's a really important issue so we're first going to start off talking about sort of markup decision processes and planning and talking about how do we sort of do this evaluation um both when we know how the world Works meaning that we're given a transition model and a reward model and when we're not then we're also going to talk about model free policy evaluation and then model free control we're going to then spend some time on deep uh deep reinforcement learning and reinforcement learning in general with function approximation which is a hugely growing area right now um I thought about making a plot of how many papers are going on in this area right now it's pretty incredible um and then we're going to talk about policy search which I think in practice particularly in robotics is one of the most influential methods right now and then we're going to spend quite a lot of time on exploration as well as have um a few Advanced topics so just to summarize what we've done today is talk a little bit about reinforcement learning how it differs compared to other aspects of AI machine learning we went through course Logistics and started to talk about sequential decision- making under uncertainty just as a quick note for next time um we will try to post the lecture slides um two days in advance or by the end of two you know the evening of two days in advance so that you can print them out if you want to um in class and I'll see you guys on Wednesday