Transcript of [한글자막] Stanford CS234: Reinforcement Learning | Lecture 11 Fast Reinforcement Learning

um today what we're going to be doing is we're going to be starting to talk about fast reinforcement learning um so in terms of where we are in the class we've just finished up policy search you guys are working on policy gradient um right now for your homework and then that'll be the last homework and then the rest of the time will be on projects excuse me um and then uh right now we're going to start to talk about fast reinforcement learning which is something that we haven't talked about so much so so the things we've discussed a lot so far in the term are things like optimization generalization and delayed consequences so how do we do planning and markof decision processes how do we scale up to really large State spaces using like deep neural networks um and how do we do this optimization and I think that that works really well for um a lot of cases where we have good simulators or where data is pretty cheap um but a lot of the work that I do in my own lab thinks about how do we teach computers to help us um which naturally involves reinforcement learning because we're teaching computers about how to make decisions that would help us but I think there are a lot of other applications where we'd really like computers to help us so things like Ed um education or Healthcare or consumer marketing and in each of these cases we can think of them as being reinforcement learning problems because we we' have some sort of agent like our computer um that is making decisions as it interacts with a person and it's trying to optimize some reward like it's trying to help someone learn something or it's trying to treat a patient or it's trying to um increase revenue for a company by having consumers click on ads and in all of those cases the place where data comes from is people um and so there's at least two big challenges with that the first is that um you know people are finite there's not an infinite number of people um and also that it's expensive and costly to um try to gather data when you interact with people and so it raises the concern about sample efficiency so in general of course we would love to have reinforcement learning algorithms that are both computationally efficient and Sample efficient but I most of the techniques we've been looking at so far particularly the sort of q-learning type techniques were really sort of inspired by this need for computational efficiency um so if we think back to when we just were doing planning at the beginning when we talked about doing dynamic programming versus doing Q learning um in dynamic programming we had to do a sum over all next States and in TD learning we sampled that so in TD learning we sort of had this constant cost per update versus for dynamic programming we had this s squared times a cost um so it was much more expensive to do things like dynamic programming than it was to do TD learning um on a on a perst step basis and so a lot of the techniques that have been developed in reinforcement learning have really been thinking about this computational efficiency issue um and there are a lot of times where computational efficiency is important like if you wanted to plan from scratch and you were sort of driving a car at 60 miles per hour then if it takes you um so if you're driving a car is 60 M hour and it takes your computer one second to make a decision about like you know how to turn the wheel or something like that um then during that one second you've traveled you know many feet um so in a lot of cases you really do have real-time constraints on the computation you can do uh and in many situations like for you know in the cases of uh some Robotics and particularly when we have simulators um we really want computational efficiency because we need to be able to do these things very quickly um and we can sort of use our simulators but we need our simulators to be fast so our agent can learn um in contrast to those sort of examples are things where sample efficiency is super exp important and maybe computation is less important so whenever experience is costly or hard to together and so this is particularly things that involve people um uh so we think about students or patients or customers like the way that our agent will learn about the world is making decisions um and that data affects real people so it might be very reasonable for us to take you know several days of computation if we could figure out a better way to treat cancer um uh because we don't want to randomly experiment on people and we want to use the data as well as we can and be really really sample efficient um uh versus in the case of like Atari it's uh we want to be really computationally efficient because we can do many many simulations it's fine no one's getting hurt but um but we need to eventually learn to make you know to derive a good game so one natural question might be okay so maybe now we care about sample efficiency um and before we cared perhaps more about computational efficiency but maybe the algorithms we've already discussed are already sample efficient um so does anybody remember like on the order of magnitude or like you know somewhere in the rough ballpark how many steps it took dqn to learn a good policy for pong maybe I there's multiple answers maybe someone ra there yeah 5 million so maybe somewhere gu around 5 million I think you know it varies somewhere probably between two to 10 is my guess two to 10 million so um that's a lot that's a lot of data um to learn to play Pong so I would argue that the techniques we've seen so far um are not going to address this issue um it's not going to be reasonable for us to need you know somewhere between 2 to 10 million customers before we figure out a good way to Target ads or two to 10 10 million patients before we figure out the right decisions to do so the techniques we've seen so far are not going to be um they're formally not sample efficient and they're also empirically not sample efficient um so we're going to need new types of techniques than what we've seen so far so of course when we start to think about this we think about the general issue of you know what does it mean for an algorithm to be good we've talked about computational efficiency and I've mentioned this thing called sample efficiency but in general I think one thing we care a lot about is you know how good is our reinforcement learning algorithm and we're going to start to try to quantify that in terms of sample efficiency um but of course you could have an algorithm that's really sample efficient in the sense that maybe it only uses the first 10 data points and then it never updates its policy so it doesn't need very much data to find a policy but the policy is bad so when we talk about sample efficiency we're going to want both that we don't need very much data and that we don't need very much data in order to make good decisions so we still want to get good performance we just don't need want to need very much experience to get there um so when we talk about what it means for an algorithm to be good you know one one possibility is we can talk about whether or not it converges at all um that just means whether or not the value function or the policies is stable at some point like ASM totically like as the number of time steps goes to infinity and we talked about how sometimes with uh value function approximation we don't even have that like things can oscillate um then another thing that's stronger than that that you might want is to say well ASM totically as T goes to Infinity um will we converge to the optimal policy and we talked about some algorithms that would do that under some different assumptions um but what we haven't talked about very much is sort of you know well how quickly do we get there ASM totically is a a very long time um and so we might want to be able to say if we have two algorithms and one of them gets the optimal policy here like if this is performance this is time and another algorithm goes like this intuitively algorithm 2 is better than algorithm one even though they both get to the optimal policy eventually so we'd like to be able to sort of um account for either we can think about things like how many mistakes our algorithm makes um or its relative performance over time compared to the optimal and so we'll start to talk today about some other forms of measures for how good an algorithm is is so in this lecture in the next couple lectures we're going to um do several different things trying to talk about sort of how good are these reinforcement learning algorithms and think about algorithms that could be much better in terms of their guarantees for performance um we're going to start off and talk about tabular settings in fact today we're only going to talk about simple Bandits but generally for the next um today and next lecture we'll talk about tabular settings and then um hopefully also get to some about fun function approximation plus sample efficiency but we'll start to talk about sort of settings Frameworks and approaches so the settings that we're going to be covering like today and next time is going to be Bandits which uh a number of you who who who here is doing the default project okay so a number of you are already starting to think about this in terms of the project so we'll introduce Bandits today um and then we'll also talk about this for mdps and then we'll also introduce Frameworks and these are evaluation criteria for formally assessing the quality of a reinforcement learning algorithm so there way there are sort of a tool you could use to evaluate many different algorithms and algorithms will either satisfy this framework or not or have different properties um under these different Frameworks and then we'll also start to talk about approaches which are classes of algorithms for achieving these different evaluation criteria for these different Frameworks in different settings either for the mdp setting or or for the Bandit setting and what we'll shortly see is that there's a couple main ideas of styles of approaches of algorithms which turn out to both have um be applicable both to Bandit settings and mdp settings um and function approximation actually um and also that have some really nice formal properties there sort of a couple big conceptual ideas about how we might do fast reinforcement learning okay so the what the plan for today will be that we'll first start with an introduction to multi-e Bandits then we'll talk about the definition of regret on a mathematical formal sense um and then we'll talk about optimism under un certainty and then as we can we'll talk about basy and regret um and probability matching and Thompson sampling I'm curious who here has ever seen this sort of material before okay a couple people most people not I wouldn't is it covered in AI I don't think they would cover it oh cool okay all right so for some of you uh this will be a review for most of you it'll be new so for multi arm Bandits we can think of them as a subset of reinforcement learning so it's generally considered a set of arms there's a set of M arms which were they equivalent to what we used to call actions so in reinforcement learning our set of actions um we're thinking of like they're being M different actions now we're often going to call those arms and then for each of those arms they have a distribution of rewards you could get so we haven't talked a lot about sort of having uncertainty over our rewards we mostly just talked about the expected reward um and for multiarm Bandits today we're going to explicitly think about the fact that um rewards might be sampled from a stochastic distribution so there's some distribution that we don't know which is conditioned on the arm so conditioned on the arm you're going to get different rewards so for example it could be that for arm one your distribution looks something like this this is the probability of that reward and this is rewards um and then for arm two it looks something like this so in this particular example um the average reward for arm one would be higher than the average reward for arm two and then they would have different variances but it doesn't have to be Galan you could have lots of different distributions um essentially what we're trying to capture here is that whenever you uh um take a particular action which we also refer to as pulling an arm um for the multi-arm Bandit um then the reward you get might vary even if you pull the same arm twice so in this case you can imagine that if you pull the arm once um arm one once you might get a reward here and maybe the second time you get a reward there so the idea in this case is it's similar to mdps except for now there's no transition function so there's no state or equivalently you can think of it as there's only a single state um and so when you take an arm um you stay in the same state there's always these M actions available to you and on each step you get to pick what action to take and then you observe some reward that is sampled from the unknown probability distribution associated with that arm and just like in reinforcement learning um we don't know what those reward distributions are in advance and our goal is to maximize the cumulative reward so if someone told you what these distributions were in advance you would know exactly which arm to pull whichever one has the highest expected expected mean so we're going to try to use pretty pretty similar notation to what we had for um the reinforcement learning case um but if you knows things that are confusing just let me know um so we're going to define the action value as the mean reward for particular action so that's QA this is unknown the agent doesn't know this in advance the optimal value barar is going to be equal to um Q of the best action and then the regret is going to be the opportunity loss for one step so what that means is that if instead of taking if if you could have taken Q of a star and instead you took Q of a t so this is the actual arm you selected how much in expectation did you lose by taking the suboptimal arm and this is how we're going to mathematically Define regret in the case of reinforcement learning so if you selected the optimal arm your regret your expected to regret would be zero for that time step um but for any other arm there will be some loss and the total regret is just um the total opportunity loss if you sum over all the time steps that the agent acts and compare um the actions it took and the um the expected reward of each of those actions to the expected reward of the optimal action so just to be clear here this is not known this is not known to the agent and this is unknown to the agent just to check for understanding for a second why is this so why is the second thing unknown to the agent yeah you don't know the probability distri right so correct so you don't know uh what the distribution is so you don't know what Q is you get to observe um a sample from that so you get to observe R you get to get an R which was sampled from the probability distribution over rewards given the action that was taken but you don't get to observe either um the true expected value of the optimal arm nor the true expected value of the arm that you selected so this isn't something we can normally evaluate unless we're in a simulated domain okay but we're going to talk about ways that we can bound this and and think about algorithms that try to minimize the regret so if we think about ways to sort of quantify it another way to think about it alternatively is that um think about the number of times that you take a particular action we can call that NT of a so that's like the number of times we select action one action two Etc and then we can define a gap which is the difference between the optimal arms value and the value of the arm that we selected and that's the Gap so that's how much we lost by picking a different arm than the optimal arm so this Gap is equal to gap for a star is equal to zero because you don't lose anything by taking the optimal arm and for all other arms it's going to be positive so another way to think about the regret which is equivalent is to say this is equivalent to thinking about what of the expected number of times you you select each of the arms times their Gap so how much you would lose by selecting that arm compared to the optimal arm and what we would like is that sort of an algorithm um uh should be able to adjust how many times you pull arms which have large gaps so if there's a really really bad arm uh like there's really bad action which has a very low reward you would like not to pull that as much to take that action as much as the arms that are close to Optimal so one approach that we've seen before is greedy um in the case of uh Bandits the greedy algorithm is very simple um we just average the the rewards we've seen for an arm so we just look at every single time that we took that arm we look at the reward we got for each those time stamps and then we average it and that just gives us an estimate of the of Q hat and then what greedy does is it selects the action with the highest value um and takes that arm forever so it's probably clear in this case that because the rewards are sampled from a stochastic distribution that if you are unlucky and get samples that are misrepresentative then you can lock into the wrong action and stay there forever so if we think of that little example I gave before and I'll work out a bigger example shortly so in this case imagine this is reward this is probability and this was A2 okay so let's imagine that the first um and I don't know make this one and this zero so if you sample from A1 in this case you could imagine um there's some non-zero probability that the first sample you get is say 0. 2 um for A1 and the first sample you get for A2 with nonzero probability might be 0.5 so the true mean of A2 is lower than the mean of A1 but if you sampled each of these once um then if you were greedy with respect to that then you would take the wrong action forever yeah take wrong action forever is the idea that our policy is going to be influencing what samples we get in the future or is the idea but but there are some set of samples independent of this grey policy to begin cuz it would seem otherwise if there's nonzero reward you just take that one forever uh great question so um it's is yeah so what said is um you know is there an additional thing that we're doing kind of before this normally for a lot of these algorithms um we're going to assume that all of them operate by selecting each arm once at least if you have a finite set of arms um and equivalently you can say if you don't have any data you treat everything equivalently or um but essentially most of these ones say until you have data for all the arms we're going to do round robin you're going to sample everything once and after you do that either you can be greedy or we can do something else so there has to be a preinitialization space it's a good question so and we're also going to assume for right now that we split ties um with equal probability so if there are two arms that have the same Pro probability um and they both have the max action max value then you would split your time between those until the Valu is changed so this is an example where if we first say sampled A1 once then sampled A2 once um and because there's a nonzero probability that those samples would make it look that um action A1 has a lower mean than action A2 then you could lock into the wrong action forever now an EG greedy algorithm which we've seen before in class um it does something very similar except for with probability 1 minus Epsilon we select um the greedy action and otherwise with Epsilon we split our probability across all the other actions or all the other arms so in these cases um we have some more robustness so in this case you know we would continue to sample um the other action but we're always going to make a suboptimal decision at least Epsilon percent of the time well approximately it's a little bit less than that because if you do it totally randomly not um but it's order Epsilon I mean it's slightly less than that because um if you uniformly split across all your arms with one over the number of arms probability you will be selecting the optimal action okay so let's see these in practice for a second before we talk more about better algorithms um so let's imagine we're trying to figure out how to treat broken toes and this is not a real medical example um but let's imagine there's three different surger three different options um one is surgery one is Buddy taping the broken toe with another toe which is what the that might tell you to do um and the third is to do nothing and the outcome measure is going to be a binary variable of whether or not your toe is healed um after six weeks as assessed by an x-ray okay so let's imagine that we model this as a multi-arm bandit with three arms where each arm corresponds to um well I'll ask you in a second what they corresponds to and and there's an unknown parameter here so each R um there's an unknown parameter which is the reward outcome so let's just take just you know one or two minutes just to say what does a um a pool of an arm correspond to in this case and why is it reasonable to model it as a bandit instead of an mdp yeah and in terms of why we model it as a bandit one reason is that mdps usually think of an agent walking through the world and each the state the world has many different states and we analyze those here we have just one state a toe is broken and various actions that right say is correct so here we just have one and and so what is what is the what does it mean to pull an arm in this case or take an action what does that correspond to in the real world U that would be a new patient coming in and then making a decision that the care for that patient right so that's like a a patient coming in and then us deciding to either do surgery on them or giving them um I uh in this case I um like doing one of the three options for treatment um and so in this case too the each pool is a different patient so how we treat a patient one isn't going to generally affect how we treat patient too in terms of um whether they healed or not or whether that particular you know surgery worked for them doesn't affect um the next patient coming in so all the patients are IID um and what we want to figure out to do is which of these treatments on average is most effective okay so let's think about these in a particular particular set setting so um a particular set of values so let's imagine that they're all brly reward VAR variables because either the toe is going to be healed or it's not going to be healed after six weeks um turns out in this particular fake example surgery is best so if you do surgery with 95% probability it'll be healed after six weeks um buddy taping is 90% and doing nothing is 0.1 so what would happen if we did something like a greedy algorithm oh yeah is it possible to incorporate other factors into like pulling the arms for example surgeries like taxing or can be like cost effective versus body taping or their ways to incorporate that yeah it's a great question so um question is about like uh you know could we you know surgery is a lot more invasive and there might be other side effects ex Etc there's a couple different ways you could imagine putting that information in one is you could just change the reward outcome so you could say U maybe it's more effective but it's also really costly and I've got to have some way to combine outcomes with cost um another thing that one might be interested in doing in these sorts of cases is that you might have a distribution of outcomes so in this case all of them have the same um distribution they're all bur newly but in some cases um your reward outcomes will not be will be complicated functions right like so it might be that um maybe for some people surgery is really awful and for most people it's really good but it's really bad for some people because they have you know some really bad side effects and so its mean is still better um but there's like this really bad wrist tail of like maybe people you know react badly to Anesthesia or something like that so in those cases you might want to not focus on expected outcomes you might want to look at risk sensitivity and in fact one of the things that we're doing in my group is looking at safe reinforcement learning including safe Bandits um and thinking about how you could optimize for um risk sensitive criteria another thing that we're not going to talk about today which you also might want to do in this case is that patients are not all identical um and you might want to incorporate some sort of contextual features about the patient to try to decide which surgery to do or you know surgery versus buddy taping uh and hopefully well those of you who are doing the default project will think about this um definitely and we'll probably get to this in a couple lectures in general we often have sort of a rich contextual state which will also affect the outcomes okay so in this case let's imagine that we have um these three potential interventions that we can do and we're running the greedy algorithm so um as I brought up before we're going to sample each of these arms once and now we're going to get an empirical average so let's say that we sample action A1 and we get a plus one and so now our um empirical average of What the expected reward is for Action A1 is one and then we do A2 and we also get one so that's our new average and then we do A3 we get a zero and so at this point what is the probability of greedy selecting each arm assuming that ties are split randomly yeah two for first two then epon thir oneus a little yes so said is um exactly correct for the EG greedy case so you're jumping ahead a little bit but that's totally right um uh in this case for greedy it'll just be 50/50 yeah you're already moving to EG grey so yeah so the probability of A1 is going to equal to probability of A2 which is going to equal to a half so let's imagine that we did this for a few time steps so so we can think about what the regret is that we incur along all of this way so at the beginning um we have this sort of initialization period where we're going to select each action once and we're always comparing this to what was the reward we could have gotten under the optimal action so the regret here is going to be exactly equal so this is optimal and remember the gret is going to be Q of a star minus Q the action you took so in the first case it's going to be zero because we took the optimal action in the second case it's going to be 0.95 minus 0.9 0.05 and in the third case it's going to be 0.95 minus 0.1 and then in the third case it's going to be zero or fourth case it's going to be zero and then it's going to be 0.05 again now in this situation um will we ever select and greedy will we ever select A3 again given the values we've seen so far no yeah I see people say no no so why not is it possible what's our current estimate of um the reward for 83 zero zero yeah so I guess I didn't show put those here but that this was the actual these were the rewards we got so it was 1 1 Z so our current estimate for A3 is zero we know our rewards are bounded between zero and one none of our estimates can ever collapse below zero um and we already have a positive one for these other two actions which means that their averages can never go to zero so we're never going to take A3 again now in this case that's not actually that bad like here here that's not actually a problem because A3 is a bad arm and it's got a much lower expected reward um in other cases I it could be that A3 we just got unlucky um and in that case I could mean that we never never take the optimal action yeah I thought we used star like in terms of the reward to regret uhuh yes and that's the same as this good question so this is the same as bar star yeah and I'll go back and forth between notation but yeah definitely just ask me so in this case we're never going to select A3 again and um now notice that in the greedy case there are cases where um if I had used slightly different values here um that you might have selected A3 again later because if it's the case that um like if you didn't have beri rewards but you had gaussians um it could be that the rewards for other arms drop below uh another arm later and then you start to switch so you don't necessarily always stick with which other arm look best at the beginning um but in this particular case with these outcomes then you're not going to select A3 ever again all right now let's do um EG grey so in this case we're going to assume we got exactly the same outcomes for the first few um and then what said is that we're going to um have a 1 12 minus Epsilon um over two so we're going to split ties randomly again so with probability Epsilon we're gonna take um with Epsilon over three we're going to take A1 or A2 or A3 and with 1 minus Epsilon over two we're going to take A1 or A2 it's not interesting okay so in this case it's going to look almost identical except we're still going to have some probability of taking A3 in this case and we can do a similar computation here in this case um we've assume that all of these are exactly the same so EG grey in this case will select A3 again yes um if Epsilon is fixed not updating again um if Epsilon is fixed how many times is it going to select A3 main question is whether it's finite or infinite so maybe talk to a neighbor for a second and decide whether if Epsilon is fixed whether it'll be A3 will be selected a finite or infinite number of times and what that means in terms of the regret okay I'm going to have everybody vote so if you think it's going to be selected an infinite number of times raise your hand great so what's that mean in terms of regret is the regret going to be good or bad it's unbounded right bad bad regret I mean in general regret is going to be unbounded unfortunately in these cases um so we're always going to unfortunately have infinite regret but um but the rate at which it grows can be can be much smaller depending on the algorithm you do so um in particular and uh yeah so we can also think about it in this case so if you have a large gap which we do for A3 here and we're going to be selecting that arm an infinite number of times then egy is also going to have um a large regret so I like this plot this plot comes from David Silver's slides um uh so if you explore forever like if you just do random um which we didn't discuss but you could also do then you're going to have linear total regret um which means that with the number of time steps T you're you're going to scale linearly with t essentially your regret is growing unboundedly and it's growing linearly which is essentially proportion I mean it's going to generally have a constant in front of it but um it's going to be a constant times the worst you could do at every time step because if you always select the worst arm at every time step your regret will also grow linearly so it's pretty bad um if you explore never if you do greedy uh then it also can be linear and if you do EG greedy it's also linear so essentially it means that all of these algorithms that we've been using so far can have really really bad performance um certainly in bad cases and so the critical question is whether or not it's better it's possible to do better than that so can we have what's often called sublinear regret so we want to have regret if an Al is going to be considered good in terms of its performance and its sample efficiency we're going to want its regret to grow sublinearly um when we think about this we're generally going to think about um whether the bounds that we create and the performance bounds are going to be problem independent or problem dependent so most of the bounds it depends a little bit for mdps most of the bounds that we can get are going to be problem independent for Bandits there's a lot of problem dependent bounds um problem dependent bounds in the case of Bandits mean that um the amount of regret that we get is going to be a function of those gaps and that should be sort of intuitive so if you have let's imagine that we just have two arms like A1 and A2 and the mean um expected reward so if this is the expected reward is one and this is respect of 01 intuitively it should be easier to figure out that arm one is better than arm two then if this is like 0.53 and this is 0.525 because in one case really hard to tell the difference between the mean of the two arms in the other case the means are really really far apart so somewhat intuitively if the Gap is really large it should be easier for us to learn and if it's really small it should be harder yeah so uh so if the optimal reward is deterministic for some actions that mean we have zero good question um question is if the optimal are if the uh optimal reward is uh are you just saying optimal reward yeah the optimal reward is deterministic do we have zero regret if you know it so if you know that all the rewards of the arms are deterministic then you just need to pull them once then you can make a good decision and then you're done in general these algorithms aren't going to know that information if you don't even if it was deterministic you're still going to have these other forms of bounds so what about greedy case if it's deter if in the greedy case it's a good question in the greedy case if you're real rewards a deterministic um then you then you would have pulled all arms once and you will make no mistakes you'll make you'll have zero regret basically from that point onwards so we'd consider that is just like you'd have some initial constant regret and then afterward it would be independent of T did you have a question yeah um remind me your name is it also a function of the variance of each arm oh great question like the second one yeah great question so we're not going to talk question is whether or also depends on the variance addition to the mean um yes we're not going to talk about that um uh but in addition to problem dependent bounds you can certainly think about parametric like if you have some parametric knowledge on the distribution of the rewards then you can exploit it um so if you know it's a gaan or other things like that um I think in general if you know the mo like if you have information about the moments then you should be able to exploit it most of the information I've seen is I'm looking at the mean in the variance we very frequently throughout a lot of this stuff are going to assume that our rewards are bounded um that's going to be important for most of the proofs that we do even without making any other parametric assumptions okay but then the other version of this is problem independent which just says regardless of the the domain you're in and regardless of the gaps can we also show that um uh regardless of any structure of the problem can we still ensure that the gr grows sublinearly and that's what we're going to mostly focus on today so I think lower bounds lower theoretical bounds are are helpful to try to understand how hard a problem is so I said is it possible for something to be sublinear um and there's been previous work to look at sort of well how much does the regret have to grow so um in this case this is Regret we're mostly going to um use write out regret but in this case they they prove that in terms of the gaps um and the similarity in the distributions in terms of the K Divergence the kale Divergence that you can show a lower bound on how much regret grows so this is where the sort of unfortunate aspect of regret growing unboundedly comes up if you don't make any other assumptions on the distributions of your rewards um uh in general the regret of your um your regret will go unboundedly and it will have do so in terms of these gaps and the kale Divergence but it's still sublinear so that's nice so it's it's growing logarithmically with t here T is the time steps it's encouraging that like our lower bounds suggest that there's room for traction right that we can definitely um at least there's no formal result that says it has to be linear says no we should be able to grow much slower so how could we why how would we maybe do this so this is now um we've talked about a particular framework which is Regret we talked about a setting which is Bandits and now we're going to talk about an approach which is um optimism in the face of uncertainty and the idea is simply to choose actions which might have high value why should we do this um I had a question on the the limit in the previous slide um s is that true at every t or only in the limit because as it's written isn't that just saying that it's greater than infinity a um that's a good question question is about whether it holds I I think this holds on every time step I'd have to check the exact like way that they wrote this um uh there's also constants um but I think this should hold on a per time step basis we're really saying that as time is going along this should be true yeah the limit as T Go's large this is where I'd have to look back at the original paper um there's probably additional constant terms which are transitory um and so this is probably the dominant term as T goes large but in a lot of the regret bounds particularly in our mdp cases we often have transitory terms that are independent of the time step but are still large early on that's that's my guess but great question okay so optimism in the face of uncertainty um says that we should choose actions that might have a high value why well there's two possible outcomes if we pick something that we think might be good um one is it's good so if it is good or I'll be more precise in this case so let's say we select so A1 and one is a one has high reward so that's good if we um if we took an action and actually does because we thought it might have high reward and actually does have high reward then we're going to have small regret so that's good um what's the other outcome A1 does not have high reward has lower has um when we sample this we get a r for A1 with low reward well if we get something with low reward we've learned something so we're like hey you know we tried that restaurant again we thought it was great the first time the second time it was horrible so now we've learn that that restaurant is not as good and we update our estimate of how good that restaurant is and that means we don't think that that action has as high of value as it did before so essentially either the world really is great and that in which case that's great we're going to have low regret or the world is not that great and then we learn something so this give us information and so this being acting optimistically um gives us both information about the world or allows us to achieve High reward and so it turns out to have been a really nice principle it's been around since at least um Leslie cabling in 1993 introduced this idea of sort of interval estimation and then there started to be a lot of analysis of these types of optimism under uncertainty techniques so how could we do this more formally or like where would we get to be more precise about what it means for an action to have a high value let's imagine that we estimate an upper confidence Bound for each action value such that the real value of that action um is less than or equal to the upper confidence bound with high probability and those upper confidence bounds are generally going to depend on how many times we've selected that particular action because we would like it to be such that if we've selected that action a lot that upper confidence bound should be pretty close to Q of a and if we haven't selected it very much maybe we're really optimistic um and then we can define an upper confidence bound banded algorithm by just selecting whichever action has the highest upper confidence bound so for every single action we maintain an upper confidence bound and then we just select whichever one has the Max and then we update the upper confidence Bound for that action after we we take it so a UCB algorithm would for tal 1 dot dot dot we would first have an initialization phase where we pull H CH once H arm once and then we compute UT of at for all a and then for tal 1 dot dot dot we would select a equaling this argmax and then we would get a reward that is sampled from the true reward distribution for that arm and then we would update UT of at and all other arms turns out that dolin we have to update not just the action of the arm that we took but the action of all the other arms we took too um and this basically comes into you you don't have to do that but in terms of the theory we often have to do that in order to account for um high probability bounds and we'll see more about that in just a second so every time you get a reward you update upper confidence bounds of all your arms and then you select the next action and you repeat this just over and over again okay so how are we going to Define these U of T so we're going to use hopings inequality um so refresher um in hinks inequality we can apply this to a set of um IID random variables we're going to assume right now that all of them are bounded between Z and one and we're going to Define our sample mean just to be uh the average over all of those variables and then what hofing says is that the probability that that expected mean is like this is the true expected mean so this is the true mean this is our empirical mean and this is you can think of this is like some Epsilon this is just some constant um is the probability that your true mean is greater than the empirical mean plus some constant U is less than or equal to X exponential - 2 N mu ^ 2 so this is the number of samples we have okay so we can also invert this to say if you want this to be true with a certain probability um you can pick a mu so that xn plus mu is going to be at least as large as the um the real mean so let's say what we want to do is we want that this to um we want that the empirical mean plus mu the probability that that's less than the real mean to equal to Delta over t^2 and we'll see why we might want to choose that particular probability shortly but let's imagine for a second that's what we want our probability to be so then we can solve for what mu has to be okay so that's exponential minus 2 N mu ^ 2 has to equal to Delta over t^2 and then what we do is we just solve for what mu is thanks for letting me know okay so mu in this case is going to be equal to the square < TK 1/ 2N log of t^2 over Delta so I just solved for that equation what does that tell us that says that if we do we Define our UT of or in this case we Define I'll keep with the same notation as for hting so if we have x n plus mu with that particular choice of mu that's generally going to be greater than or equal to the true expected value of x with probability greater than equal to 1us Delta over t^2 so this hting inquality gives us a way to define an upper bound because instead of these being X's right now you could imagine those are just pulls from our arm and those are all rewards and so this says if you take your empirical average of your reward so far and you add in this upper bound which depends on the number of times we pulled that arm and T So T here notice is the total number of time steps we pulled any arms and N is the number of times we pull that arm so they're not the same thing so we're inside of this confidence bound we have a confidence bound that is decreasing at a rate of how many times we pulled this particular arm and then we have a log term which is increasing at the rate of the number of times we pulled any arm and that is the reason why after each time step we have to up um update the upper competence bounds of all the arms so you kind of have these two competing rates that are going on as you pull an arm more you're going to get a better estimate of its reward so it's shrinking but then you also have this slower growing term this log which is increasing with the number of time steps all right so this is one way we could Define our upper confidence balance so we could um use this in the case of our rewards so we might want to say that UT of a is equal to our empirical average of that arm plus 1/ two the number of times we pulled that arm time log of T ^2 / Delta so this is how one way for us to Define our upper confidence bounds all right so now the next question is Okay so we've done that how is that going to help us in terms of showing that maybe the regret of something which is optimistic is actually sublinear okay so what we're going to do now is um I'll do a quick PLL do you guys want me to write it on here or do you guys want me to do it on the board so raise your hand if you want it on the board all right raise your hand if you want it on here okay we'll do this next part on the board that was easy was there a question in the back yeah questioning this derivation about t because it seems like you first introduced it about t you first introduced it it was basically a constant that we moved around but then later you're saying that that's actually the time that we're updating every time step so how how are we able to do that yeah it's a good question so um you're right and I'm being slightly impr precise about this if you know what the time Horizon is that you're going to be acting in an advandage you can set T to be the maximum so if you know that you're going to act for T time steps you can plug that in and then your confidence bounds are then that log term is fixed basically um in online settings if you don't know that you can also constantly be updated with the time step it's a good question yeah how is delta decided how pardon how is delta is like what is Delta oh good question um so uh question is what is Delta um what we're g to I did not tell you in this well in this case it's telling us um it's specifying what is the probability this is holding like what this inequality is holding later we're going to Pro provide a regret bound that is high probability so we're going to proove a regretable bound which is something like with probabil at least one minus a function of Delta um your regret will be sublinear so that's um you can get expected regret bounds too um and the UCB paper which um one of the original UCB papers provides an expected bound but I thought it was a little the this bound was a little bit easier to do in class so I thought i' would do with the high probability B yeah so before when we were talking about regret I didn't exactly understand how you use regret to update your estimate of the action value oh good question um so question is do we use or how would we use the regret bound to regret to update our estimate action we don't regret is just a tool to analyze our algorithm great clarification so regret is a way for us to analyze whether or not an algorithm is going to be good or bad in terms of how fast the regret grow grows but it's not used in the algorithm itself the algorithm doesn't compute regret and it's not used in terms of the updating excuse me okay so actually I'll leave this one up here so you guys can continue to see that all right so let's do our proof so what we're going to want to do now is we're going to want to try to prove something about um the regret but before and and how quickly it grows for the um upper confidence bound algorithm but before I prove that I'm going to try to argue you to you that um we're going to look at sort of the probability of failure of of these regret bound of these confidence bounds so what I've said here is that we're going to Define these upper confidence bounds like this in terms of the empirical mean for that arm so far plus this term that depends on the number of times we pulled that arm and what I want to convince you now is what is the probability what is the probability that on one step that on some step step the confidence bounds will fail to hold the confidence bounds will fail to hold why is this bad okay so we want to bound the probability that on any step excuse me as we're running our algorithm that our confidence bounds fail to hold why because if they all hold we can guarantee we're going to be making some um uh we're going to have some nice properties okay so note if all all the confidence bounds hold like on every step then we can ensure the following so if all confidence hold bounds hold then UT of a this is the actual arm we selected is going to be greater than Q of a star the real value the real value of the optimal arm okay so why is this true there's two cases here either a is equal to a star or a is not equal to a star so let's just take a second and maybe talk to your neighbor to say if it's the case that our confidence bounds hold which means that really is the case that we have UT we um that this confidence bound is um H going to be greater than or equal to the mean for that arm so these are true confidence bounds this equation is holding we're not getting a failure so we know that QT which is this is going to be greater than um the real expected value for that arm okay so if that's true then this is going to hold at every time step so maybe let's take a neighbor if it's not clear what I'm asking or how to think about that feel free to raise your hand too so there's two cases either the arm that we selected is a star or the arm we selected is not a star and in both cases this is going to hold if the confidence bounds are correct so maybe let take a second to to think about this or feel free to raise your hand if it's not clear how to how to get started on that let me just be clear here so I just wanted to note at the top there that if the confidence bounds hold then that upper confidence bounds which is equal to that is going to be greater than the real expected value for that arm yeah on the conference or the equation you have written over there so you're saying that the optimal like your optimal Q value should be less than all the confidence bounds for any action no good question so ask me to clarify what's this is saying for the arm that you selected the upper confidence bound of the arm you selected is has a value that that that upper confidence bound that you use to choose the arm whichever arm you selected the upper competence amount of that arm is higher than the true value of the optimal arm that's what this this equation is saying saying that if the confidence bounds hold on all time steps which they might not but let's say that they do because these are only high probability bounds but if they hold on all time steps then whatever arm you selected it's upper confidence bound is higher than the value of the true arm the real value of the true arm okay and I just wanted to be clear what I mean for the confidence Mount to I put this up there so that means that the upper confidence bound of an ARM Holding means that that upper confidence bound which is defined in that way is greater than the true value of that arm let's work work through this a little bit so let's see there's two cases so if a t is equal to a star then that's what this is saying is that's saying is is UT P of a star greater than Q of a star does that hold if the confidence bounds hold yes by definition so if you look up there so um the upper confidence bounds if it holds for an action for that the upper confidence Bound for an action has to be bigger than the mean for that action if that upper confidence bounds hold okay so this this works so if we really selected the optimal action we've defined our upward competence bound so they really are better than the than the mean of that arm and so this is holds the other case is that um a is not equal to a star so what does that mean that means that UT of a is greater than UT of a star because otherwise we would have selected a star it means that some other arm had a higher upper confidence bound than the optimal action and we know that this is greater than Q of a star okay so if the confidence bounds holds we know at every single time step the upper confidence bound of the arm we selected is better than the true mean of the optimal arm yes Epsilon greedy case as well this true and Epsilon greedy case as well I don't I I don't follow your question yet like you're selecting this arm using some strategy right yeah and think in this case maximizing action forward right I no you so th this only holds this part only holds because we're picking the argmax you're not going to be able to see this but the argmax a of UT of a so that first inequality only well there might be other algorithms that hold for to but it holds in particular for the upper confidence bound great great question okay so this says if we could get it and and we'll see shortly why that matters but kind of intuitively this says um if the confidence bounds hold then we know that sear confidence bound of the arm we select is going to be better than the optimal arm and the reason we're going to want that is later when we're doing the regret bounds we do not want to deal with properties we don't observe namely the value of the optimal arm because we don't know what Q of a star is can't we don't know which arm it is so when we look at regret bound right now regret bounds are in terms of Q of a star we don't know what that quantity is so we're going to need to figure out some way to get rid of that quantity um and we're going to end up using the these upper bounds but we're going to need the fact that the upper bound of the arm we select is better than the qar qas star okay so so this is saying that that's true if our upper confidence bounds hold what what is the probability that that occurs okay so what that means is um if we want to say that on all time steps this is a union B this is a union over events a union over all the events for tal 1 to T of the probability that Q of a star minus the upper confidence bound of the action that we took we want this oops not that way we want the probability of this which is essentially the probability of failure so this says what's the pro we this is if all confidence bounds hold things are good this says what is the probability that that failed so the arm that you took actually is not better the upper confidence bound is not better than the the real mean of the optimal arm so this is the failure case where the confidence um we're going to say that if we look at that we we don't want this thing to happen we can upper bound that by making sure our upper confidence bounds hold at all time steps okay so these are the arms so what I set up there is that if all of our confidence bounds hold on all time steps we can ensure that so we're now going to write that down in terms of what's the probability that the upper confidence bounds do hold on all time steps and so we're going to do probability that Q of a t as Q hat of a t is greater than U okay this is the upper confidence bound we defined over there this is just saying that the upper confidence bound holds for each arm on each time step okay well by definition over there we said we we picked an upper confidence bound to make sure this held with at least probability Delta over t^2 because that's how we defined our upper competence bound we picked a big enough thing to add on to our empirical means so that we could ensure that that upper competence bound really was larger than our mean so now we have um a union over all time steps a union over all arms um Delta / t^2 and note that if you sum over tal 1 to Infinity of tus t that's equal to pi^ 2 over 6 is less than two so when you do this sum you get 2 m Delta so what this says is that the probability that um uh your upper confidence bounds hold over all time steps um so this is the negative this is that they that they don't hold is at least 1 minus 2 m Delta so all our our what we're going to end up doing is we're going to have a high probability regret bound that says with probability at least 1 minus 2 m Delta you're going to get a small regret yeah the Horizon case great question yes this is all for an infinite Horizon case we're going to um look at the we're going to find our regret in terms of T the number of time steps all right so why is this useful um do you guys want me to leave this up or we can move that now everyone's written it down who wants it okay so let's can this go up let me see or not okay so why is this useful okay we're now going to Define our regret so this part of the board just says that we've made it so these upper confidence bounds holds with high probability now we're going to try to show what our regret is going to be oh good okay thank you all right so what's our regret regret of our UCB algorithm after T time steps is just equal to the sum over all those time steps tal 1 to T of Q of a star minus Q of a okay remember we don't know either of these things we don't know what the real mean is of any arm we pick and we don't know what the real mean is of the optimal arm so we need to turn this into things that we know these are unknown okay so what we're going to do is one of our favorite tricks in reinforcement learning which is we add and subtract the same thing so we do sum over T minus t = 1 to T of U this upper bound a minus Q of a t plus Q of a star minus UT a I just added and subtracted the same thing I picked the upper confidence bound of the arm that we selected at each time Point okay so then the important thing is that what we showed over here is that if all of our confidence bounds hold then the upper confidence bound of the arm we selected is larger than Q of a that's what we showed over there so that means that this has to be less than or equal to zero because the upper competence bound of whatever arm we selected we proved over there is going to be higher than the real mean of the optimal arm so this second part of this equation is less than or equal to zero which means we can up or down our regret as follows so now we can drop that second term so that's nice right because now we don't have any a Stars anymore we only are looking at the actions that we actually took at each time step and we're comparing the upper confidence bound the at that time step versus the real mean but remember the way we've defined our upper confidence bound and put it over here the way we defined our upper confidence bound UT of a is exactly equal to um the empirical mean at plus the square < TK 1 / 2 n a log T ^2 over Delta okay and we said here that this was going to be the difference from hting between Q of a minus Q hat of a so the probability that this remember I called this U the probability that this was greater than mu was small okay so now we're assuming that all of our confidence bounds hold which means that we know that the difference between the real empirical mean and the true mean of this arm is bounded by youu yeah going back for the leftmost bottom panel sorry it's a little hard to see um two questions first of all where is the second um so you have a union over IAL 1 to number of arms I don't see where that index actually factors in and then also if you could just go over the third line with the Delta t^2 summation how we derive that sure yeah so um so what we did there is we said that if um are are you asking about the second line to the third line oh yeah yeah so what we did that in this case is um we said for each we want to make sure that on each of the time steps all of the upper confidence bounds hold um and so that's where we get an additional sum um over here over all the arms so this is conservative um trying to make sure we don't know you could imagine just doing this over the arm that's selected um but we don't know which arm is selected we want to be able to show um this this is going to be a looser upper upper bound saying this is sufficient so um we're saying that if you want to make sure that Q of AAR is is greater than the upper bound of the arm that is selected it is sufficient to ensure that your upper confidence bounds are valid at all time points from this from this um reasoning up here and so this is the probability that your upper confidence bounds are correct on all times points for every single time point for every single arm your upper confidence bounds have to hold and then what we get in this case is we said that the probability on a particular time step the upper confidence bounds holds is Delta over t^2 that's how we Define that U term so that according to hing's inequality it would hold with at least probability Delta t^2 and then so this is that and then I just made a side note that this is not something some of you guys might have seen this but I certainly wouldn't have expected people to that just it turns out that in terms of the limit um if you sum over tal 1 to Infinity of tus 2 that's bounded by pi^ S over 6 Which is less than two um fun fact um so so you can plug that in right because then you just get a two here and then you just get a sum overall arms which is M um and you have a Delta so this just allows us to to take that infinite sum so notice also that um this goes to's question before this holds for the infinite Horizon because when we did the summing we're basically making sure that our confidence bounds hold forever so we're okay great so we said here now that we're we're doing all of this part under the assumption that our confidence bounds hold our confidence bounds hold mean that the difference between our expected mean and the true mean for that same arm is bounded by mu bounded by U with high probability where this is the definition of U so that's what our hting inequality allowed us to stay so take that quantity now that's our U and plug it in here so this is looking exactly at the difference between our upper confidence bound on Q so this is exactly equal to sum over T = 1 to T of U of 8 um it's a little confusing in terms of notation um so I'll just plug in the exact expression right there one over 2 NT of a t log t^2 over Delta Okay so we've just plugged in that the difference between the empirical mean and the true mean is bounded by quantity U okay all right so then in this case what we can do is we can split this up into the different arms that we pulled okay so this is a sum overall time steps we can pull out note that this is um if we upper bound this by Big T this is equal to less than or equal to um sare < TK log big t^2 Delta and then we're going to get a sum over all time steps and we're going to split this up According to which arms we selected okay so for this is the same as if we look at for each of the arms how many times did we pull it so sum Over N = 1 to n TI < TK 1/ n so we just divided this up like for each of the arms we selected them some amount of times that's here I is I is indexing our arms so Niti is the total number of times we selected arm I and then we sum that up and then if we note the fact that if you sum from n = 1 to T < TK 1/ n that is less than or equal to 2 < TK T use an integral argument for that I'm happy to talk about it offline yeah in the back um what two the S thank you you have it two can put a two here thanks I'll be a little loose with constants that definitely catch me a the most of these bounds will all end up being about whether it's sublinear or linear it's good to be precise okay so we have this quantity here um when is this quantity maximized this quantity is going to be maximized if we pulled all arms an equal number of times why because um 1/ n is decreasing with n and so the the largest these can be is if you split if you split um your pools across all arms equally so if we go back up up to here and I call this a so a is going to be less than or equal to excuse me S < TK 1 2 log T ^2 over Delta time sum over I = 1 to M sum Over N = 1 to T / m so this is as if we split all of our pools equally across all the arms 1/ < TK n okay and then we can use this expression okay so this is less than or equal to 1 over two we're almost there um t^2 over Delta and then we're going to get sum over I = 1 to M of 2 < TK t m okay and then when you sum this over M we get less than or equal to Square t 1 over 2 log t^2 over Delta that brings an M into there we're going get like another two in here times t m okay and now we're done so what is this shown this is shown that if we use upper competence bounds that the rate at which our regret grows is sublinear square root times a log so T here is the number of time stamps so T time steps so as if we are if we use upper confidence bounds um in order to make our decisions then the regret grows much slower this is a problem independent bound it doesn't depend on the gaps there's much there's much nicer ones and tighter ones that depend on the gaps but this indicates why optimism is fundamentally a sound thing to do in the case of Bandits is that it allows us to grow much it allows us to have much better performance in terms of regret than it does for the egy case yeah you just explain one more time the last line on the top board uh how you went from summation over t one to Big T and you just pulled out the t^2 as Big T what happened if tal 1 so big tus one great question yeah so this log term here T equals um go is ranging from tal 1 to T this log term is maximized when T is Big T so we're upper bounding that log term and then it becomes a constant and we can pull it out so the cool thing here is that this is sublinear that's the that's really the main point um we'll I go through an example and we'll go through um more of this next time um I I go we next go through an example for the toy domain for the broken toes of like like what do these upper confidence bounds look like in that case and then what will the algorithm do in these scenarios um so that's what we'll look at next and then after that we'll look at so this is sort of one class of techniques which is this optimism under uncertainty approach which is saying that we're going to look at what the value is based on a combination of the empirical rewards we've seen so far plus an upper confidence bound over them um and use that to make decisions and then the next thing we'll see too is where we are basing about the world and we instead maintain a prior and we update our prior and use that to figure out how to act so we'll go through this next week um and I'll see you then

[한글자막] Stanford CS234: Reinforcement Learning | Lecture 11 Fast Reinforcement Learning

Share your thoughts