subtitlecat.com

All language subtitles for 013 Q-Learning Intuition-subtitle-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu Download

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,040 --> 00:00:04,020 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,040 --> 00:00:07,040 Today we are finally talking about Kule learning. 3 00:00:07,070 --> 00:00:12,890 All right so we've already got this equation the bellmen equation which we've added lots of components 4 00:00:12,890 --> 00:00:13,120 to. 5 00:00:13,130 --> 00:00:19,910 We've got the reward here which can be not just at the very end but it can be at any given step. 6 00:00:19,940 --> 00:00:21,920 We've got the discount factor. 7 00:00:21,950 --> 00:00:26,880 We've got the probability because now we're looking at mark of a decision processes. 8 00:00:26,900 --> 00:00:32,780 And here we've got the possibility of ending up in a different states regardless of what action we take 9 00:00:33,350 --> 00:00:35,210 or actually given the action we take. 10 00:00:35,210 --> 00:00:40,670 There can be multiple states that we can end up in and then we got the value of the next states because 11 00:00:40,670 --> 00:00:46,790 he kind of like a recursive function and so on but you probably still have one question. 12 00:00:46,820 --> 00:00:53,560 The question is where in all of this is no letter Q Why is it all called q. 13 00:00:53,750 --> 00:00:54,270 Learning. 14 00:00:54,350 --> 00:00:55,790 So where's the cue. 15 00:00:55,910 --> 00:00:58,940 And that's the question that we're going to be answering today. 16 00:00:58,940 --> 00:01:06,620 So far we've been dealing with values the value of being in a certain state and now we're going to look 17 00:01:06,620 --> 00:01:09,820 at how Q fits into all of that as well. 18 00:01:10,070 --> 00:01:16,360 So here we've got two examples on the left is what we would be do so far our agent has been analyzing. 19 00:01:16,400 --> 00:01:18,170 Ok I'm over here. 20 00:01:18,230 --> 00:01:21,640 This is a mark of decision process so doesn't matter how I got here. 21 00:01:21,770 --> 00:01:28,250 The rest of the environment doesn't care of the steps that it took me to get here from now on. 22 00:01:28,460 --> 00:01:32,050 I have to make the optimal decision where to go here here or here. 23 00:01:32,060 --> 00:01:37,280 Based on the current state and all the future states that come from here but not from the past. 24 00:01:37,490 --> 00:01:42,010 And so he can see that there's three options there's state one state to state three. 25 00:01:42,260 --> 00:01:48,920 And based on his experience he has calculated the values in these states and now he's going to using 26 00:01:48,920 --> 00:01:49,880 the bellmen equation. 27 00:01:49,880 --> 00:01:54,260 So even though this is a classic Proceso he knows that he'll go here but there's a chance that he will 28 00:01:54,260 --> 00:01:56,120 go left right and so on. 29 00:01:56,110 --> 00:02:02,450 So based on these values going to make a decision that's what we do so far and that is totally legitimate 30 00:02:02,450 --> 00:02:03,470 approach here. 31 00:02:03,560 --> 00:02:05,640 But now we're get modified a little bit. 32 00:02:05,660 --> 00:02:12,860 We're going to take the same exact concept same exact problem but here instead of looking at the values 33 00:02:12,950 --> 00:02:21,440 of each state that he can end up in we're going to look at the values or the value of each action. 34 00:02:21,440 --> 00:02:25,640 So we're we're not going to use the letter V anymore because for the value of the state we're going 35 00:02:25,640 --> 00:02:30,740 to use a Q and the you might have a question why the letter Q Well. 36 00:02:30,740 --> 00:02:32,300 Q Some people speculate that. 37 00:02:32,300 --> 00:02:33,760 Q Will I read this. 38 00:02:33,770 --> 00:02:35,420 I think on Quora. 39 00:02:35,420 --> 00:02:41,480 Somebody mentioned that Q is because of quality but at the same time I couldn't find any other references 40 00:02:41,480 --> 00:02:45,520 to that so it might not be because that might just because that's the letter that was used at the time 41 00:02:45,920 --> 00:02:50,750 and now it became super popular because it's all called key learning because of that. 42 00:02:50,780 --> 00:02:52,520 So no exact reason was hold. 43 00:02:52,530 --> 00:02:58,830 Q But nevertheless at least it helps us distinguish between V and Q So Q here. 44 00:02:58,850 --> 00:03:03,340 There were presents rather than the value of the state it represents lets go of quality. 45 00:03:03,410 --> 00:03:06,260 It represents the quality of the action that represents. 46 00:03:06,260 --> 00:03:07,980 OK so I've got four actions. 47 00:03:08,300 --> 00:03:10,860 What are the different qualities of these action. 48 00:03:10,860 --> 00:03:16,340 What is the value of the action or the quality of the action which action is more lucrative so I need 49 00:03:16,340 --> 00:03:21,380 a metric telling me all right how do I quantify this action and then I can compare them and that is 50 00:03:21,380 --> 00:03:23,200 exactly what Q is. 51 00:03:23,470 --> 00:03:26,240 And so he's got four possible actions. 52 00:03:26,360 --> 00:03:29,240 As always go up right left or down. 53 00:03:29,240 --> 00:03:35,480 And based on the action theres going to be a formula which tells us the quantifiable value of that action 54 00:03:35,480 --> 00:03:38,410 which we're calling the Q q value of that action. 55 00:03:38,630 --> 00:03:41,700 So let's have a look at how we going to derive this formula. 56 00:03:41,710 --> 00:03:44,510 Q What how does it actually relate to these. 57 00:03:44,510 --> 00:03:51,290 Because as you can imagine because actions lead to states there has to be some sort of link between 58 00:03:51,290 --> 00:03:51,850 the two. 59 00:03:51,870 --> 00:03:56,060 Right we've got we've already determined how to calculate this and we're pretty good at it. 60 00:03:56,060 --> 00:04:02,030 We know how to use the Belman equation in very different environments with lots of different complications. 61 00:04:02,270 --> 00:04:06,080 Well let's leverage that knowledge to understand how we can now calculate. 62 00:04:06,080 --> 00:04:12,170 Q In order to make the same predictions because as you can imagine the environment doesn't change depending 63 00:04:12,500 --> 00:04:16,530 depending on what approach we use the environment is going to be the same regardless. 64 00:04:16,550 --> 00:04:22,130 So therefore this approach and this approach should always give the same result and therefore that's 65 00:04:22,460 --> 00:04:24,690 another reason why these two should be linked. 66 00:04:25,100 --> 00:04:26,290 So let's have a look. 67 00:04:26,300 --> 00:04:31,280 So here is our view approach where we just get to look at the value of any given state this state or 68 00:04:31,280 --> 00:04:32,260 any other state. 69 00:04:32,420 --> 00:04:37,190 And here we go into we're just using the lead here because that's the current state. 70 00:04:37,190 --> 00:04:43,730 And so therefore the terminology will be the same in both equations and here we using q as a Q Is the 71 00:04:43,790 --> 00:04:45,520 of the state s and the action. 72 00:04:45,540 --> 00:04:51,970 A because action is up but in which state we perform that action we perform that action in the State. 73 00:04:53,000 --> 00:04:57,230 OK so now we're going to ride out the Belman equation for the first approach as you can see here we've 74 00:04:57,230 --> 00:05:06,620 got the of s or the value of any given state s is the maximum of the reward that you get a maximum bet 75 00:05:07,070 --> 00:05:08,660 based on the actions you have three. 76 00:05:08,690 --> 00:05:14,210 In this case you actually have four actions so maxim out of all the possible actions of this part which 77 00:05:14,210 --> 00:05:20,090 we've heard discussed many times so this is our reward that we get from performing that action in that 78 00:05:20,090 --> 00:05:26,850 state plaza discount in fact multiplied by the expected value of the new state that we're going to be 79 00:05:26,850 --> 00:05:29,420 in an expected value because it is a stochastic process. 80 00:05:29,420 --> 00:05:34,460 We don't know exactly for sure that we're going to end up over here we might end up on the left or the 81 00:05:34,460 --> 00:05:36,050 right sort of probability. 82 00:05:36,050 --> 00:05:38,230 That's why these probabilities are in you. 83 00:05:38,240 --> 00:05:40,290 All right so that's our value. 84 00:05:40,350 --> 00:05:41,150 And now let's look at. 85 00:05:41,150 --> 00:05:43,530 Q So Q is going to be defined. 86 00:05:43,580 --> 00:05:49,550 We're going to use this to define Q So let's say the agent from this location from this state perform 87 00:05:49,550 --> 00:05:50,640 the action up. 88 00:05:50,840 --> 00:05:54,350 What is the q value going to be called to. 89 00:05:54,500 --> 00:05:59,320 Well first of all let's see what he will get in return for performing this action up. 90 00:05:59,420 --> 00:06:02,160 First thing that you'll get is a reward right. 91 00:06:02,360 --> 00:06:04,180 Knows no doubt about it. 92 00:06:04,250 --> 00:06:09,920 There's going to be some sort of rule or might be zero but we know that the whole is the way this reinforcement 93 00:06:09,920 --> 00:06:15,770 learning process works is that some towns are performing certain actions from a given state or two. 94 00:06:15,840 --> 00:06:17,140 So I'm going to add that in here. 95 00:06:17,480 --> 00:06:19,680 And then we're going to add what are we going to add. 96 00:06:19,850 --> 00:06:21,090 Well let's think about it. 97 00:06:21,110 --> 00:06:24,640 What is the next thing that happens after he's going there. 98 00:06:24,860 --> 00:06:32,030 Well next thing that happens is that now the agent is in a certain state he could end up here with a 99 00:06:32,330 --> 00:06:34,640 80 percent probability or some probability. 100 00:06:34,730 --> 00:06:36,670 But actually up here right here. 101 00:06:36,800 --> 00:06:43,940 But wherever he ends up now there's we already have a quantified metric for that state he's in. 102 00:06:44,210 --> 00:06:47,100 And that is actually the value of that state. 103 00:06:47,180 --> 00:06:52,340 But because he came up in many different states and three of the possible different states we have to 104 00:06:52,370 --> 00:06:55,730 look at the expected value of the state that he'll be in. 105 00:06:56,210 --> 00:06:58,610 And so we're going to add that in we're going to add. 106 00:06:58,610 --> 00:07:04,020 Of course the discounted factor as we previously had because that is somewhere in the future. 107 00:07:04,190 --> 00:07:11,210 And then we're going to add the some of across all possible states across all possible states that he 108 00:07:11,210 --> 00:07:12,910 could end up by taking this action. 109 00:07:12,910 --> 00:07:14,240 Terms of probability. 110 00:07:14,240 --> 00:07:20,150 So what we're saying here is that OK so by performing an action you're going to get a reward Plus which 111 00:07:20,150 --> 00:07:22,700 is a quantified metric Plus you're going to get. 112 00:07:22,730 --> 00:07:25,820 You end up in a state we don't know which one it could be here. 113 00:07:25,850 --> 00:07:26,950 Could be here it could be here. 114 00:07:27,050 --> 00:07:32,240 But here is the expected value of the state that you're going to end up in. 115 00:07:32,270 --> 00:07:36,290 And now we're going to multiply by discounting factor because that is one move away. 116 00:07:36,380 --> 00:07:44,180 So that is our Q value for this for performance section and what you will notice here right away is 117 00:07:44,180 --> 00:07:44,730 that. 118 00:07:44,760 --> 00:07:51,470 Q The Q value is actually exactly identical to what's inside these brackets over here. 119 00:07:51,950 --> 00:07:52,660 And why is that. 120 00:07:52,670 --> 00:07:59,930 Well if you think about it here we're taking the maximum of the results will get the maximum across 121 00:07:59,930 --> 00:08:04,910 all possible actions so we got for action taking the maximum across all possible actions of the result 122 00:08:04,910 --> 00:08:10,500 that we'll get by taking each of those actions and enqueue we're defining. 123 00:08:10,610 --> 00:08:11,160 Interesting. 124 00:08:11,160 --> 00:08:14,000 What will we get by taking a certain action. 125 00:08:14,000 --> 00:08:19,340 So if you think about it it makes sense that the value of a state. 126 00:08:19,370 --> 00:08:25,720 So for instance this state is the maximum of all of the possible Q values. 127 00:08:25,790 --> 00:08:32,360 Right so here in the States by being in the state the agent has one key value to keep the 3Q value for 128 00:08:32,360 --> 00:08:32,870 q value. 129 00:08:32,870 --> 00:08:37,760 So yes positive for possible Q values while the value of the stay it makes sense that the value of the 130 00:08:37,760 --> 00:08:42,460 state is the maximum of all of those four key values. 131 00:08:42,490 --> 00:08:44,420 That is exactly what we can see here. 132 00:08:44,420 --> 00:08:48,060 That's a good confirmation of this new formula that we derive. 133 00:08:48,080 --> 00:08:53,080 If that wasn't the case if that if that didn't match up then we would have questions would be like. 134 00:08:53,270 --> 00:08:55,150 So why why doesn't it match. 135 00:08:55,160 --> 00:08:57,510 Why doesn't it match up if. 136 00:08:57,690 --> 00:09:05,810 Q value is a quantified metric of performing an action and V depends on the floor. 137 00:09:05,930 --> 00:09:12,650 Is like is the maximum of the possible results of the four actions that he can perform over that makes 138 00:09:12,650 --> 00:09:12,970 sense. 139 00:09:12,980 --> 00:09:21,050 And that confirms the formula that we've just derived and now we're going to make it even more interesting. 140 00:09:21,080 --> 00:09:26,620 We're going to get rid of the Wii entirely because you can see here you've got Wii is a recursive function. 141 00:09:26,810 --> 00:09:29,750 So and then you've got me and then B and then B and then B and so on. 142 00:09:29,760 --> 00:09:35,480 So you can express this view through all of the following Vee's the most optimal these will come up 143 00:09:36,150 --> 00:09:36,830 here. 144 00:09:36,840 --> 00:09:43,210 We're expecting Q As a funk a recursive function of the OR as a function of the next V and then you'd 145 00:09:43,250 --> 00:09:45,200 have to plug in this V and then we get back to the B. 146 00:09:45,200 --> 00:09:51,110 So what are we going to do is we're actually going to take this V and we're going to we're going to 147 00:09:51,230 --> 00:09:54,280 replace it with Q Right so let's have a look at that. 148 00:09:54,930 --> 00:10:01,410 We're going to take this V of the next state and we're going to plug this into that formula over here. 149 00:10:01,570 --> 00:10:07,180 And as you can see now so this part doesn't change this probability doesn't change. 150 00:10:07,180 --> 00:10:16,950 But as we just discussed the of s is the maximum by all actions of q of S and a right over here. 151 00:10:16,990 --> 00:10:19,180 So that's what we're going to replace in here. 152 00:10:19,180 --> 00:10:24,310 So we're going to say a maximum of of course is the new action the action that we're going to take because 153 00:10:24,310 --> 00:10:26,760 here we've got the Wii of as prime. 154 00:10:26,770 --> 00:10:30,700 So here now we've got the maximal console at a prime. 155 00:10:30,700 --> 00:10:34,510 So the actions that we're going to take from this state are from wherever whichever other state we end 156 00:10:34,510 --> 00:10:41,200 up in but the action we're going to take from there and Maxima across all those and the maximum is of 157 00:10:41,260 --> 00:10:50,170 all the cube values that will that are available to us in that new state as prime comma a prime. 158 00:10:50,170 --> 00:10:51,280 And that's action. 159 00:10:51,280 --> 00:10:52,140 So that's the. 160 00:10:52,210 --> 00:10:53,500 So there's going to be another four. 161 00:10:53,500 --> 00:10:54,530 Q values there. 162 00:10:54,610 --> 00:10:56,700 So now as you can see let's go through again. 163 00:10:57,040 --> 00:11:02,740 So as from what we derive this word would be just cause just through logic and intuition so that we 164 00:11:02,740 --> 00:11:07,400 can see that VNS are actually view of AS and of and a are linked. 165 00:11:07,400 --> 00:11:12,400 The of S is the maximum across all actions of Cuba S and you can see right here so this this part is 166 00:11:12,400 --> 00:11:13,820 identical to this part. 167 00:11:14,290 --> 00:11:20,740 And then we're going to leverage that and we're going to replace this bit with VNS from here but not 168 00:11:20,740 --> 00:11:25,730 this exact funnel we're going to take this internal part and replace it with kill the innocent. 169 00:11:26,080 --> 00:11:32,920 So we're going to plug that in here and this part is going to be q of s prime a prime maximum of cube 170 00:11:33,430 --> 00:11:36,810 by Crucell a Priam's of Q As Prime a prime. 171 00:11:37,060 --> 00:11:39,790 And now we have our formula. 172 00:11:39,790 --> 00:11:46,880 So now we have a recursive formula for the q value so now the agent can think what's the value of the 173 00:11:46,890 --> 00:11:50,310 section what's the quality of this section was the new value of this action. 174 00:11:50,470 --> 00:11:56,570 Well it depends on the reward I get in the immediate step after that plus it depends on the discounted 175 00:11:56,590 --> 00:12:02,410 factor times the maximum of all the possible Q actions in that state. 176 00:12:02,410 --> 00:12:06,760 But I don't know if I'm going to get their side need to also look at that state in that state and that's 177 00:12:06,760 --> 00:12:12,770 why we have this expected value over here so we have some probability times the maximum that's expected 178 00:12:12,860 --> 00:12:13,300 value. 179 00:12:13,450 --> 00:12:18,010 So a very similar formula as you can see but this time we're expressing things through the q values 180 00:12:18,490 --> 00:12:27,310 and that's why this whole algorithm is called Kill learning because this is what is looked at this is 181 00:12:27,310 --> 00:12:32,020 what the agents actually use they don't look at the states look at their possible actions and then based 182 00:12:32,020 --> 00:12:35,760 on the actions on the q value of the actions they will decide which action to take. 183 00:12:35,760 --> 00:12:40,330 So they'll just look at the maximum Q value in this given state it has four actions. 184 00:12:40,330 --> 00:12:45,340 What is the best action to take so it can compare sort of comparing the different states that can end 185 00:12:45,350 --> 00:12:51,820 up end up in is going to compare the possible actions that it currently has then by finding the optimal 186 00:12:51,820 --> 00:12:56,830 one is going to take that action and then engage is going to repeat that process repeat that process 187 00:12:56,860 --> 00:12:57,440 and so on. 188 00:12:57,580 --> 00:13:03,940 So now you can see how all this comes together how the reward the discounting facts or the stochastic 189 00:13:04,360 --> 00:13:10,330 market decision processes and the values and the q values all come together in order to cueist this 190 00:13:10,690 --> 00:13:18,400 one super powerful Belman equation for q values which we can now apply and let our agents learn how 191 00:13:18,400 --> 00:13:20,410 to beat the environment. 192 00:13:20,410 --> 00:13:23,380 And so that is a intuitive explanation of what's going on. 193 00:13:23,380 --> 00:13:28,510 I know we went through the formulas but it is necessary because this is like our formula that's we've 194 00:13:28,510 --> 00:13:34,730 been going through this whole chapter and I think it's a good transition from the To. 195 00:13:34,780 --> 00:13:43,450 Q And it illustrates how there are links between Yishun And if you'd like to get a bit more of a rigorous 196 00:13:43,450 --> 00:13:49,410 approach mathematical approach and like you see the mathematics behind it and learn a bit more about 197 00:13:49,420 --> 00:13:51,600 q values and how they work. 198 00:13:51,640 --> 00:13:54,090 Then we've got some additional reading for you. 199 00:13:54,130 --> 00:14:02,980 This paper is called Markov decision processes concepts and algorithms by martn von Autor low 2009. 200 00:14:02,980 --> 00:14:09,610 So you cut the link here as always and here you can read in a bit more detail to understand all the 201 00:14:09,820 --> 00:14:15,220 nitty gritty behind Hugh values and so on and now that we've discussed all of these things relating 202 00:14:15,220 --> 00:14:21,660 to the Belman equation now we are ready to look at something more complex such as this paper in order 203 00:14:21,790 --> 00:14:27,670 if if we want to get some additional information on this in order to kind of get a deeper understanding. 204 00:14:27,670 --> 00:14:34,390 But even if you don't read the newspaper or radio you should have a good working knowledge of what learning 205 00:14:34,390 --> 00:14:40,850 is all about and how agents come up with the actions that they need to take in a certain environment. 206 00:14:40,870 --> 00:14:43,980 So I hope you enjoy today Statoil and I look forward to your next them. 207 00:14:43,990 --> 00:14:45,360 Until then enjoy. 208 00:14:45,390 --> 00:14:45,620 I. 23212