All language subtitles for 018 Deep Q-Learning Intuition - Learning-subtitle-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu Download
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,660 --> 00:00:03,920 Hello and welcome back to the course on artificial intelligence. 2 00:00:03,930 --> 00:00:09,440 And finally we're on to the fun stuff we're onto deep learning. 3 00:00:09,450 --> 00:00:10,660 All right so let's have a look. 4 00:00:10,720 --> 00:00:14,100 Bruce we spoke about killer earning and what it's all about. 5 00:00:14,140 --> 00:00:20,160 And we learned about Agent environment and how the agent will look at the state. 6 00:00:20,210 --> 00:00:23,620 Or she is in take an action get a reward. 7 00:00:23,640 --> 00:00:28,610 Enter into a new state and based on that feedback loop they will continue taking actions and they will 8 00:00:28,610 --> 00:00:29,460 learn from that. 9 00:00:29,460 --> 00:00:32,310 Understand what are the best actions to take. 10 00:00:32,310 --> 00:00:35,040 And so we looked at this basic example of a maze. 11 00:00:35,040 --> 00:00:40,550 We understood that as Asian explores environment understands what the values of the states are. 12 00:00:40,560 --> 00:00:45,150 Then we moved on from dealing with the values of the states to dealing with the values of the actions 13 00:00:45,150 --> 00:00:52,230 with the values and then A-Basin that we understood how plans in a non sarcastic environments work and 14 00:00:52,560 --> 00:00:57,070 how policies work in stochastic environments and this is an example of a policy. 15 00:00:57,120 --> 00:01:01,340 So that's a quick recap of everything we discussed in the basic learning. 16 00:01:01,450 --> 00:01:07,230 And now let's have a look at how this can be taken to the next level through deep learning through adding 17 00:01:07,230 --> 00:01:08,080 deep learning. 18 00:01:08,260 --> 00:01:08,510 OK. 19 00:01:08,520 --> 00:01:16,110 So this is our environment and what we're going to do now is we're going to add instead of just doing 20 00:01:16,110 --> 00:01:21,860 basic calculations in this matrix that we have which is pretty simple. 21 00:01:21,870 --> 00:01:26,970 What we're going to do is we're going to add two axes which adds an x and y axis or we'll call them 22 00:01:27,090 --> 00:01:28,480 x1 and x2. 23 00:01:28,560 --> 00:01:30,430 Just to make things even more general. 24 00:01:30,480 --> 00:01:36,830 And here we've got the real number the row the columns 1 2 three 4 he'll rule number the rows 1 to 3. 25 00:01:36,960 --> 00:01:44,730 And so now every single state can be described by a pair of two values x1 and x2 so any one of these 26 00:01:44,730 --> 00:01:50,940 squares in which the agent can possibly be in can be described by x1 x2. 27 00:01:50,940 --> 00:01:58,280 So for instance right now he's in the square with X1 equal to 1 and x 2 equal to 2. 28 00:01:58,470 --> 00:02:03,430 And therefore that's not some way we can escape in your square meaning we can describe in your state. 29 00:02:03,480 --> 00:02:08,330 Then of course this is a very simplified version of an environment of describing States. 30 00:02:08,340 --> 00:02:10,110 But nevertheless it works in this case. 31 00:02:10,290 --> 00:02:17,260 And that means that now we can feed this these states into a neural network. 32 00:02:17,400 --> 00:02:21,830 And by the way here I would just like to mention that at the end of the course of good annexes we've 33 00:02:21,830 --> 00:02:26,880 got an x number one and antics and two in order to proceed successfully with this section. 34 00:02:26,970 --> 00:02:32,280 Highly advisable that you check out unaccessible one which is on artificial neural network so you understand 35 00:02:32,280 --> 00:02:37,470 how they work so that we can we don't have to delve into that here and we can just use the benefits 36 00:02:37,470 --> 00:02:43,800 of the knowledge of how artificial neural networks work and so we feed this information on the state 37 00:02:43,830 --> 00:02:51,870 into a neural network and then it will process this information the X1 and x2 depending on the structure 38 00:02:51,870 --> 00:02:55,380 of the neural network it might have multiple hidden layers and so on. 39 00:02:55,380 --> 00:03:00,900 So that's something that you'll figure out in the practical tutorials but at the end we will structure 40 00:03:00,900 --> 00:03:06,570 in such a way that it spits out for values and these four values are actually going to be our Q value. 41 00:03:06,570 --> 00:03:11,790 So the values which dictate which action we need to take and the don't in this tutorial will see exactly 42 00:03:11,790 --> 00:03:15,220 how these key values are used to decide which action is taken. 43 00:03:15,240 --> 00:03:22,490 But the main point here is that we no longer look at just this maze from a learning perspective. 44 00:03:22,650 --> 00:03:29,760 We're now taking the states of the maze and we're feeding them into a deep neural network in order to 45 00:03:29,820 --> 00:03:31,360 get these cubicles and. 46 00:03:31,410 --> 00:03:35,080 And at the end of the day we're still going to come up with an action we're still going to understand 47 00:03:35,150 --> 00:03:39,900 what action we need to take and we'll discuss all this in more detail but the question right now is 48 00:03:39,900 --> 00:03:42,990 why why are we doing all of this why we called it. 49 00:03:43,200 --> 00:03:47,990 Why are making things so much more complicated when that initial approach of learning was working already 50 00:03:48,280 --> 00:03:48,990 well. 51 00:03:49,170 --> 00:03:54,980 The reason for that is the to learning was working in this very simplistic environment and we're continuing 52 00:03:54,990 --> 00:03:59,830 to deal for now with this very simplistic environment in order to better understand the concepts. 53 00:04:00,000 --> 00:04:06,220 But at the same time that simple Kial learning will no longer work in more complex environments and 54 00:04:06,600 --> 00:04:12,780 we're talking about for instance the self-driving cars which will be creating or playing Doom when the 55 00:04:13,020 --> 00:04:19,200 artificial intelligence is playing Doom or other Atari games like breakout or even self-driving cars 56 00:04:19,260 --> 00:04:26,400 and more advanced reinforcement learning things such as like robots walking around and performing actions 57 00:04:26,730 --> 00:04:32,400 in all those cases basically learning is insufficient is not strong is not powerful enough to be able 58 00:04:32,400 --> 00:04:34,700 to master those challenges. 59 00:04:34,710 --> 00:04:41,250 And just like we've seen in the deep learning course if you've been in our discipline or if you've done 60 00:04:41,250 --> 00:04:47,820 the annex sections on x number one and X-2 you will where you know that deep learning is by far superior 61 00:04:47,820 --> 00:04:51,640 to any type of machine learning let alone a simple cool learning. 62 00:04:51,660 --> 00:04:55,770 And that's why we're leveraging the power of deep learning here so we're feeding in the information 63 00:04:55,770 --> 00:04:58,580 about the environment as a vector of values. 64 00:04:58,590 --> 00:05:04,240 In this case just to use into a deep neural network and then we're using that to perform the actions 65 00:05:04,240 --> 00:05:07,220 that we want to to decide which actions are agents going to take. 66 00:05:07,420 --> 00:05:11,700 So that's kind of like a high level overview of why we're doing this. 67 00:05:11,830 --> 00:05:17,920 And now let's have a look at in a bit more detail what happens to the concept of cool learning when 68 00:05:17,920 --> 00:05:24,100 we transfer when we make the transformation from or transition from simple learning into deep Killary. 69 00:05:24,130 --> 00:05:31,720 So as you saw in the previous intuition tutorials we had a slide like this which is the foundation of 70 00:05:31,960 --> 00:05:33,550 temporal difference learning. 71 00:05:33,700 --> 00:05:37,430 This is the formula for temporal difference and basically So let's go through. 72 00:05:37,430 --> 00:05:44,640 So basically we had an agent who was in this state over here which is indicated the blue arrow. 73 00:05:45,070 --> 00:05:51,760 And we were understanding how temporal difference works for this value of for instance going up. 74 00:05:51,790 --> 00:05:57,250 And so what we saw here was before this is in the simple Killary not the deep learning is in the simple 75 00:05:57,250 --> 00:05:57,610 killer. 76 00:05:57,640 --> 00:06:05,560 What we saw was before the agent had a subsequent hue value that he had learned about this action of 77 00:06:05,560 --> 00:06:06,260 going up. 78 00:06:06,340 --> 00:06:08,700 And so then he decided to take ception to go up. 79 00:06:08,860 --> 00:06:14,830 And right after he takes his action he gets a reward for taking this action in this state. 80 00:06:14,830 --> 00:06:21,070 And that is the reward plus now he can evaluate the value of the current state he's in which is the 81 00:06:21,070 --> 00:06:27,850 maximum of all of the new q values of all of the cube of the new actions he can take a prime in the 82 00:06:27,850 --> 00:06:32,400 new state as print and read multiplied by the DK factor of gamma. 83 00:06:32,440 --> 00:06:40,450 So that is essentially the cue the new cube value or kind of like the the empirical cube value that 84 00:06:40,450 --> 00:06:43,200 he has just received for taking that action. 85 00:06:43,270 --> 00:06:45,640 And ideally these two two should be the same. 86 00:06:45,640 --> 00:06:51,430 So the actually the Q value that he had in his memory about this action in this state should equate 87 00:06:51,430 --> 00:06:57,420 to the actual reward Plus the gamma times the value of the state that he ended up in. 88 00:06:57,610 --> 00:07:01,870 And therefore that's how we calculate the temporal difference we take what you are after minus what 89 00:07:01,870 --> 00:07:05,200 he got what he had in mind what he was expecting. 90 00:07:05,200 --> 00:07:06,740 You'd subtract one from the other. 91 00:07:06,780 --> 00:07:07,690 That's a temporal difference. 92 00:07:07,690 --> 00:07:14,890 And then you use your learning rate Alpha to adjust your q value your your new q value by the temporal 93 00:07:14,890 --> 00:07:16,940 difference but with a coefficient of Alpha. 94 00:07:17,110 --> 00:07:20,360 So that is the essence of the simple learning. 95 00:07:20,460 --> 00:07:25,990 Now let's have a look at how it changes in deep Killary and so we still going to work with the slide 96 00:07:26,000 --> 00:07:29,440 but we're going to just see exactly what's happening. 97 00:07:29,620 --> 00:07:35,890 So in a deep learning the neural network will predict for Valis as we saw in the previous and as we'll 98 00:07:35,890 --> 00:07:36,320 see. 99 00:07:36,370 --> 00:07:42,340 Donna Citronelle the neural network will predict for values or it might predict more values of more 100 00:07:42,340 --> 00:07:44,790 possible actions in a given state. 101 00:07:44,800 --> 00:07:48,500 But in this case we know that there's only four actions upright left to done. 102 00:07:48,670 --> 00:07:56,160 And so the neural network will predict four of these values so there will be no end in a deep learning 103 00:07:56,170 --> 00:07:58,800 situation is important is that there is no before or after. 104 00:07:58,960 --> 00:08:01,610 And this is how we'll we'll get to know this a bit better. 105 00:08:01,720 --> 00:08:08,080 So the neural network will predict four of these values and it will compare not to what will happen 106 00:08:08,140 --> 00:08:15,280 after but the neural network will compare to this exact value but it was the value which was calculated 107 00:08:15,400 --> 00:08:17,740 in the previous step. 108 00:08:17,740 --> 00:08:22,950 So in the previous time when the agent was in this exact square. 109 00:08:23,080 --> 00:08:30,850 So let's say I don't know some time ago the agent was again was in this exact square as well and it's 110 00:08:30,850 --> 00:08:34,420 calculated this value previously. 111 00:08:34,420 --> 00:08:40,630 So in the previous time a long time ago the agent calculated this value then the agents stored this 112 00:08:40,630 --> 00:08:43,720 value for the future and now the future has come. 113 00:08:43,720 --> 00:08:48,640 So now he's in the square again and now he's got these cube values which is predicted and one of them 114 00:08:48,640 --> 00:08:50,510 is for the four going up. 115 00:08:50,680 --> 00:08:57,220 So now what he's going to do is going to compare the predicted value of Q to this value which he had 116 00:08:57,220 --> 00:09:02,520 recorded from the previous step and will understand exactly why this is important right now so just 117 00:09:02,530 --> 00:09:03,440 important understand here. 118 00:09:03,520 --> 00:09:07,990 There's no before an officer in this specific square this specific time. 119 00:09:08,140 --> 00:09:14,650 We're taking the Q value that he's predicted using the neural network this time and we comparing it 120 00:09:14,710 --> 00:09:22,060 to this value which he had from the previous time from the previous time he was in this square assessing 121 00:09:22,110 --> 00:09:28,100 all of the situation and you know like the previous time he actually performed this action. 122 00:09:28,270 --> 00:09:29,290 So there we go. 123 00:09:29,290 --> 00:09:33,360 Now let's have a look at how this all works out in the neural network and why. 124 00:09:33,370 --> 00:09:38,740 Why is it like I know it sounds a bit complicated right now but we'll break it down into simple terms 125 00:09:39,310 --> 00:09:39,990 just in a second. 126 00:09:40,000 --> 00:09:44,380 So this on your own network we're feeding in the states of the environment into the neural network is 127 00:09:44,380 --> 00:09:48,880 going through the hidden layers that it's coming out with these outputs Q1 Q2 Q3 Q4. 128 00:09:48,880 --> 00:09:56,830 In that specific state these are the cube values that the neural network is predicting for possible 129 00:09:56,830 --> 00:09:57,380 actions. 130 00:09:57,400 --> 00:09:58,420 Those are the cumulous. 131 00:09:58,420 --> 00:10:04,270 So then we're appearing to target and these targets exist exactly so if we go back here this is the 132 00:10:04,270 --> 00:10:07,230 target so this is the value that was predicted. 133 00:10:07,300 --> 00:10:11,740 And then but also we know that we have a target from the last time we were in the square. 134 00:10:11,800 --> 00:10:16,660 We have a target for this same action which is up for instance. 135 00:10:16,660 --> 00:10:21,490 So here we've got a target and we're going to compare we're comparing Q1 versus that target we're comparing 136 00:10:21,490 --> 00:10:28,390 Q2 versus that target the target that we had from previously Q3 versus a target Q4 versus the target. 137 00:10:28,420 --> 00:10:36,610 And so this is the part where the neural network or the agent is now learning through deep learning 138 00:10:36,610 --> 00:10:38,630 how to better go through. 139 00:10:38,650 --> 00:10:44,920 And the key point here is that we're still applying cool learning but the concepts answer is simple 140 00:10:44,980 --> 00:10:48,940 you learn you learn through temporal differences which are pretty straightforward which we've already 141 00:10:48,940 --> 00:10:50,720 discussed and we know quite well why not. 142 00:10:50,920 --> 00:10:56,100 But at the same time in deep learning how do neural networks learn neural networks learn through our 143 00:10:56,100 --> 00:10:56,970 adjusting the weights. 144 00:10:57,010 --> 00:11:07,120 So we have to adapt the concepts of reinforcement the concepts of simple kill learning to the way neural 145 00:11:07,120 --> 00:11:08,550 networks actually work. 146 00:11:08,710 --> 00:11:10,950 And that is through updating their weights. 147 00:11:10,960 --> 00:11:14,950 And so this is what we're trying to figure out here how do we adapt that concept of temporal difference 148 00:11:15,400 --> 00:11:21,060 to your own network so that we can leverage the full power of neural networks. 149 00:11:21,260 --> 00:11:27,790 So far we've gotten this so we enter our environment state here as a vector goes through a neural network 150 00:11:27,790 --> 00:11:33,240 we get predictions of key values and then from the previous time the agent was in that state. 151 00:11:33,240 --> 00:11:39,480 We have these new target to target one two three and four for each of these respective actions. 152 00:11:39,490 --> 00:11:40,870 And so now we're up to. 153 00:11:40,870 --> 00:11:43,360 OK let's compare each one with each one. 154 00:11:43,630 --> 00:11:50,500 And from here it's it becomes pretty straightforward if you're up to speed with neural networks. 155 00:11:50,500 --> 00:11:52,500 Once again that's on a Anax. 156 00:11:52,570 --> 00:12:00,070 Number one we're going to calculate a loss which is here and we're going to be q target this one minus 157 00:12:00,070 --> 00:12:01,760 Q minus this one. 158 00:12:01,840 --> 00:12:06,160 We're going to square that so the square difference of the each one of these and we're going to sum 159 00:12:06,160 --> 00:12:06,730 them. 160 00:12:06,820 --> 00:12:12,310 So we take the sum of the squared differences of these values and their targets and we're going to send 161 00:12:12,310 --> 00:12:13,940 them up and that's going to be a loss. 162 00:12:14,020 --> 00:12:19,030 And so ideally just as we had into in the temporal difference learning so if we go back for a second 163 00:12:19,420 --> 00:12:25,180 remember we said Ideally we want this to be equal to this so we want the temporal difference to be zero 164 00:12:25,180 --> 00:12:31,750 so that's that means basically the agent is predicting exactly correctly what you know the Q value is 165 00:12:31,750 --> 00:12:37,900 that the agent is predicting are exactly or that he has and memory are exactly descriptive of the environment 166 00:12:38,590 --> 00:12:42,940 and therefore the agent can never get the environed pretty well right. 167 00:12:43,000 --> 00:12:48,880 There's no surprises there's no there's no s.a as long as a temporal difference is a pilot highly positive 168 00:12:48,880 --> 00:12:49,970 or highly negative. 169 00:12:50,040 --> 00:12:51,340 Then we've got some surprises. 170 00:12:51,340 --> 00:12:55,690 But if the general differences zero then he knows environment so well that he can predict what's going 171 00:12:55,690 --> 00:13:01,110 on and he can and therefore his policy is going to be very good and he's going to be able to navigate. 172 00:13:01,350 --> 00:13:02,200 So here. 173 00:13:02,200 --> 00:13:07,460 Same thing so we want this law to be as close to zero I suppose as smallest possible. 174 00:13:07,720 --> 00:13:14,680 And that's why now we're going to this is the part where we are going to leverage the real true power 175 00:13:14,680 --> 00:13:19,910 of neural network so we're going to take this loss and we're going to use back propagation or stick 176 00:13:19,970 --> 00:13:27,040 as the gradient descent to take this loss and pass it through the network posit back or back propagated 177 00:13:27,040 --> 00:13:31,120 through a network and through to cast a great and decent a date the weights. 178 00:13:31,120 --> 00:13:37,780 All of these synopses in the network so that next time we go through this network the way it's already 179 00:13:37,930 --> 00:13:41,050 a bit better descriptive of the environment and that's exactly what we're. 180 00:13:41,080 --> 00:13:48,090 So here you have if you go back this is calculated losses Kalka and guess prove propagator for the network 181 00:13:48,100 --> 00:13:49,330 the weights are updated. 182 00:13:49,330 --> 00:13:55,720 Then the next time we get here this happens again and again here this happens again and so on and so 183 00:13:55,780 --> 00:14:02,560 and it keeps happening and that's how this agent learns or basically now the neural network which is 184 00:14:02,560 --> 00:14:09,880 the brain of the agent is learning is becoming more and more descriptive of the environment and therefore 185 00:14:09,880 --> 00:14:12,100 the agent is able to navigate the environment. 186 00:14:12,130 --> 00:14:17,980 When we say descriptive environment basically means that when we put in the states of the environment 187 00:14:17,980 --> 00:14:25,510 that this agent is in we are more likely to get closer and closer to the actual cue values and that 188 00:14:25,510 --> 00:14:30,790 happens because the cube values that we want to find the right action and that happens because these 189 00:14:30,790 --> 00:14:36,940 new targets are actually empirically derived so he every day how does he find these cute targets. 190 00:14:37,090 --> 00:14:40,090 That's actually there this so he actually observes. 191 00:14:40,100 --> 00:14:42,940 OK so once I do take this step what's the reward I get. 192 00:14:43,060 --> 00:14:45,070 And then what's the values of this state. 193 00:14:45,070 --> 00:14:48,850 So same thing as we saw previously in Q learning and the simple learning intuition. 194 00:14:48,850 --> 00:14:54,550 So he learns this through trial and error and then he constructs his network or that's the way it is 195 00:14:54,880 --> 00:14:59,260 in such a way that the predicted values are close and close. 196 00:14:59,380 --> 00:15:01,330 Consummating that target. 197 00:15:01,330 --> 00:15:07,360 Q values so very similar to the concept we discussed here in the simple temporal difference learning 198 00:15:07,420 --> 00:15:09,870 of the simple skill learning algorithm. 199 00:15:09,910 --> 00:15:10,460 So there you go. 200 00:15:10,460 --> 00:15:12,540 That's that's how the agent learns. 201 00:15:12,550 --> 00:15:13,930 So we're up to here. 202 00:15:14,260 --> 00:15:15,490 And that's the learning part. 23593

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.