All language subtitles for 014 Temporal Difference-subtitle-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu Download
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,160 --> 00:00:04,720 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,740 --> 00:00:07,950 Today we're talking about the temporal difference. 3 00:00:08,100 --> 00:00:14,310 Now it's very important to trial because temporal difference is the heart and soul of the Q learning 4 00:00:14,340 --> 00:00:15,100 algorithm. 5 00:00:15,120 --> 00:00:22,410 This is actually how everything we've learned so far comes together into play inside key learning. 6 00:00:22,410 --> 00:00:23,880 So let's have a look. 7 00:00:23,910 --> 00:00:28,040 Remember the time when we talked about deterministic versus nondeterministic search. 8 00:00:28,410 --> 00:00:34,960 And remember how we said in this case it's when the agent wants to go up he goes up and when. 9 00:00:35,070 --> 00:00:38,740 In this case he wants to go up there's a 10 percent chance he'll go lower left temps and chance and 10 00:00:38,730 --> 00:00:41,390 go right and an 80 percent chance will go right. 11 00:00:41,400 --> 00:00:42,390 Go straight up. 12 00:00:42,450 --> 00:00:46,410 While these numbers are of course arbitrary and can be different. 13 00:00:46,410 --> 00:00:52,260 And this whole concept is it could be different and different problems so it doesn't have to concern 14 00:00:52,320 --> 00:00:57,090 which way he's moving just that there's some randomness something that's out of the control of the agent 15 00:00:57,300 --> 00:00:59,930 happening inside this environment. 16 00:01:00,060 --> 00:01:07,470 And what effect that had is as you remember was that in the deterministic example it was very easy to 17 00:01:07,470 --> 00:01:11,030 calculate the Wii values while not necessarily always very easy. 18 00:01:11,040 --> 00:01:16,530 But in our case we could just simply calculate them by using the Belman equation and we we had the exact 19 00:01:16,530 --> 00:01:17,120 values. 20 00:01:17,370 --> 00:01:24,810 And then as you remember I very carefully mentioned that these values for the nondeterministic search 21 00:01:24,810 --> 00:01:27,810 example are off the top of my head. 22 00:01:27,840 --> 00:01:29,220 They are not Kalka we know. 23 00:01:29,270 --> 00:01:33,090 Last time I said we're not we just had to calculate them because it's very complex. 24 00:01:33,090 --> 00:01:39,390 But the computer can do it and we just went along with these values that are just values that I made 25 00:01:39,390 --> 00:01:39,600 up. 26 00:01:39,600 --> 00:01:41,310 But they did get the job done. 27 00:01:41,310 --> 00:01:43,030 They helped us understand the concept. 28 00:01:43,290 --> 00:01:47,790 Well now we're going to return to that a little bit and understand what exactly is going on here. 29 00:01:47,790 --> 00:01:55,420 Why is it so much harder to calculate these values in the nondeterministic example or generally speaking 30 00:01:55,420 --> 00:01:59,570 in these problems in these environments and the agent going through them. 31 00:01:59,580 --> 00:02:00,400 Why is it. 32 00:02:00,510 --> 00:02:03,030 Why can it be so hard to calculate these values. 33 00:02:03,030 --> 00:02:09,010 Well when you think about it because when the agent moves for for instance from here to the right he 34 00:02:09,090 --> 00:02:15,270 doesn't necessarily always move that way sometimes as a chance that he will go to win instead of going 35 00:02:15,450 --> 00:02:22,290 straight so let's call these northeast southwest so is sort of going west. 36 00:02:22,470 --> 00:02:27,360 The agent might sometimes go south and for instance from here is sort of going north. 37 00:02:27,360 --> 00:02:29,220 He might sometimes go east. 38 00:02:29,460 --> 00:02:30,240 So sorry. 39 00:02:30,240 --> 00:02:34,680 So here instead of going east he might sometimes go south and he's sort of going north. 40 00:02:34,710 --> 00:02:40,200 He might sometimes go east or west and here instead of going north he might sometimes go west or east 41 00:02:40,200 --> 00:02:41,160 or west and so on. 42 00:02:41,160 --> 00:02:47,010 So and therefore So in order to calculate this value you would need to know what this value is but the 43 00:02:47,010 --> 00:02:51,110 interesting thing is that in order to calculate this value you need to know what this value is. 44 00:02:51,120 --> 00:02:56,790 So there's a lot of recursion happening here and therefore you cannot just decide to define what these 45 00:02:56,790 --> 00:02:57,340 values are. 46 00:02:57,360 --> 00:03:01,140 And on top of that this recursion is not deterministic. 47 00:03:01,140 --> 00:03:06,000 It is sometimes it happens this way sometimes it's sort of uphill to go right sometimes instead of get 48 00:03:06,000 --> 00:03:08,250 up and go left sometimes. 49 00:03:08,730 --> 00:03:09,540 When he want to go up. 50 00:03:09,540 --> 00:03:10,520 He will go up. 51 00:03:10,560 --> 00:03:17,460 So it is subject to chance and so maybe many times agent will go through this path and he'll go up up 52 00:03:17,460 --> 00:03:22,050 up up up and you'll think that from here you always kind of goes up and the value of the state will 53 00:03:22,050 --> 00:03:27,370 go it will be good and then all of a sudden he'll drop into the pit and this value will go down. 54 00:03:27,620 --> 00:03:33,600 And so therefore you can see how there is some stochastic randomness to this whole calculation on these 55 00:03:33,600 --> 00:03:35,370 values because they're all interlinked. 56 00:03:35,370 --> 00:03:40,920 Plus on top you've got that randomness in this inherent in the environment because there's a mark of 57 00:03:40,920 --> 00:03:42,320 decision process. 58 00:03:42,540 --> 00:03:47,790 So that's where all this comes together and that's where we're going to introduce the concept of the 59 00:03:47,790 --> 00:03:52,370 temporal difference which will allow the agent to calculate these values. 60 00:03:52,530 --> 00:03:55,560 And here we were dealing with the values. 61 00:03:55,560 --> 00:03:59,390 And since then we've already moved onto Q values so that's what we're going to be working. 62 00:03:59,400 --> 00:04:01,980 We're going to be looking at huge values. 63 00:04:02,010 --> 00:04:06,090 So as I recall this is our Belman equation for q values. 64 00:04:06,180 --> 00:04:15,090 So AQ value or the value of performing a sort of action A in state s is equal to the reward that you 65 00:04:15,090 --> 00:04:22,770 get after performing that actions immediately after performing an action plus do you get the maximum 66 00:04:22,770 --> 00:04:26,720 you get the gamma of the sum of all the possible. 67 00:04:26,910 --> 00:04:31,680 So you kind of get the expected value of the state that you will end up in. 68 00:04:31,680 --> 00:04:37,710 So as you recall there was a formula for the Beldon equation and now just for simplicity say we're going 69 00:04:37,710 --> 00:04:43,670 to rewrite it in the old fashioned way and in a way that we used to talk about the bellmen equation 70 00:04:43,680 --> 00:04:45,850 before we knew about the sequester. 71 00:04:45,880 --> 00:04:53,100 So remember this was our Belman equation in the sense of a deterministic search example because here 72 00:04:53,100 --> 00:04:57,600 you don't have that expected value you don't have the same across all probabilities. 73 00:04:57,750 --> 00:05:03,110 You just have that as if it's determined you're going to end up what state you're going to end up and 74 00:05:03,110 --> 00:05:05,450 then you tell you Max in that one state. 75 00:05:05,570 --> 00:05:12,170 And the reason we're rewriting it is simply the only reason is because it is just easier to write it 76 00:05:12,200 --> 00:05:14,550 and it'll be easier to fall along with the formula. 77 00:05:14,550 --> 00:05:19,340 So we're going to just remember that we replaced this part of this bar. 78 00:05:19,430 --> 00:05:25,400 And also you'll find this notation in a lot of literature so it'll be easier for you to follow along 79 00:05:25,400 --> 00:05:28,310 with other sources if you're studying those. 80 00:05:28,370 --> 00:05:35,390 But do remember that in fact what we mean is this probabilistic approach here instead of this notation 81 00:05:35,500 --> 00:05:39,130 is just easier for us to operate this and understand what's going on. 82 00:05:39,140 --> 00:05:44,180 I just kind of like look at the equations so that they're not too cluttered but once again just remember 83 00:05:44,180 --> 00:05:48,050 that in fact what we mean is this probabilistic approach here. 84 00:05:48,290 --> 00:05:52,130 And so we're actually in the know Tom Silis have a look at what's going on. 85 00:05:52,190 --> 00:06:00,350 So here is our blank state of the maze we don't have any q values let's see or when we may but let's 86 00:06:00,500 --> 00:06:05,510 just keep it blank for now let's just look at one of the states or one of the cells. 87 00:06:05,570 --> 00:06:07,280 This one specifically. 88 00:06:07,820 --> 00:06:11,240 And here we have for answers for the action of going up. 89 00:06:11,240 --> 00:06:14,290 We have a q value that we calculate. 90 00:06:14,290 --> 00:06:18,070 So it's not that we don't have any q values yet we have it we do. 91 00:06:18,080 --> 00:06:19,930 But we're just not illustrating anything. 92 00:06:19,930 --> 00:06:22,520 We're just keeping a blank for simplicity's sake. 93 00:06:22,610 --> 00:06:28,570 But we have the age has been walking around for some time and let's say hypothetically somehow he's 94 00:06:28,580 --> 00:06:36,560 calculated this cube value of going up or Norf from this state from this specific cell and the values. 95 00:06:36,560 --> 00:06:40,240 Q S and A and so now what we have. 96 00:06:40,240 --> 00:06:45,070 So he is currently with his blue arrows point and the agent is sitting in this cell. 97 00:06:45,590 --> 00:06:48,560 And now he needs to make a choice where is he going to go. 98 00:06:48,590 --> 00:06:57,290 And he knows the value of this action going north and that is q Senay and here I'm saying before and 99 00:06:57,290 --> 00:07:01,940 the reason for that is because he that is before he takes Actually he hasn't taken action yet so he's 100 00:07:01,940 --> 00:07:10,760 still in the cell and before he's taken the action the value here is q and SNH and now he actually takes 101 00:07:10,760 --> 00:07:11,370 the action. 102 00:07:11,390 --> 00:07:13,670 So let's say he decides is the best one. 103 00:07:13,670 --> 00:07:16,440 He takes the action and he moves up to the cell. 104 00:07:16,730 --> 00:07:24,320 Well now what happens is now comes after so after he's taken action we can measure what is this value 105 00:07:24,350 --> 00:07:30,650 let's just calculate this value the value of the reward of for taking that action plus gamma times the 106 00:07:30,650 --> 00:07:35,640 maximum of this new state that he's just gotten into as prime. 107 00:07:35,640 --> 00:07:39,030 And so the maximum across all possible actions and aspirin. 108 00:07:39,080 --> 00:07:44,770 And so what we have here is the value before in of that action. 109 00:07:44,810 --> 00:07:47,650 And then we've calculated this metric afterwards. 110 00:07:47,660 --> 00:07:54,860 But as you can recall from the previous four months if we go back very quickly from the previous formula 111 00:07:55,630 --> 00:08:02,180 where we just calculated is indeed the value that is how Q of s.a.a is calculated. 112 00:08:02,210 --> 00:08:07,930 So this Arite part of just calculated separately but after we've taken action. 113 00:08:08,330 --> 00:08:15,470 So as again before we knew a Q of an S and a value something that we've calculated through our iterations 114 00:08:15,470 --> 00:08:16,860 Preuss is something. 115 00:08:17,000 --> 00:08:19,990 So a value that's stored in our memory. 116 00:08:20,000 --> 00:08:26,990 So just like a number that we know and now after the action is being performed we know what reward he 117 00:08:27,050 --> 00:08:30,270 actually got what reward the agent actually got. 118 00:08:30,440 --> 00:08:33,320 And we can calculate this new value. 119 00:08:33,320 --> 00:08:39,690 So in essence we're kind of recalculating this value but now with new information the new information 120 00:08:39,690 --> 00:08:41,120 is the reward that we got. 121 00:08:41,600 --> 00:08:47,330 And plus what stayed we ended up in and what the maximum across that state what that this new value 122 00:08:47,420 --> 00:08:50,540 is for that specific data can. 123 00:08:50,570 --> 00:08:54,480 So what's the value of that being in that state. 124 00:08:54,500 --> 00:09:02,060 So basically the Cure Vanessa-Mae but given new information and now the temporal difference is defined 125 00:09:02,150 --> 00:09:07,700 as tiddy of a and s of these two of the difference between these two. 126 00:09:07,700 --> 00:09:11,770 So here the first element is your off-Terra value. 127 00:09:11,780 --> 00:09:16,250 So the kind of like Q of Esson a bit calculated afterwards. 128 00:09:16,550 --> 00:09:21,880 And the previous quvenzhan A which you had stored in your memory. 129 00:09:22,070 --> 00:09:24,170 And so the question is are they different. 130 00:09:24,290 --> 00:09:26,240 So ideally they should be the same. 131 00:09:26,240 --> 00:09:31,750 Ideally this should be the same as this simply because this is the formula for calculating this. 132 00:09:31,790 --> 00:09:38,060 But the thing is that this is not something we Kalka this is something that we have from empirical evidence 133 00:09:38,060 --> 00:09:41,320 something that we have from just going through the maze many times and calculate. 134 00:09:41,320 --> 00:09:44,330 So this is something we come up with so far. 135 00:09:44,360 --> 00:09:46,820 Its not related to the current iteration. 136 00:09:46,820 --> 00:09:52,070 Its something that we came up with previously a long long time ago but in one of our previous iterations 137 00:09:52,070 --> 00:09:53,180 going through the maze. 138 00:09:53,510 --> 00:09:57,740 Whereas this is something we've calculated just now and there is no guarantee that they're going to 139 00:09:57,740 --> 00:10:04,720 be the same or because of the randomness that exists in the maze because this could have been calculated 140 00:10:04,750 --> 00:10:10,260 and saw some CRN random events were triggered and this can be called to different random events happening 141 00:10:10,300 --> 00:10:11,290 were triggered. 142 00:10:11,740 --> 00:10:15,680 And so now we write down our heroes just move it up there. 143 00:10:15,700 --> 00:10:16,900 So how do we use this. 144 00:10:16,900 --> 00:10:20,470 The question is OK so we have this temporal difference. 145 00:10:20,470 --> 00:10:21,340 How do we use this. 146 00:10:21,400 --> 00:10:23,450 And why is it called the temporal difference. 147 00:10:23,590 --> 00:10:28,960 Well the reason is called the temporal difference is because you're basically calculating the same thing 148 00:10:28,990 --> 00:10:33,460 you're calculating Q of S and A so the Q value of that action. 149 00:10:33,640 --> 00:10:36,140 Your Calcott here and you're calculating it here. 150 00:10:36,340 --> 00:10:38,310 But the difference is time. 151 00:10:38,320 --> 00:10:44,140 This is the Q of S and they previously this is yo Q of S and A. 152 00:10:44,140 --> 00:10:49,090 Now your new cure is innate and the question is has there been a difference. 153 00:10:49,090 --> 00:10:51,700 Have there's been a shift between them in time. 154 00:10:52,060 --> 00:10:56,830 And how can we use this to our advantage if there is indeed has been a shift in time. 155 00:10:57,040 --> 00:11:02,790 Well one thing we could do is we could say OK well you know our Q of s.a.a doesn't. 156 00:11:02,830 --> 00:11:07,490 This new value doesn't equal old so we are going to get rid of the old or forget about the old and we'll 157 00:11:07,510 --> 00:11:09,610 just use this is all a new value. 158 00:11:09,970 --> 00:11:11,920 But that would not be smart. 159 00:11:11,950 --> 00:11:17,960 And the reason for that is that in our environments random events can sometimes happen. 160 00:11:18,140 --> 00:11:25,500 And what if our old QSA of s.a.a was something that consistently happens like 80 percent of the time. 161 00:11:25,780 --> 00:11:28,750 And then like was represented by what happens 80 percent of the time. 162 00:11:28,750 --> 00:11:33,280 And then this new one just what happened due to randomness. 163 00:11:33,280 --> 00:11:39,610 In that case we're going to throw away the the one that is responsible for the bulk of the situation 164 00:11:39,760 --> 00:11:43,900 and we're going to replace it with something that happens only 10 or 20 percent of the time. 165 00:11:43,900 --> 00:11:50,650 That wouldn't be the best approach to go and that's why that's exactly why we don't want to completely 166 00:11:50,650 --> 00:11:51,990 change Opu values. 167 00:11:52,060 --> 00:11:56,890 We want to use like change them step by step a little bit by a little bit. 168 00:11:56,890 --> 00:12:01,980 And that's why we're going to use this temporal difference in a specific way so we're going to say Here's 169 00:12:02,020 --> 00:12:05,080 a formula we're going to take our cue of SNH. 170 00:12:05,560 --> 00:12:07,120 And we're going to update it in such a way. 171 00:12:07,120 --> 00:12:12,450 We're going to take the old value of cure Senay and we are going to add all five times the temporal 172 00:12:12,460 --> 00:12:13,380 difference. 173 00:12:13,420 --> 00:12:15,730 So Alpha is going to be all learning right. 174 00:12:15,730 --> 00:12:17,410 That's a new parameter that we're introducing. 175 00:12:17,410 --> 00:12:20,070 That's how quickly is algorithm learning. 176 00:12:20,080 --> 00:12:26,390 So basically we're taking this difference and whatever it is we're adding it on to our previous KJo 177 00:12:26,480 --> 00:12:27,210 snake. 178 00:12:27,220 --> 00:12:31,970 Now this formula probably doesn't make any sense or like just by looking it doesn't make sense because 179 00:12:31,970 --> 00:12:34,040 you got Covisint here and give us an A here. 180 00:12:34,060 --> 00:12:39,460 It's the same thing so probably should negate each other but we had to rewrite this in a bit of a different 181 00:12:39,460 --> 00:12:40,090 way. 182 00:12:40,390 --> 00:12:44,080 So I'm going to show you again so I'm just adding time to these formulas. 183 00:12:44,090 --> 00:12:48,070 So here is q t minus one the previous years. 184 00:12:48,070 --> 00:12:49,780 Q T minus 1 the previous years. 185 00:12:49,780 --> 00:12:56,080 Q T The New this should be a circle here in circle here as well but never mind and here get alpha temporal 186 00:12:56,080 --> 00:12:56,750 difference. 187 00:12:56,810 --> 00:12:58,750 Then you the current temporal difference. 188 00:12:58,750 --> 00:13:01,190 So you can see what we're doing we're saying. 189 00:13:01,220 --> 00:13:04,200 OK let's take our current. 190 00:13:04,240 --> 00:13:10,880 Q is going to be equal to all previous Q plus whatever temporal difference we found Times Alpha. 191 00:13:11,150 --> 00:13:16,330 This formula here is the heart and soul of the cube learning algorithm. 192 00:13:16,330 --> 00:13:18,250 This is how the cube is or update. 193 00:13:18,280 --> 00:13:24,460 And it's good that we've already learned what q values are what gamma is what is and what all this stuff 194 00:13:24,460 --> 00:13:25,300 is. 195 00:13:25,420 --> 00:13:31,740 And now all we need to see is that you have a previous Q value Yes that's good. 196 00:13:31,990 --> 00:13:37,870 And then what can happen is that when you take in when you actually do take the action when the agent 197 00:13:37,870 --> 00:13:42,530 takes action you'll know he'll get a reward and he'll end up in a state. 198 00:13:42,610 --> 00:13:46,400 And so based on that he can calculate Aha. 199 00:13:46,420 --> 00:13:53,220 OK so what is what would have what should have been the Q value of that move that I made. 200 00:13:53,530 --> 00:13:56,390 And now that is this part of the equation. 201 00:13:56,470 --> 00:14:02,870 Subtract the old Q value gets you a temporal difference and now you need to take a Alpher time sample 202 00:14:02,920 --> 00:14:05,410 difference and that's how you get adjust. 203 00:14:05,430 --> 00:14:06,370 Q Got you that's what you mean. 204 00:14:06,370 --> 00:14:10,240 I just think you go by and now just to finish off this. 205 00:14:10,240 --> 00:14:14,890 This is kind of like this is sufficient to understand what's going on but just to clarify things even 206 00:14:14,890 --> 00:14:18,370 more or perhaps maybe confuse things even more. 207 00:14:18,460 --> 00:14:23,320 What do we need to do to take this temporal difference or this simple difference or here a way to plug 208 00:14:23,320 --> 00:14:24,180 it into this format. 209 00:14:24,190 --> 00:14:29,840 So we're going to take all of this part and plug it into this formula and end up with a huge equation. 210 00:14:29,920 --> 00:14:31,490 So here we go. 211 00:14:31,660 --> 00:14:32,590 There's our equation. 212 00:14:32,590 --> 00:14:38,470 So this is the full equation with the temporal difference written out completely. 213 00:14:38,560 --> 00:14:43,690 And the reason I wrote it out as well first of all you'll probably find this in other literature if 214 00:14:43,690 --> 00:14:45,560 you study it. 215 00:14:45,730 --> 00:14:50,810 And the second thing is that it makes some things a bit more complex has formulas longer but also make 216 00:14:50,810 --> 00:14:52,300 somethings a bit clearer. 217 00:14:52,300 --> 00:14:55,940 So for instance you can see here the role Alpha plays. 218 00:14:55,960 --> 00:14:58,310 You can see it better because look at this. 219 00:14:58,320 --> 00:14:58,860 Here. 220 00:14:58,900 --> 00:15:01,410 Q T minus one and here you go. 221 00:15:01,420 --> 00:15:03,760 Q T minus one with a negative sign. 222 00:15:03,760 --> 00:15:12,170 So if you plug in Alpha equals to 1 if you put a 1 in here then this will negate this. 223 00:15:12,190 --> 00:15:16,170 So they'll destroy each other and all you'll have left is this part. 224 00:15:16,480 --> 00:15:23,080 And what that means is exactly that situation where we said All right so you've got a new value which 225 00:15:23,140 --> 00:15:24,750 it should have been. 226 00:15:24,850 --> 00:15:29,570 Let's update our Q value with the new value and forget about whatever we had previously. 227 00:15:29,710 --> 00:15:35,470 And as we discussed isn't the best approach because there are random events here and we want to update 228 00:15:35,470 --> 00:15:36,820 things step by step. 229 00:15:37,530 --> 00:15:43,590 And on other hand if you said Alpher equal to zero what happens then is that you completely forget about 230 00:15:43,590 --> 00:15:48,960 this whole part and you're cute t the new one or the current one is going to be always equal to the 231 00:15:48,960 --> 00:15:51,720 previous one so you're not going to be learning anything. 232 00:15:51,720 --> 00:15:56,730 And that means whatever is happening in the maze doesn't matter because you've decided on you Kuchi 233 00:15:56,730 --> 00:15:58,940 value a long time ago and you're just going to keep it. 234 00:15:59,230 --> 00:16:03,200 So that's why Alfas shouldn't be 0 or should be one it should be somewhere in between. 235 00:16:03,240 --> 00:16:09,330 And it's going to allow you to learn slowly step by step is going to allow you to as your or the agent 236 00:16:09,360 --> 00:16:12,720 as it goes through the maze is going to get the temporal difference. 237 00:16:12,960 --> 00:16:19,530 And slowly but surely this value is going to get update and update ibed and what will happen eventually 238 00:16:19,680 --> 00:16:25,440 is that at some point hopefully the algorithm will converge. 239 00:16:25,710 --> 00:16:30,960 And what that means is that this temporal difference will start becoming closer and closer to zero and 240 00:16:30,960 --> 00:16:37,860 eventually will be just well very close to zero or even 0 0 0 0 and what that means is that every single 241 00:16:37,860 --> 00:16:43,050 time your your new cutesie value or your new calculated value. 242 00:16:43,350 --> 00:16:44,430 What it should have been. 243 00:16:44,440 --> 00:16:49,950 So not this one but what it hypothetically should be enough to take the step will be just equal to your 244 00:16:49,950 --> 00:16:51,030 previous Q2 value. 245 00:16:51,030 --> 00:16:55,650 And then one that's zero and that means when your temperature difference is zero means your algorithm 246 00:16:56,070 --> 00:17:02,720 has converged and it's not really necessary to continue updating what's going on. 247 00:17:02,720 --> 00:17:06,270 It does this search to continue updating your cube values. 248 00:17:06,270 --> 00:17:12,780 The caveat here is that the only time probably one of the only times when you would still want to continue 249 00:17:12,810 --> 00:17:19,140 performing this whole you know updating of queue values if the environment is constantly changing. 250 00:17:19,170 --> 00:17:23,100 If not just it's not there it just has some randoms to Kostic events in it. 251 00:17:23,220 --> 00:17:28,750 But the environment itself is modifying as is morphing is changing with time. 252 00:17:29,040 --> 00:17:34,260 So you continuously need to learn because it's not possible for you to learn everything and come up 253 00:17:34,260 --> 00:17:39,210 with the optimal policy because the optimal policies also changed with the environment all the time. 254 00:17:39,240 --> 00:17:44,730 In that case you will need to continue CALKIN and temporal difference and calculating the Q values. 255 00:17:44,730 --> 00:17:46,830 But other than that that's kind of like an extra complication. 256 00:17:46,830 --> 00:17:53,370 Other than that this is how Q values update is so this is the main formula of the Q learning algorithm 257 00:17:54,090 --> 00:17:59,490 and this is kind of like the expanded version of that and now it should all come together and make sense 258 00:17:59,490 --> 00:18:05,250 why we have the Belman equation and not only what it represents the gewgaws but also how the agent goes 259 00:18:05,250 --> 00:18:12,870 about updating its values and finding exactly what is going on in that environment so it can come up 260 00:18:12,870 --> 00:18:14,620 with the optimal policy. 261 00:18:14,640 --> 00:18:21,570 So I know quite a lot to take in but hopefully you enjoyed this tutorial and hopefully you able to take 262 00:18:21,570 --> 00:18:28,680 away the underlying concepts and intuition behind your values and what's the whole notion of temporal 263 00:18:28,680 --> 00:18:36,990 difference is and why it's important why it helps us slowly train our agents and get them to understand 264 00:18:37,050 --> 00:18:39,230 their environments that they're operating in. 265 00:18:39,270 --> 00:18:45,540 And if you'd like to learn a bit more about temporal differences then a very popular paper is learning 266 00:18:45,540 --> 00:18:52,470 to predict by the methods of temporal differences by Richard Sutton of nineteen eighty eight. 267 00:18:52,620 --> 00:18:57,060 We've already had a reference by Richard Sutton as well but this is as another one and actually has 268 00:18:57,060 --> 00:19:04,620 a book so if you get into you know his writing style and his style of communication then check out his 269 00:19:04,620 --> 00:19:05,660 book as well. 270 00:19:05,810 --> 00:19:08,630 It's is kind of like a more expanded version of all of these things. 271 00:19:08,640 --> 00:19:12,820 I haven't read the book but that's what I'm imagining at the same time. 272 00:19:12,960 --> 00:19:19,530 This is going to add to the paper and you can learn a bit more about or probably a lot more about temporal 273 00:19:19,530 --> 00:19:21,050 differences there. 274 00:19:21,300 --> 00:19:22,950 And I hope you enjoyed it as well. 275 00:19:23,060 --> 00:19:24,270 We'll see you next time. 276 00:19:24,270 --> 00:19:26,250 Until then enjoy AI. 30538

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.