All language subtitles for 008 The Bellman Equation-subtitle-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu Download
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,060 --> 00:00:04,460 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,460 --> 00:00:07,630 Today we're going to talk about the Belman equation. 3 00:00:07,630 --> 00:00:12,580 It's quite a complex topic and we're going to introduce it in a step by step manner throughout this 4 00:00:12,580 --> 00:00:17,110 whole section of the course so I'm not going to just jump straight into the most complex version of 5 00:00:17,110 --> 00:00:21,730 the Belmont equation right away but instead we're going to introduce it slowly in order to gradually 6 00:00:21,730 --> 00:00:23,250 understand how it works. 7 00:00:23,410 --> 00:00:28,480 And I hope your goal with that approach if you're G.R. Let's get straight into it. 8 00:00:28,690 --> 00:00:33,820 So we're going to have a couple of key concepts that we're going to be operating with and these concepts 9 00:00:33,820 --> 00:00:34,430 are. 10 00:00:34,600 --> 00:00:41,110 S stands for states so the state in which our agent is or any other possible state in which it can be 11 00:00:41,740 --> 00:00:45,490 a represents an action that a an agent can take. 12 00:00:45,490 --> 00:00:50,680 So an agent can have access to a certain list of actions and actions are very important when they're 13 00:00:50,680 --> 00:00:53,610 looked at in a state combination. 14 00:00:53,620 --> 00:00:57,880 So when you're in a swing state and then you look at actions and it starts to make sense what's going 15 00:00:57,880 --> 00:01:01,870 to be the result of those actions because you'll look an action by itself or a state doesn't really 16 00:01:01,870 --> 00:01:07,390 make sense because you don't know where you are and where you can possibly end up and then we have we'll 17 00:01:07,390 --> 00:01:13,240 have our Which stands for reward and that's through ward that agent gets for entering into a certain 18 00:01:13,240 --> 00:01:16,980 state and gamma is the discount factor. 19 00:01:16,990 --> 00:01:21,510 And we'll talk about the discount factor in a second all make sense just now but they're just taking 20 00:01:21,510 --> 00:01:21,810 notes. 21 00:01:21,820 --> 00:01:26,300 Make a mental note that we are going to have this letter Gamelin that will be operating with later on. 22 00:01:26,620 --> 00:01:31,230 So the person behind the bellman equation is Richard Ernest bellman. 23 00:01:31,360 --> 00:01:39,400 He was a flight mathematician and came up with the concepts of dynamic programming which we're now which 24 00:01:39,400 --> 00:01:43,790 we now call reinforcement learning or which we call the Belman equation now. 25 00:01:44,110 --> 00:01:45,490 Well that's what we're called now. 26 00:01:45,490 --> 00:01:52,350 And in 1953 he came up with that concept and that's when the Belmont Belman equation came to me. 27 00:01:52,630 --> 00:01:56,530 So let's have a look at how this all works. 28 00:01:56,540 --> 00:02:02,410 There's our lovely agent in the bottom left corner and he is in a maze and this is quite a classical 29 00:02:02,500 --> 00:02:08,680 maze where you've got some blocks the wide blocks are blocks in which the agent can step into the gray 30 00:02:08,680 --> 00:02:13,800 block is the one one that is just not accessible says like a wall in this maze. 31 00:02:13,900 --> 00:02:20,150 The green is where the agent is should be aiming to end up in that's where we want the agent to go that's 32 00:02:20,150 --> 00:02:20,910 the finish. 33 00:02:21,220 --> 00:02:25,050 And the red is firepits or the engine falls into the fire pit. 34 00:02:25,060 --> 00:02:26,660 He will lose the game. 35 00:02:26,950 --> 00:02:31,330 So in the fire pit the reward which is R is minus 1. 36 00:02:31,330 --> 00:02:36,330 So that's our way of telling the agent that's not something we want you to do. 37 00:02:36,430 --> 00:02:41,320 Like remember in the example of when we're training dogs we want to tell them like bad dog if it's not 38 00:02:41,320 --> 00:02:46,030 doing the right thing that wanted to do same thing here we're one tell the agent that this is not something 39 00:02:46,030 --> 00:02:49,480 that you should be doing you shouldn't be ending up in the square so every time it doesn't happen the 40 00:02:49,480 --> 00:02:53,300 squirrel get a minus one reward so you'll be punished with minus one reward. 41 00:02:53,530 --> 00:02:57,610 On the other hand if it ends up in the Green Square it will get a plus one reward meaning that that 42 00:02:57,610 --> 00:02:59,330 is what we wanted to do. 43 00:02:59,590 --> 00:03:02,470 So those are the two rewards that the agent can't possibly get. 44 00:03:02,470 --> 00:03:06,210 And how does it learn how to operate in this maze. 45 00:03:06,370 --> 00:03:10,750 Just like in that example of the robot dogs that learned to walk which is going to let it know it will 46 00:03:10,750 --> 00:03:12,490 just tell it that here the action you can do. 47 00:03:12,490 --> 00:03:18,360 You can go up right left or down those are four possible actions that you can take and that's it. 48 00:03:18,360 --> 00:03:21,430 Have have a play around with that see what you can come up with. 49 00:03:21,430 --> 00:03:26,320 So the agent might go to the right then they might go two more to the right they might go back to the 50 00:03:26,320 --> 00:03:31,160 left just randomly pressing the button and they're trying to see what happens and they go back here. 51 00:03:31,180 --> 00:03:34,660 They go up go up go down go up go right. 52 00:03:34,660 --> 00:03:38,450 So for now they haven't learnt anything they just so far nothing's happened. 53 00:03:38,470 --> 00:03:41,790 They go right and then bam they end up in the Green Square. 54 00:03:41,830 --> 00:03:48,150 So they realize wow I just got a plus one awar So as soon as I stepped into the Green Square they got 55 00:03:48,150 --> 00:03:49,040 a plus one reward. 56 00:03:49,090 --> 00:03:53,560 And that triggers the algorithm to say OK that's really cool. 57 00:03:53,830 --> 00:03:58,920 I am rewarded for ending up in the square so I want to end up in the square. 58 00:03:58,930 --> 00:04:00,650 So what does that mean for the agent. 59 00:04:00,910 --> 00:04:04,310 That means it starts to ask the question how did I get to this square. 60 00:04:04,300 --> 00:04:10,690 What was the preceding state I was in and what action that I take to get to square and then looks back 61 00:04:10,690 --> 00:04:14,810 and says OK so the preceding state was this one. 62 00:04:14,950 --> 00:04:17,400 It turns out to be valuable in that state. 63 00:04:17,410 --> 00:04:19,240 The one that spark of the Red Arrow. 64 00:04:19,270 --> 00:04:26,230 Because from that state you're I'm I'm just one step away from getting the maximum reward I can possibly 65 00:04:26,230 --> 00:04:33,210 dream of of plus one like a biscuit for a dog from as soon as I know if I ever am in that state. 66 00:04:33,250 --> 00:04:35,150 That square marked with the Red Arrow. 67 00:04:35,200 --> 00:04:36,740 All I have to do is press right. 68 00:04:37,030 --> 00:04:41,440 So how do I tell myself to remember that that state is valuable. 69 00:04:41,440 --> 00:04:45,170 Well to me there's no difference actually as the agent. 70 00:04:45,170 --> 00:04:50,380 There's no difference in whether I am in the Green Square or in the white square right in the Green 71 00:04:50,380 --> 00:04:51,610 Square I get the reward of one. 72 00:04:51,610 --> 00:04:58,810 So I'm going to mark for myself that the Y Square is got for me it has a value of 1 because it leads 73 00:04:58,810 --> 00:05:03,280 exactly to reward one soon as I'm in the white square I know I'll just take one more action. 74 00:05:03,350 --> 00:05:08,180 I'll be in the Green Square and I'll get a reward or one so that's why I'm going to say that the value 75 00:05:08,180 --> 00:05:14,690 of this square is equal to one because it leads directly to if on any sort of subtractions as soon as 76 00:05:14,690 --> 00:05:18,890 I mean here I know my reward will be one so I'm going to mark this square as the call to one that's 77 00:05:18,890 --> 00:05:22,430 the value that's the perceived value of being in the state. 78 00:05:22,430 --> 00:05:24,740 Next the agent's going to be OK. 79 00:05:24,800 --> 00:05:26,930 So how do I get into this square. 80 00:05:27,050 --> 00:05:29,990 And you know he might walk around again and so on. 81 00:05:29,990 --> 00:05:33,800 And up in the square again and be like OK how did I get into this square before that. 82 00:05:33,800 --> 00:05:36,860 And the way I got into this square was from this square. 83 00:05:36,860 --> 00:05:37,530 Interesting. 84 00:05:37,550 --> 00:05:42,980 OK so as soon as I get into this square I know that all I have to do is go right. 85 00:05:42,980 --> 00:05:45,640 And then from here I already know that I'm going to win. 86 00:05:45,650 --> 00:05:49,970 I know exactly how everything is going to unravel from here and I know the value of being in this state 87 00:05:49,970 --> 00:05:50,970 is equal to one. 88 00:05:51,020 --> 00:05:58,340 And since there's no nothing is stopping me from growing from here to here the value in this is going 89 00:05:58,340 --> 00:06:03,920 to a perceived value I'm great value being in here as a vehicle to want as well because this is I mean 90 00:06:03,920 --> 00:06:04,640 here I know. 91 00:06:04,650 --> 00:06:06,660 Be here and I'll be here pretty quickly. 92 00:06:06,740 --> 00:06:07,980 So I'm going to win. 93 00:06:08,180 --> 00:06:10,490 And then how do you get into this square before that. 94 00:06:10,490 --> 00:06:12,940 Well I got into this square from this square. 95 00:06:13,070 --> 00:06:19,670 So the value is similar approach the value of being here is also equal to one and so on so the value 96 00:06:19,670 --> 00:06:23,690 of being here is equal to one value of being here is equal to one because each one of them leads to 97 00:06:23,690 --> 00:06:25,710 the next one and these to the finish line. 98 00:06:26,240 --> 00:06:29,850 So that's all like pretty logical at this stage. 99 00:06:29,960 --> 00:06:33,410 This is us pretty much designing the Belman equation right now. 100 00:06:33,410 --> 00:06:40,460 So this is we could possibly think about designing an equation that helps an agent go through the maze. 101 00:06:40,490 --> 00:06:45,840 So look at the reward then the preceding state give it a value of equal to reward the proceedings and 102 00:06:45,840 --> 00:06:51,920 so those are kind of like creates a pathway is all great and well but the problem here is OK what happens 103 00:06:52,010 --> 00:06:58,790 if our agent for some reason starts in this state instead of starting here and taking these actions 104 00:06:58,880 --> 00:07:00,480 and that it actually starts in the state. 105 00:07:00,650 --> 00:07:06,980 How does it know how does it remember which action to take should it go right or should it go down or 106 00:07:06,980 --> 00:07:08,540 should maybe go left or should go up. 107 00:07:08,540 --> 00:07:13,220 How does it remember which is the next continuation from here. 108 00:07:13,220 --> 00:07:18,660 If the only values it has is these values are equal to once it kind of cannot see what's further away. 109 00:07:18,660 --> 00:07:19,700 It can only see. 110 00:07:19,700 --> 00:07:20,030 All right. 111 00:07:20,030 --> 00:07:21,940 What I have here and what I have here. 112 00:07:21,980 --> 00:07:23,530 How does it know which way to go. 113 00:07:23,660 --> 00:07:27,920 Well at this stage it doesn't it's as pretty identical for the age and which way to go. 114 00:07:27,960 --> 00:07:30,770 And so that's why this approach doesn't really work. 115 00:07:30,790 --> 00:07:32,930 It's a very simplistic explanation. 116 00:07:32,930 --> 00:07:34,500 Of course there's much more to it. 117 00:07:34,520 --> 00:07:40,550 But in an intuitive way that's why we cannot just assign just carry on this value backwards like that. 118 00:07:40,790 --> 00:07:46,210 Because one of the reasons is once Agent is in between these two values which where is it going to go. 119 00:07:46,210 --> 00:07:48,560 It doesn't it can get confused like that. 120 00:07:48,620 --> 00:07:52,350 And so how do we solve this problem what are we going to do. 121 00:07:52,400 --> 00:07:57,860 And this is where we're going to start introducing the Belman equation in its actual form slowly step 122 00:07:57,860 --> 00:07:58,640 by step. 123 00:07:58,670 --> 00:08:01,510 So the Belman equation looks something like this. 124 00:08:01,640 --> 00:08:07,100 So we've already talked about the value of being in a certain state as is your current state or any 125 00:08:07,100 --> 00:08:10,250 given state and there is as well. 126 00:08:10,370 --> 00:08:17,270 And as Prime is the state the following state the state that you will end up in after the state and 127 00:08:17,270 --> 00:08:18,990 by taking concerted action. 128 00:08:19,000 --> 00:08:24,160 But we know that there's many actions and a agent can take and that's why we've got this Max over here. 129 00:08:24,260 --> 00:08:30,020 So by taking an action what will happen to an agent so let's say we're in state as by taking an action 130 00:08:30,050 --> 00:08:32,700 in state assets and we take action. 131 00:08:32,780 --> 00:08:36,690 What will happen is will instantly get a reward by getting into a new state. 132 00:08:36,770 --> 00:08:41,960 And remember that reward can be one or plus one or minus one if it's at the end of the game or it can 133 00:08:41,960 --> 00:08:46,240 be a zero if it's throughout the game in this case our reward throughout the game is zero. 134 00:08:46,280 --> 00:08:55,160 So that's the reward Plus we will get into a new state which has value of s prime. 135 00:08:55,160 --> 00:08:57,820 So that's the value of the new state and gamma. 136 00:08:57,820 --> 00:08:58,820 We'll talk about it in a second. 137 00:08:58,820 --> 00:09:03,560 But the point I'm trying to raise here or the point I'm raising here is that you've got many different 138 00:09:03,560 --> 00:09:05,810 actions that we can take and that's why we've got the maximum. 139 00:09:05,810 --> 00:09:09,630 So by taking action we get reward Plus we end up in a new state. 140 00:09:09,740 --> 00:09:14,660 And so for every move out of the in our case before our possible actions for every one of the possible 141 00:09:14,660 --> 00:09:17,810 4 actions we're going to have a equation like this. 142 00:09:17,810 --> 00:09:22,980 So this is going to have a value for they will have a different value for every one of four actions 143 00:09:23,480 --> 00:09:28,750 and we're going to look at only the maximum because of course the agent wants to take the optimal state. 144 00:09:28,760 --> 00:09:33,860 So if he's in state s he's going to look at these values he's going to find the maximum based on the 145 00:09:33,860 --> 00:09:37,500 action and going to take that action that needs the maximum of these values. 146 00:09:37,640 --> 00:09:41,480 So hopefully that makes sense why we're taking the maximum here. 147 00:09:41,660 --> 00:09:45,400 Then once we got the reward and the value that said why do we have this Gabaa parameter here. 148 00:09:45,650 --> 00:09:52,220 Well it's there exactly to solve that problem of where the agent doesn't know which way to go because 149 00:09:52,220 --> 00:09:52,850 it cannot. 150 00:09:52,950 --> 00:09:56,600 It's comparing the values of two states on both sides and they're the same. 151 00:09:56,810 --> 00:10:00,890 That's why the gamblers called the discounting factor so we're going to have a look at that and it better 152 00:10:00,890 --> 00:10:02,050 understand. 153 00:10:02,060 --> 00:10:04,680 So let's take a formula I'll put it here on the top right. 154 00:10:04,760 --> 00:10:09,100 And now we will analyze what the values of the different states are. 155 00:10:09,140 --> 00:10:11,470 And every state here is a square. 156 00:10:11,470 --> 00:10:11,820 No. 157 00:10:11,840 --> 00:10:16,610 So one of these any one of these white squares is a state I mean we're going to calculate the value 158 00:10:16,610 --> 00:10:18,290 of being in that state. 159 00:10:18,290 --> 00:10:19,770 So let's start with the square. 160 00:10:19,790 --> 00:10:21,610 What is the value of being in this state. 161 00:10:21,860 --> 00:10:25,830 Well we need to take the maximum of this value across all actions. 162 00:10:26,120 --> 00:10:31,440 And we know that this value represents is maximized as we get closer to the finish line and that's how 163 00:10:31,440 --> 00:10:36,440 it is constructed and by just by looking at you can see because here's got the reward and here's got 164 00:10:36,590 --> 00:10:40,900 a discounting factor multiplied by the value of the next state. 165 00:10:41,060 --> 00:10:46,670 And it just makes sense that that's how we would construct that equation so it makes sense that from 166 00:10:46,670 --> 00:10:50,350 here the maximum of this value will be if we move to the right. 167 00:10:50,360 --> 00:10:56,120 So that's how we calculate the values that this value of this state is he calls the maximum or equals 168 00:10:56,300 --> 00:10:57,470 to this value. 169 00:10:57,500 --> 00:11:01,000 If we move to the right if we take an action of moving to the right. 170 00:11:01,010 --> 00:11:02,330 So what will this value be. 171 00:11:02,360 --> 00:11:04,850 Well the reward of moving to the right is equal to 1. 172 00:11:05,090 --> 00:11:10,490 And regardless what color gamma is we don't have a value in the state because we are already in the 173 00:11:10,490 --> 00:11:11,720 best state possible. 174 00:11:11,720 --> 00:11:12,880 So this is the final stage. 175 00:11:12,890 --> 00:11:16,280 It won't have a value we just get a reward here and that's the end of the game. 176 00:11:16,280 --> 00:11:20,300 So the value will be of this maximum will be equal to 1. 177 00:11:20,510 --> 00:11:23,870 And that's why value of state as here is equal to 1. 178 00:11:23,870 --> 00:11:27,970 Now things get interesting when we move to the left when we move backwards a bit. 179 00:11:28,010 --> 00:11:34,060 So now is calculate the value of this of being in this state and for that we're going to need Gabaa. 180 00:11:34,070 --> 00:11:39,920 So let's say our discounting factor is a zero point nine and it makes sense what a discounting factor 181 00:11:39,920 --> 00:11:40,960 is once we calculate that. 182 00:11:40,960 --> 00:11:47,410 So from here just based on our intuition and based because we know how this is working how this works. 183 00:11:47,450 --> 00:11:51,340 We know that the best possible action is go to the right because from here we go here. 184 00:11:51,530 --> 00:11:56,120 So that means the maximum will be achieved in this state you go to the right. 185 00:11:56,270 --> 00:11:58,970 And so let's see what happens if we plug it in here. 186 00:11:58,970 --> 00:12:02,650 So if you go from here to here you don't get in your reward will be zero. 187 00:12:02,720 --> 00:12:07,440 But then you'll get camis who get zero point nine times the value of the new state which is one. 188 00:12:07,640 --> 00:12:14,030 So in this case the value the whole result of this is 1 times a 0.9 times one equals 2.9. 189 00:12:14,030 --> 00:12:15,890 So that's all values per. 190 00:12:16,250 --> 00:12:18,570 So if we calculate this now you'll see that from here. 191 00:12:18,620 --> 00:12:23,990 We know just by looking at the maze we know because we as humans because we're understanding how this 192 00:12:23,990 --> 00:12:28,450 equation works of course an AI agent would have to experiment with these things. 193 00:12:28,460 --> 00:12:32,180 But because we have like a crystal ball we can see this whole maze. 194 00:12:32,180 --> 00:12:33,860 We have like the bird's eye view right now. 195 00:12:33,860 --> 00:12:36,170 We know that the best action go to go to the right. 196 00:12:36,320 --> 00:12:42,230 So if we plug it all in here it'll be zero no reward Plus the report nine times the value in the state 197 00:12:42,230 --> 00:12:45,530 0.9 is zero point eighty one and so on. 198 00:12:45,530 --> 00:12:50,420 So here it'll be 0.23 and he'll be 0.66. 199 00:12:50,420 --> 00:12:57,590 So you can see that the way the discounted factor works is it discounts the value of the state as you 200 00:12:57,590 --> 00:12:58,610 are further away. 201 00:12:58,610 --> 00:13:05,810 So if you are familiar with finance theory then it's something similar to time value of money like what 202 00:13:05,810 --> 00:13:12,990 would you think about it this way What would you prefer to have $5 today or $5 in 10 days from now. 203 00:13:13,050 --> 00:13:17,840 Just if somebody was to give you a choice I will give you five dollars today all you $5 10 days from 204 00:13:17,840 --> 00:13:18,280 all. 205 00:13:18,390 --> 00:13:20,300 Of course you would choose $5 today. 206 00:13:20,300 --> 00:13:20,850 Why is that. 207 00:13:20,870 --> 00:13:26,750 Well because you can take that $5 and you can invest them at a certain interest rate which is very similar 208 00:13:26,750 --> 00:13:27,470 to gamma. 209 00:13:27,680 --> 00:13:33,950 And your $5 in 10 days will actually grow into maybe 5 dollars and 73 cents or something like that. 210 00:13:34,070 --> 00:13:36,410 And that's how time value of money works. 211 00:13:36,410 --> 00:13:38,310 And very similar concept here. 212 00:13:38,330 --> 00:13:43,250 And the important thing to understand here this is just a theory a way that reinforcement learning. 213 00:13:43,260 --> 00:13:45,850 So Richard Belman came up with this equation. 214 00:13:46,190 --> 00:13:48,880 And from then now that's how we use it. 215 00:13:48,880 --> 00:13:51,430 So you could go ahead and come up with a different equation. 216 00:13:51,430 --> 00:13:54,820 It doesn't have to have Gamla it might have some other factor might not you know have a factor. 217 00:13:54,950 --> 00:14:01,550 But this approach works and that's why we're using and this is what it looks like so the further away 218 00:14:01,550 --> 00:14:06,670 you are the less value of it being in the state and in terms of time and money. 219 00:14:06,680 --> 00:14:09,850 If I could say to you where would you rather be would you rather be here. 220 00:14:09,950 --> 00:14:11,200 Would you rather be here. 221 00:14:11,350 --> 00:14:12,920 You'd say I would rather be here. 222 00:14:12,920 --> 00:14:18,770 So we're creating that that same phenomenon as time value of money we're artificially creating it through 223 00:14:18,770 --> 00:14:24,680 gamma so that in order to incentivize agents or inspire agents to be closer to the finish line. 224 00:14:24,680 --> 00:14:29,720 So if an agent were to be asked would you rather be here or here because of the way this equation works 225 00:14:29,930 --> 00:14:31,590 it would choose to be here. 226 00:14:31,640 --> 00:14:33,380 There's nothing more to that nothing less. 227 00:14:33,380 --> 00:14:35,810 It's not something that the world works this way. 228 00:14:35,810 --> 00:14:42,630 No it's just something that we're artificially creating in order for our agents to understand that this 229 00:14:42,750 --> 00:14:48,140 is this is good this is good this is good old good but this one is better than this one and this one 230 00:14:48,140 --> 00:14:50,030 is better than this one and this one has been in this one. 231 00:14:50,120 --> 00:14:54,790 And that way you can see all the agent can see in which direction needs to go. 232 00:14:54,800 --> 00:15:00,270 So it can see that if I'm standing here remember that problem that we had or was he standing here so 233 00:15:00,270 --> 00:15:05,130 if you standing here do I go down or if I'm suddenly here to go up or do I go down. 234 00:15:05,250 --> 00:15:10,080 Well now there's not a problem anymore because he can see that it's actually better to go up because 235 00:15:10,080 --> 00:15:11,480 the values are here. 236 00:15:11,550 --> 00:15:14,490 And then from here he's got to go right because the value is bigger here than here. 237 00:15:14,550 --> 00:15:17,480 And then from here is Bertschi go right because the value here is bigger than you know. 238 00:15:17,670 --> 00:15:22,620 And from here he already knows that he needs to go right because he'll get a reward here of one. 239 00:15:22,680 --> 00:15:24,960 So that's how this whole approach works. 240 00:15:24,960 --> 00:15:27,600 Now let's have a quick look at the rest of the square. 241 00:15:27,600 --> 00:15:29,800 So how do we calculate the value in this square. 242 00:15:30,030 --> 00:15:32,450 Well here is where things get tricky. 243 00:15:32,460 --> 00:15:38,400 So from here you might not actually go left right you might actually go right so we can just keep going 244 00:15:38,400 --> 00:15:41,360 like that because it might actually be shorter to go this way. 245 00:15:41,520 --> 00:15:44,720 So what we're going to do is we're going to calculate the value in the square first. 246 00:15:45,000 --> 00:15:48,200 And because obviously from here the best ways to go is up. 247 00:15:48,240 --> 00:15:52,740 Again that's because we see the crew we have the crystal ball we can see things and you'll see further 248 00:15:52,740 --> 00:15:57,060 down in the section you'll see how the agent actually explores this understands this on their likes 249 00:15:57,060 --> 00:15:58,030 through experimentation. 250 00:15:58,080 --> 00:16:02,580 But for us we know that it's better to go this way so we're going to calculate value here and that's 251 00:16:02,580 --> 00:16:06,410 why we're going to calculate the value in this square first. 252 00:16:06,420 --> 00:16:09,230 So here we have three possible actions. 253 00:16:09,270 --> 00:16:11,590 In reality we actually have four we can also go left. 254 00:16:11,610 --> 00:16:15,330 The agent could hypothetically press left and bump into the wall and stay here. 255 00:16:15,420 --> 00:16:21,030 But for simplicity set which is going to show the actions that we knowing what we know and having the 256 00:16:21,030 --> 00:16:25,920 crystal ball we know which actions are the ones actually lead to something other than the same state 257 00:16:25,920 --> 00:16:26,780 again. 258 00:16:26,850 --> 00:16:32,010 And so here from here we know that again just because we have a crystal ball we know that the best way 259 00:16:32,010 --> 00:16:36,840 to go is this way an agent of course would have to experiment and find the best way and you'll see how 260 00:16:36,840 --> 00:16:37,500 that happens. 261 00:16:37,560 --> 00:16:42,270 Further down in the section you'll see actually how an agent walks around and how you would experiment 262 00:16:42,360 --> 00:16:43,610 trying to find these values. 263 00:16:43,620 --> 00:16:45,190 But for us we know it's that way. 264 00:16:45,360 --> 00:16:50,420 So here if we plug everything in one so the maximum the best output is when you go up. 265 00:16:50,510 --> 00:16:53,820 And here is a report 9:0 So you put that in. 266 00:16:53,820 --> 00:16:55,870 You get zero point nine. 267 00:16:56,220 --> 00:16:58,730 OK so it Kalika that one that calculate this one. 268 00:16:58,770 --> 00:16:59,810 Same approach. 269 00:16:59,820 --> 00:17:02,070 This is you have three ways you can go. 270 00:17:02,070 --> 00:17:05,580 Actually four for the agent but for us we can see it's only three. 271 00:17:05,880 --> 00:17:10,780 So zero point eighty one from here you have ZERO point seventy three. 272 00:17:11,130 --> 00:17:16,410 And it actually ties in nicely with this value because in you if you discount again you put 66 and here 273 00:17:16,890 --> 00:17:20,120 you have 0.23 because this is the optimal route. 274 00:17:20,130 --> 00:17:21,190 So there you go. 275 00:17:21,210 --> 00:17:23,750 That is the values all of these states. 276 00:17:23,760 --> 00:17:29,700 And now you can see that because we've created this equation or we've created synthetically this whole 277 00:17:29,730 --> 00:17:37,890 concept of the closer you are to the finish line the more valuable that state is not because we're afraid 278 00:17:37,890 --> 00:17:41,840 that now it's pretty obvious for the agent which way it should go. 279 00:17:41,970 --> 00:17:44,230 And we'll talk more about that in the coming. 280 00:17:44,910 --> 00:17:52,290 I hope you enjoyed today's session and I know it's a bit it might sound a bit very basic at this stage 281 00:17:52,320 --> 00:17:56,590 but as we go through this section we will add a bit more complexity to it. 282 00:17:56,700 --> 00:18:01,500 At the same time if you cannot wait if you want to jump into it then there is a paper which you can 283 00:18:01,500 --> 00:18:04,290 look at and it is the original paper by Richard Belman. 284 00:18:04,290 --> 00:18:08,130 It's called the theory of dynamic programming from 1954. 285 00:18:08,370 --> 00:18:10,200 And you can find it at this link. 286 00:18:10,320 --> 00:18:16,490 And there you go so you can jump straight into it and read from the author of the Belman equation. 287 00:18:16,620 --> 00:18:20,860 But just bear in mind that this is quite a mathematically heavy paper. 288 00:18:20,970 --> 00:18:22,820 And on that note I'll look for your next. 289 00:18:22,850 --> 00:18:24,590 And until then enjoy AI. 31945

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.