Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,060 --> 00:00:04,460
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,460 --> 00:00:07,630
Today we're going to talk about the Belman equation.
3
00:00:07,630 --> 00:00:12,580
It's quite a complex topic and we're going to introduce it in a step by step manner throughout this
4
00:00:12,580 --> 00:00:17,110
whole section of the course so I'm not going to just jump straight into the most complex version of
5
00:00:17,110 --> 00:00:21,730
the Belmont equation right away but instead we're going to introduce it slowly in order to gradually
6
00:00:21,730 --> 00:00:23,250
understand how it works.
7
00:00:23,410 --> 00:00:28,480
And I hope your goal with that approach if you're G.R. Let's get straight into it.
8
00:00:28,690 --> 00:00:33,820
So we're going to have a couple of key concepts that we're going to be operating with and these concepts
9
00:00:33,820 --> 00:00:34,430
are.
10
00:00:34,600 --> 00:00:41,110
S stands for states so the state in which our agent is or any other possible state in which it can be
11
00:00:41,740 --> 00:00:45,490
a represents an action that a an agent can take.
12
00:00:45,490 --> 00:00:50,680
So an agent can have access to a certain list of actions and actions are very important when they're
13
00:00:50,680 --> 00:00:53,610
looked at in a state combination.
14
00:00:53,620 --> 00:00:57,880
So when you're in a swing state and then you look at actions and it starts to make sense what's going
15
00:00:57,880 --> 00:01:01,870
to be the result of those actions because you'll look an action by itself or a state doesn't really
16
00:01:01,870 --> 00:01:07,390
make sense because you don't know where you are and where you can possibly end up and then we have we'll
17
00:01:07,390 --> 00:01:13,240
have our Which stands for reward and that's through ward that agent gets for entering into a certain
18
00:01:13,240 --> 00:01:16,980
state and gamma is the discount factor.
19
00:01:16,990 --> 00:01:21,510
And we'll talk about the discount factor in a second all make sense just now but they're just taking
20
00:01:21,510 --> 00:01:21,810
notes.
21
00:01:21,820 --> 00:01:26,300
Make a mental note that we are going to have this letter Gamelin that will be operating with later on.
22
00:01:26,620 --> 00:01:31,230
So the person behind the bellman equation is Richard Ernest bellman.
23
00:01:31,360 --> 00:01:39,400
He was a flight mathematician and came up with the concepts of dynamic programming which we're now which
24
00:01:39,400 --> 00:01:43,790
we now call reinforcement learning or which we call the Belman equation now.
25
00:01:44,110 --> 00:01:45,490
Well that's what we're called now.
26
00:01:45,490 --> 00:01:52,350
And in 1953 he came up with that concept and that's when the Belmont Belman equation came to me.
27
00:01:52,630 --> 00:01:56,530
So let's have a look at how this all works.
28
00:01:56,540 --> 00:02:02,410
There's our lovely agent in the bottom left corner and he is in a maze and this is quite a classical
29
00:02:02,500 --> 00:02:08,680
maze where you've got some blocks the wide blocks are blocks in which the agent can step into the gray
30
00:02:08,680 --> 00:02:13,800
block is the one one that is just not accessible says like a wall in this maze.
31
00:02:13,900 --> 00:02:20,150
The green is where the agent is should be aiming to end up in that's where we want the agent to go that's
32
00:02:20,150 --> 00:02:20,910
the finish.
33
00:02:21,220 --> 00:02:25,050
And the red is firepits or the engine falls into the fire pit.
34
00:02:25,060 --> 00:02:26,660
He will lose the game.
35
00:02:26,950 --> 00:02:31,330
So in the fire pit the reward which is R is minus 1.
36
00:02:31,330 --> 00:02:36,330
So that's our way of telling the agent that's not something we want you to do.
37
00:02:36,430 --> 00:02:41,320
Like remember in the example of when we're training dogs we want to tell them like bad dog if it's not
38
00:02:41,320 --> 00:02:46,030
doing the right thing that wanted to do same thing here we're one tell the agent that this is not something
39
00:02:46,030 --> 00:02:49,480
that you should be doing you shouldn't be ending up in the square so every time it doesn't happen the
40
00:02:49,480 --> 00:02:53,300
squirrel get a minus one reward so you'll be punished with minus one reward.
41
00:02:53,530 --> 00:02:57,610
On the other hand if it ends up in the Green Square it will get a plus one reward meaning that that
42
00:02:57,610 --> 00:02:59,330
is what we wanted to do.
43
00:02:59,590 --> 00:03:02,470
So those are the two rewards that the agent can't possibly get.
44
00:03:02,470 --> 00:03:06,210
And how does it learn how to operate in this maze.
45
00:03:06,370 --> 00:03:10,750
Just like in that example of the robot dogs that learned to walk which is going to let it know it will
46
00:03:10,750 --> 00:03:12,490
just tell it that here the action you can do.
47
00:03:12,490 --> 00:03:18,360
You can go up right left or down those are four possible actions that you can take and that's it.
48
00:03:18,360 --> 00:03:21,430
Have have a play around with that see what you can come up with.
49
00:03:21,430 --> 00:03:26,320
So the agent might go to the right then they might go two more to the right they might go back to the
50
00:03:26,320 --> 00:03:31,160
left just randomly pressing the button and they're trying to see what happens and they go back here.
51
00:03:31,180 --> 00:03:34,660
They go up go up go down go up go right.
52
00:03:34,660 --> 00:03:38,450
So for now they haven't learnt anything they just so far nothing's happened.
53
00:03:38,470 --> 00:03:41,790
They go right and then bam they end up in the Green Square.
54
00:03:41,830 --> 00:03:48,150
So they realize wow I just got a plus one awar So as soon as I stepped into the Green Square they got
55
00:03:48,150 --> 00:03:49,040
a plus one reward.
56
00:03:49,090 --> 00:03:53,560
And that triggers the algorithm to say OK that's really cool.
57
00:03:53,830 --> 00:03:58,920
I am rewarded for ending up in the square so I want to end up in the square.
58
00:03:58,930 --> 00:04:00,650
So what does that mean for the agent.
59
00:04:00,910 --> 00:04:04,310
That means it starts to ask the question how did I get to this square.
60
00:04:04,300 --> 00:04:10,690
What was the preceding state I was in and what action that I take to get to square and then looks back
61
00:04:10,690 --> 00:04:14,810
and says OK so the preceding state was this one.
62
00:04:14,950 --> 00:04:17,400
It turns out to be valuable in that state.
63
00:04:17,410 --> 00:04:19,240
The one that spark of the Red Arrow.
64
00:04:19,270 --> 00:04:26,230
Because from that state you're I'm I'm just one step away from getting the maximum reward I can possibly
65
00:04:26,230 --> 00:04:33,210
dream of of plus one like a biscuit for a dog from as soon as I know if I ever am in that state.
66
00:04:33,250 --> 00:04:35,150
That square marked with the Red Arrow.
67
00:04:35,200 --> 00:04:36,740
All I have to do is press right.
68
00:04:37,030 --> 00:04:41,440
So how do I tell myself to remember that that state is valuable.
69
00:04:41,440 --> 00:04:45,170
Well to me there's no difference actually as the agent.
70
00:04:45,170 --> 00:04:50,380
There's no difference in whether I am in the Green Square or in the white square right in the Green
71
00:04:50,380 --> 00:04:51,610
Square I get the reward of one.
72
00:04:51,610 --> 00:04:58,810
So I'm going to mark for myself that the Y Square is got for me it has a value of 1 because it leads
73
00:04:58,810 --> 00:05:03,280
exactly to reward one soon as I'm in the white square I know I'll just take one more action.
74
00:05:03,350 --> 00:05:08,180
I'll be in the Green Square and I'll get a reward or one so that's why I'm going to say that the value
75
00:05:08,180 --> 00:05:14,690
of this square is equal to one because it leads directly to if on any sort of subtractions as soon as
76
00:05:14,690 --> 00:05:18,890
I mean here I know my reward will be one so I'm going to mark this square as the call to one that's
77
00:05:18,890 --> 00:05:22,430
the value that's the perceived value of being in the state.
78
00:05:22,430 --> 00:05:24,740
Next the agent's going to be OK.
79
00:05:24,800 --> 00:05:26,930
So how do I get into this square.
80
00:05:27,050 --> 00:05:29,990
And you know he might walk around again and so on.
81
00:05:29,990 --> 00:05:33,800
And up in the square again and be like OK how did I get into this square before that.
82
00:05:33,800 --> 00:05:36,860
And the way I got into this square was from this square.
83
00:05:36,860 --> 00:05:37,530
Interesting.
84
00:05:37,550 --> 00:05:42,980
OK so as soon as I get into this square I know that all I have to do is go right.
85
00:05:42,980 --> 00:05:45,640
And then from here I already know that I'm going to win.
86
00:05:45,650 --> 00:05:49,970
I know exactly how everything is going to unravel from here and I know the value of being in this state
87
00:05:49,970 --> 00:05:50,970
is equal to one.
88
00:05:51,020 --> 00:05:58,340
And since there's no nothing is stopping me from growing from here to here the value in this is going
89
00:05:58,340 --> 00:06:03,920
to a perceived value I'm great value being in here as a vehicle to want as well because this is I mean
90
00:06:03,920 --> 00:06:04,640
here I know.
91
00:06:04,650 --> 00:06:06,660
Be here and I'll be here pretty quickly.
92
00:06:06,740 --> 00:06:07,980
So I'm going to win.
93
00:06:08,180 --> 00:06:10,490
And then how do you get into this square before that.
94
00:06:10,490 --> 00:06:12,940
Well I got into this square from this square.
95
00:06:13,070 --> 00:06:19,670
So the value is similar approach the value of being here is also equal to one and so on so the value
96
00:06:19,670 --> 00:06:23,690
of being here is equal to one value of being here is equal to one because each one of them leads to
97
00:06:23,690 --> 00:06:25,710
the next one and these to the finish line.
98
00:06:26,240 --> 00:06:29,850
So that's all like pretty logical at this stage.
99
00:06:29,960 --> 00:06:33,410
This is us pretty much designing the Belman equation right now.
100
00:06:33,410 --> 00:06:40,460
So this is we could possibly think about designing an equation that helps an agent go through the maze.
101
00:06:40,490 --> 00:06:45,840
So look at the reward then the preceding state give it a value of equal to reward the proceedings and
102
00:06:45,840 --> 00:06:51,920
so those are kind of like creates a pathway is all great and well but the problem here is OK what happens
103
00:06:52,010 --> 00:06:58,790
if our agent for some reason starts in this state instead of starting here and taking these actions
104
00:06:58,880 --> 00:07:00,480
and that it actually starts in the state.
105
00:07:00,650 --> 00:07:06,980
How does it know how does it remember which action to take should it go right or should it go down or
106
00:07:06,980 --> 00:07:08,540
should maybe go left or should go up.
107
00:07:08,540 --> 00:07:13,220
How does it remember which is the next continuation from here.
108
00:07:13,220 --> 00:07:18,660
If the only values it has is these values are equal to once it kind of cannot see what's further away.
109
00:07:18,660 --> 00:07:19,700
It can only see.
110
00:07:19,700 --> 00:07:20,030
All right.
111
00:07:20,030 --> 00:07:21,940
What I have here and what I have here.
112
00:07:21,980 --> 00:07:23,530
How does it know which way to go.
113
00:07:23,660 --> 00:07:27,920
Well at this stage it doesn't it's as pretty identical for the age and which way to go.
114
00:07:27,960 --> 00:07:30,770
And so that's why this approach doesn't really work.
115
00:07:30,790 --> 00:07:32,930
It's a very simplistic explanation.
116
00:07:32,930 --> 00:07:34,500
Of course there's much more to it.
117
00:07:34,520 --> 00:07:40,550
But in an intuitive way that's why we cannot just assign just carry on this value backwards like that.
118
00:07:40,790 --> 00:07:46,210
Because one of the reasons is once Agent is in between these two values which where is it going to go.
119
00:07:46,210 --> 00:07:48,560
It doesn't it can get confused like that.
120
00:07:48,620 --> 00:07:52,350
And so how do we solve this problem what are we going to do.
121
00:07:52,400 --> 00:07:57,860
And this is where we're going to start introducing the Belman equation in its actual form slowly step
122
00:07:57,860 --> 00:07:58,640
by step.
123
00:07:58,670 --> 00:08:01,510
So the Belman equation looks something like this.
124
00:08:01,640 --> 00:08:07,100
So we've already talked about the value of being in a certain state as is your current state or any
125
00:08:07,100 --> 00:08:10,250
given state and there is as well.
126
00:08:10,370 --> 00:08:17,270
And as Prime is the state the following state the state that you will end up in after the state and
127
00:08:17,270 --> 00:08:18,990
by taking concerted action.
128
00:08:19,000 --> 00:08:24,160
But we know that there's many actions and a agent can take and that's why we've got this Max over here.
129
00:08:24,260 --> 00:08:30,020
So by taking an action what will happen to an agent so let's say we're in state as by taking an action
130
00:08:30,050 --> 00:08:32,700
in state assets and we take action.
131
00:08:32,780 --> 00:08:36,690
What will happen is will instantly get a reward by getting into a new state.
132
00:08:36,770 --> 00:08:41,960
And remember that reward can be one or plus one or minus one if it's at the end of the game or it can
133
00:08:41,960 --> 00:08:46,240
be a zero if it's throughout the game in this case our reward throughout the game is zero.
134
00:08:46,280 --> 00:08:55,160
So that's the reward Plus we will get into a new state which has value of s prime.
135
00:08:55,160 --> 00:08:57,820
So that's the value of the new state and gamma.
136
00:08:57,820 --> 00:08:58,820
We'll talk about it in a second.
137
00:08:58,820 --> 00:09:03,560
But the point I'm trying to raise here or the point I'm raising here is that you've got many different
138
00:09:03,560 --> 00:09:05,810
actions that we can take and that's why we've got the maximum.
139
00:09:05,810 --> 00:09:09,630
So by taking action we get reward Plus we end up in a new state.
140
00:09:09,740 --> 00:09:14,660
And so for every move out of the in our case before our possible actions for every one of the possible
141
00:09:14,660 --> 00:09:17,810
4 actions we're going to have a equation like this.
142
00:09:17,810 --> 00:09:22,980
So this is going to have a value for they will have a different value for every one of four actions
143
00:09:23,480 --> 00:09:28,750
and we're going to look at only the maximum because of course the agent wants to take the optimal state.
144
00:09:28,760 --> 00:09:33,860
So if he's in state s he's going to look at these values he's going to find the maximum based on the
145
00:09:33,860 --> 00:09:37,500
action and going to take that action that needs the maximum of these values.
146
00:09:37,640 --> 00:09:41,480
So hopefully that makes sense why we're taking the maximum here.
147
00:09:41,660 --> 00:09:45,400
Then once we got the reward and the value that said why do we have this Gabaa parameter here.
148
00:09:45,650 --> 00:09:52,220
Well it's there exactly to solve that problem of where the agent doesn't know which way to go because
149
00:09:52,220 --> 00:09:52,850
it cannot.
150
00:09:52,950 --> 00:09:56,600
It's comparing the values of two states on both sides and they're the same.
151
00:09:56,810 --> 00:10:00,890
That's why the gamblers called the discounting factor so we're going to have a look at that and it better
152
00:10:00,890 --> 00:10:02,050
understand.
153
00:10:02,060 --> 00:10:04,680
So let's take a formula I'll put it here on the top right.
154
00:10:04,760 --> 00:10:09,100
And now we will analyze what the values of the different states are.
155
00:10:09,140 --> 00:10:11,470
And every state here is a square.
156
00:10:11,470 --> 00:10:11,820
No.
157
00:10:11,840 --> 00:10:16,610
So one of these any one of these white squares is a state I mean we're going to calculate the value
158
00:10:16,610 --> 00:10:18,290
of being in that state.
159
00:10:18,290 --> 00:10:19,770
So let's start with the square.
160
00:10:19,790 --> 00:10:21,610
What is the value of being in this state.
161
00:10:21,860 --> 00:10:25,830
Well we need to take the maximum of this value across all actions.
162
00:10:26,120 --> 00:10:31,440
And we know that this value represents is maximized as we get closer to the finish line and that's how
163
00:10:31,440 --> 00:10:36,440
it is constructed and by just by looking at you can see because here's got the reward and here's got
164
00:10:36,590 --> 00:10:40,900
a discounting factor multiplied by the value of the next state.
165
00:10:41,060 --> 00:10:46,670
And it just makes sense that that's how we would construct that equation so it makes sense that from
166
00:10:46,670 --> 00:10:50,350
here the maximum of this value will be if we move to the right.
167
00:10:50,360 --> 00:10:56,120
So that's how we calculate the values that this value of this state is he calls the maximum or equals
168
00:10:56,300 --> 00:10:57,470
to this value.
169
00:10:57,500 --> 00:11:01,000
If we move to the right if we take an action of moving to the right.
170
00:11:01,010 --> 00:11:02,330
So what will this value be.
171
00:11:02,360 --> 00:11:04,850
Well the reward of moving to the right is equal to 1.
172
00:11:05,090 --> 00:11:10,490
And regardless what color gamma is we don't have a value in the state because we are already in the
173
00:11:10,490 --> 00:11:11,720
best state possible.
174
00:11:11,720 --> 00:11:12,880
So this is the final stage.
175
00:11:12,890 --> 00:11:16,280
It won't have a value we just get a reward here and that's the end of the game.
176
00:11:16,280 --> 00:11:20,300
So the value will be of this maximum will be equal to 1.
177
00:11:20,510 --> 00:11:23,870
And that's why value of state as here is equal to 1.
178
00:11:23,870 --> 00:11:27,970
Now things get interesting when we move to the left when we move backwards a bit.
179
00:11:28,010 --> 00:11:34,060
So now is calculate the value of this of being in this state and for that we're going to need Gabaa.
180
00:11:34,070 --> 00:11:39,920
So let's say our discounting factor is a zero point nine and it makes sense what a discounting factor
181
00:11:39,920 --> 00:11:40,960
is once we calculate that.
182
00:11:40,960 --> 00:11:47,410
So from here just based on our intuition and based because we know how this is working how this works.
183
00:11:47,450 --> 00:11:51,340
We know that the best possible action is go to the right because from here we go here.
184
00:11:51,530 --> 00:11:56,120
So that means the maximum will be achieved in this state you go to the right.
185
00:11:56,270 --> 00:11:58,970
And so let's see what happens if we plug it in here.
186
00:11:58,970 --> 00:12:02,650
So if you go from here to here you don't get in your reward will be zero.
187
00:12:02,720 --> 00:12:07,440
But then you'll get camis who get zero point nine times the value of the new state which is one.
188
00:12:07,640 --> 00:12:14,030
So in this case the value the whole result of this is 1 times a 0.9 times one equals 2.9.
189
00:12:14,030 --> 00:12:15,890
So that's all values per.
190
00:12:16,250 --> 00:12:18,570
So if we calculate this now you'll see that from here.
191
00:12:18,620 --> 00:12:23,990
We know just by looking at the maze we know because we as humans because we're understanding how this
192
00:12:23,990 --> 00:12:28,450
equation works of course an AI agent would have to experiment with these things.
193
00:12:28,460 --> 00:12:32,180
But because we have like a crystal ball we can see this whole maze.
194
00:12:32,180 --> 00:12:33,860
We have like the bird's eye view right now.
195
00:12:33,860 --> 00:12:36,170
We know that the best action go to go to the right.
196
00:12:36,320 --> 00:12:42,230
So if we plug it all in here it'll be zero no reward Plus the report nine times the value in the state
197
00:12:42,230 --> 00:12:45,530
0.9 is zero point eighty one and so on.
198
00:12:45,530 --> 00:12:50,420
So here it'll be 0.23 and he'll be 0.66.
199
00:12:50,420 --> 00:12:57,590
So you can see that the way the discounted factor works is it discounts the value of the state as you
200
00:12:57,590 --> 00:12:58,610
are further away.
201
00:12:58,610 --> 00:13:05,810
So if you are familiar with finance theory then it's something similar to time value of money like what
202
00:13:05,810 --> 00:13:12,990
would you think about it this way What would you prefer to have $5 today or $5 in 10 days from now.
203
00:13:13,050 --> 00:13:17,840
Just if somebody was to give you a choice I will give you five dollars today all you $5 10 days from
204
00:13:17,840 --> 00:13:18,280
all.
205
00:13:18,390 --> 00:13:20,300
Of course you would choose $5 today.
206
00:13:20,300 --> 00:13:20,850
Why is that.
207
00:13:20,870 --> 00:13:26,750
Well because you can take that $5 and you can invest them at a certain interest rate which is very similar
208
00:13:26,750 --> 00:13:27,470
to gamma.
209
00:13:27,680 --> 00:13:33,950
And your $5 in 10 days will actually grow into maybe 5 dollars and 73 cents or something like that.
210
00:13:34,070 --> 00:13:36,410
And that's how time value of money works.
211
00:13:36,410 --> 00:13:38,310
And very similar concept here.
212
00:13:38,330 --> 00:13:43,250
And the important thing to understand here this is just a theory a way that reinforcement learning.
213
00:13:43,260 --> 00:13:45,850
So Richard Belman came up with this equation.
214
00:13:46,190 --> 00:13:48,880
And from then now that's how we use it.
215
00:13:48,880 --> 00:13:51,430
So you could go ahead and come up with a different equation.
216
00:13:51,430 --> 00:13:54,820
It doesn't have to have Gamla it might have some other factor might not you know have a factor.
217
00:13:54,950 --> 00:14:01,550
But this approach works and that's why we're using and this is what it looks like so the further away
218
00:14:01,550 --> 00:14:06,670
you are the less value of it being in the state and in terms of time and money.
219
00:14:06,680 --> 00:14:09,850
If I could say to you where would you rather be would you rather be here.
220
00:14:09,950 --> 00:14:11,200
Would you rather be here.
221
00:14:11,350 --> 00:14:12,920
You'd say I would rather be here.
222
00:14:12,920 --> 00:14:18,770
So we're creating that that same phenomenon as time value of money we're artificially creating it through
223
00:14:18,770 --> 00:14:24,680
gamma so that in order to incentivize agents or inspire agents to be closer to the finish line.
224
00:14:24,680 --> 00:14:29,720
So if an agent were to be asked would you rather be here or here because of the way this equation works
225
00:14:29,930 --> 00:14:31,590
it would choose to be here.
226
00:14:31,640 --> 00:14:33,380
There's nothing more to that nothing less.
227
00:14:33,380 --> 00:14:35,810
It's not something that the world works this way.
228
00:14:35,810 --> 00:14:42,630
No it's just something that we're artificially creating in order for our agents to understand that this
229
00:14:42,750 --> 00:14:48,140
is this is good this is good this is good old good but this one is better than this one and this one
230
00:14:48,140 --> 00:14:50,030
is better than this one and this one has been in this one.
231
00:14:50,120 --> 00:14:54,790
And that way you can see all the agent can see in which direction needs to go.
232
00:14:54,800 --> 00:15:00,270
So it can see that if I'm standing here remember that problem that we had or was he standing here so
233
00:15:00,270 --> 00:15:05,130
if you standing here do I go down or if I'm suddenly here to go up or do I go down.
234
00:15:05,250 --> 00:15:10,080
Well now there's not a problem anymore because he can see that it's actually better to go up because
235
00:15:10,080 --> 00:15:11,480
the values are here.
236
00:15:11,550 --> 00:15:14,490
And then from here he's got to go right because the value is bigger here than here.
237
00:15:14,550 --> 00:15:17,480
And then from here is Bertschi go right because the value here is bigger than you know.
238
00:15:17,670 --> 00:15:22,620
And from here he already knows that he needs to go right because he'll get a reward here of one.
239
00:15:22,680 --> 00:15:24,960
So that's how this whole approach works.
240
00:15:24,960 --> 00:15:27,600
Now let's have a quick look at the rest of the square.
241
00:15:27,600 --> 00:15:29,800
So how do we calculate the value in this square.
242
00:15:30,030 --> 00:15:32,450
Well here is where things get tricky.
243
00:15:32,460 --> 00:15:38,400
So from here you might not actually go left right you might actually go right so we can just keep going
244
00:15:38,400 --> 00:15:41,360
like that because it might actually be shorter to go this way.
245
00:15:41,520 --> 00:15:44,720
So what we're going to do is we're going to calculate the value in the square first.
246
00:15:45,000 --> 00:15:48,200
And because obviously from here the best ways to go is up.
247
00:15:48,240 --> 00:15:52,740
Again that's because we see the crew we have the crystal ball we can see things and you'll see further
248
00:15:52,740 --> 00:15:57,060
down in the section you'll see how the agent actually explores this understands this on their likes
249
00:15:57,060 --> 00:15:58,030
through experimentation.
250
00:15:58,080 --> 00:16:02,580
But for us we know that it's better to go this way so we're going to calculate value here and that's
251
00:16:02,580 --> 00:16:06,410
why we're going to calculate the value in this square first.
252
00:16:06,420 --> 00:16:09,230
So here we have three possible actions.
253
00:16:09,270 --> 00:16:11,590
In reality we actually have four we can also go left.
254
00:16:11,610 --> 00:16:15,330
The agent could hypothetically press left and bump into the wall and stay here.
255
00:16:15,420 --> 00:16:21,030
But for simplicity set which is going to show the actions that we knowing what we know and having the
256
00:16:21,030 --> 00:16:25,920
crystal ball we know which actions are the ones actually lead to something other than the same state
257
00:16:25,920 --> 00:16:26,780
again.
258
00:16:26,850 --> 00:16:32,010
And so here from here we know that again just because we have a crystal ball we know that the best way
259
00:16:32,010 --> 00:16:36,840
to go is this way an agent of course would have to experiment and find the best way and you'll see how
260
00:16:36,840 --> 00:16:37,500
that happens.
261
00:16:37,560 --> 00:16:42,270
Further down in the section you'll see actually how an agent walks around and how you would experiment
262
00:16:42,360 --> 00:16:43,610
trying to find these values.
263
00:16:43,620 --> 00:16:45,190
But for us we know it's that way.
264
00:16:45,360 --> 00:16:50,420
So here if we plug everything in one so the maximum the best output is when you go up.
265
00:16:50,510 --> 00:16:53,820
And here is a report 9:0 So you put that in.
266
00:16:53,820 --> 00:16:55,870
You get zero point nine.
267
00:16:56,220 --> 00:16:58,730
OK so it Kalika that one that calculate this one.
268
00:16:58,770 --> 00:16:59,810
Same approach.
269
00:16:59,820 --> 00:17:02,070
This is you have three ways you can go.
270
00:17:02,070 --> 00:17:05,580
Actually four for the agent but for us we can see it's only three.
271
00:17:05,880 --> 00:17:10,780
So zero point eighty one from here you have ZERO point seventy three.
272
00:17:11,130 --> 00:17:16,410
And it actually ties in nicely with this value because in you if you discount again you put 66 and here
273
00:17:16,890 --> 00:17:20,120
you have 0.23 because this is the optimal route.
274
00:17:20,130 --> 00:17:21,190
So there you go.
275
00:17:21,210 --> 00:17:23,750
That is the values all of these states.
276
00:17:23,760 --> 00:17:29,700
And now you can see that because we've created this equation or we've created synthetically this whole
277
00:17:29,730 --> 00:17:37,890
concept of the closer you are to the finish line the more valuable that state is not because we're afraid
278
00:17:37,890 --> 00:17:41,840
that now it's pretty obvious for the agent which way it should go.
279
00:17:41,970 --> 00:17:44,230
And we'll talk more about that in the coming.
280
00:17:44,910 --> 00:17:52,290
I hope you enjoyed today's session and I know it's a bit it might sound a bit very basic at this stage
281
00:17:52,320 --> 00:17:56,590
but as we go through this section we will add a bit more complexity to it.
282
00:17:56,700 --> 00:18:01,500
At the same time if you cannot wait if you want to jump into it then there is a paper which you can
283
00:18:01,500 --> 00:18:04,290
look at and it is the original paper by Richard Belman.
284
00:18:04,290 --> 00:18:08,130
It's called the theory of dynamic programming from 1954.
285
00:18:08,370 --> 00:18:10,200
And you can find it at this link.
286
00:18:10,320 --> 00:18:16,490
And there you go so you can jump straight into it and read from the author of the Belman equation.
287
00:18:16,620 --> 00:18:20,860
But just bear in mind that this is quite a mathematically heavy paper.
288
00:18:20,970 --> 00:18:22,820
And on that note I'll look for your next.
289
00:18:22,850 --> 00:18:24,590
And until then enjoy AI.
31945
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.