Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,160 --> 00:00:04,720
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,740 --> 00:00:07,950
Today we're talking about the temporal difference.
3
00:00:08,100 --> 00:00:14,310
Now it's very important to trial because temporal difference is the heart and soul of the Q learning
4
00:00:14,340 --> 00:00:15,100
algorithm.
5
00:00:15,120 --> 00:00:22,410
This is actually how everything we've learned so far comes together into play inside key learning.
6
00:00:22,410 --> 00:00:23,880
So let's have a look.
7
00:00:23,910 --> 00:00:28,040
Remember the time when we talked about deterministic versus nondeterministic search.
8
00:00:28,410 --> 00:00:34,960
And remember how we said in this case it's when the agent wants to go up he goes up and when.
9
00:00:35,070 --> 00:00:38,740
In this case he wants to go up there's a 10 percent chance he'll go lower left temps and chance and
10
00:00:38,730 --> 00:00:41,390
go right and an 80 percent chance will go right.
11
00:00:41,400 --> 00:00:42,390
Go straight up.
12
00:00:42,450 --> 00:00:46,410
While these numbers are of course arbitrary and can be different.
13
00:00:46,410 --> 00:00:52,260
And this whole concept is it could be different and different problems so it doesn't have to concern
14
00:00:52,320 --> 00:00:57,090
which way he's moving just that there's some randomness something that's out of the control of the agent
15
00:00:57,300 --> 00:00:59,930
happening inside this environment.
16
00:01:00,060 --> 00:01:07,470
And what effect that had is as you remember was that in the deterministic example it was very easy to
17
00:01:07,470 --> 00:01:11,030
calculate the Wii values while not necessarily always very easy.
18
00:01:11,040 --> 00:01:16,530
But in our case we could just simply calculate them by using the Belman equation and we we had the exact
19
00:01:16,530 --> 00:01:17,120
values.
20
00:01:17,370 --> 00:01:24,810
And then as you remember I very carefully mentioned that these values for the nondeterministic search
21
00:01:24,810 --> 00:01:27,810
example are off the top of my head.
22
00:01:27,840 --> 00:01:29,220
They are not Kalka we know.
23
00:01:29,270 --> 00:01:33,090
Last time I said we're not we just had to calculate them because it's very complex.
24
00:01:33,090 --> 00:01:39,390
But the computer can do it and we just went along with these values that are just values that I made
25
00:01:39,390 --> 00:01:39,600
up.
26
00:01:39,600 --> 00:01:41,310
But they did get the job done.
27
00:01:41,310 --> 00:01:43,030
They helped us understand the concept.
28
00:01:43,290 --> 00:01:47,790
Well now we're going to return to that a little bit and understand what exactly is going on here.
29
00:01:47,790 --> 00:01:55,420
Why is it so much harder to calculate these values in the nondeterministic example or generally speaking
30
00:01:55,420 --> 00:01:59,570
in these problems in these environments and the agent going through them.
31
00:01:59,580 --> 00:02:00,400
Why is it.
32
00:02:00,510 --> 00:02:03,030
Why can it be so hard to calculate these values.
33
00:02:03,030 --> 00:02:09,010
Well when you think about it because when the agent moves for for instance from here to the right he
34
00:02:09,090 --> 00:02:15,270
doesn't necessarily always move that way sometimes as a chance that he will go to win instead of going
35
00:02:15,450 --> 00:02:22,290
straight so let's call these northeast southwest so is sort of going west.
36
00:02:22,470 --> 00:02:27,360
The agent might sometimes go south and for instance from here is sort of going north.
37
00:02:27,360 --> 00:02:29,220
He might sometimes go east.
38
00:02:29,460 --> 00:02:30,240
So sorry.
39
00:02:30,240 --> 00:02:34,680
So here instead of going east he might sometimes go south and he's sort of going north.
40
00:02:34,710 --> 00:02:40,200
He might sometimes go east or west and here instead of going north he might sometimes go west or east
41
00:02:40,200 --> 00:02:41,160
or west and so on.
42
00:02:41,160 --> 00:02:47,010
So and therefore So in order to calculate this value you would need to know what this value is but the
43
00:02:47,010 --> 00:02:51,110
interesting thing is that in order to calculate this value you need to know what this value is.
44
00:02:51,120 --> 00:02:56,790
So there's a lot of recursion happening here and therefore you cannot just decide to define what these
45
00:02:56,790 --> 00:02:57,340
values are.
46
00:02:57,360 --> 00:03:01,140
And on top of that this recursion is not deterministic.
47
00:03:01,140 --> 00:03:06,000
It is sometimes it happens this way sometimes it's sort of uphill to go right sometimes instead of get
48
00:03:06,000 --> 00:03:08,250
up and go left sometimes.
49
00:03:08,730 --> 00:03:09,540
When he want to go up.
50
00:03:09,540 --> 00:03:10,520
He will go up.
51
00:03:10,560 --> 00:03:17,460
So it is subject to chance and so maybe many times agent will go through this path and he'll go up up
52
00:03:17,460 --> 00:03:22,050
up up up and you'll think that from here you always kind of goes up and the value of the state will
53
00:03:22,050 --> 00:03:27,370
go it will be good and then all of a sudden he'll drop into the pit and this value will go down.
54
00:03:27,620 --> 00:03:33,600
And so therefore you can see how there is some stochastic randomness to this whole calculation on these
55
00:03:33,600 --> 00:03:35,370
values because they're all interlinked.
56
00:03:35,370 --> 00:03:40,920
Plus on top you've got that randomness in this inherent in the environment because there's a mark of
57
00:03:40,920 --> 00:03:42,320
decision process.
58
00:03:42,540 --> 00:03:47,790
So that's where all this comes together and that's where we're going to introduce the concept of the
59
00:03:47,790 --> 00:03:52,370
temporal difference which will allow the agent to calculate these values.
60
00:03:52,530 --> 00:03:55,560
And here we were dealing with the values.
61
00:03:55,560 --> 00:03:59,390
And since then we've already moved onto Q values so that's what we're going to be working.
62
00:03:59,400 --> 00:04:01,980
We're going to be looking at huge values.
63
00:04:02,010 --> 00:04:06,090
So as I recall this is our Belman equation for q values.
64
00:04:06,180 --> 00:04:15,090
So AQ value or the value of performing a sort of action A in state s is equal to the reward that you
65
00:04:15,090 --> 00:04:22,770
get after performing that actions immediately after performing an action plus do you get the maximum
66
00:04:22,770 --> 00:04:26,720
you get the gamma of the sum of all the possible.
67
00:04:26,910 --> 00:04:31,680
So you kind of get the expected value of the state that you will end up in.
68
00:04:31,680 --> 00:04:37,710
So as you recall there was a formula for the Beldon equation and now just for simplicity say we're going
69
00:04:37,710 --> 00:04:43,670
to rewrite it in the old fashioned way and in a way that we used to talk about the bellmen equation
70
00:04:43,680 --> 00:04:45,850
before we knew about the sequester.
71
00:04:45,880 --> 00:04:53,100
So remember this was our Belman equation in the sense of a deterministic search example because here
72
00:04:53,100 --> 00:04:57,600
you don't have that expected value you don't have the same across all probabilities.
73
00:04:57,750 --> 00:05:03,110
You just have that as if it's determined you're going to end up what state you're going to end up and
74
00:05:03,110 --> 00:05:05,450
then you tell you Max in that one state.
75
00:05:05,570 --> 00:05:12,170
And the reason we're rewriting it is simply the only reason is because it is just easier to write it
76
00:05:12,200 --> 00:05:14,550
and it'll be easier to fall along with the formula.
77
00:05:14,550 --> 00:05:19,340
So we're going to just remember that we replaced this part of this bar.
78
00:05:19,430 --> 00:05:25,400
And also you'll find this notation in a lot of literature so it'll be easier for you to follow along
79
00:05:25,400 --> 00:05:28,310
with other sources if you're studying those.
80
00:05:28,370 --> 00:05:35,390
But do remember that in fact what we mean is this probabilistic approach here instead of this notation
81
00:05:35,500 --> 00:05:39,130
is just easier for us to operate this and understand what's going on.
82
00:05:39,140 --> 00:05:44,180
I just kind of like look at the equations so that they're not too cluttered but once again just remember
83
00:05:44,180 --> 00:05:48,050
that in fact what we mean is this probabilistic approach here.
84
00:05:48,290 --> 00:05:52,130
And so we're actually in the know Tom Silis have a look at what's going on.
85
00:05:52,190 --> 00:06:00,350
So here is our blank state of the maze we don't have any q values let's see or when we may but let's
86
00:06:00,500 --> 00:06:05,510
just keep it blank for now let's just look at one of the states or one of the cells.
87
00:06:05,570 --> 00:06:07,280
This one specifically.
88
00:06:07,820 --> 00:06:11,240
And here we have for answers for the action of going up.
89
00:06:11,240 --> 00:06:14,290
We have a q value that we calculate.
90
00:06:14,290 --> 00:06:18,070
So it's not that we don't have any q values yet we have it we do.
91
00:06:18,080 --> 00:06:19,930
But we're just not illustrating anything.
92
00:06:19,930 --> 00:06:22,520
We're just keeping a blank for simplicity's sake.
93
00:06:22,610 --> 00:06:28,570
But we have the age has been walking around for some time and let's say hypothetically somehow he's
94
00:06:28,580 --> 00:06:36,560
calculated this cube value of going up or Norf from this state from this specific cell and the values.
95
00:06:36,560 --> 00:06:40,240
Q S and A and so now what we have.
96
00:06:40,240 --> 00:06:45,070
So he is currently with his blue arrows point and the agent is sitting in this cell.
97
00:06:45,590 --> 00:06:48,560
And now he needs to make a choice where is he going to go.
98
00:06:48,590 --> 00:06:57,290
And he knows the value of this action going north and that is q Senay and here I'm saying before and
99
00:06:57,290 --> 00:07:01,940
the reason for that is because he that is before he takes Actually he hasn't taken action yet so he's
100
00:07:01,940 --> 00:07:10,760
still in the cell and before he's taken the action the value here is q and SNH and now he actually takes
101
00:07:10,760 --> 00:07:11,370
the action.
102
00:07:11,390 --> 00:07:13,670
So let's say he decides is the best one.
103
00:07:13,670 --> 00:07:16,440
He takes the action and he moves up to the cell.
104
00:07:16,730 --> 00:07:24,320
Well now what happens is now comes after so after he's taken action we can measure what is this value
105
00:07:24,350 --> 00:07:30,650
let's just calculate this value the value of the reward of for taking that action plus gamma times the
106
00:07:30,650 --> 00:07:35,640
maximum of this new state that he's just gotten into as prime.
107
00:07:35,640 --> 00:07:39,030
And so the maximum across all possible actions and aspirin.
108
00:07:39,080 --> 00:07:44,770
And so what we have here is the value before in of that action.
109
00:07:44,810 --> 00:07:47,650
And then we've calculated this metric afterwards.
110
00:07:47,660 --> 00:07:54,860
But as you can recall from the previous four months if we go back very quickly from the previous formula
111
00:07:55,630 --> 00:08:02,180
where we just calculated is indeed the value that is how Q of s.a.a is calculated.
112
00:08:02,210 --> 00:08:07,930
So this Arite part of just calculated separately but after we've taken action.
113
00:08:08,330 --> 00:08:15,470
So as again before we knew a Q of an S and a value something that we've calculated through our iterations
114
00:08:15,470 --> 00:08:16,860
Preuss is something.
115
00:08:17,000 --> 00:08:19,990
So a value that's stored in our memory.
116
00:08:20,000 --> 00:08:26,990
So just like a number that we know and now after the action is being performed we know what reward he
117
00:08:27,050 --> 00:08:30,270
actually got what reward the agent actually got.
118
00:08:30,440 --> 00:08:33,320
And we can calculate this new value.
119
00:08:33,320 --> 00:08:39,690
So in essence we're kind of recalculating this value but now with new information the new information
120
00:08:39,690 --> 00:08:41,120
is the reward that we got.
121
00:08:41,600 --> 00:08:47,330
And plus what stayed we ended up in and what the maximum across that state what that this new value
122
00:08:47,420 --> 00:08:50,540
is for that specific data can.
123
00:08:50,570 --> 00:08:54,480
So what's the value of that being in that state.
124
00:08:54,500 --> 00:09:02,060
So basically the Cure Vanessa-Mae but given new information and now the temporal difference is defined
125
00:09:02,150 --> 00:09:07,700
as tiddy of a and s of these two of the difference between these two.
126
00:09:07,700 --> 00:09:11,770
So here the first element is your off-Terra value.
127
00:09:11,780 --> 00:09:16,250
So the kind of like Q of Esson a bit calculated afterwards.
128
00:09:16,550 --> 00:09:21,880
And the previous quvenzhan A which you had stored in your memory.
129
00:09:22,070 --> 00:09:24,170
And so the question is are they different.
130
00:09:24,290 --> 00:09:26,240
So ideally they should be the same.
131
00:09:26,240 --> 00:09:31,750
Ideally this should be the same as this simply because this is the formula for calculating this.
132
00:09:31,790 --> 00:09:38,060
But the thing is that this is not something we Kalka this is something that we have from empirical evidence
133
00:09:38,060 --> 00:09:41,320
something that we have from just going through the maze many times and calculate.
134
00:09:41,320 --> 00:09:44,330
So this is something we come up with so far.
135
00:09:44,360 --> 00:09:46,820
Its not related to the current iteration.
136
00:09:46,820 --> 00:09:52,070
Its something that we came up with previously a long long time ago but in one of our previous iterations
137
00:09:52,070 --> 00:09:53,180
going through the maze.
138
00:09:53,510 --> 00:09:57,740
Whereas this is something we've calculated just now and there is no guarantee that they're going to
139
00:09:57,740 --> 00:10:04,720
be the same or because of the randomness that exists in the maze because this could have been calculated
140
00:10:04,750 --> 00:10:10,260
and saw some CRN random events were triggered and this can be called to different random events happening
141
00:10:10,300 --> 00:10:11,290
were triggered.
142
00:10:11,740 --> 00:10:15,680
And so now we write down our heroes just move it up there.
143
00:10:15,700 --> 00:10:16,900
So how do we use this.
144
00:10:16,900 --> 00:10:20,470
The question is OK so we have this temporal difference.
145
00:10:20,470 --> 00:10:21,340
How do we use this.
146
00:10:21,400 --> 00:10:23,450
And why is it called the temporal difference.
147
00:10:23,590 --> 00:10:28,960
Well the reason is called the temporal difference is because you're basically calculating the same thing
148
00:10:28,990 --> 00:10:33,460
you're calculating Q of S and A so the Q value of that action.
149
00:10:33,640 --> 00:10:36,140
Your Calcott here and you're calculating it here.
150
00:10:36,340 --> 00:10:38,310
But the difference is time.
151
00:10:38,320 --> 00:10:44,140
This is the Q of S and they previously this is yo Q of S and A.
152
00:10:44,140 --> 00:10:49,090
Now your new cure is innate and the question is has there been a difference.
153
00:10:49,090 --> 00:10:51,700
Have there's been a shift between them in time.
154
00:10:52,060 --> 00:10:56,830
And how can we use this to our advantage if there is indeed has been a shift in time.
155
00:10:57,040 --> 00:11:02,790
Well one thing we could do is we could say OK well you know our Q of s.a.a doesn't.
156
00:11:02,830 --> 00:11:07,490
This new value doesn't equal old so we are going to get rid of the old or forget about the old and we'll
157
00:11:07,510 --> 00:11:09,610
just use this is all a new value.
158
00:11:09,970 --> 00:11:11,920
But that would not be smart.
159
00:11:11,950 --> 00:11:17,960
And the reason for that is that in our environments random events can sometimes happen.
160
00:11:18,140 --> 00:11:25,500
And what if our old QSA of s.a.a was something that consistently happens like 80 percent of the time.
161
00:11:25,780 --> 00:11:28,750
And then like was represented by what happens 80 percent of the time.
162
00:11:28,750 --> 00:11:33,280
And then this new one just what happened due to randomness.
163
00:11:33,280 --> 00:11:39,610
In that case we're going to throw away the the one that is responsible for the bulk of the situation
164
00:11:39,760 --> 00:11:43,900
and we're going to replace it with something that happens only 10 or 20 percent of the time.
165
00:11:43,900 --> 00:11:50,650
That wouldn't be the best approach to go and that's why that's exactly why we don't want to completely
166
00:11:50,650 --> 00:11:51,990
change Opu values.
167
00:11:52,060 --> 00:11:56,890
We want to use like change them step by step a little bit by a little bit.
168
00:11:56,890 --> 00:12:01,980
And that's why we're going to use this temporal difference in a specific way so we're going to say Here's
169
00:12:02,020 --> 00:12:05,080
a formula we're going to take our cue of SNH.
170
00:12:05,560 --> 00:12:07,120
And we're going to update it in such a way.
171
00:12:07,120 --> 00:12:12,450
We're going to take the old value of cure Senay and we are going to add all five times the temporal
172
00:12:12,460 --> 00:12:13,380
difference.
173
00:12:13,420 --> 00:12:15,730
So Alpha is going to be all learning right.
174
00:12:15,730 --> 00:12:17,410
That's a new parameter that we're introducing.
175
00:12:17,410 --> 00:12:20,070
That's how quickly is algorithm learning.
176
00:12:20,080 --> 00:12:26,390
So basically we're taking this difference and whatever it is we're adding it on to our previous KJo
177
00:12:26,480 --> 00:12:27,210
snake.
178
00:12:27,220 --> 00:12:31,970
Now this formula probably doesn't make any sense or like just by looking it doesn't make sense because
179
00:12:31,970 --> 00:12:34,040
you got Covisint here and give us an A here.
180
00:12:34,060 --> 00:12:39,460
It's the same thing so probably should negate each other but we had to rewrite this in a bit of a different
181
00:12:39,460 --> 00:12:40,090
way.
182
00:12:40,390 --> 00:12:44,080
So I'm going to show you again so I'm just adding time to these formulas.
183
00:12:44,090 --> 00:12:48,070
So here is q t minus one the previous years.
184
00:12:48,070 --> 00:12:49,780
Q T minus 1 the previous years.
185
00:12:49,780 --> 00:12:56,080
Q T The New this should be a circle here in circle here as well but never mind and here get alpha temporal
186
00:12:56,080 --> 00:12:56,750
difference.
187
00:12:56,810 --> 00:12:58,750
Then you the current temporal difference.
188
00:12:58,750 --> 00:13:01,190
So you can see what we're doing we're saying.
189
00:13:01,220 --> 00:13:04,200
OK let's take our current.
190
00:13:04,240 --> 00:13:10,880
Q is going to be equal to all previous Q plus whatever temporal difference we found Times Alpha.
191
00:13:11,150 --> 00:13:16,330
This formula here is the heart and soul of the cube learning algorithm.
192
00:13:16,330 --> 00:13:18,250
This is how the cube is or update.
193
00:13:18,280 --> 00:13:24,460
And it's good that we've already learned what q values are what gamma is what is and what all this stuff
194
00:13:24,460 --> 00:13:25,300
is.
195
00:13:25,420 --> 00:13:31,740
And now all we need to see is that you have a previous Q value Yes that's good.
196
00:13:31,990 --> 00:13:37,870
And then what can happen is that when you take in when you actually do take the action when the agent
197
00:13:37,870 --> 00:13:42,530
takes action you'll know he'll get a reward and he'll end up in a state.
198
00:13:42,610 --> 00:13:46,400
And so based on that he can calculate Aha.
199
00:13:46,420 --> 00:13:53,220
OK so what is what would have what should have been the Q value of that move that I made.
200
00:13:53,530 --> 00:13:56,390
And now that is this part of the equation.
201
00:13:56,470 --> 00:14:02,870
Subtract the old Q value gets you a temporal difference and now you need to take a Alpher time sample
202
00:14:02,920 --> 00:14:05,410
difference and that's how you get adjust.
203
00:14:05,430 --> 00:14:06,370
Q Got you that's what you mean.
204
00:14:06,370 --> 00:14:10,240
I just think you go by and now just to finish off this.
205
00:14:10,240 --> 00:14:14,890
This is kind of like this is sufficient to understand what's going on but just to clarify things even
206
00:14:14,890 --> 00:14:18,370
more or perhaps maybe confuse things even more.
207
00:14:18,460 --> 00:14:23,320
What do we need to do to take this temporal difference or this simple difference or here a way to plug
208
00:14:23,320 --> 00:14:24,180
it into this format.
209
00:14:24,190 --> 00:14:29,840
So we're going to take all of this part and plug it into this formula and end up with a huge equation.
210
00:14:29,920 --> 00:14:31,490
So here we go.
211
00:14:31,660 --> 00:14:32,590
There's our equation.
212
00:14:32,590 --> 00:14:38,470
So this is the full equation with the temporal difference written out completely.
213
00:14:38,560 --> 00:14:43,690
And the reason I wrote it out as well first of all you'll probably find this in other literature if
214
00:14:43,690 --> 00:14:45,560
you study it.
215
00:14:45,730 --> 00:14:50,810
And the second thing is that it makes some things a bit more complex has formulas longer but also make
216
00:14:50,810 --> 00:14:52,300
somethings a bit clearer.
217
00:14:52,300 --> 00:14:55,940
So for instance you can see here the role Alpha plays.
218
00:14:55,960 --> 00:14:58,310
You can see it better because look at this.
219
00:14:58,320 --> 00:14:58,860
Here.
220
00:14:58,900 --> 00:15:01,410
Q T minus one and here you go.
221
00:15:01,420 --> 00:15:03,760
Q T minus one with a negative sign.
222
00:15:03,760 --> 00:15:12,170
So if you plug in Alpha equals to 1 if you put a 1 in here then this will negate this.
223
00:15:12,190 --> 00:15:16,170
So they'll destroy each other and all you'll have left is this part.
224
00:15:16,480 --> 00:15:23,080
And what that means is exactly that situation where we said All right so you've got a new value which
225
00:15:23,140 --> 00:15:24,750
it should have been.
226
00:15:24,850 --> 00:15:29,570
Let's update our Q value with the new value and forget about whatever we had previously.
227
00:15:29,710 --> 00:15:35,470
And as we discussed isn't the best approach because there are random events here and we want to update
228
00:15:35,470 --> 00:15:36,820
things step by step.
229
00:15:37,530 --> 00:15:43,590
And on other hand if you said Alpher equal to zero what happens then is that you completely forget about
230
00:15:43,590 --> 00:15:48,960
this whole part and you're cute t the new one or the current one is going to be always equal to the
231
00:15:48,960 --> 00:15:51,720
previous one so you're not going to be learning anything.
232
00:15:51,720 --> 00:15:56,730
And that means whatever is happening in the maze doesn't matter because you've decided on you Kuchi
233
00:15:56,730 --> 00:15:58,940
value a long time ago and you're just going to keep it.
234
00:15:59,230 --> 00:16:03,200
So that's why Alfas shouldn't be 0 or should be one it should be somewhere in between.
235
00:16:03,240 --> 00:16:09,330
And it's going to allow you to learn slowly step by step is going to allow you to as your or the agent
236
00:16:09,360 --> 00:16:12,720
as it goes through the maze is going to get the temporal difference.
237
00:16:12,960 --> 00:16:19,530
And slowly but surely this value is going to get update and update ibed and what will happen eventually
238
00:16:19,680 --> 00:16:25,440
is that at some point hopefully the algorithm will converge.
239
00:16:25,710 --> 00:16:30,960
And what that means is that this temporal difference will start becoming closer and closer to zero and
240
00:16:30,960 --> 00:16:37,860
eventually will be just well very close to zero or even 0 0 0 0 and what that means is that every single
241
00:16:37,860 --> 00:16:43,050
time your your new cutesie value or your new calculated value.
242
00:16:43,350 --> 00:16:44,430
What it should have been.
243
00:16:44,440 --> 00:16:49,950
So not this one but what it hypothetically should be enough to take the step will be just equal to your
244
00:16:49,950 --> 00:16:51,030
previous Q2 value.
245
00:16:51,030 --> 00:16:55,650
And then one that's zero and that means when your temperature difference is zero means your algorithm
246
00:16:56,070 --> 00:17:02,720
has converged and it's not really necessary to continue updating what's going on.
247
00:17:02,720 --> 00:17:06,270
It does this search to continue updating your cube values.
248
00:17:06,270 --> 00:17:12,780
The caveat here is that the only time probably one of the only times when you would still want to continue
249
00:17:12,810 --> 00:17:19,140
performing this whole you know updating of queue values if the environment is constantly changing.
250
00:17:19,170 --> 00:17:23,100
If not just it's not there it just has some randoms to Kostic events in it.
251
00:17:23,220 --> 00:17:28,750
But the environment itself is modifying as is morphing is changing with time.
252
00:17:29,040 --> 00:17:34,260
So you continuously need to learn because it's not possible for you to learn everything and come up
253
00:17:34,260 --> 00:17:39,210
with the optimal policy because the optimal policies also changed with the environment all the time.
254
00:17:39,240 --> 00:17:44,730
In that case you will need to continue CALKIN and temporal difference and calculating the Q values.
255
00:17:44,730 --> 00:17:46,830
But other than that that's kind of like an extra complication.
256
00:17:46,830 --> 00:17:53,370
Other than that this is how Q values update is so this is the main formula of the Q learning algorithm
257
00:17:54,090 --> 00:17:59,490
and this is kind of like the expanded version of that and now it should all come together and make sense
258
00:17:59,490 --> 00:18:05,250
why we have the Belman equation and not only what it represents the gewgaws but also how the agent goes
259
00:18:05,250 --> 00:18:12,870
about updating its values and finding exactly what is going on in that environment so it can come up
260
00:18:12,870 --> 00:18:14,620
with the optimal policy.
261
00:18:14,640 --> 00:18:21,570
So I know quite a lot to take in but hopefully you enjoyed this tutorial and hopefully you able to take
262
00:18:21,570 --> 00:18:28,680
away the underlying concepts and intuition behind your values and what's the whole notion of temporal
263
00:18:28,680 --> 00:18:36,990
difference is and why it's important why it helps us slowly train our agents and get them to understand
264
00:18:37,050 --> 00:18:39,230
their environments that they're operating in.
265
00:18:39,270 --> 00:18:45,540
And if you'd like to learn a bit more about temporal differences then a very popular paper is learning
266
00:18:45,540 --> 00:18:52,470
to predict by the methods of temporal differences by Richard Sutton of nineteen eighty eight.
267
00:18:52,620 --> 00:18:57,060
We've already had a reference by Richard Sutton as well but this is as another one and actually has
268
00:18:57,060 --> 00:19:04,620
a book so if you get into you know his writing style and his style of communication then check out his
269
00:19:04,620 --> 00:19:05,660
book as well.
270
00:19:05,810 --> 00:19:08,630
It's is kind of like a more expanded version of all of these things.
271
00:19:08,640 --> 00:19:12,820
I haven't read the book but that's what I'm imagining at the same time.
272
00:19:12,960 --> 00:19:19,530
This is going to add to the paper and you can learn a bit more about or probably a lot more about temporal
273
00:19:19,530 --> 00:19:21,050
differences there.
274
00:19:21,300 --> 00:19:22,950
And I hope you enjoyed it as well.
275
00:19:23,060 --> 00:19:24,270
We'll see you next time.
276
00:19:24,270 --> 00:19:26,250
Until then enjoy AI.
30538
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.