Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,660 --> 00:00:03,920
Hello and welcome back to the course on artificial intelligence.
2
00:00:03,930 --> 00:00:09,440
And finally we're on to the fun stuff we're onto deep learning.
3
00:00:09,450 --> 00:00:10,660
All right so let's have a look.
4
00:00:10,720 --> 00:00:14,100
Bruce we spoke about killer earning and what it's all about.
5
00:00:14,140 --> 00:00:20,160
And we learned about Agent environment and how the agent will look at the state.
6
00:00:20,210 --> 00:00:23,620
Or she is in take an action get a reward.
7
00:00:23,640 --> 00:00:28,610
Enter into a new state and based on that feedback loop they will continue taking actions and they will
8
00:00:28,610 --> 00:00:29,460
learn from that.
9
00:00:29,460 --> 00:00:32,310
Understand what are the best actions to take.
10
00:00:32,310 --> 00:00:35,040
And so we looked at this basic example of a maze.
11
00:00:35,040 --> 00:00:40,550
We understood that as Asian explores environment understands what the values of the states are.
12
00:00:40,560 --> 00:00:45,150
Then we moved on from dealing with the values of the states to dealing with the values of the actions
13
00:00:45,150 --> 00:00:52,230
with the values and then A-Basin that we understood how plans in a non sarcastic environments work and
14
00:00:52,560 --> 00:00:57,070
how policies work in stochastic environments and this is an example of a policy.
15
00:00:57,120 --> 00:01:01,340
So that's a quick recap of everything we discussed in the basic learning.
16
00:01:01,450 --> 00:01:07,230
And now let's have a look at how this can be taken to the next level through deep learning through adding
17
00:01:07,230 --> 00:01:08,080
deep learning.
18
00:01:08,260 --> 00:01:08,510
OK.
19
00:01:08,520 --> 00:01:16,110
So this is our environment and what we're going to do now is we're going to add instead of just doing
20
00:01:16,110 --> 00:01:21,860
basic calculations in this matrix that we have which is pretty simple.
21
00:01:21,870 --> 00:01:26,970
What we're going to do is we're going to add two axes which adds an x and y axis or we'll call them
22
00:01:27,090 --> 00:01:28,480
x1 and x2.
23
00:01:28,560 --> 00:01:30,430
Just to make things even more general.
24
00:01:30,480 --> 00:01:36,830
And here we've got the real number the row the columns 1 2 three 4 he'll rule number the rows 1 to 3.
25
00:01:36,960 --> 00:01:44,730
And so now every single state can be described by a pair of two values x1 and x2 so any one of these
26
00:01:44,730 --> 00:01:50,940
squares in which the agent can possibly be in can be described by x1 x2.
27
00:01:50,940 --> 00:01:58,280
So for instance right now he's in the square with X1 equal to 1 and x 2 equal to 2.
28
00:01:58,470 --> 00:02:03,430
And therefore that's not some way we can escape in your square meaning we can describe in your state.
29
00:02:03,480 --> 00:02:08,330
Then of course this is a very simplified version of an environment of describing States.
30
00:02:08,340 --> 00:02:10,110
But nevertheless it works in this case.
31
00:02:10,290 --> 00:02:17,260
And that means that now we can feed this these states into a neural network.
32
00:02:17,400 --> 00:02:21,830
And by the way here I would just like to mention that at the end of the course of good annexes we've
33
00:02:21,830 --> 00:02:26,880
got an x number one and antics and two in order to proceed successfully with this section.
34
00:02:26,970 --> 00:02:32,280
Highly advisable that you check out unaccessible one which is on artificial neural network so you understand
35
00:02:32,280 --> 00:02:37,470
how they work so that we can we don't have to delve into that here and we can just use the benefits
36
00:02:37,470 --> 00:02:43,800
of the knowledge of how artificial neural networks work and so we feed this information on the state
37
00:02:43,830 --> 00:02:51,870
into a neural network and then it will process this information the X1 and x2 depending on the structure
38
00:02:51,870 --> 00:02:55,380
of the neural network it might have multiple hidden layers and so on.
39
00:02:55,380 --> 00:03:00,900
So that's something that you'll figure out in the practical tutorials but at the end we will structure
40
00:03:00,900 --> 00:03:06,570
in such a way that it spits out for values and these four values are actually going to be our Q value.
41
00:03:06,570 --> 00:03:11,790
So the values which dictate which action we need to take and the don't in this tutorial will see exactly
42
00:03:11,790 --> 00:03:15,220
how these key values are used to decide which action is taken.
43
00:03:15,240 --> 00:03:22,490
But the main point here is that we no longer look at just this maze from a learning perspective.
44
00:03:22,650 --> 00:03:29,760
We're now taking the states of the maze and we're feeding them into a deep neural network in order to
45
00:03:29,820 --> 00:03:31,360
get these cubicles and.
46
00:03:31,410 --> 00:03:35,080
And at the end of the day we're still going to come up with an action we're still going to understand
47
00:03:35,150 --> 00:03:39,900
what action we need to take and we'll discuss all this in more detail but the question right now is
48
00:03:39,900 --> 00:03:42,990
why why are we doing all of this why we called it.
49
00:03:43,200 --> 00:03:47,990
Why are making things so much more complicated when that initial approach of learning was working already
50
00:03:48,280 --> 00:03:48,990
well.
51
00:03:49,170 --> 00:03:54,980
The reason for that is the to learning was working in this very simplistic environment and we're continuing
52
00:03:54,990 --> 00:03:59,830
to deal for now with this very simplistic environment in order to better understand the concepts.
53
00:04:00,000 --> 00:04:06,220
But at the same time that simple Kial learning will no longer work in more complex environments and
54
00:04:06,600 --> 00:04:12,780
we're talking about for instance the self-driving cars which will be creating or playing Doom when the
55
00:04:13,020 --> 00:04:19,200
artificial intelligence is playing Doom or other Atari games like breakout or even self-driving cars
56
00:04:19,260 --> 00:04:26,400
and more advanced reinforcement learning things such as like robots walking around and performing actions
57
00:04:26,730 --> 00:04:32,400
in all those cases basically learning is insufficient is not strong is not powerful enough to be able
58
00:04:32,400 --> 00:04:34,700
to master those challenges.
59
00:04:34,710 --> 00:04:41,250
And just like we've seen in the deep learning course if you've been in our discipline or if you've done
60
00:04:41,250 --> 00:04:47,820
the annex sections on x number one and X-2 you will where you know that deep learning is by far superior
61
00:04:47,820 --> 00:04:51,640
to any type of machine learning let alone a simple cool learning.
62
00:04:51,660 --> 00:04:55,770
And that's why we're leveraging the power of deep learning here so we're feeding in the information
63
00:04:55,770 --> 00:04:58,580
about the environment as a vector of values.
64
00:04:58,590 --> 00:05:04,240
In this case just to use into a deep neural network and then we're using that to perform the actions
65
00:05:04,240 --> 00:05:07,220
that we want to to decide which actions are agents going to take.
66
00:05:07,420 --> 00:05:11,700
So that's kind of like a high level overview of why we're doing this.
67
00:05:11,830 --> 00:05:17,920
And now let's have a look at in a bit more detail what happens to the concept of cool learning when
68
00:05:17,920 --> 00:05:24,100
we transfer when we make the transformation from or transition from simple learning into deep Killary.
69
00:05:24,130 --> 00:05:31,720
So as you saw in the previous intuition tutorials we had a slide like this which is the foundation of
70
00:05:31,960 --> 00:05:33,550
temporal difference learning.
71
00:05:33,700 --> 00:05:37,430
This is the formula for temporal difference and basically So let's go through.
72
00:05:37,430 --> 00:05:44,640
So basically we had an agent who was in this state over here which is indicated the blue arrow.
73
00:05:45,070 --> 00:05:51,760
And we were understanding how temporal difference works for this value of for instance going up.
74
00:05:51,790 --> 00:05:57,250
And so what we saw here was before this is in the simple Killary not the deep learning is in the simple
75
00:05:57,250 --> 00:05:57,610
killer.
76
00:05:57,640 --> 00:06:05,560
What we saw was before the agent had a subsequent hue value that he had learned about this action of
77
00:06:05,560 --> 00:06:06,260
going up.
78
00:06:06,340 --> 00:06:08,700
And so then he decided to take ception to go up.
79
00:06:08,860 --> 00:06:14,830
And right after he takes his action he gets a reward for taking this action in this state.
80
00:06:14,830 --> 00:06:21,070
And that is the reward plus now he can evaluate the value of the current state he's in which is the
81
00:06:21,070 --> 00:06:27,850
maximum of all of the new q values of all of the cube of the new actions he can take a prime in the
82
00:06:27,850 --> 00:06:32,400
new state as print and read multiplied by the DK factor of gamma.
83
00:06:32,440 --> 00:06:40,450
So that is essentially the cue the new cube value or kind of like the the empirical cube value that
84
00:06:40,450 --> 00:06:43,200
he has just received for taking that action.
85
00:06:43,270 --> 00:06:45,640
And ideally these two two should be the same.
86
00:06:45,640 --> 00:06:51,430
So the actually the Q value that he had in his memory about this action in this state should equate
87
00:06:51,430 --> 00:06:57,420
to the actual reward Plus the gamma times the value of the state that he ended up in.
88
00:06:57,610 --> 00:07:01,870
And therefore that's how we calculate the temporal difference we take what you are after minus what
89
00:07:01,870 --> 00:07:05,200
he got what he had in mind what he was expecting.
90
00:07:05,200 --> 00:07:06,740
You'd subtract one from the other.
91
00:07:06,780 --> 00:07:07,690
That's a temporal difference.
92
00:07:07,690 --> 00:07:14,890
And then you use your learning rate Alpha to adjust your q value your your new q value by the temporal
93
00:07:14,890 --> 00:07:16,940
difference but with a coefficient of Alpha.
94
00:07:17,110 --> 00:07:20,360
So that is the essence of the simple learning.
95
00:07:20,460 --> 00:07:25,990
Now let's have a look at how it changes in deep Killary and so we still going to work with the slide
96
00:07:26,000 --> 00:07:29,440
but we're going to just see exactly what's happening.
97
00:07:29,620 --> 00:07:35,890
So in a deep learning the neural network will predict for Valis as we saw in the previous and as we'll
98
00:07:35,890 --> 00:07:36,320
see.
99
00:07:36,370 --> 00:07:42,340
Donna Citronelle the neural network will predict for values or it might predict more values of more
100
00:07:42,340 --> 00:07:44,790
possible actions in a given state.
101
00:07:44,800 --> 00:07:48,500
But in this case we know that there's only four actions upright left to done.
102
00:07:48,670 --> 00:07:56,160
And so the neural network will predict four of these values so there will be no end in a deep learning
103
00:07:56,170 --> 00:07:58,800
situation is important is that there is no before or after.
104
00:07:58,960 --> 00:08:01,610
And this is how we'll we'll get to know this a bit better.
105
00:08:01,720 --> 00:08:08,080
So the neural network will predict four of these values and it will compare not to what will happen
106
00:08:08,140 --> 00:08:15,280
after but the neural network will compare to this exact value but it was the value which was calculated
107
00:08:15,400 --> 00:08:17,740
in the previous step.
108
00:08:17,740 --> 00:08:22,950
So in the previous time when the agent was in this exact square.
109
00:08:23,080 --> 00:08:30,850
So let's say I don't know some time ago the agent was again was in this exact square as well and it's
110
00:08:30,850 --> 00:08:34,420
calculated this value previously.
111
00:08:34,420 --> 00:08:40,630
So in the previous time a long time ago the agent calculated this value then the agents stored this
112
00:08:40,630 --> 00:08:43,720
value for the future and now the future has come.
113
00:08:43,720 --> 00:08:48,640
So now he's in the square again and now he's got these cube values which is predicted and one of them
114
00:08:48,640 --> 00:08:50,510
is for the four going up.
115
00:08:50,680 --> 00:08:57,220
So now what he's going to do is going to compare the predicted value of Q to this value which he had
116
00:08:57,220 --> 00:09:02,520
recorded from the previous step and will understand exactly why this is important right now so just
117
00:09:02,530 --> 00:09:03,440
important understand here.
118
00:09:03,520 --> 00:09:07,990
There's no before an officer in this specific square this specific time.
119
00:09:08,140 --> 00:09:14,650
We're taking the Q value that he's predicted using the neural network this time and we comparing it
120
00:09:14,710 --> 00:09:22,060
to this value which he had from the previous time from the previous time he was in this square assessing
121
00:09:22,110 --> 00:09:28,100
all of the situation and you know like the previous time he actually performed this action.
122
00:09:28,270 --> 00:09:29,290
So there we go.
123
00:09:29,290 --> 00:09:33,360
Now let's have a look at how this all works out in the neural network and why.
124
00:09:33,370 --> 00:09:38,740
Why is it like I know it sounds a bit complicated right now but we'll break it down into simple terms
125
00:09:39,310 --> 00:09:39,990
just in a second.
126
00:09:40,000 --> 00:09:44,380
So this on your own network we're feeding in the states of the environment into the neural network is
127
00:09:44,380 --> 00:09:48,880
going through the hidden layers that it's coming out with these outputs Q1 Q2 Q3 Q4.
128
00:09:48,880 --> 00:09:56,830
In that specific state these are the cube values that the neural network is predicting for possible
129
00:09:56,830 --> 00:09:57,380
actions.
130
00:09:57,400 --> 00:09:58,420
Those are the cumulous.
131
00:09:58,420 --> 00:10:04,270
So then we're appearing to target and these targets exist exactly so if we go back here this is the
132
00:10:04,270 --> 00:10:07,230
target so this is the value that was predicted.
133
00:10:07,300 --> 00:10:11,740
And then but also we know that we have a target from the last time we were in the square.
134
00:10:11,800 --> 00:10:16,660
We have a target for this same action which is up for instance.
135
00:10:16,660 --> 00:10:21,490
So here we've got a target and we're going to compare we're comparing Q1 versus that target we're comparing
136
00:10:21,490 --> 00:10:28,390
Q2 versus that target the target that we had from previously Q3 versus a target Q4 versus the target.
137
00:10:28,420 --> 00:10:36,610
And so this is the part where the neural network or the agent is now learning through deep learning
138
00:10:36,610 --> 00:10:38,630
how to better go through.
139
00:10:38,650 --> 00:10:44,920
And the key point here is that we're still applying cool learning but the concepts answer is simple
140
00:10:44,980 --> 00:10:48,940
you learn you learn through temporal differences which are pretty straightforward which we've already
141
00:10:48,940 --> 00:10:50,720
discussed and we know quite well why not.
142
00:10:50,920 --> 00:10:56,100
But at the same time in deep learning how do neural networks learn neural networks learn through our
143
00:10:56,100 --> 00:10:56,970
adjusting the weights.
144
00:10:57,010 --> 00:11:07,120
So we have to adapt the concepts of reinforcement the concepts of simple kill learning to the way neural
145
00:11:07,120 --> 00:11:08,550
networks actually work.
146
00:11:08,710 --> 00:11:10,950
And that is through updating their weights.
147
00:11:10,960 --> 00:11:14,950
And so this is what we're trying to figure out here how do we adapt that concept of temporal difference
148
00:11:15,400 --> 00:11:21,060
to your own network so that we can leverage the full power of neural networks.
149
00:11:21,260 --> 00:11:27,790
So far we've gotten this so we enter our environment state here as a vector goes through a neural network
150
00:11:27,790 --> 00:11:33,240
we get predictions of key values and then from the previous time the agent was in that state.
151
00:11:33,240 --> 00:11:39,480
We have these new target to target one two three and four for each of these respective actions.
152
00:11:39,490 --> 00:11:40,870
And so now we're up to.
153
00:11:40,870 --> 00:11:43,360
OK let's compare each one with each one.
154
00:11:43,630 --> 00:11:50,500
And from here it's it becomes pretty straightforward if you're up to speed with neural networks.
155
00:11:50,500 --> 00:11:52,500
Once again that's on a Anax.
156
00:11:52,570 --> 00:12:00,070
Number one we're going to calculate a loss which is here and we're going to be q target this one minus
157
00:12:00,070 --> 00:12:01,760
Q minus this one.
158
00:12:01,840 --> 00:12:06,160
We're going to square that so the square difference of the each one of these and we're going to sum
159
00:12:06,160 --> 00:12:06,730
them.
160
00:12:06,820 --> 00:12:12,310
So we take the sum of the squared differences of these values and their targets and we're going to send
161
00:12:12,310 --> 00:12:13,940
them up and that's going to be a loss.
162
00:12:14,020 --> 00:12:19,030
And so ideally just as we had into in the temporal difference learning so if we go back for a second
163
00:12:19,420 --> 00:12:25,180
remember we said Ideally we want this to be equal to this so we want the temporal difference to be zero
164
00:12:25,180 --> 00:12:31,750
so that's that means basically the agent is predicting exactly correctly what you know the Q value is
165
00:12:31,750 --> 00:12:37,900
that the agent is predicting are exactly or that he has and memory are exactly descriptive of the environment
166
00:12:38,590 --> 00:12:42,940
and therefore the agent can never get the environed pretty well right.
167
00:12:43,000 --> 00:12:48,880
There's no surprises there's no there's no s.a as long as a temporal difference is a pilot highly positive
168
00:12:48,880 --> 00:12:49,970
or highly negative.
169
00:12:50,040 --> 00:12:51,340
Then we've got some surprises.
170
00:12:51,340 --> 00:12:55,690
But if the general differences zero then he knows environment so well that he can predict what's going
171
00:12:55,690 --> 00:13:01,110
on and he can and therefore his policy is going to be very good and he's going to be able to navigate.
172
00:13:01,350 --> 00:13:02,200
So here.
173
00:13:02,200 --> 00:13:07,460
Same thing so we want this law to be as close to zero I suppose as smallest possible.
174
00:13:07,720 --> 00:13:14,680
And that's why now we're going to this is the part where we are going to leverage the real true power
175
00:13:14,680 --> 00:13:19,910
of neural network so we're going to take this loss and we're going to use back propagation or stick
176
00:13:19,970 --> 00:13:27,040
as the gradient descent to take this loss and pass it through the network posit back or back propagated
177
00:13:27,040 --> 00:13:31,120
through a network and through to cast a great and decent a date the weights.
178
00:13:31,120 --> 00:13:37,780
All of these synopses in the network so that next time we go through this network the way it's already
179
00:13:37,930 --> 00:13:41,050
a bit better descriptive of the environment and that's exactly what we're.
180
00:13:41,080 --> 00:13:48,090
So here you have if you go back this is calculated losses Kalka and guess prove propagator for the network
181
00:13:48,100 --> 00:13:49,330
the weights are updated.
182
00:13:49,330 --> 00:13:55,720
Then the next time we get here this happens again and again here this happens again and so on and so
183
00:13:55,780 --> 00:14:02,560
and it keeps happening and that's how this agent learns or basically now the neural network which is
184
00:14:02,560 --> 00:14:09,880
the brain of the agent is learning is becoming more and more descriptive of the environment and therefore
185
00:14:09,880 --> 00:14:12,100
the agent is able to navigate the environment.
186
00:14:12,130 --> 00:14:17,980
When we say descriptive environment basically means that when we put in the states of the environment
187
00:14:17,980 --> 00:14:25,510
that this agent is in we are more likely to get closer and closer to the actual cue values and that
188
00:14:25,510 --> 00:14:30,790
happens because the cube values that we want to find the right action and that happens because these
189
00:14:30,790 --> 00:14:36,940
new targets are actually empirically derived so he every day how does he find these cute targets.
190
00:14:37,090 --> 00:14:40,090
That's actually there this so he actually observes.
191
00:14:40,100 --> 00:14:42,940
OK so once I do take this step what's the reward I get.
192
00:14:43,060 --> 00:14:45,070
And then what's the values of this state.
193
00:14:45,070 --> 00:14:48,850
So same thing as we saw previously in Q learning and the simple learning intuition.
194
00:14:48,850 --> 00:14:54,550
So he learns this through trial and error and then he constructs his network or that's the way it is
195
00:14:54,880 --> 00:14:59,260
in such a way that the predicted values are close and close.
196
00:14:59,380 --> 00:15:01,330
Consummating that target.
197
00:15:01,330 --> 00:15:07,360
Q values so very similar to the concept we discussed here in the simple temporal difference learning
198
00:15:07,420 --> 00:15:09,870
of the simple skill learning algorithm.
199
00:15:09,910 --> 00:15:10,460
So there you go.
200
00:15:10,460 --> 00:15:12,540
That's that's how the agent learns.
201
00:15:12,550 --> 00:15:13,930
So we're up to here.
202
00:15:14,260 --> 00:15:15,490
And that's the learning part.
23593
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.