Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:05,340 --> 00:00:11,390
Welcome back everyone to this lecture on back propagation the last theory topic we're going to cover
2
00:00:11,540 --> 00:00:16,670
is back propagation and we're going to start by trying to build an intuition behind back propagation
3
00:00:17,060 --> 00:00:20,810
and then we'll dive into the calculus and notation of back propagation.
4
00:00:20,810 --> 00:00:26,870
I want to point out that back propagation is probably the hardest part of the entire theoretical deep
5
00:00:26,870 --> 00:00:32,360
learning process because of the calculus and the notation involved that calculus especially when we
6
00:00:32,360 --> 00:00:39,160
start talking about back propagation in dealing with a matrix of weights and another matrix of biases.
7
00:00:39,200 --> 00:00:43,730
So keep that in mind this is gonna be pretty difficult especially if you're rusty on your calculus notation
8
00:00:44,270 --> 00:00:48,830
with that in mind though if you understand the basic intuition that you basically just move backwards
9
00:00:48,860 --> 00:00:53,720
through a network to update the weights and biases then that's actually enough to continue on the rest
10
00:00:53,720 --> 00:00:54,640
of the course.
11
00:00:54,650 --> 00:00:59,210
So if you're fuzzy on the calculus portion of this lecture don't worry too much about that because it's
12
00:00:59,210 --> 00:01:02,360
not like you're gonna be needing to compute the gradient herself.
13
00:01:02,450 --> 00:01:11,000
The actual code is going to do that for us so fundamentally we want to know how the cost function results
14
00:01:11,120 --> 00:01:14,160
change with respect to the weights in the network.
15
00:01:14,210 --> 00:01:18,350
That way we can update the weights to minimize the cost function and we really talked a little bit about
16
00:01:18,350 --> 00:01:24,780
that when talking about things like gradient descent and how it approaches the cost function so let's
17
00:01:24,780 --> 00:01:28,960
go to begin with a very simple network in order to understand it back propagation.
18
00:01:29,010 --> 00:01:33,510
So this is a super simple network essentially each layer only has one neuron.
19
00:01:33,510 --> 00:01:39,240
So we'll see how back propagation works with just a network of a couple of neurons and then we can easily
20
00:01:39,240 --> 00:01:47,410
expand this to networks with multiple neurons per layer so as we already know each input basically receives
21
00:01:47,440 --> 00:01:48,840
a weight and a bias.
22
00:01:48,970 --> 00:01:54,190
So there's an incoming weight attached that edge and then each node or that is to say each neuron has
23
00:01:54,190 --> 00:01:54,970
its own bias.
24
00:01:54,970 --> 00:01:59,950
So we get the sort of formula weight 1 plus bias one way to plus bias to weight three plus bias 3 and
25
00:01:59,950 --> 00:02:07,010
so on so this means that we have some sort of cost function that's dependent on those weights and biases
26
00:02:09,130 --> 00:02:12,270
and we've already seen how this process propagates forward.
27
00:02:12,370 --> 00:02:19,920
So let's go ahead and start at the end to learn about back propagation so we already noted that the
28
00:02:19,920 --> 00:02:25,050
way we notate layers is by having the very last layer be called.
29
00:02:25,260 --> 00:02:30,630
So then our notation becomes that the neuron all the way on the right is in layer l the neuron to the
30
00:02:30,630 --> 00:02:37,480
left of it is L minus 1 L minus 2 and so on for L minus n layers.
31
00:02:37,490 --> 00:02:42,950
Now let's go ahead and just focus on the very last two layers of our network because back propagation
32
00:02:42,950 --> 00:02:48,590
starts all the way at the end of the network layer l once we've gone through our feed forward process
33
00:02:49,030 --> 00:02:53,750
and when focusing on these two layers I want to remind ourselves of the notation we've been using so
34
00:02:53,750 --> 00:03:02,570
far we've been defining Z as W. times x plus B and X recall X that notation of X that's really only
35
00:03:02,570 --> 00:03:09,080
valid at the very first layer because X stands for the actual raw feature inputs as you keep going from
36
00:03:09,080 --> 00:03:15,290
neuron to neuron further into the layer X technically becomes the output of the previous neuron which
37
00:03:15,380 --> 00:03:22,100
is defined as a because remember after we apply an activation function to Z such as sigmoid Zee we label
38
00:03:22,100 --> 00:03:23,440
that as a.
39
00:03:23,510 --> 00:03:29,630
So as you go further and further along into these layers z would actually be Z equal to w times a plus
40
00:03:29,630 --> 00:03:35,330
b because X is technically only valid at that very first layer as the raw feature input.
41
00:03:35,330 --> 00:03:39,950
Once you actually pass that into a neuron you technically not dealing with the raw features anymore.
42
00:03:39,950 --> 00:03:45,020
Instead you're dealing with the output of the previous neurons layer which is better stated as a the
43
00:03:45,020 --> 00:03:49,340
sigmoid of Z or whatever activation function you choose.
44
00:03:49,520 --> 00:03:53,890
OK so what does that actually mean when we take it into account for that very last layer.
45
00:03:53,900 --> 00:04:01,590
Well that means that z at the last layer is going to be equal to those weights at the last layer times
46
00:04:01,870 --> 00:04:03,600
a of L minus one.
47
00:04:03,620 --> 00:04:12,710
So what is a L minus 1 will AFL minus 1 is simply the output of the previous layers neuron so a L minus
48
00:04:12,710 --> 00:04:19,650
one plus B level the biases at that very last layer so again a of L..
49
00:04:19,740 --> 00:04:21,470
So the activation function.
50
00:04:21,510 --> 00:04:27,350
Output at the very last layer is equal to sigmoid or the activation function of zero.
51
00:04:27,360 --> 00:04:33,330
So notice Isaiah L is defined by the weights and biases at that layer l.
52
00:04:33,390 --> 00:04:36,160
And then it's defined by the output of the previous zero.
53
00:04:36,300 --> 00:04:42,330
So hopefully can kind of make these connections and that means that then the cost function is going
54
00:04:42,330 --> 00:04:50,000
to be equal to a L minus Y Y is the actual true output squared.
55
00:04:50,050 --> 00:04:56,890
So what we actually want to understand is how sensitive is the cost function to changes in W and this
56
00:04:56,890 --> 00:05:01,420
is where partial derivatives come into play because we want to figure out the relationship between that
57
00:05:01,420 --> 00:05:04,840
final cost function and the weights.
58
00:05:04,840 --> 00:05:09,910
In this case at the layer L so we're going to say take the partial derivative of that cost function
59
00:05:10,000 --> 00:05:18,050
with respect to weights and layer L and if you know some calculus then you know that there's a chain
60
00:05:18,050 --> 00:05:18,710
rule.
61
00:05:18,860 --> 00:05:24,860
And so if you were to take the formulas we just saw here and apply the chain rule for them in order
62
00:05:24,860 --> 00:05:29,030
to solve for this partial derivative we mentioned here because we want to understand the relationship
63
00:05:29,030 --> 00:05:34,130
between that cost function and the weights in the network then you end up calculating this formula.
64
00:05:34,130 --> 00:05:38,960
So this is just the chain rule that basically allows you to take the derivative of a function within
65
00:05:38,960 --> 00:05:40,300
a function.
66
00:05:40,430 --> 00:05:43,160
So here easy some calculus for the chain rule.
67
00:05:43,160 --> 00:05:48,500
We can determine that the parts of the relative of that cost function with respect to that weights is
68
00:05:48,500 --> 00:05:54,200
equal to the partial derivative of the Z with respect to the weights times the partial derivative of
69
00:05:54,200 --> 00:05:54,640
the A.
70
00:05:54,650 --> 00:06:00,560
With respect to z times the past third of the cost function with respect to a such as chain rule allows
71
00:06:00,560 --> 00:06:06,650
us to pull apart these functions within functions because we saw from these previous three formulas
72
00:06:07,070 --> 00:06:13,640
that the cost function is defined by AFL a value then defined by Z of Alan and Z as defined by W of
73
00:06:13,640 --> 00:06:20,180
Allenby of L now recall that the cost function is not just a function of the weights but it's also a
74
00:06:20,180 --> 00:06:21,750
function of the biases.
75
00:06:21,860 --> 00:06:26,720
So we want to be able to understand the relationship of the cost function changing not only the weights
76
00:06:26,960 --> 00:06:33,110
with the bias along the network as well so we can then calculate the same partial derivative so the
77
00:06:33,110 --> 00:06:37,600
partial derivative of the cost function with respect to those biased terms in the same way.
78
00:06:37,640 --> 00:06:42,400
Essentially just kind of swapping out that weight for the bias.
79
00:06:42,450 --> 00:06:47,850
Now the main idea here is that we can use the gradient to go back through the network and adjust our
80
00:06:47,850 --> 00:06:53,520
weights and biases to minimize the output of the error vector on the last output layer.
81
00:06:53,520 --> 00:07:01,340
And recall that the gradient is essentially that derivative when you're dealing with n dimensions so
82
00:07:01,340 --> 00:07:07,070
using some calculus notation we can expand this idea to networks with multiple neurons per layer and
83
00:07:07,070 --> 00:07:11,240
there's gonna be some notation you'll see in just a little bit which again if you're a little rusty
84
00:07:11,240 --> 00:07:15,710
on linear algebra or calculus it's called the Hatem art product and it's actually a product that you're
85
00:07:15,710 --> 00:07:21,020
already familiar with because it's kind of the default with NUM pi and these different deep learning
86
00:07:21,060 --> 00:07:26,480
libraries where you're actually performing elements by elements multiplication.
87
00:07:26,480 --> 00:07:31,190
So again the head smart product that little dot notation that kind of looks a little bit like a hydrogen
88
00:07:31,190 --> 00:07:32,090
symbol.
89
00:07:32,090 --> 00:07:37,280
Basically what it means is just do an element by element multiplication and that means that the two
90
00:07:37,280 --> 00:07:43,100
matrices should be the exact same size which makes sense because that's gonna match up for things like
91
00:07:43,550 --> 00:07:50,480
weights and biases so given this notation and back propagation we just have a few main steps to train
92
00:07:50,480 --> 00:07:55,640
you're on networks now No again you don't need to fully understand these intricate details on calculus
93
00:07:55,670 --> 00:08:02,270
or the notation to continue on the coding portions of this course let's review now the actual learning
94
00:08:02,270 --> 00:08:08,720
process for a network we start off with just a very basic feed forward process that we're already familiar
95
00:08:08,720 --> 00:08:09,250
with.
96
00:08:09,470 --> 00:08:15,500
And step 1 using the input x that is the original features we set the activation function a for the
97
00:08:15,500 --> 00:08:22,460
input layer so that very first input layer then means that we have z is equal to w x plus B and then
98
00:08:22,640 --> 00:08:28,730
a that output essentially coming out of the input layer is gonna be equal to your activation function
99
00:08:28,790 --> 00:08:29,290
of Z.
100
00:08:29,290 --> 00:08:32,330
In this case represented as sigmoid of Z.
101
00:08:32,330 --> 00:08:35,290
So this resulting a then feeds into the next layer.
102
00:08:35,390 --> 00:08:38,460
So then you have the next layer taking an A.
103
00:08:38,510 --> 00:08:43,770
Which means it's Z is going to be w times a plus b of the previous layer.
104
00:08:43,790 --> 00:08:49,280
So then you go into the next layer and then you'd take the output of that a stick it into C then so
105
00:08:49,280 --> 00:08:53,930
on and so on so we think about this for each layer.
106
00:08:53,930 --> 00:08:58,970
All we're doing is for computing those zis and those A's since A's based top Z.
107
00:08:58,970 --> 00:09:05,690
So if I'm at some layer L then what I'm going to be doing is setting Z of my current layer l equal to
108
00:09:06,170 --> 00:09:13,280
the weights at all times the output from the previous layer that is a L minus 1 plus the biases of my
109
00:09:13,280 --> 00:09:14,660
current layer bevel.
110
00:09:14,900 --> 00:09:18,440
Once I have my sea avail I'm going to pass that through my activation function.
111
00:09:18,530 --> 00:09:24,230
In this case sigmoid and then I get the a of my current layer L and then I can pass that to the next
112
00:09:24,230 --> 00:09:26,770
layer of L plus 1 and so on and so on.
113
00:09:27,660 --> 00:09:29,700
And then we come to step 3.
114
00:09:29,760 --> 00:09:35,070
And here we've written it out in the full calculus notation of computing our error vector.
115
00:09:35,070 --> 00:09:41,160
But essentially what we want to do is we take a look and focus on just that very first term alter is
116
00:09:41,160 --> 00:09:47,550
doing is essentially expressing the rate of change of that cost function with respect to the output
117
00:09:47,640 --> 00:09:53,820
activations and in the case of the quadratic cost function then that's essentially the same thing as
118
00:09:53,820 --> 00:10:01,240
saying the activation of the last output layer minus Y which was the true value.
119
00:10:01,410 --> 00:10:07,740
And essentially what we want to do is be able to calculate this error vector and back propagate it essentially
120
00:10:07,740 --> 00:10:13,050
calculate the error back through every other single error that way we can adjust the weights and biases
121
00:10:13,080 --> 00:10:21,030
for that error so replacing that first term with a capital L minus Y we get the following formula.
122
00:10:21,070 --> 00:10:24,180
And again reason that had a smart product here.
123
00:10:24,700 --> 00:10:31,210
So what I want to do is I want to write a generalized error vector formula and I'm gonna write it out
124
00:10:31,480 --> 00:10:36,970
in terms of the error in the next layer which makes a lot of sense because all we're doing is we're
125
00:10:36,970 --> 00:10:43,930
moving backwards and a real quick note here is it's a little tricky to find a font where a lower case
126
00:10:44,080 --> 00:10:46,620
l. looks different than the number one.
127
00:10:46,690 --> 00:10:49,850
So I have two little bullet points there to show you what I'm talking about.
128
00:10:49,900 --> 00:10:55,630
So a lower case l. essentially looks like a straight line the number one has a little dash on top.
129
00:10:55,690 --> 00:10:58,540
So keep that in mind as you see what's happening next.
130
00:10:58,540 --> 00:11:04,060
The reason I'm not gonna be using a capital L is because in general we should be using capital L to
131
00:11:04,060 --> 00:11:06,310
denote the very last output layer.
132
00:11:06,460 --> 00:11:12,130
And when I want to do is show you the formula for the error vector and those calculations for any layer
133
00:11:12,220 --> 00:11:17,070
l lowercase L that is inside of the network.
134
00:11:17,080 --> 00:11:23,500
So what I'm going to do is for this back propagation step for every single layer starting at the very
135
00:11:23,500 --> 00:11:24,010
last layer.
136
00:11:24,010 --> 00:11:31,780
Capital L the moving to capital L minus one capital L minus two etc. all the way for all these layers.
137
00:11:31,780 --> 00:11:40,060
The generalized than error term so that error term that Delta the little L or lowercase L is going to
138
00:11:40,060 --> 00:11:44,350
be equal to the weight matrix L plus 1.
139
00:11:44,350 --> 00:11:51,670
That's the lowercase L then transpose that so that t that term right there that's a transpose of the
140
00:11:51,670 --> 00:11:59,830
weight matrix in the next layer on the right hand side of the L plus 1 that's the lowercase L and then
141
00:11:59,830 --> 00:12:03,730
we have their multiplied by the error term at that.
142
00:12:03,730 --> 00:12:11,230
Also next layer and then we take the head of our product again with the Z of L pass into the activation
143
00:12:11,230 --> 00:12:13,780
function.
144
00:12:14,020 --> 00:12:16,950
So again all we're doing there is back propagating the error.
145
00:12:17,380 --> 00:12:25,630
And here we have the generalized error for any layer lowercase L so when we apply that actual transpose
146
00:12:25,630 --> 00:12:32,590
weight matrix that weight matrix of L plus 1 transposed we can think intuitively of this as moving the
147
00:12:32,590 --> 00:12:38,410
air backward through the network giving us some sort of measure of the air at the output of that elf
148
00:12:38,530 --> 00:12:49,070
layer we then take the Hatem out product of that times they had a marked Product there of the Z at that
149
00:12:49,070 --> 00:12:55,070
layer pass into the activation function and what does those is then this moves the air backward through
150
00:12:55,070 --> 00:13:02,020
the activation function in layer l giving us the error an l in the weighted input to layer l.
151
00:13:02,120 --> 00:13:09,770
So again that's a generalized term which is why you see me using that lowercase L and then we can understand
152
00:13:10,130 --> 00:13:16,550
that the gradient of the cost function is given by these two formulas send for each layer L minus 1
153
00:13:16,610 --> 00:13:18,280
L minus 2 and so on.
154
00:13:18,350 --> 00:13:24,290
All we're really doing is computing that partial derivative of the cost function with respect to the
155
00:13:24,290 --> 00:13:31,300
weights and the biases there and Jane K. That's just notation for the actual neurons themselves.
156
00:13:34,020 --> 00:13:38,800
This then allows us to adjust the weights and biases to help minimize that cost function.
157
00:13:39,060 --> 00:13:45,990
So I know this was quite difficult to understand in terms of the notation of the calculus and don't
158
00:13:45,990 --> 00:13:47,480
worry if you didn't get it right away.
159
00:13:47,490 --> 00:13:51,600
This usually takes at least for me it took definitely a couple of hours to understand.
160
00:13:51,600 --> 00:13:53,820
Just trying to write it out with paper and pencil.
161
00:13:53,820 --> 00:13:59,040
So what I've done is I've linked to you in this lecture some external links you should see it as a little
162
00:13:59,040 --> 00:14:04,050
folder pop up as you're kind of watching this lecture or right next to the lecture title you should
163
00:14:04,050 --> 00:14:08,670
see a little dropdown folder you can click there's some external links there that actually go and essentially
164
00:14:08,670 --> 00:14:09,960
derive step by step.
165
00:14:10,020 --> 00:14:11,380
All these equations.
166
00:14:11,400 --> 00:14:16,200
So if this overview wasn't enough for you and you want to see every step and the whole derivation of
167
00:14:16,200 --> 00:14:22,260
this process plus the proof of those kind of four step fundamental back propagation equations check
168
00:14:22,260 --> 00:14:27,810
out the external links for lots of details on that but if you have the general understanding that you're
169
00:14:27,810 --> 00:14:33,750
basically calculating the error term at the very last layer and then going backwards through the network
170
00:14:33,750 --> 00:14:37,980
to calculate all those errors and then adjusting the weights and biases accordingly to minimize that
171
00:14:37,980 --> 00:14:39,000
cost function.
172
00:14:39,000 --> 00:14:44,660
If you understand that general intuition you know enough to go ahead and continue with the course.
173
00:14:44,730 --> 00:14:45,450
All right.
174
00:14:45,450 --> 00:14:47,010
Thanks and I'll see you at the next lecture.
20783
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.