Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,680 --> 00:00:05,570
Hello and welcome back to the course on deep learning in today's tutorial we're talking about gradient
2
00:00:05,600 --> 00:00:06,600
descent.
3
00:00:06,890 --> 00:00:13,610
What we learned previously was that in order for a neural network to learn what needs to happen is back
4
00:00:13,610 --> 00:00:21,140
propagation and that is when the error the difference or the sum of squared differences between y hat
5
00:00:21,170 --> 00:00:28,300
and Y is back propagated through the neural network and the weights are adjusted accordingly.
6
00:00:28,520 --> 00:00:34,220
So we saw that and today we're going to learn exactly how these weights are adjusted.
7
00:00:34,400 --> 00:00:35,930
So let's have a look.
8
00:00:36,080 --> 00:00:44,030
This is our very simple version of a neural work a percept Trauner a single letter feedforward neural
9
00:00:44,030 --> 00:00:52,280
network and what we can see here is this whole process in action where we've got some input value then
10
00:00:52,280 --> 00:00:57,000
we've got to wait then a activation function is applied.
11
00:00:56,990 --> 00:01:01,850
We have we get y hat and then we compare it to the actual value we calculate the cost function.
12
00:01:01,850 --> 00:01:05,420
So how can we minimize the cost function.
13
00:01:05,420 --> 00:01:07,370
What can we do about it.
14
00:01:07,370 --> 00:01:14,750
Well one approach to do it is a brute force approach where we just take all lots of different possible
15
00:01:14,750 --> 00:01:20,990
weights and look at them and see which one looks best and what we do is for instance we would try out
16
00:01:21,080 --> 00:01:26,240
for let's say for example a thousand weights and we'd try them out that would get something like this
17
00:01:26,810 --> 00:01:32,900
for the cost function and this is a chart of on the Y axis of cross-functional the vertical axis on
18
00:01:32,900 --> 00:01:34,770
the horizontal axis of y hat.
19
00:01:34,860 --> 00:01:39,200
And because you can see the formulas I had minus Y squared.
20
00:01:39,230 --> 00:01:42,470
This is what the cost function would look something like that.
21
00:01:42,670 --> 00:01:47,830
And basically you'd find the best one is over here.
22
00:01:47,950 --> 00:01:50,980
So very simple very intuitive approach.
23
00:01:50,980 --> 00:01:53,200
Why not do this brute force method.
24
00:01:53,200 --> 00:02:01,630
Why not just try out a thousand different cost for a thousand different parameters or inputs for weights
25
00:02:01,690 --> 00:02:03,030
and see which one works best.
26
00:02:03,030 --> 00:02:04,230
You'll find the best one that way.
27
00:02:04,420 --> 00:02:10,270
Well if you have just one way to optimize this might work but as you increase the number of weights
28
00:02:10,480 --> 00:02:16,630
increase the number of Synopsys in your network you have to face the curse of dimensionality.
29
00:02:16,630 --> 00:02:19,370
And so what is the cause of dimensionality.
30
00:02:19,450 --> 00:02:24,510
The best way to describe this or explain it is to just look at a practical example.
31
00:02:24,640 --> 00:02:30,610
So remember this example we had when we were talking about how neural networks actually work where we
32
00:02:30,610 --> 00:02:37,120
were building or running a neural network for a property valuation.
33
00:02:37,120 --> 00:02:43,030
So this is what it looked like when it was trained up already well when it's not trained before it's
34
00:02:43,030 --> 00:02:45,290
trained before we know which one what are the weights.
35
00:02:45,550 --> 00:02:47,640
The actual neural network looks like this.
36
00:02:47,730 --> 00:02:54,860
Right because we have all these different possible synopses and we still have to train up the weights
37
00:02:55,280 --> 00:03:01,190
and here we have a total of 25 weights so four times five at the start plus five more from the hit out
38
00:03:01,310 --> 00:03:03,430
there 25 weights total.
39
00:03:03,680 --> 00:03:09,060
And let's see how we could possibly brute force 25 ways.
40
00:03:09,070 --> 00:03:12,610
This is a very simple neural network right here.
41
00:03:12,620 --> 00:03:20,630
Very simple just one hit in there and how could we brute force our way through a neural network of this
42
00:03:20,630 --> 00:03:21,320
size.
43
00:03:21,320 --> 00:03:24,370
Well there's some simple mathematical calculations.
44
00:03:24,410 --> 00:03:25,890
We have 25 weights.
45
00:03:25,910 --> 00:03:30,410
So that means if we have a thousand combinations that we're going to solve for every weight the total
46
00:03:30,410 --> 00:03:37,790
number of combinations is 1000 to the power 25 or a thousand or 10 to parse any five different combinations.
47
00:03:37,790 --> 00:03:48,260
Now let's see how Sun the way to tohu light the world's Fosse's supercomputer as of June 2016 what how
48
00:03:48,260 --> 00:03:49,700
would it approach this problem.
49
00:03:49,700 --> 00:03:52,390
Right so Sunway tie who light.
50
00:03:52,680 --> 00:04:00,980
It looks like this is a whole huge building pretty much for this one supercomputer and it got the Guinness
51
00:04:01,310 --> 00:04:04,940
World Record for being the Fosses supercomputer.
52
00:04:05,210 --> 00:04:12,620
Right now it is the fastest supercomputer in the world and some way tie lights can operate at a speed
53
00:04:12,620 --> 00:04:15,420
of 93 of flops.
54
00:04:15,510 --> 00:04:19,900
Flop stands for floating operation per second.
55
00:04:19,970 --> 00:04:23,310
So it can do ninety three to the power oil.
56
00:04:23,340 --> 00:04:28,010
Times ten to the power of 15 floating operations per second.
57
00:04:28,100 --> 00:04:32,340
That's how quick it is in comparison.
58
00:04:32,450 --> 00:04:38,210
Average computers right now they do like just over several gigaflops and so on.
59
00:04:38,210 --> 00:04:41,320
So it like kind of those ranges.
60
00:04:41,450 --> 00:04:44,290
Less than TEI Sunway type light.
61
00:04:44,390 --> 00:04:47,950
So suddenly it's all a lie it is in the forefront of technology.
62
00:04:48,360 --> 00:04:57,920
And let's say hypothetically that it can do one one test one combination of four on your own network
63
00:04:58,010 --> 00:05:04,220
in one floppy disk and one floating operation that is not possible that is not practical because you
64
00:05:04,220 --> 00:05:09,470
need multiple floating operations to test out a single weight in your own little.
65
00:05:09,480 --> 00:05:11,270
But even Let's let's give it a head start.
66
00:05:11,270 --> 00:05:17,990
Let's say that it can do it in a ideal world it can do that in one floating operation it can do one
67
00:05:18,290 --> 00:05:19,900
test per one floating operation.
68
00:05:20,120 --> 00:05:23,970
That means it will Doddridge still require tend to of any five.
69
00:05:24,080 --> 00:05:33,080
Divide by ninety three times ten to about 15 seconds to come to run all of those tests to brute force
70
00:05:33,080 --> 00:05:34,120
through that network.
71
00:05:34,130 --> 00:05:39,860
So that means one or approximate tend to power 58 seconds and that is the same as tend to the power
72
00:05:39,860 --> 00:05:42,120
of 50 years.
73
00:05:42,170 --> 00:05:49,910
That is a huge number that is longer than the universe has existed and that is definitely not going
74
00:05:49,910 --> 00:05:59,150
to simply this number is so huge its just definitely not going to work for us at all in our optimization.
75
00:05:59,150 --> 00:06:00,020
So there we go.
76
00:06:00,140 --> 00:06:01,220
This is a no no.
77
00:06:01,220 --> 00:06:05,450
Even on the world's fastest supercomputer Sunway tail light.
78
00:06:05,450 --> 00:06:10,140
So we have to come up with a different approach how are we going to find the optimal weight.
79
00:06:10,310 --> 00:06:15,890
By the way this our neural network was very simple what about if the neural networks looks like something
80
00:06:15,890 --> 00:06:22,740
like this or even a greater than that then yeah its just not going to happen at all ever.
81
00:06:22,760 --> 00:06:28,490
So the method were going to be looking at is called gradient descent and you may have heard of it already.
82
00:06:28,580 --> 00:06:30,770
If not we will find out what it is right now.
83
00:06:30,840 --> 00:06:41,780
So theres our cost function and now we go into see how we can foster for kind of a faster way to find
84
00:06:41,840 --> 00:06:43,190
the best option.
85
00:06:43,190 --> 00:06:45,920
So lets say we start somewhere you're going to start somewhere.
86
00:06:45,920 --> 00:06:47,390
So we start over there.
87
00:06:47,390 --> 00:06:56,990
And from that point in the top left what we're going to do is we're going to look at the angle of our
88
00:06:56,990 --> 00:07:00,800
cost function at that point so we're just going to basically that's what's called gradient because you
89
00:07:00,800 --> 00:07:02,090
have to differentiate.
90
00:07:02,150 --> 00:07:04,190
We're not going to look at the mathematical equations.
91
00:07:04,250 --> 00:07:09,370
We will provide some tips on additional reading at the end of the next lecture.
92
00:07:09,740 --> 00:07:17,150
But basically you just need to differentiate find out what the slope is in that specific point and find
93
00:07:17,150 --> 00:07:19,330
out if the slope is positive or negative.
94
00:07:19,450 --> 00:07:25,640
If the if the slope is negative like in this case means that you're going downhill so to the right is
95
00:07:25,640 --> 00:07:27,350
downhill to the left is uphill.
96
00:07:27,350 --> 00:07:29,780
And from there it means you need to go right.
97
00:07:29,780 --> 00:07:31,510
Basically you need to go downhill.
98
00:07:31,670 --> 00:07:33,070
And that's what we're going to do.
99
00:07:33,090 --> 00:07:35,510
Boom takes a step right.
100
00:07:35,510 --> 00:07:37,450
The ball rolls down again.
101
00:07:37,460 --> 00:07:38,300
Same thing.
102
00:07:38,390 --> 00:07:44,120
You calculate the slope and the slope is positive meaning writer's uphill left is downhill and you need
103
00:07:44,120 --> 00:07:46,560
to go left and you're on the ball down.
104
00:07:46,790 --> 00:07:54,900
And again you calculate the slope and you're all the bull right there you go so that's how you find
105
00:07:55,040 --> 00:08:04,520
in simple terms that's how you find the best WAITES The best situation that minimizes your cost function.
106
00:08:04,590 --> 00:08:08,970
Of course it's not going to be like a ball rolling is going to be a very zigzag type of approach but
107
00:08:09,210 --> 00:08:14,970
it's easier to remember or kind of it is more fun to look at it as a ball rolling.
108
00:08:14,970 --> 00:08:19,980
But in reality yes you just it's going to be like a step by step approach is going to be a zigzag type
109
00:08:19,980 --> 00:08:21,920
of method.
110
00:08:22,050 --> 00:08:25,020
Yeah and also there's there's lots of other elements to it.
111
00:08:25,050 --> 00:08:35,190
There's things like for instance why like why does it go down why does it not go way over the line so
112
00:08:35,190 --> 00:08:40,740
it could have jumped out of this gone upwards instead of downwards and things like that so there are
113
00:08:40,740 --> 00:08:41,950
parameters that you can tweak.
114
00:08:41,970 --> 00:08:45,570
And again we will mention where you can find out more on that.
115
00:08:45,580 --> 00:08:51,090
And plus we'll have this in practical application but in the simplest intuitive approach this is what
116
00:08:51,090 --> 00:08:51,770
is happening.
117
00:08:51,780 --> 00:08:56,670
We are getting to the bottom by just understanding which way we need to go.
118
00:08:56,700 --> 00:09:01,890
Instead of brute forcing through thousands and thousands and millions and billions and quadrillions
119
00:09:01,890 --> 00:09:02,920
of combinations.
120
00:09:03,030 --> 00:09:09,920
We can just simply every time have a look at where is where which way is it sloping so right like your
121
00:09:09,910 --> 00:09:11,690
or you imagine you're standing on a hill.
122
00:09:11,700 --> 00:09:15,870
Which way does it feel that it's going downwards and whichever way it is going down and you just keep
123
00:09:15,870 --> 00:09:20,760
walking that way you like take 50 steps away and then you assess again OK which way is it going downwards
124
00:09:21,090 --> 00:09:21,470
this way.
125
00:09:21,500 --> 00:09:24,620
OK and I'll take 50 steps or less take 40 steps that way.
126
00:09:24,690 --> 00:09:28,160
So it gets less and less and less as you get closer.
127
00:09:28,530 --> 00:09:32,720
So here's an example of gradient descent applied in a two dimensional space.
128
00:09:32,720 --> 00:09:36,450
So that was a one dimensional example.
129
00:09:36,570 --> 00:09:41,880
Here we have a two dimensional space for the gradient descent as you can see it's getting closer to
130
00:09:41,970 --> 00:09:48,450
the minimum and it's also called gradient descent because you're descending into the minimum of the
131
00:09:48,480 --> 00:09:53,430
cost function and find that he has a gradient descent applied in three dimensions.
132
00:09:53,430 --> 00:09:58,740
This is what it looks like if you projected onto two dimensions you can see zigzagging its way into
133
00:09:58,740 --> 00:09:59,600
the minimum.
134
00:09:59,700 --> 00:10:03,810
So there you go that it was gradient descent index of Tauriel We'll talk about stochastic.
135
00:10:03,810 --> 00:10:06,850
Gradient descent is really a continuation of this tutorial.
136
00:10:07,020 --> 00:10:08,720
And I look forward to seeing you there.
137
00:10:08,740 --> 00:10:10,610
And so next time enjoy deep learning.
14908
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.