Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,360 --> 00:00:06,480
Hello and welcome back to the course on deep learning this is an additional tutorial to talk about the
2
00:00:06,480 --> 00:00:08,670
soft and cross entropy functions.
3
00:00:08,670 --> 00:00:15,320
It is not 100 percent necessary in order for you to go through all of the parts that we've been through
4
00:00:15,330 --> 00:00:21,510
in the main part of this section where we're talking about the convolutional neural networks but at
5
00:00:21,510 --> 00:00:26,580
the same time I thought it would be a good addition to your bag of knowledge and skill set.
6
00:00:26,580 --> 00:00:30,840
So let's go ahead and dig into these functions.
7
00:00:30,840 --> 00:00:37,530
So to start off with what we have here is the conclusion of a neural network that we built in the main
8
00:00:37,530 --> 00:00:44,210
part of the section and then at the end it pops out some probabilities for zero point ninety five for
9
00:00:44,220 --> 00:00:48,000
a dog 0.05 five or 5 percent for a cat.
10
00:00:48,060 --> 00:00:53,250
Given that photo on the left as an input This is after the train has been conducted this is actually
11
00:00:53,260 --> 00:00:57,210
it's running and it's classifying a certain image.
12
00:00:57,360 --> 00:01:00,850
And so the question here is how come these two values add up to one.
13
00:01:00,900 --> 00:01:06,750
Because as far as we know from everything I learned about artificial neural networks there is nothing
14
00:01:06,750 --> 00:01:11,600
to say that these two final neurons are connected between each other.
15
00:01:11,730 --> 00:01:16,590
So how would they know what the value of the hold each one of them know what the value of the other
16
00:01:16,590 --> 00:01:17,310
one is.
17
00:01:17,400 --> 00:01:20,140
And how would they know to add their values up to one.
18
00:01:20,340 --> 00:01:22,060
Well the answer is they wouldn't.
19
00:01:22,260 --> 00:01:28,500
In the classic version of our artificial neural network and the only way that they do is because we
20
00:01:28,710 --> 00:01:33,960
introduce a special function called the soft max function in order to help us out of the situation.
21
00:01:33,960 --> 00:01:40,890
So normally what would happen is the dog and the cat neurons would have any kind of real values that
22
00:01:41,490 --> 00:01:44,940
they don't have to be they don't have to add up to one.
23
00:01:45,180 --> 00:01:51,900
But then we would apply the soft max function which is written up over there at the top and that would
24
00:01:51,900 --> 00:01:58,430
bring these values to be between 0 and 1 and it would make them add up to 1 and 3 PPTA.
25
00:01:59,250 --> 00:02:04,320
The soft max function or the normalized exponential function is a generalization of the logistic function
26
00:02:04,350 --> 00:02:11,640
that quote unquote squash has a k dimensional vector of arbitrary real values to a k dimensional vector
27
00:02:11,640 --> 00:02:15,320
of real values in the range of zero to one that add up to 1.
28
00:02:15,330 --> 00:02:17,620
So basically it does exactly what we want.
29
00:02:17,670 --> 00:02:22,700
It brings these values to be between 0 and 1 and make sure that they add up to 1.
30
00:02:22,960 --> 00:02:27,780
And the way it works is that the way that this is possible is that because at the bottom we're here
31
00:02:27,780 --> 00:02:29,970
you can see that there is a summation.
32
00:02:29,970 --> 00:02:38,100
So it takes the exponent and puts it in the power of Zed and adds it up so that one's a two across all
33
00:02:38,100 --> 00:02:38,830
of your classes.
34
00:02:38,850 --> 00:02:39,990
All of these values.
35
00:02:39,990 --> 00:02:44,400
And so there that's your normalization happening right there.
36
00:02:44,400 --> 00:02:51,300
So that's how the Saucebox function works and it makes sense to introduce the soft next function into
37
00:02:51,600 --> 00:02:59,490
convolutional neural networks because how strange would it be if you had a possible classes of a dog
38
00:02:59,490 --> 00:03:05,140
and a cat and for the dog class you had possibility of 80 percent.
39
00:03:05,160 --> 00:03:08,660
And for the cat claws you had a good 45 percent right.
40
00:03:08,670 --> 00:03:14,430
It just doesn't make sense like that and therefore it's much better when you introduce the soft next
41
00:03:14,430 --> 00:03:19,760
function and that's what you will find happening most of the time in convolutional and neural networks.
42
00:03:19,770 --> 00:03:26,010
Now the other thing is that the soft max function comes hand-in-hand with something called the Cross
43
00:03:26,100 --> 00:03:29,040
entropy function and it's a very handy thing for us.
44
00:03:29,050 --> 00:03:30,610
So let's first look at the formula.
45
00:03:30,660 --> 00:03:33,090
This is what the cross entry function looks like.
46
00:03:33,090 --> 00:03:38,910
We're actually going to be using a different calculation going to be using this representation of the
47
00:03:39,060 --> 00:03:40,670
century but the results are basically the same.
48
00:03:40,670 --> 00:03:42,300
This is just easier to calculate.
49
00:03:42,570 --> 00:03:49,220
And what I know this might sound very unrelated to anything right now just formulas on your screen but
50
00:03:49,850 --> 00:03:54,300
there will be some additional re recommended reading at the end of this section so don't worry if you're
51
00:03:54,600 --> 00:03:56,380
not picking up on the math.
52
00:03:56,380 --> 00:03:58,350
Like even if we haven't explained the math right now.
53
00:03:58,350 --> 00:04:03,630
But the point here is that what is across entropy well across entropy function.
54
00:04:03,630 --> 00:04:11,870
Remember how we previously in artificial neural networks we had a function called the mean squared arrow
55
00:04:11,880 --> 00:04:17,760
function which we used as the cost function for assessing our natural performance.
56
00:04:17,760 --> 00:04:23,750
And our goal was to minimize the MSE in order to optimize our network performance.
57
00:04:23,940 --> 00:04:31,830
Well that was our cost function then there and in convolutional neural networks we can still use MSE
58
00:04:31,830 --> 00:04:38,070
but a better option in convolutional neural networks after you apply the soft max function turns out
59
00:04:38,070 --> 00:04:39,840
to be the cross entropy function.
60
00:04:39,840 --> 00:04:46,080
And in convolutional neural networks when you apply the cross entry functions not cost called the cost
61
00:04:46,080 --> 00:04:49,450
function anymore is called the last function and they are very similar.
62
00:04:49,470 --> 00:04:55,520
Theyre just a little terminological differences and like little bit different and on what they mean.
63
00:04:55,530 --> 00:04:58,430
But for all purposes its pretty much the same thing.
64
00:04:58,450 --> 00:05:07,530
And what happens is the last function is again something that we want to minimize in order to maximize
65
00:05:07,530 --> 00:05:09,670
the performance of our network.
66
00:05:09,690 --> 00:05:15,260
So lets have a look at a quick example on how of how this function can be applied.
67
00:05:15,260 --> 00:05:19,260
So lets say we put an image of a dog into our network.
68
00:05:19,650 --> 00:05:26,160
The predicted value for dog is 0.9 and this is doing the training so we know that we know the label
69
00:05:26,160 --> 00:05:27,330
that is a dog.
70
00:05:27,330 --> 00:05:34,140
So the predictive value 0.9 the prigged value for cat is 0.1 then here we have the label so we know
71
00:05:34,140 --> 00:05:37,810
its a dog because this is training 0 1 for dogs or for cat.
72
00:05:37,980 --> 00:05:47,600
And so in this case you need to use you need to plug these numbers into your formula for the cross entropy.
73
00:05:47,810 --> 00:05:53,340
So how you do it is the values on the left going to the verbal cue.
74
00:05:53,420 --> 00:05:58,940
The one that is under the logarithm in the on the right side and the values from the right would go
75
00:05:58,940 --> 00:06:04,340
into P and so it's important to remember which one goes there because if you get them wrong you don't
76
00:06:04,340 --> 00:06:09,620
want to be taking a logarithm for all me from zero value and or going from 1.
77
00:06:09,620 --> 00:06:11,660
So you just want to plug them in.
78
00:06:11,720 --> 00:06:14,520
Make sure you plug them into the correct places.
79
00:06:14,840 --> 00:06:17,030
And then you basically add that up.
80
00:06:17,030 --> 00:06:22,370
So that's how the cross entry works and we'll look at a actually right now we're going to look at a
81
00:06:22,370 --> 00:06:28,130
specific step by step example of applying this function in real life and Ill kind of make make more
82
00:06:28,130 --> 00:06:32,360
sense what Cross entropy is and it'll be less like that.
83
00:06:32,360 --> 00:06:39,290
My goal in this toil is to make you more comfortable of cross century because it can sound very convoluted
84
00:06:39,320 --> 00:06:43,840
and no pun intended it can.
85
00:06:43,850 --> 00:06:50,870
Like convolutional neural networks it can sound very complex and scary but it's not.
86
00:06:50,870 --> 00:06:51,650
That's that's the point.
87
00:06:51,650 --> 00:06:54,090
So let's go ahead and apply it just so we know that it's not scary.
88
00:06:54,080 --> 00:06:56,350
So here's your all that.
89
00:06:56,360 --> 00:07:01,790
And also this will explain why we're doing this why we're looking into different cause functions.
90
00:07:01,790 --> 00:07:06,650
So neural network one neural network let's say we have two neural networks and then we pass an image
91
00:07:06,650 --> 00:07:11,960
of a dog and we know that this is a dog and not a cat.
92
00:07:12,200 --> 00:07:18,620
And then we have another image our cat this time an animal and it's a cat not a dog and here we have
93
00:07:19,040 --> 00:07:22,490
a we are looking at a hole which is in fact a dog not a cat.
94
00:07:22,490 --> 00:07:24,280
If you look very closely.
95
00:07:24,320 --> 00:07:28,440
So we want to see what our neural networks were will predict in the first case.
96
00:07:28,460 --> 00:07:36,110
Neural network 1 90 percent dog 10 percent cat correct no network number to 60 percent dog 40 percent
97
00:07:36,110 --> 00:07:38,230
cat still correct worse.
98
00:07:38,270 --> 00:07:40,030
But correct.
99
00:07:40,280 --> 00:07:46,040
Second option first neural network 10 percent cat dog 90 percent cat.
100
00:07:46,040 --> 00:07:47,300
Correct.
101
00:07:47,300 --> 00:07:53,560
You know that number to 30 percent dog 70 percent cat worse but still correct.
102
00:07:53,570 --> 00:08:01,460
And then finally neural network in in image year old network won 40 percent dog 60 percent cat incorrect
103
00:08:01,870 --> 00:08:08,270
neural network number to 10 percent dog and 90 percent cat incorrect and worse.
104
00:08:08,270 --> 00:08:15,380
So the key here is that even though both net folks got it wrong in the last one through all three images
105
00:08:15,620 --> 00:08:18,870
neural network one was outperforming neural network.
106
00:08:18,890 --> 00:08:27,010
So even in the last case it was very it had it gave dog like a 40 percent chance as opposed to neural
107
00:08:27,030 --> 00:08:32,330
network to only give dog a 10 percent chance or neural network one is outperforming across the board
108
00:08:33,200 --> 00:08:35,310
when compared to neural network 2.
109
00:08:35,520 --> 00:08:41,780
And so now we're going to look at the functions that they can measure performance that we've kind of
110
00:08:41,780 --> 00:08:42,800
talked about the rating.
111
00:08:43,040 --> 00:08:48,090
So let's put these into a table so there's neural network 1 you have the wrong number.
112
00:08:48,350 --> 00:08:49,430
So that's the image number.
113
00:08:49,550 --> 00:08:51,140
And then for image one you have.
114
00:08:51,140 --> 00:08:54,010
What's it predicted 90 percent dog chimps and cat.
115
00:08:54,110 --> 00:09:00,550
So there's the hat Marable's and then you have the actual value so dog correct cat incorrect.
116
00:09:00,560 --> 00:09:07,460
Same thing for image number two and same thing for a minimum of three and same for neural network number
117
00:09:07,460 --> 00:09:07,720
two.
118
00:09:07,750 --> 00:09:11,060
So Dog 60 percent kept 40 percent in the first image.
119
00:09:11,060 --> 00:09:13,800
That's what it predicted crotons was dog not a cat.
120
00:09:13,820 --> 00:09:14,820
And so on.
121
00:09:15,200 --> 00:09:18,050
And so now let's see what errors we can actually get.
122
00:09:18,050 --> 00:09:24,940
So what errors we can calculate to estimate the performance and monitor the performance of our networks.
123
00:09:24,950 --> 00:09:28,480
So one type of error is called the classification error.
124
00:09:28,640 --> 00:09:33,990
And that is basically just asking it did you get it right or not.
125
00:09:34,010 --> 00:09:36,940
Regardless of the probabilities is just DID YOU GET IT RIGHT.
126
00:09:36,950 --> 00:09:37,970
Or did you get it right.
127
00:09:37,970 --> 00:09:44,790
So in both cases for both neural networks each of them they got one.
128
00:09:44,810 --> 00:09:46,330
So this is how you they go wrong.
129
00:09:46,340 --> 00:09:48,460
So they got one out of three wrong.
130
00:09:48,470 --> 00:09:54,960
So 33 percent error rate for your network 1 and 30 percent error rate for neural network.
131
00:09:55,100 --> 00:09:59,750
As a baseline from this standpoint both neural networks perform at the same level but we know that's
132
00:09:59,750 --> 00:10:00,250
not true.
133
00:10:00,260 --> 00:10:04,400
We know that neural network Ikhwan is outperforming neural network.
134
00:10:05,120 --> 00:10:10,850
That's why a classification error is not a good measure especially for the purposes of back propagation
135
00:10:11,810 --> 00:10:17,960
mean square error different and by the way I did these calculations in Excel I just didn't want to bore
136
00:10:17,960 --> 00:10:22,010
you with them but you can Tony just sit down and do them on a paper or in Excel.
137
00:10:22,010 --> 00:10:28,760
These are very straightforward calculations just basically take the sum of squared errors and then just
138
00:10:28,760 --> 00:10:35,010
take the average across your observations and that's pretty much it.
139
00:10:35,060 --> 00:10:43,320
So for the for neural network one gets 25 percent for neural network 2 you get 71 percent error rates
140
00:10:43,330 --> 00:10:45,930
so as you can see this one is more accurate.
141
00:10:45,940 --> 00:10:50,380
It's telling us that nearly one has a much lower error rate than your own network.
142
00:10:51,150 --> 00:10:52,970
And then cross entropy again.
143
00:10:52,990 --> 00:10:57,250
We've seen the formula you can also calculate this is actually even easier to calculate than the mean
144
00:10:57,250 --> 00:11:04,780
square error Cross area across entropy gives you 38 percent for neural network 1 and 1.0 6 for neural
145
00:11:04,780 --> 00:11:05,350
network 2.
146
00:11:05,500 --> 00:11:08,180
So you can see the results are a bit different.
147
00:11:08,350 --> 00:11:16,510
When you look at them like that when you look at you know the miniskirt area and cross entropy and the
148
00:11:16,510 --> 00:11:26,350
question of why would you use cross entropy over means squared error isn't just about the kind of like
149
00:11:26,350 --> 00:11:32,030
the numbers that they say but all these calculations were just to show you that this is all it's all
150
00:11:32,050 --> 00:11:34,680
doable you can just do it on a paper it's it's not.
151
00:11:34,780 --> 00:11:37,890
It is not very intense mathematics.
152
00:11:37,890 --> 00:11:41,130
These are pretty pretty simple straightforward things.
153
00:11:41,200 --> 00:11:47,680
But the question of why would you use means cause entropy over means there is a very very good question
154
00:11:47,680 --> 00:11:48,250
to ask.
155
00:11:48,250 --> 00:11:58,530
I'm glad you asked that the answer to that is like there's several advantages of cross entropy over
156
00:11:58,540 --> 00:12:01,430
mean squared error which are not obvious.
157
00:12:01,450 --> 00:12:07,160
And so I'll I'll mention a couple but other then I'll I'll let you know where you can find out more.
158
00:12:07,160 --> 00:12:18,550
So one of them is that if you for instance your at the very start of your back propagation your output
159
00:12:18,550 --> 00:12:22,260
value is very very very very tiny very tiny.
160
00:12:22,360 --> 00:12:25,680
So it's much smaller than the actual value that you want.
161
00:12:25,750 --> 00:12:32,920
Then at the very start the gradient in your great and decent world will be very very low and you won't
162
00:12:32,920 --> 00:12:33,840
be enough.
163
00:12:33,850 --> 00:12:40,630
It be very hard for the neural network to actually start doing something and start moving around and
164
00:12:40,630 --> 00:12:45,010
start adjusting those weights and start Movistar actually moving in the right direction.
165
00:12:45,130 --> 00:12:50,920
Whereas when you use something like the cross entropy because it's got that logarithm in it it actually
166
00:12:51,400 --> 00:12:57,310
helps the network assess even a small area like that and do something about it.
167
00:12:57,310 --> 00:12:58,520
Here's how to think about it.
168
00:12:58,520 --> 00:13:03,260
So let's say in again this is very in and in very intuitive approach.
169
00:13:03,410 --> 00:13:08,830
There's going to be a link to the mathematics and you can derive these things through the mathematics
170
00:13:08,830 --> 00:13:11,260
in more detail but a very intuitive approach.
171
00:13:11,260 --> 00:13:16,030
Let's say your like your outcome that you want.
172
00:13:16,030 --> 00:13:22,810
Is is one and right now you are at one one millionth of one.
173
00:13:22,870 --> 00:13:23,140
Right.
174
00:13:23,170 --> 00:13:30,790
$0.00 or is there one and then you improve next time you improve your outcome from from one millionth
175
00:13:30,790 --> 00:13:32,680
to one thousandth.
176
00:13:32,860 --> 00:13:39,330
And in terms of if you calculate the squared error you just subtracting one from the other.
177
00:13:39,610 --> 00:13:44,980
Or basically in each case you're Kalka in a square and you'll see that the squared errors when you compare
178
00:13:44,980 --> 00:13:48,210
one case versus other it didn't change that much.
179
00:13:48,220 --> 00:13:51,940
You didn't improve your network that much when you looking at the mean square there.
180
00:13:52,120 --> 00:13:58,750
But if you're looking at the cross entropy because you're taking a logarithm and then you're comparing
181
00:13:58,750 --> 00:14:01,090
that to dividing one to the other.
182
00:14:01,390 --> 00:14:09,390
You will see that you have actually improved your network significantly so that that jump from one million
183
00:14:09,460 --> 00:14:12,810
to 1000 in mean squared error terms will be very low.
184
00:14:12,820 --> 00:14:15,710
It will be insignificant and it won't.
185
00:14:15,790 --> 00:14:22,270
It won't guide your gradient boosting process or your back propagation in the right direction.
186
00:14:22,340 --> 00:14:28,180
It all it will guided in the right direction but it'll be like a very slow guidance it won't have enough
187
00:14:28,540 --> 00:14:34,960
power whereas if you do recross entropy across entropy will understand that even though these are very
188
00:14:34,960 --> 00:14:42,220
small adjustments that are just you know making a tiny change in absolute terms in relative terms it's
189
00:14:42,220 --> 00:14:43,770
a huge improvement.
190
00:14:43,870 --> 00:14:46,110
And we are definitely going in the right direction.
191
00:14:46,110 --> 00:14:54,820
Let's keep going that way so cross entropy will help your neural network get to the right gets to the
192
00:14:54,820 --> 00:15:01,090
optimal state is a better way for the neural network to get to get it to an optimal state.
193
00:15:01,090 --> 00:15:08,260
But bear in mind that this only works when it across entropy is only the preferred method only for classification.
194
00:15:08,260 --> 00:15:14,200
So if you're talking about things like regression like which we had in artificial neural networks then
195
00:15:14,230 --> 00:15:20,770
you would rather go with me and squared error whereas cross entropy is better for classification and
196
00:15:20,770 --> 00:15:26,200
again it has to do with the fact that we're using soft next function so that's a kind of intuitive explanation
197
00:15:26,200 --> 00:15:31,690
of that a good place to learn a bit more about that if you're really interested in you know why are
198
00:15:31,690 --> 00:15:34,740
we using cross versus mean square error.
199
00:15:35,200 --> 00:15:43,160
Google a video by Geoffrey Hinton called the soft max output function and he explains it very well and
200
00:15:43,160 --> 00:15:48,760
you know being the godfather of deep learning who can explain it better anyway.
201
00:15:48,890 --> 00:15:51,680
And by the way any video by Geoffrey Hinton is golden.
202
00:15:51,680 --> 00:15:55,590
He's just got a huge talent for explaining things anyway.
203
00:15:55,610 --> 00:16:01,310
So that's that soft nice versus cross and I hope that gives you kind of like an intuitive understanding
204
00:16:01,310 --> 00:16:02,110
of what's going on here.
205
00:16:02,120 --> 00:16:08,030
But more importantly that you're not put off by the term cross entropy because headline will mention
206
00:16:08,030 --> 00:16:11,280
it in the practical stories and I wanted to make sure that you're prepared for that.
207
00:16:11,280 --> 00:16:15,740
And it's just another way of calculating your last function.
208
00:16:15,740 --> 00:16:21,830
And another way of optimizing your network which is specifically tailored to classification problems
209
00:16:21,860 --> 00:16:28,180
and therefore convolutional neural networks and comes in hand hand-in-hand with the soft max function.
210
00:16:28,280 --> 00:16:35,480
So additional reading if you'd like a light introduction into cross entropy if you're interested in
211
00:16:35,480 --> 00:16:37,170
the concentrate a bit more of course.
212
00:16:37,250 --> 00:16:43,370
A good article to check out is called a friendly introduction to cross entropy loss by Rob DePietro
213
00:16:44,180 --> 00:16:45,280
2016.
214
00:16:45,350 --> 00:16:46,860
Here's the link below.
215
00:16:47,150 --> 00:16:54,350
Very very nice very soft and nothing no super complex math.
216
00:16:54,440 --> 00:16:59,660
Good analogies good examples using analogies of cars and you look at cars and talks about information
217
00:16:59,660 --> 00:17:04,910
and bits and restrictions and you know how would you decode this whole Unico that it's so it's a good
218
00:17:04,910 --> 00:17:10,730
article to have a look at and we'll give you a good overview of a cross entry like from an introductory
219
00:17:10,820 --> 00:17:11,680
standpoint.
220
00:17:11,900 --> 00:17:18,590
If you want to dig into the heavy math like what you see here then check out an article by or a blog
221
00:17:18,680 --> 00:17:25,180
by how to implement a neural network Intermezzo too so in terms of use is like an intermediary thing
222
00:17:25,220 --> 00:17:27,410
like a.
223
00:17:27,550 --> 00:17:28,910
Intermittency in.
224
00:17:28,990 --> 00:17:35,690
You know like when you go to a theater and you have like a break between the first part and the second
225
00:17:35,690 --> 00:17:36,290
part.
226
00:17:36,350 --> 00:17:40,820
So because he's like going through all these steps and then he's like and then he says I got to explain
227
00:17:40,820 --> 00:17:42,210
this first.
228
00:17:42,470 --> 00:17:44,080
And yes so that's why it's called intermezzo.
229
00:17:44,090 --> 00:17:51,620
No other reason as far as I understand the articles by Peter Rolands 2016 as well so both are quite
230
00:17:51,620 --> 00:17:52,470
recent.
231
00:17:52,580 --> 00:18:00,150
And you know check out this if you'd like to dig into the mathematics behind Kross entropy behind the
232
00:18:00,150 --> 00:18:02,600
soft Max and cross entropy in this article actually.
233
00:18:02,930 --> 00:18:03,790
So there we go.
234
00:18:03,860 --> 00:18:07,360
That's all there is to these two.
235
00:18:07,370 --> 00:18:12,780
Hopefully I was able to add some additional clarity and good luck with that.
236
00:18:12,830 --> 00:18:16,970
It's going to be fun and enjoy the practical tutorials.
237
00:18:16,970 --> 00:18:18,070
I'll see you next time.
238
00:18:18,080 --> 00:18:19,700
Until then enjoy learning.
26832
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.