subtitlecat.com

All language subtitles for 077 Softmax Cross-Entropy-en

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,360 --> 00:00:06,480 Hello and welcome back to the course on deep learning this is an additional tutorial to talk about the 2 00:00:06,480 --> 00:00:08,670 soft and cross entropy functions. 3 00:00:08,670 --> 00:00:15,320 It is not 100 percent necessary in order for you to go through all of the parts that we've been through 4 00:00:15,330 --> 00:00:21,510 in the main part of this section where we're talking about the convolutional neural networks but at 5 00:00:21,510 --> 00:00:26,580 the same time I thought it would be a good addition to your bag of knowledge and skill set. 6 00:00:26,580 --> 00:00:30,840 So let's go ahead and dig into these functions. 7 00:00:30,840 --> 00:00:37,530 So to start off with what we have here is the conclusion of a neural network that we built in the main 8 00:00:37,530 --> 00:00:44,210 part of the section and then at the end it pops out some probabilities for zero point ninety five for 9 00:00:44,220 --> 00:00:48,000 a dog 0.05 five or 5 percent for a cat. 10 00:00:48,060 --> 00:00:53,250 Given that photo on the left as an input This is after the train has been conducted this is actually 11 00:00:53,260 --> 00:00:57,210 it's running and it's classifying a certain image. 12 00:00:57,360 --> 00:01:00,850 And so the question here is how come these two values add up to one. 13 00:01:00,900 --> 00:01:06,750 Because as far as we know from everything I learned about artificial neural networks there is nothing 14 00:01:06,750 --> 00:01:11,600 to say that these two final neurons are connected between each other. 15 00:01:11,730 --> 00:01:16,590 So how would they know what the value of the hold each one of them know what the value of the other 16 00:01:16,590 --> 00:01:17,310 one is. 17 00:01:17,400 --> 00:01:20,140 And how would they know to add their values up to one. 18 00:01:20,340 --> 00:01:22,060 Well the answer is they wouldn't. 19 00:01:22,260 --> 00:01:28,500 In the classic version of our artificial neural network and the only way that they do is because we 20 00:01:28,710 --> 00:01:33,960 introduce a special function called the soft max function in order to help us out of the situation. 21 00:01:33,960 --> 00:01:40,890 So normally what would happen is the dog and the cat neurons would have any kind of real values that 22 00:01:41,490 --> 00:01:44,940 they don't have to be they don't have to add up to one. 23 00:01:45,180 --> 00:01:51,900 But then we would apply the soft max function which is written up over there at the top and that would 24 00:01:51,900 --> 00:01:58,430 bring these values to be between 0 and 1 and it would make them add up to 1 and 3 PPTA. 25 00:01:59,250 --> 00:02:04,320 The soft max function or the normalized exponential function is a generalization of the logistic function 26 00:02:04,350 --> 00:02:11,640 that quote unquote squash has a k dimensional vector of arbitrary real values to a k dimensional vector 27 00:02:11,640 --> 00:02:15,320 of real values in the range of zero to one that add up to 1. 28 00:02:15,330 --> 00:02:17,620 So basically it does exactly what we want. 29 00:02:17,670 --> 00:02:22,700 It brings these values to be between 0 and 1 and make sure that they add up to 1. 30 00:02:22,960 --> 00:02:27,780 And the way it works is that the way that this is possible is that because at the bottom we're here 31 00:02:27,780 --> 00:02:29,970 you can see that there is a summation. 32 00:02:29,970 --> 00:02:38,100 So it takes the exponent and puts it in the power of Zed and adds it up so that one's a two across all 33 00:02:38,100 --> 00:02:38,830 of your classes. 34 00:02:38,850 --> 00:02:39,990 All of these values. 35 00:02:39,990 --> 00:02:44,400 And so there that's your normalization happening right there. 36 00:02:44,400 --> 00:02:51,300 So that's how the Saucebox function works and it makes sense to introduce the soft next function into 37 00:02:51,600 --> 00:02:59,490 convolutional neural networks because how strange would it be if you had a possible classes of a dog 38 00:02:59,490 --> 00:03:05,140 and a cat and for the dog class you had possibility of 80 percent. 39 00:03:05,160 --> 00:03:08,660 And for the cat claws you had a good 45 percent right. 40 00:03:08,670 --> 00:03:14,430 It just doesn't make sense like that and therefore it's much better when you introduce the soft next 41 00:03:14,430 --> 00:03:19,760 function and that's what you will find happening most of the time in convolutional and neural networks. 42 00:03:19,770 --> 00:03:26,010 Now the other thing is that the soft max function comes hand-in-hand with something called the Cross 43 00:03:26,100 --> 00:03:29,040 entropy function and it's a very handy thing for us. 44 00:03:29,050 --> 00:03:30,610 So let's first look at the formula. 45 00:03:30,660 --> 00:03:33,090 This is what the cross entry function looks like. 46 00:03:33,090 --> 00:03:38,910 We're actually going to be using a different calculation going to be using this representation of the 47 00:03:39,060 --> 00:03:40,670 century but the results are basically the same. 48 00:03:40,670 --> 00:03:42,300 This is just easier to calculate. 49 00:03:42,570 --> 00:03:49,220 And what I know this might sound very unrelated to anything right now just formulas on your screen but 50 00:03:49,850 --> 00:03:54,300 there will be some additional re recommended reading at the end of this section so don't worry if you're 51 00:03:54,600 --> 00:03:56,380 not picking up on the math. 52 00:03:56,380 --> 00:03:58,350 Like even if we haven't explained the math right now. 53 00:03:58,350 --> 00:04:03,630 But the point here is that what is across entropy well across entropy function. 54 00:04:03,630 --> 00:04:11,870 Remember how we previously in artificial neural networks we had a function called the mean squared arrow 55 00:04:11,880 --> 00:04:17,760 function which we used as the cost function for assessing our natural performance. 56 00:04:17,760 --> 00:04:23,750 And our goal was to minimize the MSE in order to optimize our network performance. 57 00:04:23,940 --> 00:04:31,830 Well that was our cost function then there and in convolutional neural networks we can still use MSE 58 00:04:31,830 --> 00:04:38,070 but a better option in convolutional neural networks after you apply the soft max function turns out 59 00:04:38,070 --> 00:04:39,840 to be the cross entropy function. 60 00:04:39,840 --> 00:04:46,080 And in convolutional neural networks when you apply the cross entry functions not cost called the cost 61 00:04:46,080 --> 00:04:49,450 function anymore is called the last function and they are very similar. 62 00:04:49,470 --> 00:04:55,520 Theyre just a little terminological differences and like little bit different and on what they mean. 63 00:04:55,530 --> 00:04:58,430 But for all purposes its pretty much the same thing. 64 00:04:58,450 --> 00:05:07,530 And what happens is the last function is again something that we want to minimize in order to maximize 65 00:05:07,530 --> 00:05:09,670 the performance of our network. 66 00:05:09,690 --> 00:05:15,260 So lets have a look at a quick example on how of how this function can be applied. 67 00:05:15,260 --> 00:05:19,260 So lets say we put an image of a dog into our network. 68 00:05:19,650 --> 00:05:26,160 The predicted value for dog is 0.9 and this is doing the training so we know that we know the label 69 00:05:26,160 --> 00:05:27,330 that is a dog. 70 00:05:27,330 --> 00:05:34,140 So the predictive value 0.9 the prigged value for cat is 0.1 then here we have the label so we know 71 00:05:34,140 --> 00:05:37,810 its a dog because this is training 0 1 for dogs or for cat. 72 00:05:37,980 --> 00:05:47,600 And so in this case you need to use you need to plug these numbers into your formula for the cross entropy. 73 00:05:47,810 --> 00:05:53,340 So how you do it is the values on the left going to the verbal cue. 74 00:05:53,420 --> 00:05:58,940 The one that is under the logarithm in the on the right side and the values from the right would go 75 00:05:58,940 --> 00:06:04,340 into P and so it's important to remember which one goes there because if you get them wrong you don't 76 00:06:04,340 --> 00:06:09,620 want to be taking a logarithm for all me from zero value and or going from 1. 77 00:06:09,620 --> 00:06:11,660 So you just want to plug them in. 78 00:06:11,720 --> 00:06:14,520 Make sure you plug them into the correct places. 79 00:06:14,840 --> 00:06:17,030 And then you basically add that up. 80 00:06:17,030 --> 00:06:22,370 So that's how the cross entry works and we'll look at a actually right now we're going to look at a 81 00:06:22,370 --> 00:06:28,130 specific step by step example of applying this function in real life and Ill kind of make make more 82 00:06:28,130 --> 00:06:32,360 sense what Cross entropy is and it'll be less like that. 83 00:06:32,360 --> 00:06:39,290 My goal in this toil is to make you more comfortable of cross century because it can sound very convoluted 84 00:06:39,320 --> 00:06:43,840 and no pun intended it can. 85 00:06:43,850 --> 00:06:50,870 Like convolutional neural networks it can sound very complex and scary but it's not. 86 00:06:50,870 --> 00:06:51,650 That's that's the point. 87 00:06:51,650 --> 00:06:54,090 So let's go ahead and apply it just so we know that it's not scary. 88 00:06:54,080 --> 00:06:56,350 So here's your all that. 89 00:06:56,360 --> 00:07:01,790 And also this will explain why we're doing this why we're looking into different cause functions. 90 00:07:01,790 --> 00:07:06,650 So neural network one neural network let's say we have two neural networks and then we pass an image 91 00:07:06,650 --> 00:07:11,960 of a dog and we know that this is a dog and not a cat. 92 00:07:12,200 --> 00:07:18,620 And then we have another image our cat this time an animal and it's a cat not a dog and here we have 93 00:07:19,040 --> 00:07:22,490 a we are looking at a hole which is in fact a dog not a cat. 94 00:07:22,490 --> 00:07:24,280 If you look very closely. 95 00:07:24,320 --> 00:07:28,440 So we want to see what our neural networks were will predict in the first case. 96 00:07:28,460 --> 00:07:36,110 Neural network 1 90 percent dog 10 percent cat correct no network number to 60 percent dog 40 percent 97 00:07:36,110 --> 00:07:38,230 cat still correct worse. 98 00:07:38,270 --> 00:07:40,030 But correct. 99 00:07:40,280 --> 00:07:46,040 Second option first neural network 10 percent cat dog 90 percent cat. 100 00:07:46,040 --> 00:07:47,300 Correct. 101 00:07:47,300 --> 00:07:53,560 You know that number to 30 percent dog 70 percent cat worse but still correct. 102 00:07:53,570 --> 00:08:01,460 And then finally neural network in in image year old network won 40 percent dog 60 percent cat incorrect 103 00:08:01,870 --> 00:08:08,270 neural network number to 10 percent dog and 90 percent cat incorrect and worse. 104 00:08:08,270 --> 00:08:15,380 So the key here is that even though both net folks got it wrong in the last one through all three images 105 00:08:15,620 --> 00:08:18,870 neural network one was outperforming neural network. 106 00:08:18,890 --> 00:08:27,010 So even in the last case it was very it had it gave dog like a 40 percent chance as opposed to neural 107 00:08:27,030 --> 00:08:32,330 network to only give dog a 10 percent chance or neural network one is outperforming across the board 108 00:08:33,200 --> 00:08:35,310 when compared to neural network 2. 109 00:08:35,520 --> 00:08:41,780 And so now we're going to look at the functions that they can measure performance that we've kind of 110 00:08:41,780 --> 00:08:42,800 talked about the rating. 111 00:08:43,040 --> 00:08:48,090 So let's put these into a table so there's neural network 1 you have the wrong number. 112 00:08:48,350 --> 00:08:49,430 So that's the image number. 113 00:08:49,550 --> 00:08:51,140 And then for image one you have. 114 00:08:51,140 --> 00:08:54,010 What's it predicted 90 percent dog chimps and cat. 115 00:08:54,110 --> 00:09:00,550 So there's the hat Marable's and then you have the actual value so dog correct cat incorrect. 116 00:09:00,560 --> 00:09:07,460 Same thing for image number two and same thing for a minimum of three and same for neural network number 117 00:09:07,460 --> 00:09:07,720 two. 118 00:09:07,750 --> 00:09:11,060 So Dog 60 percent kept 40 percent in the first image. 119 00:09:11,060 --> 00:09:13,800 That's what it predicted crotons was dog not a cat. 120 00:09:13,820 --> 00:09:14,820 And so on. 121 00:09:15,200 --> 00:09:18,050 And so now let's see what errors we can actually get. 122 00:09:18,050 --> 00:09:24,940 So what errors we can calculate to estimate the performance and monitor the performance of our networks. 123 00:09:24,950 --> 00:09:28,480 So one type of error is called the classification error. 124 00:09:28,640 --> 00:09:33,990 And that is basically just asking it did you get it right or not. 125 00:09:34,010 --> 00:09:36,940 Regardless of the probabilities is just DID YOU GET IT RIGHT. 126 00:09:36,950 --> 00:09:37,970 Or did you get it right. 127 00:09:37,970 --> 00:09:44,790 So in both cases for both neural networks each of them they got one. 128 00:09:44,810 --> 00:09:46,330 So this is how you they go wrong. 129 00:09:46,340 --> 00:09:48,460 So they got one out of three wrong. 130 00:09:48,470 --> 00:09:54,960 So 33 percent error rate for your network 1 and 30 percent error rate for neural network. 131 00:09:55,100 --> 00:09:59,750 As a baseline from this standpoint both neural networks perform at the same level but we know that's 132 00:09:59,750 --> 00:10:00,250 not true. 133 00:10:00,260 --> 00:10:04,400 We know that neural network Ikhwan is outperforming neural network. 134 00:10:05,120 --> 00:10:10,850 That's why a classification error is not a good measure especially for the purposes of back propagation 135 00:10:11,810 --> 00:10:17,960 mean square error different and by the way I did these calculations in Excel I just didn't want to bore 136 00:10:17,960 --> 00:10:22,010 you with them but you can Tony just sit down and do them on a paper or in Excel. 137 00:10:22,010 --> 00:10:28,760 These are very straightforward calculations just basically take the sum of squared errors and then just 138 00:10:28,760 --> 00:10:35,010 take the average across your observations and that's pretty much it. 139 00:10:35,060 --> 00:10:43,320 So for the for neural network one gets 25 percent for neural network 2 you get 71 percent error rates 140 00:10:43,330 --> 00:10:45,930 so as you can see this one is more accurate. 141 00:10:45,940 --> 00:10:50,380 It's telling us that nearly one has a much lower error rate than your own network. 142 00:10:51,150 --> 00:10:52,970 And then cross entropy again. 143 00:10:52,990 --> 00:10:57,250 We've seen the formula you can also calculate this is actually even easier to calculate than the mean 144 00:10:57,250 --> 00:11:04,780 square error Cross area across entropy gives you 38 percent for neural network 1 and 1.0 6 for neural 145 00:11:04,780 --> 00:11:05,350 network 2. 146 00:11:05,500 --> 00:11:08,180 So you can see the results are a bit different. 147 00:11:08,350 --> 00:11:16,510 When you look at them like that when you look at you know the miniskirt area and cross entropy and the 148 00:11:16,510 --> 00:11:26,350 question of why would you use cross entropy over means squared error isn't just about the kind of like 149 00:11:26,350 --> 00:11:32,030 the numbers that they say but all these calculations were just to show you that this is all it's all 150 00:11:32,050 --> 00:11:34,680 doable you can just do it on a paper it's it's not. 151 00:11:34,780 --> 00:11:37,890 It is not very intense mathematics. 152 00:11:37,890 --> 00:11:41,130 These are pretty pretty simple straightforward things. 153 00:11:41,200 --> 00:11:47,680 But the question of why would you use means cause entropy over means there is a very very good question 154 00:11:47,680 --> 00:11:48,250 to ask. 155 00:11:48,250 --> 00:11:58,530 I'm glad you asked that the answer to that is like there's several advantages of cross entropy over 156 00:11:58,540 --> 00:12:01,430 mean squared error which are not obvious. 157 00:12:01,450 --> 00:12:07,160 And so I'll I'll mention a couple but other then I'll I'll let you know where you can find out more. 158 00:12:07,160 --> 00:12:18,550 So one of them is that if you for instance your at the very start of your back propagation your output 159 00:12:18,550 --> 00:12:22,260 value is very very very very tiny very tiny. 160 00:12:22,360 --> 00:12:25,680 So it's much smaller than the actual value that you want. 161 00:12:25,750 --> 00:12:32,920 Then at the very start the gradient in your great and decent world will be very very low and you won't 162 00:12:32,920 --> 00:12:33,840 be enough. 163 00:12:33,850 --> 00:12:40,630 It be very hard for the neural network to actually start doing something and start moving around and 164 00:12:40,630 --> 00:12:45,010 start adjusting those weights and start Movistar actually moving in the right direction. 165 00:12:45,130 --> 00:12:50,920 Whereas when you use something like the cross entropy because it's got that logarithm in it it actually 166 00:12:51,400 --> 00:12:57,310 helps the network assess even a small area like that and do something about it. 167 00:12:57,310 --> 00:12:58,520 Here's how to think about it. 168 00:12:58,520 --> 00:13:03,260 So let's say in again this is very in and in very intuitive approach. 169 00:13:03,410 --> 00:13:08,830 There's going to be a link to the mathematics and you can derive these things through the mathematics 170 00:13:08,830 --> 00:13:11,260 in more detail but a very intuitive approach. 171 00:13:11,260 --> 00:13:16,030 Let's say your like your outcome that you want. 172 00:13:16,030 --> 00:13:22,810 Is is one and right now you are at one one millionth of one. 173 00:13:22,870 --> 00:13:23,140 Right. 174 00:13:23,170 --> 00:13:30,790 $0.00 or is there one and then you improve next time you improve your outcome from from one millionth 175 00:13:30,790 --> 00:13:32,680 to one thousandth. 176 00:13:32,860 --> 00:13:39,330 And in terms of if you calculate the squared error you just subtracting one from the other. 177 00:13:39,610 --> 00:13:44,980 Or basically in each case you're Kalka in a square and you'll see that the squared errors when you compare 178 00:13:44,980 --> 00:13:48,210 one case versus other it didn't change that much. 179 00:13:48,220 --> 00:13:51,940 You didn't improve your network that much when you looking at the mean square there. 180 00:13:52,120 --> 00:13:58,750 But if you're looking at the cross entropy because you're taking a logarithm and then you're comparing 181 00:13:58,750 --> 00:14:01,090 that to dividing one to the other. 182 00:14:01,390 --> 00:14:09,390 You will see that you have actually improved your network significantly so that that jump from one million 183 00:14:09,460 --> 00:14:12,810 to 1000 in mean squared error terms will be very low. 184 00:14:12,820 --> 00:14:15,710 It will be insignificant and it won't. 185 00:14:15,790 --> 00:14:22,270 It won't guide your gradient boosting process or your back propagation in the right direction. 186 00:14:22,340 --> 00:14:28,180 It all it will guided in the right direction but it'll be like a very slow guidance it won't have enough 187 00:14:28,540 --> 00:14:34,960 power whereas if you do recross entropy across entropy will understand that even though these are very 188 00:14:34,960 --> 00:14:42,220 small adjustments that are just you know making a tiny change in absolute terms in relative terms it's 189 00:14:42,220 --> 00:14:43,770 a huge improvement. 190 00:14:43,870 --> 00:14:46,110 And we are definitely going in the right direction. 191 00:14:46,110 --> 00:14:54,820 Let's keep going that way so cross entropy will help your neural network get to the right gets to the 192 00:14:54,820 --> 00:15:01,090 optimal state is a better way for the neural network to get to get it to an optimal state. 193 00:15:01,090 --> 00:15:08,260 But bear in mind that this only works when it across entropy is only the preferred method only for classification. 194 00:15:08,260 --> 00:15:14,200 So if you're talking about things like regression like which we had in artificial neural networks then 195 00:15:14,230 --> 00:15:20,770 you would rather go with me and squared error whereas cross entropy is better for classification and 196 00:15:20,770 --> 00:15:26,200 again it has to do with the fact that we're using soft next function so that's a kind of intuitive explanation 197 00:15:26,200 --> 00:15:31,690 of that a good place to learn a bit more about that if you're really interested in you know why are 198 00:15:31,690 --> 00:15:34,740 we using cross versus mean square error. 199 00:15:35,200 --> 00:15:43,160 Google a video by Geoffrey Hinton called the soft max output function and he explains it very well and 200 00:15:43,160 --> 00:15:48,760 you know being the godfather of deep learning who can explain it better anyway. 201 00:15:48,890 --> 00:15:51,680 And by the way any video by Geoffrey Hinton is golden. 202 00:15:51,680 --> 00:15:55,590 He's just got a huge talent for explaining things anyway. 203 00:15:55,610 --> 00:16:01,310 So that's that soft nice versus cross and I hope that gives you kind of like an intuitive understanding 204 00:16:01,310 --> 00:16:02,110 of what's going on here. 205 00:16:02,120 --> 00:16:08,030 But more importantly that you're not put off by the term cross entropy because headline will mention 206 00:16:08,030 --> 00:16:11,280 it in the practical stories and I wanted to make sure that you're prepared for that. 207 00:16:11,280 --> 00:16:15,740 And it's just another way of calculating your last function. 208 00:16:15,740 --> 00:16:21,830 And another way of optimizing your network which is specifically tailored to classification problems 209 00:16:21,860 --> 00:16:28,180 and therefore convolutional neural networks and comes in hand hand-in-hand with the soft max function. 210 00:16:28,280 --> 00:16:35,480 So additional reading if you'd like a light introduction into cross entropy if you're interested in 211 00:16:35,480 --> 00:16:37,170 the concentrate a bit more of course. 212 00:16:37,250 --> 00:16:43,370 A good article to check out is called a friendly introduction to cross entropy loss by Rob DePietro 213 00:16:44,180 --> 00:16:45,280 2016. 214 00:16:45,350 --> 00:16:46,860 Here's the link below. 215 00:16:47,150 --> 00:16:54,350 Very very nice very soft and nothing no super complex math. 216 00:16:54,440 --> 00:16:59,660 Good analogies good examples using analogies of cars and you look at cars and talks about information 217 00:16:59,660 --> 00:17:04,910 and bits and restrictions and you know how would you decode this whole Unico that it's so it's a good 218 00:17:04,910 --> 00:17:10,730 article to have a look at and we'll give you a good overview of a cross entry like from an introductory 219 00:17:10,820 --> 00:17:11,680 standpoint. 220 00:17:11,900 --> 00:17:18,590 If you want to dig into the heavy math like what you see here then check out an article by or a blog 221 00:17:18,680 --> 00:17:25,180 by how to implement a neural network Intermezzo too so in terms of use is like an intermediary thing 222 00:17:25,220 --> 00:17:27,410 like a. 223 00:17:27,550 --> 00:17:28,910 Intermittency in. 224 00:17:28,990 --> 00:17:35,690 You know like when you go to a theater and you have like a break between the first part and the second 225 00:17:35,690 --> 00:17:36,290 part. 226 00:17:36,350 --> 00:17:40,820 So because he's like going through all these steps and then he's like and then he says I got to explain 227 00:17:40,820 --> 00:17:42,210 this first. 228 00:17:42,470 --> 00:17:44,080 And yes so that's why it's called intermezzo. 229 00:17:44,090 --> 00:17:51,620 No other reason as far as I understand the articles by Peter Rolands 2016 as well so both are quite 230 00:17:51,620 --> 00:17:52,470 recent. 231 00:17:52,580 --> 00:18:00,150 And you know check out this if you'd like to dig into the mathematics behind Kross entropy behind the 232 00:18:00,150 --> 00:18:02,600 soft Max and cross entropy in this article actually. 233 00:18:02,930 --> 00:18:03,790 So there we go. 234 00:18:03,860 --> 00:18:07,360 That's all there is to these two. 235 00:18:07,370 --> 00:18:12,780 Hopefully I was able to add some additional clarity and good luck with that. 236 00:18:12,830 --> 00:18:16,970 It's going to be fun and enjoy the practical tutorials. 237 00:18:16,970 --> 00:18:18,070 I'll see you next time. 238 00:18:18,080 --> 00:18:19,700 Until then enjoy learning. 26832