subtitlecat.com

All language subtitles for 11. Backpropagation

Afrikaans

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,340 --> 00:00:11,390 Welcome back everyone to this lecture on back propagation the last theory topic we're going to cover 2 00:00:11,540 --> 00:00:16,670 is back propagation and we're going to start by trying to build an intuition behind back propagation 3 00:00:17,060 --> 00:00:20,810 and then we'll dive into the calculus and notation of back propagation. 4 00:00:20,810 --> 00:00:26,870 I want to point out that back propagation is probably the hardest part of the entire theoretical deep 5 00:00:26,870 --> 00:00:32,360 learning process because of the calculus and the notation involved that calculus especially when we 6 00:00:32,360 --> 00:00:39,160 start talking about back propagation in dealing with a matrix of weights and another matrix of biases. 7 00:00:39,200 --> 00:00:43,730 So keep that in mind this is gonna be pretty difficult especially if you're rusty on your calculus notation 8 00:00:44,270 --> 00:00:48,830 with that in mind though if you understand the basic intuition that you basically just move backwards 9 00:00:48,860 --> 00:00:53,720 through a network to update the weights and biases then that's actually enough to continue on the rest 10 00:00:53,720 --> 00:00:54,640 of the course. 11 00:00:54,650 --> 00:00:59,210 So if you're fuzzy on the calculus portion of this lecture don't worry too much about that because it's 12 00:00:59,210 --> 00:01:02,360 not like you're gonna be needing to compute the gradient herself. 13 00:01:02,450 --> 00:01:11,000 The actual code is going to do that for us so fundamentally we want to know how the cost function results 14 00:01:11,120 --> 00:01:14,160 change with respect to the weights in the network. 15 00:01:14,210 --> 00:01:18,350 That way we can update the weights to minimize the cost function and we really talked a little bit about 16 00:01:18,350 --> 00:01:24,780 that when talking about things like gradient descent and how it approaches the cost function so let's 17 00:01:24,780 --> 00:01:28,960 go to begin with a very simple network in order to understand it back propagation. 18 00:01:29,010 --> 00:01:33,510 So this is a super simple network essentially each layer only has one neuron. 19 00:01:33,510 --> 00:01:39,240 So we'll see how back propagation works with just a network of a couple of neurons and then we can easily 20 00:01:39,240 --> 00:01:47,410 expand this to networks with multiple neurons per layer so as we already know each input basically receives 21 00:01:47,440 --> 00:01:48,840 a weight and a bias. 22 00:01:48,970 --> 00:01:54,190 So there's an incoming weight attached that edge and then each node or that is to say each neuron has 23 00:01:54,190 --> 00:01:54,970 its own bias. 24 00:01:54,970 --> 00:01:59,950 So we get the sort of formula weight 1 plus bias one way to plus bias to weight three plus bias 3 and 25 00:01:59,950 --> 00:02:07,010 so on so this means that we have some sort of cost function that's dependent on those weights and biases 26 00:02:09,130 --> 00:02:12,270 and we've already seen how this process propagates forward. 27 00:02:12,370 --> 00:02:19,920 So let's go ahead and start at the end to learn about back propagation so we already noted that the 28 00:02:19,920 --> 00:02:25,050 way we notate layers is by having the very last layer be called. 29 00:02:25,260 --> 00:02:30,630 So then our notation becomes that the neuron all the way on the right is in layer l the neuron to the 30 00:02:30,630 --> 00:02:37,480 left of it is L minus 1 L minus 2 and so on for L minus n layers. 31 00:02:37,490 --> 00:02:42,950 Now let's go ahead and just focus on the very last two layers of our network because back propagation 32 00:02:42,950 --> 00:02:48,590 starts all the way at the end of the network layer l once we've gone through our feed forward process 33 00:02:49,030 --> 00:02:53,750 and when focusing on these two layers I want to remind ourselves of the notation we've been using so 34 00:02:53,750 --> 00:03:02,570 far we've been defining Z as W. times x plus B and X recall X that notation of X that's really only 35 00:03:02,570 --> 00:03:09,080 valid at the very first layer because X stands for the actual raw feature inputs as you keep going from 36 00:03:09,080 --> 00:03:15,290 neuron to neuron further into the layer X technically becomes the output of the previous neuron which 37 00:03:15,380 --> 00:03:22,100 is defined as a because remember after we apply an activation function to Z such as sigmoid Zee we label 38 00:03:22,100 --> 00:03:23,440 that as a. 39 00:03:23,510 --> 00:03:29,630 So as you go further and further along into these layers z would actually be Z equal to w times a plus 40 00:03:29,630 --> 00:03:35,330 b because X is technically only valid at that very first layer as the raw feature input. 41 00:03:35,330 --> 00:03:39,950 Once you actually pass that into a neuron you technically not dealing with the raw features anymore. 42 00:03:39,950 --> 00:03:45,020 Instead you're dealing with the output of the previous neurons layer which is better stated as a the 43 00:03:45,020 --> 00:03:49,340 sigmoid of Z or whatever activation function you choose. 44 00:03:49,520 --> 00:03:53,890 OK so what does that actually mean when we take it into account for that very last layer. 45 00:03:53,900 --> 00:04:01,590 Well that means that z at the last layer is going to be equal to those weights at the last layer times 46 00:04:01,870 --> 00:04:03,600 a of L minus one. 47 00:04:03,620 --> 00:04:12,710 So what is a L minus 1 will AFL minus 1 is simply the output of the previous layers neuron so a L minus 48 00:04:12,710 --> 00:04:19,650 one plus B level the biases at that very last layer so again a of L.. 49 00:04:19,740 --> 00:04:21,470 So the activation function. 50 00:04:21,510 --> 00:04:27,350 Output at the very last layer is equal to sigmoid or the activation function of zero. 51 00:04:27,360 --> 00:04:33,330 So notice Isaiah L is defined by the weights and biases at that layer l. 52 00:04:33,390 --> 00:04:36,160 And then it's defined by the output of the previous zero. 53 00:04:36,300 --> 00:04:42,330 So hopefully can kind of make these connections and that means that then the cost function is going 54 00:04:42,330 --> 00:04:50,000 to be equal to a L minus Y Y is the actual true output squared. 55 00:04:50,050 --> 00:04:56,890 So what we actually want to understand is how sensitive is the cost function to changes in W and this 56 00:04:56,890 --> 00:05:01,420 is where partial derivatives come into play because we want to figure out the relationship between that 57 00:05:01,420 --> 00:05:04,840 final cost function and the weights. 58 00:05:04,840 --> 00:05:09,910 In this case at the layer L so we're going to say take the partial derivative of that cost function 59 00:05:10,000 --> 00:05:18,050 with respect to weights and layer L and if you know some calculus then you know that there's a chain 60 00:05:18,050 --> 00:05:18,710 rule. 61 00:05:18,860 --> 00:05:24,860 And so if you were to take the formulas we just saw here and apply the chain rule for them in order 62 00:05:24,860 --> 00:05:29,030 to solve for this partial derivative we mentioned here because we want to understand the relationship 63 00:05:29,030 --> 00:05:34,130 between that cost function and the weights in the network then you end up calculating this formula. 64 00:05:34,130 --> 00:05:38,960 So this is just the chain rule that basically allows you to take the derivative of a function within 65 00:05:38,960 --> 00:05:40,300 a function. 66 00:05:40,430 --> 00:05:43,160 So here easy some calculus for the chain rule. 67 00:05:43,160 --> 00:05:48,500 We can determine that the parts of the relative of that cost function with respect to that weights is 68 00:05:48,500 --> 00:05:54,200 equal to the partial derivative of the Z with respect to the weights times the partial derivative of 69 00:05:54,200 --> 00:05:54,640 the A. 70 00:05:54,650 --> 00:06:00,560 With respect to z times the past third of the cost function with respect to a such as chain rule allows 71 00:06:00,560 --> 00:06:06,650 us to pull apart these functions within functions because we saw from these previous three formulas 72 00:06:07,070 --> 00:06:13,640 that the cost function is defined by AFL a value then defined by Z of Alan and Z as defined by W of 73 00:06:13,640 --> 00:06:20,180 Allenby of L now recall that the cost function is not just a function of the weights but it's also a 74 00:06:20,180 --> 00:06:21,750 function of the biases. 75 00:06:21,860 --> 00:06:26,720 So we want to be able to understand the relationship of the cost function changing not only the weights 76 00:06:26,960 --> 00:06:33,110 with the bias along the network as well so we can then calculate the same partial derivative so the 77 00:06:33,110 --> 00:06:37,600 partial derivative of the cost function with respect to those biased terms in the same way. 78 00:06:37,640 --> 00:06:42,400 Essentially just kind of swapping out that weight for the bias. 79 00:06:42,450 --> 00:06:47,850 Now the main idea here is that we can use the gradient to go back through the network and adjust our 80 00:06:47,850 --> 00:06:53,520 weights and biases to minimize the output of the error vector on the last output layer. 81 00:06:53,520 --> 00:07:01,340 And recall that the gradient is essentially that derivative when you're dealing with n dimensions so 82 00:07:01,340 --> 00:07:07,070 using some calculus notation we can expand this idea to networks with multiple neurons per layer and 83 00:07:07,070 --> 00:07:11,240 there's gonna be some notation you'll see in just a little bit which again if you're a little rusty 84 00:07:11,240 --> 00:07:15,710 on linear algebra or calculus it's called the Hatem art product and it's actually a product that you're 85 00:07:15,710 --> 00:07:21,020 already familiar with because it's kind of the default with NUM pi and these different deep learning 86 00:07:21,060 --> 00:07:26,480 libraries where you're actually performing elements by elements multiplication. 87 00:07:26,480 --> 00:07:31,190 So again the head smart product that little dot notation that kind of looks a little bit like a hydrogen 88 00:07:31,190 --> 00:07:32,090 symbol. 89 00:07:32,090 --> 00:07:37,280 Basically what it means is just do an element by element multiplication and that means that the two 90 00:07:37,280 --> 00:07:43,100 matrices should be the exact same size which makes sense because that's gonna match up for things like 91 00:07:43,550 --> 00:07:50,480 weights and biases so given this notation and back propagation we just have a few main steps to train 92 00:07:50,480 --> 00:07:55,640 you're on networks now No again you don't need to fully understand these intricate details on calculus 93 00:07:55,670 --> 00:08:02,270 or the notation to continue on the coding portions of this course let's review now the actual learning 94 00:08:02,270 --> 00:08:08,720 process for a network we start off with just a very basic feed forward process that we're already familiar 95 00:08:08,720 --> 00:08:09,250 with. 96 00:08:09,470 --> 00:08:15,500 And step 1 using the input x that is the original features we set the activation function a for the 97 00:08:15,500 --> 00:08:22,460 input layer so that very first input layer then means that we have z is equal to w x plus B and then 98 00:08:22,640 --> 00:08:28,730 a that output essentially coming out of the input layer is gonna be equal to your activation function 99 00:08:28,790 --> 00:08:29,290 of Z. 100 00:08:29,290 --> 00:08:32,330 In this case represented as sigmoid of Z. 101 00:08:32,330 --> 00:08:35,290 So this resulting a then feeds into the next layer. 102 00:08:35,390 --> 00:08:38,460 So then you have the next layer taking an A. 103 00:08:38,510 --> 00:08:43,770 Which means it's Z is going to be w times a plus b of the previous layer. 104 00:08:43,790 --> 00:08:49,280 So then you go into the next layer and then you'd take the output of that a stick it into C then so 105 00:08:49,280 --> 00:08:53,930 on and so on so we think about this for each layer. 106 00:08:53,930 --> 00:08:58,970 All we're doing is for computing those zis and those A's since A's based top Z. 107 00:08:58,970 --> 00:09:05,690 So if I'm at some layer L then what I'm going to be doing is setting Z of my current layer l equal to 108 00:09:06,170 --> 00:09:13,280 the weights at all times the output from the previous layer that is a L minus 1 plus the biases of my 109 00:09:13,280 --> 00:09:14,660 current layer bevel. 110 00:09:14,900 --> 00:09:18,440 Once I have my sea avail I'm going to pass that through my activation function. 111 00:09:18,530 --> 00:09:24,230 In this case sigmoid and then I get the a of my current layer L and then I can pass that to the next 112 00:09:24,230 --> 00:09:26,770 layer of L plus 1 and so on and so on. 113 00:09:27,660 --> 00:09:29,700 And then we come to step 3. 114 00:09:29,760 --> 00:09:35,070 And here we've written it out in the full calculus notation of computing our error vector. 115 00:09:35,070 --> 00:09:41,160 But essentially what we want to do is we take a look and focus on just that very first term alter is 116 00:09:41,160 --> 00:09:47,550 doing is essentially expressing the rate of change of that cost function with respect to the output 117 00:09:47,640 --> 00:09:53,820 activations and in the case of the quadratic cost function then that's essentially the same thing as 118 00:09:53,820 --> 00:10:01,240 saying the activation of the last output layer minus Y which was the true value. 119 00:10:01,410 --> 00:10:07,740 And essentially what we want to do is be able to calculate this error vector and back propagate it essentially 120 00:10:07,740 --> 00:10:13,050 calculate the error back through every other single error that way we can adjust the weights and biases 121 00:10:13,080 --> 00:10:21,030 for that error so replacing that first term with a capital L minus Y we get the following formula. 122 00:10:21,070 --> 00:10:24,180 And again reason that had a smart product here. 123 00:10:24,700 --> 00:10:31,210 So what I want to do is I want to write a generalized error vector formula and I'm gonna write it out 124 00:10:31,480 --> 00:10:36,970 in terms of the error in the next layer which makes a lot of sense because all we're doing is we're 125 00:10:36,970 --> 00:10:43,930 moving backwards and a real quick note here is it's a little tricky to find a font where a lower case 126 00:10:44,080 --> 00:10:46,620 l. looks different than the number one. 127 00:10:46,690 --> 00:10:49,850 So I have two little bullet points there to show you what I'm talking about. 128 00:10:49,900 --> 00:10:55,630 So a lower case l. essentially looks like a straight line the number one has a little dash on top. 129 00:10:55,690 --> 00:10:58,540 So keep that in mind as you see what's happening next. 130 00:10:58,540 --> 00:11:04,060 The reason I'm not gonna be using a capital L is because in general we should be using capital L to 131 00:11:04,060 --> 00:11:06,310 denote the very last output layer. 132 00:11:06,460 --> 00:11:12,130 And when I want to do is show you the formula for the error vector and those calculations for any layer 133 00:11:12,220 --> 00:11:17,070 l lowercase L that is inside of the network. 134 00:11:17,080 --> 00:11:23,500 So what I'm going to do is for this back propagation step for every single layer starting at the very 135 00:11:23,500 --> 00:11:24,010 last layer. 136 00:11:24,010 --> 00:11:31,780 Capital L the moving to capital L minus one capital L minus two etc. all the way for all these layers. 137 00:11:31,780 --> 00:11:40,060 The generalized than error term so that error term that Delta the little L or lowercase L is going to 138 00:11:40,060 --> 00:11:44,350 be equal to the weight matrix L plus 1. 139 00:11:44,350 --> 00:11:51,670 That's the lowercase L then transpose that so that t that term right there that's a transpose of the 140 00:11:51,670 --> 00:11:59,830 weight matrix in the next layer on the right hand side of the L plus 1 that's the lowercase L and then 141 00:11:59,830 --> 00:12:03,730 we have their multiplied by the error term at that. 142 00:12:03,730 --> 00:12:11,230 Also next layer and then we take the head of our product again with the Z of L pass into the activation 143 00:12:11,230 --> 00:12:13,780 function. 144 00:12:14,020 --> 00:12:16,950 So again all we're doing there is back propagating the error. 145 00:12:17,380 --> 00:12:25,630 And here we have the generalized error for any layer lowercase L so when we apply that actual transpose 146 00:12:25,630 --> 00:12:32,590 weight matrix that weight matrix of L plus 1 transposed we can think intuitively of this as moving the 147 00:12:32,590 --> 00:12:38,410 air backward through the network giving us some sort of measure of the air at the output of that elf 148 00:12:38,530 --> 00:12:49,070 layer we then take the Hatem out product of that times they had a marked Product there of the Z at that 149 00:12:49,070 --> 00:12:55,070 layer pass into the activation function and what does those is then this moves the air backward through 150 00:12:55,070 --> 00:13:02,020 the activation function in layer l giving us the error an l in the weighted input to layer l. 151 00:13:02,120 --> 00:13:09,770 So again that's a generalized term which is why you see me using that lowercase L and then we can understand 152 00:13:10,130 --> 00:13:16,550 that the gradient of the cost function is given by these two formulas send for each layer L minus 1 153 00:13:16,610 --> 00:13:18,280 L minus 2 and so on. 154 00:13:18,350 --> 00:13:24,290 All we're really doing is computing that partial derivative of the cost function with respect to the 155 00:13:24,290 --> 00:13:31,300 weights and the biases there and Jane K. That's just notation for the actual neurons themselves. 156 00:13:34,020 --> 00:13:38,800 This then allows us to adjust the weights and biases to help minimize that cost function. 157 00:13:39,060 --> 00:13:45,990 So I know this was quite difficult to understand in terms of the notation of the calculus and don't 158 00:13:45,990 --> 00:13:47,480 worry if you didn't get it right away. 159 00:13:47,490 --> 00:13:51,600 This usually takes at least for me it took definitely a couple of hours to understand. 160 00:13:51,600 --> 00:13:53,820 Just trying to write it out with paper and pencil. 161 00:13:53,820 --> 00:13:59,040 So what I've done is I've linked to you in this lecture some external links you should see it as a little 162 00:13:59,040 --> 00:14:04,050 folder pop up as you're kind of watching this lecture or right next to the lecture title you should 163 00:14:04,050 --> 00:14:08,670 see a little dropdown folder you can click there's some external links there that actually go and essentially 164 00:14:08,670 --> 00:14:09,960 derive step by step. 165 00:14:10,020 --> 00:14:11,380 All these equations. 166 00:14:11,400 --> 00:14:16,200 So if this overview wasn't enough for you and you want to see every step and the whole derivation of 167 00:14:16,200 --> 00:14:22,260 this process plus the proof of those kind of four step fundamental back propagation equations check 168 00:14:22,260 --> 00:14:27,810 out the external links for lots of details on that but if you have the general understanding that you're 169 00:14:27,810 --> 00:14:33,750 basically calculating the error term at the very last layer and then going backwards through the network 170 00:14:33,750 --> 00:14:37,980 to calculate all those errors and then adjusting the weights and biases accordingly to minimize that 171 00:14:37,980 --> 00:14:39,000 cost function. 172 00:14:39,000 --> 00:14:44,660 If you understand that general intuition you know enough to go ahead and continue with the course. 173 00:14:44,730 --> 00:14:45,450 All right. 174 00:14:45,450 --> 00:14:47,010 Thanks and I'll see you at the next lecture. 20783