subtitlecat.com

All language subtitles for 04_regularized-linear-regression.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,650 --> 00:00:02,940 In this video, we'll figure out 2 00:00:02,940 --> 00:00:04,470 how to get gradient descent to 3 00:00:04,470 --> 00:00:08,090 work with regularized linear regression. Let's jump in. 4 00:00:08,090 --> 00:00:10,860 Here Is a cost function we've come up with in 5 00:00:10,860 --> 00:00:14,130 the last video for regularized linear regression. 6 00:00:14,130 --> 00:00:18,285 The first part is the usual squared error cost function, 7 00:00:18,285 --> 00:00:21,810 and now you have this additional regularization term, 8 00:00:21,810 --> 00:00:24,945 where Lambda is the regularization parameter, 9 00:00:24,945 --> 00:00:28,020 and you'd like to find parameters w and 10 00:00:28,020 --> 00:00:32,250 b that minimize the regularized cost function. 11 00:00:32,250 --> 00:00:34,590 Previously we were using 12 00:00:34,590 --> 00:00:37,905 gradient descent for the original cost function, 13 00:00:37,905 --> 00:00:39,570 just the first term before we 14 00:00:39,570 --> 00:00:42,450 added that second regularization term, 15 00:00:42,450 --> 00:00:45,290 and previously, we had 16 00:00:45,290 --> 00:00:47,735 the following gradient descent algorithm, 17 00:00:47,735 --> 00:00:51,230 which is that we repeatedly update the parameters w, j, 18 00:00:51,230 --> 00:00:54,260 and b for j equals 1 through n 19 00:00:54,260 --> 00:00:55,820 according to this formula 20 00:00:55,820 --> 00:00:58,675 and b is also updated similarly. 21 00:00:58,675 --> 00:01:00,350 Again, Alpha is 22 00:01:00,350 --> 00:01:03,800 a very small positive number called the learning rate. 23 00:01:03,800 --> 00:01:05,960 In fact, the updates for 24 00:01:05,960 --> 00:01:08,720 a regularized linear regression look exactly the same, 25 00:01:08,720 --> 00:01:10,700 except that now the cost, 26 00:01:10,700 --> 00:01:13,500 J, is defined a bit differently. 27 00:01:13,500 --> 00:01:16,940 Previously the derivative of J with respect 28 00:01:16,940 --> 00:01:20,765 to w_j was given by this expression over here, 29 00:01:20,765 --> 00:01:23,465 and the derivative respect to b was 30 00:01:23,465 --> 00:01:26,560 given by this expression over here. 31 00:01:26,560 --> 00:01:30,605 Now that we've added this additional regularization term, 32 00:01:30,605 --> 00:01:33,770 the only thing that changes is that the expression for 33 00:01:33,770 --> 00:01:35,060 the derivative with respect to 34 00:01:35,060 --> 00:01:38,555 w_j ends up with one additional term, 35 00:01:38,555 --> 00:01:43,475 this plus Lambda over m times w_j. 36 00:01:43,475 --> 00:01:45,230 And in particular for 37 00:01:45,230 --> 00:01:47,645 the new definition of the cost function j, 38 00:01:47,645 --> 00:01:50,285 these two expressions over here, 39 00:01:50,285 --> 00:01:53,990 these are the new derivatives of J with respect 40 00:01:53,990 --> 00:01:58,630 to w_j and the derivative of J with respect to b. 41 00:01:58,630 --> 00:02:01,995 Recall that we don't regularize b, 42 00:02:01,995 --> 00:02:04,230 so we're not trying to shrink B. 43 00:02:04,230 --> 00:02:07,925 That's why the updated B remains the same as before, 44 00:02:07,925 --> 00:02:10,760 whereas the updated w changes because 45 00:02:10,760 --> 00:02:15,740 the regularization term causes us to try to shrink w_j. 46 00:02:16,010 --> 00:02:18,620 Let's take these definitions for 47 00:02:18,620 --> 00:02:21,140 the derivatives and put them back into 48 00:02:21,140 --> 00:02:23,780 the expression on the left to write out 49 00:02:23,780 --> 00:02:25,565 the gradient descent algorithm 50 00:02:25,565 --> 00:02:28,675 for regularized linear regression. 51 00:02:28,675 --> 00:02:31,460 To implement gradient descent 52 00:02:31,460 --> 00:02:33,545 for regularized linear regression, 53 00:02:33,545 --> 00:02:36,940 this is what you would have your code do. 54 00:02:36,940 --> 00:02:39,110 Here is the update for w_j, 55 00:02:39,110 --> 00:02:40,640 for j equals 1 through n, 56 00:02:40,640 --> 00:02:42,935 and here's the update for b. 57 00:02:42,935 --> 00:02:45,950 As usual, please remember to carry out 58 00:02:45,950 --> 00:02:49,564 simultaneous updates for all of these parameters. 59 00:02:49,564 --> 00:02:53,285 Now, in order for you to get this algorithm to work, 60 00:02:53,285 --> 00:02:55,375 this is all you need to know. 61 00:02:55,375 --> 00:02:56,780 But what I like to do in 62 00:02:56,780 --> 00:02:58,730 the remainder of this video is to go over 63 00:02:58,730 --> 00:03:01,100 some optional material to convey 64 00:03:01,100 --> 00:03:02,900 a slightly deeper intuition about 65 00:03:02,900 --> 00:03:05,480 what this formula is actually doing, 66 00:03:05,480 --> 00:03:07,460 as well as chat briefly about how 67 00:03:07,460 --> 00:03:09,685 these derivatives are derived. 68 00:03:09,685 --> 00:03:12,450 The rest of this video is completely optional. 69 00:03:12,450 --> 00:03:16,160 It's completely okay if you skip the rest of this video, 70 00:03:16,160 --> 00:03:18,470 but if you have a strong interests 71 00:03:18,470 --> 00:03:19,670 in math, then stick with me. 72 00:03:19,670 --> 00:03:21,730 It is always nice to hang out with you here, 73 00:03:21,730 --> 00:03:23,885 and through these equations, 74 00:03:23,885 --> 00:03:26,255 perhaps you can build a deeper intuition 75 00:03:26,255 --> 00:03:27,620 about what the math and what 76 00:03:27,620 --> 00:03:29,600 the derivatives are doing as well. 77 00:03:29,600 --> 00:03:32,720 Let's take a look. Let's look at 78 00:03:32,720 --> 00:03:37,040 the update rule for w_j and rewrite it in another way. 79 00:03:37,040 --> 00:03:42,125 We're updating w_j as 1 times 80 00:03:42,125 --> 00:03:49,035 w_j minus Alpha times Lambda over m times w_j. 81 00:03:49,035 --> 00:03:52,710 I've moved the term from the end to the front here. 82 00:03:52,710 --> 00:03:57,155 Then minus Alpha times 1 over m, 83 00:03:57,155 --> 00:04:00,535 and then the rest of that term over there. 84 00:04:00,535 --> 00:04:04,185 We just rearranged the terms a little bit. 85 00:04:04,185 --> 00:04:08,660 If we simplify, then we're saying that w_j is updated 86 00:04:08,660 --> 00:04:14,905 as w_j times 1 minus Alpha times Lambda over m, 87 00:04:14,905 --> 00:04:19,610 minus Alpha times this other term over here. 88 00:04:19,610 --> 00:04:22,610 You might recognize the second term as 89 00:04:22,610 --> 00:04:25,040 the usual gradient descent update 90 00:04:25,040 --> 00:04:27,680 for unregularized linear regression. 91 00:04:27,680 --> 00:04:29,150 This is the update for 92 00:04:29,150 --> 00:04:32,420 linear regression before we had regularization, 93 00:04:32,420 --> 00:04:37,130 and this is the term we saw in Week 2 of this course. 94 00:04:37,130 --> 00:04:39,050 The only change we add 95 00:04:39,050 --> 00:04:42,610 regularization is that instead of w_j being set 96 00:04:42,610 --> 00:04:46,850 to be equal to w_j minus Alpha times 97 00:04:46,850 --> 00:04:48,470 this term is now 98 00:04:48,470 --> 00:04:52,875 w times this number minus the usual update. 99 00:04:52,875 --> 00:04:56,460 This is what we had in Week 1 of this course. 100 00:04:56,460 --> 00:04:59,505 What is this first term over here? 101 00:04:59,505 --> 00:05:05,115 Well, Alpha is a very small positive number, say 0.01. 102 00:05:05,115 --> 00:05:07,625 Lambda is usually a small number, 103 00:05:07,625 --> 00:05:10,625 say 1 or maybe 10. 104 00:05:10,625 --> 00:05:13,905 Let's say lambda is 1 for this example 105 00:05:13,905 --> 00:05:17,925 and m is the training set size, say 50. 106 00:05:17,925 --> 00:05:22,470 When you multiply Alpha Lambda over m, 107 00:05:22,470 --> 00:05:27,810 say 0.01 times 1 divided by 50, 108 00:05:27,810 --> 00:05:30,990 this term ends up being a small positive number 109 00:05:30,990 --> 00:05:34,395 , say 0.0002, 110 00:05:34,395 --> 00:05:37,370 and thus, 1 minus Alpha Lambda 111 00:05:37,370 --> 00:05:38,660 over m is going to be 112 00:05:38,660 --> 00:05:40,595 a number just slightly less than 1, 113 00:05:40,595 --> 00:05:43,700 in this case, 0.9998. 114 00:05:43,700 --> 00:05:46,205 The effect of this term is that 115 00:05:46,205 --> 00:05:49,220 on every single iteration of gradient descent, 116 00:05:49,220 --> 00:05:54,515 you're taking w_j and multiplying it by 0.9998, 117 00:05:54,515 --> 00:05:56,720 that is by some numbers slightly less than 118 00:05:56,720 --> 00:05:59,785 one and for carrying out the usual update. 119 00:05:59,785 --> 00:06:03,260 What regularization is doing on every single iteration 120 00:06:03,260 --> 00:06:05,420 is you're multiplying w 121 00:06:05,420 --> 00:06:07,580 by a number slightly less than 1, 122 00:06:07,580 --> 00:06:09,770 and that has effect of shrinking 123 00:06:09,770 --> 00:06:13,075 the value of w_j just a little bit. 124 00:06:13,075 --> 00:06:15,530 This gives us another view on why 125 00:06:15,530 --> 00:06:17,690 regularization has the effect of 126 00:06:17,690 --> 00:06:19,760 shrinking the parameters w_j 127 00:06:19,760 --> 00:06:21,980 a little bit on every iteration, 128 00:06:21,980 --> 00:06:24,985 and so that's how regularization works. 129 00:06:24,985 --> 00:06:26,960 If you're curious about how 130 00:06:26,960 --> 00:06:29,030 these derivative terms were computed, 131 00:06:29,030 --> 00:06:32,330 I've just one last optional slide that goes through 132 00:06:32,330 --> 00:06:33,440 just a little bit of 133 00:06:33,440 --> 00:06:35,815 a calculation of the derivative term. 134 00:06:35,815 --> 00:06:37,940 Again, this slide and the rest of 135 00:06:37,940 --> 00:06:40,085 this video are completely optional, 136 00:06:40,085 --> 00:06:41,990 meaning you won't need any of this to 137 00:06:41,990 --> 00:06:44,795 do the practice labs and the quizzes. 138 00:06:44,795 --> 00:06:48,740 Let's step through quickly to derivative calculation. 139 00:06:48,740 --> 00:06:55,830 The derivative of J with respect to w_j looks like this. 140 00:07:05,630 --> 00:07:10,300 Recall that f of x for linear regression is defined as 141 00:07:10,300 --> 00:07:16,290 w dot x plus b or w dot product x plus b. 142 00:07:16,290 --> 00:07:19,540 It turns out that by the rules of calculus, 143 00:07:19,540 --> 00:07:21,700 the derivatives look like this, 144 00:07:21,700 --> 00:07:29,335 is 1 over 2m times the sum i equals 1 through m of 145 00:07:29,335 --> 00:07:33,655 w dot x plus b minus y 146 00:07:33,655 --> 00:07:37,585 times 2x_j plus the derivative 147 00:07:37,585 --> 00:07:39,205 of the regularization term, 148 00:07:39,205 --> 00:07:45,800 which is Lambda over 2m times 2 w_j. 149 00:07:45,800 --> 00:07:49,340 Notice that the second term does not have 150 00:07:49,340 --> 00:07:53,495 the summation term from j equals 1 through n anymore. 151 00:07:53,495 --> 00:07:56,675 The 2's cancel out here and here, 152 00:07:56,675 --> 00:07:59,065 and also here and here. 153 00:07:59,065 --> 00:08:04,570 It simplifies to this expression over here. 154 00:08:07,040 --> 00:08:12,735 Finally, remember that wx plus b is f of x, 155 00:08:12,735 --> 00:08:17,350 and so you can rewrite it as this expression down here. 156 00:08:17,350 --> 00:08:20,510 This is why this expression is used to 157 00:08:20,510 --> 00:08:24,790 compute the gradient in regularized linear regression. 158 00:08:24,790 --> 00:08:26,750 You now know how to 159 00:08:26,750 --> 00:08:29,045 implement regularized linear regression. 160 00:08:29,045 --> 00:08:31,910 Using this, you really reduce overfitting when you 161 00:08:31,910 --> 00:08:33,080 have a lot of features and 162 00:08:33,080 --> 00:08:35,095 a relatively small training set. 163 00:08:35,095 --> 00:08:37,700 This should let you get linear regression to 164 00:08:37,700 --> 00:08:40,390 work much better on many problems. 165 00:08:40,390 --> 00:08:41,945 In the next video, 166 00:08:41,945 --> 00:08:44,840 we'll take this regularization idea and apply it to 167 00:08:44,840 --> 00:08:46,745 logistic regression to avoid 168 00:08:46,745 --> 00:08:49,010 overfitting for logistic regression as well. 169 00:08:49,010 --> 00:08:52,110 Let's take a look at that in the next video.12316