subtitlecat.com

All language subtitles for 04_gradient-descent-for-multiple-linear-regression.en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,910 --> 00:00:05,290 You've learned about gradient descents about 2 00:00:05,290 --> 00:00:08,240 multiple linear regression and also vectorization. 3 00:00:08,240 --> 00:00:09,820 Let's put it all together to 4 00:00:09,820 --> 00:00:11,620 implement gradient descent for 5 00:00:11,620 --> 00:00:13,090 multiple linear regression with 6 00:00:13,090 --> 00:00:15,280 vectorization. This would be cool. 7 00:00:15,280 --> 00:00:16,925 Let's quickly review what 8 00:00:16,925 --> 00:00:18,805 multiple linear regression look like. 9 00:00:18,805 --> 00:00:20,920 Using our previous notation, 10 00:00:20,920 --> 00:00:22,570 let's see how you can write it more 11 00:00:22,570 --> 00:00:24,790 succinctly using vector notation. 12 00:00:24,790 --> 00:00:29,000 We have parameters w_1 to w_n as well as b. 13 00:00:29,000 --> 00:00:30,885 But instead of thinking of 14 00:00:30,885 --> 00:00:34,020 w_1 to w_n as separate numbers, 15 00:00:34,020 --> 00:00:35,850 that is separate parameters, 16 00:00:35,850 --> 00:00:38,470 let's start to collect all of the w's into 17 00:00:38,470 --> 00:00:41,910 a vector w so that now w is 18 00:00:41,910 --> 00:00:44,520 a vector of length n. 19 00:00:44,520 --> 00:00:46,790 We're just going to think of the parameters of 20 00:00:46,790 --> 00:00:49,160 this model as a vector w, 21 00:00:49,160 --> 00:00:50,765 as well as b, 22 00:00:50,765 --> 00:00:53,935 where b is still a number same as before. 23 00:00:53,935 --> 00:00:56,030 Whereas before we had to find 24 00:00:56,030 --> 00:00:58,615 multiple linear regression like this, 25 00:00:58,615 --> 00:01:01,130 now using vector notation, 26 00:01:01,130 --> 00:01:03,710 we can write the model as f_w, 27 00:01:03,710 --> 00:01:06,755 b of x equals the vector 28 00:01:06,755 --> 00:01:10,555 w dot product with the vector x plus b. 29 00:01:10,555 --> 00:01:14,985 Remember that this dot here means.product. 30 00:01:14,985 --> 00:01:17,800 Our cost function can be defined as J 31 00:01:17,800 --> 00:01:21,050 of w_1 through w_n, b. 32 00:01:21,050 --> 00:01:24,725 But instead of just thinking of J as a function of these 33 00:01:24,725 --> 00:01:28,975 and different parameters w_j as well as b, 34 00:01:28,975 --> 00:01:32,105 we're going to write J as a function of 35 00:01:32,105 --> 00:01:36,500 parameter vector w and the number b. 36 00:01:36,500 --> 00:01:43,200 This w_1 through w_n is replaced by this vector W and 37 00:01:43,200 --> 00:01:45,180 J now takes this input of 38 00:01:45,180 --> 00:01:49,865 vector w and a number b and returns a number. 39 00:01:49,865 --> 00:01:52,170 Here's what gradient descent looks like. 40 00:01:52,170 --> 00:01:56,090 We're going to repeatedly update each parameter w_j to 41 00:01:56,090 --> 00:02:00,880 be w_j minus Alpha times the derivative of the cost J, 42 00:02:00,880 --> 00:02:06,300 where J has parameters w_1 through w_n and b. 43 00:02:06,300 --> 00:02:09,450 Once again, we just write this as J 44 00:02:09,450 --> 00:02:13,045 of vector w and number b. 45 00:02:13,045 --> 00:02:15,650 Let's see what this looks like when you implement 46 00:02:15,650 --> 00:02:17,990 gradient descent and in particular, 47 00:02:17,990 --> 00:02:20,935 let's take a look at the derivative term. 48 00:02:20,935 --> 00:02:24,470 We'll see that gradient descent becomes just a little bit 49 00:02:24,470 --> 00:02:26,510 different with multiple features 50 00:02:26,510 --> 00:02:28,820 compared to just one feature. 51 00:02:28,820 --> 00:02:30,560 Here's what we had when we had 52 00:02:30,560 --> 00:02:33,305 gradient descent with one feature. 53 00:02:33,305 --> 00:02:36,230 We had an update rule for w and 54 00:02:36,230 --> 00:02:39,410 a separate update rule for b. Hopefully, 55 00:02:39,410 --> 00:02:41,725 these look familiar to you. 56 00:02:41,725 --> 00:02:46,250 This term here is the derivative of the cost function J 57 00:02:46,250 --> 00:02:50,270 with respect to the parameter w. Similarly, 58 00:02:50,270 --> 00:02:53,575 we have an update rule for parameter b, 59 00:02:53,575 --> 00:02:57,575 with univariate regression, we had only one feature. 60 00:02:57,575 --> 00:03:02,205 We call that feature xi without any subscript. 61 00:03:02,205 --> 00:03:06,665 Now, here's a new notation for where we have n features, 62 00:03:06,665 --> 00:03:08,845 where n is two or more. 63 00:03:08,845 --> 00:03:12,185 We get this update rule for gradient descent. 64 00:03:12,185 --> 00:03:15,095 Update w_1 to be w_1 minus 65 00:03:15,095 --> 00:03:18,655 Alpha times this expression here 66 00:03:18,655 --> 00:03:21,650 and this formula is actually 67 00:03:21,650 --> 00:03:26,755 the derivative of the cost J with respect to w_1. 68 00:03:26,755 --> 00:03:29,015 The formula for the derivative of 69 00:03:29,015 --> 00:03:31,040 J with respect to w_1 on 70 00:03:31,040 --> 00:03:32,990 the right looks very similar to 71 00:03:32,990 --> 00:03:35,795 the case of one feature on the left. 72 00:03:35,795 --> 00:03:37,430 The error term still takes 73 00:03:37,430 --> 00:03:41,645 a prediction f of x minus the target y. 74 00:03:41,645 --> 00:03:46,370 One difference is that w and x are now vectors 75 00:03:46,370 --> 00:03:48,665 and just as w on the left 76 00:03:48,665 --> 00:03:52,200 has now become w_1 here on the right, 77 00:03:52,250 --> 00:03:57,890 xi here on the left is now instead xi _1 here on 78 00:03:57,890 --> 00:04:03,965 the right and this is just for J equals 1. 79 00:04:03,965 --> 00:04:07,204 For multiple linear regression, 80 00:04:07,204 --> 00:04:10,820 we have J ranging from 1 through n and 81 00:04:10,820 --> 00:04:14,585 so we'll update the parameters w_1, 82 00:04:14,585 --> 00:04:19,460 w_2, all the way up to w_n, 83 00:04:19,460 --> 00:04:23,680 and then as before, we'll update b. 84 00:04:23,680 --> 00:04:25,625 If you implement this, 85 00:04:25,625 --> 00:04:29,285 you get gradient descent for multiple regression. 86 00:04:29,285 --> 00:04:33,695 That's it for gradient descent for multiple regression. 87 00:04:33,695 --> 00:04:36,515 Before moving on from this video, 88 00:04:36,515 --> 00:04:41,165 I want to make a quick aside or a quick side note on 89 00:04:41,165 --> 00:04:43,774 an alternative way for finding 90 00:04:43,774 --> 00:04:46,840 w and b for linear regression. 91 00:04:46,840 --> 00:04:50,565 This method is called the normal equation. 92 00:04:50,565 --> 00:04:54,410 Whereas it turns out gradient descent is a great method 93 00:04:54,410 --> 00:04:58,610 for minimizing the cost function J to find w and b, 94 00:04:58,610 --> 00:05:01,670 there is one other algorithm that works only 95 00:05:01,670 --> 00:05:04,460 for linear regression and pretty much none of 96 00:05:04,460 --> 00:05:06,755 the other algorithms you see in this specialization 97 00:05:06,755 --> 00:05:09,580 for solving for w and b 98 00:05:09,580 --> 00:05:11,780 and this other method does not need 99 00:05:11,780 --> 00:05:15,225 an iterative gradient descent algorithm. 100 00:05:15,225 --> 00:05:17,385 Called the normal equation method, 101 00:05:17,385 --> 00:05:19,760 it turns out to be possible to use 102 00:05:19,760 --> 00:05:22,610 an advanced linear algebra library to just solve for 103 00:05:22,610 --> 00:05:26,440 w and b all in one goal without iterations. 104 00:05:26,440 --> 00:05:30,430 Some disadvantages of the normal equation method are; 105 00:05:30,430 --> 00:05:32,465 first unlike gradient descent, 106 00:05:32,465 --> 00:05:35,315 this is not generalized to other learning algorithms, 107 00:05:35,315 --> 00:05:37,700 such as the logistic regression algorithm 108 00:05:37,700 --> 00:05:39,420 that you'll learn about next week 109 00:05:39,420 --> 00:05:40,805 or the neural networks or 110 00:05:40,805 --> 00:05:44,030 other algorithms you see later in this specialization. 111 00:05:44,030 --> 00:05:46,670 The normal equation method is also quite 112 00:05:46,670 --> 00:05:50,105 slow if the number of features and this large. 113 00:05:50,105 --> 00:05:52,130 Almost no machine learning 114 00:05:52,130 --> 00:05:55,400 practitioners should implement the normal equation method 115 00:05:55,400 --> 00:05:58,055 themselves but if you're using 116 00:05:58,055 --> 00:06:00,140 a mature machine learning 117 00:06:00,140 --> 00:06:03,050 library and call linear regression, 118 00:06:03,050 --> 00:06:04,865 there is a chance that on the backend, 119 00:06:04,865 --> 00:06:08,150 it'll be using this to solve for w and b. 120 00:06:08,150 --> 00:06:10,490 If you're ever in the job interview 121 00:06:10,490 --> 00:06:12,335 and hear the term normal equation, 122 00:06:12,335 --> 00:06:14,530 that's what this refers to. 123 00:06:14,530 --> 00:06:16,520 Don't worry about the details of how 124 00:06:16,520 --> 00:06:18,140 the normal equation works. 125 00:06:18,140 --> 00:06:21,950 Just be aware that some machine learning libraries 126 00:06:21,950 --> 00:06:24,110 may use this complicated method 127 00:06:24,110 --> 00:06:26,470 in the back-end to solve for w and b. 128 00:06:26,470 --> 00:06:28,880 But for most learning algorithms, 129 00:06:28,880 --> 00:06:32,060 including how you implement linear regression yourself, 130 00:06:32,060 --> 00:06:35,635 gradient descents offer a better way to get the job done. 131 00:06:35,635 --> 00:06:38,810 In the optional lab that follows this video, 132 00:06:38,810 --> 00:06:42,140 you'll see how to define a multiple regression model 133 00:06:42,140 --> 00:06:47,365 encode and also how to calculate the prediction f of x. 134 00:06:47,365 --> 00:06:51,300 You'll also see how to calculate the cost and 135 00:06:51,300 --> 00:06:53,090 implement gradient descent for 136 00:06:53,090 --> 00:06:55,250 a multiple linear regression model. 137 00:06:55,250 --> 00:06:58,715 This will be using Python's NumPy library. 138 00:06:58,715 --> 00:07:01,010 If any of the code looks very new, 139 00:07:01,010 --> 00:07:03,800 that's okay but you should feel free 140 00:07:03,800 --> 00:07:07,415 also to take a look at the previous optional lab that 141 00:07:07,415 --> 00:07:11,120 introduces NumPy and vectorization for a refresher 142 00:07:11,120 --> 00:07:15,595 of NumPy functions and how to implement those in code. 143 00:07:15,595 --> 00:07:19,470 That's it. You now know multiple linear regression. 144 00:07:19,470 --> 00:07:22,235 This is probably the single most widely used 145 00:07:22,235 --> 00:07:25,680 learning algorithm in the world today. But there's more. 146 00:07:25,680 --> 00:07:27,390 With just a few tricks such 147 00:07:27,390 --> 00:07:29,010 as picking and scaling features 148 00:07:29,010 --> 00:07:30,980 appropriately and also choosing 149 00:07:30,980 --> 00:07:32,820 the learning rate alpha appropriately, 150 00:07:32,820 --> 00:07:35,565 you'd really make this work much better. 151 00:07:35,565 --> 00:07:38,045 Just a few more videos to go for this week. 152 00:07:38,045 --> 00:07:39,740 Let's go on to the next video 153 00:07:39,740 --> 00:07:41,450 to see those little tricks that will 154 00:07:41,450 --> 00:07:43,130 help you make multiple linear 155 00:07:43,130 --> 00:07:45,810 regression work much better.11356