Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,910 --> 00:00:05,290
You've learned about
gradient descents about
2
00:00:05,290 --> 00:00:08,240
multiple linear regression
and also vectorization.
3
00:00:08,240 --> 00:00:09,820
Let's put it all together to
4
00:00:09,820 --> 00:00:11,620
implement gradient descent for
5
00:00:11,620 --> 00:00:13,090
multiple linear regression with
6
00:00:13,090 --> 00:00:15,280
vectorization. This
would be cool.
7
00:00:15,280 --> 00:00:16,925
Let's quickly review what
8
00:00:16,925 --> 00:00:18,805
multiple linear
regression look like.
9
00:00:18,805 --> 00:00:20,920
Using our previous notation,
10
00:00:20,920 --> 00:00:22,570
let's see how you
can write it more
11
00:00:22,570 --> 00:00:24,790
succinctly using
vector notation.
12
00:00:24,790 --> 00:00:29,000
We have parameters w_1
to w_n as well as b.
13
00:00:29,000 --> 00:00:30,885
But instead of thinking of
14
00:00:30,885 --> 00:00:34,020
w_1 to w_n as separate numbers,
15
00:00:34,020 --> 00:00:35,850
that is separate parameters,
16
00:00:35,850 --> 00:00:38,470
let's start to collect
all of the w's into
17
00:00:38,470 --> 00:00:41,910
a vector w so that now w is
18
00:00:41,910 --> 00:00:44,520
a vector of length n.
19
00:00:44,520 --> 00:00:46,790
We're just going to think
of the parameters of
20
00:00:46,790 --> 00:00:49,160
this model as a vector w,
21
00:00:49,160 --> 00:00:50,765
as well as b,
22
00:00:50,765 --> 00:00:53,935
where b is still a
number same as before.
23
00:00:53,935 --> 00:00:56,030
Whereas before we had to find
24
00:00:56,030 --> 00:00:58,615
multiple linear
regression like this,
25
00:00:58,615 --> 00:01:01,130
now using vector notation,
26
00:01:01,130 --> 00:01:03,710
we can write the model as f_w,
27
00:01:03,710 --> 00:01:06,755
b of x equals the vector
28
00:01:06,755 --> 00:01:10,555
w dot product with the
vector x plus b.
29
00:01:10,555 --> 00:01:14,985
Remember that this dot
here means.product.
30
00:01:14,985 --> 00:01:17,800
Our cost function
can be defined as J
31
00:01:17,800 --> 00:01:21,050
of w_1 through w_n, b.
32
00:01:21,050 --> 00:01:24,725
But instead of just thinking
of J as a function of these
33
00:01:24,725 --> 00:01:28,975
and different parameters
w_j as well as b,
34
00:01:28,975 --> 00:01:32,105
we're going to write
J as a function of
35
00:01:32,105 --> 00:01:36,500
parameter vector w
and the number b.
36
00:01:36,500 --> 00:01:43,200
This w_1 through w_n is
replaced by this vector W and
37
00:01:43,200 --> 00:01:45,180
J now takes this input of
38
00:01:45,180 --> 00:01:49,865
vector w and a number b
and returns a number.
39
00:01:49,865 --> 00:01:52,170
Here's what gradient
descent looks like.
40
00:01:52,170 --> 00:01:56,090
We're going to repeatedly
update each parameter w_j to
41
00:01:56,090 --> 00:02:00,880
be w_j minus Alpha times the
derivative of the cost J,
42
00:02:00,880 --> 00:02:06,300
where J has parameters
w_1 through w_n and b.
43
00:02:06,300 --> 00:02:09,450
Once again, we just
write this as J
44
00:02:09,450 --> 00:02:13,045
of vector w and number b.
45
00:02:13,045 --> 00:02:15,650
Let's see what this looks
like when you implement
46
00:02:15,650 --> 00:02:17,990
gradient descent
and in particular,
47
00:02:17,990 --> 00:02:20,935
let's take a look at
the derivative term.
48
00:02:20,935 --> 00:02:24,470
We'll see that gradient descent
becomes just a little bit
49
00:02:24,470 --> 00:02:26,510
different with multiple features
50
00:02:26,510 --> 00:02:28,820
compared to just one feature.
51
00:02:28,820 --> 00:02:30,560
Here's what we had when we had
52
00:02:30,560 --> 00:02:33,305
gradient descent
with one feature.
53
00:02:33,305 --> 00:02:36,230
We had an update rule for w and
54
00:02:36,230 --> 00:02:39,410
a separate update rule
for b. Hopefully,
55
00:02:39,410 --> 00:02:41,725
these look familiar to you.
56
00:02:41,725 --> 00:02:46,250
This term here is the derivative
of the cost function J
57
00:02:46,250 --> 00:02:50,270
with respect to the
parameter w. Similarly,
58
00:02:50,270 --> 00:02:53,575
we have an update
rule for parameter b,
59
00:02:53,575 --> 00:02:57,575
with univariate regression,
we had only one feature.
60
00:02:57,575 --> 00:03:02,205
We call that feature xi
without any subscript.
61
00:03:02,205 --> 00:03:06,665
Now, here's a new notation
for where we have n features,
62
00:03:06,665 --> 00:03:08,845
where n is two or more.
63
00:03:08,845 --> 00:03:12,185
We get this update rule
for gradient descent.
64
00:03:12,185 --> 00:03:15,095
Update w_1 to be w_1 minus
65
00:03:15,095 --> 00:03:18,655
Alpha times this expression here
66
00:03:18,655 --> 00:03:21,650
and this formula is actually
67
00:03:21,650 --> 00:03:26,755
the derivative of the cost
J with respect to w_1.
68
00:03:26,755 --> 00:03:29,015
The formula for
the derivative of
69
00:03:29,015 --> 00:03:31,040
J with respect to w_1 on
70
00:03:31,040 --> 00:03:32,990
the right looks very similar to
71
00:03:32,990 --> 00:03:35,795
the case of one
feature on the left.
72
00:03:35,795 --> 00:03:37,430
The error term still takes
73
00:03:37,430 --> 00:03:41,645
a prediction f of x
minus the target y.
74
00:03:41,645 --> 00:03:46,370
One difference is that
w and x are now vectors
75
00:03:46,370 --> 00:03:48,665
and just as w on the left
76
00:03:48,665 --> 00:03:52,200
has now become w_1
here on the right,
77
00:03:52,250 --> 00:03:57,890
xi here on the left is
now instead xi _1 here on
78
00:03:57,890 --> 00:04:03,965
the right and this is
just for J equals 1.
79
00:04:03,965 --> 00:04:07,204
For multiple linear regression,
80
00:04:07,204 --> 00:04:10,820
we have J ranging
from 1 through n and
81
00:04:10,820 --> 00:04:14,585
so we'll update the
parameters w_1,
82
00:04:14,585 --> 00:04:19,460
w_2, all the way up to w_n,
83
00:04:19,460 --> 00:04:23,680
and then as before,
we'll update b.
84
00:04:23,680 --> 00:04:25,625
If you implement this,
85
00:04:25,625 --> 00:04:29,285
you get gradient descent
for multiple regression.
86
00:04:29,285 --> 00:04:33,695
That's it for gradient descent
for multiple regression.
87
00:04:33,695 --> 00:04:36,515
Before moving on
from this video,
88
00:04:36,515 --> 00:04:41,165
I want to make a quick aside
or a quick side note on
89
00:04:41,165 --> 00:04:43,774
an alternative way for finding
90
00:04:43,774 --> 00:04:46,840
w and b for linear regression.
91
00:04:46,840 --> 00:04:50,565
This method is called
the normal equation.
92
00:04:50,565 --> 00:04:54,410
Whereas it turns out gradient
descent is a great method
93
00:04:54,410 --> 00:04:58,610
for minimizing the cost
function J to find w and b,
94
00:04:58,610 --> 00:05:01,670
there is one other
algorithm that works only
95
00:05:01,670 --> 00:05:04,460
for linear regression
and pretty much none of
96
00:05:04,460 --> 00:05:06,755
the other algorithms you
see in this specialization
97
00:05:06,755 --> 00:05:09,580
for solving for w and b
98
00:05:09,580 --> 00:05:11,780
and this other
method does not need
99
00:05:11,780 --> 00:05:15,225
an iterative gradient
descent algorithm.
100
00:05:15,225 --> 00:05:17,385
Called the normal
equation method,
101
00:05:17,385 --> 00:05:19,760
it turns out to be
possible to use
102
00:05:19,760 --> 00:05:22,610
an advanced linear algebra
library to just solve for
103
00:05:22,610 --> 00:05:26,440
w and b all in one goal
without iterations.
104
00:05:26,440 --> 00:05:30,430
Some disadvantages of the
normal equation method are;
105
00:05:30,430 --> 00:05:32,465
first unlike gradient descent,
106
00:05:32,465 --> 00:05:35,315
this is not generalized to
other learning algorithms,
107
00:05:35,315 --> 00:05:37,700
such as the logistic
regression algorithm
108
00:05:37,700 --> 00:05:39,420
that you'll learn
about next week
109
00:05:39,420 --> 00:05:40,805
or the neural networks or
110
00:05:40,805 --> 00:05:44,030
other algorithms you see
later in this specialization.
111
00:05:44,030 --> 00:05:46,670
The normal equation
method is also quite
112
00:05:46,670 --> 00:05:50,105
slow if the number of
features and this large.
113
00:05:50,105 --> 00:05:52,130
Almost no machine learning
114
00:05:52,130 --> 00:05:55,400
practitioners should implement
the normal equation method
115
00:05:55,400 --> 00:05:58,055
themselves but if you're using
116
00:05:58,055 --> 00:06:00,140
a mature machine learning
117
00:06:00,140 --> 00:06:03,050
library and call
linear regression,
118
00:06:03,050 --> 00:06:04,865
there is a chance
that on the backend,
119
00:06:04,865 --> 00:06:08,150
it'll be using this
to solve for w and b.
120
00:06:08,150 --> 00:06:10,490
If you're ever in
the job interview
121
00:06:10,490 --> 00:06:12,335
and hear the term
normal equation,
122
00:06:12,335 --> 00:06:14,530
that's what this refers to.
123
00:06:14,530 --> 00:06:16,520
Don't worry about
the details of how
124
00:06:16,520 --> 00:06:18,140
the normal equation works.
125
00:06:18,140 --> 00:06:21,950
Just be aware that some
machine learning libraries
126
00:06:21,950 --> 00:06:24,110
may use this complicated method
127
00:06:24,110 --> 00:06:26,470
in the back-end to
solve for w and b.
128
00:06:26,470 --> 00:06:28,880
But for most
learning algorithms,
129
00:06:28,880 --> 00:06:32,060
including how you implement
linear regression yourself,
130
00:06:32,060 --> 00:06:35,635
gradient descents offer a
better way to get the job done.
131
00:06:35,635 --> 00:06:38,810
In the optional lab that
follows this video,
132
00:06:38,810 --> 00:06:42,140
you'll see how to define a
multiple regression model
133
00:06:42,140 --> 00:06:47,365
encode and also how to calculate
the prediction f of x.
134
00:06:47,365 --> 00:06:51,300
You'll also see how to
calculate the cost and
135
00:06:51,300 --> 00:06:53,090
implement gradient descent for
136
00:06:53,090 --> 00:06:55,250
a multiple linear
regression model.
137
00:06:55,250 --> 00:06:58,715
This will be using
Python's NumPy library.
138
00:06:58,715 --> 00:07:01,010
If any of the code
looks very new,
139
00:07:01,010 --> 00:07:03,800
that's okay but you
should feel free
140
00:07:03,800 --> 00:07:07,415
also to take a look at the
previous optional lab that
141
00:07:07,415 --> 00:07:11,120
introduces NumPy and
vectorization for a refresher
142
00:07:11,120 --> 00:07:15,595
of NumPy functions and how
to implement those in code.
143
00:07:15,595 --> 00:07:19,470
That's it. You now know
multiple linear regression.
144
00:07:19,470 --> 00:07:22,235
This is probably the
single most widely used
145
00:07:22,235 --> 00:07:25,680
learning algorithm in the
world today. But there's more.
146
00:07:25,680 --> 00:07:27,390
With just a few tricks such
147
00:07:27,390 --> 00:07:29,010
as picking and scaling features
148
00:07:29,010 --> 00:07:30,980
appropriately and also choosing
149
00:07:30,980 --> 00:07:32,820
the learning rate
alpha appropriately,
150
00:07:32,820 --> 00:07:35,565
you'd really make this
work much better.
151
00:07:35,565 --> 00:07:38,045
Just a few more videos
to go for this week.
152
00:07:38,045 --> 00:07:39,740
Let's go on to the next video
153
00:07:39,740 --> 00:07:41,450
to see those little
tricks that will
154
00:07:41,450 --> 00:07:43,130
help you make multiple linear
155
00:07:43,130 --> 00:07:45,810
regression work much better.11356
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.