Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,650 --> 00:00:02,940
In this video, we'll figure out
2
00:00:02,940 --> 00:00:04,470
how to get gradient descent to
3
00:00:04,470 --> 00:00:08,090
work with regularized linear
regression. Let's jump in.
4
00:00:08,090 --> 00:00:10,860
Here Is a cost function
we've come up with in
5
00:00:10,860 --> 00:00:14,130
the last video for regularized
linear regression.
6
00:00:14,130 --> 00:00:18,285
The first part is the usual
squared error cost function,
7
00:00:18,285 --> 00:00:21,810
and now you have this
additional regularization term,
8
00:00:21,810 --> 00:00:24,945
where Lambda is the
regularization parameter,
9
00:00:24,945 --> 00:00:28,020
and you'd like to
find parameters w and
10
00:00:28,020 --> 00:00:32,250
b that minimize the
regularized cost function.
11
00:00:32,250 --> 00:00:34,590
Previously we were using
12
00:00:34,590 --> 00:00:37,905
gradient descent for the
original cost function,
13
00:00:37,905 --> 00:00:39,570
just the first term before we
14
00:00:39,570 --> 00:00:42,450
added that second
regularization term,
15
00:00:42,450 --> 00:00:45,290
and previously, we had
16
00:00:45,290 --> 00:00:47,735
the following gradient
descent algorithm,
17
00:00:47,735 --> 00:00:51,230
which is that we repeatedly
update the parameters w, j,
18
00:00:51,230 --> 00:00:54,260
and b for j equals 1 through n
19
00:00:54,260 --> 00:00:55,820
according to this formula
20
00:00:55,820 --> 00:00:58,675
and b is also updated similarly.
21
00:00:58,675 --> 00:01:00,350
Again, Alpha is
22
00:01:00,350 --> 00:01:03,800
a very small positive number
called the learning rate.
23
00:01:03,800 --> 00:01:05,960
In fact, the updates for
24
00:01:05,960 --> 00:01:08,720
a regularized linear regression
look exactly the same,
25
00:01:08,720 --> 00:01:10,700
except that now the cost,
26
00:01:10,700 --> 00:01:13,500
J, is defined a bit differently.
27
00:01:13,500 --> 00:01:16,940
Previously the derivative
of J with respect
28
00:01:16,940 --> 00:01:20,765
to w_j was given by this
expression over here,
29
00:01:20,765 --> 00:01:23,465
and the derivative
respect to b was
30
00:01:23,465 --> 00:01:26,560
given by this
expression over here.
31
00:01:26,560 --> 00:01:30,605
Now that we've added this
additional regularization term,
32
00:01:30,605 --> 00:01:33,770
the only thing that changes
is that the expression for
33
00:01:33,770 --> 00:01:35,060
the derivative with respect to
34
00:01:35,060 --> 00:01:38,555
w_j ends up with one
additional term,
35
00:01:38,555 --> 00:01:43,475
this plus Lambda
over m times w_j.
36
00:01:43,475 --> 00:01:45,230
And in particular for
37
00:01:45,230 --> 00:01:47,645
the new definition of
the cost function j,
38
00:01:47,645 --> 00:01:50,285
these two expressions over here,
39
00:01:50,285 --> 00:01:53,990
these are the new derivatives
of J with respect
40
00:01:53,990 --> 00:01:58,630
to w_j and the derivative
of J with respect to b.
41
00:01:58,630 --> 00:02:01,995
Recall that we
don't regularize b,
42
00:02:01,995 --> 00:02:04,230
so we're not trying to shrink B.
43
00:02:04,230 --> 00:02:07,925
That's why the updated B
remains the same as before,
44
00:02:07,925 --> 00:02:10,760
whereas the updated
w changes because
45
00:02:10,760 --> 00:02:15,740
the regularization term causes
us to try to shrink w_j.
46
00:02:16,010 --> 00:02:18,620
Let's take these definitions for
47
00:02:18,620 --> 00:02:21,140
the derivatives and
put them back into
48
00:02:21,140 --> 00:02:23,780
the expression on the
left to write out
49
00:02:23,780 --> 00:02:25,565
the gradient descent algorithm
50
00:02:25,565 --> 00:02:28,675
for regularized
linear regression.
51
00:02:28,675 --> 00:02:31,460
To implement gradient descent
52
00:02:31,460 --> 00:02:33,545
for regularized
linear regression,
53
00:02:33,545 --> 00:02:36,940
this is what you would
have your code do.
54
00:02:36,940 --> 00:02:39,110
Here is the update for w_j,
55
00:02:39,110 --> 00:02:40,640
for j equals 1 through n,
56
00:02:40,640 --> 00:02:42,935
and here's the update for b.
57
00:02:42,935 --> 00:02:45,950
As usual, please
remember to carry out
58
00:02:45,950 --> 00:02:49,564
simultaneous updates for
all of these parameters.
59
00:02:49,564 --> 00:02:53,285
Now, in order for you to
get this algorithm to work,
60
00:02:53,285 --> 00:02:55,375
this is all you need to know.
61
00:02:55,375 --> 00:02:56,780
But what I like to do in
62
00:02:56,780 --> 00:02:58,730
the remainder of this
video is to go over
63
00:02:58,730 --> 00:03:01,100
some optional material to convey
64
00:03:01,100 --> 00:03:02,900
a slightly deeper
intuition about
65
00:03:02,900 --> 00:03:05,480
what this formula
is actually doing,
66
00:03:05,480 --> 00:03:07,460
as well as chat
briefly about how
67
00:03:07,460 --> 00:03:09,685
these derivatives are derived.
68
00:03:09,685 --> 00:03:12,450
The rest of this video
is completely optional.
69
00:03:12,450 --> 00:03:16,160
It's completely okay if you
skip the rest of this video,
70
00:03:16,160 --> 00:03:18,470
but if you have a
strong interests
71
00:03:18,470 --> 00:03:19,670
in math, then stick with me.
72
00:03:19,670 --> 00:03:21,730
It is always nice to
hang out with you here,
73
00:03:21,730 --> 00:03:23,885
and through these equations,
74
00:03:23,885 --> 00:03:26,255
perhaps you can build
a deeper intuition
75
00:03:26,255 --> 00:03:27,620
about what the math and what
76
00:03:27,620 --> 00:03:29,600
the derivatives
are doing as well.
77
00:03:29,600 --> 00:03:32,720
Let's take a look. Let's look at
78
00:03:32,720 --> 00:03:37,040
the update rule for w_j and
rewrite it in another way.
79
00:03:37,040 --> 00:03:42,125
We're updating w_j as 1 times
80
00:03:42,125 --> 00:03:49,035
w_j minus Alpha times
Lambda over m times w_j.
81
00:03:49,035 --> 00:03:52,710
I've moved the term from
the end to the front here.
82
00:03:52,710 --> 00:03:57,155
Then minus Alpha times 1 over m,
83
00:03:57,155 --> 00:04:00,535
and then the rest of
that term over there.
84
00:04:00,535 --> 00:04:04,185
We just rearranged the
terms a little bit.
85
00:04:04,185 --> 00:04:08,660
If we simplify, then we're
saying that w_j is updated
86
00:04:08,660 --> 00:04:14,905
as w_j times 1 minus Alpha
times Lambda over m,
87
00:04:14,905 --> 00:04:19,610
minus Alpha times this
other term over here.
88
00:04:19,610 --> 00:04:22,610
You might recognize
the second term as
89
00:04:22,610 --> 00:04:25,040
the usual gradient
descent update
90
00:04:25,040 --> 00:04:27,680
for unregularized
linear regression.
91
00:04:27,680 --> 00:04:29,150
This is the update for
92
00:04:29,150 --> 00:04:32,420
linear regression before
we had regularization,
93
00:04:32,420 --> 00:04:37,130
and this is the term we saw
in Week 2 of this course.
94
00:04:37,130 --> 00:04:39,050
The only change we add
95
00:04:39,050 --> 00:04:42,610
regularization is that
instead of w_j being set
96
00:04:42,610 --> 00:04:46,850
to be equal to w_j
minus Alpha times
97
00:04:46,850 --> 00:04:48,470
this term is now
98
00:04:48,470 --> 00:04:52,875
w times this number
minus the usual update.
99
00:04:52,875 --> 00:04:56,460
This is what we had in
Week 1 of this course.
100
00:04:56,460 --> 00:04:59,505
What is this first
term over here?
101
00:04:59,505 --> 00:05:05,115
Well, Alpha is a very small
positive number, say 0.01.
102
00:05:05,115 --> 00:05:07,625
Lambda is usually
a small number,
103
00:05:07,625 --> 00:05:10,625
say 1 or maybe 10.
104
00:05:10,625 --> 00:05:13,905
Let's say lambda is
1 for this example
105
00:05:13,905 --> 00:05:17,925
and m is the training
set size, say 50.
106
00:05:17,925 --> 00:05:22,470
When you multiply
Alpha Lambda over m,
107
00:05:22,470 --> 00:05:27,810
say 0.01 times 1 divided by 50,
108
00:05:27,810 --> 00:05:30,990
this term ends up being
a small positive number
109
00:05:30,990 --> 00:05:34,395
, say 0.0002,
110
00:05:34,395 --> 00:05:37,370
and thus, 1 minus Alpha Lambda
111
00:05:37,370 --> 00:05:38,660
over m is going to be
112
00:05:38,660 --> 00:05:40,595
a number just
slightly less than 1,
113
00:05:40,595 --> 00:05:43,700
in this case, 0.9998.
114
00:05:43,700 --> 00:05:46,205
The effect of this term is that
115
00:05:46,205 --> 00:05:49,220
on every single iteration
of gradient descent,
116
00:05:49,220 --> 00:05:54,515
you're taking w_j and
multiplying it by 0.9998,
117
00:05:54,515 --> 00:05:56,720
that is by some numbers
slightly less than
118
00:05:56,720 --> 00:05:59,785
one and for carrying
out the usual update.
119
00:05:59,785 --> 00:06:03,260
What regularization is doing
on every single iteration
120
00:06:03,260 --> 00:06:05,420
is you're multiplying w
121
00:06:05,420 --> 00:06:07,580
by a number slightly
less than 1,
122
00:06:07,580 --> 00:06:09,770
and that has effect of shrinking
123
00:06:09,770 --> 00:06:13,075
the value of w_j
just a little bit.
124
00:06:13,075 --> 00:06:15,530
This gives us
another view on why
125
00:06:15,530 --> 00:06:17,690
regularization has the effect of
126
00:06:17,690 --> 00:06:19,760
shrinking the parameters w_j
127
00:06:19,760 --> 00:06:21,980
a little bit on every iteration,
128
00:06:21,980 --> 00:06:24,985
and so that's how
regularization works.
129
00:06:24,985 --> 00:06:26,960
If you're curious about how
130
00:06:26,960 --> 00:06:29,030
these derivative
terms were computed,
131
00:06:29,030 --> 00:06:32,330
I've just one last optional
slide that goes through
132
00:06:32,330 --> 00:06:33,440
just a little bit of
133
00:06:33,440 --> 00:06:35,815
a calculation of the
derivative term.
134
00:06:35,815 --> 00:06:37,940
Again, this slide
and the rest of
135
00:06:37,940 --> 00:06:40,085
this video are
completely optional,
136
00:06:40,085 --> 00:06:41,990
meaning you won't
need any of this to
137
00:06:41,990 --> 00:06:44,795
do the practice labs
and the quizzes.
138
00:06:44,795 --> 00:06:48,740
Let's step through quickly
to derivative calculation.
139
00:06:48,740 --> 00:06:55,830
The derivative of J with
respect to w_j looks like this.
140
00:07:05,630 --> 00:07:10,300
Recall that f of x for linear
regression is defined as
141
00:07:10,300 --> 00:07:16,290
w dot x plus b or w
dot product x plus b.
142
00:07:16,290 --> 00:07:19,540
It turns out that by
the rules of calculus,
143
00:07:19,540 --> 00:07:21,700
the derivatives look like this,
144
00:07:21,700 --> 00:07:29,335
is 1 over 2m times the sum
i equals 1 through m of
145
00:07:29,335 --> 00:07:33,655
w dot x plus b minus y
146
00:07:33,655 --> 00:07:37,585
times 2x_j plus the derivative
147
00:07:37,585 --> 00:07:39,205
of the regularization term,
148
00:07:39,205 --> 00:07:45,800
which is Lambda over
2m times 2 w_j.
149
00:07:45,800 --> 00:07:49,340
Notice that the second
term does not have
150
00:07:49,340 --> 00:07:53,495
the summation term from j
equals 1 through n anymore.
151
00:07:53,495 --> 00:07:56,675
The 2's cancel out
here and here,
152
00:07:56,675 --> 00:07:59,065
and also here and here.
153
00:07:59,065 --> 00:08:04,570
It simplifies to this
expression over here.
154
00:08:07,040 --> 00:08:12,735
Finally, remember that
wx plus b is f of x,
155
00:08:12,735 --> 00:08:17,350
and so you can rewrite it as
this expression down here.
156
00:08:17,350 --> 00:08:20,510
This is why this
expression is used to
157
00:08:20,510 --> 00:08:24,790
compute the gradient in
regularized linear regression.
158
00:08:24,790 --> 00:08:26,750
You now know how to
159
00:08:26,750 --> 00:08:29,045
implement regularized
linear regression.
160
00:08:29,045 --> 00:08:31,910
Using this, you really
reduce overfitting when you
161
00:08:31,910 --> 00:08:33,080
have a lot of features and
162
00:08:33,080 --> 00:08:35,095
a relatively small training set.
163
00:08:35,095 --> 00:08:37,700
This should let you get
linear regression to
164
00:08:37,700 --> 00:08:40,390
work much better
on many problems.
165
00:08:40,390 --> 00:08:41,945
In the next video,
166
00:08:41,945 --> 00:08:44,840
we'll take this regularization
idea and apply it to
167
00:08:44,840 --> 00:08:46,745
logistic regression to avoid
168
00:08:46,745 --> 00:08:49,010
overfitting for logistic
regression as well.
169
00:08:49,010 --> 00:08:52,110
Let's take a look at
that in the next video.12316
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.