Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:01,600
kylie ying has worked at many
2
00:00:01,600 --> 00:00:04,880
interesting places such as mit cern and
3
00:00:04,880 --> 00:00:07,040
free code camp she's a physicist
4
00:00:07,040 --> 00:00:10,000
engineer and basically a genius and now
5
00:00:10,000 --> 00:00:11,519
she's gonna teach you about machine
6
00:00:11,519 --> 00:00:13,679
learning in a way that is accessible to
7
00:00:13,679 --> 00:00:15,280
absolute beginners
8
00:00:15,280 --> 00:00:18,240
what's up you guys so welcome to machine
9
00:00:18,240 --> 00:00:20,480
learning for everyone
10
00:00:20,480 --> 00:00:22,160
if you are someone who is interested in
11
00:00:22,160 --> 00:00:24,400
machine learning and you think you are
12
00:00:24,400 --> 00:00:27,439
considered as everyone then this video
13
00:00:27,439 --> 00:00:28,880
is for you
14
00:00:28,880 --> 00:00:30,320
in this video we'll talk about
15
00:00:30,320 --> 00:00:32,238
supervised and unsupervised learning
16
00:00:32,238 --> 00:00:34,320
models we'll go through maybe a little
17
00:00:34,320 --> 00:00:37,440
bit of the logic or math behind them and
18
00:00:37,440 --> 00:00:40,000
then we'll also see how we can program
19
00:00:40,000 --> 00:00:42,800
it on google collab
20
00:00:42,800 --> 00:00:45,520
if there are certain things that i have
21
00:00:45,520 --> 00:00:47,520
done and you know you're somebody with
22
00:00:47,520 --> 00:00:49,440
more experience than me please feel free
23
00:00:49,440 --> 00:00:51,600
to correct me in the comments and we can
24
00:00:51,600 --> 00:00:53,440
all as a community learn from this
25
00:00:53,440 --> 00:00:54,559
together
26
00:00:54,559 --> 00:00:57,920
so with that let's just dive right in
27
00:00:57,920 --> 00:00:59,680
without wasting any time let's just dive
28
00:00:59,680 --> 00:01:01,359
straight into the code and i will be
29
00:01:01,359 --> 00:01:04,959
teaching you guys concepts as we go
30
00:01:04,959 --> 00:01:07,439
so this here is the
31
00:01:07,439 --> 00:01:10,560
uci machine learning repository and
32
00:01:10,560 --> 00:01:12,080
basically they just have a ton of data
33
00:01:12,080 --> 00:01:14,240
sets that we can access and i found this
34
00:01:14,240 --> 00:01:16,320
really cool one called the magic gamma
35
00:01:16,320 --> 00:01:18,720
telescope data set
36
00:01:18,720 --> 00:01:20,640
so in this data set if you don't want to
37
00:01:20,640 --> 00:01:23,360
read all this information to summarize
38
00:01:23,360 --> 00:01:25,680
what i what i think is going on is
39
00:01:25,680 --> 00:01:28,000
there's this gamma telescope and we have
40
00:01:28,000 --> 00:01:29,920
all these high energy particles hitting
41
00:01:29,920 --> 00:01:32,640
the telescope now there's a camera
42
00:01:32,640 --> 00:01:35,200
there's a detector that actually records
43
00:01:35,200 --> 00:01:37,280
certain patterns of you know how this
44
00:01:37,280 --> 00:01:39,040
light hits the camera
45
00:01:39,040 --> 00:01:41,520
and we can use properties of those
46
00:01:41,520 --> 00:01:44,079
patterns in order to predict what type
47
00:01:44,079 --> 00:01:46,560
of particle caused that radiation so
48
00:01:46,560 --> 00:01:49,439
whether it was a gamma particle
49
00:01:49,439 --> 00:01:53,439
or some other head like hadron
50
00:01:53,439 --> 00:01:55,040
down here these are all of the
51
00:01:55,040 --> 00:01:57,360
attributes of those patterns that we
52
00:01:57,360 --> 00:01:59,360
collect in the camera so you can see
53
00:01:59,360 --> 00:02:01,280
that there's you know some length width
54
00:02:01,280 --> 00:02:02,320
size
55
00:02:02,320 --> 00:02:04,560
asymmetry etc
56
00:02:04,560 --> 00:02:05,600
now we're going to use all these
57
00:02:05,600 --> 00:02:08,318
properties to help us discriminate the
58
00:02:08,318 --> 00:02:10,080
patterns and whether or not they came
59
00:02:10,080 --> 00:02:13,120
from a gamma particle or a hadron
60
00:02:13,120 --> 00:02:15,280
so in order to do this we're going to
61
00:02:15,280 --> 00:02:18,640
come up here go to the data folder
62
00:02:18,640 --> 00:02:21,520
and you're going to click this magic04
63
00:02:21,520 --> 00:02:25,120
data and we're going to download that
64
00:02:25,120 --> 00:02:28,560
now over here i have a colab notebook
65
00:02:28,560 --> 00:02:30,599
open so you go to
66
00:02:30,599 --> 00:02:32,560
collab.research.google.com you start a
67
00:02:32,560 --> 00:02:33,840
new notebook
68
00:02:33,840 --> 00:02:36,640
and i'm just going to call this
69
00:02:36,640 --> 00:02:40,560
um the magic data set
70
00:02:40,560 --> 00:02:42,239
so actually i'm going to call this for
71
00:02:42,239 --> 00:02:44,879
code camp magic example
72
00:02:44,879 --> 00:02:46,640
okay
73
00:02:46,640 --> 00:02:49,040
so with that i'm going to first start
74
00:02:49,040 --> 00:02:52,239
with some imports so i will import you
75
00:02:52,239 --> 00:02:54,400
know i always import numpy
76
00:02:54,400 --> 00:02:57,920
i always import pandas
77
00:02:58,640 --> 00:03:03,640
and i always import matplotlib
78
00:03:06,000 --> 00:03:08,159
and then we'll import other things as we
79
00:03:08,159 --> 00:03:10,400
go
80
00:03:10,640 --> 00:03:13,280
so yeah
81
00:03:13,920 --> 00:03:16,000
we run that in order to run the cell you
82
00:03:16,000 --> 00:03:18,000
can either click this play button here
83
00:03:18,000 --> 00:03:20,159
or you can on my computer it's just
84
00:03:20,159 --> 00:03:21,920
shift enter and that that will run the
85
00:03:21,920 --> 00:03:23,120
cell
86
00:03:23,120 --> 00:03:25,040
and here i'm just going to or i'm just
87
00:03:25,040 --> 00:03:26,800
going to you know
88
00:03:26,800 --> 00:03:28,239
let you guys know okay this is where i
89
00:03:28,239 --> 00:03:29,599
found the data set
90
00:03:29,599 --> 00:03:31,440
um so i've copied and pasted this
91
00:03:31,440 --> 00:03:33,440
actually but this is just where i found
92
00:03:33,440 --> 00:03:35,120
the data set
93
00:03:35,120 --> 00:03:37,680
and in order to import that downloaded
94
00:03:37,680 --> 00:03:40,000
file that we we got from the computer
95
00:03:40,000 --> 00:03:42,319
we're going to go over here to this
96
00:03:42,319 --> 00:03:44,239
folder thing
97
00:03:44,239 --> 00:03:46,959
and i am literally just going to drag
98
00:03:46,959 --> 00:03:50,720
and drop that file into here
99
00:03:50,720 --> 00:03:52,319
okay
100
00:03:52,319 --> 00:03:54,319
so in order to take a look at you know
101
00:03:54,319 --> 00:03:56,080
what does this file consist of do we
102
00:03:56,080 --> 00:03:57,599
have the labels do we not i mean we
103
00:03:57,599 --> 00:03:59,040
could open it on our computer but we can
104
00:03:59,040 --> 00:04:00,879
also just do
105
00:04:00,879 --> 00:04:02,319
pandas
106
00:04:02,319 --> 00:04:04,239
read csv
107
00:04:04,239 --> 00:04:08,480
and we can pass in the name of this file
108
00:04:08,480 --> 00:04:11,040
and let's see what it returns so it
109
00:04:11,040 --> 00:04:12,959
doesn't seem like we have
110
00:04:12,959 --> 00:04:16,160
the label so let's go back to
111
00:04:16,160 --> 00:04:17,839
here
112
00:04:17,839 --> 00:04:19,279
i'm just going to make the
113
00:04:19,279 --> 00:04:20,478
columns
114
00:04:20,478 --> 00:04:22,240
uh the column labels all of these
115
00:04:22,240 --> 00:04:23,840
attribute names over here so i'm just
116
00:04:23,840 --> 00:04:28,240
going to take these values and make that
117
00:04:28,240 --> 00:04:30,240
the column names
118
00:04:30,240 --> 00:04:31,919
all right how do i do that
119
00:04:31,919 --> 00:04:33,440
so basically
120
00:04:33,440 --> 00:04:35,919
i will come back here and i will create
121
00:04:35,919 --> 00:04:38,080
a list called calls
122
00:04:38,080 --> 00:04:43,120
and i will type in all of those things
123
00:04:43,120 --> 00:04:44,320
with
124
00:04:44,320 --> 00:04:46,320
f size
125
00:04:46,320 --> 00:04:47,360
f
126
00:04:47,360 --> 00:04:49,199
conch
127
00:04:49,199 --> 00:04:52,080
and we also have an f conc 1
128
00:04:52,080 --> 00:04:53,919
we have f
129
00:04:53,919 --> 00:04:55,479
symmetry
130
00:04:55,479 --> 00:04:58,360
fm3 long
131
00:04:58,360 --> 00:05:01,680
fm3 trans
132
00:05:01,680 --> 00:05:02,560
f
133
00:05:02,560 --> 00:05:04,320
alpha
134
00:05:04,320 --> 00:05:09,880
what else do we have f dist and class
135
00:05:10,639 --> 00:05:13,039
and class okay great
136
00:05:13,039 --> 00:05:15,840
now in order to label those as these
137
00:05:15,840 --> 00:05:18,320
columns down here in our data frame so
138
00:05:18,320 --> 00:05:20,400
basically this command here just reads
139
00:05:20,400 --> 00:05:23,039
some csv file that you pass in csv has
140
00:05:23,039 --> 00:05:25,919
come about comma separated values
141
00:05:25,919 --> 00:05:28,160
um and turned that into a pandas data
142
00:05:28,160 --> 00:05:30,000
frame object
143
00:05:30,000 --> 00:05:32,479
so now if i pass in a names
144
00:05:32,479 --> 00:05:34,639
here then it basically assigns these
145
00:05:34,639 --> 00:05:38,160
labels to the columns of this data set
146
00:05:38,160 --> 00:05:39,199
so
147
00:05:39,199 --> 00:05:41,280
i'm going to set this data frame equal
148
00:05:41,280 --> 00:05:44,400
to df and then if we call so head is
149
00:05:44,400 --> 00:05:47,280
just like give me the first five things
150
00:05:47,280 --> 00:05:49,440
now you'll see that we have labels for
151
00:05:49,440 --> 00:05:50,840
all of these
152
00:05:50,840 --> 00:05:53,919
okay all right great so one thing that
153
00:05:53,919 --> 00:05:56,479
you might notice is that over here the
154
00:05:56,479 --> 00:06:00,320
class labels we have g and h so if i
155
00:06:00,320 --> 00:06:02,639
actually go down here and i do data
156
00:06:02,639 --> 00:06:07,120
frame class dot unique
157
00:06:07,120 --> 00:06:09,280
you'll see that i have either g's or h's
158
00:06:09,280 --> 00:06:12,400
and these stand for gammas or hadrons
159
00:06:12,400 --> 00:06:14,960
um and our computer is not so good at
160
00:06:14,960 --> 00:06:16,479
understanding letters right our
161
00:06:16,479 --> 00:06:17,919
computer's really good at understanding
162
00:06:17,919 --> 00:06:19,840
numbers so what we're going to do is
163
00:06:19,840 --> 00:06:22,960
we're going to convert this to 0 for g
164
00:06:22,960 --> 00:06:24,800
and 1 for h so
165
00:06:24,800 --> 00:06:28,160
here i'm going to set this equal
166
00:06:28,160 --> 00:06:30,319
to
167
00:06:30,319 --> 00:06:32,639
this
168
00:06:33,039 --> 00:06:34,960
whether or not that equals g
169
00:06:34,960 --> 00:06:37,520
and then i'm just going to say as type
170
00:06:37,520 --> 00:06:41,280
int so what this should do is convert
171
00:06:41,280 --> 00:06:43,280
this entire column
172
00:06:43,280 --> 00:06:46,000
if it equals g then this is true so i
173
00:06:46,000 --> 00:06:48,080
guess that would be one and then if it's
174
00:06:48,080 --> 00:06:49,759
h it would be false so that would be
175
00:06:49,759 --> 00:06:51,680
zero but i'm just converting g and h to
176
00:06:51,680 --> 00:06:53,680
one and zeros it doesn't really matter
177
00:06:53,680 --> 00:06:54,560
like
178
00:06:54,560 --> 00:06:59,280
if g is one and h is zero or vice versa
179
00:07:00,880 --> 00:07:02,319
let me just take a step back right now
180
00:07:02,319 --> 00:07:05,280
and talk about this data set so here i
181
00:07:05,280 --> 00:07:08,000
have some data frame and i have
182
00:07:08,000 --> 00:07:10,479
all of these different values for each
183
00:07:10,479 --> 00:07:12,160
entry
184
00:07:12,160 --> 00:07:15,360
now this is a you know each of these is
185
00:07:15,360 --> 00:07:17,440
one sample it's one
186
00:07:17,440 --> 00:07:20,479
example it's one item in our data set
187
00:07:20,479 --> 00:07:22,479
it's one data point all these things are
188
00:07:22,479 --> 00:07:24,560
kind of the same thing when i mention oh
189
00:07:24,560 --> 00:07:26,000
this is one example or this is one
190
00:07:26,000 --> 00:07:28,000
sample or whatever
191
00:07:28,000 --> 00:07:31,360
now each of these samples they have
192
00:07:31,360 --> 00:07:33,520
you know one quality for each or one
193
00:07:33,520 --> 00:07:35,520
value for each of these
194
00:07:35,520 --> 00:07:38,479
labels up here and then it has the class
195
00:07:38,479 --> 00:07:39,919
now what we're going to do in this
196
00:07:39,919 --> 00:07:42,960
specific example is try to predict for
197
00:07:42,960 --> 00:07:45,599
future you know samples whether the
198
00:07:45,599 --> 00:07:50,319
class is g for gamma or h for hadron
199
00:07:50,319 --> 00:07:52,080
and that is something known as
200
00:07:52,080 --> 00:07:54,720
classification
201
00:07:54,720 --> 00:07:58,879
now all of these up here these are known
202
00:07:58,879 --> 00:08:01,840
as our features and features are just
203
00:08:01,840 --> 00:08:03,520
things that we're going to pass into our
204
00:08:03,520 --> 00:08:05,919
model in order to help us predict the
205
00:08:05,919 --> 00:08:08,720
label which in this case is the class
206
00:08:08,720 --> 00:08:09,680
column
207
00:08:09,680 --> 00:08:10,560
so
208
00:08:10,560 --> 00:08:14,160
for you know sample 0 i have
209
00:08:14,160 --> 00:08:16,160
10 different features so i have 10
210
00:08:16,160 --> 00:08:17,759
different values
211
00:08:17,759 --> 00:08:20,240
that i can pass into some model
212
00:08:20,240 --> 00:08:23,360
and i can spit out you know the class
213
00:08:23,360 --> 00:08:26,319
the label and i know the true label here
214
00:08:26,319 --> 00:08:28,400
is g so this is this is actually
215
00:08:28,400 --> 00:08:31,840
supervised learning
216
00:08:32,240 --> 00:08:33,200
all right
217
00:08:33,200 --> 00:08:35,679
so before i move on let me just give you
218
00:08:35,679 --> 00:08:38,559
a quick little crash course on what i
219
00:08:38,559 --> 00:08:41,599
just said this is machine learning for
220
00:08:41,599 --> 00:08:43,360
everyone
221
00:08:43,360 --> 00:08:45,760
well the first question is what is
222
00:08:45,760 --> 00:08:47,120
machine learning
223
00:08:47,120 --> 00:08:49,120
well machine learning is a sub-domain of
224
00:08:49,120 --> 00:08:52,160
computer science that focuses on certain
225
00:08:52,160 --> 00:08:54,160
algorithms which might help a computer
226
00:08:54,160 --> 00:08:55,680
learn from data
227
00:08:55,680 --> 00:08:57,839
without a programmer being there telling
228
00:08:57,839 --> 00:09:00,640
the computer exactly what to do that's
229
00:09:00,640 --> 00:09:04,399
what we call explicit programming
230
00:09:04,399 --> 00:09:06,640
so you might have heard of ai and ml and
231
00:09:06,640 --> 00:09:08,399
data science what is the difference
232
00:09:08,399 --> 00:09:10,160
between all of these
233
00:09:10,160 --> 00:09:12,640
so ai is artificial intelligence and
234
00:09:12,640 --> 00:09:14,880
that's an area of computer science where
235
00:09:14,880 --> 00:09:17,440
the goal is to enable computers and
236
00:09:17,440 --> 00:09:21,040
machines to perform human-like tasks and
237
00:09:21,040 --> 00:09:24,000
simulate human behavior
238
00:09:24,000 --> 00:09:27,200
now machine learning is a subset of ai
239
00:09:27,200 --> 00:09:28,880
that tries to solve
240
00:09:28,880 --> 00:09:31,120
one specific problem and make
241
00:09:31,120 --> 00:09:35,360
predictions using certain data
242
00:09:35,360 --> 00:09:37,600
and data science is a field that
243
00:09:37,600 --> 00:09:39,760
attempts to find patterns and draw
244
00:09:39,760 --> 00:09:42,000
insights from data and that might mean
245
00:09:42,000 --> 00:09:44,320
we're using machine learning
246
00:09:44,320 --> 00:09:46,399
so all of these fields kind of overlap
247
00:09:46,399 --> 00:09:48,640
and all of them might use machine
248
00:09:48,640 --> 00:09:50,880
learning
249
00:09:50,880 --> 00:09:52,560
so there are a few types of machine
250
00:09:52,560 --> 00:09:53,440
learning
251
00:09:53,440 --> 00:09:55,920
the first one is supervised learning
252
00:09:55,920 --> 00:09:57,680
and in supervised learning we're using
253
00:09:57,680 --> 00:10:00,320
labeled inputs so this means whatever
254
00:10:00,320 --> 00:10:02,480
input we get we have a corresponding
255
00:10:02,480 --> 00:10:05,839
output label in order to train models
256
00:10:05,839 --> 00:10:08,640
and to learn outputs of different new
257
00:10:08,640 --> 00:10:11,839
inputs that we might feed our model
258
00:10:11,839 --> 00:10:14,240
so for example i might have these
259
00:10:14,240 --> 00:10:16,399
pictures okay to a computer all these
260
00:10:16,399 --> 00:10:18,959
pictures are are pixels they're pixels
261
00:10:18,959 --> 00:10:21,760
with a certain color
262
00:10:21,760 --> 00:10:24,560
now in supervised learning all of these
263
00:10:24,560 --> 00:10:27,600
inputs have a label associated with them
264
00:10:27,600 --> 00:10:29,440
this is the output that we might want
265
00:10:29,440 --> 00:10:31,839
the computer to be able to predict so
266
00:10:31,839 --> 00:10:34,480
for example over here this picture is a
267
00:10:34,480 --> 00:10:35,440
cat
268
00:10:35,440 --> 00:10:38,240
this picture is a dog and this picture
269
00:10:38,240 --> 00:10:41,120
is a lizard
270
00:10:41,519 --> 00:10:44,160
now there's also unsupervised learning
271
00:10:44,160 --> 00:10:46,720
and in unsupervised learning we use
272
00:10:46,720 --> 00:10:48,320
unlabeled data
273
00:10:48,320 --> 00:10:52,480
to learn about patterns in the data
274
00:10:52,480 --> 00:10:56,560
so here are here are my input
275
00:10:56,560 --> 00:10:58,720
data points again they're just images
276
00:10:58,720 --> 00:11:01,360
they're just pixels
277
00:11:01,360 --> 00:11:03,839
well okay let's say i have a bunch of
278
00:11:03,839 --> 00:11:06,240
these different pictures
279
00:11:06,240 --> 00:11:08,560
and what i can do is i can feed all
280
00:11:08,560 --> 00:11:10,240
these to my computer and i might not you
281
00:11:10,240 --> 00:11:11,360
know my computer's not gonna be able to
282
00:11:11,360 --> 00:11:13,760
say oh this is a cat dog and
283
00:11:13,760 --> 00:11:16,240
lizard in terms of you know the output
284
00:11:16,240 --> 00:11:18,399
but it might be able to cluster all
285
00:11:18,399 --> 00:11:20,320
these pictures it might say hey all of
286
00:11:20,320 --> 00:11:22,480
these have something in common
287
00:11:22,480 --> 00:11:25,120
all of these have something in common
288
00:11:25,120 --> 00:11:26,880
and then these down here have something
289
00:11:26,880 --> 00:11:29,040
in common that's finding some sort of
290
00:11:29,040 --> 00:11:33,519
structure in our unlabeled data
291
00:11:33,519 --> 00:11:35,760
and finally we have reinforcement
292
00:11:35,760 --> 00:11:39,360
learning and reinforcement learning well
293
00:11:39,360 --> 00:11:42,000
they usually there's an agent that is
294
00:11:42,000 --> 00:11:43,680
learning in some sort of interactive
295
00:11:43,680 --> 00:11:46,399
environment based on rewards and
296
00:11:46,399 --> 00:11:48,000
penalties
297
00:11:48,000 --> 00:11:50,320
so let's think of a dog
298
00:11:50,320 --> 00:11:52,320
we can train our dog
299
00:11:52,320 --> 00:11:55,519
but there's not necessarily you know any
300
00:11:55,519 --> 00:11:58,240
wrong or right output at any given
301
00:11:58,240 --> 00:12:00,079
moment right
302
00:12:00,079 --> 00:12:02,240
well let's pretend that dog is a
303
00:12:02,240 --> 00:12:03,440
computer
304
00:12:03,440 --> 00:12:04,959
essentially what we're doing is we're
305
00:12:04,959 --> 00:12:07,040
giving rewards to our computer and
306
00:12:07,040 --> 00:12:08,880
telling our computer hey this is
307
00:12:08,880 --> 00:12:11,279
probably something good that you want to
308
00:12:11,279 --> 00:12:14,399
keep doing well computer agent yeah
309
00:12:14,399 --> 00:12:16,800
terminology
310
00:12:16,800 --> 00:12:18,560
but in this class today we'll be
311
00:12:18,560 --> 00:12:20,639
focusing on supervised learning and
312
00:12:20,639 --> 00:12:22,000
unsupervised learning and learning
313
00:12:22,000 --> 00:12:26,000
different models for each of those
314
00:12:26,000 --> 00:12:28,560
all right so let's talk about supervised
315
00:12:28,560 --> 00:12:30,560
learning first
316
00:12:30,560 --> 00:12:32,240
so this is kind of what a machine
317
00:12:32,240 --> 00:12:34,079
learning model looks like you have a
318
00:12:34,079 --> 00:12:35,600
bunch of inputs
319
00:12:35,600 --> 00:12:38,720
that are going into some model and then
320
00:12:38,720 --> 00:12:40,399
the model is spitting out an output
321
00:12:40,399 --> 00:12:42,480
which is our prediction
322
00:12:42,480 --> 00:12:44,880
so all these inputs this is what we call
323
00:12:44,880 --> 00:12:47,680
the feature vector
324
00:12:47,680 --> 00:12:48,959
now there are different types of
325
00:12:48,959 --> 00:12:51,519
features that we can have we might have
326
00:12:51,519 --> 00:12:53,279
qualitative features
327
00:12:53,279 --> 00:12:56,160
and qualitative means categorical data
328
00:12:56,160 --> 00:12:57,839
there's either a finite number of
329
00:12:57,839 --> 00:13:00,079
categories or groups
330
00:13:00,079 --> 00:13:00,959
so
331
00:13:00,959 --> 00:13:03,120
one example of a qualitative feature
332
00:13:03,120 --> 00:13:05,040
might be gender
333
00:13:05,040 --> 00:13:07,200
and in this case there's only two here
334
00:13:07,200 --> 00:13:08,639
it's for the sake of the example i know
335
00:13:08,639 --> 00:13:10,079
this might be a little bit
336
00:13:10,079 --> 00:13:13,279
outdated here we have a girl and a boy
337
00:13:13,279 --> 00:13:14,720
there are two genders there are two
338
00:13:14,720 --> 00:13:16,399
different categories
339
00:13:16,399 --> 00:13:19,839
that's a piece of qualitative data
340
00:13:19,839 --> 00:13:22,079
another example might be okay we have
341
00:13:22,079 --> 00:13:23,279
you know a bunch of different
342
00:13:23,279 --> 00:13:26,240
nationalities maybe a nationality or a
343
00:13:26,240 --> 00:13:28,320
nation or a location
344
00:13:28,320 --> 00:13:30,160
that might also be an example of
345
00:13:30,160 --> 00:13:32,880
categorical data
346
00:13:32,880 --> 00:13:35,519
now in both of these there's no inherent
347
00:13:35,519 --> 00:13:38,959
order it's not like you know we can
348
00:13:38,959 --> 00:13:42,000
rate us one and
349
00:13:42,000 --> 00:13:45,440
uh france two japan three etcetera right
350
00:13:45,440 --> 00:13:46,480
there's
351
00:13:46,480 --> 00:13:50,880
not really any inherent order built into
352
00:13:50,880 --> 00:13:52,880
either of these categorical data sets
353
00:13:52,880 --> 00:13:57,120
that's why we call this nominal data
354
00:13:57,279 --> 00:13:59,760
now for nominal data the way that we
355
00:13:59,760 --> 00:14:02,000
want to feed it into our computer is
356
00:14:02,000 --> 00:14:05,279
using something called one hot encoding
357
00:14:05,279 --> 00:14:07,519
so let's say that you know i have a data
358
00:14:07,519 --> 00:14:09,760
set some of the items in our data some
359
00:14:09,760 --> 00:14:11,440
of the inputs
360
00:14:11,440 --> 00:14:13,839
might be from the us some might be from
361
00:14:13,839 --> 00:14:16,399
india then canada than france
362
00:14:16,399 --> 00:14:18,000
now how do we get our computer to
363
00:14:18,000 --> 00:14:20,000
recognize that we have to do something
364
00:14:20,000 --> 00:14:22,800
called one hot encoding and basically
365
00:14:22,800 --> 00:14:24,880
one hot encoding is saying okay well if
366
00:14:24,880 --> 00:14:27,040
it matches some category
367
00:14:27,040 --> 00:14:29,360
make that a one and if it doesn't just
368
00:14:29,360 --> 00:14:31,040
make that a zero
369
00:14:31,040 --> 00:14:34,399
so for example if your input were from
370
00:14:34,399 --> 00:14:35,600
the us
371
00:14:35,600 --> 00:14:38,000
you would you might have one zero zero
372
00:14:38,000 --> 00:14:39,040
zero
373
00:14:39,040 --> 00:14:42,880
india you know zero one zero zero canada
374
00:14:42,880 --> 00:14:44,880
okay well the item representing canada
375
00:14:44,880 --> 00:14:46,320
is one and then france the item
376
00:14:46,320 --> 00:14:48,240
representing france is one
377
00:14:48,240 --> 00:14:50,000
and then you can see that the rest are
378
00:14:50,000 --> 00:14:54,160
zeros that's one hot encoding
379
00:14:54,399 --> 00:14:57,360
now there are also a different type of
380
00:14:57,360 --> 00:14:58,880
qualitative feature
381
00:14:58,880 --> 00:15:01,279
so here on the left there are different
382
00:15:01,279 --> 00:15:05,279
age groups there's babies toddlers
383
00:15:05,279 --> 00:15:08,560
teenagers young adults
384
00:15:08,560 --> 00:15:09,920
adults
385
00:15:09,920 --> 00:15:13,040
and so on right and on the right hand
386
00:15:13,040 --> 00:15:15,360
side we might have different ratings so
387
00:15:15,360 --> 00:15:17,040
maybe bad
388
00:15:17,040 --> 00:15:18,399
not so good
389
00:15:18,399 --> 00:15:19,920
mediocre
390
00:15:19,920 --> 00:15:22,880
good and then like great
391
00:15:22,880 --> 00:15:26,079
now these are known as ordinal pieces of
392
00:15:26,079 --> 00:15:28,000
data because they have some sort of
393
00:15:28,000 --> 00:15:30,079
inherent order
394
00:15:30,079 --> 00:15:31,839
right like
395
00:15:31,839 --> 00:15:33,760
being a toddler is a lot closer to being
396
00:15:33,760 --> 00:15:37,920
a baby than being an elderly person
397
00:15:37,920 --> 00:15:41,040
right or good is closer to great than it
398
00:15:41,040 --> 00:15:43,040
is to really bad
399
00:15:43,040 --> 00:15:44,800
so these have some sort of inherent
400
00:15:44,800 --> 00:15:46,399
ordering system
401
00:15:46,399 --> 00:15:48,639
and so for these types of data sets we
402
00:15:48,639 --> 00:15:50,399
can actually just mark them
403
00:15:50,399 --> 00:15:52,480
from you know one to five or we can just
404
00:15:52,480 --> 00:15:54,959
say hey for each of these let's give it
405
00:15:54,959 --> 00:15:57,360
a number
406
00:15:57,680 --> 00:16:00,160
and this makes sense because
407
00:16:00,160 --> 00:16:01,440
like
408
00:16:01,440 --> 00:16:03,360
for example the thing that i just said
409
00:16:03,360 --> 00:16:05,759
how good is closer to great
410
00:16:05,759 --> 00:16:08,800
then good is close to not good at all
411
00:16:08,800 --> 00:16:10,560
well four is closer to five then four is
412
00:16:10,560 --> 00:16:12,399
close to one so this actually kind of
413
00:16:12,399 --> 00:16:14,480
makes sense and it'll make sense for the
414
00:16:14,480 --> 00:16:17,360
computer as well
415
00:16:17,920 --> 00:16:20,399
all right there are also quantitative
416
00:16:20,399 --> 00:16:22,880
pieces of data and quantitative
417
00:16:22,880 --> 00:16:25,199
pieces of data are numerical valued
418
00:16:25,199 --> 00:16:28,480
pieces of data so this could be discrete
419
00:16:28,480 --> 00:16:29,680
which means you know they might be
420
00:16:29,680 --> 00:16:32,320
integers or it could be continuous which
421
00:16:32,320 --> 00:16:33,199
means
422
00:16:33,199 --> 00:16:35,279
all real numbers
423
00:16:35,279 --> 00:16:38,079
so for example the length of something
424
00:16:38,079 --> 00:16:40,639
is a quantitative piece of data
425
00:16:40,639 --> 00:16:42,800
it's a quantitative feature
426
00:16:42,800 --> 00:16:44,639
the temperature of something is a
427
00:16:44,639 --> 00:16:47,040
quantitative feature and then maybe how
428
00:16:47,040 --> 00:16:49,360
many easter eggs i collected in my
429
00:16:49,360 --> 00:16:51,600
basket this easter egg hunt
430
00:16:51,600 --> 00:16:53,920
that is an example of discrete
431
00:16:53,920 --> 00:16:55,680
quantitative feature
432
00:16:55,680 --> 00:16:58,079
okay so these are continuous and this
433
00:16:58,079 --> 00:17:01,199
over here is discrete
434
00:17:01,680 --> 00:17:04,959
so those are the things that go into our
435
00:17:04,959 --> 00:17:07,439
feature vector those are our features
436
00:17:07,439 --> 00:17:09,520
that we're feeding this model because
437
00:17:09,520 --> 00:17:12,319
our computers are really really good
438
00:17:12,319 --> 00:17:14,559
at understanding math right at
439
00:17:14,559 --> 00:17:16,880
understanding numbers they're not so
440
00:17:16,880 --> 00:17:19,280
good at understanding things that humans
441
00:17:19,280 --> 00:17:22,640
might be able to understand
442
00:17:23,039 --> 00:17:23,760
well
443
00:17:23,760 --> 00:17:25,760
what are the types of predictions that
444
00:17:25,760 --> 00:17:28,559
our model can output
445
00:17:28,559 --> 00:17:29,360
so
446
00:17:29,360 --> 00:17:31,600
in supervised learning there are some
447
00:17:31,600 --> 00:17:33,360
different tasks there's one
448
00:17:33,360 --> 00:17:34,799
classification
449
00:17:34,799 --> 00:17:36,799
and basically classification just saying
450
00:17:36,799 --> 00:17:40,080
okay predict discrete classes
451
00:17:40,080 --> 00:17:42,320
and that might mean you know this is a
452
00:17:42,320 --> 00:17:43,679
hot dog
453
00:17:43,679 --> 00:17:46,880
this is a pizza and this is ice cream
454
00:17:46,880 --> 00:17:48,720
okay so there are three distinct classes
455
00:17:48,720 --> 00:17:50,960
and any other pictures of hot dogs pizza
456
00:17:50,960 --> 00:17:52,640
or ice cream
457
00:17:52,640 --> 00:17:56,000
i can put under these labels
458
00:17:56,000 --> 00:17:58,400
hot dog pizza ice cream
459
00:17:58,400 --> 00:18:00,400
this is something known as multi-class
460
00:18:00,400 --> 00:18:02,640
classification
461
00:18:02,640 --> 00:18:05,120
but there's also binary classification
462
00:18:05,120 --> 00:18:07,360
and binary classification you might have
463
00:18:07,360 --> 00:18:08,880
hot dog
464
00:18:08,880 --> 00:18:11,039
or not hot dog so there's only two
465
00:18:11,039 --> 00:18:12,480
categories that you're working with
466
00:18:12,480 --> 00:18:13,600
something that is something and
467
00:18:13,600 --> 00:18:15,679
something that isn't
468
00:18:15,679 --> 00:18:18,400
binary classification
469
00:18:18,400 --> 00:18:20,799
okay so yeah other examples
470
00:18:20,799 --> 00:18:23,600
so if something has positive or negative
471
00:18:23,600 --> 00:18:26,160
sentiment that's binary classification
472
00:18:26,160 --> 00:18:28,160
maybe you're predicting your pictures if
473
00:18:28,160 --> 00:18:30,240
they're cats or dogs that's binary
474
00:18:30,240 --> 00:18:31,679
classification
475
00:18:31,679 --> 00:18:34,160
maybe you know you are writing an email
476
00:18:34,160 --> 00:18:35,679
filter and you're trying to figure out
477
00:18:35,679 --> 00:18:38,640
if an email is spam or not spam
478
00:18:38,640 --> 00:18:41,600
so that's also binary classification
479
00:18:41,600 --> 00:18:43,520
now for multi-class classification you
480
00:18:43,520 --> 00:18:45,440
might have you know cat dog lizard
481
00:18:45,440 --> 00:18:46,799
dolphin shark
482
00:18:46,799 --> 00:18:48,720
rabbit etc
483
00:18:48,720 --> 00:18:50,160
um we might have different types of
484
00:18:50,160 --> 00:18:53,200
fruits so like orange apple pear etc and
485
00:18:53,200 --> 00:18:55,440
then maybe different plant species but
486
00:18:55,440 --> 00:18:57,919
multi-class classification just means
487
00:18:57,919 --> 00:19:00,240
more than two okay and binary means
488
00:19:00,240 --> 00:19:03,760
we're predicting between two things
489
00:19:03,919 --> 00:19:06,160
there's also something called regression
490
00:19:06,160 --> 00:19:08,320
when we talk about supervised learning
491
00:19:08,320 --> 00:19:09,600
and this just means we're trying to
492
00:19:09,600 --> 00:19:10,559
predict
493
00:19:10,559 --> 00:19:12,960
continuous values so instead of just
494
00:19:12,960 --> 00:19:14,799
trying to predict different categories
495
00:19:14,799 --> 00:19:16,960
we're trying to come up with a number
496
00:19:16,960 --> 00:19:20,320
that you know is on some sort of scale
497
00:19:20,320 --> 00:19:23,039
so some examples
498
00:19:23,039 --> 00:19:25,520
so some examples might be the price of
499
00:19:25,520 --> 00:19:27,520
ethereum tomorrow
500
00:19:27,520 --> 00:19:30,400
or it might be okay what is going to be
501
00:19:30,400 --> 00:19:31,760
the temperature
502
00:19:31,760 --> 00:19:34,160
or it might be what is the price of this
503
00:19:34,160 --> 00:19:35,120
house
504
00:19:35,120 --> 00:19:36,640
right so these things don't really fit
505
00:19:36,640 --> 00:19:39,120
into discrete classes
506
00:19:39,120 --> 00:19:40,960
we're trying to predict a number that's
507
00:19:40,960 --> 00:19:43,840
as close to the true value as possible
508
00:19:43,840 --> 00:19:44,799
using
509
00:19:44,799 --> 00:19:48,480
different features of our data set
510
00:19:48,960 --> 00:19:51,039
so that's exactly what our model looks
511
00:19:51,039 --> 00:19:53,840
like in supervised learning
512
00:19:53,840 --> 00:19:57,200
now let's talk about the model itself
513
00:19:57,200 --> 00:19:59,760
how do we make this model learn
514
00:19:59,760 --> 00:20:02,160
or how can we tell whether or not it's
515
00:20:02,160 --> 00:20:03,360
even learning
516
00:20:03,360 --> 00:20:05,600
so before we talk about the models
517
00:20:05,600 --> 00:20:07,679
let's talk about how can we actually
518
00:20:07,679 --> 00:20:09,919
like evaluate these models or how can we
519
00:20:09,919 --> 00:20:11,600
tell whether something is a good model
520
00:20:11,600 --> 00:20:14,400
or a bad model
521
00:20:14,640 --> 00:20:15,360
so
522
00:20:15,360 --> 00:20:18,400
let's take a look at this data set so
523
00:20:18,400 --> 00:20:21,280
this data set has this is from
524
00:20:21,280 --> 00:20:24,320
a diabetes a pima indian diabetes data
525
00:20:24,320 --> 00:20:25,120
set
526
00:20:25,120 --> 00:20:26,880
and here we have different number of
527
00:20:26,880 --> 00:20:28,960
pregnancies different glucose levels
528
00:20:28,960 --> 00:20:31,440
blood pressure skin thickness
529
00:20:31,440 --> 00:20:33,919
insulin bmi age and then the outcome
530
00:20:33,919 --> 00:20:35,679
whether or not they have diabetes one
531
00:20:35,679 --> 00:20:38,640
for they do zero for they don't
532
00:20:38,640 --> 00:20:39,679
so here
533
00:20:39,679 --> 00:20:42,720
um all of these are
534
00:20:42,720 --> 00:20:44,159
quantitative
535
00:20:44,159 --> 00:20:45,840
features right because they're all on
536
00:20:45,840 --> 00:20:48,640
some scale
537
00:20:48,640 --> 00:20:52,240
so each row is a different sample in the
538
00:20:52,240 --> 00:20:54,240
data so it's a different example it's
539
00:20:54,240 --> 00:20:57,039
one person's data and each row
540
00:20:57,039 --> 00:21:00,640
represents one person in this data set
541
00:21:00,640 --> 00:21:02,640
now this column
542
00:21:02,640 --> 00:21:04,720
each column represents a different
543
00:21:04,720 --> 00:21:07,600
feature so this one here is some measure
544
00:21:07,600 --> 00:21:10,960
of blood pressure levels
545
00:21:10,960 --> 00:21:12,799
and this one over here as we mentioned
546
00:21:12,799 --> 00:21:14,799
is the output label so this one is
547
00:21:14,799 --> 00:21:18,960
whether or not they have diabetes
548
00:21:18,960 --> 00:21:20,880
and as i mentioned this is what we would
549
00:21:20,880 --> 00:21:23,200
call a feature vector because these are
550
00:21:23,200 --> 00:21:24,880
all of our features
551
00:21:24,880 --> 00:21:28,000
in one sample
552
00:21:28,000 --> 00:21:30,960
and this is what's known as the target
553
00:21:30,960 --> 00:21:34,000
or the output for that feature vector
554
00:21:34,000 --> 00:21:37,120
that's what we're trying to predict
555
00:21:37,120 --> 00:21:39,280
and all of these together is our
556
00:21:39,280 --> 00:21:40,960
features matrix
557
00:21:40,960 --> 00:21:42,559
x
558
00:21:42,559 --> 00:21:45,760
and over here this is our labels or
559
00:21:45,760 --> 00:21:49,520
targets vector y
560
00:21:49,520 --> 00:21:51,760
so i've condensed this to a chocolate
561
00:21:51,760 --> 00:21:53,280
bar to kind of
562
00:21:53,280 --> 00:21:55,760
talk about some of the other concepts in
563
00:21:55,760 --> 00:21:57,039
machine learning
564
00:21:57,039 --> 00:22:01,039
so over here we have our x our features
565
00:22:01,039 --> 00:22:05,440
matrix and over here this is our label y
566
00:22:05,440 --> 00:22:06,240
so
567
00:22:06,240 --> 00:22:09,520
each row of this will be fed into our
568
00:22:09,520 --> 00:22:10,960
model right
569
00:22:10,960 --> 00:22:12,799
and our model will make some sort of
570
00:22:12,799 --> 00:22:14,240
prediction
571
00:22:14,240 --> 00:22:15,919
and what we do is we compare that
572
00:22:15,919 --> 00:22:18,559
prediction to the actual
573
00:22:18,559 --> 00:22:21,120
value of y that we have in our labeled
574
00:22:21,120 --> 00:22:22,960
data set because that's the whole point
575
00:22:22,960 --> 00:22:24,720
of supervised learning is we can compare
576
00:22:24,720 --> 00:22:27,280
what our model is outputting to oh what
577
00:22:27,280 --> 00:22:29,440
is the truth actually and then we can go
578
00:22:29,440 --> 00:22:31,440
back and we can adjust some things so
579
00:22:31,440 --> 00:22:33,840
the next iteration we get closer
580
00:22:33,840 --> 00:22:34,720
to
581
00:22:34,720 --> 00:22:37,520
what the true value is
582
00:22:37,520 --> 00:22:38,720
so that
583
00:22:38,720 --> 00:22:41,039
whole process here the tinkering the
584
00:22:41,039 --> 00:22:42,720
okay what's the difference where did we
585
00:22:42,720 --> 00:22:44,159
go wrong
586
00:22:44,159 --> 00:22:45,919
that's what's known as training the
587
00:22:45,919 --> 00:22:47,520
model
588
00:22:47,520 --> 00:22:49,520
all right so take this whole
589
00:22:49,520 --> 00:22:51,919
you know chunk right here do we want to
590
00:22:51,919 --> 00:22:53,440
really put
591
00:22:53,440 --> 00:22:55,679
our entire chocolate bar into the model
592
00:22:55,679 --> 00:22:58,799
to train our model
593
00:22:58,880 --> 00:23:01,520
not really right because if we did that
594
00:23:01,520 --> 00:23:05,039
then how do we know that our model can
595
00:23:05,039 --> 00:23:08,559
do well on new data that we haven't seen
596
00:23:08,559 --> 00:23:11,440
like if i were to create a model
597
00:23:11,440 --> 00:23:12,240
to
598
00:23:12,240 --> 00:23:14,240
predict whether or not someone has
599
00:23:14,240 --> 00:23:16,159
diabetes
600
00:23:16,159 --> 00:23:18,799
let's say that i just train all my data
601
00:23:18,799 --> 00:23:20,400
and i see that on my training data it
602
00:23:20,400 --> 00:23:22,640
does well i go to some hospital i'm like
603
00:23:22,640 --> 00:23:24,000
here's my model
604
00:23:24,000 --> 00:23:25,440
i think you can use this to predict if
605
00:23:25,440 --> 00:23:27,679
somebody has diabetes
606
00:23:27,679 --> 00:23:29,760
do we think that would be effective or
607
00:23:29,760 --> 00:23:32,000
not
608
00:23:32,159 --> 00:23:36,400
probably not right because
609
00:23:36,400 --> 00:23:37,600
we haven't
610
00:23:37,600 --> 00:23:39,039
assessed
611
00:23:39,039 --> 00:23:42,480
how well our model can generalize
612
00:23:42,480 --> 00:23:44,960
okay it might do well after you know our
613
00:23:44,960 --> 00:23:46,640
model has seen this data over and over
614
00:23:46,640 --> 00:23:48,720
and over again but what about new data
615
00:23:48,720 --> 00:23:51,840
can our model handle new data
616
00:23:51,840 --> 00:23:53,520
well
617
00:23:53,520 --> 00:23:55,280
how do we how do we get our model to
618
00:23:55,280 --> 00:23:57,039
assess that
619
00:23:57,039 --> 00:23:59,280
so we actually break up our whole data
620
00:23:59,280 --> 00:24:00,960
set that we have
621
00:24:00,960 --> 00:24:03,200
into three different types of data sets
622
00:24:03,200 --> 00:24:05,360
we call it the training data set the
623
00:24:05,360 --> 00:24:07,679
validation data set and the testing data
624
00:24:07,679 --> 00:24:08,640
set
625
00:24:08,640 --> 00:24:10,159
and you know you might have sixty
626
00:24:10,159 --> 00:24:12,080
percent here twenty percent and twenty
627
00:24:12,080 --> 00:24:14,960
percent or eighty ten and ten um it
628
00:24:14,960 --> 00:24:16,400
really depends on how many statistics
629
00:24:16,400 --> 00:24:17,919
you have i think either of those would
630
00:24:17,919 --> 00:24:20,400
be acceptable
631
00:24:20,400 --> 00:24:22,080
so what we do is then we feed the
632
00:24:22,080 --> 00:24:24,720
training data set into our model
633
00:24:24,720 --> 00:24:27,200
we come up with you know this might be a
634
00:24:27,200 --> 00:24:29,760
vector of predictions corresponding with
635
00:24:29,760 --> 00:24:31,039
each
636
00:24:31,039 --> 00:24:34,080
sample that we put into our model
637
00:24:34,080 --> 00:24:36,000
we figure out okay what's the difference
638
00:24:36,000 --> 00:24:38,960
between our prediction and the true
639
00:24:38,960 --> 00:24:42,080
values this is something known as loss
640
00:24:42,080 --> 00:24:43,440
loss is you know what's the difference
641
00:24:43,440 --> 00:24:44,240
here
642
00:24:44,240 --> 00:24:48,080
in some numerical quantity of course
643
00:24:48,320 --> 00:24:50,400
and then we make adjustments and that's
644
00:24:50,400 --> 00:24:52,240
what we call training
645
00:24:52,240 --> 00:24:54,320
okay
646
00:24:54,320 --> 00:24:55,520
so then
647
00:24:55,520 --> 00:24:56,960
once you know we've made a bunch of
648
00:24:56,960 --> 00:24:58,400
adjustments
649
00:24:58,400 --> 00:25:00,960
we can put our validation set through
650
00:25:00,960 --> 00:25:02,559
this model
651
00:25:02,559 --> 00:25:04,960
and the validation set is kind of used
652
00:25:04,960 --> 00:25:06,799
as a reality check
653
00:25:06,799 --> 00:25:10,559
during or after training to ensure that
654
00:25:10,559 --> 00:25:14,000
the model can handle unseen data still
655
00:25:14,000 --> 00:25:15,679
so every single time after we train one
656
00:25:15,679 --> 00:25:17,200
iteration we might
657
00:25:17,200 --> 00:25:19,279
stick the validation set in and see hey
658
00:25:19,279 --> 00:25:21,039
what's the loss there
659
00:25:21,039 --> 00:25:22,960
and then after our training is over we
660
00:25:22,960 --> 00:25:25,600
can assess the validation set and ask
661
00:25:25,600 --> 00:25:27,440
hey what's the loss there
662
00:25:27,440 --> 00:25:29,919
but one key difference here is that we
663
00:25:29,919 --> 00:25:33,039
don't have that training step this
664
00:25:33,039 --> 00:25:35,520
loss never gets fed back into the model
665
00:25:35,520 --> 00:25:38,720
right that feedback loop is not closed
666
00:25:38,720 --> 00:25:40,880
all right so let's talk about loss
667
00:25:40,880 --> 00:25:43,120
really quickly
668
00:25:43,120 --> 00:25:45,279
so here i have four different types of
669
00:25:45,279 --> 00:25:47,360
models i have some sort of data that's
670
00:25:47,360 --> 00:25:49,520
being fed into the model and then some
671
00:25:49,520 --> 00:25:51,120
output
672
00:25:51,120 --> 00:25:52,159
okay so
673
00:25:52,159 --> 00:25:55,520
this output here is pretty far from you
674
00:25:55,520 --> 00:25:56,400
know this
675
00:25:56,400 --> 00:25:58,400
truth that we want
676
00:25:58,400 --> 00:25:59,440
and so
677
00:25:59,440 --> 00:26:02,400
this loss is going to be high
678
00:26:02,400 --> 00:26:03,679
in model b
679
00:26:03,679 --> 00:26:06,000
again this is pretty far from what we
680
00:26:06,000 --> 00:26:07,520
want so this loss is also going to be
681
00:26:07,520 --> 00:26:10,480
high let's give it 1.5
682
00:26:10,480 --> 00:26:13,440
now this one here it's pretty close i
683
00:26:13,440 --> 00:26:15,600
mean maybe not almost but pretty close
684
00:26:15,600 --> 00:26:16,720
to this one
685
00:26:16,720 --> 00:26:19,679
so that might have a loss of 0.5
686
00:26:19,679 --> 00:26:21,600
and then this one here is
687
00:26:21,600 --> 00:26:24,720
maybe further than this but still better
688
00:26:24,720 --> 00:26:28,880
than these two so that loss might be 0.9
689
00:26:28,880 --> 00:26:30,720
okay so which of these model performs
690
00:26:30,720 --> 00:26:32,240
the best
691
00:26:32,240 --> 00:26:33,279
well
692
00:26:33,279 --> 00:26:35,520
model c has the smallest loss so it's
693
00:26:35,520 --> 00:26:39,200
probably model c
694
00:26:39,200 --> 00:26:41,600
okay now let's take model c after you
695
00:26:41,600 --> 00:26:43,679
know we've come up with these all these
696
00:26:43,679 --> 00:26:45,919
models and we've seen okay model c is
697
00:26:45,919 --> 00:26:48,640
probably the best model
698
00:26:48,640 --> 00:26:51,120
we take model c and we run our test set
699
00:26:51,120 --> 00:26:52,400
through this model
700
00:26:52,400 --> 00:26:54,799
and this test set is used as a final
701
00:26:54,799 --> 00:26:57,520
check to see how generalizable
702
00:26:57,520 --> 00:27:00,880
that chosen model is so if i you know
703
00:27:00,880 --> 00:27:03,360
finished training my diabetes data set
704
00:27:03,360 --> 00:27:05,360
then i could run it through some chunk
705
00:27:05,360 --> 00:27:07,520
of the data and i can say oh like this
706
00:27:07,520 --> 00:27:10,240
is how it perform on data that it's
707
00:27:10,240 --> 00:27:12,320
never seen before at any point during
708
00:27:12,320 --> 00:27:15,440
the training process okay
709
00:27:15,440 --> 00:27:18,640
and that loss that's the final reported
710
00:27:18,640 --> 00:27:20,480
performance of
711
00:27:20,480 --> 00:27:22,320
my test set or
712
00:27:22,320 --> 00:27:23,840
this would be the final reported
713
00:27:23,840 --> 00:27:26,799
performance of my model
714
00:27:26,799 --> 00:27:29,039
okay
715
00:27:29,200 --> 00:27:31,679
so let's talk about this thing called
716
00:27:31,679 --> 00:27:33,440
loss because i think i kind of just
717
00:27:33,440 --> 00:27:35,679
glossed over it right
718
00:27:35,679 --> 00:27:37,840
so loss is the difference between your
719
00:27:37,840 --> 00:27:40,799
prediction and the actual
720
00:27:40,799 --> 00:27:43,120
like label
721
00:27:43,120 --> 00:27:44,880
so this would give a slightly higher
722
00:27:44,880 --> 00:27:47,200
loss than this
723
00:27:47,200 --> 00:27:50,559
and this would even give a higher loss
724
00:27:50,559 --> 00:27:53,440
because it's even more off
725
00:27:53,440 --> 00:27:55,120
in computer science we like formulas
726
00:27:55,120 --> 00:27:57,520
right we like formulaic ways
727
00:27:57,520 --> 00:27:59,520
of describing things
728
00:27:59,520 --> 00:28:01,600
so here are some examples of loss
729
00:28:01,600 --> 00:28:03,200
functions and how we can actually come
730
00:28:03,200 --> 00:28:05,279
up with numbers
731
00:28:05,279 --> 00:28:08,000
this here is known as l1 loss
732
00:28:08,000 --> 00:28:10,080
and basically l1 loss just takes the
733
00:28:10,080 --> 00:28:11,520
absolute value
734
00:28:11,520 --> 00:28:12,480
of
735
00:28:12,480 --> 00:28:14,240
whatever your
736
00:28:14,240 --> 00:28:15,679
you know real
737
00:28:15,679 --> 00:28:17,840
value is whatever the real output label
738
00:28:17,840 --> 00:28:18,559
is
739
00:28:18,559 --> 00:28:21,279
subtracts the predicted value
740
00:28:21,279 --> 00:28:23,679
and takes the absolute value of that
741
00:28:23,679 --> 00:28:24,640
okay
742
00:28:24,640 --> 00:28:25,440
so
743
00:28:25,440 --> 00:28:28,320
the absolute value is
744
00:28:28,320 --> 00:28:29,520
a function that looks something like
745
00:28:29,520 --> 00:28:30,480
this
746
00:28:30,480 --> 00:28:33,200
so the further off you are the greater
747
00:28:33,200 --> 00:28:35,440
your losses
748
00:28:35,440 --> 00:28:38,880
right in either direction so
749
00:28:38,880 --> 00:28:41,120
if your real value is off from your
750
00:28:41,120 --> 00:28:44,000
predicted value by 10 then your loss for
751
00:28:44,000 --> 00:28:46,559
that point would be 10 and then this sum
752
00:28:46,559 --> 00:28:48,399
here just means hey we're taking all the
753
00:28:48,399 --> 00:28:50,960
points in our data set and we're trying
754
00:28:50,960 --> 00:28:52,960
to figure out the sum of how far
755
00:28:52,960 --> 00:28:55,679
everything is
756
00:28:56,080 --> 00:28:58,320
now we also have something called l2
757
00:28:58,320 --> 00:28:59,600
loss so
758
00:28:59,600 --> 00:29:01,679
this loss function is quadratic which
759
00:29:01,679 --> 00:29:04,640
means that if it's close the penalty is
760
00:29:04,640 --> 00:29:08,480
very minimal and if it's off by a lot
761
00:29:08,480 --> 00:29:11,840
then the penalty is much much higher
762
00:29:11,840 --> 00:29:12,799
okay
763
00:29:12,799 --> 00:29:14,880
and this instead of the absolute value
764
00:29:14,880 --> 00:29:16,640
we just square
765
00:29:16,640 --> 00:29:18,320
the um
766
00:29:18,320 --> 00:29:21,600
the difference between the two
767
00:29:22,880 --> 00:29:24,720
now there's also something called binary
768
00:29:24,720 --> 00:29:26,880
cross entropy loss
769
00:29:26,880 --> 00:29:29,520
it looks something like this and this is
770
00:29:29,520 --> 00:29:32,240
for uh binary classification this this
771
00:29:32,240 --> 00:29:34,399
might be the loss that we use
772
00:29:34,399 --> 00:29:35,760
so this loss
773
00:29:35,760 --> 00:29:37,840
you know i'm not going to really go
774
00:29:37,840 --> 00:29:40,159
through it too much but you just need to
775
00:29:40,159 --> 00:29:42,559
know that loss decreases as the
776
00:29:42,559 --> 00:29:45,919
performance gets better
777
00:29:47,039 --> 00:29:49,159
so there are some other measures of
778
00:29:49,159 --> 00:29:52,720
accuracy as well so for example accuracy
779
00:29:52,720 --> 00:29:55,360
what is accuracy
780
00:29:55,360 --> 00:29:57,360
so let's say that these are pictures
781
00:29:57,360 --> 00:30:00,960
that i'm feeding my model okay
782
00:30:00,960 --> 00:30:03,760
and these predictions might be
783
00:30:03,760 --> 00:30:06,559
apple orange orange apple
784
00:30:06,559 --> 00:30:09,200
okay but the actual is apple orange
785
00:30:09,200 --> 00:30:11,120
apple apple
786
00:30:11,120 --> 00:30:12,159
so
787
00:30:12,159 --> 00:30:14,399
three of them were correct and one of
788
00:30:14,399 --> 00:30:16,640
them was incorrect so the accuracy of
789
00:30:16,640 --> 00:30:18,880
this model is three quarters or 75
790
00:30:18,880 --> 00:30:20,480
percent
791
00:30:20,480 --> 00:30:23,120
all right coming back to our
792
00:30:23,120 --> 00:30:25,200
collab notebook i'm going to close this
793
00:30:25,200 --> 00:30:26,399
a little bit
794
00:30:26,399 --> 00:30:29,520
again we've imported stuff up here um
795
00:30:29,520 --> 00:30:32,240
and we've already created our data frame
796
00:30:32,240 --> 00:30:34,080
right here and this is this is all of
797
00:30:34,080 --> 00:30:35,600
our data this is what we're going to use
798
00:30:35,600 --> 00:30:38,480
to train our models
799
00:30:38,480 --> 00:30:40,480
so down here
800
00:30:40,480 --> 00:30:43,760
again if we now take a look at our data
801
00:30:43,760 --> 00:30:45,919
set
802
00:30:45,919 --> 00:30:47,679
you'll see that our classes are now
803
00:30:47,679 --> 00:30:49,679
zeros and ones so now this is all
804
00:30:49,679 --> 00:30:51,200
numerical which is good because our
805
00:30:51,200 --> 00:30:54,240
computer can now understand that
806
00:30:54,240 --> 00:30:55,760
okay
807
00:30:55,760 --> 00:30:57,279
and you know it would probably be a good
808
00:30:57,279 --> 00:30:58,640
idea to maybe
809
00:30:58,640 --> 00:31:00,960
kind of plot hey do these things have
810
00:31:00,960 --> 00:31:02,559
anything to do
811
00:31:02,559 --> 00:31:04,559
with the class
812
00:31:04,559 --> 00:31:05,360
so
813
00:31:05,360 --> 00:31:06,240
here
814
00:31:06,240 --> 00:31:08,559
i'm going to go through all the labels
815
00:31:08,559 --> 00:31:10,960
so for label in
816
00:31:10,960 --> 00:31:13,120
the columns of this data frame so this
817
00:31:13,120 --> 00:31:15,360
just gets me the list actually we have
818
00:31:15,360 --> 00:31:16,880
the list right it's called so let's just
819
00:31:16,880 --> 00:31:19,440
use that it might be less confusing
820
00:31:19,440 --> 00:31:20,960
of everything up till the last thing
821
00:31:20,960 --> 00:31:22,880
which is the class so i'm going to take
822
00:31:22,880 --> 00:31:25,360
all these 10 different features
823
00:31:25,360 --> 00:31:27,919
and i'm going to plot them
824
00:31:27,919 --> 00:31:30,840
as a histogram
825
00:31:30,840 --> 00:31:32,399
um
826
00:31:32,399 --> 00:31:34,240
so
827
00:31:34,240 --> 00:31:35,440
and now i'm gonna plot them as a
828
00:31:35,440 --> 00:31:37,360
histogram so basically if i take that
829
00:31:37,360 --> 00:31:41,120
data frame and i say okay for everything
830
00:31:41,120 --> 00:31:42,000
where
831
00:31:42,000 --> 00:31:43,279
the class
832
00:31:43,279 --> 00:31:45,840
is equal to one so these are all of our
833
00:31:45,840 --> 00:31:48,559
gammas remember
834
00:31:48,559 --> 00:31:49,440
now
835
00:31:49,440 --> 00:31:53,200
for that portion of the data frame if i
836
00:31:53,200 --> 00:31:56,240
look at this label so now these
837
00:31:56,240 --> 00:31:59,120
okay what this part here is saying
838
00:31:59,120 --> 00:32:00,960
is
839
00:32:00,960 --> 00:32:03,039
inside the data frame get me everything
840
00:32:03,039 --> 00:32:05,039
where the class is equal to one so
841
00:32:05,039 --> 00:32:07,360
that's all all of these would fit into
842
00:32:07,360 --> 00:32:09,039
that category right
843
00:32:09,039 --> 00:32:10,720
and now let's just look at the label
844
00:32:10,720 --> 00:32:11,919
column so
845
00:32:11,919 --> 00:32:13,840
the first label would be f length which
846
00:32:13,840 --> 00:32:15,600
would be this column
847
00:32:15,600 --> 00:32:17,600
so this command here is getting me
848
00:32:17,600 --> 00:32:19,519
all the different values that belong to
849
00:32:19,519 --> 00:32:23,120
class 1 for this specific label
850
00:32:23,120 --> 00:32:25,200
and that's exactly what i'm going to put
851
00:32:25,200 --> 00:32:26,640
into the histogram
852
00:32:26,640 --> 00:32:28,399
and now i'm just going to tell you know
853
00:32:28,399 --> 00:32:31,679
matplotlib make the color blue make
854
00:32:31,679 --> 00:32:32,799
oops
855
00:32:32,799 --> 00:32:35,760
label this as you know gamma
856
00:32:35,760 --> 00:32:36,960
um
857
00:32:36,960 --> 00:32:40,320
set alpha why do i keep doing that alpha
858
00:32:40,320 --> 00:32:42,399
equal to 0.7 so that's just like the
859
00:32:42,399 --> 00:32:44,559
transparency and then i'm going to set
860
00:32:44,559 --> 00:32:47,519
density equal to true so that when we
861
00:32:47,519 --> 00:32:50,399
compare it to
862
00:32:50,399 --> 00:32:52,640
the hadrons here
863
00:32:52,640 --> 00:32:54,960
we'll have a baseline for comparing them
864
00:32:54,960 --> 00:32:57,600
okay so the density being true just
865
00:32:57,600 --> 00:33:00,720
basically normalizes these distributions
866
00:33:00,720 --> 00:33:03,360
so you know if you have
867
00:33:03,360 --> 00:33:04,880
200 and
868
00:33:04,880 --> 00:33:08,000
of one type and then 50 of another type
869
00:33:08,000 --> 00:33:10,720
well if you drew the histograms it would
870
00:33:10,720 --> 00:33:13,039
be hard to compare because one of them
871
00:33:13,039 --> 00:33:14,960
would be a lot bigger than the other
872
00:33:14,960 --> 00:33:17,360
right but by normalizing them we kind of
873
00:33:17,360 --> 00:33:18,960
are distributing them over how many
874
00:33:18,960 --> 00:33:21,200
samples there are
875
00:33:21,200 --> 00:33:23,360
all right and then i'm just going to put
876
00:33:23,360 --> 00:33:27,200
a title on here make that the label
877
00:33:27,200 --> 00:33:29,039
uh the y label
878
00:33:29,039 --> 00:33:30,960
so because it's density the y label is
879
00:33:30,960 --> 00:33:32,720
probability
880
00:33:32,720 --> 00:33:34,799
and the x label
881
00:33:34,799 --> 00:33:37,600
is just going to be the label
882
00:33:37,600 --> 00:33:39,600
um
883
00:33:39,600 --> 00:33:41,600
what is going on
884
00:33:41,600 --> 00:33:43,200
and i'm going to
885
00:33:43,200 --> 00:33:46,080
include a legend and
886
00:33:46,080 --> 00:33:47,919
plt.show just means okay display the
887
00:33:47,919 --> 00:33:48,880
plot
888
00:33:48,880 --> 00:33:53,120
so if i run that
889
00:33:53,120 --> 00:33:54,559
just be
890
00:33:54,559 --> 00:33:56,559
up to the last item so we want a list
891
00:33:56,559 --> 00:33:59,679
right not just the last item
892
00:33:59,679 --> 00:34:01,200
and now we can see that we're plotting
893
00:34:01,200 --> 00:34:04,480
all of these so here we have the length
894
00:34:04,480 --> 00:34:08,239
oh and i made this gamma so
895
00:34:08,239 --> 00:34:11,440
uh this should be had on
896
00:34:11,440 --> 00:34:12,239
okay
897
00:34:12,239 --> 00:34:13,918
so the gamma's in blue the hadrons are
898
00:34:13,918 --> 00:34:16,320
in red so here we can already see that
899
00:34:16,320 --> 00:34:18,960
you know maybe if the length is smaller
900
00:34:18,960 --> 00:34:21,040
it's probably more likely to be gamma
901
00:34:21,040 --> 00:34:22,639
right
902
00:34:22,639 --> 00:34:24,079
and we can kind of you know these all
903
00:34:24,079 --> 00:34:26,639
look somewhat similar
904
00:34:26,639 --> 00:34:29,119
but here okay clearly if there's more
905
00:34:29,119 --> 00:34:30,879
asymmetry
906
00:34:30,879 --> 00:34:32,719
or if you know this
907
00:34:32,719 --> 00:34:36,800
asymmetry measure is larger
908
00:34:36,800 --> 00:34:41,040
um then it's probably a hadron
909
00:34:41,280 --> 00:34:43,760
okay oh this one's a good one so
910
00:34:43,760 --> 00:34:45,040
f alpha
911
00:34:45,040 --> 00:34:46,560
seems like
912
00:34:46,560 --> 00:34:48,320
hadrons are pretty evenly distributed
913
00:34:48,320 --> 00:34:50,239
whereas if this is smaller it
914
00:34:50,239 --> 00:34:52,000
looks like there's more gammas in that
915
00:34:52,000 --> 00:34:54,399
area
916
00:34:55,280 --> 00:34:58,160
okay so this is kind of what the data
917
00:34:58,160 --> 00:34:59,680
that we're working with we can kind of
918
00:34:59,680 --> 00:35:02,240
see what's going on
919
00:35:02,240 --> 00:35:03,599
okay so the next thing that we're going
920
00:35:03,599 --> 00:35:05,119
to do here
921
00:35:05,119 --> 00:35:07,839
is we are going to create our
922
00:35:07,839 --> 00:35:09,359
train
923
00:35:09,359 --> 00:35:11,200
our validation
924
00:35:11,200 --> 00:35:13,680
and our test data sets
925
00:35:13,680 --> 00:35:18,000
i'm going to set train valid and test
926
00:35:18,000 --> 00:35:20,079
to be equal to
927
00:35:20,079 --> 00:35:21,440
this
928
00:35:21,440 --> 00:35:23,440
so numpy.split i'm just splitting up the
929
00:35:23,440 --> 00:35:24,720
data frame
930
00:35:24,720 --> 00:35:27,760
and if i do this sample where i'm
931
00:35:27,760 --> 00:35:29,520
sampling everything this will basically
932
00:35:29,520 --> 00:35:31,440
shuffle my data
933
00:35:31,440 --> 00:35:34,079
um now if i
934
00:35:34,079 --> 00:35:37,760
i want to pass in where exactly i'm
935
00:35:37,760 --> 00:35:40,000
splitting my data set so
936
00:35:40,000 --> 00:35:42,079
the first split is going to be maybe at
937
00:35:42,079 --> 00:35:43,440
60 percent
938
00:35:43,440 --> 00:35:45,839
so i'm just going to say 0.6 times the
939
00:35:45,839 --> 00:35:47,599
length of this data frame
940
00:35:47,599 --> 00:35:50,160
so and then casa 10 integer
941
00:35:50,160 --> 00:35:52,079
that's going to be the first place where
942
00:35:52,079 --> 00:35:53,440
you know i cut it off and that will be
943
00:35:53,440 --> 00:35:55,280
my training data
944
00:35:55,280 --> 00:35:59,119
now if i then go to 0.8
945
00:35:59,119 --> 00:36:01,040
this basically means everything between
946
00:36:01,040 --> 00:36:03,520
60 and 80 of the length of the data set
947
00:36:03,520 --> 00:36:05,680
will go towards validation
948
00:36:05,680 --> 00:36:06,800
and then
949
00:36:06,800 --> 00:36:08,720
like everything from 80 to 100 is going
950
00:36:08,720 --> 00:36:10,720
to be my test data
951
00:36:10,720 --> 00:36:12,880
so i can run that
952
00:36:12,880 --> 00:36:14,400
and
953
00:36:14,400 --> 00:36:15,200
now
954
00:36:15,200 --> 00:36:17,119
if we go up here and we inspect this
955
00:36:17,119 --> 00:36:19,839
data we'll see that these columns seem
956
00:36:19,839 --> 00:36:22,480
to have values in like the 100s whereas
957
00:36:22,480 --> 00:36:23,520
this one
958
00:36:23,520 --> 00:36:26,000
is 0.03
959
00:36:26,000 --> 00:36:28,240
right so the scale of all these numbers
960
00:36:28,240 --> 00:36:30,000
is way off
961
00:36:30,000 --> 00:36:32,720
and sometimes that will affect our
962
00:36:32,720 --> 00:36:33,839
results
963
00:36:33,839 --> 00:36:36,480
so one thing that we would want to do is
964
00:36:36,480 --> 00:36:37,760
scale these
965
00:36:37,760 --> 00:36:41,599
so that they are you know
966
00:36:41,599 --> 00:36:45,040
so that it's now relative to maybe
967
00:36:45,040 --> 00:36:47,760
the mean and the standard deviation of
968
00:36:47,760 --> 00:36:50,720
that specific column
969
00:36:51,359 --> 00:36:53,359
i'm going to create a function called
970
00:36:53,359 --> 00:36:55,440
scale data set
971
00:36:55,440 --> 00:36:58,240
and i'm going to pass in the data frame
972
00:36:58,240 --> 00:37:00,560
um
973
00:37:00,560 --> 00:37:02,960
and that's what i'll do for now okay so
974
00:37:02,960 --> 00:37:04,160
the x
975
00:37:04,160 --> 00:37:06,240
values are going to be you know i take
976
00:37:06,240 --> 00:37:07,680
the data frame
977
00:37:07,680 --> 00:37:08,560
and
978
00:37:08,560 --> 00:37:10,640
let's assume that
979
00:37:10,640 --> 00:37:12,400
the columns
980
00:37:12,400 --> 00:37:13,520
um
981
00:37:13,520 --> 00:37:15,520
are going to be you know that the label
982
00:37:15,520 --> 00:37:17,599
will always be the last thing in the
983
00:37:17,599 --> 00:37:21,280
data frame so what i can do is say
984
00:37:21,280 --> 00:37:22,960
dot dataframe.com all the way up to the
985
00:37:22,960 --> 00:37:23,920
last
986
00:37:23,920 --> 00:37:24,880
item
987
00:37:24,880 --> 00:37:27,280
and get those values
988
00:37:27,280 --> 00:37:29,920
now for my y
989
00:37:29,920 --> 00:37:31,760
well it's the last column so i can just
990
00:37:31,760 --> 00:37:33,760
do this i can just index into that last
991
00:37:33,760 --> 00:37:34,720
column
992
00:37:34,720 --> 00:37:37,839
and then get those values
993
00:37:38,560 --> 00:37:40,400
now
994
00:37:40,400 --> 00:37:41,520
in
995
00:37:41,520 --> 00:37:42,560
so
996
00:37:42,560 --> 00:37:45,760
i'm actually going to import something
997
00:37:45,760 --> 00:37:48,560
known as uh the standard scalar from
998
00:37:48,560 --> 00:37:52,880
sklearn so if i come up here i can go to
999
00:37:52,880 --> 00:37:55,880
sklearn.preprocessing
1000
00:37:56,000 --> 00:37:59,520
and i'm going to import um standard
1001
00:37:59,520 --> 00:38:01,119
scalar
1002
00:38:01,119 --> 00:38:03,119
i have to run that cell
1003
00:38:03,119 --> 00:38:04,880
i'm going to come back down here
1004
00:38:04,880 --> 00:38:07,760
and now i'm going to create a scalar and
1005
00:38:07,760 --> 00:38:10,320
use that skill or so standard
1006
00:38:10,320 --> 00:38:12,800
scalar
1007
00:38:13,440 --> 00:38:15,920
and with the scalar what i can do is
1008
00:38:15,920 --> 00:38:19,520
actually just fit and transform x
1009
00:38:19,520 --> 00:38:22,560
so here i can say x is equal to
1010
00:38:22,560 --> 00:38:25,520
scalar dot fit
1011
00:38:25,520 --> 00:38:27,839
fit transform
1012
00:38:27,839 --> 00:38:30,079
x so what that's doing is saying okay
1013
00:38:30,079 --> 00:38:33,920
take x and fit the standard scalar to x
1014
00:38:33,920 --> 00:38:35,520
and then transform all those values and
1015
00:38:35,520 --> 00:38:37,119
what would it be and that's going to be
1016
00:38:37,119 --> 00:38:38,560
our new x
1017
00:38:38,560 --> 00:38:40,160
all right
1018
00:38:40,160 --> 00:38:42,880
and then i'm also going to just create
1019
00:38:42,880 --> 00:38:46,720
you know the whole data as one uh huge
1020
00:38:46,720 --> 00:38:48,480
2d numpy array
1021
00:38:48,480 --> 00:38:50,800
and in order to do that i'm going to
1022
00:38:50,800 --> 00:38:51,680
call
1023
00:38:51,680 --> 00:38:54,400
hsac so h stack is saying okay take an
1024
00:38:54,400 --> 00:38:57,280
array and another array and horizontally
1025
00:38:57,280 --> 00:38:58,640
stack them together that's what the h
1026
00:38:58,640 --> 00:38:59,599
stands for
1027
00:38:59,599 --> 00:39:01,599
so by horizontally stacked them together
1028
00:39:01,599 --> 00:39:03,839
just like put them side by side okay not
1029
00:39:03,839 --> 00:39:05,920
on top of each other
1030
00:39:05,920 --> 00:39:08,160
so what am i stacking well i have to
1031
00:39:08,160 --> 00:39:09,920
pass in something
1032
00:39:09,920 --> 00:39:11,839
so that it can stack
1033
00:39:11,839 --> 00:39:14,079
x and y
1034
00:39:14,079 --> 00:39:16,480
and now
1035
00:39:16,880 --> 00:39:19,359
okay so numpy is very particular about
1036
00:39:19,359 --> 00:39:22,320
dimensions right so in this specific
1037
00:39:22,320 --> 00:39:25,280
case rx is a two-dimensional object but
1038
00:39:25,280 --> 00:39:27,359
y is only a one-dimensional thing it's
1039
00:39:27,359 --> 00:39:28,400
only a vector
1040
00:39:28,400 --> 00:39:29,760
of values
1041
00:39:29,760 --> 00:39:33,200
so in order to now reshape it into a 2d
1042
00:39:33,200 --> 00:39:38,359
item we have to call numpy.reshape
1043
00:39:38,640 --> 00:39:41,680
and we can pass in the dimensions of its
1044
00:39:41,680 --> 00:39:42,880
reshape
1045
00:39:42,880 --> 00:39:43,920
so
1046
00:39:43,920 --> 00:39:46,480
if i pass a negative 1 comma 1 that just
1047
00:39:46,480 --> 00:39:48,880
means okay make this a 2d array where
1048
00:39:48,880 --> 00:39:52,000
the negative 1 just means infer what
1049
00:39:52,000 --> 00:39:54,560
what this dimension value would be which
1050
00:39:54,560 --> 00:39:56,400
ends up being the length of y this would
1051
00:39:56,400 --> 00:39:58,880
be the same as literally doing this
1052
00:39:58,880 --> 00:40:00,320
but the negative one is easier because
1053
00:40:00,320 --> 00:40:01,839
we're making the computer do the hard
1054
00:40:01,839 --> 00:40:03,280
work
1055
00:40:03,280 --> 00:40:05,680
so if i stack that i'm going to then
1056
00:40:05,680 --> 00:40:08,839
return the data x and
1057
00:40:08,839 --> 00:40:10,720
y
1058
00:40:10,720 --> 00:40:12,240
okay
1059
00:40:12,240 --> 00:40:14,960
so one more thing is that if we go into
1060
00:40:14,960 --> 00:40:16,640
our training data set
1061
00:40:16,640 --> 00:40:19,119
okay again this is our training data set
1062
00:40:19,119 --> 00:40:21,520
and we get the length of the training
1063
00:40:21,520 --> 00:40:22,839
data set
1064
00:40:22,839 --> 00:40:26,160
um but where the training data sets
1065
00:40:26,160 --> 00:40:27,680
class
1066
00:40:27,680 --> 00:40:28,800
is one
1067
00:40:28,800 --> 00:40:32,560
so remember that this is the gammas
1068
00:40:32,560 --> 00:40:36,240
and then if we print
1069
00:40:36,839 --> 00:40:40,640
that and we do the same thing but zero
1070
00:40:40,640 --> 00:40:42,160
we'll see that
1071
00:40:42,160 --> 00:40:43,839
you know there's around
1072
00:40:43,839 --> 00:40:46,720
seven thousand of the gammas but only
1073
00:40:46,720 --> 00:40:49,599
around four thousand of the hadrons so
1074
00:40:49,599 --> 00:40:52,720
that might actually become an issue
1075
00:40:52,720 --> 00:40:55,200
and instead what we want to do
1076
00:40:55,200 --> 00:40:58,319
is we want to over sample our
1077
00:40:58,319 --> 00:41:00,079
our training data set
1078
00:41:00,079 --> 00:41:02,960
so that means that we want to increase
1079
00:41:02,960 --> 00:41:05,200
the number of
1080
00:41:05,200 --> 00:41:06,640
these values
1081
00:41:06,640 --> 00:41:09,839
so that these kind of match better
1082
00:41:09,839 --> 00:41:13,040
and surprise surprise there is something
1083
00:41:13,040 --> 00:41:14,800
that we can import that will help us do
1084
00:41:14,800 --> 00:41:16,400
that it's so
1085
00:41:16,400 --> 00:41:20,920
i'm going to go to from mblearn.org
1086
00:41:22,720 --> 00:41:24,960
and i'm going to import this random over
1087
00:41:24,960 --> 00:41:26,480
sampler
1088
00:41:26,480 --> 00:41:28,079
run that cell
1089
00:41:28,079 --> 00:41:30,480
and come back down here
1090
00:41:30,480 --> 00:41:33,040
so i will actually add in this parameter
1091
00:41:33,040 --> 00:41:35,280
called over sample
1092
00:41:35,280 --> 00:41:37,680
and set that to false
1093
00:41:37,680 --> 00:41:39,520
for default
1094
00:41:39,520 --> 00:41:41,599
um
1095
00:41:41,599 --> 00:41:44,640
and if i do want to over sample then
1096
00:41:44,640 --> 00:41:48,240
what i'm going to do
1097
00:41:48,240 --> 00:41:50,480
and by over sample so if i do want to
1098
00:41:50,480 --> 00:41:51,760
over sample
1099
00:41:51,760 --> 00:41:55,119
then i'm going to create this ros and
1100
00:41:55,119 --> 00:41:58,240
set it equal to this random over sampler
1101
00:41:58,240 --> 00:42:00,400
and then for x and y i'm just going to
1102
00:42:00,400 --> 00:42:02,400
say okay just fit
1103
00:42:02,400 --> 00:42:04,400
and resample
1104
00:42:04,400 --> 00:42:06,480
x and y and what that's doing is saying
1105
00:42:06,480 --> 00:42:09,200
okay take more of the less
1106
00:42:09,200 --> 00:42:10,160
class
1107
00:42:10,160 --> 00:42:12,319
so take take the less class and keep
1108
00:42:12,319 --> 00:42:13,839
sampling from there
1109
00:42:13,839 --> 00:42:16,480
to increase the size of our data set of
1110
00:42:16,480 --> 00:42:18,400
that smaller class so that they now
1111
00:42:18,400 --> 00:42:20,720
match
1112
00:42:20,880 --> 00:42:25,200
so if i do this and i scale
1113
00:42:25,200 --> 00:42:26,720
data set
1114
00:42:26,720 --> 00:42:28,480
and i pass in the training data set
1115
00:42:28,480 --> 00:42:32,000
where over sample is true
1116
00:42:32,000 --> 00:42:34,480
so this let's say this is train and then
1117
00:42:34,480 --> 00:42:36,079
x train
1118
00:42:36,079 --> 00:42:38,880
y train
1119
00:42:40,640 --> 00:42:42,000
oops
1120
00:42:42,000 --> 00:42:43,520
what's going on
1121
00:42:43,520 --> 00:42:47,200
oh these should be columns
1122
00:42:47,680 --> 00:42:48,960
so basically
1123
00:42:48,960 --> 00:42:50,319
what i'm doing now is i'm just saying
1124
00:42:50,319 --> 00:42:54,240
okay what is the length of y train
1125
00:42:54,240 --> 00:42:57,599
okay now it's 14 800 whatever and now
1126
00:42:57,599 --> 00:42:59,839
let's take a look at
1127
00:42:59,839 --> 00:43:03,920
um how many of these are type one
1128
00:43:04,560 --> 00:43:07,200
so actually we can just
1129
00:43:07,200 --> 00:43:09,280
sum that up
1130
00:43:09,280 --> 00:43:10,880
and then we'll also see that if we
1131
00:43:10,880 --> 00:43:12,560
instead switch the label and ask how
1132
00:43:12,560 --> 00:43:14,560
many of them are the other type
1133
00:43:14,560 --> 00:43:16,880
it's the same value so now these have
1134
00:43:16,880 --> 00:43:18,480
been
1135
00:43:18,480 --> 00:43:22,319
evenly you know rebalanced
1136
00:43:22,319 --> 00:43:24,400
okay well
1137
00:43:24,400 --> 00:43:25,359
okay
1138
00:43:25,359 --> 00:43:27,520
so here i'm just going to make
1139
00:43:27,520 --> 00:43:31,359
this the validation data set
1140
00:43:31,760 --> 00:43:34,800
and then the next one
1141
00:43:34,800 --> 00:43:37,680
uh i'm going to make this the test data
1142
00:43:37,680 --> 00:43:38,640
set
1143
00:43:38,640 --> 00:43:39,680
all right and we're actually going to
1144
00:43:39,680 --> 00:43:42,960
switch over sample here to false
1145
00:43:42,960 --> 00:43:44,400
now the reason why i'm switching that to
1146
00:43:44,400 --> 00:43:45,920
false is because
1147
00:43:45,920 --> 00:43:48,640
my validation and my test sets are
1148
00:43:48,640 --> 00:43:50,079
for the purpose of you know if i have
1149
00:43:50,079 --> 00:43:52,839
data that i haven't seen yet how does my
1150
00:43:52,839 --> 00:43:55,839
sample perform on those
1151
00:43:55,839 --> 00:43:58,960
and i don't want to over sample for that
1152
00:43:58,960 --> 00:44:01,040
right now like i i don't care about
1153
00:44:01,040 --> 00:44:03,680
balancing those i'm i want to know if i
1154
00:44:03,680 --> 00:44:06,560
have a random set of data that's
1155
00:44:06,560 --> 00:44:11,280
unlabeled can i trust my model right
1156
00:44:11,280 --> 00:44:13,760
so that's why i'm not oversampling
1157
00:44:13,760 --> 00:44:16,319
i run that and
1158
00:44:16,319 --> 00:44:17,280
again
1159
00:44:17,280 --> 00:44:19,040
what is going on oh it's because we
1160
00:44:19,040 --> 00:44:21,040
already have this train
1161
00:44:21,040 --> 00:44:22,880
so i have to go come up here and split
1162
00:44:22,880 --> 00:44:24,560
that data frame again
1163
00:44:24,560 --> 00:44:26,880
and now let's run these
1164
00:44:26,880 --> 00:44:29,119
okay
1165
00:44:29,520 --> 00:44:31,520
so now we have our data properly
1166
00:44:31,520 --> 00:44:34,480
formatted and we're going to move on to
1167
00:44:34,480 --> 00:44:35,760
different models now and i'm going to
1168
00:44:35,760 --> 00:44:37,359
tell you guys a little bit about each of
1169
00:44:37,359 --> 00:44:39,119
these models and then i'm going to show
1170
00:44:39,119 --> 00:44:42,400
you how we can do that in our code
1171
00:44:42,400 --> 00:44:43,920
so the first model that we're going to
1172
00:44:43,920 --> 00:44:46,560
learn about is k n or k nearest
1173
00:44:46,560 --> 00:44:47,839
neighbors
1174
00:44:47,839 --> 00:44:48,640
okay
1175
00:44:48,640 --> 00:44:51,760
so here i've already drawn a plot on the
1176
00:44:51,760 --> 00:44:55,839
y-axis i have the number of kids
1177
00:44:55,839 --> 00:44:58,960
that a family might have and then on
1178
00:44:58,960 --> 00:45:01,599
the x-axis i have their income in terms
1179
00:45:01,599 --> 00:45:05,520
of thousands per year so
1180
00:45:05,520 --> 00:45:07,920
you know if uh if someone's making 40
1181
00:45:07,920 --> 00:45:10,000
000 a year that's where this would be
1182
00:45:10,000 --> 00:45:12,079
and if somebody making 320 that's where
1183
00:45:12,079 --> 00:45:13,920
that would be somebody has zero kids
1184
00:45:13,920 --> 00:45:16,400
it'd be somewhere along this axis
1185
00:45:16,400 --> 00:45:18,319
somebody has five it'd be somewhere over
1186
00:45:18,319 --> 00:45:20,560
here okay
1187
00:45:20,560 --> 00:45:24,319
and now i have
1188
00:45:24,319 --> 00:45:26,640
these plus signs and these minus signs
1189
00:45:26,640 --> 00:45:27,760
on here
1190
00:45:27,760 --> 00:45:30,319
so what i'm going to represent here is
1191
00:45:30,319 --> 00:45:33,280
the plus sign
1192
00:45:33,280 --> 00:45:35,599
means that they
1193
00:45:35,599 --> 00:45:38,480
own a car
1194
00:45:39,599 --> 00:45:41,920
and the minus sign
1195
00:45:41,920 --> 00:45:44,400
is going to represent no car
1196
00:45:44,400 --> 00:45:45,599
okay
1197
00:45:45,599 --> 00:45:46,560
so
1198
00:45:46,560 --> 00:45:48,720
your initial thought should be okay i
1199
00:45:48,720 --> 00:45:50,560
think this is binary classification
1200
00:45:50,560 --> 00:45:52,720
because all of our
1201
00:45:52,720 --> 00:45:55,760
points all of our samples
1202
00:45:55,760 --> 00:45:58,839
have labels so this is a
1203
00:45:58,839 --> 00:46:01,200
sample with
1204
00:46:01,200 --> 00:46:03,839
the plus label
1205
00:46:03,839 --> 00:46:07,839
and this here is another sample
1206
00:46:07,839 --> 00:46:09,119
with
1207
00:46:09,119 --> 00:46:11,520
the minus label
1208
00:46:11,520 --> 00:46:13,200
this is an abbreviation for width that
1209
00:46:13,200 --> 00:46:15,599
i'll use
1210
00:46:15,599 --> 00:46:16,800
all right
1211
00:46:16,800 --> 00:46:19,359
so we have this entire data set and
1212
00:46:19,359 --> 00:46:21,680
maybe around half the people own a car
1213
00:46:21,680 --> 00:46:23,280
and maybe
1214
00:46:23,280 --> 00:46:26,240
around half the people don't own a car
1215
00:46:26,240 --> 00:46:29,680
okay well what if i had some new point
1216
00:46:29,680 --> 00:46:32,079
let me use choose a different color i'll
1217
00:46:32,079 --> 00:46:34,000
use this nice green
1218
00:46:34,000 --> 00:46:36,079
well what if i have a new point over
1219
00:46:36,079 --> 00:46:38,720
here so let's say that somebody makes 40
1220
00:46:38,720 --> 00:46:40,839
000 a year and has two
1221
00:46:40,839 --> 00:46:45,599
kids what do we think that would be
1222
00:46:47,760 --> 00:46:50,319
well just logically looking at this plot
1223
00:46:50,319 --> 00:46:53,440
you might think okay it seems like
1224
00:46:53,440 --> 00:46:55,119
they wouldn't have a car right because
1225
00:46:55,119 --> 00:46:56,640
that kind of matches the pattern of
1226
00:46:56,640 --> 00:46:59,119
everybody else around them
1227
00:46:59,119 --> 00:47:01,359
so that's a whole concept of this
1228
00:47:01,359 --> 00:47:05,119
nearest neighbors is you look at okay
1229
00:47:05,119 --> 00:47:06,720
what's around you
1230
00:47:06,720 --> 00:47:08,160
and then you're basically like okay i'm
1231
00:47:08,160 --> 00:47:10,400
going to take the label of the majority
1232
00:47:10,400 --> 00:47:12,720
that's around me
1233
00:47:12,720 --> 00:47:14,000
so the first thing we have to do is we
1234
00:47:14,000 --> 00:47:16,400
have to define a distance function
1235
00:47:16,400 --> 00:47:19,920
and a lot of times in you know 2d plots
1236
00:47:19,920 --> 00:47:21,680
like this our distance function is
1237
00:47:21,680 --> 00:47:23,599
something known as
1238
00:47:23,599 --> 00:47:27,200
euclidean distance
1239
00:47:31,839 --> 00:47:33,920
and euclidean distance
1240
00:47:33,920 --> 00:47:35,680
is basically just
1241
00:47:35,680 --> 00:47:39,839
this straight line distance like this
1242
00:47:40,960 --> 00:47:43,280
okay
1243
00:47:44,960 --> 00:47:47,200
so this would be the euclidean distance
1244
00:47:47,200 --> 00:47:48,960
it seems like
1245
00:47:48,960 --> 00:47:51,599
there's this point
1246
00:47:51,599 --> 00:47:53,599
there's this point
1247
00:47:53,599 --> 00:47:55,200
there's that point
1248
00:47:55,200 --> 00:47:57,839
etcetera so the length of this line this
1249
00:47:57,839 --> 00:48:00,000
green line that i just drew that is
1250
00:48:00,000 --> 00:48:03,119
what's known as euclidean distance
1251
00:48:03,119 --> 00:48:05,440
if we want to get technical with that
1252
00:48:05,440 --> 00:48:08,000
this exact formula
1253
00:48:08,000 --> 00:48:11,680
is the distance here let me zoom in
1254
00:48:11,680 --> 00:48:15,359
the distance is equal to the square root
1255
00:48:15,359 --> 00:48:17,280
of one point
1256
00:48:17,280 --> 00:48:18,319
x
1257
00:48:18,319 --> 00:48:20,880
minus the other points x
1258
00:48:20,880 --> 00:48:22,079
squared
1259
00:48:22,079 --> 00:48:24,960
plus extend that square root
1260
00:48:24,960 --> 00:48:27,920
the same thing for y so y one of one
1261
00:48:27,920 --> 00:48:30,000
minus y two of the other
1262
00:48:30,000 --> 00:48:31,280
squared
1263
00:48:31,280 --> 00:48:32,880
okay so we're basically
1264
00:48:32,880 --> 00:48:35,280
trying to find the length the distance
1265
00:48:35,280 --> 00:48:37,599
is the difference between x and y
1266
00:48:37,599 --> 00:48:39,119
and then
1267
00:48:39,119 --> 00:48:41,359
square each of those sum it up and take
1268
00:48:41,359 --> 00:48:43,040
the square root
1269
00:48:43,040 --> 00:48:44,559
okay so i'm going to erase this so it
1270
00:48:44,559 --> 00:48:46,960
doesn't clutter
1271
00:48:46,960 --> 00:48:49,839
my drawing
1272
00:48:50,960 --> 00:48:53,280
but anyways now going back to this plot
1273
00:48:53,280 --> 00:48:54,880
so here
1274
00:48:54,880 --> 00:48:57,119
in the nearest neighbor algorithm we see
1275
00:48:57,119 --> 00:48:59,359
that there is a
1276
00:48:59,359 --> 00:49:00,640
k
1277
00:49:00,640 --> 00:49:02,160
right
1278
00:49:02,160 --> 00:49:04,319
and this k is basically telling us okay
1279
00:49:04,319 --> 00:49:06,800
how many neighbors do we use in order to
1280
00:49:06,800 --> 00:49:08,880
judge what the label is
1281
00:49:08,880 --> 00:49:11,920
so usually we use a k of maybe you know
1282
00:49:11,920 --> 00:49:14,160
three or five
1283
00:49:14,160 --> 00:49:16,079
depends on how big our data set is but
1284
00:49:16,079 --> 00:49:17,440
here i would say
1285
00:49:17,440 --> 00:49:18,319
maybe
1286
00:49:18,319 --> 00:49:20,800
a logical number would be three or five
1287
00:49:20,800 --> 00:49:24,559
so let's say that we take k
1288
00:49:24,559 --> 00:49:26,559
to be equal to three
1289
00:49:26,559 --> 00:49:29,839
okay well of this data point that i drew
1290
00:49:29,839 --> 00:49:32,480
over here
1291
00:49:32,480 --> 00:49:35,040
let me use green to highlight this okay
1292
00:49:35,040 --> 00:49:36,880
so of this data point that i drew over
1293
00:49:36,880 --> 00:49:39,359
here looks like the three closest points
1294
00:49:39,359 --> 00:49:41,359
are definitely this one
1295
00:49:41,359 --> 00:49:42,559
this one
1296
00:49:42,559 --> 00:49:46,400
and then this one has a length of four
1297
00:49:46,400 --> 00:49:49,280
and this one
1298
00:49:49,280 --> 00:49:50,720
seems like it'd be a little bit further
1299
00:49:50,720 --> 00:49:53,200
than four so actually this
1300
00:49:53,200 --> 00:49:54,800
would be our these would be our three
1301
00:49:54,800 --> 00:49:56,319
points
1302
00:49:56,319 --> 00:49:59,040
well all those points are blue
1303
00:49:59,040 --> 00:50:01,440
so chances are
1304
00:50:01,440 --> 00:50:04,000
my prediction for this point is going to
1305
00:50:04,000 --> 00:50:05,200
be blue
1306
00:50:05,200 --> 00:50:06,319
it's going to be
1307
00:50:06,319 --> 00:50:08,480
probably don't have a car
1308
00:50:08,480 --> 00:50:11,119
all right now what if my point is
1309
00:50:11,119 --> 00:50:13,680
somewhere
1310
00:50:13,680 --> 00:50:16,800
what if my point is somewhere over here
1311
00:50:16,800 --> 00:50:18,640
let's say that
1312
00:50:18,640 --> 00:50:21,520
a couple has
1313
00:50:21,520 --> 00:50:25,599
four kids and they make 240 000 a year
1314
00:50:25,599 --> 00:50:27,920
all right well now my closest points are
1315
00:50:27,920 --> 00:50:30,319
this one
1316
00:50:30,319 --> 00:50:33,280
probably a little bit over that one
1317
00:50:33,280 --> 00:50:36,160
and then this one right okay still all
1318
00:50:36,160 --> 00:50:37,359
pluses
1319
00:50:37,359 --> 00:50:38,559
well
1320
00:50:38,559 --> 00:50:42,720
this one is more than likely to be
1321
00:50:42,720 --> 00:50:44,559
a plus
1322
00:50:44,559 --> 00:50:46,640
right now let me get rid of some of
1323
00:50:46,640 --> 00:50:49,680
these just so that it looks a little bit
1324
00:50:49,680 --> 00:50:52,920
more clear
1325
00:50:54,800 --> 00:50:56,800
all right let's go through
1326
00:50:56,800 --> 00:50:58,559
one more
1327
00:50:58,559 --> 00:51:02,079
what about a point that might be
1328
00:51:02,079 --> 00:51:04,079
right
1329
00:51:04,079 --> 00:51:05,599
here
1330
00:51:05,599 --> 00:51:08,000
okay let's see well definitely this is
1331
00:51:08,000 --> 00:51:09,440
the closest
1332
00:51:09,440 --> 00:51:12,000
right this one's also closest
1333
00:51:12,000 --> 00:51:12,430
and then
1334
00:51:12,430 --> 00:51:14,319
[Music]
1335
00:51:14,319 --> 00:51:16,640
it's really close between the two of
1336
00:51:16,640 --> 00:51:18,559
these
1337
00:51:18,559 --> 00:51:20,480
but if we actually do the mathematics it
1338
00:51:20,480 --> 00:51:21,760
seems like
1339
00:51:21,760 --> 00:51:23,760
if we zoom in
1340
00:51:23,760 --> 00:51:26,240
this one is right here and this one is
1341
00:51:26,240 --> 00:51:28,720
in between these two so
1342
00:51:28,720 --> 00:51:31,119
this one here is actually shorter than
1343
00:51:31,119 --> 00:51:32,880
this one
1344
00:51:32,880 --> 00:51:35,520
and that means that that top one is the
1345
00:51:35,520 --> 00:51:37,599
one that we're going to take
1346
00:51:37,599 --> 00:51:39,680
now what is the majority of the points
1347
00:51:39,680 --> 00:51:41,520
that are close by
1348
00:51:41,520 --> 00:51:44,640
well we have one plus here we have one
1349
00:51:44,640 --> 00:51:47,200
plus here and we have one minus here
1350
00:51:47,200 --> 00:51:49,280
which means that the pluses
1351
00:51:49,280 --> 00:51:51,040
are the majority
1352
00:51:51,040 --> 00:51:53,920
and that means that this
1353
00:51:53,920 --> 00:51:56,319
label
1354
00:51:56,559 --> 00:51:59,520
is probably somebody with a car
1355
00:51:59,520 --> 00:52:01,280
okay
1356
00:52:01,280 --> 00:52:04,400
so this is how k nearest neighbors would
1357
00:52:04,400 --> 00:52:06,240
work
1358
00:52:06,240 --> 00:52:08,160
it's that simple
1359
00:52:08,160 --> 00:52:11,040
and this can be extrapolated to further
1360
00:52:11,040 --> 00:52:13,599
dimensions to higher dimensions you know
1361
00:52:13,599 --> 00:52:14,880
if you have
1362
00:52:14,880 --> 00:52:16,800
here we have two different features we
1363
00:52:16,800 --> 00:52:18,640
have the income
1364
00:52:18,640 --> 00:52:20,880
and then we have the number of kids
1365
00:52:20,880 --> 00:52:21,760
but
1366
00:52:21,760 --> 00:52:23,839
let's say we have 10 different features
1367
00:52:23,839 --> 00:52:26,960
we can expand our distance function so
1368
00:52:26,960 --> 00:52:29,119
that it includes all 10 of those
1369
00:52:29,119 --> 00:52:31,040
dimensions we take the square root of
1370
00:52:31,040 --> 00:52:32,559
everything and then we figure out which
1371
00:52:32,559 --> 00:52:34,800
one is the closest to the point that we
1372
00:52:34,800 --> 00:52:35,839
desire
1373
00:52:35,839 --> 00:52:39,119
to classify okay
1374
00:52:39,119 --> 00:52:41,040
so that's k-nearest neighbors
1375
00:52:41,040 --> 00:52:42,800
so now we've learned about uh k-nearest
1376
00:52:42,800 --> 00:52:43,760
neighbors
1377
00:52:43,760 --> 00:52:45,599
let's see how we would be able to do
1378
00:52:45,599 --> 00:52:47,520
that within our code
1379
00:52:47,520 --> 00:52:50,160
so here i'm going to label the section k
1380
00:52:50,160 --> 00:52:53,040
nearest neighbors
1381
00:52:53,119 --> 00:52:54,720
and we're actually going to use a
1382
00:52:54,720 --> 00:52:58,160
package from sklearn so the reason why
1383
00:52:58,160 --> 00:53:00,240
we you know use these packages so that
1384
00:53:00,240 --> 00:53:02,640
we don't have to manually code all these
1385
00:53:02,640 --> 00:53:04,559
things ourself because it would be
1386
00:53:04,559 --> 00:53:06,240
really difficult and chances are the way
1387
00:53:06,240 --> 00:53:07,839
that we would code it either would have
1388
00:53:07,839 --> 00:53:10,160
bugs or it'd be really slow or i don't
1389
00:53:10,160 --> 00:53:11,839
know a whole bunch of issues
1390
00:53:11,839 --> 00:53:13,280
so what we're going to do is hand it off
1391
00:53:13,280 --> 00:53:14,960
to the pros
1392
00:53:14,960 --> 00:53:16,880
from here i can say
1393
00:53:16,880 --> 00:53:20,000
okay from sklearn which is this package
1394
00:53:20,000 --> 00:53:22,079
dot neighbors
1395
00:53:22,079 --> 00:53:24,319
i'm going to import k
1396
00:53:24,319 --> 00:53:26,079
neighbors classifier because we're
1397
00:53:26,079 --> 00:53:27,440
classifying
1398
00:53:27,440 --> 00:53:28,800
okay
1399
00:53:28,800 --> 00:53:30,160
so i run that
1400
00:53:30,160 --> 00:53:34,800
and our k n model is going to be this k
1401
00:53:34,800 --> 00:53:38,000
neighbors classifier and we can pass in
1402
00:53:38,000 --> 00:53:40,160
a parameter of how many neighbors you
1403
00:53:40,160 --> 00:53:43,119
know we want to use so first let's see
1404
00:53:43,119 --> 00:53:45,839
what happens if we just use one
1405
00:53:45,839 --> 00:53:47,839
so now if i do k
1406
00:53:47,839 --> 00:53:49,920
and then model dot fit
1407
00:53:49,920 --> 00:53:52,720
i can pass in my x training set and my
1408
00:53:52,720 --> 00:53:54,240
weight y train
1409
00:53:54,240 --> 00:53:58,480
data okay so that effectively fits this
1410
00:53:58,480 --> 00:54:00,400
model
1411
00:54:00,400 --> 00:54:04,480
and let's get all the predictions so why
1412
00:54:04,480 --> 00:54:06,960
k n or i guess yeah let's do y
1413
00:54:06,960 --> 00:54:08,319
predictions
1414
00:54:08,319 --> 00:54:11,359
and my y predictions are going to be k n
1415
00:54:11,359 --> 00:54:14,000
model dot predict
1416
00:54:14,000 --> 00:54:15,599
um
1417
00:54:15,599 --> 00:54:19,200
so let's use the test set x test
1418
00:54:19,200 --> 00:54:21,040
okay
1419
00:54:21,040 --> 00:54:23,680
all right so if i
1420
00:54:23,680 --> 00:54:25,280
call by predict you'll see that we have
1421
00:54:25,280 --> 00:54:27,680
those but if i get my truth values for
1422
00:54:27,680 --> 00:54:29,359
that test set you'll see that this is
1423
00:54:29,359 --> 00:54:31,040
what we actually do so just looking at
1424
00:54:31,040 --> 00:54:32,880
this we got five out of six of them okay
1425
00:54:32,880 --> 00:54:35,119
great so let's actually take a look at
1426
00:54:35,119 --> 00:54:36,240
something
1427
00:54:36,240 --> 00:54:38,319
called the classification report that's
1428
00:54:38,319 --> 00:54:41,960
offered by sklearn so if i go to from
1429
00:54:41,960 --> 00:54:44,720
sklearn.metrics import
1430
00:54:44,720 --> 00:54:47,920
classification report
1431
00:54:47,920 --> 00:54:51,200
what i can actually do is say hey print
1432
00:54:51,200 --> 00:54:53,200
out this classification
1433
00:54:53,200 --> 00:54:54,640
report for me
1434
00:54:54,640 --> 00:54:56,960
and let's check you know
1435
00:54:56,960 --> 00:54:58,960
i'm giving you the y test and the y
1436
00:54:58,960 --> 00:55:01,040
prediction
1437
00:55:01,040 --> 00:55:02,880
we run this and we see we get this whole
1438
00:55:02,880 --> 00:55:04,160
entire chart so i'm going to tell you
1439
00:55:04,160 --> 00:55:07,359
guys a few things on this chart
1440
00:55:07,359 --> 00:55:10,319
all right this accuracy is 82 which is
1441
00:55:10,319 --> 00:55:11,920
actually pretty good that's just saying
1442
00:55:11,920 --> 00:55:14,319
hey if we just look at you know what
1443
00:55:14,319 --> 00:55:15,839
each of these new points what it's
1444
00:55:15,839 --> 00:55:17,280
closest to
1445
00:55:17,280 --> 00:55:19,520
then we actually get an 82 percent
1446
00:55:19,520 --> 00:55:21,119
accuracy
1447
00:55:21,119 --> 00:55:22,240
which means
1448
00:55:22,240 --> 00:55:24,160
how many do we get right versus how many
1449
00:55:24,160 --> 00:55:26,800
total are there
1450
00:55:26,800 --> 00:55:29,040
now precision is saying okay you might
1451
00:55:29,040 --> 00:55:31,280
see that we have it for class one or
1452
00:55:31,280 --> 00:55:33,440
class zero and class
1453
00:55:33,440 --> 00:55:35,200
what precision is saying was let's go to
1454
00:55:35,200 --> 00:55:36,960
this wikipedia diagram over here because
1455
00:55:36,960 --> 00:55:39,920
i actually kind of like this diagram
1456
00:55:39,920 --> 00:55:41,359
so here
1457
00:55:41,359 --> 00:55:43,680
this is our entire data set and on the
1458
00:55:43,680 --> 00:55:45,760
left over here we have everything that
1459
00:55:45,760 --> 00:55:47,920
we know is positive so everything that
1460
00:55:47,920 --> 00:55:50,640
is actually truly positive that we've
1461
00:55:50,640 --> 00:55:52,480
labeled positive in our original data
1462
00:55:52,480 --> 00:55:53,280
set
1463
00:55:53,280 --> 00:55:54,960
and over here this is everything that's
1464
00:55:54,960 --> 00:55:56,640
truly negative
1465
00:55:56,640 --> 00:55:59,359
now in the circle we have
1466
00:55:59,359 --> 00:56:01,680
things that are paused that were labeled
1467
00:56:01,680 --> 00:56:05,440
positive by our model
1468
00:56:05,440 --> 00:56:07,440
on the left here we have things that are
1469
00:56:07,440 --> 00:56:09,440
truly positive because you know this
1470
00:56:09,440 --> 00:56:10,400
side is
1471
00:56:10,400 --> 00:56:12,000
the positive side and the side is the
1472
00:56:12,000 --> 00:56:13,359
negative side so these are truly
1473
00:56:13,359 --> 00:56:14,559
positive
1474
00:56:14,559 --> 00:56:17,040
whereas all these ones out here well
1475
00:56:17,040 --> 00:56:18,640
they should have been positive but they
1476
00:56:18,640 --> 00:56:20,720
are labeled as negative
1477
00:56:20,720 --> 00:56:23,040
and in here these are the ones that
1478
00:56:23,040 --> 00:56:24,400
we've labeled positive but they're
1479
00:56:24,400 --> 00:56:26,960
actually negative and out here these are
1480
00:56:26,960 --> 00:56:28,960
truly negative
1481
00:56:28,960 --> 00:56:32,319
so precision is saying okay
1482
00:56:32,319 --> 00:56:34,720
out of all the ones we've labeled as
1483
00:56:34,720 --> 00:56:37,040
positive how many of them are true
1484
00:56:37,040 --> 00:56:38,799
positives
1485
00:56:38,799 --> 00:56:41,839
and recall is saying okay out of all the
1486
00:56:41,839 --> 00:56:43,839
ones that we know are truly positive how
1487
00:56:43,839 --> 00:56:46,720
many did we actually get right
1488
00:56:46,720 --> 00:56:48,319
okay
1489
00:56:48,319 --> 00:56:50,640
so going back to this over here our
1490
00:56:50,640 --> 00:56:53,440
precision score so
1491
00:56:53,440 --> 00:56:55,440
again precision out of all the ones that
1492
00:56:55,440 --> 00:56:57,119
we've labeled as
1493
00:56:57,119 --> 00:56:59,440
the specific class how many of them are
1494
00:56:59,440 --> 00:57:03,520
actually that class it's 77 and 84
1495
00:57:03,520 --> 00:57:05,839
now recall how out of all the ones that
1496
00:57:05,839 --> 00:57:08,000
are actually this class how many of
1497
00:57:08,000 --> 00:57:10,799
those that we get this is 68
1498
00:57:10,799 --> 00:57:12,799
and eighty nine percent
1499
00:57:12,799 --> 00:57:13,839
all right
1500
00:57:13,839 --> 00:57:15,920
so not too shabby we can clearly see
1501
00:57:15,920 --> 00:57:18,559
that this recall and precision for
1502
00:57:18,559 --> 00:57:20,720
like this the class zero is worse than
1503
00:57:20,720 --> 00:57:22,079
class one
1504
00:57:22,079 --> 00:57:23,520
right so that means for hadrons it's
1505
00:57:23,520 --> 00:57:26,319
worked for hadrons and for our gammas
1506
00:57:26,319 --> 00:57:28,319
this f1 score over here is kind of a
1507
00:57:28,319 --> 00:57:30,319
combination of the precision and recall
1508
00:57:30,319 --> 00:57:31,920
score so
1509
00:57:31,920 --> 00:57:33,200
we're actually going to mostly look at
1510
00:57:33,200 --> 00:57:35,760
this one because we have an unbalanced
1511
00:57:35,760 --> 00:57:37,280
test data set
1512
00:57:37,280 --> 00:57:40,720
so here we have a measure of 72 and 87
1513
00:57:40,720 --> 00:57:46,400
or 0.72 and 0.87 which is not too shabby
1514
00:57:46,400 --> 00:57:48,400
all right
1515
00:57:48,400 --> 00:57:53,960
well what if we you know made this three
1516
00:57:54,480 --> 00:57:56,400
so we actually see that
1517
00:57:56,400 --> 00:58:00,240
okay so what was it originally with one
1518
00:58:00,640 --> 00:58:03,200
we see that our f1 score
1519
00:58:03,200 --> 00:58:05,520
you know is now it was 0.72 and then
1520
00:58:05,520 --> 00:58:07,280
0.87
1521
00:58:07,280 --> 00:58:09,680
and then our accuracy was 82 so if i
1522
00:58:09,680 --> 00:58:12,559
change that to three
1523
00:58:12,559 --> 00:58:13,920
all right so
1524
00:58:13,920 --> 00:58:17,119
we've kind of increased zero at the cost
1525
00:58:17,119 --> 00:58:19,680
of one and then our overall average
1526
00:58:19,680 --> 00:58:22,640
accuracy is 81. so let's actually just
1527
00:58:22,640 --> 00:58:24,960
make this five
1528
00:58:24,960 --> 00:58:27,680
all right so you know again very similar
1529
00:58:27,680 --> 00:58:29,680
numbers we have 82 percent accuracy
1530
00:58:29,680 --> 00:58:31,520
which is pretty decent for a model
1531
00:58:31,520 --> 00:58:32,880
that's
1532
00:58:32,880 --> 00:58:35,599
relatively simple
1533
00:58:35,599 --> 00:58:37,920
the next type of model that we're going
1534
00:58:37,920 --> 00:58:40,240
to talk about is something known as
1535
00:58:40,240 --> 00:58:42,480
naive bayes
1536
00:58:42,480 --> 00:58:45,040
now in order to understand the concepts
1537
00:58:45,040 --> 00:58:47,680
behind naive bayes we have to be able to
1538
00:58:47,680 --> 00:58:50,079
understand conditional probability and
1539
00:58:50,079 --> 00:58:51,520
bayes rule
1540
00:58:51,520 --> 00:58:52,240
so
1541
00:58:52,240 --> 00:58:54,720
let's say i have some sort of data set
1542
00:58:54,720 --> 00:58:57,280
that's shown in this table right here
1543
00:58:57,280 --> 00:59:00,319
people who have covid are over here in
1544
00:59:00,319 --> 00:59:03,839
this red row and people who do not have
1545
00:59:03,839 --> 00:59:06,880
covet are down here in this green row
1546
00:59:06,880 --> 00:59:08,559
now what about the cova test well people
1547
00:59:08,559 --> 00:59:11,200
who have tested positive
1548
00:59:11,200 --> 00:59:15,040
are over here in this column
1549
00:59:15,040 --> 00:59:17,359
and people who have tested negative are
1550
00:59:17,359 --> 00:59:19,280
over here in this
1551
00:59:19,280 --> 00:59:20,559
column
1552
00:59:20,559 --> 00:59:22,079
okay
1553
00:59:22,079 --> 00:59:23,839
yeah so basically our categories are
1554
00:59:23,839 --> 00:59:26,480
people who have coven and test positive
1555
00:59:26,480 --> 00:59:28,960
people who don't have covid but test
1556
00:59:28,960 --> 00:59:31,680
positive so a false false positive
1557
00:59:31,680 --> 00:59:33,520
people who have covet and test negative
1558
00:59:33,520 --> 00:59:35,359
which is a false negative
1559
00:59:35,359 --> 00:59:38,000
and people who don't have covid and test
1560
00:59:38,000 --> 00:59:39,359
negative which
1561
00:59:39,359 --> 00:59:41,440
good means you don't have coven
1562
00:59:41,440 --> 00:59:43,440
okay so let's make this slightly more
1563
00:59:43,440 --> 00:59:46,440
legible
1564
00:59:47,040 --> 00:59:50,240
and here in the margins i've written
1565
00:59:50,240 --> 00:59:51,839
down the sums
1566
00:59:51,839 --> 00:59:52,960
of
1567
00:59:52,960 --> 00:59:54,880
whatever it's referring to so this here
1568
00:59:54,880 --> 00:59:58,160
is the sum of this entire row
1569
00:59:58,160 --> 01:00:00,799
and this here might be the sum of this
1570
01:00:00,799 --> 01:00:03,440
column over here
1571
01:00:03,440 --> 01:00:04,720
okay
1572
01:00:04,720 --> 01:00:07,599
so the first question that i have is
1573
01:00:07,599 --> 01:00:10,079
what is the probability of having covid
1574
01:00:10,079 --> 01:00:12,240
given that you have a positive test
1575
01:00:12,240 --> 01:00:14,640
and in probability we write that out
1576
01:00:14,640 --> 01:00:16,880
like this so the probability
1577
01:00:16,880 --> 01:00:19,839
of covid
1578
01:00:20,400 --> 01:00:23,440
given so this line that vertical line
1579
01:00:23,440 --> 01:00:24,319
means
1580
01:00:24,319 --> 01:00:26,559
given that you know some condition
1581
01:00:26,559 --> 01:00:30,799
so given a positive test
1582
01:00:30,799 --> 01:00:31,920
okay
1583
01:00:31,920 --> 01:00:33,520
so
1584
01:00:33,520 --> 01:00:35,839
what is the probability of having covid
1585
01:00:35,839 --> 01:00:37,760
given a positive test
1586
01:00:37,760 --> 01:00:40,000
so what this is asking is saying okay
1587
01:00:40,000 --> 01:00:41,839
let's go into
1588
01:00:41,839 --> 01:00:44,640
this condition so the condition of
1589
01:00:44,640 --> 01:00:47,920
having a positive test that is
1590
01:00:47,920 --> 01:00:50,720
this slice of the data right that means
1591
01:00:50,720 --> 01:00:52,000
if you're in this slice of data you have
1592
01:00:52,000 --> 01:00:54,480
a positive test so given that we have a
1593
01:00:54,480 --> 01:00:57,119
positive test given in this condition in
1594
01:00:57,119 --> 01:00:58,880
this circumstance we have a positive
1595
01:00:58,880 --> 01:00:59,839
test
1596
01:00:59,839 --> 01:01:01,520
so what's the probability that we have
1597
01:01:01,520 --> 01:01:03,119
covid
1598
01:01:03,119 --> 01:01:04,000
well
1599
01:01:04,000 --> 01:01:05,440
if we're just using this data the number
1600
01:01:05,440 --> 01:01:09,040
of people that have covid is 531.
1601
01:01:09,040 --> 01:01:12,480
so i'm going to say that there's 531
1602
01:01:12,480 --> 01:01:14,880
people that have copied
1603
01:01:14,880 --> 01:01:17,040
and then now we divide that by the total
1604
01:01:17,040 --> 01:01:18,880
number of people that have a positive
1605
01:01:18,880 --> 01:01:22,520
test which is 551
1606
01:01:24,160 --> 01:01:26,799
okay so that's the probability and
1607
01:01:26,799 --> 01:01:28,160
doing a quick
1608
01:01:28,160 --> 01:01:29,599
division
1609
01:01:29,599 --> 01:01:33,200
we get that this is equal
1610
01:01:33,839 --> 01:01:35,799
to around
1611
01:01:35,799 --> 01:01:38,400
96.4 percent
1612
01:01:38,400 --> 01:01:40,720
so according to this data set which is
1613
01:01:40,720 --> 01:01:42,400
data that i made up off the top of my
1614
01:01:42,400 --> 01:01:45,680
head so it's not actually real cova data
1615
01:01:45,680 --> 01:01:48,720
but according to this data
1616
01:01:48,720 --> 01:01:50,880
uh the probability of encoded given that
1617
01:01:50,880 --> 01:01:52,880
you tested positive
1618
01:01:52,880 --> 01:01:54,079
is
1619
01:01:54,079 --> 01:01:57,079
96.4
1620
01:01:57,920 --> 01:02:00,559
all right now with that let's talk about
1621
01:02:00,559 --> 01:02:01,920
bayes rule
1622
01:02:01,920 --> 01:02:04,400
which is this section here
1623
01:02:04,400 --> 01:02:07,920
let's ignore this bottom part for now
1624
01:02:07,920 --> 01:02:08,640
so
1625
01:02:08,640 --> 01:02:10,400
base rule is asking okay what is the
1626
01:02:10,400 --> 01:02:13,440
probability of some event a happening
1627
01:02:13,440 --> 01:02:16,960
given that b happen so this
1628
01:02:16,960 --> 01:02:18,559
we already know has happened this is our
1629
01:02:18,559 --> 01:02:21,839
condition right
1630
01:02:22,319 --> 01:02:25,119
well what if we don't have data for that
1631
01:02:25,119 --> 01:02:26,880
right like what if we don't know what
1632
01:02:26,880 --> 01:02:29,359
the probability of a given b is
1633
01:02:29,359 --> 01:02:31,200
well bayes rule is saying okay well you
1634
01:02:31,200 --> 01:02:33,520
can actually go and calculate it as long
1635
01:02:33,520 --> 01:02:36,000
as you have the probability of b given a
1636
01:02:36,000 --> 01:02:38,000
the probability of a and the probability
1637
01:02:38,000 --> 01:02:39,280
of b
1638
01:02:39,280 --> 01:02:41,599
okay and this is just a mathematical
1639
01:02:41,599 --> 01:02:43,440
formula for that
1640
01:02:43,440 --> 01:02:46,799
all right so here we have bayes rule
1641
01:02:46,799 --> 01:02:49,200
and let's actually see bayes rule in
1642
01:02:49,200 --> 01:02:51,680
action let's use it on an example
1643
01:02:51,680 --> 01:02:54,079
so here let's say that we have some
1644
01:02:54,079 --> 01:02:56,960
um disease statistics okay
1645
01:02:56,960 --> 01:03:00,160
so knock over different disease
1646
01:03:00,160 --> 01:03:01,920
and we know that the probability of
1647
01:03:01,920 --> 01:03:04,640
obtaining a false positive is 0.05
1648
01:03:04,640 --> 01:03:06,079
probability of obtaining a false
1649
01:03:06,079 --> 01:03:09,040
negative is 0.01 and the probability of
1650
01:03:09,040 --> 01:03:11,359
the disease is 0.1
1651
01:03:11,359 --> 01:03:12,799
okay what is the probability of the
1652
01:03:12,799 --> 01:03:14,559
disease given that
1653
01:03:14,559 --> 01:03:17,520
we got a positive test
1654
01:03:17,520 --> 01:03:20,319
how do we even go about solving this
1655
01:03:20,319 --> 01:03:22,880
so what what do i mean by false positive
1656
01:03:22,880 --> 01:03:25,280
what's a different way to rewrite that
1657
01:03:25,280 --> 01:03:27,920
a false positive is when you test
1658
01:03:27,920 --> 01:03:30,400
positive but you don't actually have the
1659
01:03:30,400 --> 01:03:33,359
disease so this here is a probability
1660
01:03:33,359 --> 01:03:37,039
that you'd have a positive test given
1661
01:03:37,039 --> 01:03:39,200
no disease
1662
01:03:39,200 --> 01:03:40,400
right
1663
01:03:40,400 --> 01:03:42,240
and similarly for the false negative
1664
01:03:42,240 --> 01:03:43,599
it's the probability that you test
1665
01:03:43,599 --> 01:03:46,240
negative given that you actually have
1666
01:03:46,240 --> 01:03:48,640
the disease so if i put that into a
1667
01:03:48,640 --> 01:03:50,160
chart
1668
01:03:50,160 --> 01:03:52,799
for example
1669
01:03:54,480 --> 01:03:56,319
and this might be my positive and
1670
01:03:56,319 --> 01:03:58,960
negative tests and this might be
1671
01:03:58,960 --> 01:04:00,480
my diseases
1672
01:04:00,480 --> 01:04:04,079
disease and no disease
1673
01:04:04,079 --> 01:04:06,160
well the probability that i test
1674
01:04:06,160 --> 01:04:07,680
positive but actually have no disease
1675
01:04:07,680 --> 01:04:10,480
okay that's 0.05 over here
1676
01:04:10,480 --> 01:04:12,400
and then the false negative is up here
1677
01:04:12,400 --> 01:04:14,720
for 0.01 so i'm testing negative but i
1678
01:04:14,720 --> 01:04:17,839
don't actually have the disease
1679
01:04:18,000 --> 01:04:20,799
this so the probability that you test
1680
01:04:20,799 --> 01:04:23,039
positive and you don't have the disease
1681
01:04:23,039 --> 01:04:24,400
plus the probability that you test
1682
01:04:24,400 --> 01:04:25,839
negative given that you don't have the
1683
01:04:25,839 --> 01:04:27,119
disease
1684
01:04:27,119 --> 01:04:29,280
that should sum up to one
1685
01:04:29,280 --> 01:04:30,400
okay because if you don't have the
1686
01:04:30,400 --> 01:04:32,000
disease then you should have some
1687
01:04:32,000 --> 01:04:33,359
probability that you're testing positive
1688
01:04:33,359 --> 01:04:34,480
and some probably that you're testing
1689
01:04:34,480 --> 01:04:36,799
negative but that probability
1690
01:04:36,799 --> 01:04:39,680
in total should be one
1691
01:04:39,680 --> 01:04:42,160
so that means that
1692
01:04:42,160 --> 01:04:43,839
the probability of negative and no
1693
01:04:43,839 --> 01:04:45,680
disease this should be the reciprocal
1694
01:04:45,680 --> 01:04:47,039
this would be the opposite so it should
1695
01:04:47,039 --> 01:04:48,200
be
1696
01:04:48,200 --> 01:04:51,119
0.95 because it's 1 minus whatever this
1697
01:04:51,119 --> 01:04:54,640
probability is
1698
01:04:54,640 --> 01:04:56,960
and then similarly
1699
01:04:56,960 --> 01:04:59,280
oops
1700
01:04:59,599 --> 01:05:03,440
up here this should be 0.99 because
1701
01:05:03,440 --> 01:05:05,599
the probability that we
1702
01:05:05,599 --> 01:05:07,520
you know test negative and have the
1703
01:05:07,520 --> 01:05:08,960
disease plus the probability that we
1704
01:05:08,960 --> 01:05:10,480
test positive and have the disease
1705
01:05:10,480 --> 01:05:12,160
should equal one
1706
01:05:12,160 --> 01:05:14,640
so this is our probability chart and now
1707
01:05:14,640 --> 01:05:16,720
this probability of disease
1708
01:05:16,720 --> 01:05:18,640
being point zero point one just means i
1709
01:05:18,640 --> 01:05:19,920
have a ten percent probability of
1710
01:05:19,920 --> 01:05:21,599
actually of having the disease right
1711
01:05:21,599 --> 01:05:23,119
like
1712
01:05:23,119 --> 01:05:24,799
in the general population the
1713
01:05:24,799 --> 01:05:26,400
probability that i have the disease is
1714
01:05:26,400 --> 01:05:28,960
0.1
1715
01:05:28,960 --> 01:05:30,559
okay so what is the probability that i
1716
01:05:30,559 --> 01:05:32,480
have the disease given that i got a
1717
01:05:32,480 --> 01:05:34,839
positive i got a positive
1718
01:05:34,839 --> 01:05:37,359
test well remember that we can write
1719
01:05:37,359 --> 01:05:39,839
this out in terms of bayes rule right so
1720
01:05:39,839 --> 01:05:42,400
if i use this rule up here
1721
01:05:42,400 --> 01:05:44,319
this is the probability of a positive
1722
01:05:44,319 --> 01:05:48,240
test given that i have the disease
1723
01:05:48,240 --> 01:05:50,319
times the probability
1724
01:05:50,319 --> 01:05:52,799
of the disease
1725
01:05:52,799 --> 01:05:55,599
divided by the probability of the
1726
01:05:55,599 --> 01:05:59,920
evidence which is my positive test
1727
01:05:59,920 --> 01:06:01,599
all right now let's plug in some numbers
1728
01:06:01,599 --> 01:06:04,480
for that the probability of having a
1729
01:06:04,480 --> 01:06:06,640
positive test given that i have disease
1730
01:06:06,640 --> 01:06:07,400
is
1731
01:06:07,400 --> 01:06:09,200
0.99
1732
01:06:09,200 --> 01:06:11,039
and then the probability that i have the
1733
01:06:11,039 --> 01:06:12,480
disease
1734
01:06:12,480 --> 01:06:14,960
is this value over here
1735
01:06:14,960 --> 01:06:17,960
0.1
1736
01:06:18,480 --> 01:06:20,240
okay
1737
01:06:20,240 --> 01:06:21,359
and then
1738
01:06:21,359 --> 01:06:23,039
the probability that i have a positive
1739
01:06:23,039 --> 01:06:24,720
test at all
1740
01:06:24,720 --> 01:06:26,880
should be okay what is the probability
1741
01:06:26,880 --> 01:06:28,559
that i have a positive test given that i
1742
01:06:28,559 --> 01:06:30,720
actually have the disease
1743
01:06:30,720 --> 01:06:34,319
and then having having the disease
1744
01:06:34,319 --> 01:06:36,160
and then the other case
1745
01:06:36,160 --> 01:06:38,160
where the probability of me having a
1746
01:06:38,160 --> 01:06:40,240
negative test given or sorry positive
1747
01:06:40,240 --> 01:06:43,359
test giving no disease
1748
01:06:43,839 --> 01:06:46,319
times the probability of not actually
1749
01:06:46,319 --> 01:06:48,559
having a disease
1750
01:06:48,559 --> 01:06:50,880
okay so i can expand that
1751
01:06:50,880 --> 01:06:52,480
probability of having positive tests out
1752
01:06:52,480 --> 01:06:54,480
into these two different cases
1753
01:06:54,480 --> 01:06:56,720
i have a disease and then i don't
1754
01:06:56,720 --> 01:06:58,079
and then
1755
01:06:58,079 --> 01:06:59,359
what's the probability of having
1756
01:06:59,359 --> 01:07:00,880
positive tests in either one of those
1757
01:07:00,880 --> 01:07:02,480
cases
1758
01:07:02,480 --> 01:07:05,280
so that expression would become
1759
01:07:05,280 --> 01:07:06,960
0.99
1760
01:07:06,960 --> 01:07:09,440
times 0.1
1761
01:07:09,440 --> 01:07:11,400
plus
1762
01:07:11,400 --> 01:07:13,839
0.05 so that's the probability that i'm
1763
01:07:13,839 --> 01:07:15,599
testing positive but don't have the
1764
01:07:15,599 --> 01:07:16,880
disease
1765
01:07:16,880 --> 01:07:18,000
and the times the probability that i
1766
01:07:18,000 --> 01:07:19,440
don't actually have the disease so
1767
01:07:19,440 --> 01:07:22,079
that's 1 minus 0.1 the probability that
1768
01:07:22,079 --> 01:07:23,599
the population doesn't have the disease
1769
01:07:23,599 --> 01:07:25,280
is 90
1770
01:07:25,280 --> 01:07:26,839
so
1771
01:07:26,839 --> 01:07:32,160
0.9 and let's do that multiplication
1772
01:07:32,160 --> 01:07:33,920
and i get an answer
1773
01:07:33,920 --> 01:07:37,240
of 0.6875
1774
01:07:38,000 --> 01:07:39,960
or
1775
01:07:39,960 --> 01:07:43,839
68.75 percent
1776
01:07:44,400 --> 01:07:46,640
okay
1777
01:07:46,720 --> 01:07:49,680
all right so we can actually expand that
1778
01:07:49,680 --> 01:07:52,400
we can expand bayes rule
1779
01:07:52,400 --> 01:07:55,119
and apply it to classification
1780
01:07:55,119 --> 01:07:57,440
and this is what we call naive bayes
1781
01:07:57,440 --> 01:07:59,920
so first a little terminology so the
1782
01:07:59,920 --> 01:08:02,160
posterior
1783
01:08:02,160 --> 01:08:04,559
is this over here because it's asking
1784
01:08:04,559 --> 01:08:06,799
hey what is the probability
1785
01:08:06,799 --> 01:08:11,680
of some class ck so by ck i just mean
1786
01:08:11,680 --> 01:08:13,440
you know the different categories so c
1787
01:08:13,440 --> 01:08:15,839
for category or class or whatever so
1788
01:08:15,839 --> 01:08:19,279
category one might be cats category two
1789
01:08:19,279 --> 01:08:22,799
dogs category three lizards
1790
01:08:22,799 --> 01:08:24,799
all the way we have k categories k is
1791
01:08:24,799 --> 01:08:26,319
just some number
1792
01:08:26,319 --> 01:08:27,439
okay
1793
01:08:27,439 --> 01:08:28,399
so
1794
01:08:28,399 --> 01:08:30,640
what is the probability of having
1795
01:08:30,640 --> 01:08:32,238
of
1796
01:08:32,238 --> 01:08:35,279
this specific sample x so this is our
1797
01:08:35,279 --> 01:08:37,279
feature factor
1798
01:08:37,279 --> 01:08:39,679
of this one sample
1799
01:08:39,679 --> 01:08:41,839
what is the probability of x fitting
1800
01:08:41,839 --> 01:08:43,920
into category one two three four
1801
01:08:43,920 --> 01:08:45,839
whatever right so that that's what this
1802
01:08:45,839 --> 01:08:48,158
is asking what is the probability that
1803
01:08:48,158 --> 01:08:50,880
you know it's actually from this class
1804
01:08:50,880 --> 01:08:53,679
given all this evidence that we see the
1805
01:08:53,679 --> 01:08:56,080
x's
1806
01:08:56,960 --> 01:08:58,158
so
1807
01:08:58,158 --> 01:08:59,679
the likelihood
1808
01:08:59,679 --> 01:09:01,520
is this quantity over here it's saying
1809
01:09:01,520 --> 01:09:02,319
okay
1810
01:09:02,319 --> 01:09:04,960
well given that you know assume assume
1811
01:09:04,960 --> 01:09:06,479
we are
1812
01:09:06,479 --> 01:09:09,520
assumed that this class is class ck okay
1813
01:09:09,520 --> 01:09:12,080
assume that this is a category
1814
01:09:12,080 --> 01:09:14,000
well what is the likelihood of actually
1815
01:09:14,000 --> 01:09:15,359
seeing x
1816
01:09:15,359 --> 01:09:16,880
all these different features from that
1817
01:09:16,880 --> 01:09:19,440
category
1818
01:09:19,759 --> 01:09:22,319
and then this here is the prior so like
1819
01:09:22,319 --> 01:09:24,880
in the entire population of things
1820
01:09:24,880 --> 01:09:26,399
what what is what are the probabilities
1821
01:09:26,399 --> 01:09:28,000
what is the probability of this class in
1822
01:09:28,000 --> 01:09:30,000
general like if i have
1823
01:09:30,000 --> 01:09:32,238
you know in my entire data set what is
1824
01:09:32,238 --> 01:09:34,960
the percentage what is the chance that
1825
01:09:34,960 --> 01:09:37,279
this image is a cat how many cats do i
1826
01:09:37,279 --> 01:09:38,399
have
1827
01:09:38,399 --> 01:09:39,439
right
1828
01:09:39,439 --> 01:09:41,198
and then this down here is called the
1829
01:09:41,198 --> 01:09:43,120
evidence because
1830
01:09:43,120 --> 01:09:45,120
what we're trying to do
1831
01:09:45,120 --> 01:09:47,839
is we're changing our prior we're
1832
01:09:47,839 --> 01:09:51,198
creating this new posterior probability
1833
01:09:51,198 --> 01:09:53,600
built upon the prior by using some sort
1834
01:09:53,600 --> 01:09:55,360
of evidence right and that evidence is
1835
01:09:55,360 --> 01:09:57,679
the probability of x
1836
01:09:57,679 --> 01:09:59,760
so that's some vocab
1837
01:09:59,760 --> 01:10:00,880
and
1838
01:10:00,880 --> 01:10:04,159
this here
1839
01:10:05,360 --> 01:10:09,040
is a rule for naive bayes
1840
01:10:09,040 --> 01:10:11,360
whoa okay let's digest
1841
01:10:11,360 --> 01:10:12,239
that
1842
01:10:12,239 --> 01:10:14,159
a little bit okay
1843
01:10:14,159 --> 01:10:16,640
so what is let me use a different color
1844
01:10:16,640 --> 01:10:18,560
what is this
1845
01:10:18,560 --> 01:10:21,040
side of the equation asking
1846
01:10:21,040 --> 01:10:22,320
it's asking
1847
01:10:22,320 --> 01:10:24,000
what is the probability that we are in
1848
01:10:24,000 --> 01:10:26,320
sum class k ck
1849
01:10:26,320 --> 01:10:28,800
given that you know this is my first
1850
01:10:28,800 --> 01:10:30,800
input this is my second input this is
1851
01:10:30,800 --> 01:10:32,640
you know my third fourth this is my nth
1852
01:10:32,640 --> 01:10:33,679
input
1853
01:10:33,679 --> 01:10:37,199
so let's say that our classification is
1854
01:10:37,199 --> 01:10:40,000
do we play soccer today or not
1855
01:10:40,000 --> 01:10:42,640
okay and let's say our x's are okay is
1856
01:10:42,640 --> 01:10:45,520
it how much wind is there how much
1857
01:10:45,520 --> 01:10:47,920
uh rain is there and what day of the
1858
01:10:47,920 --> 01:10:50,480
week is it so let's say that it's
1859
01:10:50,480 --> 01:10:52,159
raining it's not windy but it's
1860
01:10:52,159 --> 01:10:56,000
wednesday do we play soccer do we not
1861
01:10:56,000 --> 01:10:59,280
so let's use bayes rule on this so this
1862
01:10:59,280 --> 01:11:01,600
here
1863
01:11:06,000 --> 01:11:09,520
is equal to the probability of x1
1864
01:11:09,520 --> 01:11:13,120
x2 all these joint probabilities given
1865
01:11:13,120 --> 01:11:14,640
class k
1866
01:11:14,640 --> 01:11:18,080
times the probability of that class
1867
01:11:18,080 --> 01:11:20,239
all over the probability of this
1868
01:11:20,239 --> 01:11:22,719
evidence
1869
01:11:24,840 --> 01:11:26,640
okay
1870
01:11:26,640 --> 01:11:28,640
so what is this fancy symbol over here
1871
01:11:28,640 --> 01:11:31,520
this means proportional
1872
01:11:31,520 --> 01:11:33,520
to
1873
01:11:33,520 --> 01:11:35,360
so how our equal sign means it's equal
1874
01:11:35,360 --> 01:11:38,000
to this like little squiggly sign means
1875
01:11:38,000 --> 01:11:40,960
that this is proportional to
1876
01:11:40,960 --> 01:11:42,159
okay
1877
01:11:42,159 --> 01:11:45,679
and this denominator over here
1878
01:11:45,679 --> 01:11:47,920
you might notice that it has no impact
1879
01:11:47,920 --> 01:11:50,640
on the class like this that number
1880
01:11:50,640 --> 01:11:52,159
doesn't depend on the class right so
1881
01:11:52,159 --> 01:11:54,320
this is going to be constant for all of
1882
01:11:54,320 --> 01:11:56,080
our different classes
1883
01:11:56,080 --> 01:11:57,920
so what i'm going to do is make things
1884
01:11:57,920 --> 01:12:00,400
simpler so i'm just going to say that
1885
01:12:00,400 --> 01:12:02,960
this probability
1886
01:12:02,960 --> 01:12:06,480
x1 x2 all the way to xn
1887
01:12:06,480 --> 01:12:08,080
this is going to be proportional to the
1888
01:12:08,080 --> 01:12:09,199
numerator i don't care about the
1889
01:12:09,199 --> 01:12:10,480
denominator because it's the same for
1890
01:12:10,480 --> 01:12:11,920
every single class
1891
01:12:11,920 --> 01:12:15,920
so this is proportional to x1 x2
1892
01:12:15,920 --> 01:12:17,040
xn
1893
01:12:17,040 --> 01:12:18,239
given
1894
01:12:18,239 --> 01:12:20,960
class k times the probability of that
1895
01:12:20,960 --> 01:12:22,400
class
1896
01:12:22,400 --> 01:12:24,080
okay
1897
01:12:24,080 --> 01:12:25,360
all right
1898
01:12:25,360 --> 01:12:28,080
so in naive bayes the
1899
01:12:28,080 --> 01:12:30,080
point of it being naive
1900
01:12:30,080 --> 01:12:32,880
is that we're actually
1901
01:12:32,880 --> 01:12:34,239
this joint probability we're just
1902
01:12:34,239 --> 01:12:35,840
assuming that all of these different
1903
01:12:35,840 --> 01:12:36,800
things
1904
01:12:36,800 --> 01:12:39,280
are all independent so in my soccer
1905
01:12:39,280 --> 01:12:40,719
example
1906
01:12:40,719 --> 01:12:41,600
you know
1907
01:12:41,600 --> 01:12:42,719
the probability that we're playing
1908
01:12:42,719 --> 01:12:44,239
soccer
1909
01:12:44,239 --> 01:12:45,199
um
1910
01:12:45,199 --> 01:12:47,440
or the probability that you know it's
1911
01:12:47,440 --> 01:12:50,239
windy and it's rainy and and it's
1912
01:12:50,239 --> 01:12:51,679
wednesday all these things are
1913
01:12:51,679 --> 01:12:53,600
independent we're assuming that they're
1914
01:12:53,600 --> 01:12:55,520
independent
1915
01:12:55,520 --> 01:12:58,400
so that means that i can actually write
1916
01:12:58,400 --> 01:13:01,040
this part of the equation here
1917
01:13:01,040 --> 01:13:02,800
as
1918
01:13:02,800 --> 01:13:05,440
this so each term in here
1919
01:13:05,440 --> 01:13:07,520
i can just multiply
1920
01:13:07,520 --> 01:13:10,719
all them together so the probability of
1921
01:13:10,719 --> 01:13:11,920
the first
1922
01:13:11,920 --> 01:13:15,280
feature given that it's class k
1923
01:13:15,280 --> 01:13:17,120
times the probability
1924
01:13:17,120 --> 01:13:18,480
of the second feature given that's
1925
01:13:18,480 --> 01:13:20,640
probably like class k all the way up
1926
01:13:20,640 --> 01:13:21,600
until
1927
01:13:21,600 --> 01:13:23,440
you know the nth
1928
01:13:23,440 --> 01:13:24,800
feature
1929
01:13:24,800 --> 01:13:26,400
of
1930
01:13:26,400 --> 01:13:29,520
given that it's class k
1931
01:13:29,520 --> 01:13:33,040
so this expands to all of this
1932
01:13:33,040 --> 01:13:36,000
all right which means that this here is
1933
01:13:36,000 --> 01:13:37,600
now proportional
1934
01:13:37,600 --> 01:13:40,320
to the thing that we just expanded
1935
01:13:40,320 --> 01:13:42,640
times this
1936
01:13:42,640 --> 01:13:44,320
so i'm going to write
1937
01:13:44,320 --> 01:13:46,640
that out so the probability
1938
01:13:46,640 --> 01:13:48,560
of that class
1939
01:13:48,560 --> 01:13:51,679
and i'm actually going to use this
1940
01:13:51,679 --> 01:13:54,800
symbol so what this means is it's a huge
1941
01:13:54,800 --> 01:13:56,800
multiplication it means multiply
1942
01:13:56,800 --> 01:13:58,239
everything
1943
01:13:58,239 --> 01:14:01,280
to the right of this so this probability
1944
01:14:01,280 --> 01:14:02,640
x
1945
01:14:02,640 --> 01:14:03,840
given
1946
01:14:03,840 --> 01:14:05,600
some class k
1947
01:14:05,600 --> 01:14:08,800
but do it for all the i so i
1948
01:14:08,800 --> 01:14:11,440
what is i okay we're going to go from
1949
01:14:11,440 --> 01:14:13,120
the first
1950
01:14:13,120 --> 01:14:16,159
x i all the way to the end so that means
1951
01:14:16,159 --> 01:14:17,760
for every single i we're just
1952
01:14:17,760 --> 01:14:19,280
multiplying
1953
01:14:19,280 --> 01:14:21,679
these probabilities together
1954
01:14:21,679 --> 01:14:23,280
and that's where
1955
01:14:23,280 --> 01:14:25,840
this up here comes from
1956
01:14:25,840 --> 01:14:26,560
so
1957
01:14:26,560 --> 01:14:28,480
to wrap this up oops this should be a
1958
01:14:28,480 --> 01:14:30,480
line to wrap this up in plain english
1959
01:14:30,480 --> 01:14:31,840
basically what this is saying is the
1960
01:14:31,840 --> 01:14:34,800
probability that you know we're in some
1961
01:14:34,800 --> 01:14:37,199
category given that we have all these
1962
01:14:37,199 --> 01:14:38,560
different features
1963
01:14:38,560 --> 01:14:41,520
is proportional to the probability of
1964
01:14:41,520 --> 01:14:43,520
that class in general
1965
01:14:43,520 --> 01:14:45,360
times the probability of each of those
1966
01:14:45,360 --> 01:14:46,480
features
1967
01:14:46,480 --> 01:14:48,560
given that we're in this one class that
1968
01:14:48,560 --> 01:14:50,080
we're testing
1969
01:14:50,080 --> 01:14:51,600
so the probability
1970
01:14:51,600 --> 01:14:54,080
of it you know of us playing soccer
1971
01:14:54,080 --> 01:14:57,199
today given that it's rainy not windy
1972
01:14:57,199 --> 01:14:58,320
and
1973
01:14:58,320 --> 01:15:00,719
um and it's wednesday
1974
01:15:00,719 --> 01:15:03,199
is proportional to okay well what is
1975
01:15:03,199 --> 01:15:04,480
what is the probability that we play
1976
01:15:04,480 --> 01:15:06,239
soccer anyways
1977
01:15:06,239 --> 01:15:08,239
and then times the probability that it's
1978
01:15:08,239 --> 01:15:10,880
rainy given that we're playing soccer
1979
01:15:10,880 --> 01:15:12,560
times the probability that it's not
1980
01:15:12,560 --> 01:15:14,800
windy given that we're playing soccer so
1981
01:15:14,800 --> 01:15:16,000
how many times are we playing soccer
1982
01:15:16,000 --> 01:15:18,000
when it's windy you know
1983
01:15:18,000 --> 01:15:20,640
and then how many times or what's the
1984
01:15:20,640 --> 01:15:22,960
probability that's wednesday given that
1985
01:15:22,960 --> 01:15:25,120
we're playing soccer
1986
01:15:25,120 --> 01:15:27,360
okay
1987
01:15:27,360 --> 01:15:30,320
so how do we use this in order to make a
1988
01:15:30,320 --> 01:15:32,239
classification
1989
01:15:32,239 --> 01:15:35,520
so that's where this comes in our y hat
1990
01:15:35,520 --> 01:15:37,440
our predicted y
1991
01:15:37,440 --> 01:15:39,520
is going to be equal to something called
1992
01:15:39,520 --> 01:15:42,480
the arg max
1993
01:15:42,480 --> 01:15:44,640
and then this expression over here
1994
01:15:44,640 --> 01:15:46,960
because we want to take the arg max well
1995
01:15:46,960 --> 01:15:48,480
we want
1996
01:15:48,480 --> 01:15:50,719
so okay if i
1997
01:15:50,719 --> 01:15:53,280
write out this
1998
01:15:53,280 --> 01:15:55,120
again this means the probability of
1999
01:15:55,120 --> 01:15:58,080
being in some class c k given all of our
2000
01:15:58,080 --> 01:16:00,560
evidence
2001
01:16:01,760 --> 01:16:04,239
well we're going to take the k
2002
01:16:04,239 --> 01:16:06,560
that maximizes
2003
01:16:06,560 --> 01:16:09,840
this expression on the right
2004
01:16:09,840 --> 01:16:12,880
that's what arc max means so if k is in
2005
01:16:12,880 --> 01:16:14,640
zero oops
2006
01:16:14,640 --> 01:16:16,560
one through
2007
01:16:16,560 --> 01:16:18,159
k so this is how many categories there
2008
01:16:18,159 --> 01:16:20,400
are we're going to go through each k
2009
01:16:20,400 --> 01:16:22,880
and we're going to solve
2010
01:16:22,880 --> 01:16:25,840
this expression over here and find the k
2011
01:16:25,840 --> 01:16:28,960
that makes that the largest
2012
01:16:28,960 --> 01:16:30,000
okay
2013
01:16:30,000 --> 01:16:31,520
and
2014
01:16:31,520 --> 01:16:34,080
remember that instead of writing this we
2015
01:16:34,080 --> 01:16:38,320
have now a formula thanks to bayes rule
2016
01:16:38,320 --> 01:16:40,480
for helping us
2017
01:16:40,480 --> 01:16:42,800
approximate that right and something
2018
01:16:42,800 --> 01:16:45,040
that maybe we can
2019
01:16:45,040 --> 01:16:47,040
we maybe we have like the evidence for
2020
01:16:47,040 --> 01:16:48,800
that we have the answers for that based
2021
01:16:48,800 --> 01:16:51,840
on our training set
2022
01:16:52,000 --> 01:16:54,320
so this principle of going through each
2023
01:16:54,320 --> 01:16:56,480
of these and finding whatever class
2024
01:16:56,480 --> 01:16:59,040
whatever category maximizes
2025
01:16:59,040 --> 01:17:00,800
this expression on the right this is
2026
01:17:00,800 --> 01:17:02,960
something known as m
2027
01:17:02,960 --> 01:17:04,880
ap for short
2028
01:17:04,880 --> 01:17:08,000
or maximum
2029
01:17:08,719 --> 01:17:11,040
a
2030
01:17:11,040 --> 01:17:14,040
posteriori
2031
01:17:14,159 --> 01:17:16,640
uh pick the hypothesis so pick the k
2032
01:17:16,640 --> 01:17:18,640
that is the most probable so that we
2033
01:17:18,640 --> 01:17:20,239
minimize the probability of
2034
01:17:20,239 --> 01:17:22,719
misclassification
2035
01:17:22,719 --> 01:17:24,000
all right
2036
01:17:24,000 --> 01:17:28,159
so that is m ap that is
2037
01:17:28,159 --> 01:17:29,679
naive bayes
2038
01:17:29,679 --> 01:17:31,679
back to the notebook so
2039
01:17:31,679 --> 01:17:34,000
just like how i imported k-nearest name
2040
01:17:34,000 --> 01:17:35,440
uh k
2041
01:17:35,440 --> 01:17:37,920
neighbors classifier up here for naive
2042
01:17:37,920 --> 01:17:41,520
bayes i can go to sklearn.naive
2043
01:17:41,520 --> 01:17:42,560
bayes
2044
01:17:42,560 --> 01:17:44,880
and i can import gaussian
2045
01:17:44,880 --> 01:17:46,719
naive bayes
2046
01:17:46,719 --> 01:17:47,840
all right
2047
01:17:47,840 --> 01:17:49,760
and here i'm going to say my naive bayes
2048
01:17:49,760 --> 01:17:52,320
model is equal this is very similar to
2049
01:17:52,320 --> 01:17:55,199
what we had above
2050
01:17:55,440 --> 01:17:56,840
and i'm just going to
2051
01:17:56,840 --> 01:18:00,640
say with this model
2052
01:18:01,040 --> 01:18:03,280
we are going to fit
2053
01:18:03,280 --> 01:18:05,120
x train
2054
01:18:05,120 --> 01:18:08,159
and y train
2055
01:18:08,159 --> 01:18:11,960
all right just like above
2056
01:18:13,199 --> 01:18:15,440
so this i might actually have to
2057
01:18:15,440 --> 01:18:17,440
so i'm going to set that
2058
01:18:17,440 --> 01:18:19,920
and uh
2059
01:18:19,920 --> 01:18:21,840
exactly just like above i'm going to
2060
01:18:21,840 --> 01:18:24,880
make my prediction
2061
01:18:24,880 --> 01:18:26,719
so here i'm going to instead use my
2062
01:18:26,719 --> 01:18:29,199
naive bayes model
2063
01:18:29,199 --> 01:18:32,960
and of course i'm going to run
2064
01:18:32,960 --> 01:18:35,440
the classification report again
2065
01:18:35,440 --> 01:18:36,640
so i'm actually just going to put these
2066
01:18:36,640 --> 01:18:38,320
in the same cell
2067
01:18:38,320 --> 01:18:39,920
but here we have the y the new y
2068
01:18:39,920 --> 01:18:42,560
prediction and then y test is still our
2069
01:18:42,560 --> 01:18:44,880
original test data set
2070
01:18:44,880 --> 01:18:47,040
so if i run this
2071
01:18:47,040 --> 01:18:48,800
you'll see that
2072
01:18:48,800 --> 01:18:51,840
okay what's going on here we get worse
2073
01:18:51,840 --> 01:18:54,960
scores right our precision
2074
01:18:54,960 --> 01:18:57,760
for all of them they look slightly worse
2075
01:18:57,760 --> 01:18:58,800
and
2076
01:18:58,800 --> 01:19:00,960
our um
2077
01:19:00,960 --> 01:19:02,560
you know for
2078
01:19:02,560 --> 01:19:04,800
our precision our recall our f1 score
2079
01:19:04,800 --> 01:19:06,320
they look slightly worse for all the
2080
01:19:06,320 --> 01:19:08,239
different categories and our total
2081
01:19:08,239 --> 01:19:11,199
accuracy i mean it's still 72 which is
2082
01:19:11,199 --> 01:19:13,040
not too shabby
2083
01:19:13,040 --> 01:19:15,440
but it's still 72
2084
01:19:15,440 --> 01:19:16,560
okay
2085
01:19:16,560 --> 01:19:18,320
um
2086
01:19:18,320 --> 01:19:21,679
which you know is not not that great
2087
01:19:21,679 --> 01:19:24,000
okay so let's move on to logistic
2088
01:19:24,000 --> 01:19:25,440
regression
2089
01:19:25,440 --> 01:19:28,480
here i've drawn a plot i have y so this
2090
01:19:28,480 --> 01:19:30,080
is my label
2091
01:19:30,080 --> 01:19:32,640
on one axis and then this is maybe one
2092
01:19:32,640 --> 01:19:34,080
of my features so let's just say i only
2093
01:19:34,080 --> 01:19:35,760
have one feature in this case
2094
01:19:35,760 --> 01:19:38,880
deck zero right
2095
01:19:38,880 --> 01:19:39,679
well
2096
01:19:39,679 --> 01:19:41,679
we see that
2097
01:19:41,679 --> 01:19:44,239
you know i have a few of one class type
2098
01:19:44,239 --> 01:19:45,600
down here
2099
01:19:45,600 --> 01:19:47,199
and we know it's one class type because
2100
01:19:47,199 --> 01:19:49,280
it's zero and then we have our other
2101
01:19:49,280 --> 01:19:51,760
class type one up here
2102
01:19:51,760 --> 01:19:53,520
okay
2103
01:19:53,520 --> 01:19:55,920
so many of you guys are familiar with
2104
01:19:55,920 --> 01:19:58,080
regression so let's start there
2105
01:19:58,080 --> 01:20:00,239
if i were to draw a regression line
2106
01:20:00,239 --> 01:20:01,520
through this
2107
01:20:01,520 --> 01:20:03,600
it might look something
2108
01:20:03,600 --> 01:20:05,520
like
2109
01:20:05,520 --> 01:20:08,000
like this
2110
01:20:08,080 --> 01:20:09,600
right
2111
01:20:09,600 --> 01:20:12,239
well this doesn't seem to be a very good
2112
01:20:12,239 --> 01:20:14,560
model like why would we use this
2113
01:20:14,560 --> 01:20:17,440
specific line to predict why
2114
01:20:17,440 --> 01:20:19,360
right
2115
01:20:19,360 --> 01:20:20,640
it's it's
2116
01:20:20,640 --> 01:20:21,679
iffy
2117
01:20:21,679 --> 01:20:23,280
okay
2118
01:20:23,280 --> 01:20:26,080
um for example we might say
2119
01:20:26,080 --> 01:20:27,760
okay well it seems like you know
2120
01:20:27,760 --> 01:20:30,080
everything from here downwards would be
2121
01:20:30,080 --> 01:20:32,320
one class type in here upwards would be
2122
01:20:32,320 --> 01:20:34,560
another class type
2123
01:20:34,560 --> 01:20:36,080
but when you look at this you're just
2124
01:20:36,080 --> 01:20:37,760
you you
2125
01:20:37,760 --> 01:20:40,400
visually can tell okay like
2126
01:20:40,400 --> 01:20:42,640
that line doesn't make sense things are
2127
01:20:42,640 --> 01:20:45,040
not those dots are not along that line
2128
01:20:45,040 --> 01:20:46,800
and the reason is because we are doing
2129
01:20:46,800 --> 01:20:50,400
classification not regression
2130
01:20:50,400 --> 01:20:52,159
okay
2131
01:20:52,159 --> 01:20:54,400
well first of all let's start here we
2132
01:20:54,400 --> 01:20:56,719
know that this
2133
01:20:56,719 --> 01:20:57,600
model
2134
01:20:57,600 --> 01:21:00,880
if we just use this line it equals m x
2135
01:21:00,880 --> 01:21:03,840
so whatever this let's just say it's x
2136
01:21:03,840 --> 01:21:05,760
plus b which is the y intercept right
2137
01:21:05,760 --> 01:21:07,520
and m is the slope
2138
01:21:07,520 --> 01:21:10,080
but when we use a linear regression is
2139
01:21:10,080 --> 01:21:11,600
it actually y hat
2140
01:21:11,600 --> 01:21:14,640
no it's not right so when we're working
2141
01:21:14,640 --> 01:21:16,080
with linear regression what we're
2142
01:21:16,080 --> 01:21:18,080
actually estimating in our model
2143
01:21:18,080 --> 01:21:19,840
is a probability what's the probability
2144
01:21:19,840 --> 01:21:23,120
between 0 and 1 that is class 0 or class
2145
01:21:23,120 --> 01:21:24,800
1.
2146
01:21:24,800 --> 01:21:27,280
so here let's rewrite this
2147
01:21:27,280 --> 01:21:29,600
as p equals mx
2148
01:21:29,600 --> 01:21:32,159
plus b
2149
01:21:32,560 --> 01:21:37,360
okay well mx plus b that can range
2150
01:21:37,360 --> 01:21:38,800
you know from negative infinity to
2151
01:21:38,800 --> 01:21:41,040
infinity right for any for any value of
2152
01:21:41,040 --> 01:21:42,640
x it goes from negative infinity to
2153
01:21:42,640 --> 01:21:44,080
infinity
2154
01:21:44,080 --> 01:21:45,679
but probability we know probably one of
2155
01:21:45,679 --> 01:21:47,280
the rules of probability is that
2156
01:21:47,280 --> 01:21:50,400
probability has to stay between zero and
2157
01:21:50,400 --> 01:21:52,000
one
2158
01:21:52,000 --> 01:21:53,840
so how do we fix this
2159
01:21:53,840 --> 01:21:55,679
well maybe instead of
2160
01:21:55,679 --> 01:21:57,360
just setting the probability equal to
2161
01:21:57,360 --> 01:21:59,760
that we can set the odds
2162
01:21:59,760 --> 01:22:02,000
equal to this so by that i mean okay
2163
01:22:02,000 --> 01:22:04,880
let's do probability divided by 1 minus
2164
01:22:04,880 --> 01:22:06,880
the probability okay so now it becomes
2165
01:22:06,880 --> 01:22:08,560
this ratio
2166
01:22:08,560 --> 01:22:10,639
now this ratio is allowed to take on
2167
01:22:10,639 --> 01:22:13,120
infinite values
2168
01:22:13,120 --> 01:22:15,920
but there's still one issue here
2169
01:22:15,920 --> 01:22:18,000
let me move this over a bit
2170
01:22:18,000 --> 01:22:21,440
the one issue here is that
2171
01:22:21,440 --> 01:22:23,280
mx plus b that can still be negative
2172
01:22:23,280 --> 01:22:24,960
right like if you know i have a negative
2173
01:22:24,960 --> 01:22:26,960
slope if i have a negative b if i have
2174
01:22:26,960 --> 01:22:28,719
some negative x's in there i don't know
2175
01:22:28,719 --> 01:22:30,320
but that can be that's allowed to be
2176
01:22:30,320 --> 01:22:32,239
negative
2177
01:22:32,239 --> 01:22:34,320
so how do we fix that
2178
01:22:34,320 --> 01:22:38,480
we do that by actually taking the log
2179
01:22:38,480 --> 01:22:40,800
of the odds
2180
01:22:40,800 --> 01:22:43,040
okay
2181
01:22:43,280 --> 01:22:46,159
so now i have the log of you know some
2182
01:22:46,159 --> 01:22:47,760
probability divided by 1 minus the
2183
01:22:47,760 --> 01:22:50,880
probability and now that is on a range
2184
01:22:50,880 --> 01:22:53,760
of negative infinity to infinity which
2185
01:22:53,760 --> 01:22:56,239
is good because the range of log should
2186
01:22:56,239 --> 01:22:58,719
be negative infinity to infinity
2187
01:22:58,719 --> 01:23:02,560
now how do i solve for p the probability
2188
01:23:02,560 --> 01:23:04,800
well the first thing i can do is take
2189
01:23:04,800 --> 01:23:06,480
you know
2190
01:23:06,480 --> 01:23:08,719
i can remove the log by taking the not
2191
01:23:08,719 --> 01:23:10,639
the um
2192
01:23:10,639 --> 01:23:14,080
e to the whatever is on both sides
2193
01:23:14,080 --> 01:23:18,080
so that gives me the probability
2194
01:23:18,080 --> 01:23:20,719
over the one minus the probability
2195
01:23:20,719 --> 01:23:26,239
is now equal to e to the m x plus b
2196
01:23:26,239 --> 01:23:27,920
okay
2197
01:23:27,920 --> 01:23:29,920
so let's multiply that out so the
2198
01:23:29,920 --> 01:23:32,880
probability is equal to
2199
01:23:32,880 --> 01:23:37,120
one minus probability e to the m x plus
2200
01:23:37,120 --> 01:23:38,320
b
2201
01:23:38,320 --> 01:23:42,639
so p is equal to e to the m x plus b
2202
01:23:42,639 --> 01:23:44,320
minus p
2203
01:23:44,320 --> 01:23:48,400
times e to the m x plus b
2204
01:23:48,400 --> 01:23:50,480
and now we have we can move like terms
2205
01:23:50,480 --> 01:23:53,600
to one side so if i do p
2206
01:23:53,600 --> 01:23:56,080
uh so basically i'm moving this over so
2207
01:23:56,080 --> 01:23:59,920
i'm adding p so now p one plus
2208
01:23:59,920 --> 01:24:02,800
e to the m x plus b
2209
01:24:02,800 --> 01:24:05,760
is equal to
2210
01:24:05,760 --> 01:24:09,760
e to the m x plus b and let me change
2211
01:24:09,760 --> 01:24:13,520
this parenthesis make it a little bigger
2212
01:24:13,520 --> 01:24:16,800
so now my probability can be e to the mx
2213
01:24:16,800 --> 01:24:18,239
plus b
2214
01:24:18,239 --> 01:24:22,000
divided by 1 plus e to the mx
2215
01:24:22,000 --> 01:24:24,560
plus b
2216
01:24:25,440 --> 01:24:26,719
okay
2217
01:24:26,719 --> 01:24:28,400
well
2218
01:24:28,400 --> 01:24:30,560
let me just rewrite this really quickly
2219
01:24:30,560 --> 01:24:33,760
i want a numerator of one on top
2220
01:24:33,760 --> 01:24:35,360
okay so what i'm going to do is i'm
2221
01:24:35,360 --> 01:24:37,360
going to multiply
2222
01:24:37,360 --> 01:24:40,719
this by negative mx plus b
2223
01:24:40,719 --> 01:24:43,199
and then also the bottom by negative mx
2224
01:24:43,199 --> 01:24:44,639
plus b and i'm allowed to do that
2225
01:24:44,639 --> 01:24:46,000
because
2226
01:24:46,000 --> 01:24:48,639
this over this is 1.
2227
01:24:48,639 --> 01:24:54,560
so now my probability is equal to 1 over
2228
01:24:54,560 --> 01:24:56,320
1 plus
2229
01:24:56,320 --> 01:25:00,320
e to the negative mx plus b and now why
2230
01:25:00,320 --> 01:25:02,239
do i rewrite it like that it's because
2231
01:25:02,239 --> 01:25:04,880
this is actually a form of a special
2232
01:25:04,880 --> 01:25:06,000
function
2233
01:25:06,000 --> 01:25:09,520
which is called the sigmoid
2234
01:25:10,400 --> 01:25:13,120
function
2235
01:25:13,120 --> 01:25:15,120
and for the sigmoid function
2236
01:25:15,120 --> 01:25:18,080
it looks something like this so s of x
2237
01:25:18,080 --> 01:25:21,600
sigmoid you know at with that sum x
2238
01:25:21,600 --> 01:25:24,400
is equal to 1 over
2239
01:25:24,400 --> 01:25:27,280
1 plus e to the negative
2240
01:25:27,280 --> 01:25:28,480
x
2241
01:25:28,480 --> 01:25:31,679
so essentially what i just did up here
2242
01:25:31,679 --> 01:25:33,440
is rewrite this
2243
01:25:33,440 --> 01:25:35,840
in some sigmoid function
2244
01:25:35,840 --> 01:25:40,000
where the x value is actually mx plus b
2245
01:25:40,000 --> 01:25:42,159
so maybe i'll change this to y just to
2246
01:25:42,159 --> 01:25:43,440
make that a bit more clear it doesn't
2247
01:25:43,440 --> 01:25:46,000
matter what the variable name is
2248
01:25:46,000 --> 01:25:48,719
but this is our sigmoid function
2249
01:25:48,719 --> 01:25:49,760
and
2250
01:25:49,760 --> 01:25:51,679
visually what our sigmoid function looks
2251
01:25:51,679 --> 01:25:53,280
like
2252
01:25:53,280 --> 01:25:56,400
is it goes from zero so this here is
2253
01:25:56,400 --> 01:25:57,679
zero
2254
01:25:57,679 --> 01:25:58,800
to one
2255
01:25:58,800 --> 01:26:00,800
and it looks something
2256
01:26:00,800 --> 01:26:03,280
like this curved s which i didn't draw
2257
01:26:03,280 --> 01:26:05,760
too well let me try that again
2258
01:26:05,760 --> 01:26:09,400
this is hard to draw
2259
01:26:10,880 --> 01:26:14,960
something if i can draw this right
2260
01:26:14,960 --> 01:26:16,000
like
2261
01:26:16,000 --> 01:26:17,120
that
2262
01:26:17,120 --> 01:26:20,560
okay so it goes in between zero and one
2263
01:26:20,560 --> 01:26:24,320
and you might notice that this form
2264
01:26:24,320 --> 01:26:25,840
fits our
2265
01:26:25,840 --> 01:26:29,120
shape up here
2266
01:26:30,960 --> 01:26:32,159
oops
2267
01:26:32,159 --> 01:26:33,040
let's
2268
01:26:33,040 --> 01:26:36,080
draw it sharper but it fits our shape up
2269
01:26:36,080 --> 01:26:38,719
there a lot better right
2270
01:26:38,719 --> 01:26:41,360
all right so that is
2271
01:26:41,360 --> 01:26:42,719
what we call
2272
01:26:42,719 --> 01:26:44,480
logistic regression we're basically
2273
01:26:44,480 --> 01:26:46,320
trying to fit our data
2274
01:26:46,320 --> 01:26:48,639
to the sigmoid function
2275
01:26:48,639 --> 01:26:50,239
okay
2276
01:26:50,239 --> 01:26:54,960
and when we only have you know one
2277
01:26:54,960 --> 01:26:56,080
um
2278
01:26:56,080 --> 01:26:58,560
data point so if we only have one
2279
01:26:58,560 --> 01:27:01,600
feature x then that's what we call
2280
01:27:01,600 --> 01:27:03,760
simple
2281
01:27:03,760 --> 01:27:06,400
logistic regression
2282
01:27:06,400 --> 01:27:08,480
but then if we have you know so that's
2283
01:27:08,480 --> 01:27:12,159
only x0 but then if we have x0 x1
2284
01:27:12,159 --> 01:27:13,760
all the way to xn
2285
01:27:13,760 --> 01:27:16,239
we call this multiple
2286
01:27:16,239 --> 01:27:18,560
logistic regression because there are
2287
01:27:18,560 --> 01:27:21,199
multiple features that we're considering
2288
01:27:21,199 --> 01:27:23,520
when we're building our model
2289
01:27:23,520 --> 01:27:26,000
logistic regression
2290
01:27:26,000 --> 01:27:29,520
so i'm going to put that here and again
2291
01:27:29,520 --> 01:27:32,239
from sklearn
2292
01:27:32,239 --> 01:27:35,679
this linear model we can import logistic
2293
01:27:35,679 --> 01:27:37,280
regression
2294
01:27:37,280 --> 01:27:38,960
right
2295
01:27:38,960 --> 01:27:42,000
and just like how we did above we can
2296
01:27:42,000 --> 01:27:45,360
repeat all of this so here instead of nb
2297
01:27:45,360 --> 01:27:46,880
i'm going to call this
2298
01:27:46,880 --> 01:27:52,400
the log model or lg logistic regression
2299
01:27:52,400 --> 01:27:55,440
i'm going to change this to logistic
2300
01:27:55,440 --> 01:27:57,040
regression
2301
01:27:57,040 --> 01:27:58,960
so i'm just going to use the default
2302
01:27:58,960 --> 01:28:00,639
logistic regression
2303
01:28:00,639 --> 01:28:02,320
but actually if you look here you see
2304
01:28:02,320 --> 01:28:04,320
that you can use different penalties so
2305
01:28:04,320 --> 01:28:05,920
right now we're using
2306
01:28:05,920 --> 01:28:08,400
um an l2 penalty
2307
01:28:08,400 --> 01:28:09,280
but
2308
01:28:09,280 --> 01:28:12,159
l2 is our quadratic formula okay so that
2309
01:28:12,159 --> 01:28:15,120
means that for you know outliers it
2310
01:28:15,120 --> 01:28:16,639
would really
2311
01:28:16,639 --> 01:28:18,880
penalize that
2312
01:28:18,880 --> 01:28:20,719
for all these other things you know you
2313
01:28:20,719 --> 01:28:22,639
can toggle
2314
01:28:22,639 --> 01:28:23,520
these
2315
01:28:23,520 --> 01:28:25,280
different parameters and you might get
2316
01:28:25,280 --> 01:28:27,280
slightly different results if i were
2317
01:28:27,280 --> 01:28:29,280
building a production level logistic
2318
01:28:29,280 --> 01:28:31,280
regression model that i would want to go
2319
01:28:31,280 --> 01:28:32,880
and i would want to figure out you know
2320
01:28:32,880 --> 01:28:34,560
what are the best
2321
01:28:34,560 --> 01:28:37,120
parameters to pass into here based on my
2322
01:28:37,120 --> 01:28:39,040
validation data
2323
01:28:39,040 --> 01:28:40,719
but for now we'll just we'll just use
2324
01:28:40,719 --> 01:28:42,639
this out of the box
2325
01:28:42,639 --> 01:28:44,639
so again i'm going to fix
2326
01:28:44,639 --> 01:28:47,360
the x train and the y train
2327
01:28:47,360 --> 01:28:49,920
and i'm just going to predict again so i
2328
01:28:49,920 --> 01:28:51,920
can just call this again
2329
01:28:51,920 --> 01:28:55,120
um and instead of l g uh nb i'm going to
2330
01:28:55,120 --> 01:28:57,840
use lg so here this is decent precision
2331
01:28:57,840 --> 01:28:59,040
65
2332
01:28:59,040 --> 01:29:00,719
recall 71
2333
01:29:00,719 --> 01:29:02,880
f1 68
2334
01:29:02,880 --> 01:29:04,239
or 82
2335
01:29:04,239 --> 01:29:06,480
uh total accuracy of 77 okay so it
2336
01:29:06,480 --> 01:29:08,960
performs slightly better than night bass
2337
01:29:08,960 --> 01:29:12,639
but it's still not as good as knn
2338
01:29:12,639 --> 01:29:15,199
all right so the last model for
2339
01:29:15,199 --> 01:29:16,880
classification that i wanted to talk
2340
01:29:16,880 --> 01:29:18,560
about is something called
2341
01:29:18,560 --> 01:29:20,639
support vector machines
2342
01:29:20,639 --> 01:29:25,199
or svms for short
2343
01:29:25,199 --> 01:29:26,000
so
2344
01:29:26,000 --> 01:29:27,600
what exactly
2345
01:29:27,600 --> 01:29:29,199
is an svm
2346
01:29:29,199 --> 01:29:32,159
model i have two different features x0
2347
01:29:32,159 --> 01:29:34,480
and x1 on the axis
2348
01:29:34,480 --> 01:29:37,440
and then i've told you if it's you know
2349
01:29:37,440 --> 01:29:40,560
class 0 or class 1 based on the blue and
2350
01:29:40,560 --> 01:29:41,760
the red
2351
01:29:41,760 --> 01:29:43,040
labels
2352
01:29:43,040 --> 01:29:46,639
my goal is to find some sort of
2353
01:29:46,639 --> 01:29:47,679
line
2354
01:29:47,679 --> 01:29:51,360
between these two labels that best
2355
01:29:51,360 --> 01:29:53,840
divides the data
2356
01:29:53,840 --> 01:29:58,400
all right so this line is our svm model
2357
01:29:58,400 --> 01:30:00,800
so i call it a line here because in 2d
2358
01:30:00,800 --> 01:30:03,120
it's a line but in 3d it would be a
2359
01:30:03,120 --> 01:30:04,880
plane and then you can also have more
2360
01:30:04,880 --> 01:30:07,199
and more dimensions so the proper term
2361
01:30:07,199 --> 01:30:08,719
is actually i want to find the
2362
01:30:08,719 --> 01:30:10,080
hyperplane
2363
01:30:10,080 --> 01:30:12,159
that best differentiates these two
2364
01:30:12,159 --> 01:30:14,239
classes
2365
01:30:14,239 --> 01:30:15,679
let's
2366
01:30:15,679 --> 01:30:18,840
see a few examples okay so
2367
01:30:18,840 --> 01:30:20,719
first
2368
01:30:20,719 --> 01:30:23,679
between these
2369
01:30:23,679 --> 01:30:24,639
three
2370
01:30:24,639 --> 01:30:27,120
lines
2371
01:30:27,600 --> 01:30:29,280
let's say a
2372
01:30:29,280 --> 01:30:30,000
b
2373
01:30:30,000 --> 01:30:32,400
and c
2374
01:30:32,400 --> 01:30:34,800
which one is the best divider of the
2375
01:30:34,800 --> 01:30:36,960
data which one has you know all the data
2376
01:30:36,960 --> 01:30:39,520
on one side or the other or at least if
2377
01:30:39,520 --> 01:30:42,000
it doesn't which one divides it the most
2378
01:30:42,000 --> 01:30:43,920
right like which one is has the most
2379
01:30:43,920 --> 01:30:46,400
defined boundary between the two
2380
01:30:46,400 --> 01:30:49,719
different groups
2381
01:30:50,719 --> 01:30:52,080
so
2382
01:30:52,080 --> 01:30:54,080
this this question should be pretty
2383
01:30:54,080 --> 01:30:56,159
straightforward
2384
01:30:56,159 --> 01:30:57,600
it should be a
2385
01:30:57,600 --> 01:31:00,159
right because a has that clear distinct
2386
01:31:00,159 --> 01:31:01,120
line
2387
01:31:01,120 --> 01:31:03,760
between where you know everything on
2388
01:31:03,760 --> 01:31:06,719
this side of a is one label it's
2389
01:31:06,719 --> 01:31:08,639
negative and everything on this side of
2390
01:31:08,639 --> 01:31:12,480
a is the other label it's positive
2391
01:31:12,560 --> 01:31:15,040
so what if i have a but then what if i
2392
01:31:15,040 --> 01:31:18,320
had drawn my b
2393
01:31:18,719 --> 01:31:20,400
like this
2394
01:31:20,400 --> 01:31:23,600
and my c
2395
01:31:23,600 --> 01:31:25,520
maybe like this
2396
01:31:25,520 --> 01:31:26,880
sorry they're kind of the labels are
2397
01:31:26,880 --> 01:31:29,440
kind of close together
2398
01:31:29,440 --> 01:31:33,679
but now which one is the best
2399
01:31:34,560 --> 01:31:38,400
so i would argue that it's still a
2400
01:31:38,400 --> 01:31:40,719
right and why is it still a
2401
01:31:40,719 --> 01:31:42,840
because in these other
2402
01:31:42,840 --> 01:31:44,639
two
2403
01:31:44,639 --> 01:31:47,920
look at how close this is to that to
2404
01:31:47,920 --> 01:31:50,560
these points
2405
01:31:50,800 --> 01:31:55,120
right so if i had some new point
2406
01:31:55,120 --> 01:31:57,280
that i wanted to estimate okay say i
2407
01:31:57,280 --> 01:31:59,280
didn't have a or b
2408
01:31:59,280 --> 01:32:01,440
so let's say we're just working with c
2409
01:32:01,440 --> 01:32:03,199
let's say i have some new point that's
2410
01:32:03,199 --> 01:32:05,520
right here
2411
01:32:05,520 --> 01:32:08,639
or maybe a new point that's right there
2412
01:32:08,639 --> 01:32:10,800
well it seems like just logically
2413
01:32:10,800 --> 01:32:13,760
looking at this i mean without the
2414
01:32:13,760 --> 01:32:15,920
boundary that
2415
01:32:15,920 --> 01:32:19,440
would probably go under the positives
2416
01:32:19,440 --> 01:32:20,560
right
2417
01:32:20,560 --> 01:32:22,000
i mean it's pretty close to that other
2418
01:32:22,000 --> 01:32:23,760
positive
2419
01:32:23,760 --> 01:32:27,360
so one thing that we care about in svms
2420
01:32:27,360 --> 01:32:30,880
is something known as the margin
2421
01:32:30,880 --> 01:32:31,920
okay
2422
01:32:31,920 --> 01:32:32,800
so
2423
01:32:32,800 --> 01:32:35,280
not only do we want to separate the two
2424
01:32:35,280 --> 01:32:38,480
classes really well we also care about
2425
01:32:38,480 --> 01:32:40,639
the boundary in between
2426
01:32:40,639 --> 01:32:42,400
where the points in those classes in our
2427
01:32:42,400 --> 01:32:43,760
data set are
2428
01:32:43,760 --> 01:32:47,760
and the line that we're drawing so
2429
01:32:47,760 --> 01:32:51,120
in a line like this
2430
01:32:51,120 --> 01:32:54,719
the closest values to this line
2431
01:32:54,719 --> 01:32:55,840
might be
2432
01:32:55,840 --> 01:32:58,560
like here
2433
01:33:00,159 --> 01:33:04,159
i'm trying to draw these perpendicular
2434
01:33:06,960 --> 01:33:10,880
right and so this effectively
2435
01:33:10,880 --> 01:33:15,280
if i switch over to these dotted lines
2436
01:33:17,280 --> 01:33:20,480
if i can draw this right
2437
01:33:21,600 --> 01:33:24,480
so these effectively are what's known as
2438
01:33:24,480 --> 01:33:27,360
the margins
2439
01:33:30,480 --> 01:33:31,760
okay
2440
01:33:31,760 --> 01:33:34,320
so these both here
2441
01:33:34,320 --> 01:33:36,800
these are our margins
2442
01:33:36,800 --> 01:33:38,400
in our svms
2443
01:33:38,400 --> 01:33:40,320
and our goal is to maximize those
2444
01:33:40,320 --> 01:33:42,159
margins so not only do we want the line
2445
01:33:42,159 --> 01:33:43,280
that best separates the two different
2446
01:33:43,280 --> 01:33:46,400
classes we want the line that has the
2447
01:33:46,400 --> 01:33:48,719
largest margin
2448
01:33:48,719 --> 01:33:52,480
and the data points that lie on
2449
01:33:52,480 --> 01:33:55,280
the margin lines the data so basically
2450
01:33:55,280 --> 01:33:56,639
these are the data points that's helping
2451
01:33:56,639 --> 01:33:58,800
us define our divider
2452
01:33:58,800 --> 01:34:01,360
these are what we call support
2453
01:34:01,360 --> 01:34:04,360
vectors
2454
01:34:04,639 --> 01:34:08,159
hence the name support vector machines
2455
01:34:08,159 --> 01:34:11,120
okay so the issue with svm sometimes is
2456
01:34:11,120 --> 01:34:13,520
that they're not so robust
2457
01:34:13,520 --> 01:34:15,280
to outliers
2458
01:34:15,280 --> 01:34:17,679
right so for example if i had
2459
01:34:17,679 --> 01:34:19,920
one outlier
2460
01:34:19,920 --> 01:34:22,000
like this up here
2461
01:34:22,000 --> 01:34:24,400
that would totally change where i want
2462
01:34:24,400 --> 01:34:25,360
my
2463
01:34:25,360 --> 01:34:26,960
support vector to be
2464
01:34:26,960 --> 01:34:28,719
even though that might be my only
2465
01:34:28,719 --> 01:34:30,639
outlier okay
2466
01:34:30,639 --> 01:34:32,960
so that's just something to keep in mind
2467
01:34:32,960 --> 01:34:36,080
as you know you're working with svms is
2468
01:34:36,080 --> 01:34:38,159
it might not be the best model if there
2469
01:34:38,159 --> 01:34:40,320
are outliers in your data set
2470
01:34:40,320 --> 01:34:43,600
okay so another example of svms
2471
01:34:43,600 --> 01:34:45,920
might be let's say that we have data
2472
01:34:45,920 --> 01:34:47,600
like this i'm just going to use a one
2473
01:34:47,600 --> 01:34:50,239
dimensional data set for this example
2474
01:34:50,239 --> 01:34:51,600
let's say we have a data set that looks
2475
01:34:51,600 --> 01:34:53,840
like this
2476
01:34:53,840 --> 01:34:56,719
well our you know separator should be
2477
01:34:56,719 --> 01:34:59,199
perpendicular to this line
2478
01:34:59,199 --> 01:35:00,560
but it should be somewhere along this
2479
01:35:00,560 --> 01:35:02,320
line so it could be
2480
01:35:02,320 --> 01:35:04,719
anywhere like this
2481
01:35:04,719 --> 01:35:07,040
you might argue okay well there's one
2482
01:35:07,040 --> 01:35:09,280
here and then you could also just draw
2483
01:35:09,280 --> 01:35:10,800
another one over here
2484
01:35:10,800 --> 01:35:12,159
right and then maybe you can have two
2485
01:35:12,159 --> 01:35:15,280
svms but that's not really how svms work
2486
01:35:15,280 --> 01:35:17,360
but one thing that we can do is we can
2487
01:35:17,360 --> 01:35:20,320
create some sort of projection
2488
01:35:20,320 --> 01:35:23,280
so i realized here that one thing
2489
01:35:23,280 --> 01:35:24,880
i forgot to do
2490
01:35:24,880 --> 01:35:27,360
was to label where zero was so let's
2491
01:35:27,360 --> 01:35:28,880
just say zero
2492
01:35:28,880 --> 01:35:31,360
is here
2493
01:35:31,920 --> 01:35:33,440
now what i'm going to do is i'm going to
2494
01:35:33,440 --> 01:35:34,560
say okay
2495
01:35:34,560 --> 01:35:36,639
i'm going to have x and then i'm going
2496
01:35:36,639 --> 01:35:38,639
to have x
2497
01:35:38,639 --> 01:35:41,679
sorry x0 and x1 so x0 is just going to
2498
01:35:41,679 --> 01:35:43,440
be my original x
2499
01:35:43,440 --> 01:35:46,239
but i'm going to make x1 equal
2500
01:35:46,239 --> 01:35:49,040
to let's say
2501
01:35:49,040 --> 01:35:49,920
x
2502
01:35:49,920 --> 01:35:53,120
squared so whatever is this squared
2503
01:35:53,120 --> 01:35:55,280
right so now
2504
01:35:55,280 --> 01:35:57,600
my negatives would be you know maybe
2505
01:35:57,600 --> 01:36:00,000
somewhere here
2506
01:36:00,000 --> 01:36:01,040
here
2507
01:36:01,040 --> 01:36:04,800
just pretend that it's somewhere up here
2508
01:36:04,800 --> 01:36:07,280
right and now my pluses might be
2509
01:36:07,280 --> 01:36:11,000
something like
2510
01:36:11,840 --> 01:36:13,600
that
2511
01:36:13,600 --> 01:36:15,199
and i'm going to run out of space over
2512
01:36:15,199 --> 01:36:16,960
here so i'm just going to draw these
2513
01:36:16,960 --> 01:36:20,960
together use your imagination
2514
01:36:21,600 --> 01:36:26,320
but once i draw it like this
2515
01:36:26,560 --> 01:36:28,560
well it's a lot easier to apply a
2516
01:36:28,560 --> 01:36:31,280
boundary right now our svm could be
2517
01:36:31,280 --> 01:36:33,199
maybe something like this
2518
01:36:33,199 --> 01:36:35,520
this
2519
01:36:35,600 --> 01:36:37,440
and now you see that we've divided our
2520
01:36:37,440 --> 01:36:39,520
data set now it's separable where one
2521
01:36:39,520 --> 01:36:41,040
class is this way
2522
01:36:41,040 --> 01:36:42,560
and the other class
2523
01:36:42,560 --> 01:36:44,400
is that way
2524
01:36:44,400 --> 01:36:47,520
okay so that's known as svms
2525
01:36:47,520 --> 01:36:50,080
um i do highly suggest that you know any
2526
01:36:50,080 --> 01:36:51,440
of these models that we just mentioned
2527
01:36:51,440 --> 01:36:53,600
if you're interested in them do go more
2528
01:36:53,600 --> 01:36:55,920
in depth mathematically into them like
2529
01:36:55,920 --> 01:36:59,119
how do we how do we find this hyperplane
2530
01:36:59,119 --> 01:37:00,960
right i'm not going to go over that in
2531
01:37:00,960 --> 01:37:02,960
this specific course because you're just
2532
01:37:02,960 --> 01:37:04,880
learning what an svm is
2533
01:37:04,880 --> 01:37:07,119
but it's a good idea to know oh okay
2534
01:37:07,119 --> 01:37:09,920
this is the technique behind finding
2535
01:37:09,920 --> 01:37:12,320
you know what exactly are the are the
2536
01:37:12,320 --> 01:37:15,040
um how do you define the hyperplane that
2537
01:37:15,040 --> 01:37:16,880
we're going to use
2538
01:37:16,880 --> 01:37:19,199
so anyways this transformation that we
2539
01:37:19,199 --> 01:37:20,560
did down here
2540
01:37:20,560 --> 01:37:25,040
this is known as the kernel trick
2541
01:37:26,960 --> 01:37:30,000
so when we go from x to some coordinate
2542
01:37:30,000 --> 01:37:32,159
x and then x squared
2543
01:37:32,159 --> 01:37:34,320
what we're doing is we are applying a
2544
01:37:34,320 --> 01:37:35,840
kernel so that's why it's called the
2545
01:37:35,840 --> 01:37:38,400
kernel trick
2546
01:37:38,400 --> 01:37:40,400
so svms are actually really powerful and
2547
01:37:40,400 --> 01:37:42,320
you'll see that here so
2548
01:37:42,320 --> 01:37:43,719
from
2549
01:37:43,719 --> 01:37:46,840
sklearn.svm we are going to import
2550
01:37:46,840 --> 01:37:50,639
svc and svc is our support vector
2551
01:37:50,639 --> 01:37:53,280
classifier
2552
01:37:53,280 --> 01:37:55,600
so with this
2553
01:37:55,600 --> 01:37:58,320
so with our sbm model
2554
01:37:58,320 --> 01:38:01,040
we are going to you know create an svc
2555
01:38:01,040 --> 01:38:02,560
model
2556
01:38:02,560 --> 01:38:04,960
and we are going to
2557
01:38:04,960 --> 01:38:06,400
uh
2558
01:38:06,400 --> 01:38:08,639
again fit this to x train i could have
2559
01:38:08,639 --> 01:38:10,080
just copy and pasted this i should have
2560
01:38:10,080 --> 01:38:12,800
probably done that
2561
01:38:13,119 --> 01:38:15,360
okay
2562
01:38:15,360 --> 01:38:17,360
taking a bit longer
2563
01:38:17,360 --> 01:38:19,679
all right
2564
01:38:20,480 --> 01:38:22,719
let's predict using our svm model and
2565
01:38:22,719 --> 01:38:23,600
here
2566
01:38:23,600 --> 01:38:26,080
let's see if i can hover over this
2567
01:38:26,080 --> 01:38:27,679
all right so again you see a lot of
2568
01:38:27,679 --> 01:38:31,520
these different parameters here that you
2569
01:38:31,520 --> 01:38:33,840
can go back and change if you were
2570
01:38:33,840 --> 01:38:36,639
creating a production level model
2571
01:38:36,639 --> 01:38:38,239
okay but
2572
01:38:38,239 --> 01:38:40,000
in this specific case
2573
01:38:40,000 --> 01:38:44,239
we'll just use it out of the box again
2574
01:38:44,239 --> 01:38:45,119
so
2575
01:38:45,119 --> 01:38:47,360
if i make predictions you'll note that
2576
01:38:47,360 --> 01:38:50,159
wow the accuracy actually jumps to 87
2577
01:38:50,159 --> 01:38:51,600
with the svm
2578
01:38:51,600 --> 01:38:53,679
and even with class 0 there's nothing
2579
01:38:53,679 --> 01:38:57,520
less than you know 0.8 which is great
2580
01:38:57,520 --> 01:38:59,840
and for class one i mean everything's at
2581
01:38:59,840 --> 01:39:01,840
0.9 which is higher than anything that
2582
01:39:01,840 --> 01:39:05,280
we had seen to this point
2583
01:39:06,560 --> 01:39:09,199
so so far we've gone over four different
2584
01:39:09,199 --> 01:39:11,280
classification models we've done svms
2585
01:39:11,280 --> 01:39:14,880
logistic regression naive bayes and k n
2586
01:39:14,880 --> 01:39:16,639
and these are just simple ways on how to
2587
01:39:16,639 --> 01:39:18,639
implement them each of these they have
2588
01:39:18,639 --> 01:39:20,400
different you know
2589
01:39:20,400 --> 01:39:23,199
they have different hyper parameters
2590
01:39:23,199 --> 01:39:25,600
that you can go and you can toggle and
2591
01:39:25,600 --> 01:39:27,119
you can try to see
2592
01:39:27,119 --> 01:39:30,000
if that helps later on or not
2593
01:39:30,000 --> 01:39:30,880
but
2594
01:39:30,880 --> 01:39:33,520
for the most part they perform
2595
01:39:33,520 --> 01:39:36,239
they give us around 70 to 80 percent
2596
01:39:36,239 --> 01:39:37,520
accuracy
2597
01:39:37,520 --> 01:39:40,960
okay with svm being the best now let's
2598
01:39:40,960 --> 01:39:43,199
see if we can actually beat that using a
2599
01:39:43,199 --> 01:39:45,199
neural net now the final type of model
2600
01:39:45,199 --> 01:39:47,360
that i wanted to talk about is known as
2601
01:39:47,360 --> 01:39:50,000
a neural net or neural network
2602
01:39:50,000 --> 01:39:53,280
and neural nets look something like this
2603
01:39:53,280 --> 01:39:55,360
so you have an input layer this is where
2604
01:39:55,360 --> 01:39:57,199
all your features would go
2605
01:39:57,199 --> 01:39:58,000
and
2606
01:39:58,000 --> 01:39:59,679
they have all these arrows pointing to
2607
01:39:59,679 --> 01:40:01,520
some sort of hidden layer
2608
01:40:01,520 --> 01:40:03,119
and then all these arrows point to some
2609
01:40:03,119 --> 01:40:05,199
sort of output layer
2610
01:40:05,199 --> 01:40:08,159
so what is what does all this mean each
2611
01:40:08,159 --> 01:40:10,480
of these layers in here this is
2612
01:40:10,480 --> 01:40:13,280
something known as a neuron
2613
01:40:13,280 --> 01:40:15,840
okay so that's a neuron
2614
01:40:15,840 --> 01:40:17,280
in a neural net
2615
01:40:17,280 --> 01:40:19,119
these are all of our features that we're
2616
01:40:19,119 --> 01:40:20,960
inputting into the neural net so that
2617
01:40:20,960 --> 01:40:23,760
might be x0 x1 all the way through
2618
01:40:23,760 --> 01:40:25,119
xn
2619
01:40:25,119 --> 01:40:26,800
right and these are the features that we
2620
01:40:26,800 --> 01:40:28,639
talked about there they might be you
2621
01:40:28,639 --> 01:40:31,920
know the pregnancy the bmi the
2622
01:40:31,920 --> 01:40:34,239
age etc
2623
01:40:34,239 --> 01:40:37,199
and now all these get weighted by some
2624
01:40:37,199 --> 01:40:38,239
value
2625
01:40:38,239 --> 01:40:40,719
so they get multiplied by some w number
2626
01:40:40,719 --> 01:40:42,800
that applies to that one specific
2627
01:40:42,800 --> 01:40:45,119
category that one specific feature so
2628
01:40:45,119 --> 01:40:46,960
these two get multiplied
2629
01:40:46,960 --> 01:40:49,920
and the sum of all of these goes into
2630
01:40:49,920 --> 01:40:51,520
that neuron
2631
01:40:51,520 --> 01:40:54,960
okay so basically i'm taking w0 times x0
2632
01:40:54,960 --> 01:40:58,239
and then i'm adding x1 times w1 and then
2633
01:40:58,239 --> 01:41:01,119
i'm adding you know x2 times w2 etc all
2634
01:41:01,119 --> 01:41:03,280
the way to xn times
2635
01:41:03,280 --> 01:41:05,280
n and that's getting
2636
01:41:05,280 --> 01:41:07,440
input into the neuron
2637
01:41:07,440 --> 01:41:10,000
now i'm also adding this bias term which
2638
01:41:10,000 --> 01:41:11,520
just means okay i might want to shift
2639
01:41:11,520 --> 01:41:14,639
this by a little bit so i might add 5 or
2640
01:41:14,639 --> 01:41:17,119
i might add 0.1 or i might subtract 100
2641
01:41:17,119 --> 01:41:19,199
i don't know but we're going to add this
2642
01:41:19,199 --> 01:41:21,440
bias term
2643
01:41:21,440 --> 01:41:24,880
and the output of all these things so
2644
01:41:24,880 --> 01:41:27,840
the sum of this this this and this
2645
01:41:27,840 --> 01:41:30,480
go into something known as an activation
2646
01:41:30,480 --> 01:41:32,400
function okay
2647
01:41:32,400 --> 01:41:34,719
and then after applying this activation
2648
01:41:34,719 --> 01:41:37,440
function we get an output
2649
01:41:37,440 --> 01:41:39,840
and this is what a neuron would look
2650
01:41:39,840 --> 01:41:41,920
like
2651
01:41:41,920 --> 01:41:43,440
now a whole network of them would look
2652
01:41:43,440 --> 01:41:44,840
something like
2653
01:41:44,840 --> 01:41:47,440
this so i kind of gloss over this
2654
01:41:47,440 --> 01:41:49,199
activation function
2655
01:41:49,199 --> 01:41:51,920
what exactly is that
2656
01:41:51,920 --> 01:41:54,800
this is how a neural net looks like if
2657
01:41:54,800 --> 01:41:56,480
we have all our inputs here and let's
2658
01:41:56,480 --> 01:41:58,480
say all of these arrows represent some
2659
01:41:58,480 --> 01:42:00,800
sort of addition
2660
01:42:00,800 --> 01:42:01,760
right
2661
01:42:01,760 --> 01:42:04,320
then what's going on is we're just
2662
01:42:04,320 --> 01:42:06,639
adding a bunch of times
2663
01:42:06,639 --> 01:42:09,040
right we're adding the some sort of
2664
01:42:09,040 --> 01:42:11,280
weight times these input layer
2665
01:42:11,280 --> 01:42:13,360
a bunch of times and then if we were to
2666
01:42:13,360 --> 01:42:16,400
go back and factor that all out
2667
01:42:16,400 --> 01:42:19,360
then this entire neural net
2668
01:42:19,360 --> 01:42:21,600
is just a linear combination of these
2669
01:42:21,600 --> 01:42:23,679
input layers
2670
01:42:23,679 --> 01:42:25,679
which i don't know about you but that
2671
01:42:25,679 --> 01:42:27,440
just seems kind of useless right because
2672
01:42:27,440 --> 01:42:29,199
we could literally just write that out
2673
01:42:29,199 --> 01:42:31,600
in a formula why would we need to set up
2674
01:42:31,600 --> 01:42:34,080
this entire neural network we
2675
01:42:34,080 --> 01:42:35,840
wouldn't
2676
01:42:35,840 --> 01:42:38,560
so the activation function is introduced
2677
01:42:38,560 --> 01:42:40,880
right so without an activation function
2678
01:42:40,880 --> 01:42:44,400
this just becomes a linear model
2679
01:42:44,400 --> 01:42:46,239
an activation function might look
2680
01:42:46,239 --> 01:42:48,320
something like this and as you can tell
2681
01:42:48,320 --> 01:42:50,639
these are not linear and the reason why
2682
01:42:50,639 --> 01:42:52,159
we introduce these
2683
01:42:52,159 --> 01:42:53,920
is so that our entire model doesn't
2684
01:42:53,920 --> 01:42:55,520
collapse on itself and become a linear
2685
01:42:55,520 --> 01:42:57,440
model
2686
01:42:57,440 --> 01:42:59,280
so over here this is something known as
2687
01:42:59,280 --> 01:43:01,280
a sigmoid function it runs between zero
2688
01:43:01,280 --> 01:43:02,480
and one
2689
01:43:02,480 --> 01:43:04,560
tan runs between negative one all the
2690
01:43:04,560 --> 01:43:05,679
way to one
2691
01:43:05,679 --> 01:43:08,639
and this is relu which anything less
2692
01:43:08,639 --> 01:43:10,719
than zero is zero and that anything
2693
01:43:10,719 --> 01:43:14,480
greater than zero is linear
2694
01:43:14,480 --> 01:43:16,800
so with these activation functions
2695
01:43:16,800 --> 01:43:19,360
every single output of a neuron
2696
01:43:19,360 --> 01:43:21,600
is no longer just the linear combination
2697
01:43:21,600 --> 01:43:23,600
of these it's some sort of altered
2698
01:43:23,600 --> 01:43:25,920
linear state which means that the input
2699
01:43:25,920 --> 01:43:28,560
into the next neuron
2700
01:43:28,560 --> 01:43:29,760
is
2701
01:43:29,760 --> 01:43:30,880
you know
2702
01:43:30,880 --> 01:43:32,880
it doesn't it doesn't collapse on itself
2703
01:43:32,880 --> 01:43:34,800
it doesn't become linear because we've
2704
01:43:34,800 --> 01:43:38,639
introduced all these non-linearities
2705
01:43:38,639 --> 01:43:41,440
so this is the training set the model
2706
01:43:41,440 --> 01:43:44,000
the loss right and then we do this thing
2707
01:43:44,000 --> 01:43:45,600
called training where we have to feed
2708
01:43:45,600 --> 01:43:47,520
the loss back into the model
2709
01:43:47,520 --> 01:43:49,679
and make certain adjustments the model
2710
01:43:49,679 --> 01:43:51,760
to improve
2711
01:43:51,760 --> 01:43:55,119
this predicted output
2712
01:43:55,119 --> 01:43:56,639
let's talk a little bit about the
2713
01:43:56,639 --> 01:43:58,719
training what exactly goes on during
2714
01:43:58,719 --> 01:44:00,719
that step
2715
01:44:00,719 --> 01:44:03,520
let's go back and take a look at our l2
2716
01:44:03,520 --> 01:44:05,360
loss function
2717
01:44:05,360 --> 01:44:07,760
this is what our l2 loss function looks
2718
01:44:07,760 --> 01:44:12,080
like it's a quadratic formula right
2719
01:44:12,080 --> 01:44:14,960
well up here the error is really really
2720
01:44:14,960 --> 01:44:17,520
really really large
2721
01:44:17,520 --> 01:44:18,320
and
2722
01:44:18,320 --> 01:44:20,639
our goal is to get somewhere down here
2723
01:44:20,639 --> 01:44:22,719
where the loss is decreased right
2724
01:44:22,719 --> 01:44:24,960
because that means that our predicted
2725
01:44:24,960 --> 01:44:29,199
value is closer to our true value
2726
01:44:29,199 --> 01:44:31,600
so that means that we want to go
2727
01:44:31,600 --> 01:44:33,679
this way
2728
01:44:33,679 --> 01:44:35,199
okay
2729
01:44:35,199 --> 01:44:37,840
and thanks to a lot of properties of
2730
01:44:37,840 --> 01:44:40,159
math something that we can do is called
2731
01:44:40,159 --> 01:44:41,760
gradient descent
2732
01:44:41,760 --> 01:44:44,960
in order to follow this
2733
01:44:44,960 --> 01:44:46,000
slope
2734
01:44:46,000 --> 01:44:48,639
down this way
2735
01:44:48,639 --> 01:44:50,719
this
2736
01:44:50,719 --> 01:44:52,320
quadratic
2737
01:44:52,320 --> 01:44:55,600
is it has different um
2738
01:44:55,600 --> 01:44:59,280
slopes with respect to some value
2739
01:44:59,280 --> 01:45:00,400
okay so
2740
01:45:00,400 --> 01:45:03,040
the loss with respect to some weight
2741
01:45:03,040 --> 01:45:04,560
w0
2742
01:45:04,560 --> 01:45:08,000
versus w1 versus wn
2743
01:45:08,000 --> 01:45:10,320
they might all be different
2744
01:45:10,320 --> 01:45:11,520
right so
2745
01:45:11,520 --> 01:45:13,040
some way that i kind of think about it
2746
01:45:13,040 --> 01:45:15,600
is to what extent is this value
2747
01:45:15,600 --> 01:45:18,000
contributing to our loss and we can
2748
01:45:18,000 --> 01:45:19,920
actually figure that out through some
2749
01:45:19,920 --> 01:45:21,920
calculus which we're not going to touch
2750
01:45:21,920 --> 01:45:25,520
up on in this specific course but
2751
01:45:25,520 --> 01:45:26,800
if you want to learn more about neural
2752
01:45:26,800 --> 01:45:28,719
nets you should probably also learn some
2753
01:45:28,719 --> 01:45:31,040
calculus and figure out what exactly
2754
01:45:31,040 --> 01:45:33,520
backpropagation is doing in order to
2755
01:45:33,520 --> 01:45:35,840
actually calculate you know how much do
2756
01:45:35,840 --> 01:45:38,400
we have to backstep by
2757
01:45:38,400 --> 01:45:40,320
so the thing is here you might notice
2758
01:45:40,320 --> 01:45:42,800
that this follows this curve at all
2759
01:45:42,800 --> 01:45:45,679
these different points and the closer we
2760
01:45:45,679 --> 01:45:49,119
get to the bottom the smaller this step
2761
01:45:49,119 --> 01:45:50,800
becomes
2762
01:45:50,800 --> 01:45:52,880
now stick with me here
2763
01:45:52,880 --> 01:45:54,320
so
2764
01:45:54,320 --> 01:45:56,960
my new value this is what we call a
2765
01:45:56,960 --> 01:46:00,000
weight update i'm going to take w0
2766
01:46:00,000 --> 01:46:02,560
and i'm going to set some new value for
2767
01:46:02,560 --> 01:46:04,000
w0
2768
01:46:04,000 --> 01:46:05,840
and what i'm going to set for that is
2769
01:46:05,840 --> 01:46:08,320
the old value of w0
2770
01:46:08,320 --> 01:46:09,360
plus
2771
01:46:09,360 --> 01:46:10,239
some
2772
01:46:10,239 --> 01:46:12,480
factor which i'll just call alpha for
2773
01:46:12,480 --> 01:46:13,600
now
2774
01:46:13,600 --> 01:46:14,719
times
2775
01:46:14,719 --> 01:46:17,280
whatever this arrow is so that's
2776
01:46:17,280 --> 01:46:20,480
basically saying okay take our old
2777
01:46:20,480 --> 01:46:22,960
w0 our old weight
2778
01:46:22,960 --> 01:46:25,679
and just decrease it
2779
01:46:25,679 --> 01:46:28,080
this way so i guess increase it in this
2780
01:46:28,080 --> 01:46:30,080
direction right like take a step in this
2781
01:46:30,080 --> 01:46:32,239
direction but this alpha here is telling
2782
01:46:32,239 --> 01:46:34,320
us okay don't don't take a huge step
2783
01:46:34,320 --> 01:46:36,000
right just in case we're wrong take a
2784
01:46:36,000 --> 01:46:37,520
small step take a small step in that
2785
01:46:37,520 --> 01:46:40,800
direction see if we get any closer
2786
01:46:40,800 --> 01:46:43,440
and for those of you who you know do
2787
01:46:43,440 --> 01:46:45,040
want to look more into the mathematics
2788
01:46:45,040 --> 01:46:46,960
of things the reason why i use a plus
2789
01:46:46,960 --> 01:46:48,320
here is because
2790
01:46:48,320 --> 01:46:50,639
this here is the negative gradient right
2791
01:46:50,639 --> 01:46:53,040
if this were just the if you were to use
2792
01:46:53,040 --> 01:46:54,320
the actual gradient this should be a
2793
01:46:54,320 --> 01:46:56,639
minus
2794
01:46:56,800 --> 01:46:59,040
now this alpha is something that we call
2795
01:46:59,040 --> 01:47:00,480
the learning rate
2796
01:47:00,480 --> 01:47:03,440
okay and that adjusts how quickly we're
2797
01:47:03,440 --> 01:47:04,960
taking steps
2798
01:47:04,960 --> 01:47:08,480
and that might you know tell our that
2799
01:47:08,480 --> 01:47:10,320
that will ultimately control
2800
01:47:10,320 --> 01:47:12,400
how long it takes for our neural net to
2801
01:47:12,400 --> 01:47:13,600
converge
2802
01:47:13,600 --> 01:47:15,119
or sometimes if you set it too high it
2803
01:47:15,119 --> 01:47:17,360
might even diverge
2804
01:47:17,360 --> 01:47:19,440
but with all of these weights so here i
2805
01:47:19,440 --> 01:47:23,280
have w0 w1 and then wn
2806
01:47:23,280 --> 01:47:25,360
we make the same update to all of them
2807
01:47:25,360 --> 01:47:27,600
after we calculate
2808
01:47:27,600 --> 01:47:28,480
the
2809
01:47:28,480 --> 01:47:29,360
loss
2810
01:47:29,360 --> 01:47:31,840
the gradient of the loss with respect to
2811
01:47:31,840 --> 01:47:33,600
that weight
2812
01:47:33,600 --> 01:47:37,119
so that's how backpropagation works
2813
01:47:37,119 --> 01:47:39,440
and that is everything that's going on
2814
01:47:39,440 --> 01:47:41,280
here after we calculate the loss we're
2815
01:47:41,280 --> 01:47:42,960
calculating gradients
2816
01:47:42,960 --> 01:47:44,880
making adjustments in the model so we're
2817
01:47:44,880 --> 01:47:47,040
setting all the all the weights to
2818
01:47:47,040 --> 01:47:50,480
something adjusted slightly
2819
01:47:50,480 --> 01:47:51,679
and then
2820
01:47:51,679 --> 01:47:53,119
we're saying okay let's take the
2821
01:47:53,119 --> 01:47:54,159
training set and run it through the
2822
01:47:54,159 --> 01:47:56,080
model again and go through this loop all
2823
01:47:56,080 --> 01:47:59,760
over again so for machine learning we
2824
01:47:59,760 --> 01:48:01,840
already have seen some libraries that we
2825
01:48:01,840 --> 01:48:05,199
use right we've already seen sk learn
2826
01:48:05,199 --> 01:48:06,159
but
2827
01:48:06,159 --> 01:48:10,800
when we start going into neural networks
2828
01:48:11,360 --> 01:48:12,800
this is kind of what we're trying to
2829
01:48:12,800 --> 01:48:14,400
program
2830
01:48:14,400 --> 01:48:15,600
and
2831
01:48:15,600 --> 01:48:16,840
it's not
2832
01:48:16,840 --> 01:48:18,639
very fun
2833
01:48:18,639 --> 01:48:20,480
to try to program this from scratch
2834
01:48:20,480 --> 01:48:21,760
because
2835
01:48:21,760 --> 01:48:24,080
not only will we probably have a lot of
2836
01:48:24,080 --> 01:48:26,080
bugs but also it's probably not going to
2837
01:48:26,080 --> 01:48:27,600
be fast enough
2838
01:48:27,600 --> 01:48:28,400
right
2839
01:48:28,400 --> 01:48:29,840
wouldn't it be great if there are some
2840
01:48:29,840 --> 01:48:30,719
you know
2841
01:48:30,719 --> 01:48:32,400
full-time professionals that are
2842
01:48:32,400 --> 01:48:34,480
dedicated to solving this problem and
2843
01:48:34,480 --> 01:48:36,639
they could literally just give us their
2844
01:48:36,639 --> 01:48:40,320
code that's already running really fast
2845
01:48:40,320 --> 01:48:44,639
well the answer is yes that exists
2846
01:48:44,639 --> 01:48:46,239
and that's why we use tensorflow so
2847
01:48:46,239 --> 01:48:48,239
tensorflow makes it really easy to
2848
01:48:48,239 --> 01:48:50,000
define these models
2849
01:48:50,000 --> 01:48:52,560
but we also have enough control
2850
01:48:52,560 --> 01:48:54,320
over what exactly we're feeding into
2851
01:48:54,320 --> 01:48:55,360
this model
2852
01:48:55,360 --> 01:48:57,920
so for example this line here is
2853
01:48:57,920 --> 01:49:00,800
basically saying okay let's create
2854
01:49:00,800 --> 01:49:02,639
a sequential neural net
2855
01:49:02,639 --> 01:49:04,159
so sequential is just you know what
2856
01:49:04,159 --> 01:49:05,920
we've seen here it just goes one layer
2857
01:49:05,920 --> 01:49:07,520
to the next
2858
01:49:07,520 --> 01:49:09,600
and a dense layer means that all them
2859
01:49:09,600 --> 01:49:11,840
are interconnected so here this is
2860
01:49:11,840 --> 01:49:13,760
interconnected with all of these nodes
2861
01:49:13,760 --> 01:49:15,600
and this one's all these and then this
2862
01:49:15,600 --> 01:49:17,760
one gets connected to all of
2863
01:49:17,760 --> 01:49:20,639
the next ones and so on so we're going
2864
01:49:20,639 --> 01:49:22,159
to create 16
2865
01:49:22,159 --> 01:49:23,920
dense nodes
2866
01:49:23,920 --> 01:49:26,480
with relu activation functions and then
2867
01:49:26,480 --> 01:49:28,320
we're going to create another layer of
2868
01:49:28,320 --> 01:49:29,679
16
2869
01:49:29,679 --> 01:49:32,800
dense nodes with relu activation and
2870
01:49:32,800 --> 01:49:34,320
then our output layer is going to be
2871
01:49:34,320 --> 01:49:37,440
just one node okay
2872
01:49:37,440 --> 01:49:38,880
and that's how easy it is to define
2873
01:49:38,880 --> 01:49:41,679
something in tensorflow
2874
01:49:41,679 --> 01:49:44,800
so tensorflow is an open source library
2875
01:49:44,800 --> 01:49:47,760
that helps you develop and train your ml
2876
01:49:47,760 --> 01:49:49,440
models
2877
01:49:49,440 --> 01:49:52,000
let's implement this for a neural net so
2878
01:49:52,000 --> 01:49:53,119
we're using a neural net for
2879
01:49:53,119 --> 01:49:54,400
classification
2880
01:49:54,400 --> 01:49:55,440
now
2881
01:49:55,440 --> 01:49:58,159
so our neural net model
2882
01:49:58,159 --> 01:50:00,080
we are going to use tensorflow and i
2883
01:50:00,080 --> 01:50:02,800
don't think i imported that up here so
2884
01:50:02,800 --> 01:50:05,840
we are going to import that down here
2885
01:50:05,840 --> 01:50:07,679
so i'm going to import
2886
01:50:07,679 --> 01:50:10,719
tensorflow as tf
2887
01:50:10,719 --> 01:50:12,320
and enter
2888
01:50:12,320 --> 01:50:14,159
cool
2889
01:50:14,159 --> 01:50:16,719
so my
2890
01:50:17,360 --> 01:50:19,199
neural net model
2891
01:50:19,199 --> 01:50:20,639
is going to be
2892
01:50:20,639 --> 01:50:23,199
i'm going to use this
2893
01:50:23,199 --> 01:50:23,900
um
2894
01:50:23,900 --> 01:50:25,360
[Music]
2895
01:50:25,360 --> 01:50:27,199
so essentially this is saying
2896
01:50:27,199 --> 01:50:28,800
layer all these things that i'm about to
2897
01:50:28,800 --> 01:50:30,000
pass in
2898
01:50:30,000 --> 01:50:31,199
so yeah
2899
01:50:31,199 --> 01:50:33,920
layer them linear stack of layers
2900
01:50:33,920 --> 01:50:35,679
layer them as a model
2901
01:50:35,679 --> 01:50:38,719
and what that means nope not that so
2902
01:50:38,719 --> 01:50:42,080
what that means is i can pass in
2903
01:50:42,080 --> 01:50:44,400
um some sort of layer
2904
01:50:44,400 --> 01:50:47,440
and i'm just going to use a dense layer
2905
01:50:47,440 --> 01:50:50,719
uh oops dot dense
2906
01:50:50,719 --> 01:50:53,520
and let's say we have 32
2907
01:50:53,520 --> 01:50:54,719
units
2908
01:50:54,719 --> 01:50:55,600
okay
2909
01:50:55,600 --> 01:50:58,480
i will also
2910
01:50:58,639 --> 01:51:01,040
um
2911
01:51:01,199 --> 01:51:04,320
set the activation as relu
2912
01:51:04,320 --> 01:51:06,719
and at first we have to specify the
2913
01:51:06,719 --> 01:51:07,920
input shape
2914
01:51:07,920 --> 01:51:11,119
so here we have 10 comma
2915
01:51:11,119 --> 01:51:13,440
all right
2916
01:51:16,000 --> 01:51:18,239
all right so that's our first layer now
2917
01:51:18,239 --> 01:51:19,600
our next layer i'm just going to have
2918
01:51:19,600 --> 01:51:20,960
another
2919
01:51:20,960 --> 01:51:24,719
a dense layer of 32 units all using relu
2920
01:51:24,719 --> 01:51:25,840
and
2921
01:51:25,840 --> 01:51:28,800
that's it so for the final layer this is
2922
01:51:28,800 --> 01:51:31,280
just going to be my output layer
2923
01:51:31,280 --> 01:51:33,679
it's going to just be one node
2924
01:51:33,679 --> 01:51:36,239
and the activation is going to be
2925
01:51:36,239 --> 01:51:37,599
sigmoid
2926
01:51:37,599 --> 01:51:38,639
so
2927
01:51:38,639 --> 01:51:40,719
if you recall from our logistic
2928
01:51:40,719 --> 01:51:42,719
regression what happened there was when
2929
01:51:42,719 --> 01:51:44,560
we had a sigmoid it looks something like
2930
01:51:44,560 --> 01:51:45,440
this
2931
01:51:45,440 --> 01:51:47,599
right so by creating a sigmoid
2932
01:51:47,599 --> 01:51:49,679
activation on our last layer we're
2933
01:51:49,679 --> 01:51:52,400
essentially projecting our predictions
2934
01:51:52,400 --> 01:51:54,639
to be zero or one
2935
01:51:54,639 --> 01:51:56,159
just like in logistic
2936
01:51:56,159 --> 01:51:57,360
regression
2937
01:51:57,360 --> 01:51:59,199
and that's going to help us
2938
01:51:59,199 --> 01:52:01,199
you know we can just round to zero or
2939
01:52:01,199 --> 01:52:05,040
one and classify that way
2940
01:52:05,040 --> 01:52:07,840
so this is my neural net model and i'm
2941
01:52:07,840 --> 01:52:09,679
going to
2942
01:52:09,679 --> 01:52:12,239
compile this so in tensorflow we have to
2943
01:52:12,239 --> 01:52:13,760
compile it
2944
01:52:13,760 --> 01:52:15,440
it's really cool because i can just
2945
01:52:15,440 --> 01:52:17,440
literally pass in what type of optimizer
2946
01:52:17,440 --> 01:52:19,679
i want and it'll do it
2947
01:52:19,679 --> 01:52:22,639
um so here if i go to optimizers i'm
2948
01:52:22,639 --> 01:52:24,560
actually going to use atom
2949
01:52:24,560 --> 01:52:26,239
and you'll see that you know the
2950
01:52:26,239 --> 01:52:27,719
learning rate is
2951
01:52:27,719 --> 01:52:30,560
0.001 so i'm just going to use that
2952
01:52:30,560 --> 01:52:33,280
default so 0.001
2953
01:52:33,280 --> 01:52:37,199
and my loss is going to be
2954
01:52:37,280 --> 01:52:39,199
binary cross
2955
01:52:39,199 --> 01:52:41,679
entropy
2956
01:52:41,679 --> 01:52:42,480
and
2957
01:52:42,480 --> 01:52:44,719
the metrics that i'm also going to
2958
01:52:44,719 --> 01:52:46,480
include on here so it already will
2959
01:52:46,480 --> 01:52:48,480
consider loss but i'm i'm also going to
2960
01:52:48,480 --> 01:52:50,960
tack on accuracy so we can actually see
2961
01:52:50,960 --> 01:52:53,840
that in a plot later on
2962
01:52:53,840 --> 01:52:56,480
all right so i'm going to run this
2963
01:52:56,480 --> 01:52:58,960
um
2964
01:52:58,960 --> 01:53:00,400
and
2965
01:53:00,400 --> 01:53:02,400
one thing that i'm going to also do is
2966
01:53:02,400 --> 01:53:04,159
i'm going to define these plot
2967
01:53:04,159 --> 01:53:05,760
definitions so i'm actually copying
2968
01:53:05,760 --> 01:53:06,880
pasting this
2969
01:53:06,880 --> 01:53:09,119
i got these from tensorflow so if you go
2970
01:53:09,119 --> 01:53:10,960
on to some tensorflow tutorial they
2971
01:53:10,960 --> 01:53:13,199
actually have these this like
2972
01:53:13,199 --> 01:53:14,320
defined
2973
01:53:14,320 --> 01:53:16,159
uh and that's exactly what i'm doing
2974
01:53:16,159 --> 01:53:17,440
here so i'm actually going to move this
2975
01:53:17,440 --> 01:53:19,119
cell up
2976
01:53:19,119 --> 01:53:21,119
run that so we're basically plotting the
2977
01:53:21,119 --> 01:53:23,040
loss over all the different epochs
2978
01:53:23,040 --> 01:53:25,280
epochs means like training cycles and
2979
01:53:25,280 --> 01:53:26,800
we're going to plot the accuracy over
2980
01:53:26,800 --> 01:53:28,880
all the epochs
2981
01:53:28,880 --> 01:53:31,280
all right so we have our model
2982
01:53:31,280 --> 01:53:32,480
and now
2983
01:53:32,480 --> 01:53:37,119
all that's left is let's train it okay
2984
01:53:37,119 --> 01:53:39,520
so i'm going to say history so
2985
01:53:39,520 --> 01:53:41,360
tensorflow is great because it keeps
2986
01:53:41,360 --> 01:53:43,280
track of the history of the training
2987
01:53:43,280 --> 01:53:45,199
which is why we can go and plot it later
2988
01:53:45,199 --> 01:53:46,239
on
2989
01:53:46,239 --> 01:53:47,840
now i'm going to set that equal to this
2990
01:53:47,840 --> 01:53:49,679
neural net model
2991
01:53:49,679 --> 01:53:51,440
and fit that
2992
01:53:51,440 --> 01:53:53,280
with x train
2993
01:53:53,280 --> 01:53:55,280
y train
2994
01:53:55,280 --> 01:53:57,920
uh i'm going to
2995
01:53:57,920 --> 01:53:59,840
make the number of epochs equal to let's
2996
01:53:59,840 --> 01:54:02,800
say just let's just use 100 for now
2997
01:54:02,800 --> 01:54:05,040
and the batch size i'm going to set
2998
01:54:05,040 --> 01:54:09,040
equal to let's say 32.
2999
01:54:09,040 --> 01:54:11,199
all right um
3000
01:54:11,199 --> 01:54:14,560
and the validation split
3001
01:54:14,800 --> 01:54:17,520
so what the validation split does if
3002
01:54:17,520 --> 01:54:20,800
it's down here somewhere
3003
01:54:20,800 --> 01:54:22,400
okay so yeah this validation split is
3004
01:54:22,400 --> 01:54:23,920
just the fraction the training data to
3005
01:54:23,920 --> 01:54:25,920
be used as validation data
3006
01:54:25,920 --> 01:54:28,800
so essentially every single epoch what's
3007
01:54:28,800 --> 01:54:30,239
going on is
3008
01:54:30,239 --> 01:54:33,440
uh tensorflow saying leave certain if if
3009
01:54:33,440 --> 01:54:35,440
this is 0.2 then leave 20
3010
01:54:35,440 --> 01:54:37,119
out and we're going to test how the
3011
01:54:37,119 --> 01:54:39,280
model performs on that 20 that we've
3012
01:54:39,280 --> 01:54:40,320
left out
3013
01:54:40,320 --> 01:54:41,679
okay so it's basically like our
3014
01:54:41,679 --> 01:54:44,000
validation data set but
3015
01:54:44,000 --> 01:54:45,599
um tensorflow does it on our training
3016
01:54:45,599 --> 01:54:47,520
data set during the training so we have
3017
01:54:47,520 --> 01:54:49,520
now a measure outside of just our
3018
01:54:49,520 --> 01:54:51,840
validation data set to see you know
3019
01:54:51,840 --> 01:54:53,520
what's going on
3020
01:54:53,520 --> 01:54:54,960
so validation split i'm going to make
3021
01:54:54,960 --> 01:54:57,520
that 0.2
3022
01:54:57,520 --> 01:55:02,239
and we can run this so if i run that
3023
01:55:03,199 --> 01:55:07,920
all right and i'm actually going to
3024
01:55:08,560 --> 01:55:10,080
set verbose
3025
01:55:10,080 --> 01:55:11,840
equal to zero which means okay don't
3026
01:55:11,840 --> 01:55:13,360
print anything because printing
3027
01:55:13,360 --> 01:55:14,880
something for 100 epochs might get kind
3028
01:55:14,880 --> 01:55:16,320
of annoying
3029
01:55:16,320 --> 01:55:18,480
so i'm just gonna let it run
3030
01:55:18,480 --> 01:55:20,560
let it train and then we'll see what
3031
01:55:20,560 --> 01:55:23,040
happens
3032
01:55:27,920 --> 01:55:30,159
cool so it finished training and now
3033
01:55:30,159 --> 01:55:31,760
what i can do is because you know i've
3034
01:55:31,760 --> 01:55:34,480
already defined these two functions
3035
01:55:34,480 --> 01:55:36,880
i can go ahead and i can plot the loss
3036
01:55:36,880 --> 01:55:38,000
oops
3037
01:55:38,000 --> 01:55:41,040
plot loss of that history
3038
01:55:41,040 --> 01:55:43,520
and i can also plot the accuracy
3039
01:55:43,520 --> 01:55:44,560
throughout
3040
01:55:44,560 --> 01:55:47,520
the training
3041
01:55:47,520 --> 01:55:48,480
so
3042
01:55:48,480 --> 01:55:51,119
this is a little bit ish what we're
3043
01:55:51,119 --> 01:55:53,599
looking for we definitely are looking
3044
01:55:53,599 --> 01:55:56,239
for a steadily decreasing loss
3045
01:55:56,239 --> 01:55:59,520
and an increasing accuracy so here we do
3046
01:55:59,520 --> 01:56:01,520
see that you know our validation
3047
01:56:01,520 --> 01:56:03,760
accuracy improves from
3048
01:56:03,760 --> 01:56:06,639
uh around point seven
3049
01:56:06,639 --> 01:56:08,560
seven or something all the way up to
3050
01:56:08,560 --> 01:56:12,000
somewhere around point maybe eight one
3051
01:56:12,000 --> 01:56:13,760
and our loss is decreasing so this is
3052
01:56:13,760 --> 01:56:14,560
good
3053
01:56:14,560 --> 01:56:17,040
it is expected that the validation loss
3054
01:56:17,040 --> 01:56:20,000
and accuracy is performing worse than um
3055
01:56:20,000 --> 01:56:22,400
the training loss or accuracy and that's
3056
01:56:22,400 --> 01:56:24,560
because our model is training on that
3057
01:56:24,560 --> 01:56:26,800
data so it's adapting to that data
3058
01:56:26,800 --> 01:56:28,719
whereas the validation stuff is you know
3059
01:56:28,719 --> 01:56:31,520
stuff that it hasn't seen yet so
3060
01:56:31,520 --> 01:56:33,360
so that's why
3061
01:56:33,360 --> 01:56:35,679
so in machine learning as we saw above
3062
01:56:35,679 --> 01:56:36,800
we could change a bunch of the
3063
01:56:36,800 --> 01:56:38,159
parameters right like i could change
3064
01:56:38,159 --> 01:56:41,119
this to 64. so now it'd be a row of 64
3065
01:56:41,119 --> 01:56:44,800
nodes and then 32 and then one
3066
01:56:44,800 --> 01:56:47,599
so i can change some of these parameters
3067
01:56:47,599 --> 01:56:48,880
and
3068
01:56:48,880 --> 01:56:50,239
a lot of machine learning is trying to
3069
01:56:50,239 --> 01:56:52,080
find hey what do we set these hyper
3070
01:56:52,080 --> 01:56:54,320
parameters to
3071
01:56:54,320 --> 01:56:57,679
so what i'm actually going to do is i'm
3072
01:56:57,679 --> 01:57:00,639
going to rewrite this so that
3073
01:57:00,639 --> 01:57:02,480
we can do something what's known as a
3074
01:57:02,480 --> 01:57:04,719
grid search so we can search through an
3075
01:57:04,719 --> 01:57:07,599
entire space of hey what happens if
3076
01:57:07,599 --> 01:57:08,880
you know we
3077
01:57:08,880 --> 01:57:13,360
have 64 nodes and 64 nodes or 16 nodes
3078
01:57:13,360 --> 01:57:14,880
and 16 nodes
3079
01:57:14,880 --> 01:57:17,520
and so on
3080
01:57:17,520 --> 01:57:19,440
um and then on top of all that we can
3081
01:57:19,440 --> 01:57:21,440
you know we can change
3082
01:57:21,440 --> 01:57:22,960
this uh
3083
01:57:22,960 --> 01:57:25,119
learning rate we can change how many
3084
01:57:25,119 --> 01:57:27,360
epochs we can change
3085
01:57:27,360 --> 01:57:29,520
you know the batch size all these things
3086
01:57:29,520 --> 01:57:31,599
might affect our training
3087
01:57:31,599 --> 01:57:33,119
and
3088
01:57:33,119 --> 01:57:34,880
just for kicks i'm also going to add
3089
01:57:34,880 --> 01:57:40,480
what's known as a dropout layer in here
3090
01:57:41,360 --> 01:57:43,599
and what dropout is doing is saying hey
3091
01:57:43,599 --> 01:57:46,320
randomly choose
3092
01:57:46,320 --> 01:57:49,199
with at this rate certain nodes
3093
01:57:49,199 --> 01:57:51,119
and don't train them
3094
01:57:51,119 --> 01:57:53,440
in you know a certain iteration so this
3095
01:57:53,440 --> 01:57:56,000
helps prevent overfitting
3096
01:57:56,000 --> 01:57:56,960
okay
3097
01:57:56,960 --> 01:57:58,000
so
3098
01:57:58,000 --> 01:57:59,679
i'm actually going to
3099
01:57:59,679 --> 01:58:01,760
define this
3100
01:58:01,760 --> 01:58:04,080
as a function called train model we're
3101
01:58:04,080 --> 01:58:07,360
going to pass an x train y train
3102
01:58:07,360 --> 01:58:09,679
um the number of
3103
01:58:09,679 --> 01:58:11,360
nodes
3104
01:58:11,360 --> 01:58:13,840
uh the drop out
3105
01:58:13,840 --> 01:58:15,119
you know the probability that we just
3106
01:58:15,119 --> 01:58:16,639
talked about
3107
01:58:16,639 --> 01:58:17,599
um
3108
01:58:17,599 --> 01:58:18,960
learning rate
3109
01:58:18,960 --> 01:58:21,119
so i'm actually going to say lr
3110
01:58:21,119 --> 01:58:23,599
batch size
3111
01:58:23,599 --> 01:58:25,440
and
3112
01:58:25,440 --> 01:58:27,119
we can also pass the number of epochs
3113
01:58:27,119 --> 01:58:30,480
right i mentioned that as a parameter
3114
01:58:30,480 --> 01:58:33,360
so indent this so it goes under here and
3115
01:58:33,360 --> 01:58:35,119
with these two i'm going to set this
3116
01:58:35,119 --> 01:58:38,560
equal to number of nodes
3117
01:58:38,560 --> 01:58:40,560
and now with the two dropout layers i'm
3118
01:58:40,560 --> 01:58:43,840
going to set dropout prob so now you
3119
01:58:43,840 --> 01:58:46,239
know the probability of
3120
01:58:46,239 --> 01:58:48,639
turning off a node during the training
3121
01:58:48,639 --> 01:58:50,639
is equal to dropout prop
3122
01:58:50,639 --> 01:58:52,400
um and i'm going to keep the output
3123
01:58:52,400 --> 01:58:54,080
layer the same
3124
01:58:54,080 --> 01:58:56,080
now i'm compiling it but this here is
3125
01:58:56,080 --> 01:58:58,080
now going to be my learning rate
3126
01:58:58,080 --> 01:58:59,840
and i still want binary cross entropy
3127
01:58:59,840 --> 01:59:02,560
and accuracy
3128
01:59:02,560 --> 01:59:05,520
we are actually going to train
3129
01:59:05,520 --> 01:59:08,080
our model inside of
3130
01:59:08,080 --> 01:59:09,199
this
3131
01:59:09,199 --> 01:59:10,639
function
3132
01:59:10,639 --> 01:59:13,920
but here we can do the epochs equals
3133
01:59:13,920 --> 01:59:16,320
epochs and this is equal to whatever you
3134
01:59:16,320 --> 01:59:18,159
know we're passing in
3135
01:59:18,159 --> 01:59:21,040
uh x train y train belong right here
3136
01:59:21,040 --> 01:59:22,960
okay so those are getting passed in as
3137
01:59:22,960 --> 01:59:23,920
well
3138
01:59:23,920 --> 01:59:26,080
and finally at the end i'm going to
3139
01:59:26,080 --> 01:59:29,679
return this model and the history of
3140
01:59:29,679 --> 01:59:32,239
that model
3141
01:59:32,639 --> 01:59:34,880
okay
3142
01:59:34,880 --> 01:59:37,199
so
3143
01:59:37,280 --> 01:59:40,080
now what i'll do
3144
01:59:40,320 --> 01:59:42,560
is let's just go through all of these so
3145
01:59:42,560 --> 01:59:45,440
let's say let's keep the epochs at 100.
3146
01:59:45,440 --> 01:59:47,280
and now what i can do is i can say hey
3147
01:59:47,280 --> 01:59:49,760
for a number of nodes in
3148
01:59:49,760 --> 01:59:52,960
let's say let's do 16 32 and 64 to see
3149
01:59:52,960 --> 01:59:55,040
what happens
3150
01:59:55,040 --> 01:59:56,840
for the different dropout
3151
01:59:56,840 --> 01:59:59,360
probabilities in
3152
01:59:59,360 --> 02:00:01,679
i mean zero would be nothing let's use
3153
02:00:01,679 --> 02:00:05,440
0.2 also to see what happens
3154
02:00:05,440 --> 02:00:07,840
um
3155
02:00:07,920 --> 02:00:10,719
you know for the learning rate in
3156
02:00:10,719 --> 02:00:11,880
uh
3157
02:00:11,880 --> 02:00:13,679
0.005
3158
02:00:13,679 --> 02:00:16,480
0.001
3159
02:00:16,480 --> 02:00:18,080
and you know maybe we want to throw on
3160
02:00:18,080 --> 02:00:22,080
0.1 in there as well
3161
02:00:22,239 --> 02:00:26,159
and then for the batch size uh let's do
3162
02:00:26,159 --> 02:00:29,520
16 32 64 as well actually and let's also
3163
02:00:29,520 --> 02:00:31,920
throw in 128. actually let's get rid of
3164
02:00:31,920 --> 02:00:33,520
16. sorry
3165
02:00:33,520 --> 02:00:36,880
let's throw 128 in there
3166
02:00:37,199 --> 02:00:39,360
that should be zero one
3167
02:00:39,360 --> 02:00:41,920
i'm going to
3168
02:00:41,920 --> 02:00:44,800
record the model in history using this
3169
02:00:44,800 --> 02:00:47,599
train model here
3170
02:00:47,599 --> 02:00:48,560
so
3171
02:00:48,560 --> 02:00:51,679
we're going to do x train y
3172
02:00:51,679 --> 02:00:52,639
train
3173
02:00:52,639 --> 02:00:54,719
the number of nodes is going to be you
3174
02:00:54,719 --> 02:00:55,920
know the number of nodes that we've
3175
02:00:55,920 --> 02:00:57,520
defined here
3176
02:00:57,520 --> 02:00:59,040
dropout
3177
02:00:59,040 --> 02:01:01,040
prob lr
3178
02:01:01,040 --> 02:01:02,960
batch size
3179
02:01:02,960 --> 02:01:05,119
and epochs okay
3180
02:01:05,119 --> 02:01:07,040
and then now we have both the model and
3181
02:01:07,040 --> 02:01:10,000
the history and what i'm going to do is
3182
02:01:10,000 --> 02:01:12,800
again i want to plot
3183
02:01:12,800 --> 02:01:14,800
the loss
3184
02:01:14,800 --> 02:01:17,119
for the history i'm also going to plot
3185
02:01:17,119 --> 02:01:19,760
the accuracy
3186
02:01:19,760 --> 02:01:20,960
probably should have done them side by
3187
02:01:20,960 --> 02:01:22,080
side that probably would have been
3188
02:01:22,080 --> 02:01:24,480
easier
3189
02:01:26,159 --> 02:01:27,520
okay so
3190
02:01:27,520 --> 02:01:30,159
what i'm going to do is
3191
02:01:30,159 --> 02:01:33,280
split up split this up
3192
02:01:33,280 --> 02:01:35,520
and that will be
3193
02:01:35,520 --> 02:01:38,320
subplots so now this is just saying okay
3194
02:01:38,320 --> 02:01:40,400
i want one row and two columns in that
3195
02:01:40,400 --> 02:01:43,040
row for my plots
3196
02:01:43,040 --> 02:01:43,920
okay
3197
02:01:43,920 --> 02:01:45,040
so
3198
02:01:45,040 --> 02:01:49,679
i'm going to plot on my axis one
3199
02:01:49,679 --> 02:01:52,400
the loss
3200
02:01:54,719 --> 02:01:57,199
i don't actually know this is gonna work
3201
02:01:57,199 --> 02:01:59,040
okay we don't care about the grid uh
3202
02:01:59,040 --> 02:02:01,119
yeah let's let's keep the grid
3203
02:02:01,119 --> 02:02:05,320
and then now on my other
3204
02:02:09,119 --> 02:02:12,320
so now on here i'm going to plot all the
3205
02:02:12,320 --> 02:02:17,320
accuracies on the second plot
3206
02:02:20,080 --> 02:02:23,760
i might have to debug this a bit
3207
02:02:24,239 --> 02:02:26,639
but we should be able to get rid of that
3208
02:02:26,639 --> 02:02:29,280
if we run this we already have history
3209
02:02:29,280 --> 02:02:33,119
saved as a variable in here so if i just
3210
02:02:33,119 --> 02:02:36,159
run it on this okay it has no
3211
02:02:36,159 --> 02:02:38,880
attribute x label
3212
02:02:38,880 --> 02:02:41,040
oh i think it's because it's like set x
3213
02:02:41,040 --> 02:02:44,080
label or something
3214
02:02:45,679 --> 02:02:48,560
okay yeah so it's it's set instead of
3215
02:02:48,560 --> 02:02:50,960
just x label y label
3216
02:02:50,960 --> 02:02:52,960
so let's see if that works
3217
02:02:52,960 --> 02:02:55,040
all right cool
3218
02:02:55,040 --> 02:02:57,280
um and let's actually make this a bit
3219
02:02:57,280 --> 02:02:58,800
larger
3220
02:02:58,800 --> 02:03:00,239
okay so we can actually change the
3221
02:03:00,239 --> 02:03:02,320
figure size and i'm gonna set
3222
02:03:02,320 --> 02:03:04,960
let's see what happens if i set that to
3223
02:03:04,960 --> 02:03:09,119
oh that's not the way i wanted it um
3224
02:03:09,920 --> 02:03:14,119
okay so that looks reasonable
3225
02:03:15,360 --> 02:03:16,719
and that's just going to be my plot
3226
02:03:16,719 --> 02:03:18,560
history function so now i can plot them
3227
02:03:18,560 --> 02:03:20,800
side by side
3228
02:03:20,800 --> 02:03:24,320
here i'm going to plot the history
3229
02:03:24,320 --> 02:03:27,760
and what i'm actually going to do is i
3230
02:03:27,760 --> 02:03:29,760
so here first i'm going to print out all
3231
02:03:29,760 --> 02:03:31,199
these parameters so i'm going to print
3232
02:03:31,199 --> 02:03:33,199
out
3233
02:03:33,199 --> 02:03:35,599
use the f string to print out uh all of
3234
02:03:35,599 --> 02:03:37,360
this stuff
3235
02:03:37,360 --> 02:03:39,679
so here i'm printing out how many nodes
3236
02:03:39,679 --> 02:03:41,199
um
3237
02:03:41,199 --> 02:03:44,560
the dropout probability
3238
02:03:45,440 --> 02:03:48,719
uh the learning rate
3239
02:03:55,040 --> 02:03:56,239
and we already know how many emails so
3240
02:03:56,239 --> 02:03:59,440
i'm not even gonna bother with that
3241
02:03:59,760 --> 02:04:00,560
so
3242
02:04:00,560 --> 02:04:02,800
once we plot
3243
02:04:02,800 --> 02:04:03,760
this
3244
02:04:03,760 --> 02:04:06,960
uh let's actually also
3245
02:04:06,960 --> 02:04:09,280
figure out what the um
3246
02:04:09,280 --> 02:04:11,520
what the validation loss is on our
3247
02:04:11,520 --> 02:04:13,840
validation set that we have
3248
02:04:13,840 --> 02:04:16,639
that we created all the way back up here
3249
02:04:16,639 --> 02:04:18,080
all right so remember we created three
3250
02:04:18,080 --> 02:04:19,679
data sets
3251
02:04:19,679 --> 02:04:23,040
let's call our model and evaluate
3252
02:04:23,040 --> 02:04:25,599
what the
3253
02:04:25,599 --> 02:04:26,880
validation
3254
02:04:26,880 --> 02:04:29,360
data what the validation data sets loss
3255
02:04:29,360 --> 02:04:30,719
would be
3256
02:04:30,719 --> 02:04:33,440
and i actually want to record
3257
02:04:33,440 --> 02:04:35,119
let's say i want to record
3258
02:04:35,119 --> 02:04:37,280
whatever model has the least validation
3259
02:04:37,280 --> 02:04:40,080
loss so
3260
02:04:40,560 --> 02:04:42,400
first i'm going to initialize that to
3261
02:04:42,400 --> 02:04:44,480
infinity so that you know any model will
3262
02:04:44,480 --> 02:04:46,159
beat that score
3263
02:04:46,159 --> 02:04:49,119
so if i do float infinity that will set
3264
02:04:49,119 --> 02:04:50,840
that to infinity
3265
02:04:50,840 --> 02:04:54,560
and um maybe i'll keep track of the
3266
02:04:54,560 --> 02:04:56,480
parameters actually it doesn't really
3267
02:04:56,480 --> 02:04:57,599
matter
3268
02:04:57,599 --> 02:04:58,639
i'm just going to keep track of the
3269
02:04:58,639 --> 02:05:00,080
model
3270
02:05:00,080 --> 02:05:02,239
and i'm going to set that to none
3271
02:05:02,239 --> 02:05:03,920
so now down here
3272
02:05:03,920 --> 02:05:06,960
if the validation loss is ever less than
3273
02:05:06,960 --> 02:05:10,159
the least validation loss
3274
02:05:10,159 --> 02:05:12,960
then i am going to simply come down here
3275
02:05:12,960 --> 02:05:14,800
and say hey
3276
02:05:14,800 --> 02:05:16,560
this validation
3277
02:05:16,560 --> 02:05:18,800
or this lease validation loss is now
3278
02:05:18,800 --> 02:05:21,520
equal to the validation loss
3279
02:05:21,520 --> 02:05:22,400
and
3280
02:05:22,400 --> 02:05:23,599
the least
3281
02:05:23,599 --> 02:05:25,040
loss model
3282
02:05:25,040 --> 02:05:27,599
is whatever this model
3283
02:05:27,599 --> 02:05:30,159
is that just earned that validation loss
3284
02:05:30,159 --> 02:05:31,760
okay
3285
02:05:31,760 --> 02:05:33,360
so
3286
02:05:33,360 --> 02:05:35,599
we are actually just going to let this
3287
02:05:35,599 --> 02:05:37,199
run
3288
02:05:37,199 --> 02:05:39,440
um for a while and then we're going to
3289
02:05:39,440 --> 02:05:40,880
get our least
3290
02:05:40,880 --> 02:05:44,320
loss model after that
3291
02:05:44,320 --> 02:05:46,480
so
3292
02:05:47,119 --> 02:05:49,840
let's just run
3293
02:05:50,000 --> 02:05:54,440
all right and now we wait
3294
02:06:06,000 --> 02:06:08,159
all right so we have finally finished
3295
02:06:08,159 --> 02:06:09,440
training
3296
02:06:09,440 --> 02:06:11,520
and you'll notice that okay down here
3297
02:06:11,520 --> 02:06:14,480
the loss actually gets to like 0.29
3298
02:06:14,480 --> 02:06:16,800
the accuracy is around 88 which is
3299
02:06:16,800 --> 02:06:18,000
pretty good
3300
02:06:18,000 --> 02:06:19,920
so you might be wondering okay why is
3301
02:06:19,920 --> 02:06:22,239
this accuracy in this
3302
02:06:22,239 --> 02:06:24,560
like these are both the validation
3303
02:06:24,560 --> 02:06:26,400
so this accuracy here is on the
3304
02:06:26,400 --> 02:06:28,000
validation data set that we've defined
3305
02:06:28,000 --> 02:06:30,239
at the beginning right and this one here
3306
02:06:30,239 --> 02:06:32,800
this is actually taking 20 of our test
3307
02:06:32,800 --> 02:06:34,880
our training set every time during the
3308
02:06:34,880 --> 02:06:37,280
training and saying okay how much of it
3309
02:06:37,280 --> 02:06:39,199
do i get right now
3310
02:06:39,199 --> 02:06:40,800
you know after this one step where i
3311
02:06:40,800 --> 02:06:43,040
didn't train with any of that
3312
02:06:43,040 --> 02:06:45,440
so they're slightly different and
3313
02:06:45,440 --> 02:06:47,119
actually i realized later on that i
3314
02:06:47,119 --> 02:06:48,560
probably you know probably what i should
3315
02:06:48,560 --> 02:06:49,599
have done
3316
02:06:49,599 --> 02:06:54,560
is over here when we were defining
3317
02:06:54,560 --> 02:06:57,440
the model fit instead of the validation
3318
02:06:57,440 --> 02:07:00,400
split you can define the validation data
3319
02:07:00,400 --> 02:07:02,800
and you can pass in the validation data
3320
02:07:02,800 --> 02:07:03,920
i don't know if this is the proper
3321
02:07:03,920 --> 02:07:05,360
syntax but
3322
02:07:05,360 --> 02:07:07,199
that's probably what i should have done
3323
02:07:07,199 --> 02:07:08,880
but instead you know we'll just stick
3324
02:07:08,880 --> 02:07:11,440
with what we have here
3325
02:07:11,440 --> 02:07:13,679
so you'll see at the end you know with
3326
02:07:13,679 --> 02:07:16,400
the 64 nodes it seems like this is our
3327
02:07:16,400 --> 02:07:18,719
best performance 64 nodes with a dropout
3328
02:07:18,719 --> 02:07:21,560
of 0.2 a learning rate of
3329
02:07:21,560 --> 02:07:25,440
0.001 and a batch size of 64.
3330
02:07:25,440 --> 02:07:28,480
and it does seem like yes the validation
3331
02:07:28,480 --> 02:07:30,719
you know the fake validation but the
3332
02:07:30,719 --> 02:07:33,920
validation um
3333
02:07:33,920 --> 02:07:36,639
loss is decreasing and then the accuracy
3334
02:07:36,639 --> 02:07:40,159
is increasing which is a good sign okay
3335
02:07:40,159 --> 02:07:41,520
so finally
3336
02:07:41,520 --> 02:07:43,119
what i'm going to do is i'm actually
3337
02:07:43,119 --> 02:07:44,719
just going to predict so i'm going to
3338
02:07:44,719 --> 02:07:45,599
take
3339
02:07:45,599 --> 02:07:48,320
this model which we've called our least
3340
02:07:48,320 --> 02:07:50,079
loss model
3341
02:07:50,079 --> 02:07:51,599
i'm going to take this model and i'm
3342
02:07:51,599 --> 02:07:53,280
going to predict
3343
02:07:53,280 --> 02:07:54,560
x test
3344
02:07:54,560 --> 02:07:56,159
on that
3345
02:07:56,159 --> 02:07:58,079
and you'll see that it gives me some
3346
02:07:58,079 --> 02:07:59,440
values that are really close to zero and
3347
02:07:59,440 --> 02:08:00,960
some that are really close to one and
3348
02:08:00,960 --> 02:08:03,040
that's because we have a sigmoid output
3349
02:08:03,040 --> 02:08:04,000
so
3350
02:08:04,000 --> 02:08:05,360
if i
3351
02:08:05,360 --> 02:08:07,840
do this
3352
02:08:07,840 --> 02:08:10,480
what i can do is i can cast them
3353
02:08:10,480 --> 02:08:12,320
so i'm going to say anything that's
3354
02:08:12,320 --> 02:08:15,679
greater than 0.5
3355
02:08:15,679 --> 02:08:18,159
set that to 1. so if i
3356
02:08:18,159 --> 02:08:20,159
actually i think what happens if i do
3357
02:08:20,159 --> 02:08:22,719
this
3358
02:08:22,719 --> 02:08:25,920
oh okay so i have to cast that as type
3359
02:08:25,920 --> 02:08:28,480
and so now you'll see that it's ones and
3360
02:08:28,480 --> 02:08:29,520
zeros
3361
02:08:29,520 --> 02:08:31,920
and i'm actually going to transform this
3362
02:08:31,920 --> 02:08:34,400
into a column as well
3363
02:08:34,400 --> 02:08:37,840
so here i'm going to
3364
02:08:38,800 --> 02:08:41,679
oh oops uh i didn't mean to do that okay
3365
02:08:41,679 --> 02:08:45,280
no i wanted to just reshape it to
3366
02:08:45,280 --> 02:08:46,590
that so now
3367
02:08:46,590 --> 02:08:47,679
[Music]
3368
02:08:47,679 --> 02:08:50,480
it's one dimensional okay
3369
02:08:50,480 --> 02:08:51,599
and
3370
02:08:51,599 --> 02:08:53,840
using that we can actually
3371
02:08:53,840 --> 02:08:56,719
just rerun the classification report
3372
02:08:56,719 --> 02:08:59,679
based on these this neural net output
3373
02:08:59,679 --> 02:09:02,400
and you'll see that okay the the f1 or
3374
02:09:02,400 --> 02:09:04,480
the accuracy gives us 87
3375
02:09:04,480 --> 02:09:06,800
so it seems like what happened here is
3376
02:09:06,800 --> 02:09:08,159
the precision
3377
02:09:08,159 --> 02:09:11,840
on uh class 0 so the hadrons has
3378
02:09:11,840 --> 02:09:14,159
increased a bit but the recall decreased
3379
02:09:14,159 --> 02:09:17,199
but the f1 score is still at a good 0.81
3380
02:09:17,199 --> 02:09:19,119
and um
3381
02:09:19,119 --> 02:09:20,719
for the other class it looked like the
3382
02:09:20,719 --> 02:09:22,400
precision decreased a bit the recall
3383
02:09:22,400 --> 02:09:24,960
increased for an overall f1 score
3384
02:09:24,960 --> 02:09:28,000
that's also been increased
3385
02:09:28,000 --> 02:09:30,320
i think i interpreted that properly i
3386
02:09:30,320 --> 02:09:31,920
mean we went through all this work and
3387
02:09:31,920 --> 02:09:34,239
we got a model that performs actually
3388
02:09:34,239 --> 02:09:37,360
very very similarly to the svm model
3389
02:09:37,360 --> 02:09:39,520
that we had earlier
3390
02:09:39,520 --> 02:09:41,119
and the whole point of this exercise was
3391
02:09:41,119 --> 02:09:42,800
to demonstrate okay these are how you
3392
02:09:42,800 --> 02:09:45,040
can define your models but it's also to
3393
02:09:45,040 --> 02:09:47,040
say hey maybe
3394
02:09:47,040 --> 02:09:48,639
you know neural nets are very very
3395
02:09:48,639 --> 02:09:50,880
powerful as you can tell
3396
02:09:50,880 --> 02:09:53,520
but sometimes you know an svm or some
3397
02:09:53,520 --> 02:09:54,639
other model
3398
02:09:54,639 --> 02:09:57,679
might actually be more appropriate
3399
02:09:57,679 --> 02:09:59,199
but in this case i guess it didn't
3400
02:09:59,199 --> 02:10:00,800
really matter which one we used at the
3401
02:10:00,800 --> 02:10:02,639
end um
3402
02:10:02,639 --> 02:10:05,280
an 87 percent accuracy accuracy score is
3403
02:10:05,280 --> 02:10:07,119
still pretty good
3404
02:10:07,119 --> 02:10:11,760
so yeah let's now move on to regression
3405
02:10:11,760 --> 02:10:13,360
we just saw a bunch of different
3406
02:10:13,360 --> 02:10:15,679
classification models now let's shift
3407
02:10:15,679 --> 02:10:17,840
gears into regression the other type of
3408
02:10:17,840 --> 02:10:19,599
supervised learning
3409
02:10:19,599 --> 02:10:22,079
if we look at this plot over here we see
3410
02:10:22,079 --> 02:10:24,400
a bunch of scattered data points
3411
02:10:24,400 --> 02:10:27,119
and here we have our x
3412
02:10:27,119 --> 02:10:29,360
value for those data points and then we
3413
02:10:29,360 --> 02:10:32,079
have the corresponding y value which is
3414
02:10:32,079 --> 02:10:34,560
now our label
3415
02:10:34,560 --> 02:10:37,440
and when we look at this plot
3416
02:10:37,440 --> 02:10:40,000
well our goal in regression is to find
3417
02:10:40,000 --> 02:10:43,760
the line of best fit that best models
3418
02:10:43,760 --> 02:10:45,520
this data
3419
02:10:45,520 --> 02:10:47,440
essentially we're trying to let's say
3420
02:10:47,440 --> 02:10:50,239
we're given some new value of x that we
3421
02:10:50,239 --> 02:10:52,400
don't have in our sample we're trying to
3422
02:10:52,400 --> 02:10:56,239
say okay what would my prediction for y
3423
02:10:56,239 --> 02:10:57,119
b
3424
02:10:57,119 --> 02:10:59,599
for that given x value so that you know
3425
02:10:59,599 --> 02:11:03,119
might be somewhere around there
3426
02:11:03,119 --> 02:11:05,119
i don't know
3427
02:11:05,119 --> 02:11:07,360
but remember in regression that you know
3428
02:11:07,360 --> 02:11:08,719
given certain features we're trying to
3429
02:11:08,719 --> 02:11:11,520
predict some continuous numerical value
3430
02:11:11,520 --> 02:11:14,079
for y
3431
02:11:14,159 --> 02:11:16,639
in linear regression
3432
02:11:16,639 --> 02:11:19,520
we want to take our data and fit a
3433
02:11:19,520 --> 02:11:22,800
linear model to this data so in this
3434
02:11:22,800 --> 02:11:24,560
case our linear model might look
3435
02:11:24,560 --> 02:11:29,520
something along the lines of here
3436
02:11:29,520 --> 02:11:30,400
right
3437
02:11:30,400 --> 02:11:33,199
so this here would be considered as
3438
02:11:33,199 --> 02:11:36,239
maybe our line of
3439
02:11:36,239 --> 02:11:37,679
best
3440
02:11:37,679 --> 02:11:39,199
fit
3441
02:11:39,199 --> 02:11:40,719
and this line
3442
02:11:40,719 --> 02:11:43,040
is modeled by the equation i'm going to
3443
02:11:43,040 --> 02:11:44,480
write it down here
3444
02:11:44,480 --> 02:11:45,280
y
3445
02:11:45,280 --> 02:11:46,639
equals
3446
02:11:46,639 --> 02:11:48,239
b 0
3447
02:11:48,239 --> 02:11:50,880
plus b 1 x
3448
02:11:50,880 --> 02:11:53,199
now b0 just means it's this y-intercept
3449
02:11:53,199 --> 02:11:55,679
so if we extend this y down here
3450
02:11:55,679 --> 02:11:59,119
this value here is b0 and then b1
3451
02:11:59,119 --> 02:12:02,400
defines the slope
3452
02:12:04,159 --> 02:12:05,840
of this line
3453
02:12:05,840 --> 02:12:06,800
okay
3454
02:12:06,800 --> 02:12:08,239
all right so that's the that's the
3455
02:12:08,239 --> 02:12:09,599
formula
3456
02:12:09,599 --> 02:12:12,639
for linear regression
3457
02:12:12,639 --> 02:12:15,199
and how exactly do we come up with that
3458
02:12:15,199 --> 02:12:17,199
formula what are we trying to do with
3459
02:12:17,199 --> 02:12:18,880
this linear regression
3460
02:12:18,880 --> 02:12:21,280
you know we could just eyeball
3461
02:12:21,280 --> 02:12:23,199
where should the line be but humans are
3462
02:12:23,199 --> 02:12:25,360
not very good at eyeballing certain
3463
02:12:25,360 --> 02:12:28,159
things like that i mean we can get close
3464
02:12:28,159 --> 02:12:30,320
but a computer is better at giving us a
3465
02:12:30,320 --> 02:12:35,840
precise value for b0 and b1
3466
02:12:35,840 --> 02:12:37,360
well let's introduce the concept of
3467
02:12:37,360 --> 02:12:40,400
something known as a residual
3468
02:12:40,400 --> 02:12:43,040
okay so
3469
02:12:43,040 --> 02:12:44,400
residual
3470
02:12:44,400 --> 02:12:46,480
you might also hear this being called
3471
02:12:46,480 --> 02:12:48,400
the error
3472
02:12:48,400 --> 02:12:51,040
and what that means is let's take some
3473
02:12:51,040 --> 02:12:53,760
data point in our data set
3474
02:12:53,760 --> 02:12:56,480
and we're going to evaluate how far off
3475
02:12:56,480 --> 02:12:58,320
is our prediction
3476
02:12:58,320 --> 02:12:59,119
from
3477
02:12:59,119 --> 02:13:01,440
a data point that we already have
3478
02:13:01,440 --> 02:13:04,800
so this here is our y let's say
3479
02:13:04,800 --> 02:13:09,520
this is one two three four five six
3480
02:13:09,520 --> 02:13:10,560
seven
3481
02:13:10,560 --> 02:13:13,440
eight so this is y eight let's call it
3482
02:13:13,440 --> 02:13:15,840
you'll see that i use this y i in order
3483
02:13:15,840 --> 02:13:18,079
to represent hey just one of these
3484
02:13:18,079 --> 02:13:19,199
points
3485
02:13:19,199 --> 02:13:20,960
okay
3486
02:13:20,960 --> 02:13:23,520
so this here is y and this here would be
3487
02:13:23,520 --> 02:13:25,920
the prediction oops this here would be
3488
02:13:25,920 --> 02:13:28,400
the prediction for y
3489
02:13:28,400 --> 02:13:29,280
8
3490
02:13:29,280 --> 02:13:31,920
which i've labeled with this hat okay if
3491
02:13:31,920 --> 02:13:33,440
it has a hat on it that means hey this
3492
02:13:33,440 --> 02:13:35,280
is what this is my guess this is my
3493
02:13:35,280 --> 02:13:36,560
prediction
3494
02:13:36,560 --> 02:13:37,360
for
3495
02:13:37,360 --> 02:13:40,000
you know this specific
3496
02:13:40,000 --> 02:13:42,840
value of x
3497
02:13:42,840 --> 02:13:45,840
okay now the residual
3498
02:13:45,840 --> 02:13:47,040
would be
3499
02:13:47,040 --> 02:13:49,199
this distance here
3500
02:13:49,199 --> 02:13:50,639
between y
3501
02:13:50,639 --> 02:13:53,599
eight and y hat eight so
3502
02:13:53,599 --> 02:13:54,960
y eight
3503
02:13:54,960 --> 02:13:57,760
minus y hat eight
3504
02:13:57,760 --> 02:13:59,040
all right because that would give us
3505
02:13:59,040 --> 02:14:00,000
this
3506
02:14:00,000 --> 02:14:01,679
here and i'm just going to take the
3507
02:14:01,679 --> 02:14:03,760
absolute value of this because what if
3508
02:14:03,760 --> 02:14:05,679
it's below the line
3509
02:14:05,679 --> 02:14:06,800
right then you would get a negative
3510
02:14:06,800 --> 02:14:08,639
value but distance can't be negative so
3511
02:14:08,639 --> 02:14:11,040
we're just going to put a little hat or
3512
02:14:11,040 --> 02:14:12,239
we're going to put a little absolute
3513
02:14:12,239 --> 02:14:15,199
value around this quantity
3514
02:14:15,199 --> 02:14:17,280
and that gives us
3515
02:14:17,280 --> 02:14:19,520
the residual or the error
3516
02:14:19,520 --> 02:14:21,760
so let me rewrite that
3517
02:14:21,760 --> 02:14:24,000
and you know to generalize to all the
3518
02:14:24,000 --> 02:14:26,159
points i'm going to say the residual can
3519
02:14:26,159 --> 02:14:27,679
be calculated
3520
02:14:27,679 --> 02:14:29,199
as y i
3521
02:14:29,199 --> 02:14:31,520
minus y hat
3522
02:14:31,520 --> 02:14:32,560
of i
3523
02:14:32,560 --> 02:14:33,840
okay
3524
02:14:33,840 --> 02:14:35,520
so this just means the distance between
3525
02:14:35,520 --> 02:14:37,119
some given point
3526
02:14:37,119 --> 02:14:39,280
and its prediction its corresponding
3527
02:14:39,280 --> 02:14:41,280
prediction on the line
3528
02:14:41,280 --> 02:14:42,239
so now
3529
02:14:42,239 --> 02:14:44,159
with this residual
3530
02:14:44,159 --> 02:14:46,239
this line of best fit
3531
02:14:46,239 --> 02:14:48,400
is generally trying to decrease these
3532
02:14:48,400 --> 02:14:51,440
residuals as much as possible
3533
02:14:51,440 --> 02:14:52,800
so
3534
02:14:52,800 --> 02:14:55,280
now that we have some value for the
3535
02:14:55,280 --> 02:14:57,119
error our line of best fit is trying to
3536
02:14:57,119 --> 02:14:59,280
decrease the error as much as possible
3537
02:14:59,280 --> 02:15:02,079
for all of the different data points
3538
02:15:02,079 --> 02:15:05,599
and that might mean you know minimizing
3539
02:15:05,599 --> 02:15:07,679
the sum of all the residuals so this
3540
02:15:07,679 --> 02:15:09,520
here this is the sum
3541
02:15:09,520 --> 02:15:12,960
symbol and if i just stick the residual
3542
02:15:12,960 --> 02:15:14,320
calculation
3543
02:15:14,320 --> 02:15:16,560
in there
3544
02:15:16,560 --> 02:15:18,639
it looks something like that right and
3545
02:15:18,639 --> 02:15:20,239
i'm just going to say okay for all of
3546
02:15:20,239 --> 02:15:22,320
the i's in our data set so for all the
3547
02:15:22,320 --> 02:15:24,239
different points we're going to sum up
3548
02:15:24,239 --> 02:15:26,320
all the residuals
3549
02:15:26,320 --> 02:15:29,040
and i'm going to try to decrease that
3550
02:15:29,040 --> 02:15:30,800
with my line of best fit so i'm going to
3551
02:15:30,800 --> 02:15:33,520
find the b0 and b1 which gives me the
3552
02:15:33,520 --> 02:15:36,639
lowest value of this
3553
02:15:36,639 --> 02:15:37,920
okay
3554
02:15:37,920 --> 02:15:40,639
now in other you know sometimes in
3555
02:15:40,639 --> 02:15:42,960
different circumstances we might
3556
02:15:42,960 --> 02:15:45,040
attach a squared to that so we're trying
3557
02:15:45,040 --> 02:15:48,320
to decrease the sum of the squared
3558
02:15:48,320 --> 02:15:51,320
residuals
3559
02:15:56,960 --> 02:15:59,119
and what that does is it just
3560
02:15:59,119 --> 02:16:01,760
you know it adds a higher penalty for
3561
02:16:01,760 --> 02:16:04,480
how far off we are from you know points
3562
02:16:04,480 --> 02:16:06,639
that are further off so that is linear
3563
02:16:06,639 --> 02:16:08,560
regression we're trying to find
3564
02:16:08,560 --> 02:16:11,440
this equation some line of best fit
3565
02:16:11,440 --> 02:16:14,079
that will help us decrease
3566
02:16:14,079 --> 02:16:16,320
this measure of error
3567
02:16:16,320 --> 02:16:18,239
with respect to all the data points that
3568
02:16:18,239 --> 02:16:20,560
we have in our data set and try to come
3569
02:16:20,560 --> 02:16:22,000
up with the best prediction for all of
3570
02:16:22,000 --> 02:16:22,800
them
3571
02:16:22,800 --> 02:16:26,159
this is known as simple
3572
02:16:26,159 --> 02:16:28,079
linear
3573
02:16:28,079 --> 02:16:31,079
regression
3574
02:16:32,000 --> 02:16:34,558
basically that means you know our
3575
02:16:34,558 --> 02:16:37,200
equation looks something
3576
02:16:37,200 --> 02:16:39,280
like this
3577
02:16:39,280 --> 02:16:43,200
now there's also multiple
3578
02:16:44,959 --> 02:16:47,839
linear regression
3579
02:16:48,398 --> 02:16:50,000
which just means that hey if we have
3580
02:16:50,000 --> 02:16:50,959
more
3581
02:16:50,959 --> 02:16:53,840
than one value for x so like think of
3582
02:16:53,840 --> 02:16:55,359
our feature vectors we have multiple
3583
02:16:55,359 --> 02:16:58,240
values in our x vector
3584
02:16:58,240 --> 02:17:01,439
then our predictor might look something
3585
02:17:01,439 --> 02:17:02,638
more
3586
02:17:02,638 --> 02:17:05,358
like this
3587
02:17:06,879 --> 02:17:09,519
actually i'm just going to say etc plus
3588
02:17:09,519 --> 02:17:13,200
b n x n so now i'm coming up with some
3589
02:17:13,200 --> 02:17:14,398
coefficient
3590
02:17:14,398 --> 02:17:17,040
for all of the different
3591
02:17:17,040 --> 02:17:19,840
x values that i have in my vector now
3592
02:17:19,840 --> 02:17:21,439
you guys might have noticed that i have
3593
02:17:21,439 --> 02:17:23,200
some assumptions over here and you might
3594
02:17:23,200 --> 02:17:24,799
be asking okay kylie what in the world
3595
02:17:24,799 --> 02:17:26,959
do these assumptions mean so let's go
3596
02:17:26,959 --> 02:17:28,959
over them
3597
02:17:28,959 --> 02:17:33,039
the first one is linearity
3598
02:17:33,679 --> 02:17:35,599
and what that means is let's say i have
3599
02:17:35,599 --> 02:17:38,000
a data set
3600
02:17:38,000 --> 02:17:41,000
okay
3601
02:17:43,599 --> 02:17:45,760
linearity just means okay my does my
3602
02:17:45,760 --> 02:17:49,280
data follow a linear pattern does
3603
02:17:49,280 --> 02:17:52,000
y increase as x increases or does y
3604
02:17:52,000 --> 02:17:56,879
decrease at as x increases does so if y
3605
02:17:56,879 --> 02:17:59,120
increases or decreases at a constant
3606
02:17:59,120 --> 02:18:01,359
rate as x increases
3607
02:18:01,359 --> 02:18:02,398
then you're probably looking at
3608
02:18:02,398 --> 02:18:04,558
something linear so what's an example of
3609
02:18:04,558 --> 02:18:07,040
a non-linear data set let's say i had
3610
02:18:07,040 --> 02:18:10,879
data that might look something like that
3611
02:18:10,879 --> 02:18:12,160
okay
3612
02:18:12,160 --> 02:18:15,040
so now just visually judging this you
3613
02:18:15,040 --> 02:18:16,879
might say okay seems like the line of
3614
02:18:16,879 --> 02:18:19,519
best fit might actually be some curve
3615
02:18:19,519 --> 02:18:21,599
like this
3616
02:18:21,599 --> 02:18:22,718
right
3617
02:18:22,718 --> 02:18:25,120
and in this case we don't satisfy that
3618
02:18:25,120 --> 02:18:27,439
linearity
3619
02:18:27,439 --> 02:18:29,599
assumption anymore
3620
02:18:29,599 --> 02:18:30,478
so
3621
02:18:30,478 --> 02:18:32,478
with linearity we basically just want
3622
02:18:32,478 --> 02:18:34,718
our data set to follow some sort of
3623
02:18:34,718 --> 02:18:36,240
linear
3624
02:18:36,240 --> 02:18:39,200
trajectory
3625
02:18:39,200 --> 02:18:41,439
and independence
3626
02:18:41,439 --> 02:18:44,240
our second assumption
3627
02:18:44,240 --> 02:18:46,240
just means
3628
02:18:46,240 --> 02:18:48,160
this point over here
3629
02:18:48,160 --> 02:18:50,080
it should have no influence on this
3630
02:18:50,080 --> 02:18:52,080
point over here or this point over here
3631
02:18:52,080 --> 02:18:54,318
or this point over here so in other
3632
02:18:54,318 --> 02:18:57,040
words all the points
3633
02:18:57,040 --> 02:19:01,120
all the samples in our data set
3634
02:19:01,120 --> 02:19:02,959
should be independent
3635
02:19:02,959 --> 02:19:04,959
okay they should not rely on one another
3636
02:19:04,959 --> 02:19:08,478
they should not affect one another
3637
02:19:14,318 --> 02:19:16,478
okay now
3638
02:19:16,478 --> 02:19:18,840
normality and
3639
02:19:18,840 --> 02:19:21,200
homoscedasticity those are concepts
3640
02:19:21,200 --> 02:19:24,959
which use this residual okay
3641
02:19:24,959 --> 02:19:28,959
so if i have a plot
3642
02:19:28,959 --> 02:19:31,120
that looks
3643
02:19:31,120 --> 02:19:32,398
something
3644
02:19:32,398 --> 02:19:35,398
like
3645
02:19:35,439 --> 02:19:37,920
this
3646
02:19:39,840 --> 02:19:42,160
and my line of best fit
3647
02:19:42,160 --> 02:19:43,519
is somewhere
3648
02:19:43,519 --> 02:19:44,558
here
3649
02:19:44,558 --> 02:19:47,120
maybe it's something like that
3650
02:19:47,120 --> 02:19:49,439
in order to look at these normality and
3651
02:19:49,439 --> 02:19:50,960
homoscedasticity
3652
02:19:50,960 --> 02:19:53,040
assumptions let's look at the residual
3653
02:19:53,040 --> 02:19:56,000
plot okay
3654
02:19:57,280 --> 02:19:59,760
and what that means is i'm going to keep
3655
02:19:59,760 --> 02:20:02,560
my same x axis
3656
02:20:02,560 --> 02:20:04,560
but instead of plotting now where they
3657
02:20:04,560 --> 02:20:07,439
are relative to this y i'm going to plot
3658
02:20:07,439 --> 02:20:11,120
these errors so now i'm going to plot y
3659
02:20:11,120 --> 02:20:14,240
minus y hat
3660
02:20:14,240 --> 02:20:15,439
like this
3661
02:20:15,439 --> 02:20:17,120
okay
3662
02:20:17,120 --> 02:20:18,800
and now you know this one is slightly
3663
02:20:18,800 --> 02:20:20,399
positive so it might be here this one
3664
02:20:20,399 --> 02:20:22,319
down here is negative it might be here
3665
02:20:22,319 --> 02:20:23,520
so our
3666
02:20:23,520 --> 02:20:25,760
residual plot
3667
02:20:25,760 --> 02:20:27,280
it's literally just a plot of how you
3668
02:20:27,280 --> 02:20:29,040
know the values are distributed around
3669
02:20:29,040 --> 02:20:31,040
our line of best fit
3670
02:20:31,040 --> 02:20:33,120
so it looks like
3671
02:20:33,120 --> 02:20:37,280
it might you know look something
3672
02:20:38,640 --> 02:20:39,920
like this
3673
02:20:39,920 --> 02:20:42,000
okay
3674
02:20:42,000 --> 02:20:45,120
so this might be our residual plot and
3675
02:20:45,120 --> 02:20:48,240
what normality means so our assumptions
3676
02:20:48,240 --> 02:20:51,200
are normality
3677
02:20:51,600 --> 02:20:53,920
and homo
3678
02:20:53,920 --> 02:20:55,439
schedasticities
3679
02:20:55,439 --> 02:20:58,399
good ass
3680
02:20:59,120 --> 02:21:00,479
i might have butchered that spelling i
3681
02:21:00,479 --> 02:21:01,840
don't really know
3682
02:21:01,840 --> 02:21:02,880
but
3683
02:21:02,880 --> 02:21:03,840
what
3684
02:21:03,840 --> 02:21:06,319
normality is saying is saying okay these
3685
02:21:06,319 --> 02:21:09,760
residuals should be normally distributed
3686
02:21:09,760 --> 02:21:11,200
okay
3687
02:21:11,200 --> 02:21:13,280
around this line of best fit it should
3688
02:21:13,280 --> 02:21:16,640
follow a normal distribution
3689
02:21:16,640 --> 02:21:17,960
and now what
3690
02:21:17,960 --> 02:21:21,520
homoscedasticity says okay our variance
3691
02:21:21,520 --> 02:21:22,960
of these points
3692
02:21:22,960 --> 02:21:24,720
should remain constant
3693
02:21:24,720 --> 02:21:27,600
throughout so this spread here should be
3694
02:21:27,600 --> 02:21:29,200
approximately the same as this spread
3695
02:21:29,200 --> 02:21:30,960
over here
3696
02:21:30,960 --> 02:21:32,640
now what's an example of where you know
3697
02:21:32,640 --> 02:21:33,520
homo
3698
02:21:33,520 --> 02:21:36,560
schedasticity is not held
3699
02:21:36,560 --> 02:21:38,800
well let's say that our
3700
02:21:38,800 --> 02:21:41,200
original plot actually looks something
3701
02:21:41,200 --> 02:21:43,520
like
3702
02:21:43,520 --> 02:21:45,840
this
3703
02:21:46,399 --> 02:21:48,240
okay so now if we looked at the
3704
02:21:48,240 --> 02:21:50,240
residuals for that
3705
02:21:50,240 --> 02:21:54,040
it might look something
3706
02:21:54,479 --> 02:21:55,840
like that
3707
02:21:55,840 --> 02:21:58,720
and now if you we look at this spread of
3708
02:21:58,720 --> 02:22:00,399
the points
3709
02:22:00,399 --> 02:22:03,680
it decreases right so now the spread is
3710
02:22:03,680 --> 02:22:04,960
not constant which means that
3711
02:22:04,960 --> 02:22:06,840
homeostasis
3712
02:22:06,840 --> 02:22:08,720
homoscedasticity
3713
02:22:08,720 --> 02:22:09,840
um
3714
02:22:09,840 --> 02:22:11,760
this this assumption would not be
3715
02:22:11,760 --> 02:22:12,880
fulfilled and it might not be
3716
02:22:12,880 --> 02:22:16,080
appropriate to use linear regression
3717
02:22:16,080 --> 02:22:17,680
so that's just linear regression
3718
02:22:17,680 --> 02:22:19,920
basically we have a bunch of data points
3719
02:22:19,920 --> 02:22:22,479
we want to predict some y value
3720
02:22:22,479 --> 02:22:24,399
for those
3721
02:22:24,399 --> 02:22:26,240
and we're trying to come up with this
3722
02:22:26,240 --> 02:22:29,280
line of best fit that best describes hey
3723
02:22:29,280 --> 02:22:31,520
given some value x
3724
02:22:31,520 --> 02:22:35,680
what would be my best guess of what y is
3725
02:22:35,680 --> 02:22:38,960
so let's move on to how do we evaluate a
3726
02:22:38,960 --> 02:22:42,160
linear regression model
3727
02:22:42,240 --> 02:22:44,640
so the first
3728
02:22:44,640 --> 02:22:46,319
measure that i'm going to talk about is
3729
02:22:46,319 --> 02:22:51,520
known as mean absolute error or m-a-e
3730
02:22:52,000 --> 02:22:53,120
for short
3731
02:22:53,120 --> 02:22:53,920
okay
3732
02:22:53,920 --> 02:22:56,319
and mean absolute error
3733
02:22:56,319 --> 02:22:58,960
is basically saying all right let's take
3734
02:22:58,960 --> 02:23:01,280
all the errors so all these residuals
3735
02:23:01,280 --> 02:23:03,439
that we talked about
3736
02:23:03,439 --> 02:23:05,120
let's sum up
3737
02:23:05,120 --> 02:23:06,960
the distance for all of them and then
3738
02:23:06,960 --> 02:23:09,040
take the average and then that can
3739
02:23:09,040 --> 02:23:12,319
describe you know how far off are we
3740
02:23:12,319 --> 02:23:13,359
so the
3741
02:23:13,359 --> 02:23:16,000
mathematical formula for that would be
3742
02:23:16,000 --> 02:23:20,240
okay let's take all the residuals
3743
02:23:21,600 --> 02:23:24,160
all right so this is the distance
3744
02:23:24,160 --> 02:23:27,120
actually let me redraw a plot down here
3745
02:23:27,120 --> 02:23:28,399
so
3746
02:23:28,399 --> 02:23:32,960
suppose i have a data set look like this
3747
02:23:32,960 --> 02:23:35,120
and
3748
02:23:35,600 --> 02:23:38,240
here are all of my
3749
02:23:38,240 --> 02:23:41,359
data points right
3750
02:23:41,359 --> 02:23:43,040
and now let's say my line looks
3751
02:23:43,040 --> 02:23:45,120
something like
3752
02:23:45,120 --> 02:23:47,840
that
3753
02:23:47,840 --> 02:23:48,800
so
3754
02:23:48,800 --> 02:23:51,200
my mean absolute error would be summing
3755
02:23:51,200 --> 02:23:52,080
up
3756
02:23:52,080 --> 02:23:55,439
all of these values
3757
02:23:55,920 --> 02:23:58,399
this was a mistake
3758
02:23:58,399 --> 02:24:00,479
so summing up all of these
3759
02:24:00,479 --> 02:24:02,000
and then dividing by how many data
3760
02:24:02,000 --> 02:24:03,680
points i have
3761
02:24:03,680 --> 02:24:05,760
so what would be all the residuals it
3762
02:24:05,760 --> 02:24:09,520
would be y i right so every single point
3763
02:24:09,520 --> 02:24:12,880
minus y hat i so the prediction for that
3764
02:24:12,880 --> 02:24:14,800
on here
3765
02:24:14,800 --> 02:24:16,880
and then we're going to sum over all of
3766
02:24:16,880 --> 02:24:19,200
the different i's in our data set
3767
02:24:19,200 --> 02:24:20,880
right so
3768
02:24:20,880 --> 02:24:21,760
i
3769
02:24:21,760 --> 02:24:24,000
and then we divide by the number of
3770
02:24:24,000 --> 02:24:25,280
points we have so actually i'm going to
3771
02:24:25,280 --> 02:24:27,680
rewrite this to make it a little clearer
3772
02:24:27,680 --> 02:24:30,000
so i is equal to whatever the first data
3773
02:24:30,000 --> 02:24:31,600
point is all the way through the nth
3774
02:24:31,600 --> 02:24:32,720
data point
3775
02:24:32,720 --> 02:24:35,040
and then we divide it by n which is how
3776
02:24:35,040 --> 02:24:36,640
many points there are
3777
02:24:36,640 --> 02:24:41,439
okay so this is our measure of m a e
3778
02:24:41,439 --> 02:24:43,840
and this is basically telling us okay in
3779
02:24:43,840 --> 02:24:45,439
on average
3780
02:24:45,439 --> 02:24:47,200
this is the distance
3781
02:24:47,200 --> 02:24:48,399
between
3782
02:24:48,399 --> 02:24:51,680
our predicted value and the actual value
3783
02:24:51,680 --> 02:24:54,479
in our training set
3784
02:24:54,479 --> 02:24:56,160
okay
3785
02:24:56,160 --> 02:25:00,640
and mae is good because it allows us to
3786
02:25:00,640 --> 02:25:03,600
you know when we get this value here we
3787
02:25:03,600 --> 02:25:05,600
can literally directly compare it to
3788
02:25:05,600 --> 02:25:06,640
whatever
3789
02:25:06,640 --> 02:25:09,920
units the y value is in so let's say y
3790
02:25:09,920 --> 02:25:11,040
is
3791
02:25:11,040 --> 02:25:14,720
we're talking you know the prediction of
3792
02:25:14,720 --> 02:25:16,880
the price of a house
3793
02:25:16,880 --> 02:25:17,680
right
3794
02:25:17,680 --> 02:25:19,280
in dollars
3795
02:25:19,280 --> 02:25:21,439
once we have once we calculate the mae
3796
02:25:21,439 --> 02:25:24,080
we can literally say oh the average you
3797
02:25:24,080 --> 02:25:25,680
know price
3798
02:25:25,680 --> 02:25:27,040
the average
3799
02:25:27,040 --> 02:25:29,840
um how much we're off by
3800
02:25:29,840 --> 02:25:32,560
is literally this many dollars
3801
02:25:32,560 --> 02:25:33,760
okay
3802
02:25:33,760 --> 02:25:35,920
so that's the mean absolute error
3803
02:25:35,920 --> 02:25:37,520
an evaluation technique that's also
3804
02:25:37,520 --> 02:25:39,200
closely related to that
3805
02:25:39,200 --> 02:25:41,760
is called the mean squared error
3806
02:25:41,760 --> 02:25:44,160
and this is mse
3807
02:25:44,160 --> 02:25:45,680
for short
3808
02:25:45,680 --> 02:25:47,040
okay
3809
02:25:47,040 --> 02:25:47,840
now
3810
02:25:47,840 --> 02:25:51,600
if i take this plot again
3811
02:25:51,680 --> 02:25:55,840
and i duplicate it and move it down here
3812
02:25:55,840 --> 02:25:58,240
well the gist of mean squared error is
3813
02:25:58,240 --> 02:25:59,439
kind of the same but instead of the
3814
02:25:59,439 --> 02:26:01,680
absolute value we're going to square
3815
02:26:01,680 --> 02:26:04,319
so now the mse
3816
02:26:04,319 --> 02:26:06,479
is something along the lines of okay
3817
02:26:06,479 --> 02:26:08,479
let's sum up
3818
02:26:08,479 --> 02:26:10,800
something right so we're going to sum up
3819
02:26:10,800 --> 02:26:13,200
all of our errors
3820
02:26:13,200 --> 02:26:17,120
so now i'm going to do y i minus y hat i
3821
02:26:17,120 --> 02:26:19,200
but instead of absolute valuing them i'm
3822
02:26:19,200 --> 02:26:21,280
going to square them all and then i'm
3823
02:26:21,280 --> 02:26:24,000
going to divide by n in order to find
3824
02:26:24,000 --> 02:26:25,040
the mean
3825
02:26:25,040 --> 02:26:27,840
so basically now i'm taking
3826
02:26:27,840 --> 02:26:29,520
all of these
3827
02:26:29,520 --> 02:26:31,760
different values and i'm squaring them
3828
02:26:31,760 --> 02:26:35,840
first before i add them to one another
3829
02:26:36,080 --> 02:26:39,439
and then i divide by n
3830
02:26:39,439 --> 02:26:41,040
and the reason why we like using mean
3831
02:26:41,040 --> 02:26:43,120
squared error is that
3832
02:26:43,120 --> 02:26:46,000
it helps us punish large errors in the
3833
02:26:46,000 --> 02:26:47,200
prediction
3834
02:26:47,200 --> 02:26:49,600
and later on mse might be important
3835
02:26:49,600 --> 02:26:53,200
because of differentiability right so a
3836
02:26:53,200 --> 02:26:55,600
quadratic equation is differentiable you
3837
02:26:55,600 --> 02:26:57,280
know if you're familiar with calculus a
3838
02:26:57,280 --> 02:26:58,640
quadratic equation is different
3839
02:26:58,640 --> 02:27:00,640
differentiable whereas the absolute
3840
02:27:00,640 --> 02:27:02,000
value function is not totally
3841
02:27:02,000 --> 02:27:03,840
differentiable everywhere
3842
02:27:03,840 --> 02:27:05,359
but if you don't understand that don't
3843
02:27:05,359 --> 02:27:07,600
worry about it you won't really need it
3844
02:27:07,600 --> 02:27:08,960
right now
3845
02:27:08,960 --> 02:27:10,479
and now one downside of mean squared
3846
02:27:10,479 --> 02:27:12,640
error is that once i calculate the mean
3847
02:27:12,640 --> 02:27:14,560
squared error over here
3848
02:27:14,560 --> 02:27:16,399
and i go back over to y and i want to
3849
02:27:16,399 --> 02:27:19,200
compare the values
3850
02:27:19,200 --> 02:27:21,520
well it gets a little bit trickier to do
3851
02:27:21,520 --> 02:27:23,760
that because
3852
02:27:23,760 --> 02:27:27,200
now my mean squared error is in terms of
3853
02:27:27,200 --> 02:27:29,359
y squared right it's
3854
02:27:29,359 --> 02:27:32,240
this is now squared so instead of just
3855
02:27:32,240 --> 02:27:34,160
dollars how you know how many dollars
3856
02:27:34,160 --> 02:27:36,240
off am i i'm talking how many dollars
3857
02:27:36,240 --> 02:27:37,359
squared
3858
02:27:37,359 --> 02:27:38,240
off
3859
02:27:38,240 --> 02:27:39,280
am i
3860
02:27:39,280 --> 02:27:41,680
and that you know to humans it doesn't
3861
02:27:41,680 --> 02:27:44,160
really make that much sense which is why
3862
02:27:44,160 --> 02:27:46,240
we have created something known as the
3863
02:27:46,240 --> 02:27:49,280
root mean square error
3864
02:27:49,280 --> 02:27:50,479
and
3865
02:27:50,479 --> 02:27:52,240
i'm just going to copy
3866
02:27:52,240 --> 02:27:54,399
this diagram over here because it's very
3867
02:27:54,399 --> 02:27:55,920
very similar
3868
02:27:55,920 --> 02:27:58,479
to mean squared error
3869
02:27:58,479 --> 02:28:00,080
except
3870
02:28:00,080 --> 02:28:03,200
now we take a big squared root
3871
02:28:03,200 --> 02:28:05,280
okay so this is rmse and we take the
3872
02:28:05,280 --> 02:28:06,720
square root
3873
02:28:06,720 --> 02:28:08,240
of that
3874
02:28:08,240 --> 02:28:10,800
mean squared error and so now the term
3875
02:28:10,800 --> 02:28:13,120
in which you know we're defining
3876
02:28:13,120 --> 02:28:14,479
our error
3877
02:28:14,479 --> 02:28:16,880
is now in terms of that dollar sign
3878
02:28:16,880 --> 02:28:19,120
symbol again so that's a pro of rooting
3879
02:28:19,120 --> 02:28:21,520
squared error is that now we can say
3880
02:28:21,520 --> 02:28:24,640
okay our error according to this metric
3881
02:28:24,640 --> 02:28:26,720
is this many dollar signs off from our
3882
02:28:26,720 --> 02:28:28,319
predictor
3883
02:28:28,319 --> 02:28:30,560
okay so it's in the same unit which is
3884
02:28:30,560 --> 02:28:32,640
one of the pros of root mean squared
3885
02:28:32,640 --> 02:28:34,880
error
3886
02:28:34,880 --> 02:28:37,600
and now finally there is the coefficient
3887
02:28:37,600 --> 02:28:40,800
of determination or r squared
3888
02:28:40,800 --> 02:28:42,399
and this is the formula for r squared so
3889
02:28:42,399 --> 02:28:45,520
r squared is equal to 1 minus rss
3890
02:28:45,520 --> 02:28:46,479
over
3891
02:28:46,479 --> 02:28:48,240
tss
3892
02:28:48,240 --> 02:28:51,200
okay so what does that mean
3893
02:28:51,200 --> 02:28:53,600
basically rss
3894
02:28:53,600 --> 02:28:56,560
stands for the sum
3895
02:28:56,560 --> 02:28:59,920
of the squared
3896
02:28:59,920 --> 02:29:01,840
residuals
3897
02:29:01,840 --> 02:29:05,200
so maybe it should be ssr instead but
3898
02:29:05,200 --> 02:29:07,920
rss sum of the squared residuals and
3899
02:29:07,920 --> 02:29:09,359
this
3900
02:29:09,359 --> 02:29:11,359
is equal
3901
02:29:11,359 --> 02:29:12,560
to
3902
02:29:12,560 --> 02:29:15,760
if i take the sum of all the values
3903
02:29:15,760 --> 02:29:19,920
and i take y i minus y hat
3904
02:29:19,920 --> 02:29:20,880
i
3905
02:29:20,880 --> 02:29:22,720
and square that
3906
02:29:22,720 --> 02:29:25,359
that is my rss right it's the sum of the
3907
02:29:25,359 --> 02:29:28,000
squared residuals
3908
02:29:28,000 --> 02:29:30,479
now tss let me actually use a different
3909
02:29:30,479 --> 02:29:33,120
color for that
3910
02:29:33,120 --> 02:29:38,000
so tss is the total
3911
02:29:38,000 --> 02:29:39,920
sum
3912
02:29:39,920 --> 02:29:42,800
of squares
3913
02:29:43,600 --> 02:29:46,000
and what that means is that instead of
3914
02:29:46,000 --> 02:29:48,319
being with respect to
3915
02:29:48,319 --> 02:29:51,560
this prediction
3916
02:29:51,600 --> 02:29:53,600
we are instead
3917
02:29:53,600 --> 02:29:56,319
going to
3918
02:29:58,160 --> 02:30:01,920
take each y value and just subtract
3919
02:30:01,920 --> 02:30:04,640
the mean of all the y values
3920
02:30:04,640 --> 02:30:06,880
and square that
3921
02:30:06,880 --> 02:30:11,960
okay so if i drew this out
3922
02:30:19,520 --> 02:30:21,300
and if this were my
3923
02:30:21,300 --> 02:30:22,720
[Music]
3924
02:30:22,720 --> 02:30:25,040
actually let's use a different color
3925
02:30:25,040 --> 02:30:27,920
let's use green
3926
02:30:28,240 --> 02:30:31,439
if this were my predictor
3927
02:30:31,439 --> 02:30:35,120
so rss is giving me this measure
3928
02:30:35,120 --> 02:30:36,080
here
3929
02:30:36,080 --> 02:30:38,080
right it's giving me some estimate of
3930
02:30:38,080 --> 02:30:40,800
how far off we are from our regressor
3931
02:30:40,800 --> 02:30:42,240
that we predicted
3932
02:30:42,240 --> 02:30:44,640
actually
3933
02:30:45,439 --> 02:30:48,399
i'm going to use red for that
3934
02:30:48,399 --> 02:30:50,080
well
3935
02:30:50,080 --> 02:30:52,800
tss on the other hand is saying okay how
3936
02:30:52,800 --> 02:30:55,920
far off are these values from the mean
3937
02:30:55,920 --> 02:30:57,840
so if we literally didn't do any
3938
02:30:57,840 --> 02:30:59,760
calculations for the line of best fit if
3939
02:30:59,760 --> 02:31:00,399
we
3940
02:31:00,399 --> 02:31:02,880
just took all the y values and averaged
3941
02:31:02,880 --> 02:31:03,920
all of them
3942
02:31:03,920 --> 02:31:05,920
and said hey this is the average value
3943
02:31:05,920 --> 02:31:08,160
for every single x value
3944
02:31:08,160 --> 02:31:09,680
i'm just going to predict that average
3945
02:31:09,680 --> 02:31:11,760
value instead
3946
02:31:11,760 --> 02:31:13,680
then it's asking okay how far off are
3947
02:31:13,680 --> 02:31:17,920
all these points from that line
3948
02:31:19,040 --> 02:31:21,600
okay and remember that this square means
3949
02:31:21,600 --> 02:31:23,200
that we're punishing
3950
02:31:23,200 --> 02:31:24,800
larger errors
3951
02:31:24,800 --> 02:31:26,880
right so even if they look somewhat
3952
02:31:26,880 --> 02:31:28,479
close in terms of distance
3953
02:31:28,479 --> 02:31:31,680
the further a few data points are
3954
02:31:31,680 --> 02:31:34,240
then the further the larger our total
3955
02:31:34,240 --> 02:31:36,800
sum of squares is going to be
3956
02:31:36,800 --> 02:31:38,479
sorry that was my dog
3957
02:31:38,479 --> 02:31:40,479
so the total sum of squares is taking
3958
02:31:40,479 --> 02:31:42,560
all of these values and saying okay what
3959
02:31:42,560 --> 02:31:45,040
is the sum of squares if i didn't do any
3960
02:31:45,040 --> 02:31:46,479
regressor and i literally just
3961
02:31:46,479 --> 02:31:48,800
calculated the average
3962
02:31:48,800 --> 02:31:51,200
of all the y values in my data set and
3963
02:31:51,200 --> 02:31:52,479
for every single x value i'm just going
3964
02:31:52,479 --> 02:31:54,240
to predict that average
3965
02:31:54,240 --> 02:31:56,319
which means that okay like that means
3966
02:31:56,319 --> 02:31:58,800
that maybe y and x aren't associated
3967
02:31:58,800 --> 02:32:01,120
with each other at all like the best
3968
02:32:01,120 --> 02:32:03,280
thing that i can do for any new x value
3969
02:32:03,280 --> 02:32:04,800
just predict hey this is the average of
3970
02:32:04,800 --> 02:32:06,080
my data set
3971
02:32:06,080 --> 02:32:08,880
and this total sum of squares is saying
3972
02:32:08,880 --> 02:32:12,160
okay well with respect to that average
3973
02:32:12,160 --> 02:32:14,960
what is our error
3974
02:32:14,960 --> 02:32:17,280
right so up here the sum of the squared
3975
02:32:17,280 --> 02:32:18,640
residuals
3976
02:32:18,640 --> 02:32:20,960
this is telling us what is our what what
3977
02:32:20,960 --> 02:32:23,280
is our error with respect to
3978
02:32:23,280 --> 02:32:26,160
this line and best fit while our total
3979
02:32:26,160 --> 02:32:27,359
sum of square is saying what is the
3980
02:32:27,359 --> 02:32:29,280
error with respect to you know just the
3981
02:32:29,280 --> 02:32:31,840
average y value
3982
02:32:31,840 --> 02:32:33,280
and
3983
02:32:33,280 --> 02:32:36,640
if our line of best fit is a better fit
3984
02:32:36,640 --> 02:32:37,600
then
3985
02:32:37,600 --> 02:32:40,479
this total sum of squares
3986
02:32:40,479 --> 02:32:43,840
that means that you know this
3987
02:32:43,840 --> 02:32:46,000
numerator
3988
02:32:46,000 --> 02:32:48,319
that means that this numerator is going
3989
02:32:48,319 --> 02:32:51,200
to be smaller than this denominator
3990
02:32:51,200 --> 02:32:55,359
right and if our errors in our
3991
02:32:55,359 --> 02:32:57,920
mind and best fit are much smaller
3992
02:32:57,920 --> 02:32:59,760
then that means that this ratio of the
3993
02:32:59,760 --> 02:33:03,359
rss over tss is going to be very small
3994
02:33:03,359 --> 02:33:06,160
which means that r squared is going to
3995
02:33:06,160 --> 02:33:08,800
go towards one
3996
02:33:08,800 --> 02:33:11,439
and now when r squared is towards one
3997
02:33:11,439 --> 02:33:13,439
that means that that's usually a sign
3998
02:33:13,439 --> 02:33:15,359
that we have a good
3999
02:33:15,359 --> 02:33:18,000
predictor
4000
02:33:19,600 --> 02:33:22,399
it's one of the signs not the only one
4001
02:33:22,399 --> 02:33:24,880
so over here i also have you know that
4002
02:33:24,880 --> 02:33:26,800
there's this adjusted r squared and what
4003
02:33:26,800 --> 02:33:28,640
that does it just adjusts for the number
4004
02:33:28,640 --> 02:33:33,280
of terms so x1 x2 x3 etc it adjusts for
4005
02:33:33,280 --> 02:33:35,040
how many extra terms we add because
4006
02:33:35,040 --> 02:33:36,640
usually when we
4007
02:33:36,640 --> 02:33:38,319
um you know
4008
02:33:38,319 --> 02:33:40,240
add an extra term the r squared value
4009
02:33:40,240 --> 02:33:41,840
will increase because that'll help us
4010
02:33:41,840 --> 02:33:44,720
predict why some more
4011
02:33:44,720 --> 02:33:47,200
but the value for the adjusted r squared
4012
02:33:47,200 --> 02:33:48,800
increases if the new term actually
4013
02:33:48,800 --> 02:33:50,479
improves this model fit more than
4014
02:33:50,479 --> 02:33:53,280
expected you know by chance so that's
4015
02:33:53,280 --> 02:33:55,280
what adjusted r squared is i'm not you
4016
02:33:55,280 --> 02:33:56,880
know it's out of the scope of this one
4017
02:33:56,880 --> 02:33:59,520
specific course and now that's linear
4018
02:33:59,520 --> 02:34:01,520
regression basically
4019
02:34:01,520 --> 02:34:04,000
i've covered the concept of residuals or
4020
02:34:04,000 --> 02:34:05,200
errors
4021
02:34:05,200 --> 02:34:06,720
and
4022
02:34:06,720 --> 02:34:08,399
you know how do we use that in order to
4023
02:34:08,399 --> 02:34:10,479
find the line of best fit
4024
02:34:10,479 --> 02:34:11,920
and you know our computer can do all the
4025
02:34:11,920 --> 02:34:14,240
calculations for us which is nice but
4026
02:34:14,240 --> 02:34:15,600
behind the scenes it's trying to
4027
02:34:15,600 --> 02:34:18,080
minimize that error right
4028
02:34:18,080 --> 02:34:19,600
and then we've gone through all the
4029
02:34:19,600 --> 02:34:22,240
different ways of actually evaluating a
4030
02:34:22,240 --> 02:34:24,479
linear regression model and the pros and
4031
02:34:24,479 --> 02:34:26,479
cons of each one
4032
02:34:26,479 --> 02:34:28,640
so now let's look at an example so we're
4033
02:34:28,640 --> 02:34:30,960
still on supervised learning but now
4034
02:34:30,960 --> 02:34:32,640
we're just going to talk about
4035
02:34:32,640 --> 02:34:34,240
regression so what happens when you
4036
02:34:34,240 --> 02:34:35,840
don't just want to predict you know type
4037
02:34:35,840 --> 02:34:37,760
one two three what happens if you
4038
02:34:37,760 --> 02:34:40,880
actually wanna predict a certain value
4039
02:34:40,880 --> 02:34:44,080
so again i'm on the uci machine learning
4040
02:34:44,080 --> 02:34:45,600
repository
4041
02:34:45,600 --> 02:34:46,960
and
4042
02:34:46,960 --> 02:34:50,399
here i found this data set about
4043
02:34:50,399 --> 02:34:53,439
bike sharing in seoul
4044
02:34:53,439 --> 02:34:54,960
south korea
4045
02:34:54,960 --> 02:34:58,479
so this data set is predicting rental
4046
02:34:58,479 --> 02:35:00,479
bike count and here it's the account of
4047
02:35:00,479 --> 02:35:03,520
bikes rented at each hour
4048
02:35:03,520 --> 02:35:05,920
so what we're going to do again you're
4049
02:35:05,920 --> 02:35:07,600
going to go into the data folder and
4050
02:35:07,600 --> 02:35:09,040
you're going to
4051
02:35:09,040 --> 02:35:13,720
download this csv file
4052
02:35:14,720 --> 02:35:16,880
and we're going to move over to colab
4053
02:35:16,880 --> 02:35:18,479
again
4054
02:35:18,479 --> 02:35:21,520
and here i'm going to name this fcc
4055
02:35:21,520 --> 02:35:22,880
bikes
4056
02:35:22,880 --> 02:35:26,479
and regression
4057
02:35:26,479 --> 02:35:27,760
i don't remember what i called the last
4058
02:35:27,760 --> 02:35:31,600
one but yeah fcc bikes regression
4059
02:35:31,600 --> 02:35:34,560
now i'm going to import a bunch of the
4060
02:35:34,560 --> 02:35:36,880
same things that i did earlier
4061
02:35:36,880 --> 02:35:38,720
um
4062
02:35:38,720 --> 02:35:41,120
and you know i'm gonna also continue to
4063
02:35:41,120 --> 02:35:43,040
import the oversampler
4064
02:35:43,040 --> 02:35:44,800
and the standard scaler
4065
02:35:44,800 --> 02:35:48,479
and then i'm actually also just going to
4066
02:35:48,479 --> 02:35:49,840
let you guys know that i have a few more
4067
02:35:49,840 --> 02:35:51,760
things i wanted to import
4068
02:35:51,760 --> 02:35:54,000
so this is a library that lets us copy
4069
02:35:54,000 --> 02:35:56,319
things uh seaborne is a wrapper over
4070
02:35:56,319 --> 02:35:58,319
matplotlib so
4071
02:35:58,319 --> 02:36:00,160
it also allows us to plot certain things
4072
02:36:00,160 --> 02:36:01,680
and then just letting you know that
4073
02:36:01,680 --> 02:36:04,720
we're also going to be using tensorflow
4074
02:36:04,720 --> 02:36:06,240
okay so one more thing that we're also
4075
02:36:06,240 --> 02:36:07,840
going to be using we're going to use the
4076
02:36:07,840 --> 02:36:10,640
sklearn linear model library and
4077
02:36:10,640 --> 02:36:12,080
actually let me make my screen a little
4078
02:36:12,080 --> 02:36:13,280
bit bigger
4079
02:36:13,280 --> 02:36:15,520
so yeah
4080
02:36:15,520 --> 02:36:17,840
awesome
4081
02:36:17,840 --> 02:36:20,080
run this and
4082
02:36:20,080 --> 02:36:21,600
that'll import all the things that we
4083
02:36:21,600 --> 02:36:22,399
need
4084
02:36:22,399 --> 02:36:25,840
so again i'm just going to you know give
4085
02:36:25,840 --> 02:36:27,520
some credit to where we got this data
4086
02:36:27,520 --> 02:36:28,560
set
4087
02:36:28,560 --> 02:36:32,000
so let me copy and paste um
4088
02:36:32,000 --> 02:36:34,560
this uci
4089
02:36:34,560 --> 02:36:37,560
thing
4090
02:36:38,000 --> 02:36:41,840
and i will also give credit to this
4091
02:36:41,840 --> 02:36:44,080
here
4092
02:36:46,479 --> 02:36:48,800
okay
4093
02:36:48,840 --> 02:36:50,880
cool all right cool
4094
02:36:50,880 --> 02:36:53,200
so this is our data set and again it
4095
02:36:53,200 --> 02:36:54,800
tells us all the different attributes
4096
02:36:54,800 --> 02:36:57,359
that we have right here so i'm actually
4097
02:36:57,359 --> 02:36:59,520
going to go ahead
4098
02:36:59,520 --> 02:37:03,359
and paste this in here
4099
02:37:03,359 --> 02:37:05,200
um
4100
02:37:05,200 --> 02:37:07,040
feel free to copy and paste this if you
4101
02:37:07,040 --> 02:37:08,399
want me to read it out loud so you can
4102
02:37:08,399 --> 02:37:10,720
type it it's by count
4103
02:37:10,720 --> 02:37:11,920
hour
4104
02:37:11,920 --> 02:37:12,800
temp
4105
02:37:12,800 --> 02:37:14,000
humidity
4106
02:37:14,000 --> 02:37:17,200
wind visibility dew point temp
4107
02:37:17,200 --> 02:37:20,240
radiation rain snow
4108
02:37:20,240 --> 02:37:21,359
and
4109
02:37:21,359 --> 02:37:24,399
functional whatever that means
4110
02:37:24,399 --> 02:37:26,319
okay so i'm going to come over here and
4111
02:37:26,319 --> 02:37:30,319
import my data by dragging and dropping
4112
02:37:30,319 --> 02:37:32,319
all right
4113
02:37:32,319 --> 02:37:33,840
now one thing that you guys might
4114
02:37:33,840 --> 02:37:35,040
actually need to do is you might
4115
02:37:35,040 --> 02:37:37,200
actually have to open up the csv because
4116
02:37:37,200 --> 02:37:38,240
there were
4117
02:37:38,240 --> 02:37:41,280
at first a few um like forbidden
4118
02:37:41,280 --> 02:37:43,120
characters in mine at least
4119
02:37:43,120 --> 02:37:45,040
so you might have to get rid of like i
4120
02:37:45,040 --> 02:37:46,640
think there was a degree here but my
4121
02:37:46,640 --> 02:37:48,240
computer wasn't recognizing it so i got
4122
02:37:48,240 --> 02:37:50,160
rid of that so you might have to go
4123
02:37:50,160 --> 02:37:52,479
through and get rid of some of those
4124
02:37:52,479 --> 02:37:55,680
labels that are incorrect
4125
02:37:55,680 --> 02:37:56,960
i'm gonna
4126
02:37:56,960 --> 02:37:58,399
do this okay
4127
02:37:58,399 --> 02:37:59,520
but
4128
02:37:59,520 --> 02:38:01,359
after we've done that we've imported in
4129
02:38:01,359 --> 02:38:04,240
here i'm going to
4130
02:38:04,240 --> 02:38:06,960
create a data a data frame from that so
4131
02:38:06,960 --> 02:38:09,520
all right so now what i can do is i can
4132
02:38:09,520 --> 02:38:11,520
read that csv file and i can get the
4133
02:38:11,520 --> 02:38:13,280
data into here
4134
02:38:13,280 --> 02:38:17,399
so sold by data.csv
4135
02:38:17,439 --> 02:38:20,160
okay so now if i call data.head
4136
02:38:20,160 --> 02:38:21,680
you'll see that i have all the various
4137
02:38:21,680 --> 02:38:24,960
labels right and then i have the data in
4138
02:38:24,960 --> 02:38:26,840
there
4139
02:38:26,840 --> 02:38:31,520
so i'm going to from here um
4140
02:38:31,520 --> 02:38:33,760
i'm actually going to get rid of some of
4141
02:38:33,760 --> 02:38:35,600
these columns that you know i don't
4142
02:38:35,600 --> 02:38:39,120
really care about so here i'm going to
4143
02:38:39,120 --> 02:38:40,960
when i when i type this in i'm going to
4144
02:38:40,960 --> 02:38:43,120
drop maybe the date
4145
02:38:43,120 --> 02:38:45,439
whether or not it's a holiday
4146
02:38:45,439 --> 02:38:48,479
and the various seasons
4147
02:38:48,479 --> 02:38:49,680
so i'm just
4148
02:38:49,680 --> 02:38:52,080
not going to care about these things
4149
02:38:52,080 --> 02:38:54,160
axis equals one means drop it from the
4150
02:38:54,160 --> 02:38:56,319
columns
4151
02:38:56,319 --> 02:38:58,160
so now you'll see that okay we still
4152
02:38:58,160 --> 02:38:59,840
have i mean i guess you don't really
4153
02:38:59,840 --> 02:39:01,040
notice it but
4154
02:39:01,040 --> 02:39:03,680
if i set the data frames columns equal
4155
02:39:03,680 --> 02:39:06,399
to data set calls
4156
02:39:06,399 --> 02:39:08,080
and i
4157
02:39:08,080 --> 02:39:09,760
look at you know the first five things
4158
02:39:09,760 --> 02:39:11,280
then you'll see that this is now our
4159
02:39:11,280 --> 02:39:14,399
data set it's a lot easier to read so
4160
02:39:14,399 --> 02:39:19,040
another thing is i'm actually going to
4161
02:39:19,040 --> 02:39:20,880
df functional
4162
02:39:20,880 --> 02:39:23,280
and we're going to create this so
4163
02:39:23,280 --> 02:39:24,800
remember that our computers are not very
4164
02:39:24,800 --> 02:39:27,040
good at language we want it to be in
4165
02:39:27,040 --> 02:39:30,560
zeros and ones so here i will convert
4166
02:39:30,560 --> 02:39:32,800
that
4167
02:39:33,439 --> 02:39:37,280
well if this is equal to yes
4168
02:39:37,280 --> 02:39:38,399
then that
4169
02:39:38,399 --> 02:39:41,840
that gets mapped as one so then set type
4170
02:39:41,840 --> 02:39:43,040
integer
4171
02:39:43,040 --> 02:39:44,640
all right
4172
02:39:44,640 --> 02:39:46,880
great
4173
02:39:46,880 --> 02:39:49,680
cool so the thing is right now these
4174
02:39:49,680 --> 02:39:52,640
byte counts are for whatever hour so to
4175
02:39:52,640 --> 02:39:54,479
make this example simpler i'm just going
4176
02:39:54,479 --> 02:39:56,080
to index on an hour and i'm going to say
4177
02:39:56,080 --> 02:39:58,319
okay we're only going to use that
4178
02:39:58,319 --> 02:40:00,000
specific hour
4179
02:40:00,000 --> 02:40:00,880
so
4180
02:40:00,880 --> 02:40:04,479
here let's say um
4181
02:40:04,479 --> 02:40:06,319
so this data frame is only going to be
4182
02:40:06,319 --> 02:40:09,760
data frame where the hour
4183
02:40:09,840 --> 02:40:14,880
let's say it equals 12 okay so it's noon
4184
02:40:14,960 --> 02:40:15,840
all right
4185
02:40:15,840 --> 02:40:17,520
so now you'll see that all the hours are
4186
02:40:17,520 --> 02:40:19,600
equal to 12 and i'm actually going to
4187
02:40:19,600 --> 02:40:22,880
now drop that column
4188
02:40:25,760 --> 02:40:26,960
our
4189
02:40:26,960 --> 02:40:30,160
axis equals one
4190
02:40:30,720 --> 02:40:33,920
all right so if we run this cell okay so
4191
02:40:33,920 --> 02:40:36,960
now we got rid of the hour in here
4192
02:40:36,960 --> 02:40:38,479
and we just have the bike count the
4193
02:40:38,479 --> 02:40:41,600
temperature humidity wind visibility and
4194
02:40:41,600 --> 02:40:43,359
yada yada yada
4195
02:40:43,359 --> 02:40:46,240
all right so what i want to do is i'm
4196
02:40:46,240 --> 02:40:49,040
going to actually plot all of these so
4197
02:40:49,040 --> 02:40:50,080
for
4198
02:40:50,080 --> 02:40:53,600
i and all the columns so the range
4199
02:40:53,600 --> 02:40:56,640
length of uh whatever this data frame is
4200
02:40:56,640 --> 02:40:58,240
and all the columns because i don't have
4201
02:40:58,240 --> 02:41:00,080
byte count as
4202
02:41:00,080 --> 02:41:03,040
actually it's my first thing so what i'm
4203
02:41:03,040 --> 02:41:05,439
going to do is say for a label and data
4204
02:41:05,439 --> 02:41:06,479
frame
4205
02:41:06,479 --> 02:41:08,479
columns everything after the first thing
4206
02:41:08,479 --> 02:41:09,920
so that would give me the temperature
4207
02:41:09,920 --> 02:41:12,319
and onwards so these are all my features
4208
02:41:12,319 --> 02:41:13,680
right
4209
02:41:13,680 --> 02:41:16,800
uh i'm going to just scatter
4210
02:41:16,800 --> 02:41:18,000
so
4211
02:41:18,000 --> 02:41:21,120
i want to see how that label how that
4212
02:41:21,120 --> 02:41:22,880
specific data
4213
02:41:22,880 --> 02:41:27,120
um how that affects the byte count so
4214
02:41:27,120 --> 02:41:29,760
i'm going to plot the byte count on the
4215
02:41:29,760 --> 02:41:31,200
y-axis
4216
02:41:31,200 --> 02:41:33,280
and i'm going to plot you know whatever
4217
02:41:33,280 --> 02:41:36,640
the specific label is on the x-axis
4218
02:41:36,640 --> 02:41:39,280
and i'm going to title this
4219
02:41:39,280 --> 02:41:41,840
uh whatever the label is
4220
02:41:41,840 --> 02:41:42,720
and
4221
02:41:42,720 --> 02:41:47,359
you know make my y label the bike count
4222
02:41:47,359 --> 02:41:49,680
at noon
4223
02:41:49,680 --> 02:41:51,840
and the x label
4224
02:41:51,840 --> 02:41:54,880
as just the label
4225
02:41:55,439 --> 02:41:57,280
okay now
4226
02:41:57,280 --> 02:41:59,359
i guess we don't even need the legend
4227
02:41:59,359 --> 02:42:02,640
so just show that plot
4228
02:42:06,160 --> 02:42:08,080
all right
4229
02:42:08,080 --> 02:42:10,080
so it seems like functional is not
4230
02:42:10,080 --> 02:42:12,640
really uh
4231
02:42:12,640 --> 02:42:16,319
doesn't really give us any utility
4232
02:42:16,319 --> 02:42:18,479
so then snow
4233
02:42:18,479 --> 02:42:19,520
rain
4234
02:42:19,520 --> 02:42:20,479
um
4235
02:42:20,479 --> 02:42:23,040
seems like this radiation
4236
02:42:23,040 --> 02:42:25,359
you know is fairly linear
4237
02:42:25,359 --> 02:42:26,880
dew point temperature
4238
02:42:26,880 --> 02:42:28,720
visibility
4239
02:42:28,720 --> 02:42:30,960
uh wind doesn't really seem like it does
4240
02:42:30,960 --> 02:42:32,000
much
4241
02:42:32,000 --> 02:42:34,399
humidity kind of maybe like an inverse
4242
02:42:34,399 --> 02:42:36,000
relationship
4243
02:42:36,000 --> 02:42:37,280
but the temperature definitely looks
4244
02:42:37,280 --> 02:42:39,040
like there's a relationship between that
4245
02:42:39,040 --> 02:42:41,280
and the number of bikes right so what
4246
02:42:41,280 --> 02:42:42,479
i'm actually going to do is i'm going to
4247
02:42:42,479 --> 02:42:44,560
drop some of the ones that don't don't
4248
02:42:44,560 --> 02:42:46,479
seem like they really matter so
4249
02:42:46,479 --> 02:42:49,040
maybe wind
4250
02:42:49,040 --> 02:42:52,439
you know visibility
4251
02:42:54,240 --> 02:42:55,600
yeah so i'm going to get rid of wind
4252
02:42:55,600 --> 02:42:58,800
visibility and functional
4253
02:42:59,200 --> 02:43:01,200
so
4254
02:43:01,200 --> 02:43:03,920
let me now data frame
4255
02:43:03,920 --> 02:43:06,840
and i'm going to drop
4256
02:43:06,840 --> 02:43:09,520
wind visibility
4257
02:43:09,520 --> 02:43:11,760
and functional
4258
02:43:11,760 --> 02:43:13,200
all right
4259
02:43:13,200 --> 02:43:15,359
and the axis again is the column so
4260
02:43:15,359 --> 02:43:16,880
that's one
4261
02:43:16,880 --> 02:43:20,240
so if i look at my data set now
4262
02:43:20,240 --> 02:43:22,240
i have just the temperature the humidity
4263
02:43:22,240 --> 02:43:24,720
the dew point temperature radiation rain
4264
02:43:24,720 --> 02:43:26,080
and snow
4265
02:43:26,080 --> 02:43:28,800
so again what i want to do is i want to
4266
02:43:28,800 --> 02:43:30,880
split this into my training
4267
02:43:30,880 --> 02:43:34,240
my validation and my test data set
4268
02:43:34,240 --> 02:43:37,439
just as we talked before
4269
02:43:37,439 --> 02:43:38,399
here
4270
02:43:38,399 --> 02:43:40,319
uh we can use the exact same thing that
4271
02:43:40,319 --> 02:43:44,479
we just did and we can say numpy.split
4272
02:43:44,479 --> 02:43:47,120
and sample you know that the whole
4273
02:43:47,120 --> 02:43:48,160
sample
4274
02:43:48,160 --> 02:43:53,279
um and then create our splits
4275
02:43:53,920 --> 02:43:57,439
of the data frame
4276
02:43:57,840 --> 02:43:59,840
and we're going to do that but now set
4277
02:43:59,840 --> 02:44:02,240
this to 8.
4278
02:44:02,240 --> 02:44:04,560
okay
4279
02:44:04,560 --> 02:44:06,319
so i don't really care about you know
4280
02:44:06,319 --> 02:44:07,680
the the full
4281
02:44:07,680 --> 02:44:10,080
grid the full array so i'm just gonna
4282
02:44:10,080 --> 02:44:10,880
use
4283
02:44:10,880 --> 02:44:13,120
an underscore for that variable
4284
02:44:13,120 --> 02:44:16,560
but i will get my training
4285
02:44:16,560 --> 02:44:22,399
x and y's and actually i don't have a um
4286
02:44:22,399 --> 02:44:23,439
function
4287
02:44:23,439 --> 02:44:27,520
for getting the x and y's so here
4288
02:44:27,520 --> 02:44:30,080
i'm going to write a function to find
4289
02:44:30,080 --> 02:44:31,600
get x y
4290
02:44:31,600 --> 02:44:32,479
and
4291
02:44:32,479 --> 02:44:33,279
uh
4292
02:44:33,279 --> 02:44:35,120
i'm going to pass in the data frame and
4293
02:44:35,120 --> 02:44:36,560
i'm actually going to pass in what the
4294
02:44:36,560 --> 02:44:39,120
name of the y label is and
4295
02:44:39,120 --> 02:44:42,160
what the x what specific x labels i want
4296
02:44:42,160 --> 02:44:43,279
to
4297
02:44:43,279 --> 02:44:44,319
look at
4298
02:44:44,319 --> 02:44:45,120
so
4299
02:44:45,120 --> 02:44:47,920
here if that's none then i'm just not
4300
02:44:47,920 --> 02:44:49,520
like i'm only going to i'm going to get
4301
02:44:49,520 --> 02:44:51,520
everything from the data set that's not
4302
02:44:51,520 --> 02:44:53,040
the while it
4303
02:44:53,040 --> 02:44:55,840
so here i'm actually going to
4304
02:44:55,840 --> 02:44:59,520
make first a deep copy
4305
02:44:59,520 --> 02:45:01,760
of my data frame
4306
02:45:01,760 --> 02:45:02,960
and
4307
02:45:02,960 --> 02:45:04,479
that basically means i'm just copying
4308
02:45:04,479 --> 02:45:05,840
everything over
4309
02:45:05,840 --> 02:45:06,800
if
4310
02:45:06,800 --> 02:45:10,399
uh if like x labels is none so if not x
4311
02:45:10,399 --> 02:45:11,439
labels
4312
02:45:11,439 --> 02:45:13,279
then all i'm going to do is say all
4313
02:45:13,279 --> 02:45:15,359
right x is going to be whatever this
4314
02:45:15,359 --> 02:45:17,040
data frame is
4315
02:45:17,040 --> 02:45:18,319
and i'm just going to take all the
4316
02:45:18,319 --> 02:45:21,120
columns so c for c and
4317
02:45:21,120 --> 02:45:22,479
data frame
4318
02:45:22,479 --> 02:45:23,920
dot columns
4319
02:45:23,920 --> 02:45:27,439
if c does not equal the y label
4320
02:45:27,439 --> 02:45:29,520
all right and i'm gonna get the values
4321
02:45:29,520 --> 02:45:31,200
from that
4322
02:45:31,200 --> 02:45:33,040
but if there is
4323
02:45:33,040 --> 02:45:34,960
the x labels
4324
02:45:34,960 --> 02:45:36,640
well okay so
4325
02:45:36,640 --> 02:45:38,880
in order to index only one thing so like
4326
02:45:38,880 --> 02:45:40,560
let's say i pass in only one thing in
4327
02:45:40,560 --> 02:45:43,200
here um
4328
02:45:43,200 --> 02:45:46,319
then my data frame is
4329
02:45:46,319 --> 02:45:47,040
so
4330
02:45:47,040 --> 02:45:49,279
let me make a case for that so if the
4331
02:45:49,279 --> 02:45:52,000
length of x labels is equal to one
4332
02:45:52,000 --> 02:45:54,560
then what i'm going to do is just say
4333
02:45:54,560 --> 02:45:55,359
that
4334
02:45:55,359 --> 02:45:56,960
this
4335
02:45:56,960 --> 02:45:58,960
is going to be
4336
02:45:58,960 --> 02:46:02,640
uh x labels and add that just that label
4337
02:46:02,640 --> 02:46:04,319
um
4338
02:46:04,319 --> 02:46:06,720
values and i actually need to reshape to
4339
02:46:06,720 --> 02:46:08,080
make this 2d
4340
02:46:08,080 --> 02:46:09,760
so i'm going to pass a negative 1 comma
4341
02:46:09,760 --> 02:46:10,479
1
4342
02:46:10,479 --> 02:46:11,600
there
4343
02:46:11,600 --> 02:46:14,880
now otherwise if i have like a list of
4344
02:46:14,880 --> 02:46:17,520
specific x labels that i want to use
4345
02:46:17,520 --> 02:46:19,359
then i'm actually just going to say x is
4346
02:46:19,359 --> 02:46:22,479
equal to data frame of those x labels
4347
02:46:22,479 --> 02:46:27,279
dot values and that should suffice
4348
02:46:27,279 --> 02:46:28,640
all right so now that's just me
4349
02:46:28,640 --> 02:46:30,319
extracting x
4350
02:46:30,319 --> 02:46:32,640
and in order to get my y i'm going to do
4351
02:46:32,640 --> 02:46:34,800
y equals data frame
4352
02:46:34,800 --> 02:46:37,760
and then pass in the y label
4353
02:46:37,760 --> 02:46:39,359
and at the very end i'm going to say
4354
02:46:39,359 --> 02:46:42,080
data equals np
4355
02:46:42,080 --> 02:46:44,720
dot h stack so i'm stacking them
4356
02:46:44,720 --> 02:46:47,600
horizontally one next to each other
4357
02:46:47,600 --> 02:46:49,840
and i'll take x and y
4358
02:46:49,840 --> 02:46:53,200
and return that oh but
4359
02:46:53,200 --> 02:46:55,120
uh this needs to be values and i'm
4360
02:46:55,120 --> 02:46:56,640
actually going to reshape this to make
4361
02:46:56,640 --> 02:46:58,800
it 2d as well so that we can do this h
4362
02:46:58,800 --> 02:46:59,920
stack
4363
02:46:59,920 --> 02:47:04,319
and i will return data x y
4364
02:47:04,960 --> 02:47:06,720
so now i should be able to say okay get
4365
02:47:06,720 --> 02:47:08,080
x y
4366
02:47:08,080 --> 02:47:09,359
and
4367
02:47:09,359 --> 02:47:11,120
take that data frame
4368
02:47:11,120 --> 02:47:14,160
and the y label so my our y label is by
4369
02:47:14,160 --> 02:47:15,520
count
4370
02:47:15,520 --> 02:47:18,080
and actually so for the x label i'm
4371
02:47:18,080 --> 02:47:19,600
actually going to
4372
02:47:19,600 --> 02:47:21,439
let's just do like one dimension right
4373
02:47:21,439 --> 02:47:24,000
now and earlier i got rid of the plots
4374
02:47:24,000 --> 02:47:26,720
but we had seen that maybe you know the
4375
02:47:26,720 --> 02:47:29,600
temperature dimension does really well
4376
02:47:29,600 --> 02:47:31,439
and we might be able to use that to
4377
02:47:31,439 --> 02:47:33,200
predict why
4378
02:47:33,200 --> 02:47:35,840
so
4379
02:47:35,840 --> 02:47:37,760
i'm going to label this also that you
4380
02:47:37,760 --> 02:47:41,040
know it's just using the temperature
4381
02:47:41,040 --> 02:47:44,240
and i am also going to do this again
4382
02:47:44,240 --> 02:47:45,600
for
4383
02:47:45,600 --> 02:47:47,840
oh this should be
4384
02:47:47,840 --> 02:47:49,600
and this should be a validation and
4385
02:47:49,600 --> 02:47:52,080
there should be a test
4386
02:47:52,080 --> 02:47:55,680
um because oh that's val
4387
02:47:55,680 --> 02:47:57,439
all right
4388
02:47:57,439 --> 02:47:59,279
but here
4389
02:47:59,279 --> 02:48:00,840
it should be
4390
02:48:00,840 --> 02:48:04,000
val this should be test
4391
02:48:04,000 --> 02:48:06,080
all right so we run this and now we have
4392
02:48:06,080 --> 02:48:08,560
our training validation and test
4393
02:48:08,560 --> 02:48:11,120
data sets for just the temperature so if
4394
02:48:11,120 --> 02:48:13,840
i look at x train
4395
02:48:13,840 --> 02:48:17,279
temp it's literally just the temperature
4396
02:48:17,279 --> 02:48:18,560
okay and i'm doing this first to show
4397
02:48:18,560 --> 02:48:21,439
you simple linear regression
4398
02:48:21,439 --> 02:48:23,359
all right so right now i can create a
4399
02:48:23,359 --> 02:48:24,720
regressor
4400
02:48:24,720 --> 02:48:26,160
so i can say
4401
02:48:26,160 --> 02:48:28,479
the temp regressor here
4402
02:48:28,479 --> 02:48:30,720
and then i'm going to you know make a
4403
02:48:30,720 --> 02:48:32,399
linear regression model and just like
4404
02:48:32,399 --> 02:48:34,479
before i can
4405
02:48:34,479 --> 02:48:35,840
simply fix
4406
02:48:35,840 --> 02:48:39,600
fit my x train temp y train
4407
02:48:39,600 --> 02:48:42,080
temp in order to train train this linear
4408
02:48:42,080 --> 02:48:44,479
regression model
4409
02:48:44,479 --> 02:48:47,279
all right and then i can also
4410
02:48:47,279 --> 02:48:48,880
i can print
4411
02:48:48,880 --> 02:48:51,040
this
4412
02:48:51,040 --> 02:48:53,680
regressor's coefficients
4413
02:48:53,680 --> 02:48:54,560
and
4414
02:48:54,560 --> 02:48:58,359
the intercept so
4415
02:48:58,640 --> 02:49:00,000
if i do that
4416
02:49:00,000 --> 02:49:02,319
okay this is the coefficient for
4417
02:49:02,319 --> 02:49:04,399
whatever the temperature is and then the
4418
02:49:04,399 --> 02:49:06,560
the x-intercept
4419
02:49:06,560 --> 02:49:08,000
okay
4420
02:49:08,000 --> 02:49:10,800
or the y-intercept sorry
4421
02:49:10,800 --> 02:49:12,399
all right
4422
02:49:12,399 --> 02:49:16,479
and i can you know score so i can get
4423
02:49:16,479 --> 02:49:18,640
the um
4424
02:49:18,640 --> 02:49:20,560
the r squared
4425
02:49:20,560 --> 02:49:21,760
score
4426
02:49:21,760 --> 02:49:24,000
so i can score
4427
02:49:24,000 --> 02:49:25,439
x
4428
02:49:25,439 --> 02:49:27,840
test
4429
02:49:28,240 --> 02:49:31,600
and y test
4430
02:49:32,640 --> 02:49:35,040
all right so it's an r squared of around
4431
02:49:35,040 --> 02:49:37,279
0.38 which is better than zero which
4432
02:49:37,279 --> 02:49:38,800
would mean hey there's absolutely no
4433
02:49:38,800 --> 02:49:41,680
association but it's also not you know
4434
02:49:41,680 --> 02:49:42,479
like
4435
02:49:42,479 --> 02:49:44,000
a
4436
02:49:44,000 --> 02:49:46,479
good it depends on the context but
4437
02:49:46,479 --> 02:49:48,000
you know the higher that number it means
4438
02:49:48,000 --> 02:49:49,680
the higher that the two variables would
4439
02:49:49,680 --> 02:49:52,160
be correlated right which
4440
02:49:52,160 --> 02:49:53,279
here it's
4441
02:49:53,279 --> 02:49:55,120
all right it just means there's maybe
4442
02:49:55,120 --> 02:49:58,399
some association between the two
4443
02:49:58,399 --> 02:50:00,240
but uh the reason why i wanted to do
4444
02:50:00,240 --> 02:50:02,399
this one d was to show you
4445
02:50:02,399 --> 02:50:04,479
you know if we plotted this this is what
4446
02:50:04,479 --> 02:50:06,720
it would look like so if i
4447
02:50:06,720 --> 02:50:08,800
uh create a scatter plot
4448
02:50:08,800 --> 02:50:09,680
and
4449
02:50:09,680 --> 02:50:12,880
let's take the training
4450
02:50:15,520 --> 02:50:17,920
um
4451
02:50:17,920 --> 02:50:20,560
so this is our data and then let's make
4452
02:50:20,560 --> 02:50:22,800
it blue
4453
02:50:22,800 --> 02:50:26,880
and then if i also plotted so
4454
02:50:26,880 --> 02:50:28,640
something that i can do is say you know
4455
02:50:28,640 --> 02:50:31,279
the x range that i'm going to plot it
4456
02:50:31,279 --> 02:50:32,240
is
4457
02:50:32,240 --> 02:50:34,399
when space um and this goes from
4458
02:50:34,399 --> 02:50:37,279
negative 20 to 40 this piece of data so
4459
02:50:37,279 --> 02:50:39,520
i'm going to say let's take 100
4460
02:50:39,520 --> 02:50:41,840
things from there
4461
02:50:41,840 --> 02:50:43,840
so i'm going to
4462
02:50:43,840 --> 02:50:48,160
plot x and i'm going to take this
4463
02:50:48,160 --> 02:50:51,840
temp this like regressor and predict
4464
02:50:51,840 --> 02:50:52,720
x
4465
02:50:52,720 --> 02:50:53,760
with that
4466
02:50:53,760 --> 02:50:55,520
okay and this label
4467
02:50:55,520 --> 02:50:57,359
i'm going to label that
4468
02:50:57,359 --> 02:50:58,319
um
4469
02:50:58,319 --> 02:50:59,520
the
4470
02:50:59,520 --> 02:51:01,760
fit
4471
02:51:02,080 --> 02:51:05,439
and this color let's make this red
4472
02:51:05,439 --> 02:51:06,160
and let's actually
4473
02:51:06,160 --> 02:51:07,680
[Music]
4474
02:51:07,680 --> 02:51:10,399
set the line with so i can i can change
4475
02:51:10,399 --> 02:51:11,840
how thick
4476
02:51:11,840 --> 02:51:14,080
that value is
4477
02:51:14,080 --> 02:51:15,920
okay
4478
02:51:15,920 --> 02:51:18,800
now at the very end uh let's create a
4479
02:51:18,800 --> 02:51:20,720
legend
4480
02:51:20,720 --> 02:51:23,040
and let's
4481
02:51:23,040 --> 02:51:24,479
all right let's also create you know
4482
02:51:24,479 --> 02:51:26,640
title
4483
02:51:26,640 --> 02:51:29,600
all these things that matter
4484
02:51:29,600 --> 02:51:31,040
in some sense
4485
02:51:31,040 --> 02:51:33,840
so here let's just say um
4486
02:51:33,840 --> 02:51:36,000
this would be the bikes
4487
02:51:36,000 --> 02:51:38,000
versus the temperature
4488
02:51:38,000 --> 02:51:41,600
right and the y label would be
4489
02:51:41,600 --> 02:51:43,760
number of bikes
4490
02:51:43,760 --> 02:51:48,080
and the x label would be the temperature
4491
02:51:48,080 --> 02:51:50,160
so i actually think that this might
4492
02:51:50,160 --> 02:51:52,560
cause an error yeah
4493
02:51:52,560 --> 02:51:54,880
so it's expecting a 2d array so we
4494
02:51:54,880 --> 02:51:56,479
actually have to
4495
02:51:56,479 --> 02:51:58,240
reshape
4496
02:51:58,240 --> 02:51:59,439
this
4497
02:51:59,439 --> 02:52:01,840
let's
4498
02:52:03,800 --> 02:52:04,060
[Applause]
4499
02:52:04,060 --> 02:52:07,160
[Music]
4500
02:52:08,640 --> 02:52:10,160
okay there we go
4501
02:52:10,160 --> 02:52:11,760
so i just had to make this an array and
4502
02:52:11,760 --> 02:52:15,439
then reshape it so it was 2d now we see
4503
02:52:15,439 --> 02:52:17,680
that all right this
4504
02:52:17,680 --> 02:52:19,840
increases but again remember those
4505
02:52:19,840 --> 02:52:21,200
assumptions that we had about linear
4506
02:52:21,200 --> 02:52:22,960
regression like this i don't really know
4507
02:52:22,960 --> 02:52:24,640
if this
4508
02:52:24,640 --> 02:52:26,319
fits those assumptions
4509
02:52:26,319 --> 02:52:27,600
right i just wanted to show you guys
4510
02:52:27,600 --> 02:52:29,359
though that like
4511
02:52:29,359 --> 02:52:31,120
all right this is what a line of best
4512
02:52:31,120 --> 02:52:35,040
fit through this data would look like
4513
02:52:35,520 --> 02:52:37,760
okay
4514
02:52:37,840 --> 02:52:39,120
now
4515
02:52:39,120 --> 02:52:42,240
we can do multiple linear regression
4516
02:52:42,240 --> 02:52:44,560
right
4517
02:52:45,200 --> 02:52:47,200
so i'm going to go ahead and do that as
4518
02:52:47,200 --> 02:52:48,720
well
4519
02:52:48,720 --> 02:52:49,520
now
4520
02:52:49,520 --> 02:52:52,319
if i take
4521
02:52:52,399 --> 02:52:54,560
my data set
4522
02:52:54,560 --> 02:52:57,279
and instead of the labels
4523
02:52:57,279 --> 02:52:59,279
so actually what's my current data set
4524
02:52:59,279 --> 02:53:01,760
right now
4525
02:53:06,319 --> 02:53:08,160
all right so let's just use all of these
4526
02:53:08,160 --> 02:53:10,800
except for the byte count right so i'm
4527
02:53:10,800 --> 02:53:14,640
going to just say for the x labels
4528
02:53:15,279 --> 02:53:17,279
let's just take the data frames columns
4529
02:53:17,279 --> 02:53:18,080
and
4530
02:53:18,080 --> 02:53:20,000
just remove the byte count
4531
02:53:20,000 --> 02:53:24,120
so does that work
4532
02:53:24,800 --> 02:53:27,680
so if this part should be affects labels
4533
02:53:27,680 --> 02:53:30,080
is none
4534
02:53:30,080 --> 02:53:32,800
and then this should work now
4535
02:53:32,800 --> 02:53:34,240
oops sorry
4536
02:53:34,240 --> 02:53:36,000
okay so i have
4537
02:53:36,000 --> 02:53:38,720
oh but this here because it's not just
4538
02:53:38,720 --> 02:53:40,479
the temperature
4539
02:53:40,479 --> 02:53:41,840
anymore
4540
02:53:41,840 --> 02:53:44,800
we should actually do this um let's say
4541
02:53:44,800 --> 02:53:45,920
all
4542
02:53:45,920 --> 02:53:48,800
right so i'm just going to quickly rerun
4543
02:53:48,800 --> 02:53:50,000
this
4544
02:53:50,000 --> 02:53:51,359
piece here so that we have our
4545
02:53:51,359 --> 02:53:53,359
temperature only data set and now we
4546
02:53:53,359 --> 02:53:55,840
have our all data set
4547
02:53:55,840 --> 02:53:56,640
okay
4548
02:53:56,640 --> 02:53:59,279
and this regressor i can do the same
4549
02:53:59,279 --> 02:54:02,560
thing so i can do the all regressor
4550
02:54:02,560 --> 02:54:04,840
and i'm going to make this the linear
4551
02:54:04,840 --> 02:54:06,560
regression
4552
02:54:06,560 --> 02:54:08,880
and
4553
02:54:08,880 --> 02:54:12,560
i'm going to fit this to x train all and
4554
02:54:12,560 --> 02:54:13,680
y
4555
02:54:13,680 --> 02:54:16,080
train all
4556
02:54:16,080 --> 02:54:16,880
okay
4557
02:54:16,880 --> 02:54:18,319
all right so let's go ahead and also
4558
02:54:18,319 --> 02:54:20,640
score this regressor and let's see how
4559
02:54:20,640 --> 02:54:23,279
the r squared performs now so if i test
4560
02:54:23,279 --> 02:54:24,319
this
4561
02:54:24,319 --> 02:54:28,160
on the test data set what happens
4562
02:54:29,279 --> 02:54:30,720
all right so our r squared seems to
4563
02:54:30,720 --> 02:54:34,640
improve it went from 0.4 to 0.5 which is
4564
02:54:34,640 --> 02:54:36,880
a good sign
4565
02:54:36,880 --> 02:54:38,240
okay
4566
02:54:38,240 --> 02:54:41,680
and i can't necessarily plot you know
4567
02:54:41,680 --> 02:54:43,920
every single dimension but this just
4568
02:54:43,920 --> 02:54:45,840
this is just to say okay this has this
4569
02:54:45,840 --> 02:54:48,800
is improved right all right so one cool
4570
02:54:48,800 --> 02:54:50,240
thing that you can do with tensorflow is
4571
02:54:50,240 --> 02:54:51,359
you can actually
4572
02:54:51,359 --> 02:54:52,880
do regression
4573
02:54:52,880 --> 02:54:55,920
but with a neural net
4574
02:54:56,960 --> 02:54:58,640
so
4575
02:54:58,640 --> 02:55:00,840
here i'm going
4576
02:55:00,840 --> 02:55:04,160
to um
4577
02:55:04,160 --> 02:55:06,319
we already have our our training data
4578
02:55:06,319 --> 02:55:08,160
for just the temperature and just you
4579
02:55:08,160 --> 02:55:10,000
know for all the different columns so
4580
02:55:10,000 --> 02:55:11,520
i'm not going to bother with splitting
4581
02:55:11,520 --> 02:55:13,200
up the data again
4582
02:55:13,200 --> 02:55:14,240
i'm just going to go ahead and start
4583
02:55:14,240 --> 02:55:16,880
building the model so
4584
02:55:16,880 --> 02:55:18,720
in this linear regression model uh
4585
02:55:18,720 --> 02:55:19,920
typically
4586
02:55:19,920 --> 02:55:23,600
you know it does help if we normalize it
4587
02:55:23,600 --> 02:55:25,680
so that's very easy to do with
4588
02:55:25,680 --> 02:55:28,000
tensorflow i can just create some
4589
02:55:28,000 --> 02:55:32,160
normalizer layer so i'm going to do
4590
02:55:32,160 --> 02:55:34,479
tensorflow keras layers
4591
02:55:34,479 --> 02:55:36,319
and get the normalization
4592
02:55:36,319 --> 02:55:37,359
layer
4593
02:55:37,359 --> 02:55:39,279
and the input shape
4594
02:55:39,279 --> 02:55:40,479
for that
4595
02:55:40,479 --> 02:55:42,319
will just be one because let's just do
4596
02:55:42,319 --> 02:55:44,960
it again on just the temperature
4597
02:55:44,960 --> 02:55:47,279
and the axis i will
4598
02:55:47,279 --> 02:55:49,520
make none
4599
02:55:49,520 --> 02:55:52,240
now for this temp normalizer
4600
02:55:52,240 --> 02:55:54,080
and i should have had an equal sign
4601
02:55:54,080 --> 02:55:54,960
there
4602
02:55:54,960 --> 02:55:58,800
um i'm going to adapt this to x
4603
02:55:58,800 --> 02:56:00,080
train
4604
02:56:00,080 --> 02:56:01,200
temp
4605
02:56:01,200 --> 02:56:02,000
and
4606
02:56:02,000 --> 02:56:06,399
reshape this to just a single vector
4607
02:56:06,399 --> 02:56:10,160
so that should work great now with this
4608
02:56:10,160 --> 02:56:13,600
model so temp neural net model what i
4609
02:56:13,600 --> 02:56:15,439
can do is i can do
4610
02:56:15,439 --> 02:56:17,359
you know diacharis
4611
02:56:17,359 --> 02:56:19,760
that's sequential
4612
02:56:19,760 --> 02:56:22,479
and i'm going to pass in this normalizer
4613
02:56:22,479 --> 02:56:23,359
layer
4614
02:56:23,359 --> 02:56:25,200
and then i'm going to say hey just give
4615
02:56:25,200 --> 02:56:27,840
me one single dense layer with one
4616
02:56:27,840 --> 02:56:30,000
single unit and what that's doing is
4617
02:56:30,000 --> 02:56:32,080
saying all right
4618
02:56:32,080 --> 02:56:34,880
well one single node just means that
4619
02:56:34,880 --> 02:56:37,040
it's linear and if you don't add any
4620
02:56:37,040 --> 02:56:38,720
sort of activation function to it the
4621
02:56:38,720 --> 02:56:40,800
output is also linear
4622
02:56:40,800 --> 02:56:43,279
so here i'm going to have tensorflow
4623
02:56:43,279 --> 02:56:44,720
keras
4624
02:56:44,720 --> 02:56:46,720
layers.dense
4625
02:56:46,720 --> 02:56:48,160
and i'm just
4626
02:56:48,160 --> 02:56:50,160
gonna have one unit
4627
02:56:50,160 --> 02:56:52,640
and that's going to be my model
4628
02:56:52,640 --> 02:56:54,399
okay
4629
02:56:54,399 --> 02:56:55,520
so
4630
02:56:55,520 --> 02:56:58,560
uh with this
4631
02:56:59,120 --> 02:57:00,319
model
4632
02:57:00,319 --> 02:57:02,880
let's compile
4633
02:57:02,880 --> 02:57:06,840
and for our optimizer um let's
4634
02:57:06,840 --> 02:57:08,640
use
4635
02:57:08,640 --> 02:57:11,359
let's use adam again
4636
02:57:11,359 --> 02:57:12,640
answers
4637
02:57:12,640 --> 02:57:14,800
dot atom and we have to pass in the
4638
02:57:14,800 --> 02:57:16,800
learning rate
4639
02:57:16,800 --> 02:57:19,680
so learning rate and our learning rate
4640
02:57:19,680 --> 02:57:22,319
let's do 0.01
4641
02:57:22,319 --> 02:57:23,279
and now
4642
02:57:23,279 --> 02:57:25,200
the loss we
4643
02:57:25,200 --> 02:57:28,720
actually let's give this one 0.1 and the
4644
02:57:28,720 --> 02:57:32,479
loss i'm going to do mean squared error
4645
02:57:32,479 --> 02:57:34,880
okay so we run that we've compiled it
4646
02:57:34,880 --> 02:57:36,800
okay great
4647
02:57:36,800 --> 02:57:40,560
and just like before we can call history
4648
02:57:40,560 --> 02:57:43,680
and i'm going to fit this model so
4649
02:57:43,680 --> 02:57:46,080
here if i call fit
4650
02:57:46,080 --> 02:57:47,920
i can just fit it and i'm going to take
4651
02:57:47,920 --> 02:57:48,960
the
4652
02:57:48,960 --> 02:57:51,920
uh x train with the temperature
4653
02:57:51,920 --> 02:57:54,399
but reshape it
4654
02:57:54,399 --> 02:57:57,680
um y train for the temperature
4655
02:57:57,680 --> 02:58:00,240
and i'm going to set verbose equal to
4656
02:58:00,240 --> 02:58:02,479
zero so that it doesn't you know display
4657
02:58:02,479 --> 02:58:03,359
stuff
4658
02:58:03,359 --> 02:58:05,200
i'm actually going to set epochs equal
4659
02:58:05,200 --> 02:58:07,840
to let's do 1000
4660
02:58:07,840 --> 02:58:09,760
um
4661
02:58:09,760 --> 02:58:12,479
and the validation
4662
02:58:12,479 --> 02:58:14,479
data should be let's pass in the
4663
02:58:14,479 --> 02:58:18,560
validation data set here
4664
02:58:18,560 --> 02:58:20,240
as a tuple
4665
02:58:20,240 --> 02:58:23,200
and i know i spelled that wrong
4666
02:58:23,200 --> 02:58:26,640
so let's just run this
4667
02:58:27,040 --> 02:58:28,720
and up here i've copy and pasted the
4668
02:58:28,720 --> 02:58:31,040
plot loss from our previous but change
4669
02:58:31,040 --> 02:58:33,920
the y label to msc because now we're
4670
02:58:33,920 --> 02:58:35,279
talking we're dealing with mean squared
4671
02:58:35,279 --> 02:58:36,479
error
4672
02:58:36,479 --> 02:58:37,520
and
4673
02:58:37,520 --> 02:58:39,120
i'm going to plot the loss of this
4674
02:58:39,120 --> 02:58:41,279
history after it's done so let's just
4675
02:58:41,279 --> 02:58:43,040
wait for this to finish training and
4676
02:58:43,040 --> 02:58:46,439
then to plot
4677
02:58:49,359 --> 02:58:51,520
okay so this actually looks pretty good
4678
02:58:51,520 --> 02:58:54,640
we see that the values are converging
4679
02:58:54,640 --> 02:58:57,840
so now what i can do is i'm going to
4680
02:58:57,840 --> 02:59:02,399
go back up and take this plot
4681
02:59:02,880 --> 02:59:04,960
and we are going to just run that plot
4682
02:59:04,960 --> 02:59:07,120
again so
4683
02:59:07,120 --> 02:59:08,160
here
4684
02:59:08,160 --> 02:59:09,120
um
4685
02:59:09,120 --> 02:59:10,960
instead of
4686
02:59:10,960 --> 02:59:12,720
this temperature regressor i'm going to
4687
02:59:12,720 --> 02:59:16,160
use the neural net regressor
4688
02:59:16,160 --> 02:59:19,279
this neural net model
4689
02:59:19,840 --> 02:59:22,880
and if i run that i can see that you
4690
02:59:22,880 --> 02:59:24,640
know this also gives me a linear
4691
02:59:24,640 --> 02:59:26,319
regressor
4692
02:59:26,319 --> 02:59:28,479
you'll notice that this this fit is not
4693
02:59:28,479 --> 02:59:31,040
entirely the same as the one
4694
02:59:31,040 --> 02:59:33,920
up here and that's due to the training
4695
02:59:33,920 --> 02:59:35,120
process
4696
02:59:35,120 --> 02:59:36,319
of
4697
02:59:36,319 --> 02:59:38,720
you know of this neural net so just two
4698
02:59:38,720 --> 02:59:40,960
different ways to try and try to find
4699
02:59:40,960 --> 02:59:42,960
the best linear regressor
4700
02:59:42,960 --> 02:59:45,200
okay but here we're using back
4701
02:59:45,200 --> 02:59:47,520
propagation to train a neural net node
4702
02:59:47,520 --> 02:59:49,680
whereas in the other one they probably
4703
02:59:49,680 --> 02:59:51,920
are not doing that okay they're probably
4704
02:59:51,920 --> 02:59:54,160
just trying to actually compute
4705
02:59:54,160 --> 02:59:57,359
the line of best fit so
4706
02:59:57,359 --> 02:59:59,439
okay given this
4707
02:59:59,439 --> 03:00:01,680
well we can repeat the exact same
4708
03:00:01,680 --> 03:00:03,040
exercise
4709
03:00:03,040 --> 03:00:04,479
with our
4710
03:00:04,479 --> 03:00:06,000
um
4711
03:00:06,000 --> 03:00:08,080
with our multiple linear regressions
4712
03:00:08,080 --> 03:00:09,200
okay
4713
03:00:09,200 --> 03:00:11,439
but i'm actually going to skip that part
4714
03:00:11,439 --> 03:00:13,359
i will leave that as an exercise to the
4715
03:00:13,359 --> 03:00:15,600
viewer okay so now what would happen if
4716
03:00:15,600 --> 03:00:18,000
we use a neural net a real neural net
4717
03:00:18,000 --> 03:00:19,920
instead of just you know one single node
4718
03:00:19,920 --> 03:00:22,720
in order to predict this so
4719
03:00:22,720 --> 03:00:24,720
let's start on that code we already have
4720
03:00:24,720 --> 03:00:26,240
our normalizer so i'm actually going to
4721
03:00:26,240 --> 03:00:27,520
take the same
4722
03:00:27,520 --> 03:00:30,880
uh setup here but instead of you know
4723
03:00:30,880 --> 03:00:32,800
this one dense layer i'm going to set
4724
03:00:32,800 --> 03:00:36,000
this equal to 32 units and for my
4725
03:00:36,000 --> 03:00:39,840
activation i'm going to use relu
4726
03:00:39,840 --> 03:00:41,760
and now let's
4727
03:00:41,760 --> 03:00:43,200
duplicate that
4728
03:00:43,200 --> 03:00:45,439
and for the final output i just want one
4729
03:00:45,439 --> 03:00:47,439
answer so i just want one cell
4730
03:00:47,439 --> 03:00:49,600
and this activation is also going to be
4731
03:00:49,600 --> 03:00:52,000
relu because i can't ever have less than
4732
03:00:52,000 --> 03:00:53,439
zero bikes so i'm just going to set that
4733
03:00:53,439 --> 03:00:54,800
as relu
4734
03:00:54,800 --> 03:00:55,920
i'm just going to name this the neural
4735
03:00:55,920 --> 03:00:58,160
net model okay
4736
03:00:58,160 --> 03:01:00,960
and at the bottom i'm going to have this
4737
03:01:00,960 --> 03:01:03,840
um neural net model
4738
03:01:03,840 --> 03:01:05,680
i'm going to have to know that model i'm
4739
03:01:05,680 --> 03:01:08,399
going to compile
4740
03:01:08,399 --> 03:01:08,840
um
4741
03:01:08,840 --> 03:01:10,640
[Music]
4742
03:01:10,640 --> 03:01:13,200
and i will actually use the same
4743
03:01:13,200 --> 03:01:14,960
compiler here
4744
03:01:14,960 --> 03:01:18,240
but instead of
4745
03:01:18,560 --> 03:01:20,800
instead of a learning rate of 0.01 i'll
4746
03:01:20,800 --> 03:01:23,040
use 0.001
4747
03:01:23,040 --> 03:01:24,800
okay
4748
03:01:24,800 --> 03:01:27,520
and i'm going to train this here so the
4749
03:01:27,520 --> 03:01:28,880
history
4750
03:01:28,880 --> 03:01:31,359
is this
4751
03:01:31,359 --> 03:01:35,680
neural net model um and i'm going to fit
4752
03:01:35,680 --> 03:01:41,120
that against x train temp y train temp
4753
03:01:41,120 --> 03:01:42,640
and
4754
03:01:42,640 --> 03:01:44,640
valid
4755
03:01:44,640 --> 03:01:46,640
validation
4756
03:01:46,640 --> 03:01:48,960
data i'm going to set this again equal
4757
03:01:48,960 --> 03:01:50,080
to
4758
03:01:50,080 --> 03:01:51,359
xval
4759
03:01:51,359 --> 03:01:53,120
temp and y
4760
03:01:53,120 --> 03:01:54,080
vowel
4761
03:01:54,080 --> 03:01:56,000
temp
4762
03:01:56,000 --> 03:01:56,960
now
4763
03:01:56,960 --> 03:01:58,479
for the verbose i'm going to set that
4764
03:01:58,479 --> 03:01:59,680
equal to zero
4765
03:01:59,680 --> 03:02:03,359
epochs let's do 100 and
4766
03:02:03,359 --> 03:02:05,600
here for the batch size actually let's
4767
03:02:05,600 --> 03:02:07,760
just not do a batch size right now
4768
03:02:07,760 --> 03:02:09,600
let's just try let's see what happens
4769
03:02:09,600 --> 03:02:11,840
here
4770
03:02:13,279 --> 03:02:15,600
and again we can plot the loss of this
4771
03:02:15,600 --> 03:02:18,479
history after it's done training
4772
03:02:18,479 --> 03:02:21,439
so let's just run this
4773
03:02:21,439 --> 03:02:22,720
and that's not what we're supposed to
4774
03:02:22,720 --> 03:02:26,479
get so what is going on
4775
03:02:26,479 --> 03:02:29,120
here is sequential we have our
4776
03:02:29,120 --> 03:02:30,479
temperature
4777
03:02:30,479 --> 03:02:32,319
normalizer
4778
03:02:32,319 --> 03:02:36,560
which i'm wondering now if we have to
4779
03:02:38,840 --> 03:02:42,920
re-do that
4780
03:02:45,600 --> 03:02:47,279
okay so we do see this
4781
03:02:47,279 --> 03:02:49,520
decline it's an interesting curve but we
4782
03:02:49,520 --> 03:02:51,760
do we do see it eventually
4783
03:02:51,760 --> 03:02:53,279
um
4784
03:02:53,279 --> 03:02:56,080
so this is our loss which all right it's
4785
03:02:56,080 --> 03:02:58,160
decreasing that's a good sign and
4786
03:02:58,160 --> 03:02:59,760
actually what's interesting is let's
4787
03:02:59,760 --> 03:03:02,640
just let's plot this model again
4788
03:03:02,640 --> 03:03:06,000
so here instead of that
4789
03:03:06,800 --> 03:03:08,160
and you'll see that we actually have
4790
03:03:08,160 --> 03:03:10,000
this like
4791
03:03:10,000 --> 03:03:11,920
curve that looks something like this so
4792
03:03:11,920 --> 03:03:13,600
actually what if i got rid of this
4793
03:03:13,600 --> 03:03:16,319
activation
4794
03:03:16,399 --> 03:03:20,319
let's train this again
4795
03:03:21,120 --> 03:03:23,840
and see what happens
4796
03:03:23,840 --> 03:03:25,600
all right so even even when i got rid of
4797
03:03:25,600 --> 03:03:27,840
that relu at the end
4798
03:03:27,840 --> 03:03:31,520
it kind of knows hey you know if
4799
03:03:31,520 --> 03:03:33,279
it's not the best model
4800
03:03:33,279 --> 03:03:37,279
if we had maybe one more layer in here
4801
03:03:39,200 --> 03:03:40,479
these are just things that you have to
4802
03:03:40,479 --> 03:03:41,840
play around with
4803
03:03:41,840 --> 03:03:43,439
when you're you know working with
4804
03:03:43,439 --> 03:03:45,279
machine learning it's like you don't
4805
03:03:45,279 --> 03:03:46,880
really know
4806
03:03:46,880 --> 03:03:49,439
what the best model is going to be
4807
03:03:49,439 --> 03:03:52,960
um for example this also is not
4808
03:03:52,960 --> 03:03:55,840
brilliant
4809
03:03:56,080 --> 03:03:56,880
but
4810
03:03:56,880 --> 03:03:59,600
i guess it's okay so my point is though
4811
03:03:59,600 --> 03:04:02,399
that with a neural net
4812
03:04:02,399 --> 03:04:04,319
i mean this is not brilliant but also
4813
03:04:04,319 --> 03:04:06,640
there's like no data down here right so
4814
03:04:06,640 --> 03:04:08,000
it's kind of hard for our model to
4815
03:04:08,000 --> 03:04:09,600
predict in fact we probably should have
4816
03:04:09,600 --> 03:04:10,960
started the prediction somewhere around
4817
03:04:10,960 --> 03:04:12,560
here
4818
03:04:12,560 --> 03:04:14,000
my point though is that with this neural
4819
03:04:14,000 --> 03:04:15,439
net model you can see that this is no
4820
03:04:15,439 --> 03:04:18,319
longer a linear predictor but yet we
4821
03:04:18,319 --> 03:04:21,359
still get an estimate of the value
4822
03:04:21,359 --> 03:04:23,120
right and we can repeat this exact same
4823
03:04:23,120 --> 03:04:24,800
exercise
4824
03:04:24,800 --> 03:04:27,120
with the multiple
4825
03:04:27,120 --> 03:04:28,479
uh
4826
03:04:28,479 --> 03:04:29,920
inputs
4827
03:04:29,920 --> 03:04:32,560
so here
4828
03:04:33,439 --> 03:04:37,040
if i now pass in all the data
4829
03:04:37,040 --> 03:04:38,000
so
4830
03:04:38,000 --> 03:04:39,840
this is my all
4831
03:04:39,840 --> 03:04:41,760
normalizer
4832
03:04:41,760 --> 03:04:44,080
and i should just be able to pass in
4833
03:04:44,080 --> 03:04:46,319
that
4834
03:04:46,319 --> 03:04:47,439
so
4835
03:04:47,439 --> 03:04:51,279
let's move this to the next cell
4836
03:04:51,279 --> 03:04:53,520
um
4837
03:04:53,920 --> 03:04:55,760
here i'm going to pass in my all
4838
03:04:55,760 --> 03:04:57,279
normalizer
4839
03:04:57,279 --> 03:04:59,920
and let's compile it yeah those
4840
03:04:59,920 --> 03:05:02,880
parameters look good
4841
03:05:02,880 --> 03:05:05,359
great so
4842
03:05:05,359 --> 03:05:06,960
here with the history when we're trying
4843
03:05:06,960 --> 03:05:08,160
to
4844
03:05:08,160 --> 03:05:10,640
fit this model instead of temp we're
4845
03:05:10,640 --> 03:05:13,279
going to use
4846
03:05:13,279 --> 03:05:14,960
our larger data set with all the
4847
03:05:14,960 --> 03:05:16,560
features
4848
03:05:16,560 --> 03:05:19,840
and let's just train that
4849
03:05:21,920 --> 03:05:25,600
and of course we want to plot the loss
4850
03:05:31,439 --> 03:05:34,080
okay so that's what our loss looks like
4851
03:05:34,080 --> 03:05:35,600
um
4852
03:05:35,600 --> 03:05:37,040
it's an interesting curve but it's
4853
03:05:37,040 --> 03:05:39,359
decreasing
4854
03:05:39,359 --> 03:05:41,279
so before we saw that our r squared
4855
03:05:41,279 --> 03:05:44,080
score was around 0.52 well we don't
4856
03:05:44,080 --> 03:05:45,439
really have that with a neural net
4857
03:05:45,439 --> 03:05:47,200
anymore but one thing that we can
4858
03:05:47,200 --> 03:05:49,600
measure is hey what is the mean squared
4859
03:05:49,600 --> 03:05:50,560
error
4860
03:05:50,560 --> 03:05:53,600
right so if i come down here
4861
03:05:53,600 --> 03:05:55,840
um
4862
03:05:56,319 --> 03:05:58,960
and i compare the two mean squared
4863
03:05:58,960 --> 03:06:01,600
errors so
4864
03:06:01,600 --> 03:06:04,880
so i can predict
4865
03:06:05,120 --> 03:06:06,880
x test
4866
03:06:06,880 --> 03:06:09,760
all right
4867
03:06:10,479 --> 03:06:12,319
so these are my predictions using that
4868
03:06:12,319 --> 03:06:14,960
linear regressor will linear multiple
4869
03:06:14,960 --> 03:06:17,040
multiple linear regressor so these are
4870
03:06:17,040 --> 03:06:19,279
my my predictions
4871
03:06:19,279 --> 03:06:21,040
linear regression
4872
03:06:21,040 --> 03:06:23,359
okay
4873
03:06:24,479 --> 03:06:26,399
i'm actually going to do
4874
03:06:26,399 --> 03:06:29,439
that at the bottom so
4875
03:06:29,439 --> 03:06:31,600
let me just copy and paste that cell
4876
03:06:31,600 --> 03:06:33,600
and bring it down here so now i'm going
4877
03:06:33,600 --> 03:06:36,800
to calculate the mean squared error
4878
03:06:36,800 --> 03:06:38,880
for both
4879
03:06:38,880 --> 03:06:40,960
um the linear
4880
03:06:40,960 --> 03:06:43,359
regressor and the neural net okay so
4881
03:06:43,359 --> 03:06:44,479
this is
4882
03:06:44,479 --> 03:06:45,439
my
4883
03:06:45,439 --> 03:06:50,000
linear and this is my neural net so
4884
03:06:50,000 --> 03:06:52,000
if i use my neural net model and i
4885
03:06:52,000 --> 03:06:53,920
predict
4886
03:06:53,920 --> 03:06:55,760
x test all
4887
03:06:55,760 --> 03:06:58,319
i get my two you know different y
4888
03:06:58,319 --> 03:07:00,080
predictions
4889
03:07:00,080 --> 03:07:01,359
and
4890
03:07:01,359 --> 03:07:02,720
um
4891
03:07:02,720 --> 03:07:04,880
i can calculate the mean squared error
4892
03:07:04,880 --> 03:07:06,319
right so
4893
03:07:06,319 --> 03:07:08,000
if i want to get the mean squared error
4894
03:07:08,000 --> 03:07:10,479
and i have y
4895
03:07:10,479 --> 03:07:12,160
prediction and y
4896
03:07:12,160 --> 03:07:13,359
real
4897
03:07:13,359 --> 03:07:14,560
i can do
4898
03:07:14,560 --> 03:07:16,560
numpy dot square and then i would need
4899
03:07:16,560 --> 03:07:19,840
the y prediction minus you know the real
4900
03:07:19,840 --> 03:07:21,600
so this this is basically squaring
4901
03:07:21,600 --> 03:07:22,880
everything
4902
03:07:22,880 --> 03:07:26,160
um and
4903
03:07:26,160 --> 03:07:28,399
this should be
4904
03:07:28,399 --> 03:07:30,720
a vector so if i
4905
03:07:30,720 --> 03:07:33,359
just take this entire thing and take the
4906
03:07:33,359 --> 03:07:34,560
mean
4907
03:07:34,560 --> 03:07:35,840
of that
4908
03:07:35,840 --> 03:07:38,240
that should give me the mse so let's
4909
03:07:38,240 --> 03:07:41,200
let's just try that out
4910
03:07:44,880 --> 03:07:48,640
and the y real is why test
4911
03:07:48,640 --> 03:07:49,680
all
4912
03:07:49,680 --> 03:07:51,359
right so that's my mean squared error
4913
03:07:51,359 --> 03:07:53,279
for the linear regressor
4914
03:07:53,279 --> 03:07:57,359
and this is my mean squared error for
4915
03:07:57,359 --> 03:08:00,080
the neural net
4916
03:08:02,000 --> 03:08:04,720
so that's interesting uh i will debug
4917
03:08:04,720 --> 03:08:06,880
this live i guess
4918
03:08:06,880 --> 03:08:08,560
so my guess is that it's probably coming
4919
03:08:08,560 --> 03:08:09,359
from
4920
03:08:09,359 --> 03:08:11,439
this normalization
4921
03:08:11,439 --> 03:08:12,479
layer
4922
03:08:12,479 --> 03:08:14,160
um
4923
03:08:14,160 --> 03:08:16,160
because this input shape is
4924
03:08:16,160 --> 03:08:17,359
probably just
4925
03:08:17,359 --> 03:08:20,359
six
4926
03:08:25,600 --> 03:08:27,520
and
4927
03:08:27,520 --> 03:08:28,880
okay so
4928
03:08:28,880 --> 03:08:30,960
that works now
4929
03:08:30,960 --> 03:08:33,600
and the reason why is because
4930
03:08:33,600 --> 03:08:35,840
like my inputs are only
4931
03:08:35,840 --> 03:08:37,200
for every vector it's only a one
4932
03:08:37,200 --> 03:08:39,520
dimensional vector of length six so i
4933
03:08:39,520 --> 03:08:42,080
should have i should have just had six
4934
03:08:42,080 --> 03:08:44,479
comma which is a tuple of size six from
4935
03:08:44,479 --> 03:08:46,399
the start or it's a it's a tuple
4936
03:08:46,399 --> 03:08:50,080
containing one element which is a six
4937
03:08:50,080 --> 03:08:52,160
okay so it's actually interesting that
4938
03:08:52,160 --> 03:08:53,200
my
4939
03:08:53,200 --> 03:08:56,640
uh neural net results seem like they
4940
03:08:56,640 --> 03:08:58,720
they have a larger mean squared error
4941
03:08:58,720 --> 03:09:01,200
than my linear aggressor
4942
03:09:01,200 --> 03:09:06,240
um one thing that we can look at is
4943
03:09:06,240 --> 03:09:09,680
we can actually plot the real versus you
4944
03:09:09,680 --> 03:09:13,279
know the the actual results versus
4945
03:09:13,279 --> 03:09:15,680
what the predictions are so
4946
03:09:15,680 --> 03:09:18,080
if i say
4947
03:09:18,080 --> 03:09:20,240
some axis and i use
4948
03:09:20,240 --> 03:09:22,080
plt.axes
4949
03:09:22,080 --> 03:09:23,439
and make these
4950
03:09:23,439 --> 03:09:25,279
equal
4951
03:09:25,279 --> 03:09:27,520
then i can scatter
4952
03:09:27,520 --> 03:09:28,960
the the y
4953
03:09:28,960 --> 03:09:31,120
you know the test so what the actual
4954
03:09:31,120 --> 03:09:33,120
values are on the x-axis and then what
4955
03:09:33,120 --> 03:09:34,640
the predictions
4956
03:09:34,640 --> 03:09:36,000
are on the
4957
03:09:36,000 --> 03:09:37,760
x-axis
4958
03:09:37,760 --> 03:09:38,800
okay
4959
03:09:38,800 --> 03:09:42,319
uh and i can label this as the linear
4960
03:09:42,319 --> 03:09:46,080
regression predictions
4961
03:09:47,520 --> 03:09:49,520
okay so then let me just label my axes
4962
03:09:49,520 --> 03:09:51,840
so the x-axis i'm going to say is the
4963
03:09:51,840 --> 03:09:54,560
true values
4964
03:09:54,560 --> 03:09:57,920
the y-axis is going to be my linear
4965
03:09:57,920 --> 03:10:01,279
regression predictions
4966
03:10:04,240 --> 03:10:07,600
or actually let's plot
4967
03:10:07,600 --> 03:10:11,200
let's just make this predictions
4968
03:10:11,359 --> 03:10:14,560
and then at the end
4969
03:10:14,560 --> 03:10:17,680
i'm going to plot
4970
03:10:17,680 --> 03:10:20,160
oh let's set some limits
4971
03:10:20,160 --> 03:10:22,399
uh
4972
03:10:22,800 --> 03:10:24,399
because i think that's like
4973
03:10:24,399 --> 03:10:28,160
approximately the max number of bikes
4974
03:10:28,560 --> 03:10:29,359
so
4975
03:10:29,359 --> 03:10:32,560
i'm going to set my x limit to this and
4976
03:10:32,560 --> 03:10:34,640
my y limit
4977
03:10:34,640 --> 03:10:36,240
to this
4978
03:10:36,240 --> 03:10:38,160
so here i'm going to pass that in here
4979
03:10:38,160 --> 03:10:40,640
too
4980
03:10:40,640 --> 03:10:43,040
and
4981
03:10:43,120 --> 03:10:44,160
all right
4982
03:10:44,160 --> 03:10:47,359
this is what we actually get for our
4983
03:10:47,359 --> 03:10:49,920
linear regressor
4984
03:10:49,920 --> 03:10:52,319
you see that actually they align quite
4985
03:10:52,319 --> 03:10:55,520
well i mean to some extent so 2000 is
4986
03:10:55,520 --> 03:10:57,200
probably too much
4987
03:10:57,200 --> 03:10:59,200
2500 i mean
4988
03:10:59,200 --> 03:11:00,319
looks like
4989
03:11:00,319 --> 03:11:03,359
maybe like 1800 would be enough here for
4990
03:11:03,359 --> 03:11:04,720
our limits
4991
03:11:04,720 --> 03:11:06,720
um
4992
03:11:06,720 --> 03:11:09,359
and i'm actually going to
4993
03:11:09,359 --> 03:11:11,279
label something else the neural net
4994
03:11:11,279 --> 03:11:14,000
predictions
4995
03:11:15,760 --> 03:11:18,479
let's add a legend
4996
03:11:18,479 --> 03:11:21,439
so you you can see that our neural net
4997
03:11:21,439 --> 03:11:23,040
for the
4998
03:11:23,040 --> 03:11:25,120
larger values it seems like it's a
4999
03:11:25,120 --> 03:11:27,680
little bit more spread out and it seems
5000
03:11:27,680 --> 03:11:28,800
like
5001
03:11:28,800 --> 03:11:31,040
we tend to underestimate a little bit
5002
03:11:31,040 --> 03:11:34,319
down here in this area
5003
03:11:34,840 --> 03:11:36,720
okay
5004
03:11:36,720 --> 03:11:38,960
and for some reason these are way off as
5005
03:11:38,960 --> 03:11:40,800
well
5006
03:11:40,800 --> 03:11:43,200
but yeah so we've basically used a
5007
03:11:43,200 --> 03:11:46,479
linear regressor and a neural net um
5008
03:11:46,479 --> 03:11:48,399
honestly there are some times where a
5009
03:11:48,399 --> 03:11:49,840
neural net is more appropriate and a
5010
03:11:49,840 --> 03:11:52,080
linear regressor is more appropriate
5011
03:11:52,080 --> 03:11:54,720
i think that it just comes with time and
5012
03:11:54,720 --> 03:11:57,040
trying to figure out you know and just
5013
03:11:57,040 --> 03:11:58,640
literally seeing like hey what works
5014
03:11:58,640 --> 03:12:00,720
better like here a linear a multiple
5015
03:12:00,720 --> 03:12:02,319
linear regressor might actually work
5016
03:12:02,319 --> 03:12:04,960
better than a neural net but for example
5017
03:12:04,960 --> 03:12:07,279
with the one-dimensional case
5018
03:12:07,279 --> 03:12:09,359
a linear regressor would never be able
5019
03:12:09,359 --> 03:12:11,200
to see this curve
5020
03:12:11,200 --> 03:12:13,040
okay
5021
03:12:13,040 --> 03:12:14,720
i mean i'm not saying this is a great
5022
03:12:14,720 --> 03:12:16,840
model either but i'm just saying
5023
03:12:16,840 --> 03:12:19,520
like hey you know
5024
03:12:19,520 --> 03:12:21,120
sometimes it might be more appropriate
5025
03:12:21,120 --> 03:12:25,040
to use something that's not linear
5026
03:12:25,279 --> 03:12:29,760
so yeah i will leave regression at that
5027
03:12:29,760 --> 03:12:31,920
okay so we just talked about supervised
5028
03:12:31,920 --> 03:12:33,040
learning
5029
03:12:33,040 --> 03:12:35,439
and in supervised learning we have data
5030
03:12:35,439 --> 03:12:37,680
we have some a bunch of features and for
5031
03:12:37,680 --> 03:12:39,760
a bunch of different samples but each of
5032
03:12:39,760 --> 03:12:41,600
those samples has some sort of label on
5033
03:12:41,600 --> 03:12:44,240
it whether that's a number a category a
5034
03:12:44,240 --> 03:12:47,760
class etc right we were able to use that
5035
03:12:47,760 --> 03:12:49,760
label in order to try to predict new
5036
03:12:49,760 --> 03:12:52,160
labels of other points that we haven't
5037
03:12:52,160 --> 03:12:54,000
seen yet
5038
03:12:54,000 --> 03:12:56,880
well now let's move on to unsupervised
5039
03:12:56,880 --> 03:12:57,920
learning
5040
03:12:57,920 --> 03:13:00,160
so with unsupervised learning we have a
5041
03:13:00,160 --> 03:13:02,960
bunch of unlabeled data
5042
03:13:02,960 --> 03:13:04,640
and what can we do with that you know
5043
03:13:04,640 --> 03:13:09,439
can we learn anything from this data
5044
03:13:09,600 --> 03:13:10,800
so the first algorithm that we're going
5045
03:13:10,800 --> 03:13:12,960
to discuss is known as k means
5046
03:13:12,960 --> 03:13:15,520
clustering what k means clustering is
5047
03:13:15,520 --> 03:13:19,439
trying to do is it's trying to compute
5048
03:13:19,439 --> 03:13:20,720
k
5049
03:13:20,720 --> 03:13:21,920
clusters
5050
03:13:21,920 --> 03:13:24,640
from the data
5051
03:13:25,680 --> 03:13:28,640
so in this example below i have a bunch
5052
03:13:28,640 --> 03:13:30,720
of scattered points and you'll see that
5053
03:13:30,720 --> 03:13:32,319
this is
5054
03:13:32,319 --> 03:13:35,520
x0 and x1 on the two axes which means
5055
03:13:35,520 --> 03:13:37,520
i'm actually plotting two different
5056
03:13:37,520 --> 03:13:38,560
features
5057
03:13:38,560 --> 03:13:40,560
right of each point but we don't know
5058
03:13:40,560 --> 03:13:43,680
what the y label is for those points
5059
03:13:43,680 --> 03:13:46,880
and now just looking at these scattered
5060
03:13:46,880 --> 03:13:48,399
points
5061
03:13:48,399 --> 03:13:50,080
we can kind of see how there are
5062
03:13:50,080 --> 03:13:53,040
different clusters in the data set right
5063
03:13:53,040 --> 03:13:55,359
so depending on what we pick for k we
5064
03:13:55,359 --> 03:13:58,560
might have different clusters
5065
03:13:58,560 --> 03:14:01,760
let's say k equals two right then we
5066
03:14:01,760 --> 03:14:03,760
might pick okay this seems like it could
5067
03:14:03,760 --> 03:14:05,920
be one cluster but this here is also
5068
03:14:05,920 --> 03:14:07,600
another cluster so those might be our
5069
03:14:07,600 --> 03:14:10,000
two different clusters
5070
03:14:10,000 --> 03:14:13,040
if we have k equals three
5071
03:14:13,040 --> 03:14:15,200
for example then okay this seems like it
5072
03:14:15,200 --> 03:14:16,880
could be a cluster
5073
03:14:16,880 --> 03:14:18,479
this seems like it could be a cluster
5074
03:14:18,479 --> 03:14:20,640
and maybe this could be a cluster right
5075
03:14:20,640 --> 03:14:22,239
so we could have three different
5076
03:14:22,239 --> 03:14:25,279
clusters in the data set
5077
03:14:25,279 --> 03:14:29,200
now this k here is predefined
5078
03:14:29,200 --> 03:14:32,239
if i can spell that correctly
5079
03:14:32,239 --> 03:14:35,359
by the person who's running the model so
5080
03:14:35,359 --> 03:14:38,560
that would be you
5081
03:14:38,560 --> 03:14:39,680
all right
5082
03:14:39,680 --> 03:14:41,520
and let's discuss how you know the
5083
03:14:41,520 --> 03:14:43,200
computer actually goes through and
5084
03:14:43,200 --> 03:14:44,479
computes
5085
03:14:44,479 --> 03:14:47,120
the k clusters
5086
03:14:47,120 --> 03:14:48,880
so i'm going to write those steps down
5087
03:14:48,880 --> 03:14:51,880
here
5088
03:14:52,560 --> 03:14:55,680
now the first step that happens is we
5089
03:14:55,680 --> 03:14:56,720
actually
5090
03:14:56,720 --> 03:14:59,040
choose well the computer
5091
03:14:59,040 --> 03:15:02,479
chooses three random points
5092
03:15:02,479 --> 03:15:04,640
on this plot
5093
03:15:04,640 --> 03:15:07,840
to be the centroids
5094
03:15:08,319 --> 03:15:10,319
and by centroids i just mean the center
5095
03:15:10,319 --> 03:15:13,120
of the clusters okay
5096
03:15:13,120 --> 03:15:15,040
so three random points let's say we're
5097
03:15:15,040 --> 03:15:16,560
doing k equals three so we're choosing
5098
03:15:16,560 --> 03:15:18,640
three random points to be the centroids
5099
03:15:18,640 --> 03:15:20,399
of the three clusters if it were two
5100
03:15:20,399 --> 03:15:22,880
we'd be choosing two random points
5101
03:15:22,880 --> 03:15:24,319
okay
5102
03:15:24,319 --> 03:15:26,080
so maybe the three random points i'm
5103
03:15:26,080 --> 03:15:29,680
choosing might be here
5104
03:15:29,760 --> 03:15:30,240
here
5105
03:15:30,240 --> 03:15:32,560
[Music]
5106
03:15:32,560 --> 03:15:35,279
and here
5107
03:15:35,279 --> 03:15:36,720
all right
5108
03:15:36,720 --> 03:15:37,680
so we have
5109
03:15:37,680 --> 03:15:39,600
three different points
5110
03:15:39,600 --> 03:15:43,680
and the second thing that we do
5111
03:15:44,319 --> 03:15:48,080
is we actually calculate
5112
03:15:48,080 --> 03:15:50,640
the distance
5113
03:15:50,640 --> 03:15:52,160
for each point
5114
03:15:52,160 --> 03:15:55,359
to those centroids
5115
03:15:55,359 --> 03:15:57,680
so between all the points
5116
03:15:57,680 --> 03:16:00,560
and the centroid
5117
03:16:01,279 --> 03:16:03,359
so basically i'm saying all right this
5118
03:16:03,359 --> 03:16:05,200
is this distance this is this distance
5119
03:16:05,200 --> 03:16:07,520
this is this distance
5120
03:16:07,520 --> 03:16:09,600
all of these distances i'm computing
5121
03:16:09,600 --> 03:16:12,000
between oops not those two
5122
03:16:12,000 --> 03:16:13,600
between the points not the centroids
5123
03:16:13,600 --> 03:16:14,640
themselves
5124
03:16:14,640 --> 03:16:16,800
so i'm computing the distances for all
5125
03:16:16,800 --> 03:16:20,000
of these plots to each of the centroids
5126
03:16:20,000 --> 03:16:21,279
okay
5127
03:16:21,279 --> 03:16:23,840
and that
5128
03:16:24,000 --> 03:16:26,000
comes with also
5129
03:16:26,000 --> 03:16:28,479
assigning
5130
03:16:28,479 --> 03:16:32,560
those points to the closest centroid
5131
03:16:34,720 --> 03:16:37,439
what do i mean by that
5132
03:16:37,439 --> 03:16:38,160
so
5133
03:16:38,160 --> 03:16:39,680
let's take
5134
03:16:39,680 --> 03:16:41,359
this point here for example so i'm
5135
03:16:41,359 --> 03:16:44,080
computing this distance this distance
5136
03:16:44,080 --> 03:16:45,840
and this distance and i'm saying okay it
5137
03:16:45,840 --> 03:16:48,399
seems like the red one is the closest so
5138
03:16:48,399 --> 03:16:50,560
i'm actually going to put this into the
5139
03:16:50,560 --> 03:16:51,840
red
5140
03:16:51,840 --> 03:16:53,120
centroid
5141
03:16:53,120 --> 03:16:58,040
so if i do that for all of these points
5142
03:16:59,760 --> 03:17:01,520
this one seems slightly closer to red
5143
03:17:01,520 --> 03:17:02,960
and this one seems slightly closer to
5144
03:17:02,960 --> 03:17:04,000
red
5145
03:17:04,000 --> 03:17:05,520
right
5146
03:17:05,520 --> 03:17:07,200
now for the blue
5147
03:17:07,200 --> 03:17:09,359
i actually wouldn't
5148
03:17:09,359 --> 03:17:11,680
put any blue ones in here but
5149
03:17:11,680 --> 03:17:14,319
we would probably actually that first
5150
03:17:14,319 --> 03:17:17,359
one is closer to red
5151
03:17:17,840 --> 03:17:20,560
and now it seems like the rest of them
5152
03:17:20,560 --> 03:17:23,279
are probably closer to green
5153
03:17:23,279 --> 03:17:25,359
so let's just put all of these into
5154
03:17:25,359 --> 03:17:27,520
green here
5155
03:17:27,520 --> 03:17:29,279
like that
5156
03:17:29,279 --> 03:17:30,560
and
5157
03:17:30,560 --> 03:17:32,319
cool so now we have
5158
03:17:32,319 --> 03:17:34,720
you know our two three technically
5159
03:17:34,720 --> 03:17:37,760
centroid so there's this group here
5160
03:17:37,760 --> 03:17:40,479
there's this group here
5161
03:17:40,479 --> 03:17:43,200
and then blue is kind of just this group
5162
03:17:43,200 --> 03:17:45,040
here it hasn't really touched any of the
5163
03:17:45,040 --> 03:17:47,600
points yet
5164
03:17:47,600 --> 03:17:49,359
so the next step
5165
03:17:49,359 --> 03:17:51,120
three that we do
5166
03:17:51,120 --> 03:17:54,479
is we actually go and we recalculate the
5167
03:17:54,479 --> 03:17:56,960
centroid so we compute
5168
03:17:56,960 --> 03:17:59,920
new centroids
5169
03:18:00,160 --> 03:18:01,920
based on the points that we have in all
5170
03:18:01,920 --> 03:18:03,920
the centroids
5171
03:18:03,920 --> 03:18:07,279
and by that i just mean okay well
5172
03:18:07,279 --> 03:18:08,880
let's take the average of all these
5173
03:18:08,880 --> 03:18:11,520
points and where is that new centroid
5174
03:18:11,520 --> 03:18:12,960
that's probably going to be somewhere
5175
03:18:12,960 --> 03:18:15,520
around here right the blue one we don't
5176
03:18:15,520 --> 03:18:16,560
have any points in there so we won't
5177
03:18:16,560 --> 03:18:19,040
touch and then the screen one
5178
03:18:19,040 --> 03:18:20,800
we can put that
5179
03:18:20,800 --> 03:18:25,200
hmm probably somewhere over here oops
5180
03:18:25,200 --> 03:18:28,560
somewhere over here
5181
03:18:28,560 --> 03:18:32,319
right so now if i erase
5182
03:18:32,960 --> 03:18:34,560
all of the
5183
03:18:34,560 --> 03:18:38,160
previously computed centroids
5184
03:18:38,160 --> 03:18:41,840
i can go and i can actually redo step
5185
03:18:41,840 --> 03:18:42,800
two
5186
03:18:42,800 --> 03:18:45,200
over here this calculation
5187
03:18:45,200 --> 03:18:46,720
all right so i'm going to go back and
5188
03:18:46,720 --> 03:18:48,080
i'm going to iterate through everything
5189
03:18:48,080 --> 03:18:49,680
again and i'm going to recompute my
5190
03:18:49,680 --> 03:18:52,720
three centroids so
5191
03:18:52,880 --> 03:18:54,720
let's see we're going to take this red
5192
03:18:54,720 --> 03:18:58,160
point these are definitely all red right
5193
03:18:58,160 --> 03:19:00,239
this one still looks a bit
5194
03:19:00,239 --> 03:19:01,439
red
5195
03:19:01,439 --> 03:19:03,600
now
5196
03:19:03,600 --> 03:19:05,439
this part we actually start getting
5197
03:19:05,439 --> 03:19:08,080
closer to the blues
5198
03:19:08,080 --> 03:19:10,479
so this one still seems closer to a blue
5199
03:19:10,479 --> 03:19:12,000
than a green
5200
03:19:12,000 --> 03:19:14,560
this one as well
5201
03:19:14,560 --> 03:19:15,760
and
5202
03:19:15,760 --> 03:19:19,840
i think the rest would belong to green
5203
03:19:21,920 --> 03:19:25,040
okay so now are three centroids or three
5204
03:19:25,040 --> 03:19:28,720
sorry our three clusters would be this
5205
03:19:28,720 --> 03:19:30,840
is
5206
03:19:30,840 --> 03:19:32,960
this and then
5207
03:19:32,960 --> 03:19:33,840
this
5208
03:19:33,840 --> 03:19:35,279
right
5209
03:19:35,279 --> 03:19:38,560
those are our three centroids
5210
03:19:38,720 --> 03:19:40,800
and so now we go back and we compute the
5211
03:19:40,800 --> 03:19:42,319
new sorry those would be the three
5212
03:19:42,319 --> 03:19:43,600
clusters so now we go back and we
5213
03:19:43,600 --> 03:19:46,080
compute the three centroids so i'm going
5214
03:19:46,080 --> 03:19:48,880
to get rid of this this and this
5215
03:19:48,880 --> 03:19:51,520
and now where would this red be centered
5216
03:19:51,520 --> 03:19:54,160
probably closer you know to this point
5217
03:19:54,160 --> 03:19:55,120
here
5218
03:19:55,120 --> 03:19:59,439
this blue might be closer to up here
5219
03:19:59,439 --> 03:20:02,720
and then this green would probably be
5220
03:20:02,720 --> 03:20:04,160
somewhere
5221
03:20:04,160 --> 03:20:05,439
it's pretty similar to what we had
5222
03:20:05,439 --> 03:20:06,479
before
5223
03:20:06,479 --> 03:20:08,239
but it seems like it'd be pulled down a
5224
03:20:08,239 --> 03:20:10,239
bit so probably somewhere around there
5225
03:20:10,239 --> 03:20:12,399
for green
5226
03:20:12,399 --> 03:20:17,600
all right and now again we go back and
5227
03:20:17,600 --> 03:20:19,920
we compute
5228
03:20:19,920 --> 03:20:21,760
the distance between all the points and
5229
03:20:21,760 --> 03:20:23,359
the centroids and then we assign them to
5230
03:20:23,359 --> 03:20:25,200
the closest centroid okay
5231
03:20:25,200 --> 03:20:26,560
so
5232
03:20:26,560 --> 03:20:30,239
the reds are all here it's very clear
5233
03:20:30,239 --> 03:20:33,520
actually let me just circle that
5234
03:20:33,520 --> 03:20:34,880
and this
5235
03:20:34,880 --> 03:20:36,000
um
5236
03:20:36,000 --> 03:20:37,200
it actually seems like this point is
5237
03:20:37,200 --> 03:20:39,120
closer to this blue now
5238
03:20:39,120 --> 03:20:40,000
so
5239
03:20:40,000 --> 03:20:40,840
the
5240
03:20:40,840 --> 03:20:45,600
blues seem like they would be maybe
5241
03:20:45,600 --> 03:20:47,760
this point looks like it'd be blue so
5242
03:20:47,760 --> 03:20:49,120
all these look like they would be blue
5243
03:20:49,120 --> 03:20:50,080
now
5244
03:20:50,080 --> 03:20:52,479
and the greens would probably be this
5245
03:20:52,479 --> 03:20:54,239
cluster right here
5246
03:20:54,239 --> 03:20:56,399
so we go back we compute
5247
03:20:56,399 --> 03:20:59,120
the uh centroids
5248
03:20:59,120 --> 03:21:01,439
bam
5249
03:21:01,680 --> 03:21:02,720
this one
5250
03:21:02,720 --> 03:21:05,359
probably like almost here bam
5251
03:21:05,359 --> 03:21:06,800
and then the green
5252
03:21:06,800 --> 03:21:10,479
looks like it would be probably
5253
03:21:10,479 --> 03:21:13,439
um here-ish
5254
03:21:13,600 --> 03:21:14,800
okay
5255
03:21:14,800 --> 03:21:17,920
and now we go back and we compute
5256
03:21:17,920 --> 03:21:18,960
the
5257
03:21:18,960 --> 03:21:19,920
we
5258
03:21:19,920 --> 03:21:22,479
compute the clusters again
5259
03:21:22,479 --> 03:21:23,359
so
5260
03:21:23,359 --> 03:21:26,000
red still this
5261
03:21:26,000 --> 03:21:27,120
blue
5262
03:21:27,120 --> 03:21:30,960
i would argue is now this cluster here
5263
03:21:30,960 --> 03:21:32,160
and green
5264
03:21:32,160 --> 03:21:35,279
is this cluster here okay so we go and
5265
03:21:35,279 --> 03:21:38,479
we recompute
5266
03:21:38,479 --> 03:21:42,000
the centroids bam
5267
03:21:42,560 --> 03:21:44,720
bam
5268
03:21:44,720 --> 03:21:45,920
and
5269
03:21:45,920 --> 03:21:47,760
you know bam
5270
03:21:47,760 --> 03:21:49,600
and now if i were to go
5271
03:21:49,600 --> 03:21:51,200
and assign all the points to clusters
5272
03:21:51,200 --> 03:21:54,399
again i would get the exact same thing
5273
03:21:54,399 --> 03:21:56,880
right and so that's when we know that we
5274
03:21:56,880 --> 03:21:59,279
can stop iterating between steps two and
5275
03:21:59,279 --> 03:22:02,080
three is when we've converged on some
5276
03:22:02,080 --> 03:22:04,880
solution when we've reached some stable
5277
03:22:04,880 --> 03:22:07,439
point and so now because none of these
5278
03:22:07,439 --> 03:22:08,720
points are really changing out of their
5279
03:22:08,720 --> 03:22:10,800
clusters anymore we can go back to the
5280
03:22:10,800 --> 03:22:12,880
user and say hey these are our three
5281
03:22:12,880 --> 03:22:14,319
clusters
5282
03:22:14,319 --> 03:22:18,319
okay and this process
5283
03:22:18,319 --> 03:22:20,640
something known as
5284
03:22:20,640 --> 03:22:24,479
expectation maximization
5285
03:22:30,080 --> 03:22:31,840
this part where we're assigning the
5286
03:22:31,840 --> 03:22:33,520
points the closest centroid this is
5287
03:22:33,520 --> 03:22:35,680
something this is our
5288
03:22:35,680 --> 03:22:37,200
expectation
5289
03:22:37,200 --> 03:22:39,520
step
5290
03:22:39,760 --> 03:22:41,439
and this part where we're computing the
5291
03:22:41,439 --> 03:22:43,120
new centroids
5292
03:22:43,120 --> 03:22:45,439
this is our
5293
03:22:45,439 --> 03:22:48,439
maximization
5294
03:22:49,520 --> 03:22:50,880
step
5295
03:22:50,880 --> 03:22:52,560
okay so that's
5296
03:22:52,560 --> 03:22:54,960
expectation maximization
5297
03:22:54,960 --> 03:22:57,279
and we use this in order to
5298
03:22:57,279 --> 03:22:58,880
compute
5299
03:22:58,880 --> 03:23:01,680
the centroids assign all the points to
5300
03:23:01,680 --> 03:23:04,479
clusters according to those centroids
5301
03:23:04,479 --> 03:23:06,399
and then we're recomputing all that over
5302
03:23:06,399 --> 03:23:08,560
again until we reach some stable point
5303
03:23:08,560 --> 03:23:11,359
where nothing is changing anymore
5304
03:23:11,359 --> 03:23:14,800
all right so that's our first example of
5305
03:23:14,800 --> 03:23:16,880
unsupervised learning and basically what
5306
03:23:16,880 --> 03:23:18,560
this is doing is trying to find some
5307
03:23:18,560 --> 03:23:21,439
structure some pattern in the data so if
5308
03:23:21,439 --> 03:23:24,000
i came up with another point
5309
03:23:24,000 --> 03:23:25,600
you know might be somewhere here i can
5310
03:23:25,600 --> 03:23:28,319
say oh looks like that's closer to
5311
03:23:28,319 --> 03:23:29,600
um
5312
03:23:29,600 --> 03:23:31,920
if this is a b c it looks like that's
5313
03:23:31,920 --> 03:23:34,160
closest to cluster b and so i would
5314
03:23:34,160 --> 03:23:36,479
probably put it in cluster b
5315
03:23:36,479 --> 03:23:38,239
okay so we can find some structure in
5316
03:23:38,239 --> 03:23:41,200
the data based on just how
5317
03:23:41,200 --> 03:23:43,040
how the points are
5318
03:23:43,040 --> 03:23:45,840
scattered relative to one another
5319
03:23:45,840 --> 03:23:47,680
now the second unsupervised learning
5320
03:23:47,680 --> 03:23:49,040
technique that i'm going to discuss with
5321
03:23:49,040 --> 03:23:50,960
you guys something known as principle
5322
03:23:50,960 --> 03:23:53,200
component analysis
5323
03:23:53,200 --> 03:23:54,640
and the point of principle component
5324
03:23:54,640 --> 03:23:56,560
analysis is
5325
03:23:56,560 --> 03:23:59,279
very often it's used as a dimensionality
5326
03:23:59,279 --> 03:24:01,120
reduction technique
5327
03:24:01,120 --> 03:24:02,399
so
5328
03:24:02,399 --> 03:24:05,120
let me write that down
5329
03:24:05,120 --> 03:24:10,000
it's used for dimensionality reduction
5330
03:24:10,319 --> 03:24:11,920
and what do i mean by dimensionality
5331
03:24:11,920 --> 03:24:14,479
reduction is if i have a bunch of
5332
03:24:14,479 --> 03:24:19,200
features like x1 x2 x3 x4 etc can i just
5333
03:24:19,200 --> 03:24:20,960
reduce that down to one
5334
03:24:20,960 --> 03:24:22,800
dimension that gives me the most
5335
03:24:22,800 --> 03:24:24,880
information about how all these points
5336
03:24:24,880 --> 03:24:26,960
are spread relative to one another
5337
03:24:26,960 --> 03:24:29,439
and that's what pca is for so pca
5338
03:24:29,439 --> 03:24:32,720
principal component analysis
5339
03:24:32,880 --> 03:24:35,760
let's say i have
5340
03:24:35,760 --> 03:24:38,000
some points
5341
03:24:38,000 --> 03:24:39,520
in
5342
03:24:39,520 --> 03:24:43,600
the x0 and x1 feature space
5343
03:24:43,600 --> 03:24:45,439
okay so
5344
03:24:45,439 --> 03:24:48,319
uh these points might be spread you know
5345
03:24:48,319 --> 03:24:50,880
something like
5346
03:24:50,880 --> 03:24:53,880
this
5347
03:24:59,600 --> 03:25:01,920
okay
5348
03:25:02,080 --> 03:25:02,800
so
5349
03:25:02,800 --> 03:25:04,800
for example if this were
5350
03:25:04,800 --> 03:25:06,399
um
5351
03:25:06,399 --> 03:25:08,640
something to do with housing prices
5352
03:25:08,640 --> 03:25:10,080
right
5353
03:25:10,080 --> 03:25:13,840
this here might be x0 might be hey uh
5354
03:25:13,840 --> 03:25:16,160
years
5355
03:25:16,160 --> 03:25:17,439
since
5356
03:25:17,439 --> 03:25:19,520
built right since the house was built
5357
03:25:19,520 --> 03:25:24,560
and x1 might be square footage
5358
03:25:25,439 --> 03:25:28,000
of the house
5359
03:25:28,000 --> 03:25:29,359
all right so like years since built i
5360
03:25:29,359 --> 03:25:31,680
mean like right now it's been you know
5361
03:25:31,680 --> 03:25:35,520
22 years since a house in 2000 was built
5362
03:25:35,520 --> 03:25:37,680
now principal component analysis is just
5363
03:25:37,680 --> 03:25:39,120
saying all right let's say we want to
5364
03:25:39,120 --> 03:25:40,720
build a model or let's say we want to
5365
03:25:40,720 --> 03:25:43,920
you know display something about
5366
03:25:43,920 --> 03:25:47,279
our data but we don't we don't have two
5367
03:25:47,279 --> 03:25:49,439
axes to show it on
5368
03:25:49,439 --> 03:25:52,319
how do we display you know
5369
03:25:52,319 --> 03:25:54,000
how do we how do we demonstrate that
5370
03:25:54,000 --> 03:25:56,479
this point is a further away from this
5371
03:25:56,479 --> 03:25:59,520
point than this point
5372
03:26:00,640 --> 03:26:02,720
and we can do that using principle
5373
03:26:02,720 --> 03:26:06,160
component analysis so
5374
03:26:06,319 --> 03:26:07,439
take what you know about linear
5375
03:26:07,439 --> 03:26:09,040
regression and just forget about it for
5376
03:26:09,040 --> 03:26:10,239
a second otherwise you might get
5377
03:26:10,239 --> 03:26:15,120
confused pca is a way of trying to
5378
03:26:15,120 --> 03:26:17,600
find direction in the space
5379
03:26:17,600 --> 03:26:20,399
with the largest variance so this
5380
03:26:20,399 --> 03:26:23,520
principle component what that means
5381
03:26:23,520 --> 03:26:27,760
is basically the component
5382
03:26:29,040 --> 03:26:31,359
so some direction
5383
03:26:31,359 --> 03:26:33,840
in this space
5384
03:26:35,760 --> 03:26:38,880
with the largest
5385
03:26:38,880 --> 03:26:40,239
variance
5386
03:26:40,239 --> 03:26:41,439
okay
5387
03:26:41,439 --> 03:26:44,000
it tells us the most about
5388
03:26:44,000 --> 03:26:46,160
our data set without the two different
5389
03:26:46,160 --> 03:26:47,520
dimensions like let's say we have these
5390
03:26:47,520 --> 03:26:49,359
two different dimensions and somebody's
5391
03:26:49,359 --> 03:26:50,479
telling us hey you only get one
5392
03:26:50,479 --> 03:26:53,439
dimension in order to show your data set
5393
03:26:53,439 --> 03:26:56,080
what dimension like what do we do we
5394
03:26:56,080 --> 03:26:58,319
want to project our data onto a single
5395
03:26:58,319 --> 03:27:00,000
dimension
5396
03:27:00,000 --> 03:27:01,520
all right
5397
03:27:01,520 --> 03:27:03,439
so that in this case might be a
5398
03:27:03,439 --> 03:27:06,319
dimension that looks something like
5399
03:27:06,319 --> 03:27:08,399
this and you might say okay
5400
03:27:08,399 --> 03:27:09,680
we're not going to talk about linear
5401
03:27:09,680 --> 03:27:11,520
regression okay
5402
03:27:11,520 --> 03:27:13,439
we don't have a y value so in linear
5403
03:27:13,439 --> 03:27:16,160
regression this would be y this is not y
5404
03:27:16,160 --> 03:27:18,479
okay we don't have a label for that
5405
03:27:18,479 --> 03:27:20,960
instead what we're doing is we're taking
5406
03:27:20,960 --> 03:27:23,359
the right angle projection so all of
5407
03:27:23,359 --> 03:27:26,640
these take that's not very visible
5408
03:27:26,640 --> 03:27:29,920
but take this right angle projection
5409
03:27:29,920 --> 03:27:32,800
onto this line
5410
03:27:32,960 --> 03:27:36,319
and what pca is doing is saying okay map
5411
03:27:36,319 --> 03:27:37,760
all of these points onto this
5412
03:27:37,760 --> 03:27:39,439
one-dimensional space
5413
03:27:39,439 --> 03:27:40,840
so the
5414
03:27:40,840 --> 03:27:45,920
transformed data set would be here
5415
03:27:51,680 --> 03:27:53,840
this one's on the data set so or on the
5416
03:27:53,840 --> 03:27:56,399
line so we just put that there
5417
03:27:56,399 --> 03:27:58,800
but now this would be our new
5418
03:27:58,800 --> 03:28:00,960
one-dimensional data set
5419
03:28:00,960 --> 03:28:03,680
okay it's not our prediction or anything
5420
03:28:03,680 --> 03:28:06,239
this is our new data set if somebody
5421
03:28:06,239 --> 03:28:08,000
came to us said you only get one
5422
03:28:08,000 --> 03:28:10,319
dimension you only get one number to
5423
03:28:10,319 --> 03:28:12,880
represent each of these 2d points
5424
03:28:12,880 --> 03:28:14,880
what number would you give me
5425
03:28:14,880 --> 03:28:16,239
this
5426
03:28:16,239 --> 03:28:18,239
would be the number
5427
03:28:18,239 --> 03:28:19,920
that we gave
5428
03:28:19,920 --> 03:28:21,200
okay
5429
03:28:21,200 --> 03:28:22,160
this
5430
03:28:22,160 --> 03:28:24,960
in this direction this is where our
5431
03:28:24,960 --> 03:28:27,840
points are the most spread out
5432
03:28:27,840 --> 03:28:30,960
right if i took this plot
5433
03:28:30,960 --> 03:28:33,200
and let me actually duplicate this so i
5434
03:28:33,200 --> 03:28:35,279
don't have to
5435
03:28:35,279 --> 03:28:36,840
rewrite
5436
03:28:36,840 --> 03:28:39,120
anything so i don't have to erase and
5437
03:28:39,120 --> 03:28:41,279
then redraw anything
5438
03:28:41,279 --> 03:28:45,760
um let me get rid of some of this stuff
5439
03:28:47,359 --> 03:28:48,960
and i just got rid of a point there too
5440
03:28:48,960 --> 03:28:52,760
so let me draw that back
5441
03:28:54,080 --> 03:28:55,200
all right
5442
03:28:55,200 --> 03:28:57,359
so if this were my original data point
5443
03:28:57,359 --> 03:28:59,760
what if i had taken you know
5444
03:28:59,760 --> 03:29:00,640
this
5445
03:29:00,640 --> 03:29:01,920
to be
5446
03:29:01,920 --> 03:29:04,319
the pca dimension
5447
03:29:04,319 --> 03:29:05,439
okay
5448
03:29:05,439 --> 03:29:06,720
well
5449
03:29:06,720 --> 03:29:07,920
i
5450
03:29:07,920 --> 03:29:10,640
then would have points
5451
03:29:10,640 --> 03:29:12,640
that
5452
03:29:12,640 --> 03:29:13,760
let me actually do that in different
5453
03:29:13,760 --> 03:29:15,439
color
5454
03:29:15,439 --> 03:29:17,520
so if i were to draw a right angle to
5455
03:29:17,520 --> 03:29:18,560
this
5456
03:29:18,560 --> 03:29:21,840
for every point
5457
03:29:23,359 --> 03:29:28,239
my points would look something like this
5458
03:29:33,359 --> 03:29:35,920
and so just intuitively looking at these
5459
03:29:35,920 --> 03:29:38,479
two different plots this top one and
5460
03:29:38,479 --> 03:29:40,880
this one we can see that the points are
5461
03:29:40,880 --> 03:29:43,359
squished a little bit closer together
5462
03:29:43,359 --> 03:29:45,680
right which means that the variance
5463
03:29:45,680 --> 03:29:47,279
that's not the space with the largest
5464
03:29:47,279 --> 03:29:48,800
variance
5465
03:29:48,800 --> 03:29:52,399
the thing about the largest variance
5466
03:29:52,479 --> 03:29:55,359
is that this will give us the most
5467
03:29:55,359 --> 03:29:57,439
discrimination between all of these
5468
03:29:57,439 --> 03:29:58,479
points
5469
03:29:58,479 --> 03:29:59,920
the larger the variance the further
5470
03:29:59,920 --> 03:30:02,399
spread out these points will likely be
5471
03:30:02,399 --> 03:30:03,279
now
5472
03:30:03,279 --> 03:30:05,120
and so that's the that's the dimension
5473
03:30:05,120 --> 03:30:07,760
that we should project it on
5474
03:30:07,760 --> 03:30:09,840
a different way to actually look at that
5475
03:30:09,840 --> 03:30:11,600
like what is the dimension with the
5476
03:30:11,600 --> 03:30:13,920
largest variance it's actually it also
5477
03:30:13,920 --> 03:30:16,880
happens to be the dimension that
5478
03:30:16,880 --> 03:30:19,279
decreases
5479
03:30:19,279 --> 03:30:21,040
that minimizes
5480
03:30:21,040 --> 03:30:23,680
the residuals so
5481
03:30:23,680 --> 03:30:26,080
if we take all the points and we take
5482
03:30:26,080 --> 03:30:29,040
the residual from that the xy residual
5483
03:30:29,040 --> 03:30:32,319
so in linear regression
5484
03:30:32,399 --> 03:30:34,080
in linear regression we were looking
5485
03:30:34,080 --> 03:30:35,920
only at this residual the differences
5486
03:30:35,920 --> 03:30:37,920
between the predictions right between y
5487
03:30:37,920 --> 03:30:41,040
and y hat it's not that
5488
03:30:41,040 --> 03:30:43,200
here in principle component analysis
5489
03:30:43,200 --> 03:30:46,720
we're taking the difference from
5490
03:30:46,720 --> 03:30:48,560
our current point in two-dimensional
5491
03:30:48,560 --> 03:30:49,520
space
5492
03:30:49,520 --> 03:30:51,680
and then its projected point
5493
03:30:51,680 --> 03:30:53,600
okay so we're taking that
5494
03:30:53,600 --> 03:30:54,880
dimension
5495
03:30:54,880 --> 03:30:56,479
and we're saying
5496
03:30:56,479 --> 03:30:58,399
all right how much
5497
03:30:58,399 --> 03:30:59,359
you know
5498
03:30:59,359 --> 03:31:01,760
how much distance is there between
5499
03:31:01,760 --> 03:31:04,000
that projection residual and we're
5500
03:31:04,000 --> 03:31:06,720
trying to minimize that for all of these
5501
03:31:06,720 --> 03:31:08,239
points
5502
03:31:08,239 --> 03:31:11,359
so that actually equates to
5503
03:31:11,359 --> 03:31:14,960
this largest variance dimension
5504
03:31:14,960 --> 03:31:18,399
this dimension here
5505
03:31:19,680 --> 03:31:22,880
the pca dimension
5506
03:31:22,880 --> 03:31:27,720
you can either look at it as minimizing
5507
03:31:27,760 --> 03:31:30,720
minimize
5508
03:31:31,120 --> 03:31:34,399
let me get rid of this
5509
03:31:34,479 --> 03:31:37,359
the projection residuals so that's the
5510
03:31:37,359 --> 03:31:40,840
stuff in orange
5511
03:31:42,000 --> 03:31:44,720
or two
5512
03:31:44,720 --> 03:31:47,359
maximizing the variance
5513
03:31:47,359 --> 03:31:50,160
between the points
5514
03:31:50,160 --> 03:31:51,680
okay
5515
03:31:51,680 --> 03:31:54,319
and we're not really going to talk about
5516
03:31:54,319 --> 03:31:56,160
you know the method that we need in
5517
03:31:56,160 --> 03:31:58,720
order to calculate out the principal
5518
03:31:58,720 --> 03:32:00,880
components or like what that projection
5519
03:32:00,880 --> 03:32:01,920
would be
5520
03:32:01,920 --> 03:32:03,439
because you will need to understand
5521
03:32:03,439 --> 03:32:05,840
linear algebra for that especially
5522
03:32:05,840 --> 03:32:08,479
um eigenvectors and eigenvalues which
5523
03:32:08,479 --> 03:32:10,319
i'm not going to cover in this class
5524
03:32:10,319 --> 03:32:11,760
but that's how you would find the
5525
03:32:11,760 --> 03:32:14,239
principal components okay
5526
03:32:14,239 --> 03:32:16,640
now with this two-dimensional data set
5527
03:32:16,640 --> 03:32:18,640
here sorry this one-dimensional data set
5528
03:32:18,640 --> 03:32:21,120
we started from a 2d data set and we
5529
03:32:21,120 --> 03:32:23,520
now boil it down to one dimension well
5530
03:32:23,520 --> 03:32:25,040
we can go and take that dimension and we
5531
03:32:25,040 --> 03:32:27,279
can do other things with it
5532
03:32:27,279 --> 03:32:29,600
right we can like if there were a y
5533
03:32:29,600 --> 03:32:32,399
label then we can now show x versus y
5534
03:32:32,399 --> 03:32:35,279
rather than x 0 and x 1
5535
03:32:35,279 --> 03:32:37,600
in different plots with that y now we
5536
03:32:37,600 --> 03:32:38,880
can just say oh this is a principal
5537
03:32:38,880 --> 03:32:40,720
component and we're going to plot that
5538
03:32:40,720 --> 03:32:43,439
with the y or for example if there were
5539
03:32:43,439 --> 03:32:45,359
a hundred different dimensions and you
5540
03:32:45,359 --> 03:32:46,800
only wanted to take
5541
03:32:46,800 --> 03:32:48,800
five of them well you could go and you
5542
03:32:48,800 --> 03:32:52,000
could find the top five pca dimensions
5543
03:32:52,000 --> 03:32:53,040
and
5544
03:32:53,040 --> 03:32:54,720
that might be a lot more useful to you
5545
03:32:54,720 --> 03:32:58,560
than 100 different feature factor values
5546
03:32:58,560 --> 03:32:59,920
right
5547
03:32:59,920 --> 03:33:01,439
so that's principle component analysis
5548
03:33:01,439 --> 03:33:03,120
again we're taking
5549
03:33:03,120 --> 03:33:05,840
you know certain data that's unlabeled
5550
03:33:05,840 --> 03:33:07,680
and we're trying to
5551
03:33:07,680 --> 03:33:09,840
make some sort of estimation
5552
03:33:09,840 --> 03:33:13,120
like some guess about its structure
5553
03:33:13,120 --> 03:33:14,319
from
5554
03:33:14,319 --> 03:33:16,800
that original data set if we wanted to
5555
03:33:16,800 --> 03:33:19,120
take you know a 3d thing so like a
5556
03:33:19,120 --> 03:33:20,319
sphere
5557
03:33:20,319 --> 03:33:22,720
but we only have a 2d surface to draw it
5558
03:33:22,720 --> 03:33:23,760
on
5559
03:33:23,760 --> 03:33:25,520
well what's the best approximation that
5560
03:33:25,520 --> 03:33:28,080
we can make oh it's a circle right pca
5561
03:33:28,080 --> 03:33:29,680
is kind of the same thing it's saying if
5562
03:33:29,680 --> 03:33:31,040
we have something with all these
5563
03:33:31,040 --> 03:33:33,040
different dimensions but we can't show
5564
03:33:33,040 --> 03:33:35,120
all of them how do we boil it down to
5565
03:33:35,120 --> 03:33:38,000
just one dimension how do we extract the
5566
03:33:38,000 --> 03:33:39,760
most information
5567
03:33:39,760 --> 03:33:42,399
from that multiple dimensions
5568
03:33:42,399 --> 03:33:44,960
and that is exactly either you minimize
5569
03:33:44,960 --> 03:33:47,600
the projection residuals or you maximize
5570
03:33:47,600 --> 03:33:51,120
the variance and that is pca so we'll go
5571
03:33:51,120 --> 03:33:53,760
through an example of that now finally
5572
03:33:53,760 --> 03:33:56,720
let's move on to implementing the
5573
03:33:56,720 --> 03:34:00,160
unsupervised learning part of this class
5574
03:34:00,160 --> 03:34:02,239
here again i'm on the uci machine
5575
03:34:02,239 --> 03:34:05,279
learning repository and i have a seeds
5576
03:34:05,279 --> 03:34:07,120
data set where
5577
03:34:07,120 --> 03:34:09,439
you know i have a bunch of kernels that
5578
03:34:09,439 --> 03:34:11,520
belong to three different types of wheat
5579
03:34:11,520 --> 03:34:14,640
so there's comma rosa and canadian
5580
03:34:14,640 --> 03:34:15,600
and
5581
03:34:15,600 --> 03:34:18,080
the different um features that we have
5582
03:34:18,080 --> 03:34:20,560
access to are you know geometric
5583
03:34:20,560 --> 03:34:22,880
parameters of those weak kernels so the
5584
03:34:22,880 --> 03:34:25,760
area perimeter compactness
5585
03:34:25,760 --> 03:34:27,120
length width
5586
03:34:27,120 --> 03:34:30,160
with asymmetry and the length of the
5587
03:34:30,160 --> 03:34:31,600
kernel groove
5588
03:34:31,600 --> 03:34:34,080
okay so all these are real values which
5589
03:34:34,080 --> 03:34:36,560
is easy to work with and what we're
5590
03:34:36,560 --> 03:34:37,760
going to do is we're going to try to
5591
03:34:37,760 --> 03:34:39,200
predict
5592
03:34:39,200 --> 03:34:41,680
or i guess we're going to try to cluster
5593
03:34:41,680 --> 03:34:44,880
the different varieties of the wheat
5594
03:34:44,880 --> 03:34:47,120
so let's get started i have a colab
5595
03:34:47,120 --> 03:34:49,359
notebook open again oh you're gonna have
5596
03:34:49,359 --> 03:34:51,359
to you know go to the data folder
5597
03:34:51,359 --> 03:34:52,880
download this
5598
03:34:52,880 --> 03:34:53,760
and
5599
03:34:53,760 --> 03:34:56,319
let's get started
5600
03:34:56,319 --> 03:34:59,840
so the first thing to do is to
5601
03:34:59,840 --> 03:35:02,720
import our seeds data set
5602
03:35:02,720 --> 03:35:05,359
into our colab notebook
5603
03:35:05,359 --> 03:35:07,760
so i've done that here
5604
03:35:07,760 --> 03:35:09,200
okay and then
5605
03:35:09,200 --> 03:35:11,520
we're going to import all the classics
5606
03:35:11,520 --> 03:35:14,720
again so pandas
5607
03:35:23,040 --> 03:35:26,000
um and then i'm also going to import
5608
03:35:26,000 --> 03:35:28,239
seaborne because i'm going to want that
5609
03:35:28,239 --> 03:35:31,600
for this specific class
5610
03:35:31,600 --> 03:35:33,840
okay
5611
03:35:35,200 --> 03:35:38,160
great so now our columns that we have in
5612
03:35:38,160 --> 03:35:40,880
our seed data set are the area
5613
03:35:40,880 --> 03:35:42,880
the perimeter
5614
03:35:42,880 --> 03:35:46,000
um the compactness
5615
03:35:46,000 --> 03:35:48,000
the length
5616
03:35:48,000 --> 03:35:49,279
with
5617
03:35:49,279 --> 03:35:51,600
asymmetry
5618
03:35:51,600 --> 03:35:53,840
groove
5619
03:35:53,840 --> 03:35:55,120
lengths i mean i'm just going to call it
5620
03:35:55,120 --> 03:35:57,760
group and then the class right the weak
5621
03:35:57,760 --> 03:36:00,560
kernels class so now we have to import
5622
03:36:00,560 --> 03:36:02,239
this um
5623
03:36:02,239 --> 03:36:04,239
i'm going to do that using pandas read
5624
03:36:04,239 --> 03:36:05,520
csv
5625
03:36:05,520 --> 03:36:06,560
and
5626
03:36:06,560 --> 03:36:09,520
it's called seeds data.csv
5627
03:36:09,520 --> 03:36:10,560
so
5628
03:36:10,560 --> 03:36:13,840
i'm going to turn that into a data frame
5629
03:36:13,840 --> 03:36:16,319
and the names are equal to the columns
5630
03:36:16,319 --> 03:36:17,920
over here
5631
03:36:17,920 --> 03:36:20,720
so what happens if i just do that
5632
03:36:20,720 --> 03:36:22,640
oops what did i call this seeds
5633
03:36:22,640 --> 03:36:25,640
dataset.text
5634
03:36:26,319 --> 03:36:27,520
all right
5635
03:36:27,520 --> 03:36:29,600
so if we actually look at our data frame
5636
03:36:29,600 --> 03:36:31,359
right now
5637
03:36:31,359 --> 03:36:34,640
you'll notice something funky okay and
5638
03:36:34,640 --> 03:36:36,960
here you know we have all the stuff
5639
03:36:36,960 --> 03:36:39,120
under area and these are all our numbers
5640
03:36:39,120 --> 03:36:40,960
with some dash t
5641
03:36:40,960 --> 03:36:42,479
so the reason is because we haven't
5642
03:36:42,479 --> 03:36:43,920
actually
5643
03:36:43,920 --> 03:36:47,120
told pandas what the separator is which
5644
03:36:47,120 --> 03:36:48,560
we can do
5645
03:36:48,560 --> 03:36:51,760
like this and this t that's just a tab
5646
03:36:51,760 --> 03:36:53,920
so in order to ensure that like all
5647
03:36:53,920 --> 03:36:55,760
white space gets recognized as a
5648
03:36:55,760 --> 03:36:56,880
separator
5649
03:36:56,880 --> 03:36:58,640
we can actually
5650
03:36:58,640 --> 03:37:02,720
this is for like a space so any spaces
5651
03:37:02,720 --> 03:37:04,479
are going to get recognized as data
5652
03:37:04,479 --> 03:37:07,359
separators so if i run that
5653
03:37:07,359 --> 03:37:09,279
now are um
5654
03:37:09,279 --> 03:37:12,960
this you know this is a lot better okay
5655
03:37:12,960 --> 03:37:14,479
okay
5656
03:37:14,479 --> 03:37:16,399
so now let's actually go and like
5657
03:37:16,399 --> 03:37:18,880
visualize this data so
5658
03:37:18,880 --> 03:37:20,640
what i'm actually going to do is plot
5659
03:37:20,640 --> 03:37:23,200
each of these against one another so
5660
03:37:23,200 --> 03:37:25,279
in this case pretend that we don't have
5661
03:37:25,279 --> 03:37:28,080
access to the class right pretend that
5662
03:37:28,080 --> 03:37:29,600
so this class here i'm just going to
5663
03:37:29,600 --> 03:37:31,680
show you in this example that like hey
5664
03:37:31,680 --> 03:37:33,120
we can predict our classes using
5665
03:37:33,120 --> 03:37:34,960
unsupervised learning
5666
03:37:34,960 --> 03:37:36,880
but for this example in unsupervised
5667
03:37:36,880 --> 03:37:38,640
learning we don't actually have access
5668
03:37:38,640 --> 03:37:39,840
to the class
5669
03:37:39,840 --> 03:37:42,239
so i'm going to just try to plot these
5670
03:37:42,239 --> 03:37:45,600
against one another and see what happens
5671
03:37:45,600 --> 03:37:49,279
so for some i in range
5672
03:37:49,279 --> 03:37:51,920
you know the columns minus one because
5673
03:37:51,920 --> 03:37:54,239
the class is in the columns
5674
03:37:54,239 --> 03:37:56,960
and i'm just going to say for j in range
5675
03:37:56,960 --> 03:37:59,920
so take everything from i
5676
03:37:59,920 --> 03:38:02,160
onwards you know so i like the next
5677
03:38:02,160 --> 03:38:03,600
thing after i
5678
03:38:03,600 --> 03:38:04,960
uh until
5679
03:38:04,960 --> 03:38:07,439
the end of this so this will give us
5680
03:38:07,439 --> 03:38:09,359
basically a grid
5681
03:38:09,359 --> 03:38:13,439
of all the different like combinations
5682
03:38:13,439 --> 03:38:17,600
and our x label is going to be
5683
03:38:17,600 --> 03:38:20,080
columns i our y label
5684
03:38:20,080 --> 03:38:22,239
is going to be the columns
5685
03:38:22,239 --> 03:38:25,200
j so those are our labels up here
5686
03:38:25,200 --> 03:38:27,279
and i'm going to use
5687
03:38:27,279 --> 03:38:29,359
seaborne this time
5688
03:38:29,359 --> 03:38:31,040
and i'm going to say
5689
03:38:31,040 --> 03:38:34,160
scatter my data so our x is going to be
5690
03:38:34,160 --> 03:38:36,960
our x label
5691
03:38:38,080 --> 03:38:41,600
our y is going to be our y label
5692
03:38:41,600 --> 03:38:42,720
um
5693
03:38:42,720 --> 03:38:43,760
and
5694
03:38:43,760 --> 03:38:46,160
our data is going to be the data frame
5695
03:38:46,160 --> 03:38:47,680
that we're passing in
5696
03:38:47,680 --> 03:38:49,359
so what's interesting here is that we
5697
03:38:49,359 --> 03:38:51,439
can say hue
5698
03:38:51,439 --> 03:38:53,359
and what this will do is say
5699
03:38:53,359 --> 03:38:55,200
like if i give this class it's going to
5700
03:38:55,200 --> 03:38:57,040
separate the three different classes
5701
03:38:57,040 --> 03:38:58,640
into three different queues so now what
5702
03:38:58,640 --> 03:39:01,040
we're doing is we're basically comparing
5703
03:39:01,040 --> 03:39:03,120
the area and the perimeter or the area
5704
03:39:03,120 --> 03:39:04,720
and the compactness
5705
03:39:04,720 --> 03:39:07,040
but we're going to visualize you know
5706
03:39:07,040 --> 03:39:10,000
what classes they're in
5707
03:39:10,000 --> 03:39:14,560
so let's go ahead and i might have to
5708
03:39:14,640 --> 03:39:16,239
show
5709
03:39:16,239 --> 03:39:18,479
so
5710
03:39:18,720 --> 03:39:19,680
great
5711
03:39:19,680 --> 03:39:21,439
so basically we can see perimeter and
5712
03:39:21,439 --> 03:39:23,920
area we give we get these three
5713
03:39:23,920 --> 03:39:25,359
groups
5714
03:39:25,359 --> 03:39:26,239
um
5715
03:39:26,239 --> 03:39:28,880
the area compactness we get these three
5716
03:39:28,880 --> 03:39:30,080
groups
5717
03:39:30,080 --> 03:39:32,239
and so on so these all kind of look
5718
03:39:32,239 --> 03:39:35,040
honestly like somewhat similar
5719
03:39:35,040 --> 03:39:37,840
right so
5720
03:39:39,120 --> 03:39:40,880
wow look at this one so this one we have
5721
03:39:40,880 --> 03:39:42,640
the compactness and the asymmetry and it
5722
03:39:42,640 --> 03:39:44,319
looks like there's not really i mean it
5723
03:39:44,319 --> 03:39:46,399
just looks like they're blobs right
5724
03:39:46,399 --> 03:39:48,840
sure maybe class three is over here more
5725
03:39:48,840 --> 03:39:51,040
but one and two kind of look like
5726
03:39:51,040 --> 03:39:53,120
they're on top of each other
5727
03:39:53,120 --> 03:39:54,399
okay
5728
03:39:54,399 --> 03:39:56,000
i mean there are some that might look
5729
03:39:56,000 --> 03:39:58,800
slightly better in terms of clustering
5730
03:39:58,800 --> 03:40:00,560
but let's go through some of the some of
5731
03:40:00,560 --> 03:40:02,479
the clustering examples that we talked
5732
03:40:02,479 --> 03:40:05,120
about and try to implement those
5733
03:40:05,120 --> 03:40:06,640
the first thing that we're going to do
5734
03:40:06,640 --> 03:40:09,920
is just straight up clustering
5735
03:40:09,920 --> 03:40:12,080
so
5736
03:40:13,760 --> 03:40:15,439
uh what we learned about was k-means
5737
03:40:15,439 --> 03:40:17,120
clustering so from
5738
03:40:17,120 --> 03:40:18,319
learn
5739
03:40:18,319 --> 03:40:20,160
i'm going to import
5740
03:40:20,160 --> 03:40:23,279
uh k means
5741
03:40:23,359 --> 03:40:24,880
okay
5742
03:40:24,880 --> 03:40:25,840
and
5743
03:40:25,840 --> 03:40:26,960
just for the
5744
03:40:26,960 --> 03:40:29,040
sake of being able to run
5745
03:40:29,040 --> 03:40:31,439
you know any x in any y
5746
03:40:31,439 --> 03:40:35,040
i'm just gonna say hey let's use some
5747
03:40:35,040 --> 03:40:37,520
x um
5748
03:40:37,520 --> 03:40:40,560
what's a good one maybe
5749
03:40:40,960 --> 03:40:42,800
i mean perimeter asymmetry could be a
5750
03:40:42,800 --> 03:40:43,760
good one
5751
03:40:43,760 --> 03:40:46,720
so x could be perimeter y could be
5752
03:40:46,720 --> 03:40:49,439
asymmetry
5753
03:40:50,399 --> 03:40:51,760
okay
5754
03:40:51,760 --> 03:40:54,960
and for this the the x values i'm going
5755
03:40:54,960 --> 03:40:56,800
to just extract
5756
03:40:56,800 --> 03:40:58,840
those specific
5757
03:40:58,840 --> 03:41:00,880
values right
5758
03:41:00,880 --> 03:41:02,080
well
5759
03:41:02,080 --> 03:41:04,800
let's make a k-means
5760
03:41:04,800 --> 03:41:07,120
uh algorithm or let's you know define
5761
03:41:07,120 --> 03:41:09,120
this so k means
5762
03:41:09,120 --> 03:41:11,279
and in this specific case we know that
5763
03:41:11,279 --> 03:41:13,040
the number of clusters
5764
03:41:13,040 --> 03:41:15,359
is three so let's just use that
5765
03:41:15,359 --> 03:41:18,000
and i'm going to fit this against this x
5766
03:41:18,000 --> 03:41:21,359
that i've just defined right here
5767
03:41:21,520 --> 03:41:23,279
right
5768
03:41:23,279 --> 03:41:24,239
so
5769
03:41:24,239 --> 03:41:26,000
um
5770
03:41:26,000 --> 03:41:29,439
you know if i create this clusters so
5771
03:41:29,439 --> 03:41:30,800
one thing one cool thing is i can
5772
03:41:30,800 --> 03:41:33,120
actually go to these clusters and i can
5773
03:41:33,120 --> 03:41:35,840
say k-mean dot labels
5774
03:41:35,840 --> 03:41:39,200
and it'll get give me
5775
03:41:41,200 --> 03:41:42,720
if i can type correctly it'll give me
5776
03:41:42,720 --> 03:41:44,000
what its predictions for all the
5777
03:41:44,000 --> 03:41:45,359
clusters are
5778
03:41:45,359 --> 03:41:47,600
and our actual
5779
03:41:47,600 --> 03:41:49,279
oops
5780
03:41:49,279 --> 03:41:52,000
not that um if we go to the data frame
5781
03:41:52,000 --> 03:41:54,720
and we get the class
5782
03:41:54,720 --> 03:41:56,720
and the values from those we can
5783
03:41:56,720 --> 03:41:59,359
actually compare these two and say hey
5784
03:41:59,359 --> 03:42:01,760
like you know everything in general most
5785
03:42:01,760 --> 03:42:04,160
of the zeros that it's predicted
5786
03:42:04,160 --> 03:42:05,920
are the ones right and in general the
5787
03:42:05,920 --> 03:42:08,800
twos are the twos here and then the
5788
03:42:08,800 --> 03:42:11,439
third class one okay that corresponds to
5789
03:42:11,439 --> 03:42:12,479
three
5790
03:42:12,479 --> 03:42:14,479
now remember these are separate classes
5791
03:42:14,479 --> 03:42:16,319
so the labels what we actually call them
5792
03:42:16,319 --> 03:42:18,800
don't really matter we can say oh map
5793
03:42:18,800 --> 03:42:21,200
zero to one map two to two and map one
5794
03:42:21,200 --> 03:42:22,399
to three
5795
03:42:22,399 --> 03:42:25,040
okay and our you know our mapping would
5796
03:42:25,040 --> 03:42:27,840
do fairly well
5797
03:42:28,479 --> 03:42:30,239
but we can actually visualize this and
5798
03:42:30,239 --> 03:42:33,439
in order to do that i'm going to create
5799
03:42:33,439 --> 03:42:36,080
this cluster
5800
03:42:36,080 --> 03:42:37,680
cluster data frame
5801
03:42:37,680 --> 03:42:40,319
so i'm going to create a data frame and
5802
03:42:40,319 --> 03:42:42,479
i'm going to pass in
5803
03:42:42,479 --> 03:42:45,439
um a horizontally stacked
5804
03:42:45,439 --> 03:42:47,279
array with x
5805
03:42:47,279 --> 03:42:50,080
so my values for x and y
5806
03:42:50,080 --> 03:42:53,760
and then um the clusters that i have
5807
03:42:53,760 --> 03:42:55,120
here
5808
03:42:55,120 --> 03:43:00,080
but i'm going to reshape them so it's 2d
5809
03:43:00,239 --> 03:43:01,520
okay
5810
03:43:01,520 --> 03:43:03,279
and the columns
5811
03:43:03,279 --> 03:43:06,880
the labels for that are going to be x y
5812
03:43:06,880 --> 03:43:08,080
and
5813
03:43:08,080 --> 03:43:10,399
plus
5814
03:43:10,399 --> 03:43:12,560
okay
5815
03:43:12,560 --> 03:43:13,840
so
5816
03:43:13,840 --> 03:43:16,479
i'm going to go ahead and do that same
5817
03:43:16,479 --> 03:43:19,279
seabourn scatter plot
5818
03:43:19,279 --> 03:43:20,399
again
5819
03:43:20,399 --> 03:43:22,800
where x is x y is y
5820
03:43:22,800 --> 03:43:26,960
and now uh the hue is again the class
5821
03:43:26,960 --> 03:43:29,680
and the data is now this cluster data
5822
03:43:29,680 --> 03:43:30,720
frame
5823
03:43:30,720 --> 03:43:34,760
all right so this here
5824
03:43:35,680 --> 03:43:37,600
this here is my
5825
03:43:37,600 --> 03:43:40,800
um k means
5826
03:43:40,800 --> 03:43:44,560
like i guess classes
5827
03:43:46,399 --> 03:43:48,720
so k-means kind of looks like this
5828
03:43:48,720 --> 03:43:51,920
if i come down here and i
5829
03:43:51,920 --> 03:43:54,239
plot you know my original data frame
5830
03:43:54,239 --> 03:43:56,479
this is my original
5831
03:43:56,479 --> 03:43:58,800
classes with respect to this specific x
5832
03:43:58,800 --> 03:43:59,840
and y
5833
03:43:59,840 --> 03:44:02,000
and you'll see that honestly like it
5834
03:44:02,000 --> 03:44:04,080
doesn't do too poorly
5835
03:44:04,080 --> 03:44:06,319
yeah there's i mean the colors are
5836
03:44:06,319 --> 03:44:07,920
different but that's fine
5837
03:44:07,920 --> 03:44:09,120
um
5838
03:44:09,120 --> 03:44:11,040
for the most part it it
5839
03:44:11,040 --> 03:44:13,760
gets information of the clusters
5840
03:44:13,760 --> 03:44:15,359
right
5841
03:44:15,359 --> 03:44:17,760
and now we can do that with higher
5842
03:44:17,760 --> 03:44:20,319
dimensions
5843
03:44:22,160 --> 03:44:24,640
so with the higher dimensions if we make
5844
03:44:24,640 --> 03:44:27,040
x equal to you know all the columns
5845
03:44:27,040 --> 03:44:28,800
except for the last one which is our
5846
03:44:28,800 --> 03:44:31,439
class
5847
03:44:31,439 --> 03:44:34,880
we can do the exact same thing
5848
03:44:35,439 --> 03:44:37,120
so here
5849
03:44:37,120 --> 03:44:38,000
and
5850
03:44:38,000 --> 03:44:40,560
we can
5851
03:44:43,439 --> 03:44:45,760
predict
5852
03:44:45,760 --> 03:44:46,960
this
5853
03:44:46,960 --> 03:44:48,800
uh but now
5854
03:44:48,800 --> 03:44:52,080
our columns are equal to
5855
03:44:52,080 --> 03:44:53,359
our data frame
5856
03:44:53,359 --> 03:44:55,359
columns all the way to the last one and
5857
03:44:55,359 --> 03:44:57,439
then with this class actually so we can
5858
03:44:57,439 --> 03:45:01,359
literally just say data frame columns
5859
03:45:01,920 --> 03:45:05,439
and we can fit all of this
5860
03:45:05,439 --> 03:45:07,120
and now
5861
03:45:07,120 --> 03:45:11,439
if i want to plot the k-means classes
5862
03:45:11,439 --> 03:45:14,319
all right so this was my
5863
03:45:14,319 --> 03:45:18,319
uh that's my clustered and my original
5864
03:45:18,319 --> 03:45:21,040
so actually let me see if i can
5865
03:45:21,040 --> 03:45:24,000
get these on the same page
5866
03:45:24,000 --> 03:45:25,760
so yeah i mean pretty similar to what we
5867
03:45:25,760 --> 03:45:28,560
just saw but what's actually really cool
5868
03:45:28,560 --> 03:45:30,560
is
5869
03:45:30,560 --> 03:45:32,880
even something like you know if we
5870
03:45:32,880 --> 03:45:35,199
change
5871
03:45:35,199 --> 03:45:36,720
so what's one of them where they were
5872
03:45:36,720 --> 03:45:40,600
like on top of each other
5873
03:45:44,319 --> 03:45:46,319
ah okay so compactness and asymmetry
5874
03:45:46,319 --> 03:45:48,000
this one's messy
5875
03:45:48,000 --> 03:45:50,560
right so if i come down here
5876
03:45:50,560 --> 03:45:52,080
and i say
5877
03:45:52,080 --> 03:45:52,840
uh
5878
03:45:52,840 --> 03:45:55,920
compactness and asymmetry
5879
03:45:55,920 --> 03:45:58,880
and i'm trying to do this in 2d
5880
03:45:58,880 --> 03:46:00,640
this is what my scatter plot so this is
5881
03:46:00,640 --> 03:46:02,319
what you know
5882
03:46:02,319 --> 03:46:05,040
my k means is telling me for these two
5883
03:46:05,040 --> 03:46:07,439
dimensions for compactness and asymmetry
5884
03:46:07,439 --> 03:46:10,399
if we just look at those two
5885
03:46:10,399 --> 03:46:12,800
these are our three classes right and we
5886
03:46:12,800 --> 03:46:14,479
know that the original looks something
5887
03:46:14,479 --> 03:46:16,800
like this and are these two
5888
03:46:16,800 --> 03:46:18,160
remotely
5889
03:46:18,160 --> 03:46:20,800
alike no
5890
03:46:20,800 --> 03:46:23,040
okay so now if i come back down here and
5891
03:46:23,040 --> 03:46:25,279
i rerun this higher dimensions one but
5892
03:46:25,279 --> 03:46:26,239
actually
5893
03:46:26,239 --> 03:46:27,840
this clusters
5894
03:46:27,840 --> 03:46:29,439
i need to
5895
03:46:29,439 --> 03:46:33,199
get the labels of the k-means again
5896
03:46:34,479 --> 03:46:36,479
okay so if i rerun this
5897
03:46:36,479 --> 03:46:37,359
with
5898
03:46:37,359 --> 03:46:39,199
higher dimensions
5899
03:46:39,199 --> 03:46:40,880
well if we zoom out and we take a look
5900
03:46:40,880 --> 03:46:43,760
at these two sure the colors are mixed
5901
03:46:43,760 --> 03:46:44,560
up
5902
03:46:44,560 --> 03:46:45,439
but
5903
03:46:45,439 --> 03:46:49,040
in general there the three groups
5904
03:46:49,040 --> 03:46:51,359
are there right this does a much better
5905
03:46:51,359 --> 03:46:54,000
job at assessing okay what group
5906
03:46:54,000 --> 03:46:55,600
is what
5907
03:46:55,600 --> 03:46:57,359
so
5908
03:46:57,359 --> 03:47:00,399
for example we could relabel uh the one
5909
03:47:00,399 --> 03:47:02,720
in the original class to two
5910
03:47:02,720 --> 03:47:04,880
and then we could make
5911
03:47:04,880 --> 03:47:07,520
sorry okay this is kind of confusing but
5912
03:47:07,520 --> 03:47:09,120
for example if
5913
03:47:09,120 --> 03:47:12,319
this light pink were projected onto
5914
03:47:12,319 --> 03:47:15,600
this darker pink here and then this dark
5915
03:47:15,600 --> 03:47:18,479
one was actually the light pink and this
5916
03:47:18,479 --> 03:47:20,800
light one was this dark one then you
5917
03:47:20,800 --> 03:47:22,560
kind of see like these correspond to one
5918
03:47:22,560 --> 03:47:24,239
another right like even these two up
5919
03:47:24,239 --> 03:47:25,359
here
5920
03:47:25,359 --> 03:47:26,880
are the same classes all the other ones
5921
03:47:26,880 --> 03:47:28,880
over here which are the same in the same
5922
03:47:28,880 --> 03:47:30,000
color
5923
03:47:30,000 --> 03:47:31,199
so you don't want to compare the two
5924
03:47:31,199 --> 03:47:33,040
colors between the plots you want to
5925
03:47:33,040 --> 03:47:35,600
compare which points are in what colors
5926
03:47:35,600 --> 03:47:37,279
in each of the plots
5927
03:47:37,279 --> 03:47:38,319
so
5928
03:47:38,319 --> 03:47:40,720
that's one cool application so this is
5929
03:47:40,720 --> 03:47:41,520
how
5930
03:47:41,520 --> 03:47:42,720
k-means
5931
03:47:42,720 --> 03:47:44,399
functions it's basically taking all the
5932
03:47:44,399 --> 03:47:46,880
data sets and saying all right
5933
03:47:46,880 --> 03:47:49,840
where are my clusters given these pieces
5934
03:47:49,840 --> 03:47:51,840
of data and then the next thing that we
5935
03:47:51,840 --> 03:47:53,040
talked about
5936
03:47:53,040 --> 03:47:56,239
is pca so pca we're reducing the
5937
03:47:56,239 --> 03:47:57,920
dimension but we're
5938
03:47:57,920 --> 03:47:59,279
mapping
5939
03:47:59,279 --> 03:48:02,160
all these like you know seven dimensions
5940
03:48:02,160 --> 03:48:03,600
i don't know if there are seven i made
5941
03:48:03,600 --> 03:48:05,120
that number up but we're mapping
5942
03:48:05,120 --> 03:48:06,960
multiple dimensions into a lower
5943
03:48:06,960 --> 03:48:08,960
dimension number
5944
03:48:08,960 --> 03:48:12,080
right and so let's see how that works
5945
03:48:12,080 --> 03:48:14,239
so from sklearn
5946
03:48:14,239 --> 03:48:16,960
decomposition i can import pca and that
5947
03:48:16,960 --> 03:48:19,520
will be my pca model
5948
03:48:19,520 --> 03:48:22,800
so if i do pca um
5949
03:48:22,800 --> 03:48:24,640
component so this is how many dimensions
5950
03:48:24,640 --> 03:48:27,040
you want to map it into and you know for
5951
03:48:27,040 --> 03:48:29,279
this exercise let's do two
5952
03:48:29,279 --> 03:48:31,040
okay so now i'm taking the top two
5953
03:48:31,040 --> 03:48:32,800
dimensions
5954
03:48:32,800 --> 03:48:35,600
and my transformed x
5955
03:48:35,600 --> 03:48:39,520
is going to be pca dot fit transform
5956
03:48:39,520 --> 03:48:43,040
and the same x that i had up here okay
5957
03:48:43,040 --> 03:48:44,640
so all the other all the values
5958
03:48:44,640 --> 03:48:45,760
basically
5959
03:48:45,760 --> 03:48:47,680
uh area perimeter compactness length
5960
03:48:47,680 --> 03:48:50,000
with asymmetry groove
5961
03:48:50,000 --> 03:48:51,840
okay
5962
03:48:51,840 --> 03:48:54,160
so let's run that
5963
03:48:54,160 --> 03:48:56,800
and we've transformed it so
5964
03:48:56,800 --> 03:48:58,560
let's look at
5965
03:48:58,560 --> 03:49:01,120
what the shape of x used to be so there
5966
03:49:01,120 --> 03:49:04,479
okay so 7 was right i had 210 samples
5967
03:49:04,479 --> 03:49:06,479
each seven
5968
03:49:06,479 --> 03:49:08,720
seven features long basically
5969
03:49:08,720 --> 03:49:10,560
and now my transformed
5970
03:49:10,560 --> 03:49:12,880
x
5971
03:49:14,560 --> 03:49:17,439
is 210 samples but only of length two
5972
03:49:17,439 --> 03:49:18,720
which means that i only have two
5973
03:49:18,720 --> 03:49:21,439
dimensions now that i'm plotting
5974
03:49:21,439 --> 03:49:23,760
and we can actually even take a look at
5975
03:49:23,760 --> 03:49:25,040
you know
5976
03:49:25,040 --> 03:49:27,040
the first five things
5977
03:49:27,040 --> 03:49:29,199
okay so now we see each each one is a
5978
03:49:29,199 --> 03:49:31,439
two dimensional point each sample is now
5979
03:49:31,439 --> 03:49:33,279
a two dimensional point
5980
03:49:33,279 --> 03:49:34,960
in our new
5981
03:49:34,960 --> 03:49:36,399
um
5982
03:49:36,399 --> 03:49:38,800
in our new dimensions
5983
03:49:38,800 --> 03:49:39,600
so
5984
03:49:39,600 --> 03:49:42,560
what's cool is i can actually scatter
5985
03:49:42,560 --> 03:49:44,880
these
5986
03:49:46,479 --> 03:49:47,520
zero
5987
03:49:47,520 --> 03:49:50,800
and transformed
5988
03:49:50,840 --> 03:49:52,399
x
5989
03:49:52,399 --> 03:49:55,359
so i actually have to
5990
03:49:55,760 --> 03:49:57,760
take the columns here
5991
03:49:57,760 --> 03:50:01,120
and if i show that
5992
03:50:01,760 --> 03:50:03,520
basically we've just taken this like
5993
03:50:03,520 --> 03:50:05,199
seven dimensional
5994
03:50:05,199 --> 03:50:07,600
thing and we've made it into a single or
5995
03:50:07,600 --> 03:50:09,520
i guess two a two dimensional
5996
03:50:09,520 --> 03:50:13,040
representation so that's a point of pca
5997
03:50:13,040 --> 03:50:14,000
and
5998
03:50:14,000 --> 03:50:16,479
actually let's go ahead and do the same
5999
03:50:16,479 --> 03:50:19,520
clustering exercise as we did up here
6000
03:50:19,520 --> 03:50:21,520
if i take the
6001
03:50:21,520 --> 03:50:23,199
k-means
6002
03:50:23,199 --> 03:50:25,199
this pca data frame i can let's
6003
03:50:25,199 --> 03:50:26,560
construct
6004
03:50:26,560 --> 03:50:28,319
data frame out of that
6005
03:50:28,319 --> 03:50:29,279
and
6006
03:50:29,279 --> 03:50:32,960
the data frame is going to be h stack
6007
03:50:32,960 --> 03:50:36,160
i'm going to take this transformed x
6008
03:50:36,160 --> 03:50:38,800
and the clusters
6009
03:50:38,800 --> 03:50:41,199
dot reshape so actually instead of
6010
03:50:41,199 --> 03:50:43,920
clusters i'm going to use uh k-means dot
6011
03:50:43,920 --> 03:50:45,199
labels
6012
03:50:45,199 --> 03:50:48,160
and i need to reshape this
6013
03:50:48,160 --> 03:50:51,600
so it's 2d so we can do the h stack
6014
03:50:51,600 --> 03:50:53,040
um
6015
03:50:53,040 --> 03:50:55,520
and for the columns i'm going to set
6016
03:50:55,520 --> 03:50:56,399
this
6017
03:50:56,399 --> 03:50:58,800
to pca1
6018
03:50:58,800 --> 03:51:00,720
pca2
6019
03:51:00,720 --> 03:51:02,880
and class
6020
03:51:02,880 --> 03:51:04,000
all right
6021
03:51:04,000 --> 03:51:05,920
so now if i take this
6022
03:51:05,920 --> 03:51:08,960
i can also do the same for the truth
6023
03:51:08,960 --> 03:51:11,840
but instead of the k means labels i want
6024
03:51:11,840 --> 03:51:15,600
from the data frame the original classes
6025
03:51:15,600 --> 03:51:16,880
and i'm just going to take the values
6026
03:51:16,880 --> 03:51:18,319
from that
6027
03:51:18,319 --> 03:51:20,080
and so now i have
6028
03:51:20,080 --> 03:51:21,359
a
6029
03:51:21,359 --> 03:51:23,920
data frame for the k-means with pca and
6030
03:51:23,920 --> 03:51:25,439
then a data frame for the truth with
6031
03:51:25,439 --> 03:51:26,880
also the pca
6032
03:51:26,880 --> 03:51:29,760
and i can now plot these similarly to
6033
03:51:29,760 --> 03:51:32,560
how i plotted these up here
6034
03:51:32,560 --> 03:51:36,560
so let me actually take these two
6035
03:51:42,239 --> 03:51:44,479
instead of the cluster data frame i want
6036
03:51:44,479 --> 03:51:46,160
the
6037
03:51:46,160 --> 03:51:48,640
uh this is the k means
6038
03:51:48,640 --> 03:51:50,479
pca data frame
6039
03:51:50,479 --> 03:51:52,640
this is still going to be class but now
6040
03:51:52,640 --> 03:51:54,880
x and y are going to be the two pca
6041
03:51:54,880 --> 03:51:56,479
dimensions
6042
03:51:56,479 --> 03:51:59,120
okay
6043
03:51:59,120 --> 03:52:01,520
so these are my two pca dimensions and
6044
03:52:01,520 --> 03:52:03,279
you can see that
6045
03:52:03,279 --> 03:52:04,800
you know they're
6046
03:52:04,800 --> 03:52:07,760
they're pretty spread out
6047
03:52:07,760 --> 03:52:09,840
and then here
6048
03:52:09,840 --> 03:52:12,080
i'm going to go to my truth classes
6049
03:52:12,080 --> 03:52:14,399
again it's pca1 pca2 but instead of
6050
03:52:14,399 --> 03:52:16,960
k-means this should be truth pca data
6051
03:52:16,960 --> 03:52:19,040
frame
6052
03:52:19,040 --> 03:52:21,359
so you can see that like in the truth
6053
03:52:21,359 --> 03:52:22,640
data frame
6054
03:52:22,640 --> 03:52:25,040
along these two dimensions
6055
03:52:25,040 --> 03:52:27,600
we actually are doing fairly well in
6056
03:52:27,600 --> 03:52:30,080
terms of separation right it does seem
6057
03:52:30,080 --> 03:52:32,399
like this is slightly more separable
6058
03:52:32,399 --> 03:52:34,239
than
6059
03:52:34,239 --> 03:52:36,239
the other like dimensions that we had
6060
03:52:36,239 --> 03:52:39,439
been looking at up here
6061
03:52:39,439 --> 03:52:41,199
so that's a good sign
6062
03:52:41,199 --> 03:52:42,640
um
6063
03:52:42,640 --> 03:52:44,640
and up here you can see that hey some of
6064
03:52:44,640 --> 03:52:46,239
these correspond to one another i mean
6065
03:52:46,239 --> 03:52:48,160
for the most part our algorithm our
6066
03:52:48,160 --> 03:52:50,800
unsupervised clustering algorithm
6067
03:52:50,800 --> 03:52:53,920
is able to give us is able to spit out
6068
03:52:53,920 --> 03:52:56,000
you know what the proper
6069
03:52:56,000 --> 03:52:59,760
uh labels are i mean if you map these
6070
03:52:59,760 --> 03:53:02,000
specific labels to the different types
6071
03:53:02,000 --> 03:53:04,160
of kernels but for example this one
6072
03:53:04,160 --> 03:53:06,640
might all be the comic kernels and same
6073
03:53:06,640 --> 03:53:07,760
here and then these might all be the
6074
03:53:07,760 --> 03:53:09,520
canadian kernels and these might all be
6075
03:53:09,520 --> 03:53:11,439
the canadian kernels
6076
03:53:11,439 --> 03:53:13,680
so it does struggle a little bit with
6077
03:53:13,680 --> 03:53:16,160
you know where they overlap
6078
03:53:16,160 --> 03:53:18,000
but for the most part our algorithm is
6079
03:53:18,000 --> 03:53:19,520
able to find the three different
6080
03:53:19,520 --> 03:53:20,840
categories
6081
03:53:20,840 --> 03:53:24,160
and do a fairly good job at predicting
6082
03:53:24,160 --> 03:53:26,160
them without without any information
6083
03:53:26,160 --> 03:53:28,880
from us we haven't given our algorithm
6084
03:53:28,880 --> 03:53:31,120
any labels so that's the gist of
6085
03:53:31,120 --> 03:53:32,800
unsupervised learning
6086
03:53:32,800 --> 03:53:35,040
i hope you guys enjoyed this course i
6087
03:53:35,040 --> 03:53:36,960
hope you know a lot of these examples
6088
03:53:36,960 --> 03:53:39,680
made sense um if there are certain
6089
03:53:39,680 --> 03:53:42,080
things that i have done
6090
03:53:42,080 --> 03:53:43,439
and you know you're somebody with more
6091
03:53:43,439 --> 03:53:45,279
experience than me please feel free to
6092
03:53:45,279 --> 03:53:47,359
correct me in the comments and we can
6093
03:53:47,359 --> 03:53:49,199
all as a community learn from this
6094
03:53:49,199 --> 03:53:54,359
together so thank you all for watching
382343
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.