subtitlecat.com

All language subtitles for Machine Learning for Everybody – Full Course

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian Download

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:01,600 kylie ying has worked at many 2 00:00:01,600 --> 00:00:04,880 interesting places such as mit cern and 3 00:00:04,880 --> 00:00:07,040 free code camp she's a physicist 4 00:00:07,040 --> 00:00:10,000 engineer and basically a genius and now 5 00:00:10,000 --> 00:00:11,519 she's gonna teach you about machine 6 00:00:11,519 --> 00:00:13,679 learning in a way that is accessible to 7 00:00:13,679 --> 00:00:15,280 absolute beginners 8 00:00:15,280 --> 00:00:18,240 what's up you guys so welcome to machine 9 00:00:18,240 --> 00:00:20,480 learning for everyone 10 00:00:20,480 --> 00:00:22,160 if you are someone who is interested in 11 00:00:22,160 --> 00:00:24,400 machine learning and you think you are 12 00:00:24,400 --> 00:00:27,439 considered as everyone then this video 13 00:00:27,439 --> 00:00:28,880 is for you 14 00:00:28,880 --> 00:00:30,320 in this video we'll talk about 15 00:00:30,320 --> 00:00:32,238 supervised and unsupervised learning 16 00:00:32,238 --> 00:00:34,320 models we'll go through maybe a little 17 00:00:34,320 --> 00:00:37,440 bit of the logic or math behind them and 18 00:00:37,440 --> 00:00:40,000 then we'll also see how we can program 19 00:00:40,000 --> 00:00:42,800 it on google collab 20 00:00:42,800 --> 00:00:45,520 if there are certain things that i have 21 00:00:45,520 --> 00:00:47,520 done and you know you're somebody with 22 00:00:47,520 --> 00:00:49,440 more experience than me please feel free 23 00:00:49,440 --> 00:00:51,600 to correct me in the comments and we can 24 00:00:51,600 --> 00:00:53,440 all as a community learn from this 25 00:00:53,440 --> 00:00:54,559 together 26 00:00:54,559 --> 00:00:57,920 so with that let's just dive right in 27 00:00:57,920 --> 00:00:59,680 without wasting any time let's just dive 28 00:00:59,680 --> 00:01:01,359 straight into the code and i will be 29 00:01:01,359 --> 00:01:04,959 teaching you guys concepts as we go 30 00:01:04,959 --> 00:01:07,439 so this here is the 31 00:01:07,439 --> 00:01:10,560 uci machine learning repository and 32 00:01:10,560 --> 00:01:12,080 basically they just have a ton of data 33 00:01:12,080 --> 00:01:14,240 sets that we can access and i found this 34 00:01:14,240 --> 00:01:16,320 really cool one called the magic gamma 35 00:01:16,320 --> 00:01:18,720 telescope data set 36 00:01:18,720 --> 00:01:20,640 so in this data set if you don't want to 37 00:01:20,640 --> 00:01:23,360 read all this information to summarize 38 00:01:23,360 --> 00:01:25,680 what i what i think is going on is 39 00:01:25,680 --> 00:01:28,000 there's this gamma telescope and we have 40 00:01:28,000 --> 00:01:29,920 all these high energy particles hitting 41 00:01:29,920 --> 00:01:32,640 the telescope now there's a camera 42 00:01:32,640 --> 00:01:35,200 there's a detector that actually records 43 00:01:35,200 --> 00:01:37,280 certain patterns of you know how this 44 00:01:37,280 --> 00:01:39,040 light hits the camera 45 00:01:39,040 --> 00:01:41,520 and we can use properties of those 46 00:01:41,520 --> 00:01:44,079 patterns in order to predict what type 47 00:01:44,079 --> 00:01:46,560 of particle caused that radiation so 48 00:01:46,560 --> 00:01:49,439 whether it was a gamma particle 49 00:01:49,439 --> 00:01:53,439 or some other head like hadron 50 00:01:53,439 --> 00:01:55,040 down here these are all of the 51 00:01:55,040 --> 00:01:57,360 attributes of those patterns that we 52 00:01:57,360 --> 00:01:59,360 collect in the camera so you can see 53 00:01:59,360 --> 00:02:01,280 that there's you know some length width 54 00:02:01,280 --> 00:02:02,320 size 55 00:02:02,320 --> 00:02:04,560 asymmetry etc 56 00:02:04,560 --> 00:02:05,600 now we're going to use all these 57 00:02:05,600 --> 00:02:08,318 properties to help us discriminate the 58 00:02:08,318 --> 00:02:10,080 patterns and whether or not they came 59 00:02:10,080 --> 00:02:13,120 from a gamma particle or a hadron 60 00:02:13,120 --> 00:02:15,280 so in order to do this we're going to 61 00:02:15,280 --> 00:02:18,640 come up here go to the data folder 62 00:02:18,640 --> 00:02:21,520 and you're going to click this magic04 63 00:02:21,520 --> 00:02:25,120 data and we're going to download that 64 00:02:25,120 --> 00:02:28,560 now over here i have a colab notebook 65 00:02:28,560 --> 00:02:30,599 open so you go to 66 00:02:30,599 --> 00:02:32,560 collab.research.google.com you start a 67 00:02:32,560 --> 00:02:33,840 new notebook 68 00:02:33,840 --> 00:02:36,640 and i'm just going to call this 69 00:02:36,640 --> 00:02:40,560 um the magic data set 70 00:02:40,560 --> 00:02:42,239 so actually i'm going to call this for 71 00:02:42,239 --> 00:02:44,879 code camp magic example 72 00:02:44,879 --> 00:02:46,640 okay 73 00:02:46,640 --> 00:02:49,040 so with that i'm going to first start 74 00:02:49,040 --> 00:02:52,239 with some imports so i will import you 75 00:02:52,239 --> 00:02:54,400 know i always import numpy 76 00:02:54,400 --> 00:02:57,920 i always import pandas 77 00:02:58,640 --> 00:03:03,640 and i always import matplotlib 78 00:03:06,000 --> 00:03:08,159 and then we'll import other things as we 79 00:03:08,159 --> 00:03:10,400 go 80 00:03:10,640 --> 00:03:13,280 so yeah 81 00:03:13,920 --> 00:03:16,000 we run that in order to run the cell you 82 00:03:16,000 --> 00:03:18,000 can either click this play button here 83 00:03:18,000 --> 00:03:20,159 or you can on my computer it's just 84 00:03:20,159 --> 00:03:21,920 shift enter and that that will run the 85 00:03:21,920 --> 00:03:23,120 cell 86 00:03:23,120 --> 00:03:25,040 and here i'm just going to or i'm just 87 00:03:25,040 --> 00:03:26,800 going to you know 88 00:03:26,800 --> 00:03:28,239 let you guys know okay this is where i 89 00:03:28,239 --> 00:03:29,599 found the data set 90 00:03:29,599 --> 00:03:31,440 um so i've copied and pasted this 91 00:03:31,440 --> 00:03:33,440 actually but this is just where i found 92 00:03:33,440 --> 00:03:35,120 the data set 93 00:03:35,120 --> 00:03:37,680 and in order to import that downloaded 94 00:03:37,680 --> 00:03:40,000 file that we we got from the computer 95 00:03:40,000 --> 00:03:42,319 we're going to go over here to this 96 00:03:42,319 --> 00:03:44,239 folder thing 97 00:03:44,239 --> 00:03:46,959 and i am literally just going to drag 98 00:03:46,959 --> 00:03:50,720 and drop that file into here 99 00:03:50,720 --> 00:03:52,319 okay 100 00:03:52,319 --> 00:03:54,319 so in order to take a look at you know 101 00:03:54,319 --> 00:03:56,080 what does this file consist of do we 102 00:03:56,080 --> 00:03:57,599 have the labels do we not i mean we 103 00:03:57,599 --> 00:03:59,040 could open it on our computer but we can 104 00:03:59,040 --> 00:04:00,879 also just do 105 00:04:00,879 --> 00:04:02,319 pandas 106 00:04:02,319 --> 00:04:04,239 read csv 107 00:04:04,239 --> 00:04:08,480 and we can pass in the name of this file 108 00:04:08,480 --> 00:04:11,040 and let's see what it returns so it 109 00:04:11,040 --> 00:04:12,959 doesn't seem like we have 110 00:04:12,959 --> 00:04:16,160 the label so let's go back to 111 00:04:16,160 --> 00:04:17,839 here 112 00:04:17,839 --> 00:04:19,279 i'm just going to make the 113 00:04:19,279 --> 00:04:20,478 columns 114 00:04:20,478 --> 00:04:22,240 uh the column labels all of these 115 00:04:22,240 --> 00:04:23,840 attribute names over here so i'm just 116 00:04:23,840 --> 00:04:28,240 going to take these values and make that 117 00:04:28,240 --> 00:04:30,240 the column names 118 00:04:30,240 --> 00:04:31,919 all right how do i do that 119 00:04:31,919 --> 00:04:33,440 so basically 120 00:04:33,440 --> 00:04:35,919 i will come back here and i will create 121 00:04:35,919 --> 00:04:38,080 a list called calls 122 00:04:38,080 --> 00:04:43,120 and i will type in all of those things 123 00:04:43,120 --> 00:04:44,320 with 124 00:04:44,320 --> 00:04:46,320 f size 125 00:04:46,320 --> 00:04:47,360 f 126 00:04:47,360 --> 00:04:49,199 conch 127 00:04:49,199 --> 00:04:52,080 and we also have an f conc 1 128 00:04:52,080 --> 00:04:53,919 we have f 129 00:04:53,919 --> 00:04:55,479 symmetry 130 00:04:55,479 --> 00:04:58,360 fm3 long 131 00:04:58,360 --> 00:05:01,680 fm3 trans 132 00:05:01,680 --> 00:05:02,560 f 133 00:05:02,560 --> 00:05:04,320 alpha 134 00:05:04,320 --> 00:05:09,880 what else do we have f dist and class 135 00:05:10,639 --> 00:05:13,039 and class okay great 136 00:05:13,039 --> 00:05:15,840 now in order to label those as these 137 00:05:15,840 --> 00:05:18,320 columns down here in our data frame so 138 00:05:18,320 --> 00:05:20,400 basically this command here just reads 139 00:05:20,400 --> 00:05:23,039 some csv file that you pass in csv has 140 00:05:23,039 --> 00:05:25,919 come about comma separated values 141 00:05:25,919 --> 00:05:28,160 um and turned that into a pandas data 142 00:05:28,160 --> 00:05:30,000 frame object 143 00:05:30,000 --> 00:05:32,479 so now if i pass in a names 144 00:05:32,479 --> 00:05:34,639 here then it basically assigns these 145 00:05:34,639 --> 00:05:38,160 labels to the columns of this data set 146 00:05:38,160 --> 00:05:39,199 so 147 00:05:39,199 --> 00:05:41,280 i'm going to set this data frame equal 148 00:05:41,280 --> 00:05:44,400 to df and then if we call so head is 149 00:05:44,400 --> 00:05:47,280 just like give me the first five things 150 00:05:47,280 --> 00:05:49,440 now you'll see that we have labels for 151 00:05:49,440 --> 00:05:50,840 all of these 152 00:05:50,840 --> 00:05:53,919 okay all right great so one thing that 153 00:05:53,919 --> 00:05:56,479 you might notice is that over here the 154 00:05:56,479 --> 00:06:00,320 class labels we have g and h so if i 155 00:06:00,320 --> 00:06:02,639 actually go down here and i do data 156 00:06:02,639 --> 00:06:07,120 frame class dot unique 157 00:06:07,120 --> 00:06:09,280 you'll see that i have either g's or h's 158 00:06:09,280 --> 00:06:12,400 and these stand for gammas or hadrons 159 00:06:12,400 --> 00:06:14,960 um and our computer is not so good at 160 00:06:14,960 --> 00:06:16,479 understanding letters right our 161 00:06:16,479 --> 00:06:17,919 computer's really good at understanding 162 00:06:17,919 --> 00:06:19,840 numbers so what we're going to do is 163 00:06:19,840 --> 00:06:22,960 we're going to convert this to 0 for g 164 00:06:22,960 --> 00:06:24,800 and 1 for h so 165 00:06:24,800 --> 00:06:28,160 here i'm going to set this equal 166 00:06:28,160 --> 00:06:30,319 to 167 00:06:30,319 --> 00:06:32,639 this 168 00:06:33,039 --> 00:06:34,960 whether or not that equals g 169 00:06:34,960 --> 00:06:37,520 and then i'm just going to say as type 170 00:06:37,520 --> 00:06:41,280 int so what this should do is convert 171 00:06:41,280 --> 00:06:43,280 this entire column 172 00:06:43,280 --> 00:06:46,000 if it equals g then this is true so i 173 00:06:46,000 --> 00:06:48,080 guess that would be one and then if it's 174 00:06:48,080 --> 00:06:49,759 h it would be false so that would be 175 00:06:49,759 --> 00:06:51,680 zero but i'm just converting g and h to 176 00:06:51,680 --> 00:06:53,680 one and zeros it doesn't really matter 177 00:06:53,680 --> 00:06:54,560 like 178 00:06:54,560 --> 00:06:59,280 if g is one and h is zero or vice versa 179 00:07:00,880 --> 00:07:02,319 let me just take a step back right now 180 00:07:02,319 --> 00:07:05,280 and talk about this data set so here i 181 00:07:05,280 --> 00:07:08,000 have some data frame and i have 182 00:07:08,000 --> 00:07:10,479 all of these different values for each 183 00:07:10,479 --> 00:07:12,160 entry 184 00:07:12,160 --> 00:07:15,360 now this is a you know each of these is 185 00:07:15,360 --> 00:07:17,440 one sample it's one 186 00:07:17,440 --> 00:07:20,479 example it's one item in our data set 187 00:07:20,479 --> 00:07:22,479 it's one data point all these things are 188 00:07:22,479 --> 00:07:24,560 kind of the same thing when i mention oh 189 00:07:24,560 --> 00:07:26,000 this is one example or this is one 190 00:07:26,000 --> 00:07:28,000 sample or whatever 191 00:07:28,000 --> 00:07:31,360 now each of these samples they have 192 00:07:31,360 --> 00:07:33,520 you know one quality for each or one 193 00:07:33,520 --> 00:07:35,520 value for each of these 194 00:07:35,520 --> 00:07:38,479 labels up here and then it has the class 195 00:07:38,479 --> 00:07:39,919 now what we're going to do in this 196 00:07:39,919 --> 00:07:42,960 specific example is try to predict for 197 00:07:42,960 --> 00:07:45,599 future you know samples whether the 198 00:07:45,599 --> 00:07:50,319 class is g for gamma or h for hadron 199 00:07:50,319 --> 00:07:52,080 and that is something known as 200 00:07:52,080 --> 00:07:54,720 classification 201 00:07:54,720 --> 00:07:58,879 now all of these up here these are known 202 00:07:58,879 --> 00:08:01,840 as our features and features are just 203 00:08:01,840 --> 00:08:03,520 things that we're going to pass into our 204 00:08:03,520 --> 00:08:05,919 model in order to help us predict the 205 00:08:05,919 --> 00:08:08,720 label which in this case is the class 206 00:08:08,720 --> 00:08:09,680 column 207 00:08:09,680 --> 00:08:10,560 so 208 00:08:10,560 --> 00:08:14,160 for you know sample 0 i have 209 00:08:14,160 --> 00:08:16,160 10 different features so i have 10 210 00:08:16,160 --> 00:08:17,759 different values 211 00:08:17,759 --> 00:08:20,240 that i can pass into some model 212 00:08:20,240 --> 00:08:23,360 and i can spit out you know the class 213 00:08:23,360 --> 00:08:26,319 the label and i know the true label here 214 00:08:26,319 --> 00:08:28,400 is g so this is this is actually 215 00:08:28,400 --> 00:08:31,840 supervised learning 216 00:08:32,240 --> 00:08:33,200 all right 217 00:08:33,200 --> 00:08:35,679 so before i move on let me just give you 218 00:08:35,679 --> 00:08:38,559 a quick little crash course on what i 219 00:08:38,559 --> 00:08:41,599 just said this is machine learning for 220 00:08:41,599 --> 00:08:43,360 everyone 221 00:08:43,360 --> 00:08:45,760 well the first question is what is 222 00:08:45,760 --> 00:08:47,120 machine learning 223 00:08:47,120 --> 00:08:49,120 well machine learning is a sub-domain of 224 00:08:49,120 --> 00:08:52,160 computer science that focuses on certain 225 00:08:52,160 --> 00:08:54,160 algorithms which might help a computer 226 00:08:54,160 --> 00:08:55,680 learn from data 227 00:08:55,680 --> 00:08:57,839 without a programmer being there telling 228 00:08:57,839 --> 00:09:00,640 the computer exactly what to do that's 229 00:09:00,640 --> 00:09:04,399 what we call explicit programming 230 00:09:04,399 --> 00:09:06,640 so you might have heard of ai and ml and 231 00:09:06,640 --> 00:09:08,399 data science what is the difference 232 00:09:08,399 --> 00:09:10,160 between all of these 233 00:09:10,160 --> 00:09:12,640 so ai is artificial intelligence and 234 00:09:12,640 --> 00:09:14,880 that's an area of computer science where 235 00:09:14,880 --> 00:09:17,440 the goal is to enable computers and 236 00:09:17,440 --> 00:09:21,040 machines to perform human-like tasks and 237 00:09:21,040 --> 00:09:24,000 simulate human behavior 238 00:09:24,000 --> 00:09:27,200 now machine learning is a subset of ai 239 00:09:27,200 --> 00:09:28,880 that tries to solve 240 00:09:28,880 --> 00:09:31,120 one specific problem and make 241 00:09:31,120 --> 00:09:35,360 predictions using certain data 242 00:09:35,360 --> 00:09:37,600 and data science is a field that 243 00:09:37,600 --> 00:09:39,760 attempts to find patterns and draw 244 00:09:39,760 --> 00:09:42,000 insights from data and that might mean 245 00:09:42,000 --> 00:09:44,320 we're using machine learning 246 00:09:44,320 --> 00:09:46,399 so all of these fields kind of overlap 247 00:09:46,399 --> 00:09:48,640 and all of them might use machine 248 00:09:48,640 --> 00:09:50,880 learning 249 00:09:50,880 --> 00:09:52,560 so there are a few types of machine 250 00:09:52,560 --> 00:09:53,440 learning 251 00:09:53,440 --> 00:09:55,920 the first one is supervised learning 252 00:09:55,920 --> 00:09:57,680 and in supervised learning we're using 253 00:09:57,680 --> 00:10:00,320 labeled inputs so this means whatever 254 00:10:00,320 --> 00:10:02,480 input we get we have a corresponding 255 00:10:02,480 --> 00:10:05,839 output label in order to train models 256 00:10:05,839 --> 00:10:08,640 and to learn outputs of different new 257 00:10:08,640 --> 00:10:11,839 inputs that we might feed our model 258 00:10:11,839 --> 00:10:14,240 so for example i might have these 259 00:10:14,240 --> 00:10:16,399 pictures okay to a computer all these 260 00:10:16,399 --> 00:10:18,959 pictures are are pixels they're pixels 261 00:10:18,959 --> 00:10:21,760 with a certain color 262 00:10:21,760 --> 00:10:24,560 now in supervised learning all of these 263 00:10:24,560 --> 00:10:27,600 inputs have a label associated with them 264 00:10:27,600 --> 00:10:29,440 this is the output that we might want 265 00:10:29,440 --> 00:10:31,839 the computer to be able to predict so 266 00:10:31,839 --> 00:10:34,480 for example over here this picture is a 267 00:10:34,480 --> 00:10:35,440 cat 268 00:10:35,440 --> 00:10:38,240 this picture is a dog and this picture 269 00:10:38,240 --> 00:10:41,120 is a lizard 270 00:10:41,519 --> 00:10:44,160 now there's also unsupervised learning 271 00:10:44,160 --> 00:10:46,720 and in unsupervised learning we use 272 00:10:46,720 --> 00:10:48,320 unlabeled data 273 00:10:48,320 --> 00:10:52,480 to learn about patterns in the data 274 00:10:52,480 --> 00:10:56,560 so here are here are my input 275 00:10:56,560 --> 00:10:58,720 data points again they're just images 276 00:10:58,720 --> 00:11:01,360 they're just pixels 277 00:11:01,360 --> 00:11:03,839 well okay let's say i have a bunch of 278 00:11:03,839 --> 00:11:06,240 these different pictures 279 00:11:06,240 --> 00:11:08,560 and what i can do is i can feed all 280 00:11:08,560 --> 00:11:10,240 these to my computer and i might not you 281 00:11:10,240 --> 00:11:11,360 know my computer's not gonna be able to 282 00:11:11,360 --> 00:11:13,760 say oh this is a cat dog and 283 00:11:13,760 --> 00:11:16,240 lizard in terms of you know the output 284 00:11:16,240 --> 00:11:18,399 but it might be able to cluster all 285 00:11:18,399 --> 00:11:20,320 these pictures it might say hey all of 286 00:11:20,320 --> 00:11:22,480 these have something in common 287 00:11:22,480 --> 00:11:25,120 all of these have something in common 288 00:11:25,120 --> 00:11:26,880 and then these down here have something 289 00:11:26,880 --> 00:11:29,040 in common that's finding some sort of 290 00:11:29,040 --> 00:11:33,519 structure in our unlabeled data 291 00:11:33,519 --> 00:11:35,760 and finally we have reinforcement 292 00:11:35,760 --> 00:11:39,360 learning and reinforcement learning well 293 00:11:39,360 --> 00:11:42,000 they usually there's an agent that is 294 00:11:42,000 --> 00:11:43,680 learning in some sort of interactive 295 00:11:43,680 --> 00:11:46,399 environment based on rewards and 296 00:11:46,399 --> 00:11:48,000 penalties 297 00:11:48,000 --> 00:11:50,320 so let's think of a dog 298 00:11:50,320 --> 00:11:52,320 we can train our dog 299 00:11:52,320 --> 00:11:55,519 but there's not necessarily you know any 300 00:11:55,519 --> 00:11:58,240 wrong or right output at any given 301 00:11:58,240 --> 00:12:00,079 moment right 302 00:12:00,079 --> 00:12:02,240 well let's pretend that dog is a 303 00:12:02,240 --> 00:12:03,440 computer 304 00:12:03,440 --> 00:12:04,959 essentially what we're doing is we're 305 00:12:04,959 --> 00:12:07,040 giving rewards to our computer and 306 00:12:07,040 --> 00:12:08,880 telling our computer hey this is 307 00:12:08,880 --> 00:12:11,279 probably something good that you want to 308 00:12:11,279 --> 00:12:14,399 keep doing well computer agent yeah 309 00:12:14,399 --> 00:12:16,800 terminology 310 00:12:16,800 --> 00:12:18,560 but in this class today we'll be 311 00:12:18,560 --> 00:12:20,639 focusing on supervised learning and 312 00:12:20,639 --> 00:12:22,000 unsupervised learning and learning 313 00:12:22,000 --> 00:12:26,000 different models for each of those 314 00:12:26,000 --> 00:12:28,560 all right so let's talk about supervised 315 00:12:28,560 --> 00:12:30,560 learning first 316 00:12:30,560 --> 00:12:32,240 so this is kind of what a machine 317 00:12:32,240 --> 00:12:34,079 learning model looks like you have a 318 00:12:34,079 --> 00:12:35,600 bunch of inputs 319 00:12:35,600 --> 00:12:38,720 that are going into some model and then 320 00:12:38,720 --> 00:12:40,399 the model is spitting out an output 321 00:12:40,399 --> 00:12:42,480 which is our prediction 322 00:12:42,480 --> 00:12:44,880 so all these inputs this is what we call 323 00:12:44,880 --> 00:12:47,680 the feature vector 324 00:12:47,680 --> 00:12:48,959 now there are different types of 325 00:12:48,959 --> 00:12:51,519 features that we can have we might have 326 00:12:51,519 --> 00:12:53,279 qualitative features 327 00:12:53,279 --> 00:12:56,160 and qualitative means categorical data 328 00:12:56,160 --> 00:12:57,839 there's either a finite number of 329 00:12:57,839 --> 00:13:00,079 categories or groups 330 00:13:00,079 --> 00:13:00,959 so 331 00:13:00,959 --> 00:13:03,120 one example of a qualitative feature 332 00:13:03,120 --> 00:13:05,040 might be gender 333 00:13:05,040 --> 00:13:07,200 and in this case there's only two here 334 00:13:07,200 --> 00:13:08,639 it's for the sake of the example i know 335 00:13:08,639 --> 00:13:10,079 this might be a little bit 336 00:13:10,079 --> 00:13:13,279 outdated here we have a girl and a boy 337 00:13:13,279 --> 00:13:14,720 there are two genders there are two 338 00:13:14,720 --> 00:13:16,399 different categories 339 00:13:16,399 --> 00:13:19,839 that's a piece of qualitative data 340 00:13:19,839 --> 00:13:22,079 another example might be okay we have 341 00:13:22,079 --> 00:13:23,279 you know a bunch of different 342 00:13:23,279 --> 00:13:26,240 nationalities maybe a nationality or a 343 00:13:26,240 --> 00:13:28,320 nation or a location 344 00:13:28,320 --> 00:13:30,160 that might also be an example of 345 00:13:30,160 --> 00:13:32,880 categorical data 346 00:13:32,880 --> 00:13:35,519 now in both of these there's no inherent 347 00:13:35,519 --> 00:13:38,959 order it's not like you know we can 348 00:13:38,959 --> 00:13:42,000 rate us one and 349 00:13:42,000 --> 00:13:45,440 uh france two japan three etcetera right 350 00:13:45,440 --> 00:13:46,480 there's 351 00:13:46,480 --> 00:13:50,880 not really any inherent order built into 352 00:13:50,880 --> 00:13:52,880 either of these categorical data sets 353 00:13:52,880 --> 00:13:57,120 that's why we call this nominal data 354 00:13:57,279 --> 00:13:59,760 now for nominal data the way that we 355 00:13:59,760 --> 00:14:02,000 want to feed it into our computer is 356 00:14:02,000 --> 00:14:05,279 using something called one hot encoding 357 00:14:05,279 --> 00:14:07,519 so let's say that you know i have a data 358 00:14:07,519 --> 00:14:09,760 set some of the items in our data some 359 00:14:09,760 --> 00:14:11,440 of the inputs 360 00:14:11,440 --> 00:14:13,839 might be from the us some might be from 361 00:14:13,839 --> 00:14:16,399 india then canada than france 362 00:14:16,399 --> 00:14:18,000 now how do we get our computer to 363 00:14:18,000 --> 00:14:20,000 recognize that we have to do something 364 00:14:20,000 --> 00:14:22,800 called one hot encoding and basically 365 00:14:22,800 --> 00:14:24,880 one hot encoding is saying okay well if 366 00:14:24,880 --> 00:14:27,040 it matches some category 367 00:14:27,040 --> 00:14:29,360 make that a one and if it doesn't just 368 00:14:29,360 --> 00:14:31,040 make that a zero 369 00:14:31,040 --> 00:14:34,399 so for example if your input were from 370 00:14:34,399 --> 00:14:35,600 the us 371 00:14:35,600 --> 00:14:38,000 you would you might have one zero zero 372 00:14:38,000 --> 00:14:39,040 zero 373 00:14:39,040 --> 00:14:42,880 india you know zero one zero zero canada 374 00:14:42,880 --> 00:14:44,880 okay well the item representing canada 375 00:14:44,880 --> 00:14:46,320 is one and then france the item 376 00:14:46,320 --> 00:14:48,240 representing france is one 377 00:14:48,240 --> 00:14:50,000 and then you can see that the rest are 378 00:14:50,000 --> 00:14:54,160 zeros that's one hot encoding 379 00:14:54,399 --> 00:14:57,360 now there are also a different type of 380 00:14:57,360 --> 00:14:58,880 qualitative feature 381 00:14:58,880 --> 00:15:01,279 so here on the left there are different 382 00:15:01,279 --> 00:15:05,279 age groups there's babies toddlers 383 00:15:05,279 --> 00:15:08,560 teenagers young adults 384 00:15:08,560 --> 00:15:09,920 adults 385 00:15:09,920 --> 00:15:13,040 and so on right and on the right hand 386 00:15:13,040 --> 00:15:15,360 side we might have different ratings so 387 00:15:15,360 --> 00:15:17,040 maybe bad 388 00:15:17,040 --> 00:15:18,399 not so good 389 00:15:18,399 --> 00:15:19,920 mediocre 390 00:15:19,920 --> 00:15:22,880 good and then like great 391 00:15:22,880 --> 00:15:26,079 now these are known as ordinal pieces of 392 00:15:26,079 --> 00:15:28,000 data because they have some sort of 393 00:15:28,000 --> 00:15:30,079 inherent order 394 00:15:30,079 --> 00:15:31,839 right like 395 00:15:31,839 --> 00:15:33,760 being a toddler is a lot closer to being 396 00:15:33,760 --> 00:15:37,920 a baby than being an elderly person 397 00:15:37,920 --> 00:15:41,040 right or good is closer to great than it 398 00:15:41,040 --> 00:15:43,040 is to really bad 399 00:15:43,040 --> 00:15:44,800 so these have some sort of inherent 400 00:15:44,800 --> 00:15:46,399 ordering system 401 00:15:46,399 --> 00:15:48,639 and so for these types of data sets we 402 00:15:48,639 --> 00:15:50,399 can actually just mark them 403 00:15:50,399 --> 00:15:52,480 from you know one to five or we can just 404 00:15:52,480 --> 00:15:54,959 say hey for each of these let's give it 405 00:15:54,959 --> 00:15:57,360 a number 406 00:15:57,680 --> 00:16:00,160 and this makes sense because 407 00:16:00,160 --> 00:16:01,440 like 408 00:16:01,440 --> 00:16:03,360 for example the thing that i just said 409 00:16:03,360 --> 00:16:05,759 how good is closer to great 410 00:16:05,759 --> 00:16:08,800 then good is close to not good at all 411 00:16:08,800 --> 00:16:10,560 well four is closer to five then four is 412 00:16:10,560 --> 00:16:12,399 close to one so this actually kind of 413 00:16:12,399 --> 00:16:14,480 makes sense and it'll make sense for the 414 00:16:14,480 --> 00:16:17,360 computer as well 415 00:16:17,920 --> 00:16:20,399 all right there are also quantitative 416 00:16:20,399 --> 00:16:22,880 pieces of data and quantitative 417 00:16:22,880 --> 00:16:25,199 pieces of data are numerical valued 418 00:16:25,199 --> 00:16:28,480 pieces of data so this could be discrete 419 00:16:28,480 --> 00:16:29,680 which means you know they might be 420 00:16:29,680 --> 00:16:32,320 integers or it could be continuous which 421 00:16:32,320 --> 00:16:33,199 means 422 00:16:33,199 --> 00:16:35,279 all real numbers 423 00:16:35,279 --> 00:16:38,079 so for example the length of something 424 00:16:38,079 --> 00:16:40,639 is a quantitative piece of data 425 00:16:40,639 --> 00:16:42,800 it's a quantitative feature 426 00:16:42,800 --> 00:16:44,639 the temperature of something is a 427 00:16:44,639 --> 00:16:47,040 quantitative feature and then maybe how 428 00:16:47,040 --> 00:16:49,360 many easter eggs i collected in my 429 00:16:49,360 --> 00:16:51,600 basket this easter egg hunt 430 00:16:51,600 --> 00:16:53,920 that is an example of discrete 431 00:16:53,920 --> 00:16:55,680 quantitative feature 432 00:16:55,680 --> 00:16:58,079 okay so these are continuous and this 433 00:16:58,079 --> 00:17:01,199 over here is discrete 434 00:17:01,680 --> 00:17:04,959 so those are the things that go into our 435 00:17:04,959 --> 00:17:07,439 feature vector those are our features 436 00:17:07,439 --> 00:17:09,520 that we're feeding this model because 437 00:17:09,520 --> 00:17:12,319 our computers are really really good 438 00:17:12,319 --> 00:17:14,559 at understanding math right at 439 00:17:14,559 --> 00:17:16,880 understanding numbers they're not so 440 00:17:16,880 --> 00:17:19,280 good at understanding things that humans 441 00:17:19,280 --> 00:17:22,640 might be able to understand 442 00:17:23,039 --> 00:17:23,760 well 443 00:17:23,760 --> 00:17:25,760 what are the types of predictions that 444 00:17:25,760 --> 00:17:28,559 our model can output 445 00:17:28,559 --> 00:17:29,360 so 446 00:17:29,360 --> 00:17:31,600 in supervised learning there are some 447 00:17:31,600 --> 00:17:33,360 different tasks there's one 448 00:17:33,360 --> 00:17:34,799 classification 449 00:17:34,799 --> 00:17:36,799 and basically classification just saying 450 00:17:36,799 --> 00:17:40,080 okay predict discrete classes 451 00:17:40,080 --> 00:17:42,320 and that might mean you know this is a 452 00:17:42,320 --> 00:17:43,679 hot dog 453 00:17:43,679 --> 00:17:46,880 this is a pizza and this is ice cream 454 00:17:46,880 --> 00:17:48,720 okay so there are three distinct classes 455 00:17:48,720 --> 00:17:50,960 and any other pictures of hot dogs pizza 456 00:17:50,960 --> 00:17:52,640 or ice cream 457 00:17:52,640 --> 00:17:56,000 i can put under these labels 458 00:17:56,000 --> 00:17:58,400 hot dog pizza ice cream 459 00:17:58,400 --> 00:18:00,400 this is something known as multi-class 460 00:18:00,400 --> 00:18:02,640 classification 461 00:18:02,640 --> 00:18:05,120 but there's also binary classification 462 00:18:05,120 --> 00:18:07,360 and binary classification you might have 463 00:18:07,360 --> 00:18:08,880 hot dog 464 00:18:08,880 --> 00:18:11,039 or not hot dog so there's only two 465 00:18:11,039 --> 00:18:12,480 categories that you're working with 466 00:18:12,480 --> 00:18:13,600 something that is something and 467 00:18:13,600 --> 00:18:15,679 something that isn't 468 00:18:15,679 --> 00:18:18,400 binary classification 469 00:18:18,400 --> 00:18:20,799 okay so yeah other examples 470 00:18:20,799 --> 00:18:23,600 so if something has positive or negative 471 00:18:23,600 --> 00:18:26,160 sentiment that's binary classification 472 00:18:26,160 --> 00:18:28,160 maybe you're predicting your pictures if 473 00:18:28,160 --> 00:18:30,240 they're cats or dogs that's binary 474 00:18:30,240 --> 00:18:31,679 classification 475 00:18:31,679 --> 00:18:34,160 maybe you know you are writing an email 476 00:18:34,160 --> 00:18:35,679 filter and you're trying to figure out 477 00:18:35,679 --> 00:18:38,640 if an email is spam or not spam 478 00:18:38,640 --> 00:18:41,600 so that's also binary classification 479 00:18:41,600 --> 00:18:43,520 now for multi-class classification you 480 00:18:43,520 --> 00:18:45,440 might have you know cat dog lizard 481 00:18:45,440 --> 00:18:46,799 dolphin shark 482 00:18:46,799 --> 00:18:48,720 rabbit etc 483 00:18:48,720 --> 00:18:50,160 um we might have different types of 484 00:18:50,160 --> 00:18:53,200 fruits so like orange apple pear etc and 485 00:18:53,200 --> 00:18:55,440 then maybe different plant species but 486 00:18:55,440 --> 00:18:57,919 multi-class classification just means 487 00:18:57,919 --> 00:19:00,240 more than two okay and binary means 488 00:19:00,240 --> 00:19:03,760 we're predicting between two things 489 00:19:03,919 --> 00:19:06,160 there's also something called regression 490 00:19:06,160 --> 00:19:08,320 when we talk about supervised learning 491 00:19:08,320 --> 00:19:09,600 and this just means we're trying to 492 00:19:09,600 --> 00:19:10,559 predict 493 00:19:10,559 --> 00:19:12,960 continuous values so instead of just 494 00:19:12,960 --> 00:19:14,799 trying to predict different categories 495 00:19:14,799 --> 00:19:16,960 we're trying to come up with a number 496 00:19:16,960 --> 00:19:20,320 that you know is on some sort of scale 497 00:19:20,320 --> 00:19:23,039 so some examples 498 00:19:23,039 --> 00:19:25,520 so some examples might be the price of 499 00:19:25,520 --> 00:19:27,520 ethereum tomorrow 500 00:19:27,520 --> 00:19:30,400 or it might be okay what is going to be 501 00:19:30,400 --> 00:19:31,760 the temperature 502 00:19:31,760 --> 00:19:34,160 or it might be what is the price of this 503 00:19:34,160 --> 00:19:35,120 house 504 00:19:35,120 --> 00:19:36,640 right so these things don't really fit 505 00:19:36,640 --> 00:19:39,120 into discrete classes 506 00:19:39,120 --> 00:19:40,960 we're trying to predict a number that's 507 00:19:40,960 --> 00:19:43,840 as close to the true value as possible 508 00:19:43,840 --> 00:19:44,799 using 509 00:19:44,799 --> 00:19:48,480 different features of our data set 510 00:19:48,960 --> 00:19:51,039 so that's exactly what our model looks 511 00:19:51,039 --> 00:19:53,840 like in supervised learning 512 00:19:53,840 --> 00:19:57,200 now let's talk about the model itself 513 00:19:57,200 --> 00:19:59,760 how do we make this model learn 514 00:19:59,760 --> 00:20:02,160 or how can we tell whether or not it's 515 00:20:02,160 --> 00:20:03,360 even learning 516 00:20:03,360 --> 00:20:05,600 so before we talk about the models 517 00:20:05,600 --> 00:20:07,679 let's talk about how can we actually 518 00:20:07,679 --> 00:20:09,919 like evaluate these models or how can we 519 00:20:09,919 --> 00:20:11,600 tell whether something is a good model 520 00:20:11,600 --> 00:20:14,400 or a bad model 521 00:20:14,640 --> 00:20:15,360 so 522 00:20:15,360 --> 00:20:18,400 let's take a look at this data set so 523 00:20:18,400 --> 00:20:21,280 this data set has this is from 524 00:20:21,280 --> 00:20:24,320 a diabetes a pima indian diabetes data 525 00:20:24,320 --> 00:20:25,120 set 526 00:20:25,120 --> 00:20:26,880 and here we have different number of 527 00:20:26,880 --> 00:20:28,960 pregnancies different glucose levels 528 00:20:28,960 --> 00:20:31,440 blood pressure skin thickness 529 00:20:31,440 --> 00:20:33,919 insulin bmi age and then the outcome 530 00:20:33,919 --> 00:20:35,679 whether or not they have diabetes one 531 00:20:35,679 --> 00:20:38,640 for they do zero for they don't 532 00:20:38,640 --> 00:20:39,679 so here 533 00:20:39,679 --> 00:20:42,720 um all of these are 534 00:20:42,720 --> 00:20:44,159 quantitative 535 00:20:44,159 --> 00:20:45,840 features right because they're all on 536 00:20:45,840 --> 00:20:48,640 some scale 537 00:20:48,640 --> 00:20:52,240 so each row is a different sample in the 538 00:20:52,240 --> 00:20:54,240 data so it's a different example it's 539 00:20:54,240 --> 00:20:57,039 one person's data and each row 540 00:20:57,039 --> 00:21:00,640 represents one person in this data set 541 00:21:00,640 --> 00:21:02,640 now this column 542 00:21:02,640 --> 00:21:04,720 each column represents a different 543 00:21:04,720 --> 00:21:07,600 feature so this one here is some measure 544 00:21:07,600 --> 00:21:10,960 of blood pressure levels 545 00:21:10,960 --> 00:21:12,799 and this one over here as we mentioned 546 00:21:12,799 --> 00:21:14,799 is the output label so this one is 547 00:21:14,799 --> 00:21:18,960 whether or not they have diabetes 548 00:21:18,960 --> 00:21:20,880 and as i mentioned this is what we would 549 00:21:20,880 --> 00:21:23,200 call a feature vector because these are 550 00:21:23,200 --> 00:21:24,880 all of our features 551 00:21:24,880 --> 00:21:28,000 in one sample 552 00:21:28,000 --> 00:21:30,960 and this is what's known as the target 553 00:21:30,960 --> 00:21:34,000 or the output for that feature vector 554 00:21:34,000 --> 00:21:37,120 that's what we're trying to predict 555 00:21:37,120 --> 00:21:39,280 and all of these together is our 556 00:21:39,280 --> 00:21:40,960 features matrix 557 00:21:40,960 --> 00:21:42,559 x 558 00:21:42,559 --> 00:21:45,760 and over here this is our labels or 559 00:21:45,760 --> 00:21:49,520 targets vector y 560 00:21:49,520 --> 00:21:51,760 so i've condensed this to a chocolate 561 00:21:51,760 --> 00:21:53,280 bar to kind of 562 00:21:53,280 --> 00:21:55,760 talk about some of the other concepts in 563 00:21:55,760 --> 00:21:57,039 machine learning 564 00:21:57,039 --> 00:22:01,039 so over here we have our x our features 565 00:22:01,039 --> 00:22:05,440 matrix and over here this is our label y 566 00:22:05,440 --> 00:22:06,240 so 567 00:22:06,240 --> 00:22:09,520 each row of this will be fed into our 568 00:22:09,520 --> 00:22:10,960 model right 569 00:22:10,960 --> 00:22:12,799 and our model will make some sort of 570 00:22:12,799 --> 00:22:14,240 prediction 571 00:22:14,240 --> 00:22:15,919 and what we do is we compare that 572 00:22:15,919 --> 00:22:18,559 prediction to the actual 573 00:22:18,559 --> 00:22:21,120 value of y that we have in our labeled 574 00:22:21,120 --> 00:22:22,960 data set because that's the whole point 575 00:22:22,960 --> 00:22:24,720 of supervised learning is we can compare 576 00:22:24,720 --> 00:22:27,280 what our model is outputting to oh what 577 00:22:27,280 --> 00:22:29,440 is the truth actually and then we can go 578 00:22:29,440 --> 00:22:31,440 back and we can adjust some things so 579 00:22:31,440 --> 00:22:33,840 the next iteration we get closer 580 00:22:33,840 --> 00:22:34,720 to 581 00:22:34,720 --> 00:22:37,520 what the true value is 582 00:22:37,520 --> 00:22:38,720 so that 583 00:22:38,720 --> 00:22:41,039 whole process here the tinkering the 584 00:22:41,039 --> 00:22:42,720 okay what's the difference where did we 585 00:22:42,720 --> 00:22:44,159 go wrong 586 00:22:44,159 --> 00:22:45,919 that's what's known as training the 587 00:22:45,919 --> 00:22:47,520 model 588 00:22:47,520 --> 00:22:49,520 all right so take this whole 589 00:22:49,520 --> 00:22:51,919 you know chunk right here do we want to 590 00:22:51,919 --> 00:22:53,440 really put 591 00:22:53,440 --> 00:22:55,679 our entire chocolate bar into the model 592 00:22:55,679 --> 00:22:58,799 to train our model 593 00:22:58,880 --> 00:23:01,520 not really right because if we did that 594 00:23:01,520 --> 00:23:05,039 then how do we know that our model can 595 00:23:05,039 --> 00:23:08,559 do well on new data that we haven't seen 596 00:23:08,559 --> 00:23:11,440 like if i were to create a model 597 00:23:11,440 --> 00:23:12,240 to 598 00:23:12,240 --> 00:23:14,240 predict whether or not someone has 599 00:23:14,240 --> 00:23:16,159 diabetes 600 00:23:16,159 --> 00:23:18,799 let's say that i just train all my data 601 00:23:18,799 --> 00:23:20,400 and i see that on my training data it 602 00:23:20,400 --> 00:23:22,640 does well i go to some hospital i'm like 603 00:23:22,640 --> 00:23:24,000 here's my model 604 00:23:24,000 --> 00:23:25,440 i think you can use this to predict if 605 00:23:25,440 --> 00:23:27,679 somebody has diabetes 606 00:23:27,679 --> 00:23:29,760 do we think that would be effective or 607 00:23:29,760 --> 00:23:32,000 not 608 00:23:32,159 --> 00:23:36,400 probably not right because 609 00:23:36,400 --> 00:23:37,600 we haven't 610 00:23:37,600 --> 00:23:39,039 assessed 611 00:23:39,039 --> 00:23:42,480 how well our model can generalize 612 00:23:42,480 --> 00:23:44,960 okay it might do well after you know our 613 00:23:44,960 --> 00:23:46,640 model has seen this data over and over 614 00:23:46,640 --> 00:23:48,720 and over again but what about new data 615 00:23:48,720 --> 00:23:51,840 can our model handle new data 616 00:23:51,840 --> 00:23:53,520 well 617 00:23:53,520 --> 00:23:55,280 how do we how do we get our model to 618 00:23:55,280 --> 00:23:57,039 assess that 619 00:23:57,039 --> 00:23:59,280 so we actually break up our whole data 620 00:23:59,280 --> 00:24:00,960 set that we have 621 00:24:00,960 --> 00:24:03,200 into three different types of data sets 622 00:24:03,200 --> 00:24:05,360 we call it the training data set the 623 00:24:05,360 --> 00:24:07,679 validation data set and the testing data 624 00:24:07,679 --> 00:24:08,640 set 625 00:24:08,640 --> 00:24:10,159 and you know you might have sixty 626 00:24:10,159 --> 00:24:12,080 percent here twenty percent and twenty 627 00:24:12,080 --> 00:24:14,960 percent or eighty ten and ten um it 628 00:24:14,960 --> 00:24:16,400 really depends on how many statistics 629 00:24:16,400 --> 00:24:17,919 you have i think either of those would 630 00:24:17,919 --> 00:24:20,400 be acceptable 631 00:24:20,400 --> 00:24:22,080 so what we do is then we feed the 632 00:24:22,080 --> 00:24:24,720 training data set into our model 633 00:24:24,720 --> 00:24:27,200 we come up with you know this might be a 634 00:24:27,200 --> 00:24:29,760 vector of predictions corresponding with 635 00:24:29,760 --> 00:24:31,039 each 636 00:24:31,039 --> 00:24:34,080 sample that we put into our model 637 00:24:34,080 --> 00:24:36,000 we figure out okay what's the difference 638 00:24:36,000 --> 00:24:38,960 between our prediction and the true 639 00:24:38,960 --> 00:24:42,080 values this is something known as loss 640 00:24:42,080 --> 00:24:43,440 loss is you know what's the difference 641 00:24:43,440 --> 00:24:44,240 here 642 00:24:44,240 --> 00:24:48,080 in some numerical quantity of course 643 00:24:48,320 --> 00:24:50,400 and then we make adjustments and that's 644 00:24:50,400 --> 00:24:52,240 what we call training 645 00:24:52,240 --> 00:24:54,320 okay 646 00:24:54,320 --> 00:24:55,520 so then 647 00:24:55,520 --> 00:24:56,960 once you know we've made a bunch of 648 00:24:56,960 --> 00:24:58,400 adjustments 649 00:24:58,400 --> 00:25:00,960 we can put our validation set through 650 00:25:00,960 --> 00:25:02,559 this model 651 00:25:02,559 --> 00:25:04,960 and the validation set is kind of used 652 00:25:04,960 --> 00:25:06,799 as a reality check 653 00:25:06,799 --> 00:25:10,559 during or after training to ensure that 654 00:25:10,559 --> 00:25:14,000 the model can handle unseen data still 655 00:25:14,000 --> 00:25:15,679 so every single time after we train one 656 00:25:15,679 --> 00:25:17,200 iteration we might 657 00:25:17,200 --> 00:25:19,279 stick the validation set in and see hey 658 00:25:19,279 --> 00:25:21,039 what's the loss there 659 00:25:21,039 --> 00:25:22,960 and then after our training is over we 660 00:25:22,960 --> 00:25:25,600 can assess the validation set and ask 661 00:25:25,600 --> 00:25:27,440 hey what's the loss there 662 00:25:27,440 --> 00:25:29,919 but one key difference here is that we 663 00:25:29,919 --> 00:25:33,039 don't have that training step this 664 00:25:33,039 --> 00:25:35,520 loss never gets fed back into the model 665 00:25:35,520 --> 00:25:38,720 right that feedback loop is not closed 666 00:25:38,720 --> 00:25:40,880 all right so let's talk about loss 667 00:25:40,880 --> 00:25:43,120 really quickly 668 00:25:43,120 --> 00:25:45,279 so here i have four different types of 669 00:25:45,279 --> 00:25:47,360 models i have some sort of data that's 670 00:25:47,360 --> 00:25:49,520 being fed into the model and then some 671 00:25:49,520 --> 00:25:51,120 output 672 00:25:51,120 --> 00:25:52,159 okay so 673 00:25:52,159 --> 00:25:55,520 this output here is pretty far from you 674 00:25:55,520 --> 00:25:56,400 know this 675 00:25:56,400 --> 00:25:58,400 truth that we want 676 00:25:58,400 --> 00:25:59,440 and so 677 00:25:59,440 --> 00:26:02,400 this loss is going to be high 678 00:26:02,400 --> 00:26:03,679 in model b 679 00:26:03,679 --> 00:26:06,000 again this is pretty far from what we 680 00:26:06,000 --> 00:26:07,520 want so this loss is also going to be 681 00:26:07,520 --> 00:26:10,480 high let's give it 1.5 682 00:26:10,480 --> 00:26:13,440 now this one here it's pretty close i 683 00:26:13,440 --> 00:26:15,600 mean maybe not almost but pretty close 684 00:26:15,600 --> 00:26:16,720 to this one 685 00:26:16,720 --> 00:26:19,679 so that might have a loss of 0.5 686 00:26:19,679 --> 00:26:21,600 and then this one here is 687 00:26:21,600 --> 00:26:24,720 maybe further than this but still better 688 00:26:24,720 --> 00:26:28,880 than these two so that loss might be 0.9 689 00:26:28,880 --> 00:26:30,720 okay so which of these model performs 690 00:26:30,720 --> 00:26:32,240 the best 691 00:26:32,240 --> 00:26:33,279 well 692 00:26:33,279 --> 00:26:35,520 model c has the smallest loss so it's 693 00:26:35,520 --> 00:26:39,200 probably model c 694 00:26:39,200 --> 00:26:41,600 okay now let's take model c after you 695 00:26:41,600 --> 00:26:43,679 know we've come up with these all these 696 00:26:43,679 --> 00:26:45,919 models and we've seen okay model c is 697 00:26:45,919 --> 00:26:48,640 probably the best model 698 00:26:48,640 --> 00:26:51,120 we take model c and we run our test set 699 00:26:51,120 --> 00:26:52,400 through this model 700 00:26:52,400 --> 00:26:54,799 and this test set is used as a final 701 00:26:54,799 --> 00:26:57,520 check to see how generalizable 702 00:26:57,520 --> 00:27:00,880 that chosen model is so if i you know 703 00:27:00,880 --> 00:27:03,360 finished training my diabetes data set 704 00:27:03,360 --> 00:27:05,360 then i could run it through some chunk 705 00:27:05,360 --> 00:27:07,520 of the data and i can say oh like this 706 00:27:07,520 --> 00:27:10,240 is how it perform on data that it's 707 00:27:10,240 --> 00:27:12,320 never seen before at any point during 708 00:27:12,320 --> 00:27:15,440 the training process okay 709 00:27:15,440 --> 00:27:18,640 and that loss that's the final reported 710 00:27:18,640 --> 00:27:20,480 performance of 711 00:27:20,480 --> 00:27:22,320 my test set or 712 00:27:22,320 --> 00:27:23,840 this would be the final reported 713 00:27:23,840 --> 00:27:26,799 performance of my model 714 00:27:26,799 --> 00:27:29,039 okay 715 00:27:29,200 --> 00:27:31,679 so let's talk about this thing called 716 00:27:31,679 --> 00:27:33,440 loss because i think i kind of just 717 00:27:33,440 --> 00:27:35,679 glossed over it right 718 00:27:35,679 --> 00:27:37,840 so loss is the difference between your 719 00:27:37,840 --> 00:27:40,799 prediction and the actual 720 00:27:40,799 --> 00:27:43,120 like label 721 00:27:43,120 --> 00:27:44,880 so this would give a slightly higher 722 00:27:44,880 --> 00:27:47,200 loss than this 723 00:27:47,200 --> 00:27:50,559 and this would even give a higher loss 724 00:27:50,559 --> 00:27:53,440 because it's even more off 725 00:27:53,440 --> 00:27:55,120 in computer science we like formulas 726 00:27:55,120 --> 00:27:57,520 right we like formulaic ways 727 00:27:57,520 --> 00:27:59,520 of describing things 728 00:27:59,520 --> 00:28:01,600 so here are some examples of loss 729 00:28:01,600 --> 00:28:03,200 functions and how we can actually come 730 00:28:03,200 --> 00:28:05,279 up with numbers 731 00:28:05,279 --> 00:28:08,000 this here is known as l1 loss 732 00:28:08,000 --> 00:28:10,080 and basically l1 loss just takes the 733 00:28:10,080 --> 00:28:11,520 absolute value 734 00:28:11,520 --> 00:28:12,480 of 735 00:28:12,480 --> 00:28:14,240 whatever your 736 00:28:14,240 --> 00:28:15,679 you know real 737 00:28:15,679 --> 00:28:17,840 value is whatever the real output label 738 00:28:17,840 --> 00:28:18,559 is 739 00:28:18,559 --> 00:28:21,279 subtracts the predicted value 740 00:28:21,279 --> 00:28:23,679 and takes the absolute value of that 741 00:28:23,679 --> 00:28:24,640 okay 742 00:28:24,640 --> 00:28:25,440 so 743 00:28:25,440 --> 00:28:28,320 the absolute value is 744 00:28:28,320 --> 00:28:29,520 a function that looks something like 745 00:28:29,520 --> 00:28:30,480 this 746 00:28:30,480 --> 00:28:33,200 so the further off you are the greater 747 00:28:33,200 --> 00:28:35,440 your losses 748 00:28:35,440 --> 00:28:38,880 right in either direction so 749 00:28:38,880 --> 00:28:41,120 if your real value is off from your 750 00:28:41,120 --> 00:28:44,000 predicted value by 10 then your loss for 751 00:28:44,000 --> 00:28:46,559 that point would be 10 and then this sum 752 00:28:46,559 --> 00:28:48,399 here just means hey we're taking all the 753 00:28:48,399 --> 00:28:50,960 points in our data set and we're trying 754 00:28:50,960 --> 00:28:52,960 to figure out the sum of how far 755 00:28:52,960 --> 00:28:55,679 everything is 756 00:28:56,080 --> 00:28:58,320 now we also have something called l2 757 00:28:58,320 --> 00:28:59,600 loss so 758 00:28:59,600 --> 00:29:01,679 this loss function is quadratic which 759 00:29:01,679 --> 00:29:04,640 means that if it's close the penalty is 760 00:29:04,640 --> 00:29:08,480 very minimal and if it's off by a lot 761 00:29:08,480 --> 00:29:11,840 then the penalty is much much higher 762 00:29:11,840 --> 00:29:12,799 okay 763 00:29:12,799 --> 00:29:14,880 and this instead of the absolute value 764 00:29:14,880 --> 00:29:16,640 we just square 765 00:29:16,640 --> 00:29:18,320 the um 766 00:29:18,320 --> 00:29:21,600 the difference between the two 767 00:29:22,880 --> 00:29:24,720 now there's also something called binary 768 00:29:24,720 --> 00:29:26,880 cross entropy loss 769 00:29:26,880 --> 00:29:29,520 it looks something like this and this is 770 00:29:29,520 --> 00:29:32,240 for uh binary classification this this 771 00:29:32,240 --> 00:29:34,399 might be the loss that we use 772 00:29:34,399 --> 00:29:35,760 so this loss 773 00:29:35,760 --> 00:29:37,840 you know i'm not going to really go 774 00:29:37,840 --> 00:29:40,159 through it too much but you just need to 775 00:29:40,159 --> 00:29:42,559 know that loss decreases as the 776 00:29:42,559 --> 00:29:45,919 performance gets better 777 00:29:47,039 --> 00:29:49,159 so there are some other measures of 778 00:29:49,159 --> 00:29:52,720 accuracy as well so for example accuracy 779 00:29:52,720 --> 00:29:55,360 what is accuracy 780 00:29:55,360 --> 00:29:57,360 so let's say that these are pictures 781 00:29:57,360 --> 00:30:00,960 that i'm feeding my model okay 782 00:30:00,960 --> 00:30:03,760 and these predictions might be 783 00:30:03,760 --> 00:30:06,559 apple orange orange apple 784 00:30:06,559 --> 00:30:09,200 okay but the actual is apple orange 785 00:30:09,200 --> 00:30:11,120 apple apple 786 00:30:11,120 --> 00:30:12,159 so 787 00:30:12,159 --> 00:30:14,399 three of them were correct and one of 788 00:30:14,399 --> 00:30:16,640 them was incorrect so the accuracy of 789 00:30:16,640 --> 00:30:18,880 this model is three quarters or 75 790 00:30:18,880 --> 00:30:20,480 percent 791 00:30:20,480 --> 00:30:23,120 all right coming back to our 792 00:30:23,120 --> 00:30:25,200 collab notebook i'm going to close this 793 00:30:25,200 --> 00:30:26,399 a little bit 794 00:30:26,399 --> 00:30:29,520 again we've imported stuff up here um 795 00:30:29,520 --> 00:30:32,240 and we've already created our data frame 796 00:30:32,240 --> 00:30:34,080 right here and this is this is all of 797 00:30:34,080 --> 00:30:35,600 our data this is what we're going to use 798 00:30:35,600 --> 00:30:38,480 to train our models 799 00:30:38,480 --> 00:30:40,480 so down here 800 00:30:40,480 --> 00:30:43,760 again if we now take a look at our data 801 00:30:43,760 --> 00:30:45,919 set 802 00:30:45,919 --> 00:30:47,679 you'll see that our classes are now 803 00:30:47,679 --> 00:30:49,679 zeros and ones so now this is all 804 00:30:49,679 --> 00:30:51,200 numerical which is good because our 805 00:30:51,200 --> 00:30:54,240 computer can now understand that 806 00:30:54,240 --> 00:30:55,760 okay 807 00:30:55,760 --> 00:30:57,279 and you know it would probably be a good 808 00:30:57,279 --> 00:30:58,640 idea to maybe 809 00:30:58,640 --> 00:31:00,960 kind of plot hey do these things have 810 00:31:00,960 --> 00:31:02,559 anything to do 811 00:31:02,559 --> 00:31:04,559 with the class 812 00:31:04,559 --> 00:31:05,360 so 813 00:31:05,360 --> 00:31:06,240 here 814 00:31:06,240 --> 00:31:08,559 i'm going to go through all the labels 815 00:31:08,559 --> 00:31:10,960 so for label in 816 00:31:10,960 --> 00:31:13,120 the columns of this data frame so this 817 00:31:13,120 --> 00:31:15,360 just gets me the list actually we have 818 00:31:15,360 --> 00:31:16,880 the list right it's called so let's just 819 00:31:16,880 --> 00:31:19,440 use that it might be less confusing 820 00:31:19,440 --> 00:31:20,960 of everything up till the last thing 821 00:31:20,960 --> 00:31:22,880 which is the class so i'm going to take 822 00:31:22,880 --> 00:31:25,360 all these 10 different features 823 00:31:25,360 --> 00:31:27,919 and i'm going to plot them 824 00:31:27,919 --> 00:31:30,840 as a histogram 825 00:31:30,840 --> 00:31:32,399 um 826 00:31:32,399 --> 00:31:34,240 so 827 00:31:34,240 --> 00:31:35,440 and now i'm gonna plot them as a 828 00:31:35,440 --> 00:31:37,360 histogram so basically if i take that 829 00:31:37,360 --> 00:31:41,120 data frame and i say okay for everything 830 00:31:41,120 --> 00:31:42,000 where 831 00:31:42,000 --> 00:31:43,279 the class 832 00:31:43,279 --> 00:31:45,840 is equal to one so these are all of our 833 00:31:45,840 --> 00:31:48,559 gammas remember 834 00:31:48,559 --> 00:31:49,440 now 835 00:31:49,440 --> 00:31:53,200 for that portion of the data frame if i 836 00:31:53,200 --> 00:31:56,240 look at this label so now these 837 00:31:56,240 --> 00:31:59,120 okay what this part here is saying 838 00:31:59,120 --> 00:32:00,960 is 839 00:32:00,960 --> 00:32:03,039 inside the data frame get me everything 840 00:32:03,039 --> 00:32:05,039 where the class is equal to one so 841 00:32:05,039 --> 00:32:07,360 that's all all of these would fit into 842 00:32:07,360 --> 00:32:09,039 that category right 843 00:32:09,039 --> 00:32:10,720 and now let's just look at the label 844 00:32:10,720 --> 00:32:11,919 column so 845 00:32:11,919 --> 00:32:13,840 the first label would be f length which 846 00:32:13,840 --> 00:32:15,600 would be this column 847 00:32:15,600 --> 00:32:17,600 so this command here is getting me 848 00:32:17,600 --> 00:32:19,519 all the different values that belong to 849 00:32:19,519 --> 00:32:23,120 class 1 for this specific label 850 00:32:23,120 --> 00:32:25,200 and that's exactly what i'm going to put 851 00:32:25,200 --> 00:32:26,640 into the histogram 852 00:32:26,640 --> 00:32:28,399 and now i'm just going to tell you know 853 00:32:28,399 --> 00:32:31,679 matplotlib make the color blue make 854 00:32:31,679 --> 00:32:32,799 oops 855 00:32:32,799 --> 00:32:35,760 label this as you know gamma 856 00:32:35,760 --> 00:32:36,960 um 857 00:32:36,960 --> 00:32:40,320 set alpha why do i keep doing that alpha 858 00:32:40,320 --> 00:32:42,399 equal to 0.7 so that's just like the 859 00:32:42,399 --> 00:32:44,559 transparency and then i'm going to set 860 00:32:44,559 --> 00:32:47,519 density equal to true so that when we 861 00:32:47,519 --> 00:32:50,399 compare it to 862 00:32:50,399 --> 00:32:52,640 the hadrons here 863 00:32:52,640 --> 00:32:54,960 we'll have a baseline for comparing them 864 00:32:54,960 --> 00:32:57,600 okay so the density being true just 865 00:32:57,600 --> 00:33:00,720 basically normalizes these distributions 866 00:33:00,720 --> 00:33:03,360 so you know if you have 867 00:33:03,360 --> 00:33:04,880 200 and 868 00:33:04,880 --> 00:33:08,000 of one type and then 50 of another type 869 00:33:08,000 --> 00:33:10,720 well if you drew the histograms it would 870 00:33:10,720 --> 00:33:13,039 be hard to compare because one of them 871 00:33:13,039 --> 00:33:14,960 would be a lot bigger than the other 872 00:33:14,960 --> 00:33:17,360 right but by normalizing them we kind of 873 00:33:17,360 --> 00:33:18,960 are distributing them over how many 874 00:33:18,960 --> 00:33:21,200 samples there are 875 00:33:21,200 --> 00:33:23,360 all right and then i'm just going to put 876 00:33:23,360 --> 00:33:27,200 a title on here make that the label 877 00:33:27,200 --> 00:33:29,039 uh the y label 878 00:33:29,039 --> 00:33:30,960 so because it's density the y label is 879 00:33:30,960 --> 00:33:32,720 probability 880 00:33:32,720 --> 00:33:34,799 and the x label 881 00:33:34,799 --> 00:33:37,600 is just going to be the label 882 00:33:37,600 --> 00:33:39,600 um 883 00:33:39,600 --> 00:33:41,600 what is going on 884 00:33:41,600 --> 00:33:43,200 and i'm going to 885 00:33:43,200 --> 00:33:46,080 include a legend and 886 00:33:46,080 --> 00:33:47,919 plt.show just means okay display the 887 00:33:47,919 --> 00:33:48,880 plot 888 00:33:48,880 --> 00:33:53,120 so if i run that 889 00:33:53,120 --> 00:33:54,559 just be 890 00:33:54,559 --> 00:33:56,559 up to the last item so we want a list 891 00:33:56,559 --> 00:33:59,679 right not just the last item 892 00:33:59,679 --> 00:34:01,200 and now we can see that we're plotting 893 00:34:01,200 --> 00:34:04,480 all of these so here we have the length 894 00:34:04,480 --> 00:34:08,239 oh and i made this gamma so 895 00:34:08,239 --> 00:34:11,440 uh this should be had on 896 00:34:11,440 --> 00:34:12,239 okay 897 00:34:12,239 --> 00:34:13,918 so the gamma's in blue the hadrons are 898 00:34:13,918 --> 00:34:16,320 in red so here we can already see that 899 00:34:16,320 --> 00:34:18,960 you know maybe if the length is smaller 900 00:34:18,960 --> 00:34:21,040 it's probably more likely to be gamma 901 00:34:21,040 --> 00:34:22,639 right 902 00:34:22,639 --> 00:34:24,079 and we can kind of you know these all 903 00:34:24,079 --> 00:34:26,639 look somewhat similar 904 00:34:26,639 --> 00:34:29,119 but here okay clearly if there's more 905 00:34:29,119 --> 00:34:30,879 asymmetry 906 00:34:30,879 --> 00:34:32,719 or if you know this 907 00:34:32,719 --> 00:34:36,800 asymmetry measure is larger 908 00:34:36,800 --> 00:34:41,040 um then it's probably a hadron 909 00:34:41,280 --> 00:34:43,760 okay oh this one's a good one so 910 00:34:43,760 --> 00:34:45,040 f alpha 911 00:34:45,040 --> 00:34:46,560 seems like 912 00:34:46,560 --> 00:34:48,320 hadrons are pretty evenly distributed 913 00:34:48,320 --> 00:34:50,239 whereas if this is smaller it 914 00:34:50,239 --> 00:34:52,000 looks like there's more gammas in that 915 00:34:52,000 --> 00:34:54,399 area 916 00:34:55,280 --> 00:34:58,160 okay so this is kind of what the data 917 00:34:58,160 --> 00:34:59,680 that we're working with we can kind of 918 00:34:59,680 --> 00:35:02,240 see what's going on 919 00:35:02,240 --> 00:35:03,599 okay so the next thing that we're going 920 00:35:03,599 --> 00:35:05,119 to do here 921 00:35:05,119 --> 00:35:07,839 is we are going to create our 922 00:35:07,839 --> 00:35:09,359 train 923 00:35:09,359 --> 00:35:11,200 our validation 924 00:35:11,200 --> 00:35:13,680 and our test data sets 925 00:35:13,680 --> 00:35:18,000 i'm going to set train valid and test 926 00:35:18,000 --> 00:35:20,079 to be equal to 927 00:35:20,079 --> 00:35:21,440 this 928 00:35:21,440 --> 00:35:23,440 so numpy.split i'm just splitting up the 929 00:35:23,440 --> 00:35:24,720 data frame 930 00:35:24,720 --> 00:35:27,760 and if i do this sample where i'm 931 00:35:27,760 --> 00:35:29,520 sampling everything this will basically 932 00:35:29,520 --> 00:35:31,440 shuffle my data 933 00:35:31,440 --> 00:35:34,079 um now if i 934 00:35:34,079 --> 00:35:37,760 i want to pass in where exactly i'm 935 00:35:37,760 --> 00:35:40,000 splitting my data set so 936 00:35:40,000 --> 00:35:42,079 the first split is going to be maybe at 937 00:35:42,079 --> 00:35:43,440 60 percent 938 00:35:43,440 --> 00:35:45,839 so i'm just going to say 0.6 times the 939 00:35:45,839 --> 00:35:47,599 length of this data frame 940 00:35:47,599 --> 00:35:50,160 so and then casa 10 integer 941 00:35:50,160 --> 00:35:52,079 that's going to be the first place where 942 00:35:52,079 --> 00:35:53,440 you know i cut it off and that will be 943 00:35:53,440 --> 00:35:55,280 my training data 944 00:35:55,280 --> 00:35:59,119 now if i then go to 0.8 945 00:35:59,119 --> 00:36:01,040 this basically means everything between 946 00:36:01,040 --> 00:36:03,520 60 and 80 of the length of the data set 947 00:36:03,520 --> 00:36:05,680 will go towards validation 948 00:36:05,680 --> 00:36:06,800 and then 949 00:36:06,800 --> 00:36:08,720 like everything from 80 to 100 is going 950 00:36:08,720 --> 00:36:10,720 to be my test data 951 00:36:10,720 --> 00:36:12,880 so i can run that 952 00:36:12,880 --> 00:36:14,400 and 953 00:36:14,400 --> 00:36:15,200 now 954 00:36:15,200 --> 00:36:17,119 if we go up here and we inspect this 955 00:36:17,119 --> 00:36:19,839 data we'll see that these columns seem 956 00:36:19,839 --> 00:36:22,480 to have values in like the 100s whereas 957 00:36:22,480 --> 00:36:23,520 this one 958 00:36:23,520 --> 00:36:26,000 is 0.03 959 00:36:26,000 --> 00:36:28,240 right so the scale of all these numbers 960 00:36:28,240 --> 00:36:30,000 is way off 961 00:36:30,000 --> 00:36:32,720 and sometimes that will affect our 962 00:36:32,720 --> 00:36:33,839 results 963 00:36:33,839 --> 00:36:36,480 so one thing that we would want to do is 964 00:36:36,480 --> 00:36:37,760 scale these 965 00:36:37,760 --> 00:36:41,599 so that they are you know 966 00:36:41,599 --> 00:36:45,040 so that it's now relative to maybe 967 00:36:45,040 --> 00:36:47,760 the mean and the standard deviation of 968 00:36:47,760 --> 00:36:50,720 that specific column 969 00:36:51,359 --> 00:36:53,359 i'm going to create a function called 970 00:36:53,359 --> 00:36:55,440 scale data set 971 00:36:55,440 --> 00:36:58,240 and i'm going to pass in the data frame 972 00:36:58,240 --> 00:37:00,560 um 973 00:37:00,560 --> 00:37:02,960 and that's what i'll do for now okay so 974 00:37:02,960 --> 00:37:04,160 the x 975 00:37:04,160 --> 00:37:06,240 values are going to be you know i take 976 00:37:06,240 --> 00:37:07,680 the data frame 977 00:37:07,680 --> 00:37:08,560 and 978 00:37:08,560 --> 00:37:10,640 let's assume that 979 00:37:10,640 --> 00:37:12,400 the columns 980 00:37:12,400 --> 00:37:13,520 um 981 00:37:13,520 --> 00:37:15,520 are going to be you know that the label 982 00:37:15,520 --> 00:37:17,599 will always be the last thing in the 983 00:37:17,599 --> 00:37:21,280 data frame so what i can do is say 984 00:37:21,280 --> 00:37:22,960 dot dataframe.com all the way up to the 985 00:37:22,960 --> 00:37:23,920 last 986 00:37:23,920 --> 00:37:24,880 item 987 00:37:24,880 --> 00:37:27,280 and get those values 988 00:37:27,280 --> 00:37:29,920 now for my y 989 00:37:29,920 --> 00:37:31,760 well it's the last column so i can just 990 00:37:31,760 --> 00:37:33,760 do this i can just index into that last 991 00:37:33,760 --> 00:37:34,720 column 992 00:37:34,720 --> 00:37:37,839 and then get those values 993 00:37:38,560 --> 00:37:40,400 now 994 00:37:40,400 --> 00:37:41,520 in 995 00:37:41,520 --> 00:37:42,560 so 996 00:37:42,560 --> 00:37:45,760 i'm actually going to import something 997 00:37:45,760 --> 00:37:48,560 known as uh the standard scalar from 998 00:37:48,560 --> 00:37:52,880 sklearn so if i come up here i can go to 999 00:37:52,880 --> 00:37:55,880 sklearn.preprocessing 1000 00:37:56,000 --> 00:37:59,520 and i'm going to import um standard 1001 00:37:59,520 --> 00:38:01,119 scalar 1002 00:38:01,119 --> 00:38:03,119 i have to run that cell 1003 00:38:03,119 --> 00:38:04,880 i'm going to come back down here 1004 00:38:04,880 --> 00:38:07,760 and now i'm going to create a scalar and 1005 00:38:07,760 --> 00:38:10,320 use that skill or so standard 1006 00:38:10,320 --> 00:38:12,800 scalar 1007 00:38:13,440 --> 00:38:15,920 and with the scalar what i can do is 1008 00:38:15,920 --> 00:38:19,520 actually just fit and transform x 1009 00:38:19,520 --> 00:38:22,560 so here i can say x is equal to 1010 00:38:22,560 --> 00:38:25,520 scalar dot fit 1011 00:38:25,520 --> 00:38:27,839 fit transform 1012 00:38:27,839 --> 00:38:30,079 x so what that's doing is saying okay 1013 00:38:30,079 --> 00:38:33,920 take x and fit the standard scalar to x 1014 00:38:33,920 --> 00:38:35,520 and then transform all those values and 1015 00:38:35,520 --> 00:38:37,119 what would it be and that's going to be 1016 00:38:37,119 --> 00:38:38,560 our new x 1017 00:38:38,560 --> 00:38:40,160 all right 1018 00:38:40,160 --> 00:38:42,880 and then i'm also going to just create 1019 00:38:42,880 --> 00:38:46,720 you know the whole data as one uh huge 1020 00:38:46,720 --> 00:38:48,480 2d numpy array 1021 00:38:48,480 --> 00:38:50,800 and in order to do that i'm going to 1022 00:38:50,800 --> 00:38:51,680 call 1023 00:38:51,680 --> 00:38:54,400 hsac so h stack is saying okay take an 1024 00:38:54,400 --> 00:38:57,280 array and another array and horizontally 1025 00:38:57,280 --> 00:38:58,640 stack them together that's what the h 1026 00:38:58,640 --> 00:38:59,599 stands for 1027 00:38:59,599 --> 00:39:01,599 so by horizontally stacked them together 1028 00:39:01,599 --> 00:39:03,839 just like put them side by side okay not 1029 00:39:03,839 --> 00:39:05,920 on top of each other 1030 00:39:05,920 --> 00:39:08,160 so what am i stacking well i have to 1031 00:39:08,160 --> 00:39:09,920 pass in something 1032 00:39:09,920 --> 00:39:11,839 so that it can stack 1033 00:39:11,839 --> 00:39:14,079 x and y 1034 00:39:14,079 --> 00:39:16,480 and now 1035 00:39:16,880 --> 00:39:19,359 okay so numpy is very particular about 1036 00:39:19,359 --> 00:39:22,320 dimensions right so in this specific 1037 00:39:22,320 --> 00:39:25,280 case rx is a two-dimensional object but 1038 00:39:25,280 --> 00:39:27,359 y is only a one-dimensional thing it's 1039 00:39:27,359 --> 00:39:28,400 only a vector 1040 00:39:28,400 --> 00:39:29,760 of values 1041 00:39:29,760 --> 00:39:33,200 so in order to now reshape it into a 2d 1042 00:39:33,200 --> 00:39:38,359 item we have to call numpy.reshape 1043 00:39:38,640 --> 00:39:41,680 and we can pass in the dimensions of its 1044 00:39:41,680 --> 00:39:42,880 reshape 1045 00:39:42,880 --> 00:39:43,920 so 1046 00:39:43,920 --> 00:39:46,480 if i pass a negative 1 comma 1 that just 1047 00:39:46,480 --> 00:39:48,880 means okay make this a 2d array where 1048 00:39:48,880 --> 00:39:52,000 the negative 1 just means infer what 1049 00:39:52,000 --> 00:39:54,560 what this dimension value would be which 1050 00:39:54,560 --> 00:39:56,400 ends up being the length of y this would 1051 00:39:56,400 --> 00:39:58,880 be the same as literally doing this 1052 00:39:58,880 --> 00:40:00,320 but the negative one is easier because 1053 00:40:00,320 --> 00:40:01,839 we're making the computer do the hard 1054 00:40:01,839 --> 00:40:03,280 work 1055 00:40:03,280 --> 00:40:05,680 so if i stack that i'm going to then 1056 00:40:05,680 --> 00:40:08,839 return the data x and 1057 00:40:08,839 --> 00:40:10,720 y 1058 00:40:10,720 --> 00:40:12,240 okay 1059 00:40:12,240 --> 00:40:14,960 so one more thing is that if we go into 1060 00:40:14,960 --> 00:40:16,640 our training data set 1061 00:40:16,640 --> 00:40:19,119 okay again this is our training data set 1062 00:40:19,119 --> 00:40:21,520 and we get the length of the training 1063 00:40:21,520 --> 00:40:22,839 data set 1064 00:40:22,839 --> 00:40:26,160 um but where the training data sets 1065 00:40:26,160 --> 00:40:27,680 class 1066 00:40:27,680 --> 00:40:28,800 is one 1067 00:40:28,800 --> 00:40:32,560 so remember that this is the gammas 1068 00:40:32,560 --> 00:40:36,240 and then if we print 1069 00:40:36,839 --> 00:40:40,640 that and we do the same thing but zero 1070 00:40:40,640 --> 00:40:42,160 we'll see that 1071 00:40:42,160 --> 00:40:43,839 you know there's around 1072 00:40:43,839 --> 00:40:46,720 seven thousand of the gammas but only 1073 00:40:46,720 --> 00:40:49,599 around four thousand of the hadrons so 1074 00:40:49,599 --> 00:40:52,720 that might actually become an issue 1075 00:40:52,720 --> 00:40:55,200 and instead what we want to do 1076 00:40:55,200 --> 00:40:58,319 is we want to over sample our 1077 00:40:58,319 --> 00:41:00,079 our training data set 1078 00:41:00,079 --> 00:41:02,960 so that means that we want to increase 1079 00:41:02,960 --> 00:41:05,200 the number of 1080 00:41:05,200 --> 00:41:06,640 these values 1081 00:41:06,640 --> 00:41:09,839 so that these kind of match better 1082 00:41:09,839 --> 00:41:13,040 and surprise surprise there is something 1083 00:41:13,040 --> 00:41:14,800 that we can import that will help us do 1084 00:41:14,800 --> 00:41:16,400 that it's so 1085 00:41:16,400 --> 00:41:20,920 i'm going to go to from mblearn.org 1086 00:41:22,720 --> 00:41:24,960 and i'm going to import this random over 1087 00:41:24,960 --> 00:41:26,480 sampler 1088 00:41:26,480 --> 00:41:28,079 run that cell 1089 00:41:28,079 --> 00:41:30,480 and come back down here 1090 00:41:30,480 --> 00:41:33,040 so i will actually add in this parameter 1091 00:41:33,040 --> 00:41:35,280 called over sample 1092 00:41:35,280 --> 00:41:37,680 and set that to false 1093 00:41:37,680 --> 00:41:39,520 for default 1094 00:41:39,520 --> 00:41:41,599 um 1095 00:41:41,599 --> 00:41:44,640 and if i do want to over sample then 1096 00:41:44,640 --> 00:41:48,240 what i'm going to do 1097 00:41:48,240 --> 00:41:50,480 and by over sample so if i do want to 1098 00:41:50,480 --> 00:41:51,760 over sample 1099 00:41:51,760 --> 00:41:55,119 then i'm going to create this ros and 1100 00:41:55,119 --> 00:41:58,240 set it equal to this random over sampler 1101 00:41:58,240 --> 00:42:00,400 and then for x and y i'm just going to 1102 00:42:00,400 --> 00:42:02,400 say okay just fit 1103 00:42:02,400 --> 00:42:04,400 and resample 1104 00:42:04,400 --> 00:42:06,480 x and y and what that's doing is saying 1105 00:42:06,480 --> 00:42:09,200 okay take more of the less 1106 00:42:09,200 --> 00:42:10,160 class 1107 00:42:10,160 --> 00:42:12,319 so take take the less class and keep 1108 00:42:12,319 --> 00:42:13,839 sampling from there 1109 00:42:13,839 --> 00:42:16,480 to increase the size of our data set of 1110 00:42:16,480 --> 00:42:18,400 that smaller class so that they now 1111 00:42:18,400 --> 00:42:20,720 match 1112 00:42:20,880 --> 00:42:25,200 so if i do this and i scale 1113 00:42:25,200 --> 00:42:26,720 data set 1114 00:42:26,720 --> 00:42:28,480 and i pass in the training data set 1115 00:42:28,480 --> 00:42:32,000 where over sample is true 1116 00:42:32,000 --> 00:42:34,480 so this let's say this is train and then 1117 00:42:34,480 --> 00:42:36,079 x train 1118 00:42:36,079 --> 00:42:38,880 y train 1119 00:42:40,640 --> 00:42:42,000 oops 1120 00:42:42,000 --> 00:42:43,520 what's going on 1121 00:42:43,520 --> 00:42:47,200 oh these should be columns 1122 00:42:47,680 --> 00:42:48,960 so basically 1123 00:42:48,960 --> 00:42:50,319 what i'm doing now is i'm just saying 1124 00:42:50,319 --> 00:42:54,240 okay what is the length of y train 1125 00:42:54,240 --> 00:42:57,599 okay now it's 14 800 whatever and now 1126 00:42:57,599 --> 00:42:59,839 let's take a look at 1127 00:42:59,839 --> 00:43:03,920 um how many of these are type one 1128 00:43:04,560 --> 00:43:07,200 so actually we can just 1129 00:43:07,200 --> 00:43:09,280 sum that up 1130 00:43:09,280 --> 00:43:10,880 and then we'll also see that if we 1131 00:43:10,880 --> 00:43:12,560 instead switch the label and ask how 1132 00:43:12,560 --> 00:43:14,560 many of them are the other type 1133 00:43:14,560 --> 00:43:16,880 it's the same value so now these have 1134 00:43:16,880 --> 00:43:18,480 been 1135 00:43:18,480 --> 00:43:22,319 evenly you know rebalanced 1136 00:43:22,319 --> 00:43:24,400 okay well 1137 00:43:24,400 --> 00:43:25,359 okay 1138 00:43:25,359 --> 00:43:27,520 so here i'm just going to make 1139 00:43:27,520 --> 00:43:31,359 this the validation data set 1140 00:43:31,760 --> 00:43:34,800 and then the next one 1141 00:43:34,800 --> 00:43:37,680 uh i'm going to make this the test data 1142 00:43:37,680 --> 00:43:38,640 set 1143 00:43:38,640 --> 00:43:39,680 all right and we're actually going to 1144 00:43:39,680 --> 00:43:42,960 switch over sample here to false 1145 00:43:42,960 --> 00:43:44,400 now the reason why i'm switching that to 1146 00:43:44,400 --> 00:43:45,920 false is because 1147 00:43:45,920 --> 00:43:48,640 my validation and my test sets are 1148 00:43:48,640 --> 00:43:50,079 for the purpose of you know if i have 1149 00:43:50,079 --> 00:43:52,839 data that i haven't seen yet how does my 1150 00:43:52,839 --> 00:43:55,839 sample perform on those 1151 00:43:55,839 --> 00:43:58,960 and i don't want to over sample for that 1152 00:43:58,960 --> 00:44:01,040 right now like i i don't care about 1153 00:44:01,040 --> 00:44:03,680 balancing those i'm i want to know if i 1154 00:44:03,680 --> 00:44:06,560 have a random set of data that's 1155 00:44:06,560 --> 00:44:11,280 unlabeled can i trust my model right 1156 00:44:11,280 --> 00:44:13,760 so that's why i'm not oversampling 1157 00:44:13,760 --> 00:44:16,319 i run that and 1158 00:44:16,319 --> 00:44:17,280 again 1159 00:44:17,280 --> 00:44:19,040 what is going on oh it's because we 1160 00:44:19,040 --> 00:44:21,040 already have this train 1161 00:44:21,040 --> 00:44:22,880 so i have to go come up here and split 1162 00:44:22,880 --> 00:44:24,560 that data frame again 1163 00:44:24,560 --> 00:44:26,880 and now let's run these 1164 00:44:26,880 --> 00:44:29,119 okay 1165 00:44:29,520 --> 00:44:31,520 so now we have our data properly 1166 00:44:31,520 --> 00:44:34,480 formatted and we're going to move on to 1167 00:44:34,480 --> 00:44:35,760 different models now and i'm going to 1168 00:44:35,760 --> 00:44:37,359 tell you guys a little bit about each of 1169 00:44:37,359 --> 00:44:39,119 these models and then i'm going to show 1170 00:44:39,119 --> 00:44:42,400 you how we can do that in our code 1171 00:44:42,400 --> 00:44:43,920 so the first model that we're going to 1172 00:44:43,920 --> 00:44:46,560 learn about is k n or k nearest 1173 00:44:46,560 --> 00:44:47,839 neighbors 1174 00:44:47,839 --> 00:44:48,640 okay 1175 00:44:48,640 --> 00:44:51,760 so here i've already drawn a plot on the 1176 00:44:51,760 --> 00:44:55,839 y-axis i have the number of kids 1177 00:44:55,839 --> 00:44:58,960 that a family might have and then on 1178 00:44:58,960 --> 00:45:01,599 the x-axis i have their income in terms 1179 00:45:01,599 --> 00:45:05,520 of thousands per year so 1180 00:45:05,520 --> 00:45:07,920 you know if uh if someone's making 40 1181 00:45:07,920 --> 00:45:10,000 000 a year that's where this would be 1182 00:45:10,000 --> 00:45:12,079 and if somebody making 320 that's where 1183 00:45:12,079 --> 00:45:13,920 that would be somebody has zero kids 1184 00:45:13,920 --> 00:45:16,400 it'd be somewhere along this axis 1185 00:45:16,400 --> 00:45:18,319 somebody has five it'd be somewhere over 1186 00:45:18,319 --> 00:45:20,560 here okay 1187 00:45:20,560 --> 00:45:24,319 and now i have 1188 00:45:24,319 --> 00:45:26,640 these plus signs and these minus signs 1189 00:45:26,640 --> 00:45:27,760 on here 1190 00:45:27,760 --> 00:45:30,319 so what i'm going to represent here is 1191 00:45:30,319 --> 00:45:33,280 the plus sign 1192 00:45:33,280 --> 00:45:35,599 means that they 1193 00:45:35,599 --> 00:45:38,480 own a car 1194 00:45:39,599 --> 00:45:41,920 and the minus sign 1195 00:45:41,920 --> 00:45:44,400 is going to represent no car 1196 00:45:44,400 --> 00:45:45,599 okay 1197 00:45:45,599 --> 00:45:46,560 so 1198 00:45:46,560 --> 00:45:48,720 your initial thought should be okay i 1199 00:45:48,720 --> 00:45:50,560 think this is binary classification 1200 00:45:50,560 --> 00:45:52,720 because all of our 1201 00:45:52,720 --> 00:45:55,760 points all of our samples 1202 00:45:55,760 --> 00:45:58,839 have labels so this is a 1203 00:45:58,839 --> 00:46:01,200 sample with 1204 00:46:01,200 --> 00:46:03,839 the plus label 1205 00:46:03,839 --> 00:46:07,839 and this here is another sample 1206 00:46:07,839 --> 00:46:09,119 with 1207 00:46:09,119 --> 00:46:11,520 the minus label 1208 00:46:11,520 --> 00:46:13,200 this is an abbreviation for width that 1209 00:46:13,200 --> 00:46:15,599 i'll use 1210 00:46:15,599 --> 00:46:16,800 all right 1211 00:46:16,800 --> 00:46:19,359 so we have this entire data set and 1212 00:46:19,359 --> 00:46:21,680 maybe around half the people own a car 1213 00:46:21,680 --> 00:46:23,280 and maybe 1214 00:46:23,280 --> 00:46:26,240 around half the people don't own a car 1215 00:46:26,240 --> 00:46:29,680 okay well what if i had some new point 1216 00:46:29,680 --> 00:46:32,079 let me use choose a different color i'll 1217 00:46:32,079 --> 00:46:34,000 use this nice green 1218 00:46:34,000 --> 00:46:36,079 well what if i have a new point over 1219 00:46:36,079 --> 00:46:38,720 here so let's say that somebody makes 40 1220 00:46:38,720 --> 00:46:40,839 000 a year and has two 1221 00:46:40,839 --> 00:46:45,599 kids what do we think that would be 1222 00:46:47,760 --> 00:46:50,319 well just logically looking at this plot 1223 00:46:50,319 --> 00:46:53,440 you might think okay it seems like 1224 00:46:53,440 --> 00:46:55,119 they wouldn't have a car right because 1225 00:46:55,119 --> 00:46:56,640 that kind of matches the pattern of 1226 00:46:56,640 --> 00:46:59,119 everybody else around them 1227 00:46:59,119 --> 00:47:01,359 so that's a whole concept of this 1228 00:47:01,359 --> 00:47:05,119 nearest neighbors is you look at okay 1229 00:47:05,119 --> 00:47:06,720 what's around you 1230 00:47:06,720 --> 00:47:08,160 and then you're basically like okay i'm 1231 00:47:08,160 --> 00:47:10,400 going to take the label of the majority 1232 00:47:10,400 --> 00:47:12,720 that's around me 1233 00:47:12,720 --> 00:47:14,000 so the first thing we have to do is we 1234 00:47:14,000 --> 00:47:16,400 have to define a distance function 1235 00:47:16,400 --> 00:47:19,920 and a lot of times in you know 2d plots 1236 00:47:19,920 --> 00:47:21,680 like this our distance function is 1237 00:47:21,680 --> 00:47:23,599 something known as 1238 00:47:23,599 --> 00:47:27,200 euclidean distance 1239 00:47:31,839 --> 00:47:33,920 and euclidean distance 1240 00:47:33,920 --> 00:47:35,680 is basically just 1241 00:47:35,680 --> 00:47:39,839 this straight line distance like this 1242 00:47:40,960 --> 00:47:43,280 okay 1243 00:47:44,960 --> 00:47:47,200 so this would be the euclidean distance 1244 00:47:47,200 --> 00:47:48,960 it seems like 1245 00:47:48,960 --> 00:47:51,599 there's this point 1246 00:47:51,599 --> 00:47:53,599 there's this point 1247 00:47:53,599 --> 00:47:55,200 there's that point 1248 00:47:55,200 --> 00:47:57,839 etcetera so the length of this line this 1249 00:47:57,839 --> 00:48:00,000 green line that i just drew that is 1250 00:48:00,000 --> 00:48:03,119 what's known as euclidean distance 1251 00:48:03,119 --> 00:48:05,440 if we want to get technical with that 1252 00:48:05,440 --> 00:48:08,000 this exact formula 1253 00:48:08,000 --> 00:48:11,680 is the distance here let me zoom in 1254 00:48:11,680 --> 00:48:15,359 the distance is equal to the square root 1255 00:48:15,359 --> 00:48:17,280 of one point 1256 00:48:17,280 --> 00:48:18,319 x 1257 00:48:18,319 --> 00:48:20,880 minus the other points x 1258 00:48:20,880 --> 00:48:22,079 squared 1259 00:48:22,079 --> 00:48:24,960 plus extend that square root 1260 00:48:24,960 --> 00:48:27,920 the same thing for y so y one of one 1261 00:48:27,920 --> 00:48:30,000 minus y two of the other 1262 00:48:30,000 --> 00:48:31,280 squared 1263 00:48:31,280 --> 00:48:32,880 okay so we're basically 1264 00:48:32,880 --> 00:48:35,280 trying to find the length the distance 1265 00:48:35,280 --> 00:48:37,599 is the difference between x and y 1266 00:48:37,599 --> 00:48:39,119 and then 1267 00:48:39,119 --> 00:48:41,359 square each of those sum it up and take 1268 00:48:41,359 --> 00:48:43,040 the square root 1269 00:48:43,040 --> 00:48:44,559 okay so i'm going to erase this so it 1270 00:48:44,559 --> 00:48:46,960 doesn't clutter 1271 00:48:46,960 --> 00:48:49,839 my drawing 1272 00:48:50,960 --> 00:48:53,280 but anyways now going back to this plot 1273 00:48:53,280 --> 00:48:54,880 so here 1274 00:48:54,880 --> 00:48:57,119 in the nearest neighbor algorithm we see 1275 00:48:57,119 --> 00:48:59,359 that there is a 1276 00:48:59,359 --> 00:49:00,640 k 1277 00:49:00,640 --> 00:49:02,160 right 1278 00:49:02,160 --> 00:49:04,319 and this k is basically telling us okay 1279 00:49:04,319 --> 00:49:06,800 how many neighbors do we use in order to 1280 00:49:06,800 --> 00:49:08,880 judge what the label is 1281 00:49:08,880 --> 00:49:11,920 so usually we use a k of maybe you know 1282 00:49:11,920 --> 00:49:14,160 three or five 1283 00:49:14,160 --> 00:49:16,079 depends on how big our data set is but 1284 00:49:16,079 --> 00:49:17,440 here i would say 1285 00:49:17,440 --> 00:49:18,319 maybe 1286 00:49:18,319 --> 00:49:20,800 a logical number would be three or five 1287 00:49:20,800 --> 00:49:24,559 so let's say that we take k 1288 00:49:24,559 --> 00:49:26,559 to be equal to three 1289 00:49:26,559 --> 00:49:29,839 okay well of this data point that i drew 1290 00:49:29,839 --> 00:49:32,480 over here 1291 00:49:32,480 --> 00:49:35,040 let me use green to highlight this okay 1292 00:49:35,040 --> 00:49:36,880 so of this data point that i drew over 1293 00:49:36,880 --> 00:49:39,359 here looks like the three closest points 1294 00:49:39,359 --> 00:49:41,359 are definitely this one 1295 00:49:41,359 --> 00:49:42,559 this one 1296 00:49:42,559 --> 00:49:46,400 and then this one has a length of four 1297 00:49:46,400 --> 00:49:49,280 and this one 1298 00:49:49,280 --> 00:49:50,720 seems like it'd be a little bit further 1299 00:49:50,720 --> 00:49:53,200 than four so actually this 1300 00:49:53,200 --> 00:49:54,800 would be our these would be our three 1301 00:49:54,800 --> 00:49:56,319 points 1302 00:49:56,319 --> 00:49:59,040 well all those points are blue 1303 00:49:59,040 --> 00:50:01,440 so chances are 1304 00:50:01,440 --> 00:50:04,000 my prediction for this point is going to 1305 00:50:04,000 --> 00:50:05,200 be blue 1306 00:50:05,200 --> 00:50:06,319 it's going to be 1307 00:50:06,319 --> 00:50:08,480 probably don't have a car 1308 00:50:08,480 --> 00:50:11,119 all right now what if my point is 1309 00:50:11,119 --> 00:50:13,680 somewhere 1310 00:50:13,680 --> 00:50:16,800 what if my point is somewhere over here 1311 00:50:16,800 --> 00:50:18,640 let's say that 1312 00:50:18,640 --> 00:50:21,520 a couple has 1313 00:50:21,520 --> 00:50:25,599 four kids and they make 240 000 a year 1314 00:50:25,599 --> 00:50:27,920 all right well now my closest points are 1315 00:50:27,920 --> 00:50:30,319 this one 1316 00:50:30,319 --> 00:50:33,280 probably a little bit over that one 1317 00:50:33,280 --> 00:50:36,160 and then this one right okay still all 1318 00:50:36,160 --> 00:50:37,359 pluses 1319 00:50:37,359 --> 00:50:38,559 well 1320 00:50:38,559 --> 00:50:42,720 this one is more than likely to be 1321 00:50:42,720 --> 00:50:44,559 a plus 1322 00:50:44,559 --> 00:50:46,640 right now let me get rid of some of 1323 00:50:46,640 --> 00:50:49,680 these just so that it looks a little bit 1324 00:50:49,680 --> 00:50:52,920 more clear 1325 00:50:54,800 --> 00:50:56,800 all right let's go through 1326 00:50:56,800 --> 00:50:58,559 one more 1327 00:50:58,559 --> 00:51:02,079 what about a point that might be 1328 00:51:02,079 --> 00:51:04,079 right 1329 00:51:04,079 --> 00:51:05,599 here 1330 00:51:05,599 --> 00:51:08,000 okay let's see well definitely this is 1331 00:51:08,000 --> 00:51:09,440 the closest 1332 00:51:09,440 --> 00:51:12,000 right this one's also closest 1333 00:51:12,000 --> 00:51:12,430 and then 1334 00:51:12,430 --> 00:51:14,319 [Music] 1335 00:51:14,319 --> 00:51:16,640 it's really close between the two of 1336 00:51:16,640 --> 00:51:18,559 these 1337 00:51:18,559 --> 00:51:20,480 but if we actually do the mathematics it 1338 00:51:20,480 --> 00:51:21,760 seems like 1339 00:51:21,760 --> 00:51:23,760 if we zoom in 1340 00:51:23,760 --> 00:51:26,240 this one is right here and this one is 1341 00:51:26,240 --> 00:51:28,720 in between these two so 1342 00:51:28,720 --> 00:51:31,119 this one here is actually shorter than 1343 00:51:31,119 --> 00:51:32,880 this one 1344 00:51:32,880 --> 00:51:35,520 and that means that that top one is the 1345 00:51:35,520 --> 00:51:37,599 one that we're going to take 1346 00:51:37,599 --> 00:51:39,680 now what is the majority of the points 1347 00:51:39,680 --> 00:51:41,520 that are close by 1348 00:51:41,520 --> 00:51:44,640 well we have one plus here we have one 1349 00:51:44,640 --> 00:51:47,200 plus here and we have one minus here 1350 00:51:47,200 --> 00:51:49,280 which means that the pluses 1351 00:51:49,280 --> 00:51:51,040 are the majority 1352 00:51:51,040 --> 00:51:53,920 and that means that this 1353 00:51:53,920 --> 00:51:56,319 label 1354 00:51:56,559 --> 00:51:59,520 is probably somebody with a car 1355 00:51:59,520 --> 00:52:01,280 okay 1356 00:52:01,280 --> 00:52:04,400 so this is how k nearest neighbors would 1357 00:52:04,400 --> 00:52:06,240 work 1358 00:52:06,240 --> 00:52:08,160 it's that simple 1359 00:52:08,160 --> 00:52:11,040 and this can be extrapolated to further 1360 00:52:11,040 --> 00:52:13,599 dimensions to higher dimensions you know 1361 00:52:13,599 --> 00:52:14,880 if you have 1362 00:52:14,880 --> 00:52:16,800 here we have two different features we 1363 00:52:16,800 --> 00:52:18,640 have the income 1364 00:52:18,640 --> 00:52:20,880 and then we have the number of kids 1365 00:52:20,880 --> 00:52:21,760 but 1366 00:52:21,760 --> 00:52:23,839 let's say we have 10 different features 1367 00:52:23,839 --> 00:52:26,960 we can expand our distance function so 1368 00:52:26,960 --> 00:52:29,119 that it includes all 10 of those 1369 00:52:29,119 --> 00:52:31,040 dimensions we take the square root of 1370 00:52:31,040 --> 00:52:32,559 everything and then we figure out which 1371 00:52:32,559 --> 00:52:34,800 one is the closest to the point that we 1372 00:52:34,800 --> 00:52:35,839 desire 1373 00:52:35,839 --> 00:52:39,119 to classify okay 1374 00:52:39,119 --> 00:52:41,040 so that's k-nearest neighbors 1375 00:52:41,040 --> 00:52:42,800 so now we've learned about uh k-nearest 1376 00:52:42,800 --> 00:52:43,760 neighbors 1377 00:52:43,760 --> 00:52:45,599 let's see how we would be able to do 1378 00:52:45,599 --> 00:52:47,520 that within our code 1379 00:52:47,520 --> 00:52:50,160 so here i'm going to label the section k 1380 00:52:50,160 --> 00:52:53,040 nearest neighbors 1381 00:52:53,119 --> 00:52:54,720 and we're actually going to use a 1382 00:52:54,720 --> 00:52:58,160 package from sklearn so the reason why 1383 00:52:58,160 --> 00:53:00,240 we you know use these packages so that 1384 00:53:00,240 --> 00:53:02,640 we don't have to manually code all these 1385 00:53:02,640 --> 00:53:04,559 things ourself because it would be 1386 00:53:04,559 --> 00:53:06,240 really difficult and chances are the way 1387 00:53:06,240 --> 00:53:07,839 that we would code it either would have 1388 00:53:07,839 --> 00:53:10,160 bugs or it'd be really slow or i don't 1389 00:53:10,160 --> 00:53:11,839 know a whole bunch of issues 1390 00:53:11,839 --> 00:53:13,280 so what we're going to do is hand it off 1391 00:53:13,280 --> 00:53:14,960 to the pros 1392 00:53:14,960 --> 00:53:16,880 from here i can say 1393 00:53:16,880 --> 00:53:20,000 okay from sklearn which is this package 1394 00:53:20,000 --> 00:53:22,079 dot neighbors 1395 00:53:22,079 --> 00:53:24,319 i'm going to import k 1396 00:53:24,319 --> 00:53:26,079 neighbors classifier because we're 1397 00:53:26,079 --> 00:53:27,440 classifying 1398 00:53:27,440 --> 00:53:28,800 okay 1399 00:53:28,800 --> 00:53:30,160 so i run that 1400 00:53:30,160 --> 00:53:34,800 and our k n model is going to be this k 1401 00:53:34,800 --> 00:53:38,000 neighbors classifier and we can pass in 1402 00:53:38,000 --> 00:53:40,160 a parameter of how many neighbors you 1403 00:53:40,160 --> 00:53:43,119 know we want to use so first let's see 1404 00:53:43,119 --> 00:53:45,839 what happens if we just use one 1405 00:53:45,839 --> 00:53:47,839 so now if i do k 1406 00:53:47,839 --> 00:53:49,920 and then model dot fit 1407 00:53:49,920 --> 00:53:52,720 i can pass in my x training set and my 1408 00:53:52,720 --> 00:53:54,240 weight y train 1409 00:53:54,240 --> 00:53:58,480 data okay so that effectively fits this 1410 00:53:58,480 --> 00:54:00,400 model 1411 00:54:00,400 --> 00:54:04,480 and let's get all the predictions so why 1412 00:54:04,480 --> 00:54:06,960 k n or i guess yeah let's do y 1413 00:54:06,960 --> 00:54:08,319 predictions 1414 00:54:08,319 --> 00:54:11,359 and my y predictions are going to be k n 1415 00:54:11,359 --> 00:54:14,000 model dot predict 1416 00:54:14,000 --> 00:54:15,599 um 1417 00:54:15,599 --> 00:54:19,200 so let's use the test set x test 1418 00:54:19,200 --> 00:54:21,040 okay 1419 00:54:21,040 --> 00:54:23,680 all right so if i 1420 00:54:23,680 --> 00:54:25,280 call by predict you'll see that we have 1421 00:54:25,280 --> 00:54:27,680 those but if i get my truth values for 1422 00:54:27,680 --> 00:54:29,359 that test set you'll see that this is 1423 00:54:29,359 --> 00:54:31,040 what we actually do so just looking at 1424 00:54:31,040 --> 00:54:32,880 this we got five out of six of them okay 1425 00:54:32,880 --> 00:54:35,119 great so let's actually take a look at 1426 00:54:35,119 --> 00:54:36,240 something 1427 00:54:36,240 --> 00:54:38,319 called the classification report that's 1428 00:54:38,319 --> 00:54:41,960 offered by sklearn so if i go to from 1429 00:54:41,960 --> 00:54:44,720 sklearn.metrics import 1430 00:54:44,720 --> 00:54:47,920 classification report 1431 00:54:47,920 --> 00:54:51,200 what i can actually do is say hey print 1432 00:54:51,200 --> 00:54:53,200 out this classification 1433 00:54:53,200 --> 00:54:54,640 report for me 1434 00:54:54,640 --> 00:54:56,960 and let's check you know 1435 00:54:56,960 --> 00:54:58,960 i'm giving you the y test and the y 1436 00:54:58,960 --> 00:55:01,040 prediction 1437 00:55:01,040 --> 00:55:02,880 we run this and we see we get this whole 1438 00:55:02,880 --> 00:55:04,160 entire chart so i'm going to tell you 1439 00:55:04,160 --> 00:55:07,359 guys a few things on this chart 1440 00:55:07,359 --> 00:55:10,319 all right this accuracy is 82 which is 1441 00:55:10,319 --> 00:55:11,920 actually pretty good that's just saying 1442 00:55:11,920 --> 00:55:14,319 hey if we just look at you know what 1443 00:55:14,319 --> 00:55:15,839 each of these new points what it's 1444 00:55:15,839 --> 00:55:17,280 closest to 1445 00:55:17,280 --> 00:55:19,520 then we actually get an 82 percent 1446 00:55:19,520 --> 00:55:21,119 accuracy 1447 00:55:21,119 --> 00:55:22,240 which means 1448 00:55:22,240 --> 00:55:24,160 how many do we get right versus how many 1449 00:55:24,160 --> 00:55:26,800 total are there 1450 00:55:26,800 --> 00:55:29,040 now precision is saying okay you might 1451 00:55:29,040 --> 00:55:31,280 see that we have it for class one or 1452 00:55:31,280 --> 00:55:33,440 class zero and class 1453 00:55:33,440 --> 00:55:35,200 what precision is saying was let's go to 1454 00:55:35,200 --> 00:55:36,960 this wikipedia diagram over here because 1455 00:55:36,960 --> 00:55:39,920 i actually kind of like this diagram 1456 00:55:39,920 --> 00:55:41,359 so here 1457 00:55:41,359 --> 00:55:43,680 this is our entire data set and on the 1458 00:55:43,680 --> 00:55:45,760 left over here we have everything that 1459 00:55:45,760 --> 00:55:47,920 we know is positive so everything that 1460 00:55:47,920 --> 00:55:50,640 is actually truly positive that we've 1461 00:55:50,640 --> 00:55:52,480 labeled positive in our original data 1462 00:55:52,480 --> 00:55:53,280 set 1463 00:55:53,280 --> 00:55:54,960 and over here this is everything that's 1464 00:55:54,960 --> 00:55:56,640 truly negative 1465 00:55:56,640 --> 00:55:59,359 now in the circle we have 1466 00:55:59,359 --> 00:56:01,680 things that are paused that were labeled 1467 00:56:01,680 --> 00:56:05,440 positive by our model 1468 00:56:05,440 --> 00:56:07,440 on the left here we have things that are 1469 00:56:07,440 --> 00:56:09,440 truly positive because you know this 1470 00:56:09,440 --> 00:56:10,400 side is 1471 00:56:10,400 --> 00:56:12,000 the positive side and the side is the 1472 00:56:12,000 --> 00:56:13,359 negative side so these are truly 1473 00:56:13,359 --> 00:56:14,559 positive 1474 00:56:14,559 --> 00:56:17,040 whereas all these ones out here well 1475 00:56:17,040 --> 00:56:18,640 they should have been positive but they 1476 00:56:18,640 --> 00:56:20,720 are labeled as negative 1477 00:56:20,720 --> 00:56:23,040 and in here these are the ones that 1478 00:56:23,040 --> 00:56:24,400 we've labeled positive but they're 1479 00:56:24,400 --> 00:56:26,960 actually negative and out here these are 1480 00:56:26,960 --> 00:56:28,960 truly negative 1481 00:56:28,960 --> 00:56:32,319 so precision is saying okay 1482 00:56:32,319 --> 00:56:34,720 out of all the ones we've labeled as 1483 00:56:34,720 --> 00:56:37,040 positive how many of them are true 1484 00:56:37,040 --> 00:56:38,799 positives 1485 00:56:38,799 --> 00:56:41,839 and recall is saying okay out of all the 1486 00:56:41,839 --> 00:56:43,839 ones that we know are truly positive how 1487 00:56:43,839 --> 00:56:46,720 many did we actually get right 1488 00:56:46,720 --> 00:56:48,319 okay 1489 00:56:48,319 --> 00:56:50,640 so going back to this over here our 1490 00:56:50,640 --> 00:56:53,440 precision score so 1491 00:56:53,440 --> 00:56:55,440 again precision out of all the ones that 1492 00:56:55,440 --> 00:56:57,119 we've labeled as 1493 00:56:57,119 --> 00:56:59,440 the specific class how many of them are 1494 00:56:59,440 --> 00:57:03,520 actually that class it's 77 and 84 1495 00:57:03,520 --> 00:57:05,839 now recall how out of all the ones that 1496 00:57:05,839 --> 00:57:08,000 are actually this class how many of 1497 00:57:08,000 --> 00:57:10,799 those that we get this is 68 1498 00:57:10,799 --> 00:57:12,799 and eighty nine percent 1499 00:57:12,799 --> 00:57:13,839 all right 1500 00:57:13,839 --> 00:57:15,920 so not too shabby we can clearly see 1501 00:57:15,920 --> 00:57:18,559 that this recall and precision for 1502 00:57:18,559 --> 00:57:20,720 like this the class zero is worse than 1503 00:57:20,720 --> 00:57:22,079 class one 1504 00:57:22,079 --> 00:57:23,520 right so that means for hadrons it's 1505 00:57:23,520 --> 00:57:26,319 worked for hadrons and for our gammas 1506 00:57:26,319 --> 00:57:28,319 this f1 score over here is kind of a 1507 00:57:28,319 --> 00:57:30,319 combination of the precision and recall 1508 00:57:30,319 --> 00:57:31,920 score so 1509 00:57:31,920 --> 00:57:33,200 we're actually going to mostly look at 1510 00:57:33,200 --> 00:57:35,760 this one because we have an unbalanced 1511 00:57:35,760 --> 00:57:37,280 test data set 1512 00:57:37,280 --> 00:57:40,720 so here we have a measure of 72 and 87 1513 00:57:40,720 --> 00:57:46,400 or 0.72 and 0.87 which is not too shabby 1514 00:57:46,400 --> 00:57:48,400 all right 1515 00:57:48,400 --> 00:57:53,960 well what if we you know made this three 1516 00:57:54,480 --> 00:57:56,400 so we actually see that 1517 00:57:56,400 --> 00:58:00,240 okay so what was it originally with one 1518 00:58:00,640 --> 00:58:03,200 we see that our f1 score 1519 00:58:03,200 --> 00:58:05,520 you know is now it was 0.72 and then 1520 00:58:05,520 --> 00:58:07,280 0.87 1521 00:58:07,280 --> 00:58:09,680 and then our accuracy was 82 so if i 1522 00:58:09,680 --> 00:58:12,559 change that to three 1523 00:58:12,559 --> 00:58:13,920 all right so 1524 00:58:13,920 --> 00:58:17,119 we've kind of increased zero at the cost 1525 00:58:17,119 --> 00:58:19,680 of one and then our overall average 1526 00:58:19,680 --> 00:58:22,640 accuracy is 81. so let's actually just 1527 00:58:22,640 --> 00:58:24,960 make this five 1528 00:58:24,960 --> 00:58:27,680 all right so you know again very similar 1529 00:58:27,680 --> 00:58:29,680 numbers we have 82 percent accuracy 1530 00:58:29,680 --> 00:58:31,520 which is pretty decent for a model 1531 00:58:31,520 --> 00:58:32,880 that's 1532 00:58:32,880 --> 00:58:35,599 relatively simple 1533 00:58:35,599 --> 00:58:37,920 the next type of model that we're going 1534 00:58:37,920 --> 00:58:40,240 to talk about is something known as 1535 00:58:40,240 --> 00:58:42,480 naive bayes 1536 00:58:42,480 --> 00:58:45,040 now in order to understand the concepts 1537 00:58:45,040 --> 00:58:47,680 behind naive bayes we have to be able to 1538 00:58:47,680 --> 00:58:50,079 understand conditional probability and 1539 00:58:50,079 --> 00:58:51,520 bayes rule 1540 00:58:51,520 --> 00:58:52,240 so 1541 00:58:52,240 --> 00:58:54,720 let's say i have some sort of data set 1542 00:58:54,720 --> 00:58:57,280 that's shown in this table right here 1543 00:58:57,280 --> 00:59:00,319 people who have covid are over here in 1544 00:59:00,319 --> 00:59:03,839 this red row and people who do not have 1545 00:59:03,839 --> 00:59:06,880 covet are down here in this green row 1546 00:59:06,880 --> 00:59:08,559 now what about the cova test well people 1547 00:59:08,559 --> 00:59:11,200 who have tested positive 1548 00:59:11,200 --> 00:59:15,040 are over here in this column 1549 00:59:15,040 --> 00:59:17,359 and people who have tested negative are 1550 00:59:17,359 --> 00:59:19,280 over here in this 1551 00:59:19,280 --> 00:59:20,559 column 1552 00:59:20,559 --> 00:59:22,079 okay 1553 00:59:22,079 --> 00:59:23,839 yeah so basically our categories are 1554 00:59:23,839 --> 00:59:26,480 people who have coven and test positive 1555 00:59:26,480 --> 00:59:28,960 people who don't have covid but test 1556 00:59:28,960 --> 00:59:31,680 positive so a false false positive 1557 00:59:31,680 --> 00:59:33,520 people who have covet and test negative 1558 00:59:33,520 --> 00:59:35,359 which is a false negative 1559 00:59:35,359 --> 00:59:38,000 and people who don't have covid and test 1560 00:59:38,000 --> 00:59:39,359 negative which 1561 00:59:39,359 --> 00:59:41,440 good means you don't have coven 1562 00:59:41,440 --> 00:59:43,440 okay so let's make this slightly more 1563 00:59:43,440 --> 00:59:46,440 legible 1564 00:59:47,040 --> 00:59:50,240 and here in the margins i've written 1565 00:59:50,240 --> 00:59:51,839 down the sums 1566 00:59:51,839 --> 00:59:52,960 of 1567 00:59:52,960 --> 00:59:54,880 whatever it's referring to so this here 1568 00:59:54,880 --> 00:59:58,160 is the sum of this entire row 1569 00:59:58,160 --> 01:00:00,799 and this here might be the sum of this 1570 01:00:00,799 --> 01:00:03,440 column over here 1571 01:00:03,440 --> 01:00:04,720 okay 1572 01:00:04,720 --> 01:00:07,599 so the first question that i have is 1573 01:00:07,599 --> 01:00:10,079 what is the probability of having covid 1574 01:00:10,079 --> 01:00:12,240 given that you have a positive test 1575 01:00:12,240 --> 01:00:14,640 and in probability we write that out 1576 01:00:14,640 --> 01:00:16,880 like this so the probability 1577 01:00:16,880 --> 01:00:19,839 of covid 1578 01:00:20,400 --> 01:00:23,440 given so this line that vertical line 1579 01:00:23,440 --> 01:00:24,319 means 1580 01:00:24,319 --> 01:00:26,559 given that you know some condition 1581 01:00:26,559 --> 01:00:30,799 so given a positive test 1582 01:00:30,799 --> 01:00:31,920 okay 1583 01:00:31,920 --> 01:00:33,520 so 1584 01:00:33,520 --> 01:00:35,839 what is the probability of having covid 1585 01:00:35,839 --> 01:00:37,760 given a positive test 1586 01:00:37,760 --> 01:00:40,000 so what this is asking is saying okay 1587 01:00:40,000 --> 01:00:41,839 let's go into 1588 01:00:41,839 --> 01:00:44,640 this condition so the condition of 1589 01:00:44,640 --> 01:00:47,920 having a positive test that is 1590 01:00:47,920 --> 01:00:50,720 this slice of the data right that means 1591 01:00:50,720 --> 01:00:52,000 if you're in this slice of data you have 1592 01:00:52,000 --> 01:00:54,480 a positive test so given that we have a 1593 01:00:54,480 --> 01:00:57,119 positive test given in this condition in 1594 01:00:57,119 --> 01:00:58,880 this circumstance we have a positive 1595 01:00:58,880 --> 01:00:59,839 test 1596 01:00:59,839 --> 01:01:01,520 so what's the probability that we have 1597 01:01:01,520 --> 01:01:03,119 covid 1598 01:01:03,119 --> 01:01:04,000 well 1599 01:01:04,000 --> 01:01:05,440 if we're just using this data the number 1600 01:01:05,440 --> 01:01:09,040 of people that have covid is 531. 1601 01:01:09,040 --> 01:01:12,480 so i'm going to say that there's 531 1602 01:01:12,480 --> 01:01:14,880 people that have copied 1603 01:01:14,880 --> 01:01:17,040 and then now we divide that by the total 1604 01:01:17,040 --> 01:01:18,880 number of people that have a positive 1605 01:01:18,880 --> 01:01:22,520 test which is 551 1606 01:01:24,160 --> 01:01:26,799 okay so that's the probability and 1607 01:01:26,799 --> 01:01:28,160 doing a quick 1608 01:01:28,160 --> 01:01:29,599 division 1609 01:01:29,599 --> 01:01:33,200 we get that this is equal 1610 01:01:33,839 --> 01:01:35,799 to around 1611 01:01:35,799 --> 01:01:38,400 96.4 percent 1612 01:01:38,400 --> 01:01:40,720 so according to this data set which is 1613 01:01:40,720 --> 01:01:42,400 data that i made up off the top of my 1614 01:01:42,400 --> 01:01:45,680 head so it's not actually real cova data 1615 01:01:45,680 --> 01:01:48,720 but according to this data 1616 01:01:48,720 --> 01:01:50,880 uh the probability of encoded given that 1617 01:01:50,880 --> 01:01:52,880 you tested positive 1618 01:01:52,880 --> 01:01:54,079 is 1619 01:01:54,079 --> 01:01:57,079 96.4 1620 01:01:57,920 --> 01:02:00,559 all right now with that let's talk about 1621 01:02:00,559 --> 01:02:01,920 bayes rule 1622 01:02:01,920 --> 01:02:04,400 which is this section here 1623 01:02:04,400 --> 01:02:07,920 let's ignore this bottom part for now 1624 01:02:07,920 --> 01:02:08,640 so 1625 01:02:08,640 --> 01:02:10,400 base rule is asking okay what is the 1626 01:02:10,400 --> 01:02:13,440 probability of some event a happening 1627 01:02:13,440 --> 01:02:16,960 given that b happen so this 1628 01:02:16,960 --> 01:02:18,559 we already know has happened this is our 1629 01:02:18,559 --> 01:02:21,839 condition right 1630 01:02:22,319 --> 01:02:25,119 well what if we don't have data for that 1631 01:02:25,119 --> 01:02:26,880 right like what if we don't know what 1632 01:02:26,880 --> 01:02:29,359 the probability of a given b is 1633 01:02:29,359 --> 01:02:31,200 well bayes rule is saying okay well you 1634 01:02:31,200 --> 01:02:33,520 can actually go and calculate it as long 1635 01:02:33,520 --> 01:02:36,000 as you have the probability of b given a 1636 01:02:36,000 --> 01:02:38,000 the probability of a and the probability 1637 01:02:38,000 --> 01:02:39,280 of b 1638 01:02:39,280 --> 01:02:41,599 okay and this is just a mathematical 1639 01:02:41,599 --> 01:02:43,440 formula for that 1640 01:02:43,440 --> 01:02:46,799 all right so here we have bayes rule 1641 01:02:46,799 --> 01:02:49,200 and let's actually see bayes rule in 1642 01:02:49,200 --> 01:02:51,680 action let's use it on an example 1643 01:02:51,680 --> 01:02:54,079 so here let's say that we have some 1644 01:02:54,079 --> 01:02:56,960 um disease statistics okay 1645 01:02:56,960 --> 01:03:00,160 so knock over different disease 1646 01:03:00,160 --> 01:03:01,920 and we know that the probability of 1647 01:03:01,920 --> 01:03:04,640 obtaining a false positive is 0.05 1648 01:03:04,640 --> 01:03:06,079 probability of obtaining a false 1649 01:03:06,079 --> 01:03:09,040 negative is 0.01 and the probability of 1650 01:03:09,040 --> 01:03:11,359 the disease is 0.1 1651 01:03:11,359 --> 01:03:12,799 okay what is the probability of the 1652 01:03:12,799 --> 01:03:14,559 disease given that 1653 01:03:14,559 --> 01:03:17,520 we got a positive test 1654 01:03:17,520 --> 01:03:20,319 how do we even go about solving this 1655 01:03:20,319 --> 01:03:22,880 so what what do i mean by false positive 1656 01:03:22,880 --> 01:03:25,280 what's a different way to rewrite that 1657 01:03:25,280 --> 01:03:27,920 a false positive is when you test 1658 01:03:27,920 --> 01:03:30,400 positive but you don't actually have the 1659 01:03:30,400 --> 01:03:33,359 disease so this here is a probability 1660 01:03:33,359 --> 01:03:37,039 that you'd have a positive test given 1661 01:03:37,039 --> 01:03:39,200 no disease 1662 01:03:39,200 --> 01:03:40,400 right 1663 01:03:40,400 --> 01:03:42,240 and similarly for the false negative 1664 01:03:42,240 --> 01:03:43,599 it's the probability that you test 1665 01:03:43,599 --> 01:03:46,240 negative given that you actually have 1666 01:03:46,240 --> 01:03:48,640 the disease so if i put that into a 1667 01:03:48,640 --> 01:03:50,160 chart 1668 01:03:50,160 --> 01:03:52,799 for example 1669 01:03:54,480 --> 01:03:56,319 and this might be my positive and 1670 01:03:56,319 --> 01:03:58,960 negative tests and this might be 1671 01:03:58,960 --> 01:04:00,480 my diseases 1672 01:04:00,480 --> 01:04:04,079 disease and no disease 1673 01:04:04,079 --> 01:04:06,160 well the probability that i test 1674 01:04:06,160 --> 01:04:07,680 positive but actually have no disease 1675 01:04:07,680 --> 01:04:10,480 okay that's 0.05 over here 1676 01:04:10,480 --> 01:04:12,400 and then the false negative is up here 1677 01:04:12,400 --> 01:04:14,720 for 0.01 so i'm testing negative but i 1678 01:04:14,720 --> 01:04:17,839 don't actually have the disease 1679 01:04:18,000 --> 01:04:20,799 this so the probability that you test 1680 01:04:20,799 --> 01:04:23,039 positive and you don't have the disease 1681 01:04:23,039 --> 01:04:24,400 plus the probability that you test 1682 01:04:24,400 --> 01:04:25,839 negative given that you don't have the 1683 01:04:25,839 --> 01:04:27,119 disease 1684 01:04:27,119 --> 01:04:29,280 that should sum up to one 1685 01:04:29,280 --> 01:04:30,400 okay because if you don't have the 1686 01:04:30,400 --> 01:04:32,000 disease then you should have some 1687 01:04:32,000 --> 01:04:33,359 probability that you're testing positive 1688 01:04:33,359 --> 01:04:34,480 and some probably that you're testing 1689 01:04:34,480 --> 01:04:36,799 negative but that probability 1690 01:04:36,799 --> 01:04:39,680 in total should be one 1691 01:04:39,680 --> 01:04:42,160 so that means that 1692 01:04:42,160 --> 01:04:43,839 the probability of negative and no 1693 01:04:43,839 --> 01:04:45,680 disease this should be the reciprocal 1694 01:04:45,680 --> 01:04:47,039 this would be the opposite so it should 1695 01:04:47,039 --> 01:04:48,200 be 1696 01:04:48,200 --> 01:04:51,119 0.95 because it's 1 minus whatever this 1697 01:04:51,119 --> 01:04:54,640 probability is 1698 01:04:54,640 --> 01:04:56,960 and then similarly 1699 01:04:56,960 --> 01:04:59,280 oops 1700 01:04:59,599 --> 01:05:03,440 up here this should be 0.99 because 1701 01:05:03,440 --> 01:05:05,599 the probability that we 1702 01:05:05,599 --> 01:05:07,520 you know test negative and have the 1703 01:05:07,520 --> 01:05:08,960 disease plus the probability that we 1704 01:05:08,960 --> 01:05:10,480 test positive and have the disease 1705 01:05:10,480 --> 01:05:12,160 should equal one 1706 01:05:12,160 --> 01:05:14,640 so this is our probability chart and now 1707 01:05:14,640 --> 01:05:16,720 this probability of disease 1708 01:05:16,720 --> 01:05:18,640 being point zero point one just means i 1709 01:05:18,640 --> 01:05:19,920 have a ten percent probability of 1710 01:05:19,920 --> 01:05:21,599 actually of having the disease right 1711 01:05:21,599 --> 01:05:23,119 like 1712 01:05:23,119 --> 01:05:24,799 in the general population the 1713 01:05:24,799 --> 01:05:26,400 probability that i have the disease is 1714 01:05:26,400 --> 01:05:28,960 0.1 1715 01:05:28,960 --> 01:05:30,559 okay so what is the probability that i 1716 01:05:30,559 --> 01:05:32,480 have the disease given that i got a 1717 01:05:32,480 --> 01:05:34,839 positive i got a positive 1718 01:05:34,839 --> 01:05:37,359 test well remember that we can write 1719 01:05:37,359 --> 01:05:39,839 this out in terms of bayes rule right so 1720 01:05:39,839 --> 01:05:42,400 if i use this rule up here 1721 01:05:42,400 --> 01:05:44,319 this is the probability of a positive 1722 01:05:44,319 --> 01:05:48,240 test given that i have the disease 1723 01:05:48,240 --> 01:05:50,319 times the probability 1724 01:05:50,319 --> 01:05:52,799 of the disease 1725 01:05:52,799 --> 01:05:55,599 divided by the probability of the 1726 01:05:55,599 --> 01:05:59,920 evidence which is my positive test 1727 01:05:59,920 --> 01:06:01,599 all right now let's plug in some numbers 1728 01:06:01,599 --> 01:06:04,480 for that the probability of having a 1729 01:06:04,480 --> 01:06:06,640 positive test given that i have disease 1730 01:06:06,640 --> 01:06:07,400 is 1731 01:06:07,400 --> 01:06:09,200 0.99 1732 01:06:09,200 --> 01:06:11,039 and then the probability that i have the 1733 01:06:11,039 --> 01:06:12,480 disease 1734 01:06:12,480 --> 01:06:14,960 is this value over here 1735 01:06:14,960 --> 01:06:17,960 0.1 1736 01:06:18,480 --> 01:06:20,240 okay 1737 01:06:20,240 --> 01:06:21,359 and then 1738 01:06:21,359 --> 01:06:23,039 the probability that i have a positive 1739 01:06:23,039 --> 01:06:24,720 test at all 1740 01:06:24,720 --> 01:06:26,880 should be okay what is the probability 1741 01:06:26,880 --> 01:06:28,559 that i have a positive test given that i 1742 01:06:28,559 --> 01:06:30,720 actually have the disease 1743 01:06:30,720 --> 01:06:34,319 and then having having the disease 1744 01:06:34,319 --> 01:06:36,160 and then the other case 1745 01:06:36,160 --> 01:06:38,160 where the probability of me having a 1746 01:06:38,160 --> 01:06:40,240 negative test given or sorry positive 1747 01:06:40,240 --> 01:06:43,359 test giving no disease 1748 01:06:43,839 --> 01:06:46,319 times the probability of not actually 1749 01:06:46,319 --> 01:06:48,559 having a disease 1750 01:06:48,559 --> 01:06:50,880 okay so i can expand that 1751 01:06:50,880 --> 01:06:52,480 probability of having positive tests out 1752 01:06:52,480 --> 01:06:54,480 into these two different cases 1753 01:06:54,480 --> 01:06:56,720 i have a disease and then i don't 1754 01:06:56,720 --> 01:06:58,079 and then 1755 01:06:58,079 --> 01:06:59,359 what's the probability of having 1756 01:06:59,359 --> 01:07:00,880 positive tests in either one of those 1757 01:07:00,880 --> 01:07:02,480 cases 1758 01:07:02,480 --> 01:07:05,280 so that expression would become 1759 01:07:05,280 --> 01:07:06,960 0.99 1760 01:07:06,960 --> 01:07:09,440 times 0.1 1761 01:07:09,440 --> 01:07:11,400 plus 1762 01:07:11,400 --> 01:07:13,839 0.05 so that's the probability that i'm 1763 01:07:13,839 --> 01:07:15,599 testing positive but don't have the 1764 01:07:15,599 --> 01:07:16,880 disease 1765 01:07:16,880 --> 01:07:18,000 and the times the probability that i 1766 01:07:18,000 --> 01:07:19,440 don't actually have the disease so 1767 01:07:19,440 --> 01:07:22,079 that's 1 minus 0.1 the probability that 1768 01:07:22,079 --> 01:07:23,599 the population doesn't have the disease 1769 01:07:23,599 --> 01:07:25,280 is 90 1770 01:07:25,280 --> 01:07:26,839 so 1771 01:07:26,839 --> 01:07:32,160 0.9 and let's do that multiplication 1772 01:07:32,160 --> 01:07:33,920 and i get an answer 1773 01:07:33,920 --> 01:07:37,240 of 0.6875 1774 01:07:38,000 --> 01:07:39,960 or 1775 01:07:39,960 --> 01:07:43,839 68.75 percent 1776 01:07:44,400 --> 01:07:46,640 okay 1777 01:07:46,720 --> 01:07:49,680 all right so we can actually expand that 1778 01:07:49,680 --> 01:07:52,400 we can expand bayes rule 1779 01:07:52,400 --> 01:07:55,119 and apply it to classification 1780 01:07:55,119 --> 01:07:57,440 and this is what we call naive bayes 1781 01:07:57,440 --> 01:07:59,920 so first a little terminology so the 1782 01:07:59,920 --> 01:08:02,160 posterior 1783 01:08:02,160 --> 01:08:04,559 is this over here because it's asking 1784 01:08:04,559 --> 01:08:06,799 hey what is the probability 1785 01:08:06,799 --> 01:08:11,680 of some class ck so by ck i just mean 1786 01:08:11,680 --> 01:08:13,440 you know the different categories so c 1787 01:08:13,440 --> 01:08:15,839 for category or class or whatever so 1788 01:08:15,839 --> 01:08:19,279 category one might be cats category two 1789 01:08:19,279 --> 01:08:22,799 dogs category three lizards 1790 01:08:22,799 --> 01:08:24,799 all the way we have k categories k is 1791 01:08:24,799 --> 01:08:26,319 just some number 1792 01:08:26,319 --> 01:08:27,439 okay 1793 01:08:27,439 --> 01:08:28,399 so 1794 01:08:28,399 --> 01:08:30,640 what is the probability of having 1795 01:08:30,640 --> 01:08:32,238 of 1796 01:08:32,238 --> 01:08:35,279 this specific sample x so this is our 1797 01:08:35,279 --> 01:08:37,279 feature factor 1798 01:08:37,279 --> 01:08:39,679 of this one sample 1799 01:08:39,679 --> 01:08:41,839 what is the probability of x fitting 1800 01:08:41,839 --> 01:08:43,920 into category one two three four 1801 01:08:43,920 --> 01:08:45,839 whatever right so that that's what this 1802 01:08:45,839 --> 01:08:48,158 is asking what is the probability that 1803 01:08:48,158 --> 01:08:50,880 you know it's actually from this class 1804 01:08:50,880 --> 01:08:53,679 given all this evidence that we see the 1805 01:08:53,679 --> 01:08:56,080 x's 1806 01:08:56,960 --> 01:08:58,158 so 1807 01:08:58,158 --> 01:08:59,679 the likelihood 1808 01:08:59,679 --> 01:09:01,520 is this quantity over here it's saying 1809 01:09:01,520 --> 01:09:02,319 okay 1810 01:09:02,319 --> 01:09:04,960 well given that you know assume assume 1811 01:09:04,960 --> 01:09:06,479 we are 1812 01:09:06,479 --> 01:09:09,520 assumed that this class is class ck okay 1813 01:09:09,520 --> 01:09:12,080 assume that this is a category 1814 01:09:12,080 --> 01:09:14,000 well what is the likelihood of actually 1815 01:09:14,000 --> 01:09:15,359 seeing x 1816 01:09:15,359 --> 01:09:16,880 all these different features from that 1817 01:09:16,880 --> 01:09:19,440 category 1818 01:09:19,759 --> 01:09:22,319 and then this here is the prior so like 1819 01:09:22,319 --> 01:09:24,880 in the entire population of things 1820 01:09:24,880 --> 01:09:26,399 what what is what are the probabilities 1821 01:09:26,399 --> 01:09:28,000 what is the probability of this class in 1822 01:09:28,000 --> 01:09:30,000 general like if i have 1823 01:09:30,000 --> 01:09:32,238 you know in my entire data set what is 1824 01:09:32,238 --> 01:09:34,960 the percentage what is the chance that 1825 01:09:34,960 --> 01:09:37,279 this image is a cat how many cats do i 1826 01:09:37,279 --> 01:09:38,399 have 1827 01:09:38,399 --> 01:09:39,439 right 1828 01:09:39,439 --> 01:09:41,198 and then this down here is called the 1829 01:09:41,198 --> 01:09:43,120 evidence because 1830 01:09:43,120 --> 01:09:45,120 what we're trying to do 1831 01:09:45,120 --> 01:09:47,839 is we're changing our prior we're 1832 01:09:47,839 --> 01:09:51,198 creating this new posterior probability 1833 01:09:51,198 --> 01:09:53,600 built upon the prior by using some sort 1834 01:09:53,600 --> 01:09:55,360 of evidence right and that evidence is 1835 01:09:55,360 --> 01:09:57,679 the probability of x 1836 01:09:57,679 --> 01:09:59,760 so that's some vocab 1837 01:09:59,760 --> 01:10:00,880 and 1838 01:10:00,880 --> 01:10:04,159 this here 1839 01:10:05,360 --> 01:10:09,040 is a rule for naive bayes 1840 01:10:09,040 --> 01:10:11,360 whoa okay let's digest 1841 01:10:11,360 --> 01:10:12,239 that 1842 01:10:12,239 --> 01:10:14,159 a little bit okay 1843 01:10:14,159 --> 01:10:16,640 so what is let me use a different color 1844 01:10:16,640 --> 01:10:18,560 what is this 1845 01:10:18,560 --> 01:10:21,040 side of the equation asking 1846 01:10:21,040 --> 01:10:22,320 it's asking 1847 01:10:22,320 --> 01:10:24,000 what is the probability that we are in 1848 01:10:24,000 --> 01:10:26,320 sum class k ck 1849 01:10:26,320 --> 01:10:28,800 given that you know this is my first 1850 01:10:28,800 --> 01:10:30,800 input this is my second input this is 1851 01:10:30,800 --> 01:10:32,640 you know my third fourth this is my nth 1852 01:10:32,640 --> 01:10:33,679 input 1853 01:10:33,679 --> 01:10:37,199 so let's say that our classification is 1854 01:10:37,199 --> 01:10:40,000 do we play soccer today or not 1855 01:10:40,000 --> 01:10:42,640 okay and let's say our x's are okay is 1856 01:10:42,640 --> 01:10:45,520 it how much wind is there how much 1857 01:10:45,520 --> 01:10:47,920 uh rain is there and what day of the 1858 01:10:47,920 --> 01:10:50,480 week is it so let's say that it's 1859 01:10:50,480 --> 01:10:52,159 raining it's not windy but it's 1860 01:10:52,159 --> 01:10:56,000 wednesday do we play soccer do we not 1861 01:10:56,000 --> 01:10:59,280 so let's use bayes rule on this so this 1862 01:10:59,280 --> 01:11:01,600 here 1863 01:11:06,000 --> 01:11:09,520 is equal to the probability of x1 1864 01:11:09,520 --> 01:11:13,120 x2 all these joint probabilities given 1865 01:11:13,120 --> 01:11:14,640 class k 1866 01:11:14,640 --> 01:11:18,080 times the probability of that class 1867 01:11:18,080 --> 01:11:20,239 all over the probability of this 1868 01:11:20,239 --> 01:11:22,719 evidence 1869 01:11:24,840 --> 01:11:26,640 okay 1870 01:11:26,640 --> 01:11:28,640 so what is this fancy symbol over here 1871 01:11:28,640 --> 01:11:31,520 this means proportional 1872 01:11:31,520 --> 01:11:33,520 to 1873 01:11:33,520 --> 01:11:35,360 so how our equal sign means it's equal 1874 01:11:35,360 --> 01:11:38,000 to this like little squiggly sign means 1875 01:11:38,000 --> 01:11:40,960 that this is proportional to 1876 01:11:40,960 --> 01:11:42,159 okay 1877 01:11:42,159 --> 01:11:45,679 and this denominator over here 1878 01:11:45,679 --> 01:11:47,920 you might notice that it has no impact 1879 01:11:47,920 --> 01:11:50,640 on the class like this that number 1880 01:11:50,640 --> 01:11:52,159 doesn't depend on the class right so 1881 01:11:52,159 --> 01:11:54,320 this is going to be constant for all of 1882 01:11:54,320 --> 01:11:56,080 our different classes 1883 01:11:56,080 --> 01:11:57,920 so what i'm going to do is make things 1884 01:11:57,920 --> 01:12:00,400 simpler so i'm just going to say that 1885 01:12:00,400 --> 01:12:02,960 this probability 1886 01:12:02,960 --> 01:12:06,480 x1 x2 all the way to xn 1887 01:12:06,480 --> 01:12:08,080 this is going to be proportional to the 1888 01:12:08,080 --> 01:12:09,199 numerator i don't care about the 1889 01:12:09,199 --> 01:12:10,480 denominator because it's the same for 1890 01:12:10,480 --> 01:12:11,920 every single class 1891 01:12:11,920 --> 01:12:15,920 so this is proportional to x1 x2 1892 01:12:15,920 --> 01:12:17,040 xn 1893 01:12:17,040 --> 01:12:18,239 given 1894 01:12:18,239 --> 01:12:20,960 class k times the probability of that 1895 01:12:20,960 --> 01:12:22,400 class 1896 01:12:22,400 --> 01:12:24,080 okay 1897 01:12:24,080 --> 01:12:25,360 all right 1898 01:12:25,360 --> 01:12:28,080 so in naive bayes the 1899 01:12:28,080 --> 01:12:30,080 point of it being naive 1900 01:12:30,080 --> 01:12:32,880 is that we're actually 1901 01:12:32,880 --> 01:12:34,239 this joint probability we're just 1902 01:12:34,239 --> 01:12:35,840 assuming that all of these different 1903 01:12:35,840 --> 01:12:36,800 things 1904 01:12:36,800 --> 01:12:39,280 are all independent so in my soccer 1905 01:12:39,280 --> 01:12:40,719 example 1906 01:12:40,719 --> 01:12:41,600 you know 1907 01:12:41,600 --> 01:12:42,719 the probability that we're playing 1908 01:12:42,719 --> 01:12:44,239 soccer 1909 01:12:44,239 --> 01:12:45,199 um 1910 01:12:45,199 --> 01:12:47,440 or the probability that you know it's 1911 01:12:47,440 --> 01:12:50,239 windy and it's rainy and and it's 1912 01:12:50,239 --> 01:12:51,679 wednesday all these things are 1913 01:12:51,679 --> 01:12:53,600 independent we're assuming that they're 1914 01:12:53,600 --> 01:12:55,520 independent 1915 01:12:55,520 --> 01:12:58,400 so that means that i can actually write 1916 01:12:58,400 --> 01:13:01,040 this part of the equation here 1917 01:13:01,040 --> 01:13:02,800 as 1918 01:13:02,800 --> 01:13:05,440 this so each term in here 1919 01:13:05,440 --> 01:13:07,520 i can just multiply 1920 01:13:07,520 --> 01:13:10,719 all them together so the probability of 1921 01:13:10,719 --> 01:13:11,920 the first 1922 01:13:11,920 --> 01:13:15,280 feature given that it's class k 1923 01:13:15,280 --> 01:13:17,120 times the probability 1924 01:13:17,120 --> 01:13:18,480 of the second feature given that's 1925 01:13:18,480 --> 01:13:20,640 probably like class k all the way up 1926 01:13:20,640 --> 01:13:21,600 until 1927 01:13:21,600 --> 01:13:23,440 you know the nth 1928 01:13:23,440 --> 01:13:24,800 feature 1929 01:13:24,800 --> 01:13:26,400 of 1930 01:13:26,400 --> 01:13:29,520 given that it's class k 1931 01:13:29,520 --> 01:13:33,040 so this expands to all of this 1932 01:13:33,040 --> 01:13:36,000 all right which means that this here is 1933 01:13:36,000 --> 01:13:37,600 now proportional 1934 01:13:37,600 --> 01:13:40,320 to the thing that we just expanded 1935 01:13:40,320 --> 01:13:42,640 times this 1936 01:13:42,640 --> 01:13:44,320 so i'm going to write 1937 01:13:44,320 --> 01:13:46,640 that out so the probability 1938 01:13:46,640 --> 01:13:48,560 of that class 1939 01:13:48,560 --> 01:13:51,679 and i'm actually going to use this 1940 01:13:51,679 --> 01:13:54,800 symbol so what this means is it's a huge 1941 01:13:54,800 --> 01:13:56,800 multiplication it means multiply 1942 01:13:56,800 --> 01:13:58,239 everything 1943 01:13:58,239 --> 01:14:01,280 to the right of this so this probability 1944 01:14:01,280 --> 01:14:02,640 x 1945 01:14:02,640 --> 01:14:03,840 given 1946 01:14:03,840 --> 01:14:05,600 some class k 1947 01:14:05,600 --> 01:14:08,800 but do it for all the i so i 1948 01:14:08,800 --> 01:14:11,440 what is i okay we're going to go from 1949 01:14:11,440 --> 01:14:13,120 the first 1950 01:14:13,120 --> 01:14:16,159 x i all the way to the end so that means 1951 01:14:16,159 --> 01:14:17,760 for every single i we're just 1952 01:14:17,760 --> 01:14:19,280 multiplying 1953 01:14:19,280 --> 01:14:21,679 these probabilities together 1954 01:14:21,679 --> 01:14:23,280 and that's where 1955 01:14:23,280 --> 01:14:25,840 this up here comes from 1956 01:14:25,840 --> 01:14:26,560 so 1957 01:14:26,560 --> 01:14:28,480 to wrap this up oops this should be a 1958 01:14:28,480 --> 01:14:30,480 line to wrap this up in plain english 1959 01:14:30,480 --> 01:14:31,840 basically what this is saying is the 1960 01:14:31,840 --> 01:14:34,800 probability that you know we're in some 1961 01:14:34,800 --> 01:14:37,199 category given that we have all these 1962 01:14:37,199 --> 01:14:38,560 different features 1963 01:14:38,560 --> 01:14:41,520 is proportional to the probability of 1964 01:14:41,520 --> 01:14:43,520 that class in general 1965 01:14:43,520 --> 01:14:45,360 times the probability of each of those 1966 01:14:45,360 --> 01:14:46,480 features 1967 01:14:46,480 --> 01:14:48,560 given that we're in this one class that 1968 01:14:48,560 --> 01:14:50,080 we're testing 1969 01:14:50,080 --> 01:14:51,600 so the probability 1970 01:14:51,600 --> 01:14:54,080 of it you know of us playing soccer 1971 01:14:54,080 --> 01:14:57,199 today given that it's rainy not windy 1972 01:14:57,199 --> 01:14:58,320 and 1973 01:14:58,320 --> 01:15:00,719 um and it's wednesday 1974 01:15:00,719 --> 01:15:03,199 is proportional to okay well what is 1975 01:15:03,199 --> 01:15:04,480 what is the probability that we play 1976 01:15:04,480 --> 01:15:06,239 soccer anyways 1977 01:15:06,239 --> 01:15:08,239 and then times the probability that it's 1978 01:15:08,239 --> 01:15:10,880 rainy given that we're playing soccer 1979 01:15:10,880 --> 01:15:12,560 times the probability that it's not 1980 01:15:12,560 --> 01:15:14,800 windy given that we're playing soccer so 1981 01:15:14,800 --> 01:15:16,000 how many times are we playing soccer 1982 01:15:16,000 --> 01:15:18,000 when it's windy you know 1983 01:15:18,000 --> 01:15:20,640 and then how many times or what's the 1984 01:15:20,640 --> 01:15:22,960 probability that's wednesday given that 1985 01:15:22,960 --> 01:15:25,120 we're playing soccer 1986 01:15:25,120 --> 01:15:27,360 okay 1987 01:15:27,360 --> 01:15:30,320 so how do we use this in order to make a 1988 01:15:30,320 --> 01:15:32,239 classification 1989 01:15:32,239 --> 01:15:35,520 so that's where this comes in our y hat 1990 01:15:35,520 --> 01:15:37,440 our predicted y 1991 01:15:37,440 --> 01:15:39,520 is going to be equal to something called 1992 01:15:39,520 --> 01:15:42,480 the arg max 1993 01:15:42,480 --> 01:15:44,640 and then this expression over here 1994 01:15:44,640 --> 01:15:46,960 because we want to take the arg max well 1995 01:15:46,960 --> 01:15:48,480 we want 1996 01:15:48,480 --> 01:15:50,719 so okay if i 1997 01:15:50,719 --> 01:15:53,280 write out this 1998 01:15:53,280 --> 01:15:55,120 again this means the probability of 1999 01:15:55,120 --> 01:15:58,080 being in some class c k given all of our 2000 01:15:58,080 --> 01:16:00,560 evidence 2001 01:16:01,760 --> 01:16:04,239 well we're going to take the k 2002 01:16:04,239 --> 01:16:06,560 that maximizes 2003 01:16:06,560 --> 01:16:09,840 this expression on the right 2004 01:16:09,840 --> 01:16:12,880 that's what arc max means so if k is in 2005 01:16:12,880 --> 01:16:14,640 zero oops 2006 01:16:14,640 --> 01:16:16,560 one through 2007 01:16:16,560 --> 01:16:18,159 k so this is how many categories there 2008 01:16:18,159 --> 01:16:20,400 are we're going to go through each k 2009 01:16:20,400 --> 01:16:22,880 and we're going to solve 2010 01:16:22,880 --> 01:16:25,840 this expression over here and find the k 2011 01:16:25,840 --> 01:16:28,960 that makes that the largest 2012 01:16:28,960 --> 01:16:30,000 okay 2013 01:16:30,000 --> 01:16:31,520 and 2014 01:16:31,520 --> 01:16:34,080 remember that instead of writing this we 2015 01:16:34,080 --> 01:16:38,320 have now a formula thanks to bayes rule 2016 01:16:38,320 --> 01:16:40,480 for helping us 2017 01:16:40,480 --> 01:16:42,800 approximate that right and something 2018 01:16:42,800 --> 01:16:45,040 that maybe we can 2019 01:16:45,040 --> 01:16:47,040 we maybe we have like the evidence for 2020 01:16:47,040 --> 01:16:48,800 that we have the answers for that based 2021 01:16:48,800 --> 01:16:51,840 on our training set 2022 01:16:52,000 --> 01:16:54,320 so this principle of going through each 2023 01:16:54,320 --> 01:16:56,480 of these and finding whatever class 2024 01:16:56,480 --> 01:16:59,040 whatever category maximizes 2025 01:16:59,040 --> 01:17:00,800 this expression on the right this is 2026 01:17:00,800 --> 01:17:02,960 something known as m 2027 01:17:02,960 --> 01:17:04,880 ap for short 2028 01:17:04,880 --> 01:17:08,000 or maximum 2029 01:17:08,719 --> 01:17:11,040 a 2030 01:17:11,040 --> 01:17:14,040 posteriori 2031 01:17:14,159 --> 01:17:16,640 uh pick the hypothesis so pick the k 2032 01:17:16,640 --> 01:17:18,640 that is the most probable so that we 2033 01:17:18,640 --> 01:17:20,239 minimize the probability of 2034 01:17:20,239 --> 01:17:22,719 misclassification 2035 01:17:22,719 --> 01:17:24,000 all right 2036 01:17:24,000 --> 01:17:28,159 so that is m ap that is 2037 01:17:28,159 --> 01:17:29,679 naive bayes 2038 01:17:29,679 --> 01:17:31,679 back to the notebook so 2039 01:17:31,679 --> 01:17:34,000 just like how i imported k-nearest name 2040 01:17:34,000 --> 01:17:35,440 uh k 2041 01:17:35,440 --> 01:17:37,920 neighbors classifier up here for naive 2042 01:17:37,920 --> 01:17:41,520 bayes i can go to sklearn.naive 2043 01:17:41,520 --> 01:17:42,560 bayes 2044 01:17:42,560 --> 01:17:44,880 and i can import gaussian 2045 01:17:44,880 --> 01:17:46,719 naive bayes 2046 01:17:46,719 --> 01:17:47,840 all right 2047 01:17:47,840 --> 01:17:49,760 and here i'm going to say my naive bayes 2048 01:17:49,760 --> 01:17:52,320 model is equal this is very similar to 2049 01:17:52,320 --> 01:17:55,199 what we had above 2050 01:17:55,440 --> 01:17:56,840 and i'm just going to 2051 01:17:56,840 --> 01:18:00,640 say with this model 2052 01:18:01,040 --> 01:18:03,280 we are going to fit 2053 01:18:03,280 --> 01:18:05,120 x train 2054 01:18:05,120 --> 01:18:08,159 and y train 2055 01:18:08,159 --> 01:18:11,960 all right just like above 2056 01:18:13,199 --> 01:18:15,440 so this i might actually have to 2057 01:18:15,440 --> 01:18:17,440 so i'm going to set that 2058 01:18:17,440 --> 01:18:19,920 and uh 2059 01:18:19,920 --> 01:18:21,840 exactly just like above i'm going to 2060 01:18:21,840 --> 01:18:24,880 make my prediction 2061 01:18:24,880 --> 01:18:26,719 so here i'm going to instead use my 2062 01:18:26,719 --> 01:18:29,199 naive bayes model 2063 01:18:29,199 --> 01:18:32,960 and of course i'm going to run 2064 01:18:32,960 --> 01:18:35,440 the classification report again 2065 01:18:35,440 --> 01:18:36,640 so i'm actually just going to put these 2066 01:18:36,640 --> 01:18:38,320 in the same cell 2067 01:18:38,320 --> 01:18:39,920 but here we have the y the new y 2068 01:18:39,920 --> 01:18:42,560 prediction and then y test is still our 2069 01:18:42,560 --> 01:18:44,880 original test data set 2070 01:18:44,880 --> 01:18:47,040 so if i run this 2071 01:18:47,040 --> 01:18:48,800 you'll see that 2072 01:18:48,800 --> 01:18:51,840 okay what's going on here we get worse 2073 01:18:51,840 --> 01:18:54,960 scores right our precision 2074 01:18:54,960 --> 01:18:57,760 for all of them they look slightly worse 2075 01:18:57,760 --> 01:18:58,800 and 2076 01:18:58,800 --> 01:19:00,960 our um 2077 01:19:00,960 --> 01:19:02,560 you know for 2078 01:19:02,560 --> 01:19:04,800 our precision our recall our f1 score 2079 01:19:04,800 --> 01:19:06,320 they look slightly worse for all the 2080 01:19:06,320 --> 01:19:08,239 different categories and our total 2081 01:19:08,239 --> 01:19:11,199 accuracy i mean it's still 72 which is 2082 01:19:11,199 --> 01:19:13,040 not too shabby 2083 01:19:13,040 --> 01:19:15,440 but it's still 72 2084 01:19:15,440 --> 01:19:16,560 okay 2085 01:19:16,560 --> 01:19:18,320 um 2086 01:19:18,320 --> 01:19:21,679 which you know is not not that great 2087 01:19:21,679 --> 01:19:24,000 okay so let's move on to logistic 2088 01:19:24,000 --> 01:19:25,440 regression 2089 01:19:25,440 --> 01:19:28,480 here i've drawn a plot i have y so this 2090 01:19:28,480 --> 01:19:30,080 is my label 2091 01:19:30,080 --> 01:19:32,640 on one axis and then this is maybe one 2092 01:19:32,640 --> 01:19:34,080 of my features so let's just say i only 2093 01:19:34,080 --> 01:19:35,760 have one feature in this case 2094 01:19:35,760 --> 01:19:38,880 deck zero right 2095 01:19:38,880 --> 01:19:39,679 well 2096 01:19:39,679 --> 01:19:41,679 we see that 2097 01:19:41,679 --> 01:19:44,239 you know i have a few of one class type 2098 01:19:44,239 --> 01:19:45,600 down here 2099 01:19:45,600 --> 01:19:47,199 and we know it's one class type because 2100 01:19:47,199 --> 01:19:49,280 it's zero and then we have our other 2101 01:19:49,280 --> 01:19:51,760 class type one up here 2102 01:19:51,760 --> 01:19:53,520 okay 2103 01:19:53,520 --> 01:19:55,920 so many of you guys are familiar with 2104 01:19:55,920 --> 01:19:58,080 regression so let's start there 2105 01:19:58,080 --> 01:20:00,239 if i were to draw a regression line 2106 01:20:00,239 --> 01:20:01,520 through this 2107 01:20:01,520 --> 01:20:03,600 it might look something 2108 01:20:03,600 --> 01:20:05,520 like 2109 01:20:05,520 --> 01:20:08,000 like this 2110 01:20:08,080 --> 01:20:09,600 right 2111 01:20:09,600 --> 01:20:12,239 well this doesn't seem to be a very good 2112 01:20:12,239 --> 01:20:14,560 model like why would we use this 2113 01:20:14,560 --> 01:20:17,440 specific line to predict why 2114 01:20:17,440 --> 01:20:19,360 right 2115 01:20:19,360 --> 01:20:20,640 it's it's 2116 01:20:20,640 --> 01:20:21,679 iffy 2117 01:20:21,679 --> 01:20:23,280 okay 2118 01:20:23,280 --> 01:20:26,080 um for example we might say 2119 01:20:26,080 --> 01:20:27,760 okay well it seems like you know 2120 01:20:27,760 --> 01:20:30,080 everything from here downwards would be 2121 01:20:30,080 --> 01:20:32,320 one class type in here upwards would be 2122 01:20:32,320 --> 01:20:34,560 another class type 2123 01:20:34,560 --> 01:20:36,080 but when you look at this you're just 2124 01:20:36,080 --> 01:20:37,760 you you 2125 01:20:37,760 --> 01:20:40,400 visually can tell okay like 2126 01:20:40,400 --> 01:20:42,640 that line doesn't make sense things are 2127 01:20:42,640 --> 01:20:45,040 not those dots are not along that line 2128 01:20:45,040 --> 01:20:46,800 and the reason is because we are doing 2129 01:20:46,800 --> 01:20:50,400 classification not regression 2130 01:20:50,400 --> 01:20:52,159 okay 2131 01:20:52,159 --> 01:20:54,400 well first of all let's start here we 2132 01:20:54,400 --> 01:20:56,719 know that this 2133 01:20:56,719 --> 01:20:57,600 model 2134 01:20:57,600 --> 01:21:00,880 if we just use this line it equals m x 2135 01:21:00,880 --> 01:21:03,840 so whatever this let's just say it's x 2136 01:21:03,840 --> 01:21:05,760 plus b which is the y intercept right 2137 01:21:05,760 --> 01:21:07,520 and m is the slope 2138 01:21:07,520 --> 01:21:10,080 but when we use a linear regression is 2139 01:21:10,080 --> 01:21:11,600 it actually y hat 2140 01:21:11,600 --> 01:21:14,640 no it's not right so when we're working 2141 01:21:14,640 --> 01:21:16,080 with linear regression what we're 2142 01:21:16,080 --> 01:21:18,080 actually estimating in our model 2143 01:21:18,080 --> 01:21:19,840 is a probability what's the probability 2144 01:21:19,840 --> 01:21:23,120 between 0 and 1 that is class 0 or class 2145 01:21:23,120 --> 01:21:24,800 1. 2146 01:21:24,800 --> 01:21:27,280 so here let's rewrite this 2147 01:21:27,280 --> 01:21:29,600 as p equals mx 2148 01:21:29,600 --> 01:21:32,159 plus b 2149 01:21:32,560 --> 01:21:37,360 okay well mx plus b that can range 2150 01:21:37,360 --> 01:21:38,800 you know from negative infinity to 2151 01:21:38,800 --> 01:21:41,040 infinity right for any for any value of 2152 01:21:41,040 --> 01:21:42,640 x it goes from negative infinity to 2153 01:21:42,640 --> 01:21:44,080 infinity 2154 01:21:44,080 --> 01:21:45,679 but probability we know probably one of 2155 01:21:45,679 --> 01:21:47,280 the rules of probability is that 2156 01:21:47,280 --> 01:21:50,400 probability has to stay between zero and 2157 01:21:50,400 --> 01:21:52,000 one 2158 01:21:52,000 --> 01:21:53,840 so how do we fix this 2159 01:21:53,840 --> 01:21:55,679 well maybe instead of 2160 01:21:55,679 --> 01:21:57,360 just setting the probability equal to 2161 01:21:57,360 --> 01:21:59,760 that we can set the odds 2162 01:21:59,760 --> 01:22:02,000 equal to this so by that i mean okay 2163 01:22:02,000 --> 01:22:04,880 let's do probability divided by 1 minus 2164 01:22:04,880 --> 01:22:06,880 the probability okay so now it becomes 2165 01:22:06,880 --> 01:22:08,560 this ratio 2166 01:22:08,560 --> 01:22:10,639 now this ratio is allowed to take on 2167 01:22:10,639 --> 01:22:13,120 infinite values 2168 01:22:13,120 --> 01:22:15,920 but there's still one issue here 2169 01:22:15,920 --> 01:22:18,000 let me move this over a bit 2170 01:22:18,000 --> 01:22:21,440 the one issue here is that 2171 01:22:21,440 --> 01:22:23,280 mx plus b that can still be negative 2172 01:22:23,280 --> 01:22:24,960 right like if you know i have a negative 2173 01:22:24,960 --> 01:22:26,960 slope if i have a negative b if i have 2174 01:22:26,960 --> 01:22:28,719 some negative x's in there i don't know 2175 01:22:28,719 --> 01:22:30,320 but that can be that's allowed to be 2176 01:22:30,320 --> 01:22:32,239 negative 2177 01:22:32,239 --> 01:22:34,320 so how do we fix that 2178 01:22:34,320 --> 01:22:38,480 we do that by actually taking the log 2179 01:22:38,480 --> 01:22:40,800 of the odds 2180 01:22:40,800 --> 01:22:43,040 okay 2181 01:22:43,280 --> 01:22:46,159 so now i have the log of you know some 2182 01:22:46,159 --> 01:22:47,760 probability divided by 1 minus the 2183 01:22:47,760 --> 01:22:50,880 probability and now that is on a range 2184 01:22:50,880 --> 01:22:53,760 of negative infinity to infinity which 2185 01:22:53,760 --> 01:22:56,239 is good because the range of log should 2186 01:22:56,239 --> 01:22:58,719 be negative infinity to infinity 2187 01:22:58,719 --> 01:23:02,560 now how do i solve for p the probability 2188 01:23:02,560 --> 01:23:04,800 well the first thing i can do is take 2189 01:23:04,800 --> 01:23:06,480 you know 2190 01:23:06,480 --> 01:23:08,719 i can remove the log by taking the not 2191 01:23:08,719 --> 01:23:10,639 the um 2192 01:23:10,639 --> 01:23:14,080 e to the whatever is on both sides 2193 01:23:14,080 --> 01:23:18,080 so that gives me the probability 2194 01:23:18,080 --> 01:23:20,719 over the one minus the probability 2195 01:23:20,719 --> 01:23:26,239 is now equal to e to the m x plus b 2196 01:23:26,239 --> 01:23:27,920 okay 2197 01:23:27,920 --> 01:23:29,920 so let's multiply that out so the 2198 01:23:29,920 --> 01:23:32,880 probability is equal to 2199 01:23:32,880 --> 01:23:37,120 one minus probability e to the m x plus 2200 01:23:37,120 --> 01:23:38,320 b 2201 01:23:38,320 --> 01:23:42,639 so p is equal to e to the m x plus b 2202 01:23:42,639 --> 01:23:44,320 minus p 2203 01:23:44,320 --> 01:23:48,400 times e to the m x plus b 2204 01:23:48,400 --> 01:23:50,480 and now we have we can move like terms 2205 01:23:50,480 --> 01:23:53,600 to one side so if i do p 2206 01:23:53,600 --> 01:23:56,080 uh so basically i'm moving this over so 2207 01:23:56,080 --> 01:23:59,920 i'm adding p so now p one plus 2208 01:23:59,920 --> 01:24:02,800 e to the m x plus b 2209 01:24:02,800 --> 01:24:05,760 is equal to 2210 01:24:05,760 --> 01:24:09,760 e to the m x plus b and let me change 2211 01:24:09,760 --> 01:24:13,520 this parenthesis make it a little bigger 2212 01:24:13,520 --> 01:24:16,800 so now my probability can be e to the mx 2213 01:24:16,800 --> 01:24:18,239 plus b 2214 01:24:18,239 --> 01:24:22,000 divided by 1 plus e to the mx 2215 01:24:22,000 --> 01:24:24,560 plus b 2216 01:24:25,440 --> 01:24:26,719 okay 2217 01:24:26,719 --> 01:24:28,400 well 2218 01:24:28,400 --> 01:24:30,560 let me just rewrite this really quickly 2219 01:24:30,560 --> 01:24:33,760 i want a numerator of one on top 2220 01:24:33,760 --> 01:24:35,360 okay so what i'm going to do is i'm 2221 01:24:35,360 --> 01:24:37,360 going to multiply 2222 01:24:37,360 --> 01:24:40,719 this by negative mx plus b 2223 01:24:40,719 --> 01:24:43,199 and then also the bottom by negative mx 2224 01:24:43,199 --> 01:24:44,639 plus b and i'm allowed to do that 2225 01:24:44,639 --> 01:24:46,000 because 2226 01:24:46,000 --> 01:24:48,639 this over this is 1. 2227 01:24:48,639 --> 01:24:54,560 so now my probability is equal to 1 over 2228 01:24:54,560 --> 01:24:56,320 1 plus 2229 01:24:56,320 --> 01:25:00,320 e to the negative mx plus b and now why 2230 01:25:00,320 --> 01:25:02,239 do i rewrite it like that it's because 2231 01:25:02,239 --> 01:25:04,880 this is actually a form of a special 2232 01:25:04,880 --> 01:25:06,000 function 2233 01:25:06,000 --> 01:25:09,520 which is called the sigmoid 2234 01:25:10,400 --> 01:25:13,120 function 2235 01:25:13,120 --> 01:25:15,120 and for the sigmoid function 2236 01:25:15,120 --> 01:25:18,080 it looks something like this so s of x 2237 01:25:18,080 --> 01:25:21,600 sigmoid you know at with that sum x 2238 01:25:21,600 --> 01:25:24,400 is equal to 1 over 2239 01:25:24,400 --> 01:25:27,280 1 plus e to the negative 2240 01:25:27,280 --> 01:25:28,480 x 2241 01:25:28,480 --> 01:25:31,679 so essentially what i just did up here 2242 01:25:31,679 --> 01:25:33,440 is rewrite this 2243 01:25:33,440 --> 01:25:35,840 in some sigmoid function 2244 01:25:35,840 --> 01:25:40,000 where the x value is actually mx plus b 2245 01:25:40,000 --> 01:25:42,159 so maybe i'll change this to y just to 2246 01:25:42,159 --> 01:25:43,440 make that a bit more clear it doesn't 2247 01:25:43,440 --> 01:25:46,000 matter what the variable name is 2248 01:25:46,000 --> 01:25:48,719 but this is our sigmoid function 2249 01:25:48,719 --> 01:25:49,760 and 2250 01:25:49,760 --> 01:25:51,679 visually what our sigmoid function looks 2251 01:25:51,679 --> 01:25:53,280 like 2252 01:25:53,280 --> 01:25:56,400 is it goes from zero so this here is 2253 01:25:56,400 --> 01:25:57,679 zero 2254 01:25:57,679 --> 01:25:58,800 to one 2255 01:25:58,800 --> 01:26:00,800 and it looks something 2256 01:26:00,800 --> 01:26:03,280 like this curved s which i didn't draw 2257 01:26:03,280 --> 01:26:05,760 too well let me try that again 2258 01:26:05,760 --> 01:26:09,400 this is hard to draw 2259 01:26:10,880 --> 01:26:14,960 something if i can draw this right 2260 01:26:14,960 --> 01:26:16,000 like 2261 01:26:16,000 --> 01:26:17,120 that 2262 01:26:17,120 --> 01:26:20,560 okay so it goes in between zero and one 2263 01:26:20,560 --> 01:26:24,320 and you might notice that this form 2264 01:26:24,320 --> 01:26:25,840 fits our 2265 01:26:25,840 --> 01:26:29,120 shape up here 2266 01:26:30,960 --> 01:26:32,159 oops 2267 01:26:32,159 --> 01:26:33,040 let's 2268 01:26:33,040 --> 01:26:36,080 draw it sharper but it fits our shape up 2269 01:26:36,080 --> 01:26:38,719 there a lot better right 2270 01:26:38,719 --> 01:26:41,360 all right so that is 2271 01:26:41,360 --> 01:26:42,719 what we call 2272 01:26:42,719 --> 01:26:44,480 logistic regression we're basically 2273 01:26:44,480 --> 01:26:46,320 trying to fit our data 2274 01:26:46,320 --> 01:26:48,639 to the sigmoid function 2275 01:26:48,639 --> 01:26:50,239 okay 2276 01:26:50,239 --> 01:26:54,960 and when we only have you know one 2277 01:26:54,960 --> 01:26:56,080 um 2278 01:26:56,080 --> 01:26:58,560 data point so if we only have one 2279 01:26:58,560 --> 01:27:01,600 feature x then that's what we call 2280 01:27:01,600 --> 01:27:03,760 simple 2281 01:27:03,760 --> 01:27:06,400 logistic regression 2282 01:27:06,400 --> 01:27:08,480 but then if we have you know so that's 2283 01:27:08,480 --> 01:27:12,159 only x0 but then if we have x0 x1 2284 01:27:12,159 --> 01:27:13,760 all the way to xn 2285 01:27:13,760 --> 01:27:16,239 we call this multiple 2286 01:27:16,239 --> 01:27:18,560 logistic regression because there are 2287 01:27:18,560 --> 01:27:21,199 multiple features that we're considering 2288 01:27:21,199 --> 01:27:23,520 when we're building our model 2289 01:27:23,520 --> 01:27:26,000 logistic regression 2290 01:27:26,000 --> 01:27:29,520 so i'm going to put that here and again 2291 01:27:29,520 --> 01:27:32,239 from sklearn 2292 01:27:32,239 --> 01:27:35,679 this linear model we can import logistic 2293 01:27:35,679 --> 01:27:37,280 regression 2294 01:27:37,280 --> 01:27:38,960 right 2295 01:27:38,960 --> 01:27:42,000 and just like how we did above we can 2296 01:27:42,000 --> 01:27:45,360 repeat all of this so here instead of nb 2297 01:27:45,360 --> 01:27:46,880 i'm going to call this 2298 01:27:46,880 --> 01:27:52,400 the log model or lg logistic regression 2299 01:27:52,400 --> 01:27:55,440 i'm going to change this to logistic 2300 01:27:55,440 --> 01:27:57,040 regression 2301 01:27:57,040 --> 01:27:58,960 so i'm just going to use the default 2302 01:27:58,960 --> 01:28:00,639 logistic regression 2303 01:28:00,639 --> 01:28:02,320 but actually if you look here you see 2304 01:28:02,320 --> 01:28:04,320 that you can use different penalties so 2305 01:28:04,320 --> 01:28:05,920 right now we're using 2306 01:28:05,920 --> 01:28:08,400 um an l2 penalty 2307 01:28:08,400 --> 01:28:09,280 but 2308 01:28:09,280 --> 01:28:12,159 l2 is our quadratic formula okay so that 2309 01:28:12,159 --> 01:28:15,120 means that for you know outliers it 2310 01:28:15,120 --> 01:28:16,639 would really 2311 01:28:16,639 --> 01:28:18,880 penalize that 2312 01:28:18,880 --> 01:28:20,719 for all these other things you know you 2313 01:28:20,719 --> 01:28:22,639 can toggle 2314 01:28:22,639 --> 01:28:23,520 these 2315 01:28:23,520 --> 01:28:25,280 different parameters and you might get 2316 01:28:25,280 --> 01:28:27,280 slightly different results if i were 2317 01:28:27,280 --> 01:28:29,280 building a production level logistic 2318 01:28:29,280 --> 01:28:31,280 regression model that i would want to go 2319 01:28:31,280 --> 01:28:32,880 and i would want to figure out you know 2320 01:28:32,880 --> 01:28:34,560 what are the best 2321 01:28:34,560 --> 01:28:37,120 parameters to pass into here based on my 2322 01:28:37,120 --> 01:28:39,040 validation data 2323 01:28:39,040 --> 01:28:40,719 but for now we'll just we'll just use 2324 01:28:40,719 --> 01:28:42,639 this out of the box 2325 01:28:42,639 --> 01:28:44,639 so again i'm going to fix 2326 01:28:44,639 --> 01:28:47,360 the x train and the y train 2327 01:28:47,360 --> 01:28:49,920 and i'm just going to predict again so i 2328 01:28:49,920 --> 01:28:51,920 can just call this again 2329 01:28:51,920 --> 01:28:55,120 um and instead of l g uh nb i'm going to 2330 01:28:55,120 --> 01:28:57,840 use lg so here this is decent precision 2331 01:28:57,840 --> 01:28:59,040 65 2332 01:28:59,040 --> 01:29:00,719 recall 71 2333 01:29:00,719 --> 01:29:02,880 f1 68 2334 01:29:02,880 --> 01:29:04,239 or 82 2335 01:29:04,239 --> 01:29:06,480 uh total accuracy of 77 okay so it 2336 01:29:06,480 --> 01:29:08,960 performs slightly better than night bass 2337 01:29:08,960 --> 01:29:12,639 but it's still not as good as knn 2338 01:29:12,639 --> 01:29:15,199 all right so the last model for 2339 01:29:15,199 --> 01:29:16,880 classification that i wanted to talk 2340 01:29:16,880 --> 01:29:18,560 about is something called 2341 01:29:18,560 --> 01:29:20,639 support vector machines 2342 01:29:20,639 --> 01:29:25,199 or svms for short 2343 01:29:25,199 --> 01:29:26,000 so 2344 01:29:26,000 --> 01:29:27,600 what exactly 2345 01:29:27,600 --> 01:29:29,199 is an svm 2346 01:29:29,199 --> 01:29:32,159 model i have two different features x0 2347 01:29:32,159 --> 01:29:34,480 and x1 on the axis 2348 01:29:34,480 --> 01:29:37,440 and then i've told you if it's you know 2349 01:29:37,440 --> 01:29:40,560 class 0 or class 1 based on the blue and 2350 01:29:40,560 --> 01:29:41,760 the red 2351 01:29:41,760 --> 01:29:43,040 labels 2352 01:29:43,040 --> 01:29:46,639 my goal is to find some sort of 2353 01:29:46,639 --> 01:29:47,679 line 2354 01:29:47,679 --> 01:29:51,360 between these two labels that best 2355 01:29:51,360 --> 01:29:53,840 divides the data 2356 01:29:53,840 --> 01:29:58,400 all right so this line is our svm model 2357 01:29:58,400 --> 01:30:00,800 so i call it a line here because in 2d 2358 01:30:00,800 --> 01:30:03,120 it's a line but in 3d it would be a 2359 01:30:03,120 --> 01:30:04,880 plane and then you can also have more 2360 01:30:04,880 --> 01:30:07,199 and more dimensions so the proper term 2361 01:30:07,199 --> 01:30:08,719 is actually i want to find the 2362 01:30:08,719 --> 01:30:10,080 hyperplane 2363 01:30:10,080 --> 01:30:12,159 that best differentiates these two 2364 01:30:12,159 --> 01:30:14,239 classes 2365 01:30:14,239 --> 01:30:15,679 let's 2366 01:30:15,679 --> 01:30:18,840 see a few examples okay so 2367 01:30:18,840 --> 01:30:20,719 first 2368 01:30:20,719 --> 01:30:23,679 between these 2369 01:30:23,679 --> 01:30:24,639 three 2370 01:30:24,639 --> 01:30:27,120 lines 2371 01:30:27,600 --> 01:30:29,280 let's say a 2372 01:30:29,280 --> 01:30:30,000 b 2373 01:30:30,000 --> 01:30:32,400 and c 2374 01:30:32,400 --> 01:30:34,800 which one is the best divider of the 2375 01:30:34,800 --> 01:30:36,960 data which one has you know all the data 2376 01:30:36,960 --> 01:30:39,520 on one side or the other or at least if 2377 01:30:39,520 --> 01:30:42,000 it doesn't which one divides it the most 2378 01:30:42,000 --> 01:30:43,920 right like which one is has the most 2379 01:30:43,920 --> 01:30:46,400 defined boundary between the two 2380 01:30:46,400 --> 01:30:49,719 different groups 2381 01:30:50,719 --> 01:30:52,080 so 2382 01:30:52,080 --> 01:30:54,080 this this question should be pretty 2383 01:30:54,080 --> 01:30:56,159 straightforward 2384 01:30:56,159 --> 01:30:57,600 it should be a 2385 01:30:57,600 --> 01:31:00,159 right because a has that clear distinct 2386 01:31:00,159 --> 01:31:01,120 line 2387 01:31:01,120 --> 01:31:03,760 between where you know everything on 2388 01:31:03,760 --> 01:31:06,719 this side of a is one label it's 2389 01:31:06,719 --> 01:31:08,639 negative and everything on this side of 2390 01:31:08,639 --> 01:31:12,480 a is the other label it's positive 2391 01:31:12,560 --> 01:31:15,040 so what if i have a but then what if i 2392 01:31:15,040 --> 01:31:18,320 had drawn my b 2393 01:31:18,719 --> 01:31:20,400 like this 2394 01:31:20,400 --> 01:31:23,600 and my c 2395 01:31:23,600 --> 01:31:25,520 maybe like this 2396 01:31:25,520 --> 01:31:26,880 sorry they're kind of the labels are 2397 01:31:26,880 --> 01:31:29,440 kind of close together 2398 01:31:29,440 --> 01:31:33,679 but now which one is the best 2399 01:31:34,560 --> 01:31:38,400 so i would argue that it's still a 2400 01:31:38,400 --> 01:31:40,719 right and why is it still a 2401 01:31:40,719 --> 01:31:42,840 because in these other 2402 01:31:42,840 --> 01:31:44,639 two 2403 01:31:44,639 --> 01:31:47,920 look at how close this is to that to 2404 01:31:47,920 --> 01:31:50,560 these points 2405 01:31:50,800 --> 01:31:55,120 right so if i had some new point 2406 01:31:55,120 --> 01:31:57,280 that i wanted to estimate okay say i 2407 01:31:57,280 --> 01:31:59,280 didn't have a or b 2408 01:31:59,280 --> 01:32:01,440 so let's say we're just working with c 2409 01:32:01,440 --> 01:32:03,199 let's say i have some new point that's 2410 01:32:03,199 --> 01:32:05,520 right here 2411 01:32:05,520 --> 01:32:08,639 or maybe a new point that's right there 2412 01:32:08,639 --> 01:32:10,800 well it seems like just logically 2413 01:32:10,800 --> 01:32:13,760 looking at this i mean without the 2414 01:32:13,760 --> 01:32:15,920 boundary that 2415 01:32:15,920 --> 01:32:19,440 would probably go under the positives 2416 01:32:19,440 --> 01:32:20,560 right 2417 01:32:20,560 --> 01:32:22,000 i mean it's pretty close to that other 2418 01:32:22,000 --> 01:32:23,760 positive 2419 01:32:23,760 --> 01:32:27,360 so one thing that we care about in svms 2420 01:32:27,360 --> 01:32:30,880 is something known as the margin 2421 01:32:30,880 --> 01:32:31,920 okay 2422 01:32:31,920 --> 01:32:32,800 so 2423 01:32:32,800 --> 01:32:35,280 not only do we want to separate the two 2424 01:32:35,280 --> 01:32:38,480 classes really well we also care about 2425 01:32:38,480 --> 01:32:40,639 the boundary in between 2426 01:32:40,639 --> 01:32:42,400 where the points in those classes in our 2427 01:32:42,400 --> 01:32:43,760 data set are 2428 01:32:43,760 --> 01:32:47,760 and the line that we're drawing so 2429 01:32:47,760 --> 01:32:51,120 in a line like this 2430 01:32:51,120 --> 01:32:54,719 the closest values to this line 2431 01:32:54,719 --> 01:32:55,840 might be 2432 01:32:55,840 --> 01:32:58,560 like here 2433 01:33:00,159 --> 01:33:04,159 i'm trying to draw these perpendicular 2434 01:33:06,960 --> 01:33:10,880 right and so this effectively 2435 01:33:10,880 --> 01:33:15,280 if i switch over to these dotted lines 2436 01:33:17,280 --> 01:33:20,480 if i can draw this right 2437 01:33:21,600 --> 01:33:24,480 so these effectively are what's known as 2438 01:33:24,480 --> 01:33:27,360 the margins 2439 01:33:30,480 --> 01:33:31,760 okay 2440 01:33:31,760 --> 01:33:34,320 so these both here 2441 01:33:34,320 --> 01:33:36,800 these are our margins 2442 01:33:36,800 --> 01:33:38,400 in our svms 2443 01:33:38,400 --> 01:33:40,320 and our goal is to maximize those 2444 01:33:40,320 --> 01:33:42,159 margins so not only do we want the line 2445 01:33:42,159 --> 01:33:43,280 that best separates the two different 2446 01:33:43,280 --> 01:33:46,400 classes we want the line that has the 2447 01:33:46,400 --> 01:33:48,719 largest margin 2448 01:33:48,719 --> 01:33:52,480 and the data points that lie on 2449 01:33:52,480 --> 01:33:55,280 the margin lines the data so basically 2450 01:33:55,280 --> 01:33:56,639 these are the data points that's helping 2451 01:33:56,639 --> 01:33:58,800 us define our divider 2452 01:33:58,800 --> 01:34:01,360 these are what we call support 2453 01:34:01,360 --> 01:34:04,360 vectors 2454 01:34:04,639 --> 01:34:08,159 hence the name support vector machines 2455 01:34:08,159 --> 01:34:11,120 okay so the issue with svm sometimes is 2456 01:34:11,120 --> 01:34:13,520 that they're not so robust 2457 01:34:13,520 --> 01:34:15,280 to outliers 2458 01:34:15,280 --> 01:34:17,679 right so for example if i had 2459 01:34:17,679 --> 01:34:19,920 one outlier 2460 01:34:19,920 --> 01:34:22,000 like this up here 2461 01:34:22,000 --> 01:34:24,400 that would totally change where i want 2462 01:34:24,400 --> 01:34:25,360 my 2463 01:34:25,360 --> 01:34:26,960 support vector to be 2464 01:34:26,960 --> 01:34:28,719 even though that might be my only 2465 01:34:28,719 --> 01:34:30,639 outlier okay 2466 01:34:30,639 --> 01:34:32,960 so that's just something to keep in mind 2467 01:34:32,960 --> 01:34:36,080 as you know you're working with svms is 2468 01:34:36,080 --> 01:34:38,159 it might not be the best model if there 2469 01:34:38,159 --> 01:34:40,320 are outliers in your data set 2470 01:34:40,320 --> 01:34:43,600 okay so another example of svms 2471 01:34:43,600 --> 01:34:45,920 might be let's say that we have data 2472 01:34:45,920 --> 01:34:47,600 like this i'm just going to use a one 2473 01:34:47,600 --> 01:34:50,239 dimensional data set for this example 2474 01:34:50,239 --> 01:34:51,600 let's say we have a data set that looks 2475 01:34:51,600 --> 01:34:53,840 like this 2476 01:34:53,840 --> 01:34:56,719 well our you know separator should be 2477 01:34:56,719 --> 01:34:59,199 perpendicular to this line 2478 01:34:59,199 --> 01:35:00,560 but it should be somewhere along this 2479 01:35:00,560 --> 01:35:02,320 line so it could be 2480 01:35:02,320 --> 01:35:04,719 anywhere like this 2481 01:35:04,719 --> 01:35:07,040 you might argue okay well there's one 2482 01:35:07,040 --> 01:35:09,280 here and then you could also just draw 2483 01:35:09,280 --> 01:35:10,800 another one over here 2484 01:35:10,800 --> 01:35:12,159 right and then maybe you can have two 2485 01:35:12,159 --> 01:35:15,280 svms but that's not really how svms work 2486 01:35:15,280 --> 01:35:17,360 but one thing that we can do is we can 2487 01:35:17,360 --> 01:35:20,320 create some sort of projection 2488 01:35:20,320 --> 01:35:23,280 so i realized here that one thing 2489 01:35:23,280 --> 01:35:24,880 i forgot to do 2490 01:35:24,880 --> 01:35:27,360 was to label where zero was so let's 2491 01:35:27,360 --> 01:35:28,880 just say zero 2492 01:35:28,880 --> 01:35:31,360 is here 2493 01:35:31,920 --> 01:35:33,440 now what i'm going to do is i'm going to 2494 01:35:33,440 --> 01:35:34,560 say okay 2495 01:35:34,560 --> 01:35:36,639 i'm going to have x and then i'm going 2496 01:35:36,639 --> 01:35:38,639 to have x 2497 01:35:38,639 --> 01:35:41,679 sorry x0 and x1 so x0 is just going to 2498 01:35:41,679 --> 01:35:43,440 be my original x 2499 01:35:43,440 --> 01:35:46,239 but i'm going to make x1 equal 2500 01:35:46,239 --> 01:35:49,040 to let's say 2501 01:35:49,040 --> 01:35:49,920 x 2502 01:35:49,920 --> 01:35:53,120 squared so whatever is this squared 2503 01:35:53,120 --> 01:35:55,280 right so now 2504 01:35:55,280 --> 01:35:57,600 my negatives would be you know maybe 2505 01:35:57,600 --> 01:36:00,000 somewhere here 2506 01:36:00,000 --> 01:36:01,040 here 2507 01:36:01,040 --> 01:36:04,800 just pretend that it's somewhere up here 2508 01:36:04,800 --> 01:36:07,280 right and now my pluses might be 2509 01:36:07,280 --> 01:36:11,000 something like 2510 01:36:11,840 --> 01:36:13,600 that 2511 01:36:13,600 --> 01:36:15,199 and i'm going to run out of space over 2512 01:36:15,199 --> 01:36:16,960 here so i'm just going to draw these 2513 01:36:16,960 --> 01:36:20,960 together use your imagination 2514 01:36:21,600 --> 01:36:26,320 but once i draw it like this 2515 01:36:26,560 --> 01:36:28,560 well it's a lot easier to apply a 2516 01:36:28,560 --> 01:36:31,280 boundary right now our svm could be 2517 01:36:31,280 --> 01:36:33,199 maybe something like this 2518 01:36:33,199 --> 01:36:35,520 this 2519 01:36:35,600 --> 01:36:37,440 and now you see that we've divided our 2520 01:36:37,440 --> 01:36:39,520 data set now it's separable where one 2521 01:36:39,520 --> 01:36:41,040 class is this way 2522 01:36:41,040 --> 01:36:42,560 and the other class 2523 01:36:42,560 --> 01:36:44,400 is that way 2524 01:36:44,400 --> 01:36:47,520 okay so that's known as svms 2525 01:36:47,520 --> 01:36:50,080 um i do highly suggest that you know any 2526 01:36:50,080 --> 01:36:51,440 of these models that we just mentioned 2527 01:36:51,440 --> 01:36:53,600 if you're interested in them do go more 2528 01:36:53,600 --> 01:36:55,920 in depth mathematically into them like 2529 01:36:55,920 --> 01:36:59,119 how do we how do we find this hyperplane 2530 01:36:59,119 --> 01:37:00,960 right i'm not going to go over that in 2531 01:37:00,960 --> 01:37:02,960 this specific course because you're just 2532 01:37:02,960 --> 01:37:04,880 learning what an svm is 2533 01:37:04,880 --> 01:37:07,119 but it's a good idea to know oh okay 2534 01:37:07,119 --> 01:37:09,920 this is the technique behind finding 2535 01:37:09,920 --> 01:37:12,320 you know what exactly are the are the 2536 01:37:12,320 --> 01:37:15,040 um how do you define the hyperplane that 2537 01:37:15,040 --> 01:37:16,880 we're going to use 2538 01:37:16,880 --> 01:37:19,199 so anyways this transformation that we 2539 01:37:19,199 --> 01:37:20,560 did down here 2540 01:37:20,560 --> 01:37:25,040 this is known as the kernel trick 2541 01:37:26,960 --> 01:37:30,000 so when we go from x to some coordinate 2542 01:37:30,000 --> 01:37:32,159 x and then x squared 2543 01:37:32,159 --> 01:37:34,320 what we're doing is we are applying a 2544 01:37:34,320 --> 01:37:35,840 kernel so that's why it's called the 2545 01:37:35,840 --> 01:37:38,400 kernel trick 2546 01:37:38,400 --> 01:37:40,400 so svms are actually really powerful and 2547 01:37:40,400 --> 01:37:42,320 you'll see that here so 2548 01:37:42,320 --> 01:37:43,719 from 2549 01:37:43,719 --> 01:37:46,840 sklearn.svm we are going to import 2550 01:37:46,840 --> 01:37:50,639 svc and svc is our support vector 2551 01:37:50,639 --> 01:37:53,280 classifier 2552 01:37:53,280 --> 01:37:55,600 so with this 2553 01:37:55,600 --> 01:37:58,320 so with our sbm model 2554 01:37:58,320 --> 01:38:01,040 we are going to you know create an svc 2555 01:38:01,040 --> 01:38:02,560 model 2556 01:38:02,560 --> 01:38:04,960 and we are going to 2557 01:38:04,960 --> 01:38:06,400 uh 2558 01:38:06,400 --> 01:38:08,639 again fit this to x train i could have 2559 01:38:08,639 --> 01:38:10,080 just copy and pasted this i should have 2560 01:38:10,080 --> 01:38:12,800 probably done that 2561 01:38:13,119 --> 01:38:15,360 okay 2562 01:38:15,360 --> 01:38:17,360 taking a bit longer 2563 01:38:17,360 --> 01:38:19,679 all right 2564 01:38:20,480 --> 01:38:22,719 let's predict using our svm model and 2565 01:38:22,719 --> 01:38:23,600 here 2566 01:38:23,600 --> 01:38:26,080 let's see if i can hover over this 2567 01:38:26,080 --> 01:38:27,679 all right so again you see a lot of 2568 01:38:27,679 --> 01:38:31,520 these different parameters here that you 2569 01:38:31,520 --> 01:38:33,840 can go back and change if you were 2570 01:38:33,840 --> 01:38:36,639 creating a production level model 2571 01:38:36,639 --> 01:38:38,239 okay but 2572 01:38:38,239 --> 01:38:40,000 in this specific case 2573 01:38:40,000 --> 01:38:44,239 we'll just use it out of the box again 2574 01:38:44,239 --> 01:38:45,119 so 2575 01:38:45,119 --> 01:38:47,360 if i make predictions you'll note that 2576 01:38:47,360 --> 01:38:50,159 wow the accuracy actually jumps to 87 2577 01:38:50,159 --> 01:38:51,600 with the svm 2578 01:38:51,600 --> 01:38:53,679 and even with class 0 there's nothing 2579 01:38:53,679 --> 01:38:57,520 less than you know 0.8 which is great 2580 01:38:57,520 --> 01:38:59,840 and for class one i mean everything's at 2581 01:38:59,840 --> 01:39:01,840 0.9 which is higher than anything that 2582 01:39:01,840 --> 01:39:05,280 we had seen to this point 2583 01:39:06,560 --> 01:39:09,199 so so far we've gone over four different 2584 01:39:09,199 --> 01:39:11,280 classification models we've done svms 2585 01:39:11,280 --> 01:39:14,880 logistic regression naive bayes and k n 2586 01:39:14,880 --> 01:39:16,639 and these are just simple ways on how to 2587 01:39:16,639 --> 01:39:18,639 implement them each of these they have 2588 01:39:18,639 --> 01:39:20,400 different you know 2589 01:39:20,400 --> 01:39:23,199 they have different hyper parameters 2590 01:39:23,199 --> 01:39:25,600 that you can go and you can toggle and 2591 01:39:25,600 --> 01:39:27,119 you can try to see 2592 01:39:27,119 --> 01:39:30,000 if that helps later on or not 2593 01:39:30,000 --> 01:39:30,880 but 2594 01:39:30,880 --> 01:39:33,520 for the most part they perform 2595 01:39:33,520 --> 01:39:36,239 they give us around 70 to 80 percent 2596 01:39:36,239 --> 01:39:37,520 accuracy 2597 01:39:37,520 --> 01:39:40,960 okay with svm being the best now let's 2598 01:39:40,960 --> 01:39:43,199 see if we can actually beat that using a 2599 01:39:43,199 --> 01:39:45,199 neural net now the final type of model 2600 01:39:45,199 --> 01:39:47,360 that i wanted to talk about is known as 2601 01:39:47,360 --> 01:39:50,000 a neural net or neural network 2602 01:39:50,000 --> 01:39:53,280 and neural nets look something like this 2603 01:39:53,280 --> 01:39:55,360 so you have an input layer this is where 2604 01:39:55,360 --> 01:39:57,199 all your features would go 2605 01:39:57,199 --> 01:39:58,000 and 2606 01:39:58,000 --> 01:39:59,679 they have all these arrows pointing to 2607 01:39:59,679 --> 01:40:01,520 some sort of hidden layer 2608 01:40:01,520 --> 01:40:03,119 and then all these arrows point to some 2609 01:40:03,119 --> 01:40:05,199 sort of output layer 2610 01:40:05,199 --> 01:40:08,159 so what is what does all this mean each 2611 01:40:08,159 --> 01:40:10,480 of these layers in here this is 2612 01:40:10,480 --> 01:40:13,280 something known as a neuron 2613 01:40:13,280 --> 01:40:15,840 okay so that's a neuron 2614 01:40:15,840 --> 01:40:17,280 in a neural net 2615 01:40:17,280 --> 01:40:19,119 these are all of our features that we're 2616 01:40:19,119 --> 01:40:20,960 inputting into the neural net so that 2617 01:40:20,960 --> 01:40:23,760 might be x0 x1 all the way through 2618 01:40:23,760 --> 01:40:25,119 xn 2619 01:40:25,119 --> 01:40:26,800 right and these are the features that we 2620 01:40:26,800 --> 01:40:28,639 talked about there they might be you 2621 01:40:28,639 --> 01:40:31,920 know the pregnancy the bmi the 2622 01:40:31,920 --> 01:40:34,239 age etc 2623 01:40:34,239 --> 01:40:37,199 and now all these get weighted by some 2624 01:40:37,199 --> 01:40:38,239 value 2625 01:40:38,239 --> 01:40:40,719 so they get multiplied by some w number 2626 01:40:40,719 --> 01:40:42,800 that applies to that one specific 2627 01:40:42,800 --> 01:40:45,119 category that one specific feature so 2628 01:40:45,119 --> 01:40:46,960 these two get multiplied 2629 01:40:46,960 --> 01:40:49,920 and the sum of all of these goes into 2630 01:40:49,920 --> 01:40:51,520 that neuron 2631 01:40:51,520 --> 01:40:54,960 okay so basically i'm taking w0 times x0 2632 01:40:54,960 --> 01:40:58,239 and then i'm adding x1 times w1 and then 2633 01:40:58,239 --> 01:41:01,119 i'm adding you know x2 times w2 etc all 2634 01:41:01,119 --> 01:41:03,280 the way to xn times 2635 01:41:03,280 --> 01:41:05,280 n and that's getting 2636 01:41:05,280 --> 01:41:07,440 input into the neuron 2637 01:41:07,440 --> 01:41:10,000 now i'm also adding this bias term which 2638 01:41:10,000 --> 01:41:11,520 just means okay i might want to shift 2639 01:41:11,520 --> 01:41:14,639 this by a little bit so i might add 5 or 2640 01:41:14,639 --> 01:41:17,119 i might add 0.1 or i might subtract 100 2641 01:41:17,119 --> 01:41:19,199 i don't know but we're going to add this 2642 01:41:19,199 --> 01:41:21,440 bias term 2643 01:41:21,440 --> 01:41:24,880 and the output of all these things so 2644 01:41:24,880 --> 01:41:27,840 the sum of this this this and this 2645 01:41:27,840 --> 01:41:30,480 go into something known as an activation 2646 01:41:30,480 --> 01:41:32,400 function okay 2647 01:41:32,400 --> 01:41:34,719 and then after applying this activation 2648 01:41:34,719 --> 01:41:37,440 function we get an output 2649 01:41:37,440 --> 01:41:39,840 and this is what a neuron would look 2650 01:41:39,840 --> 01:41:41,920 like 2651 01:41:41,920 --> 01:41:43,440 now a whole network of them would look 2652 01:41:43,440 --> 01:41:44,840 something like 2653 01:41:44,840 --> 01:41:47,440 this so i kind of gloss over this 2654 01:41:47,440 --> 01:41:49,199 activation function 2655 01:41:49,199 --> 01:41:51,920 what exactly is that 2656 01:41:51,920 --> 01:41:54,800 this is how a neural net looks like if 2657 01:41:54,800 --> 01:41:56,480 we have all our inputs here and let's 2658 01:41:56,480 --> 01:41:58,480 say all of these arrows represent some 2659 01:41:58,480 --> 01:42:00,800 sort of addition 2660 01:42:00,800 --> 01:42:01,760 right 2661 01:42:01,760 --> 01:42:04,320 then what's going on is we're just 2662 01:42:04,320 --> 01:42:06,639 adding a bunch of times 2663 01:42:06,639 --> 01:42:09,040 right we're adding the some sort of 2664 01:42:09,040 --> 01:42:11,280 weight times these input layer 2665 01:42:11,280 --> 01:42:13,360 a bunch of times and then if we were to 2666 01:42:13,360 --> 01:42:16,400 go back and factor that all out 2667 01:42:16,400 --> 01:42:19,360 then this entire neural net 2668 01:42:19,360 --> 01:42:21,600 is just a linear combination of these 2669 01:42:21,600 --> 01:42:23,679 input layers 2670 01:42:23,679 --> 01:42:25,679 which i don't know about you but that 2671 01:42:25,679 --> 01:42:27,440 just seems kind of useless right because 2672 01:42:27,440 --> 01:42:29,199 we could literally just write that out 2673 01:42:29,199 --> 01:42:31,600 in a formula why would we need to set up 2674 01:42:31,600 --> 01:42:34,080 this entire neural network we 2675 01:42:34,080 --> 01:42:35,840 wouldn't 2676 01:42:35,840 --> 01:42:38,560 so the activation function is introduced 2677 01:42:38,560 --> 01:42:40,880 right so without an activation function 2678 01:42:40,880 --> 01:42:44,400 this just becomes a linear model 2679 01:42:44,400 --> 01:42:46,239 an activation function might look 2680 01:42:46,239 --> 01:42:48,320 something like this and as you can tell 2681 01:42:48,320 --> 01:42:50,639 these are not linear and the reason why 2682 01:42:50,639 --> 01:42:52,159 we introduce these 2683 01:42:52,159 --> 01:42:53,920 is so that our entire model doesn't 2684 01:42:53,920 --> 01:42:55,520 collapse on itself and become a linear 2685 01:42:55,520 --> 01:42:57,440 model 2686 01:42:57,440 --> 01:42:59,280 so over here this is something known as 2687 01:42:59,280 --> 01:43:01,280 a sigmoid function it runs between zero 2688 01:43:01,280 --> 01:43:02,480 and one 2689 01:43:02,480 --> 01:43:04,560 tan runs between negative one all the 2690 01:43:04,560 --> 01:43:05,679 way to one 2691 01:43:05,679 --> 01:43:08,639 and this is relu which anything less 2692 01:43:08,639 --> 01:43:10,719 than zero is zero and that anything 2693 01:43:10,719 --> 01:43:14,480 greater than zero is linear 2694 01:43:14,480 --> 01:43:16,800 so with these activation functions 2695 01:43:16,800 --> 01:43:19,360 every single output of a neuron 2696 01:43:19,360 --> 01:43:21,600 is no longer just the linear combination 2697 01:43:21,600 --> 01:43:23,600 of these it's some sort of altered 2698 01:43:23,600 --> 01:43:25,920 linear state which means that the input 2699 01:43:25,920 --> 01:43:28,560 into the next neuron 2700 01:43:28,560 --> 01:43:29,760 is 2701 01:43:29,760 --> 01:43:30,880 you know 2702 01:43:30,880 --> 01:43:32,880 it doesn't it doesn't collapse on itself 2703 01:43:32,880 --> 01:43:34,800 it doesn't become linear because we've 2704 01:43:34,800 --> 01:43:38,639 introduced all these non-linearities 2705 01:43:38,639 --> 01:43:41,440 so this is the training set the model 2706 01:43:41,440 --> 01:43:44,000 the loss right and then we do this thing 2707 01:43:44,000 --> 01:43:45,600 called training where we have to feed 2708 01:43:45,600 --> 01:43:47,520 the loss back into the model 2709 01:43:47,520 --> 01:43:49,679 and make certain adjustments the model 2710 01:43:49,679 --> 01:43:51,760 to improve 2711 01:43:51,760 --> 01:43:55,119 this predicted output 2712 01:43:55,119 --> 01:43:56,639 let's talk a little bit about the 2713 01:43:56,639 --> 01:43:58,719 training what exactly goes on during 2714 01:43:58,719 --> 01:44:00,719 that step 2715 01:44:00,719 --> 01:44:03,520 let's go back and take a look at our l2 2716 01:44:03,520 --> 01:44:05,360 loss function 2717 01:44:05,360 --> 01:44:07,760 this is what our l2 loss function looks 2718 01:44:07,760 --> 01:44:12,080 like it's a quadratic formula right 2719 01:44:12,080 --> 01:44:14,960 well up here the error is really really 2720 01:44:14,960 --> 01:44:17,520 really really large 2721 01:44:17,520 --> 01:44:18,320 and 2722 01:44:18,320 --> 01:44:20,639 our goal is to get somewhere down here 2723 01:44:20,639 --> 01:44:22,719 where the loss is decreased right 2724 01:44:22,719 --> 01:44:24,960 because that means that our predicted 2725 01:44:24,960 --> 01:44:29,199 value is closer to our true value 2726 01:44:29,199 --> 01:44:31,600 so that means that we want to go 2727 01:44:31,600 --> 01:44:33,679 this way 2728 01:44:33,679 --> 01:44:35,199 okay 2729 01:44:35,199 --> 01:44:37,840 and thanks to a lot of properties of 2730 01:44:37,840 --> 01:44:40,159 math something that we can do is called 2731 01:44:40,159 --> 01:44:41,760 gradient descent 2732 01:44:41,760 --> 01:44:44,960 in order to follow this 2733 01:44:44,960 --> 01:44:46,000 slope 2734 01:44:46,000 --> 01:44:48,639 down this way 2735 01:44:48,639 --> 01:44:50,719 this 2736 01:44:50,719 --> 01:44:52,320 quadratic 2737 01:44:52,320 --> 01:44:55,600 is it has different um 2738 01:44:55,600 --> 01:44:59,280 slopes with respect to some value 2739 01:44:59,280 --> 01:45:00,400 okay so 2740 01:45:00,400 --> 01:45:03,040 the loss with respect to some weight 2741 01:45:03,040 --> 01:45:04,560 w0 2742 01:45:04,560 --> 01:45:08,000 versus w1 versus wn 2743 01:45:08,000 --> 01:45:10,320 they might all be different 2744 01:45:10,320 --> 01:45:11,520 right so 2745 01:45:11,520 --> 01:45:13,040 some way that i kind of think about it 2746 01:45:13,040 --> 01:45:15,600 is to what extent is this value 2747 01:45:15,600 --> 01:45:18,000 contributing to our loss and we can 2748 01:45:18,000 --> 01:45:19,920 actually figure that out through some 2749 01:45:19,920 --> 01:45:21,920 calculus which we're not going to touch 2750 01:45:21,920 --> 01:45:25,520 up on in this specific course but 2751 01:45:25,520 --> 01:45:26,800 if you want to learn more about neural 2752 01:45:26,800 --> 01:45:28,719 nets you should probably also learn some 2753 01:45:28,719 --> 01:45:31,040 calculus and figure out what exactly 2754 01:45:31,040 --> 01:45:33,520 backpropagation is doing in order to 2755 01:45:33,520 --> 01:45:35,840 actually calculate you know how much do 2756 01:45:35,840 --> 01:45:38,400 we have to backstep by 2757 01:45:38,400 --> 01:45:40,320 so the thing is here you might notice 2758 01:45:40,320 --> 01:45:42,800 that this follows this curve at all 2759 01:45:42,800 --> 01:45:45,679 these different points and the closer we 2760 01:45:45,679 --> 01:45:49,119 get to the bottom the smaller this step 2761 01:45:49,119 --> 01:45:50,800 becomes 2762 01:45:50,800 --> 01:45:52,880 now stick with me here 2763 01:45:52,880 --> 01:45:54,320 so 2764 01:45:54,320 --> 01:45:56,960 my new value this is what we call a 2765 01:45:56,960 --> 01:46:00,000 weight update i'm going to take w0 2766 01:46:00,000 --> 01:46:02,560 and i'm going to set some new value for 2767 01:46:02,560 --> 01:46:04,000 w0 2768 01:46:04,000 --> 01:46:05,840 and what i'm going to set for that is 2769 01:46:05,840 --> 01:46:08,320 the old value of w0 2770 01:46:08,320 --> 01:46:09,360 plus 2771 01:46:09,360 --> 01:46:10,239 some 2772 01:46:10,239 --> 01:46:12,480 factor which i'll just call alpha for 2773 01:46:12,480 --> 01:46:13,600 now 2774 01:46:13,600 --> 01:46:14,719 times 2775 01:46:14,719 --> 01:46:17,280 whatever this arrow is so that's 2776 01:46:17,280 --> 01:46:20,480 basically saying okay take our old 2777 01:46:20,480 --> 01:46:22,960 w0 our old weight 2778 01:46:22,960 --> 01:46:25,679 and just decrease it 2779 01:46:25,679 --> 01:46:28,080 this way so i guess increase it in this 2780 01:46:28,080 --> 01:46:30,080 direction right like take a step in this 2781 01:46:30,080 --> 01:46:32,239 direction but this alpha here is telling 2782 01:46:32,239 --> 01:46:34,320 us okay don't don't take a huge step 2783 01:46:34,320 --> 01:46:36,000 right just in case we're wrong take a 2784 01:46:36,000 --> 01:46:37,520 small step take a small step in that 2785 01:46:37,520 --> 01:46:40,800 direction see if we get any closer 2786 01:46:40,800 --> 01:46:43,440 and for those of you who you know do 2787 01:46:43,440 --> 01:46:45,040 want to look more into the mathematics 2788 01:46:45,040 --> 01:46:46,960 of things the reason why i use a plus 2789 01:46:46,960 --> 01:46:48,320 here is because 2790 01:46:48,320 --> 01:46:50,639 this here is the negative gradient right 2791 01:46:50,639 --> 01:46:53,040 if this were just the if you were to use 2792 01:46:53,040 --> 01:46:54,320 the actual gradient this should be a 2793 01:46:54,320 --> 01:46:56,639 minus 2794 01:46:56,800 --> 01:46:59,040 now this alpha is something that we call 2795 01:46:59,040 --> 01:47:00,480 the learning rate 2796 01:47:00,480 --> 01:47:03,440 okay and that adjusts how quickly we're 2797 01:47:03,440 --> 01:47:04,960 taking steps 2798 01:47:04,960 --> 01:47:08,480 and that might you know tell our that 2799 01:47:08,480 --> 01:47:10,320 that will ultimately control 2800 01:47:10,320 --> 01:47:12,400 how long it takes for our neural net to 2801 01:47:12,400 --> 01:47:13,600 converge 2802 01:47:13,600 --> 01:47:15,119 or sometimes if you set it too high it 2803 01:47:15,119 --> 01:47:17,360 might even diverge 2804 01:47:17,360 --> 01:47:19,440 but with all of these weights so here i 2805 01:47:19,440 --> 01:47:23,280 have w0 w1 and then wn 2806 01:47:23,280 --> 01:47:25,360 we make the same update to all of them 2807 01:47:25,360 --> 01:47:27,600 after we calculate 2808 01:47:27,600 --> 01:47:28,480 the 2809 01:47:28,480 --> 01:47:29,360 loss 2810 01:47:29,360 --> 01:47:31,840 the gradient of the loss with respect to 2811 01:47:31,840 --> 01:47:33,600 that weight 2812 01:47:33,600 --> 01:47:37,119 so that's how backpropagation works 2813 01:47:37,119 --> 01:47:39,440 and that is everything that's going on 2814 01:47:39,440 --> 01:47:41,280 here after we calculate the loss we're 2815 01:47:41,280 --> 01:47:42,960 calculating gradients 2816 01:47:42,960 --> 01:47:44,880 making adjustments in the model so we're 2817 01:47:44,880 --> 01:47:47,040 setting all the all the weights to 2818 01:47:47,040 --> 01:47:50,480 something adjusted slightly 2819 01:47:50,480 --> 01:47:51,679 and then 2820 01:47:51,679 --> 01:47:53,119 we're saying okay let's take the 2821 01:47:53,119 --> 01:47:54,159 training set and run it through the 2822 01:47:54,159 --> 01:47:56,080 model again and go through this loop all 2823 01:47:56,080 --> 01:47:59,760 over again so for machine learning we 2824 01:47:59,760 --> 01:48:01,840 already have seen some libraries that we 2825 01:48:01,840 --> 01:48:05,199 use right we've already seen sk learn 2826 01:48:05,199 --> 01:48:06,159 but 2827 01:48:06,159 --> 01:48:10,800 when we start going into neural networks 2828 01:48:11,360 --> 01:48:12,800 this is kind of what we're trying to 2829 01:48:12,800 --> 01:48:14,400 program 2830 01:48:14,400 --> 01:48:15,600 and 2831 01:48:15,600 --> 01:48:16,840 it's not 2832 01:48:16,840 --> 01:48:18,639 very fun 2833 01:48:18,639 --> 01:48:20,480 to try to program this from scratch 2834 01:48:20,480 --> 01:48:21,760 because 2835 01:48:21,760 --> 01:48:24,080 not only will we probably have a lot of 2836 01:48:24,080 --> 01:48:26,080 bugs but also it's probably not going to 2837 01:48:26,080 --> 01:48:27,600 be fast enough 2838 01:48:27,600 --> 01:48:28,400 right 2839 01:48:28,400 --> 01:48:29,840 wouldn't it be great if there are some 2840 01:48:29,840 --> 01:48:30,719 you know 2841 01:48:30,719 --> 01:48:32,400 full-time professionals that are 2842 01:48:32,400 --> 01:48:34,480 dedicated to solving this problem and 2843 01:48:34,480 --> 01:48:36,639 they could literally just give us their 2844 01:48:36,639 --> 01:48:40,320 code that's already running really fast 2845 01:48:40,320 --> 01:48:44,639 well the answer is yes that exists 2846 01:48:44,639 --> 01:48:46,239 and that's why we use tensorflow so 2847 01:48:46,239 --> 01:48:48,239 tensorflow makes it really easy to 2848 01:48:48,239 --> 01:48:50,000 define these models 2849 01:48:50,000 --> 01:48:52,560 but we also have enough control 2850 01:48:52,560 --> 01:48:54,320 over what exactly we're feeding into 2851 01:48:54,320 --> 01:48:55,360 this model 2852 01:48:55,360 --> 01:48:57,920 so for example this line here is 2853 01:48:57,920 --> 01:49:00,800 basically saying okay let's create 2854 01:49:00,800 --> 01:49:02,639 a sequential neural net 2855 01:49:02,639 --> 01:49:04,159 so sequential is just you know what 2856 01:49:04,159 --> 01:49:05,920 we've seen here it just goes one layer 2857 01:49:05,920 --> 01:49:07,520 to the next 2858 01:49:07,520 --> 01:49:09,600 and a dense layer means that all them 2859 01:49:09,600 --> 01:49:11,840 are interconnected so here this is 2860 01:49:11,840 --> 01:49:13,760 interconnected with all of these nodes 2861 01:49:13,760 --> 01:49:15,600 and this one's all these and then this 2862 01:49:15,600 --> 01:49:17,760 one gets connected to all of 2863 01:49:17,760 --> 01:49:20,639 the next ones and so on so we're going 2864 01:49:20,639 --> 01:49:22,159 to create 16 2865 01:49:22,159 --> 01:49:23,920 dense nodes 2866 01:49:23,920 --> 01:49:26,480 with relu activation functions and then 2867 01:49:26,480 --> 01:49:28,320 we're going to create another layer of 2868 01:49:28,320 --> 01:49:29,679 16 2869 01:49:29,679 --> 01:49:32,800 dense nodes with relu activation and 2870 01:49:32,800 --> 01:49:34,320 then our output layer is going to be 2871 01:49:34,320 --> 01:49:37,440 just one node okay 2872 01:49:37,440 --> 01:49:38,880 and that's how easy it is to define 2873 01:49:38,880 --> 01:49:41,679 something in tensorflow 2874 01:49:41,679 --> 01:49:44,800 so tensorflow is an open source library 2875 01:49:44,800 --> 01:49:47,760 that helps you develop and train your ml 2876 01:49:47,760 --> 01:49:49,440 models 2877 01:49:49,440 --> 01:49:52,000 let's implement this for a neural net so 2878 01:49:52,000 --> 01:49:53,119 we're using a neural net for 2879 01:49:53,119 --> 01:49:54,400 classification 2880 01:49:54,400 --> 01:49:55,440 now 2881 01:49:55,440 --> 01:49:58,159 so our neural net model 2882 01:49:58,159 --> 01:50:00,080 we are going to use tensorflow and i 2883 01:50:00,080 --> 01:50:02,800 don't think i imported that up here so 2884 01:50:02,800 --> 01:50:05,840 we are going to import that down here 2885 01:50:05,840 --> 01:50:07,679 so i'm going to import 2886 01:50:07,679 --> 01:50:10,719 tensorflow as tf 2887 01:50:10,719 --> 01:50:12,320 and enter 2888 01:50:12,320 --> 01:50:14,159 cool 2889 01:50:14,159 --> 01:50:16,719 so my 2890 01:50:17,360 --> 01:50:19,199 neural net model 2891 01:50:19,199 --> 01:50:20,639 is going to be 2892 01:50:20,639 --> 01:50:23,199 i'm going to use this 2893 01:50:23,199 --> 01:50:23,900 um 2894 01:50:23,900 --> 01:50:25,360 [Music] 2895 01:50:25,360 --> 01:50:27,199 so essentially this is saying 2896 01:50:27,199 --> 01:50:28,800 layer all these things that i'm about to 2897 01:50:28,800 --> 01:50:30,000 pass in 2898 01:50:30,000 --> 01:50:31,199 so yeah 2899 01:50:31,199 --> 01:50:33,920 layer them linear stack of layers 2900 01:50:33,920 --> 01:50:35,679 layer them as a model 2901 01:50:35,679 --> 01:50:38,719 and what that means nope not that so 2902 01:50:38,719 --> 01:50:42,080 what that means is i can pass in 2903 01:50:42,080 --> 01:50:44,400 um some sort of layer 2904 01:50:44,400 --> 01:50:47,440 and i'm just going to use a dense layer 2905 01:50:47,440 --> 01:50:50,719 uh oops dot dense 2906 01:50:50,719 --> 01:50:53,520 and let's say we have 32 2907 01:50:53,520 --> 01:50:54,719 units 2908 01:50:54,719 --> 01:50:55,600 okay 2909 01:50:55,600 --> 01:50:58,480 i will also 2910 01:50:58,639 --> 01:51:01,040 um 2911 01:51:01,199 --> 01:51:04,320 set the activation as relu 2912 01:51:04,320 --> 01:51:06,719 and at first we have to specify the 2913 01:51:06,719 --> 01:51:07,920 input shape 2914 01:51:07,920 --> 01:51:11,119 so here we have 10 comma 2915 01:51:11,119 --> 01:51:13,440 all right 2916 01:51:16,000 --> 01:51:18,239 all right so that's our first layer now 2917 01:51:18,239 --> 01:51:19,600 our next layer i'm just going to have 2918 01:51:19,600 --> 01:51:20,960 another 2919 01:51:20,960 --> 01:51:24,719 a dense layer of 32 units all using relu 2920 01:51:24,719 --> 01:51:25,840 and 2921 01:51:25,840 --> 01:51:28,800 that's it so for the final layer this is 2922 01:51:28,800 --> 01:51:31,280 just going to be my output layer 2923 01:51:31,280 --> 01:51:33,679 it's going to just be one node 2924 01:51:33,679 --> 01:51:36,239 and the activation is going to be 2925 01:51:36,239 --> 01:51:37,599 sigmoid 2926 01:51:37,599 --> 01:51:38,639 so 2927 01:51:38,639 --> 01:51:40,719 if you recall from our logistic 2928 01:51:40,719 --> 01:51:42,719 regression what happened there was when 2929 01:51:42,719 --> 01:51:44,560 we had a sigmoid it looks something like 2930 01:51:44,560 --> 01:51:45,440 this 2931 01:51:45,440 --> 01:51:47,599 right so by creating a sigmoid 2932 01:51:47,599 --> 01:51:49,679 activation on our last layer we're 2933 01:51:49,679 --> 01:51:52,400 essentially projecting our predictions 2934 01:51:52,400 --> 01:51:54,639 to be zero or one 2935 01:51:54,639 --> 01:51:56,159 just like in logistic 2936 01:51:56,159 --> 01:51:57,360 regression 2937 01:51:57,360 --> 01:51:59,199 and that's going to help us 2938 01:51:59,199 --> 01:52:01,199 you know we can just round to zero or 2939 01:52:01,199 --> 01:52:05,040 one and classify that way 2940 01:52:05,040 --> 01:52:07,840 so this is my neural net model and i'm 2941 01:52:07,840 --> 01:52:09,679 going to 2942 01:52:09,679 --> 01:52:12,239 compile this so in tensorflow we have to 2943 01:52:12,239 --> 01:52:13,760 compile it 2944 01:52:13,760 --> 01:52:15,440 it's really cool because i can just 2945 01:52:15,440 --> 01:52:17,440 literally pass in what type of optimizer 2946 01:52:17,440 --> 01:52:19,679 i want and it'll do it 2947 01:52:19,679 --> 01:52:22,639 um so here if i go to optimizers i'm 2948 01:52:22,639 --> 01:52:24,560 actually going to use atom 2949 01:52:24,560 --> 01:52:26,239 and you'll see that you know the 2950 01:52:26,239 --> 01:52:27,719 learning rate is 2951 01:52:27,719 --> 01:52:30,560 0.001 so i'm just going to use that 2952 01:52:30,560 --> 01:52:33,280 default so 0.001 2953 01:52:33,280 --> 01:52:37,199 and my loss is going to be 2954 01:52:37,280 --> 01:52:39,199 binary cross 2955 01:52:39,199 --> 01:52:41,679 entropy 2956 01:52:41,679 --> 01:52:42,480 and 2957 01:52:42,480 --> 01:52:44,719 the metrics that i'm also going to 2958 01:52:44,719 --> 01:52:46,480 include on here so it already will 2959 01:52:46,480 --> 01:52:48,480 consider loss but i'm i'm also going to 2960 01:52:48,480 --> 01:52:50,960 tack on accuracy so we can actually see 2961 01:52:50,960 --> 01:52:53,840 that in a plot later on 2962 01:52:53,840 --> 01:52:56,480 all right so i'm going to run this 2963 01:52:56,480 --> 01:52:58,960 um 2964 01:52:58,960 --> 01:53:00,400 and 2965 01:53:00,400 --> 01:53:02,400 one thing that i'm going to also do is 2966 01:53:02,400 --> 01:53:04,159 i'm going to define these plot 2967 01:53:04,159 --> 01:53:05,760 definitions so i'm actually copying 2968 01:53:05,760 --> 01:53:06,880 pasting this 2969 01:53:06,880 --> 01:53:09,119 i got these from tensorflow so if you go 2970 01:53:09,119 --> 01:53:10,960 on to some tensorflow tutorial they 2971 01:53:10,960 --> 01:53:13,199 actually have these this like 2972 01:53:13,199 --> 01:53:14,320 defined 2973 01:53:14,320 --> 01:53:16,159 uh and that's exactly what i'm doing 2974 01:53:16,159 --> 01:53:17,440 here so i'm actually going to move this 2975 01:53:17,440 --> 01:53:19,119 cell up 2976 01:53:19,119 --> 01:53:21,119 run that so we're basically plotting the 2977 01:53:21,119 --> 01:53:23,040 loss over all the different epochs 2978 01:53:23,040 --> 01:53:25,280 epochs means like training cycles and 2979 01:53:25,280 --> 01:53:26,800 we're going to plot the accuracy over 2980 01:53:26,800 --> 01:53:28,880 all the epochs 2981 01:53:28,880 --> 01:53:31,280 all right so we have our model 2982 01:53:31,280 --> 01:53:32,480 and now 2983 01:53:32,480 --> 01:53:37,119 all that's left is let's train it okay 2984 01:53:37,119 --> 01:53:39,520 so i'm going to say history so 2985 01:53:39,520 --> 01:53:41,360 tensorflow is great because it keeps 2986 01:53:41,360 --> 01:53:43,280 track of the history of the training 2987 01:53:43,280 --> 01:53:45,199 which is why we can go and plot it later 2988 01:53:45,199 --> 01:53:46,239 on 2989 01:53:46,239 --> 01:53:47,840 now i'm going to set that equal to this 2990 01:53:47,840 --> 01:53:49,679 neural net model 2991 01:53:49,679 --> 01:53:51,440 and fit that 2992 01:53:51,440 --> 01:53:53,280 with x train 2993 01:53:53,280 --> 01:53:55,280 y train 2994 01:53:55,280 --> 01:53:57,920 uh i'm going to 2995 01:53:57,920 --> 01:53:59,840 make the number of epochs equal to let's 2996 01:53:59,840 --> 01:54:02,800 say just let's just use 100 for now 2997 01:54:02,800 --> 01:54:05,040 and the batch size i'm going to set 2998 01:54:05,040 --> 01:54:09,040 equal to let's say 32. 2999 01:54:09,040 --> 01:54:11,199 all right um 3000 01:54:11,199 --> 01:54:14,560 and the validation split 3001 01:54:14,800 --> 01:54:17,520 so what the validation split does if 3002 01:54:17,520 --> 01:54:20,800 it's down here somewhere 3003 01:54:20,800 --> 01:54:22,400 okay so yeah this validation split is 3004 01:54:22,400 --> 01:54:23,920 just the fraction the training data to 3005 01:54:23,920 --> 01:54:25,920 be used as validation data 3006 01:54:25,920 --> 01:54:28,800 so essentially every single epoch what's 3007 01:54:28,800 --> 01:54:30,239 going on is 3008 01:54:30,239 --> 01:54:33,440 uh tensorflow saying leave certain if if 3009 01:54:33,440 --> 01:54:35,440 this is 0.2 then leave 20 3010 01:54:35,440 --> 01:54:37,119 out and we're going to test how the 3011 01:54:37,119 --> 01:54:39,280 model performs on that 20 that we've 3012 01:54:39,280 --> 01:54:40,320 left out 3013 01:54:40,320 --> 01:54:41,679 okay so it's basically like our 3014 01:54:41,679 --> 01:54:44,000 validation data set but 3015 01:54:44,000 --> 01:54:45,599 um tensorflow does it on our training 3016 01:54:45,599 --> 01:54:47,520 data set during the training so we have 3017 01:54:47,520 --> 01:54:49,520 now a measure outside of just our 3018 01:54:49,520 --> 01:54:51,840 validation data set to see you know 3019 01:54:51,840 --> 01:54:53,520 what's going on 3020 01:54:53,520 --> 01:54:54,960 so validation split i'm going to make 3021 01:54:54,960 --> 01:54:57,520 that 0.2 3022 01:54:57,520 --> 01:55:02,239 and we can run this so if i run that 3023 01:55:03,199 --> 01:55:07,920 all right and i'm actually going to 3024 01:55:08,560 --> 01:55:10,080 set verbose 3025 01:55:10,080 --> 01:55:11,840 equal to zero which means okay don't 3026 01:55:11,840 --> 01:55:13,360 print anything because printing 3027 01:55:13,360 --> 01:55:14,880 something for 100 epochs might get kind 3028 01:55:14,880 --> 01:55:16,320 of annoying 3029 01:55:16,320 --> 01:55:18,480 so i'm just gonna let it run 3030 01:55:18,480 --> 01:55:20,560 let it train and then we'll see what 3031 01:55:20,560 --> 01:55:23,040 happens 3032 01:55:27,920 --> 01:55:30,159 cool so it finished training and now 3033 01:55:30,159 --> 01:55:31,760 what i can do is because you know i've 3034 01:55:31,760 --> 01:55:34,480 already defined these two functions 3035 01:55:34,480 --> 01:55:36,880 i can go ahead and i can plot the loss 3036 01:55:36,880 --> 01:55:38,000 oops 3037 01:55:38,000 --> 01:55:41,040 plot loss of that history 3038 01:55:41,040 --> 01:55:43,520 and i can also plot the accuracy 3039 01:55:43,520 --> 01:55:44,560 throughout 3040 01:55:44,560 --> 01:55:47,520 the training 3041 01:55:47,520 --> 01:55:48,480 so 3042 01:55:48,480 --> 01:55:51,119 this is a little bit ish what we're 3043 01:55:51,119 --> 01:55:53,599 looking for we definitely are looking 3044 01:55:53,599 --> 01:55:56,239 for a steadily decreasing loss 3045 01:55:56,239 --> 01:55:59,520 and an increasing accuracy so here we do 3046 01:55:59,520 --> 01:56:01,520 see that you know our validation 3047 01:56:01,520 --> 01:56:03,760 accuracy improves from 3048 01:56:03,760 --> 01:56:06,639 uh around point seven 3049 01:56:06,639 --> 01:56:08,560 seven or something all the way up to 3050 01:56:08,560 --> 01:56:12,000 somewhere around point maybe eight one 3051 01:56:12,000 --> 01:56:13,760 and our loss is decreasing so this is 3052 01:56:13,760 --> 01:56:14,560 good 3053 01:56:14,560 --> 01:56:17,040 it is expected that the validation loss 3054 01:56:17,040 --> 01:56:20,000 and accuracy is performing worse than um 3055 01:56:20,000 --> 01:56:22,400 the training loss or accuracy and that's 3056 01:56:22,400 --> 01:56:24,560 because our model is training on that 3057 01:56:24,560 --> 01:56:26,800 data so it's adapting to that data 3058 01:56:26,800 --> 01:56:28,719 whereas the validation stuff is you know 3059 01:56:28,719 --> 01:56:31,520 stuff that it hasn't seen yet so 3060 01:56:31,520 --> 01:56:33,360 so that's why 3061 01:56:33,360 --> 01:56:35,679 so in machine learning as we saw above 3062 01:56:35,679 --> 01:56:36,800 we could change a bunch of the 3063 01:56:36,800 --> 01:56:38,159 parameters right like i could change 3064 01:56:38,159 --> 01:56:41,119 this to 64. so now it'd be a row of 64 3065 01:56:41,119 --> 01:56:44,800 nodes and then 32 and then one 3066 01:56:44,800 --> 01:56:47,599 so i can change some of these parameters 3067 01:56:47,599 --> 01:56:48,880 and 3068 01:56:48,880 --> 01:56:50,239 a lot of machine learning is trying to 3069 01:56:50,239 --> 01:56:52,080 find hey what do we set these hyper 3070 01:56:52,080 --> 01:56:54,320 parameters to 3071 01:56:54,320 --> 01:56:57,679 so what i'm actually going to do is i'm 3072 01:56:57,679 --> 01:57:00,639 going to rewrite this so that 3073 01:57:00,639 --> 01:57:02,480 we can do something what's known as a 3074 01:57:02,480 --> 01:57:04,719 grid search so we can search through an 3075 01:57:04,719 --> 01:57:07,599 entire space of hey what happens if 3076 01:57:07,599 --> 01:57:08,880 you know we 3077 01:57:08,880 --> 01:57:13,360 have 64 nodes and 64 nodes or 16 nodes 3078 01:57:13,360 --> 01:57:14,880 and 16 nodes 3079 01:57:14,880 --> 01:57:17,520 and so on 3080 01:57:17,520 --> 01:57:19,440 um and then on top of all that we can 3081 01:57:19,440 --> 01:57:21,440 you know we can change 3082 01:57:21,440 --> 01:57:22,960 this uh 3083 01:57:22,960 --> 01:57:25,119 learning rate we can change how many 3084 01:57:25,119 --> 01:57:27,360 epochs we can change 3085 01:57:27,360 --> 01:57:29,520 you know the batch size all these things 3086 01:57:29,520 --> 01:57:31,599 might affect our training 3087 01:57:31,599 --> 01:57:33,119 and 3088 01:57:33,119 --> 01:57:34,880 just for kicks i'm also going to add 3089 01:57:34,880 --> 01:57:40,480 what's known as a dropout layer in here 3090 01:57:41,360 --> 01:57:43,599 and what dropout is doing is saying hey 3091 01:57:43,599 --> 01:57:46,320 randomly choose 3092 01:57:46,320 --> 01:57:49,199 with at this rate certain nodes 3093 01:57:49,199 --> 01:57:51,119 and don't train them 3094 01:57:51,119 --> 01:57:53,440 in you know a certain iteration so this 3095 01:57:53,440 --> 01:57:56,000 helps prevent overfitting 3096 01:57:56,000 --> 01:57:56,960 okay 3097 01:57:56,960 --> 01:57:58,000 so 3098 01:57:58,000 --> 01:57:59,679 i'm actually going to 3099 01:57:59,679 --> 01:58:01,760 define this 3100 01:58:01,760 --> 01:58:04,080 as a function called train model we're 3101 01:58:04,080 --> 01:58:07,360 going to pass an x train y train 3102 01:58:07,360 --> 01:58:09,679 um the number of 3103 01:58:09,679 --> 01:58:11,360 nodes 3104 01:58:11,360 --> 01:58:13,840 uh the drop out 3105 01:58:13,840 --> 01:58:15,119 you know the probability that we just 3106 01:58:15,119 --> 01:58:16,639 talked about 3107 01:58:16,639 --> 01:58:17,599 um 3108 01:58:17,599 --> 01:58:18,960 learning rate 3109 01:58:18,960 --> 01:58:21,119 so i'm actually going to say lr 3110 01:58:21,119 --> 01:58:23,599 batch size 3111 01:58:23,599 --> 01:58:25,440 and 3112 01:58:25,440 --> 01:58:27,119 we can also pass the number of epochs 3113 01:58:27,119 --> 01:58:30,480 right i mentioned that as a parameter 3114 01:58:30,480 --> 01:58:33,360 so indent this so it goes under here and 3115 01:58:33,360 --> 01:58:35,119 with these two i'm going to set this 3116 01:58:35,119 --> 01:58:38,560 equal to number of nodes 3117 01:58:38,560 --> 01:58:40,560 and now with the two dropout layers i'm 3118 01:58:40,560 --> 01:58:43,840 going to set dropout prob so now you 3119 01:58:43,840 --> 01:58:46,239 know the probability of 3120 01:58:46,239 --> 01:58:48,639 turning off a node during the training 3121 01:58:48,639 --> 01:58:50,639 is equal to dropout prop 3122 01:58:50,639 --> 01:58:52,400 um and i'm going to keep the output 3123 01:58:52,400 --> 01:58:54,080 layer the same 3124 01:58:54,080 --> 01:58:56,080 now i'm compiling it but this here is 3125 01:58:56,080 --> 01:58:58,080 now going to be my learning rate 3126 01:58:58,080 --> 01:58:59,840 and i still want binary cross entropy 3127 01:58:59,840 --> 01:59:02,560 and accuracy 3128 01:59:02,560 --> 01:59:05,520 we are actually going to train 3129 01:59:05,520 --> 01:59:08,080 our model inside of 3130 01:59:08,080 --> 01:59:09,199 this 3131 01:59:09,199 --> 01:59:10,639 function 3132 01:59:10,639 --> 01:59:13,920 but here we can do the epochs equals 3133 01:59:13,920 --> 01:59:16,320 epochs and this is equal to whatever you 3134 01:59:16,320 --> 01:59:18,159 know we're passing in 3135 01:59:18,159 --> 01:59:21,040 uh x train y train belong right here 3136 01:59:21,040 --> 01:59:22,960 okay so those are getting passed in as 3137 01:59:22,960 --> 01:59:23,920 well 3138 01:59:23,920 --> 01:59:26,080 and finally at the end i'm going to 3139 01:59:26,080 --> 01:59:29,679 return this model and the history of 3140 01:59:29,679 --> 01:59:32,239 that model 3141 01:59:32,639 --> 01:59:34,880 okay 3142 01:59:34,880 --> 01:59:37,199 so 3143 01:59:37,280 --> 01:59:40,080 now what i'll do 3144 01:59:40,320 --> 01:59:42,560 is let's just go through all of these so 3145 01:59:42,560 --> 01:59:45,440 let's say let's keep the epochs at 100. 3146 01:59:45,440 --> 01:59:47,280 and now what i can do is i can say hey 3147 01:59:47,280 --> 01:59:49,760 for a number of nodes in 3148 01:59:49,760 --> 01:59:52,960 let's say let's do 16 32 and 64 to see 3149 01:59:52,960 --> 01:59:55,040 what happens 3150 01:59:55,040 --> 01:59:56,840 for the different dropout 3151 01:59:56,840 --> 01:59:59,360 probabilities in 3152 01:59:59,360 --> 02:00:01,679 i mean zero would be nothing let's use 3153 02:00:01,679 --> 02:00:05,440 0.2 also to see what happens 3154 02:00:05,440 --> 02:00:07,840 um 3155 02:00:07,920 --> 02:00:10,719 you know for the learning rate in 3156 02:00:10,719 --> 02:00:11,880 uh 3157 02:00:11,880 --> 02:00:13,679 0.005 3158 02:00:13,679 --> 02:00:16,480 0.001 3159 02:00:16,480 --> 02:00:18,080 and you know maybe we want to throw on 3160 02:00:18,080 --> 02:00:22,080 0.1 in there as well 3161 02:00:22,239 --> 02:00:26,159 and then for the batch size uh let's do 3162 02:00:26,159 --> 02:00:29,520 16 32 64 as well actually and let's also 3163 02:00:29,520 --> 02:00:31,920 throw in 128. actually let's get rid of 3164 02:00:31,920 --> 02:00:33,520 16. sorry 3165 02:00:33,520 --> 02:00:36,880 let's throw 128 in there 3166 02:00:37,199 --> 02:00:39,360 that should be zero one 3167 02:00:39,360 --> 02:00:41,920 i'm going to 3168 02:00:41,920 --> 02:00:44,800 record the model in history using this 3169 02:00:44,800 --> 02:00:47,599 train model here 3170 02:00:47,599 --> 02:00:48,560 so 3171 02:00:48,560 --> 02:00:51,679 we're going to do x train y 3172 02:00:51,679 --> 02:00:52,639 train 3173 02:00:52,639 --> 02:00:54,719 the number of nodes is going to be you 3174 02:00:54,719 --> 02:00:55,920 know the number of nodes that we've 3175 02:00:55,920 --> 02:00:57,520 defined here 3176 02:00:57,520 --> 02:00:59,040 dropout 3177 02:00:59,040 --> 02:01:01,040 prob lr 3178 02:01:01,040 --> 02:01:02,960 batch size 3179 02:01:02,960 --> 02:01:05,119 and epochs okay 3180 02:01:05,119 --> 02:01:07,040 and then now we have both the model and 3181 02:01:07,040 --> 02:01:10,000 the history and what i'm going to do is 3182 02:01:10,000 --> 02:01:12,800 again i want to plot 3183 02:01:12,800 --> 02:01:14,800 the loss 3184 02:01:14,800 --> 02:01:17,119 for the history i'm also going to plot 3185 02:01:17,119 --> 02:01:19,760 the accuracy 3186 02:01:19,760 --> 02:01:20,960 probably should have done them side by 3187 02:01:20,960 --> 02:01:22,080 side that probably would have been 3188 02:01:22,080 --> 02:01:24,480 easier 3189 02:01:26,159 --> 02:01:27,520 okay so 3190 02:01:27,520 --> 02:01:30,159 what i'm going to do is 3191 02:01:30,159 --> 02:01:33,280 split up split this up 3192 02:01:33,280 --> 02:01:35,520 and that will be 3193 02:01:35,520 --> 02:01:38,320 subplots so now this is just saying okay 3194 02:01:38,320 --> 02:01:40,400 i want one row and two columns in that 3195 02:01:40,400 --> 02:01:43,040 row for my plots 3196 02:01:43,040 --> 02:01:43,920 okay 3197 02:01:43,920 --> 02:01:45,040 so 3198 02:01:45,040 --> 02:01:49,679 i'm going to plot on my axis one 3199 02:01:49,679 --> 02:01:52,400 the loss 3200 02:01:54,719 --> 02:01:57,199 i don't actually know this is gonna work 3201 02:01:57,199 --> 02:01:59,040 okay we don't care about the grid uh 3202 02:01:59,040 --> 02:02:01,119 yeah let's let's keep the grid 3203 02:02:01,119 --> 02:02:05,320 and then now on my other 3204 02:02:09,119 --> 02:02:12,320 so now on here i'm going to plot all the 3205 02:02:12,320 --> 02:02:17,320 accuracies on the second plot 3206 02:02:20,080 --> 02:02:23,760 i might have to debug this a bit 3207 02:02:24,239 --> 02:02:26,639 but we should be able to get rid of that 3208 02:02:26,639 --> 02:02:29,280 if we run this we already have history 3209 02:02:29,280 --> 02:02:33,119 saved as a variable in here so if i just 3210 02:02:33,119 --> 02:02:36,159 run it on this okay it has no 3211 02:02:36,159 --> 02:02:38,880 attribute x label 3212 02:02:38,880 --> 02:02:41,040 oh i think it's because it's like set x 3213 02:02:41,040 --> 02:02:44,080 label or something 3214 02:02:45,679 --> 02:02:48,560 okay yeah so it's it's set instead of 3215 02:02:48,560 --> 02:02:50,960 just x label y label 3216 02:02:50,960 --> 02:02:52,960 so let's see if that works 3217 02:02:52,960 --> 02:02:55,040 all right cool 3218 02:02:55,040 --> 02:02:57,280 um and let's actually make this a bit 3219 02:02:57,280 --> 02:02:58,800 larger 3220 02:02:58,800 --> 02:03:00,239 okay so we can actually change the 3221 02:03:00,239 --> 02:03:02,320 figure size and i'm gonna set 3222 02:03:02,320 --> 02:03:04,960 let's see what happens if i set that to 3223 02:03:04,960 --> 02:03:09,119 oh that's not the way i wanted it um 3224 02:03:09,920 --> 02:03:14,119 okay so that looks reasonable 3225 02:03:15,360 --> 02:03:16,719 and that's just going to be my plot 3226 02:03:16,719 --> 02:03:18,560 history function so now i can plot them 3227 02:03:18,560 --> 02:03:20,800 side by side 3228 02:03:20,800 --> 02:03:24,320 here i'm going to plot the history 3229 02:03:24,320 --> 02:03:27,760 and what i'm actually going to do is i 3230 02:03:27,760 --> 02:03:29,760 so here first i'm going to print out all 3231 02:03:29,760 --> 02:03:31,199 these parameters so i'm going to print 3232 02:03:31,199 --> 02:03:33,199 out 3233 02:03:33,199 --> 02:03:35,599 use the f string to print out uh all of 3234 02:03:35,599 --> 02:03:37,360 this stuff 3235 02:03:37,360 --> 02:03:39,679 so here i'm printing out how many nodes 3236 02:03:39,679 --> 02:03:41,199 um 3237 02:03:41,199 --> 02:03:44,560 the dropout probability 3238 02:03:45,440 --> 02:03:48,719 uh the learning rate 3239 02:03:55,040 --> 02:03:56,239 and we already know how many emails so 3240 02:03:56,239 --> 02:03:59,440 i'm not even gonna bother with that 3241 02:03:59,760 --> 02:04:00,560 so 3242 02:04:00,560 --> 02:04:02,800 once we plot 3243 02:04:02,800 --> 02:04:03,760 this 3244 02:04:03,760 --> 02:04:06,960 uh let's actually also 3245 02:04:06,960 --> 02:04:09,280 figure out what the um 3246 02:04:09,280 --> 02:04:11,520 what the validation loss is on our 3247 02:04:11,520 --> 02:04:13,840 validation set that we have 3248 02:04:13,840 --> 02:04:16,639 that we created all the way back up here 3249 02:04:16,639 --> 02:04:18,080 all right so remember we created three 3250 02:04:18,080 --> 02:04:19,679 data sets 3251 02:04:19,679 --> 02:04:23,040 let's call our model and evaluate 3252 02:04:23,040 --> 02:04:25,599 what the 3253 02:04:25,599 --> 02:04:26,880 validation 3254 02:04:26,880 --> 02:04:29,360 data what the validation data sets loss 3255 02:04:29,360 --> 02:04:30,719 would be 3256 02:04:30,719 --> 02:04:33,440 and i actually want to record 3257 02:04:33,440 --> 02:04:35,119 let's say i want to record 3258 02:04:35,119 --> 02:04:37,280 whatever model has the least validation 3259 02:04:37,280 --> 02:04:40,080 loss so 3260 02:04:40,560 --> 02:04:42,400 first i'm going to initialize that to 3261 02:04:42,400 --> 02:04:44,480 infinity so that you know any model will 3262 02:04:44,480 --> 02:04:46,159 beat that score 3263 02:04:46,159 --> 02:04:49,119 so if i do float infinity that will set 3264 02:04:49,119 --> 02:04:50,840 that to infinity 3265 02:04:50,840 --> 02:04:54,560 and um maybe i'll keep track of the 3266 02:04:54,560 --> 02:04:56,480 parameters actually it doesn't really 3267 02:04:56,480 --> 02:04:57,599 matter 3268 02:04:57,599 --> 02:04:58,639 i'm just going to keep track of the 3269 02:04:58,639 --> 02:05:00,080 model 3270 02:05:00,080 --> 02:05:02,239 and i'm going to set that to none 3271 02:05:02,239 --> 02:05:03,920 so now down here 3272 02:05:03,920 --> 02:05:06,960 if the validation loss is ever less than 3273 02:05:06,960 --> 02:05:10,159 the least validation loss 3274 02:05:10,159 --> 02:05:12,960 then i am going to simply come down here 3275 02:05:12,960 --> 02:05:14,800 and say hey 3276 02:05:14,800 --> 02:05:16,560 this validation 3277 02:05:16,560 --> 02:05:18,800 or this lease validation loss is now 3278 02:05:18,800 --> 02:05:21,520 equal to the validation loss 3279 02:05:21,520 --> 02:05:22,400 and 3280 02:05:22,400 --> 02:05:23,599 the least 3281 02:05:23,599 --> 02:05:25,040 loss model 3282 02:05:25,040 --> 02:05:27,599 is whatever this model 3283 02:05:27,599 --> 02:05:30,159 is that just earned that validation loss 3284 02:05:30,159 --> 02:05:31,760 okay 3285 02:05:31,760 --> 02:05:33,360 so 3286 02:05:33,360 --> 02:05:35,599 we are actually just going to let this 3287 02:05:35,599 --> 02:05:37,199 run 3288 02:05:37,199 --> 02:05:39,440 um for a while and then we're going to 3289 02:05:39,440 --> 02:05:40,880 get our least 3290 02:05:40,880 --> 02:05:44,320 loss model after that 3291 02:05:44,320 --> 02:05:46,480 so 3292 02:05:47,119 --> 02:05:49,840 let's just run 3293 02:05:50,000 --> 02:05:54,440 all right and now we wait 3294 02:06:06,000 --> 02:06:08,159 all right so we have finally finished 3295 02:06:08,159 --> 02:06:09,440 training 3296 02:06:09,440 --> 02:06:11,520 and you'll notice that okay down here 3297 02:06:11,520 --> 02:06:14,480 the loss actually gets to like 0.29 3298 02:06:14,480 --> 02:06:16,800 the accuracy is around 88 which is 3299 02:06:16,800 --> 02:06:18,000 pretty good 3300 02:06:18,000 --> 02:06:19,920 so you might be wondering okay why is 3301 02:06:19,920 --> 02:06:22,239 this accuracy in this 3302 02:06:22,239 --> 02:06:24,560 like these are both the validation 3303 02:06:24,560 --> 02:06:26,400 so this accuracy here is on the 3304 02:06:26,400 --> 02:06:28,000 validation data set that we've defined 3305 02:06:28,000 --> 02:06:30,239 at the beginning right and this one here 3306 02:06:30,239 --> 02:06:32,800 this is actually taking 20 of our test 3307 02:06:32,800 --> 02:06:34,880 our training set every time during the 3308 02:06:34,880 --> 02:06:37,280 training and saying okay how much of it 3309 02:06:37,280 --> 02:06:39,199 do i get right now 3310 02:06:39,199 --> 02:06:40,800 you know after this one step where i 3311 02:06:40,800 --> 02:06:43,040 didn't train with any of that 3312 02:06:43,040 --> 02:06:45,440 so they're slightly different and 3313 02:06:45,440 --> 02:06:47,119 actually i realized later on that i 3314 02:06:47,119 --> 02:06:48,560 probably you know probably what i should 3315 02:06:48,560 --> 02:06:49,599 have done 3316 02:06:49,599 --> 02:06:54,560 is over here when we were defining 3317 02:06:54,560 --> 02:06:57,440 the model fit instead of the validation 3318 02:06:57,440 --> 02:07:00,400 split you can define the validation data 3319 02:07:00,400 --> 02:07:02,800 and you can pass in the validation data 3320 02:07:02,800 --> 02:07:03,920 i don't know if this is the proper 3321 02:07:03,920 --> 02:07:05,360 syntax but 3322 02:07:05,360 --> 02:07:07,199 that's probably what i should have done 3323 02:07:07,199 --> 02:07:08,880 but instead you know we'll just stick 3324 02:07:08,880 --> 02:07:11,440 with what we have here 3325 02:07:11,440 --> 02:07:13,679 so you'll see at the end you know with 3326 02:07:13,679 --> 02:07:16,400 the 64 nodes it seems like this is our 3327 02:07:16,400 --> 02:07:18,719 best performance 64 nodes with a dropout 3328 02:07:18,719 --> 02:07:21,560 of 0.2 a learning rate of 3329 02:07:21,560 --> 02:07:25,440 0.001 and a batch size of 64. 3330 02:07:25,440 --> 02:07:28,480 and it does seem like yes the validation 3331 02:07:28,480 --> 02:07:30,719 you know the fake validation but the 3332 02:07:30,719 --> 02:07:33,920 validation um 3333 02:07:33,920 --> 02:07:36,639 loss is decreasing and then the accuracy 3334 02:07:36,639 --> 02:07:40,159 is increasing which is a good sign okay 3335 02:07:40,159 --> 02:07:41,520 so finally 3336 02:07:41,520 --> 02:07:43,119 what i'm going to do is i'm actually 3337 02:07:43,119 --> 02:07:44,719 just going to predict so i'm going to 3338 02:07:44,719 --> 02:07:45,599 take 3339 02:07:45,599 --> 02:07:48,320 this model which we've called our least 3340 02:07:48,320 --> 02:07:50,079 loss model 3341 02:07:50,079 --> 02:07:51,599 i'm going to take this model and i'm 3342 02:07:51,599 --> 02:07:53,280 going to predict 3343 02:07:53,280 --> 02:07:54,560 x test 3344 02:07:54,560 --> 02:07:56,159 on that 3345 02:07:56,159 --> 02:07:58,079 and you'll see that it gives me some 3346 02:07:58,079 --> 02:07:59,440 values that are really close to zero and 3347 02:07:59,440 --> 02:08:00,960 some that are really close to one and 3348 02:08:00,960 --> 02:08:03,040 that's because we have a sigmoid output 3349 02:08:03,040 --> 02:08:04,000 so 3350 02:08:04,000 --> 02:08:05,360 if i 3351 02:08:05,360 --> 02:08:07,840 do this 3352 02:08:07,840 --> 02:08:10,480 what i can do is i can cast them 3353 02:08:10,480 --> 02:08:12,320 so i'm going to say anything that's 3354 02:08:12,320 --> 02:08:15,679 greater than 0.5 3355 02:08:15,679 --> 02:08:18,159 set that to 1. so if i 3356 02:08:18,159 --> 02:08:20,159 actually i think what happens if i do 3357 02:08:20,159 --> 02:08:22,719 this 3358 02:08:22,719 --> 02:08:25,920 oh okay so i have to cast that as type 3359 02:08:25,920 --> 02:08:28,480 and so now you'll see that it's ones and 3360 02:08:28,480 --> 02:08:29,520 zeros 3361 02:08:29,520 --> 02:08:31,920 and i'm actually going to transform this 3362 02:08:31,920 --> 02:08:34,400 into a column as well 3363 02:08:34,400 --> 02:08:37,840 so here i'm going to 3364 02:08:38,800 --> 02:08:41,679 oh oops uh i didn't mean to do that okay 3365 02:08:41,679 --> 02:08:45,280 no i wanted to just reshape it to 3366 02:08:45,280 --> 02:08:46,590 that so now 3367 02:08:46,590 --> 02:08:47,679 [Music] 3368 02:08:47,679 --> 02:08:50,480 it's one dimensional okay 3369 02:08:50,480 --> 02:08:51,599 and 3370 02:08:51,599 --> 02:08:53,840 using that we can actually 3371 02:08:53,840 --> 02:08:56,719 just rerun the classification report 3372 02:08:56,719 --> 02:08:59,679 based on these this neural net output 3373 02:08:59,679 --> 02:09:02,400 and you'll see that okay the the f1 or 3374 02:09:02,400 --> 02:09:04,480 the accuracy gives us 87 3375 02:09:04,480 --> 02:09:06,800 so it seems like what happened here is 3376 02:09:06,800 --> 02:09:08,159 the precision 3377 02:09:08,159 --> 02:09:11,840 on uh class 0 so the hadrons has 3378 02:09:11,840 --> 02:09:14,159 increased a bit but the recall decreased 3379 02:09:14,159 --> 02:09:17,199 but the f1 score is still at a good 0.81 3380 02:09:17,199 --> 02:09:19,119 and um 3381 02:09:19,119 --> 02:09:20,719 for the other class it looked like the 3382 02:09:20,719 --> 02:09:22,400 precision decreased a bit the recall 3383 02:09:22,400 --> 02:09:24,960 increased for an overall f1 score 3384 02:09:24,960 --> 02:09:28,000 that's also been increased 3385 02:09:28,000 --> 02:09:30,320 i think i interpreted that properly i 3386 02:09:30,320 --> 02:09:31,920 mean we went through all this work and 3387 02:09:31,920 --> 02:09:34,239 we got a model that performs actually 3388 02:09:34,239 --> 02:09:37,360 very very similarly to the svm model 3389 02:09:37,360 --> 02:09:39,520 that we had earlier 3390 02:09:39,520 --> 02:09:41,119 and the whole point of this exercise was 3391 02:09:41,119 --> 02:09:42,800 to demonstrate okay these are how you 3392 02:09:42,800 --> 02:09:45,040 can define your models but it's also to 3393 02:09:45,040 --> 02:09:47,040 say hey maybe 3394 02:09:47,040 --> 02:09:48,639 you know neural nets are very very 3395 02:09:48,639 --> 02:09:50,880 powerful as you can tell 3396 02:09:50,880 --> 02:09:53,520 but sometimes you know an svm or some 3397 02:09:53,520 --> 02:09:54,639 other model 3398 02:09:54,639 --> 02:09:57,679 might actually be more appropriate 3399 02:09:57,679 --> 02:09:59,199 but in this case i guess it didn't 3400 02:09:59,199 --> 02:10:00,800 really matter which one we used at the 3401 02:10:00,800 --> 02:10:02,639 end um 3402 02:10:02,639 --> 02:10:05,280 an 87 percent accuracy accuracy score is 3403 02:10:05,280 --> 02:10:07,119 still pretty good 3404 02:10:07,119 --> 02:10:11,760 so yeah let's now move on to regression 3405 02:10:11,760 --> 02:10:13,360 we just saw a bunch of different 3406 02:10:13,360 --> 02:10:15,679 classification models now let's shift 3407 02:10:15,679 --> 02:10:17,840 gears into regression the other type of 3408 02:10:17,840 --> 02:10:19,599 supervised learning 3409 02:10:19,599 --> 02:10:22,079 if we look at this plot over here we see 3410 02:10:22,079 --> 02:10:24,400 a bunch of scattered data points 3411 02:10:24,400 --> 02:10:27,119 and here we have our x 3412 02:10:27,119 --> 02:10:29,360 value for those data points and then we 3413 02:10:29,360 --> 02:10:32,079 have the corresponding y value which is 3414 02:10:32,079 --> 02:10:34,560 now our label 3415 02:10:34,560 --> 02:10:37,440 and when we look at this plot 3416 02:10:37,440 --> 02:10:40,000 well our goal in regression is to find 3417 02:10:40,000 --> 02:10:43,760 the line of best fit that best models 3418 02:10:43,760 --> 02:10:45,520 this data 3419 02:10:45,520 --> 02:10:47,440 essentially we're trying to let's say 3420 02:10:47,440 --> 02:10:50,239 we're given some new value of x that we 3421 02:10:50,239 --> 02:10:52,400 don't have in our sample we're trying to 3422 02:10:52,400 --> 02:10:56,239 say okay what would my prediction for y 3423 02:10:56,239 --> 02:10:57,119 b 3424 02:10:57,119 --> 02:10:59,599 for that given x value so that you know 3425 02:10:59,599 --> 02:11:03,119 might be somewhere around there 3426 02:11:03,119 --> 02:11:05,119 i don't know 3427 02:11:05,119 --> 02:11:07,360 but remember in regression that you know 3428 02:11:07,360 --> 02:11:08,719 given certain features we're trying to 3429 02:11:08,719 --> 02:11:11,520 predict some continuous numerical value 3430 02:11:11,520 --> 02:11:14,079 for y 3431 02:11:14,159 --> 02:11:16,639 in linear regression 3432 02:11:16,639 --> 02:11:19,520 we want to take our data and fit a 3433 02:11:19,520 --> 02:11:22,800 linear model to this data so in this 3434 02:11:22,800 --> 02:11:24,560 case our linear model might look 3435 02:11:24,560 --> 02:11:29,520 something along the lines of here 3436 02:11:29,520 --> 02:11:30,400 right 3437 02:11:30,400 --> 02:11:33,199 so this here would be considered as 3438 02:11:33,199 --> 02:11:36,239 maybe our line of 3439 02:11:36,239 --> 02:11:37,679 best 3440 02:11:37,679 --> 02:11:39,199 fit 3441 02:11:39,199 --> 02:11:40,719 and this line 3442 02:11:40,719 --> 02:11:43,040 is modeled by the equation i'm going to 3443 02:11:43,040 --> 02:11:44,480 write it down here 3444 02:11:44,480 --> 02:11:45,280 y 3445 02:11:45,280 --> 02:11:46,639 equals 3446 02:11:46,639 --> 02:11:48,239 b 0 3447 02:11:48,239 --> 02:11:50,880 plus b 1 x 3448 02:11:50,880 --> 02:11:53,199 now b0 just means it's this y-intercept 3449 02:11:53,199 --> 02:11:55,679 so if we extend this y down here 3450 02:11:55,679 --> 02:11:59,119 this value here is b0 and then b1 3451 02:11:59,119 --> 02:12:02,400 defines the slope 3452 02:12:04,159 --> 02:12:05,840 of this line 3453 02:12:05,840 --> 02:12:06,800 okay 3454 02:12:06,800 --> 02:12:08,239 all right so that's the that's the 3455 02:12:08,239 --> 02:12:09,599 formula 3456 02:12:09,599 --> 02:12:12,639 for linear regression 3457 02:12:12,639 --> 02:12:15,199 and how exactly do we come up with that 3458 02:12:15,199 --> 02:12:17,199 formula what are we trying to do with 3459 02:12:17,199 --> 02:12:18,880 this linear regression 3460 02:12:18,880 --> 02:12:21,280 you know we could just eyeball 3461 02:12:21,280 --> 02:12:23,199 where should the line be but humans are 3462 02:12:23,199 --> 02:12:25,360 not very good at eyeballing certain 3463 02:12:25,360 --> 02:12:28,159 things like that i mean we can get close 3464 02:12:28,159 --> 02:12:30,320 but a computer is better at giving us a 3465 02:12:30,320 --> 02:12:35,840 precise value for b0 and b1 3466 02:12:35,840 --> 02:12:37,360 well let's introduce the concept of 3467 02:12:37,360 --> 02:12:40,400 something known as a residual 3468 02:12:40,400 --> 02:12:43,040 okay so 3469 02:12:43,040 --> 02:12:44,400 residual 3470 02:12:44,400 --> 02:12:46,480 you might also hear this being called 3471 02:12:46,480 --> 02:12:48,400 the error 3472 02:12:48,400 --> 02:12:51,040 and what that means is let's take some 3473 02:12:51,040 --> 02:12:53,760 data point in our data set 3474 02:12:53,760 --> 02:12:56,480 and we're going to evaluate how far off 3475 02:12:56,480 --> 02:12:58,320 is our prediction 3476 02:12:58,320 --> 02:12:59,119 from 3477 02:12:59,119 --> 02:13:01,440 a data point that we already have 3478 02:13:01,440 --> 02:13:04,800 so this here is our y let's say 3479 02:13:04,800 --> 02:13:09,520 this is one two three four five six 3480 02:13:09,520 --> 02:13:10,560 seven 3481 02:13:10,560 --> 02:13:13,440 eight so this is y eight let's call it 3482 02:13:13,440 --> 02:13:15,840 you'll see that i use this y i in order 3483 02:13:15,840 --> 02:13:18,079 to represent hey just one of these 3484 02:13:18,079 --> 02:13:19,199 points 3485 02:13:19,199 --> 02:13:20,960 okay 3486 02:13:20,960 --> 02:13:23,520 so this here is y and this here would be 3487 02:13:23,520 --> 02:13:25,920 the prediction oops this here would be 3488 02:13:25,920 --> 02:13:28,400 the prediction for y 3489 02:13:28,400 --> 02:13:29,280 8 3490 02:13:29,280 --> 02:13:31,920 which i've labeled with this hat okay if 3491 02:13:31,920 --> 02:13:33,440 it has a hat on it that means hey this 3492 02:13:33,440 --> 02:13:35,280 is what this is my guess this is my 3493 02:13:35,280 --> 02:13:36,560 prediction 3494 02:13:36,560 --> 02:13:37,360 for 3495 02:13:37,360 --> 02:13:40,000 you know this specific 3496 02:13:40,000 --> 02:13:42,840 value of x 3497 02:13:42,840 --> 02:13:45,840 okay now the residual 3498 02:13:45,840 --> 02:13:47,040 would be 3499 02:13:47,040 --> 02:13:49,199 this distance here 3500 02:13:49,199 --> 02:13:50,639 between y 3501 02:13:50,639 --> 02:13:53,599 eight and y hat eight so 3502 02:13:53,599 --> 02:13:54,960 y eight 3503 02:13:54,960 --> 02:13:57,760 minus y hat eight 3504 02:13:57,760 --> 02:13:59,040 all right because that would give us 3505 02:13:59,040 --> 02:14:00,000 this 3506 02:14:00,000 --> 02:14:01,679 here and i'm just going to take the 3507 02:14:01,679 --> 02:14:03,760 absolute value of this because what if 3508 02:14:03,760 --> 02:14:05,679 it's below the line 3509 02:14:05,679 --> 02:14:06,800 right then you would get a negative 3510 02:14:06,800 --> 02:14:08,639 value but distance can't be negative so 3511 02:14:08,639 --> 02:14:11,040 we're just going to put a little hat or 3512 02:14:11,040 --> 02:14:12,239 we're going to put a little absolute 3513 02:14:12,239 --> 02:14:15,199 value around this quantity 3514 02:14:15,199 --> 02:14:17,280 and that gives us 3515 02:14:17,280 --> 02:14:19,520 the residual or the error 3516 02:14:19,520 --> 02:14:21,760 so let me rewrite that 3517 02:14:21,760 --> 02:14:24,000 and you know to generalize to all the 3518 02:14:24,000 --> 02:14:26,159 points i'm going to say the residual can 3519 02:14:26,159 --> 02:14:27,679 be calculated 3520 02:14:27,679 --> 02:14:29,199 as y i 3521 02:14:29,199 --> 02:14:31,520 minus y hat 3522 02:14:31,520 --> 02:14:32,560 of i 3523 02:14:32,560 --> 02:14:33,840 okay 3524 02:14:33,840 --> 02:14:35,520 so this just means the distance between 3525 02:14:35,520 --> 02:14:37,119 some given point 3526 02:14:37,119 --> 02:14:39,280 and its prediction its corresponding 3527 02:14:39,280 --> 02:14:41,280 prediction on the line 3528 02:14:41,280 --> 02:14:42,239 so now 3529 02:14:42,239 --> 02:14:44,159 with this residual 3530 02:14:44,159 --> 02:14:46,239 this line of best fit 3531 02:14:46,239 --> 02:14:48,400 is generally trying to decrease these 3532 02:14:48,400 --> 02:14:51,440 residuals as much as possible 3533 02:14:51,440 --> 02:14:52,800 so 3534 02:14:52,800 --> 02:14:55,280 now that we have some value for the 3535 02:14:55,280 --> 02:14:57,119 error our line of best fit is trying to 3536 02:14:57,119 --> 02:14:59,280 decrease the error as much as possible 3537 02:14:59,280 --> 02:15:02,079 for all of the different data points 3538 02:15:02,079 --> 02:15:05,599 and that might mean you know minimizing 3539 02:15:05,599 --> 02:15:07,679 the sum of all the residuals so this 3540 02:15:07,679 --> 02:15:09,520 here this is the sum 3541 02:15:09,520 --> 02:15:12,960 symbol and if i just stick the residual 3542 02:15:12,960 --> 02:15:14,320 calculation 3543 02:15:14,320 --> 02:15:16,560 in there 3544 02:15:16,560 --> 02:15:18,639 it looks something like that right and 3545 02:15:18,639 --> 02:15:20,239 i'm just going to say okay for all of 3546 02:15:20,239 --> 02:15:22,320 the i's in our data set so for all the 3547 02:15:22,320 --> 02:15:24,239 different points we're going to sum up 3548 02:15:24,239 --> 02:15:26,320 all the residuals 3549 02:15:26,320 --> 02:15:29,040 and i'm going to try to decrease that 3550 02:15:29,040 --> 02:15:30,800 with my line of best fit so i'm going to 3551 02:15:30,800 --> 02:15:33,520 find the b0 and b1 which gives me the 3552 02:15:33,520 --> 02:15:36,639 lowest value of this 3553 02:15:36,639 --> 02:15:37,920 okay 3554 02:15:37,920 --> 02:15:40,639 now in other you know sometimes in 3555 02:15:40,639 --> 02:15:42,960 different circumstances we might 3556 02:15:42,960 --> 02:15:45,040 attach a squared to that so we're trying 3557 02:15:45,040 --> 02:15:48,320 to decrease the sum of the squared 3558 02:15:48,320 --> 02:15:51,320 residuals 3559 02:15:56,960 --> 02:15:59,119 and what that does is it just 3560 02:15:59,119 --> 02:16:01,760 you know it adds a higher penalty for 3561 02:16:01,760 --> 02:16:04,480 how far off we are from you know points 3562 02:16:04,480 --> 02:16:06,639 that are further off so that is linear 3563 02:16:06,639 --> 02:16:08,560 regression we're trying to find 3564 02:16:08,560 --> 02:16:11,440 this equation some line of best fit 3565 02:16:11,440 --> 02:16:14,079 that will help us decrease 3566 02:16:14,079 --> 02:16:16,320 this measure of error 3567 02:16:16,320 --> 02:16:18,239 with respect to all the data points that 3568 02:16:18,239 --> 02:16:20,560 we have in our data set and try to come 3569 02:16:20,560 --> 02:16:22,000 up with the best prediction for all of 3570 02:16:22,000 --> 02:16:22,800 them 3571 02:16:22,800 --> 02:16:26,159 this is known as simple 3572 02:16:26,159 --> 02:16:28,079 linear 3573 02:16:28,079 --> 02:16:31,079 regression 3574 02:16:32,000 --> 02:16:34,558 basically that means you know our 3575 02:16:34,558 --> 02:16:37,200 equation looks something 3576 02:16:37,200 --> 02:16:39,280 like this 3577 02:16:39,280 --> 02:16:43,200 now there's also multiple 3578 02:16:44,959 --> 02:16:47,839 linear regression 3579 02:16:48,398 --> 02:16:50,000 which just means that hey if we have 3580 02:16:50,000 --> 02:16:50,959 more 3581 02:16:50,959 --> 02:16:53,840 than one value for x so like think of 3582 02:16:53,840 --> 02:16:55,359 our feature vectors we have multiple 3583 02:16:55,359 --> 02:16:58,240 values in our x vector 3584 02:16:58,240 --> 02:17:01,439 then our predictor might look something 3585 02:17:01,439 --> 02:17:02,638 more 3586 02:17:02,638 --> 02:17:05,358 like this 3587 02:17:06,879 --> 02:17:09,519 actually i'm just going to say etc plus 3588 02:17:09,519 --> 02:17:13,200 b n x n so now i'm coming up with some 3589 02:17:13,200 --> 02:17:14,398 coefficient 3590 02:17:14,398 --> 02:17:17,040 for all of the different 3591 02:17:17,040 --> 02:17:19,840 x values that i have in my vector now 3592 02:17:19,840 --> 02:17:21,439 you guys might have noticed that i have 3593 02:17:21,439 --> 02:17:23,200 some assumptions over here and you might 3594 02:17:23,200 --> 02:17:24,799 be asking okay kylie what in the world 3595 02:17:24,799 --> 02:17:26,959 do these assumptions mean so let's go 3596 02:17:26,959 --> 02:17:28,959 over them 3597 02:17:28,959 --> 02:17:33,039 the first one is linearity 3598 02:17:33,679 --> 02:17:35,599 and what that means is let's say i have 3599 02:17:35,599 --> 02:17:38,000 a data set 3600 02:17:38,000 --> 02:17:41,000 okay 3601 02:17:43,599 --> 02:17:45,760 linearity just means okay my does my 3602 02:17:45,760 --> 02:17:49,280 data follow a linear pattern does 3603 02:17:49,280 --> 02:17:52,000 y increase as x increases or does y 3604 02:17:52,000 --> 02:17:56,879 decrease at as x increases does so if y 3605 02:17:56,879 --> 02:17:59,120 increases or decreases at a constant 3606 02:17:59,120 --> 02:18:01,359 rate as x increases 3607 02:18:01,359 --> 02:18:02,398 then you're probably looking at 3608 02:18:02,398 --> 02:18:04,558 something linear so what's an example of 3609 02:18:04,558 --> 02:18:07,040 a non-linear data set let's say i had 3610 02:18:07,040 --> 02:18:10,879 data that might look something like that 3611 02:18:10,879 --> 02:18:12,160 okay 3612 02:18:12,160 --> 02:18:15,040 so now just visually judging this you 3613 02:18:15,040 --> 02:18:16,879 might say okay seems like the line of 3614 02:18:16,879 --> 02:18:19,519 best fit might actually be some curve 3615 02:18:19,519 --> 02:18:21,599 like this 3616 02:18:21,599 --> 02:18:22,718 right 3617 02:18:22,718 --> 02:18:25,120 and in this case we don't satisfy that 3618 02:18:25,120 --> 02:18:27,439 linearity 3619 02:18:27,439 --> 02:18:29,599 assumption anymore 3620 02:18:29,599 --> 02:18:30,478 so 3621 02:18:30,478 --> 02:18:32,478 with linearity we basically just want 3622 02:18:32,478 --> 02:18:34,718 our data set to follow some sort of 3623 02:18:34,718 --> 02:18:36,240 linear 3624 02:18:36,240 --> 02:18:39,200 trajectory 3625 02:18:39,200 --> 02:18:41,439 and independence 3626 02:18:41,439 --> 02:18:44,240 our second assumption 3627 02:18:44,240 --> 02:18:46,240 just means 3628 02:18:46,240 --> 02:18:48,160 this point over here 3629 02:18:48,160 --> 02:18:50,080 it should have no influence on this 3630 02:18:50,080 --> 02:18:52,080 point over here or this point over here 3631 02:18:52,080 --> 02:18:54,318 or this point over here so in other 3632 02:18:54,318 --> 02:18:57,040 words all the points 3633 02:18:57,040 --> 02:19:01,120 all the samples in our data set 3634 02:19:01,120 --> 02:19:02,959 should be independent 3635 02:19:02,959 --> 02:19:04,959 okay they should not rely on one another 3636 02:19:04,959 --> 02:19:08,478 they should not affect one another 3637 02:19:14,318 --> 02:19:16,478 okay now 3638 02:19:16,478 --> 02:19:18,840 normality and 3639 02:19:18,840 --> 02:19:21,200 homoscedasticity those are concepts 3640 02:19:21,200 --> 02:19:24,959 which use this residual okay 3641 02:19:24,959 --> 02:19:28,959 so if i have a plot 3642 02:19:28,959 --> 02:19:31,120 that looks 3643 02:19:31,120 --> 02:19:32,398 something 3644 02:19:32,398 --> 02:19:35,398 like 3645 02:19:35,439 --> 02:19:37,920 this 3646 02:19:39,840 --> 02:19:42,160 and my line of best fit 3647 02:19:42,160 --> 02:19:43,519 is somewhere 3648 02:19:43,519 --> 02:19:44,558 here 3649 02:19:44,558 --> 02:19:47,120 maybe it's something like that 3650 02:19:47,120 --> 02:19:49,439 in order to look at these normality and 3651 02:19:49,439 --> 02:19:50,960 homoscedasticity 3652 02:19:50,960 --> 02:19:53,040 assumptions let's look at the residual 3653 02:19:53,040 --> 02:19:56,000 plot okay 3654 02:19:57,280 --> 02:19:59,760 and what that means is i'm going to keep 3655 02:19:59,760 --> 02:20:02,560 my same x axis 3656 02:20:02,560 --> 02:20:04,560 but instead of plotting now where they 3657 02:20:04,560 --> 02:20:07,439 are relative to this y i'm going to plot 3658 02:20:07,439 --> 02:20:11,120 these errors so now i'm going to plot y 3659 02:20:11,120 --> 02:20:14,240 minus y hat 3660 02:20:14,240 --> 02:20:15,439 like this 3661 02:20:15,439 --> 02:20:17,120 okay 3662 02:20:17,120 --> 02:20:18,800 and now you know this one is slightly 3663 02:20:18,800 --> 02:20:20,399 positive so it might be here this one 3664 02:20:20,399 --> 02:20:22,319 down here is negative it might be here 3665 02:20:22,319 --> 02:20:23,520 so our 3666 02:20:23,520 --> 02:20:25,760 residual plot 3667 02:20:25,760 --> 02:20:27,280 it's literally just a plot of how you 3668 02:20:27,280 --> 02:20:29,040 know the values are distributed around 3669 02:20:29,040 --> 02:20:31,040 our line of best fit 3670 02:20:31,040 --> 02:20:33,120 so it looks like 3671 02:20:33,120 --> 02:20:37,280 it might you know look something 3672 02:20:38,640 --> 02:20:39,920 like this 3673 02:20:39,920 --> 02:20:42,000 okay 3674 02:20:42,000 --> 02:20:45,120 so this might be our residual plot and 3675 02:20:45,120 --> 02:20:48,240 what normality means so our assumptions 3676 02:20:48,240 --> 02:20:51,200 are normality 3677 02:20:51,600 --> 02:20:53,920 and homo 3678 02:20:53,920 --> 02:20:55,439 schedasticities 3679 02:20:55,439 --> 02:20:58,399 good ass 3680 02:20:59,120 --> 02:21:00,479 i might have butchered that spelling i 3681 02:21:00,479 --> 02:21:01,840 don't really know 3682 02:21:01,840 --> 02:21:02,880 but 3683 02:21:02,880 --> 02:21:03,840 what 3684 02:21:03,840 --> 02:21:06,319 normality is saying is saying okay these 3685 02:21:06,319 --> 02:21:09,760 residuals should be normally distributed 3686 02:21:09,760 --> 02:21:11,200 okay 3687 02:21:11,200 --> 02:21:13,280 around this line of best fit it should 3688 02:21:13,280 --> 02:21:16,640 follow a normal distribution 3689 02:21:16,640 --> 02:21:17,960 and now what 3690 02:21:17,960 --> 02:21:21,520 homoscedasticity says okay our variance 3691 02:21:21,520 --> 02:21:22,960 of these points 3692 02:21:22,960 --> 02:21:24,720 should remain constant 3693 02:21:24,720 --> 02:21:27,600 throughout so this spread here should be 3694 02:21:27,600 --> 02:21:29,200 approximately the same as this spread 3695 02:21:29,200 --> 02:21:30,960 over here 3696 02:21:30,960 --> 02:21:32,640 now what's an example of where you know 3697 02:21:32,640 --> 02:21:33,520 homo 3698 02:21:33,520 --> 02:21:36,560 schedasticity is not held 3699 02:21:36,560 --> 02:21:38,800 well let's say that our 3700 02:21:38,800 --> 02:21:41,200 original plot actually looks something 3701 02:21:41,200 --> 02:21:43,520 like 3702 02:21:43,520 --> 02:21:45,840 this 3703 02:21:46,399 --> 02:21:48,240 okay so now if we looked at the 3704 02:21:48,240 --> 02:21:50,240 residuals for that 3705 02:21:50,240 --> 02:21:54,040 it might look something 3706 02:21:54,479 --> 02:21:55,840 like that 3707 02:21:55,840 --> 02:21:58,720 and now if you we look at this spread of 3708 02:21:58,720 --> 02:22:00,399 the points 3709 02:22:00,399 --> 02:22:03,680 it decreases right so now the spread is 3710 02:22:03,680 --> 02:22:04,960 not constant which means that 3711 02:22:04,960 --> 02:22:06,840 homeostasis 3712 02:22:06,840 --> 02:22:08,720 homoscedasticity 3713 02:22:08,720 --> 02:22:09,840 um 3714 02:22:09,840 --> 02:22:11,760 this this assumption would not be 3715 02:22:11,760 --> 02:22:12,880 fulfilled and it might not be 3716 02:22:12,880 --> 02:22:16,080 appropriate to use linear regression 3717 02:22:16,080 --> 02:22:17,680 so that's just linear regression 3718 02:22:17,680 --> 02:22:19,920 basically we have a bunch of data points 3719 02:22:19,920 --> 02:22:22,479 we want to predict some y value 3720 02:22:22,479 --> 02:22:24,399 for those 3721 02:22:24,399 --> 02:22:26,240 and we're trying to come up with this 3722 02:22:26,240 --> 02:22:29,280 line of best fit that best describes hey 3723 02:22:29,280 --> 02:22:31,520 given some value x 3724 02:22:31,520 --> 02:22:35,680 what would be my best guess of what y is 3725 02:22:35,680 --> 02:22:38,960 so let's move on to how do we evaluate a 3726 02:22:38,960 --> 02:22:42,160 linear regression model 3727 02:22:42,240 --> 02:22:44,640 so the first 3728 02:22:44,640 --> 02:22:46,319 measure that i'm going to talk about is 3729 02:22:46,319 --> 02:22:51,520 known as mean absolute error or m-a-e 3730 02:22:52,000 --> 02:22:53,120 for short 3731 02:22:53,120 --> 02:22:53,920 okay 3732 02:22:53,920 --> 02:22:56,319 and mean absolute error 3733 02:22:56,319 --> 02:22:58,960 is basically saying all right let's take 3734 02:22:58,960 --> 02:23:01,280 all the errors so all these residuals 3735 02:23:01,280 --> 02:23:03,439 that we talked about 3736 02:23:03,439 --> 02:23:05,120 let's sum up 3737 02:23:05,120 --> 02:23:06,960 the distance for all of them and then 3738 02:23:06,960 --> 02:23:09,040 take the average and then that can 3739 02:23:09,040 --> 02:23:12,319 describe you know how far off are we 3740 02:23:12,319 --> 02:23:13,359 so the 3741 02:23:13,359 --> 02:23:16,000 mathematical formula for that would be 3742 02:23:16,000 --> 02:23:20,240 okay let's take all the residuals 3743 02:23:21,600 --> 02:23:24,160 all right so this is the distance 3744 02:23:24,160 --> 02:23:27,120 actually let me redraw a plot down here 3745 02:23:27,120 --> 02:23:28,399 so 3746 02:23:28,399 --> 02:23:32,960 suppose i have a data set look like this 3747 02:23:32,960 --> 02:23:35,120 and 3748 02:23:35,600 --> 02:23:38,240 here are all of my 3749 02:23:38,240 --> 02:23:41,359 data points right 3750 02:23:41,359 --> 02:23:43,040 and now let's say my line looks 3751 02:23:43,040 --> 02:23:45,120 something like 3752 02:23:45,120 --> 02:23:47,840 that 3753 02:23:47,840 --> 02:23:48,800 so 3754 02:23:48,800 --> 02:23:51,200 my mean absolute error would be summing 3755 02:23:51,200 --> 02:23:52,080 up 3756 02:23:52,080 --> 02:23:55,439 all of these values 3757 02:23:55,920 --> 02:23:58,399 this was a mistake 3758 02:23:58,399 --> 02:24:00,479 so summing up all of these 3759 02:24:00,479 --> 02:24:02,000 and then dividing by how many data 3760 02:24:02,000 --> 02:24:03,680 points i have 3761 02:24:03,680 --> 02:24:05,760 so what would be all the residuals it 3762 02:24:05,760 --> 02:24:09,520 would be y i right so every single point 3763 02:24:09,520 --> 02:24:12,880 minus y hat i so the prediction for that 3764 02:24:12,880 --> 02:24:14,800 on here 3765 02:24:14,800 --> 02:24:16,880 and then we're going to sum over all of 3766 02:24:16,880 --> 02:24:19,200 the different i's in our data set 3767 02:24:19,200 --> 02:24:20,880 right so 3768 02:24:20,880 --> 02:24:21,760 i 3769 02:24:21,760 --> 02:24:24,000 and then we divide by the number of 3770 02:24:24,000 --> 02:24:25,280 points we have so actually i'm going to 3771 02:24:25,280 --> 02:24:27,680 rewrite this to make it a little clearer 3772 02:24:27,680 --> 02:24:30,000 so i is equal to whatever the first data 3773 02:24:30,000 --> 02:24:31,600 point is all the way through the nth 3774 02:24:31,600 --> 02:24:32,720 data point 3775 02:24:32,720 --> 02:24:35,040 and then we divide it by n which is how 3776 02:24:35,040 --> 02:24:36,640 many points there are 3777 02:24:36,640 --> 02:24:41,439 okay so this is our measure of m a e 3778 02:24:41,439 --> 02:24:43,840 and this is basically telling us okay in 3779 02:24:43,840 --> 02:24:45,439 on average 3780 02:24:45,439 --> 02:24:47,200 this is the distance 3781 02:24:47,200 --> 02:24:48,399 between 3782 02:24:48,399 --> 02:24:51,680 our predicted value and the actual value 3783 02:24:51,680 --> 02:24:54,479 in our training set 3784 02:24:54,479 --> 02:24:56,160 okay 3785 02:24:56,160 --> 02:25:00,640 and mae is good because it allows us to 3786 02:25:00,640 --> 02:25:03,600 you know when we get this value here we 3787 02:25:03,600 --> 02:25:05,600 can literally directly compare it to 3788 02:25:05,600 --> 02:25:06,640 whatever 3789 02:25:06,640 --> 02:25:09,920 units the y value is in so let's say y 3790 02:25:09,920 --> 02:25:11,040 is 3791 02:25:11,040 --> 02:25:14,720 we're talking you know the prediction of 3792 02:25:14,720 --> 02:25:16,880 the price of a house 3793 02:25:16,880 --> 02:25:17,680 right 3794 02:25:17,680 --> 02:25:19,280 in dollars 3795 02:25:19,280 --> 02:25:21,439 once we have once we calculate the mae 3796 02:25:21,439 --> 02:25:24,080 we can literally say oh the average you 3797 02:25:24,080 --> 02:25:25,680 know price 3798 02:25:25,680 --> 02:25:27,040 the average 3799 02:25:27,040 --> 02:25:29,840 um how much we're off by 3800 02:25:29,840 --> 02:25:32,560 is literally this many dollars 3801 02:25:32,560 --> 02:25:33,760 okay 3802 02:25:33,760 --> 02:25:35,920 so that's the mean absolute error 3803 02:25:35,920 --> 02:25:37,520 an evaluation technique that's also 3804 02:25:37,520 --> 02:25:39,200 closely related to that 3805 02:25:39,200 --> 02:25:41,760 is called the mean squared error 3806 02:25:41,760 --> 02:25:44,160 and this is mse 3807 02:25:44,160 --> 02:25:45,680 for short 3808 02:25:45,680 --> 02:25:47,040 okay 3809 02:25:47,040 --> 02:25:47,840 now 3810 02:25:47,840 --> 02:25:51,600 if i take this plot again 3811 02:25:51,680 --> 02:25:55,840 and i duplicate it and move it down here 3812 02:25:55,840 --> 02:25:58,240 well the gist of mean squared error is 3813 02:25:58,240 --> 02:25:59,439 kind of the same but instead of the 3814 02:25:59,439 --> 02:26:01,680 absolute value we're going to square 3815 02:26:01,680 --> 02:26:04,319 so now the mse 3816 02:26:04,319 --> 02:26:06,479 is something along the lines of okay 3817 02:26:06,479 --> 02:26:08,479 let's sum up 3818 02:26:08,479 --> 02:26:10,800 something right so we're going to sum up 3819 02:26:10,800 --> 02:26:13,200 all of our errors 3820 02:26:13,200 --> 02:26:17,120 so now i'm going to do y i minus y hat i 3821 02:26:17,120 --> 02:26:19,200 but instead of absolute valuing them i'm 3822 02:26:19,200 --> 02:26:21,280 going to square them all and then i'm 3823 02:26:21,280 --> 02:26:24,000 going to divide by n in order to find 3824 02:26:24,000 --> 02:26:25,040 the mean 3825 02:26:25,040 --> 02:26:27,840 so basically now i'm taking 3826 02:26:27,840 --> 02:26:29,520 all of these 3827 02:26:29,520 --> 02:26:31,760 different values and i'm squaring them 3828 02:26:31,760 --> 02:26:35,840 first before i add them to one another 3829 02:26:36,080 --> 02:26:39,439 and then i divide by n 3830 02:26:39,439 --> 02:26:41,040 and the reason why we like using mean 3831 02:26:41,040 --> 02:26:43,120 squared error is that 3832 02:26:43,120 --> 02:26:46,000 it helps us punish large errors in the 3833 02:26:46,000 --> 02:26:47,200 prediction 3834 02:26:47,200 --> 02:26:49,600 and later on mse might be important 3835 02:26:49,600 --> 02:26:53,200 because of differentiability right so a 3836 02:26:53,200 --> 02:26:55,600 quadratic equation is differentiable you 3837 02:26:55,600 --> 02:26:57,280 know if you're familiar with calculus a 3838 02:26:57,280 --> 02:26:58,640 quadratic equation is different 3839 02:26:58,640 --> 02:27:00,640 differentiable whereas the absolute 3840 02:27:00,640 --> 02:27:02,000 value function is not totally 3841 02:27:02,000 --> 02:27:03,840 differentiable everywhere 3842 02:27:03,840 --> 02:27:05,359 but if you don't understand that don't 3843 02:27:05,359 --> 02:27:07,600 worry about it you won't really need it 3844 02:27:07,600 --> 02:27:08,960 right now 3845 02:27:08,960 --> 02:27:10,479 and now one downside of mean squared 3846 02:27:10,479 --> 02:27:12,640 error is that once i calculate the mean 3847 02:27:12,640 --> 02:27:14,560 squared error over here 3848 02:27:14,560 --> 02:27:16,399 and i go back over to y and i want to 3849 02:27:16,399 --> 02:27:19,200 compare the values 3850 02:27:19,200 --> 02:27:21,520 well it gets a little bit trickier to do 3851 02:27:21,520 --> 02:27:23,760 that because 3852 02:27:23,760 --> 02:27:27,200 now my mean squared error is in terms of 3853 02:27:27,200 --> 02:27:29,359 y squared right it's 3854 02:27:29,359 --> 02:27:32,240 this is now squared so instead of just 3855 02:27:32,240 --> 02:27:34,160 dollars how you know how many dollars 3856 02:27:34,160 --> 02:27:36,240 off am i i'm talking how many dollars 3857 02:27:36,240 --> 02:27:37,359 squared 3858 02:27:37,359 --> 02:27:38,240 off 3859 02:27:38,240 --> 02:27:39,280 am i 3860 02:27:39,280 --> 02:27:41,680 and that you know to humans it doesn't 3861 02:27:41,680 --> 02:27:44,160 really make that much sense which is why 3862 02:27:44,160 --> 02:27:46,240 we have created something known as the 3863 02:27:46,240 --> 02:27:49,280 root mean square error 3864 02:27:49,280 --> 02:27:50,479 and 3865 02:27:50,479 --> 02:27:52,240 i'm just going to copy 3866 02:27:52,240 --> 02:27:54,399 this diagram over here because it's very 3867 02:27:54,399 --> 02:27:55,920 very similar 3868 02:27:55,920 --> 02:27:58,479 to mean squared error 3869 02:27:58,479 --> 02:28:00,080 except 3870 02:28:00,080 --> 02:28:03,200 now we take a big squared root 3871 02:28:03,200 --> 02:28:05,280 okay so this is rmse and we take the 3872 02:28:05,280 --> 02:28:06,720 square root 3873 02:28:06,720 --> 02:28:08,240 of that 3874 02:28:08,240 --> 02:28:10,800 mean squared error and so now the term 3875 02:28:10,800 --> 02:28:13,120 in which you know we're defining 3876 02:28:13,120 --> 02:28:14,479 our error 3877 02:28:14,479 --> 02:28:16,880 is now in terms of that dollar sign 3878 02:28:16,880 --> 02:28:19,120 symbol again so that's a pro of rooting 3879 02:28:19,120 --> 02:28:21,520 squared error is that now we can say 3880 02:28:21,520 --> 02:28:24,640 okay our error according to this metric 3881 02:28:24,640 --> 02:28:26,720 is this many dollar signs off from our 3882 02:28:26,720 --> 02:28:28,319 predictor 3883 02:28:28,319 --> 02:28:30,560 okay so it's in the same unit which is 3884 02:28:30,560 --> 02:28:32,640 one of the pros of root mean squared 3885 02:28:32,640 --> 02:28:34,880 error 3886 02:28:34,880 --> 02:28:37,600 and now finally there is the coefficient 3887 02:28:37,600 --> 02:28:40,800 of determination or r squared 3888 02:28:40,800 --> 02:28:42,399 and this is the formula for r squared so 3889 02:28:42,399 --> 02:28:45,520 r squared is equal to 1 minus rss 3890 02:28:45,520 --> 02:28:46,479 over 3891 02:28:46,479 --> 02:28:48,240 tss 3892 02:28:48,240 --> 02:28:51,200 okay so what does that mean 3893 02:28:51,200 --> 02:28:53,600 basically rss 3894 02:28:53,600 --> 02:28:56,560 stands for the sum 3895 02:28:56,560 --> 02:28:59,920 of the squared 3896 02:28:59,920 --> 02:29:01,840 residuals 3897 02:29:01,840 --> 02:29:05,200 so maybe it should be ssr instead but 3898 02:29:05,200 --> 02:29:07,920 rss sum of the squared residuals and 3899 02:29:07,920 --> 02:29:09,359 this 3900 02:29:09,359 --> 02:29:11,359 is equal 3901 02:29:11,359 --> 02:29:12,560 to 3902 02:29:12,560 --> 02:29:15,760 if i take the sum of all the values 3903 02:29:15,760 --> 02:29:19,920 and i take y i minus y hat 3904 02:29:19,920 --> 02:29:20,880 i 3905 02:29:20,880 --> 02:29:22,720 and square that 3906 02:29:22,720 --> 02:29:25,359 that is my rss right it's the sum of the 3907 02:29:25,359 --> 02:29:28,000 squared residuals 3908 02:29:28,000 --> 02:29:30,479 now tss let me actually use a different 3909 02:29:30,479 --> 02:29:33,120 color for that 3910 02:29:33,120 --> 02:29:38,000 so tss is the total 3911 02:29:38,000 --> 02:29:39,920 sum 3912 02:29:39,920 --> 02:29:42,800 of squares 3913 02:29:43,600 --> 02:29:46,000 and what that means is that instead of 3914 02:29:46,000 --> 02:29:48,319 being with respect to 3915 02:29:48,319 --> 02:29:51,560 this prediction 3916 02:29:51,600 --> 02:29:53,600 we are instead 3917 02:29:53,600 --> 02:29:56,319 going to 3918 02:29:58,160 --> 02:30:01,920 take each y value and just subtract 3919 02:30:01,920 --> 02:30:04,640 the mean of all the y values 3920 02:30:04,640 --> 02:30:06,880 and square that 3921 02:30:06,880 --> 02:30:11,960 okay so if i drew this out 3922 02:30:19,520 --> 02:30:21,300 and if this were my 3923 02:30:21,300 --> 02:30:22,720 [Music] 3924 02:30:22,720 --> 02:30:25,040 actually let's use a different color 3925 02:30:25,040 --> 02:30:27,920 let's use green 3926 02:30:28,240 --> 02:30:31,439 if this were my predictor 3927 02:30:31,439 --> 02:30:35,120 so rss is giving me this measure 3928 02:30:35,120 --> 02:30:36,080 here 3929 02:30:36,080 --> 02:30:38,080 right it's giving me some estimate of 3930 02:30:38,080 --> 02:30:40,800 how far off we are from our regressor 3931 02:30:40,800 --> 02:30:42,240 that we predicted 3932 02:30:42,240 --> 02:30:44,640 actually 3933 02:30:45,439 --> 02:30:48,399 i'm going to use red for that 3934 02:30:48,399 --> 02:30:50,080 well 3935 02:30:50,080 --> 02:30:52,800 tss on the other hand is saying okay how 3936 02:30:52,800 --> 02:30:55,920 far off are these values from the mean 3937 02:30:55,920 --> 02:30:57,840 so if we literally didn't do any 3938 02:30:57,840 --> 02:30:59,760 calculations for the line of best fit if 3939 02:30:59,760 --> 02:31:00,399 we 3940 02:31:00,399 --> 02:31:02,880 just took all the y values and averaged 3941 02:31:02,880 --> 02:31:03,920 all of them 3942 02:31:03,920 --> 02:31:05,920 and said hey this is the average value 3943 02:31:05,920 --> 02:31:08,160 for every single x value 3944 02:31:08,160 --> 02:31:09,680 i'm just going to predict that average 3945 02:31:09,680 --> 02:31:11,760 value instead 3946 02:31:11,760 --> 02:31:13,680 then it's asking okay how far off are 3947 02:31:13,680 --> 02:31:17,920 all these points from that line 3948 02:31:19,040 --> 02:31:21,600 okay and remember that this square means 3949 02:31:21,600 --> 02:31:23,200 that we're punishing 3950 02:31:23,200 --> 02:31:24,800 larger errors 3951 02:31:24,800 --> 02:31:26,880 right so even if they look somewhat 3952 02:31:26,880 --> 02:31:28,479 close in terms of distance 3953 02:31:28,479 --> 02:31:31,680 the further a few data points are 3954 02:31:31,680 --> 02:31:34,240 then the further the larger our total 3955 02:31:34,240 --> 02:31:36,800 sum of squares is going to be 3956 02:31:36,800 --> 02:31:38,479 sorry that was my dog 3957 02:31:38,479 --> 02:31:40,479 so the total sum of squares is taking 3958 02:31:40,479 --> 02:31:42,560 all of these values and saying okay what 3959 02:31:42,560 --> 02:31:45,040 is the sum of squares if i didn't do any 3960 02:31:45,040 --> 02:31:46,479 regressor and i literally just 3961 02:31:46,479 --> 02:31:48,800 calculated the average 3962 02:31:48,800 --> 02:31:51,200 of all the y values in my data set and 3963 02:31:51,200 --> 02:31:52,479 for every single x value i'm just going 3964 02:31:52,479 --> 02:31:54,240 to predict that average 3965 02:31:54,240 --> 02:31:56,319 which means that okay like that means 3966 02:31:56,319 --> 02:31:58,800 that maybe y and x aren't associated 3967 02:31:58,800 --> 02:32:01,120 with each other at all like the best 3968 02:32:01,120 --> 02:32:03,280 thing that i can do for any new x value 3969 02:32:03,280 --> 02:32:04,800 just predict hey this is the average of 3970 02:32:04,800 --> 02:32:06,080 my data set 3971 02:32:06,080 --> 02:32:08,880 and this total sum of squares is saying 3972 02:32:08,880 --> 02:32:12,160 okay well with respect to that average 3973 02:32:12,160 --> 02:32:14,960 what is our error 3974 02:32:14,960 --> 02:32:17,280 right so up here the sum of the squared 3975 02:32:17,280 --> 02:32:18,640 residuals 3976 02:32:18,640 --> 02:32:20,960 this is telling us what is our what what 3977 02:32:20,960 --> 02:32:23,280 is our error with respect to 3978 02:32:23,280 --> 02:32:26,160 this line and best fit while our total 3979 02:32:26,160 --> 02:32:27,359 sum of square is saying what is the 3980 02:32:27,359 --> 02:32:29,280 error with respect to you know just the 3981 02:32:29,280 --> 02:32:31,840 average y value 3982 02:32:31,840 --> 02:32:33,280 and 3983 02:32:33,280 --> 02:32:36,640 if our line of best fit is a better fit 3984 02:32:36,640 --> 02:32:37,600 then 3985 02:32:37,600 --> 02:32:40,479 this total sum of squares 3986 02:32:40,479 --> 02:32:43,840 that means that you know this 3987 02:32:43,840 --> 02:32:46,000 numerator 3988 02:32:46,000 --> 02:32:48,319 that means that this numerator is going 3989 02:32:48,319 --> 02:32:51,200 to be smaller than this denominator 3990 02:32:51,200 --> 02:32:55,359 right and if our errors in our 3991 02:32:55,359 --> 02:32:57,920 mind and best fit are much smaller 3992 02:32:57,920 --> 02:32:59,760 then that means that this ratio of the 3993 02:32:59,760 --> 02:33:03,359 rss over tss is going to be very small 3994 02:33:03,359 --> 02:33:06,160 which means that r squared is going to 3995 02:33:06,160 --> 02:33:08,800 go towards one 3996 02:33:08,800 --> 02:33:11,439 and now when r squared is towards one 3997 02:33:11,439 --> 02:33:13,439 that means that that's usually a sign 3998 02:33:13,439 --> 02:33:15,359 that we have a good 3999 02:33:15,359 --> 02:33:18,000 predictor 4000 02:33:19,600 --> 02:33:22,399 it's one of the signs not the only one 4001 02:33:22,399 --> 02:33:24,880 so over here i also have you know that 4002 02:33:24,880 --> 02:33:26,800 there's this adjusted r squared and what 4003 02:33:26,800 --> 02:33:28,640 that does it just adjusts for the number 4004 02:33:28,640 --> 02:33:33,280 of terms so x1 x2 x3 etc it adjusts for 4005 02:33:33,280 --> 02:33:35,040 how many extra terms we add because 4006 02:33:35,040 --> 02:33:36,640 usually when we 4007 02:33:36,640 --> 02:33:38,319 um you know 4008 02:33:38,319 --> 02:33:40,240 add an extra term the r squared value 4009 02:33:40,240 --> 02:33:41,840 will increase because that'll help us 4010 02:33:41,840 --> 02:33:44,720 predict why some more 4011 02:33:44,720 --> 02:33:47,200 but the value for the adjusted r squared 4012 02:33:47,200 --> 02:33:48,800 increases if the new term actually 4013 02:33:48,800 --> 02:33:50,479 improves this model fit more than 4014 02:33:50,479 --> 02:33:53,280 expected you know by chance so that's 4015 02:33:53,280 --> 02:33:55,280 what adjusted r squared is i'm not you 4016 02:33:55,280 --> 02:33:56,880 know it's out of the scope of this one 4017 02:33:56,880 --> 02:33:59,520 specific course and now that's linear 4018 02:33:59,520 --> 02:34:01,520 regression basically 4019 02:34:01,520 --> 02:34:04,000 i've covered the concept of residuals or 4020 02:34:04,000 --> 02:34:05,200 errors 4021 02:34:05,200 --> 02:34:06,720 and 4022 02:34:06,720 --> 02:34:08,399 you know how do we use that in order to 4023 02:34:08,399 --> 02:34:10,479 find the line of best fit 4024 02:34:10,479 --> 02:34:11,920 and you know our computer can do all the 4025 02:34:11,920 --> 02:34:14,240 calculations for us which is nice but 4026 02:34:14,240 --> 02:34:15,600 behind the scenes it's trying to 4027 02:34:15,600 --> 02:34:18,080 minimize that error right 4028 02:34:18,080 --> 02:34:19,600 and then we've gone through all the 4029 02:34:19,600 --> 02:34:22,240 different ways of actually evaluating a 4030 02:34:22,240 --> 02:34:24,479 linear regression model and the pros and 4031 02:34:24,479 --> 02:34:26,479 cons of each one 4032 02:34:26,479 --> 02:34:28,640 so now let's look at an example so we're 4033 02:34:28,640 --> 02:34:30,960 still on supervised learning but now 4034 02:34:30,960 --> 02:34:32,640 we're just going to talk about 4035 02:34:32,640 --> 02:34:34,240 regression so what happens when you 4036 02:34:34,240 --> 02:34:35,840 don't just want to predict you know type 4037 02:34:35,840 --> 02:34:37,760 one two three what happens if you 4038 02:34:37,760 --> 02:34:40,880 actually wanna predict a certain value 4039 02:34:40,880 --> 02:34:44,080 so again i'm on the uci machine learning 4040 02:34:44,080 --> 02:34:45,600 repository 4041 02:34:45,600 --> 02:34:46,960 and 4042 02:34:46,960 --> 02:34:50,399 here i found this data set about 4043 02:34:50,399 --> 02:34:53,439 bike sharing in seoul 4044 02:34:53,439 --> 02:34:54,960 south korea 4045 02:34:54,960 --> 02:34:58,479 so this data set is predicting rental 4046 02:34:58,479 --> 02:35:00,479 bike count and here it's the account of 4047 02:35:00,479 --> 02:35:03,520 bikes rented at each hour 4048 02:35:03,520 --> 02:35:05,920 so what we're going to do again you're 4049 02:35:05,920 --> 02:35:07,600 going to go into the data folder and 4050 02:35:07,600 --> 02:35:09,040 you're going to 4051 02:35:09,040 --> 02:35:13,720 download this csv file 4052 02:35:14,720 --> 02:35:16,880 and we're going to move over to colab 4053 02:35:16,880 --> 02:35:18,479 again 4054 02:35:18,479 --> 02:35:21,520 and here i'm going to name this fcc 4055 02:35:21,520 --> 02:35:22,880 bikes 4056 02:35:22,880 --> 02:35:26,479 and regression 4057 02:35:26,479 --> 02:35:27,760 i don't remember what i called the last 4058 02:35:27,760 --> 02:35:31,600 one but yeah fcc bikes regression 4059 02:35:31,600 --> 02:35:34,560 now i'm going to import a bunch of the 4060 02:35:34,560 --> 02:35:36,880 same things that i did earlier 4061 02:35:36,880 --> 02:35:38,720 um 4062 02:35:38,720 --> 02:35:41,120 and you know i'm gonna also continue to 4063 02:35:41,120 --> 02:35:43,040 import the oversampler 4064 02:35:43,040 --> 02:35:44,800 and the standard scaler 4065 02:35:44,800 --> 02:35:48,479 and then i'm actually also just going to 4066 02:35:48,479 --> 02:35:49,840 let you guys know that i have a few more 4067 02:35:49,840 --> 02:35:51,760 things i wanted to import 4068 02:35:51,760 --> 02:35:54,000 so this is a library that lets us copy 4069 02:35:54,000 --> 02:35:56,319 things uh seaborne is a wrapper over 4070 02:35:56,319 --> 02:35:58,319 matplotlib so 4071 02:35:58,319 --> 02:36:00,160 it also allows us to plot certain things 4072 02:36:00,160 --> 02:36:01,680 and then just letting you know that 4073 02:36:01,680 --> 02:36:04,720 we're also going to be using tensorflow 4074 02:36:04,720 --> 02:36:06,240 okay so one more thing that we're also 4075 02:36:06,240 --> 02:36:07,840 going to be using we're going to use the 4076 02:36:07,840 --> 02:36:10,640 sklearn linear model library and 4077 02:36:10,640 --> 02:36:12,080 actually let me make my screen a little 4078 02:36:12,080 --> 02:36:13,280 bit bigger 4079 02:36:13,280 --> 02:36:15,520 so yeah 4080 02:36:15,520 --> 02:36:17,840 awesome 4081 02:36:17,840 --> 02:36:20,080 run this and 4082 02:36:20,080 --> 02:36:21,600 that'll import all the things that we 4083 02:36:21,600 --> 02:36:22,399 need 4084 02:36:22,399 --> 02:36:25,840 so again i'm just going to you know give 4085 02:36:25,840 --> 02:36:27,520 some credit to where we got this data 4086 02:36:27,520 --> 02:36:28,560 set 4087 02:36:28,560 --> 02:36:32,000 so let me copy and paste um 4088 02:36:32,000 --> 02:36:34,560 this uci 4089 02:36:34,560 --> 02:36:37,560 thing 4090 02:36:38,000 --> 02:36:41,840 and i will also give credit to this 4091 02:36:41,840 --> 02:36:44,080 here 4092 02:36:46,479 --> 02:36:48,800 okay 4093 02:36:48,840 --> 02:36:50,880 cool all right cool 4094 02:36:50,880 --> 02:36:53,200 so this is our data set and again it 4095 02:36:53,200 --> 02:36:54,800 tells us all the different attributes 4096 02:36:54,800 --> 02:36:57,359 that we have right here so i'm actually 4097 02:36:57,359 --> 02:36:59,520 going to go ahead 4098 02:36:59,520 --> 02:37:03,359 and paste this in here 4099 02:37:03,359 --> 02:37:05,200 um 4100 02:37:05,200 --> 02:37:07,040 feel free to copy and paste this if you 4101 02:37:07,040 --> 02:37:08,399 want me to read it out loud so you can 4102 02:37:08,399 --> 02:37:10,720 type it it's by count 4103 02:37:10,720 --> 02:37:11,920 hour 4104 02:37:11,920 --> 02:37:12,800 temp 4105 02:37:12,800 --> 02:37:14,000 humidity 4106 02:37:14,000 --> 02:37:17,200 wind visibility dew point temp 4107 02:37:17,200 --> 02:37:20,240 radiation rain snow 4108 02:37:20,240 --> 02:37:21,359 and 4109 02:37:21,359 --> 02:37:24,399 functional whatever that means 4110 02:37:24,399 --> 02:37:26,319 okay so i'm going to come over here and 4111 02:37:26,319 --> 02:37:30,319 import my data by dragging and dropping 4112 02:37:30,319 --> 02:37:32,319 all right 4113 02:37:32,319 --> 02:37:33,840 now one thing that you guys might 4114 02:37:33,840 --> 02:37:35,040 actually need to do is you might 4115 02:37:35,040 --> 02:37:37,200 actually have to open up the csv because 4116 02:37:37,200 --> 02:37:38,240 there were 4117 02:37:38,240 --> 02:37:41,280 at first a few um like forbidden 4118 02:37:41,280 --> 02:37:43,120 characters in mine at least 4119 02:37:43,120 --> 02:37:45,040 so you might have to get rid of like i 4120 02:37:45,040 --> 02:37:46,640 think there was a degree here but my 4121 02:37:46,640 --> 02:37:48,240 computer wasn't recognizing it so i got 4122 02:37:48,240 --> 02:37:50,160 rid of that so you might have to go 4123 02:37:50,160 --> 02:37:52,479 through and get rid of some of those 4124 02:37:52,479 --> 02:37:55,680 labels that are incorrect 4125 02:37:55,680 --> 02:37:56,960 i'm gonna 4126 02:37:56,960 --> 02:37:58,399 do this okay 4127 02:37:58,399 --> 02:37:59,520 but 4128 02:37:59,520 --> 02:38:01,359 after we've done that we've imported in 4129 02:38:01,359 --> 02:38:04,240 here i'm going to 4130 02:38:04,240 --> 02:38:06,960 create a data a data frame from that so 4131 02:38:06,960 --> 02:38:09,520 all right so now what i can do is i can 4132 02:38:09,520 --> 02:38:11,520 read that csv file and i can get the 4133 02:38:11,520 --> 02:38:13,280 data into here 4134 02:38:13,280 --> 02:38:17,399 so sold by data.csv 4135 02:38:17,439 --> 02:38:20,160 okay so now if i call data.head 4136 02:38:20,160 --> 02:38:21,680 you'll see that i have all the various 4137 02:38:21,680 --> 02:38:24,960 labels right and then i have the data in 4138 02:38:24,960 --> 02:38:26,840 there 4139 02:38:26,840 --> 02:38:31,520 so i'm going to from here um 4140 02:38:31,520 --> 02:38:33,760 i'm actually going to get rid of some of 4141 02:38:33,760 --> 02:38:35,600 these columns that you know i don't 4142 02:38:35,600 --> 02:38:39,120 really care about so here i'm going to 4143 02:38:39,120 --> 02:38:40,960 when i when i type this in i'm going to 4144 02:38:40,960 --> 02:38:43,120 drop maybe the date 4145 02:38:43,120 --> 02:38:45,439 whether or not it's a holiday 4146 02:38:45,439 --> 02:38:48,479 and the various seasons 4147 02:38:48,479 --> 02:38:49,680 so i'm just 4148 02:38:49,680 --> 02:38:52,080 not going to care about these things 4149 02:38:52,080 --> 02:38:54,160 axis equals one means drop it from the 4150 02:38:54,160 --> 02:38:56,319 columns 4151 02:38:56,319 --> 02:38:58,160 so now you'll see that okay we still 4152 02:38:58,160 --> 02:38:59,840 have i mean i guess you don't really 4153 02:38:59,840 --> 02:39:01,040 notice it but 4154 02:39:01,040 --> 02:39:03,680 if i set the data frames columns equal 4155 02:39:03,680 --> 02:39:06,399 to data set calls 4156 02:39:06,399 --> 02:39:08,080 and i 4157 02:39:08,080 --> 02:39:09,760 look at you know the first five things 4158 02:39:09,760 --> 02:39:11,280 then you'll see that this is now our 4159 02:39:11,280 --> 02:39:14,399 data set it's a lot easier to read so 4160 02:39:14,399 --> 02:39:19,040 another thing is i'm actually going to 4161 02:39:19,040 --> 02:39:20,880 df functional 4162 02:39:20,880 --> 02:39:23,280 and we're going to create this so 4163 02:39:23,280 --> 02:39:24,800 remember that our computers are not very 4164 02:39:24,800 --> 02:39:27,040 good at language we want it to be in 4165 02:39:27,040 --> 02:39:30,560 zeros and ones so here i will convert 4166 02:39:30,560 --> 02:39:32,800 that 4167 02:39:33,439 --> 02:39:37,280 well if this is equal to yes 4168 02:39:37,280 --> 02:39:38,399 then that 4169 02:39:38,399 --> 02:39:41,840 that gets mapped as one so then set type 4170 02:39:41,840 --> 02:39:43,040 integer 4171 02:39:43,040 --> 02:39:44,640 all right 4172 02:39:44,640 --> 02:39:46,880 great 4173 02:39:46,880 --> 02:39:49,680 cool so the thing is right now these 4174 02:39:49,680 --> 02:39:52,640 byte counts are for whatever hour so to 4175 02:39:52,640 --> 02:39:54,479 make this example simpler i'm just going 4176 02:39:54,479 --> 02:39:56,080 to index on an hour and i'm going to say 4177 02:39:56,080 --> 02:39:58,319 okay we're only going to use that 4178 02:39:58,319 --> 02:40:00,000 specific hour 4179 02:40:00,000 --> 02:40:00,880 so 4180 02:40:00,880 --> 02:40:04,479 here let's say um 4181 02:40:04,479 --> 02:40:06,319 so this data frame is only going to be 4182 02:40:06,319 --> 02:40:09,760 data frame where the hour 4183 02:40:09,840 --> 02:40:14,880 let's say it equals 12 okay so it's noon 4184 02:40:14,960 --> 02:40:15,840 all right 4185 02:40:15,840 --> 02:40:17,520 so now you'll see that all the hours are 4186 02:40:17,520 --> 02:40:19,600 equal to 12 and i'm actually going to 4187 02:40:19,600 --> 02:40:22,880 now drop that column 4188 02:40:25,760 --> 02:40:26,960 our 4189 02:40:26,960 --> 02:40:30,160 axis equals one 4190 02:40:30,720 --> 02:40:33,920 all right so if we run this cell okay so 4191 02:40:33,920 --> 02:40:36,960 now we got rid of the hour in here 4192 02:40:36,960 --> 02:40:38,479 and we just have the bike count the 4193 02:40:38,479 --> 02:40:41,600 temperature humidity wind visibility and 4194 02:40:41,600 --> 02:40:43,359 yada yada yada 4195 02:40:43,359 --> 02:40:46,240 all right so what i want to do is i'm 4196 02:40:46,240 --> 02:40:49,040 going to actually plot all of these so 4197 02:40:49,040 --> 02:40:50,080 for 4198 02:40:50,080 --> 02:40:53,600 i and all the columns so the range 4199 02:40:53,600 --> 02:40:56,640 length of uh whatever this data frame is 4200 02:40:56,640 --> 02:40:58,240 and all the columns because i don't have 4201 02:40:58,240 --> 02:41:00,080 byte count as 4202 02:41:00,080 --> 02:41:03,040 actually it's my first thing so what i'm 4203 02:41:03,040 --> 02:41:05,439 going to do is say for a label and data 4204 02:41:05,439 --> 02:41:06,479 frame 4205 02:41:06,479 --> 02:41:08,479 columns everything after the first thing 4206 02:41:08,479 --> 02:41:09,920 so that would give me the temperature 4207 02:41:09,920 --> 02:41:12,319 and onwards so these are all my features 4208 02:41:12,319 --> 02:41:13,680 right 4209 02:41:13,680 --> 02:41:16,800 uh i'm going to just scatter 4210 02:41:16,800 --> 02:41:18,000 so 4211 02:41:18,000 --> 02:41:21,120 i want to see how that label how that 4212 02:41:21,120 --> 02:41:22,880 specific data 4213 02:41:22,880 --> 02:41:27,120 um how that affects the byte count so 4214 02:41:27,120 --> 02:41:29,760 i'm going to plot the byte count on the 4215 02:41:29,760 --> 02:41:31,200 y-axis 4216 02:41:31,200 --> 02:41:33,280 and i'm going to plot you know whatever 4217 02:41:33,280 --> 02:41:36,640 the specific label is on the x-axis 4218 02:41:36,640 --> 02:41:39,280 and i'm going to title this 4219 02:41:39,280 --> 02:41:41,840 uh whatever the label is 4220 02:41:41,840 --> 02:41:42,720 and 4221 02:41:42,720 --> 02:41:47,359 you know make my y label the bike count 4222 02:41:47,359 --> 02:41:49,680 at noon 4223 02:41:49,680 --> 02:41:51,840 and the x label 4224 02:41:51,840 --> 02:41:54,880 as just the label 4225 02:41:55,439 --> 02:41:57,280 okay now 4226 02:41:57,280 --> 02:41:59,359 i guess we don't even need the legend 4227 02:41:59,359 --> 02:42:02,640 so just show that plot 4228 02:42:06,160 --> 02:42:08,080 all right 4229 02:42:08,080 --> 02:42:10,080 so it seems like functional is not 4230 02:42:10,080 --> 02:42:12,640 really uh 4231 02:42:12,640 --> 02:42:16,319 doesn't really give us any utility 4232 02:42:16,319 --> 02:42:18,479 so then snow 4233 02:42:18,479 --> 02:42:19,520 rain 4234 02:42:19,520 --> 02:42:20,479 um 4235 02:42:20,479 --> 02:42:23,040 seems like this radiation 4236 02:42:23,040 --> 02:42:25,359 you know is fairly linear 4237 02:42:25,359 --> 02:42:26,880 dew point temperature 4238 02:42:26,880 --> 02:42:28,720 visibility 4239 02:42:28,720 --> 02:42:30,960 uh wind doesn't really seem like it does 4240 02:42:30,960 --> 02:42:32,000 much 4241 02:42:32,000 --> 02:42:34,399 humidity kind of maybe like an inverse 4242 02:42:34,399 --> 02:42:36,000 relationship 4243 02:42:36,000 --> 02:42:37,280 but the temperature definitely looks 4244 02:42:37,280 --> 02:42:39,040 like there's a relationship between that 4245 02:42:39,040 --> 02:42:41,280 and the number of bikes right so what 4246 02:42:41,280 --> 02:42:42,479 i'm actually going to do is i'm going to 4247 02:42:42,479 --> 02:42:44,560 drop some of the ones that don't don't 4248 02:42:44,560 --> 02:42:46,479 seem like they really matter so 4249 02:42:46,479 --> 02:42:49,040 maybe wind 4250 02:42:49,040 --> 02:42:52,439 you know visibility 4251 02:42:54,240 --> 02:42:55,600 yeah so i'm going to get rid of wind 4252 02:42:55,600 --> 02:42:58,800 visibility and functional 4253 02:42:59,200 --> 02:43:01,200 so 4254 02:43:01,200 --> 02:43:03,920 let me now data frame 4255 02:43:03,920 --> 02:43:06,840 and i'm going to drop 4256 02:43:06,840 --> 02:43:09,520 wind visibility 4257 02:43:09,520 --> 02:43:11,760 and functional 4258 02:43:11,760 --> 02:43:13,200 all right 4259 02:43:13,200 --> 02:43:15,359 and the axis again is the column so 4260 02:43:15,359 --> 02:43:16,880 that's one 4261 02:43:16,880 --> 02:43:20,240 so if i look at my data set now 4262 02:43:20,240 --> 02:43:22,240 i have just the temperature the humidity 4263 02:43:22,240 --> 02:43:24,720 the dew point temperature radiation rain 4264 02:43:24,720 --> 02:43:26,080 and snow 4265 02:43:26,080 --> 02:43:28,800 so again what i want to do is i want to 4266 02:43:28,800 --> 02:43:30,880 split this into my training 4267 02:43:30,880 --> 02:43:34,240 my validation and my test data set 4268 02:43:34,240 --> 02:43:37,439 just as we talked before 4269 02:43:37,439 --> 02:43:38,399 here 4270 02:43:38,399 --> 02:43:40,319 uh we can use the exact same thing that 4271 02:43:40,319 --> 02:43:44,479 we just did and we can say numpy.split 4272 02:43:44,479 --> 02:43:47,120 and sample you know that the whole 4273 02:43:47,120 --> 02:43:48,160 sample 4274 02:43:48,160 --> 02:43:53,279 um and then create our splits 4275 02:43:53,920 --> 02:43:57,439 of the data frame 4276 02:43:57,840 --> 02:43:59,840 and we're going to do that but now set 4277 02:43:59,840 --> 02:44:02,240 this to 8. 4278 02:44:02,240 --> 02:44:04,560 okay 4279 02:44:04,560 --> 02:44:06,319 so i don't really care about you know 4280 02:44:06,319 --> 02:44:07,680 the the full 4281 02:44:07,680 --> 02:44:10,080 grid the full array so i'm just gonna 4282 02:44:10,080 --> 02:44:10,880 use 4283 02:44:10,880 --> 02:44:13,120 an underscore for that variable 4284 02:44:13,120 --> 02:44:16,560 but i will get my training 4285 02:44:16,560 --> 02:44:22,399 x and y's and actually i don't have a um 4286 02:44:22,399 --> 02:44:23,439 function 4287 02:44:23,439 --> 02:44:27,520 for getting the x and y's so here 4288 02:44:27,520 --> 02:44:30,080 i'm going to write a function to find 4289 02:44:30,080 --> 02:44:31,600 get x y 4290 02:44:31,600 --> 02:44:32,479 and 4291 02:44:32,479 --> 02:44:33,279 uh 4292 02:44:33,279 --> 02:44:35,120 i'm going to pass in the data frame and 4293 02:44:35,120 --> 02:44:36,560 i'm actually going to pass in what the 4294 02:44:36,560 --> 02:44:39,120 name of the y label is and 4295 02:44:39,120 --> 02:44:42,160 what the x what specific x labels i want 4296 02:44:42,160 --> 02:44:43,279 to 4297 02:44:43,279 --> 02:44:44,319 look at 4298 02:44:44,319 --> 02:44:45,120 so 4299 02:44:45,120 --> 02:44:47,920 here if that's none then i'm just not 4300 02:44:47,920 --> 02:44:49,520 like i'm only going to i'm going to get 4301 02:44:49,520 --> 02:44:51,520 everything from the data set that's not 4302 02:44:51,520 --> 02:44:53,040 the while it 4303 02:44:53,040 --> 02:44:55,840 so here i'm actually going to 4304 02:44:55,840 --> 02:44:59,520 make first a deep copy 4305 02:44:59,520 --> 02:45:01,760 of my data frame 4306 02:45:01,760 --> 02:45:02,960 and 4307 02:45:02,960 --> 02:45:04,479 that basically means i'm just copying 4308 02:45:04,479 --> 02:45:05,840 everything over 4309 02:45:05,840 --> 02:45:06,800 if 4310 02:45:06,800 --> 02:45:10,399 uh if like x labels is none so if not x 4311 02:45:10,399 --> 02:45:11,439 labels 4312 02:45:11,439 --> 02:45:13,279 then all i'm going to do is say all 4313 02:45:13,279 --> 02:45:15,359 right x is going to be whatever this 4314 02:45:15,359 --> 02:45:17,040 data frame is 4315 02:45:17,040 --> 02:45:18,319 and i'm just going to take all the 4316 02:45:18,319 --> 02:45:21,120 columns so c for c and 4317 02:45:21,120 --> 02:45:22,479 data frame 4318 02:45:22,479 --> 02:45:23,920 dot columns 4319 02:45:23,920 --> 02:45:27,439 if c does not equal the y label 4320 02:45:27,439 --> 02:45:29,520 all right and i'm gonna get the values 4321 02:45:29,520 --> 02:45:31,200 from that 4322 02:45:31,200 --> 02:45:33,040 but if there is 4323 02:45:33,040 --> 02:45:34,960 the x labels 4324 02:45:34,960 --> 02:45:36,640 well okay so 4325 02:45:36,640 --> 02:45:38,880 in order to index only one thing so like 4326 02:45:38,880 --> 02:45:40,560 let's say i pass in only one thing in 4327 02:45:40,560 --> 02:45:43,200 here um 4328 02:45:43,200 --> 02:45:46,319 then my data frame is 4329 02:45:46,319 --> 02:45:47,040 so 4330 02:45:47,040 --> 02:45:49,279 let me make a case for that so if the 4331 02:45:49,279 --> 02:45:52,000 length of x labels is equal to one 4332 02:45:52,000 --> 02:45:54,560 then what i'm going to do is just say 4333 02:45:54,560 --> 02:45:55,359 that 4334 02:45:55,359 --> 02:45:56,960 this 4335 02:45:56,960 --> 02:45:58,960 is going to be 4336 02:45:58,960 --> 02:46:02,640 uh x labels and add that just that label 4337 02:46:02,640 --> 02:46:04,319 um 4338 02:46:04,319 --> 02:46:06,720 values and i actually need to reshape to 4339 02:46:06,720 --> 02:46:08,080 make this 2d 4340 02:46:08,080 --> 02:46:09,760 so i'm going to pass a negative 1 comma 4341 02:46:09,760 --> 02:46:10,479 1 4342 02:46:10,479 --> 02:46:11,600 there 4343 02:46:11,600 --> 02:46:14,880 now otherwise if i have like a list of 4344 02:46:14,880 --> 02:46:17,520 specific x labels that i want to use 4345 02:46:17,520 --> 02:46:19,359 then i'm actually just going to say x is 4346 02:46:19,359 --> 02:46:22,479 equal to data frame of those x labels 4347 02:46:22,479 --> 02:46:27,279 dot values and that should suffice 4348 02:46:27,279 --> 02:46:28,640 all right so now that's just me 4349 02:46:28,640 --> 02:46:30,319 extracting x 4350 02:46:30,319 --> 02:46:32,640 and in order to get my y i'm going to do 4351 02:46:32,640 --> 02:46:34,800 y equals data frame 4352 02:46:34,800 --> 02:46:37,760 and then pass in the y label 4353 02:46:37,760 --> 02:46:39,359 and at the very end i'm going to say 4354 02:46:39,359 --> 02:46:42,080 data equals np 4355 02:46:42,080 --> 02:46:44,720 dot h stack so i'm stacking them 4356 02:46:44,720 --> 02:46:47,600 horizontally one next to each other 4357 02:46:47,600 --> 02:46:49,840 and i'll take x and y 4358 02:46:49,840 --> 02:46:53,200 and return that oh but 4359 02:46:53,200 --> 02:46:55,120 uh this needs to be values and i'm 4360 02:46:55,120 --> 02:46:56,640 actually going to reshape this to make 4361 02:46:56,640 --> 02:46:58,800 it 2d as well so that we can do this h 4362 02:46:58,800 --> 02:46:59,920 stack 4363 02:46:59,920 --> 02:47:04,319 and i will return data x y 4364 02:47:04,960 --> 02:47:06,720 so now i should be able to say okay get 4365 02:47:06,720 --> 02:47:08,080 x y 4366 02:47:08,080 --> 02:47:09,359 and 4367 02:47:09,359 --> 02:47:11,120 take that data frame 4368 02:47:11,120 --> 02:47:14,160 and the y label so my our y label is by 4369 02:47:14,160 --> 02:47:15,520 count 4370 02:47:15,520 --> 02:47:18,080 and actually so for the x label i'm 4371 02:47:18,080 --> 02:47:19,600 actually going to 4372 02:47:19,600 --> 02:47:21,439 let's just do like one dimension right 4373 02:47:21,439 --> 02:47:24,000 now and earlier i got rid of the plots 4374 02:47:24,000 --> 02:47:26,720 but we had seen that maybe you know the 4375 02:47:26,720 --> 02:47:29,600 temperature dimension does really well 4376 02:47:29,600 --> 02:47:31,439 and we might be able to use that to 4377 02:47:31,439 --> 02:47:33,200 predict why 4378 02:47:33,200 --> 02:47:35,840 so 4379 02:47:35,840 --> 02:47:37,760 i'm going to label this also that you 4380 02:47:37,760 --> 02:47:41,040 know it's just using the temperature 4381 02:47:41,040 --> 02:47:44,240 and i am also going to do this again 4382 02:47:44,240 --> 02:47:45,600 for 4383 02:47:45,600 --> 02:47:47,840 oh this should be 4384 02:47:47,840 --> 02:47:49,600 and this should be a validation and 4385 02:47:49,600 --> 02:47:52,080 there should be a test 4386 02:47:52,080 --> 02:47:55,680 um because oh that's val 4387 02:47:55,680 --> 02:47:57,439 all right 4388 02:47:57,439 --> 02:47:59,279 but here 4389 02:47:59,279 --> 02:48:00,840 it should be 4390 02:48:00,840 --> 02:48:04,000 val this should be test 4391 02:48:04,000 --> 02:48:06,080 all right so we run this and now we have 4392 02:48:06,080 --> 02:48:08,560 our training validation and test 4393 02:48:08,560 --> 02:48:11,120 data sets for just the temperature so if 4394 02:48:11,120 --> 02:48:13,840 i look at x train 4395 02:48:13,840 --> 02:48:17,279 temp it's literally just the temperature 4396 02:48:17,279 --> 02:48:18,560 okay and i'm doing this first to show 4397 02:48:18,560 --> 02:48:21,439 you simple linear regression 4398 02:48:21,439 --> 02:48:23,359 all right so right now i can create a 4399 02:48:23,359 --> 02:48:24,720 regressor 4400 02:48:24,720 --> 02:48:26,160 so i can say 4401 02:48:26,160 --> 02:48:28,479 the temp regressor here 4402 02:48:28,479 --> 02:48:30,720 and then i'm going to you know make a 4403 02:48:30,720 --> 02:48:32,399 linear regression model and just like 4404 02:48:32,399 --> 02:48:34,479 before i can 4405 02:48:34,479 --> 02:48:35,840 simply fix 4406 02:48:35,840 --> 02:48:39,600 fit my x train temp y train 4407 02:48:39,600 --> 02:48:42,080 temp in order to train train this linear 4408 02:48:42,080 --> 02:48:44,479 regression model 4409 02:48:44,479 --> 02:48:47,279 all right and then i can also 4410 02:48:47,279 --> 02:48:48,880 i can print 4411 02:48:48,880 --> 02:48:51,040 this 4412 02:48:51,040 --> 02:48:53,680 regressor's coefficients 4413 02:48:53,680 --> 02:48:54,560 and 4414 02:48:54,560 --> 02:48:58,359 the intercept so 4415 02:48:58,640 --> 02:49:00,000 if i do that 4416 02:49:00,000 --> 02:49:02,319 okay this is the coefficient for 4417 02:49:02,319 --> 02:49:04,399 whatever the temperature is and then the 4418 02:49:04,399 --> 02:49:06,560 the x-intercept 4419 02:49:06,560 --> 02:49:08,000 okay 4420 02:49:08,000 --> 02:49:10,800 or the y-intercept sorry 4421 02:49:10,800 --> 02:49:12,399 all right 4422 02:49:12,399 --> 02:49:16,479 and i can you know score so i can get 4423 02:49:16,479 --> 02:49:18,640 the um 4424 02:49:18,640 --> 02:49:20,560 the r squared 4425 02:49:20,560 --> 02:49:21,760 score 4426 02:49:21,760 --> 02:49:24,000 so i can score 4427 02:49:24,000 --> 02:49:25,439 x 4428 02:49:25,439 --> 02:49:27,840 test 4429 02:49:28,240 --> 02:49:31,600 and y test 4430 02:49:32,640 --> 02:49:35,040 all right so it's an r squared of around 4431 02:49:35,040 --> 02:49:37,279 0.38 which is better than zero which 4432 02:49:37,279 --> 02:49:38,800 would mean hey there's absolutely no 4433 02:49:38,800 --> 02:49:41,680 association but it's also not you know 4434 02:49:41,680 --> 02:49:42,479 like 4435 02:49:42,479 --> 02:49:44,000 a 4436 02:49:44,000 --> 02:49:46,479 good it depends on the context but 4437 02:49:46,479 --> 02:49:48,000 you know the higher that number it means 4438 02:49:48,000 --> 02:49:49,680 the higher that the two variables would 4439 02:49:49,680 --> 02:49:52,160 be correlated right which 4440 02:49:52,160 --> 02:49:53,279 here it's 4441 02:49:53,279 --> 02:49:55,120 all right it just means there's maybe 4442 02:49:55,120 --> 02:49:58,399 some association between the two 4443 02:49:58,399 --> 02:50:00,240 but uh the reason why i wanted to do 4444 02:50:00,240 --> 02:50:02,399 this one d was to show you 4445 02:50:02,399 --> 02:50:04,479 you know if we plotted this this is what 4446 02:50:04,479 --> 02:50:06,720 it would look like so if i 4447 02:50:06,720 --> 02:50:08,800 uh create a scatter plot 4448 02:50:08,800 --> 02:50:09,680 and 4449 02:50:09,680 --> 02:50:12,880 let's take the training 4450 02:50:15,520 --> 02:50:17,920 um 4451 02:50:17,920 --> 02:50:20,560 so this is our data and then let's make 4452 02:50:20,560 --> 02:50:22,800 it blue 4453 02:50:22,800 --> 02:50:26,880 and then if i also plotted so 4454 02:50:26,880 --> 02:50:28,640 something that i can do is say you know 4455 02:50:28,640 --> 02:50:31,279 the x range that i'm going to plot it 4456 02:50:31,279 --> 02:50:32,240 is 4457 02:50:32,240 --> 02:50:34,399 when space um and this goes from 4458 02:50:34,399 --> 02:50:37,279 negative 20 to 40 this piece of data so 4459 02:50:37,279 --> 02:50:39,520 i'm going to say let's take 100 4460 02:50:39,520 --> 02:50:41,840 things from there 4461 02:50:41,840 --> 02:50:43,840 so i'm going to 4462 02:50:43,840 --> 02:50:48,160 plot x and i'm going to take this 4463 02:50:48,160 --> 02:50:51,840 temp this like regressor and predict 4464 02:50:51,840 --> 02:50:52,720 x 4465 02:50:52,720 --> 02:50:53,760 with that 4466 02:50:53,760 --> 02:50:55,520 okay and this label 4467 02:50:55,520 --> 02:50:57,359 i'm going to label that 4468 02:50:57,359 --> 02:50:58,319 um 4469 02:50:58,319 --> 02:50:59,520 the 4470 02:50:59,520 --> 02:51:01,760 fit 4471 02:51:02,080 --> 02:51:05,439 and this color let's make this red 4472 02:51:05,439 --> 02:51:06,160 and let's actually 4473 02:51:06,160 --> 02:51:07,680 [Music] 4474 02:51:07,680 --> 02:51:10,399 set the line with so i can i can change 4475 02:51:10,399 --> 02:51:11,840 how thick 4476 02:51:11,840 --> 02:51:14,080 that value is 4477 02:51:14,080 --> 02:51:15,920 okay 4478 02:51:15,920 --> 02:51:18,800 now at the very end uh let's create a 4479 02:51:18,800 --> 02:51:20,720 legend 4480 02:51:20,720 --> 02:51:23,040 and let's 4481 02:51:23,040 --> 02:51:24,479 all right let's also create you know 4482 02:51:24,479 --> 02:51:26,640 title 4483 02:51:26,640 --> 02:51:29,600 all these things that matter 4484 02:51:29,600 --> 02:51:31,040 in some sense 4485 02:51:31,040 --> 02:51:33,840 so here let's just say um 4486 02:51:33,840 --> 02:51:36,000 this would be the bikes 4487 02:51:36,000 --> 02:51:38,000 versus the temperature 4488 02:51:38,000 --> 02:51:41,600 right and the y label would be 4489 02:51:41,600 --> 02:51:43,760 number of bikes 4490 02:51:43,760 --> 02:51:48,080 and the x label would be the temperature 4491 02:51:48,080 --> 02:51:50,160 so i actually think that this might 4492 02:51:50,160 --> 02:51:52,560 cause an error yeah 4493 02:51:52,560 --> 02:51:54,880 so it's expecting a 2d array so we 4494 02:51:54,880 --> 02:51:56,479 actually have to 4495 02:51:56,479 --> 02:51:58,240 reshape 4496 02:51:58,240 --> 02:51:59,439 this 4497 02:51:59,439 --> 02:52:01,840 let's 4498 02:52:03,800 --> 02:52:04,060 [Applause] 4499 02:52:04,060 --> 02:52:07,160 [Music] 4500 02:52:08,640 --> 02:52:10,160 okay there we go 4501 02:52:10,160 --> 02:52:11,760 so i just had to make this an array and 4502 02:52:11,760 --> 02:52:15,439 then reshape it so it was 2d now we see 4503 02:52:15,439 --> 02:52:17,680 that all right this 4504 02:52:17,680 --> 02:52:19,840 increases but again remember those 4505 02:52:19,840 --> 02:52:21,200 assumptions that we had about linear 4506 02:52:21,200 --> 02:52:22,960 regression like this i don't really know 4507 02:52:22,960 --> 02:52:24,640 if this 4508 02:52:24,640 --> 02:52:26,319 fits those assumptions 4509 02:52:26,319 --> 02:52:27,600 right i just wanted to show you guys 4510 02:52:27,600 --> 02:52:29,359 though that like 4511 02:52:29,359 --> 02:52:31,120 all right this is what a line of best 4512 02:52:31,120 --> 02:52:35,040 fit through this data would look like 4513 02:52:35,520 --> 02:52:37,760 okay 4514 02:52:37,840 --> 02:52:39,120 now 4515 02:52:39,120 --> 02:52:42,240 we can do multiple linear regression 4516 02:52:42,240 --> 02:52:44,560 right 4517 02:52:45,200 --> 02:52:47,200 so i'm going to go ahead and do that as 4518 02:52:47,200 --> 02:52:48,720 well 4519 02:52:48,720 --> 02:52:49,520 now 4520 02:52:49,520 --> 02:52:52,319 if i take 4521 02:52:52,399 --> 02:52:54,560 my data set 4522 02:52:54,560 --> 02:52:57,279 and instead of the labels 4523 02:52:57,279 --> 02:52:59,279 so actually what's my current data set 4524 02:52:59,279 --> 02:53:01,760 right now 4525 02:53:06,319 --> 02:53:08,160 all right so let's just use all of these 4526 02:53:08,160 --> 02:53:10,800 except for the byte count right so i'm 4527 02:53:10,800 --> 02:53:14,640 going to just say for the x labels 4528 02:53:15,279 --> 02:53:17,279 let's just take the data frames columns 4529 02:53:17,279 --> 02:53:18,080 and 4530 02:53:18,080 --> 02:53:20,000 just remove the byte count 4531 02:53:20,000 --> 02:53:24,120 so does that work 4532 02:53:24,800 --> 02:53:27,680 so if this part should be affects labels 4533 02:53:27,680 --> 02:53:30,080 is none 4534 02:53:30,080 --> 02:53:32,800 and then this should work now 4535 02:53:32,800 --> 02:53:34,240 oops sorry 4536 02:53:34,240 --> 02:53:36,000 okay so i have 4537 02:53:36,000 --> 02:53:38,720 oh but this here because it's not just 4538 02:53:38,720 --> 02:53:40,479 the temperature 4539 02:53:40,479 --> 02:53:41,840 anymore 4540 02:53:41,840 --> 02:53:44,800 we should actually do this um let's say 4541 02:53:44,800 --> 02:53:45,920 all 4542 02:53:45,920 --> 02:53:48,800 right so i'm just going to quickly rerun 4543 02:53:48,800 --> 02:53:50,000 this 4544 02:53:50,000 --> 02:53:51,359 piece here so that we have our 4545 02:53:51,359 --> 02:53:53,359 temperature only data set and now we 4546 02:53:53,359 --> 02:53:55,840 have our all data set 4547 02:53:55,840 --> 02:53:56,640 okay 4548 02:53:56,640 --> 02:53:59,279 and this regressor i can do the same 4549 02:53:59,279 --> 02:54:02,560 thing so i can do the all regressor 4550 02:54:02,560 --> 02:54:04,840 and i'm going to make this the linear 4551 02:54:04,840 --> 02:54:06,560 regression 4552 02:54:06,560 --> 02:54:08,880 and 4553 02:54:08,880 --> 02:54:12,560 i'm going to fit this to x train all and 4554 02:54:12,560 --> 02:54:13,680 y 4555 02:54:13,680 --> 02:54:16,080 train all 4556 02:54:16,080 --> 02:54:16,880 okay 4557 02:54:16,880 --> 02:54:18,319 all right so let's go ahead and also 4558 02:54:18,319 --> 02:54:20,640 score this regressor and let's see how 4559 02:54:20,640 --> 02:54:23,279 the r squared performs now so if i test 4560 02:54:23,279 --> 02:54:24,319 this 4561 02:54:24,319 --> 02:54:28,160 on the test data set what happens 4562 02:54:29,279 --> 02:54:30,720 all right so our r squared seems to 4563 02:54:30,720 --> 02:54:34,640 improve it went from 0.4 to 0.5 which is 4564 02:54:34,640 --> 02:54:36,880 a good sign 4565 02:54:36,880 --> 02:54:38,240 okay 4566 02:54:38,240 --> 02:54:41,680 and i can't necessarily plot you know 4567 02:54:41,680 --> 02:54:43,920 every single dimension but this just 4568 02:54:43,920 --> 02:54:45,840 this is just to say okay this has this 4569 02:54:45,840 --> 02:54:48,800 is improved right all right so one cool 4570 02:54:48,800 --> 02:54:50,240 thing that you can do with tensorflow is 4571 02:54:50,240 --> 02:54:51,359 you can actually 4572 02:54:51,359 --> 02:54:52,880 do regression 4573 02:54:52,880 --> 02:54:55,920 but with a neural net 4574 02:54:56,960 --> 02:54:58,640 so 4575 02:54:58,640 --> 02:55:00,840 here i'm going 4576 02:55:00,840 --> 02:55:04,160 to um 4577 02:55:04,160 --> 02:55:06,319 we already have our our training data 4578 02:55:06,319 --> 02:55:08,160 for just the temperature and just you 4579 02:55:08,160 --> 02:55:10,000 know for all the different columns so 4580 02:55:10,000 --> 02:55:11,520 i'm not going to bother with splitting 4581 02:55:11,520 --> 02:55:13,200 up the data again 4582 02:55:13,200 --> 02:55:14,240 i'm just going to go ahead and start 4583 02:55:14,240 --> 02:55:16,880 building the model so 4584 02:55:16,880 --> 02:55:18,720 in this linear regression model uh 4585 02:55:18,720 --> 02:55:19,920 typically 4586 02:55:19,920 --> 02:55:23,600 you know it does help if we normalize it 4587 02:55:23,600 --> 02:55:25,680 so that's very easy to do with 4588 02:55:25,680 --> 02:55:28,000 tensorflow i can just create some 4589 02:55:28,000 --> 02:55:32,160 normalizer layer so i'm going to do 4590 02:55:32,160 --> 02:55:34,479 tensorflow keras layers 4591 02:55:34,479 --> 02:55:36,319 and get the normalization 4592 02:55:36,319 --> 02:55:37,359 layer 4593 02:55:37,359 --> 02:55:39,279 and the input shape 4594 02:55:39,279 --> 02:55:40,479 for that 4595 02:55:40,479 --> 02:55:42,319 will just be one because let's just do 4596 02:55:42,319 --> 02:55:44,960 it again on just the temperature 4597 02:55:44,960 --> 02:55:47,279 and the axis i will 4598 02:55:47,279 --> 02:55:49,520 make none 4599 02:55:49,520 --> 02:55:52,240 now for this temp normalizer 4600 02:55:52,240 --> 02:55:54,080 and i should have had an equal sign 4601 02:55:54,080 --> 02:55:54,960 there 4602 02:55:54,960 --> 02:55:58,800 um i'm going to adapt this to x 4603 02:55:58,800 --> 02:56:00,080 train 4604 02:56:00,080 --> 02:56:01,200 temp 4605 02:56:01,200 --> 02:56:02,000 and 4606 02:56:02,000 --> 02:56:06,399 reshape this to just a single vector 4607 02:56:06,399 --> 02:56:10,160 so that should work great now with this 4608 02:56:10,160 --> 02:56:13,600 model so temp neural net model what i 4609 02:56:13,600 --> 02:56:15,439 can do is i can do 4610 02:56:15,439 --> 02:56:17,359 you know diacharis 4611 02:56:17,359 --> 02:56:19,760 that's sequential 4612 02:56:19,760 --> 02:56:22,479 and i'm going to pass in this normalizer 4613 02:56:22,479 --> 02:56:23,359 layer 4614 02:56:23,359 --> 02:56:25,200 and then i'm going to say hey just give 4615 02:56:25,200 --> 02:56:27,840 me one single dense layer with one 4616 02:56:27,840 --> 02:56:30,000 single unit and what that's doing is 4617 02:56:30,000 --> 02:56:32,080 saying all right 4618 02:56:32,080 --> 02:56:34,880 well one single node just means that 4619 02:56:34,880 --> 02:56:37,040 it's linear and if you don't add any 4620 02:56:37,040 --> 02:56:38,720 sort of activation function to it the 4621 02:56:38,720 --> 02:56:40,800 output is also linear 4622 02:56:40,800 --> 02:56:43,279 so here i'm going to have tensorflow 4623 02:56:43,279 --> 02:56:44,720 keras 4624 02:56:44,720 --> 02:56:46,720 layers.dense 4625 02:56:46,720 --> 02:56:48,160 and i'm just 4626 02:56:48,160 --> 02:56:50,160 gonna have one unit 4627 02:56:50,160 --> 02:56:52,640 and that's going to be my model 4628 02:56:52,640 --> 02:56:54,399 okay 4629 02:56:54,399 --> 02:56:55,520 so 4630 02:56:55,520 --> 02:56:58,560 uh with this 4631 02:56:59,120 --> 02:57:00,319 model 4632 02:57:00,319 --> 02:57:02,880 let's compile 4633 02:57:02,880 --> 02:57:06,840 and for our optimizer um let's 4634 02:57:06,840 --> 02:57:08,640 use 4635 02:57:08,640 --> 02:57:11,359 let's use adam again 4636 02:57:11,359 --> 02:57:12,640 answers 4637 02:57:12,640 --> 02:57:14,800 dot atom and we have to pass in the 4638 02:57:14,800 --> 02:57:16,800 learning rate 4639 02:57:16,800 --> 02:57:19,680 so learning rate and our learning rate 4640 02:57:19,680 --> 02:57:22,319 let's do 0.01 4641 02:57:22,319 --> 02:57:23,279 and now 4642 02:57:23,279 --> 02:57:25,200 the loss we 4643 02:57:25,200 --> 02:57:28,720 actually let's give this one 0.1 and the 4644 02:57:28,720 --> 02:57:32,479 loss i'm going to do mean squared error 4645 02:57:32,479 --> 02:57:34,880 okay so we run that we've compiled it 4646 02:57:34,880 --> 02:57:36,800 okay great 4647 02:57:36,800 --> 02:57:40,560 and just like before we can call history 4648 02:57:40,560 --> 02:57:43,680 and i'm going to fit this model so 4649 02:57:43,680 --> 02:57:46,080 here if i call fit 4650 02:57:46,080 --> 02:57:47,920 i can just fit it and i'm going to take 4651 02:57:47,920 --> 02:57:48,960 the 4652 02:57:48,960 --> 02:57:51,920 uh x train with the temperature 4653 02:57:51,920 --> 02:57:54,399 but reshape it 4654 02:57:54,399 --> 02:57:57,680 um y train for the temperature 4655 02:57:57,680 --> 02:58:00,240 and i'm going to set verbose equal to 4656 02:58:00,240 --> 02:58:02,479 zero so that it doesn't you know display 4657 02:58:02,479 --> 02:58:03,359 stuff 4658 02:58:03,359 --> 02:58:05,200 i'm actually going to set epochs equal 4659 02:58:05,200 --> 02:58:07,840 to let's do 1000 4660 02:58:07,840 --> 02:58:09,760 um 4661 02:58:09,760 --> 02:58:12,479 and the validation 4662 02:58:12,479 --> 02:58:14,479 data should be let's pass in the 4663 02:58:14,479 --> 02:58:18,560 validation data set here 4664 02:58:18,560 --> 02:58:20,240 as a tuple 4665 02:58:20,240 --> 02:58:23,200 and i know i spelled that wrong 4666 02:58:23,200 --> 02:58:26,640 so let's just run this 4667 02:58:27,040 --> 02:58:28,720 and up here i've copy and pasted the 4668 02:58:28,720 --> 02:58:31,040 plot loss from our previous but change 4669 02:58:31,040 --> 02:58:33,920 the y label to msc because now we're 4670 02:58:33,920 --> 02:58:35,279 talking we're dealing with mean squared 4671 02:58:35,279 --> 02:58:36,479 error 4672 02:58:36,479 --> 02:58:37,520 and 4673 02:58:37,520 --> 02:58:39,120 i'm going to plot the loss of this 4674 02:58:39,120 --> 02:58:41,279 history after it's done so let's just 4675 02:58:41,279 --> 02:58:43,040 wait for this to finish training and 4676 02:58:43,040 --> 02:58:46,439 then to plot 4677 02:58:49,359 --> 02:58:51,520 okay so this actually looks pretty good 4678 02:58:51,520 --> 02:58:54,640 we see that the values are converging 4679 02:58:54,640 --> 02:58:57,840 so now what i can do is i'm going to 4680 02:58:57,840 --> 02:59:02,399 go back up and take this plot 4681 02:59:02,880 --> 02:59:04,960 and we are going to just run that plot 4682 02:59:04,960 --> 02:59:07,120 again so 4683 02:59:07,120 --> 02:59:08,160 here 4684 02:59:08,160 --> 02:59:09,120 um 4685 02:59:09,120 --> 02:59:10,960 instead of 4686 02:59:10,960 --> 02:59:12,720 this temperature regressor i'm going to 4687 02:59:12,720 --> 02:59:16,160 use the neural net regressor 4688 02:59:16,160 --> 02:59:19,279 this neural net model 4689 02:59:19,840 --> 02:59:22,880 and if i run that i can see that you 4690 02:59:22,880 --> 02:59:24,640 know this also gives me a linear 4691 02:59:24,640 --> 02:59:26,319 regressor 4692 02:59:26,319 --> 02:59:28,479 you'll notice that this this fit is not 4693 02:59:28,479 --> 02:59:31,040 entirely the same as the one 4694 02:59:31,040 --> 02:59:33,920 up here and that's due to the training 4695 02:59:33,920 --> 02:59:35,120 process 4696 02:59:35,120 --> 02:59:36,319 of 4697 02:59:36,319 --> 02:59:38,720 you know of this neural net so just two 4698 02:59:38,720 --> 02:59:40,960 different ways to try and try to find 4699 02:59:40,960 --> 02:59:42,960 the best linear regressor 4700 02:59:42,960 --> 02:59:45,200 okay but here we're using back 4701 02:59:45,200 --> 02:59:47,520 propagation to train a neural net node 4702 02:59:47,520 --> 02:59:49,680 whereas in the other one they probably 4703 02:59:49,680 --> 02:59:51,920 are not doing that okay they're probably 4704 02:59:51,920 --> 02:59:54,160 just trying to actually compute 4705 02:59:54,160 --> 02:59:57,359 the line of best fit so 4706 02:59:57,359 --> 02:59:59,439 okay given this 4707 02:59:59,439 --> 03:00:01,680 well we can repeat the exact same 4708 03:00:01,680 --> 03:00:03,040 exercise 4709 03:00:03,040 --> 03:00:04,479 with our 4710 03:00:04,479 --> 03:00:06,000 um 4711 03:00:06,000 --> 03:00:08,080 with our multiple linear regressions 4712 03:00:08,080 --> 03:00:09,200 okay 4713 03:00:09,200 --> 03:00:11,439 but i'm actually going to skip that part 4714 03:00:11,439 --> 03:00:13,359 i will leave that as an exercise to the 4715 03:00:13,359 --> 03:00:15,600 viewer okay so now what would happen if 4716 03:00:15,600 --> 03:00:18,000 we use a neural net a real neural net 4717 03:00:18,000 --> 03:00:19,920 instead of just you know one single node 4718 03:00:19,920 --> 03:00:22,720 in order to predict this so 4719 03:00:22,720 --> 03:00:24,720 let's start on that code we already have 4720 03:00:24,720 --> 03:00:26,240 our normalizer so i'm actually going to 4721 03:00:26,240 --> 03:00:27,520 take the same 4722 03:00:27,520 --> 03:00:30,880 uh setup here but instead of you know 4723 03:00:30,880 --> 03:00:32,800 this one dense layer i'm going to set 4724 03:00:32,800 --> 03:00:36,000 this equal to 32 units and for my 4725 03:00:36,000 --> 03:00:39,840 activation i'm going to use relu 4726 03:00:39,840 --> 03:00:41,760 and now let's 4727 03:00:41,760 --> 03:00:43,200 duplicate that 4728 03:00:43,200 --> 03:00:45,439 and for the final output i just want one 4729 03:00:45,439 --> 03:00:47,439 answer so i just want one cell 4730 03:00:47,439 --> 03:00:49,600 and this activation is also going to be 4731 03:00:49,600 --> 03:00:52,000 relu because i can't ever have less than 4732 03:00:52,000 --> 03:00:53,439 zero bikes so i'm just going to set that 4733 03:00:53,439 --> 03:00:54,800 as relu 4734 03:00:54,800 --> 03:00:55,920 i'm just going to name this the neural 4735 03:00:55,920 --> 03:00:58,160 net model okay 4736 03:00:58,160 --> 03:01:00,960 and at the bottom i'm going to have this 4737 03:01:00,960 --> 03:01:03,840 um neural net model 4738 03:01:03,840 --> 03:01:05,680 i'm going to have to know that model i'm 4739 03:01:05,680 --> 03:01:08,399 going to compile 4740 03:01:08,399 --> 03:01:08,840 um 4741 03:01:08,840 --> 03:01:10,640 [Music] 4742 03:01:10,640 --> 03:01:13,200 and i will actually use the same 4743 03:01:13,200 --> 03:01:14,960 compiler here 4744 03:01:14,960 --> 03:01:18,240 but instead of 4745 03:01:18,560 --> 03:01:20,800 instead of a learning rate of 0.01 i'll 4746 03:01:20,800 --> 03:01:23,040 use 0.001 4747 03:01:23,040 --> 03:01:24,800 okay 4748 03:01:24,800 --> 03:01:27,520 and i'm going to train this here so the 4749 03:01:27,520 --> 03:01:28,880 history 4750 03:01:28,880 --> 03:01:31,359 is this 4751 03:01:31,359 --> 03:01:35,680 neural net model um and i'm going to fit 4752 03:01:35,680 --> 03:01:41,120 that against x train temp y train temp 4753 03:01:41,120 --> 03:01:42,640 and 4754 03:01:42,640 --> 03:01:44,640 valid 4755 03:01:44,640 --> 03:01:46,640 validation 4756 03:01:46,640 --> 03:01:48,960 data i'm going to set this again equal 4757 03:01:48,960 --> 03:01:50,080 to 4758 03:01:50,080 --> 03:01:51,359 xval 4759 03:01:51,359 --> 03:01:53,120 temp and y 4760 03:01:53,120 --> 03:01:54,080 vowel 4761 03:01:54,080 --> 03:01:56,000 temp 4762 03:01:56,000 --> 03:01:56,960 now 4763 03:01:56,960 --> 03:01:58,479 for the verbose i'm going to set that 4764 03:01:58,479 --> 03:01:59,680 equal to zero 4765 03:01:59,680 --> 03:02:03,359 epochs let's do 100 and 4766 03:02:03,359 --> 03:02:05,600 here for the batch size actually let's 4767 03:02:05,600 --> 03:02:07,760 just not do a batch size right now 4768 03:02:07,760 --> 03:02:09,600 let's just try let's see what happens 4769 03:02:09,600 --> 03:02:11,840 here 4770 03:02:13,279 --> 03:02:15,600 and again we can plot the loss of this 4771 03:02:15,600 --> 03:02:18,479 history after it's done training 4772 03:02:18,479 --> 03:02:21,439 so let's just run this 4773 03:02:21,439 --> 03:02:22,720 and that's not what we're supposed to 4774 03:02:22,720 --> 03:02:26,479 get so what is going on 4775 03:02:26,479 --> 03:02:29,120 here is sequential we have our 4776 03:02:29,120 --> 03:02:30,479 temperature 4777 03:02:30,479 --> 03:02:32,319 normalizer 4778 03:02:32,319 --> 03:02:36,560 which i'm wondering now if we have to 4779 03:02:38,840 --> 03:02:42,920 re-do that 4780 03:02:45,600 --> 03:02:47,279 okay so we do see this 4781 03:02:47,279 --> 03:02:49,520 decline it's an interesting curve but we 4782 03:02:49,520 --> 03:02:51,760 do we do see it eventually 4783 03:02:51,760 --> 03:02:53,279 um 4784 03:02:53,279 --> 03:02:56,080 so this is our loss which all right it's 4785 03:02:56,080 --> 03:02:58,160 decreasing that's a good sign and 4786 03:02:58,160 --> 03:02:59,760 actually what's interesting is let's 4787 03:02:59,760 --> 03:03:02,640 just let's plot this model again 4788 03:03:02,640 --> 03:03:06,000 so here instead of that 4789 03:03:06,800 --> 03:03:08,160 and you'll see that we actually have 4790 03:03:08,160 --> 03:03:10,000 this like 4791 03:03:10,000 --> 03:03:11,920 curve that looks something like this so 4792 03:03:11,920 --> 03:03:13,600 actually what if i got rid of this 4793 03:03:13,600 --> 03:03:16,319 activation 4794 03:03:16,399 --> 03:03:20,319 let's train this again 4795 03:03:21,120 --> 03:03:23,840 and see what happens 4796 03:03:23,840 --> 03:03:25,600 all right so even even when i got rid of 4797 03:03:25,600 --> 03:03:27,840 that relu at the end 4798 03:03:27,840 --> 03:03:31,520 it kind of knows hey you know if 4799 03:03:31,520 --> 03:03:33,279 it's not the best model 4800 03:03:33,279 --> 03:03:37,279 if we had maybe one more layer in here 4801 03:03:39,200 --> 03:03:40,479 these are just things that you have to 4802 03:03:40,479 --> 03:03:41,840 play around with 4803 03:03:41,840 --> 03:03:43,439 when you're you know working with 4804 03:03:43,439 --> 03:03:45,279 machine learning it's like you don't 4805 03:03:45,279 --> 03:03:46,880 really know 4806 03:03:46,880 --> 03:03:49,439 what the best model is going to be 4807 03:03:49,439 --> 03:03:52,960 um for example this also is not 4808 03:03:52,960 --> 03:03:55,840 brilliant 4809 03:03:56,080 --> 03:03:56,880 but 4810 03:03:56,880 --> 03:03:59,600 i guess it's okay so my point is though 4811 03:03:59,600 --> 03:04:02,399 that with a neural net 4812 03:04:02,399 --> 03:04:04,319 i mean this is not brilliant but also 4813 03:04:04,319 --> 03:04:06,640 there's like no data down here right so 4814 03:04:06,640 --> 03:04:08,000 it's kind of hard for our model to 4815 03:04:08,000 --> 03:04:09,600 predict in fact we probably should have 4816 03:04:09,600 --> 03:04:10,960 started the prediction somewhere around 4817 03:04:10,960 --> 03:04:12,560 here 4818 03:04:12,560 --> 03:04:14,000 my point though is that with this neural 4819 03:04:14,000 --> 03:04:15,439 net model you can see that this is no 4820 03:04:15,439 --> 03:04:18,319 longer a linear predictor but yet we 4821 03:04:18,319 --> 03:04:21,359 still get an estimate of the value 4822 03:04:21,359 --> 03:04:23,120 right and we can repeat this exact same 4823 03:04:23,120 --> 03:04:24,800 exercise 4824 03:04:24,800 --> 03:04:27,120 with the multiple 4825 03:04:27,120 --> 03:04:28,479 uh 4826 03:04:28,479 --> 03:04:29,920 inputs 4827 03:04:29,920 --> 03:04:32,560 so here 4828 03:04:33,439 --> 03:04:37,040 if i now pass in all the data 4829 03:04:37,040 --> 03:04:38,000 so 4830 03:04:38,000 --> 03:04:39,840 this is my all 4831 03:04:39,840 --> 03:04:41,760 normalizer 4832 03:04:41,760 --> 03:04:44,080 and i should just be able to pass in 4833 03:04:44,080 --> 03:04:46,319 that 4834 03:04:46,319 --> 03:04:47,439 so 4835 03:04:47,439 --> 03:04:51,279 let's move this to the next cell 4836 03:04:51,279 --> 03:04:53,520 um 4837 03:04:53,920 --> 03:04:55,760 here i'm going to pass in my all 4838 03:04:55,760 --> 03:04:57,279 normalizer 4839 03:04:57,279 --> 03:04:59,920 and let's compile it yeah those 4840 03:04:59,920 --> 03:05:02,880 parameters look good 4841 03:05:02,880 --> 03:05:05,359 great so 4842 03:05:05,359 --> 03:05:06,960 here with the history when we're trying 4843 03:05:06,960 --> 03:05:08,160 to 4844 03:05:08,160 --> 03:05:10,640 fit this model instead of temp we're 4845 03:05:10,640 --> 03:05:13,279 going to use 4846 03:05:13,279 --> 03:05:14,960 our larger data set with all the 4847 03:05:14,960 --> 03:05:16,560 features 4848 03:05:16,560 --> 03:05:19,840 and let's just train that 4849 03:05:21,920 --> 03:05:25,600 and of course we want to plot the loss 4850 03:05:31,439 --> 03:05:34,080 okay so that's what our loss looks like 4851 03:05:34,080 --> 03:05:35,600 um 4852 03:05:35,600 --> 03:05:37,040 it's an interesting curve but it's 4853 03:05:37,040 --> 03:05:39,359 decreasing 4854 03:05:39,359 --> 03:05:41,279 so before we saw that our r squared 4855 03:05:41,279 --> 03:05:44,080 score was around 0.52 well we don't 4856 03:05:44,080 --> 03:05:45,439 really have that with a neural net 4857 03:05:45,439 --> 03:05:47,200 anymore but one thing that we can 4858 03:05:47,200 --> 03:05:49,600 measure is hey what is the mean squared 4859 03:05:49,600 --> 03:05:50,560 error 4860 03:05:50,560 --> 03:05:53,600 right so if i come down here 4861 03:05:53,600 --> 03:05:55,840 um 4862 03:05:56,319 --> 03:05:58,960 and i compare the two mean squared 4863 03:05:58,960 --> 03:06:01,600 errors so 4864 03:06:01,600 --> 03:06:04,880 so i can predict 4865 03:06:05,120 --> 03:06:06,880 x test 4866 03:06:06,880 --> 03:06:09,760 all right 4867 03:06:10,479 --> 03:06:12,319 so these are my predictions using that 4868 03:06:12,319 --> 03:06:14,960 linear regressor will linear multiple 4869 03:06:14,960 --> 03:06:17,040 multiple linear regressor so these are 4870 03:06:17,040 --> 03:06:19,279 my my predictions 4871 03:06:19,279 --> 03:06:21,040 linear regression 4872 03:06:21,040 --> 03:06:23,359 okay 4873 03:06:24,479 --> 03:06:26,399 i'm actually going to do 4874 03:06:26,399 --> 03:06:29,439 that at the bottom so 4875 03:06:29,439 --> 03:06:31,600 let me just copy and paste that cell 4876 03:06:31,600 --> 03:06:33,600 and bring it down here so now i'm going 4877 03:06:33,600 --> 03:06:36,800 to calculate the mean squared error 4878 03:06:36,800 --> 03:06:38,880 for both 4879 03:06:38,880 --> 03:06:40,960 um the linear 4880 03:06:40,960 --> 03:06:43,359 regressor and the neural net okay so 4881 03:06:43,359 --> 03:06:44,479 this is 4882 03:06:44,479 --> 03:06:45,439 my 4883 03:06:45,439 --> 03:06:50,000 linear and this is my neural net so 4884 03:06:50,000 --> 03:06:52,000 if i use my neural net model and i 4885 03:06:52,000 --> 03:06:53,920 predict 4886 03:06:53,920 --> 03:06:55,760 x test all 4887 03:06:55,760 --> 03:06:58,319 i get my two you know different y 4888 03:06:58,319 --> 03:07:00,080 predictions 4889 03:07:00,080 --> 03:07:01,359 and 4890 03:07:01,359 --> 03:07:02,720 um 4891 03:07:02,720 --> 03:07:04,880 i can calculate the mean squared error 4892 03:07:04,880 --> 03:07:06,319 right so 4893 03:07:06,319 --> 03:07:08,000 if i want to get the mean squared error 4894 03:07:08,000 --> 03:07:10,479 and i have y 4895 03:07:10,479 --> 03:07:12,160 prediction and y 4896 03:07:12,160 --> 03:07:13,359 real 4897 03:07:13,359 --> 03:07:14,560 i can do 4898 03:07:14,560 --> 03:07:16,560 numpy dot square and then i would need 4899 03:07:16,560 --> 03:07:19,840 the y prediction minus you know the real 4900 03:07:19,840 --> 03:07:21,600 so this this is basically squaring 4901 03:07:21,600 --> 03:07:22,880 everything 4902 03:07:22,880 --> 03:07:26,160 um and 4903 03:07:26,160 --> 03:07:28,399 this should be 4904 03:07:28,399 --> 03:07:30,720 a vector so if i 4905 03:07:30,720 --> 03:07:33,359 just take this entire thing and take the 4906 03:07:33,359 --> 03:07:34,560 mean 4907 03:07:34,560 --> 03:07:35,840 of that 4908 03:07:35,840 --> 03:07:38,240 that should give me the mse so let's 4909 03:07:38,240 --> 03:07:41,200 let's just try that out 4910 03:07:44,880 --> 03:07:48,640 and the y real is why test 4911 03:07:48,640 --> 03:07:49,680 all 4912 03:07:49,680 --> 03:07:51,359 right so that's my mean squared error 4913 03:07:51,359 --> 03:07:53,279 for the linear regressor 4914 03:07:53,279 --> 03:07:57,359 and this is my mean squared error for 4915 03:07:57,359 --> 03:08:00,080 the neural net 4916 03:08:02,000 --> 03:08:04,720 so that's interesting uh i will debug 4917 03:08:04,720 --> 03:08:06,880 this live i guess 4918 03:08:06,880 --> 03:08:08,560 so my guess is that it's probably coming 4919 03:08:08,560 --> 03:08:09,359 from 4920 03:08:09,359 --> 03:08:11,439 this normalization 4921 03:08:11,439 --> 03:08:12,479 layer 4922 03:08:12,479 --> 03:08:14,160 um 4923 03:08:14,160 --> 03:08:16,160 because this input shape is 4924 03:08:16,160 --> 03:08:17,359 probably just 4925 03:08:17,359 --> 03:08:20,359 six 4926 03:08:25,600 --> 03:08:27,520 and 4927 03:08:27,520 --> 03:08:28,880 okay so 4928 03:08:28,880 --> 03:08:30,960 that works now 4929 03:08:30,960 --> 03:08:33,600 and the reason why is because 4930 03:08:33,600 --> 03:08:35,840 like my inputs are only 4931 03:08:35,840 --> 03:08:37,200 for every vector it's only a one 4932 03:08:37,200 --> 03:08:39,520 dimensional vector of length six so i 4933 03:08:39,520 --> 03:08:42,080 should have i should have just had six 4934 03:08:42,080 --> 03:08:44,479 comma which is a tuple of size six from 4935 03:08:44,479 --> 03:08:46,399 the start or it's a it's a tuple 4936 03:08:46,399 --> 03:08:50,080 containing one element which is a six 4937 03:08:50,080 --> 03:08:52,160 okay so it's actually interesting that 4938 03:08:52,160 --> 03:08:53,200 my 4939 03:08:53,200 --> 03:08:56,640 uh neural net results seem like they 4940 03:08:56,640 --> 03:08:58,720 they have a larger mean squared error 4941 03:08:58,720 --> 03:09:01,200 than my linear aggressor 4942 03:09:01,200 --> 03:09:06,240 um one thing that we can look at is 4943 03:09:06,240 --> 03:09:09,680 we can actually plot the real versus you 4944 03:09:09,680 --> 03:09:13,279 know the the actual results versus 4945 03:09:13,279 --> 03:09:15,680 what the predictions are so 4946 03:09:15,680 --> 03:09:18,080 if i say 4947 03:09:18,080 --> 03:09:20,240 some axis and i use 4948 03:09:20,240 --> 03:09:22,080 plt.axes 4949 03:09:22,080 --> 03:09:23,439 and make these 4950 03:09:23,439 --> 03:09:25,279 equal 4951 03:09:25,279 --> 03:09:27,520 then i can scatter 4952 03:09:27,520 --> 03:09:28,960 the the y 4953 03:09:28,960 --> 03:09:31,120 you know the test so what the actual 4954 03:09:31,120 --> 03:09:33,120 values are on the x-axis and then what 4955 03:09:33,120 --> 03:09:34,640 the predictions 4956 03:09:34,640 --> 03:09:36,000 are on the 4957 03:09:36,000 --> 03:09:37,760 x-axis 4958 03:09:37,760 --> 03:09:38,800 okay 4959 03:09:38,800 --> 03:09:42,319 uh and i can label this as the linear 4960 03:09:42,319 --> 03:09:46,080 regression predictions 4961 03:09:47,520 --> 03:09:49,520 okay so then let me just label my axes 4962 03:09:49,520 --> 03:09:51,840 so the x-axis i'm going to say is the 4963 03:09:51,840 --> 03:09:54,560 true values 4964 03:09:54,560 --> 03:09:57,920 the y-axis is going to be my linear 4965 03:09:57,920 --> 03:10:01,279 regression predictions 4966 03:10:04,240 --> 03:10:07,600 or actually let's plot 4967 03:10:07,600 --> 03:10:11,200 let's just make this predictions 4968 03:10:11,359 --> 03:10:14,560 and then at the end 4969 03:10:14,560 --> 03:10:17,680 i'm going to plot 4970 03:10:17,680 --> 03:10:20,160 oh let's set some limits 4971 03:10:20,160 --> 03:10:22,399 uh 4972 03:10:22,800 --> 03:10:24,399 because i think that's like 4973 03:10:24,399 --> 03:10:28,160 approximately the max number of bikes 4974 03:10:28,560 --> 03:10:29,359 so 4975 03:10:29,359 --> 03:10:32,560 i'm going to set my x limit to this and 4976 03:10:32,560 --> 03:10:34,640 my y limit 4977 03:10:34,640 --> 03:10:36,240 to this 4978 03:10:36,240 --> 03:10:38,160 so here i'm going to pass that in here 4979 03:10:38,160 --> 03:10:40,640 too 4980 03:10:40,640 --> 03:10:43,040 and 4981 03:10:43,120 --> 03:10:44,160 all right 4982 03:10:44,160 --> 03:10:47,359 this is what we actually get for our 4983 03:10:47,359 --> 03:10:49,920 linear regressor 4984 03:10:49,920 --> 03:10:52,319 you see that actually they align quite 4985 03:10:52,319 --> 03:10:55,520 well i mean to some extent so 2000 is 4986 03:10:55,520 --> 03:10:57,200 probably too much 4987 03:10:57,200 --> 03:10:59,200 2500 i mean 4988 03:10:59,200 --> 03:11:00,319 looks like 4989 03:11:00,319 --> 03:11:03,359 maybe like 1800 would be enough here for 4990 03:11:03,359 --> 03:11:04,720 our limits 4991 03:11:04,720 --> 03:11:06,720 um 4992 03:11:06,720 --> 03:11:09,359 and i'm actually going to 4993 03:11:09,359 --> 03:11:11,279 label something else the neural net 4994 03:11:11,279 --> 03:11:14,000 predictions 4995 03:11:15,760 --> 03:11:18,479 let's add a legend 4996 03:11:18,479 --> 03:11:21,439 so you you can see that our neural net 4997 03:11:21,439 --> 03:11:23,040 for the 4998 03:11:23,040 --> 03:11:25,120 larger values it seems like it's a 4999 03:11:25,120 --> 03:11:27,680 little bit more spread out and it seems 5000 03:11:27,680 --> 03:11:28,800 like 5001 03:11:28,800 --> 03:11:31,040 we tend to underestimate a little bit 5002 03:11:31,040 --> 03:11:34,319 down here in this area 5003 03:11:34,840 --> 03:11:36,720 okay 5004 03:11:36,720 --> 03:11:38,960 and for some reason these are way off as 5005 03:11:38,960 --> 03:11:40,800 well 5006 03:11:40,800 --> 03:11:43,200 but yeah so we've basically used a 5007 03:11:43,200 --> 03:11:46,479 linear regressor and a neural net um 5008 03:11:46,479 --> 03:11:48,399 honestly there are some times where a 5009 03:11:48,399 --> 03:11:49,840 neural net is more appropriate and a 5010 03:11:49,840 --> 03:11:52,080 linear regressor is more appropriate 5011 03:11:52,080 --> 03:11:54,720 i think that it just comes with time and 5012 03:11:54,720 --> 03:11:57,040 trying to figure out you know and just 5013 03:11:57,040 --> 03:11:58,640 literally seeing like hey what works 5014 03:11:58,640 --> 03:12:00,720 better like here a linear a multiple 5015 03:12:00,720 --> 03:12:02,319 linear regressor might actually work 5016 03:12:02,319 --> 03:12:04,960 better than a neural net but for example 5017 03:12:04,960 --> 03:12:07,279 with the one-dimensional case 5018 03:12:07,279 --> 03:12:09,359 a linear regressor would never be able 5019 03:12:09,359 --> 03:12:11,200 to see this curve 5020 03:12:11,200 --> 03:12:13,040 okay 5021 03:12:13,040 --> 03:12:14,720 i mean i'm not saying this is a great 5022 03:12:14,720 --> 03:12:16,840 model either but i'm just saying 5023 03:12:16,840 --> 03:12:19,520 like hey you know 5024 03:12:19,520 --> 03:12:21,120 sometimes it might be more appropriate 5025 03:12:21,120 --> 03:12:25,040 to use something that's not linear 5026 03:12:25,279 --> 03:12:29,760 so yeah i will leave regression at that 5027 03:12:29,760 --> 03:12:31,920 okay so we just talked about supervised 5028 03:12:31,920 --> 03:12:33,040 learning 5029 03:12:33,040 --> 03:12:35,439 and in supervised learning we have data 5030 03:12:35,439 --> 03:12:37,680 we have some a bunch of features and for 5031 03:12:37,680 --> 03:12:39,760 a bunch of different samples but each of 5032 03:12:39,760 --> 03:12:41,600 those samples has some sort of label on 5033 03:12:41,600 --> 03:12:44,240 it whether that's a number a category a 5034 03:12:44,240 --> 03:12:47,760 class etc right we were able to use that 5035 03:12:47,760 --> 03:12:49,760 label in order to try to predict new 5036 03:12:49,760 --> 03:12:52,160 labels of other points that we haven't 5037 03:12:52,160 --> 03:12:54,000 seen yet 5038 03:12:54,000 --> 03:12:56,880 well now let's move on to unsupervised 5039 03:12:56,880 --> 03:12:57,920 learning 5040 03:12:57,920 --> 03:13:00,160 so with unsupervised learning we have a 5041 03:13:00,160 --> 03:13:02,960 bunch of unlabeled data 5042 03:13:02,960 --> 03:13:04,640 and what can we do with that you know 5043 03:13:04,640 --> 03:13:09,439 can we learn anything from this data 5044 03:13:09,600 --> 03:13:10,800 so the first algorithm that we're going 5045 03:13:10,800 --> 03:13:12,960 to discuss is known as k means 5046 03:13:12,960 --> 03:13:15,520 clustering what k means clustering is 5047 03:13:15,520 --> 03:13:19,439 trying to do is it's trying to compute 5048 03:13:19,439 --> 03:13:20,720 k 5049 03:13:20,720 --> 03:13:21,920 clusters 5050 03:13:21,920 --> 03:13:24,640 from the data 5051 03:13:25,680 --> 03:13:28,640 so in this example below i have a bunch 5052 03:13:28,640 --> 03:13:30,720 of scattered points and you'll see that 5053 03:13:30,720 --> 03:13:32,319 this is 5054 03:13:32,319 --> 03:13:35,520 x0 and x1 on the two axes which means 5055 03:13:35,520 --> 03:13:37,520 i'm actually plotting two different 5056 03:13:37,520 --> 03:13:38,560 features 5057 03:13:38,560 --> 03:13:40,560 right of each point but we don't know 5058 03:13:40,560 --> 03:13:43,680 what the y label is for those points 5059 03:13:43,680 --> 03:13:46,880 and now just looking at these scattered 5060 03:13:46,880 --> 03:13:48,399 points 5061 03:13:48,399 --> 03:13:50,080 we can kind of see how there are 5062 03:13:50,080 --> 03:13:53,040 different clusters in the data set right 5063 03:13:53,040 --> 03:13:55,359 so depending on what we pick for k we 5064 03:13:55,359 --> 03:13:58,560 might have different clusters 5065 03:13:58,560 --> 03:14:01,760 let's say k equals two right then we 5066 03:14:01,760 --> 03:14:03,760 might pick okay this seems like it could 5067 03:14:03,760 --> 03:14:05,920 be one cluster but this here is also 5068 03:14:05,920 --> 03:14:07,600 another cluster so those might be our 5069 03:14:07,600 --> 03:14:10,000 two different clusters 5070 03:14:10,000 --> 03:14:13,040 if we have k equals three 5071 03:14:13,040 --> 03:14:15,200 for example then okay this seems like it 5072 03:14:15,200 --> 03:14:16,880 could be a cluster 5073 03:14:16,880 --> 03:14:18,479 this seems like it could be a cluster 5074 03:14:18,479 --> 03:14:20,640 and maybe this could be a cluster right 5075 03:14:20,640 --> 03:14:22,239 so we could have three different 5076 03:14:22,239 --> 03:14:25,279 clusters in the data set 5077 03:14:25,279 --> 03:14:29,200 now this k here is predefined 5078 03:14:29,200 --> 03:14:32,239 if i can spell that correctly 5079 03:14:32,239 --> 03:14:35,359 by the person who's running the model so 5080 03:14:35,359 --> 03:14:38,560 that would be you 5081 03:14:38,560 --> 03:14:39,680 all right 5082 03:14:39,680 --> 03:14:41,520 and let's discuss how you know the 5083 03:14:41,520 --> 03:14:43,200 computer actually goes through and 5084 03:14:43,200 --> 03:14:44,479 computes 5085 03:14:44,479 --> 03:14:47,120 the k clusters 5086 03:14:47,120 --> 03:14:48,880 so i'm going to write those steps down 5087 03:14:48,880 --> 03:14:51,880 here 5088 03:14:52,560 --> 03:14:55,680 now the first step that happens is we 5089 03:14:55,680 --> 03:14:56,720 actually 5090 03:14:56,720 --> 03:14:59,040 choose well the computer 5091 03:14:59,040 --> 03:15:02,479 chooses three random points 5092 03:15:02,479 --> 03:15:04,640 on this plot 5093 03:15:04,640 --> 03:15:07,840 to be the centroids 5094 03:15:08,319 --> 03:15:10,319 and by centroids i just mean the center 5095 03:15:10,319 --> 03:15:13,120 of the clusters okay 5096 03:15:13,120 --> 03:15:15,040 so three random points let's say we're 5097 03:15:15,040 --> 03:15:16,560 doing k equals three so we're choosing 5098 03:15:16,560 --> 03:15:18,640 three random points to be the centroids 5099 03:15:18,640 --> 03:15:20,399 of the three clusters if it were two 5100 03:15:20,399 --> 03:15:22,880 we'd be choosing two random points 5101 03:15:22,880 --> 03:15:24,319 okay 5102 03:15:24,319 --> 03:15:26,080 so maybe the three random points i'm 5103 03:15:26,080 --> 03:15:29,680 choosing might be here 5104 03:15:29,760 --> 03:15:30,240 here 5105 03:15:30,240 --> 03:15:32,560 [Music] 5106 03:15:32,560 --> 03:15:35,279 and here 5107 03:15:35,279 --> 03:15:36,720 all right 5108 03:15:36,720 --> 03:15:37,680 so we have 5109 03:15:37,680 --> 03:15:39,600 three different points 5110 03:15:39,600 --> 03:15:43,680 and the second thing that we do 5111 03:15:44,319 --> 03:15:48,080 is we actually calculate 5112 03:15:48,080 --> 03:15:50,640 the distance 5113 03:15:50,640 --> 03:15:52,160 for each point 5114 03:15:52,160 --> 03:15:55,359 to those centroids 5115 03:15:55,359 --> 03:15:57,680 so between all the points 5116 03:15:57,680 --> 03:16:00,560 and the centroid 5117 03:16:01,279 --> 03:16:03,359 so basically i'm saying all right this 5118 03:16:03,359 --> 03:16:05,200 is this distance this is this distance 5119 03:16:05,200 --> 03:16:07,520 this is this distance 5120 03:16:07,520 --> 03:16:09,600 all of these distances i'm computing 5121 03:16:09,600 --> 03:16:12,000 between oops not those two 5122 03:16:12,000 --> 03:16:13,600 between the points not the centroids 5123 03:16:13,600 --> 03:16:14,640 themselves 5124 03:16:14,640 --> 03:16:16,800 so i'm computing the distances for all 5125 03:16:16,800 --> 03:16:20,000 of these plots to each of the centroids 5126 03:16:20,000 --> 03:16:21,279 okay 5127 03:16:21,279 --> 03:16:23,840 and that 5128 03:16:24,000 --> 03:16:26,000 comes with also 5129 03:16:26,000 --> 03:16:28,479 assigning 5130 03:16:28,479 --> 03:16:32,560 those points to the closest centroid 5131 03:16:34,720 --> 03:16:37,439 what do i mean by that 5132 03:16:37,439 --> 03:16:38,160 so 5133 03:16:38,160 --> 03:16:39,680 let's take 5134 03:16:39,680 --> 03:16:41,359 this point here for example so i'm 5135 03:16:41,359 --> 03:16:44,080 computing this distance this distance 5136 03:16:44,080 --> 03:16:45,840 and this distance and i'm saying okay it 5137 03:16:45,840 --> 03:16:48,399 seems like the red one is the closest so 5138 03:16:48,399 --> 03:16:50,560 i'm actually going to put this into the 5139 03:16:50,560 --> 03:16:51,840 red 5140 03:16:51,840 --> 03:16:53,120 centroid 5141 03:16:53,120 --> 03:16:58,040 so if i do that for all of these points 5142 03:16:59,760 --> 03:17:01,520 this one seems slightly closer to red 5143 03:17:01,520 --> 03:17:02,960 and this one seems slightly closer to 5144 03:17:02,960 --> 03:17:04,000 red 5145 03:17:04,000 --> 03:17:05,520 right 5146 03:17:05,520 --> 03:17:07,200 now for the blue 5147 03:17:07,200 --> 03:17:09,359 i actually wouldn't 5148 03:17:09,359 --> 03:17:11,680 put any blue ones in here but 5149 03:17:11,680 --> 03:17:14,319 we would probably actually that first 5150 03:17:14,319 --> 03:17:17,359 one is closer to red 5151 03:17:17,840 --> 03:17:20,560 and now it seems like the rest of them 5152 03:17:20,560 --> 03:17:23,279 are probably closer to green 5153 03:17:23,279 --> 03:17:25,359 so let's just put all of these into 5154 03:17:25,359 --> 03:17:27,520 green here 5155 03:17:27,520 --> 03:17:29,279 like that 5156 03:17:29,279 --> 03:17:30,560 and 5157 03:17:30,560 --> 03:17:32,319 cool so now we have 5158 03:17:32,319 --> 03:17:34,720 you know our two three technically 5159 03:17:34,720 --> 03:17:37,760 centroid so there's this group here 5160 03:17:37,760 --> 03:17:40,479 there's this group here 5161 03:17:40,479 --> 03:17:43,200 and then blue is kind of just this group 5162 03:17:43,200 --> 03:17:45,040 here it hasn't really touched any of the 5163 03:17:45,040 --> 03:17:47,600 points yet 5164 03:17:47,600 --> 03:17:49,359 so the next step 5165 03:17:49,359 --> 03:17:51,120 three that we do 5166 03:17:51,120 --> 03:17:54,479 is we actually go and we recalculate the 5167 03:17:54,479 --> 03:17:56,960 centroid so we compute 5168 03:17:56,960 --> 03:17:59,920 new centroids 5169 03:18:00,160 --> 03:18:01,920 based on the points that we have in all 5170 03:18:01,920 --> 03:18:03,920 the centroids 5171 03:18:03,920 --> 03:18:07,279 and by that i just mean okay well 5172 03:18:07,279 --> 03:18:08,880 let's take the average of all these 5173 03:18:08,880 --> 03:18:11,520 points and where is that new centroid 5174 03:18:11,520 --> 03:18:12,960 that's probably going to be somewhere 5175 03:18:12,960 --> 03:18:15,520 around here right the blue one we don't 5176 03:18:15,520 --> 03:18:16,560 have any points in there so we won't 5177 03:18:16,560 --> 03:18:19,040 touch and then the screen one 5178 03:18:19,040 --> 03:18:20,800 we can put that 5179 03:18:20,800 --> 03:18:25,200 hmm probably somewhere over here oops 5180 03:18:25,200 --> 03:18:28,560 somewhere over here 5181 03:18:28,560 --> 03:18:32,319 right so now if i erase 5182 03:18:32,960 --> 03:18:34,560 all of the 5183 03:18:34,560 --> 03:18:38,160 previously computed centroids 5184 03:18:38,160 --> 03:18:41,840 i can go and i can actually redo step 5185 03:18:41,840 --> 03:18:42,800 two 5186 03:18:42,800 --> 03:18:45,200 over here this calculation 5187 03:18:45,200 --> 03:18:46,720 all right so i'm going to go back and 5188 03:18:46,720 --> 03:18:48,080 i'm going to iterate through everything 5189 03:18:48,080 --> 03:18:49,680 again and i'm going to recompute my 5190 03:18:49,680 --> 03:18:52,720 three centroids so 5191 03:18:52,880 --> 03:18:54,720 let's see we're going to take this red 5192 03:18:54,720 --> 03:18:58,160 point these are definitely all red right 5193 03:18:58,160 --> 03:19:00,239 this one still looks a bit 5194 03:19:00,239 --> 03:19:01,439 red 5195 03:19:01,439 --> 03:19:03,600 now 5196 03:19:03,600 --> 03:19:05,439 this part we actually start getting 5197 03:19:05,439 --> 03:19:08,080 closer to the blues 5198 03:19:08,080 --> 03:19:10,479 so this one still seems closer to a blue 5199 03:19:10,479 --> 03:19:12,000 than a green 5200 03:19:12,000 --> 03:19:14,560 this one as well 5201 03:19:14,560 --> 03:19:15,760 and 5202 03:19:15,760 --> 03:19:19,840 i think the rest would belong to green 5203 03:19:21,920 --> 03:19:25,040 okay so now are three centroids or three 5204 03:19:25,040 --> 03:19:28,720 sorry our three clusters would be this 5205 03:19:28,720 --> 03:19:30,840 is 5206 03:19:30,840 --> 03:19:32,960 this and then 5207 03:19:32,960 --> 03:19:33,840 this 5208 03:19:33,840 --> 03:19:35,279 right 5209 03:19:35,279 --> 03:19:38,560 those are our three centroids 5210 03:19:38,720 --> 03:19:40,800 and so now we go back and we compute the 5211 03:19:40,800 --> 03:19:42,319 new sorry those would be the three 5212 03:19:42,319 --> 03:19:43,600 clusters so now we go back and we 5213 03:19:43,600 --> 03:19:46,080 compute the three centroids so i'm going 5214 03:19:46,080 --> 03:19:48,880 to get rid of this this and this 5215 03:19:48,880 --> 03:19:51,520 and now where would this red be centered 5216 03:19:51,520 --> 03:19:54,160 probably closer you know to this point 5217 03:19:54,160 --> 03:19:55,120 here 5218 03:19:55,120 --> 03:19:59,439 this blue might be closer to up here 5219 03:19:59,439 --> 03:20:02,720 and then this green would probably be 5220 03:20:02,720 --> 03:20:04,160 somewhere 5221 03:20:04,160 --> 03:20:05,439 it's pretty similar to what we had 5222 03:20:05,439 --> 03:20:06,479 before 5223 03:20:06,479 --> 03:20:08,239 but it seems like it'd be pulled down a 5224 03:20:08,239 --> 03:20:10,239 bit so probably somewhere around there 5225 03:20:10,239 --> 03:20:12,399 for green 5226 03:20:12,399 --> 03:20:17,600 all right and now again we go back and 5227 03:20:17,600 --> 03:20:19,920 we compute 5228 03:20:19,920 --> 03:20:21,760 the distance between all the points and 5229 03:20:21,760 --> 03:20:23,359 the centroids and then we assign them to 5230 03:20:23,359 --> 03:20:25,200 the closest centroid okay 5231 03:20:25,200 --> 03:20:26,560 so 5232 03:20:26,560 --> 03:20:30,239 the reds are all here it's very clear 5233 03:20:30,239 --> 03:20:33,520 actually let me just circle that 5234 03:20:33,520 --> 03:20:34,880 and this 5235 03:20:34,880 --> 03:20:36,000 um 5236 03:20:36,000 --> 03:20:37,200 it actually seems like this point is 5237 03:20:37,200 --> 03:20:39,120 closer to this blue now 5238 03:20:39,120 --> 03:20:40,000 so 5239 03:20:40,000 --> 03:20:40,840 the 5240 03:20:40,840 --> 03:20:45,600 blues seem like they would be maybe 5241 03:20:45,600 --> 03:20:47,760 this point looks like it'd be blue so 5242 03:20:47,760 --> 03:20:49,120 all these look like they would be blue 5243 03:20:49,120 --> 03:20:50,080 now 5244 03:20:50,080 --> 03:20:52,479 and the greens would probably be this 5245 03:20:52,479 --> 03:20:54,239 cluster right here 5246 03:20:54,239 --> 03:20:56,399 so we go back we compute 5247 03:20:56,399 --> 03:20:59,120 the uh centroids 5248 03:20:59,120 --> 03:21:01,439 bam 5249 03:21:01,680 --> 03:21:02,720 this one 5250 03:21:02,720 --> 03:21:05,359 probably like almost here bam 5251 03:21:05,359 --> 03:21:06,800 and then the green 5252 03:21:06,800 --> 03:21:10,479 looks like it would be probably 5253 03:21:10,479 --> 03:21:13,439 um here-ish 5254 03:21:13,600 --> 03:21:14,800 okay 5255 03:21:14,800 --> 03:21:17,920 and now we go back and we compute 5256 03:21:17,920 --> 03:21:18,960 the 5257 03:21:18,960 --> 03:21:19,920 we 5258 03:21:19,920 --> 03:21:22,479 compute the clusters again 5259 03:21:22,479 --> 03:21:23,359 so 5260 03:21:23,359 --> 03:21:26,000 red still this 5261 03:21:26,000 --> 03:21:27,120 blue 5262 03:21:27,120 --> 03:21:30,960 i would argue is now this cluster here 5263 03:21:30,960 --> 03:21:32,160 and green 5264 03:21:32,160 --> 03:21:35,279 is this cluster here okay so we go and 5265 03:21:35,279 --> 03:21:38,479 we recompute 5266 03:21:38,479 --> 03:21:42,000 the centroids bam 5267 03:21:42,560 --> 03:21:44,720 bam 5268 03:21:44,720 --> 03:21:45,920 and 5269 03:21:45,920 --> 03:21:47,760 you know bam 5270 03:21:47,760 --> 03:21:49,600 and now if i were to go 5271 03:21:49,600 --> 03:21:51,200 and assign all the points to clusters 5272 03:21:51,200 --> 03:21:54,399 again i would get the exact same thing 5273 03:21:54,399 --> 03:21:56,880 right and so that's when we know that we 5274 03:21:56,880 --> 03:21:59,279 can stop iterating between steps two and 5275 03:21:59,279 --> 03:22:02,080 three is when we've converged on some 5276 03:22:02,080 --> 03:22:04,880 solution when we've reached some stable 5277 03:22:04,880 --> 03:22:07,439 point and so now because none of these 5278 03:22:07,439 --> 03:22:08,720 points are really changing out of their 5279 03:22:08,720 --> 03:22:10,800 clusters anymore we can go back to the 5280 03:22:10,800 --> 03:22:12,880 user and say hey these are our three 5281 03:22:12,880 --> 03:22:14,319 clusters 5282 03:22:14,319 --> 03:22:18,319 okay and this process 5283 03:22:18,319 --> 03:22:20,640 something known as 5284 03:22:20,640 --> 03:22:24,479 expectation maximization 5285 03:22:30,080 --> 03:22:31,840 this part where we're assigning the 5286 03:22:31,840 --> 03:22:33,520 points the closest centroid this is 5287 03:22:33,520 --> 03:22:35,680 something this is our 5288 03:22:35,680 --> 03:22:37,200 expectation 5289 03:22:37,200 --> 03:22:39,520 step 5290 03:22:39,760 --> 03:22:41,439 and this part where we're computing the 5291 03:22:41,439 --> 03:22:43,120 new centroids 5292 03:22:43,120 --> 03:22:45,439 this is our 5293 03:22:45,439 --> 03:22:48,439 maximization 5294 03:22:49,520 --> 03:22:50,880 step 5295 03:22:50,880 --> 03:22:52,560 okay so that's 5296 03:22:52,560 --> 03:22:54,960 expectation maximization 5297 03:22:54,960 --> 03:22:57,279 and we use this in order to 5298 03:22:57,279 --> 03:22:58,880 compute 5299 03:22:58,880 --> 03:23:01,680 the centroids assign all the points to 5300 03:23:01,680 --> 03:23:04,479 clusters according to those centroids 5301 03:23:04,479 --> 03:23:06,399 and then we're recomputing all that over 5302 03:23:06,399 --> 03:23:08,560 again until we reach some stable point 5303 03:23:08,560 --> 03:23:11,359 where nothing is changing anymore 5304 03:23:11,359 --> 03:23:14,800 all right so that's our first example of 5305 03:23:14,800 --> 03:23:16,880 unsupervised learning and basically what 5306 03:23:16,880 --> 03:23:18,560 this is doing is trying to find some 5307 03:23:18,560 --> 03:23:21,439 structure some pattern in the data so if 5308 03:23:21,439 --> 03:23:24,000 i came up with another point 5309 03:23:24,000 --> 03:23:25,600 you know might be somewhere here i can 5310 03:23:25,600 --> 03:23:28,319 say oh looks like that's closer to 5311 03:23:28,319 --> 03:23:29,600 um 5312 03:23:29,600 --> 03:23:31,920 if this is a b c it looks like that's 5313 03:23:31,920 --> 03:23:34,160 closest to cluster b and so i would 5314 03:23:34,160 --> 03:23:36,479 probably put it in cluster b 5315 03:23:36,479 --> 03:23:38,239 okay so we can find some structure in 5316 03:23:38,239 --> 03:23:41,200 the data based on just how 5317 03:23:41,200 --> 03:23:43,040 how the points are 5318 03:23:43,040 --> 03:23:45,840 scattered relative to one another 5319 03:23:45,840 --> 03:23:47,680 now the second unsupervised learning 5320 03:23:47,680 --> 03:23:49,040 technique that i'm going to discuss with 5321 03:23:49,040 --> 03:23:50,960 you guys something known as principle 5322 03:23:50,960 --> 03:23:53,200 component analysis 5323 03:23:53,200 --> 03:23:54,640 and the point of principle component 5324 03:23:54,640 --> 03:23:56,560 analysis is 5325 03:23:56,560 --> 03:23:59,279 very often it's used as a dimensionality 5326 03:23:59,279 --> 03:24:01,120 reduction technique 5327 03:24:01,120 --> 03:24:02,399 so 5328 03:24:02,399 --> 03:24:05,120 let me write that down 5329 03:24:05,120 --> 03:24:10,000 it's used for dimensionality reduction 5330 03:24:10,319 --> 03:24:11,920 and what do i mean by dimensionality 5331 03:24:11,920 --> 03:24:14,479 reduction is if i have a bunch of 5332 03:24:14,479 --> 03:24:19,200 features like x1 x2 x3 x4 etc can i just 5333 03:24:19,200 --> 03:24:20,960 reduce that down to one 5334 03:24:20,960 --> 03:24:22,800 dimension that gives me the most 5335 03:24:22,800 --> 03:24:24,880 information about how all these points 5336 03:24:24,880 --> 03:24:26,960 are spread relative to one another 5337 03:24:26,960 --> 03:24:29,439 and that's what pca is for so pca 5338 03:24:29,439 --> 03:24:32,720 principal component analysis 5339 03:24:32,880 --> 03:24:35,760 let's say i have 5340 03:24:35,760 --> 03:24:38,000 some points 5341 03:24:38,000 --> 03:24:39,520 in 5342 03:24:39,520 --> 03:24:43,600 the x0 and x1 feature space 5343 03:24:43,600 --> 03:24:45,439 okay so 5344 03:24:45,439 --> 03:24:48,319 uh these points might be spread you know 5345 03:24:48,319 --> 03:24:50,880 something like 5346 03:24:50,880 --> 03:24:53,880 this 5347 03:24:59,600 --> 03:25:01,920 okay 5348 03:25:02,080 --> 03:25:02,800 so 5349 03:25:02,800 --> 03:25:04,800 for example if this were 5350 03:25:04,800 --> 03:25:06,399 um 5351 03:25:06,399 --> 03:25:08,640 something to do with housing prices 5352 03:25:08,640 --> 03:25:10,080 right 5353 03:25:10,080 --> 03:25:13,840 this here might be x0 might be hey uh 5354 03:25:13,840 --> 03:25:16,160 years 5355 03:25:16,160 --> 03:25:17,439 since 5356 03:25:17,439 --> 03:25:19,520 built right since the house was built 5357 03:25:19,520 --> 03:25:24,560 and x1 might be square footage 5358 03:25:25,439 --> 03:25:28,000 of the house 5359 03:25:28,000 --> 03:25:29,359 all right so like years since built i 5360 03:25:29,359 --> 03:25:31,680 mean like right now it's been you know 5361 03:25:31,680 --> 03:25:35,520 22 years since a house in 2000 was built 5362 03:25:35,520 --> 03:25:37,680 now principal component analysis is just 5363 03:25:37,680 --> 03:25:39,120 saying all right let's say we want to 5364 03:25:39,120 --> 03:25:40,720 build a model or let's say we want to 5365 03:25:40,720 --> 03:25:43,920 you know display something about 5366 03:25:43,920 --> 03:25:47,279 our data but we don't we don't have two 5367 03:25:47,279 --> 03:25:49,439 axes to show it on 5368 03:25:49,439 --> 03:25:52,319 how do we display you know 5369 03:25:52,319 --> 03:25:54,000 how do we how do we demonstrate that 5370 03:25:54,000 --> 03:25:56,479 this point is a further away from this 5371 03:25:56,479 --> 03:25:59,520 point than this point 5372 03:26:00,640 --> 03:26:02,720 and we can do that using principle 5373 03:26:02,720 --> 03:26:06,160 component analysis so 5374 03:26:06,319 --> 03:26:07,439 take what you know about linear 5375 03:26:07,439 --> 03:26:09,040 regression and just forget about it for 5376 03:26:09,040 --> 03:26:10,239 a second otherwise you might get 5377 03:26:10,239 --> 03:26:15,120 confused pca is a way of trying to 5378 03:26:15,120 --> 03:26:17,600 find direction in the space 5379 03:26:17,600 --> 03:26:20,399 with the largest variance so this 5380 03:26:20,399 --> 03:26:23,520 principle component what that means 5381 03:26:23,520 --> 03:26:27,760 is basically the component 5382 03:26:29,040 --> 03:26:31,359 so some direction 5383 03:26:31,359 --> 03:26:33,840 in this space 5384 03:26:35,760 --> 03:26:38,880 with the largest 5385 03:26:38,880 --> 03:26:40,239 variance 5386 03:26:40,239 --> 03:26:41,439 okay 5387 03:26:41,439 --> 03:26:44,000 it tells us the most about 5388 03:26:44,000 --> 03:26:46,160 our data set without the two different 5389 03:26:46,160 --> 03:26:47,520 dimensions like let's say we have these 5390 03:26:47,520 --> 03:26:49,359 two different dimensions and somebody's 5391 03:26:49,359 --> 03:26:50,479 telling us hey you only get one 5392 03:26:50,479 --> 03:26:53,439 dimension in order to show your data set 5393 03:26:53,439 --> 03:26:56,080 what dimension like what do we do we 5394 03:26:56,080 --> 03:26:58,319 want to project our data onto a single 5395 03:26:58,319 --> 03:27:00,000 dimension 5396 03:27:00,000 --> 03:27:01,520 all right 5397 03:27:01,520 --> 03:27:03,439 so that in this case might be a 5398 03:27:03,439 --> 03:27:06,319 dimension that looks something like 5399 03:27:06,319 --> 03:27:08,399 this and you might say okay 5400 03:27:08,399 --> 03:27:09,680 we're not going to talk about linear 5401 03:27:09,680 --> 03:27:11,520 regression okay 5402 03:27:11,520 --> 03:27:13,439 we don't have a y value so in linear 5403 03:27:13,439 --> 03:27:16,160 regression this would be y this is not y 5404 03:27:16,160 --> 03:27:18,479 okay we don't have a label for that 5405 03:27:18,479 --> 03:27:20,960 instead what we're doing is we're taking 5406 03:27:20,960 --> 03:27:23,359 the right angle projection so all of 5407 03:27:23,359 --> 03:27:26,640 these take that's not very visible 5408 03:27:26,640 --> 03:27:29,920 but take this right angle projection 5409 03:27:29,920 --> 03:27:32,800 onto this line 5410 03:27:32,960 --> 03:27:36,319 and what pca is doing is saying okay map 5411 03:27:36,319 --> 03:27:37,760 all of these points onto this 5412 03:27:37,760 --> 03:27:39,439 one-dimensional space 5413 03:27:39,439 --> 03:27:40,840 so the 5414 03:27:40,840 --> 03:27:45,920 transformed data set would be here 5415 03:27:51,680 --> 03:27:53,840 this one's on the data set so or on the 5416 03:27:53,840 --> 03:27:56,399 line so we just put that there 5417 03:27:56,399 --> 03:27:58,800 but now this would be our new 5418 03:27:58,800 --> 03:28:00,960 one-dimensional data set 5419 03:28:00,960 --> 03:28:03,680 okay it's not our prediction or anything 5420 03:28:03,680 --> 03:28:06,239 this is our new data set if somebody 5421 03:28:06,239 --> 03:28:08,000 came to us said you only get one 5422 03:28:08,000 --> 03:28:10,319 dimension you only get one number to 5423 03:28:10,319 --> 03:28:12,880 represent each of these 2d points 5424 03:28:12,880 --> 03:28:14,880 what number would you give me 5425 03:28:14,880 --> 03:28:16,239 this 5426 03:28:16,239 --> 03:28:18,239 would be the number 5427 03:28:18,239 --> 03:28:19,920 that we gave 5428 03:28:19,920 --> 03:28:21,200 okay 5429 03:28:21,200 --> 03:28:22,160 this 5430 03:28:22,160 --> 03:28:24,960 in this direction this is where our 5431 03:28:24,960 --> 03:28:27,840 points are the most spread out 5432 03:28:27,840 --> 03:28:30,960 right if i took this plot 5433 03:28:30,960 --> 03:28:33,200 and let me actually duplicate this so i 5434 03:28:33,200 --> 03:28:35,279 don't have to 5435 03:28:35,279 --> 03:28:36,840 rewrite 5436 03:28:36,840 --> 03:28:39,120 anything so i don't have to erase and 5437 03:28:39,120 --> 03:28:41,279 then redraw anything 5438 03:28:41,279 --> 03:28:45,760 um let me get rid of some of this stuff 5439 03:28:47,359 --> 03:28:48,960 and i just got rid of a point there too 5440 03:28:48,960 --> 03:28:52,760 so let me draw that back 5441 03:28:54,080 --> 03:28:55,200 all right 5442 03:28:55,200 --> 03:28:57,359 so if this were my original data point 5443 03:28:57,359 --> 03:28:59,760 what if i had taken you know 5444 03:28:59,760 --> 03:29:00,640 this 5445 03:29:00,640 --> 03:29:01,920 to be 5446 03:29:01,920 --> 03:29:04,319 the pca dimension 5447 03:29:04,319 --> 03:29:05,439 okay 5448 03:29:05,439 --> 03:29:06,720 well 5449 03:29:06,720 --> 03:29:07,920 i 5450 03:29:07,920 --> 03:29:10,640 then would have points 5451 03:29:10,640 --> 03:29:12,640 that 5452 03:29:12,640 --> 03:29:13,760 let me actually do that in different 5453 03:29:13,760 --> 03:29:15,439 color 5454 03:29:15,439 --> 03:29:17,520 so if i were to draw a right angle to 5455 03:29:17,520 --> 03:29:18,560 this 5456 03:29:18,560 --> 03:29:21,840 for every point 5457 03:29:23,359 --> 03:29:28,239 my points would look something like this 5458 03:29:33,359 --> 03:29:35,920 and so just intuitively looking at these 5459 03:29:35,920 --> 03:29:38,479 two different plots this top one and 5460 03:29:38,479 --> 03:29:40,880 this one we can see that the points are 5461 03:29:40,880 --> 03:29:43,359 squished a little bit closer together 5462 03:29:43,359 --> 03:29:45,680 right which means that the variance 5463 03:29:45,680 --> 03:29:47,279 that's not the space with the largest 5464 03:29:47,279 --> 03:29:48,800 variance 5465 03:29:48,800 --> 03:29:52,399 the thing about the largest variance 5466 03:29:52,479 --> 03:29:55,359 is that this will give us the most 5467 03:29:55,359 --> 03:29:57,439 discrimination between all of these 5468 03:29:57,439 --> 03:29:58,479 points 5469 03:29:58,479 --> 03:29:59,920 the larger the variance the further 5470 03:29:59,920 --> 03:30:02,399 spread out these points will likely be 5471 03:30:02,399 --> 03:30:03,279 now 5472 03:30:03,279 --> 03:30:05,120 and so that's the that's the dimension 5473 03:30:05,120 --> 03:30:07,760 that we should project it on 5474 03:30:07,760 --> 03:30:09,840 a different way to actually look at that 5475 03:30:09,840 --> 03:30:11,600 like what is the dimension with the 5476 03:30:11,600 --> 03:30:13,920 largest variance it's actually it also 5477 03:30:13,920 --> 03:30:16,880 happens to be the dimension that 5478 03:30:16,880 --> 03:30:19,279 decreases 5479 03:30:19,279 --> 03:30:21,040 that minimizes 5480 03:30:21,040 --> 03:30:23,680 the residuals so 5481 03:30:23,680 --> 03:30:26,080 if we take all the points and we take 5482 03:30:26,080 --> 03:30:29,040 the residual from that the xy residual 5483 03:30:29,040 --> 03:30:32,319 so in linear regression 5484 03:30:32,399 --> 03:30:34,080 in linear regression we were looking 5485 03:30:34,080 --> 03:30:35,920 only at this residual the differences 5486 03:30:35,920 --> 03:30:37,920 between the predictions right between y 5487 03:30:37,920 --> 03:30:41,040 and y hat it's not that 5488 03:30:41,040 --> 03:30:43,200 here in principle component analysis 5489 03:30:43,200 --> 03:30:46,720 we're taking the difference from 5490 03:30:46,720 --> 03:30:48,560 our current point in two-dimensional 5491 03:30:48,560 --> 03:30:49,520 space 5492 03:30:49,520 --> 03:30:51,680 and then its projected point 5493 03:30:51,680 --> 03:30:53,600 okay so we're taking that 5494 03:30:53,600 --> 03:30:54,880 dimension 5495 03:30:54,880 --> 03:30:56,479 and we're saying 5496 03:30:56,479 --> 03:30:58,399 all right how much 5497 03:30:58,399 --> 03:30:59,359 you know 5498 03:30:59,359 --> 03:31:01,760 how much distance is there between 5499 03:31:01,760 --> 03:31:04,000 that projection residual and we're 5500 03:31:04,000 --> 03:31:06,720 trying to minimize that for all of these 5501 03:31:06,720 --> 03:31:08,239 points 5502 03:31:08,239 --> 03:31:11,359 so that actually equates to 5503 03:31:11,359 --> 03:31:14,960 this largest variance dimension 5504 03:31:14,960 --> 03:31:18,399 this dimension here 5505 03:31:19,680 --> 03:31:22,880 the pca dimension 5506 03:31:22,880 --> 03:31:27,720 you can either look at it as minimizing 5507 03:31:27,760 --> 03:31:30,720 minimize 5508 03:31:31,120 --> 03:31:34,399 let me get rid of this 5509 03:31:34,479 --> 03:31:37,359 the projection residuals so that's the 5510 03:31:37,359 --> 03:31:40,840 stuff in orange 5511 03:31:42,000 --> 03:31:44,720 or two 5512 03:31:44,720 --> 03:31:47,359 maximizing the variance 5513 03:31:47,359 --> 03:31:50,160 between the points 5514 03:31:50,160 --> 03:31:51,680 okay 5515 03:31:51,680 --> 03:31:54,319 and we're not really going to talk about 5516 03:31:54,319 --> 03:31:56,160 you know the method that we need in 5517 03:31:56,160 --> 03:31:58,720 order to calculate out the principal 5518 03:31:58,720 --> 03:32:00,880 components or like what that projection 5519 03:32:00,880 --> 03:32:01,920 would be 5520 03:32:01,920 --> 03:32:03,439 because you will need to understand 5521 03:32:03,439 --> 03:32:05,840 linear algebra for that especially 5522 03:32:05,840 --> 03:32:08,479 um eigenvectors and eigenvalues which 5523 03:32:08,479 --> 03:32:10,319 i'm not going to cover in this class 5524 03:32:10,319 --> 03:32:11,760 but that's how you would find the 5525 03:32:11,760 --> 03:32:14,239 principal components okay 5526 03:32:14,239 --> 03:32:16,640 now with this two-dimensional data set 5527 03:32:16,640 --> 03:32:18,640 here sorry this one-dimensional data set 5528 03:32:18,640 --> 03:32:21,120 we started from a 2d data set and we 5529 03:32:21,120 --> 03:32:23,520 now boil it down to one dimension well 5530 03:32:23,520 --> 03:32:25,040 we can go and take that dimension and we 5531 03:32:25,040 --> 03:32:27,279 can do other things with it 5532 03:32:27,279 --> 03:32:29,600 right we can like if there were a y 5533 03:32:29,600 --> 03:32:32,399 label then we can now show x versus y 5534 03:32:32,399 --> 03:32:35,279 rather than x 0 and x 1 5535 03:32:35,279 --> 03:32:37,600 in different plots with that y now we 5536 03:32:37,600 --> 03:32:38,880 can just say oh this is a principal 5537 03:32:38,880 --> 03:32:40,720 component and we're going to plot that 5538 03:32:40,720 --> 03:32:43,439 with the y or for example if there were 5539 03:32:43,439 --> 03:32:45,359 a hundred different dimensions and you 5540 03:32:45,359 --> 03:32:46,800 only wanted to take 5541 03:32:46,800 --> 03:32:48,800 five of them well you could go and you 5542 03:32:48,800 --> 03:32:52,000 could find the top five pca dimensions 5543 03:32:52,000 --> 03:32:53,040 and 5544 03:32:53,040 --> 03:32:54,720 that might be a lot more useful to you 5545 03:32:54,720 --> 03:32:58,560 than 100 different feature factor values 5546 03:32:58,560 --> 03:32:59,920 right 5547 03:32:59,920 --> 03:33:01,439 so that's principle component analysis 5548 03:33:01,439 --> 03:33:03,120 again we're taking 5549 03:33:03,120 --> 03:33:05,840 you know certain data that's unlabeled 5550 03:33:05,840 --> 03:33:07,680 and we're trying to 5551 03:33:07,680 --> 03:33:09,840 make some sort of estimation 5552 03:33:09,840 --> 03:33:13,120 like some guess about its structure 5553 03:33:13,120 --> 03:33:14,319 from 5554 03:33:14,319 --> 03:33:16,800 that original data set if we wanted to 5555 03:33:16,800 --> 03:33:19,120 take you know a 3d thing so like a 5556 03:33:19,120 --> 03:33:20,319 sphere 5557 03:33:20,319 --> 03:33:22,720 but we only have a 2d surface to draw it 5558 03:33:22,720 --> 03:33:23,760 on 5559 03:33:23,760 --> 03:33:25,520 well what's the best approximation that 5560 03:33:25,520 --> 03:33:28,080 we can make oh it's a circle right pca 5561 03:33:28,080 --> 03:33:29,680 is kind of the same thing it's saying if 5562 03:33:29,680 --> 03:33:31,040 we have something with all these 5563 03:33:31,040 --> 03:33:33,040 different dimensions but we can't show 5564 03:33:33,040 --> 03:33:35,120 all of them how do we boil it down to 5565 03:33:35,120 --> 03:33:38,000 just one dimension how do we extract the 5566 03:33:38,000 --> 03:33:39,760 most information 5567 03:33:39,760 --> 03:33:42,399 from that multiple dimensions 5568 03:33:42,399 --> 03:33:44,960 and that is exactly either you minimize 5569 03:33:44,960 --> 03:33:47,600 the projection residuals or you maximize 5570 03:33:47,600 --> 03:33:51,120 the variance and that is pca so we'll go 5571 03:33:51,120 --> 03:33:53,760 through an example of that now finally 5572 03:33:53,760 --> 03:33:56,720 let's move on to implementing the 5573 03:33:56,720 --> 03:34:00,160 unsupervised learning part of this class 5574 03:34:00,160 --> 03:34:02,239 here again i'm on the uci machine 5575 03:34:02,239 --> 03:34:05,279 learning repository and i have a seeds 5576 03:34:05,279 --> 03:34:07,120 data set where 5577 03:34:07,120 --> 03:34:09,439 you know i have a bunch of kernels that 5578 03:34:09,439 --> 03:34:11,520 belong to three different types of wheat 5579 03:34:11,520 --> 03:34:14,640 so there's comma rosa and canadian 5580 03:34:14,640 --> 03:34:15,600 and 5581 03:34:15,600 --> 03:34:18,080 the different um features that we have 5582 03:34:18,080 --> 03:34:20,560 access to are you know geometric 5583 03:34:20,560 --> 03:34:22,880 parameters of those weak kernels so the 5584 03:34:22,880 --> 03:34:25,760 area perimeter compactness 5585 03:34:25,760 --> 03:34:27,120 length width 5586 03:34:27,120 --> 03:34:30,160 with asymmetry and the length of the 5587 03:34:30,160 --> 03:34:31,600 kernel groove 5588 03:34:31,600 --> 03:34:34,080 okay so all these are real values which 5589 03:34:34,080 --> 03:34:36,560 is easy to work with and what we're 5590 03:34:36,560 --> 03:34:37,760 going to do is we're going to try to 5591 03:34:37,760 --> 03:34:39,200 predict 5592 03:34:39,200 --> 03:34:41,680 or i guess we're going to try to cluster 5593 03:34:41,680 --> 03:34:44,880 the different varieties of the wheat 5594 03:34:44,880 --> 03:34:47,120 so let's get started i have a colab 5595 03:34:47,120 --> 03:34:49,359 notebook open again oh you're gonna have 5596 03:34:49,359 --> 03:34:51,359 to you know go to the data folder 5597 03:34:51,359 --> 03:34:52,880 download this 5598 03:34:52,880 --> 03:34:53,760 and 5599 03:34:53,760 --> 03:34:56,319 let's get started 5600 03:34:56,319 --> 03:34:59,840 so the first thing to do is to 5601 03:34:59,840 --> 03:35:02,720 import our seeds data set 5602 03:35:02,720 --> 03:35:05,359 into our colab notebook 5603 03:35:05,359 --> 03:35:07,760 so i've done that here 5604 03:35:07,760 --> 03:35:09,200 okay and then 5605 03:35:09,200 --> 03:35:11,520 we're going to import all the classics 5606 03:35:11,520 --> 03:35:14,720 again so pandas 5607 03:35:23,040 --> 03:35:26,000 um and then i'm also going to import 5608 03:35:26,000 --> 03:35:28,239 seaborne because i'm going to want that 5609 03:35:28,239 --> 03:35:31,600 for this specific class 5610 03:35:31,600 --> 03:35:33,840 okay 5611 03:35:35,200 --> 03:35:38,160 great so now our columns that we have in 5612 03:35:38,160 --> 03:35:40,880 our seed data set are the area 5613 03:35:40,880 --> 03:35:42,880 the perimeter 5614 03:35:42,880 --> 03:35:46,000 um the compactness 5615 03:35:46,000 --> 03:35:48,000 the length 5616 03:35:48,000 --> 03:35:49,279 with 5617 03:35:49,279 --> 03:35:51,600 asymmetry 5618 03:35:51,600 --> 03:35:53,840 groove 5619 03:35:53,840 --> 03:35:55,120 lengths i mean i'm just going to call it 5620 03:35:55,120 --> 03:35:57,760 group and then the class right the weak 5621 03:35:57,760 --> 03:36:00,560 kernels class so now we have to import 5622 03:36:00,560 --> 03:36:02,239 this um 5623 03:36:02,239 --> 03:36:04,239 i'm going to do that using pandas read 5624 03:36:04,239 --> 03:36:05,520 csv 5625 03:36:05,520 --> 03:36:06,560 and 5626 03:36:06,560 --> 03:36:09,520 it's called seeds data.csv 5627 03:36:09,520 --> 03:36:10,560 so 5628 03:36:10,560 --> 03:36:13,840 i'm going to turn that into a data frame 5629 03:36:13,840 --> 03:36:16,319 and the names are equal to the columns 5630 03:36:16,319 --> 03:36:17,920 over here 5631 03:36:17,920 --> 03:36:20,720 so what happens if i just do that 5632 03:36:20,720 --> 03:36:22,640 oops what did i call this seeds 5633 03:36:22,640 --> 03:36:25,640 dataset.text 5634 03:36:26,319 --> 03:36:27,520 all right 5635 03:36:27,520 --> 03:36:29,600 so if we actually look at our data frame 5636 03:36:29,600 --> 03:36:31,359 right now 5637 03:36:31,359 --> 03:36:34,640 you'll notice something funky okay and 5638 03:36:34,640 --> 03:36:36,960 here you know we have all the stuff 5639 03:36:36,960 --> 03:36:39,120 under area and these are all our numbers 5640 03:36:39,120 --> 03:36:40,960 with some dash t 5641 03:36:40,960 --> 03:36:42,479 so the reason is because we haven't 5642 03:36:42,479 --> 03:36:43,920 actually 5643 03:36:43,920 --> 03:36:47,120 told pandas what the separator is which 5644 03:36:47,120 --> 03:36:48,560 we can do 5645 03:36:48,560 --> 03:36:51,760 like this and this t that's just a tab 5646 03:36:51,760 --> 03:36:53,920 so in order to ensure that like all 5647 03:36:53,920 --> 03:36:55,760 white space gets recognized as a 5648 03:36:55,760 --> 03:36:56,880 separator 5649 03:36:56,880 --> 03:36:58,640 we can actually 5650 03:36:58,640 --> 03:37:02,720 this is for like a space so any spaces 5651 03:37:02,720 --> 03:37:04,479 are going to get recognized as data 5652 03:37:04,479 --> 03:37:07,359 separators so if i run that 5653 03:37:07,359 --> 03:37:09,279 now are um 5654 03:37:09,279 --> 03:37:12,960 this you know this is a lot better okay 5655 03:37:12,960 --> 03:37:14,479 okay 5656 03:37:14,479 --> 03:37:16,399 so now let's actually go and like 5657 03:37:16,399 --> 03:37:18,880 visualize this data so 5658 03:37:18,880 --> 03:37:20,640 what i'm actually going to do is plot 5659 03:37:20,640 --> 03:37:23,200 each of these against one another so 5660 03:37:23,200 --> 03:37:25,279 in this case pretend that we don't have 5661 03:37:25,279 --> 03:37:28,080 access to the class right pretend that 5662 03:37:28,080 --> 03:37:29,600 so this class here i'm just going to 5663 03:37:29,600 --> 03:37:31,680 show you in this example that like hey 5664 03:37:31,680 --> 03:37:33,120 we can predict our classes using 5665 03:37:33,120 --> 03:37:34,960 unsupervised learning 5666 03:37:34,960 --> 03:37:36,880 but for this example in unsupervised 5667 03:37:36,880 --> 03:37:38,640 learning we don't actually have access 5668 03:37:38,640 --> 03:37:39,840 to the class 5669 03:37:39,840 --> 03:37:42,239 so i'm going to just try to plot these 5670 03:37:42,239 --> 03:37:45,600 against one another and see what happens 5671 03:37:45,600 --> 03:37:49,279 so for some i in range 5672 03:37:49,279 --> 03:37:51,920 you know the columns minus one because 5673 03:37:51,920 --> 03:37:54,239 the class is in the columns 5674 03:37:54,239 --> 03:37:56,960 and i'm just going to say for j in range 5675 03:37:56,960 --> 03:37:59,920 so take everything from i 5676 03:37:59,920 --> 03:38:02,160 onwards you know so i like the next 5677 03:38:02,160 --> 03:38:03,600 thing after i 5678 03:38:03,600 --> 03:38:04,960 uh until 5679 03:38:04,960 --> 03:38:07,439 the end of this so this will give us 5680 03:38:07,439 --> 03:38:09,359 basically a grid 5681 03:38:09,359 --> 03:38:13,439 of all the different like combinations 5682 03:38:13,439 --> 03:38:17,600 and our x label is going to be 5683 03:38:17,600 --> 03:38:20,080 columns i our y label 5684 03:38:20,080 --> 03:38:22,239 is going to be the columns 5685 03:38:22,239 --> 03:38:25,200 j so those are our labels up here 5686 03:38:25,200 --> 03:38:27,279 and i'm going to use 5687 03:38:27,279 --> 03:38:29,359 seaborne this time 5688 03:38:29,359 --> 03:38:31,040 and i'm going to say 5689 03:38:31,040 --> 03:38:34,160 scatter my data so our x is going to be 5690 03:38:34,160 --> 03:38:36,960 our x label 5691 03:38:38,080 --> 03:38:41,600 our y is going to be our y label 5692 03:38:41,600 --> 03:38:42,720 um 5693 03:38:42,720 --> 03:38:43,760 and 5694 03:38:43,760 --> 03:38:46,160 our data is going to be the data frame 5695 03:38:46,160 --> 03:38:47,680 that we're passing in 5696 03:38:47,680 --> 03:38:49,359 so what's interesting here is that we 5697 03:38:49,359 --> 03:38:51,439 can say hue 5698 03:38:51,439 --> 03:38:53,359 and what this will do is say 5699 03:38:53,359 --> 03:38:55,200 like if i give this class it's going to 5700 03:38:55,200 --> 03:38:57,040 separate the three different classes 5701 03:38:57,040 --> 03:38:58,640 into three different queues so now what 5702 03:38:58,640 --> 03:39:01,040 we're doing is we're basically comparing 5703 03:39:01,040 --> 03:39:03,120 the area and the perimeter or the area 5704 03:39:03,120 --> 03:39:04,720 and the compactness 5705 03:39:04,720 --> 03:39:07,040 but we're going to visualize you know 5706 03:39:07,040 --> 03:39:10,000 what classes they're in 5707 03:39:10,000 --> 03:39:14,560 so let's go ahead and i might have to 5708 03:39:14,640 --> 03:39:16,239 show 5709 03:39:16,239 --> 03:39:18,479 so 5710 03:39:18,720 --> 03:39:19,680 great 5711 03:39:19,680 --> 03:39:21,439 so basically we can see perimeter and 5712 03:39:21,439 --> 03:39:23,920 area we give we get these three 5713 03:39:23,920 --> 03:39:25,359 groups 5714 03:39:25,359 --> 03:39:26,239 um 5715 03:39:26,239 --> 03:39:28,880 the area compactness we get these three 5716 03:39:28,880 --> 03:39:30,080 groups 5717 03:39:30,080 --> 03:39:32,239 and so on so these all kind of look 5718 03:39:32,239 --> 03:39:35,040 honestly like somewhat similar 5719 03:39:35,040 --> 03:39:37,840 right so 5720 03:39:39,120 --> 03:39:40,880 wow look at this one so this one we have 5721 03:39:40,880 --> 03:39:42,640 the compactness and the asymmetry and it 5722 03:39:42,640 --> 03:39:44,319 looks like there's not really i mean it 5723 03:39:44,319 --> 03:39:46,399 just looks like they're blobs right 5724 03:39:46,399 --> 03:39:48,840 sure maybe class three is over here more 5725 03:39:48,840 --> 03:39:51,040 but one and two kind of look like 5726 03:39:51,040 --> 03:39:53,120 they're on top of each other 5727 03:39:53,120 --> 03:39:54,399 okay 5728 03:39:54,399 --> 03:39:56,000 i mean there are some that might look 5729 03:39:56,000 --> 03:39:58,800 slightly better in terms of clustering 5730 03:39:58,800 --> 03:40:00,560 but let's go through some of the some of 5731 03:40:00,560 --> 03:40:02,479 the clustering examples that we talked 5732 03:40:02,479 --> 03:40:05,120 about and try to implement those 5733 03:40:05,120 --> 03:40:06,640 the first thing that we're going to do 5734 03:40:06,640 --> 03:40:09,920 is just straight up clustering 5735 03:40:09,920 --> 03:40:12,080 so 5736 03:40:13,760 --> 03:40:15,439 uh what we learned about was k-means 5737 03:40:15,439 --> 03:40:17,120 clustering so from 5738 03:40:17,120 --> 03:40:18,319 learn 5739 03:40:18,319 --> 03:40:20,160 i'm going to import 5740 03:40:20,160 --> 03:40:23,279 uh k means 5741 03:40:23,359 --> 03:40:24,880 okay 5742 03:40:24,880 --> 03:40:25,840 and 5743 03:40:25,840 --> 03:40:26,960 just for the 5744 03:40:26,960 --> 03:40:29,040 sake of being able to run 5745 03:40:29,040 --> 03:40:31,439 you know any x in any y 5746 03:40:31,439 --> 03:40:35,040 i'm just gonna say hey let's use some 5747 03:40:35,040 --> 03:40:37,520 x um 5748 03:40:37,520 --> 03:40:40,560 what's a good one maybe 5749 03:40:40,960 --> 03:40:42,800 i mean perimeter asymmetry could be a 5750 03:40:42,800 --> 03:40:43,760 good one 5751 03:40:43,760 --> 03:40:46,720 so x could be perimeter y could be 5752 03:40:46,720 --> 03:40:49,439 asymmetry 5753 03:40:50,399 --> 03:40:51,760 okay 5754 03:40:51,760 --> 03:40:54,960 and for this the the x values i'm going 5755 03:40:54,960 --> 03:40:56,800 to just extract 5756 03:40:56,800 --> 03:40:58,840 those specific 5757 03:40:58,840 --> 03:41:00,880 values right 5758 03:41:00,880 --> 03:41:02,080 well 5759 03:41:02,080 --> 03:41:04,800 let's make a k-means 5760 03:41:04,800 --> 03:41:07,120 uh algorithm or let's you know define 5761 03:41:07,120 --> 03:41:09,120 this so k means 5762 03:41:09,120 --> 03:41:11,279 and in this specific case we know that 5763 03:41:11,279 --> 03:41:13,040 the number of clusters 5764 03:41:13,040 --> 03:41:15,359 is three so let's just use that 5765 03:41:15,359 --> 03:41:18,000 and i'm going to fit this against this x 5766 03:41:18,000 --> 03:41:21,359 that i've just defined right here 5767 03:41:21,520 --> 03:41:23,279 right 5768 03:41:23,279 --> 03:41:24,239 so 5769 03:41:24,239 --> 03:41:26,000 um 5770 03:41:26,000 --> 03:41:29,439 you know if i create this clusters so 5771 03:41:29,439 --> 03:41:30,800 one thing one cool thing is i can 5772 03:41:30,800 --> 03:41:33,120 actually go to these clusters and i can 5773 03:41:33,120 --> 03:41:35,840 say k-mean dot labels 5774 03:41:35,840 --> 03:41:39,200 and it'll get give me 5775 03:41:41,200 --> 03:41:42,720 if i can type correctly it'll give me 5776 03:41:42,720 --> 03:41:44,000 what its predictions for all the 5777 03:41:44,000 --> 03:41:45,359 clusters are 5778 03:41:45,359 --> 03:41:47,600 and our actual 5779 03:41:47,600 --> 03:41:49,279 oops 5780 03:41:49,279 --> 03:41:52,000 not that um if we go to the data frame 5781 03:41:52,000 --> 03:41:54,720 and we get the class 5782 03:41:54,720 --> 03:41:56,720 and the values from those we can 5783 03:41:56,720 --> 03:41:59,359 actually compare these two and say hey 5784 03:41:59,359 --> 03:42:01,760 like you know everything in general most 5785 03:42:01,760 --> 03:42:04,160 of the zeros that it's predicted 5786 03:42:04,160 --> 03:42:05,920 are the ones right and in general the 5787 03:42:05,920 --> 03:42:08,800 twos are the twos here and then the 5788 03:42:08,800 --> 03:42:11,439 third class one okay that corresponds to 5789 03:42:11,439 --> 03:42:12,479 three 5790 03:42:12,479 --> 03:42:14,479 now remember these are separate classes 5791 03:42:14,479 --> 03:42:16,319 so the labels what we actually call them 5792 03:42:16,319 --> 03:42:18,800 don't really matter we can say oh map 5793 03:42:18,800 --> 03:42:21,200 zero to one map two to two and map one 5794 03:42:21,200 --> 03:42:22,399 to three 5795 03:42:22,399 --> 03:42:25,040 okay and our you know our mapping would 5796 03:42:25,040 --> 03:42:27,840 do fairly well 5797 03:42:28,479 --> 03:42:30,239 but we can actually visualize this and 5798 03:42:30,239 --> 03:42:33,439 in order to do that i'm going to create 5799 03:42:33,439 --> 03:42:36,080 this cluster 5800 03:42:36,080 --> 03:42:37,680 cluster data frame 5801 03:42:37,680 --> 03:42:40,319 so i'm going to create a data frame and 5802 03:42:40,319 --> 03:42:42,479 i'm going to pass in 5803 03:42:42,479 --> 03:42:45,439 um a horizontally stacked 5804 03:42:45,439 --> 03:42:47,279 array with x 5805 03:42:47,279 --> 03:42:50,080 so my values for x and y 5806 03:42:50,080 --> 03:42:53,760 and then um the clusters that i have 5807 03:42:53,760 --> 03:42:55,120 here 5808 03:42:55,120 --> 03:43:00,080 but i'm going to reshape them so it's 2d 5809 03:43:00,239 --> 03:43:01,520 okay 5810 03:43:01,520 --> 03:43:03,279 and the columns 5811 03:43:03,279 --> 03:43:06,880 the labels for that are going to be x y 5812 03:43:06,880 --> 03:43:08,080 and 5813 03:43:08,080 --> 03:43:10,399 plus 5814 03:43:10,399 --> 03:43:12,560 okay 5815 03:43:12,560 --> 03:43:13,840 so 5816 03:43:13,840 --> 03:43:16,479 i'm going to go ahead and do that same 5817 03:43:16,479 --> 03:43:19,279 seabourn scatter plot 5818 03:43:19,279 --> 03:43:20,399 again 5819 03:43:20,399 --> 03:43:22,800 where x is x y is y 5820 03:43:22,800 --> 03:43:26,960 and now uh the hue is again the class 5821 03:43:26,960 --> 03:43:29,680 and the data is now this cluster data 5822 03:43:29,680 --> 03:43:30,720 frame 5823 03:43:30,720 --> 03:43:34,760 all right so this here 5824 03:43:35,680 --> 03:43:37,600 this here is my 5825 03:43:37,600 --> 03:43:40,800 um k means 5826 03:43:40,800 --> 03:43:44,560 like i guess classes 5827 03:43:46,399 --> 03:43:48,720 so k-means kind of looks like this 5828 03:43:48,720 --> 03:43:51,920 if i come down here and i 5829 03:43:51,920 --> 03:43:54,239 plot you know my original data frame 5830 03:43:54,239 --> 03:43:56,479 this is my original 5831 03:43:56,479 --> 03:43:58,800 classes with respect to this specific x 5832 03:43:58,800 --> 03:43:59,840 and y 5833 03:43:59,840 --> 03:44:02,000 and you'll see that honestly like it 5834 03:44:02,000 --> 03:44:04,080 doesn't do too poorly 5835 03:44:04,080 --> 03:44:06,319 yeah there's i mean the colors are 5836 03:44:06,319 --> 03:44:07,920 different but that's fine 5837 03:44:07,920 --> 03:44:09,120 um 5838 03:44:09,120 --> 03:44:11,040 for the most part it it 5839 03:44:11,040 --> 03:44:13,760 gets information of the clusters 5840 03:44:13,760 --> 03:44:15,359 right 5841 03:44:15,359 --> 03:44:17,760 and now we can do that with higher 5842 03:44:17,760 --> 03:44:20,319 dimensions 5843 03:44:22,160 --> 03:44:24,640 so with the higher dimensions if we make 5844 03:44:24,640 --> 03:44:27,040 x equal to you know all the columns 5845 03:44:27,040 --> 03:44:28,800 except for the last one which is our 5846 03:44:28,800 --> 03:44:31,439 class 5847 03:44:31,439 --> 03:44:34,880 we can do the exact same thing 5848 03:44:35,439 --> 03:44:37,120 so here 5849 03:44:37,120 --> 03:44:38,000 and 5850 03:44:38,000 --> 03:44:40,560 we can 5851 03:44:43,439 --> 03:44:45,760 predict 5852 03:44:45,760 --> 03:44:46,960 this 5853 03:44:46,960 --> 03:44:48,800 uh but now 5854 03:44:48,800 --> 03:44:52,080 our columns are equal to 5855 03:44:52,080 --> 03:44:53,359 our data frame 5856 03:44:53,359 --> 03:44:55,359 columns all the way to the last one and 5857 03:44:55,359 --> 03:44:57,439 then with this class actually so we can 5858 03:44:57,439 --> 03:45:01,359 literally just say data frame columns 5859 03:45:01,920 --> 03:45:05,439 and we can fit all of this 5860 03:45:05,439 --> 03:45:07,120 and now 5861 03:45:07,120 --> 03:45:11,439 if i want to plot the k-means classes 5862 03:45:11,439 --> 03:45:14,319 all right so this was my 5863 03:45:14,319 --> 03:45:18,319 uh that's my clustered and my original 5864 03:45:18,319 --> 03:45:21,040 so actually let me see if i can 5865 03:45:21,040 --> 03:45:24,000 get these on the same page 5866 03:45:24,000 --> 03:45:25,760 so yeah i mean pretty similar to what we 5867 03:45:25,760 --> 03:45:28,560 just saw but what's actually really cool 5868 03:45:28,560 --> 03:45:30,560 is 5869 03:45:30,560 --> 03:45:32,880 even something like you know if we 5870 03:45:32,880 --> 03:45:35,199 change 5871 03:45:35,199 --> 03:45:36,720 so what's one of them where they were 5872 03:45:36,720 --> 03:45:40,600 like on top of each other 5873 03:45:44,319 --> 03:45:46,319 ah okay so compactness and asymmetry 5874 03:45:46,319 --> 03:45:48,000 this one's messy 5875 03:45:48,000 --> 03:45:50,560 right so if i come down here 5876 03:45:50,560 --> 03:45:52,080 and i say 5877 03:45:52,080 --> 03:45:52,840 uh 5878 03:45:52,840 --> 03:45:55,920 compactness and asymmetry 5879 03:45:55,920 --> 03:45:58,880 and i'm trying to do this in 2d 5880 03:45:58,880 --> 03:46:00,640 this is what my scatter plot so this is 5881 03:46:00,640 --> 03:46:02,319 what you know 5882 03:46:02,319 --> 03:46:05,040 my k means is telling me for these two 5883 03:46:05,040 --> 03:46:07,439 dimensions for compactness and asymmetry 5884 03:46:07,439 --> 03:46:10,399 if we just look at those two 5885 03:46:10,399 --> 03:46:12,800 these are our three classes right and we 5886 03:46:12,800 --> 03:46:14,479 know that the original looks something 5887 03:46:14,479 --> 03:46:16,800 like this and are these two 5888 03:46:16,800 --> 03:46:18,160 remotely 5889 03:46:18,160 --> 03:46:20,800 alike no 5890 03:46:20,800 --> 03:46:23,040 okay so now if i come back down here and 5891 03:46:23,040 --> 03:46:25,279 i rerun this higher dimensions one but 5892 03:46:25,279 --> 03:46:26,239 actually 5893 03:46:26,239 --> 03:46:27,840 this clusters 5894 03:46:27,840 --> 03:46:29,439 i need to 5895 03:46:29,439 --> 03:46:33,199 get the labels of the k-means again 5896 03:46:34,479 --> 03:46:36,479 okay so if i rerun this 5897 03:46:36,479 --> 03:46:37,359 with 5898 03:46:37,359 --> 03:46:39,199 higher dimensions 5899 03:46:39,199 --> 03:46:40,880 well if we zoom out and we take a look 5900 03:46:40,880 --> 03:46:43,760 at these two sure the colors are mixed 5901 03:46:43,760 --> 03:46:44,560 up 5902 03:46:44,560 --> 03:46:45,439 but 5903 03:46:45,439 --> 03:46:49,040 in general there the three groups 5904 03:46:49,040 --> 03:46:51,359 are there right this does a much better 5905 03:46:51,359 --> 03:46:54,000 job at assessing okay what group 5906 03:46:54,000 --> 03:46:55,600 is what 5907 03:46:55,600 --> 03:46:57,359 so 5908 03:46:57,359 --> 03:47:00,399 for example we could relabel uh the one 5909 03:47:00,399 --> 03:47:02,720 in the original class to two 5910 03:47:02,720 --> 03:47:04,880 and then we could make 5911 03:47:04,880 --> 03:47:07,520 sorry okay this is kind of confusing but 5912 03:47:07,520 --> 03:47:09,120 for example if 5913 03:47:09,120 --> 03:47:12,319 this light pink were projected onto 5914 03:47:12,319 --> 03:47:15,600 this darker pink here and then this dark 5915 03:47:15,600 --> 03:47:18,479 one was actually the light pink and this 5916 03:47:18,479 --> 03:47:20,800 light one was this dark one then you 5917 03:47:20,800 --> 03:47:22,560 kind of see like these correspond to one 5918 03:47:22,560 --> 03:47:24,239 another right like even these two up 5919 03:47:24,239 --> 03:47:25,359 here 5920 03:47:25,359 --> 03:47:26,880 are the same classes all the other ones 5921 03:47:26,880 --> 03:47:28,880 over here which are the same in the same 5922 03:47:28,880 --> 03:47:30,000 color 5923 03:47:30,000 --> 03:47:31,199 so you don't want to compare the two 5924 03:47:31,199 --> 03:47:33,040 colors between the plots you want to 5925 03:47:33,040 --> 03:47:35,600 compare which points are in what colors 5926 03:47:35,600 --> 03:47:37,279 in each of the plots 5927 03:47:37,279 --> 03:47:38,319 so 5928 03:47:38,319 --> 03:47:40,720 that's one cool application so this is 5929 03:47:40,720 --> 03:47:41,520 how 5930 03:47:41,520 --> 03:47:42,720 k-means 5931 03:47:42,720 --> 03:47:44,399 functions it's basically taking all the 5932 03:47:44,399 --> 03:47:46,880 data sets and saying all right 5933 03:47:46,880 --> 03:47:49,840 where are my clusters given these pieces 5934 03:47:49,840 --> 03:47:51,840 of data and then the next thing that we 5935 03:47:51,840 --> 03:47:53,040 talked about 5936 03:47:53,040 --> 03:47:56,239 is pca so pca we're reducing the 5937 03:47:56,239 --> 03:47:57,920 dimension but we're 5938 03:47:57,920 --> 03:47:59,279 mapping 5939 03:47:59,279 --> 03:48:02,160 all these like you know seven dimensions 5940 03:48:02,160 --> 03:48:03,600 i don't know if there are seven i made 5941 03:48:03,600 --> 03:48:05,120 that number up but we're mapping 5942 03:48:05,120 --> 03:48:06,960 multiple dimensions into a lower 5943 03:48:06,960 --> 03:48:08,960 dimension number 5944 03:48:08,960 --> 03:48:12,080 right and so let's see how that works 5945 03:48:12,080 --> 03:48:14,239 so from sklearn 5946 03:48:14,239 --> 03:48:16,960 decomposition i can import pca and that 5947 03:48:16,960 --> 03:48:19,520 will be my pca model 5948 03:48:19,520 --> 03:48:22,800 so if i do pca um 5949 03:48:22,800 --> 03:48:24,640 component so this is how many dimensions 5950 03:48:24,640 --> 03:48:27,040 you want to map it into and you know for 5951 03:48:27,040 --> 03:48:29,279 this exercise let's do two 5952 03:48:29,279 --> 03:48:31,040 okay so now i'm taking the top two 5953 03:48:31,040 --> 03:48:32,800 dimensions 5954 03:48:32,800 --> 03:48:35,600 and my transformed x 5955 03:48:35,600 --> 03:48:39,520 is going to be pca dot fit transform 5956 03:48:39,520 --> 03:48:43,040 and the same x that i had up here okay 5957 03:48:43,040 --> 03:48:44,640 so all the other all the values 5958 03:48:44,640 --> 03:48:45,760 basically 5959 03:48:45,760 --> 03:48:47,680 uh area perimeter compactness length 5960 03:48:47,680 --> 03:48:50,000 with asymmetry groove 5961 03:48:50,000 --> 03:48:51,840 okay 5962 03:48:51,840 --> 03:48:54,160 so let's run that 5963 03:48:54,160 --> 03:48:56,800 and we've transformed it so 5964 03:48:56,800 --> 03:48:58,560 let's look at 5965 03:48:58,560 --> 03:49:01,120 what the shape of x used to be so there 5966 03:49:01,120 --> 03:49:04,479 okay so 7 was right i had 210 samples 5967 03:49:04,479 --> 03:49:06,479 each seven 5968 03:49:06,479 --> 03:49:08,720 seven features long basically 5969 03:49:08,720 --> 03:49:10,560 and now my transformed 5970 03:49:10,560 --> 03:49:12,880 x 5971 03:49:14,560 --> 03:49:17,439 is 210 samples but only of length two 5972 03:49:17,439 --> 03:49:18,720 which means that i only have two 5973 03:49:18,720 --> 03:49:21,439 dimensions now that i'm plotting 5974 03:49:21,439 --> 03:49:23,760 and we can actually even take a look at 5975 03:49:23,760 --> 03:49:25,040 you know 5976 03:49:25,040 --> 03:49:27,040 the first five things 5977 03:49:27,040 --> 03:49:29,199 okay so now we see each each one is a 5978 03:49:29,199 --> 03:49:31,439 two dimensional point each sample is now 5979 03:49:31,439 --> 03:49:33,279 a two dimensional point 5980 03:49:33,279 --> 03:49:34,960 in our new 5981 03:49:34,960 --> 03:49:36,399 um 5982 03:49:36,399 --> 03:49:38,800 in our new dimensions 5983 03:49:38,800 --> 03:49:39,600 so 5984 03:49:39,600 --> 03:49:42,560 what's cool is i can actually scatter 5985 03:49:42,560 --> 03:49:44,880 these 5986 03:49:46,479 --> 03:49:47,520 zero 5987 03:49:47,520 --> 03:49:50,800 and transformed 5988 03:49:50,840 --> 03:49:52,399 x 5989 03:49:52,399 --> 03:49:55,359 so i actually have to 5990 03:49:55,760 --> 03:49:57,760 take the columns here 5991 03:49:57,760 --> 03:50:01,120 and if i show that 5992 03:50:01,760 --> 03:50:03,520 basically we've just taken this like 5993 03:50:03,520 --> 03:50:05,199 seven dimensional 5994 03:50:05,199 --> 03:50:07,600 thing and we've made it into a single or 5995 03:50:07,600 --> 03:50:09,520 i guess two a two dimensional 5996 03:50:09,520 --> 03:50:13,040 representation so that's a point of pca 5997 03:50:13,040 --> 03:50:14,000 and 5998 03:50:14,000 --> 03:50:16,479 actually let's go ahead and do the same 5999 03:50:16,479 --> 03:50:19,520 clustering exercise as we did up here 6000 03:50:19,520 --> 03:50:21,520 if i take the 6001 03:50:21,520 --> 03:50:23,199 k-means 6002 03:50:23,199 --> 03:50:25,199 this pca data frame i can let's 6003 03:50:25,199 --> 03:50:26,560 construct 6004 03:50:26,560 --> 03:50:28,319 data frame out of that 6005 03:50:28,319 --> 03:50:29,279 and 6006 03:50:29,279 --> 03:50:32,960 the data frame is going to be h stack 6007 03:50:32,960 --> 03:50:36,160 i'm going to take this transformed x 6008 03:50:36,160 --> 03:50:38,800 and the clusters 6009 03:50:38,800 --> 03:50:41,199 dot reshape so actually instead of 6010 03:50:41,199 --> 03:50:43,920 clusters i'm going to use uh k-means dot 6011 03:50:43,920 --> 03:50:45,199 labels 6012 03:50:45,199 --> 03:50:48,160 and i need to reshape this 6013 03:50:48,160 --> 03:50:51,600 so it's 2d so we can do the h stack 6014 03:50:51,600 --> 03:50:53,040 um 6015 03:50:53,040 --> 03:50:55,520 and for the columns i'm going to set 6016 03:50:55,520 --> 03:50:56,399 this 6017 03:50:56,399 --> 03:50:58,800 to pca1 6018 03:50:58,800 --> 03:51:00,720 pca2 6019 03:51:00,720 --> 03:51:02,880 and class 6020 03:51:02,880 --> 03:51:04,000 all right 6021 03:51:04,000 --> 03:51:05,920 so now if i take this 6022 03:51:05,920 --> 03:51:08,960 i can also do the same for the truth 6023 03:51:08,960 --> 03:51:11,840 but instead of the k means labels i want 6024 03:51:11,840 --> 03:51:15,600 from the data frame the original classes 6025 03:51:15,600 --> 03:51:16,880 and i'm just going to take the values 6026 03:51:16,880 --> 03:51:18,319 from that 6027 03:51:18,319 --> 03:51:20,080 and so now i have 6028 03:51:20,080 --> 03:51:21,359 a 6029 03:51:21,359 --> 03:51:23,920 data frame for the k-means with pca and 6030 03:51:23,920 --> 03:51:25,439 then a data frame for the truth with 6031 03:51:25,439 --> 03:51:26,880 also the pca 6032 03:51:26,880 --> 03:51:29,760 and i can now plot these similarly to 6033 03:51:29,760 --> 03:51:32,560 how i plotted these up here 6034 03:51:32,560 --> 03:51:36,560 so let me actually take these two 6035 03:51:42,239 --> 03:51:44,479 instead of the cluster data frame i want 6036 03:51:44,479 --> 03:51:46,160 the 6037 03:51:46,160 --> 03:51:48,640 uh this is the k means 6038 03:51:48,640 --> 03:51:50,479 pca data frame 6039 03:51:50,479 --> 03:51:52,640 this is still going to be class but now 6040 03:51:52,640 --> 03:51:54,880 x and y are going to be the two pca 6041 03:51:54,880 --> 03:51:56,479 dimensions 6042 03:51:56,479 --> 03:51:59,120 okay 6043 03:51:59,120 --> 03:52:01,520 so these are my two pca dimensions and 6044 03:52:01,520 --> 03:52:03,279 you can see that 6045 03:52:03,279 --> 03:52:04,800 you know they're 6046 03:52:04,800 --> 03:52:07,760 they're pretty spread out 6047 03:52:07,760 --> 03:52:09,840 and then here 6048 03:52:09,840 --> 03:52:12,080 i'm going to go to my truth classes 6049 03:52:12,080 --> 03:52:14,399 again it's pca1 pca2 but instead of 6050 03:52:14,399 --> 03:52:16,960 k-means this should be truth pca data 6051 03:52:16,960 --> 03:52:19,040 frame 6052 03:52:19,040 --> 03:52:21,359 so you can see that like in the truth 6053 03:52:21,359 --> 03:52:22,640 data frame 6054 03:52:22,640 --> 03:52:25,040 along these two dimensions 6055 03:52:25,040 --> 03:52:27,600 we actually are doing fairly well in 6056 03:52:27,600 --> 03:52:30,080 terms of separation right it does seem 6057 03:52:30,080 --> 03:52:32,399 like this is slightly more separable 6058 03:52:32,399 --> 03:52:34,239 than 6059 03:52:34,239 --> 03:52:36,239 the other like dimensions that we had 6060 03:52:36,239 --> 03:52:39,439 been looking at up here 6061 03:52:39,439 --> 03:52:41,199 so that's a good sign 6062 03:52:41,199 --> 03:52:42,640 um 6063 03:52:42,640 --> 03:52:44,640 and up here you can see that hey some of 6064 03:52:44,640 --> 03:52:46,239 these correspond to one another i mean 6065 03:52:46,239 --> 03:52:48,160 for the most part our algorithm our 6066 03:52:48,160 --> 03:52:50,800 unsupervised clustering algorithm 6067 03:52:50,800 --> 03:52:53,920 is able to give us is able to spit out 6068 03:52:53,920 --> 03:52:56,000 you know what the proper 6069 03:52:56,000 --> 03:52:59,760 uh labels are i mean if you map these 6070 03:52:59,760 --> 03:53:02,000 specific labels to the different types 6071 03:53:02,000 --> 03:53:04,160 of kernels but for example this one 6072 03:53:04,160 --> 03:53:06,640 might all be the comic kernels and same 6073 03:53:06,640 --> 03:53:07,760 here and then these might all be the 6074 03:53:07,760 --> 03:53:09,520 canadian kernels and these might all be 6075 03:53:09,520 --> 03:53:11,439 the canadian kernels 6076 03:53:11,439 --> 03:53:13,680 so it does struggle a little bit with 6077 03:53:13,680 --> 03:53:16,160 you know where they overlap 6078 03:53:16,160 --> 03:53:18,000 but for the most part our algorithm is 6079 03:53:18,000 --> 03:53:19,520 able to find the three different 6080 03:53:19,520 --> 03:53:20,840 categories 6081 03:53:20,840 --> 03:53:24,160 and do a fairly good job at predicting 6082 03:53:24,160 --> 03:53:26,160 them without without any information 6083 03:53:26,160 --> 03:53:28,880 from us we haven't given our algorithm 6084 03:53:28,880 --> 03:53:31,120 any labels so that's the gist of 6085 03:53:31,120 --> 03:53:32,800 unsupervised learning 6086 03:53:32,800 --> 03:53:35,040 i hope you guys enjoyed this course i 6087 03:53:35,040 --> 03:53:36,960 hope you know a lot of these examples 6088 03:53:36,960 --> 03:53:39,680 made sense um if there are certain 6089 03:53:39,680 --> 03:53:42,080 things that i have done 6090 03:53:42,080 --> 03:53:43,439 and you know you're somebody with more 6091 03:53:43,439 --> 03:53:45,279 experience than me please feel free to 6092 03:53:45,279 --> 03:53:47,359 correct me in the comments and we can 6093 03:53:47,359 --> 03:53:49,199 all as a community learn from this 6094 03:53:49,199 --> 03:53:54,359 together so thank you all for watching 382343