All language subtitles for 2022_lecture7-720p-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian Download
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 0 00:00:00,000 --> 00:00:03,968 [MUSIC PLAYING] 1 00:00:03,968 --> 00:01:18,020 [MUSIC PLAYING] 2 00:01:18,020 --> 00:01:20,030 DAVID J. MALAN: This is CS50. 3 00:01:20,030 --> 00:01:23,360 And this is week 7, the week, here, of Halloween. 4 00:01:23,360 --> 00:01:26,510 Indeed, special thanks to CS50's own Valerie and her mom 5 00:01:26,510 --> 00:01:30,590 for having created this very festive scenery, and all past ones as well. 6 00:01:30,590 --> 00:01:33,330 Today, we pick up where we left off last time, 7 00:01:33,330 --> 00:01:35,180 which, recall, we introduced Python. 8 00:01:35,180 --> 00:01:38,240 And that was our big transition from C, where suddenly things 9 00:01:38,240 --> 00:01:40,850 started to look new again, probably, syntactically. 10 00:01:40,850 --> 00:01:44,390 But also, probably things hopefully started to feel easier. 11 00:01:44,390 --> 00:01:48,080 Well, with that said, problem set 6 certainly added some challenges, 12 00:01:48,080 --> 00:01:49,380 and you did some new things. 13 00:01:49,380 --> 00:01:53,090 But hopefully you've begun to appreciate that with Python, just a lot more stuff 14 00:01:53,090 --> 00:01:54,290 is easier to do. 15 00:01:54,290 --> 00:01:56,960 You get more out of the box with the language itself. 16 00:01:56,960 --> 00:01:59,450 And that's going to be so useful over the coming weeks 17 00:01:59,450 --> 00:02:03,980 as we transition further to introducing something called databases today, 18 00:02:03,980 --> 00:02:06,270 web programming next week and the week after. 19 00:02:06,270 --> 00:02:08,900 So that by term's end, and perhaps even for your final project, 20 00:02:08,900 --> 00:02:11,870 you really are building something from scratch 21 00:02:11,870 --> 00:02:15,415 using all of these various tools somehow together. 22 00:02:15,415 --> 00:02:17,540 So before we do that, though, today, let's consider 23 00:02:17,540 --> 00:02:22,610 what we weren't really able to do last week, which was actually 24 00:02:22,610 --> 00:02:25,490 create and store data ourselves. 25 00:02:25,490 --> 00:02:30,710 In Python, we've played around with the CSV, comma-separated values library. 26 00:02:30,710 --> 00:02:33,810 And you've been able to read in CSVs from disk, 27 00:02:33,810 --> 00:02:37,880 so to speak, that is, from files in your programming environments. 28 00:02:37,880 --> 00:02:41,455 But we haven't necessarily started saving data, persisting data ourselves. 29 00:02:41,455 --> 00:02:44,330 And that's a huge limitation, because pretty much all of the examples 30 00:02:44,330 --> 00:02:46,288 we've done thus far with a couple of exceptions 31 00:02:46,288 --> 00:02:49,550 have involved my providing input at the keyboard or even vocally. 32 00:02:49,550 --> 00:02:51,508 But then nothing happens to it. 33 00:02:51,508 --> 00:02:53,300 It disappears the moment the program quits, 34 00:02:53,300 --> 00:02:55,410 because it was only being stored in memory. 35 00:02:55,410 --> 00:02:59,090 But today, we'll start to focus all the more on storing things on disk, 36 00:02:59,090 --> 00:03:02,100 that is, storing things in files and folders 37 00:03:02,100 --> 00:03:04,940 so that you can actually write programs that remember 38 00:03:04,940 --> 00:03:06,620 what it is the human did last time. 39 00:03:06,620 --> 00:03:09,320 And ultimately, you can actually make mobile or web apps 40 00:03:09,320 --> 00:03:12,050 that actually begin to grow, and grow, and grow their data sets, 41 00:03:12,050 --> 00:03:15,650 as might happen if you get more and more users, for instance, on a website. 42 00:03:15,650 --> 00:03:19,400 To play, then, with this new capability of being able to write files, 43 00:03:19,400 --> 00:03:22,217 let's go ahead and just collect some data. 44 00:03:22,217 --> 00:03:24,050 In fact, those of you here in person, if you 45 00:03:24,050 --> 00:03:26,960 want to pull up this URL on your phone or laptop, 46 00:03:26,960 --> 00:03:29,060 that's going to lead you to a Google Form. 47 00:03:29,060 --> 00:03:34,130 And that Google Form is going to ask you in just a moment for really 48 00:03:34,130 --> 00:03:35,320 just your favorite TV show. 49 00:03:35,320 --> 00:03:37,070 And it's going to ask you to categorize it 50 00:03:37,070 --> 00:03:41,270 according to a genre, like comedy, or drama, or action, or musical, 51 00:03:41,270 --> 00:03:42,460 or something like that. 52 00:03:42,460 --> 00:03:44,210 And this is useful, because if you've ever 53 00:03:44,210 --> 00:03:47,520 used a Google Form before, or Microsoft's equivalent with Office 365, 54 00:03:47,520 --> 00:03:50,270 it's a really useful mechanism at just collecting data from users, 55 00:03:50,270 --> 00:03:54,000 and then ultimately, putting it into a spreadsheet form. 56 00:03:54,000 --> 00:03:58,160 So this is a screenshot of the form that those of you 57 00:03:58,160 --> 00:04:01,080 here in person or tuning in on Zoom are currently filling out. 58 00:04:01,080 --> 00:04:02,420 It's asking only two questions. 59 00:04:02,420 --> 00:04:04,550 What's the title of your favorite TV show? 60 00:04:04,550 --> 00:04:09,470 And what are one or more genres into which your TV show falls? 61 00:04:09,470 --> 00:04:12,860 And I'll go ahead and pivot now to the view 62 00:04:12,860 --> 00:04:15,667 that I'll be able to see as the person who created this form, which 63 00:04:15,667 --> 00:04:17,209 is quite simply a Google spreadsheet. 64 00:04:17,209 --> 00:04:19,709 Google Forms has this nice feature, if you've never noticed, 65 00:04:19,709 --> 00:04:22,650 that allows you to export your data to a Google Spreadsheet. 66 00:04:22,650 --> 00:04:25,130 And then from there, we can actually grab the file 67 00:04:25,130 --> 00:04:27,500 and download it to my own Mac or your own PC 68 00:04:27,500 --> 00:04:30,320 so that we can actually play around with the data that's come in. 69 00:04:30,320 --> 00:04:32,570 So in fact, let me go ahead and slide over 70 00:04:32,570 --> 00:04:36,540 to this, the live Google Spreadsheet. 71 00:04:36,540 --> 00:04:40,490 And you'll see, probably, a whole bunch of familiar TV shows here, 72 00:04:40,490 --> 00:04:41,360 all coming in. 73 00:04:41,360 --> 00:04:43,820 And if we keep scrolling, and scrolling, and scrolling-- 74 00:04:43,820 --> 00:04:45,290 only 46, 47. 75 00:04:45,290 --> 00:04:47,430 There we go, up to 50 plus already. 76 00:04:47,430 --> 00:04:50,420 If you need that URL again here, if you're just tuning in, 77 00:04:50,420 --> 00:04:52,850 you can go to this URL here. 78 00:04:52,850 --> 00:04:55,760 And in just a moment, we'll have a bunch of data 79 00:04:55,760 --> 00:04:59,450 with which we can start to experiment. 80 00:04:59,450 --> 00:05:01,070 I'll give you a moment or so there. 81 00:05:01,070 --> 00:05:06,330 82 00:05:06,330 --> 00:05:08,418 All right. 83 00:05:08,418 --> 00:05:09,960 Let me hang in there a little longer. 84 00:05:09,960 --> 00:05:11,418 OK, we've got over 100 submissions. 85 00:05:11,418 --> 00:05:12,390 Good. 86 00:05:12,390 --> 00:05:15,270 Good, even more coming in now. 87 00:05:15,270 --> 00:05:16,890 And we can see them coming in live. 88 00:05:16,890 --> 00:05:18,750 Here, let me switch back to the spreadsheet. 89 00:05:18,750 --> 00:05:21,090 The list is growing, and growing, and growing. 90 00:05:21,090 --> 00:05:22,900 And in just a moment-- 91 00:05:22,900 --> 00:05:26,490 let me give Carter a moment to help me export it in real time. 92 00:05:26,490 --> 00:05:29,640 Carter, just give me a heads up when it's reasonable for me 93 00:05:29,640 --> 00:05:32,880 to download this file. 94 00:05:32,880 --> 00:05:35,140 All right, and I'll begin to do this very slowly. 95 00:05:35,140 --> 00:05:37,720 So I'm going to go up to the File menu, if you've never done this before. 96 00:05:37,720 --> 00:05:40,110 Download-- you can download a whole bunch of formats, one in Excel. 97 00:05:40,110 --> 00:05:42,060 But more simply, and the one we'll start to play with here, 98 00:05:42,060 --> 00:05:43,630 is comma-separated values. 99 00:05:43,630 --> 00:05:47,010 So CSV files we used this past week, why are they useful? 100 00:05:47,010 --> 00:05:50,190 Now that you've played with them or used them in past real world, 101 00:05:50,190 --> 00:05:55,260 what's the utility of a CSV file versus something like Excel, for instance? 102 00:05:55,260 --> 00:05:59,010 Why CSV in the first place? 103 00:05:59,010 --> 00:06:00,030 Any instincts? 104 00:06:00,030 --> 00:06:00,858 Yeah? 105 00:06:00,858 --> 00:06:02,730 AUDIENCE: Because it's just a text file? 106 00:06:02,730 --> 00:06:04,605 DAVID J. MALAN: OK, so storage is compelling. 107 00:06:04,605 --> 00:06:07,710 A simple text file with ASCII or Unicode text is probably pretty small. 108 00:06:07,710 --> 00:06:08,290 I like that. 109 00:06:08,290 --> 00:06:10,101 Other thoughts? 110 00:06:10,101 --> 00:06:11,184 AUDIENCE: Structure of it? 111 00:06:11,184 --> 00:06:12,518 DAVID J. MALAN: Yeah, well said. 112 00:06:12,518 --> 00:06:14,760 It's just a simple text format, but using conventions 113 00:06:14,760 --> 00:06:18,330 like commas you can represent the idea of columns using new lines, 114 00:06:18,330 --> 00:06:20,430 backslash ends invisibly at the end of your lines, 115 00:06:20,430 --> 00:06:22,000 you can create the idea of rows. 116 00:06:22,000 --> 00:06:24,000 So it's a very simple way of implementing what 117 00:06:24,000 --> 00:06:26,610 we might call a flat-file database. 118 00:06:26,610 --> 00:06:28,860 It's a way of storing data in a flat, that is, 119 00:06:28,860 --> 00:06:32,310 very simple file that's just pure ASCII or Unicode text. 120 00:06:32,310 --> 00:06:35,340 And more compellingly, I dare say, is that with a CSV file, 121 00:06:35,340 --> 00:06:36,862 it's completely portable. 122 00:06:36,862 --> 00:06:38,820 Something is portable in the world of computing 123 00:06:38,820 --> 00:06:41,910 if it means you can use it on a Mac or a PC running this operating system, 124 00:06:41,910 --> 00:06:42,750 or this other one. 125 00:06:42,750 --> 00:06:45,480 And portability is nice because if I were to download an Excel file, 126 00:06:45,480 --> 00:06:47,760 there'd be a whole bunch of people in this room and online 127 00:06:47,760 --> 00:06:50,190 who couldn't download it because they haven't bought Microsoft Excel 128 00:06:50,190 --> 00:06:50,940 or installed it. 129 00:06:50,940 --> 00:06:55,680 Or if they have a Mac, or if it's a .numbers file in the Mac world, 130 00:06:55,680 --> 00:06:57,610 a PC user might not be able to download it. 131 00:06:57,610 --> 00:06:59,800 So a CSV is indeed very portable. 132 00:06:59,800 --> 00:07:02,940 So I'm going to go ahead and download, quite simply, the CSV 133 00:07:02,940 --> 00:07:04,440 version of this file. 134 00:07:04,440 --> 00:07:06,960 That's going to put it onto my own Mac's Downloads folder. 135 00:07:06,960 --> 00:07:11,460 And let me go ahead here, and in just a moment, let me just simplify the name. 136 00:07:11,460 --> 00:07:14,910 Because it actually downloads it at a pretty large name. 137 00:07:14,910 --> 00:07:18,120 And give me just one moment here, and you'll see that, indeed, 138 00:07:18,120 --> 00:07:21,210 on my Mac I have a file called favorites.csv. 139 00:07:21,210 --> 00:07:22,750 I shortened the name real quick. 140 00:07:22,750 --> 00:07:28,680 And now what I'm going to do is go over to VS Code, and in VS Code, 141 00:07:28,680 --> 00:07:30,480 I'm going to open my File Explorer. 142 00:07:30,480 --> 00:07:34,620 And if I minimize my window here for a moment, a handy feature of VS Code 143 00:07:34,620 --> 00:07:37,980 is that you can just drag and drop a file, for instance, into your Explorer. 144 00:07:37,980 --> 00:07:40,660 And voila, it's going to automatically upload it for you. 145 00:07:40,660 --> 00:07:43,260 So let me go ahead and full screen here, close my Explorer, 146 00:07:43,260 --> 00:07:45,120 temporarily close my Terminal window. 147 00:07:45,120 --> 00:07:48,720 And you'll see here a CSV file, favorites.csv. 148 00:07:48,720 --> 00:07:51,390 And the first row, by convention, has whatever 149 00:07:51,390 --> 00:07:54,720 the columns were in Google Spreadsheets, or Office 365, 150 00:07:54,720 --> 00:07:58,620 in Excel online, timestamp, comma, title, comma, genres. 151 00:07:58,620 --> 00:08:00,390 Then, we have timestamps, which indicates 152 00:08:00,390 --> 00:08:01,722 when people started submitting. 153 00:08:01,722 --> 00:08:03,930 Looks like a couple of people were super eager to get 154 00:08:03,930 --> 00:08:05,430 started an hour or two ago. 155 00:08:05,430 --> 00:08:09,340 And then, you have the title next, after a comma. 156 00:08:09,340 --> 00:08:12,150 But there's kind of a curiosity after that. 157 00:08:12,150 --> 00:08:15,510 Sometimes I see the genre like comedy, comedy, comedy, 158 00:08:15,510 --> 00:08:19,870 but sometimes it's like crime, comma, drama, or action, comma, crime, comma, 159 00:08:19,870 --> 00:08:20,550 drama. 160 00:08:20,550 --> 00:08:22,470 And those things are quoted. 161 00:08:22,470 --> 00:08:24,180 And yet, I didn't do any quotes. 162 00:08:24,180 --> 00:08:25,800 You probably didn't type any quotes. 163 00:08:25,800 --> 00:08:30,180 Where are those quotes coming from in this CSV file? 164 00:08:30,180 --> 00:08:31,650 Why are they there if we infer? 165 00:08:31,650 --> 00:08:32,485 Yeah? 166 00:08:32,485 --> 00:08:35,340 AUDIENCE: [INAUDIBLE] 167 00:08:35,340 --> 00:08:38,308 DAVID J. MALAN: Yeah, so you have a corner case, if you will. 168 00:08:38,308 --> 00:08:40,350 Because if you're using commas, as you described, 169 00:08:40,350 --> 00:08:44,280 to separate your data into what are effectively columns, well, 170 00:08:44,280 --> 00:08:47,010 you've painted yourself into a corner if your actual data 171 00:08:47,010 --> 00:08:48,578 has commas in it itself. 172 00:08:48,578 --> 00:08:51,120 So what Google has done, what Microsoft does, what Apple does 173 00:08:51,120 --> 00:08:54,330 is, they quote any strings of text that themselves 174 00:08:54,330 --> 00:08:58,560 have commas so that these are now English grammatical commas, 175 00:08:58,560 --> 00:09:00,730 not CSV specific commas. 176 00:09:00,730 --> 00:09:03,400 So it's a way of escaping your data, if you will. 177 00:09:03,400 --> 00:09:06,070 And escaping just means to call out a symbol in a special way 178 00:09:06,070 --> 00:09:08,637 so it's not misinterpreted as something else. 179 00:09:08,637 --> 00:09:10,470 All right, so this is all to say that we now 180 00:09:10,470 --> 00:09:13,980 have all of this data with which we can play in the form of what we'll 181 00:09:13,980 --> 00:09:15,840 start calling a flat-file database. 182 00:09:15,840 --> 00:09:19,230 So suppose I wanted to now start manipulating this data, 183 00:09:19,230 --> 00:09:22,110 and I want to store it ultimately, indeed, in this CSV format. 184 00:09:22,110 --> 00:09:24,540 How can I actually start to read this data, 185 00:09:24,540 --> 00:09:26,950 maybe clean it up, maybe do some analytics on it 186 00:09:26,950 --> 00:09:30,570 and actually figure out, what's the most popular show among those who submitted 187 00:09:30,570 --> 00:09:32,190 here over the past few minutes? 188 00:09:32,190 --> 00:09:34,270 Well, let me go ahead and close this. 189 00:09:34,270 --> 00:09:38,970 Let me go ahead, then, and open up, for instance, just my Terminal window. 190 00:09:38,970 --> 00:09:41,970 And let's code up a file called favorites.py. 191 00:09:41,970 --> 00:09:46,110 And let's go ahead and iteratively start simple by just opening up this file 192 00:09:46,110 --> 00:09:47,830 and printing out what's inside of it. 193 00:09:47,830 --> 00:09:51,330 So you might recall that we can do this by doing something like import CSV 194 00:09:51,330 --> 00:09:54,990 to give myself some CSV reading functionality. 195 00:09:54,990 --> 00:09:59,610 Then, I can go ahead and do something like with open, the name of the file 196 00:09:59,610 --> 00:10:02,280 that I want to open in read mode. 197 00:10:02,280 --> 00:10:03,990 Quote, unquote, "r" means to read it. 198 00:10:03,990 --> 00:10:06,600 And then, I can say as file, or whatever other name 199 00:10:06,600 --> 00:10:09,670 for a variable to say that I want to open this file, 200 00:10:09,670 --> 00:10:12,480 and essentially store some kind of reference to it in that variable 201 00:10:12,480 --> 00:10:13,600 called file. 202 00:10:13,600 --> 00:10:16,920 Then, I can give myself a reader, and I can say csv.reader, 203 00:10:16,920 --> 00:10:18,397 passing in that file as input. 204 00:10:18,397 --> 00:10:19,980 And this is the magic of that library. 205 00:10:19,980 --> 00:10:23,190 It deals with the process of opening it, reading it, and giving you 206 00:10:23,190 --> 00:10:26,430 back something that you can just iterate over, like with a for loop 207 00:10:26,430 --> 00:10:30,510 I do want to skip the first row, and recall that I can do this. 208 00:10:30,510 --> 00:10:33,665 Next, reader, is this little trick that just says, ignore the first row. 209 00:10:33,665 --> 00:10:35,040 Because the first one is special. 210 00:10:35,040 --> 00:10:37,410 It said timestamp, title, genres. 211 00:10:37,410 --> 00:10:39,400 That's not your data, that was mine. 212 00:10:39,400 --> 00:10:41,880 But this means now that I've skipped that first row. 213 00:10:41,880 --> 00:10:44,700 Everything hereafter is going to be the title of a show 214 00:10:44,700 --> 00:10:46,260 that you all like, so let me do this. 215 00:10:46,260 --> 00:10:51,090 For row in the reader, let's go ahead and print out the title 216 00:10:51,090 --> 00:10:52,860 of the show each of you typed in. 217 00:10:52,860 --> 00:10:57,000 How do I get at the title of the show each of you typed in? 218 00:10:57,000 --> 00:10:58,740 It's somewhere inside of row. 219 00:10:58,740 --> 00:11:00,790 Row recalls a list. 220 00:11:00,790 --> 00:11:02,910 So what do I want to type next in order to get 221 00:11:02,910 --> 00:11:08,920 at the title of the current row just as a quick check here? 222 00:11:08,920 --> 00:11:11,350 What do I want to type to get at the title of the row, 223 00:11:11,350 --> 00:11:15,520 keeping in mind, again, that it was timestamp, title, genres? 224 00:11:15,520 --> 00:11:16,020 Yeah? 225 00:11:16,020 --> 00:11:16,895 AUDIENCE: [INAUDIBLE] 226 00:11:16,895 --> 00:11:18,910 DAVID J. MALAN: So row bracket 1 would give me 227 00:11:18,910 --> 00:11:22,640 the second column, 0 index, that is, the one in the middle with the title. 228 00:11:22,640 --> 00:11:24,520 So this program isn't that interesting yet, 229 00:11:24,520 --> 00:11:27,370 but it's a quick and dirty way to figure out, all right, what's my data 230 00:11:27,370 --> 00:11:27,820 look like? 231 00:11:27,820 --> 00:11:29,987 Let me actually just do a little bit of a check here 232 00:11:29,987 --> 00:11:32,380 and see if it contains the data I think it does. 233 00:11:32,380 --> 00:11:34,490 Let me maximize my Terminal window here. 234 00:11:34,490 --> 00:11:37,810 Let me run Python of favorites.py, hitting Enter. 235 00:11:37,810 --> 00:11:42,520 And you'll see now a purely textual list of all of the shows 236 00:11:42,520 --> 00:11:44,560 you all seem to like here. 237 00:11:44,560 --> 00:11:47,080 But what's noteworthy about it? 238 00:11:47,080 --> 00:11:49,960 Specific shows aside, judgment aside as to people's 239 00:11:49,960 --> 00:11:54,640 TV tastes, what's interesting or noteworthy about the data that 240 00:11:54,640 --> 00:11:58,090 might create some problems for us if we start to analyze this data, 241 00:11:58,090 --> 00:11:59,740 and figure out what's the most popular? 242 00:11:59,740 --> 00:12:02,590 How many people like this or that? 243 00:12:02,590 --> 00:12:03,340 What do you think? 244 00:12:03,340 --> 00:12:04,151 Yeah? 245 00:12:04,151 --> 00:12:07,458 AUDIENCE: User errors [INAUDIBLE]. 246 00:12:07,458 --> 00:12:09,500 DAVID J. MALAN: Yeah, there might be user errors, 247 00:12:09,500 --> 00:12:12,800 or just stylistic differences that give the appearance that one show 248 00:12:12,800 --> 00:12:14,400 is different from the other. 249 00:12:14,400 --> 00:12:16,970 For instance, here. 250 00:12:16,970 --> 00:12:20,250 Let's see if I can see an example on the screen here. 251 00:12:20,250 --> 00:12:24,340 Yeah, so friends here is an all lowercase, Friends here is capitalized. 252 00:12:24,340 --> 00:12:25,140 No big deal. 253 00:12:25,140 --> 00:12:26,370 We can sort of mitigate that. 254 00:12:26,370 --> 00:12:29,480 But this is just a tiny example of where data in the real world 255 00:12:29,480 --> 00:12:30,758 can get messy fast. 256 00:12:30,758 --> 00:12:32,300 And that probably wasn't even a typo. 257 00:12:32,300 --> 00:12:37,587 It was just someone not caring as much to capitalize it, and that's fine. 258 00:12:37,587 --> 00:12:39,920 Your users are going to type what they're going to type. 259 00:12:39,920 --> 00:12:43,580 So let's see if we can't now begin to get at more specific data, 260 00:12:43,580 --> 00:12:45,500 and maybe even clean some of this data up. 261 00:12:45,500 --> 00:12:50,570 Let me go back into my file called favorites.py here, 262 00:12:50,570 --> 00:12:54,350 and let's actually do something a little more user friendly for me. 263 00:12:54,350 --> 00:12:58,070 Instead of a reader, recall that there was this dictionary reader that's 264 00:12:58,070 --> 00:12:59,750 just a little more user friendly. 265 00:12:59,750 --> 00:13:04,670 And it means I can type in dictionary reader here, passing in the same file. 266 00:13:04,670 --> 00:13:10,670 But now, when I iterate over this reader variable, what is each row? 267 00:13:10,670 --> 00:13:14,060 When using a DictReader instead of a reader, recall, and this 268 00:13:14,060 --> 00:13:17,690 is just a peculiarity of the CSV library, 269 00:13:17,690 --> 00:13:22,530 this gives me back, not a list of cells, but what instead, 270 00:13:22,530 --> 00:13:24,950 which is marginally more user friendly for me? 271 00:13:24,950 --> 00:13:25,460 Yeah? 272 00:13:25,460 --> 00:13:27,585 AUDIENCE: [INAUDIBLE] 273 00:13:27,585 --> 00:13:28,460 DAVID J. MALAN: Yeah. 274 00:13:28,460 --> 00:13:30,860 I can now use open bracket, quotes, and the title. 275 00:13:30,860 --> 00:13:33,770 Because what's coming back now is a dict object, that is, 276 00:13:33,770 --> 00:13:37,010 a dictionary which has keys and values. 277 00:13:37,010 --> 00:13:39,060 The keys of which are the column headings. 278 00:13:39,060 --> 00:13:41,400 The values of which are the data I actually care about. 279 00:13:41,400 --> 00:13:44,030 So this is just marginally better because, one, it's 280 00:13:44,030 --> 00:13:47,330 just way more obvious to me, the author of this code, what it is I'm 281 00:13:47,330 --> 00:13:48,050 getting at. 282 00:13:48,050 --> 00:13:50,310 I don't remember what column the title was. 283 00:13:50,310 --> 00:13:50,810 Was it 0? 284 00:13:50,810 --> 00:13:51,320 Was it 1? 285 00:13:51,320 --> 00:13:51,890 Was it 2? 286 00:13:51,890 --> 00:13:53,450 That's something you're going to forget over time. 287 00:13:53,450 --> 00:13:56,510 And God forbid someone changes the data by just dragging and dropping 288 00:13:56,510 --> 00:13:59,390 the columns in Excel, or Apple Numbers, or Google Spreadsheets. 289 00:13:59,390 --> 00:14:01,978 That's going to break all of your numeric indices. 290 00:14:01,978 --> 00:14:03,770 And so a dictionary reader is arguably just 291 00:14:03,770 --> 00:14:07,310 better design because it's more robust against changes 292 00:14:07,310 --> 00:14:08,960 and potential errors like that. 293 00:14:08,960 --> 00:14:12,560 Now the effect of this change isn't going to be really any different. 294 00:14:12,560 --> 00:14:16,820 If I run Python of favorites.py, voila, I get all of the same results. 295 00:14:16,820 --> 00:14:21,290 But I've now not made any assumptions as to where each of the columns 296 00:14:21,290 --> 00:14:22,920 actually is numerically. 297 00:14:22,920 --> 00:14:23,420 All right. 298 00:14:23,420 --> 00:14:26,330 Well, let's go ahead and now filter out some duplicates. 299 00:14:26,330 --> 00:14:29,960 Because there's a lot of commonality among some of the shows here, so let's 300 00:14:29,960 --> 00:14:32,660 see if we can't filter out duplicates. 301 00:14:32,660 --> 00:14:38,900 If I'm reading a CSV file top to bottom, what intuitively might be the logic 302 00:14:38,900 --> 00:14:41,650 I want to implement to filter out duplicates? 303 00:14:41,650 --> 00:14:44,900 It's not going to be quite as simple as a simple function that does it for me. 304 00:14:44,900 --> 00:14:47,300 I'm going to have to build this. 305 00:14:47,300 --> 00:14:50,630 But logically, if you're reading a file from top to bottom, 306 00:14:50,630 --> 00:14:54,860 how might you go about, in Python or just any context, 307 00:14:54,860 --> 00:14:58,040 getting rid of duplicate values? 308 00:14:58,040 --> 00:14:59,862 Yeah, what do you think? 309 00:14:59,862 --> 00:15:05,735 AUDIENCE: [INAUDIBLE] 310 00:15:05,735 --> 00:15:06,610 DAVID J. MALAN: Sure. 311 00:15:06,610 --> 00:15:09,730 I could use a list and I could add each title to the list, 312 00:15:09,730 --> 00:15:13,070 but first check if I put this into the list before. 313 00:15:13,070 --> 00:15:15,140 So let's try a little something like that. 314 00:15:15,140 --> 00:15:18,340 Let me go ahead and create a variable at the top of my program here. 315 00:15:18,340 --> 00:15:21,340 I'll call it titles, for instance, initialize to an empty list, 316 00:15:21,340 --> 00:15:23,050 open bracket, close bracket. 317 00:15:23,050 --> 00:15:27,860 And then, inside of my loop here, instead of printing it out, 318 00:15:27,860 --> 00:15:29,630 let's start to make a decision. 319 00:15:29,630 --> 00:15:39,332 So if the current row's title is in the titles list 320 00:15:39,332 --> 00:15:40,540 I don't want to put it there. 321 00:15:40,540 --> 00:15:43,582 And actually, let me invert the logic so I'm doing something proactively. 322 00:15:43,582 --> 00:15:48,190 So if it's not the case that row bracket title is in titles, 323 00:15:48,190 --> 00:15:56,200 then, go ahead and do something like titles.append the current row's title. 324 00:15:56,200 --> 00:15:59,170 And recall that we saw .append a week or so ago, 325 00:15:59,170 --> 00:16:01,700 where it just allows you to append to the current list. 326 00:16:01,700 --> 00:16:04,810 And then, what can I do at the very end, after I'm all 327 00:16:04,810 --> 00:16:06,460 done reading the whole file? 328 00:16:06,460 --> 00:16:09,820 Why don't I go ahead and say, for title in titles, 329 00:16:09,820 --> 00:16:12,730 go ahead and print out the current title? 330 00:16:12,730 --> 00:16:16,700 So it's two loops now, and we can come back to the quality of that design. 331 00:16:16,700 --> 00:16:19,373 But let me go ahead here and rerun Python of favorites.py. 332 00:16:19,373 --> 00:16:22,540 Let me increase the size of my Terminal window so we can focus just on this, 333 00:16:22,540 --> 00:16:23,770 and hit Enter. 334 00:16:23,770 --> 00:16:28,120 And now, I'm just skimming. 335 00:16:28,120 --> 00:16:31,090 I don't think I'm seeing duplicates, although I 336 00:16:31,090 --> 00:16:34,000 am seeing some near duplicates. 337 00:16:34,000 --> 00:16:36,760 For instance, there's Friends again. 338 00:16:36,760 --> 00:16:40,060 And if we keep going, and going, and going, and going, 339 00:16:40,060 --> 00:16:41,260 there's Friends again. 340 00:16:41,260 --> 00:16:47,070 Oh, interesting, so that's curious that I seem to have multiple Friends, 341 00:16:47,070 --> 00:16:48,630 and I have this one here, too. 342 00:16:48,630 --> 00:16:51,120 So how might we clean this up further? 343 00:16:51,120 --> 00:16:53,393 I like your instincts, and it's a step closer to it. 344 00:16:53,393 --> 00:16:55,560 What are we going to have to do to really filter out 345 00:16:55,560 --> 00:16:57,360 those near duplicates? 346 00:16:57,360 --> 00:16:59,146 Any thoughts? 347 00:16:59,146 --> 00:17:04,045 AUDIENCE: You could set everything to lower [INAUDIBLE].. 348 00:17:04,045 --> 00:17:04,920 DAVID J. MALAN: Yeah. 349 00:17:04,920 --> 00:17:06,670 What are the common mistakes to summarize? 350 00:17:06,670 --> 00:17:09,405 We could ignore the capitalization altogether and maybe 351 00:17:09,405 --> 00:17:12,030 just force everything to lowercase, or everything to uppercase. 352 00:17:12,030 --> 00:17:13,920 Doesn't matter which, but let's just be consistent. 353 00:17:13,920 --> 00:17:16,795 And for those of you who might have accidentally or instinctively hit 354 00:17:16,795 --> 00:17:19,650 the spacebar at the beginning of your input or even at the end, 355 00:17:19,650 --> 00:17:21,180 we can strip that off, too. 356 00:17:21,180 --> 00:17:24,490 Stripping whitespace is a common thing just to clean up user input. 357 00:17:24,490 --> 00:17:27,869 So let me go back into my code here, and let me go ahead 358 00:17:27,869 --> 00:17:30,540 and tweak the title a little bit. 359 00:17:30,540 --> 00:17:33,360 Let me say that the current title inside of this loop 360 00:17:33,360 --> 00:17:36,300 is not going to be just the current row's title. 361 00:17:36,300 --> 00:17:40,440 But let me go ahead and strip off, from the left and the right implicitly, 362 00:17:40,440 --> 00:17:41,260 any whitespace. 363 00:17:41,260 --> 00:17:44,260 If you read the documentation for the strip function, it does just that. 364 00:17:44,260 --> 00:17:47,230 It gets rid of whitespace to the left, whitespace to the right. 365 00:17:47,230 --> 00:17:50,310 And then, if I want to force everything to maybe uppercase, 366 00:17:50,310 --> 00:17:52,710 I can just uppercase the entire string. 367 00:17:52,710 --> 00:17:56,310 And remember, what's handy about Python is you can chain some of these function 368 00:17:56,310 --> 00:17:59,550 calls together by just using dots again and again. 369 00:17:59,550 --> 00:18:01,500 And that just takes whatever just happened, 370 00:18:01,500 --> 00:18:04,320 like the whitespace got stripped off, then, it additionally 371 00:18:04,320 --> 00:18:06,280 uppercases the whole thing as well. 372 00:18:06,280 --> 00:18:10,980 So now, I'm going to just check whether this specific title is in titles. 373 00:18:10,980 --> 00:18:14,760 And if not, I'm going to go ahead and append that title, 374 00:18:14,760 --> 00:18:17,200 massaged into this different format, if you will. 375 00:18:17,200 --> 00:18:19,350 So I'm throwing away some information. 376 00:18:19,350 --> 00:18:23,760 I'm sacrificing all of the nuances of your grammar and input 377 00:18:23,760 --> 00:18:25,020 to the form itself. 378 00:18:25,020 --> 00:18:27,480 But at least I'm trying to canonicalize size, that is, 379 00:18:27,480 --> 00:18:29,950 standardize what the data actually looks like. 380 00:18:29,950 --> 00:18:33,810 So let me go ahead and run Python of favorites.py again and hit Enter. 381 00:18:33,810 --> 00:18:35,412 Oh, and this is just user error. 382 00:18:35,412 --> 00:18:36,870 Maybe you haven't seen this before. 383 00:18:36,870 --> 00:18:40,870 This just looks like a mistake on my part. 384 00:18:40,870 --> 00:18:43,200 I meant to say not even uppercase. 385 00:18:43,200 --> 00:18:44,202 That's completely wrong. 386 00:18:44,202 --> 00:18:46,410 The function is called upper, now that I think of it. 387 00:18:46,410 --> 00:18:46,830 All right. 388 00:18:46,830 --> 00:18:49,330 Let's go and increase the size of the Terminal window again. 389 00:18:49,330 --> 00:18:50,820 Run Python of favorites.py. 390 00:18:50,820 --> 00:18:55,530 And now, it's a little more overwhelming to look at because it's not sorted yet 391 00:18:55,530 --> 00:18:57,300 and it's all capitalized. 392 00:18:57,300 --> 00:19:03,300 But I don't think I'm seeing multiple Friends, so to speak. 393 00:19:03,300 --> 00:19:06,455 There's one Friends up here and that's it. 394 00:19:06,455 --> 00:19:07,830 I'm back up at my prompt already. 395 00:19:07,830 --> 00:19:10,270 So we seem now to be filtering out duplicates. 396 00:19:10,270 --> 00:19:13,582 Now, before we dive in further and clean this up further than this, 397 00:19:13,582 --> 00:19:14,790 what else could we have done? 398 00:19:14,790 --> 00:19:17,100 Well, it turns out that in Python 2 you often 399 00:19:17,100 --> 00:19:19,350 do get a lot of functionality built into the language. 400 00:19:19,350 --> 00:19:22,313 And I'm kind of implementing myself the idea of a set. 401 00:19:22,313 --> 00:19:24,480 If you think back to mathematics, a set is typically 402 00:19:24,480 --> 00:19:28,470 something with a bunch of values that has duplicates filtered out. 403 00:19:28,470 --> 00:19:31,290 Recall that Python already has this for us. 404 00:19:31,290 --> 00:19:34,200 And we saw it really briefly when I whipped up the dictionary 405 00:19:34,200 --> 00:19:36,190 implementation a couple of weeks back. 406 00:19:36,190 --> 00:19:40,980 So I could actually define my titles to be a set instead of a list, 407 00:19:40,980 --> 00:19:46,080 and this would just modestly allow me to refine my code here, such 408 00:19:46,080 --> 00:19:48,870 that I don't have to bother checking for duplicates anyway. 409 00:19:48,870 --> 00:19:52,710 I can instead just say something like, titles.add 410 00:19:52,710 --> 00:19:55,807 the current title, like this. 411 00:19:55,807 --> 00:19:58,890 Marginally better design if you know that a set exists because you're just 412 00:19:58,890 --> 00:20:00,753 getting more functionality out of this. 413 00:20:00,753 --> 00:20:02,670 All right, so let's clean the data up further. 414 00:20:02,670 --> 00:20:06,150 We've now gone ahead and fixed the problem of case sensitivity. 415 00:20:06,150 --> 00:20:08,940 We threw away whitespace in case someone had hit the spacebar 416 00:20:08,940 --> 00:20:09,990 with some of the input. 417 00:20:09,990 --> 00:20:13,960 Let's go ahead now and sort these things by the titles themselves. 418 00:20:13,960 --> 00:20:17,280 So instead of just printing out the titles in the same order 419 00:20:17,280 --> 00:20:22,142 you all inputted them, but filtering out duplicates as we go, let me go ahead 420 00:20:22,142 --> 00:20:24,600 and use another function in Python you might not have seen, 421 00:20:24,600 --> 00:20:26,940 which is literally called sorted, and will 422 00:20:26,940 --> 00:20:31,900 take care of the process of actually sorting titles for you. 423 00:20:31,900 --> 00:20:34,440 Let me go ahead and increase the font size of my Terminal, 424 00:20:34,440 --> 00:20:36,570 run Python of favorites.py, and hit Enter. 425 00:20:36,570 --> 00:20:40,410 And now you can really see how many of these shows start with the word "the" 426 00:20:40,410 --> 00:20:41,220 or do not. 427 00:20:41,220 --> 00:20:43,590 Now it's a little easier to wrap our minds around, 428 00:20:43,590 --> 00:20:46,650 just because it's at least sorted alphabetically. 429 00:20:46,650 --> 00:20:50,220 But now you can really see some of the differences in people's inputs. 430 00:20:50,220 --> 00:20:51,850 So far, so good. 431 00:20:51,850 --> 00:20:56,400 But a few of you decided to stylize Avatar in three different ways here. 432 00:20:56,400 --> 00:20:59,520 Brooklyn 99 is a couple of different ways here. 433 00:20:59,520 --> 00:21:02,700 And I think if we keep going we'll see further and further variances that we 434 00:21:02,700 --> 00:21:06,460 did not fix by focusing on whitespace and capitalization alone. 435 00:21:06,460 --> 00:21:09,870 So already here, this is only, what, 100 plus, 200 rows. 436 00:21:09,870 --> 00:21:12,795 Already real-world data starts to get messy quickly, 437 00:21:12,795 --> 00:21:14,670 and that might not bode well when we actually 438 00:21:14,670 --> 00:21:16,650 want to keep around real data from real users. 439 00:21:16,650 --> 00:21:19,025 You can imagine an actual website or a mobile application 440 00:21:19,025 --> 00:21:21,660 dealing with this kind of thing on scale. 441 00:21:21,660 --> 00:21:23,080 Well, let's go ahead and do this. 442 00:21:23,080 --> 00:21:26,370 Let's actually figure out the popularity of these various shows 443 00:21:26,370 --> 00:21:31,680 by now iterating over my data, and keeping track of how many of you 444 00:21:31,680 --> 00:21:33,000 inputted a given title. 445 00:21:33,000 --> 00:21:37,890 We're going to ignore the problems like Brooklyn 99 and the Avatar. 446 00:21:37,890 --> 00:21:42,390 Sorry, yeah, Avatar, where there was things 447 00:21:42,390 --> 00:21:47,070 that were different beyond just whitespace and capitalization. 448 00:21:47,070 --> 00:21:48,840 But let's go ahead and keep track of, now, 449 00:21:48,840 --> 00:21:52,290 how many of you inputted each of these titles. 450 00:21:52,290 --> 00:21:53,448 So how can I do this? 451 00:21:53,448 --> 00:21:55,740 I'm still going to take this approach of iterating over 452 00:21:55,740 --> 00:21:58,110 the CSV file from top to bottom. 453 00:21:58,110 --> 00:22:00,360 We've used a couple of data structures thus far, 454 00:22:00,360 --> 00:22:04,260 a list to keep track of titles, or a set to keep track of titles. 455 00:22:04,260 --> 00:22:07,350 But what if I now want to keep around a little more information? 456 00:22:07,350 --> 00:22:13,110 For each title, I want to keep around how many times I've seen it before. 457 00:22:13,110 --> 00:22:14,130 I'm not doing that yet. 458 00:22:14,130 --> 00:22:17,880 I'm throwing away the total number of times I see these shows. 459 00:22:17,880 --> 00:22:20,520 How could I start to keep that around? 460 00:22:20,520 --> 00:22:21,845 AUDIENCE: Use a dictionary. 461 00:22:21,845 --> 00:22:23,970 DAVID J. MALAN: We could use a dictionary, and how? 462 00:22:23,970 --> 00:22:25,260 Elaborate on that. 463 00:22:25,260 --> 00:22:26,135 AUDIENCE: [INAUDIBLE] 464 00:22:26,135 --> 00:22:28,093 DAVID J. MALAN: Perfect, really good instincts. 465 00:22:28,093 --> 00:22:29,850 Using a dictionary, insofar as it lets us 466 00:22:29,850 --> 00:22:33,300 store keys and values, that is, associate something with something 467 00:22:33,300 --> 00:22:33,850 else. 468 00:22:33,850 --> 00:22:36,480 This is why a dictionary or hash tables more 469 00:22:36,480 --> 00:22:39,630 generally are such a useful, practical data structure. 470 00:22:39,630 --> 00:22:43,120 Because they just let you remember stuff in some kind of structured way. 471 00:22:43,120 --> 00:22:45,870 So if the keys are going to be the titles I've seen, 472 00:22:45,870 --> 00:22:49,860 the values could be the number of times I've seen each of those titles. 473 00:22:49,860 --> 00:22:54,308 And so it's kind of like just having a two-column table on paper. 474 00:22:54,308 --> 00:22:56,850 For instance, if I were going to do this on a piece of paper, 475 00:22:56,850 --> 00:22:59,340 I might just have two columns here, where 476 00:22:59,340 --> 00:23:04,590 maybe this is the title that I've seen, and this is the count over here. 477 00:23:04,590 --> 00:23:08,250 This is, in effect, a dictionary in Python. 478 00:23:08,250 --> 00:23:11,490 It's two columns, keys on the left, values on the right. 479 00:23:11,490 --> 00:23:13,620 And this, if I can implement in code, will actually 480 00:23:13,620 --> 00:23:17,580 allow me to store this data, and then maybe do some simple arithmetic 481 00:23:17,580 --> 00:23:19,450 to figure out which is the most popular. 482 00:23:19,450 --> 00:23:20,380 So let's do this. 483 00:23:20,380 --> 00:23:24,240 Let me go ahead and change my titles to not be a list, not be a set. 484 00:23:24,240 --> 00:23:29,160 Let's have it be a dictionary instead, either doing this, or more succinctly, 485 00:23:29,160 --> 00:23:33,600 two curly braces that are empty gives me an empty dictionary automatically. 486 00:23:33,600 --> 00:23:34,860 What do I now want to do? 487 00:23:34,860 --> 00:23:37,440 I think most of my code can stay the same. 488 00:23:37,440 --> 00:23:40,860 But down here, I don't want to just blindly add titles 489 00:23:40,860 --> 00:23:42,180 to the data structure. 490 00:23:42,180 --> 00:23:45,060 I somehow need to keep track of the count. 491 00:23:45,060 --> 00:23:48,690 And unfortunately, if I just do this-- let's do titles, 492 00:23:48,690 --> 00:23:53,580 bracket, title, plus equals 1. 493 00:23:53,580 --> 00:23:56,130 This is a reasonable first attempt at this. 494 00:23:56,130 --> 00:23:57,420 Because what am I doing? 495 00:23:57,420 --> 00:24:02,730 If titles is a dictionary and I want to look up the current title therein, 496 00:24:02,730 --> 00:24:05,520 the syntax for that, like before, is titles, bracket, and then 497 00:24:05,520 --> 00:24:08,880 the key you want to use to index into the dictionary. 498 00:24:08,880 --> 00:24:12,000 It's not a number in this case, it's an actual word, a title. 499 00:24:12,000 --> 00:24:14,220 And you're just going to increment it by one, 500 00:24:14,220 --> 00:24:17,010 and then eventually I'll come back and finish my second loop 501 00:24:17,010 --> 00:24:19,710 and do things in terms of the order. 502 00:24:19,710 --> 00:24:23,580 But for now, let's just keep track of the total counts. 503 00:24:23,580 --> 00:24:25,660 Let me go ahead and increase my Terminal window. 504 00:24:25,660 --> 00:24:29,590 Let me do Python of favorites.py and hit Enter. 505 00:24:29,590 --> 00:24:30,090 Huh. 506 00:24:30,090 --> 00:24:34,140 How I Met Your Mother is giving me a key error. 507 00:24:34,140 --> 00:24:36,750 What does that mean? 508 00:24:36,750 --> 00:24:39,120 And why am I seeing this? 509 00:24:39,120 --> 00:24:42,330 And in fact, just to give a little bit of a breadcrumb here, 510 00:24:42,330 --> 00:24:44,140 let me zoom out here. 511 00:24:44,140 --> 00:24:46,920 Let me open up the CSV file again real quickly. 512 00:24:46,920 --> 00:24:50,460 And wow, we didn't even get past the second row in the file 513 00:24:50,460 --> 00:24:52,210 or the first show in the file. 514 00:24:52,210 --> 00:24:54,840 Notice that How I Met Your Mother, somewhat lowercased, 515 00:24:54,840 --> 00:24:57,562 is the very first show in therein. 516 00:24:57,562 --> 00:24:59,520 What's your instinct for why this is happening? 517 00:24:59,520 --> 00:25:01,750 AUDIENCE: You don't have a starting point. 518 00:25:01,750 --> 00:25:03,667 DAVID J. MALAN: I don't have a starting point. 519 00:25:03,667 --> 00:25:04,920 I'm adding one to what? 520 00:25:04,920 --> 00:25:09,840 I'm blindly indexing into the dictionary using a key, How I Met Your Mother, 521 00:25:09,840 --> 00:25:11,880 that doesn't yet exist in the dictionary. 522 00:25:11,880 --> 00:25:14,610 And so Python throws what's called a key error 523 00:25:14,610 --> 00:25:17,440 because the key you're trying to use just doesn't exist yet. 524 00:25:17,440 --> 00:25:21,000 So logically, how could we fix this? 525 00:25:21,000 --> 00:25:21,690 We're close. 526 00:25:21,690 --> 00:25:25,110 We got half of the problem solved, but I'm not handling the obvious, now, 527 00:25:25,110 --> 00:25:26,850 case of nothing being there. 528 00:25:26,850 --> 00:25:27,595 Yeah? 529 00:25:27,595 --> 00:25:28,930 AUDIENCE: Creating a counter. 530 00:25:28,930 --> 00:25:29,620 DAVID J. MALAN: Creating a-- 531 00:25:29,620 --> 00:25:30,370 AUDIENCE: Counter. 532 00:25:30,370 --> 00:25:32,240 DAVID J. MALAN: Creating the counter itself. 533 00:25:32,240 --> 00:25:34,190 So maybe I could do something like this. 534 00:25:34,190 --> 00:25:38,380 Let me close my Terminal window and let me ask a question first. 535 00:25:38,380 --> 00:25:44,980 If the current title is in the dictionary already, if title in titles, 536 00:25:44,980 --> 00:25:47,440 that's going to give me a true-false answer it turns out. 537 00:25:47,440 --> 00:25:52,240 Then, I can safely say, titles, bracket, title, plus equals 1. 538 00:25:52,240 --> 00:25:56,740 And recall, this is just shorthand notation for the same thing as in C, 539 00:25:56,740 --> 00:25:58,300 title plus 1. 540 00:25:58,300 --> 00:25:59,360 Whoops, typo. 541 00:25:59,360 --> 00:26:00,168 Don't do that. 542 00:26:00,168 --> 00:26:02,710 That's the same thing as this but it's a little more succinct 543 00:26:02,710 --> 00:26:04,810 just to say plus equals 1. 544 00:26:04,810 --> 00:26:09,490 Else, if it's logically not the case that the current title is in the titles 545 00:26:09,490 --> 00:26:13,610 dictionary, then I probably want to say titles, bracket, title equals? 546 00:26:13,610 --> 00:26:15,315 Feel free to just shout it out. 547 00:26:15,315 --> 00:26:15,940 AUDIENCE: Zero. 548 00:26:15,940 --> 00:26:16,815 DAVID J. MALAN: Zero. 549 00:26:16,815 --> 00:26:21,320 I just have to put some value there so that the key itself is also there. 550 00:26:21,320 --> 00:26:21,820 All right. 551 00:26:21,820 --> 00:26:24,370 So now that I've got this going on, let me go ahead 552 00:26:24,370 --> 00:26:26,620 and undo my sorting temporarily. 553 00:26:26,620 --> 00:26:29,650 And now let me go ahead and do this. 554 00:26:29,650 --> 00:26:33,460 I can, as a quick check, let me go ahead and just run the code 555 00:26:33,460 --> 00:26:35,050 as is, Python of favorites.py. 556 00:26:35,050 --> 00:26:37,030 I'm back in business. 557 00:26:37,030 --> 00:26:39,790 It's printing correctly, no key errors, but it's not sorted. 558 00:26:39,790 --> 00:26:41,620 And I'm not seeing any of the counts. 559 00:26:41,620 --> 00:26:43,780 Let me just quickly add the counts, and there's 560 00:26:43,780 --> 00:26:45,530 a couple of ways I could do this. 561 00:26:45,530 --> 00:26:52,900 I could, say, print out the title, and then, maybe, let's do something like-- 562 00:26:52,900 --> 00:26:57,220 how about just, comma, titles, bracket, title? 563 00:26:57,220 --> 00:26:59,020 So I'm going to print two things at once, 564 00:26:59,020 --> 00:27:01,600 both the current title in the dictionary, 565 00:27:01,600 --> 00:27:04,300 and whatever its value is by indexing into it. 566 00:27:04,300 --> 00:27:06,140 Let me increase my Terminal window. 567 00:27:06,140 --> 00:27:10,420 Let me run Python of favorites.py, Enter, and OK. 568 00:27:10,420 --> 00:27:11,908 Huh. 569 00:27:11,908 --> 00:27:14,080 Huh. 570 00:27:14,080 --> 00:27:17,560 None of you said a whole lot of TV shows, it seems. 571 00:27:17,560 --> 00:27:21,690 What's the logical error here? 572 00:27:21,690 --> 00:27:25,460 What did I do wrong if I look back at my code here? 573 00:27:25,460 --> 00:27:25,960 Yeah? 574 00:27:25,960 --> 00:27:27,600 Why so many 0s? 575 00:27:27,600 --> 00:27:29,490 AUDIENCE: [INAUDIBLE] 576 00:27:29,490 --> 00:27:30,490 DAVID J. MALAN: Exactly. 577 00:27:30,490 --> 00:27:35,140 To summarize, I initialized the count to 0 the first time I saw it, 578 00:27:35,140 --> 00:27:38,500 but I should have initialized it at least to 1 because I just saw it. 579 00:27:38,500 --> 00:27:40,220 Or I should change my code a bit. 580 00:27:40,220 --> 00:27:42,880 So for instance, if I go back in here, the simplest fix 581 00:27:42,880 --> 00:27:46,390 is probably to initialize to 1, because on this iteration of the loop, 582 00:27:46,390 --> 00:27:49,210 obviously, I'm seeing this title for the very first time. 583 00:27:49,210 --> 00:27:51,580 Or I could change my logic a little bit. 584 00:27:51,580 --> 00:27:53,470 I could do something like this instead. 585 00:27:53,470 --> 00:27:58,840 If the current title is not in titles, then I could initialize it to 0. 586 00:27:58,840 --> 00:28:02,860 And then I could get rid of the else, and now blindly index 587 00:28:02,860 --> 00:28:04,900 into the titles dictionary. 588 00:28:04,900 --> 00:28:09,100 Because now, on line 11, I can trust that lines 9 and 10 589 00:28:09,100 --> 00:28:12,040 took care of the initialization for me if need be. 590 00:28:12,040 --> 00:28:13,028 Which one is better? 591 00:28:13,028 --> 00:28:13,570 I don't know. 592 00:28:13,570 --> 00:28:17,150 This one's a little nicer, maybe because it's one line fewer. 593 00:28:17,150 --> 00:28:20,470 But I think both approaches are perfectly reasonable and well-designed. 594 00:28:20,470 --> 00:28:22,420 But the key thing, no pun intended, is that we 595 00:28:22,420 --> 00:28:27,100 have to make sure the key exists before we presume to actually incrue. 596 00:28:27,100 --> 00:28:28,810 Oh, this is wrong. 597 00:28:28,810 --> 00:28:30,370 This is incorrect code. 598 00:28:30,370 --> 00:28:33,040 What did I do wrong? 599 00:28:33,040 --> 00:28:33,790 OK, yes. 600 00:28:33,790 --> 00:28:34,400 There we go. 601 00:28:34,400 --> 00:28:37,690 So otherwise, everyone would have liked this show once, and no matter 602 00:28:37,690 --> 00:28:39,190 how many people said the same thing. 603 00:28:39,190 --> 00:28:41,237 Now the code is as it should be. 604 00:28:41,237 --> 00:28:43,570 So let me go ahead and open up my Terminal window again. 605 00:28:43,570 --> 00:28:47,710 Let me run Python of favorites.py, and now we see more reasonable counts. 606 00:28:47,710 --> 00:28:49,150 Some shows weren't that popular. 607 00:28:49,150 --> 00:28:50,830 There's just 1s and maybe 2s. 608 00:28:50,830 --> 00:28:56,570 But I bet if we sort these things we can start to see a little more detail. 609 00:28:56,570 --> 00:28:57,890 So how else can we do this? 610 00:28:57,890 --> 00:29:04,520 Well, turns out, when dealing with a dictionary like this-- 611 00:29:04,520 --> 00:29:07,160 let's go ahead and just sort the titles themselves. 612 00:29:07,160 --> 00:29:12,130 So let's reintroduce the sorted function as I did before, but no other changes. 613 00:29:12,130 --> 00:29:14,735 Let me go ahead now and run Python of favorites.py. 614 00:29:14,735 --> 00:29:17,110 Now it's just a little easier to wrap your mind around it 615 00:29:17,110 --> 00:29:18,568 because at least it's alphabetical. 616 00:29:18,568 --> 00:29:22,600 But it's not sorted by value, it's sorted by key. 617 00:29:22,600 --> 00:29:26,170 But sure enough, if we scroll down, there's something down here, 618 00:29:26,170 --> 00:29:29,202 for instance, like, let's see, The Office. 619 00:29:29,202 --> 00:29:30,910 That's definitely going to be a contender 620 00:29:30,910 --> 00:29:32,860 for most popular, 15 responses. 621 00:29:32,860 --> 00:29:35,860 But let's see what's actually going to bubble up to the top. 622 00:29:35,860 --> 00:29:40,870 Unfortunately, the sorted function only sorts dictionaries by keys 623 00:29:40,870 --> 00:29:44,410 by default, not by values. 624 00:29:44,410 --> 00:29:47,500 But it turns out, in Python, if you read the documentation 625 00:29:47,500 --> 00:29:49,510 for the sorted function, you can actually 626 00:29:49,510 --> 00:29:54,580 pass in other arguments that tell it how to sort things. 627 00:29:54,580 --> 00:29:57,430 For instance, if I want to do things in reverse order, 628 00:29:57,430 --> 00:30:02,140 I can add a second parameter to the sorted function called reverse. 629 00:30:02,140 --> 00:30:03,310 And it's a named parameter. 630 00:30:03,310 --> 00:30:05,620 You literally say, reverse equals true, so 631 00:30:05,620 --> 00:30:08,830 that the position of it in the comma-separated list doesn't matter. 632 00:30:08,830 --> 00:30:11,762 If I now rerun this after increasing my Terminal window, 633 00:30:11,762 --> 00:30:13,720 you'll see now that it's in the opposite order. 634 00:30:13,720 --> 00:30:16,330 Now adventure and Anne with an E is at the bottom 635 00:30:16,330 --> 00:30:18,410 of the output instead of the top. 636 00:30:18,410 --> 00:30:26,845 How can I tell it to sort by values instead of by key? 637 00:30:26,845 --> 00:30:28,220 Well, let's go ahead and do this. 638 00:30:28,220 --> 00:30:30,940 Let me go ahead and define a function. 639 00:30:30,940 --> 00:30:33,070 I'm just going to call it f to keep things simple. 640 00:30:33,070 --> 00:30:36,190 And this f function is going to take a title as input. 641 00:30:36,190 --> 00:30:41,140 And given a given title, it's going to return the value of that title. 642 00:30:41,140 --> 00:30:44,560 So actually, maybe a better name for this would be get value, 643 00:30:44,560 --> 00:30:46,820 and/or we could come up with something else as well. 644 00:30:46,820 --> 00:30:49,300 The purpose of the get value function, to be clear, 645 00:30:49,300 --> 00:30:54,200 is to take it as input a title and then return the corresponding value. 646 00:30:54,200 --> 00:30:55,460 Why is this useful? 647 00:30:55,460 --> 00:30:57,760 Well, it turns out that the sorted function in Python, 648 00:30:57,760 --> 00:31:01,870 according to its documentation, also takes a key parameter, 649 00:31:01,870 --> 00:31:05,860 where you can pass in, crazy enough, the name of a function 650 00:31:05,860 --> 00:31:11,650 that it will use in order to determine what it should sort by, by the key, 651 00:31:11,650 --> 00:31:16,550 or by the value, or in other cases, even other types of data as well. 652 00:31:16,550 --> 00:31:19,390 So there's a curiosity here, though, that's very deliberate. 653 00:31:19,390 --> 00:31:21,550 Key is the name of the parameter, just like reverse 654 00:31:21,550 --> 00:31:23,092 was the name of this other parameter. 655 00:31:23,092 --> 00:31:26,110 The value of it, though, is not a function call. 656 00:31:26,110 --> 00:31:27,580 It's a function name. 657 00:31:27,580 --> 00:31:30,520 Notice I am not doing this, no parentheses. 658 00:31:30,520 --> 00:31:35,400 I'm instead passing in get value, the function I wrote, by its name. 659 00:31:35,400 --> 00:31:37,900 And this is a feature of Python and certain other languages. 660 00:31:37,900 --> 00:31:41,110 Just like variables, you can actually pass whole functions 661 00:31:41,110 --> 00:31:45,560 around so that they can be called for you later on by someone else. 662 00:31:45,560 --> 00:31:48,880 So what this means is that the sorted function written by Python, 663 00:31:48,880 --> 00:31:51,410 they didn't know what you're going to want to sort by today. 664 00:31:51,410 --> 00:31:55,810 But if you provide them with a function called get value, or anything else, now 665 00:31:55,810 --> 00:31:58,000 their sorted function will use that function 666 00:31:58,000 --> 00:32:01,840 to determine, OK, if you don't want to sort by the key of the dictionary, what 667 00:32:01,840 --> 00:32:02,950 do you want to sort by? 668 00:32:02,950 --> 00:32:05,800 This is going to tell it to sort by the value 669 00:32:05,800 --> 00:32:08,750 by returning the specific value we care about. 670 00:32:08,750 --> 00:32:12,580 So let me go ahead now and rerun this after increasing my Terminal, Python 671 00:32:12,580 --> 00:32:14,620 of favorites.py, Enter. 672 00:32:14,620 --> 00:32:17,110 Here we have now an example of all of the titles you all 673 00:32:17,110 --> 00:32:22,180 typed in, albeit forced to uppercase and with any whitespace thrown out. 674 00:32:22,180 --> 00:32:24,733 And now, The Office is an easy win over Friends, 675 00:32:24,733 --> 00:32:27,400 versus Community, versus Game of Thrones, Breaking Bad, and then 676 00:32:27,400 --> 00:32:29,740 a lot of variants thereafter. 677 00:32:29,740 --> 00:32:31,680 So there's a lot of steps to go through. 678 00:32:31,680 --> 00:32:33,555 This isn't that bad once you've done it once, 679 00:32:33,555 --> 00:32:35,472 and you know what these functions are, and you 680 00:32:35,472 --> 00:32:36,870 know that these parameters exist. 681 00:32:36,870 --> 00:32:38,010 But it's a lot of work. 682 00:32:38,010 --> 00:32:42,090 That's 17 lines of code just to analyze a CSV file 683 00:32:42,090 --> 00:32:45,420 that you all created by way of those Google Form submissions. 684 00:32:45,420 --> 00:32:48,360 But it took me a lot of work just to get simple answers out of it. 685 00:32:48,360 --> 00:32:50,277 And indeed, that's going to be among the goals 686 00:32:50,277 --> 00:32:52,920 for today, ultimately, is, how can we just make this easier? 687 00:32:52,920 --> 00:32:54,795 It's one thing to learn new things in Python, 688 00:32:54,795 --> 00:32:57,300 but if we can avoid writing code, or this much code, 689 00:32:57,300 --> 00:32:58,840 that's going to be a good thing. 690 00:32:58,840 --> 00:33:01,020 And so one other technique we can introduce here 691 00:33:01,020 --> 00:33:03,570 that does allow us to write a little less code 692 00:33:03,570 --> 00:33:05,730 is, we can actually get rid of this function. 693 00:33:05,730 --> 00:33:09,240 It turns out, in Python, if you just need to make a function 694 00:33:09,240 --> 00:33:11,940 but it's going to be used and then essentially thrown away, 695 00:33:11,940 --> 00:33:14,790 it's not something you're going to be reusing in multiple places-- 696 00:33:14,790 --> 00:33:17,550 it's not like a library function that you want to keep around-- 697 00:33:17,550 --> 00:33:19,680 you can actually just do this. 698 00:33:19,680 --> 00:33:23,430 You can change the value of this key parameter 699 00:33:23,430 --> 00:33:25,950 to be what's called a lambda function, which 700 00:33:25,950 --> 00:33:29,040 is a fancy way of saying a function that technically has no name. 701 00:33:29,040 --> 00:33:30,870 It's an anonymous function. 702 00:33:30,870 --> 00:33:32,400 Why does it have no name? 703 00:33:32,400 --> 00:33:35,530 Well, it's kind of stupid that I invented this name on line 13. 704 00:33:35,530 --> 00:33:38,670 I used it on line 16, and then I never again used it. 705 00:33:38,670 --> 00:33:42,520 If there's only being used in one place, why bother giving it a name at all? 706 00:33:42,520 --> 00:33:45,000 So if you instead, in Python, say lambda, 707 00:33:45,000 --> 00:33:47,850 and then type out the name of the parameter 708 00:33:47,850 --> 00:33:50,130 you want this anonymous function to take, 709 00:33:50,130 --> 00:33:54,460 you can then say, go ahead and return this value. 710 00:33:54,460 --> 00:33:57,030 Now let's notice the inconsistencies here. 711 00:33:57,030 --> 00:33:59,850 When you use this special lambda keyword that says, hey Python, 712 00:33:59,850 --> 00:34:02,850 give me an anonymous function, a function with no name, 713 00:34:02,850 --> 00:34:06,610 it then says, Python, this anonymous function will take one parameter. 714 00:34:06,610 --> 00:34:08,699 Notice there's no parentheses. 715 00:34:08,699 --> 00:34:10,500 And that's deliberate, if confusing. 716 00:34:10,500 --> 00:34:12,909 It just tightens things up a little bit. 717 00:34:12,909 --> 00:34:16,770 Notice that there's no return keyword, which similarly tightens things 718 00:34:16,770 --> 00:34:18,780 up a bit, albeit inconsistently. 719 00:34:18,780 --> 00:34:22,650 But this line of code I've just highlighted 720 00:34:22,650 --> 00:34:26,429 is actually identical in functionality to this. 721 00:34:26,429 --> 00:34:28,320 But it throws away the word [INAUDIBLE]. 722 00:34:28,320 --> 00:34:29,850 It throws away the word get value. 723 00:34:29,850 --> 00:34:33,510 It throws away the parentheses, and it throws away the return keyword just 724 00:34:33,510 --> 00:34:34,679 to tighten things up. 725 00:34:34,679 --> 00:34:37,290 And it's well suited for a problem like this 726 00:34:37,290 --> 00:34:39,960 where I just want to pass in a tiny little function that 727 00:34:39,960 --> 00:34:41,090 does something useful. 728 00:34:41,090 --> 00:34:42,840 But it's not something I'm going to reuse. 729 00:34:42,840 --> 00:34:45,100 It doesn't need multiple lines to take up space. 730 00:34:45,100 --> 00:34:47,159 It's just a nice, elegant one liner. 731 00:34:47,159 --> 00:34:48,840 That's all a lambda function does. 732 00:34:48,840 --> 00:34:51,909 It allows you to create an anonymous function right then and there. 733 00:34:51,909 --> 00:34:57,210 And then the function you're passing it to, like sorted, will use it as before. 734 00:34:57,210 --> 00:35:00,810 Indeed, if I run Python of favorites.py after growing my Terminal window, 735 00:35:00,810 --> 00:35:02,800 the result is exactly the same. 736 00:35:02,800 --> 00:35:06,360 And we see at the bottom here all of those small results. 737 00:35:06,360 --> 00:35:10,810 Are any questions, then, on this syntax, on these ideas? 738 00:35:10,810 --> 00:35:13,770 The goal here has been to write a Python program that just starts 739 00:35:13,770 --> 00:35:18,860 to analyze or clean up data like this. 740 00:35:18,860 --> 00:35:19,765 Yeah? 741 00:35:19,765 --> 00:35:23,273 AUDIENCE: [INAUDIBLE] 742 00:35:23,273 --> 00:35:26,440 DAVID J. MALAN: Could you use the lambda if it's just returning immediately? 743 00:35:26,440 --> 00:35:29,260 It's really meant for one line of code, generally. 744 00:35:29,260 --> 00:35:31,420 So you don't use the return keyword. 745 00:35:31,420 --> 00:35:33,880 You just say what it is you want to return. 746 00:35:33,880 --> 00:35:37,680 AUDIENCE: [INAUDIBLE] 747 00:35:37,680 --> 00:35:38,930 DAVID J. MALAN: Good question. 748 00:35:38,930 --> 00:35:40,960 Could you do more in that one line if it's 749 00:35:40,960 --> 00:35:42,670 got to be a more involved algorithm? 750 00:35:42,670 --> 00:35:45,820 Yes, but you would just ultimately return the value in question. 751 00:35:45,820 --> 00:35:47,947 In short, if it's getting at all sophisticated 752 00:35:47,947 --> 00:35:49,780 you don't use the lambda function in Python. 753 00:35:49,780 --> 00:35:52,510 You go ahead and actually just define a name for it, 754 00:35:52,510 --> 00:35:54,233 even if it's a one-off name. 755 00:35:54,233 --> 00:35:56,650 JavaScript, another language we'll look at in a few weeks, 756 00:35:56,650 --> 00:35:59,590 makes heavier use, I dare say, of lambda functions. 757 00:35:59,590 --> 00:36:02,030 And those can actually be multiple, multiple lines, 758 00:36:02,030 --> 00:36:05,520 but Python does not support that instinct. 759 00:36:05,520 --> 00:36:06,020 All right. 760 00:36:06,020 --> 00:36:07,728 So let's go ahead and do one other thing. 761 00:36:07,728 --> 00:36:10,340 Office was clearly popping out of the code here quite a bit. 762 00:36:10,340 --> 00:36:12,760 Let's go ahead and write a slightly different program 763 00:36:12,760 --> 00:36:15,400 that maybe just focuses on The Office for the moment, 764 00:36:15,400 --> 00:36:16,900 just focuses on The Office. 765 00:36:16,900 --> 00:36:21,250 So let me go ahead and throw most of this code away, up until this point 766 00:36:21,250 --> 00:36:23,080 when I'm inside of my inner loop. 767 00:36:23,080 --> 00:36:26,050 And let me go ahead, and I don't even want the global variable here. 768 00:36:26,050 --> 00:36:28,240 All I want to do is focus on the current title. 769 00:36:28,240 --> 00:36:30,730 How could I detect if someone likes The Office? 770 00:36:30,730 --> 00:36:33,790 Well, I could say something like-- 771 00:36:33,790 --> 00:36:34,670 how about this? 772 00:36:34,670 --> 00:36:36,310 So counter equals 0. 773 00:36:36,310 --> 00:36:38,350 We'll just focus on The Office. 774 00:36:38,350 --> 00:36:43,930 If title equals, equals The Office, I could then go ahead and say, 775 00:36:43,930 --> 00:36:47,410 counter plus equals 1. 776 00:36:47,410 --> 00:36:48,400 I don't need a key. 777 00:36:48,400 --> 00:36:49,858 There's no dictionary involved now. 778 00:36:49,858 --> 00:36:51,880 It's just a simple integer variable. 779 00:36:51,880 --> 00:36:55,750 And then, down here I'll say something like, 780 00:36:55,750 --> 00:37:00,970 number of people who like The Office is, whatever this value is. 781 00:37:00,970 --> 00:37:03,850 And I'll put in counter in curly braces, and then I'll 782 00:37:03,850 --> 00:37:05,783 turn this whole thing into an F string. 783 00:37:05,783 --> 00:37:07,450 All right, let me go ahead and run this. 784 00:37:07,450 --> 00:37:10,100 Python of favorites.py, Enter. 785 00:37:10,100 --> 00:37:12,610 Number of people who like The Office is 15. 786 00:37:12,610 --> 00:37:13,990 All right, so that's great. 787 00:37:13,990 --> 00:37:17,530 But let's go ahead now and deliberately muddy the data a bit. 788 00:37:17,530 --> 00:37:21,160 All of you were very nice in that you typed in The Office. 789 00:37:21,160 --> 00:37:23,230 But you can imagine someone just typing Office, 790 00:37:23,230 --> 00:37:25,692 for instance, maybe there, maybe there. 791 00:37:25,692 --> 00:37:28,150 And many people might just write Office, you could imagine. 792 00:37:28,150 --> 00:37:30,400 Didn't happen here, but suppose it did, and probably 793 00:37:30,400 --> 00:37:33,290 would have if we had even more and more submissions over time. 794 00:37:33,290 --> 00:37:37,000 Now let's go ahead and rerun this program, no changes to the code. 795 00:37:37,000 --> 00:37:39,130 Now only 13 people like The Office. 796 00:37:39,130 --> 00:37:40,150 So let's fix this. 797 00:37:40,150 --> 00:37:45,790 The data is now as I mutated it to have a couple Offices, and many The Offices. 798 00:37:45,790 --> 00:37:51,050 How could I change my Python code to now count both of those situations? 799 00:37:51,050 --> 00:37:55,390 What could I change up here in order to improve this situation? 800 00:37:55,390 --> 00:37:57,910 Any thoughts? 801 00:37:57,910 --> 00:37:58,750 Yeah? 802 00:37:58,750 --> 00:38:02,300 AUDIENCE: You write the title [INAUDIBLE].. 803 00:38:02,300 --> 00:38:05,050 DAVID J. MALAN: Yeah, so I could just ask two questions like that. 804 00:38:05,050 --> 00:38:09,370 If title equals The Office, or title equals, equals just 805 00:38:09,370 --> 00:38:10,438 Office, for instance. 806 00:38:10,438 --> 00:38:12,730 And I'm still don't have to worry about capitalization. 807 00:38:12,730 --> 00:38:15,813 I don't have to worry about spaces because I at least threw that all away. 808 00:38:15,813 --> 00:38:18,200 Now I can go ahead and rerun this code. 809 00:38:18,200 --> 00:38:19,990 Let me go run it a third time. 810 00:38:19,990 --> 00:38:22,460 OK, so we're back up to 15. 811 00:38:22,460 --> 00:38:25,210 So I like that. 812 00:38:25,210 --> 00:38:28,660 You could imagine this not scaling very well. 813 00:38:28,660 --> 00:38:31,763 Avatar had three different permutations, and there were some others 814 00:38:31,763 --> 00:38:34,180 if we dug deeper that there might have been more variants. 815 00:38:34,180 --> 00:38:36,430 Could we do something a little more general purpose? 816 00:38:36,430 --> 00:38:38,230 Well, we could do something like this. 817 00:38:38,230 --> 00:38:41,883 If Office in the title-- 818 00:38:41,883 --> 00:38:44,050 this is kind of a cool thing you can do with Python. 819 00:38:44,050 --> 00:38:46,930 It's very English-like, just ask the question, albeit tersely. 820 00:38:46,930 --> 00:38:50,660 This, interesting, just got me into trouble. 821 00:38:50,660 --> 00:38:53,140 Now, all of a sudden, we're up to 16. 822 00:38:53,140 --> 00:38:56,252 Does anyone know what the other one is? 823 00:38:56,252 --> 00:38:58,098 AUDIENCE: Someone put V Office. 824 00:38:58,098 --> 00:38:59,265 DAVID J. MALAN: What Office? 825 00:38:59,265 --> 00:39:02,120 AUDIENCE: Someone entered a V Office, [INAUDIBLE].. 826 00:39:02,120 --> 00:39:04,517 827 00:39:04,517 --> 00:39:05,850 DAVID J. MALAN: Oh, interesting. 828 00:39:05,850 --> 00:39:07,930 Yes, so they hit The. 829 00:39:07,930 --> 00:39:08,430 OK. 830 00:39:08,430 --> 00:39:11,568 [APPLAUSE] 831 00:39:11,568 --> 00:39:12,360 DAVID J. MALAN: OK. 832 00:39:12,360 --> 00:39:13,800 Someone did that, sure. 833 00:39:13,800 --> 00:39:17,130 So The V Office. 834 00:39:17,130 --> 00:39:19,470 OK, this one's actually going to be hard to correct for. 835 00:39:19,470 --> 00:39:21,300 I can't really think of a general-- 836 00:39:21,300 --> 00:39:25,860 well, this is actually a good example of data gets messy fast. 837 00:39:25,860 --> 00:39:28,080 And you could imagine doing something where, OK, we 838 00:39:28,080 --> 00:39:32,920 could have like 26 conditions if someone said The A Office, or The B Office, 839 00:39:32,920 --> 00:39:33,420 right? 840 00:39:33,420 --> 00:39:34,650 You could imagine doing that. 841 00:39:34,650 --> 00:39:37,400 But then there's surely going to be other typos that are possible. 842 00:39:37,400 --> 00:39:39,610 So that's actually a hard one to fix. 843 00:39:39,610 --> 00:39:44,730 But it turns out we got lucky and now this is actually the accurate count. 844 00:39:44,730 --> 00:39:46,660 But the data is itself messy. 845 00:39:46,660 --> 00:39:49,950 Let me show another way that just adds another tool to our toolkit. 846 00:39:49,950 --> 00:39:54,750 It turns out that there's this feature in many programming languages, Python 847 00:39:54,750 --> 00:39:57,040 among them, called regular expressions. 848 00:39:57,040 --> 00:39:59,040 And this is actually a really powerful technique 849 00:39:59,040 --> 00:40:00,873 that we'll just scratch the surface of here. 850 00:40:00,873 --> 00:40:04,290 But it's going to be really useful, actually, maybe toward final projects, 851 00:40:04,290 --> 00:40:08,880 in web programming, any time you want to clean up data or validate data. 852 00:40:08,880 --> 00:40:11,910 And actually, just to make this clear, give me a moment 853 00:40:11,910 --> 00:40:14,160 before I switch screens here. 854 00:40:14,160 --> 00:40:18,450 And let me open up a Google Form from scratch. 855 00:40:18,450 --> 00:40:22,230 Give me just a moment to create something real quick. 856 00:40:22,230 --> 00:40:25,260 If you've never noticed this before when creating a Google Form, 857 00:40:25,260 --> 00:40:28,410 you can do a question. 858 00:40:28,410 --> 00:40:30,360 And if you want the user to type in something 859 00:40:30,360 --> 00:40:33,033 very specific as a short text answer like this, 860 00:40:33,033 --> 00:40:35,700 you might know that there's toggles like this in Google's world, 861 00:40:35,700 --> 00:40:36,930 like you can require it. 862 00:40:36,930 --> 00:40:39,270 Or you can do response validation. 863 00:40:39,270 --> 00:40:41,670 You could say, what's your email? 864 00:40:41,670 --> 00:40:47,250 And then you could say something like, text is an email. 865 00:40:47,250 --> 00:40:52,530 So here's an example in Google Forms how you can validate users' input. 866 00:40:52,530 --> 00:40:57,150 But a feature most of you have probably never noticed, or cared about, or used, 867 00:40:57,150 --> 00:40:59,490 is this thing called a regular expression, where 868 00:40:59,490 --> 00:41:01,440 you can actually define a pattern. 869 00:41:01,440 --> 00:41:04,830 And I could actually reimplement that same idea by doing something like this. 870 00:41:04,830 --> 00:41:11,070 I can say, let the user type in anything represented by .star, then an at sign, 871 00:41:11,070 --> 00:41:16,600 then something else, then a literal period, then, for instance, 872 00:41:16,600 --> 00:41:17,680 something else. 873 00:41:17,680 --> 00:41:19,950 So it's very cryptic, admittedly, at first glance. 874 00:41:19,950 --> 00:41:23,430 But this means any character 0 more times. 875 00:41:23,430 --> 00:41:26,160 This means any character 0 more times. 876 00:41:26,160 --> 00:41:28,725 This means a literal period, because apparently 877 00:41:28,725 --> 00:41:32,160 dot means any character in the context of these patterns. 878 00:41:32,160 --> 00:41:36,070 Then this thing means any character 0 more times. 879 00:41:36,070 --> 00:41:38,670 So I should actually be a little more nitpicky. 880 00:41:38,670 --> 00:41:41,530 You don't want 0 or more times, you want 1 or more times. 881 00:41:41,530 --> 00:41:45,340 So this with the plus means any character 1 or more time. 882 00:41:45,340 --> 00:41:47,020 So there has to be something there. 883 00:41:47,020 --> 00:41:51,630 And I think I want the same thing here 1 or more times, 1 or more times. 884 00:41:51,630 --> 00:41:56,160 Or heck, if I want to restrict this form in some sense to edu addresses, 885 00:41:56,160 --> 00:41:59,520 I could change that last thing to literally .edu. 886 00:41:59,520 --> 00:42:01,570 And so long story short, even though this looks, 887 00:42:01,570 --> 00:42:06,420 I'm sure, pretty cryptic, there's this mini language built into Python, 888 00:42:06,420 --> 00:42:09,900 and JavaScript, and Java, and other languages that allows you to express 889 00:42:09,900 --> 00:42:12,430 patterns in a standardized way. 890 00:42:12,430 --> 00:42:15,930 And this pattern is actually something we can implement in code, too. 891 00:42:15,930 --> 00:42:18,150 And let me switch back to Python for a second just 892 00:42:18,150 --> 00:42:19,920 to do the same kind of idea. 893 00:42:19,920 --> 00:42:22,950 Let me toggle back to my code here. 894 00:42:22,950 --> 00:42:26,730 Let me put up, for instance, a summary of what it is you can do. 895 00:42:26,730 --> 00:42:33,030 And here's just a quick summary of some of the available symbols. 896 00:42:33,030 --> 00:42:39,330 A period may represent any character. .star or .asterisks means 0 or more 897 00:42:39,330 --> 00:42:39,970 characters. 898 00:42:39,970 --> 00:42:43,020 So the dot means anything, so it can be A or nothing. 899 00:42:43,020 --> 00:42:44,160 It can be B or nothing. 900 00:42:44,160 --> 00:42:49,530 It can be A, B, A, B, C. It can be any combination of 0 or more characters. 901 00:42:49,530 --> 00:42:52,860 Change that to a plus and you now express one or more characters. 902 00:42:52,860 --> 00:42:55,860 Question mark means something is optional. 903 00:42:55,860 --> 00:42:59,370 Caret symbol means start matching at the beginning of the user's input. 904 00:42:59,370 --> 00:43:05,100 Dollar sign means stop matching at the end of the user's input. 905 00:43:05,100 --> 00:43:07,210 So we won't play with all of these just now. 906 00:43:07,210 --> 00:43:11,470 But let me go over here and actually tackle this Office problem. 907 00:43:11,470 --> 00:43:15,450 Let me go ahead and import a new library called the regular expression library, 908 00:43:15,450 --> 00:43:17,280 import re. 909 00:43:17,280 --> 00:43:20,560 And then, down here, let me say this. 910 00:43:20,560 --> 00:43:25,080 If re.search, this pattern. 911 00:43:25,080 --> 00:43:30,210 Let's just search for Office, quote, unquote, in the current title. 912 00:43:30,210 --> 00:43:32,730 Then we're going to go ahead and increase the counter. 913 00:43:32,730 --> 00:43:35,040 So it turns out that the regular expression library 914 00:43:35,040 --> 00:43:39,090 has a function called search that takes as its first argument a pattern, 915 00:43:39,090 --> 00:43:41,970 and then, as its second argument the string you 916 00:43:41,970 --> 00:43:44,160 want to analyze for that pattern. 917 00:43:44,160 --> 00:43:47,880 So it's sort of looking for a needle in this haystack, from left to right. 918 00:43:47,880 --> 00:43:52,080 Let me go ahead now and run this version of the program, Enter. 919 00:43:52,080 --> 00:43:56,250 And now I screwed up because I forgot my colon, but that's old stuff. 920 00:43:56,250 --> 00:43:57,322 Enter. 921 00:43:57,322 --> 00:43:59,150 Huh. 922 00:43:59,150 --> 00:44:01,800 Number of people who like The Office is now 0. 923 00:44:01,800 --> 00:44:02,970 So this seems like a big-- 924 00:44:02,970 --> 00:44:05,640 thank you-- big step backwards. 925 00:44:05,640 --> 00:44:08,010 What did I do wrong? 926 00:44:08,010 --> 00:44:08,730 Yeah? 927 00:44:08,730 --> 00:44:10,410 AUDIENCE: [INAUDIBLE] 928 00:44:10,410 --> 00:44:11,610 DAVID J. MALAN: Yeah. 929 00:44:11,610 --> 00:44:14,595 I forced all my input to uppercase, so I probably need to do this. 930 00:44:14,595 --> 00:44:16,470 So we'll come back to other approaches there. 931 00:44:16,470 --> 00:44:17,430 Let me rerun it now. 932 00:44:17,430 --> 00:44:19,800 OK, now we're back up to 16. 933 00:44:19,800 --> 00:44:22,200 But I could even, let's say-- 934 00:44:22,200 --> 00:44:25,110 I could tolerate just The Office. 935 00:44:25,110 --> 00:44:30,120 How about this, or how about something like, or The Office? 936 00:44:30,120 --> 00:44:32,373 Let me do this instead. 937 00:44:32,373 --> 00:44:34,290 And let me use these other special characters. 938 00:44:34,290 --> 00:44:37,380 This caret sign means the beginning of the string. 939 00:44:37,380 --> 00:44:40,740 This dollar sign weirdly represents the end of the string. 940 00:44:40,740 --> 00:44:44,490 I'm adding in some parentheses just like in math, just to add another symbol 941 00:44:44,490 --> 00:44:46,470 here, the or symbol here. 942 00:44:46,470 --> 00:44:50,580 And this is saying start matching at the beginning of the user string. 943 00:44:50,580 --> 00:44:54,660 Check if the beginning of the string is Office, or the beginning of the string 944 00:44:54,660 --> 00:44:55,920 is The Office. 945 00:44:55,920 --> 00:44:58,540 And then, you better be at the end of the string. 946 00:44:58,540 --> 00:45:01,650 So they can't keep typing words before or after that input. 947 00:45:01,650 --> 00:45:03,690 Let me go ahead and rerun the program. 948 00:45:03,690 --> 00:45:07,500 And now we're down to 15, which used to be our correct answer, 949 00:45:07,500 --> 00:45:10,770 but then we noticed The V Office. 950 00:45:10,770 --> 00:45:12,680 How can we deal with that? 951 00:45:12,680 --> 00:45:16,200 It's going to be messier to deal with that. 952 00:45:16,200 --> 00:45:21,555 How about if I tolerate any character represented by dot 953 00:45:21,555 --> 00:45:23,640 in between The and Office? 954 00:45:23,640 --> 00:45:28,110 Now if I rerun it, now I really have this expressive capability. 955 00:45:28,110 --> 00:45:32,340 So this is only to say, there are so many ways in languages, in general, 956 00:45:32,340 --> 00:45:33,390 to solve problems. 957 00:45:33,390 --> 00:45:35,950 And some of these tools are more sophisticated than others. 958 00:45:35,950 --> 00:45:38,957 This is one that you've actually probably glanced at but never used 959 00:45:38,957 --> 00:45:41,040 in the context of Google Forms for years if you're 960 00:45:41,040 --> 00:45:43,770 in the habit of creating these for student groups or other activities. 961 00:45:43,770 --> 00:45:45,840 But it's now something you can start to leverage. 962 00:45:45,840 --> 00:45:49,440 And we're just scratching the surface of what's actually possible with this. 963 00:45:49,440 --> 00:45:53,640 But let's now do one final example just using some Python code here. 964 00:45:53,640 --> 00:45:55,530 And let's actually write a program that's 965 00:45:55,530 --> 00:46:00,040 a little more general purpose that allows me to search for any given title 966 00:46:00,040 --> 00:46:01,770 and figure out its popularity. 967 00:46:01,770 --> 00:46:04,410 So let me go ahead and simplify this. 968 00:46:04,410 --> 00:46:06,630 Let's get rid of our regular expressions. 969 00:46:06,630 --> 00:46:09,940 Let's go ahead and continue capitalizing the title. 970 00:46:09,940 --> 00:46:11,580 And let's go ahead to-- 971 00:46:11,580 --> 00:46:16,020 at the beginning of this program, and first ask the user for the title 972 00:46:16,020 --> 00:46:17,380 they want to search for. 973 00:46:17,380 --> 00:46:20,320 So title equals, let's ask the user for input, 974 00:46:20,320 --> 00:46:23,490 which is essentially the same thing as our CS50 get_string function. 975 00:46:23,490 --> 00:46:24,960 Ask them for the title. 976 00:46:24,960 --> 00:46:28,170 And then whatever they type in, let's go ahead and strip whitespace 977 00:46:28,170 --> 00:46:30,900 and uppercase the thing again. 978 00:46:30,900 --> 00:46:35,820 And now, inside of my loop, I could say something like this. 979 00:46:35,820 --> 00:46:42,660 If the current row's title after stripping whitespace and forcing 980 00:46:42,660 --> 00:46:46,920 it to uppercase, too, equals the user's title, then, go ahead 981 00:46:46,920 --> 00:46:49,440 and maybe increment a counter. 982 00:46:49,440 --> 00:46:51,160 So I still need that counter back. 983 00:46:51,160 --> 00:46:56,610 So let me go ahead and define this maybe in here, counter equals 0. 984 00:46:56,610 --> 00:46:58,720 And then, at the very end of this program, 985 00:46:58,720 --> 00:47:01,050 let me go ahead and print out just the popularity 986 00:47:01,050 --> 00:47:03,040 of whatever the human typed in. 987 00:47:03,040 --> 00:47:06,030 So again, the only difference is I'm asking the human for some input 988 00:47:06,030 --> 00:47:06,720 this time. 989 00:47:06,720 --> 00:47:09,150 I'm initializing my counter to 0, then I'm 990 00:47:09,150 --> 00:47:12,660 searching for their title in the CSV file 991 00:47:12,660 --> 00:47:15,810 by doing the same massaging of the data by forcing it to uppercase 992 00:47:15,810 --> 00:47:18,570 and getting rid of the whitespace. 993 00:47:18,570 --> 00:47:21,780 So now, when I run Python of favorites.py, Enter, 994 00:47:21,780 --> 00:47:30,030 I could type in the office all lowercase even, and now we're down to 13. 995 00:47:30,030 --> 00:47:34,940 13, why? 996 00:47:34,940 --> 00:47:36,700 Oh, that's correct. 997 00:47:36,700 --> 00:47:40,640 Because I'm the one that went in and removed those The keywords a bit ago. 998 00:47:40,640 --> 00:47:43,030 If we fixed those, we would be back up to 15. 999 00:47:43,030 --> 00:47:47,650 If we added support for The V Office, we would be up to 16 as well. 1000 00:47:47,650 --> 00:47:50,350 All right, any questions then on these various manipulations? 1001 00:47:50,350 --> 00:47:52,183 And if you're feeling like, oh, my god, this 1002 00:47:52,183 --> 00:47:55,100 is so much Python code just to do simple things, that's the point. 1003 00:47:55,100 --> 00:47:57,160 And indeed, even though it's a powerful language 1004 00:47:57,160 --> 00:48:00,670 and can solve these kinds of problems, we had to write almost 20 lines of code 1005 00:48:00,670 --> 00:48:03,470 just to ask a single question like this. 1006 00:48:03,470 --> 00:48:07,120 But any questions on how we did this, or on any of these building 1007 00:48:07,120 --> 00:48:10,750 blocks along the way? 1008 00:48:10,750 --> 00:48:11,890 Anything here? 1009 00:48:11,890 --> 00:48:12,390 No? 1010 00:48:12,390 --> 00:48:12,600 All right. 1011 00:48:12,600 --> 00:48:13,300 That was a lot. 1012 00:48:13,300 --> 00:48:14,800 Let's take a five-minute break here. 1013 00:48:14,800 --> 00:48:17,230 When we come back, we'll do it better. 1014 00:48:17,230 --> 00:48:18,480 So we are back. 1015 00:48:18,480 --> 00:48:20,610 And the rest of today is ultimately about, how 1016 00:48:20,610 --> 00:48:24,840 can we store, and manipulate, and change, and retrieve data 1017 00:48:24,840 --> 00:48:28,090 more efficiently than we might by just writing raw code? 1018 00:48:28,090 --> 00:48:31,440 This isn't to say that you shouldn't use Python to do the kinds of things 1019 00:48:31,440 --> 00:48:32,280 that we just did. 1020 00:48:32,280 --> 00:48:37,020 And in fact, it might be super common if you're getting a lot of messy input 1021 00:48:37,020 --> 00:48:39,073 from users that you might want to clean it up. 1022 00:48:39,073 --> 00:48:42,240 And maybe the best way to do that is to write a program so that step-by-step 1023 00:48:42,240 --> 00:48:44,370 you can make all of the requisite changes and fixes 1024 00:48:44,370 --> 00:48:47,523 like we did with The Office, for instance, again and again, 1025 00:48:47,523 --> 00:48:50,190 and reuse that code, especially if more and more submissions are 1026 00:48:50,190 --> 00:48:51,150 coming through. 1027 00:48:51,150 --> 00:48:53,550 But another theme of today, ultimately, is 1028 00:48:53,550 --> 00:48:57,638 that sometimes there are different, if not better tools for the same job. 1029 00:48:57,638 --> 00:48:59,430 And in fact, now at this point in the term, 1030 00:48:59,430 --> 00:49:02,310 as we begin to introduce not just Python, but in a moment 1031 00:49:02,310 --> 00:49:06,120 a language called SQL, and next week, a language called JavaScript, 1032 00:49:06,120 --> 00:49:09,150 and the week after that, synthesizing a whole lot of these languages 1033 00:49:09,150 --> 00:49:12,420 together is to just kind of paint a picture of how 1034 00:49:12,420 --> 00:49:15,900 you might decide what the trade-offs are between using this tool, or this tool, 1035 00:49:15,900 --> 00:49:16,830 or this other tool. 1036 00:49:16,830 --> 00:49:19,770 Because undoubtedly you can solve problems moving forward 1037 00:49:19,770 --> 00:49:22,660 in many different ways with many different tools. 1038 00:49:22,660 --> 00:49:25,020 So let's give you another tool, one with which 1039 00:49:25,020 --> 00:49:28,170 you can implement a proper relational database. 1040 00:49:28,170 --> 00:49:31,050 What we just saw in the form of CSV files 1041 00:49:31,050 --> 00:49:33,810 are what we might call flat-file databases. 1042 00:49:33,810 --> 00:49:37,500 Again, just a very simple file, flat in that there's no hierarchy to it. 1043 00:49:37,500 --> 00:49:39,270 It's just rows and columns. 1044 00:49:39,270 --> 00:49:44,280 And that is all ultimately storing ASCII or Unicode text. 1045 00:49:44,280 --> 00:49:47,400 A relational database, though, is something that's actually 1046 00:49:47,400 --> 00:49:50,850 closer to a proper spreadsheet program. 1047 00:49:50,850 --> 00:49:53,440 A CSV is an individual sheet, if you will, 1048 00:49:53,440 --> 00:49:55,260 from a spreadsheet when you export it. 1049 00:49:55,260 --> 00:49:57,460 If you had multiple sheets in a spreadsheet, 1050 00:49:57,460 --> 00:49:59,595 you would have to export multiple CSVs. 1051 00:49:59,595 --> 00:50:01,470 And that gets annoying quickly in code if you 1052 00:50:01,470 --> 00:50:03,990 have to open up this CSV, this CSV, all of which 1053 00:50:03,990 --> 00:50:07,080 represent different sheets or tabs in a proper spreadsheet. 1054 00:50:07,080 --> 00:50:11,520 A relational database is more like a spreadsheet program 1055 00:50:11,520 --> 00:50:14,640 that you, a programmer, now can interact with. 1056 00:50:14,640 --> 00:50:16,140 You can write data to it. 1057 00:50:16,140 --> 00:50:19,680 You can read data from it, and you can have multiple sheets, a.k.a., 1058 00:50:19,680 --> 00:50:22,150 tables storing all of your data. 1059 00:50:22,150 --> 00:50:24,240 So whereas Excel and numbers in Google spreadsheet 1060 00:50:24,240 --> 00:50:27,090 are meant to be reused really by humans with their mouse and their keyboard, 1061 00:50:27,090 --> 00:50:29,850 clicking, and pointing, and manipulating things graphically, 1062 00:50:29,850 --> 00:50:32,160 a relational database using a language called 1063 00:50:32,160 --> 00:50:37,320 SQL is one in which the programmer has similar capabilities, 1064 00:50:37,320 --> 00:50:39,000 but doing so in code. 1065 00:50:39,000 --> 00:50:42,720 Specifically, using a language called SQL, and at a scale 1066 00:50:42,720 --> 00:50:45,670 that's much grander than spreadsheets alone. 1067 00:50:45,670 --> 00:50:48,420 In fact, if you try on your Mac or PC to open a spreadsheet that's 1068 00:50:48,420 --> 00:50:51,090 got tens of thousands of rows, it'll probably 1069 00:50:51,090 --> 00:50:54,780 work fine, hundreds of thousands of rows, millions of rows, no way. 1070 00:50:54,780 --> 00:50:57,000 At some point your Mac or PC is going to struggle 1071 00:50:57,000 --> 00:50:59,130 to open particularly large data sets. 1072 00:50:59,130 --> 00:51:01,620 And that, too, is where proper databases come 1073 00:51:01,620 --> 00:51:04,140 into play and proper languages for databases come 1074 00:51:04,140 --> 00:51:06,120 into play, when it's all about scale. 1075 00:51:06,120 --> 00:51:09,390 And indeed, most any mobile app or web app today that you or someone else 1076 00:51:09,390 --> 00:51:13,420 might write should probably plan on lots of data if it's successful. 1077 00:51:13,420 --> 00:51:15,730 So we need the right tools for that problem. 1078 00:51:15,730 --> 00:51:19,110 So fortunately, even though we're about to learn yet another language, 1079 00:51:19,110 --> 00:51:24,360 it only does four things fundamentally, known by this silly acronym, CRUD. 1080 00:51:24,360 --> 00:51:27,870 SQL, this language for databases, supports the ability 1081 00:51:27,870 --> 00:51:32,400 to create data, read data, update data, and delete data. 1082 00:51:32,400 --> 00:51:33,420 That's it. 1083 00:51:33,420 --> 00:51:36,690 There's a few more keywords that exist in this language called SQL 1084 00:51:36,690 --> 00:51:37,750 that we'll soon see. 1085 00:51:37,750 --> 00:51:39,458 But at the end of the day, even if you're 1086 00:51:39,458 --> 00:51:42,180 starting to feel like this is a lot very quickly, 1087 00:51:42,180 --> 00:51:44,940 it all boils down to these four basic operations. 1088 00:51:44,940 --> 00:51:47,640 And the four commands in SQL, if you will, 1089 00:51:47,640 --> 00:51:51,780 functions in a sense that implement those four ideas happen to be these. 1090 00:51:51,780 --> 00:51:54,270 They're almost the same but with some slight variance. 1091 00:51:54,270 --> 00:51:59,280 The ability to create or insert data is the C. The ability to select data 1092 00:51:59,280 --> 00:52:01,080 is the R, or read. 1093 00:52:01,080 --> 00:52:02,130 Update is the same. 1094 00:52:02,130 --> 00:52:05,048 Delete is the same, but drop is also a keyword as well. 1095 00:52:05,048 --> 00:52:06,840 So we'll see these and a few other keywords 1096 00:52:06,840 --> 00:52:10,410 in SQL that, at the end of the day, just allow you to create, read, and update 1097 00:52:10,410 --> 00:52:14,310 data using verbs, if you will, like these. 1098 00:52:14,310 --> 00:52:18,030 So to do that, what's the syntax going to be? 1099 00:52:18,030 --> 00:52:20,290 Well, we won't get into the weeds too quickly on this. 1100 00:52:20,290 --> 00:52:22,650 But here's a representative syntax of how 1101 00:52:22,650 --> 00:52:25,710 you can create using this language called SQL, in your very 1102 00:52:25,710 --> 00:52:28,020 own database, a brand new table. 1103 00:52:28,020 --> 00:52:30,910 This is so easy in Excel, and Google Spreadsheets, and Apple Numbers. 1104 00:52:30,910 --> 00:52:32,910 You want a new sheet, you click the plus button. 1105 00:52:32,910 --> 00:52:33,690 You get a new tab. 1106 00:52:33,690 --> 00:52:35,490 You give it a name, and boom, you're done. 1107 00:52:35,490 --> 00:52:40,050 In the world of programming, though, if you want to create the analogue of that 1108 00:52:40,050 --> 00:52:43,440 spreadsheet in the computer's memory, you create something called a table, 1109 00:52:43,440 --> 00:52:48,420 like a sheet, that has a name, and then in parentheses has one or more columns. 1110 00:52:48,420 --> 00:52:51,990 But unlike Google Spreadsheets, and Apple Numbers, and Excel, 1111 00:52:51,990 --> 00:52:55,073 you have to decide as the programmer what types of data 1112 00:52:55,073 --> 00:52:57,240 you're going to be storing in each of these columns. 1113 00:52:57,240 --> 00:52:59,430 Now even though Excel, and Google Spreadsheets, 1114 00:52:59,430 --> 00:53:03,310 and Numbers does allow you to format or present data in different ways, 1115 00:53:03,310 --> 00:53:07,680 it's not strongly typed data like it is, for instance, when we were using C. 1116 00:53:07,680 --> 00:53:10,200 And heck, even in Python there's underlying data types. 1117 00:53:10,200 --> 00:53:12,158 Even if you don't have to type them explicitly, 1118 00:53:12,158 --> 00:53:14,900 databases are going to want to know, are you storing integers? 1119 00:53:14,900 --> 00:53:16,640 Are you storing real numbers or floats? 1120 00:53:16,640 --> 00:53:17,630 Are you storing text? 1121 00:53:17,630 --> 00:53:18,140 Why? 1122 00:53:18,140 --> 00:53:20,960 Because especially as your data scales, the more hints 1123 00:53:20,960 --> 00:53:24,410 you give the database about your data, the more performance it can be, 1124 00:53:24,410 --> 00:53:27,500 the faster it can help you get at and store that data. 1125 00:53:27,500 --> 00:53:29,302 So types are about to be important again, 1126 00:53:29,302 --> 00:53:31,760 but there's not going to be that many of them, fortunately. 1127 00:53:31,760 --> 00:53:34,640 Now how can I go about converting, for instance, some real data, 1128 00:53:34,640 --> 00:53:37,490 like that from you, my favorites.csv file, 1129 00:53:37,490 --> 00:53:39,440 into a proper relational database? 1130 00:53:39,440 --> 00:53:42,650 Well, it turns out that using SQL I can do this 1131 00:53:42,650 --> 00:53:45,260 in VS Code on my own Mac, or PC, or in the cloud 1132 00:53:45,260 --> 00:53:48,455 here by just importing the CSV into a database. 1133 00:53:48,455 --> 00:53:50,330 We'll see eventually how to do this manually. 1134 00:53:50,330 --> 00:53:52,622 For now, I'm going to use more of an automated process. 1135 00:53:52,622 --> 00:53:54,680 So let me go over to VS Code here. 1136 00:53:54,680 --> 00:53:57,170 Let me type ls to see where we left off before. 1137 00:53:57,170 --> 00:54:01,005 I had two files favorites.csv, which I downloaded from Google Spreadsheets. 1138 00:54:01,005 --> 00:54:02,630 Recall that I made a couple of changes. 1139 00:54:02,630 --> 00:54:06,050 We deleted a couple of Thes from the file for The Office. 1140 00:54:06,050 --> 00:54:08,600 But this is the same file as before, and then we 1141 00:54:08,600 --> 00:54:11,210 have favorites.py, which we'll set aside for now. 1142 00:54:11,210 --> 00:54:14,870 I'm going to go ahead now and run a command SQLite3. 1143 00:54:14,870 --> 00:54:18,020 So in the world of relational databases, there's 1144 00:54:18,020 --> 00:54:23,030 many different products out there, many different software that 1145 00:54:23,030 --> 00:54:25,370 implements the SQL language. 1146 00:54:25,370 --> 00:54:26,600 Microsoft has their own. 1147 00:54:26,600 --> 00:54:30,080 There's something called MySQL that's been very popular for years. 1148 00:54:30,080 --> 00:54:32,120 Facebook, for instance, used it early on. 1149 00:54:32,120 --> 00:54:35,030 PostgreSQL, Microsoft Access Server, Oracle, 1150 00:54:35,030 --> 00:54:36,958 and maybe a whole bunch of other product names 1151 00:54:36,958 --> 00:54:38,750 you might have encountered over time, which 1152 00:54:38,750 --> 00:54:42,980 is to say there's many different types of tools, and servers, 1153 00:54:42,980 --> 00:54:44,990 and software in which you can use SQL. 1154 00:54:44,990 --> 00:54:47,780 We're going to use a very lightweight version of the SQL language 1155 00:54:47,780 --> 00:54:49,370 today called SQLite. 1156 00:54:49,370 --> 00:54:51,680 This is the version of SQL that's generally 1157 00:54:51,680 --> 00:54:54,020 used on iPhones and Android devices these days. 1158 00:54:54,020 --> 00:54:56,930 If you download an app that stores data like your own contacts, 1159 00:54:56,930 --> 00:54:59,000 typically is stored using SQLite. 1160 00:54:59,000 --> 00:55:02,710 Because it's fairly lightweight, but you can still store hundreds, 1161 00:55:02,710 --> 00:55:05,810 thousands, even tens of thousands of pieces of data 1162 00:55:05,810 --> 00:55:07,970 even using this lightweight version thereof. 1163 00:55:07,970 --> 00:55:10,790 SQLite3 is like version 3 of this tool. 1164 00:55:10,790 --> 00:55:16,340 We're going to go ahead and run SQLite3 with a file called favorites.db. 1165 00:55:16,340 --> 00:55:20,120 It's conventional in the world of SQLite to name your file something.db. 1166 00:55:20,120 --> 00:55:22,490 I'm going to create a database called favorites.db. 1167 00:55:22,490 --> 00:55:27,010 Once I'm inside of the program, now I'm going to go ahead and enter CSV Mode. 1168 00:55:27,010 --> 00:55:28,760 Again, not something you have to memorize, 1169 00:55:28,760 --> 00:55:30,468 just something you can look up as needed. 1170 00:55:30,468 --> 00:55:34,100 And then, I'm going to import favorites.csv 1171 00:55:34,100 --> 00:55:40,260 into a table, that is, a sheet, if you will, called favorites as well. 1172 00:55:40,260 --> 00:55:44,030 Now I'm going to hit Enter and I'm going to go ahead and exit the program 1173 00:55:44,030 --> 00:55:45,650 altogether and type ls. 1174 00:55:45,650 --> 00:55:48,170 Now I have three files in my current directory-- 1175 00:55:48,170 --> 00:55:52,130 the CSV file, the Python file from before, and now favorites.db. 1176 00:55:52,130 --> 00:55:56,180 But if I did this right, all of the data you all typed into the CSV file 1177 00:55:56,180 --> 00:55:59,840 has now been loaded into a proper database where I can now use 1178 00:55:59,840 --> 00:56:03,180 this SQL language to access it instead. 1179 00:56:03,180 --> 00:56:07,730 So let's go ahead again and run SQLite3 of favorites.db, which now exists. 1180 00:56:07,730 --> 00:56:10,640 And now, at the SQLite prompt I can start 1181 00:56:10,640 --> 00:56:13,160 to play around and see what this data is. 1182 00:56:13,160 --> 00:56:16,610 For instance, I can look, by typing .schema, 1183 00:56:16,610 --> 00:56:19,362 at what the schema is of my data, what's the design. 1184 00:56:19,362 --> 00:56:22,070 Now no thought was put into the design of this data at the moment 1185 00:56:22,070 --> 00:56:23,900 because I automated the whole process. 1186 00:56:23,900 --> 00:56:26,750 Once we start creating our own databases we'll 1187 00:56:26,750 --> 00:56:29,750 give more thought to the data types and the columns that we have. 1188 00:56:29,750 --> 00:56:34,220 But we can see what SQLite presumed I wanted just 1189 00:56:34,220 --> 00:56:36,530 by importing the data by default. 1190 00:56:36,530 --> 00:56:41,120 What the import command did for me a moment ago is essentially the syntax. 1191 00:56:41,120 --> 00:56:44,510 It automated the process of creating a table, if it doesn't exist, 1192 00:56:44,510 --> 00:56:45,770 called favorites. 1193 00:56:45,770 --> 00:56:48,980 And then notice, in parentheses it gave me three columns-- 1194 00:56:48,980 --> 00:56:53,360 timestamp, title, and genres, which were inferred, obviously, from the CSV. 1195 00:56:53,360 --> 00:56:56,000 All three of which have been decreed to be text. 1196 00:56:56,000 --> 00:56:59,180 Again, once we're more comfortable we'll create our own tables, 1197 00:56:59,180 --> 00:57:01,010 choose our own types and column names. 1198 00:57:01,010 --> 00:57:03,350 But for now, I just automated the whole process just 1199 00:57:03,350 --> 00:57:08,120 to get us started by using this built-in import command as well. 1200 00:57:08,120 --> 00:57:08,810 All right. 1201 00:57:08,810 --> 00:57:11,630 So what now can I begin to do? 1202 00:57:11,630 --> 00:57:16,910 Well, if I wanted to, for instance, start playing around with data therein, 1203 00:57:16,910 --> 00:57:19,595 I might execute a couple of different commands. 1204 00:57:19,595 --> 00:57:23,000 1205 00:57:23,000 --> 00:57:28,421 Let me find the right one here-- one of which would be select. 1206 00:57:28,421 --> 00:57:31,610 Select being one of our most versatile tools 1207 00:57:31,610 --> 00:57:33,180 to select data from this database. 1208 00:57:33,180 --> 00:57:35,720 So if I have these three columns here-- timestamp, 1209 00:57:35,720 --> 00:57:39,020 title, and genres, suppose I want to select all of the titles. 1210 00:57:39,020 --> 00:57:43,790 Doing that earlier in Python required importing the CSV library, 1211 00:57:43,790 --> 00:57:48,740 opening the file, creating a reader or a DictReader, iterating over every row, 1212 00:57:48,740 --> 00:57:51,500 adding every title to a dictionary or just printing it out, 1213 00:57:51,500 --> 00:57:52,310 and dot, dot, dot. 1214 00:57:52,310 --> 00:57:55,170 There was a dozen or so lines of code when we first began. 1215 00:57:55,170 --> 00:57:56,840 Now, how about this? 1216 00:57:56,840 --> 00:58:01,220 Select title from favorites, semicolon, done. 1217 00:58:01,220 --> 00:58:05,570 So now, with this particular language, the output is very textual 1218 00:58:05,570 --> 00:58:08,930 and it's simulating what it looks like if it were more graphical by creating 1219 00:58:08,930 --> 00:58:10,870 this table, so to speak. 1220 00:58:10,870 --> 00:58:14,080 Select title from favorites is a distillation 1221 00:58:14,080 --> 00:58:17,530 in a different language called SQL of all the lines of code 1222 00:58:17,530 --> 00:58:20,740 I wrote early on when we first started playing with favorites.py. 1223 00:58:20,740 --> 00:58:25,540 SQL is therefore optimized for reading, and creating, and updating, 1224 00:58:25,540 --> 00:58:27,500 and ultimately, deleting data. 1225 00:58:27,500 --> 00:58:30,700 So here's perhaps a better tool for the job once you have the data. 1226 00:58:30,700 --> 00:58:34,030 Tossing it into a more powerful, versatile format 1227 00:58:34,030 --> 00:58:37,228 might allow you now to get more work done more quickly 1228 00:58:37,228 --> 00:58:38,770 without having to reinvent the wheel. 1229 00:58:38,770 --> 00:58:41,510 Someone else has figured out how to select data like this. 1230 00:58:41,510 --> 00:58:43,760 What more can I do here? 1231 00:58:43,760 --> 00:58:47,050 Well, let me go ahead and pull up, in a moment, just a little bit 1232 00:58:47,050 --> 00:58:49,390 of a cheat sheet here. 1233 00:58:49,390 --> 00:58:53,840 Give me one second to find this. 1234 00:58:53,840 --> 00:58:58,090 So suppose I want to now select data a little more powerfully. 1235 00:58:58,090 --> 00:59:00,220 So here's what I just did in a canonical way. 1236 00:59:00,220 --> 00:59:01,720 So select typically works like this. 1237 00:59:01,720 --> 00:59:05,860 You select columns from a specific table, semicolon. 1238 00:59:05,860 --> 00:59:08,260 Unfortunately, stupid semicolons are back. 1239 00:59:08,260 --> 00:59:12,710 Select columns from table then, is the generic form of what I just did. 1240 00:59:12,710 --> 00:59:17,122 More specifically, I selected one column called title from favorites. 1241 00:59:17,122 --> 00:59:18,580 Favorites is the name of the table. 1242 00:59:18,580 --> 00:59:19,690 Semicolon ends my thought. 1243 00:59:19,690 --> 00:59:23,440 Suppose I wanted to get two things, like the genres that each of you inputted. 1244 00:59:23,440 --> 00:59:28,420 I could instead do select title, comma, genres from favorites, 1245 00:59:28,420 --> 00:59:30,220 and then, a semicolon, and Enter. 1246 00:59:30,220 --> 00:59:32,110 It's going to look a little ugly on my screen 1247 00:59:32,110 --> 00:59:33,670 because some of these titles and-- 1248 00:59:33,670 --> 00:59:37,300 OK, one of you really went all out with Community. 1249 00:59:37,300 --> 00:59:40,660 You can see that it's just wrapping in an ugly way, 1250 00:59:40,660 --> 00:59:43,300 but it's just now showing me two columns. 1251 00:59:43,300 --> 00:59:46,840 If we scroll up to the very top again, the left most of one, 1252 00:59:46,840 --> 00:59:48,280 Black Mirror went all out, too. 1253 00:59:48,280 --> 00:59:49,150 Thank you. 1254 00:59:49,150 --> 00:59:52,000 And now, OK, we're going to have to clean some of these up. 1255 00:59:52,000 --> 00:59:54,115 Game of Thrones, good comedy, yes. 1256 00:59:54,115 --> 00:59:57,550 1257 00:59:57,550 --> 00:59:59,480 Keep going, keep going, keep going. 1258 00:59:59,480 --> 01:00:02,870 So now we've selected two of the columns that we care about. 1259 01:00:02,870 --> 01:00:03,370 There it is. 1260 01:00:03,370 --> 01:00:06,380 OK, so it's crazy wide because of all of those genres. 1261 01:00:06,380 --> 01:00:09,135 But it allows me to select exactly the data I want. 1262 01:00:09,135 --> 01:00:12,010 Let's go back to the titles, though, and perhaps start playing around 1263 01:00:12,010 --> 01:00:13,750 with some modifiers here. 1264 01:00:13,750 --> 01:00:18,310 For instance, it turns out, using SQL there's a lot of functionality 1265 01:00:18,310 --> 01:00:19,900 built into the language. 1266 01:00:19,900 --> 01:00:22,843 You've got a lot of functions, similar to Excel or Google Spreadsheets 1267 01:00:22,843 --> 01:00:24,010 where you can have formulas. 1268 01:00:24,010 --> 01:00:26,320 SQL provides you with some of the same heuristics that 1269 01:00:26,320 --> 01:00:30,350 allow you to apply operations like these on entire columns. 1270 01:00:30,350 --> 01:00:32,920 For instance, you can take averages, count the total, 1271 01:00:32,920 --> 01:00:36,010 get the distinct values, force things to lowercase, uppercase, min, 1272 01:00:36,010 --> 01:00:37,220 and max, and so forth. 1273 01:00:37,220 --> 01:00:39,610 So let's try distinct, for instance. 1274 01:00:39,610 --> 01:00:43,450 Let me go back to my Terminal, and let's say, select, 1275 01:00:43,450 --> 01:00:48,760 how about the distinct titles from the favorites table? 1276 01:00:48,760 --> 01:00:49,513 Enter. 1277 01:00:49,513 --> 01:00:51,430 I didn't bother selecting the genres because I 1278 01:00:51,430 --> 01:00:52,763 want it to be a little prettier. 1279 01:00:52,763 --> 01:00:58,090 And you can see here that we have just the distinct titles, 1280 01:00:58,090 --> 01:01:00,547 except for issues of formatting. 1281 01:01:00,547 --> 01:01:02,380 So whitespace is going to be an issue again. 1282 01:01:02,380 --> 01:01:04,213 Capitalization is going to be a thing again. 1283 01:01:04,213 --> 01:01:05,230 So there's a trade-off. 1284 01:01:05,230 --> 01:01:09,280 One of the things I was doing in Python was forcing everything to uppercase 1285 01:01:09,280 --> 01:01:10,930 and then getting rid of whitespace. 1286 01:01:10,930 --> 01:01:12,400 But we could combine some of these. 1287 01:01:12,400 --> 01:01:15,340 I could do something like force every title to uppercase, 1288 01:01:15,340 --> 01:01:16,507 then get the distinct value. 1289 01:01:16,507 --> 01:01:19,382 And that's actually going to get rid of some of those values as well. 1290 01:01:19,382 --> 01:01:21,860 And again, I did it all in one simple line that was fast. 1291 01:01:21,860 --> 01:01:24,027 So let me pull up at the bottom of the screen again. 1292 01:01:24,027 --> 01:01:28,090 I selected distinct upper titles from favorites, 1293 01:01:28,090 --> 01:01:31,070 and that did everything for me at once in just one breath. 1294 01:01:31,070 --> 01:01:33,640 Suppose I want to get the total number of counts of titles. 1295 01:01:33,640 --> 01:01:40,150 How about select count of all of those titles from favorites? 1296 01:01:40,150 --> 01:01:43,990 Semicolon, Enter, and now you get back a mini table 1297 01:01:43,990 --> 01:01:47,960 that contains just your answer, 158 in this case. 1298 01:01:47,960 --> 01:01:50,560 So that's the total number of, not distinct, 1299 01:01:50,560 --> 01:01:52,690 but total titles that we had in the file. 1300 01:01:52,690 --> 01:01:56,560 And we could continue to manipulate the data further using, again, 1301 01:01:56,560 --> 01:01:58,450 functions like these here. 1302 01:01:58,450 --> 01:02:01,550 But there's also additional filtration we can do. 1303 01:02:01,550 --> 01:02:07,010 We can also qualify our selections by saying where some condition is true. 1304 01:02:07,010 --> 01:02:10,630 So just as in Scratch, and C, and Python, you have Boolean expressions, 1305 01:02:10,630 --> 01:02:16,270 you can have the same in SQL as well, where I can filter my data where 1306 01:02:16,270 --> 01:02:19,390 something is true or false. 1307 01:02:19,390 --> 01:02:21,340 Like allows me to do approximations. 1308 01:02:21,340 --> 01:02:23,500 If I want to get something that's like The Office 1309 01:02:23,500 --> 01:02:26,320 but not necessarily T-H-E, space, Office, 1310 01:02:26,320 --> 01:02:29,440 I could do pattern matching using like here. 1311 01:02:29,440 --> 01:02:33,170 Order by, limit, and grouped by are other commands I can execute, too. 1312 01:02:33,170 --> 01:02:35,980 So let me go back and do a couple of these here. 1313 01:02:35,980 --> 01:02:42,610 How about, let me just get, oh, I don't know, all of the titles from favorites 1314 01:02:42,610 --> 01:02:44,848 but limit it to 10 results. 1315 01:02:44,848 --> 01:02:47,890 That might be one thing that's helpful to see if you just care about some 1316 01:02:47,890 --> 01:02:50,110 of the data at the top there instead. 1317 01:02:50,110 --> 01:02:56,530 How about, select all of the titles from favorites, where the title itself 1318 01:02:56,530 --> 01:02:59,710 is like, quote, unquote, "Office?" 1319 01:02:59,710 --> 01:03:02,740 And this will give me only two answers. 1320 01:03:02,740 --> 01:03:07,180 Those are the two rows, recall, that I mutated by getting rid of the word The. 1321 01:03:07,180 --> 01:03:11,880 Notice that like allows me too tolerate uppercase and lowercase. 1322 01:03:11,880 --> 01:03:14,880 Because if I instead just use the equal sign, 1323 01:03:14,880 --> 01:03:21,060 and in SQL a single equal sign does, in fact, mean equality. 1324 01:03:21,060 --> 01:03:23,370 For comparison's sake, it's not doing assignment. 1325 01:03:23,370 --> 01:03:26,070 This is not how you assign data in SQL. 1326 01:03:26,070 --> 01:03:27,940 I got back no answers there. 1327 01:03:27,940 --> 01:03:31,620 So indeed, the equal sign is giving me literal answers 1328 01:03:31,620 --> 01:03:33,990 that searches just for what I typed in. 1329 01:03:33,990 --> 01:03:35,370 How could I get all of these? 1330 01:03:35,370 --> 01:03:39,090 Well, similar in spirit to regular expressions but not quite as powerful 1331 01:03:39,090 --> 01:03:41,620 in SQL, I could do something like this. 1332 01:03:41,620 --> 01:03:45,330 I can select the title from favorites where the title is like, quote, 1333 01:03:45,330 --> 01:03:46,710 unquote, "Office." 1334 01:03:46,710 --> 01:03:52,450 But I can add, a bit weirdly, percent signs to the left and the right. 1335 01:03:52,450 --> 01:03:57,930 So the language SQL supports the same notion of pattern matching 1336 01:03:57,930 --> 01:03:59,547 but much more limited out of the box. 1337 01:03:59,547 --> 01:04:01,380 If we want more powerful regular expressions 1338 01:04:01,380 --> 01:04:03,420 we probably do want to use Python instead. 1339 01:04:03,420 --> 01:04:06,720 But the percent sign here means 0 or more characters 1340 01:04:06,720 --> 01:04:09,340 on the left, 0 or more characters on the right. 1341 01:04:09,340 --> 01:04:14,460 So this will just grab any title that contains O-F-F-I-C-E in it in that 1342 01:04:14,460 --> 01:04:15,100 order. 1343 01:04:15,100 --> 01:04:19,290 And now I get all 16, it would seem, of those results, again. 1344 01:04:19,290 --> 01:04:20,490 How do I know it's 16? 1345 01:04:20,490 --> 01:04:23,160 Well, I can just get the count of those titles 1346 01:04:23,160 --> 01:04:26,140 and get back that answer instead as well. 1347 01:04:26,140 --> 01:04:29,640 So again, it takes some getting used to, the vocabulary 1348 01:04:29,640 --> 01:04:30,973 and the syntax that you can use. 1349 01:04:30,973 --> 01:04:32,682 There's these building blocks and others. 1350 01:04:32,682 --> 01:04:35,500 But SQL is really designed, again, for creating, reading, updating, 1351 01:04:35,500 --> 01:04:36,480 and deleting data. 1352 01:04:36,480 --> 01:04:41,080 For instance, I've never really been a fan of Friends, for instance. 1353 01:04:41,080 --> 01:04:46,860 So right now if I do select, how about title from favorites 1354 01:04:46,860 --> 01:04:53,340 where title like, quote, unquote, Friends with the percent signs? 1355 01:04:53,340 --> 01:04:55,330 We can see that there's a whole bunch of them. 1356 01:04:55,330 --> 01:04:56,520 That's how many exactly. 1357 01:04:56,520 --> 01:04:57,960 Let's just do a quick count. 1358 01:04:57,960 --> 01:04:59,890 So that's nine of them. 1359 01:04:59,890 --> 01:05:03,300 Well, delete from favorites. 1360 01:05:03,300 --> 01:05:10,500 OK, you and me, delete from favorites, where title like Friends, Enter. 1361 01:05:10,500 --> 01:05:14,280 Nothing seems to happen, but bye-bye Friends. 1362 01:05:14,280 --> 01:05:15,060 [APPLAUSE] 1363 01:05:15,060 --> 01:05:16,143 DAVID J. MALAN: Thank you. 1364 01:05:16,143 --> 01:05:19,110 1365 01:05:19,110 --> 01:05:21,390 So now we've actually changed the data. 1366 01:05:21,390 --> 01:05:25,110 And this is what's compelling about a proper database. 1367 01:05:25,110 --> 01:05:29,320 Yes, you could technically write Python code that not only reads the CSV file, 1368 01:05:29,320 --> 01:05:30,540 but also writes it. 1369 01:05:30,540 --> 01:05:32,820 You can change using quote, unquote, "A" for append, 1370 01:05:32,820 --> 01:05:35,910 or quote, unquote, "W" for write, instead of quote, unquote, 1371 01:05:35,910 --> 01:05:37,263 "R" for read alone. 1372 01:05:37,263 --> 01:05:39,930 But it's definitely a little more involved to do that in Python. 1373 01:05:39,930 --> 01:05:42,250 But with SQL, you can update the data in real time. 1374 01:05:42,250 --> 01:05:45,750 And if I were actually running a web application here or a database 1375 01:05:45,750 --> 01:05:47,880 for a mobile app, that change, theoretically, 1376 01:05:47,880 --> 01:05:50,160 would be reflected everywhere on your own devices 1377 01:05:50,160 --> 01:05:52,210 if you're somehow talking to this application. 1378 01:05:52,210 --> 01:05:53,995 So that's the direction we're headed. 1379 01:05:53,995 --> 01:05:55,620 This other thing has been bothering me. 1380 01:05:55,620 --> 01:06:02,640 So select, how about title from favorites, where title equals, 1381 01:06:02,640 --> 01:06:03,180 what was it? 1382 01:06:03,180 --> 01:06:06,600 The V Office, was it? 1383 01:06:06,600 --> 01:06:07,680 Yeah, it was that one. 1384 01:06:07,680 --> 01:06:12,150 How about we update favorites by setting title 1385 01:06:12,150 --> 01:06:19,080 equal to The Office, where title equals quote, unquote, "The V Office" 1386 01:06:19,080 --> 01:06:20,470 semicolon? 1387 01:06:20,470 --> 01:06:22,650 And now, if I select the same thing again 1388 01:06:22,650 --> 01:06:24,810 I can go up and down with my arrow keys quickly. 1389 01:06:24,810 --> 01:06:27,090 Now there is no The V Office. 1390 01:06:27,090 --> 01:06:29,200 We've actually changed that value. 1391 01:06:29,200 --> 01:06:30,300 How about genres? 1392 01:06:30,300 --> 01:06:35,880 Select genres from favorites, where the title is title 1393 01:06:35,880 --> 01:06:39,450 equals Game of Thrones, semicolon. 1394 01:06:39,450 --> 01:06:43,000 These were kind of long, and I don't really agree with all of that. 1395 01:06:43,000 --> 01:06:49,560 So how about we update favorites, set genres equal to, sure, 1396 01:06:49,560 --> 01:06:52,650 action, adventure, sure, drama? 1397 01:06:52,650 --> 01:06:54,390 OK, so it's a decent list. 1398 01:06:54,390 --> 01:06:56,700 Fantasy, sure, thriller, war. 1399 01:06:56,700 --> 01:07:01,050 OK, anything really but comedy, I would say. 1400 01:07:01,050 --> 01:07:03,160 Let's go ahead and hit Enter now. 1401 01:07:03,160 --> 01:07:07,800 And now, if I select genres again, same query, now we've canonicalized that. 1402 01:07:07,800 --> 01:07:09,160 We've thrown data away. 1403 01:07:09,160 --> 01:07:11,250 So whether or not that is right is probably 1404 01:07:11,250 --> 01:07:13,020 a bit subjective and argumentative. 1405 01:07:13,020 --> 01:07:16,920 But I have at least cleaned up my data, which is, again, the U in CRUD. 1406 01:07:16,920 --> 01:07:20,670 Create, read, update, delete, you can do it that easily. 1407 01:07:20,670 --> 01:07:22,290 Beware using delete. 1408 01:07:22,290 --> 01:07:26,430 Beware worse using drop, whereby you can drop an entire table. 1409 01:07:26,430 --> 01:07:29,310 But via these kinds of commands, can we actually now 1410 01:07:29,310 --> 01:07:33,390 manipulate our data much more rapidly and with single thoughts. 1411 01:07:33,390 --> 01:07:36,390 And in fact, if you're an aspiring statistician, or data scientist, 1412 01:07:36,390 --> 01:07:40,320 or analyst in the real world, SQL is such a commonly used language 1413 01:07:40,320 --> 01:07:43,573 because it allows you to really dive into data quickly, and ask 1414 01:07:43,573 --> 01:07:45,990 questions of the data, and get back answers quite quickly. 1415 01:07:45,990 --> 01:07:47,530 And this is a simple data set. 1416 01:07:47,530 --> 01:07:51,840 You can do this with much larger data sets as we soon will, too. 1417 01:07:51,840 --> 01:07:55,050 Or any questions on what we've seen of SQL thus far? 1418 01:07:55,050 --> 01:07:57,170 Only scratched the surface, but again, it 1419 01:07:57,170 --> 01:08:03,100 boils down to creating, reading, updating, and deleting data. 1420 01:08:03,100 --> 01:08:04,910 Questions here? 1421 01:08:04,910 --> 01:08:05,410 All right. 1422 01:08:05,410 --> 01:08:07,820 Well, let's consider the design of this data. 1423 01:08:07,820 --> 01:08:11,900 Recall that if I do .schema, that shows me the design of my table, 1424 01:08:11,900 --> 01:08:13,990 the so-called schema of my data. 1425 01:08:13,990 --> 01:08:15,010 This is OK. 1426 01:08:15,010 --> 01:08:17,649 It gets the job done, and frankly, everything the user typed in 1427 01:08:17,649 --> 01:08:21,609 was arguably text, including the timestamp, which is the date and time. 1428 01:08:21,609 --> 01:08:24,040 But so the data set itself is somewhat simple. 1429 01:08:24,040 --> 01:08:29,569 But if we look at the data set itself, especially genres, let's do this. 1430 01:08:29,569 --> 01:08:31,960 Select genres from favorites. 1431 01:08:31,960 --> 01:08:34,540 And let me point out one other thing stylistically, too. 1432 01:08:34,540 --> 01:08:39,050 I am very deliberately capitalizing all of the special SQL keywords, 1433 01:08:39,050 --> 01:08:42,700 and I'm lowercasing all of the column names and the table names. 1434 01:08:42,700 --> 01:08:45,760 This is a convention, and honestly, it just helps you read, 1435 01:08:45,760 --> 01:08:49,210 I think, the code when you're co-mingling your names for columns 1436 01:08:49,210 --> 01:08:52,600 and tables with proper SQL keywords. 1437 01:08:52,600 --> 01:08:58,029 But I could just as easily do select genres from favorites, 1438 01:08:58,029 --> 01:09:01,149 but again, the SQL specific keywords don't quite jump out as much. 1439 01:09:01,149 --> 01:09:04,240 So stylistically, we would recommend this, selecting genres 1440 01:09:04,240 --> 01:09:06,250 from favorites, semicolon. 1441 01:09:06,250 --> 01:09:09,370 So here is where-- 1442 01:09:09,370 --> 01:09:11,560 oh. 1443 01:09:11,560 --> 01:09:12,850 OK, that was not intended. 1444 01:09:12,850 --> 01:09:15,220 I accidentally made every show, including 1445 01:09:15,220 --> 01:09:20,410 The Office about action, adventure, drama, fantasy, thriller, and war. 1446 01:09:20,410 --> 01:09:24,460 How did I do that accidentally? 1447 01:09:24,460 --> 01:09:25,600 What did I do wrong? 1448 01:09:25,600 --> 01:09:30,234 AUDIENCE: [INAUDIBLE] 1449 01:09:30,234 --> 01:09:31,109 DAVID J. MALAN: Yeah. 1450 01:09:31,109 --> 01:09:32,460 So beware, this is funny. 1451 01:09:32,460 --> 01:09:34,359 I think I did say beware around this time. 1452 01:09:34,359 --> 01:09:37,810 So the SQL database took me-- literally, I updated favorites, 1453 01:09:37,810 --> 01:09:41,010 setting genres equal to that, semicolon, end of thought. 1454 01:09:41,010 --> 01:09:43,500 I really wanted to say where title equals, 1455 01:09:43,500 --> 01:09:46,020 quote, unquote, "Game of Thrones." 1456 01:09:46,020 --> 01:09:49,080 Unfortunately, there isn't an undo command or time machine 1457 01:09:49,080 --> 01:09:51,930 with a SQL database, so the best we can do here 1458 01:09:51,930 --> 01:09:56,250 is, let's actually get rid of favorites.db. 1459 01:09:56,250 --> 01:10:02,340 Let's run SQLite of favorites.db again, which now will be recreated. 1460 01:10:02,340 --> 01:10:04,440 Let me change myself into CSV mode. 1461 01:10:04,440 --> 01:10:09,930 Let me import, into my favorites table, the CSV file. 1462 01:10:09,930 --> 01:10:14,250 And now, Friends is back, for better or for worse, 1463 01:10:14,250 --> 01:10:15,510 but so are all of our genres. 1464 01:10:15,510 --> 01:10:18,090 1465 01:10:18,090 --> 01:10:21,330 If I now reload the file and do select, star, from-- 1466 01:10:21,330 --> 01:10:21,840 sorry. 1467 01:10:21,840 --> 01:10:25,838 Select genres from favorites, that was the result I was getting. 1468 01:10:25,838 --> 01:10:28,630 It's much messier, but that's because some of these are quite long. 1469 01:10:28,630 --> 01:10:30,297 But now we're back to the original data. 1470 01:10:30,297 --> 01:10:32,700 Lesson here, be sure to back up your work. 1471 01:10:32,700 --> 01:10:33,240 All right. 1472 01:10:33,240 --> 01:10:36,850 So what more can we now do with this data? 1473 01:10:36,850 --> 01:10:40,480 Well, I don't love the design of the genres table for a couple of reasons. 1474 01:10:40,480 --> 01:10:43,200 One, we didn't have any sort of validation, 1475 01:10:43,200 --> 01:10:45,210 but user input is going to be messy. 1476 01:10:45,210 --> 01:10:47,790 There's just a lot of redundancy in here. 1477 01:10:47,790 --> 01:10:49,720 Let's go ahead and do this. 1478 01:10:49,720 --> 01:10:51,760 Let me select all the comedies you all typed in. 1479 01:10:51,760 --> 01:10:57,960 So select title from favorites, where genres equals, 1480 01:10:57,960 --> 01:11:00,120 quote, unquote, "comedy." 1481 01:11:00,120 --> 01:11:05,730 OK, so there's all of the shows that are explicitly comedies. 1482 01:11:05,730 --> 01:11:08,940 But I think there might actually be others. 1483 01:11:08,940 --> 01:11:11,350 Let me scroll back up here. 1484 01:11:11,350 --> 01:11:12,180 Comedy, drama. 1485 01:11:12,180 --> 01:11:14,040 What was a comedy and a drama? 1486 01:11:14,040 --> 01:11:19,140 How about let's search for the-- oops, let me copy paste comedy, comma, drama. 1487 01:11:19,140 --> 01:11:23,700 OK, so The Office, in this case, was considered comedy and drama, Billions, 1488 01:11:23,700 --> 01:11:27,010 It's Always Sunny in Philadelphia, and Gilmore Girls as well. 1489 01:11:27,010 --> 01:11:31,450 But notice that I get many more when I just search for comedy. 1490 01:11:31,450 --> 01:11:36,450 So the catch here is that, because I have all of these genres implemented 1491 01:11:36,450 --> 01:11:39,030 the way Google did, as a comma-separated list, 1492 01:11:39,030 --> 01:11:43,590 it's actually really hard and messy to get at any show, all of the shows 1493 01:11:43,590 --> 01:11:46,830 that are somewhere described as comedy. 1494 01:11:46,830 --> 01:11:49,680 Because if I search for quote, unquote, "comedy," the only answers 1495 01:11:49,680 --> 01:11:53,610 I'm going to get are this one, whatever that show is, this one, whatever 1496 01:11:53,610 --> 01:11:55,240 that show is, this one. 1497 01:11:55,240 --> 01:11:56,670 But I'm not going to get this one. 1498 01:11:56,670 --> 01:11:58,170 I'm not going to get this one. 1499 01:11:58,170 --> 01:11:59,070 Why? 1500 01:11:59,070 --> 01:12:02,790 If I'm searching for, where genres equals, quote, unquote, "comedy," 1501 01:12:02,790 --> 01:12:04,350 why am I missing those other shows? 1502 01:12:04,350 --> 01:12:07,202 1503 01:12:07,202 --> 01:12:07,910 Why am I missing? 1504 01:12:07,910 --> 01:12:08,595 Yeah? 1505 01:12:08,595 --> 01:12:11,020 AUDIENCE: [INAUDIBLE] 1506 01:12:11,020 --> 01:12:12,020 DAVID J. MALAN: Exactly. 1507 01:12:12,020 --> 01:12:14,390 It's not just a comedy, it's a comedy and a drama, 1508 01:12:14,390 --> 01:12:16,680 and a comedy or a news show, and so forth. 1509 01:12:16,680 --> 01:12:20,510 So I have to search for these commas, so this gets messy quickly, right? 1510 01:12:20,510 --> 01:12:22,560 Let me copy this so I can do this. 1511 01:12:22,560 --> 01:12:26,150 Let me search for where genres equals comedy. 1512 01:12:26,150 --> 01:12:33,230 How about, or genres equals comedy, drama, or genres 1513 01:12:33,230 --> 01:12:36,650 equals this whole thing, comedy, news, talk show? 1514 01:12:36,650 --> 01:12:38,370 I'm going to get more and more results. 1515 01:12:38,370 --> 01:12:39,900 But that's not going to scale well. 1516 01:12:39,900 --> 01:12:42,810 What could I do instead of enumerating with ors 1517 01:12:42,810 --> 01:12:45,800 all of the different permutations of genres, do you think? 1518 01:12:45,800 --> 01:12:49,655 AUDIENCE: [INAUDIBLE] 1519 01:12:49,655 --> 01:12:50,530 DAVID J. MALAN: Yeah. 1520 01:12:50,530 --> 01:12:54,430 So I could use the keyword is, similar in Python to the word in. 1521 01:12:54,430 --> 01:12:56,980 I could use the like keyword so that so long 1522 01:12:56,980 --> 01:13:02,080 as the genres is like comedy somewhere in there, 1523 01:13:02,080 --> 01:13:05,900 that's going to give me all of them, so long as the word comedy is in there. 1524 01:13:05,900 --> 01:13:09,265 But let me go ahead and just open the form from earlier. 1525 01:13:09,265 --> 01:13:12,250 1526 01:13:12,250 --> 01:13:15,073 Let me see if I can open this real quick before I toggle over. 1527 01:13:15,073 --> 01:13:16,990 If we look back at the form, recall that there 1528 01:13:16,990 --> 01:13:22,630 were all of those radio buttons asking for the specific genres 1529 01:13:22,630 --> 01:13:24,530 into which something fell. 1530 01:13:24,530 --> 01:13:29,710 And if I open this, let me full screen here and now open the original form. 1531 01:13:29,710 --> 01:13:32,920 You'll see all of the genres here, none of which 1532 01:13:32,920 --> 01:13:38,680 are that worrisome except for a corner case is jumping out at me. 1533 01:13:38,680 --> 01:13:43,330 Where might the like keyword alone get me into trouble? 1534 01:13:43,330 --> 01:13:44,380 It's not with comedy. 1535 01:13:44,380 --> 01:13:45,880 I'm OK with comedy. 1536 01:13:45,880 --> 01:13:47,680 AUDIENCE: Music and musical? 1537 01:13:47,680 --> 01:13:51,190 DAVID J. MALAN: Yeah, music and musical are deliberately on the list here. 1538 01:13:51,190 --> 01:13:53,660 Because, one, they're separate genres. 1539 01:13:53,660 --> 01:13:56,310 But if I just search for something that's like music, 1540 01:13:56,310 --> 01:13:58,810 I'm going to accidentally suck in all of the musicals, which 1541 01:13:58,810 --> 01:13:59,950 might not be what I intend. 1542 01:13:59,950 --> 01:14:03,310 If music is a music video or whatever, and musical is actually 1543 01:14:03,310 --> 01:14:06,620 a different type of show, I don't want to just do that. 1544 01:14:06,620 --> 01:14:08,110 So it seems just very messy. 1545 01:14:08,110 --> 01:14:11,682 I could probably hack something together with-- maybe add some commas in there, 1546 01:14:11,682 --> 01:14:12,640 or something like this. 1547 01:14:12,640 --> 01:14:15,520 But this is just not a good design for the data. 1548 01:14:15,520 --> 01:14:17,890 Google has done it this way because it's just 1549 01:14:17,890 --> 01:14:22,000 simple to actually keep the user's data all in a single column, 1550 01:14:22,000 --> 01:14:24,520 and just as they did, separate it by commas. 1551 01:14:24,520 --> 01:14:28,720 But this is a real messy way to use CSV is 1552 01:14:28,720 --> 01:14:32,830 by putting comma-separated values in your comma-separated values. 1553 01:14:32,830 --> 01:14:35,350 Arguably, the folks at Google probably just did this 1554 01:14:35,350 --> 01:14:36,520 because it's just simpler. 1555 01:14:36,520 --> 01:14:38,645 And they didn't want to give people multiple sheets 1556 01:14:38,645 --> 01:14:42,220 or complicate things using some other weirder character than commas alone. 1557 01:14:42,220 --> 01:14:44,450 But I bet there's a better way for us to do this. 1558 01:14:44,450 --> 01:14:45,860 And let me go ahead and do this. 1559 01:14:45,860 --> 01:14:47,978 Let me go back into my code here. 1560 01:14:47,978 --> 01:14:50,020 And in just a moment, I'm going to grab a program 1561 01:14:50,020 --> 01:14:54,250 that I wrote in advance that's going to use Python to open up the CSV file, 1562 01:14:54,250 --> 01:14:59,110 iterate over all of the rows, and load the data into two tables this time, 1563 01:14:59,110 --> 01:15:02,630 two tables, one called shows, and one called genres, 1564 01:15:02,630 --> 01:15:05,140 so as to actually separate these two things out. 1565 01:15:05,140 --> 01:15:07,730 Give me just a moment to grab the code. 1566 01:15:07,730 --> 01:15:10,720 And when I run this, I'll only have to run it once. 1567 01:15:10,720 --> 01:15:13,090 Let me go ahead and run Python in a moment, 1568 01:15:13,090 --> 01:15:15,940 and I'll reveal the results in a sec. 1569 01:15:15,940 --> 01:15:18,790 This is going to be version 8 of the code online. 1570 01:15:18,790 --> 01:15:22,480 When I do this, let me go ahead and open up this file. 1571 01:15:22,480 --> 01:15:26,260 Give me a second to move it into this directory. 1572 01:15:26,260 --> 01:15:28,570 Version 8, OK. 1573 01:15:28,570 --> 01:15:31,515 So here we have version 8 of this that's available online 1574 01:15:31,515 --> 01:15:32,890 that's going to do the following. 1575 01:15:32,890 --> 01:15:34,848 And I'll gloss over some of the details just so 1576 01:15:34,848 --> 01:15:38,740 that we don't get stuck in the weeds of some of this code. 1577 01:15:38,740 --> 01:15:41,380 I'm going to be using, at the top of this program, 1578 01:15:41,380 --> 01:15:45,820 as we'll soon see, a CS50 library, not for the sake of get_string, 1579 01:15:45,820 --> 01:15:48,730 or get_int, or get_float, but because there's some built-in SQL 1580 01:15:48,730 --> 01:15:51,880 functionality that we didn't discuss a couple of weeks back with the CS50 1581 01:15:51,880 --> 01:15:52,750 library itself. 1582 01:15:52,750 --> 01:15:56,680 But inside of the CS50 library we'll see there is a special function called 1583 01:15:56,680 --> 01:16:01,330 SQL that gives you the ability using this weird URL-like looking thing, 1584 01:16:01,330 --> 01:16:06,040 technically called a URI, that allows me to open a file called favorites.db. 1585 01:16:06,040 --> 01:16:08,380 And long story short, all of the subsequent code 1586 01:16:08,380 --> 01:16:12,580 is going to iterate over this favorites.csv file that we downloaded. 1587 01:16:12,580 --> 01:16:16,240 And it's going to import it into the SQLite database, 1588 01:16:16,240 --> 01:16:19,430 but it's going to use two tables instead of just one. 1589 01:16:19,430 --> 01:16:21,640 So give me just a moment to run this, and then I'll 1590 01:16:21,640 --> 01:16:23,600 reveal the actual results. 1591 01:16:23,600 --> 01:16:26,270 This is going to be run on favorites.csv. 1592 01:16:26,270 --> 01:16:31,510 1593 01:16:31,510 --> 01:16:35,485 And taking a look here, give me just a moment. 1594 01:16:35,485 --> 01:16:39,310 1595 01:16:39,310 --> 01:16:43,350 Oh, give me a sec. 1596 01:16:43,350 --> 01:16:45,270 Come on. 1597 01:16:45,270 --> 01:16:46,080 Come on. 1598 01:16:46,080 --> 01:16:49,020 This program should not be taking this long. 1599 01:16:49,020 --> 01:16:50,460 Sorry. 1600 01:16:50,460 --> 01:16:51,840 Let's open this real fast. 1601 01:16:51,840 --> 01:16:54,750 1602 01:16:54,750 --> 01:16:55,860 Whoops, not that file. 1603 01:16:55,860 --> 01:16:59,000 1604 01:16:59,000 --> 01:16:59,500 OK. 1605 01:16:59,500 --> 01:17:02,350 Let me just skim this code real quick to see where we've gone wrong. 1606 01:17:02,350 --> 01:17:04,840 [INAUDIBLE] reader. 1607 01:17:04,840 --> 01:17:10,390 Reader, title, show ID in certain two shows. 1608 01:17:10,390 --> 01:17:15,370 [INAUDIBLE] genres split, DB execute. 1609 01:17:15,370 --> 01:17:15,880 All right. 1610 01:17:15,880 --> 01:17:17,410 This is me debugging in real time. 1611 01:17:17,410 --> 01:17:22,843 All those times we encourage you to use print, this is me actually using print. 1612 01:17:22,843 --> 01:17:24,760 We'll see how quickly I can recover from this. 1613 01:17:24,760 --> 01:17:26,560 Python of favorites version 8. 1614 01:17:26,560 --> 01:17:29,100 1615 01:17:29,100 --> 01:17:32,070 OK, so here's me debugging in real time. 1616 01:17:32,070 --> 01:17:32,882 It's printing it. 1617 01:17:32,882 --> 01:17:34,590 Oh, maybe I just didn't wait long enough. 1618 01:17:34,590 --> 01:17:35,940 OK, so here we go. 1619 01:17:35,940 --> 01:17:39,900 What I'm doing is printing out the dictionary that represents 1620 01:17:39,900 --> 01:17:41,542 each row that you all typed in. 1621 01:17:41,542 --> 01:17:43,000 And we're actually making progress. 1622 01:17:43,000 --> 01:17:44,220 All right. 1623 01:17:44,220 --> 01:17:46,510 I was too impatient and didn't wait long enough. 1624 01:17:46,510 --> 01:17:47,470 So in a moment-- 1625 01:17:47,470 --> 01:17:47,970 there we go. 1626 01:17:47,970 --> 01:17:50,370 All right, so all we have to do sometimes is wait. 1627 01:17:50,370 --> 01:17:54,420 Let me go ahead now and open this file using SQLite3. 1628 01:17:54,420 --> 01:17:58,080 So in SQLite3 I now have a different version of favorites.db. 1629 01:17:58,080 --> 01:17:59,970 I named it number 8 for consistency. 1630 01:17:59,970 --> 01:18:03,280 Once I've run the program I can do .schema to look inside of it. 1631 01:18:03,280 --> 01:18:07,200 And here's what the two tables in this database are going to look like. 1632 01:18:07,200 --> 01:18:10,920 I've created a table called shows, this time to represent all of the TV shows 1633 01:18:10,920 --> 01:18:14,410 that are favorites, that has two columns. 1634 01:18:14,410 --> 01:18:16,812 One is called ID, one is called Title. 1635 01:18:16,812 --> 01:18:19,020 But now I'm going to start taking out for a spin some 1636 01:18:19,020 --> 01:18:20,640 of the other features of SQL. 1637 01:18:20,640 --> 01:18:24,900 And besides there being text, it turns out there's a data type called integer. 1638 01:18:24,900 --> 01:18:26,910 Besides there being a data type called text, 1639 01:18:26,910 --> 01:18:30,150 there's also a special key phrase that you can specify 1640 01:18:30,150 --> 01:18:31,830 that the title can never be null. 1641 01:18:31,830 --> 01:18:35,160 Think back to our use of null in C. Think back 1642 01:18:35,160 --> 01:18:37,260 to the keyword none in Python. 1643 01:18:37,260 --> 01:18:40,830 This is a database constraint that allows you to ensure that none of you 1644 01:18:40,830 --> 01:18:42,630 can't have of favorite TV show. 1645 01:18:42,630 --> 01:18:46,320 If you submit the form, you have to have typed in a title for it 1646 01:18:46,320 --> 01:18:48,250 to end up in our database here. 1647 01:18:48,250 --> 01:18:50,940 And you'll notice one other new feature. 1648 01:18:50,940 --> 01:18:53,460 It turns out, on this table I'm defining what's 1649 01:18:53,460 --> 01:18:56,970 called a primary key, specifically to be the ID column. 1650 01:18:56,970 --> 01:18:58,600 More on that in just a moment. 1651 01:18:58,600 --> 01:19:02,700 Meanwhile, the second table my code has created for me, as we'll soon see, 1652 01:19:02,700 --> 01:19:08,220 gives me a column called show ID, and then, a genre, 1653 01:19:08,220 --> 01:19:11,140 the value of which is text that can also not be null. 1654 01:19:11,140 --> 01:19:12,760 And then more on this in a moment. 1655 01:19:12,760 --> 01:19:15,840 This table has what we're going to call a foreign key, 1656 01:19:15,840 --> 01:19:20,550 specifically the show ID column that references shows ID. 1657 01:19:20,550 --> 01:19:23,250 So before we get into the weeds of this, this 1658 01:19:23,250 --> 01:19:27,120 is now a way of creating the relation in relational database. 1659 01:19:27,120 --> 01:19:31,140 If I have two tables now, not just one, they can somehow 1660 01:19:31,140 --> 01:19:34,720 be linked together by a common column. 1661 01:19:34,720 --> 01:19:37,860 In other words, the shows column-- 1662 01:19:37,860 --> 01:19:41,340 shows table is going to give me a table with two columns-- 1663 01:19:41,340 --> 01:19:43,080 an ID and a title. 1664 01:19:43,080 --> 01:19:46,800 Every title you gave me, I'm going to assign a unique value. 1665 01:19:46,800 --> 01:19:51,600 The genre's table, meanwhile, is going to associate individual genres 1666 01:19:51,600 --> 01:19:54,480 singular with that same idea. 1667 01:19:54,480 --> 01:20:01,150 And the result of this, to pop back to the Terminal here, is, let's do this. 1668 01:20:01,150 --> 01:20:05,460 Select star from shows of this new database, 1669 01:20:05,460 --> 01:20:09,030 and you'll see that I've given, indeed, all of the shows you all typed 1670 01:20:09,030 --> 01:20:10,663 in unique identifiers. 1671 01:20:10,663 --> 01:20:13,830 I didn't filter out duplicates or do anything beyond just forcing everything 1672 01:20:13,830 --> 01:20:14,430 to uppercase. 1673 01:20:14,430 --> 01:20:16,930 So there's going to be some duplicates here because I didn't 1674 01:20:16,930 --> 01:20:18,840 want to get rid of anyone's data. 1675 01:20:18,840 --> 01:20:21,960 But you'll see that, indeed, I've given everyone 1676 01:20:21,960 --> 01:20:24,420 a unique identifier, from the very first person who 1677 01:20:24,420 --> 01:20:28,650 typed How I Met Your Mother, all the way down to input number 158. 1678 01:20:28,650 --> 01:20:35,760 Meanwhile, if I do select star from genres, which is now a table, not just 1679 01:20:35,760 --> 01:20:38,280 a column in the original data, now you'll 1680 01:20:38,280 --> 01:20:43,195 see a much better design for this data. 1681 01:20:43,195 --> 01:20:44,320 Notice what I've done here. 1682 01:20:44,320 --> 01:20:47,370 Let me go all the way to the top and you'll see two columns, one of which 1683 01:20:47,370 --> 01:20:51,508 is called show ID, the other of which is called genre. 1684 01:20:51,508 --> 01:20:53,550 And again, I wrote some code to do this because I 1685 01:20:53,550 --> 01:20:56,675 had to take Google's messy output where everything was separated by commas. 1686 01:20:56,675 --> 01:21:00,330 I had to tear away the commas and then put each genre into this table 1687 01:21:00,330 --> 01:21:01,770 by itself. 1688 01:21:01,770 --> 01:21:04,200 Even though we haven't introduced the syntax via which 1689 01:21:04,200 --> 01:21:07,230 we can reconstitute the data and reassociate 1690 01:21:07,230 --> 01:21:10,530 your genres with your titles, why, at a glance, 1691 01:21:10,530 --> 01:21:13,140 might this be a better design now? 1692 01:21:13,140 --> 01:21:16,740 Even though I've doubled the number of tables from one to two, 1693 01:21:16,740 --> 01:21:21,540 why is this probably on the direction toward a better design? 1694 01:21:21,540 --> 01:21:23,130 What might your instincts be? 1695 01:21:23,130 --> 01:21:26,430 1696 01:21:26,430 --> 01:21:27,720 Why is this cleaner? 1697 01:21:27,720 --> 01:21:31,230 Again, first time with SQL, why is it better, perhaps, 1698 01:21:31,230 --> 01:21:33,780 that we've done this with our genre's table? 1699 01:21:33,780 --> 01:21:35,650 Can I come to you? 1700 01:21:35,650 --> 01:21:36,700 Why might this be better? 1701 01:21:36,700 --> 01:21:37,200 Yeah. 1702 01:21:37,200 --> 01:21:41,360 Oh, just because we had the conversation before about the commas. 1703 01:21:41,360 --> 01:21:42,470 AUDIENCE: [INAUDIBLE] 1704 01:21:42,470 --> 01:21:42,890 DAVID J. MALAN: Exactly. 1705 01:21:42,890 --> 01:21:43,848 It's as simple as that. 1706 01:21:43,848 --> 01:21:49,130 We've cleaned up the data by giving every genre, every word in the genres 1707 01:21:49,130 --> 01:21:51,260 column in the original Google Spreadsheet 1708 01:21:51,260 --> 01:21:54,030 its own cell in this table, if you will. 1709 01:21:54,030 --> 01:21:56,930 And now notice show ID might appear multiple times. 1710 01:21:56,930 --> 01:22:01,530 Whoever typed in How I Met Your Mother, they only associated one genre with it. 1711 01:22:01,530 --> 01:22:04,460 And so we see that show ID 1 is a comedy. 1712 01:22:04,460 --> 01:22:05,645 But whoever typed in-- 1713 01:22:05,645 --> 01:22:07,520 I forget the name of the second show offhand. 1714 01:22:07,520 --> 01:22:11,960 But that person, whoever was assigned show ID 2 checked off 1715 01:22:11,960 --> 01:22:13,850 a whole bunch of the genre's boxes. 1716 01:22:13,850 --> 01:22:17,540 That happened again with show ID 3, 4. 1717 01:22:17,540 --> 01:22:20,940 Persons 5, 6, 7 only checked one box. 1718 01:22:20,940 --> 01:22:25,040 And so you can see now that we've associated the data with what we 1719 01:22:25,040 --> 01:22:27,690 might call a one-to-many relationship. 1720 01:22:27,690 --> 01:22:33,050 A one-to-many relationship, whereby for every one show in the show's table, 1721 01:22:33,050 --> 01:22:36,800 it can now have many genres associated with it, each of which 1722 01:22:36,800 --> 01:22:41,250 is represented by a separate row here. 1723 01:22:41,250 --> 01:22:44,960 So again, if I go ahead and select star from shows-- 1724 01:22:44,960 --> 01:22:48,740 let's limit it to the first 10 just to focus on a subset of the data. 1725 01:22:48,740 --> 01:22:51,770 How I Met Your Mother, The Sopranos was the second input there. 1726 01:22:51,770 --> 01:22:55,100 It would seem that now that I've created the data in this way, 1727 01:22:55,100 --> 01:22:59,925 I could ideally somehow search the data, but a little more correctly. 1728 01:22:59,925 --> 01:23:01,550 I don't have to worry about the commas. 1729 01:23:01,550 --> 01:23:03,925 I don't have to worry about the hackish approach of music 1730 01:23:03,925 --> 01:23:05,630 being a substring of musical. 1731 01:23:05,630 --> 01:23:08,310 But how can I actually get back at this data? 1732 01:23:08,310 --> 01:23:09,750 Well, let's go ahead and do this. 1733 01:23:09,750 --> 01:23:13,700 Suppose I did want to get back maybe all of the comedies. 1734 01:23:13,700 --> 01:23:17,120 All of the comedies, no matter whether the person checked just the comedy 1735 01:23:17,120 --> 01:23:19,550 box or multiple boxes instead. 1736 01:23:19,550 --> 01:23:23,330 How now, given that I have two tables, could I 1737 01:23:23,330 --> 01:23:27,740 go about selecting only the titles of comedies? 1738 01:23:27,740 --> 01:23:29,780 I've actually made the problem a little harder, 1739 01:23:29,780 --> 01:23:32,720 but again, SQL is going to give me a solution for this. 1740 01:23:32,720 --> 01:23:35,030 The problem is that if I want to search for comedies, 1741 01:23:35,030 --> 01:23:37,820 I have to check the genres table first. 1742 01:23:37,820 --> 01:23:39,650 And then what's that going to give me? 1743 01:23:39,650 --> 01:23:43,580 If I search the genres table for comedies, 1744 01:23:43,580 --> 01:23:46,610 what's that going to give me back potentially? 1745 01:23:46,610 --> 01:23:47,360 Yeah? 1746 01:23:47,360 --> 01:23:48,110 AUDIENCE: Show ID. 1747 01:23:48,110 --> 01:23:49,110 DAVID J. MALAN: Maybe show ID. 1748 01:23:49,110 --> 01:23:49,902 So let me try that. 1749 01:23:49,902 --> 01:23:55,820 Let me do select show ID from genres, where the genre in a given row 1750 01:23:55,820 --> 01:23:57,320 equals quote, unquote, "comedy." 1751 01:23:57,320 --> 01:24:00,470 No commas, no like, no percent signs. 1752 01:24:00,470 --> 01:24:04,940 Because literally, that column now is singular words, like comedy, or drama, 1753 01:24:04,940 --> 01:24:05,720 or the like. 1754 01:24:05,720 --> 01:24:07,670 Let me go ahead and hit Enter here. 1755 01:24:07,670 --> 01:24:10,700 OK, so I got back a whole bunch of ID numbers. 1756 01:24:10,700 --> 01:24:13,100 Now this could very quickly get annoying. 1757 01:24:13,100 --> 01:24:17,930 It looks like show ID 1, 2, 4, 5, 6, 7, 9, and so forth, are all comedies. 1758 01:24:17,930 --> 01:24:23,690 So I could do something really crazy like, select title from shows, 1759 01:24:23,690 --> 01:24:29,390 where ID equals 1, or ID equals 2. 1760 01:24:29,390 --> 01:24:32,360 This is not going to scale very well, but this 1761 01:24:32,360 --> 01:24:34,610 is why SQL is especially powerful. 1762 01:24:34,610 --> 01:24:39,420 You can actually compose one SQL question from multiple ones. 1763 01:24:39,420 --> 01:24:40,440 So let's do this. 1764 01:24:40,440 --> 01:24:44,480 Why don't I select the title where the ID of the show 1765 01:24:44,480 --> 01:24:48,050 is in the following list of IDs? 1766 01:24:48,050 --> 01:24:54,710 Select show ID from genres, where the specific genre is, quote, unquote, 1767 01:24:54,710 --> 01:24:55,640 "comedy." 1768 01:24:55,640 --> 01:24:58,550 So I've got two SQL queries. 1769 01:24:58,550 --> 01:25:01,850 One is deliberately nested inside of parentheses. 1770 01:25:01,850 --> 01:25:04,760 That's going to give me back that whole list of show IDs. 1771 01:25:04,760 --> 01:25:07,070 But that's exactly what I want to then look up 1772 01:25:07,070 --> 01:25:11,000 the titles for by selecting title from shows where the ID of the show 1773 01:25:11,000 --> 01:25:13,380 is in that big, tall list. 1774 01:25:13,380 --> 01:25:18,710 And so now if I hit Enter, I get back only those shows 1775 01:25:18,710 --> 01:25:22,520 that were somehow flagged as comedy, whether you in the audience 1776 01:25:22,520 --> 01:25:26,570 checked one box for comedy, two boxes, or all of the boxes. 1777 01:25:26,570 --> 01:25:28,928 Somehow we teased out comedy, again, just 1778 01:25:28,928 --> 01:25:31,220 by using that Python script, which loaded this data not 1779 01:25:31,220 --> 01:25:33,800 into one big table, but instead, two. 1780 01:25:33,800 --> 01:25:36,420 And if we want to clean this up, let's do a couple of things. 1781 01:25:36,420 --> 01:25:40,580 Let's, outside of the parentheses, do order by title. 1782 01:25:40,580 --> 01:25:43,550 This is a way of sorting the data in SQL very easily. 1783 01:25:43,550 --> 01:25:48,140 Now we have a whole list of the same titles that are now sorted. 1784 01:25:48,140 --> 01:25:52,640 And what was the keyword with which I could filter out duplicates? 1785 01:25:52,640 --> 01:25:53,450 Yeah, distinct. 1786 01:25:53,450 --> 01:25:54,260 So let's try this. 1787 01:25:54,260 --> 01:25:59,600 Same query, but let's select only the distinct titles from that whole query. 1788 01:25:59,600 --> 01:26:01,770 And notice, I've very deliberately done it this way. 1789 01:26:01,770 --> 01:26:03,520 And to this day, any time I'm using SQL, I 1790 01:26:03,520 --> 01:26:06,187 don't just start at the beginning and type out my whole thought, 1791 01:26:06,187 --> 01:26:07,940 and just get it right on the first try. 1792 01:26:07,940 --> 01:26:10,610 I very commonly start with the subquery, if you 1793 01:26:10,610 --> 01:26:12,800 will, the thing in parentheses, just to get myself 1794 01:26:12,800 --> 01:26:14,600 one step toward what I care about. 1795 01:26:14,600 --> 01:26:15,332 Then I add to it. 1796 01:26:15,332 --> 01:26:16,040 Then I add to it. 1797 01:26:16,040 --> 01:26:18,590 Then I add to it, just like we've encouraged in Python and C, 1798 01:26:18,590 --> 01:26:22,370 taking baby steps in order to get to the answer you actually care about, 1799 01:26:22,370 --> 01:26:23,480 like this one now. 1800 01:26:23,480 --> 01:26:26,060 And other than this mistake, which we didn't 1801 01:26:26,060 --> 01:26:30,350 fix because I re-imported the data after accidentally changing everyone's genre, 1802 01:26:30,350 --> 01:26:35,190 we now have an alphabetized list of all of the same data. 1803 01:26:35,190 --> 01:26:40,670 But now it's better designed, because we have it split across these two tables. 1804 01:26:40,670 --> 01:26:41,630 Oh, thank you. 1805 01:26:41,630 --> 01:26:44,720 OK, just thanks. 1806 01:26:44,720 --> 01:26:48,950 What questions do we have, if any here? 1807 01:26:48,950 --> 01:26:50,480 Questions on this approach? 1808 01:26:50,480 --> 01:26:53,040 1809 01:26:53,040 --> 01:26:55,316 Yeah? 1810 01:26:55,316 --> 01:27:00,280 AUDIENCE: [INAUDIBLE] 1811 01:27:00,280 --> 01:27:02,280 DAVID J. MALAN: Oh, now that we have a database, 1812 01:27:02,280 --> 01:27:04,640 how do we transfer it to a CSV? 1813 01:27:04,640 --> 01:27:06,060 There are ways to do that. 1814 01:27:06,060 --> 01:27:08,570 And in fact, there's a command within SQLite 1815 01:27:08,570 --> 01:27:11,590 that allows you to export your data back to a CSV file. 1816 01:27:11,590 --> 01:27:13,340 If you want to email it to someone and you 1817 01:27:13,340 --> 01:27:16,430 want them to be able to open it in Excel, or Google Spreadsheets, or Apple 1818 01:27:16,430 --> 01:27:19,010 Numbers, or the like, you can go in the other direction. 1819 01:27:19,010 --> 01:27:21,890 Generally though, once you're in the world of SQL 1820 01:27:21,890 --> 01:27:24,620 you're probably storing your data there long term. 1821 01:27:24,620 --> 01:27:27,590 And you're probably updating it, maybe deleting it, adding to it, 1822 01:27:27,590 --> 01:27:28,273 and so forth. 1823 01:27:28,273 --> 01:27:30,440 For instance, the one command I did not show earlier 1824 01:27:30,440 --> 01:27:33,320 is, suppose someone forgot a show. 1825 01:27:33,320 --> 01:27:35,640 Let's see, did I see this in the output? 1826 01:27:35,640 --> 01:27:37,580 All right, so Curb Your Enthusiasm. 1827 01:27:37,580 --> 01:27:38,540 Saw that last night. 1828 01:27:38,540 --> 01:27:39,530 It was just, yeah. 1829 01:27:39,530 --> 01:27:41,160 Did anyone see it last night? 1830 01:27:41,160 --> 01:27:41,660 No? 1831 01:27:41,660 --> 01:27:44,743 All right, well, just the one person that checked that box, so you and me. 1832 01:27:44,743 --> 01:27:46,740 What's another show that didn't make the list? 1833 01:27:46,740 --> 01:27:48,217 How about Seinfeld? 1834 01:27:48,217 --> 01:27:49,550 It's now on Netflix, apparently. 1835 01:27:49,550 --> 01:27:55,220 So insert into shows. 1836 01:27:55,220 --> 01:27:56,390 What do we want to insert? 1837 01:27:56,390 --> 01:27:59,910 Well, we want to insert maybe an ID and a title. 1838 01:27:59,910 --> 01:28:02,150 But I don't actually care what the ID is, 1839 01:28:02,150 --> 01:28:03,650 so I'm just going to insert a title. 1840 01:28:03,650 --> 01:28:05,780 And the value I'm going to give to that title 1841 01:28:05,780 --> 01:28:08,900 is going to be, quote, unquote, "Seinfeld." 1842 01:28:08,900 --> 01:28:11,810 And then, let me go ahead and hit semicolon. 1843 01:28:11,810 --> 01:28:14,360 Nothing seems to happen, but let me rerun the big query 1844 01:28:14,360 --> 01:28:16,250 from before looking for comedies. 1845 01:28:16,250 --> 01:28:19,850 And unfortunately, Seinfeld has not yet been flagged as a comedy, 1846 01:28:19,850 --> 01:28:21,710 so let's get this right, too. 1847 01:28:21,710 --> 01:28:25,370 What intuitively I'm going to have to do to associate, now, 1848 01:28:25,370 --> 01:28:27,980 Seinfeld with my comedies? 1849 01:28:27,980 --> 01:28:30,140 I just inserted into the show's table. 1850 01:28:30,140 --> 01:28:33,890 What more needs to happen before we can flag Seinfeld as a comedy? 1851 01:28:33,890 --> 01:28:35,369 AUDIENCE: [INAUDIBLE] 1852 01:28:35,369 --> 01:28:36,452 DAVID J. MALAN: Say again? 1853 01:28:36,452 --> 01:28:37,635 AUDIENCE: [INAUDIBLE] 1854 01:28:37,635 --> 01:28:38,510 DAVID J. MALAN: Yeah. 1855 01:28:38,510 --> 01:28:42,950 So I need to insert into the genres table two things now, 1856 01:28:42,950 --> 01:28:48,180 a show ID, like this, and then, the name of the genre, 1857 01:28:48,180 --> 01:28:49,340 which presumably is comedy. 1858 01:28:49,340 --> 01:28:50,810 What values do I want to insert? 1859 01:28:50,810 --> 01:28:52,795 Well, the show ID, I better grab that. 1860 01:28:52,795 --> 01:28:54,170 Oh, I don't even know what it is. 1861 01:28:54,170 --> 01:28:55,770 I'm going to have to figure out what that is. 1862 01:28:55,770 --> 01:28:57,660 So I could do this in a couple of ways. 1863 01:28:57,660 --> 01:28:59,150 Let me do a one-time thing. 1864 01:28:59,150 --> 01:29:02,924 Select star from shows, where title equals, 1865 01:29:02,924 --> 01:29:06,770 quote, unquote, "Seinfeld" semicolon 159. 1866 01:29:06,770 --> 01:29:11,780 So now I could do, insert into genres a show ID 1867 01:29:11,780 --> 01:29:20,510 and a genre name, the values 159, and, quote, unquote, "comedy" semicolon, 1868 01:29:20,510 --> 01:29:21,260 Enter. 1869 01:29:21,260 --> 01:29:24,710 And now, if I scroll back in my history and execute that really big query 1870 01:29:24,710 --> 01:29:26,690 again, looking for all distinct comedies, 1871 01:29:26,690 --> 01:29:29,100 now Seinfeld has made the list. 1872 01:29:29,100 --> 01:29:32,637 But I did this manually so I didn't actually capitalize it. 1873 01:29:32,637 --> 01:29:33,470 Let's clean that up. 1874 01:29:33,470 --> 01:29:35,420 Let's do update. 1875 01:29:35,420 --> 01:29:37,700 Let's do update my shows. 1876 01:29:37,700 --> 01:29:43,440 Set title equals to Seinfeld semicolon. 1877 01:29:43,440 --> 01:29:43,940 No? 1878 01:29:43,940 --> 01:29:47,780 OK, thank you, where title equals, quote, unquote, "Seinfeld." 1879 01:29:47,780 --> 01:29:49,520 Let's not make that mistake again. 1880 01:29:49,520 --> 01:29:50,330 Enter. 1881 01:29:50,330 --> 01:29:53,270 And now, if I execute that really big query, now Seinfeld is, 1882 01:29:53,270 --> 01:29:56,480 indeed, considered a comedy. 1883 01:29:56,480 --> 01:29:57,960 So where are we going with this? 1884 01:29:57,960 --> 01:30:00,240 Well, thus far we've been doing all of this pretty manually. 1885 01:30:00,240 --> 01:30:02,780 And this is absolutely what an analyst, a data scientist type person 1886 01:30:02,780 --> 01:30:05,197 might do if just manipulating a pretty large data set just 1887 01:30:05,197 --> 01:30:08,010 to get at interesting answers that might be across one, 1888 01:30:08,010 --> 01:30:09,440 two, or even many more tables. 1889 01:30:09,440 --> 01:30:12,440 Eventually, in a few weeks, we're going to start to automate all of this 1890 01:30:12,440 --> 01:30:16,610 by writing code in Python that generates SQL to do this. 1891 01:30:16,610 --> 01:30:19,280 If you go to most any website on the internet today, 1892 01:30:19,280 --> 01:30:23,420 and you, for instance, log in, odds are you're typing a username and password, 1893 01:30:23,420 --> 01:30:24,260 clicking Submit. 1894 01:30:24,260 --> 01:30:25,430 What's then happening? 1895 01:30:25,430 --> 01:30:28,010 Well, the website might not be implemented in Python 1896 01:30:28,010 --> 01:30:31,610 but it's probably implemented in some language, Python, JavaScript, Java, 1897 01:30:31,610 --> 01:30:33,110 Ruby, something else. 1898 01:30:33,110 --> 01:30:38,330 And that language is probably using something like a relational database 1899 01:30:38,330 --> 01:30:41,930 to use SQL to get your username, get your password, 1900 01:30:41,930 --> 01:30:43,997 and compare the two against what you've typed in. 1901 01:30:43,997 --> 01:30:46,580 And actually, it's hopefully not getting your actual password, 1902 01:30:46,580 --> 01:30:48,163 but something called the hash thereof. 1903 01:30:48,163 --> 01:30:50,360 But there's probably a database involved doing that. 1904 01:30:50,360 --> 01:30:53,600 When you buy something on Amazon.com and you click Check Out, 1905 01:30:53,600 --> 01:30:56,720 odds are there's some code on Amazon's server 1906 01:30:56,720 --> 01:30:59,870 that's looking at what it is you added to your shopping cart, 1907 01:30:59,870 --> 01:31:03,690 and then maybe using a for loop of some sort, in Python or another language. 1908 01:31:03,690 --> 01:31:08,180 It's doing a whole bunch of SQL inserts to store in their database what 1909 01:31:08,180 --> 01:31:09,290 it is you bought. 1910 01:31:09,290 --> 01:31:12,260 There's other types of databases, too, but SQL databases, 1911 01:31:12,260 --> 01:31:14,647 or relational databases are quite popular. 1912 01:31:14,647 --> 01:31:16,730 So let's go ahead and write one other program here 1913 01:31:16,730 --> 01:31:21,320 in Python that now merges these two languages together, whereby 1914 01:31:21,320 --> 01:31:24,740 I'm going to use SQL inside of a Python program 1915 01:31:24,740 --> 01:31:28,430 so I can implement my logic of my program in Python, 1916 01:31:28,430 --> 01:31:30,050 step-by-step, line-by-line. 1917 01:31:30,050 --> 01:31:34,260 But when I want to get at some data I can actually talk to a SQL database. 1918 01:31:34,260 --> 01:31:37,190 So let me go ahead and open favorites.py. 1919 01:31:37,190 --> 01:31:39,740 1920 01:31:39,740 --> 01:31:45,200 And let me go ahead and throw away some of what we did earlier and really 1921 01:31:45,200 --> 01:31:47,690 just now add a SQL to the mix. 1922 01:31:47,690 --> 01:31:51,440 From the CS50 library, let's import the SQL function. 1923 01:31:51,440 --> 01:31:54,260 This will be useful to use because most third-party libraries 1924 01:31:54,260 --> 01:31:57,390 that deal with SQL and Python are more complicated than they need to be. 1925 01:31:57,390 --> 01:32:00,390 So I think you'll find this library easier to use. 1926 01:32:00,390 --> 01:32:01,780 Let's then do the following. 1927 01:32:01,780 --> 01:32:03,900 Create a variable called db for database. 1928 01:32:03,900 --> 01:32:05,400 But I could call it anything I want. 1929 01:32:05,400 --> 01:32:09,090 Let's use that you URI, which is a fancy way of saying something 1930 01:32:09,090 --> 01:32:17,790 that looks like a URL, but that actually opens up a database on disk, that is, 1931 01:32:17,790 --> 01:32:19,110 in the current folder. 1932 01:32:19,110 --> 01:32:22,620 Let's now ask the user for a title by prompting them for a, quote, unquote, 1933 01:32:22,620 --> 01:32:23,820 "title" like this. 1934 01:32:23,820 --> 01:32:27,960 And let's strip off any whitespace just so that the data is not messy. 1935 01:32:27,960 --> 01:32:30,730 And then, let's go ahead and do this. 1936 01:32:30,730 --> 01:32:32,220 And this is the new logic. 1937 01:32:32,220 --> 01:32:36,570 I'm going to go ahead now and write a line of code that uses Python 1938 01:32:36,570 --> 01:32:39,840 to talk to the original favorites.db. 1939 01:32:39,840 --> 01:32:43,980 So again, I'm not using the two-table database, which is in favorites8.db. 1940 01:32:43,980 --> 01:32:47,320 I'm using the original that we imported from your own data, 1941 01:32:47,320 --> 01:32:49,090 and I'm going to do the following. 1942 01:32:49,090 --> 01:32:54,150 I'm going to use db.execute to execute a SQL command inside of Python. 1943 01:32:54,150 --> 01:33:02,700 I'm going to select the count of shows from the favorites 1944 01:33:02,700 --> 01:33:09,960 table, where the title the user typed in is like this question mark. 1945 01:33:09,960 --> 01:33:11,910 And why I'm doing that is as follows. 1946 01:33:11,910 --> 01:33:15,600 Just like in C, when we had percent S, in SQL for now, 1947 01:33:15,600 --> 01:33:17,490 the analogue is going to be a question mark. 1948 01:33:17,490 --> 01:33:18,990 So same idea, different syntax. 1949 01:33:18,990 --> 01:33:21,180 Instead of percent S, it's just a question mark. 1950 01:33:21,180 --> 01:33:26,280 And using a comma outside of this first string, using CS50's execute 1951 01:33:26,280 --> 01:33:29,340 function I can pass in a SQL string, a command, 1952 01:33:29,340 --> 01:33:33,835 then any arguments I want to plug into the question marks therein. 1953 01:33:33,835 --> 01:33:36,210 So the goal at hand is to actually write a program that's 1954 01:33:36,210 --> 01:33:42,420 going to search favorites.csv, a.k.a., favorites.db for the total number 1955 01:33:42,420 --> 01:33:45,300 of people that liked a particular show. 1956 01:33:45,300 --> 01:33:49,050 So this is going to select the count of people from the favorites table 1957 01:33:49,050 --> 01:33:53,400 where the title they typed in is like whatever the user has just now typed 1958 01:33:53,400 --> 01:33:53,970 in. 1959 01:33:53,970 --> 01:33:56,730 This db execute function returns a list. 1960 01:33:56,730 --> 01:33:57,813 It returns a list of rows. 1961 01:33:57,813 --> 01:34:00,188 And you would only know that by my telling you or reading 1962 01:34:00,188 --> 01:34:01,290 the documentation. 1963 01:34:01,290 --> 01:34:04,140 And therefore, if I want to get back to the total count, 1964 01:34:04,140 --> 01:34:08,940 I'm going to go ahead and grab the first row from those rows. 1965 01:34:08,940 --> 01:34:11,220 Because it's only going to give me back the count. 1966 01:34:11,220 --> 01:34:16,440 And then I'm going to go ahead and print out that row's first value. 1967 01:34:16,440 --> 01:34:17,940 But it's going to be a little weird. 1968 01:34:17,940 --> 01:34:21,420 Technically the column is going to be called "count" star, quote, unquote, 1969 01:34:21,420 --> 01:34:22,420 which is a little weird. 1970 01:34:22,420 --> 01:34:24,180 Let me add one more feature to the mix. 1971 01:34:24,180 --> 01:34:26,280 You can actually give nicknames to columns 1972 01:34:26,280 --> 01:34:30,090 that are coming back, especially if they are the result of functions like this. 1973 01:34:30,090 --> 01:34:34,170 I can just call that column counter, in all lowercase. 1974 01:34:34,170 --> 01:34:41,620 That means I can now say get back the counter key inside of this dictionary. 1975 01:34:41,620 --> 01:34:43,360 So just to recap, what have we done? 1976 01:34:43,360 --> 01:34:46,010 We've imported the CS50 library SQL function. 1977 01:34:46,010 --> 01:34:49,080 We've, with this line of code, opened the favorites.db 1978 01:34:49,080 --> 01:34:54,810 file that you and I created earlier by importing your CSV into SQLite. 1979 01:34:54,810 --> 01:34:58,140 I'm now just asking the user for a title they want to search for. 1980 01:34:58,140 --> 01:35:02,070 I'm now executing this SQL query on that database, 1981 01:35:02,070 --> 01:35:05,250 plugging in whatever the human typed in as their title 1982 01:35:05,250 --> 01:35:07,170 in order to get back a total count. 1983 01:35:07,170 --> 01:35:10,950 And I'm giving the count a nickname, an alias of counter, 1984 01:35:10,950 --> 01:35:13,860 just so it's more self-explanatory. 1985 01:35:13,860 --> 01:35:18,330 This function, db execute, no matter what, always returns a list of rows, 1986 01:35:18,330 --> 01:35:20,470 even if there's only one row inside of it. 1987 01:35:20,470 --> 01:35:23,310 So this line of code just gives me the first and only row. 1988 01:35:23,310 --> 01:35:27,960 And then, this goes inside of that row, which it turns out is a dictionary, 1989 01:35:27,960 --> 01:35:33,730 and gives me the key counter and the value it corresponds to. 1990 01:35:33,730 --> 01:35:35,340 So what, to be clear, is this doing? 1991 01:35:35,340 --> 01:35:38,220 Let's go ahead and run this manually in my Terminal window first. 1992 01:35:38,220 --> 01:35:41,970 Let me run SQLite3 on favorites-- 1993 01:35:41,970 --> 01:35:43,380 Well, let's do this. 1994 01:35:43,380 --> 01:35:47,410 On favorites.db, let me import the data again. 1995 01:35:47,410 --> 01:35:54,910 So mode csv.import in from favorites.csv into a favorites table. 1996 01:35:54,910 --> 01:35:57,330 So I've just recreated the same data set that you all 1997 01:35:57,330 --> 01:35:59,782 gave me earlier in favorites.db. 1998 01:35:59,782 --> 01:36:02,490 If I were to do this manually, let's search for The Office again. 1999 01:36:02,490 --> 01:36:09,210 Select, count star from favorites, where title like, and let's 2000 01:36:09,210 --> 01:36:12,270 just manually type it in for now, The Office. 2001 01:36:12,270 --> 01:36:15,330 We'll search for the one with the word The, semicolon. 2002 01:36:15,330 --> 01:36:16,560 I get back 12. 2003 01:36:16,560 --> 01:36:18,780 But technically, notice what I get back. 2004 01:36:18,780 --> 01:36:25,080 I technically get back a miniature table containing one column and one row. 2005 01:36:25,080 --> 01:36:27,090 What if I want to rename that column? 2006 01:36:27,090 --> 01:36:28,840 That's where the as keyword comes in. 2007 01:36:28,840 --> 01:36:31,080 So select count star as counter. 2008 01:36:31,080 --> 01:36:32,910 Notice what happens, Enter. 2009 01:36:32,910 --> 01:36:35,883 I just get back-- same simple table, but I've 2010 01:36:35,883 --> 01:36:38,550 renamed the column to be counter just because it's a little more 2011 01:36:38,550 --> 01:36:40,410 self-explanatory as to what it is. 2012 01:36:40,410 --> 01:36:43,200 So what am I doing with this line of code? 2013 01:36:43,200 --> 01:36:47,430 This line of code is returning to me that miniature temporary table 2014 01:36:47,430 --> 01:36:50,820 in the form of a list of dictionaries. 2015 01:36:50,820 --> 01:36:55,620 The list contains one row, as we'll see, and it 2016 01:36:55,620 --> 01:37:00,800 contains one column, as we'll see, the key for which is counter. 2017 01:37:00,800 --> 01:37:02,540 So let's now run the code itself. 2018 01:37:02,540 --> 01:37:07,280 I'm going to get out of SQLite3 and I'm going to run Python of favorites.py. 2019 01:37:07,280 --> 01:37:08,120 Enter. 2020 01:37:08,120 --> 01:37:09,540 I'm being prompted for a title. 2021 01:37:09,540 --> 01:37:13,790 I'm going to type in The Office and cross my fingers, and there's that 12. 2022 01:37:13,790 --> 01:37:14,810 Why is it 12? 2023 01:37:14,810 --> 01:37:17,450 Well, there's a typo again because I re-imported the CSV. 2024 01:37:17,450 --> 01:37:21,270 I had deleted two of the Thes, so we're back at the original data set. 2025 01:37:21,270 --> 01:37:25,760 So there's 12 total that have, quote, unquote, "The Office" 2026 01:37:25,760 --> 01:37:27,780 in the title like that. 2027 01:37:27,780 --> 01:37:28,860 So what have we done? 2028 01:37:28,860 --> 01:37:31,820 We've combined some Python with some SQL, 2029 01:37:31,820 --> 01:37:34,790 but we've relegated all of the complexity of searching for something, 2030 01:37:34,790 --> 01:37:36,800 the selecting of something, gotten rid of all 2031 01:37:36,800 --> 01:37:39,560 of the with keyword, the open keyword, the for loop, 2032 01:37:39,560 --> 01:37:41,600 the reader the DictReader, and all of that. 2033 01:37:41,600 --> 01:37:46,460 And it's just one line of SQL now, using the best of both worlds. 2034 01:37:46,460 --> 01:37:51,720 All right, any questions on what we've just done here or how any of this 2035 01:37:51,720 --> 01:37:52,220 works? 2036 01:37:52,220 --> 01:37:54,870 2037 01:37:54,870 --> 01:37:56,210 Any questions here? 2038 01:37:56,210 --> 01:37:57,310 Yeah? 2039 01:37:57,310 --> 01:38:01,070 AUDIENCE: [INAUDIBLE] 2040 01:38:01,070 --> 01:38:04,413 DAVID J. MALAN: When does this function return more than one row? 2041 01:38:04,413 --> 01:38:05,330 Was that the question? 2042 01:38:05,330 --> 01:38:05,760 AUDIENCE: Yeah. 2043 01:38:05,760 --> 01:38:06,030 DAVID J. MALAN: Yeah. 2044 01:38:06,030 --> 01:38:08,240 So let's do that by changing the problem at hand. 2045 01:38:08,240 --> 01:38:10,970 This program was designed just to select the total count. 2046 01:38:10,970 --> 01:38:16,070 Let's go ahead and select, for instance, all 2047 01:38:16,070 --> 01:38:20,705 of the ways you all typed in The Office by selecting the title this time. 2048 01:38:20,705 --> 01:38:24,110 2049 01:38:24,110 --> 01:38:28,470 If I do this in SQLite3, let me go ahead and do this again 2050 01:38:28,470 --> 01:38:29,970 after increasing my Terminal window. 2051 01:38:29,970 --> 01:38:30,920 Let's do it manually. 2052 01:38:30,920 --> 01:38:35,570 Select title from favorites, where the title is like, 2053 01:38:35,570 --> 01:38:38,835 quote, unquote, "The Office," semicolon. 2054 01:38:38,835 --> 01:38:41,960 I get back all of these different rows, and we didn't even notice this one. 2055 01:38:41,960 --> 01:38:43,910 There's actually another little typo in there 2056 01:38:43,910 --> 01:38:47,630 with some capitalization of the E, and the C, and the E. That 2057 01:38:47,630 --> 01:38:50,840 would be an example of a query that gives me back therefore 2058 01:38:50,840 --> 01:38:52,080 for multiple rows. 2059 01:38:52,080 --> 01:38:53,990 So let's now change my Python program. 2060 01:38:53,990 --> 01:38:59,540 If I now, in my Python program, do this, I get back a whole bunch of rows 2061 01:38:59,540 --> 01:39:01,110 containing all of those titles. 2062 01:39:01,110 --> 01:39:06,350 I can now do, for row in rows, I can print out the current row's title, 2063 01:39:06,350 --> 01:39:09,390 and now manipulate all of those things together. 2064 01:39:09,390 --> 01:39:10,760 Let me keep both on the screen. 2065 01:39:10,760 --> 01:39:12,320 Let me run Python of favorites.py. 2066 01:39:12,320 --> 01:39:16,320 And that for loop now should iterate, what, 10 or more times, 2067 01:39:16,320 --> 01:39:18,020 once for each of those titles. 2068 01:39:18,020 --> 01:39:22,010 And indeed, if I type in The Office again, Enter. 2069 01:39:22,010 --> 01:39:24,890 Whoops. 2070 01:39:24,890 --> 01:39:25,670 Row title. 2071 01:39:25,670 --> 01:39:26,750 What did I do wrong? 2072 01:39:26,750 --> 01:39:30,052 Oh, I should not be renaming title to counter this time. 2073 01:39:30,052 --> 01:39:31,760 So that's just a dumb mistake on my part. 2074 01:39:31,760 --> 01:39:33,410 Let me rerun it again. 2075 01:39:33,410 --> 01:39:36,230 And now I should see after typing in The Office, 2076 01:39:36,230 --> 01:39:38,420 Enter, a whole bunch of The Offices. 2077 01:39:38,420 --> 01:39:40,520 And because I'm using like, even the missed 2078 01:39:40,520 --> 01:39:43,550 capitalizations are coming through, because like is case insensitive. 2079 01:39:43,550 --> 01:39:45,890 It doesn't matter if it's uppercase or lowercase. 2080 01:39:45,890 --> 01:39:50,300 Whereas had I used the equal sign I would get back only the same ones 2081 01:39:50,300 --> 01:39:51,935 capitalized correctly. 2082 01:39:51,935 --> 01:39:55,250 All right, any questions on this next? 2083 01:39:55,250 --> 01:39:59,700 All right, so let's transition to a larger, juicier data 2084 01:39:59,700 --> 01:40:01,610 set, and consider some of the issues that 2085 01:40:01,610 --> 01:40:05,870 arise when actually now using SQL and skating toward a world in which we're 2086 01:40:05,870 --> 01:40:08,970 using SQL for mobile apps, web apps, and generally speaking, 2087 01:40:08,970 --> 01:40:10,800 very large data sets. 2088 01:40:10,800 --> 01:40:14,090 So let's start with a larger data set just like that. 2089 01:40:14,090 --> 01:40:19,800 Give me just a moment to switch screens over to what we have for you today, 2090 01:40:19,800 --> 01:40:22,970 which is an actual relational database that we've created out 2091 01:40:22,970 --> 01:40:26,540 of a real-world data set from IMDb. 2092 01:40:26,540 --> 01:40:29,210 So InternetMovieDatabase.com is a website 2093 01:40:29,210 --> 01:40:31,790 where you can search for TV shows, and movies, and actors, 2094 01:40:31,790 --> 01:40:34,880 and so forth, all using their database behind the scenes. 2095 01:40:34,880 --> 01:40:39,530 IMDb wonderfully makes their data set available as not CSV files, 2096 01:40:39,530 --> 01:40:42,960 but TSV files, tab-separated values. 2097 01:40:42,960 --> 01:40:46,460 And so what we did is, before class we downloaded those TSV files. 2098 01:40:46,460 --> 01:40:50,300 We wrote a Python program similar to my favorites8.py file 2099 01:40:50,300 --> 01:40:53,750 earlier that read in all of those TSV files, 2100 01:40:53,750 --> 01:40:58,820 created some SQL tables in an IMDb database 2101 01:40:58,820 --> 01:41:03,270 for you in SQLite that has multiple tables and multiple columns. 2102 01:41:03,270 --> 01:41:07,190 So let's go and wrap our minds around what's actually in this data set. 2103 01:41:07,190 --> 01:41:10,940 Let me go back to VS Code here, and in just a moment, 2104 01:41:10,940 --> 01:41:15,260 I'm going to go ahead and copy the file, which we've named shows.db. 2105 01:41:15,260 --> 01:41:20,510 And I'm going to go ahead and increase my Terminal and do SQLite3 of shows.db. 2106 01:41:20,510 --> 01:41:23,570 Whenever playing around with a SQLite database for the first time, 2107 01:41:23,570 --> 01:41:26,480 typing .schema is perhaps a good place to start to give you a sense 2108 01:41:26,480 --> 01:41:27,410 of what's in there. 2109 01:41:27,410 --> 01:41:29,120 And things just escalated quickly. 2110 01:41:29,120 --> 01:41:31,640 There's a lot in this data set, because, indeed, there's 2111 01:41:31,640 --> 01:41:35,750 going to be tens of hundreds of thousands of rows in this data set, 2112 01:41:35,750 --> 01:41:39,230 and also problem set 7, where we'll look at the movie side of things 2113 01:41:39,230 --> 01:41:40,920 and not just the TV shows. 2114 01:41:40,920 --> 01:41:43,940 So what is the schema that we have created for you 2115 01:41:43,940 --> 01:41:47,150 from IMDb's actual real-world data? 2116 01:41:47,150 --> 01:41:48,950 One, there's a table called shows. 2117 01:41:48,950 --> 01:41:51,950 And notice we've just added whitespace by hitting Enter a bunch of times 2118 01:41:51,950 --> 01:41:54,050 to make it a little more stylistically readable. 2119 01:41:54,050 --> 01:41:57,980 The shows table has an ID column, a title column, a year, 2120 01:41:57,980 --> 01:42:00,740 and the total number of episodes for a given show. 2121 01:42:00,740 --> 01:42:05,750 And the types of those columns are integer, text, numeric, and integer. 2122 01:42:05,750 --> 01:42:08,090 So it turns out there's actually a few different data 2123 01:42:08,090 --> 01:42:13,850 types that are worth being aware of when it comes to creating tables themselves. 2124 01:42:13,850 --> 01:42:18,170 In fact, in SQLite there's five data types, and only five, 2125 01:42:18,170 --> 01:42:21,650 fortunately, one of which is, indeed, integer, negative or positive, 2126 01:42:21,650 --> 01:42:25,010 numeric, which is kind of a catchall for dates and times, 2127 01:42:25,010 --> 01:42:27,320 things that are numeric but are not just integers, 2128 01:42:27,320 --> 01:42:29,510 and not just real numbers, for instance. 2129 01:42:29,510 --> 01:42:33,020 Real number is what we've generally thought of as float up until now. 2130 01:42:33,020 --> 01:42:35,058 Text, of course, is just text, but notice 2131 01:42:35,058 --> 01:42:37,100 that you don't have to worry about how big it is. 2132 01:42:37,100 --> 01:42:39,110 Like in Python, it will size to fit. 2133 01:42:39,110 --> 01:42:41,840 And then there's BLOB, which is binary large object, which 2134 01:42:41,840 --> 01:42:45,300 is for just raw 0s and 1s, like for files or things like that. 2135 01:42:45,300 --> 01:42:47,570 But we'll generally use the other four of these. 2136 01:42:47,570 --> 01:42:50,960 And so, indeed, when we imported this data for you 2137 01:42:50,960 --> 01:42:56,270 we decided that every show would be given an ID, which is just an integer. 2138 01:42:56,270 --> 01:42:59,460 Every show has, of course, a title, which should not be null. 2139 01:42:59,460 --> 01:43:01,320 Otherwise, why is it in the database? 2140 01:43:01,320 --> 01:43:04,830 Every show has a year, which is numeric according 2141 01:43:04,830 --> 01:43:06,180 to that definition a moment ago. 2142 01:43:06,180 --> 01:43:09,540 And the total number of episodes for a show is going to be an integer. 2143 01:43:09,540 --> 01:43:13,110 What now is with these primary keys that we mentioned earlier, too? 2144 01:43:13,110 --> 01:43:18,090 A primary key is the column that uniquely identifies all of the data. 2145 01:43:18,090 --> 01:43:21,120 In our case, with the favorites, I automatically 2146 01:43:21,120 --> 01:43:24,750 gave each of your submissions a unique ID so that even if two or more of you 2147 01:43:24,750 --> 01:43:27,360 typed in The Office, your submission still 2148 01:43:27,360 --> 01:43:32,310 had a unique identifier, a number that allowed me to then correlate it 2149 01:43:32,310 --> 01:43:36,330 with your genres, just as we saw a moment ago. 2150 01:43:36,330 --> 01:43:39,280 In this version of IMDb, there's also genres. 2151 01:43:39,280 --> 01:43:42,120 But they don't come from us, they come from IMDb.com. 2152 01:43:42,120 --> 01:43:46,320 And so a genre has a show ID, and a genre just like our database. 2153 01:43:46,320 --> 01:43:49,890 But these are real-world genres with a bit more filtration. 2154 01:43:49,890 --> 01:43:54,520 Notice, though, just like my version, there's a foreign key. 2155 01:43:54,520 --> 01:43:59,970 A foreign key is the appearance of another table's primary key 2156 01:43:59,970 --> 01:44:02,050 in its own table. 2157 01:44:02,050 --> 01:44:05,100 So when you have a table like genres, which is somehow 2158 01:44:05,100 --> 01:44:10,920 cross referencing the original shows table, if shows have a primary key 2159 01:44:10,920 --> 01:44:15,270 called ID, and those same numbers appear in the genres table 2160 01:44:15,270 --> 01:44:20,250 under the column called show ID, by definition, show ID is a foreign key. 2161 01:44:20,250 --> 01:44:22,650 It's the same numbers but it's foreign in the sense 2162 01:44:22,650 --> 01:44:25,260 that the number is being used in this table, 2163 01:44:25,260 --> 01:44:29,130 even though it's officially defined primarily in this other table. 2164 01:44:29,130 --> 01:44:31,770 This is what we mean by relational databases. 2165 01:44:31,770 --> 01:44:37,170 You have multiple tables with some column in common, numbers typically. 2166 01:44:37,170 --> 01:44:40,860 And those numbers allow you to line the two tables up in such a way 2167 01:44:40,860 --> 01:44:44,040 that you can reconnect the shows with their genres, 2168 01:44:44,040 --> 01:44:46,800 just like we did with our smaller data set a moment ago. 2169 01:44:46,800 --> 01:44:49,050 This logic is extended further. 2170 01:44:49,050 --> 01:44:53,070 Notice that the IMDb database we've created for you has a stars table, 2171 01:44:53,070 --> 01:44:56,730 like TV show stars, the actors therein. 2172 01:44:56,730 --> 01:45:00,210 And that table, interestingly, has no mention of people 2173 01:45:00,210 --> 01:45:02,220 and no mention of shows, per se. 2174 01:45:02,220 --> 01:45:05,730 It only has a column called show ID, which is an integer, 2175 01:45:05,730 --> 01:45:08,220 and a person ID, which is an integer. 2176 01:45:08,220 --> 01:45:14,320 Meanwhile, if we scrolled down to the bottom, 2177 01:45:14,320 --> 01:45:17,490 you will see a table called people. 2178 01:45:17,490 --> 01:45:23,010 And we have decided in IMDb's world that every person in the TV show world 2179 01:45:23,010 --> 01:45:27,510 will have a unique identifier that's a number, a name that's text, a birth 2180 01:45:27,510 --> 01:45:31,500 date, which is numeric, and then, again, specifying that ID 2181 01:45:31,500 --> 01:45:35,350 is going to be their primary key. 2182 01:45:35,350 --> 01:45:36,940 So what's going on here? 2183 01:45:36,940 --> 01:45:42,640 Well, it turns out that TV stars and writers are both types of people. 2184 01:45:42,640 --> 01:45:47,730 So using this relational database, notice the road we're going down. 2185 01:45:47,730 --> 01:45:49,770 We're factoring out commonalities. 2186 01:45:49,770 --> 01:45:52,570 And if a person can be different things in life, 2187 01:45:52,570 --> 01:45:55,260 well, we're defining them first and foremost as people. 2188 01:45:55,260 --> 01:45:58,150 And then, notice these two tables are almost the same. 2189 01:45:58,150 --> 01:46:00,660 The stars table has a show ID, which is a number, 2190 01:46:00,660 --> 01:46:02,670 and a person ID, which is a number, which 2191 01:46:02,670 --> 01:46:10,710 allows us via this middleman table, if you will, to link people with TV shows. 2192 01:46:10,710 --> 01:46:16,080 Similarly, the writers table allows us to connect shows with people, too, 2193 01:46:16,080 --> 01:46:18,220 by just recording those numbers. 2194 01:46:18,220 --> 01:46:20,980 So if we go into this data set, let's do the following. 2195 01:46:20,980 --> 01:46:24,340 Let's do select star from people semicolon. 2196 01:46:24,340 --> 01:46:27,030 So a huge amount of data is coming back. 2197 01:46:27,030 --> 01:46:31,480 This is hundreds of thousands of rows now based on the ID numbers alone. 2198 01:46:31,480 --> 01:46:34,320 So this is real-world data now flying across the screen. 2199 01:46:34,320 --> 01:46:38,160 There's a lot of people in the TV show business, not just actors and writers, 2200 01:46:38,160 --> 01:46:39,780 but others as well. 2201 01:46:39,780 --> 01:46:40,680 It's still going. 2202 01:46:40,680 --> 01:46:41,977 There's a lot of data there. 2203 01:46:41,977 --> 01:46:44,310 So my god, if you had to do anything manual in this data 2204 01:46:44,310 --> 01:46:46,660 set it's probably not going to work out very well. 2205 01:46:46,660 --> 01:46:49,590 And actually, we're up to, what, a million people in this data 2206 01:46:49,590 --> 01:46:51,990 set, plus, which would mean this probably isn't even 2207 01:46:51,990 --> 01:46:55,590 going to open very well in Excel, or Google Spreadsheets, or Apple Numbers. 2208 01:46:55,590 --> 01:46:57,700 SQL probably is the better approach here. 2209 01:46:57,700 --> 01:47:00,360 Let's search for someone specific, like select star 2210 01:47:00,360 --> 01:47:06,060 from people, where name equals Steve Carell, for instance, sticking 2211 01:47:06,060 --> 01:47:06,702 with comedies. 2212 01:47:06,702 --> 01:47:08,160 All right, so there's Steve Carell. 2213 01:47:08,160 --> 01:47:14,100 He is person number 136,797, born in 1962. 2214 01:47:14,100 --> 01:47:16,540 And that's as much data as we have on Steve Carell here. 2215 01:47:16,540 --> 01:47:19,210 How do we figure out what shows, for instance, he's in? 2216 01:47:19,210 --> 01:47:23,500 Well, let's see, select star from shows, semicolon. 2217 01:47:23,500 --> 01:47:27,180 There's a crazy number of shows out there in the IMDb database. 2218 01:47:27,180 --> 01:47:30,150 And you can see it here again flying across the screen. 2219 01:47:30,150 --> 01:47:33,630 Feels like we're going to have to employ some techniques in order 2220 01:47:33,630 --> 01:47:37,060 to get at all of Steve Carell's shows. 2221 01:47:37,060 --> 01:47:39,220 So how are we going to do that? 2222 01:47:39,220 --> 01:47:41,730 Well, god, this is a lot of data here. 2223 01:47:41,730 --> 01:47:45,120 And in fact, yeah, we have, what, 15 million 2224 01:47:45,120 --> 01:47:47,280 shows plus in this data set, too. 2225 01:47:47,280 --> 01:47:50,340 So doing things efficiently is now going to start to matter. 2226 01:47:50,340 --> 01:47:51,790 So let's actually do this. 2227 01:47:51,790 --> 01:47:53,460 Let me select a specific show. 2228 01:47:53,460 --> 01:47:58,590 Select star from shows where title equals, quote, unquote, "The Office." 2229 01:47:58,590 --> 01:48:00,920 And there presumably shouldn't be typos in this data 2230 01:48:00,920 --> 01:48:03,470 because it comes from the real website IMDb.com. 2231 01:48:03,470 --> 01:48:05,000 Let's get back to show. 2232 01:48:05,000 --> 01:48:08,210 Turns out there's been a lot of The Offices out in the world. 2233 01:48:08,210 --> 01:48:12,170 The one that started in 2005 is the one that we want, 2234 01:48:12,170 --> 01:48:14,990 presumably the most popular with 188 episodes. 2235 01:48:14,990 --> 01:48:16,220 How can we get just that? 2236 01:48:16,220 --> 01:48:21,650 Maybe we could do and year equals, how about 2005? 2237 01:48:21,650 --> 01:48:25,250 All right, so now we've got back just the ID of The Office 2238 01:48:25,250 --> 01:48:26,210 that we care about. 2239 01:48:26,210 --> 01:48:27,230 And let's do this, too. 2240 01:48:27,230 --> 01:48:30,110 Let me turn on a timer within SQLite just 2241 01:48:30,110 --> 01:48:31,790 to get a sense of running time now. 2242 01:48:31,790 --> 01:48:33,110 Let me do that again. 2243 01:48:33,110 --> 01:48:36,050 Select star from shows, where title equals The Office, 2244 01:48:36,050 --> 01:48:37,808 and year equals 2005. 2245 01:48:37,808 --> 01:48:38,850 And let's keep it simple. 2246 01:48:38,850 --> 01:48:40,370 Let's just do titles for now. 2247 01:48:40,370 --> 01:48:41,450 Enter. 2248 01:48:41,450 --> 01:48:43,460 All right, so not terribly long. 2249 01:48:43,460 --> 01:48:46,990 It found it pretty fast, but it looks like it took how much real time? 2250 01:48:46,990 --> 01:48:50,010 0.02 seconds, not bad for just a title. 2251 01:48:50,010 --> 01:48:53,210 But just to plant a seed, it turns out that we can probably 2252 01:48:53,210 --> 01:48:54,200 speed even this up. 2253 01:48:54,200 --> 01:48:54,930 Let me do this. 2254 01:48:54,930 --> 01:48:58,220 Let me create something called an index, which is another use of the C 2255 01:48:58,220 --> 01:49:00,170 in CRUD for creating something. 2256 01:49:00,170 --> 01:49:02,810 And I'm going to call this title index. 2257 01:49:02,810 --> 01:49:07,340 And I'm going to create it on the shows table, 2258 01:49:07,340 --> 01:49:09,273 specifically on the title column. 2259 01:49:09,273 --> 01:49:11,690 And we'll see in a moment what this is going to do for me. 2260 01:49:11,690 --> 01:49:12,920 Enter. 2261 01:49:12,920 --> 01:49:17,130 Took a moment, like 0.349 seconds, to create something called an index. 2262 01:49:17,130 --> 01:49:21,590 But now watch, if I select star from shows searching for The Office again, 2263 01:49:21,590 --> 01:49:24,230 previously it took me 0.021 seconds. 2264 01:49:24,230 --> 01:49:27,680 Not bad, but now, wow. 2265 01:49:27,680 --> 01:49:31,190 Literally no time at all, or so low that it wasn't really measurable. 2266 01:49:31,190 --> 01:49:33,470 And I'll do it again just to get a sense of things. 2267 01:49:33,470 --> 01:49:34,860 Still quite low. 2268 01:49:34,860 --> 01:49:40,020 Now even though 0.021 seconds, not crazy long, imagine now having a lot of data, 2269 01:49:40,020 --> 01:49:42,560 a lot of users running a real website or real mobile app. 2270 01:49:42,560 --> 01:49:45,980 Every millisecond we can start to shave off is going to be compelling. 2271 01:49:45,980 --> 01:49:47,810 So what is it we just did? 2272 01:49:47,810 --> 01:49:51,830 Well, we actually just created something called an index. 2273 01:49:51,830 --> 01:49:53,960 And this is a nice way to tie in, now, some 2274 01:49:53,960 --> 01:49:56,420 of our week 5 discussion of data structures, 2275 01:49:56,420 --> 01:49:58,250 and our week 3 discussion of running times. 2276 01:49:58,250 --> 01:50:01,280 An index in a database is some kind of fancy data 2277 01:50:01,280 --> 01:50:06,110 structure that allows the database to do better than linear search. 2278 01:50:06,110 --> 01:50:10,580 Literally, as you just saw, these tables are crazy long or tall right now, 2279 01:50:10,580 --> 01:50:12,150 very linear, that is. 2280 01:50:12,150 --> 01:50:14,300 And so when I first searched for The Office, 2281 01:50:14,300 --> 01:50:17,780 it was literally doing linear search, top to bottom, looking at as many as, 2282 01:50:17,780 --> 01:50:19,400 what, a million plus rows. 2283 01:50:19,400 --> 01:50:20,870 That's relatively slow. 2284 01:50:20,870 --> 01:50:23,330 It's not that slow, 0.021 seconds. 2285 01:50:23,330 --> 01:50:27,020 But that's relatively slow just theoretically, algorithmically, 2286 01:50:27,020 --> 01:50:28,580 doing anything linearly. 2287 01:50:28,580 --> 01:50:32,090 But if you instead create an index using syntax 2288 01:50:32,090 --> 01:50:37,880 like this, which I just did, creating an index on the title column of the show's 2289 01:50:37,880 --> 01:50:41,220 table, that's like giving the database a clue in advance saying, 2290 01:50:41,220 --> 01:50:44,660 hey, I know I'm going to search on this column in this table a lot. 2291 01:50:44,660 --> 01:50:47,430 Do something with data structures to speed things up. 2292 01:50:47,430 --> 01:50:50,030 And so if you think back to our discussion of data structures, 2293 01:50:50,030 --> 01:50:51,720 maybe it's using a tree. 2294 01:50:51,720 --> 01:50:56,060 Maybe it's using a trie or a hash table, some fancier two-dimensional data 2295 01:50:56,060 --> 01:50:59,930 structure is generally going to lift the data up creating right maybe a tree 2296 01:50:59,930 --> 01:51:00,810 structure. 2297 01:51:00,810 --> 01:51:03,620 So it's just much faster to find data, especially 2298 01:51:03,620 --> 01:51:06,500 if it's sorting it now based on title, and not 2299 01:51:06,500 --> 01:51:08,030 just storing it in one long list. 2300 01:51:08,030 --> 01:51:10,220 And in fact, in the world of relational databases, 2301 01:51:10,220 --> 01:51:12,560 the type of structure that's often used in a database 2302 01:51:12,560 --> 01:51:13,920 is something called a B-tree. 2303 01:51:13,920 --> 01:51:15,170 It's not a binary tree. 2304 01:51:15,170 --> 01:51:19,040 Different use of the letter B, but it looks a little something like the trees 2305 01:51:19,040 --> 01:51:19,820 we've seen. 2306 01:51:19,820 --> 01:51:21,680 It's not binary because some of the nodes 2307 01:51:21,680 --> 01:51:24,890 might have more than two children or fewer, 2308 01:51:24,890 --> 01:51:28,190 but it's a very wide but relatively shallow tree. 2309 01:51:28,190 --> 01:51:29,990 It's not very tall. 2310 01:51:29,990 --> 01:51:33,770 And the upside of that is that if your data is stored in this tree, 2311 01:51:33,770 --> 01:51:36,060 the database can find it more quickly. 2312 01:51:36,060 --> 01:51:41,270 And the reason it took half a second, a third of a second to build the index 2313 01:51:41,270 --> 01:51:45,260 is because SQLite needed to take some non-zero amount of time 2314 01:51:45,260 --> 01:51:47,630 to just build up this tree in memory. 2315 01:51:47,630 --> 01:51:51,900 And it has algorithms for doing so based on alphabetization or other techniques. 2316 01:51:51,900 --> 01:51:55,340 But you spend a bit of time up front, a third of a second. 2317 01:51:55,340 --> 01:51:57,470 And then thereafter, wow. 2318 01:51:57,470 --> 01:52:00,470 Every subsequent query, if I keep doing it again and again, 2319 01:52:00,470 --> 01:52:04,040 is going to be crazy low, 0.000, maybe 0.001. 2320 01:52:04,040 --> 01:52:08,240 But an order of magnitude, a factor of 10 or 100 faster than it 2321 01:52:08,240 --> 01:52:10,790 previously was earlier. 2322 01:52:10,790 --> 01:52:14,360 So we have these indexes which allow us to get at data faster. 2323 01:52:14,360 --> 01:52:17,330 But what if we want to actually get data that's 2324 01:52:17,330 --> 01:52:19,370 now across these multiple tables? 2325 01:52:19,370 --> 01:52:20,240 How can we do that? 2326 01:52:20,240 --> 01:52:23,060 And how might these indices or indexes help further? 2327 01:52:23,060 --> 01:52:26,900 Well, it turns out there is a way that we've seen already 2328 01:52:26,900 --> 01:52:29,510 indirectly to join two tables together. 2329 01:52:29,510 --> 01:52:33,410 Previously, when I selected the ID of The Office, 2330 01:52:33,410 --> 01:52:37,740 and then I searched for it in the other table using select in a nested query, 2331 01:52:37,740 --> 01:52:40,410 I was joining two tables together. 2332 01:52:40,410 --> 01:52:42,900 And it turns out there's a couple of ways to do this. 2333 01:52:42,900 --> 01:52:46,550 Let's go ahead now and, for instance, find all of Steve Carell's TV shows. 2334 01:52:46,550 --> 01:52:48,680 Not just The Office but all of them, too. 2335 01:52:48,680 --> 01:52:56,315 Unfortunately, if we look at our schema, shows up here have no mention of TV-- 2336 01:52:56,315 --> 01:53:01,860 oh, shows over here has no mention of the TV stars in them. 2337 01:53:01,860 --> 01:53:05,130 And people have no mention of shows. 2338 01:53:05,130 --> 01:53:09,360 We somehow need to use this table here to connect the two. 2339 01:53:09,360 --> 01:53:14,820 And this is called a join table, in the sense that using two integer columns-- 2340 01:53:14,820 --> 01:53:17,790 it joins the two tables together logically. 2341 01:53:17,790 --> 01:53:21,750 And so if you're savvy enough with SQL, you can do what I did with my hands 2342 01:53:21,750 --> 01:53:26,010 earlier and like recombine tables by using these common IDs, 2343 01:53:26,010 --> 01:53:27,460 these integers together. 2344 01:53:27,460 --> 01:53:28,330 So let me do this. 2345 01:53:28,330 --> 01:53:32,730 Let me go ahead and figure out, step-by-step, Steve Carell's shows. 2346 01:53:32,730 --> 01:53:33,970 So how am I going to do this? 2347 01:53:33,970 --> 01:53:39,120 Well, if I select star from people, where name equals Steve Carell, 2348 01:53:39,120 --> 01:53:40,840 fortunately, there's only one of them. 2349 01:53:40,840 --> 01:53:46,800 So this gives me back his name, his ID, and his birth year. 2350 01:53:46,800 --> 01:53:48,960 But it's really only his ID that I care about. 2351 01:53:48,960 --> 01:53:49,680 Why? 2352 01:53:49,680 --> 01:53:55,500 Because in order to get back his shows, I need to link person ID with show ID. 2353 01:53:55,500 --> 01:53:57,640 So I need to know his ID number. 2354 01:53:57,640 --> 01:53:59,590 So what could I do with this? 2355 01:53:59,590 --> 01:54:04,230 Well, remember the schema and the stars table. 2356 01:54:04,230 --> 01:54:07,830 I've just gotten, from the people table, Steve Carell's ID. 2357 01:54:07,830 --> 01:54:13,530 I bet by transitivity I could now use his person ID, his ID, 2358 01:54:13,530 --> 01:54:15,900 to get back all of his show IDs. 2359 01:54:15,900 --> 01:54:19,170 And then once I've got all of his show IDs, I can take it one step further 2360 01:54:19,170 --> 01:54:21,330 and get back all of his shows' titles. 2361 01:54:21,330 --> 01:54:25,440 So the answer is actually English words and not just random, seemingly, 2362 01:54:25,440 --> 01:54:26,290 integers. 2363 01:54:26,290 --> 01:54:27,640 So let me go ahead and do this. 2364 01:54:27,640 --> 01:54:31,740 Let me, again, get Steve Carell's ID number, but not star. 2365 01:54:31,740 --> 01:54:33,060 Star represents everything. 2366 01:54:33,060 --> 01:54:35,610 It's a wildcard character in SQL. 2367 01:54:35,610 --> 01:54:38,310 Let me just select the ID of Steve Carell. 2368 01:54:38,310 --> 01:54:41,640 And that gives me back 136,797. 2369 01:54:41,640 --> 01:54:43,440 And it's only giving me back one value. 2370 01:54:43,440 --> 01:54:46,710 The thing called ID is just the column heading up above. 2371 01:54:46,710 --> 01:54:51,360 Now, suppose I want to select all of the show IDs 2372 01:54:51,360 --> 01:54:53,460 that Steve Carell is affiliated with. 2373 01:54:53,460 --> 01:55:00,570 Let me select Show ID from stars, where the person ID in stars 2374 01:55:00,570 --> 01:55:03,480 happens to equal Steve Carell's ID. 2375 01:55:03,480 --> 01:55:07,320 So again, I'm building up my answer in reverse and taking these baby steps. 2376 01:55:07,320 --> 01:55:11,430 On the right, in parentheses, I'm getting Steve Carell's ID. 2377 01:55:11,430 --> 01:55:15,330 On the left, I am now selecting all of the show IDs 2378 01:55:15,330 --> 01:55:19,410 that have some connection with that person ID in the stars table. 2379 01:55:19,410 --> 01:55:21,850 This answer, too, is not going to be that illuminating. 2380 01:55:21,850 --> 01:55:25,420 It's just a whole bunch of integers that have no meaning to me as a human. 2381 01:55:25,420 --> 01:55:27,770 But let's take this one step further. 2382 01:55:27,770 --> 01:55:29,520 And even though my code is getting long, I 2383 01:55:29,520 --> 01:55:31,920 could hit Enter and format it nicely, especially 2384 01:55:31,920 --> 01:55:33,660 if I were doing this in a code file. 2385 01:55:33,660 --> 01:55:35,580 But I'm just doing it interactively for now. 2386 01:55:35,580 --> 01:55:39,420 Let's now select all of the titles from the shows table, 2387 01:55:39,420 --> 01:55:48,100 where the ID of the show is in this following previous query. 2388 01:55:48,100 --> 01:55:49,740 So again, the query is getting long. 2389 01:55:49,740 --> 01:55:52,200 But notice, it's the third and last step. 2390 01:55:52,200 --> 01:55:55,950 Select title from the shows table, where the ID of the show 2391 01:55:55,950 --> 01:55:58,590 is in the list of all of the show IDs that 2392 01:55:58,590 --> 01:56:02,040 came back from the stars table searching for Steve Carell's person ID. 2393 01:56:02,040 --> 01:56:03,480 How did we get that person ID? 2394 01:56:03,480 --> 01:56:04,680 Let me scroll to the end. 2395 01:56:04,680 --> 01:56:11,160 Well, I selected, in my innermost parentheses, Steve Carell's own ID. 2396 01:56:11,160 --> 01:56:13,440 So now, when I hit Enter, voila. 2397 01:56:13,440 --> 01:56:16,525 I get all of Steve Carell's TV shows up until now. 2398 01:56:16,525 --> 01:56:19,650 And if I want to tidy this up further, I can use the same tricks as before. 2399 01:56:19,650 --> 01:56:22,180 Order by title, semicolon. 2400 01:56:22,180 --> 01:56:25,540 Now I've got it all alphabetized as before. 2401 01:56:25,540 --> 01:56:28,080 So again, with SQL comes the ability to search-- 2402 01:56:28,080 --> 01:56:31,290 I mean, look how quickly we do this, 0.094 seconds 2403 01:56:31,290 --> 01:56:34,650 to search across three different tables to get back this answer. 2404 01:56:34,650 --> 01:56:38,820 But my data is now all neatly designed in individual tables, 2405 01:56:38,820 --> 01:56:42,000 which is going to be important now that the data set is so large. 2406 01:56:42,000 --> 01:56:44,340 But let me take this one step further. 2407 01:56:44,340 --> 01:56:46,930 Let me go ahead and do this. 2408 01:56:46,930 --> 01:56:51,580 Let me go ahead and point out that with this query, 2409 01:56:51,580 --> 01:56:54,870 notice that I'm searching on-- 2410 01:56:54,870 --> 01:56:58,710 let's say I'm searching on a person ID here. 2411 01:56:58,710 --> 01:57:02,410 And at the end here, I'm searching on a name column here. 2412 01:57:02,410 --> 01:57:05,230 So let me actually go ahead and do this. 2413 01:57:05,230 --> 01:57:09,510 Let me go ahead and see if we can't speed this up. 2414 01:57:09,510 --> 01:57:13,090 This query at the moment takes 0.092 seconds. 2415 01:57:13,090 --> 01:57:15,930 Let's see if we can't speed this up further by just quickly creating 2416 01:57:15,930 --> 01:57:18,930 a few more of those B-trees in the databases memory. 2417 01:57:18,930 --> 01:57:24,240 Create an index called person index, and I'm going to do this on the stars table 2418 01:57:24,240 --> 01:57:25,950 on the person ID column. 2419 01:57:25,950 --> 01:57:26,850 Enter. 2420 01:57:26,850 --> 01:57:28,488 It's taking a moment, taking a moment. 2421 01:57:28,488 --> 01:57:30,780 That's almost a full second because that's a big table. 2422 01:57:30,780 --> 01:57:35,050 Let's create another index called show index on the stars table. 2423 01:57:35,050 --> 01:57:35,550 Why? 2424 01:57:35,550 --> 01:57:37,950 Because I want to search by the show ID also. 2425 01:57:37,950 --> 01:57:39,780 That was part of my big query. 2426 01:57:39,780 --> 01:57:40,660 Takes a moment. 2427 01:57:40,660 --> 01:57:43,810 OK, just more than about 2/3 of a second. 2428 01:57:43,810 --> 01:57:46,440 Now let's create one last one, another index called name index, 2429 01:57:46,440 --> 01:57:49,160 but I could call these things anything I want, on the people table. 2430 01:57:49,160 --> 01:57:49,560 Why? 2431 01:57:49,560 --> 01:57:51,550 Because I'm also searching on the name column. 2432 01:57:51,550 --> 01:57:53,730 So in short, I'm creating indexes on each 2433 01:57:53,730 --> 01:57:57,450 of the columns that are somehow involved in my search query, 2434 01:57:57,450 --> 01:57:59,680 going from one table to the other. 2435 01:57:59,680 --> 01:58:07,590 Now let's go back to the previous query, which, recall, took-- 2436 01:58:07,590 --> 01:58:10,270 I think I erased it, 0.091. 2437 01:58:10,270 --> 01:58:10,770 All right. 2438 01:58:10,770 --> 01:58:12,645 Well, it was roughly this order of magnitude. 2439 01:58:12,645 --> 01:58:14,140 We're not seeing the data now. 2440 01:58:14,140 --> 01:58:17,160 But let me go ahead and run my original big query once. 2441 01:58:17,160 --> 01:58:20,380 And boom, we're down to almost nothing. 2442 01:58:20,380 --> 01:58:22,860 So again, creating these indexes in memory 2443 01:58:22,860 --> 01:58:27,640 has the effect of rapidly speeding up our computation time. 2444 01:58:27,640 --> 01:58:31,140 Now if you've ever used, for instance, the my.harvard course shopping tool 2445 01:58:31,140 --> 01:58:35,130 here on campus, or Yale's analogue, you might wonder, why is the thing so slow? 2446 01:58:35,130 --> 01:58:39,330 This could be one of the reasons why large data sets with thousands of rows, 2447 01:58:39,330 --> 01:58:41,670 thousands of courses tend to be slow, if, 2448 01:58:41,670 --> 01:58:44,880 and I'm only conjecturing, if the database isn't properly indexed. 2449 01:58:44,880 --> 01:58:46,890 If you're building your own web application 2450 01:58:46,890 --> 01:58:49,170 and you're finding that users are waiting and waiting, 2451 01:58:49,170 --> 01:58:52,290 and things are spinning and spinning, what might be among the problems? 2452 01:58:52,290 --> 01:58:55,770 Well, it could absolutely just be bad algorithms and bad code that you wrote. 2453 01:58:55,770 --> 01:58:58,270 Or it might be that you haven't thought about, well, 2454 01:58:58,270 --> 01:59:01,770 what column should be optimized for searches and filtration 2455 01:59:01,770 --> 01:59:05,705 like I've done here in order to speed up subsequent queries? 2456 01:59:05,705 --> 01:59:07,830 Again, from the outside in, we can only conjecture. 2457 01:59:07,830 --> 01:59:11,580 But ultimately, this is just one of the things that 2458 01:59:11,580 --> 01:59:14,308 explains performance problems as well. 2459 01:59:14,308 --> 01:59:17,100 All right, let's point out just a couple of final syntactic things, 2460 01:59:17,100 --> 01:59:19,770 and then we'll consider, bigger picture, some problems 2461 01:59:19,770 --> 01:59:22,110 that might arise in this world. 2462 01:59:22,110 --> 01:59:26,880 If these nested, nested queries start to get a little much, 2463 01:59:26,880 --> 01:59:29,160 there are other ways, just so you've seen it, 2464 01:59:29,160 --> 01:59:31,920 that you can execute similar logic in SQL. 2465 01:59:31,920 --> 01:59:34,410 For instance, if I know in advance that I 2466 01:59:34,410 --> 01:59:39,390 want to connect Steve Carell to his show IDs and to their titles, 2467 01:59:39,390 --> 01:59:41,190 we can do something more like this. 2468 01:59:41,190 --> 01:59:52,050 Select title from the people table, joined with the stars table on people 2469 01:59:52,050 --> 01:59:54,900 ID equals stars.personID. 2470 01:59:54,900 --> 01:59:55,860 So what am I doing? 2471 01:59:55,860 --> 01:59:56,490 New syntax. 2472 01:59:56,490 --> 01:59:59,782 And again, this is not something you'll have to memorize or ingrain right away. 2473 01:59:59,782 --> 02:00:04,080 But just so you've seen other approaches, select title from people 2474 02:00:04,080 --> 02:00:05,160 join stars. 2475 02:00:05,160 --> 02:00:09,690 This is an explicit way to say, take the people table in one hand, the stars 2476 02:00:09,690 --> 02:00:12,030 table in the other hand, and somehow join them 2477 02:00:12,030 --> 02:00:13,990 as I keep doing with my fingertips here. 2478 02:00:13,990 --> 02:00:15,610 How to join them? 2479 02:00:15,610 --> 02:00:20,370 Join them so that the people, the ID column in the people table lines up 2480 02:00:20,370 --> 02:00:23,260 with the person ID in the stars table. 2481 02:00:23,260 --> 02:00:25,260 But that's not quite everything. 2482 02:00:25,260 --> 02:00:28,740 I could also say, join further on the shows table, 2483 02:00:28,740 --> 02:00:35,290 where the stars show ID equals the shows ID column. 2484 02:00:35,290 --> 02:00:36,540 So what am I doing here? 2485 02:00:36,540 --> 02:00:45,990 That's saying, go further and join the stars table with the show's table, 2486 02:00:45,990 --> 02:00:48,990 joining the show ID column with the ID column. 2487 02:00:48,990 --> 02:00:51,750 Again, this starts to get a little messy to think about. 2488 02:00:51,750 --> 02:00:55,830 But now I can just say, where name equals, quote, unquote, "Steve Carell." 2489 02:00:55,830 --> 02:00:59,070 I can do in one query what previously took me three nested queries 2490 02:00:59,070 --> 02:01:00,600 and get back the same answers. 2491 02:01:00,600 --> 02:01:05,380 And I can still add in my order by title to get back the result. 2492 02:01:05,380 --> 02:01:09,780 And if I do this a little more neatly, let me type this out a little 2493 02:01:09,780 --> 02:01:10,750 differently. 2494 02:01:10,750 --> 02:01:16,620 Let me type this out by adding a new line-- ah, I can't do that here. 2495 02:01:16,620 --> 02:01:18,120 I'm going to leave it alone for now. 2496 02:01:18,120 --> 02:01:20,700 We can type it on multiple lines in other contexts. 2497 02:01:20,700 --> 02:01:23,700 And let me do one last thing. 2498 02:01:23,700 --> 02:01:25,012 Do I want to show that? 2499 02:01:25,012 --> 02:01:26,970 I'm going to show it, but this is not something 2500 02:01:26,970 --> 02:01:28,440 you should ingrain just yet either. 2501 02:01:28,440 --> 02:01:31,620 Select title from people, stars, and shows. 2502 02:01:31,620 --> 02:01:34,860 If you know in advance that you want to do something with all three tables, 2503 02:01:34,860 --> 02:01:38,130 you can just enumerate them, one table name after the other. 2504 02:01:38,130 --> 02:01:43,497 And then you can say where people.ID equals stars.personID. 2505 02:01:43,497 --> 02:01:45,330 And now I'm hitting Enter so that it formats 2506 02:01:45,330 --> 02:01:47,070 a little more readably on my screen. 2507 02:01:47,070 --> 02:01:55,140 And stars.showID equals shows.ID, and lastly, name equals Steve Carell. 2508 02:01:55,140 --> 02:01:59,730 In short, you specify that you want to select data from all three 2509 02:01:59,730 --> 02:02:00,690 of these tables. 2510 02:02:00,690 --> 02:02:06,450 And then you tell the database how to combine foreign keys with primary keys, 2511 02:02:06,450 --> 02:02:09,690 that is, the columns that have those integers in common. 2512 02:02:09,690 --> 02:02:12,720 If I hit Enter now, I get the same exact results, ever 2513 02:02:12,720 --> 02:02:16,110 more so if I also add in an order by title. 2514 02:02:16,110 --> 02:02:17,770 Oops. 2515 02:02:17,770 --> 02:02:18,270 All right. 2516 02:02:18,270 --> 02:02:20,190 That's why I didn't want to do this earlier. 2517 02:02:20,190 --> 02:02:23,190 I have to go back through my history multiple times to actually get back 2518 02:02:23,190 --> 02:02:24,640 the multi-line query this time. 2519 02:02:24,640 --> 02:02:25,140 All right. 2520 02:02:25,140 --> 02:02:27,280 That was a lot all at once. 2521 02:02:27,280 --> 02:02:31,365 But this is only to say that, even as we make the design of the data 2522 02:02:31,365 --> 02:02:33,990 more sophisticated, and we put some of it over here, some of it 2523 02:02:33,990 --> 02:02:38,130 over here, some of it over here so as to avoid duplication of data, weird hacks 2524 02:02:38,130 --> 02:02:41,910 like putting commas in the data, we can still get back all of the answers 2525 02:02:41,910 --> 02:02:44,280 that we might want across these several tables. 2526 02:02:44,280 --> 02:02:48,580 And using indexes, we can significantly speed up these processes 2527 02:02:48,580 --> 02:02:52,140 so as to handle 10 times as many, a 100 times as many users 2528 02:02:52,140 --> 02:02:53,670 on the same actual database. 2529 02:02:53,670 --> 02:02:55,020 There is going to be a downside. 2530 02:02:55,020 --> 02:02:57,540 And thinking back to our discussion of algorithms and data 2531 02:02:57,540 --> 02:03:02,120 structures in past weeks, what might be a downside of creating these indexes? 2532 02:03:02,120 --> 02:03:06,110 Because as of now, I created four separate indexes on the name column, 2533 02:03:06,110 --> 02:03:09,560 the title column, and some other columns, too. 2534 02:03:09,560 --> 02:03:11,930 Why wouldn't I just go ahead and index everything 2535 02:03:11,930 --> 02:03:14,390 if it's clearly speeding things up? 2536 02:03:14,390 --> 02:03:15,770 Memory, so space. 2537 02:03:15,770 --> 02:03:18,890 Any time you're starting to benefit time wise in computer science, 2538 02:03:18,890 --> 02:03:21,950 odds are you're sacrificing space, or vice versa. 2539 02:03:21,950 --> 02:03:25,400 And probably indexing absolutely everything is a little dumb 2540 02:03:25,400 --> 02:03:29,430 because you're going to waste way more space than you might actually need. 2541 02:03:29,430 --> 02:03:31,610 So figuring out where the right inflection point is 2542 02:03:31,610 --> 02:03:36,410 is part of the process of designing and just getting better at these things. 2543 02:03:36,410 --> 02:03:40,910 Now unfortunately, a whole lot of things can go wrong in this world, 2544 02:03:40,910 --> 02:03:44,870 and they continue to in the real world with people using SQL databases. 2545 02:03:44,870 --> 02:03:46,760 And in fact, here on out, if you're reading 2546 02:03:46,760 --> 02:03:51,530 something technical about SQL databases, and websites being hacked in some form, 2547 02:03:51,530 --> 02:03:54,660 and passwords leaking out, unfortunately, all too often 2548 02:03:54,660 --> 02:03:57,505 it is because of what are called SQL injection attacks. 2549 02:03:57,505 --> 02:03:59,630 And just to give you a sense now to counterbalance, 2550 02:03:59,630 --> 02:04:01,588 maybe [INAUDIBLE] enthusiasm for like, oh, that 2551 02:04:01,588 --> 02:04:03,620 was neat how we can do things so quickly. 2552 02:04:03,620 --> 02:04:06,680 With great power comes responsibility in this world, too. 2553 02:04:06,680 --> 02:04:09,320 And so many people introduce bugs into their code 2554 02:04:09,320 --> 02:04:17,160 by not quite appreciating how it is the data is getting into your application. 2555 02:04:17,160 --> 02:04:18,390 So what do I mean by that? 2556 02:04:18,390 --> 02:04:21,200 Here, for instance, is a typical login screen for Yale. 2557 02:04:21,200 --> 02:04:23,780 And here's the analogue for Harvard where you're prompted, 2558 02:04:23,780 --> 02:04:26,450 every day probably, for your username and your password, 2559 02:04:26,450 --> 02:04:28,460 your email address and your password here. 2560 02:04:28,460 --> 02:04:32,420 Suppose, though, that behind this login page, 2561 02:04:32,420 --> 02:04:35,030 whether Harvard's or Yale's, there's some website. 2562 02:04:35,030 --> 02:04:38,270 And that website is using SQL underneath the hood 2563 02:04:38,270 --> 02:04:40,700 to store all of the Harvard or Yale people's 2564 02:04:40,700 --> 02:04:43,940 usernames, passwords, ID numbers, courses, transcripts, 2565 02:04:43,940 --> 02:04:44,970 all of that stuff. 2566 02:04:44,970 --> 02:04:47,480 So there's a SQL database underneath the website. 2567 02:04:47,480 --> 02:04:50,360 Well, what might go wrong with this process? 2568 02:04:50,360 --> 02:04:52,850 Unfortunately, there's some special syntax in SQL 2569 02:04:52,850 --> 02:04:54,530 just like there is in C and Python. 2570 02:04:54,530 --> 02:04:56,960 For instance, there are comments in SQL, too. 2571 02:04:56,960 --> 02:05:00,680 If you do two hyphens, dash, dash, that's a comment in SQL. 2572 02:05:00,680 --> 02:05:06,170 And if you, the programmer, aren't sufficiently distrustful of your users, 2573 02:05:06,170 --> 02:05:09,560 such that you defend against potentially adversarial attacks, 2574 02:05:09,560 --> 02:05:11,160 you might do something like this. 2575 02:05:11,160 --> 02:05:16,070 Suppose that I somewhat maliciously or curiously log in 2576 02:05:16,070 --> 02:05:19,130 by typing my username, Malan@harvard.edu, and then maybe 2577 02:05:19,130 --> 02:05:20,990 a single quote and a dash, dash. 2578 02:05:20,990 --> 02:05:21,680 Why? 2579 02:05:21,680 --> 02:05:24,860 Because I'm trying to suss out if there is a vulnerability here 2580 02:05:24,860 --> 02:05:26,420 to a SQL injection attack. 2581 02:05:26,420 --> 02:05:27,910 Do not do this in general. 2582 02:05:27,910 --> 02:05:31,160 But if I were the owner of the website trying to see if I've made any mistake, 2583 02:05:31,160 --> 02:05:35,300 I might try using potentially dangerous characters in my input. 2584 02:05:35,300 --> 02:05:36,290 Dangerous how? 2585 02:05:36,290 --> 02:05:40,340 Because single quote is used for quoting things in SQL, as we've seen-- 2586 02:05:40,340 --> 02:05:41,810 single quotes or double quotes. 2587 02:05:41,810 --> 02:05:44,930 Dash, dash, I claim now, is used for commenting. 2588 02:05:44,930 --> 02:05:47,960 But let's now imagine what the code underneath the hood 2589 02:05:47,960 --> 02:05:52,160 might be for something like Yale's login or Harvard's login. 2590 02:05:52,160 --> 02:05:54,600 What if it's code that looks like this? 2591 02:05:54,600 --> 02:05:56,540 So let me read it from left to right. 2592 02:05:56,540 --> 02:06:00,710 Suppose that they are using something like CS50's own execute function, 2593 02:06:00,710 --> 02:06:03,230 and they've got some SQL typed into the website that 2594 02:06:03,230 --> 02:06:07,160 says select star from users, where username equals this, 2595 02:06:07,160 --> 02:06:09,050 and password equals that. 2596 02:06:09,050 --> 02:06:12,510 And they're plugging in username and password. 2597 02:06:12,510 --> 02:06:13,607 So what am I doing here? 2598 02:06:13,607 --> 02:06:16,190 Well, when the user types their username password, hits Enter, 2599 02:06:16,190 --> 02:06:18,920 I probably want to select that user from my database 2600 02:06:18,920 --> 02:06:21,020 to see if the username and passwords match. 2601 02:06:21,020 --> 02:06:23,720 So the underlying SQL might be, select star 2602 02:06:23,720 --> 02:06:25,790 from users, where username equals question mark, 2603 02:06:25,790 --> 02:06:27,207 and password equals question mark. 2604 02:06:27,207 --> 02:06:28,100 Users is the table. 2605 02:06:28,100 --> 02:06:29,510 One column is username. 2606 02:06:29,510 --> 02:06:31,400 One column is password. 2607 02:06:31,400 --> 02:06:32,000 All right. 2608 02:06:32,000 --> 02:06:37,430 And if we get back one row, presumably Malan@harvard.edu 2609 02:06:37,430 --> 02:06:38,970 exists with that password. 2610 02:06:38,970 --> 02:06:41,190 We should let him proceed from there on out. 2611 02:06:41,190 --> 02:06:45,140 So that's some pseudo code, if you will, for this scenario. 2612 02:06:45,140 --> 02:06:49,580 What if, though, this code is not as well written as it currently 2613 02:06:49,580 --> 02:06:51,500 is, and isn't using question marks? 2614 02:06:51,500 --> 02:06:54,757 So the question mark syntax is a fairly common SQL thing, 2615 02:06:54,757 --> 02:06:56,840 where the question marks are used as placeholders, 2616 02:06:56,840 --> 02:06:59,390 just like in printf, percent S was. 2617 02:06:59,390 --> 02:07:02,900 But this function, db.execute from CS50's library 2618 02:07:02,900 --> 02:07:05,420 and third-party libraries as well, is also 2619 02:07:05,420 --> 02:07:07,790 doing some good stuff with these question marks, 2620 02:07:07,790 --> 02:07:09,830 and defending against the following attack. 2621 02:07:09,830 --> 02:07:12,920 Suppose that you were not using a third-party library like ours 2622 02:07:12,920 --> 02:07:16,490 and you were just manually constructing your SQL queries like this. 2623 02:07:16,490 --> 02:07:19,940 You were to do something like this instead using an f-string in Python. 2624 02:07:19,940 --> 02:07:21,800 You're comfortable with format strings now. 2625 02:07:21,800 --> 02:07:24,883 You've gotten into the habit of using curly braces and plugging in values. 2626 02:07:24,883 --> 02:07:27,230 Suppose that you, the aspiring programmer, 2627 02:07:27,230 --> 02:07:29,660 is just using techniques that you've been taught. 2628 02:07:29,660 --> 02:07:32,660 So you have an f-string with select star from users, 2629 02:07:32,660 --> 02:07:36,470 where username equals, quote, unquote, "username" in curly braces. 2630 02:07:36,470 --> 02:07:41,270 And password equals, quote, unquote, "password" in curly braces. 2631 02:07:41,270 --> 02:07:44,270 As of what, two weeks ago, this was perfectly 2632 02:07:44,270 --> 02:07:49,460 legitimate technique in Python to plug in values into a string. 2633 02:07:49,460 --> 02:07:53,630 But notice if you are using single quotes yourself 2634 02:07:53,630 --> 02:07:58,750 and the user has typed in single quotes to their input, what 2635 02:07:58,750 --> 02:07:59,920 could go wrong here? 2636 02:07:59,920 --> 02:08:04,510 Where are we going with this if you're just blindly plugging user input 2637 02:08:04,510 --> 02:08:08,350 into your own prepared string of text? 2638 02:08:08,350 --> 02:08:09,230 Yeah? 2639 02:08:09,230 --> 02:08:14,725 AUDIENCE: [INAUDIBLE] 2640 02:08:14,725 --> 02:08:15,600 DAVID J. MALAN: Yeah. 2641 02:08:15,600 --> 02:08:20,670 Worst case, they could insert what is actually SQL code into your database 2642 02:08:20,670 --> 02:08:21,810 as follows. 2643 02:08:21,810 --> 02:08:25,140 Generally speaking, if you're using special syntax like single quotes 2644 02:08:25,140 --> 02:08:27,000 to surround the user's input, you'd better 2645 02:08:27,000 --> 02:08:29,250 hope that they don't have an apostrophe in their name. 2646 02:08:29,250 --> 02:08:31,875 Or you better hope that they don't type a single quote as well. 2647 02:08:31,875 --> 02:08:36,025 Because what if their single quote finishes your single quote instead, 2648 02:08:36,025 --> 02:08:37,900 and then the rest of this is somehow ignored? 2649 02:08:37,900 --> 02:08:39,330 Well, let's consider how this might happen. 2650 02:08:39,330 --> 02:08:40,510 Let me go ahead in here. 2651 02:08:40,510 --> 02:08:42,840 This got a little blurry here, but let me 2652 02:08:42,840 --> 02:08:44,850 plug in here-- wow, that looks awful. 2653 02:08:44,850 --> 02:08:47,940 Let me fix the red. 2654 02:08:47,940 --> 02:08:50,430 Just change this to white so it's more readable. 2655 02:08:50,430 --> 02:08:56,730 What happens if the user does this instead? 2656 02:08:56,730 --> 02:08:59,310 They type in, like I did into the screenshot, 2657 02:08:59,310 --> 02:09:02,760 'Malan@harvard.edu,' single quote, dash, dash. 2658 02:09:02,760 --> 02:09:05,070 What has just happened logically, even though we've 2659 02:09:05,070 --> 02:09:06,900 only just begun with SQL today? 2660 02:09:06,900 --> 02:09:12,150 Well, select star from users, where username equals Malan@harvard.edu, end 2661 02:09:12,150 --> 02:09:13,320 quote. 2662 02:09:13,320 --> 02:09:17,340 What's bad about the rest of this? 2663 02:09:17,340 --> 02:09:19,752 Dash, dash, I claim, means a comment, which 2664 02:09:19,752 --> 02:09:22,210 means my color coding is going to be a little blurry again. 2665 02:09:22,210 --> 02:09:25,140 But everything after the dash, dash is just ignored. 2666 02:09:25,140 --> 02:09:27,440 The logic, then, of the SQL query, then, is 2667 02:09:27,440 --> 02:09:30,760 to just say, select Malan@harvard.edu from the database, 2668 02:09:30,760 --> 02:09:33,490 not even checking the password anymore. 2669 02:09:33,490 --> 02:09:36,180 Therefore, you will get back at least one row. 2670 02:09:36,180 --> 02:09:41,190 So length of rows will equal 1, and so presumably the rest of the pseudo code 2671 02:09:41,190 --> 02:09:44,190 logs the user in, gives them access to my my.harvard account, 2672 02:09:44,190 --> 02:09:45,150 or whatever it is. 2673 02:09:45,150 --> 02:09:50,640 And they've pretended to be me simply by using a single quote and a dash, 2674 02:09:50,640 --> 02:09:52,445 dash in the username field. 2675 02:09:52,445 --> 02:09:54,570 Again, please don't go start doing this later today 2676 02:09:54,570 --> 02:09:56,140 on Harvard, Yale, or other websites. 2677 02:09:56,140 --> 02:09:57,670 But it could be as simple as that. 2678 02:09:57,670 --> 02:09:58,170 Why? 2679 02:09:58,170 --> 02:10:00,030 Because the programmer practiced what they 2680 02:10:00,030 --> 02:10:04,110 were taught, which was just to use curly braces to plug in, 2681 02:10:04,110 --> 02:10:05,560 in f-strings, values. 2682 02:10:05,560 --> 02:10:08,590 But if you don't understand how the user's input is going to be used, 2683 02:10:08,590 --> 02:10:12,255 and if you don't distrust your users fundamentally, for every good person 2684 02:10:12,255 --> 02:10:14,130 out there there's going to be, unfortunately, 2685 02:10:14,130 --> 02:10:19,380 some adversary who just wants to try to find fault in your code or hack 2686 02:10:19,380 --> 02:10:20,490 into your data set. 2687 02:10:20,490 --> 02:10:22,980 This is what's known as a SQL injection attack, 2688 02:10:22,980 --> 02:10:27,000 because the user can type something that happens to be or look like SQL, 2689 02:10:27,000 --> 02:10:31,110 and trick your database into doing something it didn't intend to, 2690 02:10:31,110 --> 02:10:34,830 like, for instance, logging the user in. 2691 02:10:34,830 --> 02:10:36,880 Worst case, they could even do something else. 2692 02:10:36,880 --> 02:10:41,490 Maybe the user types a semicolon, then the word drop, or the word update. 2693 02:10:41,490 --> 02:10:45,180 You could imagine doing semicolon update table grades, where 2694 02:10:45,180 --> 02:10:49,140 name equals Malan, and set the grade equal to A instead of B, 2695 02:10:49,140 --> 02:10:50,670 or something like that. 2696 02:10:50,670 --> 02:10:53,550 The ability to inject SQL into the database 2697 02:10:53,550 --> 02:10:56,820 means you can do anything you want with the data set, either constructively, 2698 02:10:56,820 --> 02:11:00,000 or worse, destructively. 2699 02:11:00,000 --> 02:11:02,880 And now, just a quick, little cartoon that should now make sense. 2700 02:11:02,880 --> 02:11:09,410 2701 02:11:09,410 --> 02:11:13,020 OK, to, like, one of us, two of us. 2702 02:11:13,020 --> 02:11:14,400 Awkwardly somewhat funny. 2703 02:11:14,400 --> 02:11:16,560 All right, so let's move on to one last condition. 2704 02:11:16,560 --> 02:11:19,130 There's one other problem that can go awry here. 2705 02:11:19,130 --> 02:11:20,380 Oh, and I should explain this. 2706 02:11:20,380 --> 02:11:25,500 So this is an allusion to the son, Robert, having typed in semicolon. 2707 02:11:25,500 --> 02:11:28,810 The word drop, table, students, and doing some of the same technique. 2708 02:11:28,810 --> 02:11:31,440 This is humor that only CS people would understand 2709 02:11:31,440 --> 02:11:35,040 because it's the mom realizing, oh, her son's doing a SQL injection 2710 02:11:35,040 --> 02:11:36,308 attack onto the database. 2711 02:11:36,308 --> 02:11:39,600 Less funny when you explain it, but once you notice the syntax, that's all this 2712 02:11:39,600 --> 02:11:40,960 is an allusion to. 2713 02:11:40,960 --> 02:11:41,460 All right. 2714 02:11:41,460 --> 02:11:44,850 So one final threat, now that you are graduating 2715 02:11:44,850 --> 02:11:49,320 to the world of proper databases and away from CSV files alone. 2716 02:11:49,320 --> 02:11:52,170 Things can go wrong when using databases, 2717 02:11:52,170 --> 02:11:55,838 and honestly, even using CSV files if you have multiple users. 2718 02:11:55,838 --> 02:11:57,630 And thus far, you and I have had the luxury 2719 02:11:57,630 --> 02:12:00,547 in almost every program we've written that it's just me using my code. 2720 02:12:00,547 --> 02:12:01,830 It's just you using your code. 2721 02:12:01,830 --> 02:12:04,770 And even if your teaching fellow or TA is using it, probably 2722 02:12:04,770 --> 02:12:06,060 not at the same time. 2723 02:12:06,060 --> 02:12:10,770 But the world gets interesting if you start putting your code on phones, 2724 02:12:10,770 --> 02:12:14,730 on websites, such that now you might have two users literally trying 2725 02:12:14,730 --> 02:12:17,430 to log in at the same time, literally clicking a button 2726 02:12:17,430 --> 02:12:19,410 at the same, or nearly the same time. 2727 02:12:19,410 --> 02:12:22,380 What happens, then, if a computer is trying 2728 02:12:22,380 --> 02:12:25,290 to handle requests from two different people at once, 2729 02:12:25,290 --> 02:12:27,480 as might happen all the time on a website? 2730 02:12:27,480 --> 02:12:29,610 You might get what are called race conditions. 2731 02:12:29,610 --> 02:12:33,060 And this is a problem in computing in general, not just with SQL, not just 2732 02:12:33,060 --> 02:12:36,960 with Python, really just any time you have shared data, 2733 02:12:36,960 --> 02:12:39,150 like a database, as follows. 2734 02:12:39,150 --> 02:12:43,620 This apparently is one of the most liked Instagram posts ever. 2735 02:12:43,620 --> 02:12:46,110 It is literally just a picture of an egg. 2736 02:12:46,110 --> 02:12:47,645 Has anyone clicked on this egg? 2737 02:12:47,645 --> 02:12:48,270 Like, a couple? 2738 02:12:48,270 --> 02:12:48,780 Oh, OK. 2739 02:12:48,780 --> 02:12:49,280 Wow. 2740 02:12:49,280 --> 02:12:50,220 All right, so yes. 2741 02:12:50,220 --> 02:12:53,880 So go search for this photo if you'd like to add to the likes on Instagram. 2742 02:12:53,880 --> 02:12:56,110 The account is world_record_egg. 2743 02:12:56,110 --> 02:12:59,040 This is just a screenshot of Instagram of that picture of an egg. 2744 02:12:59,040 --> 02:13:00,540 If you're in the habit of using Instagram, 2745 02:13:00,540 --> 02:13:03,498 or like any social media site, there's some equivalent of a like button 2746 02:13:03,498 --> 02:13:04,920 or a heart button these days. 2747 02:13:04,920 --> 02:13:06,900 And that's actually a really hard problem. 2748 02:13:06,900 --> 02:13:10,050 Such a simple idea to count the number of likes something 2749 02:13:10,050 --> 02:13:12,990 has, but that means someone has to click on it. 2750 02:13:12,990 --> 02:13:14,910 Your code has to detect the click. 2751 02:13:14,910 --> 02:13:17,860 Your code has to update the database, and then do it again and again, 2752 02:13:17,860 --> 02:13:22,740 even if multiple people are perhaps right now clicking on that same egg. 2753 02:13:22,740 --> 02:13:28,350 And unfortunately, bad things can happen if two people try to do something 2754 02:13:28,350 --> 02:13:30,540 at the same time on a computer. 2755 02:13:30,540 --> 02:13:31,690 How might this happen? 2756 02:13:31,690 --> 02:13:35,670 So here's some more code, half pseudocode, half Python code here, 2757 02:13:35,670 --> 02:13:36,580 as follows. 2758 02:13:36,580 --> 02:13:40,230 Suppose that what happens when you, literally, right now, maybe click 2759 02:13:40,230 --> 02:13:43,470 on the like button on the Instagram post. 2760 02:13:43,470 --> 02:13:47,760 Suppose that code, like the following, is executed on Facebook servers. 2761 02:13:47,760 --> 02:13:54,190 db.execute of select likes from posts where ID equals question mark. 2762 02:13:54,190 --> 02:13:54,690 All right. 2763 02:13:54,690 --> 02:13:57,090 So what am I assuming here? 2764 02:13:57,090 --> 02:13:59,280 I'm assuming that that photograph has a unique ID. 2765 02:13:59,280 --> 02:14:02,670 It's some big integer, whatever it was, randomly assigned. 2766 02:14:02,670 --> 02:14:05,130 I'm assuming that when you click on the heart 2767 02:14:05,130 --> 02:14:08,160 the unique ID is somehow sent to Instagram servers 2768 02:14:08,160 --> 02:14:10,740 so that their code can call it ID. 2769 02:14:10,740 --> 02:14:13,830 And I'm assuming that Instagram is using its SQL database 2770 02:14:13,830 --> 02:14:17,790 and selecting, from a posts table, the current number of likes 2771 02:14:17,790 --> 02:14:21,160 of that egg for that given ID number. 2772 02:14:21,160 --> 02:14:21,660 Why? 2773 02:14:21,660 --> 02:14:24,952 Because I need to know how many likes it already has if I want to add one to it 2774 02:14:24,952 --> 02:14:26,190 and then update the database. 2775 02:14:26,190 --> 02:14:29,710 I need to select the data, then I need to update the data here. 2776 02:14:29,710 --> 02:14:30,210 All right. 2777 02:14:30,210 --> 02:14:33,780 So in some Python code here, let's store, in a variable called 2778 02:14:33,780 --> 02:14:37,950 likes, whatever comes back in the first row from the likes column. 2779 02:14:37,950 --> 02:14:40,660 Again, this is new syntax specific to our library, 2780 02:14:40,660 --> 02:14:43,830 but a common way of getting back first row and the column called 2781 02:14:43,830 --> 02:14:44,800 likes therein. 2782 02:14:44,800 --> 02:14:46,920 So at this point in the story, likes is storing 2783 02:14:46,920 --> 02:14:49,462 the total number of likes, in the millions or whatever it is, 2784 02:14:49,462 --> 02:14:50,610 of that particular egg. 2785 02:14:50,610 --> 02:14:51,900 Then I do this. 2786 02:14:51,900 --> 02:14:56,400 Execute update posts, set the number of likes 2787 02:14:56,400 --> 02:15:00,607 equal to this value, where the ID of the post equals this value. 2788 02:15:00,607 --> 02:15:02,190 What do I want to update the likes to? 2789 02:15:02,190 --> 02:15:06,540 Whatever likes currently is plus 1, and then plugging in the ID. 2790 02:15:06,540 --> 02:15:08,340 So a simple idea, right? 2791 02:15:08,340 --> 02:15:11,730 I'm checking the value of the likes, and maybe it's 10. 2792 02:15:11,730 --> 02:15:15,030 I'm changing 10 to 11 and then updating the table. 2793 02:15:15,030 --> 02:15:18,390 But a problem can arise if two people have 2794 02:15:18,390 --> 02:15:23,010 clicked on that egg at roughly the same time, or literally, the same time. 2795 02:15:23,010 --> 02:15:24,340 Why is that? 2796 02:15:24,340 --> 02:15:26,850 Well, in the world of databases and servers, 2797 02:15:26,850 --> 02:15:31,120 and the Instagrams of the world have thousands of physical servers nowadays. 2798 02:15:31,120 --> 02:15:35,430 So they can support millions, billions even, of users nowadays. 2799 02:15:35,430 --> 02:15:36,910 What can go wrong? 2800 02:15:36,910 --> 02:15:40,530 Well, typically code like this is not what we'll call atomic. 2801 02:15:40,530 --> 02:15:44,430 To be atomic means that it all executes together or not at all. 2802 02:15:44,430 --> 02:15:49,500 Rather, code typically is executed, as you might imagine, line by line. 2803 02:15:49,500 --> 02:15:53,110 And if your code is running on a server that multiple people have access to, 2804 02:15:53,110 --> 02:15:55,530 which is absolutely the case for an app like Instagram, 2805 02:15:55,530 --> 02:15:58,560 if you and I click on the heart at roughly the same time, 2806 02:15:58,560 --> 02:16:02,160 for efficiency, the computer, the server, owned by Instagram, 2807 02:16:02,160 --> 02:16:04,500 might execute this line of code for me. 2808 02:16:04,500 --> 02:16:06,537 Then it might execute this line of code for you. 2809 02:16:06,537 --> 02:16:09,120 Then this line of code for me, then this line of code for you, 2810 02:16:09,120 --> 02:16:11,703 then this line of code for me, then this line of code for you. 2811 02:16:11,703 --> 02:16:16,550 That is to say, our queries might get intermingled chronologically. 2812 02:16:16,550 --> 02:16:19,370 Because it'd be a little obnoxious if, when you're using Instagram, 2813 02:16:19,370 --> 02:16:22,040 I'm blocked out while you're interacting with the site. 2814 02:16:22,040 --> 02:16:24,778 It'd be a lot nicer for efficiency and fairness if somehow they 2815 02:16:24,778 --> 02:16:27,320 do a little bit of work for me, a little bit of work for you, 2816 02:16:27,320 --> 02:16:30,540 and back and forth, and back and forth, equitably on the server. 2817 02:16:30,540 --> 02:16:33,320 So that's what typically happens by default. These lines of code 2818 02:16:33,320 --> 02:16:35,420 get executed independently. 2819 02:16:35,420 --> 02:16:39,770 And they can happen in alternating order with other users. 2820 02:16:39,770 --> 02:16:41,809 You can get them combined like this. 2821 02:16:41,809 --> 02:16:45,840 Same order top to bottom, but other things might happen in between. 2822 02:16:45,840 --> 02:16:50,000 So suppose that the number of likes at the very beginning was 10. 2823 02:16:50,000 --> 02:16:54,080 And suppose that Carter and I both click on that egg at roughly the same time. 2824 02:16:54,080 --> 02:16:56,420 And suppose this line of code gets executed for me, 2825 02:16:56,420 --> 02:16:59,870 and that gives me a value in likes, ultimately, of 10. 2826 02:16:59,870 --> 02:17:03,290 Suppose, then, that the computer takes a break from dealing with my request, 2827 02:17:03,290 --> 02:17:05,690 does the same code for Carter, and gets back 2828 02:17:05,690 --> 02:17:07,910 what value for the current number of likes? 2829 02:17:07,910 --> 02:17:09,170 Also 10 for Carter. 2830 02:17:09,170 --> 02:17:11,209 Because mine has not been recorded yet. 2831 02:17:11,209 --> 02:17:13,910 At this point in the story, somewhere in the computer's memory 2832 02:17:13,910 --> 02:17:16,219 there's a likes variable for me, storing 10. 2833 02:17:16,219 --> 02:17:19,400 There's a likes variable storing 10 for Carter. 2834 02:17:19,400 --> 02:17:21,290 Then this line of code executes for me. 2835 02:17:21,290 --> 02:17:25,219 It updates the database to be likes plus 1, which stores 11 in the database. 2836 02:17:25,219 --> 02:17:30,440 Then Carter's code is executed, updating the same row in the database 2837 02:17:30,440 --> 02:17:34,160 to 11, unfortunately. 2838 02:17:34,160 --> 02:17:38,209 Because his value of likes happened to be the same value of mine. 2839 02:17:38,209 --> 02:17:41,540 And so the metaphor here, that if we had a refrigerator on stage we would 2840 02:17:41,540 --> 02:17:44,790 actually act out, is something that was taught to me years ago in an operating 2841 02:17:44,790 --> 02:17:50,299 systems class, whereby the most similar analogue in the real world would be 2842 02:17:50,299 --> 02:17:52,370 if you've got a mini fridge in your dorm room. 2843 02:17:52,370 --> 02:17:58,440 And one of you and your roommates comes home, opens the fridge, and realizes, 2844 02:17:58,440 --> 02:18:01,160 oh, we're out of milk, was how the story went in my day. 2845 02:18:01,160 --> 02:18:05,184 So you close the refrigerator, and you walk across the street, go to CVS, 2846 02:18:05,184 --> 02:18:06,559 and get in line to buy some milk. 2847 02:18:06,559 --> 02:18:08,309 Meanwhile, your roommate comes home. 2848 02:18:08,309 --> 02:18:12,924 They, too, inspect the state of your refrigerator, a.k.a., a variable, 2849 02:18:12,924 --> 02:18:15,049 open the door, and realizes, oh, we're out of milk. 2850 02:18:15,049 --> 02:18:15,980 I'll go get more milk. 2851 02:18:15,980 --> 02:18:18,020 Close the fridge, go across the street, and head 2852 02:18:18,020 --> 02:18:20,059 to maybe a different store, or the line is long enough 2853 02:18:20,059 --> 02:18:21,851 that you don't see each other at the store. 2854 02:18:21,851 --> 02:18:25,700 So long story short, you both eventually get home, open the door, and damn it, 2855 02:18:25,700 --> 02:18:27,559 now there's milk from your other roommate 2856 02:18:27,559 --> 02:18:30,680 there because you both made a decision on this 2857 02:18:30,680 --> 02:18:35,809 based on the state of a variable that you independently examined. 2858 02:18:35,809 --> 02:18:37,700 And you didn't somehow communicate. 2859 02:18:37,700 --> 02:18:40,610 Now in the real world, this is absolutely solvable. 2860 02:18:40,610 --> 02:18:44,120 How would you fix this or avoid this problem in the real world? 2861 02:18:44,120 --> 02:18:45,770 Literally, own roommate, own fridge. 2862 02:18:45,770 --> 02:18:47,750 AUDIENCE: Text your roommate [INAUDIBLE].. 2863 02:18:47,750 --> 02:18:48,750 DAVID J. MALAN: Perfect. 2864 02:18:48,750 --> 02:18:50,482 Let them know, so somehow communicate. 2865 02:18:50,482 --> 02:18:52,940 And in fact, the terminology here would be multiple threads 2866 02:18:52,940 --> 02:18:55,459 can somehow intercommunicate by having shared state, 2867 02:18:55,459 --> 02:18:57,230 like the iMessage thread on your phone. 2868 02:18:57,230 --> 02:18:58,190 You could leave a note. 2869 02:18:58,190 --> 02:19:01,490 You could, more dramatically, lock the refrigerator somehow, 2870 02:19:01,490 --> 02:19:05,570 thereby making the milk purchasing process atomic. 2871 02:19:05,570 --> 02:19:08,389 The fundamental problem is that for efficiency, again, 2872 02:19:08,389 --> 02:19:11,690 computers tend to intermingle logic that needs 2873 02:19:11,690 --> 02:19:15,980 to happen when it's happening across multiple users just for fairness' sake, 2874 02:19:15,980 --> 02:19:17,180 for scheduling sake. 2875 02:19:17,180 --> 02:19:19,820 You need to make sure that all three of these lines of code 2876 02:19:19,820 --> 02:19:23,269 execute for me, and then for Carter, and then for you 2877 02:19:23,269 --> 02:19:25,709 if you want to ensure that this count is correct. 2878 02:19:25,709 --> 02:19:29,219 And for years, when social media was first getting off the ground, 2879 02:19:29,219 --> 02:19:31,100 this was a super hard problem. 2880 02:19:31,100 --> 02:19:33,920 Twitter used to go down all of the time, and tweets, 2881 02:19:33,920 --> 02:19:36,260 and retweets were a thing that were similarly happening 2882 02:19:36,260 --> 02:19:37,490 with a very high frequency. 2883 02:19:37,490 --> 02:19:39,020 These are hard problems to solve. 2884 02:19:39,020 --> 02:19:40,690 And thankfully, there are solutions. 2885 02:19:40,690 --> 02:19:43,440 And we won't get into the weeds of how you might use these things, 2886 02:19:43,440 --> 02:19:46,100 but know that there are solutions in the form of things 2887 02:19:46,100 --> 02:19:49,610 called locks, which I use that word deliberately with the fridge. 2888 02:19:49,610 --> 02:19:53,570 Software locks can allow you to protect a variable so no one else can 2889 02:19:53,570 --> 02:19:55,550 look at it until you're done with it. 2890 02:19:55,550 --> 02:19:57,770 There are things called transactions, which 2891 02:19:57,770 --> 02:20:01,100 allow you to do the equivalent of sending a message to, or really locking 2892 02:20:01,100 --> 02:20:04,130 out your roommate from accessing that same variable, too, 2893 02:20:04,130 --> 02:20:06,920 but for slightly less amount of time. 2894 02:20:06,920 --> 02:20:08,880 There are solutions to these problems. 2895 02:20:08,880 --> 02:20:12,650 So for instance, in Python, the same code now in green 2896 02:20:12,650 --> 02:20:14,360 might look a little something like this. 2897 02:20:14,360 --> 02:20:17,150 When you know that something has to happen all at once, 2898 02:20:17,150 --> 02:20:21,020 altogether, you first begin a transaction, and you do your thing, 2899 02:20:21,020 --> 02:20:23,707 and then you commit the transaction at the very end. 2900 02:20:23,707 --> 02:20:25,790 Here, too, though, there's going to be a downside. 2901 02:20:25,790 --> 02:20:29,902 Typically, the more you use transactions in this way, 2902 02:20:29,902 --> 02:20:31,610 potentially the higher the probability is 2903 02:20:31,610 --> 02:20:35,190 that you're going to box someone out or make Carter's request a little slower. 2904 02:20:35,190 --> 02:20:35,690 Why? 2905 02:20:35,690 --> 02:20:37,482 Because we can't interact at the same time. 2906 02:20:37,482 --> 02:20:39,920 Or you might make his request fail if he tries to update 2907 02:20:39,920 --> 02:20:41,870 something that's already been updated. 2908 02:20:41,870 --> 02:20:44,720 So you generally want to have as few lines of code 2909 02:20:44,720 --> 02:20:47,840 together in between these transactions so that you get in and you get out. 2910 02:20:47,840 --> 02:20:50,780 And you go to CVS and you get back really fast so as to not 2911 02:20:50,780 --> 02:20:52,550 cause these kind of performance things. 2912 02:20:52,550 --> 02:20:55,070 So things indeed escalated quickly today. 2913 02:20:55,070 --> 02:20:58,272 The original goal was just to solve problems using a different language 2914 02:20:58,272 --> 02:20:59,480 more effectively than Python. 2915 02:20:59,480 --> 02:21:01,772 But as soon as you have these more powerful techniques, 2916 02:21:01,772 --> 02:21:03,525 a whole new set of problems arises. 2917 02:21:03,525 --> 02:21:05,150 Takes practice to get comfortable with. 2918 02:21:05,150 --> 02:21:09,170 But ultimately, this is all leading us toward the introduction next week 2919 02:21:09,170 --> 02:21:12,260 of web programming with HTML, CSS, and some JavaScript. 2920 02:21:12,260 --> 02:21:15,090 The week after, bringing Python and SQL back into the mix. 2921 02:21:15,090 --> 02:21:16,940 So that by term's end, we've really now used 2922 02:21:16,940 --> 02:21:19,760 all of these different languages for what they're best at. 2923 02:21:19,760 --> 02:21:22,843 And over the next few weeks, the goal is to make sure you're understanding 2924 02:21:22,843 --> 02:21:25,932 and comfortable with what each of these things is good and bad for. 2925 02:21:25,932 --> 02:21:27,140 Let's go ahead and wrap here. 2926 02:21:27,140 --> 02:21:28,473 I'll stick around for questions. 2927 02:21:28,473 --> 02:21:30,310 We'll see you next time. 2928 02:21:30,310 --> 02:22:04,344 [MUSIC PLAYING] 244433

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.