Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
0
00:00:00,000 --> 00:00:03,968
[MUSIC PLAYING]
1
00:00:03,968 --> 00:01:18,020
[MUSIC PLAYING]
2
00:01:18,020 --> 00:01:20,030
DAVID J. MALAN: This is CS50.
3
00:01:20,030 --> 00:01:23,360
And this is week 7, the week, here, of Halloween.
4
00:01:23,360 --> 00:01:26,510
Indeed, special thanks to CS50's own Valerie and her mom
5
00:01:26,510 --> 00:01:30,590
for having created this very festive scenery, and all past ones as well.
6
00:01:30,590 --> 00:01:33,330
Today, we pick up where we left off last time,
7
00:01:33,330 --> 00:01:35,180
which, recall, we introduced Python.
8
00:01:35,180 --> 00:01:38,240
And that was our big transition from C, where suddenly things
9
00:01:38,240 --> 00:01:40,850
started to look new again, probably, syntactically.
10
00:01:40,850 --> 00:01:44,390
But also, probably things hopefully started to feel easier.
11
00:01:44,390 --> 00:01:48,080
Well, with that said, problem set 6 certainly added some challenges,
12
00:01:48,080 --> 00:01:49,380
and you did some new things.
13
00:01:49,380 --> 00:01:53,090
But hopefully you've begun to appreciate that with Python, just a lot more stuff
14
00:01:53,090 --> 00:01:54,290
is easier to do.
15
00:01:54,290 --> 00:01:56,960
You get more out of the box with the language itself.
16
00:01:56,960 --> 00:01:59,450
And that's going to be so useful over the coming weeks
17
00:01:59,450 --> 00:02:03,980
as we transition further to introducing something called databases today,
18
00:02:03,980 --> 00:02:06,270
web programming next week and the week after.
19
00:02:06,270 --> 00:02:08,900
So that by term's end, and perhaps even for your final project,
20
00:02:08,900 --> 00:02:11,870
you really are building something from scratch
21
00:02:11,870 --> 00:02:15,415
using all of these various tools somehow together.
22
00:02:15,415 --> 00:02:17,540
So before we do that, though, today, let's consider
23
00:02:17,540 --> 00:02:22,610
what we weren't really able to do last week, which was actually
24
00:02:22,610 --> 00:02:25,490
create and store data ourselves.
25
00:02:25,490 --> 00:02:30,710
In Python, we've played around with the CSV, comma-separated values library.
26
00:02:30,710 --> 00:02:33,810
And you've been able to read in CSVs from disk,
27
00:02:33,810 --> 00:02:37,880
so to speak, that is, from files in your programming environments.
28
00:02:37,880 --> 00:02:41,455
But we haven't necessarily started saving data, persisting data ourselves.
29
00:02:41,455 --> 00:02:44,330
And that's a huge limitation, because pretty much all of the examples
30
00:02:44,330 --> 00:02:46,288
we've done thus far with a couple of exceptions
31
00:02:46,288 --> 00:02:49,550
have involved my providing input at the keyboard or even vocally.
32
00:02:49,550 --> 00:02:51,508
But then nothing happens to it.
33
00:02:51,508 --> 00:02:53,300
It disappears the moment the program quits,
34
00:02:53,300 --> 00:02:55,410
because it was only being stored in memory.
35
00:02:55,410 --> 00:02:59,090
But today, we'll start to focus all the more on storing things on disk,
36
00:02:59,090 --> 00:03:02,100
that is, storing things in files and folders
37
00:03:02,100 --> 00:03:04,940
so that you can actually write programs that remember
38
00:03:04,940 --> 00:03:06,620
what it is the human did last time.
39
00:03:06,620 --> 00:03:09,320
And ultimately, you can actually make mobile or web apps
40
00:03:09,320 --> 00:03:12,050
that actually begin to grow, and grow, and grow their data sets,
41
00:03:12,050 --> 00:03:15,650
as might happen if you get more and more users, for instance, on a website.
42
00:03:15,650 --> 00:03:19,400
To play, then, with this new capability of being able to write files,
43
00:03:19,400 --> 00:03:22,217
let's go ahead and just collect some data.
44
00:03:22,217 --> 00:03:24,050
In fact, those of you here in person, if you
45
00:03:24,050 --> 00:03:26,960
want to pull up this URL on your phone or laptop,
46
00:03:26,960 --> 00:03:29,060
that's going to lead you to a Google Form.
47
00:03:29,060 --> 00:03:34,130
And that Google Form is going to ask you in just a moment for really
48
00:03:34,130 --> 00:03:35,320
just your favorite TV show.
49
00:03:35,320 --> 00:03:37,070
And it's going to ask you to categorize it
50
00:03:37,070 --> 00:03:41,270
according to a genre, like comedy, or drama, or action, or musical,
51
00:03:41,270 --> 00:03:42,460
or something like that.
52
00:03:42,460 --> 00:03:44,210
And this is useful, because if you've ever
53
00:03:44,210 --> 00:03:47,520
used a Google Form before, or Microsoft's equivalent with Office 365,
54
00:03:47,520 --> 00:03:50,270
it's a really useful mechanism at just collecting data from users,
55
00:03:50,270 --> 00:03:54,000
and then ultimately, putting it into a spreadsheet form.
56
00:03:54,000 --> 00:03:58,160
So this is a screenshot of the form that those of you
57
00:03:58,160 --> 00:04:01,080
here in person or tuning in on Zoom are currently filling out.
58
00:04:01,080 --> 00:04:02,420
It's asking only two questions.
59
00:04:02,420 --> 00:04:04,550
What's the title of your favorite TV show?
60
00:04:04,550 --> 00:04:09,470
And what are one or more genres into which your TV show falls?
61
00:04:09,470 --> 00:04:12,860
And I'll go ahead and pivot now to the view
62
00:04:12,860 --> 00:04:15,667
that I'll be able to see as the person who created this form, which
63
00:04:15,667 --> 00:04:17,209
is quite simply a Google spreadsheet.
64
00:04:17,209 --> 00:04:19,709
Google Forms has this nice feature, if you've never noticed,
65
00:04:19,709 --> 00:04:22,650
that allows you to export your data to a Google Spreadsheet.
66
00:04:22,650 --> 00:04:25,130
And then from there, we can actually grab the file
67
00:04:25,130 --> 00:04:27,500
and download it to my own Mac or your own PC
68
00:04:27,500 --> 00:04:30,320
so that we can actually play around with the data that's come in.
69
00:04:30,320 --> 00:04:32,570
So in fact, let me go ahead and slide over
70
00:04:32,570 --> 00:04:36,540
to this, the live Google Spreadsheet.
71
00:04:36,540 --> 00:04:40,490
And you'll see, probably, a whole bunch of familiar TV shows here,
72
00:04:40,490 --> 00:04:41,360
all coming in.
73
00:04:41,360 --> 00:04:43,820
And if we keep scrolling, and scrolling, and scrolling--
74
00:04:43,820 --> 00:04:45,290
only 46, 47.
75
00:04:45,290 --> 00:04:47,430
There we go, up to 50 plus already.
76
00:04:47,430 --> 00:04:50,420
If you need that URL again here, if you're just tuning in,
77
00:04:50,420 --> 00:04:52,850
you can go to this URL here.
78
00:04:52,850 --> 00:04:55,760
And in just a moment, we'll have a bunch of data
79
00:04:55,760 --> 00:04:59,450
with which we can start to experiment.
80
00:04:59,450 --> 00:05:01,070
I'll give you a moment or so there.
81
00:05:01,070 --> 00:05:06,330
82
00:05:06,330 --> 00:05:08,418
All right.
83
00:05:08,418 --> 00:05:09,960
Let me hang in there a little longer.
84
00:05:09,960 --> 00:05:11,418
OK, we've got over 100 submissions.
85
00:05:11,418 --> 00:05:12,390
Good.
86
00:05:12,390 --> 00:05:15,270
Good, even more coming in now.
87
00:05:15,270 --> 00:05:16,890
And we can see them coming in live.
88
00:05:16,890 --> 00:05:18,750
Here, let me switch back to the spreadsheet.
89
00:05:18,750 --> 00:05:21,090
The list is growing, and growing, and growing.
90
00:05:21,090 --> 00:05:22,900
And in just a moment--
91
00:05:22,900 --> 00:05:26,490
let me give Carter a moment to help me export it in real time.
92
00:05:26,490 --> 00:05:29,640
Carter, just give me a heads up when it's reasonable for me
93
00:05:29,640 --> 00:05:32,880
to download this file.
94
00:05:32,880 --> 00:05:35,140
All right, and I'll begin to do this very slowly.
95
00:05:35,140 --> 00:05:37,720
So I'm going to go up to the File menu, if you've never done this before.
96
00:05:37,720 --> 00:05:40,110
Download-- you can download a whole bunch of formats, one in Excel.
97
00:05:40,110 --> 00:05:42,060
But more simply, and the one we'll start to play with here,
98
00:05:42,060 --> 00:05:43,630
is comma-separated values.
99
00:05:43,630 --> 00:05:47,010
So CSV files we used this past week, why are they useful?
100
00:05:47,010 --> 00:05:50,190
Now that you've played with them or used them in past real world,
101
00:05:50,190 --> 00:05:55,260
what's the utility of a CSV file versus something like Excel, for instance?
102
00:05:55,260 --> 00:05:59,010
Why CSV in the first place?
103
00:05:59,010 --> 00:06:00,030
Any instincts?
104
00:06:00,030 --> 00:06:00,858
Yeah?
105
00:06:00,858 --> 00:06:02,730
AUDIENCE: Because it's just a text file?
106
00:06:02,730 --> 00:06:04,605
DAVID J. MALAN: OK, so storage is compelling.
107
00:06:04,605 --> 00:06:07,710
A simple text file with ASCII or Unicode text is probably pretty small.
108
00:06:07,710 --> 00:06:08,290
I like that.
109
00:06:08,290 --> 00:06:10,101
Other thoughts?
110
00:06:10,101 --> 00:06:11,184
AUDIENCE: Structure of it?
111
00:06:11,184 --> 00:06:12,518
DAVID J. MALAN: Yeah, well said.
112
00:06:12,518 --> 00:06:14,760
It's just a simple text format, but using conventions
113
00:06:14,760 --> 00:06:18,330
like commas you can represent the idea of columns using new lines,
114
00:06:18,330 --> 00:06:20,430
backslash ends invisibly at the end of your lines,
115
00:06:20,430 --> 00:06:22,000
you can create the idea of rows.
116
00:06:22,000 --> 00:06:24,000
So it's a very simple way of implementing what
117
00:06:24,000 --> 00:06:26,610
we might call a flat-file database.
118
00:06:26,610 --> 00:06:28,860
It's a way of storing data in a flat, that is,
119
00:06:28,860 --> 00:06:32,310
very simple file that's just pure ASCII or Unicode text.
120
00:06:32,310 --> 00:06:35,340
And more compellingly, I dare say, is that with a CSV file,
121
00:06:35,340 --> 00:06:36,862
it's completely portable.
122
00:06:36,862 --> 00:06:38,820
Something is portable in the world of computing
123
00:06:38,820 --> 00:06:41,910
if it means you can use it on a Mac or a PC running this operating system,
124
00:06:41,910 --> 00:06:42,750
or this other one.
125
00:06:42,750 --> 00:06:45,480
And portability is nice because if I were to download an Excel file,
126
00:06:45,480 --> 00:06:47,760
there'd be a whole bunch of people in this room and online
127
00:06:47,760 --> 00:06:50,190
who couldn't download it because they haven't bought Microsoft Excel
128
00:06:50,190 --> 00:06:50,940
or installed it.
129
00:06:50,940 --> 00:06:55,680
Or if they have a Mac, or if it's a .numbers file in the Mac world,
130
00:06:55,680 --> 00:06:57,610
a PC user might not be able to download it.
131
00:06:57,610 --> 00:06:59,800
So a CSV is indeed very portable.
132
00:06:59,800 --> 00:07:02,940
So I'm going to go ahead and download, quite simply, the CSV
133
00:07:02,940 --> 00:07:04,440
version of this file.
134
00:07:04,440 --> 00:07:06,960
That's going to put it onto my own Mac's Downloads folder.
135
00:07:06,960 --> 00:07:11,460
And let me go ahead here, and in just a moment, let me just simplify the name.
136
00:07:11,460 --> 00:07:14,910
Because it actually downloads it at a pretty large name.
137
00:07:14,910 --> 00:07:18,120
And give me just one moment here, and you'll see that, indeed,
138
00:07:18,120 --> 00:07:21,210
on my Mac I have a file called favorites.csv.
139
00:07:21,210 --> 00:07:22,750
I shortened the name real quick.
140
00:07:22,750 --> 00:07:28,680
And now what I'm going to do is go over to VS Code, and in VS Code,
141
00:07:28,680 --> 00:07:30,480
I'm going to open my File Explorer.
142
00:07:30,480 --> 00:07:34,620
And if I minimize my window here for a moment, a handy feature of VS Code
143
00:07:34,620 --> 00:07:37,980
is that you can just drag and drop a file, for instance, into your Explorer.
144
00:07:37,980 --> 00:07:40,660
And voila, it's going to automatically upload it for you.
145
00:07:40,660 --> 00:07:43,260
So let me go ahead and full screen here, close my Explorer,
146
00:07:43,260 --> 00:07:45,120
temporarily close my Terminal window.
147
00:07:45,120 --> 00:07:48,720
And you'll see here a CSV file, favorites.csv.
148
00:07:48,720 --> 00:07:51,390
And the first row, by convention, has whatever
149
00:07:51,390 --> 00:07:54,720
the columns were in Google Spreadsheets, or Office 365,
150
00:07:54,720 --> 00:07:58,620
in Excel online, timestamp, comma, title, comma, genres.
151
00:07:58,620 --> 00:08:00,390
Then, we have timestamps, which indicates
152
00:08:00,390 --> 00:08:01,722
when people started submitting.
153
00:08:01,722 --> 00:08:03,930
Looks like a couple of people were super eager to get
154
00:08:03,930 --> 00:08:05,430
started an hour or two ago.
155
00:08:05,430 --> 00:08:09,340
And then, you have the title next, after a comma.
156
00:08:09,340 --> 00:08:12,150
But there's kind of a curiosity after that.
157
00:08:12,150 --> 00:08:15,510
Sometimes I see the genre like comedy, comedy, comedy,
158
00:08:15,510 --> 00:08:19,870
but sometimes it's like crime, comma, drama, or action, comma, crime, comma,
159
00:08:19,870 --> 00:08:20,550
drama.
160
00:08:20,550 --> 00:08:22,470
And those things are quoted.
161
00:08:22,470 --> 00:08:24,180
And yet, I didn't do any quotes.
162
00:08:24,180 --> 00:08:25,800
You probably didn't type any quotes.
163
00:08:25,800 --> 00:08:30,180
Where are those quotes coming from in this CSV file?
164
00:08:30,180 --> 00:08:31,650
Why are they there if we infer?
165
00:08:31,650 --> 00:08:32,485
Yeah?
166
00:08:32,485 --> 00:08:35,340
AUDIENCE: [INAUDIBLE]
167
00:08:35,340 --> 00:08:38,308
DAVID J. MALAN: Yeah, so you have a corner case, if you will.
168
00:08:38,308 --> 00:08:40,350
Because if you're using commas, as you described,
169
00:08:40,350 --> 00:08:44,280
to separate your data into what are effectively columns, well,
170
00:08:44,280 --> 00:08:47,010
you've painted yourself into a corner if your actual data
171
00:08:47,010 --> 00:08:48,578
has commas in it itself.
172
00:08:48,578 --> 00:08:51,120
So what Google has done, what Microsoft does, what Apple does
173
00:08:51,120 --> 00:08:54,330
is, they quote any strings of text that themselves
174
00:08:54,330 --> 00:08:58,560
have commas so that these are now English grammatical commas,
175
00:08:58,560 --> 00:09:00,730
not CSV specific commas.
176
00:09:00,730 --> 00:09:03,400
So it's a way of escaping your data, if you will.
177
00:09:03,400 --> 00:09:06,070
And escaping just means to call out a symbol in a special way
178
00:09:06,070 --> 00:09:08,637
so it's not misinterpreted as something else.
179
00:09:08,637 --> 00:09:10,470
All right, so this is all to say that we now
180
00:09:10,470 --> 00:09:13,980
have all of this data with which we can play in the form of what we'll
181
00:09:13,980 --> 00:09:15,840
start calling a flat-file database.
182
00:09:15,840 --> 00:09:19,230
So suppose I wanted to now start manipulating this data,
183
00:09:19,230 --> 00:09:22,110
and I want to store it ultimately, indeed, in this CSV format.
184
00:09:22,110 --> 00:09:24,540
How can I actually start to read this data,
185
00:09:24,540 --> 00:09:26,950
maybe clean it up, maybe do some analytics on it
186
00:09:26,950 --> 00:09:30,570
and actually figure out, what's the most popular show among those who submitted
187
00:09:30,570 --> 00:09:32,190
here over the past few minutes?
188
00:09:32,190 --> 00:09:34,270
Well, let me go ahead and close this.
189
00:09:34,270 --> 00:09:38,970
Let me go ahead, then, and open up, for instance, just my Terminal window.
190
00:09:38,970 --> 00:09:41,970
And let's code up a file called favorites.py.
191
00:09:41,970 --> 00:09:46,110
And let's go ahead and iteratively start simple by just opening up this file
192
00:09:46,110 --> 00:09:47,830
and printing out what's inside of it.
193
00:09:47,830 --> 00:09:51,330
So you might recall that we can do this by doing something like import CSV
194
00:09:51,330 --> 00:09:54,990
to give myself some CSV reading functionality.
195
00:09:54,990 --> 00:09:59,610
Then, I can go ahead and do something like with open, the name of the file
196
00:09:59,610 --> 00:10:02,280
that I want to open in read mode.
197
00:10:02,280 --> 00:10:03,990
Quote, unquote, "r" means to read it.
198
00:10:03,990 --> 00:10:06,600
And then, I can say as file, or whatever other name
199
00:10:06,600 --> 00:10:09,670
for a variable to say that I want to open this file,
200
00:10:09,670 --> 00:10:12,480
and essentially store some kind of reference to it in that variable
201
00:10:12,480 --> 00:10:13,600
called file.
202
00:10:13,600 --> 00:10:16,920
Then, I can give myself a reader, and I can say csv.reader,
203
00:10:16,920 --> 00:10:18,397
passing in that file as input.
204
00:10:18,397 --> 00:10:19,980
And this is the magic of that library.
205
00:10:19,980 --> 00:10:23,190
It deals with the process of opening it, reading it, and giving you
206
00:10:23,190 --> 00:10:26,430
back something that you can just iterate over, like with a for loop
207
00:10:26,430 --> 00:10:30,510
I do want to skip the first row, and recall that I can do this.
208
00:10:30,510 --> 00:10:33,665
Next, reader, is this little trick that just says, ignore the first row.
209
00:10:33,665 --> 00:10:35,040
Because the first one is special.
210
00:10:35,040 --> 00:10:37,410
It said timestamp, title, genres.
211
00:10:37,410 --> 00:10:39,400
That's not your data, that was mine.
212
00:10:39,400 --> 00:10:41,880
But this means now that I've skipped that first row.
213
00:10:41,880 --> 00:10:44,700
Everything hereafter is going to be the title of a show
214
00:10:44,700 --> 00:10:46,260
that you all like, so let me do this.
215
00:10:46,260 --> 00:10:51,090
For row in the reader, let's go ahead and print out the title
216
00:10:51,090 --> 00:10:52,860
of the show each of you typed in.
217
00:10:52,860 --> 00:10:57,000
How do I get at the title of the show each of you typed in?
218
00:10:57,000 --> 00:10:58,740
It's somewhere inside of row.
219
00:10:58,740 --> 00:11:00,790
Row recalls a list.
220
00:11:00,790 --> 00:11:02,910
So what do I want to type next in order to get
221
00:11:02,910 --> 00:11:08,920
at the title of the current row just as a quick check here?
222
00:11:08,920 --> 00:11:11,350
What do I want to type to get at the title of the row,
223
00:11:11,350 --> 00:11:15,520
keeping in mind, again, that it was timestamp, title, genres?
224
00:11:15,520 --> 00:11:16,020
Yeah?
225
00:11:16,020 --> 00:11:16,895
AUDIENCE: [INAUDIBLE]
226
00:11:16,895 --> 00:11:18,910
DAVID J. MALAN: So row bracket 1 would give me
227
00:11:18,910 --> 00:11:22,640
the second column, 0 index, that is, the one in the middle with the title.
228
00:11:22,640 --> 00:11:24,520
So this program isn't that interesting yet,
229
00:11:24,520 --> 00:11:27,370
but it's a quick and dirty way to figure out, all right, what's my data
230
00:11:27,370 --> 00:11:27,820
look like?
231
00:11:27,820 --> 00:11:29,987
Let me actually just do a little bit of a check here
232
00:11:29,987 --> 00:11:32,380
and see if it contains the data I think it does.
233
00:11:32,380 --> 00:11:34,490
Let me maximize my Terminal window here.
234
00:11:34,490 --> 00:11:37,810
Let me run Python of favorites.py, hitting Enter.
235
00:11:37,810 --> 00:11:42,520
And you'll see now a purely textual list of all of the shows
236
00:11:42,520 --> 00:11:44,560
you all seem to like here.
237
00:11:44,560 --> 00:11:47,080
But what's noteworthy about it?
238
00:11:47,080 --> 00:11:49,960
Specific shows aside, judgment aside as to people's
239
00:11:49,960 --> 00:11:54,640
TV tastes, what's interesting or noteworthy about the data that
240
00:11:54,640 --> 00:11:58,090
might create some problems for us if we start to analyze this data,
241
00:11:58,090 --> 00:11:59,740
and figure out what's the most popular?
242
00:11:59,740 --> 00:12:02,590
How many people like this or that?
243
00:12:02,590 --> 00:12:03,340
What do you think?
244
00:12:03,340 --> 00:12:04,151
Yeah?
245
00:12:04,151 --> 00:12:07,458
AUDIENCE: User errors [INAUDIBLE].
246
00:12:07,458 --> 00:12:09,500
DAVID J. MALAN: Yeah, there might be user errors,
247
00:12:09,500 --> 00:12:12,800
or just stylistic differences that give the appearance that one show
248
00:12:12,800 --> 00:12:14,400
is different from the other.
249
00:12:14,400 --> 00:12:16,970
For instance, here.
250
00:12:16,970 --> 00:12:20,250
Let's see if I can see an example on the screen here.
251
00:12:20,250 --> 00:12:24,340
Yeah, so friends here is an all lowercase, Friends here is capitalized.
252
00:12:24,340 --> 00:12:25,140
No big deal.
253
00:12:25,140 --> 00:12:26,370
We can sort of mitigate that.
254
00:12:26,370 --> 00:12:29,480
But this is just a tiny example of where data in the real world
255
00:12:29,480 --> 00:12:30,758
can get messy fast.
256
00:12:30,758 --> 00:12:32,300
And that probably wasn't even a typo.
257
00:12:32,300 --> 00:12:37,587
It was just someone not caring as much to capitalize it, and that's fine.
258
00:12:37,587 --> 00:12:39,920
Your users are going to type what they're going to type.
259
00:12:39,920 --> 00:12:43,580
So let's see if we can't now begin to get at more specific data,
260
00:12:43,580 --> 00:12:45,500
and maybe even clean some of this data up.
261
00:12:45,500 --> 00:12:50,570
Let me go back into my file called favorites.py here,
262
00:12:50,570 --> 00:12:54,350
and let's actually do something a little more user friendly for me.
263
00:12:54,350 --> 00:12:58,070
Instead of a reader, recall that there was this dictionary reader that's
264
00:12:58,070 --> 00:12:59,750
just a little more user friendly.
265
00:12:59,750 --> 00:13:04,670
And it means I can type in dictionary reader here, passing in the same file.
266
00:13:04,670 --> 00:13:10,670
But now, when I iterate over this reader variable, what is each row?
267
00:13:10,670 --> 00:13:14,060
When using a DictReader instead of a reader, recall, and this
268
00:13:14,060 --> 00:13:17,690
is just a peculiarity of the CSV library,
269
00:13:17,690 --> 00:13:22,530
this gives me back, not a list of cells, but what instead,
270
00:13:22,530 --> 00:13:24,950
which is marginally more user friendly for me?
271
00:13:24,950 --> 00:13:25,460
Yeah?
272
00:13:25,460 --> 00:13:27,585
AUDIENCE: [INAUDIBLE]
273
00:13:27,585 --> 00:13:28,460
DAVID J. MALAN: Yeah.
274
00:13:28,460 --> 00:13:30,860
I can now use open bracket, quotes, and the title.
275
00:13:30,860 --> 00:13:33,770
Because what's coming back now is a dict object, that is,
276
00:13:33,770 --> 00:13:37,010
a dictionary which has keys and values.
277
00:13:37,010 --> 00:13:39,060
The keys of which are the column headings.
278
00:13:39,060 --> 00:13:41,400
The values of which are the data I actually care about.
279
00:13:41,400 --> 00:13:44,030
So this is just marginally better because, one, it's
280
00:13:44,030 --> 00:13:47,330
just way more obvious to me, the author of this code, what it is I'm
281
00:13:47,330 --> 00:13:48,050
getting at.
282
00:13:48,050 --> 00:13:50,310
I don't remember what column the title was.
283
00:13:50,310 --> 00:13:50,810
Was it 0?
284
00:13:50,810 --> 00:13:51,320
Was it 1?
285
00:13:51,320 --> 00:13:51,890
Was it 2?
286
00:13:51,890 --> 00:13:53,450
That's something you're going to forget over time.
287
00:13:53,450 --> 00:13:56,510
And God forbid someone changes the data by just dragging and dropping
288
00:13:56,510 --> 00:13:59,390
the columns in Excel, or Apple Numbers, or Google Spreadsheets.
289
00:13:59,390 --> 00:14:01,978
That's going to break all of your numeric indices.
290
00:14:01,978 --> 00:14:03,770
And so a dictionary reader is arguably just
291
00:14:03,770 --> 00:14:07,310
better design because it's more robust against changes
292
00:14:07,310 --> 00:14:08,960
and potential errors like that.
293
00:14:08,960 --> 00:14:12,560
Now the effect of this change isn't going to be really any different.
294
00:14:12,560 --> 00:14:16,820
If I run Python of favorites.py, voila, I get all of the same results.
295
00:14:16,820 --> 00:14:21,290
But I've now not made any assumptions as to where each of the columns
296
00:14:21,290 --> 00:14:22,920
actually is numerically.
297
00:14:22,920 --> 00:14:23,420
All right.
298
00:14:23,420 --> 00:14:26,330
Well, let's go ahead and now filter out some duplicates.
299
00:14:26,330 --> 00:14:29,960
Because there's a lot of commonality among some of the shows here, so let's
300
00:14:29,960 --> 00:14:32,660
see if we can't filter out duplicates.
301
00:14:32,660 --> 00:14:38,900
If I'm reading a CSV file top to bottom, what intuitively might be the logic
302
00:14:38,900 --> 00:14:41,650
I want to implement to filter out duplicates?
303
00:14:41,650 --> 00:14:44,900
It's not going to be quite as simple as a simple function that does it for me.
304
00:14:44,900 --> 00:14:47,300
I'm going to have to build this.
305
00:14:47,300 --> 00:14:50,630
But logically, if you're reading a file from top to bottom,
306
00:14:50,630 --> 00:14:54,860
how might you go about, in Python or just any context,
307
00:14:54,860 --> 00:14:58,040
getting rid of duplicate values?
308
00:14:58,040 --> 00:14:59,862
Yeah, what do you think?
309
00:14:59,862 --> 00:15:05,735
AUDIENCE: [INAUDIBLE]
310
00:15:05,735 --> 00:15:06,610
DAVID J. MALAN: Sure.
311
00:15:06,610 --> 00:15:09,730
I could use a list and I could add each title to the list,
312
00:15:09,730 --> 00:15:13,070
but first check if I put this into the list before.
313
00:15:13,070 --> 00:15:15,140
So let's try a little something like that.
314
00:15:15,140 --> 00:15:18,340
Let me go ahead and create a variable at the top of my program here.
315
00:15:18,340 --> 00:15:21,340
I'll call it titles, for instance, initialize to an empty list,
316
00:15:21,340 --> 00:15:23,050
open bracket, close bracket.
317
00:15:23,050 --> 00:15:27,860
And then, inside of my loop here, instead of printing it out,
318
00:15:27,860 --> 00:15:29,630
let's start to make a decision.
319
00:15:29,630 --> 00:15:39,332
So if the current row's title is in the titles list
320
00:15:39,332 --> 00:15:40,540
I don't want to put it there.
321
00:15:40,540 --> 00:15:43,582
And actually, let me invert the logic so I'm doing something proactively.
322
00:15:43,582 --> 00:15:48,190
So if it's not the case that row bracket title is in titles,
323
00:15:48,190 --> 00:15:56,200
then, go ahead and do something like titles.append the current row's title.
324
00:15:56,200 --> 00:15:59,170
And recall that we saw .append a week or so ago,
325
00:15:59,170 --> 00:16:01,700
where it just allows you to append to the current list.
326
00:16:01,700 --> 00:16:04,810
And then, what can I do at the very end, after I'm all
327
00:16:04,810 --> 00:16:06,460
done reading the whole file?
328
00:16:06,460 --> 00:16:09,820
Why don't I go ahead and say, for title in titles,
329
00:16:09,820 --> 00:16:12,730
go ahead and print out the current title?
330
00:16:12,730 --> 00:16:16,700
So it's two loops now, and we can come back to the quality of that design.
331
00:16:16,700 --> 00:16:19,373
But let me go ahead here and rerun Python of favorites.py.
332
00:16:19,373 --> 00:16:22,540
Let me increase the size of my Terminal window so we can focus just on this,
333
00:16:22,540 --> 00:16:23,770
and hit Enter.
334
00:16:23,770 --> 00:16:28,120
And now, I'm just skimming.
335
00:16:28,120 --> 00:16:31,090
I don't think I'm seeing duplicates, although I
336
00:16:31,090 --> 00:16:34,000
am seeing some near duplicates.
337
00:16:34,000 --> 00:16:36,760
For instance, there's Friends again.
338
00:16:36,760 --> 00:16:40,060
And if we keep going, and going, and going, and going,
339
00:16:40,060 --> 00:16:41,260
there's Friends again.
340
00:16:41,260 --> 00:16:47,070
Oh, interesting, so that's curious that I seem to have multiple Friends,
341
00:16:47,070 --> 00:16:48,630
and I have this one here, too.
342
00:16:48,630 --> 00:16:51,120
So how might we clean this up further?
343
00:16:51,120 --> 00:16:53,393
I like your instincts, and it's a step closer to it.
344
00:16:53,393 --> 00:16:55,560
What are we going to have to do to really filter out
345
00:16:55,560 --> 00:16:57,360
those near duplicates?
346
00:16:57,360 --> 00:16:59,146
Any thoughts?
347
00:16:59,146 --> 00:17:04,045
AUDIENCE: You could set everything to lower [INAUDIBLE]..
348
00:17:04,045 --> 00:17:04,920
DAVID J. MALAN: Yeah.
349
00:17:04,920 --> 00:17:06,670
What are the common mistakes to summarize?
350
00:17:06,670 --> 00:17:09,405
We could ignore the capitalization altogether and maybe
351
00:17:09,405 --> 00:17:12,030
just force everything to lowercase, or everything to uppercase.
352
00:17:12,030 --> 00:17:13,920
Doesn't matter which, but let's just be consistent.
353
00:17:13,920 --> 00:17:16,795
And for those of you who might have accidentally or instinctively hit
354
00:17:16,795 --> 00:17:19,650
the spacebar at the beginning of your input or even at the end,
355
00:17:19,650 --> 00:17:21,180
we can strip that off, too.
356
00:17:21,180 --> 00:17:24,490
Stripping whitespace is a common thing just to clean up user input.
357
00:17:24,490 --> 00:17:27,869
So let me go back into my code here, and let me go ahead
358
00:17:27,869 --> 00:17:30,540
and tweak the title a little bit.
359
00:17:30,540 --> 00:17:33,360
Let me say that the current title inside of this loop
360
00:17:33,360 --> 00:17:36,300
is not going to be just the current row's title.
361
00:17:36,300 --> 00:17:40,440
But let me go ahead and strip off, from the left and the right implicitly,
362
00:17:40,440 --> 00:17:41,260
any whitespace.
363
00:17:41,260 --> 00:17:44,260
If you read the documentation for the strip function, it does just that.
364
00:17:44,260 --> 00:17:47,230
It gets rid of whitespace to the left, whitespace to the right.
365
00:17:47,230 --> 00:17:50,310
And then, if I want to force everything to maybe uppercase,
366
00:17:50,310 --> 00:17:52,710
I can just uppercase the entire string.
367
00:17:52,710 --> 00:17:56,310
And remember, what's handy about Python is you can chain some of these function
368
00:17:56,310 --> 00:17:59,550
calls together by just using dots again and again.
369
00:17:59,550 --> 00:18:01,500
And that just takes whatever just happened,
370
00:18:01,500 --> 00:18:04,320
like the whitespace got stripped off, then, it additionally
371
00:18:04,320 --> 00:18:06,280
uppercases the whole thing as well.
372
00:18:06,280 --> 00:18:10,980
So now, I'm going to just check whether this specific title is in titles.
373
00:18:10,980 --> 00:18:14,760
And if not, I'm going to go ahead and append that title,
374
00:18:14,760 --> 00:18:17,200
massaged into this different format, if you will.
375
00:18:17,200 --> 00:18:19,350
So I'm throwing away some information.
376
00:18:19,350 --> 00:18:23,760
I'm sacrificing all of the nuances of your grammar and input
377
00:18:23,760 --> 00:18:25,020
to the form itself.
378
00:18:25,020 --> 00:18:27,480
But at least I'm trying to canonicalize size, that is,
379
00:18:27,480 --> 00:18:29,950
standardize what the data actually looks like.
380
00:18:29,950 --> 00:18:33,810
So let me go ahead and run Python of favorites.py again and hit Enter.
381
00:18:33,810 --> 00:18:35,412
Oh, and this is just user error.
382
00:18:35,412 --> 00:18:36,870
Maybe you haven't seen this before.
383
00:18:36,870 --> 00:18:40,870
This just looks like a mistake on my part.
384
00:18:40,870 --> 00:18:43,200
I meant to say not even uppercase.
385
00:18:43,200 --> 00:18:44,202
That's completely wrong.
386
00:18:44,202 --> 00:18:46,410
The function is called upper, now that I think of it.
387
00:18:46,410 --> 00:18:46,830
All right.
388
00:18:46,830 --> 00:18:49,330
Let's go and increase the size of the Terminal window again.
389
00:18:49,330 --> 00:18:50,820
Run Python of favorites.py.
390
00:18:50,820 --> 00:18:55,530
And now, it's a little more overwhelming to look at because it's not sorted yet
391
00:18:55,530 --> 00:18:57,300
and it's all capitalized.
392
00:18:57,300 --> 00:19:03,300
But I don't think I'm seeing multiple Friends, so to speak.
393
00:19:03,300 --> 00:19:06,455
There's one Friends up here and that's it.
394
00:19:06,455 --> 00:19:07,830
I'm back up at my prompt already.
395
00:19:07,830 --> 00:19:10,270
So we seem now to be filtering out duplicates.
396
00:19:10,270 --> 00:19:13,582
Now, before we dive in further and clean this up further than this,
397
00:19:13,582 --> 00:19:14,790
what else could we have done?
398
00:19:14,790 --> 00:19:17,100
Well, it turns out that in Python 2 you often
399
00:19:17,100 --> 00:19:19,350
do get a lot of functionality built into the language.
400
00:19:19,350 --> 00:19:22,313
And I'm kind of implementing myself the idea of a set.
401
00:19:22,313 --> 00:19:24,480
If you think back to mathematics, a set is typically
402
00:19:24,480 --> 00:19:28,470
something with a bunch of values that has duplicates filtered out.
403
00:19:28,470 --> 00:19:31,290
Recall that Python already has this for us.
404
00:19:31,290 --> 00:19:34,200
And we saw it really briefly when I whipped up the dictionary
405
00:19:34,200 --> 00:19:36,190
implementation a couple of weeks back.
406
00:19:36,190 --> 00:19:40,980
So I could actually define my titles to be a set instead of a list,
407
00:19:40,980 --> 00:19:46,080
and this would just modestly allow me to refine my code here, such
408
00:19:46,080 --> 00:19:48,870
that I don't have to bother checking for duplicates anyway.
409
00:19:48,870 --> 00:19:52,710
I can instead just say something like, titles.add
410
00:19:52,710 --> 00:19:55,807
the current title, like this.
411
00:19:55,807 --> 00:19:58,890
Marginally better design if you know that a set exists because you're just
412
00:19:58,890 --> 00:20:00,753
getting more functionality out of this.
413
00:20:00,753 --> 00:20:02,670
All right, so let's clean the data up further.
414
00:20:02,670 --> 00:20:06,150
We've now gone ahead and fixed the problem of case sensitivity.
415
00:20:06,150 --> 00:20:08,940
We threw away whitespace in case someone had hit the spacebar
416
00:20:08,940 --> 00:20:09,990
with some of the input.
417
00:20:09,990 --> 00:20:13,960
Let's go ahead now and sort these things by the titles themselves.
418
00:20:13,960 --> 00:20:17,280
So instead of just printing out the titles in the same order
419
00:20:17,280 --> 00:20:22,142
you all inputted them, but filtering out duplicates as we go, let me go ahead
420
00:20:22,142 --> 00:20:24,600
and use another function in Python you might not have seen,
421
00:20:24,600 --> 00:20:26,940
which is literally called sorted, and will
422
00:20:26,940 --> 00:20:31,900
take care of the process of actually sorting titles for you.
423
00:20:31,900 --> 00:20:34,440
Let me go ahead and increase the font size of my Terminal,
424
00:20:34,440 --> 00:20:36,570
run Python of favorites.py, and hit Enter.
425
00:20:36,570 --> 00:20:40,410
And now you can really see how many of these shows start with the word "the"
426
00:20:40,410 --> 00:20:41,220
or do not.
427
00:20:41,220 --> 00:20:43,590
Now it's a little easier to wrap our minds around,
428
00:20:43,590 --> 00:20:46,650
just because it's at least sorted alphabetically.
429
00:20:46,650 --> 00:20:50,220
But now you can really see some of the differences in people's inputs.
430
00:20:50,220 --> 00:20:51,850
So far, so good.
431
00:20:51,850 --> 00:20:56,400
But a few of you decided to stylize Avatar in three different ways here.
432
00:20:56,400 --> 00:20:59,520
Brooklyn 99 is a couple of different ways here.
433
00:20:59,520 --> 00:21:02,700
And I think if we keep going we'll see further and further variances that we
434
00:21:02,700 --> 00:21:06,460
did not fix by focusing on whitespace and capitalization alone.
435
00:21:06,460 --> 00:21:09,870
So already here, this is only, what, 100 plus, 200 rows.
436
00:21:09,870 --> 00:21:12,795
Already real-world data starts to get messy quickly,
437
00:21:12,795 --> 00:21:14,670
and that might not bode well when we actually
438
00:21:14,670 --> 00:21:16,650
want to keep around real data from real users.
439
00:21:16,650 --> 00:21:19,025
You can imagine an actual website or a mobile application
440
00:21:19,025 --> 00:21:21,660
dealing with this kind of thing on scale.
441
00:21:21,660 --> 00:21:23,080
Well, let's go ahead and do this.
442
00:21:23,080 --> 00:21:26,370
Let's actually figure out the popularity of these various shows
443
00:21:26,370 --> 00:21:31,680
by now iterating over my data, and keeping track of how many of you
444
00:21:31,680 --> 00:21:33,000
inputted a given title.
445
00:21:33,000 --> 00:21:37,890
We're going to ignore the problems like Brooklyn 99 and the Avatar.
446
00:21:37,890 --> 00:21:42,390
Sorry, yeah, Avatar, where there was things
447
00:21:42,390 --> 00:21:47,070
that were different beyond just whitespace and capitalization.
448
00:21:47,070 --> 00:21:48,840
But let's go ahead and keep track of, now,
449
00:21:48,840 --> 00:21:52,290
how many of you inputted each of these titles.
450
00:21:52,290 --> 00:21:53,448
So how can I do this?
451
00:21:53,448 --> 00:21:55,740
I'm still going to take this approach of iterating over
452
00:21:55,740 --> 00:21:58,110
the CSV file from top to bottom.
453
00:21:58,110 --> 00:22:00,360
We've used a couple of data structures thus far,
454
00:22:00,360 --> 00:22:04,260
a list to keep track of titles, or a set to keep track of titles.
455
00:22:04,260 --> 00:22:07,350
But what if I now want to keep around a little more information?
456
00:22:07,350 --> 00:22:13,110
For each title, I want to keep around how many times I've seen it before.
457
00:22:13,110 --> 00:22:14,130
I'm not doing that yet.
458
00:22:14,130 --> 00:22:17,880
I'm throwing away the total number of times I see these shows.
459
00:22:17,880 --> 00:22:20,520
How could I start to keep that around?
460
00:22:20,520 --> 00:22:21,845
AUDIENCE: Use a dictionary.
461
00:22:21,845 --> 00:22:23,970
DAVID J. MALAN: We could use a dictionary, and how?
462
00:22:23,970 --> 00:22:25,260
Elaborate on that.
463
00:22:25,260 --> 00:22:26,135
AUDIENCE: [INAUDIBLE]
464
00:22:26,135 --> 00:22:28,093
DAVID J. MALAN: Perfect, really good instincts.
465
00:22:28,093 --> 00:22:29,850
Using a dictionary, insofar as it lets us
466
00:22:29,850 --> 00:22:33,300
store keys and values, that is, associate something with something
467
00:22:33,300 --> 00:22:33,850
else.
468
00:22:33,850 --> 00:22:36,480
This is why a dictionary or hash tables more
469
00:22:36,480 --> 00:22:39,630
generally are such a useful, practical data structure.
470
00:22:39,630 --> 00:22:43,120
Because they just let you remember stuff in some kind of structured way.
471
00:22:43,120 --> 00:22:45,870
So if the keys are going to be the titles I've seen,
472
00:22:45,870 --> 00:22:49,860
the values could be the number of times I've seen each of those titles.
473
00:22:49,860 --> 00:22:54,308
And so it's kind of like just having a two-column table on paper.
474
00:22:54,308 --> 00:22:56,850
For instance, if I were going to do this on a piece of paper,
475
00:22:56,850 --> 00:22:59,340
I might just have two columns here, where
476
00:22:59,340 --> 00:23:04,590
maybe this is the title that I've seen, and this is the count over here.
477
00:23:04,590 --> 00:23:08,250
This is, in effect, a dictionary in Python.
478
00:23:08,250 --> 00:23:11,490
It's two columns, keys on the left, values on the right.
479
00:23:11,490 --> 00:23:13,620
And this, if I can implement in code, will actually
480
00:23:13,620 --> 00:23:17,580
allow me to store this data, and then maybe do some simple arithmetic
481
00:23:17,580 --> 00:23:19,450
to figure out which is the most popular.
482
00:23:19,450 --> 00:23:20,380
So let's do this.
483
00:23:20,380 --> 00:23:24,240
Let me go ahead and change my titles to not be a list, not be a set.
484
00:23:24,240 --> 00:23:29,160
Let's have it be a dictionary instead, either doing this, or more succinctly,
485
00:23:29,160 --> 00:23:33,600
two curly braces that are empty gives me an empty dictionary automatically.
486
00:23:33,600 --> 00:23:34,860
What do I now want to do?
487
00:23:34,860 --> 00:23:37,440
I think most of my code can stay the same.
488
00:23:37,440 --> 00:23:40,860
But down here, I don't want to just blindly add titles
489
00:23:40,860 --> 00:23:42,180
to the data structure.
490
00:23:42,180 --> 00:23:45,060
I somehow need to keep track of the count.
491
00:23:45,060 --> 00:23:48,690
And unfortunately, if I just do this-- let's do titles,
492
00:23:48,690 --> 00:23:53,580
bracket, title, plus equals 1.
493
00:23:53,580 --> 00:23:56,130
This is a reasonable first attempt at this.
494
00:23:56,130 --> 00:23:57,420
Because what am I doing?
495
00:23:57,420 --> 00:24:02,730
If titles is a dictionary and I want to look up the current title therein,
496
00:24:02,730 --> 00:24:05,520
the syntax for that, like before, is titles, bracket, and then
497
00:24:05,520 --> 00:24:08,880
the key you want to use to index into the dictionary.
498
00:24:08,880 --> 00:24:12,000
It's not a number in this case, it's an actual word, a title.
499
00:24:12,000 --> 00:24:14,220
And you're just going to increment it by one,
500
00:24:14,220 --> 00:24:17,010
and then eventually I'll come back and finish my second loop
501
00:24:17,010 --> 00:24:19,710
and do things in terms of the order.
502
00:24:19,710 --> 00:24:23,580
But for now, let's just keep track of the total counts.
503
00:24:23,580 --> 00:24:25,660
Let me go ahead and increase my Terminal window.
504
00:24:25,660 --> 00:24:29,590
Let me do Python of favorites.py and hit Enter.
505
00:24:29,590 --> 00:24:30,090
Huh.
506
00:24:30,090 --> 00:24:34,140
How I Met Your Mother is giving me a key error.
507
00:24:34,140 --> 00:24:36,750
What does that mean?
508
00:24:36,750 --> 00:24:39,120
And why am I seeing this?
509
00:24:39,120 --> 00:24:42,330
And in fact, just to give a little bit of a breadcrumb here,
510
00:24:42,330 --> 00:24:44,140
let me zoom out here.
511
00:24:44,140 --> 00:24:46,920
Let me open up the CSV file again real quickly.
512
00:24:46,920 --> 00:24:50,460
And wow, we didn't even get past the second row in the file
513
00:24:50,460 --> 00:24:52,210
or the first show in the file.
514
00:24:52,210 --> 00:24:54,840
Notice that How I Met Your Mother, somewhat lowercased,
515
00:24:54,840 --> 00:24:57,562
is the very first show in therein.
516
00:24:57,562 --> 00:24:59,520
What's your instinct for why this is happening?
517
00:24:59,520 --> 00:25:01,750
AUDIENCE: You don't have a starting point.
518
00:25:01,750 --> 00:25:03,667
DAVID J. MALAN: I don't have a starting point.
519
00:25:03,667 --> 00:25:04,920
I'm adding one to what?
520
00:25:04,920 --> 00:25:09,840
I'm blindly indexing into the dictionary using a key, How I Met Your Mother,
521
00:25:09,840 --> 00:25:11,880
that doesn't yet exist in the dictionary.
522
00:25:11,880 --> 00:25:14,610
And so Python throws what's called a key error
523
00:25:14,610 --> 00:25:17,440
because the key you're trying to use just doesn't exist yet.
524
00:25:17,440 --> 00:25:21,000
So logically, how could we fix this?
525
00:25:21,000 --> 00:25:21,690
We're close.
526
00:25:21,690 --> 00:25:25,110
We got half of the problem solved, but I'm not handling the obvious, now,
527
00:25:25,110 --> 00:25:26,850
case of nothing being there.
528
00:25:26,850 --> 00:25:27,595
Yeah?
529
00:25:27,595 --> 00:25:28,930
AUDIENCE: Creating a counter.
530
00:25:28,930 --> 00:25:29,620
DAVID J. MALAN: Creating a--
531
00:25:29,620 --> 00:25:30,370
AUDIENCE: Counter.
532
00:25:30,370 --> 00:25:32,240
DAVID J. MALAN: Creating the counter itself.
533
00:25:32,240 --> 00:25:34,190
So maybe I could do something like this.
534
00:25:34,190 --> 00:25:38,380
Let me close my Terminal window and let me ask a question first.
535
00:25:38,380 --> 00:25:44,980
If the current title is in the dictionary already, if title in titles,
536
00:25:44,980 --> 00:25:47,440
that's going to give me a true-false answer it turns out.
537
00:25:47,440 --> 00:25:52,240
Then, I can safely say, titles, bracket, title, plus equals 1.
538
00:25:52,240 --> 00:25:56,740
And recall, this is just shorthand notation for the same thing as in C,
539
00:25:56,740 --> 00:25:58,300
title plus 1.
540
00:25:58,300 --> 00:25:59,360
Whoops, typo.
541
00:25:59,360 --> 00:26:00,168
Don't do that.
542
00:26:00,168 --> 00:26:02,710
That's the same thing as this but it's a little more succinct
543
00:26:02,710 --> 00:26:04,810
just to say plus equals 1.
544
00:26:04,810 --> 00:26:09,490
Else, if it's logically not the case that the current title is in the titles
545
00:26:09,490 --> 00:26:13,610
dictionary, then I probably want to say titles, bracket, title equals?
546
00:26:13,610 --> 00:26:15,315
Feel free to just shout it out.
547
00:26:15,315 --> 00:26:15,940
AUDIENCE: Zero.
548
00:26:15,940 --> 00:26:16,815
DAVID J. MALAN: Zero.
549
00:26:16,815 --> 00:26:21,320
I just have to put some value there so that the key itself is also there.
550
00:26:21,320 --> 00:26:21,820
All right.
551
00:26:21,820 --> 00:26:24,370
So now that I've got this going on, let me go ahead
552
00:26:24,370 --> 00:26:26,620
and undo my sorting temporarily.
553
00:26:26,620 --> 00:26:29,650
And now let me go ahead and do this.
554
00:26:29,650 --> 00:26:33,460
I can, as a quick check, let me go ahead and just run the code
555
00:26:33,460 --> 00:26:35,050
as is, Python of favorites.py.
556
00:26:35,050 --> 00:26:37,030
I'm back in business.
557
00:26:37,030 --> 00:26:39,790
It's printing correctly, no key errors, but it's not sorted.
558
00:26:39,790 --> 00:26:41,620
And I'm not seeing any of the counts.
559
00:26:41,620 --> 00:26:43,780
Let me just quickly add the counts, and there's
560
00:26:43,780 --> 00:26:45,530
a couple of ways I could do this.
561
00:26:45,530 --> 00:26:52,900
I could, say, print out the title, and then, maybe, let's do something like--
562
00:26:52,900 --> 00:26:57,220
how about just, comma, titles, bracket, title?
563
00:26:57,220 --> 00:26:59,020
So I'm going to print two things at once,
564
00:26:59,020 --> 00:27:01,600
both the current title in the dictionary,
565
00:27:01,600 --> 00:27:04,300
and whatever its value is by indexing into it.
566
00:27:04,300 --> 00:27:06,140
Let me increase my Terminal window.
567
00:27:06,140 --> 00:27:10,420
Let me run Python of favorites.py, Enter, and OK.
568
00:27:10,420 --> 00:27:11,908
Huh.
569
00:27:11,908 --> 00:27:14,080
Huh.
570
00:27:14,080 --> 00:27:17,560
None of you said a whole lot of TV shows, it seems.
571
00:27:17,560 --> 00:27:21,690
What's the logical error here?
572
00:27:21,690 --> 00:27:25,460
What did I do wrong if I look back at my code here?
573
00:27:25,460 --> 00:27:25,960
Yeah?
574
00:27:25,960 --> 00:27:27,600
Why so many 0s?
575
00:27:27,600 --> 00:27:29,490
AUDIENCE: [INAUDIBLE]
576
00:27:29,490 --> 00:27:30,490
DAVID J. MALAN: Exactly.
577
00:27:30,490 --> 00:27:35,140
To summarize, I initialized the count to 0 the first time I saw it,
578
00:27:35,140 --> 00:27:38,500
but I should have initialized it at least to 1 because I just saw it.
579
00:27:38,500 --> 00:27:40,220
Or I should change my code a bit.
580
00:27:40,220 --> 00:27:42,880
So for instance, if I go back in here, the simplest fix
581
00:27:42,880 --> 00:27:46,390
is probably to initialize to 1, because on this iteration of the loop,
582
00:27:46,390 --> 00:27:49,210
obviously, I'm seeing this title for the very first time.
583
00:27:49,210 --> 00:27:51,580
Or I could change my logic a little bit.
584
00:27:51,580 --> 00:27:53,470
I could do something like this instead.
585
00:27:53,470 --> 00:27:58,840
If the current title is not in titles, then I could initialize it to 0.
586
00:27:58,840 --> 00:28:02,860
And then I could get rid of the else, and now blindly index
587
00:28:02,860 --> 00:28:04,900
into the titles dictionary.
588
00:28:04,900 --> 00:28:09,100
Because now, on line 11, I can trust that lines 9 and 10
589
00:28:09,100 --> 00:28:12,040
took care of the initialization for me if need be.
590
00:28:12,040 --> 00:28:13,028
Which one is better?
591
00:28:13,028 --> 00:28:13,570
I don't know.
592
00:28:13,570 --> 00:28:17,150
This one's a little nicer, maybe because it's one line fewer.
593
00:28:17,150 --> 00:28:20,470
But I think both approaches are perfectly reasonable and well-designed.
594
00:28:20,470 --> 00:28:22,420
But the key thing, no pun intended, is that we
595
00:28:22,420 --> 00:28:27,100
have to make sure the key exists before we presume to actually incrue.
596
00:28:27,100 --> 00:28:28,810
Oh, this is wrong.
597
00:28:28,810 --> 00:28:30,370
This is incorrect code.
598
00:28:30,370 --> 00:28:33,040
What did I do wrong?
599
00:28:33,040 --> 00:28:33,790
OK, yes.
600
00:28:33,790 --> 00:28:34,400
There we go.
601
00:28:34,400 --> 00:28:37,690
So otherwise, everyone would have liked this show once, and no matter
602
00:28:37,690 --> 00:28:39,190
how many people said the same thing.
603
00:28:39,190 --> 00:28:41,237
Now the code is as it should be.
604
00:28:41,237 --> 00:28:43,570
So let me go ahead and open up my Terminal window again.
605
00:28:43,570 --> 00:28:47,710
Let me run Python of favorites.py, and now we see more reasonable counts.
606
00:28:47,710 --> 00:28:49,150
Some shows weren't that popular.
607
00:28:49,150 --> 00:28:50,830
There's just 1s and maybe 2s.
608
00:28:50,830 --> 00:28:56,570
But I bet if we sort these things we can start to see a little more detail.
609
00:28:56,570 --> 00:28:57,890
So how else can we do this?
610
00:28:57,890 --> 00:29:04,520
Well, turns out, when dealing with a dictionary like this--
611
00:29:04,520 --> 00:29:07,160
let's go ahead and just sort the titles themselves.
612
00:29:07,160 --> 00:29:12,130
So let's reintroduce the sorted function as I did before, but no other changes.
613
00:29:12,130 --> 00:29:14,735
Let me go ahead now and run Python of favorites.py.
614
00:29:14,735 --> 00:29:17,110
Now it's just a little easier to wrap your mind around it
615
00:29:17,110 --> 00:29:18,568
because at least it's alphabetical.
616
00:29:18,568 --> 00:29:22,600
But it's not sorted by value, it's sorted by key.
617
00:29:22,600 --> 00:29:26,170
But sure enough, if we scroll down, there's something down here,
618
00:29:26,170 --> 00:29:29,202
for instance, like, let's see, The Office.
619
00:29:29,202 --> 00:29:30,910
That's definitely going to be a contender
620
00:29:30,910 --> 00:29:32,860
for most popular, 15 responses.
621
00:29:32,860 --> 00:29:35,860
But let's see what's actually going to bubble up to the top.
622
00:29:35,860 --> 00:29:40,870
Unfortunately, the sorted function only sorts dictionaries by keys
623
00:29:40,870 --> 00:29:44,410
by default, not by values.
624
00:29:44,410 --> 00:29:47,500
But it turns out, in Python, if you read the documentation
625
00:29:47,500 --> 00:29:49,510
for the sorted function, you can actually
626
00:29:49,510 --> 00:29:54,580
pass in other arguments that tell it how to sort things.
627
00:29:54,580 --> 00:29:57,430
For instance, if I want to do things in reverse order,
628
00:29:57,430 --> 00:30:02,140
I can add a second parameter to the sorted function called reverse.
629
00:30:02,140 --> 00:30:03,310
And it's a named parameter.
630
00:30:03,310 --> 00:30:05,620
You literally say, reverse equals true, so
631
00:30:05,620 --> 00:30:08,830
that the position of it in the comma-separated list doesn't matter.
632
00:30:08,830 --> 00:30:11,762
If I now rerun this after increasing my Terminal window,
633
00:30:11,762 --> 00:30:13,720
you'll see now that it's in the opposite order.
634
00:30:13,720 --> 00:30:16,330
Now adventure and Anne with an E is at the bottom
635
00:30:16,330 --> 00:30:18,410
of the output instead of the top.
636
00:30:18,410 --> 00:30:26,845
How can I tell it to sort by values instead of by key?
637
00:30:26,845 --> 00:30:28,220
Well, let's go ahead and do this.
638
00:30:28,220 --> 00:30:30,940
Let me go ahead and define a function.
639
00:30:30,940 --> 00:30:33,070
I'm just going to call it f to keep things simple.
640
00:30:33,070 --> 00:30:36,190
And this f function is going to take a title as input.
641
00:30:36,190 --> 00:30:41,140
And given a given title, it's going to return the value of that title.
642
00:30:41,140 --> 00:30:44,560
So actually, maybe a better name for this would be get value,
643
00:30:44,560 --> 00:30:46,820
and/or we could come up with something else as well.
644
00:30:46,820 --> 00:30:49,300
The purpose of the get value function, to be clear,
645
00:30:49,300 --> 00:30:54,200
is to take it as input a title and then return the corresponding value.
646
00:30:54,200 --> 00:30:55,460
Why is this useful?
647
00:30:55,460 --> 00:30:57,760
Well, it turns out that the sorted function in Python,
648
00:30:57,760 --> 00:31:01,870
according to its documentation, also takes a key parameter,
649
00:31:01,870 --> 00:31:05,860
where you can pass in, crazy enough, the name of a function
650
00:31:05,860 --> 00:31:11,650
that it will use in order to determine what it should sort by, by the key,
651
00:31:11,650 --> 00:31:16,550
or by the value, or in other cases, even other types of data as well.
652
00:31:16,550 --> 00:31:19,390
So there's a curiosity here, though, that's very deliberate.
653
00:31:19,390 --> 00:31:21,550
Key is the name of the parameter, just like reverse
654
00:31:21,550 --> 00:31:23,092
was the name of this other parameter.
655
00:31:23,092 --> 00:31:26,110
The value of it, though, is not a function call.
656
00:31:26,110 --> 00:31:27,580
It's a function name.
657
00:31:27,580 --> 00:31:30,520
Notice I am not doing this, no parentheses.
658
00:31:30,520 --> 00:31:35,400
I'm instead passing in get value, the function I wrote, by its name.
659
00:31:35,400 --> 00:31:37,900
And this is a feature of Python and certain other languages.
660
00:31:37,900 --> 00:31:41,110
Just like variables, you can actually pass whole functions
661
00:31:41,110 --> 00:31:45,560
around so that they can be called for you later on by someone else.
662
00:31:45,560 --> 00:31:48,880
So what this means is that the sorted function written by Python,
663
00:31:48,880 --> 00:31:51,410
they didn't know what you're going to want to sort by today.
664
00:31:51,410 --> 00:31:55,810
But if you provide them with a function called get value, or anything else, now
665
00:31:55,810 --> 00:31:58,000
their sorted function will use that function
666
00:31:58,000 --> 00:32:01,840
to determine, OK, if you don't want to sort by the key of the dictionary, what
667
00:32:01,840 --> 00:32:02,950
do you want to sort by?
668
00:32:02,950 --> 00:32:05,800
This is going to tell it to sort by the value
669
00:32:05,800 --> 00:32:08,750
by returning the specific value we care about.
670
00:32:08,750 --> 00:32:12,580
So let me go ahead now and rerun this after increasing my Terminal, Python
671
00:32:12,580 --> 00:32:14,620
of favorites.py, Enter.
672
00:32:14,620 --> 00:32:17,110
Here we have now an example of all of the titles you all
673
00:32:17,110 --> 00:32:22,180
typed in, albeit forced to uppercase and with any whitespace thrown out.
674
00:32:22,180 --> 00:32:24,733
And now, The Office is an easy win over Friends,
675
00:32:24,733 --> 00:32:27,400
versus Community, versus Game of Thrones, Breaking Bad, and then
676
00:32:27,400 --> 00:32:29,740
a lot of variants thereafter.
677
00:32:29,740 --> 00:32:31,680
So there's a lot of steps to go through.
678
00:32:31,680 --> 00:32:33,555
This isn't that bad once you've done it once,
679
00:32:33,555 --> 00:32:35,472
and you know what these functions are, and you
680
00:32:35,472 --> 00:32:36,870
know that these parameters exist.
681
00:32:36,870 --> 00:32:38,010
But it's a lot of work.
682
00:32:38,010 --> 00:32:42,090
That's 17 lines of code just to analyze a CSV file
683
00:32:42,090 --> 00:32:45,420
that you all created by way of those Google Form submissions.
684
00:32:45,420 --> 00:32:48,360
But it took me a lot of work just to get simple answers out of it.
685
00:32:48,360 --> 00:32:50,277
And indeed, that's going to be among the goals
686
00:32:50,277 --> 00:32:52,920
for today, ultimately, is, how can we just make this easier?
687
00:32:52,920 --> 00:32:54,795
It's one thing to learn new things in Python,
688
00:32:54,795 --> 00:32:57,300
but if we can avoid writing code, or this much code,
689
00:32:57,300 --> 00:32:58,840
that's going to be a good thing.
690
00:32:58,840 --> 00:33:01,020
And so one other technique we can introduce here
691
00:33:01,020 --> 00:33:03,570
that does allow us to write a little less code
692
00:33:03,570 --> 00:33:05,730
is, we can actually get rid of this function.
693
00:33:05,730 --> 00:33:09,240
It turns out, in Python, if you just need to make a function
694
00:33:09,240 --> 00:33:11,940
but it's going to be used and then essentially thrown away,
695
00:33:11,940 --> 00:33:14,790
it's not something you're going to be reusing in multiple places--
696
00:33:14,790 --> 00:33:17,550
it's not like a library function that you want to keep around--
697
00:33:17,550 --> 00:33:19,680
you can actually just do this.
698
00:33:19,680 --> 00:33:23,430
You can change the value of this key parameter
699
00:33:23,430 --> 00:33:25,950
to be what's called a lambda function, which
700
00:33:25,950 --> 00:33:29,040
is a fancy way of saying a function that technically has no name.
701
00:33:29,040 --> 00:33:30,870
It's an anonymous function.
702
00:33:30,870 --> 00:33:32,400
Why does it have no name?
703
00:33:32,400 --> 00:33:35,530
Well, it's kind of stupid that I invented this name on line 13.
704
00:33:35,530 --> 00:33:38,670
I used it on line 16, and then I never again used it.
705
00:33:38,670 --> 00:33:42,520
If there's only being used in one place, why bother giving it a name at all?
706
00:33:42,520 --> 00:33:45,000
So if you instead, in Python, say lambda,
707
00:33:45,000 --> 00:33:47,850
and then type out the name of the parameter
708
00:33:47,850 --> 00:33:50,130
you want this anonymous function to take,
709
00:33:50,130 --> 00:33:54,460
you can then say, go ahead and return this value.
710
00:33:54,460 --> 00:33:57,030
Now let's notice the inconsistencies here.
711
00:33:57,030 --> 00:33:59,850
When you use this special lambda keyword that says, hey Python,
712
00:33:59,850 --> 00:34:02,850
give me an anonymous function, a function with no name,
713
00:34:02,850 --> 00:34:06,610
it then says, Python, this anonymous function will take one parameter.
714
00:34:06,610 --> 00:34:08,699
Notice there's no parentheses.
715
00:34:08,699 --> 00:34:10,500
And that's deliberate, if confusing.
716
00:34:10,500 --> 00:34:12,909
It just tightens things up a little bit.
717
00:34:12,909 --> 00:34:16,770
Notice that there's no return keyword, which similarly tightens things
718
00:34:16,770 --> 00:34:18,780
up a bit, albeit inconsistently.
719
00:34:18,780 --> 00:34:22,650
But this line of code I've just highlighted
720
00:34:22,650 --> 00:34:26,429
is actually identical in functionality to this.
721
00:34:26,429 --> 00:34:28,320
But it throws away the word [INAUDIBLE].
722
00:34:28,320 --> 00:34:29,850
It throws away the word get value.
723
00:34:29,850 --> 00:34:33,510
It throws away the parentheses, and it throws away the return keyword just
724
00:34:33,510 --> 00:34:34,679
to tighten things up.
725
00:34:34,679 --> 00:34:37,290
And it's well suited for a problem like this
726
00:34:37,290 --> 00:34:39,960
where I just want to pass in a tiny little function that
727
00:34:39,960 --> 00:34:41,090
does something useful.
728
00:34:41,090 --> 00:34:42,840
But it's not something I'm going to reuse.
729
00:34:42,840 --> 00:34:45,100
It doesn't need multiple lines to take up space.
730
00:34:45,100 --> 00:34:47,159
It's just a nice, elegant one liner.
731
00:34:47,159 --> 00:34:48,840
That's all a lambda function does.
732
00:34:48,840 --> 00:34:51,909
It allows you to create an anonymous function right then and there.
733
00:34:51,909 --> 00:34:57,210
And then the function you're passing it to, like sorted, will use it as before.
734
00:34:57,210 --> 00:35:00,810
Indeed, if I run Python of favorites.py after growing my Terminal window,
735
00:35:00,810 --> 00:35:02,800
the result is exactly the same.
736
00:35:02,800 --> 00:35:06,360
And we see at the bottom here all of those small results.
737
00:35:06,360 --> 00:35:10,810
Are any questions, then, on this syntax, on these ideas?
738
00:35:10,810 --> 00:35:13,770
The goal here has been to write a Python program that just starts
739
00:35:13,770 --> 00:35:18,860
to analyze or clean up data like this.
740
00:35:18,860 --> 00:35:19,765
Yeah?
741
00:35:19,765 --> 00:35:23,273
AUDIENCE: [INAUDIBLE]
742
00:35:23,273 --> 00:35:26,440
DAVID J. MALAN: Could you use the lambda if it's just returning immediately?
743
00:35:26,440 --> 00:35:29,260
It's really meant for one line of code, generally.
744
00:35:29,260 --> 00:35:31,420
So you don't use the return keyword.
745
00:35:31,420 --> 00:35:33,880
You just say what it is you want to return.
746
00:35:33,880 --> 00:35:37,680
AUDIENCE: [INAUDIBLE]
747
00:35:37,680 --> 00:35:38,930
DAVID J. MALAN: Good question.
748
00:35:38,930 --> 00:35:40,960
Could you do more in that one line if it's
749
00:35:40,960 --> 00:35:42,670
got to be a more involved algorithm?
750
00:35:42,670 --> 00:35:45,820
Yes, but you would just ultimately return the value in question.
751
00:35:45,820 --> 00:35:47,947
In short, if it's getting at all sophisticated
752
00:35:47,947 --> 00:35:49,780
you don't use the lambda function in Python.
753
00:35:49,780 --> 00:35:52,510
You go ahead and actually just define a name for it,
754
00:35:52,510 --> 00:35:54,233
even if it's a one-off name.
755
00:35:54,233 --> 00:35:56,650
JavaScript, another language we'll look at in a few weeks,
756
00:35:56,650 --> 00:35:59,590
makes heavier use, I dare say, of lambda functions.
757
00:35:59,590 --> 00:36:02,030
And those can actually be multiple, multiple lines,
758
00:36:02,030 --> 00:36:05,520
but Python does not support that instinct.
759
00:36:05,520 --> 00:36:06,020
All right.
760
00:36:06,020 --> 00:36:07,728
So let's go ahead and do one other thing.
761
00:36:07,728 --> 00:36:10,340
Office was clearly popping out of the code here quite a bit.
762
00:36:10,340 --> 00:36:12,760
Let's go ahead and write a slightly different program
763
00:36:12,760 --> 00:36:15,400
that maybe just focuses on The Office for the moment,
764
00:36:15,400 --> 00:36:16,900
just focuses on The Office.
765
00:36:16,900 --> 00:36:21,250
So let me go ahead and throw most of this code away, up until this point
766
00:36:21,250 --> 00:36:23,080
when I'm inside of my inner loop.
767
00:36:23,080 --> 00:36:26,050
And let me go ahead, and I don't even want the global variable here.
768
00:36:26,050 --> 00:36:28,240
All I want to do is focus on the current title.
769
00:36:28,240 --> 00:36:30,730
How could I detect if someone likes The Office?
770
00:36:30,730 --> 00:36:33,790
Well, I could say something like--
771
00:36:33,790 --> 00:36:34,670
how about this?
772
00:36:34,670 --> 00:36:36,310
So counter equals 0.
773
00:36:36,310 --> 00:36:38,350
We'll just focus on The Office.
774
00:36:38,350 --> 00:36:43,930
If title equals, equals The Office, I could then go ahead and say,
775
00:36:43,930 --> 00:36:47,410
counter plus equals 1.
776
00:36:47,410 --> 00:36:48,400
I don't need a key.
777
00:36:48,400 --> 00:36:49,858
There's no dictionary involved now.
778
00:36:49,858 --> 00:36:51,880
It's just a simple integer variable.
779
00:36:51,880 --> 00:36:55,750
And then, down here I'll say something like,
780
00:36:55,750 --> 00:37:00,970
number of people who like The Office is, whatever this value is.
781
00:37:00,970 --> 00:37:03,850
And I'll put in counter in curly braces, and then I'll
782
00:37:03,850 --> 00:37:05,783
turn this whole thing into an F string.
783
00:37:05,783 --> 00:37:07,450
All right, let me go ahead and run this.
784
00:37:07,450 --> 00:37:10,100
Python of favorites.py, Enter.
785
00:37:10,100 --> 00:37:12,610
Number of people who like The Office is 15.
786
00:37:12,610 --> 00:37:13,990
All right, so that's great.
787
00:37:13,990 --> 00:37:17,530
But let's go ahead now and deliberately muddy the data a bit.
788
00:37:17,530 --> 00:37:21,160
All of you were very nice in that you typed in The Office.
789
00:37:21,160 --> 00:37:23,230
But you can imagine someone just typing Office,
790
00:37:23,230 --> 00:37:25,692
for instance, maybe there, maybe there.
791
00:37:25,692 --> 00:37:28,150
And many people might just write Office, you could imagine.
792
00:37:28,150 --> 00:37:30,400
Didn't happen here, but suppose it did, and probably
793
00:37:30,400 --> 00:37:33,290
would have if we had even more and more submissions over time.
794
00:37:33,290 --> 00:37:37,000
Now let's go ahead and rerun this program, no changes to the code.
795
00:37:37,000 --> 00:37:39,130
Now only 13 people like The Office.
796
00:37:39,130 --> 00:37:40,150
So let's fix this.
797
00:37:40,150 --> 00:37:45,790
The data is now as I mutated it to have a couple Offices, and many The Offices.
798
00:37:45,790 --> 00:37:51,050
How could I change my Python code to now count both of those situations?
799
00:37:51,050 --> 00:37:55,390
What could I change up here in order to improve this situation?
800
00:37:55,390 --> 00:37:57,910
Any thoughts?
801
00:37:57,910 --> 00:37:58,750
Yeah?
802
00:37:58,750 --> 00:38:02,300
AUDIENCE: You write the title [INAUDIBLE]..
803
00:38:02,300 --> 00:38:05,050
DAVID J. MALAN: Yeah, so I could just ask two questions like that.
804
00:38:05,050 --> 00:38:09,370
If title equals The Office, or title equals, equals just
805
00:38:09,370 --> 00:38:10,438
Office, for instance.
806
00:38:10,438 --> 00:38:12,730
And I'm still don't have to worry about capitalization.
807
00:38:12,730 --> 00:38:15,813
I don't have to worry about spaces because I at least threw that all away.
808
00:38:15,813 --> 00:38:18,200
Now I can go ahead and rerun this code.
809
00:38:18,200 --> 00:38:19,990
Let me go run it a third time.
810
00:38:19,990 --> 00:38:22,460
OK, so we're back up to 15.
811
00:38:22,460 --> 00:38:25,210
So I like that.
812
00:38:25,210 --> 00:38:28,660
You could imagine this not scaling very well.
813
00:38:28,660 --> 00:38:31,763
Avatar had three different permutations, and there were some others
814
00:38:31,763 --> 00:38:34,180
if we dug deeper that there might have been more variants.
815
00:38:34,180 --> 00:38:36,430
Could we do something a little more general purpose?
816
00:38:36,430 --> 00:38:38,230
Well, we could do something like this.
817
00:38:38,230 --> 00:38:41,883
If Office in the title--
818
00:38:41,883 --> 00:38:44,050
this is kind of a cool thing you can do with Python.
819
00:38:44,050 --> 00:38:46,930
It's very English-like, just ask the question, albeit tersely.
820
00:38:46,930 --> 00:38:50,660
This, interesting, just got me into trouble.
821
00:38:50,660 --> 00:38:53,140
Now, all of a sudden, we're up to 16.
822
00:38:53,140 --> 00:38:56,252
Does anyone know what the other one is?
823
00:38:56,252 --> 00:38:58,098
AUDIENCE: Someone put V Office.
824
00:38:58,098 --> 00:38:59,265
DAVID J. MALAN: What Office?
825
00:38:59,265 --> 00:39:02,120
AUDIENCE: Someone entered a V Office, [INAUDIBLE]..
826
00:39:02,120 --> 00:39:04,517
827
00:39:04,517 --> 00:39:05,850
DAVID J. MALAN: Oh, interesting.
828
00:39:05,850 --> 00:39:07,930
Yes, so they hit The.
829
00:39:07,930 --> 00:39:08,430
OK.
830
00:39:08,430 --> 00:39:11,568
[APPLAUSE]
831
00:39:11,568 --> 00:39:12,360
DAVID J. MALAN: OK.
832
00:39:12,360 --> 00:39:13,800
Someone did that, sure.
833
00:39:13,800 --> 00:39:17,130
So The V Office.
834
00:39:17,130 --> 00:39:19,470
OK, this one's actually going to be hard to correct for.
835
00:39:19,470 --> 00:39:21,300
I can't really think of a general--
836
00:39:21,300 --> 00:39:25,860
well, this is actually a good example of data gets messy fast.
837
00:39:25,860 --> 00:39:28,080
And you could imagine doing something where, OK, we
838
00:39:28,080 --> 00:39:32,920
could have like 26 conditions if someone said The A Office, or The B Office,
839
00:39:32,920 --> 00:39:33,420
right?
840
00:39:33,420 --> 00:39:34,650
You could imagine doing that.
841
00:39:34,650 --> 00:39:37,400
But then there's surely going to be other typos that are possible.
842
00:39:37,400 --> 00:39:39,610
So that's actually a hard one to fix.
843
00:39:39,610 --> 00:39:44,730
But it turns out we got lucky and now this is actually the accurate count.
844
00:39:44,730 --> 00:39:46,660
But the data is itself messy.
845
00:39:46,660 --> 00:39:49,950
Let me show another way that just adds another tool to our toolkit.
846
00:39:49,950 --> 00:39:54,750
It turns out that there's this feature in many programming languages, Python
847
00:39:54,750 --> 00:39:57,040
among them, called regular expressions.
848
00:39:57,040 --> 00:39:59,040
And this is actually a really powerful technique
849
00:39:59,040 --> 00:40:00,873
that we'll just scratch the surface of here.
850
00:40:00,873 --> 00:40:04,290
But it's going to be really useful, actually, maybe toward final projects,
851
00:40:04,290 --> 00:40:08,880
in web programming, any time you want to clean up data or validate data.
852
00:40:08,880 --> 00:40:11,910
And actually, just to make this clear, give me a moment
853
00:40:11,910 --> 00:40:14,160
before I switch screens here.
854
00:40:14,160 --> 00:40:18,450
And let me open up a Google Form from scratch.
855
00:40:18,450 --> 00:40:22,230
Give me just a moment to create something real quick.
856
00:40:22,230 --> 00:40:25,260
If you've never noticed this before when creating a Google Form,
857
00:40:25,260 --> 00:40:28,410
you can do a question.
858
00:40:28,410 --> 00:40:30,360
And if you want the user to type in something
859
00:40:30,360 --> 00:40:33,033
very specific as a short text answer like this,
860
00:40:33,033 --> 00:40:35,700
you might know that there's toggles like this in Google's world,
861
00:40:35,700 --> 00:40:36,930
like you can require it.
862
00:40:36,930 --> 00:40:39,270
Or you can do response validation.
863
00:40:39,270 --> 00:40:41,670
You could say, what's your email?
864
00:40:41,670 --> 00:40:47,250
And then you could say something like, text is an email.
865
00:40:47,250 --> 00:40:52,530
So here's an example in Google Forms how you can validate users' input.
866
00:40:52,530 --> 00:40:57,150
But a feature most of you have probably never noticed, or cared about, or used,
867
00:40:57,150 --> 00:40:59,490
is this thing called a regular expression, where
868
00:40:59,490 --> 00:41:01,440
you can actually define a pattern.
869
00:41:01,440 --> 00:41:04,830
And I could actually reimplement that same idea by doing something like this.
870
00:41:04,830 --> 00:41:11,070
I can say, let the user type in anything represented by .star, then an at sign,
871
00:41:11,070 --> 00:41:16,600
then something else, then a literal period, then, for instance,
872
00:41:16,600 --> 00:41:17,680
something else.
873
00:41:17,680 --> 00:41:19,950
So it's very cryptic, admittedly, at first glance.
874
00:41:19,950 --> 00:41:23,430
But this means any character 0 more times.
875
00:41:23,430 --> 00:41:26,160
This means any character 0 more times.
876
00:41:26,160 --> 00:41:28,725
This means a literal period, because apparently
877
00:41:28,725 --> 00:41:32,160
dot means any character in the context of these patterns.
878
00:41:32,160 --> 00:41:36,070
Then this thing means any character 0 more times.
879
00:41:36,070 --> 00:41:38,670
So I should actually be a little more nitpicky.
880
00:41:38,670 --> 00:41:41,530
You don't want 0 or more times, you want 1 or more times.
881
00:41:41,530 --> 00:41:45,340
So this with the plus means any character 1 or more time.
882
00:41:45,340 --> 00:41:47,020
So there has to be something there.
883
00:41:47,020 --> 00:41:51,630
And I think I want the same thing here 1 or more times, 1 or more times.
884
00:41:51,630 --> 00:41:56,160
Or heck, if I want to restrict this form in some sense to edu addresses,
885
00:41:56,160 --> 00:41:59,520
I could change that last thing to literally .edu.
886
00:41:59,520 --> 00:42:01,570
And so long story short, even though this looks,
887
00:42:01,570 --> 00:42:06,420
I'm sure, pretty cryptic, there's this mini language built into Python,
888
00:42:06,420 --> 00:42:09,900
and JavaScript, and Java, and other languages that allows you to express
889
00:42:09,900 --> 00:42:12,430
patterns in a standardized way.
890
00:42:12,430 --> 00:42:15,930
And this pattern is actually something we can implement in code, too.
891
00:42:15,930 --> 00:42:18,150
And let me switch back to Python for a second just
892
00:42:18,150 --> 00:42:19,920
to do the same kind of idea.
893
00:42:19,920 --> 00:42:22,950
Let me toggle back to my code here.
894
00:42:22,950 --> 00:42:26,730
Let me put up, for instance, a summary of what it is you can do.
895
00:42:26,730 --> 00:42:33,030
And here's just a quick summary of some of the available symbols.
896
00:42:33,030 --> 00:42:39,330
A period may represent any character. .star or .asterisks means 0 or more
897
00:42:39,330 --> 00:42:39,970
characters.
898
00:42:39,970 --> 00:42:43,020
So the dot means anything, so it can be A or nothing.
899
00:42:43,020 --> 00:42:44,160
It can be B or nothing.
900
00:42:44,160 --> 00:42:49,530
It can be A, B, A, B, C. It can be any combination of 0 or more characters.
901
00:42:49,530 --> 00:42:52,860
Change that to a plus and you now express one or more characters.
902
00:42:52,860 --> 00:42:55,860
Question mark means something is optional.
903
00:42:55,860 --> 00:42:59,370
Caret symbol means start matching at the beginning of the user's input.
904
00:42:59,370 --> 00:43:05,100
Dollar sign means stop matching at the end of the user's input.
905
00:43:05,100 --> 00:43:07,210
So we won't play with all of these just now.
906
00:43:07,210 --> 00:43:11,470
But let me go over here and actually tackle this Office problem.
907
00:43:11,470 --> 00:43:15,450
Let me go ahead and import a new library called the regular expression library,
908
00:43:15,450 --> 00:43:17,280
import re.
909
00:43:17,280 --> 00:43:20,560
And then, down here, let me say this.
910
00:43:20,560 --> 00:43:25,080
If re.search, this pattern.
911
00:43:25,080 --> 00:43:30,210
Let's just search for Office, quote, unquote, in the current title.
912
00:43:30,210 --> 00:43:32,730
Then we're going to go ahead and increase the counter.
913
00:43:32,730 --> 00:43:35,040
So it turns out that the regular expression library
914
00:43:35,040 --> 00:43:39,090
has a function called search that takes as its first argument a pattern,
915
00:43:39,090 --> 00:43:41,970
and then, as its second argument the string you
916
00:43:41,970 --> 00:43:44,160
want to analyze for that pattern.
917
00:43:44,160 --> 00:43:47,880
So it's sort of looking for a needle in this haystack, from left to right.
918
00:43:47,880 --> 00:43:52,080
Let me go ahead now and run this version of the program, Enter.
919
00:43:52,080 --> 00:43:56,250
And now I screwed up because I forgot my colon, but that's old stuff.
920
00:43:56,250 --> 00:43:57,322
Enter.
921
00:43:57,322 --> 00:43:59,150
Huh.
922
00:43:59,150 --> 00:44:01,800
Number of people who like The Office is now 0.
923
00:44:01,800 --> 00:44:02,970
So this seems like a big--
924
00:44:02,970 --> 00:44:05,640
thank you-- big step backwards.
925
00:44:05,640 --> 00:44:08,010
What did I do wrong?
926
00:44:08,010 --> 00:44:08,730
Yeah?
927
00:44:08,730 --> 00:44:10,410
AUDIENCE: [INAUDIBLE]
928
00:44:10,410 --> 00:44:11,610
DAVID J. MALAN: Yeah.
929
00:44:11,610 --> 00:44:14,595
I forced all my input to uppercase, so I probably need to do this.
930
00:44:14,595 --> 00:44:16,470
So we'll come back to other approaches there.
931
00:44:16,470 --> 00:44:17,430
Let me rerun it now.
932
00:44:17,430 --> 00:44:19,800
OK, now we're back up to 16.
933
00:44:19,800 --> 00:44:22,200
But I could even, let's say--
934
00:44:22,200 --> 00:44:25,110
I could tolerate just The Office.
935
00:44:25,110 --> 00:44:30,120
How about this, or how about something like, or The Office?
936
00:44:30,120 --> 00:44:32,373
Let me do this instead.
937
00:44:32,373 --> 00:44:34,290
And let me use these other special characters.
938
00:44:34,290 --> 00:44:37,380
This caret sign means the beginning of the string.
939
00:44:37,380 --> 00:44:40,740
This dollar sign weirdly represents the end of the string.
940
00:44:40,740 --> 00:44:44,490
I'm adding in some parentheses just like in math, just to add another symbol
941
00:44:44,490 --> 00:44:46,470
here, the or symbol here.
942
00:44:46,470 --> 00:44:50,580
And this is saying start matching at the beginning of the user string.
943
00:44:50,580 --> 00:44:54,660
Check if the beginning of the string is Office, or the beginning of the string
944
00:44:54,660 --> 00:44:55,920
is The Office.
945
00:44:55,920 --> 00:44:58,540
And then, you better be at the end of the string.
946
00:44:58,540 --> 00:45:01,650
So they can't keep typing words before or after that input.
947
00:45:01,650 --> 00:45:03,690
Let me go ahead and rerun the program.
948
00:45:03,690 --> 00:45:07,500
And now we're down to 15, which used to be our correct answer,
949
00:45:07,500 --> 00:45:10,770
but then we noticed The V Office.
950
00:45:10,770 --> 00:45:12,680
How can we deal with that?
951
00:45:12,680 --> 00:45:16,200
It's going to be messier to deal with that.
952
00:45:16,200 --> 00:45:21,555
How about if I tolerate any character represented by dot
953
00:45:21,555 --> 00:45:23,640
in between The and Office?
954
00:45:23,640 --> 00:45:28,110
Now if I rerun it, now I really have this expressive capability.
955
00:45:28,110 --> 00:45:32,340
So this is only to say, there are so many ways in languages, in general,
956
00:45:32,340 --> 00:45:33,390
to solve problems.
957
00:45:33,390 --> 00:45:35,950
And some of these tools are more sophisticated than others.
958
00:45:35,950 --> 00:45:38,957
This is one that you've actually probably glanced at but never used
959
00:45:38,957 --> 00:45:41,040
in the context of Google Forms for years if you're
960
00:45:41,040 --> 00:45:43,770
in the habit of creating these for student groups or other activities.
961
00:45:43,770 --> 00:45:45,840
But it's now something you can start to leverage.
962
00:45:45,840 --> 00:45:49,440
And we're just scratching the surface of what's actually possible with this.
963
00:45:49,440 --> 00:45:53,640
But let's now do one final example just using some Python code here.
964
00:45:53,640 --> 00:45:55,530
And let's actually write a program that's
965
00:45:55,530 --> 00:46:00,040
a little more general purpose that allows me to search for any given title
966
00:46:00,040 --> 00:46:01,770
and figure out its popularity.
967
00:46:01,770 --> 00:46:04,410
So let me go ahead and simplify this.
968
00:46:04,410 --> 00:46:06,630
Let's get rid of our regular expressions.
969
00:46:06,630 --> 00:46:09,940
Let's go ahead and continue capitalizing the title.
970
00:46:09,940 --> 00:46:11,580
And let's go ahead to--
971
00:46:11,580 --> 00:46:16,020
at the beginning of this program, and first ask the user for the title
972
00:46:16,020 --> 00:46:17,380
they want to search for.
973
00:46:17,380 --> 00:46:20,320
So title equals, let's ask the user for input,
974
00:46:20,320 --> 00:46:23,490
which is essentially the same thing as our CS50 get_string function.
975
00:46:23,490 --> 00:46:24,960
Ask them for the title.
976
00:46:24,960 --> 00:46:28,170
And then whatever they type in, let's go ahead and strip whitespace
977
00:46:28,170 --> 00:46:30,900
and uppercase the thing again.
978
00:46:30,900 --> 00:46:35,820
And now, inside of my loop, I could say something like this.
979
00:46:35,820 --> 00:46:42,660
If the current row's title after stripping whitespace and forcing
980
00:46:42,660 --> 00:46:46,920
it to uppercase, too, equals the user's title, then, go ahead
981
00:46:46,920 --> 00:46:49,440
and maybe increment a counter.
982
00:46:49,440 --> 00:46:51,160
So I still need that counter back.
983
00:46:51,160 --> 00:46:56,610
So let me go ahead and define this maybe in here, counter equals 0.
984
00:46:56,610 --> 00:46:58,720
And then, at the very end of this program,
985
00:46:58,720 --> 00:47:01,050
let me go ahead and print out just the popularity
986
00:47:01,050 --> 00:47:03,040
of whatever the human typed in.
987
00:47:03,040 --> 00:47:06,030
So again, the only difference is I'm asking the human for some input
988
00:47:06,030 --> 00:47:06,720
this time.
989
00:47:06,720 --> 00:47:09,150
I'm initializing my counter to 0, then I'm
990
00:47:09,150 --> 00:47:12,660
searching for their title in the CSV file
991
00:47:12,660 --> 00:47:15,810
by doing the same massaging of the data by forcing it to uppercase
992
00:47:15,810 --> 00:47:18,570
and getting rid of the whitespace.
993
00:47:18,570 --> 00:47:21,780
So now, when I run Python of favorites.py, Enter,
994
00:47:21,780 --> 00:47:30,030
I could type in the office all lowercase even, and now we're down to 13.
995
00:47:30,030 --> 00:47:34,940
13, why?
996
00:47:34,940 --> 00:47:36,700
Oh, that's correct.
997
00:47:36,700 --> 00:47:40,640
Because I'm the one that went in and removed those The keywords a bit ago.
998
00:47:40,640 --> 00:47:43,030
If we fixed those, we would be back up to 15.
999
00:47:43,030 --> 00:47:47,650
If we added support for The V Office, we would be up to 16 as well.
1000
00:47:47,650 --> 00:47:50,350
All right, any questions then on these various manipulations?
1001
00:47:50,350 --> 00:47:52,183
And if you're feeling like, oh, my god, this
1002
00:47:52,183 --> 00:47:55,100
is so much Python code just to do simple things, that's the point.
1003
00:47:55,100 --> 00:47:57,160
And indeed, even though it's a powerful language
1004
00:47:57,160 --> 00:48:00,670
and can solve these kinds of problems, we had to write almost 20 lines of code
1005
00:48:00,670 --> 00:48:03,470
just to ask a single question like this.
1006
00:48:03,470 --> 00:48:07,120
But any questions on how we did this, or on any of these building
1007
00:48:07,120 --> 00:48:10,750
blocks along the way?
1008
00:48:10,750 --> 00:48:11,890
Anything here?
1009
00:48:11,890 --> 00:48:12,390
No?
1010
00:48:12,390 --> 00:48:12,600
All right.
1011
00:48:12,600 --> 00:48:13,300
That was a lot.
1012
00:48:13,300 --> 00:48:14,800
Let's take a five-minute break here.
1013
00:48:14,800 --> 00:48:17,230
When we come back, we'll do it better.
1014
00:48:17,230 --> 00:48:18,480
So we are back.
1015
00:48:18,480 --> 00:48:20,610
And the rest of today is ultimately about, how
1016
00:48:20,610 --> 00:48:24,840
can we store, and manipulate, and change, and retrieve data
1017
00:48:24,840 --> 00:48:28,090
more efficiently than we might by just writing raw code?
1018
00:48:28,090 --> 00:48:31,440
This isn't to say that you shouldn't use Python to do the kinds of things
1019
00:48:31,440 --> 00:48:32,280
that we just did.
1020
00:48:32,280 --> 00:48:37,020
And in fact, it might be super common if you're getting a lot of messy input
1021
00:48:37,020 --> 00:48:39,073
from users that you might want to clean it up.
1022
00:48:39,073 --> 00:48:42,240
And maybe the best way to do that is to write a program so that step-by-step
1023
00:48:42,240 --> 00:48:44,370
you can make all of the requisite changes and fixes
1024
00:48:44,370 --> 00:48:47,523
like we did with The Office, for instance, again and again,
1025
00:48:47,523 --> 00:48:50,190
and reuse that code, especially if more and more submissions are
1026
00:48:50,190 --> 00:48:51,150
coming through.
1027
00:48:51,150 --> 00:48:53,550
But another theme of today, ultimately, is
1028
00:48:53,550 --> 00:48:57,638
that sometimes there are different, if not better tools for the same job.
1029
00:48:57,638 --> 00:48:59,430
And in fact, now at this point in the term,
1030
00:48:59,430 --> 00:49:02,310
as we begin to introduce not just Python, but in a moment
1031
00:49:02,310 --> 00:49:06,120
a language called SQL, and next week, a language called JavaScript,
1032
00:49:06,120 --> 00:49:09,150
and the week after that, synthesizing a whole lot of these languages
1033
00:49:09,150 --> 00:49:12,420
together is to just kind of paint a picture of how
1034
00:49:12,420 --> 00:49:15,900
you might decide what the trade-offs are between using this tool, or this tool,
1035
00:49:15,900 --> 00:49:16,830
or this other tool.
1036
00:49:16,830 --> 00:49:19,770
Because undoubtedly you can solve problems moving forward
1037
00:49:19,770 --> 00:49:22,660
in many different ways with many different tools.
1038
00:49:22,660 --> 00:49:25,020
So let's give you another tool, one with which
1039
00:49:25,020 --> 00:49:28,170
you can implement a proper relational database.
1040
00:49:28,170 --> 00:49:31,050
What we just saw in the form of CSV files
1041
00:49:31,050 --> 00:49:33,810
are what we might call flat-file databases.
1042
00:49:33,810 --> 00:49:37,500
Again, just a very simple file, flat in that there's no hierarchy to it.
1043
00:49:37,500 --> 00:49:39,270
It's just rows and columns.
1044
00:49:39,270 --> 00:49:44,280
And that is all ultimately storing ASCII or Unicode text.
1045
00:49:44,280 --> 00:49:47,400
A relational database, though, is something that's actually
1046
00:49:47,400 --> 00:49:50,850
closer to a proper spreadsheet program.
1047
00:49:50,850 --> 00:49:53,440
A CSV is an individual sheet, if you will,
1048
00:49:53,440 --> 00:49:55,260
from a spreadsheet when you export it.
1049
00:49:55,260 --> 00:49:57,460
If you had multiple sheets in a spreadsheet,
1050
00:49:57,460 --> 00:49:59,595
you would have to export multiple CSVs.
1051
00:49:59,595 --> 00:50:01,470
And that gets annoying quickly in code if you
1052
00:50:01,470 --> 00:50:03,990
have to open up this CSV, this CSV, all of which
1053
00:50:03,990 --> 00:50:07,080
represent different sheets or tabs in a proper spreadsheet.
1054
00:50:07,080 --> 00:50:11,520
A relational database is more like a spreadsheet program
1055
00:50:11,520 --> 00:50:14,640
that you, a programmer, now can interact with.
1056
00:50:14,640 --> 00:50:16,140
You can write data to it.
1057
00:50:16,140 --> 00:50:19,680
You can read data from it, and you can have multiple sheets, a.k.a.,
1058
00:50:19,680 --> 00:50:22,150
tables storing all of your data.
1059
00:50:22,150 --> 00:50:24,240
So whereas Excel and numbers in Google spreadsheet
1060
00:50:24,240 --> 00:50:27,090
are meant to be reused really by humans with their mouse and their keyboard,
1061
00:50:27,090 --> 00:50:29,850
clicking, and pointing, and manipulating things graphically,
1062
00:50:29,850 --> 00:50:32,160
a relational database using a language called
1063
00:50:32,160 --> 00:50:37,320
SQL is one in which the programmer has similar capabilities,
1064
00:50:37,320 --> 00:50:39,000
but doing so in code.
1065
00:50:39,000 --> 00:50:42,720
Specifically, using a language called SQL, and at a scale
1066
00:50:42,720 --> 00:50:45,670
that's much grander than spreadsheets alone.
1067
00:50:45,670 --> 00:50:48,420
In fact, if you try on your Mac or PC to open a spreadsheet that's
1068
00:50:48,420 --> 00:50:51,090
got tens of thousands of rows, it'll probably
1069
00:50:51,090 --> 00:50:54,780
work fine, hundreds of thousands of rows, millions of rows, no way.
1070
00:50:54,780 --> 00:50:57,000
At some point your Mac or PC is going to struggle
1071
00:50:57,000 --> 00:50:59,130
to open particularly large data sets.
1072
00:50:59,130 --> 00:51:01,620
And that, too, is where proper databases come
1073
00:51:01,620 --> 00:51:04,140
into play and proper languages for databases come
1074
00:51:04,140 --> 00:51:06,120
into play, when it's all about scale.
1075
00:51:06,120 --> 00:51:09,390
And indeed, most any mobile app or web app today that you or someone else
1076
00:51:09,390 --> 00:51:13,420
might write should probably plan on lots of data if it's successful.
1077
00:51:13,420 --> 00:51:15,730
So we need the right tools for that problem.
1078
00:51:15,730 --> 00:51:19,110
So fortunately, even though we're about to learn yet another language,
1079
00:51:19,110 --> 00:51:24,360
it only does four things fundamentally, known by this silly acronym, CRUD.
1080
00:51:24,360 --> 00:51:27,870
SQL, this language for databases, supports the ability
1081
00:51:27,870 --> 00:51:32,400
to create data, read data, update data, and delete data.
1082
00:51:32,400 --> 00:51:33,420
That's it.
1083
00:51:33,420 --> 00:51:36,690
There's a few more keywords that exist in this language called SQL
1084
00:51:36,690 --> 00:51:37,750
that we'll soon see.
1085
00:51:37,750 --> 00:51:39,458
But at the end of the day, even if you're
1086
00:51:39,458 --> 00:51:42,180
starting to feel like this is a lot very quickly,
1087
00:51:42,180 --> 00:51:44,940
it all boils down to these four basic operations.
1088
00:51:44,940 --> 00:51:47,640
And the four commands in SQL, if you will,
1089
00:51:47,640 --> 00:51:51,780
functions in a sense that implement those four ideas happen to be these.
1090
00:51:51,780 --> 00:51:54,270
They're almost the same but with some slight variance.
1091
00:51:54,270 --> 00:51:59,280
The ability to create or insert data is the C. The ability to select data
1092
00:51:59,280 --> 00:52:01,080
is the R, or read.
1093
00:52:01,080 --> 00:52:02,130
Update is the same.
1094
00:52:02,130 --> 00:52:05,048
Delete is the same, but drop is also a keyword as well.
1095
00:52:05,048 --> 00:52:06,840
So we'll see these and a few other keywords
1096
00:52:06,840 --> 00:52:10,410
in SQL that, at the end of the day, just allow you to create, read, and update
1097
00:52:10,410 --> 00:52:14,310
data using verbs, if you will, like these.
1098
00:52:14,310 --> 00:52:18,030
So to do that, what's the syntax going to be?
1099
00:52:18,030 --> 00:52:20,290
Well, we won't get into the weeds too quickly on this.
1100
00:52:20,290 --> 00:52:22,650
But here's a representative syntax of how
1101
00:52:22,650 --> 00:52:25,710
you can create using this language called SQL, in your very
1102
00:52:25,710 --> 00:52:28,020
own database, a brand new table.
1103
00:52:28,020 --> 00:52:30,910
This is so easy in Excel, and Google Spreadsheets, and Apple Numbers.
1104
00:52:30,910 --> 00:52:32,910
You want a new sheet, you click the plus button.
1105
00:52:32,910 --> 00:52:33,690
You get a new tab.
1106
00:52:33,690 --> 00:52:35,490
You give it a name, and boom, you're done.
1107
00:52:35,490 --> 00:52:40,050
In the world of programming, though, if you want to create the analogue of that
1108
00:52:40,050 --> 00:52:43,440
spreadsheet in the computer's memory, you create something called a table,
1109
00:52:43,440 --> 00:52:48,420
like a sheet, that has a name, and then in parentheses has one or more columns.
1110
00:52:48,420 --> 00:52:51,990
But unlike Google Spreadsheets, and Apple Numbers, and Excel,
1111
00:52:51,990 --> 00:52:55,073
you have to decide as the programmer what types of data
1112
00:52:55,073 --> 00:52:57,240
you're going to be storing in each of these columns.
1113
00:52:57,240 --> 00:52:59,430
Now even though Excel, and Google Spreadsheets,
1114
00:52:59,430 --> 00:53:03,310
and Numbers does allow you to format or present data in different ways,
1115
00:53:03,310 --> 00:53:07,680
it's not strongly typed data like it is, for instance, when we were using C.
1116
00:53:07,680 --> 00:53:10,200
And heck, even in Python there's underlying data types.
1117
00:53:10,200 --> 00:53:12,158
Even if you don't have to type them explicitly,
1118
00:53:12,158 --> 00:53:14,900
databases are going to want to know, are you storing integers?
1119
00:53:14,900 --> 00:53:16,640
Are you storing real numbers or floats?
1120
00:53:16,640 --> 00:53:17,630
Are you storing text?
1121
00:53:17,630 --> 00:53:18,140
Why?
1122
00:53:18,140 --> 00:53:20,960
Because especially as your data scales, the more hints
1123
00:53:20,960 --> 00:53:24,410
you give the database about your data, the more performance it can be,
1124
00:53:24,410 --> 00:53:27,500
the faster it can help you get at and store that data.
1125
00:53:27,500 --> 00:53:29,302
So types are about to be important again,
1126
00:53:29,302 --> 00:53:31,760
but there's not going to be that many of them, fortunately.
1127
00:53:31,760 --> 00:53:34,640
Now how can I go about converting, for instance, some real data,
1128
00:53:34,640 --> 00:53:37,490
like that from you, my favorites.csv file,
1129
00:53:37,490 --> 00:53:39,440
into a proper relational database?
1130
00:53:39,440 --> 00:53:42,650
Well, it turns out that using SQL I can do this
1131
00:53:42,650 --> 00:53:45,260
in VS Code on my own Mac, or PC, or in the cloud
1132
00:53:45,260 --> 00:53:48,455
here by just importing the CSV into a database.
1133
00:53:48,455 --> 00:53:50,330
We'll see eventually how to do this manually.
1134
00:53:50,330 --> 00:53:52,622
For now, I'm going to use more of an automated process.
1135
00:53:52,622 --> 00:53:54,680
So let me go over to VS Code here.
1136
00:53:54,680 --> 00:53:57,170
Let me type ls to see where we left off before.
1137
00:53:57,170 --> 00:54:01,005
I had two files favorites.csv, which I downloaded from Google Spreadsheets.
1138
00:54:01,005 --> 00:54:02,630
Recall that I made a couple of changes.
1139
00:54:02,630 --> 00:54:06,050
We deleted a couple of Thes from the file for The Office.
1140
00:54:06,050 --> 00:54:08,600
But this is the same file as before, and then we
1141
00:54:08,600 --> 00:54:11,210
have favorites.py, which we'll set aside for now.
1142
00:54:11,210 --> 00:54:14,870
I'm going to go ahead now and run a command SQLite3.
1143
00:54:14,870 --> 00:54:18,020
So in the world of relational databases, there's
1144
00:54:18,020 --> 00:54:23,030
many different products out there, many different software that
1145
00:54:23,030 --> 00:54:25,370
implements the SQL language.
1146
00:54:25,370 --> 00:54:26,600
Microsoft has their own.
1147
00:54:26,600 --> 00:54:30,080
There's something called MySQL that's been very popular for years.
1148
00:54:30,080 --> 00:54:32,120
Facebook, for instance, used it early on.
1149
00:54:32,120 --> 00:54:35,030
PostgreSQL, Microsoft Access Server, Oracle,
1150
00:54:35,030 --> 00:54:36,958
and maybe a whole bunch of other product names
1151
00:54:36,958 --> 00:54:38,750
you might have encountered over time, which
1152
00:54:38,750 --> 00:54:42,980
is to say there's many different types of tools, and servers,
1153
00:54:42,980 --> 00:54:44,990
and software in which you can use SQL.
1154
00:54:44,990 --> 00:54:47,780
We're going to use a very lightweight version of the SQL language
1155
00:54:47,780 --> 00:54:49,370
today called SQLite.
1156
00:54:49,370 --> 00:54:51,680
This is the version of SQL that's generally
1157
00:54:51,680 --> 00:54:54,020
used on iPhones and Android devices these days.
1158
00:54:54,020 --> 00:54:56,930
If you download an app that stores data like your own contacts,
1159
00:54:56,930 --> 00:54:59,000
typically is stored using SQLite.
1160
00:54:59,000 --> 00:55:02,710
Because it's fairly lightweight, but you can still store hundreds,
1161
00:55:02,710 --> 00:55:05,810
thousands, even tens of thousands of pieces of data
1162
00:55:05,810 --> 00:55:07,970
even using this lightweight version thereof.
1163
00:55:07,970 --> 00:55:10,790
SQLite3 is like version 3 of this tool.
1164
00:55:10,790 --> 00:55:16,340
We're going to go ahead and run SQLite3 with a file called favorites.db.
1165
00:55:16,340 --> 00:55:20,120
It's conventional in the world of SQLite to name your file something.db.
1166
00:55:20,120 --> 00:55:22,490
I'm going to create a database called favorites.db.
1167
00:55:22,490 --> 00:55:27,010
Once I'm inside of the program, now I'm going to go ahead and enter CSV Mode.
1168
00:55:27,010 --> 00:55:28,760
Again, not something you have to memorize,
1169
00:55:28,760 --> 00:55:30,468
just something you can look up as needed.
1170
00:55:30,468 --> 00:55:34,100
And then, I'm going to import favorites.csv
1171
00:55:34,100 --> 00:55:40,260
into a table, that is, a sheet, if you will, called favorites as well.
1172
00:55:40,260 --> 00:55:44,030
Now I'm going to hit Enter and I'm going to go ahead and exit the program
1173
00:55:44,030 --> 00:55:45,650
altogether and type ls.
1174
00:55:45,650 --> 00:55:48,170
Now I have three files in my current directory--
1175
00:55:48,170 --> 00:55:52,130
the CSV file, the Python file from before, and now favorites.db.
1176
00:55:52,130 --> 00:55:56,180
But if I did this right, all of the data you all typed into the CSV file
1177
00:55:56,180 --> 00:55:59,840
has now been loaded into a proper database where I can now use
1178
00:55:59,840 --> 00:56:03,180
this SQL language to access it instead.
1179
00:56:03,180 --> 00:56:07,730
So let's go ahead again and run SQLite3 of favorites.db, which now exists.
1180
00:56:07,730 --> 00:56:10,640
And now, at the SQLite prompt I can start
1181
00:56:10,640 --> 00:56:13,160
to play around and see what this data is.
1182
00:56:13,160 --> 00:56:16,610
For instance, I can look, by typing .schema,
1183
00:56:16,610 --> 00:56:19,362
at what the schema is of my data, what's the design.
1184
00:56:19,362 --> 00:56:22,070
Now no thought was put into the design of this data at the moment
1185
00:56:22,070 --> 00:56:23,900
because I automated the whole process.
1186
00:56:23,900 --> 00:56:26,750
Once we start creating our own databases we'll
1187
00:56:26,750 --> 00:56:29,750
give more thought to the data types and the columns that we have.
1188
00:56:29,750 --> 00:56:34,220
But we can see what SQLite presumed I wanted just
1189
00:56:34,220 --> 00:56:36,530
by importing the data by default.
1190
00:56:36,530 --> 00:56:41,120
What the import command did for me a moment ago is essentially the syntax.
1191
00:56:41,120 --> 00:56:44,510
It automated the process of creating a table, if it doesn't exist,
1192
00:56:44,510 --> 00:56:45,770
called favorites.
1193
00:56:45,770 --> 00:56:48,980
And then notice, in parentheses it gave me three columns--
1194
00:56:48,980 --> 00:56:53,360
timestamp, title, and genres, which were inferred, obviously, from the CSV.
1195
00:56:53,360 --> 00:56:56,000
All three of which have been decreed to be text.
1196
00:56:56,000 --> 00:56:59,180
Again, once we're more comfortable we'll create our own tables,
1197
00:56:59,180 --> 00:57:01,010
choose our own types and column names.
1198
00:57:01,010 --> 00:57:03,350
But for now, I just automated the whole process just
1199
00:57:03,350 --> 00:57:08,120
to get us started by using this built-in import command as well.
1200
00:57:08,120 --> 00:57:08,810
All right.
1201
00:57:08,810 --> 00:57:11,630
So what now can I begin to do?
1202
00:57:11,630 --> 00:57:16,910
Well, if I wanted to, for instance, start playing around with data therein,
1203
00:57:16,910 --> 00:57:19,595
I might execute a couple of different commands.
1204
00:57:19,595 --> 00:57:23,000
1205
00:57:23,000 --> 00:57:28,421
Let me find the right one here-- one of which would be select.
1206
00:57:28,421 --> 00:57:31,610
Select being one of our most versatile tools
1207
00:57:31,610 --> 00:57:33,180
to select data from this database.
1208
00:57:33,180 --> 00:57:35,720
So if I have these three columns here-- timestamp,
1209
00:57:35,720 --> 00:57:39,020
title, and genres, suppose I want to select all of the titles.
1210
00:57:39,020 --> 00:57:43,790
Doing that earlier in Python required importing the CSV library,
1211
00:57:43,790 --> 00:57:48,740
opening the file, creating a reader or a DictReader, iterating over every row,
1212
00:57:48,740 --> 00:57:51,500
adding every title to a dictionary or just printing it out,
1213
00:57:51,500 --> 00:57:52,310
and dot, dot, dot.
1214
00:57:52,310 --> 00:57:55,170
There was a dozen or so lines of code when we first began.
1215
00:57:55,170 --> 00:57:56,840
Now, how about this?
1216
00:57:56,840 --> 00:58:01,220
Select title from favorites, semicolon, done.
1217
00:58:01,220 --> 00:58:05,570
So now, with this particular language, the output is very textual
1218
00:58:05,570 --> 00:58:08,930
and it's simulating what it looks like if it were more graphical by creating
1219
00:58:08,930 --> 00:58:10,870
this table, so to speak.
1220
00:58:10,870 --> 00:58:14,080
Select title from favorites is a distillation
1221
00:58:14,080 --> 00:58:17,530
in a different language called SQL of all the lines of code
1222
00:58:17,530 --> 00:58:20,740
I wrote early on when we first started playing with favorites.py.
1223
00:58:20,740 --> 00:58:25,540
SQL is therefore optimized for reading, and creating, and updating,
1224
00:58:25,540 --> 00:58:27,500
and ultimately, deleting data.
1225
00:58:27,500 --> 00:58:30,700
So here's perhaps a better tool for the job once you have the data.
1226
00:58:30,700 --> 00:58:34,030
Tossing it into a more powerful, versatile format
1227
00:58:34,030 --> 00:58:37,228
might allow you now to get more work done more quickly
1228
00:58:37,228 --> 00:58:38,770
without having to reinvent the wheel.
1229
00:58:38,770 --> 00:58:41,510
Someone else has figured out how to select data like this.
1230
00:58:41,510 --> 00:58:43,760
What more can I do here?
1231
00:58:43,760 --> 00:58:47,050
Well, let me go ahead and pull up, in a moment, just a little bit
1232
00:58:47,050 --> 00:58:49,390
of a cheat sheet here.
1233
00:58:49,390 --> 00:58:53,840
Give me one second to find this.
1234
00:58:53,840 --> 00:58:58,090
So suppose I want to now select data a little more powerfully.
1235
00:58:58,090 --> 00:59:00,220
So here's what I just did in a canonical way.
1236
00:59:00,220 --> 00:59:01,720
So select typically works like this.
1237
00:59:01,720 --> 00:59:05,860
You select columns from a specific table, semicolon.
1238
00:59:05,860 --> 00:59:08,260
Unfortunately, stupid semicolons are back.
1239
00:59:08,260 --> 00:59:12,710
Select columns from table then, is the generic form of what I just did.
1240
00:59:12,710 --> 00:59:17,122
More specifically, I selected one column called title from favorites.
1241
00:59:17,122 --> 00:59:18,580
Favorites is the name of the table.
1242
00:59:18,580 --> 00:59:19,690
Semicolon ends my thought.
1243
00:59:19,690 --> 00:59:23,440
Suppose I wanted to get two things, like the genres that each of you inputted.
1244
00:59:23,440 --> 00:59:28,420
I could instead do select title, comma, genres from favorites,
1245
00:59:28,420 --> 00:59:30,220
and then, a semicolon, and Enter.
1246
00:59:30,220 --> 00:59:32,110
It's going to look a little ugly on my screen
1247
00:59:32,110 --> 00:59:33,670
because some of these titles and--
1248
00:59:33,670 --> 00:59:37,300
OK, one of you really went all out with Community.
1249
00:59:37,300 --> 00:59:40,660
You can see that it's just wrapping in an ugly way,
1250
00:59:40,660 --> 00:59:43,300
but it's just now showing me two columns.
1251
00:59:43,300 --> 00:59:46,840
If we scroll up to the very top again, the left most of one,
1252
00:59:46,840 --> 00:59:48,280
Black Mirror went all out, too.
1253
00:59:48,280 --> 00:59:49,150
Thank you.
1254
00:59:49,150 --> 00:59:52,000
And now, OK, we're going to have to clean some of these up.
1255
00:59:52,000 --> 00:59:54,115
Game of Thrones, good comedy, yes.
1256
00:59:54,115 --> 00:59:57,550
1257
00:59:57,550 --> 00:59:59,480
Keep going, keep going, keep going.
1258
00:59:59,480 --> 01:00:02,870
So now we've selected two of the columns that we care about.
1259
01:00:02,870 --> 01:00:03,370
There it is.
1260
01:00:03,370 --> 01:00:06,380
OK, so it's crazy wide because of all of those genres.
1261
01:00:06,380 --> 01:00:09,135
But it allows me to select exactly the data I want.
1262
01:00:09,135 --> 01:00:12,010
Let's go back to the titles, though, and perhaps start playing around
1263
01:00:12,010 --> 01:00:13,750
with some modifiers here.
1264
01:00:13,750 --> 01:00:18,310
For instance, it turns out, using SQL there's a lot of functionality
1265
01:00:18,310 --> 01:00:19,900
built into the language.
1266
01:00:19,900 --> 01:00:22,843
You've got a lot of functions, similar to Excel or Google Spreadsheets
1267
01:00:22,843 --> 01:00:24,010
where you can have formulas.
1268
01:00:24,010 --> 01:00:26,320
SQL provides you with some of the same heuristics that
1269
01:00:26,320 --> 01:00:30,350
allow you to apply operations like these on entire columns.
1270
01:00:30,350 --> 01:00:32,920
For instance, you can take averages, count the total,
1271
01:00:32,920 --> 01:00:36,010
get the distinct values, force things to lowercase, uppercase, min,
1272
01:00:36,010 --> 01:00:37,220
and max, and so forth.
1273
01:00:37,220 --> 01:00:39,610
So let's try distinct, for instance.
1274
01:00:39,610 --> 01:00:43,450
Let me go back to my Terminal, and let's say, select,
1275
01:00:43,450 --> 01:00:48,760
how about the distinct titles from the favorites table?
1276
01:00:48,760 --> 01:00:49,513
Enter.
1277
01:00:49,513 --> 01:00:51,430
I didn't bother selecting the genres because I
1278
01:00:51,430 --> 01:00:52,763
want it to be a little prettier.
1279
01:00:52,763 --> 01:00:58,090
And you can see here that we have just the distinct titles,
1280
01:00:58,090 --> 01:01:00,547
except for issues of formatting.
1281
01:01:00,547 --> 01:01:02,380
So whitespace is going to be an issue again.
1282
01:01:02,380 --> 01:01:04,213
Capitalization is going to be a thing again.
1283
01:01:04,213 --> 01:01:05,230
So there's a trade-off.
1284
01:01:05,230 --> 01:01:09,280
One of the things I was doing in Python was forcing everything to uppercase
1285
01:01:09,280 --> 01:01:10,930
and then getting rid of whitespace.
1286
01:01:10,930 --> 01:01:12,400
But we could combine some of these.
1287
01:01:12,400 --> 01:01:15,340
I could do something like force every title to uppercase,
1288
01:01:15,340 --> 01:01:16,507
then get the distinct value.
1289
01:01:16,507 --> 01:01:19,382
And that's actually going to get rid of some of those values as well.
1290
01:01:19,382 --> 01:01:21,860
And again, I did it all in one simple line that was fast.
1291
01:01:21,860 --> 01:01:24,027
So let me pull up at the bottom of the screen again.
1292
01:01:24,027 --> 01:01:28,090
I selected distinct upper titles from favorites,
1293
01:01:28,090 --> 01:01:31,070
and that did everything for me at once in just one breath.
1294
01:01:31,070 --> 01:01:33,640
Suppose I want to get the total number of counts of titles.
1295
01:01:33,640 --> 01:01:40,150
How about select count of all of those titles from favorites?
1296
01:01:40,150 --> 01:01:43,990
Semicolon, Enter, and now you get back a mini table
1297
01:01:43,990 --> 01:01:47,960
that contains just your answer, 158 in this case.
1298
01:01:47,960 --> 01:01:50,560
So that's the total number of, not distinct,
1299
01:01:50,560 --> 01:01:52,690
but total titles that we had in the file.
1300
01:01:52,690 --> 01:01:56,560
And we could continue to manipulate the data further using, again,
1301
01:01:56,560 --> 01:01:58,450
functions like these here.
1302
01:01:58,450 --> 01:02:01,550
But there's also additional filtration we can do.
1303
01:02:01,550 --> 01:02:07,010
We can also qualify our selections by saying where some condition is true.
1304
01:02:07,010 --> 01:02:10,630
So just as in Scratch, and C, and Python, you have Boolean expressions,
1305
01:02:10,630 --> 01:02:16,270
you can have the same in SQL as well, where I can filter my data where
1306
01:02:16,270 --> 01:02:19,390
something is true or false.
1307
01:02:19,390 --> 01:02:21,340
Like allows me to do approximations.
1308
01:02:21,340 --> 01:02:23,500
If I want to get something that's like The Office
1309
01:02:23,500 --> 01:02:26,320
but not necessarily T-H-E, space, Office,
1310
01:02:26,320 --> 01:02:29,440
I could do pattern matching using like here.
1311
01:02:29,440 --> 01:02:33,170
Order by, limit, and grouped by are other commands I can execute, too.
1312
01:02:33,170 --> 01:02:35,980
So let me go back and do a couple of these here.
1313
01:02:35,980 --> 01:02:42,610
How about, let me just get, oh, I don't know, all of the titles from favorites
1314
01:02:42,610 --> 01:02:44,848
but limit it to 10 results.
1315
01:02:44,848 --> 01:02:47,890
That might be one thing that's helpful to see if you just care about some
1316
01:02:47,890 --> 01:02:50,110
of the data at the top there instead.
1317
01:02:50,110 --> 01:02:56,530
How about, select all of the titles from favorites, where the title itself
1318
01:02:56,530 --> 01:02:59,710
is like, quote, unquote, "Office?"
1319
01:02:59,710 --> 01:03:02,740
And this will give me only two answers.
1320
01:03:02,740 --> 01:03:07,180
Those are the two rows, recall, that I mutated by getting rid of the word The.
1321
01:03:07,180 --> 01:03:11,880
Notice that like allows me too tolerate uppercase and lowercase.
1322
01:03:11,880 --> 01:03:14,880
Because if I instead just use the equal sign,
1323
01:03:14,880 --> 01:03:21,060
and in SQL a single equal sign does, in fact, mean equality.
1324
01:03:21,060 --> 01:03:23,370
For comparison's sake, it's not doing assignment.
1325
01:03:23,370 --> 01:03:26,070
This is not how you assign data in SQL.
1326
01:03:26,070 --> 01:03:27,940
I got back no answers there.
1327
01:03:27,940 --> 01:03:31,620
So indeed, the equal sign is giving me literal answers
1328
01:03:31,620 --> 01:03:33,990
that searches just for what I typed in.
1329
01:03:33,990 --> 01:03:35,370
How could I get all of these?
1330
01:03:35,370 --> 01:03:39,090
Well, similar in spirit to regular expressions but not quite as powerful
1331
01:03:39,090 --> 01:03:41,620
in SQL, I could do something like this.
1332
01:03:41,620 --> 01:03:45,330
I can select the title from favorites where the title is like, quote,
1333
01:03:45,330 --> 01:03:46,710
unquote, "Office."
1334
01:03:46,710 --> 01:03:52,450
But I can add, a bit weirdly, percent signs to the left and the right.
1335
01:03:52,450 --> 01:03:57,930
So the language SQL supports the same notion of pattern matching
1336
01:03:57,930 --> 01:03:59,547
but much more limited out of the box.
1337
01:03:59,547 --> 01:04:01,380
If we want more powerful regular expressions
1338
01:04:01,380 --> 01:04:03,420
we probably do want to use Python instead.
1339
01:04:03,420 --> 01:04:06,720
But the percent sign here means 0 or more characters
1340
01:04:06,720 --> 01:04:09,340
on the left, 0 or more characters on the right.
1341
01:04:09,340 --> 01:04:14,460
So this will just grab any title that contains O-F-F-I-C-E in it in that
1342
01:04:14,460 --> 01:04:15,100
order.
1343
01:04:15,100 --> 01:04:19,290
And now I get all 16, it would seem, of those results, again.
1344
01:04:19,290 --> 01:04:20,490
How do I know it's 16?
1345
01:04:20,490 --> 01:04:23,160
Well, I can just get the count of those titles
1346
01:04:23,160 --> 01:04:26,140
and get back that answer instead as well.
1347
01:04:26,140 --> 01:04:29,640
So again, it takes some getting used to, the vocabulary
1348
01:04:29,640 --> 01:04:30,973
and the syntax that you can use.
1349
01:04:30,973 --> 01:04:32,682
There's these building blocks and others.
1350
01:04:32,682 --> 01:04:35,500
But SQL is really designed, again, for creating, reading, updating,
1351
01:04:35,500 --> 01:04:36,480
and deleting data.
1352
01:04:36,480 --> 01:04:41,080
For instance, I've never really been a fan of Friends, for instance.
1353
01:04:41,080 --> 01:04:46,860
So right now if I do select, how about title from favorites
1354
01:04:46,860 --> 01:04:53,340
where title like, quote, unquote, Friends with the percent signs?
1355
01:04:53,340 --> 01:04:55,330
We can see that there's a whole bunch of them.
1356
01:04:55,330 --> 01:04:56,520
That's how many exactly.
1357
01:04:56,520 --> 01:04:57,960
Let's just do a quick count.
1358
01:04:57,960 --> 01:04:59,890
So that's nine of them.
1359
01:04:59,890 --> 01:05:03,300
Well, delete from favorites.
1360
01:05:03,300 --> 01:05:10,500
OK, you and me, delete from favorites, where title like Friends, Enter.
1361
01:05:10,500 --> 01:05:14,280
Nothing seems to happen, but bye-bye Friends.
1362
01:05:14,280 --> 01:05:15,060
[APPLAUSE]
1363
01:05:15,060 --> 01:05:16,143
DAVID J. MALAN: Thank you.
1364
01:05:16,143 --> 01:05:19,110
1365
01:05:19,110 --> 01:05:21,390
So now we've actually changed the data.
1366
01:05:21,390 --> 01:05:25,110
And this is what's compelling about a proper database.
1367
01:05:25,110 --> 01:05:29,320
Yes, you could technically write Python code that not only reads the CSV file,
1368
01:05:29,320 --> 01:05:30,540
but also writes it.
1369
01:05:30,540 --> 01:05:32,820
You can change using quote, unquote, "A" for append,
1370
01:05:32,820 --> 01:05:35,910
or quote, unquote, "W" for write, instead of quote, unquote,
1371
01:05:35,910 --> 01:05:37,263
"R" for read alone.
1372
01:05:37,263 --> 01:05:39,930
But it's definitely a little more involved to do that in Python.
1373
01:05:39,930 --> 01:05:42,250
But with SQL, you can update the data in real time.
1374
01:05:42,250 --> 01:05:45,750
And if I were actually running a web application here or a database
1375
01:05:45,750 --> 01:05:47,880
for a mobile app, that change, theoretically,
1376
01:05:47,880 --> 01:05:50,160
would be reflected everywhere on your own devices
1377
01:05:50,160 --> 01:05:52,210
if you're somehow talking to this application.
1378
01:05:52,210 --> 01:05:53,995
So that's the direction we're headed.
1379
01:05:53,995 --> 01:05:55,620
This other thing has been bothering me.
1380
01:05:55,620 --> 01:06:02,640
So select, how about title from favorites, where title equals,
1381
01:06:02,640 --> 01:06:03,180
what was it?
1382
01:06:03,180 --> 01:06:06,600
The V Office, was it?
1383
01:06:06,600 --> 01:06:07,680
Yeah, it was that one.
1384
01:06:07,680 --> 01:06:12,150
How about we update favorites by setting title
1385
01:06:12,150 --> 01:06:19,080
equal to The Office, where title equals quote, unquote, "The V Office"
1386
01:06:19,080 --> 01:06:20,470
semicolon?
1387
01:06:20,470 --> 01:06:22,650
And now, if I select the same thing again
1388
01:06:22,650 --> 01:06:24,810
I can go up and down with my arrow keys quickly.
1389
01:06:24,810 --> 01:06:27,090
Now there is no The V Office.
1390
01:06:27,090 --> 01:06:29,200
We've actually changed that value.
1391
01:06:29,200 --> 01:06:30,300
How about genres?
1392
01:06:30,300 --> 01:06:35,880
Select genres from favorites, where the title is title
1393
01:06:35,880 --> 01:06:39,450
equals Game of Thrones, semicolon.
1394
01:06:39,450 --> 01:06:43,000
These were kind of long, and I don't really agree with all of that.
1395
01:06:43,000 --> 01:06:49,560
So how about we update favorites, set genres equal to, sure,
1396
01:06:49,560 --> 01:06:52,650
action, adventure, sure, drama?
1397
01:06:52,650 --> 01:06:54,390
OK, so it's a decent list.
1398
01:06:54,390 --> 01:06:56,700
Fantasy, sure, thriller, war.
1399
01:06:56,700 --> 01:07:01,050
OK, anything really but comedy, I would say.
1400
01:07:01,050 --> 01:07:03,160
Let's go ahead and hit Enter now.
1401
01:07:03,160 --> 01:07:07,800
And now, if I select genres again, same query, now we've canonicalized that.
1402
01:07:07,800 --> 01:07:09,160
We've thrown data away.
1403
01:07:09,160 --> 01:07:11,250
So whether or not that is right is probably
1404
01:07:11,250 --> 01:07:13,020
a bit subjective and argumentative.
1405
01:07:13,020 --> 01:07:16,920
But I have at least cleaned up my data, which is, again, the U in CRUD.
1406
01:07:16,920 --> 01:07:20,670
Create, read, update, delete, you can do it that easily.
1407
01:07:20,670 --> 01:07:22,290
Beware using delete.
1408
01:07:22,290 --> 01:07:26,430
Beware worse using drop, whereby you can drop an entire table.
1409
01:07:26,430 --> 01:07:29,310
But via these kinds of commands, can we actually now
1410
01:07:29,310 --> 01:07:33,390
manipulate our data much more rapidly and with single thoughts.
1411
01:07:33,390 --> 01:07:36,390
And in fact, if you're an aspiring statistician, or data scientist,
1412
01:07:36,390 --> 01:07:40,320
or analyst in the real world, SQL is such a commonly used language
1413
01:07:40,320 --> 01:07:43,573
because it allows you to really dive into data quickly, and ask
1414
01:07:43,573 --> 01:07:45,990
questions of the data, and get back answers quite quickly.
1415
01:07:45,990 --> 01:07:47,530
And this is a simple data set.
1416
01:07:47,530 --> 01:07:51,840
You can do this with much larger data sets as we soon will, too.
1417
01:07:51,840 --> 01:07:55,050
Or any questions on what we've seen of SQL thus far?
1418
01:07:55,050 --> 01:07:57,170
Only scratched the surface, but again, it
1419
01:07:57,170 --> 01:08:03,100
boils down to creating, reading, updating, and deleting data.
1420
01:08:03,100 --> 01:08:04,910
Questions here?
1421
01:08:04,910 --> 01:08:05,410
All right.
1422
01:08:05,410 --> 01:08:07,820
Well, let's consider the design of this data.
1423
01:08:07,820 --> 01:08:11,900
Recall that if I do .schema, that shows me the design of my table,
1424
01:08:11,900 --> 01:08:13,990
the so-called schema of my data.
1425
01:08:13,990 --> 01:08:15,010
This is OK.
1426
01:08:15,010 --> 01:08:17,649
It gets the job done, and frankly, everything the user typed in
1427
01:08:17,649 --> 01:08:21,609
was arguably text, including the timestamp, which is the date and time.
1428
01:08:21,609 --> 01:08:24,040
But so the data set itself is somewhat simple.
1429
01:08:24,040 --> 01:08:29,569
But if we look at the data set itself, especially genres, let's do this.
1430
01:08:29,569 --> 01:08:31,960
Select genres from favorites.
1431
01:08:31,960 --> 01:08:34,540
And let me point out one other thing stylistically, too.
1432
01:08:34,540 --> 01:08:39,050
I am very deliberately capitalizing all of the special SQL keywords,
1433
01:08:39,050 --> 01:08:42,700
and I'm lowercasing all of the column names and the table names.
1434
01:08:42,700 --> 01:08:45,760
This is a convention, and honestly, it just helps you read,
1435
01:08:45,760 --> 01:08:49,210
I think, the code when you're co-mingling your names for columns
1436
01:08:49,210 --> 01:08:52,600
and tables with proper SQL keywords.
1437
01:08:52,600 --> 01:08:58,029
But I could just as easily do select genres from favorites,
1438
01:08:58,029 --> 01:09:01,149
but again, the SQL specific keywords don't quite jump out as much.
1439
01:09:01,149 --> 01:09:04,240
So stylistically, we would recommend this, selecting genres
1440
01:09:04,240 --> 01:09:06,250
from favorites, semicolon.
1441
01:09:06,250 --> 01:09:09,370
So here is where--
1442
01:09:09,370 --> 01:09:11,560
oh.
1443
01:09:11,560 --> 01:09:12,850
OK, that was not intended.
1444
01:09:12,850 --> 01:09:15,220
I accidentally made every show, including
1445
01:09:15,220 --> 01:09:20,410
The Office about action, adventure, drama, fantasy, thriller, and war.
1446
01:09:20,410 --> 01:09:24,460
How did I do that accidentally?
1447
01:09:24,460 --> 01:09:25,600
What did I do wrong?
1448
01:09:25,600 --> 01:09:30,234
AUDIENCE: [INAUDIBLE]
1449
01:09:30,234 --> 01:09:31,109
DAVID J. MALAN: Yeah.
1450
01:09:31,109 --> 01:09:32,460
So beware, this is funny.
1451
01:09:32,460 --> 01:09:34,359
I think I did say beware around this time.
1452
01:09:34,359 --> 01:09:37,810
So the SQL database took me-- literally, I updated favorites,
1453
01:09:37,810 --> 01:09:41,010
setting genres equal to that, semicolon, end of thought.
1454
01:09:41,010 --> 01:09:43,500
I really wanted to say where title equals,
1455
01:09:43,500 --> 01:09:46,020
quote, unquote, "Game of Thrones."
1456
01:09:46,020 --> 01:09:49,080
Unfortunately, there isn't an undo command or time machine
1457
01:09:49,080 --> 01:09:51,930
with a SQL database, so the best we can do here
1458
01:09:51,930 --> 01:09:56,250
is, let's actually get rid of favorites.db.
1459
01:09:56,250 --> 01:10:02,340
Let's run SQLite of favorites.db again, which now will be recreated.
1460
01:10:02,340 --> 01:10:04,440
Let me change myself into CSV mode.
1461
01:10:04,440 --> 01:10:09,930
Let me import, into my favorites table, the CSV file.
1462
01:10:09,930 --> 01:10:14,250
And now, Friends is back, for better or for worse,
1463
01:10:14,250 --> 01:10:15,510
but so are all of our genres.
1464
01:10:15,510 --> 01:10:18,090
1465
01:10:18,090 --> 01:10:21,330
If I now reload the file and do select, star, from--
1466
01:10:21,330 --> 01:10:21,840
sorry.
1467
01:10:21,840 --> 01:10:25,838
Select genres from favorites, that was the result I was getting.
1468
01:10:25,838 --> 01:10:28,630
It's much messier, but that's because some of these are quite long.
1469
01:10:28,630 --> 01:10:30,297
But now we're back to the original data.
1470
01:10:30,297 --> 01:10:32,700
Lesson here, be sure to back up your work.
1471
01:10:32,700 --> 01:10:33,240
All right.
1472
01:10:33,240 --> 01:10:36,850
So what more can we now do with this data?
1473
01:10:36,850 --> 01:10:40,480
Well, I don't love the design of the genres table for a couple of reasons.
1474
01:10:40,480 --> 01:10:43,200
One, we didn't have any sort of validation,
1475
01:10:43,200 --> 01:10:45,210
but user input is going to be messy.
1476
01:10:45,210 --> 01:10:47,790
There's just a lot of redundancy in here.
1477
01:10:47,790 --> 01:10:49,720
Let's go ahead and do this.
1478
01:10:49,720 --> 01:10:51,760
Let me select all the comedies you all typed in.
1479
01:10:51,760 --> 01:10:57,960
So select title from favorites, where genres equals,
1480
01:10:57,960 --> 01:11:00,120
quote, unquote, "comedy."
1481
01:11:00,120 --> 01:11:05,730
OK, so there's all of the shows that are explicitly comedies.
1482
01:11:05,730 --> 01:11:08,940
But I think there might actually be others.
1483
01:11:08,940 --> 01:11:11,350
Let me scroll back up here.
1484
01:11:11,350 --> 01:11:12,180
Comedy, drama.
1485
01:11:12,180 --> 01:11:14,040
What was a comedy and a drama?
1486
01:11:14,040 --> 01:11:19,140
How about let's search for the-- oops, let me copy paste comedy, comma, drama.
1487
01:11:19,140 --> 01:11:23,700
OK, so The Office, in this case, was considered comedy and drama, Billions,
1488
01:11:23,700 --> 01:11:27,010
It's Always Sunny in Philadelphia, and Gilmore Girls as well.
1489
01:11:27,010 --> 01:11:31,450
But notice that I get many more when I just search for comedy.
1490
01:11:31,450 --> 01:11:36,450
So the catch here is that, because I have all of these genres implemented
1491
01:11:36,450 --> 01:11:39,030
the way Google did, as a comma-separated list,
1492
01:11:39,030 --> 01:11:43,590
it's actually really hard and messy to get at any show, all of the shows
1493
01:11:43,590 --> 01:11:46,830
that are somewhere described as comedy.
1494
01:11:46,830 --> 01:11:49,680
Because if I search for quote, unquote, "comedy," the only answers
1495
01:11:49,680 --> 01:11:53,610
I'm going to get are this one, whatever that show is, this one, whatever
1496
01:11:53,610 --> 01:11:55,240
that show is, this one.
1497
01:11:55,240 --> 01:11:56,670
But I'm not going to get this one.
1498
01:11:56,670 --> 01:11:58,170
I'm not going to get this one.
1499
01:11:58,170 --> 01:11:59,070
Why?
1500
01:11:59,070 --> 01:12:02,790
If I'm searching for, where genres equals, quote, unquote, "comedy,"
1501
01:12:02,790 --> 01:12:04,350
why am I missing those other shows?
1502
01:12:04,350 --> 01:12:07,202
1503
01:12:07,202 --> 01:12:07,910
Why am I missing?
1504
01:12:07,910 --> 01:12:08,595
Yeah?
1505
01:12:08,595 --> 01:12:11,020
AUDIENCE: [INAUDIBLE]
1506
01:12:11,020 --> 01:12:12,020
DAVID J. MALAN: Exactly.
1507
01:12:12,020 --> 01:12:14,390
It's not just a comedy, it's a comedy and a drama,
1508
01:12:14,390 --> 01:12:16,680
and a comedy or a news show, and so forth.
1509
01:12:16,680 --> 01:12:20,510
So I have to search for these commas, so this gets messy quickly, right?
1510
01:12:20,510 --> 01:12:22,560
Let me copy this so I can do this.
1511
01:12:22,560 --> 01:12:26,150
Let me search for where genres equals comedy.
1512
01:12:26,150 --> 01:12:33,230
How about, or genres equals comedy, drama, or genres
1513
01:12:33,230 --> 01:12:36,650
equals this whole thing, comedy, news, talk show?
1514
01:12:36,650 --> 01:12:38,370
I'm going to get more and more results.
1515
01:12:38,370 --> 01:12:39,900
But that's not going to scale well.
1516
01:12:39,900 --> 01:12:42,810
What could I do instead of enumerating with ors
1517
01:12:42,810 --> 01:12:45,800
all of the different permutations of genres, do you think?
1518
01:12:45,800 --> 01:12:49,655
AUDIENCE: [INAUDIBLE]
1519
01:12:49,655 --> 01:12:50,530
DAVID J. MALAN: Yeah.
1520
01:12:50,530 --> 01:12:54,430
So I could use the keyword is, similar in Python to the word in.
1521
01:12:54,430 --> 01:12:56,980
I could use the like keyword so that so long
1522
01:12:56,980 --> 01:13:02,080
as the genres is like comedy somewhere in there,
1523
01:13:02,080 --> 01:13:05,900
that's going to give me all of them, so long as the word comedy is in there.
1524
01:13:05,900 --> 01:13:09,265
But let me go ahead and just open the form from earlier.
1525
01:13:09,265 --> 01:13:12,250
1526
01:13:12,250 --> 01:13:15,073
Let me see if I can open this real quick before I toggle over.
1527
01:13:15,073 --> 01:13:16,990
If we look back at the form, recall that there
1528
01:13:16,990 --> 01:13:22,630
were all of those radio buttons asking for the specific genres
1529
01:13:22,630 --> 01:13:24,530
into which something fell.
1530
01:13:24,530 --> 01:13:29,710
And if I open this, let me full screen here and now open the original form.
1531
01:13:29,710 --> 01:13:32,920
You'll see all of the genres here, none of which
1532
01:13:32,920 --> 01:13:38,680
are that worrisome except for a corner case is jumping out at me.
1533
01:13:38,680 --> 01:13:43,330
Where might the like keyword alone get me into trouble?
1534
01:13:43,330 --> 01:13:44,380
It's not with comedy.
1535
01:13:44,380 --> 01:13:45,880
I'm OK with comedy.
1536
01:13:45,880 --> 01:13:47,680
AUDIENCE: Music and musical?
1537
01:13:47,680 --> 01:13:51,190
DAVID J. MALAN: Yeah, music and musical are deliberately on the list here.
1538
01:13:51,190 --> 01:13:53,660
Because, one, they're separate genres.
1539
01:13:53,660 --> 01:13:56,310
But if I just search for something that's like music,
1540
01:13:56,310 --> 01:13:58,810
I'm going to accidentally suck in all of the musicals, which
1541
01:13:58,810 --> 01:13:59,950
might not be what I intend.
1542
01:13:59,950 --> 01:14:03,310
If music is a music video or whatever, and musical is actually
1543
01:14:03,310 --> 01:14:06,620
a different type of show, I don't want to just do that.
1544
01:14:06,620 --> 01:14:08,110
So it seems just very messy.
1545
01:14:08,110 --> 01:14:11,682
I could probably hack something together with-- maybe add some commas in there,
1546
01:14:11,682 --> 01:14:12,640
or something like this.
1547
01:14:12,640 --> 01:14:15,520
But this is just not a good design for the data.
1548
01:14:15,520 --> 01:14:17,890
Google has done it this way because it's just
1549
01:14:17,890 --> 01:14:22,000
simple to actually keep the user's data all in a single column,
1550
01:14:22,000 --> 01:14:24,520
and just as they did, separate it by commas.
1551
01:14:24,520 --> 01:14:28,720
But this is a real messy way to use CSV is
1552
01:14:28,720 --> 01:14:32,830
by putting comma-separated values in your comma-separated values.
1553
01:14:32,830 --> 01:14:35,350
Arguably, the folks at Google probably just did this
1554
01:14:35,350 --> 01:14:36,520
because it's just simpler.
1555
01:14:36,520 --> 01:14:38,645
And they didn't want to give people multiple sheets
1556
01:14:38,645 --> 01:14:42,220
or complicate things using some other weirder character than commas alone.
1557
01:14:42,220 --> 01:14:44,450
But I bet there's a better way for us to do this.
1558
01:14:44,450 --> 01:14:45,860
And let me go ahead and do this.
1559
01:14:45,860 --> 01:14:47,978
Let me go back into my code here.
1560
01:14:47,978 --> 01:14:50,020
And in just a moment, I'm going to grab a program
1561
01:14:50,020 --> 01:14:54,250
that I wrote in advance that's going to use Python to open up the CSV file,
1562
01:14:54,250 --> 01:14:59,110
iterate over all of the rows, and load the data into two tables this time,
1563
01:14:59,110 --> 01:15:02,630
two tables, one called shows, and one called genres,
1564
01:15:02,630 --> 01:15:05,140
so as to actually separate these two things out.
1565
01:15:05,140 --> 01:15:07,730
Give me just a moment to grab the code.
1566
01:15:07,730 --> 01:15:10,720
And when I run this, I'll only have to run it once.
1567
01:15:10,720 --> 01:15:13,090
Let me go ahead and run Python in a moment,
1568
01:15:13,090 --> 01:15:15,940
and I'll reveal the results in a sec.
1569
01:15:15,940 --> 01:15:18,790
This is going to be version 8 of the code online.
1570
01:15:18,790 --> 01:15:22,480
When I do this, let me go ahead and open up this file.
1571
01:15:22,480 --> 01:15:26,260
Give me a second to move it into this directory.
1572
01:15:26,260 --> 01:15:28,570
Version 8, OK.
1573
01:15:28,570 --> 01:15:31,515
So here we have version 8 of this that's available online
1574
01:15:31,515 --> 01:15:32,890
that's going to do the following.
1575
01:15:32,890 --> 01:15:34,848
And I'll gloss over some of the details just so
1576
01:15:34,848 --> 01:15:38,740
that we don't get stuck in the weeds of some of this code.
1577
01:15:38,740 --> 01:15:41,380
I'm going to be using, at the top of this program,
1578
01:15:41,380 --> 01:15:45,820
as we'll soon see, a CS50 library, not for the sake of get_string,
1579
01:15:45,820 --> 01:15:48,730
or get_int, or get_float, but because there's some built-in SQL
1580
01:15:48,730 --> 01:15:51,880
functionality that we didn't discuss a couple of weeks back with the CS50
1581
01:15:51,880 --> 01:15:52,750
library itself.
1582
01:15:52,750 --> 01:15:56,680
But inside of the CS50 library we'll see there is a special function called
1583
01:15:56,680 --> 01:16:01,330
SQL that gives you the ability using this weird URL-like looking thing,
1584
01:16:01,330 --> 01:16:06,040
technically called a URI, that allows me to open a file called favorites.db.
1585
01:16:06,040 --> 01:16:08,380
And long story short, all of the subsequent code
1586
01:16:08,380 --> 01:16:12,580
is going to iterate over this favorites.csv file that we downloaded.
1587
01:16:12,580 --> 01:16:16,240
And it's going to import it into the SQLite database,
1588
01:16:16,240 --> 01:16:19,430
but it's going to use two tables instead of just one.
1589
01:16:19,430 --> 01:16:21,640
So give me just a moment to run this, and then I'll
1590
01:16:21,640 --> 01:16:23,600
reveal the actual results.
1591
01:16:23,600 --> 01:16:26,270
This is going to be run on favorites.csv.
1592
01:16:26,270 --> 01:16:31,510
1593
01:16:31,510 --> 01:16:35,485
And taking a look here, give me just a moment.
1594
01:16:35,485 --> 01:16:39,310
1595
01:16:39,310 --> 01:16:43,350
Oh, give me a sec.
1596
01:16:43,350 --> 01:16:45,270
Come on.
1597
01:16:45,270 --> 01:16:46,080
Come on.
1598
01:16:46,080 --> 01:16:49,020
This program should not be taking this long.
1599
01:16:49,020 --> 01:16:50,460
Sorry.
1600
01:16:50,460 --> 01:16:51,840
Let's open this real fast.
1601
01:16:51,840 --> 01:16:54,750
1602
01:16:54,750 --> 01:16:55,860
Whoops, not that file.
1603
01:16:55,860 --> 01:16:59,000
1604
01:16:59,000 --> 01:16:59,500
OK.
1605
01:16:59,500 --> 01:17:02,350
Let me just skim this code real quick to see where we've gone wrong.
1606
01:17:02,350 --> 01:17:04,840
[INAUDIBLE] reader.
1607
01:17:04,840 --> 01:17:10,390
Reader, title, show ID in certain two shows.
1608
01:17:10,390 --> 01:17:15,370
[INAUDIBLE] genres split, DB execute.
1609
01:17:15,370 --> 01:17:15,880
All right.
1610
01:17:15,880 --> 01:17:17,410
This is me debugging in real time.
1611
01:17:17,410 --> 01:17:22,843
All those times we encourage you to use print, this is me actually using print.
1612
01:17:22,843 --> 01:17:24,760
We'll see how quickly I can recover from this.
1613
01:17:24,760 --> 01:17:26,560
Python of favorites version 8.
1614
01:17:26,560 --> 01:17:29,100
1615
01:17:29,100 --> 01:17:32,070
OK, so here's me debugging in real time.
1616
01:17:32,070 --> 01:17:32,882
It's printing it.
1617
01:17:32,882 --> 01:17:34,590
Oh, maybe I just didn't wait long enough.
1618
01:17:34,590 --> 01:17:35,940
OK, so here we go.
1619
01:17:35,940 --> 01:17:39,900
What I'm doing is printing out the dictionary that represents
1620
01:17:39,900 --> 01:17:41,542
each row that you all typed in.
1621
01:17:41,542 --> 01:17:43,000
And we're actually making progress.
1622
01:17:43,000 --> 01:17:44,220
All right.
1623
01:17:44,220 --> 01:17:46,510
I was too impatient and didn't wait long enough.
1624
01:17:46,510 --> 01:17:47,470
So in a moment--
1625
01:17:47,470 --> 01:17:47,970
there we go.
1626
01:17:47,970 --> 01:17:50,370
All right, so all we have to do sometimes is wait.
1627
01:17:50,370 --> 01:17:54,420
Let me go ahead now and open this file using SQLite3.
1628
01:17:54,420 --> 01:17:58,080
So in SQLite3 I now have a different version of favorites.db.
1629
01:17:58,080 --> 01:17:59,970
I named it number 8 for consistency.
1630
01:17:59,970 --> 01:18:03,280
Once I've run the program I can do .schema to look inside of it.
1631
01:18:03,280 --> 01:18:07,200
And here's what the two tables in this database are going to look like.
1632
01:18:07,200 --> 01:18:10,920
I've created a table called shows, this time to represent all of the TV shows
1633
01:18:10,920 --> 01:18:14,410
that are favorites, that has two columns.
1634
01:18:14,410 --> 01:18:16,812
One is called ID, one is called Title.
1635
01:18:16,812 --> 01:18:19,020
But now I'm going to start taking out for a spin some
1636
01:18:19,020 --> 01:18:20,640
of the other features of SQL.
1637
01:18:20,640 --> 01:18:24,900
And besides there being text, it turns out there's a data type called integer.
1638
01:18:24,900 --> 01:18:26,910
Besides there being a data type called text,
1639
01:18:26,910 --> 01:18:30,150
there's also a special key phrase that you can specify
1640
01:18:30,150 --> 01:18:31,830
that the title can never be null.
1641
01:18:31,830 --> 01:18:35,160
Think back to our use of null in C. Think back
1642
01:18:35,160 --> 01:18:37,260
to the keyword none in Python.
1643
01:18:37,260 --> 01:18:40,830
This is a database constraint that allows you to ensure that none of you
1644
01:18:40,830 --> 01:18:42,630
can't have of favorite TV show.
1645
01:18:42,630 --> 01:18:46,320
If you submit the form, you have to have typed in a title for it
1646
01:18:46,320 --> 01:18:48,250
to end up in our database here.
1647
01:18:48,250 --> 01:18:50,940
And you'll notice one other new feature.
1648
01:18:50,940 --> 01:18:53,460
It turns out, on this table I'm defining what's
1649
01:18:53,460 --> 01:18:56,970
called a primary key, specifically to be the ID column.
1650
01:18:56,970 --> 01:18:58,600
More on that in just a moment.
1651
01:18:58,600 --> 01:19:02,700
Meanwhile, the second table my code has created for me, as we'll soon see,
1652
01:19:02,700 --> 01:19:08,220
gives me a column called show ID, and then, a genre,
1653
01:19:08,220 --> 01:19:11,140
the value of which is text that can also not be null.
1654
01:19:11,140 --> 01:19:12,760
And then more on this in a moment.
1655
01:19:12,760 --> 01:19:15,840
This table has what we're going to call a foreign key,
1656
01:19:15,840 --> 01:19:20,550
specifically the show ID column that references shows ID.
1657
01:19:20,550 --> 01:19:23,250
So before we get into the weeds of this, this
1658
01:19:23,250 --> 01:19:27,120
is now a way of creating the relation in relational database.
1659
01:19:27,120 --> 01:19:31,140
If I have two tables now, not just one, they can somehow
1660
01:19:31,140 --> 01:19:34,720
be linked together by a common column.
1661
01:19:34,720 --> 01:19:37,860
In other words, the shows column--
1662
01:19:37,860 --> 01:19:41,340
shows table is going to give me a table with two columns--
1663
01:19:41,340 --> 01:19:43,080
an ID and a title.
1664
01:19:43,080 --> 01:19:46,800
Every title you gave me, I'm going to assign a unique value.
1665
01:19:46,800 --> 01:19:51,600
The genre's table, meanwhile, is going to associate individual genres
1666
01:19:51,600 --> 01:19:54,480
singular with that same idea.
1667
01:19:54,480 --> 01:20:01,150
And the result of this, to pop back to the Terminal here, is, let's do this.
1668
01:20:01,150 --> 01:20:05,460
Select star from shows of this new database,
1669
01:20:05,460 --> 01:20:09,030
and you'll see that I've given, indeed, all of the shows you all typed
1670
01:20:09,030 --> 01:20:10,663
in unique identifiers.
1671
01:20:10,663 --> 01:20:13,830
I didn't filter out duplicates or do anything beyond just forcing everything
1672
01:20:13,830 --> 01:20:14,430
to uppercase.
1673
01:20:14,430 --> 01:20:16,930
So there's going to be some duplicates here because I didn't
1674
01:20:16,930 --> 01:20:18,840
want to get rid of anyone's data.
1675
01:20:18,840 --> 01:20:21,960
But you'll see that, indeed, I've given everyone
1676
01:20:21,960 --> 01:20:24,420
a unique identifier, from the very first person who
1677
01:20:24,420 --> 01:20:28,650
typed How I Met Your Mother, all the way down to input number 158.
1678
01:20:28,650 --> 01:20:35,760
Meanwhile, if I do select star from genres, which is now a table, not just
1679
01:20:35,760 --> 01:20:38,280
a column in the original data, now you'll
1680
01:20:38,280 --> 01:20:43,195
see a much better design for this data.
1681
01:20:43,195 --> 01:20:44,320
Notice what I've done here.
1682
01:20:44,320 --> 01:20:47,370
Let me go all the way to the top and you'll see two columns, one of which
1683
01:20:47,370 --> 01:20:51,508
is called show ID, the other of which is called genre.
1684
01:20:51,508 --> 01:20:53,550
And again, I wrote some code to do this because I
1685
01:20:53,550 --> 01:20:56,675
had to take Google's messy output where everything was separated by commas.
1686
01:20:56,675 --> 01:21:00,330
I had to tear away the commas and then put each genre into this table
1687
01:21:00,330 --> 01:21:01,770
by itself.
1688
01:21:01,770 --> 01:21:04,200
Even though we haven't introduced the syntax via which
1689
01:21:04,200 --> 01:21:07,230
we can reconstitute the data and reassociate
1690
01:21:07,230 --> 01:21:10,530
your genres with your titles, why, at a glance,
1691
01:21:10,530 --> 01:21:13,140
might this be a better design now?
1692
01:21:13,140 --> 01:21:16,740
Even though I've doubled the number of tables from one to two,
1693
01:21:16,740 --> 01:21:21,540
why is this probably on the direction toward a better design?
1694
01:21:21,540 --> 01:21:23,130
What might your instincts be?
1695
01:21:23,130 --> 01:21:26,430
1696
01:21:26,430 --> 01:21:27,720
Why is this cleaner?
1697
01:21:27,720 --> 01:21:31,230
Again, first time with SQL, why is it better, perhaps,
1698
01:21:31,230 --> 01:21:33,780
that we've done this with our genre's table?
1699
01:21:33,780 --> 01:21:35,650
Can I come to you?
1700
01:21:35,650 --> 01:21:36,700
Why might this be better?
1701
01:21:36,700 --> 01:21:37,200
Yeah.
1702
01:21:37,200 --> 01:21:41,360
Oh, just because we had the conversation before about the commas.
1703
01:21:41,360 --> 01:21:42,470
AUDIENCE: [INAUDIBLE]
1704
01:21:42,470 --> 01:21:42,890
DAVID J. MALAN: Exactly.
1705
01:21:42,890 --> 01:21:43,848
It's as simple as that.
1706
01:21:43,848 --> 01:21:49,130
We've cleaned up the data by giving every genre, every word in the genres
1707
01:21:49,130 --> 01:21:51,260
column in the original Google Spreadsheet
1708
01:21:51,260 --> 01:21:54,030
its own cell in this table, if you will.
1709
01:21:54,030 --> 01:21:56,930
And now notice show ID might appear multiple times.
1710
01:21:56,930 --> 01:22:01,530
Whoever typed in How I Met Your Mother, they only associated one genre with it.
1711
01:22:01,530 --> 01:22:04,460
And so we see that show ID 1 is a comedy.
1712
01:22:04,460 --> 01:22:05,645
But whoever typed in--
1713
01:22:05,645 --> 01:22:07,520
I forget the name of the second show offhand.
1714
01:22:07,520 --> 01:22:11,960
But that person, whoever was assigned show ID 2 checked off
1715
01:22:11,960 --> 01:22:13,850
a whole bunch of the genre's boxes.
1716
01:22:13,850 --> 01:22:17,540
That happened again with show ID 3, 4.
1717
01:22:17,540 --> 01:22:20,940
Persons 5, 6, 7 only checked one box.
1718
01:22:20,940 --> 01:22:25,040
And so you can see now that we've associated the data with what we
1719
01:22:25,040 --> 01:22:27,690
might call a one-to-many relationship.
1720
01:22:27,690 --> 01:22:33,050
A one-to-many relationship, whereby for every one show in the show's table,
1721
01:22:33,050 --> 01:22:36,800
it can now have many genres associated with it, each of which
1722
01:22:36,800 --> 01:22:41,250
is represented by a separate row here.
1723
01:22:41,250 --> 01:22:44,960
So again, if I go ahead and select star from shows--
1724
01:22:44,960 --> 01:22:48,740
let's limit it to the first 10 just to focus on a subset of the data.
1725
01:22:48,740 --> 01:22:51,770
How I Met Your Mother, The Sopranos was the second input there.
1726
01:22:51,770 --> 01:22:55,100
It would seem that now that I've created the data in this way,
1727
01:22:55,100 --> 01:22:59,925
I could ideally somehow search the data, but a little more correctly.
1728
01:22:59,925 --> 01:23:01,550
I don't have to worry about the commas.
1729
01:23:01,550 --> 01:23:03,925
I don't have to worry about the hackish approach of music
1730
01:23:03,925 --> 01:23:05,630
being a substring of musical.
1731
01:23:05,630 --> 01:23:08,310
But how can I actually get back at this data?
1732
01:23:08,310 --> 01:23:09,750
Well, let's go ahead and do this.
1733
01:23:09,750 --> 01:23:13,700
Suppose I did want to get back maybe all of the comedies.
1734
01:23:13,700 --> 01:23:17,120
All of the comedies, no matter whether the person checked just the comedy
1735
01:23:17,120 --> 01:23:19,550
box or multiple boxes instead.
1736
01:23:19,550 --> 01:23:23,330
How now, given that I have two tables, could I
1737
01:23:23,330 --> 01:23:27,740
go about selecting only the titles of comedies?
1738
01:23:27,740 --> 01:23:29,780
I've actually made the problem a little harder,
1739
01:23:29,780 --> 01:23:32,720
but again, SQL is going to give me a solution for this.
1740
01:23:32,720 --> 01:23:35,030
The problem is that if I want to search for comedies,
1741
01:23:35,030 --> 01:23:37,820
I have to check the genres table first.
1742
01:23:37,820 --> 01:23:39,650
And then what's that going to give me?
1743
01:23:39,650 --> 01:23:43,580
If I search the genres table for comedies,
1744
01:23:43,580 --> 01:23:46,610
what's that going to give me back potentially?
1745
01:23:46,610 --> 01:23:47,360
Yeah?
1746
01:23:47,360 --> 01:23:48,110
AUDIENCE: Show ID.
1747
01:23:48,110 --> 01:23:49,110
DAVID J. MALAN: Maybe show ID.
1748
01:23:49,110 --> 01:23:49,902
So let me try that.
1749
01:23:49,902 --> 01:23:55,820
Let me do select show ID from genres, where the genre in a given row
1750
01:23:55,820 --> 01:23:57,320
equals quote, unquote, "comedy."
1751
01:23:57,320 --> 01:24:00,470
No commas, no like, no percent signs.
1752
01:24:00,470 --> 01:24:04,940
Because literally, that column now is singular words, like comedy, or drama,
1753
01:24:04,940 --> 01:24:05,720
or the like.
1754
01:24:05,720 --> 01:24:07,670
Let me go ahead and hit Enter here.
1755
01:24:07,670 --> 01:24:10,700
OK, so I got back a whole bunch of ID numbers.
1756
01:24:10,700 --> 01:24:13,100
Now this could very quickly get annoying.
1757
01:24:13,100 --> 01:24:17,930
It looks like show ID 1, 2, 4, 5, 6, 7, 9, and so forth, are all comedies.
1758
01:24:17,930 --> 01:24:23,690
So I could do something really crazy like, select title from shows,
1759
01:24:23,690 --> 01:24:29,390
where ID equals 1, or ID equals 2.
1760
01:24:29,390 --> 01:24:32,360
This is not going to scale very well, but this
1761
01:24:32,360 --> 01:24:34,610
is why SQL is especially powerful.
1762
01:24:34,610 --> 01:24:39,420
You can actually compose one SQL question from multiple ones.
1763
01:24:39,420 --> 01:24:40,440
So let's do this.
1764
01:24:40,440 --> 01:24:44,480
Why don't I select the title where the ID of the show
1765
01:24:44,480 --> 01:24:48,050
is in the following list of IDs?
1766
01:24:48,050 --> 01:24:54,710
Select show ID from genres, where the specific genre is, quote, unquote,
1767
01:24:54,710 --> 01:24:55,640
"comedy."
1768
01:24:55,640 --> 01:24:58,550
So I've got two SQL queries.
1769
01:24:58,550 --> 01:25:01,850
One is deliberately nested inside of parentheses.
1770
01:25:01,850 --> 01:25:04,760
That's going to give me back that whole list of show IDs.
1771
01:25:04,760 --> 01:25:07,070
But that's exactly what I want to then look up
1772
01:25:07,070 --> 01:25:11,000
the titles for by selecting title from shows where the ID of the show
1773
01:25:11,000 --> 01:25:13,380
is in that big, tall list.
1774
01:25:13,380 --> 01:25:18,710
And so now if I hit Enter, I get back only those shows
1775
01:25:18,710 --> 01:25:22,520
that were somehow flagged as comedy, whether you in the audience
1776
01:25:22,520 --> 01:25:26,570
checked one box for comedy, two boxes, or all of the boxes.
1777
01:25:26,570 --> 01:25:28,928
Somehow we teased out comedy, again, just
1778
01:25:28,928 --> 01:25:31,220
by using that Python script, which loaded this data not
1779
01:25:31,220 --> 01:25:33,800
into one big table, but instead, two.
1780
01:25:33,800 --> 01:25:36,420
And if we want to clean this up, let's do a couple of things.
1781
01:25:36,420 --> 01:25:40,580
Let's, outside of the parentheses, do order by title.
1782
01:25:40,580 --> 01:25:43,550
This is a way of sorting the data in SQL very easily.
1783
01:25:43,550 --> 01:25:48,140
Now we have a whole list of the same titles that are now sorted.
1784
01:25:48,140 --> 01:25:52,640
And what was the keyword with which I could filter out duplicates?
1785
01:25:52,640 --> 01:25:53,450
Yeah, distinct.
1786
01:25:53,450 --> 01:25:54,260
So let's try this.
1787
01:25:54,260 --> 01:25:59,600
Same query, but let's select only the distinct titles from that whole query.
1788
01:25:59,600 --> 01:26:01,770
And notice, I've very deliberately done it this way.
1789
01:26:01,770 --> 01:26:03,520
And to this day, any time I'm using SQL, I
1790
01:26:03,520 --> 01:26:06,187
don't just start at the beginning and type out my whole thought,
1791
01:26:06,187 --> 01:26:07,940
and just get it right on the first try.
1792
01:26:07,940 --> 01:26:10,610
I very commonly start with the subquery, if you
1793
01:26:10,610 --> 01:26:12,800
will, the thing in parentheses, just to get myself
1794
01:26:12,800 --> 01:26:14,600
one step toward what I care about.
1795
01:26:14,600 --> 01:26:15,332
Then I add to it.
1796
01:26:15,332 --> 01:26:16,040
Then I add to it.
1797
01:26:16,040 --> 01:26:18,590
Then I add to it, just like we've encouraged in Python and C,
1798
01:26:18,590 --> 01:26:22,370
taking baby steps in order to get to the answer you actually care about,
1799
01:26:22,370 --> 01:26:23,480
like this one now.
1800
01:26:23,480 --> 01:26:26,060
And other than this mistake, which we didn't
1801
01:26:26,060 --> 01:26:30,350
fix because I re-imported the data after accidentally changing everyone's genre,
1802
01:26:30,350 --> 01:26:35,190
we now have an alphabetized list of all of the same data.
1803
01:26:35,190 --> 01:26:40,670
But now it's better designed, because we have it split across these two tables.
1804
01:26:40,670 --> 01:26:41,630
Oh, thank you.
1805
01:26:41,630 --> 01:26:44,720
OK, just thanks.
1806
01:26:44,720 --> 01:26:48,950
What questions do we have, if any here?
1807
01:26:48,950 --> 01:26:50,480
Questions on this approach?
1808
01:26:50,480 --> 01:26:53,040
1809
01:26:53,040 --> 01:26:55,316
Yeah?
1810
01:26:55,316 --> 01:27:00,280
AUDIENCE: [INAUDIBLE]
1811
01:27:00,280 --> 01:27:02,280
DAVID J. MALAN: Oh, now that we have a database,
1812
01:27:02,280 --> 01:27:04,640
how do we transfer it to a CSV?
1813
01:27:04,640 --> 01:27:06,060
There are ways to do that.
1814
01:27:06,060 --> 01:27:08,570
And in fact, there's a command within SQLite
1815
01:27:08,570 --> 01:27:11,590
that allows you to export your data back to a CSV file.
1816
01:27:11,590 --> 01:27:13,340
If you want to email it to someone and you
1817
01:27:13,340 --> 01:27:16,430
want them to be able to open it in Excel, or Google Spreadsheets, or Apple
1818
01:27:16,430 --> 01:27:19,010
Numbers, or the like, you can go in the other direction.
1819
01:27:19,010 --> 01:27:21,890
Generally though, once you're in the world of SQL
1820
01:27:21,890 --> 01:27:24,620
you're probably storing your data there long term.
1821
01:27:24,620 --> 01:27:27,590
And you're probably updating it, maybe deleting it, adding to it,
1822
01:27:27,590 --> 01:27:28,273
and so forth.
1823
01:27:28,273 --> 01:27:30,440
For instance, the one command I did not show earlier
1824
01:27:30,440 --> 01:27:33,320
is, suppose someone forgot a show.
1825
01:27:33,320 --> 01:27:35,640
Let's see, did I see this in the output?
1826
01:27:35,640 --> 01:27:37,580
All right, so Curb Your Enthusiasm.
1827
01:27:37,580 --> 01:27:38,540
Saw that last night.
1828
01:27:38,540 --> 01:27:39,530
It was just, yeah.
1829
01:27:39,530 --> 01:27:41,160
Did anyone see it last night?
1830
01:27:41,160 --> 01:27:41,660
No?
1831
01:27:41,660 --> 01:27:44,743
All right, well, just the one person that checked that box, so you and me.
1832
01:27:44,743 --> 01:27:46,740
What's another show that didn't make the list?
1833
01:27:46,740 --> 01:27:48,217
How about Seinfeld?
1834
01:27:48,217 --> 01:27:49,550
It's now on Netflix, apparently.
1835
01:27:49,550 --> 01:27:55,220
So insert into shows.
1836
01:27:55,220 --> 01:27:56,390
What do we want to insert?
1837
01:27:56,390 --> 01:27:59,910
Well, we want to insert maybe an ID and a title.
1838
01:27:59,910 --> 01:28:02,150
But I don't actually care what the ID is,
1839
01:28:02,150 --> 01:28:03,650
so I'm just going to insert a title.
1840
01:28:03,650 --> 01:28:05,780
And the value I'm going to give to that title
1841
01:28:05,780 --> 01:28:08,900
is going to be, quote, unquote, "Seinfeld."
1842
01:28:08,900 --> 01:28:11,810
And then, let me go ahead and hit semicolon.
1843
01:28:11,810 --> 01:28:14,360
Nothing seems to happen, but let me rerun the big query
1844
01:28:14,360 --> 01:28:16,250
from before looking for comedies.
1845
01:28:16,250 --> 01:28:19,850
And unfortunately, Seinfeld has not yet been flagged as a comedy,
1846
01:28:19,850 --> 01:28:21,710
so let's get this right, too.
1847
01:28:21,710 --> 01:28:25,370
What intuitively I'm going to have to do to associate, now,
1848
01:28:25,370 --> 01:28:27,980
Seinfeld with my comedies?
1849
01:28:27,980 --> 01:28:30,140
I just inserted into the show's table.
1850
01:28:30,140 --> 01:28:33,890
What more needs to happen before we can flag Seinfeld as a comedy?
1851
01:28:33,890 --> 01:28:35,369
AUDIENCE: [INAUDIBLE]
1852
01:28:35,369 --> 01:28:36,452
DAVID J. MALAN: Say again?
1853
01:28:36,452 --> 01:28:37,635
AUDIENCE: [INAUDIBLE]
1854
01:28:37,635 --> 01:28:38,510
DAVID J. MALAN: Yeah.
1855
01:28:38,510 --> 01:28:42,950
So I need to insert into the genres table two things now,
1856
01:28:42,950 --> 01:28:48,180
a show ID, like this, and then, the name of the genre,
1857
01:28:48,180 --> 01:28:49,340
which presumably is comedy.
1858
01:28:49,340 --> 01:28:50,810
What values do I want to insert?
1859
01:28:50,810 --> 01:28:52,795
Well, the show ID, I better grab that.
1860
01:28:52,795 --> 01:28:54,170
Oh, I don't even know what it is.
1861
01:28:54,170 --> 01:28:55,770
I'm going to have to figure out what that is.
1862
01:28:55,770 --> 01:28:57,660
So I could do this in a couple of ways.
1863
01:28:57,660 --> 01:28:59,150
Let me do a one-time thing.
1864
01:28:59,150 --> 01:29:02,924
Select star from shows, where title equals,
1865
01:29:02,924 --> 01:29:06,770
quote, unquote, "Seinfeld" semicolon 159.
1866
01:29:06,770 --> 01:29:11,780
So now I could do, insert into genres a show ID
1867
01:29:11,780 --> 01:29:20,510
and a genre name, the values 159, and, quote, unquote, "comedy" semicolon,
1868
01:29:20,510 --> 01:29:21,260
Enter.
1869
01:29:21,260 --> 01:29:24,710
And now, if I scroll back in my history and execute that really big query
1870
01:29:24,710 --> 01:29:26,690
again, looking for all distinct comedies,
1871
01:29:26,690 --> 01:29:29,100
now Seinfeld has made the list.
1872
01:29:29,100 --> 01:29:32,637
But I did this manually so I didn't actually capitalize it.
1873
01:29:32,637 --> 01:29:33,470
Let's clean that up.
1874
01:29:33,470 --> 01:29:35,420
Let's do update.
1875
01:29:35,420 --> 01:29:37,700
Let's do update my shows.
1876
01:29:37,700 --> 01:29:43,440
Set title equals to Seinfeld semicolon.
1877
01:29:43,440 --> 01:29:43,940
No?
1878
01:29:43,940 --> 01:29:47,780
OK, thank you, where title equals, quote, unquote, "Seinfeld."
1879
01:29:47,780 --> 01:29:49,520
Let's not make that mistake again.
1880
01:29:49,520 --> 01:29:50,330
Enter.
1881
01:29:50,330 --> 01:29:53,270
And now, if I execute that really big query, now Seinfeld is,
1882
01:29:53,270 --> 01:29:56,480
indeed, considered a comedy.
1883
01:29:56,480 --> 01:29:57,960
So where are we going with this?
1884
01:29:57,960 --> 01:30:00,240
Well, thus far we've been doing all of this pretty manually.
1885
01:30:00,240 --> 01:30:02,780
And this is absolutely what an analyst, a data scientist type person
1886
01:30:02,780 --> 01:30:05,197
might do if just manipulating a pretty large data set just
1887
01:30:05,197 --> 01:30:08,010
to get at interesting answers that might be across one,
1888
01:30:08,010 --> 01:30:09,440
two, or even many more tables.
1889
01:30:09,440 --> 01:30:12,440
Eventually, in a few weeks, we're going to start to automate all of this
1890
01:30:12,440 --> 01:30:16,610
by writing code in Python that generates SQL to do this.
1891
01:30:16,610 --> 01:30:19,280
If you go to most any website on the internet today,
1892
01:30:19,280 --> 01:30:23,420
and you, for instance, log in, odds are you're typing a username and password,
1893
01:30:23,420 --> 01:30:24,260
clicking Submit.
1894
01:30:24,260 --> 01:30:25,430
What's then happening?
1895
01:30:25,430 --> 01:30:28,010
Well, the website might not be implemented in Python
1896
01:30:28,010 --> 01:30:31,610
but it's probably implemented in some language, Python, JavaScript, Java,
1897
01:30:31,610 --> 01:30:33,110
Ruby, something else.
1898
01:30:33,110 --> 01:30:38,330
And that language is probably using something like a relational database
1899
01:30:38,330 --> 01:30:41,930
to use SQL to get your username, get your password,
1900
01:30:41,930 --> 01:30:43,997
and compare the two against what you've typed in.
1901
01:30:43,997 --> 01:30:46,580
And actually, it's hopefully not getting your actual password,
1902
01:30:46,580 --> 01:30:48,163
but something called the hash thereof.
1903
01:30:48,163 --> 01:30:50,360
But there's probably a database involved doing that.
1904
01:30:50,360 --> 01:30:53,600
When you buy something on Amazon.com and you click Check Out,
1905
01:30:53,600 --> 01:30:56,720
odds are there's some code on Amazon's server
1906
01:30:56,720 --> 01:30:59,870
that's looking at what it is you added to your shopping cart,
1907
01:30:59,870 --> 01:31:03,690
and then maybe using a for loop of some sort, in Python or another language.
1908
01:31:03,690 --> 01:31:08,180
It's doing a whole bunch of SQL inserts to store in their database what
1909
01:31:08,180 --> 01:31:09,290
it is you bought.
1910
01:31:09,290 --> 01:31:12,260
There's other types of databases, too, but SQL databases,
1911
01:31:12,260 --> 01:31:14,647
or relational databases are quite popular.
1912
01:31:14,647 --> 01:31:16,730
So let's go ahead and write one other program here
1913
01:31:16,730 --> 01:31:21,320
in Python that now merges these two languages together, whereby
1914
01:31:21,320 --> 01:31:24,740
I'm going to use SQL inside of a Python program
1915
01:31:24,740 --> 01:31:28,430
so I can implement my logic of my program in Python,
1916
01:31:28,430 --> 01:31:30,050
step-by-step, line-by-line.
1917
01:31:30,050 --> 01:31:34,260
But when I want to get at some data I can actually talk to a SQL database.
1918
01:31:34,260 --> 01:31:37,190
So let me go ahead and open favorites.py.
1919
01:31:37,190 --> 01:31:39,740
1920
01:31:39,740 --> 01:31:45,200
And let me go ahead and throw away some of what we did earlier and really
1921
01:31:45,200 --> 01:31:47,690
just now add a SQL to the mix.
1922
01:31:47,690 --> 01:31:51,440
From the CS50 library, let's import the SQL function.
1923
01:31:51,440 --> 01:31:54,260
This will be useful to use because most third-party libraries
1924
01:31:54,260 --> 01:31:57,390
that deal with SQL and Python are more complicated than they need to be.
1925
01:31:57,390 --> 01:32:00,390
So I think you'll find this library easier to use.
1926
01:32:00,390 --> 01:32:01,780
Let's then do the following.
1927
01:32:01,780 --> 01:32:03,900
Create a variable called db for database.
1928
01:32:03,900 --> 01:32:05,400
But I could call it anything I want.
1929
01:32:05,400 --> 01:32:09,090
Let's use that you URI, which is a fancy way of saying something
1930
01:32:09,090 --> 01:32:17,790
that looks like a URL, but that actually opens up a database on disk, that is,
1931
01:32:17,790 --> 01:32:19,110
in the current folder.
1932
01:32:19,110 --> 01:32:22,620
Let's now ask the user for a title by prompting them for a, quote, unquote,
1933
01:32:22,620 --> 01:32:23,820
"title" like this.
1934
01:32:23,820 --> 01:32:27,960
And let's strip off any whitespace just so that the data is not messy.
1935
01:32:27,960 --> 01:32:30,730
And then, let's go ahead and do this.
1936
01:32:30,730 --> 01:32:32,220
And this is the new logic.
1937
01:32:32,220 --> 01:32:36,570
I'm going to go ahead now and write a line of code that uses Python
1938
01:32:36,570 --> 01:32:39,840
to talk to the original favorites.db.
1939
01:32:39,840 --> 01:32:43,980
So again, I'm not using the two-table database, which is in favorites8.db.
1940
01:32:43,980 --> 01:32:47,320
I'm using the original that we imported from your own data,
1941
01:32:47,320 --> 01:32:49,090
and I'm going to do the following.
1942
01:32:49,090 --> 01:32:54,150
I'm going to use db.execute to execute a SQL command inside of Python.
1943
01:32:54,150 --> 01:33:02,700
I'm going to select the count of shows from the favorites
1944
01:33:02,700 --> 01:33:09,960
table, where the title the user typed in is like this question mark.
1945
01:33:09,960 --> 01:33:11,910
And why I'm doing that is as follows.
1946
01:33:11,910 --> 01:33:15,600
Just like in C, when we had percent S, in SQL for now,
1947
01:33:15,600 --> 01:33:17,490
the analogue is going to be a question mark.
1948
01:33:17,490 --> 01:33:18,990
So same idea, different syntax.
1949
01:33:18,990 --> 01:33:21,180
Instead of percent S, it's just a question mark.
1950
01:33:21,180 --> 01:33:26,280
And using a comma outside of this first string, using CS50's execute
1951
01:33:26,280 --> 01:33:29,340
function I can pass in a SQL string, a command,
1952
01:33:29,340 --> 01:33:33,835
then any arguments I want to plug into the question marks therein.
1953
01:33:33,835 --> 01:33:36,210
So the goal at hand is to actually write a program that's
1954
01:33:36,210 --> 01:33:42,420
going to search favorites.csv, a.k.a., favorites.db for the total number
1955
01:33:42,420 --> 01:33:45,300
of people that liked a particular show.
1956
01:33:45,300 --> 01:33:49,050
So this is going to select the count of people from the favorites table
1957
01:33:49,050 --> 01:33:53,400
where the title they typed in is like whatever the user has just now typed
1958
01:33:53,400 --> 01:33:53,970
in.
1959
01:33:53,970 --> 01:33:56,730
This db execute function returns a list.
1960
01:33:56,730 --> 01:33:57,813
It returns a list of rows.
1961
01:33:57,813 --> 01:34:00,188
And you would only know that by my telling you or reading
1962
01:34:00,188 --> 01:34:01,290
the documentation.
1963
01:34:01,290 --> 01:34:04,140
And therefore, if I want to get back to the total count,
1964
01:34:04,140 --> 01:34:08,940
I'm going to go ahead and grab the first row from those rows.
1965
01:34:08,940 --> 01:34:11,220
Because it's only going to give me back the count.
1966
01:34:11,220 --> 01:34:16,440
And then I'm going to go ahead and print out that row's first value.
1967
01:34:16,440 --> 01:34:17,940
But it's going to be a little weird.
1968
01:34:17,940 --> 01:34:21,420
Technically the column is going to be called "count" star, quote, unquote,
1969
01:34:21,420 --> 01:34:22,420
which is a little weird.
1970
01:34:22,420 --> 01:34:24,180
Let me add one more feature to the mix.
1971
01:34:24,180 --> 01:34:26,280
You can actually give nicknames to columns
1972
01:34:26,280 --> 01:34:30,090
that are coming back, especially if they are the result of functions like this.
1973
01:34:30,090 --> 01:34:34,170
I can just call that column counter, in all lowercase.
1974
01:34:34,170 --> 01:34:41,620
That means I can now say get back the counter key inside of this dictionary.
1975
01:34:41,620 --> 01:34:43,360
So just to recap, what have we done?
1976
01:34:43,360 --> 01:34:46,010
We've imported the CS50 library SQL function.
1977
01:34:46,010 --> 01:34:49,080
We've, with this line of code, opened the favorites.db
1978
01:34:49,080 --> 01:34:54,810
file that you and I created earlier by importing your CSV into SQLite.
1979
01:34:54,810 --> 01:34:58,140
I'm now just asking the user for a title they want to search for.
1980
01:34:58,140 --> 01:35:02,070
I'm now executing this SQL query on that database,
1981
01:35:02,070 --> 01:35:05,250
plugging in whatever the human typed in as their title
1982
01:35:05,250 --> 01:35:07,170
in order to get back a total count.
1983
01:35:07,170 --> 01:35:10,950
And I'm giving the count a nickname, an alias of counter,
1984
01:35:10,950 --> 01:35:13,860
just so it's more self-explanatory.
1985
01:35:13,860 --> 01:35:18,330
This function, db execute, no matter what, always returns a list of rows,
1986
01:35:18,330 --> 01:35:20,470
even if there's only one row inside of it.
1987
01:35:20,470 --> 01:35:23,310
So this line of code just gives me the first and only row.
1988
01:35:23,310 --> 01:35:27,960
And then, this goes inside of that row, which it turns out is a dictionary,
1989
01:35:27,960 --> 01:35:33,730
and gives me the key counter and the value it corresponds to.
1990
01:35:33,730 --> 01:35:35,340
So what, to be clear, is this doing?
1991
01:35:35,340 --> 01:35:38,220
Let's go ahead and run this manually in my Terminal window first.
1992
01:35:38,220 --> 01:35:41,970
Let me run SQLite3 on favorites--
1993
01:35:41,970 --> 01:35:43,380
Well, let's do this.
1994
01:35:43,380 --> 01:35:47,410
On favorites.db, let me import the data again.
1995
01:35:47,410 --> 01:35:54,910
So mode csv.import in from favorites.csv into a favorites table.
1996
01:35:54,910 --> 01:35:57,330
So I've just recreated the same data set that you all
1997
01:35:57,330 --> 01:35:59,782
gave me earlier in favorites.db.
1998
01:35:59,782 --> 01:36:02,490
If I were to do this manually, let's search for The Office again.
1999
01:36:02,490 --> 01:36:09,210
Select, count star from favorites, where title like, and let's
2000
01:36:09,210 --> 01:36:12,270
just manually type it in for now, The Office.
2001
01:36:12,270 --> 01:36:15,330
We'll search for the one with the word The, semicolon.
2002
01:36:15,330 --> 01:36:16,560
I get back 12.
2003
01:36:16,560 --> 01:36:18,780
But technically, notice what I get back.
2004
01:36:18,780 --> 01:36:25,080
I technically get back a miniature table containing one column and one row.
2005
01:36:25,080 --> 01:36:27,090
What if I want to rename that column?
2006
01:36:27,090 --> 01:36:28,840
That's where the as keyword comes in.
2007
01:36:28,840 --> 01:36:31,080
So select count star as counter.
2008
01:36:31,080 --> 01:36:32,910
Notice what happens, Enter.
2009
01:36:32,910 --> 01:36:35,883
I just get back-- same simple table, but I've
2010
01:36:35,883 --> 01:36:38,550
renamed the column to be counter just because it's a little more
2011
01:36:38,550 --> 01:36:40,410
self-explanatory as to what it is.
2012
01:36:40,410 --> 01:36:43,200
So what am I doing with this line of code?
2013
01:36:43,200 --> 01:36:47,430
This line of code is returning to me that miniature temporary table
2014
01:36:47,430 --> 01:36:50,820
in the form of a list of dictionaries.
2015
01:36:50,820 --> 01:36:55,620
The list contains one row, as we'll see, and it
2016
01:36:55,620 --> 01:37:00,800
contains one column, as we'll see, the key for which is counter.
2017
01:37:00,800 --> 01:37:02,540
So let's now run the code itself.
2018
01:37:02,540 --> 01:37:07,280
I'm going to get out of SQLite3 and I'm going to run Python of favorites.py.
2019
01:37:07,280 --> 01:37:08,120
Enter.
2020
01:37:08,120 --> 01:37:09,540
I'm being prompted for a title.
2021
01:37:09,540 --> 01:37:13,790
I'm going to type in The Office and cross my fingers, and there's that 12.
2022
01:37:13,790 --> 01:37:14,810
Why is it 12?
2023
01:37:14,810 --> 01:37:17,450
Well, there's a typo again because I re-imported the CSV.
2024
01:37:17,450 --> 01:37:21,270
I had deleted two of the Thes, so we're back at the original data set.
2025
01:37:21,270 --> 01:37:25,760
So there's 12 total that have, quote, unquote, "The Office"
2026
01:37:25,760 --> 01:37:27,780
in the title like that.
2027
01:37:27,780 --> 01:37:28,860
So what have we done?
2028
01:37:28,860 --> 01:37:31,820
We've combined some Python with some SQL,
2029
01:37:31,820 --> 01:37:34,790
but we've relegated all of the complexity of searching for something,
2030
01:37:34,790 --> 01:37:36,800
the selecting of something, gotten rid of all
2031
01:37:36,800 --> 01:37:39,560
of the with keyword, the open keyword, the for loop,
2032
01:37:39,560 --> 01:37:41,600
the reader the DictReader, and all of that.
2033
01:37:41,600 --> 01:37:46,460
And it's just one line of SQL now, using the best of both worlds.
2034
01:37:46,460 --> 01:37:51,720
All right, any questions on what we've just done here or how any of this
2035
01:37:51,720 --> 01:37:52,220
works?
2036
01:37:52,220 --> 01:37:54,870
2037
01:37:54,870 --> 01:37:56,210
Any questions here?
2038
01:37:56,210 --> 01:37:57,310
Yeah?
2039
01:37:57,310 --> 01:38:01,070
AUDIENCE: [INAUDIBLE]
2040
01:38:01,070 --> 01:38:04,413
DAVID J. MALAN: When does this function return more than one row?
2041
01:38:04,413 --> 01:38:05,330
Was that the question?
2042
01:38:05,330 --> 01:38:05,760
AUDIENCE: Yeah.
2043
01:38:05,760 --> 01:38:06,030
DAVID J. MALAN: Yeah.
2044
01:38:06,030 --> 01:38:08,240
So let's do that by changing the problem at hand.
2045
01:38:08,240 --> 01:38:10,970
This program was designed just to select the total count.
2046
01:38:10,970 --> 01:38:16,070
Let's go ahead and select, for instance, all
2047
01:38:16,070 --> 01:38:20,705
of the ways you all typed in The Office by selecting the title this time.
2048
01:38:20,705 --> 01:38:24,110
2049
01:38:24,110 --> 01:38:28,470
If I do this in SQLite3, let me go ahead and do this again
2050
01:38:28,470 --> 01:38:29,970
after increasing my Terminal window.
2051
01:38:29,970 --> 01:38:30,920
Let's do it manually.
2052
01:38:30,920 --> 01:38:35,570
Select title from favorites, where the title is like,
2053
01:38:35,570 --> 01:38:38,835
quote, unquote, "The Office," semicolon.
2054
01:38:38,835 --> 01:38:41,960
I get back all of these different rows, and we didn't even notice this one.
2055
01:38:41,960 --> 01:38:43,910
There's actually another little typo in there
2056
01:38:43,910 --> 01:38:47,630
with some capitalization of the E, and the C, and the E. That
2057
01:38:47,630 --> 01:38:50,840
would be an example of a query that gives me back therefore
2058
01:38:50,840 --> 01:38:52,080
for multiple rows.
2059
01:38:52,080 --> 01:38:53,990
So let's now change my Python program.
2060
01:38:53,990 --> 01:38:59,540
If I now, in my Python program, do this, I get back a whole bunch of rows
2061
01:38:59,540 --> 01:39:01,110
containing all of those titles.
2062
01:39:01,110 --> 01:39:06,350
I can now do, for row in rows, I can print out the current row's title,
2063
01:39:06,350 --> 01:39:09,390
and now manipulate all of those things together.
2064
01:39:09,390 --> 01:39:10,760
Let me keep both on the screen.
2065
01:39:10,760 --> 01:39:12,320
Let me run Python of favorites.py.
2066
01:39:12,320 --> 01:39:16,320
And that for loop now should iterate, what, 10 or more times,
2067
01:39:16,320 --> 01:39:18,020
once for each of those titles.
2068
01:39:18,020 --> 01:39:22,010
And indeed, if I type in The Office again, Enter.
2069
01:39:22,010 --> 01:39:24,890
Whoops.
2070
01:39:24,890 --> 01:39:25,670
Row title.
2071
01:39:25,670 --> 01:39:26,750
What did I do wrong?
2072
01:39:26,750 --> 01:39:30,052
Oh, I should not be renaming title to counter this time.
2073
01:39:30,052 --> 01:39:31,760
So that's just a dumb mistake on my part.
2074
01:39:31,760 --> 01:39:33,410
Let me rerun it again.
2075
01:39:33,410 --> 01:39:36,230
And now I should see after typing in The Office,
2076
01:39:36,230 --> 01:39:38,420
Enter, a whole bunch of The Offices.
2077
01:39:38,420 --> 01:39:40,520
And because I'm using like, even the missed
2078
01:39:40,520 --> 01:39:43,550
capitalizations are coming through, because like is case insensitive.
2079
01:39:43,550 --> 01:39:45,890
It doesn't matter if it's uppercase or lowercase.
2080
01:39:45,890 --> 01:39:50,300
Whereas had I used the equal sign I would get back only the same ones
2081
01:39:50,300 --> 01:39:51,935
capitalized correctly.
2082
01:39:51,935 --> 01:39:55,250
All right, any questions on this next?
2083
01:39:55,250 --> 01:39:59,700
All right, so let's transition to a larger, juicier data
2084
01:39:59,700 --> 01:40:01,610
set, and consider some of the issues that
2085
01:40:01,610 --> 01:40:05,870
arise when actually now using SQL and skating toward a world in which we're
2086
01:40:05,870 --> 01:40:08,970
using SQL for mobile apps, web apps, and generally speaking,
2087
01:40:08,970 --> 01:40:10,800
very large data sets.
2088
01:40:10,800 --> 01:40:14,090
So let's start with a larger data set just like that.
2089
01:40:14,090 --> 01:40:19,800
Give me just a moment to switch screens over to what we have for you today,
2090
01:40:19,800 --> 01:40:22,970
which is an actual relational database that we've created out
2091
01:40:22,970 --> 01:40:26,540
of a real-world data set from IMDb.
2092
01:40:26,540 --> 01:40:29,210
So InternetMovieDatabase.com is a website
2093
01:40:29,210 --> 01:40:31,790
where you can search for TV shows, and movies, and actors,
2094
01:40:31,790 --> 01:40:34,880
and so forth, all using their database behind the scenes.
2095
01:40:34,880 --> 01:40:39,530
IMDb wonderfully makes their data set available as not CSV files,
2096
01:40:39,530 --> 01:40:42,960
but TSV files, tab-separated values.
2097
01:40:42,960 --> 01:40:46,460
And so what we did is, before class we downloaded those TSV files.
2098
01:40:46,460 --> 01:40:50,300
We wrote a Python program similar to my favorites8.py file
2099
01:40:50,300 --> 01:40:53,750
earlier that read in all of those TSV files,
2100
01:40:53,750 --> 01:40:58,820
created some SQL tables in an IMDb database
2101
01:40:58,820 --> 01:41:03,270
for you in SQLite that has multiple tables and multiple columns.
2102
01:41:03,270 --> 01:41:07,190
So let's go and wrap our minds around what's actually in this data set.
2103
01:41:07,190 --> 01:41:10,940
Let me go back to VS Code here, and in just a moment,
2104
01:41:10,940 --> 01:41:15,260
I'm going to go ahead and copy the file, which we've named shows.db.
2105
01:41:15,260 --> 01:41:20,510
And I'm going to go ahead and increase my Terminal and do SQLite3 of shows.db.
2106
01:41:20,510 --> 01:41:23,570
Whenever playing around with a SQLite database for the first time,
2107
01:41:23,570 --> 01:41:26,480
typing .schema is perhaps a good place to start to give you a sense
2108
01:41:26,480 --> 01:41:27,410
of what's in there.
2109
01:41:27,410 --> 01:41:29,120
And things just escalated quickly.
2110
01:41:29,120 --> 01:41:31,640
There's a lot in this data set, because, indeed, there's
2111
01:41:31,640 --> 01:41:35,750
going to be tens of hundreds of thousands of rows in this data set,
2112
01:41:35,750 --> 01:41:39,230
and also problem set 7, where we'll look at the movie side of things
2113
01:41:39,230 --> 01:41:40,920
and not just the TV shows.
2114
01:41:40,920 --> 01:41:43,940
So what is the schema that we have created for you
2115
01:41:43,940 --> 01:41:47,150
from IMDb's actual real-world data?
2116
01:41:47,150 --> 01:41:48,950
One, there's a table called shows.
2117
01:41:48,950 --> 01:41:51,950
And notice we've just added whitespace by hitting Enter a bunch of times
2118
01:41:51,950 --> 01:41:54,050
to make it a little more stylistically readable.
2119
01:41:54,050 --> 01:41:57,980
The shows table has an ID column, a title column, a year,
2120
01:41:57,980 --> 01:42:00,740
and the total number of episodes for a given show.
2121
01:42:00,740 --> 01:42:05,750
And the types of those columns are integer, text, numeric, and integer.
2122
01:42:05,750 --> 01:42:08,090
So it turns out there's actually a few different data
2123
01:42:08,090 --> 01:42:13,850
types that are worth being aware of when it comes to creating tables themselves.
2124
01:42:13,850 --> 01:42:18,170
In fact, in SQLite there's five data types, and only five,
2125
01:42:18,170 --> 01:42:21,650
fortunately, one of which is, indeed, integer, negative or positive,
2126
01:42:21,650 --> 01:42:25,010
numeric, which is kind of a catchall for dates and times,
2127
01:42:25,010 --> 01:42:27,320
things that are numeric but are not just integers,
2128
01:42:27,320 --> 01:42:29,510
and not just real numbers, for instance.
2129
01:42:29,510 --> 01:42:33,020
Real number is what we've generally thought of as float up until now.
2130
01:42:33,020 --> 01:42:35,058
Text, of course, is just text, but notice
2131
01:42:35,058 --> 01:42:37,100
that you don't have to worry about how big it is.
2132
01:42:37,100 --> 01:42:39,110
Like in Python, it will size to fit.
2133
01:42:39,110 --> 01:42:41,840
And then there's BLOB, which is binary large object, which
2134
01:42:41,840 --> 01:42:45,300
is for just raw 0s and 1s, like for files or things like that.
2135
01:42:45,300 --> 01:42:47,570
But we'll generally use the other four of these.
2136
01:42:47,570 --> 01:42:50,960
And so, indeed, when we imported this data for you
2137
01:42:50,960 --> 01:42:56,270
we decided that every show would be given an ID, which is just an integer.
2138
01:42:56,270 --> 01:42:59,460
Every show has, of course, a title, which should not be null.
2139
01:42:59,460 --> 01:43:01,320
Otherwise, why is it in the database?
2140
01:43:01,320 --> 01:43:04,830
Every show has a year, which is numeric according
2141
01:43:04,830 --> 01:43:06,180
to that definition a moment ago.
2142
01:43:06,180 --> 01:43:09,540
And the total number of episodes for a show is going to be an integer.
2143
01:43:09,540 --> 01:43:13,110
What now is with these primary keys that we mentioned earlier, too?
2144
01:43:13,110 --> 01:43:18,090
A primary key is the column that uniquely identifies all of the data.
2145
01:43:18,090 --> 01:43:21,120
In our case, with the favorites, I automatically
2146
01:43:21,120 --> 01:43:24,750
gave each of your submissions a unique ID so that even if two or more of you
2147
01:43:24,750 --> 01:43:27,360
typed in The Office, your submission still
2148
01:43:27,360 --> 01:43:32,310
had a unique identifier, a number that allowed me to then correlate it
2149
01:43:32,310 --> 01:43:36,330
with your genres, just as we saw a moment ago.
2150
01:43:36,330 --> 01:43:39,280
In this version of IMDb, there's also genres.
2151
01:43:39,280 --> 01:43:42,120
But they don't come from us, they come from IMDb.com.
2152
01:43:42,120 --> 01:43:46,320
And so a genre has a show ID, and a genre just like our database.
2153
01:43:46,320 --> 01:43:49,890
But these are real-world genres with a bit more filtration.
2154
01:43:49,890 --> 01:43:54,520
Notice, though, just like my version, there's a foreign key.
2155
01:43:54,520 --> 01:43:59,970
A foreign key is the appearance of another table's primary key
2156
01:43:59,970 --> 01:44:02,050
in its own table.
2157
01:44:02,050 --> 01:44:05,100
So when you have a table like genres, which is somehow
2158
01:44:05,100 --> 01:44:10,920
cross referencing the original shows table, if shows have a primary key
2159
01:44:10,920 --> 01:44:15,270
called ID, and those same numbers appear in the genres table
2160
01:44:15,270 --> 01:44:20,250
under the column called show ID, by definition, show ID is a foreign key.
2161
01:44:20,250 --> 01:44:22,650
It's the same numbers but it's foreign in the sense
2162
01:44:22,650 --> 01:44:25,260
that the number is being used in this table,
2163
01:44:25,260 --> 01:44:29,130
even though it's officially defined primarily in this other table.
2164
01:44:29,130 --> 01:44:31,770
This is what we mean by relational databases.
2165
01:44:31,770 --> 01:44:37,170
You have multiple tables with some column in common, numbers typically.
2166
01:44:37,170 --> 01:44:40,860
And those numbers allow you to line the two tables up in such a way
2167
01:44:40,860 --> 01:44:44,040
that you can reconnect the shows with their genres,
2168
01:44:44,040 --> 01:44:46,800
just like we did with our smaller data set a moment ago.
2169
01:44:46,800 --> 01:44:49,050
This logic is extended further.
2170
01:44:49,050 --> 01:44:53,070
Notice that the IMDb database we've created for you has a stars table,
2171
01:44:53,070 --> 01:44:56,730
like TV show stars, the actors therein.
2172
01:44:56,730 --> 01:45:00,210
And that table, interestingly, has no mention of people
2173
01:45:00,210 --> 01:45:02,220
and no mention of shows, per se.
2174
01:45:02,220 --> 01:45:05,730
It only has a column called show ID, which is an integer,
2175
01:45:05,730 --> 01:45:08,220
and a person ID, which is an integer.
2176
01:45:08,220 --> 01:45:14,320
Meanwhile, if we scrolled down to the bottom,
2177
01:45:14,320 --> 01:45:17,490
you will see a table called people.
2178
01:45:17,490 --> 01:45:23,010
And we have decided in IMDb's world that every person in the TV show world
2179
01:45:23,010 --> 01:45:27,510
will have a unique identifier that's a number, a name that's text, a birth
2180
01:45:27,510 --> 01:45:31,500
date, which is numeric, and then, again, specifying that ID
2181
01:45:31,500 --> 01:45:35,350
is going to be their primary key.
2182
01:45:35,350 --> 01:45:36,940
So what's going on here?
2183
01:45:36,940 --> 01:45:42,640
Well, it turns out that TV stars and writers are both types of people.
2184
01:45:42,640 --> 01:45:47,730
So using this relational database, notice the road we're going down.
2185
01:45:47,730 --> 01:45:49,770
We're factoring out commonalities.
2186
01:45:49,770 --> 01:45:52,570
And if a person can be different things in life,
2187
01:45:52,570 --> 01:45:55,260
well, we're defining them first and foremost as people.
2188
01:45:55,260 --> 01:45:58,150
And then, notice these two tables are almost the same.
2189
01:45:58,150 --> 01:46:00,660
The stars table has a show ID, which is a number,
2190
01:46:00,660 --> 01:46:02,670
and a person ID, which is a number, which
2191
01:46:02,670 --> 01:46:10,710
allows us via this middleman table, if you will, to link people with TV shows.
2192
01:46:10,710 --> 01:46:16,080
Similarly, the writers table allows us to connect shows with people, too,
2193
01:46:16,080 --> 01:46:18,220
by just recording those numbers.
2194
01:46:18,220 --> 01:46:20,980
So if we go into this data set, let's do the following.
2195
01:46:20,980 --> 01:46:24,340
Let's do select star from people semicolon.
2196
01:46:24,340 --> 01:46:27,030
So a huge amount of data is coming back.
2197
01:46:27,030 --> 01:46:31,480
This is hundreds of thousands of rows now based on the ID numbers alone.
2198
01:46:31,480 --> 01:46:34,320
So this is real-world data now flying across the screen.
2199
01:46:34,320 --> 01:46:38,160
There's a lot of people in the TV show business, not just actors and writers,
2200
01:46:38,160 --> 01:46:39,780
but others as well.
2201
01:46:39,780 --> 01:46:40,680
It's still going.
2202
01:46:40,680 --> 01:46:41,977
There's a lot of data there.
2203
01:46:41,977 --> 01:46:44,310
So my god, if you had to do anything manual in this data
2204
01:46:44,310 --> 01:46:46,660
set it's probably not going to work out very well.
2205
01:46:46,660 --> 01:46:49,590
And actually, we're up to, what, a million people in this data
2206
01:46:49,590 --> 01:46:51,990
set, plus, which would mean this probably isn't even
2207
01:46:51,990 --> 01:46:55,590
going to open very well in Excel, or Google Spreadsheets, or Apple Numbers.
2208
01:46:55,590 --> 01:46:57,700
SQL probably is the better approach here.
2209
01:46:57,700 --> 01:47:00,360
Let's search for someone specific, like select star
2210
01:47:00,360 --> 01:47:06,060
from people, where name equals Steve Carell, for instance, sticking
2211
01:47:06,060 --> 01:47:06,702
with comedies.
2212
01:47:06,702 --> 01:47:08,160
All right, so there's Steve Carell.
2213
01:47:08,160 --> 01:47:14,100
He is person number 136,797, born in 1962.
2214
01:47:14,100 --> 01:47:16,540
And that's as much data as we have on Steve Carell here.
2215
01:47:16,540 --> 01:47:19,210
How do we figure out what shows, for instance, he's in?
2216
01:47:19,210 --> 01:47:23,500
Well, let's see, select star from shows, semicolon.
2217
01:47:23,500 --> 01:47:27,180
There's a crazy number of shows out there in the IMDb database.
2218
01:47:27,180 --> 01:47:30,150
And you can see it here again flying across the screen.
2219
01:47:30,150 --> 01:47:33,630
Feels like we're going to have to employ some techniques in order
2220
01:47:33,630 --> 01:47:37,060
to get at all of Steve Carell's shows.
2221
01:47:37,060 --> 01:47:39,220
So how are we going to do that?
2222
01:47:39,220 --> 01:47:41,730
Well, god, this is a lot of data here.
2223
01:47:41,730 --> 01:47:45,120
And in fact, yeah, we have, what, 15 million
2224
01:47:45,120 --> 01:47:47,280
shows plus in this data set, too.
2225
01:47:47,280 --> 01:47:50,340
So doing things efficiently is now going to start to matter.
2226
01:47:50,340 --> 01:47:51,790
So let's actually do this.
2227
01:47:51,790 --> 01:47:53,460
Let me select a specific show.
2228
01:47:53,460 --> 01:47:58,590
Select star from shows where title equals, quote, unquote, "The Office."
2229
01:47:58,590 --> 01:48:00,920
And there presumably shouldn't be typos in this data
2230
01:48:00,920 --> 01:48:03,470
because it comes from the real website IMDb.com.
2231
01:48:03,470 --> 01:48:05,000
Let's get back to show.
2232
01:48:05,000 --> 01:48:08,210
Turns out there's been a lot of The Offices out in the world.
2233
01:48:08,210 --> 01:48:12,170
The one that started in 2005 is the one that we want,
2234
01:48:12,170 --> 01:48:14,990
presumably the most popular with 188 episodes.
2235
01:48:14,990 --> 01:48:16,220
How can we get just that?
2236
01:48:16,220 --> 01:48:21,650
Maybe we could do and year equals, how about 2005?
2237
01:48:21,650 --> 01:48:25,250
All right, so now we've got back just the ID of The Office
2238
01:48:25,250 --> 01:48:26,210
that we care about.
2239
01:48:26,210 --> 01:48:27,230
And let's do this, too.
2240
01:48:27,230 --> 01:48:30,110
Let me turn on a timer within SQLite just
2241
01:48:30,110 --> 01:48:31,790
to get a sense of running time now.
2242
01:48:31,790 --> 01:48:33,110
Let me do that again.
2243
01:48:33,110 --> 01:48:36,050
Select star from shows, where title equals The Office,
2244
01:48:36,050 --> 01:48:37,808
and year equals 2005.
2245
01:48:37,808 --> 01:48:38,850
And let's keep it simple.
2246
01:48:38,850 --> 01:48:40,370
Let's just do titles for now.
2247
01:48:40,370 --> 01:48:41,450
Enter.
2248
01:48:41,450 --> 01:48:43,460
All right, so not terribly long.
2249
01:48:43,460 --> 01:48:46,990
It found it pretty fast, but it looks like it took how much real time?
2250
01:48:46,990 --> 01:48:50,010
0.02 seconds, not bad for just a title.
2251
01:48:50,010 --> 01:48:53,210
But just to plant a seed, it turns out that we can probably
2252
01:48:53,210 --> 01:48:54,200
speed even this up.
2253
01:48:54,200 --> 01:48:54,930
Let me do this.
2254
01:48:54,930 --> 01:48:58,220
Let me create something called an index, which is another use of the C
2255
01:48:58,220 --> 01:49:00,170
in CRUD for creating something.
2256
01:49:00,170 --> 01:49:02,810
And I'm going to call this title index.
2257
01:49:02,810 --> 01:49:07,340
And I'm going to create it on the shows table,
2258
01:49:07,340 --> 01:49:09,273
specifically on the title column.
2259
01:49:09,273 --> 01:49:11,690
And we'll see in a moment what this is going to do for me.
2260
01:49:11,690 --> 01:49:12,920
Enter.
2261
01:49:12,920 --> 01:49:17,130
Took a moment, like 0.349 seconds, to create something called an index.
2262
01:49:17,130 --> 01:49:21,590
But now watch, if I select star from shows searching for The Office again,
2263
01:49:21,590 --> 01:49:24,230
previously it took me 0.021 seconds.
2264
01:49:24,230 --> 01:49:27,680
Not bad, but now, wow.
2265
01:49:27,680 --> 01:49:31,190
Literally no time at all, or so low that it wasn't really measurable.
2266
01:49:31,190 --> 01:49:33,470
And I'll do it again just to get a sense of things.
2267
01:49:33,470 --> 01:49:34,860
Still quite low.
2268
01:49:34,860 --> 01:49:40,020
Now even though 0.021 seconds, not crazy long, imagine now having a lot of data,
2269
01:49:40,020 --> 01:49:42,560
a lot of users running a real website or real mobile app.
2270
01:49:42,560 --> 01:49:45,980
Every millisecond we can start to shave off is going to be compelling.
2271
01:49:45,980 --> 01:49:47,810
So what is it we just did?
2272
01:49:47,810 --> 01:49:51,830
Well, we actually just created something called an index.
2273
01:49:51,830 --> 01:49:53,960
And this is a nice way to tie in, now, some
2274
01:49:53,960 --> 01:49:56,420
of our week 5 discussion of data structures,
2275
01:49:56,420 --> 01:49:58,250
and our week 3 discussion of running times.
2276
01:49:58,250 --> 01:50:01,280
An index in a database is some kind of fancy data
2277
01:50:01,280 --> 01:50:06,110
structure that allows the database to do better than linear search.
2278
01:50:06,110 --> 01:50:10,580
Literally, as you just saw, these tables are crazy long or tall right now,
2279
01:50:10,580 --> 01:50:12,150
very linear, that is.
2280
01:50:12,150 --> 01:50:14,300
And so when I first searched for The Office,
2281
01:50:14,300 --> 01:50:17,780
it was literally doing linear search, top to bottom, looking at as many as,
2282
01:50:17,780 --> 01:50:19,400
what, a million plus rows.
2283
01:50:19,400 --> 01:50:20,870
That's relatively slow.
2284
01:50:20,870 --> 01:50:23,330
It's not that slow, 0.021 seconds.
2285
01:50:23,330 --> 01:50:27,020
But that's relatively slow just theoretically, algorithmically,
2286
01:50:27,020 --> 01:50:28,580
doing anything linearly.
2287
01:50:28,580 --> 01:50:32,090
But if you instead create an index using syntax
2288
01:50:32,090 --> 01:50:37,880
like this, which I just did, creating an index on the title column of the show's
2289
01:50:37,880 --> 01:50:41,220
table, that's like giving the database a clue in advance saying,
2290
01:50:41,220 --> 01:50:44,660
hey, I know I'm going to search on this column in this table a lot.
2291
01:50:44,660 --> 01:50:47,430
Do something with data structures to speed things up.
2292
01:50:47,430 --> 01:50:50,030
And so if you think back to our discussion of data structures,
2293
01:50:50,030 --> 01:50:51,720
maybe it's using a tree.
2294
01:50:51,720 --> 01:50:56,060
Maybe it's using a trie or a hash table, some fancier two-dimensional data
2295
01:50:56,060 --> 01:50:59,930
structure is generally going to lift the data up creating right maybe a tree
2296
01:50:59,930 --> 01:51:00,810
structure.
2297
01:51:00,810 --> 01:51:03,620
So it's just much faster to find data, especially
2298
01:51:03,620 --> 01:51:06,500
if it's sorting it now based on title, and not
2299
01:51:06,500 --> 01:51:08,030
just storing it in one long list.
2300
01:51:08,030 --> 01:51:10,220
And in fact, in the world of relational databases,
2301
01:51:10,220 --> 01:51:12,560
the type of structure that's often used in a database
2302
01:51:12,560 --> 01:51:13,920
is something called a B-tree.
2303
01:51:13,920 --> 01:51:15,170
It's not a binary tree.
2304
01:51:15,170 --> 01:51:19,040
Different use of the letter B, but it looks a little something like the trees
2305
01:51:19,040 --> 01:51:19,820
we've seen.
2306
01:51:19,820 --> 01:51:21,680
It's not binary because some of the nodes
2307
01:51:21,680 --> 01:51:24,890
might have more than two children or fewer,
2308
01:51:24,890 --> 01:51:28,190
but it's a very wide but relatively shallow tree.
2309
01:51:28,190 --> 01:51:29,990
It's not very tall.
2310
01:51:29,990 --> 01:51:33,770
And the upside of that is that if your data is stored in this tree,
2311
01:51:33,770 --> 01:51:36,060
the database can find it more quickly.
2312
01:51:36,060 --> 01:51:41,270
And the reason it took half a second, a third of a second to build the index
2313
01:51:41,270 --> 01:51:45,260
is because SQLite needed to take some non-zero amount of time
2314
01:51:45,260 --> 01:51:47,630
to just build up this tree in memory.
2315
01:51:47,630 --> 01:51:51,900
And it has algorithms for doing so based on alphabetization or other techniques.
2316
01:51:51,900 --> 01:51:55,340
But you spend a bit of time up front, a third of a second.
2317
01:51:55,340 --> 01:51:57,470
And then thereafter, wow.
2318
01:51:57,470 --> 01:52:00,470
Every subsequent query, if I keep doing it again and again,
2319
01:52:00,470 --> 01:52:04,040
is going to be crazy low, 0.000, maybe 0.001.
2320
01:52:04,040 --> 01:52:08,240
But an order of magnitude, a factor of 10 or 100 faster than it
2321
01:52:08,240 --> 01:52:10,790
previously was earlier.
2322
01:52:10,790 --> 01:52:14,360
So we have these indexes which allow us to get at data faster.
2323
01:52:14,360 --> 01:52:17,330
But what if we want to actually get data that's
2324
01:52:17,330 --> 01:52:19,370
now across these multiple tables?
2325
01:52:19,370 --> 01:52:20,240
How can we do that?
2326
01:52:20,240 --> 01:52:23,060
And how might these indices or indexes help further?
2327
01:52:23,060 --> 01:52:26,900
Well, it turns out there is a way that we've seen already
2328
01:52:26,900 --> 01:52:29,510
indirectly to join two tables together.
2329
01:52:29,510 --> 01:52:33,410
Previously, when I selected the ID of The Office,
2330
01:52:33,410 --> 01:52:37,740
and then I searched for it in the other table using select in a nested query,
2331
01:52:37,740 --> 01:52:40,410
I was joining two tables together.
2332
01:52:40,410 --> 01:52:42,900
And it turns out there's a couple of ways to do this.
2333
01:52:42,900 --> 01:52:46,550
Let's go ahead now and, for instance, find all of Steve Carell's TV shows.
2334
01:52:46,550 --> 01:52:48,680
Not just The Office but all of them, too.
2335
01:52:48,680 --> 01:52:56,315
Unfortunately, if we look at our schema, shows up here have no mention of TV--
2336
01:52:56,315 --> 01:53:01,860
oh, shows over here has no mention of the TV stars in them.
2337
01:53:01,860 --> 01:53:05,130
And people have no mention of shows.
2338
01:53:05,130 --> 01:53:09,360
We somehow need to use this table here to connect the two.
2339
01:53:09,360 --> 01:53:14,820
And this is called a join table, in the sense that using two integer columns--
2340
01:53:14,820 --> 01:53:17,790
it joins the two tables together logically.
2341
01:53:17,790 --> 01:53:21,750
And so if you're savvy enough with SQL, you can do what I did with my hands
2342
01:53:21,750 --> 01:53:26,010
earlier and like recombine tables by using these common IDs,
2343
01:53:26,010 --> 01:53:27,460
these integers together.
2344
01:53:27,460 --> 01:53:28,330
So let me do this.
2345
01:53:28,330 --> 01:53:32,730
Let me go ahead and figure out, step-by-step, Steve Carell's shows.
2346
01:53:32,730 --> 01:53:33,970
So how am I going to do this?
2347
01:53:33,970 --> 01:53:39,120
Well, if I select star from people, where name equals Steve Carell,
2348
01:53:39,120 --> 01:53:40,840
fortunately, there's only one of them.
2349
01:53:40,840 --> 01:53:46,800
So this gives me back his name, his ID, and his birth year.
2350
01:53:46,800 --> 01:53:48,960
But it's really only his ID that I care about.
2351
01:53:48,960 --> 01:53:49,680
Why?
2352
01:53:49,680 --> 01:53:55,500
Because in order to get back his shows, I need to link person ID with show ID.
2353
01:53:55,500 --> 01:53:57,640
So I need to know his ID number.
2354
01:53:57,640 --> 01:53:59,590
So what could I do with this?
2355
01:53:59,590 --> 01:54:04,230
Well, remember the schema and the stars table.
2356
01:54:04,230 --> 01:54:07,830
I've just gotten, from the people table, Steve Carell's ID.
2357
01:54:07,830 --> 01:54:13,530
I bet by transitivity I could now use his person ID, his ID,
2358
01:54:13,530 --> 01:54:15,900
to get back all of his show IDs.
2359
01:54:15,900 --> 01:54:19,170
And then once I've got all of his show IDs, I can take it one step further
2360
01:54:19,170 --> 01:54:21,330
and get back all of his shows' titles.
2361
01:54:21,330 --> 01:54:25,440
So the answer is actually English words and not just random, seemingly,
2362
01:54:25,440 --> 01:54:26,290
integers.
2363
01:54:26,290 --> 01:54:27,640
So let me go ahead and do this.
2364
01:54:27,640 --> 01:54:31,740
Let me, again, get Steve Carell's ID number, but not star.
2365
01:54:31,740 --> 01:54:33,060
Star represents everything.
2366
01:54:33,060 --> 01:54:35,610
It's a wildcard character in SQL.
2367
01:54:35,610 --> 01:54:38,310
Let me just select the ID of Steve Carell.
2368
01:54:38,310 --> 01:54:41,640
And that gives me back 136,797.
2369
01:54:41,640 --> 01:54:43,440
And it's only giving me back one value.
2370
01:54:43,440 --> 01:54:46,710
The thing called ID is just the column heading up above.
2371
01:54:46,710 --> 01:54:51,360
Now, suppose I want to select all of the show IDs
2372
01:54:51,360 --> 01:54:53,460
that Steve Carell is affiliated with.
2373
01:54:53,460 --> 01:55:00,570
Let me select Show ID from stars, where the person ID in stars
2374
01:55:00,570 --> 01:55:03,480
happens to equal Steve Carell's ID.
2375
01:55:03,480 --> 01:55:07,320
So again, I'm building up my answer in reverse and taking these baby steps.
2376
01:55:07,320 --> 01:55:11,430
On the right, in parentheses, I'm getting Steve Carell's ID.
2377
01:55:11,430 --> 01:55:15,330
On the left, I am now selecting all of the show IDs
2378
01:55:15,330 --> 01:55:19,410
that have some connection with that person ID in the stars table.
2379
01:55:19,410 --> 01:55:21,850
This answer, too, is not going to be that illuminating.
2380
01:55:21,850 --> 01:55:25,420
It's just a whole bunch of integers that have no meaning to me as a human.
2381
01:55:25,420 --> 01:55:27,770
But let's take this one step further.
2382
01:55:27,770 --> 01:55:29,520
And even though my code is getting long, I
2383
01:55:29,520 --> 01:55:31,920
could hit Enter and format it nicely, especially
2384
01:55:31,920 --> 01:55:33,660
if I were doing this in a code file.
2385
01:55:33,660 --> 01:55:35,580
But I'm just doing it interactively for now.
2386
01:55:35,580 --> 01:55:39,420
Let's now select all of the titles from the shows table,
2387
01:55:39,420 --> 01:55:48,100
where the ID of the show is in this following previous query.
2388
01:55:48,100 --> 01:55:49,740
So again, the query is getting long.
2389
01:55:49,740 --> 01:55:52,200
But notice, it's the third and last step.
2390
01:55:52,200 --> 01:55:55,950
Select title from the shows table, where the ID of the show
2391
01:55:55,950 --> 01:55:58,590
is in the list of all of the show IDs that
2392
01:55:58,590 --> 01:56:02,040
came back from the stars table searching for Steve Carell's person ID.
2393
01:56:02,040 --> 01:56:03,480
How did we get that person ID?
2394
01:56:03,480 --> 01:56:04,680
Let me scroll to the end.
2395
01:56:04,680 --> 01:56:11,160
Well, I selected, in my innermost parentheses, Steve Carell's own ID.
2396
01:56:11,160 --> 01:56:13,440
So now, when I hit Enter, voila.
2397
01:56:13,440 --> 01:56:16,525
I get all of Steve Carell's TV shows up until now.
2398
01:56:16,525 --> 01:56:19,650
And if I want to tidy this up further, I can use the same tricks as before.
2399
01:56:19,650 --> 01:56:22,180
Order by title, semicolon.
2400
01:56:22,180 --> 01:56:25,540
Now I've got it all alphabetized as before.
2401
01:56:25,540 --> 01:56:28,080
So again, with SQL comes the ability to search--
2402
01:56:28,080 --> 01:56:31,290
I mean, look how quickly we do this, 0.094 seconds
2403
01:56:31,290 --> 01:56:34,650
to search across three different tables to get back this answer.
2404
01:56:34,650 --> 01:56:38,820
But my data is now all neatly designed in individual tables,
2405
01:56:38,820 --> 01:56:42,000
which is going to be important now that the data set is so large.
2406
01:56:42,000 --> 01:56:44,340
But let me take this one step further.
2407
01:56:44,340 --> 01:56:46,930
Let me go ahead and do this.
2408
01:56:46,930 --> 01:56:51,580
Let me go ahead and point out that with this query,
2409
01:56:51,580 --> 01:56:54,870
notice that I'm searching on--
2410
01:56:54,870 --> 01:56:58,710
let's say I'm searching on a person ID here.
2411
01:56:58,710 --> 01:57:02,410
And at the end here, I'm searching on a name column here.
2412
01:57:02,410 --> 01:57:05,230
So let me actually go ahead and do this.
2413
01:57:05,230 --> 01:57:09,510
Let me go ahead and see if we can't speed this up.
2414
01:57:09,510 --> 01:57:13,090
This query at the moment takes 0.092 seconds.
2415
01:57:13,090 --> 01:57:15,930
Let's see if we can't speed this up further by just quickly creating
2416
01:57:15,930 --> 01:57:18,930
a few more of those B-trees in the databases memory.
2417
01:57:18,930 --> 01:57:24,240
Create an index called person index, and I'm going to do this on the stars table
2418
01:57:24,240 --> 01:57:25,950
on the person ID column.
2419
01:57:25,950 --> 01:57:26,850
Enter.
2420
01:57:26,850 --> 01:57:28,488
It's taking a moment, taking a moment.
2421
01:57:28,488 --> 01:57:30,780
That's almost a full second because that's a big table.
2422
01:57:30,780 --> 01:57:35,050
Let's create another index called show index on the stars table.
2423
01:57:35,050 --> 01:57:35,550
Why?
2424
01:57:35,550 --> 01:57:37,950
Because I want to search by the show ID also.
2425
01:57:37,950 --> 01:57:39,780
That was part of my big query.
2426
01:57:39,780 --> 01:57:40,660
Takes a moment.
2427
01:57:40,660 --> 01:57:43,810
OK, just more than about 2/3 of a second.
2428
01:57:43,810 --> 01:57:46,440
Now let's create one last one, another index called name index,
2429
01:57:46,440 --> 01:57:49,160
but I could call these things anything I want, on the people table.
2430
01:57:49,160 --> 01:57:49,560
Why?
2431
01:57:49,560 --> 01:57:51,550
Because I'm also searching on the name column.
2432
01:57:51,550 --> 01:57:53,730
So in short, I'm creating indexes on each
2433
01:57:53,730 --> 01:57:57,450
of the columns that are somehow involved in my search query,
2434
01:57:57,450 --> 01:57:59,680
going from one table to the other.
2435
01:57:59,680 --> 01:58:07,590
Now let's go back to the previous query, which, recall, took--
2436
01:58:07,590 --> 01:58:10,270
I think I erased it, 0.091.
2437
01:58:10,270 --> 01:58:10,770
All right.
2438
01:58:10,770 --> 01:58:12,645
Well, it was roughly this order of magnitude.
2439
01:58:12,645 --> 01:58:14,140
We're not seeing the data now.
2440
01:58:14,140 --> 01:58:17,160
But let me go ahead and run my original big query once.
2441
01:58:17,160 --> 01:58:20,380
And boom, we're down to almost nothing.
2442
01:58:20,380 --> 01:58:22,860
So again, creating these indexes in memory
2443
01:58:22,860 --> 01:58:27,640
has the effect of rapidly speeding up our computation time.
2444
01:58:27,640 --> 01:58:31,140
Now if you've ever used, for instance, the my.harvard course shopping tool
2445
01:58:31,140 --> 01:58:35,130
here on campus, or Yale's analogue, you might wonder, why is the thing so slow?
2446
01:58:35,130 --> 01:58:39,330
This could be one of the reasons why large data sets with thousands of rows,
2447
01:58:39,330 --> 01:58:41,670
thousands of courses tend to be slow, if,
2448
01:58:41,670 --> 01:58:44,880
and I'm only conjecturing, if the database isn't properly indexed.
2449
01:58:44,880 --> 01:58:46,890
If you're building your own web application
2450
01:58:46,890 --> 01:58:49,170
and you're finding that users are waiting and waiting,
2451
01:58:49,170 --> 01:58:52,290
and things are spinning and spinning, what might be among the problems?
2452
01:58:52,290 --> 01:58:55,770
Well, it could absolutely just be bad algorithms and bad code that you wrote.
2453
01:58:55,770 --> 01:58:58,270
Or it might be that you haven't thought about, well,
2454
01:58:58,270 --> 01:59:01,770
what column should be optimized for searches and filtration
2455
01:59:01,770 --> 01:59:05,705
like I've done here in order to speed up subsequent queries?
2456
01:59:05,705 --> 01:59:07,830
Again, from the outside in, we can only conjecture.
2457
01:59:07,830 --> 01:59:11,580
But ultimately, this is just one of the things that
2458
01:59:11,580 --> 01:59:14,308
explains performance problems as well.
2459
01:59:14,308 --> 01:59:17,100
All right, let's point out just a couple of final syntactic things,
2460
01:59:17,100 --> 01:59:19,770
and then we'll consider, bigger picture, some problems
2461
01:59:19,770 --> 01:59:22,110
that might arise in this world.
2462
01:59:22,110 --> 01:59:26,880
If these nested, nested queries start to get a little much,
2463
01:59:26,880 --> 01:59:29,160
there are other ways, just so you've seen it,
2464
01:59:29,160 --> 01:59:31,920
that you can execute similar logic in SQL.
2465
01:59:31,920 --> 01:59:34,410
For instance, if I know in advance that I
2466
01:59:34,410 --> 01:59:39,390
want to connect Steve Carell to his show IDs and to their titles,
2467
01:59:39,390 --> 01:59:41,190
we can do something more like this.
2468
01:59:41,190 --> 01:59:52,050
Select title from the people table, joined with the stars table on people
2469
01:59:52,050 --> 01:59:54,900
ID equals stars.personID.
2470
01:59:54,900 --> 01:59:55,860
So what am I doing?
2471
01:59:55,860 --> 01:59:56,490
New syntax.
2472
01:59:56,490 --> 01:59:59,782
And again, this is not something you'll have to memorize or ingrain right away.
2473
01:59:59,782 --> 02:00:04,080
But just so you've seen other approaches, select title from people
2474
02:00:04,080 --> 02:00:05,160
join stars.
2475
02:00:05,160 --> 02:00:09,690
This is an explicit way to say, take the people table in one hand, the stars
2476
02:00:09,690 --> 02:00:12,030
table in the other hand, and somehow join them
2477
02:00:12,030 --> 02:00:13,990
as I keep doing with my fingertips here.
2478
02:00:13,990 --> 02:00:15,610
How to join them?
2479
02:00:15,610 --> 02:00:20,370
Join them so that the people, the ID column in the people table lines up
2480
02:00:20,370 --> 02:00:23,260
with the person ID in the stars table.
2481
02:00:23,260 --> 02:00:25,260
But that's not quite everything.
2482
02:00:25,260 --> 02:00:28,740
I could also say, join further on the shows table,
2483
02:00:28,740 --> 02:00:35,290
where the stars show ID equals the shows ID column.
2484
02:00:35,290 --> 02:00:36,540
So what am I doing here?
2485
02:00:36,540 --> 02:00:45,990
That's saying, go further and join the stars table with the show's table,
2486
02:00:45,990 --> 02:00:48,990
joining the show ID column with the ID column.
2487
02:00:48,990 --> 02:00:51,750
Again, this starts to get a little messy to think about.
2488
02:00:51,750 --> 02:00:55,830
But now I can just say, where name equals, quote, unquote, "Steve Carell."
2489
02:00:55,830 --> 02:00:59,070
I can do in one query what previously took me three nested queries
2490
02:00:59,070 --> 02:01:00,600
and get back the same answers.
2491
02:01:00,600 --> 02:01:05,380
And I can still add in my order by title to get back the result.
2492
02:01:05,380 --> 02:01:09,780
And if I do this a little more neatly, let me type this out a little
2493
02:01:09,780 --> 02:01:10,750
differently.
2494
02:01:10,750 --> 02:01:16,620
Let me type this out by adding a new line-- ah, I can't do that here.
2495
02:01:16,620 --> 02:01:18,120
I'm going to leave it alone for now.
2496
02:01:18,120 --> 02:01:20,700
We can type it on multiple lines in other contexts.
2497
02:01:20,700 --> 02:01:23,700
And let me do one last thing.
2498
02:01:23,700 --> 02:01:25,012
Do I want to show that?
2499
02:01:25,012 --> 02:01:26,970
I'm going to show it, but this is not something
2500
02:01:26,970 --> 02:01:28,440
you should ingrain just yet either.
2501
02:01:28,440 --> 02:01:31,620
Select title from people, stars, and shows.
2502
02:01:31,620 --> 02:01:34,860
If you know in advance that you want to do something with all three tables,
2503
02:01:34,860 --> 02:01:38,130
you can just enumerate them, one table name after the other.
2504
02:01:38,130 --> 02:01:43,497
And then you can say where people.ID equals stars.personID.
2505
02:01:43,497 --> 02:01:45,330
And now I'm hitting Enter so that it formats
2506
02:01:45,330 --> 02:01:47,070
a little more readably on my screen.
2507
02:01:47,070 --> 02:01:55,140
And stars.showID equals shows.ID, and lastly, name equals Steve Carell.
2508
02:01:55,140 --> 02:01:59,730
In short, you specify that you want to select data from all three
2509
02:01:59,730 --> 02:02:00,690
of these tables.
2510
02:02:00,690 --> 02:02:06,450
And then you tell the database how to combine foreign keys with primary keys,
2511
02:02:06,450 --> 02:02:09,690
that is, the columns that have those integers in common.
2512
02:02:09,690 --> 02:02:12,720
If I hit Enter now, I get the same exact results, ever
2513
02:02:12,720 --> 02:02:16,110
more so if I also add in an order by title.
2514
02:02:16,110 --> 02:02:17,770
Oops.
2515
02:02:17,770 --> 02:02:18,270
All right.
2516
02:02:18,270 --> 02:02:20,190
That's why I didn't want to do this earlier.
2517
02:02:20,190 --> 02:02:23,190
I have to go back through my history multiple times to actually get back
2518
02:02:23,190 --> 02:02:24,640
the multi-line query this time.
2519
02:02:24,640 --> 02:02:25,140
All right.
2520
02:02:25,140 --> 02:02:27,280
That was a lot all at once.
2521
02:02:27,280 --> 02:02:31,365
But this is only to say that, even as we make the design of the data
2522
02:02:31,365 --> 02:02:33,990
more sophisticated, and we put some of it over here, some of it
2523
02:02:33,990 --> 02:02:38,130
over here, some of it over here so as to avoid duplication of data, weird hacks
2524
02:02:38,130 --> 02:02:41,910
like putting commas in the data, we can still get back all of the answers
2525
02:02:41,910 --> 02:02:44,280
that we might want across these several tables.
2526
02:02:44,280 --> 02:02:48,580
And using indexes, we can significantly speed up these processes
2527
02:02:48,580 --> 02:02:52,140
so as to handle 10 times as many, a 100 times as many users
2528
02:02:52,140 --> 02:02:53,670
on the same actual database.
2529
02:02:53,670 --> 02:02:55,020
There is going to be a downside.
2530
02:02:55,020 --> 02:02:57,540
And thinking back to our discussion of algorithms and data
2531
02:02:57,540 --> 02:03:02,120
structures in past weeks, what might be a downside of creating these indexes?
2532
02:03:02,120 --> 02:03:06,110
Because as of now, I created four separate indexes on the name column,
2533
02:03:06,110 --> 02:03:09,560
the title column, and some other columns, too.
2534
02:03:09,560 --> 02:03:11,930
Why wouldn't I just go ahead and index everything
2535
02:03:11,930 --> 02:03:14,390
if it's clearly speeding things up?
2536
02:03:14,390 --> 02:03:15,770
Memory, so space.
2537
02:03:15,770 --> 02:03:18,890
Any time you're starting to benefit time wise in computer science,
2538
02:03:18,890 --> 02:03:21,950
odds are you're sacrificing space, or vice versa.
2539
02:03:21,950 --> 02:03:25,400
And probably indexing absolutely everything is a little dumb
2540
02:03:25,400 --> 02:03:29,430
because you're going to waste way more space than you might actually need.
2541
02:03:29,430 --> 02:03:31,610
So figuring out where the right inflection point is
2542
02:03:31,610 --> 02:03:36,410
is part of the process of designing and just getting better at these things.
2543
02:03:36,410 --> 02:03:40,910
Now unfortunately, a whole lot of things can go wrong in this world,
2544
02:03:40,910 --> 02:03:44,870
and they continue to in the real world with people using SQL databases.
2545
02:03:44,870 --> 02:03:46,760
And in fact, here on out, if you're reading
2546
02:03:46,760 --> 02:03:51,530
something technical about SQL databases, and websites being hacked in some form,
2547
02:03:51,530 --> 02:03:54,660
and passwords leaking out, unfortunately, all too often
2548
02:03:54,660 --> 02:03:57,505
it is because of what are called SQL injection attacks.
2549
02:03:57,505 --> 02:03:59,630
And just to give you a sense now to counterbalance,
2550
02:03:59,630 --> 02:04:01,588
maybe [INAUDIBLE] enthusiasm for like, oh, that
2551
02:04:01,588 --> 02:04:03,620
was neat how we can do things so quickly.
2552
02:04:03,620 --> 02:04:06,680
With great power comes responsibility in this world, too.
2553
02:04:06,680 --> 02:04:09,320
And so many people introduce bugs into their code
2554
02:04:09,320 --> 02:04:17,160
by not quite appreciating how it is the data is getting into your application.
2555
02:04:17,160 --> 02:04:18,390
So what do I mean by that?
2556
02:04:18,390 --> 02:04:21,200
Here, for instance, is a typical login screen for Yale.
2557
02:04:21,200 --> 02:04:23,780
And here's the analogue for Harvard where you're prompted,
2558
02:04:23,780 --> 02:04:26,450
every day probably, for your username and your password,
2559
02:04:26,450 --> 02:04:28,460
your email address and your password here.
2560
02:04:28,460 --> 02:04:32,420
Suppose, though, that behind this login page,
2561
02:04:32,420 --> 02:04:35,030
whether Harvard's or Yale's, there's some website.
2562
02:04:35,030 --> 02:04:38,270
And that website is using SQL underneath the hood
2563
02:04:38,270 --> 02:04:40,700
to store all of the Harvard or Yale people's
2564
02:04:40,700 --> 02:04:43,940
usernames, passwords, ID numbers, courses, transcripts,
2565
02:04:43,940 --> 02:04:44,970
all of that stuff.
2566
02:04:44,970 --> 02:04:47,480
So there's a SQL database underneath the website.
2567
02:04:47,480 --> 02:04:50,360
Well, what might go wrong with this process?
2568
02:04:50,360 --> 02:04:52,850
Unfortunately, there's some special syntax in SQL
2569
02:04:52,850 --> 02:04:54,530
just like there is in C and Python.
2570
02:04:54,530 --> 02:04:56,960
For instance, there are comments in SQL, too.
2571
02:04:56,960 --> 02:05:00,680
If you do two hyphens, dash, dash, that's a comment in SQL.
2572
02:05:00,680 --> 02:05:06,170
And if you, the programmer, aren't sufficiently distrustful of your users,
2573
02:05:06,170 --> 02:05:09,560
such that you defend against potentially adversarial attacks,
2574
02:05:09,560 --> 02:05:11,160
you might do something like this.
2575
02:05:11,160 --> 02:05:16,070
Suppose that I somewhat maliciously or curiously log in
2576
02:05:16,070 --> 02:05:19,130
by typing my username, Malan@harvard.edu, and then maybe
2577
02:05:19,130 --> 02:05:20,990
a single quote and a dash, dash.
2578
02:05:20,990 --> 02:05:21,680
Why?
2579
02:05:21,680 --> 02:05:24,860
Because I'm trying to suss out if there is a vulnerability here
2580
02:05:24,860 --> 02:05:26,420
to a SQL injection attack.
2581
02:05:26,420 --> 02:05:27,910
Do not do this in general.
2582
02:05:27,910 --> 02:05:31,160
But if I were the owner of the website trying to see if I've made any mistake,
2583
02:05:31,160 --> 02:05:35,300
I might try using potentially dangerous characters in my input.
2584
02:05:35,300 --> 02:05:36,290
Dangerous how?
2585
02:05:36,290 --> 02:05:40,340
Because single quote is used for quoting things in SQL, as we've seen--
2586
02:05:40,340 --> 02:05:41,810
single quotes or double quotes.
2587
02:05:41,810 --> 02:05:44,930
Dash, dash, I claim now, is used for commenting.
2588
02:05:44,930 --> 02:05:47,960
But let's now imagine what the code underneath the hood
2589
02:05:47,960 --> 02:05:52,160
might be for something like Yale's login or Harvard's login.
2590
02:05:52,160 --> 02:05:54,600
What if it's code that looks like this?
2591
02:05:54,600 --> 02:05:56,540
So let me read it from left to right.
2592
02:05:56,540 --> 02:06:00,710
Suppose that they are using something like CS50's own execute function,
2593
02:06:00,710 --> 02:06:03,230
and they've got some SQL typed into the website that
2594
02:06:03,230 --> 02:06:07,160
says select star from users, where username equals this,
2595
02:06:07,160 --> 02:06:09,050
and password equals that.
2596
02:06:09,050 --> 02:06:12,510
And they're plugging in username and password.
2597
02:06:12,510 --> 02:06:13,607
So what am I doing here?
2598
02:06:13,607 --> 02:06:16,190
Well, when the user types their username password, hits Enter,
2599
02:06:16,190 --> 02:06:18,920
I probably want to select that user from my database
2600
02:06:18,920 --> 02:06:21,020
to see if the username and passwords match.
2601
02:06:21,020 --> 02:06:23,720
So the underlying SQL might be, select star
2602
02:06:23,720 --> 02:06:25,790
from users, where username equals question mark,
2603
02:06:25,790 --> 02:06:27,207
and password equals question mark.
2604
02:06:27,207 --> 02:06:28,100
Users is the table.
2605
02:06:28,100 --> 02:06:29,510
One column is username.
2606
02:06:29,510 --> 02:06:31,400
One column is password.
2607
02:06:31,400 --> 02:06:32,000
All right.
2608
02:06:32,000 --> 02:06:37,430
And if we get back one row, presumably Malan@harvard.edu
2609
02:06:37,430 --> 02:06:38,970
exists with that password.
2610
02:06:38,970 --> 02:06:41,190
We should let him proceed from there on out.
2611
02:06:41,190 --> 02:06:45,140
So that's some pseudo code, if you will, for this scenario.
2612
02:06:45,140 --> 02:06:49,580
What if, though, this code is not as well written as it currently
2613
02:06:49,580 --> 02:06:51,500
is, and isn't using question marks?
2614
02:06:51,500 --> 02:06:54,757
So the question mark syntax is a fairly common SQL thing,
2615
02:06:54,757 --> 02:06:56,840
where the question marks are used as placeholders,
2616
02:06:56,840 --> 02:06:59,390
just like in printf, percent S was.
2617
02:06:59,390 --> 02:07:02,900
But this function, db.execute from CS50's library
2618
02:07:02,900 --> 02:07:05,420
and third-party libraries as well, is also
2619
02:07:05,420 --> 02:07:07,790
doing some good stuff with these question marks,
2620
02:07:07,790 --> 02:07:09,830
and defending against the following attack.
2621
02:07:09,830 --> 02:07:12,920
Suppose that you were not using a third-party library like ours
2622
02:07:12,920 --> 02:07:16,490
and you were just manually constructing your SQL queries like this.
2623
02:07:16,490 --> 02:07:19,940
You were to do something like this instead using an f-string in Python.
2624
02:07:19,940 --> 02:07:21,800
You're comfortable with format strings now.
2625
02:07:21,800 --> 02:07:24,883
You've gotten into the habit of using curly braces and plugging in values.
2626
02:07:24,883 --> 02:07:27,230
Suppose that you, the aspiring programmer,
2627
02:07:27,230 --> 02:07:29,660
is just using techniques that you've been taught.
2628
02:07:29,660 --> 02:07:32,660
So you have an f-string with select star from users,
2629
02:07:32,660 --> 02:07:36,470
where username equals, quote, unquote, "username" in curly braces.
2630
02:07:36,470 --> 02:07:41,270
And password equals, quote, unquote, "password" in curly braces.
2631
02:07:41,270 --> 02:07:44,270
As of what, two weeks ago, this was perfectly
2632
02:07:44,270 --> 02:07:49,460
legitimate technique in Python to plug in values into a string.
2633
02:07:49,460 --> 02:07:53,630
But notice if you are using single quotes yourself
2634
02:07:53,630 --> 02:07:58,750
and the user has typed in single quotes to their input, what
2635
02:07:58,750 --> 02:07:59,920
could go wrong here?
2636
02:07:59,920 --> 02:08:04,510
Where are we going with this if you're just blindly plugging user input
2637
02:08:04,510 --> 02:08:08,350
into your own prepared string of text?
2638
02:08:08,350 --> 02:08:09,230
Yeah?
2639
02:08:09,230 --> 02:08:14,725
AUDIENCE: [INAUDIBLE]
2640
02:08:14,725 --> 02:08:15,600
DAVID J. MALAN: Yeah.
2641
02:08:15,600 --> 02:08:20,670
Worst case, they could insert what is actually SQL code into your database
2642
02:08:20,670 --> 02:08:21,810
as follows.
2643
02:08:21,810 --> 02:08:25,140
Generally speaking, if you're using special syntax like single quotes
2644
02:08:25,140 --> 02:08:27,000
to surround the user's input, you'd better
2645
02:08:27,000 --> 02:08:29,250
hope that they don't have an apostrophe in their name.
2646
02:08:29,250 --> 02:08:31,875
Or you better hope that they don't type a single quote as well.
2647
02:08:31,875 --> 02:08:36,025
Because what if their single quote finishes your single quote instead,
2648
02:08:36,025 --> 02:08:37,900
and then the rest of this is somehow ignored?
2649
02:08:37,900 --> 02:08:39,330
Well, let's consider how this might happen.
2650
02:08:39,330 --> 02:08:40,510
Let me go ahead in here.
2651
02:08:40,510 --> 02:08:42,840
This got a little blurry here, but let me
2652
02:08:42,840 --> 02:08:44,850
plug in here-- wow, that looks awful.
2653
02:08:44,850 --> 02:08:47,940
Let me fix the red.
2654
02:08:47,940 --> 02:08:50,430
Just change this to white so it's more readable.
2655
02:08:50,430 --> 02:08:56,730
What happens if the user does this instead?
2656
02:08:56,730 --> 02:08:59,310
They type in, like I did into the screenshot,
2657
02:08:59,310 --> 02:09:02,760
'Malan@harvard.edu,' single quote, dash, dash.
2658
02:09:02,760 --> 02:09:05,070
What has just happened logically, even though we've
2659
02:09:05,070 --> 02:09:06,900
only just begun with SQL today?
2660
02:09:06,900 --> 02:09:12,150
Well, select star from users, where username equals Malan@harvard.edu, end
2661
02:09:12,150 --> 02:09:13,320
quote.
2662
02:09:13,320 --> 02:09:17,340
What's bad about the rest of this?
2663
02:09:17,340 --> 02:09:19,752
Dash, dash, I claim, means a comment, which
2664
02:09:19,752 --> 02:09:22,210
means my color coding is going to be a little blurry again.
2665
02:09:22,210 --> 02:09:25,140
But everything after the dash, dash is just ignored.
2666
02:09:25,140 --> 02:09:27,440
The logic, then, of the SQL query, then, is
2667
02:09:27,440 --> 02:09:30,760
to just say, select Malan@harvard.edu from the database,
2668
02:09:30,760 --> 02:09:33,490
not even checking the password anymore.
2669
02:09:33,490 --> 02:09:36,180
Therefore, you will get back at least one row.
2670
02:09:36,180 --> 02:09:41,190
So length of rows will equal 1, and so presumably the rest of the pseudo code
2671
02:09:41,190 --> 02:09:44,190
logs the user in, gives them access to my my.harvard account,
2672
02:09:44,190 --> 02:09:45,150
or whatever it is.
2673
02:09:45,150 --> 02:09:50,640
And they've pretended to be me simply by using a single quote and a dash,
2674
02:09:50,640 --> 02:09:52,445
dash in the username field.
2675
02:09:52,445 --> 02:09:54,570
Again, please don't go start doing this later today
2676
02:09:54,570 --> 02:09:56,140
on Harvard, Yale, or other websites.
2677
02:09:56,140 --> 02:09:57,670
But it could be as simple as that.
2678
02:09:57,670 --> 02:09:58,170
Why?
2679
02:09:58,170 --> 02:10:00,030
Because the programmer practiced what they
2680
02:10:00,030 --> 02:10:04,110
were taught, which was just to use curly braces to plug in,
2681
02:10:04,110 --> 02:10:05,560
in f-strings, values.
2682
02:10:05,560 --> 02:10:08,590
But if you don't understand how the user's input is going to be used,
2683
02:10:08,590 --> 02:10:12,255
and if you don't distrust your users fundamentally, for every good person
2684
02:10:12,255 --> 02:10:14,130
out there there's going to be, unfortunately,
2685
02:10:14,130 --> 02:10:19,380
some adversary who just wants to try to find fault in your code or hack
2686
02:10:19,380 --> 02:10:20,490
into your data set.
2687
02:10:20,490 --> 02:10:22,980
This is what's known as a SQL injection attack,
2688
02:10:22,980 --> 02:10:27,000
because the user can type something that happens to be or look like SQL,
2689
02:10:27,000 --> 02:10:31,110
and trick your database into doing something it didn't intend to,
2690
02:10:31,110 --> 02:10:34,830
like, for instance, logging the user in.
2691
02:10:34,830 --> 02:10:36,880
Worst case, they could even do something else.
2692
02:10:36,880 --> 02:10:41,490
Maybe the user types a semicolon, then the word drop, or the word update.
2693
02:10:41,490 --> 02:10:45,180
You could imagine doing semicolon update table grades, where
2694
02:10:45,180 --> 02:10:49,140
name equals Malan, and set the grade equal to A instead of B,
2695
02:10:49,140 --> 02:10:50,670
or something like that.
2696
02:10:50,670 --> 02:10:53,550
The ability to inject SQL into the database
2697
02:10:53,550 --> 02:10:56,820
means you can do anything you want with the data set, either constructively,
2698
02:10:56,820 --> 02:11:00,000
or worse, destructively.
2699
02:11:00,000 --> 02:11:02,880
And now, just a quick, little cartoon that should now make sense.
2700
02:11:02,880 --> 02:11:09,410
2701
02:11:09,410 --> 02:11:13,020
OK, to, like, one of us, two of us.
2702
02:11:13,020 --> 02:11:14,400
Awkwardly somewhat funny.
2703
02:11:14,400 --> 02:11:16,560
All right, so let's move on to one last condition.
2704
02:11:16,560 --> 02:11:19,130
There's one other problem that can go awry here.
2705
02:11:19,130 --> 02:11:20,380
Oh, and I should explain this.
2706
02:11:20,380 --> 02:11:25,500
So this is an allusion to the son, Robert, having typed in semicolon.
2707
02:11:25,500 --> 02:11:28,810
The word drop, table, students, and doing some of the same technique.
2708
02:11:28,810 --> 02:11:31,440
This is humor that only CS people would understand
2709
02:11:31,440 --> 02:11:35,040
because it's the mom realizing, oh, her son's doing a SQL injection
2710
02:11:35,040 --> 02:11:36,308
attack onto the database.
2711
02:11:36,308 --> 02:11:39,600
Less funny when you explain it, but once you notice the syntax, that's all this
2712
02:11:39,600 --> 02:11:40,960
is an allusion to.
2713
02:11:40,960 --> 02:11:41,460
All right.
2714
02:11:41,460 --> 02:11:44,850
So one final threat, now that you are graduating
2715
02:11:44,850 --> 02:11:49,320
to the world of proper databases and away from CSV files alone.
2716
02:11:49,320 --> 02:11:52,170
Things can go wrong when using databases,
2717
02:11:52,170 --> 02:11:55,838
and honestly, even using CSV files if you have multiple users.
2718
02:11:55,838 --> 02:11:57,630
And thus far, you and I have had the luxury
2719
02:11:57,630 --> 02:12:00,547
in almost every program we've written that it's just me using my code.
2720
02:12:00,547 --> 02:12:01,830
It's just you using your code.
2721
02:12:01,830 --> 02:12:04,770
And even if your teaching fellow or TA is using it, probably
2722
02:12:04,770 --> 02:12:06,060
not at the same time.
2723
02:12:06,060 --> 02:12:10,770
But the world gets interesting if you start putting your code on phones,
2724
02:12:10,770 --> 02:12:14,730
on websites, such that now you might have two users literally trying
2725
02:12:14,730 --> 02:12:17,430
to log in at the same time, literally clicking a button
2726
02:12:17,430 --> 02:12:19,410
at the same, or nearly the same time.
2727
02:12:19,410 --> 02:12:22,380
What happens, then, if a computer is trying
2728
02:12:22,380 --> 02:12:25,290
to handle requests from two different people at once,
2729
02:12:25,290 --> 02:12:27,480
as might happen all the time on a website?
2730
02:12:27,480 --> 02:12:29,610
You might get what are called race conditions.
2731
02:12:29,610 --> 02:12:33,060
And this is a problem in computing in general, not just with SQL, not just
2732
02:12:33,060 --> 02:12:36,960
with Python, really just any time you have shared data,
2733
02:12:36,960 --> 02:12:39,150
like a database, as follows.
2734
02:12:39,150 --> 02:12:43,620
This apparently is one of the most liked Instagram posts ever.
2735
02:12:43,620 --> 02:12:46,110
It is literally just a picture of an egg.
2736
02:12:46,110 --> 02:12:47,645
Has anyone clicked on this egg?
2737
02:12:47,645 --> 02:12:48,270
Like, a couple?
2738
02:12:48,270 --> 02:12:48,780
Oh, OK.
2739
02:12:48,780 --> 02:12:49,280
Wow.
2740
02:12:49,280 --> 02:12:50,220
All right, so yes.
2741
02:12:50,220 --> 02:12:53,880
So go search for this photo if you'd like to add to the likes on Instagram.
2742
02:12:53,880 --> 02:12:56,110
The account is world_record_egg.
2743
02:12:56,110 --> 02:12:59,040
This is just a screenshot of Instagram of that picture of an egg.
2744
02:12:59,040 --> 02:13:00,540
If you're in the habit of using Instagram,
2745
02:13:00,540 --> 02:13:03,498
or like any social media site, there's some equivalent of a like button
2746
02:13:03,498 --> 02:13:04,920
or a heart button these days.
2747
02:13:04,920 --> 02:13:06,900
And that's actually a really hard problem.
2748
02:13:06,900 --> 02:13:10,050
Such a simple idea to count the number of likes something
2749
02:13:10,050 --> 02:13:12,990
has, but that means someone has to click on it.
2750
02:13:12,990 --> 02:13:14,910
Your code has to detect the click.
2751
02:13:14,910 --> 02:13:17,860
Your code has to update the database, and then do it again and again,
2752
02:13:17,860 --> 02:13:22,740
even if multiple people are perhaps right now clicking on that same egg.
2753
02:13:22,740 --> 02:13:28,350
And unfortunately, bad things can happen if two people try to do something
2754
02:13:28,350 --> 02:13:30,540
at the same time on a computer.
2755
02:13:30,540 --> 02:13:31,690
How might this happen?
2756
02:13:31,690 --> 02:13:35,670
So here's some more code, half pseudocode, half Python code here,
2757
02:13:35,670 --> 02:13:36,580
as follows.
2758
02:13:36,580 --> 02:13:40,230
Suppose that what happens when you, literally, right now, maybe click
2759
02:13:40,230 --> 02:13:43,470
on the like button on the Instagram post.
2760
02:13:43,470 --> 02:13:47,760
Suppose that code, like the following, is executed on Facebook servers.
2761
02:13:47,760 --> 02:13:54,190
db.execute of select likes from posts where ID equals question mark.
2762
02:13:54,190 --> 02:13:54,690
All right.
2763
02:13:54,690 --> 02:13:57,090
So what am I assuming here?
2764
02:13:57,090 --> 02:13:59,280
I'm assuming that that photograph has a unique ID.
2765
02:13:59,280 --> 02:14:02,670
It's some big integer, whatever it was, randomly assigned.
2766
02:14:02,670 --> 02:14:05,130
I'm assuming that when you click on the heart
2767
02:14:05,130 --> 02:14:08,160
the unique ID is somehow sent to Instagram servers
2768
02:14:08,160 --> 02:14:10,740
so that their code can call it ID.
2769
02:14:10,740 --> 02:14:13,830
And I'm assuming that Instagram is using its SQL database
2770
02:14:13,830 --> 02:14:17,790
and selecting, from a posts table, the current number of likes
2771
02:14:17,790 --> 02:14:21,160
of that egg for that given ID number.
2772
02:14:21,160 --> 02:14:21,660
Why?
2773
02:14:21,660 --> 02:14:24,952
Because I need to know how many likes it already has if I want to add one to it
2774
02:14:24,952 --> 02:14:26,190
and then update the database.
2775
02:14:26,190 --> 02:14:29,710
I need to select the data, then I need to update the data here.
2776
02:14:29,710 --> 02:14:30,210
All right.
2777
02:14:30,210 --> 02:14:33,780
So in some Python code here, let's store, in a variable called
2778
02:14:33,780 --> 02:14:37,950
likes, whatever comes back in the first row from the likes column.
2779
02:14:37,950 --> 02:14:40,660
Again, this is new syntax specific to our library,
2780
02:14:40,660 --> 02:14:43,830
but a common way of getting back first row and the column called
2781
02:14:43,830 --> 02:14:44,800
likes therein.
2782
02:14:44,800 --> 02:14:46,920
So at this point in the story, likes is storing
2783
02:14:46,920 --> 02:14:49,462
the total number of likes, in the millions or whatever it is,
2784
02:14:49,462 --> 02:14:50,610
of that particular egg.
2785
02:14:50,610 --> 02:14:51,900
Then I do this.
2786
02:14:51,900 --> 02:14:56,400
Execute update posts, set the number of likes
2787
02:14:56,400 --> 02:15:00,607
equal to this value, where the ID of the post equals this value.
2788
02:15:00,607 --> 02:15:02,190
What do I want to update the likes to?
2789
02:15:02,190 --> 02:15:06,540
Whatever likes currently is plus 1, and then plugging in the ID.
2790
02:15:06,540 --> 02:15:08,340
So a simple idea, right?
2791
02:15:08,340 --> 02:15:11,730
I'm checking the value of the likes, and maybe it's 10.
2792
02:15:11,730 --> 02:15:15,030
I'm changing 10 to 11 and then updating the table.
2793
02:15:15,030 --> 02:15:18,390
But a problem can arise if two people have
2794
02:15:18,390 --> 02:15:23,010
clicked on that egg at roughly the same time, or literally, the same time.
2795
02:15:23,010 --> 02:15:24,340
Why is that?
2796
02:15:24,340 --> 02:15:26,850
Well, in the world of databases and servers,
2797
02:15:26,850 --> 02:15:31,120
and the Instagrams of the world have thousands of physical servers nowadays.
2798
02:15:31,120 --> 02:15:35,430
So they can support millions, billions even, of users nowadays.
2799
02:15:35,430 --> 02:15:36,910
What can go wrong?
2800
02:15:36,910 --> 02:15:40,530
Well, typically code like this is not what we'll call atomic.
2801
02:15:40,530 --> 02:15:44,430
To be atomic means that it all executes together or not at all.
2802
02:15:44,430 --> 02:15:49,500
Rather, code typically is executed, as you might imagine, line by line.
2803
02:15:49,500 --> 02:15:53,110
And if your code is running on a server that multiple people have access to,
2804
02:15:53,110 --> 02:15:55,530
which is absolutely the case for an app like Instagram,
2805
02:15:55,530 --> 02:15:58,560
if you and I click on the heart at roughly the same time,
2806
02:15:58,560 --> 02:16:02,160
for efficiency, the computer, the server, owned by Instagram,
2807
02:16:02,160 --> 02:16:04,500
might execute this line of code for me.
2808
02:16:04,500 --> 02:16:06,537
Then it might execute this line of code for you.
2809
02:16:06,537 --> 02:16:09,120
Then this line of code for me, then this line of code for you,
2810
02:16:09,120 --> 02:16:11,703
then this line of code for me, then this line of code for you.
2811
02:16:11,703 --> 02:16:16,550
That is to say, our queries might get intermingled chronologically.
2812
02:16:16,550 --> 02:16:19,370
Because it'd be a little obnoxious if, when you're using Instagram,
2813
02:16:19,370 --> 02:16:22,040
I'm blocked out while you're interacting with the site.
2814
02:16:22,040 --> 02:16:24,778
It'd be a lot nicer for efficiency and fairness if somehow they
2815
02:16:24,778 --> 02:16:27,320
do a little bit of work for me, a little bit of work for you,
2816
02:16:27,320 --> 02:16:30,540
and back and forth, and back and forth, equitably on the server.
2817
02:16:30,540 --> 02:16:33,320
So that's what typically happens by default. These lines of code
2818
02:16:33,320 --> 02:16:35,420
get executed independently.
2819
02:16:35,420 --> 02:16:39,770
And they can happen in alternating order with other users.
2820
02:16:39,770 --> 02:16:41,809
You can get them combined like this.
2821
02:16:41,809 --> 02:16:45,840
Same order top to bottom, but other things might happen in between.
2822
02:16:45,840 --> 02:16:50,000
So suppose that the number of likes at the very beginning was 10.
2823
02:16:50,000 --> 02:16:54,080
And suppose that Carter and I both click on that egg at roughly the same time.
2824
02:16:54,080 --> 02:16:56,420
And suppose this line of code gets executed for me,
2825
02:16:56,420 --> 02:16:59,870
and that gives me a value in likes, ultimately, of 10.
2826
02:16:59,870 --> 02:17:03,290
Suppose, then, that the computer takes a break from dealing with my request,
2827
02:17:03,290 --> 02:17:05,690
does the same code for Carter, and gets back
2828
02:17:05,690 --> 02:17:07,910
what value for the current number of likes?
2829
02:17:07,910 --> 02:17:09,170
Also 10 for Carter.
2830
02:17:09,170 --> 02:17:11,209
Because mine has not been recorded yet.
2831
02:17:11,209 --> 02:17:13,910
At this point in the story, somewhere in the computer's memory
2832
02:17:13,910 --> 02:17:16,219
there's a likes variable for me, storing 10.
2833
02:17:16,219 --> 02:17:19,400
There's a likes variable storing 10 for Carter.
2834
02:17:19,400 --> 02:17:21,290
Then this line of code executes for me.
2835
02:17:21,290 --> 02:17:25,219
It updates the database to be likes plus 1, which stores 11 in the database.
2836
02:17:25,219 --> 02:17:30,440
Then Carter's code is executed, updating the same row in the database
2837
02:17:30,440 --> 02:17:34,160
to 11, unfortunately.
2838
02:17:34,160 --> 02:17:38,209
Because his value of likes happened to be the same value of mine.
2839
02:17:38,209 --> 02:17:41,540
And so the metaphor here, that if we had a refrigerator on stage we would
2840
02:17:41,540 --> 02:17:44,790
actually act out, is something that was taught to me years ago in an operating
2841
02:17:44,790 --> 02:17:50,299
systems class, whereby the most similar analogue in the real world would be
2842
02:17:50,299 --> 02:17:52,370
if you've got a mini fridge in your dorm room.
2843
02:17:52,370 --> 02:17:58,440
And one of you and your roommates comes home, opens the fridge, and realizes,
2844
02:17:58,440 --> 02:18:01,160
oh, we're out of milk, was how the story went in my day.
2845
02:18:01,160 --> 02:18:05,184
So you close the refrigerator, and you walk across the street, go to CVS,
2846
02:18:05,184 --> 02:18:06,559
and get in line to buy some milk.
2847
02:18:06,559 --> 02:18:08,309
Meanwhile, your roommate comes home.
2848
02:18:08,309 --> 02:18:12,924
They, too, inspect the state of your refrigerator, a.k.a., a variable,
2849
02:18:12,924 --> 02:18:15,049
open the door, and realizes, oh, we're out of milk.
2850
02:18:15,049 --> 02:18:15,980
I'll go get more milk.
2851
02:18:15,980 --> 02:18:18,020
Close the fridge, go across the street, and head
2852
02:18:18,020 --> 02:18:20,059
to maybe a different store, or the line is long enough
2853
02:18:20,059 --> 02:18:21,851
that you don't see each other at the store.
2854
02:18:21,851 --> 02:18:25,700
So long story short, you both eventually get home, open the door, and damn it,
2855
02:18:25,700 --> 02:18:27,559
now there's milk from your other roommate
2856
02:18:27,559 --> 02:18:30,680
there because you both made a decision on this
2857
02:18:30,680 --> 02:18:35,809
based on the state of a variable that you independently examined.
2858
02:18:35,809 --> 02:18:37,700
And you didn't somehow communicate.
2859
02:18:37,700 --> 02:18:40,610
Now in the real world, this is absolutely solvable.
2860
02:18:40,610 --> 02:18:44,120
How would you fix this or avoid this problem in the real world?
2861
02:18:44,120 --> 02:18:45,770
Literally, own roommate, own fridge.
2862
02:18:45,770 --> 02:18:47,750
AUDIENCE: Text your roommate [INAUDIBLE]..
2863
02:18:47,750 --> 02:18:48,750
DAVID J. MALAN: Perfect.
2864
02:18:48,750 --> 02:18:50,482
Let them know, so somehow communicate.
2865
02:18:50,482 --> 02:18:52,940
And in fact, the terminology here would be multiple threads
2866
02:18:52,940 --> 02:18:55,459
can somehow intercommunicate by having shared state,
2867
02:18:55,459 --> 02:18:57,230
like the iMessage thread on your phone.
2868
02:18:57,230 --> 02:18:58,190
You could leave a note.
2869
02:18:58,190 --> 02:19:01,490
You could, more dramatically, lock the refrigerator somehow,
2870
02:19:01,490 --> 02:19:05,570
thereby making the milk purchasing process atomic.
2871
02:19:05,570 --> 02:19:08,389
The fundamental problem is that for efficiency, again,
2872
02:19:08,389 --> 02:19:11,690
computers tend to intermingle logic that needs
2873
02:19:11,690 --> 02:19:15,980
to happen when it's happening across multiple users just for fairness' sake,
2874
02:19:15,980 --> 02:19:17,180
for scheduling sake.
2875
02:19:17,180 --> 02:19:19,820
You need to make sure that all three of these lines of code
2876
02:19:19,820 --> 02:19:23,269
execute for me, and then for Carter, and then for you
2877
02:19:23,269 --> 02:19:25,709
if you want to ensure that this count is correct.
2878
02:19:25,709 --> 02:19:29,219
And for years, when social media was first getting off the ground,
2879
02:19:29,219 --> 02:19:31,100
this was a super hard problem.
2880
02:19:31,100 --> 02:19:33,920
Twitter used to go down all of the time, and tweets,
2881
02:19:33,920 --> 02:19:36,260
and retweets were a thing that were similarly happening
2882
02:19:36,260 --> 02:19:37,490
with a very high frequency.
2883
02:19:37,490 --> 02:19:39,020
These are hard problems to solve.
2884
02:19:39,020 --> 02:19:40,690
And thankfully, there are solutions.
2885
02:19:40,690 --> 02:19:43,440
And we won't get into the weeds of how you might use these things,
2886
02:19:43,440 --> 02:19:46,100
but know that there are solutions in the form of things
2887
02:19:46,100 --> 02:19:49,610
called locks, which I use that word deliberately with the fridge.
2888
02:19:49,610 --> 02:19:53,570
Software locks can allow you to protect a variable so no one else can
2889
02:19:53,570 --> 02:19:55,550
look at it until you're done with it.
2890
02:19:55,550 --> 02:19:57,770
There are things called transactions, which
2891
02:19:57,770 --> 02:20:01,100
allow you to do the equivalent of sending a message to, or really locking
2892
02:20:01,100 --> 02:20:04,130
out your roommate from accessing that same variable, too,
2893
02:20:04,130 --> 02:20:06,920
but for slightly less amount of time.
2894
02:20:06,920 --> 02:20:08,880
There are solutions to these problems.
2895
02:20:08,880 --> 02:20:12,650
So for instance, in Python, the same code now in green
2896
02:20:12,650 --> 02:20:14,360
might look a little something like this.
2897
02:20:14,360 --> 02:20:17,150
When you know that something has to happen all at once,
2898
02:20:17,150 --> 02:20:21,020
altogether, you first begin a transaction, and you do your thing,
2899
02:20:21,020 --> 02:20:23,707
and then you commit the transaction at the very end.
2900
02:20:23,707 --> 02:20:25,790
Here, too, though, there's going to be a downside.
2901
02:20:25,790 --> 02:20:29,902
Typically, the more you use transactions in this way,
2902
02:20:29,902 --> 02:20:31,610
potentially the higher the probability is
2903
02:20:31,610 --> 02:20:35,190
that you're going to box someone out or make Carter's request a little slower.
2904
02:20:35,190 --> 02:20:35,690
Why?
2905
02:20:35,690 --> 02:20:37,482
Because we can't interact at the same time.
2906
02:20:37,482 --> 02:20:39,920
Or you might make his request fail if he tries to update
2907
02:20:39,920 --> 02:20:41,870
something that's already been updated.
2908
02:20:41,870 --> 02:20:44,720
So you generally want to have as few lines of code
2909
02:20:44,720 --> 02:20:47,840
together in between these transactions so that you get in and you get out.
2910
02:20:47,840 --> 02:20:50,780
And you go to CVS and you get back really fast so as to not
2911
02:20:50,780 --> 02:20:52,550
cause these kind of performance things.
2912
02:20:52,550 --> 02:20:55,070
So things indeed escalated quickly today.
2913
02:20:55,070 --> 02:20:58,272
The original goal was just to solve problems using a different language
2914
02:20:58,272 --> 02:20:59,480
more effectively than Python.
2915
02:20:59,480 --> 02:21:01,772
But as soon as you have these more powerful techniques,
2916
02:21:01,772 --> 02:21:03,525
a whole new set of problems arises.
2917
02:21:03,525 --> 02:21:05,150
Takes practice to get comfortable with.
2918
02:21:05,150 --> 02:21:09,170
But ultimately, this is all leading us toward the introduction next week
2919
02:21:09,170 --> 02:21:12,260
of web programming with HTML, CSS, and some JavaScript.
2920
02:21:12,260 --> 02:21:15,090
The week after, bringing Python and SQL back into the mix.
2921
02:21:15,090 --> 02:21:16,940
So that by term's end, we've really now used
2922
02:21:16,940 --> 02:21:19,760
all of these different languages for what they're best at.
2923
02:21:19,760 --> 02:21:22,843
And over the next few weeks, the goal is to make sure you're understanding
2924
02:21:22,843 --> 02:21:25,932
and comfortable with what each of these things is good and bad for.
2925
02:21:25,932 --> 02:21:27,140
Let's go ahead and wrap here.
2926
02:21:27,140 --> 02:21:28,473
I'll stick around for questions.
2927
02:21:28,473 --> 02:21:30,310
We'll see you next time.
2928
02:21:30,310 --> 02:22:04,344
[MUSIC PLAYING]
244433
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.