All language subtitles for hash_tables-720p-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian Download
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 0 00:00:00,000 --> 00:00:00,500 1 00:00:00,500 --> 00:00:01,900 [MUSIC PLAYING] 2 00:00:01,900 --> 00:00:05,710 3 00:00:05,710 --> 00:00:09,150 >> DOUG LLOYD: By now you know a lot about arrays, 4 00:00:09,150 --> 00:00:11,610 and you know a lot about linked lists. 5 00:00:11,610 --> 00:00:13,650 And we've discuss the pros and cons, we've 6 00:00:13,650 --> 00:00:16,620 discussed that linked lists can get bigger and smaller, 7 00:00:16,620 --> 00:00:18,630 but they take up more size. 8 00:00:18,630 --> 00:00:22,359 Arrays are much more straightforward to use, but they're restrictive in as much 9 00:00:22,359 --> 00:00:24,900 as we have to set the size of the array at the very beginning 10 00:00:24,900 --> 00:00:26,910 and then we're stuck with it. 11 00:00:26,910 --> 00:00:30,470 >> But that's, we've pretty much exhausted all of our topics 12 00:00:30,470 --> 00:00:33,040 about linked lists and arrays. 13 00:00:33,040 --> 00:00:34,950 Or have we? 14 00:00:34,950 --> 00:00:37,720 Maybe we can do something even more creative. 15 00:00:37,720 --> 00:00:40,950 And that sort of lends the idea of a hash table. 16 00:00:40,950 --> 00:00:46,680 >> So in a hash table we're going to try combine an array with a linked list. 17 00:00:46,680 --> 00:00:49,520 We're going to take the advantages of the array, like random access, 18 00:00:49,520 --> 00:00:53,510 being able to just go to array element 4 or array element 8 19 00:00:53,510 --> 00:00:55,560 without having to iterate across. 20 00:00:55,560 --> 00:00:57,260 That's pretty fast, right? 21 00:00:57,260 --> 00:01:00,714 >> But we also want to have our data structure be able to grow and shrink. 22 00:01:00,714 --> 00:01:02,630 We don't need, we don't want to be restricted. 23 00:01:02,630 --> 00:01:04,588 And we want to be able to add and remove things 24 00:01:04,588 --> 00:01:08,430 very easily, which if you recall, is very complex with an array. 25 00:01:08,430 --> 00:01:11,650 And we can call this new thing a hash table. 26 00:01:11,650 --> 00:01:15,190 >> And if implemented correctly, we're sort of taking 27 00:01:15,190 --> 00:01:18,150 the advantages of both data structures you've already seen, 28 00:01:18,150 --> 00:01:19,880 arrays and linked lists. 29 00:01:19,880 --> 00:01:23,070 Insertion can start to tend toward theta of 1. 30 00:01:23,070 --> 00:01:26,207 Theta we haven't really discussed, but theta is just the average case, 31 00:01:26,207 --> 00:01:27,540 what's actually going to happen. 32 00:01:27,540 --> 00:01:29,680 You're not always going to have the worst case scenario, 33 00:01:29,680 --> 00:01:32,555 and you're not always going to have the best case scenario, so what's 34 00:01:32,555 --> 00:01:33,900 the average scenario? 35 00:01:33,900 --> 00:01:36,500 >> Well an average insertion into a hash table 36 00:01:36,500 --> 00:01:39,370 can start to get close to constant time. 37 00:01:39,370 --> 00:01:41,570 And deletion can get close to constant time. 38 00:01:41,570 --> 00:01:44,440 And lookup can get close to constant time. 39 00:01:44,440 --> 00:01:48,600 That's-- we don't have a data structure yet that can do that, 40 00:01:48,600 --> 00:01:51,180 and so this already sounds like a pretty great thing. 41 00:01:51,180 --> 00:01:57,010 We've really mitigated the disadvantages of each on its own. 42 00:01:57,010 --> 00:01:59,160 >> To get this performance upgrade though, we 43 00:01:59,160 --> 00:02:03,580 need to rethink how we add data into the structure. 44 00:02:03,580 --> 00:02:07,380 Specifically we want the data itself to tell us 45 00:02:07,380 --> 00:02:09,725 where it should go in the structure. 46 00:02:09,725 --> 00:02:12,850 And if we then need to see if it's in the structure, if we need to find it, 47 00:02:12,850 --> 00:02:16,610 we want to look at the data again and be able to effectively, 48 00:02:16,610 --> 00:02:18,910 using the data, randomly access it. 49 00:02:18,910 --> 00:02:20,700 Just by looking at the data we should have 50 00:02:20,700 --> 00:02:25,890 an idea of where exactly we're going to find it in the hash table. 51 00:02:25,890 --> 00:02:28,770 >> Now the downside of a hash table is that they're really 52 00:02:28,770 --> 00:02:31,770 pretty bad at ordering or sorting data. 53 00:02:31,770 --> 00:02:34,970 And in fact, if you start to use them to order or sort 54 00:02:34,970 --> 00:02:37,990 data you lose all of the advantages you previously 55 00:02:37,990 --> 00:02:40,710 had in terms of insertion and deletion. 56 00:02:40,710 --> 00:02:44,060 The time becomes closer to theta of n, and we've basically 57 00:02:44,060 --> 00:02:45,530 regressed into a linked list. 58 00:02:45,530 --> 00:02:48,850 And so we only want to use hash tables if we don't care about 59 00:02:48,850 --> 00:02:51,490 whether data is sorted. 60 00:02:51,490 --> 00:02:54,290 For the context in which you'll use them in CS50 61 00:02:54,290 --> 00:02:58,900 you probably don't care that the data is sorted. 62 00:02:58,900 --> 00:03:03,170 >> So a hash table is a combination of two distinct pieces 63 00:03:03,170 --> 00:03:04,980 with which we're familiar. 64 00:03:04,980 --> 00:03:07,930 The first is a function, which we usually call a hash function. 65 00:03:07,930 --> 00:03:11,760 And that hash function is going to return some non-negative integer, which 66 00:03:11,760 --> 00:03:14,870 we usually call a hashcode, OK? 67 00:03:14,870 --> 00:03:20,230 The second piece is an array, which is capable of storing data of the type we 68 00:03:20,230 --> 00:03:22,190 want to place into the data structure. 69 00:03:22,190 --> 00:03:24,310 We'll hold off on the linked list element for now 70 00:03:24,310 --> 00:03:27,810 and just start with the basics of a hash table to get your head around it, 71 00:03:27,810 --> 00:03:30,210 and then we'll maybe blow your mind a little bit when we 72 00:03:30,210 --> 00:03:32,920 combine arrays and link lists together. 73 00:03:32,920 --> 00:03:35,590 >> The basic idea though is we take some data. 74 00:03:35,590 --> 00:03:37,860 We run that data through the hash function. 75 00:03:37,860 --> 00:03:41,980 And so the data is processed and it spits out a number, OK? 76 00:03:41,980 --> 00:03:44,890 And then with that number we just store the data 77 00:03:44,890 --> 00:03:48,930 we want to store in the array at that location. 78 00:03:48,930 --> 00:03:53,990 So for example we have maybe this hash table of strings. 79 00:03:53,990 --> 00:03:57,350 It's got 10 elements in it, so we can fit 10 strings in it. 80 00:03:57,350 --> 00:03:59,320 >> Let's say we want to hash John. 81 00:03:59,320 --> 00:04:02,979 So John as the data we want to insert into this hash table somewhere. 82 00:04:02,979 --> 00:04:03,770 Where do we put it? 83 00:04:03,770 --> 00:04:05,728 Well typically with an array so far we probably 84 00:04:05,728 --> 00:04:07,610 would put it in array location 0. 85 00:04:07,610 --> 00:04:09,960 But now we have this new hash function. 86 00:04:09,960 --> 00:04:13,180 >> And let's say that we run John through this hash function 87 00:04:13,180 --> 00:04:15,417 and it's spits out 4. 88 00:04:15,417 --> 00:04:17,500 Well that's where we're going to want to put John. 89 00:04:17,500 --> 00:04:22,050 We want to put John in array location 4, because if we hash John again-- 90 00:04:22,050 --> 00:04:23,810 let's say later we want to search and see 91 00:04:23,810 --> 00:04:26,960 if John exists in this hash table-- all we need to do 92 00:04:26,960 --> 00:04:30,310 is run it through the same hash function, get the number 4 out, 93 00:04:30,310 --> 00:04:33,929 and be able to find John immediately in our data structure. 94 00:04:33,929 --> 00:04:34,720 That's pretty good. 95 00:04:34,720 --> 00:04:36,928 >> Let's say we now do this again, we want to hash Paul. 96 00:04:36,928 --> 00:04:39,446 We want to add Paul into this hash table. 97 00:04:39,446 --> 00:04:42,070 Let's say that this time we run Paul through the hash function, 98 00:04:42,070 --> 00:04:44,600 the hashcode that is generated is 6. 99 00:04:44,600 --> 00:04:47,340 Well now we can put Paul in the array location 6. 100 00:04:47,340 --> 00:04:50,040 And if we need to look up whether Paul is in this hash table, 101 00:04:50,040 --> 00:04:53,900 all we need to do is run Paul through the hash function again 102 00:04:53,900 --> 00:04:55,830 and we're going to get 6 out again. 103 00:04:55,830 --> 00:04:57,590 >> And then we just look at array location 6. 104 00:04:57,590 --> 00:04:58,910 Is Paul there? 105 00:04:58,910 --> 00:05:00,160 If so, he's in the hash table. 106 00:05:00,160 --> 00:05:01,875 Is Paul not there? 107 00:05:01,875 --> 00:05:03,000 He's not in the hash table. 108 00:05:03,000 --> 00:05:05,720 It's pretty straightforward. 109 00:05:05,720 --> 00:05:07,770 >> Now how do you define a hash function? 110 00:05:07,770 --> 00:05:11,460 Well there's really no limit to the number of possible hash functions. 111 00:05:11,460 --> 00:05:14,350 In fact there's a number of really, really good ones on the internet. 112 00:05:14,350 --> 00:05:17,560 There's a number of really, really bad ones on the internet. 113 00:05:17,560 --> 00:05:21,080 It's also pretty easy to write a bad one. 114 00:05:21,080 --> 00:05:23,760 >> So what makes up a good hash function, right? 115 00:05:23,760 --> 00:05:27,280 Well a good hash function should use only the data being hashed, 116 00:05:27,280 --> 00:05:29,420 and all of the data being hashed. 117 00:05:29,420 --> 00:05:32,500 So we don't want to use anything-- we don't incorporate anything 118 00:05:32,500 --> 00:05:35,560 else other than the data. 119 00:05:35,560 --> 00:05:37,080 And we want to use all of the data. 120 00:05:37,080 --> 00:05:39,830 We don't want to just use a piece of it, we want to use all of it. 121 00:05:39,830 --> 00:05:41,710 A hash function should also be deterministic. 122 00:05:41,710 --> 00:05:42,550 What does that mean? 123 00:05:42,550 --> 00:05:46,200 Well it means that every time we pass the exact same piece of data 124 00:05:46,200 --> 00:05:50,040 into the hash function we always get the same hashcode out. 125 00:05:50,040 --> 00:05:52,870 If I pass John into the hash function I get out 4. 126 00:05:52,870 --> 00:05:56,110 I should be able to do that 10,000 times and I'll always get 4. 127 00:05:56,110 --> 00:06:00,810 So no random numbers effectively can be involved in our hash tables-- 128 00:06:00,810 --> 00:06:02,750 in our hash functions. 129 00:06:02,750 --> 00:06:05,750 >> A hash function should also uniformly distribute data. 130 00:06:05,750 --> 00:06:09,700 If every time you run data through the hash function you get the hashcode 0, 131 00:06:09,700 --> 00:06:11,200 that's probably not so great, right? 132 00:06:11,200 --> 00:06:14,600 You probably want to big a range of hash codes. 133 00:06:14,600 --> 00:06:17,190 Also things can be spread out throughout the table. 134 00:06:17,190 --> 00:06:23,210 And also it would be great if really similar data, like John and Jonathan, 135 00:06:23,210 --> 00:06:26,320 maybe were spread out to weigh different locations in the hash table. 136 00:06:26,320 --> 00:06:28,750 That would be a nice advantage. 137 00:06:28,750 --> 00:06:31,250 >> Here's an example of a hash function. 138 00:06:31,250 --> 00:06:33,150 I wrote this one up earlier. 139 00:06:33,150 --> 00:06:35,047 It's not a particularly good hash function 140 00:06:35,047 --> 00:06:37,380 for reasons that don't really bear going into right now. 141 00:06:37,380 --> 00:06:41,040 But do you see what's going on here? 142 00:06:41,040 --> 00:06:44,420 It seems like we're declaring a variable called sum and setting it equal to 0. 143 00:06:44,420 --> 00:06:50,010 And then apparently I'm doing something so long as strstr[j] is not equal 144 00:06:50,010 --> 00:06:52,490 to backslash 0. 145 00:06:52,490 --> 00:06:54,870 What am I doing there? 146 00:06:54,870 --> 00:06:57,440 >> This is basically just another way of implementing [? strl ?] 147 00:06:57,440 --> 00:06:59,773 and detecting when you've reached the end of the string. 148 00:06:59,773 --> 00:07:02,480 So I don't have to actually calculate the length of the string, 149 00:07:02,480 --> 00:07:05,640 I'm just using when I hit the backslash 0 character I know 150 00:07:05,640 --> 00:07:07,185 I've reached the end of the string. 151 00:07:07,185 --> 00:07:09,560 And then I'm going to keep iterating through that string, 152 00:07:09,560 --> 00:07:15,310 adding strstr[j] to sum, and then at the end of the day going to return sum mod 153 00:07:15,310 --> 00:07:16,200 HASH_MAX. 154 00:07:16,200 --> 00:07:18,700 >> Basically all this hash function is doing is adding up 155 00:07:18,700 --> 00:07:23,450 all of the ASCII values of my string, and then it's 156 00:07:23,450 --> 00:07:26,390 returning some hashcode modded by HASH_MAX. 157 00:07:26,390 --> 00:07:29,790 It's probably the size of my array, right? 158 00:07:29,790 --> 00:07:33,160 I don't want to be getting hash codes if my array is of size 10, 159 00:07:33,160 --> 00:07:35,712 I don't want to be getting out hash codes 11, 12, 160 00:07:35,712 --> 00:07:38,690 13, I can't put things into those locations of the array, 161 00:07:38,690 --> 00:07:39,790 that would be illegal. 162 00:07:39,790 --> 00:07:42,130 I'd suffer a segmentation fault. 163 00:07:42,130 --> 00:07:45,230 >> Now here is another quick aside. 164 00:07:45,230 --> 00:07:48,470 Generally you're probably not going to want to write your own hash functions. 165 00:07:48,470 --> 00:07:50,997 It is actually a bit of an art, not a science. 166 00:07:50,997 --> 00:07:52,580 And there's a lot that goes into them. 167 00:07:52,580 --> 00:07:55,288 The internet, like I said, is full of really good hash functions, 168 00:07:55,288 --> 00:07:58,470 and you should use the internet to find hash functions because it's really 169 00:07:58,470 --> 00:08:03,230 just kind of an unnecessary waste of time to create your own. 170 00:08:03,230 --> 00:08:05,490 >> You can write simple ones for testing purposes. 171 00:08:05,490 --> 00:08:08,323 But when you actually are going to start hashing data and storing it 172 00:08:08,323 --> 00:08:10,780 into a hash table you're probably going to want 173 00:08:10,780 --> 00:08:14,580 to use some function that was generated for you, that exists on the internet. 174 00:08:14,580 --> 00:08:17,240 If you do just be sure to cite your sources. 175 00:08:17,240 --> 00:08:21,740 There's no reason to plagiarize anything here. 176 00:08:21,740 --> 00:08:25,370 >> The computer science community is definitely growing, and really values 177 00:08:25,370 --> 00:08:28,796 open source, and it's really important to cite your sources so that people 178 00:08:28,796 --> 00:08:30,670 can get attribution for the work that they're 179 00:08:30,670 --> 00:08:32,312 doing to the benefit of the community. 180 00:08:32,312 --> 00:08:34,020 So always be sure-- and not just for hash 181 00:08:34,020 --> 00:08:37,270 functions, but generally when you use code from an outside source, 182 00:08:37,270 --> 00:08:38,299 always cite your source. 183 00:08:38,299 --> 00:08:43,500 Give credit to the person who did some of the work so you don't have to. 184 00:08:43,500 --> 00:08:46,720 >> OK so let's revisit this hash table for a second. 185 00:08:46,720 --> 00:08:48,780 This is where we left off after we inserted 186 00:08:48,780 --> 00:08:53,300 John and Paul into this hash table. 187 00:08:53,300 --> 00:08:55,180 Do you see a problem here? 188 00:08:55,180 --> 00:08:58,410 You might see two. 189 00:08:58,410 --> 00:09:02,240 But in particular, do you see this possible problem? 190 00:09:02,240 --> 00:09:06,770 >> What if I hash Ringo, and it turns out that after processing 191 00:09:06,770 --> 00:09:14,000 that data through the hash function Ringo also generated the hashcode 6. 192 00:09:14,000 --> 00:09:16,810 I've already got data at hashcode-- array location 6. 193 00:09:16,810 --> 00:09:22,000 So it's probably going to be a bit of a problem for me now, right? 194 00:09:22,000 --> 00:09:23,060 >> We call this a collision. 195 00:09:23,060 --> 00:09:26,460 And the collision occurs when two pieces of data run through the same hash 196 00:09:26,460 --> 00:09:29,200 function yield the same hashcode. 197 00:09:29,200 --> 00:09:32,850 Presumably we still want to get both pieces of data into the hash table, 198 00:09:32,850 --> 00:09:36,330 otherwise we wouldn't be running Ringo arbitrarily through the hash function. 199 00:09:36,330 --> 00:09:40,870 We presumably want to get Ringo into that array. 200 00:09:40,870 --> 00:09:46,070 >> How do we do it though, if he and Paul both yield hashcode 6? 201 00:09:46,070 --> 00:09:49,520 We don't want to overwrite Paul, we want Paul to be there too. 202 00:09:49,520 --> 00:09:52,790 So we need to find a way to get elements into the hash table that 203 00:09:52,790 --> 00:09:56,550 still preserves our quick insertion and quick look up. 204 00:09:56,550 --> 00:10:01,350 And one way to deal with it is to do something called linear probing. 205 00:10:01,350 --> 00:10:04,170 >> Using this method if we have a collision, well, what do we do? 206 00:10:04,170 --> 00:10:08,580 Well we can't put him in array location 6, or whatever hashcode was generated, 207 00:10:08,580 --> 00:10:10,820 let's put him at hashcode plus 1. 208 00:10:10,820 --> 00:10:13,670 And if that's full let's put him in hashcode plus 2. 209 00:10:13,670 --> 00:10:17,420 The benefit of this being if he's not exactly where we think he is, 210 00:10:17,420 --> 00:10:20,850 and we have to start searching, maybe we don't have to go too far. 211 00:10:20,850 --> 00:10:23,900 Maybe we don't have to search all n elements of the hash table. 212 00:10:23,900 --> 00:10:25,790 Maybe we have to search a couple of them. 213 00:10:25,790 --> 00:10:30,680 >> And so we're still tending towards that average case being close to 1 vs 214 00:10:30,680 --> 00:10:34,280 close to n, so maybe that'll work. 215 00:10:34,280 --> 00:10:38,010 So let's see how this might work out in reality. 216 00:10:38,010 --> 00:10:41,460 And let's see if maybe we can detect the problem that might occur here. 217 00:10:41,460 --> 00:10:42,540 >> Let's say we hash Bart. 218 00:10:42,540 --> 00:10:45,581 So now we're going to run a new set of strings through the hash function, 219 00:10:45,581 --> 00:10:48,460 and we run Bart through the hash function, we get hashcode 6. 220 00:10:48,460 --> 00:10:52,100 We take a look, we see 6 is empty, so we can put Bart there. 221 00:10:52,100 --> 00:10:55,410 >> Now we hash Lisa and that also generates hashcode 6. 222 00:10:55,410 --> 00:10:58,330 Well now that we're using this linear probing method we start at 6, 223 00:10:58,330 --> 00:10:59,330 we see that 6 is full. 224 00:10:59,330 --> 00:11:00,990 We can't put Lisa in 6. 225 00:11:00,990 --> 00:11:02,090 So where do we go? 226 00:11:02,090 --> 00:11:03,280 Let's go to 7. 227 00:11:03,280 --> 00:11:04,610 7's empty, so that works. 228 00:11:04,610 --> 00:11:06,510 So let's put Lisa there. 229 00:11:06,510 --> 00:11:10,200 >> Now we hash Homer and we get 7. 230 00:11:10,200 --> 00:11:13,380 OK well we know that 7's full now, so we can't put Homer there. 231 00:11:13,380 --> 00:11:14,850 So let's go to 8. 232 00:11:14,850 --> 00:11:15,664 Is 8 available? 233 00:11:15,664 --> 00:11:18,330 Yeah, and 8's close to 7, so if we have to start searching we're 234 00:11:18,330 --> 00:11:20,020 not going to have to go too far. 235 00:11:20,020 --> 00:11:22,860 And so let's put Homer at 8. 236 00:11:22,860 --> 00:11:25,151 >> Now we hash Maggie and returns 3, thank goodness 237 00:11:25,151 --> 00:11:26,650 we're able to just put Maggie there. 238 00:11:26,650 --> 00:11:29,070 We don't have to do any sort of probing for that. 239 00:11:29,070 --> 00:11:32,000 Now we hash Marge, and Marge also returns 6. 240 00:11:32,000 --> 00:11:39,560 >> Well 6 is full, 7 is full, 8 is full, 9, all right thank God, 9 is empty. 241 00:11:39,560 --> 00:11:41,600 I can put Marge at 9. 242 00:11:41,600 --> 00:11:45,280 Already we can see that we're starting to have this problem where now we're 243 00:11:45,280 --> 00:11:48,780 starting to stretch things kind of far away from their hash codes. 244 00:11:48,780 --> 00:11:52,960 And that theta of 1, that average case of being constant time, 245 00:11:52,960 --> 00:11:56,560 is starting to get a little more-- starting to tend a little more 246 00:11:56,560 --> 00:11:58,050 towards theta of n. 247 00:11:58,050 --> 00:12:01,270 We're starting to lose that advantage of hash tables. 248 00:12:01,270 --> 00:12:03,902 >> This problem that we just saw is something called clustering. 249 00:12:03,902 --> 00:12:06,360 And what's really bad about clustering is that once you now 250 00:12:06,360 --> 00:12:09,606 have two elements that are side by side it makes it even more likely, 251 00:12:09,606 --> 00:12:11,480 you have double the chance, that you're going 252 00:12:11,480 --> 00:12:13,516 to have another collision with that cluster, 253 00:12:13,516 --> 00:12:14,890 and the cluster will grow by one. 254 00:12:14,890 --> 00:12:19,640 And you'll keep growing and growing your likelihood of having a collision. 255 00:12:19,640 --> 00:12:24,470 And eventually it's just as bad as not sorting the data at all. 256 00:12:24,470 --> 00:12:27,590 >> The other problem though is we still, and so far up to this point, 257 00:12:27,590 --> 00:12:30,336 we've just been sort of understanding what a hash table is, 258 00:12:30,336 --> 00:12:31,960 we still only have room for 10 strings. 259 00:12:31,960 --> 00:12:37,030 If we want to continue to hash the citizens of Springfield, 260 00:12:37,030 --> 00:12:38,790 we can only get 10 of them in there. 261 00:12:38,790 --> 00:12:42,619 And if we try and add an 11th or 12th, we don't have a place to put them. 262 00:12:42,619 --> 00:12:45,660 We could just be spinning around in circles trying to find an empty spot, 263 00:12:45,660 --> 00:12:49,000 and we maybe get stuck in an infinite loop. 264 00:12:49,000 --> 00:12:51,810 >> So this sort of lends to the idea of something called chaining. 265 00:12:51,810 --> 00:12:55,790 And this is where we're going to bring linked lists back into the picture. 266 00:12:55,790 --> 00:13:01,300 What if instead of storing just the data itself in the array, 267 00:13:01,300 --> 00:13:04,471 every element of the array could hold multiple pieces of data? 268 00:13:04,471 --> 00:13:05,970 Well that doesn't make sense, right? 269 00:13:05,970 --> 00:13:09,280 We know that an array can only hold-- each element of an array 270 00:13:09,280 --> 00:13:12,930 can only hold one piece of data of that data type. 271 00:13:12,930 --> 00:13:16,750 >> But what if that data type is a linked list, right? 272 00:13:16,750 --> 00:13:19,830 So what if every element of the array was 273 00:13:19,830 --> 00:13:22,790 a pointer to the head of a linked list? 274 00:13:22,790 --> 00:13:24,680 And then we could build those linked lists 275 00:13:24,680 --> 00:13:27,120 and grow them arbitrarily, because linked lists allow 276 00:13:27,120 --> 00:13:32,090 us to grow and shrink a lot more flexibly than an array does. 277 00:13:32,090 --> 00:13:34,210 So what if we now use, we leverage this, right? 278 00:13:34,210 --> 00:13:37,760 We start to grow these chains out of these array locations. 279 00:13:37,760 --> 00:13:40,740 >> Now we can fit an infinite amount of data, or not infinite, 280 00:13:40,740 --> 00:13:44,170 an arbitrary amount of data, into our hash table 281 00:13:44,170 --> 00:13:47,760 without ever running into the problem of collision. 282 00:13:47,760 --> 00:13:50,740 We've also eliminated clustering by doing this. 283 00:13:50,740 --> 00:13:54,380 And well we know that when we insert into a linked list, if you recall 284 00:13:54,380 --> 00:13:57,779 from our video on linked lists, singly linked lists and doubly linked lists, 285 00:13:57,779 --> 00:13:59,070 it's a constant time operation. 286 00:13:59,070 --> 00:14:00,710 We're just adding to the front. 287 00:14:00,710 --> 00:14:04,400 >> And for look up, well we do know that look up in a linked list 288 00:14:04,400 --> 00:14:05,785 can be a problem, right? 289 00:14:05,785 --> 00:14:07,910 We have to search through it from beginning to end. 290 00:14:07,910 --> 00:14:10,460 There's no random access in a linked list. 291 00:14:10,460 --> 00:14:15,610 But if instead of having one linked list where a lookup would be O of n, 292 00:14:15,610 --> 00:14:19,590 we now have 10 linked lists, or 1,000 linked lists, 293 00:14:19,590 --> 00:14:24,120 now it's O of n divided by 10, or O of n divided by 1,000. 294 00:14:24,120 --> 00:14:26,940 >> And while we were talking theoretically about complexity 295 00:14:26,940 --> 00:14:30,061 we disregard constants, in the real world these things actually matter, 296 00:14:30,061 --> 00:14:30,560 right? 297 00:14:30,560 --> 00:14:33,080 We actually will notice that this happens 298 00:14:33,080 --> 00:14:36,610 to run 10 times faster, or 1,000 times faster, 299 00:14:36,610 --> 00:14:41,570 because we're distributing one long chain across 1,000 smaller chains. 300 00:14:41,570 --> 00:14:45,090 And so each time we have to search through one of those chains we can 301 00:14:45,090 --> 00:14:50,290 ignore the 999 chains we don't care about , and just search that one. 302 00:14:50,290 --> 00:14:53,220 >> Which is on average to be 1,000 times shorter. 303 00:14:53,220 --> 00:14:56,460 And so we still are sort of tending towards this average case 304 00:14:56,460 --> 00:15:01,610 of being constant time, but only because we are leveraging 305 00:15:01,610 --> 00:15:03,730 dividing by some huge constant factor. 306 00:15:03,730 --> 00:15:05,804 Let's see how this might actually look though. 307 00:15:05,804 --> 00:15:08,720 So this was the hash table we had before we declared a hash table that 308 00:15:08,720 --> 00:15:10,270 was capable of storing 10 strings. 309 00:15:10,270 --> 00:15:11,728 We're not going to do that anymore. 310 00:15:11,728 --> 00:15:13,880 We already know the limitations of that method. 311 00:15:13,880 --> 00:15:20,170 Now our hash table's going to be an array of 10 nodes, pointers 312 00:15:20,170 --> 00:15:22,120 to heads of linked lists. 313 00:15:22,120 --> 00:15:23,830 >> And right now it's null. 314 00:15:23,830 --> 00:15:26,170 Each one of those 10 pointers is null. 315 00:15:26,170 --> 00:15:29,870 There's nothing in our hash table right now. 316 00:15:29,870 --> 00:15:32,690 >> Now let's start to put some things into this hash table. 317 00:15:32,690 --> 00:15:35,440 And let's see how this method is going to benefit us a little bit. 318 00:15:35,440 --> 00:15:36,760 Let's now hash Joey. 319 00:15:36,760 --> 00:15:40,210 We'll will run the string Joey through a hash function and we return 6. 320 00:15:40,210 --> 00:15:41,200 Well what do we do now? 321 00:15:41,200 --> 00:15:44,090 >> Well now working with linked lists, we're not working with arrays. 322 00:15:44,090 --> 00:15:45,881 And when we're working with linked lists we 323 00:15:45,881 --> 00:15:49,790 know we need to start dynamically allocating space and building chains. 324 00:15:49,790 --> 00:15:54,020 That's sort of how-- those are the core elements of building a linked list. 325 00:15:54,020 --> 00:15:57,670 So let's dynamically allocate space for Joey, 326 00:15:57,670 --> 00:16:00,390 and then let's add him to the chain. 327 00:16:00,390 --> 00:16:03,170 >> So now look what we've done. 328 00:16:03,170 --> 00:16:06,440 When we hash Joey we got the hashcode 6. 329 00:16:06,440 --> 00:16:11,790 Now the pointer at array location 6 points to the head of a linked list, 330 00:16:11,790 --> 00:16:14,900 and right now it's the only element of a linked list. 331 00:16:14,900 --> 00:16:18,350 And the node in that linked list is Joey. 332 00:16:18,350 --> 00:16:22,300 >> So if we need to look up Joey later, we just hash Joey again, 333 00:16:22,300 --> 00:16:26,300 we get 6 again because our hash function is deterministic. 334 00:16:26,300 --> 00:16:30,400 And then we start at the head of the linked list pointed 335 00:16:30,400 --> 00:16:33,360 to by array location 6, and we can iterate 336 00:16:33,360 --> 00:16:35,650 across that trying to find Joey. 337 00:16:35,650 --> 00:16:37,780 And if we build our hash table effectively, 338 00:16:37,780 --> 00:16:41,790 and our hash function effectively to distribute data well, 339 00:16:41,790 --> 00:16:45,770 on average each of those linked lists at every array location 340 00:16:45,770 --> 00:16:50,110 will be 1/10 the size of if we just had it as a single huge 341 00:16:50,110 --> 00:16:51,650 linked list with everything in it. 342 00:16:51,650 --> 00:16:55,670 >> If we distribute that huge linked list across 10 linked lists 343 00:16:55,670 --> 00:16:57,760 each list will be 1/10 the size. 344 00:16:57,760 --> 00:17:01,432 And thus 10 times quicker to search through. 345 00:17:01,432 --> 00:17:02,390 So let's do this again. 346 00:17:02,390 --> 00:17:04,290 Let's now hash Ross. 347 00:17:04,290 --> 00:17:07,540 >> And let's say Ross, when we do that the hash code we get back is 2. 348 00:17:07,540 --> 00:17:10,630 Well now we dynamically allocate a new node, we put Ross in that node, 349 00:17:10,630 --> 00:17:14,900 and we say now array location 2, instead of pointing to null, 350 00:17:14,900 --> 00:17:18,579 points to the head of a linked list whose only node is Ross. 351 00:17:18,579 --> 00:17:22,660 And we can do this one more time, we can hash Rachel and get hashcode 4. 352 00:17:22,660 --> 00:17:25,490 malloc a new node, put Rachel in the node, and say a array location 353 00:17:25,490 --> 00:17:27,839 4 now points to the head of a linked list whose 354 00:17:27,839 --> 00:17:31,420 only element happens to be Rachel. 355 00:17:31,420 --> 00:17:33,470 >> OK but what happens if we have a collision? 356 00:17:33,470 --> 00:17:38,560 Let's see how we handle collisions using the separate chaining method. 357 00:17:38,560 --> 00:17:39,800 Let's hash Phoebe. 358 00:17:39,800 --> 00:17:41,094 We get the hashcode 6. 359 00:17:41,094 --> 00:17:44,010 In our previous example we were just storing the strings in the array. 360 00:17:44,010 --> 00:17:45,980 This was a problem. 361 00:17:45,980 --> 00:17:48,444 >> We don't want to clobber Joey, and we've already 362 00:17:48,444 --> 00:17:51,110 seen that we can get some clustering problems if we try and step 363 00:17:51,110 --> 00:17:52,202 through and probe. 364 00:17:52,202 --> 00:17:54,660 But what if we just kind of treat this the same way, right? 365 00:17:54,660 --> 00:17:57,440 It's just like adding an element to the head of a linked list. 366 00:17:57,440 --> 00:18:00,220 Let's just malloc space for Phoebe. 367 00:18:00,220 --> 00:18:04,764 >> We'll say Phoebe's next pointer points to the old head of the linked list, 368 00:18:04,764 --> 00:18:07,180 and then 6 just points to the new head of the linked list. 369 00:18:07,180 --> 00:18:10,150 And now look, we've changed Phoebe in. 370 00:18:10,150 --> 00:18:14,210 We can now store two elements with hashcode 6, 371 00:18:14,210 --> 00:18:17,170 and we don't have any problems. 372 00:18:17,170 --> 00:18:20,147 >> That's pretty much all there is to chaining. 373 00:18:20,147 --> 00:18:21,980 And chaining is definitely the method that's 374 00:18:21,980 --> 00:18:27,390 going to be most effective for you if you are storing data in a hash table. 375 00:18:27,390 --> 00:18:30,890 But this combination of arrays and linked lists 376 00:18:30,890 --> 00:18:36,080 together to form a hash table really dramatically improves your ability 377 00:18:36,080 --> 00:18:40,550 to store large amounts of data, and very quickly and efficiently search 378 00:18:40,550 --> 00:18:41,630 through that data. 379 00:18:41,630 --> 00:18:44,150 >> There's still one more data structure out there 380 00:18:44,150 --> 00:18:48,700 that might even be a bit better in terms of guaranteeing 381 00:18:48,700 --> 00:18:51,920 that our insertion, deletion, and look up times are even faster. 382 00:18:51,920 --> 00:18:55,630 And we'll see that in a video on tries. 383 00:18:55,630 --> 00:18:58,930 I'm Doug Lloyd, this is CS50. 384 00:18:58,930 --> 00:19:00,214 32738

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.