All language subtitles for 002 Exploring the Dataset (SQuAD)_en

af Afrikaans
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,980 --> 00:00:16,020 So in this lecture we will discuss the squad data set, which is the dataset we'll be using for this 2 00:00:16,020 --> 00:00:16,740 task. 3 00:00:17,340 --> 00:00:19,470 Squad simply stands for Stanford. 4 00:00:19,470 --> 00:00:21,120 Question Answering Data Set. 5 00:00:21,660 --> 00:00:26,700 Now, as you already know, the task of question answering is still constrained in a few ways. 6 00:00:27,060 --> 00:00:32,880 We can't yet give a neural network a big database of knowledge and just ask it any question we want. 7 00:00:33,210 --> 00:00:35,730 Instead, we do what is called extractive. 8 00:00:35,730 --> 00:00:36,870 Question Answering. 9 00:00:37,650 --> 00:00:43,380 What this means is that we're going to give the network a pair of texts, namely the question and the 10 00:00:43,380 --> 00:00:46,500 context which contains the answer to the question. 11 00:00:46,890 --> 00:00:50,850 In that way, the answer is always a substring of the context. 12 00:00:51,390 --> 00:00:57,780 Note also that because of this, the network never has to actually generate any text, so we don't require 13 00:00:57,780 --> 00:00:59,700 an encoder decoder setup. 14 00:01:04,209 --> 00:01:09,100 Now there are some details that will become important when you want to actually write the code. 15 00:01:09,520 --> 00:01:12,190 Firstly, let's look at how we will load in the data. 16 00:01:12,610 --> 00:01:18,070 As you can see, we just call the standard function load data set passing in the string squad. 17 00:01:18,430 --> 00:01:25,060 The data set comes with five columns which are ID title, context, question and answers. 18 00:01:25,390 --> 00:01:27,850 Note that the title is pretty much irrelevant. 19 00:01:28,540 --> 00:01:34,450 Interestingly, ID is not irrelevant, which may seem strange since we've ignored it up until this point. 20 00:01:34,630 --> 00:01:36,940 We'll discuss this more when the time comes. 21 00:01:41,530 --> 00:01:43,570 So let's look at some examples. 22 00:01:43,780 --> 00:01:45,400 Here's a context. 23 00:01:45,430 --> 00:01:49,420 It says, Architecturally, the school has a Catholic character. 24 00:01:49,660 --> 00:01:55,090 Atop the main building is Gold Dome, is a golden statue of the Virgin Mary, and so on and so forth. 25 00:01:55,480 --> 00:01:59,200 The question is what is in front of the Notre Dame main building? 26 00:01:59,530 --> 00:02:02,920 And the corresponding answer is a copper statue of Christ. 27 00:02:04,660 --> 00:02:07,450 Note that the answer seems to have a funny format. 28 00:02:07,840 --> 00:02:10,300 Firstly, the text is stored in a list. 29 00:02:10,750 --> 00:02:12,640 We'll see why that makes sense shortly. 30 00:02:14,280 --> 00:02:20,280 Furthermore, we see that in addition to the text, we also get the position of the start of the answer 31 00:02:20,280 --> 00:02:21,900 in terms of characters. 32 00:02:22,380 --> 00:02:25,680 As you recall, a string is simply an array of characters. 33 00:02:25,680 --> 00:02:30,990 So if you think of the context as an array of characters, this would be the index of the start of the 34 00:02:30,990 --> 00:02:31,650 answer. 35 00:02:33,590 --> 00:02:38,420 So for this example, the corresponding title happens to be University of Notre Dame. 36 00:02:39,530 --> 00:02:43,520 As you can see, this is irrelevant for finding the answer to the question. 37 00:02:48,000 --> 00:02:51,750 What should strike you as interesting is that the answer column is plural. 38 00:02:52,620 --> 00:02:57,150 This implies that there can potentially be multiple answers to the same question. 39 00:02:57,570 --> 00:03:00,870 This also explains why the answer data is stored in lists. 40 00:03:01,350 --> 00:03:02,880 Now, how can this be? 41 00:03:03,270 --> 00:03:07,110 Well, consider the question where did Super Bowl 50 take place? 42 00:03:07,410 --> 00:03:10,410 One possible answer is Santa Clara, California. 43 00:03:10,440 --> 00:03:11,880 This is a true fact. 44 00:03:12,960 --> 00:03:15,960 But another possible answer is Levi's Stadium. 45 00:03:15,990 --> 00:03:17,730 This is also a true fact. 46 00:03:18,630 --> 00:03:21,780 So how can one question have multiple answers? 47 00:03:22,050 --> 00:03:24,220 Well, here's the context where this came from. 48 00:03:24,240 --> 00:03:32,160 It says The game was played on February seven, 2016 at Levi's Stadium in the San Francisco Bay area 49 00:03:32,160 --> 00:03:33,900 at Santa Clara, California. 50 00:03:34,350 --> 00:03:38,400 Depending on how you interpret this question, both answers would be valid. 51 00:03:38,940 --> 00:03:43,530 So this is an example of where the same question can have multiple valid answers. 52 00:03:45,310 --> 00:03:51,220 Now, oddly, this data set is built such that for some questions, the exact same answer can appear 53 00:03:51,220 --> 00:03:52,390 multiple times. 54 00:03:52,420 --> 00:03:54,220 I'm not sure why that is. 55 00:03:54,820 --> 00:03:59,280 Finally, note that this only happens for the validation set for the train set. 56 00:03:59,290 --> 00:04:01,280 Although the column is called Answers. 57 00:04:01,300 --> 00:04:03,490 There is only one answer per sample. 58 00:04:03,790 --> 00:04:09,190 Since our neural networks loss function is only built for one target per input, this is a good thing 59 00:04:09,190 --> 00:04:14,530 since it means we don't have to do any extra work to split up multiple answers into separate training 60 00:04:14,530 --> 00:04:15,400 samples. 5820

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.