subtitlecat.com

All language subtitles for 002 Exploring the Dataset (SQuAD)_en

Afrikaans

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,980 --> 00:00:16,020 So in this lecture we will discuss the squad data set, which is the dataset we'll be using for this 2 00:00:16,020 --> 00:00:16,740 task. 3 00:00:17,340 --> 00:00:19,470 Squad simply stands for Stanford. 4 00:00:19,470 --> 00:00:21,120 Question Answering Data Set. 5 00:00:21,660 --> 00:00:26,700 Now, as you already know, the task of question answering is still constrained in a few ways. 6 00:00:27,060 --> 00:00:32,880 We can't yet give a neural network a big database of knowledge and just ask it any question we want. 7 00:00:33,210 --> 00:00:35,730 Instead, we do what is called extractive. 8 00:00:35,730 --> 00:00:36,870 Question Answering. 9 00:00:37,650 --> 00:00:43,380 What this means is that we're going to give the network a pair of texts, namely the question and the 10 00:00:43,380 --> 00:00:46,500 context which contains the answer to the question. 11 00:00:46,890 --> 00:00:50,850 In that way, the answer is always a substring of the context. 12 00:00:51,390 --> 00:00:57,780 Note also that because of this, the network never has to actually generate any text, so we don't require 13 00:00:57,780 --> 00:00:59,700 an encoder decoder setup. 14 00:01:04,209 --> 00:01:09,100 Now there are some details that will become important when you want to actually write the code. 15 00:01:09,520 --> 00:01:12,190 Firstly, let's look at how we will load in the data. 16 00:01:12,610 --> 00:01:18,070 As you can see, we just call the standard function load data set passing in the string squad. 17 00:01:18,430 --> 00:01:25,060 The data set comes with five columns which are ID title, context, question and answers. 18 00:01:25,390 --> 00:01:27,850 Note that the title is pretty much irrelevant. 19 00:01:28,540 --> 00:01:34,450 Interestingly, ID is not irrelevant, which may seem strange since we've ignored it up until this point. 20 00:01:34,630 --> 00:01:36,940 We'll discuss this more when the time comes. 21 00:01:41,530 --> 00:01:43,570 So let's look at some examples. 22 00:01:43,780 --> 00:01:45,400 Here's a context. 23 00:01:45,430 --> 00:01:49,420 It says, Architecturally, the school has a Catholic character. 24 00:01:49,660 --> 00:01:55,090 Atop the main building is Gold Dome, is a golden statue of the Virgin Mary, and so on and so forth. 25 00:01:55,480 --> 00:01:59,200 The question is what is in front of the Notre Dame main building? 26 00:01:59,530 --> 00:02:02,920 And the corresponding answer is a copper statue of Christ. 27 00:02:04,660 --> 00:02:07,450 Note that the answer seems to have a funny format. 28 00:02:07,840 --> 00:02:10,300 Firstly, the text is stored in a list. 29 00:02:10,750 --> 00:02:12,640 We'll see why that makes sense shortly. 30 00:02:14,280 --> 00:02:20,280 Furthermore, we see that in addition to the text, we also get the position of the start of the answer 31 00:02:20,280 --> 00:02:21,900 in terms of characters. 32 00:02:22,380 --> 00:02:25,680 As you recall, a string is simply an array of characters. 33 00:02:25,680 --> 00:02:30,990 So if you think of the context as an array of characters, this would be the index of the start of the 34 00:02:30,990 --> 00:02:31,650 answer. 35 00:02:33,590 --> 00:02:38,420 So for this example, the corresponding title happens to be University of Notre Dame. 36 00:02:39,530 --> 00:02:43,520 As you can see, this is irrelevant for finding the answer to the question. 37 00:02:48,000 --> 00:02:51,750 What should strike you as interesting is that the answer column is plural. 38 00:02:52,620 --> 00:02:57,150 This implies that there can potentially be multiple answers to the same question. 39 00:02:57,570 --> 00:03:00,870 This also explains why the answer data is stored in lists. 40 00:03:01,350 --> 00:03:02,880 Now, how can this be? 41 00:03:03,270 --> 00:03:07,110 Well, consider the question where did Super Bowl 50 take place? 42 00:03:07,410 --> 00:03:10,410 One possible answer is Santa Clara, California. 43 00:03:10,440 --> 00:03:11,880 This is a true fact. 44 00:03:12,960 --> 00:03:15,960 But another possible answer is Levi's Stadium. 45 00:03:15,990 --> 00:03:17,730 This is also a true fact. 46 00:03:18,630 --> 00:03:21,780 So how can one question have multiple answers? 47 00:03:22,050 --> 00:03:24,220 Well, here's the context where this came from. 48 00:03:24,240 --> 00:03:32,160 It says The game was played on February seven, 2016 at Levi's Stadium in the San Francisco Bay area 49 00:03:32,160 --> 00:03:33,900 at Santa Clara, California. 50 00:03:34,350 --> 00:03:38,400 Depending on how you interpret this question, both answers would be valid. 51 00:03:38,940 --> 00:03:43,530 So this is an example of where the same question can have multiple valid answers. 52 00:03:45,310 --> 00:03:51,220 Now, oddly, this data set is built such that for some questions, the exact same answer can appear 53 00:03:51,220 --> 00:03:52,390 multiple times. 54 00:03:52,420 --> 00:03:54,220 I'm not sure why that is. 55 00:03:54,820 --> 00:03:59,280 Finally, note that this only happens for the validation set for the train set. 56 00:03:59,290 --> 00:04:01,280 Although the column is called Answers. 57 00:04:01,300 --> 00:04:03,490 There is only one answer per sample. 58 00:04:03,790 --> 00:04:09,190 Since our neural networks loss function is only built for one target per input, this is a good thing 59 00:04:09,190 --> 00:04:14,530 since it means we don't have to do any extra work to split up multiple answers into separate training 60 00:04:14,530 --> 00:04:15,400 samples. 5820