Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:10,980 --> 00:00:16,020
So in this lecture we will discuss the squad data set, which is the dataset we'll be using for this
2
00:00:16,020 --> 00:00:16,740
task.
3
00:00:17,340 --> 00:00:19,470
Squad simply stands for Stanford.
4
00:00:19,470 --> 00:00:21,120
Question Answering Data Set.
5
00:00:21,660 --> 00:00:26,700
Now, as you already know, the task of question answering is still constrained in a few ways.
6
00:00:27,060 --> 00:00:32,880
We can't yet give a neural network a big database of knowledge and just ask it any question we want.
7
00:00:33,210 --> 00:00:35,730
Instead, we do what is called extractive.
8
00:00:35,730 --> 00:00:36,870
Question Answering.
9
00:00:37,650 --> 00:00:43,380
What this means is that we're going to give the network a pair of texts, namely the question and the
10
00:00:43,380 --> 00:00:46,500
context which contains the answer to the question.
11
00:00:46,890 --> 00:00:50,850
In that way, the answer is always a substring of the context.
12
00:00:51,390 --> 00:00:57,780
Note also that because of this, the network never has to actually generate any text, so we don't require
13
00:00:57,780 --> 00:00:59,700
an encoder decoder setup.
14
00:01:04,209 --> 00:01:09,100
Now there are some details that will become important when you want to actually write the code.
15
00:01:09,520 --> 00:01:12,190
Firstly, let's look at how we will load in the data.
16
00:01:12,610 --> 00:01:18,070
As you can see, we just call the standard function load data set passing in the string squad.
17
00:01:18,430 --> 00:01:25,060
The data set comes with five columns which are ID title, context, question and answers.
18
00:01:25,390 --> 00:01:27,850
Note that the title is pretty much irrelevant.
19
00:01:28,540 --> 00:01:34,450
Interestingly, ID is not irrelevant, which may seem strange since we've ignored it up until this point.
20
00:01:34,630 --> 00:01:36,940
We'll discuss this more when the time comes.
21
00:01:41,530 --> 00:01:43,570
So let's look at some examples.
22
00:01:43,780 --> 00:01:45,400
Here's a context.
23
00:01:45,430 --> 00:01:49,420
It says, Architecturally, the school has a Catholic character.
24
00:01:49,660 --> 00:01:55,090
Atop the main building is Gold Dome, is a golden statue of the Virgin Mary, and so on and so forth.
25
00:01:55,480 --> 00:01:59,200
The question is what is in front of the Notre Dame main building?
26
00:01:59,530 --> 00:02:02,920
And the corresponding answer is a copper statue of Christ.
27
00:02:04,660 --> 00:02:07,450
Note that the answer seems to have a funny format.
28
00:02:07,840 --> 00:02:10,300
Firstly, the text is stored in a list.
29
00:02:10,750 --> 00:02:12,640
We'll see why that makes sense shortly.
30
00:02:14,280 --> 00:02:20,280
Furthermore, we see that in addition to the text, we also get the position of the start of the answer
31
00:02:20,280 --> 00:02:21,900
in terms of characters.
32
00:02:22,380 --> 00:02:25,680
As you recall, a string is simply an array of characters.
33
00:02:25,680 --> 00:02:30,990
So if you think of the context as an array of characters, this would be the index of the start of the
34
00:02:30,990 --> 00:02:31,650
answer.
35
00:02:33,590 --> 00:02:38,420
So for this example, the corresponding title happens to be University of Notre Dame.
36
00:02:39,530 --> 00:02:43,520
As you can see, this is irrelevant for finding the answer to the question.
37
00:02:48,000 --> 00:02:51,750
What should strike you as interesting is that the answer column is plural.
38
00:02:52,620 --> 00:02:57,150
This implies that there can potentially be multiple answers to the same question.
39
00:02:57,570 --> 00:03:00,870
This also explains why the answer data is stored in lists.
40
00:03:01,350 --> 00:03:02,880
Now, how can this be?
41
00:03:03,270 --> 00:03:07,110
Well, consider the question where did Super Bowl 50 take place?
42
00:03:07,410 --> 00:03:10,410
One possible answer is Santa Clara, California.
43
00:03:10,440 --> 00:03:11,880
This is a true fact.
44
00:03:12,960 --> 00:03:15,960
But another possible answer is Levi's Stadium.
45
00:03:15,990 --> 00:03:17,730
This is also a true fact.
46
00:03:18,630 --> 00:03:21,780
So how can one question have multiple answers?
47
00:03:22,050 --> 00:03:24,220
Well, here's the context where this came from.
48
00:03:24,240 --> 00:03:32,160
It says The game was played on February seven, 2016 at Levi's Stadium in the San Francisco Bay area
49
00:03:32,160 --> 00:03:33,900
at Santa Clara, California.
50
00:03:34,350 --> 00:03:38,400
Depending on how you interpret this question, both answers would be valid.
51
00:03:38,940 --> 00:03:43,530
So this is an example of where the same question can have multiple valid answers.
52
00:03:45,310 --> 00:03:51,220
Now, oddly, this data set is built such that for some questions, the exact same answer can appear
53
00:03:51,220 --> 00:03:52,390
multiple times.
54
00:03:52,420 --> 00:03:54,220
I'm not sure why that is.
55
00:03:54,820 --> 00:03:59,280
Finally, note that this only happens for the validation set for the train set.
56
00:03:59,290 --> 00:04:01,280
Although the column is called Answers.
57
00:04:01,300 --> 00:04:03,490
There is only one answer per sample.
58
00:04:03,790 --> 00:04:09,190
Since our neural networks loss function is only built for one target per input, this is a good thing
59
00:04:09,190 --> 00:04:14,530
since it means we don't have to do any extra work to split up multiple answers into separate training
60
00:04:14,530 --> 00:04:15,400
samples.
5820
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.