Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:10,930 --> 00:00:15,730
So in this lecture we will move on to the next step of our question answering notebook.
2
00:00:16,210 --> 00:00:22,390
Now, as you recall, typically after we load in the data, we apply the token user to convert the text
3
00:00:22,390 --> 00:00:25,060
into a numerical format for the neural network.
4
00:00:25,660 --> 00:00:26,500
At a high level.
5
00:00:26,500 --> 00:00:31,600
That seems simple, but as you've seen many times already, the devil is in the details.
6
00:00:36,080 --> 00:00:40,900
Firstly, let's note that the checkpoint will be using for this notebook will be either Bert or distiller
7
00:00:40,910 --> 00:00:41,510
Bert.
8
00:00:41,900 --> 00:00:46,460
This might surprise you since question answering seems like a pretty complicated task.
9
00:00:46,580 --> 00:00:50,180
So you think it might require an encoder decoder setup?
10
00:00:50,420 --> 00:00:52,070
But this is not the case.
11
00:00:52,250 --> 00:00:57,860
This will be easier to understand once we discuss the model outputs and perhaps some theory about how
12
00:00:57,860 --> 00:00:59,300
transformers actually work.
13
00:01:00,350 --> 00:01:06,370
For now, it suffices to know that Bert and Bird like models are the correct choice for the token.
14
00:01:06,770 --> 00:01:11,360
This means that we'll be using the same token ISAR that we used earlier in the course.
15
00:01:11,990 --> 00:01:17,810
As you know, this token ISAR can already handle two input sentences such as you saw in the textual
16
00:01:17,810 --> 00:01:19,040
entitlement task.
17
00:01:23,650 --> 00:01:23,860
Okay.
18
00:01:23,980 --> 00:01:30,190
So just to review what happens when we use the tokenization to tokenize two pieces of text and how do
19
00:01:30,190 --> 00:01:30,970
we do it?
20
00:01:31,990 --> 00:01:37,750
Well, you'll recall that this works by simply passing in the two pieces of text as two separate arguments
21
00:01:38,020 --> 00:01:42,160
by convention will pass in the question first and then the context.
22
00:01:42,850 --> 00:01:49,450
If we decode the outputs from the tokenized, meaning turn the token IDs back into words, we will get
23
00:01:49,450 --> 00:01:53,230
a big, long string containing both the question and the context.
24
00:01:53,230 --> 00:01:54,880
Concatenate it together.
25
00:01:55,900 --> 00:02:01,690
In particular, this always starts with the special CLS token, followed by the first sentence followed
26
00:02:01,690 --> 00:02:04,260
by the SEP token followed by the second sentence.
27
00:02:04,270 --> 00:02:06,430
And finally, one more step token.
28
00:02:06,970 --> 00:02:10,270
Now, please note that I'm using the word sentence loosely.
29
00:02:10,389 --> 00:02:13,840
In fact, the context may contain more than one sentence.
30
00:02:13,930 --> 00:02:17,320
In fact, it likely will most, if not all of the time.
31
00:02:21,840 --> 00:02:27,750
Now, one challenging aspect of question answering is that the context part of the input can be really
32
00:02:27,750 --> 00:02:28,470
long.
33
00:02:28,770 --> 00:02:34,290
This is unlike the textual entitlement or next sentence prediction examples, since for those the two
34
00:02:34,290 --> 00:02:36,840
inputs are just two actual sentences.
35
00:02:37,320 --> 00:02:40,350
In this case, the context will contain many sentences.
36
00:02:40,920 --> 00:02:46,980
Now, as you recall, Bert can only handle a limited number of tokens, but it wouldn't be a good idea
37
00:02:46,980 --> 00:02:52,260
to truncate the context, since the part we leave out may contain the answer to the question.
38
00:02:52,590 --> 00:02:55,520
And it also wouldn't be a good idea to truncate the question.
39
00:02:55,530 --> 00:02:57,630
Since then, we wouldn't know the question.
40
00:02:58,080 --> 00:03:00,180
So what is the solution to this?
41
00:03:01,230 --> 00:03:05,250
The solution is to split the context into multiple windows.
42
00:03:05,700 --> 00:03:11,610
As a result, one data sample will turn into multiple data samples, some of which will contain the
43
00:03:11,610 --> 00:03:13,860
answer and some of which will not.
44
00:03:14,160 --> 00:03:18,300
So by doing this, we introduce a case where no answer can be found.
45
00:03:20,320 --> 00:03:26,530
Now you might be concerned for those of you very keen students, you might notice that we have a problem
46
00:03:26,530 --> 00:03:29,890
when the answer crosses the boundary between these windows.
47
00:03:30,250 --> 00:03:35,560
If half of the answer is at the end of one context window and the other half of the answer is at the
48
00:03:35,560 --> 00:03:41,470
beginning of the next context window, then no answer will be valid and the model won't really be learning
49
00:03:41,470 --> 00:03:43,090
the answers to the questions.
50
00:03:43,810 --> 00:03:47,590
We solve this problem by using overlapping windows instead.
51
00:03:47,890 --> 00:03:51,880
In the hugging face library, the amount of overlap is called the stride.
52
00:03:56,330 --> 00:04:02,840
So the full tokenized call looks like this as before we pass in the context and the question.
53
00:04:03,230 --> 00:04:08,930
The next argument is maxlength where we specify the maximum length of the entire input.
54
00:04:09,050 --> 00:04:13,370
This includes the question and the context and the special tokens.
55
00:04:14,700 --> 00:04:18,779
The next step is to specify special instruction only seconds.
56
00:04:19,050 --> 00:04:25,740
This means, since the context is the second input text, only truncate this but do not truncate the
57
00:04:25,740 --> 00:04:28,440
question which is the first input text.
58
00:04:29,880 --> 00:04:35,850
The next argument is the stride, which defines how much overlap there is between the context windows
59
00:04:35,850 --> 00:04:37,230
when they are split up.
60
00:04:38,620 --> 00:04:45,220
And finally we set return overflowing tokens to true a better name for this would probably be overlapping
61
00:04:45,250 --> 00:04:51,370
tokens since this corresponds to the tokens that overlap when we set the stride to a positive number.
62
00:04:56,040 --> 00:05:00,840
Importantly, note that when we call the tokenizing like this, it expands the data.
63
00:05:01,140 --> 00:05:07,680
If we have one question in context pair, this might be converted into multiple input samples depending
64
00:05:07,680 --> 00:05:09,690
on how long the context is.
65
00:05:10,260 --> 00:05:15,900
One thing that will become important later on is how we know which sample from the tokenized output
66
00:05:15,900 --> 00:05:18,990
corresponds to which sample from the tokenized input.
67
00:05:19,560 --> 00:05:23,730
Imagine for instance, that we enumerated the original input samples.
68
00:05:23,730 --> 00:05:27,540
So we have sample zero, sample one, sample two and so forth.
69
00:05:27,900 --> 00:05:33,570
At the output, we might have 00011, two, three, three, three and so forth.
70
00:05:34,020 --> 00:05:38,700
This means that Sample zero got expanded into three separate model inputs.
71
00:05:38,940 --> 00:05:42,840
Sample one got expanded into two model inputs and so forth.
72
00:05:44,060 --> 00:05:50,000
Luckily we do get access to this data when we call the token user as described above.
73
00:05:50,030 --> 00:05:55,100
This will return a new key we haven't seen before called overflow to sample mapping.
74
00:05:55,340 --> 00:06:01,040
It contains exactly these integers, specifically the original input sample indices.
75
00:06:05,440 --> 00:06:10,240
Now there's one more important argument into the token riser that I've left out until now.
76
00:06:10,480 --> 00:06:13,360
This is the argument return offsets, mapping.
77
00:06:13,540 --> 00:06:19,030
We will be setting this true now like the previous overflow to sample mapping.
78
00:06:19,060 --> 00:06:24,670
This probably seems very random and unnecessary, but don't worry, I feel the same way.
79
00:06:24,760 --> 00:06:27,610
It will all come together when we look at our next task.
80
00:06:27,610 --> 00:06:32,230
But this is just to set up the preliminaries, so let's just assume that it's useful.
81
00:06:33,040 --> 00:06:38,940
Basically, when we set this argument to true, we will get back an additional token as your output
82
00:06:38,950 --> 00:06:43,150
in addition to the usual input IDs, attention mask and so forth.
83
00:06:43,330 --> 00:06:45,550
This will be called the offset mapping.
84
00:06:46,740 --> 00:06:51,780
What this does is for each model input, it's going to give us a list of tuples.
85
00:06:52,470 --> 00:06:57,000
Each of these tuples corresponds to a token in the input sequence.
86
00:06:57,570 --> 00:07:04,110
Specifically, each tuple will contain the start and end character positions of that token.
87
00:07:04,650 --> 00:07:08,280
So as an example, suppose the question is where is Bob?
88
00:07:08,280 --> 00:07:10,800
And the context is Bob is at home.
89
00:07:11,400 --> 00:07:17,400
The first tuple is zero zero, which corresponds to the special CLS token because technically this doesn't
90
00:07:17,400 --> 00:07:18,690
take up any space.
91
00:07:19,880 --> 00:07:25,490
The second tuple goes from 0 to 5 because the word aware contains five letters.
92
00:07:26,030 --> 00:07:31,910
The next tuple goes from 6 to 8 because the word is contains two letters and so forth.
93
00:07:32,660 --> 00:07:38,210
Note that after the question is complete, we have another zero zero tuple which corresponds to the
94
00:07:38,210 --> 00:07:41,120
SEP token which does not take up any space.
95
00:07:42,190 --> 00:07:47,320
After this we have the tuples for the context which importantly start counting from zero.
96
00:07:48,040 --> 00:07:52,450
Now again, it might be totally unclear why we would even need this information.
97
00:07:53,170 --> 00:07:55,940
To give you some intuition for why we might need this.
98
00:07:55,960 --> 00:08:01,750
Consider that when we finally want to represent the answer to our question, the answer will be given
99
00:08:01,750 --> 00:08:04,570
as start and end positions of each token.
100
00:08:04,960 --> 00:08:11,650
So, for example, token a 2 to 4, which corresponds to the phrase at home, but in order to convert
101
00:08:11,650 --> 00:08:17,410
this back into a string to present to the user, we must know where these tokens begin and end.
102
00:08:17,440 --> 00:08:20,980
That is what is the first character and what is the last.
103
00:08:21,190 --> 00:08:25,360
So hopefully that gives you some idea of why this information is needed.
11079
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.