All language subtitles for 004 Using the Tokenizer_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,930 --> 00:00:15,730 So in this lecture we will move on to the next step of our question answering notebook. 2 00:00:16,210 --> 00:00:22,390 Now, as you recall, typically after we load in the data, we apply the token user to convert the text 3 00:00:22,390 --> 00:00:25,060 into a numerical format for the neural network. 4 00:00:25,660 --> 00:00:26,500 At a high level. 5 00:00:26,500 --> 00:00:31,600 That seems simple, but as you've seen many times already, the devil is in the details. 6 00:00:36,080 --> 00:00:40,900 Firstly, let's note that the checkpoint will be using for this notebook will be either Bert or distiller 7 00:00:40,910 --> 00:00:41,510 Bert. 8 00:00:41,900 --> 00:00:46,460 This might surprise you since question answering seems like a pretty complicated task. 9 00:00:46,580 --> 00:00:50,180 So you think it might require an encoder decoder setup? 10 00:00:50,420 --> 00:00:52,070 But this is not the case. 11 00:00:52,250 --> 00:00:57,860 This will be easier to understand once we discuss the model outputs and perhaps some theory about how 12 00:00:57,860 --> 00:00:59,300 transformers actually work. 13 00:01:00,350 --> 00:01:06,370 For now, it suffices to know that Bert and Bird like models are the correct choice for the token. 14 00:01:06,770 --> 00:01:11,360 This means that we'll be using the same token ISAR that we used earlier in the course. 15 00:01:11,990 --> 00:01:17,810 As you know, this token ISAR can already handle two input sentences such as you saw in the textual 16 00:01:17,810 --> 00:01:19,040 entitlement task. 17 00:01:23,650 --> 00:01:23,860 Okay. 18 00:01:23,980 --> 00:01:30,190 So just to review what happens when we use the tokenization to tokenize two pieces of text and how do 19 00:01:30,190 --> 00:01:30,970 we do it? 20 00:01:31,990 --> 00:01:37,750 Well, you'll recall that this works by simply passing in the two pieces of text as two separate arguments 21 00:01:38,020 --> 00:01:42,160 by convention will pass in the question first and then the context. 22 00:01:42,850 --> 00:01:49,450 If we decode the outputs from the tokenized, meaning turn the token IDs back into words, we will get 23 00:01:49,450 --> 00:01:53,230 a big, long string containing both the question and the context. 24 00:01:53,230 --> 00:01:54,880 Concatenate it together. 25 00:01:55,900 --> 00:02:01,690 In particular, this always starts with the special CLS token, followed by the first sentence followed 26 00:02:01,690 --> 00:02:04,260 by the SEP token followed by the second sentence. 27 00:02:04,270 --> 00:02:06,430 And finally, one more step token. 28 00:02:06,970 --> 00:02:10,270 Now, please note that I'm using the word sentence loosely. 29 00:02:10,389 --> 00:02:13,840 In fact, the context may contain more than one sentence. 30 00:02:13,930 --> 00:02:17,320 In fact, it likely will most, if not all of the time. 31 00:02:21,840 --> 00:02:27,750 Now, one challenging aspect of question answering is that the context part of the input can be really 32 00:02:27,750 --> 00:02:28,470 long. 33 00:02:28,770 --> 00:02:34,290 This is unlike the textual entitlement or next sentence prediction examples, since for those the two 34 00:02:34,290 --> 00:02:36,840 inputs are just two actual sentences. 35 00:02:37,320 --> 00:02:40,350 In this case, the context will contain many sentences. 36 00:02:40,920 --> 00:02:46,980 Now, as you recall, Bert can only handle a limited number of tokens, but it wouldn't be a good idea 37 00:02:46,980 --> 00:02:52,260 to truncate the context, since the part we leave out may contain the answer to the question. 38 00:02:52,590 --> 00:02:55,520 And it also wouldn't be a good idea to truncate the question. 39 00:02:55,530 --> 00:02:57,630 Since then, we wouldn't know the question. 40 00:02:58,080 --> 00:03:00,180 So what is the solution to this? 41 00:03:01,230 --> 00:03:05,250 The solution is to split the context into multiple windows. 42 00:03:05,700 --> 00:03:11,610 As a result, one data sample will turn into multiple data samples, some of which will contain the 43 00:03:11,610 --> 00:03:13,860 answer and some of which will not. 44 00:03:14,160 --> 00:03:18,300 So by doing this, we introduce a case where no answer can be found. 45 00:03:20,320 --> 00:03:26,530 Now you might be concerned for those of you very keen students, you might notice that we have a problem 46 00:03:26,530 --> 00:03:29,890 when the answer crosses the boundary between these windows. 47 00:03:30,250 --> 00:03:35,560 If half of the answer is at the end of one context window and the other half of the answer is at the 48 00:03:35,560 --> 00:03:41,470 beginning of the next context window, then no answer will be valid and the model won't really be learning 49 00:03:41,470 --> 00:03:43,090 the answers to the questions. 50 00:03:43,810 --> 00:03:47,590 We solve this problem by using overlapping windows instead. 51 00:03:47,890 --> 00:03:51,880 In the hugging face library, the amount of overlap is called the stride. 52 00:03:56,330 --> 00:04:02,840 So the full tokenized call looks like this as before we pass in the context and the question. 53 00:04:03,230 --> 00:04:08,930 The next argument is maxlength where we specify the maximum length of the entire input. 54 00:04:09,050 --> 00:04:13,370 This includes the question and the context and the special tokens. 55 00:04:14,700 --> 00:04:18,779 The next step is to specify special instruction only seconds. 56 00:04:19,050 --> 00:04:25,740 This means, since the context is the second input text, only truncate this but do not truncate the 57 00:04:25,740 --> 00:04:28,440 question which is the first input text. 58 00:04:29,880 --> 00:04:35,850 The next argument is the stride, which defines how much overlap there is between the context windows 59 00:04:35,850 --> 00:04:37,230 when they are split up. 60 00:04:38,620 --> 00:04:45,220 And finally we set return overflowing tokens to true a better name for this would probably be overlapping 61 00:04:45,250 --> 00:04:51,370 tokens since this corresponds to the tokens that overlap when we set the stride to a positive number. 62 00:04:56,040 --> 00:05:00,840 Importantly, note that when we call the tokenizing like this, it expands the data. 63 00:05:01,140 --> 00:05:07,680 If we have one question in context pair, this might be converted into multiple input samples depending 64 00:05:07,680 --> 00:05:09,690 on how long the context is. 65 00:05:10,260 --> 00:05:15,900 One thing that will become important later on is how we know which sample from the tokenized output 66 00:05:15,900 --> 00:05:18,990 corresponds to which sample from the tokenized input. 67 00:05:19,560 --> 00:05:23,730 Imagine for instance, that we enumerated the original input samples. 68 00:05:23,730 --> 00:05:27,540 So we have sample zero, sample one, sample two and so forth. 69 00:05:27,900 --> 00:05:33,570 At the output, we might have 00011, two, three, three, three and so forth. 70 00:05:34,020 --> 00:05:38,700 This means that Sample zero got expanded into three separate model inputs. 71 00:05:38,940 --> 00:05:42,840 Sample one got expanded into two model inputs and so forth. 72 00:05:44,060 --> 00:05:50,000 Luckily we do get access to this data when we call the token user as described above. 73 00:05:50,030 --> 00:05:55,100 This will return a new key we haven't seen before called overflow to sample mapping. 74 00:05:55,340 --> 00:06:01,040 It contains exactly these integers, specifically the original input sample indices. 75 00:06:05,440 --> 00:06:10,240 Now there's one more important argument into the token riser that I've left out until now. 76 00:06:10,480 --> 00:06:13,360 This is the argument return offsets, mapping. 77 00:06:13,540 --> 00:06:19,030 We will be setting this true now like the previous overflow to sample mapping. 78 00:06:19,060 --> 00:06:24,670 This probably seems very random and unnecessary, but don't worry, I feel the same way. 79 00:06:24,760 --> 00:06:27,610 It will all come together when we look at our next task. 80 00:06:27,610 --> 00:06:32,230 But this is just to set up the preliminaries, so let's just assume that it's useful. 81 00:06:33,040 --> 00:06:38,940 Basically, when we set this argument to true, we will get back an additional token as your output 82 00:06:38,950 --> 00:06:43,150 in addition to the usual input IDs, attention mask and so forth. 83 00:06:43,330 --> 00:06:45,550 This will be called the offset mapping. 84 00:06:46,740 --> 00:06:51,780 What this does is for each model input, it's going to give us a list of tuples. 85 00:06:52,470 --> 00:06:57,000 Each of these tuples corresponds to a token in the input sequence. 86 00:06:57,570 --> 00:07:04,110 Specifically, each tuple will contain the start and end character positions of that token. 87 00:07:04,650 --> 00:07:08,280 So as an example, suppose the question is where is Bob? 88 00:07:08,280 --> 00:07:10,800 And the context is Bob is at home. 89 00:07:11,400 --> 00:07:17,400 The first tuple is zero zero, which corresponds to the special CLS token because technically this doesn't 90 00:07:17,400 --> 00:07:18,690 take up any space. 91 00:07:19,880 --> 00:07:25,490 The second tuple goes from 0 to 5 because the word aware contains five letters. 92 00:07:26,030 --> 00:07:31,910 The next tuple goes from 6 to 8 because the word is contains two letters and so forth. 93 00:07:32,660 --> 00:07:38,210 Note that after the question is complete, we have another zero zero tuple which corresponds to the 94 00:07:38,210 --> 00:07:41,120 SEP token which does not take up any space. 95 00:07:42,190 --> 00:07:47,320 After this we have the tuples for the context which importantly start counting from zero. 96 00:07:48,040 --> 00:07:52,450 Now again, it might be totally unclear why we would even need this information. 97 00:07:53,170 --> 00:07:55,940 To give you some intuition for why we might need this. 98 00:07:55,960 --> 00:08:01,750 Consider that when we finally want to represent the answer to our question, the answer will be given 99 00:08:01,750 --> 00:08:04,570 as start and end positions of each token. 100 00:08:04,960 --> 00:08:11,650 So, for example, token a 2 to 4, which corresponds to the phrase at home, but in order to convert 101 00:08:11,650 --> 00:08:17,410 this back into a string to present to the user, we must know where these tokens begin and end. 102 00:08:17,440 --> 00:08:20,980 That is what is the first character and what is the last. 103 00:08:21,190 --> 00:08:25,360 So hopefully that gives you some idea of why this information is needed. 11079

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.