subtitlecat.com

All language subtitles for 004 Using the Tokenizer_en

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:10,930 --> 00:00:15,730 So in this lecture we will move on to the next step of our question answering notebook. 2 00:00:16,210 --> 00:00:22,390 Now, as you recall, typically after we load in the data, we apply the token user to convert the text 3 00:00:22,390 --> 00:00:25,060 into a numerical format for the neural network. 4 00:00:25,660 --> 00:00:26,500 At a high level. 5 00:00:26,500 --> 00:00:31,600 That seems simple, but as you've seen many times already, the devil is in the details. 6 00:00:36,080 --> 00:00:40,900 Firstly, let's note that the checkpoint will be using for this notebook will be either Bert or distiller 7 00:00:40,910 --> 00:00:41,510 Bert. 8 00:00:41,900 --> 00:00:46,460 This might surprise you since question answering seems like a pretty complicated task. 9 00:00:46,580 --> 00:00:50,180 So you think it might require an encoder decoder setup? 10 00:00:50,420 --> 00:00:52,070 But this is not the case. 11 00:00:52,250 --> 00:00:57,860 This will be easier to understand once we discuss the model outputs and perhaps some theory about how 12 00:00:57,860 --> 00:00:59,300 transformers actually work. 13 00:01:00,350 --> 00:01:06,370 For now, it suffices to know that Bert and Bird like models are the correct choice for the token. 14 00:01:06,770 --> 00:01:11,360 This means that we'll be using the same token ISAR that we used earlier in the course. 15 00:01:11,990 --> 00:01:17,810 As you know, this token ISAR can already handle two input sentences such as you saw in the textual 16 00:01:17,810 --> 00:01:19,040 entitlement task. 17 00:01:23,650 --> 00:01:23,860 Okay. 18 00:01:23,980 --> 00:01:30,190 So just to review what happens when we use the tokenization to tokenize two pieces of text and how do 19 00:01:30,190 --> 00:01:30,970 we do it? 20 00:01:31,990 --> 00:01:37,750 Well, you'll recall that this works by simply passing in the two pieces of text as two separate arguments 21 00:01:38,020 --> 00:01:42,160 by convention will pass in the question first and then the context. 22 00:01:42,850 --> 00:01:49,450 If we decode the outputs from the tokenized, meaning turn the token IDs back into words, we will get 23 00:01:49,450 --> 00:01:53,230 a big, long string containing both the question and the context. 24 00:01:53,230 --> 00:01:54,880 Concatenate it together. 25 00:01:55,900 --> 00:02:01,690 In particular, this always starts with the special CLS token, followed by the first sentence followed 26 00:02:01,690 --> 00:02:04,260 by the SEP token followed by the second sentence. 27 00:02:04,270 --> 00:02:06,430 And finally, one more step token. 28 00:02:06,970 --> 00:02:10,270 Now, please note that I'm using the word sentence loosely. 29 00:02:10,389 --> 00:02:13,840 In fact, the context may contain more than one sentence. 30 00:02:13,930 --> 00:02:17,320 In fact, it likely will most, if not all of the time. 31 00:02:21,840 --> 00:02:27,750 Now, one challenging aspect of question answering is that the context part of the input can be really 32 00:02:27,750 --> 00:02:28,470 long. 33 00:02:28,770 --> 00:02:34,290 This is unlike the textual entitlement or next sentence prediction examples, since for those the two 34 00:02:34,290 --> 00:02:36,840 inputs are just two actual sentences. 35 00:02:37,320 --> 00:02:40,350 In this case, the context will contain many sentences. 36 00:02:40,920 --> 00:02:46,980 Now, as you recall, Bert can only handle a limited number of tokens, but it wouldn't be a good idea 37 00:02:46,980 --> 00:02:52,260 to truncate the context, since the part we leave out may contain the answer to the question. 38 00:02:52,590 --> 00:02:55,520 And it also wouldn't be a good idea to truncate the question. 39 00:02:55,530 --> 00:02:57,630 Since then, we wouldn't know the question. 40 00:02:58,080 --> 00:03:00,180 So what is the solution to this? 41 00:03:01,230 --> 00:03:05,250 The solution is to split the context into multiple windows. 42 00:03:05,700 --> 00:03:11,610 As a result, one data sample will turn into multiple data samples, some of which will contain the 43 00:03:11,610 --> 00:03:13,860 answer and some of which will not. 44 00:03:14,160 --> 00:03:18,300 So by doing this, we introduce a case where no answer can be found. 45 00:03:20,320 --> 00:03:26,530 Now you might be concerned for those of you very keen students, you might notice that we have a problem 46 00:03:26,530 --> 00:03:29,890 when the answer crosses the boundary between these windows. 47 00:03:30,250 --> 00:03:35,560 If half of the answer is at the end of one context window and the other half of the answer is at the 48 00:03:35,560 --> 00:03:41,470 beginning of the next context window, then no answer will be valid and the model won't really be learning 49 00:03:41,470 --> 00:03:43,090 the answers to the questions. 50 00:03:43,810 --> 00:03:47,590 We solve this problem by using overlapping windows instead. 51 00:03:47,890 --> 00:03:51,880 In the hugging face library, the amount of overlap is called the stride. 52 00:03:56,330 --> 00:04:02,840 So the full tokenized call looks like this as before we pass in the context and the question. 53 00:04:03,230 --> 00:04:08,930 The next argument is maxlength where we specify the maximum length of the entire input. 54 00:04:09,050 --> 00:04:13,370 This includes the question and the context and the special tokens. 55 00:04:14,700 --> 00:04:18,779 The next step is to specify special instruction only seconds. 56 00:04:19,050 --> 00:04:25,740 This means, since the context is the second input text, only truncate this but do not truncate the 57 00:04:25,740 --> 00:04:28,440 question which is the first input text. 58 00:04:29,880 --> 00:04:35,850 The next argument is the stride, which defines how much overlap there is between the context windows 59 00:04:35,850 --> 00:04:37,230 when they are split up. 60 00:04:38,620 --> 00:04:45,220 And finally we set return overflowing tokens to true a better name for this would probably be overlapping 61 00:04:45,250 --> 00:04:51,370 tokens since this corresponds to the tokens that overlap when we set the stride to a positive number. 62 00:04:56,040 --> 00:05:00,840 Importantly, note that when we call the tokenizing like this, it expands the data. 63 00:05:01,140 --> 00:05:07,680 If we have one question in context pair, this might be converted into multiple input samples depending 64 00:05:07,680 --> 00:05:09,690 on how long the context is. 65 00:05:10,260 --> 00:05:15,900 One thing that will become important later on is how we know which sample from the tokenized output 66 00:05:15,900 --> 00:05:18,990 corresponds to which sample from the tokenized input. 67 00:05:19,560 --> 00:05:23,730 Imagine for instance, that we enumerated the original input samples. 68 00:05:23,730 --> 00:05:27,540 So we have sample zero, sample one, sample two and so forth. 69 00:05:27,900 --> 00:05:33,570 At the output, we might have 00011, two, three, three, three and so forth. 70 00:05:34,020 --> 00:05:38,700 This means that Sample zero got expanded into three separate model inputs. 71 00:05:38,940 --> 00:05:42,840 Sample one got expanded into two model inputs and so forth. 72 00:05:44,060 --> 00:05:50,000 Luckily we do get access to this data when we call the token user as described above. 73 00:05:50,030 --> 00:05:55,100 This will return a new key we haven't seen before called overflow to sample mapping. 74 00:05:55,340 --> 00:06:01,040 It contains exactly these integers, specifically the original input sample indices. 75 00:06:05,440 --> 00:06:10,240 Now there's one more important argument into the token riser that I've left out until now. 76 00:06:10,480 --> 00:06:13,360 This is the argument return offsets, mapping. 77 00:06:13,540 --> 00:06:19,030 We will be setting this true now like the previous overflow to sample mapping. 78 00:06:19,060 --> 00:06:24,670 This probably seems very random and unnecessary, but don't worry, I feel the same way. 79 00:06:24,760 --> 00:06:27,610 It will all come together when we look at our next task. 80 00:06:27,610 --> 00:06:32,230 But this is just to set up the preliminaries, so let's just assume that it's useful. 81 00:06:33,040 --> 00:06:38,940 Basically, when we set this argument to true, we will get back an additional token as your output 82 00:06:38,950 --> 00:06:43,150 in addition to the usual input IDs, attention mask and so forth. 83 00:06:43,330 --> 00:06:45,550 This will be called the offset mapping. 84 00:06:46,740 --> 00:06:51,780 What this does is for each model input, it's going to give us a list of tuples. 85 00:06:52,470 --> 00:06:57,000 Each of these tuples corresponds to a token in the input sequence. 86 00:06:57,570 --> 00:07:04,110 Specifically, each tuple will contain the start and end character positions of that token. 87 00:07:04,650 --> 00:07:08,280 So as an example, suppose the question is where is Bob? 88 00:07:08,280 --> 00:07:10,800 And the context is Bob is at home. 89 00:07:11,400 --> 00:07:17,400 The first tuple is zero zero, which corresponds to the special CLS token because technically this doesn't 90 00:07:17,400 --> 00:07:18,690 take up any space. 91 00:07:19,880 --> 00:07:25,490 The second tuple goes from 0 to 5 because the word aware contains five letters. 92 00:07:26,030 --> 00:07:31,910 The next tuple goes from 6 to 8 because the word is contains two letters and so forth. 93 00:07:32,660 --> 00:07:38,210 Note that after the question is complete, we have another zero zero tuple which corresponds to the 94 00:07:38,210 --> 00:07:41,120 SEP token which does not take up any space. 95 00:07:42,190 --> 00:07:47,320 After this we have the tuples for the context which importantly start counting from zero. 96 00:07:48,040 --> 00:07:52,450 Now again, it might be totally unclear why we would even need this information. 97 00:07:53,170 --> 00:07:55,940 To give you some intuition for why we might need this. 98 00:07:55,960 --> 00:08:01,750 Consider that when we finally want to represent the answer to our question, the answer will be given 99 00:08:01,750 --> 00:08:04,570 as start and end positions of each token. 100 00:08:04,960 --> 00:08:11,650 So, for example, token a 2 to 4, which corresponds to the phrase at home, but in order to convert 101 00:08:11,650 --> 00:08:17,410 this back into a string to present to the user, we must know where these tokens begin and end. 102 00:08:17,440 --> 00:08:20,980 That is what is the first character and what is the last. 103 00:08:21,190 --> 00:08:25,360 So hopefully that gives you some idea of why this information is needed. 11079