subtitlecat.com

All language subtitles for 017标记化管道

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,440 --> 00:00:12,320 The tokenizer pipeline. In this video, we'll look at how a tokenizer converts raw text to numbers 2 00:00:12,320 --> 00:00:18,080 that a Transformer model can make sense of, like when we execute this code. Here is a quick 3 00:00:18,080 --> 00:00:24,400 overview of what happens inside the tokenizer object: first the text is split into tokens, which 4 00:00:24,400 --> 00:00:31,280 are words, parts of words, or punctuation symbols. Then the tokenizer adds potential special tokens 5 00:00:31,280 --> 00:00:36,560 and converts each token to their unique respective ID as defined by the tokenizer's vocabulary. 6 00:00:37,520 --> 00:00:41,440 As we'll see it doesn't actually happen in this order, but viewing it like this 7 00:00:41,440 --> 00:00:46,320 is better for understanding what happens. The first step is to split our input text 8 00:00:46,320 --> 00:00:53,840 into tokens with the tokenize method. To do this, the tokenizer may first perform some operations 9 00:00:53,840 --> 00:00:58,000 like lowercasing all words, then follow a set of rules to split the result in small 10 00:00:58,000 --> 00:01:03,520 chunks of text. Most of the Transformers models use a subword tokenization algorithm, 11 00:01:04,160 --> 00:01:08,720 which means that one given word can be split in several tokens, like tokenize 12 00:01:08,720 --> 00:01:13,360 here. Look at the "Tokenization algorithms" videos linked below for more information! 13 00:01:14,480 --> 00:01:19,600 The ## prefix we see in front of ize is the convention used by BERT to indicate 14 00:01:19,600 --> 00:01:26,080 this token is not the beginning of a word. Other tokenizers may use different conventions however: 15 00:01:26,080 --> 00:01:31,040 for instance ALBERT tokenizers will add a long underscore in front of all the 16 00:01:31,040 --> 00:01:36,640 tokens that had a space before them, which is a convention used by sentencepiece tokenizers. 17 00:01:38,320 --> 00:01:43,280 The second step of the tokenization pipeline is to map those tokens to their respective IDs 18 00:01:43,280 --> 00:01:48,960 as defined by the vocabulary of the tokenizer. This is why we need to download a file when we 19 00:01:48,960 --> 00:01:53,600 instantiate a tokenizer with the from_pretrained method: we have to make sure we use the same 20 00:01:53,600 --> 00:01:59,520 mapping as when the model was pretrained. To do this, we use the convert_tokens_to_ids method. 21 00:02:00,720 --> 00:02:05,360 You may have noticed that we don't have the exact same result as in our first slide — or not, 22 00:02:05,360 --> 00:02:09,840 as this looks like a list of random numbers, in which case allow me to refresh your memory. 23 00:02:10,479 --> 00:02:13,680 We had a number at the beginning and at the end that are missing, 24 00:02:14,400 --> 00:02:20,160 those are the special tokens. The special tokens are added by the prepare_for_model method, 25 00:02:20,160 --> 00:02:25,280 which knows the indices of those tokens in the vocabulary and just adds the proper numbers. 26 00:02:28,320 --> 00:02:32,480 You can look at the special tokens (and more generally at how the tokenizer has changed 27 00:02:32,480 --> 00:02:37,120 your text) by using the decode method on the outputs of the tokenizer object. 28 00:02:38,240 --> 00:02:44,080 As for the prefix for beginning of words/part of words, those special tokens vary depending on 29 00:02:44,080 --> 00:02:50,080 which tokenizer you are using. The BERT tokenizer uses [CLS] and [SEP] but the roberta tokenizer 30 00:02:50,080 --> 00:02:57,520 uses html-like anchors and . Now that you know how the tokenizer works, you can forget 31 00:02:57,520 --> 00:03:02,560 all those intermediaries methods and only remember that you just have to call it on your input texts. 32 00:03:03,600 --> 00:03:06,880 The inputs don't contain the inputs IDs however, 33 00:03:07,520 --> 00:03:11,600 to learn what the attention mask is, check out the "Batch inputs together" video. 34 00:03:12,160 --> 00:03:17,840 To learn about token type IDs, look at the "Process pairs of sentences" video. 4345