subtitlecat.com

All language subtitles for 016基于子词的标记器

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:06,320 --> 00:00:11,440 Let's take a look at subword-based tokenization. Understanding why subword-based tokenization 2 00:00:11,440 --> 00:00:16,320 is interesting requires understanding the flaws of word-based and character-based tokenization. 3 00:00:17,200 --> 00:00:21,760 If you haven't seen the first videos on word-based and character-based tokenization, 4 00:00:21,760 --> 00:00:24,400 we recommend you check them out before looking at this video. 5 00:00:27,680 --> 00:00:33,440 Subword-tokenization lies in between character-based and word-based tokenization 6 00:00:33,440 --> 00:00:40,960 algorithms. The idea is to find a middle ground between very large vocabularies, large quantity of 7 00:00:40,960 --> 00:00:47,040 out-of-vocabulary tokens, loss of meaning across very similar words, for word-based tokenizers, 8 00:00:47,040 --> 00:00:52,800 and very long sequences, less meaningful individual tokens for character-based tokenizers. 9 00:00:54,720 --> 00:00:59,360 These algorithms rely on the following principle: frequently used words should not 10 00:00:59,360 --> 00:01:04,800 be split into smaller subwords, but rare words should be decomposed into meaningful subwords. 11 00:01:06,320 --> 00:01:11,520 An example is the word dog: we would like to have our tokenizer to have a single ID for the word 12 00:01:11,520 --> 00:01:18,480 dog, rather than splitting it into characters: d, o, and g. However, when encountering the word 13 00:01:18,480 --> 00:01:23,920 dogs, we would like our tokenizer to understand that at the root, this is still the word dog, 14 00:01:23,920 --> 00:01:31,280 with an added s while slightly changes the meaning while keeping the original idea. Another example 15 00:01:31,280 --> 00:01:37,520 is a complex word like tokenization, which can be split into meaningful subwords. The root of 16 00:01:37,520 --> 00:01:42,000 the word is token, and ization completes the root to give it a slightly different meaning. 17 00:01:42,720 --> 00:01:48,960 It makes sense to split the word into two: token, as the root of the word (labeled as the "start" of 18 00:01:48,960 --> 00:01:53,840 the word). ization as additional information (labeled as a "completion" of the word). 19 00:01:56,240 --> 00:02:00,320 In turn, the model will now be able to make sense of token in different situations. 20 00:02:00,880 --> 00:02:06,400 It will understand that the words token, tokens, tokenizing, and tokenization are linked and have 21 00:02:06,400 --> 00:02:14,000 a similar meaning. It will also understand that tokenization, modernization, and immunization, 22 00:02:14,000 --> 00:02:18,960 which all have the same suffixes, are probably used in the same syntactic situations. 23 00:02:20,320 --> 00:02:25,920 Subword-based tokenizers generally have a way to identify which tokens are start of words, and 24 00:02:25,920 --> 00:02:34,320 which tokens complete start of words: token as the start of a word. ##ization as completing a word. 25 00:02:34,960 --> 00:02:40,800 Here the ## prefix indicates that ization is part of a word rather than the beginning of it. 26 00:02:41,760 --> 00:02:49,440 The ## comes from the BERT tokenizer, based on the WordPiece algorithm. Other tokenizers use other 27 00:02:49,440 --> 00:02:54,720 prefixes, which can be placed to indicate part of words like seen here, or start of words instead! 28 00:02:56,000 --> 00:03:01,040 There are a lot of different algorithms that can be used for subword tokenization, and most models 29 00:03:01,040 --> 00:03:05,760 obtaining state-of-the-art results in English today use some kind of subword-tokenization 30 00:03:05,760 --> 00:03:12,320 algorithm. These approaches help in reducing the vocabulary sizes by sharing information 31 00:03:12,320 --> 00:03:17,840 across different words, having the ability to have prefixes and suffixes understood as such. 32 00:03:18,480 --> 00:03:27,760 They keep meaning across very similar words, by recognizing similar tokens making them up. 4218