subtitlecat.com

All language subtitles for 015基于字符的标记器

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:04,160 --> 00:00:09,440 Before diving in character-based tokenization, understanding why this kind of tokenization 2 00:00:09,440 --> 00:00:13,680 is interesting requires understanding the flaws of word-based tokenization. 3 00:00:14,560 --> 00:00:18,400 If you haven't seen the first video on word-based tokenization we recommend you 4 00:00:18,400 --> 00:00:23,920 check it out before looking at this video. Let's take a look at character-based tokenization. 5 00:00:25,440 --> 00:00:29,840 We now split our text into individual characters, rather than words. 6 00:00:32,720 --> 00:00:37,200 There are generally a lot of different words in languages, while the number of characters stays 7 00:00:37,200 --> 00:00:45,520 low. Here for example, for the English language that has an estimated 170,000 different words, 8 00:00:45,520 --> 00:00:48,960 we would need a very large vocabulary to encompass all words. 9 00:00:50,080 --> 00:00:55,040 With a character-based vocabulary, we can get by with only 256 characters! 10 00:00:59,600 --> 00:01:04,880 Even languages with a lot of different characters like the Chinese languages have dictionaries with 11 00:01:06,160 --> 00:01:14,160 ~20,000 different characters but more than 375,000 different words. Character-based vocabularies 12 00:01:14,160 --> 00:01:20,240 let us fewer different tokens than the word-based tokenization dictionaries we would otherwise use. 13 00:01:23,040 --> 00:01:28,000 These vocabularies are also more complete than their word-based vocabularies counterparts. 14 00:01:28,720 --> 00:01:34,160 As our vocabulary contains all characters used in a language, even words unseen during the 15 00:01:34,160 --> 00:01:39,840 tokenizer training can still be tokenized, so out-of-vocabulary tokens will be less frequent. 16 00:01:40,480 --> 00:01:45,200 This includes the ability to correctly tokenize misspelled words, rather than discarding them as 17 00:01:45,200 --> 00:01:53,600 unknown straight away. However, this algorithm isn't perfect either! Intuitively, characters 18 00:01:53,600 --> 00:01:59,760 do not hold as much information individually as a word would hold. For example, "Let's" holds 19 00:01:59,760 --> 00:02:07,040 more information than "l". Of course, this is not true for all languages, as some languages like 20 00:02:07,040 --> 00:02:11,280 ideogram-based languages have a lot of information held in single characters, 21 00:02:12,480 --> 00:02:17,200 but for others like roman-based languages, the model will have to make sense of multiple 22 00:02:17,200 --> 00:02:25,120 tokens at a time to get the information held in a single word. This leads to another issue with 23 00:02:25,120 --> 00:02:30,320 character-based tokenizers: their sequences are translated into very large amount of tokens to be 24 00:02:30,320 --> 00:02:37,680 processed by the model. This can have an impact on the size of the context the model will carry 25 00:02:37,680 --> 00:02:45,120 around, and will reduce the size of the text we can use as input for our model. This tokenization, 26 00:02:45,120 --> 00:02:49,920 while it has some issues, has seen some very good results in the past and should be considered when 27 00:02:49,920 --> 00:03:00,720 approaching a new problem as it solves some issues encountered in the word-based algorithm. 3507