subtitlecat.com

All language subtitles for 014基于单词的标记器

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:03,120 --> 00:00:10,240 Let's take a look at word-based tokenization. Word-based tokenization is the idea of splitting 2 00:00:10,240 --> 00:00:19,040 the raw text into words, by splitting on spaces or other specific rules like punctuation. In this 3 00:00:19,040 --> 00:00:25,040 algorithm, each word has a specific number, an "ID", attributed to it. In this example, "Let's" 4 00:00:25,040 --> 00:00:33,120 has the ID 250, do has ID 861, and tokenization followed by an exclamation point has the ID 345. 5 00:00:34,160 --> 00:00:39,840 This approach is interesting, as the model has representations that are based on entire words. 6 00:00:42,560 --> 00:00:45,680 The information held in a single number is high 7 00:00:45,680 --> 00:00:52,880 as a word contains a lot of contextual and semantic information in a sentence. 8 00:00:52,880 --> 00:00:58,720 However, this approach does have its limits. For example, the word dog and the word 9 00:00:58,720 --> 00:01:04,320 dogs are very similar, and their meaning is close. However, the word-based tokenization 10 00:01:05,280 --> 00:01:10,320 will attribute entirely different IDs to these two words, and the model will therefore learn 11 00:01:10,320 --> 00:01:14,880 different meanings for these two words. This is unfortunate, as we would like the model 12 00:01:14,880 --> 00:01:21,120 to understand that these words are indeed related and that dogs is the plural form of the word dog. 13 00:01:22,800 --> 00:01:26,400 Another issue with this approach is that there are a lot of different words in a language. 14 00:01:27,840 --> 00:01:31,920 If we want our model to understand all possible sentences in that language, 15 00:01:31,920 --> 00:01:37,200 then we will need to have an ID for each different word, and the total number of words, 16 00:01:37,200 --> 00:01:41,440 which is also known as the vocabulary size, can quickly become very large. 17 00:01:44,160 --> 00:01:48,800 This is an issue because each ID is mapped to a large vector that represents the word's meaning, 18 00:01:50,000 --> 00:01:55,840 and keeping track of these mappings requires an enormous number of weights when the vocabulary 19 00:01:55,840 --> 00:02:03,360 size is large. If we want our models to stay lean, we can opt for our tokenizer to ignore 20 00:02:03,360 --> 00:02:11,760 certain words that we don't necessarily need. For example, when training our tokenizer on a text, 21 00:02:11,760 --> 00:02:15,680 we might want to take the 10,000 most frequent words in that text 22 00:02:20,640 --> 00:02:23,520 to create our basic vocabulary, instead of taking all of that language's words. 23 00:02:23,520 --> 00:02:27,200 The tokenizer will know how to convert those 10,000 words into numbers, 24 00:02:27,200 --> 00:02:33,520 but any other word will be converted to the out-of-vocabulary word, or the "unknown" word. 25 00:02:36,000 --> 00:02:39,760 This can rapidly become an issue: the model will have the exact same representation 26 00:02:39,760 --> 00:02:44,720 for all words that it doesn't know, which will result in a lot of lost information. 3273