All language subtitles for 014基于单词的标记器

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:03,120 --> 00:00:10,240 Let's take a look at word-based tokenization.  Word-based tokenization is the idea of splitting   2 00:00:10,240 --> 00:00:19,040 the raw text into words, by splitting on spaces  or other specific rules like punctuation. In this   3 00:00:19,040 --> 00:00:25,040 algorithm, each word has a specific number, an  "ID", attributed to it. In this example, "Let's"   4 00:00:25,040 --> 00:00:33,120 has the ID 250, do has ID 861, and tokenization  followed by an exclamation point has the ID 345.   5 00:00:34,160 --> 00:00:39,840 This approach is interesting, as the model has  representations that are based on entire words.   6 00:00:42,560 --> 00:00:45,680 The information held in a single number is high   7 00:00:45,680 --> 00:00:52,880 as a word contains a lot of contextual  and semantic information in a sentence.   8 00:00:52,880 --> 00:00:58,720 However, this approach does have its limits.  For example, the word dog and the word   9 00:00:58,720 --> 00:01:04,320 dogs are very similar, and their meaning is  close. However, the word-based tokenization   10 00:01:05,280 --> 00:01:10,320 will attribute entirely different IDs to these  two words, and the model will therefore learn   11 00:01:10,320 --> 00:01:14,880 different meanings for these two words. This  is unfortunate, as we would like the model   12 00:01:14,880 --> 00:01:21,120 to understand that these words are indeed related  and that dogs is the plural form of the word dog.   13 00:01:22,800 --> 00:01:26,400 Another issue with this approach is that there  are a lot of different words in a language.   14 00:01:27,840 --> 00:01:31,920 If we want our model to understand all  possible sentences in that language,   15 00:01:31,920 --> 00:01:37,200 then we will need to have an ID for each  different word, and the total number of words,   16 00:01:37,200 --> 00:01:41,440 which is also known as the vocabulary  size, can quickly become very large.   17 00:01:44,160 --> 00:01:48,800 This is an issue because each ID is mapped to a  large vector that represents the word's meaning,   18 00:01:50,000 --> 00:01:55,840 and keeping track of these mappings requires an  enormous number of weights when the vocabulary   19 00:01:55,840 --> 00:02:03,360 size is large. If we want our models to stay  lean, we can opt for our tokenizer to ignore   20 00:02:03,360 --> 00:02:11,760 certain words that we don't necessarily need. For  example, when training our tokenizer on a text,   21 00:02:11,760 --> 00:02:15,680 we might want to take the 10,000  most frequent words in that text   22 00:02:20,640 --> 00:02:23,520 to create our basic vocabulary, instead  of taking all of that language's words.   23 00:02:23,520 --> 00:02:27,200 The tokenizer will know how to convert  those 10,000 words into numbers,   24 00:02:27,200 --> 00:02:33,520 but any other word will be converted to the  out-of-vocabulary word, or the "unknown" word.   25 00:02:36,000 --> 00:02:39,760 This can rapidly become an issue: the model  will have the exact same representation   26 00:02:39,760 --> 00:02:44,720 for all words that it doesn't know, which  will result in a lot of lost information. 3273

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.