All language subtitles for 015基于字符的标记器

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:04,160 --> 00:00:09,440 Before diving in character-based tokenization,  understanding why this kind of tokenization   2 00:00:09,440 --> 00:00:13,680 is interesting requires understanding  the flaws of word-based tokenization.   3 00:00:14,560 --> 00:00:18,400 If you haven't seen the first video on  word-based tokenization we recommend you   4 00:00:18,400 --> 00:00:23,920 check it out before looking at this video. Let's  take a look at character-based tokenization.   5 00:00:25,440 --> 00:00:29,840 We now split our text into individual  characters, rather than words.   6 00:00:32,720 --> 00:00:37,200 There are generally a lot of different words in  languages, while the number of characters stays   7 00:00:37,200 --> 00:00:45,520 low. Here for example, for the English language  that has an estimated 170,000 different words,   8 00:00:45,520 --> 00:00:48,960 we would need a very large  vocabulary to encompass all words.   9 00:00:50,080 --> 00:00:55,040 With a character-based vocabulary, we  can get by with only 256 characters!   10 00:00:59,600 --> 00:01:04,880 Even languages with a lot of different characters  like the Chinese languages have dictionaries with   11 00:01:06,160 --> 00:01:14,160 ~20,000 different characters but more than 375,000  different words. Character-based vocabularies   12 00:01:14,160 --> 00:01:20,240 let us fewer different tokens than the word-based  tokenization dictionaries we would otherwise use.   13 00:01:23,040 --> 00:01:28,000 These vocabularies are also more complete than  their word-based vocabularies counterparts.   14 00:01:28,720 --> 00:01:34,160 As our vocabulary contains all characters used  in a language, even words unseen during the   15 00:01:34,160 --> 00:01:39,840 tokenizer training can still be tokenized, so  out-of-vocabulary tokens will be less frequent.   16 00:01:40,480 --> 00:01:45,200 This includes the ability to correctly tokenize  misspelled words, rather than discarding them as   17 00:01:45,200 --> 00:01:53,600 unknown straight away. However, this algorithm  isn't perfect either! Intuitively, characters   18 00:01:53,600 --> 00:01:59,760 do not hold as much information individually as  a word would hold. For example, "Let's" holds   19 00:01:59,760 --> 00:02:07,040 more information than "l". Of course, this is not  true for all languages, as some languages like   20 00:02:07,040 --> 00:02:11,280 ideogram-based languages have a lot of  information held in single characters,   21 00:02:12,480 --> 00:02:17,200 but for others like roman-based languages,  the model will have to make sense of multiple   22 00:02:17,200 --> 00:02:25,120 tokens at a time to get the information held in  a single word. This leads to another issue with   23 00:02:25,120 --> 00:02:30,320 character-based tokenizers: their sequences are  translated into very large amount of tokens to be   24 00:02:30,320 --> 00:02:37,680 processed by the model. This can have an impact  on the size of the context the model will carry   25 00:02:37,680 --> 00:02:45,120 around, and will reduce the size of the text we  can use as input for our model. This tokenization,   26 00:02:45,120 --> 00:02:49,920 while it has some issues, has seen some very good  results in the past and should be considered when   27 00:02:49,920 --> 00:03:00,720 approaching a new problem as it solves some  issues encountered in the word-based algorithm. 3507

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.