All language subtitles for 016基于子词的标记器

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:06,320 --> 00:00:11,440 Let's take a look at subword-based tokenization.  Understanding why subword-based tokenization   2 00:00:11,440 --> 00:00:16,320 is interesting requires understanding the flaws  of word-based and character-based tokenization.   3 00:00:17,200 --> 00:00:21,760 If you haven't seen the first videos on  word-based and character-based tokenization,   4 00:00:21,760 --> 00:00:24,400 we recommend you check them out  before looking at this video.   5 00:00:27,680 --> 00:00:33,440 Subword-tokenization lies in between  character-based and word-based tokenization   6 00:00:33,440 --> 00:00:40,960 algorithms. The idea is to find a middle ground  between very large vocabularies, large quantity of   7 00:00:40,960 --> 00:00:47,040 out-of-vocabulary tokens, loss of meaning across  very similar words, for word-based tokenizers,   8 00:00:47,040 --> 00:00:52,800 and very long sequences, less meaningful  individual tokens for character-based tokenizers.   9 00:00:54,720 --> 00:00:59,360 These algorithms rely on the following  principle: frequently used words should not   10 00:00:59,360 --> 00:01:04,800 be split into smaller subwords, but rare words  should be decomposed into meaningful subwords.   11 00:01:06,320 --> 00:01:11,520 An example is the word dog: we would like to have  our tokenizer to have a single ID for the word   12 00:01:11,520 --> 00:01:18,480 dog, rather than splitting it into characters:  d, o, and g. However, when encountering the word   13 00:01:18,480 --> 00:01:23,920 dogs, we would like our tokenizer to understand  that at the root, this is still the word dog,   14 00:01:23,920 --> 00:01:31,280 with an added s while slightly changes the meaning  while keeping the original idea. Another example   15 00:01:31,280 --> 00:01:37,520 is a complex word like tokenization, which can  be split into meaningful subwords. The root of   16 00:01:37,520 --> 00:01:42,000 the word is token, and ization completes the  root to give it a slightly different meaning.   17 00:01:42,720 --> 00:01:48,960 It makes sense to split the word into two: token,  as the root of the word (labeled as the "start" of   18 00:01:48,960 --> 00:01:53,840 the word). ization as additional information  (labeled as a "completion" of the word).   19 00:01:56,240 --> 00:02:00,320 In turn, the model will now be able to make  sense of token in different situations.   20 00:02:00,880 --> 00:02:06,400 It will understand that the words token, tokens,  tokenizing, and tokenization are linked and have   21 00:02:06,400 --> 00:02:14,000 a similar meaning. It will also understand that  tokenization, modernization, and immunization,   22 00:02:14,000 --> 00:02:18,960 which all have the same suffixes, are probably  used in the same syntactic situations.   23 00:02:20,320 --> 00:02:25,920 Subword-based tokenizers generally have a way  to identify which tokens are start of words, and   24 00:02:25,920 --> 00:02:34,320 which tokens complete start of words: token as the  start of a word. ##ization as completing a word.   25 00:02:34,960 --> 00:02:40,800 Here the ## prefix indicates that ization is  part of a word rather than the beginning of it.   26 00:02:41,760 --> 00:02:49,440 The ## comes from the BERT tokenizer, based on the  WordPiece algorithm. Other tokenizers use other   27 00:02:49,440 --> 00:02:54,720 prefixes, which can be placed to indicate part of  words like seen here, or start of words instead!   28 00:02:56,000 --> 00:03:01,040 There are a lot of different algorithms that can  be used for subword tokenization, and most models   29 00:03:01,040 --> 00:03:05,760 obtaining state-of-the-art results in English  today use some kind of subword-tokenization   30 00:03:05,760 --> 00:03:12,320 algorithm. These approaches help in reducing  the vocabulary sizes by sharing information   31 00:03:12,320 --> 00:03:17,840 across different words, having the ability to  have prefixes and suffixes understood as such.   32 00:03:18,480 --> 00:03:27,760 They keep meaning across very similar words,  by recognizing similar tokens making them up. 4218

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.