All language subtitles for 017标记化管道

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:05,440 --> 00:00:12,320 The tokenizer pipeline. In this video, we'll look  at how a tokenizer converts raw text to numbers   2 00:00:12,320 --> 00:00:18,080 that a Transformer model can make sense of,  like when we execute this code. Here is a quick   3 00:00:18,080 --> 00:00:24,400 overview of what happens inside the tokenizer  object: first the text is split into tokens, which   4 00:00:24,400 --> 00:00:31,280 are words, parts of words, or punctuation symbols.  Then the tokenizer adds potential special tokens   5 00:00:31,280 --> 00:00:36,560 and converts each token to their unique respective  ID as defined by the tokenizer's vocabulary.   6 00:00:37,520 --> 00:00:41,440 As we'll see it doesn't actually happen  in this order, but viewing it like this   7 00:00:41,440 --> 00:00:46,320 is better for understanding what happens.  The first step is to split our input text   8 00:00:46,320 --> 00:00:53,840 into tokens with the tokenize method. To do this,  the tokenizer may first perform some operations   9 00:00:53,840 --> 00:00:58,000 like lowercasing all words, then follow a  set of rules to split the result in small   10 00:00:58,000 --> 00:01:03,520 chunks of text. Most of the Transformers  models use a subword tokenization algorithm,   11 00:01:04,160 --> 00:01:08,720 which means that one given word can be  split in several tokens, like tokenize   12 00:01:08,720 --> 00:01:13,360 here. Look at the "Tokenization algorithms"  videos linked below for more information!   13 00:01:14,480 --> 00:01:19,600 The ## prefix we see in front of ize is  the convention used by BERT to indicate   14 00:01:19,600 --> 00:01:26,080 this token is not the beginning of a word. Other  tokenizers may use different conventions however:   15 00:01:26,080 --> 00:01:31,040 for instance ALBERT tokenizers will add  a long underscore in front of all the   16 00:01:31,040 --> 00:01:36,640 tokens that had a space before them, which is  a convention used by sentencepiece tokenizers.   17 00:01:38,320 --> 00:01:43,280 The second step of the tokenization pipeline  is to map those tokens to their respective IDs   18 00:01:43,280 --> 00:01:48,960 as defined by the vocabulary of the tokenizer.  This is why we need to download a file when we   19 00:01:48,960 --> 00:01:53,600 instantiate a tokenizer with the from_pretrained  method: we have to make sure we use the same   20 00:01:53,600 --> 00:01:59,520 mapping as when the model was pretrained. To do  this, we use the convert_tokens_to_ids method.   21 00:02:00,720 --> 00:02:05,360 You may have noticed that we don't have the  exact same result as in our first slide — or not,   22 00:02:05,360 --> 00:02:09,840 as this looks like a list of random numbers,  in which case allow me to refresh your memory.   23 00:02:10,479 --> 00:02:13,680 We had a number at the beginning  and at the end that are missing,   24 00:02:14,400 --> 00:02:20,160 those are the special tokens. The special tokens  are added by the prepare_for_model method,   25 00:02:20,160 --> 00:02:25,280 which knows the indices of those tokens in the  vocabulary and just adds the proper numbers.   26 00:02:28,320 --> 00:02:32,480 You can look at the special tokens (and more  generally at how the tokenizer has changed   27 00:02:32,480 --> 00:02:37,120 your text) by using the decode method  on the outputs of the tokenizer object.   28 00:02:38,240 --> 00:02:44,080 As for the prefix for beginning of words/part  of words, those special tokens vary depending on   29 00:02:44,080 --> 00:02:50,080 which tokenizer you are using. The BERT tokenizer  uses [CLS] and [SEP] but the roberta tokenizer   30 00:02:50,080 --> 00:02:57,520 uses html-like anchors and . Now that  you know how the tokenizer works, you can forget   31 00:02:57,520 --> 00:03:02,560 all those intermediaries methods and only remember  that you just have to call it on your input texts.   32 00:03:03,600 --> 00:03:06,880 The inputs don't contain the inputs IDs however,   33 00:03:07,520 --> 00:03:11,600 to learn what the attention mask is, check  out the "Batch inputs together" video.   34 00:03:12,160 --> 00:03:17,840 To learn about token type IDs, look at  the "Process pairs of sentences" video. 4345

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.