All language subtitles for 007Transformer解码器模型

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified) Download
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:03,860 --> 00:00:09,750 In this video, we'll study the decoder architecture. An example of a popular decoder-only architecture 2 00:00:09,750 --> 00:00:15,809 is GPT-2. In order to understand how decoders work, we recommend taking a look at the video 3 00:00:15,809 --> 00:00:21,640 regarding encoders: they're extremely similar to decoders. One can use a decoder for most 4 00:00:21,640 --> 00:00:26,429 of the same tasks as an encoder, albeit with, generally, a little loss of performance. Let's 5 00:00:26,429 --> 00:00:31,769 take the same approach we have taken with the encoder to try and understand the architectural 6 00:00:31,769 --> 00:00:38,969 differences between an encoder and a decoder. We'll use a small example, using three words. 7 00:00:38,969 --> 00:00:46,550 We pass them through the decoder. We retrieve a numerical representation of each word. Here, 8 00:00:46,550 --> 00:00:51,739 for example, the decoder converts the three words “Welcome to NYC” in these three 9 00:00:51,739 --> 00:00:57,750 sequences of numbers. The decoder outputs exactly one sequence of numbers per input 10 00:00:57,750 --> 00:01:03,290 word. This numerical representation can also be called a "Feature vector", or "Feature 11 00:01:03,290 --> 00:01:09,590 tensor". Let's dive in this representation. It contains one vector per word that was passed 12 00:01:09,590 --> 00:01:14,830 through the decoder. Each of these vector is a numerical representation of the word 13 00:01:14,830 --> 00:01:21,810 in question. The dimension of that vector is defined by the architecture of the model. 14 00:01:21,810 --> 00:01:28,400 Where the decoder differs from the encoder is principally with its self-attention mechanism. 15 00:01:28,400 --> 00:01:34,090 It's using what is called "masked self-attention". Here for example, if we focus on the word 16 00:01:34,090 --> 00:01:40,170 "to", we'll see that its vector is absolutely unmodified by the "NYC" word. That's because 17 00:01:40,170 --> 00:01:45,560 all the words on the right (also known as the right context) of the word is masked. 18 00:01:45,560 --> 00:01:50,729 Rather than benefitting from all the words on the left and right, I.e., the bidirectional 19 00:01:50,729 --> 00:02:01,229 context, decoders only have access to the words on their left. The masked self-attention 20 00:02:01,229 --> 00:02:06,310 mechanism differs from the self-attention mechanism by using an additional mask to hide 21 00:02:06,310 --> 00:02:12,110 the context on either side of the word: the word's numerical representation will not be 22 00:02:12,110 --> 00:02:18,730 affected by the words in the hidden context. So when should one use a decoder? Decoders, 23 00:02:18,730 --> 00:02:24,610 like encoders, can be used as standalone models. As they generate a numerical representation, 24 00:02:24,610 --> 00:02:30,410 they can also be used in a wide variety of tasks. However, the strength of a decoder 25 00:02:30,410 --> 00:02:35,420 lies in the way a word has access to its left context. The decoders, having only access 26 00:02:35,420 --> 00:02:40,280 to their left context, are inherently good at text generation: the ability to generate 27 00:02:40,280 --> 00:02:46,120 a word, or a sequence of words, given a known sequence of words. In NLP, this is known as 28 00:02:46,120 --> 00:02:52,150 Causal Language Modeling. Let's look at an example. Here's an example of how causal language 29 00:02:52,150 --> 00:02:59,240 modeling works: we start with an initial word, which is "My". We use this as input for the 30 00:02:59,240 --> 00:03:06,330 decoder. The model outputs a vectors of dimension 768. This vector contains information about 31 00:03:06,330 --> 00:03:11,650 the sequence, which is here a single word, or word. We apply a small transformation to 32 00:03:11,650 --> 00:03:17,019 that vector so that it maps to all the words known by the model (mapping which we'll see 33 00:03:17,019 --> 00:03:22,650 later, called a language modeling head). We identify that the model believes the most 34 00:03:22,650 --> 00:03:29,720 probable following word is "name". We then take that new word, and add it to the initial 35 00:03:29,720 --> 00:03:35,560 sequence. From "My", we are now at "My name". This is where the "autoregressive" aspect 36 00:03:35,560 --> 00:03:42,689 comes in. Auto-regressive models re-use their past outputs as inputs in the following steps. 37 00:03:42,689 --> 00:03:49,280 Once again, we do that the exact same operation: we cast that sequence through the decoder, 38 00:03:49,280 --> 00:03:57,459 and retrieve the most probable following word. In this case, it is the word "is". We repeat 39 00:03:57,459 --> 00:04:03,049 the operation until we're satisfied. Starting from a single word, we've now generated a 40 00:04:03,049 --> 00:04:08,870 full sentence. We decide to stop there, but we could continue for a while; GPT-2, for 41 00:04:08,870 --> 00:04:16,918 example, has a maximum context size of 1024. We could eventually generate up to 1024 words, 42 00:04:16,918 --> 00:04:20,124 and the decoder would still have some memory of the first words of the sequence! If we 43 00:04:20,125 --> 00:04:21,125 go back several levels higher, back to the full transformer model, we can see what we 44 00:04:21,125 --> 00:04:22,125 learned about the decoder part of the full transformer model. It is what we call, auto-regressive: 45 00:04:22,125 --> 00:04:23,125 it outputs values that are then used as its input values. We repeat this operations as 46 00:04:23,125 --> 00:04:24,125 we like. It is based off of the masked self-attention layer, which allows to have word embeddings 47 00:04:24,125 --> 00:04:25,125 which have access to the context on the left side of the word. If you look at the diagram 48 00:04:25,125 --> 00:04:26,125 however, you'll see that we haven't seen one of the aspects of the decoder. That is: cross-attention. 49 00:04:26,125 --> 00:04:27,125 There is a second aspect we haven't seen, which is it's ability to convert features 50 00:04:27,125 --> 00:04:28,125 to words; heavily linked to the cross attention mechanism. However, these only apply in the 51 00:04:28,125 --> 00:04:29,125 "encoder-decoder" transformer, or the "sequence-to-sequence" transformer (which can generally be used interchangeably). 52 00:04:29,125 --> 00:04:30,125 We recommend you check out the video on encoder-decoders to get an idea of how the decoder can be used 53 00:04:30,125 --> 00:04:30,132 as a component of a larger architecture! 6571

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.