subtitlecat.com

All language subtitles for 007Transformer解码器模型

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:03,860 --> 00:00:09,750 In this video, we'll study the decoder architecture. An example of a popular decoder-only architecture 2 00:00:09,750 --> 00:00:15,809 is GPT-2. In order to understand how decoders work, we recommend taking a look at the video 3 00:00:15,809 --> 00:00:21,640 regarding encoders: they're extremely similar to decoders. One can use a decoder for most 4 00:00:21,640 --> 00:00:26,429 of the same tasks as an encoder, albeit with, generally, a little loss of performance. Let's 5 00:00:26,429 --> 00:00:31,769 take the same approach we have taken with the encoder to try and understand the architectural 6 00:00:31,769 --> 00:00:38,969 differences between an encoder and a decoder. We'll use a small example, using three words. 7 00:00:38,969 --> 00:00:46,550 We pass them through the decoder. We retrieve a numerical representation of each word. Here, 8 00:00:46,550 --> 00:00:51,739 for example, the decoder converts the three words “Welcome to NYC” in these three 9 00:00:51,739 --> 00:00:57,750 sequences of numbers. The decoder outputs exactly one sequence of numbers per input 10 00:00:57,750 --> 00:01:03,290 word. This numerical representation can also be called a "Feature vector", or "Feature 11 00:01:03,290 --> 00:01:09,590 tensor". Let's dive in this representation. It contains one vector per word that was passed 12 00:01:09,590 --> 00:01:14,830 through the decoder. Each of these vector is a numerical representation of the word 13 00:01:14,830 --> 00:01:21,810 in question. The dimension of that vector is defined by the architecture of the model. 14 00:01:21,810 --> 00:01:28,400 Where the decoder differs from the encoder is principally with its self-attention mechanism. 15 00:01:28,400 --> 00:01:34,090 It's using what is called "masked self-attention". Here for example, if we focus on the word 16 00:01:34,090 --> 00:01:40,170 "to", we'll see that its vector is absolutely unmodified by the "NYC" word. That's because 17 00:01:40,170 --> 00:01:45,560 all the words on the right (also known as the right context) of the word is masked. 18 00:01:45,560 --> 00:01:50,729 Rather than benefitting from all the words on the left and right, I.e., the bidirectional 19 00:01:50,729 --> 00:02:01,229 context, decoders only have access to the words on their left. The masked self-attention 20 00:02:01,229 --> 00:02:06,310 mechanism differs from the self-attention mechanism by using an additional mask to hide 21 00:02:06,310 --> 00:02:12,110 the context on either side of the word: the word's numerical representation will not be 22 00:02:12,110 --> 00:02:18,730 affected by the words in the hidden context. So when should one use a decoder? Decoders, 23 00:02:18,730 --> 00:02:24,610 like encoders, can be used as standalone models. As they generate a numerical representation, 24 00:02:24,610 --> 00:02:30,410 they can also be used in a wide variety of tasks. However, the strength of a decoder 25 00:02:30,410 --> 00:02:35,420 lies in the way a word has access to its left context. The decoders, having only access 26 00:02:35,420 --> 00:02:40,280 to their left context, are inherently good at text generation: the ability to generate 27 00:02:40,280 --> 00:02:46,120 a word, or a sequence of words, given a known sequence of words. In NLP, this is known as 28 00:02:46,120 --> 00:02:52,150 Causal Language Modeling. Let's look at an example. Here's an example of how causal language 29 00:02:52,150 --> 00:02:59,240 modeling works: we start with an initial word, which is "My". We use this as input for the 30 00:02:59,240 --> 00:03:06,330 decoder. The model outputs a vectors of dimension 768. This vector contains information about 31 00:03:06,330 --> 00:03:11,650 the sequence, which is here a single word, or word. We apply a small transformation to 32 00:03:11,650 --> 00:03:17,019 that vector so that it maps to all the words known by the model (mapping which we'll see 33 00:03:17,019 --> 00:03:22,650 later, called a language modeling head). We identify that the model believes the most 34 00:03:22,650 --> 00:03:29,720 probable following word is "name". We then take that new word, and add it to the initial 35 00:03:29,720 --> 00:03:35,560 sequence. From "My", we are now at "My name". This is where the "autoregressive" aspect 36 00:03:35,560 --> 00:03:42,689 comes in. Auto-regressive models re-use their past outputs as inputs in the following steps. 37 00:03:42,689 --> 00:03:49,280 Once again, we do that the exact same operation: we cast that sequence through the decoder, 38 00:03:49,280 --> 00:03:57,459 and retrieve the most probable following word. In this case, it is the word "is". We repeat 39 00:03:57,459 --> 00:04:03,049 the operation until we're satisfied. Starting from a single word, we've now generated a 40 00:04:03,049 --> 00:04:08,870 full sentence. We decide to stop there, but we could continue for a while; GPT-2, for 41 00:04:08,870 --> 00:04:16,918 example, has a maximum context size of 1024. We could eventually generate up to 1024 words, 42 00:04:16,918 --> 00:04:20,124 and the decoder would still have some memory of the first words of the sequence! If we 43 00:04:20,125 --> 00:04:21,125 go back several levels higher, back to the full transformer model, we can see what we 44 00:04:21,125 --> 00:04:22,125 learned about the decoder part of the full transformer model. It is what we call, auto-regressive: 45 00:04:22,125 --> 00:04:23,125 it outputs values that are then used as its input values. We repeat this operations as 46 00:04:23,125 --> 00:04:24,125 we like. It is based off of the masked self-attention layer, which allows to have word embeddings 47 00:04:24,125 --> 00:04:25,125 which have access to the context on the left side of the word. If you look at the diagram 48 00:04:25,125 --> 00:04:26,125 however, you'll see that we haven't seen one of the aspects of the decoder. That is: cross-attention. 49 00:04:26,125 --> 00:04:27,125 There is a second aspect we haven't seen, which is it's ability to convert features 50 00:04:27,125 --> 00:04:28,125 to words; heavily linked to the cross attention mechanism. However, these only apply in the 51 00:04:28,125 --> 00:04:29,125 "encoder-decoder" transformer, or the "sequence-to-sequence" transformer (which can generally be used interchangeably). 52 00:04:29,125 --> 00:04:30,125 We recommend you check out the video on encoder-decoders to get an idea of how the decoder can be used 53 00:04:30,125 --> 00:04:30,132 as a component of a larger architecture! 6571