subtitlecat.com

All language subtitles for 006Transformer编码器模型

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified) Download

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:04,320 --> 00:00:09,120 In this video, we'll study the encoder architecture. An example of a popular 2 00:00:09,120 --> 00:00:13,120 encoder-only architecture is BERT, which is the most popular model of its kind. 3 00:00:14,400 --> 00:00:20,880 Let's first start by understanding how it works. We'll use a small example, using three words. We 4 00:00:20,880 --> 00:00:27,040 use these as inputs, and pass them through the encoder. We retrieve a numerical representation 5 00:00:27,040 --> 00:00:34,160 of each word. Here, for example, the encoder converts the three words “Welcome to NYC” 6 00:00:34,160 --> 00:00:40,880 in these three sequences of numbers. The encoder outputs exactly one sequence of numbers per input 7 00:00:40,880 --> 00:00:46,880 word. This numerical representation can also be called a "Feature vector", or "Feature tensor". 8 00:00:48,880 --> 00:00:53,680 Let's dive in this representation. It contains one vector per word that was passed through the 9 00:00:53,680 --> 00:00:59,680 encoder. Each of these vector is a numerical representation of the word in question. 10 00:01:00,880 --> 00:01:06,400 The dimension of that vector is defined by the architecture of the model, for the base BERT 11 00:01:06,400 --> 00:01:15,280 model, it is 768. These representations contain the value of a word; but contextualized. For 12 00:01:15,280 --> 00:01:21,280 example, the vector attributed to the word "to", isn't the representation of only the "to" word. 13 00:01:22,160 --> 00:01:29,680 It also takes into account the words around it, which we call the “context”.As in, it looks to the 14 00:01:29,680 --> 00:01:34,960 left context, the word on the left of the one we're studying (here the word "Welcome") and 15 00:01:34,960 --> 00:01:41,120 the context on the right (here the word "NYC") and outputs a value for the word, within its context. 16 00:01:41,840 --> 00:01:49,280 It is therefore a contextualized value. One could say that the vector of 768 values holds the 17 00:01:49,280 --> 00:01:55,840 "meaning" of that word in the text. How it does this is thanks to the self-attention mechanism. 18 00:01:57,120 --> 00:02:02,240 The self-attention mechanism relates to different positions (or different words) in a single 19 00:02:02,240 --> 00:02:08,320 sequence, in order to compute a representation of that sequence. As we've seen before, this 20 00:02:08,320 --> 00:02:13,600 means that the resulting representation of a word has been affected by other words in the sequence. 21 00:02:15,600 --> 00:02:20,160 We won't dive into the specifics here, but we'll offer some further readings if you want to get 22 00:02:20,160 --> 00:02:26,480 a better understanding at what happens under the hood. So when should one use an encoder? 23 00:02:27,040 --> 00:02:33,680 Encoders can be used as standalone models in a wide variety of tasks. For example BERT, arguably 24 00:02:33,680 --> 00:02:38,800 the most famous transformer model, is a standalone encoder model and at the time of release, 25 00:02:38,800 --> 00:02:44,000 beat the state of the art in many sequence classification tasks, question answering tasks, 26 00:02:44,000 --> 00:02:50,240 and masked language modeling, to only cite a few. The idea is that encoders are very powerful 27 00:02:50,240 --> 00:02:55,920 at extracting vectors that carry meaningful information about a sequence. This vector can 28 00:02:55,920 --> 00:02:59,680 then be handled down the road by additional layers of neurons to make sense of them. 29 00:03:01,200 --> 00:03:04,240 Let's take a look at some examples where encoders really shine. 30 00:03:06,080 --> 00:03:11,760 First of all, Masked Language Modeling, or MLM. It's the task of predicting a hidden word 31 00:03:11,760 --> 00:03:18,560 in a sequence of words. Here, for example, we have hidden the word between "My" and "is". This is one 32 00:03:18,560 --> 00:03:24,000 of the objectives with which BERT was trained: it was trained to predict hidden words in a sequence. 33 00:03:25,040 --> 00:03:30,160 Encoders shine in this scenario in particular, as bidirectional information is crucial here. 34 00:03:30,960 --> 00:03:35,520 If we didn't have the words on the right (is, Sylvain, and the dot), then there is very little 35 00:03:35,520 --> 00:03:41,200 chance that BERT would have been able to identify "name" as the correct word. The encoder needs to 36 00:03:41,200 --> 00:03:46,720 have a good understanding of the sequence in order to predict a masked word, as even if the text is 37 00:03:46,720 --> 00:03:52,080 grammatically correct, It does not necessarily make sense in the context of the sequence. 38 00:03:54,960 --> 00:03:58,720 As mentioned earlier, encoders are good at doing sequence classification. 39 00:03:59,360 --> 00:04:03,560 Sentiment analysis is an example of a sequence classification task. 40 00:04:04,240 --> 00:04:11,040 The model's aim is to identify the sentiment of a sequence – it can range from giving a sequence 41 00:04:11,040 --> 00:04:16,720 a rating from one to five stars if doing review analysis, to giving a positive or negative rating 42 00:04:16,720 --> 00:04:22,800 to a sequence, which is what is shown here. For example here, given the two sequences, 43 00:04:22,800 --> 00:04:28,800 we use the model to compute a prediction and to classify the sequences among these two classes: 44 00:04:28,800 --> 00:04:35,040 positive and negative. While the two sequences are very similar, containing the same words, 45 00:04:35,040 --> 00:04:41,840 the meaning is different – and the encoder model is able to grasp that difference. 5923