Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:03,860 --> 00:00:09,750
In this video, we'll study the decoder architecture.
An example of a popular decoder-only architecture
2
00:00:09,750 --> 00:00:15,809
is GPT-2. In order to understand how decoders
work, we recommend taking a look at the video
3
00:00:15,809 --> 00:00:21,640
regarding encoders: they're extremely similar
to decoders. One can use a decoder for most
4
00:00:21,640 --> 00:00:26,429
of the same tasks as an encoder, albeit with,
generally, a little loss of performance. Let's
5
00:00:26,429 --> 00:00:31,769
take the same approach we have taken with
the encoder to try and understand the architectural
6
00:00:31,769 --> 00:00:38,969
differences between an encoder and a decoder.
We'll use a small example, using three words.
7
00:00:38,969 --> 00:00:46,550
We pass them through the decoder. We retrieve
a numerical representation of each word. Here,
8
00:00:46,550 --> 00:00:51,739
for example, the decoder converts the three
words “Welcome to NYC” in these three
9
00:00:51,739 --> 00:00:57,750
sequences of numbers. The decoder outputs
exactly one sequence of numbers per input
10
00:00:57,750 --> 00:01:03,290
word. This numerical representation can also
be called a "Feature vector", or "Feature
11
00:01:03,290 --> 00:01:09,590
tensor". Let's dive in this representation.
It contains one vector per word that was passed
12
00:01:09,590 --> 00:01:14,830
through the decoder. Each of these vector
is a numerical representation of the word
13
00:01:14,830 --> 00:01:21,810
in question. The dimension of that vector
is defined by the architecture of the model.
14
00:01:21,810 --> 00:01:28,400
Where the decoder differs from the encoder
is principally with its self-attention mechanism.
15
00:01:28,400 --> 00:01:34,090
It's using what is called "masked self-attention".
Here for example, if we focus on the word
16
00:01:34,090 --> 00:01:40,170
"to", we'll see that its vector is absolutely
unmodified by the "NYC" word. That's because
17
00:01:40,170 --> 00:01:45,560
all the words on the right (also known as
the right context) of the word is masked.
18
00:01:45,560 --> 00:01:50,729
Rather than benefitting from all the words
on the left and right, I.e., the bidirectional
19
00:01:50,729 --> 00:02:01,229
context, decoders only have access to the
words on their left. The masked self-attention
20
00:02:01,229 --> 00:02:06,310
mechanism differs from the self-attention
mechanism by using an additional mask to hide
21
00:02:06,310 --> 00:02:12,110
the context on either side of the word: the
word's numerical representation will not be
22
00:02:12,110 --> 00:02:18,730
affected by the words in the hidden context.
So when should one use a decoder? Decoders,
23
00:02:18,730 --> 00:02:24,610
like encoders, can be used as standalone models.
As they generate a numerical representation,
24
00:02:24,610 --> 00:02:30,410
they can also be used in a wide variety of
tasks. However, the strength of a decoder
25
00:02:30,410 --> 00:02:35,420
lies in the way a word has access to its left
context. The decoders, having only access
26
00:02:35,420 --> 00:02:40,280
to their left context, are inherently good
at text generation: the ability to generate
27
00:02:40,280 --> 00:02:46,120
a word, or a sequence of words, given a known
sequence of words. In NLP, this is known as
28
00:02:46,120 --> 00:02:52,150
Causal Language Modeling. Let's look at an
example. Here's an example of how causal language
29
00:02:52,150 --> 00:02:59,240
modeling works: we start with an initial word,
which is "My". We use this as input for the
30
00:02:59,240 --> 00:03:06,330
decoder. The model outputs a vectors of dimension
768. This vector contains information about
31
00:03:06,330 --> 00:03:11,650
the sequence, which is here a single word,
or word. We apply a small transformation to
32
00:03:11,650 --> 00:03:17,019
that vector so that it maps to all the words
known by the model (mapping which we'll see
33
00:03:17,019 --> 00:03:22,650
later, called a language modeling head). We
identify that the model believes the most
34
00:03:22,650 --> 00:03:29,720
probable following word is "name". We then
take that new word, and add it to the initial
35
00:03:29,720 --> 00:03:35,560
sequence. From "My", we are now at "My name".
This is where the "autoregressive" aspect
36
00:03:35,560 --> 00:03:42,689
comes in. Auto-regressive models re-use their
past outputs as inputs in the following steps.
37
00:03:42,689 --> 00:03:49,280
Once again, we do that the exact same operation:
we cast that sequence through the decoder,
38
00:03:49,280 --> 00:03:57,459
and retrieve the most probable following word.
In this case, it is the word "is". We repeat
39
00:03:57,459 --> 00:04:03,049
the operation until we're satisfied. Starting
from a single word, we've now generated a
40
00:04:03,049 --> 00:04:08,870
full sentence. We decide to stop there, but
we could continue for a while; GPT-2, for
41
00:04:08,870 --> 00:04:16,918
example, has a maximum context size of 1024.
We could eventually generate up to 1024 words,
42
00:04:16,918 --> 00:04:20,124
and the decoder would still have some memory
of the first words of the sequence! If we
43
00:04:20,125 --> 00:04:21,125
go back several levels higher, back to the
full transformer model, we can see what we
44
00:04:21,125 --> 00:04:22,125
learned about the decoder part of the full
transformer model. It is what we call, auto-regressive:
45
00:04:22,125 --> 00:04:23,125
it outputs values that are then used as its
input values. We repeat this operations as
46
00:04:23,125 --> 00:04:24,125
we like. It is based off of the masked self-attention
layer, which allows to have word embeddings
47
00:04:24,125 --> 00:04:25,125
which have access to the context on the left
side of the word. If you look at the diagram
48
00:04:25,125 --> 00:04:26,125
however, you'll see that we haven't seen one
of the aspects of the decoder. That is: cross-attention.
49
00:04:26,125 --> 00:04:27,125
There is a second aspect we haven't seen,
which is it's ability to convert features
50
00:04:27,125 --> 00:04:28,125
to words; heavily linked to the cross attention
mechanism. However, these only apply in the
51
00:04:28,125 --> 00:04:29,125
"encoder-decoder" transformer, or the "sequence-to-sequence"
transformer (which can generally be used interchangeably).
52
00:04:29,125 --> 00:04:30,125
We recommend you check out the video on encoder-decoders
to get an idea of how the decoder can be used
53
00:04:30,125 --> 00:04:30,132
as a component of a larger architecture!
6571
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.