Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:04,160 --> 00:00:07,200
In this video, we'll study the
encoder-decoder architecture.
2
00:00:08,160 --> 00:00:16,160
An example of a popular encoder-decoder model is
T5. In order to understand how the encoder-decoder
3
00:00:16,160 --> 00:00:21,680
works, we recommend you check out the videos
on encoders and decoders as standalone models.
4
00:00:22,400 --> 00:00:30,320
Understanding how they behave individually will
help understanding how an encoder-decoder behaves.
5
00:00:30,320 --> 00:00:35,360
Let's start from what we've seen about the
encoder. The encoder takes words as inputs,
6
00:00:36,000 --> 00:00:40,640
casts them through the encoder, and
retrieves a numerical representation
7
00:00:40,640 --> 00:00:47,360
for each word cast through it. We now know that
the numerical representation holds information
8
00:00:47,360 --> 00:00:54,000
about the meaning of the sequence. Let's put
this aside and add the decoder to the diagram.
9
00:00:56,480 --> 00:01:00,160
In this scenario, we're using the decoder
in a manner that we haven't seen before.
10
00:01:00,720 --> 00:01:07,600
We're passing the outputs of the encoder directly
to it! Additionally to the encoder outputs,
11
00:01:07,600 --> 00:01:13,040
we also give the decoder a sequence. When
prompting the decoder for an output with no
12
00:01:13,040 --> 00:01:17,360
initial sequence, we can give it the value
that indicates the start of a sequence.
13
00:01:18,000 --> 00:01:23,520
And that's where the encoder-decoder magic
happens. The encoder accepts a sequence as input.
14
00:01:24,560 --> 00:01:30,480
It computes a prediction, and outputs a
numerical representation. Then, it sends
15
00:01:30,480 --> 00:01:38,000
that over to the decoder. It has, in a sense,
encoded the sequence. And the decoder, in turn,
16
00:01:38,000 --> 00:01:42,960
using this input alongside its usual sequence
input, will take a stab at decoding the sequence.
17
00:01:44,720 --> 00:01:50,400
The decoder decodes the sequence, and outputs a
word. As of now, we don't need to make sense of
18
00:01:50,400 --> 00:01:55,440
that word, but we can understand that the decoder
is essentially decoding what the encoder has
19
00:01:55,440 --> 00:02:02,160
output. The "start of sequence word" indicates
that it should start decoding the sequence.
20
00:02:03,600 --> 00:02:10,240
Now that we have both the feature vector and
an initial generated word, we don't need the
21
00:02:10,240 --> 00:02:17,760
encoder anymore. As we have seen before with the
decoder, it can act in an auto-regressive manner;
22
00:02:18,640 --> 00:02:24,960
the word it has just output can now be used
as an input. This, in combination with the
23
00:02:24,960 --> 00:02:30,800
numerical representation output by the encoder,
can now be used to generate a second word.
24
00:02:33,200 --> 00:02:38,880
Please note that the first word is still here; as
the model still outputs it. However, it is greyed
25
00:02:38,880 --> 00:02:45,120
out as we have no need for it anymore. We can
continue on and on; for example until the decoder
26
00:02:45,120 --> 00:02:50,720
outputs a value that we consider a "stopping
value", like a dot, meaning the end of a sequence.
27
00:02:53,440 --> 00:02:58,080
Here, we've seen the full mechanism of the
encoder-decoder transformer: let's go over it one
28
00:02:58,080 --> 00:03:05,120
more time. We have an initial sequence, that is
sent to the encoder. That encoder output is then
29
00:03:05,120 --> 00:03:12,240
sent to the decoder, for it to be decoded. While
we can now discard the encoder after a single use,
30
00:03:12,240 --> 00:03:17,840
the decoder will be used several times: until
we have generated every word that we need.
31
00:03:20,000 --> 00:03:25,120
Let's see a concrete example; with Translation
Language Modeling; also called transduction;
32
00:03:25,120 --> 00:03:30,800
the act of translating a sequence. Here, we would
like to translate this English sequence "Welcome
33
00:03:30,800 --> 00:03:38,400
to NYC" in French. We're using a transformer model
that is trained for that task explicitly. We use
34
00:03:38,400 --> 00:03:43,520
the encoder to create a representation
of the English sentence. We cast this
35
00:03:43,520 --> 00:03:48,880
to the decoder and, with the use of the start of
sequence word, we ask it to output the first word.
36
00:03:50,720 --> 00:03:52,960
It outputs Bienvenue, which means "Welcome".
37
00:03:55,280 --> 00:04:02,480
We then use "Bienvenue" as the input sequence for
the decoder. This, alongside the feature vector,
38
00:04:04,320 --> 00:04:08,480
allows the decoder to predict the second
word, "à", which is "to" in English.
39
00:04:10,160 --> 00:04:14,400
Finally, we ask the decoder to predict
a third word; it predicts "NYC",
40
00:04:14,400 --> 00:04:20,240
which is, once again, correct. We've translated
the sentence! Where the encoder-decoder really
41
00:04:20,240 --> 00:04:24,880
shines, is that we have an encoder and a
decoder; which often do not share weights.
42
00:04:27,280 --> 00:04:31,440
We, therefore, have an entire block (the encoder)
that can be trained to understand the sequence,
43
00:04:31,440 --> 00:04:36,480
and extract the relevant information. For the
translation scenario we've seen earlier, for
44
00:04:36,480 --> 00:04:44,160
example, this would mean parsing and understanding
what was said in the English language; extracting
45
00:04:44,160 --> 00:04:49,040
information from that language, and putting
all of that in a vector dense in information.
46
00:04:50,880 --> 00:04:57,280
On the other hand, we have the decoder, whose
sole purpose is to decode the feature output by
47
00:04:57,280 --> 00:05:03,760
the encoder. This decoder can be specialized in
a completely different language, or even modality
48
00:05:03,760 --> 00:05:11,760
like images or speech. Encoders-decoders
are special for several reasons. Firstly,
49
00:05:11,760 --> 00:05:17,040
they're able to manage sequence to sequence
tasks, like translation that we have just seen.
50
00:05:18,640 --> 00:05:23,880
Secondly, the weights between the encoder and the
decoder parts are not necessarily shared. Let's
51
00:05:24,480 --> 00:05:31,200
take another example of translation. Here we're
translating "Transformers are powerful" in French.
52
00:05:32,240 --> 00:05:36,560
Firstly, this means that from a sequence
of three words, we're able to generate
53
00:05:36,560 --> 00:05:42,240
a sequence of four words. One could argue
that this could be handled with a decoder;
54
00:05:42,240 --> 00:05:46,960
that would generate the translation in an
auto-regressive manner; and they would be right!
55
00:05:49,840 --> 00:05:53,840
Another example of where sequence to sequence
transformers shine is in summarization.
56
00:05:54,640 --> 00:05:58,560
Here we have a very long
sequence, generally a full text,
57
00:05:58,560 --> 00:06:03,840
and we want to summarize it. Since the
encoder and decoders are separated,
58
00:06:03,840 --> 00:06:08,880
we can have different context lengths (for
example a very long context for the encoder which
59
00:06:08,880 --> 00:06:13,840
handles the text, and a smaller context for the
decoder which handles the summarized sequence).
60
00:06:16,240 --> 00:06:20,480
There are a lot of sequence to sequence
models. This contains a few examples of
61
00:06:20,480 --> 00:06:24,160
popular encoder-decoder models
available in the transformers library.
62
00:06:26,320 --> 00:06:31,200
Additionally, you can load an encoder
and a decoder inside an encoder-decoder
63
00:06:31,200 --> 00:06:35,040
model! Therefore, according to the
specific task you are targeting,
64
00:06:35,040 --> 00:06:40,240
you may choose to use specific encoders
and decoders, which have proven their worth
65
00:06:40,240 --> 00:06:49,850
on these specific tasks. This wraps things up
for the encoder-decoders. Thanks for watching!
8342
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.