All language subtitles for Applying the Cutting Edge of Object Detection to Medical Imaging.en.transcribed
Afrikaans
Albanian
Amharic
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Cebuano
Chichewa
Chinese (Simplified)
Chinese (Traditional)
Corsican
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Filipino
Finnish
French
Frisian
Galician
Georgian
German
Greek
Gujarati
Haitian Creole
Hausa
Hawaiian
Hebrew
Hindi
Hmong
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kyrgyz
Lao
Latin
Latvian
Lithuanian
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Maltese
Maori
Marathi
Mongolian
Myanmar (Burmese)
Nepali
Norwegian
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Samoan
Scots Gaelic
Serbian
Sesotho
Shona
Sindhi
Sinhala
Slovak
Slovenian
Somali
Spanish
Sundanese
Swahili
Swedish
Tajik
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Xhosa
Yiddish
Yoruba
Zulu
Odia (Oriya)
Kinyarwanda
Turkmen
Tatar
Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:02,610 --> 00:00:09,420
so nice to meet you everybody my name is
2
00:00:06,420 --> 00:00:12,450
Dan basak and I'm the head of AI at a
3
00:00:09,420 --> 00:00:13,830
dock where we do we use the technologies
4
00:00:12,450 --> 00:00:18,299
of object detection and image
5
00:00:13,830 --> 00:00:21,650
segmentation to detect urgent medical
6
00:00:18,300 --> 00:00:24,570
normalities in different medical imaging
7
00:00:21,650 --> 00:00:26,848
modalities such as CT scans and it's
8
00:00:24,570 --> 00:00:28,920
nice to see some familiar faces here in
9
00:00:26,849 --> 00:00:33,629
the crowd thank you all for coming and
10
00:00:28,920 --> 00:00:36,989
today I want to share with you and talk
11
00:00:33,629 --> 00:00:39,030
talk together with you about some some
12
00:00:36,989 --> 00:00:41,159
of the three most interesting papers in
13
00:00:39,030 --> 00:00:44,909
object detection of the last year or so
14
00:00:41,159 --> 00:00:47,010
which have done really amazing things
15
00:00:44,909 --> 00:00:56,390
that are really relevant to the
16
00:00:47,010 --> 00:00:58,650
challenges of medical imaging data so
17
00:00:56,390 --> 00:01:01,140
first of all a little bit about object
18
00:00:58,650 --> 00:01:04,519
detection so object detection is
19
00:01:01,140 --> 00:01:07,740
progressing really quickly just between
20
00:01:04,519 --> 00:01:10,530
2016 and 2017
21
00:01:07,740 --> 00:01:15,990
in the living benchmark the cocoa
22
00:01:10,530 --> 00:01:19,050
competition the accuracy has raised in
23
00:01:15,990 --> 00:01:21,830
about - a 20% relatively between the
24
00:01:19,050 --> 00:01:25,979
years so the bottom row is the best
25
00:01:21,830 --> 00:01:30,929
submission of 2016 and all of these four
26
00:01:25,980 --> 00:01:34,650
are the submissions from 2017 challenge
27
00:01:30,930 --> 00:01:37,740
in the Abbott was in the end of 2017 and
28
00:01:34,650 --> 00:01:40,080
not only is it a mature and quickly
29
00:01:37,740 --> 00:01:43,530
advancing technology I think it's also a
30
00:01:40,080 --> 00:01:45,210
very transformational technology and it
31
00:01:43,530 --> 00:01:48,210
it has the potential and I really
32
00:01:45,210 --> 00:01:52,669
believe that it will transform any and
33
00:01:48,210 --> 00:01:55,399
every industry be it medical imaging
34
00:01:52,670 --> 00:01:58,650
defense and business intelligence
35
00:01:55,400 --> 00:02:00,570
robotics and autonomous vehicles and
36
00:01:58,650 --> 00:02:04,159
even augmented reality and many many
37
00:02:00,570 --> 00:02:04,158
other more industries
38
00:02:09,090 --> 00:02:13,710
and the reason I really like these these
39
00:02:11,850 --> 00:02:15,989
meetups and talks is because I think
40
00:02:13,710 --> 00:02:19,950
that object detection and deep learning
41
00:02:15,990 --> 00:02:23,160
in general is a very very it has a lot
42
00:02:19,950 --> 00:02:25,560
of potential but we wouldn't be able to
43
00:02:23,160 --> 00:02:27,180
fulfill this potential fast enough if
44
00:02:25,560 --> 00:02:29,190
you won't have a lot of people that
45
00:02:27,180 --> 00:02:31,530
really understand this field and the
46
00:02:29,190 --> 00:02:34,470
problem with that is that we are
47
00:02:31,530 --> 00:02:36,270
building a mountain of papers that are
48
00:02:34,470 --> 00:02:38,580
really hard to read or really hard to
49
00:02:36,270 --> 00:02:40,290
get into it takes hours to and really
50
00:02:38,580 --> 00:02:43,350
understand them especially if you want
51
00:02:40,290 --> 00:02:45,660
to dive into the little details that are
52
00:02:43,350 --> 00:02:48,840
relevant to really implement something
53
00:02:45,660 --> 00:02:51,390
especially if it's on new data and I was
54
00:02:48,840 --> 00:02:54,240
really inspired by this article on
55
00:02:51,390 --> 00:02:58,470
distill which was called research debt
56
00:02:54,240 --> 00:03:00,240
and was talked exactly about that and I
57
00:02:58,470 --> 00:03:03,030
really recommend you to read it it was
58
00:03:00,240 --> 00:03:07,200
written by Chris Ola and the Shankar -
59
00:03:03,030 --> 00:03:09,270
research scientists from Google and what
60
00:03:07,200 --> 00:03:12,030
they saying I really agree with them is
61
00:03:09,270 --> 00:03:14,610
that we really need to look at this
62
00:03:12,030 --> 00:03:16,380
mountain and instead of making we can
63
00:03:14,610 --> 00:03:18,540
continue making the mountain bigger as
64
00:03:16,380 --> 00:03:20,880
long as we build staircases and
65
00:03:18,540 --> 00:03:23,850
elevators that enable everyone to climb
66
00:03:20,880 --> 00:03:25,380
it together with us because if we won't
67
00:03:23,850 --> 00:03:26,760
have enough engineers that understand
68
00:03:25,380 --> 00:03:29,640
the state-of-the-art we won't we
69
00:03:26,760 --> 00:03:32,880
wouldn't be able to create placated
70
00:03:29,640 --> 00:03:34,859
solution fast enough and I've personally
71
00:03:32,880 --> 00:03:37,320
invested hundreds of hours in learning
72
00:03:34,860 --> 00:03:43,610
this field and I really I still have a
73
00:03:37,320 --> 00:03:46,410
lot more to learn so and after learning
74
00:03:43,610 --> 00:03:48,360
learning investing a lot of time in it
75
00:03:46,410 --> 00:03:50,579
my conclusion is deep learning is
76
00:03:48,360 --> 00:03:53,760
advanced and it's mind-blowing and it's
77
00:03:50,580 --> 00:03:55,800
creative and you need to dive into it
78
00:03:53,760 --> 00:03:57,540
and learn it seriously in order to
79
00:03:55,800 --> 00:03:59,790
understand it but it's not rocket
80
00:03:57,540 --> 00:04:01,440
science and by the way I think that even
81
00:03:59,790 --> 00:04:05,429
rocket science is not really rocket
82
00:04:01,440 --> 00:04:08,730
science and my question is what can we
83
00:04:05,430 --> 00:04:12,090
do to reduce the time that is required
84
00:04:08,730 --> 00:04:14,670
for the next people to join this field
85
00:04:12,090 --> 00:04:19,079
by a factor of 10 from the time it took
86
00:04:14,670 --> 00:04:21,579
me to join in this field and I think
87
00:04:19,079 --> 00:04:24,938
that if we
88
00:04:21,579 --> 00:04:27,580
it should be a community effort and we
89
00:04:24,939 --> 00:04:29,860
should spend more time on trying to make
90
00:04:27,580 --> 00:04:31,568
these things more explainable more easy
91
00:04:29,860 --> 00:04:35,199
to understand use the right explanation
92
00:04:31,569 --> 00:04:35,560
and the right visualizations and that's
93
00:04:35,199 --> 00:04:39,849
it
94
00:04:35,560 --> 00:04:41,530
so now let's dive into first the
95
00:04:39,849 --> 00:04:43,930
structure of this doc it's going to be
96
00:04:41,530 --> 00:04:46,239
about how an hour and a half
97
00:04:43,930 --> 00:04:50,710
first I will talk about the challenges
98
00:04:46,240 --> 00:04:54,340
in medical imaging data and after that I
99
00:04:50,710 --> 00:04:57,430
will dive into a depends on how much
100
00:04:54,340 --> 00:04:59,979
time we'll have two to three of the most
101
00:04:57,430 --> 00:05:02,500
advanced papers deformable convolutional
102
00:04:59,979 --> 00:05:04,690
networks feature pyramid networks and
103
00:05:02,500 --> 00:05:06,879
focal loss I guess most of you heard
104
00:05:04,690 --> 00:05:10,949
these names if you're in the field and
105
00:05:06,879 --> 00:05:13,479
I'm going to these three papers address
106
00:05:10,949 --> 00:05:16,210
most of the challenges that I'm going to
107
00:05:13,479 --> 00:05:17,979
present now and they do it in a really
108
00:05:16,210 --> 00:05:20,229
nice way and I'm going to explain them
109
00:05:17,979 --> 00:05:22,120
in a way that is relevant both to the
110
00:05:20,229 --> 00:05:23,590
medical imaging domain and explain the
111
00:05:22,120 --> 00:05:25,180
unique details that are relevant to
112
00:05:23,590 --> 00:05:27,969
apply it in the medical imaging human
113
00:05:25,180 --> 00:05:29,500
but also it will be relevant for any one
114
00:05:27,969 --> 00:05:31,419
of you that wants to understand these
115
00:05:29,500 --> 00:05:34,569
concepts and take them to other fields
116
00:05:31,419 --> 00:05:36,789
as well so let's start with challenges
117
00:05:34,569 --> 00:05:39,669
of medical imaging data by the way can
118
00:05:36,789 --> 00:05:42,400
everyone hear me well I I don't think I
119
00:05:39,669 --> 00:05:48,909
asked ok if anyone if someone can't hear
120
00:05:42,400 --> 00:05:51,520
me just say mmm the bottom of the screen
121
00:05:48,909 --> 00:05:55,719
well I'm not sure I can solve that but
122
00:05:51,520 --> 00:05:57,849
you're welcome to come closer so the
123
00:05:55,719 --> 00:06:00,610
first challenge is extreme class
124
00:05:57,849 --> 00:06:03,190
imbalance objects are very small and
125
00:06:00,610 --> 00:06:05,110
rare compared to the number of images
126
00:06:03,190 --> 00:06:07,870
and the image sizes what I'm what I mean
127
00:06:05,110 --> 00:06:10,089
about with that you can see here this
128
00:06:07,870 --> 00:06:12,930
arrow is the detection of one of our
129
00:06:10,089 --> 00:06:17,020
algorithms in a brain CT scan of
130
00:06:12,930 --> 00:06:21,490
relatively medium size very urgent
131
00:06:17,020 --> 00:06:24,039
finding in the brain and the this is a
132
00:06:21,490 --> 00:06:25,870
relatively this is relatively large
133
00:06:24,039 --> 00:06:27,310
compared to many of the findings that we
134
00:06:25,870 --> 00:06:29,589
are required to detect and it's
135
00:06:27,310 --> 00:06:32,289
relatively obvious I put it here because
136
00:06:29,589 --> 00:06:34,659
I think that even that is not considered
137
00:06:32,289 --> 00:06:35,169
large in terms of classic object
138
00:06:34,659 --> 00:06:38,740
detection
139
00:06:35,170 --> 00:06:41,980
and I wanted you to understand what I'm
140
00:06:38,740 --> 00:06:44,980
talking about and and really see it so
141
00:06:41,980 --> 00:06:47,080
and the we're talking about findings
142
00:06:44,980 --> 00:06:49,420
that are sometimes smaller than 10 by 10
143
00:06:47,080 --> 00:06:52,180
pixels and they are found in images
144
00:06:49,420 --> 00:06:55,060
which are a single image is actually a
145
00:06:52,180 --> 00:06:58,480
3d image for us it's not a 2d image
146
00:06:55,060 --> 00:07:04,690
so it can be 10 by 10 pixels over a few
147
00:06:58,480 --> 00:07:08,530
slices inside a an image which is 100
148
00:07:04,690 --> 00:07:10,270
slices 100 2d images or even more so the
149
00:07:08,530 --> 00:07:13,809
finding is a very small part of the
150
00:07:10,270 --> 00:07:16,659
brain and most of the scans are of
151
00:07:13,810 --> 00:07:19,990
healthy brains or healthy spines or
152
00:07:16,660 --> 00:07:22,300
whatever so the the interesting data the
153
00:07:19,990 --> 00:07:27,040
data that we want to detect is very rare
154
00:07:22,300 --> 00:07:30,720
and that's a very big challenge the
155
00:07:27,040 --> 00:07:33,550
second challenge is that the objects and
156
00:07:30,720 --> 00:07:35,920
they are not in our background which is
157
00:07:33,550 --> 00:07:40,540
done atomic all structures in my opinion
158
00:07:35,920 --> 00:07:43,960
are relatively much less well structured
159
00:07:40,540 --> 00:07:46,090
much more deformable and less rigid if
160
00:07:43,960 --> 00:07:48,909
you can we can take an example which of
161
00:07:46,090 --> 00:07:50,950
course we can find a hard example in in
162
00:07:48,910 --> 00:07:53,080
the classical data set as well but if
163
00:07:50,950 --> 00:07:55,659
you look at a wheel it is bounded pretty
164
00:07:53,080 --> 00:07:58,180
well by a square bounding box which is
165
00:07:55,660 --> 00:08:02,110
the classic use case of object detection
166
00:07:58,180 --> 00:08:05,200
but here is a part of a brain and you
167
00:08:02,110 --> 00:08:06,610
can see that these pixels that I
168
00:08:05,200 --> 00:08:10,090
highlighted in yellow it's not their
169
00:08:06,610 --> 00:08:14,170
original color this is a single finding
170
00:08:10,090 --> 00:08:16,150
in the brain so if I put a the the
171
00:08:14,170 --> 00:08:19,420
tightest bounding box that I can around
172
00:08:16,150 --> 00:08:22,090
this finding it will sink still contain
173
00:08:19,420 --> 00:08:23,980
a lot of uninteresting pixels and then
174
00:08:22,090 --> 00:08:27,369
when I will extract the features for
175
00:08:23,980 --> 00:08:29,470
this bounding box most of the signal
176
00:08:27,370 --> 00:08:35,200
will come from background rather than
177
00:08:29,470 --> 00:08:38,260
from interesting pixels and that's
178
00:08:35,200 --> 00:08:42,220
that's part of the shapes of the object
179
00:08:38,260 --> 00:08:44,830
being less deformable another challenge
180
00:08:42,220 --> 00:08:46,330
is that the images are 3d and they're
181
00:08:44,830 --> 00:08:46,690
large and there is a lot of difference
182
00:08:46,330 --> 00:08:49,720
between
183
00:08:46,690 --> 00:08:52,510
images of cats - cats cans and actually
184
00:08:49,720 --> 00:08:57,160
you can see that the difference in sizes
185
00:08:52,510 --> 00:09:00,819
the inputs are like could be 30 times
186
00:08:57,160 --> 00:09:04,270
larger and the objects are nominally ten
187
00:09:00,820 --> 00:09:06,940
times smaller and this is a challenge in
188
00:09:04,270 --> 00:09:08,890
terms of a computation power that is
189
00:09:06,940 --> 00:09:12,310
required a time that is it takes it to
190
00:09:08,890 --> 00:09:14,800
converge and the memory footprint of
191
00:09:12,310 --> 00:09:16,599
your networks and how you can find a
192
00:09:14,800 --> 00:09:19,060
good compromise in the design of your
193
00:09:16,600 --> 00:09:25,300
model and what input to use and the rest
194
00:09:19,060 --> 00:09:29,380
challenge that I will talk about is that
195
00:09:25,300 --> 00:09:32,949
when a radiologist analyzes a CT scan it
196
00:09:29,380 --> 00:09:36,520
doesn't just look at the current CT scan
197
00:09:32,950 --> 00:09:38,710
he looks at a lot more types of data the
198
00:09:36,520 --> 00:09:40,990
Democrat the demographics of the patient
199
00:09:38,710 --> 00:09:44,100
his age the referral letter of the
200
00:09:40,990 --> 00:09:47,800
doctor that referred him to this to this
201
00:09:44,100 --> 00:09:49,540
scan his past scans and the reports that
202
00:09:47,800 --> 00:09:52,000
were written on these cans and actually
203
00:09:49,540 --> 00:09:54,880
the radiologists are not only they do
204
00:09:52,000 --> 00:09:57,730
this but they are obligated to do is do
205
00:09:54,880 --> 00:10:00,220
this by regulation and for a reason
206
00:09:57,730 --> 00:10:01,690
because not all the information is in
207
00:10:00,220 --> 00:10:04,420
the image and if you don't look at the
208
00:10:01,690 --> 00:10:07,180
past you can't really diagnose some of
209
00:10:04,420 --> 00:10:10,930
the cases so how do you combine all the
210
00:10:07,180 --> 00:10:14,069
different types of data both visual text
211
00:10:10,930 --> 00:10:14,069
and structured data
212
00:10:22,480 --> 00:10:31,610
okay so now I want to dive in into as I
213
00:10:27,290 --> 00:10:33,920
said two to three papers and the first
214
00:10:31,610 --> 00:10:35,959
paper I wanted to start with is
215
00:10:33,920 --> 00:10:38,269
deformable convolutional networks and
216
00:10:35,959 --> 00:10:40,189
that's the paper I chose because for my
217
00:10:38,269 --> 00:10:42,740
experience a lot of people are kind of
218
00:10:40,189 --> 00:10:44,629
afraid from this paper and find it very
219
00:10:42,740 --> 00:10:47,449
hard to get into and I think that it's
220
00:10:44,629 --> 00:10:50,660
not that difficult if you understand it
221
00:10:47,449 --> 00:10:53,628
if you explain it correctly so what is
222
00:10:50,660 --> 00:10:55,579
the motivation for the differ by the way
223
00:10:53,629 --> 00:10:58,430
this paper is by the Microsoft Research
224
00:10:55,579 --> 00:11:02,508
Asia group and it's a it's been for the
225
00:10:58,430 --> 00:11:05,269
last two years both in 2016 and 2017 a
226
00:11:02,509 --> 00:11:06,889
significant component of one of the top
227
00:11:05,269 --> 00:11:09,649
three entries to the cocoa object
228
00:11:06,889 --> 00:11:11,779
detection competition so it's a it's a
229
00:11:09,649 --> 00:11:14,720
very significant boost to performance
230
00:11:11,779 --> 00:11:17,449
and it's by Microsoft Research Asia
231
00:11:14,720 --> 00:11:19,069
which is one of the top object detection
232
00:11:17,449 --> 00:11:21,170
groups in my opinion in the world and
233
00:11:19,069 --> 00:11:24,500
gave some of the top contributions in
234
00:11:21,170 --> 00:11:26,750
the last year's so the motivation for
235
00:11:24,500 --> 00:11:29,540
this paper is that neural networks and
236
00:11:26,750 --> 00:11:31,250
the popular mechanism that we use with
237
00:11:29,540 --> 00:11:32,959
the neural networks with convolutional
238
00:11:31,250 --> 00:11:35,899
neural networks such as data
239
00:11:32,959 --> 00:11:38,290
augmentation know how to deal pretty
240
00:11:35,899 --> 00:11:43,430
well with simple transformations such as
241
00:11:38,290 --> 00:11:46,389
translation and rotation but non rigid
242
00:11:43,430 --> 00:11:49,729
transformation like changing the pose or
243
00:11:46,389 --> 00:11:53,240
the viewpoint or just the object being
244
00:11:49,730 --> 00:11:55,189
the less clear in regular form are much
245
00:11:53,240 --> 00:11:57,589
more challenging for a neural networks
246
00:11:55,189 --> 00:12:01,990
to deal with and how can we answer that
247
00:11:57,589 --> 00:12:05,660
that challenge so the solution is to
248
00:12:01,990 --> 00:12:07,579
give the network a capability a dynamic
249
00:12:05,660 --> 00:12:09,439
on a dynamic way to control the
250
00:12:07,579 --> 00:12:10,569
receptive field of the convolution so
251
00:12:09,439 --> 00:12:14,209
instead of using the traditional
252
00:12:10,569 --> 00:12:17,240
convolution does that samples the that
253
00:12:14,209 --> 00:12:18,109
samples the image with a square grid we
254
00:12:17,240 --> 00:12:20,750
can sample
255
00:12:18,110 --> 00:12:23,059
why not sample the image in any shape
256
00:12:20,750 --> 00:12:25,040
that we want and why not have it adapt
257
00:12:23,059 --> 00:12:27,199
to the input the shape that we sample
258
00:12:25,040 --> 00:12:29,029
the image with adapt to the image to the
259
00:12:27,199 --> 00:12:32,559
input that we get to our images and
260
00:12:29,029 --> 00:12:34,830
objects so we didn't talk it about why
261
00:12:32,559 --> 00:12:37,380
should it solve the problem that
262
00:12:34,830 --> 00:12:40,770
we are talking about but we will see it
263
00:12:37,380 --> 00:12:44,910
in a minute but we don't only want to
264
00:12:40,770 --> 00:12:47,310
implement this solution if it works well
265
00:12:44,910 --> 00:12:50,069
in order to be applicable in the
266
00:12:47,310 --> 00:12:51,989
industry we want to do it in to
267
00:12:50,070 --> 00:12:54,570
implement something that is easy to
268
00:12:51,990 --> 00:12:56,430
train and preferably end-to-end we don't
269
00:12:54,570 --> 00:12:59,190
want to train several different
270
00:12:56,430 --> 00:13:00,810
components and then combine them it
271
00:12:59,190 --> 00:13:02,940
creates a very cumbersome research
272
00:13:00,810 --> 00:13:04,650
process we don't want to increase the
273
00:13:02,940 --> 00:13:06,990
model complexity its run time and
274
00:13:04,650 --> 00:13:09,090
training time a number of parameters too
275
00:13:06,990 --> 00:13:11,250
much and we don't want to increase the
276
00:13:09,090 --> 00:13:13,440
code complexity if it's a convolution
277
00:13:11,250 --> 00:13:15,960
when I define the modern architecture I
278
00:13:13,440 --> 00:13:17,790
want to write the formable conf 2d lie
279
00:13:15,960 --> 00:13:21,200
and they put the parameters inside it
280
00:13:17,790 --> 00:13:24,360
just like I write cons 2d today and I
281
00:13:21,200 --> 00:13:27,060
want to I want it to be proven on
282
00:13:24,360 --> 00:13:30,270
challenging tasks rather than just toy
283
00:13:27,060 --> 00:13:33,060
datasets and the earlier works in
284
00:13:30,270 --> 00:13:35,520
similar they try to deal with similar
285
00:13:33,060 --> 00:13:37,439
problems with this such as spatial
286
00:13:35,520 --> 00:13:41,579
transformer networks gave major
287
00:13:37,440 --> 00:13:43,440
scientific contributions but in the in
288
00:13:41,580 --> 00:13:45,480
the paper they only proved it on toya
289
00:13:43,440 --> 00:13:48,480
datasets and many people they try to
290
00:13:45,480 --> 00:13:53,240
apply it on real-world datasets found it
291
00:13:48,480 --> 00:13:55,650
very hard to make them converge at all
292
00:13:53,240 --> 00:13:59,850
so this is our these are our
293
00:13:55,650 --> 00:14:01,530
requirements from this solution and now
294
00:13:59,850 --> 00:14:02,820
let's talk about the components of this
295
00:14:01,530 --> 00:14:04,650
solution and there are actually two
296
00:14:02,820 --> 00:14:07,230
components and they build them on top
297
00:14:04,650 --> 00:14:09,120
they can be combined with any object
298
00:14:07,230 --> 00:14:12,000
detection architecture in the meta
299
00:14:09,120 --> 00:14:14,040
architecture like faster are CNN and RFC
300
00:14:12,000 --> 00:14:17,160
and these are the two architectures that
301
00:14:14,040 --> 00:14:18,660
they demonstrate in the paper and the
302
00:14:17,160 --> 00:14:21,959
first component is the deformable
303
00:14:18,660 --> 00:14:24,689
convolution and the concept is we keep
304
00:14:21,960 --> 00:14:26,820
convolution the same except for making
305
00:14:24,690 --> 00:14:28,980
the settling location a function of the
306
00:14:26,820 --> 00:14:30,300
image the sampling location are not a
307
00:14:28,980 --> 00:14:33,780
fixed speed they are a function of the
308
00:14:30,300 --> 00:14:36,839
image so as few samples examples that
309
00:14:33,780 --> 00:14:38,730
they give in the paper is that the
310
00:14:36,840 --> 00:14:42,180
receptive field after several
311
00:14:38,730 --> 00:14:44,760
convolutions of this of a neuron in this
312
00:14:42,180 --> 00:14:47,420
area of the image will be each of these
313
00:14:44,760 --> 00:14:50,360
in the area of each of these we're ready
314
00:14:47,420 --> 00:14:52,400
in the image so the value that we get
315
00:14:50,360 --> 00:14:56,540
from that is that you can see that the
316
00:14:52,400 --> 00:14:58,730
pixel is then this neuron is is on the
317
00:14:56,540 --> 00:15:00,800
sky or on the border between the sky and
318
00:14:58,730 --> 00:15:02,450
the mountains and if we reduce the
319
00:15:00,800 --> 00:15:04,760
traditional convolutions after three
320
00:15:02,450 --> 00:15:06,460
convolutions we would get like a square
321
00:15:04,760 --> 00:15:08,930
or affectively it would be like a
322
00:15:06,460 --> 00:15:11,540
circular or Gaussian shape around that
323
00:15:08,930 --> 00:15:13,729
point but now with the deformable
324
00:15:11,540 --> 00:15:16,969
convolutions we are able able to simple
325
00:15:13,730 --> 00:15:19,730
a very large part of the sky the
326
00:15:16,970 --> 00:15:22,610
mountains and the objects in the image
327
00:15:19,730 --> 00:15:25,910
and intuitively it looks like a desired
328
00:15:22,610 --> 00:15:29,030
property because if I'm just seeing blue
329
00:15:25,910 --> 00:15:32,930
pixels how can I know if it's sky or
330
00:15:29,030 --> 00:15:35,600
water or or like a wall in that color I
331
00:15:32,930 --> 00:15:39,170
can't really know that for sure unless I
332
00:15:35,600 --> 00:15:41,720
have larger context and when they use
333
00:15:39,170 --> 00:15:43,400
the same network but they look on a
334
00:15:41,720 --> 00:15:46,220
different part of the image where there
335
00:15:43,400 --> 00:15:49,160
is a far motorcycle the same deformable
336
00:15:46,220 --> 00:15:51,080
convolution mechanism creates a much
337
00:15:49,160 --> 00:15:53,180
more dense with especially dense
338
00:15:51,080 --> 00:15:59,560
receptive field which covers a much
339
00:15:53,180 --> 00:16:03,500
smaller area and samples the object very
340
00:15:59,560 --> 00:16:04,849
very tightly and also samples a bit of
341
00:16:03,500 --> 00:16:07,280
the object background because it's
342
00:16:04,850 --> 00:16:09,590
intuitive that we want to sample not
343
00:16:07,280 --> 00:16:12,319
only the object but the background as
344
00:16:09,590 --> 00:16:14,630
well and on a closer and larger object
345
00:16:12,320 --> 00:16:18,440
we can see that the receptive field is
346
00:16:14,630 --> 00:16:20,240
larger and a bit less dense but does
347
00:16:18,440 --> 00:16:23,540
cover again the the entire object
348
00:16:20,240 --> 00:16:26,900
instead of just a sip a rectangular or
349
00:16:23,540 --> 00:16:32,599
circular part of it and the background
350
00:16:26,900 --> 00:16:34,310
as well so this is this is the value
351
00:16:32,600 --> 00:16:36,320
that we can get this is an intuition for
352
00:16:34,310 --> 00:16:39,439
the value that we can get from these
353
00:16:36,320 --> 00:16:41,690
deformable convolutions and this was the
354
00:16:39,440 --> 00:16:43,070
first component which will we will dive
355
00:16:41,690 --> 00:16:45,560
into the implementation of this
356
00:16:43,070 --> 00:16:48,020
component in a few slides but first I
357
00:16:45,560 --> 00:16:51,079
want to talk about the second component
358
00:16:48,020 --> 00:16:54,530
which is deformable Roi pulling
359
00:16:51,080 --> 00:16:58,640
so first let's do a short reminder of
360
00:16:54,530 --> 00:17:01,130
faster our CN n and in faster our CN n
361
00:16:58,640 --> 00:17:02,810
we get an input image we put it through
362
00:17:01,130 --> 00:17:05,839
a feature extractor and we get a feature
363
00:17:02,810 --> 00:17:08,240
man okay I'm assuming that you are you
364
00:17:05,839 --> 00:17:09,649
know fast you are CNN or similar models
365
00:17:08,240 --> 00:17:12,530
and I'm just giving you a really short
366
00:17:09,650 --> 00:17:16,010
reminder then using this feature map we
367
00:17:12,530 --> 00:17:22,579
predict about let's say 2000 bounding
368
00:17:16,010 --> 00:17:26,900
box proposals and about 2000 bounding
369
00:17:22,579 --> 00:17:29,480
box proposals and some of them a few of
370
00:17:26,900 --> 00:17:31,460
them really cover the objects that we
371
00:17:29,480 --> 00:17:33,290
are interested in such as the cars but
372
00:17:31,460 --> 00:17:35,570
some of them are just false positive of
373
00:17:33,290 --> 00:17:38,450
our bounding box proposal mechanism and
374
00:17:35,570 --> 00:17:40,280
they lie on the background and then we
375
00:17:38,450 --> 00:17:43,120
take each of these these bounding boxes
376
00:17:40,280 --> 00:17:45,710
and we crop them from the feature map
377
00:17:43,120 --> 00:17:49,399
one by one we crop them from the feature
378
00:17:45,710 --> 00:17:51,380
map and we have like 2000 crop feature
379
00:17:49,400 --> 00:17:54,230
maps for different bounding boxes and
380
00:17:51,380 --> 00:17:57,110
then we put we put each of them through
381
00:17:54,230 --> 00:17:59,360
a second feature extractor which is also
382
00:17:57,110 --> 00:18:01,250
called the first part is called our PN
383
00:17:59,360 --> 00:18:03,560
region proposal network the second part
384
00:18:01,250 --> 00:18:07,160
is they called the second stage or first
385
00:18:03,560 --> 00:18:09,290
our CNN and at the end of this feature
386
00:18:07,160 --> 00:18:11,540
extractor we classify each bounding box
387
00:18:09,290 --> 00:18:15,680
and we find its coordinates with a
388
00:18:11,540 --> 00:18:18,639
regression head so the deformable Roi
389
00:18:15,680 --> 00:18:22,130
polling changes the implementation of
390
00:18:18,640 --> 00:18:26,600
how we crop the each of these proposals
391
00:18:22,130 --> 00:18:28,790
from the feature map so what is actually
392
00:18:26,600 --> 00:18:31,639
deformable RI pooling instead of
393
00:18:28,790 --> 00:18:33,950
cropping a rectangular bounding box a
394
00:18:31,640 --> 00:18:37,310
single rectangular bounding box we pull
395
00:18:33,950 --> 00:18:39,710
nine separate bounding boxes and we'll
396
00:18:37,310 --> 00:18:42,230
see how it actually works in a few
397
00:18:39,710 --> 00:18:45,470
minutes we pull nine separate bounding
398
00:18:42,230 --> 00:18:47,450
boxes and that way you can see that
399
00:18:45,470 --> 00:18:50,060
these objects and that's an example from
400
00:18:47,450 --> 00:18:52,490
the paper are able to cover the object
401
00:18:50,060 --> 00:18:54,470
of interest much more tightly and the
402
00:18:52,490 --> 00:18:56,690
features that are cropped from the
403
00:18:54,470 --> 00:18:58,970
feature map are much more relevant to
404
00:18:56,690 --> 00:19:02,480
classify the object of interest and are
405
00:18:58,970 --> 00:19:04,370
not wasted on background which in a site
406
00:19:02,480 --> 00:19:06,020
in which in large amounts in the for
407
00:19:04,370 --> 00:19:10,639
mobile object is less interesting and
408
00:19:06,020 --> 00:19:11,900
valuable for us so these were this was
409
00:19:10,640 --> 00:19:14,480
the description of the different
410
00:19:11,900 --> 00:19:17,150
components and
411
00:19:14,480 --> 00:19:20,090
it shows a very strong improvement in
412
00:19:17,150 --> 00:19:21,679
the both in both of the significant
413
00:19:20,090 --> 00:19:24,049
metrics in the world of object detection
414
00:19:21,679 --> 00:19:26,059
we have two metrics the cocoa metric and
415
00:19:24,049 --> 00:19:28,100
the Pascal VOC metric this is the cocoa
416
00:19:26,059 --> 00:19:31,010
metric which gives more weight to
417
00:19:28,100 --> 00:19:33,260
accurate localization how well am I
418
00:19:31,010 --> 00:19:35,960
giving tight bounding boxes around the
419
00:19:33,260 --> 00:19:38,090
object so this locally tight
420
00:19:35,960 --> 00:19:40,700
localization metric gets about five to
421
00:19:38,090 --> 00:19:43,760
ten percent relative improvement due to
422
00:19:40,700 --> 00:19:47,750
this solution and the second metric
423
00:19:43,760 --> 00:19:52,429
which is which only which gives less
424
00:19:47,750 --> 00:19:58,100
weight to tight localization and thus it
425
00:19:52,429 --> 00:20:01,429
means its value is actually by telling
426
00:19:58,100 --> 00:20:03,199
us even giving us more insight into how
427
00:20:01,429 --> 00:20:05,120
many objects we were missing or
428
00:20:03,200 --> 00:20:09,260
detecting how many objects am I not
429
00:20:05,120 --> 00:20:11,239
detecting at all etc so this metric by
430
00:20:09,260 --> 00:20:13,250
the way is much more important in my
431
00:20:11,240 --> 00:20:16,130
opinion to medical imaging applications
432
00:20:13,250 --> 00:20:18,650
in most cases because tight localization
433
00:20:16,130 --> 00:20:20,480
in many times it's less important but if
434
00:20:18,650 --> 00:20:22,640
we miss a medical finding critical
435
00:20:20,480 --> 00:20:26,179
medical finding that's something that
436
00:20:22,640 --> 00:20:30,169
the doctors will really be mad at us
437
00:20:26,179 --> 00:20:44,480
about so this metric is also improved by
438
00:20:30,169 --> 00:20:46,309
5% sorry sorry yeah that's the hours
439
00:20:44,480 --> 00:20:49,700
here in this table is their
440
00:20:46,309 --> 00:20:51,910
implementation of for example faster our
441
00:20:49,700 --> 00:20:54,910
CNN with deformable convolutional
442
00:20:51,910 --> 00:20:54,910
networks
443
00:20:55,920 --> 00:21:04,090
5% 5% Oh 51
444
00:21:01,060 --> 00:21:05,830
it's a Pascal vocab mini average
445
00:21:04,090 --> 00:21:08,199
precision metric I don't want to dive
446
00:21:05,830 --> 00:21:11,530
into that too much just take it as a
447
00:21:08,200 --> 00:21:14,590
score the score for how good your a
448
00:21:11,530 --> 00:21:16,810
detector is it's a it's not really
449
00:21:14,590 --> 00:21:22,600
important to to understand it right now
450
00:21:16,810 --> 00:21:26,889
okay what is the percentage of
451
00:21:22,600 --> 00:21:28,600
undetected object it is not you can't
452
00:21:26,890 --> 00:21:31,090
understand it from this number but you
453
00:21:28,600 --> 00:21:34,899
can only understand it that is it is it
454
00:21:31,090 --> 00:21:36,939
has increased by significant Emma it is
455
00:21:34,900 --> 00:21:39,340
improved by a relatively significant
456
00:21:36,940 --> 00:21:42,580
amount you don't know the exact number
457
00:21:39,340 --> 00:21:45,639
of undetected object because this metric
458
00:21:42,580 --> 00:21:47,649
covers a lot of different working points
459
00:21:45,640 --> 00:21:51,150
of sensitivity of recall and precision
460
00:21:47,650 --> 00:21:54,370
that you can choose for your algorithm
461
00:21:51,150 --> 00:21:57,760
okay so now let's talk about the
462
00:21:54,370 --> 00:22:03,790
implementation by the way after that we
463
00:21:57,760 --> 00:22:06,879
you can ask questions freely so please
464
00:22:03,790 --> 00:22:08,680
keep like keep your questions to the end
465
00:22:06,880 --> 00:22:11,410
of this part if you have any more
466
00:22:08,680 --> 00:22:15,820
questions unless they are really really
467
00:22:11,410 --> 00:22:17,320
important so first of all let's start by
468
00:22:15,820 --> 00:22:20,020
the with the implementation of
469
00:22:17,320 --> 00:22:21,820
deformable convolutions so this is the
470
00:22:20,020 --> 00:22:23,650
diagram that they have in the paper and
471
00:22:21,820 --> 00:22:25,720
I think that it's confusing a little bit
472
00:22:23,650 --> 00:22:28,750
because it's a good diagram but it
473
00:22:25,720 --> 00:22:32,530
contains too many levels of abstraction
474
00:22:28,750 --> 00:22:35,440
and it's hard to wrap your minds around
475
00:22:32,530 --> 00:22:37,629
what's going on here so I invested some
476
00:22:35,440 --> 00:22:40,330
time in decomposing this diagram into
477
00:22:37,630 --> 00:22:44,410
several parts so it would be easier to
478
00:22:40,330 --> 00:22:46,000
understand so this is the essence of the
479
00:22:44,410 --> 00:22:48,190
layer which is called deformable
480
00:22:46,000 --> 00:22:49,990
convolution the essence that we have an
481
00:22:48,190 --> 00:22:51,880
input feature map he doesn't have to be
482
00:22:49,990 --> 00:22:53,950
the image it's actually most of the
483
00:22:51,880 --> 00:22:56,620
times not it's not used directly on the
484
00:22:53,950 --> 00:23:02,260
image but on deeper layers on deeper
485
00:22:56,620 --> 00:23:03,250
feature maps and we put a cone we put we
486
00:23:02,260 --> 00:23:05,650
use
487
00:23:03,250 --> 00:23:08,110
a convolution all over this image but
488
00:23:05,650 --> 00:23:13,480
the convolution is not the old square
489
00:23:08,110 --> 00:23:15,699
3x3 a convolution it's a different
490
00:23:13,480 --> 00:23:17,590
sampling grid for each location of the
491
00:23:15,700 --> 00:23:19,180
convolution and then let's say that I'm
492
00:23:17,590 --> 00:23:22,300
talking about this location so I have
493
00:23:19,180 --> 00:23:23,890
nine points that I'm sampling with in
494
00:23:22,300 --> 00:23:28,060
the locations of the blue squares and
495
00:23:23,890 --> 00:23:30,490
then those nine nine points are
496
00:23:28,060 --> 00:23:31,990
transformed into one point just like in
497
00:23:30,490 --> 00:23:34,990
the regular convolution the 3 by 3
498
00:23:31,990 --> 00:23:37,750
square was converted into one point or
499
00:23:34,990 --> 00:23:42,070
one vector in the feature map one
500
00:23:37,750 --> 00:23:45,910
spatial location so this is the essence
501
00:23:42,070 --> 00:23:49,389
now the implementation so you start by
502
00:23:45,910 --> 00:23:52,030
doing a regular square 3x3 convolution
503
00:23:49,390 --> 00:23:55,270
ignore the blue squares for now we start
504
00:23:52,030 --> 00:23:58,420
with the regular 3x3 convolution with
505
00:23:55,270 --> 00:24:03,490
the square shape and the output of this
506
00:23:58,420 --> 00:24:06,540
convolution is a feature map with which
507
00:24:03,490 --> 00:24:09,550
size is relatively the input feature map
508
00:24:06,540 --> 00:24:13,690
spatial size but the depths of this
509
00:24:09,550 --> 00:24:17,070
feature map is about 9 times to 18 y 9
510
00:24:13,690 --> 00:24:20,500
times 2 yeah it's because we can we can
511
00:24:17,070 --> 00:24:22,510
visualize it this this is just an aid to
512
00:24:20,500 --> 00:24:24,070
understand it this is not a stage this
513
00:24:22,510 --> 00:24:28,150
is the last computation that happens
514
00:24:24,070 --> 00:24:32,590
here actually but each 980 a vector of
515
00:24:28,150 --> 00:24:39,910
length 18 can be seen as 2 squares of
516
00:24:32,590 --> 00:24:42,909
size 3 by 3 so the top left elements in
517
00:24:39,910 --> 00:24:45,430
these two square give us the offsets
518
00:24:42,910 --> 00:24:47,740
that tell us where to locate the top
519
00:24:45,430 --> 00:24:51,070
left sampling point of our sampling grid
520
00:24:47,740 --> 00:24:53,770
and that Center squares in these two
521
00:24:51,070 --> 00:24:55,210
squares tell us where to place the
522
00:24:53,770 --> 00:24:57,310
offset the tell us where to place the
523
00:24:55,210 --> 00:25:02,020
center blue square in our new sampling
524
00:24:57,310 --> 00:25:04,659
grid and and and because we have nine
525
00:25:02,020 --> 00:25:07,300
squares nine elements in each each of
526
00:25:04,660 --> 00:25:13,260
them we get the offset for for each of
527
00:25:07,300 --> 00:25:13,260
our new blue squares and and
528
00:25:16,070 --> 00:25:22,830
yeah that's it so now the - yes so one
529
00:25:21,600 --> 00:25:24,090
square of course is the horizontal
530
00:25:22,830 --> 00:25:27,000
offsets
531
00:25:24,090 --> 00:25:28,918
tell tell us on the left-to-right axis
532
00:25:27,000 --> 00:25:30,870
how much do we want to move out each of
533
00:25:28,919 --> 00:25:34,020
our squares and the second square gives
534
00:25:30,870 --> 00:25:36,510
us the vertical offset so then we take
535
00:25:34,020 --> 00:25:40,370
this offset and we just sample them from
536
00:25:36,510 --> 00:25:42,840
the input feature map sample them
537
00:25:40,370 --> 00:25:46,469
multiply them with our with the weight
538
00:25:42,840 --> 00:25:51,959
of our convolutional kernel and get the
539
00:25:46,470 --> 00:25:53,580
vector and there is a little bit of a
540
00:25:51,960 --> 00:25:58,289
problem with what I just described
541
00:25:53,580 --> 00:26:01,320
because this convolutional layer is a
542
00:25:58,289 --> 00:26:03,960
convolution so it outputs continuous
543
00:26:01,320 --> 00:26:06,450
valued real numbers it doesn't output
544
00:26:03,960 --> 00:26:08,460
integers but we need to sample in order
545
00:26:06,450 --> 00:26:10,409
to sample the image when the image is
546
00:26:08,460 --> 00:26:12,480
discrete it contains discrete pixels so
547
00:26:10,409 --> 00:26:14,580
we need the integers but the problem is
548
00:26:12,480 --> 00:26:16,260
that we can't round these numbers
549
00:26:14,580 --> 00:26:19,139
because then it wouldn't be
550
00:26:16,260 --> 00:26:21,419
differentiable so and then we wouldn't
551
00:26:19,140 --> 00:26:23,880
be able to back propagate through it or
552
00:26:21,419 --> 00:26:26,429
it will require much heavier solution
553
00:26:23,880 --> 00:26:30,320
and a much cumbersome solution so what
554
00:26:26,429 --> 00:26:34,350
we do is say something that this group
555
00:26:30,320 --> 00:26:37,530
mentions a lot in their papers imagine
556
00:26:34,350 --> 00:26:39,809
that we have just like if we wanted to
557
00:26:37,530 --> 00:26:42,450
just if we had two coordinates the x
558
00:26:39,809 --> 00:26:45,539
coordinate was 2.3 and the y coordinate
559
00:26:42,450 --> 00:26:49,140
was 7.2 and we wanted to sample it from
560
00:26:45,539 --> 00:26:51,809
the image hmm so and we wanted to sample
561
00:26:49,140 --> 00:26:53,190
them from the image sample this point
562
00:26:51,809 --> 00:26:56,850
from the image we could use bilinear
563
00:26:53,190 --> 00:26:59,039
interpolation in order to like to
564
00:26:56,850 --> 00:27:02,580
interpolate what should be the value at
565
00:26:59,039 --> 00:27:05,370
that point so fortunately by neat
566
00:27:02,580 --> 00:27:07,408
bilinear interpolation can blink can be
567
00:27:05,370 --> 00:27:09,809
implemented very inefficient efficiently
568
00:27:07,409 --> 00:27:13,200
using matrix operators and matrix
569
00:27:09,809 --> 00:27:15,899
multiplication and that's why we can do
570
00:27:13,200 --> 00:27:18,450
it for many points of the sampling grid
571
00:27:15,900 --> 00:27:22,010
in real time and even on the GPU of
572
00:27:18,450 --> 00:27:25,799
course so explaining it's not very
573
00:27:22,010 --> 00:27:28,240
complex to understand how this the
574
00:27:25,799 --> 00:27:31,540
implementation of the Metro
575
00:27:28,240 --> 00:27:33,430
bi linear interpolation work but it is
576
00:27:31,540 --> 00:27:34,720
outside of the scope of this token if
577
00:27:33,430 --> 00:27:37,650
you are interested in it you can come
578
00:27:34,720 --> 00:27:40,630
talk to me about it later
579
00:27:37,650 --> 00:27:43,450
okay so let's say about the first
580
00:27:40,630 --> 00:27:45,160
component by the way if anyone has a
581
00:27:43,450 --> 00:27:50,340
question about this component ask
582
00:27:45,160 --> 00:27:50,340
because maybe it's better time yeah yeah
583
00:27:52,830 --> 00:27:55,830
yeah
584
00:28:01,910 --> 00:28:05,330
right so
585
00:28:10,830 --> 00:28:17,019
no it's those are okay so I repeat the
586
00:28:14,259 --> 00:28:20,049
question so everyone will hear so I said
587
00:28:17,019 --> 00:28:22,360
that first of all before I do the
588
00:28:20,049 --> 00:28:24,490
convolution with the square the yellow
589
00:28:22,360 --> 00:28:26,408
square the regular convolution I don't
590
00:28:24,490 --> 00:28:28,179
know the offsets for where I want to
591
00:28:26,409 --> 00:28:30,669
locate my blue sampling grid of the
592
00:28:28,179 --> 00:28:33,039
deformable convolutions and then I said
593
00:28:30,669 --> 00:28:37,509
that I when I know these sampling points
594
00:28:33,039 --> 00:28:41,669
I take them and multiply them with a the
595
00:28:37,509 --> 00:28:45,309
convolution kernel and what's your name
596
00:28:41,669 --> 00:28:47,769
Lisa and Lisa asked me if it's the same
597
00:28:45,309 --> 00:28:49,240
kernel if the same kernel is used for
598
00:28:47,769 --> 00:28:51,250
both of these convolutions or it's a
599
00:28:49,240 --> 00:28:54,490
different kernel so it's a different
600
00:28:51,250 --> 00:28:56,980
kernel between the yellow convolution
601
00:28:54,490 --> 00:28:59,440
has a single kernel and the blue
602
00:28:56,980 --> 00:29:04,269
convolution has a different kernel okay
603
00:28:59,440 --> 00:29:08,279
and they are learned separately okay any
604
00:29:04,269 --> 00:29:08,279
other questions yeah
605
00:29:12,400 --> 00:29:19,630
mm-hm
606
00:29:13,550 --> 00:29:19,629
what do you mean mhm
607
00:29:28,190 --> 00:29:32,190
probably but you know empirically it
608
00:29:30,539 --> 00:29:33,690
improves the results so I guess it has
609
00:29:32,190 --> 00:29:35,610
some drawbacks and maybe this solution
610
00:29:33,690 --> 00:29:39,240
can be improved but it has also
611
00:29:35,610 --> 00:29:42,990
desirable is that he asked me if it
612
00:29:39,240 --> 00:29:45,509
maybe maybe it creates some
613
00:29:42,990 --> 00:29:47,850
discontinuity because of the weird
614
00:29:45,509 --> 00:29:51,629
sampling strategy so probably it has
615
00:29:47,850 --> 00:29:53,519
some disadvantages but like I even the
616
00:29:51,629 --> 00:29:55,590
convolution that we are using today also
617
00:29:53,519 --> 00:29:57,539
has disadvantages so it's I think the
618
00:29:55,590 --> 00:29:59,490
only question is which mechanism has
619
00:29:57,539 --> 00:30:04,590
more disadvantages there relative to its
620
00:29:59,490 --> 00:30:08,039
advantages yes yeah
621
00:30:04,590 --> 00:30:14,120
so when you get when you get the loss
622
00:30:08,039 --> 00:30:14,120
you back propagate them just like
623
00:30:15,559 --> 00:30:20,639
through your bilinear interpolation
624
00:30:17,750 --> 00:30:24,529
operator that I that I talked about so
625
00:30:20,639 --> 00:30:27,840
you you get from it you have these
626
00:30:24,529 --> 00:30:30,570
numbers and you multiply them with a
627
00:30:27,840 --> 00:30:34,110
matrix of a bilinear interpolation and
628
00:30:30,570 --> 00:30:35,970
then you get these values okay it's not
629
00:30:34,110 --> 00:30:37,830
that you do something active to sample
630
00:30:35,970 --> 00:30:39,870
them it's just like you have a bilinear
631
00:30:37,830 --> 00:30:42,210
intermet ryx which is a bilinear
632
00:30:39,870 --> 00:30:46,320
interpolation kernel and you multiply it
633
00:30:42,210 --> 00:30:49,769
with with these numbers after some
634
00:30:46,320 --> 00:30:51,960
vector operations and then you get the
635
00:30:49,769 --> 00:30:54,720
values that are sampled in each of these
636
00:30:51,960 --> 00:30:57,539
points and then you multiply them with
637
00:30:54,720 --> 00:30:59,549
an with an another matrix so it's back
638
00:30:57,539 --> 00:31:03,919
propagated through the bilinear
639
00:30:59,549 --> 00:31:03,918
interpolation operator yes
640
00:31:03,960 --> 00:31:08,419
different sampling patterns for it so in
641
00:31:07,500 --> 00:31:12,889
the
642
00:31:08,419 --> 00:31:14,629
oh I hope I understood your question yes
643
00:31:12,889 --> 00:31:18,320
if there is a different sampling pattern
644
00:31:14,629 --> 00:31:20,539
for each pixel in the image so I'll go
645
00:31:18,320 --> 00:31:23,629
back if I hope I understood your
646
00:31:20,539 --> 00:31:27,259
question I'll go back to this example
647
00:31:23,629 --> 00:31:29,389
images that I showed here and I hope
648
00:31:27,259 --> 00:31:32,539
this answers your question you can see
649
00:31:29,389 --> 00:31:35,539
that for this pixel the the sampling is
650
00:31:32,539 --> 00:31:39,799
much more has a much wider coverage and
651
00:31:35,539 --> 00:31:44,230
for this pixel or activation the the
652
00:31:39,799 --> 00:31:47,330
coverage is is much smaller and the
653
00:31:44,230 --> 00:31:53,619
receptive field is a function of the
654
00:31:47,330 --> 00:31:53,619
local input and just a second
655
00:31:55,269 --> 00:32:03,100
the for each location in the image the
656
00:31:59,179 --> 00:32:05,869
offsets are a function of these 3x3
657
00:32:03,100 --> 00:32:07,789
pixels in the input feature Maps so of
658
00:32:05,869 --> 00:32:09,499
course that you will get different
659
00:32:07,789 --> 00:32:11,029
offsets if you place your conversion
660
00:32:09,499 --> 00:32:12,499
here or any purple if you place a
661
00:32:11,029 --> 00:32:14,559
convolution here does that answer your
662
00:32:12,499 --> 00:32:14,559
question
663
00:32:17,869 --> 00:32:28,019
it depends on the yellow convolution yes
664
00:32:23,509 --> 00:32:30,269
the offsets yes the yellow regular 3x3
665
00:32:28,019 --> 00:32:34,399
to the convolution square to the
666
00:32:30,269 --> 00:32:37,679
convolution determines the offsets and
667
00:32:34,399 --> 00:32:39,658
of course that the the output of the
668
00:32:37,679 --> 00:32:42,029
convolution is different for each part
669
00:32:39,659 --> 00:32:44,789
of the image because it's input is
670
00:32:42,029 --> 00:32:46,919
different okay and that's why the offset
671
00:32:44,789 --> 00:32:48,959
will be dead that's the mechanism that
672
00:32:46,919 --> 00:32:50,339
enables just a second that enables the
673
00:32:48,959 --> 00:33:02,789
offsets to be different between
674
00:32:50,339 --> 00:33:05,039
different parts of the image okay the
675
00:33:02,789 --> 00:33:07,799
output of the original square
676
00:33:05,039 --> 00:33:10,229
convolution enables us to sample the
677
00:33:07,799 --> 00:33:12,029
image for the real convolution for the
678
00:33:10,229 --> 00:33:14,039
deformable convolution which which is
679
00:33:12,029 --> 00:33:15,629
and that convolution is the one that way
680
00:33:14,039 --> 00:33:18,739
that really creates the next feature map
681
00:33:15,629 --> 00:33:18,738
of our feature extractors
682
00:33:27,330 --> 00:33:31,429
[Music]
683
00:33:35,799 --> 00:33:42,729
I'm sorry not your range - okay
684
00:33:46,020 --> 00:33:52,680
wait there's some wait please I would
685
00:33:51,000 --> 00:33:55,170
love to answer a question and I prefer
686
00:33:52,680 --> 00:33:57,360
to I think it's better that we cover
687
00:33:55,170 --> 00:34:00,810
less papers but I understand them better
688
00:33:57,360 --> 00:34:03,659
but please keep it to only like if you
689
00:34:00,810 --> 00:34:05,280
have a gaps to understand what I just
690
00:34:03,660 --> 00:34:06,660
explained and don't be shy to ask
691
00:34:05,280 --> 00:34:10,909
because I'm sure that you are not the
692
00:34:06,660 --> 00:34:10,909
only one that didn't understand yes
693
00:34:21,040 --> 00:34:26,259
can you speak louder I didn't see one
694
00:34:34,449 --> 00:34:40,009
something grid yes for each pixel in the
695
00:34:37,969 --> 00:34:42,109
original for each spatial location in
696
00:34:40,010 --> 00:34:45,679
the original feature map we have
697
00:34:42,109 --> 00:34:48,290
different 18 values they determine the
698
00:34:45,679 --> 00:34:50,359
real set the new sampling read the
699
00:34:48,290 --> 00:34:52,339
deformable self sampling grid and this
700
00:34:50,359 --> 00:34:57,078
sampling grid is different between some
701
00:34:52,339 --> 00:35:00,040
some between spatial locations okay
702
00:34:57,079 --> 00:35:00,040
yes
703
00:35:08,589 --> 00:35:13,930
I will love if you could keep this
704
00:35:11,380 --> 00:35:15,789
question today after we finish covering
705
00:35:13,930 --> 00:35:18,009
this paper and I also have an example of
706
00:35:15,789 --> 00:35:18,700
the reason I think it's interesting okay
707
00:35:18,009 --> 00:35:24,190
thanks
708
00:35:18,700 --> 00:35:26,049
yes the the the layer degenerates the
709
00:35:24,190 --> 00:35:27,339
offset only one layer and it's even a
710
00:35:26,049 --> 00:35:47,380
linear layer it doesn't have a
711
00:35:27,339 --> 00:35:49,359
non-linearity yeah I I think it could be
712
00:35:47,380 --> 00:35:53,069
an in in an interesting paper to try it
713
00:35:49,359 --> 00:35:53,069
with more convolutions and see if it's
714
00:35:53,910 --> 00:35:57,009
[Music]
715
00:35:58,170 --> 00:36:02,369
re what what do you mean
716
00:36:20,160 --> 00:36:50,170
I'm not if the in this if I understand
717
00:36:47,890 --> 00:36:51,580
your question correctly if this called
718
00:36:50,170 --> 00:36:53,410
the yellow convolution will be on
719
00:36:51,580 --> 00:36:55,240
different locations in the image but the
720
00:36:53,410 --> 00:36:57,490
the values of these locations will be
721
00:36:55,240 --> 00:37:00,580
equal then the offsets will also be
722
00:36:57,490 --> 00:37:05,279
equal is that your question mm-hmm okay
723
00:37:00,580 --> 00:37:09,270
so yes mm-hmm okay
724
00:37:05,280 --> 00:37:09,270
can we move on yeah
725
00:37:14,060 --> 00:37:32,570
no the offsets are not Li are not
726
00:37:16,520 --> 00:37:35,180
bounded but so mathematically nothing
727
00:37:32,570 --> 00:37:39,290
bounds this offsets and we also know
728
00:37:35,180 --> 00:37:41,149
that we from traditional assess if the
729
00:37:39,290 --> 00:37:43,850
offsets are bounded to the area of this
730
00:37:41,150 --> 00:37:47,390
safe 3x3 square and the authors are not
731
00:37:43,850 --> 00:37:50,210
bounded and usually they are larger than
732
00:37:47,390 --> 00:37:52,190
these 3x3 square because even in
733
00:37:50,210 --> 00:37:55,370
traditional object detection we know
734
00:37:52,190 --> 00:37:59,240
that we can use a small convolution
735
00:37:55,370 --> 00:38:02,600
kernel to predict much larger bounding
736
00:37:59,240 --> 00:38:07,310
boxes that are larger than than than the
737
00:38:02,600 --> 00:38:10,940
receptive field of these kernels so the
738
00:38:07,310 --> 00:38:13,759
is it's like if you look at it my torso
739
00:38:10,940 --> 00:38:16,220
you have enough information to know that
740
00:38:13,760 --> 00:38:19,630
my head is up to here and my footer are
741
00:38:16,220 --> 00:38:22,819
down there right so you can infer the
742
00:38:19,630 --> 00:38:26,500
wanted sampling points even if you are
743
00:38:22,820 --> 00:38:26,500
looking just on a part of an object
744
00:38:36,330 --> 00:38:42,460
no it just uses the features that it has
745
00:38:39,010 --> 00:38:46,480
and in just a limited spatial context is
746
00:38:42,460 --> 00:38:48,190
enough to predict something to infer for
747
00:38:46,480 --> 00:38:49,390
something that is outside of your of
748
00:38:48,190 --> 00:38:55,440
your context
749
00:38:49,390 --> 00:38:55,440
okay okay I'll yeah
750
00:39:07,790 --> 00:39:15,359
okay good question she asked if after we
751
00:39:11,660 --> 00:39:16,589
guys can like be quite so people will be
752
00:39:15,359 --> 00:39:19,859
able to hear Thanks
753
00:39:16,589 --> 00:39:21,509
so she asked if after we do this
754
00:39:19,859 --> 00:39:23,970
deformable convolutions maybe there are
755
00:39:21,510 --> 00:39:26,430
pixels that are not covered by our blue
756
00:39:23,970 --> 00:39:28,319
pixels all over the image and yeah it
757
00:39:26,430 --> 00:39:29,790
can happen nothing ensures us that it
758
00:39:28,319 --> 00:39:32,220
doesn't happen and it's okay that it
759
00:39:29,790 --> 00:39:35,640
happens because maybe these pixels are
760
00:39:32,220 --> 00:39:38,308
yeah yeah maybe there is the information
761
00:39:35,640 --> 00:39:40,558
there is less relevant yeah or maybe we
762
00:39:38,309 --> 00:39:43,380
are me it's possible do we also miss
763
00:39:40,559 --> 00:39:46,980
something that is important but if it
764
00:39:43,380 --> 00:39:48,839
happens that we our if it's important it
765
00:39:46,980 --> 00:39:50,690
means it will harm our classification
766
00:39:48,839 --> 00:39:53,279
results and then in the backpropagation
767
00:39:50,690 --> 00:39:55,920
these weights that generated offsets
768
00:39:53,280 --> 00:40:05,780
will adapt to predict better offsets so
769
00:39:55,920 --> 00:40:05,780
yeah what do you mean
770
00:40:17,900 --> 00:40:22,380
I
771
00:40:19,290 --> 00:40:24,330
I'm pretty sure that like in the formula
772
00:40:22,380 --> 00:40:27,690
there is nothing that prevents it but I
773
00:40:24,330 --> 00:40:30,590
get but but I guess it's something that
774
00:40:27,690 --> 00:40:33,890
that just happens because it's it's
775
00:40:30,590 --> 00:40:36,270
because it I guess it's it's not really
776
00:40:33,890 --> 00:40:37,859
beneficial in any way that they will
777
00:40:36,270 --> 00:40:41,360
converge to the same sample the same
778
00:40:37,860 --> 00:40:43,830
point and then it the way that I learned
779
00:40:41,360 --> 00:40:48,510
like a naturally simple difference
780
00:40:43,830 --> 00:40:50,100
points okay so let's move on I just want
781
00:40:48,510 --> 00:40:54,960
to see how much time we have
782
00:40:50,100 --> 00:40:57,420
okay we're good so now let's move on to
783
00:40:54,960 --> 00:41:00,630
the second component which is the
784
00:40:57,420 --> 00:41:03,720
deformable our Y pulling and I just want
785
00:41:00,630 --> 00:41:05,580
to give a quick reminder of the regular
786
00:41:03,720 --> 00:41:07,709
our Y pulling and there are many ways to
787
00:41:05,580 --> 00:41:11,549
perform our pulling there also called
788
00:41:07,710 --> 00:41:13,110
our warping and other names so now it
789
00:41:11,550 --> 00:41:15,270
doesn't really matter it can work with
790
00:41:13,110 --> 00:41:16,710
all of these methods and I'm going to
791
00:41:15,270 --> 00:41:18,630
demonstrate it with the original
792
00:41:16,710 --> 00:41:24,690
alright pulling which by the way is not
793
00:41:18,630 --> 00:41:27,240
is not differentiable and okay
794
00:41:24,690 --> 00:41:30,120
so by the way it's not a French Abell
795
00:41:27,240 --> 00:41:32,220
and the solution that I that I spoke
796
00:41:30,120 --> 00:41:34,049
about here is also the solution to make
797
00:41:32,220 --> 00:41:35,339
our i pulling differentiable or one of
798
00:41:34,050 --> 00:41:37,200
the solution to make our i pulling
799
00:41:35,340 --> 00:41:40,050
differentiable the solution with the
800
00:41:37,200 --> 00:41:42,899
bilinear interpolation operation so let
801
00:41:40,050 --> 00:41:44,850
how does our i pulling works from the
802
00:41:42,900 --> 00:41:46,980
RPM from the first stage of the fester
803
00:41:44,850 --> 00:41:49,980
are seen and we get a bounding box
804
00:41:46,980 --> 00:41:53,670
proposal and then we split this proposal
805
00:41:49,980 --> 00:41:57,750
into several bins for example to buy two
806
00:41:53,670 --> 00:42:00,810
bins or in reality it's a seven by seven
807
00:41:57,750 --> 00:42:02,340
or 14 by 14 in most of the cases but for
808
00:42:00,810 --> 00:42:04,799
the simplicity of the example let's
809
00:42:02,340 --> 00:42:08,070
assume it's two by two then for each of
810
00:42:04,800 --> 00:42:10,110
these bins separately we perform max
811
00:42:08,070 --> 00:42:12,060
pooling on the entire bin it doesn't
812
00:42:10,110 --> 00:42:14,250
matter the bin size we perform max
813
00:42:12,060 --> 00:42:18,420
pulling on the entire bin so for example
814
00:42:14,250 --> 00:42:21,660
for this P being we get 0.74 for this
815
00:42:18,420 --> 00:42:24,000
mean we get 0.39 etc so it can be
816
00:42:21,660 --> 00:42:25,470
max pulling in the paper it they talk
817
00:42:24,000 --> 00:42:30,359
about average pulling it doesn't really
818
00:42:25,470 --> 00:42:32,819
matter and this is the original ROI
819
00:42:30,359 --> 00:42:35,819
pulling and deferrable are I pulling the
820
00:42:32,819 --> 00:42:38,630
idea is that we keep the same things and
821
00:42:35,819 --> 00:42:43,650
we have the same sizes for these bins
822
00:42:38,630 --> 00:42:45,720
but we take each bin keep its size but
823
00:42:43,650 --> 00:42:47,760
give it an offset so we take the top
824
00:42:45,720 --> 00:42:49,859
right bin and we place it somewhere here
825
00:42:47,760 --> 00:42:52,890
and take the top left beam and we place
826
00:42:49,859 --> 00:42:55,890
it somewhere here etc so in reality we
827
00:42:52,890 --> 00:42:59,098
have like 7 by 7 bins and we predict
828
00:42:55,890 --> 00:43:03,299
offsets for all of them and that's
829
00:42:59,099 --> 00:43:05,970
that's uh basically how it works so the
830
00:43:03,299 --> 00:43:07,470
implementation is very similar to the to
831
00:43:05,970 --> 00:43:11,819
the previous implementation it will be
832
00:43:07,470 --> 00:43:15,089
very easy to understand now so this is
833
00:43:11,819 --> 00:43:18,058
the input feature map from which we crop
834
00:43:15,089 --> 00:43:20,400
and we perform the ROI pulling on this
835
00:43:18,059 --> 00:43:22,890
is the feature map that we pull it at
836
00:43:20,400 --> 00:43:25,500
our Y from so let's say we have an hour
837
00:43:22,890 --> 00:43:27,118
Y this is the and in this example
838
00:43:25,500 --> 00:43:30,990
they're all with all of their diagrams
839
00:43:27,119 --> 00:43:33,390
are for 3x3 bins I just mentioned 2x2
840
00:43:30,990 --> 00:43:35,430
they do it with three by three bins so
841
00:43:33,390 --> 00:43:37,140
we have in our Y when we split it to
842
00:43:35,430 --> 00:43:39,990
three by three bins it's a again it's
843
00:43:37,140 --> 00:43:45,629
the three by squeeze yellow square here
844
00:43:39,990 --> 00:43:49,740
and then we do the regular ry pulling on
845
00:43:45,630 --> 00:43:54,690
this ROI and we get the the down sample
846
00:43:49,740 --> 00:43:58,038
the ROI and then we put this ROI into a
847
00:43:54,690 --> 00:44:01,170
fully connected layer and again we get
848
00:43:58,039 --> 00:44:04,260
vector this time we get a single vector
849
00:44:01,170 --> 00:44:07,319
for this ROI of size 18 which can will
850
00:44:04,260 --> 00:44:10,049
and we can look at this vector as two
851
00:44:07,319 --> 00:44:12,029
squares of size three by three which are
852
00:44:10,049 --> 00:44:14,400
the horizontal and the vertical offsets
853
00:44:12,029 --> 00:44:17,039
for each bin so the value in the top
854
00:44:14,400 --> 00:44:20,760
left part of these squares is the offset
855
00:44:17,039 --> 00:44:24,930
for the top left bin and the value for
856
00:44:20,760 --> 00:44:27,270
the top right the two values in the top
857
00:44:24,930 --> 00:44:30,868
right a part of the square is the offset
858
00:44:27,270 --> 00:44:34,049
for the top right beam and that way we
859
00:44:30,869 --> 00:44:34,860
can get an offset for each bin and place
860
00:44:34,049 --> 00:44:37,410
them in
861
00:44:34,860 --> 00:44:43,620
in a sample different parts of the
862
00:44:37,410 --> 00:44:45,509
feature map with them and again these we
863
00:44:43,620 --> 00:44:49,650
need to sample the beans and do max
864
00:44:45,510 --> 00:44:51,750
pooling on areas that are a coordinates
865
00:44:49,650 --> 00:44:53,370
that are not squid and this is also
866
00:44:51,750 --> 00:44:56,010
solved using the same bilinear
867
00:44:53,370 --> 00:44:59,160
interpolation and matrix multiplication
868
00:44:56,010 --> 00:45:03,300
that I mentioned earlier so some really
869
00:44:59,160 --> 00:45:06,750
cool examples for AHA from their paper
870
00:45:03,300 --> 00:45:09,930
on how are I for how different about our
871
00:45:06,750 --> 00:45:11,790
i pulling works so you can see that this
872
00:45:09,930 --> 00:45:14,210
is the original proposal the yellow is
873
00:45:11,790 --> 00:45:19,529
the original proposal and then the nine
874
00:45:14,210 --> 00:45:21,270
different bins are we pull the original
875
00:45:19,530 --> 00:45:23,100
proposal and we put it into a fully
876
00:45:21,270 --> 00:45:24,990
connected layer and actually predict
877
00:45:23,100 --> 00:45:27,029
offsets which would give us nine
878
00:45:24,990 --> 00:45:29,160
different bounding boxes which are the
879
00:45:27,030 --> 00:45:32,610
bounding boxes in red so you can see how
880
00:45:29,160 --> 00:45:36,330
nicely they cover the cat and the less
881
00:45:32,610 --> 00:45:40,260
relevant information here is not we
882
00:45:36,330 --> 00:45:42,240
don't waste any any capacity on it so I
883
00:45:40,260 --> 00:45:47,400
think this is really really elegant and
884
00:45:42,240 --> 00:45:49,979
another example which we have a good
885
00:45:47,400 --> 00:45:52,080
bounding this is an example of the post
886
00:45:49,980 --> 00:45:54,780
problem so the woman is reaching her
887
00:45:52,080 --> 00:45:58,740
head hand forward and thus the bounding
888
00:45:54,780 --> 00:46:01,890
box that covers her covers the has a lot
889
00:45:58,740 --> 00:46:04,740
of wasted space that we and we will
890
00:46:01,890 --> 00:46:06,420
waste our pooled features on these on
891
00:46:04,740 --> 00:46:10,609
the features of the of this background
892
00:46:06,420 --> 00:46:12,960
and it speaks for itself I think and
893
00:46:10,610 --> 00:46:17,840
regarding how this can be useful for
894
00:46:12,960 --> 00:46:17,840
medical applications I think that yeah
895
00:46:25,140 --> 00:46:31,740
yeah sure it's good great that you asked
896
00:46:29,430 --> 00:46:33,509
because this is their like the basic so
897
00:46:31,740 --> 00:46:35,640
is this is the most important things
898
00:46:33,510 --> 00:46:37,530
that everyone will learn is turned so I
899
00:46:35,640 --> 00:46:39,810
will explain again how from the 18
900
00:46:37,530 --> 00:46:41,070
offsets from how from the 18 numbers
901
00:46:39,810 --> 00:46:43,170
that are the output of the fully
902
00:46:41,070 --> 00:46:48,990
connected you can get nine different
903
00:46:43,170 --> 00:46:50,580
bounding boxes okay so by the way it
904
00:46:48,990 --> 00:46:52,680
should I explain it again also for the
905
00:46:50,580 --> 00:47:00,960
deformable convolutions or just for the
906
00:46:52,680 --> 00:47:07,230
deformable Roi polling okay so when you
907
00:47:00,960 --> 00:47:11,370
get 18 numbers the yellow 3x3 structure
908
00:47:07,230 --> 00:47:13,710
here is the original each each sub
909
00:47:11,370 --> 00:47:18,810
square each of these nine sub squares
910
00:47:13,710 --> 00:47:21,870
are the original bins the original 3x3
911
00:47:18,810 --> 00:47:23,670
bins of the original proposal and we
912
00:47:21,870 --> 00:47:25,500
know the location of there's the
913
00:47:23,670 --> 00:47:27,960
coordinates of the center for each pin
914
00:47:25,500 --> 00:47:31,110
it can be can be calculated easily so
915
00:47:27,960 --> 00:47:33,390
now I have I have their Center and I
916
00:47:31,110 --> 00:47:35,670
have two additional numbers I have the
917
00:47:33,390 --> 00:47:37,740
four this top left square I have did
918
00:47:35,670 --> 00:47:42,210
it's horizontal offset for example if
919
00:47:37,740 --> 00:47:44,520
the offset is minus 2.5 then I know that
920
00:47:42,210 --> 00:47:48,060
the new center will for the top left
921
00:47:44,520 --> 00:47:50,370
beam will be placed minus 2 minus 2.5
922
00:47:48,060 --> 00:47:52,200
offset in the X in our rosante axis
923
00:47:50,370 --> 00:47:55,529
compared to the original centre of that
924
00:47:52,200 --> 00:47:58,200
bin and I take the value from this
925
00:47:55,530 --> 00:48:02,550
square which represent the vertical
926
00:47:58,200 --> 00:48:04,529
offset and if the value is minus 3.1
927
00:48:02,550 --> 00:48:08,430
then I know that this the center will be
928
00:48:04,530 --> 00:48:11,160
located minus 2 2 in the vertical axis
929
00:48:08,430 --> 00:48:13,950
and minus 3 in the solid minus 2 in the
930
00:48:11,160 --> 00:48:16,620
horizontal axis minus minus 3 in the
931
00:48:13,950 --> 00:48:20,250
vertical axis and I repeat this process
932
00:48:16,620 --> 00:48:22,589
it's doing it it's performed in a vector
933
00:48:20,250 --> 00:48:25,560
operation but this process you can
934
00:48:22,590 --> 00:48:27,630
imagine that is repeated 9 9 times for
935
00:48:25,560 --> 00:48:30,779
each of these bins and that way we can
936
00:48:27,630 --> 00:48:32,880
get the new sampling points of our of
937
00:48:30,780 --> 00:48:37,400
our grid do you think it was more
938
00:48:32,880 --> 00:48:37,400
understandable right now ok great
939
00:48:37,730 --> 00:48:44,670
so if we look at the medical case so I
940
00:48:41,880 --> 00:48:46,440
will come back to the same example they
941
00:48:44,670 --> 00:48:50,820
showed earlier and I think this
942
00:48:46,440 --> 00:48:53,850
demonstrates pretty well how one finding
943
00:48:50,820 --> 00:48:55,830
that can be detected pretty nicely but
944
00:48:53,850 --> 00:48:59,220
when you want to classify it if I just
945
00:48:55,830 --> 00:49:00,569
took this ROI and I pulled it and I put
946
00:48:59,220 --> 00:49:03,600
it through the second stage of my
947
00:49:00,570 --> 00:49:05,730
detector then most of the information
948
00:49:03,600 --> 00:49:08,640
most of the features that will be pulled
949
00:49:05,730 --> 00:49:11,910
will will be healthy pixels healthy
950
00:49:08,640 --> 00:49:14,580
brain pixels so it increases the chances
951
00:49:11,910 --> 00:49:17,520
that like the classifier of the second
952
00:49:14,580 --> 00:49:20,970
stage will miss classify this example as
953
00:49:17,520 --> 00:49:24,440
a healthy example so if I use the
954
00:49:20,970 --> 00:49:28,169
deformable are I pulling I can it
955
00:49:24,440 --> 00:49:30,240
naturally covers the objects the object
956
00:49:28,170 --> 00:49:34,280
the interesting object in a much tighter
957
00:49:30,240 --> 00:49:42,569
way and then they pulled our eyes are
958
00:49:34,280 --> 00:49:46,410
the the pooled ROI is the features in
959
00:49:42,570 --> 00:49:49,950
the pooled roi are much more relevant to
960
00:49:46,410 --> 00:49:51,960
the to the non healthy pixels in the
961
00:49:49,950 --> 00:50:01,919
image do you think this answers your
962
00:49:51,960 --> 00:50:06,119
question from before okay great yeah the
963
00:50:01,920 --> 00:50:10,650
final region the do you mean the final
964
00:50:06,119 --> 00:50:12,810
prediction of the of the model the final
965
00:50:10,650 --> 00:50:15,359
prediction of the model will still be
966
00:50:12,810 --> 00:50:16,080
this this yellow square this yellow
967
00:50:15,359 --> 00:50:18,090
rectangle
968
00:50:16,080 --> 00:50:20,549
the original proposal of the yellow
969
00:50:18,090 --> 00:50:22,560
rectangle or or every or a single
970
00:50:20,550 --> 00:50:24,180
rectangular e it will not be the
971
00:50:22,560 --> 00:50:26,520
original proposal it will probably be
972
00:50:24,180 --> 00:50:28,379
refined by the second stage but it will
973
00:50:26,520 --> 00:50:31,080
be something like this original proposal
974
00:50:28,380 --> 00:50:34,080
but these nine nine different bounding
975
00:50:31,080 --> 00:50:35,730
boxes help us classify and refine the
976
00:50:34,080 --> 00:50:38,000
coordinates of this bounding box much
977
00:50:35,730 --> 00:50:38,000
better
978
00:50:38,420 --> 00:50:41,530
[Music]
979
00:50:43,430 --> 00:50:58,379
sorry you said yes not at the end of the
980
00:50:56,820 --> 00:51:00,720
first stage at the beginning you have a
981
00:50:58,380 --> 00:51:04,800
normal region proposed a layer you get
982
00:51:00,720 --> 00:51:07,350
the proposal and then and then you take
983
00:51:04,800 --> 00:51:08,220
this proposal you do regular ROI pulling
984
00:51:07,350 --> 00:51:11,490
on this proposal
985
00:51:08,220 --> 00:51:13,680
you put the pull original proposal into
986
00:51:11,490 --> 00:51:15,870
a fully connected layer and then you get
987
00:51:13,680 --> 00:51:21,049
the offsets for that enable you to
988
00:51:15,870 --> 00:51:24,330
locate these red rectangles on the image
989
00:51:21,050 --> 00:51:26,430
these rectangles are past the features
990
00:51:24,330 --> 00:51:30,660
under these rectangles are passed to the
991
00:51:26,430 --> 00:51:32,819
second stage of the detector the second
992
00:51:30,660 --> 00:51:36,629
stage uses these features to classify
993
00:51:32,820 --> 00:51:38,520
the original yellow rectangle and they
994
00:51:36,630 --> 00:51:40,800
are the final output it doesn't matter
995
00:51:38,520 --> 00:51:43,470
what where these red rectangles will be
996
00:51:40,800 --> 00:51:46,380
placed the final output of this entire
997
00:51:43,470 --> 00:51:49,169
detector will still be something like a
998
00:51:46,380 --> 00:51:54,350
single yellow rectangle around this
999
00:51:49,170 --> 00:51:58,830
proposal okay okay great
1000
00:51:54,350 --> 00:52:05,310
some best practices so first of all they
1001
00:51:58,830 --> 00:52:06,870
tried it on the less layers only so the
1002
00:52:05,310 --> 00:52:09,240
the meaning of that is that they only
1003
00:52:06,870 --> 00:52:11,460
try to model to use this to model
1004
00:52:09,240 --> 00:52:13,740
deformation in high level features you
1005
00:52:11,460 --> 00:52:17,490
can think about it's very intuitive when
1006
00:52:13,740 --> 00:52:18,990
you look at something like this change
1007
00:52:17,490 --> 00:52:20,910
of pose and you have feature a feature
1008
00:52:18,990 --> 00:52:22,890
for a hand and a feature for a hand and
1009
00:52:20,910 --> 00:52:27,420
a feature for a torso and that way you
1010
00:52:22,890 --> 00:52:33,690
can model at their locations in in an
1011
00:52:27,420 --> 00:52:35,760
irregular way and they try to use it on
1012
00:52:33,690 --> 00:52:37,500
more than the three less layers and it
1013
00:52:35,760 --> 00:52:40,600
give them they gave them diminishing
1014
00:52:37,500 --> 00:52:45,340
returns so the use resonant 101 layer
1015
00:52:40,600 --> 00:52:48,009
in this paper and the this is the end of
1016
00:52:45,340 --> 00:52:51,640
the resident 101 these are the last 26
1017
00:52:48,010 --> 00:52:54,820
layers so this is the left resonant
1018
00:52:51,640 --> 00:53:00,190
block and this is the the one before it
1019
00:52:54,820 --> 00:53:02,320
so a large amount or of the convolutions
1020
00:53:00,190 --> 00:53:04,330
in resonate are 1x1 convolutions and of
1021
00:53:02,320 --> 00:53:05,770
course it's probably less interesting to
1022
00:53:04,330 --> 00:53:08,170
do something deformable with the
1023
00:53:05,770 --> 00:53:11,650
location of these convolutions so the
1024
00:53:08,170 --> 00:53:14,980
four you have three blocks like this so
1025
00:53:11,650 --> 00:53:17,380
the the optimal a configuration was to
1026
00:53:14,980 --> 00:53:19,630
put the deformable solution on on each
1027
00:53:17,380 --> 00:53:21,250
of these 3x3 convolution and when they
1028
00:53:19,630 --> 00:53:24,370
tried it on some of the convolutions out
1029
00:53:21,250 --> 00:53:27,670
of these 23 it didn't gave them too much
1030
00:53:24,370 --> 00:53:30,069
value in addition what's really amazing
1031
00:53:27,670 --> 00:53:32,620
about this solution is that it answered
1032
00:53:30,070 --> 00:53:35,440
our requirement of not adding a lot of
1033
00:53:32,620 --> 00:53:37,120
complexity to our model and the number
1034
00:53:35,440 --> 00:53:40,090
of parameters in the network barely
1035
00:53:37,120 --> 00:53:42,730
increased and also the inference time
1036
00:53:40,090 --> 00:53:50,340
for a single image didn't increase
1037
00:53:42,730 --> 00:53:51,730
significantly which is very nice and per
1038
00:53:50,340 --> 00:53:56,620
right
1039
00:53:51,730 --> 00:54:01,210
so yeah the their work is on 2d images
1040
00:53:56,620 --> 00:54:04,870
so of course but yeah adapting it to 2 3
1041
00:54:01,210 --> 00:54:08,200
2 3 D is the is more is more advanced
1042
00:54:04,870 --> 00:54:10,540
just like 3d convolutions also require a
1043
00:54:08,200 --> 00:54:14,770
bit further explanations on how to work
1044
00:54:10,540 --> 00:54:20,529
with them and this benefit of a very
1045
00:54:14,770 --> 00:54:22,090
efficient and weak inference is a I
1046
00:54:20,530 --> 00:54:24,370
think due to the layers being
1047
00:54:22,090 --> 00:54:26,500
implemented in CUDA so they implemented
1048
00:54:24,370 --> 00:54:28,960
them in CUDA and it's important that you
1049
00:54:26,500 --> 00:54:31,270
know that their original cuda
1050
00:54:28,960 --> 00:54:33,400
implementation is open-source it was
1051
00:54:31,270 --> 00:54:35,890
originally intended to be used with MX
1052
00:54:33,400 --> 00:54:38,170
net but already several repositories
1053
00:54:35,890 --> 00:54:42,009
around the internet adapted them to use
1054
00:54:38,170 --> 00:54:44,080
with under other frameworks such as
1055
00:54:42,010 --> 00:54:46,000
tensor flow and they if you work with
1056
00:54:44,080 --> 00:54:46,569
chaos then this will also work for you
1057
00:54:46,000 --> 00:54:49,020
et cetera
1058
00:54:46,570 --> 00:54:49,020
yeah
1059
00:54:57,280 --> 00:55:08,710
convolution okay so how can we how can
1060
00:55:06,849 --> 00:55:10,180
it be that it had so few parameters so
1061
00:55:08,710 --> 00:55:14,140
for example because they only do it on
1062
00:55:10,180 --> 00:55:17,290
three convolutions so yeah that's part
1063
00:55:14,140 --> 00:55:19,720
of the reason I guess yeah so but it's
1064
00:55:17,290 --> 00:55:22,000
it's like we can sit on it later and you
1065
00:55:19,720 --> 00:55:25,270
can see easily that it that's the number
1066
00:55:22,000 --> 00:55:27,940
of parameter that it adds yeah let me
1067
00:55:25,270 --> 00:55:34,720
finish just a like two slides and then
1068
00:55:27,940 --> 00:55:39,190
we we can have more questions so as we
1069
00:55:34,720 --> 00:55:41,319
expected the receptive field is affected
1070
00:55:39,190 --> 00:55:43,420
by the object size just like we saw
1071
00:55:41,320 --> 00:55:50,710
intuitive in the intuitive cherry-picked
1072
00:55:43,420 --> 00:55:53,080
examples in the beginning they they they
1073
00:55:50,710 --> 00:55:54,970
analyzed all of the objects in the data
1074
00:55:53,080 --> 00:55:58,119
set or many object in their data set and
1075
00:55:54,970 --> 00:55:59,560
they checked what are the what is the
1076
00:55:58,119 --> 00:56:01,359
receptive field of the deformable
1077
00:55:59,560 --> 00:56:03,549
convolutions for the small objects the
1078
00:56:01,359 --> 00:56:06,310
medium objects the large opt objects and
1079
00:56:03,550 --> 00:56:10,530
the end when the convolution is on top
1080
00:56:06,310 --> 00:56:13,210
of the background and they saw that the
1081
00:56:10,530 --> 00:56:16,119
for the large object objects and for the
1082
00:56:13,210 --> 00:56:18,339
background the offsets were larger than
1083
00:56:16,119 --> 00:56:20,619
the receptive fields were largest just
1084
00:56:18,339 --> 00:56:25,390
as we would expect intuitively from this
1085
00:56:20,619 --> 00:56:29,250
mechanism but what if maybe the
1086
00:56:25,390 --> 00:56:32,170
deformable part and sampling the the
1087
00:56:29,250 --> 00:56:33,880
sampling the image in an irregular grid
1088
00:56:32,170 --> 00:56:35,950
maybe it's not really important maybe
1089
00:56:33,880 --> 00:56:38,609
the only thing that's important here is
1090
00:56:35,950 --> 00:56:40,509
just the dilation is just sampling
1091
00:56:38,609 --> 00:56:43,270
further context
1092
00:56:40,510 --> 00:56:45,820
so using dilated convolutions in the
1093
00:56:43,270 --> 00:56:47,859
last layers of the of the feature
1094
00:56:45,820 --> 00:56:50,589
extraction extractors that are used for
1095
00:56:47,859 --> 00:56:54,069
detection is already standard practice
1096
00:56:50,589 --> 00:56:58,570
in detection networks and it and it does
1097
00:56:54,070 --> 00:57:01,089
improve the performance but usually it's
1098
00:56:58,570 --> 00:57:03,400
used with the I'm sorry I'll explain
1099
00:57:01,089 --> 00:57:04,960
what are dilated convolutions for for a
1100
00:57:03,400 --> 00:57:07,330
second so this is the standard
1101
00:57:04,960 --> 00:57:09,760
convolution and dilated convolution is
1102
00:57:07,330 --> 00:57:10,930
just in keeping it a square but putting
1103
00:57:09,760 --> 00:57:12,910
holes between the
1104
00:57:10,930 --> 00:57:15,730
playing points so this is with a
1105
00:57:12,910 --> 00:57:19,390
dilation of the hole sizes 1 and here
1106
00:57:15,730 --> 00:57:25,750
the hole size is 2 or it's also called
1107
00:57:19,390 --> 00:57:27,670
the dilation rate and usually people use
1108
00:57:25,750 --> 00:57:30,579
the duration size of 2 and they show
1109
00:57:27,670 --> 00:57:32,500
that for their cocoa applications in
1110
00:57:30,579 --> 00:57:35,140
some segmentation applications is even
1111
00:57:32,500 --> 00:57:39,490
more optimal to use a dilation size of 4
1112
00:57:35,140 --> 00:57:41,920
and 6 but it depends the optimal
1113
00:57:39,490 --> 00:57:43,779
dilation depends on your architecture it
1114
00:57:41,920 --> 00:57:46,300
depends on your specific imagine it even
1115
00:57:43,780 --> 00:57:48,640
depends on your specific object because
1116
00:57:46,300 --> 00:57:53,550
as we saw for large objects and small
1117
00:57:48,640 --> 00:57:57,220
objects the dilation is different so
1118
00:57:53,550 --> 00:58:01,150
even if the only important thing here is
1119
00:57:57,220 --> 00:58:03,149
the dilation still a solution like that
1120
00:58:01,150 --> 00:58:07,839
will be desirable because the dilation
1121
00:58:03,150 --> 00:58:09,670
is learned and adapted to the local
1122
00:58:07,839 --> 00:58:13,720
parts of the image and to the object
1123
00:58:09,670 --> 00:58:16,030
that you are looking on it's a
1124
00:58:13,720 --> 00:58:18,189
generalization of convolutions in
1125
00:58:16,030 --> 00:58:27,280
general and the dilation convolution
1126
00:58:18,190 --> 00:58:29,380
specifically yeah what sorry on acid
1127
00:58:27,280 --> 00:58:32,950
yeah dilated convolution on acid that
1128
00:58:29,380 --> 00:58:39,369
will be the title of my next talk I love
1129
00:58:32,950 --> 00:58:43,720
it so but they showed that if you use
1130
00:58:39,369 --> 00:58:46,839
their method then it
1131
00:58:43,720 --> 00:58:49,500
it improves even on the most optimal a
1132
00:58:46,839 --> 00:58:52,540
configuration that they could use when
1133
00:58:49,500 --> 00:58:54,520
with just dilated convolution so it even
1134
00:58:52,540 --> 00:58:57,190
improved the results further and it
1135
00:58:54,520 --> 00:58:59,319
didn't require any manual tuning or
1136
00:58:57,190 --> 00:59:03,160
hyper parameter sweep of the dilation
1137
00:58:59,319 --> 00:59:05,589
hyper parameter ok so these were
1138
00:59:03,160 --> 00:59:08,410
deformable convolutions now we have two
1139
00:59:05,589 --> 00:59:10,540
choices one choice is to move on to the
1140
00:59:08,410 --> 00:59:13,058
next paper which can be featured pyramid
1141
00:59:10,540 --> 00:59:15,460
networks or vocalist and the second
1142
00:59:13,059 --> 00:59:16,930
choice can be to like answer questions
1143
00:59:15,460 --> 00:59:21,089
so i don'r
1144
00:59:16,930 --> 00:59:21,089
either you decide or we can have a vote
1145
00:59:24,220 --> 00:59:34,430
okay so any anyone that has more
1146
00:59:32,240 --> 00:59:36,370
questions can of course come to me later
1147
00:59:34,430 --> 00:59:40,009
and ask them or send me an email or
1148
00:59:36,370 --> 00:59:43,730
whatever you want so feature pyramid
1149
00:59:40,010 --> 00:59:46,520
networks this is also this is the paper
1150
00:59:43,730 --> 00:59:48,980
by Facebook AI research group one of the
1151
00:59:46,520 --> 00:59:55,280
best object detection and deep learning
1152
00:59:48,980 --> 00:59:57,890
teams in the world and it's as I told
1153
00:59:55,280 --> 01:00:01,850
you the previous paper try to answer the
1154
00:59:57,890 --> 01:00:03,200
challenge of the anatomy and the
1155
01:00:01,850 --> 01:00:05,029
pathologies that to the medical
1156
01:00:03,200 --> 01:00:07,220
pathologies that we are trying to look
1157
01:00:05,030 --> 01:00:09,260
for the previous paper I try to answer
1158
01:00:07,220 --> 01:00:12,620
the problem that they don't have every
1159
01:00:09,260 --> 01:00:14,720
they have any irregular shape and in a
1160
01:00:12,620 --> 01:00:17,390
deformable shape and this paper tries to
1161
01:00:14,720 --> 01:00:21,470
answer the problem of the object being
1162
01:00:17,390 --> 01:00:23,210
small so this is a spine fracture as a
1163
01:00:21,470 --> 01:00:26,029
fracture in the spine you can see it
1164
01:00:23,210 --> 01:00:28,370
here and what is the problem of what
1165
01:00:26,030 --> 01:00:30,200
what what is what is the problem with
1166
01:00:28,370 --> 01:00:32,720
the small object why are they difficult
1167
01:00:30,200 --> 01:00:37,339
for neural network so just as an
1168
01:00:32,720 --> 01:00:41,689
intuition if we perform max pulling on
1169
01:00:37,340 --> 01:00:44,390
and each each of the and we have several
1170
01:00:41,690 --> 01:00:46,460
neurons several neurons next to each
1171
01:00:44,390 --> 01:00:48,259
other and these neurons receptive field
1172
01:00:46,460 --> 01:00:50,390
covers mainly this area and the
1173
01:00:48,260 --> 01:00:52,550
receptive field of this neuron covers
1174
01:00:50,390 --> 01:00:54,560
the fracture and the receptive field of
1175
01:00:52,550 --> 01:00:57,890
this neuron covers this area of the bone
1176
01:00:54,560 --> 01:01:00,950
then after then even if we have really
1177
01:00:57,890 --> 01:01:02,900
good feature and we know that here this
1178
01:01:00,950 --> 01:01:04,759
neurons indicate that it's there is a
1179
01:01:02,900 --> 01:01:06,590
bone under it and here it indicates
1180
01:01:04,760 --> 01:01:09,050
there is a fracture under it and here it
1181
01:01:06,590 --> 01:01:11,900
indicates there is a bone under it after
1182
01:01:09,050 --> 01:01:14,390
we perform max pulling on the three of
1183
01:01:11,900 --> 01:01:16,850
these neurons we live we lose the
1184
01:01:14,390 --> 01:01:19,670
spatial order between them we know there
1185
01:01:16,850 --> 01:01:23,299
is bones there are bones there and we
1186
01:01:19,670 --> 01:01:25,220
know there is a something like a hole
1187
01:01:23,300 --> 01:01:27,680
there but the hole is not necessarily a
1188
01:01:25,220 --> 01:01:29,660
fracture and we need to understand the
1189
01:01:27,680 --> 01:01:32,480
spatial structure between the things in
1190
01:01:29,660 --> 01:01:33,390
order to really classify fractures so
1191
01:01:32,480 --> 01:01:36,600
this is
1192
01:01:33,390 --> 01:01:38,910
one intuition for white convolutional
1193
01:01:36,600 --> 01:01:42,960
neural networks have problems with small
1194
01:01:38,910 --> 01:01:44,850
objects this is not the only reason by
1195
01:01:42,960 --> 01:01:48,210
the way another reason is class
1196
01:01:44,850 --> 01:01:51,630
imbalance small objects are more
1197
01:01:48,210 --> 01:01:53,820
underrepresented in in our data and this
1198
01:01:51,630 --> 01:01:58,170
is also another reason but this paper
1199
01:01:53,820 --> 01:02:01,800
deals with the problem of not having
1200
01:01:58,170 --> 01:02:05,340
good enough features that describe the
1201
01:02:01,800 --> 01:02:10,620
special occasions so the motivation for
1202
01:02:05,340 --> 01:02:18,210
this paper is weak
1203
01:02:10,620 --> 01:02:20,370
a lot of papers before did such things
1204
01:02:18,210 --> 01:02:23,100
similar to that maybe instead of just
1205
01:02:20,370 --> 01:02:25,770
predicting using the deepest feature map
1206
01:02:23,100 --> 01:02:28,110
maybe we can combine somehow feature
1207
01:02:25,770 --> 01:02:31,850
maps from several depths and a lot of
1208
01:02:28,110 --> 01:02:31,850
papers did it before and they also show
1209
01:02:31,880 --> 01:02:41,700
like like dense net as well then set
1210
01:02:35,070 --> 01:02:43,530
also did did something like that but the
1211
01:02:41,700 --> 01:02:45,149
Internet's like it's part of the of the
1212
01:02:43,530 --> 01:02:47,700
architecture is like you can say that
1213
01:02:45,150 --> 01:02:49,980
ResNet also also does it because of
1214
01:02:47,700 --> 01:02:52,980
resonant also has keep connections but I
1215
01:02:49,980 --> 01:02:54,870
mean that you take your final feature
1216
01:02:52,980 --> 01:03:04,740
map that you predict from will say T in
1217
01:02:54,870 --> 01:03:07,620
in a minute okay so so we can assume
1218
01:03:04,740 --> 01:03:09,959
that features from shallower feature
1219
01:03:07,620 --> 01:03:13,470
maps can be important to classify small
1220
01:03:09,960 --> 01:03:15,780
objects because they were developed
1221
01:03:13,470 --> 01:03:18,540
before doing too much max pooling so
1222
01:03:15,780 --> 01:03:21,270
they lost less spatial information so it
1223
01:03:18,540 --> 01:03:24,960
would be desirable to use them when we
1224
01:03:21,270 --> 01:03:28,800
when we classify small objects but we
1225
01:03:24,960 --> 01:03:31,680
miss them because the the network we use
1226
01:03:28,800 --> 01:03:34,290
only the last layer and we we could say
1227
01:03:31,680 --> 01:03:39,270
that we we can hope that the network
1228
01:03:34,290 --> 01:03:41,430
will be smart enough to develop good
1229
01:03:39,270 --> 01:03:43,080
enough features before it does max
1230
01:03:41,430 --> 01:03:45,029
pooling to identify that this is the
1231
01:03:43,080 --> 01:03:46,100
fracture before it does works pulling
1232
01:03:45,030 --> 01:03:50,540
what it while it
1233
01:03:46,100 --> 01:03:54,950
so while it while it has the the lower
1234
01:03:50,540 --> 01:03:57,920
level features and we can hope that it
1235
01:03:54,950 --> 01:03:59,870
will work but as we know with neural
1236
01:03:57,920 --> 01:04:01,880
networks a lot of times if you don't
1237
01:03:59,870 --> 01:04:04,310
force the network and you don't encode
1238
01:04:01,880 --> 01:04:06,680
your prior knowledge of the problem into
1239
01:04:04,310 --> 01:04:08,450
the architectures design then the neural
1240
01:04:06,680 --> 01:04:10,370
networks don't behave optimally and
1241
01:04:08,450 --> 01:04:14,000
although they could fit many types of
1242
01:04:10,370 --> 01:04:15,830
functions they tend to fit not the
1243
01:04:14,000 --> 01:04:18,430
optimal functions unless you encode your
1244
01:04:15,830 --> 01:04:21,319
primary information into the design so
1245
01:04:18,430 --> 01:04:24,440
there are a lot of ways to to combine
1246
01:04:21,320 --> 01:04:26,180
the shallower the shallower layers and
1247
01:04:24,440 --> 01:04:27,140
show a shallower feature map but this is
1248
01:04:26,180 --> 01:04:29,629
currently the most popular
1249
01:04:27,140 --> 01:04:31,700
implementation and the reason the reason
1250
01:04:29,630 --> 01:04:33,860
that I feel confident to say it is that
1251
01:04:31,700 --> 01:04:38,180
in the last cocoa object detection
1252
01:04:33,860 --> 01:04:40,460
competition in the end of 2017 all four
1253
01:04:38,180 --> 01:04:42,980
top competitors all four top submission
1254
01:04:40,460 --> 01:04:45,920
used feature pyramid networks as a major
1255
01:04:42,980 --> 01:04:47,990
component of their submission and it
1256
01:04:45,920 --> 01:04:51,290
improves object detection accuracy by
1257
01:04:47,990 --> 01:04:53,720
about ten percent so what I really love
1258
01:04:51,290 --> 01:04:57,110
about this paper is that it puts
1259
01:04:53,720 --> 01:05:04,220
simplicity and elegance is a major part
1260
01:04:57,110 --> 01:05:08,120
of their work so the first element of it
1261
01:05:04,220 --> 01:05:11,000
is they said we already know that in
1262
01:05:08,120 --> 01:05:13,190
convolutional neural networks if we use
1263
01:05:11,000 --> 01:05:18,190
image pyramids some of you may be know
1264
01:05:13,190 --> 01:05:21,260
it is test time multi scale so I'll
1265
01:05:18,190 --> 01:05:23,720
explain what it is in a minute if we use
1266
01:05:21,260 --> 01:05:27,200
an image pyramid it really improves our
1267
01:05:23,720 --> 01:05:29,390
way to deal with smaller objects so what
1268
01:05:27,200 --> 01:05:32,180
is an image pyramid we take the original
1269
01:05:29,390 --> 01:05:34,190
image and we scale it up to many sizes
1270
01:05:32,180 --> 01:05:36,649
or scale it up and down to many sizes
1271
01:05:34,190 --> 01:05:38,210
and then small objects appear larger and
1272
01:05:36,650 --> 01:05:40,760
are less affected by the pulling
1273
01:05:38,210 --> 01:05:43,520
operation and it has several several
1274
01:05:40,760 --> 01:05:46,520
other advantages and then we pass if we
1275
01:05:43,520 --> 01:05:48,650
use where's at 101 for example we pass
1276
01:05:46,520 --> 01:05:50,750
each of these scaled images separately
1277
01:05:48,650 --> 01:05:53,060
and we have nominal nominally like 10
1278
01:05:50,750 --> 01:05:55,520
sizes for example which is each of these
1279
01:05:53,060 --> 01:05:59,390
10 images separately through the 101
1280
01:05:55,520 --> 01:06:00,950
layers and get separate predictions for
1281
01:05:59,390 --> 01:06:03,830
each of them
1282
01:06:00,950 --> 01:06:06,109
and this works quite well the problem
1283
01:06:03,830 --> 01:06:07,640
with it is it's not really feasible for
1284
01:06:06,110 --> 01:06:10,250
most application between because it we
1285
01:06:07,640 --> 01:06:13,509
it requires a lot of time to make so
1286
01:06:10,250 --> 01:06:16,250
many forward passes of large networks so
1287
01:06:13,510 --> 01:06:18,590
they try to imitate or take their
1288
01:06:16,250 --> 01:06:20,660
intuition from image pyramid which which
1289
01:06:18,590 --> 01:06:23,540
has already proven itself and in many of
1290
01:06:20,660 --> 01:06:25,609
the design choices in the paper they try
1291
01:06:23,540 --> 01:06:27,529
to just instead of inventing something
1292
01:06:25,610 --> 01:06:29,480
from strategy try to imitate something
1293
01:06:27,530 --> 01:06:32,480
that already exists and is known to work
1294
01:06:29,480 --> 01:06:35,510
and you can see that even in the diagram
1295
01:06:32,480 --> 01:06:37,820
it looks quite similar to image pyramid
1296
01:06:35,510 --> 01:06:40,820
and we'll talk more about it in a few
1297
01:06:37,820 --> 01:06:43,760
minutes so you can use feature pyramid
1298
01:06:40,820 --> 01:06:45,620
networks as a part of the RPM the region
1299
01:06:43,760 --> 01:06:47,810
proposal network the first stage of the
1300
01:06:45,620 --> 01:06:51,350
faster are CNN and as the part of the
1301
01:06:47,810 --> 01:06:53,870
second stage of the detection network so
1302
01:06:51,350 --> 01:06:55,970
there it is combined differently into
1303
01:06:53,870 --> 01:06:57,470
these two parts of the network and we'll
1304
01:06:55,970 --> 01:07:01,310
talk about each of them separately so
1305
01:06:57,470 --> 01:07:02,930
for combining it into the RPN and now
1306
01:07:01,310 --> 01:07:07,610
you will understand what feature pyramid
1307
01:07:02,930 --> 01:07:09,740
networks actually do we take the image
1308
01:07:07,610 --> 01:07:12,080
and put it through our feature extractor
1309
01:07:09,740 --> 01:07:17,080
this is our feature extractor for
1310
01:07:12,080 --> 01:07:19,310
example ResNet 101 101 and then we have
1311
01:07:17,080 --> 01:07:21,290
something that they call a lateral
1312
01:07:19,310 --> 01:07:23,720
connection which is a one-by-one
1313
01:07:21,290 --> 01:07:26,120
convolution a one-by-one convolution
1314
01:07:23,720 --> 01:07:30,379
that keeps this feature map the same
1315
01:07:26,120 --> 01:07:33,529
size but transforms it to be to have 256
1316
01:07:30,380 --> 01:07:36,530
features and then we take this new
1317
01:07:33,530 --> 01:07:39,070
feature map and we up sample it using
1318
01:07:36,530 --> 01:07:42,620
nearest neighbor up sampling with a
1319
01:07:39,070 --> 01:07:45,200
scale factor of 2 in each dimension so
1320
01:07:42,620 --> 01:07:47,390
we enlarge it and since our pulling in
1321
01:07:45,200 --> 01:07:50,270
the original fixed feature extractor was
1322
01:07:47,390 --> 01:07:52,609
also in with a factor of 2 then now the
1323
01:07:50,270 --> 01:07:55,730
scaled up feature map is the same size
1324
01:07:52,610 --> 01:07:57,860
of the feature maps of this for the
1325
01:07:55,730 --> 01:08:00,950
previous stage in the feature extractor
1326
01:07:57,860 --> 01:08:04,100
so now we can take we can choose a
1327
01:08:00,950 --> 01:08:05,839
single feature map from the previous
1328
01:08:04,100 --> 01:08:09,589
pooling stage in the feature X structure
1329
01:08:05,840 --> 01:08:11,480
and we can combine them using a one by
1330
01:08:09,590 --> 01:08:12,900
one convolution on this feature map and
1331
01:08:11,480 --> 01:08:15,830
summation of the
1332
01:08:12,900 --> 01:08:19,170
featuring us not concatenation summation
1333
01:08:15,830 --> 01:08:20,790
and we repeat this process several times
1334
01:08:19,170 --> 01:08:23,580
actually this process is repeated
1335
01:08:20,790 --> 01:08:27,660
something like five or six times
1336
01:08:23,580 --> 01:08:31,080
depending on the implementation and then
1337
01:08:27,660 --> 01:08:33,180
we get a pyramid of feature Maps and the
1338
01:08:31,080 --> 01:08:38,939
shallowest feature Maps feature map
1339
01:08:33,180 --> 01:08:42,300
contains features from all contains
1340
01:08:38,939 --> 01:08:44,729
features that were transformed for from
1341
01:08:42,300 --> 01:08:45,960
all or almost all of the levels of
1342
01:08:44,729 --> 01:08:50,189
pulling in the original feature
1343
01:08:45,960 --> 01:08:52,800
extractor and then sorry for each of
1344
01:08:50,189 --> 01:08:55,649
these feature maps we predict separately
1345
01:08:52,800 --> 01:08:56,970
we predict bounding boxes from this this
1346
01:08:55,649 --> 01:08:58,759
feature map separately and this
1347
01:08:56,970 --> 01:09:01,050
separately and this one separately and
1348
01:08:58,760 --> 01:09:03,990
there are only three feature maps drawn
1349
01:09:01,050 --> 01:09:08,600
here but in practice they use five or
1350
01:09:03,990 --> 01:09:12,500
six depending on the implementation so
1351
01:09:08,600 --> 01:09:15,360
which which layers do we choose for this
1352
01:09:12,500 --> 01:09:17,510
to which layers do we choose for the
1353
01:09:15,359 --> 01:09:24,089
lateral connection that does anyone have
1354
01:09:17,510 --> 01:09:25,800
like I guess mmm yeah so we we don't
1355
01:09:24,090 --> 01:09:28,050
take only shallower layers we take both
1356
01:09:25,800 --> 01:09:31,770
shallow layers and deep layers but we
1357
01:09:28,050 --> 01:09:33,390
take one layer from each before we there
1358
01:09:31,770 --> 01:09:35,340
are five pulling operation in the
1359
01:09:33,390 --> 01:09:37,680
network so before each of these pooling
1360
01:09:35,340 --> 01:09:39,750
operations we take the let the output of
1361
01:09:37,680 --> 01:09:42,420
the of the last convolutional layer
1362
01:09:39,750 --> 01:09:45,149
before before this pooling operation why
1363
01:09:42,420 --> 01:09:47,399
because as we said before the pooling
1364
01:09:45,149 --> 01:09:50,009
operation is the component that code
1365
01:09:47,399 --> 01:09:53,519
that causes us the trouble with the
1366
01:09:50,010 --> 01:09:54,870
small objects so after we perform the
1367
01:09:53,520 --> 01:09:57,720
pooling operation we will lose
1368
01:09:54,870 --> 01:09:59,309
information so we want from from each
1369
01:09:57,720 --> 01:10:04,470
pooling level to take some information
1370
01:09:59,310 --> 01:10:06,120
some features and we and we choose the
1371
01:10:04,470 --> 01:10:08,580
last layer before the pooling because
1372
01:10:06,120 --> 01:10:12,510
intuitively it has the most developed
1373
01:10:08,580 --> 01:10:17,300
features for this level of spatial
1374
01:10:12,510 --> 01:10:17,300
information okay yeah
1375
01:10:21,730 --> 01:10:26,389
it's similar but you predict from each
1376
01:10:24,290 --> 01:10:28,670
level yeah it's as I said it's it's very
1377
01:10:26,390 --> 01:10:31,370
similar to many other architectures they
1378
01:10:28,670 --> 01:10:33,290
didn't invent the concept of using
1379
01:10:31,370 --> 01:10:36,620
features from shallower levels it was
1380
01:10:33,290 --> 01:10:39,469
mentioned in dozens of papers but this
1381
01:10:36,620 --> 01:10:41,840
group and also several maybe a few other
1382
01:10:39,469 --> 01:10:45,020
groups come in the same time are the
1383
01:10:41,840 --> 01:10:47,510
first to propose this mechanism of
1384
01:10:45,020 --> 01:10:50,870
predicting from several stages by the
1385
01:10:47,510 --> 01:10:53,570
way SSD SSD detector also predicts from
1386
01:10:50,870 --> 01:10:56,750
several stages but it doesn't use the
1387
01:10:53,570 --> 01:10:58,940
shallower features it starts it starts
1388
01:10:56,750 --> 01:11:00,800
from the top and creates new layers but
1389
01:10:58,940 --> 01:11:04,909
it never combines the shallower features
1390
01:11:00,800 --> 01:11:09,489
and that's where SSD misses okay yeah I
1391
01:11:04,909 --> 01:11:09,489
sorry yeah you are returned before yeah
1392
01:11:23,210 --> 01:11:27,930
this one yes and I will show the results
1393
01:11:27,150 --> 01:11:30,410
in a few minutes
1394
01:11:27,930 --> 01:11:33,470
mm-hm yeah
1395
01:11:30,410 --> 01:11:33,470
[Music]
1396
01:11:40,119 --> 01:11:45,348
from each level from each for each other
1397
01:11:43,550 --> 01:11:46,760
I'll explain so that everyone can hear
1398
01:11:45,349 --> 01:11:49,699
won't repeat the question I'll just
1399
01:11:46,760 --> 01:11:53,360
explain the process okay thank you
1400
01:11:49,699 --> 01:11:56,059
so for each pulling operation you take
1401
01:11:53,360 --> 01:12:01,159
the output of the convolutional layer
1402
01:11:56,060 --> 01:12:03,110
before it and you put it to a one by one
1403
01:12:01,159 --> 01:12:07,309
convolution to reduce its number of
1404
01:12:03,110 --> 01:12:09,500
features to 256 and then you sum it with
1405
01:12:07,310 --> 01:12:12,170
the up sampling from the top down
1406
01:12:09,500 --> 01:12:19,190
connections that they have here okay is
1407
01:12:12,170 --> 01:12:21,530
that this this feature map is a result
1408
01:12:19,190 --> 01:12:24,500
of a summation of all of the of all of
1409
01:12:21,530 --> 01:12:26,360
the 500 outputs but the one above it
1410
01:12:24,500 --> 01:12:28,580
doesn't include the one below it it's
1411
01:12:26,360 --> 01:12:37,989
only a result of the summation of of the
1412
01:12:28,580 --> 01:12:37,989
the ones above it okay and yeah okay yes
1413
01:12:42,790 --> 01:12:46,720
more lateral connections
1414
01:13:14,330 --> 01:13:19,980
I think the III would love if we could
1415
01:13:18,360 --> 01:13:21,750
answer this question later because I
1416
01:13:19,980 --> 01:13:23,070
think it's a less of an understanding
1417
01:13:21,750 --> 01:13:24,720
question and more of an intuition
1418
01:13:23,070 --> 01:13:26,519
question so if anyone has more
1419
01:13:24,720 --> 01:13:30,920
understanding question about the concept
1420
01:13:26,520 --> 01:13:30,920
then it's important to ask them now yes
1421
01:13:32,030 --> 01:13:43,320
the training is end to end yes we will
1422
01:13:41,580 --> 01:13:45,990
get to it in one of the next few slides
1423
01:13:43,320 --> 01:13:49,139
mm-hmm how do you how you train it we'll
1424
01:13:45,990 --> 01:13:52,559
get to it actually I think right right
1425
01:13:49,140 --> 01:13:55,380
now so now that you have all of these
1426
01:13:52,560 --> 01:13:59,100
these five pyramid levels and we see a
1427
01:13:55,380 --> 01:14:01,080
three of them here we use a predict head
1428
01:13:59,100 --> 01:14:03,270
on top of each of them which predicts
1429
01:14:01,080 --> 01:14:06,720
the bounding boxes or the proposal
1430
01:14:03,270 --> 01:14:09,840
bounding boxes and how do these these
1431
01:14:06,720 --> 01:14:12,480
heads work they work just like the head
1432
01:14:09,840 --> 01:14:14,280
investor our CNN this is by the way part
1433
01:14:12,480 --> 01:14:16,500
of what I said of there's the simplicity
1434
01:14:14,280 --> 01:14:18,570
of their design they tried not to change
1435
01:14:16,500 --> 01:14:22,680
too much keep all the mechanism the same
1436
01:14:18,570 --> 01:14:25,200
just add as few as possible so this is
1437
01:14:22,680 --> 01:14:27,630
taken from the faster our CNN paper this
1438
01:14:25,200 --> 01:14:30,450
is the original paper this is how to
1439
01:14:27,630 --> 01:14:31,890
predict in faster our CNN you didn't
1440
01:14:30,450 --> 01:14:35,670
have this feature map you just had the
1441
01:14:31,890 --> 01:14:39,090
deepest feature map and you put a 3x3
1442
01:14:35,670 --> 01:14:41,580
convolution on top of it and for each
1443
01:14:39,090 --> 01:14:44,490
location of the 3x3 convolution you
1444
01:14:41,580 --> 01:14:48,750
turned it into a vector of 256
1445
01:14:44,490 --> 01:14:52,019
dimensions you and you own these vectors
1446
01:14:48,750 --> 01:14:54,780
you use the 1x1 convolution one one by
1447
01:14:52,020 --> 01:14:56,730
one convolution to predict the
1448
01:14:54,780 --> 01:14:58,830
coordinates of the bounding boxes for
1449
01:14:56,730 --> 01:15:01,320
that special occasion and one one by one
1450
01:14:58,830 --> 01:15:04,230
convolution to predict the probabilities
1451
01:15:01,320 --> 01:15:07,500
of being an object and the probability
1452
01:15:04,230 --> 01:15:09,959
of not being an object so here you use
1453
01:15:07,500 --> 01:15:12,330
the exact same predict head on each of
1454
01:15:09,960 --> 01:15:15,780
these layers separately and not only do
1455
01:15:12,330 --> 01:15:17,580
you use the exact same predict head it's
1456
01:15:15,780 --> 01:15:19,950
not only that you the same architecture
1457
01:15:17,580 --> 01:15:21,870
for all of them you also use the same
1458
01:15:19,950 --> 01:15:26,170
weights for all of them they share
1459
01:15:21,870 --> 01:15:31,030
weights so if the
1460
01:15:26,170 --> 01:15:33,830
this the predict head on this layer
1461
01:15:31,030 --> 01:15:35,960
predicts a false positive box and then
1462
01:15:33,830 --> 01:15:38,480
it's punished in the back propagation
1463
01:15:35,960 --> 01:15:41,510
mechanism and the weights in the predict
1464
01:15:38,480 --> 01:15:43,280
head change they change for all of the
1465
01:15:41,510 --> 01:15:46,360
layers together okay
1466
01:15:43,280 --> 01:15:48,950
and they they tested it experimentally
1467
01:15:46,360 --> 01:15:51,200
empirically and they saw that using
1468
01:15:48,950 --> 01:15:52,940
different predict tabs if they don't
1469
01:15:51,200 --> 01:15:57,800
share weights then it doesn't really
1470
01:15:52,940 --> 01:16:04,280
improve their performance so more about
1471
01:15:57,800 --> 01:16:06,290
how they train so for each of the levels
1472
01:16:04,280 --> 01:16:09,230
of the networks we have three here and
1473
01:16:06,290 --> 01:16:12,800
two more here they assign a different
1474
01:16:09,230 --> 01:16:15,830
size of bounding boxes so the deepest
1475
01:16:12,800 --> 01:16:19,520
bounding box is only trained on the
1476
01:16:15,830 --> 01:16:21,800
smallest object and the and the top
1477
01:16:19,520 --> 01:16:25,220
bounding box is only trained on the
1478
01:16:21,800 --> 01:16:27,830
largest object this is similar to the
1479
01:16:25,220 --> 01:16:29,270
concept of anchors in festersen air for
1480
01:16:27,830 --> 01:16:34,309
those of you who are familiar with it
1481
01:16:29,270 --> 01:16:36,890
only here the anchors are are trained on
1482
01:16:34,310 --> 01:16:39,680
separate layers each each anchor at the
1483
01:16:36,890 --> 01:16:44,350
end of each scale is trained on a
1484
01:16:39,680 --> 01:16:44,350
separate completely separate layer and
1485
01:16:44,890 --> 01:16:54,050
the yeah and the reason for that is that
1486
01:16:51,560 --> 01:16:55,660
we would expect that this layer will
1487
01:16:54,050 --> 01:16:58,960
contain the most relevant information
1488
01:16:55,660 --> 01:17:01,309
for small objects so we want it to
1489
01:16:58,960 --> 01:17:03,140
specialize on small objects because it's
1490
01:17:01,310 --> 01:17:05,870
a really difficult task so you want it
1491
01:17:03,140 --> 01:17:09,070
to be the best it can on small objects
1492
01:17:05,870 --> 01:17:10,340
and that's the intuition behind it okay
1493
01:17:09,070 --> 01:17:12,849
great
1494
01:17:10,340 --> 01:17:16,070
so this was about how to combine it with
1495
01:17:12,850 --> 01:17:18,710
RPN and now we will talk about how to
1496
01:17:16,070 --> 01:17:21,950
combine it with the second stage of the
1497
01:17:18,710 --> 01:17:27,290
network first our CNN so a short
1498
01:17:21,950 --> 01:17:30,679
reminder about how fast our CNN works so
1499
01:17:27,290 --> 01:17:34,580
with the feature map we with the RPN we
1500
01:17:30,680 --> 01:17:36,740
get the proposals on the image we get
1501
01:17:34,580 --> 01:17:38,230
about for example two thousand proposals
1502
01:17:36,740 --> 01:17:40,360
and then we use our
1503
01:17:38,230 --> 01:17:43,450
pulling or deformable our eye pulling to
1504
01:17:40,360 --> 01:17:46,719
pull each one of these proposals from
1505
01:17:43,450 --> 01:17:49,720
this feature man okay but now that we
1506
01:17:46,720 --> 01:17:51,400
have so in faster our CNN we didn't have
1507
01:17:49,720 --> 01:17:53,530
these layers we only had this layer and
1508
01:17:51,400 --> 01:17:56,080
that's where we pulled the bounding
1509
01:17:53,530 --> 01:17:58,989
boxes from now that we have many layers
1510
01:17:56,080 --> 01:18:01,420
of different special resolutions maybe
1511
01:17:58,989 --> 01:18:02,860
we can also pull the bounding boxes from
1512
01:18:01,420 --> 01:18:07,510
these layers and not only from the
1513
01:18:02,860 --> 01:18:11,349
deepest layer and that's the concept but
1514
01:18:07,510 --> 01:18:15,580
how do you decide which layer do you
1515
01:18:11,350 --> 01:18:17,500
pull the ROI from so here again they use
1516
01:18:15,580 --> 01:18:19,120
the concept of simplicity and they said
1517
01:18:17,500 --> 01:18:21,130
if you're trying to imitate image
1518
01:18:19,120 --> 01:18:23,050
pyramids we can just use the decision
1519
01:18:21,130 --> 01:18:26,080
rule that is already known to work with
1520
01:18:23,050 --> 01:18:28,989
image pyramids so there is a very clear
1521
01:18:26,080 --> 01:18:33,550
formula that is used and this is the
1522
01:18:28,989 --> 01:18:36,969
formula the floor of four plus you can
1523
01:18:33,550 --> 01:18:41,970
read it okay so you can read it and now
1524
01:18:36,970 --> 01:18:46,060
I'll give examples of how it works so
1525
01:18:41,970 --> 01:18:47,800
let's say that our W and H are the sizes
1526
01:18:46,060 --> 01:18:49,720
of our proposal that within the height
1527
01:18:47,800 --> 01:18:53,350
of the proposal in pixels so let's say
1528
01:18:49,720 --> 01:18:57,910
that the proposal Z is of size 224 by
1529
01:18:53,350 --> 01:19:02,860
224 what we get here is 4 plus log log 2
1530
01:18:57,910 --> 01:19:06,790
of 1 which is 0 and the end result is 4
1531
01:19:02,860 --> 01:19:08,799
this means we are oh I pull the results
1532
01:19:06,790 --> 01:19:11,140
from the fourth layer and we'll talk
1533
01:19:08,800 --> 01:19:14,950
about the intuition behind this in a
1534
01:19:11,140 --> 01:19:22,030
minute and when we take the a proposal
1535
01:19:14,950 --> 01:19:25,599
which size is 112 by 112 then the the
1536
01:19:22,030 --> 01:19:28,509
result will be 4 plus what you can see
1537
01:19:25,600 --> 01:19:30,850
here and there is the result of this log
1538
01:19:28,510 --> 01:19:33,580
operation is minus 1 of course so it
1539
01:19:30,850 --> 01:19:36,430
will be 3 and for larger bounding box it
1540
01:19:33,580 --> 01:19:42,610
will be 5 ok so you can see that this
1541
01:19:36,430 --> 01:19:46,750
this formula enables us very efficiently
1542
01:19:42,610 --> 01:19:49,690
to our Y pool larger bounding boxes from
1543
01:19:46,750 --> 01:19:51,260
shallowest layers with less spatial
1544
01:19:49,690 --> 01:19:55,700
resolution information
1545
01:19:51,260 --> 01:19:59,810
and a smaller object from the layers
1546
01:19:55,700 --> 01:20:01,610
that are that contain the most special
1547
01:19:59,810 --> 01:20:04,460
the information with the most special
1548
01:20:01,610 --> 01:20:06,230
resolution by the way a three is the
1549
01:20:04,460 --> 01:20:07,940
number for this layer four is the number
1550
01:20:06,230 --> 01:20:11,059
for this layer four five is this layer
1551
01:20:07,940 --> 01:20:14,769
index and and so on or actually this is
1552
01:20:11,060 --> 01:20:14,770
two three four and so on
1553
01:20:20,820 --> 01:20:27,670
100 by 100 yes it's a still a relatively
1554
01:20:25,300 --> 01:20:29,920
big object and we have one feature map
1555
01:20:27,670 --> 01:20:37,809
below it to take care of objects that
1556
01:20:29,920 --> 01:20:39,249
are smaller than that we have just one
1557
01:20:37,809 --> 01:20:42,429
feature map below it ma'am
1558
01:20:39,249 --> 01:20:44,920
she said 100 and 100 is still big but we
1559
01:20:42,429 --> 01:20:47,949
use a relatively what we use a
1560
01:20:44,920 --> 01:20:51,550
relatively shallow layer to to pull it
1561
01:20:47,949 --> 01:20:53,138
from and maybe we're wasting the high
1562
01:20:51,550 --> 01:20:56,199
resolution information there on
1563
01:20:53,139 --> 01:20:59,739
relatively large objects I'm sure that
1564
01:20:56,199 --> 01:21:08,049
this can be optimized more but but it
1565
01:20:59,739 --> 01:21:09,968
works quite well okay so in this formula
1566
01:21:08,050 --> 01:21:14,999
we saw there is like a magic number here
1567
01:21:09,969 --> 01:21:14,999
200 yes
1568
01:21:17,290 --> 01:21:20,290
sorry
1569
01:21:31,980 --> 01:21:35,030
[Music]
1570
01:21:35,530 --> 01:21:45,980
before okay maybe it doesn't really
1571
01:21:43,310 --> 01:21:49,010
matter if even if it it is just the way
1572
01:21:45,980 --> 01:21:51,709
they decided to build a formula okay so
1573
01:21:49,010 --> 01:21:53,600
it's for its for ease of it to make the
1574
01:21:51,710 --> 01:21:56,360
formula more friendly actually and we
1575
01:21:53,600 --> 01:21:58,610
saw that there is a number in 224 it
1576
01:21:56,360 --> 01:22:01,190
looks like a magic number because it's
1577
01:21:58,610 --> 01:22:04,849
although it's also the size of images in
1578
01:22:01,190 --> 01:22:07,730
image net so that's the reason we use
1579
01:22:04,850 --> 01:22:10,520
this number here because and that and
1580
01:22:07,730 --> 01:22:13,790
layer number four is actually the layer
1581
01:22:10,520 --> 01:22:17,380
that the ry pooling is performed on on
1582
01:22:13,790 --> 01:22:20,210
the original faster are CNN with ResNet
1583
01:22:17,380 --> 01:22:24,500
101 it's pulled from layer four and
1584
01:22:20,210 --> 01:22:27,680
since ResNet was pre trained on image
1585
01:22:24,500 --> 01:22:30,440
net on and all of the images in in image
1586
01:22:27,680 --> 01:22:34,760
net are pretty much objects of this
1587
01:22:30,440 --> 01:22:37,490
scale then if we get an object of this
1588
01:22:34,760 --> 01:22:41,240
scale we would like to pull it from like
1589
01:22:37,490 --> 01:22:43,610
the default layer to pull to pull object
1590
01:22:41,240 --> 01:22:47,000
that was that has proven itself so far
1591
01:22:43,610 --> 01:22:49,429
so that's the intuition for the 224 a
1592
01:22:47,000 --> 01:22:53,780
number and that's also why they have
1593
01:22:49,430 --> 01:22:57,440
four here because then when you have an
1594
01:22:53,780 --> 01:22:59,840
image an object of size 224 by 224 the
1595
01:22:57,440 --> 01:23:04,330
log goes to zero and you remain with the
1596
01:22:59,840 --> 01:23:04,330
default layer index yes
1597
01:23:05,840 --> 01:23:15,090
explain about the what did I mention of
1598
01:23:12,870 --> 01:23:21,590
the score the units of the score you
1599
01:23:15,090 --> 01:23:31,380
mean sorry
1600
01:23:21,590 --> 01:23:33,120
previous okay yes yeah okay so okay is
1601
01:23:31,380 --> 01:23:35,130
the number of bounding boxes that is
1602
01:23:33,120 --> 01:23:37,140
predicted for each special location it's
1603
01:23:35,130 --> 01:23:39,690
also called the number of anchors if you
1604
01:23:37,140 --> 01:23:41,430
if you know it so for each special
1605
01:23:39,690 --> 01:23:43,110
occasion I don't only predict a single
1606
01:23:41,430 --> 01:23:45,660
bounding box I predict something like
1607
01:23:43,110 --> 01:23:48,719
nine bounding boxes or fifteen bounding
1608
01:23:45,660 --> 01:23:50,309
boxes so for each of these these
1609
01:23:48,720 --> 01:23:52,770
bounding box for each of these nine
1610
01:23:50,310 --> 01:23:56,820
bounding boxes I want four coordinates
1611
01:23:52,770 --> 01:23:59,700
and two probabilities so this is the 2k
1612
01:23:56,820 --> 01:24:02,040
and the 4k it's not two thousand or four
1613
01:23:59,700 --> 01:24:03,630
or four thousand it just like two times
1614
01:24:02,040 --> 01:24:07,940
the number of bounding boxes that I'm
1615
01:24:03,630 --> 01:24:07,940
predicting in that special occasion okay
1616
01:24:09,470 --> 01:24:16,740
okay let's move on and we're close to
1617
01:24:13,980 --> 01:24:20,610
finishing this so regarding European
1618
01:24:16,740 --> 01:24:22,290
experiment results they try to test it
1619
01:24:20,610 --> 01:24:25,139
it's relevant to solve the question that
1620
01:24:22,290 --> 01:24:27,570
you asked before so they tried not to
1621
01:24:25,140 --> 01:24:30,870
use the top-down connections only use
1622
01:24:27,570 --> 01:24:33,030
the lateral connections and not which
1623
01:24:30,870 --> 01:24:34,950
means predict from each level of
1624
01:24:33,030 --> 01:24:36,660
features but don't combine features from
1625
01:24:34,950 --> 01:24:39,090
deep layers and shallower layers and
1626
01:24:36,660 --> 01:24:41,610
they saw they saw that they didn't even
1627
01:24:39,090 --> 01:24:44,040
improve on top of faster are CNN and
1628
01:24:41,610 --> 01:24:45,389
they try to use only the top down
1629
01:24:44,040 --> 01:24:47,280
connection without the lateral
1630
01:24:45,390 --> 01:24:49,500
connection and it also didn't improve
1631
01:24:47,280 --> 01:24:54,840
the only thing the only other thing that
1632
01:24:49,500 --> 01:24:57,300
they tried the did improve was doing
1633
01:24:54,840 --> 01:24:59,070
creating all of this pyramid but not
1634
01:24:57,300 --> 01:25:01,140
using all of these predict heads using
1635
01:24:59,070 --> 01:25:05,670
only a single predict head from the
1636
01:25:01,140 --> 01:25:07,710
bottom and since the bottom contains a
1637
01:25:05,670 --> 01:25:09,840
combination of all of the features
1638
01:25:07,710 --> 01:25:11,400
intuitively maybe it's enough just to
1639
01:25:09,840 --> 01:25:14,550
predict from it and it's more efficient
1640
01:25:11,400 --> 01:25:18,019
but in practice they saw that it's not
1641
01:25:14,550 --> 01:25:18,020
enough and that
1642
01:25:19,099 --> 01:25:26,190
but it's not sorry that it's not enough
1643
01:25:21,960 --> 01:25:28,289
and that it does improve but a lags far
1644
01:25:26,190 --> 01:25:33,719
behind the full feature pyramid Network
1645
01:25:28,289 --> 01:25:36,900
solution regarding faster CNN they asked
1646
01:25:33,719 --> 01:25:40,019
themselves maybe we if we use feature
1647
01:25:36,900 --> 01:25:42,808
pyramid networks on the RPM maybe it's
1648
01:25:40,019 --> 01:25:45,510
enough maybe it's too much to put it on
1649
01:25:42,809 --> 01:25:48,329
the faster CNN as well maybe it's not
1650
01:25:45,510 --> 01:25:52,139
improving anything else so they trained
1651
01:25:48,329 --> 01:25:55,199
the RPN separately using feature pyramid
1652
01:25:52,139 --> 01:25:57,239
networks recorded for each image the
1653
01:25:55,199 --> 01:26:01,589
best proposals that they got for that
1654
01:25:57,239 --> 01:26:04,650
image and then train faster CNN
1655
01:26:01,590 --> 01:26:06,630
separately using these proposals from
1656
01:26:04,650 --> 01:26:08,610
the beginning of the training faster CNN
1657
01:26:06,630 --> 01:26:12,210
only got the best proposals it was not
1658
01:26:08,610 --> 01:26:14,579
trained end to end and still faster CNN
1659
01:26:12,210 --> 01:26:18,030
improved the results by another five to
1660
01:26:14,579 --> 01:26:22,440
ten percent so this component of faster
1661
01:26:18,030 --> 01:26:24,750
CNN is important and we talked about
1662
01:26:22,440 --> 01:26:27,808
this decision rule here and it's it's
1663
01:26:24,750 --> 01:26:30,239
nice but I saw that even if we don't use
1664
01:26:27,809 --> 01:26:33,659
it in investors in festersen and in the
1665
01:26:30,239 --> 01:26:36,869
second stage we only pull from this
1666
01:26:33,659 --> 01:26:40,049
layer from the deepest layer we don't
1667
01:26:36,869 --> 01:26:42,719
get a large difference in results so
1668
01:26:40,050 --> 01:26:45,449
faster CNN is less sensitive to which
1669
01:26:42,719 --> 01:26:48,840
layer you pull from you just need to
1670
01:26:45,449 --> 01:26:53,190
pull from the last layer with the most
1671
01:26:48,840 --> 01:26:56,340
information and another some other neat
1672
01:26:53,190 --> 01:26:58,768
things that I saw of course it improves
1673
01:26:56,340 --> 01:27:00,989
especially for small objects the results
1674
01:26:58,769 --> 01:27:04,440
for small object and the test time per
1675
01:27:00,989 --> 01:27:08,159
image on a single GPU is less than a
1676
01:27:04,440 --> 01:27:10,018
single scale non feature pyramid network
1677
01:27:08,159 --> 01:27:15,119
of the same architecture
1678
01:27:10,019 --> 01:27:17,789
so the the feature pyramid network
1679
01:27:15,119 --> 01:27:23,759
although it it is a more complex
1680
01:27:17,789 --> 01:27:26,360
architecture it's faster and it's out of
1681
01:27:23,760 --> 01:27:30,220
the scope to to explain exactly why but
1682
01:27:26,360 --> 01:27:31,900
but Ross gear cheek which is one of the
1683
01:27:30,220 --> 01:27:35,980
writers of this paper and is very
1684
01:27:31,900 --> 01:27:40,089
well-known wrote like a comment on
1685
01:27:35,980 --> 01:27:43,150
github explaining that so this is
1686
01:27:40,090 --> 01:27:46,540
feature pyramid networks and before we
1687
01:27:43,150 --> 01:27:48,160
need to summarize if someone has a
1688
01:27:46,540 --> 01:27:51,010
really really important question
1689
01:27:48,160 --> 01:27:52,269
otherwise you can come ask me I'll stay
1690
01:27:51,010 --> 01:27:55,600
here okay
1691
01:27:52,270 --> 01:27:59,890
great so we won't have time to cover
1692
01:27:55,600 --> 01:28:04,120
focal loss but I will publish my slides
1693
01:27:59,890 --> 01:28:06,100
online and hopefully maybe there will be
1694
01:28:04,120 --> 01:28:07,720
enough self contained for you to look at
1695
01:28:06,100 --> 01:28:08,650
them and maybe we will see each other
1696
01:28:07,720 --> 01:28:11,860
some other time
1697
01:28:08,650 --> 01:28:13,450
so to summarize object detection I
1698
01:28:11,860 --> 01:28:17,019
really believe that it's a revolutionary
1699
01:28:13,450 --> 01:28:19,360
technology with that is progressing
1700
01:28:17,020 --> 01:28:22,600
really fast and has the potential to
1701
01:28:19,360 --> 01:28:25,750
change every industry and the latest
1702
01:28:22,600 --> 01:28:30,400
major adventures advantages the latest
1703
01:28:25,750 --> 01:28:32,620
major advances in object detection
1704
01:28:30,400 --> 01:28:34,540
except for being really creative and
1705
01:28:32,620 --> 01:28:37,630
mind-blowing and fascinating to learn
1706
01:28:34,540 --> 01:28:41,080
about also address some of the most
1707
01:28:37,630 --> 01:28:43,560
major problems in many data domains but
1708
01:28:41,080 --> 01:28:46,690
also specifically in medical imaging and
1709
01:28:43,560 --> 01:28:48,700
the solutions that we talked about
1710
01:28:46,690 --> 01:28:51,009
earlier except for focal loss but we
1711
01:28:48,700 --> 01:28:53,019
didn't have time to cover were small
1712
01:28:51,010 --> 01:28:54,670
objects which feature pyramid networks
1713
01:28:53,020 --> 01:28:58,540
to give a very elegant and efficient
1714
01:28:54,670 --> 01:29:00,700
solution for deformable shapes which the
1715
01:28:58,540 --> 01:29:02,140
deformable convolutions and deformable
1716
01:29:00,700 --> 01:29:04,840
are i pulling give a very elegant
1717
01:29:02,140 --> 01:29:07,120
solution to and extreme class imbalance
1718
01:29:04,840 --> 01:29:13,270
which vocalist is a very nice and
1719
01:29:07,120 --> 01:29:16,390
innovative solution for so as usual in
1720
01:29:13,270 --> 01:29:18,790
deep learning like what we said in the
1721
01:29:16,390 --> 01:29:21,640
beginning it's not rocket science and I
1722
01:29:18,790 --> 01:29:24,400
think that many engineers can understand
1723
01:29:21,640 --> 01:29:27,010
the concepts and implementations but the
1724
01:29:24,400 --> 01:29:28,870
question is what can we do better in
1725
01:29:27,010 --> 01:29:31,450
order to make it this information more
1726
01:29:28,870 --> 01:29:33,190
friendly and reduce the time that is
1727
01:29:31,450 --> 01:29:35,030
required to understand it by a factor of
1728
01:29:33,190 --> 01:29:40,620
10 thank you very much
1729
01:29:35,030 --> 01:29:40,620
[Applause]
132375
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.