subtitlecat.com

All language subtitles for Applying the Cutting Edge of Object Detection to Medical Imaging.en.transcribed

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:02,610 --> 00:00:09,420 so nice to meet you everybody my name is 2 00:00:06,420 --> 00:00:12,450 Dan basak and I'm the head of AI at a 3 00:00:09,420 --> 00:00:13,830 dock where we do we use the technologies 4 00:00:12,450 --> 00:00:18,299 of object detection and image 5 00:00:13,830 --> 00:00:21,650 segmentation to detect urgent medical 6 00:00:18,300 --> 00:00:24,570 normalities in different medical imaging 7 00:00:21,650 --> 00:00:26,848 modalities such as CT scans and it's 8 00:00:24,570 --> 00:00:28,920 nice to see some familiar faces here in 9 00:00:26,849 --> 00:00:33,629 the crowd thank you all for coming and 10 00:00:28,920 --> 00:00:36,989 today I want to share with you and talk 11 00:00:33,629 --> 00:00:39,030 talk together with you about some some 12 00:00:36,989 --> 00:00:41,159 of the three most interesting papers in 13 00:00:39,030 --> 00:00:44,909 object detection of the last year or so 14 00:00:41,159 --> 00:00:47,010 which have done really amazing things 15 00:00:44,909 --> 00:00:56,390 that are really relevant to the 16 00:00:47,010 --> 00:00:58,650 challenges of medical imaging data so 17 00:00:56,390 --> 00:01:01,140 first of all a little bit about object 18 00:00:58,650 --> 00:01:04,519 detection so object detection is 19 00:01:01,140 --> 00:01:07,740 progressing really quickly just between 20 00:01:04,519 --> 00:01:10,530 2016 and 2017 21 00:01:07,740 --> 00:01:15,990 in the living benchmark the cocoa 22 00:01:10,530 --> 00:01:19,050 competition the accuracy has raised in 23 00:01:15,990 --> 00:01:21,830 about - a 20% relatively between the 24 00:01:19,050 --> 00:01:25,979 years so the bottom row is the best 25 00:01:21,830 --> 00:01:30,929 submission of 2016 and all of these four 26 00:01:25,980 --> 00:01:34,650 are the submissions from 2017 challenge 27 00:01:30,930 --> 00:01:37,740 in the Abbott was in the end of 2017 and 28 00:01:34,650 --> 00:01:40,080 not only is it a mature and quickly 29 00:01:37,740 --> 00:01:43,530 advancing technology I think it's also a 30 00:01:40,080 --> 00:01:45,210 very transformational technology and it 31 00:01:43,530 --> 00:01:48,210 it has the potential and I really 32 00:01:45,210 --> 00:01:52,669 believe that it will transform any and 33 00:01:48,210 --> 00:01:55,399 every industry be it medical imaging 34 00:01:52,670 --> 00:01:58,650 defense and business intelligence 35 00:01:55,400 --> 00:02:00,570 robotics and autonomous vehicles and 36 00:01:58,650 --> 00:02:04,159 even augmented reality and many many 37 00:02:00,570 --> 00:02:04,158 other more industries 38 00:02:09,090 --> 00:02:13,710 and the reason I really like these these 39 00:02:11,850 --> 00:02:15,989 meetups and talks is because I think 40 00:02:13,710 --> 00:02:19,950 that object detection and deep learning 41 00:02:15,990 --> 00:02:23,160 in general is a very very it has a lot 42 00:02:19,950 --> 00:02:25,560 of potential but we wouldn't be able to 43 00:02:23,160 --> 00:02:27,180 fulfill this potential fast enough if 44 00:02:25,560 --> 00:02:29,190 you won't have a lot of people that 45 00:02:27,180 --> 00:02:31,530 really understand this field and the 46 00:02:29,190 --> 00:02:34,470 problem with that is that we are 47 00:02:31,530 --> 00:02:36,270 building a mountain of papers that are 48 00:02:34,470 --> 00:02:38,580 really hard to read or really hard to 49 00:02:36,270 --> 00:02:40,290 get into it takes hours to and really 50 00:02:38,580 --> 00:02:43,350 understand them especially if you want 51 00:02:40,290 --> 00:02:45,660 to dive into the little details that are 52 00:02:43,350 --> 00:02:48,840 relevant to really implement something 53 00:02:45,660 --> 00:02:51,390 especially if it's on new data and I was 54 00:02:48,840 --> 00:02:54,240 really inspired by this article on 55 00:02:51,390 --> 00:02:58,470 distill which was called research debt 56 00:02:54,240 --> 00:03:00,240 and was talked exactly about that and I 57 00:02:58,470 --> 00:03:03,030 really recommend you to read it it was 58 00:03:00,240 --> 00:03:07,200 written by Chris Ola and the Shankar - 59 00:03:03,030 --> 00:03:09,270 research scientists from Google and what 60 00:03:07,200 --> 00:03:12,030 they saying I really agree with them is 61 00:03:09,270 --> 00:03:14,610 that we really need to look at this 62 00:03:12,030 --> 00:03:16,380 mountain and instead of making we can 63 00:03:14,610 --> 00:03:18,540 continue making the mountain bigger as 64 00:03:16,380 --> 00:03:20,880 long as we build staircases and 65 00:03:18,540 --> 00:03:23,850 elevators that enable everyone to climb 66 00:03:20,880 --> 00:03:25,380 it together with us because if we won't 67 00:03:23,850 --> 00:03:26,760 have enough engineers that understand 68 00:03:25,380 --> 00:03:29,640 the state-of-the-art we won't we 69 00:03:26,760 --> 00:03:32,880 wouldn't be able to create placated 70 00:03:29,640 --> 00:03:34,859 solution fast enough and I've personally 71 00:03:32,880 --> 00:03:37,320 invested hundreds of hours in learning 72 00:03:34,860 --> 00:03:43,610 this field and I really I still have a 73 00:03:37,320 --> 00:03:46,410 lot more to learn so and after learning 74 00:03:43,610 --> 00:03:48,360 learning investing a lot of time in it 75 00:03:46,410 --> 00:03:50,579 my conclusion is deep learning is 76 00:03:48,360 --> 00:03:53,760 advanced and it's mind-blowing and it's 77 00:03:50,580 --> 00:03:55,800 creative and you need to dive into it 78 00:03:53,760 --> 00:03:57,540 and learn it seriously in order to 79 00:03:55,800 --> 00:03:59,790 understand it but it's not rocket 80 00:03:57,540 --> 00:04:01,440 science and by the way I think that even 81 00:03:59,790 --> 00:04:05,429 rocket science is not really rocket 82 00:04:01,440 --> 00:04:08,730 science and my question is what can we 83 00:04:05,430 --> 00:04:12,090 do to reduce the time that is required 84 00:04:08,730 --> 00:04:14,670 for the next people to join this field 85 00:04:12,090 --> 00:04:19,079 by a factor of 10 from the time it took 86 00:04:14,670 --> 00:04:21,579 me to join in this field and I think 87 00:04:19,079 --> 00:04:24,938 that if we 88 00:04:21,579 --> 00:04:27,580 it should be a community effort and we 89 00:04:24,939 --> 00:04:29,860 should spend more time on trying to make 90 00:04:27,580 --> 00:04:31,568 these things more explainable more easy 91 00:04:29,860 --> 00:04:35,199 to understand use the right explanation 92 00:04:31,569 --> 00:04:35,560 and the right visualizations and that's 93 00:04:35,199 --> 00:04:39,849 it 94 00:04:35,560 --> 00:04:41,530 so now let's dive into first the 95 00:04:39,849 --> 00:04:43,930 structure of this doc it's going to be 96 00:04:41,530 --> 00:04:46,239 about how an hour and a half 97 00:04:43,930 --> 00:04:50,710 first I will talk about the challenges 98 00:04:46,240 --> 00:04:54,340 in medical imaging data and after that I 99 00:04:50,710 --> 00:04:57,430 will dive into a depends on how much 100 00:04:54,340 --> 00:04:59,979 time we'll have two to three of the most 101 00:04:57,430 --> 00:05:02,500 advanced papers deformable convolutional 102 00:04:59,979 --> 00:05:04,690 networks feature pyramid networks and 103 00:05:02,500 --> 00:05:06,879 focal loss I guess most of you heard 104 00:05:04,690 --> 00:05:10,949 these names if you're in the field and 105 00:05:06,879 --> 00:05:13,479 I'm going to these three papers address 106 00:05:10,949 --> 00:05:16,210 most of the challenges that I'm going to 107 00:05:13,479 --> 00:05:17,979 present now and they do it in a really 108 00:05:16,210 --> 00:05:20,229 nice way and I'm going to explain them 109 00:05:17,979 --> 00:05:22,120 in a way that is relevant both to the 110 00:05:20,229 --> 00:05:23,590 medical imaging domain and explain the 111 00:05:22,120 --> 00:05:25,180 unique details that are relevant to 112 00:05:23,590 --> 00:05:27,969 apply it in the medical imaging human 113 00:05:25,180 --> 00:05:29,500 but also it will be relevant for any one 114 00:05:27,969 --> 00:05:31,419 of you that wants to understand these 115 00:05:29,500 --> 00:05:34,569 concepts and take them to other fields 116 00:05:31,419 --> 00:05:36,789 as well so let's start with challenges 117 00:05:34,569 --> 00:05:39,669 of medical imaging data by the way can 118 00:05:36,789 --> 00:05:42,400 everyone hear me well I I don't think I 119 00:05:39,669 --> 00:05:48,909 asked ok if anyone if someone can't hear 120 00:05:42,400 --> 00:05:51,520 me just say mmm the bottom of the screen 121 00:05:48,909 --> 00:05:55,719 well I'm not sure I can solve that but 122 00:05:51,520 --> 00:05:57,849 you're welcome to come closer so the 123 00:05:55,719 --> 00:06:00,610 first challenge is extreme class 124 00:05:57,849 --> 00:06:03,190 imbalance objects are very small and 125 00:06:00,610 --> 00:06:05,110 rare compared to the number of images 126 00:06:03,190 --> 00:06:07,870 and the image sizes what I'm what I mean 127 00:06:05,110 --> 00:06:10,089 about with that you can see here this 128 00:06:07,870 --> 00:06:12,930 arrow is the detection of one of our 129 00:06:10,089 --> 00:06:17,020 algorithms in a brain CT scan of 130 00:06:12,930 --> 00:06:21,490 relatively medium size very urgent 131 00:06:17,020 --> 00:06:24,039 finding in the brain and the this is a 132 00:06:21,490 --> 00:06:25,870 relatively this is relatively large 133 00:06:24,039 --> 00:06:27,310 compared to many of the findings that we 134 00:06:25,870 --> 00:06:29,589 are required to detect and it's 135 00:06:27,310 --> 00:06:32,289 relatively obvious I put it here because 136 00:06:29,589 --> 00:06:34,659 I think that even that is not considered 137 00:06:32,289 --> 00:06:35,169 large in terms of classic object 138 00:06:34,659 --> 00:06:38,740 detection 139 00:06:35,170 --> 00:06:41,980 and I wanted you to understand what I'm 140 00:06:38,740 --> 00:06:44,980 talking about and and really see it so 141 00:06:41,980 --> 00:06:47,080 and the we're talking about findings 142 00:06:44,980 --> 00:06:49,420 that are sometimes smaller than 10 by 10 143 00:06:47,080 --> 00:06:52,180 pixels and they are found in images 144 00:06:49,420 --> 00:06:55,060 which are a single image is actually a 145 00:06:52,180 --> 00:06:58,480 3d image for us it's not a 2d image 146 00:06:55,060 --> 00:07:04,690 so it can be 10 by 10 pixels over a few 147 00:06:58,480 --> 00:07:08,530 slices inside a an image which is 100 148 00:07:04,690 --> 00:07:10,270 slices 100 2d images or even more so the 149 00:07:08,530 --> 00:07:13,809 finding is a very small part of the 150 00:07:10,270 --> 00:07:16,659 brain and most of the scans are of 151 00:07:13,810 --> 00:07:19,990 healthy brains or healthy spines or 152 00:07:16,660 --> 00:07:22,300 whatever so the the interesting data the 153 00:07:19,990 --> 00:07:27,040 data that we want to detect is very rare 154 00:07:22,300 --> 00:07:30,720 and that's a very big challenge the 155 00:07:27,040 --> 00:07:33,550 second challenge is that the objects and 156 00:07:30,720 --> 00:07:35,920 they are not in our background which is 157 00:07:33,550 --> 00:07:40,540 done atomic all structures in my opinion 158 00:07:35,920 --> 00:07:43,960 are relatively much less well structured 159 00:07:40,540 --> 00:07:46,090 much more deformable and less rigid if 160 00:07:43,960 --> 00:07:48,909 you can we can take an example which of 161 00:07:46,090 --> 00:07:50,950 course we can find a hard example in in 162 00:07:48,910 --> 00:07:53,080 the classical data set as well but if 163 00:07:50,950 --> 00:07:55,659 you look at a wheel it is bounded pretty 164 00:07:53,080 --> 00:07:58,180 well by a square bounding box which is 165 00:07:55,660 --> 00:08:02,110 the classic use case of object detection 166 00:07:58,180 --> 00:08:05,200 but here is a part of a brain and you 167 00:08:02,110 --> 00:08:06,610 can see that these pixels that I 168 00:08:05,200 --> 00:08:10,090 highlighted in yellow it's not their 169 00:08:06,610 --> 00:08:14,170 original color this is a single finding 170 00:08:10,090 --> 00:08:16,150 in the brain so if I put a the the 171 00:08:14,170 --> 00:08:19,420 tightest bounding box that I can around 172 00:08:16,150 --> 00:08:22,090 this finding it will sink still contain 173 00:08:19,420 --> 00:08:23,980 a lot of uninteresting pixels and then 174 00:08:22,090 --> 00:08:27,369 when I will extract the features for 175 00:08:23,980 --> 00:08:29,470 this bounding box most of the signal 176 00:08:27,370 --> 00:08:35,200 will come from background rather than 177 00:08:29,470 --> 00:08:38,260 from interesting pixels and that's 178 00:08:35,200 --> 00:08:42,220 that's part of the shapes of the object 179 00:08:38,260 --> 00:08:44,830 being less deformable another challenge 180 00:08:42,220 --> 00:08:46,330 is that the images are 3d and they're 181 00:08:44,830 --> 00:08:46,690 large and there is a lot of difference 182 00:08:46,330 --> 00:08:49,720 between 183 00:08:46,690 --> 00:08:52,510 images of cats - cats cans and actually 184 00:08:49,720 --> 00:08:57,160 you can see that the difference in sizes 185 00:08:52,510 --> 00:09:00,819 the inputs are like could be 30 times 186 00:08:57,160 --> 00:09:04,270 larger and the objects are nominally ten 187 00:09:00,820 --> 00:09:06,940 times smaller and this is a challenge in 188 00:09:04,270 --> 00:09:08,890 terms of a computation power that is 189 00:09:06,940 --> 00:09:12,310 required a time that is it takes it to 190 00:09:08,890 --> 00:09:14,800 converge and the memory footprint of 191 00:09:12,310 --> 00:09:16,599 your networks and how you can find a 192 00:09:14,800 --> 00:09:19,060 good compromise in the design of your 193 00:09:16,600 --> 00:09:25,300 model and what input to use and the rest 194 00:09:19,060 --> 00:09:29,380 challenge that I will talk about is that 195 00:09:25,300 --> 00:09:32,949 when a radiologist analyzes a CT scan it 196 00:09:29,380 --> 00:09:36,520 doesn't just look at the current CT scan 197 00:09:32,950 --> 00:09:38,710 he looks at a lot more types of data the 198 00:09:36,520 --> 00:09:40,990 Democrat the demographics of the patient 199 00:09:38,710 --> 00:09:44,100 his age the referral letter of the 200 00:09:40,990 --> 00:09:47,800 doctor that referred him to this to this 201 00:09:44,100 --> 00:09:49,540 scan his past scans and the reports that 202 00:09:47,800 --> 00:09:52,000 were written on these cans and actually 203 00:09:49,540 --> 00:09:54,880 the radiologists are not only they do 204 00:09:52,000 --> 00:09:57,730 this but they are obligated to do is do 205 00:09:54,880 --> 00:10:00,220 this by regulation and for a reason 206 00:09:57,730 --> 00:10:01,690 because not all the information is in 207 00:10:00,220 --> 00:10:04,420 the image and if you don't look at the 208 00:10:01,690 --> 00:10:07,180 past you can't really diagnose some of 209 00:10:04,420 --> 00:10:10,930 the cases so how do you combine all the 210 00:10:07,180 --> 00:10:14,069 different types of data both visual text 211 00:10:10,930 --> 00:10:14,069 and structured data 212 00:10:22,480 --> 00:10:31,610 okay so now I want to dive in into as I 213 00:10:27,290 --> 00:10:33,920 said two to three papers and the first 214 00:10:31,610 --> 00:10:35,959 paper I wanted to start with is 215 00:10:33,920 --> 00:10:38,269 deformable convolutional networks and 216 00:10:35,959 --> 00:10:40,189 that's the paper I chose because for my 217 00:10:38,269 --> 00:10:42,740 experience a lot of people are kind of 218 00:10:40,189 --> 00:10:44,629 afraid from this paper and find it very 219 00:10:42,740 --> 00:10:47,449 hard to get into and I think that it's 220 00:10:44,629 --> 00:10:50,660 not that difficult if you understand it 221 00:10:47,449 --> 00:10:53,628 if you explain it correctly so what is 222 00:10:50,660 --> 00:10:55,579 the motivation for the differ by the way 223 00:10:53,629 --> 00:10:58,430 this paper is by the Microsoft Research 224 00:10:55,579 --> 00:11:02,508 Asia group and it's a it's been for the 225 00:10:58,430 --> 00:11:05,269 last two years both in 2016 and 2017 a 226 00:11:02,509 --> 00:11:06,889 significant component of one of the top 227 00:11:05,269 --> 00:11:09,649 three entries to the cocoa object 228 00:11:06,889 --> 00:11:11,779 detection competition so it's a it's a 229 00:11:09,649 --> 00:11:14,720 very significant boost to performance 230 00:11:11,779 --> 00:11:17,449 and it's by Microsoft Research Asia 231 00:11:14,720 --> 00:11:19,069 which is one of the top object detection 232 00:11:17,449 --> 00:11:21,170 groups in my opinion in the world and 233 00:11:19,069 --> 00:11:24,500 gave some of the top contributions in 234 00:11:21,170 --> 00:11:26,750 the last year's so the motivation for 235 00:11:24,500 --> 00:11:29,540 this paper is that neural networks and 236 00:11:26,750 --> 00:11:31,250 the popular mechanism that we use with 237 00:11:29,540 --> 00:11:32,959 the neural networks with convolutional 238 00:11:31,250 --> 00:11:35,899 neural networks such as data 239 00:11:32,959 --> 00:11:38,290 augmentation know how to deal pretty 240 00:11:35,899 --> 00:11:43,430 well with simple transformations such as 241 00:11:38,290 --> 00:11:46,389 translation and rotation but non rigid 242 00:11:43,430 --> 00:11:49,729 transformation like changing the pose or 243 00:11:46,389 --> 00:11:53,240 the viewpoint or just the object being 244 00:11:49,730 --> 00:11:55,189 the less clear in regular form are much 245 00:11:53,240 --> 00:11:57,589 more challenging for a neural networks 246 00:11:55,189 --> 00:12:01,990 to deal with and how can we answer that 247 00:11:57,589 --> 00:12:05,660 that challenge so the solution is to 248 00:12:01,990 --> 00:12:07,579 give the network a capability a dynamic 249 00:12:05,660 --> 00:12:09,439 on a dynamic way to control the 250 00:12:07,579 --> 00:12:10,569 receptive field of the convolution so 251 00:12:09,439 --> 00:12:14,209 instead of using the traditional 252 00:12:10,569 --> 00:12:17,240 convolution does that samples the that 253 00:12:14,209 --> 00:12:18,109 samples the image with a square grid we 254 00:12:17,240 --> 00:12:20,750 can sample 255 00:12:18,110 --> 00:12:23,059 why not sample the image in any shape 256 00:12:20,750 --> 00:12:25,040 that we want and why not have it adapt 257 00:12:23,059 --> 00:12:27,199 to the input the shape that we sample 258 00:12:25,040 --> 00:12:29,029 the image with adapt to the image to the 259 00:12:27,199 --> 00:12:32,559 input that we get to our images and 260 00:12:29,029 --> 00:12:34,830 objects so we didn't talk it about why 261 00:12:32,559 --> 00:12:37,380 should it solve the problem that 262 00:12:34,830 --> 00:12:40,770 we are talking about but we will see it 263 00:12:37,380 --> 00:12:44,910 in a minute but we don't only want to 264 00:12:40,770 --> 00:12:47,310 implement this solution if it works well 265 00:12:44,910 --> 00:12:50,069 in order to be applicable in the 266 00:12:47,310 --> 00:12:51,989 industry we want to do it in to 267 00:12:50,070 --> 00:12:54,570 implement something that is easy to 268 00:12:51,990 --> 00:12:56,430 train and preferably end-to-end we don't 269 00:12:54,570 --> 00:12:59,190 want to train several different 270 00:12:56,430 --> 00:13:00,810 components and then combine them it 271 00:12:59,190 --> 00:13:02,940 creates a very cumbersome research 272 00:13:00,810 --> 00:13:04,650 process we don't want to increase the 273 00:13:02,940 --> 00:13:06,990 model complexity its run time and 274 00:13:04,650 --> 00:13:09,090 training time a number of parameters too 275 00:13:06,990 --> 00:13:11,250 much and we don't want to increase the 276 00:13:09,090 --> 00:13:13,440 code complexity if it's a convolution 277 00:13:11,250 --> 00:13:15,960 when I define the modern architecture I 278 00:13:13,440 --> 00:13:17,790 want to write the formable conf 2d lie 279 00:13:15,960 --> 00:13:21,200 and they put the parameters inside it 280 00:13:17,790 --> 00:13:24,360 just like I write cons 2d today and I 281 00:13:21,200 --> 00:13:27,060 want to I want it to be proven on 282 00:13:24,360 --> 00:13:30,270 challenging tasks rather than just toy 283 00:13:27,060 --> 00:13:33,060 datasets and the earlier works in 284 00:13:30,270 --> 00:13:35,520 similar they try to deal with similar 285 00:13:33,060 --> 00:13:37,439 problems with this such as spatial 286 00:13:35,520 --> 00:13:41,579 transformer networks gave major 287 00:13:37,440 --> 00:13:43,440 scientific contributions but in the in 288 00:13:41,580 --> 00:13:45,480 the paper they only proved it on toya 289 00:13:43,440 --> 00:13:48,480 datasets and many people they try to 290 00:13:45,480 --> 00:13:53,240 apply it on real-world datasets found it 291 00:13:48,480 --> 00:13:55,650 very hard to make them converge at all 292 00:13:53,240 --> 00:13:59,850 so this is our these are our 293 00:13:55,650 --> 00:14:01,530 requirements from this solution and now 294 00:13:59,850 --> 00:14:02,820 let's talk about the components of this 295 00:14:01,530 --> 00:14:04,650 solution and there are actually two 296 00:14:02,820 --> 00:14:07,230 components and they build them on top 297 00:14:04,650 --> 00:14:09,120 they can be combined with any object 298 00:14:07,230 --> 00:14:12,000 detection architecture in the meta 299 00:14:09,120 --> 00:14:14,040 architecture like faster are CNN and RFC 300 00:14:12,000 --> 00:14:17,160 and these are the two architectures that 301 00:14:14,040 --> 00:14:18,660 they demonstrate in the paper and the 302 00:14:17,160 --> 00:14:21,959 first component is the deformable 303 00:14:18,660 --> 00:14:24,689 convolution and the concept is we keep 304 00:14:21,960 --> 00:14:26,820 convolution the same except for making 305 00:14:24,690 --> 00:14:28,980 the settling location a function of the 306 00:14:26,820 --> 00:14:30,300 image the sampling location are not a 307 00:14:28,980 --> 00:14:33,780 fixed speed they are a function of the 308 00:14:30,300 --> 00:14:36,839 image so as few samples examples that 309 00:14:33,780 --> 00:14:38,730 they give in the paper is that the 310 00:14:36,840 --> 00:14:42,180 receptive field after several 311 00:14:38,730 --> 00:14:44,760 convolutions of this of a neuron in this 312 00:14:42,180 --> 00:14:47,420 area of the image will be each of these 313 00:14:44,760 --> 00:14:50,360 in the area of each of these we're ready 314 00:14:47,420 --> 00:14:52,400 in the image so the value that we get 315 00:14:50,360 --> 00:14:56,540 from that is that you can see that the 316 00:14:52,400 --> 00:14:58,730 pixel is then this neuron is is on the 317 00:14:56,540 --> 00:15:00,800 sky or on the border between the sky and 318 00:14:58,730 --> 00:15:02,450 the mountains and if we reduce the 319 00:15:00,800 --> 00:15:04,760 traditional convolutions after three 320 00:15:02,450 --> 00:15:06,460 convolutions we would get like a square 321 00:15:04,760 --> 00:15:08,930 or affectively it would be like a 322 00:15:06,460 --> 00:15:11,540 circular or Gaussian shape around that 323 00:15:08,930 --> 00:15:13,729 point but now with the deformable 324 00:15:11,540 --> 00:15:16,969 convolutions we are able able to simple 325 00:15:13,730 --> 00:15:19,730 a very large part of the sky the 326 00:15:16,970 --> 00:15:22,610 mountains and the objects in the image 327 00:15:19,730 --> 00:15:25,910 and intuitively it looks like a desired 328 00:15:22,610 --> 00:15:29,030 property because if I'm just seeing blue 329 00:15:25,910 --> 00:15:32,930 pixels how can I know if it's sky or 330 00:15:29,030 --> 00:15:35,600 water or or like a wall in that color I 331 00:15:32,930 --> 00:15:39,170 can't really know that for sure unless I 332 00:15:35,600 --> 00:15:41,720 have larger context and when they use 333 00:15:39,170 --> 00:15:43,400 the same network but they look on a 334 00:15:41,720 --> 00:15:46,220 different part of the image where there 335 00:15:43,400 --> 00:15:49,160 is a far motorcycle the same deformable 336 00:15:46,220 --> 00:15:51,080 convolution mechanism creates a much 337 00:15:49,160 --> 00:15:53,180 more dense with especially dense 338 00:15:51,080 --> 00:15:59,560 receptive field which covers a much 339 00:15:53,180 --> 00:16:03,500 smaller area and samples the object very 340 00:15:59,560 --> 00:16:04,849 very tightly and also samples a bit of 341 00:16:03,500 --> 00:16:07,280 the object background because it's 342 00:16:04,850 --> 00:16:09,590 intuitive that we want to sample not 343 00:16:07,280 --> 00:16:12,319 only the object but the background as 344 00:16:09,590 --> 00:16:14,630 well and on a closer and larger object 345 00:16:12,320 --> 00:16:18,440 we can see that the receptive field is 346 00:16:14,630 --> 00:16:20,240 larger and a bit less dense but does 347 00:16:18,440 --> 00:16:23,540 cover again the the entire object 348 00:16:20,240 --> 00:16:26,900 instead of just a sip a rectangular or 349 00:16:23,540 --> 00:16:32,599 circular part of it and the background 350 00:16:26,900 --> 00:16:34,310 as well so this is this is the value 351 00:16:32,600 --> 00:16:36,320 that we can get this is an intuition for 352 00:16:34,310 --> 00:16:39,439 the value that we can get from these 353 00:16:36,320 --> 00:16:41,690 deformable convolutions and this was the 354 00:16:39,440 --> 00:16:43,070 first component which will we will dive 355 00:16:41,690 --> 00:16:45,560 into the implementation of this 356 00:16:43,070 --> 00:16:48,020 component in a few slides but first I 357 00:16:45,560 --> 00:16:51,079 want to talk about the second component 358 00:16:48,020 --> 00:16:54,530 which is deformable Roi pulling 359 00:16:51,080 --> 00:16:58,640 so first let's do a short reminder of 360 00:16:54,530 --> 00:17:01,130 faster our CN n and in faster our CN n 361 00:16:58,640 --> 00:17:02,810 we get an input image we put it through 362 00:17:01,130 --> 00:17:05,839 a feature extractor and we get a feature 363 00:17:02,810 --> 00:17:08,240 man okay I'm assuming that you are you 364 00:17:05,839 --> 00:17:09,649 know fast you are CNN or similar models 365 00:17:08,240 --> 00:17:12,530 and I'm just giving you a really short 366 00:17:09,650 --> 00:17:16,010 reminder then using this feature map we 367 00:17:12,530 --> 00:17:22,579 predict about let's say 2000 bounding 368 00:17:16,010 --> 00:17:26,900 box proposals and about 2000 bounding 369 00:17:22,579 --> 00:17:29,480 box proposals and some of them a few of 370 00:17:26,900 --> 00:17:31,460 them really cover the objects that we 371 00:17:29,480 --> 00:17:33,290 are interested in such as the cars but 372 00:17:31,460 --> 00:17:35,570 some of them are just false positive of 373 00:17:33,290 --> 00:17:38,450 our bounding box proposal mechanism and 374 00:17:35,570 --> 00:17:40,280 they lie on the background and then we 375 00:17:38,450 --> 00:17:43,120 take each of these these bounding boxes 376 00:17:40,280 --> 00:17:45,710 and we crop them from the feature map 377 00:17:43,120 --> 00:17:49,399 one by one we crop them from the feature 378 00:17:45,710 --> 00:17:51,380 map and we have like 2000 crop feature 379 00:17:49,400 --> 00:17:54,230 maps for different bounding boxes and 380 00:17:51,380 --> 00:17:57,110 then we put we put each of them through 381 00:17:54,230 --> 00:17:59,360 a second feature extractor which is also 382 00:17:57,110 --> 00:18:01,250 called the first part is called our PN 383 00:17:59,360 --> 00:18:03,560 region proposal network the second part 384 00:18:01,250 --> 00:18:07,160 is they called the second stage or first 385 00:18:03,560 --> 00:18:09,290 our CNN and at the end of this feature 386 00:18:07,160 --> 00:18:11,540 extractor we classify each bounding box 387 00:18:09,290 --> 00:18:15,680 and we find its coordinates with a 388 00:18:11,540 --> 00:18:18,639 regression head so the deformable Roi 389 00:18:15,680 --> 00:18:22,130 polling changes the implementation of 390 00:18:18,640 --> 00:18:26,600 how we crop the each of these proposals 391 00:18:22,130 --> 00:18:28,790 from the feature map so what is actually 392 00:18:26,600 --> 00:18:31,639 deformable RI pooling instead of 393 00:18:28,790 --> 00:18:33,950 cropping a rectangular bounding box a 394 00:18:31,640 --> 00:18:37,310 single rectangular bounding box we pull 395 00:18:33,950 --> 00:18:39,710 nine separate bounding boxes and we'll 396 00:18:37,310 --> 00:18:42,230 see how it actually works in a few 397 00:18:39,710 --> 00:18:45,470 minutes we pull nine separate bounding 398 00:18:42,230 --> 00:18:47,450 boxes and that way you can see that 399 00:18:45,470 --> 00:18:50,060 these objects and that's an example from 400 00:18:47,450 --> 00:18:52,490 the paper are able to cover the object 401 00:18:50,060 --> 00:18:54,470 of interest much more tightly and the 402 00:18:52,490 --> 00:18:56,690 features that are cropped from the 403 00:18:54,470 --> 00:18:58,970 feature map are much more relevant to 404 00:18:56,690 --> 00:19:02,480 classify the object of interest and are 405 00:18:58,970 --> 00:19:04,370 not wasted on background which in a site 406 00:19:02,480 --> 00:19:06,020 in which in large amounts in the for 407 00:19:04,370 --> 00:19:10,639 mobile object is less interesting and 408 00:19:06,020 --> 00:19:11,900 valuable for us so these were this was 409 00:19:10,640 --> 00:19:14,480 the description of the different 410 00:19:11,900 --> 00:19:17,150 components and 411 00:19:14,480 --> 00:19:20,090 it shows a very strong improvement in 412 00:19:17,150 --> 00:19:21,679 the both in both of the significant 413 00:19:20,090 --> 00:19:24,049 metrics in the world of object detection 414 00:19:21,679 --> 00:19:26,059 we have two metrics the cocoa metric and 415 00:19:24,049 --> 00:19:28,100 the Pascal VOC metric this is the cocoa 416 00:19:26,059 --> 00:19:31,010 metric which gives more weight to 417 00:19:28,100 --> 00:19:33,260 accurate localization how well am I 418 00:19:31,010 --> 00:19:35,960 giving tight bounding boxes around the 419 00:19:33,260 --> 00:19:38,090 object so this locally tight 420 00:19:35,960 --> 00:19:40,700 localization metric gets about five to 421 00:19:38,090 --> 00:19:43,760 ten percent relative improvement due to 422 00:19:40,700 --> 00:19:47,750 this solution and the second metric 423 00:19:43,760 --> 00:19:52,429 which is which only which gives less 424 00:19:47,750 --> 00:19:58,100 weight to tight localization and thus it 425 00:19:52,429 --> 00:20:01,429 means its value is actually by telling 426 00:19:58,100 --> 00:20:03,199 us even giving us more insight into how 427 00:20:01,429 --> 00:20:05,120 many objects we were missing or 428 00:20:03,200 --> 00:20:09,260 detecting how many objects am I not 429 00:20:05,120 --> 00:20:11,239 detecting at all etc so this metric by 430 00:20:09,260 --> 00:20:13,250 the way is much more important in my 431 00:20:11,240 --> 00:20:16,130 opinion to medical imaging applications 432 00:20:13,250 --> 00:20:18,650 in most cases because tight localization 433 00:20:16,130 --> 00:20:20,480 in many times it's less important but if 434 00:20:18,650 --> 00:20:22,640 we miss a medical finding critical 435 00:20:20,480 --> 00:20:26,179 medical finding that's something that 436 00:20:22,640 --> 00:20:30,169 the doctors will really be mad at us 437 00:20:26,179 --> 00:20:44,480 about so this metric is also improved by 438 00:20:30,169 --> 00:20:46,309 5% sorry sorry yeah that's the hours 439 00:20:44,480 --> 00:20:49,700 here in this table is their 440 00:20:46,309 --> 00:20:51,910 implementation of for example faster our 441 00:20:49,700 --> 00:20:54,910 CNN with deformable convolutional 442 00:20:51,910 --> 00:20:54,910 networks 443 00:20:55,920 --> 00:21:04,090 5% 5% Oh 51 444 00:21:01,060 --> 00:21:05,830 it's a Pascal vocab mini average 445 00:21:04,090 --> 00:21:08,199 precision metric I don't want to dive 446 00:21:05,830 --> 00:21:11,530 into that too much just take it as a 447 00:21:08,200 --> 00:21:14,590 score the score for how good your a 448 00:21:11,530 --> 00:21:16,810 detector is it's a it's not really 449 00:21:14,590 --> 00:21:22,600 important to to understand it right now 450 00:21:16,810 --> 00:21:26,889 okay what is the percentage of 451 00:21:22,600 --> 00:21:28,600 undetected object it is not you can't 452 00:21:26,890 --> 00:21:31,090 understand it from this number but you 453 00:21:28,600 --> 00:21:34,899 can only understand it that is it is it 454 00:21:31,090 --> 00:21:36,939 has increased by significant Emma it is 455 00:21:34,900 --> 00:21:39,340 improved by a relatively significant 456 00:21:36,940 --> 00:21:42,580 amount you don't know the exact number 457 00:21:39,340 --> 00:21:45,639 of undetected object because this metric 458 00:21:42,580 --> 00:21:47,649 covers a lot of different working points 459 00:21:45,640 --> 00:21:51,150 of sensitivity of recall and precision 460 00:21:47,650 --> 00:21:54,370 that you can choose for your algorithm 461 00:21:51,150 --> 00:21:57,760 okay so now let's talk about the 462 00:21:54,370 --> 00:22:03,790 implementation by the way after that we 463 00:21:57,760 --> 00:22:06,879 you can ask questions freely so please 464 00:22:03,790 --> 00:22:08,680 keep like keep your questions to the end 465 00:22:06,880 --> 00:22:11,410 of this part if you have any more 466 00:22:08,680 --> 00:22:15,820 questions unless they are really really 467 00:22:11,410 --> 00:22:17,320 important so first of all let's start by 468 00:22:15,820 --> 00:22:20,020 the with the implementation of 469 00:22:17,320 --> 00:22:21,820 deformable convolutions so this is the 470 00:22:20,020 --> 00:22:23,650 diagram that they have in the paper and 471 00:22:21,820 --> 00:22:25,720 I think that it's confusing a little bit 472 00:22:23,650 --> 00:22:28,750 because it's a good diagram but it 473 00:22:25,720 --> 00:22:32,530 contains too many levels of abstraction 474 00:22:28,750 --> 00:22:35,440 and it's hard to wrap your minds around 475 00:22:32,530 --> 00:22:37,629 what's going on here so I invested some 476 00:22:35,440 --> 00:22:40,330 time in decomposing this diagram into 477 00:22:37,630 --> 00:22:44,410 several parts so it would be easier to 478 00:22:40,330 --> 00:22:46,000 understand so this is the essence of the 479 00:22:44,410 --> 00:22:48,190 layer which is called deformable 480 00:22:46,000 --> 00:22:49,990 convolution the essence that we have an 481 00:22:48,190 --> 00:22:51,880 input feature map he doesn't have to be 482 00:22:49,990 --> 00:22:53,950 the image it's actually most of the 483 00:22:51,880 --> 00:22:56,620 times not it's not used directly on the 484 00:22:53,950 --> 00:23:02,260 image but on deeper layers on deeper 485 00:22:56,620 --> 00:23:03,250 feature maps and we put a cone we put we 486 00:23:02,260 --> 00:23:05,650 use 487 00:23:03,250 --> 00:23:08,110 a convolution all over this image but 488 00:23:05,650 --> 00:23:13,480 the convolution is not the old square 489 00:23:08,110 --> 00:23:15,699 3x3 a convolution it's a different 490 00:23:13,480 --> 00:23:17,590 sampling grid for each location of the 491 00:23:15,700 --> 00:23:19,180 convolution and then let's say that I'm 492 00:23:17,590 --> 00:23:22,300 talking about this location so I have 493 00:23:19,180 --> 00:23:23,890 nine points that I'm sampling with in 494 00:23:22,300 --> 00:23:28,060 the locations of the blue squares and 495 00:23:23,890 --> 00:23:30,490 then those nine nine points are 496 00:23:28,060 --> 00:23:31,990 transformed into one point just like in 497 00:23:30,490 --> 00:23:34,990 the regular convolution the 3 by 3 498 00:23:31,990 --> 00:23:37,750 square was converted into one point or 499 00:23:34,990 --> 00:23:42,070 one vector in the feature map one 500 00:23:37,750 --> 00:23:45,910 spatial location so this is the essence 501 00:23:42,070 --> 00:23:49,389 now the implementation so you start by 502 00:23:45,910 --> 00:23:52,030 doing a regular square 3x3 convolution 503 00:23:49,390 --> 00:23:55,270 ignore the blue squares for now we start 504 00:23:52,030 --> 00:23:58,420 with the regular 3x3 convolution with 505 00:23:55,270 --> 00:24:03,490 the square shape and the output of this 506 00:23:58,420 --> 00:24:06,540 convolution is a feature map with which 507 00:24:03,490 --> 00:24:09,550 size is relatively the input feature map 508 00:24:06,540 --> 00:24:13,690 spatial size but the depths of this 509 00:24:09,550 --> 00:24:17,070 feature map is about 9 times to 18 y 9 510 00:24:13,690 --> 00:24:20,500 times 2 yeah it's because we can we can 511 00:24:17,070 --> 00:24:22,510 visualize it this this is just an aid to 512 00:24:20,500 --> 00:24:24,070 understand it this is not a stage this 513 00:24:22,510 --> 00:24:28,150 is the last computation that happens 514 00:24:24,070 --> 00:24:32,590 here actually but each 980 a vector of 515 00:24:28,150 --> 00:24:39,910 length 18 can be seen as 2 squares of 516 00:24:32,590 --> 00:24:42,909 size 3 by 3 so the top left elements in 517 00:24:39,910 --> 00:24:45,430 these two square give us the offsets 518 00:24:42,910 --> 00:24:47,740 that tell us where to locate the top 519 00:24:45,430 --> 00:24:51,070 left sampling point of our sampling grid 520 00:24:47,740 --> 00:24:53,770 and that Center squares in these two 521 00:24:51,070 --> 00:24:55,210 squares tell us where to place the 522 00:24:53,770 --> 00:24:57,310 offset the tell us where to place the 523 00:24:55,210 --> 00:25:02,020 center blue square in our new sampling 524 00:24:57,310 --> 00:25:04,659 grid and and and because we have nine 525 00:25:02,020 --> 00:25:07,300 squares nine elements in each each of 526 00:25:04,660 --> 00:25:13,260 them we get the offset for for each of 527 00:25:07,300 --> 00:25:13,260 our new blue squares and and 528 00:25:16,070 --> 00:25:22,830 yeah that's it so now the - yes so one 529 00:25:21,600 --> 00:25:24,090 square of course is the horizontal 530 00:25:22,830 --> 00:25:27,000 offsets 531 00:25:24,090 --> 00:25:28,918 tell tell us on the left-to-right axis 532 00:25:27,000 --> 00:25:30,870 how much do we want to move out each of 533 00:25:28,919 --> 00:25:34,020 our squares and the second square gives 534 00:25:30,870 --> 00:25:36,510 us the vertical offset so then we take 535 00:25:34,020 --> 00:25:40,370 this offset and we just sample them from 536 00:25:36,510 --> 00:25:42,840 the input feature map sample them 537 00:25:40,370 --> 00:25:46,469 multiply them with our with the weight 538 00:25:42,840 --> 00:25:51,959 of our convolutional kernel and get the 539 00:25:46,470 --> 00:25:53,580 vector and there is a little bit of a 540 00:25:51,960 --> 00:25:58,289 problem with what I just described 541 00:25:53,580 --> 00:26:01,320 because this convolutional layer is a 542 00:25:58,289 --> 00:26:03,960 convolution so it outputs continuous 543 00:26:01,320 --> 00:26:06,450 valued real numbers it doesn't output 544 00:26:03,960 --> 00:26:08,460 integers but we need to sample in order 545 00:26:06,450 --> 00:26:10,409 to sample the image when the image is 546 00:26:08,460 --> 00:26:12,480 discrete it contains discrete pixels so 547 00:26:10,409 --> 00:26:14,580 we need the integers but the problem is 548 00:26:12,480 --> 00:26:16,260 that we can't round these numbers 549 00:26:14,580 --> 00:26:19,139 because then it wouldn't be 550 00:26:16,260 --> 00:26:21,419 differentiable so and then we wouldn't 551 00:26:19,140 --> 00:26:23,880 be able to back propagate through it or 552 00:26:21,419 --> 00:26:26,429 it will require much heavier solution 553 00:26:23,880 --> 00:26:30,320 and a much cumbersome solution so what 554 00:26:26,429 --> 00:26:34,350 we do is say something that this group 555 00:26:30,320 --> 00:26:37,530 mentions a lot in their papers imagine 556 00:26:34,350 --> 00:26:39,809 that we have just like if we wanted to 557 00:26:37,530 --> 00:26:42,450 just if we had two coordinates the x 558 00:26:39,809 --> 00:26:45,539 coordinate was 2.3 and the y coordinate 559 00:26:42,450 --> 00:26:49,140 was 7.2 and we wanted to sample it from 560 00:26:45,539 --> 00:26:51,809 the image hmm so and we wanted to sample 561 00:26:49,140 --> 00:26:53,190 them from the image sample this point 562 00:26:51,809 --> 00:26:56,850 from the image we could use bilinear 563 00:26:53,190 --> 00:26:59,039 interpolation in order to like to 564 00:26:56,850 --> 00:27:02,580 interpolate what should be the value at 565 00:26:59,039 --> 00:27:05,370 that point so fortunately by neat 566 00:27:02,580 --> 00:27:07,408 bilinear interpolation can blink can be 567 00:27:05,370 --> 00:27:09,809 implemented very inefficient efficiently 568 00:27:07,409 --> 00:27:13,200 using matrix operators and matrix 569 00:27:09,809 --> 00:27:15,899 multiplication and that's why we can do 570 00:27:13,200 --> 00:27:18,450 it for many points of the sampling grid 571 00:27:15,900 --> 00:27:22,010 in real time and even on the GPU of 572 00:27:18,450 --> 00:27:25,799 course so explaining it's not very 573 00:27:22,010 --> 00:27:28,240 complex to understand how this the 574 00:27:25,799 --> 00:27:31,540 implementation of the Metro 575 00:27:28,240 --> 00:27:33,430 bi linear interpolation work but it is 576 00:27:31,540 --> 00:27:34,720 outside of the scope of this token if 577 00:27:33,430 --> 00:27:37,650 you are interested in it you can come 578 00:27:34,720 --> 00:27:40,630 talk to me about it later 579 00:27:37,650 --> 00:27:43,450 okay so let's say about the first 580 00:27:40,630 --> 00:27:45,160 component by the way if anyone has a 581 00:27:43,450 --> 00:27:50,340 question about this component ask 582 00:27:45,160 --> 00:27:50,340 because maybe it's better time yeah yeah 583 00:27:52,830 --> 00:27:55,830 yeah 584 00:28:01,910 --> 00:28:05,330 right so 585 00:28:10,830 --> 00:28:17,019 no it's those are okay so I repeat the 586 00:28:14,259 --> 00:28:20,049 question so everyone will hear so I said 587 00:28:17,019 --> 00:28:22,360 that first of all before I do the 588 00:28:20,049 --> 00:28:24,490 convolution with the square the yellow 589 00:28:22,360 --> 00:28:26,408 square the regular convolution I don't 590 00:28:24,490 --> 00:28:28,179 know the offsets for where I want to 591 00:28:26,409 --> 00:28:30,669 locate my blue sampling grid of the 592 00:28:28,179 --> 00:28:33,039 deformable convolutions and then I said 593 00:28:30,669 --> 00:28:37,509 that I when I know these sampling points 594 00:28:33,039 --> 00:28:41,669 I take them and multiply them with a the 595 00:28:37,509 --> 00:28:45,309 convolution kernel and what's your name 596 00:28:41,669 --> 00:28:47,769 Lisa and Lisa asked me if it's the same 597 00:28:45,309 --> 00:28:49,240 kernel if the same kernel is used for 598 00:28:47,769 --> 00:28:51,250 both of these convolutions or it's a 599 00:28:49,240 --> 00:28:54,490 different kernel so it's a different 600 00:28:51,250 --> 00:28:56,980 kernel between the yellow convolution 601 00:28:54,490 --> 00:28:59,440 has a single kernel and the blue 602 00:28:56,980 --> 00:29:04,269 convolution has a different kernel okay 603 00:28:59,440 --> 00:29:08,279 and they are learned separately okay any 604 00:29:04,269 --> 00:29:08,279 other questions yeah 605 00:29:12,400 --> 00:29:19,630 mm-hm 606 00:29:13,550 --> 00:29:19,629 what do you mean mhm 607 00:29:28,190 --> 00:29:32,190 probably but you know empirically it 608 00:29:30,539 --> 00:29:33,690 improves the results so I guess it has 609 00:29:32,190 --> 00:29:35,610 some drawbacks and maybe this solution 610 00:29:33,690 --> 00:29:39,240 can be improved but it has also 611 00:29:35,610 --> 00:29:42,990 desirable is that he asked me if it 612 00:29:39,240 --> 00:29:45,509 maybe maybe it creates some 613 00:29:42,990 --> 00:29:47,850 discontinuity because of the weird 614 00:29:45,509 --> 00:29:51,629 sampling strategy so probably it has 615 00:29:47,850 --> 00:29:53,519 some disadvantages but like I even the 616 00:29:51,629 --> 00:29:55,590 convolution that we are using today also 617 00:29:53,519 --> 00:29:57,539 has disadvantages so it's I think the 618 00:29:55,590 --> 00:29:59,490 only question is which mechanism has 619 00:29:57,539 --> 00:30:04,590 more disadvantages there relative to its 620 00:29:59,490 --> 00:30:08,039 advantages yes yeah 621 00:30:04,590 --> 00:30:14,120 so when you get when you get the loss 622 00:30:08,039 --> 00:30:14,120 you back propagate them just like 623 00:30:15,559 --> 00:30:20,639 through your bilinear interpolation 624 00:30:17,750 --> 00:30:24,529 operator that I that I talked about so 625 00:30:20,639 --> 00:30:27,840 you you get from it you have these 626 00:30:24,529 --> 00:30:30,570 numbers and you multiply them with a 627 00:30:27,840 --> 00:30:34,110 matrix of a bilinear interpolation and 628 00:30:30,570 --> 00:30:35,970 then you get these values okay it's not 629 00:30:34,110 --> 00:30:37,830 that you do something active to sample 630 00:30:35,970 --> 00:30:39,870 them it's just like you have a bilinear 631 00:30:37,830 --> 00:30:42,210 intermet ryx which is a bilinear 632 00:30:39,870 --> 00:30:46,320 interpolation kernel and you multiply it 633 00:30:42,210 --> 00:30:49,769 with with these numbers after some 634 00:30:46,320 --> 00:30:51,960 vector operations and then you get the 635 00:30:49,769 --> 00:30:54,720 values that are sampled in each of these 636 00:30:51,960 --> 00:30:57,539 points and then you multiply them with 637 00:30:54,720 --> 00:30:59,549 an with an another matrix so it's back 638 00:30:57,539 --> 00:31:03,919 propagated through the bilinear 639 00:30:59,549 --> 00:31:03,918 interpolation operator yes 640 00:31:03,960 --> 00:31:08,419 different sampling patterns for it so in 641 00:31:07,500 --> 00:31:12,889 the 642 00:31:08,419 --> 00:31:14,629 oh I hope I understood your question yes 643 00:31:12,889 --> 00:31:18,320 if there is a different sampling pattern 644 00:31:14,629 --> 00:31:20,539 for each pixel in the image so I'll go 645 00:31:18,320 --> 00:31:23,629 back if I hope I understood your 646 00:31:20,539 --> 00:31:27,259 question I'll go back to this example 647 00:31:23,629 --> 00:31:29,389 images that I showed here and I hope 648 00:31:27,259 --> 00:31:32,539 this answers your question you can see 649 00:31:29,389 --> 00:31:35,539 that for this pixel the the sampling is 650 00:31:32,539 --> 00:31:39,799 much more has a much wider coverage and 651 00:31:35,539 --> 00:31:44,230 for this pixel or activation the the 652 00:31:39,799 --> 00:31:47,330 coverage is is much smaller and the 653 00:31:44,230 --> 00:31:53,619 receptive field is a function of the 654 00:31:47,330 --> 00:31:53,619 local input and just a second 655 00:31:55,269 --> 00:32:03,100 the for each location in the image the 656 00:31:59,179 --> 00:32:05,869 offsets are a function of these 3x3 657 00:32:03,100 --> 00:32:07,789 pixels in the input feature Maps so of 658 00:32:05,869 --> 00:32:09,499 course that you will get different 659 00:32:07,789 --> 00:32:11,029 offsets if you place your conversion 660 00:32:09,499 --> 00:32:12,499 here or any purple if you place a 661 00:32:11,029 --> 00:32:14,559 convolution here does that answer your 662 00:32:12,499 --> 00:32:14,559 question 663 00:32:17,869 --> 00:32:28,019 it depends on the yellow convolution yes 664 00:32:23,509 --> 00:32:30,269 the offsets yes the yellow regular 3x3 665 00:32:28,019 --> 00:32:34,399 to the convolution square to the 666 00:32:30,269 --> 00:32:37,679 convolution determines the offsets and 667 00:32:34,399 --> 00:32:39,658 of course that the the output of the 668 00:32:37,679 --> 00:32:42,029 convolution is different for each part 669 00:32:39,659 --> 00:32:44,789 of the image because it's input is 670 00:32:42,029 --> 00:32:46,919 different okay and that's why the offset 671 00:32:44,789 --> 00:32:48,959 will be dead that's the mechanism that 672 00:32:46,919 --> 00:32:50,339 enables just a second that enables the 673 00:32:48,959 --> 00:33:02,789 offsets to be different between 674 00:32:50,339 --> 00:33:05,039 different parts of the image okay the 675 00:33:02,789 --> 00:33:07,799 output of the original square 676 00:33:05,039 --> 00:33:10,229 convolution enables us to sample the 677 00:33:07,799 --> 00:33:12,029 image for the real convolution for the 678 00:33:10,229 --> 00:33:14,039 deformable convolution which which is 679 00:33:12,029 --> 00:33:15,629 and that convolution is the one that way 680 00:33:14,039 --> 00:33:18,739 that really creates the next feature map 681 00:33:15,629 --> 00:33:18,738 of our feature extractors 682 00:33:27,330 --> 00:33:31,429 [Music] 683 00:33:35,799 --> 00:33:42,729 I'm sorry not your range - okay 684 00:33:46,020 --> 00:33:52,680 wait there's some wait please I would 685 00:33:51,000 --> 00:33:55,170 love to answer a question and I prefer 686 00:33:52,680 --> 00:33:57,360 to I think it's better that we cover 687 00:33:55,170 --> 00:34:00,810 less papers but I understand them better 688 00:33:57,360 --> 00:34:03,659 but please keep it to only like if you 689 00:34:00,810 --> 00:34:05,280 have a gaps to understand what I just 690 00:34:03,660 --> 00:34:06,660 explained and don't be shy to ask 691 00:34:05,280 --> 00:34:10,909 because I'm sure that you are not the 692 00:34:06,660 --> 00:34:10,909 only one that didn't understand yes 693 00:34:21,040 --> 00:34:26,259 can you speak louder I didn't see one 694 00:34:34,449 --> 00:34:40,009 something grid yes for each pixel in the 695 00:34:37,969 --> 00:34:42,109 original for each spatial location in 696 00:34:40,010 --> 00:34:45,679 the original feature map we have 697 00:34:42,109 --> 00:34:48,290 different 18 values they determine the 698 00:34:45,679 --> 00:34:50,359 real set the new sampling read the 699 00:34:48,290 --> 00:34:52,339 deformable self sampling grid and this 700 00:34:50,359 --> 00:34:57,078 sampling grid is different between some 701 00:34:52,339 --> 00:35:00,040 some between spatial locations okay 702 00:34:57,079 --> 00:35:00,040 yes 703 00:35:08,589 --> 00:35:13,930 I will love if you could keep this 704 00:35:11,380 --> 00:35:15,789 question today after we finish covering 705 00:35:13,930 --> 00:35:18,009 this paper and I also have an example of 706 00:35:15,789 --> 00:35:18,700 the reason I think it's interesting okay 707 00:35:18,009 --> 00:35:24,190 thanks 708 00:35:18,700 --> 00:35:26,049 yes the the the layer degenerates the 709 00:35:24,190 --> 00:35:27,339 offset only one layer and it's even a 710 00:35:26,049 --> 00:35:47,380 linear layer it doesn't have a 711 00:35:27,339 --> 00:35:49,359 non-linearity yeah I I think it could be 712 00:35:47,380 --> 00:35:53,069 an in in an interesting paper to try it 713 00:35:49,359 --> 00:35:53,069 with more convolutions and see if it's 714 00:35:53,910 --> 00:35:57,009 [Music] 715 00:35:58,170 --> 00:36:02,369 re what what do you mean 716 00:36:20,160 --> 00:36:50,170 I'm not if the in this if I understand 717 00:36:47,890 --> 00:36:51,580 your question correctly if this called 718 00:36:50,170 --> 00:36:53,410 the yellow convolution will be on 719 00:36:51,580 --> 00:36:55,240 different locations in the image but the 720 00:36:53,410 --> 00:36:57,490 the values of these locations will be 721 00:36:55,240 --> 00:37:00,580 equal then the offsets will also be 722 00:36:57,490 --> 00:37:05,279 equal is that your question mm-hmm okay 723 00:37:00,580 --> 00:37:09,270 so yes mm-hmm okay 724 00:37:05,280 --> 00:37:09,270 can we move on yeah 725 00:37:14,060 --> 00:37:32,570 no the offsets are not Li are not 726 00:37:16,520 --> 00:37:35,180 bounded but so mathematically nothing 727 00:37:32,570 --> 00:37:39,290 bounds this offsets and we also know 728 00:37:35,180 --> 00:37:41,149 that we from traditional assess if the 729 00:37:39,290 --> 00:37:43,850 offsets are bounded to the area of this 730 00:37:41,150 --> 00:37:47,390 safe 3x3 square and the authors are not 731 00:37:43,850 --> 00:37:50,210 bounded and usually they are larger than 732 00:37:47,390 --> 00:37:52,190 these 3x3 square because even in 733 00:37:50,210 --> 00:37:55,370 traditional object detection we know 734 00:37:52,190 --> 00:37:59,240 that we can use a small convolution 735 00:37:55,370 --> 00:38:02,600 kernel to predict much larger bounding 736 00:37:59,240 --> 00:38:07,310 boxes that are larger than than than the 737 00:38:02,600 --> 00:38:10,940 receptive field of these kernels so the 738 00:38:07,310 --> 00:38:13,759 is it's like if you look at it my torso 739 00:38:10,940 --> 00:38:16,220 you have enough information to know that 740 00:38:13,760 --> 00:38:19,630 my head is up to here and my footer are 741 00:38:16,220 --> 00:38:22,819 down there right so you can infer the 742 00:38:19,630 --> 00:38:26,500 wanted sampling points even if you are 743 00:38:22,820 --> 00:38:26,500 looking just on a part of an object 744 00:38:36,330 --> 00:38:42,460 no it just uses the features that it has 745 00:38:39,010 --> 00:38:46,480 and in just a limited spatial context is 746 00:38:42,460 --> 00:38:48,190 enough to predict something to infer for 747 00:38:46,480 --> 00:38:49,390 something that is outside of your of 748 00:38:48,190 --> 00:38:55,440 your context 749 00:38:49,390 --> 00:38:55,440 okay okay I'll yeah 750 00:39:07,790 --> 00:39:15,359 okay good question she asked if after we 751 00:39:11,660 --> 00:39:16,589 guys can like be quite so people will be 752 00:39:15,359 --> 00:39:19,859 able to hear Thanks 753 00:39:16,589 --> 00:39:21,509 so she asked if after we do this 754 00:39:19,859 --> 00:39:23,970 deformable convolutions maybe there are 755 00:39:21,510 --> 00:39:26,430 pixels that are not covered by our blue 756 00:39:23,970 --> 00:39:28,319 pixels all over the image and yeah it 757 00:39:26,430 --> 00:39:29,790 can happen nothing ensures us that it 758 00:39:28,319 --> 00:39:32,220 doesn't happen and it's okay that it 759 00:39:29,790 --> 00:39:35,640 happens because maybe these pixels are 760 00:39:32,220 --> 00:39:38,308 yeah yeah maybe there is the information 761 00:39:35,640 --> 00:39:40,558 there is less relevant yeah or maybe we 762 00:39:38,309 --> 00:39:43,380 are me it's possible do we also miss 763 00:39:40,559 --> 00:39:46,980 something that is important but if it 764 00:39:43,380 --> 00:39:48,839 happens that we our if it's important it 765 00:39:46,980 --> 00:39:50,690 means it will harm our classification 766 00:39:48,839 --> 00:39:53,279 results and then in the backpropagation 767 00:39:50,690 --> 00:39:55,920 these weights that generated offsets 768 00:39:53,280 --> 00:40:05,780 will adapt to predict better offsets so 769 00:39:55,920 --> 00:40:05,780 yeah what do you mean 770 00:40:17,900 --> 00:40:22,380 I 771 00:40:19,290 --> 00:40:24,330 I'm pretty sure that like in the formula 772 00:40:22,380 --> 00:40:27,690 there is nothing that prevents it but I 773 00:40:24,330 --> 00:40:30,590 get but but I guess it's something that 774 00:40:27,690 --> 00:40:33,890 that just happens because it's it's 775 00:40:30,590 --> 00:40:36,270 because it I guess it's it's not really 776 00:40:33,890 --> 00:40:37,859 beneficial in any way that they will 777 00:40:36,270 --> 00:40:41,360 converge to the same sample the same 778 00:40:37,860 --> 00:40:43,830 point and then it the way that I learned 779 00:40:41,360 --> 00:40:48,510 like a naturally simple difference 780 00:40:43,830 --> 00:40:50,100 points okay so let's move on I just want 781 00:40:48,510 --> 00:40:54,960 to see how much time we have 782 00:40:50,100 --> 00:40:57,420 okay we're good so now let's move on to 783 00:40:54,960 --> 00:41:00,630 the second component which is the 784 00:40:57,420 --> 00:41:03,720 deformable our Y pulling and I just want 785 00:41:00,630 --> 00:41:05,580 to give a quick reminder of the regular 786 00:41:03,720 --> 00:41:07,709 our Y pulling and there are many ways to 787 00:41:05,580 --> 00:41:11,549 perform our pulling there also called 788 00:41:07,710 --> 00:41:13,110 our warping and other names so now it 789 00:41:11,550 --> 00:41:15,270 doesn't really matter it can work with 790 00:41:13,110 --> 00:41:16,710 all of these methods and I'm going to 791 00:41:15,270 --> 00:41:18,630 demonstrate it with the original 792 00:41:16,710 --> 00:41:24,690 alright pulling which by the way is not 793 00:41:18,630 --> 00:41:27,240 is not differentiable and okay 794 00:41:24,690 --> 00:41:30,120 so by the way it's not a French Abell 795 00:41:27,240 --> 00:41:32,220 and the solution that I that I spoke 796 00:41:30,120 --> 00:41:34,049 about here is also the solution to make 797 00:41:32,220 --> 00:41:35,339 our i pulling differentiable or one of 798 00:41:34,050 --> 00:41:37,200 the solution to make our i pulling 799 00:41:35,340 --> 00:41:40,050 differentiable the solution with the 800 00:41:37,200 --> 00:41:42,899 bilinear interpolation operation so let 801 00:41:40,050 --> 00:41:44,850 how does our i pulling works from the 802 00:41:42,900 --> 00:41:46,980 RPM from the first stage of the fester 803 00:41:44,850 --> 00:41:49,980 are seen and we get a bounding box 804 00:41:46,980 --> 00:41:53,670 proposal and then we split this proposal 805 00:41:49,980 --> 00:41:57,750 into several bins for example to buy two 806 00:41:53,670 --> 00:42:00,810 bins or in reality it's a seven by seven 807 00:41:57,750 --> 00:42:02,340 or 14 by 14 in most of the cases but for 808 00:42:00,810 --> 00:42:04,799 the simplicity of the example let's 809 00:42:02,340 --> 00:42:08,070 assume it's two by two then for each of 810 00:42:04,800 --> 00:42:10,110 these bins separately we perform max 811 00:42:08,070 --> 00:42:12,060 pooling on the entire bin it doesn't 812 00:42:10,110 --> 00:42:14,250 matter the bin size we perform max 813 00:42:12,060 --> 00:42:18,420 pulling on the entire bin so for example 814 00:42:14,250 --> 00:42:21,660 for this P being we get 0.74 for this 815 00:42:18,420 --> 00:42:24,000 mean we get 0.39 etc so it can be 816 00:42:21,660 --> 00:42:25,470 max pulling in the paper it they talk 817 00:42:24,000 --> 00:42:30,359 about average pulling it doesn't really 818 00:42:25,470 --> 00:42:32,819 matter and this is the original ROI 819 00:42:30,359 --> 00:42:35,819 pulling and deferrable are I pulling the 820 00:42:32,819 --> 00:42:38,630 idea is that we keep the same things and 821 00:42:35,819 --> 00:42:43,650 we have the same sizes for these bins 822 00:42:38,630 --> 00:42:45,720 but we take each bin keep its size but 823 00:42:43,650 --> 00:42:47,760 give it an offset so we take the top 824 00:42:45,720 --> 00:42:49,859 right bin and we place it somewhere here 825 00:42:47,760 --> 00:42:52,890 and take the top left beam and we place 826 00:42:49,859 --> 00:42:55,890 it somewhere here etc so in reality we 827 00:42:52,890 --> 00:42:59,098 have like 7 by 7 bins and we predict 828 00:42:55,890 --> 00:43:03,299 offsets for all of them and that's 829 00:42:59,099 --> 00:43:05,970 that's uh basically how it works so the 830 00:43:03,299 --> 00:43:07,470 implementation is very similar to the to 831 00:43:05,970 --> 00:43:11,819 the previous implementation it will be 832 00:43:07,470 --> 00:43:15,089 very easy to understand now so this is 833 00:43:11,819 --> 00:43:18,058 the input feature map from which we crop 834 00:43:15,089 --> 00:43:20,400 and we perform the ROI pulling on this 835 00:43:18,059 --> 00:43:22,890 is the feature map that we pull it at 836 00:43:20,400 --> 00:43:25,500 our Y from so let's say we have an hour 837 00:43:22,890 --> 00:43:27,118 Y this is the and in this example 838 00:43:25,500 --> 00:43:30,990 they're all with all of their diagrams 839 00:43:27,119 --> 00:43:33,390 are for 3x3 bins I just mentioned 2x2 840 00:43:30,990 --> 00:43:35,430 they do it with three by three bins so 841 00:43:33,390 --> 00:43:37,140 we have in our Y when we split it to 842 00:43:35,430 --> 00:43:39,990 three by three bins it's a again it's 843 00:43:37,140 --> 00:43:45,629 the three by squeeze yellow square here 844 00:43:39,990 --> 00:43:49,740 and then we do the regular ry pulling on 845 00:43:45,630 --> 00:43:54,690 this ROI and we get the the down sample 846 00:43:49,740 --> 00:43:58,038 the ROI and then we put this ROI into a 847 00:43:54,690 --> 00:44:01,170 fully connected layer and again we get 848 00:43:58,039 --> 00:44:04,260 vector this time we get a single vector 849 00:44:01,170 --> 00:44:07,319 for this ROI of size 18 which can will 850 00:44:04,260 --> 00:44:10,049 and we can look at this vector as two 851 00:44:07,319 --> 00:44:12,029 squares of size three by three which are 852 00:44:10,049 --> 00:44:14,400 the horizontal and the vertical offsets 853 00:44:12,029 --> 00:44:17,039 for each bin so the value in the top 854 00:44:14,400 --> 00:44:20,760 left part of these squares is the offset 855 00:44:17,039 --> 00:44:24,930 for the top left bin and the value for 856 00:44:20,760 --> 00:44:27,270 the top right the two values in the top 857 00:44:24,930 --> 00:44:30,868 right a part of the square is the offset 858 00:44:27,270 --> 00:44:34,049 for the top right beam and that way we 859 00:44:30,869 --> 00:44:34,860 can get an offset for each bin and place 860 00:44:34,049 --> 00:44:37,410 them in 861 00:44:34,860 --> 00:44:43,620 in a sample different parts of the 862 00:44:37,410 --> 00:44:45,509 feature map with them and again these we 863 00:44:43,620 --> 00:44:49,650 need to sample the beans and do max 864 00:44:45,510 --> 00:44:51,750 pooling on areas that are a coordinates 865 00:44:49,650 --> 00:44:53,370 that are not squid and this is also 866 00:44:51,750 --> 00:44:56,010 solved using the same bilinear 867 00:44:53,370 --> 00:44:59,160 interpolation and matrix multiplication 868 00:44:56,010 --> 00:45:03,300 that I mentioned earlier so some really 869 00:44:59,160 --> 00:45:06,750 cool examples for AHA from their paper 870 00:45:03,300 --> 00:45:09,930 on how are I for how different about our 871 00:45:06,750 --> 00:45:11,790 i pulling works so you can see that this 872 00:45:09,930 --> 00:45:14,210 is the original proposal the yellow is 873 00:45:11,790 --> 00:45:19,529 the original proposal and then the nine 874 00:45:14,210 --> 00:45:21,270 different bins are we pull the original 875 00:45:19,530 --> 00:45:23,100 proposal and we put it into a fully 876 00:45:21,270 --> 00:45:24,990 connected layer and actually predict 877 00:45:23,100 --> 00:45:27,029 offsets which would give us nine 878 00:45:24,990 --> 00:45:29,160 different bounding boxes which are the 879 00:45:27,030 --> 00:45:32,610 bounding boxes in red so you can see how 880 00:45:29,160 --> 00:45:36,330 nicely they cover the cat and the less 881 00:45:32,610 --> 00:45:40,260 relevant information here is not we 882 00:45:36,330 --> 00:45:42,240 don't waste any any capacity on it so I 883 00:45:40,260 --> 00:45:47,400 think this is really really elegant and 884 00:45:42,240 --> 00:45:49,979 another example which we have a good 885 00:45:47,400 --> 00:45:52,080 bounding this is an example of the post 886 00:45:49,980 --> 00:45:54,780 problem so the woman is reaching her 887 00:45:52,080 --> 00:45:58,740 head hand forward and thus the bounding 888 00:45:54,780 --> 00:46:01,890 box that covers her covers the has a lot 889 00:45:58,740 --> 00:46:04,740 of wasted space that we and we will 890 00:46:01,890 --> 00:46:06,420 waste our pooled features on these on 891 00:46:04,740 --> 00:46:10,609 the features of the of this background 892 00:46:06,420 --> 00:46:12,960 and it speaks for itself I think and 893 00:46:10,610 --> 00:46:17,840 regarding how this can be useful for 894 00:46:12,960 --> 00:46:17,840 medical applications I think that yeah 895 00:46:25,140 --> 00:46:31,740 yeah sure it's good great that you asked 896 00:46:29,430 --> 00:46:33,509 because this is their like the basic so 897 00:46:31,740 --> 00:46:35,640 is this is the most important things 898 00:46:33,510 --> 00:46:37,530 that everyone will learn is turned so I 899 00:46:35,640 --> 00:46:39,810 will explain again how from the 18 900 00:46:37,530 --> 00:46:41,070 offsets from how from the 18 numbers 901 00:46:39,810 --> 00:46:43,170 that are the output of the fully 902 00:46:41,070 --> 00:46:48,990 connected you can get nine different 903 00:46:43,170 --> 00:46:50,580 bounding boxes okay so by the way it 904 00:46:48,990 --> 00:46:52,680 should I explain it again also for the 905 00:46:50,580 --> 00:47:00,960 deformable convolutions or just for the 906 00:46:52,680 --> 00:47:07,230 deformable Roi polling okay so when you 907 00:47:00,960 --> 00:47:11,370 get 18 numbers the yellow 3x3 structure 908 00:47:07,230 --> 00:47:13,710 here is the original each each sub 909 00:47:11,370 --> 00:47:18,810 square each of these nine sub squares 910 00:47:13,710 --> 00:47:21,870 are the original bins the original 3x3 911 00:47:18,810 --> 00:47:23,670 bins of the original proposal and we 912 00:47:21,870 --> 00:47:25,500 know the location of there's the 913 00:47:23,670 --> 00:47:27,960 coordinates of the center for each pin 914 00:47:25,500 --> 00:47:31,110 it can be can be calculated easily so 915 00:47:27,960 --> 00:47:33,390 now I have I have their Center and I 916 00:47:31,110 --> 00:47:35,670 have two additional numbers I have the 917 00:47:33,390 --> 00:47:37,740 four this top left square I have did 918 00:47:35,670 --> 00:47:42,210 it's horizontal offset for example if 919 00:47:37,740 --> 00:47:44,520 the offset is minus 2.5 then I know that 920 00:47:42,210 --> 00:47:48,060 the new center will for the top left 921 00:47:44,520 --> 00:47:50,370 beam will be placed minus 2 minus 2.5 922 00:47:48,060 --> 00:47:52,200 offset in the X in our rosante axis 923 00:47:50,370 --> 00:47:55,529 compared to the original centre of that 924 00:47:52,200 --> 00:47:58,200 bin and I take the value from this 925 00:47:55,530 --> 00:48:02,550 square which represent the vertical 926 00:47:58,200 --> 00:48:04,529 offset and if the value is minus 3.1 927 00:48:02,550 --> 00:48:08,430 then I know that this the center will be 928 00:48:04,530 --> 00:48:11,160 located minus 2 2 in the vertical axis 929 00:48:08,430 --> 00:48:13,950 and minus 3 in the solid minus 2 in the 930 00:48:11,160 --> 00:48:16,620 horizontal axis minus minus 3 in the 931 00:48:13,950 --> 00:48:20,250 vertical axis and I repeat this process 932 00:48:16,620 --> 00:48:22,589 it's doing it it's performed in a vector 933 00:48:20,250 --> 00:48:25,560 operation but this process you can 934 00:48:22,590 --> 00:48:27,630 imagine that is repeated 9 9 times for 935 00:48:25,560 --> 00:48:30,779 each of these bins and that way we can 936 00:48:27,630 --> 00:48:32,880 get the new sampling points of our of 937 00:48:30,780 --> 00:48:37,400 our grid do you think it was more 938 00:48:32,880 --> 00:48:37,400 understandable right now ok great 939 00:48:37,730 --> 00:48:44,670 so if we look at the medical case so I 940 00:48:41,880 --> 00:48:46,440 will come back to the same example they 941 00:48:44,670 --> 00:48:50,820 showed earlier and I think this 942 00:48:46,440 --> 00:48:53,850 demonstrates pretty well how one finding 943 00:48:50,820 --> 00:48:55,830 that can be detected pretty nicely but 944 00:48:53,850 --> 00:48:59,220 when you want to classify it if I just 945 00:48:55,830 --> 00:49:00,569 took this ROI and I pulled it and I put 946 00:48:59,220 --> 00:49:03,600 it through the second stage of my 947 00:49:00,570 --> 00:49:05,730 detector then most of the information 948 00:49:03,600 --> 00:49:08,640 most of the features that will be pulled 949 00:49:05,730 --> 00:49:11,910 will will be healthy pixels healthy 950 00:49:08,640 --> 00:49:14,580 brain pixels so it increases the chances 951 00:49:11,910 --> 00:49:17,520 that like the classifier of the second 952 00:49:14,580 --> 00:49:20,970 stage will miss classify this example as 953 00:49:17,520 --> 00:49:24,440 a healthy example so if I use the 954 00:49:20,970 --> 00:49:28,169 deformable are I pulling I can it 955 00:49:24,440 --> 00:49:30,240 naturally covers the objects the object 956 00:49:28,170 --> 00:49:34,280 the interesting object in a much tighter 957 00:49:30,240 --> 00:49:42,569 way and then they pulled our eyes are 958 00:49:34,280 --> 00:49:46,410 the the pooled ROI is the features in 959 00:49:42,570 --> 00:49:49,950 the pooled roi are much more relevant to 960 00:49:46,410 --> 00:49:51,960 the to the non healthy pixels in the 961 00:49:49,950 --> 00:50:01,919 image do you think this answers your 962 00:49:51,960 --> 00:50:06,119 question from before okay great yeah the 963 00:50:01,920 --> 00:50:10,650 final region the do you mean the final 964 00:50:06,119 --> 00:50:12,810 prediction of the of the model the final 965 00:50:10,650 --> 00:50:15,359 prediction of the model will still be 966 00:50:12,810 --> 00:50:16,080 this this yellow square this yellow 967 00:50:15,359 --> 00:50:18,090 rectangle 968 00:50:16,080 --> 00:50:20,549 the original proposal of the yellow 969 00:50:18,090 --> 00:50:22,560 rectangle or or every or a single 970 00:50:20,550 --> 00:50:24,180 rectangular e it will not be the 971 00:50:22,560 --> 00:50:26,520 original proposal it will probably be 972 00:50:24,180 --> 00:50:28,379 refined by the second stage but it will 973 00:50:26,520 --> 00:50:31,080 be something like this original proposal 974 00:50:28,380 --> 00:50:34,080 but these nine nine different bounding 975 00:50:31,080 --> 00:50:35,730 boxes help us classify and refine the 976 00:50:34,080 --> 00:50:38,000 coordinates of this bounding box much 977 00:50:35,730 --> 00:50:38,000 better 978 00:50:38,420 --> 00:50:41,530 [Music] 979 00:50:43,430 --> 00:50:58,379 sorry you said yes not at the end of the 980 00:50:56,820 --> 00:51:00,720 first stage at the beginning you have a 981 00:50:58,380 --> 00:51:04,800 normal region proposed a layer you get 982 00:51:00,720 --> 00:51:07,350 the proposal and then and then you take 983 00:51:04,800 --> 00:51:08,220 this proposal you do regular ROI pulling 984 00:51:07,350 --> 00:51:11,490 on this proposal 985 00:51:08,220 --> 00:51:13,680 you put the pull original proposal into 986 00:51:11,490 --> 00:51:15,870 a fully connected layer and then you get 987 00:51:13,680 --> 00:51:21,049 the offsets for that enable you to 988 00:51:15,870 --> 00:51:24,330 locate these red rectangles on the image 989 00:51:21,050 --> 00:51:26,430 these rectangles are past the features 990 00:51:24,330 --> 00:51:30,660 under these rectangles are passed to the 991 00:51:26,430 --> 00:51:32,819 second stage of the detector the second 992 00:51:30,660 --> 00:51:36,629 stage uses these features to classify 993 00:51:32,820 --> 00:51:38,520 the original yellow rectangle and they 994 00:51:36,630 --> 00:51:40,800 are the final output it doesn't matter 995 00:51:38,520 --> 00:51:43,470 what where these red rectangles will be 996 00:51:40,800 --> 00:51:46,380 placed the final output of this entire 997 00:51:43,470 --> 00:51:49,169 detector will still be something like a 998 00:51:46,380 --> 00:51:54,350 single yellow rectangle around this 999 00:51:49,170 --> 00:51:58,830 proposal okay okay great 1000 00:51:54,350 --> 00:52:05,310 some best practices so first of all they 1001 00:51:58,830 --> 00:52:06,870 tried it on the less layers only so the 1002 00:52:05,310 --> 00:52:09,240 the meaning of that is that they only 1003 00:52:06,870 --> 00:52:11,460 try to model to use this to model 1004 00:52:09,240 --> 00:52:13,740 deformation in high level features you 1005 00:52:11,460 --> 00:52:17,490 can think about it's very intuitive when 1006 00:52:13,740 --> 00:52:18,990 you look at something like this change 1007 00:52:17,490 --> 00:52:20,910 of pose and you have feature a feature 1008 00:52:18,990 --> 00:52:22,890 for a hand and a feature for a hand and 1009 00:52:20,910 --> 00:52:27,420 a feature for a torso and that way you 1010 00:52:22,890 --> 00:52:33,690 can model at their locations in in an 1011 00:52:27,420 --> 00:52:35,760 irregular way and they try to use it on 1012 00:52:33,690 --> 00:52:37,500 more than the three less layers and it 1013 00:52:35,760 --> 00:52:40,600 give them they gave them diminishing 1014 00:52:37,500 --> 00:52:45,340 returns so the use resonant 101 layer 1015 00:52:40,600 --> 00:52:48,009 in this paper and the this is the end of 1016 00:52:45,340 --> 00:52:51,640 the resident 101 these are the last 26 1017 00:52:48,010 --> 00:52:54,820 layers so this is the left resonant 1018 00:52:51,640 --> 00:53:00,190 block and this is the the one before it 1019 00:52:54,820 --> 00:53:02,320 so a large amount or of the convolutions 1020 00:53:00,190 --> 00:53:04,330 in resonate are 1x1 convolutions and of 1021 00:53:02,320 --> 00:53:05,770 course it's probably less interesting to 1022 00:53:04,330 --> 00:53:08,170 do something deformable with the 1023 00:53:05,770 --> 00:53:11,650 location of these convolutions so the 1024 00:53:08,170 --> 00:53:14,980 four you have three blocks like this so 1025 00:53:11,650 --> 00:53:17,380 the the optimal a configuration was to 1026 00:53:14,980 --> 00:53:19,630 put the deformable solution on on each 1027 00:53:17,380 --> 00:53:21,250 of these 3x3 convolution and when they 1028 00:53:19,630 --> 00:53:24,370 tried it on some of the convolutions out 1029 00:53:21,250 --> 00:53:27,670 of these 23 it didn't gave them too much 1030 00:53:24,370 --> 00:53:30,069 value in addition what's really amazing 1031 00:53:27,670 --> 00:53:32,620 about this solution is that it answered 1032 00:53:30,070 --> 00:53:35,440 our requirement of not adding a lot of 1033 00:53:32,620 --> 00:53:37,120 complexity to our model and the number 1034 00:53:35,440 --> 00:53:40,090 of parameters in the network barely 1035 00:53:37,120 --> 00:53:42,730 increased and also the inference time 1036 00:53:40,090 --> 00:53:50,340 for a single image didn't increase 1037 00:53:42,730 --> 00:53:51,730 significantly which is very nice and per 1038 00:53:50,340 --> 00:53:56,620 right 1039 00:53:51,730 --> 00:54:01,210 so yeah the their work is on 2d images 1040 00:53:56,620 --> 00:54:04,870 so of course but yeah adapting it to 2 3 1041 00:54:01,210 --> 00:54:08,200 2 3 D is the is more is more advanced 1042 00:54:04,870 --> 00:54:10,540 just like 3d convolutions also require a 1043 00:54:08,200 --> 00:54:14,770 bit further explanations on how to work 1044 00:54:10,540 --> 00:54:20,529 with them and this benefit of a very 1045 00:54:14,770 --> 00:54:22,090 efficient and weak inference is a I 1046 00:54:20,530 --> 00:54:24,370 think due to the layers being 1047 00:54:22,090 --> 00:54:26,500 implemented in CUDA so they implemented 1048 00:54:24,370 --> 00:54:28,960 them in CUDA and it's important that you 1049 00:54:26,500 --> 00:54:31,270 know that their original cuda 1050 00:54:28,960 --> 00:54:33,400 implementation is open-source it was 1051 00:54:31,270 --> 00:54:35,890 originally intended to be used with MX 1052 00:54:33,400 --> 00:54:38,170 net but already several repositories 1053 00:54:35,890 --> 00:54:42,009 around the internet adapted them to use 1054 00:54:38,170 --> 00:54:44,080 with under other frameworks such as 1055 00:54:42,010 --> 00:54:46,000 tensor flow and they if you work with 1056 00:54:44,080 --> 00:54:46,569 chaos then this will also work for you 1057 00:54:46,000 --> 00:54:49,020 et cetera 1058 00:54:46,570 --> 00:54:49,020 yeah 1059 00:54:57,280 --> 00:55:08,710 convolution okay so how can we how can 1060 00:55:06,849 --> 00:55:10,180 it be that it had so few parameters so 1061 00:55:08,710 --> 00:55:14,140 for example because they only do it on 1062 00:55:10,180 --> 00:55:17,290 three convolutions so yeah that's part 1063 00:55:14,140 --> 00:55:19,720 of the reason I guess yeah so but it's 1064 00:55:17,290 --> 00:55:22,000 it's like we can sit on it later and you 1065 00:55:19,720 --> 00:55:25,270 can see easily that it that's the number 1066 00:55:22,000 --> 00:55:27,940 of parameter that it adds yeah let me 1067 00:55:25,270 --> 00:55:34,720 finish just a like two slides and then 1068 00:55:27,940 --> 00:55:39,190 we we can have more questions so as we 1069 00:55:34,720 --> 00:55:41,319 expected the receptive field is affected 1070 00:55:39,190 --> 00:55:43,420 by the object size just like we saw 1071 00:55:41,320 --> 00:55:50,710 intuitive in the intuitive cherry-picked 1072 00:55:43,420 --> 00:55:53,080 examples in the beginning they they they 1073 00:55:50,710 --> 00:55:54,970 analyzed all of the objects in the data 1074 00:55:53,080 --> 00:55:58,119 set or many object in their data set and 1075 00:55:54,970 --> 00:55:59,560 they checked what are the what is the 1076 00:55:58,119 --> 00:56:01,359 receptive field of the deformable 1077 00:55:59,560 --> 00:56:03,549 convolutions for the small objects the 1078 00:56:01,359 --> 00:56:06,310 medium objects the large opt objects and 1079 00:56:03,550 --> 00:56:10,530 the end when the convolution is on top 1080 00:56:06,310 --> 00:56:13,210 of the background and they saw that the 1081 00:56:10,530 --> 00:56:16,119 for the large object objects and for the 1082 00:56:13,210 --> 00:56:18,339 background the offsets were larger than 1083 00:56:16,119 --> 00:56:20,619 the receptive fields were largest just 1084 00:56:18,339 --> 00:56:25,390 as we would expect intuitively from this 1085 00:56:20,619 --> 00:56:29,250 mechanism but what if maybe the 1086 00:56:25,390 --> 00:56:32,170 deformable part and sampling the the 1087 00:56:29,250 --> 00:56:33,880 sampling the image in an irregular grid 1088 00:56:32,170 --> 00:56:35,950 maybe it's not really important maybe 1089 00:56:33,880 --> 00:56:38,609 the only thing that's important here is 1090 00:56:35,950 --> 00:56:40,509 just the dilation is just sampling 1091 00:56:38,609 --> 00:56:43,270 further context 1092 00:56:40,510 --> 00:56:45,820 so using dilated convolutions in the 1093 00:56:43,270 --> 00:56:47,859 last layers of the of the feature 1094 00:56:45,820 --> 00:56:50,589 extraction extractors that are used for 1095 00:56:47,859 --> 00:56:54,069 detection is already standard practice 1096 00:56:50,589 --> 00:56:58,570 in detection networks and it and it does 1097 00:56:54,070 --> 00:57:01,089 improve the performance but usually it's 1098 00:56:58,570 --> 00:57:03,400 used with the I'm sorry I'll explain 1099 00:57:01,089 --> 00:57:04,960 what are dilated convolutions for for a 1100 00:57:03,400 --> 00:57:07,330 second so this is the standard 1101 00:57:04,960 --> 00:57:09,760 convolution and dilated convolution is 1102 00:57:07,330 --> 00:57:10,930 just in keeping it a square but putting 1103 00:57:09,760 --> 00:57:12,910 holes between the 1104 00:57:10,930 --> 00:57:15,730 playing points so this is with a 1105 00:57:12,910 --> 00:57:19,390 dilation of the hole sizes 1 and here 1106 00:57:15,730 --> 00:57:25,750 the hole size is 2 or it's also called 1107 00:57:19,390 --> 00:57:27,670 the dilation rate and usually people use 1108 00:57:25,750 --> 00:57:30,579 the duration size of 2 and they show 1109 00:57:27,670 --> 00:57:32,500 that for their cocoa applications in 1110 00:57:30,579 --> 00:57:35,140 some segmentation applications is even 1111 00:57:32,500 --> 00:57:39,490 more optimal to use a dilation size of 4 1112 00:57:35,140 --> 00:57:41,920 and 6 but it depends the optimal 1113 00:57:39,490 --> 00:57:43,779 dilation depends on your architecture it 1114 00:57:41,920 --> 00:57:46,300 depends on your specific imagine it even 1115 00:57:43,780 --> 00:57:48,640 depends on your specific object because 1116 00:57:46,300 --> 00:57:53,550 as we saw for large objects and small 1117 00:57:48,640 --> 00:57:57,220 objects the dilation is different so 1118 00:57:53,550 --> 00:58:01,150 even if the only important thing here is 1119 00:57:57,220 --> 00:58:03,149 the dilation still a solution like that 1120 00:58:01,150 --> 00:58:07,839 will be desirable because the dilation 1121 00:58:03,150 --> 00:58:09,670 is learned and adapted to the local 1122 00:58:07,839 --> 00:58:13,720 parts of the image and to the object 1123 00:58:09,670 --> 00:58:16,030 that you are looking on it's a 1124 00:58:13,720 --> 00:58:18,189 generalization of convolutions in 1125 00:58:16,030 --> 00:58:27,280 general and the dilation convolution 1126 00:58:18,190 --> 00:58:29,380 specifically yeah what sorry on acid 1127 00:58:27,280 --> 00:58:32,950 yeah dilated convolution on acid that 1128 00:58:29,380 --> 00:58:39,369 will be the title of my next talk I love 1129 00:58:32,950 --> 00:58:43,720 it so but they showed that if you use 1130 00:58:39,369 --> 00:58:46,839 their method then it 1131 00:58:43,720 --> 00:58:49,500 it improves even on the most optimal a 1132 00:58:46,839 --> 00:58:52,540 configuration that they could use when 1133 00:58:49,500 --> 00:58:54,520 with just dilated convolution so it even 1134 00:58:52,540 --> 00:58:57,190 improved the results further and it 1135 00:58:54,520 --> 00:58:59,319 didn't require any manual tuning or 1136 00:58:57,190 --> 00:59:03,160 hyper parameter sweep of the dilation 1137 00:58:59,319 --> 00:59:05,589 hyper parameter ok so these were 1138 00:59:03,160 --> 00:59:08,410 deformable convolutions now we have two 1139 00:59:05,589 --> 00:59:10,540 choices one choice is to move on to the 1140 00:59:08,410 --> 00:59:13,058 next paper which can be featured pyramid 1141 00:59:10,540 --> 00:59:15,460 networks or vocalist and the second 1142 00:59:13,059 --> 00:59:16,930 choice can be to like answer questions 1143 00:59:15,460 --> 00:59:21,089 so i don'r 1144 00:59:16,930 --> 00:59:21,089 either you decide or we can have a vote 1145 00:59:24,220 --> 00:59:34,430 okay so any anyone that has more 1146 00:59:32,240 --> 00:59:36,370 questions can of course come to me later 1147 00:59:34,430 --> 00:59:40,009 and ask them or send me an email or 1148 00:59:36,370 --> 00:59:43,730 whatever you want so feature pyramid 1149 00:59:40,010 --> 00:59:46,520 networks this is also this is the paper 1150 00:59:43,730 --> 00:59:48,980 by Facebook AI research group one of the 1151 00:59:46,520 --> 00:59:55,280 best object detection and deep learning 1152 00:59:48,980 --> 00:59:57,890 teams in the world and it's as I told 1153 00:59:55,280 --> 01:00:01,850 you the previous paper try to answer the 1154 00:59:57,890 --> 01:00:03,200 challenge of the anatomy and the 1155 01:00:01,850 --> 01:00:05,029 pathologies that to the medical 1156 01:00:03,200 --> 01:00:07,220 pathologies that we are trying to look 1157 01:00:05,030 --> 01:00:09,260 for the previous paper I try to answer 1158 01:00:07,220 --> 01:00:12,620 the problem that they don't have every 1159 01:00:09,260 --> 01:00:14,720 they have any irregular shape and in a 1160 01:00:12,620 --> 01:00:17,390 deformable shape and this paper tries to 1161 01:00:14,720 --> 01:00:21,470 answer the problem of the object being 1162 01:00:17,390 --> 01:00:23,210 small so this is a spine fracture as a 1163 01:00:21,470 --> 01:00:26,029 fracture in the spine you can see it 1164 01:00:23,210 --> 01:00:28,370 here and what is the problem of what 1165 01:00:26,030 --> 01:00:30,200 what what is what is the problem with 1166 01:00:28,370 --> 01:00:32,720 the small object why are they difficult 1167 01:00:30,200 --> 01:00:37,339 for neural network so just as an 1168 01:00:32,720 --> 01:00:41,689 intuition if we perform max pulling on 1169 01:00:37,340 --> 01:00:44,390 and each each of the and we have several 1170 01:00:41,690 --> 01:00:46,460 neurons several neurons next to each 1171 01:00:44,390 --> 01:00:48,259 other and these neurons receptive field 1172 01:00:46,460 --> 01:00:50,390 covers mainly this area and the 1173 01:00:48,260 --> 01:00:52,550 receptive field of this neuron covers 1174 01:00:50,390 --> 01:00:54,560 the fracture and the receptive field of 1175 01:00:52,550 --> 01:00:57,890 this neuron covers this area of the bone 1176 01:00:54,560 --> 01:01:00,950 then after then even if we have really 1177 01:00:57,890 --> 01:01:02,900 good feature and we know that here this 1178 01:01:00,950 --> 01:01:04,759 neurons indicate that it's there is a 1179 01:01:02,900 --> 01:01:06,590 bone under it and here it indicates 1180 01:01:04,760 --> 01:01:09,050 there is a fracture under it and here it 1181 01:01:06,590 --> 01:01:11,900 indicates there is a bone under it after 1182 01:01:09,050 --> 01:01:14,390 we perform max pulling on the three of 1183 01:01:11,900 --> 01:01:16,850 these neurons we live we lose the 1184 01:01:14,390 --> 01:01:19,670 spatial order between them we know there 1185 01:01:16,850 --> 01:01:23,299 is bones there are bones there and we 1186 01:01:19,670 --> 01:01:25,220 know there is a something like a hole 1187 01:01:23,300 --> 01:01:27,680 there but the hole is not necessarily a 1188 01:01:25,220 --> 01:01:29,660 fracture and we need to understand the 1189 01:01:27,680 --> 01:01:32,480 spatial structure between the things in 1190 01:01:29,660 --> 01:01:33,390 order to really classify fractures so 1191 01:01:32,480 --> 01:01:36,600 this is 1192 01:01:33,390 --> 01:01:38,910 one intuition for white convolutional 1193 01:01:36,600 --> 01:01:42,960 neural networks have problems with small 1194 01:01:38,910 --> 01:01:44,850 objects this is not the only reason by 1195 01:01:42,960 --> 01:01:48,210 the way another reason is class 1196 01:01:44,850 --> 01:01:51,630 imbalance small objects are more 1197 01:01:48,210 --> 01:01:53,820 underrepresented in in our data and this 1198 01:01:51,630 --> 01:01:58,170 is also another reason but this paper 1199 01:01:53,820 --> 01:02:01,800 deals with the problem of not having 1200 01:01:58,170 --> 01:02:05,340 good enough features that describe the 1201 01:02:01,800 --> 01:02:10,620 special occasions so the motivation for 1202 01:02:05,340 --> 01:02:18,210 this paper is weak 1203 01:02:10,620 --> 01:02:20,370 a lot of papers before did such things 1204 01:02:18,210 --> 01:02:23,100 similar to that maybe instead of just 1205 01:02:20,370 --> 01:02:25,770 predicting using the deepest feature map 1206 01:02:23,100 --> 01:02:28,110 maybe we can combine somehow feature 1207 01:02:25,770 --> 01:02:31,850 maps from several depths and a lot of 1208 01:02:28,110 --> 01:02:31,850 papers did it before and they also show 1209 01:02:31,880 --> 01:02:41,700 like like dense net as well then set 1210 01:02:35,070 --> 01:02:43,530 also did did something like that but the 1211 01:02:41,700 --> 01:02:45,149 Internet's like it's part of the of the 1212 01:02:43,530 --> 01:02:47,700 architecture is like you can say that 1213 01:02:45,150 --> 01:02:49,980 ResNet also also does it because of 1214 01:02:47,700 --> 01:02:52,980 resonant also has keep connections but I 1215 01:02:49,980 --> 01:02:54,870 mean that you take your final feature 1216 01:02:52,980 --> 01:03:04,740 map that you predict from will say T in 1217 01:02:54,870 --> 01:03:07,620 in a minute okay so so we can assume 1218 01:03:04,740 --> 01:03:09,959 that features from shallower feature 1219 01:03:07,620 --> 01:03:13,470 maps can be important to classify small 1220 01:03:09,960 --> 01:03:15,780 objects because they were developed 1221 01:03:13,470 --> 01:03:18,540 before doing too much max pooling so 1222 01:03:15,780 --> 01:03:21,270 they lost less spatial information so it 1223 01:03:18,540 --> 01:03:24,960 would be desirable to use them when we 1224 01:03:21,270 --> 01:03:28,800 when we classify small objects but we 1225 01:03:24,960 --> 01:03:31,680 miss them because the the network we use 1226 01:03:28,800 --> 01:03:34,290 only the last layer and we we could say 1227 01:03:31,680 --> 01:03:39,270 that we we can hope that the network 1228 01:03:34,290 --> 01:03:41,430 will be smart enough to develop good 1229 01:03:39,270 --> 01:03:43,080 enough features before it does max 1230 01:03:41,430 --> 01:03:45,029 pooling to identify that this is the 1231 01:03:43,080 --> 01:03:46,100 fracture before it does works pulling 1232 01:03:45,030 --> 01:03:50,540 what it while it 1233 01:03:46,100 --> 01:03:54,950 so while it while it has the the lower 1234 01:03:50,540 --> 01:03:57,920 level features and we can hope that it 1235 01:03:54,950 --> 01:03:59,870 will work but as we know with neural 1236 01:03:57,920 --> 01:04:01,880 networks a lot of times if you don't 1237 01:03:59,870 --> 01:04:04,310 force the network and you don't encode 1238 01:04:01,880 --> 01:04:06,680 your prior knowledge of the problem into 1239 01:04:04,310 --> 01:04:08,450 the architectures design then the neural 1240 01:04:06,680 --> 01:04:10,370 networks don't behave optimally and 1241 01:04:08,450 --> 01:04:14,000 although they could fit many types of 1242 01:04:10,370 --> 01:04:15,830 functions they tend to fit not the 1243 01:04:14,000 --> 01:04:18,430 optimal functions unless you encode your 1244 01:04:15,830 --> 01:04:21,319 primary information into the design so 1245 01:04:18,430 --> 01:04:24,440 there are a lot of ways to to combine 1246 01:04:21,320 --> 01:04:26,180 the shallower the shallower layers and 1247 01:04:24,440 --> 01:04:27,140 show a shallower feature map but this is 1248 01:04:26,180 --> 01:04:29,629 currently the most popular 1249 01:04:27,140 --> 01:04:31,700 implementation and the reason the reason 1250 01:04:29,630 --> 01:04:33,860 that I feel confident to say it is that 1251 01:04:31,700 --> 01:04:38,180 in the last cocoa object detection 1252 01:04:33,860 --> 01:04:40,460 competition in the end of 2017 all four 1253 01:04:38,180 --> 01:04:42,980 top competitors all four top submission 1254 01:04:40,460 --> 01:04:45,920 used feature pyramid networks as a major 1255 01:04:42,980 --> 01:04:47,990 component of their submission and it 1256 01:04:45,920 --> 01:04:51,290 improves object detection accuracy by 1257 01:04:47,990 --> 01:04:53,720 about ten percent so what I really love 1258 01:04:51,290 --> 01:04:57,110 about this paper is that it puts 1259 01:04:53,720 --> 01:05:04,220 simplicity and elegance is a major part 1260 01:04:57,110 --> 01:05:08,120 of their work so the first element of it 1261 01:05:04,220 --> 01:05:11,000 is they said we already know that in 1262 01:05:08,120 --> 01:05:13,190 convolutional neural networks if we use 1263 01:05:11,000 --> 01:05:18,190 image pyramids some of you may be know 1264 01:05:13,190 --> 01:05:21,260 it is test time multi scale so I'll 1265 01:05:18,190 --> 01:05:23,720 explain what it is in a minute if we use 1266 01:05:21,260 --> 01:05:27,200 an image pyramid it really improves our 1267 01:05:23,720 --> 01:05:29,390 way to deal with smaller objects so what 1268 01:05:27,200 --> 01:05:32,180 is an image pyramid we take the original 1269 01:05:29,390 --> 01:05:34,190 image and we scale it up to many sizes 1270 01:05:32,180 --> 01:05:36,649 or scale it up and down to many sizes 1271 01:05:34,190 --> 01:05:38,210 and then small objects appear larger and 1272 01:05:36,650 --> 01:05:40,760 are less affected by the pulling 1273 01:05:38,210 --> 01:05:43,520 operation and it has several several 1274 01:05:40,760 --> 01:05:46,520 other advantages and then we pass if we 1275 01:05:43,520 --> 01:05:48,650 use where's at 101 for example we pass 1276 01:05:46,520 --> 01:05:50,750 each of these scaled images separately 1277 01:05:48,650 --> 01:05:53,060 and we have nominal nominally like 10 1278 01:05:50,750 --> 01:05:55,520 sizes for example which is each of these 1279 01:05:53,060 --> 01:05:59,390 10 images separately through the 101 1280 01:05:55,520 --> 01:06:00,950 layers and get separate predictions for 1281 01:05:59,390 --> 01:06:03,830 each of them 1282 01:06:00,950 --> 01:06:06,109 and this works quite well the problem 1283 01:06:03,830 --> 01:06:07,640 with it is it's not really feasible for 1284 01:06:06,110 --> 01:06:10,250 most application between because it we 1285 01:06:07,640 --> 01:06:13,509 it requires a lot of time to make so 1286 01:06:10,250 --> 01:06:16,250 many forward passes of large networks so 1287 01:06:13,510 --> 01:06:18,590 they try to imitate or take their 1288 01:06:16,250 --> 01:06:20,660 intuition from image pyramid which which 1289 01:06:18,590 --> 01:06:23,540 has already proven itself and in many of 1290 01:06:20,660 --> 01:06:25,609 the design choices in the paper they try 1291 01:06:23,540 --> 01:06:27,529 to just instead of inventing something 1292 01:06:25,610 --> 01:06:29,480 from strategy try to imitate something 1293 01:06:27,530 --> 01:06:32,480 that already exists and is known to work 1294 01:06:29,480 --> 01:06:35,510 and you can see that even in the diagram 1295 01:06:32,480 --> 01:06:37,820 it looks quite similar to image pyramid 1296 01:06:35,510 --> 01:06:40,820 and we'll talk more about it in a few 1297 01:06:37,820 --> 01:06:43,760 minutes so you can use feature pyramid 1298 01:06:40,820 --> 01:06:45,620 networks as a part of the RPM the region 1299 01:06:43,760 --> 01:06:47,810 proposal network the first stage of the 1300 01:06:45,620 --> 01:06:51,350 faster are CNN and as the part of the 1301 01:06:47,810 --> 01:06:53,870 second stage of the detection network so 1302 01:06:51,350 --> 01:06:55,970 there it is combined differently into 1303 01:06:53,870 --> 01:06:57,470 these two parts of the network and we'll 1304 01:06:55,970 --> 01:07:01,310 talk about each of them separately so 1305 01:06:57,470 --> 01:07:02,930 for combining it into the RPN and now 1306 01:07:01,310 --> 01:07:07,610 you will understand what feature pyramid 1307 01:07:02,930 --> 01:07:09,740 networks actually do we take the image 1308 01:07:07,610 --> 01:07:12,080 and put it through our feature extractor 1309 01:07:09,740 --> 01:07:17,080 this is our feature extractor for 1310 01:07:12,080 --> 01:07:19,310 example ResNet 101 101 and then we have 1311 01:07:17,080 --> 01:07:21,290 something that they call a lateral 1312 01:07:19,310 --> 01:07:23,720 connection which is a one-by-one 1313 01:07:21,290 --> 01:07:26,120 convolution a one-by-one convolution 1314 01:07:23,720 --> 01:07:30,379 that keeps this feature map the same 1315 01:07:26,120 --> 01:07:33,529 size but transforms it to be to have 256 1316 01:07:30,380 --> 01:07:36,530 features and then we take this new 1317 01:07:33,530 --> 01:07:39,070 feature map and we up sample it using 1318 01:07:36,530 --> 01:07:42,620 nearest neighbor up sampling with a 1319 01:07:39,070 --> 01:07:45,200 scale factor of 2 in each dimension so 1320 01:07:42,620 --> 01:07:47,390 we enlarge it and since our pulling in 1321 01:07:45,200 --> 01:07:50,270 the original fixed feature extractor was 1322 01:07:47,390 --> 01:07:52,609 also in with a factor of 2 then now the 1323 01:07:50,270 --> 01:07:55,730 scaled up feature map is the same size 1324 01:07:52,610 --> 01:07:57,860 of the feature maps of this for the 1325 01:07:55,730 --> 01:08:00,950 previous stage in the feature extractor 1326 01:07:57,860 --> 01:08:04,100 so now we can take we can choose a 1327 01:08:00,950 --> 01:08:05,839 single feature map from the previous 1328 01:08:04,100 --> 01:08:09,589 pooling stage in the feature X structure 1329 01:08:05,840 --> 01:08:11,480 and we can combine them using a one by 1330 01:08:09,590 --> 01:08:12,900 one convolution on this feature map and 1331 01:08:11,480 --> 01:08:15,830 summation of the 1332 01:08:12,900 --> 01:08:19,170 featuring us not concatenation summation 1333 01:08:15,830 --> 01:08:20,790 and we repeat this process several times 1334 01:08:19,170 --> 01:08:23,580 actually this process is repeated 1335 01:08:20,790 --> 01:08:27,660 something like five or six times 1336 01:08:23,580 --> 01:08:31,080 depending on the implementation and then 1337 01:08:27,660 --> 01:08:33,180 we get a pyramid of feature Maps and the 1338 01:08:31,080 --> 01:08:38,939 shallowest feature Maps feature map 1339 01:08:33,180 --> 01:08:42,300 contains features from all contains 1340 01:08:38,939 --> 01:08:44,729 features that were transformed for from 1341 01:08:42,300 --> 01:08:45,960 all or almost all of the levels of 1342 01:08:44,729 --> 01:08:50,189 pulling in the original feature 1343 01:08:45,960 --> 01:08:52,800 extractor and then sorry for each of 1344 01:08:50,189 --> 01:08:55,649 these feature maps we predict separately 1345 01:08:52,800 --> 01:08:56,970 we predict bounding boxes from this this 1346 01:08:55,649 --> 01:08:58,759 feature map separately and this 1347 01:08:56,970 --> 01:09:01,050 separately and this one separately and 1348 01:08:58,760 --> 01:09:03,990 there are only three feature maps drawn 1349 01:09:01,050 --> 01:09:08,600 here but in practice they use five or 1350 01:09:03,990 --> 01:09:12,500 six depending on the implementation so 1351 01:09:08,600 --> 01:09:15,360 which which layers do we choose for this 1352 01:09:12,500 --> 01:09:17,510 to which layers do we choose for the 1353 01:09:15,359 --> 01:09:24,089 lateral connection that does anyone have 1354 01:09:17,510 --> 01:09:25,800 like I guess mmm yeah so we we don't 1355 01:09:24,090 --> 01:09:28,050 take only shallower layers we take both 1356 01:09:25,800 --> 01:09:31,770 shallow layers and deep layers but we 1357 01:09:28,050 --> 01:09:33,390 take one layer from each before we there 1358 01:09:31,770 --> 01:09:35,340 are five pulling operation in the 1359 01:09:33,390 --> 01:09:37,680 network so before each of these pooling 1360 01:09:35,340 --> 01:09:39,750 operations we take the let the output of 1361 01:09:37,680 --> 01:09:42,420 the of the last convolutional layer 1362 01:09:39,750 --> 01:09:45,149 before before this pooling operation why 1363 01:09:42,420 --> 01:09:47,399 because as we said before the pooling 1364 01:09:45,149 --> 01:09:50,009 operation is the component that code 1365 01:09:47,399 --> 01:09:53,519 that causes us the trouble with the 1366 01:09:50,010 --> 01:09:54,870 small objects so after we perform the 1367 01:09:53,520 --> 01:09:57,720 pooling operation we will lose 1368 01:09:54,870 --> 01:09:59,309 information so we want from from each 1369 01:09:57,720 --> 01:10:04,470 pooling level to take some information 1370 01:09:59,310 --> 01:10:06,120 some features and we and we choose the 1371 01:10:04,470 --> 01:10:08,580 last layer before the pooling because 1372 01:10:06,120 --> 01:10:12,510 intuitively it has the most developed 1373 01:10:08,580 --> 01:10:17,300 features for this level of spatial 1374 01:10:12,510 --> 01:10:17,300 information okay yeah 1375 01:10:21,730 --> 01:10:26,389 it's similar but you predict from each 1376 01:10:24,290 --> 01:10:28,670 level yeah it's as I said it's it's very 1377 01:10:26,390 --> 01:10:31,370 similar to many other architectures they 1378 01:10:28,670 --> 01:10:33,290 didn't invent the concept of using 1379 01:10:31,370 --> 01:10:36,620 features from shallower levels it was 1380 01:10:33,290 --> 01:10:39,469 mentioned in dozens of papers but this 1381 01:10:36,620 --> 01:10:41,840 group and also several maybe a few other 1382 01:10:39,469 --> 01:10:45,020 groups come in the same time are the 1383 01:10:41,840 --> 01:10:47,510 first to propose this mechanism of 1384 01:10:45,020 --> 01:10:50,870 predicting from several stages by the 1385 01:10:47,510 --> 01:10:53,570 way SSD SSD detector also predicts from 1386 01:10:50,870 --> 01:10:56,750 several stages but it doesn't use the 1387 01:10:53,570 --> 01:10:58,940 shallower features it starts it starts 1388 01:10:56,750 --> 01:11:00,800 from the top and creates new layers but 1389 01:10:58,940 --> 01:11:04,909 it never combines the shallower features 1390 01:11:00,800 --> 01:11:09,489 and that's where SSD misses okay yeah I 1391 01:11:04,909 --> 01:11:09,489 sorry yeah you are returned before yeah 1392 01:11:23,210 --> 01:11:27,930 this one yes and I will show the results 1393 01:11:27,150 --> 01:11:30,410 in a few minutes 1394 01:11:27,930 --> 01:11:33,470 mm-hm yeah 1395 01:11:30,410 --> 01:11:33,470 [Music] 1396 01:11:40,119 --> 01:11:45,348 from each level from each for each other 1397 01:11:43,550 --> 01:11:46,760 I'll explain so that everyone can hear 1398 01:11:45,349 --> 01:11:49,699 won't repeat the question I'll just 1399 01:11:46,760 --> 01:11:53,360 explain the process okay thank you 1400 01:11:49,699 --> 01:11:56,059 so for each pulling operation you take 1401 01:11:53,360 --> 01:12:01,159 the output of the convolutional layer 1402 01:11:56,060 --> 01:12:03,110 before it and you put it to a one by one 1403 01:12:01,159 --> 01:12:07,309 convolution to reduce its number of 1404 01:12:03,110 --> 01:12:09,500 features to 256 and then you sum it with 1405 01:12:07,310 --> 01:12:12,170 the up sampling from the top down 1406 01:12:09,500 --> 01:12:19,190 connections that they have here okay is 1407 01:12:12,170 --> 01:12:21,530 that this this feature map is a result 1408 01:12:19,190 --> 01:12:24,500 of a summation of all of the of all of 1409 01:12:21,530 --> 01:12:26,360 the 500 outputs but the one above it 1410 01:12:24,500 --> 01:12:28,580 doesn't include the one below it it's 1411 01:12:26,360 --> 01:12:37,989 only a result of the summation of of the 1412 01:12:28,580 --> 01:12:37,989 the ones above it okay and yeah okay yes 1413 01:12:42,790 --> 01:12:46,720 more lateral connections 1414 01:13:14,330 --> 01:13:19,980 I think the III would love if we could 1415 01:13:18,360 --> 01:13:21,750 answer this question later because I 1416 01:13:19,980 --> 01:13:23,070 think it's a less of an understanding 1417 01:13:21,750 --> 01:13:24,720 question and more of an intuition 1418 01:13:23,070 --> 01:13:26,519 question so if anyone has more 1419 01:13:24,720 --> 01:13:30,920 understanding question about the concept 1420 01:13:26,520 --> 01:13:30,920 then it's important to ask them now yes 1421 01:13:32,030 --> 01:13:43,320 the training is end to end yes we will 1422 01:13:41,580 --> 01:13:45,990 get to it in one of the next few slides 1423 01:13:43,320 --> 01:13:49,139 mm-hmm how do you how you train it we'll 1424 01:13:45,990 --> 01:13:52,559 get to it actually I think right right 1425 01:13:49,140 --> 01:13:55,380 now so now that you have all of these 1426 01:13:52,560 --> 01:13:59,100 these five pyramid levels and we see a 1427 01:13:55,380 --> 01:14:01,080 three of them here we use a predict head 1428 01:13:59,100 --> 01:14:03,270 on top of each of them which predicts 1429 01:14:01,080 --> 01:14:06,720 the bounding boxes or the proposal 1430 01:14:03,270 --> 01:14:09,840 bounding boxes and how do these these 1431 01:14:06,720 --> 01:14:12,480 heads work they work just like the head 1432 01:14:09,840 --> 01:14:14,280 investor our CNN this is by the way part 1433 01:14:12,480 --> 01:14:16,500 of what I said of there's the simplicity 1434 01:14:14,280 --> 01:14:18,570 of their design they tried not to change 1435 01:14:16,500 --> 01:14:22,680 too much keep all the mechanism the same 1436 01:14:18,570 --> 01:14:25,200 just add as few as possible so this is 1437 01:14:22,680 --> 01:14:27,630 taken from the faster our CNN paper this 1438 01:14:25,200 --> 01:14:30,450 is the original paper this is how to 1439 01:14:27,630 --> 01:14:31,890 predict in faster our CNN you didn't 1440 01:14:30,450 --> 01:14:35,670 have this feature map you just had the 1441 01:14:31,890 --> 01:14:39,090 deepest feature map and you put a 3x3 1442 01:14:35,670 --> 01:14:41,580 convolution on top of it and for each 1443 01:14:39,090 --> 01:14:44,490 location of the 3x3 convolution you 1444 01:14:41,580 --> 01:14:48,750 turned it into a vector of 256 1445 01:14:44,490 --> 01:14:52,019 dimensions you and you own these vectors 1446 01:14:48,750 --> 01:14:54,780 you use the 1x1 convolution one one by 1447 01:14:52,020 --> 01:14:56,730 one convolution to predict the 1448 01:14:54,780 --> 01:14:58,830 coordinates of the bounding boxes for 1449 01:14:56,730 --> 01:15:01,320 that special occasion and one one by one 1450 01:14:58,830 --> 01:15:04,230 convolution to predict the probabilities 1451 01:15:01,320 --> 01:15:07,500 of being an object and the probability 1452 01:15:04,230 --> 01:15:09,959 of not being an object so here you use 1453 01:15:07,500 --> 01:15:12,330 the exact same predict head on each of 1454 01:15:09,960 --> 01:15:15,780 these layers separately and not only do 1455 01:15:12,330 --> 01:15:17,580 you use the exact same predict head it's 1456 01:15:15,780 --> 01:15:19,950 not only that you the same architecture 1457 01:15:17,580 --> 01:15:21,870 for all of them you also use the same 1458 01:15:19,950 --> 01:15:26,170 weights for all of them they share 1459 01:15:21,870 --> 01:15:31,030 weights so if the 1460 01:15:26,170 --> 01:15:33,830 this the predict head on this layer 1461 01:15:31,030 --> 01:15:35,960 predicts a false positive box and then 1462 01:15:33,830 --> 01:15:38,480 it's punished in the back propagation 1463 01:15:35,960 --> 01:15:41,510 mechanism and the weights in the predict 1464 01:15:38,480 --> 01:15:43,280 head change they change for all of the 1465 01:15:41,510 --> 01:15:46,360 layers together okay 1466 01:15:43,280 --> 01:15:48,950 and they they tested it experimentally 1467 01:15:46,360 --> 01:15:51,200 empirically and they saw that using 1468 01:15:48,950 --> 01:15:52,940 different predict tabs if they don't 1469 01:15:51,200 --> 01:15:57,800 share weights then it doesn't really 1470 01:15:52,940 --> 01:16:04,280 improve their performance so more about 1471 01:15:57,800 --> 01:16:06,290 how they train so for each of the levels 1472 01:16:04,280 --> 01:16:09,230 of the networks we have three here and 1473 01:16:06,290 --> 01:16:12,800 two more here they assign a different 1474 01:16:09,230 --> 01:16:15,830 size of bounding boxes so the deepest 1475 01:16:12,800 --> 01:16:19,520 bounding box is only trained on the 1476 01:16:15,830 --> 01:16:21,800 smallest object and the and the top 1477 01:16:19,520 --> 01:16:25,220 bounding box is only trained on the 1478 01:16:21,800 --> 01:16:27,830 largest object this is similar to the 1479 01:16:25,220 --> 01:16:29,270 concept of anchors in festersen air for 1480 01:16:27,830 --> 01:16:34,309 those of you who are familiar with it 1481 01:16:29,270 --> 01:16:36,890 only here the anchors are are trained on 1482 01:16:34,310 --> 01:16:39,680 separate layers each each anchor at the 1483 01:16:36,890 --> 01:16:44,350 end of each scale is trained on a 1484 01:16:39,680 --> 01:16:44,350 separate completely separate layer and 1485 01:16:44,890 --> 01:16:54,050 the yeah and the reason for that is that 1486 01:16:51,560 --> 01:16:55,660 we would expect that this layer will 1487 01:16:54,050 --> 01:16:58,960 contain the most relevant information 1488 01:16:55,660 --> 01:17:01,309 for small objects so we want it to 1489 01:16:58,960 --> 01:17:03,140 specialize on small objects because it's 1490 01:17:01,310 --> 01:17:05,870 a really difficult task so you want it 1491 01:17:03,140 --> 01:17:09,070 to be the best it can on small objects 1492 01:17:05,870 --> 01:17:10,340 and that's the intuition behind it okay 1493 01:17:09,070 --> 01:17:12,849 great 1494 01:17:10,340 --> 01:17:16,070 so this was about how to combine it with 1495 01:17:12,850 --> 01:17:18,710 RPN and now we will talk about how to 1496 01:17:16,070 --> 01:17:21,950 combine it with the second stage of the 1497 01:17:18,710 --> 01:17:27,290 network first our CNN so a short 1498 01:17:21,950 --> 01:17:30,679 reminder about how fast our CNN works so 1499 01:17:27,290 --> 01:17:34,580 with the feature map we with the RPN we 1500 01:17:30,680 --> 01:17:36,740 get the proposals on the image we get 1501 01:17:34,580 --> 01:17:38,230 about for example two thousand proposals 1502 01:17:36,740 --> 01:17:40,360 and then we use our 1503 01:17:38,230 --> 01:17:43,450 pulling or deformable our eye pulling to 1504 01:17:40,360 --> 01:17:46,719 pull each one of these proposals from 1505 01:17:43,450 --> 01:17:49,720 this feature man okay but now that we 1506 01:17:46,720 --> 01:17:51,400 have so in faster our CNN we didn't have 1507 01:17:49,720 --> 01:17:53,530 these layers we only had this layer and 1508 01:17:51,400 --> 01:17:56,080 that's where we pulled the bounding 1509 01:17:53,530 --> 01:17:58,989 boxes from now that we have many layers 1510 01:17:56,080 --> 01:18:01,420 of different special resolutions maybe 1511 01:17:58,989 --> 01:18:02,860 we can also pull the bounding boxes from 1512 01:18:01,420 --> 01:18:07,510 these layers and not only from the 1513 01:18:02,860 --> 01:18:11,349 deepest layer and that's the concept but 1514 01:18:07,510 --> 01:18:15,580 how do you decide which layer do you 1515 01:18:11,350 --> 01:18:17,500 pull the ROI from so here again they use 1516 01:18:15,580 --> 01:18:19,120 the concept of simplicity and they said 1517 01:18:17,500 --> 01:18:21,130 if you're trying to imitate image 1518 01:18:19,120 --> 01:18:23,050 pyramids we can just use the decision 1519 01:18:21,130 --> 01:18:26,080 rule that is already known to work with 1520 01:18:23,050 --> 01:18:28,989 image pyramids so there is a very clear 1521 01:18:26,080 --> 01:18:33,550 formula that is used and this is the 1522 01:18:28,989 --> 01:18:36,969 formula the floor of four plus you can 1523 01:18:33,550 --> 01:18:41,970 read it okay so you can read it and now 1524 01:18:36,970 --> 01:18:46,060 I'll give examples of how it works so 1525 01:18:41,970 --> 01:18:47,800 let's say that our W and H are the sizes 1526 01:18:46,060 --> 01:18:49,720 of our proposal that within the height 1527 01:18:47,800 --> 01:18:53,350 of the proposal in pixels so let's say 1528 01:18:49,720 --> 01:18:57,910 that the proposal Z is of size 224 by 1529 01:18:53,350 --> 01:19:02,860 224 what we get here is 4 plus log log 2 1530 01:18:57,910 --> 01:19:06,790 of 1 which is 0 and the end result is 4 1531 01:19:02,860 --> 01:19:08,799 this means we are oh I pull the results 1532 01:19:06,790 --> 01:19:11,140 from the fourth layer and we'll talk 1533 01:19:08,800 --> 01:19:14,950 about the intuition behind this in a 1534 01:19:11,140 --> 01:19:22,030 minute and when we take the a proposal 1535 01:19:14,950 --> 01:19:25,599 which size is 112 by 112 then the the 1536 01:19:22,030 --> 01:19:28,509 result will be 4 plus what you can see 1537 01:19:25,600 --> 01:19:30,850 here and there is the result of this log 1538 01:19:28,510 --> 01:19:33,580 operation is minus 1 of course so it 1539 01:19:30,850 --> 01:19:36,430 will be 3 and for larger bounding box it 1540 01:19:33,580 --> 01:19:42,610 will be 5 ok so you can see that this 1541 01:19:36,430 --> 01:19:46,750 this formula enables us very efficiently 1542 01:19:42,610 --> 01:19:49,690 to our Y pool larger bounding boxes from 1543 01:19:46,750 --> 01:19:51,260 shallowest layers with less spatial 1544 01:19:49,690 --> 01:19:55,700 resolution information 1545 01:19:51,260 --> 01:19:59,810 and a smaller object from the layers 1546 01:19:55,700 --> 01:20:01,610 that are that contain the most special 1547 01:19:59,810 --> 01:20:04,460 the information with the most special 1548 01:20:01,610 --> 01:20:06,230 resolution by the way a three is the 1549 01:20:04,460 --> 01:20:07,940 number for this layer four is the number 1550 01:20:06,230 --> 01:20:11,059 for this layer four five is this layer 1551 01:20:07,940 --> 01:20:14,769 index and and so on or actually this is 1552 01:20:11,060 --> 01:20:14,770 two three four and so on 1553 01:20:20,820 --> 01:20:27,670 100 by 100 yes it's a still a relatively 1554 01:20:25,300 --> 01:20:29,920 big object and we have one feature map 1555 01:20:27,670 --> 01:20:37,809 below it to take care of objects that 1556 01:20:29,920 --> 01:20:39,249 are smaller than that we have just one 1557 01:20:37,809 --> 01:20:42,429 feature map below it ma'am 1558 01:20:39,249 --> 01:20:44,920 she said 100 and 100 is still big but we 1559 01:20:42,429 --> 01:20:47,949 use a relatively what we use a 1560 01:20:44,920 --> 01:20:51,550 relatively shallow layer to to pull it 1561 01:20:47,949 --> 01:20:53,138 from and maybe we're wasting the high 1562 01:20:51,550 --> 01:20:56,199 resolution information there on 1563 01:20:53,139 --> 01:20:59,739 relatively large objects I'm sure that 1564 01:20:56,199 --> 01:21:08,049 this can be optimized more but but it 1565 01:20:59,739 --> 01:21:09,968 works quite well okay so in this formula 1566 01:21:08,050 --> 01:21:14,999 we saw there is like a magic number here 1567 01:21:09,969 --> 01:21:14,999 200 yes 1568 01:21:17,290 --> 01:21:20,290 sorry 1569 01:21:31,980 --> 01:21:35,030 [Music] 1570 01:21:35,530 --> 01:21:45,980 before okay maybe it doesn't really 1571 01:21:43,310 --> 01:21:49,010 matter if even if it it is just the way 1572 01:21:45,980 --> 01:21:51,709 they decided to build a formula okay so 1573 01:21:49,010 --> 01:21:53,600 it's for its for ease of it to make the 1574 01:21:51,710 --> 01:21:56,360 formula more friendly actually and we 1575 01:21:53,600 --> 01:21:58,610 saw that there is a number in 224 it 1576 01:21:56,360 --> 01:22:01,190 looks like a magic number because it's 1577 01:21:58,610 --> 01:22:04,849 although it's also the size of images in 1578 01:22:01,190 --> 01:22:07,730 image net so that's the reason we use 1579 01:22:04,850 --> 01:22:10,520 this number here because and that and 1580 01:22:07,730 --> 01:22:13,790 layer number four is actually the layer 1581 01:22:10,520 --> 01:22:17,380 that the ry pooling is performed on on 1582 01:22:13,790 --> 01:22:20,210 the original faster are CNN with ResNet 1583 01:22:17,380 --> 01:22:24,500 101 it's pulled from layer four and 1584 01:22:20,210 --> 01:22:27,680 since ResNet was pre trained on image 1585 01:22:24,500 --> 01:22:30,440 net on and all of the images in in image 1586 01:22:27,680 --> 01:22:34,760 net are pretty much objects of this 1587 01:22:30,440 --> 01:22:37,490 scale then if we get an object of this 1588 01:22:34,760 --> 01:22:41,240 scale we would like to pull it from like 1589 01:22:37,490 --> 01:22:43,610 the default layer to pull to pull object 1590 01:22:41,240 --> 01:22:47,000 that was that has proven itself so far 1591 01:22:43,610 --> 01:22:49,429 so that's the intuition for the 224 a 1592 01:22:47,000 --> 01:22:53,780 number and that's also why they have 1593 01:22:49,430 --> 01:22:57,440 four here because then when you have an 1594 01:22:53,780 --> 01:22:59,840 image an object of size 224 by 224 the 1595 01:22:57,440 --> 01:23:04,330 log goes to zero and you remain with the 1596 01:22:59,840 --> 01:23:04,330 default layer index yes 1597 01:23:05,840 --> 01:23:15,090 explain about the what did I mention of 1598 01:23:12,870 --> 01:23:21,590 the score the units of the score you 1599 01:23:15,090 --> 01:23:31,380 mean sorry 1600 01:23:21,590 --> 01:23:33,120 previous okay yes yeah okay so okay is 1601 01:23:31,380 --> 01:23:35,130 the number of bounding boxes that is 1602 01:23:33,120 --> 01:23:37,140 predicted for each special location it's 1603 01:23:35,130 --> 01:23:39,690 also called the number of anchors if you 1604 01:23:37,140 --> 01:23:41,430 if you know it so for each special 1605 01:23:39,690 --> 01:23:43,110 occasion I don't only predict a single 1606 01:23:41,430 --> 01:23:45,660 bounding box I predict something like 1607 01:23:43,110 --> 01:23:48,719 nine bounding boxes or fifteen bounding 1608 01:23:45,660 --> 01:23:50,309 boxes so for each of these these 1609 01:23:48,720 --> 01:23:52,770 bounding box for each of these nine 1610 01:23:50,310 --> 01:23:56,820 bounding boxes I want four coordinates 1611 01:23:52,770 --> 01:23:59,700 and two probabilities so this is the 2k 1612 01:23:56,820 --> 01:24:02,040 and the 4k it's not two thousand or four 1613 01:23:59,700 --> 01:24:03,630 or four thousand it just like two times 1614 01:24:02,040 --> 01:24:07,940 the number of bounding boxes that I'm 1615 01:24:03,630 --> 01:24:07,940 predicting in that special occasion okay 1616 01:24:09,470 --> 01:24:16,740 okay let's move on and we're close to 1617 01:24:13,980 --> 01:24:20,610 finishing this so regarding European 1618 01:24:16,740 --> 01:24:22,290 experiment results they try to test it 1619 01:24:20,610 --> 01:24:25,139 it's relevant to solve the question that 1620 01:24:22,290 --> 01:24:27,570 you asked before so they tried not to 1621 01:24:25,140 --> 01:24:30,870 use the top-down connections only use 1622 01:24:27,570 --> 01:24:33,030 the lateral connections and not which 1623 01:24:30,870 --> 01:24:34,950 means predict from each level of 1624 01:24:33,030 --> 01:24:36,660 features but don't combine features from 1625 01:24:34,950 --> 01:24:39,090 deep layers and shallower layers and 1626 01:24:36,660 --> 01:24:41,610 they saw they saw that they didn't even 1627 01:24:39,090 --> 01:24:44,040 improve on top of faster are CNN and 1628 01:24:41,610 --> 01:24:45,389 they try to use only the top down 1629 01:24:44,040 --> 01:24:47,280 connection without the lateral 1630 01:24:45,390 --> 01:24:49,500 connection and it also didn't improve 1631 01:24:47,280 --> 01:24:54,840 the only thing the only other thing that 1632 01:24:49,500 --> 01:24:57,300 they tried the did improve was doing 1633 01:24:54,840 --> 01:24:59,070 creating all of this pyramid but not 1634 01:24:57,300 --> 01:25:01,140 using all of these predict heads using 1635 01:24:59,070 --> 01:25:05,670 only a single predict head from the 1636 01:25:01,140 --> 01:25:07,710 bottom and since the bottom contains a 1637 01:25:05,670 --> 01:25:09,840 combination of all of the features 1638 01:25:07,710 --> 01:25:11,400 intuitively maybe it's enough just to 1639 01:25:09,840 --> 01:25:14,550 predict from it and it's more efficient 1640 01:25:11,400 --> 01:25:18,019 but in practice they saw that it's not 1641 01:25:14,550 --> 01:25:18,020 enough and that 1642 01:25:19,099 --> 01:25:26,190 but it's not sorry that it's not enough 1643 01:25:21,960 --> 01:25:28,289 and that it does improve but a lags far 1644 01:25:26,190 --> 01:25:33,719 behind the full feature pyramid Network 1645 01:25:28,289 --> 01:25:36,900 solution regarding faster CNN they asked 1646 01:25:33,719 --> 01:25:40,019 themselves maybe we if we use feature 1647 01:25:36,900 --> 01:25:42,808 pyramid networks on the RPM maybe it's 1648 01:25:40,019 --> 01:25:45,510 enough maybe it's too much to put it on 1649 01:25:42,809 --> 01:25:48,329 the faster CNN as well maybe it's not 1650 01:25:45,510 --> 01:25:52,139 improving anything else so they trained 1651 01:25:48,329 --> 01:25:55,199 the RPN separately using feature pyramid 1652 01:25:52,139 --> 01:25:57,239 networks recorded for each image the 1653 01:25:55,199 --> 01:26:01,589 best proposals that they got for that 1654 01:25:57,239 --> 01:26:04,650 image and then train faster CNN 1655 01:26:01,590 --> 01:26:06,630 separately using these proposals from 1656 01:26:04,650 --> 01:26:08,610 the beginning of the training faster CNN 1657 01:26:06,630 --> 01:26:12,210 only got the best proposals it was not 1658 01:26:08,610 --> 01:26:14,579 trained end to end and still faster CNN 1659 01:26:12,210 --> 01:26:18,030 improved the results by another five to 1660 01:26:14,579 --> 01:26:22,440 ten percent so this component of faster 1661 01:26:18,030 --> 01:26:24,750 CNN is important and we talked about 1662 01:26:22,440 --> 01:26:27,808 this decision rule here and it's it's 1663 01:26:24,750 --> 01:26:30,239 nice but I saw that even if we don't use 1664 01:26:27,809 --> 01:26:33,659 it in investors in festersen and in the 1665 01:26:30,239 --> 01:26:36,869 second stage we only pull from this 1666 01:26:33,659 --> 01:26:40,049 layer from the deepest layer we don't 1667 01:26:36,869 --> 01:26:42,719 get a large difference in results so 1668 01:26:40,050 --> 01:26:45,449 faster CNN is less sensitive to which 1669 01:26:42,719 --> 01:26:48,840 layer you pull from you just need to 1670 01:26:45,449 --> 01:26:53,190 pull from the last layer with the most 1671 01:26:48,840 --> 01:26:56,340 information and another some other neat 1672 01:26:53,190 --> 01:26:58,768 things that I saw of course it improves 1673 01:26:56,340 --> 01:27:00,989 especially for small objects the results 1674 01:26:58,769 --> 01:27:04,440 for small object and the test time per 1675 01:27:00,989 --> 01:27:08,159 image on a single GPU is less than a 1676 01:27:04,440 --> 01:27:10,018 single scale non feature pyramid network 1677 01:27:08,159 --> 01:27:15,119 of the same architecture 1678 01:27:10,019 --> 01:27:17,789 so the the feature pyramid network 1679 01:27:15,119 --> 01:27:23,759 although it it is a more complex 1680 01:27:17,789 --> 01:27:26,360 architecture it's faster and it's out of 1681 01:27:23,760 --> 01:27:30,220 the scope to to explain exactly why but 1682 01:27:26,360 --> 01:27:31,900 but Ross gear cheek which is one of the 1683 01:27:30,220 --> 01:27:35,980 writers of this paper and is very 1684 01:27:31,900 --> 01:27:40,089 well-known wrote like a comment on 1685 01:27:35,980 --> 01:27:43,150 github explaining that so this is 1686 01:27:40,090 --> 01:27:46,540 feature pyramid networks and before we 1687 01:27:43,150 --> 01:27:48,160 need to summarize if someone has a 1688 01:27:46,540 --> 01:27:51,010 really really important question 1689 01:27:48,160 --> 01:27:52,269 otherwise you can come ask me I'll stay 1690 01:27:51,010 --> 01:27:55,600 here okay 1691 01:27:52,270 --> 01:27:59,890 great so we won't have time to cover 1692 01:27:55,600 --> 01:28:04,120 focal loss but I will publish my slides 1693 01:27:59,890 --> 01:28:06,100 online and hopefully maybe there will be 1694 01:28:04,120 --> 01:28:07,720 enough self contained for you to look at 1695 01:28:06,100 --> 01:28:08,650 them and maybe we will see each other 1696 01:28:07,720 --> 01:28:11,860 some other time 1697 01:28:08,650 --> 01:28:13,450 so to summarize object detection I 1698 01:28:11,860 --> 01:28:17,019 really believe that it's a revolutionary 1699 01:28:13,450 --> 01:28:19,360 technology with that is progressing 1700 01:28:17,020 --> 01:28:22,600 really fast and has the potential to 1701 01:28:19,360 --> 01:28:25,750 change every industry and the latest 1702 01:28:22,600 --> 01:28:30,400 major adventures advantages the latest 1703 01:28:25,750 --> 01:28:32,620 major advances in object detection 1704 01:28:30,400 --> 01:28:34,540 except for being really creative and 1705 01:28:32,620 --> 01:28:37,630 mind-blowing and fascinating to learn 1706 01:28:34,540 --> 01:28:41,080 about also address some of the most 1707 01:28:37,630 --> 01:28:43,560 major problems in many data domains but 1708 01:28:41,080 --> 01:28:46,690 also specifically in medical imaging and 1709 01:28:43,560 --> 01:28:48,700 the solutions that we talked about 1710 01:28:46,690 --> 01:28:51,009 earlier except for focal loss but we 1711 01:28:48,700 --> 01:28:53,019 didn't have time to cover were small 1712 01:28:51,010 --> 01:28:54,670 objects which feature pyramid networks 1713 01:28:53,020 --> 01:28:58,540 to give a very elegant and efficient 1714 01:28:54,670 --> 01:29:00,700 solution for deformable shapes which the 1715 01:28:58,540 --> 01:29:02,140 deformable convolutions and deformable 1716 01:29:00,700 --> 01:29:04,840 are i pulling give a very elegant 1717 01:29:02,140 --> 01:29:07,120 solution to and extreme class imbalance 1718 01:29:04,840 --> 01:29:13,270 which vocalist is a very nice and 1719 01:29:07,120 --> 01:29:16,390 innovative solution for so as usual in 1720 01:29:13,270 --> 01:29:18,790 deep learning like what we said in the 1721 01:29:16,390 --> 01:29:21,640 beginning it's not rocket science and I 1722 01:29:18,790 --> 01:29:24,400 think that many engineers can understand 1723 01:29:21,640 --> 01:29:27,010 the concepts and implementations but the 1724 01:29:24,400 --> 01:29:28,870 question is what can we do better in 1725 01:29:27,010 --> 01:29:31,450 order to make it this information more 1726 01:29:28,870 --> 01:29:33,190 friendly and reduce the time that is 1727 01:29:31,450 --> 01:29:35,030 required to understand it by a factor of 1728 01:29:33,190 --> 01:29:40,620 10 thank you very much 1729 01:29:35,030 --> 01:29:40,620 [Applause] 132375