All language subtitles for 023 How SSD is different-en

af Afrikaans
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bn Bengali
bs Bosnian
bg Bulgarian
ca Catalan
ceb Cebuano
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
tl Filipino
fi Finnish
fr French
fy Frisian
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
km Khmer
ko Korean
ku Kurdish (Kurmanji)
ky Kyrgyz
lo Lao
la Latin
lv Latvian
lt Lithuanian
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mn Mongolian
my Myanmar (Burmese)
ne Nepali
no Norwegian
ps Pashto
fa Persian
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
st Sesotho
sn Shona
sd Sindhi
si Sinhala
sk Slovak
sl Slovenian
so Somali
es Spanish
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
te Telugu
th Thai
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese Download
cy Welsh
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
or Odia (Oriya)
rw Kinyarwanda
tk Turkmen
tt Tatar
ug Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,690 --> 00:00:04,830 Hello and welcome back to the course on computer vision got an exciting section ahead and we're going 2 00:00:04,830 --> 00:00:13,470 to kick off by talking about how SS D or the single shot multi box detection algorithm is different 3 00:00:13,550 --> 00:00:15,260 to others. 4 00:00:15,440 --> 00:00:16,840 This looking at this image. 5 00:00:16,860 --> 00:00:21,860 So here is an image and. 6 00:00:22,130 --> 00:00:22,550 Yeah. 7 00:00:22,680 --> 00:00:26,060 So we can see some sheep walking around on a field. 8 00:00:26,250 --> 00:00:33,500 And the question is how would normally a computer or object detection algorithms detect these sheep 9 00:00:34,520 --> 00:00:42,190 Well it's the sheep as you can see quite small and you not only need to look at the image as we learned 10 00:00:42,200 --> 00:00:46,580 in the Anik some computer unconventional neural networks there where we're just looking at image and 11 00:00:46,580 --> 00:00:51,170 saying if something is present on the image or not here we actually need to detect exactly where the 12 00:00:51,170 --> 00:00:56,840 sheep are and put those boxes around them to you know say that the sheep is here or the sheep is here 13 00:00:56,840 --> 00:01:05,800 and so on and so what normally these algorithms would do is they would try to save time or try to save 14 00:01:05,810 --> 00:01:14,900 the computational cost by through certain tricks and hacks and of course all of them even less as D 15 00:01:14,900 --> 00:01:16,120 has its own trick and hacks. 16 00:01:16,130 --> 00:01:23,480 But what changes is the tricks and hacks that those things change and so previously most are all like 17 00:01:23,480 --> 00:01:25,490 a huge portion of the algorithms. 18 00:01:25,490 --> 00:01:31,870 What they would do is they would engage in something called an object proposal methodology. 19 00:01:31,880 --> 00:01:36,890 They would have some object proposal techniques and basically you've got an image and of course is a 20 00:01:36,890 --> 00:01:44,150 huge image Normally these images are resized to smaller images and they were like the algorithms work 21 00:01:44,150 --> 00:01:47,050 with smaller images and then maybe they resized back in and so on. 22 00:01:47,060 --> 00:01:53,310 But it is like even on a small image you still have the same problem that if you were to for instance 23 00:01:53,330 --> 00:01:58,520 convert this whole image with rectangle and then try to you know try to guess in which one there's a 24 00:01:59,000 --> 00:02:03,440 sheep or not let's say here's a rectangle Here's a rectangle you can see that most of them don't actually 25 00:02:03,440 --> 00:02:10,790 have the sheep and you would be spending a lot of time finding these these sheep and a lot of can be 26 00:02:10,790 --> 00:02:15,490 additional cost because of competition cost and Dalgard would be slow and that's a big problem. 27 00:02:15,740 --> 00:02:21,830 And so what they would normally do is do this thing called object proposal where they would come up 28 00:02:21,830 --> 00:02:30,290 with a way to break down the image segmented into parts to suggest where they could potentially be objects 29 00:02:30,290 --> 00:02:35,720 and where they could not be objects for instance by looking at the gradients in color. 30 00:02:35,840 --> 00:02:39,800 So if you look at this big chunk over here it's all green. 31 00:02:39,800 --> 00:02:47,510 There's not much there's not much gradient there's not much changes not much chare like it's all gradual 32 00:02:47,510 --> 00:02:51,950 and there's no red there's no edges there's no there's a little bit of edges but they're all kind of 33 00:02:51,950 --> 00:02:56,990 like it's all about the same kind of texture same same color more or less. 34 00:02:57,230 --> 00:02:58,390 And so there's no. 35 00:02:58,580 --> 00:03:04,700 Nothing changes radically even from like one neighboring pixel to the next and therefore you could make 36 00:03:04,700 --> 00:03:07,560 a proposal that there is no object in here. 37 00:03:07,730 --> 00:03:12,770 And then you would take this and you know maybe you would make a proposal that like it depends on how 38 00:03:12,770 --> 00:03:17,270 you break it up break it up like I can this you might say you know there's an object here then you might 39 00:03:17,270 --> 00:03:20,220 say that there's an object here because you know you can see the shadow. 40 00:03:20,330 --> 00:03:26,870 And again we're not just detecting sheep we're detecting idealy in the most complex of all challenges 41 00:03:26,870 --> 00:03:32,420 predicting any types of objects that were officers so for instance you know sheep or we could be looking 42 00:03:32,420 --> 00:03:38,900 for mice and we could be looking for cars and we could look looking for giraffes anything. 43 00:03:38,900 --> 00:03:44,300 So it's very dangerous to discard anything any object because then you might miss out. 44 00:03:44,460 --> 00:03:50,480 And then on the other hand if you take this segment off if it's segmented like that or however it's 45 00:03:50,480 --> 00:03:54,950 segment you might you would see that you would first calculate the gradient that there is some kind 46 00:03:54,950 --> 00:04:00,380 of change happening over here and there is an object proposal for this segment and then we will dig 47 00:04:00,380 --> 00:04:01,530 into this further. 48 00:04:02,000 --> 00:04:06,420 And so that's that way they save time. 49 00:04:06,480 --> 00:04:15,050 But as you can imagine it's they would sacrifice accuracy because there's all lots of other details 50 00:04:15,050 --> 00:04:21,470 to it but in short they would say can be additional cost but they would sacrifice accuracy and you know 51 00:04:21,470 --> 00:04:27,620 Miss objects sometimes and when it's kind of like it might be OK in some implication in other applications 52 00:04:27,630 --> 00:04:32,480 the like very very high percentage of accuracy especially in self-driving cars and things like that 53 00:04:32,480 --> 00:04:37,970 you can't afford to sacrifice accuracy but at the same time you also need for computational efficiency. 54 00:04:38,030 --> 00:04:42,740 So because it needs to happen all in all these happen real time. 55 00:04:42,770 --> 00:04:45,150 Yeah and so what did S's come up with. 56 00:04:45,170 --> 00:04:52,070 Well let's have a look at this slide is the they are the authors of SSD. 57 00:04:52,070 --> 00:04:57,150 They came up with a brand new solution. 58 00:04:57,260 --> 00:05:02,390 Well it's a very interesting solution where they do everything in one shot. 59 00:05:02,390 --> 00:05:04,540 That's why it's called the single shot. 60 00:05:04,660 --> 00:05:06,610 Box detection algorithm. 61 00:05:06,630 --> 00:05:08,100 It it can. 62 00:05:08,150 --> 00:05:11,420 It just looks at the image once it doesn't have to go back to it. 63 00:05:11,410 --> 00:05:17,800 It doesn't have to do this object proposal it doesn't have to then run many convolutional neural networks. 64 00:05:17,810 --> 00:05:20,030 That's the part where the communicational cost comes in. 65 00:05:20,030 --> 00:05:28,850 So if you go back one slide here if you like break it down and then you run a of neural network for 66 00:05:28,850 --> 00:05:30,920 every single one of those rectangles to detect. 67 00:05:30,920 --> 00:05:35,220 OK is there a ship there is there a ship there is there a ship there is this ship that that's very can 68 00:05:35,220 --> 00:05:40,350 be really costly and that's where you need to do these jobs proposals to cut out parts that are definitely 69 00:05:40,440 --> 00:05:45,430 out of the question so you don't have to run Minicom the conventional network many times. 70 00:05:45,510 --> 00:05:52,220 But in the single shot multi-book detection algorithm what they did is they only look at it once there's 71 00:05:52,230 --> 00:05:55,920 only one algorithm one other algorithm in the world that does the same thing. 72 00:05:56,010 --> 00:05:58,130 It's called the YOLO algorithm. 73 00:05:58,710 --> 00:06:05,310 It's stands for you only look once and in their paper they actually compare the two. 74 00:06:05,390 --> 00:06:07,110 This is does a better job. 75 00:06:07,110 --> 00:06:10,080 They prove empirically that Issas is better. 76 00:06:10,140 --> 00:06:15,810 But apart from that apart from yellow and the there is no algorithm other algorithm in the world that 77 00:06:16,440 --> 00:06:23,520 or at the time there was no other algorithm in the world that actually only looks at the image ones 78 00:06:23,550 --> 00:06:24,620 and that that's it. 79 00:06:24,610 --> 00:06:29,100 It only does that one can volitional neural networks all of them would come back to the image several 80 00:06:29,100 --> 00:06:31,410 times or many times. 81 00:06:31,560 --> 00:06:33,870 And one second. 82 00:06:34,100 --> 00:06:34,850 OK. 83 00:06:35,340 --> 00:06:36,420 And. 84 00:06:36,530 --> 00:06:36,840 Yeah. 85 00:06:36,840 --> 00:06:39,900 And so now we have this and that's we're going what did they do. 86 00:06:39,900 --> 00:06:46,450 So basically all of those boxes as you will we will discuss in the further trials will the boxes and 87 00:06:46,530 --> 00:06:51,120 boxes that are up here on the image all of them go through the network at the same time. 88 00:06:51,120 --> 00:06:57,950 And moreover not just to go through the network at the same time network also remembers which boxes 89 00:06:58,020 --> 00:06:59,560 dealing with the boxes. 90 00:06:59,640 --> 00:07:08,460 They train up on detecting objects and detecting the borders or objects correctly. 91 00:07:08,880 --> 00:07:16,040 And moreover in this same network and now the really cool thing that they introduced is that there's 92 00:07:16,050 --> 00:07:22,590 many convolutions to reduce the image size so you can see stars with 300 image 38 pixels 19 10 5 3 1 93 00:07:22,880 --> 00:07:24,460 as the image size of the juices. 94 00:07:24,600 --> 00:07:30,960 And that helps deal with different sizes of objects on the original image and we'll talk about that 95 00:07:30,960 --> 00:07:38,190 as well so they combined lots of hacks in this one algorithm and probably one of the most important 96 00:07:38,190 --> 00:07:41,690 ones is of course that they do everything in a single shot. 97 00:07:42,320 --> 00:07:49,500 All these it all happens in one huge convolution as it goes through this network and that really helps 98 00:07:49,500 --> 00:07:51,000 save on the communication efficiency. 99 00:07:51,210 --> 00:07:56,440 So that's in a nutshell what how this is different. 100 00:07:56,520 --> 00:07:59,790 So it does everything in a single shot. 101 00:07:59,790 --> 00:08:08,610 It can deal with training of identifying objects and also identifying the borders and it can deal with 102 00:08:08,960 --> 00:08:11,350 scale all all at the same time. 103 00:08:11,670 --> 00:08:20,850 And there's only one other algorithm that rival the D in that architecture and that set up it's called 104 00:08:20,850 --> 00:08:22,130 the yoa algorithm. 105 00:08:22,130 --> 00:08:27,300 And yeah and they talk about more in the paper so that's I'm not sure we'll go into those components 106 00:08:27,300 --> 00:08:31,860 in more detail in upcoming tutorials but we'll go into them in a very intuitive sense if you would like 107 00:08:31,860 --> 00:08:38,780 to get the math behind it all the more strict approach to what's going on. 108 00:08:38,790 --> 00:08:47,360 The best place to look at is of course the original paper by way you and others is called Single Shot 109 00:08:47,370 --> 00:08:51,870 multi box detector you can find our archive there's a link. 110 00:08:51,980 --> 00:08:57,480 And yeah this is quite a it's a very new algorithm and so therefore there isn't much written about it 111 00:08:58,360 --> 00:09:05,200 but nevertheless this paper is not as you can imagine the ultimate source of truth and into its Gregorie 112 00:09:05,850 --> 00:09:08,040 and on that note I hope you enjoyed this tutorial. 113 00:09:08,040 --> 00:09:12,420 Hope you're excited about the trials coming up in the section and I look for the next step. 114 00:09:12,420 --> 00:09:14,310 And until then enjoy computer vision. 13273

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.