Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,690 --> 00:00:04,830
Hello and welcome back to the course on computer vision got an exciting section ahead and we're going
2
00:00:04,830 --> 00:00:13,470
to kick off by talking about how SS D or the single shot multi box detection algorithm is different
3
00:00:13,550 --> 00:00:15,260
to others.
4
00:00:15,440 --> 00:00:16,840
This looking at this image.
5
00:00:16,860 --> 00:00:21,860
So here is an image and.
6
00:00:22,130 --> 00:00:22,550
Yeah.
7
00:00:22,680 --> 00:00:26,060
So we can see some sheep walking around on a field.
8
00:00:26,250 --> 00:00:33,500
And the question is how would normally a computer or object detection algorithms detect these sheep
9
00:00:34,520 --> 00:00:42,190
Well it's the sheep as you can see quite small and you not only need to look at the image as we learned
10
00:00:42,200 --> 00:00:46,580
in the Anik some computer unconventional neural networks there where we're just looking at image and
11
00:00:46,580 --> 00:00:51,170
saying if something is present on the image or not here we actually need to detect exactly where the
12
00:00:51,170 --> 00:00:56,840
sheep are and put those boxes around them to you know say that the sheep is here or the sheep is here
13
00:00:56,840 --> 00:01:05,800
and so on and so what normally these algorithms would do is they would try to save time or try to save
14
00:01:05,810 --> 00:01:14,900
the computational cost by through certain tricks and hacks and of course all of them even less as D
15
00:01:14,900 --> 00:01:16,120
has its own trick and hacks.
16
00:01:16,130 --> 00:01:23,480
But what changes is the tricks and hacks that those things change and so previously most are all like
17
00:01:23,480 --> 00:01:25,490
a huge portion of the algorithms.
18
00:01:25,490 --> 00:01:31,870
What they would do is they would engage in something called an object proposal methodology.
19
00:01:31,880 --> 00:01:36,890
They would have some object proposal techniques and basically you've got an image and of course is a
20
00:01:36,890 --> 00:01:44,150
huge image Normally these images are resized to smaller images and they were like the algorithms work
21
00:01:44,150 --> 00:01:47,050
with smaller images and then maybe they resized back in and so on.
22
00:01:47,060 --> 00:01:53,310
But it is like even on a small image you still have the same problem that if you were to for instance
23
00:01:53,330 --> 00:01:58,520
convert this whole image with rectangle and then try to you know try to guess in which one there's a
24
00:01:59,000 --> 00:02:03,440
sheep or not let's say here's a rectangle Here's a rectangle you can see that most of them don't actually
25
00:02:03,440 --> 00:02:10,790
have the sheep and you would be spending a lot of time finding these these sheep and a lot of can be
26
00:02:10,790 --> 00:02:15,490
additional cost because of competition cost and Dalgard would be slow and that's a big problem.
27
00:02:15,740 --> 00:02:21,830
And so what they would normally do is do this thing called object proposal where they would come up
28
00:02:21,830 --> 00:02:30,290
with a way to break down the image segmented into parts to suggest where they could potentially be objects
29
00:02:30,290 --> 00:02:35,720
and where they could not be objects for instance by looking at the gradients in color.
30
00:02:35,840 --> 00:02:39,800
So if you look at this big chunk over here it's all green.
31
00:02:39,800 --> 00:02:47,510
There's not much there's not much gradient there's not much changes not much chare like it's all gradual
32
00:02:47,510 --> 00:02:51,950
and there's no red there's no edges there's no there's a little bit of edges but they're all kind of
33
00:02:51,950 --> 00:02:56,990
like it's all about the same kind of texture same same color more or less.
34
00:02:57,230 --> 00:02:58,390
And so there's no.
35
00:02:58,580 --> 00:03:04,700
Nothing changes radically even from like one neighboring pixel to the next and therefore you could make
36
00:03:04,700 --> 00:03:07,560
a proposal that there is no object in here.
37
00:03:07,730 --> 00:03:12,770
And then you would take this and you know maybe you would make a proposal that like it depends on how
38
00:03:12,770 --> 00:03:17,270
you break it up break it up like I can this you might say you know there's an object here then you might
39
00:03:17,270 --> 00:03:20,220
say that there's an object here because you know you can see the shadow.
40
00:03:20,330 --> 00:03:26,870
And again we're not just detecting sheep we're detecting idealy in the most complex of all challenges
41
00:03:26,870 --> 00:03:32,420
predicting any types of objects that were officers so for instance you know sheep or we could be looking
42
00:03:32,420 --> 00:03:38,900
for mice and we could be looking for cars and we could look looking for giraffes anything.
43
00:03:38,900 --> 00:03:44,300
So it's very dangerous to discard anything any object because then you might miss out.
44
00:03:44,460 --> 00:03:50,480
And then on the other hand if you take this segment off if it's segmented like that or however it's
45
00:03:50,480 --> 00:03:54,950
segment you might you would see that you would first calculate the gradient that there is some kind
46
00:03:54,950 --> 00:04:00,380
of change happening over here and there is an object proposal for this segment and then we will dig
47
00:04:00,380 --> 00:04:01,530
into this further.
48
00:04:02,000 --> 00:04:06,420
And so that's that way they save time.
49
00:04:06,480 --> 00:04:15,050
But as you can imagine it's they would sacrifice accuracy because there's all lots of other details
50
00:04:15,050 --> 00:04:21,470
to it but in short they would say can be additional cost but they would sacrifice accuracy and you know
51
00:04:21,470 --> 00:04:27,620
Miss objects sometimes and when it's kind of like it might be OK in some implication in other applications
52
00:04:27,630 --> 00:04:32,480
the like very very high percentage of accuracy especially in self-driving cars and things like that
53
00:04:32,480 --> 00:04:37,970
you can't afford to sacrifice accuracy but at the same time you also need for computational efficiency.
54
00:04:38,030 --> 00:04:42,740
So because it needs to happen all in all these happen real time.
55
00:04:42,770 --> 00:04:45,150
Yeah and so what did S's come up with.
56
00:04:45,170 --> 00:04:52,070
Well let's have a look at this slide is the they are the authors of SSD.
57
00:04:52,070 --> 00:04:57,150
They came up with a brand new solution.
58
00:04:57,260 --> 00:05:02,390
Well it's a very interesting solution where they do everything in one shot.
59
00:05:02,390 --> 00:05:04,540
That's why it's called the single shot.
60
00:05:04,660 --> 00:05:06,610
Box detection algorithm.
61
00:05:06,630 --> 00:05:08,100
It it can.
62
00:05:08,150 --> 00:05:11,420
It just looks at the image once it doesn't have to go back to it.
63
00:05:11,410 --> 00:05:17,800
It doesn't have to do this object proposal it doesn't have to then run many convolutional neural networks.
64
00:05:17,810 --> 00:05:20,030
That's the part where the communicational cost comes in.
65
00:05:20,030 --> 00:05:28,850
So if you go back one slide here if you like break it down and then you run a of neural network for
66
00:05:28,850 --> 00:05:30,920
every single one of those rectangles to detect.
67
00:05:30,920 --> 00:05:35,220
OK is there a ship there is there a ship there is there a ship there is this ship that that's very can
68
00:05:35,220 --> 00:05:40,350
be really costly and that's where you need to do these jobs proposals to cut out parts that are definitely
69
00:05:40,440 --> 00:05:45,430
out of the question so you don't have to run Minicom the conventional network many times.
70
00:05:45,510 --> 00:05:52,220
But in the single shot multi-book detection algorithm what they did is they only look at it once there's
71
00:05:52,230 --> 00:05:55,920
only one algorithm one other algorithm in the world that does the same thing.
72
00:05:56,010 --> 00:05:58,130
It's called the YOLO algorithm.
73
00:05:58,710 --> 00:06:05,310
It's stands for you only look once and in their paper they actually compare the two.
74
00:06:05,390 --> 00:06:07,110
This is does a better job.
75
00:06:07,110 --> 00:06:10,080
They prove empirically that Issas is better.
76
00:06:10,140 --> 00:06:15,810
But apart from that apart from yellow and the there is no algorithm other algorithm in the world that
77
00:06:16,440 --> 00:06:23,520
or at the time there was no other algorithm in the world that actually only looks at the image ones
78
00:06:23,550 --> 00:06:24,620
and that that's it.
79
00:06:24,610 --> 00:06:29,100
It only does that one can volitional neural networks all of them would come back to the image several
80
00:06:29,100 --> 00:06:31,410
times or many times.
81
00:06:31,560 --> 00:06:33,870
And one second.
82
00:06:34,100 --> 00:06:34,850
OK.
83
00:06:35,340 --> 00:06:36,420
And.
84
00:06:36,530 --> 00:06:36,840
Yeah.
85
00:06:36,840 --> 00:06:39,900
And so now we have this and that's we're going what did they do.
86
00:06:39,900 --> 00:06:46,450
So basically all of those boxes as you will we will discuss in the further trials will the boxes and
87
00:06:46,530 --> 00:06:51,120
boxes that are up here on the image all of them go through the network at the same time.
88
00:06:51,120 --> 00:06:57,950
And moreover not just to go through the network at the same time network also remembers which boxes
89
00:06:58,020 --> 00:06:59,560
dealing with the boxes.
90
00:06:59,640 --> 00:07:08,460
They train up on detecting objects and detecting the borders or objects correctly.
91
00:07:08,880 --> 00:07:16,040
And moreover in this same network and now the really cool thing that they introduced is that there's
92
00:07:16,050 --> 00:07:22,590
many convolutions to reduce the image size so you can see stars with 300 image 38 pixels 19 10 5 3 1
93
00:07:22,880 --> 00:07:24,460
as the image size of the juices.
94
00:07:24,600 --> 00:07:30,960
And that helps deal with different sizes of objects on the original image and we'll talk about that
95
00:07:30,960 --> 00:07:38,190
as well so they combined lots of hacks in this one algorithm and probably one of the most important
96
00:07:38,190 --> 00:07:41,690
ones is of course that they do everything in a single shot.
97
00:07:42,320 --> 00:07:49,500
All these it all happens in one huge convolution as it goes through this network and that really helps
98
00:07:49,500 --> 00:07:51,000
save on the communication efficiency.
99
00:07:51,210 --> 00:07:56,440
So that's in a nutshell what how this is different.
100
00:07:56,520 --> 00:07:59,790
So it does everything in a single shot.
101
00:07:59,790 --> 00:08:08,610
It can deal with training of identifying objects and also identifying the borders and it can deal with
102
00:08:08,960 --> 00:08:11,350
scale all all at the same time.
103
00:08:11,670 --> 00:08:20,850
And there's only one other algorithm that rival the D in that architecture and that set up it's called
104
00:08:20,850 --> 00:08:22,130
the yoa algorithm.
105
00:08:22,130 --> 00:08:27,300
And yeah and they talk about more in the paper so that's I'm not sure we'll go into those components
106
00:08:27,300 --> 00:08:31,860
in more detail in upcoming tutorials but we'll go into them in a very intuitive sense if you would like
107
00:08:31,860 --> 00:08:38,780
to get the math behind it all the more strict approach to what's going on.
108
00:08:38,790 --> 00:08:47,360
The best place to look at is of course the original paper by way you and others is called Single Shot
109
00:08:47,370 --> 00:08:51,870
multi box detector you can find our archive there's a link.
110
00:08:51,980 --> 00:08:57,480
And yeah this is quite a it's a very new algorithm and so therefore there isn't much written about it
111
00:08:58,360 --> 00:09:05,200
but nevertheless this paper is not as you can imagine the ultimate source of truth and into its Gregorie
112
00:09:05,850 --> 00:09:08,040
and on that note I hope you enjoyed this tutorial.
113
00:09:08,040 --> 00:09:12,420
Hope you're excited about the trials coming up in the section and I look for the next step.
114
00:09:12,420 --> 00:09:14,310
And until then enjoy computer vision.
13273
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.