Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,550 --> 00:00:03,860
Hello welcome back to the course on computer vision.
2
00:00:04,080 --> 00:00:11,340
Today we're talking about the scale problem the scale problem of the that we have that the is the algorithm
3
00:00:11,370 --> 00:00:13,490
actually solves and how it solves.
4
00:00:13,650 --> 00:00:14,270
OK.
5
00:00:14,580 --> 00:00:16,250
So what is a scale problem.
6
00:00:16,470 --> 00:00:18,040
Well let's have a look at this image.
7
00:00:18,060 --> 00:00:25,930
This is an image we took in what was in your brain or in the Czech Republic.
8
00:00:26,560 --> 00:00:31,460
Yes just some horses at a field.
9
00:00:31,660 --> 00:00:33,980
And what can we see here.
10
00:00:33,990 --> 00:00:37,970
Well let's just do what we have learned so far.
11
00:00:37,980 --> 00:00:43,350
And let's see how the algorithm will go if we just apply the constitution so that all the grid would
12
00:00:43,350 --> 00:00:51,780
apply some boxes and right away for them we can see the boxes that have identified have fallen features
13
00:00:51,780 --> 00:00:54,760
of horses and boxes which have been found.
14
00:00:55,020 --> 00:01:04,380
And so if I remove the green boxes which haven't fallen future forces then you will see that somehow
15
00:01:04,590 --> 00:01:11,900
this horse right in front the main horse the one that's like the most obvious to us is humans.
16
00:01:11,940 --> 00:01:13,480
It has been missed.
17
00:01:13,530 --> 00:01:18,810
The algorithm did not pick up the horse it picked up that horse over there it's got features obviously
18
00:01:18,810 --> 00:01:24,780
seen features of that horse's features of that horse and seen pictures of this horse that you can barely
19
00:01:24,780 --> 00:01:28,440
see but it does not see this horse.
20
00:01:28,440 --> 00:01:31,930
Point blank and once the question is why.
21
00:01:32,970 --> 00:01:34,560
Why is that happening here.
22
00:01:34,610 --> 00:01:37,630
What's what's going on.
23
00:01:37,740 --> 00:01:41,350
Well the reason for that is that this horse is too close.
24
00:01:41,430 --> 00:01:46,320
It's too big for the algorithm to pick up features and I'll show you what what I mean by that so let's
25
00:01:46,410 --> 00:01:51,300
look at just a pair of rectangles or a couple of rectangles.
26
00:01:51,300 --> 00:01:55,220
Look at these ones the largest single out one of them so that rectangle.
27
00:01:55,410 --> 00:02:01,170
And while in this grand scheme of things it might seem that how can you not pick up a horse it's pretty
28
00:02:01,170 --> 00:02:02,500
obvious that there is a horse.
29
00:02:02,820 --> 00:02:08,700
Let's remember what we talked about at the start we said that every single one of these rectangles works
30
00:02:08,730 --> 00:02:09,960
as a separate image.
31
00:02:10,050 --> 00:02:16,200
It is dealing in its own right with what it has inside and is trying to understand do I see features
32
00:02:16,200 --> 00:02:21,960
of a horse or do I not see you if you show a horse and to make it even easier to visualize let's crop
33
00:02:21,960 --> 00:02:25,570
the rest of the image and let's look at what this rectangles were getting at.
34
00:02:25,650 --> 00:02:28,500
Let's imagine that we are this rectangle.
35
00:02:28,680 --> 00:02:33,670
And what this rectangle is seeing is that now by looking at that image can you identify that there's
36
00:02:33,690 --> 00:02:35,070
a horse in there.
37
00:02:35,220 --> 00:02:36,580
Probably not.
38
00:02:36,580 --> 00:02:40,820
What would that look like if anything that looks like.
39
00:02:41,190 --> 00:02:52,230
I don't know some like a puddle of mud or maybe I don't know it like some fog very like a fire in the
40
00:02:52,230 --> 00:02:54,390
distance.
41
00:02:54,390 --> 00:02:55,590
I don't know.
42
00:02:55,800 --> 00:02:59,620
Maybe water falling or something like God could be anything right.
43
00:02:59,620 --> 00:03:04,160
If you just forget about what we saw previously and you let your eyes adjust.
44
00:03:04,170 --> 00:03:07,960
You'll notice that it doesn't resemble a horse in any sort of way.
45
00:03:08,580 --> 00:03:11,540
And you'll probably say OK that's fair enough.
46
00:03:11,550 --> 00:03:15,090
But there's other features on the horse that you could identify from this image.
47
00:03:15,120 --> 00:03:18,920
You know you could maybe identify the who knows of the tail of the eyes.
48
00:03:18,960 --> 00:03:20,340
Well let's have a look at that as well.
49
00:03:20,640 --> 00:03:23,280
Let's have a look at another rectangle.
50
00:03:23,330 --> 00:03:24,110
Look at this one.
51
00:03:24,300 --> 00:03:32,050
So here you can see that yes we can kind of identify that there is an eye.
52
00:03:32,160 --> 00:03:40,180
And because of the the what what is it called The part that goes in the mouth with the horse.
53
00:03:40,210 --> 00:03:47,220
And I should I really should know this but because of that pink strap we can see that it's a horse and
54
00:03:47,390 --> 00:03:51,190
and it's fair that the algorithm should be also leveraging that picture.
55
00:03:51,300 --> 00:03:53,300
So yes it does.
56
00:03:53,420 --> 00:03:55,740
It does resemble a horse better.
57
00:03:55,740 --> 00:04:02,520
But remember that the S is not only an to identify that indeed there's a horse in there but it also
58
00:04:02,520 --> 00:04:09,390
means to identify it it will need to make a suggestion for the actual box that completes this horse
59
00:04:09,440 --> 00:04:12,800
that takes lead it contains a horse.
60
00:04:12,870 --> 00:04:19,290
And so from here can you tell just forgetting about that image can you tell where the horse is actually
61
00:04:19,740 --> 00:04:20,250
located.
62
00:04:20,250 --> 00:04:28,650
Is it is it located like this or is it located like this or maybe it's like the guy is turned his head
63
00:04:28,650 --> 00:04:31,060
like this but it's actually standing on this side.
64
00:04:31,110 --> 00:04:36,780
Maybe that's one of its hooves Maybe that is the horse's Hooven maybe that the horses standing here
65
00:04:36,780 --> 00:04:43,320
so it's very hard for the algorithm to predict what the horse looks like what the full image of the
66
00:04:43,320 --> 00:04:44,880
horses are like.
67
00:04:45,030 --> 00:04:51,600
Maybe maybe it's maybe there's something else blocking the horse in the way so the horse is so big that
68
00:04:51,960 --> 00:04:55,820
in most cases for most rectangles we cannot even see that horse.
69
00:04:55,830 --> 00:05:01,210
But even then all those cases where we can Judus certain features identify that it's a horse it's very
70
00:05:01,210 --> 00:05:03,360
hard to build the rectangle.
71
00:05:03,730 --> 00:05:09,460
And so this is where the S is the the next component that comes to the rescue.
72
00:05:09,730 --> 00:05:13,390
And so let's have a look at the scale problem.
73
00:05:13,440 --> 00:05:16,060
That is how he deals with the scale problem.
74
00:05:16,160 --> 00:05:21,370
So let's move back to the architecture and here we can see how there's many layers.
75
00:05:21,370 --> 00:05:27,280
So far we've been talking mostly about looking at examples or one of the original image but in reality
76
00:05:27,280 --> 00:05:33,000
what happens is the image region is constantly reduced in size.
77
00:05:33,040 --> 00:05:41,190
There are many layers in this network that's working behind the algorithm and or in the background.
78
00:05:41,200 --> 00:05:48,310
And basically what it does is it applies a conclusion or operation to reduce the image size so here
79
00:05:48,310 --> 00:05:55,330
you can see how it was 300 them it's thirty eight nineteen ten pixels five pixel three pixels one pixel
80
00:05:55,330 --> 00:05:57,610
so it's constantly being reduced in size.
81
00:05:57,630 --> 00:06:05,150
The distance sensor and and then all of everything we talked about is applied on every single step on
82
00:06:05,530 --> 00:06:13,750
all of those images what we were talking about is applied to all those rectangles identified the training
83
00:06:13,750 --> 00:06:17,370
of where they should be and what they contain where they should be happens.
84
00:06:17,440 --> 00:06:23,830
And as you can see here there's a huge number of detection is eight thousand two hundred and three directions
85
00:06:23,830 --> 00:06:27,620
in this set up for 300 300 and put image at the start.
86
00:06:27,850 --> 00:06:31,500
Eight hundred eight thousand seven hundred thirty two detections per class.
87
00:06:31,580 --> 00:06:34,690
And as you can imagine there's many different classes as well.
88
00:06:34,810 --> 00:06:43,420
And that's because there's many iterations and basically what happens is once every time we can involve
89
00:06:43,480 --> 00:06:48,090
the image further we are again we are applying these classifiers.
90
00:06:48,160 --> 00:06:54,170
Here you can see classifiers are being applied to detect for instance in our case horses again classified
91
00:06:54,250 --> 00:06:59,350
as a BNN to apply to detect horses classify as had been applied classify So basically that just shows
92
00:06:59,350 --> 00:07:02,890
that everything is being repeated again and again and again.
93
00:07:02,980 --> 00:07:07,090
And so let's have a look at that how that happens and then I'll provide some additional comments on
94
00:07:07,090 --> 00:07:07,710
that.
95
00:07:08,170 --> 00:07:14,000
So there is our image what happens to the image is it is it actually goes through a convolution and
96
00:07:14,540 --> 00:07:20,560
we won't you just resize it but that's just for visual purposes just to help us understand better what's
97
00:07:20,560 --> 00:07:21,200
going on.
98
00:07:21,310 --> 00:07:23,360
In reality it's not just resized.
99
00:07:23,380 --> 00:07:29,500
It actually goes through the confusion so the image won't look anything like itself any more it will
100
00:07:29,500 --> 00:07:35,610
be completely random looking thing for us but it will preserve the features.
101
00:07:35,600 --> 00:07:40,990
So once again we discuss this more in the illusional neural networks tutorials which are in an X number
102
00:07:40,990 --> 00:07:41,470
two.
103
00:07:41,620 --> 00:07:46,390
But for our purposes just for us to better understand intuition behind things we're going to keep it
104
00:07:46,420 --> 00:07:52,480
as an image of horses is the exact image and just should resize that will still help us understand what's
105
00:07:52,480 --> 00:07:53,680
going on.
106
00:07:53,750 --> 00:08:01,300
And so here what happens is the image is resized but the rectangles are stay the same size and now they
107
00:08:01,300 --> 00:08:03,770
are applied to this new image which is resized.
108
00:08:03,880 --> 00:08:07,640
And because the image is smaller that means that horse is smaller.
109
00:08:07,660 --> 00:08:12,990
And now if we try to pick up features of horses you'll see that these rectangles do pick up a horse
110
00:08:13,360 --> 00:08:19,500
whereas those rectangles that used speak of a horse so this horse is not picked up in animal that one
111
00:08:19,510 --> 00:08:24,550
isn't that one isn't because they're smaller or not just similar to what we had in the first example
112
00:08:24,550 --> 00:08:28,540
with Adlon at the lake and there was people there were people in the background that they were too small
113
00:08:28,540 --> 00:08:30,160
to be identified.
114
00:08:30,160 --> 00:08:30,820
Same thing here.
115
00:08:30,820 --> 00:08:35,710
So just for argument's sake we're going to say that these horses are not picked up and that's totally
116
00:08:35,710 --> 00:08:41,440
fine because we're very pick them up before whereas now we've picked up this big horse which in this
117
00:08:41,440 --> 00:08:43,630
new version of the image is smaller.
118
00:08:44,080 --> 00:08:48,590
And so now it's it will be identified.
119
00:08:48,760 --> 00:08:54,900
The question though that still remains is how will we draw the box for this horse in the original image
120
00:08:54,910 --> 00:09:00,040
we know from what we discussed in the previous tutorial we know how the algorithm goes about drawing
121
00:09:00,040 --> 00:09:00,840
this box.
122
00:09:00,880 --> 00:09:07,960
In this version of image because through training it will understand you know it has it has lots of
123
00:09:07,960 --> 00:09:15,610
examples of of horses or other objects with the ground truth and will be able to understand how to draw
124
00:09:15,610 --> 00:09:19,630
these boxes and so in this case we will also learn how to draw the boxes in this image.
125
00:09:19,630 --> 00:09:24,300
But remember that we got this image not just through resizing but through a convolution operation.
126
00:09:24,490 --> 00:09:31,580
And that means how do that means we still have a question how do we then take this box and move it back.
127
00:09:31,720 --> 00:09:35,380
Well it's pretty straightforward.
128
00:09:36,080 --> 00:09:42,310
We're just going to look at our architecture again here and so wherever we are in the process the is
129
00:09:42,370 --> 00:09:49,660
the algorithm it's constructed in such a way that it preserves information about how to return how to
130
00:09:49,660 --> 00:09:55,990
scale back to the original image so hard to go from let's say this layer to here are hard to go from
131
00:09:55,990 --> 00:10:01,270
this layer to here so it all remember that it found the horse for instance over here in this part of
132
00:10:01,270 --> 00:10:08,950
the image of this layer and it will also know how it got to this layer so it will know how to get back.
133
00:10:09,010 --> 00:10:17,140
So how to decontrol mean in simplistic terms how to get back to this original image and where it should
134
00:10:17,140 --> 00:10:21,940
place the horse there so that's an important part that it needs to place the horse in the right place
135
00:10:21,940 --> 00:10:24,300
not only in this image but in the original image.
136
00:10:24,370 --> 00:10:28,290
And this is the algorithm is equipped to do that.
137
00:10:28,360 --> 00:10:34,310
And so basically through this way we are now able to handle objects of different sizes.
138
00:10:34,540 --> 00:10:41,590
And the beauty of this is the algorithm is that it doesn't just resize the image and then proceeds do
139
00:10:41,590 --> 00:10:43,200
the conversion operation again.
140
00:10:43,210 --> 00:10:49,870
The idea here is that all this happens at the same time all of these boxes that we're detecting all
141
00:10:49,870 --> 00:10:56,920
of the positions of the boxes the classes inside the boxes and even the resizing or the different scales
142
00:10:56,920 --> 00:10:57,470
of the images.
143
00:10:57,490 --> 00:11:02,910
It all happens in one neural network over here like this so instead of just having this image and then
144
00:11:02,910 --> 00:11:07,920
having a copy of the image where just resize and another copy of the image were smaller and was more
145
00:11:07,910 --> 00:11:12,330
up and running neural networks for each one were running all of this in one your own network.
146
00:11:12,360 --> 00:11:13,500
And why is that good.
147
00:11:13,500 --> 00:11:24,510
Why is that important is because we can utilize the features we can share the features across the different
148
00:11:25,680 --> 00:11:32,160
layers so that you know we've when we're when we're training the network it learns how to detect horses
149
00:11:33,630 --> 00:11:40,080
when they're in the full size or less so in this in this lair in this size than when they're smaller
150
00:11:40,170 --> 00:11:44,970
when they're made smaller and so and so that way all of the layers are learning together.
151
00:11:45,120 --> 00:11:51,890
And that gives it that additional power or that you know there's power in numbers that they're they're
152
00:11:51,900 --> 00:11:57,270
seeing more different horses at the same time so some horses are seen here some horses are seen here
153
00:11:57,540 --> 00:12:04,200
and therefore it is more powerful in the sense that it has more examples to work off and the network
154
00:12:04,290 --> 00:12:08,700
can better adjusted its weights to properly detect those horses.
155
00:12:08,700 --> 00:12:12,110
So this really helps or facilitates the training.
156
00:12:13,580 --> 00:12:20,310
Yeah so that's how the essays the algorithm deals with the scale problem.
157
00:12:20,320 --> 00:12:23,600
It's quite a smart solution.
158
00:12:23,650 --> 00:12:30,130
I'm sure you'd agree that you know putting all of this into one emotional neural network is is pretty
159
00:12:30,820 --> 00:12:34,410
exciting and definitely gives us some amazing results.
160
00:12:34,410 --> 00:12:40,270
And then and on that note I hope you enjoyed this tutorial and I look forward to see you next time.
161
00:12:40,270 --> 00:12:42,370
And until then enjoy computer vision.
18325
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.