Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,770 --> 00:00:05,720
Hello and welcome back to the course on computer vision and today we're talking about training classifiers.
2
00:00:05,720 --> 00:00:10,760
So previously we discussed that Develin Jones algorithm has two stages training and detection.
3
00:00:10,790 --> 00:00:14,870
We've already spoken about detection and today's going to be a quick tutorial to get us started into
4
00:00:14,870 --> 00:00:16,370
the world of training.
5
00:00:16,820 --> 00:00:19,280
So what exactly do we mean when we say training.
6
00:00:19,460 --> 00:00:21,890
Well we're training these features as you remember.
7
00:00:21,890 --> 00:00:27,350
We've got five basic features but the trick is that they're scalable they can be shorter or longer wider
8
00:00:27,410 --> 00:00:28,570
or narrower.
9
00:00:28,730 --> 00:00:38,570
And we need to identify which ones of them are actually descriptive or a face or something are common
10
00:00:38,570 --> 00:00:40,020
features of a face.
11
00:00:40,040 --> 00:00:42,130
So here we've got a face.
12
00:00:42,210 --> 00:00:49,100
And remember we talked about that here the bridge of the nose is lighter and this is darker well.
13
00:00:49,220 --> 00:00:56,030
We know that from our experience with faces but an algorithm because it's probably the first face that
14
00:00:56,030 --> 00:00:57,170
it's seen.
15
00:00:57,290 --> 00:01:02,180
Won't necessarily know that that is a common feature for faces for instance from this photo It might
16
00:01:02,180 --> 00:01:02,970
think that.
17
00:01:03,180 --> 00:01:08,420
OK so when this part is lighter and this part is darker that's also a common feature for faces but that's
18
00:01:08,420 --> 00:01:11,230
not necessarily true because not everybody has a beard.
19
00:01:11,510 --> 00:01:15,610
On the other hand it might also look at the background might think that you know this white and this
20
00:01:15,620 --> 00:01:20,510
black this this contrast is also called Faces which is also not the case.
21
00:01:20,510 --> 00:01:27,680
So we need to help the algorithm understand which features are important to identify a face which features
22
00:01:27,680 --> 00:01:28,760
are not important.
23
00:01:28,820 --> 00:01:34,520
So that's number one and number two is remember when we're talking about features we spoke about the
24
00:01:34,520 --> 00:01:39,210
threshold that you will never get like exactly.
25
00:01:39,230 --> 00:01:47,030
Very bright white here and very dark black here it all is a form of grey greyscale So a bit maybe of
26
00:01:47,300 --> 00:01:54,290
some contrast like not exactly 100 percent dark but close to dark and close to bright.
27
00:01:54,320 --> 00:02:00,680
So there's some there's some difference in between the contrast and that difference has to be it has
28
00:02:00,680 --> 00:02:08,300
to exceed the threshold for us to consider this feature to be present while the training will also help
29
00:02:08,300 --> 00:02:11,500
the algorithm understand those thresholds should they be set up 0.3.
30
00:02:11,570 --> 00:02:17,780
Should they be set at zero 0.7 position to be sets and zero point fifty seven or 56 or something like
31
00:02:18,110 --> 00:02:21,590
it needs to understand where to set those thresholds for the different features.
32
00:02:21,590 --> 00:02:22,500
And that's part of the training.
33
00:02:22,500 --> 00:02:28,700
So first of all identify the features and second to set thresholds and that's that's in a very brief
34
00:02:28,700 --> 00:02:30,380
overview of what the training is about.
35
00:02:30,380 --> 00:02:37,520
We'll talk more about it in coming to terms but basically that's kind of the goal to understand what
36
00:02:37,640 --> 00:02:44,540
is what features identify a face and how to say how to know when their presence or basically the threshold
37
00:02:44,540 --> 00:02:45,460
side of things.
38
00:02:45,560 --> 00:02:54,820
So what the algorithm does is it shrinks the image to 24 by 24 pixels and then it looks for these different
39
00:02:54,820 --> 00:03:00,070
features on the image it tries to understand which features are common for faces so the first question
40
00:03:00,070 --> 00:03:02,190
is Why does a trinket 24 by 24.
41
00:03:02,410 --> 00:03:05,200
Well because as we discussed these features are scalable.
42
00:03:05,200 --> 00:03:12,610
So when you have a very large image of for instance a thousand by a thousand or even 500 by 500 pixels
43
00:03:13,090 --> 00:03:19,940
then there's lots of different variations of these features so it can be like one pixel by one pixel
44
00:03:19,940 --> 00:03:25,930
that can be two by two it can be so like two there too there can be four pixels in a four pixel that
45
00:03:25,930 --> 00:03:31,780
would be 10 and 10 would be 100 100 so there's just lots and lots of different combinations of these
46
00:03:31,780 --> 00:03:38,730
features are a combination different lengths and widths and heights of these features.
47
00:03:38,770 --> 00:03:43,440
And you just take forever to look through them and understand which ones are telling her face.
48
00:03:43,630 --> 00:03:50,200
So it's easier to scale down the image and then apply a very limited and we'll talk about this for the
49
00:03:50,200 --> 00:03:56,570
down as well and apply the features that fit in a 24 hour 24 pixel window so it's much less combinations.
50
00:03:56,800 --> 00:04:02,680
And then once we found them once we found OK so that you will still find here that the nose is like
51
00:04:02,950 --> 00:04:09,960
brighter and darker and you'll still find the same features but then when you're actually applying it
52
00:04:09,960 --> 00:04:16,140
to a Nimish to Detective face what happens then is we apply we keep the image the same size we don't
53
00:04:16,140 --> 00:04:20,080
scale it down to 24 by 24 but then we scale the features up.
54
00:04:20,220 --> 00:04:20,910
So that's the trick.
55
00:04:20,910 --> 00:04:23,820
So when you're training you scale the image down.
56
00:04:23,820 --> 00:04:27,360
All of that and we'll talk about many training images just now.
57
00:04:27,360 --> 00:04:33,990
But all of the training images are scaled down to 24 pixels and the features I detected there and the
58
00:04:33,990 --> 00:04:39,670
training happens there so we know which features are important and what what threshold they should have.
59
00:04:39,900 --> 00:04:44,940
And then in the real life when you're actually detecting you don't scale the image you keep the image
60
00:04:44,940 --> 00:04:48,440
at its original size wherever it is you scale the features up.
61
00:04:48,450 --> 00:04:52,350
So those features that are important you scale them up and then you look for them on the image.
62
00:04:52,410 --> 00:04:54,440
So that's an important thing to remember.
63
00:04:54,450 --> 00:04:55,990
That's the important trick there.
64
00:04:56,340 --> 00:04:57,950
And there you go.
65
00:04:57,990 --> 00:05:00,650
So what's what's missing here.
66
00:05:00,660 --> 00:05:02,200
We already know kind of the concept.
67
00:05:02,200 --> 00:05:06,780
Yeah we need to look for these features on a story for which we for image that we all understand that
68
00:05:07,110 --> 00:05:09,310
awaits what is missing is the data.
69
00:05:09,330 --> 00:05:10,750
One image is not enough.
70
00:05:10,770 --> 00:05:15,540
As we said it might pick up the wrong things from this one image.
71
00:05:15,780 --> 00:05:25,030
That's why you need to supply lots of images lots of faces of people frontal faces of people to the
72
00:05:25,030 --> 00:05:30,820
algorithms so that it can understand which of those features that it's looking for are actually common.
73
00:05:30,910 --> 00:05:33,820
So when it has lots of faces it'll be able to tell.
74
00:05:33,820 --> 00:05:39,000
OK so you know this feature that I saw on one face is repeating is happening on many many many many
75
00:05:39,000 --> 00:05:39,940
different faces.
76
00:05:40,090 --> 00:05:47,470
And so with you know like I can if I look for that feature then I'm going to have a good chance of finding
77
00:05:47,470 --> 00:05:53,830
a face whereas this other feature which I saw in one image is actually not present on most of the other
78
00:05:53,830 --> 00:05:55,750
images so it's not important to.
79
00:05:56,230 --> 00:06:03,570
And here we've only got a couple of faces but in the real algorithm in the actual original paper by
80
00:06:03,630 --> 00:06:09,940
Alan Jones supplied their algorithm four thousand nine hundred and sixteen manually labeled faces you
81
00:06:09,940 --> 00:06:15,970
can imagine how much work that was to go through four thousand nine hundred and sixteen photos and manual
82
00:06:15,970 --> 00:06:22,370
label that like each one instance a face is a face to face you know cut those faces out make them twenty
83
00:06:22,370 --> 00:06:23,400
four four.
84
00:06:23,650 --> 00:06:26,370
Make sure that there are faces in there and so on.
85
00:06:26,640 --> 00:06:30,100
And then they apply a quick trick solecism.
86
00:06:30,100 --> 00:06:34,970
If you're going to be training your own algorithm.
87
00:06:34,990 --> 00:06:36,630
So this is a very interesting trick.
88
00:06:36,640 --> 00:06:42,040
You just take forever for all the faces you found Take a mirror image of the face so left to right mirror
89
00:06:42,040 --> 00:06:42,760
image.
90
00:06:42,760 --> 00:06:46,410
So for instance here we've got a face where the person is more on the left in the mirror image should
91
00:06:46,420 --> 00:06:47,410
be a bit more to the right.
92
00:06:47,410 --> 00:06:51,640
So it would be just like his left eye would look like his right eye his right eye would look like his
93
00:06:51,640 --> 00:06:52,190
left eye.
94
00:06:52,330 --> 00:06:58,540
So it would be just a mirror image of this image and to us it would look pretty much the same you'd
95
00:06:58,540 --> 00:06:59,470
see that it's a mirror image.
96
00:06:59,480 --> 00:07:01,660
But for a computer it's a brand new image.
97
00:07:01,660 --> 00:07:07,060
It's technically a brand new image and therefore they instantly by doing that they doubled the number
98
00:07:07,060 --> 00:07:08,400
of images too.
99
00:07:08,410 --> 00:07:10,780
Nine hundred and nine thousand eight hundred thirty two.
100
00:07:11,200 --> 00:07:12,130
So that's good.
101
00:07:12,130 --> 00:07:19,630
So from here we can find all these 9000 almost 10000 images we can find which features are common for
102
00:07:19,630 --> 00:07:20,580
faces.
103
00:07:20,680 --> 00:07:22,000
Good and we just can keep those.
104
00:07:22,120 --> 00:07:28,900
But then we also need to make sure that those features are common just for faces that they're not common
105
00:07:28,900 --> 00:07:29,910
for something else.
106
00:07:30,130 --> 00:07:36,430
And that's why you need to supply also non face images needs a whole set of non-fatal images just some
107
00:07:36,420 --> 00:07:38,820
random photos objects everything.
108
00:07:38,830 --> 00:07:46,210
But you need to make sure that not of those images have faces in them and Viola Jones supplied their
109
00:07:46,210 --> 00:07:46,690
algorithm.
110
00:07:46,700 --> 00:07:51,240
Nine thousand five hundred and forty four of these images and.
111
00:07:51,420 --> 00:07:57,440
But the interesting thing here is that these ones don't have to be 24 or 24 because they're going to
112
00:07:57,440 --> 00:08:00,480
be any size and that of course is a limit.
113
00:08:00,480 --> 00:08:01,860
A couple of hundred pixels.
114
00:08:02,040 --> 00:08:09,300
But basically then even if it's a big image you can just take some windows from this image and treat
115
00:08:09,390 --> 00:08:15,600
each one as a separate image for training purposes and that increases even further the amount of data
116
00:08:15,600 --> 00:08:16,910
that you have for non-ferrous images.
117
00:08:16,920 --> 00:08:21,630
And in the old John's argument was about three hundred and fifty million sub windows that they had for
118
00:08:21,630 --> 00:08:22,480
training.
119
00:08:22,980 --> 00:08:26,410
So there you go as you can imagine together this will do the trick.
120
00:08:26,490 --> 00:08:31,740
When you have the face images there will help them understand which features are important for faces
121
00:08:31,770 --> 00:08:33,010
and we'll just keep those.
122
00:08:33,120 --> 00:08:38,460
And then the non face images will help it reconcile which of those features that it found that are good
123
00:08:38,460 --> 00:08:43,110
for faces are also leading to a higher rate of false positives.
124
00:08:43,110 --> 00:08:48,210
So for instance it found a certain feature it helps it pick up a face but it also helps it always pick
125
00:08:48,210 --> 00:08:48,870
up.
126
00:08:48,930 --> 00:08:50,020
I don't know a dog.
127
00:08:50,340 --> 00:08:52,040
And it just keeps picking up these dogs.
128
00:08:52,050 --> 00:08:58,530
But because these images are labeled as non faces it will be able to learn that that feature although
129
00:08:58,550 --> 00:09:02,940
is good for faces it's bad in the sense that it's picking up a lot of dogs.
130
00:09:02,970 --> 00:09:08,700
So I'm not going to keep that feature and that's how it'll work so it will use the face images which
131
00:09:08,700 --> 00:09:12,090
I was able to pick up to just limit the number of rooms.
132
00:09:12,090 --> 00:09:19,790
Let's go back room there are these these features that potentially fit into this window.
133
00:09:20,190 --> 00:09:25,450
It all of them will pick out the ones that are good for faces and then using the nonfatal images.
134
00:09:25,560 --> 00:09:32,310
It will drop the ones that are giving it high rates of false positives so that you can you can pick
135
00:09:32,310 --> 00:09:35,510
out faces and only the faces and nothing else.
136
00:09:35,820 --> 00:09:42,420
So that's in a nutshell how the training works and that helps pick out the features.
137
00:09:42,420 --> 00:09:47,760
And it also helps understand the thresholds that those features should have.
138
00:09:47,970 --> 00:09:55,530
If you'd like to do some additional reading on this topic then a petition good paper is a general framework
139
00:09:55,530 --> 00:09:59,450
for object detection by Constantine Puppa Giorgio.
140
00:09:59,700 --> 00:10:08,880
And it was written before surgery in 1998 before the paper by VELLAR Jones and that's where the horror
141
00:10:08,880 --> 00:10:11,370
like features were introduced in the first place.
142
00:10:11,730 --> 00:10:12,860
So you might want to check that out.
143
00:10:12,870 --> 00:10:19,990
But other other and that's of course the original VELLAR Jones paper has everything you need.
144
00:10:19,980 --> 00:10:24,450
It has all the details about the Horlick features and how everything works.
145
00:10:24,470 --> 00:10:34,560
Like I'll admit I haven't read in detail this paper or this new paper but like I looked at the abstract
146
00:10:34,620 --> 00:10:40,800
it looks interesting but personally I prefer the village on paper I think that's a full enough resource.
147
00:10:40,800 --> 00:10:44,070
So if you haven't read that one yet I highly recommend checking that one out.
148
00:10:44,310 --> 00:10:46,760
And on that note I look forward to an extent.
149
00:10:46,920 --> 00:10:48,890
And until then enjoy computer vision.
16730
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.