subtitlecat.com

All language subtitles for 007 Training Classifiers-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,770 --> 00:00:05,720 Hello and welcome back to the course on computer vision and today we're talking about training classifiers. 2 00:00:05,720 --> 00:00:10,760 So previously we discussed that Develin Jones algorithm has two stages training and detection. 3 00:00:10,790 --> 00:00:14,870 We've already spoken about detection and today's going to be a quick tutorial to get us started into 4 00:00:14,870 --> 00:00:16,370 the world of training. 5 00:00:16,820 --> 00:00:19,280 So what exactly do we mean when we say training. 6 00:00:19,460 --> 00:00:21,890 Well we're training these features as you remember. 7 00:00:21,890 --> 00:00:27,350 We've got five basic features but the trick is that they're scalable they can be shorter or longer wider 8 00:00:27,410 --> 00:00:28,570 or narrower. 9 00:00:28,730 --> 00:00:38,570 And we need to identify which ones of them are actually descriptive or a face or something are common 10 00:00:38,570 --> 00:00:40,020 features of a face. 11 00:00:40,040 --> 00:00:42,130 So here we've got a face. 12 00:00:42,210 --> 00:00:49,100 And remember we talked about that here the bridge of the nose is lighter and this is darker well. 13 00:00:49,220 --> 00:00:56,030 We know that from our experience with faces but an algorithm because it's probably the first face that 14 00:00:56,030 --> 00:00:57,170 it's seen. 15 00:00:57,290 --> 00:01:02,180 Won't necessarily know that that is a common feature for faces for instance from this photo It might 16 00:01:02,180 --> 00:01:02,970 think that. 17 00:01:03,180 --> 00:01:08,420 OK so when this part is lighter and this part is darker that's also a common feature for faces but that's 18 00:01:08,420 --> 00:01:11,230 not necessarily true because not everybody has a beard. 19 00:01:11,510 --> 00:01:15,610 On the other hand it might also look at the background might think that you know this white and this 20 00:01:15,620 --> 00:01:20,510 black this this contrast is also called Faces which is also not the case. 21 00:01:20,510 --> 00:01:27,680 So we need to help the algorithm understand which features are important to identify a face which features 22 00:01:27,680 --> 00:01:28,760 are not important. 23 00:01:28,820 --> 00:01:34,520 So that's number one and number two is remember when we're talking about features we spoke about the 24 00:01:34,520 --> 00:01:39,210 threshold that you will never get like exactly. 25 00:01:39,230 --> 00:01:47,030 Very bright white here and very dark black here it all is a form of grey greyscale So a bit maybe of 26 00:01:47,300 --> 00:01:54,290 some contrast like not exactly 100 percent dark but close to dark and close to bright. 27 00:01:54,320 --> 00:02:00,680 So there's some there's some difference in between the contrast and that difference has to be it has 28 00:02:00,680 --> 00:02:08,300 to exceed the threshold for us to consider this feature to be present while the training will also help 29 00:02:08,300 --> 00:02:11,500 the algorithm understand those thresholds should they be set up 0.3. 30 00:02:11,570 --> 00:02:17,780 Should they be set at zero 0.7 position to be sets and zero point fifty seven or 56 or something like 31 00:02:18,110 --> 00:02:21,590 it needs to understand where to set those thresholds for the different features. 32 00:02:21,590 --> 00:02:22,500 And that's part of the training. 33 00:02:22,500 --> 00:02:28,700 So first of all identify the features and second to set thresholds and that's that's in a very brief 34 00:02:28,700 --> 00:02:30,380 overview of what the training is about. 35 00:02:30,380 --> 00:02:37,520 We'll talk more about it in coming to terms but basically that's kind of the goal to understand what 36 00:02:37,640 --> 00:02:44,540 is what features identify a face and how to say how to know when their presence or basically the threshold 37 00:02:44,540 --> 00:02:45,460 side of things. 38 00:02:45,560 --> 00:02:54,820 So what the algorithm does is it shrinks the image to 24 by 24 pixels and then it looks for these different 39 00:02:54,820 --> 00:03:00,070 features on the image it tries to understand which features are common for faces so the first question 40 00:03:00,070 --> 00:03:02,190 is Why does a trinket 24 by 24. 41 00:03:02,410 --> 00:03:05,200 Well because as we discussed these features are scalable. 42 00:03:05,200 --> 00:03:12,610 So when you have a very large image of for instance a thousand by a thousand or even 500 by 500 pixels 43 00:03:13,090 --> 00:03:19,940 then there's lots of different variations of these features so it can be like one pixel by one pixel 44 00:03:19,940 --> 00:03:25,930 that can be two by two it can be so like two there too there can be four pixels in a four pixel that 45 00:03:25,930 --> 00:03:31,780 would be 10 and 10 would be 100 100 so there's just lots and lots of different combinations of these 46 00:03:31,780 --> 00:03:38,730 features are a combination different lengths and widths and heights of these features. 47 00:03:38,770 --> 00:03:43,440 And you just take forever to look through them and understand which ones are telling her face. 48 00:03:43,630 --> 00:03:50,200 So it's easier to scale down the image and then apply a very limited and we'll talk about this for the 49 00:03:50,200 --> 00:03:56,570 down as well and apply the features that fit in a 24 hour 24 pixel window so it's much less combinations. 50 00:03:56,800 --> 00:04:02,680 And then once we found them once we found OK so that you will still find here that the nose is like 51 00:04:02,950 --> 00:04:09,960 brighter and darker and you'll still find the same features but then when you're actually applying it 52 00:04:09,960 --> 00:04:16,140 to a Nimish to Detective face what happens then is we apply we keep the image the same size we don't 53 00:04:16,140 --> 00:04:20,080 scale it down to 24 by 24 but then we scale the features up. 54 00:04:20,220 --> 00:04:20,910 So that's the trick. 55 00:04:20,910 --> 00:04:23,820 So when you're training you scale the image down. 56 00:04:23,820 --> 00:04:27,360 All of that and we'll talk about many training images just now. 57 00:04:27,360 --> 00:04:33,990 But all of the training images are scaled down to 24 pixels and the features I detected there and the 58 00:04:33,990 --> 00:04:39,670 training happens there so we know which features are important and what what threshold they should have. 59 00:04:39,900 --> 00:04:44,940 And then in the real life when you're actually detecting you don't scale the image you keep the image 60 00:04:44,940 --> 00:04:48,440 at its original size wherever it is you scale the features up. 61 00:04:48,450 --> 00:04:52,350 So those features that are important you scale them up and then you look for them on the image. 62 00:04:52,410 --> 00:04:54,440 So that's an important thing to remember. 63 00:04:54,450 --> 00:04:55,990 That's the important trick there. 64 00:04:56,340 --> 00:04:57,950 And there you go. 65 00:04:57,990 --> 00:05:00,650 So what's what's missing here. 66 00:05:00,660 --> 00:05:02,200 We already know kind of the concept. 67 00:05:02,200 --> 00:05:06,780 Yeah we need to look for these features on a story for which we for image that we all understand that 68 00:05:07,110 --> 00:05:09,310 awaits what is missing is the data. 69 00:05:09,330 --> 00:05:10,750 One image is not enough. 70 00:05:10,770 --> 00:05:15,540 As we said it might pick up the wrong things from this one image. 71 00:05:15,780 --> 00:05:25,030 That's why you need to supply lots of images lots of faces of people frontal faces of people to the 72 00:05:25,030 --> 00:05:30,820 algorithms so that it can understand which of those features that it's looking for are actually common. 73 00:05:30,910 --> 00:05:33,820 So when it has lots of faces it'll be able to tell. 74 00:05:33,820 --> 00:05:39,000 OK so you know this feature that I saw on one face is repeating is happening on many many many many 75 00:05:39,000 --> 00:05:39,940 different faces. 76 00:05:40,090 --> 00:05:47,470 And so with you know like I can if I look for that feature then I'm going to have a good chance of finding 77 00:05:47,470 --> 00:05:53,830 a face whereas this other feature which I saw in one image is actually not present on most of the other 78 00:05:53,830 --> 00:05:55,750 images so it's not important to. 79 00:05:56,230 --> 00:06:03,570 And here we've only got a couple of faces but in the real algorithm in the actual original paper by 80 00:06:03,630 --> 00:06:09,940 Alan Jones supplied their algorithm four thousand nine hundred and sixteen manually labeled faces you 81 00:06:09,940 --> 00:06:15,970 can imagine how much work that was to go through four thousand nine hundred and sixteen photos and manual 82 00:06:15,970 --> 00:06:22,370 label that like each one instance a face is a face to face you know cut those faces out make them twenty 83 00:06:22,370 --> 00:06:23,400 four four. 84 00:06:23,650 --> 00:06:26,370 Make sure that there are faces in there and so on. 85 00:06:26,640 --> 00:06:30,100 And then they apply a quick trick solecism. 86 00:06:30,100 --> 00:06:34,970 If you're going to be training your own algorithm. 87 00:06:34,990 --> 00:06:36,630 So this is a very interesting trick. 88 00:06:36,640 --> 00:06:42,040 You just take forever for all the faces you found Take a mirror image of the face so left to right mirror 89 00:06:42,040 --> 00:06:42,760 image. 90 00:06:42,760 --> 00:06:46,410 So for instance here we've got a face where the person is more on the left in the mirror image should 91 00:06:46,420 --> 00:06:47,410 be a bit more to the right. 92 00:06:47,410 --> 00:06:51,640 So it would be just like his left eye would look like his right eye his right eye would look like his 93 00:06:51,640 --> 00:06:52,190 left eye. 94 00:06:52,330 --> 00:06:58,540 So it would be just a mirror image of this image and to us it would look pretty much the same you'd 95 00:06:58,540 --> 00:06:59,470 see that it's a mirror image. 96 00:06:59,480 --> 00:07:01,660 But for a computer it's a brand new image. 97 00:07:01,660 --> 00:07:07,060 It's technically a brand new image and therefore they instantly by doing that they doubled the number 98 00:07:07,060 --> 00:07:08,400 of images too. 99 00:07:08,410 --> 00:07:10,780 Nine hundred and nine thousand eight hundred thirty two. 100 00:07:11,200 --> 00:07:12,130 So that's good. 101 00:07:12,130 --> 00:07:19,630 So from here we can find all these 9000 almost 10000 images we can find which features are common for 102 00:07:19,630 --> 00:07:20,580 faces. 103 00:07:20,680 --> 00:07:22,000 Good and we just can keep those. 104 00:07:22,120 --> 00:07:28,900 But then we also need to make sure that those features are common just for faces that they're not common 105 00:07:28,900 --> 00:07:29,910 for something else. 106 00:07:30,130 --> 00:07:36,430 And that's why you need to supply also non face images needs a whole set of non-fatal images just some 107 00:07:36,420 --> 00:07:38,820 random photos objects everything. 108 00:07:38,830 --> 00:07:46,210 But you need to make sure that not of those images have faces in them and Viola Jones supplied their 109 00:07:46,210 --> 00:07:46,690 algorithm. 110 00:07:46,700 --> 00:07:51,240 Nine thousand five hundred and forty four of these images and. 111 00:07:51,420 --> 00:07:57,440 But the interesting thing here is that these ones don't have to be 24 or 24 because they're going to 112 00:07:57,440 --> 00:08:00,480 be any size and that of course is a limit. 113 00:08:00,480 --> 00:08:01,860 A couple of hundred pixels. 114 00:08:02,040 --> 00:08:09,300 But basically then even if it's a big image you can just take some windows from this image and treat 115 00:08:09,390 --> 00:08:15,600 each one as a separate image for training purposes and that increases even further the amount of data 116 00:08:15,600 --> 00:08:16,910 that you have for non-ferrous images. 117 00:08:16,920 --> 00:08:21,630 And in the old John's argument was about three hundred and fifty million sub windows that they had for 118 00:08:21,630 --> 00:08:22,480 training. 119 00:08:22,980 --> 00:08:26,410 So there you go as you can imagine together this will do the trick. 120 00:08:26,490 --> 00:08:31,740 When you have the face images there will help them understand which features are important for faces 121 00:08:31,770 --> 00:08:33,010 and we'll just keep those. 122 00:08:33,120 --> 00:08:38,460 And then the non face images will help it reconcile which of those features that it found that are good 123 00:08:38,460 --> 00:08:43,110 for faces are also leading to a higher rate of false positives. 124 00:08:43,110 --> 00:08:48,210 So for instance it found a certain feature it helps it pick up a face but it also helps it always pick 125 00:08:48,210 --> 00:08:48,870 up. 126 00:08:48,930 --> 00:08:50,020 I don't know a dog. 127 00:08:50,340 --> 00:08:52,040 And it just keeps picking up these dogs. 128 00:08:52,050 --> 00:08:58,530 But because these images are labeled as non faces it will be able to learn that that feature although 129 00:08:58,550 --> 00:09:02,940 is good for faces it's bad in the sense that it's picking up a lot of dogs. 130 00:09:02,970 --> 00:09:08,700 So I'm not going to keep that feature and that's how it'll work so it will use the face images which 131 00:09:08,700 --> 00:09:12,090 I was able to pick up to just limit the number of rooms. 132 00:09:12,090 --> 00:09:19,790 Let's go back room there are these these features that potentially fit into this window. 133 00:09:20,190 --> 00:09:25,450 It all of them will pick out the ones that are good for faces and then using the nonfatal images. 134 00:09:25,560 --> 00:09:32,310 It will drop the ones that are giving it high rates of false positives so that you can you can pick 135 00:09:32,310 --> 00:09:35,510 out faces and only the faces and nothing else. 136 00:09:35,820 --> 00:09:42,420 So that's in a nutshell how the training works and that helps pick out the features. 137 00:09:42,420 --> 00:09:47,760 And it also helps understand the thresholds that those features should have. 138 00:09:47,970 --> 00:09:55,530 If you'd like to do some additional reading on this topic then a petition good paper is a general framework 139 00:09:55,530 --> 00:09:59,450 for object detection by Constantine Puppa Giorgio. 140 00:09:59,700 --> 00:10:08,880 And it was written before surgery in 1998 before the paper by VELLAR Jones and that's where the horror 141 00:10:08,880 --> 00:10:11,370 like features were introduced in the first place. 142 00:10:11,730 --> 00:10:12,860 So you might want to check that out. 143 00:10:12,870 --> 00:10:19,990 But other other and that's of course the original VELLAR Jones paper has everything you need. 144 00:10:19,980 --> 00:10:24,450 It has all the details about the Horlick features and how everything works. 145 00:10:24,470 --> 00:10:34,560 Like I'll admit I haven't read in detail this paper or this new paper but like I looked at the abstract 146 00:10:34,620 --> 00:10:40,800 it looks interesting but personally I prefer the village on paper I think that's a full enough resource. 147 00:10:40,800 --> 00:10:44,070 So if you haven't read that one yet I highly recommend checking that one out. 148 00:10:44,310 --> 00:10:46,760 And on that note I look forward to an extent. 149 00:10:46,920 --> 00:10:48,890 And until then enjoy computer vision. 16730