subtitlecat.com

All language subtitles for 024 The Multi-Box Concept-en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,970 --> 00:00:03,410 Hello and welcome actually a course on computer vision. 2 00:00:03,550 --> 00:00:09,070 And today we're talking about the multi box concept behind the as the algorithm. 3 00:00:09,100 --> 00:00:17,290 This is going to be quite a visual tutorial so there we have Avalon standing on a pontoon at Lake Bonynge 4 00:00:17,320 --> 00:00:24,280 in Slovenia just as a side note if you're ever in Sylvania or around there definitely visit lake behind. 5 00:00:24,400 --> 00:00:29,900 It's a beautiful magical place as you can see it's it's amazing there. 6 00:00:29,910 --> 00:00:35,380 There's also those the Alps behind you is really great to have a lookout tower there. 7 00:00:35,490 --> 00:00:40,950 You can actually go up and see all that like a huge part of Alps from up there. 8 00:00:41,080 --> 00:00:46,210 Very lovely We were there during our European road trip when we are visiting students all across Europe. 9 00:00:46,540 --> 00:00:48,070 And is there any great stuff. 10 00:00:48,070 --> 00:00:53,250 They have two big lakes and Savannah Lake Bled and Lake voyage and this is like Boeing's bigger one. 11 00:00:53,260 --> 00:00:57,450 It's a bit further away from the main cities but it is worth the trip. 12 00:00:57,550 --> 00:01:04,180 And so as you can see we're very guilty posture on identifying him as a person that he's standing there 13 00:01:05,200 --> 00:01:06,240 pointing. 14 00:01:06,760 --> 00:01:14,710 And what we're going to do today is we're going to take this pork's as the ground troops and then we're 15 00:01:14,710 --> 00:01:22,440 going to see how the the Allegan works with its own boxes to understand to actually identify ODLAND 16 00:01:22,510 --> 00:01:24,930 and build this same exact box. 17 00:01:25,090 --> 00:01:27,130 And first things first. 18 00:01:27,130 --> 00:01:28,390 What is the ground truth. 19 00:01:28,420 --> 00:01:36,310 Well a ground truth is a concept that's not just using computers is used in many places and it separates 20 00:01:36,940 --> 00:01:42,240 the observed evidence or empirical evidence from inferred evidence. 21 00:01:42,340 --> 00:01:49,330 So for instance when you go and actually look at something and you see something and you can physically 22 00:01:49,330 --> 00:01:53,510 understand that this is a true state of things that's your ground truth there. 23 00:01:53,600 --> 00:01:55,000 There is no two ways about it. 24 00:01:55,020 --> 00:02:01,400 No there's no gray areas as you can see exactly that this is the box that should be wrong. 25 00:02:01,810 --> 00:02:09,220 Whereas when you apply an algorithm and it goes processes all of this data through its convolutional 26 00:02:09,220 --> 00:02:13,180 neural networks and so on and comes back with boxes those are not going to be ground truth boxes those 27 00:02:13,180 --> 00:02:17,860 are going to be inferred boxes those are going to be suggestion brought up by our group even if they're 28 00:02:17,860 --> 00:02:23,260 correct they're still merely suggestions and in order to make sure that they're correct we need to have 29 00:02:23,260 --> 00:02:24,260 the ground truth. 30 00:02:24,340 --> 00:02:33,760 And that's why in the process of training these networks the D-Fla. in particular we need the ground 31 00:02:33,780 --> 00:02:34,020 true. 32 00:02:34,030 --> 00:02:40,420 So we need these boxes to be identified we cannot just train the algorithm. 33 00:02:40,420 --> 00:02:45,940 If these boxes are not present in the first place that's something important to keep in mind for that 34 00:02:46,000 --> 00:02:47,120 training process. 35 00:02:47,990 --> 00:02:50,850 And yeah so there is a lull. 36 00:02:50,870 --> 00:02:57,020 Let's see how the algorithm will work and build its boxes and how the training will happen. 37 00:02:57,020 --> 00:03:05,420 So what this is they will do is will break down the image into segments and then for every segment it 38 00:03:05,480 --> 00:03:12,710 will construct several boxes will construct the box like that looks like that and a box like that for 39 00:03:12,980 --> 00:03:16,940 for instance so these exact exact set up can vary. 40 00:03:16,940 --> 00:03:22,790 But we're going to stick to this one this is the most generic one and basically the whole image will 41 00:03:22,790 --> 00:03:24,840 get covered with these boxes. 42 00:03:24,860 --> 00:03:27,600 As you can see it's covered entirely. 43 00:03:27,800 --> 00:03:36,590 And then for every single one of these boxes is going to ask the question is there an object in there 44 00:03:36,620 --> 00:03:45,140 or not and not just an object is to ask that question every single class of objects that it's it's trained 45 00:03:45,140 --> 00:03:49,860 for or in the training processes that is going to ask that question for every single class of all that 46 00:03:49,860 --> 00:03:51,540 it is training for. 47 00:03:52,100 --> 00:04:03,590 And so basically in this case it will identify that none of these boxes except for the ones highlighted 48 00:04:03,590 --> 00:04:10,430 in red actually have an object of interest and then we're going to assume that in this case for simplicity's 49 00:04:10,430 --> 00:04:12,010 sake they we're only looking for people. 50 00:04:12,080 --> 00:04:15,780 There are people in the background there but they're very small. 51 00:04:15,790 --> 00:04:22,060 And the algorithm will not be able to pick up their features and therefore will not detect those people. 52 00:04:22,070 --> 00:04:27,090 I will talk about size and scale further down and inch and others Tauriel in the section for now which 53 00:04:27,090 --> 00:04:31,880 is going to focus in on Lot and we can see that some of the boxes have picked up some features of a 54 00:04:31,880 --> 00:04:34,620 human being in that current form. 55 00:04:34,790 --> 00:04:40,310 So if we remove all that let's just look at one box just so that it's clear why for instance this box 56 00:04:40,550 --> 00:04:48,350 is able to pick up that there is a person within this box or it doesn't have to be a full person. 57 00:04:48,350 --> 00:04:52,910 This is what I wanted to point out here doesn't have to be a fool personal object as long as it can 58 00:04:52,910 --> 00:04:59,930 find a sufficient number of features it will produce a probability it will say that with a likelihood 59 00:04:59,930 --> 00:05:02,470 of 80 percent or 70 percent. 60 00:05:02,540 --> 00:05:09,120 What I have in this box is the human being or as a car or as a both or as a cat or a dog. 61 00:05:09,230 --> 00:05:14,820 So in this case it sees enough features to say that there is a human or a person inside this box. 62 00:05:15,180 --> 00:05:22,260 And then all of these ones that we have identified here can see the same almost the same features and 63 00:05:22,260 --> 00:05:29,360 they're very different features but nevertheless they all have enough identify each with a green one 64 00:05:29,360 --> 00:05:32,180 you can see that that green one for instance. 65 00:05:32,440 --> 00:05:35,660 It can it only sees from the elbow downwards. 66 00:05:35,690 --> 00:05:42,260 So just for the sake of this tutorial I decided to assume that it's not seeing enough features although 67 00:05:42,260 --> 00:05:46,770 as you'll see from the practical side of things that is actually that would be enough features for the 68 00:05:46,770 --> 00:05:49,120 S's algorithm to detect a human being. 69 00:05:49,190 --> 00:05:55,400 But just for our case I just wanted to show you that not always these three boxes like you can see that 70 00:05:55,400 --> 00:05:56,890 the boxes come in threes. 71 00:05:56,900 --> 00:06:00,070 In our case that does not always tally. 72 00:06:00,140 --> 00:06:03,420 So if one of them takes a person that doesn't mean all the time. 73 00:06:03,500 --> 00:06:05,150 So in this case that would be draw. 74 00:06:05,230 --> 00:06:09,260 And so far we have not six but five officers that have to take that approach. 75 00:06:09,830 --> 00:06:14,900 And so that is the ideal case and that's what we're striving towards. 76 00:06:14,900 --> 00:06:17,050 That's what we want to train the next hope to do. 77 00:06:17,240 --> 00:06:19,730 And the way it will happen is here's our network. 78 00:06:20,000 --> 00:06:27,520 And every time all of these boxes predict something we'll have an output so I'll bring them out here 79 00:06:27,530 --> 00:06:31,490 so for instance they'll look at these small areas. 80 00:06:31,490 --> 00:06:32,720 For now we'll talk about them later. 81 00:06:32,720 --> 00:06:38,670 We're dealing with this layer over here and so we've got all these boxes and they all predict something. 82 00:06:38,810 --> 00:06:46,340 And so then the on neural network searches for features that predicts that for instance in these boxes 83 00:06:46,430 --> 00:06:48,410 in these boxes that there's a human that's what we want. 84 00:06:48,410 --> 00:06:53,210 But for example if it had predicted that there is in these boxes as a human or in these boxes a human 85 00:06:53,450 --> 00:06:58,040 and then on the other hand it predicted that some of these boxes don't have a human at the end would 86 00:06:58,040 --> 00:06:58,900 have an error. 87 00:06:58,910 --> 00:07:02,250 How would we have this error because we actually have the ground truth. 88 00:07:02,300 --> 00:07:07,940 We have the ground truth with this white box and we would actually know by overlaying these images the 89 00:07:07,940 --> 00:07:13,250 algorithm knows exactly which boxes should say yes there is a human. 90 00:07:13,250 --> 00:07:14,570 And which boxes should say no. 91 00:07:14,570 --> 00:07:20,960 So it really knows the way what we did actually here is with the red boxes are the ones that I ideally 92 00:07:21,020 --> 00:07:26,510 should say there's a human and we want them to say human and all other boxes which we've deleted should 93 00:07:26,510 --> 00:07:28,760 say there is no human there is no person. 94 00:07:28,970 --> 00:07:32,090 And so unless that is the case there will be an error. 95 00:07:32,180 --> 00:07:36,830 So if say for instance a box or here says that there is a human and a box or a says that there is a 96 00:07:36,830 --> 00:07:41,110 human and one or a couple of these boxes don't pick up the humans there'll be an error. 97 00:07:41,330 --> 00:07:46,990 And at the end of the algorithm that error can be calculated and can be expressed mathematically we 98 00:07:47,020 --> 00:07:52,670 don't get into that that you can find more information on that in the annex on convolutional neural 99 00:07:52,670 --> 00:07:53,160 networks. 100 00:07:53,300 --> 00:07:58,190 But basically that area can be expressed and then what will happen with that error is will be back propagated 101 00:07:58,190 --> 00:08:03,590 through the network as is just a visual representation of the to this point just back propagate through 102 00:08:03,590 --> 00:08:10,940 a network and the weights of the network will be updated in order to better accommodate that so that 103 00:08:11,000 --> 00:08:15,430 next time this box won't predict a human and this box will predict a human. 104 00:08:15,560 --> 00:08:17,310 But these boxes will predict the. 105 00:08:17,330 --> 00:08:21,770 So it takes many iterations is going to happen on and on and on and on but that's how the training process 106 00:08:21,770 --> 00:08:22,250 goes. 107 00:08:22,490 --> 00:08:29,600 And so the important thing here to remember is that we unlike just simple convolutional neural networks 108 00:08:30,110 --> 00:08:35,520 where we just want to predict if there's a dog on the if there's not a dog in this case we actually 109 00:08:35,520 --> 00:08:41,700 need to find where on the image there is a person and so that we need a ground truth. 110 00:08:41,700 --> 00:08:47,580 So it's not sufficient for us as for this image itself to be labeled that there is a person or there 111 00:08:47,580 --> 00:08:48,600 is not a person. 112 00:08:48,690 --> 00:08:49,520 No that's not enough. 113 00:08:49,530 --> 00:08:55,590 We have to know exactly where the person is so we need this box already on the image for the training 114 00:08:55,590 --> 00:08:56,160 process. 115 00:08:56,460 --> 00:09:01,860 And the way to think about it kind of a very intuitive way of thinking about it is take each one of 116 00:09:01,860 --> 00:09:05,150 these boxes as a separate. 117 00:09:05,340 --> 00:09:11,850 They think of each one of these boxes as a separate example of the convolutional neural network. 118 00:09:11,850 --> 00:09:16,370 In practice if you think of each one of these boxes as a separate image that's one. 119 00:09:16,380 --> 00:09:23,610 If you take each one of these boxes as a separate image then yes then you that that kind of is very 120 00:09:23,610 --> 00:09:28,770 similar to a convolutional in your own codes because for each one of the boxes you just need to make 121 00:09:28,770 --> 00:09:31,500 sure it predicts that there is a human inside this box. 122 00:09:31,530 --> 00:09:36,840 And you also need to know for the purpose of this that indeed there was a human side as well. 123 00:09:36,870 --> 00:09:39,000 So that's the ground truth in that case. 124 00:09:39,000 --> 00:09:40,180 So that's how it works. 125 00:09:40,180 --> 00:09:46,210 Like basically all of those boxes that we saw around that were built plotted around the image. 126 00:09:46,590 --> 00:09:49,730 You just can think of them as separate images in their own right. 127 00:09:49,740 --> 00:09:55,690 And each one is trying to understand is there a human inside my image or is it or not it's human an 128 00:09:55,720 --> 00:09:57,340 image and doesn't care about the rest. 129 00:09:57,690 --> 00:10:03,860 But then the algorithm puts everything together on the hands and it takes the aggregate error propagates 130 00:10:04,090 --> 00:10:07,190 and network and trains everything up like that. 131 00:10:07,200 --> 00:10:11,710 So there we go that's the multi box concept of the essence the algorithm. 132 00:10:11,790 --> 00:10:13,870 And we will continue that etc.. 133 00:10:14,010 --> 00:10:15,280 I look forward to seeing you there. 134 00:10:15,280 --> 00:10:17,490 And until then enjoy computer vision. 15458