All language subtitles for 024 The Multi-Box Concept-en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese Download
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,970 --> 00:00:03,410 Hello and welcome actually a course on computer vision. 2 00:00:03,550 --> 00:00:09,070 And today we're talking about the multi box concept behind the as the algorithm. 3 00:00:09,100 --> 00:00:17,290 This is going to be quite a visual tutorial so there we have Avalon standing on a pontoon at Lake Bonynge 4 00:00:17,320 --> 00:00:24,280 in Slovenia just as a side note if you're ever in Sylvania or around there definitely visit lake behind. 5 00:00:24,400 --> 00:00:29,900 It's a beautiful magical place as you can see it's it's amazing there. 6 00:00:29,910 --> 00:00:35,380 There's also those the Alps behind you is really great to have a lookout tower there. 7 00:00:35,490 --> 00:00:40,950 You can actually go up and see all that like a huge part of Alps from up there. 8 00:00:41,080 --> 00:00:46,210 Very lovely We were there during our European road trip when we are visiting students all across Europe. 9 00:00:46,540 --> 00:00:48,070 And is there any great stuff. 10 00:00:48,070 --> 00:00:53,250 They have two big lakes and Savannah Lake Bled and Lake voyage and this is like Boeing's bigger one. 11 00:00:53,260 --> 00:00:57,450 It's a bit further away from the main cities but it is worth the trip. 12 00:00:57,550 --> 00:01:04,180 And so as you can see we're very guilty posture on identifying him as a person that he's standing there 13 00:01:05,200 --> 00:01:06,240 pointing. 14 00:01:06,760 --> 00:01:14,710 And what we're going to do today is we're going to take this pork's as the ground troops and then we're 15 00:01:14,710 --> 00:01:22,440 going to see how the the Allegan works with its own boxes to understand to actually identify ODLAND 16 00:01:22,510 --> 00:01:24,930 and build this same exact box. 17 00:01:25,090 --> 00:01:27,130 And first things first. 18 00:01:27,130 --> 00:01:28,390 What is the ground truth. 19 00:01:28,420 --> 00:01:36,310 Well a ground truth is a concept that's not just using computers is used in many places and it separates 20 00:01:36,940 --> 00:01:42,240 the observed evidence or empirical evidence from inferred evidence. 21 00:01:42,340 --> 00:01:49,330 So for instance when you go and actually look at something and you see something and you can physically 22 00:01:49,330 --> 00:01:53,510 understand that this is a true state of things that's your ground truth there. 23 00:01:53,600 --> 00:01:55,000 There is no two ways about it. 24 00:01:55,020 --> 00:02:01,400 No there's no gray areas as you can see exactly that this is the box that should be wrong. 25 00:02:01,810 --> 00:02:09,220 Whereas when you apply an algorithm and it goes processes all of this data through its convolutional 26 00:02:09,220 --> 00:02:13,180 neural networks and so on and comes back with boxes those are not going to be ground truth boxes those 27 00:02:13,180 --> 00:02:17,860 are going to be inferred boxes those are going to be suggestion brought up by our group even if they're 28 00:02:17,860 --> 00:02:23,260 correct they're still merely suggestions and in order to make sure that they're correct we need to have 29 00:02:23,260 --> 00:02:24,260 the ground truth. 30 00:02:24,340 --> 00:02:33,760 And that's why in the process of training these networks the D-Fla. in particular we need the ground 31 00:02:33,780 --> 00:02:34,020 true. 32 00:02:34,030 --> 00:02:40,420 So we need these boxes to be identified we cannot just train the algorithm. 33 00:02:40,420 --> 00:02:45,940 If these boxes are not present in the first place that's something important to keep in mind for that 34 00:02:46,000 --> 00:02:47,120 training process. 35 00:02:47,990 --> 00:02:50,850 And yeah so there is a lull. 36 00:02:50,870 --> 00:02:57,020 Let's see how the algorithm will work and build its boxes and how the training will happen. 37 00:02:57,020 --> 00:03:05,420 So what this is they will do is will break down the image into segments and then for every segment it 38 00:03:05,480 --> 00:03:12,710 will construct several boxes will construct the box like that looks like that and a box like that for 39 00:03:12,980 --> 00:03:16,940 for instance so these exact exact set up can vary. 40 00:03:16,940 --> 00:03:22,790 But we're going to stick to this one this is the most generic one and basically the whole image will 41 00:03:22,790 --> 00:03:24,840 get covered with these boxes. 42 00:03:24,860 --> 00:03:27,600 As you can see it's covered entirely. 43 00:03:27,800 --> 00:03:36,590 And then for every single one of these boxes is going to ask the question is there an object in there 44 00:03:36,620 --> 00:03:45,140 or not and not just an object is to ask that question every single class of objects that it's it's trained 45 00:03:45,140 --> 00:03:49,860 for or in the training processes that is going to ask that question for every single class of all that 46 00:03:49,860 --> 00:03:51,540 it is training for. 47 00:03:52,100 --> 00:04:03,590 And so basically in this case it will identify that none of these boxes except for the ones highlighted 48 00:04:03,590 --> 00:04:10,430 in red actually have an object of interest and then we're going to assume that in this case for simplicity's 49 00:04:10,430 --> 00:04:12,010 sake they we're only looking for people. 50 00:04:12,080 --> 00:04:15,780 There are people in the background there but they're very small. 51 00:04:15,790 --> 00:04:22,060 And the algorithm will not be able to pick up their features and therefore will not detect those people. 52 00:04:22,070 --> 00:04:27,090 I will talk about size and scale further down and inch and others Tauriel in the section for now which 53 00:04:27,090 --> 00:04:31,880 is going to focus in on Lot and we can see that some of the boxes have picked up some features of a 54 00:04:31,880 --> 00:04:34,620 human being in that current form. 55 00:04:34,790 --> 00:04:40,310 So if we remove all that let's just look at one box just so that it's clear why for instance this box 56 00:04:40,550 --> 00:04:48,350 is able to pick up that there is a person within this box or it doesn't have to be a full person. 57 00:04:48,350 --> 00:04:52,910 This is what I wanted to point out here doesn't have to be a fool personal object as long as it can 58 00:04:52,910 --> 00:04:59,930 find a sufficient number of features it will produce a probability it will say that with a likelihood 59 00:04:59,930 --> 00:05:02,470 of 80 percent or 70 percent. 60 00:05:02,540 --> 00:05:09,120 What I have in this box is the human being or as a car or as a both or as a cat or a dog. 61 00:05:09,230 --> 00:05:14,820 So in this case it sees enough features to say that there is a human or a person inside this box. 62 00:05:15,180 --> 00:05:22,260 And then all of these ones that we have identified here can see the same almost the same features and 63 00:05:22,260 --> 00:05:29,360 they're very different features but nevertheless they all have enough identify each with a green one 64 00:05:29,360 --> 00:05:32,180 you can see that that green one for instance. 65 00:05:32,440 --> 00:05:35,660 It can it only sees from the elbow downwards. 66 00:05:35,690 --> 00:05:42,260 So just for the sake of this tutorial I decided to assume that it's not seeing enough features although 67 00:05:42,260 --> 00:05:46,770 as you'll see from the practical side of things that is actually that would be enough features for the 68 00:05:46,770 --> 00:05:49,120 S's algorithm to detect a human being. 69 00:05:49,190 --> 00:05:55,400 But just for our case I just wanted to show you that not always these three boxes like you can see that 70 00:05:55,400 --> 00:05:56,890 the boxes come in threes. 71 00:05:56,900 --> 00:06:00,070 In our case that does not always tally. 72 00:06:00,140 --> 00:06:03,420 So if one of them takes a person that doesn't mean all the time. 73 00:06:03,500 --> 00:06:05,150 So in this case that would be draw. 74 00:06:05,230 --> 00:06:09,260 And so far we have not six but five officers that have to take that approach. 75 00:06:09,830 --> 00:06:14,900 And so that is the ideal case and that's what we're striving towards. 76 00:06:14,900 --> 00:06:17,050 That's what we want to train the next hope to do. 77 00:06:17,240 --> 00:06:19,730 And the way it will happen is here's our network. 78 00:06:20,000 --> 00:06:27,520 And every time all of these boxes predict something we'll have an output so I'll bring them out here 79 00:06:27,530 --> 00:06:31,490 so for instance they'll look at these small areas. 80 00:06:31,490 --> 00:06:32,720 For now we'll talk about them later. 81 00:06:32,720 --> 00:06:38,670 We're dealing with this layer over here and so we've got all these boxes and they all predict something. 82 00:06:38,810 --> 00:06:46,340 And so then the on neural network searches for features that predicts that for instance in these boxes 83 00:06:46,430 --> 00:06:48,410 in these boxes that there's a human that's what we want. 84 00:06:48,410 --> 00:06:53,210 But for example if it had predicted that there is in these boxes as a human or in these boxes a human 85 00:06:53,450 --> 00:06:58,040 and then on the other hand it predicted that some of these boxes don't have a human at the end would 86 00:06:58,040 --> 00:06:58,900 have an error. 87 00:06:58,910 --> 00:07:02,250 How would we have this error because we actually have the ground truth. 88 00:07:02,300 --> 00:07:07,940 We have the ground truth with this white box and we would actually know by overlaying these images the 89 00:07:07,940 --> 00:07:13,250 algorithm knows exactly which boxes should say yes there is a human. 90 00:07:13,250 --> 00:07:14,570 And which boxes should say no. 91 00:07:14,570 --> 00:07:20,960 So it really knows the way what we did actually here is with the red boxes are the ones that I ideally 92 00:07:21,020 --> 00:07:26,510 should say there's a human and we want them to say human and all other boxes which we've deleted should 93 00:07:26,510 --> 00:07:28,760 say there is no human there is no person. 94 00:07:28,970 --> 00:07:32,090 And so unless that is the case there will be an error. 95 00:07:32,180 --> 00:07:36,830 So if say for instance a box or here says that there is a human and a box or a says that there is a 96 00:07:36,830 --> 00:07:41,110 human and one or a couple of these boxes don't pick up the humans there'll be an error. 97 00:07:41,330 --> 00:07:46,990 And at the end of the algorithm that error can be calculated and can be expressed mathematically we 98 00:07:47,020 --> 00:07:52,670 don't get into that that you can find more information on that in the annex on convolutional neural 99 00:07:52,670 --> 00:07:53,160 networks. 100 00:07:53,300 --> 00:07:58,190 But basically that area can be expressed and then what will happen with that error is will be back propagated 101 00:07:58,190 --> 00:08:03,590 through the network as is just a visual representation of the to this point just back propagate through 102 00:08:03,590 --> 00:08:10,940 a network and the weights of the network will be updated in order to better accommodate that so that 103 00:08:11,000 --> 00:08:15,430 next time this box won't predict a human and this box will predict a human. 104 00:08:15,560 --> 00:08:17,310 But these boxes will predict the. 105 00:08:17,330 --> 00:08:21,770 So it takes many iterations is going to happen on and on and on and on but that's how the training process 106 00:08:21,770 --> 00:08:22,250 goes. 107 00:08:22,490 --> 00:08:29,600 And so the important thing here to remember is that we unlike just simple convolutional neural networks 108 00:08:30,110 --> 00:08:35,520 where we just want to predict if there's a dog on the if there's not a dog in this case we actually 109 00:08:35,520 --> 00:08:41,700 need to find where on the image there is a person and so that we need a ground truth. 110 00:08:41,700 --> 00:08:47,580 So it's not sufficient for us as for this image itself to be labeled that there is a person or there 111 00:08:47,580 --> 00:08:48,600 is not a person. 112 00:08:48,690 --> 00:08:49,520 No that's not enough. 113 00:08:49,530 --> 00:08:55,590 We have to know exactly where the person is so we need this box already on the image for the training 114 00:08:55,590 --> 00:08:56,160 process. 115 00:08:56,460 --> 00:09:01,860 And the way to think about it kind of a very intuitive way of thinking about it is take each one of 116 00:09:01,860 --> 00:09:05,150 these boxes as a separate. 117 00:09:05,340 --> 00:09:11,850 They think of each one of these boxes as a separate example of the convolutional neural network. 118 00:09:11,850 --> 00:09:16,370 In practice if you think of each one of these boxes as a separate image that's one. 119 00:09:16,380 --> 00:09:23,610 If you take each one of these boxes as a separate image then yes then you that that kind of is very 120 00:09:23,610 --> 00:09:28,770 similar to a convolutional in your own codes because for each one of the boxes you just need to make 121 00:09:28,770 --> 00:09:31,500 sure it predicts that there is a human inside this box. 122 00:09:31,530 --> 00:09:36,840 And you also need to know for the purpose of this that indeed there was a human side as well. 123 00:09:36,870 --> 00:09:39,000 So that's the ground truth in that case. 124 00:09:39,000 --> 00:09:40,180 So that's how it works. 125 00:09:40,180 --> 00:09:46,210 Like basically all of those boxes that we saw around that were built plotted around the image. 126 00:09:46,590 --> 00:09:49,730 You just can think of them as separate images in their own right. 127 00:09:49,740 --> 00:09:55,690 And each one is trying to understand is there a human inside my image or is it or not it's human an 128 00:09:55,720 --> 00:09:57,340 image and doesn't care about the rest. 129 00:09:57,690 --> 00:10:03,860 But then the algorithm puts everything together on the hands and it takes the aggregate error propagates 130 00:10:04,090 --> 00:10:07,190 and network and trains everything up like that. 131 00:10:07,200 --> 00:10:11,710 So there we go that's the multi box concept of the essence the algorithm. 132 00:10:11,790 --> 00:10:13,870 And we will continue that etc.. 133 00:10:14,010 --> 00:10:15,280 I look forward to seeing you there. 134 00:10:15,280 --> 00:10:17,490 And until then enjoy computer vision. 15458

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.