Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,970 --> 00:00:03,410
Hello and welcome actually a course on computer vision.
2
00:00:03,550 --> 00:00:09,070
And today we're talking about the multi box concept behind the as the algorithm.
3
00:00:09,100 --> 00:00:17,290
This is going to be quite a visual tutorial so there we have Avalon standing on a pontoon at Lake Bonynge
4
00:00:17,320 --> 00:00:24,280
in Slovenia just as a side note if you're ever in Sylvania or around there definitely visit lake behind.
5
00:00:24,400 --> 00:00:29,900
It's a beautiful magical place as you can see it's it's amazing there.
6
00:00:29,910 --> 00:00:35,380
There's also those the Alps behind you is really great to have a lookout tower there.
7
00:00:35,490 --> 00:00:40,950
You can actually go up and see all that like a huge part of Alps from up there.
8
00:00:41,080 --> 00:00:46,210
Very lovely We were there during our European road trip when we are visiting students all across Europe.
9
00:00:46,540 --> 00:00:48,070
And is there any great stuff.
10
00:00:48,070 --> 00:00:53,250
They have two big lakes and Savannah Lake Bled and Lake voyage and this is like Boeing's bigger one.
11
00:00:53,260 --> 00:00:57,450
It's a bit further away from the main cities but it is worth the trip.
12
00:00:57,550 --> 00:01:04,180
And so as you can see we're very guilty posture on identifying him as a person that he's standing there
13
00:01:05,200 --> 00:01:06,240
pointing.
14
00:01:06,760 --> 00:01:14,710
And what we're going to do today is we're going to take this pork's as the ground troops and then we're
15
00:01:14,710 --> 00:01:22,440
going to see how the the Allegan works with its own boxes to understand to actually identify ODLAND
16
00:01:22,510 --> 00:01:24,930
and build this same exact box.
17
00:01:25,090 --> 00:01:27,130
And first things first.
18
00:01:27,130 --> 00:01:28,390
What is the ground truth.
19
00:01:28,420 --> 00:01:36,310
Well a ground truth is a concept that's not just using computers is used in many places and it separates
20
00:01:36,940 --> 00:01:42,240
the observed evidence or empirical evidence from inferred evidence.
21
00:01:42,340 --> 00:01:49,330
So for instance when you go and actually look at something and you see something and you can physically
22
00:01:49,330 --> 00:01:53,510
understand that this is a true state of things that's your ground truth there.
23
00:01:53,600 --> 00:01:55,000
There is no two ways about it.
24
00:01:55,020 --> 00:02:01,400
No there's no gray areas as you can see exactly that this is the box that should be wrong.
25
00:02:01,810 --> 00:02:09,220
Whereas when you apply an algorithm and it goes processes all of this data through its convolutional
26
00:02:09,220 --> 00:02:13,180
neural networks and so on and comes back with boxes those are not going to be ground truth boxes those
27
00:02:13,180 --> 00:02:17,860
are going to be inferred boxes those are going to be suggestion brought up by our group even if they're
28
00:02:17,860 --> 00:02:23,260
correct they're still merely suggestions and in order to make sure that they're correct we need to have
29
00:02:23,260 --> 00:02:24,260
the ground truth.
30
00:02:24,340 --> 00:02:33,760
And that's why in the process of training these networks the D-Fla. in particular we need the ground
31
00:02:33,780 --> 00:02:34,020
true.
32
00:02:34,030 --> 00:02:40,420
So we need these boxes to be identified we cannot just train the algorithm.
33
00:02:40,420 --> 00:02:45,940
If these boxes are not present in the first place that's something important to keep in mind for that
34
00:02:46,000 --> 00:02:47,120
training process.
35
00:02:47,990 --> 00:02:50,850
And yeah so there is a lull.
36
00:02:50,870 --> 00:02:57,020
Let's see how the algorithm will work and build its boxes and how the training will happen.
37
00:02:57,020 --> 00:03:05,420
So what this is they will do is will break down the image into segments and then for every segment it
38
00:03:05,480 --> 00:03:12,710
will construct several boxes will construct the box like that looks like that and a box like that for
39
00:03:12,980 --> 00:03:16,940
for instance so these exact exact set up can vary.
40
00:03:16,940 --> 00:03:22,790
But we're going to stick to this one this is the most generic one and basically the whole image will
41
00:03:22,790 --> 00:03:24,840
get covered with these boxes.
42
00:03:24,860 --> 00:03:27,600
As you can see it's covered entirely.
43
00:03:27,800 --> 00:03:36,590
And then for every single one of these boxes is going to ask the question is there an object in there
44
00:03:36,620 --> 00:03:45,140
or not and not just an object is to ask that question every single class of objects that it's it's trained
45
00:03:45,140 --> 00:03:49,860
for or in the training processes that is going to ask that question for every single class of all that
46
00:03:49,860 --> 00:03:51,540
it is training for.
47
00:03:52,100 --> 00:04:03,590
And so basically in this case it will identify that none of these boxes except for the ones highlighted
48
00:04:03,590 --> 00:04:10,430
in red actually have an object of interest and then we're going to assume that in this case for simplicity's
49
00:04:10,430 --> 00:04:12,010
sake they we're only looking for people.
50
00:04:12,080 --> 00:04:15,780
There are people in the background there but they're very small.
51
00:04:15,790 --> 00:04:22,060
And the algorithm will not be able to pick up their features and therefore will not detect those people.
52
00:04:22,070 --> 00:04:27,090
I will talk about size and scale further down and inch and others Tauriel in the section for now which
53
00:04:27,090 --> 00:04:31,880
is going to focus in on Lot and we can see that some of the boxes have picked up some features of a
54
00:04:31,880 --> 00:04:34,620
human being in that current form.
55
00:04:34,790 --> 00:04:40,310
So if we remove all that let's just look at one box just so that it's clear why for instance this box
56
00:04:40,550 --> 00:04:48,350
is able to pick up that there is a person within this box or it doesn't have to be a full person.
57
00:04:48,350 --> 00:04:52,910
This is what I wanted to point out here doesn't have to be a fool personal object as long as it can
58
00:04:52,910 --> 00:04:59,930
find a sufficient number of features it will produce a probability it will say that with a likelihood
59
00:04:59,930 --> 00:05:02,470
of 80 percent or 70 percent.
60
00:05:02,540 --> 00:05:09,120
What I have in this box is the human being or as a car or as a both or as a cat or a dog.
61
00:05:09,230 --> 00:05:14,820
So in this case it sees enough features to say that there is a human or a person inside this box.
62
00:05:15,180 --> 00:05:22,260
And then all of these ones that we have identified here can see the same almost the same features and
63
00:05:22,260 --> 00:05:29,360
they're very different features but nevertheless they all have enough identify each with a green one
64
00:05:29,360 --> 00:05:32,180
you can see that that green one for instance.
65
00:05:32,440 --> 00:05:35,660
It can it only sees from the elbow downwards.
66
00:05:35,690 --> 00:05:42,260
So just for the sake of this tutorial I decided to assume that it's not seeing enough features although
67
00:05:42,260 --> 00:05:46,770
as you'll see from the practical side of things that is actually that would be enough features for the
68
00:05:46,770 --> 00:05:49,120
S's algorithm to detect a human being.
69
00:05:49,190 --> 00:05:55,400
But just for our case I just wanted to show you that not always these three boxes like you can see that
70
00:05:55,400 --> 00:05:56,890
the boxes come in threes.
71
00:05:56,900 --> 00:06:00,070
In our case that does not always tally.
72
00:06:00,140 --> 00:06:03,420
So if one of them takes a person that doesn't mean all the time.
73
00:06:03,500 --> 00:06:05,150
So in this case that would be draw.
74
00:06:05,230 --> 00:06:09,260
And so far we have not six but five officers that have to take that approach.
75
00:06:09,830 --> 00:06:14,900
And so that is the ideal case and that's what we're striving towards.
76
00:06:14,900 --> 00:06:17,050
That's what we want to train the next hope to do.
77
00:06:17,240 --> 00:06:19,730
And the way it will happen is here's our network.
78
00:06:20,000 --> 00:06:27,520
And every time all of these boxes predict something we'll have an output so I'll bring them out here
79
00:06:27,530 --> 00:06:31,490
so for instance they'll look at these small areas.
80
00:06:31,490 --> 00:06:32,720
For now we'll talk about them later.
81
00:06:32,720 --> 00:06:38,670
We're dealing with this layer over here and so we've got all these boxes and they all predict something.
82
00:06:38,810 --> 00:06:46,340
And so then the on neural network searches for features that predicts that for instance in these boxes
83
00:06:46,430 --> 00:06:48,410
in these boxes that there's a human that's what we want.
84
00:06:48,410 --> 00:06:53,210
But for example if it had predicted that there is in these boxes as a human or in these boxes a human
85
00:06:53,450 --> 00:06:58,040
and then on the other hand it predicted that some of these boxes don't have a human at the end would
86
00:06:58,040 --> 00:06:58,900
have an error.
87
00:06:58,910 --> 00:07:02,250
How would we have this error because we actually have the ground truth.
88
00:07:02,300 --> 00:07:07,940
We have the ground truth with this white box and we would actually know by overlaying these images the
89
00:07:07,940 --> 00:07:13,250
algorithm knows exactly which boxes should say yes there is a human.
90
00:07:13,250 --> 00:07:14,570
And which boxes should say no.
91
00:07:14,570 --> 00:07:20,960
So it really knows the way what we did actually here is with the red boxes are the ones that I ideally
92
00:07:21,020 --> 00:07:26,510
should say there's a human and we want them to say human and all other boxes which we've deleted should
93
00:07:26,510 --> 00:07:28,760
say there is no human there is no person.
94
00:07:28,970 --> 00:07:32,090
And so unless that is the case there will be an error.
95
00:07:32,180 --> 00:07:36,830
So if say for instance a box or here says that there is a human and a box or a says that there is a
96
00:07:36,830 --> 00:07:41,110
human and one or a couple of these boxes don't pick up the humans there'll be an error.
97
00:07:41,330 --> 00:07:46,990
And at the end of the algorithm that error can be calculated and can be expressed mathematically we
98
00:07:47,020 --> 00:07:52,670
don't get into that that you can find more information on that in the annex on convolutional neural
99
00:07:52,670 --> 00:07:53,160
networks.
100
00:07:53,300 --> 00:07:58,190
But basically that area can be expressed and then what will happen with that error is will be back propagated
101
00:07:58,190 --> 00:08:03,590
through the network as is just a visual representation of the to this point just back propagate through
102
00:08:03,590 --> 00:08:10,940
a network and the weights of the network will be updated in order to better accommodate that so that
103
00:08:11,000 --> 00:08:15,430
next time this box won't predict a human and this box will predict a human.
104
00:08:15,560 --> 00:08:17,310
But these boxes will predict the.
105
00:08:17,330 --> 00:08:21,770
So it takes many iterations is going to happen on and on and on and on but that's how the training process
106
00:08:21,770 --> 00:08:22,250
goes.
107
00:08:22,490 --> 00:08:29,600
And so the important thing here to remember is that we unlike just simple convolutional neural networks
108
00:08:30,110 --> 00:08:35,520
where we just want to predict if there's a dog on the if there's not a dog in this case we actually
109
00:08:35,520 --> 00:08:41,700
need to find where on the image there is a person and so that we need a ground truth.
110
00:08:41,700 --> 00:08:47,580
So it's not sufficient for us as for this image itself to be labeled that there is a person or there
111
00:08:47,580 --> 00:08:48,600
is not a person.
112
00:08:48,690 --> 00:08:49,520
No that's not enough.
113
00:08:49,530 --> 00:08:55,590
We have to know exactly where the person is so we need this box already on the image for the training
114
00:08:55,590 --> 00:08:56,160
process.
115
00:08:56,460 --> 00:09:01,860
And the way to think about it kind of a very intuitive way of thinking about it is take each one of
116
00:09:01,860 --> 00:09:05,150
these boxes as a separate.
117
00:09:05,340 --> 00:09:11,850
They think of each one of these boxes as a separate example of the convolutional neural network.
118
00:09:11,850 --> 00:09:16,370
In practice if you think of each one of these boxes as a separate image that's one.
119
00:09:16,380 --> 00:09:23,610
If you take each one of these boxes as a separate image then yes then you that that kind of is very
120
00:09:23,610 --> 00:09:28,770
similar to a convolutional in your own codes because for each one of the boxes you just need to make
121
00:09:28,770 --> 00:09:31,500
sure it predicts that there is a human inside this box.
122
00:09:31,530 --> 00:09:36,840
And you also need to know for the purpose of this that indeed there was a human side as well.
123
00:09:36,870 --> 00:09:39,000
So that's the ground truth in that case.
124
00:09:39,000 --> 00:09:40,180
So that's how it works.
125
00:09:40,180 --> 00:09:46,210
Like basically all of those boxes that we saw around that were built plotted around the image.
126
00:09:46,590 --> 00:09:49,730
You just can think of them as separate images in their own right.
127
00:09:49,740 --> 00:09:55,690
And each one is trying to understand is there a human inside my image or is it or not it's human an
128
00:09:55,720 --> 00:09:57,340
image and doesn't care about the rest.
129
00:09:57,690 --> 00:10:03,860
But then the algorithm puts everything together on the hands and it takes the aggregate error propagates
130
00:10:04,090 --> 00:10:07,190
and network and trains everything up like that.
131
00:10:07,200 --> 00:10:11,710
So there we go that's the multi box concept of the essence the algorithm.
132
00:10:11,790 --> 00:10:13,870
And we will continue that etc..
133
00:10:14,010 --> 00:10:15,280
I look forward to seeing you there.
134
00:10:15,280 --> 00:10:17,490
And until then enjoy computer vision.
15458
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.