Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,650 --> 00:00:05,690
Hello and welcome back to the course on a I I in the previous part we talked about the deep learning
2
00:00:05,750 --> 00:00:08,360
Killary intuition we started there.
3
00:00:08,360 --> 00:00:14,900
And in fact we actually got all the way to this part and where we talked about learning and now we're
4
00:00:14,900 --> 00:00:18,200
going to move on to the actual acting part.
5
00:00:18,200 --> 00:00:22,250
So there's there's two parts to distinct parts that we have to remember.
6
00:00:22,250 --> 00:00:25,520
So that's the learning part but now he actually he's done all of this.
7
00:00:25,520 --> 00:00:26,390
That's beautiful.
8
00:00:26,390 --> 00:00:30,500
Now he actually has to take an action he has to decide what is he going to do is going to do action
9
00:00:30,500 --> 00:00:31,710
one two three or four.
10
00:00:31,740 --> 00:00:32,860
And so how does he do that.
11
00:00:33,020 --> 00:00:39,370
Well the way he does it is now given those same values so the values don't change after we've we have
12
00:00:39,370 --> 00:00:43,430
these values of compare them with Calcott the last two by arrogated era we've updated the weights but
13
00:00:43,430 --> 00:00:45,950
the values don't change in that whole process.
14
00:00:45,990 --> 00:00:47,410
To have got the cube values there.
15
00:00:47,430 --> 00:00:48,380
They're fixed.
16
00:00:48,380 --> 00:00:49,440
We know what they are.
17
00:00:49,440 --> 00:00:50,480
All this happens though.
18
00:00:50,510 --> 00:00:53,820
Networks updated and out using those same values that we had.
19
00:00:53,960 --> 00:00:58,600
What we're going to do is we're going to parse them through a soft max function.
20
00:00:58,610 --> 00:01:00,580
And again soft Max as described.
21
00:01:00,620 --> 00:01:05,160
I think an annex 2 and we'll talk a bit more about soft max.
22
00:01:05,180 --> 00:01:12,070
Further down in or we'll talk about this action selection policy further down in the rest of this section.
23
00:01:12,140 --> 00:01:13,610
So just in a few tutorials.
24
00:01:13,730 --> 00:01:17,270
But for now we're just going to say we're passing it through a soft next function.
25
00:01:17,270 --> 00:01:22,150
Basically what it does is it allows it helps select the best one it selects the best action possible.
26
00:01:22,250 --> 00:01:23,650
And there's a small caveat to that.
27
00:01:23,660 --> 00:01:26,120
It's not just the best one possible.
28
00:01:26,120 --> 00:01:28,940
We'll talk about that in the action selection policy tutorial.
29
00:01:28,940 --> 00:01:35,890
But for now let's just say it selects the best action from here it says OK so Q1 you know the likelihood.
30
00:01:36,140 --> 00:01:41,960
Basically we know that q values predicted the Q value so it can look at them and say OK so the highest
31
00:01:41,960 --> 00:01:46,280
Q value of these just as we did in the simple Q learning algorithm.
32
00:01:46,280 --> 00:01:50,240
Ill just look at all these for say the highest values this one I'm going to select that action we're
33
00:01:50,240 --> 00:01:50,860
going to take those.
34
00:01:50,900 --> 00:01:52,180
And that's pretty much it.
35
00:01:52,220 --> 00:01:57,300
That's how he chooses which action take takes takes action and then all of this process happens again.
36
00:01:57,290 --> 00:02:02,120
For for the next stage the agent ends up in in our case and the next square of the maze.
37
00:02:02,120 --> 00:02:04,540
But generally speaking in the next state.
38
00:02:04,640 --> 00:02:05,420
So there we go.
39
00:02:05,420 --> 00:02:14,660
That's how we feed in a reinforcement learning problem into a neural network through a vector describing
40
00:02:14,660 --> 00:02:16,160
the state that we're in.
41
00:02:16,160 --> 00:02:17,510
And once we fit it.
42
00:02:17,510 --> 00:02:22,210
There's two parts of the process that happen Part one is the learning.
43
00:02:22,400 --> 00:02:26,840
So remember that part where we compare each of the cube values with the target and then we back propagate
44
00:02:26,840 --> 00:02:32,360
the loss through the network to update the weights so that our network is learning as we go through
45
00:02:32,360 --> 00:02:34,830
this maze or through this environment.
46
00:02:35,210 --> 00:02:41,120
And also the second part is of course we have to act we have to select an action and that is where we
47
00:02:41,120 --> 00:02:46,880
pass the values through a soft max function and or basically an action selection policy which we'll
48
00:02:46,880 --> 00:02:48,330
talk about further down.
49
00:02:48,470 --> 00:02:53,570
And then we simply select the action that we want to take and we perform that action and then this whole
50
00:02:53,570 --> 00:02:54,580
process starts again.
51
00:02:54,770 --> 00:02:59,570
And then maybe the agent gets then maybe the agent doesn't pausa the game.
52
00:02:59,630 --> 00:03:01,250
In any case the game ends.
53
00:03:01,250 --> 00:03:08,270
And then once again the whole process repeats the agent plays the whole game again and then that stops
54
00:03:08,270 --> 00:03:14,460
so basically that's that's another airpark every time the agent you know every time the game ends with
55
00:03:14,460 --> 00:03:16,680
a favor beyond fairie that's the end of an airport.
56
00:03:16,700 --> 00:03:19,560
And then he starts again and then he starts again and then he starts again.
57
00:03:19,790 --> 00:03:20,420
And so on.
58
00:03:20,420 --> 00:03:26,810
So that happens and this process happens for every single time the agent is in you in a new state so
59
00:03:26,810 --> 00:03:32,240
the state is encoded here so that's important not just for every single game that he plays but for every
60
00:03:32,240 --> 00:03:33,020
single state.
61
00:03:33,020 --> 00:03:38,030
So he's in a state that goes through his process dates and so on and happens every single time.
62
00:03:38,150 --> 00:03:41,410
And so the learning happens and the acting happens as well.
63
00:03:41,720 --> 00:03:47,090
So that is deep learning in the intuition behind deep learning.
64
00:03:47,090 --> 00:03:54,200
We've got lots more to cover off and then of course practical and in the meantime if you'd like to get
65
00:03:54,410 --> 00:03:56,720
some additional information on keep learning.
66
00:03:56,720 --> 00:04:05,200
We've got a recommended reading so we've already spoken about Arthur Giuliani's series of blog posts.
67
00:04:05,210 --> 00:04:12,590
If you look at simple informal learning Lifton's flow part 4 you will find the part that's relevant
68
00:04:12,590 --> 00:04:14,260
to what we discussed today.
69
00:04:14,270 --> 00:04:21,170
Note that here he talks about convolutions we are not covering revolutions in this section we're going
70
00:04:21,170 --> 00:04:23,650
to be talking about them in the next section of the course.
71
00:04:23,720 --> 00:04:28,880
So the difference here is that it's just kind of skip the conclusions part for now and we'll talk about
72
00:04:28,880 --> 00:04:32,850
them in the next part of the course but the difference is in evolutions.
73
00:04:32,850 --> 00:04:39,170
You're like looking the agent is looking at the image and and therefore he has to process an image an
74
00:04:39,170 --> 00:04:43,540
additional complication for now where we're slowly gradually building up to that.
75
00:04:43,580 --> 00:04:50,060
For now we're encoding our environment through you look here we're encoding our environment or maybe
76
00:04:50,060 --> 00:04:58,700
like look at this one probably in coding our environment as a or in to state the agent is in as a vector.
77
00:04:58,700 --> 00:05:01,330
So in our case was very simple vector of values.
78
00:05:01,490 --> 00:05:06,190
Sometimes people even in that in that simple may sometimes or as you'll see from this blog post.
79
00:05:06,290 --> 00:05:10,180
Sometimes people prefer the one hot and coded version of that state.
80
00:05:10,180 --> 00:05:13,380
So basically where every single box of the maze has a.
81
00:05:13,620 --> 00:05:17,780
So you have like a vector of for a null case would be 12 values three by four.
82
00:05:17,800 --> 00:05:22,130
So it isn't like either either 1 or 0 depending on which elements and which box you're in.
83
00:05:22,160 --> 00:05:22,990
In the environment.
84
00:05:23,060 --> 00:05:29,900
So in whichever way you decide to code your environment and the state of your environment that's how
85
00:05:29,900 --> 00:05:31,520
in coding It's basically a vector.
86
00:05:31,520 --> 00:05:36,410
The key here is that it's not a convolution So it's not like an image and there's no convolution volt
87
00:05:36,410 --> 00:05:37,810
So this part will come later.
88
00:05:37,820 --> 00:05:43,410
For us it starts over here and that just simplifies the process for us to gradually understand better.
89
00:05:43,550 --> 00:05:49,130
And of course don't forget that this post is rude and tends to flow and we're using pi torche in our
90
00:05:49,130 --> 00:05:50,090
tutorials.
91
00:05:50,090 --> 00:05:51,910
So hopefully you enjoy this.
92
00:05:51,920 --> 00:05:59,220
A quick intro into a deep convolutional deep not yet deep book learning.
93
00:05:59,310 --> 00:06:02,910
And on that note I look forward to seeing you next.
94
00:06:02,930 --> 00:06:05,430
And until then enjoy artificial intelligence.
10213
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.