Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,590 --> 00:00:03,970
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,070 --> 00:00:05,420
I hope you're enjoying the course so far.
3
00:00:05,420 --> 00:00:09,050
And today we're talking about action the selection policies.
4
00:00:09,050 --> 00:00:11,010
All right let's get straight into it.
5
00:00:11,030 --> 00:00:17,930
Previously we talked about adding a neural network to our simple learning and so far we are getting
6
00:00:18,020 --> 00:00:21,230
quite into deep learning.
7
00:00:21,230 --> 00:00:26,620
We've talked about the learning part quite a bit including adding some elements to it.
8
00:00:26,630 --> 00:00:30,020
And today we're talking about this part we're talking about the acting.
9
00:00:30,020 --> 00:00:31,290
So let's have a look.
10
00:00:31,310 --> 00:00:38,690
So here we've got what we discussed about acting that once you input the values the parameters are the
11
00:00:38,690 --> 00:00:45,230
vector describing the state agent is clearly in that environment then that is after all the learning
12
00:00:45,230 --> 00:00:47,290
is done or even before the learning is done.
13
00:00:47,420 --> 00:00:52,000
Basically we get all the q values so we're not interested in the learning right now we insist on acting
14
00:00:52,010 --> 00:00:57,350
so once we have these key values how do we understand which one we need to use.
15
00:00:57,350 --> 00:00:58,910
Well if you think about it.
16
00:00:58,910 --> 00:01:01,890
Q values are simply predictions for the cube.
17
00:01:01,910 --> 00:01:08,630
So as we did in the simple learning algorithm what did we do we just selected the one with the best
18
00:01:09,180 --> 00:01:10,420
of the highest value.
19
00:01:10,430 --> 00:01:15,380
Once we have the one with the highest IQ value we just take that action because it just brings us the
20
00:01:15,380 --> 00:01:20,330
highest value and that we know that Duval's calculator's immediate reward that we expect to receive
21
00:01:20,360 --> 00:01:23,100
Plus the DK factor times the value of the next date.
22
00:01:23,120 --> 00:01:29,480
And it's a recursive calculation so why not why wouldn't you take the best value and that's kind of
23
00:01:29,480 --> 00:01:30,570
the end of it.
24
00:01:30,800 --> 00:01:35,360
But as you can see here it's not as simple here we're using a soft max function and this is where we're
25
00:01:35,360 --> 00:01:37,910
going to talk about actual selection policies.
26
00:01:37,940 --> 00:01:41,210
So here in reality we don't have to have just a software function.
27
00:01:41,300 --> 00:01:49,190
We can have different action selection policies for example we've got Epsilon greedy Epsilon's soft
28
00:01:49,470 --> 00:01:54,950
and we've got the soft Macs and those are kind of like the most commonly used action selection policies
29
00:01:54,960 --> 00:01:56,300
of course there are others.
30
00:01:56,300 --> 00:02:02,120
For instance the most basic one is a very simple action sociables it just select the best.
31
00:02:02,120 --> 00:02:03,770
The one with the highest Q value.
32
00:02:03,980 --> 00:02:09,800
But why doesn't that action pulse fly and why do we have different types of action pulse action selection
33
00:02:09,800 --> 00:02:10,510
policies.
34
00:02:10,520 --> 00:02:15,270
Well it all boils down to exploration versus exploitation.
35
00:02:15,560 --> 00:02:22,670
And that is the core of reinforcement learning because we already talked about this a little bit that
36
00:02:22,880 --> 00:02:28,400
your agent when it's operating in an environment it might predict certain queue values which might be
37
00:02:28,400 --> 00:02:34,970
good and it might turn out great it might turn out that those are available and will be forced to explore.
38
00:02:34,970 --> 00:02:40,640
So if we for instance in this case predict that Q2 is the best one and then it takes Q To takes action
39
00:02:40,640 --> 00:02:42,350
to and it.
40
00:02:42,500 --> 00:02:46,880
So from here to Section 2 and then it gets it gets a very negative reward.
41
00:02:46,880 --> 00:02:51,980
Then the environment is forcing the agent to go and explode because now he's going to learn that oh
42
00:02:51,980 --> 00:02:56,740
actually I thought Q2 was going to be very good but it turned out very bad.
43
00:02:56,780 --> 00:02:58,370
So the results are not very bad.
44
00:02:58,370 --> 00:03:02,730
So the networks can update itself so next time he's in the state he's going to probably eat my soul
45
00:03:02,720 --> 00:03:04,010
just get to it.
46
00:03:04,190 --> 00:03:09,470
You know like if it is a very very favorable so you might think that that's like you know you might
47
00:03:09,470 --> 00:03:14,900
need a couple of times a couple of penalties or punishments in order to learn it is about action.
48
00:03:14,990 --> 00:03:20,030
But maybe he'll already soon learn that I'm going to take a different action and take the wrist action
49
00:03:20,030 --> 00:03:22,020
because now it has the best value.
50
00:03:22,160 --> 00:03:28,880
So sometimes the environment forces the agent to take different to explore different actions but sometimes
51
00:03:29,180 --> 00:03:36,860
the agent might get it find itself stuck in a local maximum it might find that it followed through through
52
00:03:36,860 --> 00:03:42,110
its initial exploration and found that oh this is a pretty cool action like I'm going to go right here.
53
00:03:42,200 --> 00:03:43,920
And that d'esprit collection.
54
00:03:43,940 --> 00:03:49,760
But the problem is that it thinks is the best action simply because it hasn't explored is explored going
55
00:03:49,760 --> 00:03:55,850
up his nose or going left is explore going right but it hasn't explored going down from that specific
56
00:03:56,360 --> 00:04:01,490
state that it's in and now that it's kind of like biased towards this action and think thinks a good
57
00:04:01,490 --> 00:04:03,800
action is going to keep taking it is going to keep getting.
58
00:04:03,840 --> 00:04:06,570
He's going to keep taking is actually going to keep getting a good reward.
59
00:04:06,620 --> 00:04:14,000
But what if this action would have been even better if this action would have been so much better that
60
00:04:14,060 --> 00:04:19,310
if it knew about this action it would actually switch to this action but because it got stuck in a local
61
00:04:19,310 --> 00:04:23,580
maximum is getting these good rewards is just going to be reinforced.
62
00:04:23,630 --> 00:04:27,770
This is going to keep reinforcing itself that or the violence going to reinforce it that this is a good
63
00:04:27,770 --> 00:04:29,450
action to take keep doing that.
64
00:04:29,510 --> 00:04:35,330
But really the reality is that there's this other action that hasn't found yet or hasn't even explored.
65
00:04:35,570 --> 00:04:37,090
That would have been much better.
66
00:04:37,130 --> 00:04:43,790
So what we want to do is we want to come up with an actual selection policy that allows our agent not
67
00:04:43,910 --> 00:04:45,800
to get stuck in a local maximum.
68
00:04:45,800 --> 00:04:50,120
Yes it's important to you know keep doing the good actions that's the exploitation part.
69
00:04:50,180 --> 00:04:52,000
We won't exploit what we've found.
70
00:04:52,100 --> 00:04:56,720
But at the same time we still want to explore we never want to stop exploring as like in life you never
71
00:04:56,720 --> 00:04:59,000
want to stop learning you stop learning you die.
72
00:04:59,120 --> 00:05:05,030
That's things like that that when you're not growing you're dying or something got so you want to keep
73
00:05:05,090 --> 00:05:07,580
learning and your agent wants to keep learning.
74
00:05:07,760 --> 00:05:10,200
And that's where these action selection policies come in.
75
00:05:10,400 --> 00:05:16,190
So we've got three you listed here so the first one is Epsilon greedy it's a very simple one it sounds
76
00:05:16,190 --> 00:05:22,140
pretty complex in the sense that like it's got a cool name and usually things with surgical names.
77
00:05:22,370 --> 00:05:23,170
It's actually not.
78
00:05:23,180 --> 00:05:31,530
So basically what it does is it will select the one with the best Q value and epsilon like Epsilon you
79
00:05:31,540 --> 00:05:35,240
might hear other places it's just like a selection policy.
80
00:05:35,240 --> 00:05:41,210
So in this case we're using it to slick so our out of Al-Q values are by sales like the one with the
81
00:05:41,540 --> 00:05:45,980
highest Q value all the time except for Epsilon percent of the time.
82
00:05:45,980 --> 00:05:53,300
So for instance if you set epsilon to 10 percent then you're going to or 0.1 than 10 percent of the
83
00:05:53,300 --> 00:05:56,740
time that the action is going to be selected at random.
84
00:05:56,750 --> 00:06:01,990
So 90 percent of the time you're still going to be selecting the best action based on the highest value.
85
00:06:02,120 --> 00:06:05,580
But 10 percent of the time is going to be selecting a random action.
86
00:06:05,600 --> 00:06:11,120
Uniform it is going to be absolutely randomly taking an action or if you said epsilon to zero point
87
00:06:11,420 --> 00:06:18,380
five for 0.05 that means that 95 percent of the time the agent is going to be taking the action with
88
00:06:18,380 --> 00:06:19,200
the highest value.
89
00:06:19,220 --> 00:06:22,470
But 5 percent of the time it's still going to be selecting and random action.
90
00:06:22,490 --> 00:06:25,550
So it's going to be going out there and exploring.
91
00:06:25,790 --> 00:06:31,640
So Epsilon's soft is very similar to the way that does kind of like why it's called FCL greedy because
92
00:06:31,750 --> 00:06:39,780
then you're greedily selecting the action the good action except for that little episode.
93
00:06:39,780 --> 00:06:40,290
Some of the time.
94
00:06:40,280 --> 00:06:46,970
So the lower the EPS deal they'll lower the Lepp Epsilon the more greasily you're selecting that kind
95
00:06:46,970 --> 00:06:53,870
of the action that is the optimal action and the less you're leaving the less chances you leaving for
96
00:06:53,870 --> 00:06:56,000
exploration Epsilon's soft is the opposite.
97
00:06:56,000 --> 00:07:02,000
So basically you're selecting at random you're selecting one minus Epsilon cent of the time.
98
00:07:02,000 --> 00:07:08,240
So if you epsilons like 0.1 to 10 percent then only 10 percent of the time you're taking this action.
99
00:07:08,490 --> 00:07:12,410
And 90 percent of the time you're selecting a random action.
100
00:07:12,410 --> 00:07:19,000
So very very simple just inverted algorithms and a soft Max is kind of like the next step from or it's
101
00:07:19,070 --> 00:07:24,350
it's a more advanced version I would say over epsilon of epsilon greedy algorithm although they both
102
00:07:24,350 --> 00:07:26,570
have merit and they both have a place.
103
00:07:26,610 --> 00:07:30,860
We're going to be using self-finance in our coding in our practical sort of thing.
104
00:07:30,860 --> 00:07:35,270
So that's what we're going to talk in a bit more detail about soft max.
105
00:07:35,330 --> 00:07:36,380
So let's have a look.
106
00:07:36,380 --> 00:07:38,440
So let's move on to your next hopefully.
107
00:07:38,450 --> 00:07:42,800
It's pretty clear about Ebsen agrees it's a pretty straightforward algorithm.
108
00:07:42,800 --> 00:07:45,100
Select this one.
109
00:07:45,230 --> 00:07:47,790
Most of the time except for sometimes go and explore.
110
00:07:47,800 --> 00:07:53,820
And now we also see why it's important to do that exploration so that we don't end up in local maximums
111
00:07:53,840 --> 00:07:58,780
in our in our optimization process so now we're going to talk a bit more about soft Macs.
112
00:07:58,880 --> 00:08:02,680
There's a tutorial on soft marks at the end of the course.
113
00:08:02,750 --> 00:08:09,560
I think it's an annex number two where we talk about the concept of Maxim's because you refresh a little
114
00:08:09,560 --> 00:08:14,650
bit here so there we're talking about neural networks and by the way we're all going to be covering
115
00:08:14,720 --> 00:08:15,290
convolutional.
116
00:08:15,290 --> 00:08:18,170
We're not covering evolution neural networks in this section.
117
00:08:18,210 --> 00:08:21,470
Of course in this section we're still using a vector.
118
00:08:21,800 --> 00:08:27,770
But in the next section of the course when we're we're creating an AI to play Doom we are going to be
119
00:08:27,770 --> 00:08:32,870
using convolutional neural network so it could be beneficial for you to look at in relational neural
120
00:08:32,870 --> 00:08:38,300
networks and then take a self max function or you can learn a bit more about soft Max.
121
00:08:38,300 --> 00:08:43,020
After you take the convolutional neural networks and of course later on.
122
00:08:43,250 --> 00:08:48,130
But here's a quick refresher So here we've got our convolutional neural network which decides whether
123
00:08:48,130 --> 00:08:48,950
it's a dog or cat.
124
00:08:48,950 --> 00:08:56,090
So here we've got the voting process between these neurons and this one says that it's a it's got the
125
00:08:56,090 --> 00:09:04,250
features you know the fluffy ears What's the pointed pointed face type of thing and the kind of the
126
00:09:04,250 --> 00:09:09,930
features are the types of eyes the eye with eyes look all these features that belong to a dog.
127
00:09:09,930 --> 00:09:13,890
So it's a 95 percent chance that it's a dog and the 5 percent chance that it's a cat.
128
00:09:13,910 --> 00:09:19,460
But the question is how did we get in that Tauriel we're talking about how do we get these values to
129
00:09:19,490 --> 00:09:20,530
add up to one.
130
00:09:20,870 --> 00:09:27,650
Well whatever convolutional all our whole neural networks are the convolutional neural network plus
131
00:09:27,650 --> 00:09:33,300
the fully connected Lares whatever it's bad out whatever the values that we apply to soft max function
132
00:09:33,300 --> 00:09:33,980
are here.
133
00:09:34,010 --> 00:09:37,720
This is where we introduced the formula for the soft next function.
134
00:09:37,810 --> 00:09:38,620
Is what it looks like.
135
00:09:38,780 --> 00:09:40,420
And then we got these values.
136
00:09:40,620 --> 00:09:43,460
And so basically that's a quick refresher.
137
00:09:43,460 --> 00:09:46,050
This is the formula for the soft Max.
138
00:09:46,100 --> 00:09:50,900
It's what it does is it takes however many outputs you have doesn't matter.
139
00:09:50,900 --> 00:09:58,130
It will take them and it will squash them all into values between 0 and 1 regardless of how big they
140
00:09:58,130 --> 00:10:03,720
are just by it's for me you can see that there's a total sum at the bottom so these devices are going
141
00:10:03,720 --> 00:10:04,860
to be zero and in.
142
00:10:04,860 --> 00:10:08,630
And also the all these values are going to add up to one always.
143
00:10:08,700 --> 00:10:16,770
And so that's that's very beneficial for us because when we're using the soft max function what happens
144
00:10:16,800 --> 00:10:21,390
is we get these values we select this best view value.
145
00:10:21,390 --> 00:10:26,740
But in reality what happens is these values that we get there are actual numbers right.
146
00:10:26,750 --> 00:10:28,760
So this is some kind of numbers.
147
00:10:28,920 --> 00:10:31,720
They don't have to all add up to one and don't have to be between 0 and 1.
148
00:10:31,730 --> 00:10:32,830
Just some numbers.
149
00:10:33,140 --> 00:10:38,520
But when we apply soft Max we don't just select the best one we actually get numbers like that so we
150
00:10:38,520 --> 00:10:44,310
get our numbers in the range between 0 and 1 and that are also that also add up to 1.
151
00:10:44,310 --> 00:10:47,220
And so what other thing do we know that adds up to one.
152
00:10:47,340 --> 00:10:53,010
Well probabilities we know that probabilities always have to add up to 1 so that is why we can say here
153
00:10:53,010 --> 00:10:57,990
we've got q values but here all of a sudden we've got soft or we've got probabilities.
154
00:10:57,990 --> 00:11:02,740
So we can say that the likelihood of this being the best action is 90 percent.
155
00:11:02,840 --> 00:11:08,610
This lesbian section 5 percent 2 percent 3 percent because we know the higher your value the better
156
00:11:08,610 --> 00:11:09,290
the action.
157
00:11:09,390 --> 00:11:14,920
So if we squash them to 0 to 1 then these become possibilities and we can deal with them as such.
158
00:11:15,090 --> 00:11:22,840
And therefore now is when the action is selected and that's how we come up with Q2.
159
00:11:22,890 --> 00:11:28,580
But if you look at it closely this isn't a strict 100 percent and these are not Saroo 0 percent.
160
00:11:28,590 --> 00:11:30,670
So this is a 5 percent to 3 percent.
161
00:11:30,810 --> 00:11:42,360
So the most natural way to apply the soft Max in order to preserve exploration in the algorithm is to
162
00:11:42,480 --> 00:11:48,600
use these exact probabilities as how often we're going to be taking that action.
163
00:11:48,600 --> 00:11:55,710
So these probabilities actually present the distribution of these actions that we're taking so basically
164
00:11:55,890 --> 00:12:01,740
soft Max makes it very easy for us to come up with a way to combine exploitation and exploration.
165
00:12:01,740 --> 00:12:06,930
So the best the best action will always have the high probability because it has highest Q value and
166
00:12:06,930 --> 00:12:11,190
therefore here we're going to be just going to use these as our distribution or we're going to say okay
167
00:12:11,190 --> 00:12:16,080
we're going to be taking Q2 90 percent of the time but 5 percent of the time we still get to be taking
168
00:12:16,120 --> 00:12:21,170
Q1 and 2 percent of the time we get to 3 and 3 percent of the time we're going to be taking Q4.
169
00:12:21,420 --> 00:12:27,090
And the beauty here is also that as these values update as and as the agent goes through the network
170
00:12:27,090 --> 00:12:35,220
more and more and more it becomes more familiar with with the environment and therefore these updates
171
00:12:35,210 --> 00:12:41,640
so this value for instance might become like it might might ascertain that this value is actually less
172
00:12:41,640 --> 00:12:47,060
or this actually is higher and so these probabilities will also change as an agent goes through.
173
00:12:47,070 --> 00:12:49,190
So even though here we've got Choo-Choo.
174
00:12:49,200 --> 00:12:55,560
Nobody is to say that sometimes 5 percent of the time to be more precise we'll be selecting Q1 as the
175
00:12:55,560 --> 00:13:00,040
action to take and sometimes or action one will be taking action one.
176
00:13:00,180 --> 00:13:05,280
Sometimes will be taking action through a two action three two percent of the time and action for will
177
00:13:05,280 --> 00:13:06,400
be taking about 3 percent.
178
00:13:06,420 --> 00:13:13,800
So every action has a chance to play in this process as long as we have enough iterations an agent goes
179
00:13:13,800 --> 00:13:17,930
through lots and lots of times through these states that they're in.
180
00:13:17,940 --> 00:13:23,880
And that's that's how this that's how any kind of deep learning algorithm works that you want to do
181
00:13:23,880 --> 00:13:30,030
this many many times so that you learn from experience and therefore as you can see here it's a very
182
00:13:30,030 --> 00:13:31,840
natural transition to.
183
00:13:31,860 --> 00:13:37,590
We're not just randomly like an Epson angry algorithm and not just randomly selecting the actions we're
184
00:13:37,590 --> 00:13:44,100
selecting them based on their soft max values which makes it makes it like has some logic behind it
185
00:13:44,190 --> 00:13:48,780
not just not just that random 10 percent of the time we're selecting a random action but there's some
186
00:13:48,780 --> 00:13:53,200
logic behind how we're doing it and based on the key values that we've explored.
187
00:13:53,280 --> 00:13:58,620
And so that's the action selection policy that we're going to be using in this course.
188
00:13:58,620 --> 00:14:04,590
You're welcome to definitely check out Ebsen greedy action section Polsce if you like but we're going
189
00:14:04,590 --> 00:14:10,920
to be predominately using the soft Max action section policy and I've got an interesting reading for
190
00:14:10,920 --> 00:14:11,490
you.
191
00:14:11,490 --> 00:14:17,430
So this is called adaptive Epsilon greedy exploration in reinforcement learning based on value differences
192
00:14:17,430 --> 00:14:18,870
it's the 2010 article.
193
00:14:18,930 --> 00:14:27,270
And it's interesting because Mike Michel I'm not sure how to pronounce Michelle and Miquel toxic introduces
194
00:14:27,450 --> 00:14:36,420
a different type of Algren's and adjusted Epsilon greedy algorithm and called the VDB VDB algorithm
195
00:14:37,230 --> 00:14:40,030
or epsilon greedy VDB algorithm you can see here.
196
00:14:40,410 --> 00:14:46,590
And he actually compares compares to the Ebsen greedy and soft Max and it's an absolute greedy algorithm
197
00:14:46,650 --> 00:14:55,740
which basically the main idea behind it is to adjust the value of epsilon depending on the state the
198
00:14:55,740 --> 00:14:56,550
agent is in.
199
00:14:56,550 --> 00:15:01,820
So if if the agent is very certain about the state in then Epsilon should be smaller so they should
200
00:15:01,820 --> 00:15:06,340
be less exploration if the agent is answered Epson's should be higher should be more exploration.
201
00:15:06,350 --> 00:15:08,930
So it is a 2010 article.
202
00:15:09,260 --> 00:15:17,930
I'm not sure if it's if this new proposed algorithm is widely used or is as being accepted in the community
203
00:15:18,010 --> 00:15:23,090
or or if artificial Times has kind of a way from this this suggestion.
204
00:15:23,090 --> 00:15:29,450
But nevertheless it will definitely help you reinforce your knowledge about action selection policies
205
00:15:29,450 --> 00:15:33,180
which we discussed the Epsom Ingredion the soft Naxal help you ill give you an opportunity to compel
206
00:15:33,200 --> 00:15:38,900
Subha site and also see in which direction people actually think when they want to improve artificial
207
00:15:38,900 --> 00:15:46,040
intelligence so if you're ever planning on creating really interesting algorithms that are pushing the
208
00:15:46,040 --> 00:15:51,770
edge of Elche artificial intelligence and pushing the envelope in this space then this could be a good
209
00:15:52,130 --> 00:16:00,140
way for you to see in which direction people think sometimes when they're trying to improve the norms
210
00:16:00,200 --> 00:16:04,070
of artificial intelligence or the norms that existed back then in 2010.
211
00:16:04,070 --> 00:16:04,760
So there we go.
212
00:16:04,790 --> 00:16:11,020
Hopefully you enjoyed today's tutorial about the action selection policies and we learned about abseil
213
00:16:11,060 --> 00:16:18,240
greedy Epson salt and the soft Macs and now you're even more prepared for the practical side of things.
214
00:16:18,290 --> 00:16:20,840
And on that note I look forward see your next step.
215
00:16:20,840 --> 00:16:22,570
And until then enjoy AI.
25355
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.