Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,040 --> 00:00:04,020
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,040 --> 00:00:07,040
Today we are finally talking about Kule learning.
3
00:00:07,070 --> 00:00:12,890
All right so we've already got this equation the bellmen equation which we've added lots of components
4
00:00:12,890 --> 00:00:13,120
to.
5
00:00:13,130 --> 00:00:19,910
We've got the reward here which can be not just at the very end but it can be at any given step.
6
00:00:19,940 --> 00:00:21,920
We've got the discount factor.
7
00:00:21,950 --> 00:00:26,880
We've got the probability because now we're looking at mark of a decision processes.
8
00:00:26,900 --> 00:00:32,780
And here we've got the possibility of ending up in a different states regardless of what action we take
9
00:00:33,350 --> 00:00:35,210
or actually given the action we take.
10
00:00:35,210 --> 00:00:40,670
There can be multiple states that we can end up in and then we got the value of the next states because
11
00:00:40,670 --> 00:00:46,790
he kind of like a recursive function and so on but you probably still have one question.
12
00:00:46,820 --> 00:00:53,560
The question is where in all of this is no letter Q Why is it all called q.
13
00:00:53,750 --> 00:00:54,270
Learning.
14
00:00:54,350 --> 00:00:55,790
So where's the cue.
15
00:00:55,910 --> 00:00:58,940
And that's the question that we're going to be answering today.
16
00:00:58,940 --> 00:01:06,620
So far we've been dealing with values the value of being in a certain state and now we're going to look
17
00:01:06,620 --> 00:01:09,820
at how Q fits into all of that as well.
18
00:01:10,070 --> 00:01:16,360
So here we've got two examples on the left is what we would be do so far our agent has been analyzing.
19
00:01:16,400 --> 00:01:18,170
Ok I'm over here.
20
00:01:18,230 --> 00:01:21,640
This is a mark of decision process so doesn't matter how I got here.
21
00:01:21,770 --> 00:01:28,250
The rest of the environment doesn't care of the steps that it took me to get here from now on.
22
00:01:28,460 --> 00:01:32,050
I have to make the optimal decision where to go here here or here.
23
00:01:32,060 --> 00:01:37,280
Based on the current state and all the future states that come from here but not from the past.
24
00:01:37,490 --> 00:01:42,010
And so he can see that there's three options there's state one state to state three.
25
00:01:42,260 --> 00:01:48,920
And based on his experience he has calculated the values in these states and now he's going to using
26
00:01:48,920 --> 00:01:49,880
the bellmen equation.
27
00:01:49,880 --> 00:01:54,260
So even though this is a classic Proceso he knows that he'll go here but there's a chance that he will
28
00:01:54,260 --> 00:01:56,120
go left right and so on.
29
00:01:56,110 --> 00:02:02,450
So based on these values going to make a decision that's what we do so far and that is totally legitimate
30
00:02:02,450 --> 00:02:03,470
approach here.
31
00:02:03,560 --> 00:02:05,640
But now we're get modified a little bit.
32
00:02:05,660 --> 00:02:12,860
We're going to take the same exact concept same exact problem but here instead of looking at the values
33
00:02:12,950 --> 00:02:21,440
of each state that he can end up in we're going to look at the values or the value of each action.
34
00:02:21,440 --> 00:02:25,640
So we're we're not going to use the letter V anymore because for the value of the state we're going
35
00:02:25,640 --> 00:02:30,740
to use a Q and the you might have a question why the letter Q Well.
36
00:02:30,740 --> 00:02:32,300
Q Some people speculate that.
37
00:02:32,300 --> 00:02:33,760
Q Will I read this.
38
00:02:33,770 --> 00:02:35,420
I think on Quora.
39
00:02:35,420 --> 00:02:41,480
Somebody mentioned that Q is because of quality but at the same time I couldn't find any other references
40
00:02:41,480 --> 00:02:45,520
to that so it might not be because that might just because that's the letter that was used at the time
41
00:02:45,920 --> 00:02:50,750
and now it became super popular because it's all called key learning because of that.
42
00:02:50,780 --> 00:02:52,520
So no exact reason was hold.
43
00:02:52,530 --> 00:02:58,830
Q But nevertheless at least it helps us distinguish between V and Q So Q here.
44
00:02:58,850 --> 00:03:03,340
There were presents rather than the value of the state it represents lets go of quality.
45
00:03:03,410 --> 00:03:06,260
It represents the quality of the action that represents.
46
00:03:06,260 --> 00:03:07,980
OK so I've got four actions.
47
00:03:08,300 --> 00:03:10,860
What are the different qualities of these action.
48
00:03:10,860 --> 00:03:16,340
What is the value of the action or the quality of the action which action is more lucrative so I need
49
00:03:16,340 --> 00:03:21,380
a metric telling me all right how do I quantify this action and then I can compare them and that is
50
00:03:21,380 --> 00:03:23,200
exactly what Q is.
51
00:03:23,470 --> 00:03:26,240
And so he's got four possible actions.
52
00:03:26,360 --> 00:03:29,240
As always go up right left or down.
53
00:03:29,240 --> 00:03:35,480
And based on the action theres going to be a formula which tells us the quantifiable value of that action
54
00:03:35,480 --> 00:03:38,410
which we're calling the Q q value of that action.
55
00:03:38,630 --> 00:03:41,700
So let's have a look at how we going to derive this formula.
56
00:03:41,710 --> 00:03:44,510
Q What how does it actually relate to these.
57
00:03:44,510 --> 00:03:51,290
Because as you can imagine because actions lead to states there has to be some sort of link between
58
00:03:51,290 --> 00:03:51,850
the two.
59
00:03:51,870 --> 00:03:56,060
Right we've got we've already determined how to calculate this and we're pretty good at it.
60
00:03:56,060 --> 00:04:02,030
We know how to use the Belman equation in very different environments with lots of different complications.
61
00:04:02,270 --> 00:04:06,080
Well let's leverage that knowledge to understand how we can now calculate.
62
00:04:06,080 --> 00:04:12,170
Q In order to make the same predictions because as you can imagine the environment doesn't change depending
63
00:04:12,500 --> 00:04:16,530
depending on what approach we use the environment is going to be the same regardless.
64
00:04:16,550 --> 00:04:22,130
So therefore this approach and this approach should always give the same result and therefore that's
65
00:04:22,460 --> 00:04:24,690
another reason why these two should be linked.
66
00:04:25,100 --> 00:04:26,290
So let's have a look.
67
00:04:26,300 --> 00:04:31,280
So here is our view approach where we just get to look at the value of any given state this state or
68
00:04:31,280 --> 00:04:32,260
any other state.
69
00:04:32,420 --> 00:04:37,190
And here we go into we're just using the lead here because that's the current state.
70
00:04:37,190 --> 00:04:43,730
And so therefore the terminology will be the same in both equations and here we using q as a Q Is the
71
00:04:43,790 --> 00:04:45,520
of the state s and the action.
72
00:04:45,540 --> 00:04:51,970
A because action is up but in which state we perform that action we perform that action in the State.
73
00:04:53,000 --> 00:04:57,230
OK so now we're going to ride out the Belman equation for the first approach as you can see here we've
74
00:04:57,230 --> 00:05:06,620
got the of s or the value of any given state s is the maximum of the reward that you get a maximum bet
75
00:05:07,070 --> 00:05:08,660
based on the actions you have three.
76
00:05:08,690 --> 00:05:14,210
In this case you actually have four actions so maxim out of all the possible actions of this part which
77
00:05:14,210 --> 00:05:20,090
we've heard discussed many times so this is our reward that we get from performing that action in that
78
00:05:20,090 --> 00:05:26,850
state plaza discount in fact multiplied by the expected value of the new state that we're going to be
79
00:05:26,850 --> 00:05:29,420
in an expected value because it is a stochastic process.
80
00:05:29,420 --> 00:05:34,460
We don't know exactly for sure that we're going to end up over here we might end up on the left or the
81
00:05:34,460 --> 00:05:36,050
right sort of probability.
82
00:05:36,050 --> 00:05:38,230
That's why these probabilities are in you.
83
00:05:38,240 --> 00:05:40,290
All right so that's our value.
84
00:05:40,350 --> 00:05:41,150
And now let's look at.
85
00:05:41,150 --> 00:05:43,530
Q So Q is going to be defined.
86
00:05:43,580 --> 00:05:49,550
We're going to use this to define Q So let's say the agent from this location from this state perform
87
00:05:49,550 --> 00:05:50,640
the action up.
88
00:05:50,840 --> 00:05:54,350
What is the q value going to be called to.
89
00:05:54,500 --> 00:05:59,320
Well first of all let's see what he will get in return for performing this action up.
90
00:05:59,420 --> 00:06:02,160
First thing that you'll get is a reward right.
91
00:06:02,360 --> 00:06:04,180
Knows no doubt about it.
92
00:06:04,250 --> 00:06:09,920
There's going to be some sort of rule or might be zero but we know that the whole is the way this reinforcement
93
00:06:09,920 --> 00:06:15,770
learning process works is that some towns are performing certain actions from a given state or two.
94
00:06:15,840 --> 00:06:17,140
So I'm going to add that in here.
95
00:06:17,480 --> 00:06:19,680
And then we're going to add what are we going to add.
96
00:06:19,850 --> 00:06:21,090
Well let's think about it.
97
00:06:21,110 --> 00:06:24,640
What is the next thing that happens after he's going there.
98
00:06:24,860 --> 00:06:32,030
Well next thing that happens is that now the agent is in a certain state he could end up here with a
99
00:06:32,330 --> 00:06:34,640
80 percent probability or some probability.
100
00:06:34,730 --> 00:06:36,670
But actually up here right here.
101
00:06:36,800 --> 00:06:43,940
But wherever he ends up now there's we already have a quantified metric for that state he's in.
102
00:06:44,210 --> 00:06:47,100
And that is actually the value of that state.
103
00:06:47,180 --> 00:06:52,340
But because he came up in many different states and three of the possible different states we have to
104
00:06:52,370 --> 00:06:55,730
look at the expected value of the state that he'll be in.
105
00:06:56,210 --> 00:06:58,610
And so we're going to add that in we're going to add.
106
00:06:58,610 --> 00:07:04,020
Of course the discounted factor as we previously had because that is somewhere in the future.
107
00:07:04,190 --> 00:07:11,210
And then we're going to add the some of across all possible states across all possible states that he
108
00:07:11,210 --> 00:07:12,910
could end up by taking this action.
109
00:07:12,910 --> 00:07:14,240
Terms of probability.
110
00:07:14,240 --> 00:07:20,150
So what we're saying here is that OK so by performing an action you're going to get a reward Plus which
111
00:07:20,150 --> 00:07:22,700
is a quantified metric Plus you're going to get.
112
00:07:22,730 --> 00:07:25,820
You end up in a state we don't know which one it could be here.
113
00:07:25,850 --> 00:07:26,950
Could be here it could be here.
114
00:07:27,050 --> 00:07:32,240
But here is the expected value of the state that you're going to end up in.
115
00:07:32,270 --> 00:07:36,290
And now we're going to multiply by discounting factor because that is one move away.
116
00:07:36,380 --> 00:07:44,180
So that is our Q value for this for performance section and what you will notice here right away is
117
00:07:44,180 --> 00:07:44,730
that.
118
00:07:44,760 --> 00:07:51,470
Q The Q value is actually exactly identical to what's inside these brackets over here.
119
00:07:51,950 --> 00:07:52,660
And why is that.
120
00:07:52,670 --> 00:07:59,930
Well if you think about it here we're taking the maximum of the results will get the maximum across
121
00:07:59,930 --> 00:08:04,910
all possible actions so we got for action taking the maximum across all possible actions of the result
122
00:08:04,910 --> 00:08:10,500
that we'll get by taking each of those actions and enqueue we're defining.
123
00:08:10,610 --> 00:08:11,160
Interesting.
124
00:08:11,160 --> 00:08:14,000
What will we get by taking a certain action.
125
00:08:14,000 --> 00:08:19,340
So if you think about it it makes sense that the value of a state.
126
00:08:19,370 --> 00:08:25,720
So for instance this state is the maximum of all of the possible Q values.
127
00:08:25,790 --> 00:08:32,360
Right so here in the States by being in the state the agent has one key value to keep the 3Q value for
128
00:08:32,360 --> 00:08:32,870
q value.
129
00:08:32,870 --> 00:08:37,760
So yes positive for possible Q values while the value of the stay it makes sense that the value of the
130
00:08:37,760 --> 00:08:42,460
state is the maximum of all of those four key values.
131
00:08:42,490 --> 00:08:44,420
That is exactly what we can see here.
132
00:08:44,420 --> 00:08:48,060
That's a good confirmation of this new formula that we derive.
133
00:08:48,080 --> 00:08:53,080
If that wasn't the case if that if that didn't match up then we would have questions would be like.
134
00:08:53,270 --> 00:08:55,150
So why why doesn't it match.
135
00:08:55,160 --> 00:08:57,510
Why doesn't it match up if.
136
00:08:57,690 --> 00:09:05,810
Q value is a quantified metric of performing an action and V depends on the floor.
137
00:09:05,930 --> 00:09:12,650
Is like is the maximum of the possible results of the four actions that he can perform over that makes
138
00:09:12,650 --> 00:09:12,970
sense.
139
00:09:12,980 --> 00:09:21,050
And that confirms the formula that we've just derived and now we're going to make it even more interesting.
140
00:09:21,080 --> 00:09:26,620
We're going to get rid of the Wii entirely because you can see here you've got Wii is a recursive function.
141
00:09:26,810 --> 00:09:29,750
So and then you've got me and then B and then B and then B and so on.
142
00:09:29,760 --> 00:09:35,480
So you can express this view through all of the following Vee's the most optimal these will come up
143
00:09:36,150 --> 00:09:36,830
here.
144
00:09:36,840 --> 00:09:43,210
We're expecting Q As a funk a recursive function of the OR as a function of the next V and then you'd
145
00:09:43,250 --> 00:09:45,200
have to plug in this V and then we get back to the B.
146
00:09:45,200 --> 00:09:51,110
So what are we going to do is we're actually going to take this V and we're going to we're going to
147
00:09:51,230 --> 00:09:54,280
replace it with Q Right so let's have a look at that.
148
00:09:54,930 --> 00:10:01,410
We're going to take this V of the next state and we're going to plug this into that formula over here.
149
00:10:01,570 --> 00:10:07,180
And as you can see now so this part doesn't change this probability doesn't change.
150
00:10:07,180 --> 00:10:16,950
But as we just discussed the of s is the maximum by all actions of q of S and a right over here.
151
00:10:16,990 --> 00:10:19,180
So that's what we're going to replace in here.
152
00:10:19,180 --> 00:10:24,310
So we're going to say a maximum of of course is the new action the action that we're going to take because
153
00:10:24,310 --> 00:10:26,760
here we've got the Wii of as prime.
154
00:10:26,770 --> 00:10:30,700
So here now we've got the maximal console at a prime.
155
00:10:30,700 --> 00:10:34,510
So the actions that we're going to take from this state are from wherever whichever other state we end
156
00:10:34,510 --> 00:10:41,200
up in but the action we're going to take from there and Maxima across all those and the maximum is of
157
00:10:41,260 --> 00:10:50,170
all the cube values that will that are available to us in that new state as prime comma a prime.
158
00:10:50,170 --> 00:10:51,280
And that's action.
159
00:10:51,280 --> 00:10:52,140
So that's the.
160
00:10:52,210 --> 00:10:53,500
So there's going to be another four.
161
00:10:53,500 --> 00:10:54,530
Q values there.
162
00:10:54,610 --> 00:10:56,700
So now as you can see let's go through again.
163
00:10:57,040 --> 00:11:02,740
So as from what we derive this word would be just cause just through logic and intuition so that we
164
00:11:02,740 --> 00:11:07,400
can see that VNS are actually view of AS and of and a are linked.
165
00:11:07,400 --> 00:11:12,400
The of S is the maximum across all actions of Cuba S and you can see right here so this this part is
166
00:11:12,400 --> 00:11:13,820
identical to this part.
167
00:11:14,290 --> 00:11:20,740
And then we're going to leverage that and we're going to replace this bit with VNS from here but not
168
00:11:20,740 --> 00:11:25,730
this exact funnel we're going to take this internal part and replace it with kill the innocent.
169
00:11:26,080 --> 00:11:32,920
So we're going to plug that in here and this part is going to be q of s prime a prime maximum of cube
170
00:11:33,430 --> 00:11:36,810
by Crucell a Priam's of Q As Prime a prime.
171
00:11:37,060 --> 00:11:39,790
And now we have our formula.
172
00:11:39,790 --> 00:11:46,880
So now we have a recursive formula for the q value so now the agent can think what's the value of the
173
00:11:46,890 --> 00:11:50,310
section what's the quality of this section was the new value of this action.
174
00:11:50,470 --> 00:11:56,570
Well it depends on the reward I get in the immediate step after that plus it depends on the discounted
175
00:11:56,590 --> 00:12:02,410
factor times the maximum of all the possible Q actions in that state.
176
00:12:02,410 --> 00:12:06,760
But I don't know if I'm going to get their side need to also look at that state in that state and that's
177
00:12:06,760 --> 00:12:12,770
why we have this expected value over here so we have some probability times the maximum that's expected
178
00:12:12,860 --> 00:12:13,300
value.
179
00:12:13,450 --> 00:12:18,010
So a very similar formula as you can see but this time we're expressing things through the q values
180
00:12:18,490 --> 00:12:27,310
and that's why this whole algorithm is called Kill learning because this is what is looked at this is
181
00:12:27,310 --> 00:12:32,020
what the agents actually use they don't look at the states look at their possible actions and then based
182
00:12:32,020 --> 00:12:35,760
on the actions on the q value of the actions they will decide which action to take.
183
00:12:35,760 --> 00:12:40,330
So they'll just look at the maximum Q value in this given state it has four actions.
184
00:12:40,330 --> 00:12:45,340
What is the best action to take so it can compare sort of comparing the different states that can end
185
00:12:45,350 --> 00:12:51,820
up end up in is going to compare the possible actions that it currently has then by finding the optimal
186
00:12:51,820 --> 00:12:56,830
one is going to take that action and then engage is going to repeat that process repeat that process
187
00:12:56,860 --> 00:12:57,440
and so on.
188
00:12:57,580 --> 00:13:03,940
So now you can see how all this comes together how the reward the discounting facts or the stochastic
189
00:13:04,360 --> 00:13:10,330
market decision processes and the values and the q values all come together in order to cueist this
190
00:13:10,690 --> 00:13:18,400
one super powerful Belman equation for q values which we can now apply and let our agents learn how
191
00:13:18,400 --> 00:13:20,410
to beat the environment.
192
00:13:20,410 --> 00:13:23,380
And so that is a intuitive explanation of what's going on.
193
00:13:23,380 --> 00:13:28,510
I know we went through the formulas but it is necessary because this is like our formula that's we've
194
00:13:28,510 --> 00:13:34,730
been going through this whole chapter and I think it's a good transition from the To.
195
00:13:34,780 --> 00:13:43,450
Q And it illustrates how there are links between Yishun And if you'd like to get a bit more of a rigorous
196
00:13:43,450 --> 00:13:49,410
approach mathematical approach and like you see the mathematics behind it and learn a bit more about
197
00:13:49,420 --> 00:13:51,600
q values and how they work.
198
00:13:51,640 --> 00:13:54,090
Then we've got some additional reading for you.
199
00:13:54,130 --> 00:14:02,980
This paper is called Markov decision processes concepts and algorithms by martn von Autor low 2009.
200
00:14:02,980 --> 00:14:09,610
So you cut the link here as always and here you can read in a bit more detail to understand all the
201
00:14:09,820 --> 00:14:15,220
nitty gritty behind Hugh values and so on and now that we've discussed all of these things relating
202
00:14:15,220 --> 00:14:21,660
to the Belman equation now we are ready to look at something more complex such as this paper in order
203
00:14:21,790 --> 00:14:27,670
if if we want to get some additional information on this in order to kind of get a deeper understanding.
204
00:14:27,670 --> 00:14:34,390
But even if you don't read the newspaper or radio you should have a good working knowledge of what learning
205
00:14:34,390 --> 00:14:40,850
is all about and how agents come up with the actions that they need to take in a certain environment.
206
00:14:40,870 --> 00:14:43,980
So I hope you enjoy today Statoil and I look forward to your next them.
207
00:14:43,990 --> 00:14:45,360
Until then enjoy.
208
00:14:45,390 --> 00:14:45,620
I.
23212
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.