Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,620 --> 00:00:04,010
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,010 --> 00:00:05,940
In today's tutorial we're going to have some fun.
3
00:00:05,960 --> 00:00:11,900
We're going to have a look and artificial intelligence actually going through that maze that we've been
4
00:00:11,900 --> 00:00:18,740
talking about so long and is going to use kill learning to navigate its way and find the way out and
5
00:00:18,830 --> 00:00:24,350
we'll see what happens to the q values was going to happen to the policy and so on.
6
00:00:24,350 --> 00:00:26,310
So let's have a look.
7
00:00:26,330 --> 00:00:31,910
We're going to be using some materials kindly provided by the Berkeley University.
8
00:00:31,910 --> 00:00:40,700
So if you go to a I don't Birk only the E R K E L E Why don't you just go to that link again.
9
00:00:40,790 --> 00:00:47,510
You'll see this Web site and hear what we're going to be looking at is the need to go to we're going
10
00:00:47,550 --> 00:00:49,130
to go to PacMan projects.
11
00:00:49,130 --> 00:00:58,160
I think Pacman projects and here if you scroll down and you look at them in first learning this is what
12
00:00:58,160 --> 00:00:59,050
we're working with.
13
00:00:59,180 --> 00:01:01,700
So here you can download the zip archive.
14
00:01:01,700 --> 00:01:03,500
So that's if you want to.
15
00:01:03,530 --> 00:01:08,330
So don't you don't have to this is we're not going to go through a solution together in this trial just
16
00:01:08,330 --> 00:01:11,860
letting you know where this is all from because we're very like.
17
00:01:11,870 --> 00:01:12,930
We really appreciate that.
18
00:01:12,980 --> 00:01:16,180
UC Berkeley has made these materials available.
19
00:01:16,190 --> 00:01:19,300
But if you do wish to experiment with this on your own.
20
00:01:19,400 --> 00:01:20,660
Just bear in mind this is not part.
21
00:01:20,680 --> 00:01:23,310
Is not going to be part of our courses as part of the Berkeley course.
22
00:01:23,330 --> 00:01:27,860
I'm not sure how it works for illustration purposes but if you do want to experiment with this you can
23
00:01:27,860 --> 00:01:31,340
find it here the zip archive and all old instructions as well.
24
00:01:31,430 --> 00:01:38,450
And we're just going to go into Python right away and first thing I wanted to show you is that here
25
00:01:38,450 --> 00:01:42,790
we've got the licensing information so this is what I mean.
26
00:01:42,870 --> 00:01:47,720
We're very lucky that they said we are free to use or extend these projects for educational purposes
27
00:01:47,720 --> 00:01:51,120
provided you know distributing publish solutions which we're not going to do.
28
00:01:51,200 --> 00:01:56,750
You retain this notice which we have and you provide a clear archbishop to UC Berkeley including a link
29
00:01:56,780 --> 00:01:57,860
to which we also have.
30
00:01:57,860 --> 00:02:00,750
So once again if you'd like to learn more there the link.
31
00:02:00,770 --> 00:02:01,720
You can have a look.
32
00:02:01,730 --> 00:02:07,490
And thank you very much all of these people who have worked on this project so here's the grid world.
33
00:02:07,490 --> 00:02:09,370
We're going to be working if there is a solution there.
34
00:02:09,460 --> 00:02:15,110
You would have to in order to make it work you'd have to solve it yourself or possibly find a solution.
35
00:02:15,110 --> 00:02:18,980
Maybe some of your some people somebody you know might help you out with that.
36
00:02:19,160 --> 00:02:24,260
If again what you want to you don't have to because we are just going to look at it on this screen right
37
00:02:24,320 --> 00:02:25,110
now.
38
00:02:25,160 --> 00:02:29,720
So after we've created all those files we could just launch it over here.
39
00:02:29,720 --> 00:02:36,680
So there are some parameters that are involved in this whole world and we're not going to just show
40
00:02:36,680 --> 00:02:39,080
you what it looks like if we launch it.
41
00:02:39,080 --> 00:02:41,540
So let's try to launch it in manual mode.
42
00:02:41,540 --> 00:02:47,070
So if I go minus one of these panoramas are manual so I can command your control agent.
43
00:02:47,090 --> 00:02:52,820
So here you can see all grids so I can go up up so you can see that it's taking action starting and
44
00:02:52,820 --> 00:02:54,980
started in states where I was.
45
00:02:55,100 --> 00:03:00,650
And then you saw that I pressed up took action Norf and first time I ended up in zero once I did go
46
00:03:00,650 --> 00:03:01,310
up.
47
00:03:01,490 --> 00:03:05,000
But second time I took action Norf and I ended in the same sad didn't move.
48
00:03:05,000 --> 00:03:08,440
So something happened you know the randomness happened I either went left or right.
49
00:03:08,780 --> 00:03:10,910
And by default the parameters are set.
50
00:03:10,910 --> 00:03:16,910
You can see here by default they're set to exactly what we discussed that how often actually results
51
00:03:16,940 --> 00:03:18,250
in unintended direction.
52
00:03:18,270 --> 00:03:20,960
20 percent of the time to 10 percent to the left some to the right.
53
00:03:21,230 --> 00:03:23,520
So if I go up and say I went up I go right.
54
00:03:23,520 --> 00:03:26,810
I went right right now didn't happen.
55
00:03:26,810 --> 00:03:29,790
Right again and right and I'm finished.
56
00:03:29,790 --> 00:03:35,810
But in this implementation you have to click again to get out of this final output so out of there just
57
00:03:35,810 --> 00:03:37,140
click again and you're finished.
58
00:03:37,190 --> 00:03:40,700
That's a terminal state so we can run our manual.
59
00:03:40,730 --> 00:03:45,620
You can see that if I go right right right left up.
60
00:03:45,740 --> 00:03:50,060
So here what we saw previously that the agent wouldn't go straight up right.
61
00:03:50,060 --> 00:03:53,300
What's the point of going up if there's a chance of going into the pit.
62
00:03:53,300 --> 00:03:54,580
So let's see what the agent would do.
63
00:03:54,610 --> 00:03:56,780
It would go left and go west here would go West.
64
00:03:56,780 --> 00:04:00,820
And you see I clicked left but it went up and here I would click right.
65
00:04:00,860 --> 00:04:05,390
And I end up in the final exit stage and you see God's reward equal to one.
66
00:04:05,390 --> 00:04:07,190
So that's what it looks like manually.
67
00:04:07,190 --> 00:04:12,520
Now let's actually hook up an AI to this and let it go through.
68
00:04:12,510 --> 00:04:16,800
So let's do an H here and let's add some Brandner.
69
00:04:16,820 --> 00:04:24,170
So let me just see what I typed here so hopefully you can see by grid world why then here minus our
70
00:04:24,230 --> 00:04:25,370
means.
71
00:04:25,370 --> 00:04:27,980
That's the reward for living.
72
00:04:27,980 --> 00:04:31,840
So I've got two of them so I probably should remove this one.
73
00:04:32,190 --> 00:04:35,050
So minus k is how many iterations.
74
00:04:35,060 --> 00:04:36,690
That's way too many iterations.
75
00:04:36,690 --> 00:04:41,180
Let's do less Let's do like 10 iterations should be enough.
76
00:04:41,180 --> 00:04:42,710
Minus a is Agent.
77
00:04:42,710 --> 00:04:47,040
What type of agent don't want to do honor and image and some value or a Q.
78
00:04:47,060 --> 00:04:49,120
Q So I want a Q.
79
00:04:49,190 --> 00:04:57,090
Q learning agent doing this minus s is what is s speed so that's way too large a force that just use
80
00:04:57,090 --> 00:05:04,780
the full speed for now minus R is a living penalty's so by default is zero.
81
00:05:04,820 --> 00:05:11,000
So remember at the very start restart 0 living penances so let's call it also 0 0 and can just remove
82
00:05:11,000 --> 00:05:16,040
this parameter and D is what is d discount.
83
00:05:16,040 --> 00:05:20,660
So I just kind of factor so let's keep it at zero point and so very similar to what we're starting off
84
00:05:20,660 --> 00:05:27,880
in this section on the course so let's run that OK way too fast again all actually so pretty OK so you
85
00:05:27,880 --> 00:05:30,130
can see how he's exploring.
86
00:05:30,580 --> 00:05:35,650
And so so far he's hit negative three times and you can see how the q values are being updated in these
87
00:05:35,650 --> 00:05:36,690
squares.
88
00:05:36,700 --> 00:05:37,860
So these are key values.
89
00:05:37,870 --> 00:05:39,310
They are sort of zero.
90
00:05:39,320 --> 00:05:40,740
You can see now the Q value.
91
00:05:40,740 --> 00:05:45,220
So he's learned that this one is a bit different implement because once you get to the final stage you
92
00:05:45,220 --> 00:05:46,560
have to get out of it.
93
00:05:46,660 --> 00:05:48,990
You have to just click one more button to exit.
94
00:05:49,000 --> 00:05:51,740
And so it's very close to one but not exactly one.
95
00:05:51,760 --> 00:05:57,530
But at the same time you can see that here you know the value slowly kind of crystallizing hands are
96
00:05:57,520 --> 00:06:02,290
a point a ex-colleague getting somewhere but they're just so far they're kind of zeroes because he doesn't
97
00:06:02,290 --> 00:06:05,470
have enough information to understand what's going on.
98
00:06:05,470 --> 00:06:08,710
OK so let's see let's see what happens here.
99
00:06:10,180 --> 00:06:13,620
Exploring exploring exploring what's going to happen.
100
00:06:13,710 --> 00:06:15,300
Well was being a while.
101
00:06:15,670 --> 00:06:17,940
And we get this some randomness involved here.
102
00:06:18,100 --> 00:06:20,100
So there is that good one a few times.
103
00:06:20,110 --> 00:06:22,500
Now he only gets 10 iterations.
104
00:06:22,510 --> 00:06:26,780
So he's got to learn fast Ok I need you there.
105
00:06:27,220 --> 00:06:29,280
Let's see what's going on.
106
00:06:29,320 --> 00:06:30,050
Come on.
107
00:06:30,060 --> 00:06:31,820
Get out of that maze already.
108
00:06:32,840 --> 00:06:38,450
And yes 10 episodes so average it turns out that.
109
00:06:38,590 --> 00:06:40,430
That's not really interested in that.
110
00:06:40,460 --> 00:06:41,760
So here let's see.
111
00:06:41,760 --> 00:06:43,060
I've never seen enough of a click.
112
00:06:43,100 --> 00:06:43,460
Right.
113
00:06:43,460 --> 00:06:43,810
There we go.
114
00:06:43,820 --> 00:06:47,780
So you can see this is the policy that he came up with.
115
00:06:48,020 --> 00:06:50,860
Even through just 10 episodes he's already got a pulse.
116
00:06:50,890 --> 00:06:55,820
I'm going to go up a bomb and here I'm going to go down here I'm going to go down here I'm going to
117
00:06:55,820 --> 00:06:58,320
go into the wall and then I'm going to bounce we're here.
118
00:06:58,550 --> 00:06:59,620
That's pretty cool.
119
00:07:00,000 --> 00:07:00,250
OK.
120
00:07:00,260 --> 00:07:02,530
So now let's increase the speed.
121
00:07:02,650 --> 00:07:04,220
What was the parameter s there.
122
00:07:04,220 --> 00:07:06,240
And that's like double lawlessness.
123
00:07:06,260 --> 00:07:13,070
That's quadruple the speed and let's increase the number of iterations so let's say 20 to ration this
124
00:07:13,070 --> 00:07:16,390
time and let's see if he can get through a bit more now.
125
00:07:16,790 --> 00:07:18,700
So you can see he's going a bit faster.
126
00:07:19,600 --> 00:07:25,900
And he's learning he's learning that it's not really you know out of this state there's not many good
127
00:07:25,900 --> 00:07:30,220
actions Orio these actions that the right and straight are not that good.
128
00:07:30,250 --> 00:07:32,400
Definitely this was definitely not good.
129
00:07:32,410 --> 00:07:34,680
He still needs to learn that so from here is also good.
130
00:07:34,680 --> 00:07:36,820
You can see that this action is pretty good.
131
00:07:36,820 --> 00:07:37,330
All right.
132
00:07:37,330 --> 00:07:38,380
What did he get.
133
00:07:38,530 --> 00:07:39,100
OK.
134
00:07:39,100 --> 00:07:42,200
So interesting policy here you we decide to go up.
135
00:07:42,330 --> 00:07:43,270
Just not enough information.
136
00:07:43,270 --> 00:07:45,610
So let's let's really do that.
137
00:07:46,850 --> 00:07:50,370
And let's increase the speed to like 100.
138
00:07:50,630 --> 00:07:56,570
Super fast and the number of iterations will give him 100 iterations this time it's run that scene like
139
00:07:56,570 --> 00:08:02,930
crazy fast and you can see that because there's so many more iterations He's got more information more
140
00:08:02,930 --> 00:08:09,500
opportunity to experiment and to actually build out this this matrix or matrix these values for every
141
00:08:09,500 --> 00:08:10,240
single state.
142
00:08:10,250 --> 00:08:13,220
He now knows you can see that zero point eighty nine.
143
00:08:13,250 --> 00:08:16,050
What did we say in our zero point 86.
144
00:08:16,120 --> 00:08:20,660
Another thing to remember is that the value of any given state.
145
00:08:20,720 --> 00:08:24,230
Remember that formula we had is the maximum of the cube values.
146
00:08:24,230 --> 00:08:27,160
Remember that thing that we came up with shortcut formula.
147
00:08:27,170 --> 00:08:30,690
So what is it what with the value in this state be the V of this.
148
00:08:30,900 --> 00:08:32,060
It would be 0.18.
149
00:08:32,060 --> 00:08:37,870
Because that's the highest out of the four here the value of this state 0.7 you want the value of this
150
00:08:37,870 --> 00:08:38,180
day.
151
00:08:38,210 --> 00:08:40,260
Is there is point sixty one and so on.
152
00:08:40,400 --> 00:08:41,480
So that's something to remember.
153
00:08:41,490 --> 00:08:45,590
I remember when I was up I think we had like zero point 86 or something so praecox.
154
00:08:45,770 --> 00:08:55,060
And so if we go next year I'll just disappear or disappear again and this can make it come back.
155
00:08:55,170 --> 00:08:55,750
OK.
156
00:08:55,760 --> 00:08:56,210
OK.
157
00:08:56,210 --> 00:09:00,680
Slowly slowly slowly filling up some spaces.
158
00:09:00,970 --> 00:09:01,450
I see.
159
00:09:01,490 --> 00:09:06,170
And it's also pretty random because not only the environment has randomness but also the way he explores
160
00:09:06,170 --> 00:09:10,750
that the star really doesn't know the policy is he's exploring at random.
161
00:09:11,190 --> 00:09:12,150
Just keeps disappearing.
162
00:09:12,170 --> 00:09:13,420
I don't understand why.
163
00:09:13,680 --> 00:09:18,650
Anyways let's see what happens if you increase the number here and here should pretty much take the
164
00:09:18,650 --> 00:09:23,060
same amount of time if the speed doesn't have a cap on it.
165
00:09:23,480 --> 00:09:27,610
OK so he's like he has more opportunity to explore things.
166
00:09:27,650 --> 00:09:30,850
OK let's see how it all goes.
167
00:09:31,260 --> 00:09:35,010
And you can see the values are converging they go up and down depending you know because there is some
168
00:09:35,010 --> 00:09:38,640
randomness and he might end up like in the pit even though he goes like this way.
169
00:09:38,640 --> 00:09:44,940
But at the same time they're slowly starting to converge to some sort of values and cue values.
170
00:09:44,950 --> 00:09:48,540
OK probably a thousand is a bit too much in terms of time.
171
00:09:48,540 --> 00:09:53,250
It doesn't look like the speed is proportionally increasing as well.
172
00:09:53,610 --> 00:09:55,560
So it might cut that part.
173
00:09:55,650 --> 00:09:57,560
I mean like reduce the speed.
174
00:09:57,600 --> 00:10:02,850
You know while this is very low you don't have to watch to the end of this tutorial I just want to experiment
175
00:10:02,850 --> 00:10:08,430
with quite a bit so to give you some like examples of what we've been working through but you get the
176
00:10:08,430 --> 00:10:10,920
point that it goes through all of this.
177
00:10:10,950 --> 00:10:14,800
It has some randomness like Rambler's built into his behavior.
178
00:10:14,820 --> 00:10:20,720
So even when it has like a policy it will still continue exploring so it won't just like once it has
179
00:10:20,720 --> 00:10:23,420
a basic policy it won't just continue following its policy.
180
00:10:23,460 --> 00:10:29,130
It will still experiment with other variations once in a while in order to enhance its policy maybe
181
00:10:29,130 --> 00:10:31,350
it hasn't found the best policy already right away.
182
00:10:31,350 --> 00:10:33,240
Maybe it can improve the policy.
183
00:10:33,360 --> 00:10:40,080
And that's why even after so many iterations you can still see some random effects it is sometimes jumps
184
00:10:40,080 --> 00:10:45,060
in to random states not just because of the randomness in the environment but also because there is
185
00:10:45,060 --> 00:10:50,750
some level like a parameter which you could control which you could set up for your agent saying that's
186
00:10:50,820 --> 00:10:56,040
you know most of the time 80 percent of the time do whatever your policy tells you to do but 20 percent
187
00:10:56,040 --> 00:11:00,930
of the time you just have some fun experiment and see what happens and use that information that you
188
00:11:00,930 --> 00:11:03,410
gather to update your policy.
189
00:11:03,410 --> 00:11:05,300
OK this is taking way too long.
190
00:11:05,310 --> 00:11:06,360
Let's try that again.
191
00:11:06,560 --> 00:11:11,640
Yeah so that's how the agent learns in different states.
192
00:11:11,640 --> 00:11:14,270
Maybe let's just run one more just out of curiosity.
193
00:11:14,280 --> 00:11:16,590
So is there anything else we can change about it.
194
00:11:18,420 --> 00:11:20,110
Iterations.
195
00:11:21,630 --> 00:11:22,400
OK.
196
00:11:22,430 --> 00:11:24,280
OK let's have a look.
197
00:11:24,550 --> 00:11:26,680
Yeah well we could change the discussion for example.
198
00:11:26,680 --> 00:11:39,860
So in this case we could say K minus a hundred minus a Q minus two and minus are OK thousand.
199
00:11:39,920 --> 00:11:41,380
So reward.
200
00:11:41,390 --> 00:11:47,920
We want to keep it maybe let's keep it at 0.04 But let's say set against this keep the reward at my
201
00:11:47,920 --> 00:11:49,270
desert point zero for every time.
202
00:11:49,280 --> 00:11:58,340
And then here we're going to say that the discount is not zero point nine but it's like zero point point
203
00:11:58,340 --> 00:11:59,030
five.
204
00:11:59,060 --> 00:12:02,300
So it gets discounted quite a lot as you go through the game.
205
00:12:02,600 --> 00:12:08,960
So it actually now will be incentivized to be closer to the finish rather than further route the states
206
00:12:08,960 --> 00:12:14,060
close to finish will get a high value so you can see that the values really drops off it's not as as
207
00:12:14,060 --> 00:12:15,400
green as it was before.
208
00:12:16,360 --> 00:12:20,190
So here you can see that this is the policy now.
209
00:12:20,380 --> 00:12:26,490
So it goes like that like that like that very similar to what we saw before just probably only differences
210
00:12:26,500 --> 00:12:28,830
from here jumping straight into here.
211
00:12:28,840 --> 00:12:29,980
So that's one.
212
00:12:30,000 --> 00:12:32,500
And OK let's just run one more.
213
00:12:32,500 --> 00:12:33,510
This is so much fun.
214
00:12:33,580 --> 00:12:39,020
Let's just run one more so k minus k 100 a q discard.
215
00:12:39,130 --> 00:12:48,960
Keep it as it was original So let's just run this basic vanilla set up ok ok ok.
216
00:12:49,110 --> 00:12:51,110
It's going to see if it will show us the policy.
217
00:12:51,210 --> 00:12:54,820
And yes we got the policy.
218
00:12:54,840 --> 00:12:55,150
Yes.
219
00:12:55,150 --> 00:12:56,350
Good finish.
220
00:12:56,350 --> 00:12:58,820
So here we've got the policy.
221
00:12:58,900 --> 00:12:59,830
You know this is familiar.
222
00:12:59,830 --> 00:13:05,260
Remember that time when we saw that the AI outsmarted the human bomb into the wall to go there and boom
223
00:13:05,290 --> 00:13:08,530
into the wall to go like that to increase the problem.
224
00:13:08,530 --> 00:13:09,270
So there we go.
225
00:13:09,280 --> 00:13:17,020
That is an example of artificial intelligence inaction very very basic simple kill earnings so no deep
226
00:13:17,020 --> 00:13:18,190
learning at this stage.
227
00:13:18,610 --> 00:13:23,810
But at the same time it's already pretty smart and I hope you enjoyed today's tutorial.
228
00:13:23,810 --> 00:13:29,210
And once again thank you to UC Berkeley and I hope you enjoyed today's tutorial and I look forward scenics
229
00:13:29,230 --> 00:13:29,630
them.
230
00:13:29,650 --> 00:13:31,120
Until then enjoy AI.
22701
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.