Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,830 --> 00:00:04,470
Hello and welcome back to the course on artificial intelligence.
2
00:00:04,580 --> 00:00:09,520
I hope you're excited about today's tutorial because we are taking our very first step into the world
3
00:00:09,520 --> 00:00:10,170
the I.
4
00:00:10,460 --> 00:00:13,150
And today we're talking about reinforcement learning.
5
00:00:13,280 --> 00:00:18,710
It's a very important story because it will underpin everything else is going to be happen in this course.
6
00:00:18,770 --> 00:00:21,010
So let's get started here.
7
00:00:21,020 --> 00:00:27,140
We've got a little maze and this maze is our representation of an environment and that's what we're
8
00:00:27,140 --> 00:00:29,210
going to be dealing with in this course.
9
00:00:29,210 --> 00:00:34,040
We're going to be dealing with certain environments in which our artificial intelligence is going to
10
00:00:34,040 --> 00:00:39,950
be performing it's going to be taking actions it's going to be looking to beat these in my going she'll
11
00:00:39,950 --> 00:00:42,350
be looking to win in these environments.
12
00:00:42,350 --> 00:00:44,190
And here we've got an agent.
13
00:00:44,360 --> 00:00:46,990
The agent is our artificial intelligence.
14
00:00:47,030 --> 00:00:52,910
That's the person or that's the mind that's going to be navigating these environments and learning from
15
00:00:53,000 --> 00:00:57,110
the feedback that their minds are going to be giving it in order to perform certain actions.
16
00:00:57,150 --> 00:01:02,180
And so the way it works is the agent perform certain actions in this environment.
17
00:01:02,360 --> 00:01:09,050
And as a result the state in which it is in will change so it might be further or closer or more to
18
00:01:09,050 --> 00:01:10,070
the left more to the right.
19
00:01:10,070 --> 00:01:15,030
It might have sort of the other parameters that describe it state and those parameters.
20
00:01:15,100 --> 00:01:20,720
So the state is going to change because of the action takes and it will also get rewards based on the
21
00:01:20,720 --> 00:01:20,970
action.
22
00:01:20,970 --> 00:01:24,950
So every time it takes an action the state will change and it'll get reward.
23
00:01:24,950 --> 00:01:29,170
Now bear in mind sometimes it might happen that it won't change the state the action won't change a
24
00:01:29,170 --> 00:01:33,070
stay or there won't be a reward for taking that action.
25
00:01:33,110 --> 00:01:34,530
In that sense it was.
26
00:01:34,670 --> 00:01:38,480
But nevertheless the agent's going to keep doing that was going to be taking actions cheating the state
27
00:01:38,480 --> 00:01:42,510
getting rewards changing action taking actions changing the state and getting rewards.
28
00:01:42,800 --> 00:01:47,840
And by doing that process it's going to be learning about what was going to be exploring the environment
29
00:01:48,200 --> 00:01:53,970
understanding what actions lead to good rewards and favorable states and what actions the two rewards
30
00:01:53,990 --> 00:01:55,840
an unfavorable state.
31
00:01:56,000 --> 00:01:59,690
And this is a very simplistic representational very global problem.
32
00:01:59,690 --> 00:02:04,390
So if you think about it environments actually don't have to be just mazes.
33
00:02:04,400 --> 00:02:09,170
It's not just about getting out of a maze or finding a treasure in a maze.
34
00:02:09,170 --> 00:02:11,740
An environment can be pretty much anything in life.
35
00:02:11,750 --> 00:02:15,180
So imagine you waking up in the morning and cooking an omelette.
36
00:02:15,410 --> 00:02:22,010
So in order to make that omelet you need to go through certain steps you need to get the salt get the
37
00:02:22,010 --> 00:02:27,770
eggs get the frying pans which to fire on and so on and it does sound like a routine mundane thing.
38
00:02:27,770 --> 00:02:29,870
But it's become routine because you've done it so many times.
39
00:02:29,960 --> 00:02:34,670
But in reality it's an environment where you're performing certain actions you're taking that you putting
40
00:02:34,670 --> 00:02:40,250
the fire on you putting a frying pan on the fire you're putting all the eggs into the frying pan and
41
00:02:40,250 --> 00:02:43,190
you put some salt on the eggs and you're turning over and so on.
42
00:02:43,190 --> 00:02:49,970
So as you can see they are CRN actions actions which are taking in certain states and those actions
43
00:02:49,970 --> 00:02:52,460
lead to certain other states and sometimes reward.
44
00:02:52,460 --> 00:02:57,650
So for instance when you put the fire on and you wait wait wait wait wait you take an action of wait
45
00:02:57,650 --> 00:03:01,900
wait wait wait too long and then you put the eggs in into the frying pan.
46
00:03:01,910 --> 00:03:03,560
The rewards are going to be very negative.
47
00:03:03,560 --> 00:03:05,120
It's all going to burn.
48
00:03:05,120 --> 00:03:10,130
On the other hand if you do all the all the correct actions in the correct time so it's also very important
49
00:03:10,130 --> 00:03:13,850
to understand that actions should be taken at the correct points in time.
50
00:03:13,850 --> 00:03:20,090
So for instance putting the salt in the frying pan before you put the eggs in might not be the best
51
00:03:20,090 --> 00:03:20,770
idea.
52
00:03:20,780 --> 00:03:26,190
You might want to take that action of putting the salt into the frying pan after the eggs are in there
53
00:03:26,200 --> 00:03:28,320
so that in a different state.
54
00:03:28,370 --> 00:03:29,620
So it's important to remember that.
55
00:03:29,780 --> 00:03:34,070
And at the same time so if you take all the correct actions in the correct order in the correct states
56
00:03:34,580 --> 00:03:38,840
your final reward could be that you get an omelet which you can eat.
57
00:03:38,900 --> 00:03:44,660
And so that's a very basic activity in your life but if you think about it it is actually an environment
58
00:03:44,990 --> 00:03:50,060
and you are the agent going through this environment and perform a task you don't really need to learn
59
00:03:50,060 --> 00:03:52,190
anything because you already know it pretty well.
60
00:03:52,220 --> 00:03:56,170
But at same time you could learn maybe you could learn how to make a better omelet or especially if
61
00:03:56,340 --> 00:03:59,010
it's your first omelet that you're making you're probably going to screw it up.
62
00:03:59,030 --> 00:04:04,010
But you will learn from that because you will understand what actions lead towards states and routes
63
00:04:04,490 --> 00:04:05,890
and anything else in life.
64
00:04:06,050 --> 00:04:11,900
For instance even trading on the stock market and you know buying and selling and getting certain feedback
65
00:04:11,900 --> 00:04:16,390
from the market in the sense of return positive or negative returns.
66
00:04:16,430 --> 00:04:20,160
That's also an environment that's you participating in that environment as an aged.
67
00:04:20,210 --> 00:04:25,220
Driving a car is also an environment where you can turn the steering wheel you can accelerate you can
68
00:04:25,220 --> 00:04:29,510
break and so on and you're getting feedback from the environment and you know one of those feedbacks
69
00:04:29,510 --> 00:04:35,840
is the policeman giving you a speeding fine if you're going above the acceptable or allowed speed limit
70
00:04:35,840 --> 00:04:36,960
on that highway.
71
00:04:37,040 --> 00:04:41,900
And therefore from there you learn that that's not something that should be done because it leads to
72
00:04:41,900 --> 00:04:43,020
a negative reward.
73
00:04:43,220 --> 00:04:45,590
So rewards don't have to be just at the very end of the process.
74
00:04:45,590 --> 00:04:48,020
They can be throughout the journey throughout the process.
75
00:04:48,020 --> 00:04:49,490
So those are a couple of examples.
76
00:04:49,490 --> 00:04:54,980
And in terms of a I the simplest way to think of reinforcement learning is like training a dog when
77
00:04:54,980 --> 00:05:00,270
you train the dog you to give it certain commands and if it obeys those commands then you give it a
78
00:05:00,440 --> 00:05:04,820
reach you give it like a biscuit or something if it doesn't Abeles Kamaz you tell it that it's a bad
79
00:05:04,820 --> 00:05:06,600
dog or you just don't give it a treat.
80
00:05:06,830 --> 00:05:13,820
And through that process it learns what certain commands or what it needs to do what action it needs
81
00:05:13,820 --> 00:05:18,470
to take in certain states and the states are the commands that you're giving it.
82
00:05:18,470 --> 00:05:22,700
And based on that it will get some certain rewards of course in the world of AI.
83
00:05:22,700 --> 00:05:24,590
It's not that complex.
84
00:05:24,590 --> 00:05:26,910
You don't have to give the treats.
85
00:05:26,960 --> 00:05:32,120
You don't have to have like a bag of biscuits with you every time you just give it a plus one or a minus
86
00:05:32,120 --> 00:05:37,290
one so it's a huge advantage that in the world of AI we've created these AIs ourselves.
87
00:05:37,310 --> 00:05:42,680
So the rewards that we're giving them if you think wow this is really cool rewards are giving them they
88
00:05:42,680 --> 00:05:48,490
don't actually exist they're just a plus or minus one or plus a one or a zero or something.
89
00:05:48,500 --> 00:05:51,100
So it's all nonexistence all imaginary stuff.
90
00:05:51,110 --> 00:05:56,300
But at the same time it leads to great results as we can create these amazing things these amazing artificial
91
00:05:56,300 --> 00:06:01,760
intelligence as by this amazing artificial intelligence by just providing rewards we don't really exist.
92
00:06:01,790 --> 00:06:05,670
Plus and minus one doesn't cost anything but same time release results.
93
00:06:05,900 --> 00:06:08,170
So very similar to real world.
94
00:06:08,210 --> 00:06:15,140
And you know for example Dokes But here the rewards are digital and just numbers.
95
00:06:15,140 --> 00:06:20,920
And with that in mind we can talk about about robot dogs I love this example so this is just around
96
00:06:20,920 --> 00:06:26,630
in pictures not necessarily that exact robot dog you know that is trained through reinforcement learning
97
00:06:26,710 --> 00:06:31,050
some of the robot dogs especially the older ones you'd have an algorithm in there.
98
00:06:31,370 --> 00:06:39,260
And this is actually a good example of the difference between preprogramed agents and reinforcement
99
00:06:39,260 --> 00:06:46,120
learning agent so you could have a robot dog which is preprogrammed to how to walk it will say.
100
00:06:46,160 --> 00:06:51,500
So in the in the algorithm behind the dog in the software will say OK so in order to walk you need to
101
00:06:52,370 --> 00:06:58,160
move your left leg forward left front leg forward then your back right leg forward then your front right
102
00:06:58,160 --> 00:07:02,480
leg forward then your back left leg forward and repeat that action and you know that's that's the definition
103
00:07:02,480 --> 00:07:04,870
of walking is a function inside this dog.
104
00:07:05,040 --> 00:07:09,060
And then it might have you know how to sit how to stand and things like that.
105
00:07:09,680 --> 00:07:16,360
Whereas in a robot dog that is trained through reinforcement learning what happens is you don't preprogram
106
00:07:16,360 --> 00:07:16,710
it.
107
00:07:16,730 --> 00:07:23,810
This is the key concept to everything here that you don't have any algorithm inside that is hard coded
108
00:07:23,810 --> 00:07:24,850
into the dog.
109
00:07:24,860 --> 00:07:28,300
Instead you have what we'll be discussing in the future.
110
00:07:28,460 --> 00:07:36,710
You have this reinforcement learning algorithm which is told that OK so the goal is from to get from
111
00:07:36,860 --> 00:07:41,990
where you are now not knowing anything to that to the end of the room for example.
112
00:07:42,170 --> 00:07:44,270
And here are the certain actions you can take.
113
00:07:44,270 --> 00:07:48,950
You can move your right foot you can move your left foot you can move your right back foot you are left
114
00:07:48,950 --> 00:07:53,000
back foot so here all the degrees of freedom you can do you can move it like this you can move like
115
00:07:53,000 --> 00:07:59,180
that so like a list of actions you can take and your rewards are every time you take a step forward
116
00:07:59,210 --> 00:08:01,430
you get a plus one every time you fall over.
117
00:08:01,430 --> 00:08:04,090
You get a minus one and that's all there is to it.
118
00:08:04,160 --> 00:08:07,390
And then they just leave the dog and let it figure it out on its own.
119
00:08:07,400 --> 00:08:13,460
So the dog tries to stand up it falls then it realizes that OK I shouldn't do that action that led to
120
00:08:13,460 --> 00:08:17,040
me falling because every time I fall I get a minus one which is not good for me then.
121
00:08:17,060 --> 00:08:21,560
So does the other action that helped him stand up and then it figures are just experiments experiments
122
00:08:21,560 --> 00:08:26,090
experiments tri's things randomly and then figures out that it can make a step forward by moving its
123
00:08:26,090 --> 00:08:31,410
right front foot and he gets a plus one and realize oh I should do more of that.
124
00:08:31,460 --> 00:08:35,620
OK cool so it now learns that it should do more of this and less of that.
125
00:08:35,630 --> 00:08:42,270
And through this learning process it quickly very quickly understands how it can walk.
126
00:08:42,410 --> 00:08:49,130
And those those dogs that figured out on their own can actually sometimes walk better than dogs that
127
00:08:49,130 --> 00:08:53,930
are preprogramed because really preprogrammed things we look at the real life dogs and or you know we
128
00:08:53,930 --> 00:08:59,960
use our own imagination how to do it whereas a reinforcement learning dog can optimize things on its
129
00:08:59,960 --> 00:09:00,300
own.
130
00:09:00,320 --> 00:09:03,540
And because in AI sometimes it can get even better results.
131
00:09:03,680 --> 00:09:05,290
And that's how they can train these robot.
132
00:09:05,320 --> 00:09:07,320
The same robot dogs to play soccer.
133
00:09:07,520 --> 00:09:12,970
You can train a normal dog to play soccer because you know simply the whole approach is different.
134
00:09:12,980 --> 00:09:20,900
And it's not something that you know probably a normal dog has been trained to do or has ever done in
135
00:09:20,900 --> 00:09:23,030
its process of its evolution.
136
00:09:23,030 --> 00:09:28,190
Whereas a reinforcement learning robot dogs can very easily understand how to play soccer as long as
137
00:09:28,190 --> 00:09:32,760
you tell them what the rewards are what the goals are what the possible actions they can take.
138
00:09:33,080 --> 00:09:36,390
So that is how reinforcement learning works.
139
00:09:36,410 --> 00:09:39,160
In general there's a quick overview of reinforcement learning.
140
00:09:39,170 --> 00:09:45,500
I hope that got you very excited about was going to come next because it's a completely different world
141
00:09:45,530 --> 00:09:51,980
compared to preprogram solutions a hard program hardcoded solutions where you have the if else conditions.
142
00:09:51,980 --> 00:09:53,750
This is very different.
143
00:09:53,840 --> 00:09:56,010
And we're going to be talking more about that.
144
00:09:56,150 --> 00:10:03,400
In the meantime we've got some additional reading for you so if you'd like to have some supporting materials
145
00:10:03,700 --> 00:10:06,810
Here's a great article which you can look and look into.
146
00:10:06,830 --> 00:10:09,300
It's called simple reinforcement learning with tensor flow.
147
00:10:09,430 --> 00:10:10,570
It's got ten parts.
148
00:10:10,570 --> 00:10:14,790
The link is here and you'll find the full clickable link on.
149
00:10:14,820 --> 00:10:22,540
In the course of resources by Arthur Giuliani's 2016 article and you can follow along this course and
150
00:10:22,540 --> 00:10:24,770
also get additional information from that article.
151
00:10:24,790 --> 00:10:30,010
But bear in mind that that article is tends to flow where as in this course we are using pi torche so
152
00:10:30,520 --> 00:10:35,830
different implementation but implantations but at the same time you might pick up a few things here
153
00:10:35,830 --> 00:10:41,260
and there that might supplement your learning that we're going to be doing in this course.
154
00:10:41,260 --> 00:10:44,910
So great articles follow you in if you're considering following it for sure.
155
00:10:44,920 --> 00:10:45,820
Still just in case.
156
00:10:45,820 --> 00:10:51,890
Check out that that first part and see if you like it see if you'd like to read it a bit more.
157
00:10:52,210 --> 00:10:58,210
And then we've got specific to this tutorial a border enforcement learning there's a paper by Richard
158
00:10:58,210 --> 00:11:00,380
Sutton which is called reinforcement learning.
159
00:11:00,420 --> 00:11:08,170
One introduction is the 1998 papers are quite old but at the same time you can learn a bit about reinforcement
160
00:11:08,170 --> 00:11:13,960
learning some of the examples like that omlet example and other examples of where reinforcement learning
161
00:11:13,960 --> 00:11:17,710
can be applied and just a general overview of reinforcement learning.
162
00:11:17,710 --> 00:11:23,220
If you are looking for some additional reading and on that note we're going to wrap up this tutorial.
163
00:11:23,230 --> 00:11:24,640
Can't wait to see you next time.
164
00:11:24,640 --> 00:11:26,560
And until then enjoy AI.
19208
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.