subtitlecat.com

All language subtitles for 015 Q-Learning Visualization-subtitle-en

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu Download

Uzbek

Vietnamese

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,620 --> 00:00:04,010 Hello and welcome back to the course on artificial intelligence. 2 00:00:04,010 --> 00:00:05,940 In today's tutorial we're going to have some fun. 3 00:00:05,960 --> 00:00:11,900 We're going to have a look and artificial intelligence actually going through that maze that we've been 4 00:00:11,900 --> 00:00:18,740 talking about so long and is going to use kill learning to navigate its way and find the way out and 5 00:00:18,830 --> 00:00:24,350 we'll see what happens to the q values was going to happen to the policy and so on. 6 00:00:24,350 --> 00:00:26,310 So let's have a look. 7 00:00:26,330 --> 00:00:31,910 We're going to be using some materials kindly provided by the Berkeley University. 8 00:00:31,910 --> 00:00:40,700 So if you go to a I don't Birk only the E R K E L E Why don't you just go to that link again. 9 00:00:40,790 --> 00:00:47,510 You'll see this Web site and hear what we're going to be looking at is the need to go to we're going 10 00:00:47,550 --> 00:00:49,130 to go to PacMan projects. 11 00:00:49,130 --> 00:00:58,160 I think Pacman projects and here if you scroll down and you look at them in first learning this is what 12 00:00:58,160 --> 00:00:59,050 we're working with. 13 00:00:59,180 --> 00:01:01,700 So here you can download the zip archive. 14 00:01:01,700 --> 00:01:03,500 So that's if you want to. 15 00:01:03,530 --> 00:01:08,330 So don't you don't have to this is we're not going to go through a solution together in this trial just 16 00:01:08,330 --> 00:01:11,860 letting you know where this is all from because we're very like. 17 00:01:11,870 --> 00:01:12,930 We really appreciate that. 18 00:01:12,980 --> 00:01:16,180 UC Berkeley has made these materials available. 19 00:01:16,190 --> 00:01:19,300 But if you do wish to experiment with this on your own. 20 00:01:19,400 --> 00:01:20,660 Just bear in mind this is not part. 21 00:01:20,680 --> 00:01:23,310 Is not going to be part of our courses as part of the Berkeley course. 22 00:01:23,330 --> 00:01:27,860 I'm not sure how it works for illustration purposes but if you do want to experiment with this you can 23 00:01:27,860 --> 00:01:31,340 find it here the zip archive and all old instructions as well. 24 00:01:31,430 --> 00:01:38,450 And we're just going to go into Python right away and first thing I wanted to show you is that here 25 00:01:38,450 --> 00:01:42,790 we've got the licensing information so this is what I mean. 26 00:01:42,870 --> 00:01:47,720 We're very lucky that they said we are free to use or extend these projects for educational purposes 27 00:01:47,720 --> 00:01:51,120 provided you know distributing publish solutions which we're not going to do. 28 00:01:51,200 --> 00:01:56,750 You retain this notice which we have and you provide a clear archbishop to UC Berkeley including a link 29 00:01:56,780 --> 00:01:57,860 to which we also have. 30 00:01:57,860 --> 00:02:00,750 So once again if you'd like to learn more there the link. 31 00:02:00,770 --> 00:02:01,720 You can have a look. 32 00:02:01,730 --> 00:02:07,490 And thank you very much all of these people who have worked on this project so here's the grid world. 33 00:02:07,490 --> 00:02:09,370 We're going to be working if there is a solution there. 34 00:02:09,460 --> 00:02:15,110 You would have to in order to make it work you'd have to solve it yourself or possibly find a solution. 35 00:02:15,110 --> 00:02:18,980 Maybe some of your some people somebody you know might help you out with that. 36 00:02:19,160 --> 00:02:24,260 If again what you want to you don't have to because we are just going to look at it on this screen right 37 00:02:24,320 --> 00:02:25,110 now. 38 00:02:25,160 --> 00:02:29,720 So after we've created all those files we could just launch it over here. 39 00:02:29,720 --> 00:02:36,680 So there are some parameters that are involved in this whole world and we're not going to just show 40 00:02:36,680 --> 00:02:39,080 you what it looks like if we launch it. 41 00:02:39,080 --> 00:02:41,540 So let's try to launch it in manual mode. 42 00:02:41,540 --> 00:02:47,070 So if I go minus one of these panoramas are manual so I can command your control agent. 43 00:02:47,090 --> 00:02:52,820 So here you can see all grids so I can go up up so you can see that it's taking action starting and 44 00:02:52,820 --> 00:02:54,980 started in states where I was. 45 00:02:55,100 --> 00:03:00,650 And then you saw that I pressed up took action Norf and first time I ended up in zero once I did go 46 00:03:00,650 --> 00:03:01,310 up. 47 00:03:01,490 --> 00:03:05,000 But second time I took action Norf and I ended in the same sad didn't move. 48 00:03:05,000 --> 00:03:08,440 So something happened you know the randomness happened I either went left or right. 49 00:03:08,780 --> 00:03:10,910 And by default the parameters are set. 50 00:03:10,910 --> 00:03:16,910 You can see here by default they're set to exactly what we discussed that how often actually results 51 00:03:16,940 --> 00:03:18,250 in unintended direction. 52 00:03:18,270 --> 00:03:20,960 20 percent of the time to 10 percent to the left some to the right. 53 00:03:21,230 --> 00:03:23,520 So if I go up and say I went up I go right. 54 00:03:23,520 --> 00:03:26,810 I went right right now didn't happen. 55 00:03:26,810 --> 00:03:29,790 Right again and right and I'm finished. 56 00:03:29,790 --> 00:03:35,810 But in this implementation you have to click again to get out of this final output so out of there just 57 00:03:35,810 --> 00:03:37,140 click again and you're finished. 58 00:03:37,190 --> 00:03:40,700 That's a terminal state so we can run our manual. 59 00:03:40,730 --> 00:03:45,620 You can see that if I go right right right left up. 60 00:03:45,740 --> 00:03:50,060 So here what we saw previously that the agent wouldn't go straight up right. 61 00:03:50,060 --> 00:03:53,300 What's the point of going up if there's a chance of going into the pit. 62 00:03:53,300 --> 00:03:54,580 So let's see what the agent would do. 63 00:03:54,610 --> 00:03:56,780 It would go left and go west here would go West. 64 00:03:56,780 --> 00:04:00,820 And you see I clicked left but it went up and here I would click right. 65 00:04:00,860 --> 00:04:05,390 And I end up in the final exit stage and you see God's reward equal to one. 66 00:04:05,390 --> 00:04:07,190 So that's what it looks like manually. 67 00:04:07,190 --> 00:04:12,520 Now let's actually hook up an AI to this and let it go through. 68 00:04:12,510 --> 00:04:16,800 So let's do an H here and let's add some Brandner. 69 00:04:16,820 --> 00:04:24,170 So let me just see what I typed here so hopefully you can see by grid world why then here minus our 70 00:04:24,230 --> 00:04:25,370 means. 71 00:04:25,370 --> 00:04:27,980 That's the reward for living. 72 00:04:27,980 --> 00:04:31,840 So I've got two of them so I probably should remove this one. 73 00:04:32,190 --> 00:04:35,050 So minus k is how many iterations. 74 00:04:35,060 --> 00:04:36,690 That's way too many iterations. 75 00:04:36,690 --> 00:04:41,180 Let's do less Let's do like 10 iterations should be enough. 76 00:04:41,180 --> 00:04:42,710 Minus a is Agent. 77 00:04:42,710 --> 00:04:47,040 What type of agent don't want to do honor and image and some value or a Q. 78 00:04:47,060 --> 00:04:49,120 Q So I want a Q. 79 00:04:49,190 --> 00:04:57,090 Q learning agent doing this minus s is what is s speed so that's way too large a force that just use 80 00:04:57,090 --> 00:05:04,780 the full speed for now minus R is a living penalty's so by default is zero. 81 00:05:04,820 --> 00:05:11,000 So remember at the very start restart 0 living penances so let's call it also 0 0 and can just remove 82 00:05:11,000 --> 00:05:16,040 this parameter and D is what is d discount. 83 00:05:16,040 --> 00:05:20,660 So I just kind of factor so let's keep it at zero point and so very similar to what we're starting off 84 00:05:20,660 --> 00:05:27,880 in this section on the course so let's run that OK way too fast again all actually so pretty OK so you 85 00:05:27,880 --> 00:05:30,130 can see how he's exploring. 86 00:05:30,580 --> 00:05:35,650 And so so far he's hit negative three times and you can see how the q values are being updated in these 87 00:05:35,650 --> 00:05:36,690 squares. 88 00:05:36,700 --> 00:05:37,860 So these are key values. 89 00:05:37,870 --> 00:05:39,310 They are sort of zero. 90 00:05:39,320 --> 00:05:40,740 You can see now the Q value. 91 00:05:40,740 --> 00:05:45,220 So he's learned that this one is a bit different implement because once you get to the final stage you 92 00:05:45,220 --> 00:05:46,560 have to get out of it. 93 00:05:46,660 --> 00:05:48,990 You have to just click one more button to exit. 94 00:05:49,000 --> 00:05:51,740 And so it's very close to one but not exactly one. 95 00:05:51,760 --> 00:05:57,530 But at the same time you can see that here you know the value slowly kind of crystallizing hands are 96 00:05:57,520 --> 00:06:02,290 a point a ex-colleague getting somewhere but they're just so far they're kind of zeroes because he doesn't 97 00:06:02,290 --> 00:06:05,470 have enough information to understand what's going on. 98 00:06:05,470 --> 00:06:08,710 OK so let's see let's see what happens here. 99 00:06:10,180 --> 00:06:13,620 Exploring exploring exploring what's going to happen. 100 00:06:13,710 --> 00:06:15,300 Well was being a while. 101 00:06:15,670 --> 00:06:17,940 And we get this some randomness involved here. 102 00:06:18,100 --> 00:06:20,100 So there is that good one a few times. 103 00:06:20,110 --> 00:06:22,500 Now he only gets 10 iterations. 104 00:06:22,510 --> 00:06:26,780 So he's got to learn fast Ok I need you there. 105 00:06:27,220 --> 00:06:29,280 Let's see what's going on. 106 00:06:29,320 --> 00:06:30,050 Come on. 107 00:06:30,060 --> 00:06:31,820 Get out of that maze already. 108 00:06:32,840 --> 00:06:38,450 And yes 10 episodes so average it turns out that. 109 00:06:38,590 --> 00:06:40,430 That's not really interested in that. 110 00:06:40,460 --> 00:06:41,760 So here let's see. 111 00:06:41,760 --> 00:06:43,060 I've never seen enough of a click. 112 00:06:43,100 --> 00:06:43,460 Right. 113 00:06:43,460 --> 00:06:43,810 There we go. 114 00:06:43,820 --> 00:06:47,780 So you can see this is the policy that he came up with. 115 00:06:48,020 --> 00:06:50,860 Even through just 10 episodes he's already got a pulse. 116 00:06:50,890 --> 00:06:55,820 I'm going to go up a bomb and here I'm going to go down here I'm going to go down here I'm going to 117 00:06:55,820 --> 00:06:58,320 go into the wall and then I'm going to bounce we're here. 118 00:06:58,550 --> 00:06:59,620 That's pretty cool. 119 00:07:00,000 --> 00:07:00,250 OK. 120 00:07:00,260 --> 00:07:02,530 So now let's increase the speed. 121 00:07:02,650 --> 00:07:04,220 What was the parameter s there. 122 00:07:04,220 --> 00:07:06,240 And that's like double lawlessness. 123 00:07:06,260 --> 00:07:13,070 That's quadruple the speed and let's increase the number of iterations so let's say 20 to ration this 124 00:07:13,070 --> 00:07:16,390 time and let's see if he can get through a bit more now. 125 00:07:16,790 --> 00:07:18,700 So you can see he's going a bit faster. 126 00:07:19,600 --> 00:07:25,900 And he's learning he's learning that it's not really you know out of this state there's not many good 127 00:07:25,900 --> 00:07:30,220 actions Orio these actions that the right and straight are not that good. 128 00:07:30,250 --> 00:07:32,400 Definitely this was definitely not good. 129 00:07:32,410 --> 00:07:34,680 He still needs to learn that so from here is also good. 130 00:07:34,680 --> 00:07:36,820 You can see that this action is pretty good. 131 00:07:36,820 --> 00:07:37,330 All right. 132 00:07:37,330 --> 00:07:38,380 What did he get. 133 00:07:38,530 --> 00:07:39,100 OK. 134 00:07:39,100 --> 00:07:42,200 So interesting policy here you we decide to go up. 135 00:07:42,330 --> 00:07:43,270 Just not enough information. 136 00:07:43,270 --> 00:07:45,610 So let's let's really do that. 137 00:07:46,850 --> 00:07:50,370 And let's increase the speed to like 100. 138 00:07:50,630 --> 00:07:56,570 Super fast and the number of iterations will give him 100 iterations this time it's run that scene like 139 00:07:56,570 --> 00:08:02,930 crazy fast and you can see that because there's so many more iterations He's got more information more 140 00:08:02,930 --> 00:08:09,500 opportunity to experiment and to actually build out this this matrix or matrix these values for every 141 00:08:09,500 --> 00:08:10,240 single state. 142 00:08:10,250 --> 00:08:13,220 He now knows you can see that zero point eighty nine. 143 00:08:13,250 --> 00:08:16,050 What did we say in our zero point 86. 144 00:08:16,120 --> 00:08:20,660 Another thing to remember is that the value of any given state. 145 00:08:20,720 --> 00:08:24,230 Remember that formula we had is the maximum of the cube values. 146 00:08:24,230 --> 00:08:27,160 Remember that thing that we came up with shortcut formula. 147 00:08:27,170 --> 00:08:30,690 So what is it what with the value in this state be the V of this. 148 00:08:30,900 --> 00:08:32,060 It would be 0.18. 149 00:08:32,060 --> 00:08:37,870 Because that's the highest out of the four here the value of this state 0.7 you want the value of this 150 00:08:37,870 --> 00:08:38,180 day. 151 00:08:38,210 --> 00:08:40,260 Is there is point sixty one and so on. 152 00:08:40,400 --> 00:08:41,480 So that's something to remember. 153 00:08:41,490 --> 00:08:45,590 I remember when I was up I think we had like zero point 86 or something so praecox. 154 00:08:45,770 --> 00:08:55,060 And so if we go next year I'll just disappear or disappear again and this can make it come back. 155 00:08:55,170 --> 00:08:55,750 OK. 156 00:08:55,760 --> 00:08:56,210 OK. 157 00:08:56,210 --> 00:09:00,680 Slowly slowly slowly filling up some spaces. 158 00:09:00,970 --> 00:09:01,450 I see. 159 00:09:01,490 --> 00:09:06,170 And it's also pretty random because not only the environment has randomness but also the way he explores 160 00:09:06,170 --> 00:09:10,750 that the star really doesn't know the policy is he's exploring at random. 161 00:09:11,190 --> 00:09:12,150 Just keeps disappearing. 162 00:09:12,170 --> 00:09:13,420 I don't understand why. 163 00:09:13,680 --> 00:09:18,650 Anyways let's see what happens if you increase the number here and here should pretty much take the 164 00:09:18,650 --> 00:09:23,060 same amount of time if the speed doesn't have a cap on it. 165 00:09:23,480 --> 00:09:27,610 OK so he's like he has more opportunity to explore things. 166 00:09:27,650 --> 00:09:30,850 OK let's see how it all goes. 167 00:09:31,260 --> 00:09:35,010 And you can see the values are converging they go up and down depending you know because there is some 168 00:09:35,010 --> 00:09:38,640 randomness and he might end up like in the pit even though he goes like this way. 169 00:09:38,640 --> 00:09:44,940 But at the same time they're slowly starting to converge to some sort of values and cue values. 170 00:09:44,950 --> 00:09:48,540 OK probably a thousand is a bit too much in terms of time. 171 00:09:48,540 --> 00:09:53,250 It doesn't look like the speed is proportionally increasing as well. 172 00:09:53,610 --> 00:09:55,560 So it might cut that part. 173 00:09:55,650 --> 00:09:57,560 I mean like reduce the speed. 174 00:09:57,600 --> 00:10:02,850 You know while this is very low you don't have to watch to the end of this tutorial I just want to experiment 175 00:10:02,850 --> 00:10:08,430 with quite a bit so to give you some like examples of what we've been working through but you get the 176 00:10:08,430 --> 00:10:10,920 point that it goes through all of this. 177 00:10:10,950 --> 00:10:14,800 It has some randomness like Rambler's built into his behavior. 178 00:10:14,820 --> 00:10:20,720 So even when it has like a policy it will still continue exploring so it won't just like once it has 179 00:10:20,720 --> 00:10:23,420 a basic policy it won't just continue following its policy. 180 00:10:23,460 --> 00:10:29,130 It will still experiment with other variations once in a while in order to enhance its policy maybe 181 00:10:29,130 --> 00:10:31,350 it hasn't found the best policy already right away. 182 00:10:31,350 --> 00:10:33,240 Maybe it can improve the policy. 183 00:10:33,360 --> 00:10:40,080 And that's why even after so many iterations you can still see some random effects it is sometimes jumps 184 00:10:40,080 --> 00:10:45,060 in to random states not just because of the randomness in the environment but also because there is 185 00:10:45,060 --> 00:10:50,750 some level like a parameter which you could control which you could set up for your agent saying that's 186 00:10:50,820 --> 00:10:56,040 you know most of the time 80 percent of the time do whatever your policy tells you to do but 20 percent 187 00:10:56,040 --> 00:11:00,930 of the time you just have some fun experiment and see what happens and use that information that you 188 00:11:00,930 --> 00:11:03,410 gather to update your policy. 189 00:11:03,410 --> 00:11:05,300 OK this is taking way too long. 190 00:11:05,310 --> 00:11:06,360 Let's try that again. 191 00:11:06,560 --> 00:11:11,640 Yeah so that's how the agent learns in different states. 192 00:11:11,640 --> 00:11:14,270 Maybe let's just run one more just out of curiosity. 193 00:11:14,280 --> 00:11:16,590 So is there anything else we can change about it. 194 00:11:18,420 --> 00:11:20,110 Iterations. 195 00:11:21,630 --> 00:11:22,400 OK. 196 00:11:22,430 --> 00:11:24,280 OK let's have a look. 197 00:11:24,550 --> 00:11:26,680 Yeah well we could change the discussion for example. 198 00:11:26,680 --> 00:11:39,860 So in this case we could say K minus a hundred minus a Q minus two and minus are OK thousand. 199 00:11:39,920 --> 00:11:41,380 So reward. 200 00:11:41,390 --> 00:11:47,920 We want to keep it maybe let's keep it at 0.04 But let's say set against this keep the reward at my 201 00:11:47,920 --> 00:11:49,270 desert point zero for every time. 202 00:11:49,280 --> 00:11:58,340 And then here we're going to say that the discount is not zero point nine but it's like zero point point 203 00:11:58,340 --> 00:11:59,030 five. 204 00:11:59,060 --> 00:12:02,300 So it gets discounted quite a lot as you go through the game. 205 00:12:02,600 --> 00:12:08,960 So it actually now will be incentivized to be closer to the finish rather than further route the states 206 00:12:08,960 --> 00:12:14,060 close to finish will get a high value so you can see that the values really drops off it's not as as 207 00:12:14,060 --> 00:12:15,400 green as it was before. 208 00:12:16,360 --> 00:12:20,190 So here you can see that this is the policy now. 209 00:12:20,380 --> 00:12:26,490 So it goes like that like that like that very similar to what we saw before just probably only differences 210 00:12:26,500 --> 00:12:28,830 from here jumping straight into here. 211 00:12:28,840 --> 00:12:29,980 So that's one. 212 00:12:30,000 --> 00:12:32,500 And OK let's just run one more. 213 00:12:32,500 --> 00:12:33,510 This is so much fun. 214 00:12:33,580 --> 00:12:39,020 Let's just run one more so k minus k 100 a q discard. 215 00:12:39,130 --> 00:12:48,960 Keep it as it was original So let's just run this basic vanilla set up ok ok ok. 216 00:12:49,110 --> 00:12:51,110 It's going to see if it will show us the policy. 217 00:12:51,210 --> 00:12:54,820 And yes we got the policy. 218 00:12:54,840 --> 00:12:55,150 Yes. 219 00:12:55,150 --> 00:12:56,350 Good finish. 220 00:12:56,350 --> 00:12:58,820 So here we've got the policy. 221 00:12:58,900 --> 00:12:59,830 You know this is familiar. 222 00:12:59,830 --> 00:13:05,260 Remember that time when we saw that the AI outsmarted the human bomb into the wall to go there and boom 223 00:13:05,290 --> 00:13:08,530 into the wall to go like that to increase the problem. 224 00:13:08,530 --> 00:13:09,270 So there we go. 225 00:13:09,280 --> 00:13:17,020 That is an example of artificial intelligence inaction very very basic simple kill earnings so no deep 226 00:13:17,020 --> 00:13:18,190 learning at this stage. 227 00:13:18,610 --> 00:13:23,810 But at the same time it's already pretty smart and I hope you enjoyed today's tutorial. 228 00:13:23,810 --> 00:13:29,210 And once again thank you to UC Berkeley and I hope you enjoyed today's tutorial and I look forward scenics 229 00:13:29,230 --> 00:13:29,630 them. 230 00:13:29,650 --> 00:13:31,120 Until then enjoy AI. 22701