subtitlecat.com

All language subtitles for 066 Gradient Descent-en

Afrikaans

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bengali

Bosnian

Bulgarian

Catalan

Cebuano

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Khmer

Korean

Kurdish (Kurmanji)

Kyrgyz

Lao

Latin

Latvian

Lithuanian

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mongolian

Myanmar (Burmese)

Nepali

Norwegian

Pashto

Persian

Polish

Portuguese

Punjabi

Romanian

Russian

Samoan

Scots Gaelic

Serbian

Sesotho

Shona

Sindhi

Sinhala

Slovak

Slovenian

Somali

Spanish

Sundanese

Swahili

Swedish

Tajik

Tamil

Telugu

Thai

Turkish

Ukrainian

Urdu

Uzbek

Vietnamese Download

Welsh

Xhosa

Yiddish

Yoruba

Zulu

Odia (Oriya)

Kinyarwanda

Turkmen

Tatar

Uyghur

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,680 --> 00:00:05,570 Hello and welcome back to the course on deep learning in today's tutorial we're talking about gradient 2 00:00:05,600 --> 00:00:06,600 descent. 3 00:00:06,890 --> 00:00:13,610 What we learned previously was that in order for a neural network to learn what needs to happen is back 4 00:00:13,610 --> 00:00:21,140 propagation and that is when the error the difference or the sum of squared differences between y hat 5 00:00:21,170 --> 00:00:28,300 and Y is back propagated through the neural network and the weights are adjusted accordingly. 6 00:00:28,520 --> 00:00:34,220 So we saw that and today we're going to learn exactly how these weights are adjusted. 7 00:00:34,400 --> 00:00:35,930 So let's have a look. 8 00:00:36,080 --> 00:00:44,030 This is our very simple version of a neural work a percept Trauner a single letter feedforward neural 9 00:00:44,030 --> 00:00:52,280 network and what we can see here is this whole process in action where we've got some input value then 10 00:00:52,280 --> 00:00:57,000 we've got to wait then a activation function is applied. 11 00:00:56,990 --> 00:01:01,850 We have we get y hat and then we compare it to the actual value we calculate the cost function. 12 00:01:01,850 --> 00:01:05,420 So how can we minimize the cost function. 13 00:01:05,420 --> 00:01:07,370 What can we do about it. 14 00:01:07,370 --> 00:01:14,750 Well one approach to do it is a brute force approach where we just take all lots of different possible 15 00:01:14,750 --> 00:01:20,990 weights and look at them and see which one looks best and what we do is for instance we would try out 16 00:01:21,080 --> 00:01:26,240 for let's say for example a thousand weights and we'd try them out that would get something like this 17 00:01:26,810 --> 00:01:32,900 for the cost function and this is a chart of on the Y axis of cross-functional the vertical axis on 18 00:01:32,900 --> 00:01:34,770 the horizontal axis of y hat. 19 00:01:34,860 --> 00:01:39,200 And because you can see the formulas I had minus Y squared. 20 00:01:39,230 --> 00:01:42,470 This is what the cost function would look something like that. 21 00:01:42,670 --> 00:01:47,830 And basically you'd find the best one is over here. 22 00:01:47,950 --> 00:01:50,980 So very simple very intuitive approach. 23 00:01:50,980 --> 00:01:53,200 Why not do this brute force method. 24 00:01:53,200 --> 00:02:01,630 Why not just try out a thousand different cost for a thousand different parameters or inputs for weights 25 00:02:01,690 --> 00:02:03,030 and see which one works best. 26 00:02:03,030 --> 00:02:04,230 You'll find the best one that way. 27 00:02:04,420 --> 00:02:10,270 Well if you have just one way to optimize this might work but as you increase the number of weights 28 00:02:10,480 --> 00:02:16,630 increase the number of Synopsys in your network you have to face the curse of dimensionality. 29 00:02:16,630 --> 00:02:19,370 And so what is the cause of dimensionality. 30 00:02:19,450 --> 00:02:24,510 The best way to describe this or explain it is to just look at a practical example. 31 00:02:24,640 --> 00:02:30,610 So remember this example we had when we were talking about how neural networks actually work where we 32 00:02:30,610 --> 00:02:37,120 were building or running a neural network for a property valuation. 33 00:02:37,120 --> 00:02:43,030 So this is what it looked like when it was trained up already well when it's not trained before it's 34 00:02:43,030 --> 00:02:45,290 trained before we know which one what are the weights. 35 00:02:45,550 --> 00:02:47,640 The actual neural network looks like this. 36 00:02:47,730 --> 00:02:54,860 Right because we have all these different possible synopses and we still have to train up the weights 37 00:02:55,280 --> 00:03:01,190 and here we have a total of 25 weights so four times five at the start plus five more from the hit out 38 00:03:01,310 --> 00:03:03,430 there 25 weights total. 39 00:03:03,680 --> 00:03:09,060 And let's see how we could possibly brute force 25 ways. 40 00:03:09,070 --> 00:03:12,610 This is a very simple neural network right here. 41 00:03:12,620 --> 00:03:20,630 Very simple just one hit in there and how could we brute force our way through a neural network of this 42 00:03:20,630 --> 00:03:21,320 size. 43 00:03:21,320 --> 00:03:24,370 Well there's some simple mathematical calculations. 44 00:03:24,410 --> 00:03:25,890 We have 25 weights. 45 00:03:25,910 --> 00:03:30,410 So that means if we have a thousand combinations that we're going to solve for every weight the total 46 00:03:30,410 --> 00:03:37,790 number of combinations is 1000 to the power 25 or a thousand or 10 to parse any five different combinations. 47 00:03:37,790 --> 00:03:48,260 Now let's see how Sun the way to tohu light the world's Fosse's supercomputer as of June 2016 what how 48 00:03:48,260 --> 00:03:49,700 would it approach this problem. 49 00:03:49,700 --> 00:03:52,390 Right so Sunway tie who light. 50 00:03:52,680 --> 00:04:00,980 It looks like this is a whole huge building pretty much for this one supercomputer and it got the Guinness 51 00:04:01,310 --> 00:04:04,940 World Record for being the Fosses supercomputer. 52 00:04:05,210 --> 00:04:12,620 Right now it is the fastest supercomputer in the world and some way tie lights can operate at a speed 53 00:04:12,620 --> 00:04:15,420 of 93 of flops. 54 00:04:15,510 --> 00:04:19,900 Flop stands for floating operation per second. 55 00:04:19,970 --> 00:04:23,310 So it can do ninety three to the power oil. 56 00:04:23,340 --> 00:04:28,010 Times ten to the power of 15 floating operations per second. 57 00:04:28,100 --> 00:04:32,340 That's how quick it is in comparison. 58 00:04:32,450 --> 00:04:38,210 Average computers right now they do like just over several gigaflops and so on. 59 00:04:38,210 --> 00:04:41,320 So it like kind of those ranges. 60 00:04:41,450 --> 00:04:44,290 Less than TEI Sunway type light. 61 00:04:44,390 --> 00:04:47,950 So suddenly it's all a lie it is in the forefront of technology. 62 00:04:48,360 --> 00:04:57,920 And let's say hypothetically that it can do one one test one combination of four on your own network 63 00:04:58,010 --> 00:05:04,220 in one floppy disk and one floating operation that is not possible that is not practical because you 64 00:05:04,220 --> 00:05:09,470 need multiple floating operations to test out a single weight in your own little. 65 00:05:09,480 --> 00:05:11,270 But even Let's let's give it a head start. 66 00:05:11,270 --> 00:05:17,990 Let's say that it can do it in a ideal world it can do that in one floating operation it can do one 67 00:05:18,290 --> 00:05:19,900 test per one floating operation. 68 00:05:20,120 --> 00:05:23,970 That means it will Doddridge still require tend to of any five. 69 00:05:24,080 --> 00:05:33,080 Divide by ninety three times ten to about 15 seconds to come to run all of those tests to brute force 70 00:05:33,080 --> 00:05:34,120 through that network. 71 00:05:34,130 --> 00:05:39,860 So that means one or approximate tend to power 58 seconds and that is the same as tend to the power 72 00:05:39,860 --> 00:05:42,120 of 50 years. 73 00:05:42,170 --> 00:05:49,910 That is a huge number that is longer than the universe has existed and that is definitely not going 74 00:05:49,910 --> 00:05:59,150 to simply this number is so huge its just definitely not going to work for us at all in our optimization. 75 00:05:59,150 --> 00:06:00,020 So there we go. 76 00:06:00,140 --> 00:06:01,220 This is a no no. 77 00:06:01,220 --> 00:06:05,450 Even on the world's fastest supercomputer Sunway tail light. 78 00:06:05,450 --> 00:06:10,140 So we have to come up with a different approach how are we going to find the optimal weight. 79 00:06:10,310 --> 00:06:15,890 By the way this our neural network was very simple what about if the neural networks looks like something 80 00:06:15,890 --> 00:06:22,740 like this or even a greater than that then yeah its just not going to happen at all ever. 81 00:06:22,760 --> 00:06:28,490 So the method were going to be looking at is called gradient descent and you may have heard of it already. 82 00:06:28,580 --> 00:06:30,770 If not we will find out what it is right now. 83 00:06:30,840 --> 00:06:41,780 So theres our cost function and now we go into see how we can foster for kind of a faster way to find 84 00:06:41,840 --> 00:06:43,190 the best option. 85 00:06:43,190 --> 00:06:45,920 So lets say we start somewhere you're going to start somewhere. 86 00:06:45,920 --> 00:06:47,390 So we start over there. 87 00:06:47,390 --> 00:06:56,990 And from that point in the top left what we're going to do is we're going to look at the angle of our 88 00:06:56,990 --> 00:07:00,800 cost function at that point so we're just going to basically that's what's called gradient because you 89 00:07:00,800 --> 00:07:02,090 have to differentiate. 90 00:07:02,150 --> 00:07:04,190 We're not going to look at the mathematical equations. 91 00:07:04,250 --> 00:07:09,370 We will provide some tips on additional reading at the end of the next lecture. 92 00:07:09,740 --> 00:07:17,150 But basically you just need to differentiate find out what the slope is in that specific point and find 93 00:07:17,150 --> 00:07:19,330 out if the slope is positive or negative. 94 00:07:19,450 --> 00:07:25,640 If the if the slope is negative like in this case means that you're going downhill so to the right is 95 00:07:25,640 --> 00:07:27,350 downhill to the left is uphill. 96 00:07:27,350 --> 00:07:29,780 And from there it means you need to go right. 97 00:07:29,780 --> 00:07:31,510 Basically you need to go downhill. 98 00:07:31,670 --> 00:07:33,070 And that's what we're going to do. 99 00:07:33,090 --> 00:07:35,510 Boom takes a step right. 100 00:07:35,510 --> 00:07:37,450 The ball rolls down again. 101 00:07:37,460 --> 00:07:38,300 Same thing. 102 00:07:38,390 --> 00:07:44,120 You calculate the slope and the slope is positive meaning writer's uphill left is downhill and you need 103 00:07:44,120 --> 00:07:46,560 to go left and you're on the ball down. 104 00:07:46,790 --> 00:07:54,900 And again you calculate the slope and you're all the bull right there you go so that's how you find 105 00:07:55,040 --> 00:08:04,520 in simple terms that's how you find the best WAITES The best situation that minimizes your cost function. 106 00:08:04,590 --> 00:08:08,970 Of course it's not going to be like a ball rolling is going to be a very zigzag type of approach but 107 00:08:09,210 --> 00:08:14,970 it's easier to remember or kind of it is more fun to look at it as a ball rolling. 108 00:08:14,970 --> 00:08:19,980 But in reality yes you just it's going to be like a step by step approach is going to be a zigzag type 109 00:08:19,980 --> 00:08:21,920 of method. 110 00:08:22,050 --> 00:08:25,020 Yeah and also there's there's lots of other elements to it. 111 00:08:25,050 --> 00:08:35,190 There's things like for instance why like why does it go down why does it not go way over the line so 112 00:08:35,190 --> 00:08:40,740 it could have jumped out of this gone upwards instead of downwards and things like that so there are 113 00:08:40,740 --> 00:08:41,950 parameters that you can tweak. 114 00:08:41,970 --> 00:08:45,570 And again we will mention where you can find out more on that. 115 00:08:45,580 --> 00:08:51,090 And plus we'll have this in practical application but in the simplest intuitive approach this is what 116 00:08:51,090 --> 00:08:51,770 is happening. 117 00:08:51,780 --> 00:08:56,670 We are getting to the bottom by just understanding which way we need to go. 118 00:08:56,700 --> 00:09:01,890 Instead of brute forcing through thousands and thousands and millions and billions and quadrillions 119 00:09:01,890 --> 00:09:02,920 of combinations. 120 00:09:03,030 --> 00:09:09,920 We can just simply every time have a look at where is where which way is it sloping so right like your 121 00:09:09,910 --> 00:09:11,690 or you imagine you're standing on a hill. 122 00:09:11,700 --> 00:09:15,870 Which way does it feel that it's going downwards and whichever way it is going down and you just keep 123 00:09:15,870 --> 00:09:20,760 walking that way you like take 50 steps away and then you assess again OK which way is it going downwards 124 00:09:21,090 --> 00:09:21,470 this way. 125 00:09:21,500 --> 00:09:24,620 OK and I'll take 50 steps or less take 40 steps that way. 126 00:09:24,690 --> 00:09:28,160 So it gets less and less and less as you get closer. 127 00:09:28,530 --> 00:09:32,720 So here's an example of gradient descent applied in a two dimensional space. 128 00:09:32,720 --> 00:09:36,450 So that was a one dimensional example. 129 00:09:36,570 --> 00:09:41,880 Here we have a two dimensional space for the gradient descent as you can see it's getting closer to 130 00:09:41,970 --> 00:09:48,450 the minimum and it's also called gradient descent because you're descending into the minimum of the 131 00:09:48,480 --> 00:09:53,430 cost function and find that he has a gradient descent applied in three dimensions. 132 00:09:53,430 --> 00:09:58,740 This is what it looks like if you projected onto two dimensions you can see zigzagging its way into 133 00:09:58,740 --> 00:09:59,600 the minimum. 134 00:09:59,700 --> 00:10:03,810 So there you go that it was gradient descent index of Tauriel We'll talk about stochastic. 135 00:10:03,810 --> 00:10:06,850 Gradient descent is really a continuation of this tutorial. 136 00:10:07,020 --> 00:10:08,720 And I look forward to seeing you there. 137 00:10:08,740 --> 00:10:10,610 And so next time enjoy deep learning. 14908