subtitlecat.com

All language subtitles for 04 - Introduce the central limit theorem

Afrikaans

Akan

Albanian

Amharic

Arabic Download

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,005 --> 00:00:03,003 - [Instructor] One of the dangers of business data analysis 2 00:00:03,003 --> 00:00:05,007 is making decisions too soon. 3 00:00:05,007 --> 00:00:09,002 The reason is that short-term results can be deceiving. 4 00:00:09,002 --> 00:00:10,007 But you should start to see patterns 5 00:00:10,007 --> 00:00:12,005 as you gather more data. 6 00:00:12,005 --> 00:00:14,007 One reliable principle of data analysis 7 00:00:14,007 --> 00:00:18,006 is the central limit theorem, which says that as the number 8 00:00:18,006 --> 00:00:21,000 of measurements increases, the more likely it is 9 00:00:21,000 --> 00:00:24,003 that your data will be distributed as you expect. 10 00:00:24,003 --> 00:00:26,008 As an example, let's say that your data 11 00:00:26,008 --> 00:00:28,004 is normally distributed. 12 00:00:28,004 --> 00:00:31,008 And a normal distribution has an average 13 00:00:31,008 --> 00:00:34,003 and also a standard deviation. 14 00:00:34,003 --> 00:00:38,003 In this case, we're looking at a so-called normal curve 15 00:00:38,003 --> 00:00:42,000 with a mu or average value of 100 16 00:00:42,000 --> 00:00:44,006 and a standard deviation of 20. 17 00:00:44,006 --> 00:00:49,000 And you can see a curve of values on this graph. 18 00:00:49,000 --> 00:00:51,004 And on the left, in the vertical axis, 19 00:00:51,004 --> 00:00:54,008 there is the probability of a specific value occurring. 20 00:00:54,008 --> 00:00:59,007 So you can see that the chance of getting exactly 100 is 2%. 21 00:00:59,007 --> 00:01:01,005 That's pretty low but there are a lot 22 00:01:01,005 --> 00:01:03,003 of values clustered around it. 23 00:01:03,003 --> 00:01:06,002 And that is where the power of the normal curve 24 00:01:06,002 --> 00:01:09,000 and normal distribution comes into play. 25 00:01:09,000 --> 00:01:12,005 If your data is normally distributed, and a lot of it is, 26 00:01:12,005 --> 00:01:16,001 then you should expect to see about 68% of your values 27 00:01:16,001 --> 00:01:19,009 in your data set within one standard deviation 28 00:01:19,009 --> 00:01:21,009 plus or minus of the mean. 29 00:01:21,009 --> 00:01:25,002 In this case, that would mean that 68% of your values 30 00:01:25,002 --> 00:01:27,008 would be between 80 and 120. 31 00:01:27,008 --> 00:01:33,000 So again, that's the average or mean of 100 minus 20 for 80 32 00:01:33,000 --> 00:01:36,009 and plus 20 for 120. 33 00:01:36,009 --> 00:01:39,009 You can also expect to see about 95% of your values 34 00:01:39,009 --> 00:01:43,000 within two standard deviations plus or minus, 35 00:01:43,000 --> 00:01:48,002 between 60 and 140, and approximately 99.7% of values 36 00:01:48,002 --> 00:01:51,000 within three standard deviations plus or minus. 37 00:01:51,000 --> 00:01:53,008 And of course the probabilities of seeing other values 38 00:01:53,008 --> 00:01:57,007 get smaller as you go further away from the average. 39 00:01:57,007 --> 00:02:00,009 It doesn't mean they never occur but it does mean 40 00:02:00,009 --> 00:02:03,002 that they are very rare. 41 00:02:03,002 --> 00:02:05,004 To show you how this data works in practice, 42 00:02:05,004 --> 00:02:08,000 I will switch to an Excel workbook, which you can find 43 00:02:08,000 --> 00:02:11,001 in the exercise files collection, to use a macro 44 00:02:11,001 --> 00:02:12,009 to generate random values and show you 45 00:02:12,009 --> 00:02:15,007 what it looks like in practice. 46 00:02:15,007 --> 00:02:19,008 The workbook I'm using is 01_04 Central Limit Theorem. 47 00:02:19,008 --> 00:02:21,007 And as I said, it is available 48 00:02:21,007 --> 00:02:24,004 in the exercise files collection. 49 00:02:24,004 --> 00:02:28,005 This workbook uses macros so I'm going to go ahead 50 00:02:28,005 --> 00:02:30,009 and enable that content. 51 00:02:30,009 --> 00:02:33,009 If you're not able to run macros on your own system, 52 00:02:33,009 --> 00:02:35,005 then you probably won't be able 53 00:02:35,005 --> 00:02:37,002 to interact with this workbook. 54 00:02:37,002 --> 00:02:40,002 But if you can, go ahead and click Enable Content 55 00:02:40,002 --> 00:02:42,000 and we're ready to go. 56 00:02:42,000 --> 00:02:44,009 So what just happened was that Excel recalculated 57 00:02:44,009 --> 00:02:48,003 my workbook and I have a new set of random values. 58 00:02:48,003 --> 00:02:52,004 I have 30 values and again it's within plus or minus 59 00:02:52,004 --> 00:02:55,001 three standard deviations from the mean. 60 00:02:55,001 --> 00:02:57,003 Currently I have 30 values selected. 61 00:02:57,003 --> 00:03:01,005 And if I click the 30 button and you'll see 62 00:03:01,005 --> 00:03:05,006 that I have four values that are one standard deviation 63 00:03:05,006 --> 00:03:10,002 above the mean and I have one that is three 64 00:03:10,002 --> 00:03:13,002 standard deviations below and so on. 65 00:03:13,002 --> 00:03:16,004 But also note that the values are distributed 66 00:03:16,004 --> 00:03:19,000 in what appears to be a flat pattern. 67 00:03:19,000 --> 00:03:22,005 You don't see the curve that we saw in the graphic earlier. 68 00:03:22,005 --> 00:03:26,000 Now let's go up to 100 values. So I click 100. 69 00:03:26,000 --> 00:03:27,008 And you can see we're starting to get something 70 00:03:27,008 --> 00:03:30,000 that looks a little bit more like a curve. 71 00:03:30,000 --> 00:03:32,004 We're seeing more clustering toward the middle. 72 00:03:32,004 --> 00:03:36,005 So I'll click 100 again. 100 again. 73 00:03:36,005 --> 00:03:39,001 And we're seeing patterns but it's not 74 00:03:39,001 --> 00:03:41,002 what we looked at before. 75 00:03:41,002 --> 00:03:43,005 So now click 1,000. 76 00:03:43,005 --> 00:03:46,005 And here the pattern really does start to develop 77 00:03:46,005 --> 00:03:50,005 because we have created or randomized more values 78 00:03:50,005 --> 00:03:54,007 and we're seeing a bit more of a hump in the middle. 79 00:03:54,007 --> 00:03:56,005 And if I click there again. 80 00:03:56,005 --> 00:03:57,008 And you can see that the pattern 81 00:03:57,008 --> 00:04:00,006 is much more like what we expected. 82 00:04:00,006 --> 00:04:02,005 Now click 10,000. 83 00:04:02,005 --> 00:04:04,005 And here the curve really starts to look like 84 00:04:04,005 --> 00:04:07,000 what we saw in the graphic earlier. 85 00:04:07,000 --> 00:04:08,007 So you have 10,000. 86 00:04:08,007 --> 00:04:12,008 And it's still a little bit lumpy in the sense 87 00:04:12,008 --> 00:04:15,009 that some bars are larger than others toward the middle. 88 00:04:15,009 --> 00:04:18,008 But it looks very much like the curve we saw. 89 00:04:18,008 --> 00:04:21,001 And finally, if I click 100,000, 90 00:04:21,001 --> 00:04:23,006 then the curve looks almost perfect. 91 00:04:23,006 --> 00:04:25,005 Because we're taking so many values, 92 00:04:25,005 --> 00:04:29,007 we've had the opportunity to smooth out random chance 93 00:04:29,007 --> 00:04:32,002 and our values look very much like the normal curve 94 00:04:32,002 --> 00:04:34,006 that we saw before. 95 00:04:34,006 --> 00:04:38,001 So keep measuring, keep analyzing, and keep an open mind 96 00:04:38,001 --> 00:04:40,000 as to what your data tells you. 7714