Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,005 --> 00:00:03,003
- [Instructor] One of the dangers of business data analysis
2
00:00:03,003 --> 00:00:05,007
is making decisions too soon.
3
00:00:05,007 --> 00:00:09,002
The reason is that short-term results can be deceiving.
4
00:00:09,002 --> 00:00:10,007
But you should start to see patterns
5
00:00:10,007 --> 00:00:12,005
as you gather more data.
6
00:00:12,005 --> 00:00:14,007
One reliable principle of data analysis
7
00:00:14,007 --> 00:00:18,006
is the central limit theorem, which says that as the number
8
00:00:18,006 --> 00:00:21,000
of measurements increases, the more likely it is
9
00:00:21,000 --> 00:00:24,003
that your data will be distributed as you expect.
10
00:00:24,003 --> 00:00:26,008
As an example, let's say that your data
11
00:00:26,008 --> 00:00:28,004
is normally distributed.
12
00:00:28,004 --> 00:00:31,008
And a normal distribution has an average
13
00:00:31,008 --> 00:00:34,003
and also a standard deviation.
14
00:00:34,003 --> 00:00:38,003
In this case, we're looking at a so-called normal curve
15
00:00:38,003 --> 00:00:42,000
with a mu or average value of 100
16
00:00:42,000 --> 00:00:44,006
and a standard deviation of 20.
17
00:00:44,006 --> 00:00:49,000
And you can see a curve of values on this graph.
18
00:00:49,000 --> 00:00:51,004
And on the left, in the vertical axis,
19
00:00:51,004 --> 00:00:54,008
there is the probability of a specific value occurring.
20
00:00:54,008 --> 00:00:59,007
So you can see that the chance of getting exactly 100 is 2%.
21
00:00:59,007 --> 00:01:01,005
That's pretty low but there are a lot
22
00:01:01,005 --> 00:01:03,003
of values clustered around it.
23
00:01:03,003 --> 00:01:06,002
And that is where the power of the normal curve
24
00:01:06,002 --> 00:01:09,000
and normal distribution comes into play.
25
00:01:09,000 --> 00:01:12,005
If your data is normally distributed, and a lot of it is,
26
00:01:12,005 --> 00:01:16,001
then you should expect to see about 68% of your values
27
00:01:16,001 --> 00:01:19,009
in your data set within one standard deviation
28
00:01:19,009 --> 00:01:21,009
plus or minus of the mean.
29
00:01:21,009 --> 00:01:25,002
In this case, that would mean that 68% of your values
30
00:01:25,002 --> 00:01:27,008
would be between 80 and 120.
31
00:01:27,008 --> 00:01:33,000
So again, that's the average or mean of 100 minus 20 for 80
32
00:01:33,000 --> 00:01:36,009
and plus 20 for 120.
33
00:01:36,009 --> 00:01:39,009
You can also expect to see about 95% of your values
34
00:01:39,009 --> 00:01:43,000
within two standard deviations plus or minus,
35
00:01:43,000 --> 00:01:48,002
between 60 and 140, and approximately 99.7% of values
36
00:01:48,002 --> 00:01:51,000
within three standard deviations plus or minus.
37
00:01:51,000 --> 00:01:53,008
And of course the probabilities of seeing other values
38
00:01:53,008 --> 00:01:57,007
get smaller as you go further away from the average.
39
00:01:57,007 --> 00:02:00,009
It doesn't mean they never occur but it does mean
40
00:02:00,009 --> 00:02:03,002
that they are very rare.
41
00:02:03,002 --> 00:02:05,004
To show you how this data works in practice,
42
00:02:05,004 --> 00:02:08,000
I will switch to an Excel workbook, which you can find
43
00:02:08,000 --> 00:02:11,001
in the exercise files collection, to use a macro
44
00:02:11,001 --> 00:02:12,009
to generate random values and show you
45
00:02:12,009 --> 00:02:15,007
what it looks like in practice.
46
00:02:15,007 --> 00:02:19,008
The workbook I'm using is 01_04 Central Limit Theorem.
47
00:02:19,008 --> 00:02:21,007
And as I said, it is available
48
00:02:21,007 --> 00:02:24,004
in the exercise files collection.
49
00:02:24,004 --> 00:02:28,005
This workbook uses macros so I'm going to go ahead
50
00:02:28,005 --> 00:02:30,009
and enable that content.
51
00:02:30,009 --> 00:02:33,009
If you're not able to run macros on your own system,
52
00:02:33,009 --> 00:02:35,005
then you probably won't be able
53
00:02:35,005 --> 00:02:37,002
to interact with this workbook.
54
00:02:37,002 --> 00:02:40,002
But if you can, go ahead and click Enable Content
55
00:02:40,002 --> 00:02:42,000
and we're ready to go.
56
00:02:42,000 --> 00:02:44,009
So what just happened was that Excel recalculated
57
00:02:44,009 --> 00:02:48,003
my workbook and I have a new set of random values.
58
00:02:48,003 --> 00:02:52,004
I have 30 values and again it's within plus or minus
59
00:02:52,004 --> 00:02:55,001
three standard deviations from the mean.
60
00:02:55,001 --> 00:02:57,003
Currently I have 30 values selected.
61
00:02:57,003 --> 00:03:01,005
And if I click the 30 button and you'll see
62
00:03:01,005 --> 00:03:05,006
that I have four values that are one standard deviation
63
00:03:05,006 --> 00:03:10,002
above the mean and I have one that is three
64
00:03:10,002 --> 00:03:13,002
standard deviations below and so on.
65
00:03:13,002 --> 00:03:16,004
But also note that the values are distributed
66
00:03:16,004 --> 00:03:19,000
in what appears to be a flat pattern.
67
00:03:19,000 --> 00:03:22,005
You don't see the curve that we saw in the graphic earlier.
68
00:03:22,005 --> 00:03:26,000
Now let's go up to 100 values. So I click 100.
69
00:03:26,000 --> 00:03:27,008
And you can see we're starting to get something
70
00:03:27,008 --> 00:03:30,000
that looks a little bit more like a curve.
71
00:03:30,000 --> 00:03:32,004
We're seeing more clustering toward the middle.
72
00:03:32,004 --> 00:03:36,005
So I'll click 100 again. 100 again.
73
00:03:36,005 --> 00:03:39,001
And we're seeing patterns but it's not
74
00:03:39,001 --> 00:03:41,002
what we looked at before.
75
00:03:41,002 --> 00:03:43,005
So now click 1,000.
76
00:03:43,005 --> 00:03:46,005
And here the pattern really does start to develop
77
00:03:46,005 --> 00:03:50,005
because we have created or randomized more values
78
00:03:50,005 --> 00:03:54,007
and we're seeing a bit more of a hump in the middle.
79
00:03:54,007 --> 00:03:56,005
And if I click there again.
80
00:03:56,005 --> 00:03:57,008
And you can see that the pattern
81
00:03:57,008 --> 00:04:00,006
is much more like what we expected.
82
00:04:00,006 --> 00:04:02,005
Now click 10,000.
83
00:04:02,005 --> 00:04:04,005
And here the curve really starts to look like
84
00:04:04,005 --> 00:04:07,000
what we saw in the graphic earlier.
85
00:04:07,000 --> 00:04:08,007
So you have 10,000.
86
00:04:08,007 --> 00:04:12,008
And it's still a little bit lumpy in the sense
87
00:04:12,008 --> 00:04:15,009
that some bars are larger than others toward the middle.
88
00:04:15,009 --> 00:04:18,008
But it looks very much like the curve we saw.
89
00:04:18,008 --> 00:04:21,001
And finally, if I click 100,000,
90
00:04:21,001 --> 00:04:23,006
then the curve looks almost perfect.
91
00:04:23,006 --> 00:04:25,005
Because we're taking so many values,
92
00:04:25,005 --> 00:04:29,007
we've had the opportunity to smooth out random chance
93
00:04:29,007 --> 00:04:32,002
and our values look very much like the normal curve
94
00:04:32,002 --> 00:04:34,006
that we saw before.
95
00:04:34,006 --> 00:04:38,001
So keep measuring, keep analyzing, and keep an open mind
96
00:04:38,001 --> 00:04:40,000
as to what your data tells you.
7714
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.