Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,004 --> 00:00:01,009
- [Instructor] The data distribution
2
00:00:01,009 --> 00:00:03,005
that you are most likely to use
3
00:00:03,005 --> 00:00:06,009
during your analysis is the normal distribution.
4
00:00:06,009 --> 00:00:09,003
The normal curve or bell curve
5
00:00:09,003 --> 00:00:11,007
has the shape shown in this chart.
6
00:00:11,007 --> 00:00:13,008
This chart indicates probabilities for a curve
7
00:00:13,008 --> 00:00:15,006
with an average or mean of 100
8
00:00:15,006 --> 00:00:18,009
and a standard deviation of 20.
9
00:00:18,009 --> 00:00:22,000
The mean is usually indicated by the Greek letter mu
10
00:00:22,000 --> 00:00:26,001
and the standard deviation by the Greek letter sigma.
11
00:00:26,001 --> 00:00:29,007
the normal curve has some very useful properties.
12
00:00:29,007 --> 00:00:33,009
The first is that approximately 68% of all values
13
00:00:33,009 --> 00:00:37,006
will occur within plus or minus one standard deviation.
14
00:00:37,006 --> 00:00:40,009
So with our mean of 100, that would mean
15
00:00:40,009 --> 00:00:44,004
that about 68% of the values would fall
16
00:00:44,004 --> 00:00:46,009
within 20 above or below the average.
17
00:00:46,009 --> 00:00:49,003
So 80 to 120.
18
00:00:49,003 --> 00:00:53,001
95% of values will be within two standard deviations,
19
00:00:53,001 --> 00:00:55,003
So 60 to 140,
20
00:00:55,003 --> 00:00:58,008
and 99.7% within three standard deviations
21
00:00:58,008 --> 00:01:03,001
plus or minus, so that's 40 to 160.
22
00:01:03,001 --> 00:01:05,005
To see how to work with these values in Excel,
23
00:01:05,005 --> 00:01:08,009
We'll switch over to our practice workbook.
24
00:01:08,009 --> 00:01:10,005
I've switched over to Excel
25
00:01:10,005 --> 00:01:13,004
and my sample file is 04_01_Normal,
26
00:01:13,004 --> 00:01:15,007
and you can find it in the chapter four folder
27
00:01:15,007 --> 00:01:18,001
of the exercise files collection.
28
00:01:18,001 --> 00:01:20,003
I use the values in columns A and B
29
00:01:20,003 --> 00:01:23,002
to create the graph of the curve
30
00:01:23,002 --> 00:01:25,003
that you see at the bottom right.
31
00:01:25,003 --> 00:01:28,008
But let's ask some numerical questions of our data.
32
00:01:28,008 --> 00:01:31,005
For example, we can calculate the probability
33
00:01:31,005 --> 00:01:33,009
of getting exactly 92.
34
00:01:33,009 --> 00:01:36,002
So we have an average of 100,
35
00:01:36,002 --> 00:01:37,009
standard aviation of 20,
36
00:01:37,009 --> 00:01:39,005
92 is close to the middle,
37
00:01:39,005 --> 00:01:42,005
so let's calculate the probability
38
00:01:42,005 --> 00:01:44,004
of getting exactly that value
39
00:01:44,004 --> 00:01:47,000
if we're generating random numbers.
40
00:01:47,000 --> 00:01:49,008
So I'll click in cell E1
41
00:01:49,008 --> 00:01:52,001
and then type an equal sign.
42
00:01:52,001 --> 00:01:55,007
And the function we use is NORM.DIST
43
00:01:55,007 --> 00:01:57,004
and as you might guess,
44
00:01:57,004 --> 00:02:00,002
that stands for normal distribution.
45
00:02:00,002 --> 00:02:03,002
The value we're working with our X is 92.
46
00:02:03,002 --> 00:02:05,005
So I'll type that in, then a comma.
47
00:02:05,005 --> 00:02:07,008
The mean is in B1, comma,
48
00:02:07,008 --> 00:02:11,004
standard deviation in B2, then a comma.
49
00:02:11,004 --> 00:02:14,008
And we are looking for the probability mass function
50
00:02:14,008 --> 00:02:17,008
which is also called a point probability.
51
00:02:17,008 --> 00:02:19,009
And that means that for the last argument,
52
00:02:19,009 --> 00:02:23,007
I need to select FALSE so I highlight that.
53
00:02:23,007 --> 00:02:24,009
Press tab to accept it,
54
00:02:24,009 --> 00:02:28,000
type a right parenthesis and enter.
55
00:02:28,000 --> 00:02:34,000
And we see the probability of getting exactly 92 is 1.84%.
56
00:02:34,000 --> 00:02:36,000
And that might seem pretty low, but remember,
57
00:02:36,000 --> 00:02:37,009
within three standard deviations,
58
00:02:37,009 --> 00:02:39,009
we go from 40 to 160.
59
00:02:39,009 --> 00:02:43,001
So the fact that 92 is as probable
60
00:02:43,001 --> 00:02:45,005
as it is at a random selection
61
00:02:45,005 --> 00:02:50,000
is an indication of how close to the average it is.
62
00:02:50,000 --> 00:02:51,007
Now let's calculate the probability
63
00:02:51,007 --> 00:02:54,001
of getting 92 or more.
64
00:02:54,001 --> 00:02:56,003
And I will do it incorrectly the first time
65
00:02:56,003 --> 00:02:59,008
and then show you how to fix what is a very common mistake.
66
00:02:59,008 --> 00:03:04,004
So in E2, I'll type equal, NORM.DIST.
67
00:03:04,004 --> 00:03:06,009
As before our X is 92,
68
00:03:06,009 --> 00:03:10,000
the mean is in B1, standard deviation in B2
69
00:03:10,000 --> 00:03:12,005
and then a comma, but now we do want to look
70
00:03:12,005 --> 00:03:15,003
for the accumulative distribution function.
71
00:03:15,003 --> 00:03:18,008
And that's because we're looking for 92 or more.
72
00:03:18,008 --> 00:03:20,008
So we want a spread of values instead
73
00:03:20,008 --> 00:03:23,004
of a single point probability.
74
00:03:23,004 --> 00:03:25,005
So I highlight TRUE,
75
00:03:25,005 --> 00:03:27,001
type a right parenthesis,
76
00:03:27,001 --> 00:03:31,002
and again, this is going to be an incorrect result.
77
00:03:31,002 --> 00:03:36,001
I get 34.46 of getting 92 or more.
78
00:03:36,001 --> 00:03:38,003
And here's why that's wrong.
79
00:03:38,003 --> 00:03:42,009
92 is to the left of the mean.
80
00:03:42,009 --> 00:03:44,009
And if you look at the normal curve,
81
00:03:44,009 --> 00:03:47,002
half the values are greater than the mean
82
00:03:47,002 --> 00:03:49,008
and the other half are less than the mean.
83
00:03:49,008 --> 00:03:54,005
So the fact that our calculation shows only 34.46%
84
00:03:54,005 --> 00:03:57,003
of values are greater than 92,
85
00:03:57,003 --> 00:04:00,006
which is less than the mean, must be incorrect.
86
00:04:00,006 --> 00:04:02,000
The way to fix this error
87
00:04:02,000 --> 00:04:05,005
is to subtract that calculation from one.
88
00:04:05,005 --> 00:04:07,005
So I will
89
00:04:07,005 --> 00:04:09,005
double click in cell E2,
90
00:04:09,005 --> 00:04:12,009
and then I will add one minus
91
00:04:12,009 --> 00:04:14,002
our previous calculation.
92
00:04:14,002 --> 00:04:18,007
Now, when I press tab, I get 65.54%
93
00:04:18,007 --> 00:04:20,004
and that makes a lot more sense
94
00:04:20,004 --> 00:04:24,000
because 92 is approximately here,
95
00:04:24,000 --> 00:04:26,004
I've highlighted 90,
96
00:04:26,004 --> 00:04:28,003
and I'll just leave the mouse pointer there
97
00:04:28,003 --> 00:04:30,006
to show you the approximate point.
98
00:04:30,006 --> 00:04:34,001
You can see that about 65.54% of the values
99
00:04:34,001 --> 00:04:35,003
are to the right
100
00:04:35,003 --> 00:04:38,007
so our calculation makes intuitive sense.
101
00:04:38,007 --> 00:04:41,006
We can also ask about percentages of values
102
00:04:41,006 --> 00:04:43,002
within a distribution.
103
00:04:43,002 --> 00:04:46,002
So let's say, what is the value inside this curve
104
00:04:46,002 --> 00:04:48,004
or as part of this data distribution
105
00:04:48,004 --> 00:04:51,006
that 33% of values are below?
106
00:04:51,006 --> 00:04:54,007
So I will click in cell H1
107
00:04:54,007 --> 00:04:56,008
and type an equal sign.
108
00:04:56,008 --> 00:05:00,002
We can't use NORM.DIST for this calculation
109
00:05:00,002 --> 00:05:04,004
but we can use a different function, NORM.INV
110
00:05:04,004 --> 00:05:07,006
and this is the inverse of the normal distribution
111
00:05:07,006 --> 00:05:11,001
where we got probabilities with NORM.DIST,
112
00:05:11,001 --> 00:05:13,000
the inverse
113
00:05:13,000 --> 00:05:17,001
of that gives us values based on a probability.
114
00:05:17,001 --> 00:05:21,000
So our probability is 33%
115
00:05:21,000 --> 00:05:22,000
then a comma,
116
00:05:22,000 --> 00:05:23,008
our mean is still in B1,
117
00:05:23,008 --> 00:05:25,009
standard deviation is still in B2.
118
00:05:25,009 --> 00:05:27,006
We don't need any other arguments
119
00:05:27,006 --> 00:05:31,000
so I'll type a right parenthesis and enter.
120
00:05:31,000 --> 00:05:34,001
And we get 91.2.
121
00:05:34,001 --> 00:05:36,004
And this again makes sense.
122
00:05:36,004 --> 00:05:40,002
About 33% of our values are below 91,
123
00:05:40,002 --> 00:05:43,002
which again is here on the curve, approximately,
124
00:05:43,002 --> 00:05:48,002
and that will show that about 33% of the values
125
00:05:48,002 --> 00:05:49,007
are to the left
126
00:05:49,007 --> 00:05:52,003
so our value checks out.
127
00:05:52,003 --> 00:05:55,004
If I want to find the value for which 90% of values
128
00:05:55,004 --> 00:05:57,003
in this curve are above,
129
00:05:57,003 --> 00:06:00,000
Then I can do NORM.INV.
130
00:06:00,000 --> 00:06:02,003
And if you're suspecting that we need to do
131
00:06:02,003 --> 00:06:05,001
one minus something, as we did with probability
132
00:06:05,001 --> 00:06:07,002
of 92 or more, you are correct
133
00:06:07,002 --> 00:06:09,004
but we put it in a different place.
134
00:06:09,004 --> 00:06:12,006
So in H2 I'll type an equal sign,
135
00:06:12,006 --> 00:06:14,007
NORM.INV.
136
00:06:14,007 --> 00:06:17,008
The result we would get by typing in 90%
137
00:06:17,008 --> 00:06:22,002
would be to return a value that 90% of values are below.
138
00:06:22,002 --> 00:06:25,006
So we need to subtract the percentage from one
139
00:06:25,006 --> 00:06:28,001
as part of the probability calculation.
140
00:06:28,001 --> 00:06:33,008
So for the first argument, I'll type 1 minus 90%,
141
00:06:33,008 --> 00:06:35,009
then a comma, B1 for the mean,
142
00:06:35,009 --> 00:06:37,008
B2 for the standard deviation,
143
00:06:37,008 --> 00:06:39,008
right parenthesis and enter,
144
00:06:39,008 --> 00:06:44,004
and we get 74.37 approximately.
145
00:06:44,004 --> 00:06:45,006
And again, that makes sense.
146
00:06:45,006 --> 00:06:48,003
If I go down to 74,
147
00:06:48,003 --> 00:06:50,004
that's 72,
148
00:06:50,004 --> 00:06:52,002
oh, there's 74.
149
00:06:52,002 --> 00:06:54,009
We can see where it lies on the curve
150
00:06:54,009 --> 00:06:58,003
and it makes sense that about 90% of values,
151
00:06:58,003 --> 00:07:01,005
including the fat part of the curve in the middle
152
00:07:01,005 --> 00:07:08,000
would be above the return value of about 74.37.
153
00:07:08,000 --> 00:07:11,008
So as you can see, you can do a lot with the normal curve,
154
00:07:11,008 --> 00:07:17,000
especially with the functions NORM.DIST and NORM.INV.
11490
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.