Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,300 --> 00:00:01,800
-: So far we have covered graphs
2
00:00:01,800 --> 00:00:03,960
that represent only one variable.
3
00:00:03,960 --> 00:00:08,010
But how do we represent relationships between two variables?
4
00:00:08,010 --> 00:00:12,390
In this video, we'll explore cross tables and scatter plots.
5
00:00:12,390 --> 00:00:14,190
Once again, we have a division
6
00:00:14,190 --> 00:00:16,833
between categorical and numerical variables.
7
00:00:17,790 --> 00:00:20,610
Let's start with categorical variables.
8
00:00:20,610 --> 00:00:23,850
The most common way to represent them is using cross tables
9
00:00:23,850 --> 00:00:27,483
or as some statisticians call them contingency tables.
10
00:00:28,410 --> 00:00:30,120
Imagine you are an investment manager
11
00:00:30,120 --> 00:00:31,800
and you manage stocks, bonds
12
00:00:31,800 --> 00:00:35,220
and real estate investments for three different investors.
13
00:00:35,220 --> 00:00:38,160
Each of them has a different idea of risk, and hence
14
00:00:38,160 --> 00:00:40,290
their money is allocated in a different way
15
00:00:40,290 --> 00:00:42,570
among the three asset classes.
16
00:00:42,570 --> 00:00:44,850
A cross table representing all the data looks
17
00:00:44,850 --> 00:00:45,993
in the following way.
18
00:00:46,890 --> 00:00:48,810
You can clearly see the row showing the type
19
00:00:48,810 --> 00:00:50,280
of investment that's been made
20
00:00:50,280 --> 00:00:53,850
and the columns with each investor's allocation.
21
00:00:53,850 --> 00:00:56,130
It is a good practice to calculate the totals
22
00:00:56,130 --> 00:00:58,830
of each row and column as it is often useful
23
00:00:58,830 --> 00:01:00,750
in further analysis.
24
00:01:00,750 --> 00:01:02,970
Notice that the subtotals of the rows give us
25
00:01:02,970 --> 00:01:06,840
total investments in stocks, bonds and real estate.
26
00:01:06,840 --> 00:01:08,490
On the other hand, the subtotals
27
00:01:08,490 --> 00:01:11,253
of the columns give us the holdings of each investor.
28
00:01:12,420 --> 00:01:14,370
Once we have created a cross table
29
00:01:14,370 --> 00:01:17,853
we can proceed by visualizing the data onto a plane.
30
00:01:18,930 --> 00:01:21,444
A very useful chart in such cases is a variation
31
00:01:21,444 --> 00:01:24,933
of the bar chart called the side by side bar chart.
32
00:01:25,860 --> 00:01:28,140
It represents the holdings of each investor
33
00:01:28,140 --> 00:01:30,120
in the different types of assets.
34
00:01:30,120 --> 00:01:32,700
Stocks are in green, bonds are in red
35
00:01:32,700 --> 00:01:34,293
and real estate is in blue.
36
00:01:35,250 --> 00:01:36,690
The name of this type of chart comes
37
00:01:36,690 --> 00:01:38,820
from the fact that for each investor,
38
00:01:38,820 --> 00:01:42,060
the categories of assets are represented side by side.
39
00:01:42,060 --> 00:01:44,940
In this way, we can easily compare asset holdings
40
00:01:44,940 --> 00:01:47,880
for a specific investor or among investors.
41
00:01:47,880 --> 00:01:49,230
Easy, right?
42
00:01:49,230 --> 00:01:51,810
All graphs are very easy to create and read
43
00:01:51,810 --> 00:01:54,180
once you have identified the type of data you are
44
00:01:54,180 --> 00:01:57,093
dealing with and decided on the best way to visualize it.
45
00:01:58,500 --> 00:02:00,870
Finally, we would like to conclude with a very
46
00:02:00,870 --> 00:02:03,333
important graph, the scatter plot.
47
00:02:04,170 --> 00:02:08,190
It is used when representing two numerical variables.
48
00:02:08,190 --> 00:02:10,560
For this example, we have gathered the reading
49
00:02:10,560 --> 00:02:14,370
and writing SAT scores of 100 individuals.
50
00:02:14,370 --> 00:02:16,970
Let me first show you the graph before analyzing it.
51
00:02:18,060 --> 00:02:18,900
All right.
52
00:02:18,900 --> 00:02:22,260
First, SAT scores by component range from 200 to
53
00:02:22,260 --> 00:02:25,260
800 points, and that is why our data is bounded
54
00:02:25,260 --> 00:02:27,543
within the range of 200 to 800.
55
00:02:28,440 --> 00:02:31,680
Second, our vertical access shows the writing scores
56
00:02:31,680 --> 00:02:34,623
while the horizontal axis contains reading scores.
57
00:02:35,910 --> 00:02:39,210
Third, there are 100 students, and the results correspond
58
00:02:39,210 --> 00:02:41,193
to a specific point on the graph.
59
00:02:42,060 --> 00:02:43,830
Each point gives us information about
60
00:02:43,830 --> 00:02:46,380
a particular student's performance.
61
00:02:46,380 --> 00:02:48,330
For example, this is Jane.
62
00:02:48,330 --> 00:02:52,473
She scored 300 on writing, but 550 on the reading part.
63
00:02:53,700 --> 00:02:55,740
Scatter plots usually represent lots
64
00:02:55,740 --> 00:02:57,243
and lots of observations.
65
00:02:58,080 --> 00:02:59,550
When interpreting a scatter plot,
66
00:02:59,550 --> 00:03:01,500
a statistician is not expected to look
67
00:03:01,500 --> 00:03:03,150
into single data points.
68
00:03:03,150 --> 00:03:04,470
He would be much more interested
69
00:03:04,470 --> 00:03:07,773
in getting the main idea of how the data is distributed.
70
00:03:08,790 --> 00:03:11,070
Okay, the first thing we see is
71
00:03:11,070 --> 00:03:13,440
that there is an obvious up trend.
72
00:03:13,440 --> 00:03:16,110
This is because lower writing scores are usually obtained
73
00:03:16,110 --> 00:03:18,240
by students with lower reading scores
74
00:03:18,240 --> 00:03:20,010
and higher writing scores have been achieved
75
00:03:20,010 --> 00:03:22,050
by students with higher reading scores.
76
00:03:22,050 --> 00:03:24,240
This is logical, right?
77
00:03:24,240 --> 00:03:25,620
Students are more likely to do well
78
00:03:25,620 --> 00:03:28,473
on both because the two tasks are closely related.
79
00:03:29,400 --> 00:03:32,370
Second, we notice a concentration of students in the middle
80
00:03:32,370 --> 00:03:34,140
of the graph with scores in the region
81
00:03:34,140 --> 00:03:38,100
of 450 to 550 on both reading and writing.
82
00:03:38,100 --> 00:03:39,660
Remember we said that scores can be
83
00:03:39,660 --> 00:03:42,060
anywhere between 200 and 800?
84
00:03:42,060 --> 00:03:45,330
Well, 500 is the average score one can get
85
00:03:45,330 --> 00:03:49,448
so it makes sense that a lot of people fall into that area.
86
00:03:49,448 --> 00:03:52,020
Third, there is this group of people
87
00:03:52,020 --> 00:03:55,470
with both very high writing and reading scores.
88
00:03:55,470 --> 00:03:57,660
The exceptional students tend to be excellent
89
00:03:57,660 --> 00:03:58,803
at both components.
90
00:03:59,640 --> 00:04:01,350
This is less true for bad students,
91
00:04:01,350 --> 00:04:02,670
as their performance tends to
92
00:04:02,670 --> 00:04:04,773
deviate when performing different tasks.
93
00:04:05,970 --> 00:04:08,520
Finally, we have Jane from a minute ago.
94
00:04:08,520 --> 00:04:11,730
She is far away from every other observation as she scored
95
00:04:11,730 --> 00:04:14,910
above average on reading, but poorly on writing.
96
00:04:14,910 --> 00:04:17,220
This observation is called an outlier
97
00:04:17,220 --> 00:04:20,370
as it goes against the logic of the whole data set.
98
00:04:20,370 --> 00:04:22,800
We will learn more about outliers and how to treat them
99
00:04:22,800 --> 00:04:24,873
in our analysis later on in this course.
100
00:04:25,980 --> 00:04:28,230
So we have gone through the basics.
101
00:04:28,230 --> 00:04:31,020
We have covered populations, samples,
102
00:04:31,020 --> 00:04:34,710
types of variables, graphs and tables.
103
00:04:34,710 --> 00:04:36,630
And it is time for us to dive into
104
00:04:36,630 --> 00:04:39,030
the heart of descriptive statistics,
105
00:04:39,030 --> 00:04:42,810
measurements of central tendency and variability.
106
00:04:42,810 --> 00:04:43,810
Thanks for watching.
8455
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.