Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,360 --> 00:00:02,910
Instructor: It's time for a more sophisticated,
2
00:00:02,910 --> 00:00:05,939
yet easy to understand example.
3
00:00:05,939 --> 00:00:08,850
We will look into market segmentation.
4
00:00:08,850 --> 00:00:10,980
I'll open a clean Jupyter notebook
5
00:00:10,980 --> 00:00:13,380
and import all the relevant packages
6
00:00:13,380 --> 00:00:15,930
including the K-means module.
7
00:00:15,930 --> 00:00:18,270
Next, we should load the dataset
8
00:00:18,270 --> 00:00:22,533
located in 3.12 example.csv.
9
00:00:23,520 --> 00:00:26,940
In the csv, we've got data from a retail shop.
10
00:00:26,940 --> 00:00:29,010
There are 30 observations.
11
00:00:29,010 --> 00:00:30,900
Each observation is a client
12
00:00:30,900 --> 00:00:33,240
and we have a score for their customer satisfaction
13
00:00:33,240 --> 00:00:34,890
and brand loyalty.
14
00:00:34,890 --> 00:00:37,890
Let's see how the data was gathered.
15
00:00:37,890 --> 00:00:40,230
Satisfaction is self-reported.
16
00:00:40,230 --> 00:00:41,400
People were basically asked
17
00:00:41,400 --> 00:00:44,100
to rate their shopping experience from one to 10
18
00:00:44,100 --> 00:00:47,220
where 10 means extremely satisfied.
19
00:00:47,220 --> 00:00:50,610
Therefore, satisfaction here is a discrete variable
20
00:00:50,610 --> 00:00:52,353
and takes integer values.
21
00:00:53,280 --> 00:00:56,730
Brand loyalty, on the other hand, is a tricky metric.
22
00:00:56,730 --> 00:00:59,160
There is no widely accepted technique to measure it,
23
00:00:59,160 --> 00:01:01,710
but there are different proxies like churn rate,
24
00:01:01,710 --> 00:01:04,950
retention rate, or customer lifetime value.
25
00:01:04,950 --> 00:01:07,140
In this dataset, loyalty was measured
26
00:01:07,140 --> 00:01:10,140
through the number of purchases from that shop for a year
27
00:01:10,140 --> 00:01:13,290
and several other factors found to be significant.
28
00:01:13,290 --> 00:01:16,380
The range is from around minus two to around two
29
00:01:16,380 --> 00:01:19,050
as the variable is already standardized.
30
00:01:19,050 --> 00:01:20,730
That's something that often occurs
31
00:01:20,730 --> 00:01:24,630
especially when creating latent variables like this one.
32
00:01:24,630 --> 00:01:27,003
Okay, let's plot the data.
33
00:01:32,190 --> 00:01:34,830
We can kind of identify two clusters
34
00:01:34,830 --> 00:01:36,630
by looking at the graph.
35
00:01:36,630 --> 00:01:40,320
This one and that one, right?
36
00:01:40,320 --> 00:01:42,960
Well, before we perform any analysis,
37
00:01:42,960 --> 00:01:45,900
let's reason about the problem for a while.
38
00:01:45,900 --> 00:01:49,320
We can divide this graph into four squares:
39
00:01:49,320 --> 00:01:52,440
low satisfaction, low loyalty,
40
00:01:52,440 --> 00:01:55,350
low satisfaction, high loyalty,
41
00:01:55,350 --> 00:01:58,320
high satisfaction, low loyalty,
42
00:01:58,320 --> 00:02:02,430
and high satisfaction, high loyalty.
43
00:02:02,430 --> 00:02:05,610
So going back to our two-cluster solution,
44
00:02:05,610 --> 00:02:08,699
we realize it didn't make much sense, right?
45
00:02:08,699 --> 00:02:11,190
What would these two clusters represent?
46
00:02:11,190 --> 00:02:14,520
The first one looks like low satisfaction, low loyalty,
47
00:02:14,520 --> 00:02:16,680
but the other one is all over the place.
48
00:02:16,680 --> 00:02:20,850
It seems to me that a two-cluster solution won't cut it,
49
00:02:20,850 --> 00:02:22,560
but enough speculation.
50
00:02:22,560 --> 00:02:24,990
Let's leverage the new knowledge we possess
51
00:02:24,990 --> 00:02:27,030
and learn something new.
52
00:02:27,030 --> 00:02:31,500
First, I'll copy the data into a new variable called x.
53
00:02:31,500 --> 00:02:35,610
Next, I'll use the same code we have used so far,
54
00:02:35,610 --> 00:02:39,000
kmeans equals KMeans of 2,
55
00:02:39,000 --> 00:02:41,667
kmeans.fit(x).
56
00:02:46,350 --> 00:02:48,600
Clusters is a duplicate of x,
57
00:02:48,600 --> 00:02:51,120
and the column cluster_pred of this data frame
58
00:02:51,120 --> 00:02:53,820
will contain the cluster where a particular observation
59
00:02:53,820 --> 00:02:56,313
was predicted to be placed by the algorithm.
60
00:02:58,830 --> 00:03:00,933
Next, I'll quickly plot the data.
61
00:03:04,230 --> 00:03:06,720
What we see are two clusters,
62
00:03:06,720 --> 00:03:09,540
but they aren't the two we imagined, are they?
63
00:03:09,540 --> 00:03:11,310
If you examine the plot closely,
64
00:03:11,310 --> 00:03:13,170
you will realize that there is a cutoff line
65
00:03:13,170 --> 00:03:15,930
at the satisfaction value of six.
66
00:03:15,930 --> 00:03:18,330
Everything on the right is one cluster.
67
00:03:18,330 --> 00:03:20,850
Everything on the left, the other.
68
00:03:20,850 --> 00:03:23,070
This solution may make sense to some,
69
00:03:23,070 --> 00:03:24,540
but most probably the algorithm
70
00:03:24,540 --> 00:03:27,063
only considered satisfaction as a feature.
71
00:03:28,890 --> 00:03:32,850
Why? Because we did not standardize the variable.
72
00:03:32,850 --> 00:03:34,140
The satisfaction values
73
00:03:34,140 --> 00:03:36,300
are much higher than those of loyalty
74
00:03:36,300 --> 00:03:40,140
and K-means more or less disregarded loyalty as a feature.
75
00:03:40,140 --> 00:03:42,900
Whenever we cluster on the basis of a single feature,
76
00:03:42,900 --> 00:03:44,940
the result will look like this,
77
00:03:44,940 --> 00:03:47,610
as if it was cut off by a vertical line.
78
00:03:47,610 --> 00:03:48,600
That's one of the ways
79
00:03:48,600 --> 00:03:51,930
to spot if something fishy is going on.
80
00:03:51,930 --> 00:03:54,540
Okay, satisfaction and loyalty
81
00:03:54,540 --> 00:03:58,320
seem equally important features for market segmentation.
82
00:03:58,320 --> 00:04:01,080
So how can we fix this problem?
83
00:04:01,080 --> 00:04:04,200
How can we give them equal weight?
84
00:04:04,200 --> 00:04:08,160
Yes, by standardizing satisfaction.
85
00:04:08,160 --> 00:04:11,430
There are several ways to do that in sklearn.
86
00:04:11,430 --> 00:04:12,990
The simplest one I am aware of
87
00:04:12,990 --> 00:04:15,210
is through the preprocessing module.
88
00:04:15,210 --> 00:04:19,860
So from sklearn, import preprocessing,
89
00:04:19,860 --> 00:04:23,610
then we will declare a new variable called x_scaled
90
00:04:23,610 --> 00:04:28,580
equal to preprocessing.scale of x.
91
00:04:29,550 --> 00:04:33,720
Scale is a method which scales each variable separately.
92
00:04:33,720 --> 00:04:36,120
In other words, each column will be standardized
93
00:04:36,120 --> 00:04:41,120
with respect to itself, exactly what we need, and it's done.
94
00:04:41,760 --> 00:04:44,160
As you can see, x_scaled is an array
95
00:04:44,160 --> 00:04:46,530
which contains the standardized satisfaction
96
00:04:46,530 --> 00:04:49,140
and the same values for loyalty.
97
00:04:49,140 --> 00:04:52,350
That's because loyalty was already standardized.
98
00:04:52,350 --> 00:04:53,550
It had a mean of zero
99
00:04:53,550 --> 00:04:56,610
and a standard deviation of one that is.
100
00:04:56,610 --> 00:04:59,373
Okay, let's continue in the known way.
101
00:05:00,240 --> 00:05:02,850
Since we don't know the number of clusters needed,
102
00:05:02,850 --> 00:05:05,490
the elbow method will come in handy.
103
00:05:05,490 --> 00:05:08,560
Let's declare a list wcss
104
00:05:09,480 --> 00:05:13,380
for i in range from one to 10:
105
00:05:13,380 --> 00:05:16,620
kmeans equals KMeans of i.
106
00:05:16,620 --> 00:05:20,283
Again, K and M are capital when referring to the method,
107
00:05:21,150 --> 00:05:24,990
kmeans.fit(x)
108
00:05:24,990 --> 00:05:28,200
and we will append the result to the wcss list
109
00:05:28,200 --> 00:05:30,330
using the inertia method.
110
00:05:30,330 --> 00:05:33,540
Great, note that the range is from one to 10
111
00:05:33,540 --> 00:05:37,200
so we will get the wcss for a 1, 2, 3
112
00:05:37,200 --> 00:05:39,570
until nine cluster solutions.
113
00:05:39,570 --> 00:05:41,943
That was an arbitrary decision on my side.
114
00:05:43,830 --> 00:05:47,700
Finally, let's plot wcss versus the number of clusters
115
00:05:47,700 --> 00:05:50,010
as we did in the previous lectures.
116
00:05:50,010 --> 00:05:53,520
The result is an elbow.
117
00:05:53,520 --> 00:05:54,600
Given this graph,
118
00:05:54,600 --> 00:05:58,260
think about the correct number of clusters we should use.
119
00:05:58,260 --> 00:06:02,070
We will explore the different solutions in the next lesson.
120
00:06:02,070 --> 00:06:03,093
Thanks for watching.
9276
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.