All language subtitles for 011 Market Segmentation with Cluster Analysis (Part 1)_en

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
ceb Cebuano
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
nl Dutch
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hmn Hmong
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
my Myanmar (Burmese)
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian Download
pl Polish
pt-BR Portuguese (Brazil)
pt Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
sm Samoan
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,360 --> 00:00:02,910 Instructor: It's time for a more sophisticated, 2 00:00:02,910 --> 00:00:05,939 yet easy to understand example. 3 00:00:05,939 --> 00:00:08,850 We will look into market segmentation. 4 00:00:08,850 --> 00:00:10,980 I'll open a clean Jupyter notebook 5 00:00:10,980 --> 00:00:13,380 and import all the relevant packages 6 00:00:13,380 --> 00:00:15,930 including the K-means module. 7 00:00:15,930 --> 00:00:18,270 Next, we should load the dataset 8 00:00:18,270 --> 00:00:22,533 located in 3.12 example.csv. 9 00:00:23,520 --> 00:00:26,940 In the csv, we've got data from a retail shop. 10 00:00:26,940 --> 00:00:29,010 There are 30 observations. 11 00:00:29,010 --> 00:00:30,900 Each observation is a client 12 00:00:30,900 --> 00:00:33,240 and we have a score for their customer satisfaction 13 00:00:33,240 --> 00:00:34,890 and brand loyalty. 14 00:00:34,890 --> 00:00:37,890 Let's see how the data was gathered. 15 00:00:37,890 --> 00:00:40,230 Satisfaction is self-reported. 16 00:00:40,230 --> 00:00:41,400 People were basically asked 17 00:00:41,400 --> 00:00:44,100 to rate their shopping experience from one to 10 18 00:00:44,100 --> 00:00:47,220 where 10 means extremely satisfied. 19 00:00:47,220 --> 00:00:50,610 Therefore, satisfaction here is a discrete variable 20 00:00:50,610 --> 00:00:52,353 and takes integer values. 21 00:00:53,280 --> 00:00:56,730 Brand loyalty, on the other hand, is a tricky metric. 22 00:00:56,730 --> 00:00:59,160 There is no widely accepted technique to measure it, 23 00:00:59,160 --> 00:01:01,710 but there are different proxies like churn rate, 24 00:01:01,710 --> 00:01:04,950 retention rate, or customer lifetime value. 25 00:01:04,950 --> 00:01:07,140 In this dataset, loyalty was measured 26 00:01:07,140 --> 00:01:10,140 through the number of purchases from that shop for a year 27 00:01:10,140 --> 00:01:13,290 and several other factors found to be significant. 28 00:01:13,290 --> 00:01:16,380 The range is from around minus two to around two 29 00:01:16,380 --> 00:01:19,050 as the variable is already standardized. 30 00:01:19,050 --> 00:01:20,730 That's something that often occurs 31 00:01:20,730 --> 00:01:24,630 especially when creating latent variables like this one. 32 00:01:24,630 --> 00:01:27,003 Okay, let's plot the data. 33 00:01:32,190 --> 00:01:34,830 We can kind of identify two clusters 34 00:01:34,830 --> 00:01:36,630 by looking at the graph. 35 00:01:36,630 --> 00:01:40,320 This one and that one, right? 36 00:01:40,320 --> 00:01:42,960 Well, before we perform any analysis, 37 00:01:42,960 --> 00:01:45,900 let's reason about the problem for a while. 38 00:01:45,900 --> 00:01:49,320 We can divide this graph into four squares: 39 00:01:49,320 --> 00:01:52,440 low satisfaction, low loyalty, 40 00:01:52,440 --> 00:01:55,350 low satisfaction, high loyalty, 41 00:01:55,350 --> 00:01:58,320 high satisfaction, low loyalty, 42 00:01:58,320 --> 00:02:02,430 and high satisfaction, high loyalty. 43 00:02:02,430 --> 00:02:05,610 So going back to our two-cluster solution, 44 00:02:05,610 --> 00:02:08,699 we realize it didn't make much sense, right? 45 00:02:08,699 --> 00:02:11,190 What would these two clusters represent? 46 00:02:11,190 --> 00:02:14,520 The first one looks like low satisfaction, low loyalty, 47 00:02:14,520 --> 00:02:16,680 but the other one is all over the place. 48 00:02:16,680 --> 00:02:20,850 It seems to me that a two-cluster solution won't cut it, 49 00:02:20,850 --> 00:02:22,560 but enough speculation. 50 00:02:22,560 --> 00:02:24,990 Let's leverage the new knowledge we possess 51 00:02:24,990 --> 00:02:27,030 and learn something new. 52 00:02:27,030 --> 00:02:31,500 First, I'll copy the data into a new variable called x. 53 00:02:31,500 --> 00:02:35,610 Next, I'll use the same code we have used so far, 54 00:02:35,610 --> 00:02:39,000 kmeans equals KMeans of 2, 55 00:02:39,000 --> 00:02:41,667 kmeans.fit(x). 56 00:02:46,350 --> 00:02:48,600 Clusters is a duplicate of x, 57 00:02:48,600 --> 00:02:51,120 and the column cluster_pred of this data frame 58 00:02:51,120 --> 00:02:53,820 will contain the cluster where a particular observation 59 00:02:53,820 --> 00:02:56,313 was predicted to be placed by the algorithm. 60 00:02:58,830 --> 00:03:00,933 Next, I'll quickly plot the data. 61 00:03:04,230 --> 00:03:06,720 What we see are two clusters, 62 00:03:06,720 --> 00:03:09,540 but they aren't the two we imagined, are they? 63 00:03:09,540 --> 00:03:11,310 If you examine the plot closely, 64 00:03:11,310 --> 00:03:13,170 you will realize that there is a cutoff line 65 00:03:13,170 --> 00:03:15,930 at the satisfaction value of six. 66 00:03:15,930 --> 00:03:18,330 Everything on the right is one cluster. 67 00:03:18,330 --> 00:03:20,850 Everything on the left, the other. 68 00:03:20,850 --> 00:03:23,070 This solution may make sense to some, 69 00:03:23,070 --> 00:03:24,540 but most probably the algorithm 70 00:03:24,540 --> 00:03:27,063 only considered satisfaction as a feature. 71 00:03:28,890 --> 00:03:32,850 Why? Because we did not standardize the variable. 72 00:03:32,850 --> 00:03:34,140 The satisfaction values 73 00:03:34,140 --> 00:03:36,300 are much higher than those of loyalty 74 00:03:36,300 --> 00:03:40,140 and K-means more or less disregarded loyalty as a feature. 75 00:03:40,140 --> 00:03:42,900 Whenever we cluster on the basis of a single feature, 76 00:03:42,900 --> 00:03:44,940 the result will look like this, 77 00:03:44,940 --> 00:03:47,610 as if it was cut off by a vertical line. 78 00:03:47,610 --> 00:03:48,600 That's one of the ways 79 00:03:48,600 --> 00:03:51,930 to spot if something fishy is going on. 80 00:03:51,930 --> 00:03:54,540 Okay, satisfaction and loyalty 81 00:03:54,540 --> 00:03:58,320 seem equally important features for market segmentation. 82 00:03:58,320 --> 00:04:01,080 So how can we fix this problem? 83 00:04:01,080 --> 00:04:04,200 How can we give them equal weight? 84 00:04:04,200 --> 00:04:08,160 Yes, by standardizing satisfaction. 85 00:04:08,160 --> 00:04:11,430 There are several ways to do that in sklearn. 86 00:04:11,430 --> 00:04:12,990 The simplest one I am aware of 87 00:04:12,990 --> 00:04:15,210 is through the preprocessing module. 88 00:04:15,210 --> 00:04:19,860 So from sklearn, import preprocessing, 89 00:04:19,860 --> 00:04:23,610 then we will declare a new variable called x_scaled 90 00:04:23,610 --> 00:04:28,580 equal to preprocessing.scale of x. 91 00:04:29,550 --> 00:04:33,720 Scale is a method which scales each variable separately. 92 00:04:33,720 --> 00:04:36,120 In other words, each column will be standardized 93 00:04:36,120 --> 00:04:41,120 with respect to itself, exactly what we need, and it's done. 94 00:04:41,760 --> 00:04:44,160 As you can see, x_scaled is an array 95 00:04:44,160 --> 00:04:46,530 which contains the standardized satisfaction 96 00:04:46,530 --> 00:04:49,140 and the same values for loyalty. 97 00:04:49,140 --> 00:04:52,350 That's because loyalty was already standardized. 98 00:04:52,350 --> 00:04:53,550 It had a mean of zero 99 00:04:53,550 --> 00:04:56,610 and a standard deviation of one that is. 100 00:04:56,610 --> 00:04:59,373 Okay, let's continue in the known way. 101 00:05:00,240 --> 00:05:02,850 Since we don't know the number of clusters needed, 102 00:05:02,850 --> 00:05:05,490 the elbow method will come in handy. 103 00:05:05,490 --> 00:05:08,560 Let's declare a list wcss 104 00:05:09,480 --> 00:05:13,380 for i in range from one to 10: 105 00:05:13,380 --> 00:05:16,620 kmeans equals KMeans of i. 106 00:05:16,620 --> 00:05:20,283 Again, K and M are capital when referring to the method, 107 00:05:21,150 --> 00:05:24,990 kmeans.fit(x) 108 00:05:24,990 --> 00:05:28,200 and we will append the result to the wcss list 109 00:05:28,200 --> 00:05:30,330 using the inertia method. 110 00:05:30,330 --> 00:05:33,540 Great, note that the range is from one to 10 111 00:05:33,540 --> 00:05:37,200 so we will get the wcss for a 1, 2, 3 112 00:05:37,200 --> 00:05:39,570 until nine cluster solutions. 113 00:05:39,570 --> 00:05:41,943 That was an arbitrary decision on my side. 114 00:05:43,830 --> 00:05:47,700 Finally, let's plot wcss versus the number of clusters 115 00:05:47,700 --> 00:05:50,010 as we did in the previous lectures. 116 00:05:50,010 --> 00:05:53,520 The result is an elbow. 117 00:05:53,520 --> 00:05:54,600 Given this graph, 118 00:05:54,600 --> 00:05:58,260 think about the correct number of clusters we should use. 119 00:05:58,260 --> 00:06:02,070 We will explore the different solutions in the next lesson. 120 00:06:02,070 --> 00:06:03,093 Thanks for watching. 9276

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.