subtitlecat.com

All language subtitles for 011 Market Segmentation with Cluster Analysis (Part 1)_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,360 --> 00:00:02,910 Instructor: It's time for a more sophisticated, 2 00:00:02,910 --> 00:00:05,939 yet easy to understand example. 3 00:00:05,939 --> 00:00:08,850 We will look into market segmentation. 4 00:00:08,850 --> 00:00:10,980 I'll open a clean Jupyter notebook 5 00:00:10,980 --> 00:00:13,380 and import all the relevant packages 6 00:00:13,380 --> 00:00:15,930 including the K-means module. 7 00:00:15,930 --> 00:00:18,270 Next, we should load the dataset 8 00:00:18,270 --> 00:00:22,533 located in 3.12 example.csv. 9 00:00:23,520 --> 00:00:26,940 In the csv, we've got data from a retail shop. 10 00:00:26,940 --> 00:00:29,010 There are 30 observations. 11 00:00:29,010 --> 00:00:30,900 Each observation is a client 12 00:00:30,900 --> 00:00:33,240 and we have a score for their customer satisfaction 13 00:00:33,240 --> 00:00:34,890 and brand loyalty. 14 00:00:34,890 --> 00:00:37,890 Let's see how the data was gathered. 15 00:00:37,890 --> 00:00:40,230 Satisfaction is self-reported. 16 00:00:40,230 --> 00:00:41,400 People were basically asked 17 00:00:41,400 --> 00:00:44,100 to rate their shopping experience from one to 10 18 00:00:44,100 --> 00:00:47,220 where 10 means extremely satisfied. 19 00:00:47,220 --> 00:00:50,610 Therefore, satisfaction here is a discrete variable 20 00:00:50,610 --> 00:00:52,353 and takes integer values. 21 00:00:53,280 --> 00:00:56,730 Brand loyalty, on the other hand, is a tricky metric. 22 00:00:56,730 --> 00:00:59,160 There is no widely accepted technique to measure it, 23 00:00:59,160 --> 00:01:01,710 but there are different proxies like churn rate, 24 00:01:01,710 --> 00:01:04,950 retention rate, or customer lifetime value. 25 00:01:04,950 --> 00:01:07,140 In this dataset, loyalty was measured 26 00:01:07,140 --> 00:01:10,140 through the number of purchases from that shop for a year 27 00:01:10,140 --> 00:01:13,290 and several other factors found to be significant. 28 00:01:13,290 --> 00:01:16,380 The range is from around minus two to around two 29 00:01:16,380 --> 00:01:19,050 as the variable is already standardized. 30 00:01:19,050 --> 00:01:20,730 That's something that often occurs 31 00:01:20,730 --> 00:01:24,630 especially when creating latent variables like this one. 32 00:01:24,630 --> 00:01:27,003 Okay, let's plot the data. 33 00:01:32,190 --> 00:01:34,830 We can kind of identify two clusters 34 00:01:34,830 --> 00:01:36,630 by looking at the graph. 35 00:01:36,630 --> 00:01:40,320 This one and that one, right? 36 00:01:40,320 --> 00:01:42,960 Well, before we perform any analysis, 37 00:01:42,960 --> 00:01:45,900 let's reason about the problem for a while. 38 00:01:45,900 --> 00:01:49,320 We can divide this graph into four squares: 39 00:01:49,320 --> 00:01:52,440 low satisfaction, low loyalty, 40 00:01:52,440 --> 00:01:55,350 low satisfaction, high loyalty, 41 00:01:55,350 --> 00:01:58,320 high satisfaction, low loyalty, 42 00:01:58,320 --> 00:02:02,430 and high satisfaction, high loyalty. 43 00:02:02,430 --> 00:02:05,610 So going back to our two-cluster solution, 44 00:02:05,610 --> 00:02:08,699 we realize it didn't make much sense, right? 45 00:02:08,699 --> 00:02:11,190 What would these two clusters represent? 46 00:02:11,190 --> 00:02:14,520 The first one looks like low satisfaction, low loyalty, 47 00:02:14,520 --> 00:02:16,680 but the other one is all over the place. 48 00:02:16,680 --> 00:02:20,850 It seems to me that a two-cluster solution won't cut it, 49 00:02:20,850 --> 00:02:22,560 but enough speculation. 50 00:02:22,560 --> 00:02:24,990 Let's leverage the new knowledge we possess 51 00:02:24,990 --> 00:02:27,030 and learn something new. 52 00:02:27,030 --> 00:02:31,500 First, I'll copy the data into a new variable called x. 53 00:02:31,500 --> 00:02:35,610 Next, I'll use the same code we have used so far, 54 00:02:35,610 --> 00:02:39,000 kmeans equals KMeans of 2, 55 00:02:39,000 --> 00:02:41,667 kmeans.fit(x). 56 00:02:46,350 --> 00:02:48,600 Clusters is a duplicate of x, 57 00:02:48,600 --> 00:02:51,120 and the column cluster_pred of this data frame 58 00:02:51,120 --> 00:02:53,820 will contain the cluster where a particular observation 59 00:02:53,820 --> 00:02:56,313 was predicted to be placed by the algorithm. 60 00:02:58,830 --> 00:03:00,933 Next, I'll quickly plot the data. 61 00:03:04,230 --> 00:03:06,720 What we see are two clusters, 62 00:03:06,720 --> 00:03:09,540 but they aren't the two we imagined, are they? 63 00:03:09,540 --> 00:03:11,310 If you examine the plot closely, 64 00:03:11,310 --> 00:03:13,170 you will realize that there is a cutoff line 65 00:03:13,170 --> 00:03:15,930 at the satisfaction value of six. 66 00:03:15,930 --> 00:03:18,330 Everything on the right is one cluster. 67 00:03:18,330 --> 00:03:20,850 Everything on the left, the other. 68 00:03:20,850 --> 00:03:23,070 This solution may make sense to some, 69 00:03:23,070 --> 00:03:24,540 but most probably the algorithm 70 00:03:24,540 --> 00:03:27,063 only considered satisfaction as a feature. 71 00:03:28,890 --> 00:03:32,850 Why? Because we did not standardize the variable. 72 00:03:32,850 --> 00:03:34,140 The satisfaction values 73 00:03:34,140 --> 00:03:36,300 are much higher than those of loyalty 74 00:03:36,300 --> 00:03:40,140 and K-means more or less disregarded loyalty as a feature. 75 00:03:40,140 --> 00:03:42,900 Whenever we cluster on the basis of a single feature, 76 00:03:42,900 --> 00:03:44,940 the result will look like this, 77 00:03:44,940 --> 00:03:47,610 as if it was cut off by a vertical line. 78 00:03:47,610 --> 00:03:48,600 That's one of the ways 79 00:03:48,600 --> 00:03:51,930 to spot if something fishy is going on. 80 00:03:51,930 --> 00:03:54,540 Okay, satisfaction and loyalty 81 00:03:54,540 --> 00:03:58,320 seem equally important features for market segmentation. 82 00:03:58,320 --> 00:04:01,080 So how can we fix this problem? 83 00:04:01,080 --> 00:04:04,200 How can we give them equal weight? 84 00:04:04,200 --> 00:04:08,160 Yes, by standardizing satisfaction. 85 00:04:08,160 --> 00:04:11,430 There are several ways to do that in sklearn. 86 00:04:11,430 --> 00:04:12,990 The simplest one I am aware of 87 00:04:12,990 --> 00:04:15,210 is through the preprocessing module. 88 00:04:15,210 --> 00:04:19,860 So from sklearn, import preprocessing, 89 00:04:19,860 --> 00:04:23,610 then we will declare a new variable called x_scaled 90 00:04:23,610 --> 00:04:28,580 equal to preprocessing.scale of x. 91 00:04:29,550 --> 00:04:33,720 Scale is a method which scales each variable separately. 92 00:04:33,720 --> 00:04:36,120 In other words, each column will be standardized 93 00:04:36,120 --> 00:04:41,120 with respect to itself, exactly what we need, and it's done. 94 00:04:41,760 --> 00:04:44,160 As you can see, x_scaled is an array 95 00:04:44,160 --> 00:04:46,530 which contains the standardized satisfaction 96 00:04:46,530 --> 00:04:49,140 and the same values for loyalty. 97 00:04:49,140 --> 00:04:52,350 That's because loyalty was already standardized. 98 00:04:52,350 --> 00:04:53,550 It had a mean of zero 99 00:04:53,550 --> 00:04:56,610 and a standard deviation of one that is. 100 00:04:56,610 --> 00:04:59,373 Okay, let's continue in the known way. 101 00:05:00,240 --> 00:05:02,850 Since we don't know the number of clusters needed, 102 00:05:02,850 --> 00:05:05,490 the elbow method will come in handy. 103 00:05:05,490 --> 00:05:08,560 Let's declare a list wcss 104 00:05:09,480 --> 00:05:13,380 for i in range from one to 10: 105 00:05:13,380 --> 00:05:16,620 kmeans equals KMeans of i. 106 00:05:16,620 --> 00:05:20,283 Again, K and M are capital when referring to the method, 107 00:05:21,150 --> 00:05:24,990 kmeans.fit(x) 108 00:05:24,990 --> 00:05:28,200 and we will append the result to the wcss list 109 00:05:28,200 --> 00:05:30,330 using the inertia method. 110 00:05:30,330 --> 00:05:33,540 Great, note that the range is from one to 10 111 00:05:33,540 --> 00:05:37,200 so we will get the wcss for a 1, 2, 3 112 00:05:37,200 --> 00:05:39,570 until nine cluster solutions. 113 00:05:39,570 --> 00:05:41,943 That was an arbitrary decision on my side. 114 00:05:43,830 --> 00:05:47,700 Finally, let's plot wcss versus the number of clusters 115 00:05:47,700 --> 00:05:50,010 as we did in the previous lectures. 116 00:05:50,010 --> 00:05:53,520 The result is an elbow. 117 00:05:53,520 --> 00:05:54,600 Given this graph, 118 00:05:54,600 --> 00:05:58,260 think about the correct number of clusters we should use. 119 00:05:58,260 --> 00:06:02,070 We will explore the different solutions in the next lesson. 120 00:06:02,070 --> 00:06:03,093 Thanks for watching. 9276