subtitlecat.com

All language subtitles for 011 Data QA _ Profiling Tools_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French Download

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:02,000 Instructor: Data profiling tools provide 2 00:00:02,000 --> 00:00:04,000 a visual way for you to explore your data 3 00:00:04,000 --> 00:00:07,000 and to get a sense of its composition, 4 00:00:07,000 --> 00:00:10,000 and each of the data profiling tools that we'll cover 5 00:00:10,000 --> 00:00:14,000 have some slightly different options, layout and purpose. 6 00:00:14,000 --> 00:00:16,000 Digging into column quality first, 7 00:00:16,000 --> 00:00:18,000 this is all about showing the percentage 8 00:00:18,000 --> 00:00:20,000 of values within a column 9 00:00:20,000 --> 00:00:22,000 and these percentages are broken down 10 00:00:22,000 --> 00:00:26,000 into three types, valid, error or empty, 11 00:00:26,000 --> 00:00:29,000 and if you're ever trying to quickly identify the percentage 12 00:00:29,000 --> 00:00:32,000 of empty or error values within a column 13 00:00:32,000 --> 00:00:34,000 this is a great tool to use. 14 00:00:34,000 --> 00:00:37,000 Additionally, you can hover over the column quality box 15 00:00:37,000 --> 00:00:40,000 to reveal a contextual menu 16 00:00:40,000 --> 00:00:43,000 with some details and additional options. 17 00:00:43,000 --> 00:00:46,000 On the lower right, you'll see a little ellipses icon 18 00:00:46,000 --> 00:00:50,000 that when clicked will reveal a bigger contextual menu 19 00:00:50,000 --> 00:00:52,000 with some other options to do things 20 00:00:52,000 --> 00:00:55,000 like cleanup, duplicates, remove empty values 21 00:00:55,000 --> 00:00:57,000 and remove or replace errors. 22 00:00:57,000 --> 00:01:00,000 So one thing to keep in mind here is that the great thing 23 00:01:00,000 --> 00:01:02,000 about these types of contextual menus 24 00:01:02,000 --> 00:01:05,000 is that they allow you to solve column quality issues 25 00:01:05,000 --> 00:01:06,000 without having to try 26 00:01:06,000 --> 00:01:10,000 and find the appropriate steps within the query editor. 27 00:01:10,000 --> 00:01:13,000 These menus really bring this functionality straight 28 00:01:13,000 --> 00:01:15,000 to you without you having to go and find it. 29 00:01:16,000 --> 00:01:19,000 Another profiling tool is column distribution 30 00:01:19,000 --> 00:01:22,000 and this provides a sample distribution 31 00:01:22,000 --> 00:01:25,000 of the data contained within a column 32 00:01:25,000 --> 00:01:27,000 and very similar to column quality tools, 33 00:01:27,000 --> 00:01:29,000 we have the same types 34 00:01:29,000 --> 00:01:32,000 of contextual menu options that help us clean up 35 00:01:32,000 --> 00:01:34,000 and sort out error or duplicate values 36 00:01:34,000 --> 00:01:37,000 that may be contained within the column. 37 00:01:37,000 --> 00:01:39,000 The other thing that I'd like to call out here 38 00:01:39,000 --> 00:01:43,000 is that you actually have a suggestion from the query editor 39 00:01:43,000 --> 00:01:45,000 and this suggestion is a potential remedy based 40 00:01:45,000 --> 00:01:48,000 on the distribution that it finds within the column. 41 00:01:48,000 --> 00:01:50,000 So here we can see 42 00:01:50,000 --> 00:01:52,000 that there's a removed duplicate suggestion 43 00:01:52,000 --> 00:01:54,000 to help clean up this data. 44 00:01:54,000 --> 00:01:57,000 This may not always be exactly what you want, 45 00:01:57,000 --> 00:02:00,000 but it can be kind of handy if the recommended option 46 00:02:00,000 --> 00:02:03,000 is the exact data cleaning step that you're looking for. 47 00:02:03,000 --> 00:02:07,000 Our last data profiling tool is called column profile, 48 00:02:07,000 --> 00:02:10,000 and this provides a more holistic view of the data 49 00:02:10,000 --> 00:02:13,000 within a column by providing a sample distribution 50 00:02:13,000 --> 00:02:17,000 of the data along with specific column statistics. 51 00:02:17,000 --> 00:02:19,000 So on the bottom left of this view, 52 00:02:19,000 --> 00:02:21,000 we have the column statistics view, 53 00:02:21,000 --> 00:02:23,000 which provides a bit more detail 54 00:02:23,000 --> 00:02:27,000 than just the column distribution view we just reviewed. 55 00:02:27,000 --> 00:02:30,000 This actually now gives us some detail about counts, 56 00:02:30,000 --> 00:02:33,000 the number of distinct values, the number of uniques, 57 00:02:33,000 --> 00:02:38,000 some min and max and empty values and so forth and so on. 58 00:02:38,000 --> 00:02:40,000 On the right hand side, we have the value distribution 59 00:02:40,000 --> 00:02:41,000 which shows the distribution 60 00:02:41,000 --> 00:02:44,000 of each value within the column, 61 00:02:44,000 --> 00:02:46,000 and similar to the other profiling tools 62 00:02:46,000 --> 00:02:49,000 you can also hover to reveal a contextual menu 63 00:02:49,000 --> 00:02:51,000 with suggested transformation options. 64 00:02:51,000 --> 00:02:54,000 All right, so with all of this in mind, 65 00:02:54,000 --> 00:02:56,000 let's go head over to the query editor 66 00:02:56,000 --> 00:02:57,000 and we're gonna check out 67 00:02:57,000 --> 00:03:00,000 these column profiling tools for ourself. 68 00:03:00,000 --> 00:03:01,000 All right, so for this demo 69 00:03:01,000 --> 00:03:05,000 we're gonna connect to the customer lookup table, 70 00:03:05,000 --> 00:03:06,000 and this is the table that we're gonna use 71 00:03:06,000 --> 00:03:10,000 to demo some of these profiling and QA tools. 72 00:03:14,000 --> 00:03:15,000 All right, perfect. 73 00:03:15,000 --> 00:03:18,000 So again, kind of following some of our best practices 74 00:03:18,000 --> 00:03:22,000 and initial first steps, let's update our table name here 75 00:03:22,000 --> 00:03:25,000 to just customer lookup, and then we'll check to make sure 76 00:03:25,000 --> 00:03:29,000 that all of the data types and column headers look good. 77 00:03:30,000 --> 00:03:32,000 All right, so kind of scrolling along here. 78 00:03:32,000 --> 00:03:35,000 Text values, our birthdate as a date, marital status 79 00:03:35,000 --> 00:03:40,000 gender, text, email, text, annual income here. 80 00:03:40,000 --> 00:03:41,000 This is a whole number. 81 00:03:41,000 --> 00:03:43,000 We could update that to currency if we want, 82 00:03:43,000 --> 00:03:44,000 but that looks good. 83 00:03:45,000 --> 00:03:47,000 Okay, everything looks really good here. 84 00:03:47,000 --> 00:03:49,000 So all of these column profiling tools 85 00:03:49,000 --> 00:03:53,000 that we talked about are located up here in the view menu 86 00:03:53,000 --> 00:03:57,000 and let's enable our column quality, 87 00:03:57,000 --> 00:03:59,000 and we can see that our quality, you know 88 00:03:59,000 --> 00:04:01,000 it actually looks pretty decent, right? 89 00:04:01,000 --> 00:04:05,000 We get this little results window for valid error and empty. 90 00:04:05,000 --> 00:04:09,000 Like I said, when we hover, we get these contextual menus. 91 00:04:09,000 --> 00:04:12,000 You can right click and also get a contextual menu 92 00:04:12,000 --> 00:04:13,000 with some of these kind of 93 00:04:13,000 --> 00:04:15,000 common transformation steps, right? 94 00:04:15,000 --> 00:04:18,000 So if we scroll along here, right? 95 00:04:18,000 --> 00:04:20,000 We've got prefix column here 96 00:04:20,000 --> 00:04:24,000 with less than 1% of empty values here, 97 00:04:24,000 --> 00:04:25,000 and again like this is stuff 98 00:04:25,000 --> 00:04:26,000 that kind of makes sense, right? 99 00:04:26,000 --> 00:04:30,000 We may not have a Mr or Mrs. prefix 100 00:04:30,000 --> 00:04:32,000 for all of the customers within our data set. 101 00:04:32,000 --> 00:04:35,000 So seeing that it's only nine here, 102 00:04:35,000 --> 00:04:37,000 I'm not super concerned about this, 103 00:04:37,000 --> 00:04:40,000 and it's probably not some sort of issue 104 00:04:40,000 --> 00:04:43,000 that is gonna compromise the integrity of our data, right? 105 00:04:43,000 --> 00:04:46,000 If we keep scrolling along here, again, everything is valid 106 00:04:46,000 --> 00:04:49,000 at a 100%, no errors, no empties 107 00:04:49,000 --> 00:04:51,000 everything looks pretty good here, 108 00:04:51,000 --> 00:04:54,000 but this is actually a bit misleading, 109 00:04:54,000 --> 00:04:57,000 because the default column profiling only analyzes 110 00:04:57,000 --> 00:05:00,000 the first 1000 rows of the table, 111 00:05:00,000 --> 00:05:05,000 and our customer lookup table contains over 18,000 records. 112 00:05:05,000 --> 00:05:07,000 So in order to view all of the records, 113 00:05:07,000 --> 00:05:09,000 we need to update the column profiling 114 00:05:09,000 --> 00:05:11,000 from the first 1000 records 115 00:05:11,000 --> 00:05:14,000 to be based on the entire data set 116 00:05:14,000 --> 00:05:16,000 and then once we update that range 117 00:05:16,000 --> 00:05:17,000 we'll see how this really looks. 118 00:05:17,000 --> 00:05:19,000 So to do that, we see down at the bottom here 119 00:05:19,000 --> 00:05:21,000 where it says column profiling is based 120 00:05:21,000 --> 00:05:23,000 on the top 1000 rows, 121 00:05:23,000 --> 00:05:24,000 and we actually want to have it 122 00:05:24,000 --> 00:05:26,000 based on the entire data set. 123 00:05:26,000 --> 00:05:29,000 So once we do that, power BI goes through and re-scans this 124 00:05:29,000 --> 00:05:32,000 and once this updates, look what happens, right? 125 00:05:32,000 --> 00:05:35,000 Our customer key column now has a bunch 126 00:05:35,000 --> 00:05:36,000 of errors in it, right? 127 00:05:36,000 --> 00:05:39,000 We're seeing that there's five errors here. 128 00:05:39,000 --> 00:05:42,000 Some of our other columns like prefix and first name, 129 00:05:42,000 --> 00:05:45,000 like we're actually seeing some empty values 130 00:05:45,000 --> 00:05:46,000 in here as well. 131 00:05:46,000 --> 00:05:48,000 So something is going on with this data set 132 00:05:48,000 --> 00:05:50,000 that we actually need to clean up, 133 00:05:50,000 --> 00:05:52,000 and like we had talked about 134 00:05:52,000 --> 00:05:56,000 we can actually use Power Query's suggestions here 135 00:05:56,000 --> 00:05:58,000 to remove these errors, right? 136 00:05:58,000 --> 00:06:00,000 So if I click remove errors here, 137 00:06:00,000 --> 00:06:03,000 a new applied step is going to be added for that 138 00:06:03,000 --> 00:06:08,000 and then we'll see the updated results based on that change. 139 00:06:08,000 --> 00:06:10,000 The only other piece that we have here now 140 00:06:10,000 --> 00:06:12,000 is that we're actually getting some empty values 141 00:06:12,000 --> 00:06:14,000 within this column, right? 142 00:06:14,000 --> 00:06:18,000 So again, we can follow Power Query suggested update here, 143 00:06:18,000 --> 00:06:20,000 click remove empty, right? 144 00:06:20,000 --> 00:06:23,000 We've added another applied step here for filtering rows. 145 00:06:23,000 --> 00:06:27,000 We're filtering out those empty rows 146 00:06:27,000 --> 00:06:30,000 and now we're kind of back to where we were at, right? 147 00:06:30,000 --> 00:06:33,000 We've got 130 empty values here for prefix, 148 00:06:33,000 --> 00:06:37,000 but again, that may make sense based on the data set 149 00:06:37,000 --> 00:06:40,000 and if I scroll over, right? 150 00:06:40,000 --> 00:06:44,000 All of these other columns are now a 100% valid, right? 151 00:06:44,000 --> 00:06:47,000 So it looks like that took care of it, 152 00:06:47,000 --> 00:06:49,000 but what I would still like to understand is 153 00:06:49,000 --> 00:06:52,000 what exactly was causing those errors, right? 154 00:06:52,000 --> 00:06:55,000 Like what was the actual values or the reason behind it? 155 00:06:55,000 --> 00:07:00,000 So let's delete these last two applied steps, right? 156 00:07:00,000 --> 00:07:02,000 And now we're back to the state 157 00:07:02,000 --> 00:07:04,000 where we have all of these errors. 158 00:07:04,000 --> 00:07:07,000 If we come back to our header here 159 00:07:07,000 --> 00:07:11,000 and we right click, if you select keep errors, 160 00:07:11,000 --> 00:07:14,000 what this is going to do is add a new applied step 161 00:07:14,000 --> 00:07:15,000 that shows a results table 162 00:07:15,000 --> 00:07:17,000 with all of those errors in it, 163 00:07:17,000 --> 00:07:21,000 and you can actually click into each of these errors 164 00:07:21,000 --> 00:07:23,000 to see the Power Query error message. 165 00:07:23,000 --> 00:07:25,000 So this is a data format error 166 00:07:25,000 --> 00:07:28,000 that we couldn't convert to a number, right? 167 00:07:28,000 --> 00:07:30,000 And here's the details. 168 00:07:30,000 --> 00:07:33,000 This 30 with three dashes is what Power Query 169 00:07:33,000 --> 00:07:37,000 was trying to convert to a number, right? 170 00:07:37,000 --> 00:07:39,000 And if I clear out of this last applied step, 171 00:07:40,000 --> 00:07:43,000 you know, I can click into another error here, 172 00:07:44,000 --> 00:07:46,000 and again it's that same error. 173 00:07:46,000 --> 00:07:48,000 We couldn't convert this to a number 174 00:07:48,000 --> 00:07:52,000 and here's the number that was trying to be converted. 175 00:07:52,000 --> 00:07:53,000 All right, we'll check one more here. 176 00:07:55,000 --> 00:07:57,000 All right, so this is same error, 177 00:07:57,000 --> 00:07:59,000 but a slightly different issue, right? 178 00:07:59,000 --> 00:08:02,000 It looks like there's some source data here 179 00:08:02,000 --> 00:08:04,000 or a link or something like that, 180 00:08:04,000 --> 00:08:07,000 that couldn't be converted to a whole number. 181 00:08:07,000 --> 00:08:11,000 So again, like that is the underlying reason why 182 00:08:11,000 --> 00:08:12,000 those errors were happening. 183 00:08:12,000 --> 00:08:15,000 Those are the actual issues within that data set. 184 00:08:15,000 --> 00:08:19,000 We also see that there's some other values within prefix 185 00:08:19,000 --> 00:08:21,000 that really don't make sense, right? 186 00:08:21,000 --> 00:08:23,000 These are not picked up as inaccurate, 187 00:08:23,000 --> 00:08:26,000 because of the column data type being text, 188 00:08:26,000 --> 00:08:29,000 but lowercase M doesn't really mean anything, right? 189 00:08:29,000 --> 00:08:31,000 If we scroll over here, 190 00:08:31,000 --> 00:08:33,000 looks like we've got some birth dates from, 191 00:08:33,000 --> 00:08:35,000 you know January 1st, 1900. 192 00:08:35,000 --> 00:08:39,000 So again, these different rows here don't make sense 193 00:08:39,000 --> 00:08:42,000 to keep as part of this data set. 194 00:08:42,000 --> 00:08:45,000 So if we clear our keep errors step, 195 00:08:45,000 --> 00:08:46,000 the great thing about this 196 00:08:46,000 --> 00:08:48,000 is that we can just rerun through 197 00:08:48,000 --> 00:08:51,000 those same cleaning steps, right? 198 00:08:51,000 --> 00:08:54,000 We can either remove errors by clicking this, 199 00:08:54,000 --> 00:08:59,000 we can right click and then select remove errors, right? 200 00:09:01,000 --> 00:09:04,000 We can right click again, remove empty 201 00:09:05,000 --> 00:09:09,000 and after that we're back to this nice clean data set, 202 00:09:09,000 --> 00:09:11,000 where our only empty values are 203 00:09:11,000 --> 00:09:14,000 within this customer prefix column. 204 00:09:14,000 --> 00:09:17,000 All right, so now that we've got these errors cleaned up 205 00:09:17,000 --> 00:09:18,000 let's quickly check out some 206 00:09:18,000 --> 00:09:21,000 of the other profiling tool options, 207 00:09:21,000 --> 00:09:23,000 and I'm gonna deselect column quality. 208 00:09:23,000 --> 00:09:25,000 We'll check out column distribution, 209 00:09:25,000 --> 00:09:28,000 and again, when we select our column distribution here 210 00:09:28,000 --> 00:09:29,000 we see a sample distribution 211 00:09:29,000 --> 00:09:32,000 of the values within each of the columns, 212 00:09:32,000 --> 00:09:36,000 and if we go in, let's check out first name here, 213 00:09:36,000 --> 00:09:39,000 we can see that we have 666 distinct values 214 00:09:39,000 --> 00:09:41,000 and 89 unique values. 215 00:09:41,000 --> 00:09:46,000 So this means that there are 666 distinct first names 216 00:09:46,000 --> 00:09:51,000 within the column, and 89 of them appear only once, right? 217 00:09:51,000 --> 00:09:54,000 So that's how to interpret those numbers, 218 00:09:54,000 --> 00:09:57,000 and if we hover over the column header here, 219 00:09:57,000 --> 00:10:01,000 again Power Query has this removed duplicate suggestion, 220 00:10:01,000 --> 00:10:02,000 and again, we're a little bit smarter 221 00:10:02,000 --> 00:10:04,000 than this suggestion and understand 222 00:10:04,000 --> 00:10:07,000 that duplicates are actually necessary within the column, 223 00:10:07,000 --> 00:10:09,000 so we'll ignore that 224 00:10:09,000 --> 00:10:12,000 and then last, we'll check out our column profiling tools 225 00:10:12,000 --> 00:10:17,000 and here we get a really nice view of our column statistics 226 00:10:17,000 --> 00:10:19,000 and a chart showing the distribution. 227 00:10:19,000 --> 00:10:21,000 Again, one of the important pieces to note here 228 00:10:21,000 --> 00:10:24,000 is that some of these column statistics 229 00:10:24,000 --> 00:10:26,000 really may not make sense. 230 00:10:26,000 --> 00:10:27,000 Count makes sense. 231 00:10:27,000 --> 00:10:29,000 We're gonna wanna see the errors 232 00:10:29,000 --> 00:10:33,000 or empty values, distinct, unique, these may make sense, 233 00:10:33,000 --> 00:10:35,000 but the min and max here, 234 00:10:35,000 --> 00:10:37,000 these don't really tell you much beyond 235 00:10:37,000 --> 00:10:40,000 the first and last names in an alphabetical order, 236 00:10:40,000 --> 00:10:44,000 but if we scroll over to a column like our annual income 237 00:10:46,000 --> 00:10:51,000 and select this one, now some of these column statistics 238 00:10:51,000 --> 00:10:53,000 are gonna provide a little bit more value here, 239 00:10:53,000 --> 00:10:56,000 especially when we get down to the minimum, maximum values 240 00:10:56,000 --> 00:10:59,000 our average and standard deviation, right? 241 00:10:59,000 --> 00:11:02,000 So depending on the data contained 242 00:11:02,000 --> 00:11:05,000 within each of these columns, some of the statistics tools 243 00:11:05,000 --> 00:11:08,000 and the distribution may make more sense. 244 00:11:08,000 --> 00:11:11,000 The larger point here is that you have a bunch of tools 245 00:11:11,000 --> 00:11:14,000 at your disposal that are really helpful to use 246 00:11:14,000 --> 00:11:15,000 when you're exploring 247 00:11:15,000 --> 00:11:16,000 and trying to understand 248 00:11:16,000 --> 00:11:19,000 the data contained within your tables. 249 00:11:19,000 --> 00:11:22,000 Data prep and QA is often a very time consuming 250 00:11:22,000 --> 00:11:25,000 and intensive process as an analyst, 251 00:11:25,000 --> 00:11:27,000 and these tools can help make that process 252 00:11:27,000 --> 00:11:29,000 a bit quicker and scalable. 20360