Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:02,000
Instructor: Data profiling tools provide
2
00:00:02,000 --> 00:00:04,000
a visual way for you to explore your data
3
00:00:04,000 --> 00:00:07,000
and to get a sense of its composition,
4
00:00:07,000 --> 00:00:10,000
and each of the data profiling tools that we'll cover
5
00:00:10,000 --> 00:00:14,000
have some slightly different options, layout and purpose.
6
00:00:14,000 --> 00:00:16,000
Digging into column quality first,
7
00:00:16,000 --> 00:00:18,000
this is all about showing the percentage
8
00:00:18,000 --> 00:00:20,000
of values within a column
9
00:00:20,000 --> 00:00:22,000
and these percentages are broken down
10
00:00:22,000 --> 00:00:26,000
into three types, valid, error or empty,
11
00:00:26,000 --> 00:00:29,000
and if you're ever trying to quickly identify the percentage
12
00:00:29,000 --> 00:00:32,000
of empty or error values within a column
13
00:00:32,000 --> 00:00:34,000
this is a great tool to use.
14
00:00:34,000 --> 00:00:37,000
Additionally, you can hover over the column quality box
15
00:00:37,000 --> 00:00:40,000
to reveal a contextual menu
16
00:00:40,000 --> 00:00:43,000
with some details and additional options.
17
00:00:43,000 --> 00:00:46,000
On the lower right, you'll see a little ellipses icon
18
00:00:46,000 --> 00:00:50,000
that when clicked will reveal a bigger contextual menu
19
00:00:50,000 --> 00:00:52,000
with some other options to do things
20
00:00:52,000 --> 00:00:55,000
like cleanup, duplicates, remove empty values
21
00:00:55,000 --> 00:00:57,000
and remove or replace errors.
22
00:00:57,000 --> 00:01:00,000
So one thing to keep in mind here is that the great thing
23
00:01:00,000 --> 00:01:02,000
about these types of contextual menus
24
00:01:02,000 --> 00:01:05,000
is that they allow you to solve column quality issues
25
00:01:05,000 --> 00:01:06,000
without having to try
26
00:01:06,000 --> 00:01:10,000
and find the appropriate steps within the query editor.
27
00:01:10,000 --> 00:01:13,000
These menus really bring this functionality straight
28
00:01:13,000 --> 00:01:15,000
to you without you having to go and find it.
29
00:01:16,000 --> 00:01:19,000
Another profiling tool is column distribution
30
00:01:19,000 --> 00:01:22,000
and this provides a sample distribution
31
00:01:22,000 --> 00:01:25,000
of the data contained within a column
32
00:01:25,000 --> 00:01:27,000
and very similar to column quality tools,
33
00:01:27,000 --> 00:01:29,000
we have the same types
34
00:01:29,000 --> 00:01:32,000
of contextual menu options that help us clean up
35
00:01:32,000 --> 00:01:34,000
and sort out error or duplicate values
36
00:01:34,000 --> 00:01:37,000
that may be contained within the column.
37
00:01:37,000 --> 00:01:39,000
The other thing that I'd like to call out here
38
00:01:39,000 --> 00:01:43,000
is that you actually have a suggestion from the query editor
39
00:01:43,000 --> 00:01:45,000
and this suggestion is a potential remedy based
40
00:01:45,000 --> 00:01:48,000
on the distribution that it finds within the column.
41
00:01:48,000 --> 00:01:50,000
So here we can see
42
00:01:50,000 --> 00:01:52,000
that there's a removed duplicate suggestion
43
00:01:52,000 --> 00:01:54,000
to help clean up this data.
44
00:01:54,000 --> 00:01:57,000
This may not always be exactly what you want,
45
00:01:57,000 --> 00:02:00,000
but it can be kind of handy if the recommended option
46
00:02:00,000 --> 00:02:03,000
is the exact data cleaning step that you're looking for.
47
00:02:03,000 --> 00:02:07,000
Our last data profiling tool is called column profile,
48
00:02:07,000 --> 00:02:10,000
and this provides a more holistic view of the data
49
00:02:10,000 --> 00:02:13,000
within a column by providing a sample distribution
50
00:02:13,000 --> 00:02:17,000
of the data along with specific column statistics.
51
00:02:17,000 --> 00:02:19,000
So on the bottom left of this view,
52
00:02:19,000 --> 00:02:21,000
we have the column statistics view,
53
00:02:21,000 --> 00:02:23,000
which provides a bit more detail
54
00:02:23,000 --> 00:02:27,000
than just the column distribution view we just reviewed.
55
00:02:27,000 --> 00:02:30,000
This actually now gives us some detail about counts,
56
00:02:30,000 --> 00:02:33,000
the number of distinct values, the number of uniques,
57
00:02:33,000 --> 00:02:38,000
some min and max and empty values and so forth and so on.
58
00:02:38,000 --> 00:02:40,000
On the right hand side, we have the value distribution
59
00:02:40,000 --> 00:02:41,000
which shows the distribution
60
00:02:41,000 --> 00:02:44,000
of each value within the column,
61
00:02:44,000 --> 00:02:46,000
and similar to the other profiling tools
62
00:02:46,000 --> 00:02:49,000
you can also hover to reveal a contextual menu
63
00:02:49,000 --> 00:02:51,000
with suggested transformation options.
64
00:02:51,000 --> 00:02:54,000
All right, so with all of this in mind,
65
00:02:54,000 --> 00:02:56,000
let's go head over to the query editor
66
00:02:56,000 --> 00:02:57,000
and we're gonna check out
67
00:02:57,000 --> 00:03:00,000
these column profiling tools for ourself.
68
00:03:00,000 --> 00:03:01,000
All right, so for this demo
69
00:03:01,000 --> 00:03:05,000
we're gonna connect to the customer lookup table,
70
00:03:05,000 --> 00:03:06,000
and this is the table that we're gonna use
71
00:03:06,000 --> 00:03:10,000
to demo some of these profiling and QA tools.
72
00:03:14,000 --> 00:03:15,000
All right, perfect.
73
00:03:15,000 --> 00:03:18,000
So again, kind of following some of our best practices
74
00:03:18,000 --> 00:03:22,000
and initial first steps, let's update our table name here
75
00:03:22,000 --> 00:03:25,000
to just customer lookup, and then we'll check to make sure
76
00:03:25,000 --> 00:03:29,000
that all of the data types and column headers look good.
77
00:03:30,000 --> 00:03:32,000
All right, so kind of scrolling along here.
78
00:03:32,000 --> 00:03:35,000
Text values, our birthdate as a date, marital status
79
00:03:35,000 --> 00:03:40,000
gender, text, email, text, annual income here.
80
00:03:40,000 --> 00:03:41,000
This is a whole number.
81
00:03:41,000 --> 00:03:43,000
We could update that to currency if we want,
82
00:03:43,000 --> 00:03:44,000
but that looks good.
83
00:03:45,000 --> 00:03:47,000
Okay, everything looks really good here.
84
00:03:47,000 --> 00:03:49,000
So all of these column profiling tools
85
00:03:49,000 --> 00:03:53,000
that we talked about are located up here in the view menu
86
00:03:53,000 --> 00:03:57,000
and let's enable our column quality,
87
00:03:57,000 --> 00:03:59,000
and we can see that our quality, you know
88
00:03:59,000 --> 00:04:01,000
it actually looks pretty decent, right?
89
00:04:01,000 --> 00:04:05,000
We get this little results window for valid error and empty.
90
00:04:05,000 --> 00:04:09,000
Like I said, when we hover, we get these contextual menus.
91
00:04:09,000 --> 00:04:12,000
You can right click and also get a contextual menu
92
00:04:12,000 --> 00:04:13,000
with some of these kind of
93
00:04:13,000 --> 00:04:15,000
common transformation steps, right?
94
00:04:15,000 --> 00:04:18,000
So if we scroll along here, right?
95
00:04:18,000 --> 00:04:20,000
We've got prefix column here
96
00:04:20,000 --> 00:04:24,000
with less than 1% of empty values here,
97
00:04:24,000 --> 00:04:25,000
and again like this is stuff
98
00:04:25,000 --> 00:04:26,000
that kind of makes sense, right?
99
00:04:26,000 --> 00:04:30,000
We may not have a Mr or Mrs. prefix
100
00:04:30,000 --> 00:04:32,000
for all of the customers within our data set.
101
00:04:32,000 --> 00:04:35,000
So seeing that it's only nine here,
102
00:04:35,000 --> 00:04:37,000
I'm not super concerned about this,
103
00:04:37,000 --> 00:04:40,000
and it's probably not some sort of issue
104
00:04:40,000 --> 00:04:43,000
that is gonna compromise the integrity of our data, right?
105
00:04:43,000 --> 00:04:46,000
If we keep scrolling along here, again, everything is valid
106
00:04:46,000 --> 00:04:49,000
at a 100%, no errors, no empties
107
00:04:49,000 --> 00:04:51,000
everything looks pretty good here,
108
00:04:51,000 --> 00:04:54,000
but this is actually a bit misleading,
109
00:04:54,000 --> 00:04:57,000
because the default column profiling only analyzes
110
00:04:57,000 --> 00:05:00,000
the first 1000 rows of the table,
111
00:05:00,000 --> 00:05:05,000
and our customer lookup table contains over 18,000 records.
112
00:05:05,000 --> 00:05:07,000
So in order to view all of the records,
113
00:05:07,000 --> 00:05:09,000
we need to update the column profiling
114
00:05:09,000 --> 00:05:11,000
from the first 1000 records
115
00:05:11,000 --> 00:05:14,000
to be based on the entire data set
116
00:05:14,000 --> 00:05:16,000
and then once we update that range
117
00:05:16,000 --> 00:05:17,000
we'll see how this really looks.
118
00:05:17,000 --> 00:05:19,000
So to do that, we see down at the bottom here
119
00:05:19,000 --> 00:05:21,000
where it says column profiling is based
120
00:05:21,000 --> 00:05:23,000
on the top 1000 rows,
121
00:05:23,000 --> 00:05:24,000
and we actually want to have it
122
00:05:24,000 --> 00:05:26,000
based on the entire data set.
123
00:05:26,000 --> 00:05:29,000
So once we do that, power BI goes through and re-scans this
124
00:05:29,000 --> 00:05:32,000
and once this updates, look what happens, right?
125
00:05:32,000 --> 00:05:35,000
Our customer key column now has a bunch
126
00:05:35,000 --> 00:05:36,000
of errors in it, right?
127
00:05:36,000 --> 00:05:39,000
We're seeing that there's five errors here.
128
00:05:39,000 --> 00:05:42,000
Some of our other columns like prefix and first name,
129
00:05:42,000 --> 00:05:45,000
like we're actually seeing some empty values
130
00:05:45,000 --> 00:05:46,000
in here as well.
131
00:05:46,000 --> 00:05:48,000
So something is going on with this data set
132
00:05:48,000 --> 00:05:50,000
that we actually need to clean up,
133
00:05:50,000 --> 00:05:52,000
and like we had talked about
134
00:05:52,000 --> 00:05:56,000
we can actually use Power Query's suggestions here
135
00:05:56,000 --> 00:05:58,000
to remove these errors, right?
136
00:05:58,000 --> 00:06:00,000
So if I click remove errors here,
137
00:06:00,000 --> 00:06:03,000
a new applied step is going to be added for that
138
00:06:03,000 --> 00:06:08,000
and then we'll see the updated results based on that change.
139
00:06:08,000 --> 00:06:10,000
The only other piece that we have here now
140
00:06:10,000 --> 00:06:12,000
is that we're actually getting some empty values
141
00:06:12,000 --> 00:06:14,000
within this column, right?
142
00:06:14,000 --> 00:06:18,000
So again, we can follow Power Query suggested update here,
143
00:06:18,000 --> 00:06:20,000
click remove empty, right?
144
00:06:20,000 --> 00:06:23,000
We've added another applied step here for filtering rows.
145
00:06:23,000 --> 00:06:27,000
We're filtering out those empty rows
146
00:06:27,000 --> 00:06:30,000
and now we're kind of back to where we were at, right?
147
00:06:30,000 --> 00:06:33,000
We've got 130 empty values here for prefix,
148
00:06:33,000 --> 00:06:37,000
but again, that may make sense based on the data set
149
00:06:37,000 --> 00:06:40,000
and if I scroll over, right?
150
00:06:40,000 --> 00:06:44,000
All of these other columns are now a 100% valid, right?
151
00:06:44,000 --> 00:06:47,000
So it looks like that took care of it,
152
00:06:47,000 --> 00:06:49,000
but what I would still like to understand is
153
00:06:49,000 --> 00:06:52,000
what exactly was causing those errors, right?
154
00:06:52,000 --> 00:06:55,000
Like what was the actual values or the reason behind it?
155
00:06:55,000 --> 00:07:00,000
So let's delete these last two applied steps, right?
156
00:07:00,000 --> 00:07:02,000
And now we're back to the state
157
00:07:02,000 --> 00:07:04,000
where we have all of these errors.
158
00:07:04,000 --> 00:07:07,000
If we come back to our header here
159
00:07:07,000 --> 00:07:11,000
and we right click, if you select keep errors,
160
00:07:11,000 --> 00:07:14,000
what this is going to do is add a new applied step
161
00:07:14,000 --> 00:07:15,000
that shows a results table
162
00:07:15,000 --> 00:07:17,000
with all of those errors in it,
163
00:07:17,000 --> 00:07:21,000
and you can actually click into each of these errors
164
00:07:21,000 --> 00:07:23,000
to see the Power Query error message.
165
00:07:23,000 --> 00:07:25,000
So this is a data format error
166
00:07:25,000 --> 00:07:28,000
that we couldn't convert to a number, right?
167
00:07:28,000 --> 00:07:30,000
And here's the details.
168
00:07:30,000 --> 00:07:33,000
This 30 with three dashes is what Power Query
169
00:07:33,000 --> 00:07:37,000
was trying to convert to a number, right?
170
00:07:37,000 --> 00:07:39,000
And if I clear out of this last applied step,
171
00:07:40,000 --> 00:07:43,000
you know, I can click into another error here,
172
00:07:44,000 --> 00:07:46,000
and again it's that same error.
173
00:07:46,000 --> 00:07:48,000
We couldn't convert this to a number
174
00:07:48,000 --> 00:07:52,000
and here's the number that was trying to be converted.
175
00:07:52,000 --> 00:07:53,000
All right, we'll check one more here.
176
00:07:55,000 --> 00:07:57,000
All right, so this is same error,
177
00:07:57,000 --> 00:07:59,000
but a slightly different issue, right?
178
00:07:59,000 --> 00:08:02,000
It looks like there's some source data here
179
00:08:02,000 --> 00:08:04,000
or a link or something like that,
180
00:08:04,000 --> 00:08:07,000
that couldn't be converted to a whole number.
181
00:08:07,000 --> 00:08:11,000
So again, like that is the underlying reason why
182
00:08:11,000 --> 00:08:12,000
those errors were happening.
183
00:08:12,000 --> 00:08:15,000
Those are the actual issues within that data set.
184
00:08:15,000 --> 00:08:19,000
We also see that there's some other values within prefix
185
00:08:19,000 --> 00:08:21,000
that really don't make sense, right?
186
00:08:21,000 --> 00:08:23,000
These are not picked up as inaccurate,
187
00:08:23,000 --> 00:08:26,000
because of the column data type being text,
188
00:08:26,000 --> 00:08:29,000
but lowercase M doesn't really mean anything, right?
189
00:08:29,000 --> 00:08:31,000
If we scroll over here,
190
00:08:31,000 --> 00:08:33,000
looks like we've got some birth dates from,
191
00:08:33,000 --> 00:08:35,000
you know January 1st, 1900.
192
00:08:35,000 --> 00:08:39,000
So again, these different rows here don't make sense
193
00:08:39,000 --> 00:08:42,000
to keep as part of this data set.
194
00:08:42,000 --> 00:08:45,000
So if we clear our keep errors step,
195
00:08:45,000 --> 00:08:46,000
the great thing about this
196
00:08:46,000 --> 00:08:48,000
is that we can just rerun through
197
00:08:48,000 --> 00:08:51,000
those same cleaning steps, right?
198
00:08:51,000 --> 00:08:54,000
We can either remove errors by clicking this,
199
00:08:54,000 --> 00:08:59,000
we can right click and then select remove errors, right?
200
00:09:01,000 --> 00:09:04,000
We can right click again, remove empty
201
00:09:05,000 --> 00:09:09,000
and after that we're back to this nice clean data set,
202
00:09:09,000 --> 00:09:11,000
where our only empty values are
203
00:09:11,000 --> 00:09:14,000
within this customer prefix column.
204
00:09:14,000 --> 00:09:17,000
All right, so now that we've got these errors cleaned up
205
00:09:17,000 --> 00:09:18,000
let's quickly check out some
206
00:09:18,000 --> 00:09:21,000
of the other profiling tool options,
207
00:09:21,000 --> 00:09:23,000
and I'm gonna deselect column quality.
208
00:09:23,000 --> 00:09:25,000
We'll check out column distribution,
209
00:09:25,000 --> 00:09:28,000
and again, when we select our column distribution here
210
00:09:28,000 --> 00:09:29,000
we see a sample distribution
211
00:09:29,000 --> 00:09:32,000
of the values within each of the columns,
212
00:09:32,000 --> 00:09:36,000
and if we go in, let's check out first name here,
213
00:09:36,000 --> 00:09:39,000
we can see that we have 666 distinct values
214
00:09:39,000 --> 00:09:41,000
and 89 unique values.
215
00:09:41,000 --> 00:09:46,000
So this means that there are 666 distinct first names
216
00:09:46,000 --> 00:09:51,000
within the column, and 89 of them appear only once, right?
217
00:09:51,000 --> 00:09:54,000
So that's how to interpret those numbers,
218
00:09:54,000 --> 00:09:57,000
and if we hover over the column header here,
219
00:09:57,000 --> 00:10:01,000
again Power Query has this removed duplicate suggestion,
220
00:10:01,000 --> 00:10:02,000
and again, we're a little bit smarter
221
00:10:02,000 --> 00:10:04,000
than this suggestion and understand
222
00:10:04,000 --> 00:10:07,000
that duplicates are actually necessary within the column,
223
00:10:07,000 --> 00:10:09,000
so we'll ignore that
224
00:10:09,000 --> 00:10:12,000
and then last, we'll check out our column profiling tools
225
00:10:12,000 --> 00:10:17,000
and here we get a really nice view of our column statistics
226
00:10:17,000 --> 00:10:19,000
and a chart showing the distribution.
227
00:10:19,000 --> 00:10:21,000
Again, one of the important pieces to note here
228
00:10:21,000 --> 00:10:24,000
is that some of these column statistics
229
00:10:24,000 --> 00:10:26,000
really may not make sense.
230
00:10:26,000 --> 00:10:27,000
Count makes sense.
231
00:10:27,000 --> 00:10:29,000
We're gonna wanna see the errors
232
00:10:29,000 --> 00:10:33,000
or empty values, distinct, unique, these may make sense,
233
00:10:33,000 --> 00:10:35,000
but the min and max here,
234
00:10:35,000 --> 00:10:37,000
these don't really tell you much beyond
235
00:10:37,000 --> 00:10:40,000
the first and last names in an alphabetical order,
236
00:10:40,000 --> 00:10:44,000
but if we scroll over to a column like our annual income
237
00:10:46,000 --> 00:10:51,000
and select this one, now some of these column statistics
238
00:10:51,000 --> 00:10:53,000
are gonna provide a little bit more value here,
239
00:10:53,000 --> 00:10:56,000
especially when we get down to the minimum, maximum values
240
00:10:56,000 --> 00:10:59,000
our average and standard deviation, right?
241
00:10:59,000 --> 00:11:02,000
So depending on the data contained
242
00:11:02,000 --> 00:11:05,000
within each of these columns, some of the statistics tools
243
00:11:05,000 --> 00:11:08,000
and the distribution may make more sense.
244
00:11:08,000 --> 00:11:11,000
The larger point here is that you have a bunch of tools
245
00:11:11,000 --> 00:11:14,000
at your disposal that are really helpful to use
246
00:11:14,000 --> 00:11:15,000
when you're exploring
247
00:11:15,000 --> 00:11:16,000
and trying to understand
248
00:11:16,000 --> 00:11:19,000
the data contained within your tables.
249
00:11:19,000 --> 00:11:22,000
Data prep and QA is often a very time consuming
250
00:11:22,000 --> 00:11:25,000
and intensive process as an analyst,
251
00:11:25,000 --> 00:11:27,000
and these tools can help make that process
252
00:11:27,000 --> 00:11:29,000
a bit quicker and scalable.
20360
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.