Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:01,000
Instructor: All right, let's talk about a concept
2
00:00:01,000 --> 00:00:03,000
called normalization.
3
00:00:03,000 --> 00:00:06,000
Now, this is a really important topic.
4
00:00:06,000 --> 00:00:08,000
It's gonna help you grasp exactly why
5
00:00:08,000 --> 00:00:10,000
and how data models actually work,
6
00:00:10,000 --> 00:00:11,000
but it can be a little bit tricky
7
00:00:11,000 --> 00:00:13,000
to wrap your head around at first.
8
00:00:15,000 --> 00:00:16,000
So the textbook definition here
9
00:00:16,000 --> 00:00:18,000
is that normalization is the process
10
00:00:18,000 --> 00:00:20,000
of organizing the tables and columns
11
00:00:20,000 --> 00:00:22,000
in a relational database
12
00:00:22,000 --> 00:00:25,000
to reduce redundancy and preserve data integrity.
13
00:00:25,000 --> 00:00:28,000
So what the heck does that even mean, right?
14
00:00:29,000 --> 00:00:31,000
Well, it's commonly used to do three different things.
15
00:00:31,000 --> 00:00:35,000
First, to eliminate redundant data in a model,
16
00:00:35,000 --> 00:00:37,000
to decrease your table sizes, and more importantly,
17
00:00:37,000 --> 00:00:40,000
improve processing speed and efficiency.
18
00:00:41,000 --> 00:00:43,000
It's also used to minimize errors and anomalies
19
00:00:43,000 --> 00:00:46,000
when you do things like insert, update, or delete,
20
00:00:46,000 --> 00:00:49,000
or otherwise modify data inside of your model.
21
00:00:50,000 --> 00:00:52,000
And third, it helps simplify queries
22
00:00:52,000 --> 00:00:55,000
and structure your database for meaningful analysis.
23
00:00:55,000 --> 00:00:57,000
And we'll see examples of this
24
00:00:57,000 --> 00:00:59,000
as we start writing DAX queries and measures
25
00:00:59,000 --> 00:01:01,000
later on in the course.
26
00:01:01,000 --> 00:01:03,000
So at this point, it still feels a little bit vague, right?
27
00:01:03,000 --> 00:01:05,000
A little ambiguous.
28
00:01:05,000 --> 00:01:07,000
So the way that I like to wrap my head around
29
00:01:07,000 --> 00:01:09,000
what normalization really means
30
00:01:09,000 --> 00:01:11,000
is to think of it this way,
31
00:01:11,000 --> 00:01:13,000
in a normalized database,
32
00:01:13,000 --> 00:01:16,000
each table should serve one distinct
33
00:01:16,000 --> 00:01:19,000
and specific purpose and one purpose only.
34
00:01:19,000 --> 00:01:22,000
So you may have a table that only contains information
35
00:01:22,000 --> 00:01:24,000
about individual products.
36
00:01:24,000 --> 00:01:25,000
You might have another table
37
00:01:25,000 --> 00:01:28,000
that only contains daily transaction records,
38
00:01:28,000 --> 00:01:30,000
another that's a calendar table
39
00:01:30,000 --> 00:01:32,000
with information about dates,
40
00:01:32,000 --> 00:01:34,000
and maybe some others that contain attributes
41
00:01:34,000 --> 00:01:37,000
about your customers, or your stores, and so on.
42
00:01:37,000 --> 00:01:38,000
And these should sound pretty familiar
43
00:01:38,000 --> 00:01:40,000
because these are the exact tables
44
00:01:40,000 --> 00:01:43,000
that we'll be working with in the project
45
00:01:43,000 --> 00:01:45,000
that we work through throughout this course.
46
00:01:45,000 --> 00:01:47,000
So let me take a stab at kind of bringing this to life,
47
00:01:47,000 --> 00:01:50,000
and visualizing the concept of normalization a bit.
48
00:01:51,000 --> 00:01:53,000
Imagine a table that looks like this.
49
00:01:53,000 --> 00:01:55,000
You've got transactional records here, right?
50
00:01:55,000 --> 00:01:57,000
You've got dates of transactions,
51
00:01:57,000 --> 00:02:00,000
product IDs that were sold, quantity of goods sold,
52
00:02:00,000 --> 00:02:03,000
but you also have product specific attributes
53
00:02:03,000 --> 00:02:04,000
stored here as well,
54
00:02:04,000 --> 00:02:06,000
like the product brand, the product name,
55
00:02:06,000 --> 00:02:08,000
the SKU, and weight,
56
00:02:08,000 --> 00:02:12,000
and because you have multiple transactions per day,
57
00:02:12,000 --> 00:02:13,000
and multiple transactions per product,
58
00:02:13,000 --> 00:02:17,000
you end up with a bunch of redundant duplicate fields
59
00:02:17,000 --> 00:02:18,000
or rows here.
60
00:02:19,000 --> 00:02:21,000
In other words, every time a specific product ID is sold,
61
00:02:21,000 --> 00:02:24,000
we're storing the same set of information
62
00:02:24,000 --> 00:02:26,000
about the brand, the name, the SKU, and the weight
63
00:02:26,000 --> 00:02:28,000
in row after row after row.
64
00:02:28,000 --> 00:02:30,000
And you might be looking at this and saying,
65
00:02:30,000 --> 00:02:32,000
"You know, what's the big deal?
66
00:02:32,000 --> 00:02:33,000
Like, I've got all the information that I need
67
00:02:33,000 --> 00:02:34,000
for my analysis.
68
00:02:35,000 --> 00:02:36,000
I've got it all in one place.
69
00:02:36,000 --> 00:02:37,000
So that's great, right?
70
00:02:38,000 --> 00:02:40,000
But suppose that instead of a handful of records,
71
00:02:40,000 --> 00:02:43,000
like we see here, you've got a hundred different products
72
00:02:43,000 --> 00:02:46,000
that sell an average of 10,000 times a day.
73
00:02:46,000 --> 00:02:50,000
Now, we're talking about upwards of 1 million duplicate rows
74
00:02:50,000 --> 00:02:52,000
for each date in the table.
75
00:02:53,000 --> 00:02:54,000
So the point is that minor inefficiencies
76
00:02:54,000 --> 00:02:58,000
like this can very quickly become major, major problems
77
00:02:58,000 --> 00:03:00,000
as you scale.
78
00:03:00,000 --> 00:03:02,000
And the way that we would solve a problem like this
79
00:03:02,000 --> 00:03:04,000
is to normalize the data, right?
80
00:03:04,000 --> 00:03:07,000
We'd take all those product specific attributes,
81
00:03:07,000 --> 00:03:09,000
brand names, SKU, and weight,
82
00:03:09,000 --> 00:03:10,000
and we would store those
83
00:03:10,000 --> 00:03:14,000
in its own separate table containing one row,
84
00:03:14,000 --> 00:03:17,000
or one unique record for each product ID
85
00:03:17,000 --> 00:03:20,000
that would essentially transform this one table
86
00:03:20,000 --> 00:03:22,000
into two tables that each serve one purpose.
87
00:03:22,000 --> 00:03:25,000
You'd have transactions with date, product ID,
88
00:03:25,000 --> 00:03:27,000
and quantity in one table.
89
00:03:28,000 --> 00:03:29,000
You'd have products in a second table,
90
00:03:29,000 --> 00:03:32,000
and you could use a relationship between those
91
00:03:32,000 --> 00:03:34,000
based on the product ID field
92
00:03:34,000 --> 00:03:37,000
to access the exact same information
93
00:03:37,000 --> 00:03:39,000
without any of the redundant data.
94
00:03:39,000 --> 00:03:40,000
So hopefully that helps,
95
00:03:40,000 --> 00:03:42,000
we're gonna walk through a ton of additional demos
96
00:03:42,000 --> 00:03:45,000
and examples to help really hammer this point home,
97
00:03:45,000 --> 00:03:47,000
but that's our introduction to the concept
98
00:03:47,000 --> 00:03:49,000
of database normalization.
7615
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.