Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:01,180 --> 00:00:02,990
One of the most important steps
2
00:00:02,990 --> 00:00:06,250
in building data intensive apps is to actually model
3
00:00:06,250 --> 00:00:08,700
all this data in MongoDB.
4
00:00:08,700 --> 00:00:12,300
And so that's what we're gonna talk about in this lecture.
5
00:00:12,300 --> 00:00:14,710
So it's really crucial that you follow it
6
00:00:14,710 --> 00:00:19,710
through even at first its a lot to take in. All right.
7
00:00:19,810 --> 00:00:22,013
Anyway, lets now get started.
8
00:00:23,530 --> 00:00:27,530
Now, data modeling is probably a very new concept to you.
9
00:00:27,530 --> 00:00:28,920
So before we start;
10
00:00:28,920 --> 00:00:32,070
lets make clear what we're actually gonna talk about.
11
00:00:32,070 --> 00:00:35,656
So, data modeling is the process of taking unstructured data
12
00:00:35,656 --> 00:00:38,770
generated by a real world scenario
13
00:00:38,770 --> 00:00:42,090
and then structure it into a logical data model
14
00:00:42,090 --> 00:00:43,410
in a database.
15
00:00:43,410 --> 00:00:46,300
And we do that according to a set of criteria
16
00:00:46,300 --> 00:00:49,330
which we're gonna learn about in this video.
17
00:00:49,330 --> 00:00:51,980
For example; lets say that we want to design
18
00:00:51,980 --> 00:00:54,120
an online shop data model.
19
00:00:54,120 --> 00:00:57,040
There will be initially a ton of unstructured data
20
00:00:57,040 --> 00:00:58,130
that we know we need.
21
00:00:58,130 --> 00:00:58,980
Right.
22
00:00:58,980 --> 00:01:00,900
Stuff like products, categories,
23
00:01:00,900 --> 00:01:03,875
customer's orders, shopping carts, suppliers.
24
00:01:03,875 --> 00:01:06,300
And so on and so forth.
25
00:01:06,300 --> 00:01:09,240
Our goal with data modeling is to then structure
26
00:01:09,240 --> 00:01:11,450
this data into a logical way.
27
00:01:11,450 --> 00:01:14,090
Reflecting the real-world relationships
28
00:01:14,090 --> 00:01:16,920
that exists between some of these data sets.
29
00:01:16,920 --> 00:01:19,670
A bit like you can see in this example.
30
00:01:19,670 --> 00:01:23,110
And this is of course just a kind of imaginary situation
31
00:01:23,110 --> 00:01:24,320
but you get the idea.
32
00:01:24,320 --> 00:01:25,600
Right.
33
00:01:25,600 --> 00:01:28,940
Now, many backend developers say that data modeling
34
00:01:28,940 --> 00:01:30,930
is where we have to think the most.
35
00:01:30,930 --> 00:01:33,670
That its the most demanding part of building
36
00:01:33,670 --> 00:01:35,310
an entire application.
37
00:01:35,310 --> 00:01:38,100
Because it really is not always straight-forward.
38
00:01:38,100 --> 00:01:41,070
And sometimes there are simply no right answers.
39
00:01:41,070 --> 00:01:45,500
So not just one unique correct way of structuring the data.
40
00:01:45,500 --> 00:01:48,420
But anyway I will do my best to lay down the process
41
00:01:48,420 --> 00:01:49,510
in this video.
42
00:01:49,510 --> 00:01:52,367
And for that we're gonna go through four steps.
43
00:01:52,367 --> 00:01:56,200
So in the first step; we learned about how to identify
44
00:01:56,200 --> 00:01:59,340
different types of relationships between data.
45
00:01:59,340 --> 00:02:00,360
Then we're gonna understand the difference
46
00:02:00,360 --> 00:02:03,019
between referencing or normalization
47
00:02:03,019 --> 00:02:07,163
and embedding or denormalization.
48
00:02:07,163 --> 00:02:09,030
In the next and most important step;
49
00:02:09,030 --> 00:02:11,660
I will show you my framework for deciding
50
00:02:11,660 --> 00:02:13,560
whether we should embed documents
51
00:02:13,560 --> 00:02:15,750
or reference to other documents
52
00:02:15,750 --> 00:02:18,690
based on a couple of different factors.
53
00:02:18,690 --> 00:02:20,810
Also, we have to quickly talk about
54
00:02:20,810 --> 00:02:22,680
different types of referencing.
55
00:02:22,680 --> 00:02:25,920
Because that's important if that is the type of design
56
00:02:25,920 --> 00:02:28,220
that we choose for our data.
57
00:02:28,220 --> 00:02:32,290
So this is gonna be in fact a quite theoretical lecture.
58
00:02:32,290 --> 00:02:35,940
But also an absolutely essential one for your progress
59
00:02:35,940 --> 00:02:37,660
as a back-end developer.
60
00:02:37,660 --> 00:02:41,553
Because the way we design data so the way we model our data
61
00:02:41,553 --> 00:02:45,180
can make or break our entire application.
62
00:02:45,180 --> 00:02:47,950
And there will be a lot of examples along the way
63
00:02:47,950 --> 00:02:49,510
to make this process easier.
64
00:02:49,510 --> 00:02:50,343
All right.
65
00:02:51,320 --> 00:02:53,440
And the first thing that we are gonna talk about
66
00:02:53,440 --> 00:02:55,780
is the different types of relationships
67
00:02:55,780 --> 00:02:58,210
that can exist between data.
68
00:02:58,210 --> 00:03:00,780
So there are three big types of relationships.
69
00:03:00,780 --> 00:03:05,150
One to one, one to many, and many to many.
70
00:03:05,150 --> 00:03:06,990
And I'm gonna use a movie application
71
00:03:06,990 --> 00:03:08,890
as an example in this slide.
72
00:03:08,890 --> 00:03:10,000
Okay?
73
00:03:10,000 --> 00:03:12,440
So first a one to one relationship
74
00:03:12,440 --> 00:03:14,140
between data is basically
75
00:03:14,140 --> 00:03:17,370
when one field can only have one value.
76
00:03:17,370 --> 00:03:21,550
So in our movie application example; one movie only ever
77
00:03:21,550 --> 00:03:22,990
have one name.
78
00:03:22,990 --> 00:03:24,910
And so this is a simple example
79
00:03:24,910 --> 00:03:27,160
of a one to one relationship.
80
00:03:27,160 --> 00:03:29,690
But these relationships are not really that important
81
00:03:29,690 --> 00:03:31,363
in terms of data modeling.
82
00:03:32,330 --> 00:03:34,430
Now the most important relationships
83
00:03:34,430 --> 00:03:37,210
are the one to many relationships.
84
00:03:37,210 --> 00:03:39,770
And they are so important that in MongoDB
85
00:03:39,770 --> 00:03:42,510
we actually distinguish between three types
86
00:03:42,510 --> 00:03:44,540
of one to many relationships.
87
00:03:44,540 --> 00:03:49,540
One to a few, one to many, and one to a ton or to a million
88
00:03:49,910 --> 00:03:53,230
or something like that. So the difference here is based
89
00:03:53,230 --> 00:03:56,893
on the relative amount of the many. All right.
90
00:03:57,840 --> 00:04:00,969
So an example to a one to a few relationship is that
91
00:04:00,969 --> 00:04:05,967
one movie can win many awards but actually just a few.
92
00:04:05,967 --> 00:04:09,630
So movie is not gonna win a thousand awards
93
00:04:09,630 --> 00:04:11,220
but it can win some.
94
00:04:11,220 --> 00:04:14,930
And so this is a typical one to few relationship.
95
00:04:14,930 --> 00:04:18,709
So you see that in general a one to many relationship
96
00:04:18,709 --> 00:04:23,210
means that one document can relate to many other documents.
97
00:04:23,210 --> 00:04:26,680
Now this might look a bit abstract without the JSON data
98
00:04:26,680 --> 00:04:28,480
but that's actually the purpose here.
99
00:04:28,480 --> 00:04:31,040
I just wanna show you a conceptual overview
100
00:04:31,040 --> 00:04:33,759
of these different types of relationships.
101
00:04:33,759 --> 00:04:36,872
Anyway, any one to many relationship
102
00:04:36,872 --> 00:04:40,600
one document can relate to hundreds or thousands
103
00:04:40,600 --> 00:04:42,070
of other documents.
104
00:04:42,070 --> 00:04:44,788
For example; one movie can have thousands of reviews
105
00:04:44,788 --> 00:04:46,710
in our application.
106
00:04:46,710 --> 00:04:49,380
And so this not really a one to few
107
00:04:49,380 --> 00:04:51,524
but one to many relationship. Okay?
108
00:04:51,524 --> 00:04:55,616
And finally we have the one to ton relationship.
109
00:04:55,616 --> 00:04:59,720
Imagine we wanted to implement some logging functionality
110
00:04:59,720 --> 00:05:03,110
in our app. So basically to know exactly what's going on
111
00:05:03,110 --> 00:05:04,870
on our server.
112
00:05:04,870 --> 00:05:08,770
This logs can then easily grow to millions of documents.
113
00:05:08,770 --> 00:05:11,270
And so this is a very typical example
114
00:05:11,270 --> 00:05:14,200
of a one to tons a relationship.
115
00:05:14,200 --> 00:05:17,100
And the difference between many and a ton is of course
116
00:05:17,100 --> 00:05:20,730
a bit fuzzy but just think that if something can grow
117
00:05:20,730 --> 00:05:23,360
almost to infinity then its definitely
118
00:05:23,360 --> 00:05:25,532
a one to a ton relationship.
119
00:05:25,532 --> 00:05:28,763
So again the one to many relationships
120
00:05:28,763 --> 00:05:31,650
are the most important ones to know.
121
00:05:31,650 --> 00:05:34,050
By the way; in relational databases
122
00:05:34,050 --> 00:05:37,061
there is just one to many without quantifying
123
00:05:37,061 --> 00:05:39,800
how much that many actually is.
124
00:05:39,800 --> 00:05:41,800
In MongoDB databases though
125
00:05:41,800 --> 00:05:44,010
it is an extremely important difference.
126
00:05:44,010 --> 00:05:47,150
Because its one of the factors that we're gonna use
127
00:05:47,150 --> 00:05:49,891
to decide if we should denormalize or normalize data
128
00:05:49,891 --> 00:05:53,340
as you will learn a bit later.
129
00:05:53,340 --> 00:05:57,181
Anyway, the less type of relationship is the many to many
130
00:05:57,181 --> 00:06:00,149
where one movie can have many actors.
131
00:06:00,149 --> 00:06:04,876
But at the same time one actor can play in many movies.
132
00:06:04,876 --> 00:06:07,910
And so here the relationship basically
133
00:06:07,910 --> 00:06:09,630
goes in both directions.
134
00:06:09,630 --> 00:06:11,800
Where before in the other types
135
00:06:11,800 --> 00:06:13,939
it was only in one direction.
136
00:06:13,939 --> 00:06:17,470
For example one movie can have many reviews
137
00:06:17,470 --> 00:06:22,450
but one specific is only for that one movie. Right.
138
00:06:22,450 --> 00:06:24,560
And the same goes for the awards.
139
00:06:24,560 --> 00:06:27,506
So one specific award like for the best actor
140
00:06:27,506 --> 00:06:30,914
goes to only one movie not multiple ones.
141
00:06:30,914 --> 00:06:35,580
But with movies and actors it is indeed different.
142
00:06:35,580 --> 00:06:39,250
So again one movie stars many actors
143
00:06:39,250 --> 00:06:41,920
but one actor plays many movies
144
00:06:41,920 --> 00:06:45,020
and so its a many to many relationship.
145
00:06:45,020 --> 00:06:46,170
Okay.
146
00:06:46,170 --> 00:06:49,060
So keep all this in mind as we now move forward
147
00:06:49,060 --> 00:06:50,063
in this lecture.
148
00:06:51,760 --> 00:06:54,870
And probably the most important aspect that we need to learn
149
00:06:54,870 --> 00:06:57,900
about MongoDB databases is referencing
150
00:06:57,900 --> 00:07:00,340
and embedding two datasets.
151
00:07:00,340 --> 00:07:02,350
And we actually already talked a little bit
152
00:07:02,350 --> 00:07:05,050
about this before but lets review it here
153
00:07:05,050 --> 00:07:07,311
and go a bit deeper also.
154
00:07:07,311 --> 00:07:09,962
So each time we have two related datasets
155
00:07:09,962 --> 00:07:13,829
we can either represent that related data in a reference
156
00:07:13,829 --> 00:07:18,829
or normalized form or in an embedded or denormalized form.
157
00:07:18,842 --> 00:07:22,190
And I keep using the two related terms together
158
00:07:22,190 --> 00:07:24,340
like referencing and normalizing
159
00:07:24,340 --> 00:07:26,460
because you will see them both being used
160
00:07:26,460 --> 00:07:29,510
and so its important that you know all of them.
161
00:07:29,510 --> 00:07:33,070
Anyway, in the referenced form we keep two related
162
00:07:33,070 --> 00:07:35,826
datasets and all the documents separated.
163
00:07:35,826 --> 00:07:39,589
So again all the data is nicely separated
164
00:07:39,589 --> 00:07:43,275
which is exactly what normalized means.
165
00:07:43,275 --> 00:07:47,110
So continuing, the movie database example from before
166
00:07:47,110 --> 00:07:50,750
we would have one movie document and one actor document
167
00:07:50,750 --> 00:07:54,870
for each actor. Now how would we then make the connection
168
00:07:54,870 --> 00:07:58,710
between movie and the actors so that later in our app
169
00:07:58,710 --> 00:08:02,150
we can show which actors played in a particular movie.
170
00:08:02,150 --> 00:08:05,210
Because if they are all completely different document
171
00:08:05,210 --> 00:08:09,438
the movie has no way of knowing about the actors. Right.
172
00:08:09,438 --> 00:08:12,253
Well that's where the IDs come in.
173
00:08:12,253 --> 00:08:16,460
So we use the actor IDs in order to create references
174
00:08:16,460 --> 00:08:18,020
on the movie document.
175
00:08:18,020 --> 00:08:20,981
Effectively connecting movies with actors.
176
00:08:20,981 --> 00:08:24,760
So you see that in a movie document we have an array
177
00:08:24,760 --> 00:08:27,198
where we stored the IDs of all the actors
178
00:08:27,198 --> 00:08:30,760
so that when we request data about a certain a movie
179
00:08:30,760 --> 00:08:34,553
we can easily identify its actors. Does that make sense?
180
00:08:34,553 --> 00:08:38,830
Now this type of referencing is called child referencing
181
00:08:38,830 --> 00:08:41,480
because its the parent in this case the movie
182
00:08:41,480 --> 00:08:45,104
who references its children. In this case the actors.
183
00:08:45,104 --> 00:08:48,841
So we're really creating some sort of hierarchy here. Right.
184
00:08:48,841 --> 00:08:51,870
Now there is also parent referencing
185
00:08:51,870 --> 00:08:54,390
and we are gonna talk about that a bit later.
186
00:08:54,390 --> 00:08:58,710
And by the way in relational databases; all data is always
187
00:08:58,710 --> 00:09:01,958
represented in normalized form like this.
188
00:09:01,958 --> 00:09:05,490
But in a no sequel database like MongoDB
189
00:09:05,490 --> 00:09:09,700
we can denormalize data into a denormalized form
190
00:09:09,700 --> 00:09:12,450
simply by embedding the related document
191
00:09:12,450 --> 00:09:15,330
right into the main document.
192
00:09:15,330 --> 00:09:18,330
So now we have all the relevant data about actors
193
00:09:18,330 --> 00:09:22,060
right inside in one main movie document without the need
194
00:09:22,060 --> 00:09:25,700
for separate documents, collections, and IDs.
195
00:09:25,700 --> 00:09:30,088
So again, if we choose to denormalize or to embed our data
196
00:09:30,088 --> 00:09:34,280
we will have one main document containing all the main data
197
00:09:34,280 --> 00:09:37,197
as well as the related data. All right.
198
00:09:37,197 --> 00:09:40,340
And the result of this is that our application
199
00:09:40,340 --> 00:09:43,330
will need to fewer queries to the database.
200
00:09:43,330 --> 00:09:45,000
Because we can get all the data
201
00:09:45,000 --> 00:09:48,074
about movies and actors all at the same time
202
00:09:48,074 --> 00:09:51,650
which will of course increase our performance.
203
00:09:51,650 --> 00:09:53,840
Now the downside here is of course
204
00:09:53,840 --> 00:09:57,530
that we can't really query the embedded data on its own.
205
00:09:57,530 --> 00:10:00,810
And so if that's a requirement for the application
206
00:10:00,810 --> 00:10:03,790
you would have to choose a normalized design
207
00:10:03,790 --> 00:10:06,280
and since we're talking about pros and cons
208
00:10:06,280 --> 00:10:09,030
of the denormalized form; lets do the same
209
00:10:09,030 --> 00:10:11,490
about the normalized design.
210
00:10:11,490 --> 00:10:13,920
And basically its kind of the opposite
211
00:10:13,920 --> 00:10:15,770
of what we just talked about.
212
00:10:15,770 --> 00:10:18,319
So there is an improvement in performance
213
00:10:18,319 --> 00:10:22,390
when we often need to query the related data on it's own
214
00:10:22,390 --> 00:10:25,740
because we then can just query the data that we need
215
00:10:25,740 --> 00:10:28,490
and not always movies and actors together.
216
00:10:28,490 --> 00:10:31,640
But on the other hand; when we need to actually query
217
00:10:31,640 --> 00:10:33,906
movies and actors together we then are gonna need
218
00:10:33,906 --> 00:10:36,396
many queries to the database.
219
00:10:36,396 --> 00:10:40,010
So first the query for the movie and then from there
220
00:10:40,010 --> 00:10:42,610
we will also need a query for the actor
221
00:10:42,610 --> 00:10:44,989
and that is of course works for performance.
222
00:10:44,989 --> 00:10:48,328
So when designing your database; this is the kind of stuff
223
00:10:48,328 --> 00:10:50,569
that you need to keep in mind. All right.
224
00:10:50,569 --> 00:10:54,900
And now just as a side note; we could of course begin
225
00:10:54,900 --> 00:10:56,994
our thought process with denormlized data
226
00:10:56,994 --> 00:10:59,670
and then come to the conclusion
227
00:10:59,670 --> 00:11:01,692
that its best to actually normalize the data.
228
00:11:01,692 --> 00:11:05,043
So when thinking about our data model
229
00:11:05,043 --> 00:11:08,378
this way of organizing data works of course in both ways.
230
00:11:08,378 --> 00:11:12,570
Now, how do we actually decide if we should
231
00:11:12,570 --> 00:11:15,330
normalize or denormalize the data?
232
00:11:15,330 --> 00:11:18,033
Well that's exactly what we're gonna learn next.
233
00:11:19,690 --> 00:11:22,974
So when we have two related datasets; we have to decide
234
00:11:22,974 --> 00:11:26,180
if we're gonna embed the datasets or if we're gonna
235
00:11:26,180 --> 00:11:27,693
keep them separated and reference them
236
00:11:27,693 --> 00:11:30,400
from one dataset to the other.
237
00:11:30,400 --> 00:11:32,730
And I kind of developed this decision framework
238
00:11:32,730 --> 00:11:36,070
which I'm gonna show you where we use three criteria
239
00:11:36,070 --> 00:11:37,770
to take that decision.
240
00:11:37,770 --> 00:11:40,450
First we look at the type of relationships
241
00:11:40,450 --> 00:11:42,800
that exists between datasets.
242
00:11:42,800 --> 00:11:45,856
Second we try to determine the data access pattern
243
00:11:45,856 --> 00:11:50,150
of the dataset that we want to either embed or reference.
244
00:11:50,150 --> 00:11:53,320
And this just means to analyze how often data is read
245
00:11:53,320 --> 00:11:55,282
and written in that dataset.
246
00:11:55,282 --> 00:11:59,025
Then we also look at something that I call data closeness.
247
00:11:59,025 --> 00:12:02,940
And data closeness is term that I actually just made up
248
00:12:02,940 --> 00:12:06,870
but what it means is how much the data is really related
249
00:12:06,870 --> 00:12:10,109
and how we want to query the data from the database.
250
00:12:10,109 --> 00:12:11,850
And this will make more sense
251
00:12:11,850 --> 00:12:14,180
when we talk about it in a moment.
252
00:12:14,180 --> 00:12:17,330
Now to actually take the decision; we need to combine
253
00:12:17,330 --> 00:12:19,350
all of these three criteria
254
00:12:19,350 --> 00:12:21,792
and not just use one of them in isolation.
255
00:12:21,792 --> 00:12:25,230
So for example; just because criteria number one
256
00:12:25,230 --> 00:12:28,380
says to embed it doesn't mean that we don't need to look
257
00:12:28,380 --> 00:12:30,425
at the other two criteria.
258
00:12:30,425 --> 00:12:34,124
All right and lets start with the relationship type.
259
00:12:34,124 --> 00:12:37,968
So usually when we have one to few relationship
260
00:12:37,968 --> 00:12:40,700
we will always embed the related dataset
261
00:12:40,700 --> 00:12:43,430
into the main dataset just like we learned
262
00:12:43,430 --> 00:12:45,860
in the last slide. Right.
263
00:12:45,860 --> 00:12:49,110
Now in a one to many relationship; things are a bit
264
00:12:49,110 --> 00:12:52,880
more fuzzy so its okay to either embed or reference.
265
00:12:52,880 --> 00:12:55,140
In that case we will have to decide
266
00:12:55,140 --> 00:12:57,304
according to the other two criteria.
267
00:12:57,304 --> 00:12:59,825
Now on the other hand, on a one to a ton
268
00:12:59,825 --> 00:13:03,894
or a many to many relationship we usually always reference
269
00:13:03,894 --> 00:13:06,811
the data. That's because if we actually did embed
270
00:13:06,811 --> 00:13:10,004
in this case we could quickly create way too large document.
271
00:13:10,004 --> 00:13:14,902
Even potentially surpassing the maximum of 16 megabytes.
272
00:13:14,902 --> 00:13:18,214
And so the solution for that is of course referencing
273
00:13:18,214 --> 00:13:22,090
or normalizing the data. And as a quick example;
274
00:13:22,090 --> 00:13:24,142
lets say that in our movie database example
275
00:13:24,142 --> 00:13:27,830
we have around 100 images associated to each movie.
276
00:13:27,830 --> 00:13:30,874
So we could say its a one to many relationship
277
00:13:30,874 --> 00:13:34,230
but are we gonna embed the dataset or should we rather
278
00:13:34,230 --> 00:13:37,523
reference them here. Well we don't really know.
279
00:13:37,523 --> 00:13:40,571
So lets take a look at the other two criteria.
280
00:13:40,571 --> 00:13:44,420
So the second one is about data access patterns
281
00:13:44,420 --> 00:13:46,290
where its just a fancy description
282
00:13:46,290 --> 00:13:48,242
for evaluating whether a certain dataset
283
00:13:48,242 --> 00:13:51,559
is mostly written to or mostly read from.
284
00:13:51,559 --> 00:13:55,760
So if the dataset that we're deciding about is mostly read
285
00:13:55,760 --> 00:13:58,179
and the data is not updated a lot
286
00:13:58,179 --> 00:14:01,620
then we should probably embed that dataset.
287
00:14:01,620 --> 00:14:04,690
So a high read/write ratio just means
288
00:14:04,690 --> 00:14:07,100
that there is a lot more reading than writing.
289
00:14:07,100 --> 00:14:11,100
And a again, a dataset like that is a good candidate
290
00:14:11,100 --> 00:14:11,983
for embedding.
291
00:14:12,830 --> 00:14:15,980
The reason for this is that by embedding we only need
292
00:14:15,980 --> 00:14:18,379
one trip to the database per query.
293
00:14:18,379 --> 00:14:22,197
While for referencing we need two trips. Right.
294
00:14:22,197 --> 00:14:25,660
So if we embed data that is read a lot;
295
00:14:25,660 --> 00:14:28,383
in each query we save one trip to the database
296
00:14:28,383 --> 00:14:32,147
making the entire process way more performant.
297
00:14:32,147 --> 00:14:35,260
So I think that our movie image example
298
00:14:35,260 --> 00:14:38,320
would actually be a good candidate for embedding.
299
00:14:38,320 --> 00:14:41,543
Because once the 100 image are saved to the database
300
00:14:41,543 --> 00:14:43,920
they are not really updated anymore
301
00:14:43,920 --> 00:14:46,930
because there is not really anything to update
302
00:14:46,930 --> 00:14:50,057
about an image. Right, so its all about reading
303
00:14:50,057 --> 00:14:52,563
and therefore based on this criteria
304
00:14:52,563 --> 00:14:55,501
we would embed the imaged documents.
305
00:14:55,501 --> 00:14:59,092
Now on the other hand, if our data is updated a lot
306
00:14:59,092 --> 00:15:03,118
then we should consider referencing or normalizing the data.
307
00:15:03,118 --> 00:15:06,700
That's because its more work for the database engine
308
00:15:06,700 --> 00:15:08,870
to update and embed a document
309
00:15:08,870 --> 00:15:11,600
than a more simple standalone document.
310
00:15:11,600 --> 00:15:13,980
And since our main goal is performance;
311
00:15:13,980 --> 00:15:15,917
we just normalize the dataset.
312
00:15:15,917 --> 00:15:19,653
In our example lets say each movie has many reviews
313
00:15:19,653 --> 00:15:23,284
and each review can be marked as helpful by the user.
314
00:15:23,284 --> 00:15:27,560
So each time someone clicks on this review was helpful
315
00:15:27,560 --> 00:15:29,780
in our application. We need to update
316
00:15:29,780 --> 00:15:31,740
the corresponding document.
317
00:15:31,740 --> 00:15:35,030
And this means that the data can change all the time
318
00:15:35,030 --> 00:15:38,520
and so this is a great candidate for normalizing.
319
00:15:38,520 --> 00:15:41,420
Again because we don't want to be querying the movies
320
00:15:41,420 --> 00:15:45,190
all the time if all we really wanna update is the reviews
321
00:15:45,190 --> 00:15:47,230
by marking them as helpful.
322
00:15:47,230 --> 00:15:49,464
Okay, does that make sense?
323
00:15:49,464 --> 00:15:53,500
And finally the last criteria I call data closeness;
324
00:15:53,500 --> 00:15:56,320
which is just like a measure for how much the data
325
00:15:56,320 --> 00:15:59,469
is related. So if the two datasets really
326
00:15:59,469 --> 00:16:02,890
intrinsically belong together then they should
327
00:16:02,890 --> 00:16:05,880
probably be embedded into one another.
328
00:16:05,880 --> 00:16:10,440
In our example; all users can have many email addresses
329
00:16:10,440 --> 00:16:13,780
on their account and since they are so intrinsically
330
00:16:13,780 --> 00:16:17,190
connected to the user, there is no doubt emails
331
00:16:17,190 --> 00:16:19,920
should be embedded into the document.
332
00:16:19,920 --> 00:16:23,830
Now if we frequently need to query both of datasets
333
00:16:23,830 --> 00:16:26,388
on their own then that's a very good reason
334
00:16:26,388 --> 00:16:29,696
to normalize the data into two separate datasets.
335
00:16:29,696 --> 00:16:32,790
Even if they are closely related.
336
00:16:32,790 --> 00:16:35,227
So imagine that in our app we have a quiz
337
00:16:35,227 --> 00:16:40,227
where users have to identify a movie based on images.
338
00:16:40,440 --> 00:16:43,080
This means that we're gonna query a lot of images
339
00:16:43,080 --> 00:16:44,180
on their own.
340
00:16:44,180 --> 00:16:47,756
So without necessarily querying for the movies themselves.
341
00:16:47,756 --> 00:16:50,640
And so if we apply this third criteria;
342
00:16:50,640 --> 00:16:54,137
we come to the conclusion that we should actually normalize
343
00:16:54,137 --> 00:16:56,759
the image dataset. All right.
344
00:16:56,759 --> 00:17:00,770
Because again if we implement this quiz functionality;
345
00:17:00,770 --> 00:17:04,057
images are gonna be queried on their own all the time.
346
00:17:04,057 --> 00:17:07,422
So, all of this shows that we should really look
347
00:17:07,422 --> 00:17:09,849
all the three criteria together
348
00:17:09,849 --> 00:17:12,700
rather than just one of them in isolation.
349
00:17:12,700 --> 00:17:15,840
Because that might lead to less optimal decisions.
350
00:17:15,840 --> 00:17:18,907
And I say less optimal instead of wrong
351
00:17:18,907 --> 00:17:21,766
because they are not really completely right
352
00:17:21,766 --> 00:17:25,262
or completely wrong ways of modeling our data.
353
00:17:25,262 --> 00:17:28,970
There are no hard rules; these are just like guidelines
354
00:17:28,970 --> 00:17:31,380
that you can follow to find the probably
355
00:17:31,380 --> 00:17:33,860
most correct way of structuring your data.
356
00:17:33,860 --> 00:17:37,077
But again, it's hard to be really really wrong.
357
00:17:37,077 --> 00:17:38,253
Okay?
358
00:17:39,740 --> 00:17:43,110
Now, lets say that we have chosen to normalize
359
00:17:43,110 --> 00:17:44,270
our datasets.
360
00:17:44,270 --> 00:17:46,653
So in other words to reference data.
361
00:17:46,653 --> 00:17:49,380
Then after that we still have to choose
362
00:17:49,380 --> 00:17:52,840
between three different types of referencing.
363
00:17:52,840 --> 00:17:55,460
Child referencing, parent referencing
364
00:17:55,460 --> 00:17:57,540
and two-way referencing.
365
00:17:57,540 --> 00:18:00,767
So the first type is child referencing.
366
00:18:00,767 --> 00:18:04,440
Which is the referencing type I actually showed you before.
367
00:18:04,440 --> 00:18:05,470
Okay?
368
00:18:05,470 --> 00:18:07,850
And lets not take the error logging example
369
00:18:07,850 --> 00:18:10,128
that I mentioned earlier. Where we could potentially
370
00:18:10,128 --> 00:18:13,021
have millions of locked documents.
371
00:18:13,021 --> 00:18:17,300
So in child referencing; we basically keep references
372
00:18:17,300 --> 00:18:20,460
to the related child documents in a parent document.
373
00:18:20,460 --> 00:18:22,941
And they are usually stored in an array.
374
00:18:22,941 --> 00:18:25,735
So you see that each log has an ID
375
00:18:25,735 --> 00:18:29,040
and then in the app document there is that array
376
00:18:29,040 --> 00:18:31,358
with all of these IDs. Right?
377
00:18:31,358 --> 00:18:34,400
However, the problem here is that this array
378
00:18:34,400 --> 00:18:39,320
of IDs can become very large if there are lots of children.
379
00:18:39,320 --> 00:18:42,230
And this is an anti-pattern in MongoDB.
380
00:18:42,230 --> 00:18:45,156
So something that we should avoid at all costs.
381
00:18:45,156 --> 00:18:47,660
Also, child referencing makes it
382
00:18:47,660 --> 00:18:51,410
so that parents and children are very tightly coupled.
383
00:18:51,410 --> 00:18:54,840
Which is not always ideal. But that's exactly
384
00:18:54,840 --> 00:18:57,020
why we have parent referencing.
385
00:18:57,020 --> 00:19:00,300
So in parent referencing; it actually works
386
00:19:00,300 --> 00:19:01,870
the other way around.
387
00:19:01,870 --> 00:19:05,570
So here in each child document we keep a reference
388
00:19:05,570 --> 00:19:07,430
to the parent element.
389
00:19:07,430 --> 00:19:10,267
Therefore the name parent referencing.
390
00:19:10,267 --> 00:19:13,890
In this example the app ID is 23
391
00:19:13,890 --> 00:19:16,640
and so in each log there is the app field
392
00:19:16,640 --> 00:19:18,990
with the 23 ID in it.
393
00:19:18,990 --> 00:19:21,660
So that the child always knows its parent.
394
00:19:21,660 --> 00:19:24,920
And so in this case the parent actually knows nothing
395
00:19:24,920 --> 00:19:26,080
about the children.
396
00:19:26,080 --> 00:19:28,768
Not who they are and not how many they are.
397
00:19:28,768 --> 00:19:32,890
So, they are way more isolated and more standalone.
398
00:19:32,890 --> 00:19:35,326
In that, it can sometimes be beneficial.
399
00:19:35,326 --> 00:19:38,880
So which of these two types is actually better
400
00:19:38,880 --> 00:19:40,527
for this data relationship.
401
00:19:40,527 --> 00:19:42,820
And remember how I said that there
402
00:19:42,820 --> 00:19:45,860
could be millions of logs and so lets suppose
403
00:19:45,860 --> 00:19:47,652
there is two million logged documents.
404
00:19:47,652 --> 00:19:51,340
In a case of child referencing, that would mean
405
00:19:51,340 --> 00:19:53,209
that there are two million ID references
406
00:19:53,209 --> 00:19:55,091
in the app document.
407
00:19:55,091 --> 00:19:58,300
Right? Now also remember how I said that
408
00:19:58,300 --> 00:20:00,545
there is 16 megabyte limit on documents.
409
00:20:00,545 --> 00:20:04,302
So if we kept adding and adding these child IDs
410
00:20:04,302 --> 00:20:06,716
into the array on the parent; then we would
411
00:20:06,716 --> 00:20:09,575
pretty quickly hit that 16 megabytes limit
412
00:20:09,575 --> 00:20:11,772
that each Bson document can hold.
413
00:20:11,772 --> 00:20:14,702
Simply because that array will grow so much.
414
00:20:14,702 --> 00:20:17,210
So that's not really gonna work.
415
00:20:17,210 --> 00:20:18,510
Is it?
416
00:20:18,510 --> 00:20:20,590
On the other hand with parent referencing
417
00:20:20,590 --> 00:20:22,990
that problem is not gonna happen.
418
00:20:22,990 --> 00:20:25,570
We will simply have two million locked documents
419
00:20:25,570 --> 00:20:30,540
just like before but each of them holds ID of its parent.
420
00:20:30,540 --> 00:20:33,098
But there is no array that will grow indefinitely
421
00:20:33,098 --> 00:20:35,740
and therefore parent referencing
422
00:20:35,740 --> 00:20:38,443
would be best solution here.
423
00:20:39,380 --> 00:20:41,901
So the conclusion of all this is that in general
424
00:20:41,901 --> 00:20:44,385
child referencing is best used
425
00:20:44,385 --> 00:20:48,008
for one to a few relationships. Where we know before hand
426
00:20:48,008 --> 00:20:51,118
that the array of child documents won't grow that much.
427
00:20:51,118 --> 00:20:54,573
On the other hand, parent referencing is best used
428
00:20:54,573 --> 00:20:58,690
for one to many and one to a ton relationships
429
00:20:58,690 --> 00:21:00,927
like this one. Okay?
430
00:21:00,927 --> 00:21:04,610
So again always keep in mind that one of the most
431
00:21:04,610 --> 00:21:07,920
important principals of MongoDB data modeling
432
00:21:07,920 --> 00:21:11,900
is that array should never be allowed to grow indefinitely.
433
00:21:11,900 --> 00:21:15,420
In order to never break that 16 megabyte limit.
434
00:21:15,420 --> 00:21:18,170
We also don't want to send our users an array
435
00:21:18,170 --> 00:21:20,730
with thousands of IDs each time
436
00:21:20,730 --> 00:21:24,340
they request a parent dataset. Okay?
437
00:21:24,340 --> 00:21:26,900
So did this logic make sense to you?
438
00:21:26,900 --> 00:21:29,660
Then lets move on to third type of referencing
439
00:21:29,660 --> 00:21:31,870
which is two-way referencing.
440
00:21:31,870 --> 00:21:34,395
And this time with the movie and actor example
441
00:21:34,395 --> 00:21:36,380
I showed you when we talked about
442
00:21:36,380 --> 00:21:39,364
many to many relationships. Remember that?
443
00:21:39,364 --> 00:21:42,229
So again, each movie has many actors
444
00:21:42,229 --> 00:21:44,880
and each actor plays in many movies.
445
00:21:44,880 --> 00:21:48,464
And so that's a typical many to many relationship.
446
00:21:48,464 --> 00:21:52,100
And we usually use this two-way referencing to design
447
00:21:52,100 --> 00:21:55,346
many to many relationships. And it works like this;
448
00:21:55,346 --> 00:21:59,370
in each movie we will keep references to all the actors
449
00:21:59,370 --> 00:22:03,980
that star in that movie. So a bit like in child referencing.
450
00:22:03,980 --> 00:22:07,000
However and at the same time in each actor
451
00:22:07,000 --> 00:22:09,570
we also keep references to all the movies
452
00:22:09,570 --> 00:22:11,660
that the actor played in.
453
00:22:11,660 --> 00:22:15,120
So movies and actors are connected in both directions.
454
00:22:15,120 --> 00:22:17,900
In therefore the name two-way referencing.
455
00:22:17,900 --> 00:22:19,950
And this makes it really easy to search
456
00:22:19,950 --> 00:22:23,290
for both movies and actors completely independently.
457
00:22:23,290 --> 00:22:25,910
While also making it easy to find the actors
458
00:22:25,910 --> 00:22:29,029
associated to each movie and the movies associated
459
00:22:29,029 --> 00:22:30,383
to each actor.
460
00:22:31,623 --> 00:22:32,560
(deep breath)
461
00:22:32,560 --> 00:22:34,747
This was quite a long lecture indeed.
462
00:22:34,747 --> 00:22:38,030
With a lot of new concepts and principals
463
00:22:38,030 --> 00:22:40,220
and guidelines to remember.
464
00:22:40,220 --> 00:22:43,460
So in order to help you with that; here goes a quick
465
00:22:43,460 --> 00:22:46,650
summary and some more general guidelines that you can
466
00:22:46,650 --> 00:22:48,423
take a look at when you need it.
467
00:22:49,260 --> 00:22:52,753
So the most important principal is: structure your data
468
00:22:52,753 --> 00:22:56,120
to match the ways that your application queries
469
00:22:56,120 --> 00:22:57,436
and updates data.
470
00:22:57,436 --> 00:23:01,400
Or in other words: identify the questions that arise
471
00:23:01,400 --> 00:23:03,784
from your application's use cases first, and then model
472
00:23:03,784 --> 00:23:06,634
your data so that the questions can get answered
473
00:23:06,634 --> 00:23:08,995
in the most efficient way.
474
00:23:08,995 --> 00:23:12,610
For example; when I need to query movies and actors
475
00:23:12,610 --> 00:23:16,130
always together or are there scenarios where I only
476
00:23:16,130 --> 00:23:18,041
query movies or only actors.
477
00:23:18,041 --> 00:23:20,528
That kind of questions is what your data model
478
00:23:20,528 --> 00:23:22,930
will be based on.
479
00:23:22,930 --> 00:23:26,730
In general, always favor embedding unless there is a good
480
00:23:26,730 --> 00:23:28,440
reason not to embed.
481
00:23:28,440 --> 00:23:32,513
Especially on one to a few and one to many relationships.
482
00:23:33,370 --> 00:23:37,713
Next up, a one to a ton or a many to many relationship
483
00:23:37,713 --> 00:23:41,543
is usually a good reason to reference instead of embedding.
484
00:23:41,543 --> 00:23:45,734
Also, favor referencing when data is updated a lot
485
00:23:45,734 --> 00:23:50,717
and if you need to frequently access a dataset on its own.
486
00:23:50,717 --> 00:23:55,340
Use embedding when data is mostly read but rarely updated
487
00:23:55,340 --> 00:23:58,469
and when two dataset belong intrinsically together.
488
00:23:58,469 --> 00:24:02,840
Don't allow arrays to grow indefinitely.
489
00:24:02,840 --> 00:24:05,982
Therefore, if you want to normalize; use child referencing
490
00:24:05,982 --> 00:24:09,680
for one to many relationships and parent referencing
491
00:24:09,680 --> 00:24:11,856
for one to a ton relationships.
492
00:24:11,856 --> 00:24:15,160
And finally use two-way referencing
493
00:24:15,160 --> 00:24:17,520
for many to many relationships.
494
00:24:17,520 --> 00:24:18,720
All right?
495
00:24:18,720 --> 00:24:21,202
And that pretty much sums it up.
496
00:24:21,202 --> 00:24:23,970
I would actually recommend you watching this video
497
00:24:23,970 --> 00:24:27,144
twice if you can, just because of how important
498
00:24:27,144 --> 00:24:30,091
this material really is. All right?
499
00:24:30,091 --> 00:24:33,363
Anyway, see you in the next video.
40661
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.