subtitlecat.com

All language subtitles for 002 MongoDB Data Modelling_Downloadly.ir_en

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian Download

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:01,180 --> 00:00:02,990 One of the most important steps 2 00:00:02,990 --> 00:00:06,250 in building data intensive apps is to actually model 3 00:00:06,250 --> 00:00:08,700 all this data in MongoDB. 4 00:00:08,700 --> 00:00:12,300 And so that's what we're gonna talk about in this lecture. 5 00:00:12,300 --> 00:00:14,710 So it's really crucial that you follow it 6 00:00:14,710 --> 00:00:19,710 through even at first its a lot to take in. All right. 7 00:00:19,810 --> 00:00:22,013 Anyway, lets now get started. 8 00:00:23,530 --> 00:00:27,530 Now, data modeling is probably a very new concept to you. 9 00:00:27,530 --> 00:00:28,920 So before we start; 10 00:00:28,920 --> 00:00:32,070 lets make clear what we're actually gonna talk about. 11 00:00:32,070 --> 00:00:35,656 So, data modeling is the process of taking unstructured data 12 00:00:35,656 --> 00:00:38,770 generated by a real world scenario 13 00:00:38,770 --> 00:00:42,090 and then structure it into a logical data model 14 00:00:42,090 --> 00:00:43,410 in a database. 15 00:00:43,410 --> 00:00:46,300 And we do that according to a set of criteria 16 00:00:46,300 --> 00:00:49,330 which we're gonna learn about in this video. 17 00:00:49,330 --> 00:00:51,980 For example; lets say that we want to design 18 00:00:51,980 --> 00:00:54,120 an online shop data model. 19 00:00:54,120 --> 00:00:57,040 There will be initially a ton of unstructured data 20 00:00:57,040 --> 00:00:58,130 that we know we need. 21 00:00:58,130 --> 00:00:58,980 Right. 22 00:00:58,980 --> 00:01:00,900 Stuff like products, categories, 23 00:01:00,900 --> 00:01:03,875 customer's orders, shopping carts, suppliers. 24 00:01:03,875 --> 00:01:06,300 And so on and so forth. 25 00:01:06,300 --> 00:01:09,240 Our goal with data modeling is to then structure 26 00:01:09,240 --> 00:01:11,450 this data into a logical way. 27 00:01:11,450 --> 00:01:14,090 Reflecting the real-world relationships 28 00:01:14,090 --> 00:01:16,920 that exists between some of these data sets. 29 00:01:16,920 --> 00:01:19,670 A bit like you can see in this example. 30 00:01:19,670 --> 00:01:23,110 And this is of course just a kind of imaginary situation 31 00:01:23,110 --> 00:01:24,320 but you get the idea. 32 00:01:24,320 --> 00:01:25,600 Right. 33 00:01:25,600 --> 00:01:28,940 Now, many backend developers say that data modeling 34 00:01:28,940 --> 00:01:30,930 is where we have to think the most. 35 00:01:30,930 --> 00:01:33,670 That its the most demanding part of building 36 00:01:33,670 --> 00:01:35,310 an entire application. 37 00:01:35,310 --> 00:01:38,100 Because it really is not always straight-forward. 38 00:01:38,100 --> 00:01:41,070 And sometimes there are simply no right answers. 39 00:01:41,070 --> 00:01:45,500 So not just one unique correct way of structuring the data. 40 00:01:45,500 --> 00:01:48,420 But anyway I will do my best to lay down the process 41 00:01:48,420 --> 00:01:49,510 in this video. 42 00:01:49,510 --> 00:01:52,367 And for that we're gonna go through four steps. 43 00:01:52,367 --> 00:01:56,200 So in the first step; we learned about how to identify 44 00:01:56,200 --> 00:01:59,340 different types of relationships between data. 45 00:01:59,340 --> 00:02:00,360 Then we're gonna understand the difference 46 00:02:00,360 --> 00:02:03,019 between referencing or normalization 47 00:02:03,019 --> 00:02:07,163 and embedding or denormalization. 48 00:02:07,163 --> 00:02:09,030 In the next and most important step; 49 00:02:09,030 --> 00:02:11,660 I will show you my framework for deciding 50 00:02:11,660 --> 00:02:13,560 whether we should embed documents 51 00:02:13,560 --> 00:02:15,750 or reference to other documents 52 00:02:15,750 --> 00:02:18,690 based on a couple of different factors. 53 00:02:18,690 --> 00:02:20,810 Also, we have to quickly talk about 54 00:02:20,810 --> 00:02:22,680 different types of referencing. 55 00:02:22,680 --> 00:02:25,920 Because that's important if that is the type of design 56 00:02:25,920 --> 00:02:28,220 that we choose for our data. 57 00:02:28,220 --> 00:02:32,290 So this is gonna be in fact a quite theoretical lecture. 58 00:02:32,290 --> 00:02:35,940 But also an absolutely essential one for your progress 59 00:02:35,940 --> 00:02:37,660 as a back-end developer. 60 00:02:37,660 --> 00:02:41,553 Because the way we design data so the way we model our data 61 00:02:41,553 --> 00:02:45,180 can make or break our entire application. 62 00:02:45,180 --> 00:02:47,950 And there will be a lot of examples along the way 63 00:02:47,950 --> 00:02:49,510 to make this process easier. 64 00:02:49,510 --> 00:02:50,343 All right. 65 00:02:51,320 --> 00:02:53,440 And the first thing that we are gonna talk about 66 00:02:53,440 --> 00:02:55,780 is the different types of relationships 67 00:02:55,780 --> 00:02:58,210 that can exist between data. 68 00:02:58,210 --> 00:03:00,780 So there are three big types of relationships. 69 00:03:00,780 --> 00:03:05,150 One to one, one to many, and many to many. 70 00:03:05,150 --> 00:03:06,990 And I'm gonna use a movie application 71 00:03:06,990 --> 00:03:08,890 as an example in this slide. 72 00:03:08,890 --> 00:03:10,000 Okay? 73 00:03:10,000 --> 00:03:12,440 So first a one to one relationship 74 00:03:12,440 --> 00:03:14,140 between data is basically 75 00:03:14,140 --> 00:03:17,370 when one field can only have one value. 76 00:03:17,370 --> 00:03:21,550 So in our movie application example; one movie only ever 77 00:03:21,550 --> 00:03:22,990 have one name. 78 00:03:22,990 --> 00:03:24,910 And so this is a simple example 79 00:03:24,910 --> 00:03:27,160 of a one to one relationship. 80 00:03:27,160 --> 00:03:29,690 But these relationships are not really that important 81 00:03:29,690 --> 00:03:31,363 in terms of data modeling. 82 00:03:32,330 --> 00:03:34,430 Now the most important relationships 83 00:03:34,430 --> 00:03:37,210 are the one to many relationships. 84 00:03:37,210 --> 00:03:39,770 And they are so important that in MongoDB 85 00:03:39,770 --> 00:03:42,510 we actually distinguish between three types 86 00:03:42,510 --> 00:03:44,540 of one to many relationships. 87 00:03:44,540 --> 00:03:49,540 One to a few, one to many, and one to a ton or to a million 88 00:03:49,910 --> 00:03:53,230 or something like that. So the difference here is based 89 00:03:53,230 --> 00:03:56,893 on the relative amount of the many. All right. 90 00:03:57,840 --> 00:04:00,969 So an example to a one to a few relationship is that 91 00:04:00,969 --> 00:04:05,967 one movie can win many awards but actually just a few. 92 00:04:05,967 --> 00:04:09,630 So movie is not gonna win a thousand awards 93 00:04:09,630 --> 00:04:11,220 but it can win some. 94 00:04:11,220 --> 00:04:14,930 And so this is a typical one to few relationship. 95 00:04:14,930 --> 00:04:18,709 So you see that in general a one to many relationship 96 00:04:18,709 --> 00:04:23,210 means that one document can relate to many other documents. 97 00:04:23,210 --> 00:04:26,680 Now this might look a bit abstract without the JSON data 98 00:04:26,680 --> 00:04:28,480 but that's actually the purpose here. 99 00:04:28,480 --> 00:04:31,040 I just wanna show you a conceptual overview 100 00:04:31,040 --> 00:04:33,759 of these different types of relationships. 101 00:04:33,759 --> 00:04:36,872 Anyway, any one to many relationship 102 00:04:36,872 --> 00:04:40,600 one document can relate to hundreds or thousands 103 00:04:40,600 --> 00:04:42,070 of other documents. 104 00:04:42,070 --> 00:04:44,788 For example; one movie can have thousands of reviews 105 00:04:44,788 --> 00:04:46,710 in our application. 106 00:04:46,710 --> 00:04:49,380 And so this not really a one to few 107 00:04:49,380 --> 00:04:51,524 but one to many relationship. Okay? 108 00:04:51,524 --> 00:04:55,616 And finally we have the one to ton relationship. 109 00:04:55,616 --> 00:04:59,720 Imagine we wanted to implement some logging functionality 110 00:04:59,720 --> 00:05:03,110 in our app. So basically to know exactly what's going on 111 00:05:03,110 --> 00:05:04,870 on our server. 112 00:05:04,870 --> 00:05:08,770 This logs can then easily grow to millions of documents. 113 00:05:08,770 --> 00:05:11,270 And so this is a very typical example 114 00:05:11,270 --> 00:05:14,200 of a one to tons a relationship. 115 00:05:14,200 --> 00:05:17,100 And the difference between many and a ton is of course 116 00:05:17,100 --> 00:05:20,730 a bit fuzzy but just think that if something can grow 117 00:05:20,730 --> 00:05:23,360 almost to infinity then its definitely 118 00:05:23,360 --> 00:05:25,532 a one to a ton relationship. 119 00:05:25,532 --> 00:05:28,763 So again the one to many relationships 120 00:05:28,763 --> 00:05:31,650 are the most important ones to know. 121 00:05:31,650 --> 00:05:34,050 By the way; in relational databases 122 00:05:34,050 --> 00:05:37,061 there is just one to many without quantifying 123 00:05:37,061 --> 00:05:39,800 how much that many actually is. 124 00:05:39,800 --> 00:05:41,800 In MongoDB databases though 125 00:05:41,800 --> 00:05:44,010 it is an extremely important difference. 126 00:05:44,010 --> 00:05:47,150 Because its one of the factors that we're gonna use 127 00:05:47,150 --> 00:05:49,891 to decide if we should denormalize or normalize data 128 00:05:49,891 --> 00:05:53,340 as you will learn a bit later. 129 00:05:53,340 --> 00:05:57,181 Anyway, the less type of relationship is the many to many 130 00:05:57,181 --> 00:06:00,149 where one movie can have many actors. 131 00:06:00,149 --> 00:06:04,876 But at the same time one actor can play in many movies. 132 00:06:04,876 --> 00:06:07,910 And so here the relationship basically 133 00:06:07,910 --> 00:06:09,630 goes in both directions. 134 00:06:09,630 --> 00:06:11,800 Where before in the other types 135 00:06:11,800 --> 00:06:13,939 it was only in one direction. 136 00:06:13,939 --> 00:06:17,470 For example one movie can have many reviews 137 00:06:17,470 --> 00:06:22,450 but one specific is only for that one movie. Right. 138 00:06:22,450 --> 00:06:24,560 And the same goes for the awards. 139 00:06:24,560 --> 00:06:27,506 So one specific award like for the best actor 140 00:06:27,506 --> 00:06:30,914 goes to only one movie not multiple ones. 141 00:06:30,914 --> 00:06:35,580 But with movies and actors it is indeed different. 142 00:06:35,580 --> 00:06:39,250 So again one movie stars many actors 143 00:06:39,250 --> 00:06:41,920 but one actor plays many movies 144 00:06:41,920 --> 00:06:45,020 and so its a many to many relationship. 145 00:06:45,020 --> 00:06:46,170 Okay. 146 00:06:46,170 --> 00:06:49,060 So keep all this in mind as we now move forward 147 00:06:49,060 --> 00:06:50,063 in this lecture. 148 00:06:51,760 --> 00:06:54,870 And probably the most important aspect that we need to learn 149 00:06:54,870 --> 00:06:57,900 about MongoDB databases is referencing 150 00:06:57,900 --> 00:07:00,340 and embedding two datasets. 151 00:07:00,340 --> 00:07:02,350 And we actually already talked a little bit 152 00:07:02,350 --> 00:07:05,050 about this before but lets review it here 153 00:07:05,050 --> 00:07:07,311 and go a bit deeper also. 154 00:07:07,311 --> 00:07:09,962 So each time we have two related datasets 155 00:07:09,962 --> 00:07:13,829 we can either represent that related data in a reference 156 00:07:13,829 --> 00:07:18,829 or normalized form or in an embedded or denormalized form. 157 00:07:18,842 --> 00:07:22,190 And I keep using the two related terms together 158 00:07:22,190 --> 00:07:24,340 like referencing and normalizing 159 00:07:24,340 --> 00:07:26,460 because you will see them both being used 160 00:07:26,460 --> 00:07:29,510 and so its important that you know all of them. 161 00:07:29,510 --> 00:07:33,070 Anyway, in the referenced form we keep two related 162 00:07:33,070 --> 00:07:35,826 datasets and all the documents separated. 163 00:07:35,826 --> 00:07:39,589 So again all the data is nicely separated 164 00:07:39,589 --> 00:07:43,275 which is exactly what normalized means. 165 00:07:43,275 --> 00:07:47,110 So continuing, the movie database example from before 166 00:07:47,110 --> 00:07:50,750 we would have one movie document and one actor document 167 00:07:50,750 --> 00:07:54,870 for each actor. Now how would we then make the connection 168 00:07:54,870 --> 00:07:58,710 between movie and the actors so that later in our app 169 00:07:58,710 --> 00:08:02,150 we can show which actors played in a particular movie. 170 00:08:02,150 --> 00:08:05,210 Because if they are all completely different document 171 00:08:05,210 --> 00:08:09,438 the movie has no way of knowing about the actors. Right. 172 00:08:09,438 --> 00:08:12,253 Well that's where the IDs come in. 173 00:08:12,253 --> 00:08:16,460 So we use the actor IDs in order to create references 174 00:08:16,460 --> 00:08:18,020 on the movie document. 175 00:08:18,020 --> 00:08:20,981 Effectively connecting movies with actors. 176 00:08:20,981 --> 00:08:24,760 So you see that in a movie document we have an array 177 00:08:24,760 --> 00:08:27,198 where we stored the IDs of all the actors 178 00:08:27,198 --> 00:08:30,760 so that when we request data about a certain a movie 179 00:08:30,760 --> 00:08:34,553 we can easily identify its actors. Does that make sense? 180 00:08:34,553 --> 00:08:38,830 Now this type of referencing is called child referencing 181 00:08:38,830 --> 00:08:41,480 because its the parent in this case the movie 182 00:08:41,480 --> 00:08:45,104 who references its children. In this case the actors. 183 00:08:45,104 --> 00:08:48,841 So we're really creating some sort of hierarchy here. Right. 184 00:08:48,841 --> 00:08:51,870 Now there is also parent referencing 185 00:08:51,870 --> 00:08:54,390 and we are gonna talk about that a bit later. 186 00:08:54,390 --> 00:08:58,710 And by the way in relational databases; all data is always 187 00:08:58,710 --> 00:09:01,958 represented in normalized form like this. 188 00:09:01,958 --> 00:09:05,490 But in a no sequel database like MongoDB 189 00:09:05,490 --> 00:09:09,700 we can denormalize data into a denormalized form 190 00:09:09,700 --> 00:09:12,450 simply by embedding the related document 191 00:09:12,450 --> 00:09:15,330 right into the main document. 192 00:09:15,330 --> 00:09:18,330 So now we have all the relevant data about actors 193 00:09:18,330 --> 00:09:22,060 right inside in one main movie document without the need 194 00:09:22,060 --> 00:09:25,700 for separate documents, collections, and IDs. 195 00:09:25,700 --> 00:09:30,088 So again, if we choose to denormalize or to embed our data 196 00:09:30,088 --> 00:09:34,280 we will have one main document containing all the main data 197 00:09:34,280 --> 00:09:37,197 as well as the related data. All right. 198 00:09:37,197 --> 00:09:40,340 And the result of this is that our application 199 00:09:40,340 --> 00:09:43,330 will need to fewer queries to the database. 200 00:09:43,330 --> 00:09:45,000 Because we can get all the data 201 00:09:45,000 --> 00:09:48,074 about movies and actors all at the same time 202 00:09:48,074 --> 00:09:51,650 which will of course increase our performance. 203 00:09:51,650 --> 00:09:53,840 Now the downside here is of course 204 00:09:53,840 --> 00:09:57,530 that we can't really query the embedded data on its own. 205 00:09:57,530 --> 00:10:00,810 And so if that's a requirement for the application 206 00:10:00,810 --> 00:10:03,790 you would have to choose a normalized design 207 00:10:03,790 --> 00:10:06,280 and since we're talking about pros and cons 208 00:10:06,280 --> 00:10:09,030 of the denormalized form; lets do the same 209 00:10:09,030 --> 00:10:11,490 about the normalized design. 210 00:10:11,490 --> 00:10:13,920 And basically its kind of the opposite 211 00:10:13,920 --> 00:10:15,770 of what we just talked about. 212 00:10:15,770 --> 00:10:18,319 So there is an improvement in performance 213 00:10:18,319 --> 00:10:22,390 when we often need to query the related data on it's own 214 00:10:22,390 --> 00:10:25,740 because we then can just query the data that we need 215 00:10:25,740 --> 00:10:28,490 and not always movies and actors together. 216 00:10:28,490 --> 00:10:31,640 But on the other hand; when we need to actually query 217 00:10:31,640 --> 00:10:33,906 movies and actors together we then are gonna need 218 00:10:33,906 --> 00:10:36,396 many queries to the database. 219 00:10:36,396 --> 00:10:40,010 So first the query for the movie and then from there 220 00:10:40,010 --> 00:10:42,610 we will also need a query for the actor 221 00:10:42,610 --> 00:10:44,989 and that is of course works for performance. 222 00:10:44,989 --> 00:10:48,328 So when designing your database; this is the kind of stuff 223 00:10:48,328 --> 00:10:50,569 that you need to keep in mind. All right. 224 00:10:50,569 --> 00:10:54,900 And now just as a side note; we could of course begin 225 00:10:54,900 --> 00:10:56,994 our thought process with denormlized data 226 00:10:56,994 --> 00:10:59,670 and then come to the conclusion 227 00:10:59,670 --> 00:11:01,692 that its best to actually normalize the data. 228 00:11:01,692 --> 00:11:05,043 So when thinking about our data model 229 00:11:05,043 --> 00:11:08,378 this way of organizing data works of course in both ways. 230 00:11:08,378 --> 00:11:12,570 Now, how do we actually decide if we should 231 00:11:12,570 --> 00:11:15,330 normalize or denormalize the data? 232 00:11:15,330 --> 00:11:18,033 Well that's exactly what we're gonna learn next. 233 00:11:19,690 --> 00:11:22,974 So when we have two related datasets; we have to decide 234 00:11:22,974 --> 00:11:26,180 if we're gonna embed the datasets or if we're gonna 235 00:11:26,180 --> 00:11:27,693 keep them separated and reference them 236 00:11:27,693 --> 00:11:30,400 from one dataset to the other. 237 00:11:30,400 --> 00:11:32,730 And I kind of developed this decision framework 238 00:11:32,730 --> 00:11:36,070 which I'm gonna show you where we use three criteria 239 00:11:36,070 --> 00:11:37,770 to take that decision. 240 00:11:37,770 --> 00:11:40,450 First we look at the type of relationships 241 00:11:40,450 --> 00:11:42,800 that exists between datasets. 242 00:11:42,800 --> 00:11:45,856 Second we try to determine the data access pattern 243 00:11:45,856 --> 00:11:50,150 of the dataset that we want to either embed or reference. 244 00:11:50,150 --> 00:11:53,320 And this just means to analyze how often data is read 245 00:11:53,320 --> 00:11:55,282 and written in that dataset. 246 00:11:55,282 --> 00:11:59,025 Then we also look at something that I call data closeness. 247 00:11:59,025 --> 00:12:02,940 And data closeness is term that I actually just made up 248 00:12:02,940 --> 00:12:06,870 but what it means is how much the data is really related 249 00:12:06,870 --> 00:12:10,109 and how we want to query the data from the database. 250 00:12:10,109 --> 00:12:11,850 And this will make more sense 251 00:12:11,850 --> 00:12:14,180 when we talk about it in a moment. 252 00:12:14,180 --> 00:12:17,330 Now to actually take the decision; we need to combine 253 00:12:17,330 --> 00:12:19,350 all of these three criteria 254 00:12:19,350 --> 00:12:21,792 and not just use one of them in isolation. 255 00:12:21,792 --> 00:12:25,230 So for example; just because criteria number one 256 00:12:25,230 --> 00:12:28,380 says to embed it doesn't mean that we don't need to look 257 00:12:28,380 --> 00:12:30,425 at the other two criteria. 258 00:12:30,425 --> 00:12:34,124 All right and lets start with the relationship type. 259 00:12:34,124 --> 00:12:37,968 So usually when we have one to few relationship 260 00:12:37,968 --> 00:12:40,700 we will always embed the related dataset 261 00:12:40,700 --> 00:12:43,430 into the main dataset just like we learned 262 00:12:43,430 --> 00:12:45,860 in the last slide. Right. 263 00:12:45,860 --> 00:12:49,110 Now in a one to many relationship; things are a bit 264 00:12:49,110 --> 00:12:52,880 more fuzzy so its okay to either embed or reference. 265 00:12:52,880 --> 00:12:55,140 In that case we will have to decide 266 00:12:55,140 --> 00:12:57,304 according to the other two criteria. 267 00:12:57,304 --> 00:12:59,825 Now on the other hand, on a one to a ton 268 00:12:59,825 --> 00:13:03,894 or a many to many relationship we usually always reference 269 00:13:03,894 --> 00:13:06,811 the data. That's because if we actually did embed 270 00:13:06,811 --> 00:13:10,004 in this case we could quickly create way too large document. 271 00:13:10,004 --> 00:13:14,902 Even potentially surpassing the maximum of 16 megabytes. 272 00:13:14,902 --> 00:13:18,214 And so the solution for that is of course referencing 273 00:13:18,214 --> 00:13:22,090 or normalizing the data. And as a quick example; 274 00:13:22,090 --> 00:13:24,142 lets say that in our movie database example 275 00:13:24,142 --> 00:13:27,830 we have around 100 images associated to each movie. 276 00:13:27,830 --> 00:13:30,874 So we could say its a one to many relationship 277 00:13:30,874 --> 00:13:34,230 but are we gonna embed the dataset or should we rather 278 00:13:34,230 --> 00:13:37,523 reference them here. Well we don't really know. 279 00:13:37,523 --> 00:13:40,571 So lets take a look at the other two criteria. 280 00:13:40,571 --> 00:13:44,420 So the second one is about data access patterns 281 00:13:44,420 --> 00:13:46,290 where its just a fancy description 282 00:13:46,290 --> 00:13:48,242 for evaluating whether a certain dataset 283 00:13:48,242 --> 00:13:51,559 is mostly written to or mostly read from. 284 00:13:51,559 --> 00:13:55,760 So if the dataset that we're deciding about is mostly read 285 00:13:55,760 --> 00:13:58,179 and the data is not updated a lot 286 00:13:58,179 --> 00:14:01,620 then we should probably embed that dataset. 287 00:14:01,620 --> 00:14:04,690 So a high read/write ratio just means 288 00:14:04,690 --> 00:14:07,100 that there is a lot more reading than writing. 289 00:14:07,100 --> 00:14:11,100 And a again, a dataset like that is a good candidate 290 00:14:11,100 --> 00:14:11,983 for embedding. 291 00:14:12,830 --> 00:14:15,980 The reason for this is that by embedding we only need 292 00:14:15,980 --> 00:14:18,379 one trip to the database per query. 293 00:14:18,379 --> 00:14:22,197 While for referencing we need two trips. Right. 294 00:14:22,197 --> 00:14:25,660 So if we embed data that is read a lot; 295 00:14:25,660 --> 00:14:28,383 in each query we save one trip to the database 296 00:14:28,383 --> 00:14:32,147 making the entire process way more performant. 297 00:14:32,147 --> 00:14:35,260 So I think that our movie image example 298 00:14:35,260 --> 00:14:38,320 would actually be a good candidate for embedding. 299 00:14:38,320 --> 00:14:41,543 Because once the 100 image are saved to the database 300 00:14:41,543 --> 00:14:43,920 they are not really updated anymore 301 00:14:43,920 --> 00:14:46,930 because there is not really anything to update 302 00:14:46,930 --> 00:14:50,057 about an image. Right, so its all about reading 303 00:14:50,057 --> 00:14:52,563 and therefore based on this criteria 304 00:14:52,563 --> 00:14:55,501 we would embed the imaged documents. 305 00:14:55,501 --> 00:14:59,092 Now on the other hand, if our data is updated a lot 306 00:14:59,092 --> 00:15:03,118 then we should consider referencing or normalizing the data. 307 00:15:03,118 --> 00:15:06,700 That's because its more work for the database engine 308 00:15:06,700 --> 00:15:08,870 to update and embed a document 309 00:15:08,870 --> 00:15:11,600 than a more simple standalone document. 310 00:15:11,600 --> 00:15:13,980 And since our main goal is performance; 311 00:15:13,980 --> 00:15:15,917 we just normalize the dataset. 312 00:15:15,917 --> 00:15:19,653 In our example lets say each movie has many reviews 313 00:15:19,653 --> 00:15:23,284 and each review can be marked as helpful by the user. 314 00:15:23,284 --> 00:15:27,560 So each time someone clicks on this review was helpful 315 00:15:27,560 --> 00:15:29,780 in our application. We need to update 316 00:15:29,780 --> 00:15:31,740 the corresponding document. 317 00:15:31,740 --> 00:15:35,030 And this means that the data can change all the time 318 00:15:35,030 --> 00:15:38,520 and so this is a great candidate for normalizing. 319 00:15:38,520 --> 00:15:41,420 Again because we don't want to be querying the movies 320 00:15:41,420 --> 00:15:45,190 all the time if all we really wanna update is the reviews 321 00:15:45,190 --> 00:15:47,230 by marking them as helpful. 322 00:15:47,230 --> 00:15:49,464 Okay, does that make sense? 323 00:15:49,464 --> 00:15:53,500 And finally the last criteria I call data closeness; 324 00:15:53,500 --> 00:15:56,320 which is just like a measure for how much the data 325 00:15:56,320 --> 00:15:59,469 is related. So if the two datasets really 326 00:15:59,469 --> 00:16:02,890 intrinsically belong together then they should 327 00:16:02,890 --> 00:16:05,880 probably be embedded into one another. 328 00:16:05,880 --> 00:16:10,440 In our example; all users can have many email addresses 329 00:16:10,440 --> 00:16:13,780 on their account and since they are so intrinsically 330 00:16:13,780 --> 00:16:17,190 connected to the user, there is no doubt emails 331 00:16:17,190 --> 00:16:19,920 should be embedded into the document. 332 00:16:19,920 --> 00:16:23,830 Now if we frequently need to query both of datasets 333 00:16:23,830 --> 00:16:26,388 on their own then that's a very good reason 334 00:16:26,388 --> 00:16:29,696 to normalize the data into two separate datasets. 335 00:16:29,696 --> 00:16:32,790 Even if they are closely related. 336 00:16:32,790 --> 00:16:35,227 So imagine that in our app we have a quiz 337 00:16:35,227 --> 00:16:40,227 where users have to identify a movie based on images. 338 00:16:40,440 --> 00:16:43,080 This means that we're gonna query a lot of images 339 00:16:43,080 --> 00:16:44,180 on their own. 340 00:16:44,180 --> 00:16:47,756 So without necessarily querying for the movies themselves. 341 00:16:47,756 --> 00:16:50,640 And so if we apply this third criteria; 342 00:16:50,640 --> 00:16:54,137 we come to the conclusion that we should actually normalize 343 00:16:54,137 --> 00:16:56,759 the image dataset. All right. 344 00:16:56,759 --> 00:17:00,770 Because again if we implement this quiz functionality; 345 00:17:00,770 --> 00:17:04,057 images are gonna be queried on their own all the time. 346 00:17:04,057 --> 00:17:07,422 So, all of this shows that we should really look 347 00:17:07,422 --> 00:17:09,849 all the three criteria together 348 00:17:09,849 --> 00:17:12,700 rather than just one of them in isolation. 349 00:17:12,700 --> 00:17:15,840 Because that might lead to less optimal decisions. 350 00:17:15,840 --> 00:17:18,907 And I say less optimal instead of wrong 351 00:17:18,907 --> 00:17:21,766 because they are not really completely right 352 00:17:21,766 --> 00:17:25,262 or completely wrong ways of modeling our data. 353 00:17:25,262 --> 00:17:28,970 There are no hard rules; these are just like guidelines 354 00:17:28,970 --> 00:17:31,380 that you can follow to find the probably 355 00:17:31,380 --> 00:17:33,860 most correct way of structuring your data. 356 00:17:33,860 --> 00:17:37,077 But again, it's hard to be really really wrong. 357 00:17:37,077 --> 00:17:38,253 Okay? 358 00:17:39,740 --> 00:17:43,110 Now, lets say that we have chosen to normalize 359 00:17:43,110 --> 00:17:44,270 our datasets. 360 00:17:44,270 --> 00:17:46,653 So in other words to reference data. 361 00:17:46,653 --> 00:17:49,380 Then after that we still have to choose 362 00:17:49,380 --> 00:17:52,840 between three different types of referencing. 363 00:17:52,840 --> 00:17:55,460 Child referencing, parent referencing 364 00:17:55,460 --> 00:17:57,540 and two-way referencing. 365 00:17:57,540 --> 00:18:00,767 So the first type is child referencing. 366 00:18:00,767 --> 00:18:04,440 Which is the referencing type I actually showed you before. 367 00:18:04,440 --> 00:18:05,470 Okay? 368 00:18:05,470 --> 00:18:07,850 And lets not take the error logging example 369 00:18:07,850 --> 00:18:10,128 that I mentioned earlier. Where we could potentially 370 00:18:10,128 --> 00:18:13,021 have millions of locked documents. 371 00:18:13,021 --> 00:18:17,300 So in child referencing; we basically keep references 372 00:18:17,300 --> 00:18:20,460 to the related child documents in a parent document. 373 00:18:20,460 --> 00:18:22,941 And they are usually stored in an array. 374 00:18:22,941 --> 00:18:25,735 So you see that each log has an ID 375 00:18:25,735 --> 00:18:29,040 and then in the app document there is that array 376 00:18:29,040 --> 00:18:31,358 with all of these IDs. Right? 377 00:18:31,358 --> 00:18:34,400 However, the problem here is that this array 378 00:18:34,400 --> 00:18:39,320 of IDs can become very large if there are lots of children. 379 00:18:39,320 --> 00:18:42,230 And this is an anti-pattern in MongoDB. 380 00:18:42,230 --> 00:18:45,156 So something that we should avoid at all costs. 381 00:18:45,156 --> 00:18:47,660 Also, child referencing makes it 382 00:18:47,660 --> 00:18:51,410 so that parents and children are very tightly coupled. 383 00:18:51,410 --> 00:18:54,840 Which is not always ideal. But that's exactly 384 00:18:54,840 --> 00:18:57,020 why we have parent referencing. 385 00:18:57,020 --> 00:19:00,300 So in parent referencing; it actually works 386 00:19:00,300 --> 00:19:01,870 the other way around. 387 00:19:01,870 --> 00:19:05,570 So here in each child document we keep a reference 388 00:19:05,570 --> 00:19:07,430 to the parent element. 389 00:19:07,430 --> 00:19:10,267 Therefore the name parent referencing. 390 00:19:10,267 --> 00:19:13,890 In this example the app ID is 23 391 00:19:13,890 --> 00:19:16,640 and so in each log there is the app field 392 00:19:16,640 --> 00:19:18,990 with the 23 ID in it. 393 00:19:18,990 --> 00:19:21,660 So that the child always knows its parent. 394 00:19:21,660 --> 00:19:24,920 And so in this case the parent actually knows nothing 395 00:19:24,920 --> 00:19:26,080 about the children. 396 00:19:26,080 --> 00:19:28,768 Not who they are and not how many they are. 397 00:19:28,768 --> 00:19:32,890 So, they are way more isolated and more standalone. 398 00:19:32,890 --> 00:19:35,326 In that, it can sometimes be beneficial. 399 00:19:35,326 --> 00:19:38,880 So which of these two types is actually better 400 00:19:38,880 --> 00:19:40,527 for this data relationship. 401 00:19:40,527 --> 00:19:42,820 And remember how I said that there 402 00:19:42,820 --> 00:19:45,860 could be millions of logs and so lets suppose 403 00:19:45,860 --> 00:19:47,652 there is two million logged documents. 404 00:19:47,652 --> 00:19:51,340 In a case of child referencing, that would mean 405 00:19:51,340 --> 00:19:53,209 that there are two million ID references 406 00:19:53,209 --> 00:19:55,091 in the app document. 407 00:19:55,091 --> 00:19:58,300 Right? Now also remember how I said that 408 00:19:58,300 --> 00:20:00,545 there is 16 megabyte limit on documents. 409 00:20:00,545 --> 00:20:04,302 So if we kept adding and adding these child IDs 410 00:20:04,302 --> 00:20:06,716 into the array on the parent; then we would 411 00:20:06,716 --> 00:20:09,575 pretty quickly hit that 16 megabytes limit 412 00:20:09,575 --> 00:20:11,772 that each Bson document can hold. 413 00:20:11,772 --> 00:20:14,702 Simply because that array will grow so much. 414 00:20:14,702 --> 00:20:17,210 So that's not really gonna work. 415 00:20:17,210 --> 00:20:18,510 Is it? 416 00:20:18,510 --> 00:20:20,590 On the other hand with parent referencing 417 00:20:20,590 --> 00:20:22,990 that problem is not gonna happen. 418 00:20:22,990 --> 00:20:25,570 We will simply have two million locked documents 419 00:20:25,570 --> 00:20:30,540 just like before but each of them holds ID of its parent. 420 00:20:30,540 --> 00:20:33,098 But there is no array that will grow indefinitely 421 00:20:33,098 --> 00:20:35,740 and therefore parent referencing 422 00:20:35,740 --> 00:20:38,443 would be best solution here. 423 00:20:39,380 --> 00:20:41,901 So the conclusion of all this is that in general 424 00:20:41,901 --> 00:20:44,385 child referencing is best used 425 00:20:44,385 --> 00:20:48,008 for one to a few relationships. Where we know before hand 426 00:20:48,008 --> 00:20:51,118 that the array of child documents won't grow that much. 427 00:20:51,118 --> 00:20:54,573 On the other hand, parent referencing is best used 428 00:20:54,573 --> 00:20:58,690 for one to many and one to a ton relationships 429 00:20:58,690 --> 00:21:00,927 like this one. Okay? 430 00:21:00,927 --> 00:21:04,610 So again always keep in mind that one of the most 431 00:21:04,610 --> 00:21:07,920 important principals of MongoDB data modeling 432 00:21:07,920 --> 00:21:11,900 is that array should never be allowed to grow indefinitely. 433 00:21:11,900 --> 00:21:15,420 In order to never break that 16 megabyte limit. 434 00:21:15,420 --> 00:21:18,170 We also don't want to send our users an array 435 00:21:18,170 --> 00:21:20,730 with thousands of IDs each time 436 00:21:20,730 --> 00:21:24,340 they request a parent dataset. Okay? 437 00:21:24,340 --> 00:21:26,900 So did this logic make sense to you? 438 00:21:26,900 --> 00:21:29,660 Then lets move on to third type of referencing 439 00:21:29,660 --> 00:21:31,870 which is two-way referencing. 440 00:21:31,870 --> 00:21:34,395 And this time with the movie and actor example 441 00:21:34,395 --> 00:21:36,380 I showed you when we talked about 442 00:21:36,380 --> 00:21:39,364 many to many relationships. Remember that? 443 00:21:39,364 --> 00:21:42,229 So again, each movie has many actors 444 00:21:42,229 --> 00:21:44,880 and each actor plays in many movies. 445 00:21:44,880 --> 00:21:48,464 And so that's a typical many to many relationship. 446 00:21:48,464 --> 00:21:52,100 And we usually use this two-way referencing to design 447 00:21:52,100 --> 00:21:55,346 many to many relationships. And it works like this; 448 00:21:55,346 --> 00:21:59,370 in each movie we will keep references to all the actors 449 00:21:59,370 --> 00:22:03,980 that star in that movie. So a bit like in child referencing. 450 00:22:03,980 --> 00:22:07,000 However and at the same time in each actor 451 00:22:07,000 --> 00:22:09,570 we also keep references to all the movies 452 00:22:09,570 --> 00:22:11,660 that the actor played in. 453 00:22:11,660 --> 00:22:15,120 So movies and actors are connected in both directions. 454 00:22:15,120 --> 00:22:17,900 In therefore the name two-way referencing. 455 00:22:17,900 --> 00:22:19,950 And this makes it really easy to search 456 00:22:19,950 --> 00:22:23,290 for both movies and actors completely independently. 457 00:22:23,290 --> 00:22:25,910 While also making it easy to find the actors 458 00:22:25,910 --> 00:22:29,029 associated to each movie and the movies associated 459 00:22:29,029 --> 00:22:30,383 to each actor. 460 00:22:31,623 --> 00:22:32,560 (deep breath) 461 00:22:32,560 --> 00:22:34,747 This was quite a long lecture indeed. 462 00:22:34,747 --> 00:22:38,030 With a lot of new concepts and principals 463 00:22:38,030 --> 00:22:40,220 and guidelines to remember. 464 00:22:40,220 --> 00:22:43,460 So in order to help you with that; here goes a quick 465 00:22:43,460 --> 00:22:46,650 summary and some more general guidelines that you can 466 00:22:46,650 --> 00:22:48,423 take a look at when you need it. 467 00:22:49,260 --> 00:22:52,753 So the most important principal is: structure your data 468 00:22:52,753 --> 00:22:56,120 to match the ways that your application queries 469 00:22:56,120 --> 00:22:57,436 and updates data. 470 00:22:57,436 --> 00:23:01,400 Or in other words: identify the questions that arise 471 00:23:01,400 --> 00:23:03,784 from your application's use cases first, and then model 472 00:23:03,784 --> 00:23:06,634 your data so that the questions can get answered 473 00:23:06,634 --> 00:23:08,995 in the most efficient way. 474 00:23:08,995 --> 00:23:12,610 For example; when I need to query movies and actors 475 00:23:12,610 --> 00:23:16,130 always together or are there scenarios where I only 476 00:23:16,130 --> 00:23:18,041 query movies or only actors. 477 00:23:18,041 --> 00:23:20,528 That kind of questions is what your data model 478 00:23:20,528 --> 00:23:22,930 will be based on. 479 00:23:22,930 --> 00:23:26,730 In general, always favor embedding unless there is a good 480 00:23:26,730 --> 00:23:28,440 reason not to embed. 481 00:23:28,440 --> 00:23:32,513 Especially on one to a few and one to many relationships. 482 00:23:33,370 --> 00:23:37,713 Next up, a one to a ton or a many to many relationship 483 00:23:37,713 --> 00:23:41,543 is usually a good reason to reference instead of embedding. 484 00:23:41,543 --> 00:23:45,734 Also, favor referencing when data is updated a lot 485 00:23:45,734 --> 00:23:50,717 and if you need to frequently access a dataset on its own. 486 00:23:50,717 --> 00:23:55,340 Use embedding when data is mostly read but rarely updated 487 00:23:55,340 --> 00:23:58,469 and when two dataset belong intrinsically together. 488 00:23:58,469 --> 00:24:02,840 Don't allow arrays to grow indefinitely. 489 00:24:02,840 --> 00:24:05,982 Therefore, if you want to normalize; use child referencing 490 00:24:05,982 --> 00:24:09,680 for one to many relationships and parent referencing 491 00:24:09,680 --> 00:24:11,856 for one to a ton relationships. 492 00:24:11,856 --> 00:24:15,160 And finally use two-way referencing 493 00:24:15,160 --> 00:24:17,520 for many to many relationships. 494 00:24:17,520 --> 00:24:18,720 All right? 495 00:24:18,720 --> 00:24:21,202 And that pretty much sums it up. 496 00:24:21,202 --> 00:24:23,970 I would actually recommend you watching this video 497 00:24:23,970 --> 00:24:27,144 twice if you can, just because of how important 498 00:24:27,144 --> 00:24:30,091 this material really is. All right? 499 00:24:30,091 --> 00:24:33,363 Anyway, see you in the next video. 40661