All language subtitles for Apache Spark Full Course - Learn Apache Spark in 8 Hours - Apache Spark Tutorial - Edureka - YouTube

af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic Download
hy Armenian
az Azerbaijani
eu Basque
be Belarusian
bem Bemba
bn Bengali
bh Bihari
bs Bosnian
br Breton
bg Bulgarian
km Cambodian
ca Catalan
chr Cherokee
ny Chichewa
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
co Corsican
hr Croatian
cs Czech
da Danish
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
tl Filipino
fi Finnish
fr French
fy Frisian
gaa Ga
gl Galician
ka Georgian
de German
el Greek
gn Guarani
gu Gujarati
ht Haitian Creole
ha Hausa
haw Hawaiian
iw Hebrew
hi Hindi
hu Hungarian
is Icelandic
ig Igbo
id Indonesian
ia Interlingua
ga Irish
it Italian
ja Japanese
jw Javanese
kn Kannada
kk Kazakh
rw Kinyarwanda
rn Kirundi
kg Kongo
ko Korean
kri Krio (Sierra Leone)
ku Kurdish
ckb Kurdish (Soranî)
ky Kyrgyz
lo Laothian
la Latin
lv Latvian
ln Lingala
lt Lithuanian
loz Lozi
lg Luganda
ach Luo
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mt Maltese
mi Maori
mr Marathi
mfe Mauritian Creole
mo Moldavian
mn Mongolian
sr-ME Montenegrin
ne Nepali
pcm Nigerian Pidgin
nso Northern Sotho
no Norwegian
nn Norwegian (Nynorsk)
oc Occitan
or Oriya
om Oromo
ps Pashto
fa Persian
pl Polish
pt-BR Portuguese (Brazil)
pt-PT Portuguese (Portugal)
pa Punjabi
qu Quechua
ro Romanian
rm Romansh
nyn Runyakitara
ru Russian
gd Scots Gaelic
sr Serbian
sh Serbo-Croatian
st Sesotho
tn Setswana
crs Seychellois Creole
sn Shona
sd Sindhi
si Sinhalese
sk Slovak
sl Slovenian
so Somali
es Spanish Download
es-419 Spanish (Latin American)
su Sundanese
sw Swahili
sv Swedish
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai
ti Tigrinya
to Tonga
lua Tshiluba
tum Tumbuka
tr Turkish
tk Turkmen
tw Twi
ug Uighur
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
cy Welsh
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
zu Zulu
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 0 00:00:06,800 --> 00:00:10,102 For the past five years Spark has been on an absolute tear 1 00:00:10,102 --> 00:00:13,700 becoming one of the most widely used Technologies in big data 2 00:00:13,700 --> 00:00:17,226 and AI. Today's cutting-edge companies like Facebook app 3 00:00:17,226 --> 00:00:18,300 will Netflix Uber 4 00:00:18,300 --> 00:00:19,965 and many more have deployed 5 00:00:19,965 --> 00:00:23,366 spark at massive scale processing petabytes of data 6 00:00:23,366 --> 00:00:25,192 to deliver Innovations ranging 7 00:00:25,192 --> 00:00:27,212 from detecting fraudulent Behavior 8 00:00:27,212 --> 00:00:30,103 to delivering personalized experiences in real. 9 00:00:30,103 --> 00:00:32,741 Lifetime and many such innovations that are 10 00:00:32,741 --> 00:00:34,500 transforming every industry. 11 00:00:34,800 --> 00:00:37,300 Hi all I welcome you all to this full court session 12 00:00:37,300 --> 00:00:40,408 on Apache spark a complete crash course consisting 13 00:00:40,408 --> 00:00:43,200 of everything you need to know to get started 14 00:00:43,200 --> 00:00:45,500 with Apache Spark from scratch. 15 00:00:45,700 --> 00:00:47,410 But before we get into details, 16 00:00:47,410 --> 00:00:51,000 let's look at our agenda for today for better understanding 17 00:00:51,000 --> 00:00:52,300 and ease of learning. 18 00:00:52,300 --> 00:00:55,400 The entire crash course is divided into 12 modules 19 00:00:55,400 --> 00:00:59,200 in the first module introduction to spark will try to understand 20 00:00:59,200 --> 00:01:03,100 what exactly Is and how it performs real time processing 21 00:01:03,200 --> 00:01:06,741 in second module will dive deep into different components 22 00:01:06,741 --> 00:01:10,600 that constitute spark will also learn about Spark architecture 23 00:01:10,600 --> 00:01:13,800 and its ecosystem next up in the third module. 24 00:01:13,800 --> 00:01:15,594 We will learn what exactly 25 00:01:15,594 --> 00:01:18,700 relational distributed data sets are in spark. 26 00:01:19,100 --> 00:01:22,427 Fourth module is all about data frames in this module. 27 00:01:22,427 --> 00:01:25,000 We will learn what exactly data frames are 28 00:01:25,000 --> 00:01:28,300 and how to perform different operations in data frames 29 00:01:28,400 --> 00:01:29,940 moving on in the fifth. 30 00:01:29,940 --> 00:01:32,446 Module we will discuss different ways 31 00:01:32,446 --> 00:01:35,300 that spark provides to perform SQL queries 32 00:01:35,300 --> 00:01:39,000 for accessing and processing data in the six module. 33 00:01:39,000 --> 00:01:39,847 We will learn 34 00:01:39,847 --> 00:01:43,500 how to perform streaming on live data streams using spark 35 00:01:43,500 --> 00:01:46,029 where and in the seventh module will discuss 36 00:01:46,029 --> 00:01:49,200 how to execute different machine learning algorithms using 37 00:01:49,200 --> 00:01:52,469 spark machine learning library 8 module is all 38 00:01:52,469 --> 00:01:54,917 about spark Graphics in this module. 39 00:01:54,917 --> 00:01:57,800 We are going to learn what graph processing is and 40 00:01:57,800 --> 00:02:01,700 how to perform graph processing using Bob Graphics library 41 00:02:01,700 --> 00:02:05,500 in the ninth module will discuss the key differences between 42 00:02:05,500 --> 00:02:08,800 two popular data processing Paddock rooms mapreduce 43 00:02:08,800 --> 00:02:12,500 and Spark talking about 10 module will integrate 44 00:02:12,500 --> 00:02:14,400 to popular James spark 45 00:02:14,400 --> 00:02:19,400 and Kafka. 11th module is all about pyspark in this module 46 00:02:19,400 --> 00:02:21,000 will try to understand 47 00:02:21,000 --> 00:02:24,281 how by spark exposes spark programming model 48 00:02:24,281 --> 00:02:26,800 to python lastly in the 12 module. 49 00:02:26,800 --> 00:02:30,100 We'll take a look at most frequently Asked interview. 50 00:02:30,100 --> 00:02:31,200 Options on spark 51 00:02:31,200 --> 00:02:33,200 which will help you Ace your interview 52 00:02:33,200 --> 00:02:34,200 with flying colors. 53 00:02:34,200 --> 00:02:35,900 Thank you guys while you are at it, 54 00:02:35,900 --> 00:02:37,600 please do not forget to subscribe 55 00:02:37,600 --> 00:02:39,173 and Edureka YouTube channel 56 00:02:39,173 --> 00:02:42,200 to stay updated with current training Technologies. 57 00:02:47,200 --> 00:02:48,400 There has been - 58 00:02:48,400 --> 00:02:51,576 underworld that spark is a future of Big Data platform, 59 00:02:51,576 --> 00:02:53,400 which is hundred times faster 60 00:02:53,400 --> 00:02:57,250 than mapreduce and is also a go-to tool for all solutions. 61 00:02:57,250 --> 00:03:00,019 But what exactly is Apache spark and what? 62 00:03:00,019 --> 00:03:01,100 It's so popular. 63 00:03:01,100 --> 00:03:03,700 And in the session I will give you a complete Insight 64 00:03:03,700 --> 00:03:04,600 of Apache spark 65 00:03:04,600 --> 00:03:07,500 and its fundamentals without any further due. 66 00:03:07,500 --> 00:03:08,200 Let's quickly. 67 00:03:08,200 --> 00:03:09,898 Look at the topics to be covered 68 00:03:09,898 --> 00:03:12,198 in this session first and foremost. 69 00:03:12,198 --> 00:03:13,000 I will tell you 70 00:03:13,000 --> 00:03:15,724 what is Apache spark and its features next. 71 00:03:15,724 --> 00:03:17,773 I will take you to the components 72 00:03:17,773 --> 00:03:18,948 of spark ecosystem 73 00:03:18,948 --> 00:03:21,932 that makes Park as a future of Big Data platform. 74 00:03:21,932 --> 00:03:22,600 After that. 75 00:03:22,600 --> 00:03:23,300 I will talk 76 00:03:23,300 --> 00:03:26,100 about the fundamental data structure of spark 77 00:03:26,100 --> 00:03:28,400 that is rdd I will also tell you 78 00:03:28,400 --> 00:03:32,400 about its features its Asians the ways to create rdd Etc 79 00:03:32,400 --> 00:03:35,500 and at the last either wrap up the session by giving 80 00:03:35,500 --> 00:03:37,351 a real-time use case of spark. 81 00:03:37,351 --> 00:03:38,505 So let's get started 82 00:03:38,505 --> 00:03:40,800 with the very first topic and understand 83 00:03:40,800 --> 00:03:43,400 what is spark spark is an open-source 84 00:03:43,400 --> 00:03:45,100 killable massively parallel 85 00:03:45,100 --> 00:03:47,700 in memory execution environment for running 86 00:03:47,700 --> 00:03:49,300 analytics applications. 87 00:03:49,300 --> 00:03:52,085 You can just think of it as an in-memory layer 88 00:03:52,085 --> 00:03:54,507 that sits about the multiple data stores 89 00:03:54,507 --> 00:03:56,929 where data can be loaded into the memory 90 00:03:56,929 --> 00:03:59,600 and analyzed in parallel across the cluster. 91 00:03:59,800 --> 00:04:03,189 Into big data processing much like mapreduce Park Works 92 00:04:03,189 --> 00:04:05,700 to distribute the data across the cluster 93 00:04:05,700 --> 00:04:08,118 and then process that data in parallel. 94 00:04:08,118 --> 00:04:10,833 The difference here is that unlike mapreduce 95 00:04:10,833 --> 00:04:14,867 which shuffles the files around the disc spark Works in memory, 96 00:04:14,867 --> 00:04:17,600 and that makes it much faster at processing 97 00:04:17,600 --> 00:04:19,300 the data than mapreduce. 98 00:04:19,300 --> 00:04:20,663 It is also said to be 99 00:04:20,663 --> 00:04:24,235 the Lightning Fast unified analytics engine for big data 100 00:04:24,235 --> 00:04:25,600 and machine learning. 101 00:04:25,600 --> 00:04:28,680 So now let's look at the interesting features 102 00:04:28,680 --> 00:04:29,800 of Apache Spark. 103 00:04:29,800 --> 00:04:32,181 Coming to speed you can cause Park as 104 00:04:32,181 --> 00:04:34,100 a swift processing framework. 105 00:04:34,100 --> 00:04:37,500 Why because it is hundred times faster in memory 106 00:04:37,500 --> 00:04:40,900 and 10 times faster on the disk on comparing it with her. 107 00:04:40,900 --> 00:04:41,700 Do not only 108 00:04:41,700 --> 00:04:45,100 that it also provides High data processing speed 109 00:04:45,200 --> 00:04:46,900 next powerful cashing. 110 00:04:46,900 --> 00:04:48,809 It has a simple programming layer 111 00:04:48,809 --> 00:04:50,600 that provides powerful caching 112 00:04:50,600 --> 00:04:53,341 and disk persistence capabilities and Spark 113 00:04:53,341 --> 00:04:55,300 can be deployed through mesos. 114 00:04:55,300 --> 00:04:58,600 How do PI on or Sparks own cluster manager 115 00:04:58,700 --> 00:04:59,700 as you all know? 116 00:04:59,700 --> 00:05:01,370 That's Park itself was designed 117 00:05:01,370 --> 00:05:03,900 and developed for real-time data processing. 118 00:05:03,900 --> 00:05:05,239 So it's obvious fact 119 00:05:05,239 --> 00:05:07,584 that it offers real-time competition 120 00:05:07,584 --> 00:05:10,800 and low latency because of in memory competitions 121 00:05:10,900 --> 00:05:14,700 next polyglot spark provides high level apis 122 00:05:14,700 --> 00:05:16,700 in Java Scala Python 123 00:05:16,700 --> 00:05:19,536 and our spark code can be written in any 124 00:05:19,536 --> 00:05:21,281 of these four languages. 125 00:05:21,281 --> 00:05:25,500 Not only that it also provides a shell in Scala and python. 126 00:05:25,692 --> 00:05:29,000 These are the various features of spark now, 127 00:05:29,000 --> 00:05:32,700 let's see the The various components of spark ecosystem. 128 00:05:32,700 --> 00:05:36,100 Let me first tell you about the spark or component. 129 00:05:36,100 --> 00:05:39,385 It is the most vital component of Spartacus system, 130 00:05:39,385 --> 00:05:40,700 which is responsible 131 00:05:40,700 --> 00:05:44,400 for basic I/O functions scheduling monitoring Etc. 132 00:05:44,400 --> 00:05:47,800 The entire Apache spark ecosystem is built on the top 133 00:05:47,800 --> 00:05:49,670 of this core execution engine 134 00:05:49,670 --> 00:05:52,700 which has extensible apis in different languages 135 00:05:52,700 --> 00:05:55,100 like Scala python are and Chava 136 00:05:55,100 --> 00:05:57,442 as I have already mentioned the spark 137 00:05:57,442 --> 00:05:59,200 and the departs from essos. 138 00:05:59,200 --> 00:06:02,800 How do you feel John or Sparks own cluster manager 139 00:06:02,800 --> 00:06:05,433 the spark ecosystem library is composed 140 00:06:05,433 --> 00:06:06,888 of various components 141 00:06:06,888 --> 00:06:10,700 like spark SQL spark streaming machine learning library. 142 00:06:10,700 --> 00:06:13,200 Now, let me explain you each of them. 143 00:06:13,200 --> 00:06:16,573 The spark SQL component is used to Leverage The Power 144 00:06:16,573 --> 00:06:18,000 of declarative queries 145 00:06:18,000 --> 00:06:21,034 and optimize storage by executing SQL queries 146 00:06:21,034 --> 00:06:22,000 on spark data, 147 00:06:22,000 --> 00:06:23,778 which is present in the rdds 148 00:06:23,778 --> 00:06:27,100 and other external sources next Sparks trimming 149 00:06:27,100 --> 00:06:29,617 component allows developers to perform batch. 150 00:06:29,617 --> 00:06:31,395 Processing and streaming of data 151 00:06:31,395 --> 00:06:35,042 in the same application and come into machine learning library. 152 00:06:35,042 --> 00:06:36,313 It eases the deployment 153 00:06:36,313 --> 00:06:39,300 and development of scalable machine learning pipelines, 154 00:06:39,300 --> 00:06:43,000 like summary statistics correlations feature extraction 155 00:06:43,000 --> 00:06:46,200 transformation functions optimization algorithms Etc 156 00:06:46,200 --> 00:06:49,365 and graph x component lets the data scientist to work 157 00:06:49,365 --> 00:06:52,584 with graph are non rough sources to achieve flexibility 158 00:06:52,584 --> 00:06:55,820 and resilience and graph construction and transformation 159 00:06:55,820 --> 00:06:56,784 and now talking 160 00:06:56,784 --> 00:07:00,000 about the programming languages spark supports car. 161 00:07:00,000 --> 00:07:02,851 I just a functional programming language in which 162 00:07:02,851 --> 00:07:04,100 the spark is written. 163 00:07:04,100 --> 00:07:08,200 So spark supports Colour as the interface then spark also 164 00:07:08,200 --> 00:07:10,100 supports python interface. 165 00:07:10,100 --> 00:07:13,066 You can write the program in Python and execute it 166 00:07:13,066 --> 00:07:14,408 over the spark again. 167 00:07:14,408 --> 00:07:16,899 If you see the code in Python and Scala, 168 00:07:16,899 --> 00:07:20,858 both are very similar then our is very famous for data analysis 169 00:07:20,858 --> 00:07:22,200 and machine learning. 170 00:07:22,200 --> 00:07:25,081 So spark has also added the support for our 171 00:07:25,081 --> 00:07:26,717 and it also supports Java 172 00:07:26,717 --> 00:07:27,961 so you can go ahead 173 00:07:27,961 --> 00:07:31,300 and write the code in Java and Giggle with this park 174 00:07:31,300 --> 00:07:33,300 next the data can be stored 175 00:07:33,300 --> 00:07:36,400 in hdfs local file system Amazon S3 cloud 176 00:07:36,700 --> 00:07:39,700 and it also supports SQL and nosql database as well. 177 00:07:39,700 --> 00:07:43,645 So this is all about the various components of spark ecosystem. 178 00:07:43,645 --> 00:07:45,300 Now, let's see what's next 179 00:07:45,300 --> 00:07:48,064 when it comes to iterative distributed computing 180 00:07:48,064 --> 00:07:50,600 that is processing the data over multiple jobs 181 00:07:50,600 --> 00:07:51,600 and competitions. 182 00:07:51,700 --> 00:07:52,776 We need to reuse 183 00:07:52,776 --> 00:07:55,200 or share the data among multiple jobs 184 00:07:55,200 --> 00:07:58,258 in earlier Frameworks like Hadoop there were problems 185 00:07:58,258 --> 00:07:59,950 while dealing with multiple. 186 00:07:59,950 --> 00:08:01,400 Operations or jobs here. 187 00:08:01,400 --> 00:08:02,900 We need to store the data 188 00:08:02,900 --> 00:08:07,053 and some intermediate stable distributed storage such as hdfs 189 00:08:07,053 --> 00:08:11,003 and multiple I/O operations makes the overall computations 190 00:08:11,003 --> 00:08:13,976 of jobs much slower and they were replications 191 00:08:13,976 --> 00:08:15,100 and civilizations 192 00:08:15,100 --> 00:08:17,955 which in turn made the process even more slower 193 00:08:17,955 --> 00:08:20,500 and our goal here was to reduce the number 194 00:08:20,500 --> 00:08:22,400 of I/O operations to hdfs 195 00:08:22,400 --> 00:08:26,350 and this can be achieved only through in-memory data sharing 196 00:08:26,350 --> 00:08:29,900 the in-memory data sharing the stent 200 times faster. 197 00:08:29,900 --> 00:08:31,966 Of the network and disk sharing 198 00:08:31,966 --> 00:08:35,138 and rdds try to solve all the problems by enabling 199 00:08:35,138 --> 00:08:38,447 fault-tolerant distributed in memory competitions. 200 00:08:38,447 --> 00:08:40,000 So now let's understand 201 00:08:40,000 --> 00:08:44,000 what our rdds it stands for resilient distributed data set. 202 00:08:44,000 --> 00:08:46,509 They are considered to be the backbone of spark 203 00:08:46,509 --> 00:08:49,419 and is one of the fundamental data structure of spark. 204 00:08:49,419 --> 00:08:51,782 It is also known as the schema-less structures 205 00:08:51,782 --> 00:08:54,900 that can handle both structured and unstructured data. 206 00:08:54,900 --> 00:08:57,900 So in spark anything you do is around rdd. 207 00:08:57,900 --> 00:08:59,700 You're reading the data in spark. 208 00:08:59,700 --> 00:09:01,500 When it is read into our daily again, 209 00:09:01,500 --> 00:09:04,300 when you're transforming the data, then you're performing 210 00:09:04,300 --> 00:09:07,268 Transformations on old rdd and creating a new one. 211 00:09:07,268 --> 00:09:10,378 Then at last you will perform some actions on the rdd 212 00:09:10,378 --> 00:09:12,533 and store that data present in an rdd 213 00:09:12,533 --> 00:09:15,906 to a persistent storage resilient distributed data set 214 00:09:15,906 --> 00:09:18,900 has an immutable distributed collection of objects. 215 00:09:18,900 --> 00:09:20,300 Your objects can be anything 216 00:09:20,300 --> 00:09:23,200 like strings lines Rose objects collections 217 00:09:23,200 --> 00:09:26,400 Etc rdds can contain any type of python Java 218 00:09:26,400 --> 00:09:27,533 or Scala objects. 219 00:09:27,533 --> 00:09:30,000 Even including user defined classes as 220 00:09:30,000 --> 00:09:32,900 And talking about the distributed environment. 221 00:09:32,900 --> 00:09:35,612 Each data set present in an rdd is divided 222 00:09:35,612 --> 00:09:37,200 into logical partitions, 223 00:09:37,200 --> 00:09:39,353 which may be computed on different nodes 224 00:09:39,353 --> 00:09:42,500 of the cluster due to this you can perform Transformations 225 00:09:42,500 --> 00:09:44,190 or actions on the complete data 226 00:09:44,190 --> 00:09:47,300 parallely and I don't have to worry about the distribution 227 00:09:47,300 --> 00:09:49,400 because spark takes care of that 228 00:09:49,400 --> 00:09:52,100 are they these are highly resilient that is 229 00:09:52,100 --> 00:09:55,141 they are able to recover quickly from any issues 230 00:09:55,141 --> 00:09:56,500 as a same data chunks 231 00:09:56,500 --> 00:09:59,700 are replicated across multiple executor notes thus 232 00:09:59,700 --> 00:10:02,564 so even if one executor fails another will still 233 00:10:02,564 --> 00:10:03,600 process the data. 234 00:10:03,600 --> 00:10:06,482 This allows you to perform functional calculations 235 00:10:06,482 --> 00:10:08,287 against a data set very quickly 236 00:10:08,287 --> 00:10:10,699 by harnessing the power of multiple nodes. 237 00:10:10,699 --> 00:10:12,472 So this is all about rdd now. 238 00:10:12,472 --> 00:10:14,000 Let's have a look at some 239 00:10:14,000 --> 00:10:17,847 of the important features of our dbe's rdds have a provision 240 00:10:17,847 --> 00:10:19,327 of in memory competition 241 00:10:19,327 --> 00:10:21,300 and all transformations are lazy. 242 00:10:21,300 --> 00:10:24,044 That is it does not compute the results right away 243 00:10:24,044 --> 00:10:25,679 until an action is applied. 244 00:10:25,679 --> 00:10:27,800 So it supports in memory competition 245 00:10:27,800 --> 00:10:30,034 and lazy evaluation as well next. 246 00:10:30,034 --> 00:10:32,200 Fault tolerant in case of rdds. 247 00:10:32,200 --> 00:10:34,454 They track the data lineage information 248 00:10:34,454 --> 00:10:37,341 to rebuild the last data automatically and this is 249 00:10:37,341 --> 00:10:40,000 how it provides fault tolerance to the system. 250 00:10:40,000 --> 00:10:42,600 Next immutability data can be created 251 00:10:42,600 --> 00:10:43,800 or received any time 252 00:10:43,800 --> 00:10:46,388 and once defined its value cannot be changed. 253 00:10:46,388 --> 00:10:47,900 And that is the reason why 254 00:10:47,900 --> 00:10:51,235 I said are they these are immutable next partitioning 255 00:10:51,235 --> 00:10:53,774 at is the fundamental unit of parallelism 256 00:10:53,774 --> 00:10:54,605 and Spark rdd 257 00:10:54,605 --> 00:10:57,800 and all the data chunks are divided into partitions 258 00:10:57,800 --> 00:10:59,960 and already next persistence. 259 00:10:59,960 --> 00:11:01,600 So users can reuse rdd 260 00:11:01,600 --> 00:11:05,400 and choose a storage stategy for them coarse-grained operations 261 00:11:05,400 --> 00:11:08,493 applies to all elements in datasets through Maps 262 00:11:08,493 --> 00:11:10,600 or filter or group by operations. 263 00:11:10,700 --> 00:11:13,000 So these are the various features of our daily. 264 00:11:13,300 --> 00:11:15,800 Now, let's see the ways to create rdd. 265 00:11:15,800 --> 00:11:19,117 There are three ways to create rdds one can create rdd 266 00:11:19,117 --> 00:11:22,800 from paralyzed Collections and one can also create rdd 267 00:11:22,800 --> 00:11:24,367 from the existing card ID 268 00:11:24,367 --> 00:11:27,100 or other are DTS and it can also be created 269 00:11:27,100 --> 00:11:30,000 from external data sources as well like hdfs. 270 00:11:30,000 --> 00:11:31,900 Amazon S3 hbase Etc. 271 00:11:32,000 --> 00:11:34,600 Now let me show you how to create rdds. 272 00:11:34,800 --> 00:11:37,199 I'll open my terminal and first check 273 00:11:37,199 --> 00:11:39,600 whether my demons are running or not. 274 00:11:40,500 --> 00:11:41,300 Cool here. 275 00:11:41,300 --> 00:11:42,757 I can see that Hadoop 276 00:11:42,757 --> 00:11:45,041 and Spark demons both are running. 277 00:11:45,041 --> 00:11:47,186 So now at the first let's start 278 00:11:47,186 --> 00:11:51,200 the spark shell it will take a bit time to start the shell. 279 00:11:52,500 --> 00:11:52,900 Cool. 280 00:11:52,900 --> 00:11:54,800 Now the spark shall has started 281 00:11:54,800 --> 00:11:58,329 and I can see the version of spark as two point one point one 282 00:11:58,329 --> 00:12:00,500 and we have a scholar shell over here. 283 00:12:00,500 --> 00:12:00,759 Now. 284 00:12:00,759 --> 00:12:02,888 I will tell you how to create rdds 285 00:12:02,888 --> 00:12:06,557 in three different ways using Scala language at the first. 286 00:12:06,557 --> 00:12:08,450 Let's see how to create an rdd 287 00:12:08,450 --> 00:12:12,178 from paralyzed collections SC dot paralyzes the method 288 00:12:12,178 --> 00:12:15,600 that I use to create a paralyzed collection of oddities 289 00:12:15,600 --> 00:12:16,733 and this method is 290 00:12:16,733 --> 00:12:20,700 a spark context paralyzed method to create a palace collection. 291 00:12:20,700 --> 00:12:22,500 So I will give a seedot bad. 292 00:12:22,500 --> 00:12:26,200 Lice and here I will paralyze one 200 numbers. 293 00:12:27,300 --> 00:12:31,371 In five different partitions and I will apply collect 294 00:12:31,371 --> 00:12:33,500 as action to start the process. 295 00:12:34,900 --> 00:12:36,592 So here in the result, 296 00:12:36,592 --> 00:12:39,600 you can see an array of fun 200 numbers. 297 00:12:39,600 --> 00:12:40,100 Okay. 298 00:12:40,300 --> 00:12:41,635 Now let me show you 299 00:12:41,635 --> 00:12:45,010 how the partitions appear in the web UI of spark. 300 00:12:45,010 --> 00:12:49,300 So the web UI port for spark is localhost four zero four zero. 301 00:12:50,700 --> 00:12:53,630 So here you have just completed one task. 302 00:12:53,630 --> 00:12:55,903 That is St. Dot paralyzed collect. 303 00:12:55,903 --> 00:12:56,800 Correct here. 304 00:12:56,800 --> 00:13:00,114 You can see all the five stages that are succeeded 305 00:13:00,114 --> 00:13:03,700 because we have divided the task into five partitions. 306 00:13:03,700 --> 00:13:06,000 So let Show you the partitions. 307 00:13:06,000 --> 00:13:08,100 So this is a dag which realization 308 00:13:08,100 --> 00:13:11,558 that is the directed acyclic graph visualization wherein 309 00:13:11,558 --> 00:13:14,200 you have applied only paralyzed as a method 310 00:13:14,200 --> 00:13:16,200 so you can see only one stage here. 311 00:13:16,800 --> 00:13:20,291 So here you can see the rdd that is been created 312 00:13:20,291 --> 00:13:24,032 and coming to even timeline you can see the task 313 00:13:24,032 --> 00:13:27,400 that has been executed in five different stages 314 00:13:27,400 --> 00:13:29,011 and the different colors imply. 315 00:13:29,011 --> 00:13:30,632 The scheduler delayed tasks 316 00:13:30,632 --> 00:13:34,300 these sterilization Time shuffle rate Time shuffle right time. 317 00:13:34,300 --> 00:13:36,612 I'm execute a Computing time Etc here. 318 00:13:36,612 --> 00:13:40,227 You can see the summary metrics for the created rdd here. 319 00:13:40,227 --> 00:13:41,000 You can see 320 00:13:41,000 --> 00:13:44,300 that the maximum time it took to execute the tasks 321 00:13:44,300 --> 00:13:48,400 in five partitions parallely is just 45 milliseconds. 322 00:13:49,000 --> 00:13:53,300 You can also see the executor ID the host ID the status 323 00:13:53,300 --> 00:13:56,800 that is succeeded duration launch time Etc. 324 00:13:57,000 --> 00:13:59,255 So this is one way of creating an rdd 325 00:13:59,255 --> 00:14:01,061 from paralyzed collections. 326 00:14:01,061 --> 00:14:02,400 Now, let me show you 327 00:14:02,400 --> 00:14:05,900 how to create an rdd from the I think our DD okay 328 00:14:06,000 --> 00:14:08,770 here I'll create an array called Aven 329 00:14:08,770 --> 00:14:11,077 and assign numbers one to ten. 330 00:14:11,800 --> 00:14:14,900 One two, three, four five six seven. 331 00:14:16,200 --> 00:14:18,900 Okay, so I got the result here. 332 00:14:18,900 --> 00:14:22,300 That is I have created an integer array of 1 to 10 333 00:14:22,300 --> 00:14:25,200 and now I will paralyze this a day one. 334 00:14:31,303 --> 00:14:32,996 Sorry, I got an error. 335 00:14:33,300 --> 00:14:37,300 It is a seedot pass the lies of a one. 336 00:14:38,200 --> 00:14:42,800 Okay, so I created an rdd called parallel collection cool. 337 00:14:42,800 --> 00:14:46,600 Now I will create a new Oddity from the existing already. 338 00:14:46,600 --> 00:14:51,000 That is Val new are d d is equal 339 00:14:51,000 --> 00:14:55,900 to a 1 dot map data present in an rdd. 340 00:14:56,061 --> 00:14:59,138 I will create a new ID from existing rdd. 341 00:14:59,200 --> 00:15:01,200 So here I will take a one. 342 00:15:01,200 --> 00:15:05,800 As a difference and map the data and multiply 343 00:15:05,800 --> 00:15:07,300 that data into two. 344 00:15:07,573 --> 00:15:09,726 So what should be our output 345 00:15:10,019 --> 00:15:13,480 if I Mark the data present in an rdd into two, 346 00:15:13,700 --> 00:15:18,600 so it would be like 2 4 6 8 up to 20, correct? 347 00:15:18,600 --> 00:15:20,400 So, let's see how it works. 348 00:15:20,700 --> 00:15:24,500 Yes, we got the output that is multiple of 1 to 10. 349 00:15:24,500 --> 00:15:26,691 That is two four six eight up to 20. 350 00:15:26,691 --> 00:15:28,357 So this is one of the method 351 00:15:28,357 --> 00:15:30,500 of creating a new ID from an old rdt. 352 00:15:30,500 --> 00:15:34,088 And I have one more method that is from external file sources. 353 00:15:34,088 --> 00:15:37,500 So what I will do here is I will give that test is equal 354 00:15:37,500 --> 00:15:39,780 to SC dot txt file here. 355 00:15:40,790 --> 00:15:43,800 I will give the path to hdfs file location 356 00:15:43,800 --> 00:15:48,900 and Link the path that is hdfs who localhost 9000 is a path 357 00:15:48,900 --> 00:15:50,800 and I have a folder. 358 00:15:50,800 --> 00:15:54,600 Called example and in that I have a file called sample. 359 00:15:57,300 --> 00:16:01,500 Cool, so I got one more already created here. 360 00:16:02,000 --> 00:16:02,281 Now. 361 00:16:02,281 --> 00:16:04,042 Let me show you this file 362 00:16:04,042 --> 00:16:07,000 that I have already kept in hdfs directory. 363 00:16:08,100 --> 00:16:09,897 I will browse the file system 364 00:16:09,897 --> 00:16:12,500 and I will show you the / example directory 365 00:16:12,500 --> 00:16:13,800 that I have created. 366 00:16:14,800 --> 00:16:16,867 So here you can see the example 367 00:16:16,867 --> 00:16:19,800 that I have created as a directory and here I 368 00:16:19,800 --> 00:16:23,000 have sample as input file that I have been given. 369 00:16:23,000 --> 00:16:25,800 So here you can see the same path location. 370 00:16:25,800 --> 00:16:26,300 So this is 371 00:16:26,300 --> 00:16:29,633 how I can create an rdd from external file sources. 372 00:16:29,633 --> 00:16:30,484 In this case. 373 00:16:30,484 --> 00:16:33,300 I have used hdfs as an external file source. 374 00:16:33,300 --> 00:16:36,757 So this is how we can create rdds from three different ways 375 00:16:36,757 --> 00:16:39,700 that is paralyzed collections from external RDS 376 00:16:39,700 --> 00:16:41,600 and from an existing rdds. 377 00:16:41,700 --> 00:16:44,900 So let's move further and see the various rdd. 378 00:16:44,900 --> 00:16:46,500 It's actually supports 379 00:16:46,500 --> 00:16:50,100 two men operations namely Transformations and actions 380 00:16:50,100 --> 00:16:51,419 as have already set. 381 00:16:51,419 --> 00:16:53,200 Our treaties are immutable. 382 00:16:53,200 --> 00:16:54,900 So once you create an rdd, 383 00:16:54,900 --> 00:16:57,500 you cannot change any content in the Hardy, 384 00:16:57,500 --> 00:16:58,913 so you might be wondering 385 00:16:58,913 --> 00:17:01,400 how our did he applies those Transformations? 386 00:17:01,400 --> 00:17:02,200 Correct? 387 00:17:02,200 --> 00:17:04,299 When you run any Transformations, 388 00:17:04,299 --> 00:17:07,062 it runs those Transformations on all our DD 389 00:17:07,062 --> 00:17:08,445 and create a new body. 390 00:17:08,445 --> 00:17:11,400 This is basically done for optimization reasons. 391 00:17:11,400 --> 00:17:13,446 Transformations are the operations 392 00:17:13,446 --> 00:17:14,500 which are applied 393 00:17:14,500 --> 00:17:18,815 on a An rdd to create a new rdd now these Transformations work 394 00:17:18,815 --> 00:17:21,221 on the principle of lazy evaluations. 395 00:17:21,221 --> 00:17:23,075 So what does it mean it means 396 00:17:23,075 --> 00:17:25,500 that when we call some operation in rdd 397 00:17:25,500 --> 00:17:28,888 at does not execute immediately and Spark montañés, 398 00:17:28,888 --> 00:17:31,704 the record of the operation that is being called 399 00:17:31,704 --> 00:17:34,127 since Transformations are lazy in nature 400 00:17:34,127 --> 00:17:36,052 so we can execute the operation 401 00:17:36,052 --> 00:17:38,600 any time by calling an action on the data. 402 00:17:38,800 --> 00:17:42,200 Hence in lazy evaluation data is not loaded 403 00:17:42,200 --> 00:17:44,525 until it is necessary now these 404 00:17:44,525 --> 00:17:46,100 Since analyze the RTD 405 00:17:46,100 --> 00:17:49,103 and produce result simple action can be count 406 00:17:49,103 --> 00:17:52,800 which will count the rows and rdd and then produce a result 407 00:17:52,800 --> 00:17:53,583 so I can say 408 00:17:53,583 --> 00:17:57,700 that transformation produced new rdd and action produced results 409 00:17:57,700 --> 00:18:00,058 before moving further with the discussion. 410 00:18:00,058 --> 00:18:03,000 Let me tell you about the three different workloads 411 00:18:03,000 --> 00:18:06,500 that spark it is they are batch mode interactive mode 412 00:18:06,500 --> 00:18:09,052 and streaming mode in case of batch mode. 413 00:18:09,052 --> 00:18:10,839 We run a batch of you write a job 414 00:18:10,839 --> 00:18:13,427 and then schedule it it works through a queue 415 00:18:13,427 --> 00:18:14,703 or a batch of separate. 416 00:18:14,703 --> 00:18:17,292 Jobs without manual intervention then in case 417 00:18:17,292 --> 00:18:18,400 of interactive mode. 418 00:18:18,400 --> 00:18:19,700 It is an interactive shell 419 00:18:19,700 --> 00:18:22,100 where you go and execute the commands one by one. 420 00:18:22,300 --> 00:18:24,844 So you will execute one command check the result 421 00:18:24,844 --> 00:18:26,902 and then execute other command based 422 00:18:26,902 --> 00:18:28,400 on the output result and so 423 00:18:28,400 --> 00:18:30,754 on it works similar to the SQL shell 424 00:18:30,754 --> 00:18:32,100 so she'll is the one 425 00:18:32,100 --> 00:18:35,221 which executes a driver program and in the Shell mode. 426 00:18:35,221 --> 00:18:37,096 You can run it on the cluster mode. 427 00:18:37,096 --> 00:18:39,449 It is generally used for development work 428 00:18:39,449 --> 00:18:41,159 or it is used for ad hoc queries, 429 00:18:41,159 --> 00:18:42,708 then comes the streaming mode 430 00:18:42,708 --> 00:18:44,900 where the program is continuously running. 431 00:18:44,900 --> 00:18:47,300 As invented data comes it takes a data 432 00:18:47,300 --> 00:18:48,818 and do some Transformations 433 00:18:48,818 --> 00:18:51,300 and actions on the data and get some results. 434 00:18:51,300 --> 00:18:53,800 So these are the three different workloads 435 00:18:53,800 --> 00:18:55,600 that spark 8 us now. 436 00:18:55,600 --> 00:18:58,100 Let's see a real-time use case here. 437 00:18:58,100 --> 00:18:59,600 I'm considering Yahoo! 438 00:18:59,600 --> 00:19:00,600 As an example. 439 00:19:00,600 --> 00:19:02,716 So what are the problems of Yahoo! 440 00:19:02,716 --> 00:19:03,128 Yahoo! 441 00:19:03,128 --> 00:19:04,062 Properties are 442 00:19:04,062 --> 00:19:06,800 highly personalized to maximize relevance. 443 00:19:06,800 --> 00:19:09,600 The algorithms used to provide personalization. 444 00:19:09,600 --> 00:19:11,692 That is the targeted advertisement 445 00:19:11,692 --> 00:19:14,800 and personalized content are highly sophisticated. 446 00:19:14,800 --> 00:19:18,300 It and the relevance model must be updated frequently 447 00:19:18,300 --> 00:19:22,745 because stories news feed and ads change in time and Yahoo, 448 00:19:22,745 --> 00:19:24,967 has over 150 petabytes of data 449 00:19:24,967 --> 00:19:28,300 that the stored on 35,000 node Hadoop cluster, 450 00:19:28,300 --> 00:19:31,391 which should be access efficiently to avoid latency 451 00:19:31,391 --> 00:19:33,150 caused by the data movement 452 00:19:33,150 --> 00:19:35,300 and to gain insights from the data 453 00:19:35,300 --> 00:19:37,000 and cost-effective manner. 454 00:19:37,000 --> 00:19:39,600 So to overcome these problems Yahoo! 455 00:19:39,600 --> 00:19:42,171 Look to spark to improve the performance 456 00:19:42,171 --> 00:19:44,687 of this iterative model training here. 457 00:19:44,687 --> 00:19:48,700 Machine learning algorithm for news personalization required 458 00:19:48,700 --> 00:19:51,200 15,000 lines of C++ code 459 00:19:51,300 --> 00:19:55,000 on the other hand the machine learning algorithm has just 460 00:19:55,000 --> 00:19:57,076 won 20 lines of Scala code. 461 00:19:57,100 --> 00:19:59,600 So that is the advantage of spark 462 00:19:59,800 --> 00:20:02,600 and this algorithm was ready for production use 463 00:20:02,600 --> 00:20:06,700 in just 30 minutes of training on a hundred million datasets 464 00:20:06,700 --> 00:20:08,900 and Sparks Rich API is available 465 00:20:08,900 --> 00:20:12,201 in several programming languages and has resilient 466 00:20:12,201 --> 00:20:14,588 in memory storage options and a scum. 467 00:20:14,588 --> 00:20:18,567 Potable with Hadoop through yarn and the spark yarn project. 468 00:20:18,567 --> 00:20:21,400 It uses Apache spark for personalizing It's 469 00:20:21,400 --> 00:20:24,490 News web pages and for targeted advertising. 470 00:20:24,490 --> 00:20:28,300 Not only that it also uses machine learning algorithms 471 00:20:28,300 --> 00:20:31,375 that run an Apache spark to find out what kind 472 00:20:31,375 --> 00:20:33,700 of news user are interested to read 473 00:20:33,700 --> 00:20:36,714 and also for categorizing the new stories to find 474 00:20:36,714 --> 00:20:39,290 out what kind of users would be interested 475 00:20:39,290 --> 00:20:41,300 in Reading each category of news 476 00:20:41,524 --> 00:20:44,524 and Spark runs over Hadoop Ian to use existing data. 477 00:20:44,600 --> 00:20:47,800 And clusters and the extensive API of spark 478 00:20:47,800 --> 00:20:50,605 and machine learning library is the development 479 00:20:50,605 --> 00:20:54,276 of machine learning algorithms and Spar produces the latency 480 00:20:54,276 --> 00:20:55,400 of model training. 481 00:20:55,400 --> 00:20:56,800 We are in memory rdd. 482 00:20:56,800 --> 00:21:00,855 So this is how spark has helped Yahoo to improve the performance 483 00:21:00,855 --> 00:21:02,431 and achieve the targets. 484 00:21:02,431 --> 00:21:05,320 So I hope you understood the concept of spark 485 00:21:05,320 --> 00:21:06,700 and its fundamentals. 486 00:21:11,500 --> 00:21:14,000 Now, let me just give you an overview 487 00:21:14,000 --> 00:21:17,600 of the Spark architecture Apache spark has a well-defined 488 00:21:17,600 --> 00:21:18,711 layered architecture 489 00:21:18,711 --> 00:21:22,017 where all the components and layers are Loosely coupled 490 00:21:22,017 --> 00:21:25,200 and integrated with various extensions and libraries. 491 00:21:25,200 --> 00:21:28,600 This architecture is based on two main abstractions. 492 00:21:28,600 --> 00:21:31,500 First one resilient distributed data sets 493 00:21:31,500 --> 00:21:32,419 that is rdd 494 00:21:32,419 --> 00:21:36,108 and the next one directed acyclic graph called DAC 495 00:21:36,108 --> 00:21:40,100 or th e in order to understand this park architecture. 496 00:21:40,100 --> 00:21:43,400 You need to first know the components of the spark 497 00:21:43,400 --> 00:21:44,500 that the spark. 498 00:21:44,500 --> 00:21:47,700 System and its fundamental data structure rdd. 499 00:21:47,700 --> 00:21:51,100 So let's start by understanding the spark ecosystem 500 00:21:51,100 --> 00:21:53,080 as you can see from the diagram. 501 00:21:53,080 --> 00:21:56,300 The spark ecosystem is composed of various components 502 00:21:56,300 --> 00:21:57,812 like spark SQL spark 503 00:21:57,812 --> 00:22:01,400 screaming machine learning library Graphics spark 504 00:22:01,400 --> 00:22:05,600 our and the code a pi component talking about spark SQL. 505 00:22:05,600 --> 00:22:08,700 It is used to Leverage The Power of declarative queries 506 00:22:08,700 --> 00:22:11,827 and optimize storage by executing SQL queries 507 00:22:11,827 --> 00:22:12,817 on spark data, 508 00:22:12,817 --> 00:22:14,520 which is present in rdds. 509 00:22:14,520 --> 00:22:18,600 And other external sources next Sparks remain component 510 00:22:18,600 --> 00:22:21,400 allows developers to perform batch processing 511 00:22:21,400 --> 00:22:22,600 and trimming of the data 512 00:22:22,600 --> 00:22:26,300 and the same application coming to machine learning library. 513 00:22:26,300 --> 00:22:27,745 It eases the development 514 00:22:27,745 --> 00:22:30,862 and deployment of scalable machine learning pipelines, 515 00:22:30,862 --> 00:22:33,765 like summary statistics cluster analysis methods 516 00:22:33,765 --> 00:22:36,709 correlations dimensionality reduction techniques 517 00:22:36,709 --> 00:22:37,900 feature extractions 518 00:22:37,900 --> 00:22:40,500 and many more now Graphics component. 519 00:22:40,500 --> 00:22:42,100 Let's the data scientist to work 520 00:22:42,100 --> 00:22:44,689 with graph and non graph sources to achieve. 521 00:22:44,689 --> 00:22:47,400 Security and resilience and graph construction 522 00:22:47,400 --> 00:22:51,000 and transformation coming to spark our it is an r package 523 00:22:51,000 --> 00:22:54,818 that provides a light weighted front end to use Apache spark. 524 00:22:54,818 --> 00:22:58,000 It provides a distributed data frame implementation 525 00:22:58,000 --> 00:23:01,994 that supports operations like selection filtering aggregation, 526 00:23:01,994 --> 00:23:03,500 but on large data sets, 527 00:23:03,500 --> 00:23:06,198 it also supports distributed machine learning 528 00:23:06,198 --> 00:23:08,100 using machine learning library. 529 00:23:08,157 --> 00:23:10,542 Finally the spark or component. 530 00:23:10,600 --> 00:23:13,600 That is the most vital component of spark ecosystem, 531 00:23:13,600 --> 00:23:14,800 which is responsible. 532 00:23:14,800 --> 00:23:17,621 Possible for basic I/O functions scheduling 533 00:23:17,621 --> 00:23:21,517 and monitoring the entire spark ecosystem is built on the top 534 00:23:21,517 --> 00:23:23,456 of this code execution engine 535 00:23:23,456 --> 00:23:26,600 which has extensible apis in different languages 536 00:23:26,600 --> 00:23:29,400 like Scala python are and Java now, 537 00:23:29,400 --> 00:23:32,200 let me tell you about the programming languages 538 00:23:32,200 --> 00:23:33,977 at the first Spark support 539 00:23:33,977 --> 00:23:37,190 Scala Scala is a functional programming language 540 00:23:37,190 --> 00:23:38,900 in which spark is written 541 00:23:39,092 --> 00:23:42,400 and Spark suppose Carla as an interface then 542 00:23:42,400 --> 00:23:44,400 spark also supports python. 543 00:23:44,400 --> 00:23:48,012 Face, you can write program in Python and execute it 544 00:23:48,012 --> 00:23:49,500 over the spark again. 545 00:23:49,500 --> 00:23:52,166 If you see the code and Scala and python, 546 00:23:52,166 --> 00:23:56,166 both are very similar then coming to our it is very famous 547 00:23:56,166 --> 00:23:58,700 for data analysis and machine learning. 548 00:23:58,700 --> 00:24:01,708 So spark has also added the support for our 549 00:24:01,708 --> 00:24:03,500 and it also supports Java 550 00:24:03,500 --> 00:24:06,280 so you can go ahead and write the Java code 551 00:24:06,280 --> 00:24:08,200 and execute it over the spark 552 00:24:08,200 --> 00:24:11,100 against Park also provides you interactive shell 553 00:24:11,100 --> 00:24:14,005 for Scala Python and are very can go ahead 554 00:24:14,005 --> 00:24:16,230 and Execute the commands one by one. 555 00:24:16,230 --> 00:24:18,700 So this is all about the sparkle ecosystem. 556 00:24:18,700 --> 00:24:19,500 Next. 557 00:24:19,500 --> 00:24:22,600 Let's discuss the fundamental data structure of spark 558 00:24:22,600 --> 00:24:26,400 that is rdd called as resilient distributed data sets. 559 00:24:26,784 --> 00:24:30,015 So and Spark anything you do is around rdd, 560 00:24:30,200 --> 00:24:33,200 you're reading the data and Spark then it is read 561 00:24:33,200 --> 00:24:34,400 into R DT again. 562 00:24:34,400 --> 00:24:37,200 When you're transforming the data, then you're performing 563 00:24:37,200 --> 00:24:40,509 Transformations on an old rdd and creating a new one. 564 00:24:40,509 --> 00:24:43,200 Then at the last you will perform some actions 565 00:24:43,200 --> 00:24:44,643 on the data and store. 566 00:24:44,643 --> 00:24:46,288 Dataset present in an rdd 567 00:24:46,288 --> 00:24:49,764 to a persistent storage resilient distributed data 568 00:24:49,764 --> 00:24:53,300 set as an immutable distributed collection of objects. 569 00:24:53,300 --> 00:24:55,200 Your objects can be anything 570 00:24:55,200 --> 00:24:58,910 like string lines Rose objects collections Etc. 571 00:24:59,600 --> 00:25:02,704 Now talking about the distributed environment. 572 00:25:02,704 --> 00:25:06,500 Each data set in rdd is divided into logical partitions, 573 00:25:06,500 --> 00:25:08,709 which may be computed on different nodes 574 00:25:08,709 --> 00:25:12,062 of the cluster due to this you can perform Transformations 575 00:25:12,062 --> 00:25:14,416 and actions on the complete data parallelly. 576 00:25:14,416 --> 00:25:17,100 And you don't have to worry about the distribution 577 00:25:17,100 --> 00:25:18,700 because part takes care 578 00:25:18,700 --> 00:25:22,200 of that next as I said our did these are immutable. 579 00:25:22,200 --> 00:25:25,000 So once you create an rdd you cannot change 580 00:25:25,000 --> 00:25:26,500 any content in the Rd 581 00:25:26,500 --> 00:25:28,102 so you might be wondering 582 00:25:28,102 --> 00:25:31,500 how our did the applies those Transformations correct? 583 00:25:31,600 --> 00:25:35,845 Then you run any Transformations at runs those Transformations 584 00:25:35,845 --> 00:25:38,300 on all our DD and create a new Oddity. 585 00:25:38,300 --> 00:25:41,700 This is basically done for optimization reasons. 586 00:25:41,700 --> 00:25:44,609 So, let me tell you one thing here are decals. 587 00:25:44,609 --> 00:25:46,205 The cached and persistent 588 00:25:46,205 --> 00:25:49,270 if you want to save an rdd for the future work, 589 00:25:49,270 --> 00:25:50,218 you can cash it 590 00:25:50,218 --> 00:25:53,000 and it will improve the spark performance rdd 591 00:25:53,000 --> 00:25:55,589 is a fault-tolerant collection of elements 592 00:25:55,589 --> 00:25:57,800 that can be operated on in parallel. 593 00:25:57,800 --> 00:26:00,400 If our DD is lost it will automatically 594 00:26:00,400 --> 00:26:03,400 be recomputed by using the original Transformations. 595 00:26:03,500 --> 00:26:06,500 This is House Park provides fault tolerance. 596 00:26:06,500 --> 00:26:10,300 There are two ways to create rdds first one by paralyzing 597 00:26:10,300 --> 00:26:13,100 an existing collection in your driver program 598 00:26:13,100 --> 00:26:15,809 and the second one by Referencing a data set 599 00:26:15,809 --> 00:26:17,700 in the external storage system 600 00:26:17,700 --> 00:26:21,200 such as shared file system hdfs hbase Etc. 601 00:26:21,400 --> 00:26:23,852 Now Transformations are the operations 602 00:26:23,852 --> 00:26:27,300 that you perform an rdd which will create a new body. 603 00:26:27,300 --> 00:26:30,346 For example, you can perform filter on an rdd 604 00:26:30,346 --> 00:26:31,800 and create a new rdd. 605 00:26:31,800 --> 00:26:34,577 Then there are actions which analyzes the rdd 606 00:26:34,577 --> 00:26:37,717 and produced result simple action can be count 607 00:26:37,717 --> 00:26:39,900 which will count the rows in our D 608 00:26:39,900 --> 00:26:42,100 and producer isn't so I can say 609 00:26:42,100 --> 00:26:46,200 that transformation produced new ID Actions produce results. 610 00:26:46,200 --> 00:26:47,011 So this is all 611 00:26:47,011 --> 00:26:49,600 about the fundamental data structure of spark 612 00:26:49,600 --> 00:26:51,000 that is already now. 613 00:26:51,000 --> 00:26:54,300 Let's dive into the core topic of today's discussion 614 00:26:54,300 --> 00:26:56,120 that the Spark architecture. 615 00:26:56,120 --> 00:26:58,100 So this is the Spark architecture 616 00:26:58,100 --> 00:26:59,300 in your master node. 617 00:26:59,300 --> 00:27:02,681 You have the driver program which drives your application. 618 00:27:02,681 --> 00:27:06,300 So the code that you're writing behaves as a driver program or 619 00:27:06,300 --> 00:27:08,752 if you are using the interactive shell the shell 620 00:27:08,752 --> 00:27:12,017 acts as a driver program inside the driver program. 621 00:27:12,017 --> 00:27:12,900 The first thing 622 00:27:12,900 --> 00:27:16,134 that you do is you create a spark context assume 623 00:27:16,134 --> 00:27:19,300 that the spark context is a gateway to allspark 624 00:27:19,300 --> 00:27:22,800 functionality at a similar to your database connection. 625 00:27:22,800 --> 00:27:25,800 So any command you execute in a database goes 626 00:27:25,800 --> 00:27:29,600 through the database connection similarly anything you do 627 00:27:29,600 --> 00:27:32,600 on spark goes through the spark context. 628 00:27:32,700 --> 00:27:34,800 Now this park on text works 629 00:27:34,800 --> 00:27:37,652 with the cluster manager to manage various jobs, 630 00:27:37,652 --> 00:27:38,783 the driver program 631 00:27:38,783 --> 00:27:42,050 and the spark context takes care of executing the job 632 00:27:42,050 --> 00:27:44,700 across the cluster a job is splitted the 633 00:27:45,161 --> 00:27:46,700 And then these tasks 634 00:27:46,700 --> 00:27:48,500 are distributed over the work or not. 635 00:27:48,500 --> 00:27:50,417 So anytime you create the rtt. 636 00:27:50,417 --> 00:27:53,562 In the spark context that rdd can be distributed 637 00:27:53,562 --> 00:27:54,900 across various notes 638 00:27:54,900 --> 00:27:58,711 and can be cashed their so rdd set to be taken partitioned 639 00:27:58,711 --> 00:28:02,426 and distributed across various notes now worker knows are 640 00:28:02,426 --> 00:28:06,268 the slave nodes whose job is to basically execute the tasks. 641 00:28:06,268 --> 00:28:07,895 The task is then executed 642 00:28:07,895 --> 00:28:10,500 on the partition rdds in the worker nodes 643 00:28:10,500 --> 00:28:14,327 and then Returns the result back to the spark context spot. 644 00:28:14,327 --> 00:28:17,892 Our context takes the job breaks the shop into the task 645 00:28:17,892 --> 00:28:20,400 and distribute them on the worker nodes 646 00:28:20,400 --> 00:28:23,900 and these tasks works on partition rdds perform, 647 00:28:23,900 --> 00:28:26,252 whatever operations you wanted to perform 648 00:28:26,252 --> 00:28:27,800 and then collect the result 649 00:28:27,800 --> 00:28:30,300 and give it back to the main Spar context. 650 00:28:30,300 --> 00:28:32,690 If your increase the number of workers, 651 00:28:32,690 --> 00:28:34,199 then you can divide jobs 652 00:28:34,199 --> 00:28:38,100 and more partitions and execute them para Leo multiple systems. 653 00:28:38,100 --> 00:28:40,600 This will be actually lot more faster. 654 00:28:40,600 --> 00:28:42,900 Also if you increase the number of workers, 655 00:28:42,900 --> 00:28:44,700 it will also increase your memory. 656 00:28:44,900 --> 00:28:46,746 And you can catch the jobs 657 00:28:46,746 --> 00:28:49,800 so that it can be executed much more faster. 658 00:28:49,800 --> 00:28:52,231 So this is all about Spark architecture. 659 00:28:52,231 --> 00:28:52,491 Now. 660 00:28:52,491 --> 00:28:54,709 Let me give you an infographic idea 661 00:28:54,709 --> 00:28:56,600 about the Spark architecture. 662 00:28:56,600 --> 00:28:59,397 It follows master-slave architecture here. 663 00:28:59,397 --> 00:29:02,400 The client submits Park user application code 664 00:29:02,400 --> 00:29:05,189 when an application code is submitted driver 665 00:29:05,189 --> 00:29:07,200 implicitly converts a user code 666 00:29:07,200 --> 00:29:09,000 that contains Transformations 667 00:29:09,000 --> 00:29:12,700 and actions into a logically directed graph called DHE 668 00:29:12,700 --> 00:29:14,200 at this stage it also 669 00:29:14,200 --> 00:29:18,172 Performs optimizations such as pipelining Transformations, 670 00:29:18,172 --> 00:29:21,165 then it converts a logical graph called DHE 671 00:29:21,165 --> 00:29:23,032 into physical execution plan 672 00:29:23,032 --> 00:29:24,100 with many stages 673 00:29:24,100 --> 00:29:26,972 after converting into physical execution plan. 674 00:29:26,972 --> 00:29:30,100 It creates a physical execution units called tasks 675 00:29:30,100 --> 00:29:31,100 under each stage. 676 00:29:31,200 --> 00:29:33,300 Then these tasks are bundled 677 00:29:33,300 --> 00:29:36,300 and sent to the cluster now driver talks 678 00:29:36,300 --> 00:29:39,523 to the cluster manager and negotiates a resources 679 00:29:39,523 --> 00:29:42,727 and cluster manager launches the needed executors 680 00:29:42,727 --> 00:29:45,392 at this point driver be Also send the task 681 00:29:45,392 --> 00:29:47,828 to the executors based on the placement 682 00:29:47,828 --> 00:29:51,610 when executor start to register themselves with the drivers, 683 00:29:51,610 --> 00:29:55,147 so that driver will have a complete view of the executors 684 00:29:55,147 --> 00:29:57,815 and executors now start executing the tasks 685 00:29:57,815 --> 00:30:00,099 that are assigned by the driver program 686 00:30:00,099 --> 00:30:01,300 at any point of time 687 00:30:01,300 --> 00:30:04,800 when the application is running driver program will monitor 688 00:30:04,800 --> 00:30:06,000 the set of executors 689 00:30:06,000 --> 00:30:07,848 that runs and the driver note 690 00:30:07,848 --> 00:30:11,100 also schedules future tasks Based on data placement. 691 00:30:11,100 --> 00:30:14,600 So this is how the internal working takes place in space. 692 00:30:14,600 --> 00:30:17,400 Architecture, there are three different types 693 00:30:17,400 --> 00:30:18,968 of workloads that spark 694 00:30:18,968 --> 00:30:22,282 and cater first batch mode in case of batch mode. 695 00:30:22,282 --> 00:30:24,800 We run a bad shop here you write the job 696 00:30:24,800 --> 00:30:26,100 and then schedule it. 697 00:30:26,100 --> 00:30:28,989 It works through a queue or batch of separate jobs 698 00:30:28,989 --> 00:30:31,804 through manual intervention next interactive mode. 699 00:30:31,804 --> 00:30:33,460 This is an interactive shell 700 00:30:33,460 --> 00:30:36,300 where you go and execute the commands one by one. 701 00:30:36,300 --> 00:30:39,100 So you'll execute one command check the result 702 00:30:39,100 --> 00:30:41,177 and then execute the other command based 703 00:30:41,177 --> 00:30:42,700 on the output result and so 704 00:30:42,700 --> 00:30:44,600 on it works similar to the SQL. 705 00:30:44,600 --> 00:30:48,200 Action social is the one which executes a driver program. 706 00:30:48,200 --> 00:30:50,833 So it is generally used for development work 707 00:30:50,833 --> 00:30:53,100 or it is also used for ad hoc queries, 708 00:30:53,100 --> 00:30:54,670 then comes the streaming mode 709 00:30:54,670 --> 00:30:57,200 where the program is continuously running as 710 00:30:57,200 --> 00:30:59,400 and when the data comes it takes a data 711 00:30:59,500 --> 00:31:02,000 and do some Transformations and actions on the data 712 00:31:02,300 --> 00:31:04,200 and then produce output results. 713 00:31:04,400 --> 00:31:06,900 So these are the three different types of workloads 714 00:31:06,900 --> 00:31:09,000 that spark actually caters now, 715 00:31:09,000 --> 00:31:11,866 let's move ahead and see a simple demo here. 716 00:31:11,866 --> 00:31:14,600 Let's understand how to create a spark up. 717 00:31:14,600 --> 00:31:17,000 Location in spark shell using Scala. 718 00:31:17,000 --> 00:31:18,266 So let's understand 719 00:31:18,266 --> 00:31:21,400 how to create a spark application in spark shell 720 00:31:21,400 --> 00:31:22,700 using Scala assume 721 00:31:22,700 --> 00:31:25,700 that we have a text file in the hdfs directory 722 00:31:25,700 --> 00:31:28,900 and we are counting the number of words in that text file. 723 00:31:28,900 --> 00:31:30,421 So, let's see how to do it. 724 00:31:30,421 --> 00:31:32,900 So before I start running, let me first check 725 00:31:32,900 --> 00:31:34,900 whether all my demons are running or not. 726 00:31:35,200 --> 00:31:37,100 So I'll type sudo JPS 727 00:31:37,200 --> 00:31:40,600 so all my spark demons and Hadoop elements are running 728 00:31:40,600 --> 00:31:44,353 that I have master/worker as Park demon son named notice. 729 00:31:44,353 --> 00:31:47,400 Manager non-manager everything as Hadoop team it. 730 00:31:47,400 --> 00:31:48,749 So the first thing 731 00:31:48,749 --> 00:31:51,600 that I do here is I run the spark shell 732 00:31:51,700 --> 00:31:54,700 so it takes bit time to start in the meanwhile. 733 00:31:54,700 --> 00:31:56,700 Let me tell you the web UI port 734 00:31:56,700 --> 00:31:59,623 for spark shell is localhost for 0 4 0. 735 00:32:00,300 --> 00:32:02,900 So this is a web UI first Park like 736 00:32:02,900 --> 00:32:06,400 if you click on jobs right now, we have not executed anything. 737 00:32:06,400 --> 00:32:08,861 So there is no details over here. 738 00:32:09,400 --> 00:32:11,900 So there you have job stages. 739 00:32:12,100 --> 00:32:14,200 So once you execute the chops 740 00:32:14,200 --> 00:32:16,300 If you'll be having the records of the tasks 741 00:32:16,300 --> 00:32:17,700 that you have executed here. 742 00:32:17,700 --> 00:32:20,400 So here you can see the stages of various jobs 743 00:32:20,400 --> 00:32:21,706 and tasks executed. 744 00:32:21,706 --> 00:32:22,943 So now let's check 745 00:32:22,943 --> 00:32:25,900 whether our spark shall have started or not. 746 00:32:25,900 --> 00:32:26,500 Yes. 747 00:32:26,500 --> 00:32:30,074 So you have your spark version as two point one point one 748 00:32:30,074 --> 00:32:32,500 and you have a scholar shell over here. 749 00:32:32,600 --> 00:32:34,300 So before I start the code, 750 00:32:34,300 --> 00:32:36,300 let's check the content that is present 751 00:32:36,300 --> 00:32:38,600 in the input text file by running this command. 752 00:32:38,933 --> 00:32:39,933 So I'll write 753 00:32:39,933 --> 00:32:44,000 where test is equal to SC dot txt file 754 00:32:44,000 --> 00:32:46,700 because I have saved a text file over there 755 00:32:46,700 --> 00:32:49,300 and I'll give the hdfs part location. 756 00:32:50,000 --> 00:32:52,900 I've stored my text file in this location. 757 00:32:53,300 --> 00:32:55,600 And Sample is the name of the text file. 758 00:32:55,600 --> 00:32:58,400 So now let me give test dot collect 759 00:32:58,400 --> 00:32:59,834 so that it collects the data 760 00:32:59,834 --> 00:33:02,600 and displays the data that is present in the text file. 761 00:33:02,600 --> 00:33:04,500 So in my text file, 762 00:33:04,500 --> 00:33:08,500 I have Hadoop research analysts data science and science. 763 00:33:08,500 --> 00:33:10,500 So this is my input data. 764 00:33:10,500 --> 00:33:12,200 So now let me map 765 00:33:12,200 --> 00:33:15,600 the functions and apply the Transformations and actions. 766 00:33:15,600 --> 00:33:20,000 So I'll give our map is equal to SC dot txt file 767 00:33:20,000 --> 00:33:22,600 and I will specify 768 00:33:22,600 --> 00:33:28,800 my but location So this is my input part location 769 00:33:29,073 --> 00:33:32,226 and I'll apply the flat map transformation 770 00:33:32,457 --> 00:33:33,842 to split the data. 771 00:33:36,100 --> 00:33:38,100 There are separated by space 772 00:33:38,900 --> 00:33:44,330 and then map the word count to be given as word comma one now. 773 00:33:44,330 --> 00:33:46,100 This would be executed. 774 00:33:46,100 --> 00:33:46,600 Yes. 775 00:33:47,100 --> 00:33:49,000 Now, let me apply the action 776 00:33:49,000 --> 00:33:52,000 for this to start the execution of the task. 777 00:33:52,900 --> 00:33:56,100 So let me tell you one thing here before applying an action. 778 00:33:56,100 --> 00:33:58,600 This park will not start the execution process. 779 00:33:58,600 --> 00:34:00,600 So here I have applied produced by key 780 00:34:00,600 --> 00:34:02,800 as the action to start counting the number 781 00:34:02,800 --> 00:34:04,100 of words in the text file. 782 00:34:04,500 --> 00:34:07,100 So now we are done with applying Transformations 783 00:34:07,100 --> 00:34:08,300 and actions as well. 784 00:34:08,300 --> 00:34:09,774 So now the next step is 785 00:34:09,774 --> 00:34:13,300 to specify the output location to store the output file. 786 00:34:13,300 --> 00:34:16,400 So I will give as counts dot save as text file 787 00:34:16,400 --> 00:34:19,500 and then specify the location form output file. 788 00:34:19,500 --> 00:34:21,398 I'll sort it in the same location 789 00:34:21,398 --> 00:34:23,000 where I have my input file. 790 00:34:23,700 --> 00:34:28,400 Never specify my output file name as output 9 cool. 791 00:34:29,000 --> 00:34:31,200 I forgot to give a double quotes. 792 00:34:31,800 --> 00:34:33,200 And I will run this. 793 00:34:36,603 --> 00:34:38,296 So it's completed now. 794 00:34:38,473 --> 00:34:40,626 So now let's see the output. 795 00:34:41,000 --> 00:34:42,900 I will open my Hadoop web UI 796 00:34:42,900 --> 00:34:45,750 by giving local lost Phi double zero seven zero 797 00:34:45,750 --> 00:34:48,600 and browse the file system to check the output. 798 00:34:48,900 --> 00:34:50,284 So as I have said, 799 00:34:50,284 --> 00:34:54,000 I have example asthma director that I have created 800 00:34:54,000 --> 00:34:57,600 and in that I have specified output 9 as my output. 801 00:34:57,600 --> 00:35:00,300 So I have the two part files been created. 802 00:35:00,300 --> 00:35:02,600 Let's check each of them one by one. 803 00:35:04,800 --> 00:35:06,512 So we have the data count 804 00:35:06,512 --> 00:35:09,116 as one analyst count as one and science 805 00:35:09,116 --> 00:35:12,200 count as two so this is a first part file now. 806 00:35:12,200 --> 00:35:14,200 Let me open the second part file for you. 807 00:35:18,500 --> 00:35:20,800 So this is the second part file there you 808 00:35:20,800 --> 00:35:23,800 have Hadoop count as one and the research count as one. 809 00:35:24,500 --> 00:35:26,558 So now let me show you the text file 810 00:35:26,558 --> 00:35:28,600 that we have specified as the input. 811 00:35:30,200 --> 00:35:31,363 So as I have told 812 00:35:31,363 --> 00:35:34,076 you Hadoop counters one research count as 813 00:35:34,076 --> 00:35:37,400 one analyst one data one signs and signs as 1 1 so 814 00:35:37,400 --> 00:35:39,600 in might be thinking data science is a one word 815 00:35:39,600 --> 00:35:40,969 no in the program code. 816 00:35:40,969 --> 00:35:44,600 We have asked to count the word that the separated by a space. 817 00:35:44,600 --> 00:35:47,600 So that is why we have science count as two. 818 00:35:47,600 --> 00:35:51,100 I hope you got an idea about how word count works. 819 00:35:51,515 --> 00:35:54,900 Similarly, I will now paralyzed 1/200 numbers 820 00:35:54,900 --> 00:35:56,200 and divide the tasks 821 00:35:56,200 --> 00:36:00,100 into five partitions to show you what is partitions of tusks. 822 00:36:00,100 --> 00:36:04,400 So I will write a seedot paralyzed 1/200 numbers 823 00:36:04,403 --> 00:36:07,096 and divide them into five partitions 824 00:36:07,115 --> 00:36:10,900 and apply collect action to collect the numbers 825 00:36:10,900 --> 00:36:12,700 and start the execution. 826 00:36:12,784 --> 00:36:16,015 So it displays you an array of 100 numbers. 827 00:36:16,300 --> 00:36:20,900 Now, let me explain you the job stages partitions even timeline. 828 00:36:20,900 --> 00:36:23,100 Dag representation and everything. 829 00:36:23,100 --> 00:36:26,023 So now let me go to the web UI of spark 830 00:36:26,023 --> 00:36:27,437 and click on jobs. 831 00:36:27,601 --> 00:36:29,294 So these are the tasks 832 00:36:29,294 --> 00:36:33,217 that have submitted so coming to word count example. 833 00:36:33,700 --> 00:36:36,300 So this is the dagger usual ization. 834 00:36:36,300 --> 00:36:38,700 I hope you can see it clearly first 835 00:36:38,700 --> 00:36:40,401 you collected the text file, 836 00:36:40,401 --> 00:36:42,709 then you applied flatmap transformation 837 00:36:42,709 --> 00:36:45,139 and mapped it to count the number of words 838 00:36:45,139 --> 00:36:47,333 and then applied Reduce by key action 839 00:36:47,333 --> 00:36:49,100 and then save the output file 840 00:36:49,100 --> 00:36:50,500 as save as text file. 841 00:36:50,500 --> 00:36:52,900 So this is Entire tag visualization 842 00:36:52,900 --> 00:36:54,000 of the number of steps 843 00:36:54,000 --> 00:36:56,000 that we have covered in our program. 844 00:36:56,000 --> 00:36:58,271 So here it shows the completed stages 845 00:36:58,271 --> 00:37:01,900 that is two stages and it also shows the duration 846 00:37:01,900 --> 00:37:03,284 that is 2 seconds. 847 00:37:03,400 --> 00:37:05,800 And if you click on the event timeline, 848 00:37:05,800 --> 00:37:08,482 it just shows the executor that is added. 849 00:37:08,482 --> 00:37:11,500 And in this case you cannot see any partitions 850 00:37:11,500 --> 00:37:15,300 because you have not split the jobs into various partitions. 851 00:37:15,500 --> 00:37:19,200 So this is how you can see the even timeline and the - 852 00:37:19,200 --> 00:37:21,700 visualization here you you can also see 853 00:37:21,700 --> 00:37:24,759 the stage ID descriptions when you have submitted 854 00:37:24,759 --> 00:37:26,800 that I have just submitted it now 855 00:37:26,800 --> 00:37:29,294 and in this it also shows the duration 856 00:37:29,294 --> 00:37:32,800 that it took to execute the task and the output pipes 857 00:37:32,800 --> 00:37:35,500 that it took the shuffle rate Shuffle right 858 00:37:35,500 --> 00:37:39,100 and many more now to show you the partitions see 859 00:37:39,100 --> 00:37:42,500 in this you just applied SC dot paralyzed, right? 860 00:37:42,500 --> 00:37:45,151 So it is just showing one stage where you 861 00:37:45,151 --> 00:37:48,400 have applied the parallelized transformation here. 862 00:37:48,400 --> 00:37:51,300 It shows the succeeded task as Phi by Phi. 863 00:37:51,300 --> 00:37:54,700 That is you have divided the task into five stages 864 00:37:54,700 --> 00:37:58,762 and all the five stages has been executed successfully now here 865 00:37:58,762 --> 00:38:02,300 you can see the partitions of the five different stages 866 00:38:02,300 --> 00:38:04,112 that is executed in parallel. 867 00:38:04,112 --> 00:38:05,800 So depending on the colors, 868 00:38:05,800 --> 00:38:07,500 it shows the scheduler delay 869 00:38:07,500 --> 00:38:10,500 the shuffle rate time executor Computing time result 870 00:38:10,500 --> 00:38:11,500 civilization time 871 00:38:11,500 --> 00:38:13,921 and getting result time and many more 872 00:38:13,921 --> 00:38:15,836 so you can see that duration 873 00:38:15,836 --> 00:38:19,252 that it took to execute the five tasks in parallel 874 00:38:19,252 --> 00:38:21,263 at the same time as maximum. 875 00:38:21,263 --> 00:38:22,700 Um one milliseconds. 876 00:38:22,700 --> 00:38:26,200 So in memory spark as much faster computation 877 00:38:26,200 --> 00:38:27,810 and you can see the IDS 878 00:38:27,810 --> 00:38:31,100 of all the five different tasks all our success. 879 00:38:31,100 --> 00:38:33,166 You can see the locality level. 880 00:38:33,166 --> 00:38:37,033 You can see the executor and the host IP ID the launch time 881 00:38:37,033 --> 00:38:39,100 the duration it take everything 882 00:38:39,200 --> 00:38:40,631 so you can also see 883 00:38:40,631 --> 00:38:44,978 that we have created our DT and paralyzed it similarly here 884 00:38:44,978 --> 00:38:47,000 also for word count example, 885 00:38:47,000 --> 00:38:48,306 you can see the rdd 886 00:38:48,306 --> 00:38:51,324 that has been created and also the Actions 887 00:38:51,324 --> 00:38:53,800 that have applied to execute the task 888 00:38:54,000 --> 00:38:57,401 and you can see the duration that it took even here also, 889 00:38:57,401 --> 00:38:58,980 it's just one milliseconds 890 00:38:58,980 --> 00:39:02,200 that it took to execute the entire word count example, 891 00:39:02,200 --> 00:39:05,900 and you can see the ID is locality level executor ID. 892 00:39:05,900 --> 00:39:06,916 So in this case, 893 00:39:06,916 --> 00:39:09,712 we have just executed the task in two stages. 894 00:39:09,712 --> 00:39:11,900 So it is just showing the two stages. 895 00:39:11,900 --> 00:39:13,100 So this is all about 896 00:39:13,100 --> 00:39:16,266 how web UI looks and what are the features and information 897 00:39:16,266 --> 00:39:18,435 that you can see in the web UI of spark 898 00:39:18,435 --> 00:39:21,200 after executing the program and the Scala shell. 899 00:39:21,200 --> 00:39:22,271 So in this program, 900 00:39:22,271 --> 00:39:25,635 you can see that first gave the part to the input location 901 00:39:25,635 --> 00:39:26,700 and check the data 902 00:39:26,700 --> 00:39:29,063 that is presented in the input file. 903 00:39:29,063 --> 00:39:31,900 And then we applied flatmap Transformations 904 00:39:31,900 --> 00:39:33,100 and created rdd 905 00:39:33,100 --> 00:39:36,800 and then applied action to start the execution of the task 906 00:39:36,800 --> 00:39:39,500 and save the output file in this location. 907 00:39:39,500 --> 00:39:41,643 So I hope you got a clear idea 908 00:39:41,643 --> 00:39:45,054 of how to execute a word count example and check 909 00:39:45,054 --> 00:39:46,861 for the various features 910 00:39:46,861 --> 00:39:50,700 and Spark web UI like partitions that visualisations 911 00:39:50,700 --> 00:39:59,900 and I hope you found the session interesting Apache spark. 912 00:40:00,000 --> 00:40:03,900 This word can generate a spark in every Hadoop Engineers mind. 913 00:40:03,900 --> 00:40:06,188 It is a big data processing framework, 914 00:40:06,188 --> 00:40:08,805 which is lightning fast and cluster Computing. 915 00:40:08,805 --> 00:40:12,300 And the core reason behind its outstanding performance is 916 00:40:12,300 --> 00:40:15,500 the resilient distributed data set or in short. 917 00:40:15,500 --> 00:40:17,779 They are DD and today I'll focus 918 00:40:17,779 --> 00:40:20,200 on the topic called rdd using spark 919 00:40:20,200 --> 00:40:21,723 before we get Get started. 920 00:40:21,723 --> 00:40:23,900 Let's have a quick look on the agenda. 921 00:40:23,900 --> 00:40:24,900 For today's session. 922 00:40:25,100 --> 00:40:28,213 We shall start with understanding the need for rdds 923 00:40:28,213 --> 00:40:29,272 where we'll learn 924 00:40:29,272 --> 00:40:32,200 the reasons behind which the rdds were required. 925 00:40:32,200 --> 00:40:34,700 Then we shall learn what our rdds 926 00:40:34,700 --> 00:40:37,871 where will understand what exactly an rdd is 927 00:40:37,871 --> 00:40:39,800 and how do they work later? 928 00:40:39,800 --> 00:40:42,400 I'll walk you through the fascinating features 929 00:40:42,400 --> 00:40:46,300 of rdds such as in memory computation partitioning 930 00:40:46,374 --> 00:40:48,475 persistence fault tolerance 931 00:40:48,475 --> 00:40:49,475 and many more 932 00:40:49,600 --> 00:40:51,200 once I finished a theory 933 00:40:51,300 --> 00:40:53,200 I'll get your hands on rdds 934 00:40:53,200 --> 00:40:55,100 where will practically create 935 00:40:55,100 --> 00:40:58,141 and perform all possible operations on a disease 936 00:40:58,141 --> 00:40:59,500 and finally I'll wind 937 00:40:59,500 --> 00:41:02,677 up this session with an interesting Pokémon use case, 938 00:41:02,677 --> 00:41:06,100 which will help you understand rdds in a much better way. 939 00:41:06,100 --> 00:41:08,100 Let's get started spark is one 940 00:41:08,100 --> 00:41:10,792 of the top mandatory skills required by each 941 00:41:10,792 --> 00:41:12,518 and every Big Data developer. 942 00:41:12,518 --> 00:41:14,687 It is used in multiple applications, 943 00:41:14,687 --> 00:41:17,800 which need real-time processing such as Google's 944 00:41:17,800 --> 00:41:21,066 recommendation engine credit card fraud detection. 945 00:41:21,066 --> 00:41:23,713 And many more to understand this in depth. 946 00:41:23,713 --> 00:41:27,200 We shall consider Amazon's recommendation engine assume 947 00:41:27,200 --> 00:41:29,500 that you are searching for a mobile phone 948 00:41:29,500 --> 00:41:33,126 and Amazon and you have certain specifications of your choice. 949 00:41:33,126 --> 00:41:36,742 Then the Amazon search engine understands your requirements 950 00:41:36,742 --> 00:41:38,450 and provides you the products 951 00:41:38,450 --> 00:41:41,155 which match the specifications of your choice. 952 00:41:41,155 --> 00:41:43,800 All this is made possible because of the most 953 00:41:43,800 --> 00:41:46,717 powerful tool existing in Big Data environment, 954 00:41:46,717 --> 00:41:49,000 which is none other than Apache spark 955 00:41:49,000 --> 00:41:51,000 and resilient distributed data. 956 00:41:51,000 --> 00:41:53,946 Is considered to be the heart of Apache spark. 957 00:41:53,946 --> 00:41:56,735 So with this let's begin our first question. 958 00:41:56,735 --> 00:41:58,300 Why do we need a disease? 959 00:41:58,300 --> 00:42:01,410 Well, the current world is expanding the technology 960 00:42:01,410 --> 00:42:02,903 and artificial intelligence 961 00:42:02,903 --> 00:42:06,891 is the face for this Evolution the machine learning algorithms 962 00:42:06,891 --> 00:42:09,300 and the data needed to train these computers 963 00:42:09,300 --> 00:42:10,453 are huge the logic 964 00:42:10,453 --> 00:42:13,378 behind all these algorithms are very complicated 965 00:42:13,378 --> 00:42:17,300 and mostly run in a distributed and iterative computation method 966 00:42:17,300 --> 00:42:19,800 the machine learning algorithms could not use 967 00:42:19,800 --> 00:42:21,053 the older mapreduce. 968 00:42:21,053 --> 00:42:24,500 Grams, because the traditional mapreduce programs needed 969 00:42:24,500 --> 00:42:26,733 a stable State hdfs and we know 970 00:42:26,733 --> 00:42:31,200 that hdfs generates redundancy during intermediate computations 971 00:42:31,200 --> 00:42:34,800 which resulted in a major latency in data processing 972 00:42:34,800 --> 00:42:36,900 and in hdfs gathering data 973 00:42:36,900 --> 00:42:39,400 for multiple processing units at a single instance. 974 00:42:39,400 --> 00:42:42,752 First time consuming along with this the major issue 975 00:42:42,752 --> 00:42:46,600 was the HTF is did not have random read and write ability. 976 00:42:46,600 --> 00:42:49,000 So using this old mapreduce programs 977 00:42:49,000 --> 00:42:52,000 for machine learning problems would be Then 978 00:42:52,000 --> 00:42:53,700 the spark was introduced 979 00:42:53,700 --> 00:42:55,318 compared to mapreduce spark 980 00:42:55,318 --> 00:42:58,435 is an advanced big data processing framework resilient 981 00:42:58,435 --> 00:42:59,503 distributed data set 982 00:42:59,503 --> 00:43:02,423 which is a fundamental and most crucial data structure 983 00:43:02,423 --> 00:43:03,600 of spark was the one 984 00:43:03,600 --> 00:43:06,900 which made it all possible rdds are effortless to create 985 00:43:06,900 --> 00:43:09,205 and the mind-blowing property with solve. 986 00:43:09,205 --> 00:43:12,500 The problem was it's in memory data processing capability 987 00:43:12,500 --> 00:43:15,600 Oddity is not a distributed file system instead. 988 00:43:15,600 --> 00:43:17,894 It is a distributed collection of memory 989 00:43:17,894 --> 00:43:19,905 where the data needed is always stored 990 00:43:19,905 --> 00:43:21,057 and kept available. 991 00:43:21,057 --> 00:43:24,269 Lynn RAM and because of this property the elevation it 992 00:43:24,269 --> 00:43:27,300 gave to the memory accessing speed was unbelievable 993 00:43:27,300 --> 00:43:29,250 The Oddities our fault tolerant 994 00:43:29,250 --> 00:43:32,900 and this property bought it a Dignity of a whole new level. 995 00:43:32,900 --> 00:43:35,074 So our next question would be 996 00:43:35,074 --> 00:43:38,522 what are rdds the resilient distributed data sets 997 00:43:38,522 --> 00:43:39,600 or the rdds are 998 00:43:39,600 --> 00:43:42,600 the primary underlying data structures of spark. 999 00:43:42,600 --> 00:43:44,311 They are highly fault tolerant 1000 00:43:44,311 --> 00:43:46,900 and the store data amongst multiple computers 1001 00:43:46,900 --> 00:43:51,000 in a network the data is written into multiple executable notes. 1002 00:43:51,000 --> 00:43:54,800 So that in case of a Calamity if any executing node fails, 1003 00:43:54,800 --> 00:43:57,459 then within a fraction of second it gets back up 1004 00:43:57,459 --> 00:43:59,100 from the next executable node 1005 00:43:59,100 --> 00:44:02,200 with the same processing speeds of the current node, 1006 00:44:02,300 --> 00:44:04,900 the fault-tolerant property enables them to roll back 1007 00:44:04,900 --> 00:44:06,876 their data to the original state 1008 00:44:06,876 --> 00:44:09,038 by applying simple Transformations on 1009 00:44:09,038 --> 00:44:11,225 to the Lost part in the lineage hard. 1010 00:44:11,225 --> 00:44:13,696 It is do not need anything called hard disk 1011 00:44:13,696 --> 00:44:15,489 or any other secondary storage 1012 00:44:15,489 --> 00:44:17,700 all that they need is the main memory, 1013 00:44:17,700 --> 00:44:18,700 which is Ram now 1014 00:44:18,700 --> 00:44:21,100 that we have understood the need for our dear. 1015 00:44:21,100 --> 00:44:22,482 It is and what exactly 1016 00:44:22,482 --> 00:44:25,204 an RTD is so let us see the different sources 1017 00:44:25,204 --> 00:44:28,223 from which the data can be ingested into an rdd. 1018 00:44:28,223 --> 00:44:30,600 The data can be loaded from any Source 1019 00:44:30,600 --> 00:44:33,700 like hdfs hbase high C ql 1020 00:44:33,700 --> 00:44:34,658 you name it? 1021 00:44:34,658 --> 00:44:35,582 They got it. 1022 00:44:35,700 --> 00:44:36,200 Hence. 1023 00:44:36,200 --> 00:44:39,000 The collected data is dropped into an rdd. 1024 00:44:39,000 --> 00:44:42,000 And guess what the rdds a free-spirited they 1025 00:44:42,000 --> 00:44:44,051 can process any type of data. 1026 00:44:44,051 --> 00:44:47,800 They won't care if the data is structured unstructured 1027 00:44:47,800 --> 00:44:49,500 or semi-structured now, 1028 00:44:49,500 --> 00:44:51,200 let me walk you through the features. 1029 00:44:51,200 --> 00:44:52,300 Just of rdds, 1030 00:44:52,300 --> 00:44:54,700 which give it an edge over the other Alternatives 1031 00:44:54,900 --> 00:44:57,100 in memory computation the idea 1032 00:44:57,100 --> 00:45:00,632 of in memory computation bought the groundbreaking progress 1033 00:45:00,632 --> 00:45:03,800 in cluster Computing it increase the processing speed 1034 00:45:03,800 --> 00:45:07,877 when compared with the hdfs moving on to Lacey evaluations 1035 00:45:07,877 --> 00:45:08,827 the phrase lazy 1036 00:45:08,827 --> 00:45:09,527 Explains It 1037 00:45:09,527 --> 00:45:12,564 All spark logs all the Transformations you apply 1038 00:45:12,564 --> 00:45:16,056 onto it and will not throw any output onto the display 1039 00:45:16,056 --> 00:45:17,900 until an action is provoked. 1040 00:45:17,900 --> 00:45:22,200 Next is Fault tolerance rdds are Lutely, fault-tolerant. 1041 00:45:22,200 --> 00:45:26,008 Any lost partition of an rdd can be rolled back by applying 1042 00:45:26,008 --> 00:45:28,700 simple Transformations on to the last part 1043 00:45:28,700 --> 00:45:30,286 in the lineage speaking 1044 00:45:30,286 --> 00:45:34,700 about immutability the data once dropped into an rdd is immutable 1045 00:45:34,700 --> 00:45:38,016 because the access provided by our DD is just re 1046 00:45:38,016 --> 00:45:39,920 only the only way to access 1047 00:45:39,920 --> 00:45:43,800 or modified is by applying a transformation on to an rdd 1048 00:45:43,800 --> 00:45:45,400 which is prior to the present one 1049 00:45:45,400 --> 00:45:47,200 discussing about partitioning. 1050 00:45:47,200 --> 00:45:48,923 The important reason for Sparks. 1051 00:45:48,923 --> 00:45:51,100 Parallel processing is its part issue. 1052 00:45:51,300 --> 00:45:54,163 By default spot determines the number of Parts 1053 00:45:54,163 --> 00:45:56,200 into which your data is divided, 1054 00:45:56,200 --> 00:45:59,652 but you can override this and decide the number of blocks. 1055 00:45:59,652 --> 00:46:01,200 You want to split your data. 1056 00:46:01,200 --> 00:46:03,193 Let's see what persistence is 1057 00:46:03,193 --> 00:46:05,600 Sparks are it is a totally reusable. 1058 00:46:05,600 --> 00:46:06,757 The users can apply 1059 00:46:06,757 --> 00:46:09,502 certain number of Transformations on to an rdd 1060 00:46:09,502 --> 00:46:11,302 and preserve the final Oddity 1061 00:46:11,302 --> 00:46:14,383 for future use this avoids all the hectic process 1062 00:46:14,383 --> 00:46:17,369 of applying all the Transformations from scratch 1063 00:46:17,369 --> 00:46:20,867 and now last but not the least course crane operations. 1064 00:46:20,867 --> 00:46:24,300 The operations performed on rdds using Transformations 1065 00:46:24,300 --> 00:46:28,069 like map filter flat map Etc change the arteries 1066 00:46:28,069 --> 00:46:29,300 and update them. 1067 00:46:29,300 --> 00:46:29,686 Hence. 1068 00:46:29,686 --> 00:46:33,100 Every operation applied onto an RTD is course trained. 1069 00:46:33,100 --> 00:46:36,800 These are the features of rdds and moving on to the next stage. 1070 00:46:36,800 --> 00:46:37,800 We shall understand. 1071 00:46:37,800 --> 00:46:39,700 The creation of rdds art. 1072 00:46:39,700 --> 00:46:42,500 It is can be created using three methods. 1073 00:46:42,500 --> 00:46:46,000 The first method is using parallelized collections. 1074 00:46:46,000 --> 00:46:50,400 Next method is by using external storage like hdfs hbase. 1075 00:46:50,400 --> 00:46:51,100 Hi. 1076 00:46:51,100 --> 00:46:54,700 And many more the third one is using an existing ID, 1077 00:46:54,700 --> 00:46:56,800 which is prior to the present one. 1078 00:46:56,800 --> 00:46:58,800 Now, let us see understand 1079 00:46:58,800 --> 00:47:02,300 and create an array D through each method now 1080 00:47:02,300 --> 00:47:05,600 Spa can be run on Virtual machines like spark VM 1081 00:47:05,600 --> 00:47:08,300 or you can install a Linux operating system 1082 00:47:08,300 --> 00:47:10,774 like Ubuntu and run it Standalone, 1083 00:47:10,774 --> 00:47:14,600 but we here at Erica use the best-in-class cloud lab 1084 00:47:14,600 --> 00:47:16,900 which comprises of all the Frameworks. 1085 00:47:16,900 --> 00:47:19,400 You needed a single stop Cloud framework. 1086 00:47:19,400 --> 00:47:20,776 No need of any hectic. 1087 00:47:20,776 --> 00:47:22,323 Has of downloading any file 1088 00:47:22,323 --> 00:47:24,632 or setting up an environment variables 1089 00:47:24,632 --> 00:47:27,289 and looking for a hardware specification Etc. 1090 00:47:27,289 --> 00:47:28,890 All you need is a login ID 1091 00:47:28,890 --> 00:47:32,091 and password to the all-in-one ready to use cloud lab 1092 00:47:32,091 --> 00:47:34,800 where you can run and save all your programs. 1093 00:47:35,400 --> 00:47:39,600 Let us fire up our spark shell using the command spark to - 1094 00:47:39,600 --> 00:47:42,446 shell now as partial is been fired up. 1095 00:47:42,446 --> 00:47:44,215 Let's create a new rdd. 1096 00:47:44,800 --> 00:47:48,400 So here we are creating a new RTD with the first method 1097 00:47:48,400 --> 00:47:51,500 which is using the parallelized collections here. 1098 00:47:51,500 --> 00:47:52,954 We are creating a new rdt 1099 00:47:52,954 --> 00:47:55,800 by the name parallelized collections are ready. 1100 00:47:55,800 --> 00:47:57,705 We are starting a spark context 1101 00:47:57,705 --> 00:48:00,321 and we have paralyzing an array into the rdd 1102 00:48:00,321 --> 00:48:03,300 which consists of the data of the days of a week, 1103 00:48:03,300 --> 00:48:04,875 which is Monday Tuesday, 1104 00:48:04,875 --> 00:48:07,500 Wednesday, Thursday, Friday and Saturday. 1105 00:48:07,500 --> 00:48:10,600 Now, let's create this our new rdd 1106 00:48:10,600 --> 00:48:13,841 paralyzed collections rdd is successfully created now, 1107 00:48:13,841 --> 00:48:16,900 let's display the data which is present in our RTD. 1108 00:48:19,400 --> 00:48:23,630 So this was the data which is present in our RTD now, 1109 00:48:23,630 --> 00:48:27,038 let's create a new ID using a second method. 1110 00:48:28,200 --> 00:48:30,892 The second method of creating an rdd 1111 00:48:30,892 --> 00:48:35,400 was using an external storage such as hdfs high SQL 1112 00:48:35,600 --> 00:48:37,100 and many more here. 1113 00:48:37,100 --> 00:48:40,200 I'm creating a new rdd by the name spark file 1114 00:48:40,200 --> 00:48:43,312 where I'll be loading a text document into the rdd 1115 00:48:43,312 --> 00:48:44,900 from an external storage, 1116 00:48:44,900 --> 00:48:45,900 which is hdfs. 1117 00:48:45,900 --> 00:48:49,700 And this is the location where my text file is located. 1118 00:48:49,800 --> 00:48:53,600 So the new rdd spark file is successfully created now, 1119 00:48:53,600 --> 00:48:55,054 let's display the data 1120 00:48:55,054 --> 00:48:57,500 which is present in as pack file a TD. 1121 00:48:58,700 --> 00:48:59,620 It's the data 1122 00:48:59,620 --> 00:49:02,241 which is present in as pack file ID is 1123 00:49:02,241 --> 00:49:05,500 a collection of alphabets starting from A to Z. 1124 00:49:05,500 --> 00:49:05,900 Now. 1125 00:49:05,900 --> 00:49:08,851 Let's create a new already using the third method 1126 00:49:08,851 --> 00:49:10,946 which is using an existing iridium, 1127 00:49:10,946 --> 00:49:14,201 which is prior to the present one in the third method. 1128 00:49:14,201 --> 00:49:16,900 I'm creating a new Rd by the name verts and 1129 00:49:16,900 --> 00:49:18,700 I'm creating a spark context 1130 00:49:18,700 --> 00:49:21,803 and paralyzing a statement into the RTD Words, 1131 00:49:21,803 --> 00:49:24,700 which is spark is a very powerful language. 1132 00:49:24,800 --> 00:49:26,517 So this is a collection of Words, 1133 00:49:26,517 --> 00:49:28,400 which I have passed into the new. 1134 00:49:28,400 --> 00:49:29,400 You are DD words. 1135 00:49:29,400 --> 00:49:29,900 Now. 1136 00:49:29,900 --> 00:49:31,700 Let us apply a transformation 1137 00:49:31,700 --> 00:49:34,800 on to the RTD and create a new artery through that. 1138 00:49:35,100 --> 00:49:37,656 So here I'm applying map transformation 1139 00:49:37,656 --> 00:49:39,140 on to the previous rdd 1140 00:49:39,140 --> 00:49:42,717 that is words and I'm storing the data into the new ID 1141 00:49:42,717 --> 00:49:44,000 which is WordPress. 1142 00:49:44,000 --> 00:49:46,500 So here we are applying map transformation in order 1143 00:49:46,500 --> 00:49:49,645 to display the first letter of each and every word 1144 00:49:49,645 --> 00:49:51,700 which is stored in the RTD words. 1145 00:49:51,700 --> 00:49:53,200 Now, let's continue. 1146 00:49:53,200 --> 00:49:56,093 The transformation is been applied successfully now, 1147 00:49:56,093 --> 00:49:59,300 let's display the contents which are present in new ID 1148 00:49:59,300 --> 00:50:01,800 which is word pair So 1149 00:50:01,800 --> 00:50:05,100 as explained we have displayed the starting letter of each 1150 00:50:05,100 --> 00:50:06,100 and every word 1151 00:50:06,100 --> 00:50:10,888 as s is starting letter of spark is starting letter of East and 1152 00:50:10,888 --> 00:50:13,700 so on L is starting letter of language. 1153 00:50:13,900 --> 00:50:17,000 Now, we have understood the creation of a dedes. 1154 00:50:17,000 --> 00:50:17,823 Let us move on 1155 00:50:17,823 --> 00:50:21,000 to the next stage where we'll understand the operations 1156 00:50:21,000 --> 00:50:23,716 that are performed on rdds Transformations 1157 00:50:23,716 --> 00:50:26,300 and actions are the two major operations 1158 00:50:26,300 --> 00:50:27,700 that are performed on added. 1159 00:50:27,700 --> 00:50:31,677 He's let us understand what our Transformations we applied. 1160 00:50:31,677 --> 00:50:35,575 Summations in order to access filter and modify the data 1161 00:50:35,575 --> 00:50:37,470 which is present in an rdd. 1162 00:50:37,470 --> 00:50:41,087 Now Transformations are further divided into two types 1163 00:50:41,087 --> 00:50:44,500 narrow Transformations and why Transformations now, 1164 00:50:44,500 --> 00:50:47,500 let us understand what our narrow Transformations 1165 00:50:47,500 --> 00:50:50,200 we apply narrow Transformations onto a single partition 1166 00:50:50,200 --> 00:50:51,400 of parent ID 1167 00:50:51,400 --> 00:50:54,886 because the data required to process the RTD is available 1168 00:50:54,886 --> 00:50:56,200 on a single partition 1169 00:50:56,200 --> 00:50:58,200 of parent additi the examples 1170 00:50:58,200 --> 00:51:01,125 for neurotransmission our map filter. 1171 00:51:01,500 --> 00:51:04,300 At map partition and map partitions. 1172 00:51:04,400 --> 00:51:06,940 Let us move on to the next type of Transformations 1173 00:51:06,940 --> 00:51:08,511 which is why Transformations. 1174 00:51:08,511 --> 00:51:11,600 We apply why Transformations on to the multiple partitions 1175 00:51:11,600 --> 00:51:12,698 of parent a greedy 1176 00:51:12,698 --> 00:51:16,080 because the data required to process an rdd is available 1177 00:51:16,080 --> 00:51:17,514 on multiple partitions 1178 00:51:17,514 --> 00:51:19,600 of the parent additi the examples 1179 00:51:19,600 --> 00:51:23,000 for why Transformations are reduced by and Union now, 1180 00:51:23,000 --> 00:51:24,823 let us move on to the next part 1181 00:51:24,823 --> 00:51:27,200 which is actions actions on the other hand 1182 00:51:27,200 --> 00:51:29,802 are considered to be the next part of operations, 1183 00:51:29,802 --> 00:51:31,700 which are used to display the final. 1184 00:51:32,200 --> 00:51:35,793 The examples for actions are collect count take 1185 00:51:35,800 --> 00:51:38,479 and first till now we have discussed 1186 00:51:38,479 --> 00:51:40,700 about the theory part on rdd. 1187 00:51:40,700 --> 00:51:42,870 Let us start executing the operations 1188 00:51:42,870 --> 00:51:44,800 that are performed on a disease. 1189 00:51:46,500 --> 00:51:49,100 In a practical part will be dealing with an example 1190 00:51:49,100 --> 00:51:50,600 of IPL match stata. 1191 00:51:50,900 --> 00:51:52,900 So here I have a CSV file 1192 00:51:52,900 --> 00:51:57,158 which has the IPL match records and this CSV file is stored 1193 00:51:57,158 --> 00:51:59,081 in my hdfs and I'm loading. 1194 00:51:59,081 --> 00:52:01,956 My batch is dot CSV file into the new rdd, 1195 00:52:01,956 --> 00:52:04,200 which is CK file as a text file. 1196 00:52:04,200 --> 00:52:07,909 So the match is dot CSV file is been successfully loaded 1197 00:52:07,909 --> 00:52:09,990 as a text file into the new ID, 1198 00:52:09,990 --> 00:52:11,400 which is CK file now, 1199 00:52:11,400 --> 00:52:13,759 let us display the data which is present 1200 00:52:13,759 --> 00:52:16,300 in our seek a file using an action command. 1201 00:52:16,400 --> 00:52:18,170 So collect is the action command 1202 00:52:18,170 --> 00:52:20,700 which I'm using in order to display the data 1203 00:52:20,700 --> 00:52:23,100 which is present in my CK file a DD. 1204 00:52:23,600 --> 00:52:27,569 So here we have in total six hundred and thirty six rows 1205 00:52:27,569 --> 00:52:30,600 of data which consists of IPL match records 1206 00:52:30,600 --> 00:52:33,500 from the year 2008 to 2017. 1207 00:52:33,711 --> 00:52:36,788 Now, let us see the schema of a CSV file. 1208 00:52:37,300 --> 00:52:40,561 I am using the action command first in order to display 1209 00:52:40,561 --> 00:52:42,800 the schema of a match is dot CSV file. 1210 00:52:42,800 --> 00:52:45,300 So this command will display the starting line 1211 00:52:45,300 --> 00:52:46,400 of the CSV file. 1212 00:52:46,400 --> 00:52:48,005 We have so the schema 1213 00:52:48,005 --> 00:52:51,600 of a CSV file is the ID of the match season city 1214 00:52:51,600 --> 00:52:54,386 where the IPL match was conducted date 1215 00:52:54,386 --> 00:52:57,700 of the match team one team two and so on now, 1216 00:52:57,700 --> 00:53:01,100 let's perform the further operations on a CSV file. 1217 00:53:02,000 --> 00:53:04,300 Now moving on to the further operations. 1218 00:53:04,300 --> 00:53:07,800 I'm about to split the second column of my CSV file 1219 00:53:07,800 --> 00:53:10,787 which consists the information regarding the states 1220 00:53:10,787 --> 00:53:12,700 which conducted the IPL matches. 1221 00:53:12,700 --> 00:53:15,467 So I am using this operation in order to display 1222 00:53:15,467 --> 00:53:18,000 the states where the matches were conducted. 1223 00:53:18,700 --> 00:53:21,600 So the transformation is been successfully applied 1224 00:53:21,600 --> 00:53:24,600 and the data has been stored into the new ID which is States. 1225 00:53:24,600 --> 00:53:26,700 Now, let's display the data which is stored 1226 00:53:26,700 --> 00:53:30,100 in our state's rdd using the collection action command, 1227 00:53:30,400 --> 00:53:31,890 so these with The states 1228 00:53:31,890 --> 00:53:34,500 where the matches were being conducted now, 1229 00:53:34,500 --> 00:53:35,817 let's find out the city 1230 00:53:35,817 --> 00:53:38,700 which conducted the maximum number of IPL matches. 1231 00:53:39,400 --> 00:53:41,700 Yeah, I'm creating a new ID again, 1232 00:53:41,700 --> 00:53:45,017 which is States count and I'm using map transformation 1233 00:53:45,017 --> 00:53:47,799 and I am counting each and every city and the number 1234 00:53:47,799 --> 00:53:50,200 of matches conducted in that particular City. 1235 00:53:50,500 --> 00:53:52,776 The transformation is successfully applied 1236 00:53:52,776 --> 00:53:55,600 and the data has been stored into the account ID. 1237 00:53:56,400 --> 00:53:56,900 Now. 1238 00:53:56,900 --> 00:54:00,097 Let us create a new editing by name State count em 1239 00:54:00,097 --> 00:54:01,414 and apply reduced by 1240 00:54:01,414 --> 00:54:04,572 key transformation and map transformation together 1241 00:54:04,572 --> 00:54:07,900 and consider topple one as the city name and toppled 1242 00:54:07,900 --> 00:54:09,500 to as the Number of matches 1243 00:54:09,500 --> 00:54:11,876 which were considered in that particular City 1244 00:54:11,876 --> 00:54:12,701 and apply sort 1245 00:54:12,701 --> 00:54:15,000 by K transformation to find out the city 1246 00:54:15,000 --> 00:54:17,700 which conducted maximum number of IPL matches. 1247 00:54:17,900 --> 00:54:20,317 The Transformations are successfully applied 1248 00:54:20,317 --> 00:54:23,200 and the data is being stored into the state count. 1249 00:54:23,200 --> 00:54:25,200 Em RTD now let's display the data 1250 00:54:25,200 --> 00:54:26,800 which is present in state count. 1251 00:54:26,800 --> 00:54:29,600 Em, I did here I am using 1252 00:54:29,600 --> 00:54:33,320 take action command in order to take the top 10 results 1253 00:54:33,320 --> 00:54:35,800 which are stored in state count MRDD. 1254 00:54:36,100 --> 00:54:38,600 So according to the results we have Mumbai 1255 00:54:38,600 --> 00:54:41,300 which Get the maximum number of IPL matches, 1256 00:54:41,300 --> 00:54:45,700 which is 85 since the year 2008 to the year 2017. 1257 00:54:46,400 --> 00:54:50,300 Now let us create a new ID by name fil ardi and use 1258 00:54:50,300 --> 00:54:53,144 flat map in order to filter out the match data 1259 00:54:53,144 --> 00:54:55,800 which were conducted in the city Hydra path 1260 00:54:55,800 --> 00:54:58,500 and store the same data into the file rdd 1261 00:54:58,500 --> 00:55:01,617 since transformation is been successfully applied now, 1262 00:55:01,617 --> 00:55:04,600 let us display the data which is present in our fil ardi 1263 00:55:04,600 --> 00:55:06,161 which consists of the matches 1264 00:55:06,161 --> 00:55:08,800 which were conducted excluding the city Hyderabad. 1265 00:55:09,900 --> 00:55:11,126 So this is the data 1266 00:55:11,126 --> 00:55:15,000 which is present in our fil ardi D which excludes the matches 1267 00:55:15,000 --> 00:55:18,000 which are played in the city Hyderabad now, 1268 00:55:18,000 --> 00:55:19,768 let us create another rdd 1269 00:55:19,768 --> 00:55:22,773 by name fil and store the data of the matches 1270 00:55:22,773 --> 00:55:25,300 which were conducted in the year 2017. 1271 00:55:25,300 --> 00:55:27,394 We shall use filter transformation 1272 00:55:27,394 --> 00:55:28,600 for this operation. 1273 00:55:28,700 --> 00:55:31,000 The transformation is been applied successfully 1274 00:55:31,000 --> 00:55:34,100 and the data has been stored into the fil ardi now, 1275 00:55:34,100 --> 00:55:36,600 let us display the data which is present there. 1276 00:55:37,200 --> 00:55:38,588 Michelle use collect 1277 00:55:38,588 --> 00:55:42,545 action command and now we have the data of all the matches 1278 00:55:42,545 --> 00:55:45,600 which your plate especially in the year 2070. 1279 00:55:47,100 --> 00:55:49,400 similarly, we can find out the matches 1280 00:55:49,400 --> 00:55:52,000 which were played in the year 2016 and we 1281 00:55:52,000 --> 00:55:54,600 can save the same data into the new rdd 1282 00:55:54,600 --> 00:55:57,500 which is fil to Similarly, 1283 00:55:57,500 --> 00:55:59,823 we can find out the data of the matches 1284 00:55:59,823 --> 00:56:03,100 which were conducted in the year 2016 and we can store 1285 00:56:03,100 --> 00:56:05,061 the same data into our new rdd 1286 00:56:05,061 --> 00:56:08,200 which is fil to I have used filter transformation 1287 00:56:08,200 --> 00:56:10,800 in order to filter out the data of the matches 1288 00:56:10,800 --> 00:56:13,581 which were conducted in the year 2016 and I 1289 00:56:13,581 --> 00:56:15,900 have saved the data into the new RTD 1290 00:56:15,900 --> 00:56:18,300 which is a file to now, 1291 00:56:18,300 --> 00:56:20,889 let us understand the union transformation 1292 00:56:20,889 --> 00:56:21,900 which will apply 1293 00:56:21,900 --> 00:56:26,400 the union transformation on to the fil ardi and fil to rdd. 1294 00:56:26,400 --> 00:56:29,100 In order to combine both the data is present 1295 00:56:29,100 --> 00:56:30,816 in both The Oddities here. 1296 00:56:30,816 --> 00:56:32,232 I'm creating a new rdd 1297 00:56:32,232 --> 00:56:35,931 by the name Union rdd and I'm applying Union transformation 1298 00:56:35,931 --> 00:56:38,600 on the to Oddities that we created before. 1299 00:56:38,600 --> 00:56:42,400 The first one is fil ardi which consists of the data 1300 00:56:42,400 --> 00:56:44,818 of the matches played in the year 2017. 1301 00:56:44,818 --> 00:56:46,633 And the second one is a file 1302 00:56:46,633 --> 00:56:49,295 to which consists the data of the matches. 1303 00:56:49,295 --> 00:56:52,469 Which up late in the year 2016 here I'll be clubbing 1304 00:56:52,469 --> 00:56:53,921 both the R8 is together 1305 00:56:53,921 --> 00:56:56,700 and I'll be saving the data into the new rdd. 1306 00:56:56,701 --> 00:56:58,163 Which is Union rdd. 1307 00:56:58,600 --> 00:57:02,600 Now let us display the data which is present in a new array, 1308 00:57:02,600 --> 00:57:04,100 which is Union rgd. 1309 00:57:04,100 --> 00:57:06,100 I am using collect action command in order 1310 00:57:06,100 --> 00:57:07,100 to display the data. 1311 00:57:07,300 --> 00:57:09,800 So here we have the data of the matches 1312 00:57:09,800 --> 00:57:11,400 which were played in the u.s. 1313 00:57:11,400 --> 00:57:13,400 2016 and 2017. 1314 00:57:13,900 --> 00:57:16,306 And now let's continue with our operations 1315 00:57:16,306 --> 00:57:19,188 and find out the player with maximum number of man 1316 00:57:19,188 --> 00:57:21,603 of the match awards for this operation. 1317 00:57:21,603 --> 00:57:23,293 I am applying map transformation 1318 00:57:23,293 --> 00:57:25,345 and splitting out the column number 13, 1319 00:57:25,345 --> 00:57:28,314 which consists of the data of the players who won the man 1320 00:57:28,314 --> 00:57:30,800 of the match awards for that particular match. 1321 00:57:30,800 --> 00:57:33,252 So the transformation is been successfully applied 1322 00:57:33,252 --> 00:57:35,752 and the column number 13 is been successfully split 1323 00:57:35,752 --> 00:57:37,700 and the data has been stored into the man 1324 00:57:37,700 --> 00:57:39,238 of the match our DD now. 1325 00:57:39,238 --> 00:57:42,155 We are creating a new rdd by the named man 1326 00:57:42,155 --> 00:57:45,600 of the match count me applying map Transformations on 1327 00:57:45,600 --> 00:57:46,800 to a previous rdd 1328 00:57:46,800 --> 00:57:48,300 and we are counting the number 1329 00:57:48,300 --> 00:57:51,300 of awards won by each and every particular player. 1330 00:57:51,700 --> 00:57:55,733 Now, we shall create a new ID by the named man of the match 1331 00:57:55,733 --> 00:57:59,500 and we are applying reduced by K. Under the previous added 1332 00:57:59,500 --> 00:58:01,311 which is man of the match count. 1333 00:58:01,311 --> 00:58:03,765 And again, we are applying map transformation 1334 00:58:03,765 --> 00:58:06,600 and considering topple one as the name of the player 1335 00:58:06,600 --> 00:58:08,843 and topple to as the number of matches. 1336 00:58:08,843 --> 00:58:11,500 He played and won the man of the match Awards, 1337 00:58:11,500 --> 00:58:14,794 let us use take action command in order to print the data 1338 00:58:14,794 --> 00:58:18,000 which is stored in our new RTD which is man of the match. 1339 00:58:18,200 --> 00:58:21,400 So according to the result we have a bws 1340 00:58:21,400 --> 00:58:24,000 who won the maximum number of man of the matches, 1341 00:58:24,000 --> 00:58:24,923 which is 15. 1342 00:58:25,800 --> 00:58:29,129 So these are the few operations that were performed on rdds. 1343 00:58:29,129 --> 00:58:31,600 Now, let us move on to our Pokémon use case 1344 00:58:31,600 --> 00:58:34,800 so that we can understand our duties in a much better way. 1345 00:58:35,800 --> 00:58:39,300 So the steps to be performed in Pokémon use cases are loading 1346 00:58:39,300 --> 00:58:41,164 the Pokemon data dot CSV file 1347 00:58:41,164 --> 00:58:44,624 from an external storage into an rdd removing the schema 1348 00:58:44,624 --> 00:58:46,700 from the Pokémon data dot CSV file 1349 00:58:46,700 --> 00:58:49,730 and finding out the total number of water type Pokemon 1350 00:58:49,730 --> 00:58:52,117 finding the total number of fire type Pokemon. 1351 00:58:52,117 --> 00:58:53,882 I know it's getting interesting. 1352 00:58:53,882 --> 00:58:57,000 So let me explain you each and every step practically. 1353 00:58:57,700 --> 00:59:00,200 So here I am creating a new identity 1354 00:59:00,200 --> 00:59:02,400 by name Pokemon data rdd one 1355 00:59:02,400 --> 00:59:05,700 and I'm loading my CSV file from an external storage. 1356 00:59:05,700 --> 00:59:08,100 That is my hdfs as a text file. 1357 00:59:08,100 --> 00:59:11,800 So the Pokemon data dot CSV file is been successfully loaded 1358 00:59:11,800 --> 00:59:12,800 into our new rdd. 1359 00:59:12,800 --> 00:59:14,100 So let us display the data 1360 00:59:14,100 --> 00:59:17,100 which is present in our Pokémon data rdd one. 1361 00:59:17,200 --> 00:59:19,700 I am using collect action command for this. 1362 00:59:20,000 --> 00:59:23,900 So here we have 721 rows of data of all the types 1363 00:59:23,900 --> 00:59:28,979 of Pokemons we have So now let us display the schema 1364 00:59:28,979 --> 00:59:30,441 of the data we have 1365 00:59:30,700 --> 00:59:33,900 I have used the action command first in order to display 1366 00:59:33,900 --> 00:59:35,727 the first line of a CSV file 1367 00:59:35,727 --> 00:59:38,600 which happens to be the schema of a CSV file. 1368 00:59:38,600 --> 00:59:40,000 So we have index 1369 00:59:40,000 --> 00:59:42,100 of the Pokemon name of the Pokémon. 1370 00:59:42,100 --> 00:59:46,700 Its type total points HP attack points defense points 1371 00:59:46,992 --> 00:59:50,607 special attack special defense speed generation, 1372 00:59:50,700 --> 00:59:51,938 and we can also find 1373 00:59:51,938 --> 00:59:54,600 if a particular Pokemon is legendary or not. 1374 00:59:55,773 --> 00:59:57,926 Here, I'm creating a new RTD 1375 00:59:58,000 --> 00:59:59,400 which is no header 1376 00:59:59,400 --> 01:00:02,800 and I'm using filter operation in order to remove the schema 1377 01:00:02,800 --> 01:00:04,900 of a Pokemon data dot CSV file. 1378 01:00:04,900 --> 01:00:08,407 The schema of Pokemon data dot CSV file is been removed 1379 01:00:08,407 --> 01:00:10,705 because the spark considers the schema 1380 01:00:10,705 --> 01:00:12,300 as a data to be processed. 1381 01:00:12,300 --> 01:00:13,480 So for this reason, 1382 01:00:13,480 --> 01:00:16,500 we remove the schema now, let's display the data 1383 01:00:16,500 --> 01:00:19,000 which is present in a no-hitter rdd. 1384 01:00:19,000 --> 01:00:20,441 I am using action command 1385 01:00:20,441 --> 01:00:22,500 collect in order to display the data 1386 01:00:22,500 --> 01:00:24,700 which is present in no header rdd. 1387 01:00:24,900 --> 01:00:26,104 So this is the data 1388 01:00:26,104 --> 01:00:28,195 which is stored in a no-hitter rdd 1389 01:00:28,195 --> 01:00:29,400 without the schema. 1390 01:00:31,200 --> 01:00:33,978 So now let us find out the number of partitions 1391 01:00:33,978 --> 01:00:37,300 into which are no header are ready is been split in two. 1392 01:00:37,300 --> 01:00:40,320 So I am using partitions transformation in order to find 1393 01:00:40,320 --> 01:00:42,060 out the number of partitions. 1394 01:00:42,060 --> 01:00:45,000 The data was split in two according to the result. 1395 01:00:45,000 --> 01:00:48,300 The no header rdd is been split into two partitions. 1396 01:00:48,600 --> 01:00:52,000 I am here creating a new rdt by name water rdd 1397 01:00:52,000 --> 01:00:55,100 and I'm using filter transformation in order to find 1398 01:00:55,100 --> 01:00:59,000 out what a type Pokemons in our Pokémon data dot CSV file. 1399 01:00:59,600 --> 01:01:02,800 I'm using action command collect in order to print the data 1400 01:01:02,800 --> 01:01:04,900 which is present in water rdd. 1401 01:01:05,200 --> 01:01:08,000 So these are the total number of water type Pokemon 1402 01:01:08,000 --> 01:01:10,528 that we have in our Pokémon data dot CSV. 1403 01:01:10,528 --> 01:01:11,160 Similarly. 1404 01:01:11,160 --> 01:01:13,500 Let's find out the fire type Pokemons. 1405 01:01:14,600 --> 01:01:17,500 I'm creating a new identity by the name fire RTD 1406 01:01:17,500 --> 01:01:20,523 and applying filter operation in order to find out 1407 01:01:20,523 --> 01:01:23,300 the fire type Pokemon present in our CSV file. 1408 01:01:24,200 --> 01:01:27,200 I'm using collect action command in order to print the data 1409 01:01:27,200 --> 01:01:29,200 which is present in fire rdd. 1410 01:01:29,400 --> 01:01:32,100 So these are the fire type Pokemon which are present 1411 01:01:32,100 --> 01:01:34,400 in our Pokémon data dot CSV file. 1412 01:01:34,600 --> 01:01:37,600 Now, let us count the total number of water type Pokemon 1413 01:01:37,600 --> 01:01:40,400 which are present in a Pokemon data dot CSV file. 1414 01:01:40,400 --> 01:01:44,500 I am using count action for this and we have 112 water type 1415 01:01:44,500 --> 01:01:47,400 Pokemon is present in our Pokémon data dot CSV file. 1416 01:01:47,400 --> 01:01:47,924 Similarly. 1417 01:01:47,924 --> 01:01:50,600 Let's find out the total number of fire-type Pokémon 1418 01:01:50,600 --> 01:01:54,300 as we have I'm using count action command for the same. 1419 01:01:54,300 --> 01:01:56,178 So we have a total 52 number 1420 01:01:56,178 --> 01:01:59,800 of fire type Pokemon Sinnoh Pokemon data dot CSV files. 1421 01:01:59,800 --> 01:02:01,992 Let's continue with our further operations 1422 01:02:01,992 --> 01:02:05,200 where we'll find out a highest defense strength of a Pokémon. 1423 01:02:05,300 --> 01:02:08,400 I am creating a new ID by the name defense list 1424 01:02:08,400 --> 01:02:10,400 and I'm applying map transformation 1425 01:02:10,400 --> 01:02:12,935 and spreading out the column number six in order 1426 01:02:12,935 --> 01:02:14,500 to extract the defense points 1427 01:02:14,500 --> 01:02:18,100 of all the Pokemons present in our Pokémon data dot CSV file. 1428 01:02:18,300 --> 01:02:21,400 So the data is been stored successfully into a new era. 1429 01:02:21,400 --> 01:02:23,100 DD which is defenseless. 1430 01:02:23,500 --> 01:02:23,700 Now. 1431 01:02:23,700 --> 01:02:26,249 I'm using Mac's action command in order to print out 1432 01:02:26,249 --> 01:02:29,100 the maximum different strengths out of all the Pokemons. 1433 01:02:29,200 --> 01:02:32,576 So we have 230 points as the maximum defense strength 1434 01:02:32,576 --> 01:02:34,200 amongst all the Pokemons. 1435 01:02:34,200 --> 01:02:35,702 So in our further operations, 1436 01:02:35,702 --> 01:02:38,502 let's find out the Pokemons which come under the category 1437 01:02:38,502 --> 01:02:40,600 of having highest different strengths, 1438 01:02:40,600 --> 01:02:42,400 which is 230 points. 1439 01:02:43,100 --> 01:02:45,456 In order to find out the name of the Pokemon 1440 01:02:45,456 --> 01:02:47,100 with highest defense strength. 1441 01:02:47,100 --> 01:02:49,182 I'm creating a new identity with the name. 1442 01:02:49,182 --> 01:02:51,717 It defense with Pokemon name and I'm applying 1443 01:02:51,717 --> 01:02:54,000 May transformation on to the previous array, 1444 01:02:54,000 --> 01:02:55,000 which is no header 1445 01:02:55,000 --> 01:02:56,062 and I'm splitting out 1446 01:02:56,062 --> 01:02:59,100 column number six which happens to be the different strengths 1447 01:02:59,100 --> 01:03:02,300 in order to extract the data from that particular row, 1448 01:03:02,300 --> 01:03:05,100 which has the defense strength as 230 points. 1449 01:03:05,769 --> 01:03:08,230 Now I'm creating a new RTD again 1450 01:03:08,300 --> 01:03:11,500 with the name maximum defense Pokemon and I'm applying 1451 01:03:11,500 --> 01:03:15,100 group bike a transformation in order to display the Pokemon 1452 01:03:15,100 --> 01:03:18,675 which have the maximum defense points that is 230 points. 1453 01:03:18,675 --> 01:03:20,400 So according to the result. 1454 01:03:20,400 --> 01:03:23,400 We have Steelix Steelix Mega chacal Aggregate 1455 01:03:23,400 --> 01:03:24,500 and aggregate Mega 1456 01:03:24,500 --> 01:03:27,200 as the Pokemons with highest different strengths, 1457 01:03:27,200 --> 01:03:28,800 which is 230 points. 1458 01:03:28,800 --> 01:03:31,100 Now we shall find out the Pokemon 1459 01:03:31,100 --> 01:03:33,600 which is having least different strengths. 1460 01:03:34,200 --> 01:03:35,900 So before we find out the Pokemon 1461 01:03:35,900 --> 01:03:37,580 with least different strengths, 1462 01:03:37,580 --> 01:03:39,694 let us find out the least defense points 1463 01:03:39,694 --> 01:03:41,700 which are present in the defense list. 1464 01:03:42,900 --> 01:03:45,100 So in order to find out the Pokémon 1465 01:03:45,100 --> 01:03:46,788 with least different strengths, 1466 01:03:46,788 --> 01:03:48,200 I have created a new rdt 1467 01:03:48,200 --> 01:03:51,654 by name minimum defense Pokemon and I have applied distinct 1468 01:03:51,654 --> 01:03:54,900 and sort by Transformations on to the defense list rdd 1469 01:03:54,900 --> 01:03:57,900 in order to extract the least defense points present 1470 01:03:57,900 --> 01:03:58,955 in the defense list 1471 01:03:58,955 --> 01:04:01,484 and I have used take action command in order 1472 01:04:01,484 --> 01:04:02,600 to display the data 1473 01:04:02,600 --> 01:04:05,300 which is present in minimum defense Pokemon rdd. 1474 01:04:05,300 --> 01:04:06,700 So according to the results, 1475 01:04:06,700 --> 01:04:09,300 we have five points as the least defense strength 1476 01:04:09,300 --> 01:04:11,053 of a particular Pokémon now, 1477 01:04:11,053 --> 01:04:13,148 let us find out the name of the On 1478 01:04:13,148 --> 01:04:16,650 which comes under the category of having Five Points as 1479 01:04:16,650 --> 01:04:18,290 different strengths now, 1480 01:04:18,290 --> 01:04:19,808 let us create a new rdd 1481 01:04:19,808 --> 01:04:23,956 which is difference Pokemon name to and apply my transformation 1482 01:04:23,956 --> 01:04:27,217 and split the column number 6 and store the data 1483 01:04:27,217 --> 01:04:28,259 into our new rdd 1484 01:04:28,259 --> 01:04:30,800 which is defense with Pokemon name, too. 1485 01:04:32,000 --> 01:04:34,500 The transformation is been successfully applied 1486 01:04:34,500 --> 01:04:36,970 and the data is now stored into the new rdd 1487 01:04:36,970 --> 01:04:37,900 which is defense 1488 01:04:37,900 --> 01:04:41,900 with Pokemon name to the data is been successfully loaded. 1489 01:04:41,900 --> 01:04:45,500 Now, let us apply the further operations here. 1490 01:04:45,538 --> 01:04:50,000 I am creating another rdd with name minimum defense Pokemon 1491 01:04:50,000 --> 01:04:53,400 and I'm applying group bike a transformation in order 1492 01:04:53,400 --> 01:04:55,500 to extract the data from the row 1493 01:04:55,500 --> 01:04:58,206 which has the defense points as 5.0. 1494 01:04:58,500 --> 01:05:01,829 The data is been successfully loaded now and let us display. 1495 01:05:01,829 --> 01:05:03,300 The data which is present 1496 01:05:03,300 --> 01:05:07,307 in minimum defense Pokemon rdd now according to the results. 1497 01:05:07,307 --> 01:05:09,073 We have to number of Pokemons, 1498 01:05:09,073 --> 01:05:12,098 which come under the category of having Five Points 1499 01:05:12,098 --> 01:05:15,400 as that defense strength the Pokemons chassis knee 1500 01:05:15,400 --> 01:05:17,500 and happening at the to Pokemons, 1501 01:05:17,500 --> 01:05:24,500 which I have in the least definition the world 1502 01:05:24,500 --> 01:05:26,100 of Information Technology 1503 01:05:26,100 --> 01:05:29,786 and big data processing started to see multiple potentialities 1504 01:05:29,786 --> 01:05:31,600 from spark coming into action. 1505 01:05:31,700 --> 01:05:34,685 Such Pinnacle in Sparks technology advancements is 1506 01:05:34,685 --> 01:05:35,600 the data frame. 1507 01:05:35,600 --> 01:05:38,200 And today we shall understand the technicalities 1508 01:05:38,200 --> 01:05:39,000 of data frames 1509 01:05:39,000 --> 01:05:42,500 and Spark a data frame and Spark is all about performance. 1510 01:05:42,500 --> 01:05:46,300 It is a powerful multifunctional and an integrated data structure 1511 01:05:46,300 --> 01:05:49,100 where the programmer can work with different libraries 1512 01:05:49,100 --> 01:05:52,000 and perform numerous functionalities without breaking 1513 01:05:52,000 --> 01:05:53,529 a sweat to understand apis 1514 01:05:53,529 --> 01:05:54,823 and libraries involved 1515 01:05:54,823 --> 01:05:57,500 in the process without wasting any time. 1516 01:05:57,500 --> 01:06:00,000 Let us understand a topic for today's discussion. 1517 01:06:00,000 --> 01:06:01,900 I line up the docket for understanding. 1518 01:06:01,900 --> 01:06:03,800 Data frames and Spark is below 1519 01:06:03,800 --> 01:06:06,962 which will begin with what our data frames here. 1520 01:06:06,962 --> 01:06:09,700 We will learn what exactly a data frame is. 1521 01:06:09,700 --> 01:06:13,706 How does it look like and what are its functionalities then we 1522 01:06:13,706 --> 01:06:16,400 shall see why do we need data frames here? 1523 01:06:16,400 --> 01:06:18,900 We shall understand the requirements which led us 1524 01:06:18,900 --> 01:06:21,200 to the invention of data frames later. 1525 01:06:21,200 --> 01:06:23,400 I'll walk you through the important features 1526 01:06:23,400 --> 01:06:24,282 of data frames. 1527 01:06:24,282 --> 01:06:25,400 Then we should look 1528 01:06:25,400 --> 01:06:28,000 into the sources from which the data frames and Spark 1529 01:06:28,000 --> 01:06:31,000 get their data from Once the theory part is finished. 1530 01:06:31,000 --> 01:06:33,400 I will get us involved into the Practical part 1531 01:06:33,400 --> 01:06:35,700 where the creation of a dataframe happens to be 1532 01:06:35,700 --> 01:06:39,400 a first step next we shall work with an interesting example, 1533 01:06:39,400 --> 01:06:41,100 which is related to football 1534 01:06:41,100 --> 01:06:43,237 and finally to understand the data frames 1535 01:06:43,237 --> 01:06:44,200 in spark in a much 1536 01:06:44,200 --> 01:06:46,980 better way we should work with the most trending topic 1537 01:06:46,980 --> 01:06:47,711 as I use case, 1538 01:06:47,711 --> 01:06:50,300 which is none other than the Game of Thrones. 1539 01:06:50,400 --> 01:06:52,100 So let's get started. 1540 01:06:52,200 --> 01:06:55,500 What is a data frame in simple terms a data frame 1541 01:06:55,500 --> 01:06:58,617 can be considered as a distributed collection of data. 1542 01:06:58,617 --> 01:07:01,156 The data is organized under named columns, 1543 01:07:01,156 --> 01:07:04,500 which provide us The operations to filter group process 1544 01:07:04,500 --> 01:07:08,205 and aggregate the available data data frames can also be used 1545 01:07:08,205 --> 01:07:11,100 with Sparks equal and we can construct data frames 1546 01:07:11,100 --> 01:07:14,800 from structured data files rdds or from an external storage 1547 01:07:14,800 --> 01:07:17,500 like hdfs Hive Cassandra hbase 1548 01:07:17,500 --> 01:07:19,676 and many more with this we should look 1549 01:07:19,676 --> 01:07:21,500 into a more simplified example, 1550 01:07:21,500 --> 01:07:24,455 which will give us a basic description of a data frame. 1551 01:07:24,455 --> 01:07:26,700 So we shall deal with an employee database 1552 01:07:26,700 --> 01:07:29,229 where we have entities and their data types. 1553 01:07:29,229 --> 01:07:31,817 So the name of the employee is a first entity 1554 01:07:31,817 --> 01:07:33,500 And its respective data type 1555 01:07:33,500 --> 01:07:37,102 is string data type similarly employee ID has data type 1556 01:07:37,102 --> 01:07:39,004 of string employee phone number 1557 01:07:39,004 --> 01:07:40,646 which is integer data type 1558 01:07:40,646 --> 01:07:43,642 and employ address happens to be string data type. 1559 01:07:43,642 --> 01:07:46,700 And finally the employee salary is float data type. 1560 01:07:46,700 --> 01:07:49,500 All this data is stored into an external storage, 1561 01:07:49,500 --> 01:07:51,093 which may be hdfs Hive 1562 01:07:51,093 --> 01:07:53,700 or Cassandra using the data frame API 1563 01:07:53,700 --> 01:07:55,200 with their respective schema, 1564 01:07:55,200 --> 01:07:56,500 which consists of the name 1565 01:07:56,500 --> 01:07:58,913 of the entity along with this data type now 1566 01:07:58,913 --> 01:08:01,900 that we have understood what exactly a data frame is. 1567 01:08:01,900 --> 01:08:03,910 Let us quickly move on to our next stage 1568 01:08:03,910 --> 01:08:06,900 where we shall understand the requirement for a data frame. 1569 01:08:07,000 --> 01:08:07,806 It provides as 1570 01:08:07,806 --> 01:08:10,400 multiple programming language support ability. 1571 01:08:10,400 --> 01:08:13,670 It has the capacity to work with multiple data sources, 1572 01:08:13,670 --> 01:08:16,904 it can process both structured and unstructured data. 1573 01:08:16,904 --> 01:08:19,455 And finally it is well versed with slicing 1574 01:08:19,455 --> 01:08:20,681 and dicing the data. 1575 01:08:20,681 --> 01:08:21,723 So the first one is 1576 01:08:21,723 --> 01:08:24,900 the support ability for multiple programming languages. 1577 01:08:24,900 --> 01:08:26,937 The IT industry is required a powerful 1578 01:08:26,937 --> 01:08:28,700 and an integrated data structure 1579 01:08:28,700 --> 01:08:29,500 which could support 1580 01:08:29,500 --> 01:08:31,800 multiple programming languages and at the same. 1581 01:08:31,800 --> 01:08:33,900 Same time without the requirement of 1582 01:08:33,900 --> 01:08:36,900 additional API data frame was the one stop solution 1583 01:08:36,900 --> 01:08:39,900 which supported multiple languages along with a single 1584 01:08:39,900 --> 01:08:41,982 API the most popular languages 1585 01:08:41,982 --> 01:08:45,046 that a dataframe could support our our python. 1586 01:08:45,046 --> 01:08:48,777 Skaila, Java and many more the next requirement 1587 01:08:48,777 --> 01:08:51,500 was to support the multiple data sources. 1588 01:08:51,500 --> 01:08:53,608 We all know that in a real-time approach 1589 01:08:53,608 --> 01:08:55,700 to data processing will never end up 1590 01:08:55,700 --> 01:08:57,700 at a single data source data frame is 1591 01:08:57,700 --> 01:08:59,057 one such data structure, 1592 01:08:59,057 --> 01:09:02,000 which has the capability to support and process data. 1593 01:09:02,000 --> 01:09:05,615 From a variety of data sources Hadoop Cassandra. 1594 01:09:05,615 --> 01:09:07,207 Json files hbase. 1595 01:09:07,207 --> 01:09:10,284 CSV files are the examples to name a few. 1596 01:09:10,300 --> 01:09:12,947 The next requirement was to process structured 1597 01:09:12,947 --> 01:09:14,200 and unstructured data. 1598 01:09:14,200 --> 01:09:17,400 The Big Data environment was designed to store huge amount 1599 01:09:17,400 --> 01:09:18,487 of data regardless 1600 01:09:18,487 --> 01:09:19,755 of which type exactly 1601 01:09:19,755 --> 01:09:22,827 it is now Sparks data frame is designed in such a way 1602 01:09:22,827 --> 01:09:25,994 that it can store a huge collection of both structured 1603 01:09:25,994 --> 01:09:27,249 and unstructured data 1604 01:09:27,249 --> 01:09:29,900 in a tabular format along with its schema. 1605 01:09:29,900 --> 01:09:33,300 The next requirement was slicing In in dicing data now, 1606 01:09:33,300 --> 01:09:34,300 the humongous amount 1607 01:09:34,300 --> 01:09:37,400 of data stored in Sparks data frame can be sliced 1608 01:09:37,400 --> 01:09:40,975 and diced using the operations like filter select group 1609 01:09:40,975 --> 01:09:42,300 by order by and many 1610 01:09:42,300 --> 01:09:45,100 more these operations are applied upon the data 1611 01:09:45,100 --> 01:09:47,456 which are stored in form of rows and columns 1612 01:09:47,456 --> 01:09:50,388 in a data frame these with a few crucial requirements 1613 01:09:50,388 --> 01:09:52,700 which led to the invention of data frames. 1614 01:09:52,800 --> 01:09:55,173 Now, let us get into the important features 1615 01:09:55,173 --> 01:09:55,997 of data frames 1616 01:09:55,997 --> 01:09:58,700 which bring it an edge over the other alternatives. 1617 01:09:59,100 --> 01:10:02,400 Immutability lazy evaluation fault tolerance 1618 01:10:02,400 --> 01:10:04,400 and distributed memory storage, 1619 01:10:04,400 --> 01:10:07,800 let us discuss about each and every feature in detail. 1620 01:10:07,800 --> 01:10:10,600 So the first one is immutability similar to 1621 01:10:10,600 --> 01:10:13,295 the resilient distributed data sets the data frames 1622 01:10:13,295 --> 01:10:16,688 and Spark are also immutable the term immutable depicts 1623 01:10:16,688 --> 01:10:18,100 that the data was stored 1624 01:10:18,100 --> 01:10:20,300 into a data frame will not be altered. 1625 01:10:20,300 --> 01:10:23,100 The only way to alter the data present in a data frame 1626 01:10:23,100 --> 01:10:25,700 would be by applying simple transformation operations 1627 01:10:25,700 --> 01:10:26,600 on to them. 1628 01:10:26,600 --> 01:10:28,900 So the next feature is lazy evaluation. 1629 01:10:28,900 --> 01:10:32,126 Valuation lazy evaluation is the key to the remarkable 1630 01:10:32,126 --> 01:10:36,100 performance offered by spark similar to the rdds data frames 1631 01:10:36,100 --> 01:10:38,999 in spark will not throw any output onto the screen 1632 01:10:38,999 --> 01:10:41,900 until and unless an action command is encountered. 1633 01:10:41,900 --> 01:10:44,300 The next feature is Fault tolerance. 1634 01:10:44,300 --> 01:10:45,182 There is no way 1635 01:10:45,182 --> 01:10:47,900 that the Sparks data frames can lose their data. 1636 01:10:47,900 --> 01:10:50,300 They follow the principle of being fault tolerant 1637 01:10:50,300 --> 01:10:51,782 to the unexpected calamities 1638 01:10:51,782 --> 01:10:53,900 which tend to destroy the available data. 1639 01:10:53,900 --> 01:10:55,893 The next feature is distributed 1640 01:10:55,893 --> 01:10:58,590 storage Sparks dataframe distribute the data. 1641 01:10:58,590 --> 01:11:00,000 Most multiple locations 1642 01:11:00,000 --> 01:11:03,294 so that in case of a node failure the next available node 1643 01:11:03,294 --> 01:11:05,900 can takes place to continue the data processing. 1644 01:11:05,900 --> 01:11:08,700 The next stage will be about the multiple data source 1645 01:11:08,700 --> 01:11:12,204 that the spark dataframe can support the spark API 1646 01:11:12,204 --> 01:11:13,690 can integrate itself 1647 01:11:13,690 --> 01:11:17,700 with multiple programming languages such as scalar Java 1648 01:11:17,700 --> 01:11:19,300 python our MySQL 1649 01:11:19,300 --> 01:11:22,600 and many more making itself capable to handle 1650 01:11:22,600 --> 01:11:26,700 a variety of data sources such as Hadoop Hive hbase 1651 01:11:26,800 --> 01:11:28,500 Cassandra, Json file. 1652 01:11:28,600 --> 01:11:31,600 As CSV files my SQL and many more. 1653 01:11:32,200 --> 01:11:33,726 So this was the theory part 1654 01:11:33,726 --> 01:11:36,100 and now let us move into the Practical part 1655 01:11:36,100 --> 01:11:37,000 where the creation 1656 01:11:37,000 --> 01:11:39,500 of a dataframe happens to be a first step. 1657 01:11:40,100 --> 01:11:42,412 So before we begin the Practical part, 1658 01:11:42,412 --> 01:11:43,975 let us load the libraries 1659 01:11:43,975 --> 01:11:47,600 which required in order to process the data in data frames. 1660 01:11:48,200 --> 01:11:50,822 So these are the few libraries which we required 1661 01:11:50,822 --> 01:11:53,600 before we process the data using our data frames. 1662 01:11:54,200 --> 01:11:56,300 Now that we have loaded all the libraries 1663 01:11:56,300 --> 01:11:59,393 which we required to process the data using the data frames. 1664 01:11:59,393 --> 01:12:01,914 Let us begin with the creation of our data frame. 1665 01:12:01,914 --> 01:12:05,000 So we shall create a new data frame with the name employee 1666 01:12:05,000 --> 01:12:05,935 and load the data 1667 01:12:05,935 --> 01:12:08,300 of the employees present in an organization. 1668 01:12:08,300 --> 01:12:11,400 The details of the employees will consist the first name 1669 01:12:11,400 --> 01:12:14,968 the last name and their mail ID along with their salary. 1670 01:12:14,968 --> 01:12:18,500 So the First Data frame is been successfully created now, 1671 01:12:18,500 --> 01:12:20,700 let us design the schema for this data frame. 1672 01:12:21,600 --> 01:12:24,100 So the schema for this data frame is been described 1673 01:12:24,100 --> 01:12:27,900 as shown the first name is of string data type and similarly. 1674 01:12:27,900 --> 01:12:29,900 The last name is a string data type 1675 01:12:29,900 --> 01:12:31,500 along with the mail address. 1676 01:12:31,500 --> 01:12:34,500 And finally the salary is integer data type 1677 01:12:34,500 --> 01:12:37,000 or you can give flow data type also, 1678 01:12:37,000 --> 01:12:39,882 so the schema has been successfully delivered now, 1679 01:12:39,882 --> 01:12:41,600 let us create the data frame using 1680 01:12:41,600 --> 01:12:43,700 Create data frame function here. 1681 01:12:43,700 --> 01:12:47,260 I'm creating a new data frame by starting a spark context 1682 01:12:47,260 --> 01:12:50,200 and using the create data frame method and loading 1683 01:12:50,200 --> 01:12:52,800 the data from Employee and employer schema. 1684 01:12:52,800 --> 01:12:55,200 The data frame is successfully created now, 1685 01:12:55,200 --> 01:12:56,200 let's print the data 1686 01:12:56,200 --> 01:12:59,353 which is existing in the dataframe EMP DF. 1687 01:13:00,273 --> 01:13:02,426 I am using show method here. 1688 01:13:03,200 --> 01:13:03,907 So the data 1689 01:13:03,907 --> 01:13:07,700 which is present in EMB DF is been successfully printed now, 1690 01:13:07,700 --> 01:13:09,600 let us move on to the next step. 1691 01:13:09,800 --> 01:13:12,800 So the next step for our today's discussion is working 1692 01:13:12,800 --> 01:13:15,500 with an example related to the FIFA data set. 1693 01:13:16,100 --> 01:13:18,217 So the first step in our FIFA example 1694 01:13:18,217 --> 01:13:20,772 would be loading the schema for the CSV file. 1695 01:13:20,772 --> 01:13:22,000 We are working with so 1696 01:13:22,000 --> 01:13:24,400 the schema has been successfully loaded now. 1697 01:13:24,400 --> 01:13:28,066 Now let us load the CSV file from our external storage 1698 01:13:28,066 --> 01:13:30,600 which is hdfs into our data frame, 1699 01:13:30,600 --> 01:13:31,907 which is FIFA DF. 1700 01:13:32,100 --> 01:13:34,394 The CSV file is been successfully loaded 1701 01:13:34,394 --> 01:13:35,800 into our new data frame, 1702 01:13:35,800 --> 01:13:37,100 which is FIFA DF now, 1703 01:13:37,100 --> 01:13:39,300 let us print the schema of a data frame using 1704 01:13:39,300 --> 01:13:40,900 the print schema command. 1705 01:13:41,900 --> 01:13:43,400 So the schema is been successfully 1706 01:13:43,400 --> 01:13:46,000 displayed here and we have the following credentials. 1707 01:13:46,000 --> 01:13:49,300 Of each and every player in our CSV file now, 1708 01:13:49,300 --> 01:13:51,900 let's move on to a further operations on a dataframe. 1709 01:13:53,100 --> 01:13:56,200 We will count the total number of records of the play 1710 01:13:56,200 --> 01:13:59,100 as we have in our CSV file using count command. 1711 01:13:59,300 --> 01:14:01,500 So we have a total of eighteen thousand 1712 01:14:01,500 --> 01:14:04,300 to not seven players in our CSV files. 1713 01:14:04,300 --> 01:14:06,091 Now, let us find out the details 1714 01:14:06,091 --> 01:14:08,500 of the columns on which we are working with. 1715 01:14:08,500 --> 01:14:11,300 So these were the columns which we are working with which 1716 01:14:11,300 --> 01:14:15,466 consists the idea of the player name age nationality potential 1717 01:14:15,466 --> 01:14:16,400 and many more. 1718 01:14:17,100 --> 01:14:19,600 Now let us use the column value 1719 01:14:19,600 --> 01:14:21,282 which has the value of each 1720 01:14:21,282 --> 01:14:23,900 and every player for a particular T and let 1721 01:14:23,900 --> 01:14:27,399 us use describe command in order to see the highest value 1722 01:14:27,399 --> 01:14:29,900 and the least value provided to a player. 1723 01:14:29,900 --> 01:14:33,000 So we have account of a total number of 18,000 1724 01:14:33,000 --> 01:14:34,400 to not seven players 1725 01:14:34,400 --> 01:14:37,612 and the minimum worth given to a player is 0 1726 01:14:37,612 --> 01:14:40,900 and the maximum is given as 9 million pounds. 1727 01:14:41,100 --> 01:14:43,100 Now, let us use the select command 1728 01:14:43,100 --> 01:14:46,216 in order to extract the column name and nationality. 1729 01:14:46,216 --> 01:14:48,172 How to find out the name of each 1730 01:14:48,172 --> 01:14:50,800 and every player along with his nationality. 1731 01:14:51,000 --> 01:14:54,226 So here we have we can display the top 20 rows of each 1732 01:14:54,226 --> 01:14:55,200 and every player 1733 01:14:55,200 --> 01:14:58,900 which we have in our CSV file along with us nationality. 1734 01:14:59,000 --> 01:14:59,700 Similarly. 1735 01:14:59,700 --> 01:15:03,200 Let us find out the players playing for a particular Club. 1736 01:15:03,200 --> 01:15:05,500 So here we have the top 20 Place playing 1737 01:15:05,500 --> 01:15:07,029 for their respective clubs 1738 01:15:07,029 --> 01:15:08,300 along with their names 1739 01:15:08,300 --> 01:15:10,800 for example messy playing for Barcelona 1740 01:15:10,800 --> 01:15:13,100 and Ronaldo for Juventus and Etc. 1741 01:15:13,100 --> 01:15:15,100 Now, let's move to the next stages. 1742 01:15:15,999 --> 01:15:17,900 No, let us find out the players 1743 01:15:18,000 --> 01:15:21,000 who are found to be most active in a particular national team 1744 01:15:21,000 --> 01:15:24,500 or a particular club with h less than 30 years. 1745 01:15:24,500 --> 01:15:25,300 We shall use 1746 01:15:25,300 --> 01:15:28,300 filter transformation to apply this operation. 1747 01:15:28,600 --> 01:15:30,500 So here we have the details 1748 01:15:30,500 --> 01:15:33,300 of the Players whose age is less than 30 years 1749 01:15:33,300 --> 01:15:37,200 and their club and nationality along with their jersey numbers. 1750 01:15:37,700 --> 01:15:40,700 So with this we have finished our FIFA example now 1751 01:15:40,700 --> 01:15:43,466 to understand the data frames in a much better way, 1752 01:15:43,466 --> 01:15:45,300 let us move on into our use case, 1753 01:15:45,300 --> 01:15:48,400 which is about the most Hot Topic The Game of Thrones. 1754 01:15:49,100 --> 01:15:51,319 Similar to our previous example, 1755 01:15:51,319 --> 01:15:54,300 let us design the schema of a CSV file first. 1756 01:15:54,300 --> 01:15:56,600 So this is the schema for a CSV file 1757 01:15:56,600 --> 01:15:59,300 which consists the data about the Game of Thrones. 1758 01:15:59,800 --> 01:16:02,800 So, this is a schema for our first CSV file. 1759 01:16:02,800 --> 01:16:06,200 Now, let us create the schema for our next CSV file. 1760 01:16:06,700 --> 01:16:09,991 I have named the schema for our next CSV file a schema 1761 01:16:09,991 --> 01:16:12,667 to and I've defined the data types for each 1762 01:16:12,667 --> 01:16:16,300 and every entity the scheme has been successfully designed 1763 01:16:16,300 --> 01:16:18,300 for the second CSV file also. 1764 01:16:18,300 --> 01:16:21,700 Now let us load our CSV files from our external storage, 1765 01:16:21,700 --> 01:16:23,200 which is our hdfs. 1766 01:16:24,000 --> 01:16:28,100 The location of the first CSV file character deaths dot CSV 1767 01:16:28,100 --> 01:16:29,076 is our hdfs, 1768 01:16:29,076 --> 01:16:31,000 which is defined as above 1769 01:16:31,000 --> 01:16:33,303 and the schema is been provided as schema. 1770 01:16:33,303 --> 01:16:35,919 And the header true option is also been provided. 1771 01:16:35,919 --> 01:16:38,100 We are using spark read function for this 1772 01:16:38,100 --> 01:16:40,789 and we are loading this data into our new data frame, 1773 01:16:40,789 --> 01:16:42,600 which is Game of Thrones data frame. 1774 01:16:42,800 --> 01:16:43,700 Similarly. 1775 01:16:43,700 --> 01:16:45,743 Let's load the other CSV file 1776 01:16:45,743 --> 01:16:49,232 which is battles dot CSV into another data frame, 1777 01:16:49,232 --> 01:16:53,000 which is Game of Thrones Butters dataframe the CSV file. 1778 01:16:53,000 --> 01:16:54,792 Has been successfully loaded now. 1779 01:16:54,792 --> 01:16:57,200 Let us continue with the further operations. 1780 01:16:57,900 --> 01:17:00,207 Now let us print the schema offer Game 1781 01:17:00,207 --> 01:17:03,200 of Thrones data frame using print schema command. 1782 01:17:03,300 --> 01:17:04,962 So here we have the schema 1783 01:17:04,962 --> 01:17:07,200 which consists of the name alliances 1784 01:17:07,200 --> 01:17:10,821 death rate book of death and many more similarly. 1785 01:17:10,821 --> 01:17:15,100 Let's print the schema of Game of Thrones Butters data frame. 1786 01:17:16,300 --> 01:17:18,600 So this is a schema for our new data frame, 1787 01:17:18,600 --> 01:17:20,700 which is Game of Thrones battle data frame. 1788 01:17:20,900 --> 01:17:23,600 Now, let's continue the further operations. 1789 01:17:24,100 --> 01:17:26,000 Now, let us display the data frame 1790 01:17:26,000 --> 01:17:29,500 which we have created using the following command data frame 1791 01:17:29,500 --> 01:17:32,188 has been successfully printed and this is the data 1792 01:17:32,188 --> 01:17:33,813 which we have in our data frame. 1793 01:17:33,813 --> 01:17:36,200 Now, let's continue with the further operations. 1794 01:17:36,400 --> 01:17:38,449 We know that there are a multiple number 1795 01:17:38,449 --> 01:17:41,100 of houses present in the story of Game of Thrones. 1796 01:17:41,100 --> 01:17:42,211 Now, let us find out 1797 01:17:42,211 --> 01:17:45,100 each and every individual house present in the story. 1798 01:17:45,300 --> 01:17:48,200 Let us use the following command in order to display each 1799 01:17:48,200 --> 01:17:51,400 and every house present in the Game of Thrones story. 1800 01:17:51,600 --> 01:17:54,600 So we have the following houses in the Game of Thrones story. 1801 01:17:54,600 --> 01:17:57,064 Now, let's continue with the further operations 1802 01:17:57,064 --> 01:18:00,299 the battles in the Game of Thrones were fought for ages. 1803 01:18:00,299 --> 01:18:02,000 Let us classify the vast waste 1804 01:18:02,000 --> 01:18:04,300 with their occurrence according to the years. 1805 01:18:04,300 --> 01:18:06,800 We shall use select and filter transformation 1806 01:18:06,800 --> 01:18:09,750 and we shall access The Columns of the details of the battle 1807 01:18:09,750 --> 01:18:11,600 and the year in which they were fought. 1808 01:18:12,100 --> 01:18:13,800 Let us first find out the battles 1809 01:18:13,800 --> 01:18:15,300 which were fought in the year. 1810 01:18:15,300 --> 01:18:18,000 R 298 the following code consists of 1811 01:18:18,000 --> 01:18:19,300 filter transformation 1812 01:18:19,300 --> 01:18:22,000 which will provide the details for which we are looking. 1813 01:18:22,000 --> 01:18:23,350 So according to the result. 1814 01:18:23,350 --> 01:18:25,400 These were the battles were fought in the year 1815 01:18:25,400 --> 01:18:28,700 298 and we have the details of the attacker Kings 1816 01:18:28,700 --> 01:18:30,002 and the defender Kings 1817 01:18:30,002 --> 01:18:33,648 and the outcome of the attacker along with their commanders 1818 01:18:33,648 --> 01:18:36,400 and the location where the war was fought now, 1819 01:18:36,400 --> 01:18:39,861 let us find out the wars based in the air 299. 1820 01:18:40,400 --> 01:18:41,764 So these with the details 1821 01:18:41,764 --> 01:18:45,293 of the verse which were fought in the year 299 and similarly, 1822 01:18:45,293 --> 01:18:48,600 let us also find out the bars which are waged in the year 300. 1823 01:18:48,600 --> 01:18:49,952 So these were the words 1824 01:18:49,952 --> 01:18:51,700 which were fought in the year 300. 1825 01:18:51,700 --> 01:18:53,700 Now, let's move on to the next operations 1826 01:18:53,700 --> 01:18:54,700 in our use case. 1827 01:18:55,000 --> 01:18:58,005 Now, let us find out the tactics used in the wars waged 1828 01:18:58,005 --> 01:19:01,343 and also find out the total number of vast waste by using 1829 01:19:01,343 --> 01:19:05,200 each type of those tactics the following code must help us. 1830 01:19:05,800 --> 01:19:07,200 Here we are using select 1831 01:19:07,200 --> 01:19:10,196 and group by operations in order to find out each 1832 01:19:10,196 --> 01:19:12,500 and every type of tactics used in the war. 1833 01:19:12,600 --> 01:19:16,221 So they have used Ambush sees raising and Pitch type 1834 01:19:16,221 --> 01:19:17,500 of tactics inverse 1835 01:19:17,500 --> 01:19:20,300 and most of the times they have used pitched battle type 1836 01:19:20,300 --> 01:19:21,600 of tactics inverse. 1837 01:19:21,600 --> 01:19:24,600 Now, let us continue with the further operations 1838 01:19:24,600 --> 01:19:27,300 the Ambush type of battles are the deadliest now, 1839 01:19:27,300 --> 01:19:28,650 let us find out the Kings 1840 01:19:28,650 --> 01:19:31,397 who fought the battles using these kind of tactics 1841 01:19:31,397 --> 01:19:34,200 and also let us find out the outcome of the battles 1842 01:19:34,200 --> 01:19:37,425 fought here the In code will help us extract the data 1843 01:19:37,425 --> 01:19:38,600 which we need here. 1844 01:19:38,600 --> 01:19:40,962 We are using select and we're commands 1845 01:19:40,962 --> 01:19:43,900 and we are selecting The Columns year attacking 1846 01:19:43,900 --> 01:19:48,181 Defender King attacker outcome battle type attacker Commander 1847 01:19:48,181 --> 01:19:49,840 defend the commander now, 1848 01:19:49,840 --> 01:19:51,500 let us print the details. 1849 01:19:51,900 --> 01:19:54,700 So these were the battles fought using the Ambush tactics 1850 01:19:54,700 --> 01:19:56,300 and these were the attacker Kings 1851 01:19:56,300 --> 01:19:59,300 and the defender Kings along with their respective commanders 1852 01:19:59,300 --> 01:20:01,641 and the wars waste in a particular year now. 1853 01:20:01,641 --> 01:20:03,700 Let's move on to the next operation. 1854 01:20:04,300 --> 01:20:06,000 Now let us focus on the houses 1855 01:20:06,000 --> 01:20:08,600 and extract the deadliest house amongst the rest. 1856 01:20:08,600 --> 01:20:11,893 The following code will help us to find out the deadliest house 1857 01:20:11,893 --> 01:20:13,700 and the number of patents the wage. 1858 01:20:13,700 --> 01:20:16,600 So here we have the details of each and every house 1859 01:20:16,600 --> 01:20:19,383 and the battles the waged according to the results. 1860 01:20:19,383 --> 01:20:20,033 We have stuck 1861 01:20:20,033 --> 01:20:22,883 and Lannister houses to be the deadliest among the others. 1862 01:20:22,883 --> 01:20:25,400 Now, let's continue with the rest of the operations. 1863 01:20:25,900 --> 01:20:28,100 Now, let us find out the deadliest king 1864 01:20:28,100 --> 01:20:29,100 among the others 1865 01:20:29,100 --> 01:20:31,400 which will use the following command in order to find 1866 01:20:31,400 --> 01:20:33,600 the deadliest king amongst the other kings 1867 01:20:33,600 --> 01:20:35,600 who fought in the A number of Firsts. 1868 01:20:35,600 --> 01:20:38,000 So according to the results we have Joffrey as 1869 01:20:38,000 --> 01:20:38,900 the deadliest King 1870 01:20:38,900 --> 01:20:41,200 who fought a total number of 14 battles. 1871 01:20:41,200 --> 01:20:44,000 Now, let us continue with the further operations. 1872 01:20:44,500 --> 01:20:46,323 Now, let us find out the houses 1873 01:20:46,323 --> 01:20:49,400 which defended most number of Wars waste against them. 1874 01:20:49,400 --> 01:20:52,500 So the following code must help us find out the details. 1875 01:20:52,600 --> 01:20:54,223 So according to the results. 1876 01:20:54,223 --> 01:20:57,400 We have Lannister house to be defending the most number 1877 01:20:57,400 --> 01:20:59,009 of paths based against them. 1878 01:20:59,009 --> 01:21:01,682 Now, let us find out the defender King who defend 1879 01:21:01,682 --> 01:21:04,900 it most number of battles which were waste against him 1880 01:21:05,400 --> 01:21:08,405 So according to the result drop stack is the king 1881 01:21:08,405 --> 01:21:10,597 who defended most number of patterns 1882 01:21:10,597 --> 01:21:12,100 which waged against him. 1883 01:21:12,100 --> 01:21:12,300 Now. 1884 01:21:12,300 --> 01:21:14,600 Let's continue with the further operations. 1885 01:21:14,800 --> 01:21:17,300 Since Lannister house is my personal favorite. 1886 01:21:17,300 --> 01:21:18,800 Let me find out the details 1887 01:21:18,800 --> 01:21:20,800 of the characters in Lannister house. 1888 01:21:20,800 --> 01:21:22,921 This code will describe their name 1889 01:21:22,921 --> 01:21:24,400 and gender one for male 1890 01:21:24,400 --> 01:21:27,700 and 0 for female along with their respective population. 1891 01:21:27,700 --> 01:21:29,830 So let me find out the male characters 1892 01:21:29,830 --> 01:21:31,500 in The Lannister house first. 1893 01:21:32,300 --> 01:21:34,899 So here we have used select and we're commanded. 1894 01:21:34,900 --> 01:21:37,600 Ends in order to find out the details of the characters 1895 01:21:37,600 --> 01:21:39,100 present in Lannister house 1896 01:21:39,100 --> 01:21:42,300 and the data is been stored into tf1 dataframe. 1897 01:21:42,300 --> 01:21:44,700 Let us print the data which is present in idea 1898 01:21:44,700 --> 01:21:46,900 of one data frame using show command. 1899 01:21:47,800 --> 01:21:49,000 So these are the details 1900 01:21:49,000 --> 01:21:51,400 of the characters present in Lannister house, 1901 01:21:51,400 --> 01:21:53,100 which are made now similarly. 1902 01:21:53,100 --> 01:21:55,400 Let us find out the female character is present 1903 01:21:55,400 --> 01:21:56,800 in Lannister house. 1904 01:21:57,500 --> 01:22:00,000 So these are the characters present in Lannister house 1905 01:22:00,000 --> 01:22:01,100 who are females 1906 01:22:01,300 --> 01:22:05,028 so we have a total number of 69 male characters and 12 number 1907 01:22:05,028 --> 01:22:07,900 of female characters in The Lannister house. 1908 01:22:07,900 --> 01:22:11,311 Now, let us continue with the next operations at the end 1909 01:22:11,311 --> 01:22:12,800 of the day every episode 1910 01:22:12,800 --> 01:22:14,800 of Game of Thrones had a noble character. 1911 01:22:15,000 --> 01:22:17,365 Let us now find out all the noble characters 1912 01:22:17,365 --> 01:22:18,664 amongst all the houses 1913 01:22:18,664 --> 01:22:21,193 that we have in our Game of Thrones CSV file 1914 01:22:21,193 --> 01:22:24,100 the following code must help us find out the details. 1915 01:22:25,600 --> 01:22:26,300 So the details 1916 01:22:26,300 --> 01:22:28,500 of all the characters from all the houses 1917 01:22:28,500 --> 01:22:30,050 who are considered to be Noble. 1918 01:22:30,050 --> 01:22:32,200 I've been saved into the new data frame, 1919 01:22:32,200 --> 01:22:33,427 which is DF 3 now, 1920 01:22:33,427 --> 01:22:36,800 let us print the details from the df3 data frame. 1921 01:22:37,500 --> 01:22:40,000 So these are the top 20 members from all the houses 1922 01:22:40,000 --> 01:22:42,900 who are considered to be Noble along with their genders. 1923 01:22:42,900 --> 01:22:45,400 Now, let us count the total number of noble characters 1924 01:22:45,400 --> 01:22:47,600 from the entire game of thrones stories. 1925 01:22:48,300 --> 01:22:50,500 So there are a total of four hundred and thirty 1926 01:22:50,500 --> 01:22:53,300 number of noble characters existing in the whole game 1927 01:22:53,300 --> 01:22:54,300 of throne story. 1928 01:22:54,800 --> 01:22:56,211 Nonetheless, we have also 1929 01:22:56,211 --> 01:22:59,086 faced a few Communists whose role in The Game 1930 01:22:59,086 --> 01:23:01,700 of Thrones is found to be exceptional vision 1931 01:23:01,700 --> 01:23:04,219 of find out the details of all those commoners 1932 01:23:04,219 --> 01:23:07,300 who were highly dedicated to their roles in each episode 1933 01:23:07,600 --> 01:23:08,700 the data of all, 1934 01:23:08,700 --> 01:23:10,700 the commoners is been successfully loaded 1935 01:23:10,700 --> 01:23:11,900 into the new data frame, 1936 01:23:11,900 --> 01:23:14,202 which is TFO now let us print the data 1937 01:23:14,202 --> 01:23:17,500 which is present in the DF for using the show command. 1938 01:23:17,900 --> 01:23:20,396 So these are the top 20 characters identified as 1939 01:23:20,396 --> 01:23:23,004 common as amongst all the Game of Thrones stories. 1940 01:23:23,004 --> 01:23:25,400 Now, let us find out the count of total number 1941 01:23:25,400 --> 01:23:26,600 of common characters. 1942 01:23:26,700 --> 01:23:27,649 So there are a total 1943 01:23:27,649 --> 01:23:30,099 of four hundred and eighty seven common characters 1944 01:23:30,099 --> 01:23:32,000 amongst all stories of Game of Thrones. 1945 01:23:32,000 --> 01:23:34,100 Let us continue with the further operations. 1946 01:23:34,100 --> 01:23:35,700 Now they were a few rows 1947 01:23:35,700 --> 01:23:37,700 who were considered to be important 1948 01:23:37,700 --> 01:23:39,210 and equally Noble, hence. 1949 01:23:39,210 --> 01:23:41,526 They were carried out under the last book. 1950 01:23:41,526 --> 01:23:43,644 So let us filter out those characters 1951 01:23:43,644 --> 01:23:46,100 and find out the details of each one of them. 1952 01:23:46,400 --> 01:23:49,520 The data of all the characters who are considered to be Noble 1953 01:23:49,520 --> 01:23:50,300 and carried out 1954 01:23:50,300 --> 01:23:53,300 until the last book are being stored into the new data frame, 1955 01:23:53,300 --> 01:23:55,629 which is TFO now let us print the data 1956 01:23:55,629 --> 01:23:56,652 which is existing 1957 01:23:56,652 --> 01:23:59,600 in the data frame for so according to the results. 1958 01:23:59,600 --> 01:24:00,650 We have two candidates 1959 01:24:00,650 --> 01:24:03,300 who are considered to be the noble and their character 1960 01:24:03,300 --> 01:24:05,200 is been carried on until the last book 1961 01:24:05,700 --> 01:24:06,900 amongst all the battles. 1962 01:24:06,900 --> 01:24:09,068 I found the battles of the last books 1963 01:24:09,068 --> 01:24:11,900 to be generating more adrenaline in the readers. 1964 01:24:11,900 --> 01:24:14,500 Let us find out the details of those battles using 1965 01:24:14,500 --> 01:24:15,600 the following code. 1966 01:24:16,000 --> 01:24:18,700 So the following code will help us to find out the bars 1967 01:24:18,700 --> 01:24:20,500 which were fought in the last year's 1968 01:24:20,500 --> 01:24:21,700 of the Game of Thrones. 1969 01:24:22,100 --> 01:24:24,799 So these are the details of the vast which are fought 1970 01:24:24,799 --> 01:24:26,800 in the last year's of the Game of Thrones 1971 01:24:26,800 --> 01:24:28,200 and the details of the Kings 1972 01:24:28,300 --> 01:24:30,067 and the details of their commanders 1973 01:24:30,067 --> 01:24:32,200 and the location where the war was fought. 1974 01:24:36,700 --> 01:24:40,579 Welcome to this interesting session of Sparks SQL tutorial 1975 01:24:40,579 --> 01:24:41,600 from a drecker. 1976 01:24:41,600 --> 01:24:42,700 So in today's session, 1977 01:24:42,700 --> 01:24:46,100 we are going to learn about how we will be working. 1978 01:24:46,100 --> 01:24:48,500 Spock sequent now what all you 1979 01:24:48,500 --> 01:24:51,944 can expect from this course from this particular session 1980 01:24:51,944 --> 01:24:53,300 so you can expect that. 1981 01:24:53,300 --> 01:24:56,400 We will be first learning by Sparks equal. 1982 01:24:56,500 --> 01:24:58,139 What are the libraries 1983 01:24:58,139 --> 01:25:00,600 which are present in Sparks equal. 1984 01:25:00,600 --> 01:25:03,600 What are the important features of Sparkle? 1985 01:25:03,600 --> 01:25:06,400 We will also be doing some Hands-On example 1986 01:25:06,400 --> 01:25:10,323 and in the end we will see some interesting use case 1987 01:25:10,323 --> 01:25:13,300 of stock market analysis now 1988 01:25:13,400 --> 01:25:15,042 Rice Park sequel is it 1989 01:25:15,042 --> 01:25:19,200 like Why we are learning it why it is really important 1990 01:25:19,200 --> 01:25:22,067 for us to know about this Sparks equal sign. 1991 01:25:22,067 --> 01:25:24,200 Is it like really hot in Market? 1992 01:25:24,200 --> 01:25:27,700 If yes, then why we want all those answer from this. 1993 01:25:27,700 --> 01:25:30,500 So if you're coming from her do background, 1994 01:25:30,500 --> 01:25:34,102 you must have heard a lot about Apache Hive now 1995 01:25:34,300 --> 01:25:36,100 what happens in Apache. 1996 01:25:36,100 --> 01:25:39,061 I also like in Apache Hive SQL developers 1997 01:25:39,061 --> 01:25:41,430 can write the queries in SQL way 1998 01:25:41,430 --> 01:25:43,800 and it will be getting converted 1999 01:25:43,800 --> 01:25:45,800 to your mapreduce and giving you the out. 2000 01:25:46,400 --> 01:25:47,600 Now we all know 2001 01:25:47,600 --> 01:25:50,000 that mapreduce is lower in nature. 2002 01:25:50,000 --> 01:25:52,726 And since mapreduce is going to be slower 2003 01:25:52,726 --> 01:25:54,500 and nature then definitely 2004 01:25:54,500 --> 01:25:58,000 your overall high score is going to be slower in nature. 2005 01:25:58,000 --> 01:25:59,537 So that was one challenge. 2006 01:25:59,537 --> 01:26:02,361 So if you have let's say less than 200 GB of data 2007 01:26:02,361 --> 01:26:04,400 or if you have a smaller set of data. 2008 01:26:04,400 --> 01:26:06,800 This was actually a big challenge 2009 01:26:06,800 --> 01:26:10,400 that in Hive your performance was not that great. 2010 01:26:10,400 --> 01:26:13,900 It also do not have any resuming capability stuck. 2011 01:26:13,900 --> 01:26:15,900 You can just start it also. 2012 01:26:15,900 --> 01:26:19,200 - cannot even drop your encrypted data bases. 2013 01:26:19,200 --> 01:26:21,082 That's was also one of the challenge 2014 01:26:21,082 --> 01:26:23,200 when you deal with the security side. 2015 01:26:23,200 --> 01:26:25,082 Now what sparks equal have done 2016 01:26:25,082 --> 01:26:28,300 it Sparks equal have solved almost all of the problem. 2017 01:26:28,300 --> 01:26:31,064 So in the last sessions you have already learned 2018 01:26:31,064 --> 01:26:34,500 about the smart way right House Park is faster from mapreduce 2019 01:26:34,500 --> 01:26:36,200 and not we have already learned 2020 01:26:36,200 --> 01:26:38,800 that in the previous few sessions now. 2021 01:26:38,800 --> 01:26:39,917 So in this session, 2022 01:26:39,917 --> 01:26:43,000 we are going to kind of take a live range of all that so 2023 01:26:43,000 --> 01:26:44,800 definitely in this case 2024 01:26:44,800 --> 01:26:47,500 since This pack is faster because of 2025 01:26:47,500 --> 01:26:49,200 the in-memory computation. 2026 01:26:49,200 --> 01:26:50,866 What is in memory competition? 2027 01:26:50,866 --> 01:26:52,200 We have already seen it. 2028 01:26:52,200 --> 01:26:55,105 So in memory computations is like whenever we 2029 01:26:55,105 --> 01:26:57,700 are Computing anything in memory directly. 2030 01:26:57,700 --> 01:27:01,165 So because of in memory competition capability because 2031 01:27:01,165 --> 01:27:02,800 of arches purpose poster. 2032 01:27:02,800 --> 01:27:07,500 So definitely your spark SQL is also been to become first know 2033 01:27:07,500 --> 01:27:08,600 so if I talk 2034 01:27:08,600 --> 01:27:11,900 about the advantages of Sparks equal over Hive 2035 01:27:11,900 --> 01:27:14,970 definitely number one it is going to be faster 2036 01:27:14,970 --> 01:27:17,900 in Listen to your hive so a high quality, 2037 01:27:17,900 --> 01:27:20,900 which is let's say you're taking around 10 minutes 2038 01:27:20,900 --> 01:27:21,905 in Sparks equal. 2039 01:27:21,905 --> 01:27:25,300 You can finish that same query in less than one minute. 2040 01:27:25,300 --> 01:27:27,400 Don't you think it's an awesome capability 2041 01:27:27,400 --> 01:27:31,400 of subsequent definitely as right now second thing is 2042 01:27:31,400 --> 01:27:34,400 when if let's say you are writing something and - 2043 01:27:34,400 --> 01:27:36,148 now you can take an example 2044 01:27:36,148 --> 01:27:39,751 of let's say a company who is let's say developing - 2045 01:27:39,751 --> 01:27:41,467 queries from last 10 years. 2046 01:27:41,467 --> 01:27:42,900 Now they were doing it. 2047 01:27:42,900 --> 01:27:44,000 There were all happy 2048 01:27:44,000 --> 01:27:46,000 that they were able to process picture. 2049 01:27:46,100 --> 01:27:48,200 That they were worried about the performance 2050 01:27:48,200 --> 01:27:50,778 that Hive is not able to give them a that level 2051 01:27:50,778 --> 01:27:53,273 of processing speed what they are looking for. 2052 01:27:53,273 --> 01:27:54,160 Now this fossil. 2053 01:27:54,160 --> 01:27:56,600 It's a challenge for that particular company. 2054 01:27:56,600 --> 01:27:58,801 Now, there's a challenge right? 2055 01:27:58,801 --> 01:28:01,397 The challenge is they came to know know 2056 01:28:01,397 --> 01:28:02,900 about subsequent fine. 2057 01:28:02,900 --> 01:28:04,685 Let's say we came to know about it, 2058 01:28:04,685 --> 01:28:05,853 but they came to know 2059 01:28:05,853 --> 01:28:08,300 that we can execute everything is Park Sequel 2060 01:28:08,300 --> 01:28:10,700 and it is going to be faster as well fine. 2061 01:28:10,700 --> 01:28:12,281 But don't you think that 2062 01:28:12,281 --> 01:28:15,708 if these companies working for net set past 10 years? 2063 01:28:15,708 --> 01:28:19,200 In Hive they must have already written lot of Gordon - 2064 01:28:19,200 --> 01:28:23,100 now if you ask them to migrate to spark SQL is will it be 2065 01:28:23,100 --> 01:28:24,400 until easy task? 2066 01:28:24,400 --> 01:28:25,200 No, right. 2067 01:28:25,200 --> 01:28:25,982 Definitely. 2068 01:28:25,982 --> 01:28:28,384 It is not going to be an easy task. 2069 01:28:28,384 --> 01:28:32,200 Why because Hive syntax and Sparks equals and X though. 2070 01:28:32,200 --> 01:28:35,800 They boot tackle the sequel way of writing the things 2071 01:28:35,800 --> 01:28:39,346 but at the same time it is always a very 2072 01:28:39,346 --> 01:28:41,500 it carries a big difference, 2073 01:28:41,500 --> 01:28:44,300 so there will be a good difference whenever we talk 2074 01:28:44,300 --> 01:28:45,905 about the syntax between them. 2075 01:28:45,905 --> 01:28:48,100 So it will take a very good amount of time 2076 01:28:48,100 --> 01:28:51,017 for that company to change all of the query mode 2077 01:28:51,017 --> 01:28:54,052 to the Sparks equal way now Sparks equal came up 2078 01:28:54,052 --> 01:28:55,426 with a smart salvation 2079 01:28:55,426 --> 01:28:56,899 what they said is even 2080 01:28:56,899 --> 01:28:58,900 if you are writing the query with - 2081 01:28:58,900 --> 01:29:01,300 you can execute that Hive query directly 2082 01:29:01,300 --> 01:29:03,500 through subsequent don't you think it's again 2083 01:29:03,500 --> 01:29:06,600 a very important and awesome facility, right? 2084 01:29:06,600 --> 01:29:09,900 Because even now if you're a good Hive developer, 2085 01:29:09,900 --> 01:29:12,000 you need not worry about 2086 01:29:12,000 --> 01:29:15,600 that how you will be now that migrating to Sparks. 2087 01:29:15,600 --> 01:29:18,658 Well, you can still keep on writing to the hive query 2088 01:29:18,658 --> 01:29:20,900 and can your query will automatically be 2089 01:29:20,900 --> 01:29:24,767 getting converted to spot sequel with similarly in Apache spark 2090 01:29:24,767 --> 01:29:27,200 as we have learned in the past sessions, 2091 01:29:27,200 --> 01:29:30,100 especially through spark streaming that Sparks. 2092 01:29:30,100 --> 01:29:33,600 The aiming is going to make you real time processing right? 2093 01:29:33,600 --> 01:29:36,000 You can also perform your real-time processing 2094 01:29:36,000 --> 01:29:37,615 using a purchase. / now. 2095 01:29:37,615 --> 01:29:39,500 This sort of facility is you 2096 01:29:39,500 --> 01:29:41,800 can take leverage even you know Sparks ago. 2097 01:29:41,800 --> 01:29:44,235 So let's say you can do a real-time processing 2098 01:29:44,235 --> 01:29:46,400 and at the same time you can also Perform 2099 01:29:46,400 --> 01:29:47,860 your SQL query now the type 2100 01:29:47,860 --> 01:29:49,120 that was the problem. 2101 01:29:49,120 --> 01:29:49,900 You cannot do 2102 01:29:49,900 --> 01:29:52,900 that because when we talk about Hive now in - 2103 01:29:52,900 --> 01:29:54,320 it's all about Hadoop is 2104 01:29:54,320 --> 01:29:56,663 all about batch processing batch processing 2105 01:29:56,663 --> 01:29:58,509 where you keep historical data 2106 01:29:58,509 --> 01:30:00,736 and then later you process it, right? 2107 01:30:00,736 --> 01:30:03,699 So it definitely Hive also follow the same approach 2108 01:30:03,699 --> 01:30:05,300 in this case also high risk 2109 01:30:05,300 --> 01:30:07,850 going to just only follow the batch processing mode, 2110 01:30:07,850 --> 01:30:09,600 but when it comes to a purchase, 2111 01:30:09,600 --> 01:30:13,500 but it will also be taking care of the real-time processing. 2112 01:30:13,500 --> 01:30:15,499 So how all these things happens 2113 01:30:15,499 --> 01:30:18,400 so Our Park sequel always uses your meta store 2114 01:30:18,400 --> 01:30:21,350 Services of your hive to query the data stored 2115 01:30:21,350 --> 01:30:22,400 and managed by - 2116 01:30:22,400 --> 01:30:24,728 so in when you were learning about high, 2117 01:30:24,728 --> 01:30:28,123 so we have learned at that time that in hives everything. 2118 01:30:28,123 --> 01:30:30,711 What we do is always stored in the meta Stone 2119 01:30:30,711 --> 01:30:33,491 so that met Esther was The crucial point, right? 2120 01:30:33,491 --> 01:30:35,200 Because using that meta store 2121 01:30:35,200 --> 01:30:37,600 only you are able to do everything up. 2122 01:30:37,600 --> 01:30:41,100 So like when you are doing let's say or any sort of query 2123 01:30:41,100 --> 01:30:42,707 when you're creating a table, 2124 01:30:42,707 --> 01:30:45,700 everything was getting stored in that same metal Stone. 2125 01:30:45,700 --> 01:30:47,559 What happens Spock sequel 2126 01:30:47,559 --> 01:30:51,800 also use the same metal Stone now is whatever metal store. 2127 01:30:51,800 --> 01:30:55,051 You have created with respect to Hive same meta store. 2128 01:30:55,051 --> 01:30:56,219 You can also use it 2129 01:30:56,219 --> 01:30:58,900 for your Sparks equal and that is something 2130 01:30:58,900 --> 01:31:02,000 which is really awesome about this spark sequent 2131 01:31:02,000 --> 01:31:04,000 that you did not create a new meta store. 2132 01:31:04,000 --> 01:31:06,300 You need not worry about a new storage space 2133 01:31:06,300 --> 01:31:07,404 and not everything 2134 01:31:07,404 --> 01:31:10,820 what you have done with respect to your high same method 2135 01:31:10,820 --> 01:31:11,620 you can use it. 2136 01:31:11,620 --> 01:31:11,833 Now. 2137 01:31:11,833 --> 01:31:13,700 You can ask me then how it is faster 2138 01:31:13,700 --> 01:31:15,700 if they're using cymatics don't remember. 2139 01:31:15,700 --> 01:31:18,500 But the processing part why high was lower 2140 01:31:18,500 --> 01:31:20,301 because of its processing way 2141 01:31:20,301 --> 01:31:23,519 because it is converting everything to the mapreduce 2142 01:31:23,519 --> 01:31:26,782 and this it was making the processing very very slow. 2143 01:31:26,782 --> 01:31:28,100 But here in this case 2144 01:31:28,100 --> 01:31:31,452 since the processing is going to be in memory computation. 2145 01:31:31,452 --> 01:31:32,705 So in Sparks equal case, 2146 01:31:32,705 --> 01:31:35,588 it is always going to be the faster now definitely 2147 01:31:35,588 --> 01:31:37,545 it just because of the meta store site. 2148 01:31:37,545 --> 01:31:39,600 We are only able to fetch the data are 2149 01:31:39,600 --> 01:31:42,129 not but at the same time for any other thing 2150 01:31:42,129 --> 01:31:44,100 of the processing related stuff, 2151 01:31:44,100 --> 01:31:46,200 it is always going to be At the 2152 01:31:46,200 --> 01:31:48,180 when we talk about the processing stage 2153 01:31:48,180 --> 01:31:51,200 it is going to be in memory does it's going to be faster. 2154 01:31:51,300 --> 01:31:54,335 So let's talk about some success stories of Sparks equal. 2155 01:31:54,335 --> 01:31:57,550 Let's see some use cases Twitter sentiment analysis. 2156 01:31:57,550 --> 01:31:58,844 If you go through over 2157 01:31:58,844 --> 01:32:01,699 if you want sexy remember our spark streaming session, 2158 01:32:01,700 --> 01:32:04,300 we have done a Twitter sentiment analysis, right? 2159 01:32:04,300 --> 01:32:05,400 So there you have seen 2160 01:32:05,400 --> 01:32:08,497 that we have first initially got the data from Twitter and 2161 01:32:08,497 --> 01:32:10,400 that to we have got it with the help 2162 01:32:10,400 --> 01:32:11,911 of Sparks Damon and later 2163 01:32:11,911 --> 01:32:13,000 what we did later. 2164 01:32:13,000 --> 01:32:15,600 We just analyze everything with the help of spot. 2165 01:32:15,600 --> 01:32:18,080 Oxycodone so you can see an advantage as possible. 2166 01:32:18,080 --> 01:32:19,761 So in Twitter sentiment analysis 2167 01:32:19,761 --> 01:32:21,600 where let's say you want to find out 2168 01:32:21,600 --> 01:32:23,200 about the Donald Trump, right? 2169 01:32:23,200 --> 01:32:24,509 You are fetching the data 2170 01:32:24,509 --> 01:32:26,547 every tweet related to the Donald Trump 2171 01:32:26,547 --> 01:32:28,900 and then kind of bring analysis in checking 2172 01:32:28,900 --> 01:32:31,200 that whether it's a positive with negative 2173 01:32:31,200 --> 01:32:32,475 tweet neutral tweet, 2174 01:32:32,475 --> 01:32:34,900 very negative with very positive to it. 2175 01:32:34,900 --> 01:32:37,257 Okay, so we have already seen the same example there 2176 01:32:37,257 --> 01:32:38,607 in that particular session. 2177 01:32:38,607 --> 01:32:39,549 So in this session, 2178 01:32:39,549 --> 01:32:40,499 as you are noticing 2179 01:32:40,499 --> 01:32:42,600 what we are doing we just want to kind of so 2180 01:32:42,600 --> 01:32:44,202 that once you're streaming the data 2181 01:32:44,202 --> 01:32:45,900 and the real time you can also do it. 2182 01:32:45,900 --> 01:32:47,977 Also, seeing using spark sequel just you 2183 01:32:47,977 --> 01:32:50,724 are doing all the processing at the real time similarly 2184 01:32:50,724 --> 01:32:52,270 in the stock market analysis. 2185 01:32:52,270 --> 01:32:54,295 You can use Park sequel lot of bullies. 2186 01:32:54,295 --> 01:32:57,400 You can adopt the in the banking fraud case Transitions and all 2187 01:32:57,400 --> 01:32:58,400 you can use that. 2188 01:32:58,400 --> 01:33:01,000 So let's say your credit card current is getting swipe 2189 01:33:01,000 --> 01:33:02,580 in India and in next 10 minutes 2190 01:33:02,580 --> 01:33:04,429 if your credit card is getting swiped 2191 01:33:04,429 --> 01:33:05,456 in let's say in u.s. 2192 01:33:05,456 --> 01:33:07,100 Definitely that is not possible. 2193 01:33:07,100 --> 01:33:07,400 Right? 2194 01:33:07,400 --> 01:33:09,872 So let's say you are doing all that processing real-time. 2195 01:33:09,872 --> 01:33:12,300 You're detecting everything with respect to sparsely me. 2196 01:33:12,300 --> 01:33:15,400 Then you are let's say applying your Sparks equal to verify 2197 01:33:15,400 --> 01:33:18,000 that Whether it's a user Trend or not, right? 2198 01:33:18,000 --> 01:33:20,600 So all those things you want to match up as possible. 2199 01:33:20,600 --> 01:33:21,960 So you can do that similarly 2200 01:33:21,960 --> 01:33:23,750 the medical domain you can use that. 2201 01:33:23,750 --> 01:33:25,949 Let's talk about some Sparks equal features. 2202 01:33:25,949 --> 01:33:28,200 So there will be some features related to it. 2203 01:33:28,400 --> 01:33:30,200 Now, you can use 2204 01:33:30,200 --> 01:33:33,700 what happens when this sequel got combined with this path. 2205 01:33:33,700 --> 01:33:34,830 We started calling it 2206 01:33:34,830 --> 01:33:35,825 as Park sequel now 2207 01:33:35,825 --> 01:33:38,700 when definitely we are talking about SQL be a talking 2208 01:33:38,700 --> 01:33:40,405 about either a structure data 2209 01:33:40,405 --> 01:33:41,800 or a semi-structured data now 2210 01:33:41,800 --> 01:33:44,231 SQL queries cannot deal with the unstructured data, 2211 01:33:44,231 --> 01:33:47,300 so that is definitely one of Thing you need to keep in mind. 2212 01:33:47,300 --> 01:33:51,000 Now your spark sequel also support various data formats. 2213 01:33:51,000 --> 01:33:52,800 You can get a data from pocket. 2214 01:33:52,800 --> 01:33:54,500 You must have heard about Market 2215 01:33:54,500 --> 01:33:56,911 that it is a columnar based storage and it 2216 01:33:56,911 --> 01:33:59,884 is kind of very much compressed format of the data 2217 01:33:59,884 --> 01:34:02,300 what you have but it's not human readable. 2218 01:34:02,300 --> 01:34:02,800 Similarly. 2219 01:34:02,800 --> 01:34:04,800 You must have heard about Jason Avro 2220 01:34:04,800 --> 01:34:07,200 where we keep the value as a key value pair. 2221 01:34:07,200 --> 01:34:08,482 Hi Cassandra, right? 2222 01:34:08,482 --> 01:34:09,700 These are nosql TVs 2223 01:34:09,700 --> 01:34:12,800 so you can get all the data from these sources now. 2224 01:34:12,800 --> 01:34:15,114 You can also convert your SQL queries 2225 01:34:15,114 --> 01:34:16,400 to your A derivative 2226 01:34:16,400 --> 01:34:18,650 so you can you can you will be able to perform 2227 01:34:18,650 --> 01:34:20,113 all the transformation steps. 2228 01:34:20,113 --> 01:34:21,800 So that is one thing you can do. 2229 01:34:21,800 --> 01:34:23,500 Now if we talk about performance 2230 01:34:23,500 --> 01:34:26,700 and scalability definitely on this red color graph. 2231 01:34:26,700 --> 01:34:29,431 If you notice this is related to your Hadoop, 2232 01:34:29,431 --> 01:34:30,300 you can notice 2233 01:34:30,300 --> 01:34:34,000 that red color graph is much more encompassing to blue color 2234 01:34:34,000 --> 01:34:36,617 and blue color denotes my performance with respect 2235 01:34:36,617 --> 01:34:37,503 to Sparks equal 2236 01:34:37,503 --> 01:34:40,856 so you can notice that spark SQL is performing much better 2237 01:34:40,856 --> 01:34:42,684 in comparison to your Hadoop. 2238 01:34:42,684 --> 01:34:44,260 So we are on this Y axis. 2239 01:34:44,260 --> 01:34:45,900 We are taking the running. 2240 01:34:46,000 --> 01:34:47,200 On the x-axis. 2241 01:34:47,200 --> 01:34:50,119 We were considering the number of iteration 2242 01:34:50,119 --> 01:34:53,000 when we talk about Sparks equal features. 2243 01:34:53,000 --> 01:34:56,000 Now few more features we have for example, 2244 01:34:56,000 --> 01:34:59,200 you can create a connection with simple your jdbc driver 2245 01:34:59,200 --> 01:35:00,494 or odbc driver, right? 2246 01:35:00,494 --> 01:35:02,482 These are simple drivers being present. 2247 01:35:02,482 --> 01:35:03,600 Now, you can create 2248 01:35:03,600 --> 01:35:06,700 your connection with his path SQL using all these drivers. 2249 01:35:06,700 --> 01:35:10,000 You can also create a user defined function means let's say 2250 01:35:10,000 --> 01:35:12,200 if any function is not available to you 2251 01:35:12,200 --> 01:35:14,600 and that gives you can create your own functions. 2252 01:35:14,600 --> 01:35:16,900 Let's say if function Is available use 2253 01:35:16,900 --> 01:35:18,639 that if it is not available, 2254 01:35:18,639 --> 01:35:21,497 you can create a UDF means user-defined function 2255 01:35:21,497 --> 01:35:23,235 and you can directly execute 2256 01:35:23,235 --> 01:35:26,478 that user-defined function and get your dessert sir. 2257 01:35:26,478 --> 01:35:28,900 So this is one example where we have shown 2258 01:35:28,900 --> 01:35:30,100 that you can convert. 2259 01:35:30,100 --> 01:35:33,000 Let's say if you don't have an uppercase API present 2260 01:35:33,000 --> 01:35:36,405 in subsequent how you can create a simple UDF for a 2261 01:35:36,405 --> 01:35:37,700 and can execute it. 2262 01:35:37,700 --> 01:35:38,850 So if you notice there 2263 01:35:38,850 --> 01:35:41,200 what we are doing let's get this is my data. 2264 01:35:41,200 --> 01:35:42,700 So if you notice in this case, 2265 01:35:43,069 --> 01:35:45,530 this is data set is my data part. 2266 01:35:45,800 --> 01:35:48,100 So this is I'm generating as a sequence. 2267 01:35:48,100 --> 01:35:51,800 I'm creating it as a data frame see this 2df part here. 2268 01:35:51,800 --> 01:35:55,100 Now after that we are creating a / U DF here 2269 01:35:55,100 --> 01:35:58,217 and notice we are converting any value which is coming 2270 01:35:58,217 --> 01:35:59,600 to my upper case, right? 2271 01:35:59,600 --> 01:36:02,000 We are using this to uppercase API to convert it. 2272 01:36:02,100 --> 01:36:05,800 We are importing this function and then what we did now 2273 01:36:05,800 --> 01:36:08,100 when we came here, we are telling that okay. 2274 01:36:08,100 --> 01:36:09,236 This is my UDF. 2275 01:36:09,236 --> 01:36:10,600 So UDF is upper by 2276 01:36:10,600 --> 01:36:12,719 because we have created here also a zapper. 2277 01:36:12,719 --> 01:36:13,569 So we are telling 2278 01:36:13,569 --> 01:36:16,100 that this is my UDF in the first step and then Then 2279 01:36:16,100 --> 01:36:17,153 when we are using it, 2280 01:36:17,153 --> 01:36:20,253 let's say with our datasets what we are doing so data sets. 2281 01:36:20,253 --> 01:36:22,100 We are passing year that okay, whatever. 2282 01:36:22,100 --> 01:36:23,393 We are doing convert it 2283 01:36:23,393 --> 01:36:26,600 to my upper developer you DFX convert it to my upper case. 2284 01:36:26,600 --> 01:36:29,100 So see we are telling you we have created our / UDF 2285 01:36:29,100 --> 01:36:31,500 that is what we are passing inside this text value. 2286 01:36:31,800 --> 01:36:34,600 So now it is just getting converted 2287 01:36:34,600 --> 01:36:37,600 and giving you all the output in your upper case way 2288 01:36:37,600 --> 01:36:40,400 so you can notice that this is your last value 2289 01:36:40,400 --> 01:36:42,700 and this is your uppercase value, right? 2290 01:36:42,700 --> 01:36:43,841 So this got converted 2291 01:36:43,841 --> 01:36:45,900 to my upper case in this particular. 2292 01:36:45,900 --> 01:36:46,500 Love it. 2293 01:36:46,500 --> 01:36:46,900 Now. 2294 01:36:46,900 --> 01:36:49,123 If you notice here also same steps. 2295 01:36:49,123 --> 01:36:52,000 We are how to we can register all of our UDF. 2296 01:36:52,000 --> 01:36:53,620 This is not being shown here. 2297 01:36:53,620 --> 01:36:55,800 So now this is how you can do that spark 2298 01:36:55,800 --> 01:36:57,354 that UDF not register. 2299 01:36:57,354 --> 01:36:58,574 So using this API, 2300 01:36:58,574 --> 01:37:02,100 you can just register your data frames now similarly, 2301 01:37:02,100 --> 01:37:03,870 if you want to get the output 2302 01:37:03,870 --> 01:37:06,800 after that you can get it using this following me 2303 01:37:06,800 --> 01:37:09,900 so you can use the show API to get the output 2304 01:37:09,900 --> 01:37:12,100 for this Sparks equal at attacher. 2305 01:37:12,100 --> 01:37:13,800 Let's see that so what is Park 2306 01:37:13,800 --> 01:37:16,400 sequel architecture now is Park sequel architecture 2307 01:37:16,400 --> 01:37:18,100 if we talked about so what happens to your let 2308 01:37:18,100 --> 01:37:19,900 's say getting the data of with using 2309 01:37:19,900 --> 01:37:21,500 your various formats, right? 2310 01:37:21,500 --> 01:37:23,911 So let's say you can get it from your CSP. 2311 01:37:23,911 --> 01:37:26,056 You can get it from your Json format. 2312 01:37:26,056 --> 01:37:28,475 You can also get it from your jdbc format. 2313 01:37:28,475 --> 01:37:30,400 Now, they will be a data source API. 2314 01:37:30,400 --> 01:37:31,708 So using data source API, 2315 01:37:31,708 --> 01:37:34,273 you can fetch the data after fetching the data 2316 01:37:34,273 --> 01:37:36,300 you will be converting to a data frame 2317 01:37:36,300 --> 01:37:38,000 where so what is data frame. 2318 01:37:38,000 --> 01:37:39,833 So in the last one we have learned 2319 01:37:39,833 --> 01:37:42,892 that that when we were creating everything is already 2320 01:37:42,892 --> 01:37:43,900 what we were doing. 2321 01:37:43,900 --> 01:37:46,437 So, let's say this was my Cluster, right? 2322 01:37:46,437 --> 01:37:48,358 So let's say this is machine. 2323 01:37:48,358 --> 01:37:49,860 This is another machine. 2324 01:37:49,860 --> 01:37:51,800 This is another machine, right? 2325 01:37:51,800 --> 01:37:53,757 So let's say these are all my clusters. 2326 01:37:53,757 --> 01:37:55,703 So what we were doing in this case now 2327 01:37:55,703 --> 01:37:58,700 when we were creating all these things are as were cluster 2328 01:37:58,700 --> 01:38:00,000 what was happening here. 2329 01:38:00,000 --> 01:38:02,600 We were passing Oliver values him, right? 2330 01:38:02,600 --> 01:38:04,739 So let's say we were keeping all the data. 2331 01:38:04,739 --> 01:38:06,200 Let's say block B1 was there 2332 01:38:06,200 --> 01:38:08,850 so we were passing all the values and work creating it 2333 01:38:08,850 --> 01:38:11,400 in the form of in the memory and we were calling 2334 01:38:11,400 --> 01:38:12,800 that as rdd now 2335 01:38:12,800 --> 01:38:16,094 when we were walking in SQL we have to store the the data 2336 01:38:16,094 --> 01:38:17,900 which is a table of data, right? 2337 01:38:17,900 --> 01:38:19,200 So let's say there is a table 2338 01:38:19,200 --> 01:38:21,200 which is let's say having column details. 2339 01:38:21,200 --> 01:38:23,200 Let's say name age. 2340 01:38:23,200 --> 01:38:24,024 Let's say here. 2341 01:38:24,024 --> 01:38:26,236 I have some value here are some value here. 2342 01:38:26,236 --> 01:38:28,506 I have some value here at some value, right? 2343 01:38:28,506 --> 01:38:31,200 So let's say I have some value of this table format. 2344 01:38:31,200 --> 01:38:34,200 Now if I have to keep this data into my cluster 2345 01:38:34,200 --> 01:38:35,200 what you need to do, 2346 01:38:35,200 --> 01:38:37,962 so you will be keeping first of all into the memory. 2347 01:38:37,962 --> 01:38:39,100 So you will be having 2348 01:38:39,100 --> 01:38:42,418 let's say name H this column to test first of all year 2349 01:38:42,418 --> 01:38:45,767 and after that you will be having some details of this. 2350 01:38:45,767 --> 01:38:46,210 Perfect. 2351 01:38:46,210 --> 01:38:47,804 So let's say this much data, 2352 01:38:47,804 --> 01:38:49,900 you have some part in the similar kind 2353 01:38:49,900 --> 01:38:52,572 of table with some other values will be here also, 2354 01:38:52,572 --> 01:38:55,300 but here also you are going to have column details. 2355 01:38:55,300 --> 01:38:58,500 You will be having name H some more data here. 2356 01:38:58,600 --> 01:39:02,600 Now if you notice this is sounding similar to our DD, 2357 01:39:02,700 --> 01:39:06,000 but this is not exactly like our GD right 2358 01:39:06,000 --> 01:39:09,400 because here we are not only keeping just the data but we 2359 01:39:09,400 --> 01:39:12,500 are also studying something like a column in a storage 2360 01:39:12,500 --> 01:39:12,861 right? 2361 01:39:12,861 --> 01:39:15,400 We also the keeping the column in all of it. 2362 01:39:15,400 --> 01:39:18,500 Data nodes or we can call it as if Burke or not, right? 2363 01:39:18,500 --> 01:39:20,653 So we are also keeping the column vectors 2364 01:39:20,653 --> 01:39:22,000 along with the rule test. 2365 01:39:22,000 --> 01:39:24,700 So this thing is called as data frames. 2366 01:39:24,700 --> 01:39:26,600 Okay, so that is called your data frame. 2367 01:39:26,600 --> 01:39:29,400 So that is what we are going to do is we are going to convert it 2368 01:39:29,400 --> 01:39:31,057 to a data frame API then 2369 01:39:31,057 --> 01:39:35,200 using the data frame TSS or by using Sparks equal to H square 2370 01:39:35,200 --> 01:39:37,550 or you will be processing the results and giving 2371 01:39:37,550 --> 01:39:40,300 the output we will learn about all these things in detail. 2372 01:39:40,600 --> 01:39:44,100 So, let's see this Popsicle libraries now there are 2373 01:39:44,100 --> 01:39:45,800 multiple apis available. 2374 01:39:45,800 --> 01:39:48,700 This like we have data source API we 2375 01:39:48,700 --> 01:39:50,500 have data frame API. 2376 01:39:50,500 --> 01:39:53,510 We have interpreter and Optimizer and SQL service. 2377 01:39:53,510 --> 01:39:55,600 We will explore all this in detail. 2378 01:39:55,600 --> 01:39:58,000 So let's talk about data source appear 2379 01:39:58,000 --> 01:40:02,787 if we talk about data source API what happens in data source API, 2380 01:40:02,787 --> 01:40:04,133 it is used to read 2381 01:40:04,133 --> 01:40:07,364 and store the structured and unstructured data 2382 01:40:07,364 --> 01:40:08,800 into your spark SQL. 2383 01:40:08,800 --> 01:40:12,200 So as you can notice in Sparks equal we can give fetch the data 2384 01:40:12,200 --> 01:40:13,437 using multiple sources 2385 01:40:13,437 --> 01:40:15,800 like you can get it from hive take Cosette. 2386 01:40:15,800 --> 01:40:18,800 Inverse ESP Apache BSD base Oracle DB so 2387 01:40:18,800 --> 01:40:20,300 many formats available, right? 2388 01:40:20,300 --> 01:40:21,427 So this API is going 2389 01:40:21,427 --> 01:40:24,956 to help you to get all the data to read all the data store it 2390 01:40:24,956 --> 01:40:26,700 where ever you want to use it. 2391 01:40:26,700 --> 01:40:28,387 Now after that your data 2392 01:40:28,387 --> 01:40:31,200 frame API is going to help you to convert 2393 01:40:31,200 --> 01:40:33,100 that into a named Colin 2394 01:40:33,100 --> 01:40:34,700 and remember I just explained you 2395 01:40:34,800 --> 01:40:36,902 that how you store the data in that 2396 01:40:36,902 --> 01:40:39,793 because here you are not keeping like I did it. 2397 01:40:39,793 --> 01:40:42,100 You're also keeping the named column as 2398 01:40:42,100 --> 01:40:45,500 well as Road it is That is the difference coming up here. 2399 01:40:45,500 --> 01:40:47,382 So that is what it is converting. 2400 01:40:47,382 --> 01:40:48,100 In this case. 2401 01:40:48,100 --> 01:40:50,561 We are using data frame API to convert it 2402 01:40:50,561 --> 01:40:52,900 into your named column and rows, right? 2403 01:40:52,900 --> 01:40:54,600 So that is what you will be doing. 2404 01:40:54,600 --> 01:40:57,700 So at it also follows the same properties like your IDs 2405 01:40:57,700 --> 01:40:59,993 like your attitude is Pearl easily evaluated 2406 01:40:59,993 --> 01:41:02,500 in all same properties will also follow up here. 2407 01:41:02,500 --> 01:41:06,000 Okay now interpret an Optimizer and interpreter 2408 01:41:06,000 --> 01:41:08,485 and Optimizer step what we are going to do. 2409 01:41:08,485 --> 01:41:11,184 So, let's see if we have this data frame API, 2410 01:41:11,184 --> 01:41:13,700 so we are going to first create this name. 2411 01:41:13,700 --> 01:41:17,800 Column then after that we will be now creating an rdd. 2412 01:41:17,800 --> 01:41:20,400 We will be applying our transformation step. 2413 01:41:20,400 --> 01:41:23,877 We will be doing over action step right to Output the value. 2414 01:41:23,877 --> 01:41:25,040 So all those things 2415 01:41:25,040 --> 01:41:28,100 where it is happens it happening in The Interpreter 2416 01:41:28,100 --> 01:41:29,400 and optimizes them. 2417 01:41:29,400 --> 01:41:33,500 So this is all happening in The Interpreter and optimism. 2418 01:41:33,600 --> 01:41:36,000 So this is what all the features you have. 2419 01:41:36,000 --> 01:41:39,500 Now, let's talk about SQL service now in SQL service 2420 01:41:39,500 --> 01:41:41,934 what happens it is going to again help you 2421 01:41:41,934 --> 01:41:43,698 so it is just doing the order. 2422 01:41:43,698 --> 01:41:45,200 Formation action the last day 2423 01:41:45,200 --> 01:41:47,567 after that using spark SQL service, 2424 01:41:47,567 --> 01:41:50,700 you will be getting your spark sequel outputs. 2425 01:41:50,700 --> 01:41:54,200 So now in this case whatever processing you have done right 2426 01:41:54,200 --> 01:41:57,500 in terms of transformations in all of that so you can see 2427 01:41:57,500 --> 01:42:01,600 that your sparkers SQL service is an entry point for working 2428 01:42:01,600 --> 01:42:04,486 along the structure data in your aperture spur. 2429 01:42:04,486 --> 01:42:04,800 Okay. 2430 01:42:04,800 --> 01:42:07,611 So it is going to kind of help you to fetch the results 2431 01:42:07,611 --> 01:42:08,700 from your optimize data 2432 01:42:08,700 --> 01:42:10,900 or maybe whatever you have interpreted before 2433 01:42:10,900 --> 01:42:12,100 so that is what it's doing. 2434 01:42:12,100 --> 01:42:13,400 So this kind of completes. 2435 01:42:13,500 --> 01:42:15,400 This whole diagram now, 2436 01:42:15,400 --> 01:42:18,082 let us see that how we can perform a work queries 2437 01:42:18,082 --> 01:42:19,200 using spark sequin. 2438 01:42:19,200 --> 01:42:21,435 Now if we talk about spark SQL queries, 2439 01:42:21,435 --> 01:42:22,376 so first of all, 2440 01:42:22,376 --> 01:42:25,348 we can go to spark cell itself engine execute everything. 2441 01:42:25,348 --> 01:42:27,253 You can also execute your program using 2442 01:42:27,253 --> 01:42:29,500 spark your Eclipse also directing from there. 2443 01:42:29,500 --> 01:42:30,600 Also, you can do that. 2444 01:42:30,600 --> 01:42:33,249 So if you are let's say log in with your spark shell session. 2445 01:42:33,249 --> 01:42:34,200 So what you can do, 2446 01:42:34,200 --> 01:42:36,700 so let's say you have first you need to import this 2447 01:42:36,700 --> 01:42:38,464 because into point x you must have heard 2448 01:42:38,464 --> 01:42:40,500 that there is something called as Park session 2449 01:42:40,500 --> 01:42:42,197 which came so that is what we are doing. 2450 01:42:42,197 --> 01:42:44,200 So in our last session we have Have you learned 2451 01:42:44,200 --> 01:42:47,077 about all these things are now Sparkstation is something 2452 01:42:47,077 --> 01:42:48,700 but we're importing after that. 2453 01:42:48,700 --> 01:42:51,940 We are creating sessions path using a builder function. 2454 01:42:51,940 --> 01:42:52,704 Look at this. 2455 01:42:52,704 --> 01:42:55,822 So This Builder API you we are using this Builder API, 2456 01:42:55,822 --> 01:42:57,458 then we are using the app name. 2457 01:42:57,458 --> 01:43:00,256 We are providing a configuration and then we are telling 2458 01:43:00,256 --> 01:43:02,860 that we are going to create our values here, right? 2459 01:43:02,860 --> 01:43:05,100 So we had that's why we are giving get okay, 2460 01:43:05,100 --> 01:43:07,987 then we are importing all these things right 2461 01:43:07,987 --> 01:43:09,800 once we imported after that 2462 01:43:09,800 --> 01:43:10,900 we can say that okay. 2463 01:43:10,900 --> 01:43:12,731 We were want to read this Json file. 2464 01:43:12,731 --> 01:43:15,400 So this implies God or Jason we want to read up here 2465 01:43:15,400 --> 01:43:18,398 and in the end we want to Output this value, right? 2466 01:43:18,398 --> 01:43:21,700 So this d f becomes my data frame containing store value 2467 01:43:21,700 --> 01:43:23,188 of my employed or Jason. 2468 01:43:23,188 --> 01:43:25,655 So this decent value will get converted 2469 01:43:25,655 --> 01:43:26,710 to my data frame. 2470 01:43:26,710 --> 01:43:30,000 We're now in the end PR just outputting the result now 2471 01:43:30,000 --> 01:43:32,100 if you notice here what we are doing, 2472 01:43:32,100 --> 01:43:33,312 so here we are first 2473 01:43:33,312 --> 01:43:36,100 of all importing your spark session same story. 2474 01:43:36,100 --> 01:43:37,200 We just executing it. 2475 01:43:37,200 --> 01:43:39,500 Then we are building our things better in that. 2476 01:43:39,500 --> 01:43:41,000 We're going to create that again. 2477 01:43:41,000 --> 01:43:44,243 We are importing it then we are reading Json file 2478 01:43:44,243 --> 01:43:46,000 by using Red Dot Json API. 2479 01:43:46,000 --> 01:43:47,900 We are reading never employed or Jason. 2480 01:43:47,900 --> 01:43:50,428 Okay, which is present in this particular directory 2481 01:43:50,428 --> 01:43:52,400 and we are outputting so can you can see 2482 01:43:52,400 --> 01:43:55,300 that Json format will be the T value format. 2483 01:43:55,300 --> 01:43:59,200 But when I'm doing this DF not show it is just showing 2484 01:43:59,200 --> 01:44:00,700 up all my values here. 2485 01:44:00,700 --> 01:44:00,935 Now. 2486 01:44:00,935 --> 01:44:03,138 Let's see how we can create our data set. 2487 01:44:03,138 --> 01:44:04,900 Now when we talk about data set, 2488 01:44:04,900 --> 01:44:06,500 you can notice what we're doing. 2489 01:44:06,500 --> 01:44:06,700 Now. 2490 01:44:06,700 --> 01:44:09,200 We have understood all this stability the how we 2491 01:44:09,200 --> 01:44:12,300 can create a data set now first of all in data set 2492 01:44:12,300 --> 01:44:14,800 what we do so So in data set we can create 2493 01:44:14,800 --> 01:44:17,900 the plus you can see we are creating a case class employ 2494 01:44:17,900 --> 01:44:19,600 right now in case class 2495 01:44:19,600 --> 01:44:22,400 what we are doing we are done just creating a sequence 2496 01:44:22,400 --> 01:44:25,600 in putting the value Andrew H like name and age column. 2497 01:44:25,600 --> 01:44:28,076 Then we are displaying our output all this data 2498 01:44:28,076 --> 01:44:28,803 set right now. 2499 01:44:28,803 --> 01:44:32,010 We are creating a primitive data set also to demonstrate mapping 2500 01:44:32,010 --> 01:44:33,894 of this data frames to your data sets. 2501 01:44:33,894 --> 01:44:34,200 Right? 2502 01:44:34,200 --> 01:44:36,200 So you can notice that we are using 2503 01:44:36,200 --> 01:44:37,700 to D's instead of 2 DF. 2504 01:44:37,700 --> 01:44:39,500 We are using two DS in this case. 2505 01:44:39,500 --> 01:44:42,293 Now, you may ask me what's the difference with respect 2506 01:44:42,293 --> 01:44:43,400 to data frame, right? 2507 01:44:43,400 --> 01:44:45,100 With respect to data frame 2508 01:44:45,100 --> 01:44:46,700 in data frame what we were doing. 2509 01:44:46,700 --> 01:44:48,682 We were create again the data frame 2510 01:44:48,682 --> 01:44:50,800 and data set both exactly looks safe. 2511 01:44:50,800 --> 01:44:53,228 It will also be having the name column in rows 2512 01:44:53,228 --> 01:44:54,200 and everything up. 2513 01:44:54,200 --> 01:44:57,334 It is introduced lately in 1.6 versions and later. 2514 01:44:57,334 --> 01:44:58,196 And what is it 2515 01:44:58,196 --> 01:45:01,100 provides it it provides a encoder mechanism using 2516 01:45:01,100 --> 01:45:02,000 which you can get 2517 01:45:02,000 --> 01:45:04,208 when you are let's say reading the weight data back. 2518 01:45:04,208 --> 01:45:06,200 Let's say you are DC realizing you're not doing 2519 01:45:06,200 --> 01:45:06,968 that step, right? 2520 01:45:06,968 --> 01:45:08,300 It is going to be faster. 2521 01:45:08,300 --> 01:45:10,400 So the performance wise data set is better. 2522 01:45:10,400 --> 01:45:13,000 That's the reason it is introduced later nowadays. 2523 01:45:13,000 --> 01:45:15,794 People are moving from data frame two data sets Okay. 2524 01:45:15,794 --> 01:45:17,500 So now we are just outputting 2525 01:45:17,500 --> 01:45:19,703 in the end see the same thing in the output. 2526 01:45:19,703 --> 01:45:21,623 But so we are creating employ a class. 2527 01:45:21,623 --> 01:45:24,684 Then we are putting the value inside it creating a data set. 2528 01:45:24,684 --> 01:45:26,500 We are looking at the values, right? 2529 01:45:26,500 --> 01:45:29,200 So these are the steps we have just understood them now 2530 01:45:29,200 --> 01:45:32,000 how we can read of a Phi so we want to read the file. 2531 01:45:32,000 --> 01:45:35,300 So we will use three dot Json as employee employee was 2532 01:45:35,300 --> 01:45:38,026 what remember case class which we have created last thing. 2533 01:45:38,026 --> 01:45:39,700 This was the classic we have created 2534 01:45:39,700 --> 01:45:40,900 your case class employee. 2535 01:45:40,900 --> 01:45:43,300 So we are telling that we are creating like this. 2536 01:45:43,500 --> 01:45:45,200 We are just out putting this value. 2537 01:45:45,200 --> 01:45:47,612 We just within shop you can see this way. 2538 01:45:47,612 --> 01:45:49,000 We can see this output. 2539 01:45:49,000 --> 01:45:50,700 Also now, let's see 2540 01:45:50,700 --> 01:45:53,900 how we can add the schema to rdd now in order 2541 01:45:53,900 --> 01:45:57,300 to add the schema to rdd what we are going to do. 2542 01:45:57,300 --> 01:45:59,100 So in this case also, 2543 01:45:59,200 --> 01:46:01,500 you can look at we are importing all the values 2544 01:46:01,500 --> 01:46:03,700 that we are importing all the libraries whatever 2545 01:46:03,700 --> 01:46:04,779 are required then 2546 01:46:04,779 --> 01:46:07,622 after that we are using this spark context text 2547 01:46:07,622 --> 01:46:09,600 by reading the data splitting it 2548 01:46:09,600 --> 01:46:12,400 with respect to comma then mapping the attributes. 2549 01:46:12,400 --> 01:46:14,750 We will employ The case that's what we have done 2550 01:46:14,750 --> 01:46:17,041 and putting converting this values to integer. 2551 01:46:17,041 --> 01:46:19,891 So in then we are converting to to death right after that. 2552 01:46:19,891 --> 01:46:22,378 We are going to create a temporary viewer table. 2553 01:46:22,378 --> 01:46:24,600 So let's create this temporary view employ. 2554 01:46:24,600 --> 01:46:26,800 Then we are going to use part dot Sequel 2555 01:46:26,800 --> 01:46:28,570 and passing up our SQL query. 2556 01:46:28,570 --> 01:46:31,500 Can you notice that we have now passing the value 2557 01:46:31,500 --> 01:46:33,900 and we are assessing this employ, right? 2558 01:46:33,900 --> 01:46:36,000 We are assessing this employee here. 2559 01:46:36,000 --> 01:46:38,500 Now, what is this employ this employee was 2560 01:46:38,500 --> 01:46:40,500 of a temporary view which we have created 2561 01:46:40,500 --> 01:46:43,128 because the challenge in Sparks equalist 2562 01:46:43,128 --> 01:46:46,329 when Whether you want to execute any SQL query you 2563 01:46:46,329 --> 01:46:49,400 cannot say select aesthetic from the data frame. 2564 01:46:49,400 --> 01:46:50,439 You cannot do that. 2565 01:46:50,439 --> 01:46:52,300 There's this is not even supported. 2566 01:46:52,300 --> 01:46:55,547 So you cannot do select extract from your data frame. 2567 01:46:55,547 --> 01:46:56,508 So instead of that 2568 01:46:56,508 --> 01:46:59,500 what we need to do is we need to create a temporary table 2569 01:46:59,500 --> 01:47:01,732 or a temporary view so you can notice here. 2570 01:47:01,732 --> 01:47:04,456 We are using this create or replace temp You by replace 2571 01:47:04,456 --> 01:47:07,349 because if it is already existing override on top of it. 2572 01:47:07,349 --> 01:47:09,400 So now we are creating a temporary table 2573 01:47:09,400 --> 01:47:12,900 which will be exactly similar to mine this data frame now 2574 01:47:12,900 --> 01:47:15,605 you You can just directly execute all the query 2575 01:47:15,605 --> 01:47:18,100 on your return preview Autumn Prairie table. 2576 01:47:18,100 --> 01:47:21,258 So you can notice here instead of using employ DF 2577 01:47:21,258 --> 01:47:22,800 which was our data frame. 2578 01:47:22,800 --> 01:47:24,730 I am using here temporary view. 2579 01:47:24,730 --> 01:47:26,100 Okay, then in the end, 2580 01:47:26,100 --> 01:47:28,000 we just mapping the names and a right 2581 01:47:28,000 --> 01:47:29,669 and we are outputting the bells. 2582 01:47:29,669 --> 01:47:30,200 That's it. 2583 01:47:30,200 --> 01:47:31,000 Same thing. 2584 01:47:31,000 --> 01:47:33,300 This is just an execution part of it. 2585 01:47:33,300 --> 01:47:35,350 So we are just showing all the steps here. 2586 01:47:35,350 --> 01:47:36,500 You can see in the end. 2587 01:47:36,500 --> 01:47:38,500 We are outputting all this value now 2588 01:47:38,600 --> 01:47:40,800 how we can add the schema to rdd. 2589 01:47:40,800 --> 01:47:43,850 Let's see this transformation step now in this case you Notice 2590 01:47:43,850 --> 01:47:45,404 that we can map this youngster fact 2591 01:47:45,404 --> 01:47:46,900 the we're converting this map name 2592 01:47:46,900 --> 01:47:49,211 into the string for the transformation part, right? 2593 01:47:49,211 --> 01:47:51,200 So we are checking all this value that okay. 2594 01:47:51,200 --> 01:47:53,500 This is the string type name. 2595 01:47:53,500 --> 01:47:55,900 We are just showing up this value right now. 2596 01:47:55,900 --> 01:47:56,900 What were you doing? 2597 01:47:56,900 --> 01:48:00,400 We are using this map encoder from the implicit class, 2598 01:48:00,400 --> 01:48:03,717 which is available to us to map the name and Each pie. 2599 01:48:03,717 --> 01:48:04,000 Okay. 2600 01:48:04,000 --> 01:48:05,529 So this is what we're going to do 2601 01:48:05,529 --> 01:48:07,579 because remember in the employee is class. 2602 01:48:07,579 --> 01:48:10,400 We have the name and age column that we want to map now. 2603 01:48:10,400 --> 01:48:11,272 Now in this case, 2604 01:48:11,272 --> 01:48:13,164 we are mapping the names to the ages. 2605 01:48:13,164 --> 01:48:14,400 Has so you can notice 2606 01:48:14,400 --> 01:48:17,600 that we are doing for ages of our younger CF data frame 2607 01:48:17,600 --> 01:48:19,335 that what we have created earlier 2608 01:48:19,335 --> 01:48:20,800 and the result is an array. 2609 01:48:20,800 --> 01:48:23,400 So the result but you're going to get will be an array 2610 01:48:23,400 --> 01:48:25,700 with the name map to your respective ages. 2611 01:48:25,700 --> 01:48:27,800 You can see this output here so you can see 2612 01:48:27,800 --> 01:48:29,100 that this is getting map. 2613 01:48:29,100 --> 01:48:29,426 Right. 2614 01:48:29,426 --> 01:48:32,201 So we are getting seeing this output like name is John 2615 01:48:32,201 --> 01:48:34,402 it is 28 that is what we are talking about. 2616 01:48:34,402 --> 01:48:36,300 So here in this case, you can notice 2617 01:48:36,300 --> 01:48:38,900 that it was representing like this in this case. 2618 01:48:38,900 --> 01:48:42,200 The output is coming out in this particular format now, 2619 01:48:42,200 --> 01:48:44,568 let's talk about how Can add the schema 2620 01:48:44,568 --> 01:48:47,674 how we can read the file we can add a whiskey minor 2621 01:48:47,674 --> 01:48:50,702 so we will be first of all importing the type class 2622 01:48:50,702 --> 01:48:51,706 into your passion. 2623 01:48:51,706 --> 01:48:52,588 So with this is 2624 01:48:52,588 --> 01:48:54,815 what we have done by using import statement. 2625 01:48:54,815 --> 01:48:58,286 Then we are going to import the row class into this partial. 2626 01:48:58,286 --> 01:49:00,500 So rho will be used in mapping our DB schema. 2627 01:49:00,500 --> 01:49:00,813 Right? 2628 01:49:00,813 --> 01:49:01,700 So you can notice 2629 01:49:01,700 --> 01:49:05,100 we're importing this also then we are creating an rdd called 2630 01:49:05,000 --> 01:49:06,200 as employ a DD. 2631 01:49:06,200 --> 01:49:07,900 So in case this case you can notice 2632 01:49:07,900 --> 01:49:09,809 that the same priority we are creating 2633 01:49:09,809 --> 01:49:12,700 and we are creating this with the help of this text file. 2634 01:49:12,700 --> 01:49:15,700 So once we have create this we are going to Define our schema. 2635 01:49:15,700 --> 01:49:17,300 So this is the scheme approach. 2636 01:49:17,300 --> 01:49:17,572 Okay. 2637 01:49:17,572 --> 01:49:18,452 So in this case, 2638 01:49:18,452 --> 01:49:21,050 we are going to Define it like named and space 2639 01:49:21,050 --> 01:49:21,800 than H. Okay, 2640 01:49:21,800 --> 01:49:24,700 because they these were the two I have in my data also 2641 01:49:24,700 --> 01:49:26,129 in this employed or tht 2642 01:49:26,129 --> 01:49:27,305 if you look at these 2643 01:49:27,305 --> 01:49:29,600 are the two data which we have named NH. 2644 01:49:29,600 --> 01:49:31,635 Now what we can do once we have done 2645 01:49:31,635 --> 01:49:34,100 that then we can split it with respect to space. 2646 01:49:34,100 --> 01:49:34,600 We can say 2647 01:49:34,600 --> 01:49:37,082 that our mapping value and we are passing it 2648 01:49:37,082 --> 01:49:39,200 all this value inside of a structure. 2649 01:49:39,200 --> 01:49:42,200 Okay, so we are defining a burn or fields are ready. 2650 01:49:42,200 --> 01:49:43,500 That is what we are doing. 2651 01:49:43,500 --> 01:49:45,200 See this the fields are ready, 2652 01:49:45,200 --> 01:49:49,500 which is going to now output after mapping the employee ID. 2653 01:49:49,500 --> 01:49:51,200 Okay, so that is what we are doing. 2654 01:49:51,200 --> 01:49:54,413 So we want to just do this into my schema strength, 2655 01:49:54,413 --> 01:49:55,375 then in the end. 2656 01:49:55,375 --> 01:49:57,300 We will be obtaining this field. 2657 01:49:57,300 --> 01:49:59,940 If you notice this field what we have created here. 2658 01:49:59,940 --> 01:50:01,788 We are obtaining this into a schema. 2659 01:50:01,788 --> 01:50:03,900 So we are passing this into a struct type 2660 01:50:03,900 --> 01:50:06,400 and it is getting converted to be our scheme of it. 2661 01:50:06,500 --> 01:50:08,200 So that is what we will do. 2662 01:50:08,200 --> 01:50:10,768 You can see all this execution same steps. 2663 01:50:10,768 --> 01:50:13,357 We are just executing in this terminal now, 2664 01:50:13,357 --> 01:50:16,500 Let's see how we are going to transform the results. 2665 01:50:16,500 --> 01:50:18,300 Now, whatever we have done, right? 2666 01:50:18,300 --> 01:50:21,229 So now we have already created already called row editing. 2667 01:50:21,229 --> 01:50:22,000 So let's create 2668 01:50:22,000 --> 01:50:25,088 that Rogue additive are going to Gray and we want 2669 01:50:25,088 --> 01:50:28,500 to transform the employee ID using the map function 2670 01:50:28,500 --> 01:50:29,513 into row already. 2671 01:50:29,513 --> 01:50:30,564 So let's do that. 2672 01:50:30,564 --> 01:50:30,837 Okay. 2673 01:50:30,837 --> 01:50:31,717 So in this case 2674 01:50:31,717 --> 01:50:34,483 what we are doing so look at this employed reading 2675 01:50:34,483 --> 01:50:36,797 we are splitting it with respect to coma 2676 01:50:36,797 --> 01:50:40,000 and after that we are telling see remember we have name 2677 01:50:40,000 --> 01:50:41,400 and then H like this so 2678 01:50:41,400 --> 01:50:43,500 that's what you're telling me telling that act. 2679 01:50:43,500 --> 01:50:44,737 Zero or my attributes 2680 01:50:44,737 --> 01:50:47,796 one and why we're trimming it just inverted to ensure 2681 01:50:47,796 --> 01:50:49,900 if there is no spaces and on which other 2682 01:50:49,900 --> 01:50:52,600 so those things we don't want to unnecessarily keep up. 2683 01:50:52,600 --> 01:50:55,400 So that's the reason we are defining this term statement. 2684 01:50:55,400 --> 01:50:58,300 Now after that after we once we are done with this, 2685 01:50:58,300 --> 01:51:01,100 we are going to define a data frame employed EF 2686 01:51:01,100 --> 01:51:03,874 and we are going to store that rdd schema into it. 2687 01:51:03,874 --> 01:51:05,764 So now if you notice this row ID, 2688 01:51:05,764 --> 01:51:07,300 which we have defined here 2689 01:51:07,300 --> 01:51:11,124 and schema which we have defined in the last case right now 2690 01:51:11,124 --> 01:51:13,300 if you'll go back and notice here. 2691 01:51:13,300 --> 01:51:16,300 Schema, we have created here right with respect to my Fields. 2692 01:51:16,600 --> 01:51:19,100 So that schema and this value 2693 01:51:19,100 --> 01:51:21,900 what we have just created here rowady. 2694 01:51:21,900 --> 01:51:23,450 We are going to pass it and say 2695 01:51:23,450 --> 01:51:25,200 that we are going to create a data frame. 2696 01:51:25,200 --> 01:51:27,900 So this will help us in creating a data frame now, 2697 01:51:27,900 --> 01:51:31,135 we can create our temporary view on the base of employee 2698 01:51:31,135 --> 01:51:33,900 of let's create an employee or temporary View and then 2699 01:51:33,900 --> 01:51:36,900 what we can do we can execute any SQL queries on top of it. 2700 01:51:36,900 --> 01:51:38,700 So as you can see SparkNotes equal we 2701 01:51:38,700 --> 01:51:42,000 can create all the SQL queries and can directly execute 2702 01:51:42,000 --> 01:51:43,200 that now what we can do. 2703 01:51:43,300 --> 01:51:45,700 We want to Output the values we can quickly do that. 2704 01:51:45,800 --> 01:51:46,000 Now. 2705 01:51:46,000 --> 01:51:48,500 We want to let's say display the names of we can say Okay, 2706 01:51:48,500 --> 01:51:51,600 attribute 0 contains the name we can use the show command. 2707 01:51:51,600 --> 01:51:54,662 So this is how we will be performing the operation 2708 01:51:54,662 --> 01:51:56,100 in the scheme away now, 2709 01:51:56,100 --> 01:51:58,900 so this is the same output way means we're just executing 2710 01:51:58,900 --> 01:51:59,914 this whole thing up. 2711 01:51:59,914 --> 01:52:01,100 You can notice here. 2712 01:52:01,100 --> 01:52:03,400 Also, we are just saying attribute 0.0. 2713 01:52:03,400 --> 01:52:06,205 It is representing or me my output now, 2714 01:52:06,205 --> 01:52:08,200 let's talk about Json data. 2715 01:52:08,200 --> 01:52:10,085 Now when we talk about Json data, 2716 01:52:10,085 --> 01:52:13,261 let's talk about how we can load our files and work on. 2717 01:52:13,261 --> 01:52:15,496 This so in this case, we will be first. 2718 01:52:15,496 --> 01:52:17,338 Let's say importing our libraries. 2719 01:52:17,338 --> 01:52:18,800 Once we are done with that. 2720 01:52:18,800 --> 01:52:20,300 Now after that we can just say 2721 01:52:20,300 --> 01:52:23,587 that retort Jason we are just bringing up our employed 2722 01:52:23,587 --> 01:52:25,611 or Jason you see this is the execution 2723 01:52:25,611 --> 01:52:27,200 of this part now similarly, 2724 01:52:27,200 --> 01:52:29,042 we can also write back in the pocket 2725 01:52:29,042 --> 01:52:31,282 or we can also read the value from parque. 2726 01:52:31,282 --> 01:52:32,400 You can notice this 2727 01:52:32,400 --> 01:52:35,600 if you want to write let's say this value employee 2728 01:52:35,600 --> 01:52:37,730 of data frame to my market way 2729 01:52:37,730 --> 01:52:40,500 so I can sit right dot right dot market. 2730 01:52:40,500 --> 01:52:43,143 So this will be created employed or Park. 2731 01:52:43,143 --> 01:52:46,504 Be created and hear all the values should be converted 2732 01:52:46,504 --> 01:52:47,900 to employed or packet. 2733 01:52:47,900 --> 01:52:49,133 Only thing is the data. 2734 01:52:49,133 --> 01:52:51,600 If you go and see in this particular directory, 2735 01:52:51,600 --> 01:52:52,717 this will be a directory. 2736 01:52:52,717 --> 01:52:53,954 We should be getting created. 2737 01:52:53,954 --> 01:52:55,400 So in this data, you will notice 2738 01:52:55,400 --> 01:52:57,500 that you will not be able to read the data. 2739 01:52:57,500 --> 01:53:00,100 So in that case because it's not human readable. 2740 01:53:00,100 --> 01:53:02,200 So that's the reason you will not be able to do that. 2741 01:53:02,200 --> 01:53:04,299 So, let's say you want to read it now so you 2742 01:53:04,299 --> 01:53:05,449 can again bring it back 2743 01:53:05,449 --> 01:53:08,600 by using Red Dot Market you are reading this employed at pocket, 2744 01:53:08,600 --> 01:53:09,600 which I just created 2745 01:53:09,600 --> 01:53:11,700 then you are creating a temporary view 2746 01:53:11,700 --> 01:53:12,775 or temporary table 2747 01:53:12,775 --> 01:53:15,488 and then By using standard SQL you can execute 2748 01:53:15,488 --> 01:53:16,903 on your temporary table. 2749 01:53:16,903 --> 01:53:17,844 Now in this way. 2750 01:53:17,844 --> 01:53:21,000 You can read your pocket file data and in then we are just 2751 01:53:21,000 --> 01:53:24,284 displaying the result see the similar output of this. 2752 01:53:24,284 --> 01:53:24,600 Okay. 2753 01:53:24,600 --> 01:53:27,100 This is how we can execute all these things up now. 2754 01:53:27,100 --> 01:53:28,670 Once we have done all this, 2755 01:53:28,670 --> 01:53:31,200 let's see how we can create our data frames. 2756 01:53:31,200 --> 01:53:33,100 So let's create this file path. 2757 01:53:33,100 --> 01:53:36,390 So let's say we have created this file employed or Jason 2758 01:53:36,390 --> 01:53:38,508 after that we can create a data frame 2759 01:53:38,508 --> 01:53:39,943 from our Json path, right? 2760 01:53:39,943 --> 01:53:42,884 So we are creating this by using retouch Jason then 2761 01:53:42,884 --> 01:53:44,420 we can Print the schema. 2762 01:53:44,420 --> 01:53:47,300 What does to this is going to print the schema 2763 01:53:47,300 --> 01:53:49,300 of my employee data frame? 2764 01:53:49,300 --> 01:53:52,500 Okay, so we are going to use this print schemer to print 2765 01:53:52,500 --> 01:53:55,795 up all the values then we can create a temporary view 2766 01:53:55,795 --> 01:53:57,000 of this data frame. 2767 01:53:57,000 --> 01:53:58,100 So we are create doing 2768 01:53:58,100 --> 01:54:00,618 that see create or replace temp you we are creating 2769 01:54:00,618 --> 01:54:02,860 that which we have seen it last time also now 2770 01:54:02,860 --> 01:54:04,888 after that we can execute our SQL query. 2771 01:54:04,888 --> 01:54:07,800 So let's say we are executing our SQL query from employee 2772 01:54:07,800 --> 01:54:10,000 where age is between 18 and 30, right? 2773 01:54:10,000 --> 01:54:11,300 So this kind of SQL query. 2774 01:54:11,300 --> 01:54:12,854 Let's say we want to do we can get 2775 01:54:12,854 --> 01:54:14,989 that And in the end we can see the output Also. 2776 01:54:14,989 --> 01:54:16,278 Let's see this execution. 2777 01:54:16,278 --> 01:54:17,000 So you can see 2778 01:54:17,000 --> 01:54:20,891 that all the vampires who these are let's say between 18 and 30 2779 01:54:20,891 --> 01:54:22,900 that is showing up in the output. 2780 01:54:22,900 --> 01:54:23,147 Now. 2781 01:54:23,147 --> 01:54:25,176 Let's see this rdd operation way. 2782 01:54:25,176 --> 01:54:26,369 Now what you can do 2783 01:54:26,369 --> 01:54:30,200 so we are going to create this add any other employer Nene now 2784 01:54:30,200 --> 01:54:33,900 which is going to store the content of employed George 2785 01:54:33,900 --> 01:54:35,300 and New Delhi Delhi. 2786 01:54:35,300 --> 01:54:36,433 So see this part, 2787 01:54:36,433 --> 01:54:39,500 so here we are creating this by using make a DD 2788 01:54:39,500 --> 01:54:43,400 and we have just this is going to store the content containing 2789 01:54:43,400 --> 01:54:45,000 Such from noodle, right? 2790 01:54:45,000 --> 01:54:45,900 You can see this 2791 01:54:45,900 --> 01:54:48,300 so New Delhi is my city named state is the ring. 2792 01:54:48,300 --> 01:54:50,250 So that is what we are passing inside it. 2793 01:54:50,250 --> 01:54:52,900 Now what we are doing we are assigning the content 2794 01:54:52,900 --> 01:54:56,700 of this other employee ID into my other employees. 2795 01:54:56,700 --> 01:54:59,200 So we are using this dark dot RI dot Json 2796 01:54:59,200 --> 01:55:00,600 and we are reading at the value 2797 01:55:00,600 --> 01:55:02,800 and in the end we are using this show appear. 2798 01:55:02,800 --> 01:55:04,857 You can notice this output coming up now. 2799 01:55:04,857 --> 01:55:06,400 Let's see with the hive table. 2800 01:55:06,400 --> 01:55:08,536 So with the hive table if you want to read that, 2801 01:55:08,536 --> 01:55:10,186 so let's do it with the case class 2802 01:55:10,186 --> 01:55:11,136 and Spark sessions. 2803 01:55:11,136 --> 01:55:11,900 So first of all, 2804 01:55:11,900 --> 01:55:14,713 we are going to import a guru class and we are going 2805 01:55:14,713 --> 01:55:16,700 to use path session into the Spartan. 2806 01:55:16,700 --> 01:55:18,000 So let's do that for a way. 2807 01:55:18,000 --> 01:55:20,082 I'm putting this row this past session 2808 01:55:20,082 --> 01:55:21,200 and not after that. 2809 01:55:21,200 --> 01:55:24,186 We are going to create a class record containing this key 2810 01:55:24,186 --> 01:55:25,756 which is of integer data type 2811 01:55:25,756 --> 01:55:27,576 and a value which is of string type. 2812 01:55:27,576 --> 01:55:29,426 Then we are going to set our location 2813 01:55:29,426 --> 01:55:30,726 of the warehouse location. 2814 01:55:30,726 --> 01:55:31,948 Okay to this pathway rows. 2815 01:55:31,948 --> 01:55:33,400 So that is what we are doing. 2816 01:55:33,400 --> 01:55:33,629 Now. 2817 01:55:33,629 --> 01:55:36,100 We are going to build a spark sessions back 2818 01:55:36,100 --> 01:55:39,200 to demonstrate the hive example in spots equal. 2819 01:55:39,200 --> 01:55:40,100 Look at this now, 2820 01:55:40,100 --> 01:55:42,700 so we are creating Sparks session dot Builder again. 2821 01:55:42,700 --> 01:55:44,331 We are passing the Any app name 2822 01:55:44,331 --> 01:55:46,700 to it we have passing the configuration to it. 2823 01:55:46,700 --> 01:55:48,968 And then we are saying that we want to enable 2824 01:55:48,968 --> 01:55:50,000 The Hive support now 2825 01:55:50,000 --> 01:55:50,800 once we have done 2826 01:55:50,800 --> 01:55:53,800 that we are importing this spark SQL library center. 2827 01:55:54,000 --> 01:55:56,612 And then you can notice that we can use SQL 2828 01:55:56,612 --> 01:55:58,601 so we can create now a table SRC 2829 01:55:58,601 --> 01:56:01,336 so you can see create table if not exist as RC 2830 01:56:01,336 --> 01:56:04,800 with column to stores the data as a key common value pair. 2831 01:56:04,800 --> 01:56:06,399 So that is what we are doing here. 2832 01:56:06,400 --> 01:56:09,000 Now, you can see all this execution of the same step. 2833 01:56:09,000 --> 01:56:09,209 Now. 2834 01:56:09,209 --> 01:56:12,430 Let's see the sequel operation happening here now in this case 2835 01:56:12,430 --> 01:56:13,229 what we can do. 2836 01:56:13,229 --> 01:56:15,700 We can now load the data from this example, 2837 01:56:15,700 --> 01:56:17,500 which is present to succeed. 2838 01:56:17,500 --> 01:56:19,400 Is this KV m dot txt file, 2839 01:56:19,400 --> 01:56:20,869 which is available to us 2840 01:56:20,869 --> 01:56:23,281 and we want to store it into the table SRC 2841 01:56:23,281 --> 01:56:25,225 which we have just created and now 2842 01:56:25,225 --> 01:56:28,872 if you want to just view the all this output becomes a sequence 2843 01:56:28,872 --> 01:56:30,305 select aesthetic form SRC 2844 01:56:30,305 --> 01:56:31,764 and it is going to show up 2845 01:56:31,764 --> 01:56:34,005 all the values you can see this output. 2846 01:56:34,005 --> 01:56:34,300 Okay. 2847 01:56:34,300 --> 01:56:37,341 So this is the way you can show up the virus now similarly we 2848 01:56:37,341 --> 01:56:38,899 can perform the count operation. 2849 01:56:38,899 --> 01:56:40,993 Okay, so we can say select Counter-Strike 2850 01:56:40,993 --> 01:56:43,400 from SRC to select the number of keys in there. 2851 01:56:43,400 --> 01:56:45,858 See tables, and now select all the records, 2852 01:56:45,858 --> 01:56:48,800 right so we can say that key select key gamma value 2853 01:56:48,800 --> 01:56:49,500 so you can see 2854 01:56:49,500 --> 01:56:52,150 that we can perform all over Hive operations here 2855 01:56:52,150 --> 01:56:53,562 on this right similarly. 2856 01:56:53,562 --> 01:56:56,300 We can create a data set string DS from spark DF 2857 01:56:56,300 --> 01:56:58,623 so you can see this also by using SQL DF 2858 01:56:58,623 --> 01:57:00,835 what we already have we can just say map 2859 01:57:00,835 --> 01:57:01,730 and then provide 2860 01:57:01,730 --> 01:57:04,541 the case class in can map the ski common value pair 2861 01:57:04,541 --> 01:57:07,600 and then in the end we can show up all this value see 2862 01:57:07,600 --> 01:57:10,644 this execution of this in then you can notice this output 2863 01:57:10,644 --> 01:57:11,828 which we want it now. 2864 01:57:11,828 --> 01:57:13,288 Let's see the result back. 2865 01:57:13,288 --> 01:57:15,700 But now we can create our data frame here. 2866 01:57:15,700 --> 01:57:18,384 Right so we can create our data frame records deaf 2867 01:57:18,384 --> 01:57:19,848 and store all the results 2868 01:57:19,848 --> 01:57:21,900 which contains the value between 1 200. 2869 01:57:21,900 --> 01:57:24,600 So we are storing all the values between 1/2 and video. 2870 01:57:24,600 --> 01:57:26,700 Then we are creating a victim Prairie View. 2871 01:57:26,700 --> 01:57:28,900 Okay for the records, that's what we are doing. 2872 01:57:28,900 --> 01:57:31,200 So for requires the FAA creating a temporary view 2873 01:57:31,200 --> 01:57:33,800 so that we can have over Oliver SQL queries now, 2874 01:57:33,800 --> 01:57:35,336 we can execute all the values 2875 01:57:35,336 --> 01:57:38,400 so you can also notice we are doing join operation here. 2876 01:57:38,400 --> 01:57:40,900 Okay, so we can display the content of join 2877 01:57:40,900 --> 01:57:43,300 between the records and this is our city. 2878 01:57:43,600 --> 01:57:46,400 We can do a joint on this part so we can also perform all 2879 01:57:46,400 --> 01:57:48,300 the joint operations and get the output. 2880 01:57:48,300 --> 01:57:48,500 Now. 2881 01:57:48,500 --> 01:57:50,356 Let's see our use case for it. 2882 01:57:50,356 --> 01:57:51,908 If we talk about use case. 2883 01:57:51,908 --> 01:57:55,071 We are going to analyze our stock market with the help 2884 01:57:55,071 --> 01:57:57,100 of spark sequence select understand 2885 01:57:57,100 --> 01:57:58,500 the problem statement first. 2886 01:57:58,500 --> 01:58:00,382 So now in our problem statement, 2887 01:58:00,382 --> 01:58:04,029 so what we want to do so we want to accept definitely everybody 2888 01:58:04,029 --> 01:58:07,156 must be aware of this top market like in stock market. 2889 01:58:07,156 --> 01:58:08,811 You can lot of activities happen. 2890 01:58:08,811 --> 01:58:10,400 You want to know analyze it 2891 01:58:10,400 --> 01:58:13,300 in order to make some profit out of it and all those stuff. 2892 01:58:13,300 --> 01:58:15,200 Alright, so now let's say our company 2893 01:58:15,200 --> 01:58:18,200 have collected a lot of data for different 10 companies 2894 01:58:18,200 --> 01:58:20,000 and they want to do some computation. 2895 01:58:20,000 --> 01:58:22,964 Let's say they want to compute the average closing price. 2896 01:58:22,964 --> 01:58:26,300 They want to list the companies with the highest closing prices. 2897 01:58:26,300 --> 01:58:29,749 They want to compute the average closing price per month. 2898 01:58:29,749 --> 01:58:32,485 They want to list the number of big price Rises 2899 01:58:32,485 --> 01:58:35,400 and fall and compute some statistical correlation. 2900 01:58:35,400 --> 01:58:37,700 So these things we are going to do with the help 2901 01:58:37,700 --> 01:58:39,158 of our spark SQL statement. 2902 01:58:39,158 --> 01:58:42,255 So this is a very common we want to process the huge data. 2903 01:58:42,255 --> 01:58:45,103 We want to handle The input from the multiple sources, 2904 01:58:45,103 --> 01:58:47,200 we want to process the data in real time 2905 01:58:47,200 --> 01:58:48,754 and it should be easy to use. 2906 01:58:48,754 --> 01:58:50,488 It should not be very complicated. 2907 01:58:50,488 --> 01:58:53,800 So all this requirement will be handled by my spots equal right? 2908 01:58:53,800 --> 01:58:55,700 So that's the reason we are going to use 2909 01:58:55,700 --> 01:58:56,950 the spacer sequence. 2910 01:58:56,950 --> 01:58:57,700 So as I said 2911 01:58:57,700 --> 01:58:59,600 that we are going to use 10 companies. 2912 01:58:59,600 --> 01:59:02,076 So we are going to kind of use this 10 companies 2913 01:59:02,076 --> 01:59:03,498 and on those ten companies. 2914 01:59:03,498 --> 01:59:04,500 We are going to see 2915 01:59:04,500 --> 01:59:07,200 that we are going to perform our analysis on top of it. 2916 01:59:07,200 --> 01:59:09,100 So we will be using this table data 2917 01:59:09,100 --> 01:59:11,800 from Yahoo finance for all this following stocks. 2918 01:59:11,800 --> 01:59:14,300 So for n and a A bit sexist. 2919 01:59:14,300 --> 01:59:15,400 So all these companies 2920 01:59:15,400 --> 01:59:17,600 we have on on which we are going to perform. 2921 01:59:17,600 --> 01:59:20,800 So this is how my data will look like which will be having date 2922 01:59:20,800 --> 01:59:25,046 opening High rate low rate closing volume adjusted close. 2923 01:59:25,046 --> 01:59:27,700 All this data will be presented now. 2924 01:59:27,700 --> 01:59:28,917 So, let's see how we 2925 01:59:28,917 --> 01:59:31,900 can Implement a stock analysis using spark sequel. 2926 01:59:31,900 --> 01:59:33,497 So what we have to do for that, 2927 01:59:33,497 --> 01:59:36,278 so this is how many data flow diagram will sound like 2928 01:59:36,278 --> 01:59:38,811 so we have going to initially have the huge amount 2929 01:59:38,811 --> 01:59:40,000 of real-time stock data 2930 01:59:40,000 --> 01:59:42,400 that we are going to process it through this path SQL. 2931 01:59:42,400 --> 01:59:44,600 So going to It into a named column base. 2932 01:59:44,600 --> 01:59:46,308 Then we are going to create an rdd 2933 01:59:46,308 --> 01:59:47,658 for functional programming. 2934 01:59:47,658 --> 01:59:48,395 So let's do that. 2935 01:59:48,395 --> 01:59:50,354 Then we are going to use a reverse Park sequel 2936 01:59:50,354 --> 01:59:52,500 which will calculate the average closing price 2937 01:59:52,500 --> 01:59:53,600 for your calculating. 2938 01:59:53,600 --> 01:59:56,188 The company with is closing per year then buy 2939 01:59:56,188 --> 01:59:59,000 some stock SQL queries will be getting our outputs. 2940 01:59:59,000 --> 02:00:01,000 Okay, so that is what we're going to do. 2941 02:00:01,000 --> 02:00:03,400 So all the queries what we are getting generated, 2942 02:00:03,400 --> 02:00:05,500 so it's not only this we are also going to compute 2943 02:00:05,500 --> 02:00:08,000 few other queries what we have solve those queries. 2944 02:00:08,000 --> 02:00:09,200 We're going to execute him. 2945 02:00:09,200 --> 02:00:09,500 Now. 2946 02:00:09,500 --> 02:00:11,273 This is how the flow will look like. 2947 02:00:11,273 --> 02:00:13,200 So we are going to initially have this Data 2948 02:00:13,200 --> 02:00:16,000 what I have just shown you a now what you're going to do. 2949 02:00:16,000 --> 02:00:17,700 You're going to create a data frame you 2950 02:00:17,700 --> 02:00:19,990 are going to then create a joint clothes are ready. 2951 02:00:19,990 --> 02:00:21,850 We will see what we are going to do here. 2952 02:00:21,850 --> 02:00:23,900 Then we are going to calculate the average 2953 02:00:23,900 --> 02:00:25,160 closing price per year. 2954 02:00:25,160 --> 02:00:27,900 We are going to hit a rough patch SQL query and get 2955 02:00:27,900 --> 02:00:29,314 the result in the table. 2956 02:00:29,314 --> 02:00:31,800 So this is how my execution will look like. 2957 02:00:31,800 --> 02:00:33,445 So what we are going to do in this case, 2958 02:00:33,445 --> 02:00:34,095 first of all, 2959 02:00:34,095 --> 02:00:36,839 we are going to initialize the Sparks equal in this function. 2960 02:00:36,839 --> 02:00:39,600 We are going to import all the required libraries then we 2961 02:00:39,600 --> 02:00:40,500 are going to start 2962 02:00:40,500 --> 02:00:43,216 our spark session after importing all the required. 2963 02:00:43,216 --> 02:00:44,473 B we are going to create 2964 02:00:44,473 --> 02:00:47,251 our case class whatever is required in the case class, 2965 02:00:47,251 --> 02:00:49,466 you can notice a then we are going to Define 2966 02:00:49,466 --> 02:00:50,600 our past stock scheme. 2967 02:00:50,600 --> 02:00:53,350 So because we have already learnt how to create a schema 2968 02:00:53,350 --> 02:00:55,500 as we're going to create this page table schema 2969 02:00:55,500 --> 02:00:56,800 by creating this way. 2970 02:00:56,800 --> 02:00:59,200 Well, then we are going to Define our parts. 2971 02:00:59,200 --> 02:01:00,900 I DD so in parts are did 2972 02:01:00,900 --> 02:01:02,895 if you notice so here we are creating. 2973 02:01:02,895 --> 02:01:04,289 This parts are ready mix. 2974 02:01:04,289 --> 02:01:05,708 We have going to create all 2975 02:01:05,708 --> 02:01:07,600 of that by using this additive first. 2976 02:01:07,600 --> 02:01:10,300 We are going to remove the header files also from it. 2977 02:01:10,300 --> 02:01:12,749 Then we are going to read our CSV file 2978 02:01:12,749 --> 02:01:15,200 into Into stocks a a on DF data frame. 2979 02:01:15,200 --> 02:01:17,500 So we are going to read this as C dot txt file. 2980 02:01:17,500 --> 02:01:20,161 You can see we are reading this file and we are going 2981 02:01:20,161 --> 02:01:21,800 to convert it into a data frame. 2982 02:01:21,800 --> 02:01:23,450 So we are passing it as an oddity. 2983 02:01:23,450 --> 02:01:24,511 Once we are done then 2984 02:01:24,511 --> 02:01:26,697 if you want to print the output we can do it 2985 02:01:26,697 --> 02:01:27,997 with the help of show API. 2986 02:01:27,997 --> 02:01:29,852 Once we are done with this now we want 2987 02:01:29,852 --> 02:01:31,450 to let's say display the average 2988 02:01:31,450 --> 02:01:34,100 of addressing closing price for n and for every month, 2989 02:01:34,100 --> 02:01:37,629 so if we can do all of that also by using select query, right 2990 02:01:37,629 --> 02:01:40,300 so we can say this data frame dot select and pass 2991 02:01:40,300 --> 02:01:43,100 whatever parameters are required to get the average know, 2992 02:01:43,100 --> 02:01:44,000 You can notice are 2993 02:01:44,000 --> 02:01:47,200 inside this we are creating the Elias of the things as well. 2994 02:01:47,200 --> 02:01:48,300 So for this DT, 2995 02:01:48,300 --> 02:01:50,059 we are creating areas here, right? 2996 02:01:50,059 --> 02:01:52,538 So we are creating the Elias for it in a binder 2997 02:01:52,538 --> 02:01:54,714 and we are showing the output also so here 2998 02:01:54,714 --> 02:01:56,307 what we are going to do now, 2999 02:01:56,307 --> 02:01:57,400 we will be checking 3000 02:01:57,400 --> 02:01:59,669 that the closing price for Microsoft. 3001 02:01:59,669 --> 02:02:03,300 So let's say they're going up by 2 or with greater than 2 3002 02:02:03,300 --> 02:02:05,900 or wherever it is going by greater than 2 and now we 3003 02:02:05,900 --> 02:02:08,039 want to get the output and display the result 3004 02:02:08,039 --> 02:02:10,023 so you can notice that wherever it is going 3005 02:02:10,023 --> 02:02:12,282 to be greater than 2 we are getting the value. 3006 02:02:12,282 --> 02:02:14,383 So we are hitting the SQL query to do that. 3007 02:02:14,383 --> 02:02:16,483 So we are hitting the SQL query now on this 3008 02:02:16,483 --> 02:02:17,935 you can notice the SQL query 3009 02:02:17,935 --> 02:02:19,975 which we are hitting on the stocks. 3010 02:02:19,975 --> 02:02:20,775 Msft. 3011 02:02:20,775 --> 02:02:21,128 Right? 3012 02:02:21,128 --> 02:02:22,768 This is the we have data frame 3013 02:02:22,768 --> 02:02:24,900 we have created now on this we are doing 3014 02:02:24,900 --> 02:02:27,076 that and we are putting our query that 3015 02:02:27,076 --> 02:02:29,395 where my condition this to be true means 3016 02:02:29,395 --> 02:02:32,066 where my closing price and my opening price 3017 02:02:32,066 --> 02:02:34,300 because let's say at the closing price 3018 02:02:34,300 --> 02:02:36,852 the stock price by let's say a hundred US Dollars 3019 02:02:36,852 --> 02:02:38,500 and at that time in the morning 3020 02:02:38,500 --> 02:02:40,800 when it open with the Lexi 98 used or so, 3021 02:02:40,800 --> 02:02:43,131 wherever it is going to be having a different. 3022 02:02:43,131 --> 02:02:43,961 Of to or greater 3023 02:02:43,961 --> 02:02:46,300 than to that only output we want to get so that is 3024 02:02:46,300 --> 02:02:47,400 what we're doing here. 3025 02:02:47,400 --> 02:02:47,600 Now. 3026 02:02:47,600 --> 02:02:50,600 Once we are done then after that what we are going to do now, 3027 02:02:50,600 --> 02:02:52,628 we are going to use the join operation. 3028 02:02:52,629 --> 02:02:55,500 So what we are going to do so we will be joining the Annan 3029 02:02:55,500 --> 02:02:58,300 and except bestop's in order to compare the closing price 3030 02:02:58,300 --> 02:03:00,200 because we want to compare the prices 3031 02:03:00,200 --> 02:03:01,297 so we will be doing that. 3032 02:03:01,297 --> 02:03:02,000 So first of all, 3033 02:03:02,000 --> 02:03:04,600 we are going to create a union of all these stocks 3034 02:03:04,600 --> 02:03:06,500 and then display this guy joint Rose. 3035 02:03:06,500 --> 02:03:07,259 So look at this 3036 02:03:07,259 --> 02:03:09,284 what we're going to do we're going to use 3037 02:03:09,284 --> 02:03:10,200 the spark sequence and 3038 02:03:10,200 --> 02:03:13,000 if you notice this closely what we're doing in this case, 3039 02:03:13,000 --> 02:03:14,439 So now in this park sequel, 3040 02:03:14,439 --> 02:03:16,200 we are hitting the square is equal 3041 02:03:16,200 --> 02:03:18,780 and all those stuff then we are saying from this 3042 02:03:18,780 --> 02:03:21,192 and here we are using this joint operation 3043 02:03:21,192 --> 02:03:22,704 may see this join oppression. 3044 02:03:22,704 --> 02:03:24,500 So this we are joining it on 3045 02:03:24,500 --> 02:03:26,500 and then in the end we are outputting it. 3046 02:03:26,500 --> 02:03:28,700 So here you can see you can do a comparison 3047 02:03:28,700 --> 02:03:31,300 of all these clothes price for all these talks. 3048 02:03:31,300 --> 02:03:34,000 You can also include no for more companies right now. 3049 02:03:34,000 --> 02:03:36,280 We have just shown you an example with to complete 3050 02:03:36,280 --> 02:03:38,480 but you can do it for more companies as well. 3051 02:03:38,480 --> 02:03:39,188 Now in this case 3052 02:03:39,188 --> 02:03:41,800 if you notice what we're doing were writing this in the park 3053 02:03:41,800 --> 02:03:44,928 a file format and Save Being into this particular location. 3054 02:03:44,928 --> 02:03:47,135 So we are creating this joint stock market. 3055 02:03:47,135 --> 02:03:49,869 So we are storing it as a packet file format and here 3056 02:03:49,869 --> 02:03:51,705 if you want to read it we can read 3057 02:03:51,705 --> 02:03:52,800 that and showed output 3058 02:03:52,800 --> 02:03:55,300 but whatever file you have saved it as a pocket 3059 02:03:55,300 --> 02:03:57,900 while definitely you will not be able to read that up 3060 02:03:57,900 --> 02:04:00,700 because that file is going to be the perfect way 3061 02:04:00,800 --> 02:04:03,900 and park it way are the files which you can never read. 3062 02:04:03,900 --> 02:04:05,900 You will not be able to read them up now, 3063 02:04:05,900 --> 02:04:08,382 so you will be seeing this average closing price per year. 3064 02:04:08,382 --> 02:04:10,631 I'm going to show you all these things running also some 3065 02:04:10,631 --> 02:04:13,181 just right to explaining you how things will be run. 3066 02:04:13,181 --> 02:04:13,900 We're doing up here. 3067 02:04:13,900 --> 02:04:15,900 So I will be showing you all these things 3068 02:04:15,900 --> 02:04:17,100 in execution as well. 3069 02:04:17,200 --> 02:04:18,200 Now in this case, 3070 02:04:18,200 --> 02:04:20,100 if you notice what we are doing again, 3071 02:04:20,100 --> 02:04:21,907 we are creating our data frame here. 3072 02:04:21,907 --> 02:04:24,800 Again, we are executing our query whatever table we have. 3073 02:04:24,800 --> 02:04:26,300 We are executing on top of it. 3074 02:04:26,300 --> 02:04:27,050 So in this case 3075 02:04:27,050 --> 02:04:29,650 because we want to find the average closing per year. 3076 02:04:29,650 --> 02:04:31,300 So what we are doing in this case, 3077 02:04:31,300 --> 02:04:33,800 we are going to create a new table containing 3078 02:04:33,800 --> 02:04:37,700 the average closing price of let's say an and fxn first 3079 02:04:37,700 --> 02:04:40,319 and then we are going to display all this new table. 3080 02:04:40,319 --> 02:04:41,369 So we are in the end. 3081 02:04:41,369 --> 02:04:42,800 We are going to register this table 3082 02:04:42,800 --> 02:04:43,900 or The temporary table 3083 02:04:43,900 --> 02:04:46,515 so that we can execute our SQL queries on top of it. 3084 02:04:46,515 --> 02:04:47,328 So in this case, 3085 02:04:47,328 --> 02:04:49,828 you can notice that we are creating this new table. 3086 02:04:49,828 --> 02:04:50,900 And in this new table, 3087 02:04:50,900 --> 02:04:52,900 we have putting our SQL query right 3088 02:04:52,900 --> 02:04:53,711 that SQL query 3089 02:04:53,711 --> 02:04:56,300 is going to contains the average closing Paso 3090 02:04:56,300 --> 02:05:00,100 the SQL queries finding out the average closing price of N 3091 02:05:00,100 --> 02:05:03,100 and all these companies then whatever we have now. 3092 02:05:03,100 --> 02:05:05,688 We are going to apply the transformation step 3093 02:05:05,688 --> 02:05:07,488 not transformation of this new table, 3094 02:05:07,488 --> 02:05:09,188 which we have created with the year 3095 02:05:09,188 --> 02:05:11,100 and the corresponding three company data 3096 02:05:11,100 --> 02:05:13,400 what we have created into the The company 3097 02:05:13,400 --> 02:05:15,103 or table select which you can notice 3098 02:05:15,103 --> 02:05:17,100 that we are creating this company or table 3099 02:05:17,100 --> 02:05:18,247 and here first of all, 3100 02:05:18,247 --> 02:05:20,725 we are going to create a transform table company 3101 02:05:20,725 --> 02:05:23,413 or and going to display the output so you can notice 3102 02:05:23,413 --> 02:05:25,100 that we are hitting the SQL query 3103 02:05:25,100 --> 02:05:27,900 and in the end we have printing this output similarly 3104 02:05:27,900 --> 02:05:29,975 if we want to let's say compute the best 3105 02:05:29,975 --> 02:05:31,597 of average close we can do that. 3106 02:05:31,597 --> 02:05:33,618 So in this case again the same way now, 3107 02:05:33,618 --> 02:05:35,800 if once they have learned the basic stuff, 3108 02:05:35,800 --> 02:05:37,426 you can notice that everything 3109 02:05:37,426 --> 02:05:40,400 is following a similar approach now in this case also, 3110 02:05:40,400 --> 02:05:43,200 we want to find out let's say the best of the average 3111 02:05:43,200 --> 02:05:46,100 So we are creating this best company here now. 3112 02:05:46,100 --> 02:05:49,500 It should contain the best average closing price of an MX 3113 02:05:49,500 --> 02:05:52,700 and first so we can just get this greatest and all battery. 3114 02:05:52,700 --> 02:05:53,400 So we creating 3115 02:05:53,400 --> 02:05:56,675 that then after that we are going to display this output 3116 02:05:56,675 --> 02:05:59,846 and we will be again registering it as a temporary table now, 3117 02:05:59,846 --> 02:06:02,700 once we have done that then we can hit our queries now, 3118 02:06:02,700 --> 02:06:04,350 so we want to check let's say best 3119 02:06:04,350 --> 02:06:05,600 performing company per year. 3120 02:06:05,600 --> 02:06:07,200 Now what we have to do for that. 3121 02:06:07,200 --> 02:06:09,319 So we are creating the final table in which 3122 02:06:09,319 --> 02:06:10,400 we are going to compute 3123 02:06:10,400 --> 02:06:13,200 all the things we are going to perform the join or not. 3124 02:06:13,200 --> 02:06:16,082 So although SQL query we are going to perform here 3125 02:06:16,082 --> 02:06:17,200 in order to compute 3126 02:06:17,200 --> 02:06:19,500 that which company is doing the best 3127 02:06:19,500 --> 02:06:21,250 and then we are going to display the output. 3128 02:06:21,250 --> 02:06:23,800 So this is what the output is going showing up here. 3129 02:06:23,800 --> 02:06:25,850 We are again storing as a comparative View 3130 02:06:25,850 --> 02:06:28,000 and here again the same story of correlation 3131 02:06:28,000 --> 02:06:29,400 what we're going to do here. 3132 02:06:29,400 --> 02:06:32,843 So now we will be using our statistics libraries to find 3133 02:06:32,843 --> 02:06:36,400 the correlation between Anand epochs companies closing price. 3134 02:06:36,400 --> 02:06:38,300 So that is what we are going to do now. 3135 02:06:38,300 --> 02:06:41,088 So correlation in finance and the investment 3136 02:06:41,088 --> 02:06:43,079 and industries is a statistics. 3137 02:06:43,079 --> 02:06:44,300 Measures the degree 3138 02:06:44,300 --> 02:06:47,564 to which to Securities move in relation to each other. 3139 02:06:47,564 --> 02:06:49,625 So the closer the correlation is 3140 02:06:49,625 --> 02:06:52,200 to be 1 this is going to be a better one. 3141 02:06:52,200 --> 02:06:53,722 So it is always like 3142 02:06:53,722 --> 02:06:57,300 how to variables are correlated with each other. 3143 02:06:57,300 --> 02:07:01,400 Let's say your H is highly correlated to your salary, 3144 02:07:01,400 --> 02:07:05,000 but you're earning like when you are young you usually 3145 02:07:05,000 --> 02:07:06,400 unless and when you 3146 02:07:06,400 --> 02:07:09,500 are more Edge definitely you will be earning more 3147 02:07:09,500 --> 02:07:12,811 because you will be more mature similar way I can say that. 3148 02:07:12,811 --> 02:07:16,400 Your salary is also dependent on your education qualification. 3149 02:07:16,400 --> 02:07:18,815 And also on the premium Institute from where you 3150 02:07:18,815 --> 02:07:20,149 have done your education. 3151 02:07:20,149 --> 02:07:21,751 Let's say if you are from IIT, 3152 02:07:21,751 --> 02:07:24,100 or I am definitely your salary will be higher 3153 02:07:24,100 --> 02:07:25,300 from any other campuses. 3154 02:07:25,300 --> 02:07:26,100 Right Miss. 3155 02:07:26,100 --> 02:07:27,072 It's a probability. 3156 02:07:27,072 --> 02:07:28,300 We what I'm telling you. 3157 02:07:28,300 --> 02:07:28,900 So let's say 3158 02:07:28,900 --> 02:07:32,132 if I have to correlate now in this case the education 3159 02:07:32,132 --> 02:07:35,600 and the salary but I can easily create a correlation, right? 3160 02:07:35,600 --> 02:07:37,300 So that is what the correlation go. 3161 02:07:37,300 --> 02:07:38,589 So we are going to do all 3162 02:07:38,589 --> 02:07:40,573 that with respect to Overstock analysis. 3163 02:07:40,573 --> 02:07:41,869 Now now what we are doing 3164 02:07:41,869 --> 02:07:45,185 in this case, so You can notice we are creating this series one 3165 02:07:45,185 --> 02:07:47,188 where we heading the select query now, 3166 02:07:47,188 --> 02:07:49,401 we are mapping all this an enclosed price. 3167 02:07:49,401 --> 02:07:52,400 We are converting to a DD similar way for Series 2. 3168 02:07:52,400 --> 02:07:53,691 Also we are doing that right. 3169 02:07:53,691 --> 02:07:55,832 So this is we are doing for rabbits or earlier. 3170 02:07:55,832 --> 02:07:58,600 We have done it for an enclosed and then in the end we 3171 02:07:58,600 --> 02:08:00,911 are using the statistics dot core to create 3172 02:08:00,911 --> 02:08:02,500 a correlation between them. 3173 02:08:02,600 --> 02:08:06,200 So you can notice this is how we can execute everything now. 3174 02:08:06,200 --> 02:08:10,353 Let's go to our VM and see everything in our execution. 3175 02:08:11,142 --> 02:08:12,757 Question from at all. 3176 02:08:12,900 --> 02:08:15,300 So this VM how we will be getting you 3177 02:08:15,300 --> 02:08:17,659 will be getting all this VM from a director. 3178 02:08:17,659 --> 02:08:19,815 So you need not worry about all that but 3179 02:08:19,815 --> 02:08:21,930 that how I will be getting all this p.m. 3180 02:08:21,930 --> 02:08:24,100 In a so a once you enroll for the courses 3181 02:08:24,100 --> 02:08:27,300 and also you will be getting all this came from that Erika said 3182 02:08:27,300 --> 02:08:28,541 so even if I am working 3183 02:08:28,541 --> 02:08:30,711 on Mac operating system my VM will work. 3184 02:08:30,711 --> 02:08:32,300 Yes every operating system. 3185 02:08:32,300 --> 02:08:33,535 It will be supported. 3186 02:08:33,535 --> 02:08:35,592 So no trouble you can just use any sort 3187 02:08:35,592 --> 02:08:38,428 of VM in all means any operating system to do that. 3188 02:08:38,428 --> 02:08:41,000 So what I would occur do is they just don't want 3189 02:08:41,000 --> 02:08:43,900 You to be troubled in any sort of stuff here. 3190 02:08:43,900 --> 02:08:46,076 So what they do is they kind of ensure 3191 02:08:46,076 --> 02:08:48,342 that whatever is required for your practicals. 3192 02:08:48,342 --> 02:08:49,400 They take care of it. 3193 02:08:49,400 --> 02:08:51,700 That's the reason they have created their own VM, 3194 02:08:51,700 --> 02:08:54,600 which is also going to be a lower size and compassion 3195 02:08:54,600 --> 02:08:56,100 to Cloudera hortonworks VM 3196 02:08:56,100 --> 02:08:58,997 and this is going to definitely be more helpful for you. 3197 02:08:58,997 --> 02:09:01,000 So all these things will be provided to 3198 02:09:01,000 --> 02:09:02,524 you question from nothing. 3199 02:09:02,524 --> 02:09:05,900 So all this project I am going to learn from the sessions. 3200 02:09:05,900 --> 02:09:06,200 Yes. 3201 02:09:06,200 --> 02:09:09,650 So once you enroll for so right now whatever we have seen 3202 02:09:09,650 --> 02:09:13,100 definitely we have just Otten upper level of view of this 3203 02:09:13,100 --> 02:09:15,350 how the session looks like for a purchase. 3204 02:09:15,350 --> 02:09:18,700 But but when we actually teach all these things in the course, 3205 02:09:18,700 --> 02:09:21,587 it's usually are much more in the detailed format. 3206 02:09:21,587 --> 02:09:22,700 So in detail format, 3207 02:09:22,700 --> 02:09:25,300 we kind of keep on showing you each step in detail 3208 02:09:25,300 --> 02:09:28,299 that how the things are working even including the project. 3209 02:09:28,299 --> 02:09:30,900 So you will be also learning with the help of project 3210 02:09:30,900 --> 02:09:32,157 on each different topic. 3211 02:09:32,157 --> 02:09:34,200 So that is the way we kind of go for it. 3212 02:09:34,200 --> 02:09:36,605 Now if I am stuck in any other project then 3213 02:09:36,605 --> 02:09:37,985 who will be helping me 3214 02:09:37,985 --> 02:09:40,308 so they will be a support team 24 by 7 3215 02:09:40,308 --> 02:09:42,046 if Get stuck at any moment. 3216 02:09:42,046 --> 02:09:44,300 You need to just give a call and kit 3217 02:09:44,300 --> 02:09:45,900 and a call or email. 3218 02:09:45,900 --> 02:09:49,076 There is a support ticket and immediately the technical 3219 02:09:49,076 --> 02:09:52,100 team will be helping across the support team is 24 by 7. 3220 02:09:52,100 --> 02:09:53,900 They are they are all technical people 3221 02:09:53,900 --> 02:09:55,821 and they will be assisting you across on all 3222 02:09:55,821 --> 02:09:58,100 that even the trainers will be assisting you for any 3223 02:09:58,100 --> 02:10:00,000 of the technical query great. 3224 02:10:00,000 --> 02:10:00,400 Awesome. 3225 02:10:00,800 --> 02:10:01,900 Thank you now. 3226 02:10:01,900 --> 02:10:03,700 So if you notice this is my data 3227 02:10:03,700 --> 02:10:06,446 we have we were executing all the things on this data. 3228 02:10:06,446 --> 02:10:08,726 Now what we want to do if you notice this is 3229 02:10:08,726 --> 02:10:10,900 the same code which I have just shown you. 3230 02:10:10,900 --> 02:10:13,800 Earlier also now let us just execute this code. 3231 02:10:13,800 --> 02:10:15,481 So in order to execute this 3232 02:10:15,481 --> 02:10:18,345 what we can do we can connect to my spa action. 3233 02:10:18,345 --> 02:10:20,400 So let's get connected to suction. 3234 02:10:21,700 --> 02:10:23,970 Someone's will be connected to Spur action. 3235 02:10:23,970 --> 02:10:25,382 We will go step by step. 3236 02:10:25,382 --> 02:10:27,700 So first we will be importing our package. 3237 02:10:31,400 --> 02:10:34,861 This take some time let it just get connected. 3238 02:10:36,300 --> 02:10:38,400 Once this is connected now, 3239 02:10:38,400 --> 02:10:39,400 you can notice 3240 02:10:39,400 --> 02:10:42,400 that I'm just importing all the all the important libraries 3241 02:10:42,400 --> 02:10:44,400 we have already learned about that. 3242 02:10:45,800 --> 02:10:49,137 After that, you will be initialising your spark session. 3243 02:10:49,137 --> 02:10:49,805 So let's do 3244 02:10:49,805 --> 02:10:52,900 that again the same steps what you have done before. 3245 02:10:58,600 --> 02:10:59,922 Once we will be done. 3246 02:10:59,922 --> 02:11:02,000 We will be creating a stock class. 3247 02:11:07,000 --> 02:11:09,900 We could have also directly executed from Eclipse. 3248 02:11:09,900 --> 02:11:11,400 Also, this is just I want 3249 02:11:11,400 --> 02:11:13,800 to show you step-by-step whatever we have learnt. 3250 02:11:13,800 --> 02:11:15,700 So now you can see for company one and then 3251 02:11:15,700 --> 02:11:16,700 if you want to do 3252 02:11:16,700 --> 02:11:20,000 some computation we want to even see the values and all right, 3253 02:11:20,000 --> 02:11:21,600 so that's what we're doing here. 3254 02:11:21,700 --> 02:11:24,700 So if we are just getting the files creating another did, 3255 02:11:24,700 --> 02:11:26,800 you know, so let's execute this. 3256 02:11:28,500 --> 02:11:31,200 Similarly for your a back similarly for your fast 3257 02:11:31,200 --> 02:11:34,050 for all this so I'm just copying all these things together 3258 02:11:34,050 --> 02:11:36,100 because there are a lot of companies for which we 3259 02:11:36,100 --> 02:11:37,400 have to do all this step. 3260 02:11:37,400 --> 02:11:39,625 So let's bring it for all the 10 companies 3261 02:11:39,625 --> 02:11:41,200 which we are going to create. 3262 02:11:49,000 --> 02:11:49,900 So as you can see, 3263 02:11:49,900 --> 02:11:52,400 this print scheme has giving it output right now. 3264 02:11:52,400 --> 02:11:52,900 Similarly. 3265 02:11:52,900 --> 02:11:55,800 I can execute for a rest of the things as well. 3266 02:11:55,800 --> 02:11:57,800 So this is just giving you the similar way. 3267 02:11:57,800 --> 02:12:01,702 All the outputs will be shown up here company for company V 3268 02:12:01,702 --> 02:12:05,000 all these companies you can see this in execution. 3269 02:12:08,000 --> 02:12:11,000 After that, we will be creating our temporary view 3270 02:12:11,000 --> 02:12:13,800 so that we can execute our SQL queries. 3271 02:12:16,500 --> 02:12:19,700 So let's do it for complaint and also then after that we 3272 02:12:19,700 --> 02:12:22,900 can just create a work all over temporary table for it. 3273 02:12:22,900 --> 02:12:25,200 Once we are done now we can do our queries. 3274 02:12:25,200 --> 02:12:27,357 Like let's say we can display the average 3275 02:12:27,357 --> 02:12:30,000 of existing closing price for and and for each one 3276 02:12:30,000 --> 02:12:31,400 so we can hit this query. 3277 02:12:34,700 --> 02:12:37,500 So all these queries will happen on your temporary view 3278 02:12:37,600 --> 02:12:39,800 because we cannot anyway to all these queries 3279 02:12:39,800 --> 02:12:41,471 on our data frames are out 3280 02:12:41,471 --> 02:12:44,300 so you can see this this is getting executed. 3281 02:12:45,500 --> 02:12:49,200 Trying it out to Tulsa now because they've done dot shoe. 3282 02:12:49,200 --> 02:12:51,237 That's the reason you're getting this output. 3283 02:12:51,237 --> 02:12:51,700 Similarly. 3284 02:12:51,700 --> 02:12:55,600 If we want to let's say list the closing price for msft 3285 02:12:55,600 --> 02:12:57,600 which went up more than $2 way. 3286 02:12:57,600 --> 02:12:58,794 So that query also we 3287 02:12:58,794 --> 02:13:02,500 can execute now we have already understood this query in detail. 3288 02:13:03,100 --> 02:13:05,300 It is seeing is execution partner 3289 02:13:05,500 --> 02:13:08,100 so that you can appreciate whatever you have learned. 3290 02:13:08,300 --> 02:13:10,700 See this is the output showing up to you. 3291 02:13:10,800 --> 02:13:12,300 Now after that 3292 02:13:12,300 --> 02:13:15,723 how you can join all the stack closing price right similar way 3293 02:13:15,723 --> 02:13:18,966 how we can save the joint view in the packet for table. 3294 02:13:18,966 --> 02:13:20,435 You want to read that back. 3295 02:13:20,435 --> 02:13:22,157 You want to create a new table 3296 02:13:22,157 --> 02:13:25,275 like so let's execute all these three queries together 3297 02:13:25,275 --> 02:13:27,100 because we have already seen this. 3298 02:13:29,700 --> 02:13:30,502 Look at this. 3299 02:13:30,502 --> 02:13:31,800 So this in this case, 3300 02:13:31,800 --> 02:13:34,300 we are doing the drawing class basing this output. 3301 02:13:34,300 --> 02:13:36,499 Then we want to save it in the package files. 3302 02:13:36,499 --> 02:13:39,100 We are saving it and we want to again reiterate back. 3303 02:13:39,100 --> 02:13:40,893 Then we are creating our new table, right? 3304 02:13:40,893 --> 02:13:42,043 We were doing that join 3305 02:13:42,043 --> 02:13:44,200 and on so that is what we are doing in this case. 3306 02:13:44,200 --> 02:13:45,900 Then you want to see this output. 3307 02:13:47,700 --> 02:13:50,400 Then we are against touring as a temp table or not. 3308 02:13:50,499 --> 02:13:50,700 Now. 3309 02:13:50,700 --> 02:13:53,700 Once we are done with this step also then what so we 3310 02:13:53,700 --> 02:13:55,400 have done it in Step 6. 3311 02:13:55,400 --> 02:13:56,900 Now we want to perform. 3312 02:13:56,900 --> 02:13:58,488 Let's have a transformation 3313 02:13:58,488 --> 02:14:01,000 on new table corresponding to the three companies 3314 02:14:01,000 --> 02:14:03,411 so that we can compare we want to create 3315 02:14:03,411 --> 02:14:06,305 the best company containing the best average closing price 3316 02:14:06,305 --> 02:14:07,748 for all these three companies. 3317 02:14:07,748 --> 02:14:09,300 We want to find the companies 3318 02:14:09,300 --> 02:14:11,600 but the best closing price average per year. 3319 02:14:11,600 --> 02:14:13,200 So let's do all that as well. 3320 02:14:18,800 --> 02:14:22,343 So you can see best company of the year now here also 3321 02:14:22,343 --> 02:14:26,500 the same stuff we are doing to be registering over temp table. 3322 02:14:34,100 --> 02:14:35,700 Okay, so there's a mistake here. 3323 02:14:35,700 --> 02:14:38,096 So if you notice here it is 1 3324 02:14:38,100 --> 02:14:40,722 but here we are doing a show of all right, 3325 02:14:40,722 --> 02:14:42,129 so there is a mistake. 3326 02:14:42,129 --> 02:14:43,600 I'm just correcting it. 3327 02:14:45,000 --> 02:14:48,300 So here also it should be 1 I'm just updating 3328 02:14:48,300 --> 02:14:51,300 in the sheet itself so that it will start working now. 3329 02:14:51,300 --> 02:14:53,102 So here I have just made it one. 3330 02:14:53,102 --> 02:14:55,300 So now after that it will start working. 3331 02:14:55,300 --> 02:14:59,600 Okay, wherever it is going to be all I have to make it one. 3332 02:15:00,400 --> 02:15:03,500 So that is the change which I need to do here also. 3333 02:15:04,400 --> 02:15:06,700 And you will notice it will start working. 3334 02:15:06,900 --> 02:15:09,433 So here also you need to make it one. 3335 02:15:09,433 --> 02:15:10,748 So all those places 3336 02:15:10,748 --> 02:15:14,363 where ever it was so just kind of a good point to make 3337 02:15:14,363 --> 02:15:18,388 so wherever you are working on this we need to always ensure 3338 02:15:18,388 --> 02:15:21,800 that all these values what you are putting up here. 3339 02:15:21,800 --> 02:15:25,900 Okay, so I could have also done it like this one second. 3340 02:15:26,300 --> 02:15:27,876 In fact in this place. 3341 02:15:27,876 --> 02:15:30,600 I need not do all this step one second. 3342 02:15:30,600 --> 02:15:33,842 Let me explain you also why no in this place. 3343 02:15:33,842 --> 02:15:37,600 It's So see from here this error started opening why 3344 02:15:37,600 --> 02:15:38,758 because my data frame 3345 02:15:38,758 --> 02:15:40,500 what I have created here most one. 3346 02:15:40,500 --> 02:15:41,500 Let's execute it. 3347 02:15:41,500 --> 02:15:43,500 Now, you will notice this Quest artwork. 3348 02:15:44,340 --> 02:15:45,659 See this is working. 3349 02:15:46,000 --> 02:15:46,300 Now. 3350 02:15:46,300 --> 02:15:47,000 After that. 3351 02:15:47,000 --> 02:15:49,493 I am creating a temp table that temp table. 3352 02:15:49,493 --> 02:15:52,400 What we are creating is let's say company on okay. 3353 02:15:52,400 --> 02:15:55,100 So this is the temp table which we have created. 3354 02:15:55,100 --> 02:15:57,808 You can see this company now in this case 3355 02:15:57,808 --> 02:16:01,300 if I am keeping this company on itself it is going to work. 3356 02:16:02,000 --> 02:16:03,195 Because here anyway, 3357 02:16:03,195 --> 02:16:05,897 I'm going to use the whatever temporary table 3358 02:16:05,897 --> 02:16:07,310 we have created, right? 3359 02:16:07,310 --> 02:16:08,600 So now let's execute. 3360 02:16:10,800 --> 02:16:12,700 So you can see now it started book. 3361 02:16:14,000 --> 02:16:15,900 No further to that now, 3362 02:16:15,900 --> 02:16:18,500 we want to create a correlation between them 3363 02:16:18,500 --> 02:16:19,600 so we can do that. 3364 02:16:23,700 --> 02:16:26,400 See this is going to give me the correlation 3365 02:16:26,400 --> 02:16:30,500 between the two column names and so that we can see here. 3366 02:16:30,700 --> 02:16:34,445 So this is the correlation the more it is closer to 1 means the 3367 02:16:34,445 --> 02:16:37,950 better it is it means definitely it is near to 1 it is 0.9, 3368 02:16:37,950 --> 02:16:39,400 which is a bigger value. 3369 02:16:39,400 --> 02:16:42,700 So definitely it is going to be much they both are 3370 02:16:42,700 --> 02:16:45,700 highly correlated means definitely they are impacting 3371 02:16:45,700 --> 02:16:47,300 each other stock price. 3372 02:16:47,400 --> 02:16:49,700 So this is all about the project 3373 02:16:49,700 --> 02:16:58,500 but Welcome to this interesting session of spots remaining 3374 02:16:58,673 --> 02:16:59,826 from and Erica. 3375 02:17:00,800 --> 02:17:02,261 What is pathogenic? 3376 02:17:02,261 --> 02:17:04,415 Is it like really important? 3377 02:17:04,500 --> 02:17:05,400 Definitely? 3378 02:17:05,400 --> 02:17:05,704 Yes. 3379 02:17:05,704 --> 02:17:07,001 Is it really hot? 3380 02:17:07,001 --> 02:17:07,600 Definitely? 3381 02:17:07,600 --> 02:17:08,100 Yes. 3382 02:17:08,100 --> 02:17:10,900 That's the reason we are learning this technology. 3383 02:17:10,900 --> 02:17:14,600 And this is one of the very sort things in the market 3384 02:17:14,600 --> 02:17:16,272 when it's a hot thing means 3385 02:17:16,272 --> 02:17:18,750 in terms of job market I'm talking about. 3386 02:17:18,750 --> 02:17:21,600 So let's see what will be our agenda for today. 3387 02:17:21,900 --> 02:17:25,500 So we are going to Gus about spark ecosystem 3388 02:17:25,500 --> 02:17:27,900 where we are going to see that okay, 3389 02:17:27,900 --> 02:17:28,700 what is pop 3390 02:17:28,700 --> 02:17:32,100 how smarts the main threats in the West Park ecosystem 3391 02:17:32,100 --> 02:17:35,631 wise path streaming we are going to have overview 3392 02:17:35,631 --> 02:17:39,900 of stock streaming kind of getting into the basics of that. 3393 02:17:39,900 --> 02:17:41,832 We will learn about these cream. 3394 02:17:41,832 --> 02:17:44,890 We will learn also about these theme Transformations. 3395 02:17:44,890 --> 02:17:46,800 We will be learning about caching 3396 02:17:46,800 --> 02:17:51,200 and persistence accumulators broadcast variables checkpoints. 3397 02:17:51,200 --> 02:17:53,600 These are like Advanced concept of paths. 3398 02:17:54,100 --> 02:17:55,600 And then in the end, 3399 02:17:55,600 --> 02:17:59,900 we will walk through a use case of Twitter sentiment analysis. 3400 02:18:00,500 --> 02:18:04,700 Now, what is streaming let's understand that. 3401 02:18:04,800 --> 02:18:08,000 So let me start by us example to you. 3402 02:18:08,600 --> 02:18:12,300 So let's see if there is a bank and in Bank. 3403 02:18:12,500 --> 02:18:13,082 Definitely. 3404 02:18:13,082 --> 02:18:14,200 I'm pretty sure all 3405 02:18:14,200 --> 02:18:18,700 of you must have views credit card debit card all those karts 3406 02:18:18,700 --> 02:18:20,900 what dance provide now, 3407 02:18:20,900 --> 02:18:23,500 let's say you have done a transaction. 3408 02:18:23,500 --> 02:18:27,300 From India just now and within an art 3409 02:18:27,300 --> 02:18:30,260 and edit your card is getting swept in u.s. 3410 02:18:30,260 --> 02:18:31,600 Is it even possible 3411 02:18:31,600 --> 02:18:35,801 for your car to vision and arduous definitely know now 3412 02:18:35,900 --> 02:18:38,100 how that bank will realize 3413 02:18:38,700 --> 02:18:41,000 that it is a fraud connection 3414 02:18:41,000 --> 02:18:44,600 because Bank cannot let that transition happen. 3415 02:18:44,700 --> 02:18:46,238 They need to stop it 3416 02:18:46,238 --> 02:18:49,771 at the time of when it is getting swiped either. 3417 02:18:49,771 --> 02:18:51,000 You can block it. 3418 02:18:51,000 --> 02:18:52,800 Give a call to you ask you 3419 02:18:52,800 --> 02:18:55,394 whether It is a genuine transaction or not. 3420 02:18:55,394 --> 02:18:57,000 Do something of that sort. 3421 02:18:57,692 --> 02:18:58,000 Now. 3422 02:18:58,000 --> 02:19:00,300 Do you think they will put some manual person 3423 02:19:00,300 --> 02:19:01,127 behind the scene 3424 02:19:01,127 --> 02:19:03,300 that will be looking at all the transaction 3425 02:19:03,300 --> 02:19:05,100 and you will block it manually. 3426 02:19:05,100 --> 02:19:08,315 No, so they require something of the sort 3427 02:19:08,315 --> 02:19:11,100 where the data will be getting stream. 3428 02:19:11,100 --> 02:19:12,500 And at the real time 3429 02:19:12,500 --> 02:19:16,113 they should be able to catch with the help of some pattern. 3430 02:19:16,113 --> 02:19:17,851 They will do some processing 3431 02:19:17,851 --> 02:19:20,575 and they will get some pattern out of it with 3432 02:19:20,575 --> 02:19:23,305 if it is not sounding like a genuine transition. 3433 02:19:23,305 --> 02:19:26,649 They will immediately add a block it I'll give you a call 3434 02:19:26,649 --> 02:19:28,565 maybe send me an OTP to confirm 3435 02:19:28,565 --> 02:19:31,100 whether it's a genuine connection dot they 3436 02:19:31,100 --> 02:19:32,050 will not wait 3437 02:19:32,050 --> 02:19:36,000 till the next day to kind of complete that transaction. 3438 02:19:36,000 --> 02:19:38,941 Otherwise if what happened nobody is going to touch 3439 02:19:38,941 --> 02:19:40,000 that that right. 3440 02:19:40,000 --> 02:19:43,000 So that is the how we work on stomach. 3441 02:19:43,100 --> 02:19:46,300 Now someone have mentioned 3442 02:19:46,500 --> 02:19:51,400 that without stream processing of data is not even possible. 3443 02:19:51,400 --> 02:19:52,435 In fact, we can see 3444 02:19:52,435 --> 02:19:55,200 that there is no And big data which is possible. 3445 02:19:55,200 --> 02:19:57,900 We cannot even talk about internet of things. 3446 02:19:57,900 --> 02:20:00,800 Right and this this is a very famous statement 3447 02:20:00,800 --> 02:20:01,900 from Donna Saint 3448 02:20:01,900 --> 02:20:05,600 do from C equals 3 lot of companies 3449 02:20:05,700 --> 02:20:13,500 like YouTube Netflix Facebook Twitter iTunes topped Pandora. 3450 02:20:13,769 --> 02:20:17,230 All these companies are using spark screaming. 3451 02:20:17,700 --> 02:20:18,100 Now. 3452 02:20:19,100 --> 02:20:20,400 What is this? 3453 02:20:20,400 --> 02:20:23,580 We have just seen with an example to kind of got an idea. 3454 02:20:23,580 --> 02:20:25,000 Idea about steaming pack. 3455 02:20:25,100 --> 02:20:30,300 Now as I said with the time growing with the internet doing 3456 02:20:30,453 --> 02:20:35,146 these three main Technologies are becoming popular day by day. 3457 02:20:35,500 --> 02:20:39,300 It's a technique to transfer the data 3458 02:20:39,500 --> 02:20:45,000 so that it can be processed as a steady and continuous 3459 02:20:45,000 --> 02:20:47,000 drip means immediately 3460 02:20:47,000 --> 02:20:49,500 as and when the data is coming 3461 02:20:49,600 --> 02:20:52,900 you are continuously processing it as well. 3462 02:20:53,600 --> 02:20:54,400 In fact, 3463 02:20:54,400 --> 02:20:58,938 this real-time streaming is what is driving to this big data 3464 02:20:59,100 --> 02:21:02,000 and also internet of things now, 3465 02:21:02,000 --> 02:21:04,786 they will be lot of things like fundamental unit 3466 02:21:04,786 --> 02:21:06,387 of streaming media streams. 3467 02:21:06,387 --> 02:21:08,700 We will also be Transforming Our screen. 3468 02:21:08,700 --> 02:21:09,700 We will be doing it. 3469 02:21:09,700 --> 02:21:10,994 In fact, the companies 3470 02:21:10,994 --> 02:21:13,400 are using it with their business intelligence. 3471 02:21:13,400 --> 02:21:16,200 We will see more details in further of the slides. 3472 02:21:16,300 --> 02:21:20,900 But before that we will be talking about spark ecosystem 3473 02:21:21,200 --> 02:21:23,500 when we talk about Spark mmm, 3474 02:21:23,500 --> 02:21:25,653 there are multiple libraries 3475 02:21:25,653 --> 02:21:29,565 which are present in a first one is pop frequent now 3476 02:21:29,565 --> 02:21:31,100 in spark SQL is like 3477 02:21:31,100 --> 02:21:35,000 when you can SQL Developer can write the query in SQL way 3478 02:21:35,000 --> 02:21:38,600 and it is going to get converted into a spark way 3479 02:21:38,600 --> 02:21:42,828 and then going to give you output kind of analogous to hide 3480 02:21:42,828 --> 02:21:46,400 but it is going to be faster in comparison to hide 3481 02:21:46,400 --> 02:21:48,252 when we talk about sports clinic 3482 02:21:48,252 --> 02:21:50,900 that is what we are going to learn it is going 3483 02:21:50,900 --> 02:21:55,300 to enable all the analytical and Practical applications 3484 02:21:55,600 --> 02:21:59,400 for your live streaming data M11. 3485 02:21:59,700 --> 02:22:02,400 Ml it is mostly for machine learning. 3486 02:22:02,400 --> 02:22:03,546 And in fact, 3487 02:22:03,546 --> 02:22:06,007 the interesting part about MLA is 3488 02:22:06,200 --> 02:22:11,100 that it is completely replacing mom invited are almost replaced. 3489 02:22:11,100 --> 02:22:13,500 Now all the core contributors 3490 02:22:13,500 --> 02:22:17,700 of Mahal have moved in two words the 3491 02:22:18,184 --> 02:22:19,800 towards the MLF thing 3492 02:22:19,800 --> 02:22:23,500 because of the faster response performance is really good. 3493 02:22:23,500 --> 02:22:26,707 In MLA Graphics Graphics. 3494 02:22:26,707 --> 02:22:27,005 Okay. 3495 02:22:27,005 --> 02:22:29,794 Let me give you example everybody must have used 3496 02:22:29,794 --> 02:22:31,100 Google Maps right now. 3497 02:22:31,100 --> 02:22:34,082 What you doing Google Map you search for the path. 3498 02:22:34,082 --> 02:22:36,600 You put your Source you put your destination. 3499 02:22:36,600 --> 02:22:38,900 Now when you just search for the part, 3500 02:22:39,000 --> 02:22:40,500 it's certainly different paths 3501 02:22:40,800 --> 02:22:45,100 and then provide you an optimal path right now 3502 02:22:45,300 --> 02:22:47,300 how it providing the optimal party. 3503 02:22:47,300 --> 02:22:50,500 These things can be done with the help of Graphics. 3504 02:22:50,500 --> 02:22:53,500 So wherever you can create a kind of a graphical stuff. 3505 02:22:53,500 --> 02:22:54,500 Up, we will say 3506 02:22:54,500 --> 02:22:56,997 that we can use Graphics spark up. 3507 02:22:56,997 --> 02:22:57,300 Now. 3508 02:22:57,300 --> 02:23:00,600 This is the kind of a package provided for art. 3509 02:23:00,600 --> 02:23:02,538 So R is of Open Source, 3510 02:23:02,538 --> 02:23:05,000 which is mostly used by analysts 3511 02:23:05,000 --> 02:23:08,300 and now spark committee won't infect all 3512 02:23:08,300 --> 02:23:11,594 the analysts kind of to move towards the sparkling water. 3513 02:23:11,594 --> 02:23:12,900 And that's the reason 3514 02:23:12,900 --> 02:23:15,615 they have recently stopped supporting spark 3515 02:23:15,615 --> 02:23:17,226 on we are all the analysts 3516 02:23:17,226 --> 02:23:20,301 can now execute the query using spark environment 3517 02:23:20,301 --> 02:23:22,800 that's getting better performance and we 3518 02:23:22,800 --> 02:23:25,000 can also work on Big Data. 3519 02:23:25,200 --> 02:23:27,800 That's that's all about the ecosystem point 3520 02:23:27,800 --> 02:23:31,061 below this we are going to have a core engine for engine 3521 02:23:31,061 --> 02:23:34,500 is the one which defines all the basics of the participants 3522 02:23:34,500 --> 02:23:36,363 all the RGV related stuff 3523 02:23:36,363 --> 02:23:38,600 and not is going to be defined 3524 02:23:38,600 --> 02:23:43,300 in your staff for Engine moving further now, 3525 02:23:43,300 --> 02:23:46,227 so as we have just discussed this part we 3526 02:23:46,227 --> 02:23:49,767 are going to now discuss past screaming indicate 3527 02:23:49,767 --> 02:23:53,500 which is going to enable analytical and Interactive. 3528 02:23:53,600 --> 02:23:58,300 For live streaming data know Y is positive 3529 02:23:58,800 --> 02:24:01,400 if I talk about bias past him indefinitely. 3530 02:24:01,400 --> 02:24:04,230 We have just gotten after different is very important. 3531 02:24:04,230 --> 02:24:06,100 That's the reason we are learning it 3532 02:24:06,200 --> 02:24:09,804 but this is so powerful that it is used now 3533 02:24:09,804 --> 02:24:14,169 for the by lot of companies to perform their marketing they 3534 02:24:14,169 --> 02:24:15,900 kind of getting an idea 3535 02:24:15,900 --> 02:24:18,250 that what a customer is looking for. 3536 02:24:18,250 --> 02:24:22,094 In fact, we are going to learn a use case of similar to that 3537 02:24:22,094 --> 02:24:24,700 where we are going to to use pasta me now 3538 02:24:24,700 --> 02:24:28,283 where we are going to use a Twitter sentimental analysis, 3539 02:24:28,283 --> 02:24:31,100 which can be used for your crisis management. 3540 02:24:31,100 --> 02:24:33,680 Maybe you want to check all your products 3541 02:24:33,680 --> 02:24:35,100 on our behave service. 3542 02:24:35,100 --> 02:24:37,420 I just think target marketing 3543 02:24:37,500 --> 02:24:40,342 by all the companies around the world. 3544 02:24:40,342 --> 02:24:42,800 This is getting used in this way. 3545 02:24:42,817 --> 02:24:46,355 And that's the reason spark steaming is gaining 3546 02:24:46,355 --> 02:24:50,432 the popularity and because of its performance as well. 3547 02:24:50,600 --> 02:24:53,200 It is beeping on other platforms. 3548 02:24:53,600 --> 02:24:57,400 At the moment now moving further. 3549 02:24:57,600 --> 02:25:01,300 Let's eat Sparks training features when we talk 3550 02:25:01,300 --> 02:25:03,300 about Sparks training teachers. 3551 02:25:03,400 --> 02:25:05,100 It's very easy to scale. 3552 02:25:05,100 --> 02:25:07,420 You can scale to even multiple nodes 3553 02:25:07,420 --> 02:25:11,083 which can even run till hundreds of most speed is going 3554 02:25:11,083 --> 02:25:14,000 to be very quick means in a very short time. 3555 02:25:14,000 --> 02:25:17,900 You can scream as well as processor data soil tolerant, 3556 02:25:17,900 --> 02:25:19,300 even it made sure 3557 02:25:19,300 --> 02:25:23,100 that even you're not losing your data integration. 3558 02:25:23,100 --> 02:25:26,600 You with your bash time and real-time processing is possible 3559 02:25:26,600 --> 02:25:30,446 and it can also be used for your business analytics 3560 02:25:30,500 --> 02:25:34,800 which is used to track the behavior of your customer. 3561 02:25:34,900 --> 02:25:38,700 So as you can see this is super polite and it's 3562 02:25:38,700 --> 02:25:43,000 like we are kind of getting to know so many interesting things 3563 02:25:43,000 --> 02:25:48,000 about this pasta me now next quickly have an overview 3564 02:25:48,000 --> 02:25:50,900 so that we can get some basics of spots. 3565 02:25:50,900 --> 02:25:53,200 Don't know let's understand. 3566 02:25:53,200 --> 02:25:54,300 Which box? 3567 02:25:55,100 --> 02:25:59,200 So as we have just discussed it is for real-time streaming data. 3568 02:25:59,600 --> 02:26:04,100 It is useful addition in your spark for API. 3569 02:26:04,100 --> 02:26:06,500 So we have already seen at the base level. 3570 02:26:06,500 --> 02:26:07,400 We have that spark 3571 02:26:07,400 --> 02:26:10,700 or in our ecosystem on top of that we have passed we 3572 02:26:10,700 --> 02:26:14,700 will impact Sparks claiming is kind of adding a lot 3573 02:26:14,700 --> 02:26:18,000 of advantage to spark Community 3574 02:26:18,000 --> 02:26:22,349 because a lot of people are only joining spark Community to kind 3575 02:26:22,349 --> 02:26:23,800 of use this pasta me. 3576 02:26:23,800 --> 02:26:25,000 It's so powerful. 3577 02:26:25,000 --> 02:26:26,344 Everyone wants to come 3578 02:26:26,344 --> 02:26:29,478 and want to use it because all the other Frameworks 3579 02:26:29,478 --> 02:26:30,809 which we already have 3580 02:26:30,809 --> 02:26:33,469 which are existing are not as good in terms 3581 02:26:33,469 --> 02:26:34,783 of performance in all 3582 02:26:34,783 --> 02:26:36,311 and and it's the easiness 3583 02:26:36,311 --> 02:26:38,482 of moving Sparks coming is also great 3584 02:26:38,482 --> 02:26:41,482 if you compare your program for let's say two orbits 3585 02:26:41,482 --> 02:26:44,100 from which is used for real-time processing. 3586 02:26:44,100 --> 02:26:46,356 You will notice that it is much easier 3587 02:26:46,356 --> 02:26:49,100 in terms of from a developer point of your ass 3588 02:26:49,100 --> 02:26:52,400 that that's the reason a lot of regular showing interest 3589 02:26:52,400 --> 02:26:53,800 in this domain now, 3590 02:26:53,800 --> 02:26:56,800 it will also enable Table of high throughput 3591 02:26:56,800 --> 02:26:58,187 and fault-tolerant 3592 02:26:58,187 --> 02:27:02,725 so that you to stream your data to process all the things up 3593 02:27:02,900 --> 02:27:06,900 and the fundamental unit Force past dreaming is going 3594 02:27:06,900 --> 02:27:08,200 to be District. 3595 02:27:08,300 --> 02:27:09,700 What is this thing? 3596 02:27:09,700 --> 02:27:10,600 Let me explain it. 3597 02:27:11,100 --> 02:27:14,200 So this dream is basically a series 3598 02:27:14,200 --> 02:27:18,900 of bodies to process the real-time data. 3599 02:27:19,400 --> 02:27:21,100 What we generally do is 3600 02:27:21,100 --> 02:27:23,678 if you look at this light inside you 3601 02:27:23,678 --> 02:27:25,300 when you get the data, 3602 02:27:25,400 --> 02:27:29,800 It is a continuous data you divide it in two batches 3603 02:27:29,800 --> 02:27:31,200 of input data. 3604 02:27:31,400 --> 02:27:35,700 We are going to call it as micro batch and then 3605 02:27:35,700 --> 02:27:39,447 we are going to get that is of processed data though. 3606 02:27:39,447 --> 02:27:40,600 It is real time. 3607 02:27:40,600 --> 02:27:42,300 But still how come it is back 3608 02:27:42,300 --> 02:27:44,547 because definitely you are doing processing 3609 02:27:44,547 --> 02:27:46,258 on some part of the data, right? 3610 02:27:46,258 --> 02:27:48,300 Even if it is coming at real time. 3611 02:27:48,300 --> 02:27:52,500 And that is what we are going to call it as micro batch. 3612 02:27:53,600 --> 02:27:55,700 Moving further now. 3613 02:27:56,600 --> 02:27:59,100 Let's see few more details on it. 3614 02:27:59,223 --> 02:28:02,300 Now from where you can get all your data. 3615 02:28:02,300 --> 02:28:04,600 What can be your data sources here. 3616 02:28:04,600 --> 02:28:09,000 So if we talk about data sources here now we can steal the data 3617 02:28:09,000 --> 02:28:13,700 from multiple sources like Market of the past events. 3618 02:28:13,700 --> 02:28:16,586 You have statuses like at based mongodb, 3619 02:28:16,586 --> 02:28:20,051 which are you know, SQL babies elasticsearch post 3620 02:28:20,051 --> 02:28:24,600 Vis equal pocket file format you can Get all the data from here. 3621 02:28:24,600 --> 02:28:27,700 Now after that you can also don't do processing 3622 02:28:27,700 --> 02:28:29,553 with the help of machine learning. 3623 02:28:29,553 --> 02:28:32,700 You can do the processing with the help of your spark SQL 3624 02:28:32,700 --> 02:28:34,800 and then give the output. 3625 02:28:34,900 --> 02:28:37,000 So this is a very strong thing 3626 02:28:37,000 --> 02:28:40,100 that you are bringing the data using spot screaming 3627 02:28:40,100 --> 02:28:41,964 but processing you can do 3628 02:28:41,964 --> 02:28:44,800 by using some other Frameworks as well. 3629 02:28:44,800 --> 02:28:47,514 Right like machine learning you can apply on the data 3630 02:28:47,514 --> 02:28:49,549 what you're getting fatter years time. 3631 02:28:49,549 --> 02:28:51,966 You can also apply your spots equal on the data, 3632 02:28:51,966 --> 02:28:53,200 which you're getting at. 3633 02:28:53,200 --> 02:28:56,300 the real time Moving further. 3634 02:28:57,100 --> 02:29:00,089 So this is a single thing now in Sparks giving you 3635 02:29:00,089 --> 02:29:03,200 what you can just get the data from multiple sources 3636 02:29:03,200 --> 02:29:07,600 like from cough cough prove sefs kinases Twitter bringing it 3637 02:29:07,600 --> 02:29:10,300 to this path screaming doing the processing 3638 02:29:10,300 --> 02:29:12,500 and storing it back to your hdfs. 3639 02:29:12,500 --> 02:29:15,900 Maybe you can bring it to your DB you can also publish 3640 02:29:15,900 --> 02:29:17,400 to your UI dashboard. 3641 02:29:17,400 --> 02:29:21,402 Next Tableau angularjs lot of UI dashboards are there 3642 02:29:21,700 --> 02:29:25,100 in which you can publish your output now. 3643 02:29:25,500 --> 02:29:26,346 Holly quotes, 3644 02:29:26,346 --> 02:29:29,782 let us just break down into more fine-grained gutters. 3645 02:29:29,782 --> 02:29:32,700 Now we are going to get our input data stream. 3646 02:29:32,700 --> 02:29:34,500 We are going to put it inside 3647 02:29:34,500 --> 02:29:38,200 of a spot screaming going to get the batches of input data. 3648 02:29:38,200 --> 02:29:40,772 Once it executes to his path engine. 3649 02:29:40,772 --> 02:29:44,300 We are going to get that chest of processed data. 3650 02:29:44,300 --> 02:29:47,146 We have just seen the same diagram before so 3651 02:29:47,146 --> 02:29:49,000 the same explanation for it. 3652 02:29:49,000 --> 02:29:52,400 Now again breaking it down into more glamour part. 3653 02:29:52,400 --> 02:29:55,060 We are getting a d string B string was 3654 02:29:55,060 --> 02:29:58,800 what Vulnerabilities of data multiple set of Harmony, 3655 02:29:58,800 --> 02:30:00,500 so we are getting a d string. 3656 02:30:00,500 --> 02:30:03,400 So let's say we are getting an rdd and the rate of time but 3657 02:30:03,400 --> 02:30:06,200 because now we are getting real steam data, right? 3658 02:30:06,200 --> 02:30:07,936 So let's say in today right now. 3659 02:30:07,936 --> 02:30:08,872 I got one second. 3660 02:30:08,872 --> 02:30:11,399 Maybe now I got some one second in one second. 3661 02:30:11,399 --> 02:30:14,600 I got more data now I got more data in the next not Frank. 3662 02:30:14,600 --> 02:30:16,300 So that is what we're talking about. 3663 02:30:16,300 --> 02:30:17,602 So we are creating data. 3664 02:30:17,602 --> 02:30:20,322 We are getting from time 0 to time what we get say 3665 02:30:20,322 --> 02:30:22,171 that we have an RGB at the rate 3666 02:30:22,171 --> 02:30:24,556 of Timbre similarly it is this proceeding 3667 02:30:24,556 --> 02:30:27,300 with the time that He's getting proceeded here. 3668 02:30:27,400 --> 02:30:30,683 Now in the next thing we extracting the words 3669 02:30:30,683 --> 02:30:32,400 from an input Stream So 3670 02:30:32,400 --> 02:30:33,300 if you can notice 3671 02:30:33,300 --> 02:30:35,550 what we are doing here from where let's say, 3672 02:30:35,550 --> 02:30:37,700 we started applying doing our operations 3673 02:30:37,700 --> 02:30:40,419 as we started doing our any sort of processing. 3674 02:30:40,419 --> 02:30:43,200 So as in when we get the data in this timeframe, 3675 02:30:43,200 --> 02:30:44,707 we started being subversive. 3676 02:30:44,707 --> 02:30:46,307 It can be a flat map operation. 3677 02:30:46,307 --> 02:30:49,300 It can be any sort of operation you're doing it can be even 3678 02:30:49,300 --> 02:30:51,800 a machine-learning opposite of whatever you are doing 3679 02:30:51,800 --> 02:30:55,600 and then you are generating the words in that kind of thing. 3680 02:30:55,700 --> 02:30:58,700 So this is how we as we're seeing 3681 02:30:58,700 --> 02:31:02,700 that how gravity we can kind of see all these part 3682 02:31:02,700 --> 02:31:04,620 at a very high level this work. 3683 02:31:04,620 --> 02:31:06,738 We again went into detail then again, 3684 02:31:06,738 --> 02:31:08,249 we went into more detail. 3685 02:31:08,249 --> 02:31:09,700 And finally we have seen 3686 02:31:09,700 --> 02:31:13,600 that how we can even process the data along the time 3687 02:31:13,600 --> 02:31:16,594 when we are screaming our data as well. 3688 02:31:17,100 --> 02:31:21,500 Now one important point is just like spark context is 3689 02:31:21,853 --> 02:31:25,700 mean entry point for any spark application similar. 3690 02:31:25,700 --> 02:31:28,300 Need to work on streaming a spot 3691 02:31:28,300 --> 02:31:31,600 screaming you require a streaming context. 3692 02:31:31,700 --> 02:31:35,800 What is that when you're passing your input data stream you 3693 02:31:35,800 --> 02:31:38,400 when you are working on the Spark engine 3694 02:31:38,400 --> 02:31:41,000 when you're walking on this path screaming engine, 3695 02:31:41,000 --> 02:31:42,900 you have to use your system 3696 02:31:42,900 --> 02:31:46,289 in context of its using screaming context only 3697 02:31:46,289 --> 02:31:48,700 you are going to get the batches 3698 02:31:48,700 --> 02:31:52,300 of your input data now so streaming context 3699 02:31:52,300 --> 02:31:57,000 is going to consume a stream of data in In Apache spark, 3700 02:31:57,300 --> 02:31:58,800 it is registers 3701 02:31:58,800 --> 02:32:04,000 and input D string to produce or receiver object. 3702 02:32:04,500 --> 02:32:08,200 Now it is the main entry point as we discussed 3703 02:32:08,200 --> 02:32:11,011 that like spark context is the main entry point 3704 02:32:11,011 --> 02:32:12,600 for the spark application. 3705 02:32:12,600 --> 02:32:13,400 Similarly. 3706 02:32:13,400 --> 02:32:16,110 Your streaming context is an entry point 3707 02:32:16,110 --> 02:32:17,500 for yourself Paxton. 3708 02:32:17,500 --> 02:32:20,800 Now does that mean now Spa context is 3709 02:32:20,800 --> 02:32:22,569 not an entry point know 3710 02:32:22,569 --> 02:32:25,779 when you creates pastrini it is dependent. 3711 02:32:25,779 --> 02:32:27,600 On your spots community. 3712 02:32:27,600 --> 02:32:30,007 So when you create this thing in context 3713 02:32:30,007 --> 02:32:33,509 it is going to be dependent on your spark of context only 3714 02:32:33,509 --> 02:32:36,732 because you will not be able to create swimming contest 3715 02:32:36,732 --> 02:32:38,000 without spot Pockets. 3716 02:32:38,000 --> 02:32:41,000 So that's the reason it is definitely required spark 3717 02:32:41,000 --> 02:32:45,600 also provide a number of default implementations of sources, 3718 02:32:45,800 --> 02:32:50,000 like looking in the data from Critter a factor 0 mq 3719 02:32:50,100 --> 02:32:53,100 which are accessible from the context. 3720 02:32:53,100 --> 02:32:55,800 So it is supporting so many things, right? 3721 02:32:55,800 --> 02:32:58,600 now If you notice this 3722 02:32:58,600 --> 02:33:01,000 what we are doing in streaming contact, 3723 02:33:01,000 --> 02:33:03,497 this is just to give you an idea about 3724 02:33:03,497 --> 02:33:06,500 how we can initialize our system in context. 3725 02:33:06,500 --> 02:33:09,971 So we will be importing these two libraries after that. 3726 02:33:09,971 --> 02:33:12,923 Can you see I'm passing spot context SE right son 3727 02:33:12,923 --> 02:33:14,400 passing it every second. 3728 02:33:14,400 --> 02:33:17,323 We are collecting the data means collect the data 3729 02:33:17,323 --> 02:33:18,400 for every 1 second. 3730 02:33:18,400 --> 02:33:21,500 You can increase this number if you want and then this 3731 02:33:21,500 --> 02:33:24,028 is your SSC means in every one second 3732 02:33:24,028 --> 02:33:25,482 what ever gonna happen? 3733 02:33:25,482 --> 02:33:27,000 I'm going to process it. 3734 02:33:27,000 --> 02:33:28,800 And what we're doing in this place, 3735 02:33:28,900 --> 02:33:33,100 let's go to the D string topic now now in these three 3736 02:33:33,500 --> 02:33:37,000 it is the full form is discretized stream. 3737 02:33:37,053 --> 02:33:38,900 It's a basic abstraction 3738 02:33:38,900 --> 02:33:41,679 provided by your spa streaming framework. 3739 02:33:41,679 --> 02:33:46,400 It's appointing a stream of data and it is going to be received 3740 02:33:46,400 --> 02:33:47,630 from your source 3741 02:33:47,630 --> 02:33:52,200 and from processed steaming context is related 3742 02:33:52,200 --> 02:33:56,900 to your response living Fun Spot context is belonging. 3743 02:33:56,900 --> 02:33:57,974 To your spark or 3744 02:33:57,974 --> 02:34:01,600 if you remember the ecosystem radical in the ecosystem, 3745 02:34:01,600 --> 02:34:06,400 we have that spark context right now streaming context is built 3746 02:34:06,400 --> 02:34:08,784 with the help of spark context. 3747 02:34:08,800 --> 02:34:11,800 And in fact using streaming context only 3748 02:34:11,800 --> 02:34:15,604 you will be able to perform your sponsoring just like 3749 02:34:15,604 --> 02:34:17,722 without spark context you will 3750 02:34:17,722 --> 02:34:19,700 not able to execute anything 3751 02:34:19,700 --> 02:34:22,482 in spark application just park application 3752 02:34:22,482 --> 02:34:25,100 will not be able to do anything similarly 3753 02:34:25,100 --> 02:34:27,200 without streaming content. 3754 02:34:27,200 --> 02:34:31,500 You're streaming application will not be able to do anything. 3755 02:34:31,500 --> 02:34:34,838 It just that screaming context is built on top 3756 02:34:34,838 --> 02:34:36,100 of spark context. 3757 02:34:36,500 --> 02:34:39,700 Okay, so it now it's a continuous stream 3758 02:34:39,700 --> 02:34:42,400 of data we can talk about these three. 3759 02:34:42,400 --> 02:34:46,200 It is received from source of on the processed data speed 3760 02:34:46,200 --> 02:34:49,000 generated by the transformation of interesting. 3761 02:34:49,300 --> 02:34:53,800 If you look at this part internally a these thing 3762 02:34:53,800 --> 02:34:57,389 can be represented by a continuous series of I 3763 02:34:57,389 --> 02:34:59,620 need these this is important. 3764 02:34:59,946 --> 02:35:04,400 Now what we're doing is every second remember last time 3765 02:35:04,400 --> 02:35:05,800 we have just seen an example 3766 02:35:05,900 --> 02:35:08,335 of like every second whatever going to happen. 3767 02:35:08,335 --> 02:35:10,100 We are going to do processing. 3768 02:35:10,200 --> 02:35:13,700 So in that every second whatever data you 3769 02:35:13,700 --> 02:35:17,300 are collecting and you're performing your operation. 3770 02:35:17,300 --> 02:35:18,010 So the data 3771 02:35:18,010 --> 02:35:21,500 what you're getting here is will be your District means 3772 02:35:21,500 --> 02:35:23,129 it's a Content you can say 3773 02:35:23,129 --> 02:35:26,200 that all these things will be your D string point. 3774 02:35:26,200 --> 02:35:29,800 It's our Representation by a continuous series 3775 02:35:29,800 --> 02:35:32,300 of kinetic energy so many hundred is getting more 3776 02:35:32,300 --> 02:35:34,500 because let's say right knocking one second. 3777 02:35:34,500 --> 02:35:36,000 What data I got collected. 3778 02:35:36,000 --> 02:35:37,100 I executed it. 3779 02:35:37,100 --> 02:35:40,500 I in the second second this data is happening here. 3780 02:35:40,715 --> 02:35:41,100 Okay? 3781 02:35:41,100 --> 02:35:41,800 Okay. 3782 02:35:41,800 --> 02:35:42,700 Sorry for that. 3783 02:35:42,700 --> 02:35:46,300 Now in the second time also the it is happening 3784 02:35:46,300 --> 02:35:47,400 a third second. 3785 02:35:47,400 --> 02:35:49,000 Also it is happening here. 3786 02:35:49,700 --> 02:35:50,500 No problem. 3787 02:35:50,500 --> 02:35:53,100 No, I'm not going to do it now fine. 3788 02:35:53,100 --> 02:35:54,727 So in the third second Auto 3789 02:35:54,727 --> 02:35:57,200 if I did something I'm processing it here. 3790 02:35:57,200 --> 02:35:57,500 Right. 3791 02:35:57,500 --> 02:35:59,800 So if you see that this diagram itself, 3792 02:35:59,800 --> 02:36:03,600 so it is every second whatever data is getting collected. 3793 02:36:03,600 --> 02:36:05,400 We are doing the processing 3794 02:36:05,400 --> 02:36:09,250 on top of it and the whole countenance series of RDV 3795 02:36:09,250 --> 02:36:13,100 what we are seeing here will be called as the strip. 3796 02:36:13,100 --> 02:36:13,500 Okay. 3797 02:36:13,500 --> 02:36:18,100 So this is what your distinct moving further now 3798 02:36:18,600 --> 02:36:22,300 we are going to understand the operation on these three. 3799 02:36:22,300 --> 02:36:24,500 So let's say you are doing 3800 02:36:24,500 --> 02:36:27,300 this operation on this dream that you are getting. 3801 02:36:27,300 --> 02:36:30,000 The data from 0 to 1 again, 3802 02:36:30,000 --> 02:36:32,300 you are applying some operation 3803 02:36:32,300 --> 02:36:36,108 on that then whatever output you get you're going to call 3804 02:36:36,108 --> 02:36:39,200 it as words the state means this is the thing 3805 02:36:39,200 --> 02:36:41,166 what you're doing you're doing a pack of operation. 3806 02:36:41,166 --> 02:36:42,700 That's the reason we're calling it is at 3807 02:36:42,700 --> 02:36:46,058 what these three now similarly whatever thing you're doing. 3808 02:36:46,058 --> 02:36:48,000 So you're going to get accordingly 3809 02:36:48,000 --> 02:36:50,569 and output be screen for it as well. 3810 02:36:50,569 --> 02:36:55,100 So this is what is happening in this particular example now. 3811 02:36:56,700 --> 02:36:59,700 Flat map flatmap is API. 3812 02:37:00,000 --> 02:37:02,100 It is very similar to mac. 3813 02:37:02,100 --> 02:37:04,089 Its kind of platen of your value. 3814 02:37:04,089 --> 02:37:04,400 Okay. 3815 02:37:04,400 --> 02:37:06,400 So let me explain you with an example. 3816 02:37:06,400 --> 02:37:07,300 What is flat back? 3817 02:37:07,500 --> 02:37:10,100 So let's say if I say that hi, 3818 02:37:10,400 --> 02:37:13,200 this is a doulica. 3819 02:37:14,500 --> 02:37:15,600 Welcome. 3820 02:37:16,200 --> 02:37:18,100 Okay, let's say listen later. 3821 02:37:18,222 --> 02:37:18,723 Now. 3822 02:37:18,723 --> 02:37:20,800 I want to apply a flatworm. 3823 02:37:20,800 --> 02:37:22,900 So let's say this is a form of rdd. 3824 02:37:22,900 --> 02:37:24,600 Also now on this rdd, 3825 02:37:24,600 --> 02:37:28,200 let's say I apply flat back to let's say our DB this is 3826 02:37:28,200 --> 02:37:30,000 the already flat map. 3827 02:37:31,600 --> 02:37:35,000 It's not map Captain black pepper. 3828 02:37:35,100 --> 02:37:38,467 And then let's say you want to define something for it. 3829 02:37:38,467 --> 02:37:40,400 So let's say you say that okay, 3830 02:37:41,100 --> 02:37:43,400 you are defining a variable sale. 3831 02:37:43,700 --> 02:37:48,300 So let's say a a DOT now 3832 02:37:48,400 --> 02:37:53,300 after that you are defining your thoughts split split. 3833 02:37:55,300 --> 02:37:58,417 We're splitting with respect to visit now in this case 3834 02:37:58,417 --> 02:38:00,106 what is going to happen now? 3835 02:38:00,106 --> 02:38:03,966 I'm not saying the exacting here just to give extremely flat back 3836 02:38:03,966 --> 02:38:06,500 just to kind of give you an idea about box. 3837 02:38:06,503 --> 02:38:09,196 It is going to flatten up this fight 3838 02:38:09,200 --> 02:38:11,200 with respect to the split 3839 02:38:11,200 --> 02:38:15,200 what you are mentioned here means what it is going to now 3840 02:38:15,200 --> 02:38:18,500 create each element as one word. 3841 02:38:18,684 --> 02:38:21,915 It is going to create like this high as one 3842 02:38:22,200 --> 02:38:26,100 what l 1 element this as one One element 3843 02:38:26,100 --> 02:38:27,515 is ask another what 3844 02:38:27,515 --> 02:38:30,939 a one-element adwaita as one water in the limit. 3845 02:38:30,939 --> 02:38:33,200 Bentham has one vote for example. 3846 02:38:33,200 --> 02:38:33,841 So this is 3847 02:38:33,841 --> 02:38:37,558 how your platinum Works kind of flatten up your whole file. 3848 02:38:37,558 --> 02:38:40,700 So this is what we are doing in our stream effort. 3849 02:38:40,700 --> 02:38:43,400 We are our so this is how this will work. 3850 02:38:44,100 --> 02:38:47,143 Now so we have just understood this part. 3851 02:38:47,143 --> 02:38:51,100 Now, let's understand input the stream and receivers. 3852 02:38:51,100 --> 02:38:52,500 Okay, what are these things? 3853 02:38:52,500 --> 02:38:53,900 Let's understand this fight. 3854 02:38:54,800 --> 02:38:55,200 Okay. 3855 02:38:55,200 --> 02:38:57,700 So what are the input based impossible? 3856 02:38:57,700 --> 02:39:00,900 They can be basic Source advances in basic Source 3857 02:39:00,900 --> 02:39:04,500 we can have filesystems sockets Connections 3858 02:39:04,600 --> 02:39:08,400 in advance Source we can have Kafka no Genesis. 3859 02:39:08,800 --> 02:39:09,200 Okay. 3860 02:39:09,300 --> 02:39:10,800 So your input these things are 3861 02:39:10,800 --> 02:39:14,000 under these things representing the stream 3862 02:39:14,300 --> 02:39:19,200 of input data received from streaming sources. 3863 02:39:19,400 --> 02:39:20,865 This is again the same thing. 3864 02:39:20,865 --> 02:39:21,136 Okay. 3865 02:39:21,136 --> 02:39:23,198 So this is there are two type of things 3866 02:39:23,198 --> 02:39:24,500 which we just discussed. 3867 02:39:24,600 --> 02:39:27,676 Is your basic and second is your advance? 3868 02:39:28,400 --> 02:39:29,800 Let's move brother. 3869 02:39:30,700 --> 02:39:33,700 Now what we are going to see each other. 3870 02:39:33,700 --> 02:39:35,870 So if you notice let's see here. 3871 02:39:35,870 --> 02:39:39,600 There are some events often it is going to your receiver 3872 02:39:39,600 --> 02:39:44,158 and then energy stream now I will bees are getting created 3873 02:39:44,158 --> 02:39:47,082 and we are performing some steps on it. 3874 02:39:47,300 --> 02:39:52,300 So the receiver sends the data into the D string 3875 02:39:52,500 --> 02:39:57,100 where each back is going to contain the RTD. 3876 02:39:57,200 --> 02:40:00,800 So this is what you're this thing is doing receiver. 3877 02:40:00,800 --> 02:40:02,500 Is doing here now 3878 02:40:03,500 --> 02:40:07,200 moving further Transformations on the D string. 3879 02:40:07,200 --> 02:40:08,384 Let's understand that. 3880 02:40:08,384 --> 02:40:10,500 What are the Transformations available? 3881 02:40:10,500 --> 02:40:13,000 There are multiple Transformations, which are 3882 02:40:13,000 --> 02:40:14,700 possibly the most popular. 3883 02:40:14,700 --> 02:40:16,100 Let's talk about that. 3884 02:40:16,100 --> 02:40:20,700 We have map flatmap filter reduce Group by so there 3885 02:40:20,700 --> 02:40:23,992 are multiple Transformations available via now. 3886 02:40:23,992 --> 02:40:27,500 It is like you are getting your input data now you 3887 02:40:27,500 --> 02:40:30,400 will be applying any of these operations. 3888 02:40:30,400 --> 02:40:33,700 Means any Transformations that is going to happen. 3889 02:40:33,700 --> 02:40:37,700 And then on you this thing is going to be created. 3890 02:40:37,700 --> 02:40:39,900 Okay, so that is what's going to happen. 3891 02:40:39,900 --> 02:40:41,851 So let's explore it one by one. 3892 02:40:41,851 --> 02:40:43,344 So let's start with now 3893 02:40:43,344 --> 02:40:46,200 if I start with map what happens with Mac 3894 02:40:46,200 --> 02:40:48,600 it is going to create that judges of data. 3895 02:40:48,600 --> 02:40:49,100 Okay. 3896 02:40:49,100 --> 02:40:51,386 So let's say it is going to create a map value 3897 02:40:51,386 --> 02:40:52,200 of it like this. 3898 02:40:52,200 --> 02:40:55,600 So let's say X is not to be my is giving the output Z 3899 02:40:55,600 --> 02:40:57,600 that is giving the output X, right. 3900 02:40:57,600 --> 02:41:00,700 So in this similar format, this is going to get mad. 3901 02:41:00,700 --> 02:41:02,887 That is going to whatever you're performing. 3902 02:41:02,887 --> 02:41:05,394 It is just going to create batches of input data, 3903 02:41:05,394 --> 02:41:06,700 which you can execute it. 3904 02:41:06,700 --> 02:41:10,800 So it returns a new DC by fasting each element 3905 02:41:10,800 --> 02:41:13,946 of the source D string through a function, 3906 02:41:13,946 --> 02:41:15,600 which you have defined. 3907 02:41:16,300 --> 02:41:17,789 Let's discuss this lapis 3908 02:41:17,789 --> 02:41:20,074 that we have just discussed it is going 3909 02:41:20,074 --> 02:41:21,565 to flatten up the things. 3910 02:41:21,565 --> 02:41:22,805 So in this case, also, 3911 02:41:22,805 --> 02:41:25,400 if you notice we are just kind of flat inner it 3912 02:41:25,400 --> 02:41:27,169 is very similar to Mac. 3913 02:41:27,169 --> 02:41:31,100 But each input item can be mapped to zero 3914 02:41:31,200 --> 02:41:34,200 or more outputs in items here. 3915 02:41:34,200 --> 02:41:38,400 Okay, and it is going to return a new these three bypassing 3916 02:41:38,400 --> 02:41:41,700 each Source element to a function for this fight. 3917 02:41:41,700 --> 02:41:44,600 So we have just seen an example of that crap anyway, 3918 02:41:44,600 --> 02:41:47,300 so that seems awfully can remember 70 more easy 3919 02:41:47,300 --> 02:41:49,200 for you to kind of see the difference 3920 02:41:49,200 --> 02:41:55,260 between with markets has no moving further filter 3921 02:41:55,360 --> 02:41:58,593 as the name States you can now filter out the values. 3922 02:41:58,593 --> 02:41:59,876 So let's say you have 3923 02:41:59,876 --> 02:42:03,701 a huge data you are kind of we want to filter out some values. 3924 02:42:03,701 --> 02:42:06,900 You just want to kind of walk with some filter data. 3925 02:42:06,900 --> 02:42:09,700 Maybe you want to remove some part of it. 3926 02:42:09,700 --> 02:42:11,900 Maybe you are trying to put some Logic on it. 3927 02:42:11,900 --> 02:42:15,800 Does this line contains this right under this line? 3928 02:42:16,100 --> 02:42:16,900 Is that so 3929 02:42:16,900 --> 02:42:20,169 in that case extreme only with that particular criteria? 3930 02:42:20,169 --> 02:42:21,691 So this is what we do here, 3931 02:42:21,691 --> 02:42:25,300 but definitely most of the times to Output is going to be smaller 3932 02:42:25,300 --> 02:42:31,000 in comparison to your input reduce reduce is it's just 3933 02:42:31,000 --> 02:42:34,500 like it's going to do kind of aggregation on the wall. 3934 02:42:34,500 --> 02:42:37,400 Let's say in the end you want to sum up all the data 3935 02:42:37,400 --> 02:42:38,200 what you have 3936 02:42:38,200 --> 02:42:41,500 that is going to be done with the help of reduce. 3937 02:42:42,100 --> 02:42:43,800 Now after that group 3938 02:42:43,800 --> 02:42:48,600 by group back is like it's going to combine all the common values 3939 02:42:48,600 --> 02:42:50,600 that is what group by is going to do. 3940 02:42:50,600 --> 02:42:53,112 So as you can see in this example all the things 3941 02:42:53,112 --> 02:42:55,196 which are starting with Seagal broom back 3942 02:42:55,196 --> 02:42:56,786 all the things we're starting 3943 02:42:56,786 --> 02:42:59,300 with J. Boardroom back all the names starting 3944 02:42:59,300 --> 02:43:00,761 with C got goodbye. 3945 02:43:00,800 --> 02:43:01,600 Not. 3946 02:43:02,000 --> 02:43:03,300 So again, what is 3947 02:43:03,300 --> 02:43:07,500 this screen window now to give you an example of this window? 3948 02:43:07,500 --> 02:43:10,108 Everybody must be knowing Twitter, right? 3949 02:43:10,108 --> 02:43:12,000 So now what happens in total? 3950 02:43:12,000 --> 02:43:13,700 Let me go to my paint. 3951 02:43:14,100 --> 02:43:16,100 So insert in this example, 3952 02:43:16,100 --> 02:43:19,853 let's understand how this windowing of Asians of so, 3953 02:43:19,853 --> 02:43:21,400 let's say in initials 3954 02:43:21,400 --> 02:43:24,600 per second in the initial per second 10 seconds. 3955 02:43:24,600 --> 02:43:27,200 Let's say the tweets are happening in this way. 3956 02:43:27,200 --> 02:43:32,200 Let's say cash a hash a hashtag now, 3957 02:43:32,200 --> 02:43:35,773 which is the trading Twitter definitely is right is 3958 02:43:35,773 --> 02:43:38,900 my training good maybe in the next 10 seconds. 3959 02:43:40,600 --> 02:43:46,500 In the next 10 seconds now again Hash A. Ashby. 3960 02:43:47,200 --> 02:43:48,400 Ashby is open 3961 02:43:48,400 --> 02:43:51,400 which is the trending with be happening here. 3962 02:43:51,400 --> 02:43:51,800 Now. 3963 02:43:51,800 --> 02:43:54,261 Let's say in another 10 seconds. 3964 02:43:54,900 --> 02:43:56,700 Now this time let's say 3965 02:43:56,700 --> 02:44:03,266 hash be hash be so actually I should be Hashmi zapping now, 3966 02:44:03,266 --> 02:44:05,266 which is trendy be lonely. 3967 02:44:05,500 --> 02:44:07,776 But now I want to find out 3968 02:44:07,776 --> 02:44:10,546 which is the trending one in last 30. 3969 02:44:11,400 --> 02:44:15,100 Ashley right because if I combine I can do it easily. 3970 02:44:15,400 --> 02:44:19,900 Now this is your been doing operation example means you 3971 02:44:19,900 --> 02:44:23,300 are not only looking at your current window, 3972 02:44:23,300 --> 02:44:24,800 but you're also looking 3973 02:44:24,800 --> 02:44:27,516 at your previous window Vanessa current window. 3974 02:44:27,516 --> 02:44:30,008 I'm talking about let's say 10 seconds of slot 3975 02:44:30,008 --> 02:44:32,600 in this 10 seconds lat let's say you are doing 3976 02:44:32,600 --> 02:44:35,431 this operation on has be has to be has to be has to be 3977 02:44:35,431 --> 02:44:37,456 so this is a current window now you are 3978 02:44:37,456 --> 02:44:40,282 not fully Computing with respect to your current window. 3979 02:44:40,282 --> 02:44:42,800 But you are also considering your previous window. 3980 02:44:42,800 --> 02:44:44,055 Now, let's say in this case. 3981 02:44:44,055 --> 02:44:44,681 If I ask you, 3982 02:44:44,681 --> 02:44:46,900 can you give me the output of which is trending 3983 02:44:46,900 --> 02:44:48,361 in last 17 seconds? 3984 02:44:48,361 --> 02:44:50,900 Will you be able to answer know why 3985 02:44:50,900 --> 02:44:54,900 because you don't have partial information for 7 Seconds 3986 02:44:54,900 --> 02:44:56,400 you have information 3987 02:44:56,400 --> 02:45:01,000 for your 10 20 30 mins multiple of them, 3988 02:45:01,200 --> 02:45:03,500 but not intermediate one. 3989 02:45:03,500 --> 02:45:04,711 So keep this in mind. 3990 02:45:04,711 --> 02:45:07,365 Okay, so you will be able to perform in doing 3991 02:45:07,365 --> 02:45:10,207 operation only with respect to your window size. 3992 02:45:10,207 --> 02:45:11,900 It's not like you can create 3993 02:45:11,900 --> 02:45:15,085 any partial value in can do the window efficient now, 3994 02:45:15,085 --> 02:45:16,800 let's get back to the sides. 3995 02:45:21,800 --> 02:45:23,203 Now it's a similar thing. 3996 02:45:23,203 --> 02:45:24,350 So now it is shown here 3997 02:45:24,350 --> 02:45:27,100 that we are not only considering the current window, 3998 02:45:27,100 --> 02:45:30,200 but we are also considering the previous window 3999 02:45:30,200 --> 02:45:31,604 now next understand 4000 02:45:31,604 --> 02:45:35,300 the output operators are operations of the business 4001 02:45:35,700 --> 02:45:38,434 when we talk about output operations. 4002 02:45:38,434 --> 02:45:41,400 The output operations are going to allow 4003 02:45:41,400 --> 02:45:45,853 the D string data to be pushed out to your external system. 4004 02:45:45,853 --> 02:45:47,700 If you notice here means 4005 02:45:47,700 --> 02:45:51,300 whenever whatever processing you have done with respect to 4006 02:45:51,300 --> 02:45:54,300 what What data you are doing here now your output you 4007 02:45:54,300 --> 02:45:57,100 can store in multiple base against original file system. 4008 02:45:57,100 --> 02:45:58,600 You can keep in your database. 4009 02:45:58,600 --> 02:46:01,800 You can keep it even in your external systems 4010 02:46:01,800 --> 02:46:04,200 so you can keep in multiple places. 4011 02:46:04,200 --> 02:46:06,400 So that is what being reflected here. 4012 02:46:07,500 --> 02:46:10,600 Now, so if I talk about output operation, 4013 02:46:10,600 --> 02:46:11,653 these are the one 4014 02:46:11,653 --> 02:46:15,495 which are supported we can print out the value we can use save 4015 02:46:15,495 --> 02:46:17,700 as text file menu save as take five. 4016 02:46:17,700 --> 02:46:19,500 It saves it into your chest. 4017 02:46:19,500 --> 02:46:21,736 If you want you can also use it to save it 4018 02:46:21,736 --> 02:46:23,100 in the local pack system. 4019 02:46:23,100 --> 02:46:25,174 You can save it as an object file. 4020 02:46:25,174 --> 02:46:27,500 Also, you can save it as a Hadoop file 4021 02:46:27,500 --> 02:46:30,800 or you can also apply for these are daily function. 4022 02:46:31,200 --> 02:46:34,500 Now what are for each argument function? 4023 02:46:34,500 --> 02:46:35,956 Let's see this example. 4024 02:46:35,956 --> 02:46:39,700 So the mill Levy Spin on this part in detail Banks we teach 4025 02:46:39,700 --> 02:46:41,600 you or in advocacy sessions, 4026 02:46:41,600 --> 02:46:43,927 but just to give you an idea now. 4027 02:46:43,927 --> 02:46:46,310 This is a very powerful primitive 4028 02:46:46,310 --> 02:46:49,608 that is going to allow your data to be sent out 4029 02:46:49,608 --> 02:46:51,400 to your external systems. 4030 02:46:51,400 --> 02:46:53,700 So using this you can send it across 4031 02:46:53,700 --> 02:46:55,500 to your web server system. 4032 02:46:55,500 --> 02:46:57,385 We have just seen our external system 4033 02:46:57,385 --> 02:46:58,904 that we can give file system. 4034 02:46:58,904 --> 02:46:59,900 It can be anything. 4035 02:46:59,900 --> 02:47:02,800 So using this you will be able to transfer it. 4036 02:47:02,800 --> 02:47:05,100 You can view will be able to send it out 4037 02:47:05,100 --> 02:47:07,162 to your external systems. 4038 02:47:07,500 --> 02:47:11,500 Now, let's understand the cash in and persistence now 4039 02:47:11,500 --> 02:47:14,300 when we talk about caching and persistence, 4040 02:47:14,300 --> 02:47:18,900 so these 3 Ms. Also annoying the developers to cash 4041 02:47:19,000 --> 02:47:22,100 or to persist the streams data 4042 02:47:22,100 --> 02:47:27,023 in the moral means you can keep your data in memory. 4043 02:47:27,023 --> 02:47:31,100 You can cash your data in the morning for longer time. 4044 02:47:31,200 --> 02:47:33,200 Even after your action is complete. 4045 02:47:33,200 --> 02:47:36,000 It is not going to delete it 4046 02:47:36,100 --> 02:47:38,946 so you can just Use this as many times 4047 02:47:38,946 --> 02:47:39,800 as you want 4048 02:47:39,800 --> 02:47:42,900 so you can simply use the first method to do that. 4049 02:47:42,900 --> 02:47:44,485 So for your input streams 4050 02:47:44,485 --> 02:47:48,100 which are receiving the data over the network may be using 4051 02:47:48,100 --> 02:47:50,000 taskbar Loom sockets. 4052 02:47:50,400 --> 02:47:54,500 The default persistence level is set to the replicate 4053 02:47:54,500 --> 02:47:57,331 the data to two loads for the for tolerance 4054 02:47:57,331 --> 02:48:00,500 like it is also going to be replicating the data 4055 02:48:00,502 --> 02:48:01,600 into two parts 4056 02:48:01,600 --> 02:48:04,800 so you can see the same thing in this diagram. 4057 02:48:05,300 --> 02:48:07,979 Let's understand this accumulators broadcast 4058 02:48:07,979 --> 02:48:09,600 variables and checkpoints. 4059 02:48:09,700 --> 02:48:12,553 Now, these are mostly for your performance. 4060 02:48:12,553 --> 02:48:16,626 But so this is going to help you to kind of perform to help you 4061 02:48:16,626 --> 02:48:18,444 in the performance partner. 4062 02:48:18,444 --> 02:48:20,600 So it is accumulators is nothing 4063 02:48:20,600 --> 02:48:25,200 but environment that are only added through and associative 4064 02:48:25,300 --> 02:48:27,400 and commutative operation. 4065 02:48:28,000 --> 02:48:31,100 Usually if you're coming from Purdue background 4066 02:48:31,100 --> 02:48:32,678 if you have done let's say be 4067 02:48:32,678 --> 02:48:35,400 mapreduce programming you must have seen something. 4068 02:48:35,400 --> 02:48:36,900 Counters like that, 4069 02:48:36,900 --> 02:48:38,749 they'll be used for other counters 4070 02:48:38,749 --> 02:48:42,000 which kind of helps us to debug the program as well and you 4071 02:48:42,000 --> 02:48:44,700 can perform some analysis in the console itself. 4072 02:48:44,700 --> 02:48:46,600 Now this is similar to you can do 4073 02:48:46,600 --> 02:48:48,100 with the accumulators as well. 4074 02:48:48,100 --> 02:48:50,152 So you can Implement your contest with X 4075 02:48:50,152 --> 02:48:52,800 open this part you can also some of the things 4076 02:48:52,800 --> 02:48:54,800 with this fact now you can 4077 02:48:54,800 --> 02:48:57,800 if you want to track through UI you can also do 4078 02:48:57,800 --> 02:49:00,402 that as you can see in this UI itself. 4079 02:49:00,402 --> 02:49:02,500 You can see all your excavators 4080 02:49:02,500 --> 02:49:05,400 as well now similarly we have broadcast. 4081 02:49:05,400 --> 02:49:10,300 Erebus now broadcast Parables allows the programmer to keep 4082 02:49:10,300 --> 02:49:14,787 your meat only bearable cast on all the machines 4083 02:49:14,787 --> 02:49:16,325 which are available. 4084 02:49:16,838 --> 02:49:19,838 Now it is going to be kind of cashing it 4085 02:49:19,838 --> 02:49:21,684 on all the machines now, 4086 02:49:22,000 --> 02:49:25,900 they can be used to give every note of copy 4087 02:49:26,200 --> 02:49:29,000 of a large input data set 4088 02:49:29,300 --> 02:49:35,028 in an efficient manner so you can just use that sparkle. 4089 02:49:35,028 --> 02:49:39,643 Also attempt to distribute the distributed broadcast variable 4090 02:49:39,643 --> 02:49:41,700 using efficient bra strap. 4091 02:49:41,700 --> 02:49:44,907 I will do nothing to reduce the communication process. 4092 02:49:44,907 --> 02:49:46,100 So as you can see here, 4093 02:49:46,100 --> 02:49:47,800 we are passing this broadcast value 4094 02:49:47,800 --> 02:49:50,700 it is going to spark contest and then it is broadcasting 4095 02:49:50,700 --> 02:49:51,700 to this places. 4096 02:49:51,700 --> 02:49:55,500 So this is what how it is working in this application. 4097 02:49:55,700 --> 02:49:58,582 Generally when we teach in this class has and also 4098 02:49:58,582 --> 02:50:00,600 since things are Advanced concept, 4099 02:50:00,600 --> 02:50:02,953 we kind of we kind of try to expand you 4100 02:50:02,953 --> 02:50:05,189 with the practicals are not right now. 4101 02:50:05,189 --> 02:50:08,915 I just want to give you an idea about what are these things? 4102 02:50:08,915 --> 02:50:09,764 So when you go 4103 02:50:09,764 --> 02:50:12,009 with the practicals of all these things 4104 02:50:12,009 --> 02:50:13,367 that how activator see 4105 02:50:13,367 --> 02:50:16,700 how this is happening out is getting broadcasted Things 4106 02:50:16,700 --> 02:50:19,941 become more and more fear at that time right now. 4107 02:50:19,941 --> 02:50:20,683 I just want 4108 02:50:20,683 --> 02:50:24,600 that everybody at these data high level overview of things. 4109 02:50:25,246 --> 02:50:28,400 Now moving further sub what is checkpoints 4110 02:50:28,400 --> 02:50:30,257 so checkpoints are similar 4111 02:50:30,257 --> 02:50:32,900 to your checkpoints in the gaming now, 4112 02:50:32,900 --> 02:50:37,200 hold on they can they make it run 24/7 make it resilient 4113 02:50:37,200 --> 02:50:41,400 to the failure and related to the application project. 4114 02:50:41,500 --> 02:50:43,214 So if you can see this diagram, 4115 02:50:43,214 --> 02:50:45,296 we are just creating the checkpoint. 4116 02:50:45,296 --> 02:50:47,200 So as in the metadata checkpoint, 4117 02:50:47,200 --> 02:50:50,279 you can see it is the saving of the information 4118 02:50:50,279 --> 02:50:53,827 which is defining the streaming computation if we talk 4119 02:50:53,827 --> 02:50:55,300 about data from check. 4120 02:50:55,600 --> 02:51:01,000 It is saving of the generated a DD to the reliable storage. 4121 02:51:01,100 --> 02:51:03,400 So this is this both are generating 4122 02:51:03,400 --> 02:51:06,900 the checkpoint now now moving forward. 4123 02:51:06,900 --> 02:51:09,815 We are going to move towards our project 4124 02:51:09,815 --> 02:51:14,300 where we are going to perform our Twitter sentiment analysis. 4125 02:51:14,400 --> 02:51:17,413 Let's discuss a very important Force case 4126 02:51:17,413 --> 02:51:19,600 of Twitter sentiment analysis. 4127 02:51:19,600 --> 02:51:21,500 This is going to be very interesting 4128 02:51:21,500 --> 02:51:24,600 because we will just do a real-time. 4129 02:51:24,900 --> 02:51:28,588 This on Twitter sentiment analysis and they can be 4130 02:51:28,588 --> 02:51:31,900 lot of possibility of this sentiment analysis 4131 02:51:31,900 --> 02:51:33,631 will be but we will be taking something 4132 02:51:33,631 --> 02:51:36,000 for the turtle and it's going to be very interesting. 4133 02:51:36,100 --> 02:51:39,900 So generally when we do all this in know course, 4134 02:51:39,900 --> 02:51:41,070 it is more detailed 4135 02:51:41,070 --> 02:51:44,582 because right now in women are definitely going in deep is 4136 02:51:44,582 --> 02:51:46,000 not very much possible, 4137 02:51:46,000 --> 02:51:48,600 but during the training of a director, 4138 02:51:48,600 --> 02:51:51,470 you will learn all these things within the trust awesome, 4139 02:51:51,470 --> 02:51:52,994 right that's there something 4140 02:51:52,994 --> 02:51:55,100 which we learned during the session. 4141 02:51:55,100 --> 02:51:59,061 It's No, we talked about some use cases of Twitter. 4142 02:51:59,300 --> 02:52:01,300 As I said there can be multiple use cases 4143 02:52:01,300 --> 02:52:02,300 which are possible 4144 02:52:02,300 --> 02:52:04,156 because there are solutions 4145 02:52:04,156 --> 02:52:07,100 behind whatever the continue doing it so much 4146 02:52:07,100 --> 02:52:08,700 of social media right now 4147 02:52:08,700 --> 02:52:11,288 in these days are very active has been right. 4148 02:52:11,288 --> 02:52:12,400 It must be noticing 4149 02:52:12,400 --> 02:52:15,300 that even politicians have started using Twitter 4150 02:52:15,300 --> 02:52:18,000 and their did all the treats are being shown 4151 02:52:18,000 --> 02:52:21,200 in the news channel in cystic of a heart-rending to it 4152 02:52:21,200 --> 02:52:23,900 because they are talking about positive negative 4153 02:52:23,900 --> 02:52:26,100 in any politician use Something right? 4154 02:52:26,100 --> 02:52:27,900 And if we talk about anything is even 4155 02:52:27,900 --> 02:52:29,100 if we talk about let's 4156 02:52:29,100 --> 02:52:32,260 any Sports FIFA World Cup is going on then you will notice 4157 02:52:32,260 --> 02:52:35,200 always return will be filled up with lot of treatment. 4158 02:52:35,200 --> 02:52:38,435 So how we can make use of it how we can do some analysis 4159 02:52:38,435 --> 02:52:41,400 on top of it that first we are going to learn in this 4160 02:52:41,400 --> 02:52:44,600 so they can be multiple sort of our sentiment analysis 4161 02:52:44,600 --> 02:52:47,595 think it can be done for your crisis Management Service. 4162 02:52:47,595 --> 02:52:50,900 I just think target marketing we can keep on talking about 4163 02:52:50,900 --> 02:52:52,716 when a new release release now 4164 02:52:52,716 --> 02:52:55,200 even the moviemakers kind of glowing eyes. 4165 02:52:55,200 --> 02:52:57,628 Okay, hold this movie is going to perform 4166 02:52:57,628 --> 02:53:00,356 so they can easily make out of it beforehand. 4167 02:53:00,356 --> 02:53:04,200 Okay, this movie is going to go in this kind of range of profit 4168 02:53:04,200 --> 02:53:05,800 or not interesting day. 4169 02:53:05,800 --> 02:53:08,200 I let us explore not to Impossible even 4170 02:53:08,200 --> 02:53:10,500 in the political campaign in 50 must have heard 4171 02:53:10,600 --> 02:53:11,400 that in u.s. 4172 02:53:11,400 --> 02:53:13,600 When the president election was happening. 4173 02:53:13,600 --> 02:53:15,676 They have used in fact role 4174 02:53:15,676 --> 02:53:19,600 of social media of all this analysis at all and then 4175 02:53:19,600 --> 02:53:22,400 that have ever played a major role in winning 4176 02:53:22,400 --> 02:53:23,880 that election similarly, 4177 02:53:23,880 --> 02:53:26,100 how weather investors want to predict 4178 02:53:26,100 --> 02:53:28,950 whether they should invest in a particular company or not, 4179 02:53:28,950 --> 02:53:30,300 whether they want to check 4180 02:53:30,300 --> 02:53:33,715 that whether like we should Target which customers 4181 02:53:33,715 --> 02:53:34,900 for advertisement 4182 02:53:34,900 --> 02:53:38,000 because we cannot Target everyone problem with targeting 4183 02:53:38,000 --> 02:53:40,580 everyone is and if we try to Target element, 4184 02:53:40,580 --> 02:53:43,032 it will be very costly so we want to kind 4185 02:53:43,032 --> 02:53:44,333 of set it a little bit 4186 02:53:44,333 --> 02:53:46,178 because maybe my set of people whom I 4187 02:53:46,178 --> 02:53:48,954 should send this advertisement to be more effective 4188 02:53:48,954 --> 02:53:52,000 and Wells as well as a queen is going to be cost effective 4189 02:53:52,000 --> 02:53:54,100 as well if you wanted to do the products 4190 02:53:54,100 --> 02:53:57,200 and services also include I guess we can also do this. 4191 02:53:57,200 --> 02:53:57,500 Now. 4192 02:53:57,500 --> 02:54:00,900 Let's see some use cases like the him terms of use case. 4193 02:54:00,900 --> 02:54:03,100 I will show you a practical how it comes. 4194 02:54:03,100 --> 02:54:04,000 So first of all, 4195 02:54:04,000 --> 02:54:06,724 we will be importing all the required packages 4196 02:54:06,724 --> 02:54:08,725 because we are going to not perform 4197 02:54:08,725 --> 02:54:10,400 or Twitter sentiment analysis. 4198 02:54:10,400 --> 02:54:12,824 So we will be requiring some packages for that. 4199 02:54:12,824 --> 02:54:15,700 So we will be doing that as a first step then we need 4200 02:54:15,700 --> 02:54:18,641 to SEC Oliver authentication without or indication. 4201 02:54:18,641 --> 02:54:21,405 We cannot do anything of now here the challenges 4202 02:54:21,405 --> 02:54:23,201 we cannot directly put your username 4203 02:54:23,201 --> 02:54:24,431 and they don't you think 4204 02:54:24,431 --> 02:54:27,100 it will get Candidate put your username and password. 4205 02:54:27,200 --> 02:54:28,800 So Peter came up with something. 4206 02:54:28,800 --> 02:54:30,400 Very smart thing. 4207 02:54:30,500 --> 02:54:33,100 What they did is they came up with something 4208 02:54:33,100 --> 02:54:35,080 on his fourth indication tokens. 4209 02:54:35,080 --> 02:54:37,100 So you have to go to death brought 4210 02:54:37,100 --> 02:54:39,100 twitter.com login from there 4211 02:54:39,100 --> 02:54:42,972 and you will find kind of all this authentication tokens 4212 02:54:42,972 --> 02:54:44,100 available to you 4213 02:54:44,100 --> 02:54:47,900 for will be the recruit take that and put it here then 4214 02:54:47,900 --> 02:54:50,335 as we have learned the D string transformation, 4215 02:54:50,335 --> 02:54:52,294 you will be doing all that computation 4216 02:54:52,294 --> 02:54:55,100 you so you will be having my distinct honor of France. 4217 02:54:55,100 --> 02:54:58,100 Action, then you will be generating your Tweet data. 4218 02:54:58,100 --> 02:55:01,472 I'm going to save it in this particular directory. 4219 02:55:01,472 --> 02:55:03,400 Once you are done with this. 4220 02:55:03,400 --> 02:55:06,200 Then you are going to extract your sentiment 4221 02:55:06,200 --> 02:55:07,600 once you extract it. 4222 02:55:07,600 --> 02:55:08,400 And you're done. 4223 02:55:08,400 --> 02:55:11,900 Let me show you quickly how it works in our fear. 4224 02:55:12,000 --> 02:55:15,226 Now one more interesting thing about a greater would be 4225 02:55:15,226 --> 02:55:18,247 that you will be getting all this consideration machines. 4226 02:55:18,247 --> 02:55:19,482 So you need not worry 4227 02:55:19,482 --> 02:55:21,892 about from where I will be getting all this. 4228 02:55:21,892 --> 02:55:25,100 Is it like very difficult to install when I was waiting. 4229 02:55:25,100 --> 02:55:26,400 This open source location. 4230 02:55:26,400 --> 02:55:29,061 It was not working for me in my operating system. 4231 02:55:29,061 --> 02:55:30,179 It was not working. 4232 02:55:30,179 --> 02:55:32,400 So many things we have generally seen 4233 02:55:32,400 --> 02:55:34,700 people face issues to resolve 4234 02:55:34,700 --> 02:55:36,600 everything up be we kind 4235 02:55:36,600 --> 02:55:40,000 of provide all this fear question from Rockville. 4236 02:55:40,000 --> 02:55:41,900 This pm has priest but yes, 4237 02:55:41,900 --> 02:55:44,300 that's what it has everything pre-installed. 4238 02:55:44,300 --> 02:55:46,700 Whichever will be required for your training. 4239 02:55:46,700 --> 02:55:49,133 So that's the best part what we also provide. 4240 02:55:49,133 --> 02:55:51,700 So in this case your Eclipse will already be there. 4241 02:55:51,700 --> 02:55:53,900 You need to just go to your Eclipse location. 4242 02:55:53,900 --> 02:55:55,300 Let me show you how you can. 4243 02:55:55,300 --> 02:55:56,700 So cold that if you want 4244 02:55:57,200 --> 02:56:00,600 because it gives you it gives you just need to go inside it 4245 02:56:00,600 --> 02:56:02,200 and double-click on it at that. 4246 02:56:02,200 --> 02:56:04,400 You need not go and kind of installed eclipse 4247 02:56:04,400 --> 02:56:07,400 and not even the spot will already be installed for you. 4248 02:56:07,400 --> 02:56:09,900 Let us go in our project. 4249 02:56:09,900 --> 02:56:12,895 So this is our project which is in front of you. 4250 02:56:12,895 --> 02:56:15,674 This is my project which we are going to war. 4251 02:56:15,674 --> 02:56:16,653 Now you can see 4252 02:56:16,653 --> 02:56:19,522 that we have first imported all the libraries 4253 02:56:19,522 --> 02:56:22,146 that we have set or more indication system 4254 02:56:22,146 --> 02:56:24,806 and then we have moved and kind of ecstatic. 4255 02:56:24,806 --> 02:56:27,900 The D string transformation extractor that we write 4256 02:56:27,900 --> 02:56:29,900 and then save the output final effect. 4257 02:56:29,900 --> 02:56:32,100 So these are the things which we have done 4258 02:56:32,100 --> 02:56:36,000 in this program has now let's execute it to run this program. 4259 02:56:36,000 --> 02:56:39,900 It's very simple go to run as and from run 4260 02:56:39,900 --> 02:56:42,700 as click on still application. 4261 02:56:43,200 --> 02:56:45,276 You will notice in the end. 4262 02:56:45,276 --> 02:56:48,600 It is releasing that great good to see that 4263 02:56:48,886 --> 02:56:51,286 so it is executing the program. 4264 02:56:51,286 --> 02:56:52,440 Let us execute. 4265 02:56:55,700 --> 02:56:57,800 I did bring a taxi for Trump. 4266 02:56:57,800 --> 02:57:01,292 So use these for Trump any way that we surveyed to be negative. 4267 02:57:01,292 --> 02:57:01,629 Right? 4268 02:57:01,629 --> 02:57:02,654 It's an achievement 4269 02:57:02,654 --> 02:57:06,036 because anything you do for Tom will be to be negative Trump is 4270 02:57:06,036 --> 02:57:07,563 anyway the hot topic for us. 4271 02:57:07,563 --> 02:57:09,200 Maybe make it a little bigger. 4272 02:57:14,100 --> 02:57:17,200 You will notice a lot of negative tweets coming up on. 4273 02:57:24,700 --> 02:57:26,900 Yes, now, I'm just stopping it 4274 02:57:26,900 --> 02:57:28,742 so that I can show you something. 4275 02:57:28,742 --> 02:57:28,972 Yes. 4276 02:57:28,972 --> 02:57:30,700 It's filtering that we thought 4277 02:57:30,800 --> 02:57:33,700 so we have actually been written back in the program itself. 4278 02:57:33,700 --> 02:57:36,300 You have given at one location from using 4279 02:57:36,300 --> 02:57:38,087 that we were kind of asking 4280 02:57:38,087 --> 02:57:41,200 for a treetop Tom now here we are doing analysis 4281 02:57:41,200 --> 02:57:43,064 and it is also going to tell us 4282 02:57:43,064 --> 02:57:46,264 whether it's a positive to a negative resistance is situated. 4283 02:57:46,264 --> 02:57:47,500 It is giving up Faith 4284 02:57:47,500 --> 02:57:50,444 because term for Transit even will not quit positive rate. 4285 02:57:50,444 --> 02:57:51,454 So that's something 4286 02:57:51,454 --> 02:57:53,790 which is so that's the reason you're finding. 4287 02:57:53,790 --> 02:57:54,800 This is a negative. 4288 02:57:54,900 --> 02:57:56,412 Similarly if there will be any other 4289 02:57:56,412 --> 02:57:57,964 that we should be getting a static. 4290 02:57:57,964 --> 02:58:00,200 So right now if I keep on moving ahead we will see 4291 02:58:00,200 --> 02:58:02,300 multiple negative traits which will come up. 4292 02:58:02,300 --> 02:58:04,600 So that's how this program runs. 4293 02:58:04,900 --> 02:58:07,000 So this is how our program 4294 02:58:07,000 --> 02:58:09,403 we will be executing we can distract it. 4295 02:58:09,403 --> 02:58:13,100 Even the output results will be getting through at a location 4296 02:58:13,100 --> 02:58:16,500 as you can see in this if I go to my location here, 4297 02:58:16,500 --> 02:58:19,100 this is my actual project where it is running 4298 02:58:19,100 --> 02:58:20,533 so you can just come 4299 02:58:20,533 --> 02:58:23,400 to this location here are on your output. 4300 02:58:23,400 --> 02:58:24,982 All your output is Getting through there 4301 02:58:24,982 --> 02:58:26,200 so you can just take a look as 4302 02:58:26,200 --> 02:58:28,200 but yes, so it's everything is done 4303 02:58:28,200 --> 02:58:29,971 by using space thing apart. 4304 02:58:29,971 --> 02:58:30,300 Okay. 4305 02:58:30,300 --> 02:58:31,900 That's what we've seen right reverse 4306 02:58:31,900 --> 02:58:33,653 that we were seeing it with respect 4307 02:58:33,653 --> 02:58:35,200 to these three transformations 4308 02:58:35,200 --> 02:58:38,300 in a so we have done all that with have both passed anybody. 4309 02:58:38,400 --> 02:58:41,200 So that is one of those awesome part about this 4310 02:58:41,200 --> 02:58:44,700 that you can do such a powerful things with respect 4311 02:58:44,700 --> 02:58:47,279 to your with respect to you this way. 4312 02:58:47,279 --> 02:58:49,500 Now, let's analyze the results. 4313 02:58:49,800 --> 02:58:51,152 So as we have just seen 4314 02:58:51,152 --> 02:58:53,400 that it is showing the president's a positive 4315 02:58:53,400 --> 02:58:54,800 to a negative tweets. 4316 02:58:55,000 --> 02:58:57,200 So this is where your output is getting Stone 4317 02:58:57,200 --> 02:59:00,000 as it shown you a doubt will appear like this. 4318 02:59:00,000 --> 02:59:00,300 Okay. 4319 02:59:00,300 --> 02:59:02,700 This is just broke your output to explicitly 4320 02:59:02,700 --> 02:59:03,762 principal also tell 4321 02:59:03,762 --> 02:59:05,848 whether it's a neutral one positive one 4322 02:59:05,848 --> 02:59:07,277 negative one everything. 4323 02:59:07,277 --> 02:59:09,600 We have done it with the help of Sparks. 4324 02:59:09,600 --> 02:59:12,000 I mean only now we have done it for Trump 4325 02:59:12,000 --> 02:59:14,000 as I just explained you that we have put 4326 02:59:14,000 --> 02:59:15,555 in our program itself from 4327 02:59:15,555 --> 02:59:17,589 like we have put everything up here 4328 02:59:17,589 --> 02:59:21,000 and based on that only we are getting all the software now 4329 02:59:21,000 --> 02:59:23,498 we can apply all the sentiment analysis 4330 02:59:23,498 --> 02:59:24,403 and like this. 4331 02:59:24,403 --> 02:59:25,731 Like we have learned. 4332 02:59:25,731 --> 02:59:28,754 So I hope you have found all this this specially 4333 02:59:28,754 --> 02:59:30,593 this use case very much useful 4334 02:59:30,593 --> 02:59:32,800 for you kind of getting you that yes, 4335 02:59:32,800 --> 02:59:34,388 it is getting done by half. 4336 02:59:34,388 --> 02:59:36,200 But right now we have put from here, 4337 02:59:36,200 --> 02:59:38,550 but if you want you can keep on putting the hashtag as 4338 02:59:38,550 --> 02:59:40,286 well because that's how we are doing it. 4339 02:59:40,286 --> 02:59:41,886 You can keep on changing the tax. 4340 02:59:41,886 --> 02:59:44,335 Maybe you can kind of code for let's say four people 4341 02:59:44,335 --> 02:59:45,200 for stuff is going 4342 02:59:45,200 --> 02:59:49,000 on a cricket match will be going on we can just put the tweets 4343 02:59:49,000 --> 02:59:52,300 according to that just take the in that case instead of trump. 4344 02:59:52,300 --> 02:59:53,980 You can put any player named 4345 02:59:53,980 --> 02:59:56,432 or maybe a Team name and you will see all 4346 02:59:56,432 --> 02:59:58,300 that friendly becoming a father. 4347 02:59:58,300 --> 03:00:00,700 Okay, so that's how you can play with this. 4348 03:00:01,000 --> 03:00:01,500 Now. 4349 03:00:01,800 --> 03:00:04,400 This is there are multiple examples with it, 4350 03:00:04,400 --> 03:00:05,400 which we can play 4351 03:00:05,500 --> 03:00:09,500 and this new skills can be even evolved multiple other type 4352 03:00:09,500 --> 03:00:10,250 of those cases. 4353 03:00:10,250 --> 03:00:12,200 You can just keep on transforming it 4354 03:00:12,200 --> 03:00:14,300 according to your own use cases. 4355 03:00:14,400 --> 03:00:17,800 So that's it about Sparks coming which I wanted to discuss. 4356 03:00:17,800 --> 03:00:21,000 So I hope you must have found it useful. 4357 03:00:26,000 --> 03:00:28,228 So in classification generally 4358 03:00:28,228 --> 03:00:31,200 what happens just to give you an example. 4359 03:00:31,300 --> 03:00:33,867 You must have notice the spam email box. 4360 03:00:33,867 --> 03:00:36,500 I hope everybody must be having have seen 4361 03:00:36,500 --> 03:00:39,700 that sparkle in your spam email box Energy Mix. 4362 03:00:39,800 --> 03:00:45,000 Now when any new email comes up how Google decide 4363 03:00:45,165 --> 03:00:49,134 whether it's a spam email or unknown stamped image 4364 03:00:49,300 --> 03:00:53,400 that is done as an example of classification plus 3, 4365 03:00:53,576 --> 03:00:56,423 let's say My ghost in the Google news, 4366 03:00:56,500 --> 03:00:58,794 when you type something it group. 4367 03:00:58,794 --> 03:01:00,300 All the news together 4368 03:01:00,300 --> 03:01:04,700 that is called your electric regression equation is also one 4369 03:01:04,700 --> 03:01:07,300 of the very important fact it is not here. 4370 03:01:07,500 --> 03:01:11,700 The regression is let's say you have a house 4371 03:01:11,900 --> 03:01:14,100 and you want to sell that house 4372 03:01:14,400 --> 03:01:16,500 and you have no idea. 4373 03:01:16,700 --> 03:01:18,715 What is the optimal price? 4374 03:01:18,715 --> 03:01:21,100 You should keep for your house. 4375 03:01:21,100 --> 03:01:24,400 Now this regression will help you too. 4376 03:01:24,400 --> 03:01:28,534 To achieve that collaborative filtering you might have see 4377 03:01:28,534 --> 03:01:31,000 when you go to your Amazon web page 4378 03:01:31,000 --> 03:01:33,400 that they show you a recommendation, right? 4379 03:01:33,400 --> 03:01:34,430 You can buy this 4380 03:01:34,430 --> 03:01:38,400 because you are buying this but this is done with the help 4381 03:01:38,400 --> 03:01:40,900 of colaborative filtering. 4382 03:01:42,028 --> 03:01:44,315 Before I move to the project, 4383 03:01:44,315 --> 03:01:47,700 I want to show you some practical find how we 4384 03:01:47,700 --> 03:01:50,300 will be executing spark things. 4385 03:01:50,503 --> 03:01:53,196 So let me take you to the VM machine 4386 03:01:53,300 --> 03:01:55,300 which will be provided by a Dorita. 4387 03:01:55,300 --> 03:01:57,928 So this machines are also provided by the Rekha. 4388 03:01:57,928 --> 03:02:00,222 So you need not worry about from where I 4389 03:02:00,222 --> 03:02:01,963 will be getting the software. 4390 03:02:01,963 --> 03:02:04,421 What I will be doing recite It Roll there. 4391 03:02:04,421 --> 03:02:07,300 Everything is taken care back into they come now. 4392 03:02:07,300 --> 03:02:08,957 Once you will be coming 4393 03:02:08,957 --> 03:02:12,059 to this you will see a machine like Like this, 4394 03:02:12,059 --> 03:02:13,300 let me close this. 4395 03:02:13,300 --> 03:02:16,970 So what will happen you will see a blank machine like this. 4396 03:02:16,970 --> 03:02:18,300 Let me show you this. 4397 03:02:18,300 --> 03:02:20,500 So this is how your machine will look like. 4398 03:02:20,500 --> 03:02:24,100 Now what you are going to do in order to start working. 4399 03:02:24,100 --> 03:02:26,600 You will be opening this permanent by clicking 4400 03:02:26,600 --> 03:02:27,800 on this black option. 4401 03:02:28,000 --> 03:02:29,300 Now after that, 4402 03:02:29,400 --> 03:02:34,400 what you can do is you can now go to your spot now 4403 03:02:34,400 --> 03:02:39,300 how I can work with funds in order to execute any program 4404 03:02:39,300 --> 03:02:43,000 in sparked by using Funeral program you 4405 03:02:43,000 --> 03:02:46,700 will be entering it as fast - 4406 03:02:46,700 --> 03:02:49,400 Chanel if you type fast - gel 4407 03:02:49,500 --> 03:02:52,500 it will take you to the scale of Ron 4408 03:02:52,800 --> 03:02:55,800 where you can write your path program, 4409 03:02:56,100 --> 03:03:00,020 but by using scale of programming language, 4410 03:03:00,020 --> 03:03:01,558 you can notice this. 4411 03:03:02,200 --> 03:03:06,300 Now, can you see the fact it is also giving me 1.5.2 version. 4412 03:03:06,300 --> 03:03:09,200 So that is the version of your spot. 4413 03:03:09,800 --> 03:03:11,400 Now you can see here. 4414 03:03:11,400 --> 03:03:15,200 You can also see this part of our context available as a see 4415 03:03:15,200 --> 03:03:17,752 when you get connected to your spark sure. 4416 03:03:17,752 --> 03:03:21,441 You can just see this will be my default available to you. 4417 03:03:21,441 --> 03:03:22,800 Let us get connected. 4418 03:03:22,800 --> 03:03:23,800 It is sometime. 4419 03:03:39,207 --> 03:03:40,746 No, we got anything. 4420 03:03:40,746 --> 03:03:43,900 So we got connected to this Kayla prom now 4421 03:03:43,900 --> 03:03:45,894 if I want to come out of it, 4422 03:03:45,894 --> 03:03:49,300 I will just type exit it will just let me come 4423 03:03:49,300 --> 03:03:51,400 out of this product now. 4424 03:03:52,100 --> 03:03:56,176 Secondly, I can also write my programs with my python. 4425 03:03:56,176 --> 03:03:57,407 So what I can do 4426 03:03:57,500 --> 03:04:00,200 if I want to do programming and Spark, 4427 03:04:00,200 --> 03:04:03,040 but with provide Python programming language, 4428 03:04:03,040 --> 03:04:05,300 I will be connecting with by Sparks. 4429 03:04:05,300 --> 03:04:09,148 So I just need to type ice pack in order to get connected. 4430 03:04:09,148 --> 03:04:09,912 Your fighter. 4431 03:04:09,912 --> 03:04:10,206 Okay. 4432 03:04:10,206 --> 03:04:11,791 I'm not getting connected now 4433 03:04:11,791 --> 03:04:13,576 because I'm not going to require. 4434 03:04:13,576 --> 03:04:16,700 I think I will be explaining everything that scalar item. 4435 03:04:16,700 --> 03:04:19,700 But if you want to get connected you can type icebox. 4436 03:04:19,700 --> 03:04:21,100 So let's again get connected 4437 03:04:21,100 --> 03:04:23,900 to my staff - sure now meanwhile, 4438 03:04:23,900 --> 03:04:25,800 this is getting connected. 4439 03:04:25,800 --> 03:04:27,800 Let us create a small pipe. 4440 03:04:27,800 --> 03:04:29,823 So let us create a file so currently 4441 03:04:29,823 --> 03:04:31,897 if you notice I don't have any file. 4442 03:04:31,897 --> 03:04:32,281 Okay. 4443 03:04:32,284 --> 03:04:34,300 I already have a DOT txt. 4444 03:04:34,300 --> 03:04:37,300 So let's say sake at a DOT txt. 4445 03:04:37,400 --> 03:04:38,958 So I have some data one. 4446 03:04:38,958 --> 03:04:40,200 Two three four five. 4447 03:04:40,200 --> 03:04:42,362 This is my data, which is with me. 4448 03:04:42,362 --> 03:04:44,000 Now what I'm going to do, 4449 03:04:44,000 --> 03:04:47,900 let me push this file and do select the effective 4450 03:04:47,900 --> 03:04:49,900 if it is already available 4451 03:04:49,900 --> 03:04:55,000 in my system as that means SDK system Hadoop DFS - 4452 03:04:55,000 --> 03:04:57,900 ooh, Jack a dot txt just to quickly check 4453 03:04:57,900 --> 03:04:59,700 if it is already available. 4454 03:05:06,100 --> 03:05:09,400 There is no sex by so let me first put this file 4455 03:05:09,400 --> 03:05:12,700 to my system to put a dot txt. 4456 03:05:14,200 --> 03:05:16,300 So this will put it in the default location 4457 03:05:16,300 --> 03:05:17,200 of x g of X. 4458 03:05:17,200 --> 03:05:19,700 Now if I want to read it, I can see the specs. 4459 03:05:19,700 --> 03:05:20,922 So again, I'm assuming 4460 03:05:20,922 --> 03:05:23,700 that you're aware of this as big as commands so you 4461 03:05:23,700 --> 03:05:25,300 can see now this one two, 4462 03:05:25,300 --> 03:05:28,500 three four Pilots coming from a Hadoop file system. 4463 03:05:28,500 --> 03:05:30,192 Now what I want to do, 4464 03:05:30,192 --> 03:05:36,400 I want to use this file in my in my system of spa now 4465 03:05:36,400 --> 03:05:39,200 how I can do that select we come here. 4466 03:05:39,200 --> 03:05:42,500 So in skaila in skaila, 4467 03:05:42,500 --> 03:05:46,000 we do not have any Your float and on like in Java 4468 03:05:46,000 --> 03:05:48,700 we use the Define like this right integer 4469 03:05:48,700 --> 03:05:49,907 K is equal to 10 4470 03:05:49,907 --> 03:05:53,000 like this is used to define buttons Kayla. 4471 03:05:53,000 --> 03:05:55,400 We do not use this data type. 4472 03:05:55,473 --> 03:05:58,626 In fact, what we do is we call it as back. 4473 03:05:58,700 --> 03:06:02,000 So if I use that a is equal to 10, 4474 03:06:02,100 --> 03:06:04,700 it will automatically identify 4475 03:06:04,900 --> 03:06:08,100 that it is a integer value notice. 4476 03:06:08,900 --> 03:06:13,303 It will tell me that a is of my integer type now 4477 03:06:13,303 --> 03:06:16,072 if I want to Update this value to 20. 4478 03:06:16,072 --> 03:06:17,149 I can do that. 4479 03:06:17,400 --> 03:06:17,800 Now. 4480 03:06:17,900 --> 03:06:20,900 Let's say if I want to update this to ABC like this. 4481 03:06:21,200 --> 03:06:23,700 This will smoke an error by 4482 03:06:23,900 --> 03:06:27,400 because a is already defined as in danger 4483 03:06:27,600 --> 03:06:31,300 and you're trying to assign some PVC string back. 4484 03:06:31,300 --> 03:06:34,000 So that is the reason you got this error. 4485 03:06:34,000 --> 03:06:34,900 Similarly. 4486 03:06:35,000 --> 03:06:38,000 There is one more thing called as value. 4487 03:06:38,300 --> 03:06:40,300 Well B is equal to 10. 4488 03:06:40,300 --> 03:06:44,200 Let's say if I do it works exactly a similar to that. 4489 03:06:44,200 --> 03:06:47,500 But I have one difference now in this case. 4490 03:06:47,500 --> 03:06:51,600 If I do basic want to 20 you will see an error 4491 03:06:51,800 --> 03:06:57,000 and why does Sarah because when you define something as well, 4492 03:06:57,200 --> 03:06:59,200 it is a constant. 4493 03:06:59,300 --> 03:07:02,400 It is not going to be variable anymore. 4494 03:07:02,430 --> 03:07:04,046 It will be a constant 4495 03:07:04,046 --> 03:07:08,300 and that is the reason if you define something as well, 4496 03:07:08,300 --> 03:07:10,700 it will be not updatable. 4497 03:07:10,700 --> 03:07:14,400 You will be should not be able to update that value. 4498 03:07:14,400 --> 03:07:19,400 So this is how in Fela you will be doing your program 4499 03:07:19,700 --> 03:07:23,969 so back for bearable part of that for your constant, 4500 03:07:23,969 --> 03:07:27,200 but now so you will be doing like this now, 4501 03:07:27,200 --> 03:07:31,664 let's use it for the example what we have learned now. 4502 03:07:31,664 --> 03:07:34,971 Let's say if I want to create and cut the V. 4503 03:07:35,100 --> 03:07:40,100 So Bal number is equal to SC dot txt file. 4504 03:07:40,100 --> 03:07:43,000 Remember this API we have learned the CPI 4505 03:07:43,000 --> 03:07:45,500 already St. Dot Txt file now. 4506 03:07:45,500 --> 03:07:49,300 Let me give this file a DOT txt. 4507 03:07:49,500 --> 03:07:52,000 If I give this file a dot txt. 4508 03:07:52,300 --> 03:07:55,900 It will be creating an ID will see this file. 4509 03:07:55,900 --> 03:07:57,000 It is telling 4510 03:07:57,000 --> 03:08:00,800 that I created an rdd of string type. 4511 03:08:01,100 --> 03:08:01,300 Now. 4512 03:08:01,300 --> 03:08:06,600 If I want to read this data, I will call number dot connect. 4513 03:08:06,800 --> 03:08:10,415 This will print be the value what was available. 4514 03:08:10,415 --> 03:08:14,261 Can you say now this line what you are seeing here? 4515 03:08:14,300 --> 03:08:17,300 Is going to be from your memory. 4516 03:08:17,400 --> 03:08:19,382 This is your from my body. 4517 03:08:19,382 --> 03:08:23,500 It is reading a and that is the reason it is showing up 4518 03:08:23,500 --> 03:08:25,800 in this particular manner. 4519 03:08:25,842 --> 03:08:29,457 So this is how you will be performing your step. 4520 03:08:29,484 --> 03:08:30,715 No second thing. 4521 03:08:31,100 --> 03:08:36,000 I told you that sparked and walk on Standalone systems as well. 4522 03:08:36,100 --> 03:08:36,400 Right? 4523 03:08:36,400 --> 03:08:38,400 So right now what was happening was 4524 03:08:38,400 --> 03:08:42,000 that we have executed this part in our history of this now 4525 03:08:42,000 --> 03:08:46,283 if I want to execute this Us on our local file system. 4526 03:08:46,283 --> 03:08:47,338 Can I do that? 4527 03:08:47,338 --> 03:08:49,300 Yes, it can still do that. 4528 03:08:49,300 --> 03:08:51,300 What you need to do for that. 4529 03:08:51,300 --> 03:08:54,700 So is in that case the difference will come here. 4530 03:08:54,700 --> 03:08:57,000 Now what the file you are giving 4531 03:08:57,000 --> 03:08:59,748 here would be instead of giving like that. 4532 03:08:59,748 --> 03:09:03,100 You will be denoting this file keyword before that. 4533 03:09:03,100 --> 03:09:06,300 And after that you need to give you a local file. 4534 03:09:06,300 --> 03:09:09,200 For example, what is this part slash home slash. 4535 03:09:09,200 --> 03:09:09,900 Advocacy. 4536 03:09:09,900 --> 03:09:12,400 This is a local park not as deep as possible. 4537 03:09:12,400 --> 03:09:14,400 So you will be writing / foam. 4538 03:09:14,400 --> 03:09:17,400 /schedule Erica a DOT PSD. 4539 03:09:17,500 --> 03:09:19,100 Now if you give this 4540 03:09:19,300 --> 03:09:22,700 this will be loading the file into memory, 4541 03:09:23,000 --> 03:09:26,300 but not from your hdfs instead. 4542 03:09:26,300 --> 03:09:29,100 What does that is this loaded it 4543 03:09:29,100 --> 03:09:33,000 from your just loaded it formula looks like this 4544 03:09:33,200 --> 03:09:34,921 so that is the defensive. 4545 03:09:34,921 --> 03:09:37,600 So as you can see in the second case, 4546 03:09:37,600 --> 03:09:41,600 I am not even using my hdfs. 4547 03:09:41,700 --> 03:09:43,000 Which means what now? 4548 03:09:43,000 --> 03:09:46,000 Can you tell me why this Sarah this is interesting. 4549 03:09:46,000 --> 03:09:49,300 Why do Sarah input path does not exist 4550 03:09:49,300 --> 03:09:51,600 because I have given a typo here. 4551 03:09:51,600 --> 03:09:52,400 Okay. 4552 03:09:52,400 --> 03:09:53,595 Now if you notice 4553 03:09:53,595 --> 03:09:58,555 by I did not get this error here why I did not get this Elijah 4554 03:09:58,555 --> 03:10:00,200 this file do not exist. 4555 03:10:00,200 --> 03:10:02,500 But still I did not got 4556 03:10:02,500 --> 03:10:07,300 any error because of lazy evaluation link 4557 03:10:07,300 --> 03:10:11,500 the evaluation kind of made sure that even 4558 03:10:11,500 --> 03:10:14,400 if you have given the wrong part in creating 4559 03:10:14,400 --> 03:10:18,200 And beyond ready but it has not executed anything. 4560 03:10:18,400 --> 03:10:19,900 So all the output 4561 03:10:19,900 --> 03:10:22,800 or the error mistake you are able to receive 4562 03:10:22,800 --> 03:10:25,600 when you hit that action of Collective Now 4563 03:10:25,600 --> 03:10:27,997 in order to correct this value. 4564 03:10:27,997 --> 03:10:32,890 I need to connect this adorable and this time if I execute it, 4565 03:10:32,975 --> 03:10:33,975 it will work. 4566 03:10:34,050 --> 03:10:37,050 Okay, you can see this output 1 2 3 4 5. 4567 03:10:37,100 --> 03:10:40,500 So this time it works by so now we should be 4568 03:10:40,500 --> 03:10:44,200 more clear about the lazy evaluation of the even 4569 03:10:44,200 --> 03:10:46,375 if you are giving the wrong file name 4570 03:10:46,375 --> 03:10:47,628 doesn't matter suppose. 4571 03:10:47,628 --> 03:10:49,804 I want to use Park in production unit, 4572 03:10:49,804 --> 03:10:51,155 but not on top of Hadoop. 4573 03:10:51,155 --> 03:10:52,007 Is it possible? 4574 03:10:52,007 --> 03:10:53,200 Yes, you can do that. 4575 03:10:53,200 --> 03:10:54,500 You can do that Sonny, 4576 03:10:54,500 --> 03:10:56,900 but usually that's not what you do. 4577 03:10:56,900 --> 03:10:58,958 But yes, if you want to can do that, 4578 03:10:58,958 --> 03:11:00,299 there are a lot of things 4579 03:11:00,299 --> 03:11:02,239 which you can view can also deploy it 4580 03:11:02,239 --> 03:11:05,611 on your Amazon clusters as that lot of things you can do that. 4581 03:11:05,611 --> 03:11:07,900 How will it provided distribute in that case? 4582 03:11:07,900 --> 03:11:10,186 We'll be using some other distribution system. 4583 03:11:10,186 --> 03:11:12,425 So in that case you are not using this fact, 4584 03:11:12,425 --> 03:11:14,300 you can deploy it will be just death. 4585 03:11:14,300 --> 03:11:16,400 He will not be able to kind of go across 4586 03:11:16,400 --> 03:11:17,698 and distribute in that Master. 4587 03:11:17,698 --> 03:11:19,849 You will not be able to lift weight that redundancy, 4588 03:11:19,849 --> 03:11:22,500 but you can use them in Amazon is the enough for that. 4589 03:11:22,500 --> 03:11:23,700 Okay, so that is 4590 03:11:23,700 --> 03:11:28,089 how you will be using this now you're going to get so this is 4591 03:11:28,089 --> 03:11:31,600 how you will be performing your practice as a sec 4592 03:11:31,600 --> 03:11:33,643 how you will be working on this part. 4593 03:11:33,643 --> 03:11:35,800 I will be a training you as I told you. 4594 03:11:35,800 --> 03:11:37,500 So this is how things work. 4595 03:11:37,700 --> 03:11:41,600 Now, let us see an interesting use case. 4596 03:11:41,800 --> 03:11:43,900 So for that let us go back. 4597 03:11:43,900 --> 03:11:47,900 Back to our visiting this is going to be very interesting. 4598 03:11:48,161 --> 03:11:50,238 So let's see this use case. 4599 03:11:50,600 --> 03:11:51,600 Look at this. 4600 03:11:51,900 --> 03:11:53,500 This is very interested. 4601 03:11:53,500 --> 03:11:57,600 Now this use case is for earthquake detection using Spa. 4602 03:11:57,600 --> 03:12:00,200 So in Japan you might have already seen 4603 03:12:00,200 --> 03:12:02,450 that there are so many up to access coming you 4604 03:12:02,450 --> 03:12:03,800 might have thought about it. 4605 03:12:03,800 --> 03:12:05,591 I definitely you might have not seen it 4606 03:12:05,591 --> 03:12:07,100 but you must have heard about it 4607 03:12:07,100 --> 03:12:09,200 that there are so many earthquake 4608 03:12:09,200 --> 03:12:13,700 which happens in Japan now how to solve that problem with 4609 03:12:13,700 --> 03:12:16,111 about I'm just going to give you a glimpse 4610 03:12:16,111 --> 03:12:17,400 of what kind of problems 4611 03:12:17,400 --> 03:12:18,563 in solving the sessions 4612 03:12:18,563 --> 03:12:21,600 definitely we are not going to walk through in detail of this 4613 03:12:21,600 --> 03:12:24,500 but you will get an idea House of Prince fastest. 4614 03:12:24,500 --> 03:12:27,300 Okay, just to give you a little bit of brief here. 4615 03:12:27,300 --> 03:12:30,500 But all these products will learn at the time 4616 03:12:30,500 --> 03:12:31,900 of sessions now. 4617 03:12:32,000 --> 03:12:35,300 So let's see this part how we will be using this bill. 4618 03:12:35,300 --> 03:12:38,500 So as everybody must be knowing what is asked website. 4619 03:12:38,500 --> 03:12:39,800 So our crack is 4620 03:12:40,200 --> 03:12:44,028 like a shaking of your surface of the Earth your own country. 4621 03:12:44,028 --> 03:12:46,900 Ignore all those events that happen in tector. 4622 03:12:46,900 --> 03:12:48,050 If you're from India, 4623 03:12:48,050 --> 03:12:51,400 you might have seen recently there was an earthquake incident 4624 03:12:51,400 --> 03:12:54,600 which came from Nepal by even recently two days back. 4625 03:12:54,600 --> 03:12:56,900 Also there was upset incident. 4626 03:12:57,053 --> 03:12:59,900 So these are techniques on coming now, 4627 03:12:59,900 --> 03:13:02,300 very important part is let's say 4628 03:13:02,300 --> 03:13:06,100 if the earthquake is on major earthquake like arguing 4629 03:13:06,100 --> 03:13:08,992 or maybe tsunami maybe forest fires, 4630 03:13:08,992 --> 03:13:10,600 maybe a volcano now, 4631 03:13:10,600 --> 03:13:14,000 it's very important for them to kind of SC. 4632 03:13:15,100 --> 03:13:19,600 That black is going to come they should be able to kind 4633 03:13:19,600 --> 03:13:21,600 of predicted beforehand. 4634 03:13:21,600 --> 03:13:23,776 It's not happen that as a last moment. 4635 03:13:23,776 --> 03:13:25,254 They got to the that okay 4636 03:13:25,254 --> 03:13:27,862 Dirtbag is comes after I came up cracking No, 4637 03:13:27,862 --> 03:13:29,700 it should not happen like that. 4638 03:13:29,700 --> 03:13:34,000 They should be able to estimate all these things beforehand. 4639 03:13:34,000 --> 03:13:36,600 They should be able to predict beforehand. 4640 03:13:36,688 --> 03:13:40,611 So this is the system with Japan's is using already. 4641 03:13:40,700 --> 03:13:44,300 So this is a real-time kind of use case what I am presenting. 4642 03:13:44,300 --> 03:13:47,300 It's so Japan is already using this path finger 4643 03:13:47,300 --> 03:13:49,770 in order to solve this earthquake problem. 4644 03:13:49,770 --> 03:13:52,482 We are going to see that how they're using it. 4645 03:13:52,482 --> 03:13:52,866 Okay. 4646 03:13:52,900 --> 03:13:56,900 Now let's say what happens in Japan earthquake model. 4647 03:13:57,000 --> 03:14:00,000 So whenever there is an earthquake coming 4648 03:14:00,000 --> 03:14:02,000 for example at 2:46 p.m. 4649 03:14:02,000 --> 03:14:04,800 On March 4 2011 now 4650 03:14:04,800 --> 03:14:08,300 Japan earthquake early warning was detected. 4651 03:14:08,600 --> 03:14:12,800 Now the thing was as soon as it detected immediately, 4652 03:14:12,800 --> 03:14:16,999 they start sending Not those fools to the lift 4653 03:14:17,000 --> 03:14:20,700 to the factories every station through TV stations. 4654 03:14:20,700 --> 03:14:23,300 They immediately kind of told everyone 4655 03:14:23,300 --> 03:14:26,315 so that all the students were there in school. 4656 03:14:26,315 --> 03:14:29,800 They got the time to go under the desk bullet trains, 4657 03:14:29,800 --> 03:14:30,900 which were running. 4658 03:14:30,900 --> 03:14:31,571 They stop. 4659 03:14:31,571 --> 03:14:35,200 Otherwise the capabilities of us will start shaking now 4660 03:14:35,200 --> 03:14:38,200 the bullet trains are already running at the very high speed. 4661 03:14:38,200 --> 03:14:39,432 They want to ensure 4662 03:14:39,432 --> 03:14:43,000 that there should be no sort of casualty because of that 4663 03:14:43,000 --> 03:14:46,600 so all the bullet train Stop all the elevators the lift 4664 03:14:46,600 --> 03:14:47,825 which were running. 4665 03:14:47,825 --> 03:14:50,600 They stop otherwise some incident can happen 4666 03:14:50,700 --> 03:14:53,930 in 60 seconds 60 seconds 4667 03:14:53,930 --> 03:14:55,700 before this number they 4668 03:14:55,700 --> 03:14:59,100 were able to inform almost every month. 4669 03:14:59,300 --> 03:15:01,212 They have send the message. 4670 03:15:01,212 --> 03:15:02,698 They have a broadcast 4671 03:15:02,698 --> 03:15:05,949 on TV all those things they have done immediately 4672 03:15:05,949 --> 03:15:07,100 to all the people 4673 03:15:07,100 --> 03:15:09,856 so that they can send at least this message 4674 03:15:09,856 --> 03:15:11,300 whoever can receive it 4675 03:15:11,300 --> 03:15:13,600 and that have saved millions 4676 03:15:13,600 --> 03:15:17,300 of So powerful they were able to achieve 4677 03:15:17,300 --> 03:15:22,100 that they have done all this with the help of Apache spark. 4678 03:15:22,192 --> 03:15:24,500 That is the most important job 4679 03:15:24,500 --> 03:15:27,900 how they've got you can select everything 4680 03:15:27,900 --> 03:15:29,800 what they are doing there. 4681 03:15:29,800 --> 03:15:33,600 They are doing it on the real time system, right? 4682 03:15:33,700 --> 03:15:35,690 They cannot just collect the data 4683 03:15:35,690 --> 03:15:39,100 and then later the processes they did everything as 4684 03:15:39,100 --> 03:15:40,300 a real-time system. 4685 03:15:40,300 --> 03:15:43,300 So they collected the data immediately process it 4686 03:15:43,300 --> 03:15:45,004 and as soon has the detected 4687 03:15:45,004 --> 03:15:47,484 that has quick they immediately inform the 4688 03:15:47,484 --> 03:15:49,381 in fact this happened in 2011. 4689 03:15:49,381 --> 03:15:52,100 Now they they start using it very frequently 4690 03:15:52,100 --> 03:15:54,318 because Japan is one of the area 4691 03:15:54,318 --> 03:15:58,200 which is very frequently of kind of affected by all this. 4692 03:15:58,200 --> 03:15:58,900 So as I said, 4693 03:15:58,900 --> 03:16:01,548 the main thing is we should be able to process the data 4694 03:16:01,548 --> 03:16:02,449 and we are finding 4695 03:16:02,449 --> 03:16:04,900 that the bigger thing you should be able to handle 4696 03:16:04,900 --> 03:16:06,400 the data from multiple sources 4697 03:16:06,400 --> 03:16:07,789 because data may be coming 4698 03:16:07,789 --> 03:16:10,882 from multiple sources may be different different sources. 4699 03:16:10,882 --> 03:16:13,600 They might be suggesting some of the other events. 4700 03:16:13,600 --> 03:16:16,305 It's because Which we are predicting that okay, 4701 03:16:16,305 --> 03:16:17,770 this earthquake can happen. 4702 03:16:17,770 --> 03:16:19,729 It should be very easy to use because 4703 03:16:19,729 --> 03:16:22,500 if it is very complicated then in that case 4704 03:16:22,500 --> 03:16:23,500 for a user to use it 4705 03:16:23,500 --> 03:16:25,549 if you'd be very good become competitive service. 4706 03:16:25,549 --> 03:16:27,600 You will not be able to solve the problem. 4707 03:16:27,700 --> 03:16:29,200 Now even in the end 4708 03:16:29,200 --> 03:16:32,100 how to send the alert message is important. 4709 03:16:32,100 --> 03:16:32,900 Okay. 4710 03:16:32,900 --> 03:16:36,000 So all those things are taken care by your spark. 4711 03:16:36,000 --> 03:16:39,923 Now there are two kinds of layer in your earthquake. 4712 03:16:40,100 --> 03:16:42,633 The number one layer is a prime the way 4713 03:16:42,633 --> 03:16:43,900 and second is fake. 4714 03:16:43,900 --> 03:16:44,864 And we'll wait. 4715 03:16:44,864 --> 03:16:46,600 There are two kinds of wave 4716 03:16:46,600 --> 03:16:49,100 in an earthquake Prime Z Wave is like 4717 03:16:49,100 --> 03:16:52,261 when the earthquake is just about to start it start 4718 03:16:52,261 --> 03:16:53,400 to the city center 4719 03:16:53,400 --> 03:16:55,200 and it's vendor or Quake 4720 03:16:55,200 --> 03:16:59,100 is going to start secondary wave is more severe than 4721 03:16:59,100 --> 03:17:01,400 which sparked after producing. 4722 03:17:01,400 --> 03:17:03,912 Now what happens in secondary wheel is 4723 03:17:03,912 --> 03:17:06,900 when it's that start it can do maximum damage 4724 03:17:06,900 --> 03:17:09,605 because primary ways you can see the initial wave 4725 03:17:09,605 --> 03:17:11,900 but the second we will be on top of that 4726 03:17:11,900 --> 03:17:14,800 so they will be some details with respect to I 'm not going 4727 03:17:14,800 --> 03:17:15,800 in detail of that. 4728 03:17:15,800 --> 03:17:17,600 But yeah, there will be some details 4729 03:17:17,600 --> 03:17:18,700 with respect to that. 4730 03:17:18,700 --> 03:17:21,700 Now what we are going to do using Sparks. 4731 03:17:21,700 --> 03:17:23,907 We will be creating our arms. 4732 03:17:23,907 --> 03:17:26,799 So let's go and see that in our machine 4733 03:17:26,799 --> 03:17:30,600 how we will be sick calculating our Roc which using 4734 03:17:30,600 --> 03:17:33,600 which we will be solving this problem later 4735 03:17:33,600 --> 03:17:36,524 and we will be calculating this Roc with the help 4736 03:17:36,524 --> 03:17:37,500 of Apache spark. 4737 03:17:37,500 --> 03:17:39,729 Let us again come back to this machine now 4738 03:17:39,729 --> 03:17:41,369 in order to walk on that. 4739 03:17:41,369 --> 03:17:43,600 Let's first exit from this console. 4740 03:17:43,800 --> 03:17:48,300 Once you exit from this console now what you're going to do. 4741 03:17:48,300 --> 03:17:51,900 I have already created this project in kept it here 4742 03:17:51,900 --> 03:17:55,563 because we just want to give you an overview of this. 4743 03:17:55,563 --> 03:17:57,900 Let me go to my downloads section. 4744 03:17:57,900 --> 03:18:01,400 There is a project called as Earth to so this is 4745 03:18:01,400 --> 03:18:03,400 your project initially 4746 03:18:03,500 --> 03:18:06,400 what all things you will be having you 4747 03:18:06,400 --> 03:18:08,839 will not be having all the things initial part. 4748 03:18:08,839 --> 03:18:09,900 So what will happen. 4749 03:18:09,900 --> 03:18:12,990 So let's say if I go to my downloads from here, 4750 03:18:12,990 --> 03:18:14,200 I have worked too. 4751 03:18:14,200 --> 03:18:16,800 project Okay. 4752 03:18:16,800 --> 03:18:19,000 Now initially I will not be having 4753 03:18:19,000 --> 03:18:22,300 this target directory project directory bin directory. 4754 03:18:22,300 --> 03:18:25,400 We will be using our SBT framework. 4755 03:18:25,400 --> 03:18:28,900 If you do not know SBP this is the skill of Bill tooth 4756 03:18:28,900 --> 03:18:32,400 which takes care of all your dependencies takes care 4757 03:18:32,400 --> 03:18:36,700 of all your dependencies are not so it is very similar to Melvin 4758 03:18:36,700 --> 03:18:40,577 if you already know Megan you this is because very similar 4759 03:18:40,577 --> 03:18:42,900 but at the same time I prefer this BTW 4760 03:18:42,900 --> 03:18:46,100 because as BT is more easier to write income. 4761 03:18:46,100 --> 03:18:47,700 I've been doing yoga never 4762 03:18:47,700 --> 03:18:50,700 so you will be writing this bill taught as begins. 4763 03:18:50,700 --> 03:18:55,800 So this finally will provide you build dot SBT now in this file, 4764 03:18:55,800 --> 03:18:57,255 you will be giving the name 4765 03:18:57,255 --> 03:18:59,700 of your project your what's a version of is 4766 03:18:59,700 --> 03:19:02,800 because using version of scale of what you are using. 4767 03:19:02,800 --> 03:19:05,385 What are the dependencies you have with 4768 03:19:05,385 --> 03:19:09,400 what versions dependencies you have like 4 stock 4 and using 4769 03:19:09,400 --> 03:19:11,194 1.5.2 version of stock. 4770 03:19:11,200 --> 03:19:15,100 So you are telling that whatever in my program, 4771 03:19:15,150 --> 03:19:16,150 I am writing. 4772 03:19:16,200 --> 03:19:22,100 So if I require anything related to spawn quote go and get it 4773 03:19:22,100 --> 03:19:27,400 from this website of dot Apache dot box download it install it. 4774 03:19:27,800 --> 03:19:29,900 If I require any dependency 4775 03:19:29,900 --> 03:19:34,700 for spark streaming program for this particular version 1.5.2. 4776 03:19:35,000 --> 03:19:37,700 Go to this website or this link 4777 03:19:37,700 --> 03:19:41,200 and executed similar theme for Amanda password. 4778 03:19:41,200 --> 03:19:43,353 So you just telling them now 4779 03:19:43,400 --> 03:19:47,200 once you have done this you will be creating a Folder structure. 4780 03:19:47,200 --> 03:19:49,200 Your folder structure would be you need 4781 03:19:49,200 --> 03:19:50,722 to create a sassy folder. 4782 03:19:50,722 --> 03:19:51,393 After that. 4783 03:19:51,393 --> 03:19:54,612 You will be creating a main folder from Main folder. 4784 03:19:54,612 --> 03:19:57,200 You will be creating again a folder called 4785 03:19:57,200 --> 03:19:58,800 as Kayla now inside 4786 03:19:58,800 --> 03:20:01,100 that you will be keeping your program. 4787 03:20:01,100 --> 03:20:03,300 So now here you will be writing a program. 4788 03:20:03,300 --> 03:20:04,500 So you are writing you. 4789 03:20:04,500 --> 03:20:07,499 Can you see this screaming to a scalar Network on scale 4790 03:20:07,499 --> 03:20:08,500 of our DOT Stella. 4791 03:20:08,500 --> 03:20:10,623 So let's keep it as a black box for them. 4792 03:20:10,623 --> 03:20:12,730 So you will be writing the code to achieve 4793 03:20:12,730 --> 03:20:14,083 this problem statement. 4794 03:20:14,083 --> 03:20:15,500 Now what we are going to do 4795 03:20:15,500 --> 03:20:20,200 that come out of this What do you mean project folder 4796 03:20:20,400 --> 03:20:21,500 and from here? 4797 03:20:21,700 --> 03:20:24,400 We will be writing SBT packaged. 4798 03:20:24,500 --> 03:20:26,400 It will start downloading 4799 03:20:26,400 --> 03:20:29,700 with respect to your is beating it will check your program. 4800 03:20:29,700 --> 03:20:31,900 Whatever dependency you require 4801 03:20:31,900 --> 03:20:35,750 for stock course starts screaming stuck in the lift. 4802 03:20:35,750 --> 03:20:36,895 It will download 4803 03:20:36,895 --> 03:20:39,400 and install it it will just download 4804 03:20:39,400 --> 03:20:42,200 and install it so we are not going to execute it 4805 03:20:42,200 --> 03:20:43,900 because I've already done it before 4806 03:20:43,900 --> 03:20:45,300 and it also takes some time. 4807 03:20:45,300 --> 03:20:48,453 So that's the reason I'm not doing it now. 4808 03:20:48,500 --> 03:20:50,689 You have been this packet, 4809 03:20:50,700 --> 03:20:53,788 you will find all this directly Target directly 4810 03:20:53,788 --> 03:20:55,400 toward project directed. 4811 03:20:55,400 --> 03:20:58,100 These got created later on the now 4812 03:20:58,100 --> 03:20:59,600 what is going to happen. 4813 03:20:59,600 --> 03:21:03,400 Once you have created this you will go to your Eclipse. 4814 03:21:03,400 --> 03:21:04,900 So you are a pure c will open. 4815 03:21:04,900 --> 03:21:06,600 So let me open my Eclipse. 4816 03:21:06,900 --> 03:21:08,995 So this is how you're equipped to protect. 4817 03:21:08,995 --> 03:21:09,200 Now. 4818 03:21:09,200 --> 03:21:11,300 I already have this program in front of me, 4819 03:21:11,300 --> 03:21:14,900 but let me tell you how you will be bringing this program. 4820 03:21:14,900 --> 03:21:17,800 You will be going to your import option 4821 03:21:17,800 --> 03:21:18,934 with We import you 4822 03:21:18,934 --> 03:21:22,400 will be selecting your existing projects into workspace. 4823 03:21:22,400 --> 03:21:23,700 Next once you do 4824 03:21:23,700 --> 03:21:26,400 that you need to select your main project. 4825 03:21:26,400 --> 03:21:29,000 For example, you need to select this Earth to project 4826 03:21:29,000 --> 03:21:31,900 what you have created and click on OK 4827 03:21:31,900 --> 03:21:32,709 once you do 4828 03:21:32,709 --> 03:21:35,872 that they will be a project directory coming 4829 03:21:35,872 --> 03:21:38,300 from this Earth to will come here. 4830 03:21:38,300 --> 03:21:41,700 Now what we need to do go to your s RC / Main 4831 03:21:41,700 --> 03:21:43,628 and not ignore all this program. 4832 03:21:43,628 --> 03:21:46,400 I require only just are jocular because this is 4833 03:21:46,400 --> 03:21:48,500 where I've written my main function. 4834 03:21:48,500 --> 03:21:50,260 Important now after that 4835 03:21:50,260 --> 03:21:52,900 once you reach to this you need to go 4836 03:21:52,900 --> 03:21:55,900 to your run as Kayla application 4837 03:21:56,100 --> 03:21:59,600 and your spot code will start to execute now, 4838 03:21:59,600 --> 03:22:01,800 this will return me a row 0. 4839 03:22:02,000 --> 03:22:02,314 Okay. 4840 03:22:02,314 --> 03:22:03,700 Let's see this output. 4841 03:22:06,600 --> 03:22:08,200 Now if I see this, 4842 03:22:08,200 --> 03:22:11,800 this will show me once it's finished executing. 4843 03:22:22,900 --> 03:22:26,300 See this our area under carosi is this 4844 03:22:26,300 --> 03:22:29,107 so this is all computed with the elbows path program. 4845 03:22:29,107 --> 03:22:29,695 Similarly. 4846 03:22:29,695 --> 03:22:32,100 There are other programs also met will help you 4847 03:22:32,100 --> 03:22:33,400 to spin the data or not. 4848 03:22:33,509 --> 03:22:35,010 I'm not walking over all that. 4849 03:22:35,160 --> 03:22:39,000 Now, let's come back to my wedding and see 4850 03:22:39,000 --> 03:22:40,900 that what is the next step 4851 03:22:40,900 --> 03:22:44,500 what we will be doing so you can see this way will be next. 4852 03:22:44,500 --> 03:22:48,200 Is she getting created now, I'm keeping my Roc here. 4853 03:22:48,200 --> 03:22:53,100 Now after you have created your RZ you will be Our graph 4854 03:22:53,100 --> 03:22:56,200 now in Japan there is one important thing. 4855 03:22:56,200 --> 03:22:59,771 Japan is already of affected area of your organs. 4856 03:22:59,771 --> 03:23:01,714 And now the trouble here is 4857 03:23:01,714 --> 03:23:05,600 that whatever it's not the even for a minor earthquake. 4858 03:23:05,600 --> 03:23:07,852 I should start sending the alert right? 4859 03:23:07,852 --> 03:23:11,300 I don't want to do all that for the minor minor affection. 4860 03:23:11,300 --> 03:23:14,100 In fact, the buildings and the infrastructure. 4861 03:23:14,100 --> 03:23:17,300 What is created is the point is in such a way 4862 03:23:17,300 --> 03:23:18,600 if any odd quack 4863 03:23:18,600 --> 03:23:21,700 below six magnitude comes there there. 4864 03:23:22,000 --> 03:23:25,713 The phones are designed in a way that they will be no damage. 4865 03:23:25,713 --> 03:23:27,400 They will be no damage them. 4866 03:23:27,400 --> 03:23:29,400 So this is the major thing 4867 03:23:29,400 --> 03:23:33,300 when you work with your Japan free book now in Japan, 4868 03:23:33,300 --> 03:23:36,000 so that means with six they are not even worried 4869 03:23:36,000 --> 03:23:37,300 but about six they 4870 03:23:37,300 --> 03:23:40,668 are worried now for that day will be a graph simulation 4871 03:23:40,668 --> 03:23:43,600 what you can do you can do it with Park as well. 4872 03:23:43,600 --> 03:23:47,800 Once you generate this graph you will be seeing that anything 4873 03:23:47,800 --> 03:23:49,449 which is going above 6 4874 03:23:49,449 --> 03:23:52,000 if anything which is going above 6, 4875 03:23:52,000 --> 03:23:55,400 Should immediately start the vendor now ignore all 4876 03:23:55,400 --> 03:23:56,700 this programming side 4877 03:23:56,700 --> 03:23:59,800 because that is what we have just created and showing 4878 03:23:59,800 --> 03:24:01,411 you this execution fact now 4879 03:24:01,411 --> 03:24:03,800 if you have to visualize the same result, 4880 03:24:03,800 --> 03:24:05,200 this is what is happening. 4881 03:24:05,200 --> 03:24:07,300 This is showing my Roc but 4882 03:24:07,300 --> 03:24:11,800 if my artwork is going to be greater than 6 then only 4883 03:24:11,800 --> 03:24:16,415 weighs those alert then only send the alert to all the paper. 4884 03:24:16,415 --> 03:24:18,400 Otherwise take come 4885 03:24:18,600 --> 03:24:22,000 that is what the project what we generally show. 4886 03:24:22,000 --> 03:24:25,563 Oh in our space program sent now it is not the only project 4887 03:24:25,563 --> 03:24:28,900 we also kind of create multiple other products as well. 4888 03:24:28,900 --> 03:24:31,600 For example, I kind of create a model just 4889 03:24:31,600 --> 03:24:33,204 like how Walmart to it 4890 03:24:33,204 --> 03:24:35,100 how Walmart maybe creating 4891 03:24:35,100 --> 03:24:38,241 a whatever sales is happening with respect to that. 4892 03:24:38,241 --> 03:24:39,743 They're using Apache spark 4893 03:24:39,743 --> 03:24:43,000 and at the end they are kind of making you visualize the output 4894 03:24:43,000 --> 03:24:45,400 of doing whatever analytics they're doing. 4895 03:24:45,400 --> 03:24:46,900 So that is ordering the spark. 4896 03:24:46,900 --> 03:24:48,900 So all those things we walking through 4897 03:24:48,900 --> 03:24:52,252 when we do the per session all the things you learn quick. 4898 03:24:52,252 --> 03:24:55,100 I feel that all these projects are using right now, 4899 03:24:55,100 --> 03:24:56,700 since you do not know the topic 4900 03:24:56,700 --> 03:24:59,400 you are not able to get hundred percent of the project. 4901 03:24:59,400 --> 03:25:00,434 But at that time 4902 03:25:00,434 --> 03:25:03,366 once you know each and every topics of deadly 4903 03:25:03,366 --> 03:25:07,100 you will have a clearer picture of how spark is handling. 4904 03:25:07,100 --> 03:25:15,000 All these use cases graphs are very attractive 4905 03:25:15,000 --> 03:25:17,900 when it comes to modeling real world data 4906 03:25:17,900 --> 03:25:19,900 because they are intuitive flexible 4907 03:25:19,900 --> 03:25:23,100 and the theory supporting them has Been maturing 4908 03:25:23,100 --> 03:25:25,209 for centuries welcome everyone 4909 03:25:25,209 --> 03:25:27,600 in today's session on Spa Graphics. 4910 03:25:27,700 --> 03:25:30,700 So without any further delay, let's look at the agenda first. 4911 03:25:31,500 --> 03:25:34,561 We start by understanding the basics of craft Theory 4912 03:25:34,561 --> 03:25:36,229 and different types of craft. 4913 03:25:36,229 --> 03:25:38,806 Then we'll look at the features of Graphics 4914 03:25:38,806 --> 03:25:40,170 further will understand 4915 03:25:40,170 --> 03:25:43,820 what is property graph and look at various crafts operations. 4916 03:25:43,820 --> 03:25:44,594 Moving ahead. 4917 03:25:44,594 --> 03:25:48,258 We'll look at different graph processing algorithms at last. 4918 03:25:48,258 --> 03:25:49,500 We'll look at a demo 4919 03:25:49,500 --> 03:25:52,400 where we will try to analyze Ford's go by 4920 03:25:52,400 --> 03:25:54,700 data using pagerank algorithm. 4921 03:25:54,700 --> 03:25:56,800 Let's move to the first topic. 4922 03:25:57,200 --> 03:25:59,845 So we'll start with basics of graph. 4923 03:25:59,845 --> 03:26:03,661 So graphs are I basically made up of two sets called 4924 03:26:03,661 --> 03:26:05,089 vertices and edges. 4925 03:26:05,089 --> 03:26:08,704 The vertices are drawn from some underlying type 4926 03:26:08,704 --> 03:26:11,550 and the set can be finite or infinite. 4927 03:26:11,550 --> 03:26:12,900 Now each element 4928 03:26:12,900 --> 03:26:17,035 of the edge set is a pair consisting of two elements 4929 03:26:17,035 --> 03:26:18,728 from the vertices set. 4930 03:26:18,900 --> 03:26:21,400 So your vertex is V1. 4931 03:26:21,403 --> 03:26:23,173 Then your vertex is V3. 4932 03:26:23,173 --> 03:26:25,480 Then your vertex is V2 and V4. 4933 03:26:25,700 --> 03:26:29,300 And your edges are V 1 comma V 3 then next 4934 03:26:29,300 --> 03:26:33,500 is V 1 comma V 2 Then you have B2 comma V 3 4935 03:26:33,500 --> 03:26:34,961 and then you have V 4936 03:26:34,961 --> 03:26:38,807 2 comma V fo so basically we represent vertices set 4937 03:26:38,807 --> 03:26:43,000 as closed in curly braces all the name of vertices. 4938 03:26:43,100 --> 03:26:45,561 So we have V 1 we have V 2 4939 03:26:45,561 --> 03:26:48,176 we have V 3 and then we have before 4940 03:26:48,300 --> 03:26:53,073 and we'll close the curly braces and to represent the edge set. 4941 03:26:53,073 --> 03:26:56,600 We use curly braces again and then in curly braces, 4942 03:26:56,600 --> 03:27:00,907 we specify those two vertex which are joined by the edge. 4943 03:27:01,000 --> 03:27:02,600 So for this Edge, 4944 03:27:02,600 --> 03:27:07,700 we will use a viven comma V 3 and then for this Edge 4945 03:27:07,700 --> 03:27:12,700 will use we one comma V 2 and then for this Edge again, 4946 03:27:12,700 --> 03:27:15,000 we'll use V 2 comma V 4. 4947 03:27:16,088 --> 03:27:19,011 And then at last for this Edge will use 4948 03:27:19,300 --> 03:27:23,700 we do comma V 3 and At Last I will close the curly braces. 4949 03:27:24,100 --> 03:27:26,400 So this is your vertices set. 4950 03:27:26,500 --> 03:27:28,900 And this is your headset. 4951 03:27:29,400 --> 03:27:31,958 Now one, very important thing that is 4952 03:27:31,958 --> 03:27:35,476 if headset is containing U comma V or you can say 4953 03:27:35,476 --> 03:27:38,700 that are instead is containing V 1 comma V 3. 4954 03:27:38,700 --> 03:27:42,000 So V1 is basically a adjacent to V 3. 4955 03:27:42,200 --> 03:27:45,100 Similarly your V 1 is adjacent to V 2. 4956 03:27:45,200 --> 03:27:48,427 Then V2 is adjacent to V for and looking at this 4957 03:27:48,427 --> 03:27:50,900 as you can say V2 is adjacent to V 3. 4958 03:27:50,900 --> 03:27:53,686 Now, let's quickly move ahead and we'll look 4959 03:27:53,686 --> 03:27:55,500 at different types of craft. 4960 03:27:55,500 --> 03:27:58,300 So first we have undirected graphs. 4961 03:27:58,500 --> 03:28:00,936 So basically in an undirected graph, 4962 03:28:00,936 --> 03:28:04,000 we use straight lines to represent the edges. 4963 03:28:04,000 --> 03:28:08,350 Now the order of the vertices in the edge set does not matter 4964 03:28:08,350 --> 03:28:09,800 in undirected graph. 4965 03:28:09,800 --> 03:28:14,040 So the undirected graph usually are drawn using straight lines 4966 03:28:14,040 --> 03:28:15,500 between the vertices. 4967 03:28:15,500 --> 03:28:18,300 Now it is almost similar to the graph 4968 03:28:18,300 --> 03:28:20,763 which we have seen in the last slide. 4969 03:28:20,763 --> 03:28:21,563 Similarly. 4970 03:28:21,563 --> 03:28:25,000 We can again represent the vertices set as 5 comma 4971 03:28:25,000 --> 03:28:27,500 6 comma 7 comma 8 and the edge 4972 03:28:27,500 --> 03:28:32,000 set as 5 comma 6 then 5 comma 7 now talking 4973 03:28:32,000 --> 03:28:33,643 about directed graphs. 4974 03:28:33,643 --> 03:28:37,605 So basically in a directed graph the order of vertices 4975 03:28:37,605 --> 03:28:39,400 in the edge set matters. 4976 03:28:39,700 --> 03:28:43,100 So we use Arrow to represent the edges 4977 03:28:43,300 --> 03:28:45,014 as you can see in the image 4978 03:28:45,014 --> 03:28:48,000 as It was not the case with the undirected graph 4979 03:28:48,000 --> 03:28:49,900 where we were using the straight lines. 4980 03:28:50,000 --> 03:28:51,400 So in directed graph, 4981 03:28:51,400 --> 03:28:56,000 we use Arrow to denote the edges and the important thing is 4982 03:28:56,000 --> 03:28:58,214 The Edge set should be similar. 4983 03:28:58,214 --> 03:29:00,500 It will contain the source vertex 4984 03:29:00,500 --> 03:29:04,200 that is five in this case and the destination vertex, 4985 03:29:04,200 --> 03:29:09,400 which is 6 in this case and this is never similar to six comma 4986 03:29:09,400 --> 03:29:13,300 five you cannot represent this Edge as 6 comma 5 4987 03:29:13,400 --> 03:29:17,100 because the direction always Does indeed directed graph 4988 03:29:17,100 --> 03:29:18,500 similarly you can see 4989 03:29:18,500 --> 03:29:20,556 that 5 is adjacent to 6, 4990 03:29:20,556 --> 03:29:23,787 but you cannot say that 6 is adjacent to 5. 4991 03:29:24,200 --> 03:29:29,000 So for this graph the vertices said would be similar as 5 comma 4992 03:29:29,000 --> 03:29:32,620 6 comma 7 comma 8 which was similar 4993 03:29:32,620 --> 03:29:34,158 in undirected graph, 4994 03:29:34,200 --> 03:29:38,700 but in directed graph your Edge set should be your first opal. 4995 03:29:38,700 --> 03:29:42,835 This one will be 5 comma 6 then you second Edge, 4996 03:29:42,835 --> 03:29:46,528 which is this one would be five comma Mama seven, 4997 03:29:47,000 --> 03:29:53,300 and at last your this set would be 7 comma 8 but in case 4998 03:29:53,300 --> 03:29:56,166 of undirected graph you can write this as 4999 03:29:56,166 --> 03:29:57,600 8 comma 7 or in case 5000 03:29:57,600 --> 03:30:00,400 of undirected graph you can write this one as seven comma 5001 03:30:00,400 --> 03:30:03,369 5 but this is not the case with the directed graph. 5002 03:30:03,369 --> 03:30:05,428 You have to follow the source vertex 5003 03:30:05,428 --> 03:30:08,100 and the destination vertex to represent the edge. 5004 03:30:08,100 --> 03:30:10,642 So I hope you guys are clear with the undirected 5005 03:30:10,642 --> 03:30:11,846 and directed graph. 5006 03:30:11,846 --> 03:30:12,100 Now. 5007 03:30:12,100 --> 03:30:15,200 Let's talk about vertex label graph now. 5008 03:30:15,200 --> 03:30:18,840 A Vertex liberal graph each vertex is labeled 5009 03:30:18,840 --> 03:30:21,650 with some data in addition to the data 5010 03:30:21,650 --> 03:30:23,700 that identifies the vertex. 5011 03:30:23,700 --> 03:30:28,100 So basically we say this X or this v as the vertex ID. 5012 03:30:28,200 --> 03:30:29,500 So there will be data 5013 03:30:29,500 --> 03:30:31,800 that would be added to this vertex. 5014 03:30:32,000 --> 03:30:35,200 So let's say this vertex would be 6 comma 5015 03:30:35,200 --> 03:30:37,500 and then we are adding the color 5016 03:30:37,500 --> 03:30:39,700 so it would be purple next. 5017 03:30:39,800 --> 03:30:42,100 This vertex would be 8 comma 5018 03:30:42,100 --> 03:30:44,700 and the color would be green next. 5019 03:30:44,700 --> 03:30:50,400 We'll say See this as 7 comma read and then this one is as 5020 03:30:50,400 --> 03:30:54,400 five comma blue now the six or this five 5021 03:30:54,400 --> 03:30:55,639 or seven or eight. 5022 03:30:55,639 --> 03:30:58,800 These are vertex ID and the additional data, 5023 03:30:58,800 --> 03:31:03,500 which is attached is the color like blue purple green or red. 5024 03:31:03,900 --> 03:31:08,696 But only the identifying data is present in the pair of edges 5025 03:31:08,696 --> 03:31:12,543 or you can say only the ID of the vertex is present 5026 03:31:12,543 --> 03:31:13,773 in the edge set. 5027 03:31:14,100 --> 03:31:15,322 So here the Edsel. 5028 03:31:15,322 --> 03:31:17,700 Again similar to your directed graph 5029 03:31:17,700 --> 03:31:19,587 that is your Source ID this 5030 03:31:19,587 --> 03:31:21,992 which is 5 and then destination ID, 5031 03:31:21,992 --> 03:31:25,274 which is 6 in this case then for this case. 5032 03:31:25,274 --> 03:31:28,785 It's similar as five comma 7 then in for this case. 5033 03:31:28,785 --> 03:31:30,469 It's similar as 7 comma 8 5034 03:31:30,469 --> 03:31:33,600 so we are not specifying this additional data, 5035 03:31:33,600 --> 03:31:35,699 which is attached to the vertices. 5036 03:31:35,699 --> 03:31:36,878 That is the color. 5037 03:31:36,878 --> 03:31:40,121 If you only specify the identifiers of the vertex 5038 03:31:40,121 --> 03:31:41,300 that is the number 5039 03:31:41,300 --> 03:31:44,700 but your vertex set would be something 5040 03:31:44,700 --> 03:31:46,300 like so this vertex 5041 03:31:46,300 --> 03:31:50,100 would be 5 comma blue then your next vertex 5042 03:31:50,100 --> 03:31:52,600 will become 6 comma purple 5043 03:31:53,100 --> 03:31:56,700 then your next vertex will become 8 comma green 5044 03:31:57,000 --> 03:31:59,800 and at last your last vertex will be written 5045 03:31:59,800 --> 03:32:01,100 as 7 comma read. 5046 03:32:01,100 --> 03:32:04,808 So basically when you are specifying the vertices set 5047 03:32:04,808 --> 03:32:07,305 in the vertex label graph you attach 5048 03:32:07,305 --> 03:32:10,683 the additional information in the vertices are set 5049 03:32:10,683 --> 03:32:12,200 but while representing 5050 03:32:12,200 --> 03:32:16,183 the edge set it is represented similarly as A directed graph 5051 03:32:16,183 --> 03:32:19,900 where you have to just specify the source vertex identifier 5052 03:32:19,900 --> 03:32:20,900 and then you have 5053 03:32:20,900 --> 03:32:24,300 to specify the destination vertex identifier now. 5054 03:32:24,300 --> 03:32:27,500 I hope that you guys are clear with underrated directed 5055 03:32:27,500 --> 03:32:29,000 and vertex label graph. 5056 03:32:29,184 --> 03:32:33,615 So let's quickly move forward next we have cyclic graph. 5057 03:32:33,800 --> 03:32:36,800 So a cyclic graph is a directed graph 5058 03:32:36,900 --> 03:32:38,900 with at least one cycle 5059 03:32:39,000 --> 03:32:43,153 and the cycle is the path along with the directed edges 5060 03:32:43,153 --> 03:32:44,933 from a Vertex to itself. 5061 03:32:44,933 --> 03:32:47,000 So so once you see over here, 5062 03:32:47,000 --> 03:32:47,708 you can see 5063 03:32:47,708 --> 03:32:50,541 that from this vertex V. It's moving toward x 5064 03:32:50,541 --> 03:32:51,700 7 then it's moving 5065 03:32:51,700 --> 03:32:54,700 to vertex Aid then with arrows moving to vertex six. 5066 03:32:54,700 --> 03:32:57,539 And then again, it's moving to vertex V. 5067 03:32:57,539 --> 03:33:01,600 So there should be at least one cycle in a cyclic graph. 5068 03:33:01,600 --> 03:33:04,000 There might be a new component. 5069 03:33:04,000 --> 03:33:08,400 It's a Vertex 9 which is attached over here again, 5070 03:33:08,400 --> 03:33:10,401 so it would be a cyclic graph 5071 03:33:10,401 --> 03:33:13,300 because it has one complete cycle over here 5072 03:33:13,300 --> 03:33:15,500 and the important thing to notice is 5073 03:33:15,500 --> 03:33:20,300 That the arrow should make the cycle like from 5 to 7 5074 03:33:20,300 --> 03:33:23,300 and then from 7 to 8 and then 8 to 6 5075 03:33:23,300 --> 03:33:25,300 and 6 to 5 and let's say 5076 03:33:25,300 --> 03:33:26,831 that there is an arrow 5077 03:33:26,831 --> 03:33:30,281 from 5 to 6 and then there is an arrow from 6 to 8. 5078 03:33:30,281 --> 03:33:32,233 So we have flipped the arrows. 5079 03:33:32,233 --> 03:33:33,600 So in that situation, 5080 03:33:33,600 --> 03:33:36,372 this is not a cyclic graph because the arrows 5081 03:33:36,372 --> 03:33:38,200 are not completing the cycle. 5082 03:33:38,200 --> 03:33:41,370 So once you move from 5 to 7 and then from 7 to 8, 5083 03:33:41,370 --> 03:33:44,452 you cannot move from 8:00 to 6:00 and similarly 5084 03:33:44,452 --> 03:33:47,167 once you move from 5 to 6 and then 6 to 8. 5085 03:33:47,167 --> 03:33:49,020 You cannot move from 8 to 7. 5086 03:33:49,020 --> 03:33:52,000 So in that situation, it's not a cyclic graph. 5087 03:33:52,000 --> 03:33:54,307 So let's clear all this thing. 5088 03:33:54,307 --> 03:33:56,461 So will represent this cycle 5089 03:33:56,461 --> 03:34:00,300 as five then using double arrows will go to 7 5090 03:34:00,300 --> 03:34:05,300 and then we'll move to 8 and then we'll move to 6 5091 03:34:05,300 --> 03:34:09,774 and at last we'll come back to 5 now. 5092 03:34:09,774 --> 03:34:11,851 We have Edge liberal graph. 5093 03:34:12,000 --> 03:34:15,030 So basically as label graph is a graph. 5094 03:34:15,030 --> 03:34:17,752 The edges are associated with labels. 5095 03:34:17,752 --> 03:34:22,059 So one can basically indicate this by making the edge set 5096 03:34:22,059 --> 03:34:23,906 as be a set of triplets. 5097 03:34:23,906 --> 03:34:25,600 So for example, 5098 03:34:25,600 --> 03:34:26,900 let's say this H 5099 03:34:26,900 --> 03:34:30,875 in this Edge label graph will be denoted as the source 5100 03:34:30,875 --> 03:34:33,200 which is 6 then the destination 5101 03:34:33,200 --> 03:34:38,000 which is 7 and then the label of the edge which is blue. 5102 03:34:38,000 --> 03:34:41,400 So this Edge would be defined something 5103 03:34:41,400 --> 03:34:44,700 like 6 comma 7 comma blue and then for this 5104 03:34:44,700 --> 03:34:47,100 and Hurley The Source vertex 5105 03:34:47,100 --> 03:34:49,414 that is 7 the destination vertex, 5106 03:34:49,414 --> 03:34:52,100 which is 8 then the label of the edge, 5107 03:34:52,100 --> 03:34:55,400 which is white like similarly for this Edge. 5108 03:34:55,400 --> 03:35:00,200 It's five comma 7 and then blue comma red. 5109 03:35:01,000 --> 03:35:03,076 And it lasts for this Edge. 5110 03:35:03,076 --> 03:35:09,200 It's five comma six and then it would be yellow common green, 5111 03:35:09,200 --> 03:35:11,362 which is the label of the edge. 5112 03:35:11,362 --> 03:35:14,665 So all these four edges will become the headset 5113 03:35:14,665 --> 03:35:18,400 for this graph and the vertices set is almost similar 5114 03:35:18,400 --> 03:35:21,200 that is 5 comma 6 comma 7 comma 8 now 5115 03:35:21,200 --> 03:35:24,200 to generalize this I would say x 5116 03:35:24,200 --> 03:35:26,400 comma y so X here is 5117 03:35:26,400 --> 03:35:30,700 the source vertex then why here is the destination vertex? 5118 03:35:30,700 --> 03:35:33,914 X and then a here is the label of the edge 5119 03:35:33,914 --> 03:35:36,900 then Edge label graph are usually drawn 5120 03:35:36,900 --> 03:35:39,573 with the labels written adjacent to the Earth 5121 03:35:39,573 --> 03:35:40,902 specifying the edges 5122 03:35:40,902 --> 03:35:41,900 as you can see. 5123 03:35:41,900 --> 03:35:43,900 We have mentioned blue white 5124 03:35:43,900 --> 03:35:46,695 and all those label addition to the edges. 5125 03:35:46,695 --> 03:35:50,400 So I hope you guys a player with the edge label graph, 5126 03:35:50,400 --> 03:35:51,561 which is nothing 5127 03:35:51,561 --> 03:35:54,900 but labels attached to each and every Edge now, 5128 03:35:54,900 --> 03:35:57,200 let's talk about weighted graph. 5129 03:35:57,200 --> 03:36:00,310 So we did graph is an edge label draft. 5130 03:36:00,700 --> 03:36:03,700 Where the labels can be operated on by 5131 03:36:03,700 --> 03:36:06,921 usually automatic operators or comparison operators, 5132 03:36:06,921 --> 03:36:09,700 like less than or greater than symbol usually 5133 03:36:09,700 --> 03:36:12,900 these are integers or floats and the idea is 5134 03:36:12,900 --> 03:36:15,534 that some edges may be more expensive 5135 03:36:15,534 --> 03:36:18,900 and this cost is represented by the edge labels 5136 03:36:18,900 --> 03:36:22,992 or weights now in short weighted graphs are a special kind 5137 03:36:22,992 --> 03:36:24,500 of Edgley build rafts 5138 03:36:24,500 --> 03:36:27,200 where your Edge is attached to a weight. 5139 03:36:27,200 --> 03:36:29,800 Generally, which is a integer or a float 5140 03:36:29,800 --> 03:36:33,100 so that we can perform some addition or subtraction 5141 03:36:33,100 --> 03:36:35,452 or different kind of automatic operations 5142 03:36:35,452 --> 03:36:36,689 or it can be some kind 5143 03:36:36,689 --> 03:36:39,500 of conditional operations like less than or greater 5144 03:36:39,500 --> 03:36:40,800 than so we'll again 5145 03:36:40,800 --> 03:36:45,700 represent this Edge as 5 comma 6 and then the weight as 3 5146 03:36:46,100 --> 03:36:49,900 and similarly will represent this Edge as 6 comma 5147 03:36:49,900 --> 03:36:55,351 7 and the weight is again 6 so similarly we represent 5148 03:36:55,351 --> 03:36:57,197 these two edges as well. 5149 03:36:57,300 --> 03:36:57,900 So I hope 5150 03:36:57,900 --> 03:37:00,500 that you guys are clear with the weighted graphs. 5151 03:37:00,500 --> 03:37:02,300 Now let's quickly move ahead and look 5152 03:37:02,300 --> 03:37:04,200 at this directed acyclic graph. 5153 03:37:04,200 --> 03:37:06,900 So this is a directed acyclic graph, 5154 03:37:07,100 --> 03:37:09,500 which is basically without Cycles. 5155 03:37:09,500 --> 03:37:12,445 So as we just discussed in cyclic graphs here, 5156 03:37:12,445 --> 03:37:13,151 you can see 5157 03:37:13,151 --> 03:37:16,601 that it is not completing the graph from the directions 5158 03:37:16,601 --> 03:37:19,607 or you can say the direction of the edges, right? 5159 03:37:19,607 --> 03:37:21,011 We can move from 5 to 7, 5160 03:37:21,011 --> 03:37:22,164 then seven to eight 5161 03:37:22,164 --> 03:37:25,500 but we cannot move from 8 to 6 and similarly we can move 5162 03:37:25,500 --> 03:37:27,600 from 5:00 to 6:00 then 6:00 to 8:00, 5163 03:37:27,600 --> 03:37:29,700 but we cannot move from 8 to 7. 5164 03:37:29,700 --> 03:37:32,962 So this is Not forming a cycle and these kind 5165 03:37:32,962 --> 03:37:36,300 of crafts are known as directed acyclic graph. 5166 03:37:36,300 --> 03:37:39,914 Now, they appear as special cases in CS application all 5167 03:37:39,914 --> 03:37:41,855 the time and the vertices set 5168 03:37:41,855 --> 03:37:44,600 and the edge set are represented similarly 5169 03:37:44,700 --> 03:37:46,700 as we have seen earlier not talking 5170 03:37:46,700 --> 03:37:48,670 about the disconnected graph. 5171 03:37:48,670 --> 03:37:51,972 So vertices in a graph do not need to be connected 5172 03:37:51,972 --> 03:37:53,100 to other vertices. 5173 03:37:53,100 --> 03:37:54,466 It is basically legal 5174 03:37:54,466 --> 03:37:57,200 for a graph to have disconnected components 5175 03:37:57,200 --> 03:38:00,466 or even loan vertices without a single connection. 5176 03:38:00,466 --> 03:38:04,400 So basically this disconnected graph which has four vertices 5177 03:38:04,400 --> 03:38:05,300 but no edges. 5178 03:38:05,300 --> 03:38:05,543 Now. 5179 03:38:05,543 --> 03:38:08,100 Let me tell you something important that is 5180 03:38:08,100 --> 03:38:10,176 what our sources and sinks. 5181 03:38:10,200 --> 03:38:13,738 So let's say we have one Arrow from five to six 5182 03:38:13,738 --> 03:38:18,233 and one Arrow from 5 to 7 now word is with only 5183 03:38:18,233 --> 03:38:20,233 in arrows are called sink. 5184 03:38:20,600 --> 03:38:25,200 So the 7 and 6 are known as sinks and the vertices 5185 03:38:25,307 --> 03:38:28,400 with only out arrows are called sources. 5186 03:38:28,400 --> 03:38:32,500 So as you can see in the image this Five only have out arrows 5187 03:38:32,500 --> 03:38:33,800 to six and seven. 5188 03:38:33,800 --> 03:38:36,200 So these are called sources now. 5189 03:38:36,200 --> 03:38:38,506 We'll talk about this in a while guys. 5190 03:38:38,506 --> 03:38:41,500 Once we are going through the pagerank algorithm. 5191 03:38:41,500 --> 03:38:45,228 So I hope that you guys know what our vertices what our edges 5192 03:38:45,228 --> 03:38:48,149 how vertices and edges represents the graph then 5193 03:38:48,149 --> 03:38:50,200 what are different kinds of graph? 5194 03:38:50,384 --> 03:38:52,615 Let's move to the next topic. 5195 03:38:52,800 --> 03:38:54,236 So next let's know. 5196 03:38:54,236 --> 03:38:55,900 What is Park Graphics. 5197 03:38:55,900 --> 03:38:58,616 So talking about Graphics Graphics is 5198 03:38:58,616 --> 03:39:00,519 a new component in spark. 5199 03:39:00,519 --> 03:39:03,843 For graphs and crafts parallel computation now 5200 03:39:03,843 --> 03:39:06,170 at a high level graphic extends 5201 03:39:06,170 --> 03:39:09,954 The Spark rdd by introducing a new graph abstraction 5202 03:39:09,954 --> 03:39:12,046 that is directed multigraph 5203 03:39:12,046 --> 03:39:15,122 that is properties attached to each vertex 5204 03:39:15,122 --> 03:39:18,800 and Edge now to support craft computation Graphics 5205 03:39:18,800 --> 03:39:22,320 basically exposes a set of fundamental operators, 5206 03:39:22,320 --> 03:39:25,400 like finding sub graph for joining vertices 5207 03:39:25,400 --> 03:39:30,253 or aggregating messages as well as it also exposes and optimize. 5208 03:39:30,253 --> 03:39:34,713 This variant of the pregnant a pi in addition Graphics also 5209 03:39:34,713 --> 03:39:37,987 provides you a collection of graph algorithms 5210 03:39:37,987 --> 03:39:41,700 and Builders to simplify your spark analytics tasks. 5211 03:39:41,700 --> 03:39:45,600 So basically your graphics is extending your spark rdd. 5212 03:39:45,600 --> 03:39:48,800 Then you have Graphics is providing an abstraction 5213 03:39:48,800 --> 03:39:50,614 that is directed multigraph 5214 03:39:50,614 --> 03:39:53,800 with properties attached to each vertex and Edge. 5215 03:39:53,800 --> 03:39:56,800 So we'll look at this property graph in a while. 5216 03:39:56,800 --> 03:40:00,200 Then again Graphics gives you some fundamental operators 5217 03:40:00,200 --> 03:40:01,000 and Then it also 5218 03:40:01,000 --> 03:40:03,800 provides you some graph algorithms and Builders 5219 03:40:03,800 --> 03:40:07,260 which makes your analytics easier now to get started 5220 03:40:07,260 --> 03:40:11,400 you first need to import spark and Graphics into your project. 5221 03:40:11,400 --> 03:40:12,550 So as you can see, 5222 03:40:12,550 --> 03:40:15,875 we are importing first Park and then we are importing 5223 03:40:15,875 --> 03:40:19,200 spark Graphics to get those graphics functionalities. 5224 03:40:19,200 --> 03:40:21,150 And at last we are importing 5225 03:40:21,150 --> 03:40:25,400 spark rdd to use those already functionalities in our program. 5226 03:40:25,400 --> 03:40:28,098 But let me tell you that if you are not using 5227 03:40:28,098 --> 03:40:30,400 spark shell then you will need a spark. 5228 03:40:30,400 --> 03:40:31,807 Context in your program. 5229 03:40:31,807 --> 03:40:32,341 So I hope 5230 03:40:32,341 --> 03:40:35,400 that you guys are clear with the features of graphics 5231 03:40:35,400 --> 03:40:36,400 and the libraries 5232 03:40:36,400 --> 03:40:39,200 which you need to import in order to use Graphics. 5233 03:40:39,300 --> 03:40:43,500 So let us quickly move ahead and look at the property graph. 5234 03:40:43,500 --> 03:40:45,800 Now property graph is something 5235 03:40:45,800 --> 03:40:50,300 as the name suggests property graph have properties attached 5236 03:40:50,300 --> 03:40:52,400 to each vertex and Edge. 5237 03:40:52,500 --> 03:40:54,115 So the property graph 5238 03:40:54,115 --> 03:40:58,653 is a directed multigraph with user-defined objects attached 5239 03:40:58,653 --> 03:41:00,500 to each vertex and Edge. 5240 03:41:00,500 --> 03:41:03,700 Now you might be wondering what is undirected multigraph. 5241 03:41:03,700 --> 03:41:08,123 So a directed multi graph is a directed graph with potentially 5242 03:41:08,123 --> 03:41:11,137 multiple parallel edges sharing same source 5243 03:41:11,137 --> 03:41:13,050 and same destination vertex. 5244 03:41:13,050 --> 03:41:15,102 So as you can see in the image 5245 03:41:15,102 --> 03:41:17,700 that from San Francisco to Los Angeles, 5246 03:41:17,700 --> 03:41:22,106 we have two edges and similarly from Los Angeles to Chicago. 5247 03:41:22,106 --> 03:41:23,600 There are two edges. 5248 03:41:23,600 --> 03:41:26,019 So basically in a directed multigraph, 5249 03:41:26,019 --> 03:41:28,400 the first thing is the directed graph, 5250 03:41:28,400 --> 03:41:30,386 so it should have a Direction. 5251 03:41:30,386 --> 03:41:33,300 Ian attached to the edges and then talking 5252 03:41:33,300 --> 03:41:36,100 about multigraph so between Source vertex 5253 03:41:36,100 --> 03:41:37,850 and a destination vertex, 5254 03:41:37,850 --> 03:41:39,600 there could be two edges. 5255 03:41:39,800 --> 03:41:42,886 So the ability to support parallel edges 5256 03:41:42,886 --> 03:41:46,100 basically simplifies the modeling scenarios 5257 03:41:46,100 --> 03:41:49,054 where there can be multiple relationships 5258 03:41:49,054 --> 03:41:51,997 between the same vertices for an example. 5259 03:41:51,997 --> 03:41:54,200 Let's say these are two persons 5260 03:41:54,200 --> 03:41:56,644 so they can be friends as well as they 5261 03:41:56,644 --> 03:41:58,361 can be co-workers, right? 5262 03:41:58,361 --> 03:42:02,000 So these kind of scenarios can be Easily modeled using 5263 03:42:02,000 --> 03:42:03,900 directed multigraph now. 5264 03:42:03,900 --> 03:42:08,700 Each vertex is keyed by a unique 64-bit long identifier, 5265 03:42:08,800 --> 03:42:12,700 which is basically the vertex ID and it helps an indexing. 5266 03:42:12,700 --> 03:42:16,500 So each of your vertex contains a Vertex ID, 5267 03:42:16,600 --> 03:42:20,000 which is a unique 64-bit long identifier 5268 03:42:20,200 --> 03:42:21,900 and similarly edges 5269 03:42:21,900 --> 03:42:26,600 have corresponding source and destination vertex identifiers. 5270 03:42:26,700 --> 03:42:28,174 So this Edge would have 5271 03:42:28,174 --> 03:42:31,647 this vertex identifier as well as This vertex identifier 5272 03:42:31,647 --> 03:42:35,620 or you can say Source vertex ID and the destination vertex ID. 5273 03:42:35,620 --> 03:42:37,900 So as we discuss this property graph 5274 03:42:37,900 --> 03:42:42,300 is basically parameterised over the vertex and Edge types, 5275 03:42:42,300 --> 03:42:45,684 and these are the types of objects associated 5276 03:42:45,684 --> 03:42:47,700 with each vertex and Edge. 5277 03:42:48,400 --> 03:42:51,792 So your graphics basically optimizes the representation 5278 03:42:51,792 --> 03:42:53,300 of vertex and Edge types 5279 03:42:53,300 --> 03:42:56,900 and it reduces the in memory footprint by storing 5280 03:42:56,900 --> 03:43:00,400 the primitive data types in a specialized array. 5281 03:43:00,400 --> 03:43:04,400 In some cases it might be desirable to have vertices 5282 03:43:04,400 --> 03:43:07,200 with different property types in the same graph. 5283 03:43:07,200 --> 03:43:10,400 Now this can be accomplished through inheritance. 5284 03:43:10,400 --> 03:43:14,000 So for an example to model a user and product 5285 03:43:14,000 --> 03:43:15,300 in a bipartite graph, 5286 03:43:15,300 --> 03:43:17,676 or you can see that we have user property 5287 03:43:17,676 --> 03:43:19,400 and we have product property. 5288 03:43:19,400 --> 03:43:19,762 Okay. 5289 03:43:19,762 --> 03:43:23,400 So let me first tell you what is a bipartite graph. 5290 03:43:23,400 --> 03:43:26,861 So a bipartite graph is also called a by graph 5291 03:43:27,000 --> 03:43:29,500 which is a set of graph vertices. 5292 03:43:30,300 --> 03:43:35,400 Opposed into two disjoint sets such that no two graph vertices 5293 03:43:35,469 --> 03:43:37,930 within the same set are adjacent. 5294 03:43:38,100 --> 03:43:39,700 So as you can see over here, 5295 03:43:39,700 --> 03:43:43,000 we have user property and then we have product property 5296 03:43:43,000 --> 03:43:46,282 but no to user property can be adjacent or you 5297 03:43:46,282 --> 03:43:48,592 can say there should be no edges 5298 03:43:48,592 --> 03:43:51,707 that is joining any of the to user property or 5299 03:43:51,707 --> 03:43:53,300 there should be no Edge 5300 03:43:53,300 --> 03:43:56,000 that should be joining product property. 5301 03:43:56,400 --> 03:44:00,000 So in this scenario we use inheritance. 5302 03:44:00,200 --> 03:44:01,757 So as you can see here, 5303 03:44:01,757 --> 03:44:04,600 we have class vertex property now basically 5304 03:44:04,600 --> 03:44:07,400 what we are doing we are creating another class 5305 03:44:07,400 --> 03:44:08,900 with user property. 5306 03:44:08,900 --> 03:44:10,700 And here we have name, 5307 03:44:10,700 --> 03:44:13,500 which is again a string and we are extending 5308 03:44:13,500 --> 03:44:17,038 or you can say we are inheriting the vertex property class. 5309 03:44:17,038 --> 03:44:19,600 Now again, in the case of product property. 5310 03:44:19,600 --> 03:44:22,100 We have name that is name of the product 5311 03:44:22,100 --> 03:44:25,000 which is again string and then we have price of the product 5312 03:44:25,000 --> 03:44:25,985 which is double 5313 03:44:25,985 --> 03:44:29,400 and we are again extending this vertex property graph 5314 03:44:29,400 --> 03:44:32,900 and at last You're grading a graph with this vertex property 5315 03:44:32,900 --> 03:44:33,900 and then string. 5316 03:44:33,900 --> 03:44:37,045 So this is how we can basically model user 5317 03:44:37,045 --> 03:44:39,500 and product as a bipartite graph. 5318 03:44:39,500 --> 03:44:41,430 So we have created user property 5319 03:44:41,430 --> 03:44:44,265 as well as we have created this product property 5320 03:44:44,265 --> 03:44:47,100 and we are extending this vertex property class. 5321 03:44:47,400 --> 03:44:50,076 No talking about this property graph. 5322 03:44:50,076 --> 03:44:51,907 It's similar to your rdd. 5323 03:44:51,907 --> 03:44:55,900 So like your rdd property graph are immutable distributed 5324 03:44:55,900 --> 03:44:57,200 and fault tolerant. 5325 03:44:57,200 --> 03:45:00,491 So changes to the values or structure of the graph. 5326 03:45:00,491 --> 03:45:01,908 Basically accomplished 5327 03:45:01,908 --> 03:45:04,900 by producing a new graph with the desired changes 5328 03:45:04,900 --> 03:45:07,700 and the substantial part of the original graph 5329 03:45:07,700 --> 03:45:09,900 which can be your structure of the graph 5330 03:45:09,900 --> 03:45:11,800 or attributes or indices. 5331 03:45:11,800 --> 03:45:15,081 These are basically reused in the new graph reducing 5332 03:45:15,081 --> 03:45:18,040 the cost of inherent functional data structure. 5333 03:45:18,040 --> 03:45:20,100 So basically your property graph 5334 03:45:20,100 --> 03:45:22,500 once you're trying to change values of structure. 5335 03:45:22,500 --> 03:45:26,024 So it creates a new graph with changed structure 5336 03:45:26,024 --> 03:45:27,300 or changed values 5337 03:45:27,300 --> 03:45:30,182 and zero substantial part of original graph. 5338 03:45:30,182 --> 03:45:33,300 Re used multiple times to improve the performance 5339 03:45:33,300 --> 03:45:35,900 and it can be your structure of the graph 5340 03:45:35,900 --> 03:45:38,600 which is getting reuse or it can be your attributes 5341 03:45:38,600 --> 03:45:41,000 or indices of the graph which is getting reused. 5342 03:45:41,000 --> 03:45:44,400 So this is how your property graph provides efficiency. 5343 03:45:44,400 --> 03:45:46,400 Now, the graph is partitioned 5344 03:45:46,400 --> 03:45:48,800 across the executors using a range 5345 03:45:48,800 --> 03:45:50,500 of vertex partitioning rules, 5346 03:45:50,500 --> 03:45:52,700 which are basically Loosely defined 5347 03:45:52,700 --> 03:45:56,514 and similar to our DD each partition of the graph 5348 03:45:56,514 --> 03:45:57,800 can be recreated 5349 03:45:57,800 --> 03:46:01,100 on different machines in the event of Failure. 5350 03:46:01,100 --> 03:46:05,000 So this is how your property graph provides fault tolerance. 5351 03:46:05,000 --> 03:46:07,643 So as we already discussed logically 5352 03:46:07,643 --> 03:46:12,174 the property graph corresponds to a pair of type collections, 5353 03:46:12,174 --> 03:46:15,800 including the properties for each vertex and Edge 5354 03:46:15,800 --> 03:46:17,338 and as a consequence 5355 03:46:17,338 --> 03:46:21,492 the graph class contains members to access the vertices 5356 03:46:21,492 --> 03:46:22,569 and the edges. 5357 03:46:22,800 --> 03:46:24,067 So as you can see we 5358 03:46:24,067 --> 03:46:27,300 have graphed class then you can see we have vertices 5359 03:46:27,307 --> 03:46:28,692 and we have edges. 5360 03:46:29,500 --> 03:46:34,400 Now this vertex Rd DVD is extending your rdd, 5361 03:46:34,600 --> 03:46:41,100 which is your body D and then your vertex ID 5362 03:46:41,500 --> 03:46:43,807 and then your vertex property. 5363 03:46:44,600 --> 03:46:45,100 Similarly. 5364 03:46:45,100 --> 03:46:47,600 Your Edge rdd is extending 5365 03:46:47,600 --> 03:46:53,500 your Oddity with your Edge property so the classes 5366 03:46:53,500 --> 03:46:54,900 that is vertex rdd 5367 03:46:54,900 --> 03:47:00,100 and HR DD extends under optimized version of your rdd, 5368 03:47:00,100 --> 03:47:03,810 which includes vertex idn vertex property and your rdd 5369 03:47:03,810 --> 03:47:06,746 which includes your Edge property and Booth 5370 03:47:06,746 --> 03:47:07,795 this vertex rdd 5371 03:47:07,795 --> 03:47:11,501 and hrd provides additional functionality build on top 5372 03:47:11,501 --> 03:47:12,876 of graph computation 5373 03:47:12,876 --> 03:47:15,900 and leverages internal optimizations as well. 5374 03:47:15,900 --> 03:47:19,159 So this is the reason we use this Vertex rdd or Edge already 5375 03:47:19,159 --> 03:47:22,500 because it already extends your already containing your word. 5376 03:47:22,500 --> 03:47:23,888 X ID and vertex property 5377 03:47:23,888 --> 03:47:26,700 or your Edge property it also provides you 5378 03:47:26,700 --> 03:47:30,100 additional functionalities built on top of craft computation. 5379 03:47:30,100 --> 03:47:33,700 And again, it gives you some internal optimizations as well. 5380 03:47:34,100 --> 03:47:37,715 Now, let me clear this and let's take an example 5381 03:47:37,715 --> 03:47:39,000 of property graph 5382 03:47:39,000 --> 03:47:40,633 where the vertex property 5383 03:47:40,633 --> 03:47:43,300 might contain the user name and occupation. 5384 03:47:43,300 --> 03:47:47,200 So as you can see in this table that we have ID of the vertex 5385 03:47:47,200 --> 03:47:50,000 and then we have property attached to each vertex. 5386 03:47:50,000 --> 03:47:52,602 That is the username as well as the Station 5387 03:47:52,602 --> 03:47:55,700 of the user or you can see the profession of the user 5388 03:47:55,700 --> 03:47:58,715 and we can annotate the edges with the string 5389 03:47:58,715 --> 03:48:01,800 describing the relationship between the users. 5390 03:48:01,800 --> 03:48:04,400 So so as you can see first is Thomas 5391 03:48:04,400 --> 03:48:06,300 who is a professor then second is Frank 5392 03:48:06,300 --> 03:48:08,000 who is also a professor then 5393 03:48:08,000 --> 03:48:09,900 as you can see third is Jenny. 5394 03:48:09,900 --> 03:48:12,241 She's a student and forth is Bob 5395 03:48:12,241 --> 03:48:15,997 who is a doctor now Thomas is a colleague of Frank. 5396 03:48:15,997 --> 03:48:17,200 Then you can see 5397 03:48:17,200 --> 03:48:21,000 that Thomas is academic advisor of Jenny again. 5398 03:48:21,000 --> 03:48:23,153 Frank is also a Make advisor 5399 03:48:23,153 --> 03:48:27,692 of Jenny and then the doctor is the health advisor of Jenny. 5400 03:48:27,700 --> 03:48:31,200 So the resulting graph would have a signature 5401 03:48:31,200 --> 03:48:32,800 of something like this. 5402 03:48:32,800 --> 03:48:34,800 So I'll explain this in a while. 5403 03:48:34,900 --> 03:48:38,300 So there are numerous ways to construct the property graph 5404 03:48:38,300 --> 03:48:39,300 from raw files 5405 03:48:39,300 --> 03:48:43,400 or RDS or even synthetic generators and we'll discuss it 5406 03:48:43,400 --> 03:48:44,766 in graph Builders, 5407 03:48:44,766 --> 03:48:46,313 but the very probable 5408 03:48:46,313 --> 03:48:49,700 and most General method is to use graph object. 5409 03:48:49,700 --> 03:48:52,129 So let's take a look at the code first. 5410 03:48:52,129 --> 03:48:53,651 And so first over here, 5411 03:48:53,651 --> 03:48:55,900 we are assuming that Parker context 5412 03:48:55,900 --> 03:48:58,100 has already been constructed. 5413 03:48:58,100 --> 03:49:01,700 Then we are giving the SES power context next. 5414 03:49:01,700 --> 03:49:04,600 We are creating an rdd for the vertices. 5415 03:49:04,600 --> 03:49:06,689 So as you can see for users, 5416 03:49:06,689 --> 03:49:09,600 we have specified idd and then vertex ID 5417 03:49:09,600 --> 03:49:11,393 and then these are two strings. 5418 03:49:11,393 --> 03:49:12,605 So first one would be 5419 03:49:12,605 --> 03:49:15,900 your username and the second one will be your profession. 5420 03:49:15,900 --> 03:49:19,612 Then we are using SC paralyzed and we are creating an array 5421 03:49:19,612 --> 03:49:22,300 where we are specifying all the vertices so 5422 03:49:22,300 --> 03:49:23,838 And that is this one 5423 03:49:23,900 --> 03:49:25,900 and you are getting the name as Thomas 5424 03:49:25,900 --> 03:49:26,800 and the profession 5425 03:49:26,800 --> 03:49:30,646 is Professor similarly for to well Frank Professor. 5426 03:49:30,646 --> 03:49:34,600 Then 3L Jenny cheese student and 4L Bob doctors. 5427 03:49:34,600 --> 03:49:37,746 So here we have created the vertex next. 5428 03:49:37,746 --> 03:49:40,207 We are creating an rdd for edges. 5429 03:49:40,500 --> 03:49:43,400 So first we are giving the values relationship. 5430 03:49:43,400 --> 03:49:46,400 Then we are creating an rdd with Edge string 5431 03:49:46,400 --> 03:49:50,000 and then we're using SC paralyzed to create the edge 5432 03:49:50,000 --> 03:49:52,948 and in the array we are specifying the A source vertex, 5433 03:49:52,948 --> 03:49:55,595 then we are specifying the destination vertex. 5434 03:49:55,595 --> 03:49:57,400 And then we are giving the relation 5435 03:49:57,400 --> 03:50:01,000 that is colleague similarly for next Edge resources 5436 03:50:01,000 --> 03:50:02,800 when this nation is one 5437 03:50:02,800 --> 03:50:06,131 and then the profession is academic advisor 5438 03:50:06,165 --> 03:50:07,934 and then it goes so on. 5439 03:50:08,242 --> 03:50:11,857 So then this line we are defining a default user 5440 03:50:12,200 --> 03:50:16,276 in case there is a relationship between missing users. 5441 03:50:16,300 --> 03:50:18,900 Now we have given the name as default user 5442 03:50:18,900 --> 03:50:20,800 and the profession is missing. 5443 03:50:21,400 --> 03:50:24,000 Nature trying to build an initial graph. 5444 03:50:24,000 --> 03:50:27,100 So for that we are using this graph object. 5445 03:50:27,100 --> 03:50:30,100 So we have specified users that is your vertices. 5446 03:50:30,100 --> 03:50:34,300 Then we are specifying the relations that is your edges. 5447 03:50:34,400 --> 03:50:36,867 And then we are giving the default user 5448 03:50:36,867 --> 03:50:39,400 which is basically for any missing user. 5449 03:50:39,400 --> 03:50:41,800 So now as you can see over here, 5450 03:50:41,800 --> 03:50:46,700 we are using Edge case class and edges have a source ID 5451 03:50:46,700 --> 03:50:48,300 and a destination ID, 5452 03:50:48,300 --> 03:50:51,300 which is basically corresponding to your source 5453 03:50:51,300 --> 03:50:52,800 and destination vertex. 5454 03:50:52,800 --> 03:50:55,100 And in addition to the Edge class. 5455 03:50:55,100 --> 03:50:56,900 We have an attribute member 5456 03:50:56,900 --> 03:51:00,600 which stores The Edge property which is the relation over here 5457 03:51:00,600 --> 03:51:01,600 that is colleague 5458 03:51:01,600 --> 03:51:06,138 or it is academic advisor or it is Health advisor and so on. 5459 03:51:06,200 --> 03:51:06,900 So, I hope 5460 03:51:06,900 --> 03:51:10,287 that you guys are clear about creating a property graph 5461 03:51:10,287 --> 03:51:13,800 how to specify the vertices how to specify edges and then 5462 03:51:13,800 --> 03:51:17,763 how to create a graph Now we can deconstruct a graph 5463 03:51:17,763 --> 03:51:19,461 into respective vertex 5464 03:51:19,461 --> 03:51:23,000 and Edge views by using a graph toward vertices 5465 03:51:23,000 --> 03:51:24,900 and graph edges members. 5466 03:51:25,000 --> 03:51:27,041 So as you can see we are using craft 5467 03:51:27,041 --> 03:51:30,100 or vertices over here and crafts dot edges over here. 5468 03:51:30,100 --> 03:51:32,100 Now what we are trying to do. 5469 03:51:32,100 --> 03:51:35,900 So first over here the graph which we have created earlier. 5470 03:51:35,900 --> 03:51:37,291 So we have graphed 5471 03:51:37,300 --> 03:51:40,700 vertices dot filter Now using this case class. 5472 03:51:40,700 --> 03:51:42,300 We have this vertex ID. 5473 03:51:42,300 --> 03:51:45,378 We have the name and then we have the position. 5474 03:51:45,378 --> 03:51:48,322 And we are specifying the position as doctor. 5475 03:51:48,322 --> 03:51:51,400 So first we are trying to filter the profession 5476 03:51:51,400 --> 03:51:53,600 of the user as doctor. 5477 03:51:53,600 --> 03:51:55,400 And then we are trying to count. 5478 03:51:55,400 --> 03:51:55,630 It. 5479 03:51:55,900 --> 03:51:56,900 Next. 5480 03:51:56,900 --> 03:51:59,700 We are specifying graph edges filter 5481 03:51:59,900 --> 03:52:03,270 and we are basically trying to filter the edges 5482 03:52:03,270 --> 03:52:07,300 where the source ID is greater than your destination ID. 5483 03:52:07,300 --> 03:52:09,800 And then we are trying to count those edges. 5484 03:52:09,800 --> 03:52:12,600 We are using a Scala case expression 5485 03:52:12,600 --> 03:52:15,400 as you can see to deconstruct the temple. 5486 03:52:15,500 --> 03:52:17,400 You can say to deconstruct 5487 03:52:17,400 --> 03:52:23,358 the result on the other hand craft edges returns a edge rdd, 5488 03:52:23,358 --> 03:52:26,282 which is containing Edge string object. 5489 03:52:26,400 --> 03:52:30,800 So we could also have used the case Class Type Constructor 5490 03:52:30,900 --> 03:52:32,200 as you can see here. 5491 03:52:32,200 --> 03:52:34,832 So again over here we are using graph dot s 5492 03:52:34,832 --> 03:52:36,400 dot filter and over here. 5493 03:52:36,400 --> 03:52:40,400 We have given case h and then we are specifying the property 5494 03:52:40,400 --> 03:52:43,900 that is Source destination and then property of the edge 5495 03:52:43,900 --> 03:52:45,000 which is attached. 5496 03:52:45,000 --> 03:52:48,800 And then we are filtering it and then we are trying to count it. 5497 03:52:48,800 --> 03:52:53,547 So this is how using Edge class either you can see with edges 5498 03:52:53,547 --> 03:52:55,603 or you can see with vertices. 5499 03:52:55,603 --> 03:52:59,191 This is how you can go ahead and deconstruct them. 5500 03:52:59,191 --> 03:53:01,900 Right because you're grounded vertices 5501 03:53:01,900 --> 03:53:06,300 or your s dot vertices returns a Vertex rdd or Edge rdd. 5502 03:53:06,400 --> 03:53:07,947 So to deconstruct them, 5503 03:53:07,947 --> 03:53:10,100 we basically use this case class. 5504 03:53:10,100 --> 03:53:11,000 So I hope you 5505 03:53:11,000 --> 03:53:13,742 guys are clear about transforming property graph. 5506 03:53:13,742 --> 03:53:15,400 And how do you use this case? 5507 03:53:15,400 --> 03:53:19,300 Us to deconstruct the protects our DD or HR DD. 5508 03:53:20,169 --> 03:53:22,630 So now let's quickly move ahead. 5509 03:53:22,700 --> 03:53:24,875 Now in addition to the vertex 5510 03:53:24,875 --> 03:53:27,406 and Edge views of the property graph 5511 03:53:27,406 --> 03:53:30,300 Graphics also exposes a triplet view now, 5512 03:53:30,300 --> 03:53:32,700 you might be wondering what is a triplet view. 5513 03:53:32,700 --> 03:53:35,977 So the triplet view logically joins the vertex 5514 03:53:35,977 --> 03:53:39,600 and Edge properties yielding an rdd edge triplet 5515 03:53:39,600 --> 03:53:42,700 with vertex property and your Edge property. 5516 03:53:42,700 --> 03:53:45,174 So as you can see it gives an rdd. 5517 03:53:45,174 --> 03:53:47,217 D with s triplet and then it 5518 03:53:47,217 --> 03:53:51,523 has vertex property as well as H property associated with it 5519 03:53:51,523 --> 03:53:55,100 and it contains an instance of each triplet class. 5520 03:53:55,200 --> 03:53:55,700 Now. 5521 03:53:55,700 --> 03:53:57,800 I am taking example of a join. 5522 03:53:57,800 --> 03:54:01,603 So in this joint we are trying to select Source ID destination 5523 03:54:01,603 --> 03:54:03,100 ID Source attribute then 5524 03:54:03,100 --> 03:54:04,635 this is your Edge attribute 5525 03:54:04,635 --> 03:54:07,400 and then at last you have destination attribute. 5526 03:54:07,400 --> 03:54:11,200 So basically your edges has Alias e then your vertices 5527 03:54:11,200 --> 03:54:12,907 has Alias as source. 5528 03:54:12,907 --> 03:54:16,516 And again your vertices has Alias as Nation so we 5529 03:54:16,516 --> 03:54:19,900 are trying to select Source ID destination ID, 5530 03:54:19,900 --> 03:54:23,155 then Source, attribute and destination attribute, 5531 03:54:23,155 --> 03:54:25,800 and we also selecting The Edge attribute 5532 03:54:25,800 --> 03:54:28,200 and we are performing left join. 5533 03:54:28,400 --> 03:54:31,900 The edge Source ID should be equal to Source ID 5534 03:54:31,900 --> 03:54:35,600 and the h destination ID should be equal to destination ID. 5535 03:54:36,400 --> 03:54:39,700 And now your Edge triplet class basically 5536 03:54:39,700 --> 03:54:43,090 extends your Edge class by adding your Source attribute 5537 03:54:43,090 --> 03:54:45,100 and destination attribute members 5538 03:54:45,100 --> 03:54:48,100 which contains the source and destination properties 5539 03:54:48,200 --> 03:54:49,155 and we can use 5540 03:54:49,155 --> 03:54:52,500 the triplet view of a graph to render a collection 5541 03:54:52,500 --> 03:54:55,804 of strings describing relationship between users. 5542 03:54:55,804 --> 03:54:59,521 This is vertex 1 which is again denoting your user one. 5543 03:54:59,521 --> 03:55:01,986 That is Thomas and who is a professor 5544 03:55:01,986 --> 03:55:03,081 and is vertex 3, 5545 03:55:03,081 --> 03:55:06,400 which is denoting you Jenny and she's a student. 5546 03:55:06,400 --> 03:55:07,994 And this is your Edge, 5547 03:55:07,994 --> 03:55:11,400 which is defining the relationship between them. 5548 03:55:11,400 --> 03:55:13,600 So this is a h triplet 5549 03:55:13,600 --> 03:55:17,300 which is denoting the both vertex as well 5550 03:55:17,300 --> 03:55:20,900 as the edge which denote the relation between them. 5551 03:55:20,900 --> 03:55:23,600 So now looking at this code first we have already 5552 03:55:23,600 --> 03:55:26,377 created the graph then we are taking this graph. 5553 03:55:26,377 --> 03:55:27,979 We are finding the triplets 5554 03:55:27,979 --> 03:55:30,194 and then we are mapping each triplet. 5555 03:55:30,194 --> 03:55:33,700 We are trying to find out the triplet dot Source attribute 5556 03:55:33,700 --> 03:55:36,155 in which we are picking up the username. 5557 03:55:36,155 --> 03:55:37,100 Then over here. 5558 03:55:37,100 --> 03:55:39,800 We are trying to pick up the triplet attribute, 5559 03:55:39,800 --> 03:55:42,400 which is nothing but the edge attribute 5560 03:55:42,400 --> 03:55:44,400 which is your academic advisor. 5561 03:55:44,400 --> 03:55:45,800 Then we are trying 5562 03:55:45,800 --> 03:55:48,800 to pick up the triplet destination attribute. 5563 03:55:48,800 --> 03:55:50,904 It will again pick up the username 5564 03:55:50,904 --> 03:55:52,500 of destination attribute, 5565 03:55:52,500 --> 03:55:54,766 which is username of this vertex 3. 5566 03:55:54,766 --> 03:55:57,100 So for an example in this situation, 5567 03:55:57,100 --> 03:56:01,000 it will print Thomas is the academic advisor of Jenny. 5568 03:56:01,000 --> 03:56:03,211 So then we are trying to take this facts. 5569 03:56:03,211 --> 03:56:04,726 We are collecting the facts 5570 03:56:04,726 --> 03:56:07,900 using this forage we have Painting each of the triplet 5571 03:56:07,900 --> 03:56:09,812 that is present in this graph. 5572 03:56:09,812 --> 03:56:10,385 So I hope 5573 03:56:10,385 --> 03:56:13,700 that you guys are clear with the concepts of triplet. 5574 03:56:14,600 --> 03:56:17,300 So now let's quickly take a look at graph Builders. 5575 03:56:17,353 --> 03:56:19,200 So as I already told you 5576 03:56:19,200 --> 03:56:22,700 that Graphics provides several ways of building a graph 5577 03:56:22,700 --> 03:56:25,551 from a collection of vertices and edges either. 5578 03:56:25,551 --> 03:56:28,900 It can be stored in our DD or it can be stored on disk. 5579 03:56:28,900 --> 03:56:32,600 So in this graph object first, we have this apply method. 5580 03:56:32,600 --> 03:56:36,300 So basically this apply method allows creating a graph 5581 03:56:36,300 --> 03:56:37,773 from rdd of vertices 5582 03:56:37,773 --> 03:56:42,000 and edges and duplicate vertices are picked up our by Tralee 5583 03:56:42,000 --> 03:56:43,139 and the vertices 5584 03:56:43,139 --> 03:56:46,700 which are found in the Edge rdd and are not present 5585 03:56:46,700 --> 03:56:50,522 in the vertices rdd are assigned a default attribute. 5586 03:56:50,522 --> 03:56:52,653 So in this apply method first, 5587 03:56:52,653 --> 03:56:55,100 we are providing the vertex rdd then 5588 03:56:55,100 --> 03:56:57,000 we are providing the edge rdd 5589 03:56:57,000 --> 03:57:00,311 and then we are providing the default vertex attribute. 5590 03:57:00,311 --> 03:57:03,613 So it will create the vertex which we have specified. 5591 03:57:03,613 --> 03:57:05,400 Then it will create the edges 5592 03:57:05,400 --> 03:57:08,700 which are specified and if there is a vertex 5593 03:57:08,700 --> 03:57:11,173 which is being referred by The Edge, 5594 03:57:11,173 --> 03:57:14,000 but it is not present in this vertex rdd. 5595 03:57:14,000 --> 03:57:16,763 So So what it does it creates that vertex 5596 03:57:16,763 --> 03:57:20,900 and assigns them the value of this default vertex attribute. 5597 03:57:20,900 --> 03:57:22,700 Next we have from edges. 5598 03:57:22,700 --> 03:57:27,000 So graph Dot from edges allows creating a graph only 5599 03:57:27,000 --> 03:57:28,900 from the rdd of edges 5600 03:57:29,000 --> 03:57:32,266 which automatically creates any vertices mentioned 5601 03:57:32,266 --> 03:57:35,400 in the edges and assigns them the default value. 5602 03:57:35,500 --> 03:57:39,000 So what happens over here you provide the edge rdd 5603 03:57:39,000 --> 03:57:40,496 and all the vertices 5604 03:57:40,496 --> 03:57:44,385 that are present in the hrd are automatically created 5605 03:57:44,385 --> 03:57:48,500 and Default value is assigned to each of those vertices. 5606 03:57:48,500 --> 03:57:49,522 So graphed out 5607 03:57:49,522 --> 03:57:53,100 from adjustables basically allows creating a graph 5608 03:57:53,100 --> 03:57:55,484 from only the rdd of vegetables 5609 03:57:55,500 --> 03:58:00,100 and it assigns the edges as value 1 and again the vertices 5610 03:58:00,100 --> 03:58:04,200 which are specified by the edges are automatically created 5611 03:58:04,200 --> 03:58:05,788 and the default value which 5612 03:58:05,788 --> 03:58:09,005 we are specifying over here will be allocated to them. 5613 03:58:09,005 --> 03:58:10,100 So basically you're 5614 03:58:10,100 --> 03:58:12,980 from has double supports deduplicating of edges, 5615 03:58:12,980 --> 03:58:15,800 which means you can remove the duplicate edges, 5616 03:58:15,800 --> 03:58:19,373 but for that you have to provide a partition strategy 5617 03:58:19,373 --> 03:58:23,953 in the unique edges parameter as it is necessary to co-locate 5618 03:58:23,953 --> 03:58:25,277 The Identical edges 5619 03:58:25,277 --> 03:58:28,900 on the same partition duplicate edges can be removed. 5620 03:58:29,100 --> 03:58:33,000 So moving ahead men of the graph Builders re partitions, 5621 03:58:33,000 --> 03:58:37,146 the graph edges by default instead edges are left 5622 03:58:37,146 --> 03:58:39,300 in their default partitions. 5623 03:58:39,300 --> 03:58:42,540 So as you can see, we have a graph loader object, 5624 03:58:42,540 --> 03:58:44,700 which is basically used to load. 5625 03:58:44,700 --> 03:58:46,776 Crafts from the file system 5626 03:58:46,900 --> 03:58:51,571 so graft or group edges requires the graph to be re-partition 5627 03:58:51,571 --> 03:58:52,956 because it assumes 5628 03:58:53,000 --> 03:58:55,900 that identical edges will be co-located 5629 03:58:55,900 --> 03:58:57,378 on the same partition. 5630 03:58:57,378 --> 03:59:00,200 And so you must call graph dot Partition by 5631 03:59:00,200 --> 03:59:02,200 before calling group edges. 5632 03:59:02,900 --> 03:59:07,500 So so now you can see the edge list file method over here 5633 03:59:07,538 --> 03:59:12,000 which provides a way to load a graph from the list of edges 5634 03:59:12,000 --> 03:59:14,577 which is present on the disk and it 5635 03:59:14,577 --> 03:59:18,900 It passes the adjacency list that is your Source vertex ID 5636 03:59:18,900 --> 03:59:22,900 and the destination vertex ID Pairs and it creates a graph. 5637 03:59:23,200 --> 03:59:24,300 So now for an example, 5638 03:59:24,300 --> 03:59:29,600 let's say we have two and one which is one Edge then you have 5639 03:59:29,600 --> 03:59:31,533 for one which is another Edge 5640 03:59:31,533 --> 03:59:34,600 and then you have 1/2 which is another Edge. 5641 03:59:34,600 --> 03:59:36,700 So it will load these edges 5642 03:59:36,900 --> 03:59:39,300 and then it will create the graph. 5643 03:59:39,300 --> 03:59:40,792 So it will create 2, 5644 03:59:40,792 --> 03:59:44,600 then it will create for and then it will create one. 5645 03:59:44,900 --> 03:59:46,100 And for to one it 5646 03:59:46,100 --> 03:59:49,757 will create the edge and then for one it will create the edge 5647 03:59:49,757 --> 03:59:52,500 and at last we create an edge for one and two. 5648 03:59:52,700 --> 03:59:55,300 So do you create a graph something like this? 5649 03:59:56,000 --> 03:59:59,100 It creates a graph from specified edges 5650 03:59:59,300 --> 04:00:01,929 where automatically vertices are created 5651 04:00:01,929 --> 04:00:05,751 which are mentioned by the edges and all the vertex 5652 04:00:05,751 --> 04:00:08,465 and Edge attribute are set by default one 5653 04:00:08,465 --> 04:00:10,907 and as well as one will be associated 5654 04:00:10,907 --> 04:00:12,400 with all the vertices. 5655 04:00:12,543 --> 04:00:15,900 So it will be 4 comma 1 then again for this. 5656 04:00:15,900 --> 04:00:19,200 It would be 1 comma 1 and similarly it would be 5657 04:00:19,200 --> 04:00:21,201 2 comma 1 for this vertex. 5658 04:00:21,800 --> 04:00:24,184 Now, let's go back to the code. 5659 04:00:24,184 --> 04:00:27,800 So then we have this canonical orientation. 5660 04:00:28,200 --> 04:00:31,655 So this argument allows reorienting edges 5661 04:00:31,655 --> 04:00:33,500 in the positive direction 5662 04:00:33,500 --> 04:00:35,100 that is from the lower Source ID 5663 04:00:35,100 --> 04:00:38,000 to the higher destination ID now, 5664 04:00:38,000 --> 04:00:40,800 which is basically required by your connected components 5665 04:00:40,800 --> 04:00:41,782 algorithm will talk 5666 04:00:41,782 --> 04:00:43,800 about this algorithm in a while you guys 5667 04:00:44,100 --> 04:00:47,069 but before this this basically helps 5668 04:00:47,069 --> 04:00:49,300 in view orienting your edges, 5669 04:00:49,300 --> 04:00:51,500 which means your Source vertex, 5670 04:00:51,500 --> 04:00:55,400 Tex should always be less than your destination vertex. 5671 04:00:55,400 --> 04:00:58,700 So in that situation it might reorient this Edge. 5672 04:00:58,700 --> 04:01:01,970 So it will reorient this Edge and basically to reverse 5673 04:01:01,970 --> 04:01:04,862 direction of the edge similarly over here. 5674 04:01:04,862 --> 04:01:06,000 So with the vertex 5675 04:01:06,000 --> 04:01:08,896 which is coming from 2 to 1 will be reoriented 5676 04:01:08,896 --> 04:01:10,700 and will be again reversed. 5677 04:01:10,700 --> 04:01:11,754 Now the talking 5678 04:01:11,754 --> 04:01:16,300 about the minimum Edge partition this minimum Edge partition 5679 04:01:16,300 --> 04:01:18,858 basically specifies the minimum number 5680 04:01:18,858 --> 04:01:21,900 of edge partitions to generate There might be 5681 04:01:21,900 --> 04:01:24,242 more Edge partitions than a specified. 5682 04:01:24,242 --> 04:01:26,900 So let's say the hdfs file has more blocks. 5683 04:01:26,900 --> 04:01:29,300 So obviously more partitions will be created 5684 04:01:29,300 --> 04:01:32,182 but this will give you the minimum Edge partitions 5685 04:01:32,182 --> 04:01:33,651 that should be created. 5686 04:01:33,651 --> 04:01:34,192 So I hope 5687 04:01:34,192 --> 04:01:36,900 that you guys are clear with this graph loader 5688 04:01:36,900 --> 04:01:38,358 how this graph loader Works 5689 04:01:38,358 --> 04:01:41,300 how you can go ahead and provide the edge list file 5690 04:01:41,300 --> 04:01:43,300 and how it will create the craft 5691 04:01:43,300 --> 04:01:47,124 from this Edge list file and then this canonical orientation 5692 04:01:47,124 --> 04:01:50,300 where we are again going and reorienting the graph 5693 04:01:50,300 --> 04:01:52,299 and then we have Minimum Edge partition 5694 04:01:52,299 --> 04:01:54,900 which is giving the minimum number of edge partitions 5695 04:01:54,900 --> 04:01:56,300 that should be created. 5696 04:01:56,300 --> 04:02:00,000 So now I guess you guys are clear with the graph Builder. 5697 04:02:00,000 --> 04:02:03,400 So how to go ahead and use this graph object 5698 04:02:03,400 --> 04:02:06,900 and how to create graph using apply from edges 5699 04:02:06,900 --> 04:02:09,200 and from vegetables method 5700 04:02:09,400 --> 04:02:11,700 and then I guess you might be clear 5701 04:02:11,700 --> 04:02:13,586 with the graph loader object 5702 04:02:13,586 --> 04:02:17,715 and where you can go ahead and create a graph from Edge list. 5703 04:02:17,715 --> 04:02:17,990 Now. 5704 04:02:17,990 --> 04:02:21,500 Let's move ahead and talk about vertex and Edge rdd. 5705 04:02:21,900 --> 04:02:23,561 So as I already told you 5706 04:02:23,561 --> 04:02:27,007 that Graphics exposes our DD views of the vertices 5707 04:02:27,007 --> 04:02:30,056 and edges stored within the graph at however, 5708 04:02:30,056 --> 04:02:33,798 because Graphics again maintains the vertices and edges 5709 04:02:33,798 --> 04:02:35,600 in optimize data structure 5710 04:02:35,600 --> 04:02:36,979 and these data structure 5711 04:02:36,979 --> 04:02:39,499 provide additional functionalities as well. 5712 04:02:39,499 --> 04:02:42,679 Now, let us see some of the additional functionalities 5713 04:02:42,679 --> 04:02:44,300 which are provided by them. 5714 04:02:44,465 --> 04:02:47,234 So let's first talk about vertex rdd. 5715 04:02:47,600 --> 04:02:51,100 So I already told you that vertex rdd. 5716 04:02:51,100 --> 04:02:54,800 He is basically extending this rdd with vertex ID 5717 04:02:54,800 --> 04:02:59,338 and the vertex property and it adds an additional constraint 5718 04:02:59,338 --> 04:03:05,600 that each vertex ID occurs only words now moreover vertex rdd 5719 04:03:05,800 --> 04:03:10,000 a represents a set of vertices each with an attribute 5720 04:03:10,000 --> 04:03:12,600 of type A now internally 5721 04:03:12,700 --> 04:03:17,600 what happens this is achieved by storing the vertex attribute 5722 04:03:17,700 --> 04:03:19,184 in an reusable, 5723 04:03:19,184 --> 04:03:21,030 hash map data structure. 5724 04:03:24,200 --> 04:03:27,700 So suppose, this is our hash map data structure. 5725 04:03:27,700 --> 04:03:30,200 So suppose if to vertex rdd 5726 04:03:30,200 --> 04:03:34,840 are derived from the same base vertex rdd suppose. 5727 04:03:35,280 --> 04:03:37,600 These are two vertex rdd 5728 04:03:37,600 --> 04:03:41,200 which are basically derived from this vertex rdd 5729 04:03:41,200 --> 04:03:44,400 so they can be joined in constant time 5730 04:03:44,400 --> 04:03:46,100 without hash evaluations. 5731 04:03:46,100 --> 04:03:49,400 So you don't have to go ahead and evaluate the properties 5732 04:03:49,400 --> 04:03:52,400 of both the vertices you can easily go ahead 5733 04:03:52,400 --> 04:03:55,398 and you can join them without the Yes, 5734 04:03:55,400 --> 04:03:58,288 and this is one of the way in which this vertex 5735 04:03:58,288 --> 04:04:00,800 already provides you the optimization now 5736 04:04:00,800 --> 04:04:03,900 to leverage this indexed data structure 5737 04:04:04,200 --> 04:04:08,700 the vertex rdd exposes multiple additional functionalities. 5738 04:04:09,000 --> 04:04:11,000 So it gives you all these functions 5739 04:04:11,000 --> 04:04:12,000 as you can see here. 5740 04:04:12,300 --> 04:04:15,300 It gives you filter map values then - 5741 04:04:15,300 --> 04:04:16,663 difference left join 5742 04:04:16,663 --> 04:04:19,800 in a joint and aggregate using index functions. 5743 04:04:19,800 --> 04:04:22,600 So let us first discuss about these functions. 5744 04:04:22,600 --> 04:04:26,800 So basically filter a function filters the vertex set 5745 04:04:26,800 --> 04:04:31,700 but preserves the internal index So based on some condition. 5746 04:04:31,700 --> 04:04:33,405 It filters the vertices 5747 04:04:33,405 --> 04:04:36,300 that are present then in map values. 5748 04:04:36,300 --> 04:04:39,200 It is basically used to transform the values 5749 04:04:39,200 --> 04:04:41,000 without changing the IDS 5750 04:04:41,000 --> 04:04:44,461 and which again preserves your internal index. 5751 04:04:44,461 --> 04:04:49,399 So it does not change the idea of the vertices and it helps 5752 04:04:49,399 --> 04:04:53,100 in transforming those values now talking about the - 5753 04:04:53,100 --> 04:04:55,900 method it shows What is unique 5754 04:04:55,900 --> 04:04:58,500 in the said based on their vertex IDs? 5755 04:04:58,500 --> 04:04:59,500 So what happens 5756 04:04:59,500 --> 04:05:03,300 if you are providing to set of vertices first contains V1 V2 5757 04:05:03,300 --> 04:05:06,100 and V3 and second one contains V3, 5758 04:05:06,200 --> 04:05:08,276 so it will return V1 and V2 5759 04:05:08,276 --> 04:05:11,366 because they are unique in both the sets 5760 04:05:11,700 --> 04:05:14,700 and it is basically done with the help of vertex ID. 5761 04:05:14,900 --> 04:05:17,053 So next we have dysfunction. 5762 04:05:17,100 --> 04:05:20,900 So it basically removes the vertices from this set 5763 04:05:20,900 --> 04:05:25,800 that appears in another set Then we have left join an inner join. 5764 04:05:25,800 --> 04:05:28,300 So join operators basically take advantage 5765 04:05:28,300 --> 04:05:30,900 of the internal indexing to accelerate join. 5766 04:05:30,900 --> 04:05:32,900 So you can go ahead and you can perform left join 5767 04:05:32,900 --> 04:05:34,400 or you can perform inner join. 5768 04:05:34,453 --> 04:05:37,246 Next you have aggregate using index. 5769 04:05:37,700 --> 04:05:40,800 So basically is aggregate using index is nothing 5770 04:05:40,800 --> 04:05:42,400 by reduced by key, 5771 04:05:42,500 --> 04:05:44,200 but it uses index 5772 04:05:44,300 --> 04:05:48,000 on this rdd to accelerate the Reduce by key function 5773 04:05:48,000 --> 04:05:50,500 or you can say reduced by key operation. 5774 04:05:50,700 --> 04:05:54,900 So again filter is actually Using bit set and there 5775 04:05:54,900 --> 04:05:56,500 by reusing the index 5776 04:05:56,500 --> 04:05:58,800 and preserving the ability to do 5777 04:05:58,800 --> 04:06:02,220 fast joints with other vertex rdd now similarly 5778 04:06:02,220 --> 04:06:04,600 the map values operator as well. 5779 04:06:04,600 --> 04:06:08,200 Do not allow the map function to change the vertex ID 5780 04:06:08,200 --> 04:06:09,600 and this again helps 5781 04:06:09,600 --> 04:06:13,120 in reusing the same hash map data structure now both 5782 04:06:13,120 --> 04:06:14,533 of your left join as 5783 04:06:14,533 --> 04:06:17,900 well as your inner join is able to identify 5784 04:06:17,900 --> 04:06:20,400 that whether the two vertex rdd 5785 04:06:20,400 --> 04:06:23,169 which are joining are derived from the same. 5786 04:06:23,169 --> 04:06:24,208 Hash map or not. 5787 04:06:24,208 --> 04:06:28,300 And for this they basically use linear scan did again don't have 5788 04:06:28,300 --> 04:06:31,900 to go ahead and search for costly Point lookups. 5789 04:06:31,900 --> 04:06:35,300 So this is the benefit of using vertex rdd. 5790 04:06:35,500 --> 04:06:36,571 So to summarize 5791 04:06:36,571 --> 04:06:40,300 your vertex audit abuses hash map data structure, 5792 04:06:40,426 --> 04:06:42,273 which is again reusable. 5793 04:06:42,300 --> 04:06:44,700 They try to preserve your indexes 5794 04:06:44,700 --> 04:06:48,500 so that it would be easier to create a new vertex already 5795 04:06:48,500 --> 04:06:51,404 derive a new vertex already from them then again 5796 04:06:51,404 --> 04:06:54,000 while performing some joining or Relations, 5797 04:06:54,000 --> 04:06:57,900 it is pretty much easy to go ahead perform a linear scan 5798 04:06:57,900 --> 04:07:01,500 and then you can go ahead and join those two vertex rdd. 5799 04:07:01,500 --> 04:07:05,423 So it actually helps in optimizing your performance. 5800 04:07:05,700 --> 04:07:06,700 Now moving ahead. 5801 04:07:06,700 --> 04:07:10,200 Let's talk about HR DD now again, 5802 04:07:10,200 --> 04:07:13,900 as you can see your Edge already is extending your rdd 5803 04:07:13,900 --> 04:07:15,400 with property Edge. 5804 04:07:15,400 --> 04:07:18,792 Now it organizes the edge in Block partition using 5805 04:07:18,792 --> 04:07:21,700 one of the various partitioning strategies, 5806 04:07:21,700 --> 04:07:25,608 which is again defined in Your partition strategies attribute 5807 04:07:25,608 --> 04:07:28,800 or you can say partition strategy parameter within 5808 04:07:28,800 --> 04:07:30,865 each partition each attribute 5809 04:07:30,865 --> 04:07:34,100 and a decency structure are stored separately 5810 04:07:34,100 --> 04:07:36,200 which enables the maximum reuse 5811 04:07:36,200 --> 04:07:38,200 when changing the attribute values. 5812 04:07:38,600 --> 04:07:42,900 So basically what it does while storing your Edge attributes 5813 04:07:42,900 --> 04:07:46,400 and your Source vertex and destination vertex, 5814 04:07:46,400 --> 04:07:48,400 they are stored separately so 5815 04:07:48,400 --> 04:07:51,200 that changing the values of the attributes 5816 04:07:51,200 --> 04:07:54,200 either of the source Vertex or Nation Vertex 5817 04:07:54,200 --> 04:07:55,500 or Edge attribute 5818 04:07:55,500 --> 04:07:58,300 so that it can be reused as many times 5819 04:07:58,300 --> 04:08:01,600 as we need by changing the attribute values itself. 5820 04:08:01,600 --> 04:08:04,713 So that once the vertex ID is changed of an edge. 5821 04:08:04,713 --> 04:08:06,400 It could be easily changed 5822 04:08:06,400 --> 04:08:09,196 and the earlier part can be reused now 5823 04:08:09,196 --> 04:08:10,314 as you can see, 5824 04:08:10,314 --> 04:08:13,518 we have three additional functions over here 5825 04:08:13,518 --> 04:08:16,500 that is map values reverse an inner join. 5826 04:08:16,700 --> 04:08:19,000 So in hrd basically map 5827 04:08:19,000 --> 04:08:21,400 values is to transform the edge attributes 5828 04:08:21,400 --> 04:08:23,200 while preserving the structure. 5829 04:08:23,200 --> 04:08:25,029 ER it is helpful in transforming 5830 04:08:25,029 --> 04:08:28,500 so you can use map values and map the values of Courage rdd. 5831 04:08:28,800 --> 04:08:31,300 Then you can go ahead and use this reverse function 5832 04:08:31,300 --> 04:08:35,400 which rivers The Edge reusing both attribute and structure. 5833 04:08:35,400 --> 04:08:37,531 So the source becomes destination. 5834 04:08:37,531 --> 04:08:40,179 The destination becomes Source not talking 5835 04:08:40,179 --> 04:08:41,600 about this inner join. 5836 04:08:41,700 --> 04:08:43,600 So it basically joins 5837 04:08:43,600 --> 04:08:48,500 to Edge rdds partitioned using same partitioning strategy. 5838 04:08:49,100 --> 04:08:52,900 Now as we already discuss that same partition strategies, 5839 04:08:52,900 --> 04:08:55,585 Tired because again to co-locate you need 5840 04:08:55,585 --> 04:08:57,600 to use same partition strategy 5841 04:08:57,600 --> 04:08:59,682 and your identical vertex should reside 5842 04:08:59,682 --> 04:09:02,800 in same partition to perform join operation over them. 5843 04:09:02,800 --> 04:09:03,092 Now. 5844 04:09:03,092 --> 04:09:07,290 Let me quickly give you an idea about optimization performed 5845 04:09:07,290 --> 04:09:08,500 in this Graphics. 5846 04:09:08,536 --> 04:09:10,151 So Graphics basically 5847 04:09:10,151 --> 04:09:14,844 adopts a Vertex cut approach to distribute graph partitioning. 5848 04:09:15,500 --> 04:09:20,700 So suppose you have five vertex and then they are connected. 5849 04:09:20,800 --> 04:09:23,100 Let's not worry about the arrows, right? 5850 04:09:23,100 --> 04:09:26,200 Now or let's not worry about Direction right now. 5851 04:09:26,200 --> 04:09:29,200 So either it can be divided from the edges, 5852 04:09:29,200 --> 04:09:32,287 which is one approach or again. 5853 04:09:32,287 --> 04:09:34,825 It can be divided from the vertex. 5854 04:09:35,300 --> 04:09:36,840 So in that situation, 5855 04:09:36,840 --> 04:09:39,700 it would be divided something like this. 5856 04:09:41,200 --> 04:09:43,500 So rather than splitting crafts 5857 04:09:43,500 --> 04:09:47,900 along edges Graphics partition is the graph along vertices, 5858 04:09:47,900 --> 04:09:50,305 which can again reduce the communication 5859 04:09:50,305 --> 04:09:51,600 and storage overhead. 5860 04:09:51,600 --> 04:09:53,523 So logically what happens 5861 04:09:53,523 --> 04:09:56,500 that your edges are assigned to machines 5862 04:09:56,500 --> 04:10:00,200 and allowing your vertices to span multiple machines. 5863 04:10:00,200 --> 04:10:03,500 So what this is is basically divided into multiple machines 5864 04:10:03,500 --> 04:10:06,900 and your edges is assigned to a single machine right 5865 04:10:06,900 --> 04:10:09,600 then the exact method of assigning edges. 5866 04:10:09,600 --> 04:10:11,800 Depends on the partition strategy. 5867 04:10:11,800 --> 04:10:15,400 So the partition strategy is the one which basically decides 5868 04:10:15,400 --> 04:10:16,800 how to assign the edges 5869 04:10:16,800 --> 04:10:20,300 to different machines or you can send different partitions. 5870 04:10:20,300 --> 04:10:21,400 So user can choose 5871 04:10:21,400 --> 04:10:24,900 between different strategies by partitioning the graph 5872 04:10:24,900 --> 04:10:28,200 with the help of this graft Partition by operator. 5873 04:10:28,200 --> 04:10:29,500 Now as we discussed 5874 04:10:29,500 --> 04:10:31,329 that this craft or Partition 5875 04:10:31,329 --> 04:10:34,400 by operator three partitions and then it divides 5876 04:10:34,400 --> 04:10:36,900 or relocates the edges 5877 04:10:37,000 --> 04:10:39,900 and basically we try to put the identical edges. 5878 04:10:39,900 --> 04:10:41,500 On a single partition 5879 04:10:41,500 --> 04:10:43,827 so that different operations like join 5880 04:10:43,827 --> 04:10:45,400 can be performed on them. 5881 04:10:45,400 --> 04:10:49,629 So once the edges have been partitioned the mean challenge 5882 04:10:49,629 --> 04:10:52,690 is efficiently joining the vertex attributes 5883 04:10:52,690 --> 04:10:54,400 with the edges right now 5884 04:10:54,400 --> 04:10:56,000 because real world graphs 5885 04:10:56,000 --> 04:10:58,600 typically have more edges than vertices. 5886 04:10:58,600 --> 04:11:03,300 So we move vertex attributes to the edges and because not all 5887 04:11:03,300 --> 04:11:07,800 the partitions will contain edges adjacent to all vertices. 5888 04:11:07,800 --> 04:11:09,755 We internally maintain a row. 5889 04:11:09,755 --> 04:11:10,700 Routing table. 5890 04:11:10,700 --> 04:11:14,400 So the routing table is the one who will broadcast the vertices 5891 04:11:14,400 --> 04:11:18,146 and 10 will implement the join required for the operations. 5892 04:11:18,146 --> 04:11:18,946 So, I hope 5893 04:11:18,946 --> 04:11:22,200 that you guys are clear how vertex rdd and hrd 5894 04:11:22,200 --> 04:11:23,338 works and then 5895 04:11:23,338 --> 04:11:25,800 how the optimizations take place 5896 04:11:25,800 --> 04:11:29,900 and how vertex cut optimizes the operations in graphics. 5897 04:11:30,100 --> 04:11:32,600 Now, let's talk about graph operators. 5898 04:11:32,600 --> 04:11:35,480 So just as already have basic operations 5899 04:11:35,480 --> 04:11:37,400 like map filter reduced by 5900 04:11:37,400 --> 04:11:41,300 key property graph also have Election of basic operators 5901 04:11:41,300 --> 04:11:44,530 that take user-defined functions and produce new graphs 5902 04:11:44,530 --> 04:11:48,029 the transform properties and structure Now The Co-operators 5903 04:11:48,029 --> 04:11:50,900 that have optimized implementation are basically 5904 04:11:50,900 --> 04:11:54,061 defined in crafts class and convenient operators 5905 04:11:54,061 --> 04:11:55,262 that are expressed 5906 04:11:55,262 --> 04:11:57,600 as a composition of The Co-operators 5907 04:11:57,600 --> 04:12:00,500 are basically defined in your graphs class. 5908 04:12:00,500 --> 04:12:03,346 But in Scala it implicit the operators 5909 04:12:03,346 --> 04:12:04,800 in graph Ops class, 5910 04:12:04,800 --> 04:12:08,500 they are automatically available as a member of graft class 5911 04:12:08,600 --> 04:12:09,600 so you can use them. 5912 04:12:09,700 --> 04:12:12,450 M using the graph class as well now 5913 04:12:12,500 --> 04:12:14,593 as you can see we have list of operators 5914 04:12:14,593 --> 04:12:15,858 like property operator, 5915 04:12:15,858 --> 04:12:17,800 then you have structural operator. 5916 04:12:17,800 --> 04:12:19,300 Then you have join operator 5917 04:12:19,300 --> 04:12:22,000 and then you have something called neighborhood operator. 5918 04:12:22,000 --> 04:12:24,700 So let's talk about them one by one now talking 5919 04:12:24,700 --> 04:12:26,400 about property operators, 5920 04:12:26,400 --> 04:12:30,016 like rdd has map operator the property graph contains 5921 04:12:30,016 --> 04:12:34,168 map vertices map edges and map triplets operators right now. 5922 04:12:34,168 --> 04:12:38,445 Each of this operator basically eels a new graph with the vertex 5923 04:12:38,445 --> 04:12:39,600 or Edge property. 5924 04:12:39,600 --> 04:12:42,600 Modified by the user-defined map function based 5925 04:12:42,600 --> 04:12:46,366 on the user-defined map function it basically transforms 5926 04:12:46,366 --> 04:12:47,915 or modifies the vertices 5927 04:12:47,915 --> 04:12:49,202 if it's map vertices 5928 04:12:49,202 --> 04:12:51,489 or it transform or modify the edges 5929 04:12:51,489 --> 04:12:53,170 if it is map edges method 5930 04:12:53,170 --> 04:12:56,600 or map is operator and so on format repeats as well. 5931 04:12:56,600 --> 04:13:00,053 Now the important thing to note is that in each case. 5932 04:13:00,053 --> 04:13:02,700 The graph structure is unaffected and this 5933 04:13:02,700 --> 04:13:04,968 is a key feature of these operators. 5934 04:13:04,968 --> 04:13:07,513 Basically which allows the resulting graph 5935 04:13:07,513 --> 04:13:09,500 to reuse the structural indices. 5936 04:13:09,500 --> 04:13:10,300 Of the original graph 5937 04:13:10,300 --> 04:13:12,600 each and every time you apply a transformation, 5938 04:13:12,600 --> 04:13:14,700 so it creates a new graph 5939 04:13:14,700 --> 04:13:17,500 and the original graph is unaffected 5940 04:13:17,500 --> 04:13:19,200 so that it can be used 5941 04:13:19,200 --> 04:13:22,500 so you can see it can be reused in creating new graphs. 5942 04:13:22,500 --> 04:13:22,800 Right? 5943 04:13:22,800 --> 04:13:24,600 So your structure indices 5944 04:13:24,600 --> 04:13:27,700 can be used from the original graph not talking 5945 04:13:27,700 --> 04:13:29,400 about this map vertices. 5946 04:13:29,400 --> 04:13:31,152 Let me use the highlighter. 5947 04:13:31,152 --> 04:13:32,900 So first we have map vertices. 5948 04:13:32,900 --> 04:13:34,200 So be it Maps the vertices 5949 04:13:34,200 --> 04:13:36,100 or you can still transform the vertices. 5950 04:13:36,100 --> 04:13:39,300 So you provide vertex ID and then vertex. 5951 04:13:40,100 --> 04:13:43,400 And you apply some of the transformation function using 5952 04:13:43,400 --> 04:13:46,600 which so it will give you a graph with newer text property 5953 04:13:46,600 --> 04:13:49,500 as you can see now same is the case with map edges. 5954 04:13:49,500 --> 04:13:53,800 So again you provide the edges then you transform the edges. 5955 04:13:53,800 --> 04:13:57,600 So initially it was Ed and then you transform it to Edie to 5956 04:13:57,700 --> 04:13:58,600 and then the graph 5957 04:13:58,600 --> 04:14:01,000 which is given or you can see the graph 5958 04:14:01,000 --> 04:14:04,947 which is returned is the graph for the changed each attribute. 5959 04:14:04,947 --> 04:14:07,535 So you can see here the attribute is ed2. 5960 04:14:07,535 --> 04:14:09,800 Same is the case with Mark triplets. 5961 04:14:09,900 --> 04:14:11,500 So using Mark triplets, 5962 04:14:11,500 --> 04:14:14,657 you can use the edge triplet where you can go ahead 5963 04:14:14,657 --> 04:14:18,700 and Target the vertex Properties or you can say vertex attributes 5964 04:14:18,700 --> 04:14:21,817 or to be more specific Source vertex attribute as well 5965 04:14:21,817 --> 04:14:23,641 as destination vertex attribute 5966 04:14:23,641 --> 04:14:26,900 and the edge attribute and then you can apply transformation 5967 04:14:26,900 --> 04:14:28,654 over those Source attributes 5968 04:14:28,654 --> 04:14:31,600 or destination attributes or the edge attributes 5969 04:14:31,600 --> 04:14:34,500 so you can change them and then it will again return a graph 5970 04:14:34,500 --> 04:14:36,300 with the transformed values now, 5971 04:14:36,300 --> 04:14:39,000 I guess you guys are clear the property operator. 5972 04:14:39,000 --> 04:14:40,819 So let's move Next operator 5973 04:14:40,819 --> 04:14:44,958 that is structural operator So currently Graphics supports only 5974 04:14:44,958 --> 04:14:48,200 a simple set of commonly use structural operators. 5975 04:14:48,200 --> 04:14:50,712 And we expect more to be added in future. 5976 04:14:50,712 --> 04:14:53,220 Now you can see in structural operator. 5977 04:14:53,220 --> 04:14:54,800 We have reversed operator. 5978 04:14:54,800 --> 04:14:56,464 Then we have subgraph operator. 5979 04:14:56,464 --> 04:14:57,923 Then we have masks operator 5980 04:14:57,923 --> 04:15:00,100 and then we have group edges operator. 5981 04:15:00,100 --> 04:15:04,096 So let's talk about them one by one so first reverse operator, 5982 04:15:04,096 --> 04:15:05,640 so as the name suggests, 5983 04:15:05,640 --> 04:15:09,500 it returns a new graph with all the edge directions reversed. 5984 04:15:09,500 --> 04:15:11,750 So basically it will change your Source vertex 5985 04:15:11,750 --> 04:15:12,950 into destination vertex, 5986 04:15:12,950 --> 04:15:15,108 and then it will change your destination vertex 5987 04:15:15,108 --> 04:15:16,000 into Source vertex. 5988 04:15:16,000 --> 04:15:18,500 So it will reverse the direction of your edges. 5989 04:15:18,500 --> 04:15:21,600 And the reverse operation does not modify Vertex 5990 04:15:21,600 --> 04:15:23,300 or Edge Properties or change. 5991 04:15:23,300 --> 04:15:24,300 The number of edges. 5992 04:15:24,400 --> 04:15:25,739 It can be implemented 5993 04:15:25,739 --> 04:15:28,800 efficiently without data movement or duplication. 5994 04:15:28,800 --> 04:15:31,400 So next we have subgraph operator. 5995 04:15:31,400 --> 04:15:34,615 So basically subgraph operator takes the vertex 5996 04:15:34,615 --> 04:15:35,967 and Edge predicates 5997 04:15:35,967 --> 04:15:38,577 or you can say Vertex or edge condition 5998 04:15:38,577 --> 04:15:41,600 and Returns the Of containing only the vertex 5999 04:15:41,600 --> 04:15:44,835 that satisfy those vertex predicates and then it Returns 6000 04:15:44,835 --> 04:15:47,306 the edges that satisfy the edge predicates. 6001 04:15:47,306 --> 04:15:50,200 So basically will give a condition about edges and 6002 04:15:50,200 --> 04:15:51,954 vertices and those predicates 6003 04:15:51,954 --> 04:15:54,009 which are fulfilled or those vertex 6004 04:15:54,009 --> 04:15:57,303 which are fulfilling the predicates will be only returned 6005 04:15:57,303 --> 04:15:59,302 and again seems the case with your edges 6006 04:15:59,302 --> 04:16:01,237 and then your graph will be connected. 6007 04:16:01,237 --> 04:16:03,800 Now, the subgraph operator can be used in a number 6008 04:16:03,800 --> 04:16:06,953 of situations to restrict the graph to the vertices 6009 04:16:06,953 --> 04:16:08,245 and edges of interest 6010 04:16:08,245 --> 04:16:10,615 and eliminate the Rest of the components, 6011 04:16:10,615 --> 04:16:13,450 right so you can see this is The Edge predicate. 6012 04:16:13,450 --> 04:16:15,200 This is the vertex predicate. 6013 04:16:15,200 --> 04:16:18,900 Then we are providing the extra plate with the vertex 6014 04:16:18,900 --> 04:16:20,500 and Edge attributes 6015 04:16:20,500 --> 04:16:21,567 and we are waiting 6016 04:16:21,567 --> 04:16:24,700 for the Boolean value then same is the case with vertex. 6017 04:16:24,700 --> 04:16:27,100 We're providing the vertex properties over here 6018 04:16:27,100 --> 04:16:29,150 or you can say vertex attribute over here. 6019 04:16:29,150 --> 04:16:29,925 And then again, 6020 04:16:29,925 --> 04:16:32,126 it will yield a graph which is a sub graph 6021 04:16:32,126 --> 04:16:35,400 of the original graph which will fulfill those predicates now, 6022 04:16:35,400 --> 04:16:37,600 the next operator is mask operator. 6023 04:16:37,600 --> 04:16:39,746 So mask operator Constructors. 6024 04:16:39,746 --> 04:16:43,466 Graph by returning a graph that contains the vertices 6025 04:16:43,466 --> 04:16:46,888 and edges that are also found in the input graph. 6026 04:16:46,888 --> 04:16:48,637 Basically, you can treat 6027 04:16:48,637 --> 04:16:52,500 this mask operator as a comparison between two graphs. 6028 04:16:52,500 --> 04:16:53,314 So suppose. 6029 04:16:53,314 --> 04:16:54,500 We are comparing 6030 04:16:54,500 --> 04:16:58,100 graph 1 and graph 2 and it will return this sub graph 6031 04:16:58,100 --> 04:17:00,800 which is common in both the graphs again. 6032 04:17:00,800 --> 04:17:04,600 This can be used in conjunction with the subgraph operator. 6033 04:17:04,600 --> 04:17:05,900 Basically to restrict 6034 04:17:05,900 --> 04:17:09,400 a graph based on properties in another related graph, right. 6035 04:17:09,400 --> 04:17:12,280 And so I guess you guys are clear with the mask operator. 6036 04:17:12,280 --> 04:17:13,000 So we're here. 6037 04:17:13,000 --> 04:17:14,233 We're providing a graph 6038 04:17:14,233 --> 04:17:16,776 and then we are providing the input graph as well. 6039 04:17:16,776 --> 04:17:18,671 And then it will return a graph 6040 04:17:18,671 --> 04:17:21,700 which is basically a subset of both of these graph 6041 04:17:21,700 --> 04:17:23,600 not talking about group edges. 6042 04:17:23,600 --> 04:17:26,796 So the group edges operator merges the parallel edges 6043 04:17:26,796 --> 04:17:28,446 in the multigraph, right? 6044 04:17:28,446 --> 04:17:29,683 So what it does it, 6045 04:17:29,683 --> 04:17:33,244 the duplicate edges between pair of vertices are merged 6046 04:17:33,244 --> 04:17:35,800 or you can say are at can be aggregated 6047 04:17:35,800 --> 04:17:37,325 or perform some action 6048 04:17:37,325 --> 04:17:41,000 and in many numerical applications I just can be added 6049 04:17:41,000 --> 04:17:43,702 and their weights can be combined into a single edge, 6050 04:17:43,702 --> 04:17:46,804 right which will again reduce the size of the graph. 6051 04:17:46,804 --> 04:17:47,900 So for an example, 6052 04:17:47,900 --> 04:17:51,400 you have to vertex V1 and V2 and there are two edges 6053 04:17:51,400 --> 04:17:53,100 with weight 10 and 15. 6054 04:17:53,100 --> 04:17:56,291 So actually what you can do is you can merge those two edges 6055 04:17:56,291 --> 04:17:59,700 if they have same direction and you can represent the way to 25. 6056 04:17:59,700 --> 04:18:02,100 So this will actually reduce the size 6057 04:18:02,100 --> 04:18:05,144 of the graph now looking at the next operator, 6058 04:18:05,144 --> 04:18:06,700 which is join operator. 6059 04:18:06,700 --> 04:18:09,400 So in many cases it is necessary. 6060 04:18:09,400 --> 04:18:13,151 To join data from external collection with graphs, right? 6061 04:18:13,151 --> 04:18:13,909 For example. 6062 04:18:13,909 --> 04:18:16,100 We might have an extra user property 6063 04:18:16,100 --> 04:18:18,855 that we want to merge with the existing graph 6064 04:18:18,855 --> 04:18:21,186 or we might want to pull vertex property 6065 04:18:21,186 --> 04:18:23,100 from one graph to another right. 6066 04:18:23,100 --> 04:18:24,700 So these are some of the situations 6067 04:18:24,700 --> 04:18:27,000 where you go ahead and use this join operators. 6068 04:18:27,000 --> 04:18:28,900 So now as you can see over here, 6069 04:18:28,900 --> 04:18:31,100 the first operator is joined vertices. 6070 04:18:31,100 --> 04:18:34,792 So the joint vertices operator joins the vertices 6071 04:18:34,792 --> 04:18:36,176 with the input rdd 6072 04:18:36,200 --> 04:18:39,516 and returns a new graph with the vertex properties. 6073 04:18:39,516 --> 04:18:42,700 Dean after applying the user-defined map function 6074 04:18:42,700 --> 04:18:45,400 now the vertices without a matching value 6075 04:18:45,400 --> 04:18:49,500 in the rdd basically retains their original value not talking 6076 04:18:49,500 --> 04:18:51,400 about outer join vertices. 6077 04:18:51,400 --> 04:18:55,100 So it behaves similar to join vertices except that 6078 04:18:55,100 --> 04:18:59,586 which user-defined map function is applied to all the vertices 6079 04:18:59,586 --> 04:19:02,200 and can change the vertex property type. 6080 04:19:02,200 --> 04:19:05,600 So suppose that you have a old graph which has 6081 04:19:05,600 --> 04:19:08,100 a Vertex attribute as old price 6082 04:19:08,200 --> 04:19:10,700 and then you created a new a graph from it 6083 04:19:10,700 --> 04:19:13,735 and then it has the vertex attribute as new rice. 6084 04:19:13,735 --> 04:19:16,645 So you can go ahead and join two of these graphs 6085 04:19:16,645 --> 04:19:19,249 and you can perform an aggregation of both 6086 04:19:19,249 --> 04:19:21,725 the Old and New prices in the new graph. 6087 04:19:21,725 --> 04:19:25,265 So in this kind of situation join vertices are used 6088 04:19:25,265 --> 04:19:26,389 now moving ahead. 6089 04:19:26,389 --> 04:19:29,814 Let's talk about neighborhood aggregation now key step 6090 04:19:29,814 --> 04:19:33,239 in many graph analytics is aggregating the information 6091 04:19:33,239 --> 04:19:36,600 about the neighborhood of each vertex for an example. 6092 04:19:36,600 --> 04:19:39,500 We might want to know the number of followers each user has 6093 04:19:39,700 --> 04:19:41,200 Or the average age 6094 04:19:41,200 --> 04:19:45,600 of the follower of each user now many iterative graph algorithms, 6095 04:19:45,600 --> 04:19:47,416 like pagerank shortest path, 6096 04:19:47,416 --> 04:19:50,501 then connected components repeatedly aggregate 6097 04:19:50,501 --> 04:19:52,893 the properties of neighboring vertices. 6098 04:19:52,893 --> 04:19:56,200 Now, it has four operators in neighborhood aggregation. 6099 04:19:56,200 --> 04:19:58,803 So the first one is your aggregate messages. 6100 04:19:58,803 --> 04:20:01,500 So the core aggregation operation in graphics 6101 04:20:01,500 --> 04:20:02,900 is aggregate messages. 6102 04:20:02,900 --> 04:20:04,090 Now this operator 6103 04:20:04,090 --> 04:20:07,100 applies a user-defined send message function 6104 04:20:07,100 --> 04:20:10,799 as you can see over here to Each of the edge triplet 6105 04:20:10,799 --> 04:20:11,600 in the graph 6106 04:20:11,600 --> 04:20:14,230 and then it uses merge message function 6107 04:20:14,230 --> 04:20:17,900 to aggregate those messages at the destination vertex. 6108 04:20:18,000 --> 04:20:19,900 Now the user-defined 6109 04:20:19,900 --> 04:20:23,150 send message function takes an edge context 6110 04:20:23,150 --> 04:20:26,200 as you can see and which exposes the source 6111 04:20:26,200 --> 04:20:29,892 and destination address Buttes along with the edge attribute 6112 04:20:29,892 --> 04:20:32,399 and functions like send to Source or send 6113 04:20:32,399 --> 04:20:35,303 to destination is used to send messages to source 6114 04:20:35,303 --> 04:20:37,013 and destination attributes. 6115 04:20:37,013 --> 04:20:39,800 Now you can think of send message as the map. 6116 04:20:39,800 --> 04:20:43,592 Function in mapreduce and the user-defined merge function 6117 04:20:43,592 --> 04:20:46,000 which actually takes the two messages 6118 04:20:46,000 --> 04:20:48,200 which are present on the same Vertex 6119 04:20:48,200 --> 04:20:50,784 or you can see the same destination vertex 6120 04:20:50,784 --> 04:20:52,090 and it again combines 6121 04:20:52,090 --> 04:20:55,662 or aggregate those messages and produces a single message. 6122 04:20:55,662 --> 04:20:58,146 Now, you can think of the merge message 6123 04:20:58,146 --> 04:21:00,500 as reduce function the mapreduce now, 6124 04:21:00,500 --> 04:21:05,100 the aggregate messages operator returns a Vertex rdd. 6125 04:21:05,100 --> 04:21:08,128 Basically, it contains the aggregated messages at each 6126 04:21:08,128 --> 04:21:09,657 of the destination vertex. 6127 04:21:09,657 --> 04:21:10,600 It's and vertices 6128 04:21:10,600 --> 04:21:13,815 that did not receive a message are not included 6129 04:21:13,815 --> 04:21:15,693 in the returned vertex rdd. 6130 04:21:15,693 --> 04:21:17,028 So only those vertex 6131 04:21:17,028 --> 04:21:20,500 are returned which actually have received the message 6132 04:21:20,500 --> 04:21:22,956 and then those messages have been merged. 6133 04:21:22,956 --> 04:21:25,250 If any vertex which haven't received. 6134 04:21:25,250 --> 04:21:28,437 The message will not be included in the returned rdd 6135 04:21:28,437 --> 04:21:31,500 or you can say a return vertex rdd now in addition 6136 04:21:31,500 --> 04:21:34,000 as you can see we have a triplets Fields. 6137 04:21:34,000 --> 04:21:37,519 So aggregate messages takes an optional triplet fields, 6138 04:21:37,519 --> 04:21:39,400 which indicates what data is. 6139 04:21:39,400 --> 04:21:41,304 Accessed in the edge content. 6140 04:21:41,304 --> 04:21:42,752 So the possible options 6141 04:21:42,752 --> 04:21:45,900 for the triplet fields are defined interpret fields 6142 04:21:45,900 --> 04:21:48,600 to default value of triplet Fields is triplet 6143 04:21:48,600 --> 04:21:52,300 Fields oil as you can see over here this basically indicates 6144 04:21:52,300 --> 04:21:55,600 that user-defined send message function May access 6145 04:21:55,600 --> 04:21:58,074 any of the fields in the edge content. 6146 04:21:58,074 --> 04:22:01,982 So this triplet field argument can be used to notify Graphics 6147 04:22:01,982 --> 04:22:05,549 that only these part of the edge content will be needed 6148 04:22:05,549 --> 04:22:09,491 which basically allows Graphics to select the optimize joining. 6149 04:22:09,491 --> 04:22:10,700 Strategy, so I hope 6150 04:22:10,700 --> 04:22:13,500 that you guys are clear with the aggregate messages. 6151 04:22:13,500 --> 04:22:16,794 Let's quickly move ahead and look at the second operator. 6152 04:22:16,794 --> 04:22:20,019 So the second operator is mapreduce triplet transition. 6153 04:22:20,019 --> 04:22:21,400 Now in earlier versions 6154 04:22:21,400 --> 04:22:24,700 of Graphics neighborhood aggregation was accomplished 6155 04:22:24,700 --> 04:22:27,272 using the mapreduce triplets operator. 6156 04:22:27,272 --> 04:22:29,802 This mapreduce triplet operator is used 6157 04:22:29,802 --> 04:22:31,814 in older versions of Graphics. 6158 04:22:31,814 --> 04:22:35,100 This operator takes the user-defined map function, 6159 04:22:35,100 --> 04:22:38,900 which is applied to each triplet and can yield messages 6160 04:22:38,900 --> 04:22:42,300 which are Aggregating using the user-defined reduce functions. 6161 04:22:42,300 --> 04:22:44,300 This one is the reason I defined malfunction. 6162 04:22:44,300 --> 04:22:46,600 And this one is your user defined reduce function. 6163 04:22:46,600 --> 04:22:49,081 So it basically applies the map function 6164 04:22:49,081 --> 04:22:50,305 to all the triplets 6165 04:22:50,305 --> 04:22:53,654 and then the aggregate those messages using this user 6166 04:22:53,654 --> 04:22:55,171 defined reduce function. 6167 04:22:55,171 --> 04:22:58,900 Now the newer version of this map produced triplets operator 6168 04:22:58,900 --> 04:23:01,770 is the aggregate messages now moving ahead. 6169 04:23:01,770 --> 04:23:04,900 Let's talk about Computing degree information operator. 6170 04:23:04,900 --> 04:23:07,900 So one of the common aggregation task is Computing 6171 04:23:07,900 --> 04:23:09,579 the degree of each vertex. 6172 04:23:09,579 --> 04:23:12,842 That is the number of edges adjacent to each vertex. 6173 04:23:12,842 --> 04:23:15,072 Now in the context of directed graph. 6174 04:23:15,072 --> 04:23:18,400 It is often necessary to know the in degree out degree. 6175 04:23:18,400 --> 04:23:20,300 Then the total degree of vertex. 6176 04:23:20,300 --> 04:23:22,800 These kind of things are pretty much important 6177 04:23:22,800 --> 04:23:25,389 and the graph Ops class contain a collection 6178 04:23:25,389 --> 04:23:28,400 of operators to compute the degrees of each vertex. 6179 04:23:28,500 --> 04:23:29,800 So as you can see, 6180 04:23:29,800 --> 04:23:33,100 we have maximum input degree than maximum output degree, 6181 04:23:33,100 --> 04:23:36,100 then maximum degrees maximum degree will tell 6182 04:23:36,100 --> 04:23:39,400 us the number of Maximum incoming edges then Max. 6183 04:23:39,400 --> 04:23:42,325 Degree will tell us maximum number of output edges 6184 04:23:42,325 --> 04:23:43,510 and this Max degree 6185 04:23:43,510 --> 04:23:46,685 with actually tell us the number of input as well as 6186 04:23:46,685 --> 04:23:49,572 output edges now moving ahead to next operator 6187 04:23:49,572 --> 04:23:52,300 that is collecting Neighbors in some cases. 6188 04:23:52,300 --> 04:23:54,182 It may be easier to express 6189 04:23:54,182 --> 04:23:57,600 the computation by collecting neighboring vertices 6190 04:23:57,600 --> 04:24:00,000 and their attribute at each vertex. 6191 04:24:00,000 --> 04:24:02,624 Now, this can be easily accomplished using 6192 04:24:02,624 --> 04:24:06,400 the collect neighbors ID and the collect neighbors operator. 6193 04:24:06,400 --> 04:24:09,600 So basically your collect neighbor ID takes 6194 04:24:09,600 --> 04:24:12,200 The Edge direction as the parameter 6195 04:24:12,300 --> 04:24:14,400 and it returns a Vertex rdd 6196 04:24:14,400 --> 04:24:17,400 that contains the array of vertex ID 6197 04:24:17,500 --> 04:24:20,000 that is neighboring to the particular vertex 6198 04:24:20,000 --> 04:24:23,400 now similarly The Collection neighbors again takes 6199 04:24:23,400 --> 04:24:25,717 the edge directions as the input 6200 04:24:25,717 --> 04:24:28,000 and it will return you the array 6201 04:24:28,000 --> 04:24:31,600 with the vertex ID and the vertex attribute both now, 6202 04:24:31,600 --> 04:24:32,717 let me quickly open 6203 04:24:32,717 --> 04:24:35,700 my VM and let us go through the spark directory first. 6204 04:24:35,900 --> 04:24:38,600 Let me first open my terminal so first 6205 04:24:38,600 --> 04:24:41,800 I'll start the Do demons so for that I will go 6206 04:24:41,800 --> 04:24:46,358 to her do phone directory genocide has been start 6207 04:24:46,358 --> 04:24:48,282 or lot asset script file. 6208 04:24:52,000 --> 04:24:53,400 So let me check 6209 04:24:53,400 --> 04:24:55,700 if the Hadoop demons are running or not. 6210 04:24:58,700 --> 04:25:00,706 So as you can see that name, 6211 04:25:00,706 --> 04:25:03,000 no data node secondary name node, 6212 04:25:03,000 --> 04:25:05,848 the node manager and resource manager. 6213 04:25:05,848 --> 04:25:08,400 All the Demons of Hadoop are up now. 6214 04:25:08,400 --> 04:25:10,661 I will navigate to spark home. 6215 04:25:10,661 --> 04:25:13,300 Let me first start this park demons. 6216 04:25:17,600 --> 04:25:19,700 I See Spark demons are running 6217 04:25:19,700 --> 04:25:24,000 alko first minimize this and let me take you to this park home. 6218 04:25:24,900 --> 04:25:27,309 And this is my spot directories. 6219 04:25:27,309 --> 04:25:28,712 I'll go inside now. 6220 04:25:28,712 --> 04:25:30,926 Let me first show you the data 6221 04:25:30,926 --> 04:25:34,100 which is by default present with your spark. 6222 04:25:34,400 --> 04:25:36,700 So we'll open this in a new tab. 6223 04:25:36,700 --> 04:25:38,865 So you can see we have two files 6224 04:25:38,865 --> 04:25:41,100 in this Graphics data directory. 6225 04:25:41,100 --> 04:25:44,638 Meanwhile, let me take you to the example code. 6226 04:25:44,638 --> 04:25:48,900 So this is example and inside so main scalar. 6227 04:25:49,600 --> 04:25:50,500 You can find 6228 04:25:50,500 --> 04:25:54,700 the graphics directory and inside this Graphics directory 6229 04:25:54,700 --> 04:25:59,000 you Some of the sample codes which are present over here. 6230 04:25:59,000 --> 04:26:01,692 So I will take you to this aggregate 6231 04:26:01,692 --> 04:26:05,100 messages example dots Kayla now meanwhile, 6232 04:26:05,100 --> 04:26:07,287 let me open the data as well. 6233 04:26:07,287 --> 04:26:09,700 So you'll be able to understand. 6234 04:26:10,500 --> 04:26:12,967 Now this is followers dot txt file. 6235 04:26:12,967 --> 04:26:15,000 So basically you can imagine 6236 04:26:15,000 --> 04:26:18,545 these are the edges which are representing the vertex. 6237 04:26:18,545 --> 04:26:21,580 So this is what x 2 and this is vertex 1 then 6238 04:26:21,580 --> 04:26:25,100 this is Vertex 4 and this is vertex 1 and similarly. 6239 04:26:25,100 --> 04:26:28,400 So on these are representing those vertex and 6240 04:26:28,400 --> 04:26:30,900 if you can remember I have already told you 6241 04:26:30,900 --> 04:26:33,200 that inside graph loader class. 6242 04:26:33,200 --> 04:26:35,818 There is a function called Edge list file 6243 04:26:35,818 --> 04:26:37,200 which takes the edges 6244 04:26:37,200 --> 04:26:40,500 from a file and then it construct the graph based. 6245 04:26:40,500 --> 04:26:43,800 That now second you have this user dot txt. 6246 04:26:43,800 --> 04:26:47,550 So these are basically the edges with the vertex ID. 6247 04:26:47,550 --> 04:26:51,200 So vertex ID for this vertex is 1 then for this is 2 6248 04:26:51,200 --> 04:26:53,539 and so on and then this is the data 6249 04:26:53,539 --> 04:26:57,600 which is attached or you can say the attribute of the edges. 6250 04:26:57,600 --> 04:26:59,800 So these are the vertex ID 6251 04:26:59,958 --> 04:27:03,700 which is 1 2 3 respectively and this is the data 6252 04:27:03,700 --> 04:27:06,800 which is associated with your each vertex. 6253 04:27:06,800 --> 04:27:10,500 So this is username and this might be the name of your user. 6254 04:27:10,500 --> 04:27:13,100 Zur and so on now you can also see 6255 04:27:13,100 --> 04:27:16,900 that in some of the cases the name of the user is missing. 6256 04:27:16,900 --> 04:27:18,800 So as in this case the name 6257 04:27:18,800 --> 04:27:22,100 of the user is missing these are the vertices 6258 04:27:22,100 --> 04:27:26,300 or you can see the vertex ID and vertex attributes. 6259 04:27:26,600 --> 04:27:30,500 Now, let me take you through this aggregate messages example, 6260 04:27:30,600 --> 04:27:32,400 so as you can see, we are giving the name 6261 04:27:32,400 --> 04:27:36,100 of the packages over G Apache spark examples dot Graphics, 6262 04:27:36,300 --> 04:27:40,306 then we are importing Graphics in that very important. 6263 04:27:40,306 --> 04:27:41,764 Off class as well as 6264 04:27:41,764 --> 04:27:45,700 this vertex rdd next we are using this graph generator. 6265 04:27:45,700 --> 04:27:48,500 I'll tell you why we are using this graph generator 6266 04:27:48,700 --> 04:27:52,400 and then we are using the spark session over here. 6267 04:27:52,400 --> 04:27:54,105 So this is an example 6268 04:27:54,163 --> 04:27:58,778 where we are using the aggregate messages operator to compute 6269 04:27:58,778 --> 04:28:03,163 the average age of the more senior followers of each user. 6270 04:28:03,200 --> 04:28:03,700 Okay. 6271 04:28:03,928 --> 04:28:06,929 So this is the object of aggregate messages example. 6272 04:28:07,000 --> 04:28:10,000 Now, this is the main function where we are first. 6273 04:28:10,100 --> 04:28:13,600 Realizing this box session then the name of the application. 6274 04:28:13,600 --> 04:28:16,400 So you have to provide the name of the application 6275 04:28:16,400 --> 04:28:17,400 and this is get 6276 04:28:17,400 --> 04:28:20,600 or create method now next you are initializing 6277 04:28:20,600 --> 04:28:24,338 the spark context as SC now coming to the code. 6278 04:28:24,400 --> 04:28:27,400 So we are specifying a graph then this graph 6279 04:28:27,400 --> 04:28:30,300 is containing double and N now. 6280 04:28:30,400 --> 04:28:33,200 I just told you that we are importing craft generator. 6281 04:28:33,200 --> 04:28:35,023 So this graph generator is 6282 04:28:35,023 --> 04:28:37,900 to generate a random graph for Simplicity. 6283 04:28:37,900 --> 04:28:40,400 So you would have multiple number of edges and vertices. 6284 04:28:40,400 --> 04:28:43,047 Says then you are using this log normal graph. 6285 04:28:43,047 --> 04:28:44,900 You're passing the spark context 6286 04:28:44,900 --> 04:28:47,677 and you're specifying the number of vertices as hundred. 6287 04:28:47,677 --> 04:28:49,956 So it will generate hundred vertices for you. 6288 04:28:49,956 --> 04:28:51,200 Then what you are doing. 6289 04:28:51,200 --> 04:28:53,400 You are specifying the map vertices 6290 04:28:53,400 --> 04:28:56,815 and you're trying to map ID to double so 6291 04:28:56,815 --> 04:28:58,200 what this would do 6292 04:28:58,200 --> 04:29:02,100 this will basically map your ID to double then 6293 04:29:02,100 --> 04:29:05,700 in next year trying to calculate the older followers 6294 04:29:05,700 --> 04:29:08,300 where you have given it as vertex rdd 6295 04:29:08,300 --> 04:29:10,494 and then put is nth and Also, 6296 04:29:10,494 --> 04:29:13,900 your vertex already has sent as your vertex ID 6297 04:29:13,900 --> 04:29:15,200 and your data is double 6298 04:29:15,200 --> 04:29:17,533 which is associated with each of the vertex 6299 04:29:17,533 --> 04:29:19,604 or you can say the vertex attribute. 6300 04:29:19,604 --> 04:29:20,900 So you have this graph 6301 04:29:20,900 --> 04:29:23,178 which is basically generated randomly 6302 04:29:23,178 --> 04:29:26,189 and then you are performing aggregate messages. 6303 04:29:26,189 --> 04:29:29,200 So this is the aggregate messages operator now, 6304 04:29:29,200 --> 04:29:33,353 if you can remember we first have the send messages, right? 6305 04:29:33,353 --> 04:29:35,000 So inside this triplet, 6306 04:29:35,000 --> 04:29:38,620 we are specifying a function that if the source attribute 6307 04:29:38,620 --> 04:29:40,100 of the triplet is board. 6308 04:29:40,100 --> 04:29:42,300 Destination attribute of the triplet. 6309 04:29:42,300 --> 04:29:43,900 So basically it will return 6310 04:29:43,900 --> 04:29:47,144 if the followers age is greater than the age 6311 04:29:47,144 --> 04:29:48,452 of person whom he 6312 04:29:48,452 --> 04:29:52,259 is following this tells the followers is is greater 6313 04:29:52,259 --> 04:29:55,000 than the age of whom he is following. 6314 04:29:55,000 --> 04:29:56,462 So in that situation, 6315 04:29:56,462 --> 04:29:59,200 it will send message to the destination 6316 04:29:59,200 --> 04:30:01,400 with vertex containing counter 6317 04:30:01,400 --> 04:30:05,000 that is 1 and the age of the source attribute 6318 04:30:05,000 --> 04:30:07,700 that is the age of the follower so first 6319 04:30:07,700 --> 04:30:10,800 so you can see the age of the destination on is less 6320 04:30:10,800 --> 04:30:12,807 than the age of source attribute. 6321 04:30:12,807 --> 04:30:14,000 So it will tell you 6322 04:30:14,000 --> 04:30:17,293 if the follower is older than the user or not. 6323 04:30:17,293 --> 04:30:21,100 So in that situation will send one to the destination 6324 04:30:21,100 --> 04:30:23,900 and we'll send the age of the source 6325 04:30:23,900 --> 04:30:26,900 or you can see the edge of the follower then second. 6326 04:30:26,900 --> 04:30:29,400 I have told you that we have merged messages. 6327 04:30:29,500 --> 04:30:32,500 So here we are adding the counter and the H 6328 04:30:32,600 --> 04:30:33,800 in this reduce function. 6329 04:30:33,900 --> 04:30:37,515 So now what we are doing we are dividing the total age 6330 04:30:37,515 --> 04:30:38,421 of the number 6331 04:30:38,421 --> 04:30:41,439 of older followers to Write an average age 6332 04:30:41,439 --> 04:30:42,700 of older followers. 6333 04:30:42,700 --> 04:30:45,400 So this is the reason why we have passed the attribute 6334 04:30:45,400 --> 04:30:47,200 of source vertex firstly 6335 04:30:47,200 --> 04:30:49,300 if we are specifying this variable that is 6336 04:30:49,300 --> 04:30:51,194 average age of older followers. 6337 04:30:51,194 --> 04:30:53,700 And then we are specifying the vertex rdd. 6338 04:30:53,888 --> 04:30:58,211 So this will be double and then this older followers 6339 04:30:58,292 --> 04:30:59,600 that is the graph 6340 04:30:59,600 --> 04:31:02,349 which we are picking up from here and then we 6341 04:31:02,349 --> 04:31:04,100 are trying to map the value. 6342 04:31:04,100 --> 04:31:05,400 So in the vertex, 6343 04:31:05,400 --> 04:31:10,100 we have ID and we have value so in this situation We 6344 04:31:10,100 --> 04:31:13,600 are using this case class about count and total age. 6345 04:31:13,600 --> 04:31:16,000 So what we are doing we are taking this total age 6346 04:31:16,000 --> 04:31:19,246 and we are dividing it by count which we have gathered from this 6347 04:31:19,246 --> 04:31:20,011 send message. 6348 04:31:20,011 --> 04:31:22,800 And then we have aggregated using this reduce function. 6349 04:31:22,800 --> 04:31:26,400 We are again taking the total age of the older followers. 6350 04:31:26,400 --> 04:31:28,994 And then we are trying to divide it by count 6351 04:31:28,994 --> 04:31:30,377 to get the average age 6352 04:31:30,377 --> 04:31:33,900 when at last we are trying to display the result and then 6353 04:31:33,900 --> 04:31:35,600 we are stopping this park. 6354 04:31:35,600 --> 04:31:38,385 So let me quickly open the terminal so I 6355 04:31:38,385 --> 04:31:39,742 will go to examples 6356 04:31:39,742 --> 04:31:43,600 so I'd examples I took you through the source directory 6357 04:31:43,600 --> 04:31:46,400 where the code is present inside skaila. 6358 04:31:46,400 --> 04:31:49,154 And then inside there is a spark directory 6359 04:31:49,154 --> 04:31:51,975 where you will find the code but to execute 6360 04:31:51,975 --> 04:31:55,200 the example you need to go to the jars territory. 6361 04:31:56,100 --> 04:31:58,392 Now, this is the scale example jar 6362 04:31:58,392 --> 04:32:00,200 which you need to execute. 6363 04:32:00,200 --> 04:32:03,100 But before this, let me take you to the hdfs. 6364 04:32:03,400 --> 04:32:05,600 So the URL is localhost. 6365 04:32:05,600 --> 04:32:07,400 Colon 5 0 0 7 0 6366 04:32:08,500 --> 04:32:10,800 And we'll go to utilities then 6367 04:32:10,800 --> 04:32:12,800 we'll go to browse the file system. 6368 04:32:13,000 --> 04:32:14,137 So as you can see, 6369 04:32:14,137 --> 04:32:16,849 I have created a user directory in which I 6370 04:32:16,849 --> 04:32:18,700 have specified the username. 6371 04:32:18,700 --> 04:32:22,000 That is Ed Eureka and inside Ed Eureka. 6372 04:32:22,000 --> 04:32:24,200 I have placed my data directory 6373 04:32:24,200 --> 04:32:27,500 where we have this graphics and inside the graphics. 6374 04:32:27,500 --> 04:32:30,100 We have both the file that is followers Dot txt 6375 04:32:30,100 --> 04:32:31,600 and users dot txt. 6376 04:32:31,600 --> 04:32:32,854 So in this program, 6377 04:32:32,854 --> 04:32:35,100 we are not referring to these files 6378 04:32:35,100 --> 04:32:38,500 but incoming examples will be referring to these files. 6379 04:32:38,500 --> 04:32:42,700 So I would request you to first move it to this hdfs directory. 6380 04:32:42,700 --> 04:32:46,800 So that spark can refer the files in data Graphics. 6381 04:32:47,000 --> 04:32:50,300 Now, let me quickly minimize this and the command 6382 04:32:50,300 --> 04:32:53,000 to execute is Spock - 6383 04:32:53,000 --> 04:32:56,900 submit and then I'll pass this charge parameter 6384 04:32:56,900 --> 04:32:59,900 and I'll provide the spark example jar. 6385 04:33:01,200 --> 04:33:05,100 So this is the jar then I'll specify the class name. 6386 04:33:05,100 --> 04:33:06,900 So to get the class name. 6387 04:33:06,900 --> 04:33:08,900 I will go to the code. 6388 04:33:09,200 --> 04:33:12,000 I'll first take the package name from here. 6389 04:33:12,700 --> 04:33:14,100 And then I'll take 6390 04:33:14,100 --> 04:33:17,935 the class name which is aggregated messages example, 6391 04:33:17,935 --> 04:33:19,400 so this is my class. 6392 04:33:19,400 --> 04:33:21,928 And as I told you have to provide the name 6393 04:33:21,928 --> 04:33:23,100 of the application. 6394 04:33:23,100 --> 04:33:26,600 So let me keep it as example and I'll hit enter. 6395 04:33:31,946 --> 04:33:34,253 So now you can see the result. 6396 04:33:36,000 --> 04:33:37,700 So this is the followers 6397 04:33:37,700 --> 04:33:40,500 and this is the average age of followers. 6398 04:33:40,500 --> 04:33:41,827 So it is 34 Den. 6399 04:33:41,827 --> 04:33:45,038 We have 52 which is the count of follower. 6400 04:33:45,038 --> 04:33:48,500 And the average age is seventy six point eight 6401 04:33:48,500 --> 04:33:51,100 that is it has 96 senior followers. 6402 04:33:51,100 --> 04:33:52,900 And then the average age 6403 04:33:52,900 --> 04:33:56,000 of the followers is ninety nine point zero, 6404 04:33:56,100 --> 04:33:58,600 then it has four senior followers 6405 04:33:58,600 --> 04:34:00,520 and the average age is 51. 6406 04:34:00,520 --> 04:34:03,400 Then this vertex has 16 senior followers 6407 04:34:03,400 --> 04:34:06,003 with the average age of 57 point five. 6408 04:34:06,003 --> 04:34:09,024 5 and so on you can see the result over here. 6409 04:34:09,024 --> 04:34:12,800 So I hope now you guys are clear with aggregate messages 6410 04:34:12,800 --> 04:34:14,748 how to use aggregate messages 6411 04:34:14,748 --> 04:34:17,100 how to specify the send message then 6412 04:34:17,100 --> 04:34:19,200 how to write the merge message. 6413 04:34:19,200 --> 04:34:21,788 So let's quickly go back to the presentation. 6414 04:34:21,788 --> 04:34:23,500 Now, let us quickly move ahead 6415 04:34:23,500 --> 04:34:26,014 and look at some of the graph algorithms. 6416 04:34:26,014 --> 04:34:27,959 So the first one is Page rank. 6417 04:34:27,959 --> 04:34:31,200 So page rank measures the importance of each vertex 6418 04:34:31,200 --> 04:34:32,706 in a graph assuming 6419 04:34:32,800 --> 04:34:35,900 that an edge from U to V represents. 6420 04:34:36,000 --> 04:34:37,453 And recommendation 6421 04:34:37,453 --> 04:34:41,300 or support of Vis importance by you for an example. 6422 04:34:41,300 --> 04:34:45,468 Let's say if a Twitter user is followed by many others user 6423 04:34:45,468 --> 04:34:48,200 will obviously rank high graphics comes 6424 04:34:48,200 --> 04:34:51,919 with the static and dynamic implementation of pagerank as 6425 04:34:51,919 --> 04:34:53,780 methods on page rank object 6426 04:34:53,780 --> 04:34:57,500 and static page rank runs a fixed number of iterations, 6427 04:34:57,500 --> 04:35:02,200 which can be specified by you while the dynamic page rank runs 6428 04:35:02,200 --> 04:35:04,100 until the ranks converge 6429 04:35:04,500 --> 04:35:08,300 what we mean by that is it Stop changing by more 6430 04:35:08,300 --> 04:35:10,400 than a specified tolerance. 6431 04:35:10,500 --> 04:35:11,300 So it runs 6432 04:35:11,300 --> 04:35:14,500 until it have optimized the page rank of each 6433 04:35:14,500 --> 04:35:19,400 of the vertices now graphs class allows calling these algorithms 6434 04:35:19,400 --> 04:35:22,100 directly as methods on crafts class. 6435 04:35:22,200 --> 04:35:24,800 Now, let's quickly go back to the VM. 6436 04:35:25,000 --> 04:35:27,469 So this is the pagerank example. 6437 04:35:27,469 --> 04:35:29,161 Let me open this file. 6438 04:35:29,600 --> 04:35:32,595 So first we are specifying this Graphics package, 6439 04:35:32,595 --> 04:35:35,065 then we are importing the graph loader. 6440 04:35:35,065 --> 04:35:37,600 So as you can Remember inside this graph 6441 04:35:37,600 --> 04:35:41,000 loader class we have that edge list file operator, 6442 04:35:41,000 --> 04:35:43,600 which will basically create the graph using the edges 6443 04:35:43,600 --> 04:35:46,575 and we have those edges in our followers 6444 04:35:46,575 --> 04:35:50,542 dot txt file now coming back to pagerank example now, 6445 04:35:50,542 --> 04:35:53,900 we're importing the spark SQL Sparks session. 6446 04:35:54,100 --> 04:35:56,619 Now, this is Page rank example object 6447 04:35:56,619 --> 04:35:59,700 and inside which we have created a main class 6448 04:35:59,700 --> 04:36:04,000 and we have similarly created this park session then Builders 6449 04:36:04,000 --> 04:36:05,600 and we're specifying the app name 6450 04:36:05,600 --> 04:36:09,800 which Is to be provided then we have get our grid method. 6451 04:36:09,800 --> 04:36:10,415 So this is 6452 04:36:10,415 --> 04:36:12,800 where we are initializing the spark context 6453 04:36:12,800 --> 04:36:13,800 as you can remember. 6454 04:36:13,800 --> 04:36:16,900 I told you that using this Edge list file method. 6455 04:36:16,900 --> 04:36:19,115 We are basically creating the graph 6456 04:36:19,115 --> 04:36:21,200 from the followers dot txt file. 6457 04:36:21,200 --> 04:36:24,223 Now, we are running the page rank over here. 6458 04:36:24,223 --> 04:36:28,421 So in rank it will give you all the page rank of the vertices 6459 04:36:28,421 --> 04:36:30,104 that is inside this graph 6460 04:36:30,104 --> 04:36:33,400 which we have just to reducing graph loader class. 6461 04:36:33,400 --> 04:36:36,575 So if you're passing an integer as an an argument 6462 04:36:36,575 --> 04:36:37,700 to the page rank, 6463 04:36:37,700 --> 04:36:40,018 it will run that number iterations. 6464 04:36:40,018 --> 04:36:43,000 Otherwise, if you're passing a double value, 6465 04:36:43,000 --> 04:36:45,495 it will run until the convergence. 6466 04:36:45,495 --> 04:36:48,400 So we are running page rank on this graph 6467 04:36:48,400 --> 04:36:50,861 and we have passed the vertices. 6468 04:36:50,900 --> 04:36:55,300 Now after this we are trying to load the users dot txt file 6469 04:36:55,500 --> 04:36:58,400 and then we are trying to play 6470 04:36:58,400 --> 04:37:02,400 the line by comma then the field zero too long 6471 04:37:02,400 --> 04:37:04,571 and we are storing the field one. 6472 04:37:04,571 --> 04:37:06,200 So basically field zero. 6473 04:37:06,300 --> 04:37:09,376 In your user txt is your vertex ID or you 6474 04:37:09,376 --> 04:37:13,790 can see the ID of the user and field one is your username. 6475 04:37:13,790 --> 04:37:17,252 So we are trying to load these two Fields now. 6476 04:37:17,280 --> 04:37:19,819 We are trying to rank by username. 6477 04:37:19,969 --> 04:37:24,600 So we are taking the users and we are joining the ranks. 6478 04:37:24,600 --> 04:37:28,000 So this is where we are using the join operation. 6479 04:37:28,000 --> 04:37:29,670 So Frank's by username. 6480 04:37:29,670 --> 04:37:32,562 We are trying to attach those username 6481 04:37:32,562 --> 04:37:35,793 or put those username with the page rank value. 6482 04:37:35,793 --> 04:37:37,641 So we are taking the users 6483 04:37:37,641 --> 04:37:40,554 then we are joining the ranks it is again, 6484 04:37:40,554 --> 04:37:42,900 we are getting from this page Rank 6485 04:37:43,300 --> 04:37:47,700 and then we are mapping the ID user name and rank. 6486 04:37:56,500 --> 04:38:00,517 Second week sometime run some iterations over the craft 6487 04:38:00,517 --> 04:38:02,600 and will try to converge it. 6488 04:38:08,000 --> 04:38:11,700 So after converging you can see the user and the rank. 6489 04:38:11,700 --> 04:38:14,300 So the maximum rank is with Barack Obama, 6490 04:38:14,300 --> 04:38:18,000 which is 1.45 then with Lady Gaga. 6491 04:38:18,100 --> 04:38:22,200 It's 1.39 and then with order ski and so on. 6492 04:38:22,261 --> 04:38:24,338 Let's go back to the slide. 6493 04:38:25,200 --> 04:38:27,000 So now after page rank, 6494 04:38:27,200 --> 04:38:28,856 let's quickly move ahead 6495 04:38:28,856 --> 04:38:32,200 to Connected components the connected components 6496 04:38:32,200 --> 04:38:34,923 algorithm labels each connected component 6497 04:38:34,923 --> 04:38:38,600 of the graph with the ID of its lowest numbered vertex. 6498 04:38:38,600 --> 04:38:40,700 So let us quickly go back to the VM. 6499 04:38:42,000 --> 04:38:45,200 Now let's go inside the graphics directory 6500 04:38:45,200 --> 04:38:48,300 and now we'll open this connect components example. 6501 04:38:48,400 --> 04:38:51,818 So again, it's the same very important graph load 6502 04:38:51,818 --> 04:38:53,100 and Spark session. 6503 04:38:53,300 --> 04:38:56,600 Now, this is the connect components example object makes 6504 04:38:56,600 --> 04:39:00,176 this is the main function and inside the main function. 6505 04:39:00,176 --> 04:39:01,800 We are again specifying all 6506 04:39:01,800 --> 04:39:04,500 those Sparks session then app name, 6507 04:39:04,500 --> 04:39:06,389 then we have spark context. 6508 04:39:06,389 --> 04:39:07,509 So it's similar. 6509 04:39:07,509 --> 04:39:10,100 So again using this graph loader class 6510 04:39:10,130 --> 04:39:11,669 and using this Edge. 6511 04:39:11,900 --> 04:39:15,700 To file we are loading the followers dot txt file. 6512 04:39:15,700 --> 04:39:16,733 Now in this graph. 6513 04:39:16,733 --> 04:39:19,706 We are using this connected components algorithm. 6514 04:39:19,706 --> 04:39:23,300 And then we are trying to find the connected components now 6515 04:39:23,300 --> 04:39:26,600 at last we are trying to again load this user file 6516 04:39:26,600 --> 04:39:28,300 that is users Dot txt. 6517 04:39:28,500 --> 04:39:31,312 And we are trying to join the connected components 6518 04:39:31,312 --> 04:39:34,387 with the username so over here it is also the same thing 6519 04:39:34,387 --> 04:39:36,504 which we have discussed in page rank, 6520 04:39:36,504 --> 04:39:38,000 which is taking the field 0 6521 04:39:38,000 --> 04:39:41,100 and field one of your user dot txt file 6522 04:39:41,400 --> 04:39:45,100 and a at last we are joining this users 6523 04:39:45,100 --> 04:39:49,200 and at last year trying to join this users to connect component 6524 04:39:49,200 --> 04:39:50,584 that is from here. 6525 04:39:50,584 --> 04:39:50,882 Now. 6526 04:39:50,882 --> 04:39:54,008 We are printing the CC by username collect. 6527 04:39:54,008 --> 04:39:58,400 So let us quickly go ahead and execute this example as well. 6528 04:39:58,600 --> 04:40:01,400 So let me first copy this object name. 6529 04:40:03,800 --> 04:40:17,300 that's name this as example to so 6530 04:40:17,300 --> 04:40:20,100 as you can see Justin Bieber has one connected component, 6531 04:40:20,100 --> 04:40:23,300 then you can see this has three connected component. 6532 04:40:23,300 --> 04:40:25,100 Then this has one connected component 6533 04:40:25,100 --> 04:40:28,600 than Barack Obama has one connected component and so on. 6534 04:40:28,600 --> 04:40:30,464 So this basically gives you an idea 6535 04:40:30,464 --> 04:40:32,200 about the connected components. 6536 04:40:32,200 --> 04:40:33,900 Now, let's quickly move back 6537 04:40:33,900 --> 04:40:37,300 to the slide will discuss about the third algorithm 6538 04:40:37,300 --> 04:40:39,100 that is triangle counting. 6539 04:40:39,100 --> 04:40:43,177 So basically a Vertex is a part of a triangle when it has 6540 04:40:43,177 --> 04:40:46,900 two adjacent vertices with an edge between them. 6541 04:40:46,900 --> 04:40:49,100 So it will form a triangle, right? 6542 04:40:49,100 --> 04:40:52,313 And then that vertex is a part of a triangle 6543 04:40:52,313 --> 04:40:56,092 now Graphics implements a triangle counting algorithm 6544 04:40:56,092 --> 04:40:58,200 in the Triangle count object. 6545 04:40:58,200 --> 04:41:01,200 Now that determines the number of triangles passing 6546 04:41:01,200 --> 04:41:04,600 through each vertex providing a measure of clustering 6547 04:41:04,600 --> 04:41:07,400 so we can compute the triangle count 6548 04:41:07,400 --> 04:41:09,875 of the social network data set 6549 04:41:09,875 --> 04:41:13,675 from the pagerank section 1 mode thing to note is 6550 04:41:13,675 --> 04:41:16,598 that triangle count requires the edges. 6551 04:41:16,600 --> 04:41:18,800 To be in a canonical orientation. 6552 04:41:18,800 --> 04:41:21,364 That is your Source ID should always be less 6553 04:41:21,364 --> 04:41:22,868 than your destination ID 6554 04:41:22,868 --> 04:41:25,500 and the graph will be partition using craft 6555 04:41:25,500 --> 04:41:27,318 or Partition by Method now, 6556 04:41:27,318 --> 04:41:28,800 let's quickly go back. 6557 04:41:28,800 --> 04:41:32,000 So let me open the graphics directory again, 6558 04:41:32,000 --> 04:41:35,200 and we'll see the triangle counting example. 6559 04:41:36,500 --> 04:41:38,100 So again, it's the same 6560 04:41:38,100 --> 04:41:40,900 and the object is triangle counting example, 6561 04:41:40,900 --> 04:41:43,400 then the main function is same as well. 6562 04:41:43,400 --> 04:41:46,400 Now we are again using this graph load of class 6563 04:41:46,400 --> 04:41:50,183 and we are loading the followers dot txt 6564 04:41:50,183 --> 04:41:52,000 which contains the edges 6565 04:41:52,000 --> 04:41:53,000 as you can see here. 6566 04:41:53,000 --> 04:41:54,600 We are using this Partition 6567 04:41:54,600 --> 04:41:58,800 by argument and we are passing the random vertex cut, 6568 04:41:58,800 --> 04:42:01,000 which is the partition strategy. 6569 04:42:01,000 --> 04:42:03,165 So this is how you can go ahead 6570 04:42:03,165 --> 04:42:06,100 and you can Implement a partition strategy. 6571 04:42:06,123 --> 04:42:09,277 He is loading the edges in canonical order 6572 04:42:09,400 --> 04:42:11,900 and partitioning the graph for triangle count. 6573 04:42:11,900 --> 04:42:12,129 Now. 6574 04:42:12,129 --> 04:42:14,600 We are trying to find out the triangle count 6575 04:42:14,600 --> 04:42:15,830 for each vertex. 6576 04:42:15,830 --> 04:42:18,000 So we have this try count 6577 04:42:18,000 --> 04:42:22,600 variable and then we are using this triangle count algorithm 6578 04:42:22,600 --> 04:42:25,074 and then we are specifying the vertices 6579 04:42:25,074 --> 04:42:28,200 so it will execute triangle count over this graph 6580 04:42:28,200 --> 04:42:31,900 which we have just loaded from follows dot txt file. 6581 04:42:31,900 --> 04:42:35,074 And again, we are basically joining usernames. 6582 04:42:35,074 --> 04:42:38,320 So first we are Being the usernames again here. 6583 04:42:38,320 --> 04:42:42,600 We are performing the join between users and try counts. 6584 04:42:42,900 --> 04:42:45,300 So try counts is from here. 6585 04:42:45,300 --> 04:42:48,806 And then we are again printing the value from here. 6586 04:42:48,806 --> 04:42:50,700 So again, this is the same. 6587 04:42:50,700 --> 04:42:52,844 Let us quickly go ahead and execute 6588 04:42:52,844 --> 04:42:54,800 this triangle counting example. 6589 04:42:54,800 --> 04:42:56,338 So let me copy this. 6590 04:42:56,500 --> 04:42:58,300 I'll go back to the terminal. 6591 04:42:58,400 --> 04:43:02,300 I'll limit as example 3 and change the class name. 6592 04:43:04,134 --> 04:43:05,365 And I hit enter. 6593 04:43:14,100 --> 04:43:16,900 So now you can see the triangle associated 6594 04:43:16,900 --> 04:43:20,100 with Justin Bieber 0 then Barack Obama is one 6595 04:43:20,100 --> 04:43:21,600 with odors kids one 6596 04:43:21,661 --> 04:43:23,200 and with Jerry sick. 6597 04:43:23,200 --> 04:43:24,100 It's fun. 6598 04:43:24,300 --> 04:43:27,800 So for better understanding I would recommend you to go ahead 6599 04:43:27,800 --> 04:43:30,136 and take this followers or txt. 6600 04:43:30,136 --> 04:43:33,000 And you can create a graph by yourself. 6601 04:43:33,000 --> 04:43:36,227 And then you can attach these users names with them 6602 04:43:36,227 --> 04:43:38,100 and then you will get an idea 6603 04:43:38,100 --> 04:43:41,700 about why it is giving the number as 1 or 0. 6604 04:43:41,700 --> 04:43:44,065 So again the graph which is connecting. 6605 04:43:44,065 --> 04:43:45,000 In two and four 6606 04:43:45,000 --> 04:43:47,600 is disconnect and it is not completing any triangles. 6607 04:43:47,600 --> 04:43:52,900 So the value of these 3 are 0 and next year's second graph 6608 04:43:52,900 --> 04:43:54,600 which is connecting 6609 04:43:54,600 --> 04:43:59,400 your vertex 3 6 & 7 is completing one triangle. 6610 04:43:59,400 --> 04:44:01,323 So this is the reason why 6611 04:44:01,323 --> 04:44:05,300 these three vertices have values one now. 6612 04:44:05,400 --> 04:44:06,952 Let me quickly go back. 6613 04:44:06,952 --> 04:44:07,875 So now I hope 6614 04:44:07,875 --> 04:44:11,000 that you guys are clear with all the concepts 6615 04:44:11,000 --> 04:44:14,011 of graph operators then graph algorithms. 6616 04:44:14,011 --> 04:44:17,400 Eames so now is the right time and let us look 6617 04:44:17,400 --> 04:44:19,200 at a spa Graphics demo 6618 04:44:19,300 --> 04:44:20,838 where we'll go ahead 6619 04:44:20,838 --> 04:44:24,300 and we'll try to analyze the force go by data. 6620 04:44:24,800 --> 04:44:27,800 So let me quickly go back to my VM. 6621 04:44:28,000 --> 04:44:29,699 So let me first show you the website 6622 04:44:29,699 --> 04:44:32,500 where you can go ahead and download the Fords go by data. 6623 04:44:38,600 --> 04:44:40,350 So over here you can go 6624 04:44:40,350 --> 04:44:43,700 to download the fort bike strip history data. 6625 04:44:46,480 --> 04:44:51,019 So you can go ahead and download this 2017 Ford's trip data. 6626 04:44:51,100 --> 04:44:53,000 So I have already downloaded it. 6627 04:44:55,300 --> 04:44:56,696 So to avoid the typos, 6628 04:44:56,696 --> 04:44:59,300 I have already written all the commands so 6629 04:44:59,300 --> 04:45:07,100 first let me go ahead and start the spark shell So I'm inside 6630 04:45:07,100 --> 04:45:09,700 these Park shell now. 6631 04:45:09,700 --> 04:45:13,300 Let me first import graphics and Spa body. 6632 04:45:15,800 --> 04:45:19,200 So I've successfully imported graphics and Spark rdd. 6633 04:45:20,180 --> 04:45:23,719 Now, let me create a spark SQL context as well. 6634 04:45:25,100 --> 04:45:28,900 So I have successfully created this park SQL context. 6635 04:45:28,900 --> 04:45:31,520 So this is basically for running SQL queries 6636 04:45:31,520 --> 04:45:32,800 over the data frames. 6637 04:45:34,100 --> 04:45:37,176 Now, let me go ahead and import the data. 6638 04:45:37,826 --> 04:45:40,673 So I'm loading the data in data frame. 6639 04:45:40,800 --> 04:45:43,623 So the format of file is CSV, 6640 04:45:43,623 --> 04:45:46,853 then an option the header is already added. 6641 04:45:46,853 --> 04:45:48,700 So that's why it's true. 6642 04:45:48,800 --> 04:45:51,600 Then it will automatically infer this schema 6643 04:45:51,600 --> 04:45:53,332 and then in the load parameter, 6644 04:45:53,332 --> 04:45:55,400 I have specified the path of the file. 6645 04:45:55,400 --> 04:45:57,100 So I'll quickly hit enter. 6646 04:45:59,100 --> 04:46:02,500 So the data is loaded in the data frame to check. 6647 04:46:02,500 --> 04:46:07,000 I'll use d f dot count so it will give me the count. 6648 04:46:09,900 --> 04:46:16,553 So you can see it has 5 lakhs 19 2007 Red Rose now. 6649 04:46:16,553 --> 04:46:20,092 Let me click go back and I'll print the schema. 6650 04:46:21,400 --> 04:46:25,010 So this is the schema the duration in second, 6651 04:46:25,010 --> 04:46:27,625 then we have the start time end time. 6652 04:46:27,625 --> 04:46:29,876 Then you have start station ID. 6653 04:46:29,876 --> 04:46:32,200 Then you have start station name. 6654 04:46:32,300 --> 04:46:35,761 Then you have start station latitude longitude 6655 04:46:35,761 --> 04:46:37,207 then end station ID 6656 04:46:37,207 --> 04:46:40,360 and station name then end station latitude 6657 04:46:40,360 --> 04:46:42,007 and station longitude. 6658 04:46:42,007 --> 04:46:46,500 Then your bike ID user type then the birth year of the member 6659 04:46:46,500 --> 04:46:48,650 and the gender of the member now, 6660 04:46:48,650 --> 04:46:50,800 I'm trying to create a data frame 6661 04:46:50,800 --> 04:46:52,306 that is Gas stations 6662 04:46:52,306 --> 04:46:56,300 so it will only create the station ID and station name 6663 04:46:56,300 --> 04:46:58,607 which I'll be using as vertex. 6664 04:46:58,800 --> 04:47:02,000 So here I am trying to create a data frame 6665 04:47:02,000 --> 04:47:03,500 with the name of just stations 6666 04:47:03,658 --> 04:47:07,120 where I am just selecting the start station ID 6667 04:47:07,120 --> 04:47:09,600 and I'm casting it as float 6668 04:47:09,600 --> 04:47:12,400 and then I'm selecting the start station name 6669 04:47:12,400 --> 04:47:15,400 and then I'm using the distinct function to only 6670 04:47:15,400 --> 04:47:17,169 keep the unique values. 6671 04:47:17,169 --> 04:47:19,864 So I quickly go ahead and hit enter. 6672 04:47:20,100 --> 04:47:21,600 So again, let me go 6673 04:47:21,600 --> 04:47:27,000 ahead and use this just stations and I will print the schema. 6674 04:47:28,300 --> 04:47:31,531 So you can see there is station ID, 6675 04:47:31,531 --> 04:47:34,000 and then there is start station name. 6676 04:47:34,569 --> 04:47:36,800 It contains the unique values 6677 04:47:36,800 --> 04:47:40,600 of stations in this just station data frame. 6678 04:47:40,800 --> 04:47:41,735 So now again, 6679 04:47:41,735 --> 04:47:44,900 I am taking this stations where I'm selecting 6680 04:47:44,900 --> 04:47:47,971 these thought station ID and and station ID. 6681 04:47:47,971 --> 04:47:49,900 Then I am using re distinct 6682 04:47:49,900 --> 04:47:52,700 which will again give me the unique values 6683 04:47:52,700 --> 04:47:54,600 and I'm using this flat map 6684 04:47:54,600 --> 04:47:56,200 where I am specifying 6685 04:47:56,200 --> 04:47:59,700 the iterables where we are taking the x0 6686 04:47:59,700 --> 04:48:01,700 that is your start station ID, 6687 04:48:01,700 --> 04:48:04,405 and I am taking x 1 which is your ends. 6688 04:48:04,405 --> 04:48:05,700 An ID and then again, 6689 04:48:05,700 --> 04:48:07,800 I'm applying this distinct function 6690 04:48:07,800 --> 04:48:12,200 that it will keep only the unique values and then 6691 04:48:12,400 --> 04:48:14,600 at last we have to d f function 6692 04:48:14,600 --> 04:48:16,619 which will convert it to data frame. 6693 04:48:16,619 --> 04:48:19,100 So let me quickly go ahead and execute this. 6694 04:48:19,500 --> 04:48:21,376 So I am printing this schema. 6695 04:48:21,376 --> 04:48:23,576 So as you can see it has one column 6696 04:48:23,576 --> 04:48:26,100 that is value and it has data type long. 6697 04:48:26,100 --> 04:48:29,715 So I have taken all the start and end station ID 6698 04:48:29,715 --> 04:48:31,561 and using this flat map. 6699 04:48:31,561 --> 04:48:34,200 I have retreated over all the start. 6700 04:48:34,200 --> 04:48:37,705 And and station ID and then using the distinct function 6701 04:48:37,705 --> 04:48:41,600 and taking the unique values and converting it to data frames 6702 04:48:41,600 --> 04:48:44,800 so I can use the stations and using the station. 6703 04:48:44,800 --> 04:48:49,000 I will basically keep each of the stations in a Vertex. 6704 04:48:49,000 --> 04:48:52,500 So this is the reason why I'm taking the stations 6705 04:48:52,500 --> 04:48:55,300 or you can say I am taking the unique stations 6706 04:48:55,300 --> 04:48:58,107 from the start station ID and station ID 6707 04:48:58,107 --> 04:48:59,691 so that I can go ahead 6708 04:48:59,691 --> 04:49:02,500 and I can define vertex as the stations. 6709 04:49:03,100 --> 04:49:06,400 So now we are creating our set of vertices 6710 04:49:06,400 --> 04:49:09,804 and attaching a bit of metadata to each one of them 6711 04:49:09,804 --> 04:49:12,800 which in our case is the name of the station. 6712 04:49:12,800 --> 04:49:16,035 So as you can see we are creating this station vertices, 6713 04:49:16,035 --> 04:49:18,679 which is again an rdd with vertex ID and strength. 6714 04:49:18,679 --> 04:49:21,700 So we are using the station's which we have just created. 6715 04:49:21,700 --> 04:49:24,500 We are joining it with just stations 6716 04:49:24,500 --> 04:49:27,100 at the station value should be equal 6717 04:49:27,100 --> 04:49:29,300 to just station station ID. 6718 04:49:29,600 --> 04:49:32,400 So as we have created stations, 6719 04:49:32,400 --> 04:49:35,200 And just station so we are joining it. 6720 04:49:36,600 --> 04:49:39,061 And then selecting the station ID 6721 04:49:39,061 --> 04:49:43,000 and start station name then we are mapping row 0. 6722 04:49:44,700 --> 04:49:48,600 And Row 1 so your row 0 will basically be 6723 04:49:48,600 --> 04:49:51,088 your vertex ID and Row 1 will be the string. 6724 04:49:51,088 --> 04:49:55,100 That is the name of your station to let me quickly go ahead 6725 04:49:55,100 --> 04:49:56,300 and execute this. 6726 04:49:57,357 --> 04:50:01,742 So let us quickly print this using collect forage println. 6727 04:50:19,500 --> 04:50:20,366 So over here, 6728 04:50:20,366 --> 04:50:23,900 we are basically attaching the edges or you can see we 6729 04:50:23,900 --> 04:50:27,500 are creating the trip edges to all our individual rights 6730 04:50:27,500 --> 04:50:29,900 and then we'll get the station values 6731 04:50:30,350 --> 04:50:33,350 and then we'll add a dummy value of one. 6732 04:50:33,800 --> 04:50:34,900 So as you can see 6733 04:50:34,900 --> 04:50:37,200 that I am selecting the start station and 6734 04:50:37,200 --> 04:50:38,600 and station from the DF 6735 04:50:38,600 --> 04:50:41,300 which is the first data frame which we have loaded 6736 04:50:41,300 --> 04:50:46,200 and then I am mapping it to row 0 + Row 1, 6737 04:50:46,400 --> 04:50:49,000 which is your source and destination. 6738 04:50:49,100 --> 04:50:53,500 And then and then I'm attaching a value one to each one of them. 6739 04:50:53,600 --> 04:50:55,000 So I'll hit enter. 6740 04:50:57,500 --> 04:51:00,900 Now, let me quickly go ahead and print this station edges. 6741 04:51:07,500 --> 04:51:10,300 So just taking the source ID of the vertex 6742 04:51:10,300 --> 04:51:12,182 and destination ID of the vertex 6743 04:51:12,182 --> 04:51:14,800 or you can say so station ID or vertex station ID 6744 04:51:14,800 --> 04:51:17,900 and it is attaching value one to each one of them. 6745 04:51:17,900 --> 04:51:20,700 So now you can go ahead and build your graph. 6746 04:51:20,700 --> 04:51:23,854 But again as we discuss that we need a default station 6747 04:51:23,854 --> 04:51:25,700 so you can have some situations 6748 04:51:25,700 --> 04:51:29,033 where your edges might be indicating some vertices, 6749 04:51:29,033 --> 04:51:31,500 but that vertices might not be present 6750 04:51:31,500 --> 04:51:33,107 in your vertex re D. 6751 04:51:33,107 --> 04:51:34,764 So for that situation, 6752 04:51:34,764 --> 04:51:37,400 we need to create a default station. 6753 04:51:37,400 --> 04:51:40,651 So I created a default station as missing station. 6754 04:51:40,651 --> 04:51:42,100 So now we are all set. 6755 04:51:42,100 --> 04:51:44,400 We can go ahead and create the graph. 6756 04:51:44,400 --> 04:51:46,700 So the name of the graph is station graph. 6757 04:51:46,700 --> 04:51:49,000 Then the vertices are stationed vertices 6758 04:51:49,000 --> 04:51:50,485 which we have created 6759 04:51:50,485 --> 04:51:54,247 which basically contains the station ID and station name 6760 04:51:54,247 --> 04:51:56,300 and then we have station edges 6761 04:51:56,300 --> 04:51:58,600 and at last we have default station. 6762 04:51:58,600 --> 04:52:01,500 So let me quickly go ahead and execute this. 6763 04:52:03,100 --> 04:52:06,500 So now I need to cash this graph for faster access. 6764 04:52:06,500 --> 04:52:08,700 So I'll use cash function. 6765 04:52:09,500 --> 04:52:13,300 So let us quickly go ahead and check the number of vertices. 6766 04:52:24,700 --> 04:52:28,600 So these are the number of vertices again, 6767 04:52:28,900 --> 04:52:31,600 we can check the number of edges as well. 6768 04:52:35,700 --> 04:52:37,300 So these are the number of edges. 6769 04:52:38,405 --> 04:52:40,400 And to get a sanity check. 6770 04:52:40,400 --> 04:52:43,500 So let's go ahead and check the number of records 6771 04:52:43,500 --> 04:52:45,500 that are present in the data frame. 6772 04:52:48,000 --> 04:52:50,900 So as you can see that the number of edges 6773 04:52:50,900 --> 04:52:55,100 in our graph and the count in our data frame is similar, 6774 04:52:55,100 --> 04:52:56,900 or you can see the same. 6775 04:52:56,900 --> 04:53:00,702 So now let's go ahead and run page rank on our data 6776 04:53:00,702 --> 04:53:04,200 so we can either run a set number of iterations 6777 04:53:04,200 --> 04:53:06,700 or we can run it until the convergence. 6778 04:53:06,700 --> 04:53:10,400 So in my case, I'll run it till convergence. 6779 04:53:11,700 --> 04:53:15,000 So it's rank then station graph then page rank. 6780 04:53:15,000 --> 04:53:17,133 So has specified the double value 6781 04:53:17,133 --> 04:53:21,000 so it will Tell convergence so let's wait for some time. 6782 04:53:51,600 --> 04:53:55,400 So now that we have executed the pagerank algorithm. 6783 04:53:55,700 --> 04:53:57,300 So we got the ranks 6784 04:53:57,300 --> 04:53:59,700 which are attached to each vertices. 6785 04:54:00,100 --> 04:54:03,700 So now let us quickly go ahead and look at the ranks. 6786 04:54:03,700 --> 04:54:06,601 So we are joining ranks with station vertices 6787 04:54:06,601 --> 04:54:09,675 and then we have sorting it in descending values 6788 04:54:09,675 --> 04:54:11,900 and we are taking the first 10 rows 6789 04:54:11,900 --> 04:54:13,500 and then we are printing them. 6790 04:54:13,500 --> 04:54:16,700 So let's quickly go ahead and hit enter. 6791 04:54:21,700 --> 04:54:26,000 So you can see these are the top 10 stations which have 6792 04:54:26,000 --> 04:54:27,800 the most pagerank values 6793 04:54:27,800 --> 04:54:30,800 so you can say it has more number of incoming trips. 6794 04:54:30,800 --> 04:54:32,270 Now one question would be 6795 04:54:32,270 --> 04:54:35,000 what are the most common destinations in the data set 6796 04:54:35,000 --> 04:54:36,598 from location to location 6797 04:54:36,598 --> 04:54:40,500 so we can do this by performing a grouping operator and adding 6798 04:54:40,500 --> 04:54:42,218 The Edge counts together. 6799 04:54:42,218 --> 04:54:46,000 So basically this will give a new graph except each Edge 6800 04:54:46,000 --> 04:54:50,300 will now be the sum of all the semantically same edges. 6801 04:54:51,500 --> 04:54:53,700 So again, we are taking the station graph. 6802 04:54:53,700 --> 04:54:56,800 We are performing Group by edges H1 and H2. 6803 04:54:56,800 --> 04:55:00,197 So we are basically grouping edges H1 and H2. 6804 04:55:00,200 --> 04:55:01,629 So we are aggregating them. 6805 04:55:01,629 --> 04:55:03,100 Then we are using triplet 6806 04:55:03,100 --> 04:55:06,099 and then we are sorting them in descending order again. 6807 04:55:06,099 --> 04:55:08,200 And then we are printing the triplets 6808 04:55:08,200 --> 04:55:10,908 from The Source vertex and the number of trips 6809 04:55:10,908 --> 04:55:13,864 and then we are taking the destination attribute 6810 04:55:13,864 --> 04:55:15,500 or you can see destination 6811 04:55:15,500 --> 04:55:18,100 Vertex or you can see destination station. 6812 04:55:26,526 --> 04:55:28,373 So you can see there are 6813 04:55:28,500 --> 04:55:32,300 1933 trips from San Francisco Ferry Building 6814 04:55:32,300 --> 04:55:34,100 to the station then again, 6815 04:55:34,100 --> 04:55:36,700 you can see there are fourteen hundred and eleven 6816 04:55:36,700 --> 04:55:39,900 trips from San Francisco to this location. 6817 04:55:39,900 --> 04:55:42,200 Then there are 1 0 to 5 trips 6818 04:55:42,200 --> 04:55:45,300 from this station to San Francisco 6819 04:55:45,500 --> 04:55:49,100 and it goes so on so now we have got a directed graph 6820 04:55:49,100 --> 04:55:50,885 that mean our trip are directional 6821 04:55:50,885 --> 04:55:52,400 from one location to another 6822 04:55:52,600 --> 04:55:55,787 so now we can go ahead and find the number of Trades 6823 04:55:55,787 --> 04:55:57,725 that Went to a specific station 6824 04:55:57,725 --> 04:56:00,100 and then leave from a specific station. 6825 04:56:00,100 --> 04:56:01,806 So basically we are trying 6826 04:56:01,806 --> 04:56:04,300 to find the inbound and outbound values 6827 04:56:04,300 --> 04:56:07,829 or you can say we are trying to find in degree and out degree 6828 04:56:07,829 --> 04:56:08,723 of the stations. 6829 04:56:08,723 --> 04:56:12,300 So let us first calculate the in degrees from using station graph 6830 04:56:12,300 --> 04:56:14,364 and I am using n degree operator. 6831 04:56:14,364 --> 04:56:17,298 Then I'm joining it with the station vertices 6832 04:56:17,298 --> 04:56:20,435 and then I'm sorting it again in descending order 6833 04:56:20,435 --> 04:56:22,852 and then I'm taking the top 10 values. 6834 04:56:22,852 --> 04:56:25,400 So let's quickly go ahead and hit enter. 6835 04:56:30,900 --> 04:56:34,815 So these are the top 10 station and you can see the in degrees. 6836 04:56:34,815 --> 04:56:36,600 So there are these many trips 6837 04:56:36,600 --> 04:56:38,797 which are coming into these stations. 6838 04:56:38,797 --> 04:56:39,651 Not similarly. 6839 04:56:39,651 --> 04:56:41,300 We can find the out degree. 6840 04:56:48,200 --> 04:56:51,400 Now again, you can see the out degrees as well. 6841 04:56:51,400 --> 04:56:54,896 So these are the stations and these are the out degrees. 6842 04:56:54,896 --> 04:56:58,439 So again, you can go ahead and perform some more operations 6843 04:56:58,439 --> 04:56:59,400 over this graph. 6844 04:56:59,400 --> 04:57:01,635 So you can go ahead and find the station 6845 04:57:01,635 --> 04:57:03,700 which has most number of trips things 6846 04:57:03,700 --> 04:57:07,241 that is most number of people coming into that station, 6847 04:57:07,241 --> 04:57:09,758 but less people are leaving that station 6848 04:57:09,758 --> 04:57:13,320 and again on the contrary you can find out the stations 6849 04:57:13,320 --> 04:57:15,538 where there are more number of edges 6850 04:57:15,538 --> 04:57:18,240 or you can set trip leaving those stations. 6851 04:57:18,240 --> 04:57:19,848 But there are less number 6852 04:57:19,848 --> 04:57:22,100 of trips coming into those stations. 6853 04:57:22,100 --> 04:57:25,800 So I guess you guys are now clear with Spa Graphics. 6854 04:57:25,800 --> 04:57:27,810 Then we discuss the different types 6855 04:57:27,810 --> 04:57:29,398 of crops then moving ahead. 6856 04:57:29,398 --> 04:57:31,100 We discuss the features of grafx. 6857 04:57:31,100 --> 04:57:33,675 They'll be discuss something about property graph. 6858 04:57:33,675 --> 04:57:35,500 We understood what is property graph 6859 04:57:35,500 --> 04:57:38,200 how you can create vertex how you can create edges 6860 04:57:38,200 --> 04:57:40,800 how to use Vertex or DD H Rd D. 6861 04:57:40,800 --> 04:57:44,500 Then we looked at some of the important vertex operations 6862 04:57:44,500 --> 04:57:48,500 and at last we understood some of the graph algorithms. 6863 04:57:48,500 --> 04:57:51,349 So I guess now you guys are clear about 6864 04:57:51,349 --> 04:57:53,600 how to work with Bob Graphics. 6865 04:57:58,300 --> 04:58:01,300 Today's video is on Hadoop versus park. 6866 04:58:01,400 --> 04:58:04,683 Now as we know organizations from different domains 6867 04:58:04,683 --> 04:58:07,400 are investing in big data analytics today. 6868 04:58:07,400 --> 04:58:10,400 They're analyzing large data sets to uncover 6869 04:58:10,400 --> 04:58:11,730 all hidden patterns 6870 04:58:11,730 --> 04:58:15,510 unknown correlations market trends customer preferences 6871 04:58:15,510 --> 04:58:18,100 and other useful business information. 6872 04:58:18,100 --> 04:58:20,800 Analogy of findings are helping organizations 6873 04:58:20,800 --> 04:58:24,100 and more effective marketing new Revenue opportunities 6874 04:58:24,100 --> 04:58:25,973 and better customer service 6875 04:58:25,973 --> 04:58:29,241 and they're trying to get competitive advantages 6876 04:58:29,241 --> 04:58:30,947 over rival organizations 6877 04:58:30,947 --> 04:58:33,920 and other business benefits and Apache spark 6878 04:58:33,920 --> 04:58:38,000 and Hadoop are the two of most prominent Big Data Frameworks 6879 04:58:38,000 --> 04:58:41,289 and I see people often comparing these two technologies 6880 04:58:41,289 --> 04:58:44,700 and that is what exactly we're going to do in this video. 6881 04:58:44,700 --> 04:58:48,100 Now, we'll compare these two big data Frame Works based 6882 04:58:48,100 --> 04:58:49,800 on on different parameters, 6883 04:58:49,800 --> 04:58:52,487 but first it is important to get an overview 6884 04:58:52,487 --> 04:58:53,800 about what is Hadoop. 6885 04:58:53,800 --> 04:58:55,600 And what is Apache spark? 6886 04:58:55,600 --> 04:58:58,900 So let me just tell you a little bit about Hadoop Hadoop is 6887 04:58:58,900 --> 04:59:00,200 a framework to store 6888 04:59:00,200 --> 04:59:04,200 and process large sets of data across computer clusters 6889 04:59:04,200 --> 04:59:07,100 and Hadoop can scale from single computer system 6890 04:59:07,100 --> 04:59:09,710 up to thousands of commodity systems 6891 04:59:09,710 --> 04:59:11,500 that offer local storage 6892 04:59:11,500 --> 04:59:14,801 and compute power and Hadoop is composed of modules 6893 04:59:14,801 --> 04:59:18,500 that work together to create the entire Hadoop framework. 6894 04:59:18,500 --> 04:59:20,557 These are some of the components 6895 04:59:20,557 --> 04:59:23,254 that we have in the entire Hadoop framework 6896 04:59:23,254 --> 04:59:24,800 or the Hadoop ecosystem. 6897 04:59:24,800 --> 04:59:27,500 For example, let me tell you about hdfs, 6898 04:59:27,500 --> 04:59:30,856 which is the storage unit of Hadoop yarn, which is 6899 04:59:30,856 --> 04:59:32,500 for resource management. 6900 04:59:32,500 --> 04:59:34,600 There are different than a little tools 6901 04:59:34,600 --> 04:59:39,500 like Apache Hive Pig nosql databases like Apache hbase. 6902 04:59:39,900 --> 04:59:40,900 Even Apache spark 6903 04:59:40,900 --> 04:59:43,893 and Apache Stone fits in the Hadoop ecosystem 6904 04:59:43,893 --> 04:59:45,399 for processing big data 6905 04:59:45,399 --> 04:59:49,200 in real-time for ingesting data we have Tools like Flume 6906 04:59:49,200 --> 04:59:52,082 and scoop flumist used to ingest unstructured data 6907 04:59:52,082 --> 04:59:53,600 or semi-structured data 6908 04:59:53,600 --> 04:59:57,135 where scoop is used to ingest structured data into hdfs. 6909 04:59:57,135 --> 04:59:59,900 If you want to learn more about these tools, 6910 04:59:59,900 --> 05:00:01,470 you can go to Eddie rei'kas 6911 05:00:01,470 --> 05:00:04,000 YouTube channel and look for Hadoop tutorial 6912 05:00:04,000 --> 05:00:06,600 where everything has been explained in detail. 6913 05:00:06,600 --> 05:00:08,171 Now, let's move to spark 6914 05:00:08,171 --> 05:00:12,100 Apache spark is a lightning-fast cluster Computing technology 6915 05:00:12,100 --> 05:00:14,400 that is designed for fast computation. 6916 05:00:14,400 --> 05:00:18,223 The main feature of spark is it's in memory clusters. 6917 05:00:18,223 --> 05:00:19,400 Esther Computing 6918 05:00:19,400 --> 05:00:23,482 that increases the processing of speed of an application fog 6919 05:00:23,482 --> 05:00:27,100 perform similar operations to that of Hadoop modules, 6920 05:00:27,100 --> 05:00:30,365 but it uses an in-memory processing and optimizes 6921 05:00:30,365 --> 05:00:33,791 the steps the primary difference between mapreduce 6922 05:00:33,791 --> 05:00:35,400 and Hadoop and Spark is 6923 05:00:35,400 --> 05:00:38,500 that mapreduce users persistent storage 6924 05:00:38,500 --> 05:00:42,100 and Spark uses resilient distributed data sets, 6925 05:00:42,100 --> 05:00:44,920 which is known as rdds which resides 6926 05:00:44,920 --> 05:00:48,458 in memory the different components and Sparkle. 6927 05:00:48,800 --> 05:00:52,000 The spark origin the spark or is the base engine 6928 05:00:52,000 --> 05:00:53,600 for large-scale parallel 6929 05:00:53,600 --> 05:00:57,463 and distributed data processing further additional libraries 6930 05:00:57,463 --> 05:01:01,100 which are built on top of the core allow diverse workloads 6931 05:01:01,100 --> 05:01:02,381 for streaming SQL 6932 05:01:02,381 --> 05:01:06,000 and machine learning spark or is also responsible 6933 05:01:06,000 --> 05:01:09,500 for memory management and fault recovery scheduling 6934 05:01:09,500 --> 05:01:12,749 and distributed and monitoring jobs and a cluster 6935 05:01:12,749 --> 05:01:16,000 and interacting with the storage systems as well. 6936 05:01:16,100 --> 05:01:16,649 Next up. 6937 05:01:16,649 --> 05:01:18,300 We have spark streaming. 6938 05:01:18,300 --> 05:01:20,906 Spark streaming is the component of spark 6939 05:01:20,906 --> 05:01:24,100 which is used to process real-time streaming data. 6940 05:01:24,100 --> 05:01:25,822 It enables high throughput 6941 05:01:25,822 --> 05:01:29,600 and fault-tolerant stream processing of live data streams. 6942 05:01:29,600 --> 05:01:33,500 We have Sparks equal spark SQL is a new module in spark 6943 05:01:33,500 --> 05:01:36,800 which integrates relational processing with Sparks 6944 05:01:36,800 --> 05:01:38,800 functional programming API. 6945 05:01:38,800 --> 05:01:41,700 It supports querying data either via SQL 6946 05:01:41,700 --> 05:01:44,000 or via the hive query language. 6947 05:01:44,000 --> 05:01:46,381 For those of you familiar with rdbms. 6948 05:01:46,381 --> 05:01:48,300 Spark sequel will be an easy. 6949 05:01:48,300 --> 05:01:51,637 Transition from your earlier tools where you can extend 6950 05:01:51,637 --> 05:01:55,100 the boundaries of traditional relational data processing. 6951 05:01:55,200 --> 05:02:00,092 Next up is Graphics Ralph X is the spark API for graphs 6952 05:02:00,092 --> 05:02:02,400 and graph parallel computation 6953 05:02:02,400 --> 05:02:04,867 and thus it extends the spark resilient 6954 05:02:04,867 --> 05:02:08,700 distributed data sets with a resilient distributed property. 6955 05:02:08,700 --> 05:02:09,500 Graph. 6956 05:02:09,900 --> 05:02:13,000 Next is Park Emma lip for machine learning 6957 05:02:13,000 --> 05:02:16,500 Emma lip stands for machine learning library spark. 6958 05:02:16,500 --> 05:02:18,300 Emma live is used to perform machine. 6959 05:02:18,400 --> 05:02:20,900 In learning in Apache spark now 6960 05:02:20,900 --> 05:02:24,200 since you've got an overview of both these two Frameworks, 6961 05:02:24,200 --> 05:02:25,985 I believe that the ground 6962 05:02:25,985 --> 05:02:29,200 is all set to compare Apache spark and Hadoop. 6963 05:02:29,200 --> 05:02:32,617 Let's move ahead and compare Apache spark with Hadoop 6964 05:02:32,617 --> 05:02:36,100 on different parameters to understand their strengths. 6965 05:02:36,100 --> 05:02:38,887 We will be comparing these two Frameworks 6966 05:02:38,887 --> 05:02:40,700 based on these parameters. 6967 05:02:40,700 --> 05:02:44,400 Let's start with performance first Spark is fast 6968 05:02:44,400 --> 05:02:45,476 because it has 6969 05:02:45,476 --> 05:02:49,000 in-memory processing it can also use For data, 6970 05:02:49,000 --> 05:02:51,774 that doesn't fit into memory Sparks 6971 05:02:51,774 --> 05:02:55,851 in-memory processing delivers near real-time analytics 6972 05:02:56,000 --> 05:02:57,771 and this makes Park suitable 6973 05:02:57,771 --> 05:03:00,300 for credit card processing system machine 6974 05:03:00,300 --> 05:03:02,300 learning security analysis 6975 05:03:02,300 --> 05:03:05,100 and processing data for iot sensors. 6976 05:03:05,200 --> 05:03:07,700 Now, let's talk about hadoop's performance. 6977 05:03:07,700 --> 05:03:10,700 Now Hadoop has originally designed to continuously 6978 05:03:10,700 --> 05:03:13,700 gather data from multiple sources without worrying 6979 05:03:13,700 --> 05:03:14,800 about the type of data 6980 05:03:14,800 --> 05:03:15,687 and storing it 6981 05:03:15,687 --> 05:03:18,544 across distributed environment and mapreduce. 6982 05:03:18,544 --> 05:03:22,185 Use uses batch processing mapreduce was never built for 6983 05:03:22,185 --> 05:03:24,108 real-time processing main idea 6984 05:03:24,108 --> 05:03:27,751 behind yarn is parallel processing over distributed data 6985 05:03:27,751 --> 05:03:30,400 set the problem with comparing the two is 6986 05:03:30,400 --> 05:03:33,400 that they have different way of processing 6987 05:03:33,400 --> 05:03:37,400 and the idea behind the development is also Divergent 6988 05:03:37,700 --> 05:03:40,300 next ease-of-use spark comes 6989 05:03:40,300 --> 05:03:44,400 with a user-friendly apis for Scala Java Python 6990 05:03:44,400 --> 05:03:48,300 and Sparks equal spark SQL is very similar to SQL. 6991 05:03:48,600 --> 05:03:50,047 So it becomes easier 6992 05:03:50,047 --> 05:03:53,202 for a sequel developers to learn it spark also 6993 05:03:53,202 --> 05:03:55,272 provides an interactive shell 6994 05:03:55,272 --> 05:03:58,700 for developers to query and perform other actions 6995 05:03:58,700 --> 05:04:00,800 and have immediate feedback. 6996 05:04:00,900 --> 05:04:02,762 Now, let's talk about Hadoop. 6997 05:04:02,762 --> 05:04:06,544 You can ingest data in Hadoop easily either by using shell 6998 05:04:06,544 --> 05:04:09,000 or integrating it with multiple tools, 6999 05:04:09,000 --> 05:04:10,353 like scoop and Flume 7000 05:04:10,353 --> 05:04:13,021 and yarn is just a processing framework 7001 05:04:13,021 --> 05:04:15,900 that can be integrated with multiple tools 7002 05:04:15,900 --> 05:04:18,200 like Hive and pig for Analytics. 7003 05:04:18,200 --> 05:04:20,353 I visit data warehousing component 7004 05:04:20,353 --> 05:04:22,381 which performs Reading Writing 7005 05:04:22,381 --> 05:04:26,058 and managing large data set in a distributed environment 7006 05:04:26,058 --> 05:04:29,100 using sql-like interface to conclude here. 7007 05:04:29,100 --> 05:04:31,700 Both of them have their own ways to make 7008 05:04:31,700 --> 05:04:33,500 themselves user-friendly. 7009 05:04:33,826 --> 05:04:36,365 Now, let's come to the cost Hadoop 7010 05:04:36,365 --> 05:04:39,903 and Spark are both Apache open source projects. 7011 05:04:40,000 --> 05:04:43,900 So there's no cost for the software cost is only associated 7012 05:04:43,900 --> 05:04:47,433 with the infrastructure both the products are designed 7013 05:04:47,433 --> 05:04:48,300 in such a way 7014 05:04:48,300 --> 05:04:50,800 that Can run on commodity Hardware 7015 05:04:50,800 --> 05:04:54,100 with low TCO or total cost of ownership. 7016 05:04:54,800 --> 05:04:56,895 Well now you might be wondering the ways 7017 05:04:56,895 --> 05:04:58,400 in which they are different. 7018 05:04:58,400 --> 05:05:02,117 They're all the same storage and processing in Hadoop is 7019 05:05:02,117 --> 05:05:05,700 disc-based and Hadoop uses standard amounts of memory. 7020 05:05:05,700 --> 05:05:06,717 So with Hadoop, 7021 05:05:06,717 --> 05:05:07,600 we need a lot 7022 05:05:07,600 --> 05:05:12,200 of disk space as well as faster transfer speed Hadoop 7023 05:05:12,200 --> 05:05:15,300 also requires multiple systems to distribute 7024 05:05:15,300 --> 05:05:17,000 the disk input output, 7025 05:05:17,000 --> 05:05:18,900 but in case of Apache spark 7026 05:05:18,900 --> 05:05:22,800 due to its in-memory processing it requires a lot of memory, 7027 05:05:22,800 --> 05:05:24,900 but it can deal with the standard. 7028 05:05:24,900 --> 05:05:28,400 Speed and amount of disk as disk space is a relatively 7029 05:05:28,400 --> 05:05:29,855 inexpensive commodity 7030 05:05:29,855 --> 05:05:32,985 and since Park does not use disk input output 7031 05:05:32,985 --> 05:05:34,591 for processing instead. 7032 05:05:34,591 --> 05:05:36,337 It requires large amounts 7033 05:05:36,337 --> 05:05:39,200 of RAM for executing everything in memory. 7034 05:05:39,300 --> 05:05:42,000 So spark systems incurs more cost 7035 05:05:42,300 --> 05:05:45,314 but yes one important thing to keep in mind is 7036 05:05:45,314 --> 05:05:49,400 that Sparks technology reduces the number of required systems, 7037 05:05:49,400 --> 05:05:52,900 it needs significantly fewer systems that cost more 7038 05:05:52,900 --> 05:05:55,991 so there will be a point at which spark reduces 7039 05:05:55,991 --> 05:05:57,134 the cost per unit 7040 05:05:57,134 --> 05:06:01,100 of the computation even with the additional RAM requirement. 7041 05:06:01,200 --> 05:06:04,500 There are two types of data processing batch processing 7042 05:06:04,500 --> 05:06:08,344 and stream processing batch processing has been crucial 7043 05:06:08,344 --> 05:06:09,904 to the Big Data World 7044 05:06:09,904 --> 05:06:13,100 in simplest term batch processing is working 7045 05:06:13,100 --> 05:06:16,500 with high data volumes collected over a period 7046 05:06:16,500 --> 05:06:20,423 in batch processing data is first collected then processed 7047 05:06:20,423 --> 05:06:21,800 and then the results 7048 05:06:21,800 --> 05:06:24,624 are produced at a later stage and batch. 7049 05:06:24,624 --> 05:06:26,000 Is it efficient way 7050 05:06:26,000 --> 05:06:28,769 of processing large static data sets? 7051 05:06:28,800 --> 05:06:30,300 Generally we perform 7052 05:06:30,300 --> 05:06:34,300 batch processing for archived data sets for example, 7053 05:06:34,300 --> 05:06:36,887 calculating average income of a country 7054 05:06:36,887 --> 05:06:40,700 or evaluating the change in e-commerce in the last decade 7055 05:06:40,900 --> 05:06:45,000 now stream processing stream processing is the current Trend 7056 05:06:45,000 --> 05:06:48,258 in the Big Data World need of the hour is speed 7057 05:06:48,258 --> 05:06:50,100 and real-time information, 7058 05:06:50,100 --> 05:06:52,100 which is what stream processing 7059 05:06:52,100 --> 05:06:54,500 does batch processing does not allow. 7060 05:06:54,500 --> 05:06:57,700 Businesses to quickly react to changing business needs 7061 05:06:57,700 --> 05:07:01,900 and real-time stream processing has seen a rapid growth 7062 05:07:01,900 --> 05:07:05,188 in that demand now coming back to Apache Spark 7063 05:07:05,188 --> 05:07:09,420 versus Hadoop yarn is basically a batch processing framework 7064 05:07:09,420 --> 05:07:11,500 when we submit a job to yarn. 7065 05:07:11,500 --> 05:07:14,827 It reads data from the cluster performs operation 7066 05:07:14,827 --> 05:07:17,539 and write the results back to the cluster 7067 05:07:17,539 --> 05:07:19,100 and then it again reads 7068 05:07:19,100 --> 05:07:21,900 the updated data performs the next operation 7069 05:07:21,900 --> 05:07:25,500 and write the results back to the cluster and Off 7070 05:07:25,700 --> 05:07:29,678 on the other hand spark is designed to cover a wide range 7071 05:07:29,678 --> 05:07:31,100 of workloads such as 7072 05:07:31,100 --> 05:07:35,429 batch application iterative algorithms interactive queries 7073 05:07:35,429 --> 05:07:37,100 and streaming as well. 7074 05:07:37,400 --> 05:07:40,899 Now, let's come to fault tolerance Hadoop and Spark 7075 05:07:40,899 --> 05:07:43,000 both provides fault tolerance, 7076 05:07:43,000 --> 05:07:45,716 but have different approaches for hdfs 7077 05:07:45,716 --> 05:07:47,673 and yarn both Master demons. 7078 05:07:47,673 --> 05:07:49,700 That is the name node in hdfs 7079 05:07:49,700 --> 05:07:53,285 and resource manager in the arm checks the heartbeat 7080 05:07:53,285 --> 05:07:54,651 of the slave demons. 7081 05:07:54,651 --> 05:07:58,000 The slave demons are data nodes and node managers. 7082 05:07:58,000 --> 05:08:00,100 So if any slave demon fails, 7083 05:08:00,100 --> 05:08:03,800 the master demons reschedules all pending an in-progress 7084 05:08:03,800 --> 05:08:07,900 operations to another slave now this method is effective 7085 05:08:07,900 --> 05:08:11,300 but it can significantly increase the completion time 7086 05:08:11,300 --> 05:08:14,000 for operations with single failure also 7087 05:08:14,000 --> 05:08:16,400 and as Hadoop uses commodity hardware 7088 05:08:16,400 --> 05:08:20,200 and another way in which hdfs ensures fault tolerance is 7089 05:08:20,200 --> 05:08:21,797 by replicating data. 7090 05:08:22,200 --> 05:08:24,200 Now let's talk about spark 7091 05:08:24,200 --> 05:08:29,094 as we discussed earlier rdds are resilient distributed data sets 7092 05:08:29,094 --> 05:08:31,710 are building blocks of Apache spark 7093 05:08:32,000 --> 05:08:34,100 and rdds are the one 7094 05:08:34,226 --> 05:08:37,073 which provide fault tolerant to spark. 7095 05:08:37,073 --> 05:08:38,000 They can refer 7096 05:08:38,000 --> 05:08:41,600 to any data set present and external storage system 7097 05:08:41,600 --> 05:08:45,200 like hdfs Edge base shared file system Etc. 7098 05:08:45,300 --> 05:08:47,100 They can also be operated 7099 05:08:47,100 --> 05:08:49,869 parallely rdds can persist a data set 7100 05:08:49,869 --> 05:08:52,100 and memory across operations. 7101 05:08:52,100 --> 05:08:56,061 It's which makes future actions 10 times much faster 7102 05:08:56,061 --> 05:08:58,731 if rdd is lost it will automatically 7103 05:08:58,731 --> 05:09:02,700 get recomputed by using the original Transformations. 7104 05:09:02,700 --> 05:09:06,720 And this is how spark provides fault tolerance and at the end. 7105 05:09:06,720 --> 05:09:08,500 Let us talk about security. 7106 05:09:08,500 --> 05:09:11,100 Well Hadoop has multiple ways of providing 7107 05:09:11,100 --> 05:09:14,806 security Hadoop supports Kerberos for authentication, 7108 05:09:14,806 --> 05:09:17,800 but it is difficult to handle nevertheless. 7109 05:09:17,800 --> 05:09:21,800 It also supports third-party vendors like ldap. 7110 05:09:22,000 --> 05:09:23,441 For authentication, 7111 05:09:23,441 --> 05:09:26,400 they also offer encryption hdfs supports 7112 05:09:26,400 --> 05:09:30,600 traditional file permissions as well as Access Control lists, 7113 05:09:30,600 --> 05:09:34,222 Hadoop provides service level authorization which guarantees 7114 05:09:34,222 --> 05:09:36,800 that clients have the right permissions for 7115 05:09:36,800 --> 05:09:40,400 job submission spark currently supports authentication 7116 05:09:40,400 --> 05:09:44,600 via a shared secret spark can integrate with hdfs 7117 05:09:44,600 --> 05:09:46,900 and it can use hdfs ACLS 7118 05:09:46,900 --> 05:09:50,652 or Access Control lists and file level permissions 7119 05:09:50,652 --> 05:09:52,024 sparking also run. 7120 05:09:52,024 --> 05:09:55,100 Yarn, leveraging the capability of Kerberos. 7121 05:09:55,100 --> 05:09:55,900 Now. 7122 05:09:55,900 --> 05:09:59,100 This was the comparison of these two Frameworks based 7123 05:09:59,100 --> 05:10:00,600 on these following parameters. 7124 05:10:00,600 --> 05:10:03,300 Now, let us understand use cases 7125 05:10:03,300 --> 05:10:06,900 where these Technologies fit best use cases were 7126 05:10:06,900 --> 05:10:07,900 Hadoop fits best. 7127 05:10:07,900 --> 05:10:09,300 For example, 7128 05:10:09,300 --> 05:10:12,500 when you're analyzing archive data yarn 7129 05:10:12,500 --> 05:10:14,300 allows parallel processing 7130 05:10:14,300 --> 05:10:18,657 over huge amounts of data parts of data is processed parallely 7131 05:10:18,657 --> 05:10:21,300 and separately on different data nodes 7132 05:10:21,300 --> 05:10:25,825 and gathers result from each node manager in cases 7133 05:10:25,825 --> 05:10:29,000 when instant results are not required now 7134 05:10:29,000 --> 05:10:32,319 Hadoop mapreduce is a good and economical solution 7135 05:10:32,319 --> 05:10:33,700 for batch processing. 7136 05:10:33,700 --> 05:10:35,546 However, it is incapable 7137 05:10:35,900 --> 05:10:39,015 of processing data in real-time use cases 7138 05:10:39,015 --> 05:10:43,400 where Spark fits best in real-time Big Data analysis, 7139 05:10:43,400 --> 05:10:46,600 real-time data analysis means processing data 7140 05:10:46,600 --> 05:10:50,300 that is getting generated by the real-time event streams 7141 05:10:50,300 --> 05:10:53,000 coming in at the rate of Billions of events 7142 05:10:53,000 --> 05:10:55,000 per second the strength 7143 05:10:55,000 --> 05:10:58,277 of spark lies in its abilities to support streaming 7144 05:10:58,277 --> 05:11:00,900 of data along with distributed processing 7145 05:11:00,900 --> 05:11:04,700 and Spark claims to process data hundred times faster 7146 05:11:04,700 --> 05:11:09,100 than mapreduce while 10 times faster with the discs. 7147 05:11:09,100 --> 05:11:13,000 It is used in graph processing spark contains 7148 05:11:13,000 --> 05:11:15,720 a graph computation Library called Graphics 7149 05:11:15,720 --> 05:11:18,700 which simplifies our life in memory computation 7150 05:11:18,700 --> 05:11:22,100 along with inbuilt graph support improves the performance. 7151 05:11:22,100 --> 05:11:24,700 Performance of algorithm by a magnitude 7152 05:11:24,700 --> 05:11:28,516 of one or two degrees over traditional mapreduce programs. 7153 05:11:28,516 --> 05:11:32,200 It is also used in iterative machine learning algorithms 7154 05:11:32,200 --> 05:11:35,900 almost all machine learning algorithms work iteratively 7155 05:11:35,900 --> 05:11:39,039 as we have seen earlier iterative algorithms 7156 05:11:39,039 --> 05:11:41,389 involve input/output bottlenecks 7157 05:11:41,389 --> 05:11:44,400 in the mapreduce implementations mapreduce 7158 05:11:44,400 --> 05:11:46,400 uses coarse-grained tasks 7159 05:11:46,400 --> 05:11:47,600 that are too heavy 7160 05:11:47,600 --> 05:11:51,926 for iterative algorithms spark caches the intermediate data. 7161 05:11:51,926 --> 05:11:53,972 I said after each iteration 7162 05:11:53,972 --> 05:11:57,586 and runs multiple iterations on the cache data set 7163 05:11:57,586 --> 05:12:01,200 which eventually reduces the input output overhead 7164 05:12:01,200 --> 05:12:03,142 and executes the algorithm 7165 05:12:03,142 --> 05:12:07,400 faster in a fault-tolerant manner sad the end which one is 7166 05:12:07,400 --> 05:12:10,900 the best the answer to this is Hadoop 7167 05:12:10,900 --> 05:12:14,800 and Apache spark are not competing with one another. 7168 05:12:15,000 --> 05:12:18,100 In fact, they complement each other quite well, 7169 05:12:18,100 --> 05:12:20,745 how do brings huge data sets under control 7170 05:12:20,745 --> 05:12:22,100 by commodity systems? 7171 05:12:22,100 --> 05:12:26,100 Systems and Spark provides a real-time in-memory processing 7172 05:12:26,100 --> 05:12:27,700 for those data sets. 7173 05:12:27,900 --> 05:12:30,600 When we combine Apache Sparks ability. 7174 05:12:30,600 --> 05:12:34,200 That is the high processing speed and advanced analytics 7175 05:12:34,200 --> 05:12:38,600 and multiple integration support with Hadoop slow cost operation 7176 05:12:38,600 --> 05:12:40,200 on commodity Hardware. 7177 05:12:40,200 --> 05:12:42,091 It gives the best results 7178 05:12:42,091 --> 05:12:45,800 Hadoop compliments Apache spark capabilities spark 7179 05:12:45,800 --> 05:12:48,737 not completely replace a do but the good news is 7180 05:12:48,737 --> 05:12:52,079 that the demand of spark is currently at an all-time. 7181 05:12:52,079 --> 05:12:55,849 Hi, if you want to learn more about the Hadoop ecosystem tools 7182 05:12:55,849 --> 05:12:56,900 and Apache spark, 7183 05:12:56,900 --> 05:12:59,106 don't forget to take a look at the editor 7184 05:12:59,106 --> 05:13:01,700 Acres YouTube channel and check out the big data 7185 05:13:01,700 --> 05:13:03,000 and Hadoop playlist. 7186 05:13:07,600 --> 05:13:09,776 Welcome everyone in today's session on 7187 05:13:09,776 --> 05:13:11,100 kafka's Park streaming. 7188 05:13:11,100 --> 05:13:14,400 So without any further delay, let's look at the agenda first. 7189 05:13:14,400 --> 05:13:16,128 We will start by understanding. 7190 05:13:16,128 --> 05:13:17,310 What is Apache Kafka? 7191 05:13:17,310 --> 05:13:19,900 Then we will discuss about different components 7192 05:13:19,900 --> 05:13:22,000 of Apache Kafka and it's architecture. 7193 05:13:22,000 --> 05:13:24,899 Further we will look at different Kafka commands. 7194 05:13:24,899 --> 05:13:25,546 After that. 7195 05:13:25,546 --> 05:13:27,994 We'll take a brief overview of Apache spark 7196 05:13:27,994 --> 05:13:30,700 and will understand different spark components. 7197 05:13:30,700 --> 05:13:31,201 Finally. 7198 05:13:31,201 --> 05:13:32,579 We'll look at the demo 7199 05:13:32,579 --> 05:13:35,900 where we will use spark streaming with Apache caf-pow. 7200 05:13:36,100 --> 05:13:37,600 Let's move to our first slide. 7201 05:13:37,900 --> 05:13:39,323 So in a real time scenario, 7202 05:13:39,323 --> 05:13:41,500 we have different systems of services, 7203 05:13:41,500 --> 05:13:43,000 which will be communicating 7204 05:13:43,000 --> 05:13:46,200 with each other and the data pipelines are the ones 7205 05:13:46,200 --> 05:13:48,800 which are establishing connection between two servers 7206 05:13:48,800 --> 05:13:49,953 or two systems. 7207 05:13:50,000 --> 05:13:52,100 Now, let's take an example of e-commerce. 7208 05:13:52,100 --> 05:13:55,255 Except site where it can have multiple servers at front end 7209 05:13:55,255 --> 05:13:58,161 like Weber application server for hosting application. 7210 05:13:58,161 --> 05:13:59,530 It can have a chat server 7211 05:13:59,530 --> 05:14:01,958 for the customers to provide chart facilities. 7212 05:14:01,958 --> 05:14:04,900 Then it can have a separate server for payment Etc. 7213 05:14:04,900 --> 05:14:08,145 Similarly organization can also have multiple server 7214 05:14:08,145 --> 05:14:09,100 at the back end 7215 05:14:09,100 --> 05:14:11,900 which will be receiving messages from different front end servers 7216 05:14:11,900 --> 05:14:13,200 based on the requirements. 7217 05:14:13,400 --> 05:14:15,600 Now they can have a database server 7218 05:14:15,600 --> 05:14:17,700 which will be storing the records then they 7219 05:14:17,700 --> 05:14:20,100 can have security systems for user authentication 7220 05:14:20,100 --> 05:14:21,916 and authorization then they can have 7221 05:14:21,916 --> 05:14:23,368 Real-time monitoring server, 7222 05:14:23,368 --> 05:14:25,600 which is basically used for recommendations. 7223 05:14:25,600 --> 05:14:28,100 So all these data pipelines becomes complex 7224 05:14:28,100 --> 05:14:30,200 with the increase in number of systems 7225 05:14:30,200 --> 05:14:31,594 and adding a new system 7226 05:14:31,594 --> 05:14:33,900 or server requires more data pipelines, 7227 05:14:33,900 --> 05:14:35,900 which will again make the data flow 7228 05:14:35,900 --> 05:14:37,800 more complicated and complex. 7229 05:14:37,800 --> 05:14:38,662 Now managing. 7230 05:14:38,662 --> 05:14:41,646 These data pipelines also become very difficult 7231 05:14:41,646 --> 05:14:45,100 as each data pipeline has their own set of requirements 7232 05:14:45,100 --> 05:14:46,700 for example data pipelines, 7233 05:14:46,700 --> 05:14:49,700 which handles transaction should be more fault tolerant 7234 05:14:49,700 --> 05:14:51,700 and robust on the other hand. 7235 05:14:51,700 --> 05:14:54,372 Clickstream data pipeline can be more fragile. 7236 05:14:54,372 --> 05:14:55,784 So adding some pipelines 7237 05:14:55,784 --> 05:14:58,400 or removing some pipelines becomes more difficult 7238 05:14:58,400 --> 05:14:59,600 from the complex system. 7239 05:14:59,800 --> 05:15:02,800 So now I hope that you would have understood the problem 7240 05:15:02,800 --> 05:15:05,400 due to which misting systems was originated. 7241 05:15:05,400 --> 05:15:08,200 Let's move to the next slide and we'll understand 7242 05:15:08,200 --> 05:15:11,970 how Kafka solves this problem now measuring system reduces 7243 05:15:11,970 --> 05:15:13,835 the complexity of data pipelines 7244 05:15:13,835 --> 05:15:16,600 and makes the communication between systems more 7245 05:15:16,600 --> 05:15:19,780 simpler and manageable using messaging system. 7246 05:15:19,780 --> 05:15:22,500 Now, you can easily stablish remote Education 7247 05:15:22,500 --> 05:15:25,063 and send your data easily across Netbook. 7248 05:15:25,063 --> 05:15:26,536 Now a different systems 7249 05:15:26,536 --> 05:15:29,100 may use different platforms and languages 7250 05:15:29,200 --> 05:15:30,200 and messaging system 7251 05:15:30,200 --> 05:15:32,852 provides you a common Paradigm independent 7252 05:15:32,852 --> 05:15:34,560 of any platformer language. 7253 05:15:34,560 --> 05:15:36,900 So basically it decouples the platform 7254 05:15:36,900 --> 05:15:39,800 on which a front end server as well as your back-end server 7255 05:15:39,800 --> 05:15:43,600 is running you can also stablish a no synchronous communication 7256 05:15:43,600 --> 05:15:44,800 and send messages 7257 05:15:44,800 --> 05:15:47,000 so that the sender does not have to wait 7258 05:15:47,000 --> 05:15:49,000 for the receiver to process the messages. 7259 05:15:49,200 --> 05:15:51,300 Now one of the benefit of messaging system is 7260 05:15:51,300 --> 05:15:53,295 that you can Reliable communication. 7261 05:15:53,295 --> 05:15:56,600 So even when the receiver and network is not working properly. 7262 05:15:56,600 --> 05:15:59,272 Your messages wouldn't get lost not talking 7263 05:15:59,272 --> 05:16:02,900 about cough cough cough cough decouples the data pipelines 7264 05:16:02,900 --> 05:16:06,205 and solves the complexity problem the applications 7265 05:16:06,205 --> 05:16:10,050 which are producing messages to Kafka are called producers 7266 05:16:10,050 --> 05:16:11,400 and the applications 7267 05:16:11,400 --> 05:16:13,600 which are consuming those messages from Kafka 7268 05:16:13,600 --> 05:16:14,706 are called consumers. 7269 05:16:14,706 --> 05:16:17,500 Now, as you can see in the image the front end server, 7270 05:16:17,500 --> 05:16:20,200 then your application server will burn application server 7271 05:16:20,200 --> 05:16:21,500 to and chat server. 7272 05:16:21,500 --> 05:16:25,500 I using messages to Kafka and these are called producers 7273 05:16:25,500 --> 05:16:26,985 and your database server 7274 05:16:26,985 --> 05:16:29,594 security systems real-time monitoring server 7275 05:16:29,594 --> 05:16:31,900 than other services and data warehouse. 7276 05:16:31,900 --> 05:16:34,300 These are basically consuming the messages 7277 05:16:34,300 --> 05:16:35,900 and are called consumers. 7278 05:16:36,100 --> 05:16:39,600 So your producer sends the message to Kafka 7279 05:16:39,700 --> 05:16:42,781 and then cough cash to those messages and consumers 7280 05:16:42,781 --> 05:16:45,000 who want those messages can subscribe 7281 05:16:45,000 --> 05:16:47,607 and receive them now you can also have 7282 05:16:47,607 --> 05:16:51,191 multiple subscribers to a single category of messages. 7283 05:16:51,191 --> 05:16:52,623 So you Database server 7284 05:16:52,623 --> 05:16:56,400 and your security system can be consuming the same messages 7285 05:16:56,400 --> 05:16:58,600 which is produced by application server 7286 05:16:58,600 --> 05:17:01,423 1 and again adding a new consumer is very easy. 7287 05:17:01,423 --> 05:17:03,658 You can go ahead and add a new consumer 7288 05:17:03,658 --> 05:17:06,268 and just subscribe to the message categories 7289 05:17:06,268 --> 05:17:07,300 that is required. 7290 05:17:07,300 --> 05:17:10,700 So again, you can add a new consumer say consumer one 7291 05:17:10,700 --> 05:17:13,100 and you can again go ahead and subscribe 7292 05:17:13,100 --> 05:17:14,570 to the category of messages 7293 05:17:14,570 --> 05:17:17,100 which is produced by application server one. 7294 05:17:17,100 --> 05:17:19,100 So, let's quickly move ahead. 7295 05:17:19,100 --> 05:17:21,606 Let's talk about a Bocce Kafka so party. 7296 05:17:21,606 --> 05:17:24,853 Kafka is a distributed publish/subscribe messaging 7297 05:17:24,853 --> 05:17:28,300 system messaging traditionally has two models queuing 7298 05:17:28,300 --> 05:17:32,173 and publish/subscribe in a queue a pool of consumers. 7299 05:17:32,173 --> 05:17:33,769 May read from a server 7300 05:17:33,769 --> 05:17:36,540 and each record only goes to one of them 7301 05:17:36,540 --> 05:17:38,600 whereas in publish/subscribe. 7302 05:17:38,600 --> 05:17:41,313 The record is broadcasted to all consumers. 7303 05:17:41,313 --> 05:17:43,722 So multiple consumers can get the record 7304 05:17:43,722 --> 05:17:45,700 the Kafka cluster is distributed 7305 05:17:45,700 --> 05:17:48,374 and have multiple machines running in parallel. 7306 05:17:48,374 --> 05:17:50,700 And this is the reason why calf pies fast 7307 05:17:50,700 --> 05:17:52,000 scalable and fault. 7308 05:17:52,300 --> 05:17:53,309 Now let me tell you 7309 05:17:53,309 --> 05:17:55,700 that Kafka is developed at LinkedIn and later. 7310 05:17:55,700 --> 05:17:57,700 It became a part of Apache project. 7311 05:17:57,900 --> 05:18:01,100 Now, let us look at some of the important terminologies. 7312 05:18:01,100 --> 05:18:03,499 So we'll first start with topic. 7313 05:18:03,499 --> 05:18:05,081 So topic is a category 7314 05:18:05,081 --> 05:18:08,100 or feed name to which records are published 7315 05:18:08,100 --> 05:18:11,226 and Topic in Kafka are always multi subscriber. 7316 05:18:11,226 --> 05:18:14,800 That is a topic can have zero one or multiple consumers 7317 05:18:14,800 --> 05:18:16,600 that can subscribe the topic 7318 05:18:16,600 --> 05:18:19,300 and consume the data written to it for an example. 7319 05:18:19,300 --> 05:18:21,850 You can have serious record getting published in sales, too. 7320 05:18:21,850 --> 05:18:23,500 Topic you can have product records 7321 05:18:23,500 --> 05:18:25,600 which is getting published in product topic 7322 05:18:25,600 --> 05:18:28,965 and so on this will actually segregate your messages 7323 05:18:28,965 --> 05:18:31,756 and consumer will only subscribe the topic 7324 05:18:31,756 --> 05:18:35,500 that they need and again you consumer can also subscribe 7325 05:18:35,500 --> 05:18:37,300 to two or more topics. 7326 05:18:37,300 --> 05:18:40,100 Now, let's talk about partitions. 7327 05:18:40,100 --> 05:18:44,253 So Kafka topics are divided into a number of partitions 7328 05:18:44,253 --> 05:18:47,800 and partitions allow you to paralyze a topic 7329 05:18:47,800 --> 05:18:49,284 by splitting the data 7330 05:18:49,284 --> 05:18:51,846 in a particular topic across multiple. 7331 05:18:51,846 --> 05:18:55,200 Brokers which means each partition can be placed 7332 05:18:55,200 --> 05:18:58,869 on separate machine to allow multiple consumers to read 7333 05:18:58,869 --> 05:19:00,500 from a topic parallelly. 7334 05:19:00,500 --> 05:19:02,700 So in case of serious topic you can have 7335 05:19:02,700 --> 05:19:05,700 three partition partition 0 partition 1 and partition 7336 05:19:05,700 --> 05:19:09,400 to from where three consumers can read data parallel. 7337 05:19:09,400 --> 05:19:10,481 Now moving ahead. 7338 05:19:10,481 --> 05:19:12,200 Let's talk about producers. 7339 05:19:12,200 --> 05:19:13,845 So producers are the one 7340 05:19:13,845 --> 05:19:17,000 who publishes the data to topics of the choice. 7341 05:19:17,000 --> 05:19:18,600 Then you have consumers 7342 05:19:18,600 --> 05:19:21,786 so consumers can subscribe to one or more topic. 7343 05:19:21,786 --> 05:19:22,910 And consume data 7344 05:19:22,910 --> 05:19:26,773 from that topic now consumers basically label themselves 7345 05:19:26,773 --> 05:19:28,600 with a consumer group name 7346 05:19:28,600 --> 05:19:31,900 and each record publish to a topic is delivered 7347 05:19:31,900 --> 05:19:35,703 to one consumer instance within each subscribing consumer group. 7348 05:19:35,703 --> 05:19:37,536 So suppose you have a consumer group. 7349 05:19:37,536 --> 05:19:40,072 Let's say consumer Group 1 and then you have 7350 05:19:40,072 --> 05:19:41,900 three consumers residing in it. 7351 05:19:41,900 --> 05:19:45,400 That is consumer a consumer be an consumer see now 7352 05:19:45,400 --> 05:19:47,015 from the seals topic. 7353 05:19:47,100 --> 05:19:51,600 Each record can be read once by consumer group Fun and it 7354 05:19:51,600 --> 05:19:56,200 And either be read by consumer a or consumer be or consumer see 7355 05:19:56,200 --> 05:20:00,337 but it can only be consumed once by the single consumer group 7356 05:20:00,337 --> 05:20:02,200 that is consumer group one. 7357 05:20:02,200 --> 05:20:05,700 But again, you can have multiple consumer groups 7358 05:20:05,700 --> 05:20:07,700 which can subscribe to a topic 7359 05:20:07,700 --> 05:20:11,260 where one record can be consumed by multiple consumers. 7360 05:20:11,260 --> 05:20:14,226 That is one consumer from each consumer group. 7361 05:20:14,226 --> 05:20:16,842 So now let's say you have a consumer one 7362 05:20:16,842 --> 05:20:19,291 and consumer group to in consumer Group 7363 05:20:19,291 --> 05:20:20,600 1 we have to consumer 7364 05:20:20,600 --> 05:20:22,854 that is consumer a a and consumer be 7365 05:20:22,854 --> 05:20:24,400 and consumer group to we 7366 05:20:24,400 --> 05:20:27,819 have to Consumers consumer key and consumer to be so 7367 05:20:27,819 --> 05:20:30,229 if consumer Group 1 and consumer group 7368 05:20:30,229 --> 05:20:32,900 2 are consuming messages from topic sales. 7369 05:20:32,900 --> 05:20:36,000 So the single record will be consumed by consumer group one 7370 05:20:36,000 --> 05:20:39,111 as well as consumer group 2 and a single consumer 7371 05:20:39,111 --> 05:20:43,000 from both the consumer group will consume the record once so, 7372 05:20:43,000 --> 05:20:45,900 I guess you are clear with the concept of consumer 7373 05:20:45,900 --> 05:20:49,124 and consumer group Now consumer instances can be 7374 05:20:49,124 --> 05:20:51,800 a separate process or separate machines. 7375 05:20:51,900 --> 05:20:55,918 No talking about Brokers Brokers are nothing but a single machine 7376 05:20:55,918 --> 05:20:57,300 in the CAF per cluster 7377 05:20:57,300 --> 05:21:00,800 and zookeeper is another Apache open source project. 7378 05:21:00,800 --> 05:21:03,536 It's Tuesday metadata information related 7379 05:21:03,536 --> 05:21:04,700 to Kafka cluster. 7380 05:21:04,700 --> 05:21:08,100 Like Brokers information topics details Etc. 7381 05:21:08,100 --> 05:21:09,933 Zookeeper is basically the one 7382 05:21:09,933 --> 05:21:12,316 who is managing the whole Kafka cluster. 7383 05:21:12,316 --> 05:21:14,700 Now, let's quickly go to the next slide. 7384 05:21:14,700 --> 05:21:16,900 So suppose you have a topic. 7385 05:21:16,900 --> 05:21:21,100 Let's assume this is topic sales and you have for partition 7386 05:21:21,100 --> 05:21:23,900 so you have Partition 0 partition 1 partition 7387 05:21:23,900 --> 05:21:27,600 to and partition three now you have five Brokers over here. 7388 05:21:27,614 --> 05:21:30,768 Now, let's take the case of partition 1 so 7389 05:21:30,850 --> 05:21:34,800 if the replication factor is 3 it will have 3 copies 7390 05:21:34,800 --> 05:21:37,100 which will reside on different Brokers. 7391 05:21:37,100 --> 05:21:40,121 So when the replica is on broker to next is 7392 05:21:40,121 --> 05:21:43,000 on broker 3 and next is on brokered 5 and 7393 05:21:43,000 --> 05:21:44,800 as you can see repl 5, 7394 05:21:45,000 --> 05:21:47,800 so this 5 is from this broker 5. 7395 05:21:48,100 --> 05:21:52,500 So the ID of the replica is same as the ID of The broker 7396 05:21:52,500 --> 05:21:55,700 that hosts it now moving ahead. 7397 05:21:55,700 --> 05:21:57,100 One of the replica 7398 05:21:57,100 --> 05:22:00,800 of partition one will serve as the leader replica. 7399 05:22:00,800 --> 05:22:02,074 So now the leader 7400 05:22:02,074 --> 05:22:06,200 of partition one is replica five and any consumer coming 7401 05:22:06,200 --> 05:22:07,684 and consuming messages 7402 05:22:07,684 --> 05:22:10,944 from partition one will be solved by this replica. 7403 05:22:10,944 --> 05:22:14,635 And these two replicas is basically for fault tolerance. 7404 05:22:14,635 --> 05:22:17,343 So that once you're broken five goes off 7405 05:22:17,343 --> 05:22:19,264 or your disc becomes corrupt, 7406 05:22:19,264 --> 05:22:21,115 so your replica 3 or replica 7407 05:22:21,115 --> 05:22:24,100 to to one of them will again serve as a leader 7408 05:22:24,100 --> 05:22:26,938 and this is basically decided on the basis 7409 05:22:26,938 --> 05:22:28,600 of most in sync replica. 7410 05:22:28,600 --> 05:22:30,587 So the replica which will be most 7411 05:22:30,587 --> 05:22:34,100 in sync with this replica will become the next leader. 7412 05:22:34,100 --> 05:22:36,700 So similarly this partition 0 may decide 7413 05:22:36,700 --> 05:22:40,400 on broker one broker to and broker three again 7414 05:22:40,400 --> 05:22:44,500 your partition to May reside on broke of for group 7415 05:22:44,500 --> 05:22:46,800 of five and say broker one 7416 05:22:46,900 --> 05:22:49,500 and then your third partition might reside 7417 05:22:49,500 --> 05:22:51,500 on these three brokers. 7418 05:22:51,700 --> 05:22:54,900 So suppose that this is the leader for partition 7419 05:22:54,900 --> 05:22:56,378 to this is the leader 7420 05:22:56,378 --> 05:22:59,900 for partition 0 this is the leader for partition 3. 7421 05:22:59,900 --> 05:23:02,900 This is the leader for partition 1 right 7422 05:23:02,900 --> 05:23:03,600 so you can see 7423 05:23:03,600 --> 05:23:08,300 that for consumers can consume data pad Ali from these Brokers 7424 05:23:08,300 --> 05:23:10,798 so it can consume data from partition 7425 05:23:10,798 --> 05:23:14,200 to this consumer can consume data from partition 0 7426 05:23:14,200 --> 05:23:17,800 and similarly for partition 3 and partition fun 7427 05:23:18,100 --> 05:23:21,500 now by maintaining the replica basically helps. 7428 05:23:21,500 --> 05:23:25,433 Sin fault tolerance and keeping different partition leaders 7429 05:23:25,433 --> 05:23:29,300 on different Brokers basically helps in parallel execution 7430 05:23:29,300 --> 05:23:32,300 or you can say baddeley consuming those messages. 7431 05:23:32,300 --> 05:23:34,391 So I hope that you guys are clear 7432 05:23:34,391 --> 05:23:36,972 about topics partitions and replicas now, 7433 05:23:36,972 --> 05:23:38,803 let's move to our next slide. 7434 05:23:38,803 --> 05:23:42,062 So this is how the whole Kafka cluster looks like you 7435 05:23:42,062 --> 05:23:43,567 have multiple producers, 7436 05:23:43,567 --> 05:23:46,200 which is again producing messages to Kafka. 7437 05:23:46,200 --> 05:23:48,600 Then this whole is the Kafka cluster 7438 05:23:48,600 --> 05:23:51,590 where you have two nodes node one has to broker. 7439 05:23:51,590 --> 05:23:55,128 Joker one and broker to and the Note II has two Brokers 7440 05:23:55,128 --> 05:23:58,600 which is broker three and broke of for again consumers 7441 05:23:58,600 --> 05:24:01,434 will be consuming data from these Brokers 7442 05:24:01,434 --> 05:24:03,135 and zookeeper is the one 7443 05:24:03,135 --> 05:24:05,900 who is managing this whole calf cluster. 7444 05:24:06,200 --> 05:24:07,100 Now, let's look 7445 05:24:07,100 --> 05:24:10,688 at some basic commands of Kafka and understand how Kafka Works 7446 05:24:10,688 --> 05:24:12,500 how to go ahead and start zookeeper 7447 05:24:12,500 --> 05:24:14,708 how to go ahead and start Kafka server 7448 05:24:14,708 --> 05:24:16,200 and how to again go ahead 7449 05:24:16,200 --> 05:24:19,141 and produce some messages to Kafka and then consume 7450 05:24:19,141 --> 05:24:20,600 some messages to Kafka. 7451 05:24:20,600 --> 05:24:21,800 So let me quickly. 7452 05:24:21,800 --> 05:24:27,200 on my VM So let me quickly open the terminal. 7453 05:24:28,600 --> 05:24:31,400 Let me quickly go ahead and execute sudo GPS 7454 05:24:31,400 --> 05:24:33,180 so that I can check all the demons 7455 05:24:33,180 --> 05:24:34,800 that are running in my system. 7456 05:24:35,400 --> 05:24:37,095 So you can see I have named 7457 05:24:37,095 --> 05:24:40,800 no data node resource manager node manager job is to server. 7458 05:24:42,000 --> 05:24:43,933 So now as all the hdfs demons 7459 05:24:43,933 --> 05:24:46,200 are burning let us quickly go ahead 7460 05:24:46,200 --> 05:24:48,100 and start Kafka services. 7461 05:24:48,100 --> 05:24:50,561 So first I will go to Kafka home. 7462 05:24:51,400 --> 05:24:53,800 So let me show you the directory. 7463 05:24:53,800 --> 05:24:56,200 So my Kafka is in user lib. 7464 05:24:56,600 --> 05:24:56,900 Now. 7465 05:24:56,900 --> 05:25:00,088 Let me quickly go ahead and start zookeeper service. 7466 05:25:00,088 --> 05:25:01,087 But before that, 7467 05:25:01,087 --> 05:25:03,900 let me show you zookeeper dot properties file. 7468 05:25:06,415 --> 05:25:10,800 So decline Port is 2 1 8 1 so my zookeeper will be running 7469 05:25:10,800 --> 05:25:12,300 on Port to 181 7470 05:25:12,600 --> 05:25:15,400 and the data directory in which my zookeeper 7471 05:25:15,400 --> 05:25:19,700 will store all the metadata is slash temp / zookeeper. 7472 05:25:20,000 --> 05:25:23,200 So let us quickly go ahead and start zookeeper 7473 05:25:23,400 --> 05:25:28,300 and the command is bins zookeeper server start. 7474 05:25:28,900 --> 05:25:30,500 So this is the script file 7475 05:25:30,500 --> 05:25:33,300 and then I'll pass the properties file 7476 05:25:33,357 --> 05:25:37,988 which is inside config directory and a little Meanwhile, 7477 05:25:37,988 --> 05:25:39,834 let me open another tab. 7478 05:25:40,403 --> 05:25:44,096 So here I will be starting my first Kafka broker. 7479 05:25:44,200 --> 05:25:47,200 But before that let me show you the properties file. 7480 05:25:47,576 --> 05:25:50,423 So we'll go in config directory again, 7481 05:25:51,100 --> 05:25:53,700 and I have server dot properties. 7482 05:25:54,400 --> 05:25:58,300 So this is the properties of my first Kafka broker. 7483 05:25:59,507 --> 05:26:01,892 So first we have server Basics. 7484 05:26:02,300 --> 05:26:06,400 So here the broker idea of my first broker is 0 then 7485 05:26:06,400 --> 05:26:10,700 the port is 9:09 to on which my first broker will be running. 7486 05:26:11,400 --> 05:26:14,500 So it contains all the socket server settings 7487 05:26:14,657 --> 05:26:16,042 then moving ahead. 7488 05:26:16,049 --> 05:26:17,555 We have log base X. 7489 05:26:17,555 --> 05:26:21,139 So in that log Basics, this is log directory, 7490 05:26:21,200 --> 05:26:23,500 which is / them / Kafka - 7491 05:26:23,500 --> 05:26:26,400 logs so over here my Kafka will store 7492 05:26:26,400 --> 05:26:28,226 all those messages or records, 7493 05:26:28,226 --> 05:26:30,600 which will be produced by The Producers. 7494 05:26:30,600 --> 05:26:31,799 So all the records 7495 05:26:31,799 --> 05:26:35,600 which belongs to broker 0 will be stored at this location. 7496 05:26:35,900 --> 05:26:39,200 Now, the next section is internal topic settings 7497 05:26:39,200 --> 05:26:40,900 in which the offset topical. 7498 05:26:40,900 --> 05:26:42,500 application factor is 1 7499 05:26:42,500 --> 05:26:48,100 then transaction State log replication factor is 1 Next 7500 05:26:48,384 --> 05:26:50,615 we have log retention policy. 7501 05:26:50,900 --> 05:26:54,500 So the log retention ours is 168. 7502 05:26:54,500 --> 05:26:58,319 So your records will be stored for 168 hours by default 7503 05:26:58,319 --> 05:27:00,300 and then it will be deleted. 7504 05:27:00,300 --> 05:27:02,300 Then you have zookeeper properties 7505 05:27:02,300 --> 05:27:05,100 where we have specified zookeeper connect and 7506 05:27:05,100 --> 05:27:07,482 as we have seen in Zookeeper dot properties file 7507 05:27:07,482 --> 05:27:10,000 that are zookeeper will be running on Port 2 1 8 1 7508 05:27:10,000 --> 05:27:12,000 so we are giving the address of Zookeeper 7509 05:27:12,000 --> 05:27:13,900 that is localized to one eight one 7510 05:27:14,300 --> 05:27:15,911 and at last we have group. 7511 05:27:15,911 --> 05:27:18,700 Coordinator setting so let us quickly go ahead 7512 05:27:18,700 --> 05:27:20,700 and start the first broker. 7513 05:27:21,457 --> 05:27:24,842 So the script file is Kafka server started sh 7514 05:27:24,900 --> 05:27:27,100 and then we have to give the properties file, 7515 05:27:27,200 --> 05:27:31,000 which is server dot properties for the first broker. 7516 05:27:31,200 --> 05:27:35,276 I'll hit enter and meanwhile, let me open another tab. 7517 05:27:36,234 --> 05:27:39,865 now I'll show you the next properties file, 7518 05:27:40,200 --> 05:27:43,400 which is Server 1. 7519 05:27:43,400 --> 05:27:44,600 Properties. 7520 05:27:45,300 --> 05:27:46,400 So the things 7521 05:27:46,400 --> 05:27:50,700 which you have to change for creating a new broker 7522 05:27:51,000 --> 05:27:54,700 is first you have to change the broker ID. 7523 05:27:54,900 --> 05:27:59,100 So my earlier book ID was 0 the new broker ID is 1 again, 7524 05:27:59,100 --> 05:28:02,255 you can replicate this file and for a new server, 7525 05:28:02,255 --> 05:28:05,059 you have to change the broker idea to to then 7526 05:28:05,059 --> 05:28:08,513 you have to change the port because on 9:09 to already. 7527 05:28:08,513 --> 05:28:11,200 My first broker is running that is broker 0 7528 05:28:11,200 --> 05:28:12,019 so my broker. 7529 05:28:12,019 --> 05:28:14,099 Should connect to a different port 7530 05:28:14,099 --> 05:28:17,000 and here I have specified nine zero nine three. 7531 05:28:17,700 --> 05:28:21,600 Next thing what you have to change is the log directory. 7532 05:28:21,600 --> 05:28:25,830 So here I have added a - 1 to the default log directory. 7533 05:28:25,830 --> 05:28:27,400 So all these records 7534 05:28:27,400 --> 05:28:30,600 which is stored to my broker one will be going 7535 05:28:30,600 --> 05:28:32,505 to this particular directory 7536 05:28:32,505 --> 05:28:35,500 that is slashed and slashed cough call logs - 7537 05:28:35,500 --> 05:28:39,500 1 And rest of the things are similar, 7538 05:28:39,700 --> 05:28:42,900 so let me quickly go ahead and start second broker as well. 7539 05:28:45,800 --> 05:28:48,000 And let me open one more terminal. 7540 05:28:51,569 --> 05:28:54,030 And I'll start broker to as well. 7541 05:29:01,400 --> 05:29:06,475 So the Zookeeper started then procurve one is also started 7542 05:29:06,475 --> 05:29:09,700 and this is broker to which is also started 7543 05:29:09,702 --> 05:29:11,472 and this is proof of 3. 7544 05:29:12,600 --> 05:29:14,600 So now let me quickly minimize this 7545 05:29:15,200 --> 05:29:17,300 and I'll open a new terminal. 7546 05:29:18,000 --> 05:29:20,800 Now first, let us look at some commands later 7547 05:29:20,800 --> 05:29:21,900 to Kafka topics. 7548 05:29:21,900 --> 05:29:24,900 So I'll quickly go ahead and create a topic. 7549 05:29:25,250 --> 05:29:29,250 So again, let me first go to my Kafka home directory. 7550 05:29:31,700 --> 05:29:36,000 Then the script file is Kafka top it dot sh, 7551 05:29:36,000 --> 05:29:37,762 then the first parameter 7552 05:29:37,762 --> 05:29:41,800 is create then we have to give the address of zoo keeper 7553 05:29:41,800 --> 05:29:43,327 because zookeeper is the one 7554 05:29:43,327 --> 05:29:46,000 who is actually containing all the details related 7555 05:29:46,000 --> 05:29:47,000 to your topic. 7556 05:29:47,700 --> 05:29:50,600 So the address of my zookeeper is localized to one eight one 7557 05:29:50,700 --> 05:29:53,000 then we'll give the topic name. 7558 05:29:53,000 --> 05:29:56,076 So let me name the topic as Kafka - 7559 05:29:56,076 --> 05:30:00,000 spark next we have to specify the replication factor 7560 05:30:00,000 --> 05:30:01,100 of the topic. 7561 05:30:01,300 --> 05:30:04,900 So it will replicate all the partitions inside the topic 7562 05:30:04,900 --> 05:30:05,700 that many times. 7563 05:30:06,600 --> 05:30:08,300 So replication - 7564 05:30:08,300 --> 05:30:10,900 Factor as we have three Brokers, 7565 05:30:10,900 --> 05:30:15,600 so let me keep it as 3 and then we have partitions. 7566 05:30:15,800 --> 05:30:17,074 So I will keep it as 7567 05:30:17,074 --> 05:30:19,746 three because we have three Brokers running 7568 05:30:19,746 --> 05:30:21,689 and our consumer can go ahead 7569 05:30:21,689 --> 05:30:23,700 and consume messages parallely 7570 05:30:23,700 --> 05:30:27,010 from three Brokers and let me press enter. 7571 05:30:29,300 --> 05:30:32,000 So now you can see the topic is created. 7572 05:30:32,000 --> 05:30:35,100 Now, let us quickly go ahead and list all the topics. 7573 05:30:35,100 --> 05:30:36,100 So the command 7574 05:30:36,100 --> 05:30:40,200 for listing all the topics is dot slash bin again. 7575 05:30:40,200 --> 05:30:44,200 We'll open cough car topic script file then - 7576 05:30:44,200 --> 05:30:48,300 - list and again will provide the address of Zookeeper. 7577 05:30:48,700 --> 05:30:50,000 So do again list the topic 7578 05:30:50,000 --> 05:30:53,674 we have to first go to the CAF core topic script file. 7579 05:30:53,674 --> 05:30:55,200 Then we have to give - 7580 05:30:55,200 --> 05:30:59,300 - list parameter and next we have to give the zookeepers. 7581 05:30:59,576 --> 05:31:02,423 Which is localhost 181 I'll hit enter. 7582 05:31:04,100 --> 05:31:07,000 And you can see I have this Kafka - 7583 05:31:07,000 --> 05:31:11,000 spark the kafka's park topic has been created. 7584 05:31:11,100 --> 05:31:11,407 Now. 7585 05:31:11,407 --> 05:31:14,176 Let me show you one more thing again. 7586 05:31:14,176 --> 05:31:18,900 We'll go to when cuff card topics not sh 7587 05:31:19,000 --> 05:31:21,100 and we'll describe this topic. 7588 05:31:21,900 --> 05:31:24,600 I will pass the address of zoo keeper, 7589 05:31:24,800 --> 05:31:26,300 which is localhost 7590 05:31:26,600 --> 05:31:30,600 to one eight one and then I'll pause the topic name, 7591 05:31:31,000 --> 05:31:34,700 which is Kafka - Spark 7592 05:31:36,400 --> 05:31:37,600 So now you can see here. 7593 05:31:37,600 --> 05:31:40,100 The topic is cough by spark. 7594 05:31:40,100 --> 05:31:43,400 The partition count is 3 the replication factor is 3 7595 05:31:43,400 --> 05:31:45,600 and the config is as follows. 7596 05:31:45,700 --> 05:31:49,900 So here you can see all the three partitions of the topic 7597 05:31:49,900 --> 05:31:54,400 that is partition 0 partition 1 and partition 2 then the leader 7598 05:31:54,400 --> 05:31:57,400 for partition 0 is broker to the leader 7599 05:31:57,400 --> 05:31:59,417 for partition one is broker 0 7600 05:31:59,417 --> 05:32:02,200 and leader for partition to is broker one 7601 05:32:02,200 --> 05:32:06,194 so you can see we have different partition leaders residing on 7602 05:32:06,194 --> 05:32:09,600 And Brokers, so this is basically for load balancing. 7603 05:32:09,600 --> 05:32:11,900 So that different partition could be served 7604 05:32:11,900 --> 05:32:13,000 from different Brokers 7605 05:32:13,000 --> 05:32:15,413 and it could be consumed parallely again, 7606 05:32:15,413 --> 05:32:16,800 you can see the replica 7607 05:32:16,800 --> 05:32:20,512 of this partition is residing in all the three Brokers same 7608 05:32:20,512 --> 05:32:23,200 with Partition 1 and same with Partition to 7609 05:32:23,200 --> 05:32:25,700 and it's showing you the insync replica. 7610 05:32:25,700 --> 05:32:27,100 So in synch replica, 7611 05:32:27,100 --> 05:32:30,600 the first is to then you have 0 and then you have 1 7612 05:32:30,600 --> 05:32:33,600 and similarly with Partition 1 and 2. 7613 05:32:33,900 --> 05:32:35,100 So now let us quickly. 7614 05:32:35,100 --> 05:32:35,900 Go ahead. 7615 05:32:36,500 --> 05:32:38,346 I'll reduce this to 1/2. 7616 05:32:40,000 --> 05:32:42,200 Wake me up in one more terminal. 7617 05:32:43,300 --> 05:32:45,200 The reason why I'm doing this is 7618 05:32:45,200 --> 05:32:48,600 that we can actually produce message from One console 7619 05:32:48,600 --> 05:32:51,700 and then we can receive the message in another console. 7620 05:32:51,707 --> 05:32:56,092 So for that I'll start cough cough console producer first. 7621 05:32:56,396 --> 05:32:57,703 So the command is 7622 05:32:58,000 --> 05:33:04,400 dot slash bin cough cough console producer dot sh 7623 05:33:04,400 --> 05:33:06,100 and then in case 7624 05:33:06,100 --> 05:33:11,400 of producer you have to give the parameter as broker - list, 7625 05:33:11,800 --> 05:33:18,000 which is Localhost 9:09 to you can provide any of the Brokers 7626 05:33:18,000 --> 05:33:19,000 that is running 7627 05:33:19,000 --> 05:33:22,400 and it will again take the rest of the Brokers from there. 7628 05:33:22,400 --> 05:33:25,794 So you just have to provide the address of one broker. 7629 05:33:25,794 --> 05:33:28,100 You can also provide a set of Brokers 7630 05:33:28,100 --> 05:33:30,000 so you can give it as localhost colon. 7631 05:33:30,000 --> 05:33:33,800 9:09 2 comma Lu closed: 9 0 9 3 and similarly. 7632 05:33:33,800 --> 05:33:35,800 So here I am passing the address 7633 05:33:35,800 --> 05:33:39,700 of the first broker now next I have to mention the topic. 7634 05:33:39,700 --> 05:33:41,900 So topic is Kafka Spark. 7635 05:33:43,700 --> 05:33:45,161 And I'll hit enter. 7636 05:33:45,500 --> 05:33:47,900 So my console producer is started. 7637 05:33:47,900 --> 05:33:50,600 Let me produce a message saying hi. 7638 05:33:51,000 --> 05:33:53,376 Now in the second terminal I will go ahead 7639 05:33:53,376 --> 05:33:55,200 and start the console consumer. 7640 05:33:55,500 --> 05:34:00,700 So again, the command is Kafka console consumer not sh 7641 05:34:00,800 --> 05:34:03,000 and then in case of consumer, 7642 05:34:03,000 --> 05:34:06,600 you have to give the parameter as bootstrap server. 7643 05:34:07,800 --> 05:34:10,400 So this is the thing to notice guys that in case 7644 05:34:10,400 --> 05:34:13,600 of producer you have to give the broker list by in. 7645 05:34:13,600 --> 05:34:14,725 So of consumer, 7646 05:34:14,725 --> 05:34:19,000 you have to give bootstrap server and it is again the same 7647 05:34:19,000 --> 05:34:23,389 that is localhost 9:09 to which the address of my broker 0 7648 05:34:23,500 --> 05:34:25,807 and then I will give the topic 7649 05:34:25,807 --> 05:34:30,700 which is cuff cost park now adding this parameter 7650 05:34:30,700 --> 05:34:32,100 that is from - 7651 05:34:32,100 --> 05:34:35,800 beginning will basically give me messages stored 7652 05:34:35,800 --> 05:34:37,926 in that topic from beginning. 7653 05:34:37,926 --> 05:34:41,300 Otherwise, if I'm not giving this parameter - - 7654 05:34:41,300 --> 05:34:43,200 from beginning I'll only 7655 05:34:43,200 --> 05:34:44,630 I'm the recent messages 7656 05:34:44,630 --> 05:34:48,300 that has been produced after starting this console consumer. 7657 05:34:48,300 --> 05:34:49,484 So let me hit enter 7658 05:34:49,484 --> 05:34:52,600 and you can see I'll get a message saying hi first. 7659 05:34:55,700 --> 05:34:57,267 Well, I'm sorry guys. 7660 05:34:57,267 --> 05:35:00,400 The topic name I have given is not correct. 7661 05:35:00,400 --> 05:35:01,784 Sorry for my typo. 7662 05:35:01,784 --> 05:35:03,707 Let me quickly corrected. 7663 05:35:04,300 --> 05:35:05,800 And let me hit enter. 7664 05:35:06,800 --> 05:35:10,300 So as you can see, I am receiving the messages. 7665 05:35:10,300 --> 05:35:13,900 I received High then let me produce some more messages. 7666 05:35:19,200 --> 05:35:21,600 So now you can see all the messages 7667 05:35:21,600 --> 05:35:22,858 that I am producing 7668 05:35:22,858 --> 05:35:26,900 from console producer is getting consumed by console consumer. 7669 05:35:26,900 --> 05:35:30,466 Now this console producer as well as console consumer 7670 05:35:30,466 --> 05:35:31,838 is basically used by 7671 05:35:31,838 --> 05:35:35,200 the developers to actually test the Kafka cluster. 7672 05:35:35,200 --> 05:35:37,100 So what happens if you are 7673 05:35:37,100 --> 05:35:38,300 if there is a producer 7674 05:35:38,300 --> 05:35:40,300 which is running and which is producing 7675 05:35:40,300 --> 05:35:43,196 those messages to Kafka then you can go ahead 7676 05:35:43,196 --> 05:35:45,558 and you can start console consumer and check 7677 05:35:45,558 --> 05:35:47,500 whether the producer is producing. 7678 05:35:47,500 --> 05:35:49,900 Messages or not or you can again go ahead 7679 05:35:49,900 --> 05:35:50,900 and check the format 7680 05:35:50,900 --> 05:35:53,860 in which your message are getting produced to the topic. 7681 05:35:53,860 --> 05:35:56,988 Those kind of testing part is done using console consumer 7682 05:35:56,988 --> 05:35:59,000 and similarly using console producer. 7683 05:35:59,000 --> 05:36:01,500 You do something like you are creating a consumer 7684 05:36:01,500 --> 05:36:04,900 so you can go ahead you can produce a message to Kafka topic 7685 05:36:04,900 --> 05:36:06,000 and then you can check 7686 05:36:06,000 --> 05:36:08,700 whether your consumer is consuming that message or not. 7687 05:36:08,700 --> 05:36:11,049 This is basically used for testing now, 7688 05:36:11,049 --> 05:36:13,400 let us quickly go ahead and close this. 7689 05:36:15,700 --> 05:36:18,700 Now let us get back to our slides now. 7690 05:36:18,700 --> 05:36:20,605 I have briefly covered Kafka 7691 05:36:20,605 --> 05:36:24,300 and the concepts of Kafka so here basically I'm giving 7692 05:36:24,300 --> 05:36:27,200 you a small brief idea about what Kafka is 7693 05:36:27,200 --> 05:36:29,100 and how Kafka works now 7694 05:36:29,100 --> 05:36:32,100 as we have understood why we need misting systems. 7695 05:36:32,100 --> 05:36:33,100 What is cough cough? 7696 05:36:33,100 --> 05:36:35,000 What are different terminologies and Kafka 7697 05:36:35,000 --> 05:36:36,657 how Kafka architecture works 7698 05:36:36,657 --> 05:36:39,513 and we have seen some of the basic cuff Pokemons. 7699 05:36:39,513 --> 05:36:41,000 So let us now understand. 7700 05:36:41,000 --> 05:36:42,600 What is Apache spark. 7701 05:36:42,800 --> 05:36:44,900 So basically Apache spark 7702 05:36:44,900 --> 05:36:47,802 is an Source cluster Computing framework 7703 05:36:47,802 --> 05:36:51,300 for near real-time processing now spark provides 7704 05:36:51,300 --> 05:36:54,205 an interface for programming the entire cluster 7705 05:36:54,205 --> 05:36:56,047 with implicit data parallelism 7706 05:36:56,047 --> 05:36:59,300 and fault tolerance will talk about how spark provides 7707 05:36:59,300 --> 05:37:02,900 fault tolerance but talking about implicit data parallelism. 7708 05:37:02,900 --> 05:37:06,600 That means you do not need any special directives operators 7709 05:37:06,600 --> 05:37:09,000 or functions to enable parallel execution. 7710 05:37:09,000 --> 05:37:12,600 It sparked by default provides the data parallelism spark 7711 05:37:12,600 --> 05:37:15,628 is designed to cover a wide range of workloads such. 7712 05:37:15,628 --> 05:37:16,919 As batch applications 7713 05:37:16,919 --> 05:37:20,400 iterative algorithms interactive queries machine learning 7714 05:37:20,400 --> 05:37:22,000 algorithms and streaming. 7715 05:37:22,000 --> 05:37:24,174 So basically the main feature 7716 05:37:24,174 --> 05:37:27,500 of spark is it's in memory cluster Computing 7717 05:37:27,500 --> 05:37:30,900 that increases the processing speed of the application. 7718 05:37:30,900 --> 05:37:34,763 So what spark does spark does not store the data in discs, 7719 05:37:34,763 --> 05:37:36,950 but it does it transforms the data 7720 05:37:36,950 --> 05:37:38,700 and keep the data in memory. 7721 05:37:38,700 --> 05:37:39,616 So that quickly 7722 05:37:39,616 --> 05:37:42,500 multiple operations can be applied over the data 7723 05:37:42,500 --> 05:37:45,500 and the final result is only stored in the disk 7724 05:37:45,500 --> 05:37:49,629 now a On-site Spa can also do batch processing hundred times 7725 05:37:49,629 --> 05:37:51,108 faster than mapreduce. 7726 05:37:51,108 --> 05:37:54,400 And this is the reason why a patches Park is to go 7727 05:37:54,400 --> 05:37:57,324 to tool for big data processing in the industry. 7728 05:37:57,324 --> 05:38:00,000 Now, let's quickly move ahead and understand 7729 05:38:00,000 --> 05:38:01,461 how spark does this 7730 05:38:01,600 --> 05:38:03,617 so the answer is rdd 7731 05:38:03,617 --> 05:38:07,700 that is resilient distributed data sets now an rdd is 7732 05:38:07,700 --> 05:38:11,406 a read-only partitioned collection of records and you 7733 05:38:11,406 --> 05:38:14,897 can see it is a fundamental data structure of spa. 7734 05:38:14,897 --> 05:38:16,312 So basically, ERD is 7735 05:38:16,312 --> 05:38:19,522 an immutable distributed collection of objects. 7736 05:38:19,522 --> 05:38:21,709 So each data set in rdd is divided 7737 05:38:21,709 --> 05:38:23,300 into logical partitions, 7738 05:38:23,300 --> 05:38:25,639 which may be computed on different nodes 7739 05:38:25,639 --> 05:38:28,400 of the cluster now already can contain any type 7740 05:38:28,400 --> 05:38:30,800 of python Java or scale objects. 7741 05:38:30,800 --> 05:38:33,900 Now talking about the fault tolerance rdd 7742 05:38:33,900 --> 05:38:37,900 is a fault-tolerant collection of elements that can be operated 7743 05:38:37,900 --> 05:38:39,000 on in parallel. 7744 05:38:39,000 --> 05:38:40,500 Now, how are ready does 7745 05:38:40,500 --> 05:38:43,380 that if rdd is lost it will automatically 7746 05:38:43,380 --> 05:38:45,609 be recomputed by using original. 7747 05:38:45,609 --> 05:38:49,300 Nations and this is how spot provides fault tolerance. 7748 05:38:49,300 --> 05:38:51,255 So I hope that you guys are clear 7749 05:38:51,255 --> 05:38:53,700 that house Park provides fault tolerance. 7750 05:38:54,132 --> 05:38:57,500 Now let's talk about how we can create rdds. 7751 05:38:57,500 --> 05:39:01,600 So there are two ways to create rdds first is paralyzing 7752 05:39:01,600 --> 05:39:04,474 an existing collection in your driver program, 7753 05:39:04,474 --> 05:39:06,200 or you can refer a data set 7754 05:39:06,200 --> 05:39:09,300 in an external storage systems such as shared file system. 7755 05:39:09,300 --> 05:39:11,300 It can be hdfs Edge base 7756 05:39:11,300 --> 05:39:15,200 or any other data source offering a Hadoop input format 7757 05:39:15,200 --> 05:39:16,800 now spark makes use 7758 05:39:16,800 --> 05:39:20,200 of the concept of rdd to achieve fast and efficient operations. 7759 05:39:20,200 --> 05:39:22,600 Now, let's quickly move ahead 7760 05:39:22,600 --> 05:39:27,200 and look how already So first we create an rdd 7761 05:39:27,200 --> 05:39:29,600 which you can create either by referring 7762 05:39:29,600 --> 05:39:31,800 to an external storage system. 7763 05:39:31,800 --> 05:39:35,400 And then once you create an rdd you can go ahead 7764 05:39:35,400 --> 05:39:37,800 and you can apply multiple Transformations 7765 05:39:37,800 --> 05:39:38,800 over that are ready. 7766 05:39:39,100 --> 05:39:43,100 Like will perform filter map Union Etc. 7767 05:39:43,100 --> 05:39:44,219 And then again, 7768 05:39:44,219 --> 05:39:48,400 it gives you a new rdd or you can see the transformed rdd 7769 05:39:48,400 --> 05:39:51,500 and at last you apply some action and get 7770 05:39:51,500 --> 05:39:55,100 the result now this action can be Count first 7771 05:39:55,100 --> 05:39:57,149 a can collect all those kind 7772 05:39:57,149 --> 05:39:58,100 of functions. 7773 05:39:58,100 --> 05:40:01,151 So now this is a brief idea about what is rdd 7774 05:40:01,151 --> 05:40:02,400 and how rdd works. 7775 05:40:02,400 --> 05:40:04,570 So now let's quickly move ahead and look 7776 05:40:04,570 --> 05:40:06,100 at the different workloads 7777 05:40:06,100 --> 05:40:08,200 that can be handled by Apache spark. 7778 05:40:08,200 --> 05:40:10,883 So we have interactive streaming analytics. 7779 05:40:10,883 --> 05:40:12,800 Then we have machine learning. 7780 05:40:12,800 --> 05:40:14,158 We have data integration. 7781 05:40:14,158 --> 05:40:16,207 We have spark streaming and processing. 7782 05:40:16,207 --> 05:40:17,944 So let us talk about them one 7783 05:40:17,944 --> 05:40:20,400 by one first is spark streaming and processing. 7784 05:40:20,400 --> 05:40:21,400 So now basically, 7785 05:40:21,400 --> 05:40:24,007 you know data arrives at a steady rate. 7786 05:40:24,007 --> 05:40:27,000 Are you can say at a continuous streams, right? 7787 05:40:27,000 --> 05:40:29,300 And then what you can do you can again go ahead 7788 05:40:29,300 --> 05:40:30,829 and store the data set in disk 7789 05:40:30,829 --> 05:40:34,299 and then you can actually go ahead and apply some processing 7790 05:40:34,299 --> 05:40:36,007 over it some analytics over it 7791 05:40:36,007 --> 05:40:38,000 and then get some results out of it, 7792 05:40:38,000 --> 05:40:41,200 but this is not the scenario with each and every case. 7793 05:40:41,200 --> 05:40:44,100 Let's take an example of financial transactions 7794 05:40:44,100 --> 05:40:46,343 where you have to go ahead and identify 7795 05:40:46,343 --> 05:40:48,931 and refuse potential fraudulent transactions. 7796 05:40:48,931 --> 05:40:50,297 Now if you will go ahead 7797 05:40:50,297 --> 05:40:53,197 and store the data stream and then you will go ahead 7798 05:40:53,197 --> 05:40:55,800 and apply some Assessing it would be too late 7799 05:40:55,800 --> 05:40:58,287 and someone would have got away with the money. 7800 05:40:58,287 --> 05:41:00,386 So in that scenario what you need to do. 7801 05:41:00,386 --> 05:41:03,183 So you need to quickly take that input data stream. 7802 05:41:03,183 --> 05:41:05,700 You need to apply some Transformations over it 7803 05:41:05,700 --> 05:41:08,300 and then you have to take actions accordingly. 7804 05:41:08,300 --> 05:41:10,015 Like you can send some notification 7805 05:41:10,015 --> 05:41:11,322 or you can actually reject 7806 05:41:11,322 --> 05:41:13,972 that fraudulent transaction something like that. 7807 05:41:13,972 --> 05:41:15,200 And then you can go ahead 7808 05:41:15,200 --> 05:41:17,686 and if you want you can store those results 7809 05:41:17,686 --> 05:41:19,700 or data set in some of the database 7810 05:41:19,700 --> 05:41:21,700 or you can see some of the file system. 7811 05:41:21,800 --> 05:41:24,000 So we have some scenarios. 7812 05:41:24,026 --> 05:41:27,873 Very we have to actually process the stream of data 7813 05:41:27,900 --> 05:41:29,300 and then we have to go ahead 7814 05:41:29,300 --> 05:41:30,358 and store the data 7815 05:41:30,358 --> 05:41:34,008 or perform some analysis on it or take some necessary actions. 7816 05:41:34,008 --> 05:41:37,000 So this is where Spark streaming comes into picture 7817 05:41:37,000 --> 05:41:38,575 and Spark is a best fit 7818 05:41:38,575 --> 05:41:42,000 for processing those continuous input data streams. 7819 05:41:42,000 --> 05:41:45,500 Now moving to next that is machine learning now, 7820 05:41:45,500 --> 05:41:46,314 as you know, 7821 05:41:46,314 --> 05:41:47,730 that first we create 7822 05:41:47,730 --> 05:41:51,182 a machine learning model then we continuously feed 7823 05:41:51,182 --> 05:41:54,011 those incoming data streams to the model. 7824 05:41:54,011 --> 05:41:56,700 And we get some continuous output based 7825 05:41:56,700 --> 05:41:58,144 on the input values. 7826 05:41:58,144 --> 05:42:00,453 Now, we reuse intermediate results 7827 05:42:00,453 --> 05:42:04,300 across multiple computation in multi-stage applications, 7828 05:42:04,300 --> 05:42:07,600 which basically includes substantial overhead due to 7829 05:42:07,600 --> 05:42:10,500 data replication disk I/O and sterilization 7830 05:42:10,500 --> 05:42:12,200 which makes the system slow. 7831 05:42:12,200 --> 05:42:16,200 Now what Spock does spark rdd will store intermediate result 7832 05:42:16,200 --> 05:42:19,446 in a distributed memory instead of a stable storage 7833 05:42:19,446 --> 05:42:21,200 and make the system faster. 7834 05:42:21,200 --> 05:42:24,800 So as we saw in spark rdd all the Transformations 7835 05:42:24,800 --> 05:42:26,482 will be applied over there 7836 05:42:26,482 --> 05:42:29,200 and all the transformed rdds will be stored 7837 05:42:29,200 --> 05:42:31,999 in the memory itself so we can quickly go ahead 7838 05:42:31,999 --> 05:42:35,037 and apply some more iterative algorithms over there 7839 05:42:35,037 --> 05:42:37,508 and it does not take much time in functions 7840 05:42:37,508 --> 05:42:39,333 like data replication or disk 7841 05:42:39,333 --> 05:42:42,164 I/O so all those overheads will be reduced now 7842 05:42:42,164 --> 05:42:45,500 you might be wondering that memories always very less. 7843 05:42:45,500 --> 05:42:48,000 So what if the memory gets over so 7844 05:42:48,000 --> 05:42:50,600 if the distributed memory is not sufficient 7845 05:42:50,600 --> 05:42:52,100 to store intermediate results, 7846 05:42:52,300 --> 05:42:54,300 then it will store those results. 7847 05:42:54,300 --> 05:42:55,100 On the desk. 7848 05:42:55,100 --> 05:42:58,000 So I hope that you guys are clear how sparks perform 7849 05:42:58,000 --> 05:43:00,000 this iterative machine learning algorithms 7850 05:43:00,000 --> 05:43:01,500 and why spark is fast, 7851 05:43:01,819 --> 05:43:04,280 let's look at the next workload. 7852 05:43:04,400 --> 05:43:08,200 So next workload is interactive streaming analytics. 7853 05:43:08,200 --> 05:43:10,900 Now as we already discussed about streaming data 7854 05:43:10,900 --> 05:43:15,300 so user runs ad hoc queries on the same subset of data 7855 05:43:15,300 --> 05:43:19,127 and each query will do a disk I/O on the stable storage 7856 05:43:19,127 --> 05:43:22,386 which can dominate applications execution time. 7857 05:43:22,386 --> 05:43:24,300 So, let me take an example. 7858 05:43:24,300 --> 05:43:25,400 Data scientist. 7859 05:43:25,400 --> 05:43:27,800 So basically you have continuous streams of data, 7860 05:43:27,800 --> 05:43:28,800 which is coming in. 7861 05:43:28,800 --> 05:43:30,650 So what your data scientists would do. 7862 05:43:30,650 --> 05:43:32,900 So do your data scientists will either ask 7863 05:43:32,900 --> 05:43:34,274 some questions execute 7864 05:43:34,274 --> 05:43:37,208 some queries over the data then view the result 7865 05:43:37,208 --> 05:43:40,563 and then he might alter the initial question slightly 7866 05:43:40,563 --> 05:43:41,804 by seeing the output 7867 05:43:41,804 --> 05:43:44,332 or he might also drill deeper into results 7868 05:43:44,332 --> 05:43:47,757 and execute some more queries over the gathered result. 7869 05:43:47,757 --> 05:43:51,500 So there are multiple scenarios in which your data scientist 7870 05:43:51,500 --> 05:43:54,265 would be running some interactive queries. 7871 05:43:54,265 --> 05:43:57,569 On the streaming analytics now house path helps 7872 05:43:57,569 --> 05:44:00,200 in this interactive streaming analytics. 7873 05:44:00,200 --> 05:44:04,453 So each transformed our DD may be recomputed each time. 7874 05:44:04,453 --> 05:44:06,838 You run an action on it, right? 7875 05:44:06,838 --> 05:44:10,692 And when you persist an rdd in memory in which case 7876 05:44:10,692 --> 05:44:13,430 Park will keep all the elements around 7877 05:44:13,430 --> 05:44:15,800 on the cluster for faster access 7878 05:44:15,800 --> 05:44:18,296 and whenever you will execute the query next time 7879 05:44:18,296 --> 05:44:19,077 over the data, 7880 05:44:19,077 --> 05:44:21,200 then the query will be executed quickly 7881 05:44:21,200 --> 05:44:23,700 and it will give you a instant result, right? 7882 05:44:24,100 --> 05:44:26,090 So I hope that you guys are clear 7883 05:44:26,090 --> 05:44:29,200 how spark helps in interactive streaming analytics. 7884 05:44:29,400 --> 05:44:32,000 Now, let's talk about data integration. 7885 05:44:32,000 --> 05:44:33,570 So basically as you know, 7886 05:44:33,570 --> 05:44:36,900 that in large organizations data is basically produced 7887 05:44:36,900 --> 05:44:39,400 from different systems across the business 7888 05:44:39,400 --> 05:44:42,000 and basically you need a framework 7889 05:44:42,000 --> 05:44:45,800 which can actually integrate different data sources. 7890 05:44:45,800 --> 05:44:46,900 So Spock is the one 7891 05:44:46,900 --> 05:44:49,382 which actually integrate different data sources 7892 05:44:49,382 --> 05:44:50,500 so you can go ahead 7893 05:44:50,500 --> 05:44:53,800 and you can take the data from Kafka Cassandra flu. 7894 05:44:53,800 --> 05:44:55,518 Umm hbase then Amazon S3. 7895 05:44:55,518 --> 05:44:59,300 Then you can perform some real time analytics over there 7896 05:44:59,300 --> 05:45:02,000 or even say some near real-time analytics over there. 7897 05:45:02,000 --> 05:45:04,250 You can apply some machine learning algorithms 7898 05:45:04,250 --> 05:45:05,700 and then you can go ahead 7899 05:45:05,700 --> 05:45:08,500 and store the process result in Apache hbase. 7900 05:45:08,500 --> 05:45:10,600 Then msql hdfs. 7901 05:45:10,600 --> 05:45:12,100 It could be your Kafka. 7902 05:45:12,100 --> 05:45:15,500 So spark basically gives you a multiple options 7903 05:45:15,500 --> 05:45:16,600 where you can go ahead 7904 05:45:16,600 --> 05:45:18,500 and pick the data from and again, 7905 05:45:18,500 --> 05:45:21,200 you can go ahead and write the data into now. 7906 05:45:21,200 --> 05:45:23,620 Let's quickly move ahead and we'll talk. 7907 05:45:23,620 --> 05:45:27,013 About different spark components so you can see here. 7908 05:45:27,013 --> 05:45:28,500 I have a spark or engine. 7909 05:45:28,500 --> 05:45:30,376 So basically this is the core engine 7910 05:45:30,376 --> 05:45:32,200 and on top of this core engine. 7911 05:45:32,200 --> 05:45:35,574 You have spark SQL spark streaming then MLA, 7912 05:45:35,900 --> 05:45:38,100 then you have graphics and the newest Parker. 7913 05:45:38,200 --> 05:45:41,087 Let's talk about them one by one and we'll start 7914 05:45:41,087 --> 05:45:42,500 with spark core engine. 7915 05:45:42,500 --> 05:45:45,200 So spark or engine is the base engine 7916 05:45:45,200 --> 05:45:46,800 for large-scale parallel 7917 05:45:46,800 --> 05:45:50,026 and distributed data processing additional libraries, 7918 05:45:50,026 --> 05:45:52,200 which are built on top of the core allows 7919 05:45:52,200 --> 05:45:53,700 divers workloads Force. 7920 05:45:53,700 --> 05:45:57,300 Streaming SQL machine learning then you can go ahead 7921 05:45:57,300 --> 05:45:59,300 and execute our on spark 7922 05:45:59,300 --> 05:46:01,731 or you can go ahead and execute python on spark 7923 05:46:01,731 --> 05:46:03,000 those kind of workloads. 7924 05:46:03,000 --> 05:46:04,700 You can easily go ahead and execute. 7925 05:46:04,700 --> 05:46:07,800 So basically your spark or engine is the one 7926 05:46:07,800 --> 05:46:10,040 who is managing all your memory, 7927 05:46:10,040 --> 05:46:13,084 then all your fault recovery your scheduling 7928 05:46:13,084 --> 05:46:14,755 your Distributing of jobs 7929 05:46:14,755 --> 05:46:16,078 and monitoring jobs 7930 05:46:16,078 --> 05:46:19,700 on a cluster and interacting with the storage system. 7931 05:46:19,700 --> 05:46:22,400 So in in short we can see the spark 7932 05:46:22,400 --> 05:46:24,501 or engine is the heart of Spock 7933 05:46:24,501 --> 05:46:25,951 and on top of this all 7934 05:46:25,951 --> 05:46:28,389 of these libraries are there so first, 7935 05:46:28,389 --> 05:46:30,429 let's talk about spark streaming. 7936 05:46:30,429 --> 05:46:33,088 So spot streaming is the component of Spas 7937 05:46:33,088 --> 05:46:36,273 which is used to process real-time streaming data 7938 05:46:36,273 --> 05:46:37,600 as we just discussed 7939 05:46:37,600 --> 05:46:41,061 and it is a useful addition to spark core API. 7940 05:46:41,200 --> 05:46:43,600 Now it enables high throughput and fault 7941 05:46:43,600 --> 05:46:46,554 tolerance stream processing for live data streams. 7942 05:46:46,554 --> 05:46:47,700 So you can go ahead 7943 05:46:47,700 --> 05:46:51,338 and you can perform all the streaming data analytics 7944 05:46:51,338 --> 05:46:55,800 using this spark streaming then You have Spock SQL over here. 7945 05:46:55,800 --> 05:46:58,900 So basically spark SQL is a new module in spark 7946 05:46:58,900 --> 05:47:02,200 which integrates relational processing of Sparks functional 7947 05:47:02,200 --> 05:47:06,900 programming API and it supports querying data either via SQL 7948 05:47:06,900 --> 05:47:08,315 or SQL that is - 7949 05:47:08,315 --> 05:47:09,469 query language. 7950 05:47:09,500 --> 05:47:11,500 So basically for those of you 7951 05:47:11,500 --> 05:47:15,615 who are familiar with rdbms Spock SQL is an easy transition 7952 05:47:15,615 --> 05:47:17,100 from your earlier tool 7953 05:47:17,100 --> 05:47:19,511 where you can go ahead and extend the boundaries 7954 05:47:19,511 --> 05:47:22,100 of traditional relational data processing now 7955 05:47:22,100 --> 05:47:23,700 talking about graphics. 7956 05:47:23,700 --> 05:47:24,900 So Graphics is 7957 05:47:24,900 --> 05:47:28,500 the spaag API for graphs and crafts parallel computation. 7958 05:47:28,500 --> 05:47:30,800 It extends the spark rdd 7959 05:47:30,800 --> 05:47:34,309 with a resilient distributed property graph a talking 7960 05:47:34,309 --> 05:47:35,213 at high level. 7961 05:47:35,213 --> 05:47:38,700 Basically Graphics extend the graph already abstraction 7962 05:47:38,700 --> 05:47:41,758 by introducing the resilient distributed property graph, 7963 05:47:41,758 --> 05:47:42,778 which is nothing 7964 05:47:42,778 --> 05:47:45,900 but a directed multigraph with properties attached 7965 05:47:45,900 --> 05:47:49,700 to each vertex and Edge next we have spark are so 7966 05:47:49,700 --> 05:47:52,394 basically it provides you packages for our language 7967 05:47:52,394 --> 05:47:54,100 and then you can go ahead and 7968 05:47:54,100 --> 05:47:55,399 Leverage Park power 7969 05:47:55,399 --> 05:47:58,000 with our shell next you have spark MLA. 7970 05:47:58,000 --> 05:48:01,849 So ml is basically stands for machine learning library. 7971 05:48:01,849 --> 05:48:05,200 So spark MLM is used to perform machine learning 7972 05:48:05,200 --> 05:48:06,500 in Apache spark. 7973 05:48:06,500 --> 05:48:08,773 Now many common machine learning 7974 05:48:08,773 --> 05:48:11,784 and statical algorithms have been implemented 7975 05:48:11,784 --> 05:48:13,700 and are shipped with ML live 7976 05:48:13,700 --> 05:48:16,935 which simplifies large scale machine learning pipelines, 7977 05:48:16,935 --> 05:48:18,347 which basically includes 7978 05:48:18,347 --> 05:48:20,994 summary statistics correlations classification 7979 05:48:20,994 --> 05:48:23,800 and regression collaborative filtering techniques. 7980 05:48:23,800 --> 05:48:25,700 New cluster analysis methods 7981 05:48:25,700 --> 05:48:28,582 then you have dimensionality reduction techniques. 7982 05:48:28,582 --> 05:48:31,400 You have feature extraction and transformation functions. 7983 05:48:31,400 --> 05:48:33,700 When you have optimization algorithms, 7984 05:48:33,700 --> 05:48:35,900 it is basically a MLM package 7985 05:48:35,900 --> 05:48:39,000 or you can see a machine learning package on top of spa. 7986 05:48:39,000 --> 05:48:41,639 Then you also have something called by spark, 7987 05:48:41,639 --> 05:48:43,979 which is python package for spark there. 7988 05:48:43,979 --> 05:48:46,800 You can go ahead and leverage python over spark. 7989 05:48:46,800 --> 05:48:47,376 So I hope 7990 05:48:47,376 --> 05:48:50,900 that you guys are clear with different spark components. 7991 05:48:51,100 --> 05:48:53,200 So before moving to cough gasp, 7992 05:48:53,200 --> 05:48:54,524 ah, Exclaiming demo. 7993 05:48:54,524 --> 05:48:58,075 So I have just given you a brief intro to Apache spark. 7994 05:48:58,075 --> 05:49:01,100 If you want a detailed tutorial on Apache spark 7995 05:49:01,100 --> 05:49:02,600 or different components 7996 05:49:02,600 --> 05:49:06,753 of Apache spark like Apache spark SQL spark data frames 7997 05:49:06,800 --> 05:49:10,200 or spark streaming Spa Graphics Spock MLA, 7998 05:49:10,200 --> 05:49:13,200 so you can go to editor Acres YouTube channel again. 7999 05:49:13,200 --> 05:49:14,800 So now we are here guys. 8000 05:49:14,800 --> 05:49:18,252 I know that you guys are waiting for this demo from a while. 8001 05:49:18,252 --> 05:49:21,900 So now let's go ahead and look at calf by spark streaming demo. 8002 05:49:21,900 --> 05:49:23,700 So let me quickly go ahead and open. 8003 05:49:23,700 --> 05:49:28,000 my virtual machine and I'll open a terminal. 8004 05:49:28,600 --> 05:49:30,658 So let me first check all the demons 8005 05:49:30,658 --> 05:49:32,400 that are running in my system. 8006 05:49:33,800 --> 05:49:35,341 So my zookeeper is running 8007 05:49:35,341 --> 05:49:37,753 name node is running data node is running. 8008 05:49:37,753 --> 05:49:39,130 The my resource manager 8009 05:49:39,130 --> 05:49:42,714 is running all the three cough cough Brokers are running then 8010 05:49:42,714 --> 05:49:44,088 node manager is running 8011 05:49:44,088 --> 05:49:46,000 and job is to server is running. 8012 05:49:46,200 --> 05:49:49,200 So now I have to start my spark demons. 8013 05:49:49,200 --> 05:49:51,900 So let me first go to the spark home 8014 05:49:52,600 --> 05:49:54,600 and start this part demon. 8015 05:49:54,600 --> 05:49:57,800 The command is a spin start or not. 8016 05:49:57,800 --> 05:49:58,900 Sh. 8017 05:50:01,400 --> 05:50:03,400 So let me quickly go ahead 8018 05:50:03,400 --> 05:50:06,861 and execute sudo JPS to check my spark demons. 8019 05:50:08,500 --> 05:50:12,200 So you can see master and vocal demons are running. 8020 05:50:12,596 --> 05:50:14,903 So let me close this terminal. 8021 05:50:16,300 --> 05:50:18,700 Let me go to the project directory. 8022 05:50:20,600 --> 05:50:22,808 So basically, I have two projects. 8023 05:50:22,808 --> 05:50:25,376 This is cough card transaction producer. 8024 05:50:25,376 --> 05:50:28,852 And the next one is the spark streaming Kafka master. 8025 05:50:28,852 --> 05:50:31,327 So first we will be producing messages 8026 05:50:31,327 --> 05:50:33,400 from Kafka transaction producer 8027 05:50:33,400 --> 05:50:36,200 and then we'll be streaming those records 8028 05:50:36,200 --> 05:50:39,670 which is basically produced by this producer using the spark 8029 05:50:39,670 --> 05:50:41,025 streaming Kafka master. 8030 05:50:41,025 --> 05:50:42,494 So first, let me take you 8031 05:50:42,494 --> 05:50:45,100 through this cough card transaction producer. 8032 05:50:45,100 --> 05:50:47,244 So this is our cornbread XML file. 8033 05:50:47,244 --> 05:50:49,004 Let me open it with G edit. 8034 05:50:49,004 --> 05:50:50,700 So basically this is a me. 8035 05:50:50,700 --> 05:50:54,400 Project and and I have used spring boot server. 8036 05:50:54,800 --> 05:50:57,071 So I have given Java version 8037 05:50:57,071 --> 05:51:00,456 as a you can see cough cough client over here 8038 05:51:00,500 --> 05:51:02,900 and the version of Kafka client, 8039 05:51:03,780 --> 05:51:07,719 then you can see I'm putting Jackson data bind. 8040 05:51:08,800 --> 05:51:13,500 Then ji-sun and then I am packaging it as a war file 8041 05:51:13,600 --> 05:51:15,500 that is web archive file. 8042 05:51:15,500 --> 05:51:20,000 And here I am again specifying the spring boot Maven plugins, 8043 05:51:20,000 --> 05:51:21,300 which is to be downloaded. 8044 05:51:21,300 --> 05:51:23,258 So let me quickly go ahead 8045 05:51:23,258 --> 05:51:27,100 and close this and we'll go to this Source directory 8046 05:51:27,100 --> 05:51:29,125 and then we'll go inside main. 8047 05:51:29,125 --> 05:51:32,972 So basically this is the file that is sales Jan 2009 file. 8048 05:51:32,972 --> 05:51:35,200 So let me show you the file first. 8049 05:51:37,300 --> 05:51:38,860 So these are the records 8050 05:51:38,860 --> 05:51:41,200 which I'll be producing to the Kafka. 8051 05:51:41,200 --> 05:51:43,600 So the fields are transaction date 8052 05:51:43,600 --> 05:51:45,500 than product price payment 8053 05:51:45,500 --> 05:51:49,767 type the name city state country account created 8054 05:51:49,800 --> 05:51:51,646 then last login latitude 8055 05:51:51,646 --> 05:51:52,846 and longitude. 8056 05:51:52,846 --> 05:51:57,400 So let me close this file and then the application dot. 8057 05:51:57,400 --> 05:51:59,778 Yml is the main property file. 8058 05:51:59,900 --> 05:52:02,654 So in this application dot yml am specifying 8059 05:52:02,654 --> 05:52:04,000 the bootstrap server, 8060 05:52:04,000 --> 05:52:07,900 which is localhost 9:09 to than am specifying the Pause 8061 05:52:07,900 --> 05:52:11,500 which again resides on localhost 9:09 to so here. 8062 05:52:11,500 --> 05:52:16,200 I have specified the broker list now next I have product topic. 8063 05:52:16,200 --> 05:52:19,000 So the topic of the product is transaction. 8064 05:52:19,000 --> 05:52:21,230 Then the partition count is 1 8065 05:52:21,500 --> 05:52:25,800 so basically you're a cks config controls the criteria 8066 05:52:25,800 --> 05:52:29,100 under which requests are considered complete 8067 05:52:29,100 --> 05:52:32,900 and the all setting we have specified will result 8068 05:52:32,900 --> 05:52:35,828 in blocking on the full Committee of the record. 8069 05:52:35,828 --> 05:52:37,225 It is the slowest burn 8070 05:52:37,225 --> 05:52:40,900 the most durable setting not talking about retries. 8071 05:52:40,900 --> 05:52:44,600 So it will retry Thrice then we have mempool size 8072 05:52:44,600 --> 05:52:46,587 and we have maximum pool size, 8073 05:52:46,587 --> 05:52:49,700 which is basically for implementing Java threads 8074 05:52:49,700 --> 05:52:52,000 and at last we have the file path. 8075 05:52:52,000 --> 05:52:53,900 So this is the path of the file, 8076 05:52:53,900 --> 05:52:57,900 which I have shown you just now so messages will be consumed 8077 05:52:57,900 --> 05:52:58,800 from this file. 8078 05:52:58,800 --> 05:53:02,600 Let me quickly close this file and we'll look at application 8079 05:53:02,600 --> 05:53:06,792 but properties so here we have specified the properties 8080 05:53:06,792 --> 05:53:08,600 for Springboard server. 8081 05:53:08,700 --> 05:53:10,877 So we have server context path. 8082 05:53:10,877 --> 05:53:12,185 That is /n Eureka. 8083 05:53:12,185 --> 05:53:14,607 Then we have spring application name 8084 05:53:14,607 --> 05:53:16,301 that is Kafka producer. 8085 05:53:16,301 --> 05:53:17,700 We have server Port 8086 05:53:17,700 --> 05:53:22,200 that is double line W8 and the spring events timeout is 20. 8087 05:53:22,200 --> 05:53:24,430 So let me close this as well. 8088 05:53:24,430 --> 05:53:25,530 Let's go back. 8089 05:53:25,800 --> 05:53:29,500 Let's go inside Java calm and Eureka Kafka. 8090 05:53:29,700 --> 05:53:33,400 So we'll explore the important files one by one. 8091 05:53:33,400 --> 05:53:36,800 So let me first take you through this dito directory. 8092 05:53:36,900 --> 05:53:39,617 And over here, we have transaction dot Java. 8093 05:53:39,617 --> 05:53:42,253 So basically here we are storing the model. 8094 05:53:42,253 --> 05:53:45,871 So basically you can see these are the fields from the file, 8095 05:53:45,871 --> 05:53:47,372 which I have shown you. 8096 05:53:47,372 --> 05:53:49,200 So we have transaction date. 8097 05:53:49,200 --> 05:53:53,600 We have product price payment type name city state country 8098 05:53:53,600 --> 05:53:57,700 and so on so we have created variable for each field. 8099 05:53:57,700 --> 05:54:01,101 So what we are doing we are basically creating a getter 8100 05:54:01,101 --> 05:54:03,766 and Setter function for all these variables. 8101 05:54:03,766 --> 05:54:05,702 So we have get transaction ID, 8102 05:54:05,702 --> 05:54:08,800 which will basically returned Transaction ID then 8103 05:54:08,800 --> 05:54:10,600 we have sent transaction ID, 8104 05:54:10,600 --> 05:54:13,300 which will basically send the transaction ID. 8105 05:54:13,300 --> 05:54:13,809 Similarly. 8106 05:54:13,809 --> 05:54:17,036 We have get transaction date for getting the transaction date. 8107 05:54:17,036 --> 05:54:19,100 Then we have set transaction date and it 8108 05:54:19,100 --> 05:54:21,900 will set the transaction date using this variable. 8109 05:54:21,900 --> 05:54:25,532 Then we have get products and product get price set price 8110 05:54:25,532 --> 05:54:26,700 and all the getter 8111 05:54:26,700 --> 05:54:29,900 and Setter methods for each of the variable. 8112 05:54:32,000 --> 05:54:34,000 This is the Constructor. 8113 05:54:34,100 --> 05:54:35,615 So here we are taking 8114 05:54:35,615 --> 05:54:39,513 all the parameters like transaction date product price. 8115 05:54:39,513 --> 05:54:42,295 And then we are setting the value of each 8116 05:54:42,295 --> 05:54:44,800 of the variables using this operator. 8117 05:54:44,800 --> 05:54:48,295 So we are setting the value for transaction date product price 8118 05:54:48,295 --> 05:54:51,500 payment and all of the fields that is present over there. 8119 05:54:51,515 --> 05:54:51,900 Next. 8120 05:54:51,900 --> 05:54:55,053 We are also creating a default Constructor 8121 05:54:55,200 --> 05:54:56,616 and then over here. 8122 05:54:56,616 --> 05:54:59,300 We are overriding the tostring method 8123 05:54:59,300 --> 05:55:01,600 and what we are doing we are basically 8124 05:55:02,400 --> 05:55:04,500 The transaction details 8125 05:55:04,500 --> 05:55:06,600 and we are returning transaction date 8126 05:55:06,600 --> 05:55:09,100 and then the value of transaction date product 8127 05:55:09,100 --> 05:55:12,300 then body of product price then value of price 8128 05:55:12,300 --> 05:55:14,900 and so on for all the fields. 8129 05:55:15,300 --> 05:55:18,800 So basically this is the model of the transaction 8130 05:55:18,800 --> 05:55:20,000 so we can go ahead 8131 05:55:20,000 --> 05:55:22,529 and we can create object of this transaction 8132 05:55:22,529 --> 05:55:24,400 and then we can easily go ahead 8133 05:55:24,400 --> 05:55:27,700 and send the transaction object as the value. 8134 05:55:27,700 --> 05:55:29,900 So this is the main reason of creating 8135 05:55:29,900 --> 05:55:31,588 this transaction model, LOL. 8136 05:55:31,588 --> 05:55:34,000 Me quickly, go ahead and close this file. 8137 05:55:34,000 --> 05:55:38,400 Let's go back and let's first take a look at this config. 8138 05:55:38,615 --> 05:55:41,384 So this is Kafka properties dot Java. 8139 05:55:41,500 --> 05:55:43,202 So what we did again 8140 05:55:43,202 --> 05:55:46,894 as I have shown you the application dot yml file. 8141 05:55:46,942 --> 05:55:48,500 So we have taken all 8142 05:55:48,500 --> 05:55:51,500 the parameters that we have specified over there. 8143 05:55:51,600 --> 05:55:54,600 That is your bootstrap product topic partition count 8144 05:55:54,600 --> 05:55:57,700 then Brokers filename and thread count. 8145 05:55:57,700 --> 05:55:59,322 So all these properties 8146 05:55:59,322 --> 05:56:02,367 then you have file path then all these Days, 8147 05:56:02,367 --> 05:56:04,300 we have taken we have created 8148 05:56:04,300 --> 05:56:07,100 a variable and then what we are doing again, 8149 05:56:07,100 --> 05:56:08,700 we are doing the same thing 8150 05:56:08,700 --> 05:56:11,039 as we did with our transaction model. 8151 05:56:11,039 --> 05:56:12,600 We are creating a getter 8152 05:56:12,600 --> 05:56:15,247 and Setter method for each of these variables. 8153 05:56:15,247 --> 05:56:17,305 So you can see we have get file path 8154 05:56:17,305 --> 05:56:19,300 and we are returning the file path. 8155 05:56:19,300 --> 05:56:20,924 Then we have set file path 8156 05:56:20,924 --> 05:56:24,300 where we are setting the file path using this operator. 8157 05:56:24,300 --> 05:56:24,800 Similarly. 8158 05:56:24,800 --> 05:56:26,600 We have get product topics 8159 05:56:26,600 --> 05:56:29,567 at product topic then we have greater incentive 8160 05:56:29,567 --> 05:56:30,400 for third count. 8161 05:56:30,400 --> 05:56:31,700 We have greater incentive. 8162 05:56:31,700 --> 05:56:36,000 for bootstrap and all those properties No, 8163 05:56:36,100 --> 05:56:37,522 we can again go ahead 8164 05:56:37,522 --> 05:56:40,300 and call this cough cough properties anywhere 8165 05:56:40,300 --> 05:56:41,400 and then we can easily 8166 05:56:41,400 --> 05:56:44,000 extract those values using getter methods. 8167 05:56:44,100 --> 05:56:48,400 So let me quickly close this file and I'll take you 8168 05:56:48,400 --> 05:56:50,500 to the configurations. 8169 05:56:50,900 --> 05:56:52,100 So in this configuration 8170 05:56:52,100 --> 05:56:54,700 what we are doing we are creating the object 8171 05:56:54,700 --> 05:56:56,700 of Kafka properties as you can see, 8172 05:56:57,000 --> 05:56:59,800 so what we are doing then we are again creating a property's 8173 05:56:59,800 --> 05:57:02,600 object and then we are setting the properties 8174 05:57:02,700 --> 05:57:03,800 so you can see 8175 05:57:03,800 --> 05:57:06,800 that we are Setting the bootstrap server config 8176 05:57:06,800 --> 05:57:08,400 and then we are retrieving 8177 05:57:08,400 --> 05:57:11,900 the value using the cough cough properties object. 8178 05:57:11,900 --> 05:57:14,300 And this is the get bootstrap server function. 8179 05:57:14,300 --> 05:57:17,500 Then you can see we are setting the acknowledgement config 8180 05:57:17,500 --> 05:57:18,400 and we are getting 8181 05:57:18,400 --> 05:57:22,100 the acknowledgement from this get acknowledgement function. 8182 05:57:22,100 --> 05:57:24,900 And then we are using this get rate rise method. 8183 05:57:24,900 --> 05:57:27,300 So from all these Kafka properties object. 8184 05:57:27,300 --> 05:57:29,000 We are calling those getter methods 8185 05:57:29,000 --> 05:57:30,700 and retrieving those values 8186 05:57:30,700 --> 05:57:34,100 and setting those values in this property object. 8187 05:57:34,100 --> 05:57:36,900 So We have partitioner class. 8188 05:57:37,000 --> 05:57:40,294 So we are basically implementing this default partitioner 8189 05:57:40,294 --> 05:57:41,400 which is present in 8190 05:57:41,400 --> 05:57:45,700 over G. Apache car park client producer internals package. 8191 05:57:45,700 --> 05:57:48,600 Then we are creating a producer over here 8192 05:57:48,600 --> 05:57:50,756 and we are passing this props 8193 05:57:50,756 --> 05:57:54,400 object which will set the properties so over here. 8194 05:57:54,400 --> 05:57:56,684 We are passing the key serializer, 8195 05:57:56,684 --> 05:57:58,900 which is the string T serializer. 8196 05:57:58,900 --> 05:58:00,100 And then this is 8197 05:58:00,100 --> 05:58:04,400 the value serializer in which we are creating new customer. 8198 05:58:04,400 --> 05:58:07,500 Distance Eliezer and then we are passing transaction 8199 05:58:07,500 --> 05:58:10,400 over here and then it will return the producer 8200 05:58:10,500 --> 05:58:13,735 and then we are implementing thread we are again getting 8201 05:58:13,735 --> 05:58:15,200 the get minimum pool size 8202 05:58:15,200 --> 05:58:17,700 from Kafka properties and get maximum pool size 8203 05:58:17,700 --> 05:58:18,700 from Kafka property. 8204 05:58:18,700 --> 05:58:19,600 So we're here. 8205 05:58:19,600 --> 05:58:22,000 We are implementing Java threads now. 8206 05:58:22,000 --> 05:58:25,534 Let me quickly close this cough pop producer configuration 8207 05:58:25,534 --> 05:58:28,200 where we are configuring our Kafka producer. 8208 05:58:28,461 --> 05:58:29,538 Let's go back. 8209 05:58:30,400 --> 05:58:32,800 Let's quickly go to this API 8210 05:58:32,946 --> 05:58:36,253 which have event producer EPA dot Java file. 8211 05:58:36,300 --> 05:58:40,130 So here we are basically creating an event producer API 8212 05:58:40,130 --> 05:58:42,400 which has this dispatch function. 8213 05:58:42,400 --> 05:58:46,900 So we'll use this dispatch function to send the records. 8214 05:58:47,180 --> 05:58:49,719 So let me quickly close this file. 8215 05:58:50,061 --> 05:58:51,138 Let's go back. 8216 05:58:51,300 --> 05:58:53,475 We have already seen this config 8217 05:58:53,475 --> 05:58:54,700 and configurations 8218 05:58:54,700 --> 05:58:57,100 in which we are basically retrieving those values 8219 05:58:57,100 --> 05:58:58,984 from application dot yml file 8220 05:58:58,984 --> 05:59:02,300 and then we are Setting the producer configurations, 8221 05:59:02,300 --> 05:59:04,000 then we have constants. 8222 05:59:04,000 --> 05:59:07,100 So in Kafka constants or Java, 8223 05:59:07,200 --> 05:59:09,900 we have created this Kafka constant interface 8224 05:59:09,900 --> 05:59:11,393 where we have specified 8225 05:59:11,393 --> 05:59:14,925 the batch size account limit check some limit then read 8226 05:59:14,925 --> 05:59:17,494 batch size minimum balance maximum balance 8227 05:59:17,494 --> 05:59:19,500 minimum account maximum account. 8228 05:59:19,500 --> 05:59:22,604 Then we are also implementing daytime for matter. 8229 05:59:22,604 --> 05:59:25,643 So we are specifying all the constants over here. 8230 05:59:25,643 --> 05:59:27,100 Let me close this file. 8231 05:59:27,100 --> 05:59:31,300 Let's go back then this is Manso will not look 8232 05:59:31,300 --> 05:59:32,506 at these two files, 8233 05:59:32,506 --> 05:59:35,300 but let me tell you what does these two files 8234 05:59:35,300 --> 05:59:39,400 to these two files are basically to record the metrics 8235 05:59:39,400 --> 05:59:42,000 of your Kafka like time in which 8236 05:59:42,000 --> 05:59:44,889 your thousand records have been produced in cough power. 8237 05:59:44,889 --> 05:59:45,781 You can say time 8238 05:59:45,781 --> 05:59:48,400 in which records are getting published to Kafka. 8239 05:59:48,400 --> 05:59:51,936 It will be monitored and then you can record those starts. 8240 05:59:51,936 --> 05:59:53,292 So basically it helps 8241 05:59:53,292 --> 05:59:57,100 in optimizing the performance of your Kafka producer, right? 8242 05:59:57,100 --> 05:59:59,863 You can actually know how to do Recon. 8243 05:59:59,863 --> 06:00:03,000 How to add just those configuration factors 8244 06:00:03,000 --> 06:00:05,041 and then you can see the difference 8245 06:00:05,041 --> 06:00:07,159 or you can actually monitor the stats 8246 06:00:07,159 --> 06:00:08,259 and then understand 8247 06:00:08,259 --> 06:00:11,612 or how you can actually make your producer more efficient. 8248 06:00:11,612 --> 06:00:13,039 So these are basically 8249 06:00:13,039 --> 06:00:16,800 for those factors but let's not worry about this right now. 8250 06:00:16,900 --> 06:00:18,600 Let's go back next. 8251 06:00:18,600 --> 06:00:21,500 Let me quickly take you through this file utility. 8252 06:00:21,500 --> 06:00:24,000 So you have file you treated or Java. 8253 06:00:24,000 --> 06:00:26,600 So basically what we are doing over here, 8254 06:00:26,600 --> 06:00:28,550 we are reading each record 8255 06:00:28,550 --> 06:00:32,200 from the file we using For reader so over here, 8256 06:00:32,200 --> 06:00:36,900 you can see we have this list and then we have bufferedreader. 8257 06:00:36,900 --> 06:00:38,700 Then we have file reader. 8258 06:00:38,700 --> 06:00:41,000 So first we are reading the file 8259 06:00:41,000 --> 06:00:44,105 and then we are trying to split each of the fields 8260 06:00:44,105 --> 06:00:45,500 present in the record. 8261 06:00:45,500 --> 06:00:49,500 And then we are setting the value of those fields over here. 8262 06:00:49,700 --> 06:00:52,407 Then we are specifying some of the exceptions 8263 06:00:52,407 --> 06:00:54,900 that may occur like number format exception 8264 06:00:54,900 --> 06:00:57,500 or pass exception all those kind of exception 8265 06:00:57,500 --> 06:01:00,900 we have specified over here and then we are Closing this 8266 06:01:00,900 --> 06:01:01,959 so in this file. 8267 06:01:01,959 --> 06:01:04,746 We are basically reading the records now. 8268 06:01:04,746 --> 06:01:06,000 Let me close this. 8269 06:01:06,000 --> 06:01:07,100 Let's go back. 8270 06:01:07,500 --> 06:01:07,766 Now. 8271 06:01:07,766 --> 06:01:10,500 Let's take a quick look at the seal lizer. 8272 06:01:10,500 --> 06:01:13,100 So this is custom Jason serializer. 8273 06:01:13,500 --> 06:01:15,100 So in serializer, 8274 06:01:15,100 --> 06:01:18,000 we have created a custom decency réaliser. 8275 06:01:18,000 --> 06:01:22,023 Now, this is basically to write the values as bites. 8276 06:01:22,100 --> 06:01:26,082 So the data which you will be passing will be written in bytes 8277 06:01:26,082 --> 06:01:27,197 because as we know 8278 06:01:27,197 --> 06:01:29,800 that data is sent to Kafka and form of pie. 8279 06:01:29,800 --> 06:01:32,000 And this is the reason why we have created 8280 06:01:32,000 --> 06:01:33,700 this custom Jason serializer. 8281 06:01:33,930 --> 06:01:37,469 So now let me quickly close this let's go back. 8282 06:01:37,700 --> 06:01:41,800 This file is basically for my spring boot web application. 8283 06:01:41,900 --> 06:01:44,200 So let's not get into this. 8284 06:01:44,300 --> 06:01:47,100 Let's look at events Red Dot Java. 8285 06:01:47,865 --> 06:01:51,634 So basically over here we have event producer API. 8286 06:01:52,300 --> 06:01:57,100 So now we are trying to dispatch those events and to show you 8287 06:01:57,100 --> 06:01:58,988 how dispatch function works. 8288 06:01:58,988 --> 06:02:00,000 Let me go back. 8289 06:02:00,000 --> 06:02:01,691 Let me open services 8290 06:02:01,700 --> 06:02:05,000 and even producer I MPL is implementation. 8291 06:02:05,000 --> 06:02:08,100 So let me show you how this dispatch works. 8292 06:02:08,100 --> 06:02:10,400 So basically over here what we are doing first. 8293 06:02:10,400 --> 06:02:11,576 We are initializing. 8294 06:02:11,576 --> 06:02:13,047 So using the file utility. 8295 06:02:13,047 --> 06:02:16,000 We are basically reading the files and read the file. 8296 06:02:16,000 --> 06:02:19,356 We are getting the path using this Kafka properties object 8297 06:02:19,356 --> 06:02:22,300 and we are calling this getter method of file path. 8298 06:02:22,300 --> 06:02:24,900 Then what we are doing we are basically taking 8299 06:02:24,900 --> 06:02:25,900 the product list 8300 06:02:25,900 --> 06:02:28,700 and then we are trying to dispatch it so 8301 06:02:28,700 --> 06:02:32,800 in dispatch Are basically using Kafka producer 8302 06:02:33,600 --> 06:02:37,000 and then we are creating the object of the producer record. 8303 06:02:37,000 --> 06:02:41,594 Then we are using the get topic from this calf pad properties. 8304 06:02:41,594 --> 06:02:44,004 We are getting this transaction ID 8305 06:02:44,004 --> 06:02:45,459 from the transaction 8306 06:02:45,459 --> 06:02:49,540 and then we are using event producer send to send the data. 8307 06:02:49,540 --> 06:02:51,300 And finally we are trying 8308 06:02:51,300 --> 06:02:54,827 to monitor this but let's not worry about the monitoring 8309 06:02:54,827 --> 06:02:57,200 and cash the monitoring and start spot 8310 06:02:57,200 --> 06:02:59,661 so we can ignore this part Nets. 8311 06:02:59,800 --> 06:03:03,700 Let's quickly go back and look at the last file 8312 06:03:03,700 --> 06:03:05,100 which is producer. 8313 06:03:05,600 --> 06:03:07,835 So let me show you this event producer. 8314 06:03:07,835 --> 06:03:09,300 So what we are doing here, 8315 06:03:09,300 --> 06:03:11,500 we are actually creating a logger. 8316 06:03:11,900 --> 06:03:13,500 So in this on completion method, 8317 06:03:13,500 --> 06:03:16,300 we are basically passing the record metadata. 8318 06:03:16,300 --> 06:03:20,838 And if your e-except shin is not null then it will basically 8319 06:03:20,838 --> 06:03:25,200 throw an error saying this and recorded metadata else. 8320 06:03:25,400 --> 06:03:29,700 It will give you the send message to topic partition. 8321 06:03:29,700 --> 06:03:32,300 All set and then the record metadata 8322 06:03:32,300 --> 06:03:34,564 and topic and then it will give 8323 06:03:34,564 --> 06:03:38,800 you all the details regarding topic partitions and offsets. 8324 06:03:38,800 --> 06:03:40,888 So I hope that you guys have understood 8325 06:03:40,888 --> 06:03:44,110 how this cough cough producer is working now is the time we 8326 06:03:44,110 --> 06:03:47,169 need to go ahead and we need to quickly execute this. 8327 06:03:47,169 --> 06:03:49,200 So let me open a terminal over here. 8328 06:03:49,500 --> 06:03:51,653 No first build this project. 8329 06:03:51,653 --> 06:03:54,423 We need to execute mvn clean install. 8330 06:03:54,900 --> 06:03:56,800 This will install all the dependencies. 8331 06:04:01,600 --> 06:04:04,100 So as you can see our build is successful. 8332 06:04:04,100 --> 06:04:08,111 So let me minimize this and this target directory is created 8333 06:04:08,111 --> 06:04:10,394 after you build a wave in project. 8334 06:04:10,394 --> 06:04:11,778 So let me quickly go 8335 06:04:11,778 --> 06:04:16,000 inside this target directory and this is the root dot bar file 8336 06:04:16,000 --> 06:04:18,300 that is root dot web archive file 8337 06:04:18,300 --> 06:04:19,897 which we need to execute. 8338 06:04:19,897 --> 06:04:22,900 So let's quickly go ahead and execute this file. 8339 06:04:23,100 --> 06:04:24,755 But before this to verify 8340 06:04:24,755 --> 06:04:27,800 whether the data is getting produced in our car 8341 06:04:27,800 --> 06:04:29,900 for topics so for testing 8342 06:04:29,900 --> 06:04:33,300 as I already told you We need to go ahead 8343 06:04:33,300 --> 06:04:36,200 and we need to open a console consumer 8344 06:04:36,500 --> 06:04:37,500 so that we can check 8345 06:04:37,500 --> 06:04:40,200 that whether data is getting published or not. 8346 06:04:42,400 --> 06:04:45,100 So let me quickly minimize this. 8347 06:04:48,300 --> 06:04:52,700 So let's quickly go to Kafka directory and the command 8348 06:04:52,700 --> 06:04:59,300 is dot slash bin Kafka console consumer and then - 8349 06:04:59,300 --> 06:05:01,500 - bootstrap server. 8350 06:05:14,800 --> 06:05:21,964 nine zero nine two Okay, I'll let me check the topic. 8351 06:05:21,964 --> 06:05:23,271 What's the topic? 8352 06:05:24,000 --> 06:05:27,000 Let's go to our application dot yml file. 8353 06:05:27,000 --> 06:05:31,000 So the topic name is transaction. 8354 06:05:31,000 --> 06:05:35,100 Let me quickly minimize this specify the topic name 8355 06:05:35,100 --> 06:05:36,500 and I'll hit enter. 8356 06:05:36,500 --> 06:05:41,300 So now let me place this console aside. 8357 06:05:41,300 --> 06:05:45,900 And now let's quickly go ahead and execute our project. 8358 06:05:45,900 --> 06:05:49,400 So for that the command is Java - 8359 06:05:49,400 --> 06:05:52,938 jar and then we'll provide the path of the file 8360 06:05:52,938 --> 06:05:54,100 that is inside. 8361 06:05:54,300 --> 06:05:59,700 Great, and the file is rude dot war and here we go. 8362 06:06:18,100 --> 06:06:20,955 So now you can see in the console consumer. 8363 06:06:20,955 --> 06:06:23,200 The records are getting published. 8364 06:06:23,200 --> 06:06:23,700 Right? 8365 06:06:24,000 --> 06:06:25,903 So there are multiple records 8366 06:06:25,903 --> 06:06:29,118 which have been published in our transaction topic 8367 06:06:29,118 --> 06:06:32,400 and you can verify this using the console consumer. 8368 06:06:32,400 --> 06:06:33,145 So this is 8369 06:06:33,145 --> 06:06:36,500 where the developers use the console consumer. 8370 06:06:38,000 --> 06:06:40,980 So now we have successfully verified our producer. 8371 06:06:40,980 --> 06:06:43,900 So let me quickly go ahead and stop the producer. 8372 06:06:45,500 --> 06:06:48,200 Lat, let me stop consumer as well. 8373 06:06:49,400 --> 06:06:51,370 Let's quickly minimize this 8374 06:06:51,370 --> 06:06:54,144 and now let's go to the second project. 8375 06:06:54,144 --> 06:06:56,700 That is Park streaming Kafka Master. 8376 06:06:56,900 --> 06:06:57,200 Again. 8377 06:06:57,200 --> 06:06:59,667 We have specified all the dependencies 8378 06:06:59,667 --> 06:07:00,800 that is required. 8379 06:07:01,000 --> 06:07:03,700 Let me quickly show you those dependencies. 8380 06:07:07,700 --> 06:07:09,800 Now again, you can see were here. 8381 06:07:09,800 --> 06:07:12,400 We have specified Java version then we 8382 06:07:12,400 --> 06:07:16,600 have specified the artifacts or you can see the dependencies. 8383 06:07:16,796 --> 06:07:18,796 So we have Scala compiler. 8384 06:07:18,796 --> 06:07:21,411 Then we have spark streaming Kafka. 8385 06:07:21,900 --> 06:07:24,200 Then we have cough cough clients. 8386 06:07:24,400 --> 06:07:28,400 Then Json data binding then we have Maven compiler plug-in. 8387 06:07:28,400 --> 06:07:30,600 So all those dependencies which are required. 8388 06:07:30,600 --> 06:07:32,300 We are specified over here. 8389 06:07:32,500 --> 06:07:35,500 So let me quickly go ahead and close it. 8390 06:07:36,200 --> 06:07:40,503 Let's quickly move to the source directory main then let's look 8391 06:07:40,503 --> 06:07:42,100 at the resources again. 8392 06:07:42,203 --> 06:07:44,896 So this is application dot yml file. 8393 06:07:45,700 --> 06:07:46,700 So we have put 8394 06:07:46,700 --> 06:07:49,600 eight zero eight zero then we have bootstrap server over here. 8395 06:07:49,600 --> 06:07:51,100 Then we have proven over here. 8396 06:07:51,100 --> 06:07:53,200 Then we have topic is as transaction. 8397 06:07:53,200 --> 06:07:56,000 The group is transaction partition count is one 8398 06:07:56,000 --> 06:07:57,273 and then the file name 8399 06:07:57,273 --> 06:07:59,664 so we won't be using this file name then. 8400 06:07:59,664 --> 06:08:01,900 Let me quickly go ahead and close this. 8401 06:08:01,900 --> 06:08:02,984 Let's go back. 8402 06:08:02,984 --> 06:08:06,600 Let's go back to Java directory comms Park demo, 8403 06:08:06,600 --> 06:08:08,200 then this is the model. 8404 06:08:08,200 --> 06:08:10,100 So it's same 8405 06:08:10,600 --> 06:08:13,011 so these are all the fields that are there 8406 06:08:13,011 --> 06:08:15,800 in the transaction you have transaction. 8407 06:08:15,800 --> 06:08:18,100 Eight product price payment type 8408 06:08:18,100 --> 06:08:22,500 the name city state country account created and so on. 8409 06:08:22,500 --> 06:08:25,100 And again, we have specified all the getter 8410 06:08:25,100 --> 06:08:29,285 and Setter methods over here and similarly again, 8411 06:08:29,285 --> 06:08:32,600 we have created this transaction dto Constructor 8412 06:08:32,600 --> 06:08:34,900 where we have taken all the parameters 8413 06:08:34,900 --> 06:08:38,200 and then we have setting the values using this operator. 8414 06:08:38,200 --> 06:08:39,100 Next. 8415 06:08:39,100 --> 06:08:42,400 We are again over adding this tostring function 8416 06:08:42,400 --> 06:08:43,414 and over here. 8417 06:08:43,414 --> 06:08:47,500 We are again returning the details like transaction date 8418 06:08:47,500 --> 06:08:49,700 and then vario transaction date product 8419 06:08:49,700 --> 06:08:53,200 and then value of product and similarly all the fields. 8420 06:08:53,411 --> 06:08:55,488 So let me close this model. 8421 06:08:55,900 --> 06:08:57,100 Let's go back. 8422 06:08:57,200 --> 06:09:00,500 Let's look at cough covers, then we are see realizer. 8423 06:09:00,500 --> 06:09:02,294 So this is the Jason serializer 8424 06:09:02,294 --> 06:09:06,187 which was there in our producer and this is transaction decoder. 8425 06:09:06,187 --> 06:09:07,300 Let's take a look. 8426 06:09:07,780 --> 06:09:09,319 Now you have decoder 8427 06:09:09,400 --> 06:09:12,600 which is again implementing decoder and we're passing 8428 06:09:12,600 --> 06:09:14,800 this transaction dto then again, 8429 06:09:14,800 --> 06:09:17,339 you can see we This problem by its method 8430 06:09:17,339 --> 06:09:18,800 which we are overriding 8431 06:09:18,800 --> 06:09:22,022 and we are reading the values using this bites 8432 06:09:22,022 --> 06:09:24,600 and then transaction DDO class again, 8433 06:09:24,600 --> 06:09:28,600 if it is failing to pass we are giving Json processing failed 8434 06:09:28,600 --> 06:09:29,799 for object this 8435 06:09:30,200 --> 06:09:31,573 and you can see we have 8436 06:09:31,573 --> 06:09:34,200 this transaction decoder construct over here. 8437 06:09:34,200 --> 06:09:37,200 So let me quickly again close this file. 8438 06:09:37,200 --> 06:09:38,892 Let's quickly go back. 8439 06:09:39,400 --> 06:09:42,500 And now let's take a look at spot streaming app 8440 06:09:42,500 --> 06:09:44,200 where basically the data 8441 06:09:44,200 --> 06:09:48,100 which the producer project will be producing to cough cough 8442 06:09:48,100 --> 06:09:51,900 will be actually consumed by spark streaming application. 8443 06:09:51,900 --> 06:09:55,071 So spark streaming will stream the data in real time 8444 06:09:55,071 --> 06:09:57,000 and then will display the data. 8445 06:09:57,000 --> 06:09:59,600 So in this park streaming application, 8446 06:09:59,600 --> 06:10:03,189 we are creating conf object and then we are setting 8447 06:10:03,189 --> 06:10:05,900 the application name as cough by sandbox. 8448 06:10:05,900 --> 06:10:09,331 The master is local star then we have Java. 8449 06:10:09,331 --> 06:10:13,100 Fog contest so here we are specifying the spark context 8450 06:10:13,100 --> 06:10:16,700 and then next we are specifying the Java streaming context. 8451 06:10:16,700 --> 06:10:18,500 So this object will basically 8452 06:10:18,500 --> 06:10:21,100 we used to take the streaming data. 8453 06:10:21,100 --> 06:10:25,946 So we are passing this Java Spa context over here as a parameter 8454 06:10:25,946 --> 06:10:29,900 and then we are specifying the duration that is 2000. 8455 06:10:29,900 --> 06:10:30,200 Next. 8456 06:10:30,200 --> 06:10:32,600 We have Kafka parameters should to connect 8457 06:10:32,600 --> 06:10:35,555 to Kafka you need to specify this parameters. 8458 06:10:35,555 --> 06:10:37,100 So in Kafka parameters, 8459 06:10:37,100 --> 06:10:39,500 we are specifying The Meta broken. 8460 06:10:39,500 --> 06:10:44,292 Why's that is localized 9:09 to then we have Auto offset resent 8461 06:10:44,292 --> 06:10:45,600 that is smallest. 8462 06:10:45,600 --> 06:10:49,200 Then in topics the name of the topic from which we 8463 06:10:49,200 --> 06:10:53,300 will be consuming messages is transaction next Java. 8464 06:10:53,300 --> 06:10:56,200 We're creating a Java pair input D streams. 8465 06:10:56,200 --> 06:10:59,300 So basically this D stream is discrete stream, 8466 06:10:59,300 --> 06:11:02,300 which is the basic abstraction of spark streaming 8467 06:11:02,300 --> 06:11:04,290 and is a continuous sequence 8468 06:11:04,290 --> 06:11:07,104 of rdds representing a continuous stream 8469 06:11:07,104 --> 06:11:11,200 of data now the stream can I The created from live data 8470 06:11:11,200 --> 06:11:13,000 from Kafka hdfs of Flume 8471 06:11:13,000 --> 06:11:14,457 or it can be generated 8472 06:11:14,457 --> 06:11:17,900 from transforming existing be streams using operation 8473 06:11:17,900 --> 06:11:18,828 to over here. 8474 06:11:18,828 --> 06:11:21,700 We are again creating a Java input D stream. 8475 06:11:21,700 --> 06:11:24,700 We are passing string and transaction DTS parameters 8476 06:11:24,700 --> 06:11:27,504 and we are creating direct Kafka stream object. 8477 06:11:27,504 --> 06:11:29,700 Then we're using this Kafka you tails 8478 06:11:29,700 --> 06:11:33,000 and we are calling the method create direct stream 8479 06:11:33,000 --> 06:11:35,885 where we are passing the parameters as SSC 8480 06:11:35,885 --> 06:11:38,700 that is your spark streaming context then 8481 06:11:38,700 --> 06:11:40,341 you have String dot class 8482 06:11:40,341 --> 06:11:42,829 which is basically your key serializer. 8483 06:11:42,829 --> 06:11:45,322 Then transaction video does not class 8484 06:11:45,322 --> 06:11:46,500 that is basically 8485 06:11:46,500 --> 06:11:49,700 your value serializer then string decoder 8486 06:11:49,700 --> 06:11:52,868 that is to decode your key and then transaction 8487 06:11:52,868 --> 06:11:55,900 decoded basically to decode your transaction. 8488 06:11:55,900 --> 06:11:57,784 Then you have Kafka parameters, 8489 06:11:57,784 --> 06:11:59,501 which you have created here 8490 06:11:59,501 --> 06:12:02,300 where you have specified broken list and auto 8491 06:12:02,300 --> 06:12:05,900 offset reset and then you are specifying the topics 8492 06:12:05,900 --> 06:12:10,500 which is your transaction so next using this Cordy stream, 8493 06:12:10,500 --> 06:12:14,000 you're actually continuously iterating over the rdd 8494 06:12:14,000 --> 06:12:17,345 and then you are trying to print your new rdd 8495 06:12:17,345 --> 06:12:19,400 with then already partition 8496 06:12:19,400 --> 06:12:21,200 and size then rdd count 8497 06:12:21,200 --> 06:12:24,600 and the record so already for each record. 8498 06:12:24,900 --> 06:12:26,400 So you are printing the record 8499 06:12:26,500 --> 06:12:30,400 and then you are starting these Park streaming context 8500 06:12:30,400 --> 06:12:32,800 and then you are waiting for the termination. 8501 06:12:32,800 --> 06:12:35,500 So this is the spark streaming application. 8502 06:12:35,500 --> 06:12:39,200 So let's first quickly go ahead and execute this application. 8503 06:12:39,200 --> 06:12:40,900 They've been close this file. 8504 06:12:41,000 --> 06:12:43,400 Let's go to the source. 8505 06:12:44,900 --> 06:12:49,000 Now, let me quickly go ahead and delete this target directory. 8506 06:12:49,000 --> 06:12:53,615 So now let me quickly open the terminal MV and clean install. 8507 06:12:58,400 --> 06:13:01,800 So now as you can see the target directory is again created 8508 06:13:01,800 --> 06:13:05,307 and this park streaming Kafka snapshot jar is created. 8509 06:13:05,307 --> 06:13:07,300 So we need to execute this jar. 8510 06:13:07,700 --> 06:13:10,800 So let me quickly go ahead and minimize it. 8511 06:13:12,500 --> 06:13:14,300 Let me close this terminal. 8512 06:13:14,400 --> 06:13:18,000 So now first I'll start this pop streaming job. 8513 06:13:18,600 --> 06:13:24,100 So the command is Java - jar inside the target directory. 8514 06:13:24,600 --> 06:13:31,500 We have this spark streaming of college are so let's hit enter. 8515 06:13:34,500 --> 06:13:38,100 So let me know quickly go ahead and start producing messages. 8516 06:13:41,000 --> 06:13:44,100 So I will minimize this and I will wait for the messages. 8517 06:13:50,019 --> 06:13:53,480 So let me quickly close this pot streaming job 8518 06:13:53,600 --> 06:13:56,900 and then I will show you the consumed records 8519 06:13:59,000 --> 06:14:00,400 so you can see the record 8520 06:14:00,400 --> 06:14:02,673 that is consumed from spark streaming. 8521 06:14:02,673 --> 06:14:05,500 So here you have got record and transaction dto 8522 06:14:05,500 --> 06:14:08,561 and then transaction date products all the details, 8523 06:14:08,561 --> 06:14:09,969 which we are specified. 8524 06:14:09,969 --> 06:14:11,500 You can see it over here. 8525 06:14:11,500 --> 06:14:15,400 So this is how spark streaming works with Kafka now, 8526 06:14:15,400 --> 06:14:17,600 it's just a basic job again. 8527 06:14:17,600 --> 06:14:20,900 You can go ahead and you can take Those transaction you 8528 06:14:20,900 --> 06:14:23,651 can perform some real-time analytics over there 8529 06:14:23,651 --> 06:14:27,406 and then you can go ahead and write those results so over here 8530 06:14:27,406 --> 06:14:29,500 we have just given you a basic demo 8531 06:14:29,500 --> 06:14:32,401 in which we are producing the records to Kafka 8532 06:14:32,401 --> 06:14:34,400 and then using spark streaming. 8533 06:14:34,400 --> 06:14:37,533 We are streaming those records from Kafka again. 8534 06:14:37,533 --> 06:14:38,600 You can go ahead 8535 06:14:38,600 --> 06:14:41,083 and you can perform multiple Transformations 8536 06:14:41,083 --> 06:14:42,848 over the data multiple actions 8537 06:14:42,848 --> 06:14:45,500 and produce some real-time results using this data. 8538 06:14:45,500 --> 06:14:48,975 So this is just a basic demo where we have shown you 8539 06:14:48,975 --> 06:14:51,700 how to basically produce recalls to Kafka 8540 06:14:51,700 --> 06:14:55,000 and then consume those records using spark streaming. 8541 06:14:55,000 --> 06:14:57,846 So let's quickly go back to our slide. 8542 06:14:58,600 --> 06:15:00,526 Now as this was a basic project. 8543 06:15:00,526 --> 06:15:01,669 Let me explain you 8544 06:15:01,669 --> 06:15:04,390 one of the cough by spark streaming project, 8545 06:15:04,390 --> 06:15:05,754 which is a Ted Eureka. 8546 06:15:05,754 --> 06:15:09,100 So basically there is a company called Tech review.com. 8547 06:15:09,100 --> 06:15:11,900 So this take review.com basically provide reviews 8548 06:15:11,900 --> 06:15:14,481 for your recent and different Technologies, 8549 06:15:14,481 --> 06:15:17,800 like a smart watches phones different operating systems 8550 06:15:17,800 --> 06:15:20,100 and anything new that is coming into Market. 8551 06:15:20,100 --> 06:15:23,409 So what happens is the company decided to include a new feature 8552 06:15:23,409 --> 06:15:26,883 which will basically allow users to compare the popularity 8553 06:15:26,883 --> 06:15:29,200 or trend of multiple Technologies based 8554 06:15:29,200 --> 06:15:32,400 on the Twitter feeds and second for the USP. 8555 06:15:32,400 --> 06:15:33,500 They are basically 8556 06:15:33,500 --> 06:15:36,200 trying this comparison to happen in real time. 8557 06:15:36,200 --> 06:15:38,788 So basically they have assigned you this task 8558 06:15:38,788 --> 06:15:41,299 so that you have to go ahead you have to take 8559 06:15:41,299 --> 06:15:42,752 the real-time Twitter feeds 8560 06:15:42,752 --> 06:15:45,400 then you have to show the real time comparison 8561 06:15:45,400 --> 06:15:46,900 of various Technologies. 8562 06:15:46,900 --> 06:15:50,500 So again, the company is is asking you to to identify 8563 06:15:50,500 --> 06:15:51,684 the minute literate 8564 06:15:51,684 --> 06:15:55,500 between different Technologies by consuming Twitter streams 8565 06:15:55,500 --> 06:15:58,900 and writing aggregated minute Li count to Cassandra 8566 06:15:58,900 --> 06:16:00,200 from where again - 8567 06:16:00,200 --> 06:16:02,700 boarding team will come into picture and then they 8568 06:16:02,700 --> 06:16:06,700 will try to dashboard that data and it can show you a graph 8569 06:16:06,700 --> 06:16:07,800 where you can see 8570 06:16:07,800 --> 06:16:09,892 how the trend of two different 8571 06:16:09,892 --> 06:16:13,656 or you can see various Technologies are going ahead now 8572 06:16:13,656 --> 06:16:16,157 the solution strategy which is there 8573 06:16:16,157 --> 06:16:20,083 so you have to continuously stream the data from Twitter. 8574 06:16:20,083 --> 06:16:21,689 Then you will be storing 8575 06:16:21,689 --> 06:16:24,322 that those tweets inside a cop car topic 8576 06:16:24,322 --> 06:16:25,567 then second again. 8577 06:16:25,567 --> 06:16:27,987 You have to perform spark streaming. 8578 06:16:27,987 --> 06:16:31,009 So you will be continuously streaming the data 8579 06:16:31,009 --> 06:16:34,300 and then you will be applying some Transformations 8580 06:16:34,300 --> 06:16:36,900 which will basically give you the minute trend 8581 06:16:36,900 --> 06:16:38,361 of the two technologies. 8582 06:16:38,361 --> 06:16:41,747 And again, you'll write it back to a car for topic and at last 8583 06:16:41,747 --> 06:16:42,992 you'll write a consumer 8584 06:16:42,992 --> 06:16:46,051 that will be consuming messages from the Casbah topic 8585 06:16:46,051 --> 06:16:49,200 and that will write the data in your Cassandra database. 8586 06:16:49,200 --> 06:16:51,018 So First you have to write a program 8587 06:16:51,018 --> 06:16:53,049 that will be consuming data from Twitter 8588 06:16:53,049 --> 06:16:54,696 and I did to cough or topic. 8589 06:16:54,696 --> 06:16:56,999 Then you have to write a spark streaming job, 8590 06:16:56,999 --> 06:17:00,200 which will be continuously streaming the data from Kafka 8591 06:17:00,300 --> 06:17:03,300 and perform analytics to identify the military Trend 8592 06:17:03,300 --> 06:17:06,200 and then it will write the data back to a cuff for topic 8593 06:17:06,200 --> 06:17:08,282 and then you have to write the third job 8594 06:17:08,282 --> 06:17:10,114 which will be basically a consumer 8595 06:17:10,114 --> 06:17:12,668 that will consume data from the table for topic 8596 06:17:12,668 --> 06:17:15,000 and write the data to a Cassandra database. 8597 06:17:19,800 --> 06:17:21,709 But a spark is a powerful framework, 8598 06:17:21,709 --> 06:17:23,960 which has been heavily used in the industry 8599 06:17:23,960 --> 06:17:26,800 for real-time analytics and machine learning purposes. 8600 06:17:26,800 --> 06:17:28,689 So before I proceed with the session, 8601 06:17:28,689 --> 06:17:30,489 let's have a quick look at the topics 8602 06:17:30,489 --> 06:17:31,968 which will be covering today. 8603 06:17:31,968 --> 06:17:33,600 So I'm starting off by explaining 8604 06:17:33,600 --> 06:17:35,900 what exactly is by spot and how it works. 8605 06:17:35,900 --> 06:17:36,900 When we go ahead. 8606 06:17:36,900 --> 06:17:39,819 We'll find out the various advantages provided by spark. 8607 06:17:39,819 --> 06:17:41,200 Then I will be showing you 8608 06:17:41,200 --> 06:17:43,400 how to install by sparking a systems. 8609 06:17:43,400 --> 06:17:45,300 Once we are done with the installation. 8610 06:17:45,300 --> 06:17:48,200 I will talk about the fundamental concepts of by spark 8611 06:17:48,200 --> 06:17:49,800 like this spark context. 8612 06:17:49,900 --> 06:17:53,900 Data frames MLA Oddities and much more and finally, 8613 06:17:53,900 --> 06:17:57,100 I'll close of the session with the demo in which I'll show you 8614 06:17:57,100 --> 06:18:00,200 how to implement by spark to solve real life use cases. 8615 06:18:00,200 --> 06:18:01,791 So without any further Ado, 8616 06:18:01,791 --> 06:18:04,621 let's quickly embark on our journey to pie spot now 8617 06:18:04,621 --> 06:18:06,558 before I start off with by spark. 8618 06:18:06,558 --> 06:18:09,500 Let me first brief you about the by spark ecosystem 8619 06:18:09,500 --> 06:18:13,154 as you can see from the diagram the spark ecosystem is composed 8620 06:18:13,154 --> 06:18:16,400 of various components like Sparks equals Park streaming. 8621 06:18:16,400 --> 06:18:19,800 Ml Abe graphics and the core API component the spark. 8622 06:18:19,800 --> 06:18:22,000 Equal component is used to Leverage The Power 8623 06:18:22,000 --> 06:18:23,320 of decorative queries 8624 06:18:23,320 --> 06:18:26,281 and optimize storage by executing sql-like queries 8625 06:18:26,281 --> 06:18:27,124 on spark data, 8626 06:18:27,124 --> 06:18:28,654 which is presented in rdds 8627 06:18:28,654 --> 06:18:31,589 and other external sources spark streaming component 8628 06:18:31,589 --> 06:18:33,882 allows developers to perform batch processing 8629 06:18:33,882 --> 06:18:36,714 and streaming of data with ease in the same application. 8630 06:18:36,714 --> 06:18:39,345 The machine learning library eases the development 8631 06:18:39,345 --> 06:18:41,600 and deployment of scalable machine learning 8632 06:18:41,600 --> 06:18:43,600 pipelines Graphics component. 8633 06:18:43,600 --> 06:18:47,100 Let's the data scientists work with graph and non graph sources 8634 06:18:47,100 --> 06:18:49,982 to achieve flexibility and resilience in graph. 8635 06:18:49,982 --> 06:18:51,775 Struction and Transformations 8636 06:18:51,775 --> 06:18:54,000 and finally the spark core component. 8637 06:18:54,000 --> 06:18:56,723 It is the most vital component of spark ecosystem, 8638 06:18:56,723 --> 06:18:57,900 which is responsible 8639 06:18:57,900 --> 06:19:00,644 for basic input output functions scheduling 8640 06:19:00,644 --> 06:19:04,172 and monitoring the entire spark ecosystem is built on top 8641 06:19:04,172 --> 06:19:06,014 of this code execution engine 8642 06:19:06,014 --> 06:19:09,000 which has extensible apis in different languages 8643 06:19:09,000 --> 06:19:12,300 like Scala Python and Java and in today's session, 8644 06:19:12,300 --> 06:19:13,915 I will specifically discuss 8645 06:19:13,915 --> 06:19:16,967 about the spark API in Python programming languages, 8646 06:19:16,967 --> 06:19:19,600 which is more popularly known as the pie Spa. 8647 06:19:19,700 --> 06:19:22,839 You might be wondering why pie spot well to get 8648 06:19:22,839 --> 06:19:24,000 a better Insight. 8649 06:19:24,000 --> 06:19:26,400 Let me give you a brief into pie spot. 8650 06:19:26,400 --> 06:19:29,300 Now as you already know by spec is the collaboration 8651 06:19:29,300 --> 06:19:31,050 of two powerful Technologies, 8652 06:19:31,050 --> 06:19:32,500 which are spark which is 8653 06:19:32,500 --> 06:19:35,459 an open-source clustering Computing framework built 8654 06:19:35,459 --> 06:19:38,300 around speed ease of use and streaming analytics. 8655 06:19:38,300 --> 06:19:40,707 And the other one is python, of course python, 8656 06:19:40,707 --> 06:19:43,900 which is a general purpose high-level programming language. 8657 06:19:43,900 --> 06:19:46,900 It provides wide range of libraries and is majorly used 8658 06:19:46,900 --> 06:19:50,000 for machine learning and real-time analytics now, 8659 06:19:50,000 --> 06:19:52,000 Now which gives us by spark 8660 06:19:52,000 --> 06:19:53,852 which is a python a pay for spark 8661 06:19:53,852 --> 06:19:56,581 that lets you harness the Simplicity of Python 8662 06:19:56,581 --> 06:19:58,400 and The Power of Apache spark. 8663 06:19:58,400 --> 06:20:01,059 In order to tame pick data up ice pack. 8664 06:20:01,059 --> 06:20:03,398 Also lets you use the rdds and come 8665 06:20:03,398 --> 06:20:06,700 with a default integration of Pi Forge a library. 8666 06:20:06,700 --> 06:20:10,397 We learn about rdds later in this video now that you know, 8667 06:20:10,397 --> 06:20:11,500 what is pi spark. 8668 06:20:11,500 --> 06:20:14,400 Let's now see the advantages of using spark with python 8669 06:20:14,400 --> 06:20:17,700 as we all know python itself is very simple and easy. 8670 06:20:17,700 --> 06:20:20,700 So when Spock is written in Python it To participate 8671 06:20:20,700 --> 06:20:22,837 quite easy to learn and use moreover. 8672 06:20:22,837 --> 06:20:24,737 It's a dynamically type language 8673 06:20:24,737 --> 06:20:28,300 which means Oddities can hold objects of multiple data types. 8674 06:20:28,300 --> 06:20:30,711 Not only does it also makes the EPA simple 8675 06:20:30,711 --> 06:20:32,400 and comprehensive and talking 8676 06:20:32,400 --> 06:20:34,700 about the readability of code maintenance 8677 06:20:34,700 --> 06:20:36,700 and familiarity with the python API 8678 06:20:36,700 --> 06:20:38,577 for purchase Park is far better 8679 06:20:38,577 --> 06:20:41,000 than other programming languages python also 8680 06:20:41,000 --> 06:20:43,100 provides various options for visualization, 8681 06:20:43,100 --> 06:20:46,180 which is not possible using Scala or Java moreover. 8682 06:20:46,180 --> 06:20:49,200 You can conveniently call are directly from python 8683 06:20:49,200 --> 06:20:50,800 on above this python comes 8684 06:20:50,800 --> 06:20:52,300 with a wide range of libraries 8685 06:20:52,300 --> 06:20:55,800 like numpy pandas Caitlin Seaborn matplotlib 8686 06:20:55,800 --> 06:20:57,912 and these debris is in data analysis 8687 06:20:57,912 --> 06:20:59,300 and also provide mature 8688 06:20:59,300 --> 06:21:02,564 and time test statistics with all these feature. 8689 06:21:02,564 --> 06:21:04,100 You can effortlessly program 8690 06:21:04,100 --> 06:21:06,700 and spice Park in case you get stuck somewhere 8691 06:21:06,700 --> 06:21:07,600 or habit out. 8692 06:21:07,600 --> 06:21:08,835 There is a huge price 8693 06:21:08,835 --> 06:21:12,600 but Community out there whom you can reach out and put your query 8694 06:21:12,600 --> 06:21:13,800 and that is very actor. 8695 06:21:13,800 --> 06:21:16,647 So I will make good use of this opportunity to show you 8696 06:21:16,647 --> 06:21:18,000 how to install Pi spark 8697 06:21:18,000 --> 06:21:20,900 in a system now here I'm using Red Hat Linux 8698 06:21:20,900 --> 06:21:24,400 based sent to a system the same steps can be applied 8699 06:21:24,400 --> 06:21:26,000 for using Linux systems as well. 8700 06:21:26,200 --> 06:21:28,500 So in order to install Pi spark first, 8701 06:21:28,500 --> 06:21:31,100 make sure that you have Hadoop installed in your system. 8702 06:21:31,100 --> 06:21:33,700 So if you want to know more about how to install Ado, 8703 06:21:33,700 --> 06:21:36,500 please check out our new playlist on YouTube 8704 06:21:36,500 --> 06:21:39,909 or you can check out our blog on a direct our website the first 8705 06:21:39,909 --> 06:21:43,100 of all you need to go to the Apache spark official website, 8706 06:21:43,100 --> 06:21:44,750 which is parked at a party Dot o-- r-- 8707 06:21:44,750 --> 06:21:48,025 g-- and the download section you can download the latest version 8708 06:21:48,025 --> 06:21:48,907 of spark release 8709 06:21:48,907 --> 06:21:51,500 which supports It's the latest version of Hadoop 8710 06:21:51,500 --> 06:21:53,800 or Hadoop version 2.7 or above now. 8711 06:21:53,800 --> 06:21:55,429 Once you have downloaded it, 8712 06:21:55,429 --> 06:21:57,900 all you need to do is extract it or add say 8713 06:21:57,900 --> 06:21:59,400 under the file contents. 8714 06:21:59,400 --> 06:22:01,400 And after that you need to put in the path 8715 06:22:01,400 --> 06:22:04,200 where the spark is installed in the bash RC file. 8716 06:22:04,200 --> 06:22:06,082 Now, you also need to install pip 8717 06:22:06,082 --> 06:22:09,300 and jupyter notebook using the pipe command and make sure 8718 06:22:09,300 --> 06:22:11,700 that the version of piston or above so 8719 06:22:11,700 --> 06:22:12,820 as you can see here, 8720 06:22:12,820 --> 06:22:16,114 this is what our bash RC file looks like here you can see 8721 06:22:16,114 --> 06:22:17,700 that we have put in the path 8722 06:22:17,700 --> 06:22:20,700 for Hadoop spark and as well as Spunk driver python, 8723 06:22:20,700 --> 06:22:22,200 which is The jupyter Notebook. 8724 06:22:22,200 --> 06:22:23,087 What we'll do. 8725 06:22:23,087 --> 06:22:25,939 Is that the moment you run the pie Spock shell 8726 06:22:25,939 --> 06:22:29,300 it will automatically open a jupyter notebook for you. 8727 06:22:29,300 --> 06:22:29,551 Now. 8728 06:22:29,551 --> 06:22:32,000 I find jupyter notebook very easy to work 8729 06:22:32,000 --> 06:22:35,700 with rather than the shell is supposed to search choice now 8730 06:22:35,700 --> 06:22:37,899 that we are done with the installation path. 8731 06:22:37,899 --> 06:22:40,100 Let's now dive deeper into pie Sparkle on few 8732 06:22:40,100 --> 06:22:41,100 of its fundamentals, 8733 06:22:41,100 --> 06:22:43,770 which you need to know in order to work with by Spar. 8734 06:22:43,770 --> 06:22:45,870 Now this timeline shows the various topics, 8735 06:22:45,870 --> 06:22:48,600 which we will be covering under the pie spark fundamentals. 8736 06:22:48,700 --> 06:22:49,650 So let's start off. 8737 06:22:49,650 --> 06:22:51,500 With the very first Topic in our list. 8738 06:22:51,500 --> 06:22:53,100 That is the spark context. 8739 06:22:53,100 --> 06:22:56,335 The spark context is the heart of any spark application. 8740 06:22:56,335 --> 06:22:59,518 It sets up internal services and establishes a connection 8741 06:22:59,518 --> 06:23:03,300 to a spark execution environment through a spark context object. 8742 06:23:03,300 --> 06:23:05,357 You can create rdds accumulators 8743 06:23:05,357 --> 06:23:09,000 and broadcast variable access Park service's run jobs 8744 06:23:09,000 --> 06:23:11,362 and much more the spark context allows 8745 06:23:11,362 --> 06:23:14,094 the spark driver application to access the cluster 8746 06:23:14,094 --> 06:23:15,600 through a resource manager, 8747 06:23:15,600 --> 06:23:16,600 which can be yarn 8748 06:23:16,600 --> 06:23:19,600 or Sparks cluster manager the driver program then runs. 8749 06:23:19,700 --> 06:23:23,044 Operations inside the executors on the worker nodes 8750 06:23:23,044 --> 06:23:26,478 and Spark context uses the pie for Jay to launch a jvm 8751 06:23:26,478 --> 06:23:29,200 which in turn creates a Java spark context. 8752 06:23:29,200 --> 06:23:30,884 Now there are various parameters, 8753 06:23:30,884 --> 06:23:33,200 which can be used with a spark context object 8754 06:23:33,200 --> 06:23:34,700 like the Master app name 8755 06:23:34,700 --> 06:23:37,366 spark home the pie files the environment 8756 06:23:37,366 --> 06:23:41,600 in which has set the path size serializer configuration Gateway 8757 06:23:41,600 --> 06:23:44,267 and much more among these parameters 8758 06:23:44,267 --> 06:23:47,700 the master and app name are the most commonly used now 8759 06:23:47,700 --> 06:23:51,061 to give you a basic Insight on how Spark program works. 8760 06:23:51,061 --> 06:23:53,807 I have listed down its basic lifecycle phases 8761 06:23:53,807 --> 06:23:56,903 the typical life cycle of a spark program includes 8762 06:23:56,903 --> 06:23:59,367 creating rdds from external data sources 8763 06:23:59,367 --> 06:24:02,400 or paralyzed a collection in your driver program. 8764 06:24:02,400 --> 06:24:05,361 Then we have the lazy transformation in a lazily 8765 06:24:05,361 --> 06:24:07,064 transforming the base rdds 8766 06:24:07,064 --> 06:24:10,600 into new Oddities using transformation then caching few 8767 06:24:10,600 --> 06:24:12,700 of those rdds for future reuse 8768 06:24:12,800 --> 06:24:15,800 and finally performing action to execute parallel computation 8769 06:24:15,800 --> 06:24:17,500 and to produce the results. 8770 06:24:17,500 --> 06:24:19,800 The next Topic in our list is added. 8771 06:24:19,800 --> 06:24:20,700 And I'm sure people 8772 06:24:20,700 --> 06:24:23,700 who have already worked with spark a familiar with this term, 8773 06:24:23,700 --> 06:24:25,582 but for people who are new to it, 8774 06:24:25,582 --> 06:24:26,900 let me just explain it. 8775 06:24:26,900 --> 06:24:29,782 No Artie T stands for resilient distributed data set. 8776 06:24:29,782 --> 06:24:32,000 It is considered to be the building block 8777 06:24:32,000 --> 06:24:33,433 of any spark application. 8778 06:24:33,433 --> 06:24:35,900 The reason behind this is these elements run 8779 06:24:35,900 --> 06:24:38,600 and operate on multiple nodes to do parallel processing 8780 06:24:38,600 --> 06:24:39,400 on a cluster. 8781 06:24:39,400 --> 06:24:40,952 And once you create an RTD, 8782 06:24:40,952 --> 06:24:43,273 it becomes immutable and by imitable, 8783 06:24:43,273 --> 06:24:46,637 I mean that it is an object whose State cannot be modified 8784 06:24:46,637 --> 06:24:47,700 after its created, 8785 06:24:47,700 --> 06:24:49,654 but we can transform its values by up. 8786 06:24:49,654 --> 06:24:51,438 Applying certain transformation. 8787 06:24:51,438 --> 06:24:53,500 They have good fault tolerance ability 8788 06:24:53,500 --> 06:24:56,700 and can automatically recover for almost any failures. 8789 06:24:56,700 --> 06:25:00,700 This adds an added Advantage not to achieve a certain task 8790 06:25:00,700 --> 06:25:03,205 multiple operations can be applied on these IDs 8791 06:25:03,205 --> 06:25:05,675 which are categorized in two ways the first 8792 06:25:05,675 --> 06:25:06,800 in the transformation 8793 06:25:06,800 --> 06:25:09,900 and the second one is the actions the Transformations 8794 06:25:09,900 --> 06:25:10,800 are the operations 8795 06:25:10,800 --> 06:25:13,800 which are applied on an oddity to create a new rdd. 8796 06:25:14,000 --> 06:25:15,300 Now these transformation work 8797 06:25:15,300 --> 06:25:17,300 on the principle of lazy evaluation 8798 06:25:17,700 --> 06:25:19,900 and transformation are lazy in nature. 8799 06:25:19,900 --> 06:25:22,927 Meaning when we call some operation in our dirty. 8800 06:25:22,927 --> 06:25:25,758 It does not execute immediately spark maintains, 8801 06:25:25,758 --> 06:25:28,602 the record of the operations it is being called 8802 06:25:28,602 --> 06:25:31,324 through with the help of direct acyclic graphs, 8803 06:25:31,324 --> 06:25:33,100 which is also known as the DHS 8804 06:25:33,100 --> 06:25:35,900 and since the Transformations are lazy in nature. 8805 06:25:35,900 --> 06:25:37,604 So when we execute operation 8806 06:25:37,604 --> 06:25:40,100 any time by calling an action on the data, 8807 06:25:40,100 --> 06:25:42,371 the lazy evaluation data is not loaded 8808 06:25:42,371 --> 06:25:43,547 until it's necessary 8809 06:25:43,547 --> 06:25:46,900 and the moment we call out the action all the computations 8810 06:25:46,900 --> 06:25:49,900 are performed parallely to give you the desired output. 8811 06:25:49,900 --> 06:25:52,400 Put now a few important Transformations are 8812 06:25:52,400 --> 06:25:53,944 the map flatmap filter 8813 06:25:53,944 --> 06:25:55,360 this thing reduced by 8814 06:25:55,360 --> 06:25:59,000 key map partition sort by actions are the operations 8815 06:25:59,000 --> 06:26:02,058 which are applied on an rdd to instruct a party spark 8816 06:26:02,058 --> 06:26:03,188 to apply computation 8817 06:26:03,188 --> 06:26:05,600 and pass the result back to the driver few 8818 06:26:05,600 --> 06:26:09,100 of these actions include collect the collectors mapreduce 8819 06:26:09,100 --> 06:26:10,300 take first now, 8820 06:26:10,300 --> 06:26:13,600 let me Implement few of these for your better understanding. 8821 06:26:14,600 --> 06:26:17,000 So first of all, let me show you the bash 8822 06:26:17,000 --> 06:26:18,800 as if I'll which I was talking about. 8823 06:26:25,100 --> 06:26:27,196 So here you can see in the bash RC file. 8824 06:26:27,196 --> 06:26:29,400 We provide the path for all the Frameworks 8825 06:26:29,400 --> 06:26:31,250 which we have installed in the system. 8826 06:26:31,250 --> 06:26:32,800 So for example, you can see here. 8827 06:26:32,800 --> 06:26:35,100 We have installed Hadoop the moment we 8828 06:26:35,100 --> 06:26:38,100 install and unzip it or rather see entire it 8829 06:26:38,100 --> 06:26:41,300 I have shifted all my Frameworks to one particular location 8830 06:26:41,300 --> 06:26:43,492 as you can see is the US are the user 8831 06:26:43,492 --> 06:26:46,140 and inside this we have the library and inside 8832 06:26:46,140 --> 06:26:49,217 that I have installed the Hadoop and also the spa now 8833 06:26:49,217 --> 06:26:50,400 as you can see here, 8834 06:26:50,400 --> 06:26:51,300 we have two lines. 8835 06:26:51,300 --> 06:26:54,800 I'll highlight this one for you the pie spark driver. 8836 06:26:54,800 --> 06:26:56,392 Titan which is the Jupiter 8837 06:26:56,392 --> 06:26:59,700 and we have given it as a notebook the option available 8838 06:26:59,700 --> 06:27:02,100 as know to what we'll do is at the moment. 8839 06:27:02,100 --> 06:27:04,731 I start spark will automatically redirect me 8840 06:27:04,731 --> 06:27:06,200 to The jupyter Notebook. 8841 06:27:10,200 --> 06:27:14,500 So let me just rename this notebook as rdd tutorial. 8842 06:27:15,200 --> 06:27:16,900 So let's get started. 8843 06:27:17,800 --> 06:27:21,000 So here to load any file into an rdd suppose. 8844 06:27:21,000 --> 06:27:23,795 I'm loading a text file you need to use the S 8845 06:27:23,795 --> 06:27:26,700 if it is a spark context as C dot txt file 8846 06:27:26,700 --> 06:27:28,952 and you need to provide the path of the data 8847 06:27:28,952 --> 06:27:30,600 which you are going to load. 8848 06:27:30,600 --> 06:27:33,300 So one thing to keep in mind is that the default path 8849 06:27:33,300 --> 06:27:35,483 which the artery takes or the jupyter. 8850 06:27:35,483 --> 06:27:37,365 Notebook takes is the hdfs path. 8851 06:27:37,365 --> 06:27:39,456 So in order to use the local file system, 8852 06:27:39,456 --> 06:27:41,311 you need to mention the file colon 8853 06:27:41,311 --> 06:27:42,900 and double forward slashes. 8854 06:27:43,800 --> 06:27:46,100 So once our sample data is 8855 06:27:46,100 --> 06:27:49,076 inside the ret not to have a look at it. 8856 06:27:49,076 --> 06:27:52,000 We need to invoke using it the action. 8857 06:27:52,000 --> 06:27:54,900 So let's go ahead and take a look at the first five objects 8858 06:27:54,900 --> 06:27:59,400 or rather say the first five elements of this particular rdt. 8859 06:27:59,700 --> 06:28:02,776 The sample it I have taken here is about blockchain 8860 06:28:02,776 --> 06:28:03,700 as you can see. 8861 06:28:03,700 --> 06:28:05,000 We have one two, 8862 06:28:05,030 --> 06:28:07,569 three four and five elements here. 8863 06:28:08,500 --> 06:28:12,080 Suppose I need to convert all the data into a lowercase 8864 06:28:12,080 --> 06:28:14,600 and split it according to word by word. 8865 06:28:14,600 --> 06:28:17,000 So for that I will create a function 8866 06:28:17,000 --> 06:28:20,000 and in the function I'll pass on this Oddity. 8867 06:28:20,000 --> 06:28:21,700 So I'm creating as you can see here. 8868 06:28:21,700 --> 06:28:22,990 I'm creating rdd one 8869 06:28:22,990 --> 06:28:25,700 that is a new ID and using the map function 8870 06:28:25,700 --> 06:28:29,200 or rather say the transformation and passing on the function, 8871 06:28:29,200 --> 06:28:32,100 which I just created to lower and to split it. 8872 06:28:32,496 --> 06:28:35,803 So if we have a look at the output of our D1 8873 06:28:37,800 --> 06:28:39,059 As you can see here, 8874 06:28:39,059 --> 06:28:41,200 all the words are in the lower case 8875 06:28:41,200 --> 06:28:44,300 and all of them are separated with the help of a space bar. 8876 06:28:44,700 --> 06:28:47,000 Now this another transformation, 8877 06:28:47,000 --> 06:28:50,216 which is known as the flat map to give you a flat and output 8878 06:28:50,216 --> 06:28:52,157 and I am passing the same function 8879 06:28:52,157 --> 06:28:53,569 which I created earlier. 8880 06:28:53,569 --> 06:28:54,500 So let's go ahead 8881 06:28:54,500 --> 06:28:56,800 and have a look at the output for this one. 8882 06:28:56,800 --> 06:28:58,200 So as you can see here, 8883 06:28:58,200 --> 06:29:00,189 we got the first five elements 8884 06:29:00,189 --> 06:29:04,355 which are the save one as we got here the contrast transactions 8885 06:29:04,355 --> 06:29:05,700 and and the records. 8886 06:29:05,700 --> 06:29:07,523 So just one thing to keep in mind. 8887 06:29:07,523 --> 06:29:09,700 Is at the flat map is a transformation 8888 06:29:09,700 --> 06:29:11,664 where as take is the action now, 8889 06:29:11,664 --> 06:29:13,614 as you can see that the contents 8890 06:29:13,614 --> 06:29:16,007 of the sample data contains stop words. 8891 06:29:16,007 --> 06:29:18,762 So if I want to remove all the stop was all you 8892 06:29:18,762 --> 06:29:19,900 need to do is start 8893 06:29:19,900 --> 06:29:23,351 and create a list of stop words in which I have mentioned here 8894 06:29:23,351 --> 06:29:24,200 as you can see. 8895 06:29:24,200 --> 06:29:26,200 We have a all the as is 8896 06:29:26,200 --> 06:29:28,700 and now these are not all the stop words. 8897 06:29:28,700 --> 06:29:31,701 So I've chosen only a few of them just to show you 8898 06:29:31,701 --> 06:29:33,600 what exactly the output will be 8899 06:29:33,600 --> 06:29:36,100 and now we are using here the filter transformation 8900 06:29:36,100 --> 06:29:37,800 and with the help of Lambda. 8901 06:29:37,800 --> 06:29:40,800 Function and which we have X specified as X naught 8902 06:29:40,800 --> 06:29:43,360 in stock quotes and we have created another rdd 8903 06:29:43,360 --> 06:29:44,465 which is added III 8904 06:29:44,465 --> 06:29:46,000 which will take the input 8905 06:29:46,000 --> 06:29:48,800 from our DD to so let's go ahead and see 8906 06:29:48,800 --> 06:29:51,700 whether and and the are removed or not. 8907 06:29:51,700 --> 06:29:55,600 This is you can see contracts transaction records of them. 8908 06:29:55,600 --> 06:29:57,500 If you look at the output 5, 8909 06:29:57,500 --> 06:30:00,979 we have contracts transaction and and the and in the 8910 06:30:00,979 --> 06:30:02,337 are not in this list, 8911 06:30:02,337 --> 06:30:04,600 but suppose I want to group the data 8912 06:30:04,600 --> 06:30:07,523 according to the first three characters of any element. 8913 06:30:07,523 --> 06:30:08,756 So for that I'll use 8914 06:30:08,756 --> 06:30:11,900 the group by and I'll use the Lambda function again. 8915 06:30:11,900 --> 06:30:14,000 So let's have a look at the output 8916 06:30:14,000 --> 06:30:16,769 so you can see we have EDG and edges. 8917 06:30:16,900 --> 06:30:20,638 So the first three letters of both words are same similarly. 8918 06:30:20,638 --> 06:30:23,300 We can find it using the first two letters. 8919 06:30:23,300 --> 06:30:27,800 Also, let me just change it to two so you can see we are gu 8920 06:30:27,800 --> 06:30:29,800 and guid just a guide 8921 06:30:30,000 --> 06:30:32,200 not these are the basic Transformations 8922 06:30:32,200 --> 06:30:33,785 and actions but suppose. 8923 06:30:33,785 --> 06:30:37,400 I want to find out the sum of the first thousand numbers. 8924 06:30:37,400 --> 06:30:39,436 Others have first 10,000 numbers. 8925 06:30:39,436 --> 06:30:42,300 All I need to do is initialize another Oddity, 8926 06:30:42,300 --> 06:30:44,400 which is the number underscore ID. 8927 06:30:44,400 --> 06:30:47,512 And we use the AC Dot parallelized and the range 8928 06:30:47,512 --> 06:30:49,500 we have given is one to 10,000 8929 06:30:49,500 --> 06:30:51,600 and we'll use the reduce action 8930 06:30:51,600 --> 06:30:54,532 here to see the output you can see here. 8931 06:30:54,532 --> 06:30:56,840 We have the sum of the numbers ranging 8932 06:30:56,840 --> 06:30:58,400 from one to ten thousand. 8933 06:30:58,400 --> 06:31:00,900 Now this was all about rdd. 8934 06:31:00,900 --> 06:31:01,699 The next topic 8935 06:31:01,699 --> 06:31:03,711 that we have on a list is broadcast 8936 06:31:03,711 --> 06:31:07,181 and accumulators now in spark we perform parallel processing 8937 06:31:07,181 --> 06:31:09,100 through the Help of shared variables 8938 06:31:09,100 --> 06:31:11,672 or when the driver sends any tasks with the executor 8939 06:31:11,672 --> 06:31:14,900 present on the cluster a copy of the shared variable is also sent 8940 06:31:14,900 --> 06:31:15,700 to the each node 8941 06:31:15,700 --> 06:31:18,100 of the cluster thus maintaining High availability 8942 06:31:18,100 --> 06:31:19,400 and fault tolerance. 8943 06:31:19,400 --> 06:31:22,223 Now, this is done in order to accomplish the task 8944 06:31:22,223 --> 06:31:25,341 and Apache spark supposed to type of shared variables. 8945 06:31:25,341 --> 06:31:26,711 One of them is broadcast. 8946 06:31:26,711 --> 06:31:28,861 And the other one is the accumulator now 8947 06:31:28,861 --> 06:31:31,735 broadcast variables are used to save the copy of data 8948 06:31:31,735 --> 06:31:33,334 on all the nodes in a cluster. 8949 06:31:33,334 --> 06:31:36,117 Whereas the accumulator is the variable that is used 8950 06:31:36,117 --> 06:31:37,700 for aggregating the incoming. 8951 06:31:37,700 --> 06:31:40,056 Information we are different associative 8952 06:31:40,056 --> 06:31:43,500 and commutative operations now moving on to our next topic 8953 06:31:43,500 --> 06:31:47,094 which is a spark configuration the spark configuration class 8954 06:31:47,094 --> 06:31:49,800 provides a set of configurations and parameters 8955 06:31:49,800 --> 06:31:52,300 that are needed to execute a spark application 8956 06:31:52,300 --> 06:31:54,300 on the local system or any cluster. 8957 06:31:54,300 --> 06:31:56,800 Now when you use spark configuration object 8958 06:31:56,800 --> 06:31:59,112 to set the values to these parameters, 8959 06:31:59,112 --> 06:32:02,800 they automatically take priority over the system properties. 8960 06:32:02,800 --> 06:32:05,035 Now this class contains various Getters 8961 06:32:05,035 --> 06:32:07,800 and Setters methods some of which are Set method 8962 06:32:07,800 --> 06:32:10,323 which is used to set a configuration property. 8963 06:32:10,323 --> 06:32:11,555 We have the set master 8964 06:32:11,555 --> 06:32:13,605 which is used for setting the master URL. 8965 06:32:13,605 --> 06:32:14,840 Yeah the set app name, 8966 06:32:14,840 --> 06:32:17,421 which is used to set an application name and we 8967 06:32:17,421 --> 06:32:20,900 have the get method to retrieve a configuration value of a key. 8968 06:32:20,900 --> 06:32:23,000 And finally we have set spark home 8969 06:32:23,000 --> 06:32:25,600 which is used for setting the spark installation path 8970 06:32:25,600 --> 06:32:26,700 on worker nodes. 8971 06:32:26,700 --> 06:32:28,800 Now coming to the next topic on our list 8972 06:32:28,800 --> 06:32:31,600 which is a spark files the spark file class 8973 06:32:31,600 --> 06:32:33,264 contains only the class methods 8974 06:32:33,264 --> 06:32:36,500 so that the user cannot create any spark files instance. 8975 06:32:36,500 --> 06:32:39,200 Now this helps in Dissolving the path of the files 8976 06:32:39,200 --> 06:32:41,500 that are added using the spark context add 8977 06:32:41,500 --> 06:32:44,600 file method the class Park files contain to class methods 8978 06:32:44,600 --> 06:32:47,798 which are the get method and the get root directory method. 8979 06:32:47,798 --> 06:32:50,500 Now, the get is used to retrieve the absolute path 8980 06:32:50,500 --> 06:32:53,900 of a file added through spark context to add file 8981 06:32:54,000 --> 06:32:55,300 and the get root directory 8982 06:32:55,300 --> 06:32:57,076 is used to retrieve the root directory 8983 06:32:57,076 --> 06:32:58,900 that contains the files that are added. 8984 06:32:58,900 --> 06:33:00,700 So this park context dot add file. 8985 06:33:00,700 --> 06:33:03,022 Now, these are smart topics and the next topic 8986 06:33:03,022 --> 06:33:04,257 that we will covering 8987 06:33:04,257 --> 06:33:07,600 in our list are the data frames now data frames in a party. 8988 06:33:07,600 --> 06:33:09,655 Spark is a distributed collection of rows 8989 06:33:09,655 --> 06:33:10,831 under named columns, 8990 06:33:10,831 --> 06:33:13,400 which is similar to the relational database tables 8991 06:33:13,400 --> 06:33:14,700 or Excel sheets. 8992 06:33:14,700 --> 06:33:16,812 It also shares common attributes 8993 06:33:16,812 --> 06:33:19,800 with the rdds few characteristics of data frames 8994 06:33:19,800 --> 06:33:21,300 are immutable in nature. 8995 06:33:21,300 --> 06:33:23,500 That is the same as you can create a data frame, 8996 06:33:23,500 --> 06:33:24,900 but you cannot change it. 8997 06:33:24,900 --> 06:33:26,500 It allows lazy evaluation. 8998 06:33:26,500 --> 06:33:28,300 That is the task not executed 8999 06:33:28,300 --> 06:33:30,500 unless and until an action is triggered 9000 06:33:30,500 --> 06:33:33,000 and moreover data frames are distributed in nature, 9001 06:33:33,000 --> 06:33:34,900 which are designed for processing large 9002 06:33:34,900 --> 06:33:37,400 collection of structure or semi-structured data. 9003 06:33:37,400 --> 06:33:39,953 Can be created using different data formats, 9004 06:33:39,953 --> 06:33:41,200 like loading the data 9005 06:33:41,200 --> 06:33:43,650 from source files such as Json or CSV, 9006 06:33:43,650 --> 06:33:46,100 or you can load it from an existing re 9007 06:33:46,100 --> 06:33:48,842 you can use databases like hi Cassandra. 9008 06:33:48,842 --> 06:33:50,600 You can use pocket files. 9009 06:33:50,600 --> 06:33:52,800 You can use CSV XML files. 9010 06:33:52,800 --> 06:33:53,900 There are many sources 9011 06:33:53,900 --> 06:33:56,448 through which you can create a particular R DT now, 9012 06:33:56,448 --> 06:33:59,200 let me show you how to create a data frame in pie spark 9013 06:33:59,200 --> 06:34:02,100 and perform various actions and Transformations on it. 9014 06:34:02,300 --> 06:34:05,065 So let's continue this in the same notebook 9015 06:34:05,065 --> 06:34:07,700 which we have here now here we have taken 9016 06:34:07,700 --> 06:34:09,300 In the NYC Flight data, 9017 06:34:09,300 --> 06:34:12,561 and I'm creating a data frame which is the NYC flights 9018 06:34:12,561 --> 06:34:13,300 on the score 9019 06:34:13,300 --> 06:34:14,959 TF now to load the data. 9020 06:34:14,959 --> 06:34:18,340 We are using the spark dot RI dot CSV method and you 9021 06:34:18,340 --> 06:34:19,600 to provide the path 9022 06:34:19,600 --> 06:34:21,900 which is the local path of by default. 9023 06:34:21,900 --> 06:34:24,200 It takes the hdfs same as our GD 9024 06:34:24,200 --> 06:34:26,208 and one thing to note down here is 9025 06:34:26,208 --> 06:34:28,886 that I've provided two parameters extra here, 9026 06:34:28,886 --> 06:34:31,400 which is the info schema and the header 9027 06:34:31,400 --> 06:34:34,700 if we do not provide this as true of a skip it 9028 06:34:34,700 --> 06:34:35,800 what will happen. 9029 06:34:35,800 --> 06:34:39,300 Is that if your data set Is the name of the columns 9030 06:34:39,300 --> 06:34:42,863 on the first row it will take those as data as well. 9031 06:34:42,863 --> 06:34:45,100 It will not infer the schema now. 9032 06:34:45,100 --> 06:34:49,023 Once we have loaded the data in our data frame we need to use 9033 06:34:49,023 --> 06:34:51,900 the show action to have a look at the output. 9034 06:34:51,900 --> 06:34:53,223 So as you can see here, 9035 06:34:53,223 --> 06:34:55,399 we have the output which is exactly it 9036 06:34:55,399 --> 06:34:58,600 gives us the top 20 rows or the particular data set. 9037 06:34:58,600 --> 06:35:02,600 We have the year month day departure time deposit delay 9038 06:35:02,600 --> 06:35:07,000 arrival time arrival delay and so many more attributes. 9039 06:35:07,300 --> 06:35:08,500 To print the schema 9040 06:35:08,500 --> 06:35:11,500 of the particular data frame you need the transformation 9041 06:35:11,500 --> 06:35:13,762 or as say the action of print schema. 9042 06:35:13,762 --> 06:35:15,900 So let's have a look at the schema. 9043 06:35:15,900 --> 06:35:19,117 As you can see here we have here which is integer month integer. 9044 06:35:19,117 --> 06:35:21,000 Almost half of them are integer. 9045 06:35:21,000 --> 06:35:23,600 We have the carrier as string the tail number 9046 06:35:23,600 --> 06:35:26,625 a string the origin string destination string 9047 06:35:26,625 --> 06:35:28,123 and so on now suppose. 9048 06:35:28,123 --> 06:35:29,075 I want to know 9049 06:35:29,075 --> 06:35:31,786 how many records are there in my database 9050 06:35:31,786 --> 06:35:33,685 or the data frame rather say 9051 06:35:33,685 --> 06:35:36,600 so you need the count function for this one. 9052 06:35:36,600 --> 06:35:40,600 I will provide but the results so as you can see here, 9053 06:35:40,600 --> 06:35:42,992 we have three point three million records 9054 06:35:42,992 --> 06:35:44,097 here three million 9055 06:35:44,097 --> 06:35:46,800 thirty six thousand seven hundred seventy six 9056 06:35:46,800 --> 06:35:48,400 to be exact now suppose. 9057 06:35:48,400 --> 06:35:51,153 I want to have a look at the flight name the origin 9058 06:35:51,153 --> 06:35:52,400 and the destination 9059 06:35:52,400 --> 06:35:55,400 of just these three columns from the particular data frame. 9060 06:35:55,400 --> 06:35:57,800 We need to use the select option. 9061 06:35:58,200 --> 06:36:00,882 So as you can see here, we have the top 20 rows. 9062 06:36:00,882 --> 06:36:03,128 Now, what we saw was the select query 9063 06:36:03,128 --> 06:36:05,000 on this particular data frame, 9064 06:36:05,000 --> 06:36:07,240 but if I wanted to see or rather, 9065 06:36:07,240 --> 06:36:09,200 I want to check the summary. 9066 06:36:09,200 --> 06:36:11,400 Of any particular column suppose. 9067 06:36:11,400 --> 06:36:14,500 I want to check the what is the lowest count 9068 06:36:14,500 --> 06:36:18,100 or the highest count in the particular distance column. 9069 06:36:18,100 --> 06:36:20,500 I need to use the describe function here. 9070 06:36:20,500 --> 06:36:23,100 So I'll show you what the summer it looks like. 9071 06:36:23,500 --> 06:36:25,142 So the distance the count 9072 06:36:25,142 --> 06:36:27,900 is the number of rows total number of rows. 9073 06:36:27,900 --> 06:36:30,800 We have the mean the standard deviation via the minimum value, 9074 06:36:30,800 --> 06:36:32,900 which is 17 and the maximum value, 9075 06:36:32,900 --> 06:36:34,500 which is 4983. 9076 06:36:34,900 --> 06:36:38,100 Now this gives you a summary of the particular column 9077 06:36:38,100 --> 06:36:39,856 if you want to So that we know 9078 06:36:39,856 --> 06:36:41,838 that the minimum distance is 70. 9079 06:36:41,838 --> 06:36:44,500 Let's go ahead and filter out our data using 9080 06:36:44,500 --> 06:36:47,700 the filter function in which the distance is 17. 9081 06:36:48,700 --> 06:36:49,978 So you can see here. 9082 06:36:49,978 --> 06:36:51,000 We have one data 9083 06:36:51,000 --> 06:36:55,700 in which in the 2013 year the minimum distance here is 17 9084 06:36:55,700 --> 06:36:59,100 but similarly suppose I want to have a look at the flash 9085 06:36:59,100 --> 06:37:01,600 which are originating from EWR. 9086 06:37:01,900 --> 06:37:02,400 Similarly. 9087 06:37:02,400 --> 06:37:04,600 We use the filter function here as well. 9088 06:37:04,600 --> 06:37:06,599 Now the another Clause here, 9089 06:37:06,599 --> 06:37:09,300 which is the where Clause is also used 9090 06:37:09,300 --> 06:37:11,236 for filtering the suppose. 9091 06:37:11,236 --> 06:37:12,800 I want to have a look 9092 06:37:12,815 --> 06:37:16,046 at the flight data and filter it out to see 9093 06:37:16,046 --> 06:37:17,507 if the day at work. 9094 06:37:17,507 --> 06:37:22,000 Which the flight took off was the second of any month suppose. 9095 06:37:22,000 --> 06:37:23,589 So here instead of filter. 9096 06:37:23,589 --> 06:37:25,422 We can also use a where clause 9097 06:37:25,422 --> 06:37:27,500 which will give us the same output. 9098 06:37:29,200 --> 06:37:33,100 Now, we can also pass on multiple parameters 9099 06:37:33,100 --> 06:37:36,000 and rather say the multiple conditions. 9100 06:37:36,000 --> 06:37:39,866 So suppose I want the day of the flight should be seventh 9101 06:37:39,866 --> 06:37:41,839 and the origin should be JFK 9102 06:37:41,839 --> 06:37:45,292 and the arrival delay should be less than 0 I mean 9103 06:37:45,292 --> 06:37:47,900 that is for none of the postponed fly. 9104 06:37:48,000 --> 06:37:49,600 So just to have a look 9105 06:37:49,600 --> 06:37:52,314 at these numbers will use the way clause 9106 06:37:52,314 --> 06:37:55,600 and separate all the conditions using the + symbol 9107 06:37:56,100 --> 06:37:57,800 so you can see here all the data. 9108 06:37:57,800 --> 06:38:00,700 The day is 7 the origin is JFK 9109 06:38:01,100 --> 06:38:04,900 and the arrival delay is less than 0 now. 9110 06:38:04,900 --> 06:38:07,621 These were the basic Transformations and actions 9111 06:38:07,621 --> 06:38:09,300 on the particular data frame. 9112 06:38:09,300 --> 06:38:12,900 Now one thing we can also do is create a temporary table 9113 06:38:12,900 --> 06:38:14,100 for SQL queries 9114 06:38:14,100 --> 06:38:15,100 if someone is 9115 06:38:15,100 --> 06:38:19,000 not good or is not Wanted to all these transformation 9116 06:38:19,000 --> 06:38:22,400 and action add would rather use SQL queries on the data. 9117 06:38:22,400 --> 06:38:26,006 They can use this register dot temp table to create a table 9118 06:38:26,006 --> 06:38:27,925 for their particular data frame. 9119 06:38:27,925 --> 06:38:30,129 What we'll do is convert the NYC flights 9120 06:38:30,129 --> 06:38:33,600 and a Squatty of data frame into NYC endoscope flight table, 9121 06:38:33,600 --> 06:38:36,700 which can be used later and SQL queries can be performed 9122 06:38:36,700 --> 06:38:38,500 on this particular table. 9123 06:38:38,600 --> 06:38:43,000 So you remember in the beginning we use the NYC flies and score d 9124 06:38:43,000 --> 06:38:47,600 f dot show now we can use the select asterisk from I 9125 06:38:47,600 --> 06:38:51,600 am just go flights to get the same output now suppose 9126 06:38:51,600 --> 06:38:55,011 we want to look at the minimum a time of any flights. 9127 06:38:55,011 --> 06:38:58,217 We use the select minimum air time from NYC flights. 9128 06:38:58,217 --> 06:38:59,600 That is the SQL query. 9129 06:38:59,600 --> 06:39:02,400 We pass all the SQL query in the sequel context 9130 06:39:02,400 --> 06:39:03,700 or SQL function. 9131 06:39:03,700 --> 06:39:04,800 So you can see here. 9132 06:39:04,800 --> 06:39:07,900 We have the minimum air time as 20 now to have a look 9133 06:39:07,900 --> 06:39:11,400 at the Wreckers in which the air time is minimum 20. 9134 06:39:11,600 --> 06:39:14,693 Now we can also use nested SQL queries a suppose 9135 06:39:14,693 --> 06:39:15,847 if I want to check 9136 06:39:15,847 --> 06:39:19,328 which all flights have the Minimum air time as 20 now 9137 06:39:19,328 --> 06:39:20,553 that cannot be done 9138 06:39:20,553 --> 06:39:24,132 in a simple SQL query we need nested query for that one. 9139 06:39:24,132 --> 06:39:26,800 So selecting aspects from New York flights 9140 06:39:26,800 --> 06:39:29,500 where the airtime is in and inside 9141 06:39:29,500 --> 06:39:30,913 that we have another query, 9142 06:39:30,913 --> 06:39:33,477 which is Select minimum air time from NYC flights. 9143 06:39:33,477 --> 06:39:35,100 Let's see if this works or not. 9144 06:39:37,200 --> 06:39:38,497 CS as you can see here, 9145 06:39:38,497 --> 06:39:41,600 we have two Flats which have the minimum air time as 20. 9146 06:39:42,200 --> 06:39:44,400 So guys this is it for data frames. 9147 06:39:44,400 --> 06:39:46,147 So let's get back to our presentation 9148 06:39:46,147 --> 06:39:48,697 and have a look at the list which we were following. 9149 06:39:48,697 --> 06:39:49,966 We completed data frames. 9150 06:39:49,966 --> 06:39:52,600 Next we have stories levels now Storage level 9151 06:39:52,600 --> 06:39:55,200 in pie spark is a class which helps in deciding 9152 06:39:55,200 --> 06:39:56,991 how the rdds should be stored 9153 06:39:56,991 --> 06:39:59,400 now based on this rdds are either stored 9154 06:39:59,400 --> 06:40:01,400 in this or in memory or in 9155 06:40:01,400 --> 06:40:04,300 both the class Storage level also decides 9156 06:40:04,300 --> 06:40:06,594 whether the RADS should be serialized 9157 06:40:06,594 --> 06:40:09,480 or replicate its partition for the final 9158 06:40:09,480 --> 06:40:12,000 and the last topic for the today's list 9159 06:40:12,000 --> 06:40:15,100 is MLM blog MLM is the machine learning APA 9160 06:40:15,100 --> 06:40:17,000 which is provided by spark, 9161 06:40:17,000 --> 06:40:18,600 which is also present in Python. 9162 06:40:18,700 --> 06:40:21,180 And this library is heavily used in Python 9163 06:40:21,180 --> 06:40:22,597 for machine learning as 9164 06:40:22,597 --> 06:40:26,094 well as real-time streaming analytics Aurelius algorithm 9165 06:40:26,094 --> 06:40:28,773 supported by this libraries are first of all, 9166 06:40:28,773 --> 06:40:30,600 we have the spark dot m l live 9167 06:40:30,600 --> 06:40:33,482 now recently the spice Park MN lips supports model 9168 06:40:33,482 --> 06:40:37,500 based collaborative filtering by a small set of latent factors 9169 06:40:37,500 --> 06:40:40,500 and here all the users and the products are described 9170 06:40:40,500 --> 06:40:42,300 which we can use to predict them. 9171 06:40:42,300 --> 06:40:45,909 Missing entries however to learn these latent factors 9172 06:40:45,909 --> 06:40:48,886 Park dot ml abuses the alternatingly square 9173 06:40:48,886 --> 06:40:50,755 which is the ALS algorithm. 9174 06:40:50,755 --> 06:40:52,900 Next we have the MLF clustering 9175 06:40:52,900 --> 06:40:53,852 and are supervised 9176 06:40:53,852 --> 06:40:57,700 learning problem is clustering now here we try to group subsets 9177 06:40:57,700 --> 06:40:59,989 of entities with one another on the basis 9178 06:40:59,989 --> 06:41:02,000 of some notion of similarity. 9179 06:41:02,200 --> 06:41:02,500 Next. 9180 06:41:02,500 --> 06:41:04,500 We have the frequent pattern matching, 9181 06:41:04,500 --> 06:41:08,400 which is the fpm now frequent pattern matching is mining 9182 06:41:08,400 --> 06:41:12,800 frequent items item set subsequences or other Lectures 9183 06:41:12,800 --> 06:41:13,600 that are usually 9184 06:41:13,600 --> 06:41:16,900 among the first steps to analyze a large-scale data set. 9185 06:41:16,900 --> 06:41:20,600 This has been an active research topic in data mining for years. 9186 06:41:20,600 --> 06:41:22,800 We have the linear algebra. 9187 06:41:23,000 --> 06:41:25,032 Now this algorithm support spice Park, 9188 06:41:25,032 --> 06:41:27,403 I mean live utilities for linear algebra. 9189 06:41:27,403 --> 06:41:29,300 We have collaborative filtering. 9190 06:41:29,400 --> 06:41:30,900 We have classification 9191 06:41:30,900 --> 06:41:34,000 for binary classification various methods are available 9192 06:41:34,000 --> 06:41:37,700 in sparked MLA package such as multi-class classification as 9193 06:41:37,700 --> 06:41:40,912 well as regression analysis in classification some 9194 06:41:40,912 --> 06:41:44,067 of the most popular Terms used are Nave by a strand 9195 06:41:44,067 --> 06:41:45,457 of forest decision tree 9196 06:41:45,457 --> 06:41:48,600 and so much and finally we have the linear regression 9197 06:41:48,600 --> 06:41:51,300 now basically lead integration comes from the family 9198 06:41:51,300 --> 06:41:54,064 of recreation algorithms to find relationships 9199 06:41:54,064 --> 06:41:56,812 and dependencies between variables is the main goal 9200 06:41:56,812 --> 06:41:58,594 of regression all the pie spark 9201 06:41:58,594 --> 06:42:01,400 MLA package also covers other algorithm classes 9202 06:42:01,400 --> 06:42:02,100 and functions. 9203 06:42:02,400 --> 06:42:04,591 Let's now try to implement all the concepts 9204 06:42:04,591 --> 06:42:07,200 which we have learned in pie spark tutorial session 9205 06:42:07,200 --> 06:42:10,600 now here we are going to use a heart disease prediction model 9206 06:42:10,600 --> 06:42:13,278 and we are going to predict Using the decision tree 9207 06:42:13,278 --> 06:42:16,599 with the help of classification as well as regression. 9208 06:42:16,599 --> 06:42:16,800 Now. 9209 06:42:16,800 --> 06:42:19,600 These all are part of the ml Live library here. 9210 06:42:19,600 --> 06:42:21,800 Let's see how we can perform these types 9211 06:42:21,800 --> 06:42:23,300 of functions and queries. 9212 06:42:39,800 --> 06:42:40,600 The first of all 9213 06:42:40,600 --> 06:42:43,700 what we need to do is initialize the spark context. 9214 06:42:45,100 --> 06:42:48,300 Next we are going to read the UCI data set 9215 06:42:48,400 --> 06:42:50,500 of the heart disease prediction 9216 06:42:50,600 --> 06:42:52,600 and we are going to clean the data. 9217 06:42:52,600 --> 06:42:55,700 So let's import the pandas and the numpy library here. 9218 06:42:56,000 --> 06:42:58,852 Let's create a data frame as heart disease TF and 9219 06:42:58,852 --> 06:43:00,100 as mentioned earlier, 9220 06:43:00,100 --> 06:43:03,544 we are going to use the read CSV method here 9221 06:43:03,700 --> 06:43:05,300 and here we don't have a header. 9222 06:43:05,300 --> 06:43:07,500 So we have provided header as none. 9223 06:43:07,700 --> 06:43:10,800 Now the original data set contains 300 3 rows 9224 06:43:10,800 --> 06:43:12,100 and 14 columns. 9225 06:43:12,600 --> 06:43:15,800 Now the categories of diagnosis of heart disease 9226 06:43:15,900 --> 06:43:17,000 that we are projecting 9227 06:43:17,300 --> 06:43:22,400 if the value 0 is for 50% less than narrowing and for the value 9228 06:43:22,400 --> 06:43:24,900 1 which we are giving is for the values 9229 06:43:24,900 --> 06:43:27,500 which have 50% more diameter of naren. 9230 06:43:28,700 --> 06:43:31,623 So here we are using the numpy library. 9231 06:43:32,700 --> 06:43:35,921 These are particularly old methods which is showing 9232 06:43:35,921 --> 06:43:39,400 the deprecated warning but no issues it will work fine. 9233 06:43:40,900 --> 06:43:42,500 So as you can see here, 9234 06:43:42,500 --> 06:43:45,300 we have the categories of diagnosis of heart disease 9235 06:43:45,300 --> 06:43:48,100 that we are predicting the value 0 is 4 less than 50 9236 06:43:48,100 --> 06:43:50,000 and value 1 is greater than 50. 9237 06:43:50,400 --> 06:43:53,014 So what we did here was clear the row 9238 06:43:53,014 --> 06:43:57,500 which have the question mark or which have the empty spaces. 9239 06:43:58,700 --> 06:44:00,900 Now to get a look at the data set here. 9240 06:44:00,900 --> 06:44:02,200 Now, you can see here. 9241 06:44:02,200 --> 06:44:06,086 We have zero at many places instead of the question mark 9242 06:44:06,086 --> 06:44:07,500 which we had earlier 9243 06:44:08,600 --> 06:44:11,300 and now we are saving it to a txt file. 9244 06:44:12,000 --> 06:44:14,200 And you can see her after dropping the rose 9245 06:44:14,200 --> 06:44:15,494 with any empty values. 9246 06:44:15,494 --> 06:44:18,000 We have two ninety seven rows and 14 columns. 9247 06:44:18,300 --> 06:44:20,800 But this is what the new clear data set looks 9248 06:44:20,800 --> 06:44:24,400 like now we are importing the ml lived library 9249 06:44:24,400 --> 06:44:26,500 and the regression here now here 9250 06:44:26,500 --> 06:44:29,077 what we are going to do is create a label point, 9251 06:44:29,077 --> 06:44:31,900 which is a local Vector associated with a label 9252 06:44:31,900 --> 06:44:33,100 or a response. 9253 06:44:33,100 --> 06:44:36,600 So for that we need to import the MLF dot regression. 9254 06:44:37,800 --> 06:44:39,600 So for that we are taking the text file 9255 06:44:39,600 --> 06:44:43,000 which we just created now without the missing values. 9256 06:44:43,000 --> 06:44:43,665 Now next. 9257 06:44:43,665 --> 06:44:47,678 What we are going to do is pass the MLA data line by line 9258 06:44:47,678 --> 06:44:49,900 into the MLM label Point object 9259 06:44:49,900 --> 06:44:51,671 and we are going to convert the - 9260 06:44:51,671 --> 06:44:53,000 one labels to the 0 now. 9261 06:44:53,000 --> 06:44:56,200 Let's have a look after passing the number of fishing lines. 9262 06:44:57,800 --> 06:45:00,200 Okay, we have to label .01. 9263 06:45:00,600 --> 06:45:01,700 That's cool. 9264 06:45:01,700 --> 06:45:04,700 Now next what we are going to do is perform classification using 9265 06:45:04,700 --> 06:45:05,800 the decision tree. 9266 06:45:05,800 --> 06:45:09,300 So for that we need to import the pie spark the ml 8.3. 9267 06:45:09,600 --> 06:45:13,200 So next what we have to do is split the data into the training 9268 06:45:13,200 --> 06:45:14,300 and testing data 9269 06:45:14,300 --> 06:45:18,500 and we split here the data into 70s 233 standard ratio, 9270 06:45:18,600 --> 06:45:20,672 70 being the training data set 9271 06:45:20,672 --> 06:45:24,541 and the 30% being the testing data set next what we do is 9272 06:45:24,541 --> 06:45:26,200 that we train the model. 9273 06:45:26,200 --> 06:45:28,600 Which we are created here using the training set. 9274 06:45:29,100 --> 06:45:31,100 We have created a training model decision trees 9275 06:45:31,100 --> 06:45:32,400 or trained classifier. 9276 06:45:32,400 --> 06:45:34,400 We have used a training data number 9277 06:45:34,400 --> 06:45:36,947 of classes is file the categorical feature, 9278 06:45:36,947 --> 06:45:38,104 which we have given 9279 06:45:38,104 --> 06:45:40,600 maximum depth to which we are classifying. 9280 06:45:40,600 --> 06:45:42,000 It is 3 the next 9281 06:45:42,000 --> 06:45:45,505 what we are going to do is evaluate the model based 9282 06:45:45,505 --> 06:45:49,000 on the test data set now and evaluate the error. 9283 06:45:49,300 --> 06:45:50,800 So here we are creating 9284 06:45:50,800 --> 06:45:53,211 predictions and we are using the test data 9285 06:45:53,211 --> 06:45:55,800 to get the predictions through the model 9286 06:45:55,800 --> 06:45:58,200 which we Do and we are also going to find 9287 06:45:58,200 --> 06:45:59,500 the test errors here. 9288 06:45:59,700 --> 06:46:00,900 So as you can see here, 9289 06:46:00,900 --> 06:46:04,507 the test error is zero point 2 2 9 7 we 9290 06:46:04,507 --> 06:46:08,200 have created a classification decision tree model 9291 06:46:08,200 --> 06:46:11,100 in which the feature less than 12 is 3 the value 9292 06:46:11,100 --> 06:46:13,225 of the features distance 0 is 54. 9293 06:46:13,225 --> 06:46:16,014 So as you can see our model is pretty good. 9294 06:46:16,014 --> 06:46:19,700 So now next we'll use regression for the same purposes. 9295 06:46:19,700 --> 06:46:22,300 So let's perform the regression using decision tree. 9296 06:46:22,500 --> 06:46:24,500 So as you can see we have the train model 9297 06:46:24,500 --> 06:46:26,400 and we are using the decision tree, too. 9298 06:46:26,400 --> 06:46:29,460 Trine request using the training data the same 9299 06:46:29,460 --> 06:46:33,200 which we created using the decision tree model up there. 9300 06:46:33,200 --> 06:46:34,811 We use the classification 9301 06:46:34,811 --> 06:46:37,440 now we are using regression now similarly. 9302 06:46:37,440 --> 06:46:38,921 We are going to evaluate 9303 06:46:38,921 --> 06:46:42,500 our model using our test data set and find that test errors 9304 06:46:42,500 --> 06:46:45,600 which is the mean squared error here for aggression. 9305 06:46:45,600 --> 06:46:48,200 So let's have a look at the mean square error here. 9306 06:46:48,200 --> 06:46:50,584 The mean square error is 0.168. 9307 06:46:50,800 --> 06:46:52,100 That is good. 9308 06:46:52,100 --> 06:46:53,318 Now finally if we have 9309 06:46:53,318 --> 06:46:55,700 a look at the Learned regression tree model. 9310 06:46:56,800 --> 06:47:00,300 You can see we have created the regression tree model 9311 06:47:00,300 --> 06:47:02,800 till the depth of 3 with 15 notes. 9312 06:47:02,800 --> 06:47:04,577 And here we have all the features 9313 06:47:04,577 --> 06:47:06,300 and classification of the tree. 9314 06:47:11,000 --> 06:47:11,675 Hello folks. 9315 06:47:11,675 --> 06:47:13,700 Welcome to spawn interview questions. 9316 06:47:13,800 --> 06:47:16,949 The session has been planned collectively to have commonly 9317 06:47:16,949 --> 06:47:19,988 asked interview questions later to the smart technology 9318 06:47:19,988 --> 06:47:22,400 and the general answer and the expectation 9319 06:47:22,400 --> 06:47:25,594 is already you are aware of this particular technology. 9320 06:47:25,594 --> 06:47:29,200 To some extent and in general the common questions being asked 9321 06:47:29,200 --> 06:47:31,500 as well as I will give interaction with the technology 9322 06:47:31,500 --> 06:47:33,600 as so let's get this started. 9323 06:47:33,600 --> 06:47:36,023 So the agenda for this particular session is 9324 06:47:36,023 --> 06:47:38,197 the basic questions are going to cover 9325 06:47:38,197 --> 06:47:41,138 and questions later to the spark core Technologies. 9326 06:47:41,138 --> 06:47:42,400 That's when I say spark 9327 06:47:42,400 --> 06:47:44,900 or that's going to be the base and top 9328 06:47:44,900 --> 06:47:48,075 of spark or we have four important components 9329 06:47:48,075 --> 06:47:50,669 which work that is streaming Graphics. 9330 06:47:50,669 --> 06:47:53,100 Ml Abe and SQL all these components 9331 06:47:53,100 --> 06:47:57,500 have been created to satisfy a The government again interaction 9332 06:47:57,500 --> 06:47:59,495 with these Technologies and get 9333 06:47:59,495 --> 06:48:02,200 into the commonly asked interview questions 9334 06:48:02,300 --> 06:48:04,500 and the questions also framed such a way. 9335 06:48:04,500 --> 06:48:07,200 It covers the spectrum of the doubts as well 9336 06:48:07,200 --> 06:48:10,600 as the features available within that specific technology. 9337 06:48:10,600 --> 06:48:12,512 So let's take the first question 9338 06:48:12,512 --> 06:48:15,800 and look into the answer like how commonly this covered. 9339 06:48:15,800 --> 06:48:19,800 What is Apache spark and Spark It's with Apache Foundation now, 9340 06:48:20,000 --> 06:48:21,000 it's open source. 9341 06:48:21,000 --> 06:48:22,809 It's a cluster Computing framework 9342 06:48:22,809 --> 06:48:24,280 for real-time processing. 9343 06:48:24,280 --> 06:48:25,750 So three main keywords over. 9344 06:48:25,750 --> 06:48:28,151 Here a purchase markets are open source project. 9345 06:48:28,151 --> 06:48:29,856 It's used for cluster Computing. 9346 06:48:29,856 --> 06:48:33,272 And for a memory processing along with real-time processing. 9347 06:48:33,272 --> 06:48:35,485 It's going to support in memory Computing. 9348 06:48:35,485 --> 06:48:36,672 So the lots of project 9349 06:48:36,672 --> 06:48:38,400 which supports cluster Computing 9350 06:48:38,400 --> 06:48:42,100 along with that spark differentiate Itself by doing 9351 06:48:42,100 --> 06:48:43,839 the in-memory Computing. 9352 06:48:43,839 --> 06:48:46,231 It's very active community and out 9353 06:48:46,231 --> 06:48:50,000 of the Hadoop ecosystem technology is Apache spark is 9354 06:48:50,000 --> 06:48:51,500 very active multiple releases. 9355 06:48:51,500 --> 06:48:52,800 We got last year. 9356 06:48:52,800 --> 06:48:56,750 It's a very inactive project among the about your Basically, 9357 06:48:56,750 --> 06:49:00,072 it's a framework kind support in memory Computing 9358 06:49:00,072 --> 06:49:04,100 and cluster Computing and you may face this specific question 9359 06:49:04,100 --> 06:49:05,700 how spark is different 9360 06:49:05,700 --> 06:49:08,085 than mapreduce on how you can compare it 9361 06:49:08,085 --> 06:49:11,400 with the mapreduce mapreduce is the processing pathology 9362 06:49:11,400 --> 06:49:12,900 within the Hadoop ecosystem 9363 06:49:12,900 --> 06:49:14,400 and within Hadoop ecosystem. 9364 06:49:14,400 --> 06:49:18,700 We have hdfs Hadoop distributed file system mapreduce going 9365 06:49:18,700 --> 06:49:23,300 to support distributed computing and how spark is different. 9366 06:49:23,300 --> 06:49:25,900 So how we can compare smart with them. 9367 06:49:25,900 --> 06:49:28,907 Mapreduce in a way this comparison going 9368 06:49:28,907 --> 06:49:32,400 to help us to understand the technology better. 9369 06:49:32,400 --> 06:49:33,100 But definitely 9370 06:49:33,100 --> 06:49:36,600 like we cannot compare these two or two different methodologies 9371 06:49:36,600 --> 06:49:40,200 by which it's going to work spark is very simple to program 9372 06:49:40,200 --> 06:49:42,700 but mapreduce there is no abstraction 9373 06:49:42,700 --> 06:49:44,118 or the sense like all 9374 06:49:44,118 --> 06:49:47,900 the implementations we have to provide and interactivity. 9375 06:49:47,900 --> 06:49:52,200 It's has an interactive mode to work with inspark a mapreduce. 9376 06:49:52,200 --> 06:49:53,800 That is no interactive mode. 9377 06:49:53,800 --> 06:49:55,900 There are some components like Apache. 9378 06:49:55,900 --> 06:49:56,800 Big and high 9379 06:49:56,800 --> 06:50:00,400 which facilitates has to do the interactive Computing 9380 06:50:00,400 --> 06:50:02,145 or interactive programming 9381 06:50:02,145 --> 06:50:05,100 and smog supports real-time stream processing 9382 06:50:05,100 --> 06:50:07,700 and to precisely say with inspark 9383 06:50:07,700 --> 06:50:11,000 the stream processing is called a near real-time processing. 9384 06:50:11,000 --> 06:50:13,600 There's nothing in the world is Real Time processing. 9385 06:50:13,600 --> 06:50:15,100 It's near real-time processing. 9386 06:50:15,100 --> 06:50:18,200 It's going to do the processing and micro batches. 9387 06:50:18,200 --> 06:50:19,200 I'll cover in detail 9388 06:50:19,200 --> 06:50:21,400 when we are moving onto the streaming concept 9389 06:50:21,400 --> 06:50:22,600 and you're going to do 9390 06:50:22,600 --> 06:50:25,700 the batch processing on the historical data in Matrix. 9391 06:50:25,700 --> 06:50:28,300 Zeus when I say stream processing I will get the data 9392 06:50:28,300 --> 06:50:31,025 that is getting processed in real time and do 9393 06:50:31,025 --> 06:50:33,849 the processing and get the result either store it 9394 06:50:33,849 --> 06:50:35,772 on publish to publish Community. 9395 06:50:35,772 --> 06:50:37,697 We will be doing it let and see 9396 06:50:37,697 --> 06:50:40,149 wise mapreduce will have very high latency 9397 06:50:40,149 --> 06:50:42,915 because it has to read the data from hard disk, 9398 06:50:42,915 --> 06:50:45,200 but spark it will have very low latency 9399 06:50:45,200 --> 06:50:47,200 because it can reprocess 9400 06:50:47,200 --> 06:50:50,500 are used the data already cased in memory, 9401 06:50:50,500 --> 06:50:53,786 but there is a small catch over here in spark first time 9402 06:50:53,786 --> 06:50:56,600 when the data gets loaded it has Tool to read it 9403 06:50:56,600 --> 06:50:59,100 from the hard disk same as mapreduce. 9404 06:50:59,100 --> 06:51:01,600 So once it is red it will be there in the memory. 9405 06:51:01,692 --> 06:51:03,000 So spark is good. 9406 06:51:03,000 --> 06:51:05,100 Whenever we need to do I treat 9407 06:51:05,100 --> 06:51:08,900 a Computing so spark whenever you do I treat a Computing again 9408 06:51:08,900 --> 06:51:11,400 and again to the processing on the same data, 9409 06:51:11,400 --> 06:51:14,200 especially in machine learning deep learning all we will be 9410 06:51:14,200 --> 06:51:17,900 using the iterative Computing his Fox performs much better. 9411 06:51:17,900 --> 06:51:19,805 You will see the rock performance 9412 06:51:19,805 --> 06:51:22,651 Improvement hundred times faster than mapreduce. 9413 06:51:22,651 --> 06:51:25,800 But if it is one time processing and fire-and-forget, 9414 06:51:25,800 --> 06:51:28,805 Get the type of processing spark lately, 9415 06:51:28,805 --> 06:51:30,600 maybe the same latency, 9416 06:51:30,600 --> 06:51:32,699 you will be getting a tan mapreduce maybe 9417 06:51:32,699 --> 06:51:35,900 like some improvements because of the building block or spark. 9418 06:51:35,900 --> 06:51:38,800 That's the ID you may get some additional Advantage. 9419 06:51:38,800 --> 06:51:43,000 So that's the key feature are the key comparison factor 9420 06:51:43,300 --> 06:51:45,200 of sparkin mapreduce. 9421 06:51:45,800 --> 06:51:50,100 Now, let's get on to the key features xnk features of spark. 9422 06:51:50,200 --> 06:51:52,200 We discussed over the Speed and Performance. 9423 06:51:52,200 --> 06:51:54,200 It's going to use the in-memory Computing 9424 06:51:54,200 --> 06:51:55,559 so Speed and Performance. 9425 06:51:55,559 --> 06:51:57,300 Place it's going to much better. 9426 06:51:57,300 --> 06:52:00,900 When we do actually to Computing and Somali got the sense 9427 06:52:00,900 --> 06:52:03,810 the programming language to be used with a spark. 9428 06:52:03,810 --> 06:52:06,700 It can be any of these languages can be python. 9429 06:52:06,700 --> 06:52:08,400 Java are our scale. 9430 06:52:08,400 --> 06:52:08,570 Mm. 9431 06:52:08,570 --> 06:52:11,300 We can do programming with any of these languages 9432 06:52:11,300 --> 06:52:14,200 and data formats to give us a input. 9433 06:52:14,200 --> 06:52:17,172 We can give any data formats like Jason back 9434 06:52:17,172 --> 06:52:18,900 with a data formats began 9435 06:52:18,900 --> 06:52:21,888 if there is a input and the key selling point 9436 06:52:21,888 --> 06:52:24,400 with the spark is it's lazy evaluation the 9437 06:52:24,400 --> 06:52:25,575 since it's going 9438 06:52:25,575 --> 06:52:29,100 To calculate the DAC cycle directed acyclic graph 9439 06:52:29,100 --> 06:52:32,700 d a g because that is a th e it's going to calculate 9440 06:52:32,700 --> 06:52:35,300 what all steps needs to be executed to achieve 9441 06:52:35,300 --> 06:52:36,400 the final result. 9442 06:52:36,400 --> 06:52:38,969 So we need to give all the steps as well as 9443 06:52:38,969 --> 06:52:40,519 what final result I want. 9444 06:52:40,519 --> 06:52:42,983 It's going to calculate the optimal cycle 9445 06:52:42,983 --> 06:52:44,400 on optimal calculation. 9446 06:52:44,400 --> 06:52:46,400 What else tips needs to be calculated 9447 06:52:46,400 --> 06:52:49,100 or what else tips needs to be executed only those steps 9448 06:52:49,100 --> 06:52:50,500 it will be executing it. 9449 06:52:50,500 --> 06:52:52,900 So basically it's a lazy execution only 9450 06:52:52,900 --> 06:52:54,450 if the results needs to be processed, 9451 06:52:54,450 --> 06:52:55,800 it will be processing that. 9452 06:52:55,800 --> 06:52:58,623 Because of it and it's about real-time Computing. 9453 06:52:58,623 --> 06:53:00,200 It's through spark streaming 9454 06:53:00,200 --> 06:53:02,200 that is a component called spark streaming 9455 06:53:02,200 --> 06:53:04,700 which supports real-time Computing and it gels 9456 06:53:04,700 --> 06:53:07,115 with Hadoop ecosystem variable. 9457 06:53:07,115 --> 06:53:09,500 It can run on top of Hadoop Ian 9458 06:53:09,500 --> 06:53:12,562 or it can Leverage The hdfs to do the processing. 9459 06:53:12,562 --> 06:53:16,300 So when it leverages the hdfs the Hadoop cluster container 9460 06:53:16,300 --> 06:53:19,400 can be used to do the distributed computing 9461 06:53:19,400 --> 06:53:23,707 as well as it can leverage the resource manager to manage 9462 06:53:23,707 --> 06:53:25,400 the resources so spot. 9463 06:53:25,400 --> 06:53:28,426 I can gel with the hdfs very well as well as it can leverage 9464 06:53:28,426 --> 06:53:29,642 the resource manager 9465 06:53:29,642 --> 06:53:32,500 to share the resources as well as data locality. 9466 06:53:32,500 --> 06:53:34,699 You can give each data locality. 9467 06:53:34,699 --> 06:53:36,900 It can do the processing we have 9468 06:53:36,900 --> 06:53:41,200 to the database data is located within the hdfs and has a fleet 9469 06:53:41,200 --> 06:53:43,700 of machine learning algorithms already implemented 9470 06:53:43,700 --> 06:53:46,100 right from clustering classification regression. 9471 06:53:46,100 --> 06:53:48,238 All this logic already implemented 9472 06:53:48,238 --> 06:53:49,600 and machine learning. 9473 06:53:49,600 --> 06:53:52,400 It's achieved using MLA be within spark 9474 06:53:52,400 --> 06:53:54,800 and there is a component called a graphics 9475 06:53:54,800 --> 06:53:58,600 which supports Maybe we can solve the problems using 9476 06:53:58,600 --> 06:54:02,600 graph Theory using the component Graphics within this park. 9477 06:54:02,700 --> 06:54:04,700 So these are the things we can consider as 9478 06:54:04,700 --> 06:54:06,700 the key features of spark. 9479 06:54:06,700 --> 06:54:09,400 So when you discuss with the installation 9480 06:54:09,400 --> 06:54:10,300 of the spark, 9481 06:54:10,300 --> 06:54:13,581 you may come across this year on what is he on do you 9482 06:54:13,581 --> 06:54:16,765 need to install spark on all nodes of young cluster? 9483 06:54:16,765 --> 06:54:19,700 So yarn is nothing but another is US negotiator. 9484 06:54:19,700 --> 06:54:22,500 That's the resource manager within the Hadoop ecosystem. 9485 06:54:22,500 --> 06:54:25,529 So that's going to provide the resource management platform. 9486 06:54:25,529 --> 06:54:28,200 Ian going to provide the resource management platform 9487 06:54:28,200 --> 06:54:29,500 across all the Clusters 9488 06:54:29,600 --> 06:54:33,200 and Spark It's going to provide the data processing. 9489 06:54:33,200 --> 06:54:35,300 So wherever there is a horse being used 9490 06:54:35,300 --> 06:54:38,049 that location response will be used to do the data processing. 9491 06:54:38,049 --> 06:54:39,056 And of course, yes, 9492 06:54:39,056 --> 06:54:41,600 we need to have spark installed on all the nodes. 9493 06:54:41,800 --> 06:54:43,900 It's Parker stores are located. 9494 06:54:43,900 --> 06:54:47,100 That's basically we need those libraries an additional 9495 06:54:47,100 --> 06:54:50,200 to the installation of spark and all the worker nodes. 9496 06:54:50,200 --> 06:54:52,106 We need to increase the ram capacity 9497 06:54:52,106 --> 06:54:53,283 on the VOC emissions 9498 06:54:53,283 --> 06:54:55,800 as well as far going to consume huge amounts. 9499 06:54:56,100 --> 06:55:00,500 Memory to do the processing it will not do the mapreduce way 9500 06:55:00,500 --> 06:55:01,600 of working internally. 9501 06:55:01,600 --> 06:55:04,191 It's going to generate the next cycle and do 9502 06:55:04,191 --> 06:55:06,000 the processing on top of yeah, 9503 06:55:06,000 --> 06:55:09,900 so Ian and the high level it's like resource manager 9504 06:55:09,900 --> 06:55:13,100 or like an operating system for the distributed computing. 9505 06:55:13,100 --> 06:55:15,500 It's going to coordinate all the resource management 9506 06:55:15,500 --> 06:55:17,900 across the fleet of servers on top of it. 9507 06:55:17,900 --> 06:55:20,100 I can have multiple components 9508 06:55:20,100 --> 06:55:25,100 like spark these giraffe this park especially it's going 9509 06:55:25,100 --> 06:55:27,800 to help Just watch it in memory Computing. 9510 06:55:27,800 --> 06:55:30,900 So sparkly on is nothing but it's a resource manager 9511 06:55:30,900 --> 06:55:33,600 to manage the resource across the cluster on top of it. 9512 06:55:33,600 --> 06:55:35,470 We can have spunk and yes, 9513 06:55:35,470 --> 06:55:37,700 we need to have spark installed 9514 06:55:37,700 --> 06:55:41,800 and all the notes on where the spark yarn cluster is used 9515 06:55:41,800 --> 06:55:43,581 and also additional to that. 9516 06:55:43,581 --> 06:55:45,809 We need to have the memory increased 9517 06:55:45,809 --> 06:55:47,400 in all the worker robots. 9518 06:55:47,600 --> 06:55:48,870 The next question goes 9519 06:55:48,870 --> 06:55:51,400 like this what file system response support. 9520 06:55:52,300 --> 06:55:55,779 What is the file system then we work in individual system. 9521 06:55:55,779 --> 06:55:58,100 We will be having a file system to work 9522 06:55:58,100 --> 06:56:01,000 within that particular operating system Mary 9523 06:56:01,000 --> 06:56:04,900 redistributed cluster or in the distributed architecture. 9524 06:56:04,900 --> 06:56:06,744 We need a file system with which 9525 06:56:06,744 --> 06:56:09,800 where we can store the data in a distribute mechanism. 9526 06:56:09,800 --> 06:56:12,900 How do comes with the file system called hdfs. 9527 06:56:13,100 --> 06:56:15,800 It's called Hadoop distributed file system 9528 06:56:15,800 --> 06:56:19,131 by data gets distributed across multiple systems 9529 06:56:19,131 --> 06:56:21,400 and it will be coordinated by 2. 9530 06:56:21,400 --> 06:56:24,500 Different type of components called name node and data node 9531 06:56:24,500 --> 06:56:27,800 and Spark it can use this hdfs directly. 9532 06:56:27,800 --> 06:56:30,900 So you can have any files in hdfs and start using it 9533 06:56:30,900 --> 06:56:34,800 within the spark ecosystem and it gives another advantage 9534 06:56:34,800 --> 06:56:35,900 of data locality 9535 06:56:35,900 --> 06:56:38,415 when it does the distributed processing wherever 9536 06:56:38,415 --> 06:56:39,700 the data is distributed. 9537 06:56:39,700 --> 06:56:42,400 The processing could be done locally to that particular 9538 06:56:42,400 --> 06:56:44,300 Mission way data is located 9539 06:56:44,300 --> 06:56:47,223 and to start with as a standalone mode. 9540 06:56:47,223 --> 06:56:49,500 You can use the local file system aspect. 9541 06:56:49,600 --> 06:56:51,508 So this could be used especially 9542 06:56:51,508 --> 06:56:53,818 when we are doing the development or any 9543 06:56:53,818 --> 06:56:56,390 of you see you can use the local file system 9544 06:56:56,390 --> 06:56:59,500 and Amazon Cloud provides another file system called. 9545 06:56:59,500 --> 06:57:02,119 Yes, three simple storage service we call 9546 06:57:02,119 --> 06:57:03,100 that is the S3. 9547 06:57:03,100 --> 06:57:04,998 It's a block storage service. 9548 06:57:04,998 --> 06:57:06,700 This can also be leveraged 9549 06:57:06,700 --> 06:57:09,238 or used within spa for the storage 9550 06:57:09,800 --> 06:57:11,100 and lot other file system. 9551 06:57:11,100 --> 06:57:14,700 Also, it supports there are some file systems like Alex, 9552 06:57:14,700 --> 06:57:17,700 oh which provides in memory storage 9553 06:57:17,700 --> 06:57:20,800 so we can leverage that particular file system as well. 9554 06:57:21,100 --> 06:57:22,796 So we have seen all the features. 9555 06:57:22,796 --> 06:57:25,580 What are the functionalities available with inspark? 9556 06:57:25,580 --> 06:57:27,600 We're going to look at the limitations 9557 06:57:27,600 --> 06:57:28,800 of using spark. 9558 06:57:28,800 --> 06:57:30,252 Of course every component 9559 06:57:30,252 --> 06:57:33,000 when it comes with a huge power and Advantage. 9560 06:57:33,000 --> 06:57:35,200 It will have its own limitations as well. 9561 06:57:35,300 --> 06:57:38,900 The equation illustrates some limitations of using 9562 06:57:38,900 --> 06:57:41,900 spark spark utilizes more storage space 9563 06:57:41,900 --> 06:57:43,400 compared to Hadoop 9564 06:57:43,400 --> 06:57:44,715 and it comes to the installation. 9565 06:57:44,715 --> 06:57:47,600 It's going to consume more space but in the Big Data world, 9566 06:57:47,600 --> 06:57:49,500 that's not a very huge constraint 9567 06:57:49,500 --> 06:57:52,206 because storage cons is not Great are very high 9568 06:57:52,206 --> 06:57:55,504 and our big data space and developer needs to be careful 9569 06:57:55,504 --> 06:57:58,275 while running the apps and Spark the reason 9570 06:57:58,275 --> 06:58:00,300 because it uses in-memory Computing. 9571 06:58:00,400 --> 06:58:02,870 Of course, it handles the memory very well. 9572 06:58:02,870 --> 06:58:05,400 But if you try to load a huge amount of data 9573 06:58:05,400 --> 06:58:08,700 and the distributed environment and if you try to do is join 9574 06:58:08,700 --> 06:58:09,903 when you try to do join 9575 06:58:09,903 --> 06:58:13,491 within the distributed world the data going to get transferred 9576 06:58:13,491 --> 06:58:14,700 over the network network 9577 06:58:14,700 --> 06:58:18,100 is really a costly resource So the plan 9578 06:58:18,200 --> 06:58:20,800 or design should be such a way to reduce or minimize. 9579 06:58:20,800 --> 06:58:23,500 As the data transferred over the network 9580 06:58:23,500 --> 06:58:27,103 and however the way possible with all possible means 9581 06:58:27,103 --> 06:58:30,000 we should facilitate distribution of theta 9582 06:58:30,000 --> 06:58:32,200 over multiple missions the more 9583 06:58:32,200 --> 06:58:34,600 we distribute the more parallelism we can achieve 9584 06:58:34,600 --> 06:58:38,500 and the more results we can get and cost efficiency. 9585 06:58:38,500 --> 06:58:40,700 If you try to compare the cost 9586 06:58:40,700 --> 06:58:42,800 how much cost involved 9587 06:58:42,800 --> 06:58:45,700 to do a particular processing take any unit 9588 06:58:45,700 --> 06:58:48,545 in terms of processing 1 GB of data with say 9589 06:58:48,545 --> 06:58:50,200 like II Treaty processing 9590 06:58:50,200 --> 06:58:53,800 if you come Cost-wise in-memory Computing always it's considered 9591 06:58:53,800 --> 06:58:57,088 because memory It's relatively come costlier 9592 06:58:57,088 --> 06:58:58,200 than the storage 9593 06:58:58,400 --> 06:59:00,000 so that may act like a bottleneck 9594 06:59:00,000 --> 06:59:01,400 and we cannot increase 9595 06:59:01,400 --> 06:59:05,200 the memory capacity of the mission Beyond supplement. 9596 06:59:05,900 --> 06:59:07,500 So we have to grow horizontally. 9597 06:59:07,800 --> 06:59:10,042 So when we have the data distributor 9598 06:59:10,042 --> 06:59:11,900 in memory across the cluster, 9599 06:59:12,000 --> 06:59:13,337 of course the network transfer 9600 06:59:13,337 --> 06:59:15,300 all those bottlenecks will come into picture. 9601 06:59:15,300 --> 06:59:17,400 So we have to strike the right balance 9602 06:59:17,400 --> 06:59:20,700 which will help us to achieve the in-memory computing. 9603 06:59:20,700 --> 06:59:22,775 Whatever, they memory computer repair it 9604 06:59:22,775 --> 06:59:24,000 will help us to achieve 9605 06:59:24,000 --> 06:59:25,757 and it consumes huge amount 9606 06:59:25,757 --> 06:59:28,400 of data processing compared to Hadoop 9607 06:59:28,600 --> 06:59:30,600 and Spark it performs 9608 06:59:30,600 --> 06:59:33,800 better than use it as a creative Computing 9609 06:59:33,800 --> 06:59:36,700 because it likes for both spark and the other Technologies. 9610 06:59:36,700 --> 06:59:37,699 It has to read data 9611 06:59:37,699 --> 06:59:39,700 for the first time from the hottest car 9612 06:59:39,700 --> 06:59:43,300 from other data source and Spark performance is really better 9613 06:59:43,300 --> 06:59:46,114 when it reads the data onto does the processing 9614 06:59:46,114 --> 06:59:48,500 when the data is available in the cache, 9615 06:59:48,723 --> 06:59:50,800 of course is the DAC cycle. 9616 06:59:50,800 --> 06:59:53,094 It's going to give us a lot of advantage 9617 06:59:53,094 --> 06:59:54,400 while doing the processing 9618 06:59:54,400 --> 06:59:56,802 but the in-memory Computing processing 9619 06:59:56,802 --> 06:59:59,400 that's going to give us lots of Leverage. 9620 06:59:59,400 --> 07:00:01,605 The next question list some use cases 9621 07:00:01,605 --> 07:00:04,300 where Spark outperforms Hadoop in processing. 9622 07:00:04,400 --> 07:00:06,300 The first thing is the real time processing. 9623 07:00:06,300 --> 07:00:08,629 How do you cannot handle real time processing 9624 07:00:08,629 --> 07:00:10,884 but spark and handle real time processing. 9625 07:00:10,884 --> 07:00:13,843 So any data that's coming in in the land architecture. 9626 07:00:13,843 --> 07:00:15,300 You will have three layers. 9627 07:00:15,300 --> 07:00:17,210 The most of the Big Data projects will be 9628 07:00:17,210 --> 07:00:18,500 in the Lambda architecture. 9629 07:00:18,500 --> 07:00:21,500 You will have speed layer by layer and sighs Leo 9630 07:00:21,500 --> 07:00:23,900 and the speed layer whenever the river comes 9631 07:00:23,900 --> 07:00:26,900 in that needs to be processed stored and handled. 9632 07:00:26,900 --> 07:00:27,975 So in those type 9633 07:00:27,975 --> 07:00:30,800 of real-time processing stock is the best fit. 9634 07:00:30,800 --> 07:00:32,500 Of course, we can Hadoop ecosystem. 9635 07:00:32,500 --> 07:00:33,837 We have other components 9636 07:00:33,837 --> 07:00:36,400 which does the real-time processing like storm. 9637 07:00:36,400 --> 07:00:39,000 But when you want to Leverage The Machine learning 9638 07:00:39,000 --> 07:00:40,500 along with the Sparks dreaming 9639 07:00:40,500 --> 07:00:43,200 on such computation spark will be much better. 9640 07:00:43,200 --> 07:00:44,243 So that's why I like 9641 07:00:44,243 --> 07:00:45,621 when you have architecture 9642 07:00:45,621 --> 07:00:47,900 like a Lambda architecture you want to have 9643 07:00:47,900 --> 07:00:51,100 all three layers bachelier speed layer and service. 9644 07:00:51,100 --> 07:00:54,800 A spark and gel the speed layer and service layer far better 9645 07:00:54,800 --> 07:00:56,800 and it's going to provide better performance. 9646 07:00:56,800 --> 07:00:59,400 And whenever you do the edge processing 9647 07:00:59,400 --> 07:01:02,400 especially like doing a machine learning processing, 9648 07:01:02,400 --> 07:01:04,501 we will leverage nitrate in Computing 9649 07:01:04,501 --> 07:01:06,210 and can perform a hundred times 9650 07:01:06,210 --> 07:01:08,800 faster than Hadoop the more diversity processing 9651 07:01:08,800 --> 07:01:11,600 that we do the more data will be read from the memory 9652 07:01:11,600 --> 07:01:14,700 and it's going to get as much faster performance 9653 07:01:14,700 --> 07:01:16,700 than I did with mapreduce. 9654 07:01:16,700 --> 07:01:20,100 So again, remember whenever you do the processing only buns, 9655 07:01:20,100 --> 07:01:23,000 so you're going to to do the processing finally bonds 9656 07:01:23,000 --> 07:01:24,900 read process it and deliver. 9657 07:01:24,900 --> 07:01:27,516 The result spark may not be the best fit 9658 07:01:27,516 --> 07:01:30,200 that can be done with a mapreduce itself. 9659 07:01:30,200 --> 07:01:32,773 And there is another component called akka it's 9660 07:01:32,773 --> 07:01:35,600 a messaging system our message quantity 9661 07:01:35,600 --> 07:01:38,500 in system Sparkle internally uses account 9662 07:01:38,500 --> 07:01:40,500 for scheduling our any task 9663 07:01:40,500 --> 07:01:43,100 that needs to be assigned by the master to the worker 9664 07:01:43,700 --> 07:01:45,700 and the follow-up of that particular task 9665 07:01:45,700 --> 07:01:49,000 by the master basically asynchronous coordination system 9666 07:01:49,000 --> 07:01:51,000 and that's achieved using akka 9667 07:01:51,400 --> 07:01:55,100 I call programming internally it's used by this monk 9668 07:01:55,100 --> 07:01:56,551 as such for the developers. 9669 07:01:56,551 --> 07:01:59,358 We don't need to worry about a couple of growing up. 9670 07:01:59,358 --> 07:02:00,900 Of course we can leverage it 9671 07:02:00,900 --> 07:02:04,500 but the car is used internally by the spawn for scheduling 9672 07:02:04,500 --> 07:02:08,800 and coordination between master and the burqa and with inspark. 9673 07:02:08,800 --> 07:02:10,700 We have few major components. 9674 07:02:10,700 --> 07:02:13,200 Let's see, what are the major components 9675 07:02:13,200 --> 07:02:14,500 of a possessed man. 9676 07:02:14,500 --> 07:02:18,069 The lay the components of spot ecosystem start comes 9677 07:02:18,069 --> 07:02:19,319 with a core engine. 9678 07:02:19,319 --> 07:02:20,700 So that has the core. 9679 07:02:20,700 --> 07:02:23,570 Realities of what is required from by the spark 9680 07:02:23,570 --> 07:02:26,600 of all this Punk Oddities are the building blocks 9681 07:02:26,600 --> 07:02:29,361 of the spark core engine on top of spark 9682 07:02:29,361 --> 07:02:31,300 or the basic functionalities are 9683 07:02:31,300 --> 07:02:34,600 file interaction file system coordination all that's done 9684 07:02:34,600 --> 07:02:36,400 by the spark core engine 9685 07:02:36,400 --> 07:02:38,432 on top of spark core engine. 9686 07:02:38,432 --> 07:02:40,900 We have a number of other offerings 9687 07:02:40,900 --> 07:02:44,700 to do machine learning to do graph Computing to do streaming. 9688 07:02:44,700 --> 07:02:47,000 We have n number of other components. 9689 07:02:47,000 --> 07:02:48,800 So the major use the components 9690 07:02:48,800 --> 07:02:51,000 of these components like Sparks equal. 9691 07:02:51,000 --> 07:02:52,037 Spock streaming. 9692 07:02:52,037 --> 07:02:55,520 I'm a little graphics and Spark our other high level. 9693 07:02:55,520 --> 07:02:58,400 We will see what are these components Sparks 9694 07:02:58,400 --> 07:03:02,000 equal especially it's designed to do the processing 9695 07:03:02,000 --> 07:03:03,729 against a structure data 9696 07:03:03,729 --> 07:03:07,400 so we can write SQL queries and we can handle 9697 07:03:07,400 --> 07:03:08,854 or we can do the processing. 9698 07:03:08,854 --> 07:03:11,400 So it's going to give us the interface to interact 9699 07:03:11,400 --> 07:03:12,100 with the data, 9700 07:03:12,300 --> 07:03:15,900 especially structure data and other language 9701 07:03:15,900 --> 07:03:18,700 that we can use it's more similar to 9702 07:03:18,700 --> 07:03:20,600 what we use within the SQL. 9703 07:03:20,600 --> 07:03:22,700 Well, I can say 99 percentage is seen 9704 07:03:22,700 --> 07:03:25,934 and most of the commonly used functionalities within the SQL 9705 07:03:25,934 --> 07:03:28,111 have been implemented within smocks equal 9706 07:03:28,111 --> 07:03:31,700 and Spark streaming is going to support the stream processing. 9707 07:03:31,700 --> 07:03:34,000 That's the offering available to handle 9708 07:03:34,000 --> 07:03:35,920 the stream processing and MLA 9709 07:03:35,920 --> 07:03:38,900 based the offering to handle machine learning. 9710 07:03:38,900 --> 07:03:42,700 So the component name is called ml in and has a list 9711 07:03:42,700 --> 07:03:44,300 of components a list 9712 07:03:44,300 --> 07:03:47,300 of machine learning algorithms already defined 9713 07:03:47,300 --> 07:03:50,700 we can leverage and use any of those machine learning. 9714 07:03:51,400 --> 07:03:54,944 Graphics again, it's a graph processing offerings 9715 07:03:54,944 --> 07:03:56,200 within the spark. 9716 07:03:56,200 --> 07:03:59,141 It's going to support us to achieve graph Computing 9717 07:03:59,141 --> 07:04:02,330 against the data that we have like pagerank calculation. 9718 07:04:02,330 --> 07:04:04,107 How many connector identities 9719 07:04:04,107 --> 07:04:07,600 how many triangles all those going to provide us a meaning 9720 07:04:07,600 --> 07:04:09,300 to that particular data 9721 07:04:09,300 --> 07:04:12,500 and Spark are is the component is going to interact 9722 07:04:12,500 --> 07:04:14,371 or helpers to leverage. 9723 07:04:14,371 --> 07:04:17,856 The language are within the spark environment 9724 07:04:18,100 --> 07:04:20,600 are is a statistical programming language. 9725 07:04:20,600 --> 07:04:23,170 Each where we can do statistical Computing, 9726 07:04:23,170 --> 07:04:24,700 which is Park environment 9727 07:04:24,700 --> 07:04:28,306 and we can leverage our language by using this parka to get 9728 07:04:28,306 --> 07:04:32,194 that executed within the spark a environment addition to that. 9729 07:04:32,194 --> 07:04:35,675 There are other components as well like approximative is 9730 07:04:35,675 --> 07:04:39,118 it's called blink DB all other things I can be test each. 9731 07:04:39,118 --> 07:04:42,541 So these are the major Lee used components within spark. 9732 07:04:42,541 --> 07:04:43,561 So next question. 9733 07:04:43,561 --> 07:04:45,944 How can start be used alongside her too? 9734 07:04:45,944 --> 07:04:49,000 So when we see a spark performance much better it's 9735 07:04:49,000 --> 07:04:51,000 not a replacement to handle it. 9736 07:04:51,000 --> 07:04:52,100 Going to coexist 9737 07:04:52,100 --> 07:04:55,488 with the Hadoop right Square leveraging the spark 9738 07:04:55,488 --> 07:04:56,900 and Hadoop together. 9739 07:04:56,900 --> 07:05:00,000 It's going to help us to achieve the best result. 9740 07:05:00,000 --> 07:05:00,268 Yes. 9741 07:05:00,268 --> 07:05:04,300 Mark can do in memory Computing or can handle the speed layer 9742 07:05:04,300 --> 07:05:06,600 and Hadoop comes with the resource manager 9743 07:05:06,600 --> 07:05:08,500 so we can leverage the resource manager 9744 07:05:08,500 --> 07:05:10,900 of Hadoop to make smart to work 9745 07:05:11,000 --> 07:05:13,529 and few processing be don't need to Leverage 9746 07:05:13,529 --> 07:05:14,904 The in-memory Computing. 9747 07:05:14,904 --> 07:05:18,500 For example, one time processing to the processing and forget. 9748 07:05:18,500 --> 07:05:20,773 I just store it we can use mapreduce. 9749 07:05:20,773 --> 07:05:24,700 He's so the processing cost Computing cost will be much less 9750 07:05:24,700 --> 07:05:26,100 compared to Spa 9751 07:05:26,100 --> 07:05:29,400 so we can amalgam eyes and get strike the right balance 9752 07:05:29,400 --> 07:05:31,700 between the batch processing and stream processing 9753 07:05:31,700 --> 07:05:34,507 when we have spark along with Adam. 9754 07:05:34,507 --> 07:05:38,100 Let's have some detail question later to spark core 9755 07:05:38,100 --> 07:05:39,100 with inspark or 9756 07:05:39,100 --> 07:05:41,900 as I mentioned earlier the core building block 9757 07:05:41,900 --> 07:05:45,600 of spark or is our DD resilient distributed data set. 9758 07:05:45,600 --> 07:05:46,654 It's a virtual. 9759 07:05:46,654 --> 07:05:48,442 It's not a physical entity. 9760 07:05:48,442 --> 07:05:49,900 It's a logical entity. 9761 07:05:49,900 --> 07:05:52,400 You will not See this audit is existing. 9762 07:05:52,400 --> 07:05:54,700 The existence of hundred will come into picture 9763 07:05:54,900 --> 07:05:56,474 when you take some action. 9764 07:05:56,474 --> 07:05:59,200 So this is our Unity will be used are referred 9765 07:05:59,200 --> 07:06:00,800 to create the DAC cycle 9766 07:06:00,943 --> 07:06:05,500 and arteries will be optimized to transform from one form 9767 07:06:05,500 --> 07:06:07,264 to another form to make a plan 9768 07:06:07,264 --> 07:06:09,400 how the data set needs to be transformed 9769 07:06:09,400 --> 07:06:11,500 from one structure to another structure. 9770 07:06:11,700 --> 07:06:14,817 And finally when you take some against an RTD that existence 9771 07:06:14,817 --> 07:06:15,924 of the data structure 9772 07:06:15,924 --> 07:06:18,200 that resulted in data will come into picture 9773 07:06:18,200 --> 07:06:20,500 and that can be stored in any file system 9774 07:06:20,500 --> 07:06:22,000 whether it's GFS is 3 9775 07:06:22,000 --> 07:06:24,568 or any other file system can be stored and 9776 07:06:24,568 --> 07:06:27,900 that it is can exist in a partition form the sense. 9777 07:06:27,900 --> 07:06:30,600 It can get distributed across multiple systems 9778 07:06:30,600 --> 07:06:33,800 and it's fault tolerant and it's a fault tolerant. 9779 07:06:33,800 --> 07:06:36,494 If any of the artery is lost any partition 9780 07:06:36,494 --> 07:06:37,742 of the RTD is lost. 9781 07:06:37,742 --> 07:06:40,700 It can regenerate only that specific partition 9782 07:06:40,700 --> 07:06:41,700 it can regenerate 9783 07:06:41,900 --> 07:06:43,900 so that's a huge advantage of our GD. 9784 07:06:43,900 --> 07:06:46,600 So it's a mass like first the huge advantage of added. 9785 07:06:46,600 --> 07:06:47,900 It's a fault-tolerant 9786 07:06:47,900 --> 07:06:50,600 where it can regenerate the last rdds. 9787 07:06:50,600 --> 07:06:53,606 And it can exist in a distributed fashion 9788 07:06:53,606 --> 07:06:55,165 and it is immutable the 9789 07:06:55,165 --> 07:06:59,300 since once the RTD is defined on like it it cannot be changed. 9790 07:06:59,300 --> 07:07:01,500 The next question is how do we create rdds 9791 07:07:01,500 --> 07:07:04,500 in spark the two ways we can create The Oddities one 9792 07:07:04,664 --> 07:07:09,700 as isn't the spark context we can use any of the collections 9793 07:07:09,700 --> 07:07:12,700 that's available within this scalar or in the Java and using 9794 07:07:12,700 --> 07:07:14,000 the paralyzed function. 9795 07:07:14,000 --> 07:07:17,049 We can create the RTD and it's going to use 9796 07:07:17,049 --> 07:07:20,474 the underlying file systems distribution mechanism 9797 07:07:20,474 --> 07:07:23,900 if The data is located in distributed file system, 9798 07:07:23,900 --> 07:07:24,700 like hdfs. 9799 07:07:25,000 --> 07:07:27,154 It will leverage that and it will make 9800 07:07:27,154 --> 07:07:30,331 those arteries available in a number of systems. 9801 07:07:30,331 --> 07:07:33,696 So it's going to leverage and follow the same distribution 9802 07:07:33,696 --> 07:07:34,700 and already Aspen 9803 07:07:34,700 --> 07:07:37,200 or we can create the rdt by loading the data 9804 07:07:37,200 --> 07:07:39,835 from external sources as well like its peace 9805 07:07:39,835 --> 07:07:42,900 and hdfs be may not consider as an external Source. 9806 07:07:42,900 --> 07:07:45,300 It will be consider as a file system of Hadoop. 9807 07:07:45,400 --> 07:07:47,300 So when Spock is working 9808 07:07:47,300 --> 07:07:49,743 with Hadoop mostly the file system, 9809 07:07:49,743 --> 07:07:51,900 we will be using will be Hdfs, 9810 07:07:51,900 --> 07:07:53,782 if you can read from it each piece 9811 07:07:53,782 --> 07:07:55,900 or even we can do from other sources, 9812 07:07:55,900 --> 07:07:59,781 like Parkwood file or has three different sources a roux. 9813 07:07:59,781 --> 07:08:02,000 You can read and create the RTD. 9814 07:08:02,200 --> 07:08:03,000 Next question is 9815 07:08:03,000 --> 07:08:05,800 what is executed memory in spark application. 9816 07:08:05,800 --> 07:08:08,100 Every Spark application will have fixed. 9817 07:08:08,100 --> 07:08:09,900 It keeps eyes and fixed number, 9818 07:08:09,900 --> 07:08:13,196 of course for the spark executor executor is nothing 9819 07:08:13,196 --> 07:08:16,500 but the execution unit available in every machine 9820 07:08:16,500 --> 07:08:19,600 and that's going to facilitate to do the processing to do 9821 07:08:19,600 --> 07:08:21,654 the tasks in the Water machine, 9822 07:08:21,654 --> 07:08:25,300 so irrespective of whether you use yarn resource manager 9823 07:08:25,300 --> 07:08:26,800 or any other measures 9824 07:08:26,800 --> 07:08:29,600 like resource manager every worker Mission. 9825 07:08:29,600 --> 07:08:31,200 We will have an Executor 9826 07:08:31,200 --> 07:08:34,400 and within the executor the task will be handled 9827 07:08:34,400 --> 07:08:38,700 and the memory to be allocated for that particular executor is 9828 07:08:38,700 --> 07:08:41,893 what we Define as the hip size and we can Define 9829 07:08:41,893 --> 07:08:42,775 how much amount 9830 07:08:42,775 --> 07:08:45,788 of memory should be used for that particular executor 9831 07:08:45,788 --> 07:08:47,700 within the worker machine as well. 9832 07:08:47,700 --> 07:08:50,900 As number of cores can be used within the exit. 9833 07:08:51,000 --> 07:08:53,988 Our by the executor with this path application 9834 07:08:53,988 --> 07:08:55,600 and that can be controlled 9835 07:08:55,600 --> 07:08:58,100 through the configuration files of spark. 9836 07:08:58,100 --> 07:09:01,300 Next questions different partitions in Apache spark. 9837 07:09:01,300 --> 07:09:03,100 So any data irrespective of 9838 07:09:03,100 --> 07:09:05,478 whether it is a small data a large data, 9839 07:09:05,478 --> 07:09:07,213 we can divide those data sets 9840 07:09:07,213 --> 07:09:10,708 across multiple systems the process of dividing the data 9841 07:09:10,708 --> 07:09:11,961 into multiple pieces 9842 07:09:11,961 --> 07:09:13,310 and making it to store 9843 07:09:13,310 --> 07:09:16,500 across multiple systems as a different logical units. 9844 07:09:16,500 --> 07:09:17,549 It's called partitioning. 9845 07:09:17,549 --> 07:09:20,600 So in simple terms partitioning is nothing but the process 9846 07:09:20,600 --> 07:09:21,700 of Dividing the data 9847 07:09:21,700 --> 07:09:24,800 and storing in multiple systems is called partitions 9848 07:09:24,800 --> 07:09:26,600 and by default the conversion 9849 07:09:26,600 --> 07:09:29,700 of the data into R. TD will happen in the system 9850 07:09:29,700 --> 07:09:31,400 where the partition is existing. 9851 07:09:31,400 --> 07:09:33,830 So the more the partition the more parallelism 9852 07:09:33,830 --> 07:09:36,049 they are going to get at the same time. 9853 07:09:36,049 --> 07:09:38,500 We have to be careful not to trigger huge amount 9854 07:09:38,500 --> 07:09:40,100 of network data transfer as well 9855 07:09:40,300 --> 07:09:43,455 and every a DD can be partitioned with inspark 9856 07:09:43,455 --> 07:09:45,700 and the panel is the partitioning 9857 07:09:45,700 --> 07:09:49,559 going to help us to achieve parallelism more the partition 9858 07:09:49,559 --> 07:09:50,685 that we have more. 9859 07:09:50,685 --> 07:09:52,000 Solutions can be done 9860 07:09:52,000 --> 07:09:54,300 and that the key thing about the success 9861 07:09:54,300 --> 07:09:58,200 of the spark program is minimizing the network traffic 9862 07:09:58,200 --> 07:10:00,900 while doing the parallel processing and minimizing 9863 07:10:00,900 --> 07:10:04,247 the data transfer within the systems of spark. 9864 07:10:04,247 --> 07:10:08,000 What operations does already support so I can operate 9865 07:10:08,000 --> 07:10:10,228 multiple operations against our GD. 9866 07:10:10,228 --> 07:10:13,900 So there are two type of things we can do we can group it 9867 07:10:13,900 --> 07:10:16,000 into two one is transformations 9868 07:10:16,000 --> 07:10:18,800 in Transformations are did he will get transformed 9869 07:10:18,800 --> 07:10:20,600 from one form to another form. 9870 07:10:20,600 --> 07:10:22,600 Select filtering grouping all 9871 07:10:22,600 --> 07:10:25,000 that like it's going to get transformed 9872 07:10:25,000 --> 07:10:28,000 from one form to another form one small example, 9873 07:10:28,000 --> 07:10:31,470 like reduced by key filter all that will be Transformations. 9874 07:10:31,470 --> 07:10:33,700 The resultant of the transformation will be 9875 07:10:33,700 --> 07:10:35,300 another rdd the same time. 9876 07:10:35,300 --> 07:10:37,700 We can take some actions against the rdd 9877 07:10:37,700 --> 07:10:40,245 that's going to give us the final result. 9878 07:10:40,245 --> 07:10:41,262 I can say count 9879 07:10:41,262 --> 07:10:43,500 how many records or they are store 9880 07:10:43,500 --> 07:10:45,700 that result into the hdfs. 9881 07:10:46,100 --> 07:10:49,541 They all our actions so multiple actions can be taken 9882 07:10:49,541 --> 07:10:50,600 against the RTD. 9883 07:10:50,600 --> 07:10:53,700 The existence of the data will come into picture only 9884 07:10:53,700 --> 07:10:56,200 if I take some action against not ready. 9885 07:10:56,200 --> 07:10:56,515 Okay. 9886 07:10:56,515 --> 07:10:57,400 Next question. 9887 07:10:57,400 --> 07:11:01,000 What do you understand by transformations in spark? 9888 07:11:01,100 --> 07:11:03,679 So Transformations are nothing but functions 9889 07:11:03,679 --> 07:11:06,800 mostly it will be higher order functions within scale 9890 07:11:06,800 --> 07:11:09,400 and we have something like a higher order functions 9891 07:11:09,400 --> 07:11:12,356 which will be applied against the tardy. 9892 07:11:12,356 --> 07:11:14,100 Mostly against the list 9893 07:11:14,100 --> 07:11:16,407 of elements that we have within the rdd 9894 07:11:16,407 --> 07:11:19,314 that function will get applied by the existence 9895 07:11:19,314 --> 07:11:21,875 of the arditi will Come into picture one lie 9896 07:11:21,875 --> 07:11:25,597 if we take some action against it in this particular example, 9897 07:11:25,597 --> 07:11:26,900 I am reading the file 9898 07:11:26,900 --> 07:11:30,536 and having it within the rdd Control Data then I am doing 9899 07:11:30,536 --> 07:11:32,500 some transformation using a map. 9900 07:11:32,500 --> 07:11:34,382 So it's going to apply a function 9901 07:11:34,382 --> 07:11:35,623 so we can map I have 9902 07:11:35,623 --> 07:11:39,100 some function which will split each record using the tab. 9903 07:11:39,100 --> 07:11:41,632 So the spit with the app will be applied 9904 07:11:41,632 --> 07:11:44,300 against each record within the raw data 9905 07:11:44,300 --> 07:11:48,200 and the resultant movies data will again be another rdd, 9906 07:11:48,200 --> 07:11:50,644 but of course, this will be a lazy operation. 9907 07:11:50,644 --> 07:11:53,700 The existence of movies data will come into picture only 9908 07:11:53,700 --> 07:11:57,700 if I take some action against it like count or print 9909 07:11:57,726 --> 07:12:01,573 or store only those actions will generate the data. 9910 07:12:01,800 --> 07:12:04,600 So next question Define functions of spark code. 9911 07:12:04,600 --> 07:12:07,100 So that's going to take care of the memory management 9912 07:12:07,100 --> 07:12:09,400 and fault tolerance of rdds. 9913 07:12:09,400 --> 07:12:12,700 It's going to help us to schedule distribute the task 9914 07:12:12,700 --> 07:12:15,400 and manage the jobs running within the cluster 9915 07:12:15,400 --> 07:12:17,700 and so we're going to help us to or store the rear 9916 07:12:17,700 --> 07:12:20,700 in the storage system as well as reads data from the storage. 9917 07:12:20,700 --> 07:12:23,905 System that's to do the file system level operations. 9918 07:12:23,905 --> 07:12:25,200 It's going to help us 9919 07:12:25,200 --> 07:12:27,500 and Spark core programming can be done in any 9920 07:12:27,500 --> 07:12:30,347 of these languages like Java scalar python 9921 07:12:30,347 --> 07:12:32,500 as well as using our so core is 9922 07:12:32,500 --> 07:12:35,600 that the horizontal level on top of spark or we can have 9923 07:12:35,600 --> 07:12:37,500 a number of components 9924 07:12:37,600 --> 07:12:41,000 and there are different type of rdds available one such 9925 07:12:41,000 --> 07:12:42,923 a special type is parody. 9926 07:12:42,923 --> 07:12:43,800 So next question. 9927 07:12:43,800 --> 07:12:46,100 What do you understand by pay an rdd? 9928 07:12:46,100 --> 07:12:49,792 It's going to exist in peace as a keys and values 9929 07:12:49,800 --> 07:12:51,906 so I can Some special functions 9930 07:12:51,906 --> 07:12:55,400 within the parodies are special Transformations, 9931 07:12:55,400 --> 07:12:58,900 like connect all the values corresponding to the same key 9932 07:12:58,900 --> 07:13:00,200 like solder Shuffle 9933 07:13:00,300 --> 07:13:02,800 what happens within the shortened Shuffle of Hadoop 9934 07:13:02,900 --> 07:13:04,356 those type of operations 9935 07:13:04,356 --> 07:13:05,161 like you want 9936 07:13:05,161 --> 07:13:08,339 to consolidate our group all the values corresponding 9937 07:13:08,339 --> 07:13:10,792 to the same key are apply some functions 9938 07:13:10,792 --> 07:13:14,400 against all the values corresponding to the same key. 9939 07:13:14,400 --> 07:13:16,200 Like I want to get the sum 9940 07:13:16,200 --> 07:13:20,400 of the value of all the keys we can use the parody. 9941 07:13:20,400 --> 07:13:23,600 D and get that a cheat so it's going to the data 9942 07:13:23,600 --> 07:13:29,300 within the re going to exist in Pace keys and right. 9943 07:13:29,300 --> 07:13:31,376 Okay a question from Jason. 9944 07:13:31,376 --> 07:13:33,223 What are our Vector rdds 9945 07:13:33,300 --> 07:13:36,300 in machine learning you will have huge amount 9946 07:13:36,300 --> 07:13:38,700 of processing handled by vectors 9947 07:13:38,700 --> 07:13:42,812 and matrices and we do lots of operations Vector operations, 9948 07:13:42,812 --> 07:13:44,200 like effective actor 9949 07:13:44,200 --> 07:13:47,700 or transforming any data into a vector form so vectors 9950 07:13:47,700 --> 07:13:50,755 like as the normal way it will have a Direction. 9951 07:13:50,755 --> 07:13:51,624 And magnitude 9952 07:13:51,624 --> 07:13:54,900 so we can do some operations like some two vectors 9953 07:13:54,900 --> 07:13:58,622 and what is the difference between the vector A 9954 07:13:58,622 --> 07:14:00,500 and B as well as a and see 9955 07:14:00,500 --> 07:14:02,400 if the difference between Vector A 9956 07:14:02,400 --> 07:14:04,200 and B is less compared to a 9957 07:14:04,200 --> 07:14:06,487 and C we can say the vector A 9958 07:14:06,487 --> 07:14:10,825 and B is somewhat similar in terms of features. 9959 07:14:11,100 --> 07:14:13,815 So the vector R GD will be used to represent 9960 07:14:13,815 --> 07:14:17,100 the vector directly and that will be used extensively 9961 07:14:17,100 --> 07:14:19,500 while doing the measuring and Jason. 9962 07:14:19,700 --> 07:14:20,500 Thank you other. 9963 07:14:20,500 --> 07:14:21,400 Is another question. 9964 07:14:21,400 --> 07:14:22,900 What is our GD lineage? 9965 07:14:22,900 --> 07:14:25,800 So here I any data processing any Transformations 9966 07:14:25,800 --> 07:14:28,811 that we do it maintains something called a lineage. 9967 07:14:28,811 --> 07:14:31,100 So what how data is getting transformed 9968 07:14:31,100 --> 07:14:33,543 when the data is available in the partition form 9969 07:14:33,543 --> 07:14:36,300 in multiple systems and when we do the transformation, 9970 07:14:36,300 --> 07:14:39,800 it will undergo multiple steps and in the distributed word. 9971 07:14:39,800 --> 07:14:42,700 It's very common to have failures of machines 9972 07:14:42,700 --> 07:14:45,200 or machines going out of the network 9973 07:14:45,200 --> 07:14:47,000 and the system our framework 9974 07:14:47,000 --> 07:14:47,800 as it should be 9975 07:14:47,800 --> 07:14:50,800 in a position to handle small handles it through. 9976 07:14:50,858 --> 07:14:55,800 Did he leave eh it can restore the last partition only assume 9977 07:14:55,800 --> 07:14:59,004 like out of ten machines data is distributed 9978 07:14:59,004 --> 07:15:00,828 across five machines out of 9979 07:15:00,828 --> 07:15:03,800 that those five machines One mission is lost. 9980 07:15:03,800 --> 07:15:06,500 So whatever the latest transformation 9981 07:15:06,500 --> 07:15:07,807 that had the data 9982 07:15:08,000 --> 07:15:10,100 for that particular partition the partition 9983 07:15:10,100 --> 07:15:13,924 in the last mission alone can be regenerated and it knows 9984 07:15:13,924 --> 07:15:16,700 how to regenerate that data on how to get that result 9985 07:15:16,700 --> 07:15:18,384 and data using the concept 9986 07:15:18,384 --> 07:15:21,153 of rdd lineage so from which Each data source, 9987 07:15:21,153 --> 07:15:22,200 it got generated. 9988 07:15:22,200 --> 07:15:23,800 What was its previous step. 9989 07:15:23,800 --> 07:15:26,300 So the completely is will be available 9990 07:15:26,300 --> 07:15:29,724 and it's maintained by the spark framework internally. 9991 07:15:29,724 --> 07:15:31,700 We call that as Oddities in eh, 9992 07:15:31,700 --> 07:15:34,682 what is point driver to put it simply for those 9993 07:15:34,682 --> 07:15:37,600 who are from her do background yarn back room. 9994 07:15:37,600 --> 07:15:40,000 We can compare this to at muster. 9995 07:15:40,100 --> 07:15:43,300 Every application will have a spark driver 9996 07:15:43,300 --> 07:15:44,900 that will have a spot context 9997 07:15:44,900 --> 07:15:47,550 which is going to moderate the complete execution 9998 07:15:47,550 --> 07:15:50,200 of the job that will connect to the spark master. 9999 07:15:50,500 --> 07:15:52,300 Delivers the RTD graph 10000 07:15:52,300 --> 07:15:54,900 that is the lineage for the master 10001 07:15:54,900 --> 07:15:56,810 and the coordinate the tasks. 10002 07:15:56,810 --> 07:15:57,817 What are the tasks 10003 07:15:57,817 --> 07:16:00,700 that gets executed in the distributed environment? 10004 07:16:00,700 --> 07:16:01,500 It can do 10005 07:16:01,500 --> 07:16:04,400 the parallel processing do the Transformations 10006 07:16:04,600 --> 07:16:06,900 and actions against the RTD. 10007 07:16:06,900 --> 07:16:08,551 So it's a single point of contact 10008 07:16:08,551 --> 07:16:10,100 for that specific application. 10009 07:16:10,100 --> 07:16:12,500 So smart driver is a short linked 10010 07:16:12,500 --> 07:16:15,300 and the spawn context within this part driver 10011 07:16:15,300 --> 07:16:18,558 is going to be the coordinator between the master and the tasks 10012 07:16:18,558 --> 07:16:20,694 that are running and smart driver. 10013 07:16:20,694 --> 07:16:23,100 I can get started in any of the executor 10014 07:16:23,100 --> 07:16:26,800 with inspark name types of custom managers in spark. 10015 07:16:26,800 --> 07:16:28,800 So whenever you have a group of machines, 10016 07:16:28,800 --> 07:16:30,247 you need a manager to manage 10017 07:16:30,247 --> 07:16:33,415 the resources the different type of the store manager already. 10018 07:16:33,415 --> 07:16:35,700 We have seen the yarn yet another assist ago. 10019 07:16:35,700 --> 07:16:39,400 She later which manages the resources of Hadoop on top 10020 07:16:39,400 --> 07:16:43,000 of yarn we can make Spock to book sometimes I 10021 07:16:43,000 --> 07:16:46,700 may want to have sparkle own my organization 10022 07:16:46,700 --> 07:16:49,594 and not along with the Hadoop or any other technology. 10023 07:16:49,594 --> 07:16:50,297 Then I can go 10024 07:16:50,297 --> 07:16:53,100 with the And alone spawn has built-in cluster manager. 10025 07:16:53,100 --> 07:16:55,547 So only spawn can get executed multiple systems. 10026 07:16:55,547 --> 07:16:57,423 But generally if we have a cluster we 10027 07:16:57,423 --> 07:16:58,600 will try to leverage 10028 07:16:58,600 --> 07:17:01,600 various other Computing platforms Computing Frameworks, 10029 07:17:01,600 --> 07:17:04,601 like graph processing giraffe these on that. 10030 07:17:04,601 --> 07:17:07,000 We will try to leverage that case. 10031 07:17:07,000 --> 07:17:08,321 We will go with yarn 10032 07:17:08,321 --> 07:17:10,700 or some generalized resource manager, 10033 07:17:10,700 --> 07:17:12,000 like masseuse Ian. 10034 07:17:12,000 --> 07:17:14,400 It's very specific to Hadoop and it comes along 10035 07:17:14,400 --> 07:17:18,500 with Hadoop measures is the cluster level resource manager 10036 07:17:18,500 --> 07:17:20,600 and I have multiple clusters. 10037 07:17:20,600 --> 07:17:23,700 Within organization, then you can use mrs. 10038 07:17:23,800 --> 07:17:25,883 Mrs. Is also a resource manager. 10039 07:17:25,883 --> 07:17:29,400 It's a separate table project within Apache X question. 10040 07:17:29,400 --> 07:17:30,600 What do you understand 10041 07:17:30,600 --> 07:17:34,200 by worker node in a cluster redistribute environment. 10042 07:17:34,200 --> 07:17:36,252 We will have n number of workers we call 10043 07:17:36,252 --> 07:17:38,200 that is a worker node or a slave node, 10044 07:17:38,200 --> 07:17:40,665 which does the actual processing going to get 10045 07:17:40,665 --> 07:17:43,300 the data do the processing and get us the result 10046 07:17:43,300 --> 07:17:45,100 and masternode going to assign 10047 07:17:45,100 --> 07:17:48,000 what has to be done by one person own and it's going 10048 07:17:48,000 --> 07:17:50,551 to read the data available in the specific work on. 10049 07:17:50,551 --> 07:17:53,196 Generally, the tasks assigned to the worker node, 10050 07:17:53,196 --> 07:17:55,900 or the task will be assigned to the output node data 10051 07:17:55,900 --> 07:17:57,500 is located in vigorous Pace. 10052 07:17:57,500 --> 07:18:00,100 Especially Hadoop always it will try to achieve 10053 07:18:00,100 --> 07:18:01,183 the data locality. 10054 07:18:01,183 --> 07:18:04,391 That's what we can't is the resource availability as 10055 07:18:04,391 --> 07:18:05,900 well as the availability 10056 07:18:05,900 --> 07:18:08,900 of the resource in terms of CPU memory as well 10057 07:18:08,900 --> 07:18:10,000 will be considered 10058 07:18:10,000 --> 07:18:13,601 as you might have some data in replicated in three missions. 10059 07:18:13,601 --> 07:18:16,884 All three machines are busy doing the work and no CPU 10060 07:18:16,884 --> 07:18:19,414 or memory available to start the other task. 10061 07:18:19,414 --> 07:18:20,400 It will not wait. 10062 07:18:20,400 --> 07:18:23,300 For those missions to complete the job and get the resource 10063 07:18:23,300 --> 07:18:25,900 and do the processing it will start the processing 10064 07:18:25,900 --> 07:18:27,000 and some other machine 10065 07:18:27,000 --> 07:18:28,200 which is going to be near 10066 07:18:28,200 --> 07:18:31,300 to that the missions having the data and read the data 10067 07:18:31,300 --> 07:18:32,400 over the network. 10068 07:18:32,600 --> 07:18:35,100 So to answer straight or commissions are nothing but 10069 07:18:35,100 --> 07:18:36,600 which does the actual work 10070 07:18:36,600 --> 07:18:37,755 and going to report 10071 07:18:37,755 --> 07:18:41,315 to the master in terms of what is the resource utilization 10072 07:18:41,315 --> 07:18:42,627 and the tasks running 10073 07:18:42,627 --> 07:18:46,000 within the work emissions will be doing the actual work 10074 07:18:46,000 --> 07:18:49,049 and what ways as past Vector just few minutes back. 10075 07:18:49,049 --> 07:18:50,656 I was answering a question. 10076 07:18:50,656 --> 07:18:52,697 What is a vector vector is nothing 10077 07:18:52,697 --> 07:18:55,500 but representing the data in multi dimensional form? 10078 07:18:55,500 --> 07:18:57,500 The vector can be multi-dimensional 10079 07:18:57,500 --> 07:18:58,500 Vector as well. 10080 07:18:58,500 --> 07:19:02,400 As you know, I am going to represent a point in space. 10081 07:19:02,400 --> 07:19:04,938 I need three dimensions the X y&z. 10082 07:19:05,000 --> 07:19:08,076 So the vector will have three dimensions. 10083 07:19:08,300 --> 07:19:10,934 If I need to represent a line in the species. 10084 07:19:10,934 --> 07:19:14,107 Then I need two points to represent the starting point 10085 07:19:14,107 --> 07:19:17,700 of the line and the endpoint of the line then I need a vector 10086 07:19:17,700 --> 07:19:18,800 which can hold 10087 07:19:18,800 --> 07:19:21,049 so it will have two Dimensions the first First Dimension 10088 07:19:21,049 --> 07:19:23,121 will have one point the second dimension 10089 07:19:23,121 --> 07:19:24,400 will have another Point 10090 07:19:24,400 --> 07:19:25,429 let us say point B 10091 07:19:25,429 --> 07:19:29,200 if I have to represent a plane then I need another dimension 10092 07:19:29,200 --> 07:19:30,702 to represent two lines. 10093 07:19:30,702 --> 07:19:31,510 So each line 10094 07:19:31,510 --> 07:19:34,203 will be representing two points same way. 10095 07:19:34,203 --> 07:19:37,200 I can represent any data using a vector form 10096 07:19:37,200 --> 07:19:40,217 as you might have huge number of feedback 10097 07:19:40,217 --> 07:19:43,500 or ratings of products across an organization. 10098 07:19:43,500 --> 07:19:46,327 Let's take a simple example Amazon Amazon have 10099 07:19:46,327 --> 07:19:47,632 millions of products. 10100 07:19:47,632 --> 07:19:50,498 Not every user not even a single user would have 10101 07:19:50,498 --> 07:19:53,461 It was millions of all the products within Amazon. 10102 07:19:53,461 --> 07:19:55,341 The only hardly we would have used 10103 07:19:55,341 --> 07:19:58,400 like a point one percent or like even less than that, 10104 07:19:58,400 --> 07:20:00,200 maybe like few hundred products. 10105 07:20:00,200 --> 07:20:02,600 We would have used and rated the products 10106 07:20:02,600 --> 07:20:04,600 within amazing for the complete lifetime. 10107 07:20:04,600 --> 07:20:07,700 If I have to represent all ratings of the products 10108 07:20:07,700 --> 07:20:10,194 with director and see the first position 10109 07:20:10,194 --> 07:20:13,400 of the rating it's going to refer to the product 10110 07:20:13,400 --> 07:20:15,200 with ID 1 second position. 10111 07:20:15,200 --> 07:20:17,600 It's going to refer to the product with ID 2. 10112 07:20:17,600 --> 07:20:20,700 So I will have million values within that particular vector. 10113 07:20:20,700 --> 07:20:22,645 After out of million values, 10114 07:20:22,645 --> 07:20:25,493 I'll have only values 400 products where I 10115 07:20:25,493 --> 07:20:27,300 have provided the ratings. 10116 07:20:27,400 --> 07:20:30,947 So it may vary from number 1 to 5 for all others. 10117 07:20:30,947 --> 07:20:34,200 It will say 0 sparse pins thinly distributed. 10118 07:20:34,800 --> 07:20:38,774 So to represent the huge amount of data with the position 10119 07:20:38,774 --> 07:20:41,900 and saying this particular position is having 10120 07:20:41,900 --> 07:20:43,800 a 0 value we can mention 10121 07:20:43,800 --> 07:20:45,900 that with a key and value. 10122 07:20:45,900 --> 07:20:47,415 So what position having 10123 07:20:47,415 --> 07:20:51,500 what value rather than storing all Zero seconds told one lie 10124 07:20:51,500 --> 07:20:55,471 non-zeros the position of it and that the corresponding value. 10125 07:20:55,471 --> 07:20:58,400 That means all others going to be a zero value 10126 07:20:58,400 --> 07:21:01,400 so we can mention this particular space 10127 07:21:01,400 --> 07:21:05,400 Vector mentioning it to representa nonzero entities. 10128 07:21:05,400 --> 07:21:08,300 So to store only the nonzero entities 10129 07:21:08,300 --> 07:21:10,364 this Mass Factor will be used 10130 07:21:10,364 --> 07:21:12,500 so that we don't need to based 10131 07:21:12,500 --> 07:21:15,550 on additional space was during this past Vector. 10132 07:21:15,550 --> 07:21:18,600 Let's discuss some questions on spark streaming. 10133 07:21:18,600 --> 07:21:21,422 How is streaming Dad in sparking explained 10134 07:21:21,422 --> 07:21:23,900 with examples smart streaming is used 10135 07:21:23,900 --> 07:21:25,452 for processing real-time 10136 07:21:25,452 --> 07:21:29,500 streaming data to precisely say it's a micro batch processing. 10137 07:21:29,500 --> 07:21:32,852 So data will be collected between every small interval say 10138 07:21:32,852 --> 07:21:35,128 maybe like .5 seconds or every seconds 10139 07:21:35,128 --> 07:21:36,200 until you get processed. 10140 07:21:36,200 --> 07:21:36,900 So internally, 10141 07:21:36,900 --> 07:21:40,100 it's going to create micro patches the data created 10142 07:21:40,100 --> 07:21:43,800 out of that micro batch we call there is a d stream the stream 10143 07:21:43,800 --> 07:21:45,500 is like a and ready 10144 07:21:45,500 --> 07:21:48,200 so I can do Transformations and actions. 10145 07:21:48,200 --> 07:21:50,691 Whatever that I do with our DD I can do 10146 07:21:50,691 --> 07:21:52,200 With the stream as well 10147 07:21:52,500 --> 07:21:57,100 and Spark streaming can read data from Flume hdfs are 10148 07:21:57,100 --> 07:21:59,500 other streaming services Aspen 10149 07:21:59,800 --> 07:22:02,565 and store the data in the dashboard or in 10150 07:22:02,565 --> 07:22:06,300 any other database and it provides very high throughput 10151 07:22:06,400 --> 07:22:09,200 as it can be processed with a number of different systems 10152 07:22:09,200 --> 07:22:11,800 in a distributed fashion again streaming. 10153 07:22:11,800 --> 07:22:14,858 This stream will be partitioned internally and it has 10154 07:22:14,858 --> 07:22:17,100 the built-in feature of fault tolerance, 10155 07:22:17,100 --> 07:22:18,700 even if any data is lost 10156 07:22:18,700 --> 07:22:22,100 and it's transformed already is Lost it can regenerate 10157 07:22:22,100 --> 07:22:23,930 those rdds from the existing 10158 07:22:23,930 --> 07:22:25,500 or from the source data. 10159 07:22:25,500 --> 07:22:28,100 So these three is going to be the building block 10160 07:22:28,100 --> 07:22:32,748 of streaming and it has the fault tolerance mechanism 10161 07:22:32,748 --> 07:22:34,902 what we have within the RTD. 10162 07:22:35,000 --> 07:22:38,600 So this stream are specialized on Didi specialized form 10163 07:22:38,600 --> 07:22:42,000 of our GD specifically to use it within this box dreaming. 10164 07:22:42,000 --> 07:22:42,253 Okay. 10165 07:22:42,253 --> 07:22:42,963 Next question. 10166 07:22:42,963 --> 07:22:45,600 What is the significance of sliding window operation? 10167 07:22:45,600 --> 07:22:48,700 That's a very interesting one in the streaming data whenever 10168 07:22:48,700 --> 07:22:50,600 we do the Computing the data. 10169 07:22:50,600 --> 07:22:53,218 Density are the business implications 10170 07:22:53,218 --> 07:22:56,500 of that specific data May oscillate a lot. 10171 07:22:56,500 --> 07:22:58,400 For example within Twitter. 10172 07:22:58,400 --> 07:23:01,455 We used to say the trending tweet hashtag just 10173 07:23:01,455 --> 07:23:03,900 because that hashtag is very popular. 10174 07:23:03,900 --> 07:23:06,200 Maybe someone might have hacked into the system 10175 07:23:06,200 --> 07:23:09,500 and used a number of tweets maybe for that particular 10176 07:23:09,500 --> 07:23:12,202 our it might have appeared millions of times just 10177 07:23:12,202 --> 07:23:15,123 because it appear billions of times for that specific 10178 07:23:15,123 --> 07:23:16,107 and minute duration 10179 07:23:16,107 --> 07:23:18,800 or like say to three minute duration each not getting 10180 07:23:18,800 --> 07:23:20,200 to the trending tank. 10181 07:23:20,200 --> 07:23:22,286 Trending hashtag for that particular day 10182 07:23:22,286 --> 07:23:23,992 or for that particular month. 10183 07:23:23,992 --> 07:23:26,700 So what we will do we will try to do an average. 10184 07:23:26,700 --> 07:23:29,357 So like a window this current time frame 10185 07:23:29,357 --> 07:23:32,500 and T minus 1 T minus 2 all the data we will consider 10186 07:23:32,500 --> 07:23:34,807 and we will try to find the average or some 10187 07:23:34,807 --> 07:23:37,276 so the complete business logic will be applied 10188 07:23:37,276 --> 07:23:39,100 against that particular window. 10189 07:23:39,200 --> 07:23:43,400 So any drastic changes on to precisely say the spike 10190 07:23:43,500 --> 07:23:46,200 or deep very drastic spinal cords 10191 07:23:46,200 --> 07:23:50,300 drastic deep in the pattern of the data will be normalized. 10192 07:23:50,300 --> 07:23:51,100 So that's the 10193 07:23:51,100 --> 07:23:54,452 because significance of using the sliding window operation 10194 07:23:54,452 --> 07:23:55,800 with inspark streaming 10195 07:23:55,800 --> 07:23:59,600 and smart can handle this sliding window automatically. 10196 07:23:59,600 --> 07:24:04,000 It can store the prior data the T minus 1 T minus 2 and 10197 07:24:04,000 --> 07:24:06,300 how big the window needs to be maintained 10198 07:24:06,300 --> 07:24:09,192 or that can be handled easily within the program 10199 07:24:09,192 --> 07:24:11,100 and it's at the abstract level. 10200 07:24:11,300 --> 07:24:12,100 Next question is 10201 07:24:12,100 --> 07:24:15,600 what is destroying the expansion is discretized stream. 10202 07:24:15,600 --> 07:24:17,600 So that's the abstract form 10203 07:24:17,600 --> 07:24:20,500 or the which will form of representation of the data. 10204 07:24:20,500 --> 07:24:22,494 For the spark streaming the same way, 10205 07:24:22,494 --> 07:24:25,200 how are ready getting transformed from one form 10206 07:24:25,200 --> 07:24:26,200 to another form? 10207 07:24:26,200 --> 07:24:27,504 We will have series 10208 07:24:27,504 --> 07:24:30,800 of oddities all put together called as a d string 10209 07:24:30,800 --> 07:24:32,100 so this term is nothing 10210 07:24:32,100 --> 07:24:34,000 but it's another representation 10211 07:24:34,000 --> 07:24:36,593 of our GD are like to group of oddities 10212 07:24:36,593 --> 07:24:38,223 because there is a stream 10213 07:24:38,223 --> 07:24:41,100 and I can apply the streaming functions 10214 07:24:41,100 --> 07:24:43,921 or any of the functions Transformations are actions 10215 07:24:43,921 --> 07:24:47,200 available within the streaming against this D string 10216 07:24:47,300 --> 07:24:49,674 So within that particular micro badge, 10217 07:24:49,674 --> 07:24:51,600 so I will Define What interval 10218 07:24:51,600 --> 07:24:54,377 the data should be collected on should be processed 10219 07:24:54,377 --> 07:24:56,100 because there is a micro batch. 10220 07:24:56,100 --> 07:24:59,900 It could be every 1 second or every hundred milliseconds 10221 07:24:59,900 --> 07:25:01,000 or every five seconds. 10222 07:25:01,300 --> 07:25:02,300 I can Define that page 10223 07:25:02,300 --> 07:25:04,300 particular period so all the data is used 10224 07:25:04,300 --> 07:25:07,300 in that particular duration will be considered 10225 07:25:07,300 --> 07:25:08,400 as a piece of data 10226 07:25:08,400 --> 07:25:09,600 and that will be called 10227 07:25:09,600 --> 07:25:13,400 as ADI string s question explain casing in spark streaming. 10228 07:25:13,400 --> 07:25:14,000 Of course. 10229 07:25:14,000 --> 07:25:15,000 Yes Mark internally. 10230 07:25:15,000 --> 07:25:16,300 It uses in memory Computing. 10231 07:25:16,600 --> 07:25:18,700 So any data when it is doing the Computing 10232 07:25:18,900 --> 07:25:21,600 that's killing generated will be there in Mary but find 10233 07:25:21,600 --> 07:25:25,100 that if you do more and more processing with other jobs 10234 07:25:25,100 --> 07:25:27,190 when there is a need for more memory, 10235 07:25:27,190 --> 07:25:30,500 the least used on DDS will be clear enough from the memory 10236 07:25:30,500 --> 07:25:34,100 or the least used data available out of actions 10237 07:25:34,100 --> 07:25:36,700 from the arditi will be cleared of from the memory. 10238 07:25:36,700 --> 07:25:40,000 Sometimes I may need that data forever in memory, 10239 07:25:40,000 --> 07:25:41,800 very simple example, like dictionary. 10240 07:25:42,100 --> 07:25:43,600 I want the dictionary words 10241 07:25:43,600 --> 07:25:45,658 should be always available in memory 10242 07:25:45,658 --> 07:25:48,900 because I may do a spell check against the Tweet comments 10243 07:25:48,900 --> 07:25:51,500 or feedback comments and our of nines. 10244 07:25:51,500 --> 07:25:54,900 So what I can do I can say KH those any data 10245 07:25:54,900 --> 07:25:57,036 that comes in we can cash it. 10246 07:25:57,036 --> 07:25:59,100 What possessed it in memory. 10247 07:25:59,100 --> 07:26:02,100 So even when there is a need for memory by other applications 10248 07:26:02,100 --> 07:26:05,800 this specific data will not be remote and especially 10249 07:26:05,800 --> 07:26:08,800 that will be used to do the further processing 10250 07:26:08,800 --> 07:26:11,500 and the casing also can be defined 10251 07:26:11,500 --> 07:26:15,200 whether it should be in memory only I in memory and hard disk 10252 07:26:15,200 --> 07:26:17,000 that also we can Define it. 10253 07:26:17,000 --> 07:26:20,100 Let's discuss some questions on spark graphics. 10254 07:26:20,300 --> 07:26:24,000 The next question is is there an APA for implementing collapse 10255 07:26:24,000 --> 07:26:26,200 and Spark in graph Theory? 10256 07:26:26,600 --> 07:26:28,100 Everything will be represented 10257 07:26:28,100 --> 07:26:33,200 as a graph is a graph it will have nodes and edges. 10258 07:26:33,419 --> 07:26:36,880 So all will be represented using the arteries. 10259 07:26:37,000 --> 07:26:40,300 So it's going to extend the RTD and there is 10260 07:26:40,300 --> 07:26:42,482 a component called graphics 10261 07:26:42,500 --> 07:26:44,983 and it exposes the functionalities 10262 07:26:44,983 --> 07:26:49,800 to represent a graph we can have H RG D buttocks rdd by creating. 10263 07:26:49,800 --> 07:26:51,700 During the edges and vertex. 10264 07:26:51,700 --> 07:26:53,239 I can create a graph 10265 07:26:53,500 --> 07:26:57,400 and this graph can exist in a distributed environment. 10266 07:26:57,400 --> 07:27:00,208 So same way we will be in a position to do 10267 07:27:00,208 --> 07:27:02,400 the parallel processing as well. 10268 07:27:02,700 --> 07:27:06,300 So Graphics, it's just a form of representing 10269 07:27:06,400 --> 07:27:11,200 the data paragraphs with edges and the traces and of course, 10270 07:27:11,200 --> 07:27:14,299 yes, it provides the APA to implement out create 10271 07:27:14,299 --> 07:27:17,400 the graph do the processing on the graph the APA 10272 07:27:17,400 --> 07:27:19,900 so divided what is Page rank? 10273 07:27:20,100 --> 07:27:24,600 Graphics we didn't have sex once the graph is created. 10274 07:27:24,600 --> 07:27:28,900 We can calculate the page rank for a particular note. 10275 07:27:29,100 --> 07:27:32,000 So that's very similar to how we have the page rank 10276 07:27:32,100 --> 07:27:35,635 for the websites within Google the higher the page rank. 10277 07:27:35,635 --> 07:27:38,774 That means it's more important within that particular graph. 10278 07:27:38,774 --> 07:27:40,547 It's going to show the importance 10279 07:27:40,547 --> 07:27:41,900 of that particular node 10280 07:27:41,900 --> 07:27:45,154 or Edge within that particular graph is a graph is 10281 07:27:45,154 --> 07:27:46,700 a connected set of data. 10282 07:27:46,800 --> 07:27:49,600 All right, I will be connected using the property 10283 07:27:49,600 --> 07:27:51,100 and How much important 10284 07:27:51,100 --> 07:27:55,300 that property makes we will have a value Associated to it. 10285 07:27:55,500 --> 07:27:57,900 So within pagerank we can calculate 10286 07:27:57,900 --> 07:27:59,100 like a static page rank. 10287 07:27:59,300 --> 07:28:00,703 It will run a number 10288 07:28:00,703 --> 07:28:03,300 of iterations or there is another page 10289 07:28:03,300 --> 07:28:06,600 and code anomic page rank that will get executed 10290 07:28:06,600 --> 07:28:09,200 till we reach a particular saturation level 10291 07:28:09,300 --> 07:28:13,600 and the saturation level can be defined with multiple criterias 10292 07:28:14,100 --> 07:28:15,200 and the APA is 10293 07:28:15,200 --> 07:28:17,500 because there is a graph operations. 10294 07:28:17,700 --> 07:28:20,289 And be direct executed against those graph 10295 07:28:20,289 --> 07:28:23,700 and they all are available as a PA within the graphics. 10296 07:28:24,103 --> 07:28:25,796 What is lineage graph? 10297 07:28:26,000 --> 07:28:28,400 So the audit is very similar 10298 07:28:28,500 --> 07:28:32,800 to the graphics how the graph representation every rtt. 10299 07:28:32,800 --> 07:28:33,800 Internally. 10300 07:28:33,800 --> 07:28:36,400 It will have the relation saying 10301 07:28:36,500 --> 07:28:39,157 how that particular rdd got created. 10302 07:28:39,157 --> 07:28:42,725 And from where how that got transformed argit is 10303 07:28:42,725 --> 07:28:44,700 how their got transformed. 10304 07:28:44,700 --> 07:28:47,600 So the complete lineage or the complete history 10305 07:28:47,600 --> 07:28:50,587 or the complete path will be recorded 10306 07:28:50,587 --> 07:28:51,900 within the lineage. 10307 07:28:52,100 --> 07:28:53,517 That will be used in case 10308 07:28:53,517 --> 07:28:56,400 if any particular partition of the target is lost. 10309 07:28:56,400 --> 07:28:57,900 It can be regenerated. 10310 07:28:58,000 --> 07:28:59,899 Even if the complete artery is lost. 10311 07:28:59,899 --> 07:29:00,900 We can regenerate 10312 07:29:00,900 --> 07:29:03,149 so it will have the complete information on what are 10313 07:29:03,149 --> 07:29:06,193 the partitions where it is existing water Transformations. 10314 07:29:06,193 --> 07:29:07,119 It had undergone. 10315 07:29:07,119 --> 07:29:08,747 What is the resultant and you 10316 07:29:08,747 --> 07:29:10,600 if anything is lost in the middle, 10317 07:29:10,600 --> 07:29:12,511 it knows where to recalculate 10318 07:29:12,511 --> 07:29:16,400 from and what are essential things needs to be recalculated. 10319 07:29:16,400 --> 07:29:19,817 It's going to save us a lot of time and if that Audrey 10320 07:29:19,817 --> 07:29:21,762 is never being used it will now. 10321 07:29:21,762 --> 07:29:23,100 Ever get recalculated. 10322 07:29:23,100 --> 07:29:26,500 So they recalculation also triggers based on the action 10323 07:29:26,500 --> 07:29:27,799 only on need basis. 10324 07:29:27,799 --> 07:29:29,100 It will recalculate 10325 07:29:29,200 --> 07:29:32,500 that's why it's going to use the memory optimally 10326 07:29:32,700 --> 07:29:36,087 does Apache spark provide checkpointing officially 10327 07:29:36,087 --> 07:29:38,300 like the example like a streaming 10328 07:29:38,600 --> 07:29:43,600 and if any data is lost within that particular sliding window, 10329 07:29:43,600 --> 07:29:47,492 we cannot get back the data are like the data will be lost 10330 07:29:47,492 --> 07:29:50,103 because Jim I'm making a window of say 24 10331 07:29:50,103 --> 07:29:51,800 asks to do some averaging. 10332 07:29:51,800 --> 07:29:55,270 Each I'm making a sliding window of 24 hours every 24 hours. 10333 07:29:55,270 --> 07:29:59,100 It will keep on getting slider and if you lose any system 10334 07:29:59,100 --> 07:30:01,500 as in there is a complete failure of the cluster. 10335 07:30:01,500 --> 07:30:02,562 I may lose the data 10336 07:30:02,562 --> 07:30:04,800 because it's all available in the memory. 10337 07:30:04,900 --> 07:30:06,400 So how to recalculate 10338 07:30:06,400 --> 07:30:08,902 if the data system is lost it follows something 10339 07:30:08,902 --> 07:30:10,100 called a checkpointing 10340 07:30:10,100 --> 07:30:12,831 so we can check point the data and directly. 10341 07:30:12,831 --> 07:30:14,800 It's provided by the spark APA. 10342 07:30:14,800 --> 07:30:16,600 We have to just provide the location 10343 07:30:16,600 --> 07:30:19,700 where it should get checked pointed and you can read 10344 07:30:19,700 --> 07:30:23,200 that particular data back when you Not the system again, 10345 07:30:23,200 --> 07:30:24,866 whatever the state it was 10346 07:30:24,866 --> 07:30:27,600 in be can regenerate that particular data. 10347 07:30:27,700 --> 07:30:29,454 So yes to answer the question 10348 07:30:29,454 --> 07:30:32,300 straight about this path points check monitoring 10349 07:30:32,300 --> 07:30:35,300 and it will help us to regenerate the state 10350 07:30:35,300 --> 07:30:37,010 what it was earlier. 10351 07:30:37,200 --> 07:30:40,000 Let's move on to the next component spark ml it. 10352 07:30:40,300 --> 07:30:41,515 How is machine learning 10353 07:30:41,515 --> 07:30:44,600 implemented in spark the machine learning again? 10354 07:30:44,600 --> 07:30:46,800 It's a very huge ocean by itself 10355 07:30:46,900 --> 07:30:49,800 and it's not a technology specific to spark 10356 07:30:49,800 --> 07:30:51,800 which learning is a common data science. 10357 07:30:51,800 --> 07:30:55,235 It's a Set of data science work where we have different type 10358 07:30:55,235 --> 07:30:57,983 of algorithms different categories of algorithm, 10359 07:30:57,983 --> 07:31:01,100 like clustering regression dimensionality reduction 10360 07:31:01,100 --> 07:31:02,100 or that we have 10361 07:31:02,300 --> 07:31:05,600 and all these algorithms are most of the algorithms 10362 07:31:05,600 --> 07:31:08,070 have been implemented in spark and smart is 10363 07:31:08,070 --> 07:31:09,481 the preferred framework 10364 07:31:09,481 --> 07:31:12,910 or before preferred application component to do the machine 10365 07:31:12,910 --> 07:31:14,500 learning algorithm nowadays 10366 07:31:14,500 --> 07:31:16,500 or machine learning processing the reason 10367 07:31:16,500 --> 07:31:19,700 because most of the machine learning algorithms needs 10368 07:31:19,700 --> 07:31:21,890 to be executed i3t real number. 10369 07:31:21,890 --> 07:31:25,000 Of times till we get the optimal result maybe 10370 07:31:25,000 --> 07:31:27,700 like say twenty five iterations are 58 iterations 10371 07:31:27,700 --> 07:31:29,900 or till we get that specific accuracy. 10372 07:31:29,900 --> 07:31:33,100 You will keep on running the processing again and again 10373 07:31:33,100 --> 07:31:36,092 and smog is very good fit whenever you want to do 10374 07:31:36,092 --> 07:31:37,900 the processing again and again 10375 07:31:37,900 --> 07:31:40,400 because the data will be available in memory. 10376 07:31:40,400 --> 07:31:43,600 I can read it faster store the data back into the memory 10377 07:31:43,600 --> 07:31:44,700 again reach faster 10378 07:31:44,700 --> 07:31:47,500 and all this machine learning algorithms have been provided 10379 07:31:47,500 --> 07:31:50,800 within the spark a separate component called ml lip 10380 07:31:50,900 --> 07:31:53,096 and within mlsp We have other components 10381 07:31:53,096 --> 07:31:55,800 like feature Association to extract the features. 10382 07:31:55,800 --> 07:31:58,575 You may be wondering how they can process 10383 07:31:58,575 --> 07:32:02,600 the images the core thing about processing a image or audio 10384 07:32:02,600 --> 07:32:04,922 or video is about extracting the feature 10385 07:32:04,922 --> 07:32:08,363 and comparing the future how much they are related. 10386 07:32:08,363 --> 07:32:10,300 So that's where vectors matrices all 10387 07:32:10,300 --> 07:32:13,500 that will come into picture and we can have pipeline 10388 07:32:13,500 --> 07:32:16,144 of processing as well to the processing 10389 07:32:16,144 --> 07:32:18,800 one then take the result and do the processing 10390 07:32:18,800 --> 07:32:21,700 to and it has persistence algorithm as well. 10391 07:32:21,700 --> 07:32:24,234 The result of it the generator process 10392 07:32:24,234 --> 07:32:25,999 the result it can be persisted 10393 07:32:25,999 --> 07:32:27,010 and reloaded back 10394 07:32:27,010 --> 07:32:29,421 into the system to continue the processing 10395 07:32:29,421 --> 07:32:32,245 from that particular Point onwards next question. 10396 07:32:32,245 --> 07:32:34,605 What are categories of machine learning machine 10397 07:32:34,605 --> 07:32:38,000 learning assets different categories available supervised 10398 07:32:38,000 --> 07:32:41,001 or unsupervised and reinforced learning supervised 10399 07:32:41,001 --> 07:32:42,900 and surprised it's very popular 10400 07:32:43,200 --> 07:32:46,700 where we will know some I'll give an example. 10401 07:32:47,200 --> 07:32:50,123 I'll know well in advance what category 10402 07:32:50,123 --> 07:32:54,800 that belongs to Z. Want to do a character recognition 10403 07:32:55,400 --> 07:32:57,185 while training the data, 10404 07:32:57,185 --> 07:33:01,800 I can give information saying this particular image belongs 10405 07:33:01,800 --> 07:33:04,160 to this particular category character 10406 07:33:04,160 --> 07:33:05,800 or this particular number 10407 07:33:05,800 --> 07:33:10,100 and I can train sometimes I will not know well in advance 10408 07:33:10,100 --> 07:33:14,478 assume like I may have different type of images 10409 07:33:14,700 --> 07:33:19,200 like it may have cars bikes cat dog all that. 10410 07:33:19,400 --> 07:33:21,920 I want to know how many category available. 10411 07:33:21,920 --> 07:33:25,279 No, I will not know well in advance so I want to group it 10412 07:33:25,279 --> 07:33:26,900 how many category available 10413 07:33:26,900 --> 07:33:29,100 and then I'll realize saying okay, 10414 07:33:29,100 --> 07:33:31,600 they're all this belongs to a particular category. 10415 07:33:31,600 --> 07:33:33,800 I'll identify the pattern within the category 10416 07:33:33,800 --> 07:33:36,333 and I'll give a category named say 10417 07:33:36,333 --> 07:33:39,751 like all these images belongs to boot category 10418 07:33:39,751 --> 07:33:41,300 on looks like a boat. 10419 07:33:41,500 --> 07:33:45,400 So leaving it to the system by providing this value or not. 10420 07:33:45,400 --> 07:33:48,400 Let's say the cat is different type of machine learning comes 10421 07:33:48,400 --> 07:33:49,503 into picture and 10422 07:33:49,503 --> 07:33:53,160 as such machine learning is not specific to It's going 10423 07:33:53,160 --> 07:33:57,300 to help us to achieve to run this machine learning algorithms 10424 07:33:57,400 --> 07:34:00,700 what our spark ml lead tools MLA business thing 10425 07:34:00,700 --> 07:34:02,300 but machine learning library 10426 07:34:02,300 --> 07:34:03,700 or machine learning offering 10427 07:34:03,700 --> 07:34:07,200 within this Mark and has a number of algorithms implemented 10428 07:34:07,200 --> 07:34:09,800 and it provides very good feature to persist 10429 07:34:09,800 --> 07:34:12,306 the result generally in machine learning. 10430 07:34:12,306 --> 07:34:14,509 We will generate a model the pattern 10431 07:34:14,509 --> 07:34:17,089 of the data recorder is a model the model 10432 07:34:17,089 --> 07:34:20,688 will be persisted either in different forms Like Pat. 10433 07:34:20,688 --> 07:34:23,087 Quit I have Through different forms, 10434 07:34:23,087 --> 07:34:26,700 it can be stored opposite district and has methodologies 10435 07:34:26,700 --> 07:34:29,600 to extract the features from a set of data. 10436 07:34:29,600 --> 07:34:31,353 I may have million images. 10437 07:34:31,353 --> 07:34:32,500 I want to extract 10438 07:34:32,500 --> 07:34:36,300 the common features available within those millions of images 10439 07:34:36,300 --> 07:34:40,170 and other utilities available to process to define 10440 07:34:40,170 --> 07:34:43,607 or like to define the seed the randomizing it so 10441 07:34:43,607 --> 07:34:47,441 different utilities are available as well as pipelines. 10442 07:34:47,441 --> 07:34:49,500 That's very specific to spark 10443 07:34:49,800 --> 07:34:53,300 where I can Channel Arrange the sequence 10444 07:34:53,300 --> 07:34:56,700 of steps to be undergone by the machine learning submission 10445 07:34:56,700 --> 07:34:58,100 learning one algorithm first 10446 07:34:58,100 --> 07:34:59,863 and then the result of it will be fed 10447 07:34:59,863 --> 07:35:02,163 into a machine learning algorithm to like that. 10448 07:35:02,163 --> 07:35:03,400 We can have a sequence 10449 07:35:03,400 --> 07:35:06,500 of execution and that will be defined using 10450 07:35:06,500 --> 07:35:10,562 the pipeline's is Honorable features of spark Emily. 10451 07:35:11,000 --> 07:35:15,100 What are some popular algorithms and Utilities in spark Emily. 10452 07:35:15,500 --> 07:35:18,382 So these are some popular algorithms like regression 10453 07:35:18,382 --> 07:35:22,000 classification basic statistics recommendation system. 10454 07:35:22,000 --> 07:35:24,678 It's a comedy system is like well implemented. 10455 07:35:24,678 --> 07:35:27,000 All we have to provide is give the data. 10456 07:35:27,000 --> 07:35:30,579 If you give the ratings and products within an organization, 10457 07:35:30,579 --> 07:35:32,400 if you have the complete damp, 10458 07:35:32,400 --> 07:35:35,800 we can build the recommendation system in no time. 10459 07:35:35,800 --> 07:35:39,283 And if you give any user you can give a recommendation. 10460 07:35:39,283 --> 07:35:41,600 These are the products the user may like 10461 07:35:41,600 --> 07:35:42,500 and those products 10462 07:35:42,500 --> 07:35:45,900 can be displayed in the search result recommendation system 10463 07:35:45,900 --> 07:35:48,017 really works on the basis of the feedback 10464 07:35:48,017 --> 07:35:50,400 that we are providing for the earlier products 10465 07:35:50,400 --> 07:35:51,500 that we had bought. 10466 07:35:51,600 --> 07:35:54,225 Bustling dimensionality reduction whenever 10467 07:35:54,225 --> 07:35:57,300 we do transitioning with the huge amount of data, 10468 07:35:57,600 --> 07:35:59,511 it's very very compute-intensive 10469 07:35:59,511 --> 07:36:01,900 and we may have to reduce the dimensions, 10470 07:36:01,900 --> 07:36:03,752 especially the matrix dimensions 10471 07:36:03,752 --> 07:36:07,000 within them early without losing the features. 10472 07:36:07,000 --> 07:36:09,538 What are the features available without losing it? 10473 07:36:09,538 --> 07:36:11,308 We should reduce the dimensionality 10474 07:36:11,308 --> 07:36:13,580 and there are some algorithms available to do 10475 07:36:13,580 --> 07:36:16,660 that dimensionality reduction and feature extraction. 10476 07:36:16,660 --> 07:36:19,486 So what are the common features are features available 10477 07:36:19,486 --> 07:36:22,227 within that particular image and I can Compare 10478 07:36:22,227 --> 07:36:23,300 what are the common 10479 07:36:23,300 --> 07:36:26,600 across common features available within those images? 10480 07:36:26,600 --> 07:36:29,106 That's how we will group those images. 10481 07:36:29,106 --> 07:36:29,716 So get me 10482 07:36:29,716 --> 07:36:32,900 whether this particular image the person looking 10483 07:36:32,900 --> 07:36:35,300 like this image available in the database or not. 10484 07:36:35,700 --> 07:36:37,524 For example, assume the organization 10485 07:36:37,524 --> 07:36:40,600 or the police department crime Department maintaining a list 10486 07:36:40,600 --> 07:36:44,400 of persons committed crime and if we get a new photo 10487 07:36:44,400 --> 07:36:48,161 when they do a search they may not have the exact photo bit 10488 07:36:48,161 --> 07:36:49,200 by bit the photo 10489 07:36:49,200 --> 07:36:51,600 might have been taken with a different background. 10490 07:36:51,600 --> 07:36:55,000 Front lighting's different locations different time. 10491 07:36:55,000 --> 07:36:57,754 So a hundred percent the data will be different on bits 10492 07:36:57,754 --> 07:37:00,520 and bytes will be different but look nice. 10493 07:37:00,520 --> 07:37:03,767 Yes, they are going to be seeing so I'm going to search 10494 07:37:03,767 --> 07:37:05,100 the photo looking similar 10495 07:37:05,100 --> 07:37:07,500 to this particular photograph as the input. 10496 07:37:07,500 --> 07:37:09,033 I'll provide to achieve 10497 07:37:09,033 --> 07:37:11,976 that we will be extracting the features in each 10498 07:37:11,976 --> 07:37:13,000 of those photos. 10499 07:37:13,000 --> 07:37:15,717 We will extract the features and we will try to match 10500 07:37:15,717 --> 07:37:17,697 the feature rather than the bits 10501 07:37:17,697 --> 07:37:21,015 and bytes and optimization as well in terms of processing 10502 07:37:21,015 --> 07:37:22,200 or doing the piping. 10503 07:37:22,200 --> 07:37:25,100 There are a number of algorithms to do the optimization. 10504 07:37:25,400 --> 07:37:27,000 Let's move on to spark SQL. 10505 07:37:27,100 --> 07:37:29,811 Is there a module to implement sequence Park? 10506 07:37:29,811 --> 07:37:32,475 How does it work so directly not the sequel 10507 07:37:32,475 --> 07:37:36,300 may be very similar to high whatever the structure data 10508 07:37:36,300 --> 07:37:37,300 that we have. 10509 07:37:37,400 --> 07:37:38,800 We can read the data 10510 07:37:38,800 --> 07:37:42,000 or extract the meaning out of the data using SQL 10511 07:37:42,400 --> 07:37:44,600 and it exposes the APA 10512 07:37:44,700 --> 07:37:48,700 and we can use those API to read the data or create data frames 10513 07:37:48,834 --> 07:37:51,065 and spunk SQL has four major. 10514 07:37:51,500 --> 07:37:55,800 Degrees data source data Frame data frame is 10515 07:37:55,800 --> 07:37:58,900 like the representation of X and Y data 10516 07:37:59,300 --> 07:38:02,800 or like Excel data multi-dimensional structure data 10517 07:38:03,000 --> 07:38:06,000 and abstract form on top of dataframe. 10518 07:38:06,000 --> 07:38:08,541 I can do the query and internally, 10519 07:38:08,541 --> 07:38:11,700 it has interpreter and Optimizer any query 10520 07:38:11,700 --> 07:38:15,100 I fire that will get interpreted or optimized 10521 07:38:15,100 --> 07:38:18,500 and get executed using the SQL services and get 10522 07:38:18,500 --> 07:38:20,300 the data from the data frame 10523 07:38:20,300 --> 07:38:22,900 or it An read the data from the data source 10524 07:38:22,900 --> 07:38:24,000 and do the processing. 10525 07:38:24,265 --> 07:38:26,034 What is a package file? 10526 07:38:26,100 --> 07:38:27,800 It's a format of the file 10527 07:38:27,800 --> 07:38:30,361 where the data in some structured form, 10528 07:38:30,361 --> 07:38:33,800 especially the result of the Spock SQL can be stored 10529 07:38:33,800 --> 07:38:37,350 or returned in some persistence and the packet again. 10530 07:38:37,350 --> 07:38:41,317 It is a open source from Apache its data serialization technique 10531 07:38:41,317 --> 07:38:44,833 where we can serialize the data using the pad could form 10532 07:38:44,833 --> 07:38:46,078 and to precisely say, 10533 07:38:46,078 --> 07:38:47,500 it's a columnar storage. 10534 07:38:47,500 --> 07:38:49,900 It's going to consume less space it will use 10535 07:38:49,900 --> 07:38:51,200 the keys and values. 10536 07:38:51,300 --> 07:38:55,500 Store the data and also it helps you to access a specific data 10537 07:38:55,500 --> 07:38:59,100 from that packaged form using the query so backward. 10538 07:38:59,100 --> 07:39:02,200 It's another open source format data serialization format 10539 07:39:02,200 --> 07:39:03,267 to store the data 10540 07:39:03,267 --> 07:39:04,900 on purses the data as well 10541 07:39:04,900 --> 07:39:08,700 as to retrieve the data list the functions of Sparks equal. 10542 07:39:08,700 --> 07:39:10,800 You can be used to load the varieties 10543 07:39:10,800 --> 07:39:12,300 of structured data, of course, 10544 07:39:12,300 --> 07:39:15,600 yes monks equal can work only with the structure data. 10545 07:39:15,600 --> 07:39:17,900 It can be used to load varieties 10546 07:39:17,900 --> 07:39:20,900 of structured data and you can use SQL 10547 07:39:20,900 --> 07:39:23,600 like it's to query against the program 10548 07:39:23,600 --> 07:39:25,000 and it can be used 10549 07:39:25,000 --> 07:39:27,839 with external tools to connect to this park as well. 10550 07:39:27,839 --> 07:39:30,400 It gives very good the integration with the SQL 10551 07:39:30,400 --> 07:39:32,900 and using python Java Scala code. 10552 07:39:33,000 --> 07:39:35,831 We can create an rdd from the structure data 10553 07:39:35,831 --> 07:39:38,400 available directly using this box equal. 10554 07:39:38,400 --> 07:39:40,300 I can generate the TD. 10555 07:39:40,500 --> 07:39:42,600 So it's going to facilitate the people 10556 07:39:42,600 --> 07:39:46,400 from database background to make the program faster and quicker. 10557 07:39:47,100 --> 07:39:48,100 Next question is 10558 07:39:48,100 --> 07:39:50,700 what do you understand by lazy evaluation? 10559 07:39:50,900 --> 07:39:54,400 So whenever you do any operation within the spark word, 10560 07:39:54,400 --> 07:39:57,281 it will not do the processing immediately it look 10561 07:39:57,281 --> 07:40:00,100 for the final results that we are asking for it. 10562 07:40:00,100 --> 07:40:02,000 If it doesn't ask for the final result. 10563 07:40:02,000 --> 07:40:04,660 It doesn't need to do the processing So based 10564 07:40:04,660 --> 07:40:07,200 on the final action until we do the action. 10565 07:40:07,200 --> 07:40:08,990 There will not be any Transformations. 10566 07:40:08,990 --> 07:40:11,700 I will there will not be any actual processing happening. 10567 07:40:11,700 --> 07:40:13,141 It will just understand 10568 07:40:13,141 --> 07:40:15,900 what our Transformations it has to do finally 10569 07:40:15,900 --> 07:40:18,900 if you ask The action then in optimized way, 10570 07:40:18,900 --> 07:40:22,200 it's going to complete the data processing and get 10571 07:40:22,200 --> 07:40:23,553 us the final result. 10572 07:40:23,553 --> 07:40:26,600 So to answer straight lazy evaluation is doing 10573 07:40:26,600 --> 07:40:30,300 the processing one Leon need of the resultant data. 10574 07:40:30,300 --> 07:40:32,100 The data is not required. 10575 07:40:32,100 --> 07:40:34,757 It's not going to do the processing. 10576 07:40:34,757 --> 07:40:36,726 Can you use Funk to access 10577 07:40:36,726 --> 07:40:40,200 and analyze data stored in Cassandra data piece? 10578 07:40:40,200 --> 07:40:41,600 Yes, it is possible. 10579 07:40:41,600 --> 07:40:44,400 Okay, not only Cassandra any of the nosql database it 10580 07:40:44,400 --> 07:40:46,100 can very well do the processing 10581 07:40:46,100 --> 07:40:49,700 and Sandra also works in a distributed architecture. 10582 07:40:49,700 --> 07:40:51,200 It's a nosql database 10583 07:40:51,200 --> 07:40:53,800 so it can leverage the data locality. 10584 07:40:53,800 --> 07:40:56,000 The query can be executed locally 10585 07:40:56,000 --> 07:40:58,200 where the Cassandra notes are available. 10586 07:40:58,200 --> 07:41:01,100 It's going to make the query execution faster 10587 07:41:01,100 --> 07:41:04,326 and reduce the network load and Spark executors. 10588 07:41:04,326 --> 07:41:06,009 It will try to get started 10589 07:41:06,009 --> 07:41:08,242 or the spark executors in the mission 10590 07:41:08,242 --> 07:41:10,600 where the Cassandra notes are available 10591 07:41:10,600 --> 07:41:13,900 or data is available going to do the processing locally. 10592 07:41:13,900 --> 07:41:16,450 So it's going to leverage the data locality. 10593 07:41:16,450 --> 07:41:17,426 T next question, 10594 07:41:17,426 --> 07:41:19,500 how can you minimize data transfers 10595 07:41:19,500 --> 07:41:21,200 when working with spark 10596 07:41:21,200 --> 07:41:23,636 if you ask the core design the success 10597 07:41:23,636 --> 07:41:25,514 of the spark program depends on 10598 07:41:25,514 --> 07:41:28,300 how much you are reducing the network transfer. 10599 07:41:28,300 --> 07:41:30,900 This network transfer is very costly operation 10600 07:41:30,900 --> 07:41:32,300 and you cannot paralyzed 10601 07:41:32,400 --> 07:41:35,600 in case multiple ways are especially two ways to avoid. 10602 07:41:35,600 --> 07:41:37,664 This one is called broadcast variable 10603 07:41:37,664 --> 07:41:40,300 and at Co-operators broadcast variable. 10604 07:41:40,300 --> 07:41:43,536 It will help us to transfer any static data 10605 07:41:43,536 --> 07:41:46,428 or any informations keep on publish. 10606 07:41:46,500 --> 07:41:48,300 To multiple systems. 10607 07:41:48,300 --> 07:41:49,300 So I'll see 10608 07:41:49,300 --> 07:41:52,257 if any data to be transferred to multiple executors 10609 07:41:52,257 --> 07:41:53,500 to be used in common. 10610 07:41:53,500 --> 07:41:55,016 I can broadcast it 10611 07:41:55,200 --> 07:41:58,800 and I might want to consolidate the values happening 10612 07:41:58,800 --> 07:42:02,172 in multiple workers in a single centralized location. 10613 07:42:02,172 --> 07:42:03,600 I can use accumulator. 10614 07:42:03,600 --> 07:42:06,412 So this will help us to achieve the data consolidation 10615 07:42:06,412 --> 07:42:08,800 of data distribution in the distributed world. 10616 07:42:08,800 --> 07:42:11,800 The ap11 are not abstract level 10617 07:42:11,800 --> 07:42:14,351 where we don't need to do the heavy lifting 10618 07:42:14,351 --> 07:42:16,600 that's taken care by the spark for us. 10619 07:42:16,800 --> 07:42:19,275 What our broadcast variables just now 10620 07:42:19,275 --> 07:42:22,300 as we discussed the value of the common value 10621 07:42:22,300 --> 07:42:23,200 that we need. 10622 07:42:23,200 --> 07:42:27,300 I am a want that to be available in multiple executors 10623 07:42:27,300 --> 07:42:31,000 multiple workers simple example you want to do a spell check 10624 07:42:31,000 --> 07:42:33,500 on the Tweet Commons the dictionary 10625 07:42:33,500 --> 07:42:36,100 which has the right list of words. 10626 07:42:36,200 --> 07:42:37,800 I'll have the complete list. 10627 07:42:37,800 --> 07:42:40,300 I want that particular dictionary to be available 10628 07:42:40,300 --> 07:42:41,400 in each executor 10629 07:42:41,400 --> 07:42:43,944 so that with a task with that's running locally 10630 07:42:43,944 --> 07:42:46,600 in those Executives can refer to that particular. 10631 07:42:46,600 --> 07:42:49,900 Task and get the processing done by avoiding 10632 07:42:49,900 --> 07:42:51,616 the network data transfer. 10633 07:42:51,616 --> 07:42:55,485 So the process of Distributing the data from the spark context 10634 07:42:55,485 --> 07:42:56,500 to the executors 10635 07:42:56,500 --> 07:42:58,700 where the task going to run is achieved 10636 07:42:58,700 --> 07:43:00,400 using broadcast variables 10637 07:43:00,400 --> 07:43:03,952 and the built-in within the spark APA using this parquet p-- 10638 07:43:03,952 --> 07:43:06,000 we can create the bronchus variable 10639 07:43:06,200 --> 07:43:09,500 and the process of Distributing this data available 10640 07:43:09,500 --> 07:43:13,524 in all executors is taken care by the spark framework explain 10641 07:43:13,524 --> 07:43:15,000 accumulators in spark. 10642 07:43:15,100 --> 07:43:18,500 The similar way how we have broadcast variables. 10643 07:43:18,500 --> 07:43:21,290 We have accumulators as well simple example, 10644 07:43:21,290 --> 07:43:25,100 you want to count how many error codes are available 10645 07:43:25,100 --> 07:43:26,600 in the distributed environment 10646 07:43:26,800 --> 07:43:28,400 as your data is distributed 10647 07:43:28,400 --> 07:43:31,300 across multiple systems multiple Executives. 10648 07:43:31,400 --> 07:43:34,784 Each executor will do the process thing count 10649 07:43:34,784 --> 07:43:37,200 the records anatomically. 10650 07:43:37,200 --> 07:43:38,978 I may want the total count. 10651 07:43:38,978 --> 07:43:42,600 So what I will do I will ask to maintain an accumulator, 10652 07:43:42,600 --> 07:43:45,250 of course, it will be maintained in this more context. 10653 07:43:45,250 --> 07:43:48,500 In the driver program the driver program going 10654 07:43:48,500 --> 07:43:50,100 to be one per application. 10655 07:43:50,100 --> 07:43:52,200 It will keep on getting accumulated 10656 07:43:52,200 --> 07:43:54,900 and whenever I want I can read those values 10657 07:43:54,900 --> 07:43:57,100 and take any appropriate action. 10658 07:43:57,200 --> 07:44:00,300 So it's like more or less the accumulators and practice videos 10659 07:44:00,300 --> 07:44:01,600 looks opposite each other, 10660 07:44:02,000 --> 07:44:03,800 but the purpose is totally different. 10661 07:44:04,200 --> 07:44:06,531 Why is there a need for workers variable 10662 07:44:06,531 --> 07:44:10,400 when working with Apache Spark It's read only variable 10663 07:44:10,400 --> 07:44:13,800 and it will be cached in memory in a distributed fashion 10664 07:44:13,800 --> 07:44:15,789 and it eliminates the The work 10665 07:44:15,789 --> 07:44:19,012 of moving the data from a centralized location 10666 07:44:19,012 --> 07:44:20,400 that is Spong driver 10667 07:44:20,400 --> 07:44:24,200 or from a particular program to all the executors 10668 07:44:24,200 --> 07:44:26,830 within the cluster where the transfer into get executed. 10669 07:44:26,830 --> 07:44:29,700 We don't need to worry about where the task will get executed 10670 07:44:29,700 --> 07:44:31,100 within the cluster. 10671 07:44:31,100 --> 07:44:32,138 So when compared 10672 07:44:32,138 --> 07:44:34,900 with the accumulators broadcast variables, 10673 07:44:34,900 --> 07:44:37,256 it's going to have a read-only operation. 10674 07:44:37,256 --> 07:44:38,903 The executors cannot change 10675 07:44:38,903 --> 07:44:41,100 the value can only read those values. 10676 07:44:41,100 --> 07:44:44,900 It cannot update so mostly will be used like a quiche. 10677 07:44:44,900 --> 07:44:47,400 Have for the identity next question, 10678 07:44:47,400 --> 07:44:50,327 how can you trigger automatically naps in spark 10679 07:44:50,327 --> 07:44:52,300 to handle accumulated metadata. 10680 07:44:52,700 --> 07:44:54,500 So there is a parameter 10681 07:44:54,500 --> 07:44:57,900 that we can set TTL the will get triggered along 10682 07:44:57,900 --> 07:45:00,900 with the running jobs and intermediately. 10683 07:45:00,900 --> 07:45:04,000 It's going to write the data result into the disc 10684 07:45:04,000 --> 07:45:07,155 or cleaned unnecessary data or clean the rdds. 10685 07:45:07,155 --> 07:45:08,600 That's not being used. 10686 07:45:08,600 --> 07:45:09,800 The least used RTD. 10687 07:45:09,800 --> 07:45:10,987 It will get cleaned 10688 07:45:10,987 --> 07:45:14,800 and click keep the metadata as well as the memory clean water. 10689 07:45:14,800 --> 07:45:17,800 The various levels of persistence in Apache spark 10690 07:45:17,800 --> 07:45:20,200 when you say data should be stored in memory. 10691 07:45:20,200 --> 07:45:23,000 It can be indifferent now you can be possessed it 10692 07:45:23,000 --> 07:45:27,100 so it can be in memory of only or memory and disk or disk only 10693 07:45:27,200 --> 07:45:30,500 and when it is getting stored we can ask it to store it 10694 07:45:30,500 --> 07:45:31,800 in a civilized form. 10695 07:45:31,900 --> 07:45:35,300 So the reason why we may store or possess dress, 10696 07:45:35,303 --> 07:45:36,996 I want this particular 10697 07:45:37,100 --> 07:45:40,200 on very this form of body little back 10698 07:45:40,200 --> 07:45:42,038 for using so I can really 10699 07:45:42,038 --> 07:45:45,200 back maybe I may not need it very immediate. 10700 07:45:45,400 --> 07:45:48,477 So I don't want that to keep occupying my memory. 10701 07:45:48,477 --> 07:45:50,400 I'll write it to the hard disk 10702 07:45:50,400 --> 07:45:52,700 and I'll read it back whenever there is a need. 10703 07:45:52,700 --> 07:45:55,300 I'll read it back the next question. 10704 07:45:55,300 --> 07:45:58,069 What do you understand by schema rdd, 10705 07:45:58,200 --> 07:46:01,900 so schema rdd will be used as slave Within These Punk's equal. 10706 07:46:01,900 --> 07:46:05,300 So the RTD will have the meta information built into it. 10707 07:46:05,300 --> 07:46:07,919 It will have the schema also very similar to 10708 07:46:07,919 --> 07:46:10,642 what we have the database schema the structure 10709 07:46:10,642 --> 07:46:11,976 of the particular data 10710 07:46:11,976 --> 07:46:14,994 and when I have a structure it will be easy for me. 10711 07:46:14,994 --> 07:46:16,081 To handle the data 10712 07:46:16,081 --> 07:46:19,100 so data and the structure will be existing together 10713 07:46:19,100 --> 07:46:20,360 and the schema are ready. 10714 07:46:20,360 --> 07:46:20,550 Now. 10715 07:46:20,550 --> 07:46:22,100 It's called as a data frame 10716 07:46:22,100 --> 07:46:25,009 but it's Mark and dataframe term is very popular 10717 07:46:25,009 --> 07:46:27,616 in languages like our as other languages. 10718 07:46:27,616 --> 07:46:28,700 It's very popular. 10719 07:46:28,700 --> 07:46:31,700 So it's going to have the data and The Meta information 10720 07:46:31,700 --> 07:46:34,700 about that data saying what column was structure it. 10721 07:46:34,700 --> 07:46:36,300 Is it explain the scenario 10722 07:46:36,300 --> 07:46:38,656 where you will be using spark streaming 10723 07:46:38,656 --> 07:46:41,200 as you may want to do a sentiment analysis 10724 07:46:41,200 --> 07:46:44,200 of Twitter's so there I will be streamed 10725 07:46:44,400 --> 07:46:49,200 so we will Flume sort of a tool to harvest the information 10726 07:46:49,300 --> 07:46:52,700 from Peter and fit it into spark streaming. 10727 07:46:52,700 --> 07:46:56,300 It will extract or identify the sentiment of each 10728 07:46:56,300 --> 07:46:58,300 and every tweet and Market 10729 07:46:58,300 --> 07:47:00,899 whether it is positive or negative and accordingly 10730 07:47:00,899 --> 07:47:02,900 the data will be the structure data 10731 07:47:02,900 --> 07:47:03,700 that we tidy 10732 07:47:03,700 --> 07:47:05,742 whether it is positive or negative maybe 10733 07:47:05,742 --> 07:47:06,856 percentage of positive 10734 07:47:06,856 --> 07:47:09,088 and percentage of negative sentiment store it 10735 07:47:09,088 --> 07:47:10,500 in some structured form. 10736 07:47:10,500 --> 07:47:14,111 Then you can leverage this park Sequel and do grouping 10737 07:47:14,111 --> 07:47:16,403 or filtering Based on the sentiment 10738 07:47:16,403 --> 07:47:19,587 and maybe I can use a machine learning algorithm. 10739 07:47:19,587 --> 07:47:22,107 What drives that particular tweet to be 10740 07:47:22,107 --> 07:47:23,500 in the negative side. 10741 07:47:23,500 --> 07:47:26,700 Is there any similarity between all this negative sentiment 10742 07:47:26,700 --> 07:47:28,812 negative tweets may be specific 10743 07:47:28,812 --> 07:47:32,700 to a product a specific time by when the Tweet was sweeter 10744 07:47:32,700 --> 07:47:34,421 or from a specific region 10745 07:47:34,421 --> 07:47:36,900 that we it was Twitter those analysis 10746 07:47:36,900 --> 07:47:40,194 could be done by leveraging the MLA above spark. 10747 07:47:40,194 --> 07:47:43,700 So Emily streaming core all going to work together. 10748 07:47:43,700 --> 07:47:45,200 All these are like different. 10749 07:47:45,200 --> 07:47:48,500 Offerings available to solve different problems. 10750 07:47:48,600 --> 07:47:51,100 So with this we are coming to end of this interview 10751 07:47:51,100 --> 07:47:53,100 questions discussion of spark. 10752 07:47:53,100 --> 07:47:54,465 I hope you all enjoyed. 10753 07:47:54,465 --> 07:47:56,913 I hope it was constructive and useful one. 10754 07:47:56,913 --> 07:47:59,600 The more information about editor is available 10755 07:47:59,600 --> 07:48:02,183 in this website to record at cou only best 10756 07:48:02,183 --> 07:48:05,900 and keep visiting the website for blocks and latest updates. 10757 07:48:05,900 --> 07:48:07,000 Thank you folks. 10758 07:48:07,500 --> 07:48:10,400 I hope you have enjoyed listening to this video. 10759 07:48:10,400 --> 07:48:12,450 Please be kind enough to like it 10760 07:48:12,450 --> 07:48:15,600 and you can comment any of your doubts and queries 10761 07:48:15,600 --> 07:48:17,078 and we will reply them 10762 07:48:17,078 --> 07:48:20,923 at the earliest do look out for more videos in our playlist 10763 07:48:20,923 --> 07:48:24,105 And subscribe to Edureka channel to learn more. 10764 07:48:24,105 --> 07:48:25,100 Happy learning.870388

Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.