All language subtitles for Apache Spark Full Course - Learn Apache Spark in 8 Hours - Apache Spark Tutorial - Edureka - YouTube
Afrikaans
Albanian
Amharic
Arabic
Armenian
Azerbaijani
Basque
Belarusian
Bengali
Bosnian
Bulgarian
Catalan
Cebuano
Chichewa
Chinese (Simplified)
Chinese (Traditional)
Corsican
Croatian
Czech
Danish
English
Esperanto
Estonian
Filipino
Finnish
French
Frisian
Galician
Georgian
German
Greek
Gujarati
Haitian Creole
Hausa
Hawaiian
Hebrew
Hindi
Hmong
Hungarian
Icelandic
Igbo
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)
Kyrgyz
Lao
Latin
Latvian
Lithuanian
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Maltese
Maori
Marathi
Mongolian
Myanmar (Burmese)
Nepali
Norwegian
Pashto
Persian
Polish
Punjabi
Romanian
Russian
Samoan
Scots Gaelic
Serbian
Sesotho
Shona
Sindhi
Sinhala
Slovak
Slovenian
Somali
Sundanese
Swahili
Swedish
Tajik
Tamil
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Welsh
Xhosa
Yiddish
Yoruba
Zulu
Odia (Oriya)
Kinyarwanda
Turkmen
Tatar
Uyghur
Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
0
00:00:06,800 --> 00:00:10,102
For the past five years Spark
has been on an absolute tear
1
00:00:10,102 --> 00:00:13,700
becoming one of the most widely
used Technologies in big data
2
00:00:13,700 --> 00:00:17,226
and AI. Today's cutting-edge
companies like Facebook app
3
00:00:17,226 --> 00:00:18,300
will Netflix Uber
4
00:00:18,300 --> 00:00:19,965
and many more have deployed
5
00:00:19,965 --> 00:00:23,366
spark at massive scale
processing petabytes of data
6
00:00:23,366 --> 00:00:25,192
to deliver Innovations ranging
7
00:00:25,192 --> 00:00:27,212
from detecting
fraudulent Behavior
8
00:00:27,212 --> 00:00:30,103
to delivering personalized
experiences in real.
9
00:00:30,103 --> 00:00:32,741
Lifetime and many such
innovations that are
10
00:00:32,741 --> 00:00:34,500
transforming every industry.
11
00:00:34,800 --> 00:00:37,300
Hi all I welcome you
all to this full court session
12
00:00:37,300 --> 00:00:40,408
on Apache spark a complete
crash course consisting
13
00:00:40,408 --> 00:00:43,200
of everything you need
to know to get started
14
00:00:43,200 --> 00:00:45,500
with Apache Spark from scratch.
15
00:00:45,700 --> 00:00:47,410
But before we get into details,
16
00:00:47,410 --> 00:00:51,000
let's look at our agenda for
today for better understanding
17
00:00:51,000 --> 00:00:52,300
and ease of learning.
18
00:00:52,300 --> 00:00:55,400
The entire crash course
is divided into 12 modules
19
00:00:55,400 --> 00:00:59,200
in the first module introduction
to spark will try to understand
20
00:00:59,200 --> 00:01:03,100
what exactly Is and how it
performs real time processing
21
00:01:03,200 --> 00:01:06,741
in second module will dive deep
into different components
22
00:01:06,741 --> 00:01:10,600
that constitute spark will also
learn about Spark architecture
23
00:01:10,600 --> 00:01:13,800
and its ecosystem next up
in the third module.
24
00:01:13,800 --> 00:01:15,594
We will learn what exactly
25
00:01:15,594 --> 00:01:18,700
relational distributed data
sets are in spark.
26
00:01:19,100 --> 00:01:22,427
Fourth module is all about
data frames in this module.
27
00:01:22,427 --> 00:01:25,000
We will learn what
exactly data frames are
28
00:01:25,000 --> 00:01:28,300
and how to perform different
operations in data frames
29
00:01:28,400 --> 00:01:29,940
moving on in the fifth.
30
00:01:29,940 --> 00:01:32,446
Module we will
discuss different ways
31
00:01:32,446 --> 00:01:35,300
that spark provides
to perform SQL queries
32
00:01:35,300 --> 00:01:39,000
for accessing and processing
data in the six module.
33
00:01:39,000 --> 00:01:39,847
We will learn
34
00:01:39,847 --> 00:01:43,500
how to perform streaming
on live data streams using spark
35
00:01:43,500 --> 00:01:46,029
where and in the seventh
module will discuss
36
00:01:46,029 --> 00:01:49,200
how to execute different machine
learning algorithms using
37
00:01:49,200 --> 00:01:52,469
spark machine learning library
8 module is all
38
00:01:52,469 --> 00:01:54,917
about spark Graphics
in this module.
39
00:01:54,917 --> 00:01:57,800
We are going to learn what
graph processing is and
40
00:01:57,800 --> 00:02:01,700
how to perform graph processing
using Bob Graphics library
41
00:02:01,700 --> 00:02:05,500
in the ninth module will discuss
the key differences between
42
00:02:05,500 --> 00:02:08,800
two popular data processing
Paddock rooms mapreduce
43
00:02:08,800 --> 00:02:12,500
and Spark talking
about 10 module will integrate
44
00:02:12,500 --> 00:02:14,400
to popular James spark
45
00:02:14,400 --> 00:02:19,400
and Kafka. 11th module is
all about pyspark in this module
46
00:02:19,400 --> 00:02:21,000
will try to understand
47
00:02:21,000 --> 00:02:24,281
how by spark exposes
spark programming model
48
00:02:24,281 --> 00:02:26,800
to python lastly
in the 12 module.
49
00:02:26,800 --> 00:02:30,100
We'll take a look at most
frequently Asked interview.
50
00:02:30,100 --> 00:02:31,200
Options on spark
51
00:02:31,200 --> 00:02:33,200
which will help you
Ace your interview
52
00:02:33,200 --> 00:02:34,200
with flying colors.
53
00:02:34,200 --> 00:02:35,900
Thank you guys
while you are at it,
54
00:02:35,900 --> 00:02:37,600
please do not
forget to subscribe
55
00:02:37,600 --> 00:02:39,173
and Edureka YouTube channel
56
00:02:39,173 --> 00:02:42,200
to stay updated with
current training Technologies.
57
00:02:47,200 --> 00:02:48,400
There has been -
58
00:02:48,400 --> 00:02:51,576
underworld that spark is
a future of Big Data platform,
59
00:02:51,576 --> 00:02:53,400
which is hundred times faster
60
00:02:53,400 --> 00:02:57,250
than mapreduce and is also
a go-to tool for all solutions.
61
00:02:57,250 --> 00:03:00,019
But what exactly is
Apache spark and what?
62
00:03:00,019 --> 00:03:01,100
It's so popular.
63
00:03:01,100 --> 00:03:03,700
And in the session I will give
you a complete Insight
64
00:03:03,700 --> 00:03:04,600
of Apache spark
65
00:03:04,600 --> 00:03:07,500
and its fundamentals
without any further due.
66
00:03:07,500 --> 00:03:08,200
Let's quickly.
67
00:03:08,200 --> 00:03:09,898
Look at the topics to be covered
68
00:03:09,898 --> 00:03:12,198
in this session
first and foremost.
69
00:03:12,198 --> 00:03:13,000
I will tell you
70
00:03:13,000 --> 00:03:15,724
what is Apache spark
and its features next.
71
00:03:15,724 --> 00:03:17,773
I will take you
to the components
72
00:03:17,773 --> 00:03:18,948
of spark ecosystem
73
00:03:18,948 --> 00:03:21,932
that makes Park as a future
of Big Data platform.
74
00:03:21,932 --> 00:03:22,600
After that.
75
00:03:22,600 --> 00:03:23,300
I will talk
76
00:03:23,300 --> 00:03:26,100
about the fundamental
data structure of spark
77
00:03:26,100 --> 00:03:28,400
that is rdd I will also tell you
78
00:03:28,400 --> 00:03:32,400
about its features its Asians
the ways to create rdd Etc
79
00:03:32,400 --> 00:03:35,500
and at the last either wrap
up the session by giving
80
00:03:35,500 --> 00:03:37,351
a real-time use case of spark.
81
00:03:37,351 --> 00:03:38,505
So let's get started
82
00:03:38,505 --> 00:03:40,800
with the very first
topic and understand
83
00:03:40,800 --> 00:03:43,400
what is spark spark
is an open-source
84
00:03:43,400 --> 00:03:45,100
killable massively parallel
85
00:03:45,100 --> 00:03:47,700
in memory execution
environment for running
86
00:03:47,700 --> 00:03:49,300
analytics applications.
87
00:03:49,300 --> 00:03:52,085
You can just think
of it as an in-memory layer
88
00:03:52,085 --> 00:03:54,507
that sits about the
multiple data stores
89
00:03:54,507 --> 00:03:56,929
where data can be loaded
into the memory
90
00:03:56,929 --> 00:03:59,600
and analyzed in parallel
across the cluster.
91
00:03:59,800 --> 00:04:03,189
Into big data processing much
like mapreduce Park Works
92
00:04:03,189 --> 00:04:05,700
to distribute the data
across the cluster
93
00:04:05,700 --> 00:04:08,118
and then process
that data in parallel.
94
00:04:08,118 --> 00:04:10,833
The difference here is
that unlike mapreduce
95
00:04:10,833 --> 00:04:14,867
which shuffles the files around
the disc spark Works in memory,
96
00:04:14,867 --> 00:04:17,600
and that makes it much
faster at processing
97
00:04:17,600 --> 00:04:19,300
the data than mapreduce.
98
00:04:19,300 --> 00:04:20,663
It is also said to be
99
00:04:20,663 --> 00:04:24,235
the Lightning Fast unified
analytics engine for big data
100
00:04:24,235 --> 00:04:25,600
and machine learning.
101
00:04:25,600 --> 00:04:28,680
So now let's look
at the interesting features
102
00:04:28,680 --> 00:04:29,800
of Apache Spark.
103
00:04:29,800 --> 00:04:32,181
Coming to speed you
can cause Park as
104
00:04:32,181 --> 00:04:34,100
a swift processing framework.
105
00:04:34,100 --> 00:04:37,500
Why because it is
hundred times faster in memory
106
00:04:37,500 --> 00:04:40,900
and 10 times faster on the disk
on comparing it with her.
107
00:04:40,900 --> 00:04:41,700
Do not only
108
00:04:41,700 --> 00:04:45,100
that it also provides
High data processing speed
109
00:04:45,200 --> 00:04:46,900
next powerful cashing.
110
00:04:46,900 --> 00:04:48,809
It has a simple
programming layer
111
00:04:48,809 --> 00:04:50,600
that provides powerful caching
112
00:04:50,600 --> 00:04:53,341
and disk persistence
capabilities and Spark
113
00:04:53,341 --> 00:04:55,300
can be deployed through mesos.
114
00:04:55,300 --> 00:04:58,600
How do PI on or Sparks
own cluster manager
115
00:04:58,700 --> 00:04:59,700
as you all know?
116
00:04:59,700 --> 00:05:01,370
That's Park itself was designed
117
00:05:01,370 --> 00:05:03,900
and developed for
real-time data processing.
118
00:05:03,900 --> 00:05:05,239
So it's obvious fact
119
00:05:05,239 --> 00:05:07,584
that it offers
real-time competition
120
00:05:07,584 --> 00:05:10,800
and low latency because of
in memory competitions
121
00:05:10,900 --> 00:05:14,700
next polyglot spark
provides high level apis
122
00:05:14,700 --> 00:05:16,700
in Java Scala Python
123
00:05:16,700 --> 00:05:19,536
and our spark code
can be written in any
124
00:05:19,536 --> 00:05:21,281
of these four languages.
125
00:05:21,281 --> 00:05:25,500
Not only that it also provides
a shell in Scala and python.
126
00:05:25,692 --> 00:05:29,000
These are the various
features of spark now,
127
00:05:29,000 --> 00:05:32,700
let's see the The various
components of spark ecosystem.
128
00:05:32,700 --> 00:05:36,100
Let me first tell you
about the spark or component.
129
00:05:36,100 --> 00:05:39,385
It is the most vital component
of Spartacus system,
130
00:05:39,385 --> 00:05:40,700
which is responsible
131
00:05:40,700 --> 00:05:44,400
for basic I/O functions
scheduling monitoring Etc.
132
00:05:44,400 --> 00:05:47,800
The entire Apache spark
ecosystem is built on the top
133
00:05:47,800 --> 00:05:49,670
of this core execution engine
134
00:05:49,670 --> 00:05:52,700
which has extensible apis
in different languages
135
00:05:52,700 --> 00:05:55,100
like Scala python are and Chava
136
00:05:55,100 --> 00:05:57,442
as I have already
mentioned the spark
137
00:05:57,442 --> 00:05:59,200
and the departs from essos.
138
00:05:59,200 --> 00:06:02,800
How do you feel John
or Sparks own cluster manager
139
00:06:02,800 --> 00:06:05,433
the spark ecosystem
library is composed
140
00:06:05,433 --> 00:06:06,888
of various components
141
00:06:06,888 --> 00:06:10,700
like spark SQL spark streaming
machine learning library.
142
00:06:10,700 --> 00:06:13,200
Now, let me explain
you each of them.
143
00:06:13,200 --> 00:06:16,573
The spark SQL component
is used to Leverage The Power
144
00:06:16,573 --> 00:06:18,000
of declarative queries
145
00:06:18,000 --> 00:06:21,034
and optimize storage
by executing SQL queries
146
00:06:21,034 --> 00:06:22,000
on spark data,
147
00:06:22,000 --> 00:06:23,778
which is present in the rdds
148
00:06:23,778 --> 00:06:27,100
and other external sources
next Sparks trimming
149
00:06:27,100 --> 00:06:29,617
component allows developers
to perform batch.
150
00:06:29,617 --> 00:06:31,395
Processing and streaming of data
151
00:06:31,395 --> 00:06:35,042
in the same application and come
into machine learning library.
152
00:06:35,042 --> 00:06:36,313
It eases the deployment
153
00:06:36,313 --> 00:06:39,300
and development of scalable
machine learning pipelines,
154
00:06:39,300 --> 00:06:43,000
like summary statistics
correlations feature extraction
155
00:06:43,000 --> 00:06:46,200
transformation functions
optimization algorithms Etc
156
00:06:46,200 --> 00:06:49,365
and graph x component lets
the data scientist to work
157
00:06:49,365 --> 00:06:52,584
with graph are non rough sources
to achieve flexibility
158
00:06:52,584 --> 00:06:55,820
and resilience and graph
construction and transformation
159
00:06:55,820 --> 00:06:56,784
and now talking
160
00:06:56,784 --> 00:07:00,000
about the programming
languages spark supports car.
161
00:07:00,000 --> 00:07:02,851
I just a functional
programming language in which
162
00:07:02,851 --> 00:07:04,100
the spark is written.
163
00:07:04,100 --> 00:07:08,200
So spark supports Colour as
the interface then spark also
164
00:07:08,200 --> 00:07:10,100
supports python interface.
165
00:07:10,100 --> 00:07:13,066
You can write the program
in Python and execute it
166
00:07:13,066 --> 00:07:14,408
over the spark again.
167
00:07:14,408 --> 00:07:16,899
If you see the code
in Python and Scala,
168
00:07:16,899 --> 00:07:20,858
both are very similar then our
is very famous for data analysis
169
00:07:20,858 --> 00:07:22,200
and machine learning.
170
00:07:22,200 --> 00:07:25,081
So spark has also added
the support for our
171
00:07:25,081 --> 00:07:26,717
and it also supports Java
172
00:07:26,717 --> 00:07:27,961
so you can go ahead
173
00:07:27,961 --> 00:07:31,300
and write the code in Java
and Giggle with this park
174
00:07:31,300 --> 00:07:33,300
next the data can be stored
175
00:07:33,300 --> 00:07:36,400
in hdfs local file
system Amazon S3 cloud
176
00:07:36,700 --> 00:07:39,700
and it also supports SQL
and nosql database as well.
177
00:07:39,700 --> 00:07:43,645
So this is all about the various
components of spark ecosystem.
178
00:07:43,645 --> 00:07:45,300
Now, let's see what's next
179
00:07:45,300 --> 00:07:48,064
when it comes to iterative
distributed computing
180
00:07:48,064 --> 00:07:50,600
that is processing the data
over multiple jobs
181
00:07:50,600 --> 00:07:51,600
and competitions.
182
00:07:51,700 --> 00:07:52,776
We need to reuse
183
00:07:52,776 --> 00:07:55,200
or share the data
among multiple jobs
184
00:07:55,200 --> 00:07:58,258
in earlier Frameworks
like Hadoop there were problems
185
00:07:58,258 --> 00:07:59,950
while dealing with multiple.
186
00:07:59,950 --> 00:08:01,400
Operations or jobs here.
187
00:08:01,400 --> 00:08:02,900
We need to store the data
188
00:08:02,900 --> 00:08:07,053
and some intermediate stable
distributed storage such as hdfs
189
00:08:07,053 --> 00:08:11,003
and multiple I/O operations
makes the overall computations
190
00:08:11,003 --> 00:08:13,976
of jobs much slower
and they were replications
191
00:08:13,976 --> 00:08:15,100
and civilizations
192
00:08:15,100 --> 00:08:17,955
which in turn made
the process even more slower
193
00:08:17,955 --> 00:08:20,500
and our goal here was
to reduce the number
194
00:08:20,500 --> 00:08:22,400
of I/O operations to hdfs
195
00:08:22,400 --> 00:08:26,350
and this can be achieved only
through in-memory data sharing
196
00:08:26,350 --> 00:08:29,900
the in-memory data sharing
the stent 200 times faster.
197
00:08:29,900 --> 00:08:31,966
Of the network and disk sharing
198
00:08:31,966 --> 00:08:35,138
and rdds try to solve all
the problems by enabling
199
00:08:35,138 --> 00:08:38,447
fault-tolerant distributed
in memory competitions.
200
00:08:38,447 --> 00:08:40,000
So now let's understand
201
00:08:40,000 --> 00:08:44,000
what our rdds it stands for
resilient distributed data set.
202
00:08:44,000 --> 00:08:46,509
They are considered to be
the backbone of spark
203
00:08:46,509 --> 00:08:49,419
and is one of the fundamental
data structure of spark.
204
00:08:49,419 --> 00:08:51,782
It is also known as
the schema-less structures
205
00:08:51,782 --> 00:08:54,900
that can handle both structured
and unstructured data.
206
00:08:54,900 --> 00:08:57,900
So in spark anything
you do is around rdd.
207
00:08:57,900 --> 00:08:59,700
You're reading the
data in spark.
208
00:08:59,700 --> 00:09:01,500
When it is read
into our daily again,
209
00:09:01,500 --> 00:09:04,300
when you're transforming
the data, then you're performing
210
00:09:04,300 --> 00:09:07,268
Transformations on old rdd
and creating a new one.
211
00:09:07,268 --> 00:09:10,378
Then at last you will perform
some actions on the rdd
212
00:09:10,378 --> 00:09:12,533
and store that data
present in an rdd
213
00:09:12,533 --> 00:09:15,906
to a persistent storage
resilient distributed data set
214
00:09:15,906 --> 00:09:18,900
has an immutable distributed
collection of objects.
215
00:09:18,900 --> 00:09:20,300
Your objects can be anything
216
00:09:20,300 --> 00:09:23,200
like strings lines
Rose objects collections
217
00:09:23,200 --> 00:09:26,400
Etc rdds can contain
any type of python Java
218
00:09:26,400 --> 00:09:27,533
or Scala objects.
219
00:09:27,533 --> 00:09:30,000
Even including user
defined classes as
220
00:09:30,000 --> 00:09:32,900
And talking about
the distributed environment.
221
00:09:32,900 --> 00:09:35,612
Each data set present
in an rdd is divided
222
00:09:35,612 --> 00:09:37,200
into logical partitions,
223
00:09:37,200 --> 00:09:39,353
which may be computed
on different nodes
224
00:09:39,353 --> 00:09:42,500
of the cluster due to this you
can perform Transformations
225
00:09:42,500 --> 00:09:44,190
or actions on the complete data
226
00:09:44,190 --> 00:09:47,300
parallely and I don't have
to worry about the distribution
227
00:09:47,300 --> 00:09:49,400
because spark takes care of that
228
00:09:49,400 --> 00:09:52,100
are they these are
highly resilient that is
229
00:09:52,100 --> 00:09:55,141
they are able to recover
quickly from any issues
230
00:09:55,141 --> 00:09:56,500
as a same data chunks
231
00:09:56,500 --> 00:09:59,700
are replicated across
multiple executor notes thus
232
00:09:59,700 --> 00:10:02,564
so even if one executor
fails another will still
233
00:10:02,564 --> 00:10:03,600
process the data.
234
00:10:03,600 --> 00:10:06,482
This allows you to perform
functional calculations
235
00:10:06,482 --> 00:10:08,287
against a data set very quickly
236
00:10:08,287 --> 00:10:10,699
by harnessing the power
of multiple nodes.
237
00:10:10,699 --> 00:10:12,472
So this is all about rdd now.
238
00:10:12,472 --> 00:10:14,000
Let's have a look at some
239
00:10:14,000 --> 00:10:17,847
of the important features of
our dbe's rdds have a provision
240
00:10:17,847 --> 00:10:19,327
of in memory competition
241
00:10:19,327 --> 00:10:21,300
and all transformations
are lazy.
242
00:10:21,300 --> 00:10:24,044
That is it does not compute
the results right away
243
00:10:24,044 --> 00:10:25,679
until an action is applied.
244
00:10:25,679 --> 00:10:27,800
So it supports
in memory competition
245
00:10:27,800 --> 00:10:30,034
and lazy evaluation
as well next.
246
00:10:30,034 --> 00:10:32,200
Fault tolerant in case of rdds.
247
00:10:32,200 --> 00:10:34,454
They track the data
lineage information
248
00:10:34,454 --> 00:10:37,341
to rebuild the last data
automatically and this is
249
00:10:37,341 --> 00:10:40,000
how it provides fault tolerance
to the system.
250
00:10:40,000 --> 00:10:42,600
Next immutability data
can be created
251
00:10:42,600 --> 00:10:43,800
or received any time
252
00:10:43,800 --> 00:10:46,388
and once defined its value
cannot be changed.
253
00:10:46,388 --> 00:10:47,900
And that is the reason why
254
00:10:47,900 --> 00:10:51,235
I said are they these are
immutable next partitioning
255
00:10:51,235 --> 00:10:53,774
at is the fundamental
unit of parallelism
256
00:10:53,774 --> 00:10:54,605
and Spark rdd
257
00:10:54,605 --> 00:10:57,800
and all the data chunks
are divided into partitions
258
00:10:57,800 --> 00:10:59,960
and already next persistence.
259
00:10:59,960 --> 00:11:01,600
So users can reuse rdd
260
00:11:01,600 --> 00:11:05,400
and choose a storage stategy for
them coarse-grained operations
261
00:11:05,400 --> 00:11:08,493
applies to all elements
in datasets through Maps
262
00:11:08,493 --> 00:11:10,600
or filter or
group by operations.
263
00:11:10,700 --> 00:11:13,000
So these are the various
features of our daily.
264
00:11:13,300 --> 00:11:15,800
Now, let's see
the ways to create rdd.
265
00:11:15,800 --> 00:11:19,117
There are three ways to create
rdds one can create rdd
266
00:11:19,117 --> 00:11:22,800
from paralyzed Collections
and one can also create rdd
267
00:11:22,800 --> 00:11:24,367
from the existing card ID
268
00:11:24,367 --> 00:11:27,100
or other are DTS
and it can also be created
269
00:11:27,100 --> 00:11:30,000
from external data sources
as well like hdfs.
270
00:11:30,000 --> 00:11:31,900
Amazon S3 hbase Etc.
271
00:11:32,000 --> 00:11:34,600
Now let me show you
how to create rdds.
272
00:11:34,800 --> 00:11:37,199
I'll open my terminal
and first check
273
00:11:37,199 --> 00:11:39,600
whether my demons
are running or not.
274
00:11:40,500 --> 00:11:41,300
Cool here.
275
00:11:41,300 --> 00:11:42,757
I can see that Hadoop
276
00:11:42,757 --> 00:11:45,041
and Spark demons
both are running.
277
00:11:45,041 --> 00:11:47,186
So now at the first let's start
278
00:11:47,186 --> 00:11:51,200
the spark shell it will take
a bit time to start the shell.
279
00:11:52,500 --> 00:11:52,900
Cool.
280
00:11:52,900 --> 00:11:54,800
Now the spark shall has started
281
00:11:54,800 --> 00:11:58,329
and I can see the version of
spark as two point one point one
282
00:11:58,329 --> 00:12:00,500
and we have a scholar
shell over here.
283
00:12:00,500 --> 00:12:00,759
Now.
284
00:12:00,759 --> 00:12:02,888
I will tell you
how to create rdds
285
00:12:02,888 --> 00:12:06,557
in three different ways using
Scala language at the first.
286
00:12:06,557 --> 00:12:08,450
Let's see how to create an rdd
287
00:12:08,450 --> 00:12:12,178
from paralyzed collections
SC dot paralyzes the method
288
00:12:12,178 --> 00:12:15,600
that I use to create a paralyzed
collection of oddities
289
00:12:15,600 --> 00:12:16,733
and this method is
290
00:12:16,733 --> 00:12:20,700
a spark context paralyzed method
to create a palace collection.
291
00:12:20,700 --> 00:12:22,500
So I will give a seedot bad.
292
00:12:22,500 --> 00:12:26,200
Lice and here I will paralyze
one 200 numbers.
293
00:12:27,300 --> 00:12:31,371
In five different partitions
and I will apply collect
294
00:12:31,371 --> 00:12:33,500
as action to start the process.
295
00:12:34,900 --> 00:12:36,592
So here in the result,
296
00:12:36,592 --> 00:12:39,600
you can see an array
of fun 200 numbers.
297
00:12:39,600 --> 00:12:40,100
Okay.
298
00:12:40,300 --> 00:12:41,635
Now let me show you
299
00:12:41,635 --> 00:12:45,010
how the partitions appear
in the web UI of spark.
300
00:12:45,010 --> 00:12:49,300
So the web UI port for spark is
localhost four zero four zero.
301
00:12:50,700 --> 00:12:53,630
So here you have just
completed one task.
302
00:12:53,630 --> 00:12:55,903
That is St. Dot
paralyzed collect.
303
00:12:55,903 --> 00:12:56,800
Correct here.
304
00:12:56,800 --> 00:13:00,114
You can see all the five stages
that are succeeded
305
00:13:00,114 --> 00:13:03,700
because we have divided the task
into five partitions.
306
00:13:03,700 --> 00:13:06,000
So let Show you the partitions.
307
00:13:06,000 --> 00:13:08,100
So this is a dag
which realization
308
00:13:08,100 --> 00:13:11,558
that is the directed acyclic
graph visualization wherein
309
00:13:11,558 --> 00:13:14,200
you have applied only
paralyzed as a method
310
00:13:14,200 --> 00:13:16,200
so you can see only
one stage here.
311
00:13:16,800 --> 00:13:20,291
So here you can see the rdd
that is been created
312
00:13:20,291 --> 00:13:24,032
and coming to even timeline
you can see the task
313
00:13:24,032 --> 00:13:27,400
that has been executed
in five different stages
314
00:13:27,400 --> 00:13:29,011
and the different colors imply.
315
00:13:29,011 --> 00:13:30,632
The scheduler delayed tasks
316
00:13:30,632 --> 00:13:34,300
these sterilization Time shuffle
rate Time shuffle right time.
317
00:13:34,300 --> 00:13:36,612
I'm execute a Computing
time Etc here.
318
00:13:36,612 --> 00:13:40,227
You can see the summary metrics
for the created rdd here.
319
00:13:40,227 --> 00:13:41,000
You can see
320
00:13:41,000 --> 00:13:44,300
that the maximum time it
took to execute the tasks
321
00:13:44,300 --> 00:13:48,400
in five partitions parallely is
just 45 milliseconds.
322
00:13:49,000 --> 00:13:53,300
You can also see the executor ID
the host ID the status
323
00:13:53,300 --> 00:13:56,800
that is succeeded
duration launch time Etc.
324
00:13:57,000 --> 00:13:59,255
So this is one way
of creating an rdd
325
00:13:59,255 --> 00:14:01,061
from paralyzed collections.
326
00:14:01,061 --> 00:14:02,400
Now, let me show you
327
00:14:02,400 --> 00:14:05,900
how to create an rdd
from the I think our DD okay
328
00:14:06,000 --> 00:14:08,770
here I'll create
an array called Aven
329
00:14:08,770 --> 00:14:11,077
and assign numbers one to ten.
330
00:14:11,800 --> 00:14:14,900
One two, three,
four five six seven.
331
00:14:16,200 --> 00:14:18,900
Okay, so I got the result here.
332
00:14:18,900 --> 00:14:22,300
That is I have created
an integer array of 1 to 10
333
00:14:22,300 --> 00:14:25,200
and now I will paralyze
this a day one.
334
00:14:31,303 --> 00:14:32,996
Sorry, I got an error.
335
00:14:33,300 --> 00:14:37,300
It is a seedot pass
the lies of a one.
336
00:14:38,200 --> 00:14:42,800
Okay, so I created an rdd
called parallel collection cool.
337
00:14:42,800 --> 00:14:46,600
Now I will create a new Oddity
from the existing already.
338
00:14:46,600 --> 00:14:51,000
That is Val new are d d is equal
339
00:14:51,000 --> 00:14:55,900
to a 1 dot map data
present in an rdd.
340
00:14:56,061 --> 00:14:59,138
I will create a new ID
from existing rdd.
341
00:14:59,200 --> 00:15:01,200
So here I will take a one.
342
00:15:01,200 --> 00:15:05,800
As a difference and map
the data and multiply
343
00:15:05,800 --> 00:15:07,300
that data into two.
344
00:15:07,573 --> 00:15:09,726
So what should be our output
345
00:15:10,019 --> 00:15:13,480
if I Mark the data present
in an rdd into two,
346
00:15:13,700 --> 00:15:18,600
so it would be like
2 4 6 8 up to 20, correct?
347
00:15:18,600 --> 00:15:20,400
So, let's see how it works.
348
00:15:20,700 --> 00:15:24,500
Yes, we got the output
that is multiple of 1 to 10.
349
00:15:24,500 --> 00:15:26,691
That is two four
six eight up to 20.
350
00:15:26,691 --> 00:15:28,357
So this is one of the method
351
00:15:28,357 --> 00:15:30,500
of creating a new ID
from an old rdt.
352
00:15:30,500 --> 00:15:34,088
And I have one more method that
is from external file sources.
353
00:15:34,088 --> 00:15:37,500
So what I will do here is I
will give that test is equal
354
00:15:37,500 --> 00:15:39,780
to SC dot txt file here.
355
00:15:40,790 --> 00:15:43,800
I will give the path
to hdfs file location
356
00:15:43,800 --> 00:15:48,900
and Link the path that is hdfs
who localhost 9000 is a path
357
00:15:48,900 --> 00:15:50,800
and I have a folder.
358
00:15:50,800 --> 00:15:54,600
Called example and in that
I have a file called sample.
359
00:15:57,300 --> 00:16:01,500
Cool, so I got one
more already created here.
360
00:16:02,000 --> 00:16:02,281
Now.
361
00:16:02,281 --> 00:16:04,042
Let me show you this file
362
00:16:04,042 --> 00:16:07,000
that I have already kept
in hdfs directory.
363
00:16:08,100 --> 00:16:09,897
I will browse the file system
364
00:16:09,897 --> 00:16:12,500
and I will show you
the / example directory
365
00:16:12,500 --> 00:16:13,800
that I have created.
366
00:16:14,800 --> 00:16:16,867
So here you can see the example
367
00:16:16,867 --> 00:16:19,800
that I have created as
a directory and here I
368
00:16:19,800 --> 00:16:23,000
have sample as input file
that I have been given.
369
00:16:23,000 --> 00:16:25,800
So here you can see
the same path location.
370
00:16:25,800 --> 00:16:26,300
So this is
371
00:16:26,300 --> 00:16:29,633
how I can create an rdd
from external file sources.
372
00:16:29,633 --> 00:16:30,484
In this case.
373
00:16:30,484 --> 00:16:33,300
I have used hdfs as
an external file source.
374
00:16:33,300 --> 00:16:36,757
So this is how we can create
rdds from three different ways
375
00:16:36,757 --> 00:16:39,700
that is paralyzed collections
from external RDS
376
00:16:39,700 --> 00:16:41,600
and from an existing rdds.
377
00:16:41,700 --> 00:16:44,900
So let's move further and see
the various rdd.
378
00:16:44,900 --> 00:16:46,500
It's actually supports
379
00:16:46,500 --> 00:16:50,100
two men operations namely
Transformations and actions
380
00:16:50,100 --> 00:16:51,419
as have already set.
381
00:16:51,419 --> 00:16:53,200
Our treaties are immutable.
382
00:16:53,200 --> 00:16:54,900
So once you create an rdd,
383
00:16:54,900 --> 00:16:57,500
you cannot change
any content in the Hardy,
384
00:16:57,500 --> 00:16:58,913
so you might be wondering
385
00:16:58,913 --> 00:17:01,400
how our did he applies
those Transformations?
386
00:17:01,400 --> 00:17:02,200
Correct?
387
00:17:02,200 --> 00:17:04,299
When you run
any Transformations,
388
00:17:04,299 --> 00:17:07,062
it runs those Transformations
on all our DD
389
00:17:07,062 --> 00:17:08,445
and create a new body.
390
00:17:08,445 --> 00:17:11,400
This is basically done
for optimization reasons.
391
00:17:11,400 --> 00:17:13,446
Transformations are
the operations
392
00:17:13,446 --> 00:17:14,500
which are applied
393
00:17:14,500 --> 00:17:18,815
on a An rdd to create a new rdd
now these Transformations work
394
00:17:18,815 --> 00:17:21,221
on the principle
of lazy evaluations.
395
00:17:21,221 --> 00:17:23,075
So what does it mean it means
396
00:17:23,075 --> 00:17:25,500
that when we call
some operation in rdd
397
00:17:25,500 --> 00:17:28,888
at does not execute immediately
and Spark montañés,
398
00:17:28,888 --> 00:17:31,704
the record of the operation
that is being called
399
00:17:31,704 --> 00:17:34,127
since Transformations
are lazy in nature
400
00:17:34,127 --> 00:17:36,052
so we can execute the operation
401
00:17:36,052 --> 00:17:38,600
any time by calling
an action on the data.
402
00:17:38,800 --> 00:17:42,200
Hence in lazy evaluation
data is not loaded
403
00:17:42,200 --> 00:17:44,525
until it is necessary now these
404
00:17:44,525 --> 00:17:46,100
Since analyze the RTD
405
00:17:46,100 --> 00:17:49,103
and produce result
simple action can be count
406
00:17:49,103 --> 00:17:52,800
which will count the rows and
rdd and then produce a result
407
00:17:52,800 --> 00:17:53,583
so I can say
408
00:17:53,583 --> 00:17:57,700
that transformation produced new
rdd and action produced results
409
00:17:57,700 --> 00:18:00,058
before moving further
with the discussion.
410
00:18:00,058 --> 00:18:03,000
Let me tell you about
the three different workloads
411
00:18:03,000 --> 00:18:06,500
that spark it is they are
batch mode interactive mode
412
00:18:06,500 --> 00:18:09,052
and streaming mode
in case of batch mode.
413
00:18:09,052 --> 00:18:10,839
We run a batch
of you write a job
414
00:18:10,839 --> 00:18:13,427
and then schedule it
it works through a queue
415
00:18:13,427 --> 00:18:14,703
or a batch of separate.
416
00:18:14,703 --> 00:18:17,292
Jobs without manual
intervention then in case
417
00:18:17,292 --> 00:18:18,400
of interactive mode.
418
00:18:18,400 --> 00:18:19,700
It is an interactive shell
419
00:18:19,700 --> 00:18:22,100
where you go and execute
the commands one by one.
420
00:18:22,300 --> 00:18:24,844
So you will execute
one command check the result
421
00:18:24,844 --> 00:18:26,902
and then execute
other command based
422
00:18:26,902 --> 00:18:28,400
on the output result and so
423
00:18:28,400 --> 00:18:30,754
on it works similar
to the SQL shell
424
00:18:30,754 --> 00:18:32,100
so she'll is the one
425
00:18:32,100 --> 00:18:35,221
which executes a driver program
and in the Shell mode.
426
00:18:35,221 --> 00:18:37,096
You can run it
on the cluster mode.
427
00:18:37,096 --> 00:18:39,449
It is generally used
for development work
428
00:18:39,449 --> 00:18:41,159
or it is used
for ad hoc queries,
429
00:18:41,159 --> 00:18:42,708
then comes the streaming mode
430
00:18:42,708 --> 00:18:44,900
where the program
is continuously running.
431
00:18:44,900 --> 00:18:47,300
As invented data
comes it takes a data
432
00:18:47,300 --> 00:18:48,818
and do some Transformations
433
00:18:48,818 --> 00:18:51,300
and actions on the data
and get some results.
434
00:18:51,300 --> 00:18:53,800
So these are the three
different workloads
435
00:18:53,800 --> 00:18:55,600
that spark 8 us now.
436
00:18:55,600 --> 00:18:58,100
Let's see a real-time
use case here.
437
00:18:58,100 --> 00:18:59,600
I'm considering Yahoo!
438
00:18:59,600 --> 00:19:00,600
As an example.
439
00:19:00,600 --> 00:19:02,716
So what are
the problems of Yahoo!
440
00:19:02,716 --> 00:19:03,128
Yahoo!
441
00:19:03,128 --> 00:19:04,062
Properties are
442
00:19:04,062 --> 00:19:06,800
highly personalized
to maximize relevance.
443
00:19:06,800 --> 00:19:09,600
The algorithms used
to provide personalization.
444
00:19:09,600 --> 00:19:11,692
That is the
targeted advertisement
445
00:19:11,692 --> 00:19:14,800
and personalized content
are highly sophisticated.
446
00:19:14,800 --> 00:19:18,300
It and the relevance model
must be updated frequently
447
00:19:18,300 --> 00:19:22,745
because stories news feed and
ads change in time and Yahoo,
448
00:19:22,745 --> 00:19:24,967
has over 150 petabytes of data
449
00:19:24,967 --> 00:19:28,300
that the stored
on 35,000 node Hadoop cluster,
450
00:19:28,300 --> 00:19:31,391
which should be access
efficiently to avoid latency
451
00:19:31,391 --> 00:19:33,150
caused by the data movement
452
00:19:33,150 --> 00:19:35,300
and to gain insights
from the data
453
00:19:35,300 --> 00:19:37,000
and cost-effective manner.
454
00:19:37,000 --> 00:19:39,600
So to overcome
these problems Yahoo!
455
00:19:39,600 --> 00:19:42,171
Look to spark to
improve the performance
456
00:19:42,171 --> 00:19:44,687
of this iterative
model training here.
457
00:19:44,687 --> 00:19:48,700
Machine learning algorithm for
news personalization required
458
00:19:48,700 --> 00:19:51,200
15,000 lines of C++ code
459
00:19:51,300 --> 00:19:55,000
on the other hand the machine
learning algorithm has just
460
00:19:55,000 --> 00:19:57,076
won 20 lines of Scala code.
461
00:19:57,100 --> 00:19:59,600
So that is
the advantage of spark
462
00:19:59,800 --> 00:20:02,600
and this algorithm was ready
for production use
463
00:20:02,600 --> 00:20:06,700
in just 30 minutes of training
on a hundred million datasets
464
00:20:06,700 --> 00:20:08,900
and Sparks Rich API is available
465
00:20:08,900 --> 00:20:12,201
in several programming
languages and has resilient
466
00:20:12,201 --> 00:20:14,588
in memory storage
options and a scum.
467
00:20:14,588 --> 00:20:18,567
Potable with Hadoop through yarn
and the spark yarn project.
468
00:20:18,567 --> 00:20:21,400
It uses Apache spark
for personalizing It's
469
00:20:21,400 --> 00:20:24,490
News web pages and for
targeted advertising.
470
00:20:24,490 --> 00:20:28,300
Not only that it also uses
machine learning algorithms
471
00:20:28,300 --> 00:20:31,375
that run an Apache spark
to find out what kind
472
00:20:31,375 --> 00:20:33,700
of news user are
interested to read
473
00:20:33,700 --> 00:20:36,714
and also for categorizing
the new stories to find
474
00:20:36,714 --> 00:20:39,290
out what kind of users
would be interested
475
00:20:39,290 --> 00:20:41,300
in Reading each category of news
476
00:20:41,524 --> 00:20:44,524
and Spark runs over Hadoop Ian
to use existing data.
477
00:20:44,600 --> 00:20:47,800
And clusters and
the extensive API of spark
478
00:20:47,800 --> 00:20:50,605
and machine learning library
is the development
479
00:20:50,605 --> 00:20:54,276
of machine learning algorithms
and Spar produces the latency
480
00:20:54,276 --> 00:20:55,400
of model training.
481
00:20:55,400 --> 00:20:56,800
We are in memory rdd.
482
00:20:56,800 --> 00:21:00,855
So this is how spark has helped
Yahoo to improve the performance
483
00:21:00,855 --> 00:21:02,431
and achieve the targets.
484
00:21:02,431 --> 00:21:05,320
So I hope you understood
the concept of spark
485
00:21:05,320 --> 00:21:06,700
and its fundamentals.
486
00:21:11,500 --> 00:21:14,000
Now, let me just give
you an overview
487
00:21:14,000 --> 00:21:17,600
of the Spark architecture
Apache spark has a well-defined
488
00:21:17,600 --> 00:21:18,711
layered architecture
489
00:21:18,711 --> 00:21:22,017
where all the components
and layers are Loosely coupled
490
00:21:22,017 --> 00:21:25,200
and integrated with various
extensions and libraries.
491
00:21:25,200 --> 00:21:28,600
This architecture is based
on two main abstractions.
492
00:21:28,600 --> 00:21:31,500
First one resilient
distributed data sets
493
00:21:31,500 --> 00:21:32,419
that is rdd
494
00:21:32,419 --> 00:21:36,108
and the next one directed
acyclic graph called DAC
495
00:21:36,108 --> 00:21:40,100
or th e in order to understand
this park architecture.
496
00:21:40,100 --> 00:21:43,400
You need to first know
the components of the spark
497
00:21:43,400 --> 00:21:44,500
that the spark.
498
00:21:44,500 --> 00:21:47,700
System and its fundamental
data structure rdd.
499
00:21:47,700 --> 00:21:51,100
So let's start by understanding
the spark ecosystem
500
00:21:51,100 --> 00:21:53,080
as you can see from the diagram.
501
00:21:53,080 --> 00:21:56,300
The spark ecosystem is composed
of various components
502
00:21:56,300 --> 00:21:57,812
like spark SQL spark
503
00:21:57,812 --> 00:22:01,400
screaming machine learning
library Graphics spark
504
00:22:01,400 --> 00:22:05,600
our and the code a pi component
talking about spark SQL.
505
00:22:05,600 --> 00:22:08,700
It is used to Leverage The Power
of declarative queries
506
00:22:08,700 --> 00:22:11,827
and optimize storage
by executing SQL queries
507
00:22:11,827 --> 00:22:12,817
on spark data,
508
00:22:12,817 --> 00:22:14,520
which is present in rdds.
509
00:22:14,520 --> 00:22:18,600
And other external sources
next Sparks remain component
510
00:22:18,600 --> 00:22:21,400
allows developers
to perform batch processing
511
00:22:21,400 --> 00:22:22,600
and trimming of the data
512
00:22:22,600 --> 00:22:26,300
and the same application coming
to machine learning library.
513
00:22:26,300 --> 00:22:27,745
It eases the development
514
00:22:27,745 --> 00:22:30,862
and deployment of scalable
machine learning pipelines,
515
00:22:30,862 --> 00:22:33,765
like summary statistics
cluster analysis methods
516
00:22:33,765 --> 00:22:36,709
correlations dimensionality
reduction techniques
517
00:22:36,709 --> 00:22:37,900
feature extractions
518
00:22:37,900 --> 00:22:40,500
and many more now
Graphics component.
519
00:22:40,500 --> 00:22:42,100
Let's the data scientist to work
520
00:22:42,100 --> 00:22:44,689
with graph and non graph
sources to achieve.
521
00:22:44,689 --> 00:22:47,400
Security and resilience
and graph construction
522
00:22:47,400 --> 00:22:51,000
and transformation coming
to spark our it is an r package
523
00:22:51,000 --> 00:22:54,818
that provides a light weighted
front end to use Apache spark.
524
00:22:54,818 --> 00:22:58,000
It provides a distributed
data frame implementation
525
00:22:58,000 --> 00:23:01,994
that supports operations like
selection filtering aggregation,
526
00:23:01,994 --> 00:23:03,500
but on large data sets,
527
00:23:03,500 --> 00:23:06,198
it also supports
distributed machine learning
528
00:23:06,198 --> 00:23:08,100
using machine learning library.
529
00:23:08,157 --> 00:23:10,542
Finally the spark or component.
530
00:23:10,600 --> 00:23:13,600
That is the most vital component
of spark ecosystem,
531
00:23:13,600 --> 00:23:14,800
which is responsible.
532
00:23:14,800 --> 00:23:17,621
Possible for basic
I/O functions scheduling
533
00:23:17,621 --> 00:23:21,517
and monitoring the entire spark
ecosystem is built on the top
534
00:23:21,517 --> 00:23:23,456
of this code execution engine
535
00:23:23,456 --> 00:23:26,600
which has extensible apis
in different languages
536
00:23:26,600 --> 00:23:29,400
like Scala python
are and Java now,
537
00:23:29,400 --> 00:23:32,200
let me tell you
about the programming languages
538
00:23:32,200 --> 00:23:33,977
at the first Spark support
539
00:23:33,977 --> 00:23:37,190
Scala Scala is a functional
programming language
540
00:23:37,190 --> 00:23:38,900
in which spark is written
541
00:23:39,092 --> 00:23:42,400
and Spark suppose Carla
as an interface then
542
00:23:42,400 --> 00:23:44,400
spark also supports python.
543
00:23:44,400 --> 00:23:48,012
Face, you can write program
in Python and execute it
544
00:23:48,012 --> 00:23:49,500
over the spark again.
545
00:23:49,500 --> 00:23:52,166
If you see the code
and Scala and python,
546
00:23:52,166 --> 00:23:56,166
both are very similar then
coming to our it is very famous
547
00:23:56,166 --> 00:23:58,700
for data analysis
and machine learning.
548
00:23:58,700 --> 00:24:01,708
So spark has also added
the support for our
549
00:24:01,708 --> 00:24:03,500
and it also supports Java
550
00:24:03,500 --> 00:24:06,280
so you can go ahead
and write the Java code
551
00:24:06,280 --> 00:24:08,200
and execute it over the spark
552
00:24:08,200 --> 00:24:11,100
against Park also provides
you interactive shell
553
00:24:11,100 --> 00:24:14,005
for Scala Python and are
very can go ahead
554
00:24:14,005 --> 00:24:16,230
and Execute the commands
one by one.
555
00:24:16,230 --> 00:24:18,700
So this is all about
the sparkle ecosystem.
556
00:24:18,700 --> 00:24:19,500
Next.
557
00:24:19,500 --> 00:24:22,600
Let's discuss the fundamental
data structure of spark
558
00:24:22,600 --> 00:24:26,400
that is rdd called as
resilient distributed data sets.
559
00:24:26,784 --> 00:24:30,015
So and Spark anything
you do is around rdd,
560
00:24:30,200 --> 00:24:33,200
you're reading the data
and Spark then it is read
561
00:24:33,200 --> 00:24:34,400
into R DT again.
562
00:24:34,400 --> 00:24:37,200
When you're transforming
the data, then you're performing
563
00:24:37,200 --> 00:24:40,509
Transformations on an old rdd
and creating a new one.
564
00:24:40,509 --> 00:24:43,200
Then at the last you
will perform some actions
565
00:24:43,200 --> 00:24:44,643
on the data and store.
566
00:24:44,643 --> 00:24:46,288
Dataset present in an rdd
567
00:24:46,288 --> 00:24:49,764
to a persistent storage
resilient distributed data
568
00:24:49,764 --> 00:24:53,300
set as an immutable distributed
collection of objects.
569
00:24:53,300 --> 00:24:55,200
Your objects can be anything
570
00:24:55,200 --> 00:24:58,910
like string lines
Rose objects collections Etc.
571
00:24:59,600 --> 00:25:02,704
Now talking about
the distributed environment.
572
00:25:02,704 --> 00:25:06,500
Each data set in rdd is divided
into logical partitions,
573
00:25:06,500 --> 00:25:08,709
which may be computed
on different nodes
574
00:25:08,709 --> 00:25:12,062
of the cluster due to this you
can perform Transformations
575
00:25:12,062 --> 00:25:14,416
and actions on the
complete data parallelly.
576
00:25:14,416 --> 00:25:17,100
And you don't have to worry
about the distribution
577
00:25:17,100 --> 00:25:18,700
because part takes care
578
00:25:18,700 --> 00:25:22,200
of that next as I said our
did these are immutable.
579
00:25:22,200 --> 00:25:25,000
So once you create
an rdd you cannot change
580
00:25:25,000 --> 00:25:26,500
any content in the Rd
581
00:25:26,500 --> 00:25:28,102
so you might be wondering
582
00:25:28,102 --> 00:25:31,500
how our did the applies
those Transformations correct?
583
00:25:31,600 --> 00:25:35,845
Then you run any Transformations
at runs those Transformations
584
00:25:35,845 --> 00:25:38,300
on all our DD
and create a new Oddity.
585
00:25:38,300 --> 00:25:41,700
This is basically done
for optimization reasons.
586
00:25:41,700 --> 00:25:44,609
So, let me tell you
one thing here are decals.
587
00:25:44,609 --> 00:25:46,205
The cached and persistent
588
00:25:46,205 --> 00:25:49,270
if you want to save an rdd
for the future work,
589
00:25:49,270 --> 00:25:50,218
you can cash it
590
00:25:50,218 --> 00:25:53,000
and it will improve
the spark performance rdd
591
00:25:53,000 --> 00:25:55,589
is a fault-tolerant
collection of elements
592
00:25:55,589 --> 00:25:57,800
that can be operated
on in parallel.
593
00:25:57,800 --> 00:26:00,400
If our DD is lost
it will automatically
594
00:26:00,400 --> 00:26:03,400
be recomputed by using
the original Transformations.
595
00:26:03,500 --> 00:26:06,500
This is House Park
provides fault tolerance.
596
00:26:06,500 --> 00:26:10,300
There are two ways to create
rdds first one by paralyzing
597
00:26:10,300 --> 00:26:13,100
an existing collection
in your driver program
598
00:26:13,100 --> 00:26:15,809
and the second one
by Referencing a data set
599
00:26:15,809 --> 00:26:17,700
in the external storage system
600
00:26:17,700 --> 00:26:21,200
such as shared file
system hdfs hbase Etc.
601
00:26:21,400 --> 00:26:23,852
Now Transformations
are the operations
602
00:26:23,852 --> 00:26:27,300
that you perform an rdd
which will create a new body.
603
00:26:27,300 --> 00:26:30,346
For example, you
can perform filter on an rdd
604
00:26:30,346 --> 00:26:31,800
and create a new rdd.
605
00:26:31,800 --> 00:26:34,577
Then there are actions
which analyzes the rdd
606
00:26:34,577 --> 00:26:37,717
and produced result
simple action can be count
607
00:26:37,717 --> 00:26:39,900
which will count
the rows in our D
608
00:26:39,900 --> 00:26:42,100
and producer isn't so I can say
609
00:26:42,100 --> 00:26:46,200
that transformation produced
new ID Actions produce results.
610
00:26:46,200 --> 00:26:47,011
So this is all
611
00:26:47,011 --> 00:26:49,600
about the fundamental
data structure of spark
612
00:26:49,600 --> 00:26:51,000
that is already now.
613
00:26:51,000 --> 00:26:54,300
Let's dive into the core topic
of today's discussion
614
00:26:54,300 --> 00:26:56,120
that the Spark architecture.
615
00:26:56,120 --> 00:26:58,100
So this is
the Spark architecture
616
00:26:58,100 --> 00:26:59,300
in your master node.
617
00:26:59,300 --> 00:27:02,681
You have the driver program
which drives your application.
618
00:27:02,681 --> 00:27:06,300
So the code that you're writing
behaves as a driver program or
619
00:27:06,300 --> 00:27:08,752
if you are using
the interactive shell the shell
620
00:27:08,752 --> 00:27:12,017
acts as a driver program
inside the driver program.
621
00:27:12,017 --> 00:27:12,900
The first thing
622
00:27:12,900 --> 00:27:16,134
that you do is you create
a spark context assume
623
00:27:16,134 --> 00:27:19,300
that the spark context
is a gateway to allspark
624
00:27:19,300 --> 00:27:22,800
functionality at a similar
to your database connection.
625
00:27:22,800 --> 00:27:25,800
So any command you execute
in a database goes
626
00:27:25,800 --> 00:27:29,600
through the database connection
similarly anything you do
627
00:27:29,600 --> 00:27:32,600
on spark goes through
the spark context.
628
00:27:32,700 --> 00:27:34,800
Now this park on text works
629
00:27:34,800 --> 00:27:37,652
with the cluster manager
to manage various jobs,
630
00:27:37,652 --> 00:27:38,783
the driver program
631
00:27:38,783 --> 00:27:42,050
and the spark context takes care
of executing the job
632
00:27:42,050 --> 00:27:44,700
across the cluster
a job is splitted the
633
00:27:45,161 --> 00:27:46,700
And then these tasks
634
00:27:46,700 --> 00:27:48,500
are distributed over
the work or not.
635
00:27:48,500 --> 00:27:50,417
So anytime you create the rtt.
636
00:27:50,417 --> 00:27:53,562
In the spark context
that rdd can be distributed
637
00:27:53,562 --> 00:27:54,900
across various notes
638
00:27:54,900 --> 00:27:58,711
and can be cashed their so rdd
set to be taken partitioned
639
00:27:58,711 --> 00:28:02,426
and distributed across various
notes now worker knows are
640
00:28:02,426 --> 00:28:06,268
the slave nodes whose job is
to basically execute the tasks.
641
00:28:06,268 --> 00:28:07,895
The task is then executed
642
00:28:07,895 --> 00:28:10,500
on the partition rdds
in the worker nodes
643
00:28:10,500 --> 00:28:14,327
and then Returns the result back
to the spark context spot.
644
00:28:14,327 --> 00:28:17,892
Our context takes the job breaks
the shop into the task
645
00:28:17,892 --> 00:28:20,400
and distribute them
on the worker nodes
646
00:28:20,400 --> 00:28:23,900
and these tasks works
on partition rdds perform,
647
00:28:23,900 --> 00:28:26,252
whatever operations you
wanted to perform
648
00:28:26,252 --> 00:28:27,800
and then collect the result
649
00:28:27,800 --> 00:28:30,300
and give it back
to the main Spar context.
650
00:28:30,300 --> 00:28:32,690
If your increase
the number of workers,
651
00:28:32,690 --> 00:28:34,199
then you can divide jobs
652
00:28:34,199 --> 00:28:38,100
and more partitions and execute
them para Leo multiple systems.
653
00:28:38,100 --> 00:28:40,600
This will be actually
lot more faster.
654
00:28:40,600 --> 00:28:42,900
Also if you increase
the number of workers,
655
00:28:42,900 --> 00:28:44,700
it will also
increase your memory.
656
00:28:44,900 --> 00:28:46,746
And you can catch the jobs
657
00:28:46,746 --> 00:28:49,800
so that it can be executed
much more faster.
658
00:28:49,800 --> 00:28:52,231
So this is all
about Spark architecture.
659
00:28:52,231 --> 00:28:52,491
Now.
660
00:28:52,491 --> 00:28:54,709
Let me give you
an infographic idea
661
00:28:54,709 --> 00:28:56,600
about the Spark architecture.
662
00:28:56,600 --> 00:28:59,397
It follows master-slave
architecture here.
663
00:28:59,397 --> 00:29:02,400
The client submits
Park user application code
664
00:29:02,400 --> 00:29:05,189
when an application code
is submitted driver
665
00:29:05,189 --> 00:29:07,200
implicitly converts a user code
666
00:29:07,200 --> 00:29:09,000
that contains Transformations
667
00:29:09,000 --> 00:29:12,700
and actions into a logically
directed graph called DHE
668
00:29:12,700 --> 00:29:14,200
at this stage it also
669
00:29:14,200 --> 00:29:18,172
Performs optimizations such as
pipelining Transformations,
670
00:29:18,172 --> 00:29:21,165
then it converts
a logical graph called DHE
671
00:29:21,165 --> 00:29:23,032
into physical execution plan
672
00:29:23,032 --> 00:29:24,100
with many stages
673
00:29:24,100 --> 00:29:26,972
after converting into
physical execution plan.
674
00:29:26,972 --> 00:29:30,100
It creates a physical
execution units called tasks
675
00:29:30,100 --> 00:29:31,100
under each stage.
676
00:29:31,200 --> 00:29:33,300
Then these tasks are bundled
677
00:29:33,300 --> 00:29:36,300
and sent to the cluster
now driver talks
678
00:29:36,300 --> 00:29:39,523
to the cluster manager
and negotiates a resources
679
00:29:39,523 --> 00:29:42,727
and cluster manager launches
the needed executors
680
00:29:42,727 --> 00:29:45,392
at this point driver
be Also send the task
681
00:29:45,392 --> 00:29:47,828
to the executors based
on the placement
682
00:29:47,828 --> 00:29:51,610
when executor start to register
themselves with the drivers,
683
00:29:51,610 --> 00:29:55,147
so that driver will have
a complete view of the executors
684
00:29:55,147 --> 00:29:57,815
and executors now start
executing the tasks
685
00:29:57,815 --> 00:30:00,099
that are assigned by
the driver program
686
00:30:00,099 --> 00:30:01,300
at any point of time
687
00:30:01,300 --> 00:30:04,800
when the application is running
driver program will monitor
688
00:30:04,800 --> 00:30:06,000
the set of executors
689
00:30:06,000 --> 00:30:07,848
that runs and the driver note
690
00:30:07,848 --> 00:30:11,100
also schedules future tasks
Based on data placement.
691
00:30:11,100 --> 00:30:14,600
So this is how the internal
working takes place in space.
692
00:30:14,600 --> 00:30:17,400
Architecture, there are
three different types
693
00:30:17,400 --> 00:30:18,968
of workloads that spark
694
00:30:18,968 --> 00:30:22,282
and cater first batch mode
in case of batch mode.
695
00:30:22,282 --> 00:30:24,800
We run a bad shop here
you write the job
696
00:30:24,800 --> 00:30:26,100
and then schedule it.
697
00:30:26,100 --> 00:30:28,989
It works through a queue
or batch of separate jobs
698
00:30:28,989 --> 00:30:31,804
through manual intervention
next interactive mode.
699
00:30:31,804 --> 00:30:33,460
This is an interactive shell
700
00:30:33,460 --> 00:30:36,300
where you go and execute
the commands one by one.
701
00:30:36,300 --> 00:30:39,100
So you'll execute
one command check the result
702
00:30:39,100 --> 00:30:41,177
and then execute
the other command based
703
00:30:41,177 --> 00:30:42,700
on the output result and so
704
00:30:42,700 --> 00:30:44,600
on it works similar to the SQL.
705
00:30:44,600 --> 00:30:48,200
Action social is the one
which executes a driver program.
706
00:30:48,200 --> 00:30:50,833
So it is generally used
for development work
707
00:30:50,833 --> 00:30:53,100
or it is also used
for ad hoc queries,
708
00:30:53,100 --> 00:30:54,670
then comes the streaming mode
709
00:30:54,670 --> 00:30:57,200
where the program
is continuously running as
710
00:30:57,200 --> 00:30:59,400
and when the data
comes it takes a data
711
00:30:59,500 --> 00:31:02,000
and do some Transformations
and actions on the data
712
00:31:02,300 --> 00:31:04,200
and then produce output results.
713
00:31:04,400 --> 00:31:06,900
So these are the three
different types of workloads
714
00:31:06,900 --> 00:31:09,000
that spark actually caters now,
715
00:31:09,000 --> 00:31:11,866
let's move ahead and see
a simple demo here.
716
00:31:11,866 --> 00:31:14,600
Let's understand how
to create a spark up.
717
00:31:14,600 --> 00:31:17,000
Location in spark
shell using Scala.
718
00:31:17,000 --> 00:31:18,266
So let's understand
719
00:31:18,266 --> 00:31:21,400
how to create a spark
application in spark shell
720
00:31:21,400 --> 00:31:22,700
using Scala assume
721
00:31:22,700 --> 00:31:25,700
that we have a text file
in the hdfs directory
722
00:31:25,700 --> 00:31:28,900
and we are counting the number
of words in that text file.
723
00:31:28,900 --> 00:31:30,421
So, let's see how to do it.
724
00:31:30,421 --> 00:31:32,900
So before I start running,
let me first check
725
00:31:32,900 --> 00:31:34,900
whether all my demons
are running or not.
726
00:31:35,200 --> 00:31:37,100
So I'll type sudo JPS
727
00:31:37,200 --> 00:31:40,600
so all my spark demons
and Hadoop elements are running
728
00:31:40,600 --> 00:31:44,353
that I have master/worker
as Park demon son named notice.
729
00:31:44,353 --> 00:31:47,400
Manager non-manager everything
as Hadoop team it.
730
00:31:47,400 --> 00:31:48,749
So the first thing
731
00:31:48,749 --> 00:31:51,600
that I do here is
I run the spark shell
732
00:31:51,700 --> 00:31:54,700
so it takes bit time
to start in the meanwhile.
733
00:31:54,700 --> 00:31:56,700
Let me tell you the web UI port
734
00:31:56,700 --> 00:31:59,623
for spark shell is
localhost for 0 4 0.
735
00:32:00,300 --> 00:32:02,900
So this is a web
UI first Park like
736
00:32:02,900 --> 00:32:06,400
if you click on jobs right now,
we have not executed anything.
737
00:32:06,400 --> 00:32:08,861
So there is
no details over here.
738
00:32:09,400 --> 00:32:11,900
So there you have job stages.
739
00:32:12,100 --> 00:32:14,200
So once you execute the chops
740
00:32:14,200 --> 00:32:16,300
If you'll be having
the records of the tasks
741
00:32:16,300 --> 00:32:17,700
that you have executed here.
742
00:32:17,700 --> 00:32:20,400
So here you can see
the stages of various jobs
743
00:32:20,400 --> 00:32:21,706
and tasks executed.
744
00:32:21,706 --> 00:32:22,943
So now let's check
745
00:32:22,943 --> 00:32:25,900
whether our spark
shall have started or not.
746
00:32:25,900 --> 00:32:26,500
Yes.
747
00:32:26,500 --> 00:32:30,074
So you have your spark version
as two point one point one
748
00:32:30,074 --> 00:32:32,500
and you have a scholar
shell over here.
749
00:32:32,600 --> 00:32:34,300
So before I start the code,
750
00:32:34,300 --> 00:32:36,300
let's check the content
that is present
751
00:32:36,300 --> 00:32:38,600
in the input text file
by running this command.
752
00:32:38,933 --> 00:32:39,933
So I'll write
753
00:32:39,933 --> 00:32:44,000
where test is equal
to SC dot txt file
754
00:32:44,000 --> 00:32:46,700
because I have saved
a text file over there
755
00:32:46,700 --> 00:32:49,300
and I'll give
the hdfs part location.
756
00:32:50,000 --> 00:32:52,900
I've stored my text file
in this location.
757
00:32:53,300 --> 00:32:55,600
And Sample is the name
of the text file.
758
00:32:55,600 --> 00:32:58,400
So now let me give
test dot collect
759
00:32:58,400 --> 00:32:59,834
so that it collects the data
760
00:32:59,834 --> 00:33:02,600
and displays the data that
is present in the text file.
761
00:33:02,600 --> 00:33:04,500
So in my text file,
762
00:33:04,500 --> 00:33:08,500
I have Hadoop research analysts
data science and science.
763
00:33:08,500 --> 00:33:10,500
So this is my input data.
764
00:33:10,500 --> 00:33:12,200
So now let me map
765
00:33:12,200 --> 00:33:15,600
the functions and apply
the Transformations and actions.
766
00:33:15,600 --> 00:33:20,000
So I'll give our map is equal
to SC dot txt file
767
00:33:20,000 --> 00:33:22,600
and I will specify
768
00:33:22,600 --> 00:33:28,800
my but location So this
is my input part location
769
00:33:29,073 --> 00:33:32,226
and I'll apply
the flat map transformation
770
00:33:32,457 --> 00:33:33,842
to split the data.
771
00:33:36,100 --> 00:33:38,100
There are separated by space
772
00:33:38,900 --> 00:33:44,330
and then map the word count to
be given as word comma one now.
773
00:33:44,330 --> 00:33:46,100
This would be executed.
774
00:33:46,100 --> 00:33:46,600
Yes.
775
00:33:47,100 --> 00:33:49,000
Now, let me apply the action
776
00:33:49,000 --> 00:33:52,000
for this to start
the execution of the task.
777
00:33:52,900 --> 00:33:56,100
So let me tell you one thing
here before applying an action.
778
00:33:56,100 --> 00:33:58,600
This park will not start
the execution process.
779
00:33:58,600 --> 00:34:00,600
So here I have applied
produced by key
780
00:34:00,600 --> 00:34:02,800
as the action to start
counting the number
781
00:34:02,800 --> 00:34:04,100
of words in the text file.
782
00:34:04,500 --> 00:34:07,100
So now we are done
with applying Transformations
783
00:34:07,100 --> 00:34:08,300
and actions as well.
784
00:34:08,300 --> 00:34:09,774
So now the next step is
785
00:34:09,774 --> 00:34:13,300
to specify the output location
to store the output file.
786
00:34:13,300 --> 00:34:16,400
So I will give
as counts dot save as text file
787
00:34:16,400 --> 00:34:19,500
and then specify
the location form output file.
788
00:34:19,500 --> 00:34:21,398
I'll sort it
in the same location
789
00:34:21,398 --> 00:34:23,000
where I have my input file.
790
00:34:23,700 --> 00:34:28,400
Never specify my output
file name as output 9 cool.
791
00:34:29,000 --> 00:34:31,200
I forgot to give
a double quotes.
792
00:34:31,800 --> 00:34:33,200
And I will run this.
793
00:34:36,603 --> 00:34:38,296
So it's completed now.
794
00:34:38,473 --> 00:34:40,626
So now let's see the output.
795
00:34:41,000 --> 00:34:42,900
I will open my Hadoop web UI
796
00:34:42,900 --> 00:34:45,750
by giving local lost Phi
double zero seven zero
797
00:34:45,750 --> 00:34:48,600
and browse the file system
to check the output.
798
00:34:48,900 --> 00:34:50,284
So as I have said,
799
00:34:50,284 --> 00:34:54,000
I have example asthma director
that I have created
800
00:34:54,000 --> 00:34:57,600
and in that I have specified
output 9 as my output.
801
00:34:57,600 --> 00:35:00,300
So I have the two part
files been created.
802
00:35:00,300 --> 00:35:02,600
Let's check each
of them one by one.
803
00:35:04,800 --> 00:35:06,512
So we have the data count
804
00:35:06,512 --> 00:35:09,116
as one analyst count
as one and science
805
00:35:09,116 --> 00:35:12,200
count as two so this is
a first part file now.
806
00:35:12,200 --> 00:35:14,200
Let me open the second
part file for you.
807
00:35:18,500 --> 00:35:20,800
So this is the second
part file there you
808
00:35:20,800 --> 00:35:23,800
have Hadoop count as one
and the research count as one.
809
00:35:24,500 --> 00:35:26,558
So now let me show
you the text file
810
00:35:26,558 --> 00:35:28,600
that we have specified
as the input.
811
00:35:30,200 --> 00:35:31,363
So as I have told
812
00:35:31,363 --> 00:35:34,076
you Hadoop counters
one research count as
813
00:35:34,076 --> 00:35:37,400
one analyst one data one signs
and signs as 1 1 so
814
00:35:37,400 --> 00:35:39,600
in might be thinking
data science is a one word
815
00:35:39,600 --> 00:35:40,969
no in the program code.
816
00:35:40,969 --> 00:35:44,600
We have asked to count the word
that the separated by a space.
817
00:35:44,600 --> 00:35:47,600
So that is why we have
science count as two.
818
00:35:47,600 --> 00:35:51,100
I hope you got an idea
about how word count works.
819
00:35:51,515 --> 00:35:54,900
Similarly, I will now
paralyzed 1/200 numbers
820
00:35:54,900 --> 00:35:56,200
and divide the tasks
821
00:35:56,200 --> 00:36:00,100
into five partitions to show
you what is partitions of tusks.
822
00:36:00,100 --> 00:36:04,400
So I will write a seedot
paralyzed 1/200 numbers
823
00:36:04,403 --> 00:36:07,096
and divide them
into five partitions
824
00:36:07,115 --> 00:36:10,900
and apply collect action
to collect the numbers
825
00:36:10,900 --> 00:36:12,700
and start the execution.
826
00:36:12,784 --> 00:36:16,015
So it displays you
an array of 100 numbers.
827
00:36:16,300 --> 00:36:20,900
Now, let me explain you the job
stages partitions even timeline.
828
00:36:20,900 --> 00:36:23,100
Dag representation
and everything.
829
00:36:23,100 --> 00:36:26,023
So now let me go
to the web UI of spark
830
00:36:26,023 --> 00:36:27,437
and click on jobs.
831
00:36:27,601 --> 00:36:29,294
So these are the tasks
832
00:36:29,294 --> 00:36:33,217
that have submitted so
coming to word count example.
833
00:36:33,700 --> 00:36:36,300
So this is the
dagger usual ization.
834
00:36:36,300 --> 00:36:38,700
I hope you can see
it clearly first
835
00:36:38,700 --> 00:36:40,401
you collected the text file,
836
00:36:40,401 --> 00:36:42,709
then you applied
flatmap transformation
837
00:36:42,709 --> 00:36:45,139
and mapped it to count
the number of words
838
00:36:45,139 --> 00:36:47,333
and then applied
Reduce by key action
839
00:36:47,333 --> 00:36:49,100
and then save the output file
840
00:36:49,100 --> 00:36:50,500
as save as text file.
841
00:36:50,500 --> 00:36:52,900
So this is Entire
tag visualization
842
00:36:52,900 --> 00:36:54,000
of the number of steps
843
00:36:54,000 --> 00:36:56,000
that we have covered
in our program.
844
00:36:56,000 --> 00:36:58,271
So here it shows
the completed stages
845
00:36:58,271 --> 00:37:01,900
that is two stages
and it also shows the duration
846
00:37:01,900 --> 00:37:03,284
that is 2 seconds.
847
00:37:03,400 --> 00:37:05,800
And if you click
on the event timeline,
848
00:37:05,800 --> 00:37:08,482
it just shows the executor
that is added.
849
00:37:08,482 --> 00:37:11,500
And in this case you
cannot see any partitions
850
00:37:11,500 --> 00:37:15,300
because you have not split the
jobs into various partitions.
851
00:37:15,500 --> 00:37:19,200
So this is how you can see
the even timeline and the -
852
00:37:19,200 --> 00:37:21,700
visualization here you
you can also see
853
00:37:21,700 --> 00:37:24,759
the stage ID descriptions
when you have submitted
854
00:37:24,759 --> 00:37:26,800
that I have just
submitted it now
855
00:37:26,800 --> 00:37:29,294
and in this it also
shows the duration
856
00:37:29,294 --> 00:37:32,800
that it took to execute the task
and the output pipes
857
00:37:32,800 --> 00:37:35,500
that it took the shuffle
rate Shuffle right
858
00:37:35,500 --> 00:37:39,100
and many more now to show
you the partitions see
859
00:37:39,100 --> 00:37:42,500
in this you just applied
SC dot paralyzed, right?
860
00:37:42,500 --> 00:37:45,151
So it is just showing
one stage where you
861
00:37:45,151 --> 00:37:48,400
have applied the parallelized
transformation here.
862
00:37:48,400 --> 00:37:51,300
It shows the succeeded
task as Phi by Phi.
863
00:37:51,300 --> 00:37:54,700
That is you have divided
the task into five stages
864
00:37:54,700 --> 00:37:58,762
and all the five stages has been
executed successfully now here
865
00:37:58,762 --> 00:38:02,300
you can see the partitions
of the five different stages
866
00:38:02,300 --> 00:38:04,112
that is executed in parallel.
867
00:38:04,112 --> 00:38:05,800
So depending on the colors,
868
00:38:05,800 --> 00:38:07,500
it shows the scheduler delay
869
00:38:07,500 --> 00:38:10,500
the shuffle rate time
executor Computing time result
870
00:38:10,500 --> 00:38:11,500
civilization time
871
00:38:11,500 --> 00:38:13,921
and getting result time
and many more
872
00:38:13,921 --> 00:38:15,836
so you can see that duration
873
00:38:15,836 --> 00:38:19,252
that it took to execute
the five tasks in parallel
874
00:38:19,252 --> 00:38:21,263
at the same time as maximum.
875
00:38:21,263 --> 00:38:22,700
Um one milliseconds.
876
00:38:22,700 --> 00:38:26,200
So in memory spark as
much faster computation
877
00:38:26,200 --> 00:38:27,810
and you can see the IDS
878
00:38:27,810 --> 00:38:31,100
of all the five different
tasks all our success.
879
00:38:31,100 --> 00:38:33,166
You can see the locality level.
880
00:38:33,166 --> 00:38:37,033
You can see the executor and
the host IP ID the launch time
881
00:38:37,033 --> 00:38:39,100
the duration it take everything
882
00:38:39,200 --> 00:38:40,631
so you can also see
883
00:38:40,631 --> 00:38:44,978
that we have created our DT
and paralyzed it similarly here
884
00:38:44,978 --> 00:38:47,000
also for word count example,
885
00:38:47,000 --> 00:38:48,306
you can see the rdd
886
00:38:48,306 --> 00:38:51,324
that has been created
and also the Actions
887
00:38:51,324 --> 00:38:53,800
that have applied
to execute the task
888
00:38:54,000 --> 00:38:57,401
and you can see the duration
that it took even here also,
889
00:38:57,401 --> 00:38:58,980
it's just one milliseconds
890
00:38:58,980 --> 00:39:02,200
that it took to execute
the entire word count example,
891
00:39:02,200 --> 00:39:05,900
and you can see the ID is
locality level executor ID.
892
00:39:05,900 --> 00:39:06,916
So in this case,
893
00:39:06,916 --> 00:39:09,712
we have just executed
the task in two stages.
894
00:39:09,712 --> 00:39:11,900
So it is just showing
the two stages.
895
00:39:11,900 --> 00:39:13,100
So this is all about
896
00:39:13,100 --> 00:39:16,266
how web UI looks and what are
the features and information
897
00:39:16,266 --> 00:39:18,435
that you can see
in the web UI of spark
898
00:39:18,435 --> 00:39:21,200
after executing the program
and the Scala shell.
899
00:39:21,200 --> 00:39:22,271
So in this program,
900
00:39:22,271 --> 00:39:25,635
you can see that first gave
the part to the input location
901
00:39:25,635 --> 00:39:26,700
and check the data
902
00:39:26,700 --> 00:39:29,063
that is presented
in the input file.
903
00:39:29,063 --> 00:39:31,900
And then we applied
flatmap Transformations
904
00:39:31,900 --> 00:39:33,100
and created rdd
905
00:39:33,100 --> 00:39:36,800
and then applied action to start
the execution of the task
906
00:39:36,800 --> 00:39:39,500
and save the output file
in this location.
907
00:39:39,500 --> 00:39:41,643
So I hope you got a clear idea
908
00:39:41,643 --> 00:39:45,054
of how to execute a word
count example and check
909
00:39:45,054 --> 00:39:46,861
for the various features
910
00:39:46,861 --> 00:39:50,700
and Spark web UI like
partitions that visualisations
911
00:39:50,700 --> 00:39:59,900
and I hope you found the session
interesting Apache spark.
912
00:40:00,000 --> 00:40:03,900
This word can generate a spark
in every Hadoop Engineers mind.
913
00:40:03,900 --> 00:40:06,188
It is a big data
processing framework,
914
00:40:06,188 --> 00:40:08,805
which is lightning fast
and cluster Computing.
915
00:40:08,805 --> 00:40:12,300
And the core reason behind
its outstanding performance is
916
00:40:12,300 --> 00:40:15,500
the resilient distributed
data set or in short.
917
00:40:15,500 --> 00:40:17,779
They are DD and today I'll focus
918
00:40:17,779 --> 00:40:20,200
on the topic called
rdd using spark
919
00:40:20,200 --> 00:40:21,723
before we get Get started.
920
00:40:21,723 --> 00:40:23,900
Let's have a quick look
on the agenda.
921
00:40:23,900 --> 00:40:24,900
For today's session.
922
00:40:25,100 --> 00:40:28,213
We shall start with
understanding the need for rdds
923
00:40:28,213 --> 00:40:29,272
where we'll learn
924
00:40:29,272 --> 00:40:32,200
the reasons behind which
the rdds were required.
925
00:40:32,200 --> 00:40:34,700
Then we shall learn
what our rdds
926
00:40:34,700 --> 00:40:37,871
where will understand
what exactly an rdd is
927
00:40:37,871 --> 00:40:39,800
and how do they work later?
928
00:40:39,800 --> 00:40:42,400
I'll walk you through
the fascinating features
929
00:40:42,400 --> 00:40:46,300
of rdds such as in
memory computation partitioning
930
00:40:46,374 --> 00:40:48,475
persistence fault tolerance
931
00:40:48,475 --> 00:40:49,475
and many more
932
00:40:49,600 --> 00:40:51,200
once I finished a theory
933
00:40:51,300 --> 00:40:53,200
I'll get your hands on rdds
934
00:40:53,200 --> 00:40:55,100
where will practically create
935
00:40:55,100 --> 00:40:58,141
and perform all possible
operations on a disease
936
00:40:58,141 --> 00:40:59,500
and finally I'll wind
937
00:40:59,500 --> 00:41:02,677
up this session with
an interesting Pokémon use case,
938
00:41:02,677 --> 00:41:06,100
which will help you understand
rdds in a much better way.
939
00:41:06,100 --> 00:41:08,100
Let's get started spark is one
940
00:41:08,100 --> 00:41:10,792
of the top mandatory skills
required by each
941
00:41:10,792 --> 00:41:12,518
and every Big Data developer.
942
00:41:12,518 --> 00:41:14,687
It is used
in multiple applications,
943
00:41:14,687 --> 00:41:17,800
which need real-time processing
such as Google's
944
00:41:17,800 --> 00:41:21,066
recommendation engine credit
card fraud detection.
945
00:41:21,066 --> 00:41:23,713
And many more to understand
this in depth.
946
00:41:23,713 --> 00:41:27,200
We shall consider Amazon's
recommendation engine assume
947
00:41:27,200 --> 00:41:29,500
that you are searching
for a mobile phone
948
00:41:29,500 --> 00:41:33,126
and Amazon and you have certain
specifications of your choice.
949
00:41:33,126 --> 00:41:36,742
Then the Amazon search engine
understands your requirements
950
00:41:36,742 --> 00:41:38,450
and provides you the products
951
00:41:38,450 --> 00:41:41,155
which match the specifications
of your choice.
952
00:41:41,155 --> 00:41:43,800
All this is made possible
because of the most
953
00:41:43,800 --> 00:41:46,717
powerful tool existing
in Big Data environment,
954
00:41:46,717 --> 00:41:49,000
which is none other
than Apache spark
955
00:41:49,000 --> 00:41:51,000
and resilient distributed data.
956
00:41:51,000 --> 00:41:53,946
Is considered to be
the heart of Apache spark.
957
00:41:53,946 --> 00:41:56,735
So with this let's begin
our first question.
958
00:41:56,735 --> 00:41:58,300
Why do we need a disease?
959
00:41:58,300 --> 00:42:01,410
Well, the current world
is expanding the technology
960
00:42:01,410 --> 00:42:02,903
and artificial intelligence
961
00:42:02,903 --> 00:42:06,891
is the face for this Evolution
the machine learning algorithms
962
00:42:06,891 --> 00:42:09,300
and the data needed
to train these computers
963
00:42:09,300 --> 00:42:10,453
are huge the logic
964
00:42:10,453 --> 00:42:13,378
behind all these algorithms
are very complicated
965
00:42:13,378 --> 00:42:17,300
and mostly run in a distributed
and iterative computation method
966
00:42:17,300 --> 00:42:19,800
the machine learning
algorithms could not use
967
00:42:19,800 --> 00:42:21,053
the older mapreduce.
968
00:42:21,053 --> 00:42:24,500
Grams, because the traditional
mapreduce programs needed
969
00:42:24,500 --> 00:42:26,733
a stable State hdfs and we know
970
00:42:26,733 --> 00:42:31,200
that hdfs generates redundancy
during intermediate computations
971
00:42:31,200 --> 00:42:34,800
which resulted in a major
latency in data processing
972
00:42:34,800 --> 00:42:36,900
and in hdfs gathering data
973
00:42:36,900 --> 00:42:39,400
for multiple processing units
at a single instance.
974
00:42:39,400 --> 00:42:42,752
First time consuming along
with this the major issue
975
00:42:42,752 --> 00:42:46,600
was the HTF is did not have
random read and write ability.
976
00:42:46,600 --> 00:42:49,000
So using this old
mapreduce programs
977
00:42:49,000 --> 00:42:52,000
for machine learning
problems would be Then
978
00:42:52,000 --> 00:42:53,700
the spark was introduced
979
00:42:53,700 --> 00:42:55,318
compared to mapreduce spark
980
00:42:55,318 --> 00:42:58,435
is an advanced big data
processing framework resilient
981
00:42:58,435 --> 00:42:59,503
distributed data set
982
00:42:59,503 --> 00:43:02,423
which is a fundamental
and most crucial data structure
983
00:43:02,423 --> 00:43:03,600
of spark was the one
984
00:43:03,600 --> 00:43:06,900
which made it all possible rdds
are effortless to create
985
00:43:06,900 --> 00:43:09,205
and the mind-blowing
property with solve.
986
00:43:09,205 --> 00:43:12,500
The problem was it's in memory
data processing capability
987
00:43:12,500 --> 00:43:15,600
Oddity is not a distributed
file system instead.
988
00:43:15,600 --> 00:43:17,894
It is a distributed
collection of memory
989
00:43:17,894 --> 00:43:19,905
where the data needed
is always stored
990
00:43:19,905 --> 00:43:21,057
and kept available.
991
00:43:21,057 --> 00:43:24,269
Lynn RAM and because of
this property the elevation it
992
00:43:24,269 --> 00:43:27,300
gave to the memory
accessing speed was unbelievable
993
00:43:27,300 --> 00:43:29,250
The Oddities our fault tolerant
994
00:43:29,250 --> 00:43:32,900
and this property bought it
a Dignity of a whole new level.
995
00:43:32,900 --> 00:43:35,074
So our next question would be
996
00:43:35,074 --> 00:43:38,522
what are rdds the resilient
distributed data sets
997
00:43:38,522 --> 00:43:39,600
or the rdds are
998
00:43:39,600 --> 00:43:42,600
the primary underlying
data structures of spark.
999
00:43:42,600 --> 00:43:44,311
They are highly fault tolerant
1000
00:43:44,311 --> 00:43:46,900
and the store data
amongst multiple computers
1001
00:43:46,900 --> 00:43:51,000
in a network the data is written
into multiple executable notes.
1002
00:43:51,000 --> 00:43:54,800
So that in case of a Calamity
if any executing node fails,
1003
00:43:54,800 --> 00:43:57,459
then within a fraction
of second it gets back up
1004
00:43:57,459 --> 00:43:59,100
from the next executable node
1005
00:43:59,100 --> 00:44:02,200
with the same processing speeds
of the current node,
1006
00:44:02,300 --> 00:44:04,900
the fault-tolerant property
enables them to roll back
1007
00:44:04,900 --> 00:44:06,876
their data to the original state
1008
00:44:06,876 --> 00:44:09,038
by applying simple
Transformations on
1009
00:44:09,038 --> 00:44:11,225
to the Lost part
in the lineage hard.
1010
00:44:11,225 --> 00:44:13,696
It is do not need
anything called hard disk
1011
00:44:13,696 --> 00:44:15,489
or any other secondary storage
1012
00:44:15,489 --> 00:44:17,700
all that they need
is the main memory,
1013
00:44:17,700 --> 00:44:18,700
which is Ram now
1014
00:44:18,700 --> 00:44:21,100
that we have understood
the need for our dear.
1015
00:44:21,100 --> 00:44:22,482
It is and what exactly
1016
00:44:22,482 --> 00:44:25,204
an RTD is so let us see
the different sources
1017
00:44:25,204 --> 00:44:28,223
from which the data
can be ingested into an rdd.
1018
00:44:28,223 --> 00:44:30,600
The data can be loaded
from any Source
1019
00:44:30,600 --> 00:44:33,700
like hdfs hbase high C ql
1020
00:44:33,700 --> 00:44:34,658
you name it?
1021
00:44:34,658 --> 00:44:35,582
They got it.
1022
00:44:35,700 --> 00:44:36,200
Hence.
1023
00:44:36,200 --> 00:44:39,000
The collected data
is dropped into an rdd.
1024
00:44:39,000 --> 00:44:42,000
And guess what the rdds
a free-spirited they
1025
00:44:42,000 --> 00:44:44,051
can process any type of data.
1026
00:44:44,051 --> 00:44:47,800
They won't care if the data
is structured unstructured
1027
00:44:47,800 --> 00:44:49,500
or semi-structured now,
1028
00:44:49,500 --> 00:44:51,200
let me walk you
through the features.
1029
00:44:51,200 --> 00:44:52,300
Just of rdds,
1030
00:44:52,300 --> 00:44:54,700
which give it an edge
over the other Alternatives
1031
00:44:54,900 --> 00:44:57,100
in memory computation the idea
1032
00:44:57,100 --> 00:45:00,632
of in memory computation bought
the groundbreaking progress
1033
00:45:00,632 --> 00:45:03,800
in cluster Computing it
increase the processing speed
1034
00:45:03,800 --> 00:45:07,877
when compared with the hdfs
moving on to Lacey evaluations
1035
00:45:07,877 --> 00:45:08,827
the phrase lazy
1036
00:45:08,827 --> 00:45:09,527
Explains It
1037
00:45:09,527 --> 00:45:12,564
All spark logs all
the Transformations you apply
1038
00:45:12,564 --> 00:45:16,056
onto it and will not throw
any output onto the display
1039
00:45:16,056 --> 00:45:17,900
until an action is provoked.
1040
00:45:17,900 --> 00:45:22,200
Next is Fault tolerance rdds
are Lutely, fault-tolerant.
1041
00:45:22,200 --> 00:45:26,008
Any lost partition of an rdd
can be rolled back by applying
1042
00:45:26,008 --> 00:45:28,700
simple Transformations on
to the last part
1043
00:45:28,700 --> 00:45:30,286
in the lineage speaking
1044
00:45:30,286 --> 00:45:34,700
about immutability the data once
dropped into an rdd is immutable
1045
00:45:34,700 --> 00:45:38,016
because the access provided
by our DD is just re
1046
00:45:38,016 --> 00:45:39,920
only the only way to access
1047
00:45:39,920 --> 00:45:43,800
or modified is by applying
a transformation on to an rdd
1048
00:45:43,800 --> 00:45:45,400
which is prior
to the present one
1049
00:45:45,400 --> 00:45:47,200
discussing about partitioning.
1050
00:45:47,200 --> 00:45:48,923
The important reason for Sparks.
1051
00:45:48,923 --> 00:45:51,100
Parallel processing is
its part issue.
1052
00:45:51,300 --> 00:45:54,163
By default spot determines
the number of Parts
1053
00:45:54,163 --> 00:45:56,200
into which your data is divided,
1054
00:45:56,200 --> 00:45:59,652
but you can override this
and decide the number of blocks.
1055
00:45:59,652 --> 00:46:01,200
You want to split your data.
1056
00:46:01,200 --> 00:46:03,193
Let's see what persistence is
1057
00:46:03,193 --> 00:46:05,600
Sparks are it is
a totally reusable.
1058
00:46:05,600 --> 00:46:06,757
The users can apply
1059
00:46:06,757 --> 00:46:09,502
certain number of
Transformations on to an rdd
1060
00:46:09,502 --> 00:46:11,302
and preserve the final Oddity
1061
00:46:11,302 --> 00:46:14,383
for future use this avoids
all the hectic process
1062
00:46:14,383 --> 00:46:17,369
of applying all
the Transformations from scratch
1063
00:46:17,369 --> 00:46:20,867
and now last but not the least
course crane operations.
1064
00:46:20,867 --> 00:46:24,300
The operations performed
on rdds using Transformations
1065
00:46:24,300 --> 00:46:28,069
like map filter flat map
Etc change the arteries
1066
00:46:28,069 --> 00:46:29,300
and update them.
1067
00:46:29,300 --> 00:46:29,686
Hence.
1068
00:46:29,686 --> 00:46:33,100
Every operation applied
onto an RTD is course trained.
1069
00:46:33,100 --> 00:46:36,800
These are the features of rdds
and moving on to the next stage.
1070
00:46:36,800 --> 00:46:37,800
We shall understand.
1071
00:46:37,800 --> 00:46:39,700
The creation of rdds art.
1072
00:46:39,700 --> 00:46:42,500
It is can be created
using three methods.
1073
00:46:42,500 --> 00:46:46,000
The first method is using
parallelized collections.
1074
00:46:46,000 --> 00:46:50,400
Next method is by using external
storage like hdfs hbase.
1075
00:46:50,400 --> 00:46:51,100
Hi.
1076
00:46:51,100 --> 00:46:54,700
And many more the third one
is using an existing ID,
1077
00:46:54,700 --> 00:46:56,800
which is prior
to the present one.
1078
00:46:56,800 --> 00:46:58,800
Now, let us see understand
1079
00:46:58,800 --> 00:47:02,300
and create an array D
through each method now
1080
00:47:02,300 --> 00:47:05,600
Spa can be run on Virtual
machines like spark VM
1081
00:47:05,600 --> 00:47:08,300
or you can install
a Linux operating system
1082
00:47:08,300 --> 00:47:10,774
like Ubuntu and
run it Standalone,
1083
00:47:10,774 --> 00:47:14,600
but we here at Erica use
the best-in-class cloud lab
1084
00:47:14,600 --> 00:47:16,900
which comprises of
all the Frameworks.
1085
00:47:16,900 --> 00:47:19,400
You needed a single
stop Cloud framework.
1086
00:47:19,400 --> 00:47:20,776
No need of any hectic.
1087
00:47:20,776 --> 00:47:22,323
Has of downloading any file
1088
00:47:22,323 --> 00:47:24,632
or setting up
an environment variables
1089
00:47:24,632 --> 00:47:27,289
and looking for
a hardware specification Etc.
1090
00:47:27,289 --> 00:47:28,890
All you need is a login ID
1091
00:47:28,890 --> 00:47:32,091
and password to the all-in-one
ready to use cloud lab
1092
00:47:32,091 --> 00:47:34,800
where you can run
and save all your programs.
1093
00:47:35,400 --> 00:47:39,600
Let us fire up our spark shell
using the command spark to -
1094
00:47:39,600 --> 00:47:42,446
shell now as partial
is been fired up.
1095
00:47:42,446 --> 00:47:44,215
Let's create a new rdd.
1096
00:47:44,800 --> 00:47:48,400
So here we are creating
a new RTD with the first method
1097
00:47:48,400 --> 00:47:51,500
which is using the
parallelized collections here.
1098
00:47:51,500 --> 00:47:52,954
We are creating a new rdt
1099
00:47:52,954 --> 00:47:55,800
by the name parallelized
collections are ready.
1100
00:47:55,800 --> 00:47:57,705
We are starting a spark context
1101
00:47:57,705 --> 00:48:00,321
and we have paralyzing
an array into the rdd
1102
00:48:00,321 --> 00:48:03,300
which consists of the data
of the days of a week,
1103
00:48:03,300 --> 00:48:04,875
which is Monday Tuesday,
1104
00:48:04,875 --> 00:48:07,500
Wednesday, Thursday,
Friday and Saturday.
1105
00:48:07,500 --> 00:48:10,600
Now, let's create
this our new rdd
1106
00:48:10,600 --> 00:48:13,841
paralyzed collections rdd
is successfully created now,
1107
00:48:13,841 --> 00:48:16,900
let's display the data
which is present in our RTD.
1108
00:48:19,400 --> 00:48:23,630
So this was the data
which is present in our RTD now,
1109
00:48:23,630 --> 00:48:27,038
let's create a new ID
using a second method.
1110
00:48:28,200 --> 00:48:30,892
The second method
of creating an rdd
1111
00:48:30,892 --> 00:48:35,400
was using an external storage
such as hdfs high SQL
1112
00:48:35,600 --> 00:48:37,100
and many more here.
1113
00:48:37,100 --> 00:48:40,200
I'm creating a new rdd
by the name spark file
1114
00:48:40,200 --> 00:48:43,312
where I'll be loading
a text document into the rdd
1115
00:48:43,312 --> 00:48:44,900
from an external storage,
1116
00:48:44,900 --> 00:48:45,900
which is hdfs.
1117
00:48:45,900 --> 00:48:49,700
And this is the location
where my text file is located.
1118
00:48:49,800 --> 00:48:53,600
So the new rdd spark file
is successfully created now,
1119
00:48:53,600 --> 00:48:55,054
let's display the data
1120
00:48:55,054 --> 00:48:57,500
which is present
in as pack file a TD.
1121
00:48:58,700 --> 00:48:59,620
It's the data
1122
00:48:59,620 --> 00:49:02,241
which is present in
as pack file ID is
1123
00:49:02,241 --> 00:49:05,500
a collection of alphabets
starting from A to Z.
1124
00:49:05,500 --> 00:49:05,900
Now.
1125
00:49:05,900 --> 00:49:08,851
Let's create a new already
using the third method
1126
00:49:08,851 --> 00:49:10,946
which is using
an existing iridium,
1127
00:49:10,946 --> 00:49:14,201
which is prior to the present
one in the third method.
1128
00:49:14,201 --> 00:49:16,900
I'm creating a new Rd
by the name verts and
1129
00:49:16,900 --> 00:49:18,700
I'm creating a spark context
1130
00:49:18,700 --> 00:49:21,803
and paralyzing a statement
into the RTD Words,
1131
00:49:21,803 --> 00:49:24,700
which is spark is
a very powerful language.
1132
00:49:24,800 --> 00:49:26,517
So this is
a collection of Words,
1133
00:49:26,517 --> 00:49:28,400
which I have passed
into the new.
1134
00:49:28,400 --> 00:49:29,400
You are DD words.
1135
00:49:29,400 --> 00:49:29,900
Now.
1136
00:49:29,900 --> 00:49:31,700
Let us apply a transformation
1137
00:49:31,700 --> 00:49:34,800
on to the RTD and create
a new artery through that.
1138
00:49:35,100 --> 00:49:37,656
So here I'm applying
map transformation
1139
00:49:37,656 --> 00:49:39,140
on to the previous rdd
1140
00:49:39,140 --> 00:49:42,717
that is words and I'm storing
the data into the new ID
1141
00:49:42,717 --> 00:49:44,000
which is WordPress.
1142
00:49:44,000 --> 00:49:46,500
So here we are applying
map transformation in order
1143
00:49:46,500 --> 00:49:49,645
to display the first letter
of each and every word
1144
00:49:49,645 --> 00:49:51,700
which is stored
in the RTD words.
1145
00:49:51,700 --> 00:49:53,200
Now, let's continue.
1146
00:49:53,200 --> 00:49:56,093
The transformation is been
applied successfully now,
1147
00:49:56,093 --> 00:49:59,300
let's display the contents
which are present in new ID
1148
00:49:59,300 --> 00:50:01,800
which is word pair So
1149
00:50:01,800 --> 00:50:05,100
as explained we have displayed
the starting letter of each
1150
00:50:05,100 --> 00:50:06,100
and every word
1151
00:50:06,100 --> 00:50:10,888
as s is starting letter of spark
is starting letter of East and
1152
00:50:10,888 --> 00:50:13,700
so on L is starting
letter of language.
1153
00:50:13,900 --> 00:50:17,000
Now, we have understood
the creation of a dedes.
1154
00:50:17,000 --> 00:50:17,823
Let us move on
1155
00:50:17,823 --> 00:50:21,000
to the next stage where we'll
understand the operations
1156
00:50:21,000 --> 00:50:23,716
that are performed
on rdds Transformations
1157
00:50:23,716 --> 00:50:26,300
and actions are
the two major operations
1158
00:50:26,300 --> 00:50:27,700
that are performed on added.
1159
00:50:27,700 --> 00:50:31,677
He's let us understand what
our Transformations we applied.
1160
00:50:31,677 --> 00:50:35,575
Summations in order to access
filter and modify the data
1161
00:50:35,575 --> 00:50:37,470
which is present in an rdd.
1162
00:50:37,470 --> 00:50:41,087
Now Transformations are further
divided into two types
1163
00:50:41,087 --> 00:50:44,500
narrow Transformations and
why Transformations now,
1164
00:50:44,500 --> 00:50:47,500
let us understand what
our narrow Transformations
1165
00:50:47,500 --> 00:50:50,200
we apply narrow Transformations
onto a single partition
1166
00:50:50,200 --> 00:50:51,400
of parent ID
1167
00:50:51,400 --> 00:50:54,886
because the data required
to process the RTD is available
1168
00:50:54,886 --> 00:50:56,200
on a single partition
1169
00:50:56,200 --> 00:50:58,200
of parent additi the examples
1170
00:50:58,200 --> 00:51:01,125
for neurotransmission
our map filter.
1171
00:51:01,500 --> 00:51:04,300
At map partition
and map partitions.
1172
00:51:04,400 --> 00:51:06,940
Let us move on to the next
type of Transformations
1173
00:51:06,940 --> 00:51:08,511
which is why Transformations.
1174
00:51:08,511 --> 00:51:11,600
We apply why Transformations
on to the multiple partitions
1175
00:51:11,600 --> 00:51:12,698
of parent a greedy
1176
00:51:12,698 --> 00:51:16,080
because the data required
to process an rdd is available
1177
00:51:16,080 --> 00:51:17,514
on multiple partitions
1178
00:51:17,514 --> 00:51:19,600
of the parent
additi the examples
1179
00:51:19,600 --> 00:51:23,000
for why Transformations
are reduced by and Union now,
1180
00:51:23,000 --> 00:51:24,823
let us move on to the next part
1181
00:51:24,823 --> 00:51:27,200
which is actions actions
on the other hand
1182
00:51:27,200 --> 00:51:29,802
are considered to be
the next part of operations,
1183
00:51:29,802 --> 00:51:31,700
which are used
to display the final.
1184
00:51:32,200 --> 00:51:35,793
The examples for actions
are collect count take
1185
00:51:35,800 --> 00:51:38,479
and first till now
we have discussed
1186
00:51:38,479 --> 00:51:40,700
about the theory part on rdd.
1187
00:51:40,700 --> 00:51:42,870
Let us start
executing the operations
1188
00:51:42,870 --> 00:51:44,800
that are performed on a disease.
1189
00:51:46,500 --> 00:51:49,100
In a practical part
will be dealing with an example
1190
00:51:49,100 --> 00:51:50,600
of IPL match stata.
1191
00:51:50,900 --> 00:51:52,900
So here I have a CSV file
1192
00:51:52,900 --> 00:51:57,158
which has the IPL match records
and this CSV file is stored
1193
00:51:57,158 --> 00:51:59,081
in my hdfs and I'm loading.
1194
00:51:59,081 --> 00:52:01,956
My batch is dot CSV file
into the new rdd,
1195
00:52:01,956 --> 00:52:04,200
which is CK file as a text file.
1196
00:52:04,200 --> 00:52:07,909
So the match is dot CSV file
is been successfully loaded
1197
00:52:07,909 --> 00:52:09,990
as a text file into the new ID,
1198
00:52:09,990 --> 00:52:11,400
which is CK file now,
1199
00:52:11,400 --> 00:52:13,759
let us display the data
which is present
1200
00:52:13,759 --> 00:52:16,300
in our seek a file
using an action command.
1201
00:52:16,400 --> 00:52:18,170
So collect is the action command
1202
00:52:18,170 --> 00:52:20,700
which I'm using in order
to display the data
1203
00:52:20,700 --> 00:52:23,100
which is present
in my CK file a DD.
1204
00:52:23,600 --> 00:52:27,569
So here we have in total
six hundred and thirty six rows
1205
00:52:27,569 --> 00:52:30,600
of data which consists
of IPL match records
1206
00:52:30,600 --> 00:52:33,500
from the year 2008 to 2017.
1207
00:52:33,711 --> 00:52:36,788
Now, let us see the schema
of a CSV file.
1208
00:52:37,300 --> 00:52:40,561
I am using the action command
first in order to display
1209
00:52:40,561 --> 00:52:42,800
the schema of a match
is dot CSV file.
1210
00:52:42,800 --> 00:52:45,300
So this command will display
the starting line
1211
00:52:45,300 --> 00:52:46,400
of the CSV file.
1212
00:52:46,400 --> 00:52:48,005
We have so the schema
1213
00:52:48,005 --> 00:52:51,600
of a CSV file is the ID
of the match season city
1214
00:52:51,600 --> 00:52:54,386
where the IPL match
was conducted date
1215
00:52:54,386 --> 00:52:57,700
of the match team one team
two and so on now,
1216
00:52:57,700 --> 00:53:01,100
let's perform the further
operations on a CSV file.
1217
00:53:02,000 --> 00:53:04,300
Now moving on
to the further operations.
1218
00:53:04,300 --> 00:53:07,800
I'm about to split
the second column of my CSV file
1219
00:53:07,800 --> 00:53:10,787
which consists the information
regarding the states
1220
00:53:10,787 --> 00:53:12,700
which conducted the IPL matches.
1221
00:53:12,700 --> 00:53:15,467
So I am using this operation
in order to display
1222
00:53:15,467 --> 00:53:18,000
the states where
the matches were conducted.
1223
00:53:18,700 --> 00:53:21,600
So the transformation
is been successfully applied
1224
00:53:21,600 --> 00:53:24,600
and the data has been stored
into the new ID which is States.
1225
00:53:24,600 --> 00:53:26,700
Now, let's display the data
which is stored
1226
00:53:26,700 --> 00:53:30,100
in our state's rdd using
the collection action command,
1227
00:53:30,400 --> 00:53:31,890
so these with The states
1228
00:53:31,890 --> 00:53:34,500
where the matches
were being conducted now,
1229
00:53:34,500 --> 00:53:35,817
let's find out the city
1230
00:53:35,817 --> 00:53:38,700
which conducted the maximum
number of IPL matches.
1231
00:53:39,400 --> 00:53:41,700
Yeah, I'm creating
a new ID again,
1232
00:53:41,700 --> 00:53:45,017
which is States count
and I'm using map transformation
1233
00:53:45,017 --> 00:53:47,799
and I am counting each
and every city and the number
1234
00:53:47,799 --> 00:53:50,200
of matches conducted
in that particular City.
1235
00:53:50,500 --> 00:53:52,776
The transformation
is successfully applied
1236
00:53:52,776 --> 00:53:55,600
and the data has been stored
into the account ID.
1237
00:53:56,400 --> 00:53:56,900
Now.
1238
00:53:56,900 --> 00:54:00,097
Let us create a new editing
by name State count em
1239
00:54:00,097 --> 00:54:01,414
and apply reduced by
1240
00:54:01,414 --> 00:54:04,572
key transformation and map
transformation together
1241
00:54:04,572 --> 00:54:07,900
and consider topple one as
the city name and toppled
1242
00:54:07,900 --> 00:54:09,500
to as the Number of matches
1243
00:54:09,500 --> 00:54:11,876
which were considered
in that particular City
1244
00:54:11,876 --> 00:54:12,701
and apply sort
1245
00:54:12,701 --> 00:54:15,000
by K transformation
to find out the city
1246
00:54:15,000 --> 00:54:17,700
which conducted maximum number
of IPL matches.
1247
00:54:17,900 --> 00:54:20,317
The Transformations
are successfully applied
1248
00:54:20,317 --> 00:54:23,200
and the data is being stored
into the state count.
1249
00:54:23,200 --> 00:54:25,200
Em RTD now let's
display the data
1250
00:54:25,200 --> 00:54:26,800
which is present in state count.
1251
00:54:26,800 --> 00:54:29,600
Em, I did here I am using
1252
00:54:29,600 --> 00:54:33,320
take action command in order
to take the top 10 results
1253
00:54:33,320 --> 00:54:35,800
which are stored
in state count MRDD.
1254
00:54:36,100 --> 00:54:38,600
So according to the results
we have Mumbai
1255
00:54:38,600 --> 00:54:41,300
which Get the maximum number
of IPL matches,
1256
00:54:41,300 --> 00:54:45,700
which is 85 since the year
2008 to the year 2017.
1257
00:54:46,400 --> 00:54:50,300
Now let us create a new ID
by name fil ardi and use
1258
00:54:50,300 --> 00:54:53,144
flat map in order to filter
out the match data
1259
00:54:53,144 --> 00:54:55,800
which were conducted
in the city Hydra path
1260
00:54:55,800 --> 00:54:58,500
and store the same data
into the file rdd
1261
00:54:58,500 --> 00:55:01,617
since transformation is been
successfully applied now,
1262
00:55:01,617 --> 00:55:04,600
let us display the data
which is present in our fil ardi
1263
00:55:04,600 --> 00:55:06,161
which consists of the matches
1264
00:55:06,161 --> 00:55:08,800
which were conducted
excluding the city Hyderabad.
1265
00:55:09,900 --> 00:55:11,126
So this is the data
1266
00:55:11,126 --> 00:55:15,000
which is present in our fil ardi
D which excludes the matches
1267
00:55:15,000 --> 00:55:18,000
which are played
in the city Hyderabad now,
1268
00:55:18,000 --> 00:55:19,768
let us create another rdd
1269
00:55:19,768 --> 00:55:22,773
by name fil and store
the data of the matches
1270
00:55:22,773 --> 00:55:25,300
which were conducted
in the year 2017.
1271
00:55:25,300 --> 00:55:27,394
We shall use
filter transformation
1272
00:55:27,394 --> 00:55:28,600
for this operation.
1273
00:55:28,700 --> 00:55:31,000
The transformation is
been applied successfully
1274
00:55:31,000 --> 00:55:34,100
and the data has been stored
into the fil ardi now,
1275
00:55:34,100 --> 00:55:36,600
let us display the data
which is present there.
1276
00:55:37,200 --> 00:55:38,588
Michelle use collect
1277
00:55:38,588 --> 00:55:42,545
action command and now we have
the data of all the matches
1278
00:55:42,545 --> 00:55:45,600
which your plate especially
in the year 2070.
1279
00:55:47,100 --> 00:55:49,400
similarly, we can find
out the matches
1280
00:55:49,400 --> 00:55:52,000
which were played
in the year 2016 and we
1281
00:55:52,000 --> 00:55:54,600
can save the same data
into the new rdd
1282
00:55:54,600 --> 00:55:57,500
which is fil to Similarly,
1283
00:55:57,500 --> 00:55:59,823
we can find out the data
of the matches
1284
00:55:59,823 --> 00:56:03,100
which were conducted in the year
2016 and we can store
1285
00:56:03,100 --> 00:56:05,061
the same data into our new rdd
1286
00:56:05,061 --> 00:56:08,200
which is fil to I
have used filter transformation
1287
00:56:08,200 --> 00:56:10,800
in order to filter out
the data of the matches
1288
00:56:10,800 --> 00:56:13,581
which were conducted
in the year 2016 and I
1289
00:56:13,581 --> 00:56:15,900
have saved the data
into the new RTD
1290
00:56:15,900 --> 00:56:18,300
which is a file to now,
1291
00:56:18,300 --> 00:56:20,889
let us understand
the union transformation
1292
00:56:20,889 --> 00:56:21,900
which will apply
1293
00:56:21,900 --> 00:56:26,400
the union transformation on
to the fil ardi and fil to rdd.
1294
00:56:26,400 --> 00:56:29,100
In order to combine
both the data is present
1295
00:56:29,100 --> 00:56:30,816
in both The Oddities here.
1296
00:56:30,816 --> 00:56:32,232
I'm creating a new rdd
1297
00:56:32,232 --> 00:56:35,931
by the name Union rdd and I'm
applying Union transformation
1298
00:56:35,931 --> 00:56:38,600
on the to Oddities
that we created before.
1299
00:56:38,600 --> 00:56:42,400
The first one is fil ardi
which consists of the data
1300
00:56:42,400 --> 00:56:44,818
of the matches played
in the year 2017.
1301
00:56:44,818 --> 00:56:46,633
And the second one is a file
1302
00:56:46,633 --> 00:56:49,295
to which consists
the data of the matches.
1303
00:56:49,295 --> 00:56:52,469
Which up late in the year
2016 here I'll be clubbing
1304
00:56:52,469 --> 00:56:53,921
both the R8 is together
1305
00:56:53,921 --> 00:56:56,700
and I'll be saving the data
into the new rdd.
1306
00:56:56,701 --> 00:56:58,163
Which is Union rdd.
1307
00:56:58,600 --> 00:57:02,600
Now let us display the data
which is present in a new array,
1308
00:57:02,600 --> 00:57:04,100
which is Union rgd.
1309
00:57:04,100 --> 00:57:06,100
I am using collect
action command in order
1310
00:57:06,100 --> 00:57:07,100
to display the data.
1311
00:57:07,300 --> 00:57:09,800
So here we have the data
of the matches
1312
00:57:09,800 --> 00:57:11,400
which were played in the u.s.
1313
00:57:11,400 --> 00:57:13,400
2016 and 2017.
1314
00:57:13,900 --> 00:57:16,306
And now let's continue
with our operations
1315
00:57:16,306 --> 00:57:19,188
and find out the player
with maximum number of man
1316
00:57:19,188 --> 00:57:21,603
of the match awards
for this operation.
1317
00:57:21,603 --> 00:57:23,293
I am applying map transformation
1318
00:57:23,293 --> 00:57:25,345
and splitting out
the column number 13,
1319
00:57:25,345 --> 00:57:28,314
which consists of the data
of the players who won the man
1320
00:57:28,314 --> 00:57:30,800
of the match awards
for that particular match.
1321
00:57:30,800 --> 00:57:33,252
So the transformation
is been successfully applied
1322
00:57:33,252 --> 00:57:35,752
and the column number
13 is been successfully split
1323
00:57:35,752 --> 00:57:37,700
and the data has been
stored into the man
1324
00:57:37,700 --> 00:57:39,238
of the match our DD now.
1325
00:57:39,238 --> 00:57:42,155
We are creating a new rdd
by the named man
1326
00:57:42,155 --> 00:57:45,600
of the match count me applying
map Transformations on
1327
00:57:45,600 --> 00:57:46,800
to a previous rdd
1328
00:57:46,800 --> 00:57:48,300
and we are counting the number
1329
00:57:48,300 --> 00:57:51,300
of awards won by each and
every particular player.
1330
00:57:51,700 --> 00:57:55,733
Now, we shall create a new ID
by the named man of the match
1331
00:57:55,733 --> 00:57:59,500
and we are applying reduced
by K. Under the previous added
1332
00:57:59,500 --> 00:58:01,311
which is man of the match count.
1333
00:58:01,311 --> 00:58:03,765
And again, we are applying
map transformation
1334
00:58:03,765 --> 00:58:06,600
and considering topple one
as the name of the player
1335
00:58:06,600 --> 00:58:08,843
and topple to as
the number of matches.
1336
00:58:08,843 --> 00:58:11,500
He played and won the man
of the match Awards,
1337
00:58:11,500 --> 00:58:14,794
let us use take action command
in order to print the data
1338
00:58:14,794 --> 00:58:18,000
which is stored in our new RTD
which is man of the match.
1339
00:58:18,200 --> 00:58:21,400
So according to the result
we have a bws
1340
00:58:21,400 --> 00:58:24,000
who won the maximum number
of man of the matches,
1341
00:58:24,000 --> 00:58:24,923
which is 15.
1342
00:58:25,800 --> 00:58:29,129
So these are the few operations
that were performed on rdds.
1343
00:58:29,129 --> 00:58:31,600
Now, let us move on
to our Pokémon use case
1344
00:58:31,600 --> 00:58:34,800
so that we can understand
our duties in a much better way.
1345
00:58:35,800 --> 00:58:39,300
So the steps to be performed
in Pokémon use cases are loading
1346
00:58:39,300 --> 00:58:41,164
the Pokemon data dot CSV file
1347
00:58:41,164 --> 00:58:44,624
from an external storage
into an rdd removing the schema
1348
00:58:44,624 --> 00:58:46,700
from the Pokémon
data dot CSV file
1349
00:58:46,700 --> 00:58:49,730
and finding out the total number
of water type Pokemon
1350
00:58:49,730 --> 00:58:52,117
finding the total number
of fire type Pokemon.
1351
00:58:52,117 --> 00:58:53,882
I know it's getting interesting.
1352
00:58:53,882 --> 00:58:57,000
So let me explain you each
and every step practically.
1353
00:58:57,700 --> 00:59:00,200
So here I am creating
a new identity
1354
00:59:00,200 --> 00:59:02,400
by name Pokemon data rdd one
1355
00:59:02,400 --> 00:59:05,700
and I'm loading my CSV file
from an external storage.
1356
00:59:05,700 --> 00:59:08,100
That is my hdfs as a text file.
1357
00:59:08,100 --> 00:59:11,800
So the Pokemon data dot CSV file
is been successfully loaded
1358
00:59:11,800 --> 00:59:12,800
into our new rdd.
1359
00:59:12,800 --> 00:59:14,100
So let us display the data
1360
00:59:14,100 --> 00:59:17,100
which is present
in our Pokémon data rdd one.
1361
00:59:17,200 --> 00:59:19,700
I am using collect
action command for this.
1362
00:59:20,000 --> 00:59:23,900
So here we have 721 rows
of data of all the types
1363
00:59:23,900 --> 00:59:28,979
of Pokemons we have So now
let us display the schema
1364
00:59:28,979 --> 00:59:30,441
of the data we have
1365
00:59:30,700 --> 00:59:33,900
I have used the action command
first in order to display
1366
00:59:33,900 --> 00:59:35,727
the first line of a CSV file
1367
00:59:35,727 --> 00:59:38,600
which happens to be
the schema of a CSV file.
1368
00:59:38,600 --> 00:59:40,000
So we have index
1369
00:59:40,000 --> 00:59:42,100
of the Pokemon name
of the Pokémon.
1370
00:59:42,100 --> 00:59:46,700
Its type total points
HP attack points defense points
1371
00:59:46,992 --> 00:59:50,607
special attack special
defense speed generation,
1372
00:59:50,700 --> 00:59:51,938
and we can also find
1373
00:59:51,938 --> 00:59:54,600
if a particular Pokemon
is legendary or not.
1374
00:59:55,773 --> 00:59:57,926
Here, I'm creating a new RTD
1375
00:59:58,000 --> 00:59:59,400
which is no header
1376
00:59:59,400 --> 01:00:02,800
and I'm using filter operation
in order to remove the schema
1377
01:00:02,800 --> 01:00:04,900
of a Pokemon data dot CSV file.
1378
01:00:04,900 --> 01:00:08,407
The schema of Pokemon data
dot CSV file is been removed
1379
01:00:08,407 --> 01:00:10,705
because the spark
considers the schema
1380
01:00:10,705 --> 01:00:12,300
as a data to be processed.
1381
01:00:12,300 --> 01:00:13,480
So for this reason,
1382
01:00:13,480 --> 01:00:16,500
we remove the schema now,
let's display the data
1383
01:00:16,500 --> 01:00:19,000
which is present
in a no-hitter rdd.
1384
01:00:19,000 --> 01:00:20,441
I am using action command
1385
01:00:20,441 --> 01:00:22,500
collect in order
to display the data
1386
01:00:22,500 --> 01:00:24,700
which is present
in no header rdd.
1387
01:00:24,900 --> 01:00:26,104
So this is the data
1388
01:00:26,104 --> 01:00:28,195
which is stored
in a no-hitter rdd
1389
01:00:28,195 --> 01:00:29,400
without the schema.
1390
01:00:31,200 --> 01:00:33,978
So now let us find out
the number of partitions
1391
01:00:33,978 --> 01:00:37,300
into which are no header are
ready is been split in two.
1392
01:00:37,300 --> 01:00:40,320
So I am using partitions
transformation in order to find
1393
01:00:40,320 --> 01:00:42,060
out the number of partitions.
1394
01:00:42,060 --> 01:00:45,000
The data was split
in two according to the result.
1395
01:00:45,000 --> 01:00:48,300
The no header rdd is been split
into two partitions.
1396
01:00:48,600 --> 01:00:52,000
I am here creating a new rdt
by name water rdd
1397
01:00:52,000 --> 01:00:55,100
and I'm using filter
transformation in order to find
1398
01:00:55,100 --> 01:00:59,000
out what a type Pokemons in
our Pokémon data dot CSV file.
1399
01:00:59,600 --> 01:01:02,800
I'm using action command collect
in order to print the data
1400
01:01:02,800 --> 01:01:04,900
which is present in water rdd.
1401
01:01:05,200 --> 01:01:08,000
So these are the total number
of water type Pokemon
1402
01:01:08,000 --> 01:01:10,528
that we have in our
Pokémon data dot CSV.
1403
01:01:10,528 --> 01:01:11,160
Similarly.
1404
01:01:11,160 --> 01:01:13,500
Let's find out
the fire type Pokemons.
1405
01:01:14,600 --> 01:01:17,500
I'm creating a new identity
by the name fire RTD
1406
01:01:17,500 --> 01:01:20,523
and applying filter operation
in order to find out
1407
01:01:20,523 --> 01:01:23,300
the fire type Pokemon
present in our CSV file.
1408
01:01:24,200 --> 01:01:27,200
I'm using collect action command
in order to print the data
1409
01:01:27,200 --> 01:01:29,200
which is present in fire rdd.
1410
01:01:29,400 --> 01:01:32,100
So these are the fire type
Pokemon which are present
1411
01:01:32,100 --> 01:01:34,400
in our Pokémon
data dot CSV file.
1412
01:01:34,600 --> 01:01:37,600
Now, let us count the total
number of water type Pokemon
1413
01:01:37,600 --> 01:01:40,400
which are present
in a Pokemon data dot CSV file.
1414
01:01:40,400 --> 01:01:44,500
I am using count action for this
and we have 112 water type
1415
01:01:44,500 --> 01:01:47,400
Pokemon is present in
our Pokémon data dot CSV file.
1416
01:01:47,400 --> 01:01:47,924
Similarly.
1417
01:01:47,924 --> 01:01:50,600
Let's find out the total number
of fire-type Pokémon
1418
01:01:50,600 --> 01:01:54,300
as we have I'm using count
action command for the same.
1419
01:01:54,300 --> 01:01:56,178
So we have a total 52 number
1420
01:01:56,178 --> 01:01:59,800
of fire type Pokemon Sinnoh
Pokemon data dot CSV files.
1421
01:01:59,800 --> 01:02:01,992
Let's continue with
our further operations
1422
01:02:01,992 --> 01:02:05,200
where we'll find out a highest
defense strength of a Pokémon.
1423
01:02:05,300 --> 01:02:08,400
I am creating a new ID
by the name defense list
1424
01:02:08,400 --> 01:02:10,400
and I'm applying
map transformation
1425
01:02:10,400 --> 01:02:12,935
and spreading out
the column number six in order
1426
01:02:12,935 --> 01:02:14,500
to extract the defense points
1427
01:02:14,500 --> 01:02:18,100
of all the Pokemons present in
our Pokémon data dot CSV file.
1428
01:02:18,300 --> 01:02:21,400
So the data is been stored
successfully into a new era.
1429
01:02:21,400 --> 01:02:23,100
DD which is defenseless.
1430
01:02:23,500 --> 01:02:23,700
Now.
1431
01:02:23,700 --> 01:02:26,249
I'm using Mac's action command
in order to print out
1432
01:02:26,249 --> 01:02:29,100
the maximum different strengths
out of all the Pokemons.
1433
01:02:29,200 --> 01:02:32,576
So we have 230 points as
the maximum defense strength
1434
01:02:32,576 --> 01:02:34,200
amongst all the Pokemons.
1435
01:02:34,200 --> 01:02:35,702
So in our further operations,
1436
01:02:35,702 --> 01:02:38,502
let's find out the Pokemons
which come under the category
1437
01:02:38,502 --> 01:02:40,600
of having highest
different strengths,
1438
01:02:40,600 --> 01:02:42,400
which is 230 points.
1439
01:02:43,100 --> 01:02:45,456
In order to find out
the name of the Pokemon
1440
01:02:45,456 --> 01:02:47,100
with highest defense strength.
1441
01:02:47,100 --> 01:02:49,182
I'm creating a new identity
with the name.
1442
01:02:49,182 --> 01:02:51,717
It defense with Pokemon name
and I'm applying
1443
01:02:51,717 --> 01:02:54,000
May transformation on
to the previous array,
1444
01:02:54,000 --> 01:02:55,000
which is no header
1445
01:02:55,000 --> 01:02:56,062
and I'm splitting out
1446
01:02:56,062 --> 01:02:59,100
column number six which happens
to be the different strengths
1447
01:02:59,100 --> 01:03:02,300
in order to extract the data
from that particular row,
1448
01:03:02,300 --> 01:03:05,100
which has the defense
strength as 230 points.
1449
01:03:05,769 --> 01:03:08,230
Now I'm creating a new RTD again
1450
01:03:08,300 --> 01:03:11,500
with the name maximum defense
Pokemon and I'm applying
1451
01:03:11,500 --> 01:03:15,100
group bike a transformation
in order to display the Pokemon
1452
01:03:15,100 --> 01:03:18,675
which have the maximum defense
points that is 230 points.
1453
01:03:18,675 --> 01:03:20,400
So according to the result.
1454
01:03:20,400 --> 01:03:23,400
We have Steelix Steelix
Mega chacal Aggregate
1455
01:03:23,400 --> 01:03:24,500
and aggregate Mega
1456
01:03:24,500 --> 01:03:27,200
as the Pokemons with
highest different strengths,
1457
01:03:27,200 --> 01:03:28,800
which is 230 points.
1458
01:03:28,800 --> 01:03:31,100
Now we shall find
out the Pokemon
1459
01:03:31,100 --> 01:03:33,600
which is having least
different strengths.
1460
01:03:34,200 --> 01:03:35,900
So before we find
out the Pokemon
1461
01:03:35,900 --> 01:03:37,580
with least different strengths,
1462
01:03:37,580 --> 01:03:39,694
let us find out
the least defense points
1463
01:03:39,694 --> 01:03:41,700
which are present
in the defense list.
1464
01:03:42,900 --> 01:03:45,100
So in order to find
out the Pokémon
1465
01:03:45,100 --> 01:03:46,788
with least different strengths,
1466
01:03:46,788 --> 01:03:48,200
I have created a new rdt
1467
01:03:48,200 --> 01:03:51,654
by name minimum defense Pokemon
and I have applied distinct
1468
01:03:51,654 --> 01:03:54,900
and sort by Transformations
on to the defense list rdd
1469
01:03:54,900 --> 01:03:57,900
in order to extract
the least defense points present
1470
01:03:57,900 --> 01:03:58,955
in the defense list
1471
01:03:58,955 --> 01:04:01,484
and I have used take
action command in order
1472
01:04:01,484 --> 01:04:02,600
to display the data
1473
01:04:02,600 --> 01:04:05,300
which is present
in minimum defense Pokemon rdd.
1474
01:04:05,300 --> 01:04:06,700
So according to the results,
1475
01:04:06,700 --> 01:04:09,300
we have five points as
the least defense strength
1476
01:04:09,300 --> 01:04:11,053
of a particular Pokémon now,
1477
01:04:11,053 --> 01:04:13,148
let us find out
the name of the On
1478
01:04:13,148 --> 01:04:16,650
which comes under the category
of having Five Points as
1479
01:04:16,650 --> 01:04:18,290
different strengths now,
1480
01:04:18,290 --> 01:04:19,808
let us create a new rdd
1481
01:04:19,808 --> 01:04:23,956
which is difference Pokemon name
to and apply my transformation
1482
01:04:23,956 --> 01:04:27,217
and split the column number 6
and store the data
1483
01:04:27,217 --> 01:04:28,259
into our new rdd
1484
01:04:28,259 --> 01:04:30,800
which is defense
with Pokemon name, too.
1485
01:04:32,000 --> 01:04:34,500
The transformation is
been successfully applied
1486
01:04:34,500 --> 01:04:36,970
and the data is now
stored into the new rdd
1487
01:04:36,970 --> 01:04:37,900
which is defense
1488
01:04:37,900 --> 01:04:41,900
with Pokemon name to the data
is been successfully loaded.
1489
01:04:41,900 --> 01:04:45,500
Now, let us apply
the further operations here.
1490
01:04:45,538 --> 01:04:50,000
I am creating another rdd with
name minimum defense Pokemon
1491
01:04:50,000 --> 01:04:53,400
and I'm applying group bike
a transformation in order
1492
01:04:53,400 --> 01:04:55,500
to extract the data from the row
1493
01:04:55,500 --> 01:04:58,206
which has the defense
points as 5.0.
1494
01:04:58,500 --> 01:05:01,829
The data is been successfully
loaded now and let us display.
1495
01:05:01,829 --> 01:05:03,300
The data which is present
1496
01:05:03,300 --> 01:05:07,307
in minimum defense Pokemon rdd
now according to the results.
1497
01:05:07,307 --> 01:05:09,073
We have to number of Pokemons,
1498
01:05:09,073 --> 01:05:12,098
which come under the category
of having Five Points
1499
01:05:12,098 --> 01:05:15,400
as that defense strength
the Pokemons chassis knee
1500
01:05:15,400 --> 01:05:17,500
and happening at
the to Pokemons,
1501
01:05:17,500 --> 01:05:24,500
which I have in the least
definition the world
1502
01:05:24,500 --> 01:05:26,100
of Information Technology
1503
01:05:26,100 --> 01:05:29,786
and big data processing started
to see multiple potentialities
1504
01:05:29,786 --> 01:05:31,600
from spark coming into action.
1505
01:05:31,700 --> 01:05:34,685
Such Pinnacle in Sparks
technology advancements is
1506
01:05:34,685 --> 01:05:35,600
the data frame.
1507
01:05:35,600 --> 01:05:38,200
And today we shall
understand the technicalities
1508
01:05:38,200 --> 01:05:39,000
of data frames
1509
01:05:39,000 --> 01:05:42,500
and Spark a data frame and Spark
is all about performance.
1510
01:05:42,500 --> 01:05:46,300
It is a powerful multifunctional
and an integrated data structure
1511
01:05:46,300 --> 01:05:49,100
where the programmer can work
with different libraries
1512
01:05:49,100 --> 01:05:52,000
and perform numerous
functionalities without breaking
1513
01:05:52,000 --> 01:05:53,529
a sweat to understand apis
1514
01:05:53,529 --> 01:05:54,823
and libraries involved
1515
01:05:54,823 --> 01:05:57,500
in the process
without wasting any time.
1516
01:05:57,500 --> 01:06:00,000
Let us understand a topic
for today's discussion.
1517
01:06:00,000 --> 01:06:01,900
I line up the docket
for understanding.
1518
01:06:01,900 --> 01:06:03,800
Data frames and Spark is below
1519
01:06:03,800 --> 01:06:06,962
which will begin with
what our data frames here.
1520
01:06:06,962 --> 01:06:09,700
We will learn what
exactly a data frame is.
1521
01:06:09,700 --> 01:06:13,706
How does it look like and what
are its functionalities then we
1522
01:06:13,706 --> 01:06:16,400
shall see why do we need
data frames here?
1523
01:06:16,400 --> 01:06:18,900
We shall understand
the requirements which led us
1524
01:06:18,900 --> 01:06:21,200
to the invention
of data frames later.
1525
01:06:21,200 --> 01:06:23,400
I'll walk you through
the important features
1526
01:06:23,400 --> 01:06:24,282
of data frames.
1527
01:06:24,282 --> 01:06:25,400
Then we should look
1528
01:06:25,400 --> 01:06:28,000
into the sources from which
the data frames and Spark
1529
01:06:28,000 --> 01:06:31,000
get their data from Once
the theory part is finished.
1530
01:06:31,000 --> 01:06:33,400
I will get us involved
into the Practical part
1531
01:06:33,400 --> 01:06:35,700
where the creation
of a dataframe happens to be
1532
01:06:35,700 --> 01:06:39,400
a first step next we shall work
with an interesting example,
1533
01:06:39,400 --> 01:06:41,100
which is related to football
1534
01:06:41,100 --> 01:06:43,237
and finally to understand
the data frames
1535
01:06:43,237 --> 01:06:44,200
in spark in a much
1536
01:06:44,200 --> 01:06:46,980
better way we should work
with the most trending topic
1537
01:06:46,980 --> 01:06:47,711
as I use case,
1538
01:06:47,711 --> 01:06:50,300
which is none other
than the Game of Thrones.
1539
01:06:50,400 --> 01:06:52,100
So let's get started.
1540
01:06:52,200 --> 01:06:55,500
What is a data frame
in simple terms a data frame
1541
01:06:55,500 --> 01:06:58,617
can be considered as a
distributed collection of data.
1542
01:06:58,617 --> 01:07:01,156
The data is organized
under named columns,
1543
01:07:01,156 --> 01:07:04,500
which provide us The operations
to filter group process
1544
01:07:04,500 --> 01:07:08,205
and aggregate the available data
data frames can also be used
1545
01:07:08,205 --> 01:07:11,100
with Sparks equal and we
can construct data frames
1546
01:07:11,100 --> 01:07:14,800
from structured data files rdds
or from an external storage
1547
01:07:14,800 --> 01:07:17,500
like hdfs Hive Cassandra hbase
1548
01:07:17,500 --> 01:07:19,676
and many more with
this we should look
1549
01:07:19,676 --> 01:07:21,500
into a more simplified example,
1550
01:07:21,500 --> 01:07:24,455
which will give us a basic
description of a data frame.
1551
01:07:24,455 --> 01:07:26,700
So we shall deal
with an employee database
1552
01:07:26,700 --> 01:07:29,229
where we have entities
and their data types.
1553
01:07:29,229 --> 01:07:31,817
So the name of the employee
is a first entity
1554
01:07:31,817 --> 01:07:33,500
And its respective data type
1555
01:07:33,500 --> 01:07:37,102
is string data type similarly
employee ID has data type
1556
01:07:37,102 --> 01:07:39,004
of string employee phone number
1557
01:07:39,004 --> 01:07:40,646
which is integer data type
1558
01:07:40,646 --> 01:07:43,642
and employ address happens
to be string data type.
1559
01:07:43,642 --> 01:07:46,700
And finally the employee salary
is float data type.
1560
01:07:46,700 --> 01:07:49,500
All this data is stored
into an external storage,
1561
01:07:49,500 --> 01:07:51,093
which may be hdfs Hive
1562
01:07:51,093 --> 01:07:53,700
or Cassandra using
the data frame API
1563
01:07:53,700 --> 01:07:55,200
with their respective schema,
1564
01:07:55,200 --> 01:07:56,500
which consists of the name
1565
01:07:56,500 --> 01:07:58,913
of the entity along
with this data type now
1566
01:07:58,913 --> 01:08:01,900
that we have understood what
exactly a data frame is.
1567
01:08:01,900 --> 01:08:03,910
Let us quickly move on
to our next stage
1568
01:08:03,910 --> 01:08:06,900
where we shall understand the
requirement for a data frame.
1569
01:08:07,000 --> 01:08:07,806
It provides as
1570
01:08:07,806 --> 01:08:10,400
multiple programming
language support ability.
1571
01:08:10,400 --> 01:08:13,670
It has the capacity to work
with multiple data sources,
1572
01:08:13,670 --> 01:08:16,904
it can process both structured
and unstructured data.
1573
01:08:16,904 --> 01:08:19,455
And finally it is
well versed with slicing
1574
01:08:19,455 --> 01:08:20,681
and dicing the data.
1575
01:08:20,681 --> 01:08:21,723
So the first one is
1576
01:08:21,723 --> 01:08:24,900
the support ability for
multiple programming languages.
1577
01:08:24,900 --> 01:08:26,937
The IT industry
is required a powerful
1578
01:08:26,937 --> 01:08:28,700
and an integrated data structure
1579
01:08:28,700 --> 01:08:29,500
which could support
1580
01:08:29,500 --> 01:08:31,800
multiple programming languages
and at the same.
1581
01:08:31,800 --> 01:08:33,900
Same time without
the requirement of
1582
01:08:33,900 --> 01:08:36,900
additional API data frame
was the one stop solution
1583
01:08:36,900 --> 01:08:39,900
which supported multiple
languages along with a single
1584
01:08:39,900 --> 01:08:41,982
API the most popular languages
1585
01:08:41,982 --> 01:08:45,046
that a dataframe could
support our our python.
1586
01:08:45,046 --> 01:08:48,777
Skaila, Java and many more
the next requirement
1587
01:08:48,777 --> 01:08:51,500
was to support
the multiple data sources.
1588
01:08:51,500 --> 01:08:53,608
We all know that in
a real-time approach
1589
01:08:53,608 --> 01:08:55,700
to data processing
will never end up
1590
01:08:55,700 --> 01:08:57,700
at a single data
source data frame is
1591
01:08:57,700 --> 01:08:59,057
one such data structure,
1592
01:08:59,057 --> 01:09:02,000
which has the capability
to support and process data.
1593
01:09:02,000 --> 01:09:05,615
From a variety of data
sources Hadoop Cassandra.
1594
01:09:05,615 --> 01:09:07,207
Json files hbase.
1595
01:09:07,207 --> 01:09:10,284
CSV files are the examples
to name a few.
1596
01:09:10,300 --> 01:09:12,947
The next requirement was
to process structured
1597
01:09:12,947 --> 01:09:14,200
and unstructured data.
1598
01:09:14,200 --> 01:09:17,400
The Big Data environment was
designed to store huge amount
1599
01:09:17,400 --> 01:09:18,487
of data regardless
1600
01:09:18,487 --> 01:09:19,755
of which type exactly
1601
01:09:19,755 --> 01:09:22,827
it is now Sparks data frame
is designed in such a way
1602
01:09:22,827 --> 01:09:25,994
that it can store a huge
collection of both structured
1603
01:09:25,994 --> 01:09:27,249
and unstructured data
1604
01:09:27,249 --> 01:09:29,900
in a tabular format
along with its schema.
1605
01:09:29,900 --> 01:09:33,300
The next requirement was slicing
In in dicing data now,
1606
01:09:33,300 --> 01:09:34,300
the humongous amount
1607
01:09:34,300 --> 01:09:37,400
of data stored in Sparks
data frame can be sliced
1608
01:09:37,400 --> 01:09:40,975
and diced using the operations
like filter select group
1609
01:09:40,975 --> 01:09:42,300
by order by and many
1610
01:09:42,300 --> 01:09:45,100
more these operations
are applied upon the data
1611
01:09:45,100 --> 01:09:47,456
which are stored in form
of rows and columns
1612
01:09:47,456 --> 01:09:50,388
in a data frame these
with a few crucial requirements
1613
01:09:50,388 --> 01:09:52,700
which led to the invention
of data frames.
1614
01:09:52,800 --> 01:09:55,173
Now, let us get
into the important features
1615
01:09:55,173 --> 01:09:55,997
of data frames
1616
01:09:55,997 --> 01:09:58,700
which bring it an edge
over the other alternatives.
1617
01:09:59,100 --> 01:10:02,400
Immutability lazy
evaluation fault tolerance
1618
01:10:02,400 --> 01:10:04,400
and distributed memory storage,
1619
01:10:04,400 --> 01:10:07,800
let us discuss about each
and every feature in detail.
1620
01:10:07,800 --> 01:10:10,600
So the first one is
immutability similar to
1621
01:10:10,600 --> 01:10:13,295
the resilient distributed data
sets the data frames
1622
01:10:13,295 --> 01:10:16,688
and Spark are also immutable
the term immutable depicts
1623
01:10:16,688 --> 01:10:18,100
that the data was stored
1624
01:10:18,100 --> 01:10:20,300
into a data frame
will not be altered.
1625
01:10:20,300 --> 01:10:23,100
The only way to alter the data
present in a data frame
1626
01:10:23,100 --> 01:10:25,700
would be by applying
simple transformation operations
1627
01:10:25,700 --> 01:10:26,600
on to them.
1628
01:10:26,600 --> 01:10:28,900
So the next feature
is lazy evaluation.
1629
01:10:28,900 --> 01:10:32,126
Valuation lazy evaluation
is the key to the remarkable
1630
01:10:32,126 --> 01:10:36,100
performance offered by spark
similar to the rdds data frames
1631
01:10:36,100 --> 01:10:38,999
in spark will not throw
any output onto the screen
1632
01:10:38,999 --> 01:10:41,900
until and unless an action
command is encountered.
1633
01:10:41,900 --> 01:10:44,300
The next feature
is Fault tolerance.
1634
01:10:44,300 --> 01:10:45,182
There is no way
1635
01:10:45,182 --> 01:10:47,900
that the Sparks data frames
can lose their data.
1636
01:10:47,900 --> 01:10:50,300
They follow the principle
of being fault tolerant
1637
01:10:50,300 --> 01:10:51,782
to the unexpected calamities
1638
01:10:51,782 --> 01:10:53,900
which tend to destroy
the available data.
1639
01:10:53,900 --> 01:10:55,893
The next feature is distributed
1640
01:10:55,893 --> 01:10:58,590
storage Sparks dataframe
distribute the data.
1641
01:10:58,590 --> 01:11:00,000
Most multiple locations
1642
01:11:00,000 --> 01:11:03,294
so that in case of a node
failure the next available node
1643
01:11:03,294 --> 01:11:05,900
can takes place to continue
the data processing.
1644
01:11:05,900 --> 01:11:08,700
The next stage will be
about the multiple data source
1645
01:11:08,700 --> 01:11:12,204
that the spark dataframe
can support the spark API
1646
01:11:12,204 --> 01:11:13,690
can integrate itself
1647
01:11:13,690 --> 01:11:17,700
with multiple programming
languages such as scalar Java
1648
01:11:17,700 --> 01:11:19,300
python our MySQL
1649
01:11:19,300 --> 01:11:22,600
and many more making
itself capable to handle
1650
01:11:22,600 --> 01:11:26,700
a variety of data sources
such as Hadoop Hive hbase
1651
01:11:26,800 --> 01:11:28,500
Cassandra, Json file.
1652
01:11:28,600 --> 01:11:31,600
As CSV files my SQL
and many more.
1653
01:11:32,200 --> 01:11:33,726
So this was the theory part
1654
01:11:33,726 --> 01:11:36,100
and now let us move
into the Practical part
1655
01:11:36,100 --> 01:11:37,000
where the creation
1656
01:11:37,000 --> 01:11:39,500
of a dataframe happens
to be a first step.
1657
01:11:40,100 --> 01:11:42,412
So before we begin
the Practical part,
1658
01:11:42,412 --> 01:11:43,975
let us load the libraries
1659
01:11:43,975 --> 01:11:47,600
which required in order to
process the data in data frames.
1660
01:11:48,200 --> 01:11:50,822
So these are the few libraries
which we required
1661
01:11:50,822 --> 01:11:53,600
before we process the data
using our data frames.
1662
01:11:54,200 --> 01:11:56,300
Now that we have loaded
all the libraries
1663
01:11:56,300 --> 01:11:59,393
which we required to process
the data using the data frames.
1664
01:11:59,393 --> 01:12:01,914
Let us begin with the creation
of our data frame.
1665
01:12:01,914 --> 01:12:05,000
So we shall create a new data
frame with the name employee
1666
01:12:05,000 --> 01:12:05,935
and load the data
1667
01:12:05,935 --> 01:12:08,300
of the employees present
in an organization.
1668
01:12:08,300 --> 01:12:11,400
The details of the employees
will consist the first name
1669
01:12:11,400 --> 01:12:14,968
the last name and their mail ID
along with their salary.
1670
01:12:14,968 --> 01:12:18,500
So the First Data frame is
been successfully created now,
1671
01:12:18,500 --> 01:12:20,700
let us design the schema
for this data frame.
1672
01:12:21,600 --> 01:12:24,100
So the schema for this data
frame is been described
1673
01:12:24,100 --> 01:12:27,900
as shown the first name is of
string data type and similarly.
1674
01:12:27,900 --> 01:12:29,900
The last name is
a string data type
1675
01:12:29,900 --> 01:12:31,500
along with the mail address.
1676
01:12:31,500 --> 01:12:34,500
And finally the salary
is integer data type
1677
01:12:34,500 --> 01:12:37,000
or you can give
flow data type also,
1678
01:12:37,000 --> 01:12:39,882
so the schema has been
successfully delivered now,
1679
01:12:39,882 --> 01:12:41,600
let us create
the data frame using
1680
01:12:41,600 --> 01:12:43,700
Create data frame function here.
1681
01:12:43,700 --> 01:12:47,260
I'm creating a new data frame
by starting a spark context
1682
01:12:47,260 --> 01:12:50,200
and using the create
data frame method and loading
1683
01:12:50,200 --> 01:12:52,800
the data from Employee
and employer schema.
1684
01:12:52,800 --> 01:12:55,200
The data frame is
successfully created now,
1685
01:12:55,200 --> 01:12:56,200
let's print the data
1686
01:12:56,200 --> 01:12:59,353
which is existing
in the dataframe EMP DF.
1687
01:13:00,273 --> 01:13:02,426
I am using show method here.
1688
01:13:03,200 --> 01:13:03,907
So the data
1689
01:13:03,907 --> 01:13:07,700
which is present in EMB DF is
been successfully printed now,
1690
01:13:07,700 --> 01:13:09,600
let us move on to the next step.
1691
01:13:09,800 --> 01:13:12,800
So the next step for our today's
discussion is working
1692
01:13:12,800 --> 01:13:15,500
with an example related
to the FIFA data set.
1693
01:13:16,100 --> 01:13:18,217
So the first step
in our FIFA example
1694
01:13:18,217 --> 01:13:20,772
would be loading the schema
for the CSV file.
1695
01:13:20,772 --> 01:13:22,000
We are working with so
1696
01:13:22,000 --> 01:13:24,400
the schema has been
successfully loaded now.
1697
01:13:24,400 --> 01:13:28,066
Now let us load the CSV file
from our external storage
1698
01:13:28,066 --> 01:13:30,600
which is hdfs
into our data frame,
1699
01:13:30,600 --> 01:13:31,907
which is FIFA DF.
1700
01:13:32,100 --> 01:13:34,394
The CSV file is been
successfully loaded
1701
01:13:34,394 --> 01:13:35,800
into our new data frame,
1702
01:13:35,800 --> 01:13:37,100
which is FIFA DF now,
1703
01:13:37,100 --> 01:13:39,300
let us print the schema
of a data frame using
1704
01:13:39,300 --> 01:13:40,900
the print schema command.
1705
01:13:41,900 --> 01:13:43,400
So the schema
is been successfully
1706
01:13:43,400 --> 01:13:46,000
displayed here and we have
the following credentials.
1707
01:13:46,000 --> 01:13:49,300
Of each and every player
in our CSV file now,
1708
01:13:49,300 --> 01:13:51,900
let's move on to a further
operations on a dataframe.
1709
01:13:53,100 --> 01:13:56,200
We will count the total number
of records of the play
1710
01:13:56,200 --> 01:13:59,100
as we have in our CSV file
using count command.
1711
01:13:59,300 --> 01:14:01,500
So we have a total
of eighteen thousand
1712
01:14:01,500 --> 01:14:04,300
to not seven players
in our CSV files.
1713
01:14:04,300 --> 01:14:06,091
Now, let us find out the details
1714
01:14:06,091 --> 01:14:08,500
of the columns on which
we are working with.
1715
01:14:08,500 --> 01:14:11,300
So these were the columns
which we are working with which
1716
01:14:11,300 --> 01:14:15,466
consists the idea of the player
name age nationality potential
1717
01:14:15,466 --> 01:14:16,400
and many more.
1718
01:14:17,100 --> 01:14:19,600
Now let us use the column value
1719
01:14:19,600 --> 01:14:21,282
which has the value of each
1720
01:14:21,282 --> 01:14:23,900
and every player
for a particular T and let
1721
01:14:23,900 --> 01:14:27,399
us use describe command in order
to see the highest value
1722
01:14:27,399 --> 01:14:29,900
and the least value
provided to a player.
1723
01:14:29,900 --> 01:14:33,000
So we have account
of a total number of 18,000
1724
01:14:33,000 --> 01:14:34,400
to not seven players
1725
01:14:34,400 --> 01:14:37,612
and the minimum worth
given to a player is 0
1726
01:14:37,612 --> 01:14:40,900
and the maximum is given
as 9 million pounds.
1727
01:14:41,100 --> 01:14:43,100
Now, let us use
the select command
1728
01:14:43,100 --> 01:14:46,216
in order to extract
the column name and nationality.
1729
01:14:46,216 --> 01:14:48,172
How to find out the name of each
1730
01:14:48,172 --> 01:14:50,800
and every player along
with his nationality.
1731
01:14:51,000 --> 01:14:54,226
So here we have we can display
the top 20 rows of each
1732
01:14:54,226 --> 01:14:55,200
and every player
1733
01:14:55,200 --> 01:14:58,900
which we have in our CSV file
along with us nationality.
1734
01:14:59,000 --> 01:14:59,700
Similarly.
1735
01:14:59,700 --> 01:15:03,200
Let us find out the players
playing for a particular Club.
1736
01:15:03,200 --> 01:15:05,500
So here we have
the top 20 Place playing
1737
01:15:05,500 --> 01:15:07,029
for their respective clubs
1738
01:15:07,029 --> 01:15:08,300
along with their names
1739
01:15:08,300 --> 01:15:10,800
for example messy
playing for Barcelona
1740
01:15:10,800 --> 01:15:13,100
and Ronaldo for
Juventus and Etc.
1741
01:15:13,100 --> 01:15:15,100
Now, let's move
to the next stages.
1742
01:15:15,999 --> 01:15:17,900
No, let us find out the players
1743
01:15:18,000 --> 01:15:21,000
who are found to be most active
in a particular national team
1744
01:15:21,000 --> 01:15:24,500
or a particular club
with h less than 30 years.
1745
01:15:24,500 --> 01:15:25,300
We shall use
1746
01:15:25,300 --> 01:15:28,300
filter transformation
to apply this operation.
1747
01:15:28,600 --> 01:15:30,500
So here we have the details
1748
01:15:30,500 --> 01:15:33,300
of the Players whose age
is less than 30 years
1749
01:15:33,300 --> 01:15:37,200
and their club and nationality
along with their jersey numbers.
1750
01:15:37,700 --> 01:15:40,700
So with this we have finished
our FIFA example now
1751
01:15:40,700 --> 01:15:43,466
to understand the data frames
in a much better way,
1752
01:15:43,466 --> 01:15:45,300
let us move on
into our use case,
1753
01:15:45,300 --> 01:15:48,400
which is about the most Hot
Topic The Game of Thrones.
1754
01:15:49,100 --> 01:15:51,319
Similar to our previous example,
1755
01:15:51,319 --> 01:15:54,300
let us design the schema
of a CSV file first.
1756
01:15:54,300 --> 01:15:56,600
So this is the schema
for a CSV file
1757
01:15:56,600 --> 01:15:59,300
which consists the data
about the Game of Thrones.
1758
01:15:59,800 --> 01:16:02,800
So, this is a schema
for our first CSV file.
1759
01:16:02,800 --> 01:16:06,200
Now, let us create the schema
for our next CSV file.
1760
01:16:06,700 --> 01:16:09,991
I have named the schema
for our next CSV file a schema
1761
01:16:09,991 --> 01:16:12,667
to and I've defined
the data types for each
1762
01:16:12,667 --> 01:16:16,300
and every entity the scheme
has been successfully designed
1763
01:16:16,300 --> 01:16:18,300
for the second CSV file also.
1764
01:16:18,300 --> 01:16:21,700
Now let us load our CSV files
from our external storage,
1765
01:16:21,700 --> 01:16:23,200
which is our hdfs.
1766
01:16:24,000 --> 01:16:28,100
The location of the first CSV
file character deaths dot CSV
1767
01:16:28,100 --> 01:16:29,076
is our hdfs,
1768
01:16:29,076 --> 01:16:31,000
which is defined as above
1769
01:16:31,000 --> 01:16:33,303
and the schema is been
provided as schema.
1770
01:16:33,303 --> 01:16:35,919
And the header true option
is also been provided.
1771
01:16:35,919 --> 01:16:38,100
We are using spark
read function for this
1772
01:16:38,100 --> 01:16:40,789
and we are loading this data
into our new data frame,
1773
01:16:40,789 --> 01:16:42,600
which is Game
of Thrones data frame.
1774
01:16:42,800 --> 01:16:43,700
Similarly.
1775
01:16:43,700 --> 01:16:45,743
Let's load the other CSV file
1776
01:16:45,743 --> 01:16:49,232
which is battles dot CSV
into another data frame,
1777
01:16:49,232 --> 01:16:53,000
which is Game of Thrones
Butters dataframe the CSV file.
1778
01:16:53,000 --> 01:16:54,792
Has been successfully
loaded now.
1779
01:16:54,792 --> 01:16:57,200
Let us continue
with the further operations.
1780
01:16:57,900 --> 01:17:00,207
Now let us print
the schema offer Game
1781
01:17:00,207 --> 01:17:03,200
of Thrones data frame using
print schema command.
1782
01:17:03,300 --> 01:17:04,962
So here we have the schema
1783
01:17:04,962 --> 01:17:07,200
which consists of
the name alliances
1784
01:17:07,200 --> 01:17:10,821
death rate book of death
and many more similarly.
1785
01:17:10,821 --> 01:17:15,100
Let's print the schema of Game
of Thrones Butters data frame.
1786
01:17:16,300 --> 01:17:18,600
So this is a schema
for our new data frame,
1787
01:17:18,600 --> 01:17:20,700
which is Game of Thrones
battle data frame.
1788
01:17:20,900 --> 01:17:23,600
Now, let's continue
the further operations.
1789
01:17:24,100 --> 01:17:26,000
Now, let us display
the data frame
1790
01:17:26,000 --> 01:17:29,500
which we have created using
the following command data frame
1791
01:17:29,500 --> 01:17:32,188
has been successfully printed
and this is the data
1792
01:17:32,188 --> 01:17:33,813
which we have in our data frame.
1793
01:17:33,813 --> 01:17:36,200
Now, let's continue
with the further operations.
1794
01:17:36,400 --> 01:17:38,449
We know that there are
a multiple number
1795
01:17:38,449 --> 01:17:41,100
of houses present in the story
of Game of Thrones.
1796
01:17:41,100 --> 01:17:42,211
Now, let us find out
1797
01:17:42,211 --> 01:17:45,100
each and every individual house
present in the story.
1798
01:17:45,300 --> 01:17:48,200
Let us use the following command
in order to display each
1799
01:17:48,200 --> 01:17:51,400
and every house present
in the Game of Thrones story.
1800
01:17:51,600 --> 01:17:54,600
So we have the following houses
in the Game of Thrones story.
1801
01:17:54,600 --> 01:17:57,064
Now, let's continue
with the further operations
1802
01:17:57,064 --> 01:18:00,299
the battles in the Game
of Thrones were fought for ages.
1803
01:18:00,299 --> 01:18:02,000
Let us classify the vast waste
1804
01:18:02,000 --> 01:18:04,300
with their occurrence
according to the years.
1805
01:18:04,300 --> 01:18:06,800
We shall use select
and filter transformation
1806
01:18:06,800 --> 01:18:09,750
and we shall access The Columns
of the details of the battle
1807
01:18:09,750 --> 01:18:11,600
and the year in which
they were fought.
1808
01:18:12,100 --> 01:18:13,800
Let us first find
out the battles
1809
01:18:13,800 --> 01:18:15,300
which were fought in the year.
1810
01:18:15,300 --> 01:18:18,000
R 298 the following
code consists of
1811
01:18:18,000 --> 01:18:19,300
filter transformation
1812
01:18:19,300 --> 01:18:22,000
which will provide the details
for which we are looking.
1813
01:18:22,000 --> 01:18:23,350
So according to the result.
1814
01:18:23,350 --> 01:18:25,400
These were the battles
were fought in the year
1815
01:18:25,400 --> 01:18:28,700
298 and we have the details
of the attacker Kings
1816
01:18:28,700 --> 01:18:30,002
and the defender Kings
1817
01:18:30,002 --> 01:18:33,648
and the outcome of the attacker
along with their commanders
1818
01:18:33,648 --> 01:18:36,400
and the location
where the war was fought now,
1819
01:18:36,400 --> 01:18:39,861
let us find out the wars
based in the air 299.
1820
01:18:40,400 --> 01:18:41,764
So these with the details
1821
01:18:41,764 --> 01:18:45,293
of the verse which were fought
in the year 299 and similarly,
1822
01:18:45,293 --> 01:18:48,600
let us also find out the bars
which are waged in the year 300.
1823
01:18:48,600 --> 01:18:49,952
So these were the words
1824
01:18:49,952 --> 01:18:51,700
which were fought
in the year 300.
1825
01:18:51,700 --> 01:18:53,700
Now, let's move on
to the next operations
1826
01:18:53,700 --> 01:18:54,700
in our use case.
1827
01:18:55,000 --> 01:18:58,005
Now, let us find out the tactics
used in the wars waged
1828
01:18:58,005 --> 01:19:01,343
and also find out the total
number of vast waste by using
1829
01:19:01,343 --> 01:19:05,200
each type of those tactics
the following code must help us.
1830
01:19:05,800 --> 01:19:07,200
Here we are using select
1831
01:19:07,200 --> 01:19:10,196
and group by operations
in order to find out each
1832
01:19:10,196 --> 01:19:12,500
and every type of tactics
used in the war.
1833
01:19:12,600 --> 01:19:16,221
So they have used Ambush sees
raising and Pitch type
1834
01:19:16,221 --> 01:19:17,500
of tactics inverse
1835
01:19:17,500 --> 01:19:20,300
and most of the times they
have used pitched battle type
1836
01:19:20,300 --> 01:19:21,600
of tactics inverse.
1837
01:19:21,600 --> 01:19:24,600
Now, let us continue
with the further operations
1838
01:19:24,600 --> 01:19:27,300
the Ambush type of battles are
the deadliest now,
1839
01:19:27,300 --> 01:19:28,650
let us find out the Kings
1840
01:19:28,650 --> 01:19:31,397
who fought the battles
using these kind of tactics
1841
01:19:31,397 --> 01:19:34,200
and also let us find out
the outcome of the battles
1842
01:19:34,200 --> 01:19:37,425
fought here the In code
will help us extract the data
1843
01:19:37,425 --> 01:19:38,600
which we need here.
1844
01:19:38,600 --> 01:19:40,962
We are using select
and we're commands
1845
01:19:40,962 --> 01:19:43,900
and we are selecting
The Columns year attacking
1846
01:19:43,900 --> 01:19:48,181
Defender King attacker outcome
battle type attacker Commander
1847
01:19:48,181 --> 01:19:49,840
defend the commander now,
1848
01:19:49,840 --> 01:19:51,500
let us print the details.
1849
01:19:51,900 --> 01:19:54,700
So these were the battles
fought using the Ambush tactics
1850
01:19:54,700 --> 01:19:56,300
and these were
the attacker Kings
1851
01:19:56,300 --> 01:19:59,300
and the defender Kings along
with their respective commanders
1852
01:19:59,300 --> 01:20:01,641
and the wars waste
in a particular year now.
1853
01:20:01,641 --> 01:20:03,700
Let's move on
to the next operation.
1854
01:20:04,300 --> 01:20:06,000
Now let us focus on the houses
1855
01:20:06,000 --> 01:20:08,600
and extract the deadliest house
amongst the rest.
1856
01:20:08,600 --> 01:20:11,893
The following code will help us
to find out the deadliest house
1857
01:20:11,893 --> 01:20:13,700
and the number
of patents the wage.
1858
01:20:13,700 --> 01:20:16,600
So here we have the details
of each and every house
1859
01:20:16,600 --> 01:20:19,383
and the battles the waged
according to the results.
1860
01:20:19,383 --> 01:20:20,033
We have stuck
1861
01:20:20,033 --> 01:20:22,883
and Lannister houses to be
the deadliest among the others.
1862
01:20:22,883 --> 01:20:25,400
Now, let's continue
with the rest of the operations.
1863
01:20:25,900 --> 01:20:28,100
Now, let us find out
the deadliest king
1864
01:20:28,100 --> 01:20:29,100
among the others
1865
01:20:29,100 --> 01:20:31,400
which will use the following
command in order to find
1866
01:20:31,400 --> 01:20:33,600
the deadliest king
amongst the other kings
1867
01:20:33,600 --> 01:20:35,600
who fought in the A
number of Firsts.
1868
01:20:35,600 --> 01:20:38,000
So according to the results
we have Joffrey as
1869
01:20:38,000 --> 01:20:38,900
the deadliest King
1870
01:20:38,900 --> 01:20:41,200
who fought a total number
of 14 battles.
1871
01:20:41,200 --> 01:20:44,000
Now, let us continue
with the further operations.
1872
01:20:44,500 --> 01:20:46,323
Now, let us find out the houses
1873
01:20:46,323 --> 01:20:49,400
which defended most number
of Wars waste against them.
1874
01:20:49,400 --> 01:20:52,500
So the following code must help
us find out the details.
1875
01:20:52,600 --> 01:20:54,223
So according to the results.
1876
01:20:54,223 --> 01:20:57,400
We have Lannister house
to be defending the most number
1877
01:20:57,400 --> 01:20:59,009
of paths based against them.
1878
01:20:59,009 --> 01:21:01,682
Now, let us find out
the defender King who defend
1879
01:21:01,682 --> 01:21:04,900
it most number of battles
which were waste against him
1880
01:21:05,400 --> 01:21:08,405
So according to the result drop
stack is the king
1881
01:21:08,405 --> 01:21:10,597
who defended most
number of patterns
1882
01:21:10,597 --> 01:21:12,100
which waged against him.
1883
01:21:12,100 --> 01:21:12,300
Now.
1884
01:21:12,300 --> 01:21:14,600
Let's continue with
the further operations.
1885
01:21:14,800 --> 01:21:17,300
Since Lannister house
is my personal favorite.
1886
01:21:17,300 --> 01:21:18,800
Let me find out the details
1887
01:21:18,800 --> 01:21:20,800
of the characters
in Lannister house.
1888
01:21:20,800 --> 01:21:22,921
This code will
describe their name
1889
01:21:22,921 --> 01:21:24,400
and gender one for male
1890
01:21:24,400 --> 01:21:27,700
and 0 for female along with
their respective population.
1891
01:21:27,700 --> 01:21:29,830
So let me find out
the male characters
1892
01:21:29,830 --> 01:21:31,500
in The Lannister house first.
1893
01:21:32,300 --> 01:21:34,899
So here we have used select
and we're commanded.
1894
01:21:34,900 --> 01:21:37,600
Ends in order to find out
the details of the characters
1895
01:21:37,600 --> 01:21:39,100
present in Lannister house
1896
01:21:39,100 --> 01:21:42,300
and the data is been stored
into tf1 dataframe.
1897
01:21:42,300 --> 01:21:44,700
Let us print the data
which is present in idea
1898
01:21:44,700 --> 01:21:46,900
of one data frame
using show command.
1899
01:21:47,800 --> 01:21:49,000
So these are the details
1900
01:21:49,000 --> 01:21:51,400
of the characters
present in Lannister house,
1901
01:21:51,400 --> 01:21:53,100
which are made now similarly.
1902
01:21:53,100 --> 01:21:55,400
Let us find out the female
character is present
1903
01:21:55,400 --> 01:21:56,800
in Lannister house.
1904
01:21:57,500 --> 01:22:00,000
So these are the characters
present in Lannister house
1905
01:22:00,000 --> 01:22:01,100
who are females
1906
01:22:01,300 --> 01:22:05,028
so we have a total number of
69 male characters and 12 number
1907
01:22:05,028 --> 01:22:07,900
of female characters
in The Lannister house.
1908
01:22:07,900 --> 01:22:11,311
Now, let us continue with
the next operations at the end
1909
01:22:11,311 --> 01:22:12,800
of the day every episode
1910
01:22:12,800 --> 01:22:14,800
of Game of Thrones had
a noble character.
1911
01:22:15,000 --> 01:22:17,365
Let us now find out all
the noble characters
1912
01:22:17,365 --> 01:22:18,664
amongst all the houses
1913
01:22:18,664 --> 01:22:21,193
that we have in our Game
of Thrones CSV file
1914
01:22:21,193 --> 01:22:24,100
the following code must help
us find out the details.
1915
01:22:25,600 --> 01:22:26,300
So the details
1916
01:22:26,300 --> 01:22:28,500
of all the characters
from all the houses
1917
01:22:28,500 --> 01:22:30,050
who are considered to be Noble.
1918
01:22:30,050 --> 01:22:32,200
I've been saved
into the new data frame,
1919
01:22:32,200 --> 01:22:33,427
which is DF 3 now,
1920
01:22:33,427 --> 01:22:36,800
let us print the details
from the df3 data frame.
1921
01:22:37,500 --> 01:22:40,000
So these are the top 20 members
from all the houses
1922
01:22:40,000 --> 01:22:42,900
who are considered to be Noble
along with their genders.
1923
01:22:42,900 --> 01:22:45,400
Now, let us count the total
number of noble characters
1924
01:22:45,400 --> 01:22:47,600
from the entire game
of thrones stories.
1925
01:22:48,300 --> 01:22:50,500
So there are a total
of four hundred and thirty
1926
01:22:50,500 --> 01:22:53,300
number of noble characters
existing in the whole game
1927
01:22:53,300 --> 01:22:54,300
of throne story.
1928
01:22:54,800 --> 01:22:56,211
Nonetheless, we have also
1929
01:22:56,211 --> 01:22:59,086
faced a few Communists
whose role in The Game
1930
01:22:59,086 --> 01:23:01,700
of Thrones is found
to be exceptional vision
1931
01:23:01,700 --> 01:23:04,219
of find out the details
of all those commoners
1932
01:23:04,219 --> 01:23:07,300
who were highly dedicated
to their roles in each episode
1933
01:23:07,600 --> 01:23:08,700
the data of all,
1934
01:23:08,700 --> 01:23:10,700
the commoners is
been successfully loaded
1935
01:23:10,700 --> 01:23:11,900
into the new data frame,
1936
01:23:11,900 --> 01:23:14,202
which is TFO now let
us print the data
1937
01:23:14,202 --> 01:23:17,500
which is present in the DF
for using the show command.
1938
01:23:17,900 --> 01:23:20,396
So these are the top
20 characters identified as
1939
01:23:20,396 --> 01:23:23,004
common as amongst all the Game
of Thrones stories.
1940
01:23:23,004 --> 01:23:25,400
Now, let us find out
the count of total number
1941
01:23:25,400 --> 01:23:26,600
of common characters.
1942
01:23:26,700 --> 01:23:27,649
So there are a total
1943
01:23:27,649 --> 01:23:30,099
of four hundred and
eighty seven common characters
1944
01:23:30,099 --> 01:23:32,000
amongst all stories
of Game of Thrones.
1945
01:23:32,000 --> 01:23:34,100
Let us continue
with the further operations.
1946
01:23:34,100 --> 01:23:35,700
Now they were a few rows
1947
01:23:35,700 --> 01:23:37,700
who were considered
to be important
1948
01:23:37,700 --> 01:23:39,210
and equally Noble, hence.
1949
01:23:39,210 --> 01:23:41,526
They were carried out
under the last book.
1950
01:23:41,526 --> 01:23:43,644
So let us filter
out those characters
1951
01:23:43,644 --> 01:23:46,100
and find out the details
of each one of them.
1952
01:23:46,400 --> 01:23:49,520
The data of all the characters
who are considered to be Noble
1953
01:23:49,520 --> 01:23:50,300
and carried out
1954
01:23:50,300 --> 01:23:53,300
until the last book are being
stored into the new data frame,
1955
01:23:53,300 --> 01:23:55,629
which is TFO now let
us print the data
1956
01:23:55,629 --> 01:23:56,652
which is existing
1957
01:23:56,652 --> 01:23:59,600
in the data frame for so
according to the results.
1958
01:23:59,600 --> 01:24:00,650
We have two candidates
1959
01:24:00,650 --> 01:24:03,300
who are considered to be
the noble and their character
1960
01:24:03,300 --> 01:24:05,200
is been carried on
until the last book
1961
01:24:05,700 --> 01:24:06,900
amongst all the battles.
1962
01:24:06,900 --> 01:24:09,068
I found the battles
of the last books
1963
01:24:09,068 --> 01:24:11,900
to be generating more
adrenaline in the readers.
1964
01:24:11,900 --> 01:24:14,500
Let us find out the details
of those battles using
1965
01:24:14,500 --> 01:24:15,600
the following code.
1966
01:24:16,000 --> 01:24:18,700
So the following code will help
us to find out the bars
1967
01:24:18,700 --> 01:24:20,500
which were fought
in the last year's
1968
01:24:20,500 --> 01:24:21,700
of the Game of Thrones.
1969
01:24:22,100 --> 01:24:24,799
So these are the details
of the vast which are fought
1970
01:24:24,799 --> 01:24:26,800
in the last year's
of the Game of Thrones
1971
01:24:26,800 --> 01:24:28,200
and the details of the Kings
1972
01:24:28,300 --> 01:24:30,067
and the details
of their commanders
1973
01:24:30,067 --> 01:24:32,200
and the location
where the war was fought.
1974
01:24:36,700 --> 01:24:40,579
Welcome to this interesting
session of Sparks SQL tutorial
1975
01:24:40,579 --> 01:24:41,600
from a drecker.
1976
01:24:41,600 --> 01:24:42,700
So in today's session,
1977
01:24:42,700 --> 01:24:46,100
we are going to learn about
how we will be working.
1978
01:24:46,100 --> 01:24:48,500
Spock sequent now what all you
1979
01:24:48,500 --> 01:24:51,944
can expect from this course
from this particular session
1980
01:24:51,944 --> 01:24:53,300
so you can expect that.
1981
01:24:53,300 --> 01:24:56,400
We will be first learning
by Sparks equal.
1982
01:24:56,500 --> 01:24:58,139
What are the libraries
1983
01:24:58,139 --> 01:25:00,600
which are present
in Sparks equal.
1984
01:25:00,600 --> 01:25:03,600
What are the important
features of Sparkle?
1985
01:25:03,600 --> 01:25:06,400
We will also be doing
some Hands-On example
1986
01:25:06,400 --> 01:25:10,323
and in the end we will see
some interesting use case
1987
01:25:10,323 --> 01:25:13,300
of stock market analysis now
1988
01:25:13,400 --> 01:25:15,042
Rice Park sequel is it
1989
01:25:15,042 --> 01:25:19,200
like Why we are learning it
why it is really important
1990
01:25:19,200 --> 01:25:22,067
for us to know about
this Sparks equal sign.
1991
01:25:22,067 --> 01:25:24,200
Is it like really hot in Market?
1992
01:25:24,200 --> 01:25:27,700
If yes, then why we want
all those answer from this.
1993
01:25:27,700 --> 01:25:30,500
So if you're coming
from her do background,
1994
01:25:30,500 --> 01:25:34,102
you must have heard a lot
about Apache Hive now
1995
01:25:34,300 --> 01:25:36,100
what happens in Apache.
1996
01:25:36,100 --> 01:25:39,061
I also like in Apache
Hive SQL developers
1997
01:25:39,061 --> 01:25:41,430
can write the queries in SQL way
1998
01:25:41,430 --> 01:25:43,800
and it will be getting converted
1999
01:25:43,800 --> 01:25:45,800
to your mapreduce
and giving you the out.
2000
01:25:46,400 --> 01:25:47,600
Now we all know
2001
01:25:47,600 --> 01:25:50,000
that mapreduce is
lower in nature.
2002
01:25:50,000 --> 01:25:52,726
And since mapreduce
is going to be slower
2003
01:25:52,726 --> 01:25:54,500
and nature then definitely
2004
01:25:54,500 --> 01:25:58,000
your overall high score
is going to be slower in nature.
2005
01:25:58,000 --> 01:25:59,537
So that was one challenge.
2006
01:25:59,537 --> 01:26:02,361
So if you have let's say
less than 200 GB of data
2007
01:26:02,361 --> 01:26:04,400
or if you have
a smaller set of data.
2008
01:26:04,400 --> 01:26:06,800
This was actually
a big challenge
2009
01:26:06,800 --> 01:26:10,400
that in Hive your performance
was not that great.
2010
01:26:10,400 --> 01:26:13,900
It also do not have
any resuming capability stuck.
2011
01:26:13,900 --> 01:26:15,900
You can just start it also.
2012
01:26:15,900 --> 01:26:19,200
- cannot even drop
your encrypted data bases.
2013
01:26:19,200 --> 01:26:21,082
That's was also one
of the challenge
2014
01:26:21,082 --> 01:26:23,200
when you deal with
the security side.
2015
01:26:23,200 --> 01:26:25,082
Now what sparks equal have done
2016
01:26:25,082 --> 01:26:28,300
it Sparks equal have solved
almost all of the problem.
2017
01:26:28,300 --> 01:26:31,064
So in the last sessions
you have already learned
2018
01:26:31,064 --> 01:26:34,500
about the smart way right House
Park is faster from mapreduce
2019
01:26:34,500 --> 01:26:36,200
and not we have already learned
2020
01:26:36,200 --> 01:26:38,800
that in the previous
few sessions now.
2021
01:26:38,800 --> 01:26:39,917
So in this session,
2022
01:26:39,917 --> 01:26:43,000
we are going to kind of take
a live range of all that so
2023
01:26:43,000 --> 01:26:44,800
definitely in this case
2024
01:26:44,800 --> 01:26:47,500
since This pack is
faster because of
2025
01:26:47,500 --> 01:26:49,200
the in-memory computation.
2026
01:26:49,200 --> 01:26:50,866
What is in memory competition?
2027
01:26:50,866 --> 01:26:52,200
We have already seen it.
2028
01:26:52,200 --> 01:26:55,105
So in memory computations
is like whenever we
2029
01:26:55,105 --> 01:26:57,700
are Computing anything
in memory directly.
2030
01:26:57,700 --> 01:27:01,165
So because of in memory
competition capability because
2031
01:27:01,165 --> 01:27:02,800
of arches purpose poster.
2032
01:27:02,800 --> 01:27:07,500
So definitely your spark SQL is
also been to become first know
2033
01:27:07,500 --> 01:27:08,600
so if I talk
2034
01:27:08,600 --> 01:27:11,900
about the advantages
of Sparks equal over Hive
2035
01:27:11,900 --> 01:27:14,970
definitely number one it
is going to be faster
2036
01:27:14,970 --> 01:27:17,900
in Listen to your hive
so a high quality,
2037
01:27:17,900 --> 01:27:20,900
which is let's say
you're taking around 10 minutes
2038
01:27:20,900 --> 01:27:21,905
in Sparks equal.
2039
01:27:21,905 --> 01:27:25,300
You can finish that same query
in less than one minute.
2040
01:27:25,300 --> 01:27:27,400
Don't you think it's
an awesome capability
2041
01:27:27,400 --> 01:27:31,400
of subsequent definitely as
right now second thing is
2042
01:27:31,400 --> 01:27:34,400
when if let's say you
are writing something and -
2043
01:27:34,400 --> 01:27:36,148
now you can take an example
2044
01:27:36,148 --> 01:27:39,751
of let's say a company
who is let's say developing -
2045
01:27:39,751 --> 01:27:41,467
queries from last 10 years.
2046
01:27:41,467 --> 01:27:42,900
Now they were doing it.
2047
01:27:42,900 --> 01:27:44,000
There were all happy
2048
01:27:44,000 --> 01:27:46,000
that they were able
to process picture.
2049
01:27:46,100 --> 01:27:48,200
That they were worried
about the performance
2050
01:27:48,200 --> 01:27:50,778
that Hive is not able
to give them a that level
2051
01:27:50,778 --> 01:27:53,273
of processing speed what
they are looking for.
2052
01:27:53,273 --> 01:27:54,160
Now this fossil.
2053
01:27:54,160 --> 01:27:56,600
It's a challenge
for that particular company.
2054
01:27:56,600 --> 01:27:58,801
Now, there's a challenge right?
2055
01:27:58,801 --> 01:28:01,397
The challenge is
they came to know know
2056
01:28:01,397 --> 01:28:02,900
about subsequent fine.
2057
01:28:02,900 --> 01:28:04,685
Let's say we came
to know about it,
2058
01:28:04,685 --> 01:28:05,853
but they came to know
2059
01:28:05,853 --> 01:28:08,300
that we can execute
everything is Park Sequel
2060
01:28:08,300 --> 01:28:10,700
and it is going to be
faster as well fine.
2061
01:28:10,700 --> 01:28:12,281
But don't you think that
2062
01:28:12,281 --> 01:28:15,708
if these companies working
for net set past 10 years?
2063
01:28:15,708 --> 01:28:19,200
In Hive they must have already
written lot of Gordon -
2064
01:28:19,200 --> 01:28:23,100
now if you ask them to migrate
to spark SQL is will it be
2065
01:28:23,100 --> 01:28:24,400
until easy task?
2066
01:28:24,400 --> 01:28:25,200
No, right.
2067
01:28:25,200 --> 01:28:25,982
Definitely.
2068
01:28:25,982 --> 01:28:28,384
It is not going
to be an easy task.
2069
01:28:28,384 --> 01:28:32,200
Why because Hive syntax
and Sparks equals and X though.
2070
01:28:32,200 --> 01:28:35,800
They boot tackle the sequel way
of writing the things
2071
01:28:35,800 --> 01:28:39,346
but at the same time
it is always a very
2072
01:28:39,346 --> 01:28:41,500
it carries a big difference,
2073
01:28:41,500 --> 01:28:44,300
so there will be a good
difference whenever we talk
2074
01:28:44,300 --> 01:28:45,905
about the syntax between them.
2075
01:28:45,905 --> 01:28:48,100
So it will take a very
good amount of time
2076
01:28:48,100 --> 01:28:51,017
for that company to change
all of the query mode
2077
01:28:51,017 --> 01:28:54,052
to the Sparks equal way
now Sparks equal came up
2078
01:28:54,052 --> 01:28:55,426
with a smart salvation
2079
01:28:55,426 --> 01:28:56,899
what they said is even
2080
01:28:56,899 --> 01:28:58,900
if you are writing
the query with -
2081
01:28:58,900 --> 01:29:01,300
you can execute
that Hive query directly
2082
01:29:01,300 --> 01:29:03,500
through subsequent don't you
think it's again
2083
01:29:03,500 --> 01:29:06,600
a very important
and awesome facility, right?
2084
01:29:06,600 --> 01:29:09,900
Because even now
if you're a good Hive developer,
2085
01:29:09,900 --> 01:29:12,000
you need not worry about
2086
01:29:12,000 --> 01:29:15,600
that how you will be now
that migrating to Sparks.
2087
01:29:15,600 --> 01:29:18,658
Well, you can still keep on
writing to the hive query
2088
01:29:18,658 --> 01:29:20,900
and can your query
will automatically be
2089
01:29:20,900 --> 01:29:24,767
getting converted to spot sequel
with similarly in Apache spark
2090
01:29:24,767 --> 01:29:27,200
as we have learned
in the past sessions,
2091
01:29:27,200 --> 01:29:30,100
especially through spark
streaming that Sparks.
2092
01:29:30,100 --> 01:29:33,600
The aiming is going to make
you real time processing right?
2093
01:29:33,600 --> 01:29:36,000
You can also perform
your real-time processing
2094
01:29:36,000 --> 01:29:37,615
using a purchase. / now.
2095
01:29:37,615 --> 01:29:39,500
This sort of facility is you
2096
01:29:39,500 --> 01:29:41,800
can take leverage even
you know Sparks ago.
2097
01:29:41,800 --> 01:29:44,235
So let's say you can do
a real-time processing
2098
01:29:44,235 --> 01:29:46,400
and at the same time
you can also Perform
2099
01:29:46,400 --> 01:29:47,860
your SQL query now the type
2100
01:29:47,860 --> 01:29:49,120
that was the problem.
2101
01:29:49,120 --> 01:29:49,900
You cannot do
2102
01:29:49,900 --> 01:29:52,900
that because when we talk
about Hive now in -
2103
01:29:52,900 --> 01:29:54,320
it's all about Hadoop is
2104
01:29:54,320 --> 01:29:56,663
all about batch
processing batch processing
2105
01:29:56,663 --> 01:29:58,509
where you keep historical data
2106
01:29:58,509 --> 01:30:00,736
and then later you
process it, right?
2107
01:30:00,736 --> 01:30:03,699
So it definitely Hive also
follow the same approach
2108
01:30:03,699 --> 01:30:05,300
in this case also high risk
2109
01:30:05,300 --> 01:30:07,850
going to just only follow
the batch processing mode,
2110
01:30:07,850 --> 01:30:09,600
but when it comes to a purchase,
2111
01:30:09,600 --> 01:30:13,500
but it will also be taking care
of the real-time processing.
2112
01:30:13,500 --> 01:30:15,499
So how all these things happens
2113
01:30:15,499 --> 01:30:18,400
so Our Park sequel always
uses your meta store
2114
01:30:18,400 --> 01:30:21,350
Services of your hive
to query the data stored
2115
01:30:21,350 --> 01:30:22,400
and managed by -
2116
01:30:22,400 --> 01:30:24,728
so in when you were
learning about high,
2117
01:30:24,728 --> 01:30:28,123
so we have learned at that time
that in hives everything.
2118
01:30:28,123 --> 01:30:30,711
What we do is always
stored in the meta Stone
2119
01:30:30,711 --> 01:30:33,491
so that met Esther was
The crucial point, right?
2120
01:30:33,491 --> 01:30:35,200
Because using that meta store
2121
01:30:35,200 --> 01:30:37,600
only you are able
to do everything up.
2122
01:30:37,600 --> 01:30:41,100
So like when you are doing
let's say or any sort of query
2123
01:30:41,100 --> 01:30:42,707
when you're creating a table,
2124
01:30:42,707 --> 01:30:45,700
everything was getting stored
in that same metal Stone.
2125
01:30:45,700 --> 01:30:47,559
What happens Spock sequel
2126
01:30:47,559 --> 01:30:51,800
also use the same metal Stone
now is whatever metal store.
2127
01:30:51,800 --> 01:30:55,051
You have created with respect
to Hive same meta store.
2128
01:30:55,051 --> 01:30:56,219
You can also use it
2129
01:30:56,219 --> 01:30:58,900
for your Sparks equal
and that is something
2130
01:30:58,900 --> 01:31:02,000
which is really awesome
about this spark sequent
2131
01:31:02,000 --> 01:31:04,000
that you did not create
a new meta store.
2132
01:31:04,000 --> 01:31:06,300
You need not worry
about a new storage space
2133
01:31:06,300 --> 01:31:07,404
and not everything
2134
01:31:07,404 --> 01:31:10,820
what you have done with respect
to your high same method
2135
01:31:10,820 --> 01:31:11,620
you can use it.
2136
01:31:11,620 --> 01:31:11,833
Now.
2137
01:31:11,833 --> 01:31:13,700
You can ask me then
how it is faster
2138
01:31:13,700 --> 01:31:15,700
if they're using
cymatics don't remember.
2139
01:31:15,700 --> 01:31:18,500
But the processing part
why high was lower
2140
01:31:18,500 --> 01:31:20,301
because of its processing way
2141
01:31:20,301 --> 01:31:23,519
because it is converting
everything to the mapreduce
2142
01:31:23,519 --> 01:31:26,782
and this it was making
the processing very very slow.
2143
01:31:26,782 --> 01:31:28,100
But here in this case
2144
01:31:28,100 --> 01:31:31,452
since the processing is going
to be in memory computation.
2145
01:31:31,452 --> 01:31:32,705
So in Sparks equal case,
2146
01:31:32,705 --> 01:31:35,588
it is always going to be
the faster now definitely
2147
01:31:35,588 --> 01:31:37,545
it just because of
the meta store site.
2148
01:31:37,545 --> 01:31:39,600
We are only able
to fetch the data are
2149
01:31:39,600 --> 01:31:42,129
not but at the same time
for any other thing
2150
01:31:42,129 --> 01:31:44,100
of the processing related stuff,
2151
01:31:44,100 --> 01:31:46,200
it is always going to be At the
2152
01:31:46,200 --> 01:31:48,180
when we talk about
the processing stage
2153
01:31:48,180 --> 01:31:51,200
it is going to be in memory
does it's going to be faster.
2154
01:31:51,300 --> 01:31:54,335
So let's talk about some success
stories of Sparks equal.
2155
01:31:54,335 --> 01:31:57,550
Let's see some use cases
Twitter sentiment analysis.
2156
01:31:57,550 --> 01:31:58,844
If you go through over
2157
01:31:58,844 --> 01:32:01,699
if you want sexy remember
our spark streaming session,
2158
01:32:01,700 --> 01:32:04,300
we have done a Twitter
sentiment analysis, right?
2159
01:32:04,300 --> 01:32:05,400
So there you have seen
2160
01:32:05,400 --> 01:32:08,497
that we have first initially
got the data from Twitter and
2161
01:32:08,497 --> 01:32:10,400
that to we have got
it with the help
2162
01:32:10,400 --> 01:32:11,911
of Sparks Damon and later
2163
01:32:11,911 --> 01:32:13,000
what we did later.
2164
01:32:13,000 --> 01:32:15,600
We just analyze everything
with the help of spot.
2165
01:32:15,600 --> 01:32:18,080
Oxycodone so you can see
an advantage as possible.
2166
01:32:18,080 --> 01:32:19,761
So in Twitter sentiment analysis
2167
01:32:19,761 --> 01:32:21,600
where let's say
you want to find out
2168
01:32:21,600 --> 01:32:23,200
about the Donald Trump, right?
2169
01:32:23,200 --> 01:32:24,509
You are fetching the data
2170
01:32:24,509 --> 01:32:26,547
every tweet related
to the Donald Trump
2171
01:32:26,547 --> 01:32:28,900
and then kind of bring
analysis in checking
2172
01:32:28,900 --> 01:32:31,200
that whether it's
a positive with negative
2173
01:32:31,200 --> 01:32:32,475
tweet neutral tweet,
2174
01:32:32,475 --> 01:32:34,900
very negative with very
positive to it.
2175
01:32:34,900 --> 01:32:37,257
Okay, so we have already
seen the same example there
2176
01:32:37,257 --> 01:32:38,607
in that particular session.
2177
01:32:38,607 --> 01:32:39,549
So in this session,
2178
01:32:39,549 --> 01:32:40,499
as you are noticing
2179
01:32:40,499 --> 01:32:42,600
what we are doing we
just want to kind of so
2180
01:32:42,600 --> 01:32:44,202
that once you're
streaming the data
2181
01:32:44,202 --> 01:32:45,900
and the real time
you can also do it.
2182
01:32:45,900 --> 01:32:47,977
Also, seeing using
spark sequel just you
2183
01:32:47,977 --> 01:32:50,724
are doing all the processing
at the real time similarly
2184
01:32:50,724 --> 01:32:52,270
in the stock market analysis.
2185
01:32:52,270 --> 01:32:54,295
You can use Park
sequel lot of bullies.
2186
01:32:54,295 --> 01:32:57,400
You can adopt the in the banking
fraud case Transitions and all
2187
01:32:57,400 --> 01:32:58,400
you can use that.
2188
01:32:58,400 --> 01:33:01,000
So let's say your credit
card current is getting swipe
2189
01:33:01,000 --> 01:33:02,580
in India and in next 10 minutes
2190
01:33:02,580 --> 01:33:04,429
if your credit card
is getting swiped
2191
01:33:04,429 --> 01:33:05,456
in let's say in u.s.
2192
01:33:05,456 --> 01:33:07,100
Definitely that is not possible.
2193
01:33:07,100 --> 01:33:07,400
Right?
2194
01:33:07,400 --> 01:33:09,872
So let's say you are doing all
that processing real-time.
2195
01:33:09,872 --> 01:33:12,300
You're detecting everything
with respect to sparsely me.
2196
01:33:12,300 --> 01:33:15,400
Then you are let's say applying
your Sparks equal to verify
2197
01:33:15,400 --> 01:33:18,000
that Whether it's
a user Trend or not, right?
2198
01:33:18,000 --> 01:33:20,600
So all those things you want
to match up as possible.
2199
01:33:20,600 --> 01:33:21,960
So you can do that similarly
2200
01:33:21,960 --> 01:33:23,750
the medical domain
you can use that.
2201
01:33:23,750 --> 01:33:25,949
Let's talk about
some Sparks equal features.
2202
01:33:25,949 --> 01:33:28,200
So there will be
some features related to it.
2203
01:33:28,400 --> 01:33:30,200
Now, you can use
2204
01:33:30,200 --> 01:33:33,700
what happens when this sequel
got combined with this path.
2205
01:33:33,700 --> 01:33:34,830
We started calling it
2206
01:33:34,830 --> 01:33:35,825
as Park sequel now
2207
01:33:35,825 --> 01:33:38,700
when definitely we are talking
about SQL be a talking
2208
01:33:38,700 --> 01:33:40,405
about either a structure data
2209
01:33:40,405 --> 01:33:41,800
or a semi-structured data now
2210
01:33:41,800 --> 01:33:44,231
SQL queries cannot deal
with the unstructured data,
2211
01:33:44,231 --> 01:33:47,300
so that is definitely one of
Thing you need to keep in mind.
2212
01:33:47,300 --> 01:33:51,000
Now your spark sequel also
support various data formats.
2213
01:33:51,000 --> 01:33:52,800
You can get a data from pocket.
2214
01:33:52,800 --> 01:33:54,500
You must have heard about Market
2215
01:33:54,500 --> 01:33:56,911
that it is a columnar
based storage and it
2216
01:33:56,911 --> 01:33:59,884
is kind of very much
compressed format of the data
2217
01:33:59,884 --> 01:34:02,300
what you have but it's
not human readable.
2218
01:34:02,300 --> 01:34:02,800
Similarly.
2219
01:34:02,800 --> 01:34:04,800
You must have heard
about Jason Avro
2220
01:34:04,800 --> 01:34:07,200
where we keep the value
as a key value pair.
2221
01:34:07,200 --> 01:34:08,482
Hi Cassandra, right?
2222
01:34:08,482 --> 01:34:09,700
These are nosql TVs
2223
01:34:09,700 --> 01:34:12,800
so you can get all the data
from these sources now.
2224
01:34:12,800 --> 01:34:15,114
You can also convert
your SQL queries
2225
01:34:15,114 --> 01:34:16,400
to your A derivative
2226
01:34:16,400 --> 01:34:18,650
so you can you can you
will be able to perform
2227
01:34:18,650 --> 01:34:20,113
all the transformation steps.
2228
01:34:20,113 --> 01:34:21,800
So that is one thing you can do.
2229
01:34:21,800 --> 01:34:23,500
Now if we talk about performance
2230
01:34:23,500 --> 01:34:26,700
and scalability definitely
on this red color graph.
2231
01:34:26,700 --> 01:34:29,431
If you notice this
is related to your Hadoop,
2232
01:34:29,431 --> 01:34:30,300
you can notice
2233
01:34:30,300 --> 01:34:34,000
that red color graph is much
more encompassing to blue color
2234
01:34:34,000 --> 01:34:36,617
and blue color denotes
my performance with respect
2235
01:34:36,617 --> 01:34:37,503
to Sparks equal
2236
01:34:37,503 --> 01:34:40,856
so you can notice that spark
SQL is performing much better
2237
01:34:40,856 --> 01:34:42,684
in comparison to your Hadoop.
2238
01:34:42,684 --> 01:34:44,260
So we are on this Y axis.
2239
01:34:44,260 --> 01:34:45,900
We are taking the running.
2240
01:34:46,000 --> 01:34:47,200
On the x-axis.
2241
01:34:47,200 --> 01:34:50,119
We were considering
the number of iteration
2242
01:34:50,119 --> 01:34:53,000
when we talk about
Sparks equal features.
2243
01:34:53,000 --> 01:34:56,000
Now few more features
we have for example,
2244
01:34:56,000 --> 01:34:59,200
you can create a connection
with simple your jdbc driver
2245
01:34:59,200 --> 01:35:00,494
or odbc driver, right?
2246
01:35:00,494 --> 01:35:02,482
These are simple
drivers being present.
2247
01:35:02,482 --> 01:35:03,600
Now, you can create
2248
01:35:03,600 --> 01:35:06,700
your connection with his path
SQL using all these drivers.
2249
01:35:06,700 --> 01:35:10,000
You can also create a user
defined function means let's say
2250
01:35:10,000 --> 01:35:12,200
if any function is
not available to you
2251
01:35:12,200 --> 01:35:14,600
and that gives you can create
your own functions.
2252
01:35:14,600 --> 01:35:16,900
Let's say if function
Is available use
2253
01:35:16,900 --> 01:35:18,639
that if it is not available,
2254
01:35:18,639 --> 01:35:21,497
you can create a UDF means
user-defined function
2255
01:35:21,497 --> 01:35:23,235
and you can directly execute
2256
01:35:23,235 --> 01:35:26,478
that user-defined function
and get your dessert sir.
2257
01:35:26,478 --> 01:35:28,900
So this is one example
where we have shown
2258
01:35:28,900 --> 01:35:30,100
that you can convert.
2259
01:35:30,100 --> 01:35:33,000
Let's say if you don't have
an uppercase API present
2260
01:35:33,000 --> 01:35:36,405
in subsequent how you
can create a simple UDF for a
2261
01:35:36,405 --> 01:35:37,700
and can execute it.
2262
01:35:37,700 --> 01:35:38,850
So if you notice there
2263
01:35:38,850 --> 01:35:41,200
what we are doing
let's get this is my data.
2264
01:35:41,200 --> 01:35:42,700
So if you notice in this case,
2265
01:35:43,069 --> 01:35:45,530
this is data set is
my data part.
2266
01:35:45,800 --> 01:35:48,100
So this is I'm generating
as a sequence.
2267
01:35:48,100 --> 01:35:51,800
I'm creating it as a data frame
see this 2df part here.
2268
01:35:51,800 --> 01:35:55,100
Now after that we
are creating a / U DF here
2269
01:35:55,100 --> 01:35:58,217
and notice we are converting
any value which is coming
2270
01:35:58,217 --> 01:35:59,600
to my upper case, right?
2271
01:35:59,600 --> 01:36:02,000
We are using this to uppercase
API to convert it.
2272
01:36:02,100 --> 01:36:05,800
We are importing this function
and then what we did now
2273
01:36:05,800 --> 01:36:08,100
when we came here,
we are telling that okay.
2274
01:36:08,100 --> 01:36:09,236
This is my UDF.
2275
01:36:09,236 --> 01:36:10,600
So UDF is upper by
2276
01:36:10,600 --> 01:36:12,719
because we have created
here also a zapper.
2277
01:36:12,719 --> 01:36:13,569
So we are telling
2278
01:36:13,569 --> 01:36:16,100
that this is my UDF
in the first step and then Then
2279
01:36:16,100 --> 01:36:17,153
when we are using it,
2280
01:36:17,153 --> 01:36:20,253
let's say with our datasets
what we are doing so data sets.
2281
01:36:20,253 --> 01:36:22,100
We are passing year
that okay, whatever.
2282
01:36:22,100 --> 01:36:23,393
We are doing convert it
2283
01:36:23,393 --> 01:36:26,600
to my upper developer you DFX
convert it to my upper case.
2284
01:36:26,600 --> 01:36:29,100
So see we are telling you
we have created our / UDF
2285
01:36:29,100 --> 01:36:31,500
that is what we are passing
inside this text value.
2286
01:36:31,800 --> 01:36:34,600
So now it is just
getting converted
2287
01:36:34,600 --> 01:36:37,600
and giving you all the output
in your upper case way
2288
01:36:37,600 --> 01:36:40,400
so you can notice
that this is your last value
2289
01:36:40,400 --> 01:36:42,700
and this is your
uppercase value, right?
2290
01:36:42,700 --> 01:36:43,841
So this got converted
2291
01:36:43,841 --> 01:36:45,900
to my upper case
in this particular.
2292
01:36:45,900 --> 01:36:46,500
Love it.
2293
01:36:46,500 --> 01:36:46,900
Now.
2294
01:36:46,900 --> 01:36:49,123
If you notice here
also same steps.
2295
01:36:49,123 --> 01:36:52,000
We are how to we
can register all of our UDF.
2296
01:36:52,000 --> 01:36:53,620
This is not being shown here.
2297
01:36:53,620 --> 01:36:55,800
So now this is
how you can do that spark
2298
01:36:55,800 --> 01:36:57,354
that UDF not register.
2299
01:36:57,354 --> 01:36:58,574
So using this API,
2300
01:36:58,574 --> 01:37:02,100
you can just register
your data frames now similarly,
2301
01:37:02,100 --> 01:37:03,870
if you want to get the output
2302
01:37:03,870 --> 01:37:06,800
after that you can get
it using this following me
2303
01:37:06,800 --> 01:37:09,900
so you can use the show API
to get the output
2304
01:37:09,900 --> 01:37:12,100
for this Sparks
equal at attacher.
2305
01:37:12,100 --> 01:37:13,800
Let's see that so what is Park
2306
01:37:13,800 --> 01:37:16,400
sequel architecture now is
Park sequel architecture
2307
01:37:16,400 --> 01:37:18,100
if we talked about so
what happens to your let
2308
01:37:18,100 --> 01:37:19,900
's say getting the data
of with using
2309
01:37:19,900 --> 01:37:21,500
your various formats, right?
2310
01:37:21,500 --> 01:37:23,911
So let's say you can get
it from your CSP.
2311
01:37:23,911 --> 01:37:26,056
You can get it
from your Json format.
2312
01:37:26,056 --> 01:37:28,475
You can also get it
from your jdbc format.
2313
01:37:28,475 --> 01:37:30,400
Now, they will be
a data source API.
2314
01:37:30,400 --> 01:37:31,708
So using data source API,
2315
01:37:31,708 --> 01:37:34,273
you can fetch the data
after fetching the data
2316
01:37:34,273 --> 01:37:36,300
you will be converting
to a data frame
2317
01:37:36,300 --> 01:37:38,000
where so what is data frame.
2318
01:37:38,000 --> 01:37:39,833
So in the last one
we have learned
2319
01:37:39,833 --> 01:37:42,892
that that when we were creating
everything is already
2320
01:37:42,892 --> 01:37:43,900
what we were doing.
2321
01:37:43,900 --> 01:37:46,437
So, let's say this was
my Cluster, right?
2322
01:37:46,437 --> 01:37:48,358
So let's say this is machine.
2323
01:37:48,358 --> 01:37:49,860
This is another machine.
2324
01:37:49,860 --> 01:37:51,800
This is another machine, right?
2325
01:37:51,800 --> 01:37:53,757
So let's say these are
all my clusters.
2326
01:37:53,757 --> 01:37:55,703
So what we were doing
in this case now
2327
01:37:55,703 --> 01:37:58,700
when we were creating all
these things are as were cluster
2328
01:37:58,700 --> 01:38:00,000
what was happening here.
2329
01:38:00,000 --> 01:38:02,600
We were passing
Oliver values him, right?
2330
01:38:02,600 --> 01:38:04,739
So let's say we
were keeping all the data.
2331
01:38:04,739 --> 01:38:06,200
Let's say block B1 was there
2332
01:38:06,200 --> 01:38:08,850
so we were passing all
the values and work creating it
2333
01:38:08,850 --> 01:38:11,400
in the form of in the memory
and we were calling
2334
01:38:11,400 --> 01:38:12,800
that as rdd now
2335
01:38:12,800 --> 01:38:16,094
when we were walking in SQL
we have to store the the data
2336
01:38:16,094 --> 01:38:17,900
which is a table of data, right?
2337
01:38:17,900 --> 01:38:19,200
So let's say there is a table
2338
01:38:19,200 --> 01:38:21,200
which is let's say
having column details.
2339
01:38:21,200 --> 01:38:23,200
Let's say name age.
2340
01:38:23,200 --> 01:38:24,024
Let's say here.
2341
01:38:24,024 --> 01:38:26,236
I have some value here
are some value here.
2342
01:38:26,236 --> 01:38:28,506
I have some value here
at some value, right?
2343
01:38:28,506 --> 01:38:31,200
So let's say I have some value
of this table format.
2344
01:38:31,200 --> 01:38:34,200
Now if I have to keep
this data into my cluster
2345
01:38:34,200 --> 01:38:35,200
what you need to do,
2346
01:38:35,200 --> 01:38:37,962
so you will be keeping first
of all into the memory.
2347
01:38:37,962 --> 01:38:39,100
So you will be having
2348
01:38:39,100 --> 01:38:42,418
let's say name H this column
to test first of all year
2349
01:38:42,418 --> 01:38:45,767
and after that you will be
having some details of this.
2350
01:38:45,767 --> 01:38:46,210
Perfect.
2351
01:38:46,210 --> 01:38:47,804
So let's say this much data,
2352
01:38:47,804 --> 01:38:49,900
you have some part
in the similar kind
2353
01:38:49,900 --> 01:38:52,572
of table with some other values
will be here also,
2354
01:38:52,572 --> 01:38:55,300
but here also you are going
to have column details.
2355
01:38:55,300 --> 01:38:58,500
You will be having name H
some more data here.
2356
01:38:58,600 --> 01:39:02,600
Now if you notice this
is sounding similar to our DD,
2357
01:39:02,700 --> 01:39:06,000
but this is not exactly
like our GD right
2358
01:39:06,000 --> 01:39:09,400
because here we are not only
keeping just the data but we
2359
01:39:09,400 --> 01:39:12,500
are also studying something
like a column in a storage
2360
01:39:12,500 --> 01:39:12,861
right?
2361
01:39:12,861 --> 01:39:15,400
We also the keeping
the column in all of it.
2362
01:39:15,400 --> 01:39:18,500
Data nodes or we can call it as
if Burke or not, right?
2363
01:39:18,500 --> 01:39:20,653
So we are also keeping
the column vectors
2364
01:39:20,653 --> 01:39:22,000
along with the rule test.
2365
01:39:22,000 --> 01:39:24,700
So this thing is called
as data frames.
2366
01:39:24,700 --> 01:39:26,600
Okay, so that is called
your data frame.
2367
01:39:26,600 --> 01:39:29,400
So that is what we are going to
do is we are going to convert it
2368
01:39:29,400 --> 01:39:31,057
to a data frame API then
2369
01:39:31,057 --> 01:39:35,200
using the data frame TSS or by
using Sparks equal to H square
2370
01:39:35,200 --> 01:39:37,550
or you will be processing
the results and giving
2371
01:39:37,550 --> 01:39:40,300
the output we will learn about
all these things in detail.
2372
01:39:40,600 --> 01:39:44,100
So, let's see this Popsicle
libraries now there are
2373
01:39:44,100 --> 01:39:45,800
multiple apis available.
2374
01:39:45,800 --> 01:39:48,700
This like we have
data source API we
2375
01:39:48,700 --> 01:39:50,500
have data frame API.
2376
01:39:50,500 --> 01:39:53,510
We have interpreter
and Optimizer and SQL service.
2377
01:39:53,510 --> 01:39:55,600
We will explore
all this in detail.
2378
01:39:55,600 --> 01:39:58,000
So let's talk about
data source appear
2379
01:39:58,000 --> 01:40:02,787
if we talk about data source API
what happens in data source API,
2380
01:40:02,787 --> 01:40:04,133
it is used to read
2381
01:40:04,133 --> 01:40:07,364
and store the structured
and unstructured data
2382
01:40:07,364 --> 01:40:08,800
into your spark SQL.
2383
01:40:08,800 --> 01:40:12,200
So as you can notice in Sparks
equal we can give fetch the data
2384
01:40:12,200 --> 01:40:13,437
using multiple sources
2385
01:40:13,437 --> 01:40:15,800
like you can get it
from hive take Cosette.
2386
01:40:15,800 --> 01:40:18,800
Inverse ESP Apache
BSD base Oracle DB so
2387
01:40:18,800 --> 01:40:20,300
many formats available, right?
2388
01:40:20,300 --> 01:40:21,427
So this API is going
2389
01:40:21,427 --> 01:40:24,956
to help you to get all the data
to read all the data store it
2390
01:40:24,956 --> 01:40:26,700
where ever you want to use it.
2391
01:40:26,700 --> 01:40:28,387
Now after that your data
2392
01:40:28,387 --> 01:40:31,200
frame API is going
to help you to convert
2393
01:40:31,200 --> 01:40:33,100
that into a named Colin
2394
01:40:33,100 --> 01:40:34,700
and remember I
just explained you
2395
01:40:34,800 --> 01:40:36,902
that how you store
the data in that
2396
01:40:36,902 --> 01:40:39,793
because here you are not keeping
like I did it.
2397
01:40:39,793 --> 01:40:42,100
You're also keeping
the named column as
2398
01:40:42,100 --> 01:40:45,500
well as Road it is That is
the difference coming up here.
2399
01:40:45,500 --> 01:40:47,382
So that is
what it is converting.
2400
01:40:47,382 --> 01:40:48,100
In this case.
2401
01:40:48,100 --> 01:40:50,561
We are using data
frame API to convert it
2402
01:40:50,561 --> 01:40:52,900
into your named column
and rows, right?
2403
01:40:52,900 --> 01:40:54,600
So that is what you
will be doing.
2404
01:40:54,600 --> 01:40:57,700
So at it also follows the same
properties like your IDs
2405
01:40:57,700 --> 01:40:59,993
like your attitude is
Pearl easily evaluated
2406
01:40:59,993 --> 01:41:02,500
in all same properties
will also follow up here.
2407
01:41:02,500 --> 01:41:06,000
Okay now interpret
an Optimizer and interpreter
2408
01:41:06,000 --> 01:41:08,485
and Optimizer step
what we are going to do.
2409
01:41:08,485 --> 01:41:11,184
So, let's see if we have
this data frame API,
2410
01:41:11,184 --> 01:41:13,700
so we are going to first
create this name.
2411
01:41:13,700 --> 01:41:17,800
Column then after that we
will be now creating an rdd.
2412
01:41:17,800 --> 01:41:20,400
We will be applying
our transformation step.
2413
01:41:20,400 --> 01:41:23,877
We will be doing over action
step right to Output the value.
2414
01:41:23,877 --> 01:41:25,040
So all those things
2415
01:41:25,040 --> 01:41:28,100
where it is happens it happening
in The Interpreter
2416
01:41:28,100 --> 01:41:29,400
and optimizes them.
2417
01:41:29,400 --> 01:41:33,500
So this is all happening
in The Interpreter and optimism.
2418
01:41:33,600 --> 01:41:36,000
So this is what all
the features you have.
2419
01:41:36,000 --> 01:41:39,500
Now, let's talk about
SQL service now in SQL service
2420
01:41:39,500 --> 01:41:41,934
what happens it is going
to again help you
2421
01:41:41,934 --> 01:41:43,698
so it is just doing the order.
2422
01:41:43,698 --> 01:41:45,200
Formation action the last day
2423
01:41:45,200 --> 01:41:47,567
after that using
spark SQL service,
2424
01:41:47,567 --> 01:41:50,700
you will be getting
your spark sequel outputs.
2425
01:41:50,700 --> 01:41:54,200
So now in this case whatever
processing you have done right
2426
01:41:54,200 --> 01:41:57,500
in terms of transformations
in all of that so you can see
2427
01:41:57,500 --> 01:42:01,600
that your sparkers SQL service
is an entry point for working
2428
01:42:01,600 --> 01:42:04,486
along the structure data
in your aperture spur.
2429
01:42:04,486 --> 01:42:04,800
Okay.
2430
01:42:04,800 --> 01:42:07,611
So it is going to kind of
help you to fetch the results
2431
01:42:07,611 --> 01:42:08,700
from your optimize data
2432
01:42:08,700 --> 01:42:10,900
or maybe whatever you
have interpreted before
2433
01:42:10,900 --> 01:42:12,100
so that is what it's doing.
2434
01:42:12,100 --> 01:42:13,400
So this kind of completes.
2435
01:42:13,500 --> 01:42:15,400
This whole diagram now,
2436
01:42:15,400 --> 01:42:18,082
let us see that how we
can perform a work queries
2437
01:42:18,082 --> 01:42:19,200
using spark sequin.
2438
01:42:19,200 --> 01:42:21,435
Now if we talk
about spark SQL queries,
2439
01:42:21,435 --> 01:42:22,376
so first of all,
2440
01:42:22,376 --> 01:42:25,348
we can go to spark cell itself
engine execute everything.
2441
01:42:25,348 --> 01:42:27,253
You can also execute
your program using
2442
01:42:27,253 --> 01:42:29,500
spark your Eclipse also
directing from there.
2443
01:42:29,500 --> 01:42:30,600
Also, you can do that.
2444
01:42:30,600 --> 01:42:33,249
So if you are let's say log in
with your spark shell session.
2445
01:42:33,249 --> 01:42:34,200
So what you can do,
2446
01:42:34,200 --> 01:42:36,700
so let's say you have first
you need to import this
2447
01:42:36,700 --> 01:42:38,464
because into point x
you must have heard
2448
01:42:38,464 --> 01:42:40,500
that there is something
called as Park session
2449
01:42:40,500 --> 01:42:42,197
which came so that is
what we are doing.
2450
01:42:42,197 --> 01:42:44,200
So in our last session
we have Have you learned
2451
01:42:44,200 --> 01:42:47,077
about all these things are
now Sparkstation is something
2452
01:42:47,077 --> 01:42:48,700
but we're importing after that.
2453
01:42:48,700 --> 01:42:51,940
We are creating sessions path
using a builder function.
2454
01:42:51,940 --> 01:42:52,704
Look at this.
2455
01:42:52,704 --> 01:42:55,822
So This Builder API you we
are using this Builder API,
2456
01:42:55,822 --> 01:42:57,458
then we are using the app name.
2457
01:42:57,458 --> 01:43:00,256
We are providing a configuration
and then we are telling
2458
01:43:00,256 --> 01:43:02,860
that we are going to create
our values here, right?
2459
01:43:02,860 --> 01:43:05,100
So we had that's why
we are giving get okay,
2460
01:43:05,100 --> 01:43:07,987
then we are importing
all these things right
2461
01:43:07,987 --> 01:43:09,800
once we imported after that
2462
01:43:09,800 --> 01:43:10,900
we can say that okay.
2463
01:43:10,900 --> 01:43:12,731
We were want to read
this Json file.
2464
01:43:12,731 --> 01:43:15,400
So this implies God
or Jason we want to read up here
2465
01:43:15,400 --> 01:43:18,398
and in the end we want
to Output this value, right?
2466
01:43:18,398 --> 01:43:21,700
So this d f becomes my data
frame containing store value
2467
01:43:21,700 --> 01:43:23,188
of my employed or Jason.
2468
01:43:23,188 --> 01:43:25,655
So this decent value
will get converted
2469
01:43:25,655 --> 01:43:26,710
to my data frame.
2470
01:43:26,710 --> 01:43:30,000
We're now in the end PR just
outputting the result now
2471
01:43:30,000 --> 01:43:32,100
if you notice here
what we are doing,
2472
01:43:32,100 --> 01:43:33,312
so here we are first
2473
01:43:33,312 --> 01:43:36,100
of all importing your spark
session same story.
2474
01:43:36,100 --> 01:43:37,200
We just executing it.
2475
01:43:37,200 --> 01:43:39,500
Then we are building
our things better in that.
2476
01:43:39,500 --> 01:43:41,000
We're going to
create that again.
2477
01:43:41,000 --> 01:43:44,243
We are importing it then
we are reading Json file
2478
01:43:44,243 --> 01:43:46,000
by using Red Dot Json API.
2479
01:43:46,000 --> 01:43:47,900
We are reading
never employed or Jason.
2480
01:43:47,900 --> 01:43:50,428
Okay, which is present
in this particular directory
2481
01:43:50,428 --> 01:43:52,400
and we are outputting
so can you can see
2482
01:43:52,400 --> 01:43:55,300
that Json format will be
the T value format.
2483
01:43:55,300 --> 01:43:59,200
But when I'm doing this DF
not show it is just showing
2484
01:43:59,200 --> 01:44:00,700
up all my values here.
2485
01:44:00,700 --> 01:44:00,935
Now.
2486
01:44:00,935 --> 01:44:03,138
Let's see how we
can create our data set.
2487
01:44:03,138 --> 01:44:04,900
Now when we talk about data set,
2488
01:44:04,900 --> 01:44:06,500
you can notice
what we're doing.
2489
01:44:06,500 --> 01:44:06,700
Now.
2490
01:44:06,700 --> 01:44:09,200
We have understood all
this stability the how we
2491
01:44:09,200 --> 01:44:12,300
can create a data set now
first of all in data set
2492
01:44:12,300 --> 01:44:14,800
what we do so So
in data set we can create
2493
01:44:14,800 --> 01:44:17,900
the plus you can see we
are creating a case class employ
2494
01:44:17,900 --> 01:44:19,600
right now in case class
2495
01:44:19,600 --> 01:44:22,400
what we are doing we are done
just creating a sequence
2496
01:44:22,400 --> 01:44:25,600
in putting the value Andrew H
like name and age column.
2497
01:44:25,600 --> 01:44:28,076
Then we are displaying
our output all this data
2498
01:44:28,076 --> 01:44:28,803
set right now.
2499
01:44:28,803 --> 01:44:32,010
We are creating a primitive data
set also to demonstrate mapping
2500
01:44:32,010 --> 01:44:33,894
of this data frames
to your data sets.
2501
01:44:33,894 --> 01:44:34,200
Right?
2502
01:44:34,200 --> 01:44:36,200
So you can notice
that we are using
2503
01:44:36,200 --> 01:44:37,700
to D's instead of 2 DF.
2504
01:44:37,700 --> 01:44:39,500
We are using two DS
in this case.
2505
01:44:39,500 --> 01:44:42,293
Now, you may ask me what's
the difference with respect
2506
01:44:42,293 --> 01:44:43,400
to data frame, right?
2507
01:44:43,400 --> 01:44:45,100
With respect to data frame
2508
01:44:45,100 --> 01:44:46,700
in data frame
what we were doing.
2509
01:44:46,700 --> 01:44:48,682
We were create
again the data frame
2510
01:44:48,682 --> 01:44:50,800
and data set both
exactly looks safe.
2511
01:44:50,800 --> 01:44:53,228
It will also be having
the name column in rows
2512
01:44:53,228 --> 01:44:54,200
and everything up.
2513
01:44:54,200 --> 01:44:57,334
It is introduced lately
in 1.6 versions and later.
2514
01:44:57,334 --> 01:44:58,196
And what is it
2515
01:44:58,196 --> 01:45:01,100
provides it it provides
a encoder mechanism using
2516
01:45:01,100 --> 01:45:02,000
which you can get
2517
01:45:02,000 --> 01:45:04,208
when you are let's say
reading the weight data back.
2518
01:45:04,208 --> 01:45:06,200
Let's say you are DC
realizing you're not doing
2519
01:45:06,200 --> 01:45:06,968
that step, right?
2520
01:45:06,968 --> 01:45:08,300
It is going to be faster.
2521
01:45:08,300 --> 01:45:10,400
So the performance
wise data set is better.
2522
01:45:10,400 --> 01:45:13,000
That's the reason it
is introduced later nowadays.
2523
01:45:13,000 --> 01:45:15,794
People are moving from
data frame two data sets Okay.
2524
01:45:15,794 --> 01:45:17,500
So now we are just outputting
2525
01:45:17,500 --> 01:45:19,703
in the end see the same
thing in the output.
2526
01:45:19,703 --> 01:45:21,623
But so we are creating
employ a class.
2527
01:45:21,623 --> 01:45:24,684
Then we are putting the value
inside it creating a data set.
2528
01:45:24,684 --> 01:45:26,500
We are looking
at the values, right?
2529
01:45:26,500 --> 01:45:29,200
So these are the steps we
have just understood them now
2530
01:45:29,200 --> 01:45:32,000
how we can read of a Phi so
we want to read the file.
2531
01:45:32,000 --> 01:45:35,300
So we will use three dot Json
as employee employee was
2532
01:45:35,300 --> 01:45:38,026
what remember case class which
we have created last thing.
2533
01:45:38,026 --> 01:45:39,700
This was the classic
we have created
2534
01:45:39,700 --> 01:45:40,900
your case class employee.
2535
01:45:40,900 --> 01:45:43,300
So we are telling
that we are creating like this.
2536
01:45:43,500 --> 01:45:45,200
We are just out
putting this value.
2537
01:45:45,200 --> 01:45:47,612
We just within shop
you can see this way.
2538
01:45:47,612 --> 01:45:49,000
We can see this output.
2539
01:45:49,000 --> 01:45:50,700
Also now, let's see
2540
01:45:50,700 --> 01:45:53,900
how we can add the schema
to rdd now in order
2541
01:45:53,900 --> 01:45:57,300
to add the schema to rdd
what we are going to do.
2542
01:45:57,300 --> 01:45:59,100
So in this case also,
2543
01:45:59,200 --> 01:46:01,500
you can look at we
are importing all the values
2544
01:46:01,500 --> 01:46:03,700
that we are importing all
the libraries whatever
2545
01:46:03,700 --> 01:46:04,779
are required then
2546
01:46:04,779 --> 01:46:07,622
after that we are using
this spark context text
2547
01:46:07,622 --> 01:46:09,600
by reading the data splitting it
2548
01:46:09,600 --> 01:46:12,400
with respect to comma then
mapping the attributes.
2549
01:46:12,400 --> 01:46:14,750
We will employ The case
that's what we have done
2550
01:46:14,750 --> 01:46:17,041
and putting converting
this values to integer.
2551
01:46:17,041 --> 01:46:19,891
So in then we are converting
to to death right after that.
2552
01:46:19,891 --> 01:46:22,378
We are going to create
a temporary viewer table.
2553
01:46:22,378 --> 01:46:24,600
So let's create
this temporary view employ.
2554
01:46:24,600 --> 01:46:26,800
Then we are going
to use part dot Sequel
2555
01:46:26,800 --> 01:46:28,570
and passing up our SQL query.
2556
01:46:28,570 --> 01:46:31,500
Can you notice that we
have now passing the value
2557
01:46:31,500 --> 01:46:33,900
and we are assessing
this employ, right?
2558
01:46:33,900 --> 01:46:36,000
We are assessing
this employee here.
2559
01:46:36,000 --> 01:46:38,500
Now, what is this employ
this employee was
2560
01:46:38,500 --> 01:46:40,500
of a temporary view
which we have created
2561
01:46:40,500 --> 01:46:43,128
because the challenge
in Sparks equalist
2562
01:46:43,128 --> 01:46:46,329
when Whether you want
to execute any SQL query you
2563
01:46:46,329 --> 01:46:49,400
cannot say select aesthetic
from the data frame.
2564
01:46:49,400 --> 01:46:50,439
You cannot do that.
2565
01:46:50,439 --> 01:46:52,300
There's this is
not even supported.
2566
01:46:52,300 --> 01:46:55,547
So you cannot do select extract
from your data frame.
2567
01:46:55,547 --> 01:46:56,508
So instead of that
2568
01:46:56,508 --> 01:46:59,500
what we need to do is we need
to create a temporary table
2569
01:46:59,500 --> 01:47:01,732
or a temporary view
so you can notice here.
2570
01:47:01,732 --> 01:47:04,456
We are using this create
or replace temp You by replace
2571
01:47:04,456 --> 01:47:07,349
because if it is already
existing override on top of it.
2572
01:47:07,349 --> 01:47:09,400
So now we are creating
a temporary table
2573
01:47:09,400 --> 01:47:12,900
which will be exactly similar
to mine this data frame now
2574
01:47:12,900 --> 01:47:15,605
you You can just directly
execute all the query
2575
01:47:15,605 --> 01:47:18,100
on your return preview
Autumn Prairie table.
2576
01:47:18,100 --> 01:47:21,258
So you can notice here
instead of using employ DF
2577
01:47:21,258 --> 01:47:22,800
which was our data frame.
2578
01:47:22,800 --> 01:47:24,730
I am using here temporary view.
2579
01:47:24,730 --> 01:47:26,100
Okay, then in the end,
2580
01:47:26,100 --> 01:47:28,000
we just mapping
the names and a right
2581
01:47:28,000 --> 01:47:29,669
and we are outputting the bells.
2582
01:47:29,669 --> 01:47:30,200
That's it.
2583
01:47:30,200 --> 01:47:31,000
Same thing.
2584
01:47:31,000 --> 01:47:33,300
This is just
an execution part of it.
2585
01:47:33,300 --> 01:47:35,350
So we are just showing
all the steps here.
2586
01:47:35,350 --> 01:47:36,500
You can see in the end.
2587
01:47:36,500 --> 01:47:38,500
We are outputting
all this value now
2588
01:47:38,600 --> 01:47:40,800
how we can add
the schema to rdd.
2589
01:47:40,800 --> 01:47:43,850
Let's see this transformation
step now in this case you Notice
2590
01:47:43,850 --> 01:47:45,404
that we can map
this youngster fact
2591
01:47:45,404 --> 01:47:46,900
the we're converting
this map name
2592
01:47:46,900 --> 01:47:49,211
into the string for
the transformation part, right?
2593
01:47:49,211 --> 01:47:51,200
So we are checking all
this value that okay.
2594
01:47:51,200 --> 01:47:53,500
This is the string type name.
2595
01:47:53,500 --> 01:47:55,900
We are just showing up
this value right now.
2596
01:47:55,900 --> 01:47:56,900
What were you doing?
2597
01:47:56,900 --> 01:48:00,400
We are using this map encoder
from the implicit class,
2598
01:48:00,400 --> 01:48:03,717
which is available to us
to map the name and Each pie.
2599
01:48:03,717 --> 01:48:04,000
Okay.
2600
01:48:04,000 --> 01:48:05,529
So this is
what we're going to do
2601
01:48:05,529 --> 01:48:07,579
because remember in
the employee is class.
2602
01:48:07,579 --> 01:48:10,400
We have the name and age column
that we want to map now.
2603
01:48:10,400 --> 01:48:11,272
Now in this case,
2604
01:48:11,272 --> 01:48:13,164
we are mapping
the names to the ages.
2605
01:48:13,164 --> 01:48:14,400
Has so you can notice
2606
01:48:14,400 --> 01:48:17,600
that we are doing for ages
of our younger CF data frame
2607
01:48:17,600 --> 01:48:19,335
that what we
have created earlier
2608
01:48:19,335 --> 01:48:20,800
and the result is an array.
2609
01:48:20,800 --> 01:48:23,400
So the result but you're going
to get will be an array
2610
01:48:23,400 --> 01:48:25,700
with the name map
to your respective ages.
2611
01:48:25,700 --> 01:48:27,800
You can see this output
here so you can see
2612
01:48:27,800 --> 01:48:29,100
that this is getting map.
2613
01:48:29,100 --> 01:48:29,426
Right.
2614
01:48:29,426 --> 01:48:32,201
So we are getting seeing
this output like name is John
2615
01:48:32,201 --> 01:48:34,402
it is 28 that is what
we are talking about.
2616
01:48:34,402 --> 01:48:36,300
So here in this case,
you can notice
2617
01:48:36,300 --> 01:48:38,900
that it was representing
like this in this case.
2618
01:48:38,900 --> 01:48:42,200
The output is coming out
in this particular format now,
2619
01:48:42,200 --> 01:48:44,568
let's talk about
how Can add the schema
2620
01:48:44,568 --> 01:48:47,674
how we can read the file
we can add a whiskey minor
2621
01:48:47,674 --> 01:48:50,702
so we will be first
of all importing the type class
2622
01:48:50,702 --> 01:48:51,706
into your passion.
2623
01:48:51,706 --> 01:48:52,588
So with this is
2624
01:48:52,588 --> 01:48:54,815
what we have done
by using import statement.
2625
01:48:54,815 --> 01:48:58,286
Then we are going to import
the row class into this partial.
2626
01:48:58,286 --> 01:49:00,500
So rho will be used
in mapping our DB schema.
2627
01:49:00,500 --> 01:49:00,813
Right?
2628
01:49:00,813 --> 01:49:01,700
So you can notice
2629
01:49:01,700 --> 01:49:05,100
we're importing this also then
we are creating an rdd called
2630
01:49:05,000 --> 01:49:06,200
as employ a DD.
2631
01:49:06,200 --> 01:49:07,900
So in case this case
you can notice
2632
01:49:07,900 --> 01:49:09,809
that the same priority
we are creating
2633
01:49:09,809 --> 01:49:12,700
and we are creating this
with the help of this text file.
2634
01:49:12,700 --> 01:49:15,700
So once we have create this we
are going to Define our schema.
2635
01:49:15,700 --> 01:49:17,300
So this is the scheme approach.
2636
01:49:17,300 --> 01:49:17,572
Okay.
2637
01:49:17,572 --> 01:49:18,452
So in this case,
2638
01:49:18,452 --> 01:49:21,050
we are going to Define it
like named and space
2639
01:49:21,050 --> 01:49:21,800
than H. Okay,
2640
01:49:21,800 --> 01:49:24,700
because they these were
the two I have in my data also
2641
01:49:24,700 --> 01:49:26,129
in this employed or tht
2642
01:49:26,129 --> 01:49:27,305
if you look at these
2643
01:49:27,305 --> 01:49:29,600
are the two data which
we have named NH.
2644
01:49:29,600 --> 01:49:31,635
Now what we can do
once we have done
2645
01:49:31,635 --> 01:49:34,100
that then we can split it
with respect to space.
2646
01:49:34,100 --> 01:49:34,600
We can say
2647
01:49:34,600 --> 01:49:37,082
that our mapping value
and we are passing it
2648
01:49:37,082 --> 01:49:39,200
all this value inside
of a structure.
2649
01:49:39,200 --> 01:49:42,200
Okay, so we are defining a burn
or fields are ready.
2650
01:49:42,200 --> 01:49:43,500
That is what we are doing.
2651
01:49:43,500 --> 01:49:45,200
See this the fields are ready,
2652
01:49:45,200 --> 01:49:49,500
which is going to now output
after mapping the employee ID.
2653
01:49:49,500 --> 01:49:51,200
Okay, so that is
what we are doing.
2654
01:49:51,200 --> 01:49:54,413
So we want to just do this
into my schema strength,
2655
01:49:54,413 --> 01:49:55,375
then in the end.
2656
01:49:55,375 --> 01:49:57,300
We will be obtaining this field.
2657
01:49:57,300 --> 01:49:59,940
If you notice this field
what we have created here.
2658
01:49:59,940 --> 01:50:01,788
We are obtaining
this into a schema.
2659
01:50:01,788 --> 01:50:03,900
So we are passing this
into a struct type
2660
01:50:03,900 --> 01:50:06,400
and it is getting converted
to be our scheme of it.
2661
01:50:06,500 --> 01:50:08,200
So that is what we will do.
2662
01:50:08,200 --> 01:50:10,768
You can see all
this execution same steps.
2663
01:50:10,768 --> 01:50:13,357
We are just executing
in this terminal now,
2664
01:50:13,357 --> 01:50:16,500
Let's see how we are going
to transform the results.
2665
01:50:16,500 --> 01:50:18,300
Now, whatever we
have done, right?
2666
01:50:18,300 --> 01:50:21,229
So now we have already created
already called row editing.
2667
01:50:21,229 --> 01:50:22,000
So let's create
2668
01:50:22,000 --> 01:50:25,088
that Rogue additive are going
to Gray and we want
2669
01:50:25,088 --> 01:50:28,500
to transform the employee ID
using the map function
2670
01:50:28,500 --> 01:50:29,513
into row already.
2671
01:50:29,513 --> 01:50:30,564
So let's do that.
2672
01:50:30,564 --> 01:50:30,837
Okay.
2673
01:50:30,837 --> 01:50:31,717
So in this case
2674
01:50:31,717 --> 01:50:34,483
what we are doing so look
at this employed reading
2675
01:50:34,483 --> 01:50:36,797
we are splitting it
with respect to coma
2676
01:50:36,797 --> 01:50:40,000
and after that we are telling
see remember we have name
2677
01:50:40,000 --> 01:50:41,400
and then H like this so
2678
01:50:41,400 --> 01:50:43,500
that's what you're telling
me telling that act.
2679
01:50:43,500 --> 01:50:44,737
Zero or my attributes
2680
01:50:44,737 --> 01:50:47,796
one and why we're trimming
it just inverted to ensure
2681
01:50:47,796 --> 01:50:49,900
if there is no spaces
and on which other
2682
01:50:49,900 --> 01:50:52,600
so those things we don't want
to unnecessarily keep up.
2683
01:50:52,600 --> 01:50:55,400
So that's the reason we are
defining this term statement.
2684
01:50:55,400 --> 01:50:58,300
Now after that after we
once we are done with this,
2685
01:50:58,300 --> 01:51:01,100
we are going to define
a data frame employed EF
2686
01:51:01,100 --> 01:51:03,874
and we are going to store
that rdd schema into it.
2687
01:51:03,874 --> 01:51:05,764
So now if you notice
this row ID,
2688
01:51:05,764 --> 01:51:07,300
which we have defined here
2689
01:51:07,300 --> 01:51:11,124
and schema which we have defined
in the last case right now
2690
01:51:11,124 --> 01:51:13,300
if you'll go back
and notice here.
2691
01:51:13,300 --> 01:51:16,300
Schema, we have created here
right with respect to my Fields.
2692
01:51:16,600 --> 01:51:19,100
So that schema and this value
2693
01:51:19,100 --> 01:51:21,900
what we have just
created here rowady.
2694
01:51:21,900 --> 01:51:23,450
We are going to pass it and say
2695
01:51:23,450 --> 01:51:25,200
that we are going
to create a data frame.
2696
01:51:25,200 --> 01:51:27,900
So this will help us
in creating a data frame now,
2697
01:51:27,900 --> 01:51:31,135
we can create our temporary view
on the base of employee
2698
01:51:31,135 --> 01:51:33,900
of let's create an employee
or temporary View and then
2699
01:51:33,900 --> 01:51:36,900
what we can do we can execute
any SQL queries on top of it.
2700
01:51:36,900 --> 01:51:38,700
So as you can see
SparkNotes equal we
2701
01:51:38,700 --> 01:51:42,000
can create all the SQL queries
and can directly execute
2702
01:51:42,000 --> 01:51:43,200
that now what we can do.
2703
01:51:43,300 --> 01:51:45,700
We want to Output the values
we can quickly do that.
2704
01:51:45,800 --> 01:51:46,000
Now.
2705
01:51:46,000 --> 01:51:48,500
We want to let's say display
the names of we can say Okay,
2706
01:51:48,500 --> 01:51:51,600
attribute 0 contains the name
we can use the show command.
2707
01:51:51,600 --> 01:51:54,662
So this is how we
will be performing the operation
2708
01:51:54,662 --> 01:51:56,100
in the scheme away now,
2709
01:51:56,100 --> 01:51:58,900
so this is the same output way
means we're just executing
2710
01:51:58,900 --> 01:51:59,914
this whole thing up.
2711
01:51:59,914 --> 01:52:01,100
You can notice here.
2712
01:52:01,100 --> 01:52:03,400
Also, we are just
saying attribute 0.0.
2713
01:52:03,400 --> 01:52:06,205
It is representing
or me my output now,
2714
01:52:06,205 --> 01:52:08,200
let's talk about Json data.
2715
01:52:08,200 --> 01:52:10,085
Now when we talk
about Json data,
2716
01:52:10,085 --> 01:52:13,261
let's talk about how we
can load our files and work on.
2717
01:52:13,261 --> 01:52:15,496
This so in this case,
we will be first.
2718
01:52:15,496 --> 01:52:17,338
Let's say importing
our libraries.
2719
01:52:17,338 --> 01:52:18,800
Once we are done with that.
2720
01:52:18,800 --> 01:52:20,300
Now after that we can just say
2721
01:52:20,300 --> 01:52:23,587
that retort Jason we are
just bringing up our employed
2722
01:52:23,587 --> 01:52:25,611
or Jason you see
this is the execution
2723
01:52:25,611 --> 01:52:27,200
of this part now similarly,
2724
01:52:27,200 --> 01:52:29,042
we can also write
back in the pocket
2725
01:52:29,042 --> 01:52:31,282
or we can also read
the value from parque.
2726
01:52:31,282 --> 01:52:32,400
You can notice this
2727
01:52:32,400 --> 01:52:35,600
if you want to write
let's say this value employee
2728
01:52:35,600 --> 01:52:37,730
of data frame to my market way
2729
01:52:37,730 --> 01:52:40,500
so I can sit right dot
right dot market.
2730
01:52:40,500 --> 01:52:43,143
So this will be created
employed or Park.
2731
01:52:43,143 --> 01:52:46,504
Be created and hear all
the values should be converted
2732
01:52:46,504 --> 01:52:47,900
to employed or packet.
2733
01:52:47,900 --> 01:52:49,133
Only thing is the data.
2734
01:52:49,133 --> 01:52:51,600
If you go and see
in this particular directory,
2735
01:52:51,600 --> 01:52:52,717
this will be a directory.
2736
01:52:52,717 --> 01:52:53,954
We should be getting created.
2737
01:52:53,954 --> 01:52:55,400
So in this data,
you will notice
2738
01:52:55,400 --> 01:52:57,500
that you will not be able
to read the data.
2739
01:52:57,500 --> 01:53:00,100
So in that case
because it's not human readable.
2740
01:53:00,100 --> 01:53:02,200
So that's the reason you
will not be able to do that.
2741
01:53:02,200 --> 01:53:04,299
So, let's say you want
to read it now so you
2742
01:53:04,299 --> 01:53:05,449
can again bring it back
2743
01:53:05,449 --> 01:53:08,600
by using Red Dot Market you are
reading this employed at pocket,
2744
01:53:08,600 --> 01:53:09,600
which I just created
2745
01:53:09,600 --> 01:53:11,700
then you are creating
a temporary view
2746
01:53:11,700 --> 01:53:12,775
or temporary table
2747
01:53:12,775 --> 01:53:15,488
and then By using
standard SQL you can execute
2748
01:53:15,488 --> 01:53:16,903
on your temporary table.
2749
01:53:16,903 --> 01:53:17,844
Now in this way.
2750
01:53:17,844 --> 01:53:21,000
You can read your pocket file
data and in then we are just
2751
01:53:21,000 --> 01:53:24,284
displaying the result see
the similar output of this.
2752
01:53:24,284 --> 01:53:24,600
Okay.
2753
01:53:24,600 --> 01:53:27,100
This is how we can execute
all these things up now.
2754
01:53:27,100 --> 01:53:28,670
Once we have done all this,
2755
01:53:28,670 --> 01:53:31,200
let's see how we
can create our data frames.
2756
01:53:31,200 --> 01:53:33,100
So let's create this file path.
2757
01:53:33,100 --> 01:53:36,390
So let's say we have created
this file employed or Jason
2758
01:53:36,390 --> 01:53:38,508
after that we can
create a data frame
2759
01:53:38,508 --> 01:53:39,943
from our Json path, right?
2760
01:53:39,943 --> 01:53:42,884
So we are creating this
by using retouch Jason then
2761
01:53:42,884 --> 01:53:44,420
we can Print the schema.
2762
01:53:44,420 --> 01:53:47,300
What does to this is going
to print the schema
2763
01:53:47,300 --> 01:53:49,300
of my employee data frame?
2764
01:53:49,300 --> 01:53:52,500
Okay, so we are going to use
this print schemer to print
2765
01:53:52,500 --> 01:53:55,795
up all the values then we
can create a temporary view
2766
01:53:55,795 --> 01:53:57,000
of this data frame.
2767
01:53:57,000 --> 01:53:58,100
So we are create doing
2768
01:53:58,100 --> 01:54:00,618
that see create or replace
temp you we are creating
2769
01:54:00,618 --> 01:54:02,860
that which we have seen
it last time also now
2770
01:54:02,860 --> 01:54:04,888
after that we can
execute our SQL query.
2771
01:54:04,888 --> 01:54:07,800
So let's say we are executing
our SQL query from employee
2772
01:54:07,800 --> 01:54:10,000
where age is between 18
and 30, right?
2773
01:54:10,000 --> 01:54:11,300
So this kind of SQL query.
2774
01:54:11,300 --> 01:54:12,854
Let's say we want
to do we can get
2775
01:54:12,854 --> 01:54:14,989
that And in the end we
can see the output Also.
2776
01:54:14,989 --> 01:54:16,278
Let's see this execution.
2777
01:54:16,278 --> 01:54:17,000
So you can see
2778
01:54:17,000 --> 01:54:20,891
that all the vampires who these
are let's say between 18 and 30
2779
01:54:20,891 --> 01:54:22,900
that is showing up
in the output.
2780
01:54:22,900 --> 01:54:23,147
Now.
2781
01:54:23,147 --> 01:54:25,176
Let's see this
rdd operation way.
2782
01:54:25,176 --> 01:54:26,369
Now what you can do
2783
01:54:26,369 --> 01:54:30,200
so we are going to create this
add any other employer Nene now
2784
01:54:30,200 --> 01:54:33,900
which is going to store
the content of employed George
2785
01:54:33,900 --> 01:54:35,300
and New Delhi Delhi.
2786
01:54:35,300 --> 01:54:36,433
So see this part,
2787
01:54:36,433 --> 01:54:39,500
so here we are creating this
by using make a DD
2788
01:54:39,500 --> 01:54:43,400
and we have just this is going
to store the content containing
2789
01:54:43,400 --> 01:54:45,000
Such from noodle, right?
2790
01:54:45,000 --> 01:54:45,900
You can see this
2791
01:54:45,900 --> 01:54:48,300
so New Delhi is my city
named state is the ring.
2792
01:54:48,300 --> 01:54:50,250
So that is what we
are passing inside it.
2793
01:54:50,250 --> 01:54:52,900
Now what we are doing we
are assigning the content
2794
01:54:52,900 --> 01:54:56,700
of this other employee ID
into my other employees.
2795
01:54:56,700 --> 01:54:59,200
So we are using
this dark dot RI dot Json
2796
01:54:59,200 --> 01:55:00,600
and we are reading at the value
2797
01:55:00,600 --> 01:55:02,800
and in the end we
are using this show appear.
2798
01:55:02,800 --> 01:55:04,857
You can notice
this output coming up now.
2799
01:55:04,857 --> 01:55:06,400
Let's see with the hive table.
2800
01:55:06,400 --> 01:55:08,536
So with the hive table
if you want to read that,
2801
01:55:08,536 --> 01:55:10,186
so let's do it
with the case class
2802
01:55:10,186 --> 01:55:11,136
and Spark sessions.
2803
01:55:11,136 --> 01:55:11,900
So first of all,
2804
01:55:11,900 --> 01:55:14,713
we are going to import
a guru class and we are going
2805
01:55:14,713 --> 01:55:16,700
to use path session
into the Spartan.
2806
01:55:16,700 --> 01:55:18,000
So let's do that for a way.
2807
01:55:18,000 --> 01:55:20,082
I'm putting this row
this past session
2808
01:55:20,082 --> 01:55:21,200
and not after that.
2809
01:55:21,200 --> 01:55:24,186
We are going to create a class
record containing this key
2810
01:55:24,186 --> 01:55:25,756
which is of integer data type
2811
01:55:25,756 --> 01:55:27,576
and a value which is
of string type.
2812
01:55:27,576 --> 01:55:29,426
Then we are going
to set our location
2813
01:55:29,426 --> 01:55:30,726
of the warehouse location.
2814
01:55:30,726 --> 01:55:31,948
Okay to this pathway rows.
2815
01:55:31,948 --> 01:55:33,400
So that is what we are doing.
2816
01:55:33,400 --> 01:55:33,629
Now.
2817
01:55:33,629 --> 01:55:36,100
We are going to build
a spark sessions back
2818
01:55:36,100 --> 01:55:39,200
to demonstrate the hive
example in spots equal.
2819
01:55:39,200 --> 01:55:40,100
Look at this now,
2820
01:55:40,100 --> 01:55:42,700
so we are creating Sparks
session dot Builder again.
2821
01:55:42,700 --> 01:55:44,331
We are passing the Any app name
2822
01:55:44,331 --> 01:55:46,700
to it we have passing
the configuration to it.
2823
01:55:46,700 --> 01:55:48,968
And then we are saying
that we want to enable
2824
01:55:48,968 --> 01:55:50,000
The Hive support now
2825
01:55:50,000 --> 01:55:50,800
once we have done
2826
01:55:50,800 --> 01:55:53,800
that we are importing
this spark SQL library center.
2827
01:55:54,000 --> 01:55:56,612
And then you can notice
that we can use SQL
2828
01:55:56,612 --> 01:55:58,601
so we can create now a table SRC
2829
01:55:58,601 --> 01:56:01,336
so you can see create table
if not exist as RC
2830
01:56:01,336 --> 01:56:04,800
with column to stores the data
as a key common value pair.
2831
01:56:04,800 --> 01:56:06,399
So that is what we
are doing here.
2832
01:56:06,400 --> 01:56:09,000
Now, you can see all
this execution of the same step.
2833
01:56:09,000 --> 01:56:09,209
Now.
2834
01:56:09,209 --> 01:56:12,430
Let's see the sequel operation
happening here now in this case
2835
01:56:12,430 --> 01:56:13,229
what we can do.
2836
01:56:13,229 --> 01:56:15,700
We can now load the data
from this example,
2837
01:56:15,700 --> 01:56:17,500
which is present to succeed.
2838
01:56:17,500 --> 01:56:19,400
Is this KV m dot txt file,
2839
01:56:19,400 --> 01:56:20,869
which is available to us
2840
01:56:20,869 --> 01:56:23,281
and we want to store it
into the table SRC
2841
01:56:23,281 --> 01:56:25,225
which we have just
created and now
2842
01:56:25,225 --> 01:56:28,872
if you want to just view the all
this output becomes a sequence
2843
01:56:28,872 --> 01:56:30,305
select aesthetic form SRC
2844
01:56:30,305 --> 01:56:31,764
and it is going to show up
2845
01:56:31,764 --> 01:56:34,005
all the values you
can see this output.
2846
01:56:34,005 --> 01:56:34,300
Okay.
2847
01:56:34,300 --> 01:56:37,341
So this is the way you can show
up the virus now similarly we
2848
01:56:37,341 --> 01:56:38,899
can perform the count operation.
2849
01:56:38,899 --> 01:56:40,993
Okay, so we can say
select Counter-Strike
2850
01:56:40,993 --> 01:56:43,400
from SRC to select the number
of keys in there.
2851
01:56:43,400 --> 01:56:45,858
See tables, and now
select all the records,
2852
01:56:45,858 --> 01:56:48,800
right so we can say
that key select key gamma value
2853
01:56:48,800 --> 01:56:49,500
so you can see
2854
01:56:49,500 --> 01:56:52,150
that we can perform all
over Hive operations here
2855
01:56:52,150 --> 01:56:53,562
on this right similarly.
2856
01:56:53,562 --> 01:56:56,300
We can create a data set
string DS from spark DF
2857
01:56:56,300 --> 01:56:58,623
so you can see this
also by using SQL DF
2858
01:56:58,623 --> 01:57:00,835
what we already have
we can just say map
2859
01:57:00,835 --> 01:57:01,730
and then provide
2860
01:57:01,730 --> 01:57:04,541
the case class in can map
the ski common value pair
2861
01:57:04,541 --> 01:57:07,600
and then in the end we
can show up all this value see
2862
01:57:07,600 --> 01:57:10,644
this execution of this in then
you can notice this output
2863
01:57:10,644 --> 01:57:11,828
which we want it now.
2864
01:57:11,828 --> 01:57:13,288
Let's see the result back.
2865
01:57:13,288 --> 01:57:15,700
But now we can create
our data frame here.
2866
01:57:15,700 --> 01:57:18,384
Right so we can create
our data frame records deaf
2867
01:57:18,384 --> 01:57:19,848
and store all the results
2868
01:57:19,848 --> 01:57:21,900
which contains the value
between 1 200.
2869
01:57:21,900 --> 01:57:24,600
So we are storing all the values
between 1/2 and video.
2870
01:57:24,600 --> 01:57:26,700
Then we are creating
a victim Prairie View.
2871
01:57:26,700 --> 01:57:28,900
Okay for the records,
that's what we are doing.
2872
01:57:28,900 --> 01:57:31,200
So for requires the FAA
creating a temporary view
2873
01:57:31,200 --> 01:57:33,800
so that we can have
over Oliver SQL queries now,
2874
01:57:33,800 --> 01:57:35,336
we can execute all the values
2875
01:57:35,336 --> 01:57:38,400
so you can also notice we
are doing join operation here.
2876
01:57:38,400 --> 01:57:40,900
Okay, so we can display
the content of join
2877
01:57:40,900 --> 01:57:43,300
between the records
and this is our city.
2878
01:57:43,600 --> 01:57:46,400
We can do a joint on this part
so we can also perform all
2879
01:57:46,400 --> 01:57:48,300
the joint operations
and get the output.
2880
01:57:48,300 --> 01:57:48,500
Now.
2881
01:57:48,500 --> 01:57:50,356
Let's see our use case for it.
2882
01:57:50,356 --> 01:57:51,908
If we talk about use case.
2883
01:57:51,908 --> 01:57:55,071
We are going to analyze
our stock market with the help
2884
01:57:55,071 --> 01:57:57,100
of spark sequence
select understand
2885
01:57:57,100 --> 01:57:58,500
the problem statement first.
2886
01:57:58,500 --> 01:58:00,382
So now in our problem statement,
2887
01:58:00,382 --> 01:58:04,029
so what we want to do so we want
to accept definitely everybody
2888
01:58:04,029 --> 01:58:07,156
must be aware of this top market
like in stock market.
2889
01:58:07,156 --> 01:58:08,811
You can lot
of activities happen.
2890
01:58:08,811 --> 01:58:10,400
You want to know analyze it
2891
01:58:10,400 --> 01:58:13,300
in order to make some profit
out of it and all those stuff.
2892
01:58:13,300 --> 01:58:15,200
Alright, so now
let's say our company
2893
01:58:15,200 --> 01:58:18,200
have collected a lot of data
for different 10 companies
2894
01:58:18,200 --> 01:58:20,000
and they want to do
some computation.
2895
01:58:20,000 --> 01:58:22,964
Let's say they want to compute
the average closing price.
2896
01:58:22,964 --> 01:58:26,300
They want to list the companies
with the highest closing prices.
2897
01:58:26,300 --> 01:58:29,749
They want to compute the average
closing price per month.
2898
01:58:29,749 --> 01:58:32,485
They want to list the number
of big price Rises
2899
01:58:32,485 --> 01:58:35,400
and fall and compute
some statistical correlation.
2900
01:58:35,400 --> 01:58:37,700
So these things we are going
to do with the help
2901
01:58:37,700 --> 01:58:39,158
of our spark SQL statement.
2902
01:58:39,158 --> 01:58:42,255
So this is a very common we want
to process the huge data.
2903
01:58:42,255 --> 01:58:45,103
We want to handle The input
from the multiple sources,
2904
01:58:45,103 --> 01:58:47,200
we want to process
the data in real time
2905
01:58:47,200 --> 01:58:48,754
and it should be easy to use.
2906
01:58:48,754 --> 01:58:50,488
It should not be
very complicated.
2907
01:58:50,488 --> 01:58:53,800
So all this requirement will be
handled by my spots equal right?
2908
01:58:53,800 --> 01:58:55,700
So that's the reason
we are going to use
2909
01:58:55,700 --> 01:58:56,950
the spacer sequence.
2910
01:58:56,950 --> 01:58:57,700
So as I said
2911
01:58:57,700 --> 01:58:59,600
that we are going
to use 10 companies.
2912
01:58:59,600 --> 01:59:02,076
So we are going to kind
of use this 10 companies
2913
01:59:02,076 --> 01:59:03,498
and on those ten companies.
2914
01:59:03,498 --> 01:59:04,500
We are going to see
2915
01:59:04,500 --> 01:59:07,200
that we are going to perform
our analysis on top of it.
2916
01:59:07,200 --> 01:59:09,100
So we will be using
this table data
2917
01:59:09,100 --> 01:59:11,800
from Yahoo finance
for all this following stocks.
2918
01:59:11,800 --> 01:59:14,300
So for n and a A bit sexist.
2919
01:59:14,300 --> 01:59:15,400
So all these companies
2920
01:59:15,400 --> 01:59:17,600
we have on on which we
are going to perform.
2921
01:59:17,600 --> 01:59:20,800
So this is how my data will look
like which will be having date
2922
01:59:20,800 --> 01:59:25,046
opening High rate low rate
closing volume adjusted close.
2923
01:59:25,046 --> 01:59:27,700
All this data will
be presented now.
2924
01:59:27,700 --> 01:59:28,917
So, let's see how we
2925
01:59:28,917 --> 01:59:31,900
can Implement a stock analysis
using spark sequel.
2926
01:59:31,900 --> 01:59:33,497
So what we have to do for that,
2927
01:59:33,497 --> 01:59:36,278
so this is how many data
flow diagram will sound like
2928
01:59:36,278 --> 01:59:38,811
so we have going to initially
have the huge amount
2929
01:59:38,811 --> 01:59:40,000
of real-time stock data
2930
01:59:40,000 --> 01:59:42,400
that we are going to process it
through this path SQL.
2931
01:59:42,400 --> 01:59:44,600
So going to It into
a named column base.
2932
01:59:44,600 --> 01:59:46,308
Then we are going
to create an rdd
2933
01:59:46,308 --> 01:59:47,658
for functional programming.
2934
01:59:47,658 --> 01:59:48,395
So let's do that.
2935
01:59:48,395 --> 01:59:50,354
Then we are going to use
a reverse Park sequel
2936
01:59:50,354 --> 01:59:52,500
which will calculate
the average closing price
2937
01:59:52,500 --> 01:59:53,600
for your calculating.
2938
01:59:53,600 --> 01:59:56,188
The company with is closing
per year then buy
2939
01:59:56,188 --> 01:59:59,000
some stock SQL queries
will be getting our outputs.
2940
01:59:59,000 --> 02:00:01,000
Okay, so that is
what we're going to do.
2941
02:00:01,000 --> 02:00:03,400
So all the queries
what we are getting generated,
2942
02:00:03,400 --> 02:00:05,500
so it's not only this we
are also going to compute
2943
02:00:05,500 --> 02:00:08,000
few other queries what we
have solve those queries.
2944
02:00:08,000 --> 02:00:09,200
We're going to execute him.
2945
02:00:09,200 --> 02:00:09,500
Now.
2946
02:00:09,500 --> 02:00:11,273
This is how the flow
will look like.
2947
02:00:11,273 --> 02:00:13,200
So we are going
to initially have this Data
2948
02:00:13,200 --> 02:00:16,000
what I have just shown you a now
what you're going to do.
2949
02:00:16,000 --> 02:00:17,700
You're going to create
a data frame you
2950
02:00:17,700 --> 02:00:19,990
are going to then create
a joint clothes are ready.
2951
02:00:19,990 --> 02:00:21,850
We will see what we
are going to do here.
2952
02:00:21,850 --> 02:00:23,900
Then we are going
to calculate the average
2953
02:00:23,900 --> 02:00:25,160
closing price per year.
2954
02:00:25,160 --> 02:00:27,900
We are going to hit
a rough patch SQL query and get
2955
02:00:27,900 --> 02:00:29,314
the result in the table.
2956
02:00:29,314 --> 02:00:31,800
So this is how my execution
will look like.
2957
02:00:31,800 --> 02:00:33,445
So what we are going
to do in this case,
2958
02:00:33,445 --> 02:00:34,095
first of all,
2959
02:00:34,095 --> 02:00:36,839
we are going to initialize the
Sparks equal in this function.
2960
02:00:36,839 --> 02:00:39,600
We are going to import all
the required libraries then we
2961
02:00:39,600 --> 02:00:40,500
are going to start
2962
02:00:40,500 --> 02:00:43,216
our spark session after
importing all the required.
2963
02:00:43,216 --> 02:00:44,473
B we are going to create
2964
02:00:44,473 --> 02:00:47,251
our case class whatever
is required in the case class,
2965
02:00:47,251 --> 02:00:49,466
you can notice a then
we are going to Define
2966
02:00:49,466 --> 02:00:50,600
our past stock scheme.
2967
02:00:50,600 --> 02:00:53,350
So because we have already
learnt how to create a schema
2968
02:00:53,350 --> 02:00:55,500
as we're going to create
this page table schema
2969
02:00:55,500 --> 02:00:56,800
by creating this way.
2970
02:00:56,800 --> 02:00:59,200
Well, then we are going
to Define our parts.
2971
02:00:59,200 --> 02:01:00,900
I DD so in parts are did
2972
02:01:00,900 --> 02:01:02,895
if you notice so
here we are creating.
2973
02:01:02,895 --> 02:01:04,289
This parts are ready mix.
2974
02:01:04,289 --> 02:01:05,708
We have going to create all
2975
02:01:05,708 --> 02:01:07,600
of that by using
this additive first.
2976
02:01:07,600 --> 02:01:10,300
We are going to remove
the header files also from it.
2977
02:01:10,300 --> 02:01:12,749
Then we are going
to read our CSV file
2978
02:01:12,749 --> 02:01:15,200
into Into stocks a a
on DF data frame.
2979
02:01:15,200 --> 02:01:17,500
So we are going to read
this as C dot txt file.
2980
02:01:17,500 --> 02:01:20,161
You can see we are reading
this file and we are going
2981
02:01:20,161 --> 02:01:21,800
to convert it into a data frame.
2982
02:01:21,800 --> 02:01:23,450
So we are passing
it as an oddity.
2983
02:01:23,450 --> 02:01:24,511
Once we are done then
2984
02:01:24,511 --> 02:01:26,697
if you want to print
the output we can do it
2985
02:01:26,697 --> 02:01:27,997
with the help of show API.
2986
02:01:27,997 --> 02:01:29,852
Once we are done
with this now we want
2987
02:01:29,852 --> 02:01:31,450
to let's say display the average
2988
02:01:31,450 --> 02:01:34,100
of addressing closing price
for n and for every month,
2989
02:01:34,100 --> 02:01:37,629
so if we can do all of that also
by using select query, right
2990
02:01:37,629 --> 02:01:40,300
so we can say this data frame
dot select and pass
2991
02:01:40,300 --> 02:01:43,100
whatever parameters are required
to get the average know,
2992
02:01:43,100 --> 02:01:44,000
You can notice are
2993
02:01:44,000 --> 02:01:47,200
inside this we are creating
the Elias of the things as well.
2994
02:01:47,200 --> 02:01:48,300
So for this DT,
2995
02:01:48,300 --> 02:01:50,059
we are creating
areas here, right?
2996
02:01:50,059 --> 02:01:52,538
So we are creating the Elias
for it in a binder
2997
02:01:52,538 --> 02:01:54,714
and we are showing
the output also so here
2998
02:01:54,714 --> 02:01:56,307
what we are going to do now,
2999
02:01:56,307 --> 02:01:57,400
we will be checking
3000
02:01:57,400 --> 02:01:59,669
that the closing
price for Microsoft.
3001
02:01:59,669 --> 02:02:03,300
So let's say they're going up
by 2 or with greater than 2
3002
02:02:03,300 --> 02:02:05,900
or wherever it is going
by greater than 2 and now we
3003
02:02:05,900 --> 02:02:08,039
want to get the output
and display the result
3004
02:02:08,039 --> 02:02:10,023
so you can notice
that wherever it is going
3005
02:02:10,023 --> 02:02:12,282
to be greater than 2 we
are getting the value.
3006
02:02:12,282 --> 02:02:14,383
So we are hitting
the SQL query to do that.
3007
02:02:14,383 --> 02:02:16,483
So we are hitting
the SQL query now on this
3008
02:02:16,483 --> 02:02:17,935
you can notice the SQL query
3009
02:02:17,935 --> 02:02:19,975
which we are hitting
on the stocks.
3010
02:02:19,975 --> 02:02:20,775
Msft.
3011
02:02:20,775 --> 02:02:21,128
Right?
3012
02:02:21,128 --> 02:02:22,768
This is the we have data frame
3013
02:02:22,768 --> 02:02:24,900
we have created now
on this we are doing
3014
02:02:24,900 --> 02:02:27,076
that and we are putting
our query that
3015
02:02:27,076 --> 02:02:29,395
where my condition
this to be true means
3016
02:02:29,395 --> 02:02:32,066
where my closing price
and my opening price
3017
02:02:32,066 --> 02:02:34,300
because let's say
at the closing price
3018
02:02:34,300 --> 02:02:36,852
the stock price by let's say
a hundred US Dollars
3019
02:02:36,852 --> 02:02:38,500
and at that time in the morning
3020
02:02:38,500 --> 02:02:40,800
when it open with
the Lexi 98 used or so,
3021
02:02:40,800 --> 02:02:43,131
wherever it is going
to be having a different.
3022
02:02:43,131 --> 02:02:43,961
Of to or greater
3023
02:02:43,961 --> 02:02:46,300
than to that only output
we want to get so that is
3024
02:02:46,300 --> 02:02:47,400
what we're doing here.
3025
02:02:47,400 --> 02:02:47,600
Now.
3026
02:02:47,600 --> 02:02:50,600
Once we are done then after that
what we are going to do now,
3027
02:02:50,600 --> 02:02:52,628
we are going to use
the join operation.
3028
02:02:52,629 --> 02:02:55,500
So what we are going to do so
we will be joining the Annan
3029
02:02:55,500 --> 02:02:58,300
and except bestop's in order
to compare the closing price
3030
02:02:58,300 --> 02:03:00,200
because we want
to compare the prices
3031
02:03:00,200 --> 02:03:01,297
so we will be doing that.
3032
02:03:01,297 --> 02:03:02,000
So first of all,
3033
02:03:02,000 --> 02:03:04,600
we are going to create a union
of all these stocks
3034
02:03:04,600 --> 02:03:06,500
and then display
this guy joint Rose.
3035
02:03:06,500 --> 02:03:07,259
So look at this
3036
02:03:07,259 --> 02:03:09,284
what we're going to do
we're going to use
3037
02:03:09,284 --> 02:03:10,200
the spark sequence and
3038
02:03:10,200 --> 02:03:13,000
if you notice this closely
what we're doing in this case,
3039
02:03:13,000 --> 02:03:14,439
So now in this park sequel,
3040
02:03:14,439 --> 02:03:16,200
we are hitting
the square is equal
3041
02:03:16,200 --> 02:03:18,780
and all those stuff then
we are saying from this
3042
02:03:18,780 --> 02:03:21,192
and here we are using
this joint operation
3043
02:03:21,192 --> 02:03:22,704
may see this join oppression.
3044
02:03:22,704 --> 02:03:24,500
So this we are joining it on
3045
02:03:24,500 --> 02:03:26,500
and then in the end
we are outputting it.
3046
02:03:26,500 --> 02:03:28,700
So here you can see
you can do a comparison
3047
02:03:28,700 --> 02:03:31,300
of all these clothes price
for all these talks.
3048
02:03:31,300 --> 02:03:34,000
You can also include no
for more companies right now.
3049
02:03:34,000 --> 02:03:36,280
We have just shown you
an example with to complete
3050
02:03:36,280 --> 02:03:38,480
but you can do it
for more companies as well.
3051
02:03:38,480 --> 02:03:39,188
Now in this case
3052
02:03:39,188 --> 02:03:41,800
if you notice what we're doing
were writing this in the park
3053
02:03:41,800 --> 02:03:44,928
a file format and Save Being
into this particular location.
3054
02:03:44,928 --> 02:03:47,135
So we are creating
this joint stock market.
3055
02:03:47,135 --> 02:03:49,869
So we are storing it as
a packet file format and here
3056
02:03:49,869 --> 02:03:51,705
if you want to read
it we can read
3057
02:03:51,705 --> 02:03:52,800
that and showed output
3058
02:03:52,800 --> 02:03:55,300
but whatever file you
have saved it as a pocket
3059
02:03:55,300 --> 02:03:57,900
while definitely you
will not be able to read that up
3060
02:03:57,900 --> 02:04:00,700
because that file is going
to be the perfect way
3061
02:04:00,800 --> 02:04:03,900
and park it way are the files
which you can never read.
3062
02:04:03,900 --> 02:04:05,900
You will not be able
to read them up now,
3063
02:04:05,900 --> 02:04:08,382
so you will be seeing this
average closing price per year.
3064
02:04:08,382 --> 02:04:10,631
I'm going to show you all
these things running also some
3065
02:04:10,631 --> 02:04:13,181
just right to explaining you
how things will be run.
3066
02:04:13,181 --> 02:04:13,900
We're doing up here.
3067
02:04:13,900 --> 02:04:15,900
So I will be showing
you all these things
3068
02:04:15,900 --> 02:04:17,100
in execution as well.
3069
02:04:17,200 --> 02:04:18,200
Now in this case,
3070
02:04:18,200 --> 02:04:20,100
if you notice
what we are doing again,
3071
02:04:20,100 --> 02:04:21,907
we are creating
our data frame here.
3072
02:04:21,907 --> 02:04:24,800
Again, we are executing our
query whatever table we have.
3073
02:04:24,800 --> 02:04:26,300
We are executing on top of it.
3074
02:04:26,300 --> 02:04:27,050
So in this case
3075
02:04:27,050 --> 02:04:29,650
because we want to find
the average closing per year.
3076
02:04:29,650 --> 02:04:31,300
So what we are doing
in this case,
3077
02:04:31,300 --> 02:04:33,800
we are going to create
a new table containing
3078
02:04:33,800 --> 02:04:37,700
the average closing price
of let's say an and fxn first
3079
02:04:37,700 --> 02:04:40,319
and then we are going
to display all this new table.
3080
02:04:40,319 --> 02:04:41,369
So we are in the end.
3081
02:04:41,369 --> 02:04:42,800
We are going to
register this table
3082
02:04:42,800 --> 02:04:43,900
or The temporary table
3083
02:04:43,900 --> 02:04:46,515
so that we can execute
our SQL queries on top of it.
3084
02:04:46,515 --> 02:04:47,328
So in this case,
3085
02:04:47,328 --> 02:04:49,828
you can notice that we
are creating this new table.
3086
02:04:49,828 --> 02:04:50,900
And in this new table,
3087
02:04:50,900 --> 02:04:52,900
we have putting
our SQL query right
3088
02:04:52,900 --> 02:04:53,711
that SQL query
3089
02:04:53,711 --> 02:04:56,300
is going to contains
the average closing Paso
3090
02:04:56,300 --> 02:05:00,100
the SQL queries finding out
the average closing price of N
3091
02:05:00,100 --> 02:05:03,100
and all these companies
then whatever we have now.
3092
02:05:03,100 --> 02:05:05,688
We are going to apply
the transformation step
3093
02:05:05,688 --> 02:05:07,488
not transformation
of this new table,
3094
02:05:07,488 --> 02:05:09,188
which we have created
with the year
3095
02:05:09,188 --> 02:05:11,100
and the corresponding
three company data
3096
02:05:11,100 --> 02:05:13,400
what we have created
into the The company
3097
02:05:13,400 --> 02:05:15,103
or table select
which you can notice
3098
02:05:15,103 --> 02:05:17,100
that we are creating
this company or table
3099
02:05:17,100 --> 02:05:18,247
and here first of all,
3100
02:05:18,247 --> 02:05:20,725
we are going to create
a transform table company
3101
02:05:20,725 --> 02:05:23,413
or and going to display
the output so you can notice
3102
02:05:23,413 --> 02:05:25,100
that we are hitting
the SQL query
3103
02:05:25,100 --> 02:05:27,900
and in the end we have printing
this output similarly
3104
02:05:27,900 --> 02:05:29,975
if we want to let's say
compute the best
3105
02:05:29,975 --> 02:05:31,597
of average close we can do that.
3106
02:05:31,597 --> 02:05:33,618
So in this case again
the same way now,
3107
02:05:33,618 --> 02:05:35,800
if once they have learned
the basic stuff,
3108
02:05:35,800 --> 02:05:37,426
you can notice that everything
3109
02:05:37,426 --> 02:05:40,400
is following a similar approach
now in this case also,
3110
02:05:40,400 --> 02:05:43,200
we want to find out let's say
the best of the average
3111
02:05:43,200 --> 02:05:46,100
So we are creating
this best company here now.
3112
02:05:46,100 --> 02:05:49,500
It should contain the best
average closing price of an MX
3113
02:05:49,500 --> 02:05:52,700
and first so we can just get
this greatest and all battery.
3114
02:05:52,700 --> 02:05:53,400
So we creating
3115
02:05:53,400 --> 02:05:56,675
that then after that we
are going to display this output
3116
02:05:56,675 --> 02:05:59,846
and we will be again registering
it as a temporary table now,
3117
02:05:59,846 --> 02:06:02,700
once we have done that then
we can hit our queries now,
3118
02:06:02,700 --> 02:06:04,350
so we want to check
let's say best
3119
02:06:04,350 --> 02:06:05,600
performing company per year.
3120
02:06:05,600 --> 02:06:07,200
Now what we have to do for that.
3121
02:06:07,200 --> 02:06:09,319
So we are creating
the final table in which
3122
02:06:09,319 --> 02:06:10,400
we are going to compute
3123
02:06:10,400 --> 02:06:13,200
all the things we are going
to perform the join or not.
3124
02:06:13,200 --> 02:06:16,082
So although SQL query we
are going to perform here
3125
02:06:16,082 --> 02:06:17,200
in order to compute
3126
02:06:17,200 --> 02:06:19,500
that which company
is doing the best
3127
02:06:19,500 --> 02:06:21,250
and then we are going
to display the output.
3128
02:06:21,250 --> 02:06:23,800
So this is what the output
is going showing up here.
3129
02:06:23,800 --> 02:06:25,850
We are again storing
as a comparative View
3130
02:06:25,850 --> 02:06:28,000
and here again the same
story of correlation
3131
02:06:28,000 --> 02:06:29,400
what we're going to do here.
3132
02:06:29,400 --> 02:06:32,843
So now we will be using
our statistics libraries to find
3133
02:06:32,843 --> 02:06:36,400
the correlation between Anand
epochs companies closing price.
3134
02:06:36,400 --> 02:06:38,300
So that is what we
are going to do now.
3135
02:06:38,300 --> 02:06:41,088
So correlation in finance
and the investment
3136
02:06:41,088 --> 02:06:43,079
and industries is a statistics.
3137
02:06:43,079 --> 02:06:44,300
Measures the degree
3138
02:06:44,300 --> 02:06:47,564
to which to Securities move
in relation to each other.
3139
02:06:47,564 --> 02:06:49,625
So the closer the correlation is
3140
02:06:49,625 --> 02:06:52,200
to be 1 this is going
to be a better one.
3141
02:06:52,200 --> 02:06:53,722
So it is always like
3142
02:06:53,722 --> 02:06:57,300
how to variables are correlated
with each other.
3143
02:06:57,300 --> 02:07:01,400
Let's say your H is highly
correlated to your salary,
3144
02:07:01,400 --> 02:07:05,000
but you're earning like
when you are young you usually
3145
02:07:05,000 --> 02:07:06,400
unless and when you
3146
02:07:06,400 --> 02:07:09,500
are more Edge definitely
you will be earning more
3147
02:07:09,500 --> 02:07:12,811
because you will be more mature
similar way I can say that.
3148
02:07:12,811 --> 02:07:16,400
Your salary is also dependent
on your education qualification.
3149
02:07:16,400 --> 02:07:18,815
And also on the premium
Institute from where you
3150
02:07:18,815 --> 02:07:20,149
have done your education.
3151
02:07:20,149 --> 02:07:21,751
Let's say if you are from IIT,
3152
02:07:21,751 --> 02:07:24,100
or I am definitely
your salary will be higher
3153
02:07:24,100 --> 02:07:25,300
from any other campuses.
3154
02:07:25,300 --> 02:07:26,100
Right Miss.
3155
02:07:26,100 --> 02:07:27,072
It's a probability.
3156
02:07:27,072 --> 02:07:28,300
We what I'm telling you.
3157
02:07:28,300 --> 02:07:28,900
So let's say
3158
02:07:28,900 --> 02:07:32,132
if I have to correlate now
in this case the education
3159
02:07:32,132 --> 02:07:35,600
and the salary but I can easily
create a correlation, right?
3160
02:07:35,600 --> 02:07:37,300
So that is
what the correlation go.
3161
02:07:37,300 --> 02:07:38,589
So we are going to do all
3162
02:07:38,589 --> 02:07:40,573
that with respect
to Overstock analysis.
3163
02:07:40,573 --> 02:07:41,869
Now now what we are doing
3164
02:07:41,869 --> 02:07:45,185
in this case, so You can notice
we are creating this series one
3165
02:07:45,185 --> 02:07:47,188
where we heading
the select query now,
3166
02:07:47,188 --> 02:07:49,401
we are mapping all
this an enclosed price.
3167
02:07:49,401 --> 02:07:52,400
We are converting to a DD
similar way for Series 2.
3168
02:07:52,400 --> 02:07:53,691
Also we are doing that right.
3169
02:07:53,691 --> 02:07:55,832
So this is we are doing
for rabbits or earlier.
3170
02:07:55,832 --> 02:07:58,600
We have done it for an enclosed
and then in the end we
3171
02:07:58,600 --> 02:08:00,911
are using the statistics
dot core to create
3172
02:08:00,911 --> 02:08:02,500
a correlation between them.
3173
02:08:02,600 --> 02:08:06,200
So you can notice this is how we
can execute everything now.
3174
02:08:06,200 --> 02:08:10,353
Let's go to our VM and see
everything in our execution.
3175
02:08:11,142 --> 02:08:12,757
Question from at all.
3176
02:08:12,900 --> 02:08:15,300
So this VM how we
will be getting you
3177
02:08:15,300 --> 02:08:17,659
will be getting all
this VM from a director.
3178
02:08:17,659 --> 02:08:19,815
So you need not worry
about all that but
3179
02:08:19,815 --> 02:08:21,930
that how I will be
getting all this p.m.
3180
02:08:21,930 --> 02:08:24,100
In a so a once you
enroll for the courses
3181
02:08:24,100 --> 02:08:27,300
and also you will be getting all
this came from that Erika said
3182
02:08:27,300 --> 02:08:28,541
so even if I am working
3183
02:08:28,541 --> 02:08:30,711
on Mac operating system
my VM will work.
3184
02:08:30,711 --> 02:08:32,300
Yes every operating system.
3185
02:08:32,300 --> 02:08:33,535
It will be supported.
3186
02:08:33,535 --> 02:08:35,592
So no trouble you
can just use any sort
3187
02:08:35,592 --> 02:08:38,428
of VM in all means
any operating system to do that.
3188
02:08:38,428 --> 02:08:41,000
So what I would occur do
is they just don't want
3189
02:08:41,000 --> 02:08:43,900
You to be troubled
in any sort of stuff here.
3190
02:08:43,900 --> 02:08:46,076
So what they do is
they kind of ensure
3191
02:08:46,076 --> 02:08:48,342
that whatever is required
for your practicals.
3192
02:08:48,342 --> 02:08:49,400
They take care of it.
3193
02:08:49,400 --> 02:08:51,700
That's the reason they
have created their own VM,
3194
02:08:51,700 --> 02:08:54,600
which is also going to be
a lower size and compassion
3195
02:08:54,600 --> 02:08:56,100
to Cloudera hortonworks VM
3196
02:08:56,100 --> 02:08:58,997
and this is going to definitely
be more helpful for you.
3197
02:08:58,997 --> 02:09:01,000
So all these things
will be provided to
3198
02:09:01,000 --> 02:09:02,524
you question from nothing.
3199
02:09:02,524 --> 02:09:05,900
So all this project I am going
to learn from the sessions.
3200
02:09:05,900 --> 02:09:06,200
Yes.
3201
02:09:06,200 --> 02:09:09,650
So once you enroll for so
right now whatever we have seen
3202
02:09:09,650 --> 02:09:13,100
definitely we have just Otten
upper level of view of this
3203
02:09:13,100 --> 02:09:15,350
how the session looks
like for a purchase.
3204
02:09:15,350 --> 02:09:18,700
But but when we actually teach
all these things in the course,
3205
02:09:18,700 --> 02:09:21,587
it's usually are much more
in the detailed format.
3206
02:09:21,587 --> 02:09:22,700
So in detail format,
3207
02:09:22,700 --> 02:09:25,300
we kind of keep on showing
you each step in detail
3208
02:09:25,300 --> 02:09:28,299
that how the things are working
even including the project.
3209
02:09:28,299 --> 02:09:30,900
So you will be also learning
with the help of project
3210
02:09:30,900 --> 02:09:32,157
on each different topic.
3211
02:09:32,157 --> 02:09:34,200
So that is the way
we kind of go for it.
3212
02:09:34,200 --> 02:09:36,605
Now if I am stuck
in any other project then
3213
02:09:36,605 --> 02:09:37,985
who will be helping me
3214
02:09:37,985 --> 02:09:40,308
so they will be
a support team 24 by 7
3215
02:09:40,308 --> 02:09:42,046
if Get stuck at any moment.
3216
02:09:42,046 --> 02:09:44,300
You need to just
give a call and kit
3217
02:09:44,300 --> 02:09:45,900
and a call or email.
3218
02:09:45,900 --> 02:09:49,076
There is a support ticket
and immediately the technical
3219
02:09:49,076 --> 02:09:52,100
team will be helping across
the support team is 24 by 7.
3220
02:09:52,100 --> 02:09:53,900
They are they are
all technical people
3221
02:09:53,900 --> 02:09:55,821
and they will be assisting
you across on all
3222
02:09:55,821 --> 02:09:58,100
that even the trainers
will be assisting you for any
3223
02:09:58,100 --> 02:10:00,000
of the technical query great.
3224
02:10:00,000 --> 02:10:00,400
Awesome.
3225
02:10:00,800 --> 02:10:01,900
Thank you now.
3226
02:10:01,900 --> 02:10:03,700
So if you notice this is my data
3227
02:10:03,700 --> 02:10:06,446
we have we were executing
all the things on this data.
3228
02:10:06,446 --> 02:10:08,726
Now what we want to do
if you notice this is
3229
02:10:08,726 --> 02:10:10,900
the same code which I
have just shown you.
3230
02:10:10,900 --> 02:10:13,800
Earlier also now let us
just execute this code.
3231
02:10:13,800 --> 02:10:15,481
So in order to execute this
3232
02:10:15,481 --> 02:10:18,345
what we can do we can connect
to my spa action.
3233
02:10:18,345 --> 02:10:20,400
So let's get
connected to suction.
3234
02:10:21,700 --> 02:10:23,970
Someone's will be connected
to Spur action.
3235
02:10:23,970 --> 02:10:25,382
We will go step by step.
3236
02:10:25,382 --> 02:10:27,700
So first we will be
importing our package.
3237
02:10:31,400 --> 02:10:34,861
This take some time let
it just get connected.
3238
02:10:36,300 --> 02:10:38,400
Once this is connected now,
3239
02:10:38,400 --> 02:10:39,400
you can notice
3240
02:10:39,400 --> 02:10:42,400
that I'm just importing all
the all the important libraries
3241
02:10:42,400 --> 02:10:44,400
we have already
learned about that.
3242
02:10:45,800 --> 02:10:49,137
After that, you will be
initialising your spark session.
3243
02:10:49,137 --> 02:10:49,805
So let's do
3244
02:10:49,805 --> 02:10:52,900
that again the same steps
what you have done before.
3245
02:10:58,600 --> 02:10:59,922
Once we will be done.
3246
02:10:59,922 --> 02:11:02,000
We will be creating
a stock class.
3247
02:11:07,000 --> 02:11:09,900
We could have also directly
executed from Eclipse.
3248
02:11:09,900 --> 02:11:11,400
Also, this is just I want
3249
02:11:11,400 --> 02:11:13,800
to show you step-by-step
whatever we have learnt.
3250
02:11:13,800 --> 02:11:15,700
So now you can see
for company one and then
3251
02:11:15,700 --> 02:11:16,700
if you want to do
3252
02:11:16,700 --> 02:11:20,000
some computation we want to even
see the values and all right,
3253
02:11:20,000 --> 02:11:21,600
so that's what we're doing here.
3254
02:11:21,700 --> 02:11:24,700
So if we are just getting
the files creating another did,
3255
02:11:24,700 --> 02:11:26,800
you know, so let's execute this.
3256
02:11:28,500 --> 02:11:31,200
Similarly for your a back
similarly for your fast
3257
02:11:31,200 --> 02:11:34,050
for all this so I'm just copying
all these things together
3258
02:11:34,050 --> 02:11:36,100
because there are a lot
of companies for which we
3259
02:11:36,100 --> 02:11:37,400
have to do all this step.
3260
02:11:37,400 --> 02:11:39,625
So let's bring it
for all the 10 companies
3261
02:11:39,625 --> 02:11:41,200
which we are going to create.
3262
02:11:49,000 --> 02:11:49,900
So as you can see,
3263
02:11:49,900 --> 02:11:52,400
this print scheme has giving
it output right now.
3264
02:11:52,400 --> 02:11:52,900
Similarly.
3265
02:11:52,900 --> 02:11:55,800
I can execute for a rest
of the things as well.
3266
02:11:55,800 --> 02:11:57,800
So this is just giving
you the similar way.
3267
02:11:57,800 --> 02:12:01,702
All the outputs will be shown
up here company for company V
3268
02:12:01,702 --> 02:12:05,000
all these companies you
can see this in execution.
3269
02:12:08,000 --> 02:12:11,000
After that, we will be creating
our temporary view
3270
02:12:11,000 --> 02:12:13,800
so that we can execute
our SQL queries.
3271
02:12:16,500 --> 02:12:19,700
So let's do it for complaint
and also then after that we
3272
02:12:19,700 --> 02:12:22,900
can just create a work all
over temporary table for it.
3273
02:12:22,900 --> 02:12:25,200
Once we are done now
we can do our queries.
3274
02:12:25,200 --> 02:12:27,357
Like let's say we
can display the average
3275
02:12:27,357 --> 02:12:30,000
of existing closing price
for and and for each one
3276
02:12:30,000 --> 02:12:31,400
so we can hit this query.
3277
02:12:34,700 --> 02:12:37,500
So all these queries will happen
on your temporary view
3278
02:12:37,600 --> 02:12:39,800
because we cannot anyway
to all these queries
3279
02:12:39,800 --> 02:12:41,471
on our data frames are out
3280
02:12:41,471 --> 02:12:44,300
so you can see this this
is getting executed.
3281
02:12:45,500 --> 02:12:49,200
Trying it out to Tulsa now
because they've done dot shoe.
3282
02:12:49,200 --> 02:12:51,237
That's the reason
you're getting this output.
3283
02:12:51,237 --> 02:12:51,700
Similarly.
3284
02:12:51,700 --> 02:12:55,600
If we want to let's say list
the closing price for msft
3285
02:12:55,600 --> 02:12:57,600
which went up more than $2 way.
3286
02:12:57,600 --> 02:12:58,794
So that query also we
3287
02:12:58,794 --> 02:13:02,500
can execute now we have already
understood this query in detail.
3288
02:13:03,100 --> 02:13:05,300
It is seeing is
execution partner
3289
02:13:05,500 --> 02:13:08,100
so that you can appreciate
whatever you have learned.
3290
02:13:08,300 --> 02:13:10,700
See this is the output
showing up to you.
3291
02:13:10,800 --> 02:13:12,300
Now after that
3292
02:13:12,300 --> 02:13:15,723
how you can join all the stack
closing price right similar way
3293
02:13:15,723 --> 02:13:18,966
how we can save the joint view
in the packet for table.
3294
02:13:18,966 --> 02:13:20,435
You want to read that back.
3295
02:13:20,435 --> 02:13:22,157
You want to create a new table
3296
02:13:22,157 --> 02:13:25,275
like so let's execute all
these three queries together
3297
02:13:25,275 --> 02:13:27,100
because we have
already seen this.
3298
02:13:29,700 --> 02:13:30,502
Look at this.
3299
02:13:30,502 --> 02:13:31,800
So this in this case,
3300
02:13:31,800 --> 02:13:34,300
we are doing the drawing class
basing this output.
3301
02:13:34,300 --> 02:13:36,499
Then we want to save it
in the package files.
3302
02:13:36,499 --> 02:13:39,100
We are saving it and we want
to again reiterate back.
3303
02:13:39,100 --> 02:13:40,893
Then we are creating
our new table, right?
3304
02:13:40,893 --> 02:13:42,043
We were doing that join
3305
02:13:42,043 --> 02:13:44,200
and on so that is
what we are doing in this case.
3306
02:13:44,200 --> 02:13:45,900
Then you want
to see this output.
3307
02:13:47,700 --> 02:13:50,400
Then we are against touring
as a temp table or not.
3308
02:13:50,499 --> 02:13:50,700
Now.
3309
02:13:50,700 --> 02:13:53,700
Once we are done with this step
also then what so we
3310
02:13:53,700 --> 02:13:55,400
have done it in Step 6.
3311
02:13:55,400 --> 02:13:56,900
Now we want to perform.
3312
02:13:56,900 --> 02:13:58,488
Let's have a transformation
3313
02:13:58,488 --> 02:14:01,000
on new table corresponding
to the three companies
3314
02:14:01,000 --> 02:14:03,411
so that we can compare
we want to create
3315
02:14:03,411 --> 02:14:06,305
the best company containing
the best average closing price
3316
02:14:06,305 --> 02:14:07,748
for all these three companies.
3317
02:14:07,748 --> 02:14:09,300
We want to find the companies
3318
02:14:09,300 --> 02:14:11,600
but the best closing
price average per year.
3319
02:14:11,600 --> 02:14:13,200
So let's do all that as well.
3320
02:14:18,800 --> 02:14:22,343
So you can see best company
of the year now here also
3321
02:14:22,343 --> 02:14:26,500
the same stuff we are doing to
be registering over temp table.
3322
02:14:34,100 --> 02:14:35,700
Okay, so there's a mistake here.
3323
02:14:35,700 --> 02:14:38,096
So if you notice here it is 1
3324
02:14:38,100 --> 02:14:40,722
but here we are doing
a show of all right,
3325
02:14:40,722 --> 02:14:42,129
so there is a mistake.
3326
02:14:42,129 --> 02:14:43,600
I'm just correcting it.
3327
02:14:45,000 --> 02:14:48,300
So here also it should be
1 I'm just updating
3328
02:14:48,300 --> 02:14:51,300
in the sheet itself so
that it will start working now.
3329
02:14:51,300 --> 02:14:53,102
So here I have just made it one.
3330
02:14:53,102 --> 02:14:55,300
So now after that it
will start working.
3331
02:14:55,300 --> 02:14:59,600
Okay, wherever it is going
to be all I have to make it one.
3332
02:15:00,400 --> 02:15:03,500
So that is the change
which I need to do here also.
3333
02:15:04,400 --> 02:15:06,700
And you will notice
it will start working.
3334
02:15:06,900 --> 02:15:09,433
So here also you
need to make it one.
3335
02:15:09,433 --> 02:15:10,748
So all those places
3336
02:15:10,748 --> 02:15:14,363
where ever it was so just
kind of a good point to make
3337
02:15:14,363 --> 02:15:18,388
so wherever you are working
on this we need to always ensure
3338
02:15:18,388 --> 02:15:21,800
that all these values
what you are putting up here.
3339
02:15:21,800 --> 02:15:25,900
Okay, so I could have also
done it like this one second.
3340
02:15:26,300 --> 02:15:27,876
In fact in this place.
3341
02:15:27,876 --> 02:15:30,600
I need not do all
this step one second.
3342
02:15:30,600 --> 02:15:33,842
Let me explain you also
why no in this place.
3343
02:15:33,842 --> 02:15:37,600
It's So see from here
this error started opening why
3344
02:15:37,600 --> 02:15:38,758
because my data frame
3345
02:15:38,758 --> 02:15:40,500
what I have created
here most one.
3346
02:15:40,500 --> 02:15:41,500
Let's execute it.
3347
02:15:41,500 --> 02:15:43,500
Now, you will notice
this Quest artwork.
3348
02:15:44,340 --> 02:15:45,659
See this is working.
3349
02:15:46,000 --> 02:15:46,300
Now.
3350
02:15:46,300 --> 02:15:47,000
After that.
3351
02:15:47,000 --> 02:15:49,493
I am creating a temp table
that temp table.
3352
02:15:49,493 --> 02:15:52,400
What we are creating is
let's say company on okay.
3353
02:15:52,400 --> 02:15:55,100
So this is the temp table
which we have created.
3354
02:15:55,100 --> 02:15:57,808
You can see this company
now in this case
3355
02:15:57,808 --> 02:16:01,300
if I am keeping this company
on itself it is going to work.
3356
02:16:02,000 --> 02:16:03,195
Because here anyway,
3357
02:16:03,195 --> 02:16:05,897
I'm going to use
the whatever temporary table
3358
02:16:05,897 --> 02:16:07,310
we have created, right?
3359
02:16:07,310 --> 02:16:08,600
So now let's execute.
3360
02:16:10,800 --> 02:16:12,700
So you can see now
it started book.
3361
02:16:14,000 --> 02:16:15,900
No further to that now,
3362
02:16:15,900 --> 02:16:18,500
we want to create
a correlation between them
3363
02:16:18,500 --> 02:16:19,600
so we can do that.
3364
02:16:23,700 --> 02:16:26,400
See this is going to give
me the correlation
3365
02:16:26,400 --> 02:16:30,500
between the two column names
and so that we can see here.
3366
02:16:30,700 --> 02:16:34,445
So this is the correlation the
more it is closer to 1 means the
3367
02:16:34,445 --> 02:16:37,950
better it is it means definitely
it is near to 1 it is 0.9,
3368
02:16:37,950 --> 02:16:39,400
which is a bigger value.
3369
02:16:39,400 --> 02:16:42,700
So definitely it is going
to be much they both are
3370
02:16:42,700 --> 02:16:45,700
highly correlated means
definitely they are impacting
3371
02:16:45,700 --> 02:16:47,300
each other stock price.
3372
02:16:47,400 --> 02:16:49,700
So this is all about the project
3373
02:16:49,700 --> 02:16:58,500
but Welcome to this interesting
session of spots remaining
3374
02:16:58,673 --> 02:16:59,826
from and Erica.
3375
02:17:00,800 --> 02:17:02,261
What is pathogenic?
3376
02:17:02,261 --> 02:17:04,415
Is it like really important?
3377
02:17:04,500 --> 02:17:05,400
Definitely?
3378
02:17:05,400 --> 02:17:05,704
Yes.
3379
02:17:05,704 --> 02:17:07,001
Is it really hot?
3380
02:17:07,001 --> 02:17:07,600
Definitely?
3381
02:17:07,600 --> 02:17:08,100
Yes.
3382
02:17:08,100 --> 02:17:10,900
That's the reason we
are learning this technology.
3383
02:17:10,900 --> 02:17:14,600
And this is one of the very
sort things in the market
3384
02:17:14,600 --> 02:17:16,272
when it's a hot thing means
3385
02:17:16,272 --> 02:17:18,750
in terms of job market
I'm talking about.
3386
02:17:18,750 --> 02:17:21,600
So let's see what will be
our agenda for today.
3387
02:17:21,900 --> 02:17:25,500
So we are going to Gus
about spark ecosystem
3388
02:17:25,500 --> 02:17:27,900
where we are going
to see that okay,
3389
02:17:27,900 --> 02:17:28,700
what is pop
3390
02:17:28,700 --> 02:17:32,100
how smarts the main threats
in the West Park ecosystem
3391
02:17:32,100 --> 02:17:35,631
wise path streaming we
are going to have overview
3392
02:17:35,631 --> 02:17:39,900
of stock streaming kind of
getting into the basics of that.
3393
02:17:39,900 --> 02:17:41,832
We will learn about these cream.
3394
02:17:41,832 --> 02:17:44,890
We will learn also about
these theme Transformations.
3395
02:17:44,890 --> 02:17:46,800
We will be
learning about caching
3396
02:17:46,800 --> 02:17:51,200
and persistence accumulators
broadcast variables checkpoints.
3397
02:17:51,200 --> 02:17:53,600
These are like Advanced
concept of paths.
3398
02:17:54,100 --> 02:17:55,600
And then in the end,
3399
02:17:55,600 --> 02:17:59,900
we will walk through a use case
of Twitter sentiment analysis.
3400
02:18:00,500 --> 02:18:04,700
Now, what is streaming
let's understand that.
3401
02:18:04,800 --> 02:18:08,000
So let me start
by us example to you.
3402
02:18:08,600 --> 02:18:12,300
So let's see if there is
a bank and in Bank.
3403
02:18:12,500 --> 02:18:13,082
Definitely.
3404
02:18:13,082 --> 02:18:14,200
I'm pretty sure all
3405
02:18:14,200 --> 02:18:18,700
of you must have views credit
card debit card all those karts
3406
02:18:18,700 --> 02:18:20,900
what dance provide now,
3407
02:18:20,900 --> 02:18:23,500
let's say you
have done a transaction.
3408
02:18:23,500 --> 02:18:27,300
From India just now
and within an art
3409
02:18:27,300 --> 02:18:30,260
and edit your card
is getting swept in u.s.
3410
02:18:30,260 --> 02:18:31,600
Is it even possible
3411
02:18:31,600 --> 02:18:35,801
for your car to vision
and arduous definitely know now
3412
02:18:35,900 --> 02:18:38,100
how that bank will realize
3413
02:18:38,700 --> 02:18:41,000
that it is a fraud connection
3414
02:18:41,000 --> 02:18:44,600
because Bank cannot let
that transition happen.
3415
02:18:44,700 --> 02:18:46,238
They need to stop it
3416
02:18:46,238 --> 02:18:49,771
at the time of when it
is getting swiped either.
3417
02:18:49,771 --> 02:18:51,000
You can block it.
3418
02:18:51,000 --> 02:18:52,800
Give a call to you ask you
3419
02:18:52,800 --> 02:18:55,394
whether It is a genuine
transaction or not.
3420
02:18:55,394 --> 02:18:57,000
Do something of that sort.
3421
02:18:57,692 --> 02:18:58,000
Now.
3422
02:18:58,000 --> 02:19:00,300
Do you think they will put
some manual person
3423
02:19:00,300 --> 02:19:01,127
behind the scene
3424
02:19:01,127 --> 02:19:03,300
that will be looking
at all the transaction
3425
02:19:03,300 --> 02:19:05,100
and you will block it manually.
3426
02:19:05,100 --> 02:19:08,315
No, so they require
something of the sort
3427
02:19:08,315 --> 02:19:11,100
where the data will
be getting stream.
3428
02:19:11,100 --> 02:19:12,500
And at the real time
3429
02:19:12,500 --> 02:19:16,113
they should be able to catch
with the help of some pattern.
3430
02:19:16,113 --> 02:19:17,851
They will do some processing
3431
02:19:17,851 --> 02:19:20,575
and they will get
some pattern out of it with
3432
02:19:20,575 --> 02:19:23,305
if it is not sounding
like a genuine transition.
3433
02:19:23,305 --> 02:19:26,649
They will immediately add
a block it I'll give you a call
3434
02:19:26,649 --> 02:19:28,565
maybe send me an OTP to confirm
3435
02:19:28,565 --> 02:19:31,100
whether it's a genuine
connection dot they
3436
02:19:31,100 --> 02:19:32,050
will not wait
3437
02:19:32,050 --> 02:19:36,000
till the next day to kind of
complete that transaction.
3438
02:19:36,000 --> 02:19:38,941
Otherwise if what happened
nobody is going to touch
3439
02:19:38,941 --> 02:19:40,000
that that right.
3440
02:19:40,000 --> 02:19:43,000
So that is the how we
work on stomach.
3441
02:19:43,100 --> 02:19:46,300
Now someone have mentioned
3442
02:19:46,500 --> 02:19:51,400
that without stream processing
of data is not even possible.
3443
02:19:51,400 --> 02:19:52,435
In fact, we can see
3444
02:19:52,435 --> 02:19:55,200
that there is no And big data
which is possible.
3445
02:19:55,200 --> 02:19:57,900
We cannot even talk
about internet of things.
3446
02:19:57,900 --> 02:20:00,800
Right and this this is
a very famous statement
3447
02:20:00,800 --> 02:20:01,900
from Donna Saint
3448
02:20:01,900 --> 02:20:05,600
do from C equals
3 lot of companies
3449
02:20:05,700 --> 02:20:13,500
like YouTube Netflix Facebook
Twitter iTunes topped Pandora.
3450
02:20:13,769 --> 02:20:17,230
All these companies
are using spark screaming.
3451
02:20:17,700 --> 02:20:18,100
Now.
3452
02:20:19,100 --> 02:20:20,400
What is this?
3453
02:20:20,400 --> 02:20:23,580
We have just seen with an
example to kind of got an idea.
3454
02:20:23,580 --> 02:20:25,000
Idea about steaming pack.
3455
02:20:25,100 --> 02:20:30,300
Now as I said with the time
growing with the internet doing
3456
02:20:30,453 --> 02:20:35,146
these three main Technologies
are becoming popular day by day.
3457
02:20:35,500 --> 02:20:39,300
It's a technique
to transfer the data
3458
02:20:39,500 --> 02:20:45,000
so that it can be processed
as a steady and continuous
3459
02:20:45,000 --> 02:20:47,000
drip means immediately
3460
02:20:47,000 --> 02:20:49,500
as and when the data is coming
3461
02:20:49,600 --> 02:20:52,900
you are continuously
processing it as well.
3462
02:20:53,600 --> 02:20:54,400
In fact,
3463
02:20:54,400 --> 02:20:58,938
this real-time streaming is
what is driving to this big data
3464
02:20:59,100 --> 02:21:02,000
and also internet of things now,
3465
02:21:02,000 --> 02:21:04,786
they will be lot of things
like fundamental unit
3466
02:21:04,786 --> 02:21:06,387
of streaming media streams.
3467
02:21:06,387 --> 02:21:08,700
We will also be
Transforming Our screen.
3468
02:21:08,700 --> 02:21:09,700
We will be doing it.
3469
02:21:09,700 --> 02:21:10,994
In fact, the companies
3470
02:21:10,994 --> 02:21:13,400
are using it with
their business intelligence.
3471
02:21:13,400 --> 02:21:16,200
We will see more details
in further of the slides.
3472
02:21:16,300 --> 02:21:20,900
But before that we will be
talking about spark ecosystem
3473
02:21:21,200 --> 02:21:23,500
when we talk about Spark mmm,
3474
02:21:23,500 --> 02:21:25,653
there are multiple libraries
3475
02:21:25,653 --> 02:21:29,565
which are present in a first one
is pop frequent now
3476
02:21:29,565 --> 02:21:31,100
in spark SQL is like
3477
02:21:31,100 --> 02:21:35,000
when you can SQL Developer
can write the query in SQL way
3478
02:21:35,000 --> 02:21:38,600
and it is going to get converted
into a spark way
3479
02:21:38,600 --> 02:21:42,828
and then going to give you
output kind of analogous to hide
3480
02:21:42,828 --> 02:21:46,400
but it is going to be faster
in comparison to hide
3481
02:21:46,400 --> 02:21:48,252
when we talk about sports clinic
3482
02:21:48,252 --> 02:21:50,900
that is what we are going
to learn it is going
3483
02:21:50,900 --> 02:21:55,300
to enable all the analytical
and Practical applications
3484
02:21:55,600 --> 02:21:59,400
for your live
streaming data M11.
3485
02:21:59,700 --> 02:22:02,400
Ml it is mostly
for machine learning.
3486
02:22:02,400 --> 02:22:03,546
And in fact,
3487
02:22:03,546 --> 02:22:06,007
the interesting part
about MLA is
3488
02:22:06,200 --> 02:22:11,100
that it is completely replacing
mom invited are almost replaced.
3489
02:22:11,100 --> 02:22:13,500
Now all the core contributors
3490
02:22:13,500 --> 02:22:17,700
of Mahal have moved
in two words the
3491
02:22:18,184 --> 02:22:19,800
towards the MLF thing
3492
02:22:19,800 --> 02:22:23,500
because of the faster response
performance is really good.
3493
02:22:23,500 --> 02:22:26,707
In MLA Graphics Graphics.
3494
02:22:26,707 --> 02:22:27,005
Okay.
3495
02:22:27,005 --> 02:22:29,794
Let me give you example
everybody must have used
3496
02:22:29,794 --> 02:22:31,100
Google Maps right now.
3497
02:22:31,100 --> 02:22:34,082
What you doing Google Map
you search for the path.
3498
02:22:34,082 --> 02:22:36,600
You put your Source you
put your destination.
3499
02:22:36,600 --> 02:22:38,900
Now when you just
search for the part,
3500
02:22:39,000 --> 02:22:40,500
it's certainly different paths
3501
02:22:40,800 --> 02:22:45,100
and then provide you
an optimal path right now
3502
02:22:45,300 --> 02:22:47,300
how it providing
the optimal party.
3503
02:22:47,300 --> 02:22:50,500
These things can be done
with the help of Graphics.
3504
02:22:50,500 --> 02:22:53,500
So wherever you can create
a kind of a graphical stuff.
3505
02:22:53,500 --> 02:22:54,500
Up, we will say
3506
02:22:54,500 --> 02:22:56,997
that we can use
Graphics spark up.
3507
02:22:56,997 --> 02:22:57,300
Now.
3508
02:22:57,300 --> 02:23:00,600
This is the kind
of a package provided for art.
3509
02:23:00,600 --> 02:23:02,538
So R is of Open Source,
3510
02:23:02,538 --> 02:23:05,000
which is mostly used by analysts
3511
02:23:05,000 --> 02:23:08,300
and now spark committee
won't infect all
3512
02:23:08,300 --> 02:23:11,594
the analysts kind of to move
towards the sparkling water.
3513
02:23:11,594 --> 02:23:12,900
And that's the reason
3514
02:23:12,900 --> 02:23:15,615
they have recently
stopped supporting spark
3515
02:23:15,615 --> 02:23:17,226
on we are all the analysts
3516
02:23:17,226 --> 02:23:20,301
can now execute the query
using spark environment
3517
02:23:20,301 --> 02:23:22,800
that's getting better
performance and we
3518
02:23:22,800 --> 02:23:25,000
can also work on Big Data.
3519
02:23:25,200 --> 02:23:27,800
That's that's all
about the ecosystem point
3520
02:23:27,800 --> 02:23:31,061
below this we are going to have
a core engine for engine
3521
02:23:31,061 --> 02:23:34,500
is the one which defines all
the basics of the participants
3522
02:23:34,500 --> 02:23:36,363
all the RGV related stuff
3523
02:23:36,363 --> 02:23:38,600
and not is going to be defined
3524
02:23:38,600 --> 02:23:43,300
in your staff for Engine
moving further now,
3525
02:23:43,300 --> 02:23:46,227
so as we have just
discussed this part we
3526
02:23:46,227 --> 02:23:49,767
are going to now discuss
past screaming indicate
3527
02:23:49,767 --> 02:23:53,500
which is going to enable
analytical and Interactive.
3528
02:23:53,600 --> 02:23:58,300
For live streaming data
know Y is positive
3529
02:23:58,800 --> 02:24:01,400
if I talk about bias
past him indefinitely.
3530
02:24:01,400 --> 02:24:04,230
We have just gotten after
different is very important.
3531
02:24:04,230 --> 02:24:06,100
That's the reason
we are learning it
3532
02:24:06,200 --> 02:24:09,804
but this is so powerful
that it is used now
3533
02:24:09,804 --> 02:24:14,169
for the by lot of companies
to perform their marketing they
3534
02:24:14,169 --> 02:24:15,900
kind of getting an idea
3535
02:24:15,900 --> 02:24:18,250
that what a customer
is looking for.
3536
02:24:18,250 --> 02:24:22,094
In fact, we are going to learn
a use case of similar to that
3537
02:24:22,094 --> 02:24:24,700
where we are going
to to use pasta me now
3538
02:24:24,700 --> 02:24:28,283
where we are going to use
a Twitter sentimental analysis,
3539
02:24:28,283 --> 02:24:31,100
which can be used
for your crisis management.
3540
02:24:31,100 --> 02:24:33,680
Maybe you want to check
all your products
3541
02:24:33,680 --> 02:24:35,100
on our behave service.
3542
02:24:35,100 --> 02:24:37,420
I just think target marketing
3543
02:24:37,500 --> 02:24:40,342
by all the companies
around the world.
3544
02:24:40,342 --> 02:24:42,800
This is getting used
in this way.
3545
02:24:42,817 --> 02:24:46,355
And that's the reason
spark steaming is gaining
3546
02:24:46,355 --> 02:24:50,432
the popularity and because
of its performance as well.
3547
02:24:50,600 --> 02:24:53,200
It is beeping
on other platforms.
3548
02:24:53,600 --> 02:24:57,400
At the moment
now moving further.
3549
02:24:57,600 --> 02:25:01,300
Let's eat Sparks training
features when we talk
3550
02:25:01,300 --> 02:25:03,300
about Sparks training teachers.
3551
02:25:03,400 --> 02:25:05,100
It's very easy to scale.
3552
02:25:05,100 --> 02:25:07,420
You can scale
to even multiple nodes
3553
02:25:07,420 --> 02:25:11,083
which can even run till hundreds
of most speed is going
3554
02:25:11,083 --> 02:25:14,000
to be very quick means
in a very short time.
3555
02:25:14,000 --> 02:25:17,900
You can scream as well as
processor data soil tolerant,
3556
02:25:17,900 --> 02:25:19,300
even it made sure
3557
02:25:19,300 --> 02:25:23,100
that even you're not losing
your data integration.
3558
02:25:23,100 --> 02:25:26,600
You with your bash time and
real-time processing is possible
3559
02:25:26,600 --> 02:25:30,446
and it can also be used
for your business analytics
3560
02:25:30,500 --> 02:25:34,800
which is used to track
the behavior of your customer.
3561
02:25:34,900 --> 02:25:38,700
So as you can see this
is super polite and it's
3562
02:25:38,700 --> 02:25:43,000
like we are kind of getting to
know so many interesting things
3563
02:25:43,000 --> 02:25:48,000
about this pasta me now next
quickly have an overview
3564
02:25:48,000 --> 02:25:50,900
so that we can get
some basics of spots.
3565
02:25:50,900 --> 02:25:53,200
Don't know let's understand.
3566
02:25:53,200 --> 02:25:54,300
Which box?
3567
02:25:55,100 --> 02:25:59,200
So as we have just discussed it
is for real-time streaming data.
3568
02:25:59,600 --> 02:26:04,100
It is useful addition
in your spark for API.
3569
02:26:04,100 --> 02:26:06,500
So we have already seen
at the base level.
3570
02:26:06,500 --> 02:26:07,400
We have that spark
3571
02:26:07,400 --> 02:26:10,700
or in our ecosystem on top
of that we have passed we
3572
02:26:10,700 --> 02:26:14,700
will impact Sparks claiming
is kind of adding a lot
3573
02:26:14,700 --> 02:26:18,000
of advantage to spark Community
3574
02:26:18,000 --> 02:26:22,349
because a lot of people are only
joining spark Community to kind
3575
02:26:22,349 --> 02:26:23,800
of use this pasta me.
3576
02:26:23,800 --> 02:26:25,000
It's so powerful.
3577
02:26:25,000 --> 02:26:26,344
Everyone wants to come
3578
02:26:26,344 --> 02:26:29,478
and want to use it
because all the other Frameworks
3579
02:26:29,478 --> 02:26:30,809
which we already have
3580
02:26:30,809 --> 02:26:33,469
which are existing are
not as good in terms
3581
02:26:33,469 --> 02:26:34,783
of performance in all
3582
02:26:34,783 --> 02:26:36,311
and and it's the easiness
3583
02:26:36,311 --> 02:26:38,482
of moving Sparks
coming is also great
3584
02:26:38,482 --> 02:26:41,482
if you compare your program
for let's say two orbits
3585
02:26:41,482 --> 02:26:44,100
from which is used
for real-time processing.
3586
02:26:44,100 --> 02:26:46,356
You will notice
that it is much easier
3587
02:26:46,356 --> 02:26:49,100
in terms of from
a developer point of your ass
3588
02:26:49,100 --> 02:26:52,400
that that's the reason a lot
of regular showing interest
3589
02:26:52,400 --> 02:26:53,800
in this domain now,
3590
02:26:53,800 --> 02:26:56,800
it will also enable Table
of high throughput
3591
02:26:56,800 --> 02:26:58,187
and fault-tolerant
3592
02:26:58,187 --> 02:27:02,725
so that you to stream your data
to process all the things up
3593
02:27:02,900 --> 02:27:06,900
and the fundamental unit
Force past dreaming is going
3594
02:27:06,900 --> 02:27:08,200
to be District.
3595
02:27:08,300 --> 02:27:09,700
What is this thing?
3596
02:27:09,700 --> 02:27:10,600
Let me explain it.
3597
02:27:11,100 --> 02:27:14,200
So this dream is
basically a series
3598
02:27:14,200 --> 02:27:18,900
of bodies to process
the real-time data.
3599
02:27:19,400 --> 02:27:21,100
What we generally do is
3600
02:27:21,100 --> 02:27:23,678
if you look
at this light inside you
3601
02:27:23,678 --> 02:27:25,300
when you get the data,
3602
02:27:25,400 --> 02:27:29,800
It is a continuous data you
divide it in two batches
3603
02:27:29,800 --> 02:27:31,200
of input data.
3604
02:27:31,400 --> 02:27:35,700
We are going to call it
as micro batch and then
3605
02:27:35,700 --> 02:27:39,447
we are going to get that is
of processed data though.
3606
02:27:39,447 --> 02:27:40,600
It is real time.
3607
02:27:40,600 --> 02:27:42,300
But still how come it is back
3608
02:27:42,300 --> 02:27:44,547
because definitely you
are doing processing
3609
02:27:44,547 --> 02:27:46,258
on some part of the data, right?
3610
02:27:46,258 --> 02:27:48,300
Even if it is coming
at real time.
3611
02:27:48,300 --> 02:27:52,500
And that is what we are going
to call it as micro batch.
3612
02:27:53,600 --> 02:27:55,700
Moving further now.
3613
02:27:56,600 --> 02:27:59,100
Let's see few more
details on it.
3614
02:27:59,223 --> 02:28:02,300
Now from where you
can get all your data.
3615
02:28:02,300 --> 02:28:04,600
What can be your
data sources here.
3616
02:28:04,600 --> 02:28:09,000
So if we talk about data sources
here now we can steal the data
3617
02:28:09,000 --> 02:28:13,700
from multiple sources
like Market of the past events.
3618
02:28:13,700 --> 02:28:16,586
You have statuses
like at based mongodb,
3619
02:28:16,586 --> 02:28:20,051
which are you know,
SQL babies elasticsearch post
3620
02:28:20,051 --> 02:28:24,600
Vis equal pocket file format you
can Get all the data from here.
3621
02:28:24,600 --> 02:28:27,700
Now after that you can also
don't do processing
3622
02:28:27,700 --> 02:28:29,553
with the help
of machine learning.
3623
02:28:29,553 --> 02:28:32,700
You can do the processing
with the help of your spark SQL
3624
02:28:32,700 --> 02:28:34,800
and then give the output.
3625
02:28:34,900 --> 02:28:37,000
So this is a very strong thing
3626
02:28:37,000 --> 02:28:40,100
that you are bringing
the data using spot screaming
3627
02:28:40,100 --> 02:28:41,964
but processing you can do
3628
02:28:41,964 --> 02:28:44,800
by using some other
Frameworks as well.
3629
02:28:44,800 --> 02:28:47,514
Right like machine learning
you can apply on the data
3630
02:28:47,514 --> 02:28:49,549
what you're getting
fatter years time.
3631
02:28:49,549 --> 02:28:51,966
You can also apply
your spots equal on the data,
3632
02:28:51,966 --> 02:28:53,200
which you're getting at.
3633
02:28:53,200 --> 02:28:56,300
the real time Moving further.
3634
02:28:57,100 --> 02:29:00,089
So this is a single thing now
in Sparks giving you
3635
02:29:00,089 --> 02:29:03,200
what you can just get the data
from multiple sources
3636
02:29:03,200 --> 02:29:07,600
like from cough cough prove
sefs kinases Twitter bringing it
3637
02:29:07,600 --> 02:29:10,300
to this path screaming
doing the processing
3638
02:29:10,300 --> 02:29:12,500
and storing it back
to your hdfs.
3639
02:29:12,500 --> 02:29:15,900
Maybe you can bring it to
your DB you can also publish
3640
02:29:15,900 --> 02:29:17,400
to your UI dashboard.
3641
02:29:17,400 --> 02:29:21,402
Next Tableau angularjs lot
of UI dashboards are there
3642
02:29:21,700 --> 02:29:25,100
in which you can publish
your output now.
3643
02:29:25,500 --> 02:29:26,346
Holly quotes,
3644
02:29:26,346 --> 02:29:29,782
let us just break down
into more fine-grained gutters.
3645
02:29:29,782 --> 02:29:32,700
Now we are going to get
our input data stream.
3646
02:29:32,700 --> 02:29:34,500
We are going to put it inside
3647
02:29:34,500 --> 02:29:38,200
of a spot screaming going to get
the batches of input data.
3648
02:29:38,200 --> 02:29:40,772
Once it executes
to his path engine.
3649
02:29:40,772 --> 02:29:44,300
We are going to get that chest
of processed data.
3650
02:29:44,300 --> 02:29:47,146
We have just seen
the same diagram before so
3651
02:29:47,146 --> 02:29:49,000
the same explanation for it.
3652
02:29:49,000 --> 02:29:52,400
Now again breaking it down
into more glamour part.
3653
02:29:52,400 --> 02:29:55,060
We are getting a d
string B string was
3654
02:29:55,060 --> 02:29:58,800
what Vulnerabilities of data
multiple set of Harmony,
3655
02:29:58,800 --> 02:30:00,500
so we are getting a d string.
3656
02:30:00,500 --> 02:30:03,400
So let's say we are getting
an rdd and the rate of time but
3657
02:30:03,400 --> 02:30:06,200
because now we are getting
real steam data, right?
3658
02:30:06,200 --> 02:30:07,936
So let's say in today right now.
3659
02:30:07,936 --> 02:30:08,872
I got one second.
3660
02:30:08,872 --> 02:30:11,399
Maybe now I got some one second
in one second.
3661
02:30:11,399 --> 02:30:14,600
I got more data now I got
more data in the next not Frank.
3662
02:30:14,600 --> 02:30:16,300
So that is what
we're talking about.
3663
02:30:16,300 --> 02:30:17,602
So we are creating data.
3664
02:30:17,602 --> 02:30:20,322
We are getting from time
0 to time what we get say
3665
02:30:20,322 --> 02:30:22,171
that we have an RGB at the rate
3666
02:30:22,171 --> 02:30:24,556
of Timbre similarly
it is this proceeding
3667
02:30:24,556 --> 02:30:27,300
with the time that He's
getting proceeded here.
3668
02:30:27,400 --> 02:30:30,683
Now in the next thing
we extracting the words
3669
02:30:30,683 --> 02:30:32,400
from an input Stream So
3670
02:30:32,400 --> 02:30:33,300
if you can notice
3671
02:30:33,300 --> 02:30:35,550
what we are doing here
from where let's say,
3672
02:30:35,550 --> 02:30:37,700
we started applying
doing our operations
3673
02:30:37,700 --> 02:30:40,419
as we started doing
our any sort of processing.
3674
02:30:40,419 --> 02:30:43,200
So as in when we get the data
in this timeframe,
3675
02:30:43,200 --> 02:30:44,707
we started being subversive.
3676
02:30:44,707 --> 02:30:46,307
It can be a flat map operation.
3677
02:30:46,307 --> 02:30:49,300
It can be any sort of operation
you're doing it can be even
3678
02:30:49,300 --> 02:30:51,800
a machine-learning opposite
of whatever you are doing
3679
02:30:51,800 --> 02:30:55,600
and then you are generating
the words in that kind of thing.
3680
02:30:55,700 --> 02:30:58,700
So this is how we
as we're seeing
3681
02:30:58,700 --> 02:31:02,700
that how gravity we can kind
of see all these part
3682
02:31:02,700 --> 02:31:04,620
at a very high level this work.
3683
02:31:04,620 --> 02:31:06,738
We again went into
detail then again,
3684
02:31:06,738 --> 02:31:08,249
we went into more detail.
3685
02:31:08,249 --> 02:31:09,700
And finally we have seen
3686
02:31:09,700 --> 02:31:13,600
that how we can even process
the data along the time
3687
02:31:13,600 --> 02:31:16,594
when we are screaming
our data as well.
3688
02:31:17,100 --> 02:31:21,500
Now one important point is just
like spark context is
3689
02:31:21,853 --> 02:31:25,700
mean entry point for
any spark application similar.
3690
02:31:25,700 --> 02:31:28,300
Need to work on streaming a spot
3691
02:31:28,300 --> 02:31:31,600
screaming you require
a streaming context.
3692
02:31:31,700 --> 02:31:35,800
What is that when you're passing
your input data stream you
3693
02:31:35,800 --> 02:31:38,400
when you are working
on the Spark engine
3694
02:31:38,400 --> 02:31:41,000
when you're walking
on this path screaming engine,
3695
02:31:41,000 --> 02:31:42,900
you have to use your system
3696
02:31:42,900 --> 02:31:46,289
in context of its using
screaming context only
3697
02:31:46,289 --> 02:31:48,700
you are going to get the batches
3698
02:31:48,700 --> 02:31:52,300
of your input data now
so streaming context
3699
02:31:52,300 --> 02:31:57,000
is going to consume a stream
of data in In Apache spark,
3700
02:31:57,300 --> 02:31:58,800
it is registers
3701
02:31:58,800 --> 02:32:04,000
and input D string to produce
or receiver object.
3702
02:32:04,500 --> 02:32:08,200
Now it is the main entry point
as we discussed
3703
02:32:08,200 --> 02:32:11,011
that like spark context is
the main entry point
3704
02:32:11,011 --> 02:32:12,600
for the spark application.
3705
02:32:12,600 --> 02:32:13,400
Similarly.
3706
02:32:13,400 --> 02:32:16,110
Your streaming context
is an entry point
3707
02:32:16,110 --> 02:32:17,500
for yourself Paxton.
3708
02:32:17,500 --> 02:32:20,800
Now does that mean
now Spa context is
3709
02:32:20,800 --> 02:32:22,569
not an entry point know
3710
02:32:22,569 --> 02:32:25,779
when you creates pastrini
it is dependent.
3711
02:32:25,779 --> 02:32:27,600
On your spots community.
3712
02:32:27,600 --> 02:32:30,007
So when you create
this thing in context
3713
02:32:30,007 --> 02:32:33,509
it is going to be dependent
on your spark of context only
3714
02:32:33,509 --> 02:32:36,732
because you will not be able
to create swimming contest
3715
02:32:36,732 --> 02:32:38,000
without spot Pockets.
3716
02:32:38,000 --> 02:32:41,000
So that's the reason it
is definitely required spark
3717
02:32:41,000 --> 02:32:45,600
also provide a number of default
implementations of sources,
3718
02:32:45,800 --> 02:32:50,000
like looking in the data
from Critter a factor 0 mq
3719
02:32:50,100 --> 02:32:53,100
which are accessible
from the context.
3720
02:32:53,100 --> 02:32:55,800
So it is supporting
so many things, right?
3721
02:32:55,800 --> 02:32:58,600
now If you notice this
3722
02:32:58,600 --> 02:33:01,000
what we are doing
in streaming contact,
3723
02:33:01,000 --> 02:33:03,497
this is just to give
you an idea about
3724
02:33:03,497 --> 02:33:06,500
how we can initialize
our system in context.
3725
02:33:06,500 --> 02:33:09,971
So we will be importing
these two libraries after that.
3726
02:33:09,971 --> 02:33:12,923
Can you see I'm passing
spot context SE right son
3727
02:33:12,923 --> 02:33:14,400
passing it every second.
3728
02:33:14,400 --> 02:33:17,323
We are collecting the data
means collect the data
3729
02:33:17,323 --> 02:33:18,400
for every 1 second.
3730
02:33:18,400 --> 02:33:21,500
You can increase this number
if you want and then this
3731
02:33:21,500 --> 02:33:24,028
is your SSC means
in every one second
3732
02:33:24,028 --> 02:33:25,482
what ever gonna happen?
3733
02:33:25,482 --> 02:33:27,000
I'm going to process it.
3734
02:33:27,000 --> 02:33:28,800
And what we're doing
in this place,
3735
02:33:28,900 --> 02:33:33,100
let's go to the D string topic
now now in these three
3736
02:33:33,500 --> 02:33:37,000
it is the full form
is discretized stream.
3737
02:33:37,053 --> 02:33:38,900
It's a basic abstraction
3738
02:33:38,900 --> 02:33:41,679
provided by your spa
streaming framework.
3739
02:33:41,679 --> 02:33:46,400
It's appointing a stream of data
and it is going to be received
3740
02:33:46,400 --> 02:33:47,630
from your source
3741
02:33:47,630 --> 02:33:52,200
and from processed
steaming context is related
3742
02:33:52,200 --> 02:33:56,900
to your response living
Fun Spot context is belonging.
3743
02:33:56,900 --> 02:33:57,974
To your spark or
3744
02:33:57,974 --> 02:34:01,600
if you remember the ecosystem
radical in the ecosystem,
3745
02:34:01,600 --> 02:34:06,400
we have that spark context right
now streaming context is built
3746
02:34:06,400 --> 02:34:08,784
with the help of spark context.
3747
02:34:08,800 --> 02:34:11,800
And in fact using
streaming context only
3748
02:34:11,800 --> 02:34:15,604
you will be able to perform
your sponsoring just like
3749
02:34:15,604 --> 02:34:17,722
without spark context you will
3750
02:34:17,722 --> 02:34:19,700
not able to execute anything
3751
02:34:19,700 --> 02:34:22,482
in spark application
just park application
3752
02:34:22,482 --> 02:34:25,100
will not be able
to do anything similarly
3753
02:34:25,100 --> 02:34:27,200
without streaming content.
3754
02:34:27,200 --> 02:34:31,500
You're streaming application
will not be able to do anything.
3755
02:34:31,500 --> 02:34:34,838
It just that screaming
context is built on top
3756
02:34:34,838 --> 02:34:36,100
of spark context.
3757
02:34:36,500 --> 02:34:39,700
Okay, so it now it's
a continuous stream
3758
02:34:39,700 --> 02:34:42,400
of data we can talk
about these three.
3759
02:34:42,400 --> 02:34:46,200
It is received from source
of on the processed data speed
3760
02:34:46,200 --> 02:34:49,000
generated by the
transformation of interesting.
3761
02:34:49,300 --> 02:34:53,800
If you look at this part
internally a these thing
3762
02:34:53,800 --> 02:34:57,389
can be represented by
a continuous series of I
3763
02:34:57,389 --> 02:34:59,620
need these this is important.
3764
02:34:59,946 --> 02:35:04,400
Now what we're doing is
every second remember last time
3765
02:35:04,400 --> 02:35:05,800
we have just seen an example
3766
02:35:05,900 --> 02:35:08,335
of like every second
whatever going to happen.
3767
02:35:08,335 --> 02:35:10,100
We are going to do processing.
3768
02:35:10,200 --> 02:35:13,700
So in that every second
whatever data you
3769
02:35:13,700 --> 02:35:17,300
are collecting and you're
performing your operation.
3770
02:35:17,300 --> 02:35:18,010
So the data
3771
02:35:18,010 --> 02:35:21,500
what you're getting here is
will be your District means
3772
02:35:21,500 --> 02:35:23,129
it's a Content you can say
3773
02:35:23,129 --> 02:35:26,200
that all these things
will be your D string point.
3774
02:35:26,200 --> 02:35:29,800
It's our Representation
by a continuous series
3775
02:35:29,800 --> 02:35:32,300
of kinetic energy so
many hundred is getting more
3776
02:35:32,300 --> 02:35:34,500
because let's say right
knocking one second.
3777
02:35:34,500 --> 02:35:36,000
What data I got collected.
3778
02:35:36,000 --> 02:35:37,100
I executed it.
3779
02:35:37,100 --> 02:35:40,500
I in the second second
this data is happening here.
3780
02:35:40,715 --> 02:35:41,100
Okay?
3781
02:35:41,100 --> 02:35:41,800
Okay.
3782
02:35:41,800 --> 02:35:42,700
Sorry for that.
3783
02:35:42,700 --> 02:35:46,300
Now in the second time
also the it is happening
3784
02:35:46,300 --> 02:35:47,400
a third second.
3785
02:35:47,400 --> 02:35:49,000
Also it is happening here.
3786
02:35:49,700 --> 02:35:50,500
No problem.
3787
02:35:50,500 --> 02:35:53,100
No, I'm not going
to do it now fine.
3788
02:35:53,100 --> 02:35:54,727
So in the third second Auto
3789
02:35:54,727 --> 02:35:57,200
if I did something
I'm processing it here.
3790
02:35:57,200 --> 02:35:57,500
Right.
3791
02:35:57,500 --> 02:35:59,800
So if you see
that this diagram itself,
3792
02:35:59,800 --> 02:36:03,600
so it is every second whatever
data is getting collected.
3793
02:36:03,600 --> 02:36:05,400
We are doing the processing
3794
02:36:05,400 --> 02:36:09,250
on top of it and the whole
countenance series of RDV
3795
02:36:09,250 --> 02:36:13,100
what we are seeing here
will be called as the strip.
3796
02:36:13,100 --> 02:36:13,500
Okay.
3797
02:36:13,500 --> 02:36:18,100
So this is what your distinct
moving further now
3798
02:36:18,600 --> 02:36:22,300
we are going to understand
the operation on these three.
3799
02:36:22,300 --> 02:36:24,500
So let's say you are doing
3800
02:36:24,500 --> 02:36:27,300
this operation on this dream
that you are getting.
3801
02:36:27,300 --> 02:36:30,000
The data from 0 to 1 again,
3802
02:36:30,000 --> 02:36:32,300
you are applying some operation
3803
02:36:32,300 --> 02:36:36,108
on that then whatever output
you get you're going to call
3804
02:36:36,108 --> 02:36:39,200
it as words the state
means this is the thing
3805
02:36:39,200 --> 02:36:41,166
what you're doing you're doing
a pack of operation.
3806
02:36:41,166 --> 02:36:42,700
That's the reason
we're calling it is at
3807
02:36:42,700 --> 02:36:46,058
what these three now similarly
whatever thing you're doing.
3808
02:36:46,058 --> 02:36:48,000
So you're going
to get accordingly
3809
02:36:48,000 --> 02:36:50,569
and output be screen
for it as well.
3810
02:36:50,569 --> 02:36:55,100
So this is what is happening
in this particular example now.
3811
02:36:56,700 --> 02:36:59,700
Flat map flatmap is API.
3812
02:37:00,000 --> 02:37:02,100
It is very similar to mac.
3813
02:37:02,100 --> 02:37:04,089
Its kind of platen
of your value.
3814
02:37:04,089 --> 02:37:04,400
Okay.
3815
02:37:04,400 --> 02:37:06,400
So let me explain you
with an example.
3816
02:37:06,400 --> 02:37:07,300
What is flat back?
3817
02:37:07,500 --> 02:37:10,100
So let's say
if I say that hi,
3818
02:37:10,400 --> 02:37:13,200
this is a doulica.
3819
02:37:14,500 --> 02:37:15,600
Welcome.
3820
02:37:16,200 --> 02:37:18,100
Okay, let's say listen later.
3821
02:37:18,222 --> 02:37:18,723
Now.
3822
02:37:18,723 --> 02:37:20,800
I want to apply a flatworm.
3823
02:37:20,800 --> 02:37:22,900
So let's say this is
a form of rdd.
3824
02:37:22,900 --> 02:37:24,600
Also now on this rdd,
3825
02:37:24,600 --> 02:37:28,200
let's say I apply flat back
to let's say our DB this is
3826
02:37:28,200 --> 02:37:30,000
the already flat map.
3827
02:37:31,600 --> 02:37:35,000
It's not map
Captain black pepper.
3828
02:37:35,100 --> 02:37:38,467
And then let's say you want
to define something for it.
3829
02:37:38,467 --> 02:37:40,400
So let's say you say that okay,
3830
02:37:41,100 --> 02:37:43,400
you are defining
a variable sale.
3831
02:37:43,700 --> 02:37:48,300
So let's say a a DOT now
3832
02:37:48,400 --> 02:37:53,300
after that you are defining
your thoughts split split.
3833
02:37:55,300 --> 02:37:58,417
We're splitting with respect
to visit now in this case
3834
02:37:58,417 --> 02:38:00,106
what is going to happen now?
3835
02:38:00,106 --> 02:38:03,966
I'm not saying the exacting here
just to give extremely flat back
3836
02:38:03,966 --> 02:38:06,500
just to kind of give
you an idea about box.
3837
02:38:06,503 --> 02:38:09,196
It is going to flatten
up this fight
3838
02:38:09,200 --> 02:38:11,200
with respect to the split
3839
02:38:11,200 --> 02:38:15,200
what you are mentioned here
means what it is going to now
3840
02:38:15,200 --> 02:38:18,500
create each element as one word.
3841
02:38:18,684 --> 02:38:21,915
It is going to create
like this high as one
3842
02:38:22,200 --> 02:38:26,100
what l 1 element this
as one One element
3843
02:38:26,100 --> 02:38:27,515
is ask another what
3844
02:38:27,515 --> 02:38:30,939
a one-element adwaita as
one water in the limit.
3845
02:38:30,939 --> 02:38:33,200
Bentham has one
vote for example.
3846
02:38:33,200 --> 02:38:33,841
So this is
3847
02:38:33,841 --> 02:38:37,558
how your platinum Works kind
of flatten up your whole file.
3848
02:38:37,558 --> 02:38:40,700
So this is what we are doing
in our stream effort.
3849
02:38:40,700 --> 02:38:43,400
We are our so this is
how this will work.
3850
02:38:44,100 --> 02:38:47,143
Now so we have just
understood this part.
3851
02:38:47,143 --> 02:38:51,100
Now, let's understand input
the stream and receivers.
3852
02:38:51,100 --> 02:38:52,500
Okay, what are these things?
3853
02:38:52,500 --> 02:38:53,900
Let's understand this fight.
3854
02:38:54,800 --> 02:38:55,200
Okay.
3855
02:38:55,200 --> 02:38:57,700
So what are the input
based impossible?
3856
02:38:57,700 --> 02:39:00,900
They can be basic Source
advances in basic Source
3857
02:39:00,900 --> 02:39:04,500
we can have filesystems
sockets Connections
3858
02:39:04,600 --> 02:39:08,400
in advance Source we
can have Kafka no Genesis.
3859
02:39:08,800 --> 02:39:09,200
Okay.
3860
02:39:09,300 --> 02:39:10,800
So your input these things are
3861
02:39:10,800 --> 02:39:14,000
under these things
representing the stream
3862
02:39:14,300 --> 02:39:19,200
of input data received
from streaming sources.
3863
02:39:19,400 --> 02:39:20,865
This is again the same thing.
3864
02:39:20,865 --> 02:39:21,136
Okay.
3865
02:39:21,136 --> 02:39:23,198
So this is there are
two type of things
3866
02:39:23,198 --> 02:39:24,500
which we just discussed.
3867
02:39:24,600 --> 02:39:27,676
Is your basic and second
is your advance?
3868
02:39:28,400 --> 02:39:29,800
Let's move brother.
3869
02:39:30,700 --> 02:39:33,700
Now what we are going
to see each other.
3870
02:39:33,700 --> 02:39:35,870
So if you notice let's see here.
3871
02:39:35,870 --> 02:39:39,600
There are some events often it
is going to your receiver
3872
02:39:39,600 --> 02:39:44,158
and then energy stream now I
will bees are getting created
3873
02:39:44,158 --> 02:39:47,082
and we are performing
some steps on it.
3874
02:39:47,300 --> 02:39:52,300
So the receiver sends
the data into the D string
3875
02:39:52,500 --> 02:39:57,100
where each back is going
to contain the RTD.
3876
02:39:57,200 --> 02:40:00,800
So this is what you're
this thing is doing receiver.
3877
02:40:00,800 --> 02:40:02,500
Is doing here now
3878
02:40:03,500 --> 02:40:07,200
moving further Transformations
on the D string.
3879
02:40:07,200 --> 02:40:08,384
Let's understand that.
3880
02:40:08,384 --> 02:40:10,500
What are the
Transformations available?
3881
02:40:10,500 --> 02:40:13,000
There are multiple
Transformations, which are
3882
02:40:13,000 --> 02:40:14,700
possibly the most popular.
3883
02:40:14,700 --> 02:40:16,100
Let's talk about that.
3884
02:40:16,100 --> 02:40:20,700
We have map flatmap filter
reduce Group by so there
3885
02:40:20,700 --> 02:40:23,992
are multiple Transformations
available via now.
3886
02:40:23,992 --> 02:40:27,500
It is like you are getting
your input data now you
3887
02:40:27,500 --> 02:40:30,400
will be applying any
of these operations.
3888
02:40:30,400 --> 02:40:33,700
Means any Transformations
that is going to happen.
3889
02:40:33,700 --> 02:40:37,700
And then on you this thing
is going to be created.
3890
02:40:37,700 --> 02:40:39,900
Okay, so that is
what's going to happen.
3891
02:40:39,900 --> 02:40:41,851
So let's explore it one by one.
3892
02:40:41,851 --> 02:40:43,344
So let's start with now
3893
02:40:43,344 --> 02:40:46,200
if I start with map
what happens with Mac
3894
02:40:46,200 --> 02:40:48,600
it is going to create
that judges of data.
3895
02:40:48,600 --> 02:40:49,100
Okay.
3896
02:40:49,100 --> 02:40:51,386
So let's say it is going
to create a map value
3897
02:40:51,386 --> 02:40:52,200
of it like this.
3898
02:40:52,200 --> 02:40:55,600
So let's say X is not to be
my is giving the output Z
3899
02:40:55,600 --> 02:40:57,600
that is giving
the output X, right.
3900
02:40:57,600 --> 02:41:00,700
So in this similar format,
this is going to get mad.
3901
02:41:00,700 --> 02:41:02,887
That is going to whatever
you're performing.
3902
02:41:02,887 --> 02:41:05,394
It is just going to create
batches of input data,
3903
02:41:05,394 --> 02:41:06,700
which you can execute it.
3904
02:41:06,700 --> 02:41:10,800
So it returns a new DC
by fasting each element
3905
02:41:10,800 --> 02:41:13,946
of the source D string
through a function,
3906
02:41:13,946 --> 02:41:15,600
which you have defined.
3907
02:41:16,300 --> 02:41:17,789
Let's discuss this lapis
3908
02:41:17,789 --> 02:41:20,074
that we have just
discussed it is going
3909
02:41:20,074 --> 02:41:21,565
to flatten up the things.
3910
02:41:21,565 --> 02:41:22,805
So in this case, also,
3911
02:41:22,805 --> 02:41:25,400
if you notice we are just
kind of flat inner it
3912
02:41:25,400 --> 02:41:27,169
is very similar to Mac.
3913
02:41:27,169 --> 02:41:31,100
But each input item
can be mapped to zero
3914
02:41:31,200 --> 02:41:34,200
or more outputs in items here.
3915
02:41:34,200 --> 02:41:38,400
Okay, and it is going to return
a new these three bypassing
3916
02:41:38,400 --> 02:41:41,700
each Source element
to a function for this fight.
3917
02:41:41,700 --> 02:41:44,600
So we have just seen an example
of that crap anyway,
3918
02:41:44,600 --> 02:41:47,300
so that seems awfully
can remember 70 more easy
3919
02:41:47,300 --> 02:41:49,200
for you to kind of
see the difference
3920
02:41:49,200 --> 02:41:55,260
between with markets has
no moving further filter
3921
02:41:55,360 --> 02:41:58,593
as the name States you
can now filter out the values.
3922
02:41:58,593 --> 02:41:59,876
So let's say you have
3923
02:41:59,876 --> 02:42:03,701
a huge data you are kind of we
want to filter out some values.
3924
02:42:03,701 --> 02:42:06,900
You just want to kind of walk
with some filter data.
3925
02:42:06,900 --> 02:42:09,700
Maybe you want to remove
some part of it.
3926
02:42:09,700 --> 02:42:11,900
Maybe you are trying
to put some Logic on it.
3927
02:42:11,900 --> 02:42:15,800
Does this line contains
this right under this line?
3928
02:42:16,100 --> 02:42:16,900
Is that so
3929
02:42:16,900 --> 02:42:20,169
in that case extreme only
with that particular criteria?
3930
02:42:20,169 --> 02:42:21,691
So this is what we do here,
3931
02:42:21,691 --> 02:42:25,300
but definitely most of the times
to Output is going to be smaller
3932
02:42:25,300 --> 02:42:31,000
in comparison to your input
reduce reduce is it's just
3933
02:42:31,000 --> 02:42:34,500
like it's going to do kind
of aggregation on the wall.
3934
02:42:34,500 --> 02:42:37,400
Let's say in the end you want
to sum up all the data
3935
02:42:37,400 --> 02:42:38,200
what you have
3936
02:42:38,200 --> 02:42:41,500
that is going to be done
with the help of reduce.
3937
02:42:42,100 --> 02:42:43,800
Now after that group
3938
02:42:43,800 --> 02:42:48,600
by group back is like it's going
to combine all the common values
3939
02:42:48,600 --> 02:42:50,600
that is what group
by is going to do.
3940
02:42:50,600 --> 02:42:53,112
So as you can see
in this example all the things
3941
02:42:53,112 --> 02:42:55,196
which are starting
with Seagal broom back
3942
02:42:55,196 --> 02:42:56,786
all the things we're starting
3943
02:42:56,786 --> 02:42:59,300
with J. Boardroom back
all the names starting
3944
02:42:59,300 --> 02:43:00,761
with C got goodbye.
3945
02:43:00,800 --> 02:43:01,600
Not.
3946
02:43:02,000 --> 02:43:03,300
So again, what is
3947
02:43:03,300 --> 02:43:07,500
this screen window now to give
you an example of this window?
3948
02:43:07,500 --> 02:43:10,108
Everybody must be
knowing Twitter, right?
3949
02:43:10,108 --> 02:43:12,000
So now what happens in total?
3950
02:43:12,000 --> 02:43:13,700
Let me go to my paint.
3951
02:43:14,100 --> 02:43:16,100
So insert in this example,
3952
02:43:16,100 --> 02:43:19,853
let's understand how
this windowing of Asians of so,
3953
02:43:19,853 --> 02:43:21,400
let's say in initials
3954
02:43:21,400 --> 02:43:24,600
per second in the initial
per second 10 seconds.
3955
02:43:24,600 --> 02:43:27,200
Let's say the tweets
are happening in this way.
3956
02:43:27,200 --> 02:43:32,200
Let's say cash
a hash a hashtag now,
3957
02:43:32,200 --> 02:43:35,773
which is the trading Twitter
definitely is right is
3958
02:43:35,773 --> 02:43:38,900
my training good maybe
in the next 10 seconds.
3959
02:43:40,600 --> 02:43:46,500
In the next 10 seconds
now again Hash A. Ashby.
3960
02:43:47,200 --> 02:43:48,400
Ashby is open
3961
02:43:48,400 --> 02:43:51,400
which is the trending
with be happening here.
3962
02:43:51,400 --> 02:43:51,800
Now.
3963
02:43:51,800 --> 02:43:54,261
Let's say in another 10 seconds.
3964
02:43:54,900 --> 02:43:56,700
Now this time let's say
3965
02:43:56,700 --> 02:44:03,266
hash be hash be so actually I
should be Hashmi zapping now,
3966
02:44:03,266 --> 02:44:05,266
which is trendy be lonely.
3967
02:44:05,500 --> 02:44:07,776
But now I want to find out
3968
02:44:07,776 --> 02:44:10,546
which is the trending
one in last 30.
3969
02:44:11,400 --> 02:44:15,100
Ashley right because
if I combine I can do it easily.
3970
02:44:15,400 --> 02:44:19,900
Now this is your been doing
operation example means you
3971
02:44:19,900 --> 02:44:23,300
are not only looking
at your current window,
3972
02:44:23,300 --> 02:44:24,800
but you're also looking
3973
02:44:24,800 --> 02:44:27,516
at your previous window
Vanessa current window.
3974
02:44:27,516 --> 02:44:30,008
I'm talking about let's say
10 seconds of slot
3975
02:44:30,008 --> 02:44:32,600
in this 10 seconds lat
let's say you are doing
3976
02:44:32,600 --> 02:44:35,431
this operation on has be has
to be has to be has to be
3977
02:44:35,431 --> 02:44:37,456
so this is a current
window now you are
3978
02:44:37,456 --> 02:44:40,282
not fully Computing with respect
to your current window.
3979
02:44:40,282 --> 02:44:42,800
But you are also considering
your previous window.
3980
02:44:42,800 --> 02:44:44,055
Now, let's say in this case.
3981
02:44:44,055 --> 02:44:44,681
If I ask you,
3982
02:44:44,681 --> 02:44:46,900
can you give me the output
of which is trending
3983
02:44:46,900 --> 02:44:48,361
in last 17 seconds?
3984
02:44:48,361 --> 02:44:50,900
Will you be able
to answer know why
3985
02:44:50,900 --> 02:44:54,900
because you don't have partial
information for 7 Seconds
3986
02:44:54,900 --> 02:44:56,400
you have information
3987
02:44:56,400 --> 02:45:01,000
for your 10 20 30 mins
multiple of them,
3988
02:45:01,200 --> 02:45:03,500
but not intermediate one.
3989
02:45:03,500 --> 02:45:04,711
So keep this in mind.
3990
02:45:04,711 --> 02:45:07,365
Okay, so you will be able
to perform in doing
3991
02:45:07,365 --> 02:45:10,207
operation only with respect
to your window size.
3992
02:45:10,207 --> 02:45:11,900
It's not like you can create
3993
02:45:11,900 --> 02:45:15,085
any partial value in can do
the window efficient now,
3994
02:45:15,085 --> 02:45:16,800
let's get back to the sides.
3995
02:45:21,800 --> 02:45:23,203
Now it's a similar thing.
3996
02:45:23,203 --> 02:45:24,350
So now it is shown here
3997
02:45:24,350 --> 02:45:27,100
that we are not only considering
the current window,
3998
02:45:27,100 --> 02:45:30,200
but we are also considering
the previous window
3999
02:45:30,200 --> 02:45:31,604
now next understand
4000
02:45:31,604 --> 02:45:35,300
the output operators are
operations of the business
4001
02:45:35,700 --> 02:45:38,434
when we talk
about output operations.
4002
02:45:38,434 --> 02:45:41,400
The output operations
are going to allow
4003
02:45:41,400 --> 02:45:45,853
the D string data to be pushed
out to your external system.
4004
02:45:45,853 --> 02:45:47,700
If you notice here means
4005
02:45:47,700 --> 02:45:51,300
whenever whatever processing
you have done with respect to
4006
02:45:51,300 --> 02:45:54,300
what What data you are doing
here now your output you
4007
02:45:54,300 --> 02:45:57,100
can store in multiple base
against original file system.
4008
02:45:57,100 --> 02:45:58,600
You can keep in your database.
4009
02:45:58,600 --> 02:46:01,800
You can keep it even
in your external systems
4010
02:46:01,800 --> 02:46:04,200
so you can keep
in multiple places.
4011
02:46:04,200 --> 02:46:06,400
So that is
what being reflected here.
4012
02:46:07,500 --> 02:46:10,600
Now, so if I talk
about output operation,
4013
02:46:10,600 --> 02:46:11,653
these are the one
4014
02:46:11,653 --> 02:46:15,495
which are supported we can print
out the value we can use save
4015
02:46:15,495 --> 02:46:17,700
as text file menu save
as take five.
4016
02:46:17,700 --> 02:46:19,500
It saves it into your chest.
4017
02:46:19,500 --> 02:46:21,736
If you want you can
also use it to save it
4018
02:46:21,736 --> 02:46:23,100
in the local pack system.
4019
02:46:23,100 --> 02:46:25,174
You can save it as
an object file.
4020
02:46:25,174 --> 02:46:27,500
Also, you can save
it as a Hadoop file
4021
02:46:27,500 --> 02:46:30,800
or you can also apply
for these are daily function.
4022
02:46:31,200 --> 02:46:34,500
Now what are for
each argument function?
4023
02:46:34,500 --> 02:46:35,956
Let's see this example.
4024
02:46:35,956 --> 02:46:39,700
So the mill Levy Spin on this
part in detail Banks we teach
4025
02:46:39,700 --> 02:46:41,600
you or in advocacy sessions,
4026
02:46:41,600 --> 02:46:43,927
but just to give
you an idea now.
4027
02:46:43,927 --> 02:46:46,310
This is a very
powerful primitive
4028
02:46:46,310 --> 02:46:49,608
that is going to allow
your data to be sent out
4029
02:46:49,608 --> 02:46:51,400
to your external systems.
4030
02:46:51,400 --> 02:46:53,700
So using this you
can send it across
4031
02:46:53,700 --> 02:46:55,500
to your web server system.
4032
02:46:55,500 --> 02:46:57,385
We have just seen
our external system
4033
02:46:57,385 --> 02:46:58,904
that we can give file system.
4034
02:46:58,904 --> 02:46:59,900
It can be anything.
4035
02:46:59,900 --> 02:47:02,800
So using this you
will be able to transfer it.
4036
02:47:02,800 --> 02:47:05,100
You can view will be
able to send it out
4037
02:47:05,100 --> 02:47:07,162
to your external systems.
4038
02:47:07,500 --> 02:47:11,500
Now, let's understand the cash
in and persistence now
4039
02:47:11,500 --> 02:47:14,300
when we talk
about caching and persistence,
4040
02:47:14,300 --> 02:47:18,900
so these 3 Ms. Also annoying
the developers to cash
4041
02:47:19,000 --> 02:47:22,100
or to persist the streams data
4042
02:47:22,100 --> 02:47:27,023
in the moral means you
can keep your data in memory.
4043
02:47:27,023 --> 02:47:31,100
You can cash your data
in the morning for longer time.
4044
02:47:31,200 --> 02:47:33,200
Even after your
action is complete.
4045
02:47:33,200 --> 02:47:36,000
It is not going to delete it
4046
02:47:36,100 --> 02:47:38,946
so you can just Use
this as many times
4047
02:47:38,946 --> 02:47:39,800
as you want
4048
02:47:39,800 --> 02:47:42,900
so you can simply use
the first method to do that.
4049
02:47:42,900 --> 02:47:44,485
So for your input streams
4050
02:47:44,485 --> 02:47:48,100
which are receiving the data
over the network may be using
4051
02:47:48,100 --> 02:47:50,000
taskbar Loom sockets.
4052
02:47:50,400 --> 02:47:54,500
The default persistence level
is set to the replicate
4053
02:47:54,500 --> 02:47:57,331
the data to two loads
for the for tolerance
4054
02:47:57,331 --> 02:48:00,500
like it is also going
to be replicating the data
4055
02:48:00,502 --> 02:48:01,600
into two parts
4056
02:48:01,600 --> 02:48:04,800
so you can see the same thing
in this diagram.
4057
02:48:05,300 --> 02:48:07,979
Let's understand this
accumulators broadcast
4058
02:48:07,979 --> 02:48:09,600
variables and checkpoints.
4059
02:48:09,700 --> 02:48:12,553
Now, these are mostly
for your performance.
4060
02:48:12,553 --> 02:48:16,626
But so this is going to help you
to kind of perform to help you
4061
02:48:16,626 --> 02:48:18,444
in the performance partner.
4062
02:48:18,444 --> 02:48:20,600
So it is accumulators is nothing
4063
02:48:20,600 --> 02:48:25,200
but environment that are only
added through and associative
4064
02:48:25,300 --> 02:48:27,400
and commutative operation.
4065
02:48:28,000 --> 02:48:31,100
Usually if you're coming
from Purdue background
4066
02:48:31,100 --> 02:48:32,678
if you have done let's say be
4067
02:48:32,678 --> 02:48:35,400
mapreduce programming you
must have seen something.
4068
02:48:35,400 --> 02:48:36,900
Counters like that,
4069
02:48:36,900 --> 02:48:38,749
they'll be used
for other counters
4070
02:48:38,749 --> 02:48:42,000
which kind of helps us to debug
the program as well and you
4071
02:48:42,000 --> 02:48:44,700
can perform some analysis
in the console itself.
4072
02:48:44,700 --> 02:48:46,600
Now this is similar
to you can do
4073
02:48:46,600 --> 02:48:48,100
with the accumulators as well.
4074
02:48:48,100 --> 02:48:50,152
So you can Implement
your contest with X
4075
02:48:50,152 --> 02:48:52,800
open this part you can
also some of the things
4076
02:48:52,800 --> 02:48:54,800
with this fact now you can
4077
02:48:54,800 --> 02:48:57,800
if you want to track
through UI you can also do
4078
02:48:57,800 --> 02:49:00,402
that as you can see
in this UI itself.
4079
02:49:00,402 --> 02:49:02,500
You can see all your excavators
4080
02:49:02,500 --> 02:49:05,400
as well now similarly
we have broadcast.
4081
02:49:05,400 --> 02:49:10,300
Erebus now broadcast Parables
allows the programmer to keep
4082
02:49:10,300 --> 02:49:14,787
your meat only bearable cast
on all the machines
4083
02:49:14,787 --> 02:49:16,325
which are available.
4084
02:49:16,838 --> 02:49:19,838
Now it is going
to be kind of cashing it
4085
02:49:19,838 --> 02:49:21,684
on all the machines now,
4086
02:49:22,000 --> 02:49:25,900
they can be used to give
every note of copy
4087
02:49:26,200 --> 02:49:29,000
of a large input data set
4088
02:49:29,300 --> 02:49:35,028
in an efficient manner so you
can just use that sparkle.
4089
02:49:35,028 --> 02:49:39,643
Also attempt to distribute the
distributed broadcast variable
4090
02:49:39,643 --> 02:49:41,700
using efficient bra strap.
4091
02:49:41,700 --> 02:49:44,907
I will do nothing to reduce
the communication process.
4092
02:49:44,907 --> 02:49:46,100
So as you can see here,
4093
02:49:46,100 --> 02:49:47,800
we are passing
this broadcast value
4094
02:49:47,800 --> 02:49:50,700
it is going to spark contest
and then it is broadcasting
4095
02:49:50,700 --> 02:49:51,700
to this places.
4096
02:49:51,700 --> 02:49:55,500
So this is what how it
is working in this application.
4097
02:49:55,700 --> 02:49:58,582
Generally when we teach
in this class has and also
4098
02:49:58,582 --> 02:50:00,600
since things are
Advanced concept,
4099
02:50:00,600 --> 02:50:02,953
we kind of we kind
of try to expand you
4100
02:50:02,953 --> 02:50:05,189
with the practicals
are not right now.
4101
02:50:05,189 --> 02:50:08,915
I just want to give you an idea
about what are these things?
4102
02:50:08,915 --> 02:50:09,764
So when you go
4103
02:50:09,764 --> 02:50:12,009
with the practicals
of all these things
4104
02:50:12,009 --> 02:50:13,367
that how activator see
4105
02:50:13,367 --> 02:50:16,700
how this is happening out
is getting broadcasted Things
4106
02:50:16,700 --> 02:50:19,941
become more and more fear
at that time right now.
4107
02:50:19,941 --> 02:50:20,683
I just want
4108
02:50:20,683 --> 02:50:24,600
that everybody at these data
high level overview of things.
4109
02:50:25,246 --> 02:50:28,400
Now moving further sub
what is checkpoints
4110
02:50:28,400 --> 02:50:30,257
so checkpoints are similar
4111
02:50:30,257 --> 02:50:32,900
to your checkpoints
in the gaming now,
4112
02:50:32,900 --> 02:50:37,200
hold on they can they make
it run 24/7 make it resilient
4113
02:50:37,200 --> 02:50:41,400
to the failure and related
to the application project.
4114
02:50:41,500 --> 02:50:43,214
So if you can see this diagram,
4115
02:50:43,214 --> 02:50:45,296
we are just
creating the checkpoint.
4116
02:50:45,296 --> 02:50:47,200
So as in the
metadata checkpoint,
4117
02:50:47,200 --> 02:50:50,279
you can see it is the saving
of the information
4118
02:50:50,279 --> 02:50:53,827
which is defining the streaming
computation if we talk
4119
02:50:53,827 --> 02:50:55,300
about data from check.
4120
02:50:55,600 --> 02:51:01,000
It is saving of the generated
a DD to the reliable storage.
4121
02:51:01,100 --> 02:51:03,400
So this is this
both are generating
4122
02:51:03,400 --> 02:51:06,900
the checkpoint now
now moving forward.
4123
02:51:06,900 --> 02:51:09,815
We are going to move
towards our project
4124
02:51:09,815 --> 02:51:14,300
where we are going to perform
our Twitter sentiment analysis.
4125
02:51:14,400 --> 02:51:17,413
Let's discuss a very
important Force case
4126
02:51:17,413 --> 02:51:19,600
of Twitter sentiment analysis.
4127
02:51:19,600 --> 02:51:21,500
This is going to
be very interesting
4128
02:51:21,500 --> 02:51:24,600
because we will just
do a real-time.
4129
02:51:24,900 --> 02:51:28,588
This on Twitter sentiment
analysis and they can be
4130
02:51:28,588 --> 02:51:31,900
lot of possibility
of this sentiment analysis
4131
02:51:31,900 --> 02:51:33,631
will be but we will
be taking something
4132
02:51:33,631 --> 02:51:36,000
for the turtle and it's going
to be very interesting.
4133
02:51:36,100 --> 02:51:39,900
So generally when we do
all this in know course,
4134
02:51:39,900 --> 02:51:41,070
it is more detailed
4135
02:51:41,070 --> 02:51:44,582
because right now in women
are definitely going in deep is
4136
02:51:44,582 --> 02:51:46,000
not very much possible,
4137
02:51:46,000 --> 02:51:48,600
but during the training
of a director,
4138
02:51:48,600 --> 02:51:51,470
you will learn all these things
within the trust awesome,
4139
02:51:51,470 --> 02:51:52,994
right that's there something
4140
02:51:52,994 --> 02:51:55,100
which we learned
during the session.
4141
02:51:55,100 --> 02:51:59,061
It's No, we talked
about some use cases of Twitter.
4142
02:51:59,300 --> 02:52:01,300
As I said there can be
multiple use cases
4143
02:52:01,300 --> 02:52:02,300
which are possible
4144
02:52:02,300 --> 02:52:04,156
because there are solutions
4145
02:52:04,156 --> 02:52:07,100
behind whatever the continue
doing it so much
4146
02:52:07,100 --> 02:52:08,700
of social media right now
4147
02:52:08,700 --> 02:52:11,288
in these days are
very active has been right.
4148
02:52:11,288 --> 02:52:12,400
It must be noticing
4149
02:52:12,400 --> 02:52:15,300
that even politicians
have started using Twitter
4150
02:52:15,300 --> 02:52:18,000
and their did all
the treats are being shown
4151
02:52:18,000 --> 02:52:21,200
in the news channel in cystic
of a heart-rending to it
4152
02:52:21,200 --> 02:52:23,900
because they are talking
about positive negative
4153
02:52:23,900 --> 02:52:26,100
in any politician
use Something right?
4154
02:52:26,100 --> 02:52:27,900
And if we talk
about anything is even
4155
02:52:27,900 --> 02:52:29,100
if we talk about let's
4156
02:52:29,100 --> 02:52:32,260
any Sports FIFA World Cup
is going on then you will notice
4157
02:52:32,260 --> 02:52:35,200
always return will be filled up
with lot of treatment.
4158
02:52:35,200 --> 02:52:38,435
So how we can make use of it
how we can do some analysis
4159
02:52:38,435 --> 02:52:41,400
on top of it that first we
are going to learn in this
4160
02:52:41,400 --> 02:52:44,600
so they can be multiple sort
of our sentiment analysis
4161
02:52:44,600 --> 02:52:47,595
think it can be done for
your crisis Management Service.
4162
02:52:47,595 --> 02:52:50,900
I just think target marketing
we can keep on talking about
4163
02:52:50,900 --> 02:52:52,716
when a new release release now
4164
02:52:52,716 --> 02:52:55,200
even the moviemakers
kind of glowing eyes.
4165
02:52:55,200 --> 02:52:57,628
Okay, hold this movie
is going to perform
4166
02:52:57,628 --> 02:53:00,356
so they can easily make
out of it beforehand.
4167
02:53:00,356 --> 02:53:04,200
Okay, this movie is going to go
in this kind of range of profit
4168
02:53:04,200 --> 02:53:05,800
or not interesting day.
4169
02:53:05,800 --> 02:53:08,200
I let us explore
not to Impossible even
4170
02:53:08,200 --> 02:53:10,500
in the political campaign
in 50 must have heard
4171
02:53:10,600 --> 02:53:11,400
that in u.s.
4172
02:53:11,400 --> 02:53:13,600
When the president
election was happening.
4173
02:53:13,600 --> 02:53:15,676
They have used in fact role
4174
02:53:15,676 --> 02:53:19,600
of social media of all
this analysis at all and then
4175
02:53:19,600 --> 02:53:22,400
that have ever played
a major role in winning
4176
02:53:22,400 --> 02:53:23,880
that election similarly,
4177
02:53:23,880 --> 02:53:26,100
how weather investors
want to predict
4178
02:53:26,100 --> 02:53:28,950
whether they should invest
in a particular company or not,
4179
02:53:28,950 --> 02:53:30,300
whether they want to check
4180
02:53:30,300 --> 02:53:33,715
that whether like we
should Target which customers
4181
02:53:33,715 --> 02:53:34,900
for advertisement
4182
02:53:34,900 --> 02:53:38,000
because we cannot Target
everyone problem with targeting
4183
02:53:38,000 --> 02:53:40,580
everyone is and if we try
to Target element,
4184
02:53:40,580 --> 02:53:43,032
it will be very costly
so we want to kind
4185
02:53:43,032 --> 02:53:44,333
of set it a little bit
4186
02:53:44,333 --> 02:53:46,178
because maybe my set
of people whom I
4187
02:53:46,178 --> 02:53:48,954
should send this advertisement
to be more effective
4188
02:53:48,954 --> 02:53:52,000
and Wells as well as a queen
is going to be cost effective
4189
02:53:52,000 --> 02:53:54,100
as well if you wanted
to do the products
4190
02:53:54,100 --> 02:53:57,200
and services also include
I guess we can also do this.
4191
02:53:57,200 --> 02:53:57,500
Now.
4192
02:53:57,500 --> 02:54:00,900
Let's see some use cases
like the him terms of use case.
4193
02:54:00,900 --> 02:54:03,100
I will show you a practical
how it comes.
4194
02:54:03,100 --> 02:54:04,000
So first of all,
4195
02:54:04,000 --> 02:54:06,724
we will be importing all
the required packages
4196
02:54:06,724 --> 02:54:08,725
because we are going
to not perform
4197
02:54:08,725 --> 02:54:10,400
or Twitter sentiment analysis.
4198
02:54:10,400 --> 02:54:12,824
So we will be requiring
some packages for that.
4199
02:54:12,824 --> 02:54:15,700
So we will be doing that as
a first step then we need
4200
02:54:15,700 --> 02:54:18,641
to SEC Oliver authentication
without or indication.
4201
02:54:18,641 --> 02:54:21,405
We cannot do anything
of now here the challenges
4202
02:54:21,405 --> 02:54:23,201
we cannot directly
put your username
4203
02:54:23,201 --> 02:54:24,431
and they don't you think
4204
02:54:24,431 --> 02:54:27,100
it will get Candidate put
your username and password.
4205
02:54:27,200 --> 02:54:28,800
So Peter came up with something.
4206
02:54:28,800 --> 02:54:30,400
Very smart thing.
4207
02:54:30,500 --> 02:54:33,100
What they did is they came
up with something
4208
02:54:33,100 --> 02:54:35,080
on his fourth indication tokens.
4209
02:54:35,080 --> 02:54:37,100
So you have to go
to death brought
4210
02:54:37,100 --> 02:54:39,100
twitter.com login from there
4211
02:54:39,100 --> 02:54:42,972
and you will find kind of all
this authentication tokens
4212
02:54:42,972 --> 02:54:44,100
available to you
4213
02:54:44,100 --> 02:54:47,900
for will be the recruit take
that and put it here then
4214
02:54:47,900 --> 02:54:50,335
as we have learned
the D string transformation,
4215
02:54:50,335 --> 02:54:52,294
you will be doing
all that computation
4216
02:54:52,294 --> 02:54:55,100
you so you will be having
my distinct honor of France.
4217
02:54:55,100 --> 02:54:58,100
Action, then you will be
generating your Tweet data.
4218
02:54:58,100 --> 02:55:01,472
I'm going to save it
in this particular directory.
4219
02:55:01,472 --> 02:55:03,400
Once you are done with this.
4220
02:55:03,400 --> 02:55:06,200
Then you are going
to extract your sentiment
4221
02:55:06,200 --> 02:55:07,600
once you extract it.
4222
02:55:07,600 --> 02:55:08,400
And you're done.
4223
02:55:08,400 --> 02:55:11,900
Let me show you quickly
how it works in our fear.
4224
02:55:12,000 --> 02:55:15,226
Now one more interesting thing
about a greater would be
4225
02:55:15,226 --> 02:55:18,247
that you will be getting all
this consideration machines.
4226
02:55:18,247 --> 02:55:19,482
So you need not worry
4227
02:55:19,482 --> 02:55:21,892
about from where I
will be getting all this.
4228
02:55:21,892 --> 02:55:25,100
Is it like very difficult
to install when I was waiting.
4229
02:55:25,100 --> 02:55:26,400
This open source location.
4230
02:55:26,400 --> 02:55:29,061
It was not working for me
in my operating system.
4231
02:55:29,061 --> 02:55:30,179
It was not working.
4232
02:55:30,179 --> 02:55:32,400
So many things we
have generally seen
4233
02:55:32,400 --> 02:55:34,700
people face issues to resolve
4234
02:55:34,700 --> 02:55:36,600
everything up be we kind
4235
02:55:36,600 --> 02:55:40,000
of provide all this fear
question from Rockville.
4236
02:55:40,000 --> 02:55:41,900
This pm has priest but yes,
4237
02:55:41,900 --> 02:55:44,300
that's what it has
everything pre-installed.
4238
02:55:44,300 --> 02:55:46,700
Whichever will be required
for your training.
4239
02:55:46,700 --> 02:55:49,133
So that's the best part
what we also provide.
4240
02:55:49,133 --> 02:55:51,700
So in this case your Eclipse
will already be there.
4241
02:55:51,700 --> 02:55:53,900
You need to just go
to your Eclipse location.
4242
02:55:53,900 --> 02:55:55,300
Let me show you how you can.
4243
02:55:55,300 --> 02:55:56,700
So cold that if you want
4244
02:55:57,200 --> 02:56:00,600
because it gives you it gives
you just need to go inside it
4245
02:56:00,600 --> 02:56:02,200
and double-click on it at that.
4246
02:56:02,200 --> 02:56:04,400
You need not go and kind
of installed eclipse
4247
02:56:04,400 --> 02:56:07,400
and not even the spot will
already be installed for you.
4248
02:56:07,400 --> 02:56:09,900
Let us go in our project.
4249
02:56:09,900 --> 02:56:12,895
So this is our project
which is in front of you.
4250
02:56:12,895 --> 02:56:15,674
This is my project
which we are going to war.
4251
02:56:15,674 --> 02:56:16,653
Now you can see
4252
02:56:16,653 --> 02:56:19,522
that we have first
imported all the libraries
4253
02:56:19,522 --> 02:56:22,146
that we have set
or more indication system
4254
02:56:22,146 --> 02:56:24,806
and then we have moved
and kind of ecstatic.
4255
02:56:24,806 --> 02:56:27,900
The D string transformation
extractor that we write
4256
02:56:27,900 --> 02:56:29,900
and then save
the output final effect.
4257
02:56:29,900 --> 02:56:32,100
So these are the things
which we have done
4258
02:56:32,100 --> 02:56:36,000
in this program has now let's
execute it to run this program.
4259
02:56:36,000 --> 02:56:39,900
It's very simple go
to run as and from run
4260
02:56:39,900 --> 02:56:42,700
as click on still application.
4261
02:56:43,200 --> 02:56:45,276
You will notice in the end.
4262
02:56:45,276 --> 02:56:48,600
It is releasing
that great good to see that
4263
02:56:48,886 --> 02:56:51,286
so it is executing the program.
4264
02:56:51,286 --> 02:56:52,440
Let us execute.
4265
02:56:55,700 --> 02:56:57,800
I did bring a taxi for Trump.
4266
02:56:57,800 --> 02:57:01,292
So use these for Trump any way
that we surveyed to be negative.
4267
02:57:01,292 --> 02:57:01,629
Right?
4268
02:57:01,629 --> 02:57:02,654
It's an achievement
4269
02:57:02,654 --> 02:57:06,036
because anything you do for Tom
will be to be negative Trump is
4270
02:57:06,036 --> 02:57:07,563
anyway the hot topic for us.
4271
02:57:07,563 --> 02:57:09,200
Maybe make it a little bigger.
4272
02:57:14,100 --> 02:57:17,200
You will notice a lot
of negative tweets coming up on.
4273
02:57:24,700 --> 02:57:26,900
Yes, now, I'm just stopping it
4274
02:57:26,900 --> 02:57:28,742
so that I can
show you something.
4275
02:57:28,742 --> 02:57:28,972
Yes.
4276
02:57:28,972 --> 02:57:30,700
It's filtering that we thought
4277
02:57:30,800 --> 02:57:33,700
so we have actually been written
back in the program itself.
4278
02:57:33,700 --> 02:57:36,300
You have given
at one location from using
4279
02:57:36,300 --> 02:57:38,087
that we were kind of asking
4280
02:57:38,087 --> 02:57:41,200
for a treetop Tom now
here we are doing analysis
4281
02:57:41,200 --> 02:57:43,064
and it is also going to tell us
4282
02:57:43,064 --> 02:57:46,264
whether it's a positive to a
negative resistance is situated.
4283
02:57:46,264 --> 02:57:47,500
It is giving up Faith
4284
02:57:47,500 --> 02:57:50,444
because term for Transit even
will not quit positive rate.
4285
02:57:50,444 --> 02:57:51,454
So that's something
4286
02:57:51,454 --> 02:57:53,790
which is so that's
the reason you're finding.
4287
02:57:53,790 --> 02:57:54,800
This is a negative.
4288
02:57:54,900 --> 02:57:56,412
Similarly if there
will be any other
4289
02:57:56,412 --> 02:57:57,964
that we should
be getting a static.
4290
02:57:57,964 --> 02:58:00,200
So right now if I keep on
moving ahead we will see
4291
02:58:00,200 --> 02:58:02,300
multiple negative traits
which will come up.
4292
02:58:02,300 --> 02:58:04,600
So that's how this program runs.
4293
02:58:04,900 --> 02:58:07,000
So this is how our program
4294
02:58:07,000 --> 02:58:09,403
we will be executing
we can distract it.
4295
02:58:09,403 --> 02:58:13,100
Even the output results will be
getting through at a location
4296
02:58:13,100 --> 02:58:16,500
as you can see in this
if I go to my location here,
4297
02:58:16,500 --> 02:58:19,100
this is my actual project
where it is running
4298
02:58:19,100 --> 02:58:20,533
so you can just come
4299
02:58:20,533 --> 02:58:23,400
to this location here
are on your output.
4300
02:58:23,400 --> 02:58:24,982
All your output
is Getting through there
4301
02:58:24,982 --> 02:58:26,200
so you can just take a look as
4302
02:58:26,200 --> 02:58:28,200
but yes, so it's
everything is done
4303
02:58:28,200 --> 02:58:29,971
by using space thing apart.
4304
02:58:29,971 --> 02:58:30,300
Okay.
4305
02:58:30,300 --> 02:58:31,900
That's what we've
seen right reverse
4306
02:58:31,900 --> 02:58:33,653
that we were seeing
it with respect
4307
02:58:33,653 --> 02:58:35,200
to these three transformations
4308
02:58:35,200 --> 02:58:38,300
in a so we have done all that
with have both passed anybody.
4309
02:58:38,400 --> 02:58:41,200
So that is one
of those awesome part about this
4310
02:58:41,200 --> 02:58:44,700
that you can do such
a powerful things with respect
4311
02:58:44,700 --> 02:58:47,279
to your with respect
to you this way.
4312
02:58:47,279 --> 02:58:49,500
Now, let's analyze the results.
4313
02:58:49,800 --> 02:58:51,152
So as we have just seen
4314
02:58:51,152 --> 02:58:53,400
that it is showing
the president's a positive
4315
02:58:53,400 --> 02:58:54,800
to a negative tweets.
4316
02:58:55,000 --> 02:58:57,200
So this is where your output
is getting Stone
4317
02:58:57,200 --> 02:59:00,000
as it shown you a doubt
will appear like this.
4318
02:59:00,000 --> 02:59:00,300
Okay.
4319
02:59:00,300 --> 02:59:02,700
This is just broke
your output to explicitly
4320
02:59:02,700 --> 02:59:03,762
principal also tell
4321
02:59:03,762 --> 02:59:05,848
whether it's a neutral
one positive one
4322
02:59:05,848 --> 02:59:07,277
negative one everything.
4323
02:59:07,277 --> 02:59:09,600
We have done it
with the help of Sparks.
4324
02:59:09,600 --> 02:59:12,000
I mean only now we
have done it for Trump
4325
02:59:12,000 --> 02:59:14,000
as I just explained you
that we have put
4326
02:59:14,000 --> 02:59:15,555
in our program itself from
4327
02:59:15,555 --> 02:59:17,589
like we have put
everything up here
4328
02:59:17,589 --> 02:59:21,000
and based on that only we
are getting all the software now
4329
02:59:21,000 --> 02:59:23,498
we can apply all
the sentiment analysis
4330
02:59:23,498 --> 02:59:24,403
and like this.
4331
02:59:24,403 --> 02:59:25,731
Like we have learned.
4332
02:59:25,731 --> 02:59:28,754
So I hope you have found
all this this specially
4333
02:59:28,754 --> 02:59:30,593
this use case very much useful
4334
02:59:30,593 --> 02:59:32,800
for you kind of
getting you that yes,
4335
02:59:32,800 --> 02:59:34,388
it is getting done by half.
4336
02:59:34,388 --> 02:59:36,200
But right now we
have put from here,
4337
02:59:36,200 --> 02:59:38,550
but if you want you can keep
on putting the hashtag as
4338
02:59:38,550 --> 02:59:40,286
well because that's
how we are doing it.
4339
02:59:40,286 --> 02:59:41,886
You can keep on
changing the tax.
4340
02:59:41,886 --> 02:59:44,335
Maybe you can kind of code
for let's say four people
4341
02:59:44,335 --> 02:59:45,200
for stuff is going
4342
02:59:45,200 --> 02:59:49,000
on a cricket match will be going
on we can just put the tweets
4343
02:59:49,000 --> 02:59:52,300
according to that just take the
in that case instead of trump.
4344
02:59:52,300 --> 02:59:53,980
You can put any player named
4345
02:59:53,980 --> 02:59:56,432
or maybe a Team name
and you will see all
4346
02:59:56,432 --> 02:59:58,300
that friendly becoming a father.
4347
02:59:58,300 --> 03:00:00,700
Okay, so that's
how you can play with this.
4348
03:00:01,000 --> 03:00:01,500
Now.
4349
03:00:01,800 --> 03:00:04,400
This is there are
multiple examples with it,
4350
03:00:04,400 --> 03:00:05,400
which we can play
4351
03:00:05,500 --> 03:00:09,500
and this new skills can be even
evolved multiple other type
4352
03:00:09,500 --> 03:00:10,250
of those cases.
4353
03:00:10,250 --> 03:00:12,200
You can just keep
on transforming it
4354
03:00:12,200 --> 03:00:14,300
according to your own use cases.
4355
03:00:14,400 --> 03:00:17,800
So that's it about Sparks coming
which I wanted to discuss.
4356
03:00:17,800 --> 03:00:21,000
So I hope you must
have found it useful.
4357
03:00:26,000 --> 03:00:28,228
So in classification generally
4358
03:00:28,228 --> 03:00:31,200
what happens just
to give you an example.
4359
03:00:31,300 --> 03:00:33,867
You must have notice
the spam email box.
4360
03:00:33,867 --> 03:00:36,500
I hope everybody
must be having have seen
4361
03:00:36,500 --> 03:00:39,700
that sparkle in your spam
email box Energy Mix.
4362
03:00:39,800 --> 03:00:45,000
Now when any new email comes up
how Google decide
4363
03:00:45,165 --> 03:00:49,134
whether it's a spam email
or unknown stamped image
4364
03:00:49,300 --> 03:00:53,400
that is done as an example
of classification plus 3,
4365
03:00:53,576 --> 03:00:56,423
let's say My ghost
in the Google news,
4366
03:00:56,500 --> 03:00:58,794
when you type
something it group.
4367
03:00:58,794 --> 03:01:00,300
All the news together
4368
03:01:00,300 --> 03:01:04,700
that is called your electric
regression equation is also one
4369
03:01:04,700 --> 03:01:07,300
of the very important
fact it is not here.
4370
03:01:07,500 --> 03:01:11,700
The regression is let's say
you have a house
4371
03:01:11,900 --> 03:01:14,100
and you want to sell that house
4372
03:01:14,400 --> 03:01:16,500
and you have no idea.
4373
03:01:16,700 --> 03:01:18,715
What is the optimal price?
4374
03:01:18,715 --> 03:01:21,100
You should keep for your house.
4375
03:01:21,100 --> 03:01:24,400
Now this regression
will help you too.
4376
03:01:24,400 --> 03:01:28,534
To achieve that collaborative
filtering you might have see
4377
03:01:28,534 --> 03:01:31,000
when you go
to your Amazon web page
4378
03:01:31,000 --> 03:01:33,400
that they show you
a recommendation, right?
4379
03:01:33,400 --> 03:01:34,430
You can buy this
4380
03:01:34,430 --> 03:01:38,400
because you are buying this
but this is done with the help
4381
03:01:38,400 --> 03:01:40,900
of colaborative filtering.
4382
03:01:42,028 --> 03:01:44,315
Before I move to the project,
4383
03:01:44,315 --> 03:01:47,700
I want to show you
some practical find how we
4384
03:01:47,700 --> 03:01:50,300
will be executing spark things.
4385
03:01:50,503 --> 03:01:53,196
So let me take you
to the VM machine
4386
03:01:53,300 --> 03:01:55,300
which will be provided
by a Dorita.
4387
03:01:55,300 --> 03:01:57,928
So this machines are also
provided by the Rekha.
4388
03:01:57,928 --> 03:02:00,222
So you need not worry
about from where I
4389
03:02:00,222 --> 03:02:01,963
will be getting the software.
4390
03:02:01,963 --> 03:02:04,421
What I will be doing
recite It Roll there.
4391
03:02:04,421 --> 03:02:07,300
Everything is taken care back
into they come now.
4392
03:02:07,300 --> 03:02:08,957
Once you will be coming
4393
03:02:08,957 --> 03:02:12,059
to this you will see
a machine like Like this,
4394
03:02:12,059 --> 03:02:13,300
let me close this.
4395
03:02:13,300 --> 03:02:16,970
So what will happen you will see
a blank machine like this.
4396
03:02:16,970 --> 03:02:18,300
Let me show you this.
4397
03:02:18,300 --> 03:02:20,500
So this is how your machine
will look like.
4398
03:02:20,500 --> 03:02:24,100
Now what you are going to do
in order to start working.
4399
03:02:24,100 --> 03:02:26,600
You will be opening
this permanent by clicking
4400
03:02:26,600 --> 03:02:27,800
on this black option.
4401
03:02:28,000 --> 03:02:29,300
Now after that,
4402
03:02:29,400 --> 03:02:34,400
what you can do is you
can now go to your spot now
4403
03:02:34,400 --> 03:02:39,300
how I can work with funds
in order to execute any program
4404
03:02:39,300 --> 03:02:43,000
in sparked by using
Funeral program you
4405
03:02:43,000 --> 03:02:46,700
will be entering it as fast -
4406
03:02:46,700 --> 03:02:49,400
Chanel if you type fast - gel
4407
03:02:49,500 --> 03:02:52,500
it will take you
to the scale of Ron
4408
03:02:52,800 --> 03:02:55,800
where you can write
your path program,
4409
03:02:56,100 --> 03:03:00,020
but by using scale
of programming language,
4410
03:03:00,020 --> 03:03:01,558
you can notice this.
4411
03:03:02,200 --> 03:03:06,300
Now, can you see the fact it
is also giving me 1.5.2 version.
4412
03:03:06,300 --> 03:03:09,200
So that is the version
of your spot.
4413
03:03:09,800 --> 03:03:11,400
Now you can see here.
4414
03:03:11,400 --> 03:03:15,200
You can also see this part of
our context available as a see
4415
03:03:15,200 --> 03:03:17,752
when you get connected
to your spark sure.
4416
03:03:17,752 --> 03:03:21,441
You can just see this will be
my default available to you.
4417
03:03:21,441 --> 03:03:22,800
Let us get connected.
4418
03:03:22,800 --> 03:03:23,800
It is sometime.
4419
03:03:39,207 --> 03:03:40,746
No, we got anything.
4420
03:03:40,746 --> 03:03:43,900
So we got connected
to this Kayla prom now
4421
03:03:43,900 --> 03:03:45,894
if I want to come out of it,
4422
03:03:45,894 --> 03:03:49,300
I will just type exit
it will just let me come
4423
03:03:49,300 --> 03:03:51,400
out of this product now.
4424
03:03:52,100 --> 03:03:56,176
Secondly, I can also write
my programs with my python.
4425
03:03:56,176 --> 03:03:57,407
So what I can do
4426
03:03:57,500 --> 03:04:00,200
if I want to do
programming and Spark,
4427
03:04:00,200 --> 03:04:03,040
but with provide
Python programming language,
4428
03:04:03,040 --> 03:04:05,300
I will be connecting
with by Sparks.
4429
03:04:05,300 --> 03:04:09,148
So I just need to type ice pack
in order to get connected.
4430
03:04:09,148 --> 03:04:09,912
Your fighter.
4431
03:04:09,912 --> 03:04:10,206
Okay.
4432
03:04:10,206 --> 03:04:11,791
I'm not getting connected now
4433
03:04:11,791 --> 03:04:13,576
because I'm not
going to require.
4434
03:04:13,576 --> 03:04:16,700
I think I will be explaining
everything that scalar item.
4435
03:04:16,700 --> 03:04:19,700
But if you want to get connected
you can type icebox.
4436
03:04:19,700 --> 03:04:21,100
So let's again get connected
4437
03:04:21,100 --> 03:04:23,900
to my staff -
sure now meanwhile,
4438
03:04:23,900 --> 03:04:25,800
this is getting connected.
4439
03:04:25,800 --> 03:04:27,800
Let us create a small pipe.
4440
03:04:27,800 --> 03:04:29,823
So let us create
a file so currently
4441
03:04:29,823 --> 03:04:31,897
if you notice I
don't have any file.
4442
03:04:31,897 --> 03:04:32,281
Okay.
4443
03:04:32,284 --> 03:04:34,300
I already have a DOT txt.
4444
03:04:34,300 --> 03:04:37,300
So let's say sake at a DOT txt.
4445
03:04:37,400 --> 03:04:38,958
So I have some data one.
4446
03:04:38,958 --> 03:04:40,200
Two three four five.
4447
03:04:40,200 --> 03:04:42,362
This is my data,
which is with me.
4448
03:04:42,362 --> 03:04:44,000
Now what I'm going to do,
4449
03:04:44,000 --> 03:04:47,900
let me push this file
and do select the effective
4450
03:04:47,900 --> 03:04:49,900
if it is already available
4451
03:04:49,900 --> 03:04:55,000
in my system as that means
SDK system Hadoop DFS -
4452
03:04:55,000 --> 03:04:57,900
ooh, Jack a dot txt just
to quickly check
4453
03:04:57,900 --> 03:04:59,700
if it is already available.
4454
03:05:06,100 --> 03:05:09,400
There is no sex by so let
me first put this file
4455
03:05:09,400 --> 03:05:12,700
to my system to put a dot txt.
4456
03:05:14,200 --> 03:05:16,300
So this will put it
in the default location
4457
03:05:16,300 --> 03:05:17,200
of x g of X.
4458
03:05:17,200 --> 03:05:19,700
Now if I want to read it,
I can see the specs.
4459
03:05:19,700 --> 03:05:20,922
So again, I'm assuming
4460
03:05:20,922 --> 03:05:23,700
that you're aware of this
as big as commands so you
4461
03:05:23,700 --> 03:05:25,300
can see now this one two,
4462
03:05:25,300 --> 03:05:28,500
three four Pilots coming
from a Hadoop file system.
4463
03:05:28,500 --> 03:05:30,192
Now what I want to do,
4464
03:05:30,192 --> 03:05:36,400
I want to use this file
in my in my system of spa now
4465
03:05:36,400 --> 03:05:39,200
how I can do that select
we come here.
4466
03:05:39,200 --> 03:05:42,500
So in skaila in skaila,
4467
03:05:42,500 --> 03:05:46,000
we do not have any Your float
and on like in Java
4468
03:05:46,000 --> 03:05:48,700
we use the Define
like this right integer
4469
03:05:48,700 --> 03:05:49,907
K is equal to 10
4470
03:05:49,907 --> 03:05:53,000
like this is used
to define buttons Kayla.
4471
03:05:53,000 --> 03:05:55,400
We do not use this data type.
4472
03:05:55,473 --> 03:05:58,626
In fact, what we do
is we call it as back.
4473
03:05:58,700 --> 03:06:02,000
So if I use
that a is equal to 10,
4474
03:06:02,100 --> 03:06:04,700
it will automatically identify
4475
03:06:04,900 --> 03:06:08,100
that it is
a integer value notice.
4476
03:06:08,900 --> 03:06:13,303
It will tell me that
a is of my integer type now
4477
03:06:13,303 --> 03:06:16,072
if I want to Update
this value to 20.
4478
03:06:16,072 --> 03:06:17,149
I can do that.
4479
03:06:17,400 --> 03:06:17,800
Now.
4480
03:06:17,900 --> 03:06:20,900
Let's say if I want to update
this to ABC like this.
4481
03:06:21,200 --> 03:06:23,700
This will smoke an error by
4482
03:06:23,900 --> 03:06:27,400
because a is already
defined as in danger
4483
03:06:27,600 --> 03:06:31,300
and you're trying to assign
some PVC string back.
4484
03:06:31,300 --> 03:06:34,000
So that is the reason
you got this error.
4485
03:06:34,000 --> 03:06:34,900
Similarly.
4486
03:06:35,000 --> 03:06:38,000
There is one more thing
called as value.
4487
03:06:38,300 --> 03:06:40,300
Well B is equal to 10.
4488
03:06:40,300 --> 03:06:44,200
Let's say if I do it works
exactly a similar to that.
4489
03:06:44,200 --> 03:06:47,500
But I have one difference
now in this case.
4490
03:06:47,500 --> 03:06:51,600
If I do basic want
to 20 you will see an error
4491
03:06:51,800 --> 03:06:57,000
and why does Sarah because when
you define something as well,
4492
03:06:57,200 --> 03:06:59,200
it is a constant.
4493
03:06:59,300 --> 03:07:02,400
It is not going
to be variable anymore.
4494
03:07:02,430 --> 03:07:04,046
It will be a constant
4495
03:07:04,046 --> 03:07:08,300
and that is the reason
if you define something as well,
4496
03:07:08,300 --> 03:07:10,700
it will be not updatable.
4497
03:07:10,700 --> 03:07:14,400
You will be should not be able
to update that value.
4498
03:07:14,400 --> 03:07:19,400
So this is how in Fela you
will be doing your program
4499
03:07:19,700 --> 03:07:23,969
so back for bearable part
of that for your constant,
4500
03:07:23,969 --> 03:07:27,200
but now so you will be
doing like this now,
4501
03:07:27,200 --> 03:07:31,664
let's use it for the example
what we have learned now.
4502
03:07:31,664 --> 03:07:34,971
Let's say if I want
to create and cut the V.
4503
03:07:35,100 --> 03:07:40,100
So Bal number is equal
to SC dot txt file.
4504
03:07:40,100 --> 03:07:43,000
Remember this API we
have learned the CPI
4505
03:07:43,000 --> 03:07:45,500
already St. Dot Txt file now.
4506
03:07:45,500 --> 03:07:49,300
Let me give this file a DOT txt.
4507
03:07:49,500 --> 03:07:52,000
If I give this file a dot txt.
4508
03:07:52,300 --> 03:07:55,900
It will be creating
an ID will see this file.
4509
03:07:55,900 --> 03:07:57,000
It is telling
4510
03:07:57,000 --> 03:08:00,800
that I created an rdd
of string type.
4511
03:08:01,100 --> 03:08:01,300
Now.
4512
03:08:01,300 --> 03:08:06,600
If I want to read this data,
I will call number dot connect.
4513
03:08:06,800 --> 03:08:10,415
This will print be the value
what was available.
4514
03:08:10,415 --> 03:08:14,261
Can you say now this line
what you are seeing here?
4515
03:08:14,300 --> 03:08:17,300
Is going to be from your memory.
4516
03:08:17,400 --> 03:08:19,382
This is your from my body.
4517
03:08:19,382 --> 03:08:23,500
It is reading a and that is
the reason it is showing up
4518
03:08:23,500 --> 03:08:25,800
in this particular manner.
4519
03:08:25,842 --> 03:08:29,457
So this is how you
will be performing your step.
4520
03:08:29,484 --> 03:08:30,715
No second thing.
4521
03:08:31,100 --> 03:08:36,000
I told you that sparked and walk
on Standalone systems as well.
4522
03:08:36,100 --> 03:08:36,400
Right?
4523
03:08:36,400 --> 03:08:38,400
So right now
what was happening was
4524
03:08:38,400 --> 03:08:42,000
that we have executed this part
in our history of this now
4525
03:08:42,000 --> 03:08:46,283
if I want to execute this Us
on our local file system.
4526
03:08:46,283 --> 03:08:47,338
Can I do that?
4527
03:08:47,338 --> 03:08:49,300
Yes, it can still do that.
4528
03:08:49,300 --> 03:08:51,300
What you need to do for that.
4529
03:08:51,300 --> 03:08:54,700
So is in that case
the difference will come here.
4530
03:08:54,700 --> 03:08:57,000
Now what the file you are giving
4531
03:08:57,000 --> 03:08:59,748
here would be instead
of giving like that.
4532
03:08:59,748 --> 03:09:03,100
You will be denoting
this file keyword before that.
4533
03:09:03,100 --> 03:09:06,300
And after that you need
to give you a local file.
4534
03:09:06,300 --> 03:09:09,200
For example, what is
this part slash home slash.
4535
03:09:09,200 --> 03:09:09,900
Advocacy.
4536
03:09:09,900 --> 03:09:12,400
This is a local park
not as deep as possible.
4537
03:09:12,400 --> 03:09:14,400
So you will be
writing / foam.
4538
03:09:14,400 --> 03:09:17,400
/schedule Erica a DOT PSD.
4539
03:09:17,500 --> 03:09:19,100
Now if you give this
4540
03:09:19,300 --> 03:09:22,700
this will be loading
the file into memory,
4541
03:09:23,000 --> 03:09:26,300
but not from your hdfs instead.
4542
03:09:26,300 --> 03:09:29,100
What does that is this loaded it
4543
03:09:29,100 --> 03:09:33,000
from your just loaded it
formula looks like this
4544
03:09:33,200 --> 03:09:34,921
so that is the defensive.
4545
03:09:34,921 --> 03:09:37,600
So as you can see
in the second case,
4546
03:09:37,600 --> 03:09:41,600
I am not even using my hdfs.
4547
03:09:41,700 --> 03:09:43,000
Which means what now?
4548
03:09:43,000 --> 03:09:46,000
Can you tell me why this
Sarah this is interesting.
4549
03:09:46,000 --> 03:09:49,300
Why do Sarah input path
does not exist
4550
03:09:49,300 --> 03:09:51,600
because I have given
a typo here.
4551
03:09:51,600 --> 03:09:52,400
Okay.
4552
03:09:52,400 --> 03:09:53,595
Now if you notice
4553
03:09:53,595 --> 03:09:58,555
by I did not get this error here
why I did not get this Elijah
4554
03:09:58,555 --> 03:10:00,200
this file do not exist.
4555
03:10:00,200 --> 03:10:02,500
But still I did not got
4556
03:10:02,500 --> 03:10:07,300
any error because of
lazy evaluation link
4557
03:10:07,300 --> 03:10:11,500
the evaluation kind
of made sure that even
4558
03:10:11,500 --> 03:10:14,400
if you have given
the wrong part in creating
4559
03:10:14,400 --> 03:10:18,200
And beyond ready but it
has not executed anything.
4560
03:10:18,400 --> 03:10:19,900
So all the output
4561
03:10:19,900 --> 03:10:22,800
or the error mistake
you are able to receive
4562
03:10:22,800 --> 03:10:25,600
when you hit that action
of Collective Now
4563
03:10:25,600 --> 03:10:27,997
in order to correct this value.
4564
03:10:27,997 --> 03:10:32,890
I need to connect this adorable
and this time if I execute it,
4565
03:10:32,975 --> 03:10:33,975
it will work.
4566
03:10:34,050 --> 03:10:37,050
Okay, you can see
this output 1 2 3 4 5.
4567
03:10:37,100 --> 03:10:40,500
So this time it works
by so now we should be
4568
03:10:40,500 --> 03:10:44,200
more clear about the lazy
evaluation of the even
4569
03:10:44,200 --> 03:10:46,375
if you are giving
the wrong file name
4570
03:10:46,375 --> 03:10:47,628
doesn't matter suppose.
4571
03:10:47,628 --> 03:10:49,804
I want to use Park
in production unit,
4572
03:10:49,804 --> 03:10:51,155
but not on top of Hadoop.
4573
03:10:51,155 --> 03:10:52,007
Is it possible?
4574
03:10:52,007 --> 03:10:53,200
Yes, you can do that.
4575
03:10:53,200 --> 03:10:54,500
You can do that Sonny,
4576
03:10:54,500 --> 03:10:56,900
but usually that's
not what you do.
4577
03:10:56,900 --> 03:10:58,958
But yes, if you
want to can do that,
4578
03:10:58,958 --> 03:11:00,299
there are a lot of things
4579
03:11:00,299 --> 03:11:02,239
which you can view
can also deploy it
4580
03:11:02,239 --> 03:11:05,611
on your Amazon clusters as that
lot of things you can do that.
4581
03:11:05,611 --> 03:11:07,900
How will it provided
distribute in that case?
4582
03:11:07,900 --> 03:11:10,186
We'll be using
some other distribution system.
4583
03:11:10,186 --> 03:11:12,425
So in that case you
are not using this fact,
4584
03:11:12,425 --> 03:11:14,300
you can deploy it
will be just death.
4585
03:11:14,300 --> 03:11:16,400
He will not be able
to kind of go across
4586
03:11:16,400 --> 03:11:17,698
and distribute in that Master.
4587
03:11:17,698 --> 03:11:19,849
You will not be able to lift
weight that redundancy,
4588
03:11:19,849 --> 03:11:22,500
but you can use them in Amazon
is the enough for that.
4589
03:11:22,500 --> 03:11:23,700
Okay, so that is
4590
03:11:23,700 --> 03:11:28,089
how you will be using this now
you're going to get so this is
4591
03:11:28,089 --> 03:11:31,600
how you will be performing
your practice as a sec
4592
03:11:31,600 --> 03:11:33,643
how you will be working
on this part.
4593
03:11:33,643 --> 03:11:35,800
I will be a training you
as I told you.
4594
03:11:35,800 --> 03:11:37,500
So this is how things work.
4595
03:11:37,700 --> 03:11:41,600
Now, let us see
an interesting use case.
4596
03:11:41,800 --> 03:11:43,900
So for that let us go back.
4597
03:11:43,900 --> 03:11:47,900
Back to our visiting this
is going to be very interesting.
4598
03:11:48,161 --> 03:11:50,238
So let's see this use case.
4599
03:11:50,600 --> 03:11:51,600
Look at this.
4600
03:11:51,900 --> 03:11:53,500
This is very interested.
4601
03:11:53,500 --> 03:11:57,600
Now this use case is for
earthquake detection using Spa.
4602
03:11:57,600 --> 03:12:00,200
So in Japan you
might have already seen
4603
03:12:00,200 --> 03:12:02,450
that there are so many
up to access coming you
4604
03:12:02,450 --> 03:12:03,800
might have thought about it.
4605
03:12:03,800 --> 03:12:05,591
I definitely you
might have not seen it
4606
03:12:05,591 --> 03:12:07,100
but you must have heard about it
4607
03:12:07,100 --> 03:12:09,200
that there are
so many earthquake
4608
03:12:09,200 --> 03:12:13,700
which happens in Japan now
how to solve that problem with
4609
03:12:13,700 --> 03:12:16,111
about I'm just going
to give you a glimpse
4610
03:12:16,111 --> 03:12:17,400
of what kind of problems
4611
03:12:17,400 --> 03:12:18,563
in solving the sessions
4612
03:12:18,563 --> 03:12:21,600
definitely we are not going to
walk through in detail of this
4613
03:12:21,600 --> 03:12:24,500
but you will get an idea
House of Prince fastest.
4614
03:12:24,500 --> 03:12:27,300
Okay, just to give you
a little bit of brief here.
4615
03:12:27,300 --> 03:12:30,500
But all these products
will learn at the time
4616
03:12:30,500 --> 03:12:31,900
of sessions now.
4617
03:12:32,000 --> 03:12:35,300
So let's see this part
how we will be using this bill.
4618
03:12:35,300 --> 03:12:38,500
So as everybody must be knowing
what is asked website.
4619
03:12:38,500 --> 03:12:39,800
So our crack is
4620
03:12:40,200 --> 03:12:44,028
like a shaking of your surface
of the Earth your own country.
4621
03:12:44,028 --> 03:12:46,900
Ignore all those events
that happen in tector.
4622
03:12:46,900 --> 03:12:48,050
If you're from India,
4623
03:12:48,050 --> 03:12:51,400
you might have seen recently
there was an earthquake incident
4624
03:12:51,400 --> 03:12:54,600
which came from Nepal
by even recently two days back.
4625
03:12:54,600 --> 03:12:56,900
Also there was upset incident.
4626
03:12:57,053 --> 03:12:59,900
So these are techniques
on coming now,
4627
03:12:59,900 --> 03:13:02,300
very important part is let's say
4628
03:13:02,300 --> 03:13:06,100
if the earthquake is
on major earthquake like arguing
4629
03:13:06,100 --> 03:13:08,992
or maybe tsunami
maybe forest fires,
4630
03:13:08,992 --> 03:13:10,600
maybe a volcano now,
4631
03:13:10,600 --> 03:13:14,000
it's very important
for them to kind of SC.
4632
03:13:15,100 --> 03:13:19,600
That black is going to come
they should be able to kind
4633
03:13:19,600 --> 03:13:21,600
of predicted beforehand.
4634
03:13:21,600 --> 03:13:23,776
It's not happen
that as a last moment.
4635
03:13:23,776 --> 03:13:25,254
They got to the that okay
4636
03:13:25,254 --> 03:13:27,862
Dirtbag is comes
after I came up cracking No,
4637
03:13:27,862 --> 03:13:29,700
it should not happen like that.
4638
03:13:29,700 --> 03:13:34,000
They should be able to estimate
all these things beforehand.
4639
03:13:34,000 --> 03:13:36,600
They should be able
to predict beforehand.
4640
03:13:36,688 --> 03:13:40,611
So this is the system
with Japan's is using already.
4641
03:13:40,700 --> 03:13:44,300
So this is a real-time kind of
use case what I am presenting.
4642
03:13:44,300 --> 03:13:47,300
It's so Japan is already
using this path finger
4643
03:13:47,300 --> 03:13:49,770
in order to solve
this earthquake problem.
4644
03:13:49,770 --> 03:13:52,482
We are going to see
that how they're using it.
4645
03:13:52,482 --> 03:13:52,866
Okay.
4646
03:13:52,900 --> 03:13:56,900
Now let's say what happens
in Japan earthquake model.
4647
03:13:57,000 --> 03:14:00,000
So whenever there is
an earthquake coming
4648
03:14:00,000 --> 03:14:02,000
for example at 2:46 p.m.
4649
03:14:02,000 --> 03:14:04,800
On March 4 2011 now
4650
03:14:04,800 --> 03:14:08,300
Japan earthquake early
warning was detected.
4651
03:14:08,600 --> 03:14:12,800
Now the thing was as
soon as it detected immediately,
4652
03:14:12,800 --> 03:14:16,999
they start sending
Not those fools to the lift
4653
03:14:17,000 --> 03:14:20,700
to the factories every station
through TV stations.
4654
03:14:20,700 --> 03:14:23,300
They immediately kind
of told everyone
4655
03:14:23,300 --> 03:14:26,315
so that all the students
were there in school.
4656
03:14:26,315 --> 03:14:29,800
They got the time to go
under the desk bullet trains,
4657
03:14:29,800 --> 03:14:30,900
which were running.
4658
03:14:30,900 --> 03:14:31,571
They stop.
4659
03:14:31,571 --> 03:14:35,200
Otherwise the capabilities
of us will start shaking now
4660
03:14:35,200 --> 03:14:38,200
the bullet trains are already
running at the very high speed.
4661
03:14:38,200 --> 03:14:39,432
They want to ensure
4662
03:14:39,432 --> 03:14:43,000
that there should be no sort
of casualty because of that
4663
03:14:43,000 --> 03:14:46,600
so all the bullet train Stop
all the elevators the lift
4664
03:14:46,600 --> 03:14:47,825
which were running.
4665
03:14:47,825 --> 03:14:50,600
They stop otherwise
some incident can happen
4666
03:14:50,700 --> 03:14:53,930
in 60 seconds 60 seconds
4667
03:14:53,930 --> 03:14:55,700
before this number they
4668
03:14:55,700 --> 03:14:59,100
were able to inform
almost every month.
4669
03:14:59,300 --> 03:15:01,212
They have send the message.
4670
03:15:01,212 --> 03:15:02,698
They have a broadcast
4671
03:15:02,698 --> 03:15:05,949
on TV all those things
they have done immediately
4672
03:15:05,949 --> 03:15:07,100
to all the people
4673
03:15:07,100 --> 03:15:09,856
so that they can send
at least this message
4674
03:15:09,856 --> 03:15:11,300
whoever can receive it
4675
03:15:11,300 --> 03:15:13,600
and that have saved millions
4676
03:15:13,600 --> 03:15:17,300
of So powerful they
were able to achieve
4677
03:15:17,300 --> 03:15:22,100
that they have done all this
with the help of Apache spark.
4678
03:15:22,192 --> 03:15:24,500
That is the most important job
4679
03:15:24,500 --> 03:15:27,900
how they've got you
can select everything
4680
03:15:27,900 --> 03:15:29,800
what they are doing there.
4681
03:15:29,800 --> 03:15:33,600
They are doing it
on the real time system, right?
4682
03:15:33,700 --> 03:15:35,690
They cannot just
collect the data
4683
03:15:35,690 --> 03:15:39,100
and then later the processes
they did everything as
4684
03:15:39,100 --> 03:15:40,300
a real-time system.
4685
03:15:40,300 --> 03:15:43,300
So they collected the data
immediately process it
4686
03:15:43,300 --> 03:15:45,004
and as soon has the detected
4687
03:15:45,004 --> 03:15:47,484
that has quick they
immediately inform the
4688
03:15:47,484 --> 03:15:49,381
in fact this happened in 2011.
4689
03:15:49,381 --> 03:15:52,100
Now they they start
using it very frequently
4690
03:15:52,100 --> 03:15:54,318
because Japan is one of the area
4691
03:15:54,318 --> 03:15:58,200
which is very frequently
of kind of affected by all this.
4692
03:15:58,200 --> 03:15:58,900
So as I said,
4693
03:15:58,900 --> 03:16:01,548
the main thing is we should be
able to process the data
4694
03:16:01,548 --> 03:16:02,449
and we are finding
4695
03:16:02,449 --> 03:16:04,900
that the bigger thing you
should be able to handle
4696
03:16:04,900 --> 03:16:06,400
the data from multiple sources
4697
03:16:06,400 --> 03:16:07,789
because data may be coming
4698
03:16:07,789 --> 03:16:10,882
from multiple sources may be
different different sources.
4699
03:16:10,882 --> 03:16:13,600
They might be suggesting some
of the other events.
4700
03:16:13,600 --> 03:16:16,305
It's because Which we
are predicting that okay,
4701
03:16:16,305 --> 03:16:17,770
this earthquake can happen.
4702
03:16:17,770 --> 03:16:19,729
It should be very
easy to use because
4703
03:16:19,729 --> 03:16:22,500
if it is very complicated
then in that case
4704
03:16:22,500 --> 03:16:23,500
for a user to use it
4705
03:16:23,500 --> 03:16:25,549
if you'd be very good
become competitive service.
4706
03:16:25,549 --> 03:16:27,600
You will not be able
to solve the problem.
4707
03:16:27,700 --> 03:16:29,200
Now even in the end
4708
03:16:29,200 --> 03:16:32,100
how to send the alert
message is important.
4709
03:16:32,100 --> 03:16:32,900
Okay.
4710
03:16:32,900 --> 03:16:36,000
So all those things
are taken care by your spark.
4711
03:16:36,000 --> 03:16:39,923
Now there are two kinds
of layer in your earthquake.
4712
03:16:40,100 --> 03:16:42,633
The number one layer
is a prime the way
4713
03:16:42,633 --> 03:16:43,900
and second is fake.
4714
03:16:43,900 --> 03:16:44,864
And we'll wait.
4715
03:16:44,864 --> 03:16:46,600
There are two kinds of wave
4716
03:16:46,600 --> 03:16:49,100
in an earthquake
Prime Z Wave is like
4717
03:16:49,100 --> 03:16:52,261
when the earthquake is
just about to start it start
4718
03:16:52,261 --> 03:16:53,400
to the city center
4719
03:16:53,400 --> 03:16:55,200
and it's vendor or Quake
4720
03:16:55,200 --> 03:16:59,100
is going to start secondary wave
is more severe than
4721
03:16:59,100 --> 03:17:01,400
which sparked after producing.
4722
03:17:01,400 --> 03:17:03,912
Now what happens
in secondary wheel is
4723
03:17:03,912 --> 03:17:06,900
when it's that start it
can do maximum damage
4724
03:17:06,900 --> 03:17:09,605
because primary ways you
can see the initial wave
4725
03:17:09,605 --> 03:17:11,900
but the second we
will be on top of that
4726
03:17:11,900 --> 03:17:14,800
so they will be some details
with respect to I 'm not going
4727
03:17:14,800 --> 03:17:15,800
in detail of that.
4728
03:17:15,800 --> 03:17:17,600
But yeah, there
will be some details
4729
03:17:17,600 --> 03:17:18,700
with respect to that.
4730
03:17:18,700 --> 03:17:21,700
Now what we are going
to do using Sparks.
4731
03:17:21,700 --> 03:17:23,907
We will be creating our arms.
4732
03:17:23,907 --> 03:17:26,799
So let's go and see
that in our machine
4733
03:17:26,799 --> 03:17:30,600
how we will be sick
calculating our Roc which using
4734
03:17:30,600 --> 03:17:33,600
which we will be solving
this problem later
4735
03:17:33,600 --> 03:17:36,524
and we will be calculating
this Roc with the help
4736
03:17:36,524 --> 03:17:37,500
of Apache spark.
4737
03:17:37,500 --> 03:17:39,729
Let us again come back
to this machine now
4738
03:17:39,729 --> 03:17:41,369
in order to walk on that.
4739
03:17:41,369 --> 03:17:43,600
Let's first exit
from this console.
4740
03:17:43,800 --> 03:17:48,300
Once you exit from this console
now what you're going to do.
4741
03:17:48,300 --> 03:17:51,900
I have already created
this project in kept it here
4742
03:17:51,900 --> 03:17:55,563
because we just want to give
you an overview of this.
4743
03:17:55,563 --> 03:17:57,900
Let me go to
my downloads section.
4744
03:17:57,900 --> 03:18:01,400
There is a project called
as Earth to so this is
4745
03:18:01,400 --> 03:18:03,400
your project initially
4746
03:18:03,500 --> 03:18:06,400
what all things you
will be having you
4747
03:18:06,400 --> 03:18:08,839
will not be having all
the things initial part.
4748
03:18:08,839 --> 03:18:09,900
So what will happen.
4749
03:18:09,900 --> 03:18:12,990
So let's say if I go
to my downloads from here,
4750
03:18:12,990 --> 03:18:14,200
I have worked too.
4751
03:18:14,200 --> 03:18:16,800
project Okay.
4752
03:18:16,800 --> 03:18:19,000
Now initially I
will not be having
4753
03:18:19,000 --> 03:18:22,300
this target directory project
directory bin directory.
4754
03:18:22,300 --> 03:18:25,400
We will be using
our SBT framework.
4755
03:18:25,400 --> 03:18:28,900
If you do not know SBP this
is the skill of Bill tooth
4756
03:18:28,900 --> 03:18:32,400
which takes care of all
your dependencies takes care
4757
03:18:32,400 --> 03:18:36,700
of all your dependencies are not
so it is very similar to Melvin
4758
03:18:36,700 --> 03:18:40,577
if you already know Megan you
this is because very similar
4759
03:18:40,577 --> 03:18:42,900
but at the same time
I prefer this BTW
4760
03:18:42,900 --> 03:18:46,100
because as BT is
more easier to write income.
4761
03:18:46,100 --> 03:18:47,700
I've been doing yoga never
4762
03:18:47,700 --> 03:18:50,700
so you will be writing
this bill taught as begins.
4763
03:18:50,700 --> 03:18:55,800
So this finally will provide you
build dot SBT now in this file,
4764
03:18:55,800 --> 03:18:57,255
you will be giving the name
4765
03:18:57,255 --> 03:18:59,700
of your project your
what's a version of is
4766
03:18:59,700 --> 03:19:02,800
because using version of scale
of what you are using.
4767
03:19:02,800 --> 03:19:05,385
What are the dependencies
you have with
4768
03:19:05,385 --> 03:19:09,400
what versions dependencies you
have like 4 stock 4 and using
4769
03:19:09,400 --> 03:19:11,194
1.5.2 version of stock.
4770
03:19:11,200 --> 03:19:15,100
So you are telling
that whatever in my program,
4771
03:19:15,150 --> 03:19:16,150
I am writing.
4772
03:19:16,200 --> 03:19:22,100
So if I require anything related
to spawn quote go and get it
4773
03:19:22,100 --> 03:19:27,400
from this website of dot Apache
dot box download it install it.
4774
03:19:27,800 --> 03:19:29,900
If I require any dependency
4775
03:19:29,900 --> 03:19:34,700
for spark streaming program for
this particular version 1.5.2.
4776
03:19:35,000 --> 03:19:37,700
Go to this website or this link
4777
03:19:37,700 --> 03:19:41,200
and executed similar theme
for Amanda password.
4778
03:19:41,200 --> 03:19:43,353
So you just telling them now
4779
03:19:43,400 --> 03:19:47,200
once you have done this you will
be creating a Folder structure.
4780
03:19:47,200 --> 03:19:49,200
Your folder structure
would be you need
4781
03:19:49,200 --> 03:19:50,722
to create a sassy folder.
4782
03:19:50,722 --> 03:19:51,393
After that.
4783
03:19:51,393 --> 03:19:54,612
You will be creating
a main folder from Main folder.
4784
03:19:54,612 --> 03:19:57,200
You will be creating
again a folder called
4785
03:19:57,200 --> 03:19:58,800
as Kayla now inside
4786
03:19:58,800 --> 03:20:01,100
that you will be
keeping your program.
4787
03:20:01,100 --> 03:20:03,300
So now here you will
be writing a program.
4788
03:20:03,300 --> 03:20:04,500
So you are writing you.
4789
03:20:04,500 --> 03:20:07,499
Can you see this screaming
to a scalar Network on scale
4790
03:20:07,499 --> 03:20:08,500
of our DOT Stella.
4791
03:20:08,500 --> 03:20:10,623
So let's keep it as
a black box for them.
4792
03:20:10,623 --> 03:20:12,730
So you will be writing
the code to achieve
4793
03:20:12,730 --> 03:20:14,083
this problem statement.
4794
03:20:14,083 --> 03:20:15,500
Now what we are going to do
4795
03:20:15,500 --> 03:20:20,200
that come out of this What
do you mean project folder
4796
03:20:20,400 --> 03:20:21,500
and from here?
4797
03:20:21,700 --> 03:20:24,400
We will be writing SBT packaged.
4798
03:20:24,500 --> 03:20:26,400
It will start downloading
4799
03:20:26,400 --> 03:20:29,700
with respect to your is beating
it will check your program.
4800
03:20:29,700 --> 03:20:31,900
Whatever dependency you require
4801
03:20:31,900 --> 03:20:35,750
for stock course starts
screaming stuck in the lift.
4802
03:20:35,750 --> 03:20:36,895
It will download
4803
03:20:36,895 --> 03:20:39,400
and install it it
will just download
4804
03:20:39,400 --> 03:20:42,200
and install it so we
are not going to execute it
4805
03:20:42,200 --> 03:20:43,900
because I've already
done it before
4806
03:20:43,900 --> 03:20:45,300
and it also takes some time.
4807
03:20:45,300 --> 03:20:48,453
So that's the reason
I'm not doing it now.
4808
03:20:48,500 --> 03:20:50,689
You have been this packet,
4809
03:20:50,700 --> 03:20:53,788
you will find all
this directly Target directly
4810
03:20:53,788 --> 03:20:55,400
toward project directed.
4811
03:20:55,400 --> 03:20:58,100
These got created
later on the now
4812
03:20:58,100 --> 03:20:59,600
what is going to happen.
4813
03:20:59,600 --> 03:21:03,400
Once you have created this
you will go to your Eclipse.
4814
03:21:03,400 --> 03:21:04,900
So you are a pure c will open.
4815
03:21:04,900 --> 03:21:06,600
So let me open my Eclipse.
4816
03:21:06,900 --> 03:21:08,995
So this is how you're
equipped to protect.
4817
03:21:08,995 --> 03:21:09,200
Now.
4818
03:21:09,200 --> 03:21:11,300
I already have this program
in front of me,
4819
03:21:11,300 --> 03:21:14,900
but let me tell you how you
will be bringing this program.
4820
03:21:14,900 --> 03:21:17,800
You will be going
to your import option
4821
03:21:17,800 --> 03:21:18,934
with We import you
4822
03:21:18,934 --> 03:21:22,400
will be selecting your existing
projects into workspace.
4823
03:21:22,400 --> 03:21:23,700
Next once you do
4824
03:21:23,700 --> 03:21:26,400
that you need to select
your main project.
4825
03:21:26,400 --> 03:21:29,000
For example, you need
to select this Earth to project
4826
03:21:29,000 --> 03:21:31,900
what you have created
and click on OK
4827
03:21:31,900 --> 03:21:32,709
once you do
4828
03:21:32,709 --> 03:21:35,872
that they will be
a project directory coming
4829
03:21:35,872 --> 03:21:38,300
from this Earth
to will come here.
4830
03:21:38,300 --> 03:21:41,700
Now what we need to do go
to your s RC / Main
4831
03:21:41,700 --> 03:21:43,628
and not ignore all this program.
4832
03:21:43,628 --> 03:21:46,400
I require only just are jocular
because this is
4833
03:21:46,400 --> 03:21:48,500
where I've written
my main function.
4834
03:21:48,500 --> 03:21:50,260
Important now after that
4835
03:21:50,260 --> 03:21:52,900
once you reach
to this you need to go
4836
03:21:52,900 --> 03:21:55,900
to your run as Kayla application
4837
03:21:56,100 --> 03:21:59,600
and your spot code
will start to execute now,
4838
03:21:59,600 --> 03:22:01,800
this will return me a row 0.
4839
03:22:02,000 --> 03:22:02,314
Okay.
4840
03:22:02,314 --> 03:22:03,700
Let's see this output.
4841
03:22:06,600 --> 03:22:08,200
Now if I see this,
4842
03:22:08,200 --> 03:22:11,800
this will show me
once it's finished executing.
4843
03:22:22,900 --> 03:22:26,300
See this our area
under carosi is this
4844
03:22:26,300 --> 03:22:29,107
so this is all computed
with the elbows path program.
4845
03:22:29,107 --> 03:22:29,695
Similarly.
4846
03:22:29,695 --> 03:22:32,100
There are other programs
also met will help you
4847
03:22:32,100 --> 03:22:33,400
to spin the data or not.
4848
03:22:33,509 --> 03:22:35,010
I'm not walking over all that.
4849
03:22:35,160 --> 03:22:39,000
Now, let's come back
to my wedding and see
4850
03:22:39,000 --> 03:22:40,900
that what is the next step
4851
03:22:40,900 --> 03:22:44,500
what we will be doing so you
can see this way will be next.
4852
03:22:44,500 --> 03:22:48,200
Is she getting created now,
I'm keeping my Roc here.
4853
03:22:48,200 --> 03:22:53,100
Now after you have created
your RZ you will be Our graph
4854
03:22:53,100 --> 03:22:56,200
now in Japan there is
one important thing.
4855
03:22:56,200 --> 03:22:59,771
Japan is already
of affected area of your organs.
4856
03:22:59,771 --> 03:23:01,714
And now the trouble here is
4857
03:23:01,714 --> 03:23:05,600
that whatever it's not the even
for a minor earthquake.
4858
03:23:05,600 --> 03:23:07,852
I should start sending
the alert right?
4859
03:23:07,852 --> 03:23:11,300
I don't want to do all that
for the minor minor affection.
4860
03:23:11,300 --> 03:23:14,100
In fact, the buildings
and the infrastructure.
4861
03:23:14,100 --> 03:23:17,300
What is created is
the point is in such a way
4862
03:23:17,300 --> 03:23:18,600
if any odd quack
4863
03:23:18,600 --> 03:23:21,700
below six magnitude
comes there there.
4864
03:23:22,000 --> 03:23:25,713
The phones are designed in a way
that they will be no damage.
4865
03:23:25,713 --> 03:23:27,400
They will be no damage them.
4866
03:23:27,400 --> 03:23:29,400
So this is the major thing
4867
03:23:29,400 --> 03:23:33,300
when you work with your Japan
free book now in Japan,
4868
03:23:33,300 --> 03:23:36,000
so that means with six
they are not even worried
4869
03:23:36,000 --> 03:23:37,300
but about six they
4870
03:23:37,300 --> 03:23:40,668
are worried now for that day
will be a graph simulation
4871
03:23:40,668 --> 03:23:43,600
what you can do you can do it
with Park as well.
4872
03:23:43,600 --> 03:23:47,800
Once you generate this graph you
will be seeing that anything
4873
03:23:47,800 --> 03:23:49,449
which is going above 6
4874
03:23:49,449 --> 03:23:52,000
if anything which
is going above 6,
4875
03:23:52,000 --> 03:23:55,400
Should immediately start
the vendor now ignore all
4876
03:23:55,400 --> 03:23:56,700
this programming side
4877
03:23:56,700 --> 03:23:59,800
because that is what we
have just created and showing
4878
03:23:59,800 --> 03:24:01,411
you this execution fact now
4879
03:24:01,411 --> 03:24:03,800
if you have to visualize
the same result,
4880
03:24:03,800 --> 03:24:05,200
this is what is happening.
4881
03:24:05,200 --> 03:24:07,300
This is showing my Roc but
4882
03:24:07,300 --> 03:24:11,800
if my artwork is going
to be greater than 6 then only
4883
03:24:11,800 --> 03:24:16,415
weighs those alert then only
send the alert to all the paper.
4884
03:24:16,415 --> 03:24:18,400
Otherwise take come
4885
03:24:18,600 --> 03:24:22,000
that is what the project
what we generally show.
4886
03:24:22,000 --> 03:24:25,563
Oh in our space program sent now
it is not the only project
4887
03:24:25,563 --> 03:24:28,900
we also kind of create
multiple other products as well.
4888
03:24:28,900 --> 03:24:31,600
For example, I kind
of create a model just
4889
03:24:31,600 --> 03:24:33,204
like how Walmart to it
4890
03:24:33,204 --> 03:24:35,100
how Walmart maybe creating
4891
03:24:35,100 --> 03:24:38,241
a whatever sales is happening
with respect to that.
4892
03:24:38,241 --> 03:24:39,743
They're using Apache spark
4893
03:24:39,743 --> 03:24:43,000
and at the end they are kind of
making you visualize the output
4894
03:24:43,000 --> 03:24:45,400
of doing whatever
analytics they're doing.
4895
03:24:45,400 --> 03:24:46,900
So that is ordering the spark.
4896
03:24:46,900 --> 03:24:48,900
So all those things
we walking through
4897
03:24:48,900 --> 03:24:52,252
when we do the per session all
the things you learn quick.
4898
03:24:52,252 --> 03:24:55,100
I feel that all these projects
are using right now,
4899
03:24:55,100 --> 03:24:56,700
since you do not know the topic
4900
03:24:56,700 --> 03:24:59,400
you are not able to get
hundred percent of the project.
4901
03:24:59,400 --> 03:25:00,434
But at that time
4902
03:25:00,434 --> 03:25:03,366
once you know each
and every topics of deadly
4903
03:25:03,366 --> 03:25:07,100
you will have a clearer picture
of how spark is handling.
4904
03:25:07,100 --> 03:25:15,000
All these use cases graphs
are very attractive
4905
03:25:15,000 --> 03:25:17,900
when it comes to modeling
real world data
4906
03:25:17,900 --> 03:25:19,900
because they are
intuitive flexible
4907
03:25:19,900 --> 03:25:23,100
and the theory supporting
them has Been maturing
4908
03:25:23,100 --> 03:25:25,209
for centuries welcome everyone
4909
03:25:25,209 --> 03:25:27,600
in today's session
on Spa Graphics.
4910
03:25:27,700 --> 03:25:30,700
So without any further delay,
let's look at the agenda first.
4911
03:25:31,500 --> 03:25:34,561
We start by understanding
the basics of craft Theory
4912
03:25:34,561 --> 03:25:36,229
and different types of craft.
4913
03:25:36,229 --> 03:25:38,806
Then we'll look
at the features of Graphics
4914
03:25:38,806 --> 03:25:40,170
further will understand
4915
03:25:40,170 --> 03:25:43,820
what is property graph and look
at various crafts operations.
4916
03:25:43,820 --> 03:25:44,594
Moving ahead.
4917
03:25:44,594 --> 03:25:48,258
We'll look at different graph
processing algorithms at last.
4918
03:25:48,258 --> 03:25:49,500
We'll look at a demo
4919
03:25:49,500 --> 03:25:52,400
where we will try
to analyze Ford's go by
4920
03:25:52,400 --> 03:25:54,700
data using pagerank algorithm.
4921
03:25:54,700 --> 03:25:56,800
Let's move to the first topic.
4922
03:25:57,200 --> 03:25:59,845
So we'll start
with basics of graph.
4923
03:25:59,845 --> 03:26:03,661
So graphs are I basically
made up of two sets called
4924
03:26:03,661 --> 03:26:05,089
vertices and edges.
4925
03:26:05,089 --> 03:26:08,704
The vertices are drawn
from some underlying type
4926
03:26:08,704 --> 03:26:11,550
and the set can be
finite or infinite.
4927
03:26:11,550 --> 03:26:12,900
Now each element
4928
03:26:12,900 --> 03:26:17,035
of the edge set is a pair
consisting of two elements
4929
03:26:17,035 --> 03:26:18,728
from the vertices set.
4930
03:26:18,900 --> 03:26:21,400
So your vertex is V1.
4931
03:26:21,403 --> 03:26:23,173
Then your vertex is V3.
4932
03:26:23,173 --> 03:26:25,480
Then your vertex is V2 and V4.
4933
03:26:25,700 --> 03:26:29,300
And your edges are V
1 comma V 3 then next
4934
03:26:29,300 --> 03:26:33,500
is V 1 comma V 2 Then
you have B2 comma V 3
4935
03:26:33,500 --> 03:26:34,961
and then you have V
4936
03:26:34,961 --> 03:26:38,807
2 comma V fo so basically
we represent vertices set
4937
03:26:38,807 --> 03:26:43,000
as closed in curly braces
all the name of vertices.
4938
03:26:43,100 --> 03:26:45,561
So we have V 1 we have V 2
4939
03:26:45,561 --> 03:26:48,176
we have V 3 and then
we have before
4940
03:26:48,300 --> 03:26:53,073
and we'll close the curly braces
and to represent the edge set.
4941
03:26:53,073 --> 03:26:56,600
We use curly braces again
and then in curly braces,
4942
03:26:56,600 --> 03:27:00,907
we specify those two vertex
which are joined by the edge.
4943
03:27:01,000 --> 03:27:02,600
So for this Edge,
4944
03:27:02,600 --> 03:27:07,700
we will use a viven comma V
3 and then for this Edge
4945
03:27:07,700 --> 03:27:12,700
will use we one comma V
2 and then for this Edge again,
4946
03:27:12,700 --> 03:27:15,000
we'll use V 2 comma V 4.
4947
03:27:16,088 --> 03:27:19,011
And then at last
for this Edge will use
4948
03:27:19,300 --> 03:27:23,700
we do comma V 3 and At Last I
will close the curly braces.
4949
03:27:24,100 --> 03:27:26,400
So this is your vertices set.
4950
03:27:26,500 --> 03:27:28,900
And this is your headset.
4951
03:27:29,400 --> 03:27:31,958
Now one, very
important thing that is
4952
03:27:31,958 --> 03:27:35,476
if headset is containing
U comma V or you can say
4953
03:27:35,476 --> 03:27:38,700
that are instead
is containing V 1 comma V 3.
4954
03:27:38,700 --> 03:27:42,000
So V1 is basically
a adjacent to V 3.
4955
03:27:42,200 --> 03:27:45,100
Similarly your V
1 is adjacent to V 2.
4956
03:27:45,200 --> 03:27:48,427
Then V2 is adjacent
to V for and looking at this
4957
03:27:48,427 --> 03:27:50,900
as you can say V2
is adjacent to V 3.
4958
03:27:50,900 --> 03:27:53,686
Now, let's quickly move
ahead and we'll look
4959
03:27:53,686 --> 03:27:55,500
at different types of craft.
4960
03:27:55,500 --> 03:27:58,300
So first we have
undirected graphs.
4961
03:27:58,500 --> 03:28:00,936
So basically in
an undirected graph,
4962
03:28:00,936 --> 03:28:04,000
we use straight lines
to represent the edges.
4963
03:28:04,000 --> 03:28:08,350
Now the order of the vertices
in the edge set does not matter
4964
03:28:08,350 --> 03:28:09,800
in undirected graph.
4965
03:28:09,800 --> 03:28:14,040
So the undirected graph usually
are drawn using straight lines
4966
03:28:14,040 --> 03:28:15,500
between the vertices.
4967
03:28:15,500 --> 03:28:18,300
Now it is almost
similar to the graph
4968
03:28:18,300 --> 03:28:20,763
which we have seen
in the last slide.
4969
03:28:20,763 --> 03:28:21,563
Similarly.
4970
03:28:21,563 --> 03:28:25,000
We can again represent
the vertices set as 5 comma
4971
03:28:25,000 --> 03:28:27,500
6 comma 7 comma 8 and the edge
4972
03:28:27,500 --> 03:28:32,000
set as 5 comma 6 then
5 comma 7 now talking
4973
03:28:32,000 --> 03:28:33,643
about directed graphs.
4974
03:28:33,643 --> 03:28:37,605
So basically in a directed graph
the order of vertices
4975
03:28:37,605 --> 03:28:39,400
in the edge set matters.
4976
03:28:39,700 --> 03:28:43,100
So we use Arrow
to represent the edges
4977
03:28:43,300 --> 03:28:45,014
as you can see in the image
4978
03:28:45,014 --> 03:28:48,000
as It was not the case
with the undirected graph
4979
03:28:48,000 --> 03:28:49,900
where we were using
the straight lines.
4980
03:28:50,000 --> 03:28:51,400
So in directed graph,
4981
03:28:51,400 --> 03:28:56,000
we use Arrow to denote the edges
and the important thing is
4982
03:28:56,000 --> 03:28:58,214
The Edge set should be similar.
4983
03:28:58,214 --> 03:29:00,500
It will contain
the source vertex
4984
03:29:00,500 --> 03:29:04,200
that is five in this case
and the destination vertex,
4985
03:29:04,200 --> 03:29:09,400
which is 6 in this case and this
is never similar to six comma
4986
03:29:09,400 --> 03:29:13,300
five you cannot represent
this Edge as 6 comma 5
4987
03:29:13,400 --> 03:29:17,100
because the direction always
Does indeed directed graph
4988
03:29:17,100 --> 03:29:18,500
similarly you can see
4989
03:29:18,500 --> 03:29:20,556
that 5 is adjacent to 6,
4990
03:29:20,556 --> 03:29:23,787
but you cannot say
that 6 is adjacent to 5.
4991
03:29:24,200 --> 03:29:29,000
So for this graph the vertices
said would be similar as 5 comma
4992
03:29:29,000 --> 03:29:32,620
6 comma 7 comma 8
which was similar
4993
03:29:32,620 --> 03:29:34,158
in undirected graph,
4994
03:29:34,200 --> 03:29:38,700
but in directed graph your Edge
set should be your first opal.
4995
03:29:38,700 --> 03:29:42,835
This one will be 5 comma
6 then you second Edge,
4996
03:29:42,835 --> 03:29:46,528
which is this one would be
five comma Mama seven,
4997
03:29:47,000 --> 03:29:53,300
and at last your this set
would be 7 comma 8 but in case
4998
03:29:53,300 --> 03:29:56,166
of undirected graph
you can write this as
4999
03:29:56,166 --> 03:29:57,600
8 comma 7 or in case
5000
03:29:57,600 --> 03:30:00,400
of undirected graph you can
write this one as seven comma
5001
03:30:00,400 --> 03:30:03,369
5 but this is not the case
with the directed graph.
5002
03:30:03,369 --> 03:30:05,428
You have to follow
the source vertex
5003
03:30:05,428 --> 03:30:08,100
and the destination vertex
to represent the edge.
5004
03:30:08,100 --> 03:30:10,642
So I hope you guys are clear
with the undirected
5005
03:30:10,642 --> 03:30:11,846
and directed graph.
5006
03:30:11,846 --> 03:30:12,100
Now.
5007
03:30:12,100 --> 03:30:15,200
Let's talk about
vertex label graph now.
5008
03:30:15,200 --> 03:30:18,840
A Vertex liberal graph
each vertex is labeled
5009
03:30:18,840 --> 03:30:21,650
with some data
in addition to the data
5010
03:30:21,650 --> 03:30:23,700
that identifies the vertex.
5011
03:30:23,700 --> 03:30:28,100
So basically we say this X
or this v as the vertex ID.
5012
03:30:28,200 --> 03:30:29,500
So there will be data
5013
03:30:29,500 --> 03:30:31,800
that would be added
to this vertex.
5014
03:30:32,000 --> 03:30:35,200
So let's say this vertex
would be 6 comma
5015
03:30:35,200 --> 03:30:37,500
and then we are adding the color
5016
03:30:37,500 --> 03:30:39,700
so it would be purple next.
5017
03:30:39,800 --> 03:30:42,100
This vertex would be 8 comma
5018
03:30:42,100 --> 03:30:44,700
and the color
would be green next.
5019
03:30:44,700 --> 03:30:50,400
We'll say See this as 7 comma
read and then this one is as
5020
03:30:50,400 --> 03:30:54,400
five comma blue now
the six or this five
5021
03:30:54,400 --> 03:30:55,639
or seven or eight.
5022
03:30:55,639 --> 03:30:58,800
These are vertex ID
and the additional data,
5023
03:30:58,800 --> 03:31:03,500
which is attached is the color
like blue purple green or red.
5024
03:31:03,900 --> 03:31:08,696
But only the identifying data
is present in the pair of edges
5025
03:31:08,696 --> 03:31:12,543
or you can say only the ID
of the vertex is present
5026
03:31:12,543 --> 03:31:13,773
in the edge set.
5027
03:31:14,100 --> 03:31:15,322
So here the Edsel.
5028
03:31:15,322 --> 03:31:17,700
Again similar to
your directed graph
5029
03:31:17,700 --> 03:31:19,587
that is your Source ID this
5030
03:31:19,587 --> 03:31:21,992
which is 5 and
then destination ID,
5031
03:31:21,992 --> 03:31:25,274
which is 6 in this case
then for this case.
5032
03:31:25,274 --> 03:31:28,785
It's similar as five comma
7 then in for this case.
5033
03:31:28,785 --> 03:31:30,469
It's similar as 7 comma 8
5034
03:31:30,469 --> 03:31:33,600
so we are not specifying
this additional data,
5035
03:31:33,600 --> 03:31:35,699
which is attached
to the vertices.
5036
03:31:35,699 --> 03:31:36,878
That is the color.
5037
03:31:36,878 --> 03:31:40,121
If you only specify
the identifiers of the vertex
5038
03:31:40,121 --> 03:31:41,300
that is the number
5039
03:31:41,300 --> 03:31:44,700
but your vertex set
would be something
5040
03:31:44,700 --> 03:31:46,300
like so this vertex
5041
03:31:46,300 --> 03:31:50,100
would be 5 comma blue
then your next vertex
5042
03:31:50,100 --> 03:31:52,600
will become 6 comma purple
5043
03:31:53,100 --> 03:31:56,700
then your next vertex
will become 8 comma green
5044
03:31:57,000 --> 03:31:59,800
and at last your last
vertex will be written
5045
03:31:59,800 --> 03:32:01,100
as 7 comma read.
5046
03:32:01,100 --> 03:32:04,808
So basically when you
are specifying the vertices set
5047
03:32:04,808 --> 03:32:07,305
in the vertex label
graph you attach
5048
03:32:07,305 --> 03:32:10,683
the additional information
in the vertices are set
5049
03:32:10,683 --> 03:32:12,200
but while representing
5050
03:32:12,200 --> 03:32:16,183
the edge set it is represented
similarly as A directed graph
5051
03:32:16,183 --> 03:32:19,900
where you have to just specify
the source vertex identifier
5052
03:32:19,900 --> 03:32:20,900
and then you have
5053
03:32:20,900 --> 03:32:24,300
to specify the destination
vertex identifier now.
5054
03:32:24,300 --> 03:32:27,500
I hope that you guys are clear
with underrated directed
5055
03:32:27,500 --> 03:32:29,000
and vertex label graph.
5056
03:32:29,184 --> 03:32:33,615
So let's quickly move forward
next we have cyclic graph.
5057
03:32:33,800 --> 03:32:36,800
So a cyclic graph
is a directed graph
5058
03:32:36,900 --> 03:32:38,900
with at least one cycle
5059
03:32:39,000 --> 03:32:43,153
and the cycle is the path
along with the directed edges
5060
03:32:43,153 --> 03:32:44,933
from a Vertex to itself.
5061
03:32:44,933 --> 03:32:47,000
So so once you see over here,
5062
03:32:47,000 --> 03:32:47,708
you can see
5063
03:32:47,708 --> 03:32:50,541
that from this vertex
V. It's moving toward x
5064
03:32:50,541 --> 03:32:51,700
7 then it's moving
5065
03:32:51,700 --> 03:32:54,700
to vertex Aid then with arrows
moving to vertex six.
5066
03:32:54,700 --> 03:32:57,539
And then again,
it's moving to vertex V.
5067
03:32:57,539 --> 03:33:01,600
So there should be at least
one cycle in a cyclic graph.
5068
03:33:01,600 --> 03:33:04,000
There might be a new component.
5069
03:33:04,000 --> 03:33:08,400
It's a Vertex 9 which is
attached over here again,
5070
03:33:08,400 --> 03:33:10,401
so it would be a cyclic graph
5071
03:33:10,401 --> 03:33:13,300
because it has
one complete cycle over here
5072
03:33:13,300 --> 03:33:15,500
and the important
thing to notice is
5073
03:33:15,500 --> 03:33:20,300
That the arrow should make
the cycle like from 5 to 7
5074
03:33:20,300 --> 03:33:23,300
and then from 7 to 8
and then 8 to 6
5075
03:33:23,300 --> 03:33:25,300
and 6 to 5 and let's say
5076
03:33:25,300 --> 03:33:26,831
that there is an arrow
5077
03:33:26,831 --> 03:33:30,281
from 5 to 6 and then there
is an arrow from 6 to 8.
5078
03:33:30,281 --> 03:33:32,233
So we have flipped the arrows.
5079
03:33:32,233 --> 03:33:33,600
So in that situation,
5080
03:33:33,600 --> 03:33:36,372
this is not a cyclic graph
because the arrows
5081
03:33:36,372 --> 03:33:38,200
are not completing the cycle.
5082
03:33:38,200 --> 03:33:41,370
So once you move from 5 to 7
and then from 7 to 8,
5083
03:33:41,370 --> 03:33:44,452
you cannot move from 8:00
to 6:00 and similarly
5084
03:33:44,452 --> 03:33:47,167
once you move from 5 to 6
and then 6 to 8.
5085
03:33:47,167 --> 03:33:49,020
You cannot move from 8 to 7.
5086
03:33:49,020 --> 03:33:52,000
So in that situation,
it's not a cyclic graph.
5087
03:33:52,000 --> 03:33:54,307
So let's clear all this thing.
5088
03:33:54,307 --> 03:33:56,461
So will represent this cycle
5089
03:33:56,461 --> 03:34:00,300
as five then using
double arrows will go to 7
5090
03:34:00,300 --> 03:34:05,300
and then we'll move to 8
and then we'll move to 6
5091
03:34:05,300 --> 03:34:09,774
and at last we'll
come back to 5 now.
5092
03:34:09,774 --> 03:34:11,851
We have Edge liberal graph.
5093
03:34:12,000 --> 03:34:15,030
So basically as label
graph is a graph.
5094
03:34:15,030 --> 03:34:17,752
The edges are
associated with labels.
5095
03:34:17,752 --> 03:34:22,059
So one can basically indicate
this by making the edge set
5096
03:34:22,059 --> 03:34:23,906
as be a set of triplets.
5097
03:34:23,906 --> 03:34:25,600
So for example,
5098
03:34:25,600 --> 03:34:26,900
let's say this H
5099
03:34:26,900 --> 03:34:30,875
in this Edge label graph
will be denoted as the source
5100
03:34:30,875 --> 03:34:33,200
which is 6 then the destination
5101
03:34:33,200 --> 03:34:38,000
which is 7 and then the label
of the edge which is blue.
5102
03:34:38,000 --> 03:34:41,400
So this Edge would
be defined something
5103
03:34:41,400 --> 03:34:44,700
like 6 comma 7 comma blue
and then for this
5104
03:34:44,700 --> 03:34:47,100
and Hurley The Source vertex
5105
03:34:47,100 --> 03:34:49,414
that is 7 the
destination vertex,
5106
03:34:49,414 --> 03:34:52,100
which is 8 then
the label of the edge,
5107
03:34:52,100 --> 03:34:55,400
which is white like
similarly for this Edge.
5108
03:34:55,400 --> 03:35:00,200
It's five comma 7 and
then blue comma red.
5109
03:35:01,000 --> 03:35:03,076
And it lasts for this Edge.
5110
03:35:03,076 --> 03:35:09,200
It's five comma six and then it
would be yellow common green,
5111
03:35:09,200 --> 03:35:11,362
which is the label of the edge.
5112
03:35:11,362 --> 03:35:14,665
So all these four edges
will become the headset
5113
03:35:14,665 --> 03:35:18,400
for this graph and the vertices
set is almost similar
5114
03:35:18,400 --> 03:35:21,200
that is 5 comma
6 comma 7 comma 8 now
5115
03:35:21,200 --> 03:35:24,200
to generalize this I would say x
5116
03:35:24,200 --> 03:35:26,400
comma y so X here is
5117
03:35:26,400 --> 03:35:30,700
the source vertex then why
here is the destination vertex?
5118
03:35:30,700 --> 03:35:33,914
X and then a here is
the label of the edge
5119
03:35:33,914 --> 03:35:36,900
then Edge label graph
are usually drawn
5120
03:35:36,900 --> 03:35:39,573
with the labels written
adjacent to the Earth
5121
03:35:39,573 --> 03:35:40,902
specifying the edges
5122
03:35:40,902 --> 03:35:41,900
as you can see.
5123
03:35:41,900 --> 03:35:43,900
We have mentioned blue white
5124
03:35:43,900 --> 03:35:46,695
and all those label
addition to the edges.
5125
03:35:46,695 --> 03:35:50,400
So I hope you guys a player
with the edge label graph,
5126
03:35:50,400 --> 03:35:51,561
which is nothing
5127
03:35:51,561 --> 03:35:54,900
but labels attached
to each and every Edge now,
5128
03:35:54,900 --> 03:35:57,200
let's talk about weighted graph.
5129
03:35:57,200 --> 03:36:00,310
So we did graph is
an edge label draft.
5130
03:36:00,700 --> 03:36:03,700
Where the labels
can be operated on by
5131
03:36:03,700 --> 03:36:06,921
usually automatic operators
or comparison operators,
5132
03:36:06,921 --> 03:36:09,700
like less than or greater
than symbol usually
5133
03:36:09,700 --> 03:36:12,900
these are integers
or floats and the idea is
5134
03:36:12,900 --> 03:36:15,534
that some edges
may be more expensive
5135
03:36:15,534 --> 03:36:18,900
and this cost is represented
by the edge labels
5136
03:36:18,900 --> 03:36:22,992
or weights now in short weighted
graphs are a special kind
5137
03:36:22,992 --> 03:36:24,500
of Edgley build rafts
5138
03:36:24,500 --> 03:36:27,200
where your Edge
is attached to a weight.
5139
03:36:27,200 --> 03:36:29,800
Generally, which is
a integer or a float
5140
03:36:29,800 --> 03:36:33,100
so that we can perform
some addition or subtraction
5141
03:36:33,100 --> 03:36:35,452
or different kind
of automatic operations
5142
03:36:35,452 --> 03:36:36,689
or it can be some kind
5143
03:36:36,689 --> 03:36:39,500
of conditional operations
like less than or greater
5144
03:36:39,500 --> 03:36:40,800
than so we'll again
5145
03:36:40,800 --> 03:36:45,700
represent this Edge as 5 comma
6 and then the weight as 3
5146
03:36:46,100 --> 03:36:49,900
and similarly will represent
this Edge as 6 comma
5147
03:36:49,900 --> 03:36:55,351
7 and the weight is again
6 so similarly we represent
5148
03:36:55,351 --> 03:36:57,197
these two edges as well.
5149
03:36:57,300 --> 03:36:57,900
So I hope
5150
03:36:57,900 --> 03:37:00,500
that you guys are clear
with the weighted graphs.
5151
03:37:00,500 --> 03:37:02,300
Now let's quickly
move ahead and look
5152
03:37:02,300 --> 03:37:04,200
at this directed acyclic graph.
5153
03:37:04,200 --> 03:37:06,900
So this is
a directed acyclic graph,
5154
03:37:07,100 --> 03:37:09,500
which is basically
without Cycles.
5155
03:37:09,500 --> 03:37:12,445
So as we just discussed
in cyclic graphs here,
5156
03:37:12,445 --> 03:37:13,151
you can see
5157
03:37:13,151 --> 03:37:16,601
that it is not completing
the graph from the directions
5158
03:37:16,601 --> 03:37:19,607
or you can say the direction
of the edges, right?
5159
03:37:19,607 --> 03:37:21,011
We can move from 5 to 7,
5160
03:37:21,011 --> 03:37:22,164
then seven to eight
5161
03:37:22,164 --> 03:37:25,500
but we cannot move from 8 to 6
and similarly we can move
5162
03:37:25,500 --> 03:37:27,600
from 5:00 to 6:00
then 6:00 to 8:00,
5163
03:37:27,600 --> 03:37:29,700
but we cannot move from 8 to 7.
5164
03:37:29,700 --> 03:37:32,962
So this is Not forming
a cycle and these kind
5165
03:37:32,962 --> 03:37:36,300
of crafts are known as
directed acyclic graph.
5166
03:37:36,300 --> 03:37:39,914
Now, they appear as special
cases in CS application all
5167
03:37:39,914 --> 03:37:41,855
the time and the vertices set
5168
03:37:41,855 --> 03:37:44,600
and the edge set
are represented similarly
5169
03:37:44,700 --> 03:37:46,700
as we have seen
earlier not talking
5170
03:37:46,700 --> 03:37:48,670
about the disconnected graph.
5171
03:37:48,670 --> 03:37:51,972
So vertices in a graph
do not need to be connected
5172
03:37:51,972 --> 03:37:53,100
to other vertices.
5173
03:37:53,100 --> 03:37:54,466
It is basically legal
5174
03:37:54,466 --> 03:37:57,200
for a graph to have
disconnected components
5175
03:37:57,200 --> 03:38:00,466
or even loan vertices
without a single connection.
5176
03:38:00,466 --> 03:38:04,400
So basically this disconnected
graph which has four vertices
5177
03:38:04,400 --> 03:38:05,300
but no edges.
5178
03:38:05,300 --> 03:38:05,543
Now.
5179
03:38:05,543 --> 03:38:08,100
Let me tell you something
important that is
5180
03:38:08,100 --> 03:38:10,176
what our sources and sinks.
5181
03:38:10,200 --> 03:38:13,738
So let's say we have
one Arrow from five to six
5182
03:38:13,738 --> 03:38:18,233
and one Arrow from 5 to 7
now word is with only
5183
03:38:18,233 --> 03:38:20,233
in arrows are called sink.
5184
03:38:20,600 --> 03:38:25,200
So the 7 and 6 are known
as sinks and the vertices
5185
03:38:25,307 --> 03:38:28,400
with only out arrows
are called sources.
5186
03:38:28,400 --> 03:38:32,500
So as you can see in the image
this Five only have out arrows
5187
03:38:32,500 --> 03:38:33,800
to six and seven.
5188
03:38:33,800 --> 03:38:36,200
So these are called sources now.
5189
03:38:36,200 --> 03:38:38,506
We'll talk about this
in a while guys.
5190
03:38:38,506 --> 03:38:41,500
Once we are going
through the pagerank algorithm.
5191
03:38:41,500 --> 03:38:45,228
So I hope that you guys know
what our vertices what our edges
5192
03:38:45,228 --> 03:38:48,149
how vertices and edges
represents the graph then
5193
03:38:48,149 --> 03:38:50,200
what are different
kinds of graph?
5194
03:38:50,384 --> 03:38:52,615
Let's move to the next topic.
5195
03:38:52,800 --> 03:38:54,236
So next let's know.
5196
03:38:54,236 --> 03:38:55,900
What is Park Graphics.
5197
03:38:55,900 --> 03:38:58,616
So talking about
Graphics Graphics is
5198
03:38:58,616 --> 03:39:00,519
a new component in spark.
5199
03:39:00,519 --> 03:39:03,843
For graphs and crafts
parallel computation now
5200
03:39:03,843 --> 03:39:06,170
at a high level graphic extends
5201
03:39:06,170 --> 03:39:09,954
The Spark rdd by introducing
a new graph abstraction
5202
03:39:09,954 --> 03:39:12,046
that is directed multigraph
5203
03:39:12,046 --> 03:39:15,122
that is properties
attached to each vertex
5204
03:39:15,122 --> 03:39:18,800
and Edge now to support
craft computation Graphics
5205
03:39:18,800 --> 03:39:22,320
basically exposes a set
of fundamental operators,
5206
03:39:22,320 --> 03:39:25,400
like finding sub graph
for joining vertices
5207
03:39:25,400 --> 03:39:30,253
or aggregating messages as well
as it also exposes and optimize.
5208
03:39:30,253 --> 03:39:34,713
This variant of the pregnant
a pi in addition Graphics also
5209
03:39:34,713 --> 03:39:37,987
provides you a collection
of graph algorithms
5210
03:39:37,987 --> 03:39:41,700
and Builders to simplify
your spark analytics tasks.
5211
03:39:41,700 --> 03:39:45,600
So basically your graphics
is extending your spark rdd.
5212
03:39:45,600 --> 03:39:48,800
Then you have Graphics
is providing an abstraction
5213
03:39:48,800 --> 03:39:50,614
that is directed multigraph
5214
03:39:50,614 --> 03:39:53,800
with properties attached
to each vertex and Edge.
5215
03:39:53,800 --> 03:39:56,800
So we'll look at this
property graph in a while.
5216
03:39:56,800 --> 03:40:00,200
Then again Graphics gives you
some fundamental operators
5217
03:40:00,200 --> 03:40:01,000
and Then it also
5218
03:40:01,000 --> 03:40:03,800
provides you some graph
algorithms and Builders
5219
03:40:03,800 --> 03:40:07,260
which makes your analytics
easier now to get started
5220
03:40:07,260 --> 03:40:11,400
you first need to import spark
and Graphics into your project.
5221
03:40:11,400 --> 03:40:12,550
So as you can see,
5222
03:40:12,550 --> 03:40:15,875
we are importing first Park
and then we are importing
5223
03:40:15,875 --> 03:40:19,200
spark Graphics to get
those graphics functionalities.
5224
03:40:19,200 --> 03:40:21,150
And at last we are importing
5225
03:40:21,150 --> 03:40:25,400
spark rdd to use those already
functionalities in our program.
5226
03:40:25,400 --> 03:40:28,098
But let me tell you
that if you are not using
5227
03:40:28,098 --> 03:40:30,400
spark shell then you
will need a spark.
5228
03:40:30,400 --> 03:40:31,807
Context in your program.
5229
03:40:31,807 --> 03:40:32,341
So I hope
5230
03:40:32,341 --> 03:40:35,400
that you guys are clear
with the features of graphics
5231
03:40:35,400 --> 03:40:36,400
and the libraries
5232
03:40:36,400 --> 03:40:39,200
which you need to import
in order to use Graphics.
5233
03:40:39,300 --> 03:40:43,500
So let us quickly move ahead
and look at the property graph.
5234
03:40:43,500 --> 03:40:45,800
Now property graph is something
5235
03:40:45,800 --> 03:40:50,300
as the name suggests property
graph have properties attached
5236
03:40:50,300 --> 03:40:52,400
to each vertex and Edge.
5237
03:40:52,500 --> 03:40:54,115
So the property graph
5238
03:40:54,115 --> 03:40:58,653
is a directed multigraph with
user-defined objects attached
5239
03:40:58,653 --> 03:41:00,500
to each vertex and Edge.
5240
03:41:00,500 --> 03:41:03,700
Now you might be wondering
what is undirected multigraph.
5241
03:41:03,700 --> 03:41:08,123
So a directed multi graph is a
directed graph with potentially
5242
03:41:08,123 --> 03:41:11,137
multiple parallel edges
sharing same source
5243
03:41:11,137 --> 03:41:13,050
and same destination vertex.
5244
03:41:13,050 --> 03:41:15,102
So as you can see in the image
5245
03:41:15,102 --> 03:41:17,700
that from San Francisco
to Los Angeles,
5246
03:41:17,700 --> 03:41:22,106
we have two edges and similarly
from Los Angeles to Chicago.
5247
03:41:22,106 --> 03:41:23,600
There are two edges.
5248
03:41:23,600 --> 03:41:26,019
So basically in
a directed multigraph,
5249
03:41:26,019 --> 03:41:28,400
the first thing is
the directed graph,
5250
03:41:28,400 --> 03:41:30,386
so it should have a Direction.
5251
03:41:30,386 --> 03:41:33,300
Ian attached to the edges
and then talking
5252
03:41:33,300 --> 03:41:36,100
about multigraph so
between Source vertex
5253
03:41:36,100 --> 03:41:37,850
and a destination vertex,
5254
03:41:37,850 --> 03:41:39,600
there could be two edges.
5255
03:41:39,800 --> 03:41:42,886
So the ability to
support parallel edges
5256
03:41:42,886 --> 03:41:46,100
basically simplifies
the modeling scenarios
5257
03:41:46,100 --> 03:41:49,054
where there can be
multiple relationships
5258
03:41:49,054 --> 03:41:51,997
between the same vertices
for an example.
5259
03:41:51,997 --> 03:41:54,200
Let's say these are two persons
5260
03:41:54,200 --> 03:41:56,644
so they can be friends
as well as they
5261
03:41:56,644 --> 03:41:58,361
can be co-workers, right?
5262
03:41:58,361 --> 03:42:02,000
So these kind of scenarios
can be Easily modeled using
5263
03:42:02,000 --> 03:42:03,900
directed multigraph now.
5264
03:42:03,900 --> 03:42:08,700
Each vertex is keyed by
a unique 64-bit long identifier,
5265
03:42:08,800 --> 03:42:12,700
which is basically the vertex ID
and it helps an indexing.
5266
03:42:12,700 --> 03:42:16,500
So each of your vertex
contains a Vertex ID,
5267
03:42:16,600 --> 03:42:20,000
which is a unique
64-bit long identifier
5268
03:42:20,200 --> 03:42:21,900
and similarly edges
5269
03:42:21,900 --> 03:42:26,600
have corresponding source and
destination vertex identifiers.
5270
03:42:26,700 --> 03:42:28,174
So this Edge would have
5271
03:42:28,174 --> 03:42:31,647
this vertex identifier as
well as This vertex identifier
5272
03:42:31,647 --> 03:42:35,620
or you can say Source vertex ID
and the destination vertex ID.
5273
03:42:35,620 --> 03:42:37,900
So as we discuss
this property graph
5274
03:42:37,900 --> 03:42:42,300
is basically parameterised
over the vertex and Edge types,
5275
03:42:42,300 --> 03:42:45,684
and these are the types
of objects associated
5276
03:42:45,684 --> 03:42:47,700
with each vertex and Edge.
5277
03:42:48,400 --> 03:42:51,792
So your graphics basically
optimizes the representation
5278
03:42:51,792 --> 03:42:53,300
of vertex and Edge types
5279
03:42:53,300 --> 03:42:56,900
and it reduces the in
memory footprint by storing
5280
03:42:56,900 --> 03:43:00,400
the primitive data types
in a specialized array.
5281
03:43:00,400 --> 03:43:04,400
In some cases it might be
desirable to have vertices
5282
03:43:04,400 --> 03:43:07,200
with different property types
in the same graph.
5283
03:43:07,200 --> 03:43:10,400
Now this can be accomplished
through inheritance.
5284
03:43:10,400 --> 03:43:14,000
So for an example to model
a user and product
5285
03:43:14,000 --> 03:43:15,300
in a bipartite graph,
5286
03:43:15,300 --> 03:43:17,676
or you can see
that we have user property
5287
03:43:17,676 --> 03:43:19,400
and we have product property.
5288
03:43:19,400 --> 03:43:19,762
Okay.
5289
03:43:19,762 --> 03:43:23,400
So let me first tell you
what is a bipartite graph.
5290
03:43:23,400 --> 03:43:26,861
So a bipartite graph
is also called a by graph
5291
03:43:27,000 --> 03:43:29,500
which is a set
of graph vertices.
5292
03:43:30,300 --> 03:43:35,400
Opposed into two disjoint sets
such that no two graph vertices
5293
03:43:35,469 --> 03:43:37,930
within the same
set are adjacent.
5294
03:43:38,100 --> 03:43:39,700
So as you can see over here,
5295
03:43:39,700 --> 03:43:43,000
we have user property and then
we have product property
5296
03:43:43,000 --> 03:43:46,282
but no to user property
can be adjacent or you
5297
03:43:46,282 --> 03:43:48,592
can say there should be no edges
5298
03:43:48,592 --> 03:43:51,707
that is joining any
of the to user property or
5299
03:43:51,707 --> 03:43:53,300
there should be no Edge
5300
03:43:53,300 --> 03:43:56,000
that should be joining
product property.
5301
03:43:56,400 --> 03:44:00,000
So in this scenario
we use inheritance.
5302
03:44:00,200 --> 03:44:01,757
So as you can see here,
5303
03:44:01,757 --> 03:44:04,600
we have class vertex
property now basically
5304
03:44:04,600 --> 03:44:07,400
what we are doing we
are creating another class
5305
03:44:07,400 --> 03:44:08,900
with user property.
5306
03:44:08,900 --> 03:44:10,700
And here we have name,
5307
03:44:10,700 --> 03:44:13,500
which is again a string
and we are extending
5308
03:44:13,500 --> 03:44:17,038
or you can say we are inheriting
the vertex property class.
5309
03:44:17,038 --> 03:44:19,600
Now again, in the case
of product property.
5310
03:44:19,600 --> 03:44:22,100
We have name that is
name of the product
5311
03:44:22,100 --> 03:44:25,000
which is again string and then
we have price of the product
5312
03:44:25,000 --> 03:44:25,985
which is double
5313
03:44:25,985 --> 03:44:29,400
and we are again extending
this vertex property graph
5314
03:44:29,400 --> 03:44:32,900
and at last You're grading a
graph with this vertex property
5315
03:44:32,900 --> 03:44:33,900
and then string.
5316
03:44:33,900 --> 03:44:37,045
So this is how we
can basically model user
5317
03:44:37,045 --> 03:44:39,500
and product as
a bipartite graph.
5318
03:44:39,500 --> 03:44:41,430
So we have created user property
5319
03:44:41,430 --> 03:44:44,265
as well as we have created
this product property
5320
03:44:44,265 --> 03:44:47,100
and we are extending
this vertex property class.
5321
03:44:47,400 --> 03:44:50,076
No talking about
this property graph.
5322
03:44:50,076 --> 03:44:51,907
It's similar to your rdd.
5323
03:44:51,907 --> 03:44:55,900
So like your rdd property graph
are immutable distributed
5324
03:44:55,900 --> 03:44:57,200
and fault tolerant.
5325
03:44:57,200 --> 03:45:00,491
So changes to the values
or structure of the graph.
5326
03:45:00,491 --> 03:45:01,908
Basically accomplished
5327
03:45:01,908 --> 03:45:04,900
by producing a new graph
with the desired changes
5328
03:45:04,900 --> 03:45:07,700
and the substantial part
of the original graph
5329
03:45:07,700 --> 03:45:09,900
which can be your structure
of the graph
5330
03:45:09,900 --> 03:45:11,800
or attributes or indices.
5331
03:45:11,800 --> 03:45:15,081
These are basically reused
in the new graph reducing
5332
03:45:15,081 --> 03:45:18,040
the cost of inherent
functional data structure.
5333
03:45:18,040 --> 03:45:20,100
So basically your property graph
5334
03:45:20,100 --> 03:45:22,500
once you're trying to change
values of structure.
5335
03:45:22,500 --> 03:45:26,024
So it creates a new graph
with changed structure
5336
03:45:26,024 --> 03:45:27,300
or changed values
5337
03:45:27,300 --> 03:45:30,182
and zero substantial part
of original graph.
5338
03:45:30,182 --> 03:45:33,300
Re used multiple times
to improve the performance
5339
03:45:33,300 --> 03:45:35,900
and it can be
your structure of the graph
5340
03:45:35,900 --> 03:45:38,600
which is getting reuse
or it can be your attributes
5341
03:45:38,600 --> 03:45:41,000
or indices of the graph
which is getting reused.
5342
03:45:41,000 --> 03:45:44,400
So this is how your property
graph provides efficiency.
5343
03:45:44,400 --> 03:45:46,400
Now, the graph is partitioned
5344
03:45:46,400 --> 03:45:48,800
across the executors
using a range
5345
03:45:48,800 --> 03:45:50,500
of vertex partitioning rules,
5346
03:45:50,500 --> 03:45:52,700
which are basically
Loosely defined
5347
03:45:52,700 --> 03:45:56,514
and similar to our DD
each partition of the graph
5348
03:45:56,514 --> 03:45:57,800
can be recreated
5349
03:45:57,800 --> 03:46:01,100
on different machines
in the event of Failure.
5350
03:46:01,100 --> 03:46:05,000
So this is how your property
graph provides fault tolerance.
5351
03:46:05,000 --> 03:46:07,643
So as we already
discussed logically
5352
03:46:07,643 --> 03:46:12,174
the property graph corresponds
to a pair of type collections,
5353
03:46:12,174 --> 03:46:15,800
including the properties
for each vertex and Edge
5354
03:46:15,800 --> 03:46:17,338
and as a consequence
5355
03:46:17,338 --> 03:46:21,492
the graph class contains
members to access the vertices
5356
03:46:21,492 --> 03:46:22,569
and the edges.
5357
03:46:22,800 --> 03:46:24,067
So as you can see we
5358
03:46:24,067 --> 03:46:27,300
have graphed class then you
can see we have vertices
5359
03:46:27,307 --> 03:46:28,692
and we have edges.
5360
03:46:29,500 --> 03:46:34,400
Now this vertex Rd DVD
is extending your rdd,
5361
03:46:34,600 --> 03:46:41,100
which is your body
D and then your vertex ID
5362
03:46:41,500 --> 03:46:43,807
and then your vertex property.
5363
03:46:44,600 --> 03:46:45,100
Similarly.
5364
03:46:45,100 --> 03:46:47,600
Your Edge rdd is extending
5365
03:46:47,600 --> 03:46:53,500
your Oddity with your Edge
property so the classes
5366
03:46:53,500 --> 03:46:54,900
that is vertex rdd
5367
03:46:54,900 --> 03:47:00,100
and HR DD extends under
optimized version of your rdd,
5368
03:47:00,100 --> 03:47:03,810
which includes vertex idn
vertex property and your rdd
5369
03:47:03,810 --> 03:47:06,746
which includes your Edge
property and Booth
5370
03:47:06,746 --> 03:47:07,795
this vertex rdd
5371
03:47:07,795 --> 03:47:11,501
and hrd provides additional
functionality build on top
5372
03:47:11,501 --> 03:47:12,876
of graph computation
5373
03:47:12,876 --> 03:47:15,900
and leverages internal
optimizations as well.
5374
03:47:15,900 --> 03:47:19,159
So this is the reason we use
this Vertex rdd or Edge already
5375
03:47:19,159 --> 03:47:22,500
because it already extends your
already containing your word.
5376
03:47:22,500 --> 03:47:23,888
X ID and vertex property
5377
03:47:23,888 --> 03:47:26,700
or your Edge property
it also provides you
5378
03:47:26,700 --> 03:47:30,100
additional functionalities built
on top of craft computation.
5379
03:47:30,100 --> 03:47:33,700
And again, it gives you some
internal optimizations as well.
5380
03:47:34,100 --> 03:47:37,715
Now, let me clear
this and let's take an example
5381
03:47:37,715 --> 03:47:39,000
of property graph
5382
03:47:39,000 --> 03:47:40,633
where the vertex property
5383
03:47:40,633 --> 03:47:43,300
might contain the user
name and occupation.
5384
03:47:43,300 --> 03:47:47,200
So as you can see in this table
that we have ID of the vertex
5385
03:47:47,200 --> 03:47:50,000
and then we have property
attached to each vertex.
5386
03:47:50,000 --> 03:47:52,602
That is the username
as well as the Station
5387
03:47:52,602 --> 03:47:55,700
of the user or you can see
the profession of the user
5388
03:47:55,700 --> 03:47:58,715
and we can annotate
the edges with the string
5389
03:47:58,715 --> 03:48:01,800
describing the relationship
between the users.
5390
03:48:01,800 --> 03:48:04,400
So so as you can see
first is Thomas
5391
03:48:04,400 --> 03:48:06,300
who is a professor
then second is Frank
5392
03:48:06,300 --> 03:48:08,000
who is also a professor then
5393
03:48:08,000 --> 03:48:09,900
as you can see third is Jenny.
5394
03:48:09,900 --> 03:48:12,241
She's a student and forth is Bob
5395
03:48:12,241 --> 03:48:15,997
who is a doctor now Thomas is
a colleague of Frank.
5396
03:48:15,997 --> 03:48:17,200
Then you can see
5397
03:48:17,200 --> 03:48:21,000
that Thomas is academic
advisor of Jenny again.
5398
03:48:21,000 --> 03:48:23,153
Frank is also a Make advisor
5399
03:48:23,153 --> 03:48:27,692
of Jenny and then the doctor
is the health advisor of Jenny.
5400
03:48:27,700 --> 03:48:31,200
So the resulting graph
would have a signature
5401
03:48:31,200 --> 03:48:32,800
of something like this.
5402
03:48:32,800 --> 03:48:34,800
So I'll explain this in a while.
5403
03:48:34,900 --> 03:48:38,300
So there are numerous ways
to construct the property graph
5404
03:48:38,300 --> 03:48:39,300
from raw files
5405
03:48:39,300 --> 03:48:43,400
or RDS or even synthetic
generators and we'll discuss it
5406
03:48:43,400 --> 03:48:44,766
in graph Builders,
5407
03:48:44,766 --> 03:48:46,313
but the very probable
5408
03:48:46,313 --> 03:48:49,700
and most General method
is to use graph object.
5409
03:48:49,700 --> 03:48:52,129
So let's take a look
at the code first.
5410
03:48:52,129 --> 03:48:53,651
And so first over here,
5411
03:48:53,651 --> 03:48:55,900
we are assuming
that Parker context
5412
03:48:55,900 --> 03:48:58,100
has already been constructed.
5413
03:48:58,100 --> 03:49:01,700
Then we are giving
the SES power context next.
5414
03:49:01,700 --> 03:49:04,600
We are creating an rdd
for the vertices.
5415
03:49:04,600 --> 03:49:06,689
So as you can see for users,
5416
03:49:06,689 --> 03:49:09,600
we have specified idd
and then vertex ID
5417
03:49:09,600 --> 03:49:11,393
and then these are two strings.
5418
03:49:11,393 --> 03:49:12,605
So first one would be
5419
03:49:12,605 --> 03:49:15,900
your username and the second one
will be your profession.
5420
03:49:15,900 --> 03:49:19,612
Then we are using SC paralyzed
and we are creating an array
5421
03:49:19,612 --> 03:49:22,300
where we are specifying
all the vertices so
5422
03:49:22,300 --> 03:49:23,838
And that is this one
5423
03:49:23,900 --> 03:49:25,900
and you are getting
the name as Thomas
5424
03:49:25,900 --> 03:49:26,800
and the profession
5425
03:49:26,800 --> 03:49:30,646
is Professor similarly
for to well Frank Professor.
5426
03:49:30,646 --> 03:49:34,600
Then 3L Jenny cheese student
and 4L Bob doctors.
5427
03:49:34,600 --> 03:49:37,746
So here we have created
the vertex next.
5428
03:49:37,746 --> 03:49:40,207
We are creating
an rdd for edges.
5429
03:49:40,500 --> 03:49:43,400
So first we are giving
the values relationship.
5430
03:49:43,400 --> 03:49:46,400
Then we are creating
an rdd with Edge string
5431
03:49:46,400 --> 03:49:50,000
and then we're using SC
paralyzed to create the edge
5432
03:49:50,000 --> 03:49:52,948
and in the array we are
specifying the A source vertex,
5433
03:49:52,948 --> 03:49:55,595
then we are specifying
the destination vertex.
5434
03:49:55,595 --> 03:49:57,400
And then we are
giving the relation
5435
03:49:57,400 --> 03:50:01,000
that is colleague similarly
for next Edge resources
5436
03:50:01,000 --> 03:50:02,800
when this nation is one
5437
03:50:02,800 --> 03:50:06,131
and then the profession
is academic advisor
5438
03:50:06,165 --> 03:50:07,934
and then it goes so on.
5439
03:50:08,242 --> 03:50:11,857
So then this line we
are defining a default user
5440
03:50:12,200 --> 03:50:16,276
in case there is a relationship
between missing users.
5441
03:50:16,300 --> 03:50:18,900
Now we have given
the name as default user
5442
03:50:18,900 --> 03:50:20,800
and the profession is missing.
5443
03:50:21,400 --> 03:50:24,000
Nature trying to build
an initial graph.
5444
03:50:24,000 --> 03:50:27,100
So for that we are using
this graph object.
5445
03:50:27,100 --> 03:50:30,100
So we have specified users
that is your vertices.
5446
03:50:30,100 --> 03:50:34,300
Then we are specifying the
relations that is your edges.
5447
03:50:34,400 --> 03:50:36,867
And then we are giving
the default user
5448
03:50:36,867 --> 03:50:39,400
which is basically
for any missing user.
5449
03:50:39,400 --> 03:50:41,800
So now as you can see over here,
5450
03:50:41,800 --> 03:50:46,700
we are using Edge case class
and edges have a source ID
5451
03:50:46,700 --> 03:50:48,300
and a destination ID,
5452
03:50:48,300 --> 03:50:51,300
which is basically
corresponding to your source
5453
03:50:51,300 --> 03:50:52,800
and destination vertex.
5454
03:50:52,800 --> 03:50:55,100
And in addition
to the Edge class.
5455
03:50:55,100 --> 03:50:56,900
We have an attribute member
5456
03:50:56,900 --> 03:51:00,600
which stores The Edge property
which is the relation over here
5457
03:51:00,600 --> 03:51:01,600
that is colleague
5458
03:51:01,600 --> 03:51:06,138
or it is academic advisor or it
is Health advisor and so on.
5459
03:51:06,200 --> 03:51:06,900
So, I hope
5460
03:51:06,900 --> 03:51:10,287
that you guys are clear
about creating a property graph
5461
03:51:10,287 --> 03:51:13,800
how to specify the vertices
how to specify edges and then
5462
03:51:13,800 --> 03:51:17,763
how to create a graph Now
we can deconstruct a graph
5463
03:51:17,763 --> 03:51:19,461
into respective vertex
5464
03:51:19,461 --> 03:51:23,000
and Edge views by using
a graph toward vertices
5465
03:51:23,000 --> 03:51:24,900
and graph edges members.
5466
03:51:25,000 --> 03:51:27,041
So as you can see
we are using craft
5467
03:51:27,041 --> 03:51:30,100
or vertices over here
and crafts dot edges over here.
5468
03:51:30,100 --> 03:51:32,100
Now what we are trying to do.
5469
03:51:32,100 --> 03:51:35,900
So first over here the graph
which we have created earlier.
5470
03:51:35,900 --> 03:51:37,291
So we have graphed
5471
03:51:37,300 --> 03:51:40,700
vertices dot filter Now
using this case class.
5472
03:51:40,700 --> 03:51:42,300
We have this vertex ID.
5473
03:51:42,300 --> 03:51:45,378
We have the name and then
we have the position.
5474
03:51:45,378 --> 03:51:48,322
And we are specifying
the position as doctor.
5475
03:51:48,322 --> 03:51:51,400
So first we are trying
to filter the profession
5476
03:51:51,400 --> 03:51:53,600
of the user as doctor.
5477
03:51:53,600 --> 03:51:55,400
And then we are trying to count.
5478
03:51:55,400 --> 03:51:55,630
It.
5479
03:51:55,900 --> 03:51:56,900
Next.
5480
03:51:56,900 --> 03:51:59,700
We are specifying
graph edges filter
5481
03:51:59,900 --> 03:52:03,270
and we are basically
trying to filter the edges
5482
03:52:03,270 --> 03:52:07,300
where the source ID is greater
than your destination ID.
5483
03:52:07,300 --> 03:52:09,800
And then we are trying
to count those edges.
5484
03:52:09,800 --> 03:52:12,600
We are using
a Scala case expression
5485
03:52:12,600 --> 03:52:15,400
as you can see to
deconstruct the temple.
5486
03:52:15,500 --> 03:52:17,400
You can say to deconstruct
5487
03:52:17,400 --> 03:52:23,358
the result on the other hand
craft edges returns a edge rdd,
5488
03:52:23,358 --> 03:52:26,282
which is containing
Edge string object.
5489
03:52:26,400 --> 03:52:30,800
So we could also have used
the case Class Type Constructor
5490
03:52:30,900 --> 03:52:32,200
as you can see here.
5491
03:52:32,200 --> 03:52:34,832
So again over here we
are using graph dot s
5492
03:52:34,832 --> 03:52:36,400
dot filter and over here.
5493
03:52:36,400 --> 03:52:40,400
We have given case h and then
we are specifying the property
5494
03:52:40,400 --> 03:52:43,900
that is Source destination
and then property of the edge
5495
03:52:43,900 --> 03:52:45,000
which is attached.
5496
03:52:45,000 --> 03:52:48,800
And then we are filtering it and
then we are trying to count it.
5497
03:52:48,800 --> 03:52:53,547
So this is how using Edge class
either you can see with edges
5498
03:52:53,547 --> 03:52:55,603
or you can see with vertices.
5499
03:52:55,603 --> 03:52:59,191
This is how you can go ahead
and deconstruct them.
5500
03:52:59,191 --> 03:53:01,900
Right because you're
grounded vertices
5501
03:53:01,900 --> 03:53:06,300
or your s dot vertices returns
a Vertex rdd or Edge rdd.
5502
03:53:06,400 --> 03:53:07,947
So to deconstruct them,
5503
03:53:07,947 --> 03:53:10,100
we basically use
this case class.
5504
03:53:10,100 --> 03:53:11,000
So I hope you
5505
03:53:11,000 --> 03:53:13,742
guys are clear about
transforming property graph.
5506
03:53:13,742 --> 03:53:15,400
And how do you use this case?
5507
03:53:15,400 --> 03:53:19,300
Us to deconstruct
the protects our DD or HR DD.
5508
03:53:20,169 --> 03:53:22,630
So now let's quickly move ahead.
5509
03:53:22,700 --> 03:53:24,875
Now in addition to the vertex
5510
03:53:24,875 --> 03:53:27,406
and Edge views
of the property graph
5511
03:53:27,406 --> 03:53:30,300
Graphics also exposes
a triplet view now,
5512
03:53:30,300 --> 03:53:32,700
you might be wondering
what is a triplet view.
5513
03:53:32,700 --> 03:53:35,977
So the triplet view
logically joins the vertex
5514
03:53:35,977 --> 03:53:39,600
and Edge properties
yielding an rdd edge triplet
5515
03:53:39,600 --> 03:53:42,700
with vertex property
and your Edge property.
5516
03:53:42,700 --> 03:53:45,174
So as you can see
it gives an rdd.
5517
03:53:45,174 --> 03:53:47,217
D with s triplet and then it
5518
03:53:47,217 --> 03:53:51,523
has vertex property as well as
H property associated with it
5519
03:53:51,523 --> 03:53:55,100
and it contains an instance
of each triplet class.
5520
03:53:55,200 --> 03:53:55,700
Now.
5521
03:53:55,700 --> 03:53:57,800
I am taking example of a join.
5522
03:53:57,800 --> 03:54:01,603
So in this joint we are trying
to select Source ID destination
5523
03:54:01,603 --> 03:54:03,100
ID Source attribute then
5524
03:54:03,100 --> 03:54:04,635
this is your Edge attribute
5525
03:54:04,635 --> 03:54:07,400
and then at last you
have destination attribute.
5526
03:54:07,400 --> 03:54:11,200
So basically your edges has
Alias e then your vertices
5527
03:54:11,200 --> 03:54:12,907
has Alias as source.
5528
03:54:12,907 --> 03:54:16,516
And again your vertices
has Alias as Nation so we
5529
03:54:16,516 --> 03:54:19,900
are trying to select
Source ID destination ID,
5530
03:54:19,900 --> 03:54:23,155
then Source, attribute
and destination attribute,
5531
03:54:23,155 --> 03:54:25,800
and we also selecting
The Edge attribute
5532
03:54:25,800 --> 03:54:28,200
and we are performing left join.
5533
03:54:28,400 --> 03:54:31,900
The edge Source ID should
be equal to Source ID
5534
03:54:31,900 --> 03:54:35,600
and the h destination ID should
be equal to destination ID.
5535
03:54:36,400 --> 03:54:39,700
And now your Edge
triplet class basically
5536
03:54:39,700 --> 03:54:43,090
extends your Edge class
by adding your Source attribute
5537
03:54:43,090 --> 03:54:45,100
and destination
attribute members
5538
03:54:45,100 --> 03:54:48,100
which contains the source
and destination properties
5539
03:54:48,200 --> 03:54:49,155
and we can use
5540
03:54:49,155 --> 03:54:52,500
the triplet view of a graph
to render a collection
5541
03:54:52,500 --> 03:54:55,804
of strings describing
relationship between users.
5542
03:54:55,804 --> 03:54:59,521
This is vertex 1 which is again
denoting your user one.
5543
03:54:59,521 --> 03:55:01,986
That is Thomas and
who is a professor
5544
03:55:01,986 --> 03:55:03,081
and is vertex 3,
5545
03:55:03,081 --> 03:55:06,400
which is denoting you Jenny
and she's a student.
5546
03:55:06,400 --> 03:55:07,994
And this is your Edge,
5547
03:55:07,994 --> 03:55:11,400
which is defining
the relationship between them.
5548
03:55:11,400 --> 03:55:13,600
So this is a h triplet
5549
03:55:13,600 --> 03:55:17,300
which is denoting
the both vertex as well
5550
03:55:17,300 --> 03:55:20,900
as the edge which denote
the relation between them.
5551
03:55:20,900 --> 03:55:23,600
So now looking at this code
first we have already
5552
03:55:23,600 --> 03:55:26,377
created the graph then we
are taking this graph.
5553
03:55:26,377 --> 03:55:27,979
We are finding the triplets
5554
03:55:27,979 --> 03:55:30,194
and then we are
mapping each triplet.
5555
03:55:30,194 --> 03:55:33,700
We are trying to find out
the triplet dot Source attribute
5556
03:55:33,700 --> 03:55:36,155
in which we are picking
up the username.
5557
03:55:36,155 --> 03:55:37,100
Then over here.
5558
03:55:37,100 --> 03:55:39,800
We are trying to pick up
the triplet attribute,
5559
03:55:39,800 --> 03:55:42,400
which is nothing
but the edge attribute
5560
03:55:42,400 --> 03:55:44,400
which is your academic advisor.
5561
03:55:44,400 --> 03:55:45,800
Then we are trying
5562
03:55:45,800 --> 03:55:48,800
to pick up the triplet
destination attribute.
5563
03:55:48,800 --> 03:55:50,904
It will again pick
up the username
5564
03:55:50,904 --> 03:55:52,500
of destination attribute,
5565
03:55:52,500 --> 03:55:54,766
which is username
of this vertex 3.
5566
03:55:54,766 --> 03:55:57,100
So for an example
in this situation,
5567
03:55:57,100 --> 03:56:01,000
it will print Thomas is
the academic advisor of Jenny.
5568
03:56:01,000 --> 03:56:03,211
So then we are trying
to take this facts.
5569
03:56:03,211 --> 03:56:04,726
We are collecting the facts
5570
03:56:04,726 --> 03:56:07,900
using this forage we have
Painting each of the triplet
5571
03:56:07,900 --> 03:56:09,812
that is present in this graph.
5572
03:56:09,812 --> 03:56:10,385
So I hope
5573
03:56:10,385 --> 03:56:13,700
that you guys are clear
with the concepts of triplet.
5574
03:56:14,600 --> 03:56:17,300
So now let's quickly take
a look at graph Builders.
5575
03:56:17,353 --> 03:56:19,200
So as I already told you
5576
03:56:19,200 --> 03:56:22,700
that Graphics provides
several ways of building a graph
5577
03:56:22,700 --> 03:56:25,551
from a collection of vertices
and edges either.
5578
03:56:25,551 --> 03:56:28,900
It can be stored in our DD
or it can be stored on disk.
5579
03:56:28,900 --> 03:56:32,600
So in this graph object first,
we have this apply method.
5580
03:56:32,600 --> 03:56:36,300
So basically this apply
method allows creating a graph
5581
03:56:36,300 --> 03:56:37,773
from rdd of vertices
5582
03:56:37,773 --> 03:56:42,000
and edges and duplicate vertices
are picked up our by Tralee
5583
03:56:42,000 --> 03:56:43,139
and the vertices
5584
03:56:43,139 --> 03:56:46,700
which are found in the Edge rdd
and are not present
5585
03:56:46,700 --> 03:56:50,522
in the vertices rdd are assigned
a default attribute.
5586
03:56:50,522 --> 03:56:52,653
So in this apply method first,
5587
03:56:52,653 --> 03:56:55,100
we are providing
the vertex rdd then
5588
03:56:55,100 --> 03:56:57,000
we are providing the edge rdd
5589
03:56:57,000 --> 03:57:00,311
and then we are providing
the default vertex attribute.
5590
03:57:00,311 --> 03:57:03,613
So it will create the vertex
which we have specified.
5591
03:57:03,613 --> 03:57:05,400
Then it will create the edges
5592
03:57:05,400 --> 03:57:08,700
which are specified and
if there is a vertex
5593
03:57:08,700 --> 03:57:11,173
which is being referred
by The Edge,
5594
03:57:11,173 --> 03:57:14,000
but it is not present
in this vertex rdd.
5595
03:57:14,000 --> 03:57:16,763
So So what it does it
creates that vertex
5596
03:57:16,763 --> 03:57:20,900
and assigns them the value of
this default vertex attribute.
5597
03:57:20,900 --> 03:57:22,700
Next we have from edges.
5598
03:57:22,700 --> 03:57:27,000
So graph Dot from edges
allows creating a graph only
5599
03:57:27,000 --> 03:57:28,900
from the rdd of edges
5600
03:57:29,000 --> 03:57:32,266
which automatically creates
any vertices mentioned
5601
03:57:32,266 --> 03:57:35,400
in the edges and assigns
them the default value.
5602
03:57:35,500 --> 03:57:39,000
So what happens over here
you provide the edge rdd
5603
03:57:39,000 --> 03:57:40,496
and all the vertices
5604
03:57:40,496 --> 03:57:44,385
that are present in the hrd
are automatically created
5605
03:57:44,385 --> 03:57:48,500
and Default value is assigned
to each of those vertices.
5606
03:57:48,500 --> 03:57:49,522
So graphed out
5607
03:57:49,522 --> 03:57:53,100
from adjustables basically
allows creating a graph
5608
03:57:53,100 --> 03:57:55,484
from only the rdd of vegetables
5609
03:57:55,500 --> 03:58:00,100
and it assigns the edges as
value 1 and again the vertices
5610
03:58:00,100 --> 03:58:04,200
which are specified by the edges
are automatically created
5611
03:58:04,200 --> 03:58:05,788
and the default value which
5612
03:58:05,788 --> 03:58:09,005
we are specifying over here
will be allocated to them.
5613
03:58:09,005 --> 03:58:10,100
So basically you're
5614
03:58:10,100 --> 03:58:12,980
from has double supports
deduplicating of edges,
5615
03:58:12,980 --> 03:58:15,800
which means you can remove
the duplicate edges,
5616
03:58:15,800 --> 03:58:19,373
but for that you have
to provide a partition strategy
5617
03:58:19,373 --> 03:58:23,953
in the unique edges parameter
as it is necessary to co-locate
5618
03:58:23,953 --> 03:58:25,277
The Identical edges
5619
03:58:25,277 --> 03:58:28,900
on the same partition duplicate
edges can be removed.
5620
03:58:29,100 --> 03:58:33,000
So moving ahead men of the graph
Builders re partitions,
5621
03:58:33,000 --> 03:58:37,146
the graph edges by default
instead edges are left
5622
03:58:37,146 --> 03:58:39,300
in their default partitions.
5623
03:58:39,300 --> 03:58:42,540
So as you can see,
we have a graph loader object,
5624
03:58:42,540 --> 03:58:44,700
which is basically used to load.
5625
03:58:44,700 --> 03:58:46,776
Crafts from the file system
5626
03:58:46,900 --> 03:58:51,571
so graft or group edges requires
the graph to be re-partition
5627
03:58:51,571 --> 03:58:52,956
because it assumes
5628
03:58:53,000 --> 03:58:55,900
that identical edges
will be co-located
5629
03:58:55,900 --> 03:58:57,378
on the same partition.
5630
03:58:57,378 --> 03:59:00,200
And so you must call
graph dot Partition by
5631
03:59:00,200 --> 03:59:02,200
before calling group edges.
5632
03:59:02,900 --> 03:59:07,500
So so now you can see the edge
list file method over here
5633
03:59:07,538 --> 03:59:12,000
which provides a way to load
a graph from the list of edges
5634
03:59:12,000 --> 03:59:14,577
which is present
on the disk and it
5635
03:59:14,577 --> 03:59:18,900
It passes the adjacency list
that is your Source vertex ID
5636
03:59:18,900 --> 03:59:22,900
and the destination vertex ID
Pairs and it creates a graph.
5637
03:59:23,200 --> 03:59:24,300
So now for an example,
5638
03:59:24,300 --> 03:59:29,600
let's say we have two and one
which is one Edge then you have
5639
03:59:29,600 --> 03:59:31,533
for one which is another Edge
5640
03:59:31,533 --> 03:59:34,600
and then you have 1/2
which is another Edge.
5641
03:59:34,600 --> 03:59:36,700
So it will load these edges
5642
03:59:36,900 --> 03:59:39,300
and then it will
create the graph.
5643
03:59:39,300 --> 03:59:40,792
So it will create 2,
5644
03:59:40,792 --> 03:59:44,600
then it will create
for and then it will create one.
5645
03:59:44,900 --> 03:59:46,100
And for to one it
5646
03:59:46,100 --> 03:59:49,757
will create the edge and then
for one it will create the edge
5647
03:59:49,757 --> 03:59:52,500
and at last we create
an edge for one and two.
5648
03:59:52,700 --> 03:59:55,300
So do you create a graph
something like this?
5649
03:59:56,000 --> 03:59:59,100
It creates a graph
from specified edges
5650
03:59:59,300 --> 04:00:01,929
where automatically
vertices are created
5651
04:00:01,929 --> 04:00:05,751
which are mentioned by the edges
and all the vertex
5652
04:00:05,751 --> 04:00:08,465
and Edge attribute
are set by default one
5653
04:00:08,465 --> 04:00:10,907
and as well as one
will be associated
5654
04:00:10,907 --> 04:00:12,400
with all the vertices.
5655
04:00:12,543 --> 04:00:15,900
So it will be 4 comma
1 then again for this.
5656
04:00:15,900 --> 04:00:19,200
It would be 1 comma
1 and similarly it would be
5657
04:00:19,200 --> 04:00:21,201
2 comma 1 for this vertex.
5658
04:00:21,800 --> 04:00:24,184
Now, let's go back to the code.
5659
04:00:24,184 --> 04:00:27,800
So then we have
this canonical orientation.
5660
04:00:28,200 --> 04:00:31,655
So this argument
allows reorienting edges
5661
04:00:31,655 --> 04:00:33,500
in the positive direction
5662
04:00:33,500 --> 04:00:35,100
that is from the lower Source ID
5663
04:00:35,100 --> 04:00:38,000
to the higher
destination ID now,
5664
04:00:38,000 --> 04:00:40,800
which is basically required
by your connected components
5665
04:00:40,800 --> 04:00:41,782
algorithm will talk
5666
04:00:41,782 --> 04:00:43,800
about this algorithm
in a while you guys
5667
04:00:44,100 --> 04:00:47,069
but before this
this basically helps
5668
04:00:47,069 --> 04:00:49,300
in view orienting your edges,
5669
04:00:49,300 --> 04:00:51,500
which means your Source vertex,
5670
04:00:51,500 --> 04:00:55,400
Tex should always be less
than your destination vertex.
5671
04:00:55,400 --> 04:00:58,700
So in that situation it
might reorient this Edge.
5672
04:00:58,700 --> 04:01:01,970
So it will reorient this Edge
and basically to reverse
5673
04:01:01,970 --> 04:01:04,862
direction of the edge
similarly over here.
5674
04:01:04,862 --> 04:01:06,000
So with the vertex
5675
04:01:06,000 --> 04:01:08,896
which is coming from 2 to 1
will be reoriented
5676
04:01:08,896 --> 04:01:10,700
and will be again reversed.
5677
04:01:10,700 --> 04:01:11,754
Now the talking
5678
04:01:11,754 --> 04:01:16,300
about the minimum Edge partition
this minimum Edge partition
5679
04:01:16,300 --> 04:01:18,858
basically specifies
the minimum number
5680
04:01:18,858 --> 04:01:21,900
of edge partitions
to generate There might be
5681
04:01:21,900 --> 04:01:24,242
more Edge partitions
than a specified.
5682
04:01:24,242 --> 04:01:26,900
So let's say the hdfs
file has more blocks.
5683
04:01:26,900 --> 04:01:29,300
So obviously more partitions
will be created
5684
04:01:29,300 --> 04:01:32,182
but this will give you
the minimum Edge partitions
5685
04:01:32,182 --> 04:01:33,651
that should be created.
5686
04:01:33,651 --> 04:01:34,192
So I hope
5687
04:01:34,192 --> 04:01:36,900
that you guys are clear
with this graph loader
5688
04:01:36,900 --> 04:01:38,358
how this graph loader Works
5689
04:01:38,358 --> 04:01:41,300
how you can go ahead
and provide the edge list file
5690
04:01:41,300 --> 04:01:43,300
and how it will create the craft
5691
04:01:43,300 --> 04:01:47,124
from this Edge list file and
then this canonical orientation
5692
04:01:47,124 --> 04:01:50,300
where we are again going
and reorienting the graph
5693
04:01:50,300 --> 04:01:52,299
and then we have
Minimum Edge partition
5694
04:01:52,299 --> 04:01:54,900
which is giving the minimum
number of edge partitions
5695
04:01:54,900 --> 04:01:56,300
that should be created.
5696
04:01:56,300 --> 04:02:00,000
So now I guess you guys are
clear with the graph Builder.
5697
04:02:00,000 --> 04:02:03,400
So how to go ahead and use
this graph object
5698
04:02:03,400 --> 04:02:06,900
and how to create graph
using apply from edges
5699
04:02:06,900 --> 04:02:09,200
and from vegetables method
5700
04:02:09,400 --> 04:02:11,700
and then I guess
you might be clear
5701
04:02:11,700 --> 04:02:13,586
with the graph loader object
5702
04:02:13,586 --> 04:02:17,715
and where you can go ahead and
create a graph from Edge list.
5703
04:02:17,715 --> 04:02:17,990
Now.
5704
04:02:17,990 --> 04:02:21,500
Let's move ahead and talk
about vertex and Edge rdd.
5705
04:02:21,900 --> 04:02:23,561
So as I already told you
5706
04:02:23,561 --> 04:02:27,007
that Graphics exposes
our DD views of the vertices
5707
04:02:27,007 --> 04:02:30,056
and edges stored
within the graph at however,
5708
04:02:30,056 --> 04:02:33,798
because Graphics again
maintains the vertices and edges
5709
04:02:33,798 --> 04:02:35,600
in optimize data structure
5710
04:02:35,600 --> 04:02:36,979
and these data structure
5711
04:02:36,979 --> 04:02:39,499
provide additional
functionalities as well.
5712
04:02:39,499 --> 04:02:42,679
Now, let us see some of
the additional functionalities
5713
04:02:42,679 --> 04:02:44,300
which are provided by them.
5714
04:02:44,465 --> 04:02:47,234
So let's first talk
about vertex rdd.
5715
04:02:47,600 --> 04:02:51,100
So I already told
you that vertex rdd.
5716
04:02:51,100 --> 04:02:54,800
He is basically extending
this rdd with vertex ID
5717
04:02:54,800 --> 04:02:59,338
and the vertex property and it
adds an additional constraint
5718
04:02:59,338 --> 04:03:05,600
that each vertex ID occurs only
words now moreover vertex rdd
5719
04:03:05,800 --> 04:03:10,000
a represents a set of vertices
each with an attribute
5720
04:03:10,000 --> 04:03:12,600
of type A now internally
5721
04:03:12,700 --> 04:03:17,600
what happens this is achieved
by storing the vertex attribute
5722
04:03:17,700 --> 04:03:19,184
in an reusable,
5723
04:03:19,184 --> 04:03:21,030
hash map data structure.
5724
04:03:24,200 --> 04:03:27,700
So suppose, this is
our hash map data structure.
5725
04:03:27,700 --> 04:03:30,200
So suppose if to vertex rdd
5726
04:03:30,200 --> 04:03:34,840
are derived from the same
base vertex rdd suppose.
5727
04:03:35,280 --> 04:03:37,600
These are two vertex rdd
5728
04:03:37,600 --> 04:03:41,200
which are basically derived
from this vertex rdd
5729
04:03:41,200 --> 04:03:44,400
so they can be joined
in constant time
5730
04:03:44,400 --> 04:03:46,100
without hash evaluations.
5731
04:03:46,100 --> 04:03:49,400
So you don't have to go ahead
and evaluate the properties
5732
04:03:49,400 --> 04:03:52,400
of both the vertices
you can easily go ahead
5733
04:03:52,400 --> 04:03:55,398
and you can join them
without the Yes,
5734
04:03:55,400 --> 04:03:58,288
and this is one of the way
in which this vertex
5735
04:03:58,288 --> 04:04:00,800
already provides you
the optimization now
5736
04:04:00,800 --> 04:04:03,900
to leverage this
indexed data structure
5737
04:04:04,200 --> 04:04:08,700
the vertex rdd exposes multiple
additional functionalities.
5738
04:04:09,000 --> 04:04:11,000
So it gives you
all these functions
5739
04:04:11,000 --> 04:04:12,000
as you can see here.
5740
04:04:12,300 --> 04:04:15,300
It gives you filter
map values then -
5741
04:04:15,300 --> 04:04:16,663
difference left join
5742
04:04:16,663 --> 04:04:19,800
in a joint and aggregate
using index functions.
5743
04:04:19,800 --> 04:04:22,600
So let us first discuss
about these functions.
5744
04:04:22,600 --> 04:04:26,800
So basically filter a function
filters the vertex set
5745
04:04:26,800 --> 04:04:31,700
but preserves the internal index
So based on some condition.
5746
04:04:31,700 --> 04:04:33,405
It filters the vertices
5747
04:04:33,405 --> 04:04:36,300
that are present
then in map values.
5748
04:04:36,300 --> 04:04:39,200
It is basically used
to transform the values
5749
04:04:39,200 --> 04:04:41,000
without changing the IDS
5750
04:04:41,000 --> 04:04:44,461
and which again preserves
your internal index.
5751
04:04:44,461 --> 04:04:49,399
So it does not change the idea
of the vertices and it helps
5752
04:04:49,399 --> 04:04:53,100
in transforming those values
now talking about the -
5753
04:04:53,100 --> 04:04:55,900
method it shows What is unique
5754
04:04:55,900 --> 04:04:58,500
in the said based
on their vertex IDs?
5755
04:04:58,500 --> 04:04:59,500
So what happens
5756
04:04:59,500 --> 04:05:03,300
if you are providing to set
of vertices first contains V1 V2
5757
04:05:03,300 --> 04:05:06,100
and V3 and second
one contains V3,
5758
04:05:06,200 --> 04:05:08,276
so it will return V1 and V2
5759
04:05:08,276 --> 04:05:11,366
because they are unique
in both the sets
5760
04:05:11,700 --> 04:05:14,700
and it is basically done
with the help of vertex ID.
5761
04:05:14,900 --> 04:05:17,053
So next we have dysfunction.
5762
04:05:17,100 --> 04:05:20,900
So it basically removes
the vertices from this set
5763
04:05:20,900 --> 04:05:25,800
that appears in another set Then
we have left join an inner join.
5764
04:05:25,800 --> 04:05:28,300
So join operators
basically take advantage
5765
04:05:28,300 --> 04:05:30,900
of the internal indexing
to accelerate join.
5766
04:05:30,900 --> 04:05:32,900
So you can go ahead
and you can perform left join
5767
04:05:32,900 --> 04:05:34,400
or you can perform inner join.
5768
04:05:34,453 --> 04:05:37,246
Next you have
aggregate using index.
5769
04:05:37,700 --> 04:05:40,800
So basically is aggregate
using index is nothing
5770
04:05:40,800 --> 04:05:42,400
by reduced by key,
5771
04:05:42,500 --> 04:05:44,200
but it uses index
5772
04:05:44,300 --> 04:05:48,000
on this rdd to accelerate
the Reduce by key function
5773
04:05:48,000 --> 04:05:50,500
or you can say reduced
by key operation.
5774
04:05:50,700 --> 04:05:54,900
So again filter is actually
Using bit set and there
5775
04:05:54,900 --> 04:05:56,500
by reusing the index
5776
04:05:56,500 --> 04:05:58,800
and preserving the ability to do
5777
04:05:58,800 --> 04:06:02,220
fast joints with other
vertex rdd now similarly
5778
04:06:02,220 --> 04:06:04,600
the map values operator as well.
5779
04:06:04,600 --> 04:06:08,200
Do not allow the map function
to change the vertex ID
5780
04:06:08,200 --> 04:06:09,600
and this again helps
5781
04:06:09,600 --> 04:06:13,120
in reusing the same
hash map data structure now both
5782
04:06:13,120 --> 04:06:14,533
of your left join as
5783
04:06:14,533 --> 04:06:17,900
well as your inner join
is able to identify
5784
04:06:17,900 --> 04:06:20,400
that whether the two vertex rdd
5785
04:06:20,400 --> 04:06:23,169
which are joining
are derived from the same.
5786
04:06:23,169 --> 04:06:24,208
Hash map or not.
5787
04:06:24,208 --> 04:06:28,300
And for this they basically use
linear scan did again don't have
5788
04:06:28,300 --> 04:06:31,900
to go ahead and search
for costly Point lookups.
5789
04:06:31,900 --> 04:06:35,300
So this is the benefit
of using vertex rdd.
5790
04:06:35,500 --> 04:06:36,571
So to summarize
5791
04:06:36,571 --> 04:06:40,300
your vertex audit abuses
hash map data structure,
5792
04:06:40,426 --> 04:06:42,273
which is again reusable.
5793
04:06:42,300 --> 04:06:44,700
They try to
preserve your indexes
5794
04:06:44,700 --> 04:06:48,500
so that it would be easier
to create a new vertex already
5795
04:06:48,500 --> 04:06:51,404
derive a new vertex already
from them then again
5796
04:06:51,404 --> 04:06:54,000
while performing some
joining or Relations,
5797
04:06:54,000 --> 04:06:57,900
it is pretty much easy to go
ahead perform a linear scan
5798
04:06:57,900 --> 04:07:01,500
and then you can go ahead
and join those two vertex rdd.
5799
04:07:01,500 --> 04:07:05,423
So it actually helps
in optimizing your performance.
5800
04:07:05,700 --> 04:07:06,700
Now moving ahead.
5801
04:07:06,700 --> 04:07:10,200
Let's talk about
HR DD now again,
5802
04:07:10,200 --> 04:07:13,900
as you can see your Edge
already is extending your rdd
5803
04:07:13,900 --> 04:07:15,400
with property Edge.
5804
04:07:15,400 --> 04:07:18,792
Now it organizes the edge
in Block partition using
5805
04:07:18,792 --> 04:07:21,700
one of the various
partitioning strategies,
5806
04:07:21,700 --> 04:07:25,608
which is again defined in Your
partition strategies attribute
5807
04:07:25,608 --> 04:07:28,800
or you can say partition
strategy parameter within
5808
04:07:28,800 --> 04:07:30,865
each partition each attribute
5809
04:07:30,865 --> 04:07:34,100
and a decency structure
are stored separately
5810
04:07:34,100 --> 04:07:36,200
which enables the maximum reuse
5811
04:07:36,200 --> 04:07:38,200
when changing the
attribute values.
5812
04:07:38,600 --> 04:07:42,900
So basically what it does while
storing your Edge attributes
5813
04:07:42,900 --> 04:07:46,400
and your Source vertex
and destination vertex,
5814
04:07:46,400 --> 04:07:48,400
they are stored separately so
5815
04:07:48,400 --> 04:07:51,200
that changing the values
of the attributes
5816
04:07:51,200 --> 04:07:54,200
either of the source
Vertex or Nation Vertex
5817
04:07:54,200 --> 04:07:55,500
or Edge attribute
5818
04:07:55,500 --> 04:07:58,300
so that it can be
reused as many times
5819
04:07:58,300 --> 04:08:01,600
as we need by changing
the attribute values itself.
5820
04:08:01,600 --> 04:08:04,713
So that once the vertex ID
is changed of an edge.
5821
04:08:04,713 --> 04:08:06,400
It could be easily changed
5822
04:08:06,400 --> 04:08:09,196
and the earlier part
can be reused now
5823
04:08:09,196 --> 04:08:10,314
as you can see,
5824
04:08:10,314 --> 04:08:13,518
we have three additional
functions over here
5825
04:08:13,518 --> 04:08:16,500
that is map values
reverse an inner join.
5826
04:08:16,700 --> 04:08:19,000
So in hrd basically map
5827
04:08:19,000 --> 04:08:21,400
values is to transform
the edge attributes
5828
04:08:21,400 --> 04:08:23,200
while preserving the structure.
5829
04:08:23,200 --> 04:08:25,029
ER it is helpful in transforming
5830
04:08:25,029 --> 04:08:28,500
so you can use map values and
map the values of Courage rdd.
5831
04:08:28,800 --> 04:08:31,300
Then you can go ahead and use
this reverse function
5832
04:08:31,300 --> 04:08:35,400
which rivers The Edge reusing
both attribute and structure.
5833
04:08:35,400 --> 04:08:37,531
So the source
becomes destination.
5834
04:08:37,531 --> 04:08:40,179
The destination becomes
Source not talking
5835
04:08:40,179 --> 04:08:41,600
about this inner join.
5836
04:08:41,700 --> 04:08:43,600
So it basically joins
5837
04:08:43,600 --> 04:08:48,500
to Edge rdds partitioned using
same partitioning strategy.
5838
04:08:49,100 --> 04:08:52,900
Now as we already discuss
that same partition strategies,
5839
04:08:52,900 --> 04:08:55,585
Tired because again
to co-locate you need
5840
04:08:55,585 --> 04:08:57,600
to use same partition strategy
5841
04:08:57,600 --> 04:08:59,682
and your identical
vertex should reside
5842
04:08:59,682 --> 04:09:02,800
in same partition to perform
join operation over them.
5843
04:09:02,800 --> 04:09:03,092
Now.
5844
04:09:03,092 --> 04:09:07,290
Let me quickly give you an idea
about optimization performed
5845
04:09:07,290 --> 04:09:08,500
in this Graphics.
5846
04:09:08,536 --> 04:09:10,151
So Graphics basically
5847
04:09:10,151 --> 04:09:14,844
adopts a Vertex cut approach to
distribute graph partitioning.
5848
04:09:15,500 --> 04:09:20,700
So suppose you have five vertex
and then they are connected.
5849
04:09:20,800 --> 04:09:23,100
Let's not worry
about the arrows, right?
5850
04:09:23,100 --> 04:09:26,200
Now or let's not worry
about Direction right now.
5851
04:09:26,200 --> 04:09:29,200
So either it can be divided
from the edges,
5852
04:09:29,200 --> 04:09:32,287
which is one approach or again.
5853
04:09:32,287 --> 04:09:34,825
It can be divided
from the vertex.
5854
04:09:35,300 --> 04:09:36,840
So in that situation,
5855
04:09:36,840 --> 04:09:39,700
it would be divided
something like this.
5856
04:09:41,200 --> 04:09:43,500
So rather than splitting crafts
5857
04:09:43,500 --> 04:09:47,900
along edges Graphics partition
is the graph along vertices,
5858
04:09:47,900 --> 04:09:50,305
which can again
reduce the communication
5859
04:09:50,305 --> 04:09:51,600
and storage overhead.
5860
04:09:51,600 --> 04:09:53,523
So logically what happens
5861
04:09:53,523 --> 04:09:56,500
that your edges
are assigned to machines
5862
04:09:56,500 --> 04:10:00,200
and allowing your vertices
to span multiple machines.
5863
04:10:00,200 --> 04:10:03,500
So what this is is basically
divided into multiple machines
5864
04:10:03,500 --> 04:10:06,900
and your edges is assigned
to a single machine right
5865
04:10:06,900 --> 04:10:09,600
then the exact method
of assigning edges.
5866
04:10:09,600 --> 04:10:11,800
Depends on the
partition strategy.
5867
04:10:11,800 --> 04:10:15,400
So the partition strategy is
the one which basically decides
5868
04:10:15,400 --> 04:10:16,800
how to assign the edges
5869
04:10:16,800 --> 04:10:20,300
to different machines or you
can send different partitions.
5870
04:10:20,300 --> 04:10:21,400
So user can choose
5871
04:10:21,400 --> 04:10:24,900
between different strategies
by partitioning the graph
5872
04:10:24,900 --> 04:10:28,200
with the help of this graft
Partition by operator.
5873
04:10:28,200 --> 04:10:29,500
Now as we discussed
5874
04:10:29,500 --> 04:10:31,329
that this craft or Partition
5875
04:10:31,329 --> 04:10:34,400
by operator three partitions
and then it divides
5876
04:10:34,400 --> 04:10:36,900
or relocates the edges
5877
04:10:37,000 --> 04:10:39,900
and basically we try
to put the identical edges.
5878
04:10:39,900 --> 04:10:41,500
On a single partition
5879
04:10:41,500 --> 04:10:43,827
so that different
operations like join
5880
04:10:43,827 --> 04:10:45,400
can be performed on them.
5881
04:10:45,400 --> 04:10:49,629
So once the edges have been
partitioned the mean challenge
5882
04:10:49,629 --> 04:10:52,690
is efficiently joining
the vertex attributes
5883
04:10:52,690 --> 04:10:54,400
with the edges right now
5884
04:10:54,400 --> 04:10:56,000
because real world graphs
5885
04:10:56,000 --> 04:10:58,600
typically have more
edges than vertices.
5886
04:10:58,600 --> 04:11:03,300
So we move vertex attributes
to the edges and because not all
5887
04:11:03,300 --> 04:11:07,800
the partitions will contain
edges adjacent to all vertices.
5888
04:11:07,800 --> 04:11:09,755
We internally maintain a row.
5889
04:11:09,755 --> 04:11:10,700
Routing table.
5890
04:11:10,700 --> 04:11:14,400
So the routing table is the one
who will broadcast the vertices
5891
04:11:14,400 --> 04:11:18,146
and 10 will implement the join
required for the operations.
5892
04:11:18,146 --> 04:11:18,946
So, I hope
5893
04:11:18,946 --> 04:11:22,200
that you guys are clear
how vertex rdd and hrd
5894
04:11:22,200 --> 04:11:23,338
works and then
5895
04:11:23,338 --> 04:11:25,800
how the optimizations take place
5896
04:11:25,800 --> 04:11:29,900
and how vertex cut optimizes
the operations in graphics.
5897
04:11:30,100 --> 04:11:32,600
Now, let's talk
about graph operators.
5898
04:11:32,600 --> 04:11:35,480
So just as already
have basic operations
5899
04:11:35,480 --> 04:11:37,400
like map filter reduced by
5900
04:11:37,400 --> 04:11:41,300
key property graph also have
Election of basic operators
5901
04:11:41,300 --> 04:11:44,530
that take user-defined functions
and produce new graphs
5902
04:11:44,530 --> 04:11:48,029
the transform properties and
structure Now The Co-operators
5903
04:11:48,029 --> 04:11:50,900
that have optimized
implementation are basically
5904
04:11:50,900 --> 04:11:54,061
defined in crafts class
and convenient operators
5905
04:11:54,061 --> 04:11:55,262
that are expressed
5906
04:11:55,262 --> 04:11:57,600
as a composition
of The Co-operators
5907
04:11:57,600 --> 04:12:00,500
are basically defined
in your graphs class.
5908
04:12:00,500 --> 04:12:03,346
But in Scala it
implicit the operators
5909
04:12:03,346 --> 04:12:04,800
in graph Ops class,
5910
04:12:04,800 --> 04:12:08,500
they are automatically available
as a member of graft class
5911
04:12:08,600 --> 04:12:09,600
so you can use them.
5912
04:12:09,700 --> 04:12:12,450
M using the graph
class as well now
5913
04:12:12,500 --> 04:12:14,593
as you can see we have
list of operators
5914
04:12:14,593 --> 04:12:15,858
like property operator,
5915
04:12:15,858 --> 04:12:17,800
then you have
structural operator.
5916
04:12:17,800 --> 04:12:19,300
Then you have join operator
5917
04:12:19,300 --> 04:12:22,000
and then you have something
called neighborhood operator.
5918
04:12:22,000 --> 04:12:24,700
So let's talk about them one
by one now talking
5919
04:12:24,700 --> 04:12:26,400
about property operators,
5920
04:12:26,400 --> 04:12:30,016
like rdd has map operator
the property graph contains
5921
04:12:30,016 --> 04:12:34,168
map vertices map edges and map
triplets operators right now.
5922
04:12:34,168 --> 04:12:38,445
Each of this operator basically
eels a new graph with the vertex
5923
04:12:38,445 --> 04:12:39,600
or Edge property.
5924
04:12:39,600 --> 04:12:42,600
Modified by the user-defined
map function based
5925
04:12:42,600 --> 04:12:46,366
on the user-defined map function
it basically transforms
5926
04:12:46,366 --> 04:12:47,915
or modifies the vertices
5927
04:12:47,915 --> 04:12:49,202
if it's map vertices
5928
04:12:49,202 --> 04:12:51,489
or it transform
or modify the edges
5929
04:12:51,489 --> 04:12:53,170
if it is map edges method
5930
04:12:53,170 --> 04:12:56,600
or map is operator and so
on format repeats as well.
5931
04:12:56,600 --> 04:13:00,053
Now the important thing
to note is that in each case.
5932
04:13:00,053 --> 04:13:02,700
The graph structure
is unaffected and this
5933
04:13:02,700 --> 04:13:04,968
is a key feature
of these operators.
5934
04:13:04,968 --> 04:13:07,513
Basically which allows
the resulting graph
5935
04:13:07,513 --> 04:13:09,500
to reuse the structural indices.
5936
04:13:09,500 --> 04:13:10,300
Of the original graph
5937
04:13:10,300 --> 04:13:12,600
each and every time you
apply a transformation,
5938
04:13:12,600 --> 04:13:14,700
so it creates a new graph
5939
04:13:14,700 --> 04:13:17,500
and the original
graph is unaffected
5940
04:13:17,500 --> 04:13:19,200
so that it can be used
5941
04:13:19,200 --> 04:13:22,500
so you can see it can be reused
in creating new graphs.
5942
04:13:22,500 --> 04:13:22,800
Right?
5943
04:13:22,800 --> 04:13:24,600
So your structure indices
5944
04:13:24,600 --> 04:13:27,700
can be used from the original
graph not talking
5945
04:13:27,700 --> 04:13:29,400
about this map vertices.
5946
04:13:29,400 --> 04:13:31,152
Let me use the highlighter.
5947
04:13:31,152 --> 04:13:32,900
So first we have map vertices.
5948
04:13:32,900 --> 04:13:34,200
So be it Maps the vertices
5949
04:13:34,200 --> 04:13:36,100
or you can still
transform the vertices.
5950
04:13:36,100 --> 04:13:39,300
So you provide vertex ID
and then vertex.
5951
04:13:40,100 --> 04:13:43,400
And you apply some of the
transformation function using
5952
04:13:43,400 --> 04:13:46,600
which so it will give you
a graph with newer text property
5953
04:13:46,600 --> 04:13:49,500
as you can see now same is
the case with map edges.
5954
04:13:49,500 --> 04:13:53,800
So again you provide the edges
then you transform the edges.
5955
04:13:53,800 --> 04:13:57,600
So initially it was Ed and then
you transform it to Edie to
5956
04:13:57,700 --> 04:13:58,600
and then the graph
5957
04:13:58,600 --> 04:14:01,000
which is given or you
can see the graph
5958
04:14:01,000 --> 04:14:04,947
which is returned is the graph
for the changed each attribute.
5959
04:14:04,947 --> 04:14:07,535
So you can see here
the attribute is ed2.
5960
04:14:07,535 --> 04:14:09,800
Same is the case
with Mark triplets.
5961
04:14:09,900 --> 04:14:11,500
So using Mark triplets,
5962
04:14:11,500 --> 04:14:14,657
you can use the edge triplet
where you can go ahead
5963
04:14:14,657 --> 04:14:18,700
and Target the vertex Properties
or you can say vertex attributes
5964
04:14:18,700 --> 04:14:21,817
or to be more specific
Source vertex attribute as well
5965
04:14:21,817 --> 04:14:23,641
as destination vertex attribute
5966
04:14:23,641 --> 04:14:26,900
and the edge attribute and then
you can apply transformation
5967
04:14:26,900 --> 04:14:28,654
over those Source attributes
5968
04:14:28,654 --> 04:14:31,600
or destination attributes
or the edge attributes
5969
04:14:31,600 --> 04:14:34,500
so you can change them and then
it will again return a graph
5970
04:14:34,500 --> 04:14:36,300
with the transformed values now,
5971
04:14:36,300 --> 04:14:39,000
I guess you guys are clear
the property operator.
5972
04:14:39,000 --> 04:14:40,819
So let's move Next operator
5973
04:14:40,819 --> 04:14:44,958
that is structural operator So
currently Graphics supports only
5974
04:14:44,958 --> 04:14:48,200
a simple set of commonly
use structural operators.
5975
04:14:48,200 --> 04:14:50,712
And we expect more
to be added in future.
5976
04:14:50,712 --> 04:14:53,220
Now you can see
in structural operator.
5977
04:14:53,220 --> 04:14:54,800
We have reversed operator.
5978
04:14:54,800 --> 04:14:56,464
Then we have subgraph operator.
5979
04:14:56,464 --> 04:14:57,923
Then we have masks operator
5980
04:14:57,923 --> 04:15:00,100
and then we have
group edges operator.
5981
04:15:00,100 --> 04:15:04,096
So let's talk about them one by
one so first reverse operator,
5982
04:15:04,096 --> 04:15:05,640
so as the name suggests,
5983
04:15:05,640 --> 04:15:09,500
it returns a new graph with all
the edge directions reversed.
5984
04:15:09,500 --> 04:15:11,750
So basically it will change
your Source vertex
5985
04:15:11,750 --> 04:15:12,950
into destination vertex,
5986
04:15:12,950 --> 04:15:15,108
and then it will change
your destination vertex
5987
04:15:15,108 --> 04:15:16,000
into Source vertex.
5988
04:15:16,000 --> 04:15:18,500
So it will reverse
the direction of your edges.
5989
04:15:18,500 --> 04:15:21,600
And the reverse operation
does not modify Vertex
5990
04:15:21,600 --> 04:15:23,300
or Edge Properties or change.
5991
04:15:23,300 --> 04:15:24,300
The number of edges.
5992
04:15:24,400 --> 04:15:25,739
It can be implemented
5993
04:15:25,739 --> 04:15:28,800
efficiently without
data movement or duplication.
5994
04:15:28,800 --> 04:15:31,400
So next we have
subgraph operator.
5995
04:15:31,400 --> 04:15:34,615
So basically subgraph
operator takes the vertex
5996
04:15:34,615 --> 04:15:35,967
and Edge predicates
5997
04:15:35,967 --> 04:15:38,577
or you can say Vertex
or edge condition
5998
04:15:38,577 --> 04:15:41,600
and Returns the Of
containing only the vertex
5999
04:15:41,600 --> 04:15:44,835
that satisfy those vertex
predicates and then it Returns
6000
04:15:44,835 --> 04:15:47,306
the edges that satisfy
the edge predicates.
6001
04:15:47,306 --> 04:15:50,200
So basically will give
a condition about edges and
6002
04:15:50,200 --> 04:15:51,954
vertices and those predicates
6003
04:15:51,954 --> 04:15:54,009
which are fulfilled
or those vertex
6004
04:15:54,009 --> 04:15:57,303
which are fulfilling the
predicates will be only returned
6005
04:15:57,303 --> 04:15:59,302
and again seems the case
with your edges
6006
04:15:59,302 --> 04:16:01,237
and then your graph
will be connected.
6007
04:16:01,237 --> 04:16:03,800
Now, the subgraph operator
can be used in a number
6008
04:16:03,800 --> 04:16:06,953
of situations to restrict
the graph to the vertices
6009
04:16:06,953 --> 04:16:08,245
and edges of interest
6010
04:16:08,245 --> 04:16:10,615
and eliminate the Rest
of the components,
6011
04:16:10,615 --> 04:16:13,450
right so you can see
this is The Edge predicate.
6012
04:16:13,450 --> 04:16:15,200
This is the vertex predicate.
6013
04:16:15,200 --> 04:16:18,900
Then we are providing
the extra plate with the vertex
6014
04:16:18,900 --> 04:16:20,500
and Edge attributes
6015
04:16:20,500 --> 04:16:21,567
and we are waiting
6016
04:16:21,567 --> 04:16:24,700
for the Boolean value then
same is the case with vertex.
6017
04:16:24,700 --> 04:16:27,100
We're providing the vertex
properties over here
6018
04:16:27,100 --> 04:16:29,150
or you can say vertex
attribute over here.
6019
04:16:29,150 --> 04:16:29,925
And then again,
6020
04:16:29,925 --> 04:16:32,126
it will yield a graph
which is a sub graph
6021
04:16:32,126 --> 04:16:35,400
of the original graph which will
fulfill those predicates now,
6022
04:16:35,400 --> 04:16:37,600
the next operator
is mask operator.
6023
04:16:37,600 --> 04:16:39,746
So mask operator Constructors.
6024
04:16:39,746 --> 04:16:43,466
Graph by returning a graph
that contains the vertices
6025
04:16:43,466 --> 04:16:46,888
and edges that are also found
in the input graph.
6026
04:16:46,888 --> 04:16:48,637
Basically, you can treat
6027
04:16:48,637 --> 04:16:52,500
this mask operator as
a comparison between two graphs.
6028
04:16:52,500 --> 04:16:53,314
So suppose.
6029
04:16:53,314 --> 04:16:54,500
We are comparing
6030
04:16:54,500 --> 04:16:58,100
graph 1 and graph 2 and it
will return this sub graph
6031
04:16:58,100 --> 04:17:00,800
which is common in both
the graphs again.
6032
04:17:00,800 --> 04:17:04,600
This can be used in conjunction
with the subgraph operator.
6033
04:17:04,600 --> 04:17:05,900
Basically to restrict
6034
04:17:05,900 --> 04:17:09,400
a graph based on properties
in another related graph, right.
6035
04:17:09,400 --> 04:17:12,280
And so I guess you guys are
clear with the mask operator.
6036
04:17:12,280 --> 04:17:13,000
So we're here.
6037
04:17:13,000 --> 04:17:14,233
We're providing a graph
6038
04:17:14,233 --> 04:17:16,776
and then we are providing
the input graph as well.
6039
04:17:16,776 --> 04:17:18,671
And then it will return a graph
6040
04:17:18,671 --> 04:17:21,700
which is basically a subset
of both of these graph
6041
04:17:21,700 --> 04:17:23,600
not talking about group edges.
6042
04:17:23,600 --> 04:17:26,796
So the group edges operator
merges the parallel edges
6043
04:17:26,796 --> 04:17:28,446
in the multigraph, right?
6044
04:17:28,446 --> 04:17:29,683
So what it does it,
6045
04:17:29,683 --> 04:17:33,244
the duplicate edges between pair
of vertices are merged
6046
04:17:33,244 --> 04:17:35,800
or you can say are
at can be aggregated
6047
04:17:35,800 --> 04:17:37,325
or perform some action
6048
04:17:37,325 --> 04:17:41,000
and in many numerical
applications I just can be added
6049
04:17:41,000 --> 04:17:43,702
and their weights can be
combined into a single edge,
6050
04:17:43,702 --> 04:17:46,804
right which will again
reduce the size of the graph.
6051
04:17:46,804 --> 04:17:47,900
So for an example,
6052
04:17:47,900 --> 04:17:51,400
you have to vertex V1 and V2
and there are two edges
6053
04:17:51,400 --> 04:17:53,100
with weight 10 and 15.
6054
04:17:53,100 --> 04:17:56,291
So actually what you can do is
you can merge those two edges
6055
04:17:56,291 --> 04:17:59,700
if they have same direction and
you can represent the way to 25.
6056
04:17:59,700 --> 04:18:02,100
So this will actually
reduce the size
6057
04:18:02,100 --> 04:18:05,144
of the graph now looking
at the next operator,
6058
04:18:05,144 --> 04:18:06,700
which is join operator.
6059
04:18:06,700 --> 04:18:09,400
So in many cases
it is necessary.
6060
04:18:09,400 --> 04:18:13,151
To join data from external
collection with graphs, right?
6061
04:18:13,151 --> 04:18:13,909
For example.
6062
04:18:13,909 --> 04:18:16,100
We might have
an extra user property
6063
04:18:16,100 --> 04:18:18,855
that we want to merge
with the existing graph
6064
04:18:18,855 --> 04:18:21,186
or we might want
to pull vertex property
6065
04:18:21,186 --> 04:18:23,100
from one graph to another right.
6066
04:18:23,100 --> 04:18:24,700
So these are some
of the situations
6067
04:18:24,700 --> 04:18:27,000
where you go ahead and use
this join operators.
6068
04:18:27,000 --> 04:18:28,900
So now as you can see over here,
6069
04:18:28,900 --> 04:18:31,100
the first operator
is joined vertices.
6070
04:18:31,100 --> 04:18:34,792
So the joint vertices operator
joins the vertices
6071
04:18:34,792 --> 04:18:36,176
with the input rdd
6072
04:18:36,200 --> 04:18:39,516
and returns a new graph
with the vertex properties.
6073
04:18:39,516 --> 04:18:42,700
Dean after applying
the user-defined map function
6074
04:18:42,700 --> 04:18:45,400
now the vertices
without a matching value
6075
04:18:45,400 --> 04:18:49,500
in the rdd basically retains
their original value not talking
6076
04:18:49,500 --> 04:18:51,400
about outer join vertices.
6077
04:18:51,400 --> 04:18:55,100
So it behaves similar
to join vertices except that
6078
04:18:55,100 --> 04:18:59,586
which user-defined map function
is applied to all the vertices
6079
04:18:59,586 --> 04:19:02,200
and can change
the vertex property type.
6080
04:19:02,200 --> 04:19:05,600
So suppose that you have
a old graph which has
6081
04:19:05,600 --> 04:19:08,100
a Vertex attribute as old price
6082
04:19:08,200 --> 04:19:10,700
and then you created
a new a graph from it
6083
04:19:10,700 --> 04:19:13,735
and then it has the vertex
attribute as new rice.
6084
04:19:13,735 --> 04:19:16,645
So you can go ahead
and join two of these graphs
6085
04:19:16,645 --> 04:19:19,249
and you can perform
an aggregation of both
6086
04:19:19,249 --> 04:19:21,725
the Old and New prices
in the new graph.
6087
04:19:21,725 --> 04:19:25,265
So in this kind of situation
join vertices are used
6088
04:19:25,265 --> 04:19:26,389
now moving ahead.
6089
04:19:26,389 --> 04:19:29,814
Let's talk about neighborhood
aggregation now key step
6090
04:19:29,814 --> 04:19:33,239
in many graph analytics
is aggregating the information
6091
04:19:33,239 --> 04:19:36,600
about the neighborhood
of each vertex for an example.
6092
04:19:36,600 --> 04:19:39,500
We might want to know the number
of followers each user has
6093
04:19:39,700 --> 04:19:41,200
Or the average age
6094
04:19:41,200 --> 04:19:45,600
of the follower of each user now
many iterative graph algorithms,
6095
04:19:45,600 --> 04:19:47,416
like pagerank shortest path,
6096
04:19:47,416 --> 04:19:50,501
then connected components
repeatedly aggregate
6097
04:19:50,501 --> 04:19:52,893
the properties of
neighboring vertices.
6098
04:19:52,893 --> 04:19:56,200
Now, it has four operators
in neighborhood aggregation.
6099
04:19:56,200 --> 04:19:58,803
So the first one is
your aggregate messages.
6100
04:19:58,803 --> 04:20:01,500
So the core aggregation
operation in graphics
6101
04:20:01,500 --> 04:20:02,900
is aggregate messages.
6102
04:20:02,900 --> 04:20:04,090
Now this operator
6103
04:20:04,090 --> 04:20:07,100
applies a user-defined
send message function
6104
04:20:07,100 --> 04:20:10,799
as you can see over here
to Each of the edge triplet
6105
04:20:10,799 --> 04:20:11,600
in the graph
6106
04:20:11,600 --> 04:20:14,230
and then it uses
merge message function
6107
04:20:14,230 --> 04:20:17,900
to aggregate those messages
at the destination vertex.
6108
04:20:18,000 --> 04:20:19,900
Now the user-defined
6109
04:20:19,900 --> 04:20:23,150
send message function
takes an edge context
6110
04:20:23,150 --> 04:20:26,200
as you can see and
which exposes the source
6111
04:20:26,200 --> 04:20:29,892
and destination address Buttes
along with the edge attribute
6112
04:20:29,892 --> 04:20:32,399
and functions like send
to Source or send
6113
04:20:32,399 --> 04:20:35,303
to destination is used
to send messages to source
6114
04:20:35,303 --> 04:20:37,013
and destination attributes.
6115
04:20:37,013 --> 04:20:39,800
Now you can think
of send message as the map.
6116
04:20:39,800 --> 04:20:43,592
Function in mapreduce and
the user-defined merge function
6117
04:20:43,592 --> 04:20:46,000
which actually takes
the two messages
6118
04:20:46,000 --> 04:20:48,200
which are present
on the same Vertex
6119
04:20:48,200 --> 04:20:50,784
or you can see
the same destination vertex
6120
04:20:50,784 --> 04:20:52,090
and it again combines
6121
04:20:52,090 --> 04:20:55,662
or aggregate those messages
and produces a single message.
6122
04:20:55,662 --> 04:20:58,146
Now, you can think
of the merge message
6123
04:20:58,146 --> 04:21:00,500
as reduce function
the mapreduce now,
6124
04:21:00,500 --> 04:21:05,100
the aggregate messages operator
returns a Vertex rdd.
6125
04:21:05,100 --> 04:21:08,128
Basically, it contains
the aggregated messages at each
6126
04:21:08,128 --> 04:21:09,657
of the destination vertex.
6127
04:21:09,657 --> 04:21:10,600
It's and vertices
6128
04:21:10,600 --> 04:21:13,815
that did not receive
a message are not included
6129
04:21:13,815 --> 04:21:15,693
in the returned vertex rdd.
6130
04:21:15,693 --> 04:21:17,028
So only those vertex
6131
04:21:17,028 --> 04:21:20,500
are returned which actually
have received the message
6132
04:21:20,500 --> 04:21:22,956
and then those messages
have been merged.
6133
04:21:22,956 --> 04:21:25,250
If any vertex
which haven't received.
6134
04:21:25,250 --> 04:21:28,437
The message will not be included
in the returned rdd
6135
04:21:28,437 --> 04:21:31,500
or you can say a return
vertex rdd now in addition
6136
04:21:31,500 --> 04:21:34,000
as you can see we have
a triplets Fields.
6137
04:21:34,000 --> 04:21:37,519
So aggregate messages takes
an optional triplet fields,
6138
04:21:37,519 --> 04:21:39,400
which indicates what data is.
6139
04:21:39,400 --> 04:21:41,304
Accessed in the edge content.
6140
04:21:41,304 --> 04:21:42,752
So the possible options
6141
04:21:42,752 --> 04:21:45,900
for the triplet fields
are defined interpret fields
6142
04:21:45,900 --> 04:21:48,600
to default value
of triplet Fields is triplet
6143
04:21:48,600 --> 04:21:52,300
Fields oil as you can see over
here this basically indicates
6144
04:21:52,300 --> 04:21:55,600
that user-defined send
message function May access
6145
04:21:55,600 --> 04:21:58,074
any of the fields
in the edge content.
6146
04:21:58,074 --> 04:22:01,982
So this triplet field argument
can be used to notify Graphics
6147
04:22:01,982 --> 04:22:05,549
that only these part of
the edge content will be needed
6148
04:22:05,549 --> 04:22:09,491
which basically allows Graphics
to select the optimize joining.
6149
04:22:09,491 --> 04:22:10,700
Strategy, so I hope
6150
04:22:10,700 --> 04:22:13,500
that you guys are clear
with the aggregate messages.
6151
04:22:13,500 --> 04:22:16,794
Let's quickly move ahead
and look at the second operator.
6152
04:22:16,794 --> 04:22:20,019
So the second operator is
mapreduce triplet transition.
6153
04:22:20,019 --> 04:22:21,400
Now in earlier versions
6154
04:22:21,400 --> 04:22:24,700
of Graphics neighborhood
aggregation was accomplished
6155
04:22:24,700 --> 04:22:27,272
using the mapreduce
triplets operator.
6156
04:22:27,272 --> 04:22:29,802
This mapreduce triplet
operator is used
6157
04:22:29,802 --> 04:22:31,814
in older versions of Graphics.
6158
04:22:31,814 --> 04:22:35,100
This operator takes
the user-defined map function,
6159
04:22:35,100 --> 04:22:38,900
which is applied to each triplet
and can yield messages
6160
04:22:38,900 --> 04:22:42,300
which are Aggregating using the
user-defined reduce functions.
6161
04:22:42,300 --> 04:22:44,300
This one is the reason
I defined malfunction.
6162
04:22:44,300 --> 04:22:46,600
And this one is your user
defined reduce function.
6163
04:22:46,600 --> 04:22:49,081
So it basically applies
the map function
6164
04:22:49,081 --> 04:22:50,305
to all the triplets
6165
04:22:50,305 --> 04:22:53,654
and then the aggregate
those messages using this user
6166
04:22:53,654 --> 04:22:55,171
defined reduce function.
6167
04:22:55,171 --> 04:22:58,900
Now the newer version of this
map produced triplets operator
6168
04:22:58,900 --> 04:23:01,770
is the aggregate messages
now moving ahead.
6169
04:23:01,770 --> 04:23:04,900
Let's talk about Computing
degree information operator.
6170
04:23:04,900 --> 04:23:07,900
So one of the common
aggregation task is Computing
6171
04:23:07,900 --> 04:23:09,579
the degree of each vertex.
6172
04:23:09,579 --> 04:23:12,842
That is the number of edges
adjacent to each vertex.
6173
04:23:12,842 --> 04:23:15,072
Now in the context
of directed graph.
6174
04:23:15,072 --> 04:23:18,400
It is often necessary to know
the in degree out degree.
6175
04:23:18,400 --> 04:23:20,300
Then the total degree of vertex.
6176
04:23:20,300 --> 04:23:22,800
These kind of things are
pretty much important
6177
04:23:22,800 --> 04:23:25,389
and the graph Ops class
contain a collection
6178
04:23:25,389 --> 04:23:28,400
of operators to compute
the degrees of each vertex.
6179
04:23:28,500 --> 04:23:29,800
So as you can see,
6180
04:23:29,800 --> 04:23:33,100
we have maximum input degree
than maximum output degree,
6181
04:23:33,100 --> 04:23:36,100
then maximum degrees
maximum degree will tell
6182
04:23:36,100 --> 04:23:39,400
us the number of Maximum
incoming edges then Max.
6183
04:23:39,400 --> 04:23:42,325
Degree will tell us
maximum number of output edges
6184
04:23:42,325 --> 04:23:43,510
and this Max degree
6185
04:23:43,510 --> 04:23:46,685
with actually tell us the number
of input as well as
6186
04:23:46,685 --> 04:23:49,572
output edges now moving
ahead to next operator
6187
04:23:49,572 --> 04:23:52,300
that is collecting
Neighbors in some cases.
6188
04:23:52,300 --> 04:23:54,182
It may be easier to express
6189
04:23:54,182 --> 04:23:57,600
the computation by collecting
neighboring vertices
6190
04:23:57,600 --> 04:24:00,000
and their attribute
at each vertex.
6191
04:24:00,000 --> 04:24:02,624
Now, this can be easily
accomplished using
6192
04:24:02,624 --> 04:24:06,400
the collect neighbors ID and
the collect neighbors operator.
6193
04:24:06,400 --> 04:24:09,600
So basically your collect
neighbor ID takes
6194
04:24:09,600 --> 04:24:12,200
The Edge direction
as the parameter
6195
04:24:12,300 --> 04:24:14,400
and it returns a Vertex rdd
6196
04:24:14,400 --> 04:24:17,400
that contains the array
of vertex ID
6197
04:24:17,500 --> 04:24:20,000
that is neighboring
to the particular vertex
6198
04:24:20,000 --> 04:24:23,400
now similarly The Collection
neighbors again takes
6199
04:24:23,400 --> 04:24:25,717
the edge directions as the input
6200
04:24:25,717 --> 04:24:28,000
and it will return you the array
6201
04:24:28,000 --> 04:24:31,600
with the vertex ID and
the vertex attribute both now,
6202
04:24:31,600 --> 04:24:32,717
let me quickly open
6203
04:24:32,717 --> 04:24:35,700
my VM and let us go through
the spark directory first.
6204
04:24:35,900 --> 04:24:38,600
Let me first open
my terminal so first
6205
04:24:38,600 --> 04:24:41,800
I'll start the Do demons so
for that I will go
6206
04:24:41,800 --> 04:24:46,358
to her do phone directory
genocide has been start
6207
04:24:46,358 --> 04:24:48,282
or lot asset script file.
6208
04:24:52,000 --> 04:24:53,400
So let me check
6209
04:24:53,400 --> 04:24:55,700
if the Hadoop demons
are running or not.
6210
04:24:58,700 --> 04:25:00,706
So as you can see that name,
6211
04:25:00,706 --> 04:25:03,000
no data node
secondary name node,
6212
04:25:03,000 --> 04:25:05,848
the node manager
and resource manager.
6213
04:25:05,848 --> 04:25:08,400
All the Demons
of Hadoop are up now.
6214
04:25:08,400 --> 04:25:10,661
I will navigate to spark home.
6215
04:25:10,661 --> 04:25:13,300
Let me first start
this park demons.
6216
04:25:17,600 --> 04:25:19,700
I See Spark demons are running
6217
04:25:19,700 --> 04:25:24,000
alko first minimize this and let
me take you to this park home.
6218
04:25:24,900 --> 04:25:27,309
And this is my spot directories.
6219
04:25:27,309 --> 04:25:28,712
I'll go inside now.
6220
04:25:28,712 --> 04:25:30,926
Let me first show you the data
6221
04:25:30,926 --> 04:25:34,100
which is by default present
with your spark.
6222
04:25:34,400 --> 04:25:36,700
So we'll open this in a new tab.
6223
04:25:36,700 --> 04:25:38,865
So you can see
we have two files
6224
04:25:38,865 --> 04:25:41,100
in this Graphics data directory.
6225
04:25:41,100 --> 04:25:44,638
Meanwhile, let me take you
to the example code.
6226
04:25:44,638 --> 04:25:48,900
So this is example
and inside so main scalar.
6227
04:25:49,600 --> 04:25:50,500
You can find
6228
04:25:50,500 --> 04:25:54,700
the graphics directory and
inside this Graphics directory
6229
04:25:54,700 --> 04:25:59,000
you Some of the sample codes
which are present over here.
6230
04:25:59,000 --> 04:26:01,692
So I will take you
to this aggregate
6231
04:26:01,692 --> 04:26:05,100
messages example dots
Kayla now meanwhile,
6232
04:26:05,100 --> 04:26:07,287
let me open the data as well.
6233
04:26:07,287 --> 04:26:09,700
So you'll be able to understand.
6234
04:26:10,500 --> 04:26:12,967
Now this is
followers dot txt file.
6235
04:26:12,967 --> 04:26:15,000
So basically you can imagine
6236
04:26:15,000 --> 04:26:18,545
these are the edges which
are representing the vertex.
6237
04:26:18,545 --> 04:26:21,580
So this is what x 2
and this is vertex 1 then
6238
04:26:21,580 --> 04:26:25,100
this is Vertex 4 and this
is vertex 1 and similarly.
6239
04:26:25,100 --> 04:26:28,400
So on these are representing
those vertex and
6240
04:26:28,400 --> 04:26:30,900
if you can remember I
have already told you
6241
04:26:30,900 --> 04:26:33,200
that inside graph loader class.
6242
04:26:33,200 --> 04:26:35,818
There is a function
called Edge list file
6243
04:26:35,818 --> 04:26:37,200
which takes the edges
6244
04:26:37,200 --> 04:26:40,500
from a file and then it
construct the graph based.
6245
04:26:40,500 --> 04:26:43,800
That now second you
have this user dot txt.
6246
04:26:43,800 --> 04:26:47,550
So these are basically the edges
with the vertex ID.
6247
04:26:47,550 --> 04:26:51,200
So vertex ID for this vertex
is 1 then for this is 2
6248
04:26:51,200 --> 04:26:53,539
and so on and then
this is the data
6249
04:26:53,539 --> 04:26:57,600
which is attached or you can say
the attribute of the edges.
6250
04:26:57,600 --> 04:26:59,800
So these are the vertex ID
6251
04:26:59,958 --> 04:27:03,700
which is 1 2 3 respectively
and this is the data
6252
04:27:03,700 --> 04:27:06,800
which is associated
with your each vertex.
6253
04:27:06,800 --> 04:27:10,500
So this is username and this
might be the name of your user.
6254
04:27:10,500 --> 04:27:13,100
Zur and so on now
you can also see
6255
04:27:13,100 --> 04:27:16,900
that in some of the cases
the name of the user is missing.
6256
04:27:16,900 --> 04:27:18,800
So as in this case the name
6257
04:27:18,800 --> 04:27:22,100
of the user is missing
these are the vertices
6258
04:27:22,100 --> 04:27:26,300
or you can see the vertex ID
and vertex attributes.
6259
04:27:26,600 --> 04:27:30,500
Now, let me take you through
this aggregate messages example,
6260
04:27:30,600 --> 04:27:32,400
so as you can see,
we are giving the name
6261
04:27:32,400 --> 04:27:36,100
of the packages over G Apache
spark examples dot Graphics,
6262
04:27:36,300 --> 04:27:40,306
then we are importing Graphics
in that very important.
6263
04:27:40,306 --> 04:27:41,764
Off class as well as
6264
04:27:41,764 --> 04:27:45,700
this vertex rdd next we
are using this graph generator.
6265
04:27:45,700 --> 04:27:48,500
I'll tell you why we
are using this graph generator
6266
04:27:48,700 --> 04:27:52,400
and then we are using
the spark session over here.
6267
04:27:52,400 --> 04:27:54,105
So this is an example
6268
04:27:54,163 --> 04:27:58,778
where we are using the aggregate
messages operator to compute
6269
04:27:58,778 --> 04:28:03,163
the average age of the more
senior followers of each user.
6270
04:28:03,200 --> 04:28:03,700
Okay.
6271
04:28:03,928 --> 04:28:06,929
So this is the object
of aggregate messages example.
6272
04:28:07,000 --> 04:28:10,000
Now, this is the main function
where we are first.
6273
04:28:10,100 --> 04:28:13,600
Realizing this box session then
the name of the application.
6274
04:28:13,600 --> 04:28:16,400
So you have to provide the name
of the application
6275
04:28:16,400 --> 04:28:17,400
and this is get
6276
04:28:17,400 --> 04:28:20,600
or create method now
next you are initializing
6277
04:28:20,600 --> 04:28:24,338
the spark context as SC
now coming to the code.
6278
04:28:24,400 --> 04:28:27,400
So we are specifying
a graph then this graph
6279
04:28:27,400 --> 04:28:30,300
is containing double and N now.
6280
04:28:30,400 --> 04:28:33,200
I just told you that we
are importing craft generator.
6281
04:28:33,200 --> 04:28:35,023
So this graph generator is
6282
04:28:35,023 --> 04:28:37,900
to generate a random
graph for Simplicity.
6283
04:28:37,900 --> 04:28:40,400
So you would have multiple
number of edges and vertices.
6284
04:28:40,400 --> 04:28:43,047
Says then you are using
this log normal graph.
6285
04:28:43,047 --> 04:28:44,900
You're passing the spark context
6286
04:28:44,900 --> 04:28:47,677
and you're specifying the number
of vertices as hundred.
6287
04:28:47,677 --> 04:28:49,956
So it will generate
hundred vertices for you.
6288
04:28:49,956 --> 04:28:51,200
Then what you are doing.
6289
04:28:51,200 --> 04:28:53,400
You are specifying
the map vertices
6290
04:28:53,400 --> 04:28:56,815
and you're trying
to map ID to double so
6291
04:28:56,815 --> 04:28:58,200
what this would do
6292
04:28:58,200 --> 04:29:02,100
this will basically map
your ID to double then
6293
04:29:02,100 --> 04:29:05,700
in next year trying
to calculate the older followers
6294
04:29:05,700 --> 04:29:08,300
where you have given
it as vertex rdd
6295
04:29:08,300 --> 04:29:10,494
and then put is nth and Also,
6296
04:29:10,494 --> 04:29:13,900
your vertex already
has sent as your vertex ID
6297
04:29:13,900 --> 04:29:15,200
and your data is double
6298
04:29:15,200 --> 04:29:17,533
which is associated
with each of the vertex
6299
04:29:17,533 --> 04:29:19,604
or you can say
the vertex attribute.
6300
04:29:19,604 --> 04:29:20,900
So you have this graph
6301
04:29:20,900 --> 04:29:23,178
which is basically
generated randomly
6302
04:29:23,178 --> 04:29:26,189
and then you are performing
aggregate messages.
6303
04:29:26,189 --> 04:29:29,200
So this is the aggregate
messages operator now,
6304
04:29:29,200 --> 04:29:33,353
if you can remember we first
have the send messages, right?
6305
04:29:33,353 --> 04:29:35,000
So inside this triplet,
6306
04:29:35,000 --> 04:29:38,620
we are specifying a function
that if the source attribute
6307
04:29:38,620 --> 04:29:40,100
of the triplet is board.
6308
04:29:40,100 --> 04:29:42,300
Destination attribute
of the triplet.
6309
04:29:42,300 --> 04:29:43,900
So basically it will return
6310
04:29:43,900 --> 04:29:47,144
if the followers age
is greater than the age
6311
04:29:47,144 --> 04:29:48,452
of person whom he
6312
04:29:48,452 --> 04:29:52,259
is following this tells
the followers is is greater
6313
04:29:52,259 --> 04:29:55,000
than the age of whom
he is following.
6314
04:29:55,000 --> 04:29:56,462
So in that situation,
6315
04:29:56,462 --> 04:29:59,200
it will send message
to the destination
6316
04:29:59,200 --> 04:30:01,400
with vertex containing counter
6317
04:30:01,400 --> 04:30:05,000
that is 1 and the age
of the source attribute
6318
04:30:05,000 --> 04:30:07,700
that is the age
of the follower so first
6319
04:30:07,700 --> 04:30:10,800
so you can see the age
of the destination on is less
6320
04:30:10,800 --> 04:30:12,807
than the age
of source attribute.
6321
04:30:12,807 --> 04:30:14,000
So it will tell you
6322
04:30:14,000 --> 04:30:17,293
if the follower is older
than the user or not.
6323
04:30:17,293 --> 04:30:21,100
So in that situation will send
one to the destination
6324
04:30:21,100 --> 04:30:23,900
and we'll send the age
of the source
6325
04:30:23,900 --> 04:30:26,900
or you can see the edge
of the follower then second.
6326
04:30:26,900 --> 04:30:29,400
I have told you
that we have merged messages.
6327
04:30:29,500 --> 04:30:32,500
So here we are adding
the counter and the H
6328
04:30:32,600 --> 04:30:33,800
in this reduce function.
6329
04:30:33,900 --> 04:30:37,515
So now what we are doing we
are dividing the total age
6330
04:30:37,515 --> 04:30:38,421
of the number
6331
04:30:38,421 --> 04:30:41,439
of older followers
to Write an average age
6332
04:30:41,439 --> 04:30:42,700
of older followers.
6333
04:30:42,700 --> 04:30:45,400
So this is the reason why
we have passed the attribute
6334
04:30:45,400 --> 04:30:47,200
of source vertex firstly
6335
04:30:47,200 --> 04:30:49,300
if we are specifying
this variable that is
6336
04:30:49,300 --> 04:30:51,194
average age of older followers.
6337
04:30:51,194 --> 04:30:53,700
And then we are specifying
the vertex rdd.
6338
04:30:53,888 --> 04:30:58,211
So this will be double
and then this older followers
6339
04:30:58,292 --> 04:30:59,600
that is the graph
6340
04:30:59,600 --> 04:31:02,349
which we are picking up
from here and then we
6341
04:31:02,349 --> 04:31:04,100
are trying to map the value.
6342
04:31:04,100 --> 04:31:05,400
So in the vertex,
6343
04:31:05,400 --> 04:31:10,100
we have ID and we have value so
in this situation We
6344
04:31:10,100 --> 04:31:13,600
are using this case class
about count and total age.
6345
04:31:13,600 --> 04:31:16,000
So what we are doing we
are taking this total age
6346
04:31:16,000 --> 04:31:19,246
and we are dividing it by count
which we have gathered from this
6347
04:31:19,246 --> 04:31:20,011
send message.
6348
04:31:20,011 --> 04:31:22,800
And then we have aggregated
using this reduce function.
6349
04:31:22,800 --> 04:31:26,400
We are again taking the total
age of the older followers.
6350
04:31:26,400 --> 04:31:28,994
And then we are trying
to divide it by count
6351
04:31:28,994 --> 04:31:30,377
to get the average age
6352
04:31:30,377 --> 04:31:33,900
when at last we are trying
to display the result and then
6353
04:31:33,900 --> 04:31:35,600
we are stopping this park.
6354
04:31:35,600 --> 04:31:38,385
So let me quickly open
the terminal so I
6355
04:31:38,385 --> 04:31:39,742
will go to examples
6356
04:31:39,742 --> 04:31:43,600
so I'd examples I took you
through the source directory
6357
04:31:43,600 --> 04:31:46,400
where the code is
present inside skaila.
6358
04:31:46,400 --> 04:31:49,154
And then inside there
is a spark directory
6359
04:31:49,154 --> 04:31:51,975
where you will find
the code but to execute
6360
04:31:51,975 --> 04:31:55,200
the example you need to go
to the jars territory.
6361
04:31:56,100 --> 04:31:58,392
Now, this is
the scale example jar
6362
04:31:58,392 --> 04:32:00,200
which you need to execute.
6363
04:32:00,200 --> 04:32:03,100
But before this,
let me take you to the hdfs.
6364
04:32:03,400 --> 04:32:05,600
So the URL is localhost.
6365
04:32:05,600 --> 04:32:07,400
Colon 5 0 0 7 0
6366
04:32:08,500 --> 04:32:10,800
And we'll go to utilities then
6367
04:32:10,800 --> 04:32:12,800
we'll go to browse
the file system.
6368
04:32:13,000 --> 04:32:14,137
So as you can see,
6369
04:32:14,137 --> 04:32:16,849
I have created a user
directory in which I
6370
04:32:16,849 --> 04:32:18,700
have specified the username.
6371
04:32:18,700 --> 04:32:22,000
That is Ed Eureka
and inside Ed Eureka.
6372
04:32:22,000 --> 04:32:24,200
I have placed my data directory
6373
04:32:24,200 --> 04:32:27,500
where we have this graphics
and inside the graphics.
6374
04:32:27,500 --> 04:32:30,100
We have both the file
that is followers Dot txt
6375
04:32:30,100 --> 04:32:31,600
and users dot txt.
6376
04:32:31,600 --> 04:32:32,854
So in this program,
6377
04:32:32,854 --> 04:32:35,100
we are not referring
to these files
6378
04:32:35,100 --> 04:32:38,500
but incoming examples will
be referring to these files.
6379
04:32:38,500 --> 04:32:42,700
So I would request you to first
move it to this hdfs directory.
6380
04:32:42,700 --> 04:32:46,800
So that spark can refer
the files in data Graphics.
6381
04:32:47,000 --> 04:32:50,300
Now, let me quickly minimize
this and the command
6382
04:32:50,300 --> 04:32:53,000
to execute is Spock -
6383
04:32:53,000 --> 04:32:56,900
submit and then I'll pass
this charge parameter
6384
04:32:56,900 --> 04:32:59,900
and I'll provide
the spark example jar.
6385
04:33:01,200 --> 04:33:05,100
So this is the jar then
I'll specify the class name.
6386
04:33:05,100 --> 04:33:06,900
So to get the class name.
6387
04:33:06,900 --> 04:33:08,900
I will go to the code.
6388
04:33:09,200 --> 04:33:12,000
I'll first take
the package name from here.
6389
04:33:12,700 --> 04:33:14,100
And then I'll take
6390
04:33:14,100 --> 04:33:17,935
the class name which is
aggregated messages example,
6391
04:33:17,935 --> 04:33:19,400
so this is my class.
6392
04:33:19,400 --> 04:33:21,928
And as I told you have
to provide the name
6393
04:33:21,928 --> 04:33:23,100
of the application.
6394
04:33:23,100 --> 04:33:26,600
So let me keep it as example
and I'll hit enter.
6395
04:33:31,946 --> 04:33:34,253
So now you can see the result.
6396
04:33:36,000 --> 04:33:37,700
So this is the followers
6397
04:33:37,700 --> 04:33:40,500
and this is the average
age of followers.
6398
04:33:40,500 --> 04:33:41,827
So it is 34 Den.
6399
04:33:41,827 --> 04:33:45,038
We have 52 which is
the count of follower.
6400
04:33:45,038 --> 04:33:48,500
And the average age is
seventy six point eight
6401
04:33:48,500 --> 04:33:51,100
that is it has
96 senior followers.
6402
04:33:51,100 --> 04:33:52,900
And then the average age
6403
04:33:52,900 --> 04:33:56,000
of the followers is
ninety nine point zero,
6404
04:33:56,100 --> 04:33:58,600
then it has
four senior followers
6405
04:33:58,600 --> 04:34:00,520
and the average age is 51.
6406
04:34:00,520 --> 04:34:03,400
Then this vertex has
16 senior followers
6407
04:34:03,400 --> 04:34:06,003
with the average age
of 57 point five.
6408
04:34:06,003 --> 04:34:09,024
5 and so on you can see
the result over here.
6409
04:34:09,024 --> 04:34:12,800
So I hope now you guys are clear
with aggregate messages
6410
04:34:12,800 --> 04:34:14,748
how to use aggregate messages
6411
04:34:14,748 --> 04:34:17,100
how to specify
the send message then
6412
04:34:17,100 --> 04:34:19,200
how to write the merge message.
6413
04:34:19,200 --> 04:34:21,788
So let's quickly go back
to the presentation.
6414
04:34:21,788 --> 04:34:23,500
Now, let us quickly move ahead
6415
04:34:23,500 --> 04:34:26,014
and look at some
of the graph algorithms.
6416
04:34:26,014 --> 04:34:27,959
So the first one is Page rank.
6417
04:34:27,959 --> 04:34:31,200
So page rank measures
the importance of each vertex
6418
04:34:31,200 --> 04:34:32,706
in a graph assuming
6419
04:34:32,800 --> 04:34:35,900
that an edge from U
to V represents.
6420
04:34:36,000 --> 04:34:37,453
And recommendation
6421
04:34:37,453 --> 04:34:41,300
or support of Vis importance
by you for an example.
6422
04:34:41,300 --> 04:34:45,468
Let's say if a Twitter user
is followed by many others user
6423
04:34:45,468 --> 04:34:48,200
will obviously rank
high graphics comes
6424
04:34:48,200 --> 04:34:51,919
with the static and dynamic
implementation of pagerank as
6425
04:34:51,919 --> 04:34:53,780
methods on page rank object
6426
04:34:53,780 --> 04:34:57,500
and static page rank runs
a fixed number of iterations,
6427
04:34:57,500 --> 04:35:02,200
which can be specified by you
while the dynamic page rank runs
6428
04:35:02,200 --> 04:35:04,100
until the ranks converge
6429
04:35:04,500 --> 04:35:08,300
what we mean by that is
it Stop changing by more
6430
04:35:08,300 --> 04:35:10,400
than a specified tolerance.
6431
04:35:10,500 --> 04:35:11,300
So it runs
6432
04:35:11,300 --> 04:35:14,500
until it have optimized
the page rank of each
6433
04:35:14,500 --> 04:35:19,400
of the vertices now graphs class
allows calling these algorithms
6434
04:35:19,400 --> 04:35:22,100
directly as methods
on crafts class.
6435
04:35:22,200 --> 04:35:24,800
Now, let's quickly go
back to the VM.
6436
04:35:25,000 --> 04:35:27,469
So this is the pagerank example.
6437
04:35:27,469 --> 04:35:29,161
Let me open this file.
6438
04:35:29,600 --> 04:35:32,595
So first we are specifying
this Graphics package,
6439
04:35:32,595 --> 04:35:35,065
then we are importing
the graph loader.
6440
04:35:35,065 --> 04:35:37,600
So as you can Remember
inside this graph
6441
04:35:37,600 --> 04:35:41,000
loader class we have
that edge list file operator,
6442
04:35:41,000 --> 04:35:43,600
which will basically create
the graph using the edges
6443
04:35:43,600 --> 04:35:46,575
and we have those edges
in our followers
6444
04:35:46,575 --> 04:35:50,542
dot txt file now coming back
to pagerank example now,
6445
04:35:50,542 --> 04:35:53,900
we're importing the spark
SQL Sparks session.
6446
04:35:54,100 --> 04:35:56,619
Now, this is Page
rank example object
6447
04:35:56,619 --> 04:35:59,700
and inside which we
have created a main class
6448
04:35:59,700 --> 04:36:04,000
and we have similarly created
this park session then Builders
6449
04:36:04,000 --> 04:36:05,600
and we're specifying
the app name
6450
04:36:05,600 --> 04:36:09,800
which Is to be provided then
we have get our grid method.
6451
04:36:09,800 --> 04:36:10,415
So this is
6452
04:36:10,415 --> 04:36:12,800
where we are initializing
the spark context
6453
04:36:12,800 --> 04:36:13,800
as you can remember.
6454
04:36:13,800 --> 04:36:16,900
I told you that using
this Edge list file method.
6455
04:36:16,900 --> 04:36:19,115
We are basically
creating the graph
6456
04:36:19,115 --> 04:36:21,200
from the followers dot txt file.
6457
04:36:21,200 --> 04:36:24,223
Now, we are running
the page rank over here.
6458
04:36:24,223 --> 04:36:28,421
So in rank it will give you all
the page rank of the vertices
6459
04:36:28,421 --> 04:36:30,104
that is inside this graph
6460
04:36:30,104 --> 04:36:33,400
which we have just
to reducing graph loader class.
6461
04:36:33,400 --> 04:36:36,575
So if you're passing
an integer as an an argument
6462
04:36:36,575 --> 04:36:37,700
to the page rank,
6463
04:36:37,700 --> 04:36:40,018
it will run
that number iterations.
6464
04:36:40,018 --> 04:36:43,000
Otherwise, if you're
passing a double value,
6465
04:36:43,000 --> 04:36:45,495
it will run
until the convergence.
6466
04:36:45,495 --> 04:36:48,400
So we are running
page rank on this graph
6467
04:36:48,400 --> 04:36:50,861
and we have passed the vertices.
6468
04:36:50,900 --> 04:36:55,300
Now after this we are trying
to load the users dot txt file
6469
04:36:55,500 --> 04:36:58,400
and then we are trying to play
6470
04:36:58,400 --> 04:37:02,400
the line by comma then
the field zero too long
6471
04:37:02,400 --> 04:37:04,571
and we are storing
the field one.
6472
04:37:04,571 --> 04:37:06,200
So basically field zero.
6473
04:37:06,300 --> 04:37:09,376
In your user txt is
your vertex ID or you
6474
04:37:09,376 --> 04:37:13,790
can see the ID of the user
and field one is your username.
6475
04:37:13,790 --> 04:37:17,252
So we are trying to load
these two Fields now.
6476
04:37:17,280 --> 04:37:19,819
We are trying
to rank by username.
6477
04:37:19,969 --> 04:37:24,600
So we are taking the users
and we are joining the ranks.
6478
04:37:24,600 --> 04:37:28,000
So this is where we
are using the join operation.
6479
04:37:28,000 --> 04:37:29,670
So Frank's by username.
6480
04:37:29,670 --> 04:37:32,562
We are trying to
attach those username
6481
04:37:32,562 --> 04:37:35,793
or put those username
with the page rank value.
6482
04:37:35,793 --> 04:37:37,641
So we are taking the users
6483
04:37:37,641 --> 04:37:40,554
then we are joining
the ranks it is again,
6484
04:37:40,554 --> 04:37:42,900
we are getting
from this page Rank
6485
04:37:43,300 --> 04:37:47,700
and then we are mapping
the ID user name and rank.
6486
04:37:56,500 --> 04:38:00,517
Second week sometime run
some iterations over the craft
6487
04:38:00,517 --> 04:38:02,600
and will try to converge it.
6488
04:38:08,000 --> 04:38:11,700
So after converging you
can see the user and the rank.
6489
04:38:11,700 --> 04:38:14,300
So the maximum rank is
with Barack Obama,
6490
04:38:14,300 --> 04:38:18,000
which is 1.45 then
with Lady Gaga.
6491
04:38:18,100 --> 04:38:22,200
It's 1.39 and then with
order ski and so on.
6492
04:38:22,261 --> 04:38:24,338
Let's go back to the slide.
6493
04:38:25,200 --> 04:38:27,000
So now after page rank,
6494
04:38:27,200 --> 04:38:28,856
let's quickly move ahead
6495
04:38:28,856 --> 04:38:32,200
to Connected components
the connected components
6496
04:38:32,200 --> 04:38:34,923
algorithm labels each
connected component
6497
04:38:34,923 --> 04:38:38,600
of the graph with the ID
of its lowest numbered vertex.
6498
04:38:38,600 --> 04:38:40,700
So let us quickly go
back to the VM.
6499
04:38:42,000 --> 04:38:45,200
Now let's go inside
the graphics directory
6500
04:38:45,200 --> 04:38:48,300
and now we'll open
this connect components example.
6501
04:38:48,400 --> 04:38:51,818
So again, it's the same very
important graph load
6502
04:38:51,818 --> 04:38:53,100
and Spark session.
6503
04:38:53,300 --> 04:38:56,600
Now, this is the connect
components example object makes
6504
04:38:56,600 --> 04:39:00,176
this is the main function
and inside the main function.
6505
04:39:00,176 --> 04:39:01,800
We are again specifying all
6506
04:39:01,800 --> 04:39:04,500
those Sparks session
then app name,
6507
04:39:04,500 --> 04:39:06,389
then we have spark context.
6508
04:39:06,389 --> 04:39:07,509
So it's similar.
6509
04:39:07,509 --> 04:39:10,100
So again using
this graph loader class
6510
04:39:10,130 --> 04:39:11,669
and using this Edge.
6511
04:39:11,900 --> 04:39:15,700
To file we are loading
the followers dot txt file.
6512
04:39:15,700 --> 04:39:16,733
Now in this graph.
6513
04:39:16,733 --> 04:39:19,706
We are using this connected
components algorithm.
6514
04:39:19,706 --> 04:39:23,300
And then we are trying to find
the connected components now
6515
04:39:23,300 --> 04:39:26,600
at last we are trying
to again load this user file
6516
04:39:26,600 --> 04:39:28,300
that is users Dot txt.
6517
04:39:28,500 --> 04:39:31,312
And we are trying to join
the connected components
6518
04:39:31,312 --> 04:39:34,387
with the username so over
here it is also the same thing
6519
04:39:34,387 --> 04:39:36,504
which we have discussed
in page rank,
6520
04:39:36,504 --> 04:39:38,000
which is taking the field 0
6521
04:39:38,000 --> 04:39:41,100
and field one
of your user dot txt file
6522
04:39:41,400 --> 04:39:45,100
and a at last we
are joining this users
6523
04:39:45,100 --> 04:39:49,200
and at last year trying to join
this users to connect component
6524
04:39:49,200 --> 04:39:50,584
that is from here.
6525
04:39:50,584 --> 04:39:50,882
Now.
6526
04:39:50,882 --> 04:39:54,008
We are printing the CC
by username collect.
6527
04:39:54,008 --> 04:39:58,400
So let us quickly go ahead and
execute this example as well.
6528
04:39:58,600 --> 04:40:01,400
So let me first copy
this object name.
6529
04:40:03,800 --> 04:40:17,300
that's name this
as example to so
6530
04:40:17,300 --> 04:40:20,100
as you can see Justin Bieber has
one connected component,
6531
04:40:20,100 --> 04:40:23,300
then you can see this has
three connected component.
6532
04:40:23,300 --> 04:40:25,100
Then this has
one connected component
6533
04:40:25,100 --> 04:40:28,600
than Barack Obama has one
connected component and so on.
6534
04:40:28,600 --> 04:40:30,464
So this basically
gives you an idea
6535
04:40:30,464 --> 04:40:32,200
about the connected components.
6536
04:40:32,200 --> 04:40:33,900
Now, let's quickly move back
6537
04:40:33,900 --> 04:40:37,300
to the slide will discuss
about the third algorithm
6538
04:40:37,300 --> 04:40:39,100
that is triangle counting.
6539
04:40:39,100 --> 04:40:43,177
So basically a Vertex is a part
of a triangle when it has
6540
04:40:43,177 --> 04:40:46,900
two adjacent vertices
with an edge between them.
6541
04:40:46,900 --> 04:40:49,100
So it will form
a triangle, right?
6542
04:40:49,100 --> 04:40:52,313
And then that vertex
is a part of a triangle
6543
04:40:52,313 --> 04:40:56,092
now Graphics implements
a triangle counting algorithm
6544
04:40:56,092 --> 04:40:58,200
in the Triangle count object.
6545
04:40:58,200 --> 04:41:01,200
Now that determines the number
of triangles passing
6546
04:41:01,200 --> 04:41:04,600
through each vertex providing
a measure of clustering
6547
04:41:04,600 --> 04:41:07,400
so we can compute
the triangle count
6548
04:41:07,400 --> 04:41:09,875
of the social network data set
6549
04:41:09,875 --> 04:41:13,675
from the pagerank section
1 mode thing to note is
6550
04:41:13,675 --> 04:41:16,598
that triangle count
requires the edges.
6551
04:41:16,600 --> 04:41:18,800
To be in
a canonical orientation.
6552
04:41:18,800 --> 04:41:21,364
That is your Source ID
should always be less
6553
04:41:21,364 --> 04:41:22,868
than your destination ID
6554
04:41:22,868 --> 04:41:25,500
and the graph will be
partition using craft
6555
04:41:25,500 --> 04:41:27,318
or Partition by Method now,
6556
04:41:27,318 --> 04:41:28,800
let's quickly go back.
6557
04:41:28,800 --> 04:41:32,000
So let me open
the graphics directory again,
6558
04:41:32,000 --> 04:41:35,200
and we'll see
the triangle counting example.
6559
04:41:36,500 --> 04:41:38,100
So again, it's the same
6560
04:41:38,100 --> 04:41:40,900
and the object is
triangle counting example,
6561
04:41:40,900 --> 04:41:43,400
then the main function
is same as well.
6562
04:41:43,400 --> 04:41:46,400
Now we are again using
this graph load of class
6563
04:41:46,400 --> 04:41:50,183
and we are loading
the followers dot txt
6564
04:41:50,183 --> 04:41:52,000
which contains the edges
6565
04:41:52,000 --> 04:41:53,000
as you can see here.
6566
04:41:53,000 --> 04:41:54,600
We are using this Partition
6567
04:41:54,600 --> 04:41:58,800
by argument and we are passing
the random vertex cut,
6568
04:41:58,800 --> 04:42:01,000
which is the partition strategy.
6569
04:42:01,000 --> 04:42:03,165
So this is how you can go ahead
6570
04:42:03,165 --> 04:42:06,100
and you can Implement
a partition strategy.
6571
04:42:06,123 --> 04:42:09,277
He is loading the edges
in canonical order
6572
04:42:09,400 --> 04:42:11,900
and partitioning the graph
for triangle count.
6573
04:42:11,900 --> 04:42:12,129
Now.
6574
04:42:12,129 --> 04:42:14,600
We are trying to find
out the triangle count
6575
04:42:14,600 --> 04:42:15,830
for each vertex.
6576
04:42:15,830 --> 04:42:18,000
So we have this try count
6577
04:42:18,000 --> 04:42:22,600
variable and then we are using
this triangle count algorithm
6578
04:42:22,600 --> 04:42:25,074
and then we are
specifying the vertices
6579
04:42:25,074 --> 04:42:28,200
so it will execute
triangle count over this graph
6580
04:42:28,200 --> 04:42:31,900
which we have just loaded
from follows dot txt file.
6581
04:42:31,900 --> 04:42:35,074
And again, we are basically
joining usernames.
6582
04:42:35,074 --> 04:42:38,320
So first we are Being
the usernames again here.
6583
04:42:38,320 --> 04:42:42,600
We are performing the join
between users and try counts.
6584
04:42:42,900 --> 04:42:45,300
So try counts is from here.
6585
04:42:45,300 --> 04:42:48,806
And then we are again
printing the value from here.
6586
04:42:48,806 --> 04:42:50,700
So again, this is the same.
6587
04:42:50,700 --> 04:42:52,844
Let us quickly go
ahead and execute
6588
04:42:52,844 --> 04:42:54,800
this triangle counting example.
6589
04:42:54,800 --> 04:42:56,338
So let me copy this.
6590
04:42:56,500 --> 04:42:58,300
I'll go back to the terminal.
6591
04:42:58,400 --> 04:43:02,300
I'll limit as example
3 and change the class name.
6592
04:43:04,134 --> 04:43:05,365
And I hit enter.
6593
04:43:14,100 --> 04:43:16,900
So now you can see
the triangle associated
6594
04:43:16,900 --> 04:43:20,100
with Justin Bieber 0 then
Barack Obama is one
6595
04:43:20,100 --> 04:43:21,600
with odors kids one
6596
04:43:21,661 --> 04:43:23,200
and with Jerry sick.
6597
04:43:23,200 --> 04:43:24,100
It's fun.
6598
04:43:24,300 --> 04:43:27,800
So for better understanding I
would recommend you to go ahead
6599
04:43:27,800 --> 04:43:30,136
and take this followers or txt.
6600
04:43:30,136 --> 04:43:33,000
And you can create
a graph by yourself.
6601
04:43:33,000 --> 04:43:36,227
And then you can attach
these users names with them
6602
04:43:36,227 --> 04:43:38,100
and then you will get an idea
6603
04:43:38,100 --> 04:43:41,700
about why it is giving
the number as 1 or 0.
6604
04:43:41,700 --> 04:43:44,065
So again the graph
which is connecting.
6605
04:43:44,065 --> 04:43:45,000
In two and four
6606
04:43:45,000 --> 04:43:47,600
is disconnect and it
is not completing any triangles.
6607
04:43:47,600 --> 04:43:52,900
So the value of these 3 are 0
and next year's second graph
6608
04:43:52,900 --> 04:43:54,600
which is connecting
6609
04:43:54,600 --> 04:43:59,400
your vertex 3 6 & 7
is completing one triangle.
6610
04:43:59,400 --> 04:44:01,323
So this is the reason why
6611
04:44:01,323 --> 04:44:05,300
these three vertices
have values one now.
6612
04:44:05,400 --> 04:44:06,952
Let me quickly go back.
6613
04:44:06,952 --> 04:44:07,875
So now I hope
6614
04:44:07,875 --> 04:44:11,000
that you guys are clear
with all the concepts
6615
04:44:11,000 --> 04:44:14,011
of graph operators
then graph algorithms.
6616
04:44:14,011 --> 04:44:17,400
Eames so now is the right
time and let us look
6617
04:44:17,400 --> 04:44:19,200
at a spa Graphics demo
6618
04:44:19,300 --> 04:44:20,838
where we'll go ahead
6619
04:44:20,838 --> 04:44:24,300
and we'll try to analyze
the force go by data.
6620
04:44:24,800 --> 04:44:27,800
So let me quickly go
back to my VM.
6621
04:44:28,000 --> 04:44:29,699
So let me first show
you the website
6622
04:44:29,699 --> 04:44:32,500
where you can go ahead and
download the Fords go by data.
6623
04:44:38,600 --> 04:44:40,350
So over here you can go
6624
04:44:40,350 --> 04:44:43,700
to download the fort
bike strip history data.
6625
04:44:46,480 --> 04:44:51,019
So you can go ahead and download
this 2017 Ford's trip data.
6626
04:44:51,100 --> 04:44:53,000
So I have already downloaded it.
6627
04:44:55,300 --> 04:44:56,696
So to avoid the typos,
6628
04:44:56,696 --> 04:44:59,300
I have already written
all the commands so
6629
04:44:59,300 --> 04:45:07,100
first let me go ahead and start
the spark shell So I'm inside
6630
04:45:07,100 --> 04:45:09,700
these Park shell now.
6631
04:45:09,700 --> 04:45:13,300
Let me first import graphics
and Spa body.
6632
04:45:15,800 --> 04:45:19,200
So I've successfully
imported graphics and Spark rdd.
6633
04:45:20,180 --> 04:45:23,719
Now, let me create
a spark SQL context as well.
6634
04:45:25,100 --> 04:45:28,900
So I have successfully
created this park SQL context.
6635
04:45:28,900 --> 04:45:31,520
So this is basically
for running SQL queries
6636
04:45:31,520 --> 04:45:32,800
over the data frames.
6637
04:45:34,100 --> 04:45:37,176
Now, let me go ahead
and import the data.
6638
04:45:37,826 --> 04:45:40,673
So I'm loading the data
in data frame.
6639
04:45:40,800 --> 04:45:43,623
So the format of file is CSV,
6640
04:45:43,623 --> 04:45:46,853
then an option the header
is already added.
6641
04:45:46,853 --> 04:45:48,700
So that's why it's true.
6642
04:45:48,800 --> 04:45:51,600
Then it will automatically
infer this schema
6643
04:45:51,600 --> 04:45:53,332
and then in the load parameter,
6644
04:45:53,332 --> 04:45:55,400
I have specified
the path of the file.
6645
04:45:55,400 --> 04:45:57,100
So I'll quickly hit enter.
6646
04:45:59,100 --> 04:46:02,500
So the data is loaded
in the data frame to check.
6647
04:46:02,500 --> 04:46:07,000
I'll use d f dot count
so it will give me the count.
6648
04:46:09,900 --> 04:46:16,553
So you can see it has
5 lakhs 19 2007 Red Rose now.
6649
04:46:16,553 --> 04:46:20,092
Let me click go back
and I'll print the schema.
6650
04:46:21,400 --> 04:46:25,010
So this is the schema
the duration in second,
6651
04:46:25,010 --> 04:46:27,625
then we have
the start time end time.
6652
04:46:27,625 --> 04:46:29,876
Then you have start station ID.
6653
04:46:29,876 --> 04:46:32,200
Then you have
start station name.
6654
04:46:32,300 --> 04:46:35,761
Then you have start
station latitude longitude
6655
04:46:35,761 --> 04:46:37,207
then end station ID
6656
04:46:37,207 --> 04:46:40,360
and station name then
end station latitude
6657
04:46:40,360 --> 04:46:42,007
and station longitude.
6658
04:46:42,007 --> 04:46:46,500
Then your bike ID user type then
the birth year of the member
6659
04:46:46,500 --> 04:46:48,650
and the gender
of the member now,
6660
04:46:48,650 --> 04:46:50,800
I'm trying to create
a data frame
6661
04:46:50,800 --> 04:46:52,306
that is Gas stations
6662
04:46:52,306 --> 04:46:56,300
so it will only create
the station ID and station name
6663
04:46:56,300 --> 04:46:58,607
which I'll be using as vertex.
6664
04:46:58,800 --> 04:47:02,000
So here I am trying
to create a data frame
6665
04:47:02,000 --> 04:47:03,500
with the name of just stations
6666
04:47:03,658 --> 04:47:07,120
where I am just selecting
the start station ID
6667
04:47:07,120 --> 04:47:09,600
and I'm casting it as float
6668
04:47:09,600 --> 04:47:12,400
and then I'm selecting
the start station name
6669
04:47:12,400 --> 04:47:15,400
and then I'm using
the distinct function to only
6670
04:47:15,400 --> 04:47:17,169
keep the unique values.
6671
04:47:17,169 --> 04:47:19,864
So I quickly go
ahead and hit enter.
6672
04:47:20,100 --> 04:47:21,600
So again, let me go
6673
04:47:21,600 --> 04:47:27,000
ahead and use this just stations
and I will print the schema.
6674
04:47:28,300 --> 04:47:31,531
So you can see
there is station ID,
6675
04:47:31,531 --> 04:47:34,000
and then there is
start station name.
6676
04:47:34,569 --> 04:47:36,800
It contains the unique values
6677
04:47:36,800 --> 04:47:40,600
of stations in this just
station data frame.
6678
04:47:40,800 --> 04:47:41,735
So now again,
6679
04:47:41,735 --> 04:47:44,900
I am taking this stations
where I'm selecting
6680
04:47:44,900 --> 04:47:47,971
these thought station ID
and and station ID.
6681
04:47:47,971 --> 04:47:49,900
Then I am using re distinct
6682
04:47:49,900 --> 04:47:52,700
which will again give
me the unique values
6683
04:47:52,700 --> 04:47:54,600
and I'm using this flat map
6684
04:47:54,600 --> 04:47:56,200
where I am specifying
6685
04:47:56,200 --> 04:47:59,700
the iterables where we
are taking the x0
6686
04:47:59,700 --> 04:48:01,700
that is your start station ID,
6687
04:48:01,700 --> 04:48:04,405
and I am taking x 1
which is your ends.
6688
04:48:04,405 --> 04:48:05,700
An ID and then again,
6689
04:48:05,700 --> 04:48:07,800
I'm applying this
distinct function
6690
04:48:07,800 --> 04:48:12,200
that it will keep only
the unique values and then
6691
04:48:12,400 --> 04:48:14,600
at last we have to d f function
6692
04:48:14,600 --> 04:48:16,619
which will convert
it to data frame.
6693
04:48:16,619 --> 04:48:19,100
So let me quickly go ahead
and execute this.
6694
04:48:19,500 --> 04:48:21,376
So I am printing this schema.
6695
04:48:21,376 --> 04:48:23,576
So as you can see
it has one column
6696
04:48:23,576 --> 04:48:26,100
that is value and it
has data type long.
6697
04:48:26,100 --> 04:48:29,715
So I have taken all
the start and end station ID
6698
04:48:29,715 --> 04:48:31,561
and using this flat map.
6699
04:48:31,561 --> 04:48:34,200
I have retreated
over all the start.
6700
04:48:34,200 --> 04:48:37,705
And and station ID and then
using the distinct function
6701
04:48:37,705 --> 04:48:41,600
and taking the unique values
and converting it to data frames
6702
04:48:41,600 --> 04:48:44,800
so I can use the stations
and using the station.
6703
04:48:44,800 --> 04:48:49,000
I will basically keep each
of the stations in a Vertex.
6704
04:48:49,000 --> 04:48:52,500
So this is the reason why
I'm taking the stations
6705
04:48:52,500 --> 04:48:55,300
or you can say I am taking
the unique stations
6706
04:48:55,300 --> 04:48:58,107
from the start station ID
and station ID
6707
04:48:58,107 --> 04:48:59,691
so that I can go ahead
6708
04:48:59,691 --> 04:49:02,500
and I can define
vertex as the stations.
6709
04:49:03,100 --> 04:49:06,400
So now we are creating
our set of vertices
6710
04:49:06,400 --> 04:49:09,804
and attaching a bit
of metadata to each one of them
6711
04:49:09,804 --> 04:49:12,800
which in our case is
the name of the station.
6712
04:49:12,800 --> 04:49:16,035
So as you can see we are
creating this station vertices,
6713
04:49:16,035 --> 04:49:18,679
which is again an rdd
with vertex ID and strength.
6714
04:49:18,679 --> 04:49:21,700
So we are using the station's
which we have just created.
6715
04:49:21,700 --> 04:49:24,500
We are joining it
with just stations
6716
04:49:24,500 --> 04:49:27,100
at the station value
should be equal
6717
04:49:27,100 --> 04:49:29,300
to just station station ID.
6718
04:49:29,600 --> 04:49:32,400
So as we have created stations,
6719
04:49:32,400 --> 04:49:35,200
And just station
so we are joining it.
6720
04:49:36,600 --> 04:49:39,061
And then selecting
the station ID
6721
04:49:39,061 --> 04:49:43,000
and start station name
then we are mapping row 0.
6722
04:49:44,700 --> 04:49:48,600
And Row 1 so your row
0 will basically be
6723
04:49:48,600 --> 04:49:51,088
your vertex ID and Row
1 will be the string.
6724
04:49:51,088 --> 04:49:55,100
That is the name of your station
to let me quickly go ahead
6725
04:49:55,100 --> 04:49:56,300
and execute this.
6726
04:49:57,357 --> 04:50:01,742
So let us quickly print this
using collect forage println.
6727
04:50:19,500 --> 04:50:20,366
So over here,
6728
04:50:20,366 --> 04:50:23,900
we are basically attaching
the edges or you can see we
6729
04:50:23,900 --> 04:50:27,500
are creating the trip edges
to all our individual rights
6730
04:50:27,500 --> 04:50:29,900
and then we'll get
the station values
6731
04:50:30,350 --> 04:50:33,350
and then we'll add
a dummy value of one.
6732
04:50:33,800 --> 04:50:34,900
So as you can see
6733
04:50:34,900 --> 04:50:37,200
that I am selecting
the start station and
6734
04:50:37,200 --> 04:50:38,600
and station from the DF
6735
04:50:38,600 --> 04:50:41,300
which is the first data frame
which we have loaded
6736
04:50:41,300 --> 04:50:46,200
and then I am mapping
it to row 0 + Row 1,
6737
04:50:46,400 --> 04:50:49,000
which is your source
and destination.
6738
04:50:49,100 --> 04:50:53,500
And then and then I'm attaching
a value one to each one of them.
6739
04:50:53,600 --> 04:50:55,000
So I'll hit enter.
6740
04:50:57,500 --> 04:51:00,900
Now, let me quickly go ahead
and print this station edges.
6741
04:51:07,500 --> 04:51:10,300
So just taking the source
ID of the vertex
6742
04:51:10,300 --> 04:51:12,182
and destination ID of the vertex
6743
04:51:12,182 --> 04:51:14,800
or you can say so station ID
or vertex station ID
6744
04:51:14,800 --> 04:51:17,900
and it is attaching value
one to each one of them.
6745
04:51:17,900 --> 04:51:20,700
So now you can go ahead
and build your graph.
6746
04:51:20,700 --> 04:51:23,854
But again as we discuss
that we need a default station
6747
04:51:23,854 --> 04:51:25,700
so you can have some situations
6748
04:51:25,700 --> 04:51:29,033
where your edges might be
indicating some vertices,
6749
04:51:29,033 --> 04:51:31,500
but that vertices
might not be present
6750
04:51:31,500 --> 04:51:33,107
in your vertex re D.
6751
04:51:33,107 --> 04:51:34,764
So for that situation,
6752
04:51:34,764 --> 04:51:37,400
we need to create
a default station.
6753
04:51:37,400 --> 04:51:40,651
So I created a default station
as missing station.
6754
04:51:40,651 --> 04:51:42,100
So now we are all set.
6755
04:51:42,100 --> 04:51:44,400
We can go ahead
and create the graph.
6756
04:51:44,400 --> 04:51:46,700
So the name of the graph
is station graph.
6757
04:51:46,700 --> 04:51:49,000
Then the vertices
are stationed vertices
6758
04:51:49,000 --> 04:51:50,485
which we have created
6759
04:51:50,485 --> 04:51:54,247
which basically contains
the station ID and station name
6760
04:51:54,247 --> 04:51:56,300
and then we have station edges
6761
04:51:56,300 --> 04:51:58,600
and at last we
have default station.
6762
04:51:58,600 --> 04:52:01,500
So let me quickly go ahead
and execute this.
6763
04:52:03,100 --> 04:52:06,500
So now I need to cash this graph
for faster access.
6764
04:52:06,500 --> 04:52:08,700
So I'll use cash function.
6765
04:52:09,500 --> 04:52:13,300
So let us quickly go ahead and
check the number of vertices.
6766
04:52:24,700 --> 04:52:28,600
So these are the number
of vertices again,
6767
04:52:28,900 --> 04:52:31,600
we can check the number
of edges as well.
6768
04:52:35,700 --> 04:52:37,300
So these are
the number of edges.
6769
04:52:38,405 --> 04:52:40,400
And to get a sanity check.
6770
04:52:40,400 --> 04:52:43,500
So let's go ahead
and check the number of records
6771
04:52:43,500 --> 04:52:45,500
that are present
in the data frame.
6772
04:52:48,000 --> 04:52:50,900
So as you can see
that the number of edges
6773
04:52:50,900 --> 04:52:55,100
in our graph and the count
in our data frame is similar,
6774
04:52:55,100 --> 04:52:56,900
or you can see the same.
6775
04:52:56,900 --> 04:53:00,702
So now let's go ahead and run
page rank on our data
6776
04:53:00,702 --> 04:53:04,200
so we can either run
a set number of iterations
6777
04:53:04,200 --> 04:53:06,700
or we can run it
until the convergence.
6778
04:53:06,700 --> 04:53:10,400
So in my case,
I'll run it till convergence.
6779
04:53:11,700 --> 04:53:15,000
So it's rank then
station graph then page rank.
6780
04:53:15,000 --> 04:53:17,133
So has specified
the double value
6781
04:53:17,133 --> 04:53:21,000
so it will Tell convergence
so let's wait for some time.
6782
04:53:51,600 --> 04:53:55,400
So now that we have executed
the pagerank algorithm.
6783
04:53:55,700 --> 04:53:57,300
So we got the ranks
6784
04:53:57,300 --> 04:53:59,700
which are attached
to each vertices.
6785
04:54:00,100 --> 04:54:03,700
So now let us quickly go ahead
and look at the ranks.
6786
04:54:03,700 --> 04:54:06,601
So we are joining ranks
with station vertices
6787
04:54:06,601 --> 04:54:09,675
and then we have sorting it
in descending values
6788
04:54:09,675 --> 04:54:11,900
and we are taking
the first 10 rows
6789
04:54:11,900 --> 04:54:13,500
and then we are printing them.
6790
04:54:13,500 --> 04:54:16,700
So let's quickly go
ahead and hit enter.
6791
04:54:21,700 --> 04:54:26,000
So you can see these are
the top 10 stations which have
6792
04:54:26,000 --> 04:54:27,800
the most pagerank values
6793
04:54:27,800 --> 04:54:30,800
so you can say it has
more number of incoming trips.
6794
04:54:30,800 --> 04:54:32,270
Now one question would be
6795
04:54:32,270 --> 04:54:35,000
what are the most common
destinations in the data set
6796
04:54:35,000 --> 04:54:36,598
from location to location
6797
04:54:36,598 --> 04:54:40,500
so we can do this by performing
a grouping operator and adding
6798
04:54:40,500 --> 04:54:42,218
The Edge counts together.
6799
04:54:42,218 --> 04:54:46,000
So basically this will give
a new graph except each Edge
6800
04:54:46,000 --> 04:54:50,300
will now be the sum of all
the semantically same edges.
6801
04:54:51,500 --> 04:54:53,700
So again, we are taking
the station graph.
6802
04:54:53,700 --> 04:54:56,800
We are performing Group
by edges H1 and H2.
6803
04:54:56,800 --> 04:55:00,197
So we are basically
grouping edges H1 and H2.
6804
04:55:00,200 --> 04:55:01,629
So we are aggregating them.
6805
04:55:01,629 --> 04:55:03,100
Then we are using triplet
6806
04:55:03,100 --> 04:55:06,099
and then we are sorting them
in descending order again.
6807
04:55:06,099 --> 04:55:08,200
And then we are
printing the triplets
6808
04:55:08,200 --> 04:55:10,908
from The Source vertex
and the number of trips
6809
04:55:10,908 --> 04:55:13,864
and then we are taking
the destination attribute
6810
04:55:13,864 --> 04:55:15,500
or you can see destination
6811
04:55:15,500 --> 04:55:18,100
Vertex or you can see
destination station.
6812
04:55:26,526 --> 04:55:28,373
So you can see there are
6813
04:55:28,500 --> 04:55:32,300
1933 trips from San
Francisco Ferry Building
6814
04:55:32,300 --> 04:55:34,100
to the station then again,
6815
04:55:34,100 --> 04:55:36,700
you can see there are
fourteen hundred and eleven
6816
04:55:36,700 --> 04:55:39,900
trips from San Francisco
to this location.
6817
04:55:39,900 --> 04:55:42,200
Then there are 1 0 to 5 trips
6818
04:55:42,200 --> 04:55:45,300
from this station
to San Francisco
6819
04:55:45,500 --> 04:55:49,100
and it goes so on so now we
have got a directed graph
6820
04:55:49,100 --> 04:55:50,885
that mean our
trip are directional
6821
04:55:50,885 --> 04:55:52,400
from one location to another
6822
04:55:52,600 --> 04:55:55,787
so now we can go ahead
and find the number of Trades
6823
04:55:55,787 --> 04:55:57,725
that Went to a specific station
6824
04:55:57,725 --> 04:56:00,100
and then leave
from a specific station.
6825
04:56:00,100 --> 04:56:01,806
So basically we are trying
6826
04:56:01,806 --> 04:56:04,300
to find the inbound
and outbound values
6827
04:56:04,300 --> 04:56:07,829
or you can say we are trying
to find in degree and out degree
6828
04:56:07,829 --> 04:56:08,723
of the stations.
6829
04:56:08,723 --> 04:56:12,300
So let us first calculate the in
degrees from using station graph
6830
04:56:12,300 --> 04:56:14,364
and I am using
n degree operator.
6831
04:56:14,364 --> 04:56:17,298
Then I'm joining it
with the station vertices
6832
04:56:17,298 --> 04:56:20,435
and then I'm sorting it again
in descending order
6833
04:56:20,435 --> 04:56:22,852
and then I'm taking
the top 10 values.
6834
04:56:22,852 --> 04:56:25,400
So let's quickly go
ahead and hit enter.
6835
04:56:30,900 --> 04:56:34,815
So these are the top 10 station
and you can see the in degrees.
6836
04:56:34,815 --> 04:56:36,600
So there are these many trips
6837
04:56:36,600 --> 04:56:38,797
which are coming
into these stations.
6838
04:56:38,797 --> 04:56:39,651
Not similarly.
6839
04:56:39,651 --> 04:56:41,300
We can find the out degree.
6840
04:56:48,200 --> 04:56:51,400
Now again, you can see
the out degrees as well.
6841
04:56:51,400 --> 04:56:54,896
So these are the stations
and these are the out degrees.
6842
04:56:54,896 --> 04:56:58,439
So again, you can go ahead
and perform some more operations
6843
04:56:58,439 --> 04:56:59,400
over this graph.
6844
04:56:59,400 --> 04:57:01,635
So you can go ahead
and find the station
6845
04:57:01,635 --> 04:57:03,700
which has most number
of trips things
6846
04:57:03,700 --> 04:57:07,241
that is most number of people
coming into that station,
6847
04:57:07,241 --> 04:57:09,758
but less people are
leaving that station
6848
04:57:09,758 --> 04:57:13,320
and again on the contrary
you can find out the stations
6849
04:57:13,320 --> 04:57:15,538
where there are
more number of edges
6850
04:57:15,538 --> 04:57:18,240
or you can set trip
leaving those stations.
6851
04:57:18,240 --> 04:57:19,848
But there are less number
6852
04:57:19,848 --> 04:57:22,100
of trips coming
into those stations.
6853
04:57:22,100 --> 04:57:25,800
So I guess you guys are
now clear with Spa Graphics.
6854
04:57:25,800 --> 04:57:27,810
Then we discuss
the different types
6855
04:57:27,810 --> 04:57:29,398
of crops then moving ahead.
6856
04:57:29,398 --> 04:57:31,100
We discuss the
features of grafx.
6857
04:57:31,100 --> 04:57:33,675
They'll be discuss something
about property graph.
6858
04:57:33,675 --> 04:57:35,500
We understood what
is property graph
6859
04:57:35,500 --> 04:57:38,200
how you can create vertex
how you can create edges
6860
04:57:38,200 --> 04:57:40,800
how to use Vertex or DD H Rd D.
6861
04:57:40,800 --> 04:57:44,500
Then we looked at some of
the important vertex operations
6862
04:57:44,500 --> 04:57:48,500
and at last we understood some
of the graph algorithms.
6863
04:57:48,500 --> 04:57:51,349
So I guess now you
guys are clear about
6864
04:57:51,349 --> 04:57:53,600
how to work with Bob Graphics.
6865
04:57:58,300 --> 04:58:01,300
Today's video is
on Hadoop versus park.
6866
04:58:01,400 --> 04:58:04,683
Now as we know organizations
from different domains
6867
04:58:04,683 --> 04:58:07,400
are investing in big
data analytics today.
6868
04:58:07,400 --> 04:58:10,400
They're analyzing large
data sets to uncover
6869
04:58:10,400 --> 04:58:11,730
all hidden patterns
6870
04:58:11,730 --> 04:58:15,510
unknown correlations market
trends customer preferences
6871
04:58:15,510 --> 04:58:18,100
and other useful
business information.
6872
04:58:18,100 --> 04:58:20,800
Analogy of findings
are helping organizations
6873
04:58:20,800 --> 04:58:24,100
and more effective marketing
new Revenue opportunities
6874
04:58:24,100 --> 04:58:25,973
and better customer service
6875
04:58:25,973 --> 04:58:29,241
and they're trying
to get competitive advantages
6876
04:58:29,241 --> 04:58:30,947
over rival organizations
6877
04:58:30,947 --> 04:58:33,920
and other business benefits
and Apache spark
6878
04:58:33,920 --> 04:58:38,000
and Hadoop are the two of most
prominent Big Data Frameworks
6879
04:58:38,000 --> 04:58:41,289
and I see people often comparing
these two technologies
6880
04:58:41,289 --> 04:58:44,700
and that is what exactly
we're going to do in this video.
6881
04:58:44,700 --> 04:58:48,100
Now, we'll compare these two big
data Frame Works based
6882
04:58:48,100 --> 04:58:49,800
on on different parameters,
6883
04:58:49,800 --> 04:58:52,487
but first it is important
to get an overview
6884
04:58:52,487 --> 04:58:53,800
about what is Hadoop.
6885
04:58:53,800 --> 04:58:55,600
And what is Apache spark?
6886
04:58:55,600 --> 04:58:58,900
So let me just tell you a little
bit about Hadoop Hadoop is
6887
04:58:58,900 --> 04:59:00,200
a framework to store
6888
04:59:00,200 --> 04:59:04,200
and process large sets of data
across computer clusters
6889
04:59:04,200 --> 04:59:07,100
and Hadoop can scale
from single computer system
6890
04:59:07,100 --> 04:59:09,710
up to thousands
of commodity systems
6891
04:59:09,710 --> 04:59:11,500
that offer local storage
6892
04:59:11,500 --> 04:59:14,801
and compute power and Hadoop
is composed of modules
6893
04:59:14,801 --> 04:59:18,500
that work together to create
the entire Hadoop framework.
6894
04:59:18,500 --> 04:59:20,557
These are some of the components
6895
04:59:20,557 --> 04:59:23,254
that we have in the
entire Hadoop framework
6896
04:59:23,254 --> 04:59:24,800
or the Hadoop ecosystem.
6897
04:59:24,800 --> 04:59:27,500
For example, let
me tell you about hdfs,
6898
04:59:27,500 --> 04:59:30,856
which is the storage unit
of Hadoop yarn, which is
6899
04:59:30,856 --> 04:59:32,500
for resource management.
6900
04:59:32,500 --> 04:59:34,600
There are different
than a little tools
6901
04:59:34,600 --> 04:59:39,500
like Apache Hive Pig nosql
databases like Apache hbase.
6902
04:59:39,900 --> 04:59:40,900
Even Apache spark
6903
04:59:40,900 --> 04:59:43,893
and Apache Stone fits
in the Hadoop ecosystem
6904
04:59:43,893 --> 04:59:45,399
for processing big data
6905
04:59:45,399 --> 04:59:49,200
in real-time for ingesting data
we have Tools like Flume
6906
04:59:49,200 --> 04:59:52,082
and scoop flumist used
to ingest unstructured data
6907
04:59:52,082 --> 04:59:53,600
or semi-structured data
6908
04:59:53,600 --> 04:59:57,135
where scoop is used to ingest
structured data into hdfs.
6909
04:59:57,135 --> 04:59:59,900
If you want to learn more
about these tools,
6910
04:59:59,900 --> 05:00:01,470
you can go to Eddie rei'kas
6911
05:00:01,470 --> 05:00:04,000
YouTube channel and look
for Hadoop tutorial
6912
05:00:04,000 --> 05:00:06,600
where everything has
been explained in detail.
6913
05:00:06,600 --> 05:00:08,171
Now, let's move to spark
6914
05:00:08,171 --> 05:00:12,100
Apache spark is a lightning-fast
cluster Computing technology
6915
05:00:12,100 --> 05:00:14,400
that is designed
for fast computation.
6916
05:00:14,400 --> 05:00:18,223
The main feature of spark
is it's in memory clusters.
6917
05:00:18,223 --> 05:00:19,400
Esther Computing
6918
05:00:19,400 --> 05:00:23,482
that increases the processing
of speed of an application fog
6919
05:00:23,482 --> 05:00:27,100
perform similar operations
to that of Hadoop modules,
6920
05:00:27,100 --> 05:00:30,365
but it uses an in-memory
processing and optimizes
6921
05:00:30,365 --> 05:00:33,791
the steps the primary
difference between mapreduce
6922
05:00:33,791 --> 05:00:35,400
and Hadoop and Spark is
6923
05:00:35,400 --> 05:00:38,500
that mapreduce users
persistent storage
6924
05:00:38,500 --> 05:00:42,100
and Spark uses resilient
distributed data sets,
6925
05:00:42,100 --> 05:00:44,920
which is known as
rdds which resides
6926
05:00:44,920 --> 05:00:48,458
in memory the different
components and Sparkle.
6927
05:00:48,800 --> 05:00:52,000
The spark origin the spark
or is the base engine
6928
05:00:52,000 --> 05:00:53,600
for large-scale parallel
6929
05:00:53,600 --> 05:00:57,463
and distributed data processing
further additional libraries
6930
05:00:57,463 --> 05:01:01,100
which are built on top of
the core allow diverse workloads
6931
05:01:01,100 --> 05:01:02,381
for streaming SQL
6932
05:01:02,381 --> 05:01:06,000
and machine learning spark
or is also responsible
6933
05:01:06,000 --> 05:01:09,500
for memory management
and fault recovery scheduling
6934
05:01:09,500 --> 05:01:12,749
and distributed and monitoring
jobs and a cluster
6935
05:01:12,749 --> 05:01:16,000
and interacting with
the storage systems as well.
6936
05:01:16,100 --> 05:01:16,649
Next up.
6937
05:01:16,649 --> 05:01:18,300
We have spark streaming.
6938
05:01:18,300 --> 05:01:20,906
Spark streaming is
the component of spark
6939
05:01:20,906 --> 05:01:24,100
which is used to process
real-time streaming data.
6940
05:01:24,100 --> 05:01:25,822
It enables high throughput
6941
05:01:25,822 --> 05:01:29,600
and fault-tolerant stream
processing of live data streams.
6942
05:01:29,600 --> 05:01:33,500
We have Sparks equal spark
SQL is a new module in spark
6943
05:01:33,500 --> 05:01:36,800
which integrates relational
processing with Sparks
6944
05:01:36,800 --> 05:01:38,800
functional programming API.
6945
05:01:38,800 --> 05:01:41,700
It supports querying
data either via SQL
6946
05:01:41,700 --> 05:01:44,000
or via the hive query language.
6947
05:01:44,000 --> 05:01:46,381
For those of you
familiar with rdbms.
6948
05:01:46,381 --> 05:01:48,300
Spark sequel will be an easy.
6949
05:01:48,300 --> 05:01:51,637
Transition from your earlier
tools where you can extend
6950
05:01:51,637 --> 05:01:55,100
the boundaries of traditional
relational data processing.
6951
05:01:55,200 --> 05:02:00,092
Next up is Graphics Ralph X is
the spark API for graphs
6952
05:02:00,092 --> 05:02:02,400
and graph parallel computation
6953
05:02:02,400 --> 05:02:04,867
and thus it extends
the spark resilient
6954
05:02:04,867 --> 05:02:08,700
distributed data sets with a
resilient distributed property.
6955
05:02:08,700 --> 05:02:09,500
Graph.
6956
05:02:09,900 --> 05:02:13,000
Next is Park Emma lip
for machine learning
6957
05:02:13,000 --> 05:02:16,500
Emma lip stands for machine
learning library spark.
6958
05:02:16,500 --> 05:02:18,300
Emma live is used
to perform machine.
6959
05:02:18,400 --> 05:02:20,900
In learning in Apache spark now
6960
05:02:20,900 --> 05:02:24,200
since you've got an overview
of both these two Frameworks,
6961
05:02:24,200 --> 05:02:25,985
I believe that the ground
6962
05:02:25,985 --> 05:02:29,200
is all set to compare
Apache spark and Hadoop.
6963
05:02:29,200 --> 05:02:32,617
Let's move ahead and compare
Apache spark with Hadoop
6964
05:02:32,617 --> 05:02:36,100
on different parameters
to understand their strengths.
6965
05:02:36,100 --> 05:02:38,887
We will be comparing
these two Frameworks
6966
05:02:38,887 --> 05:02:40,700
based on these parameters.
6967
05:02:40,700 --> 05:02:44,400
Let's start with performance
first Spark is fast
6968
05:02:44,400 --> 05:02:45,476
because it has
6969
05:02:45,476 --> 05:02:49,000
in-memory processing it
can also use For data,
6970
05:02:49,000 --> 05:02:51,774
that doesn't fit
into memory Sparks
6971
05:02:51,774 --> 05:02:55,851
in-memory processing delivers
near real-time analytics
6972
05:02:56,000 --> 05:02:57,771
and this makes Park suitable
6973
05:02:57,771 --> 05:03:00,300
for credit card
processing system machine
6974
05:03:00,300 --> 05:03:02,300
learning security analysis
6975
05:03:02,300 --> 05:03:05,100
and processing data
for iot sensors.
6976
05:03:05,200 --> 05:03:07,700
Now, let's talk
about hadoop's performance.
6977
05:03:07,700 --> 05:03:10,700
Now Hadoop has originally
designed to continuously
6978
05:03:10,700 --> 05:03:13,700
gather data from multiple
sources without worrying
6979
05:03:13,700 --> 05:03:14,800
about the type of data
6980
05:03:14,800 --> 05:03:15,687
and storing it
6981
05:03:15,687 --> 05:03:18,544
across distributed
environment and mapreduce.
6982
05:03:18,544 --> 05:03:22,185
Use uses batch processing
mapreduce was never built for
6983
05:03:22,185 --> 05:03:24,108
real-time processing main idea
6984
05:03:24,108 --> 05:03:27,751
behind yarn is parallel
processing over distributed data
6985
05:03:27,751 --> 05:03:30,400
set the problem
with comparing the two is
6986
05:03:30,400 --> 05:03:33,400
that they have different
way of processing
6987
05:03:33,400 --> 05:03:37,400
and the idea behind the
development is also Divergent
6988
05:03:37,700 --> 05:03:40,300
next ease-of-use spark comes
6989
05:03:40,300 --> 05:03:44,400
with a user-friendly apis
for Scala Java Python
6990
05:03:44,400 --> 05:03:48,300
and Sparks equal spark SQL
is very similar to SQL.
6991
05:03:48,600 --> 05:03:50,047
So it becomes easier
6992
05:03:50,047 --> 05:03:53,202
for a sequel developers
to learn it spark also
6993
05:03:53,202 --> 05:03:55,272
provides an interactive shell
6994
05:03:55,272 --> 05:03:58,700
for developers to query
and perform other actions
6995
05:03:58,700 --> 05:04:00,800
and have immediate feedback.
6996
05:04:00,900 --> 05:04:02,762
Now, let's talk about Hadoop.
6997
05:04:02,762 --> 05:04:06,544
You can ingest data in Hadoop
easily either by using shell
6998
05:04:06,544 --> 05:04:09,000
or integrating it
with multiple tools,
6999
05:04:09,000 --> 05:04:10,353
like scoop and Flume
7000
05:04:10,353 --> 05:04:13,021
and yarn is just
a processing framework
7001
05:04:13,021 --> 05:04:15,900
that can be integrated
with multiple tools
7002
05:04:15,900 --> 05:04:18,200
like Hive and pig for Analytics.
7003
05:04:18,200 --> 05:04:20,353
I visit data
warehousing component
7004
05:04:20,353 --> 05:04:22,381
which performs Reading Writing
7005
05:04:22,381 --> 05:04:26,058
and managing large data set
in a distributed environment
7006
05:04:26,058 --> 05:04:29,100
using sql-like interface
to conclude here.
7007
05:04:29,100 --> 05:04:31,700
Both of them have
their own ways to make
7008
05:04:31,700 --> 05:04:33,500
themselves user-friendly.
7009
05:04:33,826 --> 05:04:36,365
Now, let's come
to the cost Hadoop
7010
05:04:36,365 --> 05:04:39,903
and Spark are both Apache
open source projects.
7011
05:04:40,000 --> 05:04:43,900
So there's no cost for the
software cost is only associated
7012
05:04:43,900 --> 05:04:47,433
with the infrastructure both
the products are designed
7013
05:04:47,433 --> 05:04:48,300
in such a way
7014
05:04:48,300 --> 05:04:50,800
that Can run
on commodity Hardware
7015
05:04:50,800 --> 05:04:54,100
with low TCO or total
cost of ownership.
7016
05:04:54,800 --> 05:04:56,895
Well now you might
be wondering the ways
7017
05:04:56,895 --> 05:04:58,400
in which they are different.
7018
05:04:58,400 --> 05:05:02,117
They're all the same storage
and processing in Hadoop is
7019
05:05:02,117 --> 05:05:05,700
disc-based and Hadoop uses
standard amounts of memory.
7020
05:05:05,700 --> 05:05:06,717
So with Hadoop,
7021
05:05:06,717 --> 05:05:07,600
we need a lot
7022
05:05:07,600 --> 05:05:12,200
of disk space as well as
faster transfer speed Hadoop
7023
05:05:12,200 --> 05:05:15,300
also requires multiple
systems to distribute
7024
05:05:15,300 --> 05:05:17,000
the disk input output,
7025
05:05:17,000 --> 05:05:18,900
but in case of Apache spark
7026
05:05:18,900 --> 05:05:22,800
due to its in-memory processing
it requires a lot of memory,
7027
05:05:22,800 --> 05:05:24,900
but it can deal
with the standard.
7028
05:05:24,900 --> 05:05:28,400
Speed and amount of disk as
disk space is a relatively
7029
05:05:28,400 --> 05:05:29,855
inexpensive commodity
7030
05:05:29,855 --> 05:05:32,985
and since Park does not use
disk input output
7031
05:05:32,985 --> 05:05:34,591
for processing instead.
7032
05:05:34,591 --> 05:05:36,337
It requires large amounts
7033
05:05:36,337 --> 05:05:39,200
of RAM for executing
everything in memory.
7034
05:05:39,300 --> 05:05:42,000
So spark systems
incurs more cost
7035
05:05:42,300 --> 05:05:45,314
but yes one important thing
to keep in mind is
7036
05:05:45,314 --> 05:05:49,400
that Sparks technology reduces
the number of required systems,
7037
05:05:49,400 --> 05:05:52,900
it needs significantly
fewer systems that cost more
7038
05:05:52,900 --> 05:05:55,991
so there will be a point
at which spark reduces
7039
05:05:55,991 --> 05:05:57,134
the cost per unit
7040
05:05:57,134 --> 05:06:01,100
of the computation even with
the additional RAM requirement.
7041
05:06:01,200 --> 05:06:04,500
There are two types of
data processing batch processing
7042
05:06:04,500 --> 05:06:08,344
and stream processing batch
processing has been crucial
7043
05:06:08,344 --> 05:06:09,904
to the Big Data World
7044
05:06:09,904 --> 05:06:13,100
in simplest term batch
processing is working
7045
05:06:13,100 --> 05:06:16,500
with high data volumes
collected over a period
7046
05:06:16,500 --> 05:06:20,423
in batch processing data is
first collected then processed
7047
05:06:20,423 --> 05:06:21,800
and then the results
7048
05:06:21,800 --> 05:06:24,624
are produced at a later
stage and batch.
7049
05:06:24,624 --> 05:06:26,000
Is it efficient way
7050
05:06:26,000 --> 05:06:28,769
of processing large
static data sets?
7051
05:06:28,800 --> 05:06:30,300
Generally we perform
7052
05:06:30,300 --> 05:06:34,300
batch processing for archived
data sets for example,
7053
05:06:34,300 --> 05:06:36,887
calculating average income
of a country
7054
05:06:36,887 --> 05:06:40,700
or evaluating the change
in e-commerce in the last decade
7055
05:06:40,900 --> 05:06:45,000
now stream processing stream
processing is the current Trend
7056
05:06:45,000 --> 05:06:48,258
in the Big Data World need
of the hour is speed
7057
05:06:48,258 --> 05:06:50,100
and real-time information,
7058
05:06:50,100 --> 05:06:52,100
which is what stream processing
7059
05:06:52,100 --> 05:06:54,500
does batch processing
does not allow.
7060
05:06:54,500 --> 05:06:57,700
Businesses to quickly react
to changing business needs
7061
05:06:57,700 --> 05:07:01,900
and real-time stream processing
has seen a rapid growth
7062
05:07:01,900 --> 05:07:05,188
in that demand now coming
back to Apache Spark
7063
05:07:05,188 --> 05:07:09,420
versus Hadoop yarn is basically
a batch processing framework
7064
05:07:09,420 --> 05:07:11,500
when we submit a job to yarn.
7065
05:07:11,500 --> 05:07:14,827
It reads data from
the cluster performs operation
7066
05:07:14,827 --> 05:07:17,539
and write the results
back to the cluster
7067
05:07:17,539 --> 05:07:19,100
and then it again reads
7068
05:07:19,100 --> 05:07:21,900
the updated data performs
the next operation
7069
05:07:21,900 --> 05:07:25,500
and write the results back
to the cluster and Off
7070
05:07:25,700 --> 05:07:29,678
on the other hand spark is
designed to cover a wide range
7071
05:07:29,678 --> 05:07:31,100
of workloads such as
7072
05:07:31,100 --> 05:07:35,429
batch application iterative
algorithms interactive queries
7073
05:07:35,429 --> 05:07:37,100
and streaming as well.
7074
05:07:37,400 --> 05:07:40,899
Now, let's come to fault
tolerance Hadoop and Spark
7075
05:07:40,899 --> 05:07:43,000
both provides fault tolerance,
7076
05:07:43,000 --> 05:07:45,716
but have different
approaches for hdfs
7077
05:07:45,716 --> 05:07:47,673
and yarn both Master demons.
7078
05:07:47,673 --> 05:07:49,700
That is the name node in hdfs
7079
05:07:49,700 --> 05:07:53,285
and resource manager
in the arm checks the heartbeat
7080
05:07:53,285 --> 05:07:54,651
of the slave demons.
7081
05:07:54,651 --> 05:07:58,000
The slave demons are data nodes
and node managers.
7082
05:07:58,000 --> 05:08:00,100
So if any slave demon fails,
7083
05:08:00,100 --> 05:08:03,800
the master demons reschedules
all pending an in-progress
7084
05:08:03,800 --> 05:08:07,900
operations to another slave
now this method is effective
7085
05:08:07,900 --> 05:08:11,300
but it can significantly
increase the completion time
7086
05:08:11,300 --> 05:08:14,000
for operations with
single failure also
7087
05:08:14,000 --> 05:08:16,400
and as Hadoop uses
commodity hardware
7088
05:08:16,400 --> 05:08:20,200
and another way in which hdfs
ensures fault tolerance is
7089
05:08:20,200 --> 05:08:21,797
by replicating data.
7090
05:08:22,200 --> 05:08:24,200
Now let's talk about spark
7091
05:08:24,200 --> 05:08:29,094
as we discussed earlier rdds are
resilient distributed data sets
7092
05:08:29,094 --> 05:08:31,710
are building blocks
of Apache spark
7093
05:08:32,000 --> 05:08:34,100
and rdds are the one
7094
05:08:34,226 --> 05:08:37,073
which provide fault
tolerant to spark.
7095
05:08:37,073 --> 05:08:38,000
They can refer
7096
05:08:38,000 --> 05:08:41,600
to any data set present
and external storage system
7097
05:08:41,600 --> 05:08:45,200
like hdfs Edge base
shared file system Etc.
7098
05:08:45,300 --> 05:08:47,100
They can also be operated
7099
05:08:47,100 --> 05:08:49,869
parallely rdds can
persist a data set
7100
05:08:49,869 --> 05:08:52,100
and memory across operations.
7101
05:08:52,100 --> 05:08:56,061
It's which makes future actions
10 times much faster
7102
05:08:56,061 --> 05:08:58,731
if rdd is lost
it will automatically
7103
05:08:58,731 --> 05:09:02,700
get recomputed by using
the original Transformations.
7104
05:09:02,700 --> 05:09:06,720
And this is how spark provides
fault tolerance and at the end.
7105
05:09:06,720 --> 05:09:08,500
Let us talk about security.
7106
05:09:08,500 --> 05:09:11,100
Well Hadoop has
multiple ways of providing
7107
05:09:11,100 --> 05:09:14,806
security Hadoop supports
Kerberos for authentication,
7108
05:09:14,806 --> 05:09:17,800
but it is difficult
to handle nevertheless.
7109
05:09:17,800 --> 05:09:21,800
It also supports
third-party vendors like ldap.
7110
05:09:22,000 --> 05:09:23,441
For authentication,
7111
05:09:23,441 --> 05:09:26,400
they also offer
encryption hdfs supports
7112
05:09:26,400 --> 05:09:30,600
traditional file permissions as
well as Access Control lists,
7113
05:09:30,600 --> 05:09:34,222
Hadoop provides service level
authorization which guarantees
7114
05:09:34,222 --> 05:09:36,800
that clients have
the right permissions for
7115
05:09:36,800 --> 05:09:40,400
job submission spark currently
supports authentication
7116
05:09:40,400 --> 05:09:44,600
via a shared secret spark
can integrate with hdfs
7117
05:09:44,600 --> 05:09:46,900
and it can use hdfs ACLS
7118
05:09:46,900 --> 05:09:50,652
or Access Control lists
and file level permissions
7119
05:09:50,652 --> 05:09:52,024
sparking also run.
7120
05:09:52,024 --> 05:09:55,100
Yarn, leveraging the
capability of Kerberos.
7121
05:09:55,100 --> 05:09:55,900
Now.
7122
05:09:55,900 --> 05:09:59,100
This was the comparison
of these two Frameworks based
7123
05:09:59,100 --> 05:10:00,600
on these following parameters.
7124
05:10:00,600 --> 05:10:03,300
Now, let us understand use cases
7125
05:10:03,300 --> 05:10:06,900
where these Technologies
fit best use cases were
7126
05:10:06,900 --> 05:10:07,900
Hadoop fits best.
7127
05:10:07,900 --> 05:10:09,300
For example,
7128
05:10:09,300 --> 05:10:12,500
when you're analyzing
archive data yarn
7129
05:10:12,500 --> 05:10:14,300
allows parallel processing
7130
05:10:14,300 --> 05:10:18,657
over huge amounts of data parts
of data is processed parallely
7131
05:10:18,657 --> 05:10:21,300
and separately on
different data nodes
7132
05:10:21,300 --> 05:10:25,825
and gathers result
from each node manager in cases
7133
05:10:25,825 --> 05:10:29,000
when instant results
are not required now
7134
05:10:29,000 --> 05:10:32,319
Hadoop mapreduce is a good
and economical solution
7135
05:10:32,319 --> 05:10:33,700
for batch processing.
7136
05:10:33,700 --> 05:10:35,546
However, it is incapable
7137
05:10:35,900 --> 05:10:39,015
of processing data
in real-time use cases
7138
05:10:39,015 --> 05:10:43,400
where Spark fits best
in real-time Big Data analysis,
7139
05:10:43,400 --> 05:10:46,600
real-time data analysis
means processing data
7140
05:10:46,600 --> 05:10:50,300
that is getting generated by
the real-time event streams
7141
05:10:50,300 --> 05:10:53,000
coming in at the rate
of Billions of events
7142
05:10:53,000 --> 05:10:55,000
per second the strength
7143
05:10:55,000 --> 05:10:58,277
of spark lies in its abilities
to support streaming
7144
05:10:58,277 --> 05:11:00,900
of data along with
distributed processing
7145
05:11:00,900 --> 05:11:04,700
and Spark claims to process
data hundred times faster
7146
05:11:04,700 --> 05:11:09,100
than mapreduce while 10 times
faster with the discs.
7147
05:11:09,100 --> 05:11:13,000
It is used in graph
processing spark contains
7148
05:11:13,000 --> 05:11:15,720
a graph computation
Library called Graphics
7149
05:11:15,720 --> 05:11:18,700
which simplifies our life
in memory computation
7150
05:11:18,700 --> 05:11:22,100
along with inbuilt graph support
improves the performance.
7151
05:11:22,100 --> 05:11:24,700
Performance of algorithm
by a magnitude
7152
05:11:24,700 --> 05:11:28,516
of one or two degrees over
traditional mapreduce programs.
7153
05:11:28,516 --> 05:11:32,200
It is also used in iterative
machine learning algorithms
7154
05:11:32,200 --> 05:11:35,900
almost all machine learning
algorithms work iteratively
7155
05:11:35,900 --> 05:11:39,039
as we have seen earlier
iterative algorithms
7156
05:11:39,039 --> 05:11:41,389
involve input/output bottlenecks
7157
05:11:41,389 --> 05:11:44,400
in the mapreduce
implementations mapreduce
7158
05:11:44,400 --> 05:11:46,400
uses coarse-grained tasks
7159
05:11:46,400 --> 05:11:47,600
that are too heavy
7160
05:11:47,600 --> 05:11:51,926
for iterative algorithms spark
caches the intermediate data.
7161
05:11:51,926 --> 05:11:53,972
I said after each iteration
7162
05:11:53,972 --> 05:11:57,586
and runs multiple iterations
on the cache data set
7163
05:11:57,586 --> 05:12:01,200
which eventually reduces
the input output overhead
7164
05:12:01,200 --> 05:12:03,142
and executes the algorithm
7165
05:12:03,142 --> 05:12:07,400
faster in a fault-tolerant
manner sad the end which one is
7166
05:12:07,400 --> 05:12:10,900
the best the answer
to this is Hadoop
7167
05:12:10,900 --> 05:12:14,800
and Apache spark are
not competing with one another.
7168
05:12:15,000 --> 05:12:18,100
In fact, they complement
each other quite well,
7169
05:12:18,100 --> 05:12:20,745
how do brings huge
data sets under control
7170
05:12:20,745 --> 05:12:22,100
by commodity systems?
7171
05:12:22,100 --> 05:12:26,100
Systems and Spark provides
a real-time in-memory processing
7172
05:12:26,100 --> 05:12:27,700
for those data sets.
7173
05:12:27,900 --> 05:12:30,600
When we combine
Apache Sparks ability.
7174
05:12:30,600 --> 05:12:34,200
That is the high processing
speed and advanced analytics
7175
05:12:34,200 --> 05:12:38,600
and multiple integration support
with Hadoop slow cost operation
7176
05:12:38,600 --> 05:12:40,200
on commodity Hardware.
7177
05:12:40,200 --> 05:12:42,091
It gives the best results
7178
05:12:42,091 --> 05:12:45,800
Hadoop compliments Apache
spark capabilities spark
7179
05:12:45,800 --> 05:12:48,737
not completely replace a do
but the good news is
7180
05:12:48,737 --> 05:12:52,079
that the demand of spark is
currently at an all-time.
7181
05:12:52,079 --> 05:12:55,849
Hi, if you want to learn more
about the Hadoop ecosystem tools
7182
05:12:55,849 --> 05:12:56,900
and Apache spark,
7183
05:12:56,900 --> 05:12:59,106
don't forget to take
a look at the editor
7184
05:12:59,106 --> 05:13:01,700
Acres YouTube channel
and check out the big data
7185
05:13:01,700 --> 05:13:03,000
and Hadoop playlist.
7186
05:13:07,600 --> 05:13:09,776
Welcome everyone in
today's session on
7187
05:13:09,776 --> 05:13:11,100
kafka's Park streaming.
7188
05:13:11,100 --> 05:13:14,400
So without any further delay,
let's look at the agenda first.
7189
05:13:14,400 --> 05:13:16,128
We will start by understanding.
7190
05:13:16,128 --> 05:13:17,310
What is Apache Kafka?
7191
05:13:17,310 --> 05:13:19,900
Then we will discuss
about different components
7192
05:13:19,900 --> 05:13:22,000
of Apache Kafka
and it's architecture.
7193
05:13:22,000 --> 05:13:24,899
Further we will look
at different Kafka commands.
7194
05:13:24,899 --> 05:13:25,546
After that.
7195
05:13:25,546 --> 05:13:27,994
We'll take a brief overview
of Apache spark
7196
05:13:27,994 --> 05:13:30,700
and will understand
different spark components.
7197
05:13:30,700 --> 05:13:31,201
Finally.
7198
05:13:31,201 --> 05:13:32,579
We'll look at the demo
7199
05:13:32,579 --> 05:13:35,900
where we will use spark
streaming with Apache caf-pow.
7200
05:13:36,100 --> 05:13:37,600
Let's move to our first slide.
7201
05:13:37,900 --> 05:13:39,323
So in a real time scenario,
7202
05:13:39,323 --> 05:13:41,500
we have different
systems of services,
7203
05:13:41,500 --> 05:13:43,000
which will be communicating
7204
05:13:43,000 --> 05:13:46,200
with each other and
the data pipelines are the ones
7205
05:13:46,200 --> 05:13:48,800
which are establishing
connection between two servers
7206
05:13:48,800 --> 05:13:49,953
or two systems.
7207
05:13:50,000 --> 05:13:52,100
Now, let's take
an example of e-commerce.
7208
05:13:52,100 --> 05:13:55,255
Except site where it can have
multiple servers at front end
7209
05:13:55,255 --> 05:13:58,161
like Weber application server
for hosting application.
7210
05:13:58,161 --> 05:13:59,530
It can have a chat server
7211
05:13:59,530 --> 05:14:01,958
for the customers
to provide chart facilities.
7212
05:14:01,958 --> 05:14:04,900
Then it can have a separate
server for payment Etc.
7213
05:14:04,900 --> 05:14:08,145
Similarly organization can also
have multiple server
7214
05:14:08,145 --> 05:14:09,100
at the back end
7215
05:14:09,100 --> 05:14:11,900
which will be receiving messages
from different front end servers
7216
05:14:11,900 --> 05:14:13,200
based on the requirements.
7217
05:14:13,400 --> 05:14:15,600
Now they can have
a database server
7218
05:14:15,600 --> 05:14:17,700
which will be storing
the records then they
7219
05:14:17,700 --> 05:14:20,100
can have security systems
for user authentication
7220
05:14:20,100 --> 05:14:21,916
and authorization then
they can have
7221
05:14:21,916 --> 05:14:23,368
Real-time monitoring server,
7222
05:14:23,368 --> 05:14:25,600
which is basically
used for recommendations.
7223
05:14:25,600 --> 05:14:28,100
So all these data
pipelines becomes complex
7224
05:14:28,100 --> 05:14:30,200
with the increase
in number of systems
7225
05:14:30,200 --> 05:14:31,594
and adding a new system
7226
05:14:31,594 --> 05:14:33,900
or server requires
more data pipelines,
7227
05:14:33,900 --> 05:14:35,900
which will again
make the data flow
7228
05:14:35,900 --> 05:14:37,800
more complicated and complex.
7229
05:14:37,800 --> 05:14:38,662
Now managing.
7230
05:14:38,662 --> 05:14:41,646
These data pipelines also
become very difficult
7231
05:14:41,646 --> 05:14:45,100
as each data pipeline has
their own set of requirements
7232
05:14:45,100 --> 05:14:46,700
for example data pipelines,
7233
05:14:46,700 --> 05:14:49,700
which handles transaction
should be more fault tolerant
7234
05:14:49,700 --> 05:14:51,700
and robust on the other hand.
7235
05:14:51,700 --> 05:14:54,372
Clickstream data pipeline
can be more fragile.
7236
05:14:54,372 --> 05:14:55,784
So adding some pipelines
7237
05:14:55,784 --> 05:14:58,400
or removing some pipelines
becomes more difficult
7238
05:14:58,400 --> 05:14:59,600
from the complex system.
7239
05:14:59,800 --> 05:15:02,800
So now I hope that you would
have understood the problem
7240
05:15:02,800 --> 05:15:05,400
due to which misting
systems was originated.
7241
05:15:05,400 --> 05:15:08,200
Let's move to the next slide
and we'll understand
7242
05:15:08,200 --> 05:15:11,970
how Kafka solves this problem
now measuring system reduces
7243
05:15:11,970 --> 05:15:13,835
the complexity of data pipelines
7244
05:15:13,835 --> 05:15:16,600
and makes the communication
between systems more
7245
05:15:16,600 --> 05:15:19,780
simpler and manageable
using messaging system.
7246
05:15:19,780 --> 05:15:22,500
Now, you can easily
stablish remote Education
7247
05:15:22,500 --> 05:15:25,063
and send your data
easily across Netbook.
7248
05:15:25,063 --> 05:15:26,536
Now a different systems
7249
05:15:26,536 --> 05:15:29,100
may use different
platforms and languages
7250
05:15:29,200 --> 05:15:30,200
and messaging system
7251
05:15:30,200 --> 05:15:32,852
provides you a common
Paradigm independent
7252
05:15:32,852 --> 05:15:34,560
of any platformer language.
7253
05:15:34,560 --> 05:15:36,900
So basically it
decouples the platform
7254
05:15:36,900 --> 05:15:39,800
on which a front end server as
well as your back-end server
7255
05:15:39,800 --> 05:15:43,600
is running you can also stablish
a no synchronous communication
7256
05:15:43,600 --> 05:15:44,800
and send messages
7257
05:15:44,800 --> 05:15:47,000
so that the sender
does not have to wait
7258
05:15:47,000 --> 05:15:49,000
for the receiver
to process the messages.
7259
05:15:49,200 --> 05:15:51,300
Now one of the benefit
of messaging system is
7260
05:15:51,300 --> 05:15:53,295
that you can
Reliable communication.
7261
05:15:53,295 --> 05:15:56,600
So even when the receiver and
network is not working properly.
7262
05:15:56,600 --> 05:15:59,272
Your messages wouldn't
get lost not talking
7263
05:15:59,272 --> 05:16:02,900
about cough cough cough cough
decouples the data pipelines
7264
05:16:02,900 --> 05:16:06,205
and solves the complexity
problem the applications
7265
05:16:06,205 --> 05:16:10,050
which are producing messages
to Kafka are called producers
7266
05:16:10,050 --> 05:16:11,400
and the applications
7267
05:16:11,400 --> 05:16:13,600
which are consuming
those messages from Kafka
7268
05:16:13,600 --> 05:16:14,706
are called consumers.
7269
05:16:14,706 --> 05:16:17,500
Now, as you can see in the image
the front end server,
7270
05:16:17,500 --> 05:16:20,200
then your application server
will burn application server
7271
05:16:20,200 --> 05:16:21,500
to and chat server.
7272
05:16:21,500 --> 05:16:25,500
I using messages to Kafka
and these are called producers
7273
05:16:25,500 --> 05:16:26,985
and your database server
7274
05:16:26,985 --> 05:16:29,594
security systems real-time
monitoring server
7275
05:16:29,594 --> 05:16:31,900
than other services
and data warehouse.
7276
05:16:31,900 --> 05:16:34,300
These are basically
consuming the messages
7277
05:16:34,300 --> 05:16:35,900
and are called consumers.
7278
05:16:36,100 --> 05:16:39,600
So your producer sends
the message to Kafka
7279
05:16:39,700 --> 05:16:42,781
and then cough cash
to those messages and consumers
7280
05:16:42,781 --> 05:16:45,000
who want those
messages can subscribe
7281
05:16:45,000 --> 05:16:47,607
and receive them now
you can also have
7282
05:16:47,607 --> 05:16:51,191
multiple subscribers to
a single category of messages.
7283
05:16:51,191 --> 05:16:52,623
So you Database server
7284
05:16:52,623 --> 05:16:56,400
and your security system can
be consuming the same messages
7285
05:16:56,400 --> 05:16:58,600
which is produced
by application server
7286
05:16:58,600 --> 05:17:01,423
1 and again adding
a new consumer is very easy.
7287
05:17:01,423 --> 05:17:03,658
You can go ahead and
add a new consumer
7288
05:17:03,658 --> 05:17:06,268
and just subscribe
to the message categories
7289
05:17:06,268 --> 05:17:07,300
that is required.
7290
05:17:07,300 --> 05:17:10,700
So again, you can add
a new consumer say consumer one
7291
05:17:10,700 --> 05:17:13,100
and you can again
go ahead and subscribe
7292
05:17:13,100 --> 05:17:14,570
to the category of messages
7293
05:17:14,570 --> 05:17:17,100
which is produced by
application server one.
7294
05:17:17,100 --> 05:17:19,100
So, let's quickly move ahead.
7295
05:17:19,100 --> 05:17:21,606
Let's talk about
a Bocce Kafka so party.
7296
05:17:21,606 --> 05:17:24,853
Kafka is a distributed
publish/subscribe messaging
7297
05:17:24,853 --> 05:17:28,300
system messaging traditionally
has two models queuing
7298
05:17:28,300 --> 05:17:32,173
and publish/subscribe in a queue
a pool of consumers.
7299
05:17:32,173 --> 05:17:33,769
May read from a server
7300
05:17:33,769 --> 05:17:36,540
and each record only
goes to one of them
7301
05:17:36,540 --> 05:17:38,600
whereas in publish/subscribe.
7302
05:17:38,600 --> 05:17:41,313
The record is broadcasted
to all consumers.
7303
05:17:41,313 --> 05:17:43,722
So multiple consumers
can get the record
7304
05:17:43,722 --> 05:17:45,700
the Kafka cluster is distributed
7305
05:17:45,700 --> 05:17:48,374
and have multiple machines
running in parallel.
7306
05:17:48,374 --> 05:17:50,700
And this is the reason
why calf pies fast
7307
05:17:50,700 --> 05:17:52,000
scalable and fault.
7308
05:17:52,300 --> 05:17:53,309
Now let me tell you
7309
05:17:53,309 --> 05:17:55,700
that Kafka is developed
at LinkedIn and later.
7310
05:17:55,700 --> 05:17:57,700
It became a part
of Apache project.
7311
05:17:57,900 --> 05:18:01,100
Now, let us look at some
of the important terminologies.
7312
05:18:01,100 --> 05:18:03,499
So we'll first start with topic.
7313
05:18:03,499 --> 05:18:05,081
So topic is a category
7314
05:18:05,081 --> 05:18:08,100
or feed name to which
records are published
7315
05:18:08,100 --> 05:18:11,226
and Topic in Kafka are
always multi subscriber.
7316
05:18:11,226 --> 05:18:14,800
That is a topic can have
zero one or multiple consumers
7317
05:18:14,800 --> 05:18:16,600
that can subscribe the topic
7318
05:18:16,600 --> 05:18:19,300
and consume the data written
to it for an example.
7319
05:18:19,300 --> 05:18:21,850
You can have serious record
getting published in sales, too.
7320
05:18:21,850 --> 05:18:23,500
Topic you can
have product records
7321
05:18:23,500 --> 05:18:25,600
which is getting published
in product topic
7322
05:18:25,600 --> 05:18:28,965
and so on this will actually
segregate your messages
7323
05:18:28,965 --> 05:18:31,756
and consumer will only
subscribe the topic
7324
05:18:31,756 --> 05:18:35,500
that they need and again you
consumer can also subscribe
7325
05:18:35,500 --> 05:18:37,300
to two or more topics.
7326
05:18:37,300 --> 05:18:40,100
Now, let's talk
about partitions.
7327
05:18:40,100 --> 05:18:44,253
So Kafka topics are divided
into a number of partitions
7328
05:18:44,253 --> 05:18:47,800
and partitions allow
you to paralyze a topic
7329
05:18:47,800 --> 05:18:49,284
by splitting the data
7330
05:18:49,284 --> 05:18:51,846
in a particular
topic across multiple.
7331
05:18:51,846 --> 05:18:55,200
Brokers which means
each partition can be placed
7332
05:18:55,200 --> 05:18:58,869
on separate machine to allow
multiple consumers to read
7333
05:18:58,869 --> 05:19:00,500
from a topic parallelly.
7334
05:19:00,500 --> 05:19:02,700
So in case of serious
topic you can have
7335
05:19:02,700 --> 05:19:05,700
three partition partition
0 partition 1 and partition
7336
05:19:05,700 --> 05:19:09,400
to from where three consumers
can read data parallel.
7337
05:19:09,400 --> 05:19:10,481
Now moving ahead.
7338
05:19:10,481 --> 05:19:12,200
Let's talk about producers.
7339
05:19:12,200 --> 05:19:13,845
So producers are the one
7340
05:19:13,845 --> 05:19:17,000
who publishes the data
to topics of the choice.
7341
05:19:17,000 --> 05:19:18,600
Then you have consumers
7342
05:19:18,600 --> 05:19:21,786
so consumers can subscribe
to one or more topic.
7343
05:19:21,786 --> 05:19:22,910
And consume data
7344
05:19:22,910 --> 05:19:26,773
from that topic now consumers
basically label themselves
7345
05:19:26,773 --> 05:19:28,600
with a consumer group name
7346
05:19:28,600 --> 05:19:31,900
and each record publish
to a topic is delivered
7347
05:19:31,900 --> 05:19:35,703
to one consumer instance within
each subscribing consumer group.
7348
05:19:35,703 --> 05:19:37,536
So suppose you have
a consumer group.
7349
05:19:37,536 --> 05:19:40,072
Let's say consumer Group
1 and then you have
7350
05:19:40,072 --> 05:19:41,900
three consumers residing in it.
7351
05:19:41,900 --> 05:19:45,400
That is consumer a consumer be
an consumer see now
7352
05:19:45,400 --> 05:19:47,015
from the seals topic.
7353
05:19:47,100 --> 05:19:51,600
Each record can be read once
by consumer group Fun and it
7354
05:19:51,600 --> 05:19:56,200
And either be read by consumer a
or consumer be or consumer see
7355
05:19:56,200 --> 05:20:00,337
but it can only be consumed once
by the single consumer group
7356
05:20:00,337 --> 05:20:02,200
that is consumer group one.
7357
05:20:02,200 --> 05:20:05,700
But again, you can have
multiple consumer groups
7358
05:20:05,700 --> 05:20:07,700
which can subscribe to a topic
7359
05:20:07,700 --> 05:20:11,260
where one record can be consumed
by multiple consumers.
7360
05:20:11,260 --> 05:20:14,226
That is one consumer
from each consumer group.
7361
05:20:14,226 --> 05:20:16,842
So now let's say
you have a consumer one
7362
05:20:16,842 --> 05:20:19,291
and consumer group
to in consumer Group
7363
05:20:19,291 --> 05:20:20,600
1 we have to consumer
7364
05:20:20,600 --> 05:20:22,854
that is consumer a a
and consumer be
7365
05:20:22,854 --> 05:20:24,400
and consumer group to we
7366
05:20:24,400 --> 05:20:27,819
have to Consumers consumer key
and consumer to be so
7367
05:20:27,819 --> 05:20:30,229
if consumer Group
1 and consumer group
7368
05:20:30,229 --> 05:20:32,900
2 are consuming messages
from topic sales.
7369
05:20:32,900 --> 05:20:36,000
So the single record will be
consumed by consumer group one
7370
05:20:36,000 --> 05:20:39,111
as well as consumer group
2 and a single consumer
7371
05:20:39,111 --> 05:20:43,000
from both the consumer group
will consume the record once so,
7372
05:20:43,000 --> 05:20:45,900
I guess you are clear
with the concept of consumer
7373
05:20:45,900 --> 05:20:49,124
and consumer group Now
consumer instances can be
7374
05:20:49,124 --> 05:20:51,800
a separate process
or separate machines.
7375
05:20:51,900 --> 05:20:55,918
No talking about Brokers Brokers
are nothing but a single machine
7376
05:20:55,918 --> 05:20:57,300
in the CAF per cluster
7377
05:20:57,300 --> 05:21:00,800
and zookeeper is another Apache
open source project.
7378
05:21:00,800 --> 05:21:03,536
It's Tuesday metadata
information related
7379
05:21:03,536 --> 05:21:04,700
to Kafka cluster.
7380
05:21:04,700 --> 05:21:08,100
Like Brokers information
topics details Etc.
7381
05:21:08,100 --> 05:21:09,933
Zookeeper is basically the one
7382
05:21:09,933 --> 05:21:12,316
who is managing
the whole Kafka cluster.
7383
05:21:12,316 --> 05:21:14,700
Now, let's quickly go
to the next slide.
7384
05:21:14,700 --> 05:21:16,900
So suppose you have a topic.
7385
05:21:16,900 --> 05:21:21,100
Let's assume this is topic sales
and you have for partition
7386
05:21:21,100 --> 05:21:23,900
so you have Partition
0 partition 1 partition
7387
05:21:23,900 --> 05:21:27,600
to and partition three now you
have five Brokers over here.
7388
05:21:27,614 --> 05:21:30,768
Now, let's take the case
of partition 1 so
7389
05:21:30,850 --> 05:21:34,800
if the replication factor
is 3 it will have 3 copies
7390
05:21:34,800 --> 05:21:37,100
which will reside
on different Brokers.
7391
05:21:37,100 --> 05:21:40,121
So when the replica is
on broker to next is
7392
05:21:40,121 --> 05:21:43,000
on broker 3 and next is
on brokered 5 and
7393
05:21:43,000 --> 05:21:44,800
as you can see repl 5,
7394
05:21:45,000 --> 05:21:47,800
so this 5 is from this broker 5.
7395
05:21:48,100 --> 05:21:52,500
So the ID of the replica
is same as the ID of The broker
7396
05:21:52,500 --> 05:21:55,700
that hosts it now moving ahead.
7397
05:21:55,700 --> 05:21:57,100
One of the replica
7398
05:21:57,100 --> 05:22:00,800
of partition one will serve
as the leader replica.
7399
05:22:00,800 --> 05:22:02,074
So now the leader
7400
05:22:02,074 --> 05:22:06,200
of partition one is replica
five and any consumer coming
7401
05:22:06,200 --> 05:22:07,684
and consuming messages
7402
05:22:07,684 --> 05:22:10,944
from partition one will
be solved by this replica.
7403
05:22:10,944 --> 05:22:14,635
And these two replicas is
basically for fault tolerance.
7404
05:22:14,635 --> 05:22:17,343
So that once you're
broken five goes off
7405
05:22:17,343 --> 05:22:19,264
or your disc becomes corrupt,
7406
05:22:19,264 --> 05:22:21,115
so your replica 3 or replica
7407
05:22:21,115 --> 05:22:24,100
to to one of them
will again serve as a leader
7408
05:22:24,100 --> 05:22:26,938
and this is basically
decided on the basis
7409
05:22:26,938 --> 05:22:28,600
of most in sync replica.
7410
05:22:28,600 --> 05:22:30,587
So the replica
which will be most
7411
05:22:30,587 --> 05:22:34,100
in sync with this replica
will become the next leader.
7412
05:22:34,100 --> 05:22:36,700
So similarly this
partition 0 may decide
7413
05:22:36,700 --> 05:22:40,400
on broker one broker to
and broker three again
7414
05:22:40,400 --> 05:22:44,500
your partition to May
reside on broke of for group
7415
05:22:44,500 --> 05:22:46,800
of five and say broker one
7416
05:22:46,900 --> 05:22:49,500
and then your third
partition might reside
7417
05:22:49,500 --> 05:22:51,500
on these three brokers.
7418
05:22:51,700 --> 05:22:54,900
So suppose that this is
the leader for partition
7419
05:22:54,900 --> 05:22:56,378
to this is the leader
7420
05:22:56,378 --> 05:22:59,900
for partition 0 this is
the leader for partition 3.
7421
05:22:59,900 --> 05:23:02,900
This is the leader
for partition 1 right
7422
05:23:02,900 --> 05:23:03,600
so you can see
7423
05:23:03,600 --> 05:23:08,300
that for consumers can consume
data pad Ali from these Brokers
7424
05:23:08,300 --> 05:23:10,798
so it can consume
data from partition
7425
05:23:10,798 --> 05:23:14,200
to this consumer can consume
data from partition 0
7426
05:23:14,200 --> 05:23:17,800
and similarly for partition
3 and partition fun
7427
05:23:18,100 --> 05:23:21,500
now by maintaining
the replica basically helps.
7428
05:23:21,500 --> 05:23:25,433
Sin fault tolerance and keeping
different partition leaders
7429
05:23:25,433 --> 05:23:29,300
on different Brokers basically
helps in parallel execution
7430
05:23:29,300 --> 05:23:32,300
or you can say baddeley
consuming those messages.
7431
05:23:32,300 --> 05:23:34,391
So I hope that you
guys are clear
7432
05:23:34,391 --> 05:23:36,972
about topics partitions
and replicas now,
7433
05:23:36,972 --> 05:23:38,803
let's move to our next slide.
7434
05:23:38,803 --> 05:23:42,062
So this is how the whole
Kafka cluster looks like you
7435
05:23:42,062 --> 05:23:43,567
have multiple producers,
7436
05:23:43,567 --> 05:23:46,200
which is again producing
messages to Kafka.
7437
05:23:46,200 --> 05:23:48,600
Then this whole is
the Kafka cluster
7438
05:23:48,600 --> 05:23:51,590
where you have two nodes node
one has to broker.
7439
05:23:51,590 --> 05:23:55,128
Joker one and broker to
and the Note II has two Brokers
7440
05:23:55,128 --> 05:23:58,600
which is broker three and broke
of for again consumers
7441
05:23:58,600 --> 05:24:01,434
will be consuming data
from these Brokers
7442
05:24:01,434 --> 05:24:03,135
and zookeeper is the one
7443
05:24:03,135 --> 05:24:05,900
who is managing
this whole calf cluster.
7444
05:24:06,200 --> 05:24:07,100
Now, let's look
7445
05:24:07,100 --> 05:24:10,688
at some basic commands of Kafka
and understand how Kafka Works
7446
05:24:10,688 --> 05:24:12,500
how to go ahead
and start zookeeper
7447
05:24:12,500 --> 05:24:14,708
how to go ahead
and start Kafka server
7448
05:24:14,708 --> 05:24:16,200
and how to again go ahead
7449
05:24:16,200 --> 05:24:19,141
and produce some messages
to Kafka and then consume
7450
05:24:19,141 --> 05:24:20,600
some messages to Kafka.
7451
05:24:20,600 --> 05:24:21,800
So let me quickly.
7452
05:24:21,800 --> 05:24:27,200
on my VM So let me
quickly open the terminal.
7453
05:24:28,600 --> 05:24:31,400
Let me quickly go ahead
and execute sudo GPS
7454
05:24:31,400 --> 05:24:33,180
so that I can check
all the demons
7455
05:24:33,180 --> 05:24:34,800
that are running in my system.
7456
05:24:35,400 --> 05:24:37,095
So you can see I have named
7457
05:24:37,095 --> 05:24:40,800
no data node resource manager
node manager job is to server.
7458
05:24:42,000 --> 05:24:43,933
So now as all the hdfs demons
7459
05:24:43,933 --> 05:24:46,200
are burning let us
quickly go ahead
7460
05:24:46,200 --> 05:24:48,100
and start Kafka services.
7461
05:24:48,100 --> 05:24:50,561
So first I will go
to Kafka home.
7462
05:24:51,400 --> 05:24:53,800
So let me show
you the directory.
7463
05:24:53,800 --> 05:24:56,200
So my Kafka is in user lib.
7464
05:24:56,600 --> 05:24:56,900
Now.
7465
05:24:56,900 --> 05:25:00,088
Let me quickly go ahead
and start zookeeper service.
7466
05:25:00,088 --> 05:25:01,087
But before that,
7467
05:25:01,087 --> 05:25:03,900
let me show you
zookeeper dot properties file.
7468
05:25:06,415 --> 05:25:10,800
So decline Port is 2 1 8 1 so
my zookeeper will be running
7469
05:25:10,800 --> 05:25:12,300
on Port to 181
7470
05:25:12,600 --> 05:25:15,400
and the data directory
in which my zookeeper
7471
05:25:15,400 --> 05:25:19,700
will store all the metadata
is slash temp / zookeeper.
7472
05:25:20,000 --> 05:25:23,200
So let us quickly go ahead
and start zookeeper
7473
05:25:23,400 --> 05:25:28,300
and the command is bins
zookeeper server start.
7474
05:25:28,900 --> 05:25:30,500
So this is the script file
7475
05:25:30,500 --> 05:25:33,300
and then I'll pass
the properties file
7476
05:25:33,357 --> 05:25:37,988
which is inside config directory
and a little Meanwhile,
7477
05:25:37,988 --> 05:25:39,834
let me open another tab.
7478
05:25:40,403 --> 05:25:44,096
So here I will be starting
my first Kafka broker.
7479
05:25:44,200 --> 05:25:47,200
But before that let me show
you the properties file.
7480
05:25:47,576 --> 05:25:50,423
So we'll go
in config directory again,
7481
05:25:51,100 --> 05:25:53,700
and I have
server dot properties.
7482
05:25:54,400 --> 05:25:58,300
So this is the properties
of my first Kafka broker.
7483
05:25:59,507 --> 05:26:01,892
So first we have server Basics.
7484
05:26:02,300 --> 05:26:06,400
So here the broker idea
of my first broker is 0 then
7485
05:26:06,400 --> 05:26:10,700
the port is 9:09 to on which
my first broker will be running.
7486
05:26:11,400 --> 05:26:14,500
So it contains all
the socket server settings
7487
05:26:14,657 --> 05:26:16,042
then moving ahead.
7488
05:26:16,049 --> 05:26:17,555
We have log base X.
7489
05:26:17,555 --> 05:26:21,139
So in that log Basics,
this is log directory,
7490
05:26:21,200 --> 05:26:23,500
which is / them / Kafka -
7491
05:26:23,500 --> 05:26:26,400
logs so over here
my Kafka will store
7492
05:26:26,400 --> 05:26:28,226
all those messages or records,
7493
05:26:28,226 --> 05:26:30,600
which will be produced
by The Producers.
7494
05:26:30,600 --> 05:26:31,799
So all the records
7495
05:26:31,799 --> 05:26:35,600
which belongs to broker 0
will be stored at this location.
7496
05:26:35,900 --> 05:26:39,200
Now, the next section is
internal topic settings
7497
05:26:39,200 --> 05:26:40,900
in which the offset topical.
7498
05:26:40,900 --> 05:26:42,500
application factor is 1
7499
05:26:42,500 --> 05:26:48,100
then transaction State log
replication factor is 1 Next
7500
05:26:48,384 --> 05:26:50,615
we have log retention policy.
7501
05:26:50,900 --> 05:26:54,500
So the log retention
ours is 168.
7502
05:26:54,500 --> 05:26:58,319
So your records will be stored
for 168 hours by default
7503
05:26:58,319 --> 05:27:00,300
and then it will be deleted.
7504
05:27:00,300 --> 05:27:02,300
Then you have
zookeeper properties
7505
05:27:02,300 --> 05:27:05,100
where we have specified
zookeeper connect and
7506
05:27:05,100 --> 05:27:07,482
as we have seen
in Zookeeper dot properties file
7507
05:27:07,482 --> 05:27:10,000
that are zookeeper
will be running on Port 2 1 8 1
7508
05:27:10,000 --> 05:27:12,000
so we are giving
the address of Zookeeper
7509
05:27:12,000 --> 05:27:13,900
that is localized
to one eight one
7510
05:27:14,300 --> 05:27:15,911
and at last we have group.
7511
05:27:15,911 --> 05:27:18,700
Coordinator setting so
let us quickly go ahead
7512
05:27:18,700 --> 05:27:20,700
and start the first broker.
7513
05:27:21,457 --> 05:27:24,842
So the script file is
Kafka server started sh
7514
05:27:24,900 --> 05:27:27,100
and then we have to give
the properties file,
7515
05:27:27,200 --> 05:27:31,000
which is server dot properties
for the first broker.
7516
05:27:31,200 --> 05:27:35,276
I'll hit enter and meanwhile,
let me open another tab.
7517
05:27:36,234 --> 05:27:39,865
now I'll show you
the next properties file,
7518
05:27:40,200 --> 05:27:43,400
which is Server 1.
7519
05:27:43,400 --> 05:27:44,600
Properties.
7520
05:27:45,300 --> 05:27:46,400
So the things
7521
05:27:46,400 --> 05:27:50,700
which you have to change
for creating a new broker
7522
05:27:51,000 --> 05:27:54,700
is first you have
to change the broker ID.
7523
05:27:54,900 --> 05:27:59,100
So my earlier book ID was 0
the new broker ID is 1 again,
7524
05:27:59,100 --> 05:28:02,255
you can replicate this file
and for a new server,
7525
05:28:02,255 --> 05:28:05,059
you have to change
the broker idea to to then
7526
05:28:05,059 --> 05:28:08,513
you have to change the port
because on 9:09 to already.
7527
05:28:08,513 --> 05:28:11,200
My first broker is running
that is broker 0
7528
05:28:11,200 --> 05:28:12,019
so my broker.
7529
05:28:12,019 --> 05:28:14,099
Should connect to
a different port
7530
05:28:14,099 --> 05:28:17,000
and here I have specified
nine zero nine three.
7531
05:28:17,700 --> 05:28:21,600
Next thing what you have
to change is the log directory.
7532
05:28:21,600 --> 05:28:25,830
So here I have added a -
1 to the default log directory.
7533
05:28:25,830 --> 05:28:27,400
So all these records
7534
05:28:27,400 --> 05:28:30,600
which is stored to my broker
one will be going
7535
05:28:30,600 --> 05:28:32,505
to this particular directory
7536
05:28:32,505 --> 05:28:35,500
that is slashed
and slashed cough call logs -
7537
05:28:35,500 --> 05:28:39,500
1 And rest of the
things are similar,
7538
05:28:39,700 --> 05:28:42,900
so let me quickly go ahead
and start second broker as well.
7539
05:28:45,800 --> 05:28:48,000
And let me open
one more terminal.
7540
05:28:51,569 --> 05:28:54,030
And I'll start
broker to as well.
7541
05:29:01,400 --> 05:29:06,475
So the Zookeeper started then
procurve one is also started
7542
05:29:06,475 --> 05:29:09,700
and this is broker
to which is also started
7543
05:29:09,702 --> 05:29:11,472
and this is proof of 3.
7544
05:29:12,600 --> 05:29:14,600
So now let me
quickly minimize this
7545
05:29:15,200 --> 05:29:17,300
and I'll open a new terminal.
7546
05:29:18,000 --> 05:29:20,800
Now first, let us look
at some commands later
7547
05:29:20,800 --> 05:29:21,900
to Kafka topics.
7548
05:29:21,900 --> 05:29:24,900
So I'll quickly go ahead
and create a topic.
7549
05:29:25,250 --> 05:29:29,250
So again, let me first go
to my Kafka home directory.
7550
05:29:31,700 --> 05:29:36,000
Then the script file
is Kafka top it dot sh,
7551
05:29:36,000 --> 05:29:37,762
then the first parameter
7552
05:29:37,762 --> 05:29:41,800
is create then we have to give
the address of zoo keeper
7553
05:29:41,800 --> 05:29:43,327
because zookeeper is the one
7554
05:29:43,327 --> 05:29:46,000
who is actually containing
all the details related
7555
05:29:46,000 --> 05:29:47,000
to your topic.
7556
05:29:47,700 --> 05:29:50,600
So the address of my zookeeper
is localized to one eight one
7557
05:29:50,700 --> 05:29:53,000
then we'll give the topic name.
7558
05:29:53,000 --> 05:29:56,076
So let me name the topic
as Kafka -
7559
05:29:56,076 --> 05:30:00,000
spark next we have to specify
the replication factor
7560
05:30:00,000 --> 05:30:01,100
of the topic.
7561
05:30:01,300 --> 05:30:04,900
So it will replicate all
the partitions inside the topic
7562
05:30:04,900 --> 05:30:05,700
that many times.
7563
05:30:06,600 --> 05:30:08,300
So replication -
7564
05:30:08,300 --> 05:30:10,900
Factor as we
have three Brokers,
7565
05:30:10,900 --> 05:30:15,600
so let me keep it as 3
and then we have partitions.
7566
05:30:15,800 --> 05:30:17,074
So I will keep it as
7567
05:30:17,074 --> 05:30:19,746
three because we have
three Brokers running
7568
05:30:19,746 --> 05:30:21,689
and our consumer can go ahead
7569
05:30:21,689 --> 05:30:23,700
and consume messages parallely
7570
05:30:23,700 --> 05:30:27,010
from three Brokers and
let me press enter.
7571
05:30:29,300 --> 05:30:32,000
So now you can see
the topic is created.
7572
05:30:32,000 --> 05:30:35,100
Now, let us quickly go ahead
and list all the topics.
7573
05:30:35,100 --> 05:30:36,100
So the command
7574
05:30:36,100 --> 05:30:40,200
for listing all the topics
is dot slash bin again.
7575
05:30:40,200 --> 05:30:44,200
We'll open cough car
topic script file then -
7576
05:30:44,200 --> 05:30:48,300
- list and again will provide
the address of Zookeeper.
7577
05:30:48,700 --> 05:30:50,000
So do again list the topic
7578
05:30:50,000 --> 05:30:53,674
we have to first go to
the CAF core topic script file.
7579
05:30:53,674 --> 05:30:55,200
Then we have to give -
7580
05:30:55,200 --> 05:30:59,300
- list parameter and next we
have to give the zookeepers.
7581
05:30:59,576 --> 05:31:02,423
Which is localhost
181 I'll hit enter.
7582
05:31:04,100 --> 05:31:07,000
And you can see
I have this Kafka -
7583
05:31:07,000 --> 05:31:11,000
spark the kafka's
park topic has been created.
7584
05:31:11,100 --> 05:31:11,407
Now.
7585
05:31:11,407 --> 05:31:14,176
Let me show you
one more thing again.
7586
05:31:14,176 --> 05:31:18,900
We'll go to when cuff
card topics not sh
7587
05:31:19,000 --> 05:31:21,100
and we'll describe this topic.
7588
05:31:21,900 --> 05:31:24,600
I will pass the address
of zoo keeper,
7589
05:31:24,800 --> 05:31:26,300
which is localhost
7590
05:31:26,600 --> 05:31:30,600
to one eight one and then
I'll pause the topic name,
7591
05:31:31,000 --> 05:31:34,700
which is Kafka - Spark
7592
05:31:36,400 --> 05:31:37,600
So now you can see here.
7593
05:31:37,600 --> 05:31:40,100
The topic is cough by spark.
7594
05:31:40,100 --> 05:31:43,400
The partition count is
3 the replication factor is 3
7595
05:31:43,400 --> 05:31:45,600
and the config is as follows.
7596
05:31:45,700 --> 05:31:49,900
So here you can see all the
three partitions of the topic
7597
05:31:49,900 --> 05:31:54,400
that is partition 0 partition 1
and partition 2 then the leader
7598
05:31:54,400 --> 05:31:57,400
for partition 0 is
broker to the leader
7599
05:31:57,400 --> 05:31:59,417
for partition one is broker 0
7600
05:31:59,417 --> 05:32:02,200
and leader for partition
to is broker one
7601
05:32:02,200 --> 05:32:06,194
so you can see we have different
partition leaders residing on
7602
05:32:06,194 --> 05:32:09,600
And Brokers, so this is
basically for load balancing.
7603
05:32:09,600 --> 05:32:11,900
So that different partition
could be served
7604
05:32:11,900 --> 05:32:13,000
from different Brokers
7605
05:32:13,000 --> 05:32:15,413
and it could be
consumed parallely again,
7606
05:32:15,413 --> 05:32:16,800
you can see the replica
7607
05:32:16,800 --> 05:32:20,512
of this partition is residing
in all the three Brokers same
7608
05:32:20,512 --> 05:32:23,200
with Partition 1 and same
with Partition to
7609
05:32:23,200 --> 05:32:25,700
and it's showing you
the insync replica.
7610
05:32:25,700 --> 05:32:27,100
So in synch replica,
7611
05:32:27,100 --> 05:32:30,600
the first is to then you have 0
and then you have 1
7612
05:32:30,600 --> 05:32:33,600
and similarly with
Partition 1 and 2.
7613
05:32:33,900 --> 05:32:35,100
So now let us quickly.
7614
05:32:35,100 --> 05:32:35,900
Go ahead.
7615
05:32:36,500 --> 05:32:38,346
I'll reduce this to 1/2.
7616
05:32:40,000 --> 05:32:42,200
Wake me up in one more terminal.
7617
05:32:43,300 --> 05:32:45,200
The reason why I'm doing this is
7618
05:32:45,200 --> 05:32:48,600
that we can actually produce
message from One console
7619
05:32:48,600 --> 05:32:51,700
and then we can receive
the message in another console.
7620
05:32:51,707 --> 05:32:56,092
So for that I'll start cough
cough console producer first.
7621
05:32:56,396 --> 05:32:57,703
So the command is
7622
05:32:58,000 --> 05:33:04,400
dot slash bin cough cough
console producer dot sh
7623
05:33:04,400 --> 05:33:06,100
and then in case
7624
05:33:06,100 --> 05:33:11,400
of producer you have to give
the parameter as broker - list,
7625
05:33:11,800 --> 05:33:18,000
which is Localhost 9:09 to you
can provide any of the Brokers
7626
05:33:18,000 --> 05:33:19,000
that is running
7627
05:33:19,000 --> 05:33:22,400
and it will again take the rest
of the Brokers from there.
7628
05:33:22,400 --> 05:33:25,794
So you just have to provide
the address of one broker.
7629
05:33:25,794 --> 05:33:28,100
You can also provide
a set of Brokers
7630
05:33:28,100 --> 05:33:30,000
so you can give it
as localhost colon.
7631
05:33:30,000 --> 05:33:33,800
9:09 2 comma Lu closed:
9 0 9 3 and similarly.
7632
05:33:33,800 --> 05:33:35,800
So here I am passing the address
7633
05:33:35,800 --> 05:33:39,700
of the first broker now next
I have to mention the topic.
7634
05:33:39,700 --> 05:33:41,900
So topic is Kafka Spark.
7635
05:33:43,700 --> 05:33:45,161
And I'll hit enter.
7636
05:33:45,500 --> 05:33:47,900
So my console
producer is started.
7637
05:33:47,900 --> 05:33:50,600
Let me produce
a message saying hi.
7638
05:33:51,000 --> 05:33:53,376
Now in the second terminal
I will go ahead
7639
05:33:53,376 --> 05:33:55,200
and start the console consumer.
7640
05:33:55,500 --> 05:34:00,700
So again, the command is
Kafka console consumer not sh
7641
05:34:00,800 --> 05:34:03,000
and then in case of consumer,
7642
05:34:03,000 --> 05:34:06,600
you have to give the parameter
as bootstrap server.
7643
05:34:07,800 --> 05:34:10,400
So this is the thing
to notice guys that in case
7644
05:34:10,400 --> 05:34:13,600
of producer you have to give
the broker list by in.
7645
05:34:13,600 --> 05:34:14,725
So of consumer,
7646
05:34:14,725 --> 05:34:19,000
you have to give bootstrap
server and it is again the same
7647
05:34:19,000 --> 05:34:23,389
that is localhost 9:09 to which
the address of my broker 0
7648
05:34:23,500 --> 05:34:25,807
and then I will give the topic
7649
05:34:25,807 --> 05:34:30,700
which is cuff cost park
now adding this parameter
7650
05:34:30,700 --> 05:34:32,100
that is from -
7651
05:34:32,100 --> 05:34:35,800
beginning will basically
give me messages stored
7652
05:34:35,800 --> 05:34:37,926
in that topic from beginning.
7653
05:34:37,926 --> 05:34:41,300
Otherwise, if I'm not giving
this parameter - -
7654
05:34:41,300 --> 05:34:43,200
from beginning I'll only
7655
05:34:43,200 --> 05:34:44,630
I'm the recent messages
7656
05:34:44,630 --> 05:34:48,300
that has been produced after
starting this console consumer.
7657
05:34:48,300 --> 05:34:49,484
So let me hit enter
7658
05:34:49,484 --> 05:34:52,600
and you can see I'll get
a message saying hi first.
7659
05:34:55,700 --> 05:34:57,267
Well, I'm sorry guys.
7660
05:34:57,267 --> 05:35:00,400
The topic name I
have given is not correct.
7661
05:35:00,400 --> 05:35:01,784
Sorry for my typo.
7662
05:35:01,784 --> 05:35:03,707
Let me quickly corrected.
7663
05:35:04,300 --> 05:35:05,800
And let me hit enter.
7664
05:35:06,800 --> 05:35:10,300
So as you can see,
I am receiving the messages.
7665
05:35:10,300 --> 05:35:13,900
I received High then let
me produce some more messages.
7666
05:35:19,200 --> 05:35:21,600
So now you can see
all the messages
7667
05:35:21,600 --> 05:35:22,858
that I am producing
7668
05:35:22,858 --> 05:35:26,900
from console producer is getting
consumed by console consumer.
7669
05:35:26,900 --> 05:35:30,466
Now this console producer
as well as console consumer
7670
05:35:30,466 --> 05:35:31,838
is basically used by
7671
05:35:31,838 --> 05:35:35,200
the developers to actually
test the Kafka cluster.
7672
05:35:35,200 --> 05:35:37,100
So what happens if you are
7673
05:35:37,100 --> 05:35:38,300
if there is a producer
7674
05:35:38,300 --> 05:35:40,300
which is running and
which is producing
7675
05:35:40,300 --> 05:35:43,196
those messages to Kafka
then you can go ahead
7676
05:35:43,196 --> 05:35:45,558
and you can start console
consumer and check
7677
05:35:45,558 --> 05:35:47,500
whether the producer
is producing.
7678
05:35:47,500 --> 05:35:49,900
Messages or not
or you can again go ahead
7679
05:35:49,900 --> 05:35:50,900
and check the format
7680
05:35:50,900 --> 05:35:53,860
in which your message are
getting produced to the topic.
7681
05:35:53,860 --> 05:35:56,988
Those kind of testing part
is done using console consumer
7682
05:35:56,988 --> 05:35:59,000
and similarly using
console producer.
7683
05:35:59,000 --> 05:36:01,500
You do something
like you are creating a consumer
7684
05:36:01,500 --> 05:36:04,900
so you can go ahead you can
produce a message to Kafka topic
7685
05:36:04,900 --> 05:36:06,000
and then you can check
7686
05:36:06,000 --> 05:36:08,700
whether your consumer is
consuming that message or not.
7687
05:36:08,700 --> 05:36:11,049
This is basically used
for testing now,
7688
05:36:11,049 --> 05:36:13,400
let us quickly go ahead
and close this.
7689
05:36:15,700 --> 05:36:18,700
Now let us get back
to our slides now.
7690
05:36:18,700 --> 05:36:20,605
I have briefly covered Kafka
7691
05:36:20,605 --> 05:36:24,300
and the concepts of Kafka so
here basically I'm giving
7692
05:36:24,300 --> 05:36:27,200
you a small brief idea
about what Kafka is
7693
05:36:27,200 --> 05:36:29,100
and how Kafka works now
7694
05:36:29,100 --> 05:36:32,100
as we have understood why
we need misting systems.
7695
05:36:32,100 --> 05:36:33,100
What is cough cough?
7696
05:36:33,100 --> 05:36:35,000
What are different
terminologies and Kafka
7697
05:36:35,000 --> 05:36:36,657
how Kafka architecture works
7698
05:36:36,657 --> 05:36:39,513
and we have seen some
of the basic cuff Pokemons.
7699
05:36:39,513 --> 05:36:41,000
So let us now understand.
7700
05:36:41,000 --> 05:36:42,600
What is Apache spark.
7701
05:36:42,800 --> 05:36:44,900
So basically Apache spark
7702
05:36:44,900 --> 05:36:47,802
is an Source cluster
Computing framework
7703
05:36:47,802 --> 05:36:51,300
for near real-time processing
now spark provides
7704
05:36:51,300 --> 05:36:54,205
an interface for programming
the entire cluster
7705
05:36:54,205 --> 05:36:56,047
with implicit data parallelism
7706
05:36:56,047 --> 05:36:59,300
and fault tolerance will talk
about how spark provides
7707
05:36:59,300 --> 05:37:02,900
fault tolerance but talking
about implicit data parallelism.
7708
05:37:02,900 --> 05:37:06,600
That means you do not need
any special directives operators
7709
05:37:06,600 --> 05:37:09,000
or functions to enable
parallel execution.
7710
05:37:09,000 --> 05:37:12,600
It sparked by default provides
the data parallelism spark
7711
05:37:12,600 --> 05:37:15,628
is designed to cover
a wide range of workloads such.
7712
05:37:15,628 --> 05:37:16,919
As batch applications
7713
05:37:16,919 --> 05:37:20,400
iterative algorithms interactive
queries machine learning
7714
05:37:20,400 --> 05:37:22,000
algorithms and streaming.
7715
05:37:22,000 --> 05:37:24,174
So basically the main feature
7716
05:37:24,174 --> 05:37:27,500
of spark is it's
in memory cluster Computing
7717
05:37:27,500 --> 05:37:30,900
that increases the processing
speed of the application.
7718
05:37:30,900 --> 05:37:34,763
So what spark does spark does
not store the data in discs,
7719
05:37:34,763 --> 05:37:36,950
but it does it
transforms the data
7720
05:37:36,950 --> 05:37:38,700
and keep the data in memory.
7721
05:37:38,700 --> 05:37:39,616
So that quickly
7722
05:37:39,616 --> 05:37:42,500
multiple operations can
be applied over the data
7723
05:37:42,500 --> 05:37:45,500
and the final result
is only stored in the disk
7724
05:37:45,500 --> 05:37:49,629
now a On-site Spa can also do
batch processing hundred times
7725
05:37:49,629 --> 05:37:51,108
faster than mapreduce.
7726
05:37:51,108 --> 05:37:54,400
And this is the reason why
a patches Park is to go
7727
05:37:54,400 --> 05:37:57,324
to tool for big data processing
in the industry.
7728
05:37:57,324 --> 05:38:00,000
Now, let's quickly move
ahead and understand
7729
05:38:00,000 --> 05:38:01,461
how spark does this
7730
05:38:01,600 --> 05:38:03,617
so the answer is rdd
7731
05:38:03,617 --> 05:38:07,700
that is resilient distributed
data sets now an rdd is
7732
05:38:07,700 --> 05:38:11,406
a read-only partitioned
collection of records and you
7733
05:38:11,406 --> 05:38:14,897
can see it is a fundamental
data structure of spa.
7734
05:38:14,897 --> 05:38:16,312
So basically, ERD is
7735
05:38:16,312 --> 05:38:19,522
an immutable distributed
collection of objects.
7736
05:38:19,522 --> 05:38:21,709
So each data set
in rdd is divided
7737
05:38:21,709 --> 05:38:23,300
into logical partitions,
7738
05:38:23,300 --> 05:38:25,639
which may be computed
on different nodes
7739
05:38:25,639 --> 05:38:28,400
of the cluster now already
can contain any type
7740
05:38:28,400 --> 05:38:30,800
of python Java or scale objects.
7741
05:38:30,800 --> 05:38:33,900
Now talking about
the fault tolerance rdd
7742
05:38:33,900 --> 05:38:37,900
is a fault-tolerant collection
of elements that can be operated
7743
05:38:37,900 --> 05:38:39,000
on in parallel.
7744
05:38:39,000 --> 05:38:40,500
Now, how are ready does
7745
05:38:40,500 --> 05:38:43,380
that if rdd is lost
it will automatically
7746
05:38:43,380 --> 05:38:45,609
be recomputed by using original.
7747
05:38:45,609 --> 05:38:49,300
Nations and this is how spot
provides fault tolerance.
7748
05:38:49,300 --> 05:38:51,255
So I hope that you
guys are clear
7749
05:38:51,255 --> 05:38:53,700
that house Park
provides fault tolerance.
7750
05:38:54,132 --> 05:38:57,500
Now let's talk about
how we can create rdds.
7751
05:38:57,500 --> 05:39:01,600
So there are two ways to create
rdds first is paralyzing
7752
05:39:01,600 --> 05:39:04,474
an existing collection
in your driver program,
7753
05:39:04,474 --> 05:39:06,200
or you can refer a data set
7754
05:39:06,200 --> 05:39:09,300
in an external storage systems
such as shared file system.
7755
05:39:09,300 --> 05:39:11,300
It can be hdfs Edge base
7756
05:39:11,300 --> 05:39:15,200
or any other data source
offering a Hadoop input format
7757
05:39:15,200 --> 05:39:16,800
now spark makes use
7758
05:39:16,800 --> 05:39:20,200
of the concept of rdd to achieve
fast and efficient operations.
7759
05:39:20,200 --> 05:39:22,600
Now, let's quickly move ahead
7760
05:39:22,600 --> 05:39:27,200
and look how already So
first we create an rdd
7761
05:39:27,200 --> 05:39:29,600
which you can create
either by referring
7762
05:39:29,600 --> 05:39:31,800
to an external storage system.
7763
05:39:31,800 --> 05:39:35,400
And then once you create
an rdd you can go ahead
7764
05:39:35,400 --> 05:39:37,800
and you can apply
multiple Transformations
7765
05:39:37,800 --> 05:39:38,800
over that are ready.
7766
05:39:39,100 --> 05:39:43,100
Like will perform
filter map Union Etc.
7767
05:39:43,100 --> 05:39:44,219
And then again,
7768
05:39:44,219 --> 05:39:48,400
it gives you a new rdd or you
can see the transformed rdd
7769
05:39:48,400 --> 05:39:51,500
and at last you apply
some action and get
7770
05:39:51,500 --> 05:39:55,100
the result now this action
can be Count first
7771
05:39:55,100 --> 05:39:57,149
a can collect all those kind
7772
05:39:57,149 --> 05:39:58,100
of functions.
7773
05:39:58,100 --> 05:40:01,151
So now this is a brief idea
about what is rdd
7774
05:40:01,151 --> 05:40:02,400
and how rdd works.
7775
05:40:02,400 --> 05:40:04,570
So now let's quickly
move ahead and look
7776
05:40:04,570 --> 05:40:06,100
at the different workloads
7777
05:40:06,100 --> 05:40:08,200
that can be handled
by Apache spark.
7778
05:40:08,200 --> 05:40:10,883
So we have interactive
streaming analytics.
7779
05:40:10,883 --> 05:40:12,800
Then we have machine learning.
7780
05:40:12,800 --> 05:40:14,158
We have data integration.
7781
05:40:14,158 --> 05:40:16,207
We have spark
streaming and processing.
7782
05:40:16,207 --> 05:40:17,944
So let us talk about them one
7783
05:40:17,944 --> 05:40:20,400
by one first is spark
streaming and processing.
7784
05:40:20,400 --> 05:40:21,400
So now basically,
7785
05:40:21,400 --> 05:40:24,007
you know data arrives
at a steady rate.
7786
05:40:24,007 --> 05:40:27,000
Are you can say
at a continuous streams, right?
7787
05:40:27,000 --> 05:40:29,300
And then what you can do
you can again go ahead
7788
05:40:29,300 --> 05:40:30,829
and store the data set in disk
7789
05:40:30,829 --> 05:40:34,299
and then you can actually go
ahead and apply some processing
7790
05:40:34,299 --> 05:40:36,007
over it some analytics over it
7791
05:40:36,007 --> 05:40:38,000
and then get
some results out of it,
7792
05:40:38,000 --> 05:40:41,200
but this is not the scenario
with each and every case.
7793
05:40:41,200 --> 05:40:44,100
Let's take an example
of financial transactions
7794
05:40:44,100 --> 05:40:46,343
where you have to go
ahead and identify
7795
05:40:46,343 --> 05:40:48,931
and refuse potential
fraudulent transactions.
7796
05:40:48,931 --> 05:40:50,297
Now if you will go ahead
7797
05:40:50,297 --> 05:40:53,197
and store the data stream
and then you will go ahead
7798
05:40:53,197 --> 05:40:55,800
and apply some Assessing
it would be too late
7799
05:40:55,800 --> 05:40:58,287
and someone would have got
away with the money.
7800
05:40:58,287 --> 05:41:00,386
So in that scenario
what you need to do.
7801
05:41:00,386 --> 05:41:03,183
So you need to quickly take
that input data stream.
7802
05:41:03,183 --> 05:41:05,700
You need to apply
some Transformations over it
7803
05:41:05,700 --> 05:41:08,300
and then you have
to take actions accordingly.
7804
05:41:08,300 --> 05:41:10,015
Like you can send
some notification
7805
05:41:10,015 --> 05:41:11,322
or you can actually reject
7806
05:41:11,322 --> 05:41:13,972
that fraudulent transaction
something like that.
7807
05:41:13,972 --> 05:41:15,200
And then you can go ahead
7808
05:41:15,200 --> 05:41:17,686
and if you want you
can store those results
7809
05:41:17,686 --> 05:41:19,700
or data set in some
of the database
7810
05:41:19,700 --> 05:41:21,700
or you can see some
of the file system.
7811
05:41:21,800 --> 05:41:24,000
So we have some scenarios.
7812
05:41:24,026 --> 05:41:27,873
Very we have to actually
process the stream of data
7813
05:41:27,900 --> 05:41:29,300
and then we have to go ahead
7814
05:41:29,300 --> 05:41:30,358
and store the data
7815
05:41:30,358 --> 05:41:34,008
or perform some analysis on it
or take some necessary actions.
7816
05:41:34,008 --> 05:41:37,000
So this is where Spark
streaming comes into picture
7817
05:41:37,000 --> 05:41:38,575
and Spark is a best fit
7818
05:41:38,575 --> 05:41:42,000
for processing those continuous
input data streams.
7819
05:41:42,000 --> 05:41:45,500
Now moving to next
that is machine learning now,
7820
05:41:45,500 --> 05:41:46,314
as you know,
7821
05:41:46,314 --> 05:41:47,730
that first we create
7822
05:41:47,730 --> 05:41:51,182
a machine learning model
then we continuously feed
7823
05:41:51,182 --> 05:41:54,011
those incoming data
streams to the model.
7824
05:41:54,011 --> 05:41:56,700
And we get some
continuous output based
7825
05:41:56,700 --> 05:41:58,144
on the input values.
7826
05:41:58,144 --> 05:42:00,453
Now, we reuse
intermediate results
7827
05:42:00,453 --> 05:42:04,300
across multiple computation
in multi-stage applications,
7828
05:42:04,300 --> 05:42:07,600
which basically includes
substantial overhead due to
7829
05:42:07,600 --> 05:42:10,500
data replication disk
I/O and sterilization
7830
05:42:10,500 --> 05:42:12,200
which makes the system slow.
7831
05:42:12,200 --> 05:42:16,200
Now what Spock does spark rdd
will store intermediate result
7832
05:42:16,200 --> 05:42:19,446
in a distributed memory
instead of a stable storage
7833
05:42:19,446 --> 05:42:21,200
and make the system faster.
7834
05:42:21,200 --> 05:42:24,800
So as we saw in spark rdd
all the Transformations
7835
05:42:24,800 --> 05:42:26,482
will be applied over there
7836
05:42:26,482 --> 05:42:29,200
and all the transformed
rdds will be stored
7837
05:42:29,200 --> 05:42:31,999
in the memory itself
so we can quickly go ahead
7838
05:42:31,999 --> 05:42:35,037
and apply some more
iterative algorithms over there
7839
05:42:35,037 --> 05:42:37,508
and it does not take
much time in functions
7840
05:42:37,508 --> 05:42:39,333
like data replication or disk
7841
05:42:39,333 --> 05:42:42,164
I/O so all those overheads
will be reduced now
7842
05:42:42,164 --> 05:42:45,500
you might be wondering
that memories always very less.
7843
05:42:45,500 --> 05:42:48,000
So what if the memory
gets over so
7844
05:42:48,000 --> 05:42:50,600
if the distributed memory
is not sufficient
7845
05:42:50,600 --> 05:42:52,100
to store intermediate results,
7846
05:42:52,300 --> 05:42:54,300
then it will
store those results.
7847
05:42:54,300 --> 05:42:55,100
On the desk.
7848
05:42:55,100 --> 05:42:58,000
So I hope that you guys are
clear how sparks perform
7849
05:42:58,000 --> 05:43:00,000
this iterative machine
learning algorithms
7850
05:43:00,000 --> 05:43:01,500
and why spark is fast,
7851
05:43:01,819 --> 05:43:04,280
let's look at the next workload.
7852
05:43:04,400 --> 05:43:08,200
So next workload is
interactive streaming analytics.
7853
05:43:08,200 --> 05:43:10,900
Now as we already discussed
about streaming data
7854
05:43:10,900 --> 05:43:15,300
so user runs ad hoc queries
on the same subset of data
7855
05:43:15,300 --> 05:43:19,127
and each query will do a disk
I/O on the stable storage
7856
05:43:19,127 --> 05:43:22,386
which can dominate
applications execution time.
7857
05:43:22,386 --> 05:43:24,300
So, let me take an example.
7858
05:43:24,300 --> 05:43:25,400
Data scientist.
7859
05:43:25,400 --> 05:43:27,800
So basically you have
continuous streams of data,
7860
05:43:27,800 --> 05:43:28,800
which is coming in.
7861
05:43:28,800 --> 05:43:30,650
So what your data
scientists would do.
7862
05:43:30,650 --> 05:43:32,900
So do your data scientists
will either ask
7863
05:43:32,900 --> 05:43:34,274
some questions execute
7864
05:43:34,274 --> 05:43:37,208
some queries over the data
then view the result
7865
05:43:37,208 --> 05:43:40,563
and then he might alter
the initial question slightly
7866
05:43:40,563 --> 05:43:41,804
by seeing the output
7867
05:43:41,804 --> 05:43:44,332
or he might also drill
deeper into results
7868
05:43:44,332 --> 05:43:47,757
and execute some more queries
over the gathered result.
7869
05:43:47,757 --> 05:43:51,500
So there are multiple scenarios
in which your data scientist
7870
05:43:51,500 --> 05:43:54,265
would be running
some interactive queries.
7871
05:43:54,265 --> 05:43:57,569
On the streaming analytics
now house path helps
7872
05:43:57,569 --> 05:44:00,200
in this interactive
streaming analytics.
7873
05:44:00,200 --> 05:44:04,453
So each transformed our DD
may be recomputed each time.
7874
05:44:04,453 --> 05:44:06,838
You run an action on it, right?
7875
05:44:06,838 --> 05:44:10,692
And when you persist an rdd
in memory in which case
7876
05:44:10,692 --> 05:44:13,430
Park will keep all
the elements around
7877
05:44:13,430 --> 05:44:15,800
on the cluster for faster access
7878
05:44:15,800 --> 05:44:18,296
and whenever you will execute
the query next time
7879
05:44:18,296 --> 05:44:19,077
over the data,
7880
05:44:19,077 --> 05:44:21,200
then the query will
be executed quickly
7881
05:44:21,200 --> 05:44:23,700
and it will give you
a instant result, right?
7882
05:44:24,100 --> 05:44:26,090
So I hope that you
guys are clear
7883
05:44:26,090 --> 05:44:29,200
how spark helps in
interactive streaming analytics.
7884
05:44:29,400 --> 05:44:32,000
Now, let's talk
about data integration.
7885
05:44:32,000 --> 05:44:33,570
So basically as you know,
7886
05:44:33,570 --> 05:44:36,900
that in large organizations data
is basically produced
7887
05:44:36,900 --> 05:44:39,400
from different systems
across the business
7888
05:44:39,400 --> 05:44:42,000
and basically you
need a framework
7889
05:44:42,000 --> 05:44:45,800
which can actually integrate
different data sources.
7890
05:44:45,800 --> 05:44:46,900
So Spock is the one
7891
05:44:46,900 --> 05:44:49,382
which actually integrate
different data sources
7892
05:44:49,382 --> 05:44:50,500
so you can go ahead
7893
05:44:50,500 --> 05:44:53,800
and you can take the data
from Kafka Cassandra flu.
7894
05:44:53,800 --> 05:44:55,518
Umm hbase then Amazon S3.
7895
05:44:55,518 --> 05:44:59,300
Then you can perform some real
time analytics over there
7896
05:44:59,300 --> 05:45:02,000
or even say some near
real-time analytics over there.
7897
05:45:02,000 --> 05:45:04,250
You can apply some machine
learning algorithms
7898
05:45:04,250 --> 05:45:05,700
and then you can go ahead
7899
05:45:05,700 --> 05:45:08,500
and store the process
result in Apache hbase.
7900
05:45:08,500 --> 05:45:10,600
Then msql hdfs.
7901
05:45:10,600 --> 05:45:12,100
It could be your Kafka.
7902
05:45:12,100 --> 05:45:15,500
So spark basically gives
you a multiple options
7903
05:45:15,500 --> 05:45:16,600
where you can go ahead
7904
05:45:16,600 --> 05:45:18,500
and pick the data
from and again,
7905
05:45:18,500 --> 05:45:21,200
you can go ahead
and write the data into now.
7906
05:45:21,200 --> 05:45:23,620
Let's quickly move ahead
and we'll talk.
7907
05:45:23,620 --> 05:45:27,013
About different spark components
so you can see here.
7908
05:45:27,013 --> 05:45:28,500
I have a spark or engine.
7909
05:45:28,500 --> 05:45:30,376
So basically this
is the core engine
7910
05:45:30,376 --> 05:45:32,200
and on top of this core engine.
7911
05:45:32,200 --> 05:45:35,574
You have spark SQL spark
streaming then MLA,
7912
05:45:35,900 --> 05:45:38,100
then you have graphics
and the newest Parker.
7913
05:45:38,200 --> 05:45:41,087
Let's talk about them one
by one and we'll start
7914
05:45:41,087 --> 05:45:42,500
with spark core engine.
7915
05:45:42,500 --> 05:45:45,200
So spark or engine
is the base engine
7916
05:45:45,200 --> 05:45:46,800
for large-scale parallel
7917
05:45:46,800 --> 05:45:50,026
and distributed data processing
additional libraries,
7918
05:45:50,026 --> 05:45:52,200
which are built on top
of the core allows
7919
05:45:52,200 --> 05:45:53,700
divers workloads Force.
7920
05:45:53,700 --> 05:45:57,300
Streaming SQL machine learning
then you can go ahead
7921
05:45:57,300 --> 05:45:59,300
and execute our on spark
7922
05:45:59,300 --> 05:46:01,731
or you can go ahead
and execute python on spark
7923
05:46:01,731 --> 05:46:03,000
those kind of workloads.
7924
05:46:03,000 --> 05:46:04,700
You can easily go
ahead and execute.
7925
05:46:04,700 --> 05:46:07,800
So basically your spark
or engine is the one
7926
05:46:07,800 --> 05:46:10,040
who is managing all your memory,
7927
05:46:10,040 --> 05:46:13,084
then all your fault
recovery your scheduling
7928
05:46:13,084 --> 05:46:14,755
your Distributing of jobs
7929
05:46:14,755 --> 05:46:16,078
and monitoring jobs
7930
05:46:16,078 --> 05:46:19,700
on a cluster and interacting
with the storage system.
7931
05:46:19,700 --> 05:46:22,400
So in in short we
can see the spark
7932
05:46:22,400 --> 05:46:24,501
or engine is the heart of Spock
7933
05:46:24,501 --> 05:46:25,951
and on top of this all
7934
05:46:25,951 --> 05:46:28,389
of these libraries
are there so first,
7935
05:46:28,389 --> 05:46:30,429
let's talk about
spark streaming.
7936
05:46:30,429 --> 05:46:33,088
So spot streaming is
the component of Spas
7937
05:46:33,088 --> 05:46:36,273
which is used to process
real-time streaming data
7938
05:46:36,273 --> 05:46:37,600
as we just discussed
7939
05:46:37,600 --> 05:46:41,061
and it is a useful addition
to spark core API.
7940
05:46:41,200 --> 05:46:43,600
Now it enables high
throughput and fault
7941
05:46:43,600 --> 05:46:46,554
tolerance stream processing
for live data streams.
7942
05:46:46,554 --> 05:46:47,700
So you can go ahead
7943
05:46:47,700 --> 05:46:51,338
and you can perform all
the streaming data analytics
7944
05:46:51,338 --> 05:46:55,800
using this spark streaming then
You have Spock SQL over here.
7945
05:46:55,800 --> 05:46:58,900
So basically spark SQL is
a new module in spark
7946
05:46:58,900 --> 05:47:02,200
which integrates relational
processing of Sparks functional
7947
05:47:02,200 --> 05:47:06,900
programming API and it supports
querying data either via SQL
7948
05:47:06,900 --> 05:47:08,315
or SQL that is -
7949
05:47:08,315 --> 05:47:09,469
query language.
7950
05:47:09,500 --> 05:47:11,500
So basically for those of you
7951
05:47:11,500 --> 05:47:15,615
who are familiar with rdbms
Spock SQL is an easy transition
7952
05:47:15,615 --> 05:47:17,100
from your earlier tool
7953
05:47:17,100 --> 05:47:19,511
where you can go ahead
and extend the boundaries
7954
05:47:19,511 --> 05:47:22,100
of traditional relational
data processing now
7955
05:47:22,100 --> 05:47:23,700
talking about graphics.
7956
05:47:23,700 --> 05:47:24,900
So Graphics is
7957
05:47:24,900 --> 05:47:28,500
the spaag API for graphs
and crafts parallel computation.
7958
05:47:28,500 --> 05:47:30,800
It extends the spark rdd
7959
05:47:30,800 --> 05:47:34,309
with a resilient distributed
property graph a talking
7960
05:47:34,309 --> 05:47:35,213
at high level.
7961
05:47:35,213 --> 05:47:38,700
Basically Graphics extend
the graph already abstraction
7962
05:47:38,700 --> 05:47:41,758
by introducing the resilient
distributed property graph,
7963
05:47:41,758 --> 05:47:42,778
which is nothing
7964
05:47:42,778 --> 05:47:45,900
but a directed multigraph
with properties attached
7965
05:47:45,900 --> 05:47:49,700
to each vertex and Edge
next we have spark are so
7966
05:47:49,700 --> 05:47:52,394
basically it provides you
packages for our language
7967
05:47:52,394 --> 05:47:54,100
and then you can go ahead and
7968
05:47:54,100 --> 05:47:55,399
Leverage Park power
7969
05:47:55,399 --> 05:47:58,000
with our shell next
you have spark MLA.
7970
05:47:58,000 --> 05:48:01,849
So ml is basically stands
for machine learning library.
7971
05:48:01,849 --> 05:48:05,200
So spark MLM is used
to perform machine learning
7972
05:48:05,200 --> 05:48:06,500
in Apache spark.
7973
05:48:06,500 --> 05:48:08,773
Now many common machine learning
7974
05:48:08,773 --> 05:48:11,784
and statical algorithms
have been implemented
7975
05:48:11,784 --> 05:48:13,700
and are shipped with ML live
7976
05:48:13,700 --> 05:48:16,935
which simplifies large scale
machine learning pipelines,
7977
05:48:16,935 --> 05:48:18,347
which basically includes
7978
05:48:18,347 --> 05:48:20,994
summary statistics
correlations classification
7979
05:48:20,994 --> 05:48:23,800
and regression collaborative
filtering techniques.
7980
05:48:23,800 --> 05:48:25,700
New cluster analysis methods
7981
05:48:25,700 --> 05:48:28,582
then you have dimensionality
reduction techniques.
7982
05:48:28,582 --> 05:48:31,400
You have feature extraction
and transformation functions.
7983
05:48:31,400 --> 05:48:33,700
When you have
optimization algorithms,
7984
05:48:33,700 --> 05:48:35,900
it is basically a MLM package
7985
05:48:35,900 --> 05:48:39,000
or you can see a machine
learning package on top of spa.
7986
05:48:39,000 --> 05:48:41,639
Then you also have
something called by spark,
7987
05:48:41,639 --> 05:48:43,979
which is python package
for spark there.
7988
05:48:43,979 --> 05:48:46,800
You can go ahead
and leverage python over spark.
7989
05:48:46,800 --> 05:48:47,376
So I hope
7990
05:48:47,376 --> 05:48:50,900
that you guys are clear
with different spark components.
7991
05:48:51,100 --> 05:48:53,200
So before moving
to cough gasp,
7992
05:48:53,200 --> 05:48:54,524
ah, Exclaiming demo.
7993
05:48:54,524 --> 05:48:58,075
So I have just given you
a brief intro to Apache spark.
7994
05:48:58,075 --> 05:49:01,100
If you want a detailed tutorial
on Apache spark
7995
05:49:01,100 --> 05:49:02,600
or different components
7996
05:49:02,600 --> 05:49:06,753
of Apache spark like Apache
spark SQL spark data frames
7997
05:49:06,800 --> 05:49:10,200
or spark streaming
Spa Graphics Spock MLA,
7998
05:49:10,200 --> 05:49:13,200
so you can go to editor
Acres YouTube channel again.
7999
05:49:13,200 --> 05:49:14,800
So now we are here guys.
8000
05:49:14,800 --> 05:49:18,252
I know that you guys are waiting
for this demo from a while.
8001
05:49:18,252 --> 05:49:21,900
So now let's go ahead and look
at calf by spark streaming demo.
8002
05:49:21,900 --> 05:49:23,700
So let me quickly go
ahead and open.
8003
05:49:23,700 --> 05:49:28,000
my virtual machine
and I'll open a terminal.
8004
05:49:28,600 --> 05:49:30,658
So let me first check
all the demons
8005
05:49:30,658 --> 05:49:32,400
that are running in my system.
8006
05:49:33,800 --> 05:49:35,341
So my zookeeper is running
8007
05:49:35,341 --> 05:49:37,753
name node is running
data node is running.
8008
05:49:37,753 --> 05:49:39,130
The my resource manager
8009
05:49:39,130 --> 05:49:42,714
is running all the three cough
cough Brokers are running then
8010
05:49:42,714 --> 05:49:44,088
node manager is running
8011
05:49:44,088 --> 05:49:46,000
and job is to server is running.
8012
05:49:46,200 --> 05:49:49,200
So now I have to start
my spark demons.
8013
05:49:49,200 --> 05:49:51,900
So let me first go
to the spark home
8014
05:49:52,600 --> 05:49:54,600
and start this part demon.
8015
05:49:54,600 --> 05:49:57,800
The command is
a spin start or not.
8016
05:49:57,800 --> 05:49:58,900
Sh.
8017
05:50:01,400 --> 05:50:03,400
So let me quickly go ahead
8018
05:50:03,400 --> 05:50:06,861
and execute sudo JPS
to check my spark demons.
8019
05:50:08,500 --> 05:50:12,200
So you can see master
and vocal demons are running.
8020
05:50:12,596 --> 05:50:14,903
So let me close this terminal.
8021
05:50:16,300 --> 05:50:18,700
Let me go to
the project directory.
8022
05:50:20,600 --> 05:50:22,808
So basically, I
have two projects.
8023
05:50:22,808 --> 05:50:25,376
This is cough card
transaction producer.
8024
05:50:25,376 --> 05:50:28,852
And the next one is the spark
streaming Kafka master.
8025
05:50:28,852 --> 05:50:31,327
So first we will
be producing messages
8026
05:50:31,327 --> 05:50:33,400
from Kafka transaction producer
8027
05:50:33,400 --> 05:50:36,200
and then we'll be
streaming those records
8028
05:50:36,200 --> 05:50:39,670
which is basically produced by
this producer using the spark
8029
05:50:39,670 --> 05:50:41,025
streaming Kafka master.
8030
05:50:41,025 --> 05:50:42,494
So first, let me take you
8031
05:50:42,494 --> 05:50:45,100
through this cough
card transaction producer.
8032
05:50:45,100 --> 05:50:47,244
So this is
our cornbread XML file.
8033
05:50:47,244 --> 05:50:49,004
Let me open it with G edit.
8034
05:50:49,004 --> 05:50:50,700
So basically this is a me.
8035
05:50:50,700 --> 05:50:54,400
Project and and I have used
spring boot server.
8036
05:50:54,800 --> 05:50:57,071
So I have given Java version
8037
05:50:57,071 --> 05:51:00,456
as a you can see
cough cough client over here
8038
05:51:00,500 --> 05:51:02,900
and the version of Kafka client,
8039
05:51:03,780 --> 05:51:07,719
then you can see I'm putting
Jackson data bind.
8040
05:51:08,800 --> 05:51:13,500
Then ji-sun and then I
am packaging it as a war file
8041
05:51:13,600 --> 05:51:15,500
that is web archive file.
8042
05:51:15,500 --> 05:51:20,000
And here I am again specifying
the spring boot Maven plugins,
8043
05:51:20,000 --> 05:51:21,300
which is to be downloaded.
8044
05:51:21,300 --> 05:51:23,258
So let me quickly go ahead
8045
05:51:23,258 --> 05:51:27,100
and close this and we'll go
to this Source directory
8046
05:51:27,100 --> 05:51:29,125
and then we'll go inside main.
8047
05:51:29,125 --> 05:51:32,972
So basically this is the file
that is sales Jan 2009 file.
8048
05:51:32,972 --> 05:51:35,200
So let me show you
the file first.
8049
05:51:37,300 --> 05:51:38,860
So these are the records
8050
05:51:38,860 --> 05:51:41,200
which I'll be producing
to the Kafka.
8051
05:51:41,200 --> 05:51:43,600
So the fields
are transaction date
8052
05:51:43,600 --> 05:51:45,500
than product price payment
8053
05:51:45,500 --> 05:51:49,767
type the name city state
country account created
8054
05:51:49,800 --> 05:51:51,646
then last login latitude
8055
05:51:51,646 --> 05:51:52,846
and longitude.
8056
05:51:52,846 --> 05:51:57,400
So let me close this file
and then the application dot.
8057
05:51:57,400 --> 05:51:59,778
Yml is the main property file.
8058
05:51:59,900 --> 05:52:02,654
So in this application
dot yml am specifying
8059
05:52:02,654 --> 05:52:04,000
the bootstrap server,
8060
05:52:04,000 --> 05:52:07,900
which is localhost 9:09 to
than am specifying the Pause
8061
05:52:07,900 --> 05:52:11,500
which again resides
on localhost 9:09 to so here.
8062
05:52:11,500 --> 05:52:16,200
I have specified the broker list
now next I have product topic.
8063
05:52:16,200 --> 05:52:19,000
So the topic of the
product is transaction.
8064
05:52:19,000 --> 05:52:21,230
Then the partition count is 1
8065
05:52:21,500 --> 05:52:25,800
so basically you're a cks
config controls the criteria
8066
05:52:25,800 --> 05:52:29,100
under which requests
are considered complete
8067
05:52:29,100 --> 05:52:32,900
and the all setting we
have specified will result
8068
05:52:32,900 --> 05:52:35,828
in blocking on the full
Committee of the record.
8069
05:52:35,828 --> 05:52:37,225
It is the slowest burn
8070
05:52:37,225 --> 05:52:40,900
the most durable setting
not talking about retries.
8071
05:52:40,900 --> 05:52:44,600
So it will retry Thrice
then we have mempool size
8072
05:52:44,600 --> 05:52:46,587
and we have maximum pool size,
8073
05:52:46,587 --> 05:52:49,700
which is basically
for implementing Java threads
8074
05:52:49,700 --> 05:52:52,000
and at last we
have the file path.
8075
05:52:52,000 --> 05:52:53,900
So this is the path of the file,
8076
05:52:53,900 --> 05:52:57,900
which I have shown you just now
so messages will be consumed
8077
05:52:57,900 --> 05:52:58,800
from this file.
8078
05:52:58,800 --> 05:53:02,600
Let me quickly close this file
and we'll look at application
8079
05:53:02,600 --> 05:53:06,792
but properties so here we
have specified the properties
8080
05:53:06,792 --> 05:53:08,600
for Springboard server.
8081
05:53:08,700 --> 05:53:10,877
So we have server context path.
8082
05:53:10,877 --> 05:53:12,185
That is /n Eureka.
8083
05:53:12,185 --> 05:53:14,607
Then we have
spring application name
8084
05:53:14,607 --> 05:53:16,301
that is Kafka producer.
8085
05:53:16,301 --> 05:53:17,700
We have server Port
8086
05:53:17,700 --> 05:53:22,200
that is double line W8 and
the spring events timeout is 20.
8087
05:53:22,200 --> 05:53:24,430
So let me close this as well.
8088
05:53:24,430 --> 05:53:25,530
Let's go back.
8089
05:53:25,800 --> 05:53:29,500
Let's go inside Java calm
and Eureka Kafka.
8090
05:53:29,700 --> 05:53:33,400
So we'll explore
the important files one by one.
8091
05:53:33,400 --> 05:53:36,800
So let me first take you
through this dito directory.
8092
05:53:36,900 --> 05:53:39,617
And over here,
we have transaction dot Java.
8093
05:53:39,617 --> 05:53:42,253
So basically here we
are storing the model.
8094
05:53:42,253 --> 05:53:45,871
So basically you can see these
are the fields from the file,
8095
05:53:45,871 --> 05:53:47,372
which I have shown you.
8096
05:53:47,372 --> 05:53:49,200
So we have transaction date.
8097
05:53:49,200 --> 05:53:53,600
We have product price payment
type name city state country
8098
05:53:53,600 --> 05:53:57,700
and so on so we have created
variable for each field.
8099
05:53:57,700 --> 05:54:01,101
So what we are doing we
are basically creating a getter
8100
05:54:01,101 --> 05:54:03,766
and Setter function for
all these variables.
8101
05:54:03,766 --> 05:54:05,702
So we have get transaction ID,
8102
05:54:05,702 --> 05:54:08,800
which will basically
returned Transaction ID then
8103
05:54:08,800 --> 05:54:10,600
we have sent transaction ID,
8104
05:54:10,600 --> 05:54:13,300
which will basically
send the transaction ID.
8105
05:54:13,300 --> 05:54:13,809
Similarly.
8106
05:54:13,809 --> 05:54:17,036
We have get transaction date for
getting the transaction date.
8107
05:54:17,036 --> 05:54:19,100
Then we have set
transaction date and it
8108
05:54:19,100 --> 05:54:21,900
will set the transaction date
using this variable.
8109
05:54:21,900 --> 05:54:25,532
Then we have get products
and product get price set price
8110
05:54:25,532 --> 05:54:26,700
and all the getter
8111
05:54:26,700 --> 05:54:29,900
and Setter methods
for each of the variable.
8112
05:54:32,000 --> 05:54:34,000
This is the Constructor.
8113
05:54:34,100 --> 05:54:35,615
So here we are taking
8114
05:54:35,615 --> 05:54:39,513
all the parameters like
transaction date product price.
8115
05:54:39,513 --> 05:54:42,295
And then we are setting
the value of each
8116
05:54:42,295 --> 05:54:44,800
of the variables
using this operator.
8117
05:54:44,800 --> 05:54:48,295
So we are setting the value for
transaction date product price
8118
05:54:48,295 --> 05:54:51,500
payment and all of the fields
that is present over there.
8119
05:54:51,515 --> 05:54:51,900
Next.
8120
05:54:51,900 --> 05:54:55,053
We are also creating
a default Constructor
8121
05:54:55,200 --> 05:54:56,616
and then over here.
8122
05:54:56,616 --> 05:54:59,300
We are overriding
the tostring method
8123
05:54:59,300 --> 05:55:01,600
and what we are doing
we are basically
8124
05:55:02,400 --> 05:55:04,500
The transaction details
8125
05:55:04,500 --> 05:55:06,600
and we are
returning transaction date
8126
05:55:06,600 --> 05:55:09,100
and then the value
of transaction date product
8127
05:55:09,100 --> 05:55:12,300
then body of product price
then value of price
8128
05:55:12,300 --> 05:55:14,900
and so on for all the fields.
8129
05:55:15,300 --> 05:55:18,800
So basically this is the model
of the transaction
8130
05:55:18,800 --> 05:55:20,000
so we can go ahead
8131
05:55:20,000 --> 05:55:22,529
and we can create object
of this transaction
8132
05:55:22,529 --> 05:55:24,400
and then we can easily go ahead
8133
05:55:24,400 --> 05:55:27,700
and send the transaction
object as the value.
8134
05:55:27,700 --> 05:55:29,900
So this is the main
reason of creating
8135
05:55:29,900 --> 05:55:31,588
this transaction model, LOL.
8136
05:55:31,588 --> 05:55:34,000
Me quickly, go ahead
and close this file.
8137
05:55:34,000 --> 05:55:38,400
Let's go back and let's first
take a look at this config.
8138
05:55:38,615 --> 05:55:41,384
So this is Kafka
properties dot Java.
8139
05:55:41,500 --> 05:55:43,202
So what we did again
8140
05:55:43,202 --> 05:55:46,894
as I have shown you
the application dot yml file.
8141
05:55:46,942 --> 05:55:48,500
So we have taken all
8142
05:55:48,500 --> 05:55:51,500
the parameters that we
have specified over there.
8143
05:55:51,600 --> 05:55:54,600
That is your bootstrap
product topic partition count
8144
05:55:54,600 --> 05:55:57,700
then Brokers filename
and thread count.
8145
05:55:57,700 --> 05:55:59,322
So all these properties
8146
05:55:59,322 --> 05:56:02,367
then you have file path
then all these Days,
8147
05:56:02,367 --> 05:56:04,300
we have taken we have created
8148
05:56:04,300 --> 05:56:07,100
a variable and then
what we are doing again,
8149
05:56:07,100 --> 05:56:08,700
we are doing the same thing
8150
05:56:08,700 --> 05:56:11,039
as we did with
our transaction model.
8151
05:56:11,039 --> 05:56:12,600
We are creating a getter
8152
05:56:12,600 --> 05:56:15,247
and Setter method for each
of these variables.
8153
05:56:15,247 --> 05:56:17,305
So you can see we
have get file path
8154
05:56:17,305 --> 05:56:19,300
and we are returning
the file path.
8155
05:56:19,300 --> 05:56:20,924
Then we have set file path
8156
05:56:20,924 --> 05:56:24,300
where we are setting the file
path using this operator.
8157
05:56:24,300 --> 05:56:24,800
Similarly.
8158
05:56:24,800 --> 05:56:26,600
We have get product topics
8159
05:56:26,600 --> 05:56:29,567
at product topic then we
have greater incentive
8160
05:56:29,567 --> 05:56:30,400
for third count.
8161
05:56:30,400 --> 05:56:31,700
We have greater incentive.
8162
05:56:31,700 --> 05:56:36,000
for bootstrap and all
those properties No,
8163
05:56:36,100 --> 05:56:37,522
we can again go ahead
8164
05:56:37,522 --> 05:56:40,300
and call this cough
cough properties anywhere
8165
05:56:40,300 --> 05:56:41,400
and then we can easily
8166
05:56:41,400 --> 05:56:44,000
extract those values
using getter methods.
8167
05:56:44,100 --> 05:56:48,400
So let me quickly close
this file and I'll take you
8168
05:56:48,400 --> 05:56:50,500
to the configurations.
8169
05:56:50,900 --> 05:56:52,100
So in this configuration
8170
05:56:52,100 --> 05:56:54,700
what we are doing we
are creating the object
8171
05:56:54,700 --> 05:56:56,700
of Kafka properties
as you can see,
8172
05:56:57,000 --> 05:56:59,800
so what we are doing then we
are again creating a property's
8173
05:56:59,800 --> 05:57:02,600
object and then we
are setting the properties
8174
05:57:02,700 --> 05:57:03,800
so you can see
8175
05:57:03,800 --> 05:57:06,800
that we are Setting
the bootstrap server config
8176
05:57:06,800 --> 05:57:08,400
and then we are retrieving
8177
05:57:08,400 --> 05:57:11,900
the value using the cough
cough properties object.
8178
05:57:11,900 --> 05:57:14,300
And this is the get
bootstrap server function.
8179
05:57:14,300 --> 05:57:17,500
Then you can see we are setting
the acknowledgement config
8180
05:57:17,500 --> 05:57:18,400
and we are getting
8181
05:57:18,400 --> 05:57:22,100
the acknowledgement from this
get acknowledgement function.
8182
05:57:22,100 --> 05:57:24,900
And then we are using
this get rate rise method.
8183
05:57:24,900 --> 05:57:27,300
So from all these
Kafka properties object.
8184
05:57:27,300 --> 05:57:29,000
We are calling
those getter methods
8185
05:57:29,000 --> 05:57:30,700
and retrieving those values
8186
05:57:30,700 --> 05:57:34,100
and setting those values
in this property object.
8187
05:57:34,100 --> 05:57:36,900
So We have partitioner class.
8188
05:57:37,000 --> 05:57:40,294
So we are basically implementing
this default partitioner
8189
05:57:40,294 --> 05:57:41,400
which is present in
8190
05:57:41,400 --> 05:57:45,700
over G. Apache car park client
producer internals package.
8191
05:57:45,700 --> 05:57:48,600
Then we are creating
a producer over here
8192
05:57:48,600 --> 05:57:50,756
and we are passing this props
8193
05:57:50,756 --> 05:57:54,400
object which will set
the properties so over here.
8194
05:57:54,400 --> 05:57:56,684
We are passing
the key serializer,
8195
05:57:56,684 --> 05:57:58,900
which is the
string T serializer.
8196
05:57:58,900 --> 05:58:00,100
And then this is
8197
05:58:00,100 --> 05:58:04,400
the value serializer in which
we are creating new customer.
8198
05:58:04,400 --> 05:58:07,500
Distance Eliezer and then
we are passing transaction
8199
05:58:07,500 --> 05:58:10,400
over here and then it
will return the producer
8200
05:58:10,500 --> 05:58:13,735
and then we are implementing
thread we are again getting
8201
05:58:13,735 --> 05:58:15,200
the get minimum pool size
8202
05:58:15,200 --> 05:58:17,700
from Kafka properties and get
maximum pool size
8203
05:58:17,700 --> 05:58:18,700
from Kafka property.
8204
05:58:18,700 --> 05:58:19,600
So we're here.
8205
05:58:19,600 --> 05:58:22,000
We are implementing
Java threads now.
8206
05:58:22,000 --> 05:58:25,534
Let me quickly close this cough
pop producer configuration
8207
05:58:25,534 --> 05:58:28,200
where we are configuring
our Kafka producer.
8208
05:58:28,461 --> 05:58:29,538
Let's go back.
8209
05:58:30,400 --> 05:58:32,800
Let's quickly go to this API
8210
05:58:32,946 --> 05:58:36,253
which have event producer
EPA dot Java file.
8211
05:58:36,300 --> 05:58:40,130
So here we are basically
creating an event producer API
8212
05:58:40,130 --> 05:58:42,400
which has this
dispatch function.
8213
05:58:42,400 --> 05:58:46,900
So we'll use this dispatch
function to send the records.
8214
05:58:47,180 --> 05:58:49,719
So let me quickly
close this file.
8215
05:58:50,061 --> 05:58:51,138
Let's go back.
8216
05:58:51,300 --> 05:58:53,475
We have already seen this config
8217
05:58:53,475 --> 05:58:54,700
and configurations
8218
05:58:54,700 --> 05:58:57,100
in which we are basically
retrieving those values
8219
05:58:57,100 --> 05:58:58,984
from application dot yml file
8220
05:58:58,984 --> 05:59:02,300
and then we are Setting
the producer configurations,
8221
05:59:02,300 --> 05:59:04,000
then we have constants.
8222
05:59:04,000 --> 05:59:07,100
So in Kafka constants or Java,
8223
05:59:07,200 --> 05:59:09,900
we have created this Kafka
constant interface
8224
05:59:09,900 --> 05:59:11,393
where we have specified
8225
05:59:11,393 --> 05:59:14,925
the batch size account limit
check some limit then read
8226
05:59:14,925 --> 05:59:17,494
batch size minimum
balance maximum balance
8227
05:59:17,494 --> 05:59:19,500
minimum account maximum account.
8228
05:59:19,500 --> 05:59:22,604
Then we are also implementing
daytime for matter.
8229
05:59:22,604 --> 05:59:25,643
So we are specifying all
the constants over here.
8230
05:59:25,643 --> 05:59:27,100
Let me close this file.
8231
05:59:27,100 --> 05:59:31,300
Let's go back then this is
Manso will not look
8232
05:59:31,300 --> 05:59:32,506
at these two files,
8233
05:59:32,506 --> 05:59:35,300
but let me tell you what
does these two files
8234
05:59:35,300 --> 05:59:39,400
to these two files are
basically to record the metrics
8235
05:59:39,400 --> 05:59:42,000
of your Kafka like time in which
8236
05:59:42,000 --> 05:59:44,889
your thousand records have
been produced in cough power.
8237
05:59:44,889 --> 05:59:45,781
You can say time
8238
05:59:45,781 --> 05:59:48,400
in which records
are getting published to Kafka.
8239
05:59:48,400 --> 05:59:51,936
It will be monitored and then
you can record those starts.
8240
05:59:51,936 --> 05:59:53,292
So basically it helps
8241
05:59:53,292 --> 05:59:57,100
in optimizing the performance
of your Kafka producer, right?
8242
05:59:57,100 --> 05:59:59,863
You can actually know
how to do Recon.
8243
05:59:59,863 --> 06:00:03,000
How to add just
those configuration factors
8244
06:00:03,000 --> 06:00:05,041
and then you can
see the difference
8245
06:00:05,041 --> 06:00:07,159
or you can actually
monitor the stats
8246
06:00:07,159 --> 06:00:08,259
and then understand
8247
06:00:08,259 --> 06:00:11,612
or how you can actually make
your producer more efficient.
8248
06:00:11,612 --> 06:00:13,039
So these are basically
8249
06:00:13,039 --> 06:00:16,800
for those factors but let's
not worry about this right now.
8250
06:00:16,900 --> 06:00:18,600
Let's go back next.
8251
06:00:18,600 --> 06:00:21,500
Let me quickly take you
through this file utility.
8252
06:00:21,500 --> 06:00:24,000
So you have file
you treated or Java.
8253
06:00:24,000 --> 06:00:26,600
So basically what we
are doing over here,
8254
06:00:26,600 --> 06:00:28,550
we are reading each record
8255
06:00:28,550 --> 06:00:32,200
from the file we using
For reader so over here,
8256
06:00:32,200 --> 06:00:36,900
you can see we have this list
and then we have bufferedreader.
8257
06:00:36,900 --> 06:00:38,700
Then we have file reader.
8258
06:00:38,700 --> 06:00:41,000
So first we are reading the file
8259
06:00:41,000 --> 06:00:44,105
and then we are trying
to split each of the fields
8260
06:00:44,105 --> 06:00:45,500
present in the record.
8261
06:00:45,500 --> 06:00:49,500
And then we are setting the
value of those fields over here.
8262
06:00:49,700 --> 06:00:52,407
Then we are specifying
some of the exceptions
8263
06:00:52,407 --> 06:00:54,900
that may occur like
number format exception
8264
06:00:54,900 --> 06:00:57,500
or pass exception all
those kind of exception
8265
06:00:57,500 --> 06:01:00,900
we have specified over here
and then we are Closing this
8266
06:01:00,900 --> 06:01:01,959
so in this file.
8267
06:01:01,959 --> 06:01:04,746
We are basically
reading the records now.
8268
06:01:04,746 --> 06:01:06,000
Let me close this.
8269
06:01:06,000 --> 06:01:07,100
Let's go back.
8270
06:01:07,500 --> 06:01:07,766
Now.
8271
06:01:07,766 --> 06:01:10,500
Let's take a quick look
at the seal lizer.
8272
06:01:10,500 --> 06:01:13,100
So this is custom
Jason serializer.
8273
06:01:13,500 --> 06:01:15,100
So in serializer,
8274
06:01:15,100 --> 06:01:18,000
we have created
a custom decency réaliser.
8275
06:01:18,000 --> 06:01:22,023
Now, this is basically
to write the values as bites.
8276
06:01:22,100 --> 06:01:26,082
So the data which you will be
passing will be written in bytes
8277
06:01:26,082 --> 06:01:27,197
because as we know
8278
06:01:27,197 --> 06:01:29,800
that data is sent to Kafka
and form of pie.
8279
06:01:29,800 --> 06:01:32,000
And this is the reason
why we have created
8280
06:01:32,000 --> 06:01:33,700
this custom Jason serializer.
8281
06:01:33,930 --> 06:01:37,469
So now let me quickly close
this let's go back.
8282
06:01:37,700 --> 06:01:41,800
This file is basically for
my spring boot web application.
8283
06:01:41,900 --> 06:01:44,200
So let's not get into this.
8284
06:01:44,300 --> 06:01:47,100
Let's look at events
Red Dot Java.
8285
06:01:47,865 --> 06:01:51,634
So basically over here we
have event producer API.
8286
06:01:52,300 --> 06:01:57,100
So now we are trying to dispatch
those events and to show you
8287
06:01:57,100 --> 06:01:58,988
how dispatch function works.
8288
06:01:58,988 --> 06:02:00,000
Let me go back.
8289
06:02:00,000 --> 06:02:01,691
Let me open services
8290
06:02:01,700 --> 06:02:05,000
and even producer
I MPL is implementation.
8291
06:02:05,000 --> 06:02:08,100
So let me show you
how this dispatch works.
8292
06:02:08,100 --> 06:02:10,400
So basically over here
what we are doing first.
8293
06:02:10,400 --> 06:02:11,576
We are initializing.
8294
06:02:11,576 --> 06:02:13,047
So using the file utility.
8295
06:02:13,047 --> 06:02:16,000
We are basically reading
the files and read the file.
8296
06:02:16,000 --> 06:02:19,356
We are getting the path using
this Kafka properties object
8297
06:02:19,356 --> 06:02:22,300
and we are calling
this getter method of file path.
8298
06:02:22,300 --> 06:02:24,900
Then what we are doing
we are basically taking
8299
06:02:24,900 --> 06:02:25,900
the product list
8300
06:02:25,900 --> 06:02:28,700
and then we are trying
to dispatch it so
8301
06:02:28,700 --> 06:02:32,800
in dispatch Are basically
using Kafka producer
8302
06:02:33,600 --> 06:02:37,000
and then we are creating the
object of the producer record.
8303
06:02:37,000 --> 06:02:41,594
Then we are using the get topic
from this calf pad properties.
8304
06:02:41,594 --> 06:02:44,004
We are getting
this transaction ID
8305
06:02:44,004 --> 06:02:45,459
from the transaction
8306
06:02:45,459 --> 06:02:49,540
and then we are using event
producer send to send the data.
8307
06:02:49,540 --> 06:02:51,300
And finally we are trying
8308
06:02:51,300 --> 06:02:54,827
to monitor this but let's
not worry about the monitoring
8309
06:02:54,827 --> 06:02:57,200
and cash the monitoring
and start spot
8310
06:02:57,200 --> 06:02:59,661
so we can ignore this part Nets.
8311
06:02:59,800 --> 06:03:03,700
Let's quickly go back
and look at the last file
8312
06:03:03,700 --> 06:03:05,100
which is producer.
8313
06:03:05,600 --> 06:03:07,835
So let me show you
this event producer.
8314
06:03:07,835 --> 06:03:09,300
So what we are doing here,
8315
06:03:09,300 --> 06:03:11,500
we are actually
creating a logger.
8316
06:03:11,900 --> 06:03:13,500
So in this on completion method,
8317
06:03:13,500 --> 06:03:16,300
we are basically passing
the record metadata.
8318
06:03:16,300 --> 06:03:20,838
And if your e-except shin is
not null then it will basically
8319
06:03:20,838 --> 06:03:25,200
throw an error saying this
and recorded metadata else.
8320
06:03:25,400 --> 06:03:29,700
It will give you the send
message to topic partition.
8321
06:03:29,700 --> 06:03:32,300
All set and then
the record metadata
8322
06:03:32,300 --> 06:03:34,564
and topic and then it will give
8323
06:03:34,564 --> 06:03:38,800
you all the details regarding
topic partitions and offsets.
8324
06:03:38,800 --> 06:03:40,888
So I hope that you
guys have understood
8325
06:03:40,888 --> 06:03:44,110
how this cough cough producer
is working now is the time we
8326
06:03:44,110 --> 06:03:47,169
need to go ahead and we need
to quickly execute this.
8327
06:03:47,169 --> 06:03:49,200
So let me open
a terminal over here.
8328
06:03:49,500 --> 06:03:51,653
No first build this project.
8329
06:03:51,653 --> 06:03:54,423
We need to execute
mvn clean install.
8330
06:03:54,900 --> 06:03:56,800
This will install
all the dependencies.
8331
06:04:01,600 --> 06:04:04,100
So as you can see
our build is successful.
8332
06:04:04,100 --> 06:04:08,111
So let me minimize this and
this target directory is created
8333
06:04:08,111 --> 06:04:10,394
after you build
a wave in project.
8334
06:04:10,394 --> 06:04:11,778
So let me quickly go
8335
06:04:11,778 --> 06:04:16,000
inside this target directory and
this is the root dot bar file
8336
06:04:16,000 --> 06:04:18,300
that is root dot
web archive file
8337
06:04:18,300 --> 06:04:19,897
which we need to execute.
8338
06:04:19,897 --> 06:04:22,900
So let's quickly go ahead
and execute this file.
8339
06:04:23,100 --> 06:04:24,755
But before this to verify
8340
06:04:24,755 --> 06:04:27,800
whether the data
is getting produced in our car
8341
06:04:27,800 --> 06:04:29,900
for topics so for testing
8342
06:04:29,900 --> 06:04:33,300
as I already told you
We need to go ahead
8343
06:04:33,300 --> 06:04:36,200
and we need to open
a console consumer
8344
06:04:36,500 --> 06:04:37,500
so that we can check
8345
06:04:37,500 --> 06:04:40,200
that whether data
is getting published or not.
8346
06:04:42,400 --> 06:04:45,100
So let me quickly minimize this.
8347
06:04:48,300 --> 06:04:52,700
So let's quickly go to
Kafka directory and the command
8348
06:04:52,700 --> 06:04:59,300
is dot slash bin Kafka
console consumer and then -
8349
06:04:59,300 --> 06:05:01,500
- bootstrap server.
8350
06:05:14,800 --> 06:05:21,964
nine zero nine two Okay,
I'll let me check the topic.
8351
06:05:21,964 --> 06:05:23,271
What's the topic?
8352
06:05:24,000 --> 06:05:27,000
Let's go to our
application dot yml file.
8353
06:05:27,000 --> 06:05:31,000
So the topic
name is transaction.
8354
06:05:31,000 --> 06:05:35,100
Let me quickly minimize
this specify the topic name
8355
06:05:35,100 --> 06:05:36,500
and I'll hit enter.
8356
06:05:36,500 --> 06:05:41,300
So now let me place
this console aside.
8357
06:05:41,300 --> 06:05:45,900
And now let's quickly go ahead
and execute our project.
8358
06:05:45,900 --> 06:05:49,400
So for that
the command is Java -
8359
06:05:49,400 --> 06:05:52,938
jar and then we'll provide
the path of the file
8360
06:05:52,938 --> 06:05:54,100
that is inside.
8361
06:05:54,300 --> 06:05:59,700
Great, and the file is
rude dot war and here we go.
8362
06:06:18,100 --> 06:06:20,955
So now you can see
in the console consumer.
8363
06:06:20,955 --> 06:06:23,200
The records are
getting published.
8364
06:06:23,200 --> 06:06:23,700
Right?
8365
06:06:24,000 --> 06:06:25,903
So there are multiple records
8366
06:06:25,903 --> 06:06:29,118
which have been published
in our transaction topic
8367
06:06:29,118 --> 06:06:32,400
and you can verify this
using the console consumer.
8368
06:06:32,400 --> 06:06:33,145
So this is
8369
06:06:33,145 --> 06:06:36,500
where the developers use
the console consumer.
8370
06:06:38,000 --> 06:06:40,980
So now we have successfully
verified our producer.
8371
06:06:40,980 --> 06:06:43,900
So let me quickly go ahead
and stop the producer.
8372
06:06:45,500 --> 06:06:48,200
Lat, let me stop
consumer as well.
8373
06:06:49,400 --> 06:06:51,370
Let's quickly minimize this
8374
06:06:51,370 --> 06:06:54,144
and now let's go
to the second project.
8375
06:06:54,144 --> 06:06:56,700
That is Park
streaming Kafka Master.
8376
06:06:56,900 --> 06:06:57,200
Again.
8377
06:06:57,200 --> 06:06:59,667
We have specified
all the dependencies
8378
06:06:59,667 --> 06:07:00,800
that is required.
8379
06:07:01,000 --> 06:07:03,700
Let me quickly show
you those dependencies.
8380
06:07:07,700 --> 06:07:09,800
Now again, you
can see were here.
8381
06:07:09,800 --> 06:07:12,400
We have specified
Java version then we
8382
06:07:12,400 --> 06:07:16,600
have specified the artifacts
or you can see the dependencies.
8383
06:07:16,796 --> 06:07:18,796
So we have Scala compiler.
8384
06:07:18,796 --> 06:07:21,411
Then we have
spark streaming Kafka.
8385
06:07:21,900 --> 06:07:24,200
Then we have
cough cough clients.
8386
06:07:24,400 --> 06:07:28,400
Then Json data binding then we
have Maven compiler plug-in.
8387
06:07:28,400 --> 06:07:30,600
So all those dependencies
which are required.
8388
06:07:30,600 --> 06:07:32,300
We are specified over here.
8389
06:07:32,500 --> 06:07:35,500
So let me quickly go
ahead and close it.
8390
06:07:36,200 --> 06:07:40,503
Let's quickly move to the source
directory main then let's look
8391
06:07:40,503 --> 06:07:42,100
at the resources again.
8392
06:07:42,203 --> 06:07:44,896
So this is application
dot yml file.
8393
06:07:45,700 --> 06:07:46,700
So we have put
8394
06:07:46,700 --> 06:07:49,600
eight zero eight zero then we
have bootstrap server over here.
8395
06:07:49,600 --> 06:07:51,100
Then we have proven over here.
8396
06:07:51,100 --> 06:07:53,200
Then we have topic
is as transaction.
8397
06:07:53,200 --> 06:07:56,000
The group is transaction
partition count is one
8398
06:07:56,000 --> 06:07:57,273
and then the file name
8399
06:07:57,273 --> 06:07:59,664
so we won't be using
this file name then.
8400
06:07:59,664 --> 06:08:01,900
Let me quickly go ahead
and close this.
8401
06:08:01,900 --> 06:08:02,984
Let's go back.
8402
06:08:02,984 --> 06:08:06,600
Let's go back to Java
directory comms Park demo,
8403
06:08:06,600 --> 06:08:08,200
then this is the model.
8404
06:08:08,200 --> 06:08:10,100
So it's same
8405
06:08:10,600 --> 06:08:13,011
so these are all the fields
that are there
8406
06:08:13,011 --> 06:08:15,800
in the transaction
you have transaction.
8407
06:08:15,800 --> 06:08:18,100
Eight product price payment type
8408
06:08:18,100 --> 06:08:22,500
the name city state country
account created and so on.
8409
06:08:22,500 --> 06:08:25,100
And again, we have
specified all the getter
8410
06:08:25,100 --> 06:08:29,285
and Setter methods over here
and similarly again,
8411
06:08:29,285 --> 06:08:32,600
we have created
this transaction dto Constructor
8412
06:08:32,600 --> 06:08:34,900
where we have taken
all the parameters
8413
06:08:34,900 --> 06:08:38,200
and then we have setting
the values using this operator.
8414
06:08:38,200 --> 06:08:39,100
Next.
8415
06:08:39,100 --> 06:08:42,400
We are again over adding
this tostring function
8416
06:08:42,400 --> 06:08:43,414
and over here.
8417
06:08:43,414 --> 06:08:47,500
We are again returning the
details like transaction date
8418
06:08:47,500 --> 06:08:49,700
and then vario
transaction date product
8419
06:08:49,700 --> 06:08:53,200
and then value of product
and similarly all the fields.
8420
06:08:53,411 --> 06:08:55,488
So let me close this model.
8421
06:08:55,900 --> 06:08:57,100
Let's go back.
8422
06:08:57,200 --> 06:09:00,500
Let's look at cough covers,
then we are see realizer.
8423
06:09:00,500 --> 06:09:02,294
So this is the Jason serializer
8424
06:09:02,294 --> 06:09:06,187
which was there in our producer
and this is transaction decoder.
8425
06:09:06,187 --> 06:09:07,300
Let's take a look.
8426
06:09:07,780 --> 06:09:09,319
Now you have decoder
8427
06:09:09,400 --> 06:09:12,600
which is again implementing
decoder and we're passing
8428
06:09:12,600 --> 06:09:14,800
this transaction dto then again,
8429
06:09:14,800 --> 06:09:17,339
you can see we This problem
by its method
8430
06:09:17,339 --> 06:09:18,800
which we are overriding
8431
06:09:18,800 --> 06:09:22,022
and we are reading
the values using this bites
8432
06:09:22,022 --> 06:09:24,600
and then transaction
DDO class again,
8433
06:09:24,600 --> 06:09:28,600
if it is failing to pass we are
giving Json processing failed
8434
06:09:28,600 --> 06:09:29,799
for object this
8435
06:09:30,200 --> 06:09:31,573
and you can see we have
8436
06:09:31,573 --> 06:09:34,200
this transaction decoder
construct over here.
8437
06:09:34,200 --> 06:09:37,200
So let me quickly
again close this file.
8438
06:09:37,200 --> 06:09:38,892
Let's quickly go back.
8439
06:09:39,400 --> 06:09:42,500
And now let's take a look
at spot streaming app
8440
06:09:42,500 --> 06:09:44,200
where basically the data
8441
06:09:44,200 --> 06:09:48,100
which the producer project
will be producing to cough cough
8442
06:09:48,100 --> 06:09:51,900
will be actually consumed by
spark streaming application.
8443
06:09:51,900 --> 06:09:55,071
So spark streaming will stream
the data in real time
8444
06:09:55,071 --> 06:09:57,000
and then will display the data.
8445
06:09:57,000 --> 06:09:59,600
So in this park
streaming application,
8446
06:09:59,600 --> 06:10:03,189
we are creating conf object
and then we are setting
8447
06:10:03,189 --> 06:10:05,900
the application name
as cough by sandbox.
8448
06:10:05,900 --> 06:10:09,331
The master is local star
then we have Java.
8449
06:10:09,331 --> 06:10:13,100
Fog contest so here we
are specifying the spark context
8450
06:10:13,100 --> 06:10:16,700
and then next we are specifying
the Java streaming context.
8451
06:10:16,700 --> 06:10:18,500
So this object will basically
8452
06:10:18,500 --> 06:10:21,100
we used to take
the streaming data.
8453
06:10:21,100 --> 06:10:25,946
So we are passing this Java Spa
context over here as a parameter
8454
06:10:25,946 --> 06:10:29,900
and then we are specifying
the duration that is 2000.
8455
06:10:29,900 --> 06:10:30,200
Next.
8456
06:10:30,200 --> 06:10:32,600
We have Kafka parameters
should to connect
8457
06:10:32,600 --> 06:10:35,555
to Kafka you need
to specify this parameters.
8458
06:10:35,555 --> 06:10:37,100
So in Kafka parameters,
8459
06:10:37,100 --> 06:10:39,500
we are specifying
The Meta broken.
8460
06:10:39,500 --> 06:10:44,292
Why's that is localized 9:09 to
then we have Auto offset resent
8461
06:10:44,292 --> 06:10:45,600
that is smallest.
8462
06:10:45,600 --> 06:10:49,200
Then in topics the name
of the topic from which we
8463
06:10:49,200 --> 06:10:53,300
will be consuming messages
is transaction next Java.
8464
06:10:53,300 --> 06:10:56,200
We're creating a Java
pair input D streams.
8465
06:10:56,200 --> 06:10:59,300
So basically this D stream
is discrete stream,
8466
06:10:59,300 --> 06:11:02,300
which is the basic abstraction
of spark streaming
8467
06:11:02,300 --> 06:11:04,290
and is a continuous sequence
8468
06:11:04,290 --> 06:11:07,104
of rdds representing
a continuous stream
8469
06:11:07,104 --> 06:11:11,200
of data now the stream can I
The created from live data
8470
06:11:11,200 --> 06:11:13,000
from Kafka hdfs of Flume
8471
06:11:13,000 --> 06:11:14,457
or it can be generated
8472
06:11:14,457 --> 06:11:17,900
from transforming existing be
streams using operation
8473
06:11:17,900 --> 06:11:18,828
to over here.
8474
06:11:18,828 --> 06:11:21,700
We are again creating
a Java input D stream.
8475
06:11:21,700 --> 06:11:24,700
We are passing string
and transaction DTS parameters
8476
06:11:24,700 --> 06:11:27,504
and we are creating
direct Kafka stream object.
8477
06:11:27,504 --> 06:11:29,700
Then we're using
this Kafka you tails
8478
06:11:29,700 --> 06:11:33,000
and we are calling
the method create direct stream
8479
06:11:33,000 --> 06:11:35,885
where we are passing
the parameters as SSC
8480
06:11:35,885 --> 06:11:38,700
that is your spark
streaming context then
8481
06:11:38,700 --> 06:11:40,341
you have String dot class
8482
06:11:40,341 --> 06:11:42,829
which is basically
your key serializer.
8483
06:11:42,829 --> 06:11:45,322
Then transaction video
does not class
8484
06:11:45,322 --> 06:11:46,500
that is basically
8485
06:11:46,500 --> 06:11:49,700
your value serializer
then string decoder
8486
06:11:49,700 --> 06:11:52,868
that is to decode your key
and then transaction
8487
06:11:52,868 --> 06:11:55,900
decoded basically to
decode your transaction.
8488
06:11:55,900 --> 06:11:57,784
Then you have Kafka parameters,
8489
06:11:57,784 --> 06:11:59,501
which you have created here
8490
06:11:59,501 --> 06:12:02,300
where you have specified
broken list and auto
8491
06:12:02,300 --> 06:12:05,900
offset reset and then you
are specifying the topics
8492
06:12:05,900 --> 06:12:10,500
which is your transaction so
next using this Cordy stream,
8493
06:12:10,500 --> 06:12:14,000
you're actually continuously
iterating over the rdd
8494
06:12:14,000 --> 06:12:17,345
and then you are trying
to print your new rdd
8495
06:12:17,345 --> 06:12:19,400
with then already partition
8496
06:12:19,400 --> 06:12:21,200
and size then rdd count
8497
06:12:21,200 --> 06:12:24,600
and the record so already
for each record.
8498
06:12:24,900 --> 06:12:26,400
So you are printing the record
8499
06:12:26,500 --> 06:12:30,400
and then you are starting
these Park streaming context
8500
06:12:30,400 --> 06:12:32,800
and then you are waiting
for the termination.
8501
06:12:32,800 --> 06:12:35,500
So this is the spark
streaming application.
8502
06:12:35,500 --> 06:12:39,200
So let's first quickly go ahead
and execute this application.
8503
06:12:39,200 --> 06:12:40,900
They've been close this file.
8504
06:12:41,000 --> 06:12:43,400
Let's go to the source.
8505
06:12:44,900 --> 06:12:49,000
Now, let me quickly go ahead and
delete this target directory.
8506
06:12:49,000 --> 06:12:53,615
So now let me quickly open the
terminal MV and clean install.
8507
06:12:58,400 --> 06:13:01,800
So now as you can see the target
directory is again created
8508
06:13:01,800 --> 06:13:05,307
and this park streaming Kafka
snapshot jar is created.
8509
06:13:05,307 --> 06:13:07,300
So we need to execute this jar.
8510
06:13:07,700 --> 06:13:10,800
So let me quickly go ahead
and minimize it.
8511
06:13:12,500 --> 06:13:14,300
Let me close this terminal.
8512
06:13:14,400 --> 06:13:18,000
So now first I'll start
this pop streaming job.
8513
06:13:18,600 --> 06:13:24,100
So the command is Java -
jar inside the target directory.
8514
06:13:24,600 --> 06:13:31,500
We have this spark streaming of
college are so let's hit enter.
8515
06:13:34,500 --> 06:13:38,100
So let me know quickly go ahead
and start producing messages.
8516
06:13:41,000 --> 06:13:44,100
So I will minimize this and I
will wait for the messages.
8517
06:13:50,019 --> 06:13:53,480
So let me quickly close
this pot streaming job
8518
06:13:53,600 --> 06:13:56,900
and then I will show
you the consumed records
8519
06:13:59,000 --> 06:14:00,400
so you can see the record
8520
06:14:00,400 --> 06:14:02,673
that is consumed
from spark streaming.
8521
06:14:02,673 --> 06:14:05,500
So here you have got record
and transaction dto
8522
06:14:05,500 --> 06:14:08,561
and then transaction date
products all the details,
8523
06:14:08,561 --> 06:14:09,969
which we are specified.
8524
06:14:09,969 --> 06:14:11,500
You can see it over here.
8525
06:14:11,500 --> 06:14:15,400
So this is how spark
streaming works with Kafka now,
8526
06:14:15,400 --> 06:14:17,600
it's just a basic job again.
8527
06:14:17,600 --> 06:14:20,900
You can go ahead and you
can take Those transaction you
8528
06:14:20,900 --> 06:14:23,651
can perform some real-time
analytics over there
8529
06:14:23,651 --> 06:14:27,406
and then you can go ahead and
write those results so over here
8530
06:14:27,406 --> 06:14:29,500
we have just given
you a basic demo
8531
06:14:29,500 --> 06:14:32,401
in which we are producing
the records to Kafka
8532
06:14:32,401 --> 06:14:34,400
and then using spark streaming.
8533
06:14:34,400 --> 06:14:37,533
We are streaming those records
from Kafka again.
8534
06:14:37,533 --> 06:14:38,600
You can go ahead
8535
06:14:38,600 --> 06:14:41,083
and you can perform
multiple Transformations
8536
06:14:41,083 --> 06:14:42,848
over the data multiple actions
8537
06:14:42,848 --> 06:14:45,500
and produce some real-time
results using this data.
8538
06:14:45,500 --> 06:14:48,975
So this is just a basic demo
where we have shown you
8539
06:14:48,975 --> 06:14:51,700
how to basically
produce recalls to Kafka
8540
06:14:51,700 --> 06:14:55,000
and then consume those records
using spark streaming.
8541
06:14:55,000 --> 06:14:57,846
So let's quickly go
back to our slide.
8542
06:14:58,600 --> 06:15:00,526
Now as this was a basic project.
8543
06:15:00,526 --> 06:15:01,669
Let me explain you
8544
06:15:01,669 --> 06:15:04,390
one of the cough
by spark streaming project,
8545
06:15:04,390 --> 06:15:05,754
which is a Ted Eureka.
8546
06:15:05,754 --> 06:15:09,100
So basically there is a company
called Tech review.com.
8547
06:15:09,100 --> 06:15:11,900
So this take review.com
basically provide reviews
8548
06:15:11,900 --> 06:15:14,481
for your recent
and different Technologies,
8549
06:15:14,481 --> 06:15:17,800
like a smart watches phones
different operating systems
8550
06:15:17,800 --> 06:15:20,100
and anything new
that is coming into Market.
8551
06:15:20,100 --> 06:15:23,409
So what happens is the company
decided to include a new feature
8552
06:15:23,409 --> 06:15:26,883
which will basically allow
users to compare the popularity
8553
06:15:26,883 --> 06:15:29,200
or trend of multiple
Technologies based
8554
06:15:29,200 --> 06:15:32,400
on the Twitter feeds
and second for the USP.
8555
06:15:32,400 --> 06:15:33,500
They are basically
8556
06:15:33,500 --> 06:15:36,200
trying this comparison
to happen in real time.
8557
06:15:36,200 --> 06:15:38,788
So basically they
have assigned you this task
8558
06:15:38,788 --> 06:15:41,299
so that you have to go
ahead you have to take
8559
06:15:41,299 --> 06:15:42,752
the real-time Twitter feeds
8560
06:15:42,752 --> 06:15:45,400
then you have to show
the real time comparison
8561
06:15:45,400 --> 06:15:46,900
of various Technologies.
8562
06:15:46,900 --> 06:15:50,500
So again, the company is
is asking you to to identify
8563
06:15:50,500 --> 06:15:51,684
the minute literate
8564
06:15:51,684 --> 06:15:55,500
between different Technologies
by consuming Twitter streams
8565
06:15:55,500 --> 06:15:58,900
and writing aggregated minute
Li count to Cassandra
8566
06:15:58,900 --> 06:16:00,200
from where again -
8567
06:16:00,200 --> 06:16:02,700
boarding team will come
into picture and then they
8568
06:16:02,700 --> 06:16:06,700
will try to dashboard that data
and it can show you a graph
8569
06:16:06,700 --> 06:16:07,800
where you can see
8570
06:16:07,800 --> 06:16:09,892
how the trend of two different
8571
06:16:09,892 --> 06:16:13,656
or you can see various
Technologies are going ahead now
8572
06:16:13,656 --> 06:16:16,157
the solution strategy
which is there
8573
06:16:16,157 --> 06:16:20,083
so you have to continuously
stream the data from Twitter.
8574
06:16:20,083 --> 06:16:21,689
Then you will be storing
8575
06:16:21,689 --> 06:16:24,322
that those tweets
inside a cop car topic
8576
06:16:24,322 --> 06:16:25,567
then second again.
8577
06:16:25,567 --> 06:16:27,987
You have to
perform spark streaming.
8578
06:16:27,987 --> 06:16:31,009
So you will be continuously
streaming the data
8579
06:16:31,009 --> 06:16:34,300
and then you will be
applying some Transformations
8580
06:16:34,300 --> 06:16:36,900
which will basically
give you the minute trend
8581
06:16:36,900 --> 06:16:38,361
of the two technologies.
8582
06:16:38,361 --> 06:16:41,747
And again, you'll write it back
to a car for topic and at last
8583
06:16:41,747 --> 06:16:42,992
you'll write a consumer
8584
06:16:42,992 --> 06:16:46,051
that will be consuming messages
from the Casbah topic
8585
06:16:46,051 --> 06:16:49,200
and that will write the data
in your Cassandra database.
8586
06:16:49,200 --> 06:16:51,018
So First you have
to write a program
8587
06:16:51,018 --> 06:16:53,049
that will be consuming
data from Twitter
8588
06:16:53,049 --> 06:16:54,696
and I did to cough or topic.
8589
06:16:54,696 --> 06:16:56,999
Then you have to write
a spark streaming job,
8590
06:16:56,999 --> 06:17:00,200
which will be continuously
streaming the data from Kafka
8591
06:17:00,300 --> 06:17:03,300
and perform analytics
to identify the military Trend
8592
06:17:03,300 --> 06:17:06,200
and then it will write the data
back to a cuff for topic
8593
06:17:06,200 --> 06:17:08,282
and then you have
to write the third job
8594
06:17:08,282 --> 06:17:10,114
which will be
basically a consumer
8595
06:17:10,114 --> 06:17:12,668
that will consume data
from the table for topic
8596
06:17:12,668 --> 06:17:15,000
and write the data
to a Cassandra database.
8597
06:17:19,800 --> 06:17:21,709
But a spark is
a powerful framework,
8598
06:17:21,709 --> 06:17:23,960
which has been heavily
used in the industry
8599
06:17:23,960 --> 06:17:26,800
for real-time analytics
and machine learning purposes.
8600
06:17:26,800 --> 06:17:28,689
So before I proceed
with the session,
8601
06:17:28,689 --> 06:17:30,489
let's have a quick
look at the topics
8602
06:17:30,489 --> 06:17:31,968
which will be covering today.
8603
06:17:31,968 --> 06:17:33,600
So I'm starting
off by explaining
8604
06:17:33,600 --> 06:17:35,900
what exactly is by spot
and how it works.
8605
06:17:35,900 --> 06:17:36,900
When we go ahead.
8606
06:17:36,900 --> 06:17:39,819
We'll find out the various
advantages provided by spark.
8607
06:17:39,819 --> 06:17:41,200
Then I will be showing you
8608
06:17:41,200 --> 06:17:43,400
how to install
by sparking a systems.
8609
06:17:43,400 --> 06:17:45,300
Once we are done
with the installation.
8610
06:17:45,300 --> 06:17:48,200
I will talk about the
fundamental concepts of by spark
8611
06:17:48,200 --> 06:17:49,800
like this spark context.
8612
06:17:49,900 --> 06:17:53,900
Data frames MLA Oddities
and much more and finally,
8613
06:17:53,900 --> 06:17:57,100
I'll close of the session with
the demo in which I'll show you
8614
06:17:57,100 --> 06:18:00,200
how to implement by spark
to solve real life use cases.
8615
06:18:00,200 --> 06:18:01,791
So without any further Ado,
8616
06:18:01,791 --> 06:18:04,621
let's quickly embark
on our journey to pie spot now
8617
06:18:04,621 --> 06:18:06,558
before I start off
with by spark.
8618
06:18:06,558 --> 06:18:09,500
Let me first brief you
about the by spark ecosystem
8619
06:18:09,500 --> 06:18:13,154
as you can see from the diagram
the spark ecosystem is composed
8620
06:18:13,154 --> 06:18:16,400
of various components like
Sparks equals Park streaming.
8621
06:18:16,400 --> 06:18:19,800
Ml Abe graphics and the core
API component the spark.
8622
06:18:19,800 --> 06:18:22,000
Equal component is used
to Leverage The Power
8623
06:18:22,000 --> 06:18:23,320
of decorative queries
8624
06:18:23,320 --> 06:18:26,281
and optimize storage
by executing sql-like queries
8625
06:18:26,281 --> 06:18:27,124
on spark data,
8626
06:18:27,124 --> 06:18:28,654
which is presented in rdds
8627
06:18:28,654 --> 06:18:31,589
and other external sources
spark streaming component
8628
06:18:31,589 --> 06:18:33,882
allows developers
to perform batch processing
8629
06:18:33,882 --> 06:18:36,714
and streaming of data with ease
in the same application.
8630
06:18:36,714 --> 06:18:39,345
The machine learning library
eases the development
8631
06:18:39,345 --> 06:18:41,600
and deployment of
scalable machine learning
8632
06:18:41,600 --> 06:18:43,600
pipelines Graphics component.
8633
06:18:43,600 --> 06:18:47,100
Let's the data scientists work
with graph and non graph sources
8634
06:18:47,100 --> 06:18:49,982
to achieve flexibility
and resilience in graph.
8635
06:18:49,982 --> 06:18:51,775
Struction and Transformations
8636
06:18:51,775 --> 06:18:54,000
and finally the
spark core component.
8637
06:18:54,000 --> 06:18:56,723
It is the most vital component
of spark ecosystem,
8638
06:18:56,723 --> 06:18:57,900
which is responsible
8639
06:18:57,900 --> 06:19:00,644
for basic input output
functions scheduling
8640
06:19:00,644 --> 06:19:04,172
and monitoring the entire
spark ecosystem is built on top
8641
06:19:04,172 --> 06:19:06,014
of this code execution engine
8642
06:19:06,014 --> 06:19:09,000
which has extensible apis
in different languages
8643
06:19:09,000 --> 06:19:12,300
like Scala Python and Java
and in today's session,
8644
06:19:12,300 --> 06:19:13,915
I will specifically discuss
8645
06:19:13,915 --> 06:19:16,967
about the spark API
in Python programming languages,
8646
06:19:16,967 --> 06:19:19,600
which is more popularly
known as the pie Spa.
8647
06:19:19,700 --> 06:19:22,839
You might be wondering
why pie spot well to get
8648
06:19:22,839 --> 06:19:24,000
a better Insight.
8649
06:19:24,000 --> 06:19:26,400
Let me give you a brief
into pie spot.
8650
06:19:26,400 --> 06:19:29,300
Now as you already know
by spec is the collaboration
8651
06:19:29,300 --> 06:19:31,050
of two powerful Technologies,
8652
06:19:31,050 --> 06:19:32,500
which are spark which is
8653
06:19:32,500 --> 06:19:35,459
an open-source clustering
Computing framework built
8654
06:19:35,459 --> 06:19:38,300
around speed ease of use
and streaming analytics.
8655
06:19:38,300 --> 06:19:40,707
And the other one is python,
of course python,
8656
06:19:40,707 --> 06:19:43,900
which is a general purpose
high-level programming language.
8657
06:19:43,900 --> 06:19:46,900
It provides wide range
of libraries and is majorly used
8658
06:19:46,900 --> 06:19:50,000
for machine learning
and real-time analytics now,
8659
06:19:50,000 --> 06:19:52,000
Now which gives us by spark
8660
06:19:52,000 --> 06:19:53,852
which is a python
a pay for spark
8661
06:19:53,852 --> 06:19:56,581
that lets you harness
the Simplicity of Python
8662
06:19:56,581 --> 06:19:58,400
and The Power of Apache spark.
8663
06:19:58,400 --> 06:20:01,059
In order to tame
pick data up ice pack.
8664
06:20:01,059 --> 06:20:03,398
Also lets you use
the rdds and come
8665
06:20:03,398 --> 06:20:06,700
with a default integration
of Pi Forge a library.
8666
06:20:06,700 --> 06:20:10,397
We learn about rdds later
in this video now that you know,
8667
06:20:10,397 --> 06:20:11,500
what is pi spark.
8668
06:20:11,500 --> 06:20:14,400
Let's now see the advantages
of using spark with python
8669
06:20:14,400 --> 06:20:17,700
as we all know python
itself is very simple and easy.
8670
06:20:17,700 --> 06:20:20,700
So when Spock is written
in Python it To participate
8671
06:20:20,700 --> 06:20:22,837
quite easy to learn
and use moreover.
8672
06:20:22,837 --> 06:20:24,737
It's a dynamically type language
8673
06:20:24,737 --> 06:20:28,300
which means Oddities can hold
objects of multiple data types.
8674
06:20:28,300 --> 06:20:30,711
Not only does it also
makes the EPA simple
8675
06:20:30,711 --> 06:20:32,400
and comprehensive and talking
8676
06:20:32,400 --> 06:20:34,700
about the readability
of code maintenance
8677
06:20:34,700 --> 06:20:36,700
and familiarity with
the python API
8678
06:20:36,700 --> 06:20:38,577
for purchase Park is far better
8679
06:20:38,577 --> 06:20:41,000
than other programming
languages python also
8680
06:20:41,000 --> 06:20:43,100
provides various options
for visualization,
8681
06:20:43,100 --> 06:20:46,180
which is not possible using
Scala or Java moreover.
8682
06:20:46,180 --> 06:20:49,200
You can conveniently call
are directly from python
8683
06:20:49,200 --> 06:20:50,800
on above this python comes
8684
06:20:50,800 --> 06:20:52,300
with a wide range of libraries
8685
06:20:52,300 --> 06:20:55,800
like numpy pandas
Caitlin Seaborn matplotlib
8686
06:20:55,800 --> 06:20:57,912
and these debris is
in data analysis
8687
06:20:57,912 --> 06:20:59,300
and also provide mature
8688
06:20:59,300 --> 06:21:02,564
and time test statistics
with all these feature.
8689
06:21:02,564 --> 06:21:04,100
You can effortlessly program
8690
06:21:04,100 --> 06:21:06,700
and spice Park in case
you get stuck somewhere
8691
06:21:06,700 --> 06:21:07,600
or habit out.
8692
06:21:07,600 --> 06:21:08,835
There is a huge price
8693
06:21:08,835 --> 06:21:12,600
but Community out there whom you
can reach out and put your query
8694
06:21:12,600 --> 06:21:13,800
and that is very actor.
8695
06:21:13,800 --> 06:21:16,647
So I will make good use
of this opportunity to show you
8696
06:21:16,647 --> 06:21:18,000
how to install Pi spark
8697
06:21:18,000 --> 06:21:20,900
in a system now here
I'm using Red Hat Linux
8698
06:21:20,900 --> 06:21:24,400
based sent to a system
the same steps can be applied
8699
06:21:24,400 --> 06:21:26,000
for using Linux systems as well.
8700
06:21:26,200 --> 06:21:28,500
So in order to install
Pi spark first,
8701
06:21:28,500 --> 06:21:31,100
make sure that you have
Hadoop installed in your system.
8702
06:21:31,100 --> 06:21:33,700
So if you want to know more
about how to install Ado,
8703
06:21:33,700 --> 06:21:36,500
please check out
our new playlist on YouTube
8704
06:21:36,500 --> 06:21:39,909
or you can check out our blog on
a direct our website the first
8705
06:21:39,909 --> 06:21:43,100
of all you need to go to the
Apache spark official website,
8706
06:21:43,100 --> 06:21:44,750
which is parked
at a party Dot o-- r--
8707
06:21:44,750 --> 06:21:48,025
g-- and the download section you
can download the latest version
8708
06:21:48,025 --> 06:21:48,907
of spark release
8709
06:21:48,907 --> 06:21:51,500
which supports It's
the latest version of Hadoop
8710
06:21:51,500 --> 06:21:53,800
or Hadoop version
2.7 or above now.
8711
06:21:53,800 --> 06:21:55,429
Once you have downloaded it,
8712
06:21:55,429 --> 06:21:57,900
all you need to do is
extract it or add say
8713
06:21:57,900 --> 06:21:59,400
under the file contents.
8714
06:21:59,400 --> 06:22:01,400
And after that you
need to put in the path
8715
06:22:01,400 --> 06:22:04,200
where the spark is installed
in the bash RC file.
8716
06:22:04,200 --> 06:22:06,082
Now, you also need
to install pip
8717
06:22:06,082 --> 06:22:09,300
and jupyter notebook using
the pipe command and make sure
8718
06:22:09,300 --> 06:22:11,700
that the version
of piston or above so
8719
06:22:11,700 --> 06:22:12,820
as you can see here,
8720
06:22:12,820 --> 06:22:16,114
this is what our bash RC file
looks like here you can see
8721
06:22:16,114 --> 06:22:17,700
that we have put in the path
8722
06:22:17,700 --> 06:22:20,700
for Hadoop spark and as
well as Spunk driver python,
8723
06:22:20,700 --> 06:22:22,200
which is The jupyter Notebook.
8724
06:22:22,200 --> 06:22:23,087
What we'll do.
8725
06:22:23,087 --> 06:22:25,939
Is that the moment you
run the pie Spock shell
8726
06:22:25,939 --> 06:22:29,300
it will automatically open
a jupyter notebook for you.
8727
06:22:29,300 --> 06:22:29,551
Now.
8728
06:22:29,551 --> 06:22:32,000
I find jupyter notebook
very easy to work
8729
06:22:32,000 --> 06:22:35,700
with rather than the shell
is supposed to search choice now
8730
06:22:35,700 --> 06:22:37,899
that we are done
with the installation path.
8731
06:22:37,899 --> 06:22:40,100
Let's now dive deeper
into pie Sparkle on few
8732
06:22:40,100 --> 06:22:41,100
of its fundamentals,
8733
06:22:41,100 --> 06:22:43,770
which you need to know
in order to work with by Spar.
8734
06:22:43,770 --> 06:22:45,870
Now this timeline shows
the various topics,
8735
06:22:45,870 --> 06:22:48,600
which we will be covering under
the pie spark fundamentals.
8736
06:22:48,700 --> 06:22:49,650
So let's start off.
8737
06:22:49,650 --> 06:22:51,500
With the very first
Topic in our list.
8738
06:22:51,500 --> 06:22:53,100
That is the spark context.
8739
06:22:53,100 --> 06:22:56,335
The spark context is the heart
of any spark application.
8740
06:22:56,335 --> 06:22:59,518
It sets up internal services
and establishes a connection
8741
06:22:59,518 --> 06:23:03,300
to a spark execution environment
through a spark context object.
8742
06:23:03,300 --> 06:23:05,357
You can create rdds accumulators
8743
06:23:05,357 --> 06:23:09,000
and broadcast variable
access Park service's run jobs
8744
06:23:09,000 --> 06:23:11,362
and much more
the spark context allows
8745
06:23:11,362 --> 06:23:14,094
the spark driver application
to access the cluster
8746
06:23:14,094 --> 06:23:15,600
through a resource manager,
8747
06:23:15,600 --> 06:23:16,600
which can be yarn
8748
06:23:16,600 --> 06:23:19,600
or Sparks cluster manager
the driver program then runs.
8749
06:23:19,700 --> 06:23:23,044
Operations inside the executors
on the worker nodes
8750
06:23:23,044 --> 06:23:26,478
and Spark context uses the pie
for Jay to launch a jvm
8751
06:23:26,478 --> 06:23:29,200
which in turn creates
a Java spark context.
8752
06:23:29,200 --> 06:23:30,884
Now there are
various parameters,
8753
06:23:30,884 --> 06:23:33,200
which can be used
with a spark context object
8754
06:23:33,200 --> 06:23:34,700
like the Master app name
8755
06:23:34,700 --> 06:23:37,366
spark home the pie
files the environment
8756
06:23:37,366 --> 06:23:41,600
in which has set the path size
serializer configuration Gateway
8757
06:23:41,600 --> 06:23:44,267
and much more
among these parameters
8758
06:23:44,267 --> 06:23:47,700
the master and app name
are the most commonly used now
8759
06:23:47,700 --> 06:23:51,061
to give you a basic Insight
on how Spark program works.
8760
06:23:51,061 --> 06:23:53,807
I have listed down
its basic lifecycle phases
8761
06:23:53,807 --> 06:23:56,903
the typical life cycle
of a spark program includes
8762
06:23:56,903 --> 06:23:59,367
creating rdds from
external data sources
8763
06:23:59,367 --> 06:24:02,400
or paralyzed a collection
in your driver program.
8764
06:24:02,400 --> 06:24:05,361
Then we have the lazy
transformation in a lazily
8765
06:24:05,361 --> 06:24:07,064
transforming the base rdds
8766
06:24:07,064 --> 06:24:10,600
into new Oddities using
transformation then caching few
8767
06:24:10,600 --> 06:24:12,700
of those rdds for future reuse
8768
06:24:12,800 --> 06:24:15,800
and finally performing action
to execute parallel computation
8769
06:24:15,800 --> 06:24:17,500
and to produce the results.
8770
06:24:17,500 --> 06:24:19,800
The next Topic
in our list is added.
8771
06:24:19,800 --> 06:24:20,700
And I'm sure people
8772
06:24:20,700 --> 06:24:23,700
who have already worked with
spark a familiar with this term,
8773
06:24:23,700 --> 06:24:25,582
but for people
who are new to it,
8774
06:24:25,582 --> 06:24:26,900
let me just explain it.
8775
06:24:26,900 --> 06:24:29,782
No Artie T stands for
resilient distributed data set.
8776
06:24:29,782 --> 06:24:32,000
It is considered to be
the building block
8777
06:24:32,000 --> 06:24:33,433
of any spark application.
8778
06:24:33,433 --> 06:24:35,900
The reason behind this
is these elements run
8779
06:24:35,900 --> 06:24:38,600
and operate on multiple nodes
to do parallel processing
8780
06:24:38,600 --> 06:24:39,400
on a cluster.
8781
06:24:39,400 --> 06:24:40,952
And once you create an RTD,
8782
06:24:40,952 --> 06:24:43,273
it becomes immutable
and by imitable,
8783
06:24:43,273 --> 06:24:46,637
I mean that it is an object
whose State cannot be modified
8784
06:24:46,637 --> 06:24:47,700
after its created,
8785
06:24:47,700 --> 06:24:49,654
but we can transform
its values by up.
8786
06:24:49,654 --> 06:24:51,438
Applying certain transformation.
8787
06:24:51,438 --> 06:24:53,500
They have good
fault tolerance ability
8788
06:24:53,500 --> 06:24:56,700
and can automatically recover
for almost any failures.
8789
06:24:56,700 --> 06:25:00,700
This adds an added Advantage
not to achieve a certain task
8790
06:25:00,700 --> 06:25:03,205
multiple operations can
be applied on these IDs
8791
06:25:03,205 --> 06:25:05,675
which are categorized
in two ways the first
8792
06:25:05,675 --> 06:25:06,800
in the transformation
8793
06:25:06,800 --> 06:25:09,900
and the second one is
the actions the Transformations
8794
06:25:09,900 --> 06:25:10,800
are the operations
8795
06:25:10,800 --> 06:25:13,800
which are applied on an oddity
to create a new rdd.
8796
06:25:14,000 --> 06:25:15,300
Now these transformation work
8797
06:25:15,300 --> 06:25:17,300
on the principle
of lazy evaluation
8798
06:25:17,700 --> 06:25:19,900
and transformation
are lazy in nature.
8799
06:25:19,900 --> 06:25:22,927
Meaning when we call
some operation in our dirty.
8800
06:25:22,927 --> 06:25:25,758
It does not execute
immediately spark maintains,
8801
06:25:25,758 --> 06:25:28,602
the record of the operations
it is being called
8802
06:25:28,602 --> 06:25:31,324
through with the help
of direct acyclic graphs,
8803
06:25:31,324 --> 06:25:33,100
which is also known as the DHS
8804
06:25:33,100 --> 06:25:35,900
and since the Transformations
are lazy in nature.
8805
06:25:35,900 --> 06:25:37,604
So when we execute operation
8806
06:25:37,604 --> 06:25:40,100
any time by calling
an action on the data,
8807
06:25:40,100 --> 06:25:42,371
the lazy evaluation
data is not loaded
8808
06:25:42,371 --> 06:25:43,547
until it's necessary
8809
06:25:43,547 --> 06:25:46,900
and the moment we call out
the action all the computations
8810
06:25:46,900 --> 06:25:49,900
are performed parallely to give
you the desired output.
8811
06:25:49,900 --> 06:25:52,400
Put now a few important
Transformations are
8812
06:25:52,400 --> 06:25:53,944
the map flatmap filter
8813
06:25:53,944 --> 06:25:55,360
this thing reduced by
8814
06:25:55,360 --> 06:25:59,000
key map partition sort by
actions are the operations
8815
06:25:59,000 --> 06:26:02,058
which are applied on an rdd
to instruct a party spark
8816
06:26:02,058 --> 06:26:03,188
to apply computation
8817
06:26:03,188 --> 06:26:05,600
and pass the result back
to the driver few
8818
06:26:05,600 --> 06:26:09,100
of these actions include
collect the collectors mapreduce
8819
06:26:09,100 --> 06:26:10,300
take first now,
8820
06:26:10,300 --> 06:26:13,600
let me Implement few of these
for your better understanding.
8821
06:26:14,600 --> 06:26:17,000
So first of all,
let me show you the bash
8822
06:26:17,000 --> 06:26:18,800
as if I'll which I
was talking about.
8823
06:26:25,100 --> 06:26:27,196
So here you can see
in the bash RC file.
8824
06:26:27,196 --> 06:26:29,400
We provide the path
for all the Frameworks
8825
06:26:29,400 --> 06:26:31,250
which we have installed
in the system.
8826
06:26:31,250 --> 06:26:32,800
So for example,
you can see here.
8827
06:26:32,800 --> 06:26:35,100
We have installed
Hadoop the moment we
8828
06:26:35,100 --> 06:26:38,100
install and unzip it
or rather see entire it
8829
06:26:38,100 --> 06:26:41,300
I have shifted all my Frameworks
to one particular location
8830
06:26:41,300 --> 06:26:43,492
as you can see is
the US are the user
8831
06:26:43,492 --> 06:26:46,140
and inside this we have
the library and inside
8832
06:26:46,140 --> 06:26:49,217
that I have installed the Hadoop
and also the spa now
8833
06:26:49,217 --> 06:26:50,400
as you can see here,
8834
06:26:50,400 --> 06:26:51,300
we have two lines.
8835
06:26:51,300 --> 06:26:54,800
I'll highlight this one for
you the pie spark driver.
8836
06:26:54,800 --> 06:26:56,392
Titan which is the Jupiter
8837
06:26:56,392 --> 06:26:59,700
and we have given it as
a notebook the option available
8838
06:26:59,700 --> 06:27:02,100
as know to what we'll do
is at the moment.
8839
06:27:02,100 --> 06:27:04,731
I start spark will
automatically redirect me
8840
06:27:04,731 --> 06:27:06,200
to The jupyter Notebook.
8841
06:27:10,200 --> 06:27:14,500
So let me just rename
this notebook as rdd tutorial.
8842
06:27:15,200 --> 06:27:16,900
So let's get started.
8843
06:27:17,800 --> 06:27:21,000
So here to load any file
into an rdd suppose.
8844
06:27:21,000 --> 06:27:23,795
I'm loading a text file
you need to use the S
8845
06:27:23,795 --> 06:27:26,700
if it is a spark context
as C dot txt file
8846
06:27:26,700 --> 06:27:28,952
and you need to provide
the path of the data
8847
06:27:28,952 --> 06:27:30,600
which you are going to load.
8848
06:27:30,600 --> 06:27:33,300
So one thing to keep
in mind is that the default path
8849
06:27:33,300 --> 06:27:35,483
which the artery takes
or the jupyter.
8850
06:27:35,483 --> 06:27:37,365
Notebook takes is the hdfs path.
8851
06:27:37,365 --> 06:27:39,456
So in order to use
the local file system,
8852
06:27:39,456 --> 06:27:41,311
you need to mention
the file colon
8853
06:27:41,311 --> 06:27:42,900
and double forward slashes.
8854
06:27:43,800 --> 06:27:46,100
So once our sample data is
8855
06:27:46,100 --> 06:27:49,076
inside the ret not to
have a look at it.
8856
06:27:49,076 --> 06:27:52,000
We need to invoke
using it the action.
8857
06:27:52,000 --> 06:27:54,900
So let's go ahead and take
a look at the first five objects
8858
06:27:54,900 --> 06:27:59,400
or rather say the first five
elements of this particular rdt.
8859
06:27:59,700 --> 06:28:02,776
The sample it I have taken
here is about blockchain
8860
06:28:02,776 --> 06:28:03,700
as you can see.
8861
06:28:03,700 --> 06:28:05,000
We have one two,
8862
06:28:05,030 --> 06:28:07,569
three four and
five elements here.
8863
06:28:08,500 --> 06:28:12,080
Suppose I need to convert
all the data into a lowercase
8864
06:28:12,080 --> 06:28:14,600
and split it according
to word by word.
8865
06:28:14,600 --> 06:28:17,000
So for that I will
create a function
8866
06:28:17,000 --> 06:28:20,000
and in the function
I'll pass on this Oddity.
8867
06:28:20,000 --> 06:28:21,700
So I'm creating
as you can see here.
8868
06:28:21,700 --> 06:28:22,990
I'm creating rdd one
8869
06:28:22,990 --> 06:28:25,700
that is a new ID
and using the map function
8870
06:28:25,700 --> 06:28:29,200
or rather say the transformation
and passing on the function,
8871
06:28:29,200 --> 06:28:32,100
which I just created to lower
and to split it.
8872
06:28:32,496 --> 06:28:35,803
So if we have a look
at the output of our D1
8873
06:28:37,800 --> 06:28:39,059
As you can see here,
8874
06:28:39,059 --> 06:28:41,200
all the words are
in the lower case
8875
06:28:41,200 --> 06:28:44,300
and all of them are separated
with the help of a space bar.
8876
06:28:44,700 --> 06:28:47,000
Now this another transformation,
8877
06:28:47,000 --> 06:28:50,216
which is known as the flat map
to give you a flat and output
8878
06:28:50,216 --> 06:28:52,157
and I am passing
the same function
8879
06:28:52,157 --> 06:28:53,569
which I created earlier.
8880
06:28:53,569 --> 06:28:54,500
So let's go ahead
8881
06:28:54,500 --> 06:28:56,800
and have a look
at the output for this one.
8882
06:28:56,800 --> 06:28:58,200
So as you can see here,
8883
06:28:58,200 --> 06:29:00,189
we got the first five elements
8884
06:29:00,189 --> 06:29:04,355
which are the save one as we got
here the contrast transactions
8885
06:29:04,355 --> 06:29:05,700
and and the records.
8886
06:29:05,700 --> 06:29:07,523
So just one thing
to keep in mind.
8887
06:29:07,523 --> 06:29:09,700
Is at the flat map
is a transformation
8888
06:29:09,700 --> 06:29:11,664
where as take is the action now,
8889
06:29:11,664 --> 06:29:13,614
as you can see
that the contents
8890
06:29:13,614 --> 06:29:16,007
of the sample data
contains stop words.
8891
06:29:16,007 --> 06:29:18,762
So if I want to remove
all the stop was all you
8892
06:29:18,762 --> 06:29:19,900
need to do is start
8893
06:29:19,900 --> 06:29:23,351
and create a list of stop words
in which I have mentioned here
8894
06:29:23,351 --> 06:29:24,200
as you can see.
8895
06:29:24,200 --> 06:29:26,200
We have a all the as is
8896
06:29:26,200 --> 06:29:28,700
and now these are
not all the stop words.
8897
06:29:28,700 --> 06:29:31,701
So I've chosen only a few
of them just to show you
8898
06:29:31,701 --> 06:29:33,600
what exactly the output will be
8899
06:29:33,600 --> 06:29:36,100
and now we are using here
the filter transformation
8900
06:29:36,100 --> 06:29:37,800
and with the help of Lambda.
8901
06:29:37,800 --> 06:29:40,800
Function and which we have
X specified as X naught
8902
06:29:40,800 --> 06:29:43,360
in stock quotes and we
have created another rdd
8903
06:29:43,360 --> 06:29:44,465
which is added III
8904
06:29:44,465 --> 06:29:46,000
which will take the input
8905
06:29:46,000 --> 06:29:48,800
from our DD to so
let's go ahead and see
8906
06:29:48,800 --> 06:29:51,700
whether and and the
are removed or not.
8907
06:29:51,700 --> 06:29:55,600
This is you can see contracts
transaction records of them.
8908
06:29:55,600 --> 06:29:57,500
If you look at the output 5,
8909
06:29:57,500 --> 06:30:00,979
we have contracts transaction
and and the and in the
8910
06:30:00,979 --> 06:30:02,337
are not in this list,
8911
06:30:02,337 --> 06:30:04,600
but suppose I want
to group the data
8912
06:30:04,600 --> 06:30:07,523
according to the first
three characters of any element.
8913
06:30:07,523 --> 06:30:08,756
So for that I'll use
8914
06:30:08,756 --> 06:30:11,900
the group by and I'll use
the Lambda function again.
8915
06:30:11,900 --> 06:30:14,000
So let's have a look
at the output
8916
06:30:14,000 --> 06:30:16,769
so you can see we
have EDG and edges.
8917
06:30:16,900 --> 06:30:20,638
So the first three letters of
both words are same similarly.
8918
06:30:20,638 --> 06:30:23,300
We can find it using
the first two letters.
8919
06:30:23,300 --> 06:30:27,800
Also, let me just change it
to two so you can see we are gu
8920
06:30:27,800 --> 06:30:29,800
and guid just a guide
8921
06:30:30,000 --> 06:30:32,200
not these are
the basic Transformations
8922
06:30:32,200 --> 06:30:33,785
and actions but suppose.
8923
06:30:33,785 --> 06:30:37,400
I want to find out the sum
of the first thousand numbers.
8924
06:30:37,400 --> 06:30:39,436
Others have first
10,000 numbers.
8925
06:30:39,436 --> 06:30:42,300
All I need to do
is initialize another Oddity,
8926
06:30:42,300 --> 06:30:44,400
which is the
number underscore ID.
8927
06:30:44,400 --> 06:30:47,512
And we use the AC Dot
parallelized and the range
8928
06:30:47,512 --> 06:30:49,500
we have given is one to 10,000
8929
06:30:49,500 --> 06:30:51,600
and we'll use the reduce action
8930
06:30:51,600 --> 06:30:54,532
here to see the output
you can see here.
8931
06:30:54,532 --> 06:30:56,840
We have the sum
of the numbers ranging
8932
06:30:56,840 --> 06:30:58,400
from one to ten thousand.
8933
06:30:58,400 --> 06:31:00,900
Now this was all about rdd.
8934
06:31:00,900 --> 06:31:01,699
The next topic
8935
06:31:01,699 --> 06:31:03,711
that we have
on a list is broadcast
8936
06:31:03,711 --> 06:31:07,181
and accumulators now in spark
we perform parallel processing
8937
06:31:07,181 --> 06:31:09,100
through the Help
of shared variables
8938
06:31:09,100 --> 06:31:11,672
or when the driver sends
any tasks with the executor
8939
06:31:11,672 --> 06:31:14,900
present on the cluster a copy of
the shared variable is also sent
8940
06:31:14,900 --> 06:31:15,700
to the each node
8941
06:31:15,700 --> 06:31:18,100
of the cluster thus
maintaining High availability
8942
06:31:18,100 --> 06:31:19,400
and fault tolerance.
8943
06:31:19,400 --> 06:31:22,223
Now, this is done in order
to accomplish the task
8944
06:31:22,223 --> 06:31:25,341
and Apache spark supposed
to type of shared variables.
8945
06:31:25,341 --> 06:31:26,711
One of them is broadcast.
8946
06:31:26,711 --> 06:31:28,861
And the other one is
the accumulator now
8947
06:31:28,861 --> 06:31:31,735
broadcast variables are used
to save the copy of data
8948
06:31:31,735 --> 06:31:33,334
on all the nodes in a cluster.
8949
06:31:33,334 --> 06:31:36,117
Whereas the accumulator is
the variable that is used
8950
06:31:36,117 --> 06:31:37,700
for aggregating the incoming.
8951
06:31:37,700 --> 06:31:40,056
Information we are
different associative
8952
06:31:40,056 --> 06:31:43,500
and commutative operations now
moving on to our next topic
8953
06:31:43,500 --> 06:31:47,094
which is a spark configuration
the spark configuration class
8954
06:31:47,094 --> 06:31:49,800
provides a set
of configurations and parameters
8955
06:31:49,800 --> 06:31:52,300
that are needed to execute
a spark application
8956
06:31:52,300 --> 06:31:54,300
on the local system
or any cluster.
8957
06:31:54,300 --> 06:31:56,800
Now when you use
spark configuration object
8958
06:31:56,800 --> 06:31:59,112
to set the values
to these parameters,
8959
06:31:59,112 --> 06:32:02,800
they automatically take priority
over the system properties.
8960
06:32:02,800 --> 06:32:05,035
Now this class
contains various Getters
8961
06:32:05,035 --> 06:32:07,800
and Setters methods some
of which are Set method
8962
06:32:07,800 --> 06:32:10,323
which is used to set
a configuration property.
8963
06:32:10,323 --> 06:32:11,555
We have the set master
8964
06:32:11,555 --> 06:32:13,605
which is used for setting
the master URL.
8965
06:32:13,605 --> 06:32:14,840
Yeah the set app name,
8966
06:32:14,840 --> 06:32:17,421
which is used to set
an application name and we
8967
06:32:17,421 --> 06:32:20,900
have the get method to retrieve
a configuration value of a key.
8968
06:32:20,900 --> 06:32:23,000
And finally we
have set spark home
8969
06:32:23,000 --> 06:32:25,600
which is used for setting
the spark installation path
8970
06:32:25,600 --> 06:32:26,700
on worker nodes.
8971
06:32:26,700 --> 06:32:28,800
Now coming to the next
topic on our list
8972
06:32:28,800 --> 06:32:31,600
which is a spark files
the spark file class
8973
06:32:31,600 --> 06:32:33,264
contains only the class methods
8974
06:32:33,264 --> 06:32:36,500
so that the user cannot create
any spark files instance.
8975
06:32:36,500 --> 06:32:39,200
Now this helps in Dissolving
the path of the files
8976
06:32:39,200 --> 06:32:41,500
that are added using
the spark context add
8977
06:32:41,500 --> 06:32:44,600
file method the class Park files
contain to class methods
8978
06:32:44,600 --> 06:32:47,798
which are the get method and
the get root directory method.
8979
06:32:47,798 --> 06:32:50,500
Now, the get is used
to retrieve the absolute path
8980
06:32:50,500 --> 06:32:53,900
of a file added through
spark context to add file
8981
06:32:54,000 --> 06:32:55,300
and the get root directory
8982
06:32:55,300 --> 06:32:57,076
is used to retrieve
the root directory
8983
06:32:57,076 --> 06:32:58,900
that contains the files
that are added.
8984
06:32:58,900 --> 06:33:00,700
So this park context
dot add file.
8985
06:33:00,700 --> 06:33:03,022
Now, these are smart topics
and the next topic
8986
06:33:03,022 --> 06:33:04,257
that we will covering
8987
06:33:04,257 --> 06:33:07,600
in our list are the data frames
now data frames in a party.
8988
06:33:07,600 --> 06:33:09,655
Spark is a distributed
collection of rows
8989
06:33:09,655 --> 06:33:10,831
under named columns,
8990
06:33:10,831 --> 06:33:13,400
which is similar to
the relational database tables
8991
06:33:13,400 --> 06:33:14,700
or Excel sheets.
8992
06:33:14,700 --> 06:33:16,812
It also shares common attributes
8993
06:33:16,812 --> 06:33:19,800
with the rdds few
characteristics of data frames
8994
06:33:19,800 --> 06:33:21,300
are immutable in nature.
8995
06:33:21,300 --> 06:33:23,500
That is the same
as you can create a data frame,
8996
06:33:23,500 --> 06:33:24,900
but you cannot change it.
8997
06:33:24,900 --> 06:33:26,500
It allows lazy evaluation.
8998
06:33:26,500 --> 06:33:28,300
That is the task not executed
8999
06:33:28,300 --> 06:33:30,500
unless and until
an action is triggered
9000
06:33:30,500 --> 06:33:33,000
and moreover data frames
are distributed in nature,
9001
06:33:33,000 --> 06:33:34,900
which are designed
for processing large
9002
06:33:34,900 --> 06:33:37,400
collection of structure
or semi-structured data.
9003
06:33:37,400 --> 06:33:39,953
Can be created using
different data formats,
9004
06:33:39,953 --> 06:33:41,200
like loading the data
9005
06:33:41,200 --> 06:33:43,650
from source files
such as Json or CSV,
9006
06:33:43,650 --> 06:33:46,100
or you can load it
from an existing re
9007
06:33:46,100 --> 06:33:48,842
you can use databases
like hi Cassandra.
9008
06:33:48,842 --> 06:33:50,600
You can use pocket files.
9009
06:33:50,600 --> 06:33:52,800
You can use CSV XML files.
9010
06:33:52,800 --> 06:33:53,900
There are many sources
9011
06:33:53,900 --> 06:33:56,448
through which you can create
a particular R DT now,
9012
06:33:56,448 --> 06:33:59,200
let me show you how to create
a data frame in pie spark
9013
06:33:59,200 --> 06:34:02,100
and perform various actions
and Transformations on it.
9014
06:34:02,300 --> 06:34:05,065
So let's continue this
in the same notebook
9015
06:34:05,065 --> 06:34:07,700
which we have here now
here we have taken
9016
06:34:07,700 --> 06:34:09,300
In the NYC Flight data,
9017
06:34:09,300 --> 06:34:12,561
and I'm creating a data frame
which is the NYC flights
9018
06:34:12,561 --> 06:34:13,300
on the score
9019
06:34:13,300 --> 06:34:14,959
TF now to load the data.
9020
06:34:14,959 --> 06:34:18,340
We are using the spark dot
RI dot CSV method and you
9021
06:34:18,340 --> 06:34:19,600
to provide the path
9022
06:34:19,600 --> 06:34:21,900
which is the local path
of by default.
9023
06:34:21,900 --> 06:34:24,200
It takes the hdfs same as our GD
9024
06:34:24,200 --> 06:34:26,208
and one thing
to note down here is
9025
06:34:26,208 --> 06:34:28,886
that I've provided
two parameters extra here,
9026
06:34:28,886 --> 06:34:31,400
which is the info schema
and the header
9027
06:34:31,400 --> 06:34:34,700
if we do not provide
this as true of a skip it
9028
06:34:34,700 --> 06:34:35,800
what will happen.
9029
06:34:35,800 --> 06:34:39,300
Is that if your data set Is
the name of the columns
9030
06:34:39,300 --> 06:34:42,863
on the first row it will take
those as data as well.
9031
06:34:42,863 --> 06:34:45,100
It will not infer
the schema now.
9032
06:34:45,100 --> 06:34:49,023
Once we have loaded the data
in our data frame we need to use
9033
06:34:49,023 --> 06:34:51,900
the show action to have
a look at the output.
9034
06:34:51,900 --> 06:34:53,223
So as you can see here,
9035
06:34:53,223 --> 06:34:55,399
we have the output
which is exactly it
9036
06:34:55,399 --> 06:34:58,600
gives us the top 20 rows
or the particular data set.
9037
06:34:58,600 --> 06:35:02,600
We have the year month day
departure time deposit delay
9038
06:35:02,600 --> 06:35:07,000
arrival time arrival delay
and so many more attributes.
9039
06:35:07,300 --> 06:35:08,500
To print the schema
9040
06:35:08,500 --> 06:35:11,500
of the particular data frame
you need the transformation
9041
06:35:11,500 --> 06:35:13,762
or as say the action
of print schema.
9042
06:35:13,762 --> 06:35:15,900
So let's have a look
at the schema.
9043
06:35:15,900 --> 06:35:19,117
As you can see here we have here
which is integer month integer.
9044
06:35:19,117 --> 06:35:21,000
Almost half of them are integer.
9045
06:35:21,000 --> 06:35:23,600
We have the carrier as
string the tail number
9046
06:35:23,600 --> 06:35:26,625
a string the origin
string destination string
9047
06:35:26,625 --> 06:35:28,123
and so on now suppose.
9048
06:35:28,123 --> 06:35:29,075
I want to know
9049
06:35:29,075 --> 06:35:31,786
how many records are
there in my database
9050
06:35:31,786 --> 06:35:33,685
or the data frame rather say
9051
06:35:33,685 --> 06:35:36,600
so you need the count
function for this one.
9052
06:35:36,600 --> 06:35:40,600
I will provide but the results
so as you can see here,
9053
06:35:40,600 --> 06:35:42,992
we have three point
three million records
9054
06:35:42,992 --> 06:35:44,097
here three million
9055
06:35:44,097 --> 06:35:46,800
thirty six thousand
seven hundred seventy six
9056
06:35:46,800 --> 06:35:48,400
to be exact now suppose.
9057
06:35:48,400 --> 06:35:51,153
I want to have a look
at the flight name the origin
9058
06:35:51,153 --> 06:35:52,400
and the destination
9059
06:35:52,400 --> 06:35:55,400
of just these three columns
from the particular data frame.
9060
06:35:55,400 --> 06:35:57,800
We need to use
the select option.
9061
06:35:58,200 --> 06:36:00,882
So as you can see here,
we have the top 20 rows.
9062
06:36:00,882 --> 06:36:03,128
Now, what we saw
was the select query
9063
06:36:03,128 --> 06:36:05,000
on this particular data frame,
9064
06:36:05,000 --> 06:36:07,240
but if I wanted
to see or rather,
9065
06:36:07,240 --> 06:36:09,200
I want to check the summary.
9066
06:36:09,200 --> 06:36:11,400
Of any particular
column suppose.
9067
06:36:11,400 --> 06:36:14,500
I want to check the
what is the lowest count
9068
06:36:14,500 --> 06:36:18,100
or the highest count in
the particular distance column.
9069
06:36:18,100 --> 06:36:20,500
I need to use
the describe function here.
9070
06:36:20,500 --> 06:36:23,100
So I'll show you what
the summer it looks like.
9071
06:36:23,500 --> 06:36:25,142
So the distance the count
9072
06:36:25,142 --> 06:36:27,900
is the number of rows
total number of rows.
9073
06:36:27,900 --> 06:36:30,800
We have the mean the standard
deviation via the minimum value,
9074
06:36:30,800 --> 06:36:32,900
which is 17
and the maximum value,
9075
06:36:32,900 --> 06:36:34,500
which is 4983.
9076
06:36:34,900 --> 06:36:38,100
Now this gives you a summary
of the particular column
9077
06:36:38,100 --> 06:36:39,856
if you want to So
that we know
9078
06:36:39,856 --> 06:36:41,838
that the minimum distance is 70.
9079
06:36:41,838 --> 06:36:44,500
Let's go ahead and filter
out our data using
9080
06:36:44,500 --> 06:36:47,700
the filter function
in which the distance is 17.
9081
06:36:48,700 --> 06:36:49,978
So you can see here.
9082
06:36:49,978 --> 06:36:51,000
We have one data
9083
06:36:51,000 --> 06:36:55,700
in which in the 2013 year
the minimum distance here is 17
9084
06:36:55,700 --> 06:36:59,100
but similarly suppose I want
to have a look at the flash
9085
06:36:59,100 --> 06:37:01,600
which are originating from EWR.
9086
06:37:01,900 --> 06:37:02,400
Similarly.
9087
06:37:02,400 --> 06:37:04,600
We use the filter
function here as well.
9088
06:37:04,600 --> 06:37:06,599
Now the another Clause here,
9089
06:37:06,599 --> 06:37:09,300
which is the where
Clause is also used
9090
06:37:09,300 --> 06:37:11,236
for filtering the suppose.
9091
06:37:11,236 --> 06:37:12,800
I want to have a look
9092
06:37:12,815 --> 06:37:16,046
at the flight data
and filter it out to see
9093
06:37:16,046 --> 06:37:17,507
if the day at work.
9094
06:37:17,507 --> 06:37:22,000
Which the flight took off was
the second of any month suppose.
9095
06:37:22,000 --> 06:37:23,589
So here instead of filter.
9096
06:37:23,589 --> 06:37:25,422
We can also use a where clause
9097
06:37:25,422 --> 06:37:27,500
which will give us
the same output.
9098
06:37:29,200 --> 06:37:33,100
Now, we can also pass
on multiple parameters
9099
06:37:33,100 --> 06:37:36,000
and rather say
the multiple conditions.
9100
06:37:36,000 --> 06:37:39,866
So suppose I want the day
of the flight should be seventh
9101
06:37:39,866 --> 06:37:41,839
and the origin should be JFK
9102
06:37:41,839 --> 06:37:45,292
and the arrival delay
should be less than 0 I mean
9103
06:37:45,292 --> 06:37:47,900
that is for none
of the postponed fly.
9104
06:37:48,000 --> 06:37:49,600
So just to have a look
9105
06:37:49,600 --> 06:37:52,314
at these numbers
will use the way clause
9106
06:37:52,314 --> 06:37:55,600
and separate all the conditions
using the + symbol
9107
06:37:56,100 --> 06:37:57,800
so you can see
here all the data.
9108
06:37:57,800 --> 06:38:00,700
The day is 7 the origin is JFK
9109
06:38:01,100 --> 06:38:04,900
and the arrival delay
is less than 0 now.
9110
06:38:04,900 --> 06:38:07,621
These were the basic
Transformations and actions
9111
06:38:07,621 --> 06:38:09,300
on the particular data frame.
9112
06:38:09,300 --> 06:38:12,900
Now one thing we can also do
is create a temporary table
9113
06:38:12,900 --> 06:38:14,100
for SQL queries
9114
06:38:14,100 --> 06:38:15,100
if someone is
9115
06:38:15,100 --> 06:38:19,000
not good or is not Wanted
to all these transformation
9116
06:38:19,000 --> 06:38:22,400
and action add would rather
use SQL queries on the data.
9117
06:38:22,400 --> 06:38:26,006
They can use this register dot
temp table to create a table
9118
06:38:26,006 --> 06:38:27,925
for their particular data frame.
9119
06:38:27,925 --> 06:38:30,129
What we'll do is
convert the NYC flights
9120
06:38:30,129 --> 06:38:33,600
and a Squatty of data frame
into NYC endoscope flight table,
9121
06:38:33,600 --> 06:38:36,700
which can be used later
and SQL queries can be performed
9122
06:38:36,700 --> 06:38:38,500
on this particular table.
9123
06:38:38,600 --> 06:38:43,000
So you remember in the beginning
we use the NYC flies and score d
9124
06:38:43,000 --> 06:38:47,600
f dot show now we can use
the select asterisk from I
9125
06:38:47,600 --> 06:38:51,600
am just go flights to get
the same output now suppose
9126
06:38:51,600 --> 06:38:55,011
we want to look at the minimum
a time of any flights.
9127
06:38:55,011 --> 06:38:58,217
We use the select minimum
air time from NYC flights.
9128
06:38:58,217 --> 06:38:59,600
That is the SQL query.
9129
06:38:59,600 --> 06:39:02,400
We pass all the SQL query
in the sequel context
9130
06:39:02,400 --> 06:39:03,700
or SQL function.
9131
06:39:03,700 --> 06:39:04,800
So you can see here.
9132
06:39:04,800 --> 06:39:07,900
We have the minimum air time
as 20 now to have a look
9133
06:39:07,900 --> 06:39:11,400
at the Wreckers in which
the air time is minimum 20.
9134
06:39:11,600 --> 06:39:14,693
Now we can also use
nested SQL queries a suppose
9135
06:39:14,693 --> 06:39:15,847
if I want to check
9136
06:39:15,847 --> 06:39:19,328
which all flights have
the Minimum air time as 20 now
9137
06:39:19,328 --> 06:39:20,553
that cannot be done
9138
06:39:20,553 --> 06:39:24,132
in a simple SQL query we need
nested query for that one.
9139
06:39:24,132 --> 06:39:26,800
So selecting aspects
from New York flights
9140
06:39:26,800 --> 06:39:29,500
where the airtime
is in and inside
9141
06:39:29,500 --> 06:39:30,913
that we have another query,
9142
06:39:30,913 --> 06:39:33,477
which is Select minimum air time
from NYC flights.
9143
06:39:33,477 --> 06:39:35,100
Let's see if this works or not.
9144
06:39:37,200 --> 06:39:38,497
CS as you can see here,
9145
06:39:38,497 --> 06:39:41,600
we have two Flats which have
the minimum air time as 20.
9146
06:39:42,200 --> 06:39:44,400
So guys this is it
for data frames.
9147
06:39:44,400 --> 06:39:46,147
So let's get back
to our presentation
9148
06:39:46,147 --> 06:39:48,697
and have a look at the list
which we were following.
9149
06:39:48,697 --> 06:39:49,966
We completed data frames.
9150
06:39:49,966 --> 06:39:52,600
Next we have stories levels
now Storage level
9151
06:39:52,600 --> 06:39:55,200
in pie spark is a class
which helps in deciding
9152
06:39:55,200 --> 06:39:56,991
how the rdds should be stored
9153
06:39:56,991 --> 06:39:59,400
now based on this rdds
are either stored
9154
06:39:59,400 --> 06:40:01,400
in this or in memory or in
9155
06:40:01,400 --> 06:40:04,300
both the class Storage
level also decides
9156
06:40:04,300 --> 06:40:06,594
whether the RADS
should be serialized
9157
06:40:06,594 --> 06:40:09,480
or replicate its partition
for the final
9158
06:40:09,480 --> 06:40:12,000
and the last topic
for the today's list
9159
06:40:12,000 --> 06:40:15,100
is MLM blog MLM is
the machine learning APA
9160
06:40:15,100 --> 06:40:17,000
which is provided by spark,
9161
06:40:17,000 --> 06:40:18,600
which is also present in Python.
9162
06:40:18,700 --> 06:40:21,180
And this library
is heavily used in Python
9163
06:40:21,180 --> 06:40:22,597
for machine learning as
9164
06:40:22,597 --> 06:40:26,094
well as real-time streaming
analytics Aurelius algorithm
9165
06:40:26,094 --> 06:40:28,773
supported by this libraries
are first of all,
9166
06:40:28,773 --> 06:40:30,600
we have the spark dot m l live
9167
06:40:30,600 --> 06:40:33,482
now recently the spice
Park MN lips supports model
9168
06:40:33,482 --> 06:40:37,500
based collaborative filtering
by a small set of latent factors
9169
06:40:37,500 --> 06:40:40,500
and here all the users
and the products are described
9170
06:40:40,500 --> 06:40:42,300
which we can use
to predict them.
9171
06:40:42,300 --> 06:40:45,909
Missing entries however
to learn these latent factors
9172
06:40:45,909 --> 06:40:48,886
Park dot ml abuses
the alternatingly square
9173
06:40:48,886 --> 06:40:50,755
which is the ALS algorithm.
9174
06:40:50,755 --> 06:40:52,900
Next we have the MLF clustering
9175
06:40:52,900 --> 06:40:53,852
and are supervised
9176
06:40:53,852 --> 06:40:57,700
learning problem is clustering
now here we try to group subsets
9177
06:40:57,700 --> 06:40:59,989
of entities with one
another on the basis
9178
06:40:59,989 --> 06:41:02,000
of some notion of similarity.
9179
06:41:02,200 --> 06:41:02,500
Next.
9180
06:41:02,500 --> 06:41:04,500
We have the frequent
pattern matching,
9181
06:41:04,500 --> 06:41:08,400
which is the fpm now frequent
pattern matching is mining
9182
06:41:08,400 --> 06:41:12,800
frequent items item set
subsequences or other Lectures
9183
06:41:12,800 --> 06:41:13,600
that are usually
9184
06:41:13,600 --> 06:41:16,900
among the first steps to analyze
a large-scale data set.
9185
06:41:16,900 --> 06:41:20,600
This has been an active research
topic in data mining for years.
9186
06:41:20,600 --> 06:41:22,800
We have the linear algebra.
9187
06:41:23,000 --> 06:41:25,032
Now this algorithm
support spice Park,
9188
06:41:25,032 --> 06:41:27,403
I mean live utilities
for linear algebra.
9189
06:41:27,403 --> 06:41:29,300
We have collaborative filtering.
9190
06:41:29,400 --> 06:41:30,900
We have classification
9191
06:41:30,900 --> 06:41:34,000
for binary classification
various methods are available
9192
06:41:34,000 --> 06:41:37,700
in sparked MLA package such as
multi-class classification as
9193
06:41:37,700 --> 06:41:40,912
well as regression analysis
in classification some
9194
06:41:40,912 --> 06:41:44,067
of the most popular Terms
used are Nave by a strand
9195
06:41:44,067 --> 06:41:45,457
of forest decision tree
9196
06:41:45,457 --> 06:41:48,600
and so much and finally we
have the linear regression
9197
06:41:48,600 --> 06:41:51,300
now basically lead integration
comes from the family
9198
06:41:51,300 --> 06:41:54,064
of recreation algorithms
to find relationships
9199
06:41:54,064 --> 06:41:56,812
and dependencies between
variables is the main goal
9200
06:41:56,812 --> 06:41:58,594
of regression all the pie spark
9201
06:41:58,594 --> 06:42:01,400
MLA package also covers
other algorithm classes
9202
06:42:01,400 --> 06:42:02,100
and functions.
9203
06:42:02,400 --> 06:42:04,591
Let's now try to implement
all the concepts
9204
06:42:04,591 --> 06:42:07,200
which we have learned
in pie spark tutorial session
9205
06:42:07,200 --> 06:42:10,600
now here we are going to use
a heart disease prediction model
9206
06:42:10,600 --> 06:42:13,278
and we are going to predict
Using the decision tree
9207
06:42:13,278 --> 06:42:16,599
with the help of classification
as well as regression.
9208
06:42:16,599 --> 06:42:16,800
Now.
9209
06:42:16,800 --> 06:42:19,600
These all are part
of the ml Live library here.
9210
06:42:19,600 --> 06:42:21,800
Let's see how we
can perform these types
9211
06:42:21,800 --> 06:42:23,300
of functions and queries.
9212
06:42:39,800 --> 06:42:40,600
The first of all
9213
06:42:40,600 --> 06:42:43,700
what we need to do
is initialize the spark context.
9214
06:42:45,100 --> 06:42:48,300
Next we are going
to read the UCI data set
9215
06:42:48,400 --> 06:42:50,500
of the heart disease prediction
9216
06:42:50,600 --> 06:42:52,600
and we are going
to clean the data.
9217
06:42:52,600 --> 06:42:55,700
So let's import the pandas
and the numpy library here.
9218
06:42:56,000 --> 06:42:58,852
Let's create a data frame
as heart disease TF and
9219
06:42:58,852 --> 06:43:00,100
as mentioned earlier,
9220
06:43:00,100 --> 06:43:03,544
we are going to use
the read CSV method here
9221
06:43:03,700 --> 06:43:05,300
and here we don't have a header.
9222
06:43:05,300 --> 06:43:07,500
So we have provided
header as none.
9223
06:43:07,700 --> 06:43:10,800
Now the original data set
contains 300 3 rows
9224
06:43:10,800 --> 06:43:12,100
and 14 columns.
9225
06:43:12,600 --> 06:43:15,800
Now the categories
of diagnosis of heart disease
9226
06:43:15,900 --> 06:43:17,000
that we are projecting
9227
06:43:17,300 --> 06:43:22,400
if the value 0 is for 50% less
than narrowing and for the value
9228
06:43:22,400 --> 06:43:24,900
1 which we are giving
is for the values
9229
06:43:24,900 --> 06:43:27,500
which have 50% more
diameter of naren.
9230
06:43:28,700 --> 06:43:31,623
So here we are using
the numpy library.
9231
06:43:32,700 --> 06:43:35,921
These are particularly
old methods which is showing
9232
06:43:35,921 --> 06:43:39,400
the deprecated warning
but no issues it will work fine.
9233
06:43:40,900 --> 06:43:42,500
So as you can see here,
9234
06:43:42,500 --> 06:43:45,300
we have the categories
of diagnosis of heart disease
9235
06:43:45,300 --> 06:43:48,100
that we are predicting
the value 0 is 4 less than 50
9236
06:43:48,100 --> 06:43:50,000
and value 1 is greater than 50.
9237
06:43:50,400 --> 06:43:53,014
So what we did here
was clear the row
9238
06:43:53,014 --> 06:43:57,500
which have the question mark
or which have the empty spaces.
9239
06:43:58,700 --> 06:44:00,900
Now to get a look
at the data set here.
9240
06:44:00,900 --> 06:44:02,200
Now, you can see here.
9241
06:44:02,200 --> 06:44:06,086
We have zero at many places
instead of the question mark
9242
06:44:06,086 --> 06:44:07,500
which we had earlier
9243
06:44:08,600 --> 06:44:11,300
and now we are saving
it to a txt file.
9244
06:44:12,000 --> 06:44:14,200
And you can see her
after dropping the rose
9245
06:44:14,200 --> 06:44:15,494
with any empty values.
9246
06:44:15,494 --> 06:44:18,000
We have two ninety seven rows
and 14 columns.
9247
06:44:18,300 --> 06:44:20,800
But this is what the new
clear data set looks
9248
06:44:20,800 --> 06:44:24,400
like now we are importing
the ml lived library
9249
06:44:24,400 --> 06:44:26,500
and the regression here now here
9250
06:44:26,500 --> 06:44:29,077
what we are going to do
is create a label point,
9251
06:44:29,077 --> 06:44:31,900
which is a local Vector
associated with a label
9252
06:44:31,900 --> 06:44:33,100
or a response.
9253
06:44:33,100 --> 06:44:36,600
So for that we need to import
the MLF dot regression.
9254
06:44:37,800 --> 06:44:39,600
So for that we are
taking the text file
9255
06:44:39,600 --> 06:44:43,000
which we just created now
without the missing values.
9256
06:44:43,000 --> 06:44:43,665
Now next.
9257
06:44:43,665 --> 06:44:47,678
What we are going to do is
pass the MLA data line by line
9258
06:44:47,678 --> 06:44:49,900
into the MLM label Point object
9259
06:44:49,900 --> 06:44:51,671
and we are going
to convert the -
9260
06:44:51,671 --> 06:44:53,000
one labels to the 0 now.
9261
06:44:53,000 --> 06:44:56,200
Let's have a look after passing
the number of fishing lines.
9262
06:44:57,800 --> 06:45:00,200
Okay, we have to label .01.
9263
06:45:00,600 --> 06:45:01,700
That's cool.
9264
06:45:01,700 --> 06:45:04,700
Now next what we are going to do
is perform classification using
9265
06:45:04,700 --> 06:45:05,800
the decision tree.
9266
06:45:05,800 --> 06:45:09,300
So for that we need to import
the pie spark the ml 8.3.
9267
06:45:09,600 --> 06:45:13,200
So next what we have to do is
split the data into the training
9268
06:45:13,200 --> 06:45:14,300
and testing data
9269
06:45:14,300 --> 06:45:18,500
and we split here the data
into 70s 233 standard ratio,
9270
06:45:18,600 --> 06:45:20,672
70 being the training data set
9271
06:45:20,672 --> 06:45:24,541
and the 30% being the testing
data set next what we do is
9272
06:45:24,541 --> 06:45:26,200
that we train the model.
9273
06:45:26,200 --> 06:45:28,600
Which we are created here
using the training set.
9274
06:45:29,100 --> 06:45:31,100
We have created
a training model decision trees
9275
06:45:31,100 --> 06:45:32,400
or trained classifier.
9276
06:45:32,400 --> 06:45:34,400
We have used
a training data number
9277
06:45:34,400 --> 06:45:36,947
of classes is file
the categorical feature,
9278
06:45:36,947 --> 06:45:38,104
which we have given
9279
06:45:38,104 --> 06:45:40,600
maximum depth to which
we are classifying.
9280
06:45:40,600 --> 06:45:42,000
It is 3 the next
9281
06:45:42,000 --> 06:45:45,505
what we are going to do is
evaluate the model based
9282
06:45:45,505 --> 06:45:49,000
on the test data set now
and evaluate the error.
9283
06:45:49,300 --> 06:45:50,800
So here we are creating
9284
06:45:50,800 --> 06:45:53,211
predictions and we
are using the test data
9285
06:45:53,211 --> 06:45:55,800
to get the predictions
through the model
9286
06:45:55,800 --> 06:45:58,200
which we Do and we
are also going to find
9287
06:45:58,200 --> 06:45:59,500
the test errors here.
9288
06:45:59,700 --> 06:46:00,900
So as you can see here,
9289
06:46:00,900 --> 06:46:04,507
the test error is
zero point 2 2 9 7 we
9290
06:46:04,507 --> 06:46:08,200
have created a classification
decision tree model
9291
06:46:08,200 --> 06:46:11,100
in which the feature
less than 12 is 3 the value
9292
06:46:11,100 --> 06:46:13,225
of the features
distance 0 is 54.
9293
06:46:13,225 --> 06:46:16,014
So as you can see
our model is pretty good.
9294
06:46:16,014 --> 06:46:19,700
So now next we'll use regression
for the same purposes.
9295
06:46:19,700 --> 06:46:22,300
So let's perform the regression
using decision tree.
9296
06:46:22,500 --> 06:46:24,500
So as you can see
we have the train model
9297
06:46:24,500 --> 06:46:26,400
and we are using
the decision tree, too.
9298
06:46:26,400 --> 06:46:29,460
Trine request using
the training data the same
9299
06:46:29,460 --> 06:46:33,200
which we created using the
decision tree model up there.
9300
06:46:33,200 --> 06:46:34,811
We use the classification
9301
06:46:34,811 --> 06:46:37,440
now we are using
regression now similarly.
9302
06:46:37,440 --> 06:46:38,921
We are going to evaluate
9303
06:46:38,921 --> 06:46:42,500
our model using our test data
set and find that test errors
9304
06:46:42,500 --> 06:46:45,600
which is the mean squared error
here for aggression.
9305
06:46:45,600 --> 06:46:48,200
So let's have a look
at the mean square error here.
9306
06:46:48,200 --> 06:46:50,584
The mean square error is 0.168.
9307
06:46:50,800 --> 06:46:52,100
That is good.
9308
06:46:52,100 --> 06:46:53,318
Now finally if we have
9309
06:46:53,318 --> 06:46:55,700
a look at the Learned
regression tree model.
9310
06:46:56,800 --> 06:47:00,300
You can see we have created
the regression tree model
9311
06:47:00,300 --> 06:47:02,800
till the depth
of 3 with 15 notes.
9312
06:47:02,800 --> 06:47:04,577
And here we have
all the features
9313
06:47:04,577 --> 06:47:06,300
and classification of the tree.
9314
06:47:11,000 --> 06:47:11,675
Hello folks.
9315
06:47:11,675 --> 06:47:13,700
Welcome to spawn
interview questions.
9316
06:47:13,800 --> 06:47:16,949
The session has been planned
collectively to have commonly
9317
06:47:16,949 --> 06:47:19,988
asked interview questions later
to the smart technology
9318
06:47:19,988 --> 06:47:22,400
and the general answer
and the expectation
9319
06:47:22,400 --> 06:47:25,594
is already you are aware
of this particular technology.
9320
06:47:25,594 --> 06:47:29,200
To some extent and in general
the common questions being asked
9321
06:47:29,200 --> 06:47:31,500
as well as I will give
interaction with the technology
9322
06:47:31,500 --> 06:47:33,600
as so let's get this started.
9323
06:47:33,600 --> 06:47:36,023
So the agenda for
this particular session is
9324
06:47:36,023 --> 06:47:38,197
the basic questions
are going to cover
9325
06:47:38,197 --> 06:47:41,138
and questions later
to the spark core Technologies.
9326
06:47:41,138 --> 06:47:42,400
That's when I say spark
9327
06:47:42,400 --> 06:47:44,900
or that's going to be
the base and top
9328
06:47:44,900 --> 06:47:48,075
of spark or we have
four important components
9329
06:47:48,075 --> 06:47:50,669
which work that
is streaming Graphics.
9330
06:47:50,669 --> 06:47:53,100
Ml Abe and SQL
all these components
9331
06:47:53,100 --> 06:47:57,500
have been created to satisfy a
The government again interaction
9332
06:47:57,500 --> 06:47:59,495
with these Technologies and get
9333
06:47:59,495 --> 06:48:02,200
into the commonly
asked interview questions
9334
06:48:02,300 --> 06:48:04,500
and the questions also
framed such a way.
9335
06:48:04,500 --> 06:48:07,200
It covers the spectrum
of the doubts as well
9336
06:48:07,200 --> 06:48:10,600
as the features available
within that specific technology.
9337
06:48:10,600 --> 06:48:12,512
So let's take the first question
9338
06:48:12,512 --> 06:48:15,800
and look into the answer like
how commonly this covered.
9339
06:48:15,800 --> 06:48:19,800
What is Apache spark and Spark
It's with Apache Foundation now,
9340
06:48:20,000 --> 06:48:21,000
it's open source.
9341
06:48:21,000 --> 06:48:22,809
It's a cluster
Computing framework
9342
06:48:22,809 --> 06:48:24,280
for real-time processing.
9343
06:48:24,280 --> 06:48:25,750
So three main keywords over.
9344
06:48:25,750 --> 06:48:28,151
Here a purchase markets
are open source project.
9345
06:48:28,151 --> 06:48:29,856
It's used for cluster Computing.
9346
06:48:29,856 --> 06:48:33,272
And for a memory processing
along with real-time processing.
9347
06:48:33,272 --> 06:48:35,485
It's going to support
in memory Computing.
9348
06:48:35,485 --> 06:48:36,672
So the lots of project
9349
06:48:36,672 --> 06:48:38,400
which supports cluster Computing
9350
06:48:38,400 --> 06:48:42,100
along with that spark
differentiate Itself by doing
9351
06:48:42,100 --> 06:48:43,839
the in-memory Computing.
9352
06:48:43,839 --> 06:48:46,231
It's very active
community and out
9353
06:48:46,231 --> 06:48:50,000
of the Hadoop ecosystem
technology is Apache spark is
9354
06:48:50,000 --> 06:48:51,500
very active multiple releases.
9355
06:48:51,500 --> 06:48:52,800
We got last year.
9356
06:48:52,800 --> 06:48:56,750
It's a very inactive project
among the about your Basically,
9357
06:48:56,750 --> 06:49:00,072
it's a framework kind support
in memory Computing
9358
06:49:00,072 --> 06:49:04,100
and cluster Computing and you
may face this specific question
9359
06:49:04,100 --> 06:49:05,700
how spark is different
9360
06:49:05,700 --> 06:49:08,085
than mapreduce on
how you can compare it
9361
06:49:08,085 --> 06:49:11,400
with the mapreduce mapreduce
is the processing pathology
9362
06:49:11,400 --> 06:49:12,900
within the Hadoop ecosystem
9363
06:49:12,900 --> 06:49:14,400
and within Hadoop ecosystem.
9364
06:49:14,400 --> 06:49:18,700
We have hdfs Hadoop distributed
file system mapreduce going
9365
06:49:18,700 --> 06:49:23,300
to support distributed computing
and how spark is different.
9366
06:49:23,300 --> 06:49:25,900
So how we can compare
smart with them.
9367
06:49:25,900 --> 06:49:28,907
Mapreduce in a way
this comparison going
9368
06:49:28,907 --> 06:49:32,400
to help us to understand
the technology better.
9369
06:49:32,400 --> 06:49:33,100
But definitely
9370
06:49:33,100 --> 06:49:36,600
like we cannot compare these two
or two different methodologies
9371
06:49:36,600 --> 06:49:40,200
by which it's going to work
spark is very simple to program
9372
06:49:40,200 --> 06:49:42,700
but mapreduce there
is no abstraction
9373
06:49:42,700 --> 06:49:44,118
or the sense like all
9374
06:49:44,118 --> 06:49:47,900
the implementations we have
to provide and interactivity.
9375
06:49:47,900 --> 06:49:52,200
It's has an interactive mode to
work with inspark a mapreduce.
9376
06:49:52,200 --> 06:49:53,800
That is no interactive mode.
9377
06:49:53,800 --> 06:49:55,900
There are some
components like Apache.
9378
06:49:55,900 --> 06:49:56,800
Big and high
9379
06:49:56,800 --> 06:50:00,400
which facilitates has to do
the interactive Computing
9380
06:50:00,400 --> 06:50:02,145
or interactive programming
9381
06:50:02,145 --> 06:50:05,100
and smog supports
real-time stream processing
9382
06:50:05,100 --> 06:50:07,700
and to precisely
say with inspark
9383
06:50:07,700 --> 06:50:11,000
the stream processing is called
a near real-time processing.
9384
06:50:11,000 --> 06:50:13,600
There's nothing in the world
is Real Time processing.
9385
06:50:13,600 --> 06:50:15,100
It's near real-time processing.
9386
06:50:15,100 --> 06:50:18,200
It's going to do the processing
and micro batches.
9387
06:50:18,200 --> 06:50:19,200
I'll cover in detail
9388
06:50:19,200 --> 06:50:21,400
when we are moving
onto the streaming concept
9389
06:50:21,400 --> 06:50:22,600
and you're going to do
9390
06:50:22,600 --> 06:50:25,700
the batch processing on
the historical data in Matrix.
9391
06:50:25,700 --> 06:50:28,300
Zeus when I say stream
processing I will get the data
9392
06:50:28,300 --> 06:50:31,025
that is getting processed
in real time and do
9393
06:50:31,025 --> 06:50:33,849
the processing and get
the result either store it
9394
06:50:33,849 --> 06:50:35,772
on publish to publish Community.
9395
06:50:35,772 --> 06:50:37,697
We will be doing it let and see
9396
06:50:37,697 --> 06:50:40,149
wise mapreduce will have
very high latency
9397
06:50:40,149 --> 06:50:42,915
because it has to read
the data from hard disk,
9398
06:50:42,915 --> 06:50:45,200
but spark it will have
very low latency
9399
06:50:45,200 --> 06:50:47,200
because it can reprocess
9400
06:50:47,200 --> 06:50:50,500
are used the data
already cased in memory,
9401
06:50:50,500 --> 06:50:53,786
but there is a small catch
over here in spark first time
9402
06:50:53,786 --> 06:50:56,600
when the data gets loaded it
has Tool to read it
9403
06:50:56,600 --> 06:50:59,100
from the hard disk
same as mapreduce.
9404
06:50:59,100 --> 06:51:01,600
So once it is red it
will be there in the memory.
9405
06:51:01,692 --> 06:51:03,000
So spark is good.
9406
06:51:03,000 --> 06:51:05,100
Whenever we need to do I treat
9407
06:51:05,100 --> 06:51:08,900
a Computing so spark whenever
you do I treat a Computing again
9408
06:51:08,900 --> 06:51:11,400
and again to the processing
on the same data,
9409
06:51:11,400 --> 06:51:14,200
especially in machine learning
deep learning all we will be
9410
06:51:14,200 --> 06:51:17,900
using the iterative Computing
his Fox performs much better.
9411
06:51:17,900 --> 06:51:19,805
You will see
the rock performance
9412
06:51:19,805 --> 06:51:22,651
Improvement hundred times
faster than mapreduce.
9413
06:51:22,651 --> 06:51:25,800
But if it is one time processing
and fire-and-forget,
9414
06:51:25,800 --> 06:51:28,805
Get the type
of processing spark lately,
9415
06:51:28,805 --> 06:51:30,600
maybe the same latency,
9416
06:51:30,600 --> 06:51:32,699
you will be getting
a tan mapreduce maybe
9417
06:51:32,699 --> 06:51:35,900
like some improvements because
of the building block or spark.
9418
06:51:35,900 --> 06:51:38,800
That's the ID you may get
some additional Advantage.
9419
06:51:38,800 --> 06:51:43,000
So that's the key feature are
the key comparison factor
9420
06:51:43,300 --> 06:51:45,200
of sparkin mapreduce.
9421
06:51:45,800 --> 06:51:50,100
Now, let's get on to the key
features xnk features of spark.
9422
06:51:50,200 --> 06:51:52,200
We discussed over
the Speed and Performance.
9423
06:51:52,200 --> 06:51:54,200
It's going to use
the in-memory Computing
9424
06:51:54,200 --> 06:51:55,559
so Speed and Performance.
9425
06:51:55,559 --> 06:51:57,300
Place it's going to much better.
9426
06:51:57,300 --> 06:52:00,900
When we do actually to Computing
and Somali got the sense
9427
06:52:00,900 --> 06:52:03,810
the programming language
to be used with a spark.
9428
06:52:03,810 --> 06:52:06,700
It can be any of these languages
can be python.
9429
06:52:06,700 --> 06:52:08,400
Java are our scale.
9430
06:52:08,400 --> 06:52:08,570
Mm.
9431
06:52:08,570 --> 06:52:11,300
We can do programming
with any of these languages
9432
06:52:11,300 --> 06:52:14,200
and data formats
to give us a input.
9433
06:52:14,200 --> 06:52:17,172
We can give any data formats
like Jason back
9434
06:52:17,172 --> 06:52:18,900
with a data formats began
9435
06:52:18,900 --> 06:52:21,888
if there is a input
and the key selling point
9436
06:52:21,888 --> 06:52:24,400
with the spark is it's
lazy evaluation the
9437
06:52:24,400 --> 06:52:25,575
since it's going
9438
06:52:25,575 --> 06:52:29,100
To calculate the DAC cycle
directed acyclic graph
9439
06:52:29,100 --> 06:52:32,700
d a g because that is a th e
it's going to calculate
9440
06:52:32,700 --> 06:52:35,300
what all steps needs
to be executed to achieve
9441
06:52:35,300 --> 06:52:36,400
the final result.
9442
06:52:36,400 --> 06:52:38,969
So we need to give all
the steps as well as
9443
06:52:38,969 --> 06:52:40,519
what final result I want.
9444
06:52:40,519 --> 06:52:42,983
It's going to calculate
the optimal cycle
9445
06:52:42,983 --> 06:52:44,400
on optimal calculation.
9446
06:52:44,400 --> 06:52:46,400
What else tips needs
to be calculated
9447
06:52:46,400 --> 06:52:49,100
or what else tips needs
to be executed only those steps
9448
06:52:49,100 --> 06:52:50,500
it will be executing it.
9449
06:52:50,500 --> 06:52:52,900
So basically it's
a lazy execution only
9450
06:52:52,900 --> 06:52:54,450
if the results needs
to be processed,
9451
06:52:54,450 --> 06:52:55,800
it will be processing that.
9452
06:52:55,800 --> 06:52:58,623
Because of it and it's
about real-time Computing.
9453
06:52:58,623 --> 06:53:00,200
It's through spark streaming
9454
06:53:00,200 --> 06:53:02,200
that is a component
called spark streaming
9455
06:53:02,200 --> 06:53:04,700
which supports real-time
Computing and it gels
9456
06:53:04,700 --> 06:53:07,115
with Hadoop ecosystem variable.
9457
06:53:07,115 --> 06:53:09,500
It can run on top of Hadoop Ian
9458
06:53:09,500 --> 06:53:12,562
or it can Leverage The hdfs
to do the processing.
9459
06:53:12,562 --> 06:53:16,300
So when it leverages the hdfs
the Hadoop cluster container
9460
06:53:16,300 --> 06:53:19,400
can be used to do
the distributed computing
9461
06:53:19,400 --> 06:53:23,707
as well as it can leverage
the resource manager to manage
9462
06:53:23,707 --> 06:53:25,400
the resources so spot.
9463
06:53:25,400 --> 06:53:28,426
I can gel with the hdfs very
well as well as it can leverage
9464
06:53:28,426 --> 06:53:29,642
the resource manager
9465
06:53:29,642 --> 06:53:32,500
to share the resources
as well as data locality.
9466
06:53:32,500 --> 06:53:34,699
You can give each data locality.
9467
06:53:34,699 --> 06:53:36,900
It can do the processing we have
9468
06:53:36,900 --> 06:53:41,200
to the database data is located
within the hdfs and has a fleet
9469
06:53:41,200 --> 06:53:43,700
of machine learning
algorithms already implemented
9470
06:53:43,700 --> 06:53:46,100
right from clustering
classification regression.
9471
06:53:46,100 --> 06:53:48,238
All this logic
already implemented
9472
06:53:48,238 --> 06:53:49,600
and machine learning.
9473
06:53:49,600 --> 06:53:52,400
It's achieved using
MLA be within spark
9474
06:53:52,400 --> 06:53:54,800
and there is a component
called a graphics
9475
06:53:54,800 --> 06:53:58,600
which supports Maybe we
can solve the problems using
9476
06:53:58,600 --> 06:54:02,600
graph Theory using the component
Graphics within this park.
9477
06:54:02,700 --> 06:54:04,700
So these are the things
we can consider as
9478
06:54:04,700 --> 06:54:06,700
the key features of spark.
9479
06:54:06,700 --> 06:54:09,400
So when you discuss
with the installation
9480
06:54:09,400 --> 06:54:10,300
of the spark,
9481
06:54:10,300 --> 06:54:13,581
you may come across this year
on what is he on do you
9482
06:54:13,581 --> 06:54:16,765
need to install spark
on all nodes of young cluster?
9483
06:54:16,765 --> 06:54:19,700
So yarn is nothing
but another is US negotiator.
9484
06:54:19,700 --> 06:54:22,500
That's the resource manager
within the Hadoop ecosystem.
9485
06:54:22,500 --> 06:54:25,529
So that's going to provide the
resource management platform.
9486
06:54:25,529 --> 06:54:28,200
Ian going to provide
the resource management platform
9487
06:54:28,200 --> 06:54:29,500
across all the Clusters
9488
06:54:29,600 --> 06:54:33,200
and Spark It's going
to provide the data processing.
9489
06:54:33,200 --> 06:54:35,300
So wherever there is
a horse being used
9490
06:54:35,300 --> 06:54:38,049
that location response will be
used to do the data processing.
9491
06:54:38,049 --> 06:54:39,056
And of course, yes,
9492
06:54:39,056 --> 06:54:41,600
we need to have spark
installed on all the nodes.
9493
06:54:41,800 --> 06:54:43,900
It's Parker stores are located.
9494
06:54:43,900 --> 06:54:47,100
That's basically we need
those libraries an additional
9495
06:54:47,100 --> 06:54:50,200
to the installation of spark
and all the worker nodes.
9496
06:54:50,200 --> 06:54:52,106
We need to increase
the ram capacity
9497
06:54:52,106 --> 06:54:53,283
on the VOC emissions
9498
06:54:53,283 --> 06:54:55,800
as well as far going
to consume huge amounts.
9499
06:54:56,100 --> 06:55:00,500
Memory to do the processing it
will not do the mapreduce way
9500
06:55:00,500 --> 06:55:01,600
of working internally.
9501
06:55:01,600 --> 06:55:04,191
It's going to generate
the next cycle and do
9502
06:55:04,191 --> 06:55:06,000
the processing on top of yeah,
9503
06:55:06,000 --> 06:55:09,900
so Ian and the high level it's
like resource manager
9504
06:55:09,900 --> 06:55:13,100
or like an operating system
for the distributed computing.
9505
06:55:13,100 --> 06:55:15,500
It's going to coordinate
all the resource management
9506
06:55:15,500 --> 06:55:17,900
across the fleet
of servers on top of it.
9507
06:55:17,900 --> 06:55:20,100
I can have multiple components
9508
06:55:20,100 --> 06:55:25,100
like spark these giraffe
this park especially it's going
9509
06:55:25,100 --> 06:55:27,800
to help Just watch it
in memory Computing.
9510
06:55:27,800 --> 06:55:30,900
So sparkly on is nothing
but it's a resource manager
9511
06:55:30,900 --> 06:55:33,600
to manage the resource
across the cluster on top of it.
9512
06:55:33,600 --> 06:55:35,470
We can have spunk and yes,
9513
06:55:35,470 --> 06:55:37,700
we need to have spark installed
9514
06:55:37,700 --> 06:55:41,800
and all the notes on where
the spark yarn cluster is used
9515
06:55:41,800 --> 06:55:43,581
and also additional to that.
9516
06:55:43,581 --> 06:55:45,809
We need to have
the memory increased
9517
06:55:45,809 --> 06:55:47,400
in all the worker robots.
9518
06:55:47,600 --> 06:55:48,870
The next question goes
9519
06:55:48,870 --> 06:55:51,400
like this what file
system response support.
9520
06:55:52,300 --> 06:55:55,779
What is the file system then
we work in individual system.
9521
06:55:55,779 --> 06:55:58,100
We will be having
a file system to work
9522
06:55:58,100 --> 06:56:01,000
within that particular
operating system Mary
9523
06:56:01,000 --> 06:56:04,900
redistributed cluster or in
the distributed architecture.
9524
06:56:04,900 --> 06:56:06,744
We need a file system with which
9525
06:56:06,744 --> 06:56:09,800
where we can store the data
in a distribute mechanism.
9526
06:56:09,800 --> 06:56:12,900
How do comes with
the file system called hdfs.
9527
06:56:13,100 --> 06:56:15,800
It's called Hadoop
distributed file system
9528
06:56:15,800 --> 06:56:19,131
by data gets distributed
across multiple systems
9529
06:56:19,131 --> 06:56:21,400
and it will be coordinated by 2.
9530
06:56:21,400 --> 06:56:24,500
Different type of components
called name node and data node
9531
06:56:24,500 --> 06:56:27,800
and Spark it can use
this hdfs directly.
9532
06:56:27,800 --> 06:56:30,900
So you can have any files
in hdfs and start using it
9533
06:56:30,900 --> 06:56:34,800
within the spark ecosystem
and it gives another advantage
9534
06:56:34,800 --> 06:56:35,900
of data locality
9535
06:56:35,900 --> 06:56:38,415
when it does the distributed
processing wherever
9536
06:56:38,415 --> 06:56:39,700
the data is distributed.
9537
06:56:39,700 --> 06:56:42,400
The processing could be done
locally to that particular
9538
06:56:42,400 --> 06:56:44,300
Mission way data is located
9539
06:56:44,300 --> 06:56:47,223
and to start with as
a standalone mode.
9540
06:56:47,223 --> 06:56:49,500
You can use the local
file system aspect.
9541
06:56:49,600 --> 06:56:51,508
So this could be used especially
9542
06:56:51,508 --> 06:56:53,818
when we are doing
the development or any
9543
06:56:53,818 --> 06:56:56,390
of you see you can use
the local file system
9544
06:56:56,390 --> 06:56:59,500
and Amazon Cloud provides
another file system called.
9545
06:56:59,500 --> 06:57:02,119
Yes, three simple
storage service we call
9546
06:57:02,119 --> 06:57:03,100
that is the S3.
9547
06:57:03,100 --> 06:57:04,998
It's a block storage service.
9548
06:57:04,998 --> 06:57:06,700
This can also be leveraged
9549
06:57:06,700 --> 06:57:09,238
or used within spa
for the storage
9550
06:57:09,800 --> 06:57:11,100
and lot other file system.
9551
06:57:11,100 --> 06:57:14,700
Also, it supports there are
some file systems like Alex,
9552
06:57:14,700 --> 06:57:17,700
oh which provides
in memory storage
9553
06:57:17,700 --> 06:57:20,800
so we can leverage that
particular file system as well.
9554
06:57:21,100 --> 06:57:22,796
So we have seen
all the features.
9555
06:57:22,796 --> 06:57:25,580
What are the functionalities
available with inspark?
9556
06:57:25,580 --> 06:57:27,600
We're going to look
at the limitations
9557
06:57:27,600 --> 06:57:28,800
of using spark.
9558
06:57:28,800 --> 06:57:30,252
Of course every component
9559
06:57:30,252 --> 06:57:33,000
when it comes with
a huge power and Advantage.
9560
06:57:33,000 --> 06:57:35,200
It will have its own
limitations as well.
9561
06:57:35,300 --> 06:57:38,900
The equation illustrates
some limitations of using
9562
06:57:38,900 --> 06:57:41,900
spark spark utilizes
more storage space
9563
06:57:41,900 --> 06:57:43,400
compared to Hadoop
9564
06:57:43,400 --> 06:57:44,715
and it comes
to the installation.
9565
06:57:44,715 --> 06:57:47,600
It's going to consume more space
but in the Big Data world,
9566
06:57:47,600 --> 06:57:49,500
that's not a
very huge constraint
9567
06:57:49,500 --> 06:57:52,206
because storage cons is
not Great are very high
9568
06:57:52,206 --> 06:57:55,504
and our big data space and
developer needs to be careful
9569
06:57:55,504 --> 06:57:58,275
while running the apps
and Spark the reason
9570
06:57:58,275 --> 06:58:00,300
because it uses
in-memory Computing.
9571
06:58:00,400 --> 06:58:02,870
Of course, it handles
the memory very well.
9572
06:58:02,870 --> 06:58:05,400
But if you try to load
a huge amount of data
9573
06:58:05,400 --> 06:58:08,700
and the distributed environment
and if you try to do is join
9574
06:58:08,700 --> 06:58:09,903
when you try to do join
9575
06:58:09,903 --> 06:58:13,491
within the distributed world the
data going to get transferred
9576
06:58:13,491 --> 06:58:14,700
over the network network
9577
06:58:14,700 --> 06:58:18,100
is really a costly
resource So the plan
9578
06:58:18,200 --> 06:58:20,800
or design should be such
a way to reduce or minimize.
9579
06:58:20,800 --> 06:58:23,500
As the data transferred
over the network
9580
06:58:23,500 --> 06:58:27,103
and however the way
possible with all possible means
9581
06:58:27,103 --> 06:58:30,000
we should facilitate
distribution of theta
9582
06:58:30,000 --> 06:58:32,200
over multiple missions the more
9583
06:58:32,200 --> 06:58:34,600
we distribute the more
parallelism we can achieve
9584
06:58:34,600 --> 06:58:38,500
and the more results we can get
and cost efficiency.
9585
06:58:38,500 --> 06:58:40,700
If you try to compare the cost
9586
06:58:40,700 --> 06:58:42,800
how much cost involved
9587
06:58:42,800 --> 06:58:45,700
to do a particular
processing take any unit
9588
06:58:45,700 --> 06:58:48,545
in terms of processing
1 GB of data with say
9589
06:58:48,545 --> 06:58:50,200
like II Treaty processing
9590
06:58:50,200 --> 06:58:53,800
if you come Cost-wise in-memory
Computing always it's considered
9591
06:58:53,800 --> 06:58:57,088
because memory It's
relatively come costlier
9592
06:58:57,088 --> 06:58:58,200
than the storage
9593
06:58:58,400 --> 06:59:00,000
so that may act
like a bottleneck
9594
06:59:00,000 --> 06:59:01,400
and we cannot increase
9595
06:59:01,400 --> 06:59:05,200
the memory capacity of
the mission Beyond supplement.
9596
06:59:05,900 --> 06:59:07,500
So we have to grow horizontally.
9597
06:59:07,800 --> 06:59:10,042
So when we have
the data distributor
9598
06:59:10,042 --> 06:59:11,900
in memory across the cluster,
9599
06:59:12,000 --> 06:59:13,337
of course the network transfer
9600
06:59:13,337 --> 06:59:15,300
all those bottlenecks
will come into picture.
9601
06:59:15,300 --> 06:59:17,400
So we have to strike
the right balance
9602
06:59:17,400 --> 06:59:20,700
which will help us to achieve
the in-memory computing.
9603
06:59:20,700 --> 06:59:22,775
Whatever, they memory
computer repair it
9604
06:59:22,775 --> 06:59:24,000
will help us to achieve
9605
06:59:24,000 --> 06:59:25,757
and it consumes huge amount
9606
06:59:25,757 --> 06:59:28,400
of data processing
compared to Hadoop
9607
06:59:28,600 --> 06:59:30,600
and Spark it performs
9608
06:59:30,600 --> 06:59:33,800
better than use it as
a creative Computing
9609
06:59:33,800 --> 06:59:36,700
because it likes for both spark
and the other Technologies.
9610
06:59:36,700 --> 06:59:37,699
It has to read data
9611
06:59:37,699 --> 06:59:39,700
for the first time
from the hottest car
9612
06:59:39,700 --> 06:59:43,300
from other data source and Spark
performance is really better
9613
06:59:43,300 --> 06:59:46,114
when it reads the data
onto does the processing
9614
06:59:46,114 --> 06:59:48,500
when the data is available
in the cache,
9615
06:59:48,723 --> 06:59:50,800
of course is the DAC cycle.
9616
06:59:50,800 --> 06:59:53,094
It's going to give
us a lot of advantage
9617
06:59:53,094 --> 06:59:54,400
while doing the processing
9618
06:59:54,400 --> 06:59:56,802
but the in-memory
Computing processing
9619
06:59:56,802 --> 06:59:59,400
that's going to give
us lots of Leverage.
9620
06:59:59,400 --> 07:00:01,605
The next question
list some use cases
9621
07:00:01,605 --> 07:00:04,300
where Spark outperforms
Hadoop in processing.
9622
07:00:04,400 --> 07:00:06,300
The first thing is
the real time processing.
9623
07:00:06,300 --> 07:00:08,629
How do you cannot handle
real time processing
9624
07:00:08,629 --> 07:00:10,884
but spark and handle
real time processing.
9625
07:00:10,884 --> 07:00:13,843
So any data that's coming in
in the land architecture.
9626
07:00:13,843 --> 07:00:15,300
You will have three layers.
9627
07:00:15,300 --> 07:00:17,210
The most of the Big
Data projects will be
9628
07:00:17,210 --> 07:00:18,500
in the Lambda architecture.
9629
07:00:18,500 --> 07:00:21,500
You will have speed layer
by layer and sighs Leo
9630
07:00:21,500 --> 07:00:23,900
and the speed layer
whenever the river comes
9631
07:00:23,900 --> 07:00:26,900
in that needs to be processed
stored and handled.
9632
07:00:26,900 --> 07:00:27,975
So in those type
9633
07:00:27,975 --> 07:00:30,800
of real-time processing stock
is the best fit.
9634
07:00:30,800 --> 07:00:32,500
Of course, we can
Hadoop ecosystem.
9635
07:00:32,500 --> 07:00:33,837
We have other components
9636
07:00:33,837 --> 07:00:36,400
which does the real-time
processing like storm.
9637
07:00:36,400 --> 07:00:39,000
But when you want to Leverage
The Machine learning
9638
07:00:39,000 --> 07:00:40,500
along with the Sparks dreaming
9639
07:00:40,500 --> 07:00:43,200
on such computation spark
will be much better.
9640
07:00:43,200 --> 07:00:44,243
So that's why I like
9641
07:00:44,243 --> 07:00:45,621
when you have architecture
9642
07:00:45,621 --> 07:00:47,900
like a Lambda architecture
you want to have
9643
07:00:47,900 --> 07:00:51,100
all three layers bachelier
speed layer and service.
9644
07:00:51,100 --> 07:00:54,800
A spark and gel the speed layer
and service layer far better
9645
07:00:54,800 --> 07:00:56,800
and it's going to provide
better performance.
9646
07:00:56,800 --> 07:00:59,400
And whenever you do
the edge processing
9647
07:00:59,400 --> 07:01:02,400
especially like doing
a machine learning processing,
9648
07:01:02,400 --> 07:01:04,501
we will leverage
nitrate in Computing
9649
07:01:04,501 --> 07:01:06,210
and can perform a hundred times
9650
07:01:06,210 --> 07:01:08,800
faster than Hadoop
the more diversity processing
9651
07:01:08,800 --> 07:01:11,600
that we do the more data
will be read from the memory
9652
07:01:11,600 --> 07:01:14,700
and it's going to get as
much faster performance
9653
07:01:14,700 --> 07:01:16,700
than I did with mapreduce.
9654
07:01:16,700 --> 07:01:20,100
So again, remember whenever you
do the processing only buns,
9655
07:01:20,100 --> 07:01:23,000
so you're going to to do
the processing finally bonds
9656
07:01:23,000 --> 07:01:24,900
read process it and deliver.
9657
07:01:24,900 --> 07:01:27,516
The result spark
may not be the best fit
9658
07:01:27,516 --> 07:01:30,200
that can be done
with a mapreduce itself.
9659
07:01:30,200 --> 07:01:32,773
And there is another component
called akka it's
9660
07:01:32,773 --> 07:01:35,600
a messaging system
our message quantity
9661
07:01:35,600 --> 07:01:38,500
in system Sparkle
internally uses account
9662
07:01:38,500 --> 07:01:40,500
for scheduling our any task
9663
07:01:40,500 --> 07:01:43,100
that needs to be assigned
by the master to the worker
9664
07:01:43,700 --> 07:01:45,700
and the follow-up
of that particular task
9665
07:01:45,700 --> 07:01:49,000
by the master basically
asynchronous coordination system
9666
07:01:49,000 --> 07:01:51,000
and that's achieved using akka
9667
07:01:51,400 --> 07:01:55,100
I call programming internally
it's used by this monk
9668
07:01:55,100 --> 07:01:56,551
as such for the developers.
9669
07:01:56,551 --> 07:01:59,358
We don't need to worry
about a couple of growing up.
9670
07:01:59,358 --> 07:02:00,900
Of course we can leverage it
9671
07:02:00,900 --> 07:02:04,500
but the car is used internally
by the spawn for scheduling
9672
07:02:04,500 --> 07:02:08,800
and coordination between master
and the burqa and with inspark.
9673
07:02:08,800 --> 07:02:10,700
We have few major components.
9674
07:02:10,700 --> 07:02:13,200
Let's see, what are
the major components
9675
07:02:13,200 --> 07:02:14,500
of a possessed man.
9676
07:02:14,500 --> 07:02:18,069
The lay the components
of spot ecosystem start comes
9677
07:02:18,069 --> 07:02:19,319
with a core engine.
9678
07:02:19,319 --> 07:02:20,700
So that has the core.
9679
07:02:20,700 --> 07:02:23,570
Realities of what is required
from by the spark
9680
07:02:23,570 --> 07:02:26,600
of all this Punk Oddities
are the building blocks
9681
07:02:26,600 --> 07:02:29,361
of the spark core engine
on top of spark
9682
07:02:29,361 --> 07:02:31,300
or the basic functionalities are
9683
07:02:31,300 --> 07:02:34,600
file interaction file system
coordination all that's done
9684
07:02:34,600 --> 07:02:36,400
by the spark core engine
9685
07:02:36,400 --> 07:02:38,432
on top of spark core engine.
9686
07:02:38,432 --> 07:02:40,900
We have a number
of other offerings
9687
07:02:40,900 --> 07:02:44,700
to do machine learning to do
graph Computing to do streaming.
9688
07:02:44,700 --> 07:02:47,000
We have n number
of other components.
9689
07:02:47,000 --> 07:02:48,800
So the major use the components
9690
07:02:48,800 --> 07:02:51,000
of these components
like Sparks equal.
9691
07:02:51,000 --> 07:02:52,037
Spock streaming.
9692
07:02:52,037 --> 07:02:55,520
I'm a little graphics
and Spark our other high level.
9693
07:02:55,520 --> 07:02:58,400
We will see what are
these components Sparks
9694
07:02:58,400 --> 07:03:02,000
equal especially it's designed
to do the processing
9695
07:03:02,000 --> 07:03:03,729
against a structure data
9696
07:03:03,729 --> 07:03:07,400
so we can write SQL queries
and we can handle
9697
07:03:07,400 --> 07:03:08,854
or we can do the processing.
9698
07:03:08,854 --> 07:03:11,400
So it's going to give us
the interface to interact
9699
07:03:11,400 --> 07:03:12,100
with the data,
9700
07:03:12,300 --> 07:03:15,900
especially structure data
and other language
9701
07:03:15,900 --> 07:03:18,700
that we can use
it's more similar to
9702
07:03:18,700 --> 07:03:20,600
what we use within the SQL.
9703
07:03:20,600 --> 07:03:22,700
Well, I can say
99 percentage is seen
9704
07:03:22,700 --> 07:03:25,934
and most of the commonly used
functionalities within the SQL
9705
07:03:25,934 --> 07:03:28,111
have been implemented
within smocks equal
9706
07:03:28,111 --> 07:03:31,700
and Spark streaming is going to
support the stream processing.
9707
07:03:31,700 --> 07:03:34,000
That's the offering
available to handle
9708
07:03:34,000 --> 07:03:35,920
the stream processing and MLA
9709
07:03:35,920 --> 07:03:38,900
based the offering
to handle machine learning.
9710
07:03:38,900 --> 07:03:42,700
So the component name
is called ml in and has a list
9711
07:03:42,700 --> 07:03:44,300
of components a list
9712
07:03:44,300 --> 07:03:47,300
of machine learning
algorithms already defined
9713
07:03:47,300 --> 07:03:50,700
we can leverage and use any
of those machine learning.
9714
07:03:51,400 --> 07:03:54,944
Graphics again, it's
a graph processing offerings
9715
07:03:54,944 --> 07:03:56,200
within the spark.
9716
07:03:56,200 --> 07:03:59,141
It's going to support us
to achieve graph Computing
9717
07:03:59,141 --> 07:04:02,330
against the data that we have
like pagerank calculation.
9718
07:04:02,330 --> 07:04:04,107
How many connector identities
9719
07:04:04,107 --> 07:04:07,600
how many triangles all those
going to provide us a meaning
9720
07:04:07,600 --> 07:04:09,300
to that particular data
9721
07:04:09,300 --> 07:04:12,500
and Spark are is the component
is going to interact
9722
07:04:12,500 --> 07:04:14,371
or helpers to leverage.
9723
07:04:14,371 --> 07:04:17,856
The language are
within the spark environment
9724
07:04:18,100 --> 07:04:20,600
are is a statistical
programming language.
9725
07:04:20,600 --> 07:04:23,170
Each where we can do
statistical Computing,
9726
07:04:23,170 --> 07:04:24,700
which is Park environment
9727
07:04:24,700 --> 07:04:28,306
and we can leverage our language
by using this parka to get
9728
07:04:28,306 --> 07:04:32,194
that executed within the spark
a environment addition to that.
9729
07:04:32,194 --> 07:04:35,675
There are other components
as well like approximative is
9730
07:04:35,675 --> 07:04:39,118
it's called blink DB all other
things I can be test each.
9731
07:04:39,118 --> 07:04:42,541
So these are the major Lee used
components within spark.
9732
07:04:42,541 --> 07:04:43,561
So next question.
9733
07:04:43,561 --> 07:04:45,944
How can start be used
alongside her too?
9734
07:04:45,944 --> 07:04:49,000
So when we see a spark
performance much better it's
9735
07:04:49,000 --> 07:04:51,000
not a replacement to handle it.
9736
07:04:51,000 --> 07:04:52,100
Going to coexist
9737
07:04:52,100 --> 07:04:55,488
with the Hadoop right
Square leveraging the spark
9738
07:04:55,488 --> 07:04:56,900
and Hadoop together.
9739
07:04:56,900 --> 07:05:00,000
It's going to help us
to achieve the best result.
9740
07:05:00,000 --> 07:05:00,268
Yes.
9741
07:05:00,268 --> 07:05:04,300
Mark can do in memory Computing
or can handle the speed layer
9742
07:05:04,300 --> 07:05:06,600
and Hadoop comes
with the resource manager
9743
07:05:06,600 --> 07:05:08,500
so we can leverage
the resource manager
9744
07:05:08,500 --> 07:05:10,900
of Hadoop to make smart to work
9745
07:05:11,000 --> 07:05:13,529
and few processing be
don't need to Leverage
9746
07:05:13,529 --> 07:05:14,904
The in-memory Computing.
9747
07:05:14,904 --> 07:05:18,500
For example, one time processing
to the processing and forget.
9748
07:05:18,500 --> 07:05:20,773
I just store it we
can use mapreduce.
9749
07:05:20,773 --> 07:05:24,700
He's so the processing cost
Computing cost will be much less
9750
07:05:24,700 --> 07:05:26,100
compared to Spa
9751
07:05:26,100 --> 07:05:29,400
so we can amalgam eyes and get
strike the right balance
9752
07:05:29,400 --> 07:05:31,700
between the batch processing
and stream processing
9753
07:05:31,700 --> 07:05:34,507
when we have spark
along with Adam.
9754
07:05:34,507 --> 07:05:38,100
Let's have some detail question
later to spark core
9755
07:05:38,100 --> 07:05:39,100
with inspark or
9756
07:05:39,100 --> 07:05:41,900
as I mentioned earlier
the core building block
9757
07:05:41,900 --> 07:05:45,600
of spark or is our DD resilient
distributed data set.
9758
07:05:45,600 --> 07:05:46,654
It's a virtual.
9759
07:05:46,654 --> 07:05:48,442
It's not a physical entity.
9760
07:05:48,442 --> 07:05:49,900
It's a logical entity.
9761
07:05:49,900 --> 07:05:52,400
You will not See
this audit is existing.
9762
07:05:52,400 --> 07:05:54,700
The existence of hundred
will come into picture
9763
07:05:54,900 --> 07:05:56,474
when you take some action.
9764
07:05:56,474 --> 07:05:59,200
So this is our Unity
will be used are referred
9765
07:05:59,200 --> 07:06:00,800
to create the DAC cycle
9766
07:06:00,943 --> 07:06:05,500
and arteries will be optimized
to transform from one form
9767
07:06:05,500 --> 07:06:07,264
to another form to make a plan
9768
07:06:07,264 --> 07:06:09,400
how the data set needs
to be transformed
9769
07:06:09,400 --> 07:06:11,500
from one structure
to another structure.
9770
07:06:11,700 --> 07:06:14,817
And finally when you take some
against an RTD that existence
9771
07:06:14,817 --> 07:06:15,924
of the data structure
9772
07:06:15,924 --> 07:06:18,200
that resulted in data
will come into picture
9773
07:06:18,200 --> 07:06:20,500
and that can be stored
in any file system
9774
07:06:20,500 --> 07:06:22,000
whether it's GFS is 3
9775
07:06:22,000 --> 07:06:24,568
or any other file system
can be stored and
9776
07:06:24,568 --> 07:06:27,900
that it is can exist
in a partition form the sense.
9777
07:06:27,900 --> 07:06:30,600
It can get distributed
across multiple systems
9778
07:06:30,600 --> 07:06:33,800
and it's fault tolerant
and it's a fault tolerant.
9779
07:06:33,800 --> 07:06:36,494
If any of the artery
is lost any partition
9780
07:06:36,494 --> 07:06:37,742
of the RTD is lost.
9781
07:06:37,742 --> 07:06:40,700
It can regenerate only
that specific partition
9782
07:06:40,700 --> 07:06:41,700
it can regenerate
9783
07:06:41,900 --> 07:06:43,900
so that's a huge
advantage of our GD.
9784
07:06:43,900 --> 07:06:46,600
So it's a mass like first
the huge advantage of added.
9785
07:06:46,600 --> 07:06:47,900
It's a fault-tolerant
9786
07:06:47,900 --> 07:06:50,600
where it can regenerate
the last rdds.
9787
07:06:50,600 --> 07:06:53,606
And it can exist
in a distributed fashion
9788
07:06:53,606 --> 07:06:55,165
and it is immutable the
9789
07:06:55,165 --> 07:06:59,300
since once the RTD is defined on
like it it cannot be changed.
9790
07:06:59,300 --> 07:07:01,500
The next question is
how do we create rdds
9791
07:07:01,500 --> 07:07:04,500
in spark the two ways we
can create The Oddities one
9792
07:07:04,664 --> 07:07:09,700
as isn't the spark context we
can use any of the collections
9793
07:07:09,700 --> 07:07:12,700
that's available within this
scalar or in the Java and using
9794
07:07:12,700 --> 07:07:14,000
the paralyzed function.
9795
07:07:14,000 --> 07:07:17,049
We can create the RTD
and it's going to use
9796
07:07:17,049 --> 07:07:20,474
the underlying file
systems distribution mechanism
9797
07:07:20,474 --> 07:07:23,900
if The data is located
in distributed file system,
9798
07:07:23,900 --> 07:07:24,700
like hdfs.
9799
07:07:25,000 --> 07:07:27,154
It will leverage
that and it will make
9800
07:07:27,154 --> 07:07:30,331
those arteries available
in a number of systems.
9801
07:07:30,331 --> 07:07:33,696
So it's going to leverage
and follow the same distribution
9802
07:07:33,696 --> 07:07:34,700
and already Aspen
9803
07:07:34,700 --> 07:07:37,200
or we can create the rdt
by loading the data
9804
07:07:37,200 --> 07:07:39,835
from external sources
as well like its peace
9805
07:07:39,835 --> 07:07:42,900
and hdfs be may not consider
as an external Source.
9806
07:07:42,900 --> 07:07:45,300
It will be consider as
a file system of Hadoop.
9807
07:07:45,400 --> 07:07:47,300
So when Spock is working
9808
07:07:47,300 --> 07:07:49,743
with Hadoop mostly
the file system,
9809
07:07:49,743 --> 07:07:51,900
we will be using will be Hdfs,
9810
07:07:51,900 --> 07:07:53,782
if you can read
from it each piece
9811
07:07:53,782 --> 07:07:55,900
or even we can do
from other sources,
9812
07:07:55,900 --> 07:07:59,781
like Parkwood file or has
three different sources a roux.
9813
07:07:59,781 --> 07:08:02,000
You can read and create the RTD.
9814
07:08:02,200 --> 07:08:03,000
Next question is
9815
07:08:03,000 --> 07:08:05,800
what is executed memory
in spark application.
9816
07:08:05,800 --> 07:08:08,100
Every Spark application
will have fixed.
9817
07:08:08,100 --> 07:08:09,900
It keeps eyes and fixed number,
9818
07:08:09,900 --> 07:08:13,196
of course for the spark
executor executor is nothing
9819
07:08:13,196 --> 07:08:16,500
but the execution unit
available in every machine
9820
07:08:16,500 --> 07:08:19,600
and that's going to facilitate
to do the processing to do
9821
07:08:19,600 --> 07:08:21,654
the tasks in the Water machine,
9822
07:08:21,654 --> 07:08:25,300
so irrespective of whether you
use yarn resource manager
9823
07:08:25,300 --> 07:08:26,800
or any other measures
9824
07:08:26,800 --> 07:08:29,600
like resource manager
every worker Mission.
9825
07:08:29,600 --> 07:08:31,200
We will have an Executor
9826
07:08:31,200 --> 07:08:34,400
and within the executor
the task will be handled
9827
07:08:34,400 --> 07:08:38,700
and the memory to be allocated
for that particular executor is
9828
07:08:38,700 --> 07:08:41,893
what we Define as the hip size
and we can Define
9829
07:08:41,893 --> 07:08:42,775
how much amount
9830
07:08:42,775 --> 07:08:45,788
of memory should be used
for that particular executor
9831
07:08:45,788 --> 07:08:47,700
within the worker
machine as well.
9832
07:08:47,700 --> 07:08:50,900
As number of cores
can be used within the exit.
9833
07:08:51,000 --> 07:08:53,988
Our by the executor
with this path application
9834
07:08:53,988 --> 07:08:55,600
and that can be controlled
9835
07:08:55,600 --> 07:08:58,100
through the configuration
files of spark.
9836
07:08:58,100 --> 07:09:01,300
Next questions different
partitions in Apache spark.
9837
07:09:01,300 --> 07:09:03,100
So any data irrespective of
9838
07:09:03,100 --> 07:09:05,478
whether it is a small
data a large data,
9839
07:09:05,478 --> 07:09:07,213
we can divide those data sets
9840
07:09:07,213 --> 07:09:10,708
across multiple systems
the process of dividing the data
9841
07:09:10,708 --> 07:09:11,961
into multiple pieces
9842
07:09:11,961 --> 07:09:13,310
and making it to store
9843
07:09:13,310 --> 07:09:16,500
across multiple systems as
a different logical units.
9844
07:09:16,500 --> 07:09:17,549
It's called partitioning.
9845
07:09:17,549 --> 07:09:20,600
So in simple terms partitioning
is nothing but the process
9846
07:09:20,600 --> 07:09:21,700
of Dividing the data
9847
07:09:21,700 --> 07:09:24,800
and storing in multiple systems
is called partitions
9848
07:09:24,800 --> 07:09:26,600
and by default the conversion
9849
07:09:26,600 --> 07:09:29,700
of the data into R. TD
will happen in the system
9850
07:09:29,700 --> 07:09:31,400
where the partition is existing.
9851
07:09:31,400 --> 07:09:33,830
So the more the partition
the more parallelism
9852
07:09:33,830 --> 07:09:36,049
they are going to get
at the same time.
9853
07:09:36,049 --> 07:09:38,500
We have to be careful
not to trigger huge amount
9854
07:09:38,500 --> 07:09:40,100
of network data transfer as well
9855
07:09:40,300 --> 07:09:43,455
and every a DD can
be partitioned with inspark
9856
07:09:43,455 --> 07:09:45,700
and the panel
is the partitioning
9857
07:09:45,700 --> 07:09:49,559
going to help us to achieve
parallelism more the partition
9858
07:09:49,559 --> 07:09:50,685
that we have more.
9859
07:09:50,685 --> 07:09:52,000
Solutions can be done
9860
07:09:52,000 --> 07:09:54,300
and that the key thing
about the success
9861
07:09:54,300 --> 07:09:58,200
of the spark program is
minimizing the network traffic
9862
07:09:58,200 --> 07:10:00,900
while doing the parallel
processing and minimizing
9863
07:10:00,900 --> 07:10:04,247
the data transfer
within the systems of spark.
9864
07:10:04,247 --> 07:10:08,000
What operations does already
support so I can operate
9865
07:10:08,000 --> 07:10:10,228
multiple operations
against our GD.
9866
07:10:10,228 --> 07:10:13,900
So there are two type of things
we can do we can group it
9867
07:10:13,900 --> 07:10:16,000
into two one is transformations
9868
07:10:16,000 --> 07:10:18,800
in Transformations are did he
will get transformed
9869
07:10:18,800 --> 07:10:20,600
from one form to another form.
9870
07:10:20,600 --> 07:10:22,600
Select filtering grouping all
9871
07:10:22,600 --> 07:10:25,000
that like it's going
to get transformed
9872
07:10:25,000 --> 07:10:28,000
from one form to another form
one small example,
9873
07:10:28,000 --> 07:10:31,470
like reduced by key filter all
that will be Transformations.
9874
07:10:31,470 --> 07:10:33,700
The resultant of
the transformation will be
9875
07:10:33,700 --> 07:10:35,300
another rdd the same time.
9876
07:10:35,300 --> 07:10:37,700
We can take some actions
against the rdd
9877
07:10:37,700 --> 07:10:40,245
that's going to give
us the final result.
9878
07:10:40,245 --> 07:10:41,262
I can say count
9879
07:10:41,262 --> 07:10:43,500
how many records
or they are store
9880
07:10:43,500 --> 07:10:45,700
that result into the hdfs.
9881
07:10:46,100 --> 07:10:49,541
They all our actions so
multiple actions can be taken
9882
07:10:49,541 --> 07:10:50,600
against the RTD.
9883
07:10:50,600 --> 07:10:53,700
The existence of the data
will come into picture only
9884
07:10:53,700 --> 07:10:56,200
if I take some action
against not ready.
9885
07:10:56,200 --> 07:10:56,515
Okay.
9886
07:10:56,515 --> 07:10:57,400
Next question.
9887
07:10:57,400 --> 07:11:01,000
What do you understand
by transformations in spark?
9888
07:11:01,100 --> 07:11:03,679
So Transformations are
nothing but functions
9889
07:11:03,679 --> 07:11:06,800
mostly it will be higher
order functions within scale
9890
07:11:06,800 --> 07:11:09,400
and we have something
like a higher order functions
9891
07:11:09,400 --> 07:11:12,356
which will be applied
against the tardy.
9892
07:11:12,356 --> 07:11:14,100
Mostly against the list
9893
07:11:14,100 --> 07:11:16,407
of elements that we
have within the rdd
9894
07:11:16,407 --> 07:11:19,314
that function will get
applied by the existence
9895
07:11:19,314 --> 07:11:21,875
of the arditi will Come
into picture one lie
9896
07:11:21,875 --> 07:11:25,597
if we take some action against
it in this particular example,
9897
07:11:25,597 --> 07:11:26,900
I am reading the file
9898
07:11:26,900 --> 07:11:30,536
and having it within the rdd
Control Data then I am doing
9899
07:11:30,536 --> 07:11:32,500
some transformation using a map.
9900
07:11:32,500 --> 07:11:34,382
So it's going
to apply a function
9901
07:11:34,382 --> 07:11:35,623
so we can map I have
9902
07:11:35,623 --> 07:11:39,100
some function which will split
each record using the tab.
9903
07:11:39,100 --> 07:11:41,632
So the spit with the app
will be applied
9904
07:11:41,632 --> 07:11:44,300
against each record
within the raw data
9905
07:11:44,300 --> 07:11:48,200
and the resultant movies data
will again be another rdd,
9906
07:11:48,200 --> 07:11:50,644
but of course,
this will be a lazy operation.
9907
07:11:50,644 --> 07:11:53,700
The existence of movies data
will come into picture only
9908
07:11:53,700 --> 07:11:57,700
if I take some action
against it like count or print
9909
07:11:57,726 --> 07:12:01,573
or store only those actions
will generate the data.
9910
07:12:01,800 --> 07:12:04,600
So next question
Define functions of spark code.
9911
07:12:04,600 --> 07:12:07,100
So that's going to take care
of the memory management
9912
07:12:07,100 --> 07:12:09,400
and fault tolerance of rdds.
9913
07:12:09,400 --> 07:12:12,700
It's going to help us
to schedule distribute the task
9914
07:12:12,700 --> 07:12:15,400
and manage the jobs running
within the cluster
9915
07:12:15,400 --> 07:12:17,700
and so we're going to help
us to or store the rear
9916
07:12:17,700 --> 07:12:20,700
in the storage system as well
as reads data from the storage.
9917
07:12:20,700 --> 07:12:23,905
System that's to do the file
system level operations.
9918
07:12:23,905 --> 07:12:25,200
It's going to help us
9919
07:12:25,200 --> 07:12:27,500
and Spark core programming
can be done in any
9920
07:12:27,500 --> 07:12:30,347
of these languages
like Java scalar python
9921
07:12:30,347 --> 07:12:32,500
as well as using our so core is
9922
07:12:32,500 --> 07:12:35,600
that the horizontal level
on top of spark or we can have
9923
07:12:35,600 --> 07:12:37,500
a number of components
9924
07:12:37,600 --> 07:12:41,000
and there are different type
of rdds available one such
9925
07:12:41,000 --> 07:12:42,923
a special type is parody.
9926
07:12:42,923 --> 07:12:43,800
So next question.
9927
07:12:43,800 --> 07:12:46,100
What do you understand
by pay an rdd?
9928
07:12:46,100 --> 07:12:49,792
It's going to exist
in peace as a keys and values
9929
07:12:49,800 --> 07:12:51,906
so I can Some special functions
9930
07:12:51,906 --> 07:12:55,400
within the parodies
are special Transformations,
9931
07:12:55,400 --> 07:12:58,900
like connect all the values
corresponding to the same key
9932
07:12:58,900 --> 07:13:00,200
like solder Shuffle
9933
07:13:00,300 --> 07:13:02,800
what happens within
the shortened Shuffle of Hadoop
9934
07:13:02,900 --> 07:13:04,356
those type of operations
9935
07:13:04,356 --> 07:13:05,161
like you want
9936
07:13:05,161 --> 07:13:08,339
to consolidate our group
all the values corresponding
9937
07:13:08,339 --> 07:13:10,792
to the same key are
apply some functions
9938
07:13:10,792 --> 07:13:14,400
against all the values
corresponding to the same key.
9939
07:13:14,400 --> 07:13:16,200
Like I want to get the sum
9940
07:13:16,200 --> 07:13:20,400
of the value of all the keys
we can use the parody.
9941
07:13:20,400 --> 07:13:23,600
D and get that a cheat so
it's going to the data
9942
07:13:23,600 --> 07:13:29,300
within the re going to exist
in Pace keys and right.
9943
07:13:29,300 --> 07:13:31,376
Okay a question from Jason.
9944
07:13:31,376 --> 07:13:33,223
What are our Vector rdds
9945
07:13:33,300 --> 07:13:36,300
in machine learning you
will have huge amount
9946
07:13:36,300 --> 07:13:38,700
of processing handled by vectors
9947
07:13:38,700 --> 07:13:42,812
and matrices and we do lots
of operations Vector operations,
9948
07:13:42,812 --> 07:13:44,200
like effective actor
9949
07:13:44,200 --> 07:13:47,700
or transforming any data
into a vector form so vectors
9950
07:13:47,700 --> 07:13:50,755
like as the normal way
it will have a Direction.
9951
07:13:50,755 --> 07:13:51,624
And magnitude
9952
07:13:51,624 --> 07:13:54,900
so we can do some operations
like some two vectors
9953
07:13:54,900 --> 07:13:58,622
and what is the difference
between the vector A
9954
07:13:58,622 --> 07:14:00,500
and B as well as a and see
9955
07:14:00,500 --> 07:14:02,400
if the difference
between Vector A
9956
07:14:02,400 --> 07:14:04,200
and B is less compared to a
9957
07:14:04,200 --> 07:14:06,487
and C we can say the vector A
9958
07:14:06,487 --> 07:14:10,825
and B is somewhat similar
in terms of features.
9959
07:14:11,100 --> 07:14:13,815
So the vector R GD
will be used to represent
9960
07:14:13,815 --> 07:14:17,100
the vector directly and
that will be used extensively
9961
07:14:17,100 --> 07:14:19,500
while doing the
measuring and Jason.
9962
07:14:19,700 --> 07:14:20,500
Thank you other.
9963
07:14:20,500 --> 07:14:21,400
Is another question.
9964
07:14:21,400 --> 07:14:22,900
What is our GD lineage?
9965
07:14:22,900 --> 07:14:25,800
So here I any data
processing any Transformations
9966
07:14:25,800 --> 07:14:28,811
that we do it maintains
something called a lineage.
9967
07:14:28,811 --> 07:14:31,100
So what how data
is getting transformed
9968
07:14:31,100 --> 07:14:33,543
when the data is available
in the partition form
9969
07:14:33,543 --> 07:14:36,300
in multiple systems and
when we do the transformation,
9970
07:14:36,300 --> 07:14:39,800
it will undergo multiple steps
and in the distributed word.
9971
07:14:39,800 --> 07:14:42,700
It's very common to have
failures of machines
9972
07:14:42,700 --> 07:14:45,200
or machines going
out of the network
9973
07:14:45,200 --> 07:14:47,000
and the system our framework
9974
07:14:47,000 --> 07:14:47,800
as it should be
9975
07:14:47,800 --> 07:14:50,800
in a position to handle
small handles it through.
9976
07:14:50,858 --> 07:14:55,800
Did he leave eh it can restore
the last partition only assume
9977
07:14:55,800 --> 07:14:59,004
like out of ten machines
data is distributed
9978
07:14:59,004 --> 07:15:00,828
across five machines out of
9979
07:15:00,828 --> 07:15:03,800
that those five machines
One mission is lost.
9980
07:15:03,800 --> 07:15:06,500
So whatever the
latest transformation
9981
07:15:06,500 --> 07:15:07,807
that had the data
9982
07:15:08,000 --> 07:15:10,100
for that particular
partition the partition
9983
07:15:10,100 --> 07:15:13,924
in the last mission alone
can be regenerated and it knows
9984
07:15:13,924 --> 07:15:16,700
how to regenerate that data
on how to get that result
9985
07:15:16,700 --> 07:15:18,384
and data using the concept
9986
07:15:18,384 --> 07:15:21,153
of rdd lineage so
from which Each data source,
9987
07:15:21,153 --> 07:15:22,200
it got generated.
9988
07:15:22,200 --> 07:15:23,800
What was its previous step.
9989
07:15:23,800 --> 07:15:26,300
So the completely
is will be available
9990
07:15:26,300 --> 07:15:29,724
and it's maintained by
the spark framework internally.
9991
07:15:29,724 --> 07:15:31,700
We call that as Oddities in eh,
9992
07:15:31,700 --> 07:15:34,682
what is point driver to put
it simply for those
9993
07:15:34,682 --> 07:15:37,600
who are from her
do background yarn back room.
9994
07:15:37,600 --> 07:15:40,000
We can compare this
to at muster.
9995
07:15:40,100 --> 07:15:43,300
Every application will
have a spark driver
9996
07:15:43,300 --> 07:15:44,900
that will have a spot context
9997
07:15:44,900 --> 07:15:47,550
which is going to moderate
the complete execution
9998
07:15:47,550 --> 07:15:50,200
of the job that will connect
to the spark master.
9999
07:15:50,500 --> 07:15:52,300
Delivers the RTD graph
10000
07:15:52,300 --> 07:15:54,900
that is the lineage
for the master
10001
07:15:54,900 --> 07:15:56,810
and the coordinate the tasks.
10002
07:15:56,810 --> 07:15:57,817
What are the tasks
10003
07:15:57,817 --> 07:16:00,700
that gets executed
in the distributed environment?
10004
07:16:00,700 --> 07:16:01,500
It can do
10005
07:16:01,500 --> 07:16:04,400
the parallel processing
do the Transformations
10006
07:16:04,600 --> 07:16:06,900
and actions against the RTD.
10007
07:16:06,900 --> 07:16:08,551
So it's a single
point of contact
10008
07:16:08,551 --> 07:16:10,100
for that specific application.
10009
07:16:10,100 --> 07:16:12,500
So smart driver
is a short linked
10010
07:16:12,500 --> 07:16:15,300
and the spawn context
within this part driver
10011
07:16:15,300 --> 07:16:18,558
is going to be the coordinator
between the master and the tasks
10012
07:16:18,558 --> 07:16:20,694
that are running
and smart driver.
10013
07:16:20,694 --> 07:16:23,100
I can get started
in any of the executor
10014
07:16:23,100 --> 07:16:26,800
with inspark name types
of custom managers in spark.
10015
07:16:26,800 --> 07:16:28,800
So whenever you have
a group of machines,
10016
07:16:28,800 --> 07:16:30,247
you need a manager to manage
10017
07:16:30,247 --> 07:16:33,415
the resources the different type
of the store manager already.
10018
07:16:33,415 --> 07:16:35,700
We have seen the yarn
yet another assist ago.
10019
07:16:35,700 --> 07:16:39,400
She later which manages
the resources of Hadoop on top
10020
07:16:39,400 --> 07:16:43,000
of yarn we can make
Spock to book sometimes I
10021
07:16:43,000 --> 07:16:46,700
may want to have sparkle
own my organization
10022
07:16:46,700 --> 07:16:49,594
and not along with the Hadoop
or any other technology.
10023
07:16:49,594 --> 07:16:50,297
Then I can go
10024
07:16:50,297 --> 07:16:53,100
with the And alone spawn
has built-in cluster manager.
10025
07:16:53,100 --> 07:16:55,547
So only spawn can get
executed multiple systems.
10026
07:16:55,547 --> 07:16:57,423
But generally if we
have a cluster we
10027
07:16:57,423 --> 07:16:58,600
will try to leverage
10028
07:16:58,600 --> 07:17:01,600
various other Computing
platforms Computing Frameworks,
10029
07:17:01,600 --> 07:17:04,601
like graph processing
giraffe these on that.
10030
07:17:04,601 --> 07:17:07,000
We will try to
leverage that case.
10031
07:17:07,000 --> 07:17:08,321
We will go with yarn
10032
07:17:08,321 --> 07:17:10,700
or some generalized
resource manager,
10033
07:17:10,700 --> 07:17:12,000
like masseuse Ian.
10034
07:17:12,000 --> 07:17:14,400
It's very specific to Hadoop
and it comes along
10035
07:17:14,400 --> 07:17:18,500
with Hadoop measures is the
cluster level resource manager
10036
07:17:18,500 --> 07:17:20,600
and I have multiple clusters.
10037
07:17:20,600 --> 07:17:23,700
Within organization,
then you can use mrs.
10038
07:17:23,800 --> 07:17:25,883
Mrs. Is also a resource manager.
10039
07:17:25,883 --> 07:17:29,400
It's a separate table project
within Apache X question.
10040
07:17:29,400 --> 07:17:30,600
What do you understand
10041
07:17:30,600 --> 07:17:34,200
by worker node in a cluster
redistribute environment.
10042
07:17:34,200 --> 07:17:36,252
We will have n number
of workers we call
10043
07:17:36,252 --> 07:17:38,200
that is a worker node
or a slave node,
10044
07:17:38,200 --> 07:17:40,665
which does the actual
processing going to get
10045
07:17:40,665 --> 07:17:43,300
the data do the processing
and get us the result
10046
07:17:43,300 --> 07:17:45,100
and masternode going to assign
10047
07:17:45,100 --> 07:17:48,000
what has to be done by
one person own and it's going
10048
07:17:48,000 --> 07:17:50,551
to read the data available
in the specific work on.
10049
07:17:50,551 --> 07:17:53,196
Generally, the tasks assigned
to the worker node,
10050
07:17:53,196 --> 07:17:55,900
or the task will be assigned
to the output node data
10051
07:17:55,900 --> 07:17:57,500
is located in vigorous Pace.
10052
07:17:57,500 --> 07:18:00,100
Especially Hadoop always
it will try to achieve
10053
07:18:00,100 --> 07:18:01,183
the data locality.
10054
07:18:01,183 --> 07:18:04,391
That's what we can't is
the resource availability as
10055
07:18:04,391 --> 07:18:05,900
well as the availability
10056
07:18:05,900 --> 07:18:08,900
of the resource in terms
of CPU memory as well
10057
07:18:08,900 --> 07:18:10,000
will be considered
10058
07:18:10,000 --> 07:18:13,601
as you might have some data
in replicated in three missions.
10059
07:18:13,601 --> 07:18:16,884
All three machines are busy
doing the work and no CPU
10060
07:18:16,884 --> 07:18:19,414
or memory available
to start the other task.
10061
07:18:19,414 --> 07:18:20,400
It will not wait.
10062
07:18:20,400 --> 07:18:23,300
For those missions to complete
the job and get the resource
10063
07:18:23,300 --> 07:18:25,900
and do the processing it
will start the processing
10064
07:18:25,900 --> 07:18:27,000
and some other machine
10065
07:18:27,000 --> 07:18:28,200
which is going to be near
10066
07:18:28,200 --> 07:18:31,300
to that the missions having
the data and read the data
10067
07:18:31,300 --> 07:18:32,400
over the network.
10068
07:18:32,600 --> 07:18:35,100
So to answer straight
or commissions are nothing but
10069
07:18:35,100 --> 07:18:36,600
which does the actual work
10070
07:18:36,600 --> 07:18:37,755
and going to report
10071
07:18:37,755 --> 07:18:41,315
to the master in terms of what
is the resource utilization
10072
07:18:41,315 --> 07:18:42,627
and the tasks running
10073
07:18:42,627 --> 07:18:46,000
within the work emissions
will be doing the actual work
10074
07:18:46,000 --> 07:18:49,049
and what ways as past Vector
just few minutes back.
10075
07:18:49,049 --> 07:18:50,656
I was answering a question.
10076
07:18:50,656 --> 07:18:52,697
What is a vector
vector is nothing
10077
07:18:52,697 --> 07:18:55,500
but representing the data
in multi dimensional form?
10078
07:18:55,500 --> 07:18:57,500
The vector can
be multi-dimensional
10079
07:18:57,500 --> 07:18:58,500
Vector as well.
10080
07:18:58,500 --> 07:19:02,400
As you know, I am going
to represent a point in space.
10081
07:19:02,400 --> 07:19:04,938
I need three dimensions
the X y&z.
10082
07:19:05,000 --> 07:19:08,076
So the vector will
have three dimensions.
10083
07:19:08,300 --> 07:19:10,934
If I need to represent
a line in the species.
10084
07:19:10,934 --> 07:19:14,107
Then I need two points
to represent the starting point
10085
07:19:14,107 --> 07:19:17,700
of the line and the endpoint
of the line then I need a vector
10086
07:19:17,700 --> 07:19:18,800
which can hold
10087
07:19:18,800 --> 07:19:21,049
so it will have two Dimensions
the first First Dimension
10088
07:19:21,049 --> 07:19:23,121
will have one point
the second dimension
10089
07:19:23,121 --> 07:19:24,400
will have another Point
10090
07:19:24,400 --> 07:19:25,429
let us say point B
10091
07:19:25,429 --> 07:19:29,200
if I have to represent a plane
then I need another dimension
10092
07:19:29,200 --> 07:19:30,702
to represent two lines.
10093
07:19:30,702 --> 07:19:31,510
So each line
10094
07:19:31,510 --> 07:19:34,203
will be representing
two points same way.
10095
07:19:34,203 --> 07:19:37,200
I can represent any data
using a vector form
10096
07:19:37,200 --> 07:19:40,217
as you might have
huge number of feedback
10097
07:19:40,217 --> 07:19:43,500
or ratings of products
across an organization.
10098
07:19:43,500 --> 07:19:46,327
Let's take a simple example
Amazon Amazon have
10099
07:19:46,327 --> 07:19:47,632
millions of products.
10100
07:19:47,632 --> 07:19:50,498
Not every user not even
a single user would have
10101
07:19:50,498 --> 07:19:53,461
It was millions of all
the products within Amazon.
10102
07:19:53,461 --> 07:19:55,341
The only hardly
we would have used
10103
07:19:55,341 --> 07:19:58,400
like a point one percent
or like even less than that,
10104
07:19:58,400 --> 07:20:00,200
maybe like few hundred products.
10105
07:20:00,200 --> 07:20:02,600
We would have used
and rated the products
10106
07:20:02,600 --> 07:20:04,600
within amazing for
the complete lifetime.
10107
07:20:04,600 --> 07:20:07,700
If I have to represent
all ratings of the products
10108
07:20:07,700 --> 07:20:10,194
with director and see
the first position
10109
07:20:10,194 --> 07:20:13,400
of the rating it's going
to refer to the product
10110
07:20:13,400 --> 07:20:15,200
with ID 1 second position.
10111
07:20:15,200 --> 07:20:17,600
It's going to refer
to the product with ID 2.
10112
07:20:17,600 --> 07:20:20,700
So I will have million values
within that particular vector.
10113
07:20:20,700 --> 07:20:22,645
After out of million values,
10114
07:20:22,645 --> 07:20:25,493
I'll have only values
400 products where I
10115
07:20:25,493 --> 07:20:27,300
have provided the ratings.
10116
07:20:27,400 --> 07:20:30,947
So it may vary from number
1 to 5 for all others.
10117
07:20:30,947 --> 07:20:34,200
It will say 0 sparse
pins thinly distributed.
10118
07:20:34,800 --> 07:20:38,774
So to represent the huge amount
of data with the position
10119
07:20:38,774 --> 07:20:41,900
and saying this particular
position is having
10120
07:20:41,900 --> 07:20:43,800
a 0 value we can mention
10121
07:20:43,800 --> 07:20:45,900
that with a key and value.
10122
07:20:45,900 --> 07:20:47,415
So what position having
10123
07:20:47,415 --> 07:20:51,500
what value rather than storing
all Zero seconds told one lie
10124
07:20:51,500 --> 07:20:55,471
non-zeros the position of it and
that the corresponding value.
10125
07:20:55,471 --> 07:20:58,400
That means all others going
to be a zero value
10126
07:20:58,400 --> 07:21:01,400
so we can mention
this particular space
10127
07:21:01,400 --> 07:21:05,400
Vector mentioning it
to representa nonzero entities.
10128
07:21:05,400 --> 07:21:08,300
So to store only
the nonzero entities
10129
07:21:08,300 --> 07:21:10,364
this Mass Factor will be used
10130
07:21:10,364 --> 07:21:12,500
so that we don't need to based
10131
07:21:12,500 --> 07:21:15,550
on additional space was
during this past Vector.
10132
07:21:15,550 --> 07:21:18,600
Let's discuss some questions
on spark streaming.
10133
07:21:18,600 --> 07:21:21,422
How is streaming Dad
in sparking explained
10134
07:21:21,422 --> 07:21:23,900
with examples smart
streaming is used
10135
07:21:23,900 --> 07:21:25,452
for processing real-time
10136
07:21:25,452 --> 07:21:29,500
streaming data to precisely say
it's a micro batch processing.
10137
07:21:29,500 --> 07:21:32,852
So data will be collected
between every small interval say
10138
07:21:32,852 --> 07:21:35,128
maybe like .5 seconds
or every seconds
10139
07:21:35,128 --> 07:21:36,200
until you get processed.
10140
07:21:36,200 --> 07:21:36,900
So internally,
10141
07:21:36,900 --> 07:21:40,100
it's going to create
micro patches the data created
10142
07:21:40,100 --> 07:21:43,800
out of that micro batch we call
there is a d stream the stream
10143
07:21:43,800 --> 07:21:45,500
is like a and ready
10144
07:21:45,500 --> 07:21:48,200
so I can do
Transformations and actions.
10145
07:21:48,200 --> 07:21:50,691
Whatever that I do
with our DD I can do
10146
07:21:50,691 --> 07:21:52,200
With the stream as well
10147
07:21:52,500 --> 07:21:57,100
and Spark streaming can read
data from Flume hdfs are
10148
07:21:57,100 --> 07:21:59,500
other streaming services Aspen
10149
07:21:59,800 --> 07:22:02,565
and store the data
in the dashboard or in
10150
07:22:02,565 --> 07:22:06,300
any other database and it
provides very high throughput
10151
07:22:06,400 --> 07:22:09,200
as it can be processed with
a number of different systems
10152
07:22:09,200 --> 07:22:11,800
in a distributed
fashion again streaming.
10153
07:22:11,800 --> 07:22:14,858
This stream will be partitioned
internally and it has
10154
07:22:14,858 --> 07:22:17,100
the built-in feature
of fault tolerance,
10155
07:22:17,100 --> 07:22:18,700
even if any data is lost
10156
07:22:18,700 --> 07:22:22,100
and it's transformed already
is Lost it can regenerate
10157
07:22:22,100 --> 07:22:23,930
those rdds from the existing
10158
07:22:23,930 --> 07:22:25,500
or from the source data.
10159
07:22:25,500 --> 07:22:28,100
So these three is going
to be the building block
10160
07:22:28,100 --> 07:22:32,748
of streaming and it has
the fault tolerance mechanism
10161
07:22:32,748 --> 07:22:34,902
what we have within the RTD.
10162
07:22:35,000 --> 07:22:38,600
So this stream are specialized
on Didi specialized form
10163
07:22:38,600 --> 07:22:42,000
of our GD specifically to use it
within this box dreaming.
10164
07:22:42,000 --> 07:22:42,253
Okay.
10165
07:22:42,253 --> 07:22:42,963
Next question.
10166
07:22:42,963 --> 07:22:45,600
What is the significance
of sliding window operation?
10167
07:22:45,600 --> 07:22:48,700
That's a very interesting one
in the streaming data whenever
10168
07:22:48,700 --> 07:22:50,600
we do the Computing the data.
10169
07:22:50,600 --> 07:22:53,218
Density are the
business implications
10170
07:22:53,218 --> 07:22:56,500
of that specific data
May oscillate a lot.
10171
07:22:56,500 --> 07:22:58,400
For example within Twitter.
10172
07:22:58,400 --> 07:23:01,455
We used to say the trending
tweet hashtag just
10173
07:23:01,455 --> 07:23:03,900
because that hashtag
is very popular.
10174
07:23:03,900 --> 07:23:06,200
Maybe someone might have hacked
into the system
10175
07:23:06,200 --> 07:23:09,500
and used a number of tweets
maybe for that particular
10176
07:23:09,500 --> 07:23:12,202
our it might have appeared
millions of times just
10177
07:23:12,202 --> 07:23:15,123
because it appear billions
of times for that specific
10178
07:23:15,123 --> 07:23:16,107
and minute duration
10179
07:23:16,107 --> 07:23:18,800
or like say to three minute
duration each not getting
10180
07:23:18,800 --> 07:23:20,200
to the trending tank.
10181
07:23:20,200 --> 07:23:22,286
Trending hashtag for
that particular day
10182
07:23:22,286 --> 07:23:23,992
or for that particular month.
10183
07:23:23,992 --> 07:23:26,700
So what we will do we
will try to do an average.
10184
07:23:26,700 --> 07:23:29,357
So like a window
this current time frame
10185
07:23:29,357 --> 07:23:32,500
and T minus 1 T minus 2 all
the data we will consider
10186
07:23:32,500 --> 07:23:34,807
and we will try to find
the average or some
10187
07:23:34,807 --> 07:23:37,276
so the complete business logic
will be applied
10188
07:23:37,276 --> 07:23:39,100
against that particular window.
10189
07:23:39,200 --> 07:23:43,400
So any drastic changes
on to precisely say the spike
10190
07:23:43,500 --> 07:23:46,200
or deep very
drastic spinal cords
10191
07:23:46,200 --> 07:23:50,300
drastic deep in the pattern
of the data will be normalized.
10192
07:23:50,300 --> 07:23:51,100
So that's the
10193
07:23:51,100 --> 07:23:54,452
because significance of using
the sliding window operation
10194
07:23:54,452 --> 07:23:55,800
with inspark streaming
10195
07:23:55,800 --> 07:23:59,600
and smart can handle this
sliding window automatically.
10196
07:23:59,600 --> 07:24:04,000
It can store the prior data
the T minus 1 T minus 2 and
10197
07:24:04,000 --> 07:24:06,300
how big the window
needs to be maintained
10198
07:24:06,300 --> 07:24:09,192
or that can be handled easily
within the program
10199
07:24:09,192 --> 07:24:11,100
and it's at the abstract level.
10200
07:24:11,300 --> 07:24:12,100
Next question is
10201
07:24:12,100 --> 07:24:15,600
what is destroying the expansion
is discretized stream.
10202
07:24:15,600 --> 07:24:17,600
So that's the abstract form
10203
07:24:17,600 --> 07:24:20,500
or the which will form
of representation of the data.
10204
07:24:20,500 --> 07:24:22,494
For the spark
streaming the same way,
10205
07:24:22,494 --> 07:24:25,200
how are ready getting
transformed from one form
10206
07:24:25,200 --> 07:24:26,200
to another form?
10207
07:24:26,200 --> 07:24:27,504
We will have series
10208
07:24:27,504 --> 07:24:30,800
of oddities all put together
called as a d string
10209
07:24:30,800 --> 07:24:32,100
so this term is nothing
10210
07:24:32,100 --> 07:24:34,000
but it's another representation
10211
07:24:34,000 --> 07:24:36,593
of our GD are like
to group of oddities
10212
07:24:36,593 --> 07:24:38,223
because there is a stream
10213
07:24:38,223 --> 07:24:41,100
and I can apply
the streaming functions
10214
07:24:41,100 --> 07:24:43,921
or any of the functions
Transformations are actions
10215
07:24:43,921 --> 07:24:47,200
available within the streaming
against this D string
10216
07:24:47,300 --> 07:24:49,674
So within that
particular micro badge,
10217
07:24:49,674 --> 07:24:51,600
so I will Define What interval
10218
07:24:51,600 --> 07:24:54,377
the data should be collected
on should be processed
10219
07:24:54,377 --> 07:24:56,100
because there is a micro batch.
10220
07:24:56,100 --> 07:24:59,900
It could be every 1 second
or every hundred milliseconds
10221
07:24:59,900 --> 07:25:01,000
or every five seconds.
10222
07:25:01,300 --> 07:25:02,300
I can Define that page
10223
07:25:02,300 --> 07:25:04,300
particular period so
all the data is used
10224
07:25:04,300 --> 07:25:07,300
in that particular duration
will be considered
10225
07:25:07,300 --> 07:25:08,400
as a piece of data
10226
07:25:08,400 --> 07:25:09,600
and that will be called
10227
07:25:09,600 --> 07:25:13,400
as ADI string s question explain
casing in spark streaming.
10228
07:25:13,400 --> 07:25:14,000
Of course.
10229
07:25:14,000 --> 07:25:15,000
Yes Mark internally.
10230
07:25:15,000 --> 07:25:16,300
It uses in memory Computing.
10231
07:25:16,600 --> 07:25:18,700
So any data when it
is doing the Computing
10232
07:25:18,900 --> 07:25:21,600
that's killing generated
will be there in Mary but find
10233
07:25:21,600 --> 07:25:25,100
that if you do more and more
processing with other jobs
10234
07:25:25,100 --> 07:25:27,190
when there is a need
for more memory,
10235
07:25:27,190 --> 07:25:30,500
the least used on DDS will be
clear enough from the memory
10236
07:25:30,500 --> 07:25:34,100
or the least used data
available out of actions
10237
07:25:34,100 --> 07:25:36,700
from the arditi will be cleared
of from the memory.
10238
07:25:36,700 --> 07:25:40,000
Sometimes I may need
that data forever in memory,
10239
07:25:40,000 --> 07:25:41,800
very simple example,
like dictionary.
10240
07:25:42,100 --> 07:25:43,600
I want the dictionary words
10241
07:25:43,600 --> 07:25:45,658
should be always
available in memory
10242
07:25:45,658 --> 07:25:48,900
because I may do a spell check
against the Tweet comments
10243
07:25:48,900 --> 07:25:51,500
or feedback comments
and our of nines.
10244
07:25:51,500 --> 07:25:54,900
So what I can do I
can say KH those any data
10245
07:25:54,900 --> 07:25:57,036
that comes in we can cash it.
10246
07:25:57,036 --> 07:25:59,100
What possessed it in memory.
10247
07:25:59,100 --> 07:26:02,100
So even when there is a need
for memory by other applications
10248
07:26:02,100 --> 07:26:05,800
this specific data will
not be remote and especially
10249
07:26:05,800 --> 07:26:08,800
that will be used to do
the further processing
10250
07:26:08,800 --> 07:26:11,500
and the casing
also can be defined
10251
07:26:11,500 --> 07:26:15,200
whether it should be in memory
only I in memory and hard disk
10252
07:26:15,200 --> 07:26:17,000
that also we can Define it.
10253
07:26:17,000 --> 07:26:20,100
Let's discuss some questions
on spark graphics.
10254
07:26:20,300 --> 07:26:24,000
The next question is is there
an APA for implementing collapse
10255
07:26:24,000 --> 07:26:26,200
and Spark in graph Theory?
10256
07:26:26,600 --> 07:26:28,100
Everything will be represented
10257
07:26:28,100 --> 07:26:33,200
as a graph is a graph it
will have nodes and edges.
10258
07:26:33,419 --> 07:26:36,880
So all will be represented
using the arteries.
10259
07:26:37,000 --> 07:26:40,300
So it's going to extend
the RTD and there is
10260
07:26:40,300 --> 07:26:42,482
a component called graphics
10261
07:26:42,500 --> 07:26:44,983
and it exposes
the functionalities
10262
07:26:44,983 --> 07:26:49,800
to represent a graph we can have
H RG D buttocks rdd by creating.
10263
07:26:49,800 --> 07:26:51,700
During the edges and vertex.
10264
07:26:51,700 --> 07:26:53,239
I can create a graph
10265
07:26:53,500 --> 07:26:57,400
and this graph can exist
in a distributed environment.
10266
07:26:57,400 --> 07:27:00,208
So same way we will be
in a position to do
10267
07:27:00,208 --> 07:27:02,400
the parallel processing as well.
10268
07:27:02,700 --> 07:27:06,300
So Graphics, it's just
a form of representing
10269
07:27:06,400 --> 07:27:11,200
the data paragraphs with edges
and the traces and of course,
10270
07:27:11,200 --> 07:27:14,299
yes, it provides the APA
to implement out create
10271
07:27:14,299 --> 07:27:17,400
the graph do the processing
on the graph the APA
10272
07:27:17,400 --> 07:27:19,900
so divided what is Page rank?
10273
07:27:20,100 --> 07:27:24,600
Graphics we didn't have sex
once the graph is created.
10274
07:27:24,600 --> 07:27:28,900
We can calculate the page rank
for a particular note.
10275
07:27:29,100 --> 07:27:32,000
So that's very similar to
how we have the page rank
10276
07:27:32,100 --> 07:27:35,635
for the websites within Google
the higher the page rank.
10277
07:27:35,635 --> 07:27:38,774
That means it's more important
within that particular graph.
10278
07:27:38,774 --> 07:27:40,547
It's going to
show the importance
10279
07:27:40,547 --> 07:27:41,900
of that particular node
10280
07:27:41,900 --> 07:27:45,154
or Edge within that particular
graph is a graph is
10281
07:27:45,154 --> 07:27:46,700
a connected set of data.
10282
07:27:46,800 --> 07:27:49,600
All right, I will be connected
using the property
10283
07:27:49,600 --> 07:27:51,100
and How much important
10284
07:27:51,100 --> 07:27:55,300
that property makes we will have
a value Associated to it.
10285
07:27:55,500 --> 07:27:57,900
So within pagerank
we can calculate
10286
07:27:57,900 --> 07:27:59,100
like a static page rank.
10287
07:27:59,300 --> 07:28:00,703
It will run a number
10288
07:28:00,703 --> 07:28:03,300
of iterations or there
is another page
10289
07:28:03,300 --> 07:28:06,600
and code anomic page rank
that will get executed
10290
07:28:06,600 --> 07:28:09,200
till we reach
a particular saturation level
10291
07:28:09,300 --> 07:28:13,600
and the saturation level can be
defined with multiple criterias
10292
07:28:14,100 --> 07:28:15,200
and the APA is
10293
07:28:15,200 --> 07:28:17,500
because there is
a graph operations.
10294
07:28:17,700 --> 07:28:20,289
And be direct executed
against those graph
10295
07:28:20,289 --> 07:28:23,700
and they all are available
as a PA within the graphics.
10296
07:28:24,103 --> 07:28:25,796
What is lineage graph?
10297
07:28:26,000 --> 07:28:28,400
So the audit is very similar
10298
07:28:28,500 --> 07:28:32,800
to the graphics how the
graph representation every rtt.
10299
07:28:32,800 --> 07:28:33,800
Internally.
10300
07:28:33,800 --> 07:28:36,400
It will have the relation saying
10301
07:28:36,500 --> 07:28:39,157
how that particular
rdd got created.
10302
07:28:39,157 --> 07:28:42,725
And from where how
that got transformed argit is
10303
07:28:42,725 --> 07:28:44,700
how their got transformed.
10304
07:28:44,700 --> 07:28:47,600
So the complete lineage
or the complete history
10305
07:28:47,600 --> 07:28:50,587
or the complete path
will be recorded
10306
07:28:50,587 --> 07:28:51,900
within the lineage.
10307
07:28:52,100 --> 07:28:53,517
That will be used in case
10308
07:28:53,517 --> 07:28:56,400
if any particular partition
of the target is lost.
10309
07:28:56,400 --> 07:28:57,900
It can be regenerated.
10310
07:28:58,000 --> 07:28:59,899
Even if the complete
artery is lost.
10311
07:28:59,899 --> 07:29:00,900
We can regenerate
10312
07:29:00,900 --> 07:29:03,149
so it will have the complete
information on what are
10313
07:29:03,149 --> 07:29:06,193
the partitions where it is
existing water Transformations.
10314
07:29:06,193 --> 07:29:07,119
It had undergone.
10315
07:29:07,119 --> 07:29:08,747
What is the resultant and you
10316
07:29:08,747 --> 07:29:10,600
if anything is lost
in the middle,
10317
07:29:10,600 --> 07:29:12,511
it knows where to recalculate
10318
07:29:12,511 --> 07:29:16,400
from and what are essential
things needs to be recalculated.
10319
07:29:16,400 --> 07:29:19,817
It's going to save us a lot
of time and if that Audrey
10320
07:29:19,817 --> 07:29:21,762
is never being used it will now.
10321
07:29:21,762 --> 07:29:23,100
Ever get recalculated.
10322
07:29:23,100 --> 07:29:26,500
So they recalculation also
triggers based on the action
10323
07:29:26,500 --> 07:29:27,799
only on need basis.
10324
07:29:27,799 --> 07:29:29,100
It will recalculate
10325
07:29:29,200 --> 07:29:32,500
that's why it's going
to use the memory optimally
10326
07:29:32,700 --> 07:29:36,087
does Apache spark provide
checkpointing officially
10327
07:29:36,087 --> 07:29:38,300
like the example
like a streaming
10328
07:29:38,600 --> 07:29:43,600
and if any data is lost within
that particular sliding window,
10329
07:29:43,600 --> 07:29:47,492
we cannot get back the data are
like the data will be lost
10330
07:29:47,492 --> 07:29:50,103
because Jim I'm making
a window of say 24
10331
07:29:50,103 --> 07:29:51,800
asks to do some averaging.
10332
07:29:51,800 --> 07:29:55,270
Each I'm making a sliding window
of 24 hours every 24 hours.
10333
07:29:55,270 --> 07:29:59,100
It will keep on getting slider
and if you lose any system
10334
07:29:59,100 --> 07:30:01,500
as in there is a complete
failure of the cluster.
10335
07:30:01,500 --> 07:30:02,562
I may lose the data
10336
07:30:02,562 --> 07:30:04,800
because it's all available
in the memory.
10337
07:30:04,900 --> 07:30:06,400
So how to recalculate
10338
07:30:06,400 --> 07:30:08,902
if the data system is lost
it follows something
10339
07:30:08,902 --> 07:30:10,100
called a checkpointing
10340
07:30:10,100 --> 07:30:12,831
so we can check point
the data and directly.
10341
07:30:12,831 --> 07:30:14,800
It's provided by the spark APA.
10342
07:30:14,800 --> 07:30:16,600
We have to just
provide the location
10343
07:30:16,600 --> 07:30:19,700
where it should get checked
pointed and you can read
10344
07:30:19,700 --> 07:30:23,200
that particular data back
when you Not the system again,
10345
07:30:23,200 --> 07:30:24,866
whatever the state it was
10346
07:30:24,866 --> 07:30:27,600
in be can regenerate
that particular data.
10347
07:30:27,700 --> 07:30:29,454
So yes to answer the question
10348
07:30:29,454 --> 07:30:32,300
straight about this path
points check monitoring
10349
07:30:32,300 --> 07:30:35,300
and it will help us
to regenerate the state
10350
07:30:35,300 --> 07:30:37,010
what it was earlier.
10351
07:30:37,200 --> 07:30:40,000
Let's move on to the next
component spark ml it.
10352
07:30:40,300 --> 07:30:41,515
How is machine learning
10353
07:30:41,515 --> 07:30:44,600
implemented in spark
the machine learning again?
10354
07:30:44,600 --> 07:30:46,800
It's a very huge ocean by itself
10355
07:30:46,900 --> 07:30:49,800
and it's not a technology
specific to spark
10356
07:30:49,800 --> 07:30:51,800
which learning is
a common data science.
10357
07:30:51,800 --> 07:30:55,235
It's a Set of data science work
where we have different type
10358
07:30:55,235 --> 07:30:57,983
of algorithms different
categories of algorithm,
10359
07:30:57,983 --> 07:31:01,100
like clustering regression
dimensionality reduction
10360
07:31:01,100 --> 07:31:02,100
or that we have
10361
07:31:02,300 --> 07:31:05,600
and all these algorithms
are most of the algorithms
10362
07:31:05,600 --> 07:31:08,070
have been implemented
in spark and smart is
10363
07:31:08,070 --> 07:31:09,481
the preferred framework
10364
07:31:09,481 --> 07:31:12,910
or before preferred application
component to do the machine
10365
07:31:12,910 --> 07:31:14,500
learning algorithm nowadays
10366
07:31:14,500 --> 07:31:16,500
or machine learning
processing the reason
10367
07:31:16,500 --> 07:31:19,700
because most of the machine
learning algorithms needs
10368
07:31:19,700 --> 07:31:21,890
to be executed i3t real number.
10369
07:31:21,890 --> 07:31:25,000
Of times till we get
the optimal result maybe
10370
07:31:25,000 --> 07:31:27,700
like say twenty five
iterations are 58 iterations
10371
07:31:27,700 --> 07:31:29,900
or till we get
that specific accuracy.
10372
07:31:29,900 --> 07:31:33,100
You will keep on running
the processing again and again
10373
07:31:33,100 --> 07:31:36,092
and smog is very good fit
whenever you want to do
10374
07:31:36,092 --> 07:31:37,900
the processing again and again
10375
07:31:37,900 --> 07:31:40,400
because the data
will be available in memory.
10376
07:31:40,400 --> 07:31:43,600
I can read it faster store
the data back into the memory
10377
07:31:43,600 --> 07:31:44,700
again reach faster
10378
07:31:44,700 --> 07:31:47,500
and all this machine learning
algorithms have been provided
10379
07:31:47,500 --> 07:31:50,800
within the spark a separate
component called ml lip
10380
07:31:50,900 --> 07:31:53,096
and within mlsp We
have other components
10381
07:31:53,096 --> 07:31:55,800
like feature Association
to extract the features.
10382
07:31:55,800 --> 07:31:58,575
You may be wondering
how they can process
10383
07:31:58,575 --> 07:32:02,600
the images the core thing about
processing a image or audio
10384
07:32:02,600 --> 07:32:04,922
or video is about
extracting the feature
10385
07:32:04,922 --> 07:32:08,363
and comparing the future
how much they are related.
10386
07:32:08,363 --> 07:32:10,300
So that's where
vectors matrices all
10387
07:32:10,300 --> 07:32:13,500
that will come into picture
and we can have pipeline
10388
07:32:13,500 --> 07:32:16,144
of processing as well
to the processing
10389
07:32:16,144 --> 07:32:18,800
one then take the result
and do the processing
10390
07:32:18,800 --> 07:32:21,700
to and it has persistence
algorithm as well.
10391
07:32:21,700 --> 07:32:24,234
The result of it
the generator process
10392
07:32:24,234 --> 07:32:25,999
the result it can be persisted
10393
07:32:25,999 --> 07:32:27,010
and reloaded back
10394
07:32:27,010 --> 07:32:29,421
into the system to
continue the processing
10395
07:32:29,421 --> 07:32:32,245
from that particular Point
onwards next question.
10396
07:32:32,245 --> 07:32:34,605
What are categories
of machine learning machine
10397
07:32:34,605 --> 07:32:38,000
learning assets different
categories available supervised
10398
07:32:38,000 --> 07:32:41,001
or unsupervised and
reinforced learning supervised
10399
07:32:41,001 --> 07:32:42,900
and surprised it's very popular
10400
07:32:43,200 --> 07:32:46,700
where we will know some
I'll give an example.
10401
07:32:47,200 --> 07:32:50,123
I'll know well
in advance what category
10402
07:32:50,123 --> 07:32:54,800
that belongs to Z. Want
to do a character recognition
10403
07:32:55,400 --> 07:32:57,185
while training the data,
10404
07:32:57,185 --> 07:33:01,800
I can give information saying
this particular image belongs
10405
07:33:01,800 --> 07:33:04,160
to this particular
category character
10406
07:33:04,160 --> 07:33:05,800
or this particular number
10407
07:33:05,800 --> 07:33:10,100
and I can train sometimes I
will not know well in advance
10408
07:33:10,100 --> 07:33:14,478
assume like I may have
different type of images
10409
07:33:14,700 --> 07:33:19,200
like it may have
cars bikes cat dog all that.
10410
07:33:19,400 --> 07:33:21,920
I want to know
how many category available.
10411
07:33:21,920 --> 07:33:25,279
No, I will not know well
in advance so I want to group it
10412
07:33:25,279 --> 07:33:26,900
how many category available
10413
07:33:26,900 --> 07:33:29,100
and then I'll
realize saying okay,
10414
07:33:29,100 --> 07:33:31,600
they're all this belongs
to a particular category.
10415
07:33:31,600 --> 07:33:33,800
I'll identify the pattern
within the category
10416
07:33:33,800 --> 07:33:36,333
and I'll give
a category named say
10417
07:33:36,333 --> 07:33:39,751
like all these images
belongs to boot category
10418
07:33:39,751 --> 07:33:41,300
on looks like a boat.
10419
07:33:41,500 --> 07:33:45,400
So leaving it to the system
by providing this value or not.
10420
07:33:45,400 --> 07:33:48,400
Let's say the cat is different
type of machine learning comes
10421
07:33:48,400 --> 07:33:49,503
into picture and
10422
07:33:49,503 --> 07:33:53,160
as such machine learning is
not specific to It's going
10423
07:33:53,160 --> 07:33:57,300
to help us to achieve to run
this machine learning algorithms
10424
07:33:57,400 --> 07:34:00,700
what our spark ml lead
tools MLA business thing
10425
07:34:00,700 --> 07:34:02,300
but machine learning library
10426
07:34:02,300 --> 07:34:03,700
or machine learning offering
10427
07:34:03,700 --> 07:34:07,200
within this Mark and has a
number of algorithms implemented
10428
07:34:07,200 --> 07:34:09,800
and it provides very
good feature to persist
10429
07:34:09,800 --> 07:34:12,306
the result generally
in machine learning.
10430
07:34:12,306 --> 07:34:14,509
We will generate
a model the pattern
10431
07:34:14,509 --> 07:34:17,089
of the data recorder
is a model the model
10432
07:34:17,089 --> 07:34:20,688
will be persisted either in
different forms Like Pat.
10433
07:34:20,688 --> 07:34:23,087
Quit I have
Through different forms,
10434
07:34:23,087 --> 07:34:26,700
it can be stored opposite
district and has methodologies
10435
07:34:26,700 --> 07:34:29,600
to extract the features
from a set of data.
10436
07:34:29,600 --> 07:34:31,353
I may have million images.
10437
07:34:31,353 --> 07:34:32,500
I want to extract
10438
07:34:32,500 --> 07:34:36,300
the common features available
within those millions of images
10439
07:34:36,300 --> 07:34:40,170
and other utilities
available to process to define
10440
07:34:40,170 --> 07:34:43,607
or like to define the seed
the randomizing it so
10441
07:34:43,607 --> 07:34:47,441
different utilities are
available as well as pipelines.
10442
07:34:47,441 --> 07:34:49,500
That's very specific to spark
10443
07:34:49,800 --> 07:34:53,300
where I can Channel
Arrange the sequence
10444
07:34:53,300 --> 07:34:56,700
of steps to be undergone by
the machine learning submission
10445
07:34:56,700 --> 07:34:58,100
learning one algorithm first
10446
07:34:58,100 --> 07:34:59,863
and then the result
of it will be fed
10447
07:34:59,863 --> 07:35:02,163
into a machine learning
algorithm to like that.
10448
07:35:02,163 --> 07:35:03,400
We can have a sequence
10449
07:35:03,400 --> 07:35:06,500
of execution and
that will be defined using
10450
07:35:06,500 --> 07:35:10,562
the pipeline's is Honorable
features of spark Emily.
10451
07:35:11,000 --> 07:35:15,100
What are some popular algorithms
and Utilities in spark Emily.
10452
07:35:15,500 --> 07:35:18,382
So these are some popular
algorithms like regression
10453
07:35:18,382 --> 07:35:22,000
classification basic statistics
recommendation system.
10454
07:35:22,000 --> 07:35:24,678
It's a comedy system is
like well implemented.
10455
07:35:24,678 --> 07:35:27,000
All we have to provide
is give the data.
10456
07:35:27,000 --> 07:35:30,579
If you give the ratings and
products within an organization,
10457
07:35:30,579 --> 07:35:32,400
if you have the complete damp,
10458
07:35:32,400 --> 07:35:35,800
we can build the recommendation
system in no time.
10459
07:35:35,800 --> 07:35:39,283
And if you give any user you
can give a recommendation.
10460
07:35:39,283 --> 07:35:41,600
These are the products
the user may like
10461
07:35:41,600 --> 07:35:42,500
and those products
10462
07:35:42,500 --> 07:35:45,900
can be displayed in the search
result recommendation system
10463
07:35:45,900 --> 07:35:48,017
really works on the basis
of the feedback
10464
07:35:48,017 --> 07:35:50,400
that we are providing
for the earlier products
10465
07:35:50,400 --> 07:35:51,500
that we had bought.
10466
07:35:51,600 --> 07:35:54,225
Bustling dimensionality
reduction whenever
10467
07:35:54,225 --> 07:35:57,300
we do transitioning
with the huge amount of data,
10468
07:35:57,600 --> 07:35:59,511
it's very very compute-intensive
10469
07:35:59,511 --> 07:36:01,900
and we may have
to reduce the dimensions,
10470
07:36:01,900 --> 07:36:03,752
especially the matrix dimensions
10471
07:36:03,752 --> 07:36:07,000
within them early
without losing the features.
10472
07:36:07,000 --> 07:36:09,538
What are the features
available without losing it?
10473
07:36:09,538 --> 07:36:11,308
We should reduce
the dimensionality
10474
07:36:11,308 --> 07:36:13,580
and there are
some algorithms available to do
10475
07:36:13,580 --> 07:36:16,660
that dimensionality reduction
and feature extraction.
10476
07:36:16,660 --> 07:36:19,486
So what are the common features
are features available
10477
07:36:19,486 --> 07:36:22,227
within that particular image
and I can Compare
10478
07:36:22,227 --> 07:36:23,300
what are the common
10479
07:36:23,300 --> 07:36:26,600
across common features
available within those images?
10480
07:36:26,600 --> 07:36:29,106
That's how we
will group those images.
10481
07:36:29,106 --> 07:36:29,716
So get me
10482
07:36:29,716 --> 07:36:32,900
whether this particular image
the person looking
10483
07:36:32,900 --> 07:36:35,300
like this image available
in the database or not.
10484
07:36:35,700 --> 07:36:37,524
For example,
assume the organization
10485
07:36:37,524 --> 07:36:40,600
or the police department crime
Department maintaining a list
10486
07:36:40,600 --> 07:36:44,400
of persons committed crime
and if we get a new photo
10487
07:36:44,400 --> 07:36:48,161
when they do a search they
may not have the exact photo bit
10488
07:36:48,161 --> 07:36:49,200
by bit the photo
10489
07:36:49,200 --> 07:36:51,600
might have been taken
with a different background.
10490
07:36:51,600 --> 07:36:55,000
Front lighting's different
locations different time.
10491
07:36:55,000 --> 07:36:57,754
So a hundred percent the data
will be different on bits
10492
07:36:57,754 --> 07:37:00,520
and bytes will be different
but look nice.
10493
07:37:00,520 --> 07:37:03,767
Yes, they are going to be seeing
so I'm going to search
10494
07:37:03,767 --> 07:37:05,100
the photo looking similar
10495
07:37:05,100 --> 07:37:07,500
to this particular
photograph as the input.
10496
07:37:07,500 --> 07:37:09,033
I'll provide to achieve
10497
07:37:09,033 --> 07:37:11,976
that we will be extracting
the features in each
10498
07:37:11,976 --> 07:37:13,000
of those photos.
10499
07:37:13,000 --> 07:37:15,717
We will extract the features
and we will try to match
10500
07:37:15,717 --> 07:37:17,697
the feature rather than the bits
10501
07:37:17,697 --> 07:37:21,015
and bytes and optimization as
well in terms of processing
10502
07:37:21,015 --> 07:37:22,200
or doing the piping.
10503
07:37:22,200 --> 07:37:25,100
There are a number of algorithms
to do the optimization.
10504
07:37:25,400 --> 07:37:27,000
Let's move on to spark SQL.
10505
07:37:27,100 --> 07:37:29,811
Is there a module
to implement sequence Park?
10506
07:37:29,811 --> 07:37:32,475
How does it work so
directly not the sequel
10507
07:37:32,475 --> 07:37:36,300
may be very similar to high
whatever the structure data
10508
07:37:36,300 --> 07:37:37,300
that we have.
10509
07:37:37,400 --> 07:37:38,800
We can read the data
10510
07:37:38,800 --> 07:37:42,000
or extract the meaning
out of the data using SQL
10511
07:37:42,400 --> 07:37:44,600
and it exposes the APA
10512
07:37:44,700 --> 07:37:48,700
and we can use those API to read
the data or create data frames
10513
07:37:48,834 --> 07:37:51,065
and spunk SQL has four major.
10514
07:37:51,500 --> 07:37:55,800
Degrees data source
data Frame data frame is
10515
07:37:55,800 --> 07:37:58,900
like the representation
of X and Y data
10516
07:37:59,300 --> 07:38:02,800
or like Excel data
multi-dimensional structure data
10517
07:38:03,000 --> 07:38:06,000
and abstract form
on top of dataframe.
10518
07:38:06,000 --> 07:38:08,541
I can do the
query and internally,
10519
07:38:08,541 --> 07:38:11,700
it has interpreter
and Optimizer any query
10520
07:38:11,700 --> 07:38:15,100
I fire that will
get interpreted or optimized
10521
07:38:15,100 --> 07:38:18,500
and get executed using
the SQL services and get
10522
07:38:18,500 --> 07:38:20,300
the data from the data frame
10523
07:38:20,300 --> 07:38:22,900
or it An read the data
from the data source
10524
07:38:22,900 --> 07:38:24,000
and do the processing.
10525
07:38:24,265 --> 07:38:26,034
What is a package file?
10526
07:38:26,100 --> 07:38:27,800
It's a format of the file
10527
07:38:27,800 --> 07:38:30,361
where the data
in some structured form,
10528
07:38:30,361 --> 07:38:33,800
especially the result
of the Spock SQL can be stored
10529
07:38:33,800 --> 07:38:37,350
or returned in some persistence
and the packet again.
10530
07:38:37,350 --> 07:38:41,317
It is a open source from Apache
its data serialization technique
10531
07:38:41,317 --> 07:38:44,833
where we can serialize the data
using the pad could form
10532
07:38:44,833 --> 07:38:46,078
and to precisely say,
10533
07:38:46,078 --> 07:38:47,500
it's a columnar storage.
10534
07:38:47,500 --> 07:38:49,900
It's going to consume
less space it will use
10535
07:38:49,900 --> 07:38:51,200
the keys and values.
10536
07:38:51,300 --> 07:38:55,500
Store the data and also it helps
you to access a specific data
10537
07:38:55,500 --> 07:38:59,100
from that packaged form
using the query so backward.
10538
07:38:59,100 --> 07:39:02,200
It's another open source format
data serialization format
10539
07:39:02,200 --> 07:39:03,267
to store the data
10540
07:39:03,267 --> 07:39:04,900
on purses the data as well
10541
07:39:04,900 --> 07:39:08,700
as to retrieve the data list
the functions of Sparks equal.
10542
07:39:08,700 --> 07:39:10,800
You can be used
to load the varieties
10543
07:39:10,800 --> 07:39:12,300
of structured data, of course,
10544
07:39:12,300 --> 07:39:15,600
yes monks equal can work only
with the structure data.
10545
07:39:15,600 --> 07:39:17,900
It can be used to load varieties
10546
07:39:17,900 --> 07:39:20,900
of structured data
and you can use SQL
10547
07:39:20,900 --> 07:39:23,600
like it's to query
against the program
10548
07:39:23,600 --> 07:39:25,000
and it can be used
10549
07:39:25,000 --> 07:39:27,839
with external tools to connect
to this park as well.
10550
07:39:27,839 --> 07:39:30,400
It gives very good
the integration with the SQL
10551
07:39:30,400 --> 07:39:32,900
and using python
Java Scala code.
10552
07:39:33,000 --> 07:39:35,831
We can create an rdd
from the structure data
10553
07:39:35,831 --> 07:39:38,400
available directly using
this box equal.
10554
07:39:38,400 --> 07:39:40,300
I can generate the TD.
10555
07:39:40,500 --> 07:39:42,600
So it's going to
facilitate the people
10556
07:39:42,600 --> 07:39:46,400
from database background to make
the program faster and quicker.
10557
07:39:47,100 --> 07:39:48,100
Next question is
10558
07:39:48,100 --> 07:39:50,700
what do you understand
by lazy evaluation?
10559
07:39:50,900 --> 07:39:54,400
So whenever you do any operation
within the spark word,
10560
07:39:54,400 --> 07:39:57,281
it will not do the processing
immediately it look
10561
07:39:57,281 --> 07:40:00,100
for the final results
that we are asking for it.
10562
07:40:00,100 --> 07:40:02,000
If it doesn't ask
for the final result.
10563
07:40:02,000 --> 07:40:04,660
It doesn't need to do
the processing So based
10564
07:40:04,660 --> 07:40:07,200
on the final action
until we do the action.
10565
07:40:07,200 --> 07:40:08,990
There will not be
any Transformations.
10566
07:40:08,990 --> 07:40:11,700
I will there will not be
any actual processing happening.
10567
07:40:11,700 --> 07:40:13,141
It will just understand
10568
07:40:13,141 --> 07:40:15,900
what our Transformations
it has to do finally
10569
07:40:15,900 --> 07:40:18,900
if you ask The action
then in optimized way,
10570
07:40:18,900 --> 07:40:22,200
it's going to complete
the data processing and get
10571
07:40:22,200 --> 07:40:23,553
us the final result.
10572
07:40:23,553 --> 07:40:26,600
So to answer straight
lazy evaluation is doing
10573
07:40:26,600 --> 07:40:30,300
the processing one Leon need
of the resultant data.
10574
07:40:30,300 --> 07:40:32,100
The data is not required.
10575
07:40:32,100 --> 07:40:34,757
It's not going
to do the processing.
10576
07:40:34,757 --> 07:40:36,726
Can you use Funk to access
10577
07:40:36,726 --> 07:40:40,200
and analyze data stored
in Cassandra data piece?
10578
07:40:40,200 --> 07:40:41,600
Yes, it is possible.
10579
07:40:41,600 --> 07:40:44,400
Okay, not only Cassandra
any of the nosql database it
10580
07:40:44,400 --> 07:40:46,100
can very well do the processing
10581
07:40:46,100 --> 07:40:49,700
and Sandra also works
in a distributed architecture.
10582
07:40:49,700 --> 07:40:51,200
It's a nosql database
10583
07:40:51,200 --> 07:40:53,800
so it can leverage
the data locality.
10584
07:40:53,800 --> 07:40:56,000
The query can
be executed locally
10585
07:40:56,000 --> 07:40:58,200
where the Cassandra
notes are available.
10586
07:40:58,200 --> 07:41:01,100
It's going to make
the query execution faster
10587
07:41:01,100 --> 07:41:04,326
and reduce the network load
and Spark executors.
10588
07:41:04,326 --> 07:41:06,009
It will try to get started
10589
07:41:06,009 --> 07:41:08,242
or the spark executors
in the mission
10590
07:41:08,242 --> 07:41:10,600
where the Cassandra
notes are available
10591
07:41:10,600 --> 07:41:13,900
or data is available going
to do the processing locally.
10592
07:41:13,900 --> 07:41:16,450
So it's going to leverage
the data locality.
10593
07:41:16,450 --> 07:41:17,426
T next question,
10594
07:41:17,426 --> 07:41:19,500
how can you
minimize data transfers
10595
07:41:19,500 --> 07:41:21,200
when working with spark
10596
07:41:21,200 --> 07:41:23,636
if you ask the core
design the success
10597
07:41:23,636 --> 07:41:25,514
of the spark program depends on
10598
07:41:25,514 --> 07:41:28,300
how much you are reducing
the network transfer.
10599
07:41:28,300 --> 07:41:30,900
This network transfer
is very costly operation
10600
07:41:30,900 --> 07:41:32,300
and you cannot paralyzed
10601
07:41:32,400 --> 07:41:35,600
in case multiple ways are
especially two ways to avoid.
10602
07:41:35,600 --> 07:41:37,664
This one is called
broadcast variable
10603
07:41:37,664 --> 07:41:40,300
and at Co-operators
broadcast variable.
10604
07:41:40,300 --> 07:41:43,536
It will help us
to transfer any static data
10605
07:41:43,536 --> 07:41:46,428
or any informations
keep on publish.
10606
07:41:46,500 --> 07:41:48,300
To multiple systems.
10607
07:41:48,300 --> 07:41:49,300
So I'll see
10608
07:41:49,300 --> 07:41:52,257
if any data to be transferred
to multiple executors
10609
07:41:52,257 --> 07:41:53,500
to be used in common.
10610
07:41:53,500 --> 07:41:55,016
I can broadcast it
10611
07:41:55,200 --> 07:41:58,800
and I might want to consolidate
the values happening
10612
07:41:58,800 --> 07:42:02,172
in multiple workers in
a single centralized location.
10613
07:42:02,172 --> 07:42:03,600
I can use accumulator.
10614
07:42:03,600 --> 07:42:06,412
So this will help us to achieve
the data consolidation
10615
07:42:06,412 --> 07:42:08,800
of data distribution
in the distributed world.
10616
07:42:08,800 --> 07:42:11,800
The ap11 are not abstract level
10617
07:42:11,800 --> 07:42:14,351
where we don't need
to do the heavy lifting
10618
07:42:14,351 --> 07:42:16,600
that's taken care
by the spark for us.
10619
07:42:16,800 --> 07:42:19,275
What our broadcast
variables just now
10620
07:42:19,275 --> 07:42:22,300
as we discussed the value
of the common value
10621
07:42:22,300 --> 07:42:23,200
that we need.
10622
07:42:23,200 --> 07:42:27,300
I am a want that to be available
in multiple executors
10623
07:42:27,300 --> 07:42:31,000
multiple workers simple example
you want to do a spell check
10624
07:42:31,000 --> 07:42:33,500
on the Tweet
Commons the dictionary
10625
07:42:33,500 --> 07:42:36,100
which has the right
list of words.
10626
07:42:36,200 --> 07:42:37,800
I'll have the complete list.
10627
07:42:37,800 --> 07:42:40,300
I want that particular
dictionary to be available
10628
07:42:40,300 --> 07:42:41,400
in each executor
10629
07:42:41,400 --> 07:42:43,944
so that with a task with
that's running locally
10630
07:42:43,944 --> 07:42:46,600
in those Executives can refer
to that particular.
10631
07:42:46,600 --> 07:42:49,900
Task and get the processing
done by avoiding
10632
07:42:49,900 --> 07:42:51,616
the network data transfer.
10633
07:42:51,616 --> 07:42:55,485
So the process of Distributing
the data from the spark context
10634
07:42:55,485 --> 07:42:56,500
to the executors
10635
07:42:56,500 --> 07:42:58,700
where the task going
to run is achieved
10636
07:42:58,700 --> 07:43:00,400
using broadcast variables
10637
07:43:00,400 --> 07:43:03,952
and the built-in within the
spark APA using this parquet p--
10638
07:43:03,952 --> 07:43:06,000
we can create
the bronchus variable
10639
07:43:06,200 --> 07:43:09,500
and the process of Distributing
this data available
10640
07:43:09,500 --> 07:43:13,524
in all executors is taken care
by the spark framework explain
10641
07:43:13,524 --> 07:43:15,000
accumulators in spark.
10642
07:43:15,100 --> 07:43:18,500
The similar way how we
have broadcast variables.
10643
07:43:18,500 --> 07:43:21,290
We have accumulators
as well simple example,
10644
07:43:21,290 --> 07:43:25,100
you want to count how many
error codes are available
10645
07:43:25,100 --> 07:43:26,600
in the distributed environment
10646
07:43:26,800 --> 07:43:28,400
as your data is distributed
10647
07:43:28,400 --> 07:43:31,300
across multiple systems
multiple Executives.
10648
07:43:31,400 --> 07:43:34,784
Each executor will do
the process thing count
10649
07:43:34,784 --> 07:43:37,200
the records anatomically.
10650
07:43:37,200 --> 07:43:38,978
I may want the total count.
10651
07:43:38,978 --> 07:43:42,600
So what I will do I will ask
to maintain an accumulator,
10652
07:43:42,600 --> 07:43:45,250
of course, it will be maintained
in this more context.
10653
07:43:45,250 --> 07:43:48,500
In the driver program
the driver program going
10654
07:43:48,500 --> 07:43:50,100
to be one per application.
10655
07:43:50,100 --> 07:43:52,200
It will keep on
getting accumulated
10656
07:43:52,200 --> 07:43:54,900
and whenever I want I
can read those values
10657
07:43:54,900 --> 07:43:57,100
and take any appropriate action.
10658
07:43:57,200 --> 07:44:00,300
So it's like more or less the
accumulators and practice videos
10659
07:44:00,300 --> 07:44:01,600
looks opposite each other,
10660
07:44:02,000 --> 07:44:03,800
but the purpose
is totally different.
10661
07:44:04,200 --> 07:44:06,531
Why is there a need
for workers variable
10662
07:44:06,531 --> 07:44:10,400
when working with Apache Spark
It's read only variable
10663
07:44:10,400 --> 07:44:13,800
and it will be cached in memory
in a distributed fashion
10664
07:44:13,800 --> 07:44:15,789
and it eliminates the The work
10665
07:44:15,789 --> 07:44:19,012
of moving the data
from a centralized location
10666
07:44:19,012 --> 07:44:20,400
that is Spong driver
10667
07:44:20,400 --> 07:44:24,200
or from a particular program
to all the executors
10668
07:44:24,200 --> 07:44:26,830
within the cluster where
the transfer into get executed.
10669
07:44:26,830 --> 07:44:29,700
We don't need to worry about
where the task will get executed
10670
07:44:29,700 --> 07:44:31,100
within the cluster.
10671
07:44:31,100 --> 07:44:32,138
So when compared
10672
07:44:32,138 --> 07:44:34,900
with the accumulators
broadcast variables,
10673
07:44:34,900 --> 07:44:37,256
it's going to have
a read-only operation.
10674
07:44:37,256 --> 07:44:38,903
The executors cannot change
10675
07:44:38,903 --> 07:44:41,100
the value can only
read those values.
10676
07:44:41,100 --> 07:44:44,900
It cannot update so mostly
will be used like a quiche.
10677
07:44:44,900 --> 07:44:47,400
Have for the
identity next question,
10678
07:44:47,400 --> 07:44:50,327
how can you trigger
automatically naps in spark
10679
07:44:50,327 --> 07:44:52,300
to handle accumulated metadata.
10680
07:44:52,700 --> 07:44:54,500
So there is a parameter
10681
07:44:54,500 --> 07:44:57,900
that we can set TTL the
will get triggered along
10682
07:44:57,900 --> 07:45:00,900
with the running jobs
and intermediately.
10683
07:45:00,900 --> 07:45:04,000
It's going to write the data
result into the disc
10684
07:45:04,000 --> 07:45:07,155
or cleaned unnecessary data
or clean the rdds.
10685
07:45:07,155 --> 07:45:08,600
That's not being used.
10686
07:45:08,600 --> 07:45:09,800
The least used RTD.
10687
07:45:09,800 --> 07:45:10,987
It will get cleaned
10688
07:45:10,987 --> 07:45:14,800
and click keep the metadata as
well as the memory clean water.
10689
07:45:14,800 --> 07:45:17,800
The various levels
of persistence in Apache spark
10690
07:45:17,800 --> 07:45:20,200
when you say data
should be stored in memory.
10691
07:45:20,200 --> 07:45:23,000
It can be indifferent now
you can be possessed it
10692
07:45:23,000 --> 07:45:27,100
so it can be in memory of only
or memory and disk or disk only
10693
07:45:27,200 --> 07:45:30,500
and when it is getting stored
we can ask it to store it
10694
07:45:30,500 --> 07:45:31,800
in a civilized form.
10695
07:45:31,900 --> 07:45:35,300
So the reason why we may store
or possess dress,
10696
07:45:35,303 --> 07:45:36,996
I want this particular
10697
07:45:37,100 --> 07:45:40,200
on very this form
of body little back
10698
07:45:40,200 --> 07:45:42,038
for using so I can really
10699
07:45:42,038 --> 07:45:45,200
back maybe I may not need
it very immediate.
10700
07:45:45,400 --> 07:45:48,477
So I don't want that to keep
occupying my memory.
10701
07:45:48,477 --> 07:45:50,400
I'll write it to the hard disk
10702
07:45:50,400 --> 07:45:52,700
and I'll read it back
whenever there is a need.
10703
07:45:52,700 --> 07:45:55,300
I'll read it back
the next question.
10704
07:45:55,300 --> 07:45:58,069
What do you understand
by schema rdd,
10705
07:45:58,200 --> 07:46:01,900
so schema rdd will be used as
slave Within These Punk's equal.
10706
07:46:01,900 --> 07:46:05,300
So the RTD will have the meta
information built into it.
10707
07:46:05,300 --> 07:46:07,919
It will have the schema
also very similar to
10708
07:46:07,919 --> 07:46:10,642
what we have the database
schema the structure
10709
07:46:10,642 --> 07:46:11,976
of the particular data
10710
07:46:11,976 --> 07:46:14,994
and when I have a structure it
will be easy for me.
10711
07:46:14,994 --> 07:46:16,081
To handle the data
10712
07:46:16,081 --> 07:46:19,100
so data and the structure
will be existing together
10713
07:46:19,100 --> 07:46:20,360
and the schema are ready.
10714
07:46:20,360 --> 07:46:20,550
Now.
10715
07:46:20,550 --> 07:46:22,100
It's called as a data frame
10716
07:46:22,100 --> 07:46:25,009
but it's Mark and dataframe
term is very popular
10717
07:46:25,009 --> 07:46:27,616
in languages like our
as other languages.
10718
07:46:27,616 --> 07:46:28,700
It's very popular.
10719
07:46:28,700 --> 07:46:31,700
So it's going to have the data
and The Meta information
10720
07:46:31,700 --> 07:46:34,700
about that data saying
what column was structure it.
10721
07:46:34,700 --> 07:46:36,300
Is it explain the scenario
10722
07:46:36,300 --> 07:46:38,656
where you will be
using spark streaming
10723
07:46:38,656 --> 07:46:41,200
as you may want to do
a sentiment analysis
10724
07:46:41,200 --> 07:46:44,200
of Twitter's so there
I will be streamed
10725
07:46:44,400 --> 07:46:49,200
so we will Flume sort of a tool
to harvest the information
10726
07:46:49,300 --> 07:46:52,700
from Peter and fit it
into spark streaming.
10727
07:46:52,700 --> 07:46:56,300
It will extract or identify
the sentiment of each
10728
07:46:56,300 --> 07:46:58,300
and every tweet and Market
10729
07:46:58,300 --> 07:47:00,899
whether it is positive
or negative and accordingly
10730
07:47:00,899 --> 07:47:02,900
the data will be
the structure data
10731
07:47:02,900 --> 07:47:03,700
that we tidy
10732
07:47:03,700 --> 07:47:05,742
whether it is positive
or negative maybe
10733
07:47:05,742 --> 07:47:06,856
percentage of positive
10734
07:47:06,856 --> 07:47:09,088
and percentage of negative
sentiment store it
10735
07:47:09,088 --> 07:47:10,500
in some structured form.
10736
07:47:10,500 --> 07:47:14,111
Then you can leverage this park
Sequel and do grouping
10737
07:47:14,111 --> 07:47:16,403
or filtering Based
on the sentiment
10738
07:47:16,403 --> 07:47:19,587
and maybe I can use
a machine learning algorithm.
10739
07:47:19,587 --> 07:47:22,107
What drives that
particular tweet to be
10740
07:47:22,107 --> 07:47:23,500
in the negative side.
10741
07:47:23,500 --> 07:47:26,700
Is there any similarity between
all this negative sentiment
10742
07:47:26,700 --> 07:47:28,812
negative tweets may be specific
10743
07:47:28,812 --> 07:47:32,700
to a product a specific time
by when the Tweet was sweeter
10744
07:47:32,700 --> 07:47:34,421
or from a specific region
10745
07:47:34,421 --> 07:47:36,900
that we it was
Twitter those analysis
10746
07:47:36,900 --> 07:47:40,194
could be done by leveraging
the MLA above spark.
10747
07:47:40,194 --> 07:47:43,700
So Emily streaming core
all going to work together.
10748
07:47:43,700 --> 07:47:45,200
All these are like different.
10749
07:47:45,200 --> 07:47:48,500
Offerings available to
solve different problems.
10750
07:47:48,600 --> 07:47:51,100
So with this we are coming
to end of this interview
10751
07:47:51,100 --> 07:47:53,100
questions discussion of spark.
10752
07:47:53,100 --> 07:47:54,465
I hope you all enjoyed.
10753
07:47:54,465 --> 07:47:56,913
I hope it was constructive
and useful one.
10754
07:47:56,913 --> 07:47:59,600
The more information
about editor is available
10755
07:47:59,600 --> 07:48:02,183
in this website to record
at cou only best
10756
07:48:02,183 --> 07:48:05,900
and keep visiting the website
for blocks and latest updates.
10757
07:48:05,900 --> 07:48:07,000
Thank you folks.
10758
07:48:07,500 --> 07:48:10,400
I hope you have enjoyed
listening to this video.
10759
07:48:10,400 --> 07:48:12,450
Please be kind enough to like it
10760
07:48:12,450 --> 07:48:15,600
and you can comment any
of your doubts and queries
10761
07:48:15,600 --> 07:48:17,078
and we will reply them
10762
07:48:17,078 --> 07:48:20,923
at the earliest do look out
for more videos in our playlist
10763
07:48:20,923 --> 07:48:24,105
And subscribe to Edureka
channel to learn more.
10764
07:48:24,105 --> 07:48:25,100
Happy learning.870388