Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
0
00:00:06,800 --> 00:00:10,102
For the past five years Spark
has been on an absolute tear
1
00:00:10,102 --> 00:00:13,700
becoming one of the most widely
used Technologies in big data
2
00:00:13,700 --> 00:00:17,226
and AI. Today's cutting-edge
companies like Facebook app
3
00:00:17,226 --> 00:00:18,300
will Netflix Uber
4
00:00:18,300 --> 00:00:19,965
and many more have deployed
5
00:00:19,965 --> 00:00:23,366
spark at massive scale
processing petabytes of data
6
00:00:23,366 --> 00:00:25,192
to deliver Innovations ranging
7
00:00:25,192 --> 00:00:27,212
from detecting
fraudulent Behavior
8
00:00:27,212 --> 00:00:30,103
to delivering personalized
experiences in real.
9
00:00:30,103 --> 00:00:32,741
Lifetime and many such
innovations that are
10
00:00:32,741 --> 00:00:34,500
transforming every industry.
11
00:00:34,800 --> 00:00:37,300
Hi all I welcome you
all to this full court session
12
00:00:37,300 --> 00:00:40,408
on Apache spark a complete
crash course consisting
13
00:00:40,408 --> 00:00:43,200
of everything you need
to know to get started
14
00:00:43,200 --> 00:00:45,500
with Apache Spark from scratch.
15
00:00:45,700 --> 00:00:47,410
But before we get into details,
16
00:00:47,410 --> 00:00:51,000
let's look at our agenda for
today for better understanding
17
00:00:51,000 --> 00:00:52,300
and ease of learning.
18
00:00:52,300 --> 00:00:55,400
The entire crash course
is divided into 12 modules
19
00:00:55,400 --> 00:00:59,200
in the first module introduction
to spark will try to understand
20
00:00:59,200 --> 00:01:03,100
what exactly Is and how it
performs real time processing
21
00:01:03,200 --> 00:01:06,741
in second module will dive deep
into different components
22
00:01:06,741 --> 00:01:10,600
that constitute spark will also
learn about Spark architecture
23
00:01:10,600 --> 00:01:13,800
and its ecosystem next up
in the third module.
24
00:01:13,800 --> 00:01:15,594
We will learn what exactly
25
00:01:15,594 --> 00:01:18,700
relational distributed data
sets are in spark.
26
00:01:19,100 --> 00:01:22,427
Fourth module is all about
data frames in this module.
27
00:01:22,427 --> 00:01:25,000
We will learn what
exactly data frames are
28
00:01:25,000 --> 00:01:28,300
and how to perform different
operations in data frames
29
00:01:28,400 --> 00:01:29,940
moving on in the fifth.
30
00:01:29,940 --> 00:01:32,446
Module we will
discuss different ways
31
00:01:32,446 --> 00:01:35,300
that spark provides
to perform SQL queries
32
00:01:35,300 --> 00:01:39,000
for accessing and processing
data in the six module.
33
00:01:39,000 --> 00:01:39,847
We will learn
34
00:01:39,847 --> 00:01:43,500
how to perform streaming
on live data streams using spark
35
00:01:43,500 --> 00:01:46,029
where and in the seventh
module will discuss
36
00:01:46,029 --> 00:01:49,200
how to execute different machine
learning algorithms using
37
00:01:49,200 --> 00:01:52,469
spark machine learning library
8 module is all
38
00:01:52,469 --> 00:01:54,917
about spark Graphics
in this module.
39
00:01:54,917 --> 00:01:57,800
We are going to learn what
graph processing is and
40
00:01:57,800 --> 00:02:01,700
how to perform graph processing
using Bob Graphics library
41
00:02:01,700 --> 00:02:05,500
in the ninth module will discuss
the key differences between
42
00:02:05,500 --> 00:02:08,800
two popular data processing
Paddock rooms mapreduce
43
00:02:08,800 --> 00:02:12,500
and Spark talking
about 10 module will integrate
44
00:02:12,500 --> 00:02:14,400
to popular James spark
45
00:02:14,400 --> 00:02:19,400
and Kafka. 11th module is
all about pyspark in this module
46
00:02:19,400 --> 00:02:21,000
will try to understand
47
00:02:21,000 --> 00:02:24,281
how by spark exposes
spark programming model
48
00:02:24,281 --> 00:02:26,800
to python lastly
in the 12 module.
49
00:02:26,800 --> 00:02:30,100
We'll take a look at most
frequently Asked interview.
50
00:02:30,100 --> 00:02:31,200
Options on spark
51
00:02:31,200 --> 00:02:33,200
which will help you
Ace your interview
52
00:02:33,200 --> 00:02:34,200
with flying colors.
53
00:02:34,200 --> 00:02:35,900
Thank you guys
while you are at it,
54
00:02:35,900 --> 00:02:37,600
please do not
forget to subscribe
55
00:02:37,600 --> 00:02:39,173
and Edureka YouTube channel
56
00:02:39,173 --> 00:02:42,200
to stay updated with
current training Technologies.
57
00:02:47,200 --> 00:02:48,400
There has been -
58
00:02:48,400 --> 00:02:51,576
underworld that spark is
a future of Big Data platform,
59
00:02:51,576 --> 00:02:53,400
which is hundred times faster
60
00:02:53,400 --> 00:02:57,250
than mapreduce and is also
a go-to tool for all solutions.
61
00:02:57,250 --> 00:03:00,019
But what exactly is
Apache spark and what?
62
00:03:00,019 --> 00:03:01,100
It's so popular.
63
00:03:01,100 --> 00:03:03,700
And in the session I will give
you a complete Insight
64
00:03:03,700 --> 00:03:04,600
of Apache spark
65
00:03:04,600 --> 00:03:07,500
and its fundamentals
without any further due.
66
00:03:07,500 --> 00:03:08,200
Let's quickly.
67
00:03:08,200 --> 00:03:09,898
Look at the topics to be covered
68
00:03:09,898 --> 00:03:12,198
in this session
first and foremost.
69
00:03:12,198 --> 00:03:13,000
I will tell you
70
00:03:13,000 --> 00:03:15,724
what is Apache spark
and its features next.
71
00:03:15,724 --> 00:03:17,773
I will take you
to the components
72
00:03:17,773 --> 00:03:18,948
of spark ecosystem
73
00:03:18,948 --> 00:03:21,932
that makes Park as a future
of Big Data platform.
74
00:03:21,932 --> 00:03:22,600
After that.
75
00:03:22,600 --> 00:03:23,300
I will talk
76
00:03:23,300 --> 00:03:26,100
about the fundamental
data structure of spark
77
00:03:26,100 --> 00:03:28,400
that is rdd I will also tell you
78
00:03:28,400 --> 00:03:32,400
about its features its Asians
the ways to create rdd Etc
79
00:03:32,400 --> 00:03:35,500
and at the last either wrap
up the session by giving
80
00:03:35,500 --> 00:03:37,351
a real-time use case of spark.
81
00:03:37,351 --> 00:03:38,505
So let's get started
82
00:03:38,505 --> 00:03:40,800
with the very first
topic and understand
83
00:03:40,800 --> 00:03:43,400
what is spark spark
is an open-source
84
00:03:43,400 --> 00:03:45,100
killable massively parallel
85
00:03:45,100 --> 00:03:47,700
in memory execution
environment for running
86
00:03:47,700 --> 00:03:49,300
analytics applications.
87
00:03:49,300 --> 00:03:52,085
You can just think
of it as an in-memory layer
88
00:03:52,085 --> 00:03:54,507
that sits about the
multiple data stores
89
00:03:54,507 --> 00:03:56,929
where data can be loaded
into the memory
90
00:03:56,929 --> 00:03:59,600
and analyzed in parallel
across the cluster.
91
00:03:59,800 --> 00:04:03,189
Into big data processing much
like mapreduce Park Works
92
00:04:03,189 --> 00:04:05,700
to distribute the data
across the cluster
93
00:04:05,700 --> 00:04:08,118
and then process
that data in parallel.
94
00:04:08,118 --> 00:04:10,833
The difference here is
that unlike mapreduce
95
00:04:10,833 --> 00:04:14,867
which shuffles the files around
the disc spark Works in memory,
96
00:04:14,867 --> 00:04:17,600
and that makes it much
faster at processing
97
00:04:17,600 --> 00:04:19,300
the data than mapreduce.
98
00:04:19,300 --> 00:04:20,663
It is also said to be
99
00:04:20,663 --> 00:04:24,235
the Lightning Fast unified
analytics engine for big data
100
00:04:24,235 --> 00:04:25,600
and machine learning.
101
00:04:25,600 --> 00:04:28,680
So now let's look
at the interesting features
102
00:04:28,680 --> 00:04:29,800
of Apache Spark.
103
00:04:29,800 --> 00:04:32,181
Coming to speed you
can cause Park as
104
00:04:32,181 --> 00:04:34,100
a swift processing framework.
105
00:04:34,100 --> 00:04:37,500
Why because it is
hundred times faster in memory
106
00:04:37,500 --> 00:04:40,900
and 10 times faster on the disk
on comparing it with her.
107
00:04:40,900 --> 00:04:41,700
Do not only
108
00:04:41,700 --> 00:04:45,100
that it also provides
High data processing speed
109
00:04:45,200 --> 00:04:46,900
next powerful cashing.
110
00:04:46,900 --> 00:04:48,809
It has a simple
programming layer
111
00:04:48,809 --> 00:04:50,600
that provides powerful caching
112
00:04:50,600 --> 00:04:53,341
and disk persistence
capabilities and Spark
113
00:04:53,341 --> 00:04:55,300
can be deployed through mesos.
114
00:04:55,300 --> 00:04:58,600
How do PI on or Sparks
own cluster manager
115
00:04:58,700 --> 00:04:59,700
as you all know?
116
00:04:59,700 --> 00:05:01,370
That's Park itself was designed
117
00:05:01,370 --> 00:05:03,900
and developed for
real-time data processing.
118
00:05:03,900 --> 00:05:05,239
So it's obvious fact
119
00:05:05,239 --> 00:05:07,584
that it offers
real-time competition
120
00:05:07,584 --> 00:05:10,800
and low latency because of
in memory competitions
121
00:05:10,900 --> 00:05:14,700
next polyglot spark
provides high level apis
122
00:05:14,700 --> 00:05:16,700
in Java Scala Python
123
00:05:16,700 --> 00:05:19,536
and our spark code
can be written in any
124
00:05:19,536 --> 00:05:21,281
of these four languages.
125
00:05:21,281 --> 00:05:25,500
Not only that it also provides
a shell in Scala and python.
126
00:05:25,692 --> 00:05:29,000
These are the various
features of spark now,
127
00:05:29,000 --> 00:05:32,700
let's see the The various
components of spark ecosystem.
128
00:05:32,700 --> 00:05:36,100
Let me first tell you
about the spark or component.
129
00:05:36,100 --> 00:05:39,385
It is the most vital component
of Spartacus system,
130
00:05:39,385 --> 00:05:40,700
which is responsible
131
00:05:40,700 --> 00:05:44,400
for basic I/O functions
scheduling monitoring Etc.
132
00:05:44,400 --> 00:05:47,800
The entire Apache spark
ecosystem is built on the top
133
00:05:47,800 --> 00:05:49,670
of this core execution engine
134
00:05:49,670 --> 00:05:52,700
which has extensible apis
in different languages
135
00:05:52,700 --> 00:05:55,100
like Scala python are and Chava
136
00:05:55,100 --> 00:05:57,442
as I have already
mentioned the spark
137
00:05:57,442 --> 00:05:59,200
and the departs from essos.
138
00:05:59,200 --> 00:06:02,800
How do you feel John
or Sparks own cluster manager
139
00:06:02,800 --> 00:06:05,433
the spark ecosystem
library is composed
140
00:06:05,433 --> 00:06:06,888
of various components
141
00:06:06,888 --> 00:06:10,700
like spark SQL spark streaming
machine learning library.
142
00:06:10,700 --> 00:06:13,200
Now, let me explain
you each of them.
143
00:06:13,200 --> 00:06:16,573
The spark SQL component
is used to Leverage The Power
144
00:06:16,573 --> 00:06:18,000
of declarative queries
145
00:06:18,000 --> 00:06:21,034
and optimize storage
by executing SQL queries
146
00:06:21,034 --> 00:06:22,000
on spark data,
147
00:06:22,000 --> 00:06:23,778
which is present in the rdds
148
00:06:23,778 --> 00:06:27,100
and other external sources
next Sparks trimming
149
00:06:27,100 --> 00:06:29,617
component allows developers
to perform batch.
150
00:06:29,617 --> 00:06:31,395
Processing and streaming of data
151
00:06:31,395 --> 00:06:35,042
in the same application and come
into machine learning library.
152
00:06:35,042 --> 00:06:36,313
It eases the deployment
153
00:06:36,313 --> 00:06:39,300
and development of scalable
machine learning pipelines,
154
00:06:39,300 --> 00:06:43,000
like summary statistics
correlations feature extraction
155
00:06:43,000 --> 00:06:46,200
transformation functions
optimization algorithms Etc
156
00:06:46,200 --> 00:06:49,365
and graph x component lets
the data scientist to work
157
00:06:49,365 --> 00:06:52,584
with graph are non rough sources
to achieve flexibility
158
00:06:52,584 --> 00:06:55,820
and resilience and graph
construction and transformation
159
00:06:55,820 --> 00:06:56,784
and now talking
160
00:06:56,784 --> 00:07:00,000
about the programming
languages spark supports car.
161
00:07:00,000 --> 00:07:02,851
I just a functional
programming language in which
162
00:07:02,851 --> 00:07:04,100
the spark is written.
163
00:07:04,100 --> 00:07:08,200
So spark supports Colour as
the interface then spark also
164
00:07:08,200 --> 00:07:10,100
supports python interface.
165
00:07:10,100 --> 00:07:13,066
You can write the program
in Python and execute it
166
00:07:13,066 --> 00:07:14,408
over the spark again.
167
00:07:14,408 --> 00:07:16,899
If you see the code
in Python and Scala,
168
00:07:16,899 --> 00:07:20,858
both are very similar then our
is very famous for data analysis
169
00:07:20,858 --> 00:07:22,200
and machine learning.
170
00:07:22,200 --> 00:07:25,081
So spark has also added
the support for our
171
00:07:25,081 --> 00:07:26,717
and it also supports Java
172
00:07:26,717 --> 00:07:27,961
so you can go ahead
173
00:07:27,961 --> 00:07:31,300
and write the code in Java
and Giggle with this park
174
00:07:31,300 --> 00:07:33,300
next the data can be stored
175
00:07:33,300 --> 00:07:36,400
in hdfs local file
system Amazon S3 cloud
176
00:07:36,700 --> 00:07:39,700
and it also supports SQL
and nosql database as well.
177
00:07:39,700 --> 00:07:43,645
So this is all about the various
components of spark ecosystem.
178
00:07:43,645 --> 00:07:45,300
Now, let's see what's next
179
00:07:45,300 --> 00:07:48,064
when it comes to iterative
distributed computing
180
00:07:48,064 --> 00:07:50,600
that is processing the data
over multiple jobs
181
00:07:50,600 --> 00:07:51,600
and competitions.
182
00:07:51,700 --> 00:07:52,776
We need to reuse
183
00:07:52,776 --> 00:07:55,200
or share the data
among multiple jobs
184
00:07:55,200 --> 00:07:58,258
in earlier Frameworks
like Hadoop there were problems
185
00:07:58,258 --> 00:07:59,950
while dealing with multiple.
186
00:07:59,950 --> 00:08:01,400
Operations or jobs here.
187
00:08:01,400 --> 00:08:02,900
We need to store the data
188
00:08:02,900 --> 00:08:07,053
and some intermediate stable
distributed storage such as hdfs
189
00:08:07,053 --> 00:08:11,003
and multiple I/O operations
makes the overall computations
190
00:08:11,003 --> 00:08:13,976
of jobs much slower
and they were replications
191
00:08:13,976 --> 00:08:15,100
and civilizations
192
00:08:15,100 --> 00:08:17,955
which in turn made
the process even more slower
193
00:08:17,955 --> 00:08:20,500
and our goal here was
to reduce the number
194
00:08:20,500 --> 00:08:22,400
of I/O operations to hdfs
195
00:08:22,400 --> 00:08:26,350
and this can be achieved only
through in-memory data sharing
196
00:08:26,350 --> 00:08:29,900
the in-memory data sharing
the stent 200 times faster.
197
00:08:29,900 --> 00:08:31,966
Of the network and disk sharing
198
00:08:31,966 --> 00:08:35,138
and rdds try to solve all
the problems by enabling
199
00:08:35,138 --> 00:08:38,447
fault-tolerant distributed
in memory competitions.
200
00:08:38,447 --> 00:08:40,000
So now let's understand
201
00:08:40,000 --> 00:08:44,000
what our rdds it stands for
resilient distributed data set.
202
00:08:44,000 --> 00:08:46,509
They are considered to be
the backbone of spark
203
00:08:46,509 --> 00:08:49,419
and is one of the fundamental
data structure of spark.
204
00:08:49,419 --> 00:08:51,782
It is also known as
the schema-less structures
205
00:08:51,782 --> 00:08:54,900
that can handle both structured
and unstructured data.
206
00:08:54,900 --> 00:08:57,900
So in spark anything
you do is around rdd.
207
00:08:57,900 --> 00:08:59,700
You're reading the
data in spark.
208
00:08:59,700 --> 00:09:01,500
When it is read
into our daily again,
209
00:09:01,500 --> 00:09:04,300
when you're transforming
the data, then you're performing
210
00:09:04,300 --> 00:09:07,268
Transformations on old rdd
and creating a new one.
211
00:09:07,268 --> 00:09:10,378
Then at last you will perform
some actions on the rdd
212
00:09:10,378 --> 00:09:12,533
and store that data
present in an rdd
213
00:09:12,533 --> 00:09:15,906
to a persistent storage
resilient distributed data set
214
00:09:15,906 --> 00:09:18,900
has an immutable distributed
collection of objects.
215
00:09:18,900 --> 00:09:20,300
Your objects can be anything
216
00:09:20,300 --> 00:09:23,200
like strings lines
Rose objects collections
217
00:09:23,200 --> 00:09:26,400
Etc rdds can contain
any type of python Java
218
00:09:26,400 --> 00:09:27,533
or Scala objects.
219
00:09:27,533 --> 00:09:30,000
Even including user
defined classes as
220
00:09:30,000 --> 00:09:32,900
And talking about
the distributed environment.
221
00:09:32,900 --> 00:09:35,612
Each data set present
in an rdd is divided
222
00:09:35,612 --> 00:09:37,200
into logical partitions,
223
00:09:37,200 --> 00:09:39,353
which may be computed
on different nodes
224
00:09:39,353 --> 00:09:42,500
of the cluster due to this you
can perform Transformations
225
00:09:42,500 --> 00:09:44,190
or actions on the complete data
226
00:09:44,190 --> 00:09:47,300
parallely and I don't have
to worry about the distribution
227
00:09:47,300 --> 00:09:49,400
because spark takes care of that
228
00:09:49,400 --> 00:09:52,100
are they these are
highly resilient that is
229
00:09:52,100 --> 00:09:55,141
they are able to recover
quickly from any issues
230
00:09:55,141 --> 00:09:56,500
as a same data chunks
231
00:09:56,500 --> 00:09:59,700
are replicated across
multiple executor notes thus
232
00:09:59,700 --> 00:10:02,564
so even if one executor
fails another will still
233
00:10:02,564 --> 00:10:03,600
process the data.
234
00:10:03,600 --> 00:10:06,482
This allows you to perform
functional calculations
235
00:10:06,482 --> 00:10:08,287
against a data set very quickly
236
00:10:08,287 --> 00:10:10,699
by harnessing the power
of multiple nodes.
237
00:10:10,699 --> 00:10:12,472
So this is all about rdd now.
238
00:10:12,472 --> 00:10:14,000
Let's have a look at some
239
00:10:14,000 --> 00:10:17,847
of the important features of
our dbe's rdds have a provision
240
00:10:17,847 --> 00:10:19,327
of in memory competition
241
00:10:19,327 --> 00:10:21,300
and all transformations
are lazy.
242
00:10:21,300 --> 00:10:24,044
That is it does not compute
the results right away
243
00:10:24,044 --> 00:10:25,679
until an action is applied.
244
00:10:25,679 --> 00:10:27,800
So it supports
in memory competition
245
00:10:27,800 --> 00:10:30,034
and lazy evaluation
as well next.
246
00:10:30,034 --> 00:10:32,200
Fault tolerant in case of rdds.
247
00:10:32,200 --> 00:10:34,454
They track the data
lineage information
248
00:10:34,454 --> 00:10:37,341
to rebuild the last data
automatically and this is
249
00:10:37,341 --> 00:10:40,000
how it provides fault tolerance
to the system.
250
00:10:40,000 --> 00:10:42,600
Next immutability data
can be created
251
00:10:42,600 --> 00:10:43,800
or received any time
252
00:10:43,800 --> 00:10:46,388
and once defined its value
cannot be changed.
253
00:10:46,388 --> 00:10:47,900
And that is the reason why
254
00:10:47,900 --> 00:10:51,235
I said are they these are
immutable next partitioning
255
00:10:51,235 --> 00:10:53,774
at is the fundamental
unit of parallelism
256
00:10:53,774 --> 00:10:54,605
and Spark rdd
257
00:10:54,605 --> 00:10:57,800
and all the data chunks
are divided into partitions
258
00:10:57,800 --> 00:10:59,960
and already next persistence.
259
00:10:59,960 --> 00:11:01,600
So users can reuse rdd
260
00:11:01,600 --> 00:11:05,400
and choose a storage stategy for
them coarse-grained operations
261
00:11:05,400 --> 00:11:08,493
applies to all elements
in datasets through Maps
262
00:11:08,493 --> 00:11:10,600
or filter or
group by operations.
263
00:11:10,700 --> 00:11:13,000
So these are the various
features of our daily.
264
00:11:13,300 --> 00:11:15,800
Now, let's see
the ways to create rdd.
265
00:11:15,800 --> 00:11:19,117
There are three ways to create
rdds one can create rdd
266
00:11:19,117 --> 00:11:22,800
from paralyzed Collections
and one can also create rdd
267
00:11:22,800 --> 00:11:24,367
from the existing card ID
268
00:11:24,367 --> 00:11:27,100
or other are DTS
and it can also be created
269
00:11:27,100 --> 00:11:30,000
from external data sources
as well like hdfs.
270
00:11:30,000 --> 00:11:31,900
Amazon S3 hbase Etc.
271
00:11:32,000 --> 00:11:34,600
Now let me show you
how to create rdds.
272
00:11:34,800 --> 00:11:37,199
I'll open my terminal
and first check
273
00:11:37,199 --> 00:11:39,600
whether my demons
are running or not.
274
00:11:40,500 --> 00:11:41,300
Cool here.
275
00:11:41,300 --> 00:11:42,757
I can see that Hadoop
276
00:11:42,757 --> 00:11:45,041
and Spark demons
both are running.
277
00:11:45,041 --> 00:11:47,186
So now at the first let's start
278
00:11:47,186 --> 00:11:51,200
the spark shell it will take
a bit time to start the shell.
279
00:11:52,500 --> 00:11:52,900
Cool.
280
00:11:52,900 --> 00:11:54,800
Now the spark shall has started
281
00:11:54,800 --> 00:11:58,329
and I can see the version of
spark as two point one point one
282
00:11:58,329 --> 00:12:00,500
and we have a scholar
shell over here.
283
00:12:00,500 --> 00:12:00,759
Now.
284
00:12:00,759 --> 00:12:02,888
I will tell you
how to create rdds
285
00:12:02,888 --> 00:12:06,557
in three different ways using
Scala language at the first.
286
00:12:06,557 --> 00:12:08,450
Let's see how to create an rdd
287
00:12:08,450 --> 00:12:12,178
from paralyzed collections
SC dot paralyzes the method
288
00:12:12,178 --> 00:12:15,600
that I use to create a paralyzed
collection of oddities
289
00:12:15,600 --> 00:12:16,733
and this method is
290
00:12:16,733 --> 00:12:20,700
a spark context paralyzed method
to create a palace collection.
291
00:12:20,700 --> 00:12:22,500
So I will give a seedot bad.
292
00:12:22,500 --> 00:12:26,200
Lice and here I will paralyze
one 200 numbers.
293
00:12:27,300 --> 00:12:31,371
In five different partitions
and I will apply collect
294
00:12:31,371 --> 00:12:33,500
as action to start the process.
295
00:12:34,900 --> 00:12:36,592
So here in the result,
296
00:12:36,592 --> 00:12:39,600
you can see an array
of fun 200 numbers.
297
00:12:39,600 --> 00:12:40,100
Okay.
298
00:12:40,300 --> 00:12:41,635
Now let me show you
299
00:12:41,635 --> 00:12:45,010
how the partitions appear
in the web UI of spark.
300
00:12:45,010 --> 00:12:49,300
So the web UI port for spark is
localhost four zero four zero.
301
00:12:50,700 --> 00:12:53,630
So here you have just
completed one task.
302
00:12:53,630 --> 00:12:55,903
That is St. Dot
paralyzed collect.
303
00:12:55,903 --> 00:12:56,800
Correct here.
304
00:12:56,800 --> 00:13:00,114
You can see all the five stages
that are succeeded
305
00:13:00,114 --> 00:13:03,700
because we have divided the task
into five partitions.
306
00:13:03,700 --> 00:13:06,000
So let Show you the partitions.
307
00:13:06,000 --> 00:13:08,100
So this is a dag
which realization
308
00:13:08,100 --> 00:13:11,558
that is the directed acyclic
graph visualization wherein
309
00:13:11,558 --> 00:13:14,200
you have applied only
paralyzed as a method
310
00:13:14,200 --> 00:13:16,200
so you can see only
one stage here.
311
00:13:16,800 --> 00:13:20,291
So here you can see the rdd
that is been created
312
00:13:20,291 --> 00:13:24,032
and coming to even timeline
you can see the task
313
00:13:24,032 --> 00:13:27,400
that has been executed
in five different stages
314
00:13:27,400 --> 00:13:29,011
and the different colors imply.
315
00:13:29,011 --> 00:13:30,632
The scheduler delayed tasks
316
00:13:30,632 --> 00:13:34,300
these sterilization Time shuffle
rate Time shuffle right time.
317
00:13:34,300 --> 00:13:36,612
I'm execute a Computing
time Etc here.
318
00:13:36,612 --> 00:13:40,227
You can see the summary metrics
for the created rdd here.
319
00:13:40,227 --> 00:13:41,000
You can see
320
00:13:41,000 --> 00:13:44,300
that the maximum time it
took to execute the tasks
321
00:13:44,300 --> 00:13:48,400
in five partitions parallely is
just 45 milliseconds.
322
00:13:49,000 --> 00:13:53,300
You can also see the executor ID
the host ID the status
323
00:13:53,300 --> 00:13:56,800
that is succeeded
duration launch time Etc.
324
00:13:57,000 --> 00:13:59,255
So this is one way
of creating an rdd
325
00:13:59,255 --> 00:14:01,061
from paralyzed collections.
326
00:14:01,061 --> 00:14:02,400
Now, let me show you
327
00:14:02,400 --> 00:14:05,900
how to create an rdd
from the I think our DD okay
328
00:14:06,000 --> 00:14:08,770
here I'll create
an array called Aven
329
00:14:08,770 --> 00:14:11,077
and assign numbers one to ten.
330
00:14:11,800 --> 00:14:14,900
One two, three,
four five six seven.
331
00:14:16,200 --> 00:14:18,900
Okay, so I got the result here.
332
00:14:18,900 --> 00:14:22,300
That is I have created
an integer array of 1 to 10
333
00:14:22,300 --> 00:14:25,200
and now I will paralyze
this a day one.
334
00:14:31,303 --> 00:14:32,996
Sorry, I got an error.
335
00:14:33,300 --> 00:14:37,300
It is a seedot pass
the lies of a one.
336
00:14:38,200 --> 00:14:42,800
Okay, so I created an rdd
called parallel collection cool.
337
00:14:42,800 --> 00:14:46,600
Now I will create a new Oddity
from the existing already.
338
00:14:46,600 --> 00:14:51,000
That is Val new are d d is equal
339
00:14:51,000 --> 00:14:55,900
to a 1 dot map data
present in an rdd.
340
00:14:56,061 --> 00:14:59,138
I will create a new ID
from existing rdd.
341
00:14:59,200 --> 00:15:01,200
So here I will take a one.
342
00:15:01,200 --> 00:15:05,800
As a difference and map
the data and multiply
343
00:15:05,800 --> 00:15:07,300
that data into two.
344
00:15:07,573 --> 00:15:09,726
So what should be our output
345
00:15:10,019 --> 00:15:13,480
if I Mark the data present
in an rdd into two,
346
00:15:13,700 --> 00:15:18,600
so it would be like
2 4 6 8 up to 20, correct?
347
00:15:18,600 --> 00:15:20,400
So, let's see how it works.
348
00:15:20,700 --> 00:15:24,500
Yes, we got the output
that is multiple of 1 to 10.
349
00:15:24,500 --> 00:15:26,691
That is two four
six eight up to 20.
350
00:15:26,691 --> 00:15:28,357
So this is one of the method
351
00:15:28,357 --> 00:15:30,500
of creating a new ID
from an old rdt.
352
00:15:30,500 --> 00:15:34,088
And I have one more method that
is from external file sources.
353
00:15:34,088 --> 00:15:37,500
So what I will do here is I
will give that test is equal
354
00:15:37,500 --> 00:15:39,780
to SC dot txt file here.
355
00:15:40,790 --> 00:15:43,800
I will give the path
to hdfs file location
356
00:15:43,800 --> 00:15:48,900
and Link the path that is hdfs
who localhost 9000 is a path
357
00:15:48,900 --> 00:15:50,800
and I have a folder.
358
00:15:50,800 --> 00:15:54,600
Called example and in that
I have a file called sample.
359
00:15:57,300 --> 00:16:01,500
Cool, so I got one
more already created here.
360
00:16:02,000 --> 00:16:02,281
Now.
361
00:16:02,281 --> 00:16:04,042
Let me show you this file
362
00:16:04,042 --> 00:16:07,000
that I have already kept
in hdfs directory.
363
00:16:08,100 --> 00:16:09,897
I will browse the file system
364
00:16:09,897 --> 00:16:12,500
and I will show you
the / example directory
365
00:16:12,500 --> 00:16:13,800
that I have created.
366
00:16:14,800 --> 00:16:16,867
So here you can see the example
367
00:16:16,867 --> 00:16:19,800
that I have created as
a directory and here I
368
00:16:19,800 --> 00:16:23,000
have sample as input file
that I have been given.
369
00:16:23,000 --> 00:16:25,800
So here you can see
the same path location.
370
00:16:25,800 --> 00:16:26,300
So this is
371
00:16:26,300 --> 00:16:29,633
how I can create an rdd
from external file sources.
372
00:16:29,633 --> 00:16:30,484
In this case.
373
00:16:30,484 --> 00:16:33,300
I have used hdfs as
an external file source.
374
00:16:33,300 --> 00:16:36,757
So this is how we can create
rdds from three different ways
375
00:16:36,757 --> 00:16:39,700
that is paralyzed collections
from external RDS
376
00:16:39,700 --> 00:16:41,600
and from an existing rdds.
377
00:16:41,700 --> 00:16:44,900
So let's move further and see
the various rdd.
378
00:16:44,900 --> 00:16:46,500
It's actually supports
379
00:16:46,500 --> 00:16:50,100
two men operations namely
Transformations and actions
380
00:16:50,100 --> 00:16:51,419
as have already set.
381
00:16:51,419 --> 00:16:53,200
Our treaties are immutable.
382
00:16:53,200 --> 00:16:54,900
So once you create an rdd,
383
00:16:54,900 --> 00:16:57,500
you cannot change
any content in the Hardy,
384
00:16:57,500 --> 00:16:58,913
so you might be wondering
385
00:16:58,913 --> 00:17:01,400
how our did he applies
those Transformations?
386
00:17:01,400 --> 00:17:02,200
Correct?
387
00:17:02,200 --> 00:17:04,299
When you run
any Transformations,
388
00:17:04,299 --> 00:17:07,062
it runs those Transformations
on all our DD
389
00:17:07,062 --> 00:17:08,445
and create a new body.
390
00:17:08,445 --> 00:17:11,400
This is basically done
for optimization reasons.
391
00:17:11,400 --> 00:17:13,446
Transformations are
the operations
392
00:17:13,446 --> 00:17:14,500
which are applied
393
00:17:14,500 --> 00:17:18,815
on a An rdd to create a new rdd
now these Transformations work
394
00:17:18,815 --> 00:17:21,221
on the principle
of lazy evaluations.
395
00:17:21,221 --> 00:17:23,075
So what does it mean it means
396
00:17:23,075 --> 00:17:25,500
that when we call
some operation in rdd
397
00:17:25,500 --> 00:17:28,888
at does not execute immediately
and Spark montañés,
398
00:17:28,888 --> 00:17:31,704
the record of the operation
that is being called
399
00:17:31,704 --> 00:17:34,127
since Transformations
are lazy in nature
400
00:17:34,127 --> 00:17:36,052
so we can execute the operation
401
00:17:36,052 --> 00:17:38,600
any time by calling
an action on the data.
402
00:17:38,800 --> 00:17:42,200
Hence in lazy evaluation
data is not loaded
403
00:17:42,200 --> 00:17:44,525
until it is necessary now these
404
00:17:44,525 --> 00:17:46,100
Since analyze the RTD
405
00:17:46,100 --> 00:17:49,103
and produce result
simple action can be count
406
00:17:49,103 --> 00:17:52,800
which will count the rows and
rdd and then produce a result
407
00:17:52,800 --> 00:17:53,583
so I can say
408
00:17:53,583 --> 00:17:57,700
that transformation produced new
rdd and action produced results
409
00:17:57,700 --> 00:18:00,058
before moving further
with the discussion.
410
00:18:00,058 --> 00:18:03,000
Let me tell you about
the three different workloads
411
00:18:03,000 --> 00:18:06,500
that spark it is they are
batch mode interactive mode
412
00:18:06,500 --> 00:18:09,052
and streaming mode
in case of batch mode.
413
00:18:09,052 --> 00:18:10,839
We run a batch
of you write a job
414
00:18:10,839 --> 00:18:13,427
and then schedule it
it works through a queue
415
00:18:13,427 --> 00:18:14,703
or a batch of separate.
416
00:18:14,703 --> 00:18:17,292
Jobs without manual
intervention then in case
417
00:18:17,292 --> 00:18:18,400
of interactive mode.
418
00:18:18,400 --> 00:18:19,700
It is an interactive shell
419
00:18:19,700 --> 00:18:22,100
where you go and execute
the commands one by one.
420
00:18:22,300 --> 00:18:24,844
So you will execute
one command check the result
421
00:18:24,844 --> 00:18:26,902
and then execute
other command based
422
00:18:26,902 --> 00:18:28,400
on the output result and so
423
00:18:28,400 --> 00:18:30,754
on it works similar
to the SQL shell
424
00:18:30,754 --> 00:18:32,100
so she'll is the one
425
00:18:32,100 --> 00:18:35,221
which executes a driver program
and in the Shell mode.
426
00:18:35,221 --> 00:18:37,096
You can run it
on the cluster mode.
427
00:18:37,096 --> 00:18:39,449
It is generally used
for development work
428
00:18:39,449 --> 00:18:41,159
or it is used
for ad hoc queries,
429
00:18:41,159 --> 00:18:42,708
then comes the streaming mode
430
00:18:42,708 --> 00:18:44,900
where the program
is continuously running.
431
00:18:44,900 --> 00:18:47,300
As invented data
comes it takes a data
432
00:18:47,300 --> 00:18:48,818
and do some Transformations
433
00:18:48,818 --> 00:18:51,300
and actions on the data
and get some results.
434
00:18:51,300 --> 00:18:53,800
So these are the three
different workloads
435
00:18:53,800 --> 00:18:55,600
that spark 8 us now.
436
00:18:55,600 --> 00:18:58,100
Let's see a real-time
use case here.
437
00:18:58,100 --> 00:18:59,600
I'm considering Yahoo!
438
00:18:59,600 --> 00:19:00,600
As an example.
439
00:19:00,600 --> 00:19:02,716
So what are
the problems of Yahoo!
440
00:19:02,716 --> 00:19:03,128
Yahoo!
441
00:19:03,128 --> 00:19:04,062
Properties are
442
00:19:04,062 --> 00:19:06,800
highly personalized
to maximize relevance.
443
00:19:06,800 --> 00:19:09,600
The algorithms used
to provide personalization.
444
00:19:09,600 --> 00:19:11,692
That is the
targeted advertisement
445
00:19:11,692 --> 00:19:14,800
and personalized content
are highly sophisticated.
446
00:19:14,800 --> 00:19:18,300
It and the relevance model
must be updated frequently
447
00:19:18,300 --> 00:19:22,745
because stories news feed and
ads change in time and Yahoo,
448
00:19:22,745 --> 00:19:24,967
has over 150 petabytes of data
449
00:19:24,967 --> 00:19:28,300
that the stored
on 35,000 node Hadoop cluster,
450
00:19:28,300 --> 00:19:31,391
which should be access
efficiently to avoid latency
451
00:19:31,391 --> 00:19:33,150
caused by the data movement
452
00:19:33,150 --> 00:19:35,300
and to gain insights
from the data
453
00:19:35,300 --> 00:19:37,000
and cost-effective manner.
454
00:19:37,000 --> 00:19:39,600
So to overcome
these problems Yahoo!
455
00:19:39,600 --> 00:19:42,171
Look to spark to
improve the performance
456
00:19:42,171 --> 00:19:44,687
of this iterative
model training here.
457
00:19:44,687 --> 00:19:48,700
Machine learning algorithm for
news personalization required
458
00:19:48,700 --> 00:19:51,200
15,000 lines of C++ code
459
00:19:51,300 --> 00:19:55,000
on the other hand the machine
learning algorithm has just
460
00:19:55,000 --> 00:19:57,076
won 20 lines of Scala code.
461
00:19:57,100 --> 00:19:59,600
So that is
the advantage of spark
462
00:19:59,800 --> 00:20:02,600
and this algorithm was ready
for production use
463
00:20:02,600 --> 00:20:06,700
in just 30 minutes of training
on a hundred million datasets
464
00:20:06,700 --> 00:20:08,900
and Sparks Rich API is available
465
00:20:08,900 --> 00:20:12,201
in several programming
languages and has resilient
466
00:20:12,201 --> 00:20:14,588
in memory storage
options and a scum.
467
00:20:14,588 --> 00:20:18,567
Potable with Hadoop through yarn
and the spark yarn project.
468
00:20:18,567 --> 00:20:21,400
It uses Apache spark
for personalizing It's
469
00:20:21,400 --> 00:20:24,490
News web pages and for
targeted advertising.
470
00:20:24,490 --> 00:20:28,300
Not only that it also uses
machine learning algorithms
471
00:20:28,300 --> 00:20:31,375
that run an Apache spark
to find out what kind
472
00:20:31,375 --> 00:20:33,700
of news user are
interested to read
473
00:20:33,700 --> 00:20:36,714
and also for categorizing
the new stories to find
474
00:20:36,714 --> 00:20:39,290
out what kind of users
would be interested
475
00:20:39,290 --> 00:20:41,300
in Reading each category of news
476
00:20:41,524 --> 00:20:44,524
and Spark runs over Hadoop Ian
to use existing data.
477
00:20:44,600 --> 00:20:47,800
And clusters and
the extensive API of spark
478
00:20:47,800 --> 00:20:50,605
and machine learning library
is the development
479
00:20:50,605 --> 00:20:54,276
of machine learning algorithms
and Spar produces the latency
480
00:20:54,276 --> 00:20:55,400
of model training.
481
00:20:55,400 --> 00:20:56,800
We are in memory rdd.
482
00:20:56,800 --> 00:21:00,855
So this is how spark has helped
Yahoo to improve the performance
483
00:21:00,855 --> 00:21:02,431
and achieve the targets.
484
00:21:02,431 --> 00:21:05,320
So I hope you understood
the concept of spark
485
00:21:05,320 --> 00:21:06,700
and its fundamentals.
486
00:21:11,500 --> 00:21:14,000
Now, let me just give
you an overview
487
00:21:14,000 --> 00:21:17,600
of the Spark architecture
Apache spark has a well-defined
488
00:21:17,600 --> 00:21:18,711
layered architecture
489
00:21:18,711 --> 00:21:22,017
where all the components
and layers are Loosely coupled
490
00:21:22,017 --> 00:21:25,200
and integrated with various
extensions and libraries.
491
00:21:25,200 --> 00:21:28,600
This architecture is based
on two main abstractions.
492
00:21:28,600 --> 00:21:31,500
First one resilient
distributed data sets
493
00:21:31,500 --> 00:21:32,419
that is rdd
494
00:21:32,419 --> 00:21:36,108
and the next one directed
acyclic graph called DAC
495
00:21:36,108 --> 00:21:40,100
or th e in order to understand
this park architecture.
496
00:21:40,100 --> 00:21:43,400
You need to first know
the components of the spark
497
00:21:43,400 --> 00:21:44,500
that the spark.
498
00:21:44,500 --> 00:21:47,700
System and its fundamental
data structure rdd.
499
00:21:47,700 --> 00:21:51,100
So let's start by understanding
the spark ecosystem
500
00:21:51,100 --> 00:21:53,080
as you can see from the diagram.
501
00:21:53,080 --> 00:21:56,300
The spark ecosystem is composed
of various components
502
00:21:56,300 --> 00:21:57,812
like spark SQL spark
503
00:21:57,812 --> 00:22:01,400
screaming machine learning
library Graphics spark
504
00:22:01,400 --> 00:22:05,600
our and the code a pi component
talking about spark SQL.
505
00:22:05,600 --> 00:22:08,700
It is used to Leverage The Power
of declarative queries
506
00:22:08,700 --> 00:22:11,827
and optimize storage
by executing SQL queries
507
00:22:11,827 --> 00:22:12,817
on spark data,
508
00:22:12,817 --> 00:22:14,520
which is present in rdds.
509
00:22:14,520 --> 00:22:18,600
And other external sources
next Sparks remain component
510
00:22:18,600 --> 00:22:21,400
allows developers
to perform batch processing
511
00:22:21,400 --> 00:22:22,600
and trimming of the data
512
00:22:22,600 --> 00:22:26,300
and the same application coming
to machine learning library.
513
00:22:26,300 --> 00:22:27,745
It eases the development
514
00:22:27,745 --> 00:22:30,862
and deployment of scalable
machine learning pipelines,
515
00:22:30,862 --> 00:22:33,765
like summary statistics
cluster analysis methods
516
00:22:33,765 --> 00:22:36,709
correlations dimensionality
reduction techniques
517
00:22:36,709 --> 00:22:37,900
feature extractions
518
00:22:37,900 --> 00:22:40,500
and many more now
Graphics component.
519
00:22:40,500 --> 00:22:42,100
Let's the data scientist to work
520
00:22:42,100 --> 00:22:44,689
with graph and non graph
sources to achieve.
521
00:22:44,689 --> 00:22:47,400
Security and resilience
and graph construction
522
00:22:47,400 --> 00:22:51,000
and transformation coming
to spark our it is an r package
523
00:22:51,000 --> 00:22:54,818
that provides a light weighted
front end to use Apache spark.
524
00:22:54,818 --> 00:22:58,000
It provides a distributed
data frame implementation
525
00:22:58,000 --> 00:23:01,994
that supports operations like
selection filtering aggregation,
526
00:23:01,994 --> 00:23:03,500
but on large data sets,
527
00:23:03,500 --> 00:23:06,198
it also supports
distributed machine learning
528
00:23:06,198 --> 00:23:08,100
using machine learning library.
529
00:23:08,157 --> 00:23:10,542
Finally the spark or component.
530
00:23:10,600 --> 00:23:13,600
That is the most vital component
of spark ecosystem,
531
00:23:13,600 --> 00:23:14,800
which is responsible.
532
00:23:14,800 --> 00:23:17,621
Possible for basic
I/O functions scheduling
533
00:23:17,621 --> 00:23:21,517
and monitoring the entire spark
ecosystem is built on the top
534
00:23:21,517 --> 00:23:23,456
of this code execution engine
535
00:23:23,456 --> 00:23:26,600
which has extensible apis
in different languages
536
00:23:26,600 --> 00:23:29,400
like Scala python
are and Java now,
537
00:23:29,400 --> 00:23:32,200
let me tell you
about the programming languages
538
00:23:32,200 --> 00:23:33,977
at the first Spark support
539
00:23:33,977 --> 00:23:37,190
Scala Scala is a functional
programming language
540
00:23:37,190 --> 00:23:38,900
in which spark is written
541
00:23:39,092 --> 00:23:42,400
and Spark suppose Carla
as an interface then
542
00:23:42,400 --> 00:23:44,400
spark also supports python.
543
00:23:44,400 --> 00:23:48,012
Face, you can write program
in Python and execute it
544
00:23:48,012 --> 00:23:49,500
over the spark again.
545
00:23:49,500 --> 00:23:52,166
If you see the code
and Scala and python,
546
00:23:52,166 --> 00:23:56,166
both are very similar then
coming to our it is very famous
547
00:23:56,166 --> 00:23:58,700
for data analysis
and machine learning.
548
00:23:58,700 --> 00:24:01,708
So spark has also added
the support for our
549
00:24:01,708 --> 00:24:03,500
and it also supports Java
550
00:24:03,500 --> 00:24:06,280
so you can go ahead
and write the Java code
551
00:24:06,280 --> 00:24:08,200
and execute it over the spark
552
00:24:08,200 --> 00:24:11,100
against Park also provides
you interactive shell
553
00:24:11,100 --> 00:24:14,005
for Scala Python and are
very can go ahead
554
00:24:14,005 --> 00:24:16,230
and Execute the commands
one by one.
555
00:24:16,230 --> 00:24:18,700
So this is all about
the sparkle ecosystem.
556
00:24:18,700 --> 00:24:19,500
Next.
557
00:24:19,500 --> 00:24:22,600
Let's discuss the fundamental
data structure of spark
558
00:24:22,600 --> 00:24:26,400
that is rdd called as
resilient distributed data sets.
559
00:24:26,784 --> 00:24:30,015
So and Spark anything
you do is around rdd,
560
00:24:30,200 --> 00:24:33,200
you're reading the data
and Spark then it is read
561
00:24:33,200 --> 00:24:34,400
into R DT again.
562
00:24:34,400 --> 00:24:37,200
When you're transforming
the data, then you're performing
563
00:24:37,200 --> 00:24:40,509
Transformations on an old rdd
and creating a new one.
564
00:24:40,509 --> 00:24:43,200
Then at the last you
will perform some actions
565
00:24:43,200 --> 00:24:44,643
on the data and store.
566
00:24:44,643 --> 00:24:46,288
Dataset present in an rdd
567
00:24:46,288 --> 00:24:49,764
to a persistent storage
resilient distributed data
568
00:24:49,764 --> 00:24:53,300
set as an immutable distributed
collection of objects.
569
00:24:53,300 --> 00:24:55,200
Your objects can be anything
570
00:24:55,200 --> 00:24:58,910
like string lines
Rose objects collections Etc.
571
00:24:59,600 --> 00:25:02,704
Now talking about
the distributed environment.
572
00:25:02,704 --> 00:25:06,500
Each data set in rdd is divided
into logical partitions,
573
00:25:06,500 --> 00:25:08,709
which may be computed
on different nodes
574
00:25:08,709 --> 00:25:12,062
of the cluster due to this you
can perform Transformations
575
00:25:12,062 --> 00:25:14,416
and actions on the
complete data parallelly.
576
00:25:14,416 --> 00:25:17,100
And you don't have to worry
about the distribution
577
00:25:17,100 --> 00:25:18,700
because part takes care
578
00:25:18,700 --> 00:25:22,200
of that next as I said our
did these are immutable.
579
00:25:22,200 --> 00:25:25,000
So once you create
an rdd you cannot change
580
00:25:25,000 --> 00:25:26,500
any content in the Rd
581
00:25:26,500 --> 00:25:28,102
so you might be wondering
582
00:25:28,102 --> 00:25:31,500
how our did the applies
those Transformations correct?
583
00:25:31,600 --> 00:25:35,845
Then you run any Transformations
at runs those Transformations
584
00:25:35,845 --> 00:25:38,300
on all our DD
and create a new Oddity.
585
00:25:38,300 --> 00:25:41,700
This is basically done
for optimization reasons.
586
00:25:41,700 --> 00:25:44,609
So, let me tell you
one thing here are decals.
587
00:25:44,609 --> 00:25:46,205
The cached and persistent
588
00:25:46,205 --> 00:25:49,270
if you want to save an rdd
for the future work,
589
00:25:49,270 --> 00:25:50,218
you can cash it
590
00:25:50,218 --> 00:25:53,000
and it will improve
the spark performance rdd
591
00:25:53,000 --> 00:25:55,589
is a fault-tolerant
collection of elements
592
00:25:55,589 --> 00:25:57,800
that can be operated
on in parallel.
593
00:25:57,800 --> 00:26:00,400
If our DD is lost
it will automatically
594
00:26:00,400 --> 00:26:03,400
be recomputed by using
the original Transformations.
595
00:26:03,500 --> 00:26:06,500
This is House Park
provides fault tolerance.
596
00:26:06,500 --> 00:26:10,300
There are two ways to create
rdds first one by paralyzing
597
00:26:10,300 --> 00:26:13,100
an existing collection
in your driver program
598
00:26:13,100 --> 00:26:15,809
and the second one
by Referencing a data set
599
00:26:15,809 --> 00:26:17,700
in the external storage system
600
00:26:17,700 --> 00:26:21,200
such as shared file
system hdfs hbase Etc.
601
00:26:21,400 --> 00:26:23,852
Now Transformations
are the operations
602
00:26:23,852 --> 00:26:27,300
that you perform an rdd
which will create a new body.
603
00:26:27,300 --> 00:26:30,346
For example, you
can perform filter on an rdd
604
00:26:30,346 --> 00:26:31,800
and create a new rdd.
605
00:26:31,800 --> 00:26:34,577
Then there are actions
which analyzes the rdd
606
00:26:34,577 --> 00:26:37,717
and produced result
simple action can be count
607
00:26:37,717 --> 00:26:39,900
which will count
the rows in our D
608
00:26:39,900 --> 00:26:42,100
and producer isn't so I can say
609
00:26:42,100 --> 00:26:46,200
that transformation produced
new ID Actions produce results.
610
00:26:46,200 --> 00:26:47,011
So this is all
611
00:26:47,011 --> 00:26:49,600
about the fundamental
data structure of spark
612
00:26:49,600 --> 00:26:51,000
that is already now.
613
00:26:51,000 --> 00:26:54,300
Let's dive into the core topic
of today's discussion
614
00:26:54,300 --> 00:26:56,120
that the Spark architecture.
615
00:26:56,120 --> 00:26:58,100
So this is
the Spark architecture
616
00:26:58,100 --> 00:26:59,300
in your master node.
617
00:26:59,300 --> 00:27:02,681
You have the driver program
which drives your application.
618
00:27:02,681 --> 00:27:06,300
So the code that you're writing
behaves as a driver program or
619
00:27:06,300 --> 00:27:08,752
if you are using
the interactive shell the shell
620
00:27:08,752 --> 00:27:12,017
acts as a driver program
inside the driver program.
621
00:27:12,017 --> 00:27:12,900
The first thing
622
00:27:12,900 --> 00:27:16,134
that you do is you create
a spark context assume
623
00:27:16,134 --> 00:27:19,300
that the spark context
is a gateway to allspark
624
00:27:19,300 --> 00:27:22,800
functionality at a similar
to your database connection.
625
00:27:22,800 --> 00:27:25,800
So any command you execute
in a database goes
626
00:27:25,800 --> 00:27:29,600
through the database connection
similarly anything you do
627
00:27:29,600 --> 00:27:32,600
on spark goes through
the spark context.
628
00:27:32,700 --> 00:27:34,800
Now this park on text works
629
00:27:34,800 --> 00:27:37,652
with the cluster manager
to manage various jobs,
630
00:27:37,652 --> 00:27:38,783
the driver program
631
00:27:38,783 --> 00:27:42,050
and the spark context takes care
of executing the job
632
00:27:42,050 --> 00:27:44,700
across the cluster
a job is splitted the
633
00:27:45,161 --> 00:27:46,700
And then these tasks
634
00:27:46,700 --> 00:27:48,500
are distributed over
the work or not.
635
00:27:48,500 --> 00:27:50,417
So anytime you create the rtt.
636
00:27:50,417 --> 00:27:53,562
In the spark context
that rdd can be distributed
637
00:27:53,562 --> 00:27:54,900
across various notes
638
00:27:54,900 --> 00:27:58,711
and can be cashed their so rdd
set to be taken partitioned
639
00:27:58,711 --> 00:28:02,426
and distributed across various
notes now worker knows are
640
00:28:02,426 --> 00:28:06,268
the slave nodes whose job is
to basically execute the tasks.
641
00:28:06,268 --> 00:28:07,895
The task is then executed
642
00:28:07,895 --> 00:28:10,500
on the partition rdds
in the worker nodes
643
00:28:10,500 --> 00:28:14,327
and then Returns the result back
to the spark context spot.
644
00:28:14,327 --> 00:28:17,892
Our context takes the job breaks
the shop into the task
645
00:28:17,892 --> 00:28:20,400
and distribute them
on the worker nodes
646
00:28:20,400 --> 00:28:23,900
and these tasks works
on partition rdds perform,
647
00:28:23,900 --> 00:28:26,252
whatever operations you
wanted to perform
648
00:28:26,252 --> 00:28:27,800
and then collect the result
649
00:28:27,800 --> 00:28:30,300
and give it back
to the main Spar context.
650
00:28:30,300 --> 00:28:32,690
If your increase
the number of workers,
651
00:28:32,690 --> 00:28:34,199
then you can divide jobs
652
00:28:34,199 --> 00:28:38,100
and more partitions and execute
them para Leo multiple systems.
653
00:28:38,100 --> 00:28:40,600
This will be actually
lot more faster.
654
00:28:40,600 --> 00:28:42,900
Also if you increase
the number of workers,
655
00:28:42,900 --> 00:28:44,700
it will also
increase your memory.
656
00:28:44,900 --> 00:28:46,746
And you can catch the jobs
657
00:28:46,746 --> 00:28:49,800
so that it can be executed
much more faster.
658
00:28:49,800 --> 00:28:52,231
So this is all
about Spark architecture.
659
00:28:52,231 --> 00:28:52,491
Now.
660
00:28:52,491 --> 00:28:54,709
Let me give you
an infographic idea
661
00:28:54,709 --> 00:28:56,600
about the Spark architecture.
662
00:28:56,600 --> 00:28:59,397
It follows master-slave
architecture here.
663
00:28:59,397 --> 00:29:02,400
The client submits
Park user application code
664
00:29:02,400 --> 00:29:05,189
when an application code
is submitted driver
665
00:29:05,189 --> 00:29:07,200
implicitly converts a user code
666
00:29:07,200 --> 00:29:09,000
that contains Transformations
667
00:29:09,000 --> 00:29:12,700
and actions into a logically
directed graph called DHE
668
00:29:12,700 --> 00:29:14,200
at this stage it also
669
00:29:14,200 --> 00:29:18,172
Performs optimizations such as
pipelining Transformations,
670
00:29:18,172 --> 00:29:21,165
then it converts
a logical graph called DHE
671
00:29:21,165 --> 00:29:23,032
into physical execution plan
672
00:29:23,032 --> 00:29:24,100
with many stages
673
00:29:24,100 --> 00:29:26,972
after converting into
physical execution plan.
674
00:29:26,972 --> 00:29:30,100
It creates a physical
execution units called tasks
675
00:29:30,100 --> 00:29:31,100
under each stage.
676
00:29:31,200 --> 00:29:33,300
Then these tasks are bundled
677
00:29:33,300 --> 00:29:36,300
and sent to the cluster
now driver talks
678
00:29:36,300 --> 00:29:39,523
to the cluster manager
and negotiates a resources
679
00:29:39,523 --> 00:29:42,727
and cluster manager launches
the needed executors
680
00:29:42,727 --> 00:29:45,392
at this point driver
be Also send the task
681
00:29:45,392 --> 00:29:47,828
to the executors based
on the placement
682
00:29:47,828 --> 00:29:51,610
when executor start to register
themselves with the drivers,
683
00:29:51,610 --> 00:29:55,147
so that driver will have
a complete view of the executors
684
00:29:55,147 --> 00:29:57,815
and executors now start
executing the tasks
685
00:29:57,815 --> 00:30:00,099
that are assigned by
the driver program
686
00:30:00,099 --> 00:30:01,300
at any point of time
687
00:30:01,300 --> 00:30:04,800
when the application is running
driver program will monitor
688
00:30:04,800 --> 00:30:06,000
the set of executors
689
00:30:06,000 --> 00:30:07,848
that runs and the driver note
690
00:30:07,848 --> 00:30:11,100
also schedules future tasks
Based on data placement.
691
00:30:11,100 --> 00:30:14,600
So this is how the internal
working takes place in space.
692
00:30:14,600 --> 00:30:17,400
Architecture, there are
three different types
693
00:30:17,400 --> 00:30:18,968
of workloads that spark
694
00:30:18,968 --> 00:30:22,282
and cater first batch mode
in case of batch mode.
695
00:30:22,282 --> 00:30:24,800
We run a bad shop here
you write the job
696
00:30:24,800 --> 00:30:26,100
and then schedule it.
697
00:30:26,100 --> 00:30:28,989
It works through a queue
or batch of separate jobs
698
00:30:28,989 --> 00:30:31,804
through manual intervention
next interactive mode.
699
00:30:31,804 --> 00:30:33,460
This is an interactive shell
700
00:30:33,460 --> 00:30:36,300
where you go and execute
the commands one by one.
701
00:30:36,300 --> 00:30:39,100
So you'll execute
one command check the result
702
00:30:39,100 --> 00:30:41,177
and then execute
the other command based
703
00:30:41,177 --> 00:30:42,700
on the output result and so
704
00:30:42,700 --> 00:30:44,600
on it works similar to the SQL.
705
00:30:44,600 --> 00:30:48,200
Action social is the one
which executes a driver program.
706
00:30:48,200 --> 00:30:50,833
So it is generally used
for development work
707
00:30:50,833 --> 00:30:53,100
or it is also used
for ad hoc queries,
708
00:30:53,100 --> 00:30:54,670
then comes the streaming mode
709
00:30:54,670 --> 00:30:57,200
where the program
is continuously running as
710
00:30:57,200 --> 00:30:59,400
and when the data
comes it takes a data
711
00:30:59,500 --> 00:31:02,000
and do some Transformations
and actions on the data
712
00:31:02,300 --> 00:31:04,200
and then produce output results.
713
00:31:04,400 --> 00:31:06,900
So these are the three
different types of workloads
714
00:31:06,900 --> 00:31:09,000
that spark actually caters now,
715
00:31:09,000 --> 00:31:11,866
let's move ahead and see
a simple demo here.
716
00:31:11,866 --> 00:31:14,600
Let's understand how
to create a spark up.
717
00:31:14,600 --> 00:31:17,000
Location in spark
shell using Scala.
718
00:31:17,000 --> 00:31:18,266
So let's understand
719
00:31:18,266 --> 00:31:21,400
how to create a spark
application in spark shell
720
00:31:21,400 --> 00:31:22,700
using Scala assume
721
00:31:22,700 --> 00:31:25,700
that we have a text file
in the hdfs directory
722
00:31:25,700 --> 00:31:28,900
and we are counting the number
of words in that text file.
723
00:31:28,900 --> 00:31:30,421
So, let's see how to do it.
724
00:31:30,421 --> 00:31:32,900
So before I start running,
let me first check
725
00:31:32,900 --> 00:31:34,900
whether all my demons
are running or not.
726
00:31:35,200 --> 00:31:37,100
So I'll type sudo JPS
727
00:31:37,200 --> 00:31:40,600
so all my spark demons
and Hadoop elements are running
728
00:31:40,600 --> 00:31:44,353
that I have master/worker
as Park demon son named notice.
729
00:31:44,353 --> 00:31:47,400
Manager non-manager everything
as Hadoop team it.
730
00:31:47,400 --> 00:31:48,749
So the first thing
731
00:31:48,749 --> 00:31:51,600
that I do here is
I run the spark shell
732
00:31:51,700 --> 00:31:54,700
so it takes bit time
to start in the meanwhile.
733
00:31:54,700 --> 00:31:56,700
Let me tell you the web UI port
734
00:31:56,700 --> 00:31:59,623
for spark shell is
localhost for 0 4 0.
735
00:32:00,300 --> 00:32:02,900
So this is a web
UI first Park like
736
00:32:02,900 --> 00:32:06,400
if you click on jobs right now,
we have not executed anything.
737
00:32:06,400 --> 00:32:08,861
So there is
no details over here.
738
00:32:09,400 --> 00:32:11,900
So there you have job stages.
739
00:32:12,100 --> 00:32:14,200
So once you execute the chops
740
00:32:14,200 --> 00:32:16,300
If you'll be having
the records of the tasks
741
00:32:16,300 --> 00:32:17,700
that you have executed here.
742
00:32:17,700 --> 00:32:20,400
So here you can see
the stages of various jobs
743
00:32:20,400 --> 00:32:21,706
and tasks executed.
744
00:32:21,706 --> 00:32:22,943
So now let's check
745
00:32:22,943 --> 00:32:25,900
whether our spark
shall have started or not.
746
00:32:25,900 --> 00:32:26,500
Yes.
747
00:32:26,500 --> 00:32:30,074
So you have your spark version
as two point one point one
748
00:32:30,074 --> 00:32:32,500
and you have a scholar
shell over here.
749
00:32:32,600 --> 00:32:34,300
So before I start the code,
750
00:32:34,300 --> 00:32:36,300
let's check the content
that is present
751
00:32:36,300 --> 00:32:38,600
in the input text file
by running this command.
752
00:32:38,933 --> 00:32:39,933
So I'll write
753
00:32:39,933 --> 00:32:44,000
where test is equal
to SC dot txt file
754
00:32:44,000 --> 00:32:46,700
because I have saved
a text file over there
755
00:32:46,700 --> 00:32:49,300
and I'll give
the hdfs part location.
756
00:32:50,000 --> 00:32:52,900
I've stored my text file
in this location.
757
00:32:53,300 --> 00:32:55,600
And Sample is the name
of the text file.
758
00:32:55,600 --> 00:32:58,400
So now let me give
test dot collect
759
00:32:58,400 --> 00:32:59,834
so that it collects the data
760
00:32:59,834 --> 00:33:02,600
and displays the data that
is present in the text file.
761
00:33:02,600 --> 00:33:04,500
So in my text file,
762
00:33:04,500 --> 00:33:08,500
I have Hadoop research analysts
data science and science.
763
00:33:08,500 --> 00:33:10,500
So this is my input data.
764
00:33:10,500 --> 00:33:12,200
So now let me map
765
00:33:12,200 --> 00:33:15,600
the functions and apply
the Transformations and actions.
766
00:33:15,600 --> 00:33:20,000
So I'll give our map is equal
to SC dot txt file
767
00:33:20,000 --> 00:33:22,600
and I will specify
768
00:33:22,600 --> 00:33:28,800
my but location So this
is my input part location
769
00:33:29,073 --> 00:33:32,226
and I'll apply
the flat map transformation
770
00:33:32,457 --> 00:33:33,842
to split the data.
771
00:33:36,100 --> 00:33:38,100
There are separated by space
772
00:33:38,900 --> 00:33:44,330
and then map the word count to
be given as word comma one now.
773
00:33:44,330 --> 00:33:46,100
This would be executed.
774
00:33:46,100 --> 00:33:46,600
Yes.
775
00:33:47,100 --> 00:33:49,000
Now, let me apply the action
776
00:33:49,000 --> 00:33:52,000
for this to start
the execution of the task.
777
00:33:52,900 --> 00:33:56,100
So let me tell you one thing
here before applying an action.
778
00:33:56,100 --> 00:33:58,600
This park will not start
the execution process.
779
00:33:58,600 --> 00:34:00,600
So here I have applied
produced by key
780
00:34:00,600 --> 00:34:02,800
as the action to start
counting the number
781
00:34:02,800 --> 00:34:04,100
of words in the text file.
782
00:34:04,500 --> 00:34:07,100
So now we are done
with applying Transformations
783
00:34:07,100 --> 00:34:08,300
and actions as well.
784
00:34:08,300 --> 00:34:09,774
So now the next step is
785
00:34:09,774 --> 00:34:13,300
to specify the output location
to store the output file.
786
00:34:13,300 --> 00:34:16,400
So I will give
as counts dot save as text file
787
00:34:16,400 --> 00:34:19,500
and then specify
the location form output file.
788
00:34:19,500 --> 00:34:21,398
I'll sort it
in the same location
789
00:34:21,398 --> 00:34:23,000
where I have my input file.
790
00:34:23,700 --> 00:34:28,400
Never specify my output
file name as output 9 cool.
791
00:34:29,000 --> 00:34:31,200
I forgot to give
a double quotes.
792
00:34:31,800 --> 00:34:33,200
And I will run this.
793
00:34:36,603 --> 00:34:38,296
So it's completed now.
794
00:34:38,473 --> 00:34:40,626
So now let's see the output.
795
00:34:41,000 --> 00:34:42,900
I will open my Hadoop web UI
796
00:34:42,900 --> 00:34:45,750
by giving local lost Phi
double zero seven zero
797
00:34:45,750 --> 00:34:48,600
and browse the file system
to check the output.
798
00:34:48,900 --> 00:34:50,284
So as I have said,
799
00:34:50,284 --> 00:34:54,000
I have example asthma director
that I have created
800
00:34:54,000 --> 00:34:57,600
and in that I have specified
output 9 as my output.
801
00:34:57,600 --> 00:35:00,300
So I have the two part
files been created.
802
00:35:00,300 --> 00:35:02,600
Let's check each
of them one by one.
803
00:35:04,800 --> 00:35:06,512
So we have the data count
804
00:35:06,512 --> 00:35:09,116
as one analyst count
as one and science
805
00:35:09,116 --> 00:35:12,200
count as two so this is
a first part file now.
806
00:35:12,200 --> 00:35:14,200
Let me open the second
part file for you.
807
00:35:18,500 --> 00:35:20,800
So this is the second
part file there you
808
00:35:20,800 --> 00:35:23,800
have Hadoop count as one
and the research count as one.
809
00:35:24,500 --> 00:35:26,558
So now let me show
you the text file
810
00:35:26,558 --> 00:35:28,600
that we have specified
as the input.
811
00:35:30,200 --> 00:35:31,363
So as I have told
812
00:35:31,363 --> 00:35:34,076
you Hadoop counters
one research count as
813
00:35:34,076 --> 00:35:37,400
one analyst one data one signs
and signs as 1 1 so
814
00:35:37,400 --> 00:35:39,600
in might be thinking
data science is a one word
815
00:35:39,600 --> 00:35:40,969
no in the program code.
816
00:35:40,969 --> 00:35:44,600
We have asked to count the word
that the separated by a space.
817
00:35:44,600 --> 00:35:47,600
So that is why we have
science count as two.
818
00:35:47,600 --> 00:35:51,100
I hope you got an idea
about how word count works.
819
00:35:51,515 --> 00:35:54,900
Similarly, I will now
paralyzed 1/200 numbers
820
00:35:54,900 --> 00:35:56,200
and divide the tasks
821
00:35:56,200 --> 00:36:00,100
into five partitions to show
you what is partitions of tusks.
822
00:36:00,100 --> 00:36:04,400
So I will write a seedot
paralyzed 1/200 numbers
823
00:36:04,403 --> 00:36:07,096
and divide them
into five partitions
824
00:36:07,115 --> 00:36:10,900
and apply collect action
to collect the numbers
825
00:36:10,900 --> 00:36:12,700
and start the execution.
826
00:36:12,784 --> 00:36:16,015
So it displays you
an array of 100 numbers.
827
00:36:16,300 --> 00:36:20,900
Now, let me explain you the job
stages partitions even timeline.
828
00:36:20,900 --> 00:36:23,100
Dag representation
and everything.
829
00:36:23,100 --> 00:36:26,023
So now let me go
to the web UI of spark
830
00:36:26,023 --> 00:36:27,437
and click on jobs.
831
00:36:27,601 --> 00:36:29,294
So these are the tasks
832
00:36:29,294 --> 00:36:33,217
that have submitted so
coming to word count example.
833
00:36:33,700 --> 00:36:36,300
So this is the
dagger usual ization.
834
00:36:36,300 --> 00:36:38,700
I hope you can see
it clearly first
835
00:36:38,700 --> 00:36:40,401
you collected the text file,
836
00:36:40,401 --> 00:36:42,709
then you applied
flatmap transformation
837
00:36:42,709 --> 00:36:45,139
and mapped it to count
the number of words
838
00:36:45,139 --> 00:36:47,333
and then applied
Reduce by key action
839
00:36:47,333 --> 00:36:49,100
and then save the output file
840
00:36:49,100 --> 00:36:50,500
as save as text file.
841
00:36:50,500 --> 00:36:52,900
So this is Entire
tag visualization
842
00:36:52,900 --> 00:36:54,000
of the number of steps
843
00:36:54,000 --> 00:36:56,000
that we have covered
in our program.
844
00:36:56,000 --> 00:36:58,271
So here it shows
the completed stages
845
00:36:58,271 --> 00:37:01,900
that is two stages
and it also shows the duration
846
00:37:01,900 --> 00:37:03,284
that is 2 seconds.
847
00:37:03,400 --> 00:37:05,800
And if you click
on the event timeline,
848
00:37:05,800 --> 00:37:08,482
it just shows the executor
that is added.
849
00:37:08,482 --> 00:37:11,500
And in this case you
cannot see any partitions
850
00:37:11,500 --> 00:37:15,300
because you have not split the
jobs into various partitions.
851
00:37:15,500 --> 00:37:19,200
So this is how you can see
the even timeline and the -
852
00:37:19,200 --> 00:37:21,700
visualization here you
you can also see
853
00:37:21,700 --> 00:37:24,759
the stage ID descriptions
when you have submitted
854
00:37:24,759 --> 00:37:26,800
that I have just
submitted it now
855
00:37:26,800 --> 00:37:29,294
and in this it also
shows the duration
856
00:37:29,294 --> 00:37:32,800
that it took to execute the task
and the output pipes
857
00:37:32,800 --> 00:37:35,500
that it took the shuffle
rate Shuffle right
858
00:37:35,500 --> 00:37:39,100
and many more now to show
you the partitions see
859
00:37:39,100 --> 00:37:42,500
in this you just applied
SC dot paralyzed, right?
860
00:37:42,500 --> 00:37:45,151
So it is just showing
one stage where you
861
00:37:45,151 --> 00:37:48,400
have applied the parallelized
transformation here.
862
00:37:48,400 --> 00:37:51,300
It shows the succeeded
task as Phi by Phi.
863
00:37:51,300 --> 00:37:54,700
That is you have divided
the task into five stages
864
00:37:54,700 --> 00:37:58,762
and all the five stages has been
executed successfully now here
865
00:37:58,762 --> 00:38:02,300
you can see the partitions
of the five different stages
866
00:38:02,300 --> 00:38:04,112
that is executed in parallel.
867
00:38:04,112 --> 00:38:05,800
So depending on the colors,
868
00:38:05,800 --> 00:38:07,500
it shows the scheduler delay
869
00:38:07,500 --> 00:38:10,500
the shuffle rate time
executor Computing time result
870
00:38:10,500 --> 00:38:11,500
civilization time
871
00:38:11,500 --> 00:38:13,921
and getting result time
and many more
872
00:38:13,921 --> 00:38:15,836
so you can see that duration
873
00:38:15,836 --> 00:38:19,252
that it took to execute
the five tasks in parallel
874
00:38:19,252 --> 00:38:21,263
at the same time as maximum.
875
00:38:21,263 --> 00:38:22,700
Um one milliseconds.
876
00:38:22,700 --> 00:38:26,200
So in memory spark as
much faster computation
877
00:38:26,200 --> 00:38:27,810
and you can see the IDS
878
00:38:27,810 --> 00:38:31,100
of all the five different
tasks all our success.
879
00:38:31,100 --> 00:38:33,166
You can see the locality level.
880
00:38:33,166 --> 00:38:37,033
You can see the executor and
the host IP ID the launch time
881
00:38:37,033 --> 00:38:39,100
the duration it take everything
882
00:38:39,200 --> 00:38:40,631
so you can also see
883
00:38:40,631 --> 00:38:44,978
that we have created our DT
and paralyzed it similarly here
884
00:38:44,978 --> 00:38:47,000
also for word count example,
885
00:38:47,000 --> 00:38:48,306
you can see the rdd
886
00:38:48,306 --> 00:38:51,324
that has been created
and also the Actions
887
00:38:51,324 --> 00:38:53,800
that have applied
to execute the task
888
00:38:54,000 --> 00:38:57,401
and you can see the duration
that it took even here also,
889
00:38:57,401 --> 00:38:58,980
it's just one milliseconds
890
00:38:58,980 --> 00:39:02,200
that it took to execute
the entire word count example,
891
00:39:02,200 --> 00:39:05,900
and you can see the ID is
locality level executor ID.
892
00:39:05,900 --> 00:39:06,916
So in this case,
893
00:39:06,916 --> 00:39:09,712
we have just executed
the task in two stages.
894
00:39:09,712 --> 00:39:11,900
So it is just showing
the two stages.
895
00:39:11,900 --> 00:39:13,100
So this is all about
896
00:39:13,100 --> 00:39:16,266
how web UI looks and what are
the features and information
897
00:39:16,266 --> 00:39:18,435
that you can see
in the web UI of spark
898
00:39:18,435 --> 00:39:21,200
after executing the program
and the Scala shell.
899
00:39:21,200 --> 00:39:22,271
So in this program,
900
00:39:22,271 --> 00:39:25,635
you can see that first gave
the part to the input location
901
00:39:25,635 --> 00:39:26,700
and check the data
902
00:39:26,700 --> 00:39:29,063
that is presented
in the input file.
903
00:39:29,063 --> 00:39:31,900
And then we applied
flatmap Transformations
904
00:39:31,900 --> 00:39:33,100
and created rdd
905
00:39:33,100 --> 00:39:36,800
and then applied action to start
the execution of the task
906
00:39:36,800 --> 00:39:39,500
and save the output file
in this location.
907
00:39:39,500 --> 00:39:41,643
So I hope you got a clear idea
908
00:39:41,643 --> 00:39:45,054
of how to execute a word
count example and check
909
00:39:45,054 --> 00:39:46,861
for the various features
910
00:39:46,861 --> 00:39:50,700
and Spark web UI like
partitions that visualisations
911
00:39:50,700 --> 00:39:59,900
and I hope you found the session
interesting Apache spark.
912
00:40:00,000 --> 00:40:03,900
This word can generate a spark
in every Hadoop Engineers mind.
913
00:40:03,900 --> 00:40:06,188
It is a big data
processing framework,
914
00:40:06,188 --> 00:40:08,805
which is lightning fast
and cluster Computing.
915
00:40:08,805 --> 00:40:12,300
And the core reason behind
its outstanding performance is
916
00:40:12,300 --> 00:40:15,500
the resilient distributed
data set or in short.
917
00:40:15,500 --> 00:40:17,779
They are DD and today I'll focus
918
00:40:17,779 --> 00:40:20,200
on the topic called
rdd using spark
919
00:40:20,200 --> 00:40:21,723
before we get Get started.
920
00:40:21,723 --> 00:40:23,900
Let's have a quick look
on the agenda.
921
00:40:23,900 --> 00:40:24,900
For today's session.
922
00:40:25,100 --> 00:40:28,213
We shall start with
understanding the need for rdds
923
00:40:28,213 --> 00:40:29,272
where we'll learn
924
00:40:29,272 --> 00:40:32,200
the reasons behind which
the rdds were required.
925
00:40:32,200 --> 00:40:34,700
Then we shall learn
what our rdds
926
00:40:34,700 --> 00:40:37,871
where will understand
what exactly an rdd is
927
00:40:37,871 --> 00:40:39,800
and how do they work later?
928
00:40:39,800 --> 00:40:42,400
I'll walk you through
the fascinating features
929
00:40:42,400 --> 00:40:46,300
of rdds such as in
memory computation partitioning
930
00:40:46,374 --> 00:40:48,475
persistence fault tolerance
931
00:40:48,475 --> 00:40:49,475
and many more
932
00:40:49,600 --> 00:40:51,200
once I finished a theory
933
00:40:51,300 --> 00:40:53,200
I'll get your hands on rdds
934
00:40:53,200 --> 00:40:55,100
where will practically create
935
00:40:55,100 --> 00:40:58,141
and perform all possible
operations on a disease
936
00:40:58,141 --> 00:40:59,500
and finally I'll wind
937
00:40:59,500 --> 00:41:02,677
up this session with
an interesting Pokémon use case,
938
00:41:02,677 --> 00:41:06,100
which will help you understand
rdds in a much better way.
939
00:41:06,100 --> 00:41:08,100
Let's get started spark is one
940
00:41:08,100 --> 00:41:10,792
of the top mandatory skills
required by each
941
00:41:10,792 --> 00:41:12,518
and every Big Data developer.
942
00:41:12,518 --> 00:41:14,687
It is used
in multiple applications,
943
00:41:14,687 --> 00:41:17,800
which need real-time processing
such as Google's
944
00:41:17,800 --> 00:41:21,066
recommendation engine credit
card fraud detection.
945
00:41:21,066 --> 00:41:23,713
And many more to understand
this in depth.
946
00:41:23,713 --> 00:41:27,200
We shall consider Amazon's
recommendation engine assume
947
00:41:27,200 --> 00:41:29,500
that you are searching
for a mobile phone
948
00:41:29,500 --> 00:41:33,126
and Amazon and you have certain
specifications of your choice.
949
00:41:33,126 --> 00:41:36,742
Then the Amazon search engine
understands your requirements
950
00:41:36,742 --> 00:41:38,450
and provides you the products
951
00:41:38,450 --> 00:41:41,155
which match the specifications
of your choice.
952
00:41:41,155 --> 00:41:43,800
All this is made possible
because of the most
953
00:41:43,800 --> 00:41:46,717
powerful tool existing
in Big Data environment,
954
00:41:46,717 --> 00:41:49,000
which is none other
than Apache spark
955
00:41:49,000 --> 00:41:51,000
and resilient distributed data.
956
00:41:51,000 --> 00:41:53,946
Is considered to be
the heart of Apache spark.
957
00:41:53,946 --> 00:41:56,735
So with this let's begin
our first question.
958
00:41:56,735 --> 00:41:58,300
Why do we need a disease?
959
00:41:58,300 --> 00:42:01,410
Well, the current world
is expanding the technology
960
00:42:01,410 --> 00:42:02,903
and artificial intelligence
961
00:42:02,903 --> 00:42:06,891
is the face for this Evolution
the machine learning algorithms
962
00:42:06,891 --> 00:42:09,300
and the data needed
to train these computers
963
00:42:09,300 --> 00:42:10,453
are huge the logic
964
00:42:10,453 --> 00:42:13,378
behind all these algorithms
are very complicated
965
00:42:13,378 --> 00:42:17,300
and mostly run in a distributed
and iterative computation method
966
00:42:17,300 --> 00:42:19,800
the machine learning
algorithms could not use
967
00:42:19,800 --> 00:42:21,053
the older mapreduce.
968
00:42:21,053 --> 00:42:24,500
Grams, because the traditional
mapreduce programs needed
969
00:42:24,500 --> 00:42:26,733
a stable State hdfs and we know
970
00:42:26,733 --> 00:42:31,200
that hdfs generates redundancy
during intermediate computations
971
00:42:31,200 --> 00:42:34,800
which resulted in a major
latency in data processing
972
00:42:34,800 --> 00:42:36,900
and in hdfs gathering data
973
00:42:36,900 --> 00:42:39,400
for multiple processing units
at a single instance.
974
00:42:39,400 --> 00:42:42,752
First time consuming along
with this the major issue
975
00:42:42,752 --> 00:42:46,600
was the HTF is did not have
random read and write ability.
976
00:42:46,600 --> 00:42:49,000
So using this old
mapreduce programs
977
00:42:49,000 --> 00:42:52,000
for machine learning
problems would be Then
978
00:42:52,000 --> 00:42:53,700
the spark was introduced
979
00:42:53,700 --> 00:42:55,318
compared to mapreduce spark
980
00:42:55,318 --> 00:42:58,435
is an advanced big data
processing framework resilient
981
00:42:58,435 --> 00:42:59,503
distributed data set
982
00:42:59,503 --> 00:43:02,423
which is a fundamental
and most crucial data structure
983
00:43:02,423 --> 00:43:03,600
of spark was the one
984
00:43:03,600 --> 00:43:06,900
which made it all possible rdds
are effortless to create
985
00:43:06,900 --> 00:43:09,205
and the mind-blowing
property with solve.
986
00:43:09,205 --> 00:43:12,500
The problem was it's in memory
data processing capability
987
00:43:12,500 --> 00:43:15,600
Oddity is not a distributed
file system instead.
988
00:43:15,600 --> 00:43:17,894
It is a distributed
collection of memory
989
00:43:17,894 --> 00:43:19,905
where the data needed
is always stored
990
00:43:19,905 --> 00:43:21,057
and kept available.
991
00:43:21,057 --> 00:43:24,269
Lynn RAM and because of
this property the elevation it
992
00:43:24,269 --> 00:43:27,300
gave to the memory
accessing speed was unbelievable
993
00:43:27,300 --> 00:43:29,250
The Oddities our fault tolerant
994
00:43:29,250 --> 00:43:32,900
and this property bought it
a Dignity of a whole new level.
995
00:43:32,900 --> 00:43:35,074
So our next question would be
996
00:43:35,074 --> 00:43:38,522
what are rdds the resilient
distributed data sets
997
00:43:38,522 --> 00:43:39,600
or the rdds are
998
00:43:39,600 --> 00:43:42,600
the primary underlying
data structures of spark.
999
00:43:42,600 --> 00:43:44,311
They are highly fault tolerant
1000
00:43:44,311 --> 00:43:46,900
and the store data
amongst multiple computers
1001
00:43:46,900 --> 00:43:51,000
in a network the data is written
into multiple executable notes.
1002
00:43:51,000 --> 00:43:54,800
So that in case of a Calamity
if any executing node fails,
1003
00:43:54,800 --> 00:43:57,459
then within a fraction
of second it gets back up
1004
00:43:57,459 --> 00:43:59,100
from the next executable node
1005
00:43:59,100 --> 00:44:02,200
with the same processing speeds
of the current node,
1006
00:44:02,300 --> 00:44:04,900
the fault-tolerant property
enables them to roll back
1007
00:44:04,900 --> 00:44:06,876
their data to the original state
1008
00:44:06,876 --> 00:44:09,038
by applying simple
Transformations on
1009
00:44:09,038 --> 00:44:11,225
to the Lost part
in the lineage hard.
1010
00:44:11,225 --> 00:44:13,696
It is do not need
anything called hard disk
1011
00:44:13,696 --> 00:44:15,489
or any other secondary storage
1012
00:44:15,489 --> 00:44:17,700
all that they need
is the main memory,
1013
00:44:17,700 --> 00:44:18,700
which is Ram now
1014
00:44:18,700 --> 00:44:21,100
that we have understood
the need for our dear.
1015
00:44:21,100 --> 00:44:22,482
It is and what exactly
1016
00:44:22,482 --> 00:44:25,204
an RTD is so let us see
the different sources
1017
00:44:25,204 --> 00:44:28,223
from which the data
can be ingested into an rdd.
1018
00:44:28,223 --> 00:44:30,600
The data can be loaded
from any Source
1019
00:44:30,600 --> 00:44:33,700
like hdfs hbase high C ql
1020
00:44:33,700 --> 00:44:34,658
you name it?
1021
00:44:34,658 --> 00:44:35,582
They got it.
1022
00:44:35,700 --> 00:44:36,200
Hence.
1023
00:44:36,200 --> 00:44:39,000
The collected data
is dropped into an rdd.
1024
00:44:39,000 --> 00:44:42,000
And guess what the rdds
a free-spirited they
1025
00:44:42,000 --> 00:44:44,051
can process any type of data.
1026
00:44:44,051 --> 00:44:47,800
They won't care if the data
is structured unstructured
1027
00:44:47,800 --> 00:44:49,500
or semi-structured now,
1028
00:44:49,500 --> 00:44:51,200
let me walk you
through the features.
1029
00:44:51,200 --> 00:44:52,300
Just of rdds,
1030
00:44:52,300 --> 00:44:54,700
which give it an edge
over the other Alternatives
1031
00:44:54,900 --> 00:44:57,100
in memory computation the idea
1032
00:44:57,100 --> 00:45:00,632
of in memory computation bought
the groundbreaking progress
1033
00:45:00,632 --> 00:45:03,800
in cluster Computing it
increase the processing speed
1034
00:45:03,800 --> 00:45:07,877
when compared with the hdfs
moving on to Lacey evaluations
1035
00:45:07,877 --> 00:45:08,827
the phrase lazy
1036
00:45:08,827 --> 00:45:09,527
Explains It
1037
00:45:09,527 --> 00:45:12,564
All spark logs all
the Transformations you apply
1038
00:45:12,564 --> 00:45:16,056
onto it and will not throw
any output onto the display
1039
00:45:16,056 --> 00:45:17,900
until an action is provoked.
1040
00:45:17,900 --> 00:45:22,200
Next is Fault tolerance rdds
are Lutely, fault-tolerant.
1041
00:45:22,200 --> 00:45:26,008
Any lost partition of an rdd
can be rolled back by applying
1042
00:45:26,008 --> 00:45:28,700
simple Transformations on
to the last part
1043
00:45:28,700 --> 00:45:30,286
in the lineage speaking
1044
00:45:30,286 --> 00:45:34,700
about immutability the data once
dropped into an rdd is immutable
1045
00:45:34,700 --> 00:45:38,016
because the access provided
by our DD is just re
1046
00:45:38,016 --> 00:45:39,920
only the only way to access
1047
00:45:39,920 --> 00:45:43,800
or modified is by applying
a transformation on to an rdd
1048
00:45:43,800 --> 00:45:45,400
which is prior
to the present one
1049
00:45:45,400 --> 00:45:47,200
discussing about partitioning.
1050
00:45:47,200 --> 00:45:48,923
The important reason for Sparks.
1051
00:45:48,923 --> 00:45:51,100
Parallel processing is
its part issue.
1052
00:45:51,300 --> 00:45:54,163
By default spot determines
the number of Parts
1053
00:45:54,163 --> 00:45:56,200
into which your data is divided,
1054
00:45:56,200 --> 00:45:59,652
but you can override this
and decide the number of blocks.
1055
00:45:59,652 --> 00:46:01,200
You want to split your data.
1056
00:46:01,200 --> 00:46:03,193
Let's see what persistence is
1057
00:46:03,193 --> 00:46:05,600
Sparks are it is
a totally reusable.
1058
00:46:05,600 --> 00:46:06,757
The users can apply
1059
00:46:06,757 --> 00:46:09,502
certain number of
Transformations on to an rdd
1060
00:46:09,502 --> 00:46:11,302
and preserve the final Oddity
1061
00:46:11,302 --> 00:46:14,383
for future use this avoids
all the hectic process
1062
00:46:14,383 --> 00:46:17,369
of applying all
the Transformations from scratch
1063
00:46:17,369 --> 00:46:20,867
and now last but not the least
course crane operations.
1064
00:46:20,867 --> 00:46:24,300
The operations performed
on rdds using Transformations
1065
00:46:24,300 --> 00:46:28,069
like map filter flat map
Etc change the arteries
1066
00:46:28,069 --> 00:46:29,300
and update them.
1067
00:46:29,300 --> 00:46:29,686
Hence.
1068
00:46:29,686 --> 00:46:33,100
Every operation applied
onto an RTD is course trained.
1069
00:46:33,100 --> 00:46:36,800
These are the features of rdds
and moving on to the next stage.
1070
00:46:36,800 --> 00:46:37,800
We shall understand.
1071
00:46:37,800 --> 00:46:39,700
The creation of rdds art.
1072
00:46:39,700 --> 00:46:42,500
It is can be created
using three methods.
1073
00:46:42,500 --> 00:46:46,000
The first method is using
parallelized collections.
1074
00:46:46,000 --> 00:46:50,400
Next method is by using external
storage like hdfs hbase.
1075
00:46:50,400 --> 00:46:51,100
Hi.
1076
00:46:51,100 --> 00:46:54,700
And many more the third one
is using an existing ID,
1077
00:46:54,700 --> 00:46:56,800
which is prior
to the present one.
1078
00:46:56,800 --> 00:46:58,800
Now, let us see understand
1079
00:46:58,800 --> 00:47:02,300
and create an array D
through each method now
1080
00:47:02,300 --> 00:47:05,600
Spa can be run on Virtual
machines like spark VM
1081
00:47:05,600 --> 00:47:08,300
or you can install
a Linux operating system
1082
00:47:08,300 --> 00:47:10,774
like Ubuntu and
run it Standalone,
1083
00:47:10,774 --> 00:47:14,600
but we here at Erica use
the best-in-class cloud lab
1084
00:47:14,600 --> 00:47:16,900
which comprises of
all the Frameworks.
1085
00:47:16,900 --> 00:47:19,400
You needed a single
stop Cloud framework.
1086
00:47:19,400 --> 00:47:20,776
No need of any hectic.
1087
00:47:20,776 --> 00:47:22,323
Has of downloading any file
1088
00:47:22,323 --> 00:47:24,632
or setting up
an environment variables
1089
00:47:24,632 --> 00:47:27,289
and looking for
a hardware specification Etc.
1090
00:47:27,289 --> 00:47:28,890
All you need is a login ID
1091
00:47:28,890 --> 00:47:32,091
and password to the all-in-one
ready to use cloud lab
1092
00:47:32,091 --> 00:47:34,800
where you can run
and save all your programs.
1093
00:47:35,400 --> 00:47:39,600
Let us fire up our spark shell
using the command spark to -
1094
00:47:39,600 --> 00:47:42,446
shell now as partial
is been fired up.
1095
00:47:42,446 --> 00:47:44,215
Let's create a new rdd.
1096
00:47:44,800 --> 00:47:48,400
So here we are creating
a new RTD with the first method
1097
00:47:48,400 --> 00:47:51,500
which is using the
parallelized collections here.
1098
00:47:51,500 --> 00:47:52,954
We are creating a new rdt
1099
00:47:52,954 --> 00:47:55,800
by the name parallelized
collections are ready.
1100
00:47:55,800 --> 00:47:57,705
We are starting a spark context
1101
00:47:57,705 --> 00:48:00,321
and we have paralyzing
an array into the rdd
1102
00:48:00,321 --> 00:48:03,300
which consists of the data
of the days of a week,
1103
00:48:03,300 --> 00:48:04,875
which is Monday Tuesday,
1104
00:48:04,875 --> 00:48:07,500
Wednesday, Thursday,
Friday and Saturday.
1105
00:48:07,500 --> 00:48:10,600
Now, let's create
this our new rdd
1106
00:48:10,600 --> 00:48:13,841
paralyzed collections rdd
is successfully created now,
1107
00:48:13,841 --> 00:48:16,900
let's display the data
which is present in our RTD.
1108
00:48:19,400 --> 00:48:23,630
So this was the data
which is present in our RTD now,
1109
00:48:23,630 --> 00:48:27,038
let's create a new ID
using a second method.
1110
00:48:28,200 --> 00:48:30,892
The second method
of creating an rdd
1111
00:48:30,892 --> 00:48:35,400
was using an external storage
such as hdfs high SQL
1112
00:48:35,600 --> 00:48:37,100
and many more here.
1113
00:48:37,100 --> 00:48:40,200
I'm creating a new rdd
by the name spark file
1114
00:48:40,200 --> 00:48:43,312
where I'll be loading
a text document into the rdd
1115
00:48:43,312 --> 00:48:44,900
from an external storage,
1116
00:48:44,900 --> 00:48:45,900
which is hdfs.
1117
00:48:45,900 --> 00:48:49,700
And this is the location
where my text file is located.
1118
00:48:49,800 --> 00:48:53,600
So the new rdd spark file
is successfully created now,
1119
00:48:53,600 --> 00:48:55,054
let's display the data
1120
00:48:55,054 --> 00:48:57,500
which is present
in as pack file a TD.
1121
00:48:58,700 --> 00:48:59,620
It's the data
1122
00:48:59,620 --> 00:49:02,241
which is present in
as pack file ID is
1123
00:49:02,241 --> 00:49:05,500
a collection of alphabets
starting from A to Z.
1124
00:49:05,500 --> 00:49:05,900
Now.
1125
00:49:05,900 --> 00:49:08,851
Let's create a new already
using the third method
1126
00:49:08,851 --> 00:49:10,946
which is using
an existing iridium,
1127
00:49:10,946 --> 00:49:14,201
which is prior to the present
one in the third method.
1128
00:49:14,201 --> 00:49:16,900
I'm creating a new Rd
by the name verts and
1129
00:49:16,900 --> 00:49:18,700
I'm creating a spark context
1130
00:49:18,700 --> 00:49:21,803
and paralyzing a statement
into the RTD Words,
1131
00:49:21,803 --> 00:49:24,700
which is spark is
a very powerful language.
1132
00:49:24,800 --> 00:49:26,517
So this is
a collection of Words,
1133
00:49:26,517 --> 00:49:28,400
which I have passed
into the new.
1134
00:49:28,400 --> 00:49:29,400
You are DD words.
1135
00:49:29,400 --> 00:49:29,900
Now.
1136
00:49:29,900 --> 00:49:31,700
Let us apply a transformation
1137
00:49:31,700 --> 00:49:34,800
on to the RTD and create
a new artery through that.
1138
00:49:35,100 --> 00:49:37,656
So here I'm applying
map transformation
1139
00:49:37,656 --> 00:49:39,140
on to the previous rdd
1140
00:49:39,140 --> 00:49:42,717
that is words and I'm storing
the data into the new ID
1141
00:49:42,717 --> 00:49:44,000
which is WordPress.
1142
00:49:44,000 --> 00:49:46,500
So here we are applying
map transformation in order
1143
00:49:46,500 --> 00:49:49,645
to display the first letter
of each and every word
1144
00:49:49,645 --> 00:49:51,700
which is stored
in the RTD words.
1145
00:49:51,700 --> 00:49:53,200
Now, let's continue.
1146
00:49:53,200 --> 00:49:56,093
The transformation is been
applied successfully now,
1147
00:49:56,093 --> 00:49:59,300
let's display the contents
which are present in new ID
1148
00:49:59,300 --> 00:50:01,800
which is word pair So
1149
00:50:01,800 --> 00:50:05,100
as explained we have displayed
the starting letter of each
1150
00:50:05,100 --> 00:50:06,100
and every word
1151
00:50:06,100 --> 00:50:10,888
as s is starting letter of spark
is starting letter of East and
1152
00:50:10,888 --> 00:50:13,700
so on L is starting
letter of language.
1153
00:50:13,900 --> 00:50:17,000
Now, we have understood
the creation of a dedes.
1154
00:50:17,000 --> 00:50:17,823
Let us move on
1155
00:50:17,823 --> 00:50:21,000
to the next stage where we'll
understand the operations
1156
00:50:21,000 --> 00:50:23,716
that are performed
on rdds Transformations
1157
00:50:23,716 --> 00:50:26,300
and actions are
the two major operations
1158
00:50:26,300 --> 00:50:27,700
that are performed on added.
1159
00:50:27,700 --> 00:50:31,677
He's let us understand what
our Transformations we applied.
1160
00:50:31,677 --> 00:50:35,575
Summations in order to access
filter and modify the data
1161
00:50:35,575 --> 00:50:37,470
which is present in an rdd.
1162
00:50:37,470 --> 00:50:41,087
Now Transformations are further
divided into two types
1163
00:50:41,087 --> 00:50:44,500
narrow Transformations and
why Transformations now,
1164
00:50:44,500 --> 00:50:47,500
let us understand what
our narrow Transformations
1165
00:50:47,500 --> 00:50:50,200
we apply narrow Transformations
onto a single partition
1166
00:50:50,200 --> 00:50:51,400
of parent ID
1167
00:50:51,400 --> 00:50:54,886
because the data required
to process the RTD is available
1168
00:50:54,886 --> 00:50:56,200
on a single partition
1169
00:50:56,200 --> 00:50:58,200
of parent additi the examples
1170
00:50:58,200 --> 00:51:01,125
for neurotransmission
our map filter.
1171
00:51:01,500 --> 00:51:04,300
At map partition
and map partitions.
1172
00:51:04,400 --> 00:51:06,940
Let us move on to the next
type of Transformations
1173
00:51:06,940 --> 00:51:08,511
which is why Transformations.
1174
00:51:08,511 --> 00:51:11,600
We apply why Transformations
on to the multiple partitions
1175
00:51:11,600 --> 00:51:12,698
of parent a greedy
1176
00:51:12,698 --> 00:51:16,080
because the data required
to process an rdd is available
1177
00:51:16,080 --> 00:51:17,514
on multiple partitions
1178
00:51:17,514 --> 00:51:19,600
of the parent
additi the examples
1179
00:51:19,600 --> 00:51:23,000
for why Transformations
are reduced by and Union now,
1180
00:51:23,000 --> 00:51:24,823
let us move on to the next part
1181
00:51:24,823 --> 00:51:27,200
which is actions actions
on the other hand
1182
00:51:27,200 --> 00:51:29,802
are considered to be
the next part of operations,
1183
00:51:29,802 --> 00:51:31,700
which are used
to display the final.
1184
00:51:32,200 --> 00:51:35,793
The examples for actions
are collect count take
1185
00:51:35,800 --> 00:51:38,479
and first till now
we have discussed
1186
00:51:38,479 --> 00:51:40,700
about the theory part on rdd.
1187
00:51:40,700 --> 00:51:42,870
Let us start
executing the operations
1188
00:51:42,870 --> 00:51:44,800
that are performed on a disease.
1189
00:51:46,500 --> 00:51:49,100
In a practical part
will be dealing with an example
1190
00:51:49,100 --> 00:51:50,600
of IPL match stata.
1191
00:51:50,900 --> 00:51:52,900
So here I have a CSV file
1192
00:51:52,900 --> 00:51:57,158
which has the IPL match records
and this CSV file is stored
1193
00:51:57,158 --> 00:51:59,081
in my hdfs and I'm loading.
1194
00:51:59,081 --> 00:52:01,956
My batch is dot CSV file
into the new rdd,
1195
00:52:01,956 --> 00:52:04,200
which is CK file as a text file.
1196
00:52:04,200 --> 00:52:07,909
So the match is dot CSV file
is been successfully loaded
1197
00:52:07,909 --> 00:52:09,990
as a text file into the new ID,
1198
00:52:09,990 --> 00:52:11,400
which is CK file now,
1199
00:52:11,400 --> 00:52:13,759
let us display the data
which is present
1200
00:52:13,759 --> 00:52:16,300
in our seek a file
using an action command.
1201
00:52:16,400 --> 00:52:18,170
So collect is the action command
1202
00:52:18,170 --> 00:52:20,700
which I'm using in order
to display the data
1203
00:52:20,700 --> 00:52:23,100
which is present
in my CK file a DD.
1204
00:52:23,600 --> 00:52:27,569
So here we have in total
six hundred and thirty six rows
1205
00:52:27,569 --> 00:52:30,600
of data which consists
of IPL match records
1206
00:52:30,600 --> 00:52:33,500
from the year 2008 to 2017.
1207
00:52:33,711 --> 00:52:36,788
Now, let us see the schema
of a CSV file.
1208
00:52:37,300 --> 00:52:40,561
I am using the action command
first in order to display
1209
00:52:40,561 --> 00:52:42,800
the schema of a match
is dot CSV file.
1210
00:52:42,800 --> 00:52:45,300
So this command will display
the starting line
1211
00:52:45,300 --> 00:52:46,400
of the CSV file.
1212
00:52:46,400 --> 00:52:48,005
We have so the schema
1213
00:52:48,005 --> 00:52:51,600
of a CSV file is the ID
of the match season city
1214
00:52:51,600 --> 00:52:54,386
where the IPL match
was conducted date
1215
00:52:54,386 --> 00:52:57,700
of the match team one team
two and so on now,
1216
00:52:57,700 --> 00:53:01,100
let's perform the further
operations on a CSV file.
1217
00:53:02,000 --> 00:53:04,300
Now moving on
to the further operations.
1218
00:53:04,300 --> 00:53:07,800
I'm about to split
the second column of my CSV file
1219
00:53:07,800 --> 00:53:10,787
which consists the information
regarding the states
1220
00:53:10,787 --> 00:53:12,700
which conducted the IPL matches.
1221
00:53:12,700 --> 00:53:15,467
So I am using this operation
in order to display
1222
00:53:15,467 --> 00:53:18,000
the states where
the matches were conducted.
1223
00:53:18,700 --> 00:53:21,600
So the transformation
is been successfully applied
1224
00:53:21,600 --> 00:53:24,600
and the data has been stored
into the new ID which is States.
1225
00:53:24,600 --> 00:53:26,700
Now, let's display the data
which is stored
1226
00:53:26,700 --> 00:53:30,100
in our state's rdd using
the collection action command,
1227
00:53:30,400 --> 00:53:31,890
so these with The states
1228
00:53:31,890 --> 00:53:34,500
where the matches
were being conducted now,
1229
00:53:34,500 --> 00:53:35,817
let's find out the city
1230
00:53:35,817 --> 00:53:38,700
which conducted the maximum
number of IPL matches.
1231
00:53:39,400 --> 00:53:41,700
Yeah, I'm creating
a new ID again,
1232
00:53:41,700 --> 00:53:45,017
which is States count
and I'm using map transformation
1233
00:53:45,017 --> 00:53:47,799
and I am counting each
and every city and the number
1234
00:53:47,799 --> 00:53:50,200
of matches conducted
in that particular City.
1235
00:53:50,500 --> 00:53:52,776
The transformation
is successfully applied
1236
00:53:52,776 --> 00:53:55,600
and the data has been stored
into the account ID.
1237
00:53:56,400 --> 00:53:56,900
Now.
1238
00:53:56,900 --> 00:54:00,097
Let us create a new editing
by name State count em
1239
00:54:00,097 --> 00:54:01,414
and apply reduced by
1240
00:54:01,414 --> 00:54:04,572
key transformation and map
transformation together
1241
00:54:04,572 --> 00:54:07,900
and consider topple one as
the city name and toppled
1242
00:54:07,900 --> 00:54:09,500
to as the Number of matches
1243
00:54:09,500 --> 00:54:11,876
which were considered
in that particular City
1244
00:54:11,876 --> 00:54:12,701
and apply sort
1245
00:54:12,701 --> 00:54:15,000
by K transformation
to find out the city
1246
00:54:15,000 --> 00:54:17,700
which conducted maximum number
of IPL matches.
1247
00:54:17,900 --> 00:54:20,317
The Transformations
are successfully applied
1248
00:54:20,317 --> 00:54:23,200
and the data is being stored
into the state count.
1249
00:54:23,200 --> 00:54:25,200
Em RTD now let's
display the data
1250
00:54:25,200 --> 00:54:26,800
which is present in state count.
1251
00:54:26,800 --> 00:54:29,600
Em, I did here I am using
1252
00:54:29,600 --> 00:54:33,320
take action command in order
to take the top 10 results
1253
00:54:33,320 --> 00:54:35,800
which are stored
in state count MRDD.
1254
00:54:36,100 --> 00:54:38,600
So according to the results
we have Mumbai
1255
00:54:38,600 --> 00:54:41,300
which Get the maximum number
of IPL matches,
1256
00:54:41,300 --> 00:54:45,700
which is 85 since the year
2008 to the year 2017.
1257
00:54:46,400 --> 00:54:50,300
Now let us create a new ID
by name fil ardi and use
1258
00:54:50,300 --> 00:54:53,144
flat map in order to filter
out the match data
1259
00:54:53,144 --> 00:54:55,800
which were conducted
in the city Hydra path
1260
00:54:55,800 --> 00:54:58,500
and store the same data
into the file rdd
1261
00:54:58,500 --> 00:55:01,617
since transformation is been
successfully applied now,
1262
00:55:01,617 --> 00:55:04,600
let us display the data
which is present in our fil ardi
1263
00:55:04,600 --> 00:55:06,161
which consists of the matches
1264
00:55:06,161 --> 00:55:08,800
which were conducted
excluding the city Hyderabad.
1265
00:55:09,900 --> 00:55:11,126
So this is the data
1266
00:55:11,126 --> 00:55:15,000
which is present in our fil ardi
D which excludes the matches
1267
00:55:15,000 --> 00:55:18,000
which are played
in the city Hyderabad now,
1268
00:55:18,000 --> 00:55:19,768
let us create another rdd
1269
00:55:19,768 --> 00:55:22,773
by name fil and store
the data of the matches
1270
00:55:22,773 --> 00:55:25,300
which were conducted
in the year 2017.
1271
00:55:25,300 --> 00:55:27,394
We shall use
filter transformation
1272
00:55:27,394 --> 00:55:28,600
for this operation.
1273
00:55:28,700 --> 00:55:31,000
The transformation is
been applied successfully
1274
00:55:31,000 --> 00:55:34,100
and the data has been stored
into the fil ardi now,
1275
00:55:34,100 --> 00:55:36,600
let us display the data
which is present there.
1276
00:55:37,200 --> 00:55:38,588
Michelle use collect
1277
00:55:38,588 --> 00:55:42,545
action command and now we have
the data of all the matches
1278
00:55:42,545 --> 00:55:45,600
which your plate especially
in the year 2070.
1279
00:55:47,100 --> 00:55:49,400
similarly, we can find
out the matches
1280
00:55:49,400 --> 00:55:52,000
which were played
in the year 2016 and we
1281
00:55:52,000 --> 00:55:54,600
can save the same data
into the new rdd
1282
00:55:54,600 --> 00:55:57,500
which is fil to Similarly,
1283
00:55:57,500 --> 00:55:59,823
we can find out the data
of the matches
1284
00:55:59,823 --> 00:56:03,100
which were conducted in the year
2016 and we can store
1285
00:56:03,100 --> 00:56:05,061
the same data into our new rdd
1286
00:56:05,061 --> 00:56:08,200
which is fil to I
have used filter transformation
1287
00:56:08,200 --> 00:56:10,800
in order to filter out
the data of the matches
1288
00:56:10,800 --> 00:56:13,581
which were conducted
in the year 2016 and I
1289
00:56:13,581 --> 00:56:15,900
have saved the data
into the new RTD
1290
00:56:15,900 --> 00:56:18,300
which is a file to now,
1291
00:56:18,300 --> 00:56:20,889
let us understand
the union transformation
1292
00:56:20,889 --> 00:56:21,900
which will apply
1293
00:56:21,900 --> 00:56:26,400
the union transformation on
to the fil ardi and fil to rdd.
1294
00:56:26,400 --> 00:56:29,100
In order to combine
both the data is present
1295
00:56:29,100 --> 00:56:30,816
in both The Oddities here.
1296
00:56:30,816 --> 00:56:32,232
I'm creating a new rdd
1297
00:56:32,232 --> 00:56:35,931
by the name Union rdd and I'm
applying Union transformation
1298
00:56:35,931 --> 00:56:38,600
on the to Oddities
that we created before.
1299
00:56:38,600 --> 00:56:42,400
The first one is fil ardi
which consists of the data
1300
00:56:42,400 --> 00:56:44,818
of the matches played
in the year 2017.
1301
00:56:44,818 --> 00:56:46,633
And the second one is a file
1302
00:56:46,633 --> 00:56:49,295
to which consists
the data of the matches.
1303
00:56:49,295 --> 00:56:52,469
Which up late in the year
2016 here I'll be clubbing
1304
00:56:52,469 --> 00:56:53,921
both the R8 is together
1305
00:56:53,921 --> 00:56:56,700
and I'll be saving the data
into the new rdd.
1306
00:56:56,701 --> 00:56:58,163
Which is Union rdd.
1307
00:56:58,600 --> 00:57:02,600
Now let us display the data
which is present in a new array,
1308
00:57:02,600 --> 00:57:04,100
which is Union rgd.
1309
00:57:04,100 --> 00:57:06,100
I am using collect
action command in order
1310
00:57:06,100 --> 00:57:07,100
to display the data.
1311
00:57:07,300 --> 00:57:09,800
So here we have the data
of the matches
1312
00:57:09,800 --> 00:57:11,400
which were played in the u.s.
1313
00:57:11,400 --> 00:57:13,400
2016 and 2017.
1314
00:57:13,900 --> 00:57:16,306
And now let's continue
with our operations
1315
00:57:16,306 --> 00:57:19,188
and find out the player
with maximum number of man
1316
00:57:19,188 --> 00:57:21,603
of the match awards
for this operation.
1317
00:57:21,603 --> 00:57:23,293
I am applying map transformation
1318
00:57:23,293 --> 00:57:25,345
and splitting out
the column number 13,
1319
00:57:25,345 --> 00:57:28,314
which consists of the data
of the players who won the man
1320
00:57:28,314 --> 00:57:30,800
of the match awards
for that particular match.
1321
00:57:30,800 --> 00:57:33,252
So the transformation
is been successfully applied
1322
00:57:33,252 --> 00:57:35,752
and the column number
13 is been successfully split
1323
00:57:35,752 --> 00:57:37,700
and the data has been
stored into the man
1324
00:57:37,700 --> 00:57:39,238
of the match our DD now.
1325
00:57:39,238 --> 00:57:42,155
We are creating a new rdd
by the named man
1326
00:57:42,155 --> 00:57:45,600
of the match count me applying
map Transformations on
1327
00:57:45,600 --> 00:57:46,800
to a previous rdd
1328
00:57:46,800 --> 00:57:48,300
and we are counting the number
1329
00:57:48,300 --> 00:57:51,300
of awards won by each and
every particular player.
1330
00:57:51,700 --> 00:57:55,733
Now, we shall create a new ID
by the named man of the match
1331
00:57:55,733 --> 00:57:59,500
and we are applying reduced
by K. Under the previous added
1332
00:57:59,500 --> 00:58:01,311
which is man of the match count.
1333
00:58:01,311 --> 00:58:03,765
And again, we are applying
map transformation
1334
00:58:03,765 --> 00:58:06,600
and considering topple one
as the name of the player
1335
00:58:06,600 --> 00:58:08,843
and topple to as
the number of matches.
1336
00:58:08,843 --> 00:58:11,500
He played and won the man
of the match Awards,
1337
00:58:11,500 --> 00:58:14,794
let us use take action command
in order to print the data
1338
00:58:14,794 --> 00:58:18,000
which is stored in our new RTD
which is man of the match.
1339
00:58:18,200 --> 00:58:21,400
So according to the result
we have a bws
1340
00:58:21,400 --> 00:58:24,000
who won the maximum number
of man of the matches,
1341
00:58:24,000 --> 00:58:24,923
which is 15.
1342
00:58:25,800 --> 00:58:29,129
So these are the few operations
that were performed on rdds.
1343
00:58:29,129 --> 00:58:31,600
Now, let us move on
to our Pokémon use case
1344
00:58:31,600 --> 00:58:34,800
so that we can understand
our duties in a much better way.
1345
00:58:35,800 --> 00:58:39,300
So the steps to be performed
in Pokémon use cases are loading
1346
00:58:39,300 --> 00:58:41,164
the Pokemon data dot CSV file
1347
00:58:41,164 --> 00:58:44,624
from an external storage
into an rdd removing the schema
1348
00:58:44,624 --> 00:58:46,700
from the Pokémon
data dot CSV file
1349
00:58:46,700 --> 00:58:49,730
and finding out the total number
of water type Pokemon
1350
00:58:49,730 --> 00:58:52,117
finding the total number
of fire type Pokemon.
1351
00:58:52,117 --> 00:58:53,882
I know it's getting interesting.
1352
00:58:53,882 --> 00:58:57,000
So let me explain you each
and every step practically.
1353
00:58:57,700 --> 00:59:00,200
So here I am creating
a new identity
1354
00:59:00,200 --> 00:59:02,400
by name Pokemon data rdd one
1355
00:59:02,400 --> 00:59:05,700
and I'm loading my CSV file
from an external storage.
1356
00:59:05,700 --> 00:59:08,100
That is my hdfs as a text file.
1357
00:59:08,100 --> 00:59:11,800
So the Pokemon data dot CSV file
is been successfully loaded
1358
00:59:11,800 --> 00:59:12,800
into our new rdd.
1359
00:59:12,800 --> 00:59:14,100
So let us display the data
1360
00:59:14,100 --> 00:59:17,100
which is present
in our Pokémon data rdd one.
1361
00:59:17,200 --> 00:59:19,700
I am using collect
action command for this.
1362
00:59:20,000 --> 00:59:23,900
So here we have 721 rows
of data of all the types
1363
00:59:23,900 --> 00:59:28,979
of Pokemons we have So now
let us display the schema
1364
00:59:28,979 --> 00:59:30,441
of the data we have
1365
00:59:30,700 --> 00:59:33,900
I have used the action command
first in order to display
1366
00:59:33,900 --> 00:59:35,727
the first line of a CSV file
1367
00:59:35,727 --> 00:59:38,600
which happens to be
the schema of a CSV file.
1368
00:59:38,600 --> 00:59:40,000
So we have index
1369
00:59:40,000 --> 00:59:42,100
of the Pokemon name
of the Pokémon.
1370
00:59:42,100 --> 00:59:46,700
Its type total points
HP attack points defense points
1371
00:59:46,992 --> 00:59:50,607
special attack special
defense speed generation,
1372
00:59:50,700 --> 00:59:51,938
and we can also find
1373
00:59:51,938 --> 00:59:54,600
if a particular Pokemon
is legendary or not.
1374
00:59:55,773 --> 00:59:57,926
Here, I'm creating a new RTD
1375
00:59:58,000 --> 00:59:59,400
which is no header
1376
00:59:59,400 --> 01:00:02,800
and I'm using filter operation
in order to remove the schema
1377
01:00:02,800 --> 01:00:04,900
of a Pokemon data dot CSV file.
1378
01:00:04,900 --> 01:00:08,407
The schema of Pokemon data
dot CSV file is been removed
1379
01:00:08,407 --> 01:00:10,705
because the spark
considers the schema
1380
01:00:10,705 --> 01:00:12,300
as a data to be processed.
1381
01:00:12,300 --> 01:00:13,480
So for this reason,
1382
01:00:13,480 --> 01:00:16,500
we remove the schema now,
let's display the data
1383
01:00:16,500 --> 01:00:19,000
which is present
in a no-hitter rdd.
1384
01:00:19,000 --> 01:00:20,441
I am using action command
1385
01:00:20,441 --> 01:00:22,500
collect in order
to display the data
1386
01:00:22,500 --> 01:00:24,700
which is present
in no header rdd.
1387
01:00:24,900 --> 01:00:26,104
So this is the data
1388
01:00:26,104 --> 01:00:28,195
which is stored
in a no-hitter rdd
1389
01:00:28,195 --> 01:00:29,400
without the schema.
1390
01:00:31,200 --> 01:00:33,978
So now let us find out
the number of partitions
1391
01:00:33,978 --> 01:00:37,300
into which are no header are
ready is been split in two.
1392
01:00:37,300 --> 01:00:40,320
So I am using partitions
transformation in order to find
1393
01:00:40,320 --> 01:00:42,060
out the number of partitions.
1394
01:00:42,060 --> 01:00:45,000
The data was split
in two according to the result.
1395
01:00:45,000 --> 01:00:48,300
The no header rdd is been split
into two partitions.
1396
01:00:48,600 --> 01:00:52,000
I am here creating a new rdt
by name water rdd
1397
01:00:52,000 --> 01:00:55,100
and I'm using filter
transformation in order to find
1398
01:00:55,100 --> 01:00:59,000
out what a type Pokemons in
our Pokémon data dot CSV file.
1399
01:00:59,600 --> 01:01:02,800
I'm using action command collect
in order to print the data
1400
01:01:02,800 --> 01:01:04,900
which is present in water rdd.
1401
01:01:05,200 --> 01:01:08,000
So these are the total number
of water type Pokemon
1402
01:01:08,000 --> 01:01:10,528
that we have in our
Pokémon data dot CSV.
1403
01:01:10,528 --> 01:01:11,160
Similarly.
1404
01:01:11,160 --> 01:01:13,500
Let's find out
the fire type Pokemons.
1405
01:01:14,600 --> 01:01:17,500
I'm creating a new identity
by the name fire RTD
1406
01:01:17,500 --> 01:01:20,523
and applying filter operation
in order to find out
1407
01:01:20,523 --> 01:01:23,300
the fire type Pokemon
present in our CSV file.
1408
01:01:24,200 --> 01:01:27,200
I'm using collect action command
in order to print the data
1409
01:01:27,200 --> 01:01:29,200
which is present in fire rdd.
1410
01:01:29,400 --> 01:01:32,100
So these are the fire type
Pokemon which are present
1411
01:01:32,100 --> 01:01:34,400
in our Pokémon
data dot CSV file.
1412
01:01:34,600 --> 01:01:37,600
Now, let us count the total
number of water type Pokemon
1413
01:01:37,600 --> 01:01:40,400
which are present
in a Pokemon data dot CSV file.
1414
01:01:40,400 --> 01:01:44,500
I am using count action for this
and we have 112 water type
1415
01:01:44,500 --> 01:01:47,400
Pokemon is present in
our Pokémon data dot CSV file.
1416
01:01:47,400 --> 01:01:47,924
Similarly.
1417
01:01:47,924 --> 01:01:50,600
Let's find out the total number
of fire-type Pokémon
1418
01:01:50,600 --> 01:01:54,300
as we have I'm using count
action command for the same.
1419
01:01:54,300 --> 01:01:56,178
So we have a total 52 number
1420
01:01:56,178 --> 01:01:59,800
of fire type Pokemon Sinnoh
Pokemon data dot CSV files.
1421
01:01:59,800 --> 01:02:01,992
Let's continue with
our further operations
1422
01:02:01,992 --> 01:02:05,200
where we'll find out a highest
defense strength of a Pokémon.
1423
01:02:05,300 --> 01:02:08,400
I am creating a new ID
by the name defense list
1424
01:02:08,400 --> 01:02:10,400
and I'm applying
map transformation
1425
01:02:10,400 --> 01:02:12,935
and spreading out
the column number six in order
1426
01:02:12,935 --> 01:02:14,500
to extract the defense points
1427
01:02:14,500 --> 01:02:18,100
of all the Pokemons present in
our Pokémon data dot CSV file.
1428
01:02:18,300 --> 01:02:21,400
So the data is been stored
successfully into a new era.
1429
01:02:21,400 --> 01:02:23,100
DD which is defenseless.
1430
01:02:23,500 --> 01:02:23,700
Now.
1431
01:02:23,700 --> 01:02:26,249
I'm using Mac's action command
in order to print out
1432
01:02:26,249 --> 01:02:29,100
the maximum different strengths
out of all the Pokemons.
1433
01:02:29,200 --> 01:02:32,576
So we have 230 points as
the maximum defense strength
1434
01:02:32,576 --> 01:02:34,200
amongst all the Pokemons.
1435
01:02:34,200 --> 01:02:35,702
So in our further operations,
1436
01:02:35,702 --> 01:02:38,502
let's find out the Pokemons
which come under the category
1437
01:02:38,502 --> 01:02:40,600
of having highest
different strengths,
1438
01:02:40,600 --> 01:02:42,400
which is 230 points.
1439
01:02:43,100 --> 01:02:45,456
In order to find out
the name of the Pokemon
1440
01:02:45,456 --> 01:02:47,100
with highest defense strength.
1441
01:02:47,100 --> 01:02:49,182
I'm creating a new identity
with the name.
1442
01:02:49,182 --> 01:02:51,717
It defense with Pokemon name
and I'm applying
1443
01:02:51,717 --> 01:02:54,000
May transformation on
to the previous array,
1444
01:02:54,000 --> 01:02:55,000
which is no header
1445
01:02:55,000 --> 01:02:56,062
and I'm splitting out
1446
01:02:56,062 --> 01:02:59,100
column number six which happens
to be the different strengths
1447
01:02:59,100 --> 01:03:02,300
in order to extract the data
from that particular row,
1448
01:03:02,300 --> 01:03:05,100
which has the defense
strength as 230 points.
1449
01:03:05,769 --> 01:03:08,230
Now I'm creating a new RTD again
1450
01:03:08,300 --> 01:03:11,500
with the name maximum defense
Pokemon and I'm applying
1451
01:03:11,500 --> 01:03:15,100
group bike a transformation
in order to display the Pokemon
1452
01:03:15,100 --> 01:03:18,675
which have the maximum defense
points that is 230 points.
1453
01:03:18,675 --> 01:03:20,400
So according to the result.
1454
01:03:20,400 --> 01:03:23,400
We have Steelix Steelix
Mega chacal Aggregate
1455
01:03:23,400 --> 01:03:24,500
and aggregate Mega
1456
01:03:24,500 --> 01:03:27,200
as the Pokemons with
highest different strengths,
1457
01:03:27,200 --> 01:03:28,800
which is 230 points.
1458
01:03:28,800 --> 01:03:31,100
Now we shall find
out the Pokemon
1459
01:03:31,100 --> 01:03:33,600
which is having least
different strengths.
1460
01:03:34,200 --> 01:03:35,900
So before we find
out the Pokemon
1461
01:03:35,900 --> 01:03:37,580
with least different strengths,
1462
01:03:37,580 --> 01:03:39,694
let us find out
the least defense points
1463
01:03:39,694 --> 01:03:41,700
which are present
in the defense list.
1464
01:03:42,900 --> 01:03:45,100
So in order to find
out the Pokémon
1465
01:03:45,100 --> 01:03:46,788
with least different strengths,
1466
01:03:46,788 --> 01:03:48,200
I have created a new rdt
1467
01:03:48,200 --> 01:03:51,654
by name minimum defense Pokemon
and I have applied distinct
1468
01:03:51,654 --> 01:03:54,900
and sort by Transformations
on to the defense list rdd
1469
01:03:54,900 --> 01:03:57,900
in order to extract
the least defense points present
1470
01:03:57,900 --> 01:03:58,955
in the defense list
1471
01:03:58,955 --> 01:04:01,484
and I have used take
action command in order
1472
01:04:01,484 --> 01:04:02,600
to display the data
1473
01:04:02,600 --> 01:04:05,300
which is present
in minimum defense Pokemon rdd.
1474
01:04:05,300 --> 01:04:06,700
So according to the results,
1475
01:04:06,700 --> 01:04:09,300
we have five points as
the least defense strength
1476
01:04:09,300 --> 01:04:11,053
of a particular Pokémon now,
1477
01:04:11,053 --> 01:04:13,148
let us find out
the name of the On
1478
01:04:13,148 --> 01:04:16,650
which comes under the category
of having Five Points as
1479
01:04:16,650 --> 01:04:18,290
different strengths now,
1480
01:04:18,290 --> 01:04:19,808
let us create a new rdd
1481
01:04:19,808 --> 01:04:23,956
which is difference Pokemon name
to and apply my transformation
1482
01:04:23,956 --> 01:04:27,217
and split the column number 6
and store the data
1483
01:04:27,217 --> 01:04:28,259
into our new rdd
1484
01:04:28,259 --> 01:04:30,800
which is defense
with Pokemon name, too.
1485
01:04:32,000 --> 01:04:34,500
The transformation is
been successfully applied
1486
01:04:34,500 --> 01:04:36,970
and the data is now
stored into the new rdd
1487
01:04:36,970 --> 01:04:37,900
which is defense
1488
01:04:37,900 --> 01:04:41,900
with Pokemon name to the data
is been successfully loaded.
1489
01:04:41,900 --> 01:04:45,500
Now, let us apply
the further operations here.
1490
01:04:45,538 --> 01:04:50,000
I am creating another rdd with
name minimum defense Pokemon
1491
01:04:50,000 --> 01:04:53,400
and I'm applying group bike
a transformation in order
1492
01:04:53,400 --> 01:04:55,500
to extract the data from the row
1493
01:04:55,500 --> 01:04:58,206
which has the defense
points as 5.0.
1494
01:04:58,500 --> 01:05:01,829
The data is been successfully
loaded now and let us display.
1495
01:05:01,829 --> 01:05:03,300
The data which is present
1496
01:05:03,300 --> 01:05:07,307
in minimum defense Pokemon rdd
now according to the results.
1497
01:05:07,307 --> 01:05:09,073
We have to number of Pokemons,
1498
01:05:09,073 --> 01:05:12,098
which come under the category
of having Five Points
1499
01:05:12,098 --> 01:05:15,400
as that defense strength
the Pokemons chassis knee
1500
01:05:15,400 --> 01:05:17,500
and happening at
the to Pokemons,
1501
01:05:17,500 --> 01:05:24,500
which I have in the least
definition the world
1502
01:05:24,500 --> 01:05:26,100
of Information Technology
1503
01:05:26,100 --> 01:05:29,786
and big data processing started
to see multiple potentialities
1504
01:05:29,786 --> 01:05:31,600
from spark coming into action.
1505
01:05:31,700 --> 01:05:34,685
Such Pinnacle in Sparks
technology advancements is
1506
01:05:34,685 --> 01:05:35,600
the data frame.
1507
01:05:35,600 --> 01:05:38,200
And today we shall
understand the technicalities
1508
01:05:38,200 --> 01:05:39,000
of data frames
1509
01:05:39,000 --> 01:05:42,500
and Spark a data frame and Spark
is all about performance.
1510
01:05:42,500 --> 01:05:46,300
It is a powerful multifunctional
and an integrated data structure
1511
01:05:46,300 --> 01:05:49,100
where the programmer can work
with different libraries
1512
01:05:49,100 --> 01:05:52,000
and perform numerous
functionalities without breaking
1513
01:05:52,000 --> 01:05:53,529
a sweat to understand apis
1514
01:05:53,529 --> 01:05:54,823
and libraries involved
1515
01:05:54,823 --> 01:05:57,500
in the process
without wasting any time.
1516
01:05:57,500 --> 01:06:00,000
Let us understand a topic
for today's discussion.
1517
01:06:00,000 --> 01:06:01,900
I line up the docket
for understanding.
1518
01:06:01,900 --> 01:06:03,800
Data frames and Spark is below
1519
01:06:03,800 --> 01:06:06,962
which will begin with
what our data frames here.
1520
01:06:06,962 --> 01:06:09,700
We will learn what
exactly a data frame is.
1521
01:06:09,700 --> 01:06:13,706
How does it look like and what
are its functionalities then we
1522
01:06:13,706 --> 01:06:16,400
shall see why do we need
data frames here?
1523
01:06:16,400 --> 01:06:18,900
We shall understand
the requirements which led us
1524
01:06:18,900 --> 01:06:21,200
to the invention
of data frames later.
1525
01:06:21,200 --> 01:06:23,400
I'll walk you through
the important features
1526
01:06:23,400 --> 01:06:24,282
of data frames.
1527
01:06:24,282 --> 01:06:25,400
Then we should look
1528
01:06:25,400 --> 01:06:28,000
into the sources from which
the data frames and Spark
1529
01:06:28,000 --> 01:06:31,000
get their data from Once
the theory part is finished.
1530
01:06:31,000 --> 01:06:33,400
I will get us involved
into the Practical part
1531
01:06:33,400 --> 01:06:35,700
where the creation
of a dataframe happens to be
1532
01:06:35,700 --> 01:06:39,400
a first step next we shall work
with an interesting example,
1533
01:06:39,400 --> 01:06:41,100
which is related to football
1534
01:06:41,100 --> 01:06:43,237
and finally to understand
the data frames
1535
01:06:43,237 --> 01:06:44,200
in spark in a much
1536
01:06:44,200 --> 01:06:46,980
better way we should work
with the most trending topic
1537
01:06:46,980 --> 01:06:47,711
as I use case,
1538
01:06:47,711 --> 01:06:50,300
which is none other
than the Game of Thrones.
1539
01:06:50,400 --> 01:06:52,100
So let's get started.
1540
01:06:52,200 --> 01:06:55,500
What is a data frame
in simple terms a data frame
1541
01:06:55,500 --> 01:06:58,617
can be considered as a
distributed collection of data.
1542
01:06:58,617 --> 01:07:01,156
The data is organized
under named columns,
1543
01:07:01,156 --> 01:07:04,500
which provide us The operations
to filter group process
1544
01:07:04,500 --> 01:07:08,205
and aggregate the available data
data frames can also be used
1545
01:07:08,205 --> 01:07:11,100
with Sparks equal and we
can construct data frames
1546
01:07:11,100 --> 01:07:14,800
from structured data files rdds
or from an external storage
1547
01:07:14,800 --> 01:07:17,500
like hdfs Hive Cassandra hbase
1548
01:07:17,500 --> 01:07:19,676
and many more with
this we should look
1549
01:07:19,676 --> 01:07:21,500
into a more simplified example,
1550
01:07:21,500 --> 01:07:24,455
which will give us a basic
description of a data frame.
1551
01:07:24,455 --> 01:07:26,700
So we shall deal
with an employee database
1552
01:07:26,700 --> 01:07:29,229
where we have entities
and their data types.
1553
01:07:29,229 --> 01:07:31,817
So the name of the employee
is a first entity
1554
01:07:31,817 --> 01:07:33,500
And its respective data type
1555
01:07:33,500 --> 01:07:37,102
is string data type similarly
employee ID has data type
1556
01:07:37,102 --> 01:07:39,004
of string employee phone number
1557
01:07:39,004 --> 01:07:40,646
which is integer data type
1558
01:07:40,646 --> 01:07:43,642
and employ address happens
to be string data type.
1559
01:07:43,642 --> 01:07:46,700
And finally the employee salary
is float data type.
1560
01:07:46,700 --> 01:07:49,500
All this data is stored
into an external storage,
1561
01:07:49,500 --> 01:07:51,093
which may be hdfs Hive
1562
01:07:51,093 --> 01:07:53,700
or Cassandra using
the data frame API
1563
01:07:53,700 --> 01:07:55,200
with their respective schema,
1564
01:07:55,200 --> 01:07:56,500
which consists of the name
1565
01:07:56,500 --> 01:07:58,913
of the entity along
with this data type now
1566
01:07:58,913 --> 01:08:01,900
that we have understood what
exactly a data frame is.
1567
01:08:01,900 --> 01:08:03,910
Let us quickly move on
to our next stage
1568
01:08:03,910 --> 01:08:06,900
where we shall understand the
requirement for a data frame.
1569
01:08:07,000 --> 01:08:07,806
It provides as
1570
01:08:07,806 --> 01:08:10,400
multiple programming
language support ability.
1571
01:08:10,400 --> 01:08:13,670
It has the capacity to work
with multiple data sources,
1572
01:08:13,670 --> 01:08:16,904
it can process both structured
and unstructured data.
1573
01:08:16,904 --> 01:08:19,455
And finally it is
well versed with slicing
1574
01:08:19,455 --> 01:08:20,681
and dicing the data.
1575
01:08:20,681 --> 01:08:21,723
So the first one is
1576
01:08:21,723 --> 01:08:24,900
the support ability for
multiple programming languages.
1577
01:08:24,900 --> 01:08:26,937
The IT industry
is required a powerful
1578
01:08:26,937 --> 01:08:28,700
and an integrated data structure
1579
01:08:28,700 --> 01:08:29,500
which could support
1580
01:08:29,500 --> 01:08:31,800
multiple programming languages
and at the same.
1581
01:08:31,800 --> 01:08:33,900
Same time without
the requirement of
1582
01:08:33,900 --> 01:08:36,900
additional API data frame
was the one stop solution
1583
01:08:36,900 --> 01:08:39,900
which supported multiple
languages along with a single
1584
01:08:39,900 --> 01:08:41,982
API the most popular languages
1585
01:08:41,982 --> 01:08:45,046
that a dataframe could
support our our python.
1586
01:08:45,046 --> 01:08:48,777
Skaila, Java and many more
the next requirement
1587
01:08:48,777 --> 01:08:51,500
was to support
the multiple data sources.
1588
01:08:51,500 --> 01:08:53,608
We all know that in
a real-time approach
1589
01:08:53,608 --> 01:08:55,700
to data processing
will never end up
1590
01:08:55,700 --> 01:08:57,700
at a single data
source data frame is
1591
01:08:57,700 --> 01:08:59,057
one such data structure,
1592
01:08:59,057 --> 01:09:02,000
which has the capability
to support and process data.
1593
01:09:02,000 --> 01:09:05,615
From a variety of data
sources Hadoop Cassandra.
1594
01:09:05,615 --> 01:09:07,207
Json files hbase.
1595
01:09:07,207 --> 01:09:10,284
CSV files are the examples
to name a few.
1596
01:09:10,300 --> 01:09:12,947
The next requirement was
to process structured
1597
01:09:12,947 --> 01:09:14,200
and unstructured data.
1598
01:09:14,200 --> 01:09:17,400
The Big Data environment was
designed to store huge amount
1599
01:09:17,400 --> 01:09:18,487
of data regardless
1600
01:09:18,487 --> 01:09:19,755
of which type exactly
1601
01:09:19,755 --> 01:09:22,827
it is now Sparks data frame
is designed in such a way
1602
01:09:22,827 --> 01:09:25,994
that it can store a huge
collection of both structured
1603
01:09:25,994 --> 01:09:27,249
and unstructured data
1604
01:09:27,249 --> 01:09:29,900
in a tabular format
along with its schema.
1605
01:09:29,900 --> 01:09:33,300
The next requirement was slicing
In in dicing data now,
1606
01:09:33,300 --> 01:09:34,300
the humongous amount
1607
01:09:34,300 --> 01:09:37,400
of data stored in Sparks
data frame can be sliced
1608
01:09:37,400 --> 01:09:40,975
and diced using the operations
like filter select group
1609
01:09:40,975 --> 01:09:42,300
by order by and many
1610
01:09:42,300 --> 01:09:45,100
more these operations
are applied upon the data
1611
01:09:45,100 --> 01:09:47,456
which are stored in form
of rows and columns
1612
01:09:47,456 --> 01:09:50,388
in a data frame these
with a few crucial requirements
1613
01:09:50,388 --> 01:09:52,700
which led to the invention
of data frames.
1614
01:09:52,800 --> 01:09:55,173
Now, let us get
into the important features
1615
01:09:55,173 --> 01:09:55,997
of data frames
1616
01:09:55,997 --> 01:09:58,700
which bring it an edge
over the other alternatives.
1617
01:09:59,100 --> 01:10:02,400
Immutability lazy
evaluation fault tolerance
1618
01:10:02,400 --> 01:10:04,400
and distributed memory storage,
1619
01:10:04,400 --> 01:10:07,800
let us discuss about each
and every feature in detail.
1620
01:10:07,800 --> 01:10:10,600
So the first one is
immutability similar to
1621
01:10:10,600 --> 01:10:13,295
the resilient distributed data
sets the data frames
1622
01:10:13,295 --> 01:10:16,688
and Spark are also immutable
the term immutable depicts
1623
01:10:16,688 --> 01:10:18,100
that the data was stored
1624
01:10:18,100 --> 01:10:20,300
into a data frame
will not be altered.
1625
01:10:20,300 --> 01:10:23,100
The only way to alter the data
present in a data frame
1626
01:10:23,100 --> 01:10:25,700
would be by applying
simple transformation operations
1627
01:10:25,700 --> 01:10:26,600
on to them.
1628
01:10:26,600 --> 01:10:28,900
So the next feature
is lazy evaluation.
1629
01:10:28,900 --> 01:10:32,126
Valuation lazy evaluation
is the key to the remarkable
1630
01:10:32,126 --> 01:10:36,100
performance offered by spark
similar to the rdds data frames
1631
01:10:36,100 --> 01:10:38,999
in spark will not throw
any output onto the screen
1632
01:10:38,999 --> 01:10:41,900
until and unless an action
command is encountered.
1633
01:10:41,900 --> 01:10:44,300
The next feature
is Fault tolerance.
1634
01:10:44,300 --> 01:10:45,182
There is no way
1635
01:10:45,182 --> 01:10:47,900
that the Sparks data frames
can lose their data.
1636
01:10:47,900 --> 01:10:50,300
They follow the principle
of being fault tolerant
1637
01:10:50,300 --> 01:10:51,782
to the unexpected calamities
1638
01:10:51,782 --> 01:10:53,900
which tend to destroy
the available data.
1639
01:10:53,900 --> 01:10:55,893
The next feature is distributed
1640
01:10:55,893 --> 01:10:58,590
storage Sparks dataframe
distribute the data.
1641
01:10:58,590 --> 01:11:00,000
Most multiple locations
1642
01:11:00,000 --> 01:11:03,294
so that in case of a node
failure the next available node
1643
01:11:03,294 --> 01:11:05,900
can takes place to continue
the data processing.
1644
01:11:05,900 --> 01:11:08,700
The next stage will be
about the multiple data source
1645
01:11:08,700 --> 01:11:12,204
that the spark dataframe
can support the spark API
1646
01:11:12,204 --> 01:11:13,690
can integrate itself
1647
01:11:13,690 --> 01:11:17,700
with multiple programming
languages such as scalar Java
1648
01:11:17,700 --> 01:11:19,300
python our MySQL
1649
01:11:19,300 --> 01:11:22,600
and many more making
itself capable to handle
1650
01:11:22,600 --> 01:11:26,700
a variety of data sources
such as Hadoop Hive hbase
1651
01:11:26,800 --> 01:11:28,500
Cassandra, Json file.
1652
01:11:28,600 --> 01:11:31,600
As CSV files my SQL
and many more.
1653
01:11:32,200 --> 01:11:33,726
So this was the theory part
1654
01:11:33,726 --> 01:11:36,100
and now let us move
into the Practical part
1655
01:11:36,100 --> 01:11:37,000
where the creation
1656
01:11:37,000 --> 01:11:39,500
of a dataframe happens
to be a first step.
1657
01:11:40,100 --> 01:11:42,412
So before we begin
the Practical part,
1658
01:11:42,412 --> 01:11:43,975
let us load the libraries
1659
01:11:43,975 --> 01:11:47,600
which required in order to
process the data in data frames.
1660
01:11:48,200 --> 01:11:50,822
So these are the few libraries
which we required
1661
01:11:50,822 --> 01:11:53,600
before we process the data
using our data frames.
1662
01:11:54,200 --> 01:11:56,300
Now that we have loaded
all the libraries
1663
01:11:56,300 --> 01:11:59,393
which we required to process
the data using the data frames.
1664
01:11:59,393 --> 01:12:01,914
Let us begin with the creation
of our data frame.
1665
01:12:01,914 --> 01:12:05,000
So we shall create a new data
frame with the name employee
1666
01:12:05,000 --> 01:12:05,935
and load the data
1667
01:12:05,935 --> 01:12:08,300
of the employees present
in an organization.
1668
01:12:08,300 --> 01:12:11,400
The details of the employees
will consist the first name
1669
01:12:11,400 --> 01:12:14,968
the last name and their mail ID
along with their salary.
1670
01:12:14,968 --> 01:12:18,500
So the First Data frame is
been successfully created now,
1671
01:12:18,500 --> 01:12:20,700
let us design the schema
for this data frame.
1672
01:12:21,600 --> 01:12:24,100
So the schema for this data
frame is been described
1673
01:12:24,100 --> 01:12:27,900
as shown the first name is of
string data type and similarly.
1674
01:12:27,900 --> 01:12:29,900
The last name is
a string data type
1675
01:12:29,900 --> 01:12:31,500
along with the mail address.
1676
01:12:31,500 --> 01:12:34,500
And finally the salary
is integer data type
1677
01:12:34,500 --> 01:12:37,000
or you can give
flow data type also,
1678
01:12:37,000 --> 01:12:39,882
so the schema has been
successfully delivered now,
1679
01:12:39,882 --> 01:12:41,600
let us create
the data frame using
1680
01:12:41,600 --> 01:12:43,700
Create data frame function here.
1681
01:12:43,700 --> 01:12:47,260
I'm creating a new data frame
by starting a spark context
1682
01:12:47,260 --> 01:12:50,200
and using the create
data frame method and loading
1683
01:12:50,200 --> 01:12:52,800
the data from Employee
and employer schema.
1684
01:12:52,800 --> 01:12:55,200
The data frame is
successfully created now,
1685
01:12:55,200 --> 01:12:56,200
let's print the data
1686
01:12:56,200 --> 01:12:59,353
which is existing
in the dataframe EMP DF.
1687
01:13:00,273 --> 01:13:02,426
I am using show method here.
1688
01:13:03,200 --> 01:13:03,907
So the data
1689
01:13:03,907 --> 01:13:07,700
which is present in EMB DF is
been successfully printed now,
1690
01:13:07,700 --> 01:13:09,600
let us move on to the next step.
1691
01:13:09,800 --> 01:13:12,800
So the next step for our today's
discussion is working
1692
01:13:12,800 --> 01:13:15,500
with an example related
to the FIFA data set.
1693
01:13:16,100 --> 01:13:18,217
So the first step
in our FIFA example
1694
01:13:18,217 --> 01:13:20,772
would be loading the schema
for the CSV file.
1695
01:13:20,772 --> 01:13:22,000
We are working with so
1696
01:13:22,000 --> 01:13:24,400
the schema has been
successfully loaded now.
1697
01:13:24,400 --> 01:13:28,066
Now let us load the CSV file
from our external storage
1698
01:13:28,066 --> 01:13:30,600
which is hdfs
into our data frame,
1699
01:13:30,600 --> 01:13:31,907
which is FIFA DF.
1700
01:13:32,100 --> 01:13:34,394
The CSV file is been
successfully loaded
1701
01:13:34,394 --> 01:13:35,800
into our new data frame,
1702
01:13:35,800 --> 01:13:37,100
which is FIFA DF now,
1703
01:13:37,100 --> 01:13:39,300
let us print the schema
of a data frame using
1704
01:13:39,300 --> 01:13:40,900
the print schema command.
1705
01:13:41,900 --> 01:13:43,400
So the schema
is been successfully
1706
01:13:43,400 --> 01:13:46,000
displayed here and we have
the following credentials.
1707
01:13:46,000 --> 01:13:49,300
Of each and every player
in our CSV file now,
1708
01:13:49,300 --> 01:13:51,900
let's move on to a further
operations on a dataframe.
1709
01:13:53,100 --> 01:13:56,200
We will count the total number
of records of the play
1710
01:13:56,200 --> 01:13:59,100
as we have in our CSV file
using count command.
1711
01:13:59,300 --> 01:14:01,500
So we have a total
of eighteen thousand
1712
01:14:01,500 --> 01:14:04,300
to not seven players
in our CSV files.
1713
01:14:04,300 --> 01:14:06,091
Now, let us find out the details
1714
01:14:06,091 --> 01:14:08,500
of the columns on which
we are working with.
1715
01:14:08,500 --> 01:14:11,300
So these were the columns
which we are working with which
1716
01:14:11,300 --> 01:14:15,466
consists the idea of the player
name age nationality potential
1717
01:14:15,466 --> 01:14:16,400
and many more.
1718
01:14:17,100 --> 01:14:19,600
Now let us use the column value
1719
01:14:19,600 --> 01:14:21,282
which has the value of each
1720
01:14:21,282 --> 01:14:23,900
and every player
for a particular T and let
1721
01:14:23,900 --> 01:14:27,399
us use describe command in order
to see the highest value
1722
01:14:27,399 --> 01:14:29,900
and the least value
provided to a player.
1723
01:14:29,900 --> 01:14:33,000
So we have account
of a total number of 18,000
1724
01:14:33,000 --> 01:14:34,400
to not seven players
1725
01:14:34,400 --> 01:14:37,612
and the minimum worth
given to a player is 0
1726
01:14:37,612 --> 01:14:40,900
and the maximum is given
as 9 million pounds.
1727
01:14:41,100 --> 01:14:43,100
Now, let us use
the select command
1728
01:14:43,100 --> 01:14:46,216
in order to extract
the column name and nationality.
1729
01:14:46,216 --> 01:14:48,172
How to find out the name of each
1730
01:14:48,172 --> 01:14:50,800
and every player along
with his nationality.
1731
01:14:51,000 --> 01:14:54,226
So here we have we can display
the top 20 rows of each
1732
01:14:54,226 --> 01:14:55,200
and every player
1733
01:14:55,200 --> 01:14:58,900
which we have in our CSV file
along with us nationality.
1734
01:14:59,000 --> 01:14:59,700
Similarly.
1735
01:14:59,700 --> 01:15:03,200
Let us find out the players
playing for a particular Club.
1736
01:15:03,200 --> 01:15:05,500
So here we have
the top 20 Place playing
1737
01:15:05,500 --> 01:15:07,029
for their respective clubs
1738
01:15:07,029 --> 01:15:08,300
along with their names
1739
01:15:08,300 --> 01:15:10,800
for example messy
playing for Barcelona
1740
01:15:10,800 --> 01:15:13,100
and Ronaldo for
Juventus and Etc.
1741
01:15:13,100 --> 01:15:15,100
Now, let's move
to the next stages.
1742
01:15:15,999 --> 01:15:17,900
No, let us find out the players
1743
01:15:18,000 --> 01:15:21,000
who are found to be most active
in a particular national team
1744
01:15:21,000 --> 01:15:24,500
or a particular club
with h less than 30 years.
1745
01:15:24,500 --> 01:15:25,300
We shall use
1746
01:15:25,300 --> 01:15:28,300
filter transformation
to apply this operation.
1747
01:15:28,600 --> 01:15:30,500
So here we have the details
1748
01:15:30,500 --> 01:15:33,300
of the Players whose age
is less than 30 years
1749
01:15:33,300 --> 01:15:37,200
and their club and nationality
along with their jersey numbers.
1750
01:15:37,700 --> 01:15:40,700
So with this we have finished
our FIFA example now
1751
01:15:40,700 --> 01:15:43,466
to understand the data frames
in a much better way,
1752
01:15:43,466 --> 01:15:45,300
let us move on
into our use case,
1753
01:15:45,300 --> 01:15:48,400
which is about the most Hot
Topic The Game of Thrones.
1754
01:15:49,100 --> 01:15:51,319
Similar to our previous example,
1755
01:15:51,319 --> 01:15:54,300
let us design the schema
of a CSV file first.
1756
01:15:54,300 --> 01:15:56,600
So this is the schema
for a CSV file
1757
01:15:56,600 --> 01:15:59,300
which consists the data
about the Game of Thrones.
1758
01:15:59,800 --> 01:16:02,800
So, this is a schema
for our first CSV file.
1759
01:16:02,800 --> 01:16:06,200
Now, let us create the schema
for our next CSV file.
1760
01:16:06,700 --> 01:16:09,991
I have named the schema
for our next CSV file a schema
1761
01:16:09,991 --> 01:16:12,667
to and I've defined
the data types for each
1762
01:16:12,667 --> 01:16:16,300
and every entity the scheme
has been successfully designed
1763
01:16:16,300 --> 01:16:18,300
for the second CSV file also.
1764
01:16:18,300 --> 01:16:21,700
Now let us load our CSV files
from our external storage,
1765
01:16:21,700 --> 01:16:23,200
which is our hdfs.
1766
01:16:24,000 --> 01:16:28,100
The location of the first CSV
file character deaths dot CSV
1767
01:16:28,100 --> 01:16:29,076
is our hdfs,
1768
01:16:29,076 --> 01:16:31,000
which is defined as above
1769
01:16:31,000 --> 01:16:33,303
and the schema is been
provided as schema.
1770
01:16:33,303 --> 01:16:35,919
And the header true option
is also been provided.
1771
01:16:35,919 --> 01:16:38,100
We are using spark
read function for this
1772
01:16:38,100 --> 01:16:40,789
and we are loading this data
into our new data frame,
1773
01:16:40,789 --> 01:16:42,600
which is Game
of Thrones data frame.
1774
01:16:42,800 --> 01:16:43,700
Similarly.
1775
01:16:43,700 --> 01:16:45,743
Let's load the other CSV file
1776
01:16:45,743 --> 01:16:49,232
which is battles dot CSV
into another data frame,
1777
01:16:49,232 --> 01:16:53,000
which is Game of Thrones
Butters dataframe the CSV file.
1778
01:16:53,000 --> 01:16:54,792
Has been successfully
loaded now.
1779
01:16:54,792 --> 01:16:57,200
Let us continue
with the further operations.
1780
01:16:57,900 --> 01:17:00,207
Now let us print
the schema offer Game
1781
01:17:00,207 --> 01:17:03,200
of Thrones data frame using
print schema command.
1782
01:17:03,300 --> 01:17:04,962
So here we have the schema
1783
01:17:04,962 --> 01:17:07,200
which consists of
the name alliances
1784
01:17:07,200 --> 01:17:10,821
death rate book of death
and many more similarly.
1785
01:17:10,821 --> 01:17:15,100
Let's print the schema of Game
of Thrones Butters data frame.
1786
01:17:16,300 --> 01:17:18,600
So this is a schema
for our new data frame,
1787
01:17:18,600 --> 01:17:20,700
which is Game of Thrones
battle data frame.
1788
01:17:20,900 --> 01:17:23,600
Now, let's continue
the further operations.
1789
01:17:24,100 --> 01:17:26,000
Now, let us display
the data frame
1790
01:17:26,000 --> 01:17:29,500
which we have created using
the following command data frame
1791
01:17:29,500 --> 01:17:32,188
has been successfully printed
and this is the data
1792
01:17:32,188 --> 01:17:33,813
which we have in our data frame.
1793
01:17:33,813 --> 01:17:36,200
Now, let's continue
with the further operations.
1794
01:17:36,400 --> 01:17:38,449
We know that there are
a multiple number
1795
01:17:38,449 --> 01:17:41,100
of houses present in the story
of Game of Thrones.
1796
01:17:41,100 --> 01:17:42,211
Now, let us find out
1797
01:17:42,211 --> 01:17:45,100
each and every individual house
present in the story.
1798
01:17:45,300 --> 01:17:48,200
Let us use the following command
in order to display each
1799
01:17:48,200 --> 01:17:51,400
and every house present
in the Game of Thrones story.
1800
01:17:51,600 --> 01:17:54,600
So we have the following houses
in the Game of Thrones story.
1801
01:17:54,600 --> 01:17:57,064
Now, let's continue
with the further operations
1802
01:17:57,064 --> 01:18:00,299
the battles in the Game
of Thrones were fought for ages.
1803
01:18:00,299 --> 01:18:02,000
Let us classify the vast waste
1804
01:18:02,000 --> 01:18:04,300
with their occurrence
according to the years.
1805
01:18:04,300 --> 01:18:06,800
We shall use select
and filter transformation
1806
01:18:06,800 --> 01:18:09,750
and we shall access The Columns
of the details of the battle
1807
01:18:09,750 --> 01:18:11,600
and the year in which
they were fought.
1808
01:18:12,100 --> 01:18:13,800
Let us first find
out the battles
1809
01:18:13,800 --> 01:18:15,300
which were fought in the year.
1810
01:18:15,300 --> 01:18:18,000
R 298 the following
code consists of
1811
01:18:18,000 --> 01:18:19,300
filter transformation
1812
01:18:19,300 --> 01:18:22,000
which will provide the details
for which we are looking.
1813
01:18:22,000 --> 01:18:23,350
So according to the result.
1814
01:18:23,350 --> 01:18:25,400
These were the battles
were fought in the year
1815
01:18:25,400 --> 01:18:28,700
298 and we have the details
of the attacker Kings
1816
01:18:28,700 --> 01:18:30,002
and the defender Kings
1817
01:18:30,002 --> 01:18:33,648
and the outcome of the attacker
along with their commanders
1818
01:18:33,648 --> 01:18:36,400
and the location
where the war was fought now,
1819
01:18:36,400 --> 01:18:39,861
let us find out the wars
based in the air 299.
1820
01:18:40,400 --> 01:18:41,764
So these with the details
1821
01:18:41,764 --> 01:18:45,293
of the verse which were fought
in the year 299 and similarly,
1822
01:18:45,293 --> 01:18:48,600
let us also find out the bars
which are waged in the year 300.
1823
01:18:48,600 --> 01:18:49,952
So these were the words
1824
01:18:49,952 --> 01:18:51,700
which were fought
in the year 300.
1825
01:18:51,700 --> 01:18:53,700
Now, let's move on
to the next operations
1826
01:18:53,700 --> 01:18:54,700
in our use case.
1827
01:18:55,000 --> 01:18:58,005
Now, let us find out the tactics
used in the wars waged
1828
01:18:58,005 --> 01:19:01,343
and also find out the total
number of vast waste by using
1829
01:19:01,343 --> 01:19:05,200
each type of those tactics
the following code must help us.
1830
01:19:05,800 --> 01:19:07,200
Here we are using select
1831
01:19:07,200 --> 01:19:10,196
and group by operations
in order to find out each
1832
01:19:10,196 --> 01:19:12,500
and every type of tactics
used in the war.
1833
01:19:12,600 --> 01:19:16,221
So they have used Ambush sees
raising and Pitch type
1834
01:19:16,221 --> 01:19:17,500
of tactics inverse
1835
01:19:17,500 --> 01:19:20,300
and most of the times they
have used pitched battle type
1836
01:19:20,300 --> 01:19:21,600
of tactics inverse.
1837
01:19:21,600 --> 01:19:24,600
Now, let us continue
with the further operations
1838
01:19:24,600 --> 01:19:27,300
the Ambush type of battles are
the deadliest now,
1839
01:19:27,300 --> 01:19:28,650
let us find out the Kings
1840
01:19:28,650 --> 01:19:31,397
who fought the battles
using these kind of tactics
1841
01:19:31,397 --> 01:19:34,200
and also let us find out
the outcome of the battles
1842
01:19:34,200 --> 01:19:37,425
fought here the In code
will help us extract the data
1843
01:19:37,425 --> 01:19:38,600
which we need here.
1844
01:19:38,600 --> 01:19:40,962
We are using select
and we're commands
1845
01:19:40,962 --> 01:19:43,900
and we are selecting
The Columns year attacking
1846
01:19:43,900 --> 01:19:48,181
Defender King attacker outcome
battle type attacker Commander
1847
01:19:48,181 --> 01:19:49,840
defend the commander now,
1848
01:19:49,840 --> 01:19:51,500
let us print the details.
1849
01:19:51,900 --> 01:19:54,700
So these were the battles
fought using the Ambush tactics
1850
01:19:54,700 --> 01:19:56,300
and these were
the attacker Kings
1851
01:19:56,300 --> 01:19:59,300
and the defender Kings along
with their respective commanders
1852
01:19:59,300 --> 01:20:01,641
and the wars waste
in a particular year now.
1853
01:20:01,641 --> 01:20:03,700
Let's move on
to the next operation.
1854
01:20:04,300 --> 01:20:06,000
Now let us focus on the houses
1855
01:20:06,000 --> 01:20:08,600
and extract the deadliest house
amongst the rest.
1856
01:20:08,600 --> 01:20:11,893
The following code will help us
to find out the deadliest house
1857
01:20:11,893 --> 01:20:13,700
and the number
of patents the wage.
1858
01:20:13,700 --> 01:20:16,600
So here we have the details
of each and every house
1859
01:20:16,600 --> 01:20:19,383
and the battles the waged
according to the results.
1860
01:20:19,383 --> 01:20:20,033
We have stuck
1861
01:20:20,033 --> 01:20:22,883
and Lannister houses to be
the deadliest among the others.
1862
01:20:22,883 --> 01:20:25,400
Now, let's continue
with the rest of the operations.
1863
01:20:25,900 --> 01:20:28,100
Now, let us find out
the deadliest king
1864
01:20:28,100 --> 01:20:29,100
among the others
1865
01:20:29,100 --> 01:20:31,400
which will use the following
command in order to find
1866
01:20:31,400 --> 01:20:33,600
the deadliest king
amongst the other kings
1867
01:20:33,600 --> 01:20:35,600
who fought in the A
number of Firsts.
1868
01:20:35,600 --> 01:20:38,000
So according to the results
we have Joffrey as
1869
01:20:38,000 --> 01:20:38,900
the deadliest King
1870
01:20:38,900 --> 01:20:41,200
who fought a total number
of 14 battles.
1871
01:20:41,200 --> 01:20:44,000
Now, let us continue
with the further operations.
1872
01:20:44,500 --> 01:20:46,323
Now, let us find out the houses
1873
01:20:46,323 --> 01:20:49,400
which defended most number
of Wars waste against them.
1874
01:20:49,400 --> 01:20:52,500
So the following code must help
us find out the details.
1875
01:20:52,600 --> 01:20:54,223
So according to the results.
1876
01:20:54,223 --> 01:20:57,400
We have Lannister house
to be defending the most number
1877
01:20:57,400 --> 01:20:59,009
of paths based against them.
1878
01:20:59,009 --> 01:21:01,682
Now, let us find out
the defender King who defend
1879
01:21:01,682 --> 01:21:04,900
it most number of battles
which were waste against him
1880
01:21:05,400 --> 01:21:08,405
So according to the result drop
stack is the king
1881
01:21:08,405 --> 01:21:10,597
who defended most
number of patterns
1882
01:21:10,597 --> 01:21:12,100
which waged against him.
1883
01:21:12,100 --> 01:21:12,300
Now.
1884
01:21:12,300 --> 01:21:14,600
Let's continue with
the further operations.
1885
01:21:14,800 --> 01:21:17,300
Since Lannister house
is my personal favorite.
1886
01:21:17,300 --> 01:21:18,800
Let me find out the details
1887
01:21:18,800 --> 01:21:20,800
of the characters
in Lannister house.
1888
01:21:20,800 --> 01:21:22,921
This code will
describe their name
1889
01:21:22,921 --> 01:21:24,400
and gender one for male
1890
01:21:24,400 --> 01:21:27,700
and 0 for female along with
their respective population.
1891
01:21:27,700 --> 01:21:29,830
So let me find out
the male characters
1892
01:21:29,830 --> 01:21:31,500
in The Lannister house first.
1893
01:21:32,300 --> 01:21:34,899
So here we have used select
and we're commanded.
1894
01:21:34,900 --> 01:21:37,600
Ends in order to find out
the details of the characters
1895
01:21:37,600 --> 01:21:39,100
present in Lannister house
1896
01:21:39,100 --> 01:21:42,300
and the data is been stored
into tf1 dataframe.
1897
01:21:42,300 --> 01:21:44,700
Let us print the data
which is present in idea
1898
01:21:44,700 --> 01:21:46,900
of one data frame
using show command.
1899
01:21:47,800 --> 01:21:49,000
So these are the details
1900
01:21:49,000 --> 01:21:51,400
of the characters
present in Lannister house,
1901
01:21:51,400 --> 01:21:53,100
which are made now similarly.
1902
01:21:53,100 --> 01:21:55,400
Let us find out the female
character is present
1903
01:21:55,400 --> 01:21:56,800
in Lannister house.
1904
01:21:57,500 --> 01:22:00,000
So these are the characters
present in Lannister house
1905
01:22:00,000 --> 01:22:01,100
who are females
1906
01:22:01,300 --> 01:22:05,028
so we have a total number of
69 male characters and 12 number
1907
01:22:05,028 --> 01:22:07,900
of female characters
in The Lannister house.
1908
01:22:07,900 --> 01:22:11,311
Now, let us continue with
the next operations at the end
1909
01:22:11,311 --> 01:22:12,800
of the day every episode
1910
01:22:12,800 --> 01:22:14,800
of Game of Thrones had
a noble character.
1911
01:22:15,000 --> 01:22:17,365
Let us now find out all
the noble characters
1912
01:22:17,365 --> 01:22:18,664
amongst all the houses
1913
01:22:18,664 --> 01:22:21,193
that we have in our Game
of Thrones CSV file
1914
01:22:21,193 --> 01:22:24,100
the following code must help
us find out the details.
1915
01:22:25,600 --> 01:22:26,300
So the details
1916
01:22:26,300 --> 01:22:28,500
of all the characters
from all the houses
1917
01:22:28,500 --> 01:22:30,050
who are considered to be Noble.
1918
01:22:30,050 --> 01:22:32,200
I've been saved
into the new data frame,
1919
01:22:32,200 --> 01:22:33,427
which is DF 3 now,
1920
01:22:33,427 --> 01:22:36,800
let us print the details
from the df3 data frame.
1921
01:22:37,500 --> 01:22:40,000
So these are the top 20 members
from all the houses
1922
01:22:40,000 --> 01:22:42,900
who are considered to be Noble
along with their genders.
1923
01:22:42,900 --> 01:22:45,400
Now, let us count the total
number of noble characters
1924
01:22:45,400 --> 01:22:47,600
from the entire game
of thrones stories.
1925
01:22:48,300 --> 01:22:50,500
So there are a total
of four hundred and thirty
1926
01:22:50,500 --> 01:22:53,300
number of noble characters
existing in the whole game
1927
01:22:53,300 --> 01:22:54,300
of throne story.
1928
01:22:54,800 --> 01:22:56,211
Nonetheless, we have also
1929
01:22:56,211 --> 01:22:59,086
faced a few Communists
whose role in The Game
1930
01:22:59,086 --> 01:23:01,700
of Thrones is found
to be exceptional vision
1931
01:23:01,700 --> 01:23:04,219
of find out the details
of all those commoners
1932
01:23:04,219 --> 01:23:07,300
who were highly dedicated
to their roles in each episode
1933
01:23:07,600 --> 01:23:08,700
the data of all,
1934
01:23:08,700 --> 01:23:10,700
the commoners is
been successfully loaded
1935
01:23:10,700 --> 01:23:11,900
into the new data frame,
1936
01:23:11,900 --> 01:23:14,202
which is TFO now let
us print the data
1937
01:23:14,202 --> 01:23:17,500
which is present in the DF
for using the show command.
1938
01:23:17,900 --> 01:23:20,396
So these are the top
20 characters identified as
1939
01:23:20,396 --> 01:23:23,004
common as amongst all the Game
of Thrones stories.
1940
01:23:23,004 --> 01:23:25,400
Now, let us find out
the count of total number
1941
01:23:25,400 --> 01:23:26,600
of common characters.
1942
01:23:26,700 --> 01:23:27,649
So there are a total
1943
01:23:27,649 --> 01:23:30,099
of four hundred and
eighty seven common characters
1944
01:23:30,099 --> 01:23:32,000
amongst all stories
of Game of Thrones.
1945
01:23:32,000 --> 01:23:34,100
Let us continue
with the further operations.
1946
01:23:34,100 --> 01:23:35,700
Now they were a few rows
1947
01:23:35,700 --> 01:23:37,700
who were considered
to be important
1948
01:23:37,700 --> 01:23:39,210
and equally Noble, hence.
1949
01:23:39,210 --> 01:23:41,526
They were carried out
under the last book.
1950
01:23:41,526 --> 01:23:43,644
So let us filter
out those characters
1951
01:23:43,644 --> 01:23:46,100
and find out the details
of each one of them.
1952
01:23:46,400 --> 01:23:49,520
The data of all the characters
who are considered to be Noble
1953
01:23:49,520 --> 01:23:50,300
and carried out
1954
01:23:50,300 --> 01:23:53,300
until the last book are being
stored into the new data frame,
1955
01:23:53,300 --> 01:23:55,629
which is TFO now let
us print the data
1956
01:23:55,629 --> 01:23:56,652
which is existing
1957
01:23:56,652 --> 01:23:59,600
in the data frame for so
according to the results.
1958
01:23:59,600 --> 01:24:00,650
We have two candidates
1959
01:24:00,650 --> 01:24:03,300
who are considered to be
the noble and their character
1960
01:24:03,300 --> 01:24:05,200
is been carried on
until the last book
1961
01:24:05,700 --> 01:24:06,900
amongst all the battles.
1962
01:24:06,900 --> 01:24:09,068
I found the battles
of the last books
1963
01:24:09,068 --> 01:24:11,900
to be generating more
adrenaline in the readers.
1964
01:24:11,900 --> 01:24:14,500
Let us find out the details
of those battles using
1965
01:24:14,500 --> 01:24:15,600
the following code.
1966
01:24:16,000 --> 01:24:18,700
So the following code will help
us to find out the bars
1967
01:24:18,700 --> 01:24:20,500
which were fought
in the last year's
1968
01:24:20,500 --> 01:24:21,700
of the Game of Thrones.
1969
01:24:22,100 --> 01:24:24,799
So these are the details
of the vast which are fought
1970
01:24:24,799 --> 01:24:26,800
in the last year's
of the Game of Thrones
1971
01:24:26,800 --> 01:24:28,200
and the details of the Kings
1972
01:24:28,300 --> 01:24:30,067
and the details
of their commanders
1973
01:24:30,067 --> 01:24:32,200
and the location
where the war was fought.
1974
01:24:36,700 --> 01:24:40,579
Welcome to this interesting
session of Sparks SQL tutorial
1975
01:24:40,579 --> 01:24:41,600
from a drecker.
1976
01:24:41,600 --> 01:24:42,700
So in today's session,
1977
01:24:42,700 --> 01:24:46,100
we are going to learn about
how we will be working.
1978
01:24:46,100 --> 01:24:48,500
Spock sequent now what all you
1979
01:24:48,500 --> 01:24:51,944
can expect from this course
from this particular session
1980
01:24:51,944 --> 01:24:53,300
so you can expect that.
1981
01:24:53,300 --> 01:24:56,400
We will be first learning
by Sparks equal.
1982
01:24:56,500 --> 01:24:58,139
What are the libraries
1983
01:24:58,139 --> 01:25:00,600
which are present
in Sparks equal.
1984
01:25:00,600 --> 01:25:03,600
What are the important
features of Sparkle?
1985
01:25:03,600 --> 01:25:06,400
We will also be doing
some Hands-On example
1986
01:25:06,400 --> 01:25:10,323
and in the end we will see
some interesting use case
1987
01:25:10,323 --> 01:25:13,300
of stock market analysis now
1988
01:25:13,400 --> 01:25:15,042
Rice Park sequel is it
1989
01:25:15,042 --> 01:25:19,200
like Why we are learning it
why it is really important
1990
01:25:19,200 --> 01:25:22,067
for us to know about
this Sparks equal sign.
1991
01:25:22,067 --> 01:25:24,200
Is it like really hot in Market?
1992
01:25:24,200 --> 01:25:27,700
If yes, then why we want
all those answer from this.
1993
01:25:27,700 --> 01:25:30,500
So if you're coming
from her do background,
1994
01:25:30,500 --> 01:25:34,102
you must have heard a lot
about Apache Hive now
1995
01:25:34,300 --> 01:25:36,100
what happens in Apache.
1996
01:25:36,100 --> 01:25:39,061
I also like in Apache
Hive SQL developers
1997
01:25:39,061 --> 01:25:41,430
can write the queries in SQL way
1998
01:25:41,430 --> 01:25:43,800
and it will be getting converted
1999
01:25:43,800 --> 01:25:45,800
to your mapreduce
and giving you the out.
2000
01:25:46,400 --> 01:25:47,600
Now we all know
2001
01:25:47,600 --> 01:25:50,000
that mapreduce is
lower in nature.
2002
01:25:50,000 --> 01:25:52,726
And since mapreduce
is going to be slower
2003
01:25:52,726 --> 01:25:54,500
and nature then definitely
2004
01:25:54,500 --> 01:25:58,000
your overall high score
is going to be slower in nature.
2005
01:25:58,000 --> 01:25:59,537
So that was one challenge.
2006
01:25:59,537 --> 01:26:02,361
So if you have let's say
less than 200 GB of data
2007
01:26:02,361 --> 01:26:04,400
or if you have
a smaller set of data.
2008
01:26:04,400 --> 01:26:06,800
This was actually
a big challenge
2009
01:26:06,800 --> 01:26:10,400
that in Hive your performance
was not that great.
2010
01:26:10,400 --> 01:26:13,900
It also do not have
any resuming capability stuck.
2011
01:26:13,900 --> 01:26:15,900
You can just start it also.
2012
01:26:15,900 --> 01:26:19,200
- cannot even drop
your encrypted data bases.
2013
01:26:19,200 --> 01:26:21,082
That's was also one
of the challenge
2014
01:26:21,082 --> 01:26:23,200
when you deal with
the security side.
2015
01:26:23,200 --> 01:26:25,082
Now what sparks equal have done
2016
01:26:25,082 --> 01:26:28,300
it Sparks equal have solved
almost all of the problem.
2017
01:26:28,300 --> 01:26:31,064
So in the last sessions
you have already learned
2018
01:26:31,064 --> 01:26:34,500
about the smart way right House
Park is faster from mapreduce
2019
01:26:34,500 --> 01:26:36,200
and not we have already learned
2020
01:26:36,200 --> 01:26:38,800
that in the previous
few sessions now.
2021
01:26:38,800 --> 01:26:39,917
So in this session,
2022
01:26:39,917 --> 01:26:43,000
we are going to kind of take
a live range of all that so
2023
01:26:43,000 --> 01:26:44,800
definitely in this case
2024
01:26:44,800 --> 01:26:47,500
since This pack is
faster because of
2025
01:26:47,500 --> 01:26:49,200
the in-memory computation.
2026
01:26:49,200 --> 01:26:50,866
What is in memory competition?
2027
01:26:50,866 --> 01:26:52,200
We have already seen it.
2028
01:26:52,200 --> 01:26:55,105
So in memory computations
is like whenever we
2029
01:26:55,105 --> 01:26:57,700
are Computing anything
in memory directly.
2030
01:26:57,700 --> 01:27:01,165
So because of in memory
competition capability because
2031
01:27:01,165 --> 01:27:02,800
of arches purpose poster.
2032
01:27:02,800 --> 01:27:07,500
So definitely your spark SQL is
also been to become first know
2033
01:27:07,500 --> 01:27:08,600
so if I talk
2034
01:27:08,600 --> 01:27:11,900
about the advantages
of Sparks equal over Hive
2035
01:27:11,900 --> 01:27:14,970
definitely number one it
is going to be faster
2036
01:27:14,970 --> 01:27:17,900
in Listen to your hive
so a high quality,
2037
01:27:17,900 --> 01:27:20,900
which is let's say
you're taking around 10 minutes
2038
01:27:20,900 --> 01:27:21,905
in Sparks equal.
2039
01:27:21,905 --> 01:27:25,300
You can finish that same query
in less than one minute.
2040
01:27:25,300 --> 01:27:27,400
Don't you think it's
an awesome capability
2041
01:27:27,400 --> 01:27:31,400
of subsequent definitely as
right now second thing is
2042
01:27:31,400 --> 01:27:34,400
when if let's say you
are writing something and -
2043
01:27:34,400 --> 01:27:36,148
now you can take an example
2044
01:27:36,148 --> 01:27:39,751
of let's say a company
who is let's say developing -
2045
01:27:39,751 --> 01:27:41,467
queries from last 10 years.
2046
01:27:41,467 --> 01:27:42,900
Now they were doing it.
2047
01:27:42,900 --> 01:27:44,000
There were all happy
2048
01:27:44,000 --> 01:27:46,000
that they were able
to process picture.
2049
01:27:46,100 --> 01:27:48,200
That they were worried
about the performance
2050
01:27:48,200 --> 01:27:50,778
that Hive is not able
to give them a that level
2051
01:27:50,778 --> 01:27:53,273
of processing speed what
they are looking for.
2052
01:27:53,273 --> 01:27:54,160
Now this fossil.
2053
01:27:54,160 --> 01:27:56,600
It's a challenge
for that particular company.
2054
01:27:56,600 --> 01:27:58,801
Now, there's a challenge right?
2055
01:27:58,801 --> 01:28:01,397
The challenge is
they came to know know
2056
01:28:01,397 --> 01:28:02,900
about subsequent fine.
2057
01:28:02,900 --> 01:28:04,685
Let's say we came
to know about it,
2058
01:28:04,685 --> 01:28:05,853
but they came to know
2059
01:28:05,853 --> 01:28:08,300
that we can execute
everything is Park Sequel
2060
01:28:08,300 --> 01:28:10,700
and it is going to be
faster as well fine.
2061
01:28:10,700 --> 01:28:12,281
But don't you think that
2062
01:28:12,281 --> 01:28:15,708
if these companies working
for net set past 10 years?
2063
01:28:15,708 --> 01:28:19,200
In Hive they must have already
written lot of Gordon -
2064
01:28:19,200 --> 01:28:23,100
now if you ask them to migrate
to spark SQL is will it be
2065
01:28:23,100 --> 01:28:24,400
until easy task?
2066
01:28:24,400 --> 01:28:25,200
No, right.
2067
01:28:25,200 --> 01:28:25,982
Definitely.
2068
01:28:25,982 --> 01:28:28,384
It is not going
to be an easy task.
2069
01:28:28,384 --> 01:28:32,200
Why because Hive syntax
and Sparks equals and X though.
2070
01:28:32,200 --> 01:28:35,800
They boot tackle the sequel way
of writing the things
2071
01:28:35,800 --> 01:28:39,346
but at the same time
it is always a very
2072
01:28:39,346 --> 01:28:41,500
it carries a big difference,
2073
01:28:41,500 --> 01:28:44,300
so there will be a good
difference whenever we talk
2074
01:28:44,300 --> 01:28:45,905
about the syntax between them.
2075
01:28:45,905 --> 01:28:48,100
So it will take a very
good amount of time
2076
01:28:48,100 --> 01:28:51,017
for that company to change
all of the query mode
2077
01:28:51,017 --> 01:28:54,052
to the Sparks equal way
now Sparks equal came up
2078
01:28:54,052 --> 01:28:55,426
with a smart salvation
2079
01:28:55,426 --> 01:28:56,899
what they said is even
2080
01:28:56,899 --> 01:28:58,900
if you are writing
the query with -
2081
01:28:58,900 --> 01:29:01,300
you can execute
that Hive query directly
2082
01:29:01,300 --> 01:29:03,500
through subsequent don't you
think it's again
2083
01:29:03,500 --> 01:29:06,600
a very important
and awesome facility, right?
2084
01:29:06,600 --> 01:29:09,900
Because even now
if you're a good Hive developer,
2085
01:29:09,900 --> 01:29:12,000
you need not worry about
2086
01:29:12,000 --> 01:29:15,600
that how you will be now
that migrating to Sparks.
2087
01:29:15,600 --> 01:29:18,658
Well, you can still keep on
writing to the hive query
2088
01:29:18,658 --> 01:29:20,900
and can your query
will automatically be
2089
01:29:20,900 --> 01:29:24,767
getting converted to spot sequel
with similarly in Apache spark
2090
01:29:24,767 --> 01:29:27,200
as we have learned
in the past sessions,
2091
01:29:27,200 --> 01:29:30,100
especially through spark
streaming that Sparks.
2092
01:29:30,100 --> 01:29:33,600
The aiming is going to make
you real time processing right?
2093
01:29:33,600 --> 01:29:36,000
You can also perform
your real-time processing
2094
01:29:36,000 --> 01:29:37,615
using a purchase. / now.
2095
01:29:37,615 --> 01:29:39,500
This sort of facility is you
2096
01:29:39,500 --> 01:29:41,800
can take leverage even
you know Sparks ago.
2097
01:29:41,800 --> 01:29:44,235
So let's say you can do
a real-time processing
2098
01:29:44,235 --> 01:29:46,400
and at the same time
you can also Perform
2099
01:29:46,400 --> 01:29:47,860
your SQL query now the type
2100
01:29:47,860 --> 01:29:49,120
that was the problem.
2101
01:29:49,120 --> 01:29:49,900
You cannot do
2102
01:29:49,900 --> 01:29:52,900
that because when we talk
about Hive now in -
2103
01:29:52,900 --> 01:29:54,320
it's all about Hadoop is
2104
01:29:54,320 --> 01:29:56,663
all about batch
processing batch processing
2105
01:29:56,663 --> 01:29:58,509
where you keep historical data
2106
01:29:58,509 --> 01:30:00,736
and then later you
process it, right?
2107
01:30:00,736 --> 01:30:03,699
So it definitely Hive also
follow the same approach
2108
01:30:03,699 --> 01:30:05,300
in this case also high risk
2109
01:30:05,300 --> 01:30:07,850
going to just only follow
the batch processing mode,
2110
01:30:07,850 --> 01:30:09,600
but when it comes to a purchase,
2111
01:30:09,600 --> 01:30:13,500
but it will also be taking care
of the real-time processing.
2112
01:30:13,500 --> 01:30:15,499
So how all these things happens
2113
01:30:15,499 --> 01:30:18,400
so Our Park sequel always
uses your meta store
2114
01:30:18,400 --> 01:30:21,350
Services of your hive
to query the data stored
2115
01:30:21,350 --> 01:30:22,400
and managed by -
2116
01:30:22,400 --> 01:30:24,728
so in when you were
learning about high,
2117
01:30:24,728 --> 01:30:28,123
so we have learned at that time
that in hives everything.
2118
01:30:28,123 --> 01:30:30,711
What we do is always
stored in the meta Stone
2119
01:30:30,711 --> 01:30:33,491
so that met Esther was
The crucial point, right?
2120
01:30:33,491 --> 01:30:35,200
Because using that meta store
2121
01:30:35,200 --> 01:30:37,600
only you are able
to do everything up.
2122
01:30:37,600 --> 01:30:41,100
So like when you are doing
let's say or any sort of query
2123
01:30:41,100 --> 01:30:42,707
when you're creating a table,
2124
01:30:42,707 --> 01:30:45,700
everything was getting stored
in that same metal Stone.
2125
01:30:45,700 --> 01:30:47,559
What happens Spock sequel
2126
01:30:47,559 --> 01:30:51,800
also use the same metal Stone
now is whatever metal store.
2127
01:30:51,800 --> 01:30:55,051
You have created with respect
to Hive same meta store.
2128
01:30:55,051 --> 01:30:56,219
You can also use it
2129
01:30:56,219 --> 01:30:58,900
for your Sparks equal
and that is something
2130
01:30:58,900 --> 01:31:02,000
which is really awesome
about this spark sequent
2131
01:31:02,000 --> 01:31:04,000
that you did not create
a new meta store.
2132
01:31:04,000 --> 01:31:06,300
You need not worry
about a new storage space
2133
01:31:06,300 --> 01:31:07,404
and not everything
2134
01:31:07,404 --> 01:31:10,820
what you have done with respect
to your high same method
2135
01:31:10,820 --> 01:31:11,620
you can use it.
2136
01:31:11,620 --> 01:31:11,833
Now.
2137
01:31:11,833 --> 01:31:13,700
You can ask me then
how it is faster
2138
01:31:13,700 --> 01:31:15,700
if they're using
cymatics don't remember.
2139
01:31:15,700 --> 01:31:18,500
But the processing part
why high was lower
2140
01:31:18,500 --> 01:31:20,301
because of its processing way
2141
01:31:20,301 --> 01:31:23,519
because it is converting
everything to the mapreduce
2142
01:31:23,519 --> 01:31:26,782
and this it was making
the processing very very slow.
2143
01:31:26,782 --> 01:31:28,100
But here in this case
2144
01:31:28,100 --> 01:31:31,452
since the processing is going
to be in memory computation.
2145
01:31:31,452 --> 01:31:32,705
So in Sparks equal case,
2146
01:31:32,705 --> 01:31:35,588
it is always going to be
the faster now definitely
2147
01:31:35,588 --> 01:31:37,545
it just because of
the meta store site.
2148
01:31:37,545 --> 01:31:39,600
We are only able
to fetch the data are
2149
01:31:39,600 --> 01:31:42,129
not but at the same time
for any other thing
2150
01:31:42,129 --> 01:31:44,100
of the processing related stuff,
2151
01:31:44,100 --> 01:31:46,200
it is always going to be At the
2152
01:31:46,200 --> 01:31:48,180
when we talk about
the processing stage
2153
01:31:48,180 --> 01:31:51,200
it is going to be in memory
does it's going to be faster.
2154
01:31:51,300 --> 01:31:54,335
So let's talk about some success
stories of Sparks equal.
2155
01:31:54,335 --> 01:31:57,550
Let's see some use cases
Twitter sentiment analysis.
2156
01:31:57,550 --> 01:31:58,844
If you go through over
2157
01:31:58,844 --> 01:32:01,699
if you want sexy remember
our spark streaming session,
2158
01:32:01,700 --> 01:32:04,300
we have done a Twitter
sentiment analysis, right?
2159
01:32:04,300 --> 01:32:05,400
So there you have seen
2160
01:32:05,400 --> 01:32:08,497
that we have first initially
got the data from Twitter and
2161
01:32:08,497 --> 01:32:10,400
that to we have got
it with the help
2162
01:32:10,400 --> 01:32:11,911
of Sparks Damon and later
2163
01:32:11,911 --> 01:32:13,000
what we did later.
2164
01:32:13,000 --> 01:32:15,600
We just analyze everything
with the help of spot.
2165
01:32:15,600 --> 01:32:18,080
Oxycodone so you can see
an advantage as possible.
2166
01:32:18,080 --> 01:32:19,761
So in Twitter sentiment analysis
2167
01:32:19,761 --> 01:32:21,600
where let's say
you want to find out
2168
01:32:21,600 --> 01:32:23,200
about the Donald Trump, right?
2169
01:32:23,200 --> 01:32:24,509
You are fetching the data
2170
01:32:24,509 --> 01:32:26,547
every tweet related
to the Donald Trump
2171
01:32:26,547 --> 01:32:28,900
and then kind of bring
analysis in checking
2172
01:32:28,900 --> 01:32:31,200
that whether it's
a positive with negative
2173
01:32:31,200 --> 01:32:32,475
tweet neutral tweet,
2174
01:32:32,475 --> 01:32:34,900
very negative with very
positive to it.
2175
01:32:34,900 --> 01:32:37,257
Okay, so we have already
seen the same example there
2176
01:32:37,257 --> 01:32:38,607
in that particular session.
2177
01:32:38,607 --> 01:32:39,549
So in this session,
2178
01:32:39,549 --> 01:32:40,499
as you are noticing
2179
01:32:40,499 --> 01:32:42,600
what we are doing we
just want to kind of so
2180
01:32:42,600 --> 01:32:44,202
that once you're
streaming the data
2181
01:32:44,202 --> 01:32:45,900
and the real time
you can also do it.
2182
01:32:45,900 --> 01:32:47,977
Also, seeing using
spark sequel just you
2183
01:32:47,977 --> 01:32:50,724
are doing all the processing
at the real time similarly
2184
01:32:50,724 --> 01:32:52,270
in the stock market analysis.
2185
01:32:52,270 --> 01:32:54,295
You can use Park
sequel lot of bullies.
2186
01:32:54,295 --> 01:32:57,400
You can adopt the in the banking
fraud case Transitions and all
2187
01:32:57,400 --> 01:32:58,400
you can use that.
2188
01:32:58,400 --> 01:33:01,000
So let's say your credit
card current is getting swipe
2189
01:33:01,000 --> 01:33:02,580
in India and in next 10 minutes
2190
01:33:02,580 --> 01:33:04,429
if your credit card
is getting swiped
2191
01:33:04,429 --> 01:33:05,456
in let's say in u.s.
2192
01:33:05,456 --> 01:33:07,100
Definitely that is not possible.
2193
01:33:07,100 --> 01:33:07,400
Right?
2194
01:33:07,400 --> 01:33:09,872
So let's say you are doing all
that processing real-time.
2195
01:33:09,872 --> 01:33:12,300
You're detecting everything
with respect to sparsely me.
2196
01:33:12,300 --> 01:33:15,400
Then you are let's say applying
your Sparks equal to verify
2197
01:33:15,400 --> 01:33:18,000
that Whether it's
a user Trend or not, right?
2198
01:33:18,000 --> 01:33:20,600
So all those things you want
to match up as possible.
2199
01:33:20,600 --> 01:33:21,960
So you can do that similarly
2200
01:33:21,960 --> 01:33:23,750
the medical domain
you can use that.
2201
01:33:23,750 --> 01:33:25,949
Let's talk about
some Sparks equal features.
2202
01:33:25,949 --> 01:33:28,200
So there will be
some features related to it.
2203
01:33:28,400 --> 01:33:30,200
Now, you can use
2204
01:33:30,200 --> 01:33:33,700
what happens when this sequel
got combined with this path.
2205
01:33:33,700 --> 01:33:34,830
We started calling it
2206
01:33:34,830 --> 01:33:35,825
as Park sequel now
2207
01:33:35,825 --> 01:33:38,700
when definitely we are talking
about SQL be a talking
2208
01:33:38,700 --> 01:33:40,405
about either a structure data
2209
01:33:40,405 --> 01:33:41,800
or a semi-structured data now
2210
01:33:41,800 --> 01:33:44,231
SQL queries cannot deal
with the unstructured data,
2211
01:33:44,231 --> 01:33:47,300
so that is definitely one of
Thing you need to keep in mind.
2212
01:33:47,300 --> 01:33:51,000
Now your spark sequel also
support various data formats.
2213
01:33:51,000 --> 01:33:52,800
You can get a data from pocket.
2214
01:33:52,800 --> 01:33:54,500
You must have heard about Market
2215
01:33:54,500 --> 01:33:56,911
that it is a columnar
based storage and it
2216
01:33:56,911 --> 01:33:59,884
is kind of very much
compressed format of the data
2217
01:33:59,884 --> 01:34:02,300
what you have but it's
not human readable.
2218
01:34:02,300 --> 01:34:02,800
Similarly.
2219
01:34:02,800 --> 01:34:04,800
You must have heard
about Jason Avro
2220
01:34:04,800 --> 01:34:07,200
where we keep the value
as a key value pair.
2221
01:34:07,200 --> 01:34:08,482
Hi Cassandra, right?
2222
01:34:08,482 --> 01:34:09,700
These are nosql TVs
2223
01:34:09,700 --> 01:34:12,800
so you can get all the data
from these sources now.
2224
01:34:12,800 --> 01:34:15,114
You can also convert
your SQL queries
2225
01:34:15,114 --> 01:34:16,400
to your A derivative
2226
01:34:16,400 --> 01:34:18,650
so you can you can you
will be able to perform
2227
01:34:18,650 --> 01:34:20,113
all the transformation steps.
2228
01:34:20,113 --> 01:34:21,800
So that is one thing you can do.
2229
01:34:21,800 --> 01:34:23,500
Now if we talk about performance
2230
01:34:23,500 --> 01:34:26,700
and scalability definitely
on this red color graph.
2231
01:34:26,700 --> 01:34:29,431
If you notice this
is related to your Hadoop,
2232
01:34:29,431 --> 01:34:30,300
you can notice
2233
01:34:30,300 --> 01:34:34,000
that red color graph is much
more encompassing to blue color
2234
01:34:34,000 --> 01:34:36,617
and blue color denotes
my performance with respect
2235
01:34:36,617 --> 01:34:37,503
to Sparks equal
2236
01:34:37,503 --> 01:34:40,856
so you can notice that spark
SQL is performing much better
2237
01:34:40,856 --> 01:34:42,684
in comparison to your Hadoop.
2238
01:34:42,684 --> 01:34:44,260
So we are on this Y axis.
2239
01:34:44,260 --> 01:34:45,900
We are taking the running.
2240
01:34:46,000 --> 01:34:47,200
On the x-axis.
2241
01:34:47,200 --> 01:34:50,119
We were considering
the number of iteration
2242
01:34:50,119 --> 01:34:53,000
when we talk about
Sparks equal features.
2243
01:34:53,000 --> 01:34:56,000
Now few more features
we have for example,
2244
01:34:56,000 --> 01:34:59,200
you can create a connection
with simple your jdbc driver
2245
01:34:59,200 --> 01:35:00,494
or odbc driver, right?
2246
01:35:00,494 --> 01:35:02,482
These are simple
drivers being present.
2247
01:35:02,482 --> 01:35:03,600
Now, you can create
2248
01:35:03,600 --> 01:35:06,700
your connection with his path
SQL using all these drivers.
2249
01:35:06,700 --> 01:35:10,000
You can also create a user
defined function means let's say
2250
01:35:10,000 --> 01:35:12,200
if any function is
not available to you
2251
01:35:12,200 --> 01:35:14,600
and that gives you can create
your own functions.
2252
01:35:14,600 --> 01:35:16,900
Let's say if function
Is available use
2253
01:35:16,900 --> 01:35:18,639
that if it is not available,
2254
01:35:18,639 --> 01:35:21,497
you can create a UDF means
user-defined function
2255
01:35:21,497 --> 01:35:23,235
and you can directly execute
2256
01:35:23,235 --> 01:35:26,478
that user-defined function
and get your dessert sir.
2257
01:35:26,478 --> 01:35:28,900
So this is one example
where we have shown
2258
01:35:28,900 --> 01:35:30,100
that you can convert.
2259
01:35:30,100 --> 01:35:33,000
Let's say if you don't have
an uppercase API present
2260
01:35:33,000 --> 01:35:36,405
in subsequent how you
can create a simple UDF for a
2261
01:35:36,405 --> 01:35:37,700
and can execute it.
2262
01:35:37,700 --> 01:35:38,850
So if you notice there
2263
01:35:38,850 --> 01:35:41,200
what we are doing
let's get this is my data.
2264
01:35:41,200 --> 01:35:42,700
So if you notice in this case,
2265
01:35:43,069 --> 01:35:45,530
this is data set is
my data part.
2266
01:35:45,800 --> 01:35:48,100
So this is I'm generating
as a sequence.
2267
01:35:48,100 --> 01:35:51,800
I'm creating it as a data frame
see this 2df part here.
2268
01:35:51,800 --> 01:35:55,100
Now after that we
are creating a / U DF here
2269
01:35:55,100 --> 01:35:58,217
and notice we are converting
any value which is coming
2270
01:35:58,217 --> 01:35:59,600
to my upper case, right?
2271
01:35:59,600 --> 01:36:02,000
We are using this to uppercase
API to convert it.
2272
01:36:02,100 --> 01:36:05,800
We are importing this function
and then what we did now
2273
01:36:05,800 --> 01:36:08,100
when we came here,
we are telling that okay.
2274
01:36:08,100 --> 01:36:09,236
This is my UDF.
2275
01:36:09,236 --> 01:36:10,600
So UDF is upper by
2276
01:36:10,600 --> 01:36:12,719
because we have created
here also a zapper.
2277
01:36:12,719 --> 01:36:13,569
So we are telling
2278
01:36:13,569 --> 01:36:16,100
that this is my UDF
in the first step and then Then
2279
01:36:16,100 --> 01:36:17,153
when we are using it,
2280
01:36:17,153 --> 01:36:20,253
let's say with our datasets
what we are doing so data sets.
2281
01:36:20,253 --> 01:36:22,100
We are passing year
that okay, whatever.
2282
01:36:22,100 --> 01:36:23,393
We are doing convert it
2283
01:36:23,393 --> 01:36:26,600
to my upper developer you DFX
convert it to my upper case.
2284
01:36:26,600 --> 01:36:29,100
So see we are telling you
we have created our / UDF
2285
01:36:29,100 --> 01:36:31,500
that is what we are passing
inside this text value.
2286
01:36:31,800 --> 01:36:34,600
So now it is just
getting converted
2287
01:36:34,600 --> 01:36:37,600
and giving you all the output
in your upper case way
2288
01:36:37,600 --> 01:36:40,400
so you can notice
that this is your last value
2289
01:36:40,400 --> 01:36:42,700
and this is your
uppercase value, right?
2290
01:36:42,700 --> 01:36:43,841
So this got converted
2291
01:36:43,841 --> 01:36:45,900
to my upper case
in this particular.
2292
01:36:45,900 --> 01:36:46,500
Love it.
2293
01:36:46,500 --> 01:36:46,900
Now.
2294
01:36:46,900 --> 01:36:49,123
If you notice here
also same steps.
2295
01:36:49,123 --> 01:36:52,000
We are how to we
can register all of our UDF.
2296
01:36:52,000 --> 01:36:53,620
This is not being shown here.
2297
01:36:53,620 --> 01:36:55,800
So now this is
how you can do that spark
2298
01:36:55,800 --> 01:36:57,354
that UDF not register.
2299
01:36:57,354 --> 01:36:58,574
So using this API,
2300
01:36:58,574 --> 01:37:02,100
you can just register
your data frames now similarly,
2301
01:37:02,100 --> 01:37:03,870
if you want to get the output
2302
01:37:03,870 --> 01:37:06,800
after that you can get
it using this following me
2303
01:37:06,800 --> 01:37:09,900
so you can use the show API
to get the output
2304
01:37:09,900 --> 01:37:12,100
for this Sparks
equal at attacher.
2305
01:37:12,100 --> 01:37:13,800
Let's see that so what is Park
2306
01:37:13,800 --> 01:37:16,400
sequel architecture now is
Park sequel architecture
2307
01:37:16,400 --> 01:37:18,100
if we talked about so
what happens to your let
2308
01:37:18,100 --> 01:37:19,900
's say getting the data
of with using
2309
01:37:19,900 --> 01:37:21,500
your various formats, right?
2310
01:37:21,500 --> 01:37:23,911
So let's say you can get
it from your CSP.
2311
01:37:23,911 --> 01:37:26,056
You can get it
from your Json format.
2312
01:37:26,056 --> 01:37:28,475
You can also get it
from your jdbc format.
2313
01:37:28,475 --> 01:37:30,400
Now, they will be
a data source API.
2314
01:37:30,400 --> 01:37:31,708
So using data source API,
2315
01:37:31,708 --> 01:37:34,273
you can fetch the data
after fetching the data
2316
01:37:34,273 --> 01:37:36,300
you will be converting
to a data frame
2317
01:37:36,300 --> 01:37:38,000
where so what is data frame.
2318
01:37:38,000 --> 01:37:39,833
So in the last one
we have learned
2319
01:37:39,833 --> 01:37:42,892
that that when we were creating
everything is already
2320
01:37:42,892 --> 01:37:43,900
what we were doing.
2321
01:37:43,900 --> 01:37:46,437
So, let's say this was
my Cluster, right?
2322
01:37:46,437 --> 01:37:48,358
So let's say this is machine.
2323
01:37:48,358 --> 01:37:49,860
This is another machine.
2324
01:37:49,860 --> 01:37:51,800
This is another machine, right?
2325
01:37:51,800 --> 01:37:53,757
So let's say these are
all my clusters.
2326
01:37:53,757 --> 01:37:55,703
So what we were doing
in this case now
2327
01:37:55,703 --> 01:37:58,700
when we were creating all
these things are as were cluster
2328
01:37:58,700 --> 01:38:00,000
what was happening here.
2329
01:38:00,000 --> 01:38:02,600
We were passing
Oliver values him, right?
2330
01:38:02,600 --> 01:38:04,739
So let's say we
were keeping all the data.
2331
01:38:04,739 --> 01:38:06,200
Let's say block B1 was there
2332
01:38:06,200 --> 01:38:08,850
so we were passing all
the values and work creating it
2333
01:38:08,850 --> 01:38:11,400
in the form of in the memory
and we were calling
2334
01:38:11,400 --> 01:38:12,800
that as rdd now
2335
01:38:12,800 --> 01:38:16,094
when we were walking in SQL
we have to store the the data
2336
01:38:16,094 --> 01:38:17,900
which is a table of data, right?
2337
01:38:17,900 --> 01:38:19,200
So let's say there is a table
2338
01:38:19,200 --> 01:38:21,200
which is let's say
having column details.
2339
01:38:21,200 --> 01:38:23,200
Let's say name age.
2340
01:38:23,200 --> 01:38:24,024
Let's say here.
2341
01:38:24,024 --> 01:38:26,236
I have some value here
are some value here.
2342
01:38:26,236 --> 01:38:28,506
I have some value here
at some value, right?
2343
01:38:28,506 --> 01:38:31,200
So let's say I have some value
of this table format.
2344
01:38:31,200 --> 01:38:34,200
Now if I have to keep
this data into my cluster
2345
01:38:34,200 --> 01:38:35,200
what you need to do,
2346
01:38:35,200 --> 01:38:37,962
so you will be keeping first
of all into the memory.
2347
01:38:37,962 --> 01:38:39,100
So you will be having
2348
01:38:39,100 --> 01:38:42,418
let's say name H this column
to test first of all year
2349
01:38:42,418 --> 01:38:45,767
and after that you will be
having some details of this.
2350
01:38:45,767 --> 01:38:46,210
Perfect.
2351
01:38:46,210 --> 01:38:47,804
So let's say this much data,
2352
01:38:47,804 --> 01:38:49,900
you have some part
in the similar kind
2353
01:38:49,900 --> 01:38:52,572
of table with some other values
will be here also,
2354
01:38:52,572 --> 01:38:55,300
but here also you are going
to have column details.
2355
01:38:55,300 --> 01:38:58,500
You will be having name H
some more data here.
2356
01:38:58,600 --> 01:39:02,600
Now if you notice this
is sounding similar to our DD,
2357
01:39:02,700 --> 01:39:06,000
but this is not exactly
like our GD right
2358
01:39:06,000 --> 01:39:09,400
because here we are not only
keeping just the data but we
2359
01:39:09,400 --> 01:39:12,500
are also studying something
like a column in a storage
2360
01:39:12,500 --> 01:39:12,861
right?
2361
01:39:12,861 --> 01:39:15,400
We also the keeping
the column in all of it.
2362
01:39:15,400 --> 01:39:18,500
Data nodes or we can call it as
if Burke or not, right?
2363
01:39:18,500 --> 01:39:20,653
So we are also keeping
the column vectors
2364
01:39:20,653 --> 01:39:22,000
along with the rule test.
2365
01:39:22,000 --> 01:39:24,700
So this thing is called
as data frames.
2366
01:39:24,700 --> 01:39:26,600
Okay, so that is called
your data frame.
2367
01:39:26,600 --> 01:39:29,400
So that is what we are going to
do is we are going to convert it
2368
01:39:29,400 --> 01:39:31,057
to a data frame API then
2369
01:39:31,057 --> 01:39:35,200
using the data frame TSS or by
using Sparks equal to H square
2370
01:39:35,200 --> 01:39:37,550
or you will be processing
the results and giving
2371
01:39:37,550 --> 01:39:40,300
the output we will learn about
all these things in detail.
2372
01:39:40,600 --> 01:39:44,100
So, let's see this Popsicle
libraries now there are
2373
01:39:44,100 --> 01:39:45,800
multiple apis available.
2374
01:39:45,800 --> 01:39:48,700
This like we have
data source API we
2375
01:39:48,700 --> 01:39:50,500
have data frame API.
2376
01:39:50,500 --> 01:39:53,510
We have interpreter
and Optimizer and SQL service.
2377
01:39:53,510 --> 01:39:55,600
We will explore
all this in detail.
2378
01:39:55,600 --> 01:39:58,000
So let's talk about
data source appear
2379
01:39:58,000 --> 01:40:02,787
if we talk about data source API
what happens in data source API,
2380
01:40:02,787 --> 01:40:04,133
it is used to read
2381
01:40:04,133 --> 01:40:07,364
and store the structured
and unstructured data
2382
01:40:07,364 --> 01:40:08,800
into your spark SQL.
2383
01:40:08,800 --> 01:40:12,200
So as you can notice in Sparks
equal we can give fetch the data
2384
01:40:12,200 --> 01:40:13,437
using multiple sources
2385
01:40:13,437 --> 01:40:15,800
like you can get it
from hive take Cosette.
2386
01:40:15,800 --> 01:40:18,800
Inverse ESP Apache
BSD base Oracle DB so
2387
01:40:18,800 --> 01:40:20,300
many formats available, right?
2388
01:40:20,300 --> 01:40:21,427
So this API is going
2389
01:40:21,427 --> 01:40:24,956
to help you to get all the data
to read all the data store it
2390
01:40:24,956 --> 01:40:26,700
where ever you want to use it.
2391
01:40:26,700 --> 01:40:28,387
Now after that your data
2392
01:40:28,387 --> 01:40:31,200
frame API is going
to help you to convert
2393
01:40:31,200 --> 01:40:33,100
that into a named Colin
2394
01:40:33,100 --> 01:40:34,700
and remember I
just explained you
2395
01:40:34,800 --> 01:40:36,902
that how you store
the data in that
2396
01:40:36,902 --> 01:40:39,793
because here you are not keeping
like I did it.
2397
01:40:39,793 --> 01:40:42,100
You're also keeping
the named column as
2398
01:40:42,100 --> 01:40:45,500
well as Road it is That is
the difference coming up here.
2399
01:40:45,500 --> 01:40:47,382
So that is
what it is converting.
2400
01:40:47,382 --> 01:40:48,100
In this case.
2401
01:40:48,100 --> 01:40:50,561
We are using data
frame API to convert it
2402
01:40:50,561 --> 01:40:52,900
into your named column
and rows, right?
2403
01:40:52,900 --> 01:40:54,600
So that is what you
will be doing.
2404
01:40:54,600 --> 01:40:57,700
So at it also follows the same
properties like your IDs
2405
01:40:57,700 --> 01:40:59,993
like your attitude is
Pearl easily evaluated
2406
01:40:59,993 --> 01:41:02,500
in all same properties
will also follow up here.
2407
01:41:02,500 --> 01:41:06,000
Okay now interpret
an Optimizer and interpreter
2408
01:41:06,000 --> 01:41:08,485
and Optimizer step
what we are going to do.
2409
01:41:08,485 --> 01:41:11,184
So, let's see if we have
this data frame API,
2410
01:41:11,184 --> 01:41:13,700
so we are going to first
create this name.
2411
01:41:13,700 --> 01:41:17,800
Column then after that we
will be now creating an rdd.
2412
01:41:17,800 --> 01:41:20,400
We will be applying
our transformation step.
2413
01:41:20,400 --> 01:41:23,877
We will be doing over action
step right to Output the value.
2414
01:41:23,877 --> 01:41:25,040
So all those things
2415
01:41:25,040 --> 01:41:28,100
where it is happens it happening
in The Interpreter
2416
01:41:28,100 --> 01:41:29,400
and optimizes them.
2417
01:41:29,400 --> 01:41:33,500
So this is all happening
in The Interpreter and optimism.
2418
01:41:33,600 --> 01:41:36,000
So this is what all
the features you have.
2419
01:41:36,000 --> 01:41:39,500
Now, let's talk about
SQL service now in SQL service
2420
01:41:39,500 --> 01:41:41,934
what happens it is going
to again help you
2421
01:41:41,934 --> 01:41:43,698
so it is just doing the order.
2422
01:41:43,698 --> 01:41:45,200
Formation action the last day
2423
01:41:45,200 --> 01:41:47,567
after that using
spark SQL service,
2424
01:41:47,567 --> 01:41:50,700
you will be getting
your spark sequel outputs.
2425
01:41:50,700 --> 01:41:54,200
So now in this case whatever
processing you have done right
2426
01:41:54,200 --> 01:41:57,500
in terms of transformations
in all of that so you can see
2427
01:41:57,500 --> 01:42:01,600
that your sparkers SQL service
is an entry point for working
2428
01:42:01,600 --> 01:42:04,486
along the structure data
in your aperture spur.
2429
01:42:04,486 --> 01:42:04,800
Okay.
2430
01:42:04,800 --> 01:42:07,611
So it is going to kind of
help you to fetch the results
2431
01:42:07,611 --> 01:42:08,700
from your optimize data
2432
01:42:08,700 --> 01:42:10,900
or maybe whatever you
have interpreted before
2433
01:42:10,900 --> 01:42:12,100
so that is what it's doing.
2434
01:42:12,100 --> 01:42:13,400
So this kind of completes.
2435
01:42:13,500 --> 01:42:15,400
This whole diagram now,
2436
01:42:15,400 --> 01:42:18,082
let us see that how we
can perform a work queries
2437
01:42:18,082 --> 01:42:19,200
using spark sequin.
2438
01:42:19,200 --> 01:42:21,435
Now if we talk
about spark SQL queries,
2439
01:42:21,435 --> 01:42:22,376
so first of all,
2440
01:42:22,376 --> 01:42:25,348
we can go to spark cell itself
engine execute everything.
2441
01:42:25,348 --> 01:42:27,253
You can also execute
your program using
2442
01:42:27,253 --> 01:42:29,500
spark your Eclipse also
directing from there.
2443
01:42:29,500 --> 01:42:30,600
Also, you can do that.
2444
01:42:30,600 --> 01:42:33,249
So if you are let's say log in
with your spark shell session.
2445
01:42:33,249 --> 01:42:34,200
So what you can do,
2446
01:42:34,200 --> 01:42:36,700
so let's say you have first
you need to import this
2447
01:42:36,700 --> 01:42:38,464
because into point x
you must have heard
2448
01:42:38,464 --> 01:42:40,500
that there is something
called as Park session
2449
01:42:40,500 --> 01:42:42,197
which came so that is
what we are doing.
2450
01:42:42,197 --> 01:42:44,200
So in our last session
we have Have you learned
2451
01:42:44,200 --> 01:42:47,077
about all these things are
now Sparkstation is something
2452
01:42:47,077 --> 01:42:48,700
but we're importing after that.
2453
01:42:48,700 --> 01:42:51,940
We are creating sessions path
using a builder function.
2454
01:42:51,940 --> 01:42:52,704
Look at this.
2455
01:42:52,704 --> 01:42:55,822
So This Builder API you we
are using this Builder API,
2456
01:42:55,822 --> 01:42:57,458
then we are using the app name.
2457
01:42:57,458 --> 01:43:00,256
We are providing a configuration
and then we are telling
2458
01:43:00,256 --> 01:43:02,860
that we are going to create
our values here, right?
2459
01:43:02,860 --> 01:43:05,100
So we had that's why
we are giving get okay,
2460
01:43:05,100 --> 01:43:07,987
then we are importing
all these things right
2461
01:43:07,987 --> 01:43:09,800
once we imported after that
2462
01:43:09,800 --> 01:43:10,900
we can say that okay.
2463
01:43:10,900 --> 01:43:12,731
We were want to read
this Json file.
2464
01:43:12,731 --> 01:43:15,400
So this implies God
or Jason we want to read up here
2465
01:43:15,400 --> 01:43:18,398
and in the end we want
to Output this value, right?
2466
01:43:18,398 --> 01:43:21,700
So this d f becomes my data
frame containing store value
2467
01:43:21,700 --> 01:43:23,188
of my employed or Jason.
2468
01:43:23,188 --> 01:43:25,655
So this decent value
will get converted
2469
01:43:25,655 --> 01:43:26,710
to my data frame.
2470
01:43:26,710 --> 01:43:30,000
We're now in the end PR just
outputting the result now
2471
01:43:30,000 --> 01:43:32,100
if you notice here
what we are doing,
2472
01:43:32,100 --> 01:43:33,312
so here we are first
2473
01:43:33,312 --> 01:43:36,100
of all importing your spark
session same story.
2474
01:43:36,100 --> 01:43:37,200
We just executing it.
2475
01:43:37,200 --> 01:43:39,500
Then we are building
our things better in that.
2476
01:43:39,500 --> 01:43:41,000
We're going to
create that again.
2477
01:43:41,000 --> 01:43:44,243
We are importing it then
we are reading Json file
2478
01:43:44,243 --> 01:43:46,000
by using Red Dot Json API.
2479
01:43:46,000 --> 01:43:47,900
We are reading
never employed or Jason.
2480
01:43:47,900 --> 01:43:50,428
Okay, which is present
in this particular directory
2481
01:43:50,428 --> 01:43:52,400
and we are outputting
so can you can see
2482
01:43:52,400 --> 01:43:55,300
that Json format will be
the T value format.
2483
01:43:55,300 --> 01:43:59,200
But when I'm doing this DF
not show it is just showing
2484
01:43:59,200 --> 01:44:00,700
up all my values here.
2485
01:44:00,700 --> 01:44:00,935
Now.
2486
01:44:00,935 --> 01:44:03,138
Let's see how we
can create our data set.
2487
01:44:03,138 --> 01:44:04,900
Now when we talk about data set,
2488
01:44:04,900 --> 01:44:06,500
you can notice
what we're doing.
2489
01:44:06,500 --> 01:44:06,700
Now.
2490
01:44:06,700 --> 01:44:09,200
We have understood all
this stability the how we
2491
01:44:09,200 --> 01:44:12,300
can create a data set now
first of all in data set
2492
01:44:12,300 --> 01:44:14,800
what we do so So
in data set we can create
2493
01:44:14,800 --> 01:44:17,900
the plus you can see we
are creating a case class employ
2494
01:44:17,900 --> 01:44:19,600
right now in case class
2495
01:44:19,600 --> 01:44:22,400
what we are doing we are done
just creating a sequence
2496
01:44:22,400 --> 01:44:25,600
in putting the value Andrew H
like name and age column.
2497
01:44:25,600 --> 01:44:28,076
Then we are displaying
our output all this data
2498
01:44:28,076 --> 01:44:28,803
set right now.
2499
01:44:28,803 --> 01:44:32,010
We are creating a primitive data
set also to demonstrate mapping
2500
01:44:32,010 --> 01:44:33,894
of this data frames
to your data sets.
2501
01:44:33,894 --> 01:44:34,200
Right?
2502
01:44:34,200 --> 01:44:36,200
So you can notice
that we are using
2503
01:44:36,200 --> 01:44:37,700
to D's instead of 2 DF.
2504
01:44:37,700 --> 01:44:39,500
We are using two DS
in this case.
2505
01:44:39,500 --> 01:44:42,293
Now, you may ask me what's
the difference with respect
2506
01:44:42,293 --> 01:44:43,400
to data frame, right?
2507
01:44:43,400 --> 01:44:45,100
With respect to data frame
2508
01:44:45,100 --> 01:44:46,700
in data frame
what we were doing.
2509
01:44:46,700 --> 01:44:48,682
We were create
again the data frame
2510
01:44:48,682 --> 01:44:50,800
and data set both
exactly looks safe.
2511
01:44:50,800 --> 01:44:53,228
It will also be having
the name column in rows
2512
01:44:53,228 --> 01:44:54,200
and everything up.
2513
01:44:54,200 --> 01:44:57,334
It is introduced lately
in 1.6 versions and later.
2514
01:44:57,334 --> 01:44:58,196
And what is it
2515
01:44:58,196 --> 01:45:01,100
provides it it provides
a encoder mechanism using
2516
01:45:01,100 --> 01:45:02,000
which you can get
2517
01:45:02,000 --> 01:45:04,208
when you are let's say
reading the weight data back.
2518
01:45:04,208 --> 01:45:06,200
Let's say you are DC
realizing you're not doing
2519
01:45:06,200 --> 01:45:06,968
that step, right?
2520
01:45:06,968 --> 01:45:08,300
It is going to be faster.
2521
01:45:08,300 --> 01:45:10,400
So the performance
wise data set is better.
2522
01:45:10,400 --> 01:45:13,000
That's the reason it
is introduced later nowadays.
2523
01:45:13,000 --> 01:45:15,794
People are moving from
data frame two data sets Okay.
2524
01:45:15,794 --> 01:45:17,500
So now we are just outputting
2525
01:45:17,500 --> 01:45:19,703
in the end see the same
thing in the output.
2526
01:45:19,703 --> 01:45:21,623
But so we are creating
employ a class.
2527
01:45:21,623 --> 01:45:24,684
Then we are putting the value
inside it creating a data set.
2528
01:45:24,684 --> 01:45:26,500
We are looking
at the values, right?
2529
01:45:26,500 --> 01:45:29,200
So these are the steps we
have just understood them now
2530
01:45:29,200 --> 01:45:32,000
how we can read of a Phi so
we want to read the file.
2531
01:45:32,000 --> 01:45:35,300
So we will use three dot Json
as employee employee was
2532
01:45:35,300 --> 01:45:38,026
what remember case class which
we have created last thing.
2533
01:45:38,026 --> 01:45:39,700
This was the classic
we have created
2534
01:45:39,700 --> 01:45:40,900
your case class employee.
2535
01:45:40,900 --> 01:45:43,300
So we are telling
that we are creating like this.
2536
01:45:43,500 --> 01:45:45,200
We are just out
putting this value.
2537
01:45:45,200 --> 01:45:47,612
We just within shop
you can see this way.
2538
01:45:47,612 --> 01:45:49,000
We can see this output.
2539
01:45:49,000 --> 01:45:50,700
Also now, let's see
2540
01:45:50,700 --> 01:45:53,900
how we can add the schema
to rdd now in order
2541
01:45:53,900 --> 01:45:57,300
to add the schema to rdd
what we are going to do.
2542
01:45:57,300 --> 01:45:59,100
So in this case also,
2543
01:45:59,200 --> 01:46:01,500
you can look at we
are importing all the values
2544
01:46:01,500 --> 01:46:03,700
that we are importing all
the libraries whatever
2545
01:46:03,700 --> 01:46:04,779
are required then
2546
01:46:04,779 --> 01:46:07,622
after that we are using
this spark context text
2547
01:46:07,622 --> 01:46:09,600
by reading the data splitting it
2548
01:46:09,600 --> 01:46:12,400
with respect to comma then
mapping the attributes.
2549
01:46:12,400 --> 01:46:14,750
We will employ The case
that's what we have done
2550
01:46:14,750 --> 01:46:17,041
and putting converting
this values to integer.
2551
01:46:17,041 --> 01:46:19,891
So in then we are converting
to to death right after that.
2552
01:46:19,891 --> 01:46:22,378
We are going to create
a temporary viewer table.
2553
01:46:22,378 --> 01:46:24,600
So let's create
this temporary view employ.
2554
01:46:24,600 --> 01:46:26,800
Then we are going
to use part dot Sequel
2555
01:46:26,800 --> 01:46:28,570
and passing up our SQL query.
2556
01:46:28,570 --> 01:46:31,500
Can you notice that we
have now passing the value
2557
01:46:31,500 --> 01:46:33,900
and we are assessing
this employ, right?
2558
01:46:33,900 --> 01:46:36,000
We are assessing
this employee here.
2559
01:46:36,000 --> 01:46:38,500
Now, what is this employ
this employee was
2560
01:46:38,500 --> 01:46:40,500
of a temporary view
which we have created
2561
01:46:40,500 --> 01:46:43,128
because the challenge
in Sparks equalist
2562
01:46:43,128 --> 01:46:46,329
when Whether you want
to execute any SQL query you
2563
01:46:46,329 --> 01:46:49,400
cannot say select aesthetic
from the data frame.
2564
01:46:49,400 --> 01:46:50,439
You cannot do that.
2565
01:46:50,439 --> 01:46:52,300
There's this is
not even supported.
2566
01:46:52,300 --> 01:46:55,547
So you cannot do select extract
from your data frame.
2567
01:46:55,547 --> 01:46:56,508
So instead of that
2568
01:46:56,508 --> 01:46:59,500
what we need to do is we need
to create a temporary table
2569
01:46:59,500 --> 01:47:01,732
or a temporary view
so you can notice here.
2570
01:47:01,732 --> 01:47:04,456
We are using this create
or replace temp You by replace
2571
01:47:04,456 --> 01:47:07,349
because if it is already
existing override on top of it.
2572
01:47:07,349 --> 01:47:09,400
So now we are creating
a temporary table
2573
01:47:09,400 --> 01:47:12,900
which will be exactly similar
to mine this data frame now
2574
01:47:12,900 --> 01:47:15,605
you You can just directly
execute all the query
2575
01:47:15,605 --> 01:47:18,100
on your return preview
Autumn Prairie table.
2576
01:47:18,100 --> 01:47:21,258
So you can notice here
instead of using employ DF
2577
01:47:21,258 --> 01:47:22,800
which was our data frame.
2578
01:47:22,800 --> 01:47:24,730
I am using here temporary view.
2579
01:47:24,730 --> 01:47:26,100
Okay, then in the end,
2580
01:47:26,100 --> 01:47:28,000
we just mapping
the names and a right
2581
01:47:28,000 --> 01:47:29,669
and we are outputting the bells.
2582
01:47:29,669 --> 01:47:30,200
That's it.
2583
01:47:30,200 --> 01:47:31,000
Same thing.
2584
01:47:31,000 --> 01:47:33,300
This is just
an execution part of it.
2585
01:47:33,300 --> 01:47:35,350
So we are just showing
all the steps here.
2586
01:47:35,350 --> 01:47:36,500
You can see in the end.
2587
01:47:36,500 --> 01:47:38,500
We are outputting
all this value now
2588
01:47:38,600 --> 01:47:40,800
how we can add
the schema to rdd.
2589
01:47:40,800 --> 01:47:43,850
Let's see this transformation
step now in this case you Notice
2590
01:47:43,850 --> 01:47:45,404
that we can map
this youngster fact
2591
01:47:45,404 --> 01:47:46,900
the we're converting
this map name
2592
01:47:46,900 --> 01:47:49,211
into the string for
the transformation part, right?
2593
01:47:49,211 --> 01:47:51,200
So we are checking all
this value that okay.
2594
01:47:51,200 --> 01:47:53,500
This is the string type name.
2595
01:47:53,500 --> 01:47:55,900
We are just showing up
this value right now.
2596
01:47:55,900 --> 01:47:56,900
What were you doing?
2597
01:47:56,900 --> 01:48:00,400
We are using this map encoder
from the implicit class,
2598
01:48:00,400 --> 01:48:03,717
which is available to us
to map the name and Each pie.
2599
01:48:03,717 --> 01:48:04,000
Okay.
2600
01:48:04,000 --> 01:48:05,529
So this is
what we're going to do
2601
01:48:05,529 --> 01:48:07,579
because remember in
the employee is class.
2602
01:48:07,579 --> 01:48:10,400
We have the name and age column
that we want to map now.
2603
01:48:10,400 --> 01:48:11,272
Now in this case,
2604
01:48:11,272 --> 01:48:13,164
we are mapping
the names to the ages.
2605
01:48:13,164 --> 01:48:14,400
Has so you can notice
2606
01:48:14,400 --> 01:48:17,600
that we are doing for ages
of our younger CF data frame
2607
01:48:17,600 --> 01:48:19,335
that what we
have created earlier
2608
01:48:19,335 --> 01:48:20,800
and the result is an array.
2609
01:48:20,800 --> 01:48:23,400
So the result but you're going
to get will be an array
2610
01:48:23,400 --> 01:48:25,700
with the name map
to your respective ages.
2611
01:48:25,700 --> 01:48:27,800
You can see this output
here so you can see
2612
01:48:27,800 --> 01:48:29,100
that this is getting map.
2613
01:48:29,100 --> 01:48:29,426
Right.
2614
01:48:29,426 --> 01:48:32,201
So we are getting seeing
this output like name is John
2615
01:48:32,201 --> 01:48:34,402
it is 28 that is what
we are talking about.
2616
01:48:34,402 --> 01:48:36,300
So here in this case,
you can notice
2617
01:48:36,300 --> 01:48:38,900
that it was representing
like this in this case.
2618
01:48:38,900 --> 01:48:42,200
The output is coming out
in this particular format now,
2619
01:48:42,200 --> 01:48:44,568
let's talk about
how Can add the schema
2620
01:48:44,568 --> 01:48:47,674
how we can read the file
we can add a whiskey minor
2621
01:48:47,674 --> 01:48:50,702
so we will be first
of all importing the type class
2622
01:48:50,702 --> 01:48:51,706
into your passion.
2623
01:48:51,706 --> 01:48:52,588
So with this is
2624
01:48:52,588 --> 01:48:54,815
what we have done
by using import statement.
2625
01:48:54,815 --> 01:48:58,286
Then we are going to import
the row class into this partial.
2626
01:48:58,286 --> 01:49:00,500
So rho will be used
in mapping our DB schema.
2627
01:49:00,500 --> 01:49:00,813
Right?
2628
01:49:00,813 --> 01:49:01,700
So you can notice
2629
01:49:01,700 --> 01:49:05,100
we're importing this also then
we are creating an rdd called
2630
01:49:05,000 --> 01:49:06,200
as employ a DD.
2631
01:49:06,200 --> 01:49:07,900
So in case this case
you can notice
2632
01:49:07,900 --> 01:49:09,809
that the same priority
we are creating
2633
01:49:09,809 --> 01:49:12,700
and we are creating this
with the help of this text file.
2634
01:49:12,700 --> 01:49:15,700
So once we have create this we
are going to Define our schema.
2635
01:49:15,700 --> 01:49:17,300
So this is the scheme approach.
2636
01:49:17,300 --> 01:49:17,572
Okay.
2637
01:49:17,572 --> 01:49:18,452
So in this case,
2638
01:49:18,452 --> 01:49:21,050
we are going to Define it
like named and space
2639
01:49:21,050 --> 01:49:21,800
than H. Okay,
2640
01:49:21,800 --> 01:49:24,700
because they these were
the two I have in my data also
2641
01:49:24,700 --> 01:49:26,129
in this employed or tht
2642
01:49:26,129 --> 01:49:27,305
if you look at these
2643
01:49:27,305 --> 01:49:29,600
are the two data which
we have named NH.
2644
01:49:29,600 --> 01:49:31,635
Now what we can do
once we have done
2645
01:49:31,635 --> 01:49:34,100
that then we can split it
with respect to space.
2646
01:49:34,100 --> 01:49:34,600
We can say
2647
01:49:34,600 --> 01:49:37,082
that our mapping value
and we are passing it
2648
01:49:37,082 --> 01:49:39,200
all this value inside
of a structure.
2649
01:49:39,200 --> 01:49:42,200
Okay, so we are defining a burn
or fields are ready.
2650
01:49:42,200 --> 01:49:43,500
That is what we are doing.
2651
01:49:43,500 --> 01:49:45,200
See this the fields are ready,
2652
01:49:45,200 --> 01:49:49,500
which is going to now output
after mapping the employee ID.
2653
01:49:49,500 --> 01:49:51,200
Okay, so that is
what we are doing.
2654
01:49:51,200 --> 01:49:54,413
So we want to just do this
into my schema strength,
2655
01:49:54,413 --> 01:49:55,375
then in the end.
2656
01:49:55,375 --> 01:49:57,300
We will be obtaining this field.
2657
01:49:57,300 --> 01:49:59,940
If you notice this field
what we have created here.
2658
01:49:59,940 --> 01:50:01,788
We are obtaining
this into a schema.
2659
01:50:01,788 --> 01:50:03,900
So we are passing this
into a struct type
2660
01:50:03,900 --> 01:50:06,400
and it is getting converted
to be our scheme of it.
2661
01:50:06,500 --> 01:50:08,200
So that is what we will do.
2662
01:50:08,200 --> 01:50:10,768
You can see all
this execution same steps.
2663
01:50:10,768 --> 01:50:13,357
We are just executing
in this terminal now,
2664
01:50:13,357 --> 01:50:16,500
Let's see how we are going
to transform the results.
2665
01:50:16,500 --> 01:50:18,300
Now, whatever we
have done, right?
2666
01:50:18,300 --> 01:50:21,229
So now we have already created
already called row editing.
2667
01:50:21,229 --> 01:50:22,000
So let's create
2668
01:50:22,000 --> 01:50:25,088
that Rogue additive are going
to Gray and we want
2669
01:50:25,088 --> 01:50:28,500
to transform the employee ID
using the map function
2670
01:50:28,500 --> 01:50:29,513
into row already.
2671
01:50:29,513 --> 01:50:30,564
So let's do that.
2672
01:50:30,564 --> 01:50:30,837
Okay.
2673
01:50:30,837 --> 01:50:31,717
So in this case
2674
01:50:31,717 --> 01:50:34,483
what we are doing so look
at this employed reading
2675
01:50:34,483 --> 01:50:36,797
we are splitting it
with respect to coma
2676
01:50:36,797 --> 01:50:40,000
and after that we are telling
see remember we have name
2677
01:50:40,000 --> 01:50:41,400
and then H like this so
2678
01:50:41,400 --> 01:50:43,500
that's what you're telling
me telling that act.
2679
01:50:43,500 --> 01:50:44,737
Zero or my attributes
2680
01:50:44,737 --> 01:50:47,796
one and why we're trimming
it just inverted to ensure
2681
01:50:47,796 --> 01:50:49,900
if there is no spaces
and on which other
2682
01:50:49,900 --> 01:50:52,600
so those things we don't want
to unnecessarily keep up.
2683
01:50:52,600 --> 01:50:55,400
So that's the reason we are
defining this term statement.
2684
01:50:55,400 --> 01:50:58,300
Now after that after we
once we are done with this,
2685
01:50:58,300 --> 01:51:01,100
we are going to define
a data frame employed EF
2686
01:51:01,100 --> 01:51:03,874
and we are going to store
that rdd schema into it.
2687
01:51:03,874 --> 01:51:05,764
So now if you notice
this row ID,
2688
01:51:05,764 --> 01:51:07,300
which we have defined here
2689
01:51:07,300 --> 01:51:11,124
and schema which we have defined
in the last case right now
2690
01:51:11,124 --> 01:51:13,300
if you'll go back
and notice here.
2691
01:51:13,300 --> 01:51:16,300
Schema, we have created here
right with respect to my Fields.
2692
01:51:16,600 --> 01:51:19,100
So that schema and this value
2693
01:51:19,100 --> 01:51:21,900
what we have just
created here rowady.
2694
01:51:21,900 --> 01:51:23,450
We are going to pass it and say
2695
01:51:23,450 --> 01:51:25,200
that we are going
to create a data frame.
2696
01:51:25,200 --> 01:51:27,900
So this will help us
in creating a data frame now,
2697
01:51:27,900 --> 01:51:31,135
we can create our temporary view
on the base of employee
2698
01:51:31,135 --> 01:51:33,900
of let's create an employee
or temporary View and then
2699
01:51:33,900 --> 01:51:36,900
what we can do we can execute
any SQL queries on top of it.
2700
01:51:36,900 --> 01:51:38,700
So as you can see
SparkNotes equal we
2701
01:51:38,700 --> 01:51:42,000
can create all the SQL queries
and can directly execute
2702
01:51:42,000 --> 01:51:43,200
that now what we can do.
2703
01:51:43,300 --> 01:51:45,700
We want to Output the values
we can quickly do that.
2704
01:51:45,800 --> 01:51:46,000
Now.
2705
01:51:46,000 --> 01:51:48,500
We want to let's say display
the names of we can say Okay,
2706
01:51:48,500 --> 01:51:51,600
attribute 0 contains the name
we can use the show command.
2707
01:51:51,600 --> 01:51:54,662
So this is how we
will be performing the operation
2708
01:51:54,662 --> 01:51:56,100
in the scheme away now,
2709
01:51:56,100 --> 01:51:58,900
so this is the same output way
means we're just executing
2710
01:51:58,900 --> 01:51:59,914
this whole thing up.
2711
01:51:59,914 --> 01:52:01,100
You can notice here.
2712
01:52:01,100 --> 01:52:03,400
Also, we are just
saying attribute 0.0.
2713
01:52:03,400 --> 01:52:06,205
It is representing
or me my output now,
2714
01:52:06,205 --> 01:52:08,200
let's talk about Json data.
2715
01:52:08,200 --> 01:52:10,085
Now when we talk
about Json data,
2716
01:52:10,085 --> 01:52:13,261
let's talk about how we
can load our files and work on.
2717
01:52:13,261 --> 01:52:15,496
This so in this case,
we will be first.
2718
01:52:15,496 --> 01:52:17,338
Let's say importing
our libraries.
2719
01:52:17,338 --> 01:52:18,800
Once we are done with that.
2720
01:52:18,800 --> 01:52:20,300
Now after that we can just say
2721
01:52:20,300 --> 01:52:23,587
that retort Jason we are
just bringing up our employed
2722
01:52:23,587 --> 01:52:25,611
or Jason you see
this is the execution
2723
01:52:25,611 --> 01:52:27,200
of this part now similarly,
2724
01:52:27,200 --> 01:52:29,042
we can also write
back in the pocket
2725
01:52:29,042 --> 01:52:31,282
or we can also read
the value from parque.
2726
01:52:31,282 --> 01:52:32,400
You can notice this
2727
01:52:32,400 --> 01:52:35,600
if you want to write
let's say this value employee
2728
01:52:35,600 --> 01:52:37,730
of data frame to my market way
2729
01:52:37,730 --> 01:52:40,500
so I can sit right dot
right dot market.
2730
01:52:40,500 --> 01:52:43,143
So this will be created
employed or Park.
2731
01:52:43,143 --> 01:52:46,504
Be created and hear all
the values should be converted
2732
01:52:46,504 --> 01:52:47,900
to employed or packet.
2733
01:52:47,900 --> 01:52:49,133
Only thing is the data.
2734
01:52:49,133 --> 01:52:51,600
If you go and see
in this particular directory,
2735
01:52:51,600 --> 01:52:52,717
this will be a directory.
2736
01:52:52,717 --> 01:52:53,954
We should be getting created.
2737
01:52:53,954 --> 01:52:55,400
So in this data,
you will notice
2738
01:52:55,400 --> 01:52:57,500
that you will not be able
to read the data.
2739
01:52:57,500 --> 01:53:00,100
So in that case
because it's not human readable.
2740
01:53:00,100 --> 01:53:02,200
So that's the reason you
will not be able to do that.
2741
01:53:02,200 --> 01:53:04,299
So, let's say you want
to read it now so you
2742
01:53:04,299 --> 01:53:05,449
can again bring it back
2743
01:53:05,449 --> 01:53:08,600
by using Red Dot Market you are
reading this employed at pocket,
2744
01:53:08,600 --> 01:53:09,600
which I just created
2745
01:53:09,600 --> 01:53:11,700
then you are creating
a temporary view
2746
01:53:11,700 --> 01:53:12,775
or temporary table
2747
01:53:12,775 --> 01:53:15,488
and then By using
standard SQL you can execute
2748
01:53:15,488 --> 01:53:16,903
on your temporary table.
2749
01:53:16,903 --> 01:53:17,844
Now in this way.
2750
01:53:17,844 --> 01:53:21,000
You can read your pocket file
data and in then we are just
2751
01:53:21,000 --> 01:53:24,284
displaying the result see
the similar output of this.
2752
01:53:24,284 --> 01:53:24,600
Okay.
2753
01:53:24,600 --> 01:53:27,100
This is how we can execute
all these things up now.
2754
01:53:27,100 --> 01:53:28,670
Once we have done all this,
2755
01:53:28,670 --> 01:53:31,200
let's see how we
can create our data frames.
2756
01:53:31,200 --> 01:53:33,100
So let's create this file path.
2757
01:53:33,100 --> 01:53:36,390
So let's say we have created
this file employed or Jason
2758
01:53:36,390 --> 01:53:38,508
after that we can
create a data frame
2759
01:53:38,508 --> 01:53:39,943
from our Json path, right?
2760
01:53:39,943 --> 01:53:42,884
So we are creating this
by using retouch Jason then
2761
01:53:42,884 --> 01:53:44,420
we can Print the schema.
2762
01:53:44,420 --> 01:53:47,300
What does to this is going
to print the schema
2763
01:53:47,300 --> 01:53:49,300
of my employee data frame?
2764
01:53:49,300 --> 01:53:52,500
Okay, so we are going to use
this print schemer to print
2765
01:53:52,500 --> 01:53:55,795
up all the values then we
can create a temporary view
2766
01:53:55,795 --> 01:53:57,000
of this data frame.
2767
01:53:57,000 --> 01:53:58,100
So we are create doing
2768
01:53:58,100 --> 01:54:00,618
that see create or replace
temp you we are creating
2769
01:54:00,618 --> 01:54:02,860
that which we have seen
it last time also now
2770
01:54:02,860 --> 01:54:04,888
after that we can
execute our SQL query.
2771
01:54:04,888 --> 01:54:07,800
So let's say we are executing
our SQL query from employee
2772
01:54:07,800 --> 01:54:10,000
where age is between 18
and 30, right?
2773
01:54:10,000 --> 01:54:11,300
So this kind of SQL query.
2774
01:54:11,300 --> 01:54:12,854
Let's say we want
to do we can get
2775
01:54:12,854 --> 01:54:14,989
that And in the end we
can see the output Also.
2776
01:54:14,989 --> 01:54:16,278
Let's see this execution.
2777
01:54:16,278 --> 01:54:17,000
So you can see
2778
01:54:17,000 --> 01:54:20,891
that all the vampires who these
are let's say between 18 and 30
2779
01:54:20,891 --> 01:54:22,900
that is showing up
in the output.
2780
01:54:22,900 --> 01:54:23,147
Now.
2781
01:54:23,147 --> 01:54:25,176
Let's see this
rdd operation way.
2782
01:54:25,176 --> 01:54:26,369
Now what you can do
2783
01:54:26,369 --> 01:54:30,200
so we are going to create this
add any other employer Nene now
2784
01:54:30,200 --> 01:54:33,900
which is going to store
the content of employed George
2785
01:54:33,900 --> 01:54:35,300
and New Delhi Delhi.
2786
01:54:35,300 --> 01:54:36,433
So see this part,
2787
01:54:36,433 --> 01:54:39,500
so here we are creating this
by using make a DD
2788
01:54:39,500 --> 01:54:43,400
and we have just this is going
to store the content containing
2789
01:54:43,400 --> 01:54:45,000
Such from noodle, right?
2790
01:54:45,000 --> 01:54:45,900
You can see this
2791
01:54:45,900 --> 01:54:48,300
so New Delhi is my city
named state is the ring.
2792
01:54:48,300 --> 01:54:50,250
So that is what we
are passing inside it.
2793
01:54:50,250 --> 01:54:52,900
Now what we are doing we
are assigning the content
2794
01:54:52,900 --> 01:54:56,700
of this other employee ID
into my other employees.
2795
01:54:56,700 --> 01:54:59,200
So we are using
this dark dot RI dot Json
2796
01:54:59,200 --> 01:55:00,600
and we are reading at the value
2797
01:55:00,600 --> 01:55:02,800
and in the end we
are using this show appear.
2798
01:55:02,800 --> 01:55:04,857
You can notice
this output coming up now.
2799
01:55:04,857 --> 01:55:06,400
Let's see with the hive table.
2800
01:55:06,400 --> 01:55:08,536
So with the hive table
if you want to read that,
2801
01:55:08,536 --> 01:55:10,186
so let's do it
with the case class
2802
01:55:10,186 --> 01:55:11,136
and Spark sessions.
2803
01:55:11,136 --> 01:55:11,900
So first of all,
2804
01:55:11,900 --> 01:55:14,713
we are going to import
a guru class and we are going
2805
01:55:14,713 --> 01:55:16,700
to use path session
into the Spartan.
2806
01:55:16,700 --> 01:55:18,000
So let's do that for a way.
2807
01:55:18,000 --> 01:55:20,082
I'm putting this row
this past session
2808
01:55:20,082 --> 01:55:21,200
and not after that.
2809
01:55:21,200 --> 01:55:24,186
We are going to create a class
record containing this key
2810
01:55:24,186 --> 01:55:25,756
which is of integer data type
2811
01:55:25,756 --> 01:55:27,576
and a value which is
of string type.
2812
01:55:27,576 --> 01:55:29,426
Then we are going
to set our location
2813
01:55:29,426 --> 01:55:30,726
of the warehouse location.
2814
01:55:30,726 --> 01:55:31,948
Okay to this pathway rows.
2815
01:55:31,948 --> 01:55:33,400
So that is what we are doing.
2816
01:55:33,400 --> 01:55:33,629
Now.
2817
01:55:33,629 --> 01:55:36,100
We are going to build
a spark sessions back
2818
01:55:36,100 --> 01:55:39,200
to demonstrate the hive
example in spots equal.
2819
01:55:39,200 --> 01:55:40,100
Look at this now,
2820
01:55:40,100 --> 01:55:42,700
so we are creating Sparks
session dot Builder again.
2821
01:55:42,700 --> 01:55:44,331
We are passing the Any app name
2822
01:55:44,331 --> 01:55:46,700
to it we have passing
the configuration to it.
2823
01:55:46,700 --> 01:55:48,968
And then we are saying
that we want to enable
2824
01:55:48,968 --> 01:55:50,000
The Hive support now
2825
01:55:50,000 --> 01:55:50,800
once we have done
2826
01:55:50,800 --> 01:55:53,800
that we are importing
this spark SQL library center.
2827
01:55:54,000 --> 01:55:56,612
And then you can notice
that we can use SQL
2828
01:55:56,612 --> 01:55:58,601
so we can create now a table SRC
2829
01:55:58,601 --> 01:56:01,336
so you can see create table
if not exist as RC
2830
01:56:01,336 --> 01:56:04,800
with column to stores the data
as a key common value pair.
2831
01:56:04,800 --> 01:56:06,399
So that is what we
are doing here.
2832
01:56:06,400 --> 01:56:09,000
Now, you can see all
this execution of the same step.
2833
01:56:09,000 --> 01:56:09,209
Now.
2834
01:56:09,209 --> 01:56:12,430
Let's see the sequel operation
happening here now in this case
2835
01:56:12,430 --> 01:56:13,229
what we can do.
2836
01:56:13,229 --> 01:56:15,700
We can now load the data
from this example,
2837
01:56:15,700 --> 01:56:17,500
which is present to succeed.
2838
01:56:17,500 --> 01:56:19,400
Is this KV m dot txt file,
2839
01:56:19,400 --> 01:56:20,869
which is available to us
2840
01:56:20,869 --> 01:56:23,281
and we want to store it
into the table SRC
2841
01:56:23,281 --> 01:56:25,225
which we have just
created and now
2842
01:56:25,225 --> 01:56:28,872
if you want to just view the all
this output becomes a sequence
2843
01:56:28,872 --> 01:56:30,305
select aesthetic form SRC
2844
01:56:30,305 --> 01:56:31,764
and it is going to show up
2845
01:56:31,764 --> 01:56:34,005
all the values you
can see this output.
2846
01:56:34,005 --> 01:56:34,300
Okay.
2847
01:56:34,300 --> 01:56:37,341
So this is the way you can show
up the virus now similarly we
2848
01:56:37,341 --> 01:56:38,899
can perform the count operation.
2849
01:56:38,899 --> 01:56:40,993
Okay, so we can say
select Counter-Strike
2850
01:56:40,993 --> 01:56:43,400
from SRC to select the number
of keys in there.
2851
01:56:43,400 --> 01:56:45,858
See tables, and now
select all the records,
2852
01:56:45,858 --> 01:56:48,800
right so we can say
that key select key gamma value
2853
01:56:48,800 --> 01:56:49,500
so you can see
2854
01:56:49,500 --> 01:56:52,150
that we can perform all
over Hive operations here
2855
01:56:52,150 --> 01:56:53,562
on this right similarly.
2856
01:56:53,562 --> 01:56:56,300
We can create a data set
string DS from spark DF
2857
01:56:56,300 --> 01:56:58,623
so you can see this
also by using SQL DF
2858
01:56:58,623 --> 01:57:00,835
what we already have
we can just say map
2859
01:57:00,835 --> 01:57:01,730
and then provide
2860
01:57:01,730 --> 01:57:04,541
the case class in can map
the ski common value pair
2861
01:57:04,541 --> 01:57:07,600
and then in the end we
can show up all this value see
2862
01:57:07,600 --> 01:57:10,644
this execution of this in then
you can notice this output
2863
01:57:10,644 --> 01:57:11,828
which we want it now.
2864
01:57:11,828 --> 01:57:13,288
Let's see the result back.
2865
01:57:13,288 --> 01:57:15,700
But now we can create
our data frame here.
2866
01:57:15,700 --> 01:57:18,384
Right so we can create
our data frame records deaf
2867
01:57:18,384 --> 01:57:19,848
and store all the results
2868
01:57:19,848 --> 01:57:21,900
which contains the value
between 1 200.
2869
01:57:21,900 --> 01:57:24,600
So we are storing all the values
between 1/2 and video.
2870
01:57:24,600 --> 01:57:26,700
Then we are creating
a victim Prairie View.
2871
01:57:26,700 --> 01:57:28,900
Okay for the records,
that's what we are doing.
2872
01:57:28,900 --> 01:57:31,200
So for requires the FAA
creating a temporary view
2873
01:57:31,200 --> 01:57:33,800
so that we can have
over Oliver SQL queries now,
2874
01:57:33,800 --> 01:57:35,336
we can execute all the values
2875
01:57:35,336 --> 01:57:38,400
so you can also notice we
are doing join operation here.
2876
01:57:38,400 --> 01:57:40,900
Okay, so we can display
the content of join
2877
01:57:40,900 --> 01:57:43,300
between the records
and this is our city.
2878
01:57:43,600 --> 01:57:46,400
We can do a joint on this part
so we can also perform all
2879
01:57:46,400 --> 01:57:48,300
the joint operations
and get the output.
2880
01:57:48,300 --> 01:57:48,500
Now.
2881
01:57:48,500 --> 01:57:50,356
Let's see our use case for it.
2882
01:57:50,356 --> 01:57:51,908
If we talk about use case.
2883
01:57:51,908 --> 01:57:55,071
We are going to analyze
our stock market with the help
2884
01:57:55,071 --> 01:57:57,100
of spark sequence
select understand
2885
01:57:57,100 --> 01:57:58,500
the problem statement first.
2886
01:57:58,500 --> 01:58:00,382
So now in our problem statement,
2887
01:58:00,382 --> 01:58:04,029
so what we want to do so we want
to accept definitely everybody
2888
01:58:04,029 --> 01:58:07,156
must be aware of this top market
like in stock market.
2889
01:58:07,156 --> 01:58:08,811
You can lot
of activities happen.
2890
01:58:08,811 --> 01:58:10,400
You want to know analyze it
2891
01:58:10,400 --> 01:58:13,300
in order to make some profit
out of it and all those stuff.
2892
01:58:13,300 --> 01:58:15,200
Alright, so now
let's say our company
2893
01:58:15,200 --> 01:58:18,200
have collected a lot of data
for different 10 companies
2894
01:58:18,200 --> 01:58:20,000
and they want to do
some computation.
2895
01:58:20,000 --> 01:58:22,964
Let's say they want to compute
the average closing price.
2896
01:58:22,964 --> 01:58:26,300
They want to list the companies
with the highest closing prices.
2897
01:58:26,300 --> 01:58:29,749
They want to compute the average
closing price per month.
2898
01:58:29,749 --> 01:58:32,485
They want to list the number
of big price Rises
2899
01:58:32,485 --> 01:58:35,400
and fall and compute
some statistical correlation.
2900
01:58:35,400 --> 01:58:37,700
So these things we are going
to do with the help
2901
01:58:37,700 --> 01:58:39,158
of our spark SQL statement.
2902
01:58:39,158 --> 01:58:42,255
So this is a very common we want
to process the huge data.
2903
01:58:42,255 --> 01:58:45,103
We want to handle The input
from the multiple sources,
2904
01:58:45,103 --> 01:58:47,200
we want to process
the data in real time
2905
01:58:47,200 --> 01:58:48,754
and it should be easy to use.
2906
01:58:48,754 --> 01:58:50,488
It should not be
very complicated.
2907
01:58:50,488 --> 01:58:53,800
So all this requirement will be
handled by my spots equal right?
2908
01:58:53,800 --> 01:58:55,700
So that's the reason
we are going to use
2909
01:58:55,700 --> 01:58:56,950
the spacer sequence.
2910
01:58:56,950 --> 01:58:57,700
So as I said
2911
01:58:57,700 --> 01:58:59,600
that we are going
to use 10 companies.
2912
01:58:59,600 --> 01:59:02,076
So we are going to kind
of use this 10 companies
2913
01:59:02,076 --> 01:59:03,498
and on those ten companies.
2914
01:59:03,498 --> 01:59:04,500
We are going to see
2915
01:59:04,500 --> 01:59:07,200
that we are going to perform
our analysis on top of it.
2916
01:59:07,200 --> 01:59:09,100
So we will be using
this table data
2917
01:59:09,100 --> 01:59:11,800
from Yahoo finance
for all this following stocks.
2918
01:59:11,800 --> 01:59:14,300
So for n and a A bit sexist.
2919
01:59:14,300 --> 01:59:15,400
So all these companies
2920
01:59:15,400 --> 01:59:17,600
we have on on which we
are going to perform.
2921
01:59:17,600 --> 01:59:20,800
So this is how my data will look
like which will be having date
2922
01:59:20,800 --> 01:59:25,046
opening High rate low rate
closing volume adjusted close.
2923
01:59:25,046 --> 01:59:27,700
All this data will
be presented now.
2924
01:59:27,700 --> 01:59:28,917
So, let's see how we
2925
01:59:28,917 --> 01:59:31,900
can Implement a stock analysis
using spark sequel.
2926
01:59:31,900 --> 01:59:33,497
So what we have to do for that,
2927
01:59:33,497 --> 01:59:36,278
so this is how many data
flow diagram will sound like
2928
01:59:36,278 --> 01:59:38,811
so we have going to initially
have the huge amount
2929
01:59:38,811 --> 01:59:40,000
of real-time stock data
2930
01:59:40,000 --> 01:59:42,400
that we are going to process it
through this path SQL.
2931
01:59:42,400 --> 01:59:44,600
So going to It into
a named column base.
2932
01:59:44,600 --> 01:59:46,308
Then we are going
to create an rdd
2933
01:59:46,308 --> 01:59:47,658
for functional programming.
2934
01:59:47,658 --> 01:59:48,395
So let's do that.
2935
01:59:48,395 --> 01:59:50,354
Then we are going to use
a reverse Park sequel
2936
01:59:50,354 --> 01:59:52,500
which will calculate
the average closing price
2937
01:59:52,500 --> 01:59:53,600
for your calculating.
2938
01:59:53,600 --> 01:59:56,188
The company with is closing
per year then buy
2939
01:59:56,188 --> 01:59:59,000
some stock SQL queries
will be getting our outputs.
2940
01:59:59,000 --> 02:00:01,000
Okay, so that is
what we're going to do.
2941
02:00:01,000 --> 02:00:03,400
So all the queries
what we are getting generated,
2942
02:00:03,400 --> 02:00:05,500
so it's not only this we
are also going to compute
2943
02:00:05,500 --> 02:00:08,000
few other queries what we
have solve those queries.
2944
02:00:08,000 --> 02:00:09,200
We're going to execute him.
2945
02:00:09,200 --> 02:00:09,500
Now.
2946
02:00:09,500 --> 02:00:11,273
This is how the flow
will look like.
2947
02:00:11,273 --> 02:00:13,200
So we are going
to initially have this Data
2948
02:00:13,200 --> 02:00:16,000
what I have just shown you a now
what you're going to do.
2949
02:00:16,000 --> 02:00:17,700
You're going to create
a data frame you
2950
02:00:17,700 --> 02:00:19,990
are going to then create
a joint clothes are ready.
2951
02:00:19,990 --> 02:00:21,850
We will see what we
are going to do here.
2952
02:00:21,850 --> 02:00:23,900
Then we are going
to calculate the average
2953
02:00:23,900 --> 02:00:25,160
closing price per year.
2954
02:00:25,160 --> 02:00:27,900
We are going to hit
a rough patch SQL query and get
2955
02:00:27,900 --> 02:00:29,314
the result in the table.
2956
02:00:29,314 --> 02:00:31,800
So this is how my execution
will look like.
2957
02:00:31,800 --> 02:00:33,445
So what we are going
to do in this case,
2958
02:00:33,445 --> 02:00:34,095
first of all,
2959
02:00:34,095 --> 02:00:36,839
we are going to initialize the
Sparks equal in this function.
2960
02:00:36,839 --> 02:00:39,600
We are going to import all
the required libraries then we
2961
02:00:39,600 --> 02:00:40,500
are going to start
2962
02:00:40,500 --> 02:00:43,216
our spark session after
importing all the required.
2963
02:00:43,216 --> 02:00:44,473
B we are going to create
2964
02:00:44,473 --> 02:00:47,251
our case class whatever
is required in the case class,
2965
02:00:47,251 --> 02:00:49,466
you can notice a then
we are going to Define
2966
02:00:49,466 --> 02:00:50,600
our past stock scheme.
2967
02:00:50,600 --> 02:00:53,350
So because we have already
learnt how to create a schema
2968
02:00:53,350 --> 02:00:55,500
as we're going to create
this page table schema
2969
02:00:55,500 --> 02:00:56,800
by creating this way.
2970
02:00:56,800 --> 02:00:59,200
Well, then we are going
to Define our parts.
2971
02:00:59,200 --> 02:01:00,900
I DD so in parts are did
2972
02:01:00,900 --> 02:01:02,895
if you notice so
here we are creating.
2973
02:01:02,895 --> 02:01:04,289
This parts are ready mix.
2974
02:01:04,289 --> 02:01:05,708
We have going to create all
2975
02:01:05,708 --> 02:01:07,600
of that by using
this additive first.
2976
02:01:07,600 --> 02:01:10,300
We are going to remove
the header files also from it.
2977
02:01:10,300 --> 02:01:12,749
Then we are going
to read our CSV file
2978
02:01:12,749 --> 02:01:15,200
into Into stocks a a
on DF data frame.
2979
02:01:15,200 --> 02:01:17,500
So we are going to read
this as C dot txt file.
2980
02:01:17,500 --> 02:01:20,161
You can see we are reading
this file and we are going
2981
02:01:20,161 --> 02:01:21,800
to convert it into a data frame.
2982
02:01:21,800 --> 02:01:23,450
So we are passing
it as an oddity.
2983
02:01:23,450 --> 02:01:24,511
Once we are done then
2984
02:01:24,511 --> 02:01:26,697
if you want to print
the output we can do it
2985
02:01:26,697 --> 02:01:27,997
with the help of show API.
2986
02:01:27,997 --> 02:01:29,852
Once we are done
with this now we want
2987
02:01:29,852 --> 02:01:31,450
to let's say display the average
2988
02:01:31,450 --> 02:01:34,100
of addressing closing price
for n and for every month,
2989
02:01:34,100 --> 02:01:37,629
so if we can do all of that also
by using select query, right
2990
02:01:37,629 --> 02:01:40,300
so we can say this data frame
dot select and pass
2991
02:01:40,300 --> 02:01:43,100
whatever parameters are required
to get the average know,
2992
02:01:43,100 --> 02:01:44,000
You can notice are
2993
02:01:44,000 --> 02:01:47,200
inside this we are creating
the Elias of the things as well.
2994
02:01:47,200 --> 02:01:48,300
So for this DT,
2995
02:01:48,300 --> 02:01:50,059
we are creating
areas here, right?
2996
02:01:50,059 --> 02:01:52,538
So we are creating the Elias
for it in a binder
2997
02:01:52,538 --> 02:01:54,714
and we are showing
the output also so here
2998
02:01:54,714 --> 02:01:56,307
what we are going to do now,
2999
02:01:56,307 --> 02:01:57,400
we will be checking
3000
02:01:57,400 --> 02:01:59,669
that the closing
price for Microsoft.
3001
02:01:59,669 --> 02:02:03,300
So let's say they're going up
by 2 or with greater than 2
3002
02:02:03,300 --> 02:02:05,900
or wherever it is going
by greater than 2 and now we
3003
02:02:05,900 --> 02:02:08,039
want to get the output
and display the result
3004
02:02:08,039 --> 02:02:10,023
so you can notice
that wherever it is going
3005
02:02:10,023 --> 02:02:12,282
to be greater than 2 we
are getting the value.
3006
02:02:12,282 --> 02:02:14,383
So we are hitting
the SQL query to do that.
3007
02:02:14,383 --> 02:02:16,483
So we are hitting
the SQL query now on this
3008
02:02:16,483 --> 02:02:17,935
you can notice the SQL query
3009
02:02:17,935 --> 02:02:19,975
which we are hitting
on the stocks.
3010
02:02:19,975 --> 02:02:20,775
Msft.
3011
02:02:20,775 --> 02:02:21,128
Right?
3012
02:02:21,128 --> 02:02:22,768
This is the we have data frame
3013
02:02:22,768 --> 02:02:24,900
we have created now
on this we are doing
3014
02:02:24,900 --> 02:02:27,076
that and we are putting
our query that
3015
02:02:27,076 --> 02:02:29,395
where my condition
this to be true means
3016
02:02:29,395 --> 02:02:32,066
where my closing price
and my opening price
3017
02:02:32,066 --> 02:02:34,300
because let's say
at the closing price
3018
02:02:34,300 --> 02:02:36,852
the stock price by let's say
a hundred US Dollars
3019
02:02:36,852 --> 02:02:38,500
and at that time in the morning
3020
02:02:38,500 --> 02:02:40,800
when it open with
the Lexi 98 used or so,
3021
02:02:40,800 --> 02:02:43,131
wherever it is going
to be having a different.
3022
02:02:43,131 --> 02:02:43,961
Of to or greater
3023
02:02:43,961 --> 02:02:46,300
than to that only output
we want to get so that is
3024
02:02:46,300 --> 02:02:47,400
what we're doing here.
3025
02:02:47,400 --> 02:02:47,600
Now.
3026
02:02:47,600 --> 02:02:50,600
Once we are done then after that
what we are going to do now,
3027
02:02:50,600 --> 02:02:52,628
we are going to use
the join operation.
3028
02:02:52,629 --> 02:02:55,500
So what we are going to do so
we will be joining the Annan
3029
02:02:55,500 --> 02:02:58,300
and except bestop's in order
to compare the closing price
3030
02:02:58,300 --> 02:03:00,200
because we want
to compare the prices
3031
02:03:00,200 --> 02:03:01,297
so we will be doing that.
3032
02:03:01,297 --> 02:03:02,000
So first of all,
3033
02:03:02,000 --> 02:03:04,600
we are going to create a union
of all these stocks
3034
02:03:04,600 --> 02:03:06,500
and then display
this guy joint Rose.
3035
02:03:06,500 --> 02:03:07,259
So look at this
3036
02:03:07,259 --> 02:03:09,284
what we're going to do
we're going to use
3037
02:03:09,284 --> 02:03:10,200
the spark sequence and
3038
02:03:10,200 --> 02:03:13,000
if you notice this closely
what we're doing in this case,
3039
02:03:13,000 --> 02:03:14,439
So now in this park sequel,
3040
02:03:14,439 --> 02:03:16,200
we are hitting
the square is equal
3041
02:03:16,200 --> 02:03:18,780
and all those stuff then
we are saying from this
3042
02:03:18,780 --> 02:03:21,192
and here we are using
this joint operation
3043
02:03:21,192 --> 02:03:22,704
may see this join oppression.
3044
02:03:22,704 --> 02:03:24,500
So this we are joining it on
3045
02:03:24,500 --> 02:03:26,500
and then in the end
we are outputting it.
3046
02:03:26,500 --> 02:03:28,700
So here you can see
you can do a comparison
3047
02:03:28,700 --> 02:03:31,300
of all these clothes price
for all these talks.
3048
02:03:31,300 --> 02:03:34,000
You can also include no
for more companies right now.
3049
02:03:34,000 --> 02:03:36,280
We have just shown you
an example with to complete
3050
02:03:36,280 --> 02:03:38,480
but you can do it
for more companies as well.
3051
02:03:38,480 --> 02:03:39,188
Now in this case
3052
02:03:39,188 --> 02:03:41,800
if you notice what we're doing
were writing this in the park
3053
02:03:41,800 --> 02:03:44,928
a file format and Save Being
into this particular location.
3054
02:03:44,928 --> 02:03:47,135
So we are creating
this joint stock market.
3055
02:03:47,135 --> 02:03:49,869
So we are storing it as
a packet file format and here
3056
02:03:49,869 --> 02:03:51,705
if you want to read
it we can read
3057
02:03:51,705 --> 02:03:52,800
that and showed output
3058
02:03:52,800 --> 02:03:55,300
but whatever file you
have saved it as a pocket
3059
02:03:55,300 --> 02:03:57,900
while definitely you
will not be able to read that up
3060
02:03:57,900 --> 02:04:00,700
because that file is going
to be the perfect way
3061
02:04:00,800 --> 02:04:03,900
and park it way are the files
which you can never read.
3062
02:04:03,900 --> 02:04:05,900
You will not be able
to read them up now,
3063
02:04:05,900 --> 02:04:08,382
so you will be seeing this
average closing price per year.
3064
02:04:08,382 --> 02:04:10,631
I'm going to show you all
these things running also some
3065
02:04:10,631 --> 02:04:13,181
just right to explaining you
how things will be run.
3066
02:04:13,181 --> 02:04:13,900
We're doing up here.
3067
02:04:13,900 --> 02:04:15,900
So I will be showing
you all these things
3068
02:04:15,900 --> 02:04:17,100
in execution as well.
3069
02:04:17,200 --> 02:04:18,200
Now in this case,
3070
02:04:18,200 --> 02:04:20,100
if you notice
what we are doing again,
3071
02:04:20,100 --> 02:04:21,907
we are creating
our data frame here.
3072
02:04:21,907 --> 02:04:24,800
Again, we are executing our
query whatever table we have.
3073
02:04:24,800 --> 02:04:26,300
We are executing on top of it.
3074
02:04:26,300 --> 02:04:27,050
So in this case
3075
02:04:27,050 --> 02:04:29,650
because we want to find
the average closing per year.
3076
02:04:29,650 --> 02:04:31,300
So what we are doing
in this case,
3077
02:04:31,300 --> 02:04:33,800
we are going to create
a new table containing
3078
02:04:33,800 --> 02:04:37,700
the average closing price
of let's say an and fxn first
3079
02:04:37,700 --> 02:04:40,319
and then we are going
to display all this new table.
3080
02:04:40,319 --> 02:04:41,369
So we are in the end.
3081
02:04:41,369 --> 02:04:42,800
We are going to
register this table
3082
02:04:42,800 --> 02:04:43,900
or The temporary table
3083
02:04:43,900 --> 02:04:46,515
so that we can execute
our SQL queries on top of it.
3084
02:04:46,515 --> 02:04:47,328
So in this case,
3085
02:04:47,328 --> 02:04:49,828
you can notice that we
are creating this new table.
3086
02:04:49,828 --> 02:04:50,900
And in this new table,
3087
02:04:50,900 --> 02:04:52,900
we have putting
our SQL query right
3088
02:04:52,900 --> 02:04:53,711
that SQL query
3089
02:04:53,711 --> 02:04:56,300
is going to contains
the average closing Paso
3090
02:04:56,300 --> 02:05:00,100
the SQL queries finding out
the average closing price of N
3091
02:05:00,100 --> 02:05:03,100
and all these companies
then whatever we have now.
3092
02:05:03,100 --> 02:05:05,688
We are going to apply
the transformation step
3093
02:05:05,688 --> 02:05:07,488
not transformation
of this new table,
3094
02:05:07,488 --> 02:05:09,188
which we have created
with the year
3095
02:05:09,188 --> 02:05:11,100
and the corresponding
three company data
3096
02:05:11,100 --> 02:05:13,400
what we have created
into the The company
3097
02:05:13,400 --> 02:05:15,103
or table select
which you can notice
3098
02:05:15,103 --> 02:05:17,100
that we are creating
this company or table
3099
02:05:17,100 --> 02:05:18,247
and here first of all,
3100
02:05:18,247 --> 02:05:20,725
we are going to create
a transform table company
3101
02:05:20,725 --> 02:05:23,413
or and going to display
the output so you can notice
3102
02:05:23,413 --> 02:05:25,100
that we are hitting
the SQL query
3103
02:05:25,100 --> 02:05:27,900
and in the end we have printing
this output similarly
3104
02:05:27,900 --> 02:05:29,975
if we want to let's say
compute the best
3105
02:05:29,975 --> 02:05:31,597
of average close we can do that.
3106
02:05:31,597 --> 02:05:33,618
So in this case again
the same way now,
3107
02:05:33,618 --> 02:05:35,800
if once they have learned
the basic stuff,
3108
02:05:35,800 --> 02:05:37,426
you can notice that everything
3109
02:05:37,426 --> 02:05:40,400
is following a similar approach
now in this case also,
3110
02:05:40,400 --> 02:05:43,200
we want to find out let's say
the best of the average
3111
02:05:43,200 --> 02:05:46,100
So we are creating
this best company here now.
3112
02:05:46,100 --> 02:05:49,500
It should contain the best
average closing price of an MX
3113
02:05:49,500 --> 02:05:52,700
and first so we can just get
this greatest and all battery.
3114
02:05:52,700 --> 02:05:53,400
So we creating
3115
02:05:53,400 --> 02:05:56,675
that then after that we
are going to display this output
3116
02:05:56,675 --> 02:05:59,846
and we will be again registering
it as a temporary table now,
3117
02:05:59,846 --> 02:06:02,700
once we have done that then
we can hit our queries now,
3118
02:06:02,700 --> 02:06:04,350
so we want to check
let's say best
3119
02:06:04,350 --> 02:06:05,600
performing company per year.
3120
02:06:05,600 --> 02:06:07,200
Now what we have to do for that.
3121
02:06:07,200 --> 02:06:09,319
So we are creating
the final table in which
3122
02:06:09,319 --> 02:06:10,400
we are going to compute
3123
02:06:10,400 --> 02:06:13,200
all the things we are going
to perform the join or not.
3124
02:06:13,200 --> 02:06:16,082
So although SQL query we
are going to perform here
3125
02:06:16,082 --> 02:06:17,200
in order to compute
3126
02:06:17,200 --> 02:06:19,500
that which company
is doing the best
3127
02:06:19,500 --> 02:06:21,250
and then we are going
to display the output.
3128
02:06:21,250 --> 02:06:23,800
So this is what the output
is going showing up here.
3129
02:06:23,800 --> 02:06:25,850
We are again storing
as a comparative View
3130
02:06:25,850 --> 02:06:28,000
and here again the same
story of correlation
3131
02:06:28,000 --> 02:06:29,400
what we're going to do here.
3132
02:06:29,400 --> 02:06:32,843
So now we will be using
our statistics libraries to find
3133
02:06:32,843 --> 02:06:36,400
the correlation between Anand
epochs companies closing price.
3134
02:06:36,400 --> 02:06:38,300
So that is what we
are going to do now.
3135
02:06:38,300 --> 02:06:41,088
So correlation in finance
and the investment
3136
02:06:41,088 --> 02:06:43,079
and industries is a statistics.
3137
02:06:43,079 --> 02:06:44,300
Measures the degree
3138
02:06:44,300 --> 02:06:47,564
to which to Securities move
in relation to each other.
3139
02:06:47,564 --> 02:06:49,625
So the closer the correlation is
3140
02:06:49,625 --> 02:06:52,200
to be 1 this is going
to be a better one.
3141
02:06:52,200 --> 02:06:53,722
So it is always like
3142
02:06:53,722 --> 02:06:57,300
how to variables are correlated
with each other.
3143
02:06:57,300 --> 02:07:01,400
Let's say your H is highly
correlated to your salary,
3144
02:07:01,400 --> 02:07:05,000
but you're earning like
when you are young you usually
3145
02:07:05,000 --> 02:07:06,400
unless and when you
3146
02:07:06,400 --> 02:07:09,500
are more Edge definitely
you will be earning more
3147
02:07:09,500 --> 02:07:12,811
because you will be more mature
similar way I can say that.
3148
02:07:12,811 --> 02:07:16,400
Your salary is also dependent
on your education qualification.
3149
02:07:16,400 --> 02:07:18,815
And also on the premium
Institute from where you
3150
02:07:18,815 --> 02:07:20,149
have done your education.
3151
02:07:20,149 --> 02:07:21,751
Let's say if you are from IIT,
3152
02:07:21,751 --> 02:07:24,100
or I am definitely
your salary will be higher
3153
02:07:24,100 --> 02:07:25,300
from any other campuses.
3154
02:07:25,300 --> 02:07:26,100
Right Miss.
3155
02:07:26,100 --> 02:07:27,072
It's a probability.
3156
02:07:27,072 --> 02:07:28,300
We what I'm telling you.
3157
02:07:28,300 --> 02:07:28,900
So let's say
3158
02:07:28,900 --> 02:07:32,132
if I have to correlate now
in this case the education
3159
02:07:32,132 --> 02:07:35,600
and the salary but I can easily
create a correlation, right?
3160
02:07:35,600 --> 02:07:37,300
So that is
what the correlation go.
3161
02:07:37,300 --> 02:07:38,589
So we are going to do all
3162
02:07:38,589 --> 02:07:40,573
that with respect
to Overstock analysis.
3163
02:07:40,573 --> 02:07:41,869
Now now what we are doing
3164
02:07:41,869 --> 02:07:45,185
in this case, so You can notice
we are creating this series one
3165
02:07:45,185 --> 02:07:47,188
where we heading
the select query now,
3166
02:07:47,188 --> 02:07:49,401
we are mapping all
this an enclosed price.
3167
02:07:49,401 --> 02:07:52,400
We are converting to a DD
similar way for Series 2.
3168
02:07:52,400 --> 02:07:53,691
Also we are doing that right.
3169
02:07:53,691 --> 02:07:55,832
So this is we are doing
for rabbits or earlier.
3170
02:07:55,832 --> 02:07:58,600
We have done it for an enclosed
and then in the end we
3171
02:07:58,600 --> 02:08:00,911
are using the statistics
dot core to create
3172
02:08:00,911 --> 02:08:02,500
a correlation between them.
3173
02:08:02,600 --> 02:08:06,200
So you can notice this is how we
can execute everything now.
3174
02:08:06,200 --> 02:08:10,353
Let's go to our VM and see
everything in our execution.
3175
02:08:11,142 --> 02:08:12,757
Question from at all.
3176
02:08:12,900 --> 02:08:15,300
So this VM how we
will be getting you
3177
02:08:15,300 --> 02:08:17,659
will be getting all
this VM from a director.
3178
02:08:17,659 --> 02:08:19,815
So you need not worry
about all that but
3179
02:08:19,815 --> 02:08:21,930
that how I will be
getting all this p.m.
3180
02:08:21,930 --> 02:08:24,100
In a so a once you
enroll for the courses
3181
02:08:24,100 --> 02:08:27,300
and also you will be getting all
this came from that Erika said
3182
02:08:27,300 --> 02:08:28,541
so even if I am working
3183
02:08:28,541 --> 02:08:30,711
on Mac operating system
my VM will work.
3184
02:08:30,711 --> 02:08:32,300
Yes every operating system.
3185
02:08:32,300 --> 02:08:33,535
It will be supported.
3186
02:08:33,535 --> 02:08:35,592
So no trouble you
can just use any sort
3187
02:08:35,592 --> 02:08:38,428
of VM in all means
any operating system to do that.
3188
02:08:38,428 --> 02:08:41,000
So what I would occur do
is they just don't want
3189
02:08:41,000 --> 02:08:43,900
You to be troubled
in any sort of stuff here.
3190
02:08:43,900 --> 02:08:46,076
So what they do is
they kind of ensure
3191
02:08:46,076 --> 02:08:48,342
that whatever is required
for your practicals.
3192
02:08:48,342 --> 02:08:49,400
They take care of it.
3193
02:08:49,400 --> 02:08:51,700
That's the reason they
have created their own VM,
3194
02:08:51,700 --> 02:08:54,600
which is also going to be
a lower size and compassion
3195
02:08:54,600 --> 02:08:56,100
to Cloudera hortonworks VM
3196
02:08:56,100 --> 02:08:58,997
and this is going to definitely
be more helpful for you.
3197
02:08:58,997 --> 02:09:01,000
So all these things
will be provided to
3198
02:09:01,000 --> 02:09:02,524
you question from nothing.
3199
02:09:02,524 --> 02:09:05,900
So all this project I am going
to learn from the sessions.
3200
02:09:05,900 --> 02:09:06,200
Yes.
3201
02:09:06,200 --> 02:09:09,650
So once you enroll for so
right now whatever we have seen
3202
02:09:09,650 --> 02:09:13,100
definitely we have just Otten
upper level of view of this
3203
02:09:13,100 --> 02:09:15,350
how the session looks
like for a purchase.
3204
02:09:15,350 --> 02:09:18,700
But but when we actually teach
all these things in the course,
3205
02:09:18,700 --> 02:09:21,587
it's usually are much more
in the detailed format.
3206
02:09:21,587 --> 02:09:22,700
So in detail format,
3207
02:09:22,700 --> 02:09:25,300
we kind of keep on showing
you each step in detail
3208
02:09:25,300 --> 02:09:28,299
that how the things are working
even including the project.
3209
02:09:28,299 --> 02:09:30,900
So you will be also learning
with the help of project
3210
02:09:30,900 --> 02:09:32,157
on each different topic.
3211
02:09:32,157 --> 02:09:34,200
So that is the way
we kind of go for it.
3212
02:09:34,200 --> 02:09:36,605
Now if I am stuck
in any other project then
3213
02:09:36,605 --> 02:09:37,985
who will be helping me
3214
02:09:37,985 --> 02:09:40,308
so they will be
a support team 24 by 7
3215
02:09:40,308 --> 02:09:42,046
if Get stuck at any moment.
3216
02:09:42,046 --> 02:09:44,300
You need to just
give a call and kit
3217
02:09:44,300 --> 02:09:45,900
and a call or email.
3218
02:09:45,900 --> 02:09:49,076
There is a support ticket
and immediately the technical
3219
02:09:49,076 --> 02:09:52,100
team will be helping across
the support team is 24 by 7.
3220
02:09:52,100 --> 02:09:53,900
They are they are
all technical people
3221
02:09:53,900 --> 02:09:55,821
and they will be assisting
you across on all
3222
02:09:55,821 --> 02:09:58,100
that even the trainers
will be assisting you for any
3223
02:09:58,100 --> 02:10:00,000
of the technical query great.
3224
02:10:00,000 --> 02:10:00,400
Awesome.
3225
02:10:00,800 --> 02:10:01,900
Thank you now.
3226
02:10:01,900 --> 02:10:03,700
So if you notice this is my data
3227
02:10:03,700 --> 02:10:06,446
we have we were executing
all the things on this data.
3228
02:10:06,446 --> 02:10:08,726
Now what we want to do
if you notice this is
3229
02:10:08,726 --> 02:10:10,900
the same code which I
have just shown you.
3230
02:10:10,900 --> 02:10:13,800
Earlier also now let us
just execute this code.
3231
02:10:13,800 --> 02:10:15,481
So in order to execute this
3232
02:10:15,481 --> 02:10:18,345
what we can do we can connect
to my spa action.
3233
02:10:18,345 --> 02:10:20,400
So let's get
connected to suction.
3234
02:10:21,700 --> 02:10:23,970
Someone's will be connected
to Spur action.
3235
02:10:23,970 --> 02:10:25,382
We will go step by step.
3236
02:10:25,382 --> 02:10:27,700
So first we will be
importing our package.
3237
02:10:31,400 --> 02:10:34,861
This take some time let
it just get connected.
3238
02:10:36,300 --> 02:10:38,400
Once this is connected now,
3239
02:10:38,400 --> 02:10:39,400
you can notice
3240
02:10:39,400 --> 02:10:42,400
that I'm just importing all
the all the important libraries
3241
02:10:42,400 --> 02:10:44,400
we have already
learned about that.
3242
02:10:45,800 --> 02:10:49,137
After that, you will be
initialising your spark session.
3243
02:10:49,137 --> 02:10:49,805
So let's do
3244
02:10:49,805 --> 02:10:52,900
that again the same steps
what you have done before.
3245
02:10:58,600 --> 02:10:59,922
Once we will be done.
3246
02:10:59,922 --> 02:11:02,000
We will be creating
a stock class.
3247
02:11:07,000 --> 02:11:09,900
We could have also directly
executed from Eclipse.
3248
02:11:09,900 --> 02:11:11,400
Also, this is just I want
3249
02:11:11,400 --> 02:11:13,800
to show you step-by-step
whatever we have learnt.
3250
02:11:13,800 --> 02:11:15,700
So now you can see
for company one and then
3251
02:11:15,700 --> 02:11:16,700
if you want to do
3252
02:11:16,700 --> 02:11:20,000
some computation we want to even
see the values and all right,
3253
02:11:20,000 --> 02:11:21,600
so that's what we're doing here.
3254
02:11:21,700 --> 02:11:24,700
So if we are just getting
the files creating another did,
3255
02:11:24,700 --> 02:11:26,800
you know, so let's execute this.
3256
02:11:28,500 --> 02:11:31,200
Similarly for your a back
similarly for your fast
3257
02:11:31,200 --> 02:11:34,050
for all this so I'm just copying
all these things together
3258
02:11:34,050 --> 02:11:36,100
because there are a lot
of companies for which we
3259
02:11:36,100 --> 02:11:37,400
have to do all this step.
3260
02:11:37,400 --> 02:11:39,625
So let's bring it
for all the 10 companies
3261
02:11:39,625 --> 02:11:41,200
which we are going to create.
3262
02:11:49,000 --> 02:11:49,900
So as you can see,
3263
02:11:49,900 --> 02:11:52,400
this print scheme has giving
it output right now.
3264
02:11:52,400 --> 02:11:52,900
Similarly.
3265
02:11:52,900 --> 02:11:55,800
I can execute for a rest
of the things as well.
3266
02:11:55,800 --> 02:11:57,800
So this is just giving
you the similar way.
3267
02:11:57,800 --> 02:12:01,702
All the outputs will be shown
up here company for company V
3268
02:12:01,702 --> 02:12:05,000
all these companies you
can see this in execution.
3269
02:12:08,000 --> 02:12:11,000
After that, we will be creating
our temporary view
3270
02:12:11,000 --> 02:12:13,800
so that we can execute
our SQL queries.
3271
02:12:16,500 --> 02:12:19,700
So let's do it for complaint
and also then after that we
3272
02:12:19,700 --> 02:12:22,900
can just create a work all
over temporary table for it.
3273
02:12:22,900 --> 02:12:25,200
Once we are done now
we can do our queries.
3274
02:12:25,200 --> 02:12:27,357
Like let's say we
can display the average
3275
02:12:27,357 --> 02:12:30,000
of existing closing price
for and and for each one
3276
02:12:30,000 --> 02:12:31,400
so we can hit this query.
3277
02:12:34,700 --> 02:12:37,500
So all these queries will happen
on your temporary view
3278
02:12:37,600 --> 02:12:39,800
because we cannot anyway
to all these queries
3279
02:12:39,800 --> 02:12:41,471
on our data frames are out
3280
02:12:41,471 --> 02:12:44,300
so you can see this this
is getting executed.
3281
02:12:45,500 --> 02:12:49,200
Trying it out to Tulsa now
because they've done dot shoe.
3282
02:12:49,200 --> 02:12:51,237
That's the reason
you're getting this output.
3283
02:12:51,237 --> 02:12:51,700
Similarly.
3284
02:12:51,700 --> 02:12:55,600
If we want to let's say list
the closing price for msft
3285
02:12:55,600 --> 02:12:57,600
which went up more than $2 way.
3286
02:12:57,600 --> 02:12:58,794
So that query also we
3287
02:12:58,794 --> 02:13:02,500
can execute now we have already
understood this query in detail.
3288
02:13:03,100 --> 02:13:05,300
It is seeing is
execution partner
3289
02:13:05,500 --> 02:13:08,100
so that you can appreciate
whatever you have learned.
3290
02:13:08,300 --> 02:13:10,700
See this is the output
showing up to you.
3291
02:13:10,800 --> 02:13:12,300
Now after that
3292
02:13:12,300 --> 02:13:15,723
how you can join all the stack
closing price right similar way
3293
02:13:15,723 --> 02:13:18,966
how we can save the joint view
in the packet for table.
3294
02:13:18,966 --> 02:13:20,435
You want to read that back.
3295
02:13:20,435 --> 02:13:22,157
You want to create a new table
3296
02:13:22,157 --> 02:13:25,275
like so let's execute all
these three queries together
3297
02:13:25,275 --> 02:13:27,100
because we have
already seen this.
3298
02:13:29,700 --> 02:13:30,502
Look at this.
3299
02:13:30,502 --> 02:13:31,800
So this in this case,
3300
02:13:31,800 --> 02:13:34,300
we are doing the drawing class
basing this output.
3301
02:13:34,300 --> 02:13:36,499
Then we want to save it
in the package files.
3302
02:13:36,499 --> 02:13:39,100
We are saving it and we want
to again reiterate back.
3303
02:13:39,100 --> 02:13:40,893
Then we are creating
our new table, right?
3304
02:13:40,893 --> 02:13:42,043
We were doing that join
3305
02:13:42,043 --> 02:13:44,200
and on so that is
what we are doing in this case.
3306
02:13:44,200 --> 02:13:45,900
Then you want
to see this output.
3307
02:13:47,700 --> 02:13:50,400
Then we are against touring
as a temp table or not.
3308
02:13:50,499 --> 02:13:50,700
Now.
3309
02:13:50,700 --> 02:13:53,700
Once we are done with this step
also then what so we
3310
02:13:53,700 --> 02:13:55,400
have done it in Step 6.
3311
02:13:55,400 --> 02:13:56,900
Now we want to perform.
3312
02:13:56,900 --> 02:13:58,488
Let's have a transformation
3313
02:13:58,488 --> 02:14:01,000
on new table corresponding
to the three companies
3314
02:14:01,000 --> 02:14:03,411
so that we can compare
we want to create
3315
02:14:03,411 --> 02:14:06,305
the best company containing
the best average closing price
3316
02:14:06,305 --> 02:14:07,748
for all these three companies.
3317
02:14:07,748 --> 02:14:09,300
We want to find the companies
3318
02:14:09,300 --> 02:14:11,600
but the best closing
price average per year.
3319
02:14:11,600 --> 02:14:13,200
So let's do all that as well.
3320
02:14:18,800 --> 02:14:22,343
So you can see best company
of the year now here also
3321
02:14:22,343 --> 02:14:26,500
the same stuff we are doing to
be registering over temp table.
3322
02:14:34,100 --> 02:14:35,700
Okay, so there's a mistake here.
3323
02:14:35,700 --> 02:14:38,096
So if you notice here it is 1
3324
02:14:38,100 --> 02:14:40,722
but here we are doing
a show of all right,
3325
02:14:40,722 --> 02:14:42,129
so there is a mistake.
3326
02:14:42,129 --> 02:14:43,600
I'm just correcting it.
3327
02:14:45,000 --> 02:14:48,300
So here also it should be
1 I'm just updating
3328
02:14:48,300 --> 02:14:51,300
in the sheet itself so
that it will start working now.
3329
02:14:51,300 --> 02:14:53,102
So here I have just made it one.
3330
02:14:53,102 --> 02:14:55,300
So now after that it
will start working.
3331
02:14:55,300 --> 02:14:59,600
Okay, wherever it is going
to be all I have to make it one.
3332
02:15:00,400 --> 02:15:03,500
So that is the change
which I need to do here also.
3333
02:15:04,400 --> 02:15:06,700
And you will notice
it will start working.
3334
02:15:06,900 --> 02:15:09,433
So here also you
need to make it one.
3335
02:15:09,433 --> 02:15:10,748
So all those places
3336
02:15:10,748 --> 02:15:14,363
where ever it was so just
kind of a good point to make
3337
02:15:14,363 --> 02:15:18,388
so wherever you are working
on this we need to always ensure
3338
02:15:18,388 --> 02:15:21,800
that all these values
what you are putting up here.
3339
02:15:21,800 --> 02:15:25,900
Okay, so I could have also
done it like this one second.
3340
02:15:26,300 --> 02:15:27,876
In fact in this place.
3341
02:15:27,876 --> 02:15:30,600
I need not do all
this step one second.
3342
02:15:30,600 --> 02:15:33,842
Let me explain you also
why no in this place.
3343
02:15:33,842 --> 02:15:37,600
It's So see from here
this error started opening why
3344
02:15:37,600 --> 02:15:38,758
because my data frame
3345
02:15:38,758 --> 02:15:40,500
what I have created
here most one.
3346
02:15:40,500 --> 02:15:41,500
Let's execute it.
3347
02:15:41,500 --> 02:15:43,500
Now, you will notice
this Quest artwork.
3348
02:15:44,340 --> 02:15:45,659
See this is working.
3349
02:15:46,000 --> 02:15:46,300
Now.
3350
02:15:46,300 --> 02:15:47,000
After that.
3351
02:15:47,000 --> 02:15:49,493
I am creating a temp table
that temp table.
3352
02:15:49,493 --> 02:15:52,400
What we are creating is
let's say company on okay.
3353
02:15:52,400 --> 02:15:55,100
So this is the temp table
which we have created.
3354
02:15:55,100 --> 02:15:57,808
You can see this company
now in this case
3355
02:15:57,808 --> 02:16:01,300
if I am keeping this company
on itself it is going to work.
3356
02:16:02,000 --> 02:16:03,195
Because here anyway,
3357
02:16:03,195 --> 02:16:05,897
I'm going to use
the whatever temporary table
3358
02:16:05,897 --> 02:16:07,310
we have created, right?
3359
02:16:07,310 --> 02:16:08,600
So now let's execute.
3360
02:16:10,800 --> 02:16:12,700
So you can see now
it started book.
3361
02:16:14,000 --> 02:16:15,900
No further to that now,
3362
02:16:15,900 --> 02:16:18,500
we want to create
a correlation between them
3363
02:16:18,500 --> 02:16:19,600
so we can do that.
3364
02:16:23,700 --> 02:16:26,400
See this is going to give
me the correlation
3365
02:16:26,400 --> 02:16:30,500
between the two column names
and so that we can see here.
3366
02:16:30,700 --> 02:16:34,445
So this is the correlation the
more it is closer to 1 means the
3367
02:16:34,445 --> 02:16:37,950
better it is it means definitely
it is near to 1 it is 0.9,
3368
02:16:37,950 --> 02:16:39,400
which is a bigger value.
3369
02:16:39,400 --> 02:16:42,700
So definitely it is going
to be much they both are
3370
02:16:42,700 --> 02:16:45,700
highly correlated means
definitely they are impacting
3371
02:16:45,700 --> 02:16:47,300
each other stock price.
3372
02:16:47,400 --> 02:16:49,700
So this is all about the project
3373
02:16:49,700 --> 02:16:58,500
but Welcome to this interesting
session of spots remaining
3374
02:16:58,673 --> 02:16:59,826
from and Erica.
3375
02:17:00,800 --> 02:17:02,261
What is pathogenic?
3376
02:17:02,261 --> 02:17:04,415
Is it like really important?
3377
02:17:04,500 --> 02:17:05,400
Definitely?
3378
02:17:05,400 --> 02:17:05,704
Yes.
3379
02:17:05,704 --> 02:17:07,001
Is it really hot?
3380
02:17:07,001 --> 02:17:07,600
Definitely?
3381
02:17:07,600 --> 02:17:08,100
Yes.
3382
02:17:08,100 --> 02:17:10,900
That's the reason we
are learning this technology.
3383
02:17:10,900 --> 02:17:14,600
And this is one of the very
sort things in the market
3384
02:17:14,600 --> 02:17:16,272
when it's a hot thing means
3385
02:17:16,272 --> 02:17:18,750
in terms of job market
I'm talking about.
3386
02:17:18,750 --> 02:17:21,600
So let's see what will be
our agenda for today.
3387
02:17:21,900 --> 02:17:25,500
So we are going to Gus
about spark ecosystem
3388
02:17:25,500 --> 02:17:27,900
where we are going
to see that okay,
3389
02:17:27,900 --> 02:17:28,700
what is pop
3390
02:17:28,700 --> 02:17:32,100
how smarts the main threats
in the West Park ecosystem
3391
02:17:32,100 --> 02:17:35,631
wise path streaming we
are going to have overview
3392
02:17:35,631 --> 02:17:39,900
of stock streaming kind of
getting into the basics of that.
3393
02:17:39,900 --> 02:17:41,832
We will learn about these cream.
3394
02:17:41,832 --> 02:17:44,890
We will learn also about
these theme Transformations.
3395
02:17:44,890 --> 02:17:46,800
We will be
learning about caching
3396
02:17:46,800 --> 02:17:51,200
and persistence accumulators
broadcast variables checkpoints.
3397
02:17:51,200 --> 02:17:53,600
These are like Advanced
concept of paths.
3398
02:17:54,100 --> 02:17:55,600
And then in the end,
3399
02:17:55,600 --> 02:17:59,900
we will walk through a use case
of Twitter sentiment analysis.
3400
02:18:00,500 --> 02:18:04,700
Now, what is streaming
let's understand that.
3401
02:18:04,800 --> 02:18:08,000
So let me start
by us example to you.
3402
02:18:08,600 --> 02:18:12,300
So let's see if there is
a bank and in Bank.
3403
02:18:12,500 --> 02:18:13,082
Definitely.
3404
02:18:13,082 --> 02:18:14,200
I'm pretty sure all
3405
02:18:14,200 --> 02:18:18,700
of you must have views credit
card debit card all those karts
3406
02:18:18,700 --> 02:18:20,900
what dance provide now,
3407
02:18:20,900 --> 02:18:23,500
let's say you
have done a transaction.
3408
02:18:23,500 --> 02:18:27,300
From India just now
and within an art
3409
02:18:27,300 --> 02:18:30,260
and edit your card
is getting swept in u.s.
3410
02:18:30,260 --> 02:18:31,600
Is it even possible
3411
02:18:31,600 --> 02:18:35,801
for your car to vision
and arduous definitely know now
3412
02:18:35,900 --> 02:18:38,100
how that bank will realize
3413
02:18:38,700 --> 02:18:41,000
that it is a fraud connection
3414
02:18:41,000 --> 02:18:44,600
because Bank cannot let
that transition happen.
3415
02:18:44,700 --> 02:18:46,238
They need to stop it
3416
02:18:46,238 --> 02:18:49,771
at the time of when it
is getting swiped either.
3417
02:18:49,771 --> 02:18:51,000
You can block it.
3418
02:18:51,000 --> 02:18:52,800
Give a call to you ask you
3419
02:18:52,800 --> 02:18:55,394
whether It is a genuine
transaction or not.
3420
02:18:55,394 --> 02:18:57,000
Do something of that sort.
3421
02:18:57,692 --> 02:18:58,000
Now.
3422
02:18:58,000 --> 02:19:00,300
Do you think they will put
some manual person
3423
02:19:00,300 --> 02:19:01,127
behind the scene
3424
02:19:01,127 --> 02:19:03,300
that will be looking
at all the transaction
3425
02:19:03,300 --> 02:19:05,100
and you will block it manually.
3426
02:19:05,100 --> 02:19:08,315
No, so they require
something of the sort
3427
02:19:08,315 --> 02:19:11,100
where the data will
be getting stream.
3428
02:19:11,100 --> 02:19:12,500
And at the real time
3429
02:19:12,500 --> 02:19:16,113
they should be able to catch
with the help of some pattern.
3430
02:19:16,113 --> 02:19:17,851
They will do some processing
3431
02:19:17,851 --> 02:19:20,575
and they will get
some pattern out of it with
3432
02:19:20,575 --> 02:19:23,305
if it is not sounding
like a genuine transition.
3433
02:19:23,305 --> 02:19:26,649
They will immediately add
a block it I'll give you a call
3434
02:19:26,649 --> 02:19:28,565
maybe send me an OTP to confirm
3435
02:19:28,565 --> 02:19:31,100
whether it's a genuine
connection dot they
3436
02:19:31,100 --> 02:19:32,050
will not wait
3437
02:19:32,050 --> 02:19:36,000
till the next day to kind of
complete that transaction.
3438
02:19:36,000 --> 02:19:38,941
Otherwise if what happened
nobody is going to touch
3439
02:19:38,941 --> 02:19:40,000
that that right.
3440
02:19:40,000 --> 02:19:43,000
So that is the how we
work on stomach.
3441
02:19:43,100 --> 02:19:46,300
Now someone have mentioned
3442
02:19:46,500 --> 02:19:51,400
that without stream processing
of data is not even possible.
3443
02:19:51,400 --> 02:19:52,435
In fact, we can see
3444
02:19:52,435 --> 02:19:55,200
that there is no And big data
which is possible.
3445
02:19:55,200 --> 02:19:57,900
We cannot even talk
about internet of things.
3446
02:19:57,900 --> 02:20:00,800
Right and this this is
a very famous statement
3447
02:20:00,800 --> 02:20:01,900
from Donna Saint
3448
02:20:01,900 --> 02:20:05,600
do from C equals
3 lot of companies
3449
02:20:05,700 --> 02:20:13,500
like YouTube Netflix Facebook
Twitter iTunes topped Pandora.
3450
02:20:13,769 --> 02:20:17,230
All these companies
are using spark screaming.
3451
02:20:17,700 --> 02:20:18,100
Now.
3452
02:20:19,100 --> 02:20:20,400
What is this?
3453
02:20:20,400 --> 02:20:23,580
We have just seen with an
example to kind of got an idea.
3454
02:20:23,580 --> 02:20:25,000
Idea about steaming pack.
3455
02:20:25,100 --> 02:20:30,300
Now as I said with the time
growing with the internet doing
3456
02:20:30,453 --> 02:20:35,146
these three main Technologies
are becoming popular day by day.
3457
02:20:35,500 --> 02:20:39,300
It's a technique
to transfer the data
3458
02:20:39,500 --> 02:20:45,000
so that it can be processed
as a steady and continuous
3459
02:20:45,000 --> 02:20:47,000
drip means immediately
3460
02:20:47,000 --> 02:20:49,500
as and when the data is coming
3461
02:20:49,600 --> 02:20:52,900
you are continuously
processing it as well.
3462
02:20:53,600 --> 02:20:54,400
In fact,
3463
02:20:54,400 --> 02:20:58,938
this real-time streaming is
what is driving to this big data
3464
02:20:59,100 --> 02:21:02,000
and also internet of things now,
3465
02:21:02,000 --> 02:21:04,786
they will be lot of things
like fundamental unit
3466
02:21:04,786 --> 02:21:06,387
of streaming media streams.
3467
02:21:06,387 --> 02:21:08,700
We will also be
Transforming Our screen.
3468
02:21:08,700 --> 02:21:09,700
We will be doing it.
3469
02:21:09,700 --> 02:21:10,994
In fact, the companies
3470
02:21:10,994 --> 02:21:13,400
are using it with
their business intelligence.
3471
02:21:13,400 --> 02:21:16,200
We will see more details
in further of the slides.
3472
02:21:16,300 --> 02:21:20,900
But before that we will be
talking about spark ecosystem
3473
02:21:21,200 --> 02:21:23,500
when we talk about Spark mmm,
3474
02:21:23,500 --> 02:21:25,653
there are multiple libraries
3475
02:21:25,653 --> 02:21:29,565
which are present in a first one
is pop frequent now
3476
02:21:29,565 --> 02:21:31,100
in spark SQL is like
3477
02:21:31,100 --> 02:21:35,000
when you can SQL Developer
can write the query in SQL way
3478
02:21:35,000 --> 02:21:38,600
and it is going to get converted
into a spark way
3479
02:21:38,600 --> 02:21:42,828
and then going to give you
output kind of analogous to hide
3480
02:21:42,828 --> 02:21:46,400
but it is going to be faster
in comparison to hide
3481
02:21:46,400 --> 02:21:48,252
when we talk about sports clinic
3482
02:21:48,252 --> 02:21:50,900
that is what we are going
to learn it is going
3483
02:21:50,900 --> 02:21:55,300
to enable all the analytical
and Practical applications
3484
02:21:55,600 --> 02:21:59,400
for your live
streaming data M11.
3485
02:21:59,700 --> 02:22:02,400
Ml it is mostly
for machine learning.
3486
02:22:02,400 --> 02:22:03,546
And in fact,
3487
02:22:03,546 --> 02:22:06,007
the interesting part
about MLA is
3488
02:22:06,200 --> 02:22:11,100
that it is completely replacing
mom invited are almost replaced.
3489
02:22:11,100 --> 02:22:13,500
Now all the core contributors
3490
02:22:13,500 --> 02:22:17,700
of Mahal have moved
in two words the
3491
02:22:18,184 --> 02:22:19,800
towards the MLF thing
3492
02:22:19,800 --> 02:22:23,500
because of the faster response
performance is really good.
3493
02:22:23,500 --> 02:22:26,707
In MLA Graphics Graphics.
3494
02:22:26,707 --> 02:22:27,005
Okay.
3495
02:22:27,005 --> 02:22:29,794
Let me give you example
everybody must have used
3496
02:22:29,794 --> 02:22:31,100
Google Maps right now.
3497
02:22:31,100 --> 02:22:34,082
What you doing Google Map
you search for the path.
3498
02:22:34,082 --> 02:22:36,600
You put your Source you
put your destination.
3499
02:22:36,600 --> 02:22:38,900
Now when you just
search for the part,
3500
02:22:39,000 --> 02:22:40,500
it's certainly different paths
3501
02:22:40,800 --> 02:22:45,100
and then provide you
an optimal path right now
3502
02:22:45,300 --> 02:22:47,300
how it providing
the optimal party.
3503
02:22:47,300 --> 02:22:50,500
These things can be done
with the help of Graphics.
3504
02:22:50,500 --> 02:22:53,500
So wherever you can create
a kind of a graphical stuff.
3505
02:22:53,500 --> 02:22:54,500
Up, we will say
3506
02:22:54,500 --> 02:22:56,997
that we can use
Graphics spark up.
3507
02:22:56,997 --> 02:22:57,300
Now.
3508
02:22:57,300 --> 02:23:00,600
This is the kind
of a package provided for art.
3509
02:23:00,600 --> 02:23:02,538
So R is of Open Source,
3510
02:23:02,538 --> 02:23:05,000
which is mostly used by analysts
3511
02:23:05,000 --> 02:23:08,300
and now spark committee
won't infect all
3512
02:23:08,300 --> 02:23:11,594
the analysts kind of to move
towards the sparkling water.
3513
02:23:11,594 --> 02:23:12,900
And that's the reason
3514
02:23:12,900 --> 02:23:15,615
they have recently
stopped supporting spark
3515
02:23:15,615 --> 02:23:17,226
on we are all the analysts
3516
02:23:17,226 --> 02:23:20,301
can now execute the query
using spark environment
3517
02:23:20,301 --> 02:23:22,800
that's getting better
performance and we
3518
02:23:22,800 --> 02:23:25,000
can also work on Big Data.
3519
02:23:25,200 --> 02:23:27,800
That's that's all
about the ecosystem point
3520
02:23:27,800 --> 02:23:31,061
below this we are going to have
a core engine for engine
3521
02:23:31,061 --> 02:23:34,500
is the one which defines all
the basics of the participants
3522
02:23:34,500 --> 02:23:36,363
all the RGV related stuff
3523
02:23:36,363 --> 02:23:38,600
and not is going to be defined
3524
02:23:38,600 --> 02:23:43,300
in your staff for Engine
moving further now,
3525
02:23:43,300 --> 02:23:46,227
so as we have just
discussed this part we
3526
02:23:46,227 --> 02:23:49,767
are going to now discuss
past screaming indicate
3527
02:23:49,767 --> 02:23:53,500
which is going to enable
analytical and Interactive.
3528
02:23:53,600 --> 02:23:58,300
For live streaming data
know Y is positive
3529
02:23:58,800 --> 02:24:01,400
if I talk about bias
past him indefinitely.
3530
02:24:01,400 --> 02:24:04,230
We have just gotten after
different is very important.
3531
02:24:04,230 --> 02:24:06,100
That's the reason
we are learning it
3532
02:24:06,200 --> 02:24:09,804
but this is so powerful
that it is used now
3533
02:24:09,804 --> 02:24:14,169
for the by lot of companies
to perform their marketing they
3534
02:24:14,169 --> 02:24:15,900
kind of getting an idea
3535
02:24:15,900 --> 02:24:18,250
that what a customer
is looking for.
3536
02:24:18,250 --> 02:24:22,094
In fact, we are going to learn
a use case of similar to that
3537
02:24:22,094 --> 02:24:24,700
where we are going
to to use pasta me now
3538
02:24:24,700 --> 02:24:28,283
where we are going to use
a Twitter sentimental analysis,
3539
02:24:28,283 --> 02:24:31,100
which can be used
for your crisis management.
3540
02:24:31,100 --> 02:24:33,680
Maybe you want to check
all your products
3541
02:24:33,680 --> 02:24:35,100
on our behave service.
3542
02:24:35,100 --> 02:24:37,420
I just think target marketing
3543
02:24:37,500 --> 02:24:40,342
by all the companies
around the world.
3544
02:24:40,342 --> 02:24:42,800
This is getting used
in this way.
3545
02:24:42,817 --> 02:24:46,355
And that's the reason
spark steaming is gaining
3546
02:24:46,355 --> 02:24:50,432
the popularity and because
of its performance as well.
3547
02:24:50,600 --> 02:24:53,200
It is beeping
on other platforms.
3548
02:24:53,600 --> 02:24:57,400
At the moment
now moving further.
3549
02:24:57,600 --> 02:25:01,300
Let's eat Sparks training
features when we talk
3550
02:25:01,300 --> 02:25:03,300
about Sparks training teachers.
3551
02:25:03,400 --> 02:25:05,100
It's very easy to scale.
3552
02:25:05,100 --> 02:25:07,420
You can scale
to even multiple nodes
3553
02:25:07,420 --> 02:25:11,083
which can even run till hundreds
of most speed is going
3554
02:25:11,083 --> 02:25:14,000
to be very quick means
in a very short time.
3555
02:25:14,000 --> 02:25:17,900
You can scream as well as
processor data soil tolerant,
3556
02:25:17,900 --> 02:25:19,300
even it made sure
3557
02:25:19,300 --> 02:25:23,100
that even you're not losing
your data integration.
3558
02:25:23,100 --> 02:25:26,600
You with your bash time and
real-time processing is possible
3559
02:25:26,600 --> 02:25:30,446
and it can also be used
for your business analytics
3560
02:25:30,500 --> 02:25:34,800
which is used to track
the behavior of your customer.
3561
02:25:34,900 --> 02:25:38,700
So as you can see this
is super polite and it's
3562
02:25:38,700 --> 02:25:43,000
like we are kind of getting to
know so many interesting things
3563
02:25:43,000 --> 02:25:48,000
about this pasta me now next
quickly have an overview
3564
02:25:48,000 --> 02:25:50,900
so that we can get
some basics of spots.
3565
02:25:50,900 --> 02:25:53,200
Don't know let's understand.
3566
02:25:53,200 --> 02:25:54,300
Which box?
3567
02:25:55,100 --> 02:25:59,200
So as we have just discussed it
is for real-time streaming data.
3568
02:25:59,600 --> 02:26:04,100
It is useful addition
in your spark for API.
3569
02:26:04,100 --> 02:26:06,500
So we have already seen
at the base level.
3570
02:26:06,500 --> 02:26:07,400
We have that spark
3571
02:26:07,400 --> 02:26:10,700
or in our ecosystem on top
of that we have passed we
3572
02:26:10,700 --> 02:26:14,700
will impact Sparks claiming
is kind of adding a lot
3573
02:26:14,700 --> 02:26:18,000
of advantage to spark Community
3574
02:26:18,000 --> 02:26:22,349
because a lot of people are only
joining spark Community to kind
3575
02:26:22,349 --> 02:26:23,800
of use this pasta me.
3576
02:26:23,800 --> 02:26:25,000
It's so powerful.
3577
02:26:25,000 --> 02:26:26,344
Everyone wants to come
3578
02:26:26,344 --> 02:26:29,478
and want to use it
because all the other Frameworks
3579
02:26:29,478 --> 02:26:30,809
which we already have
3580
02:26:30,809 --> 02:26:33,469
which are existing are
not as good in terms
3581
02:26:33,469 --> 02:26:34,783
of performance in all
3582
02:26:34,783 --> 02:26:36,311
and and it's the easiness
3583
02:26:36,311 --> 02:26:38,482
of moving Sparks
coming is also great
3584
02:26:38,482 --> 02:26:41,482
if you compare your program
for let's say two orbits
3585
02:26:41,482 --> 02:26:44,100
from which is used
for real-time processing.
3586
02:26:44,100 --> 02:26:46,356
You will notice
that it is much easier
3587
02:26:46,356 --> 02:26:49,100
in terms of from
a developer point of your ass
3588
02:26:49,100 --> 02:26:52,400
that that's the reason a lot
of regular showing interest
3589
02:26:52,400 --> 02:26:53,800
in this domain now,
3590
02:26:53,800 --> 02:26:56,800
it will also enable Table
of high throughput
3591
02:26:56,800 --> 02:26:58,187
and fault-tolerant
3592
02:26:58,187 --> 02:27:02,725
so that you to stream your data
to process all the things up
3593
02:27:02,900 --> 02:27:06,900
and the fundamental unit
Force past dreaming is going
3594
02:27:06,900 --> 02:27:08,200
to be District.
3595
02:27:08,300 --> 02:27:09,700
What is this thing?
3596
02:27:09,700 --> 02:27:10,600
Let me explain it.
3597
02:27:11,100 --> 02:27:14,200
So this dream is
basically a series
3598
02:27:14,200 --> 02:27:18,900
of bodies to process
the real-time data.
3599
02:27:19,400 --> 02:27:21,100
What we generally do is
3600
02:27:21,100 --> 02:27:23,678
if you look
at this light inside you
3601
02:27:23,678 --> 02:27:25,300
when you get the data,
3602
02:27:25,400 --> 02:27:29,800
It is a continuous data you
divide it in two batches
3603
02:27:29,800 --> 02:27:31,200
of input data.
3604
02:27:31,400 --> 02:27:35,700
We are going to call it
as micro batch and then
3605
02:27:35,700 --> 02:27:39,447
we are going to get that is
of processed data though.
3606
02:27:39,447 --> 02:27:40,600
It is real time.
3607
02:27:40,600 --> 02:27:42,300
But still how come it is back
3608
02:27:42,300 --> 02:27:44,547
because definitely you
are doing processing
3609
02:27:44,547 --> 02:27:46,258
on some part of the data, right?
3610
02:27:46,258 --> 02:27:48,300
Even if it is coming
at real time.
3611
02:27:48,300 --> 02:27:52,500
And that is what we are going
to call it as micro batch.
3612
02:27:53,600 --> 02:27:55,700
Moving further now.
3613
02:27:56,600 --> 02:27:59,100
Let's see few more
details on it.
3614
02:27:59,223 --> 02:28:02,300
Now from where you
can get all your data.
3615
02:28:02,300 --> 02:28:04,600
What can be your
data sources here.
3616
02:28:04,600 --> 02:28:09,000
So if we talk about data sources
here now we can steal the data
3617
02:28:09,000 --> 02:28:13,700
from multiple sources
like Market of the past events.
3618
02:28:13,700 --> 02:28:16,586
You have statuses
like at based mongodb,
3619
02:28:16,586 --> 02:28:20,051
which are you know,
SQL babies elasticsearch post
3620
02:28:20,051 --> 02:28:24,600
Vis equal pocket file format you
can Get all the data from here.
3621
02:28:24,600 --> 02:28:27,700
Now after that you can also
don't do processing
3622
02:28:27,700 --> 02:28:29,553
with the help
of machine learning.
3623
02:28:29,553 --> 02:28:32,700
You can do the processing
with the help of your spark SQL
3624
02:28:32,700 --> 02:28:34,800
and then give the output.
3625
02:28:34,900 --> 02:28:37,000
So this is a very strong thing
3626
02:28:37,000 --> 02:28:40,100
that you are bringing
the data using spot screaming
3627
02:28:40,100 --> 02:28:41,964
but processing you can do
3628
02:28:41,964 --> 02:28:44,800
by using some other
Frameworks as well.
3629
02:28:44,800 --> 02:28:47,514
Right like machine learning
you can apply on the data
3630
02:28:47,514 --> 02:28:49,549
what you're getting
fatter years time.
3631
02:28:49,549 --> 02:28:51,966
You can also apply
your spots equal on the data,
3632
02:28:51,966 --> 02:28:53,200
which you're getting at.
3633
02:28:53,200 --> 02:28:56,300
the real time Moving further.
3634
02:28:57,100 --> 02:29:00,089
So this is a single thing now
in Sparks giving you
3635
02:29:00,089 --> 02:29:03,200
what you can just get the data
from multiple sources
3636
02:29:03,200 --> 02:29:07,600
like from cough cough prove
sefs kinases Twitter bringing it
3637
02:29:07,600 --> 02:29:10,300
to this path screaming
doing the processing
3638
02:29:10,300 --> 02:29:12,500
and storing it back
to your hdfs.
3639
02:29:12,500 --> 02:29:15,900
Maybe you can bring it to
your DB you can also publish
3640
02:29:15,900 --> 02:29:17,400
to your UI dashboard.
3641
02:29:17,400 --> 02:29:21,402
Next Tableau angularjs lot
of UI dashboards are there
3642
02:29:21,700 --> 02:29:25,100
in which you can publish
your output now.
3643
02:29:25,500 --> 02:29:26,346
Holly quotes,
3644
02:29:26,346 --> 02:29:29,782
let us just break down
into more fine-grained gutters.
3645
02:29:29,782 --> 02:29:32,700
Now we are going to get
our input data stream.
3646
02:29:32,700 --> 02:29:34,500
We are going to put it inside
3647
02:29:34,500 --> 02:29:38,200
of a spot screaming going to get
the batches of input data.
3648
02:29:38,200 --> 02:29:40,772
Once it executes
to his path engine.
3649
02:29:40,772 --> 02:29:44,300
We are going to get that chest
of processed data.
3650
02:29:44,300 --> 02:29:47,146
We have just seen
the same diagram before so
3651
02:29:47,146 --> 02:29:49,000
the same explanation for it.
3652
02:29:49,000 --> 02:29:52,400
Now again breaking it down
into more glamour part.
3653
02:29:52,400 --> 02:29:55,060
We are getting a d
string B string was
3654
02:29:55,060 --> 02:29:58,800
what Vulnerabilities of data
multiple set of Harmony,
3655
02:29:58,800 --> 02:30:00,500
so we are getting a d string.
3656
02:30:00,500 --> 02:30:03,400
So let's say we are getting
an rdd and the rate of time but
3657
02:30:03,400 --> 02:30:06,200
because now we are getting
real steam data, right?
3658
02:30:06,200 --> 02:30:07,936
So let's say in today right now.
3659
02:30:07,936 --> 02:30:08,872
I got one second.
3660
02:30:08,872 --> 02:30:11,399
Maybe now I got some one second
in one second.
3661
02:30:11,399 --> 02:30:14,600
I got more data now I got
more data in the next not Frank.
3662
02:30:14,600 --> 02:30:16,300
So that is what
we're talking about.
3663
02:30:16,300 --> 02:30:17,602
So we are creating data.
3664
02:30:17,602 --> 02:30:20,322
We are getting from time
0 to time what we get say
3665
02:30:20,322 --> 02:30:22,171
that we have an RGB at the rate
3666
02:30:22,171 --> 02:30:24,556
of Timbre similarly
it is this proceeding
3667
02:30:24,556 --> 02:30:27,300
with the time that He's
getting proceeded here.
3668
02:30:27,400 --> 02:30:30,683
Now in the next thing
we extracting the words
3669
02:30:30,683 --> 02:30:32,400
from an input Stream So
3670
02:30:32,400 --> 02:30:33,300
if you can notice
3671
02:30:33,300 --> 02:30:35,550
what we are doing here
from where let's say,
3672
02:30:35,550 --> 02:30:37,700
we started applying
doing our operations
3673
02:30:37,700 --> 02:30:40,419
as we started doing
our any sort of processing.
3674
02:30:40,419 --> 02:30:43,200
So as in when we get the data
in this timeframe,
3675
02:30:43,200 --> 02:30:44,707
we started being subversive.
3676
02:30:44,707 --> 02:30:46,307
It can be a flat map operation.
3677
02:30:46,307 --> 02:30:49,300
It can be any sort of operation
you're doing it can be even
3678
02:30:49,300 --> 02:30:51,800
a machine-learning opposite
of whatever you are doing
3679
02:30:51,800 --> 02:30:55,600
and then you are generating
the words in that kind of thing.
3680
02:30:55,700 --> 02:30:58,700
So this is how we
as we're seeing
3681
02:30:58,700 --> 02:31:02,700
that how gravity we can kind
of see all these part
3682
02:31:02,700 --> 02:31:04,620
at a very high level this work.
3683
02:31:04,620 --> 02:31:06,738
We again went into
detail then again,
3684
02:31:06,738 --> 02:31:08,249
we went into more detail.
3685
02:31:08,249 --> 02:31:09,700
And finally we have seen
3686
02:31:09,700 --> 02:31:13,600
that how we can even process
the data along the time
3687
02:31:13,600 --> 02:31:16,594
when we are screaming
our data as well.
3688
02:31:17,100 --> 02:31:21,500
Now one important point is just
like spark context is
3689
02:31:21,853 --> 02:31:25,700
mean entry point for
any spark application similar.
3690
02:31:25,700 --> 02:31:28,300
Need to work on streaming a spot
3691
02:31:28,300 --> 02:31:31,600
screaming you require
a streaming context.
3692
02:31:31,700 --> 02:31:35,800
What is that when you're passing
your input data stream you
3693
02:31:35,800 --> 02:31:38,400
when you are working
on the Spark engine
3694
02:31:38,400 --> 02:31:41,000
when you're walking
on this path screaming engine,
3695
02:31:41,000 --> 02:31:42,900
you have to use your system
3696
02:31:42,900 --> 02:31:46,289
in context of its using
screaming context only
3697
02:31:46,289 --> 02:31:48,700
you are going to get the batches
3698
02:31:48,700 --> 02:31:52,300
of your input data now
so streaming context
3699
02:31:52,300 --> 02:31:57,000
is going to consume a stream
of data in In Apache spark,
3700
02:31:57,300 --> 02:31:58,800
it is registers
3701
02:31:58,800 --> 02:32:04,000
and input D string to produce
or receiver object.
3702
02:32:04,500 --> 02:32:08,200
Now it is the main entry point
as we discussed
3703
02:32:08,200 --> 02:32:11,011
that like spark context is
the main entry point
3704
02:32:11,011 --> 02:32:12,600
for the spark application.
3705
02:32:12,600 --> 02:32:13,400
Similarly.
3706
02:32:13,400 --> 02:32:16,110
Your streaming context
is an entry point
3707
02:32:16,110 --> 02:32:17,500
for yourself Paxton.
3708
02:32:17,500 --> 02:32:20,800
Now does that mean
now Spa context is
3709
02:32:20,800 --> 02:32:22,569
not an entry point know
3710
02:32:22,569 --> 02:32:25,779
when you creates pastrini
it is dependent.
3711
02:32:25,779 --> 02:32:27,600
On your spots community.
3712
02:32:27,600 --> 02:32:30,007
So when you create
this thing in context
3713
02:32:30,007 --> 02:32:33,509
it is going to be dependent
on your spark of context only
3714
02:32:33,509 --> 02:32:36,732
because you will not be able
to create swimming contest
3715
02:32:36,732 --> 02:32:38,000
without spot Pockets.
3716
02:32:38,000 --> 02:32:41,000
So that's the reason it
is definitely required spark
3717
02:32:41,000 --> 02:32:45,600
also provide a number of default
implementations of sources,
3718
02:32:45,800 --> 02:32:50,000
like looking in the data
from Critter a factor 0 mq
3719
02:32:50,100 --> 02:32:53,100
which are accessible
from the context.
3720
02:32:53,100 --> 02:32:55,800
So it is supporting
so many things, right?
3721
02:32:55,800 --> 02:32:58,600
now If you notice this
3722
02:32:58,600 --> 02:33:01,000
what we are doing
in streaming contact,
3723
02:33:01,000 --> 02:33:03,497
this is just to give
you an idea about
3724
02:33:03,497 --> 02:33:06,500
how we can initialize
our system in context.
3725
02:33:06,500 --> 02:33:09,971
So we will be importing
these two libraries after that.
3726
02:33:09,971 --> 02:33:12,923
Can you see I'm passing
spot context SE right son
3727
02:33:12,923 --> 02:33:14,400
passing it every second.
3728
02:33:14,400 --> 02:33:17,323
We are collecting the data
means collect the data
3729
02:33:17,323 --> 02:33:18,400
for every 1 second.
3730
02:33:18,400 --> 02:33:21,500
You can increase this number
if you want and then this
3731
02:33:21,500 --> 02:33:24,028
is your SSC means
in every one second
3732
02:33:24,028 --> 02:33:25,482
what ever gonna happen?
3733
02:33:25,482 --> 02:33:27,000
I'm going to process it.
3734
02:33:27,000 --> 02:33:28,800
And what we're doing
in this place,
3735
02:33:28,900 --> 02:33:33,100
let's go to the D string topic
now now in these three
3736
02:33:33,500 --> 02:33:37,000
it is the full form
is discretized stream.
3737
02:33:37,053 --> 02:33:38,900
It's a basic abstraction
3738
02:33:38,900 --> 02:33:41,679
provided by your spa
streaming framework.
3739
02:33:41,679 --> 02:33:46,400
It's appointing a stream of data
and it is going to be received
3740
02:33:46,400 --> 02:33:47,630
from your source
3741
02:33:47,630 --> 02:33:52,200
and from processed
steaming context is related
3742
02:33:52,200 --> 02:33:56,900
to your response living
Fun Spot context is belonging.
3743
02:33:56,900 --> 02:33:57,974
To your spark or
3744
02:33:57,974 --> 02:34:01,600
if you remember the ecosystem
radical in the ecosystem,
3745
02:34:01,600 --> 02:34:06,400
we have that spark context right
now streaming context is built
3746
02:34:06,400 --> 02:34:08,784
with the help of spark context.
3747
02:34:08,800 --> 02:34:11,800
And in fact using
streaming context only
3748
02:34:11,800 --> 02:34:15,604
you will be able to perform
your sponsoring just like
3749
02:34:15,604 --> 02:34:17,722
without spark context you will
3750
02:34:17,722 --> 02:34:19,700
not able to execute anything
3751
02:34:19,700 --> 02:34:22,482
in spark application
just park application
3752
02:34:22,482 --> 02:34:25,100
will not be able
to do anything similarly
3753
02:34:25,100 --> 02:34:27,200
without streaming content.
3754
02:34:27,200 --> 02:34:31,500
You're streaming application
will not be able to do anything.
3755
02:34:31,500 --> 02:34:34,838
It just that screaming
context is built on top
3756
02:34:34,838 --> 02:34:36,100
of spark context.
3757
02:34:36,500 --> 02:34:39,700
Okay, so it now it's
a continuous stream
3758
02:34:39,700 --> 02:34:42,400
of data we can talk
about these three.
3759
02:34:42,400 --> 02:34:46,200
It is received from source
of on the processed data speed
3760
02:34:46,200 --> 02:34:49,000
generated by the
transformation of interesting.
3761
02:34:49,300 --> 02:34:53,800
If you look at this part
internally a these thing
3762
02:34:53,800 --> 02:34:57,389
can be represented by
a continuous series of I
3763
02:34:57,389 --> 02:34:59,620
need these this is important.
3764
02:34:59,946 --> 02:35:04,400
Now what we're doing is
every second remember last time
3765
02:35:04,400 --> 02:35:05,800
we have just seen an example
3766
02:35:05,900 --> 02:35:08,335
of like every second
whatever going to happen.
3767
02:35:08,335 --> 02:35:10,100
We are going to do processing.
3768
02:35:10,200 --> 02:35:13,700
So in that every second
whatever data you
3769
02:35:13,700 --> 02:35:17,300
are collecting and you're
performing your operation.
3770
02:35:17,300 --> 02:35:18,010
So the data
3771
02:35:18,010 --> 02:35:21,500
what you're getting here is
will be your District means
3772
02:35:21,500 --> 02:35:23,129
it's a Content you can say
3773
02:35:23,129 --> 02:35:26,200
that all these things
will be your D string point.
3774
02:35:26,200 --> 02:35:29,800
It's our Representation
by a continuous series
3775
02:35:29,800 --> 02:35:32,300
of kinetic energy so
many hundred is getting more
3776
02:35:32,300 --> 02:35:34,500
because let's say right
knocking one second.
3777
02:35:34,500 --> 02:35:36,000
What data I got collected.
3778
02:35:36,000 --> 02:35:37,100
I executed it.
3779
02:35:37,100 --> 02:35:40,500
I in the second second
this data is happening here.
3780
02:35:40,715 --> 02:35:41,100
Okay?
3781
02:35:41,100 --> 02:35:41,800
Okay.
3782
02:35:41,800 --> 02:35:42,700
Sorry for that.
3783
02:35:42,700 --> 02:35:46,300
Now in the second time
also the it is happening
3784
02:35:46,300 --> 02:35:47,400
a third second.
3785
02:35:47,400 --> 02:35:49,000
Also it is happening here.
3786
02:35:49,700 --> 02:35:50,500
No problem.
3787
02:35:50,500 --> 02:35:53,100
No, I'm not going
to do it now fine.
3788
02:35:53,100 --> 02:35:54,727
So in the third second Auto
3789
02:35:54,727 --> 02:35:57,200
if I did something
I'm processing it here.
3790
02:35:57,200 --> 02:35:57,500
Right.
3791
02:35:57,500 --> 02:35:59,800
So if you see
that this diagram itself,
3792
02:35:59,800 --> 02:36:03,600
so it is every second whatever
data is getting collected.
3793
02:36:03,600 --> 02:36:05,400
We are doing the processing
3794
02:36:05,400 --> 02:36:09,250
on top of it and the whole
countenance series of RDV
3795
02:36:09,250 --> 02:36:13,100
what we are seeing here
will be called as the strip.
3796
02:36:13,100 --> 02:36:13,500
Okay.
3797
02:36:13,500 --> 02:36:18,100
So this is what your distinct
moving further now
3798
02:36:18,600 --> 02:36:22,300
we are going to understand
the operation on these three.
3799
02:36:22,300 --> 02:36:24,500
So let's say you are doing
3800
02:36:24,500 --> 02:36:27,300
this operation on this dream
that you are getting.
3801
02:36:27,300 --> 02:36:30,000
The data from 0 to 1 again,
3802
02:36:30,000 --> 02:36:32,300
you are applying some operation
3803
02:36:32,300 --> 02:36:36,108
on that then whatever output
you get you're going to call
3804
02:36:36,108 --> 02:36:39,200
it as words the state
means this is the thing
3805
02:36:39,200 --> 02:36:41,166
what you're doing you're doing
a pack of operation.
3806
02:36:41,166 --> 02:36:42,700
That's the reason
we're calling it is at
3807
02:36:42,700 --> 02:36:46,058
what these three now similarly
whatever thing you're doing.
3808
02:36:46,058 --> 02:36:48,000
So you're going
to get accordingly
3809
02:36:48,000 --> 02:36:50,569
and output be screen
for it as well.
3810
02:36:50,569 --> 02:36:55,100
So this is what is happening
in this particular example now.
3811
02:36:56,700 --> 02:36:59,700
Flat map flatmap is API.
3812
02:37:00,000 --> 02:37:02,100
It is very similar to mac.
3813
02:37:02,100 --> 02:37:04,089
Its kind of platen
of your value.
3814
02:37:04,089 --> 02:37:04,400
Okay.
3815
02:37:04,400 --> 02:37:06,400
So let me explain you
with an example.
3816
02:37:06,400 --> 02:37:07,300
What is flat back?
3817
02:37:07,500 --> 02:37:10,100
So let's say
if I say that hi,
3818
02:37:10,400 --> 02:37:13,200
this is a doulica.
3819
02:37:14,500 --> 02:37:15,600
Welcome.
3820
02:37:16,200 --> 02:37:18,100
Okay, let's say listen later.
3821
02:37:18,222 --> 02:37:18,723
Now.
3822
02:37:18,723 --> 02:37:20,800
I want to apply a flatworm.
3823
02:37:20,800 --> 02:37:22,900
So let's say this is
a form of rdd.
3824
02:37:22,900 --> 02:37:24,600
Also now on this rdd,
3825
02:37:24,600 --> 02:37:28,200
let's say I apply flat back
to let's say our DB this is
3826
02:37:28,200 --> 02:37:30,000
the already flat map.
3827
02:37:31,600 --> 02:37:35,000
It's not map
Captain black pepper.
3828
02:37:35,100 --> 02:37:38,467
And then let's say you want
to define something for it.
3829
02:37:38,467 --> 02:37:40,400
So let's say you say that okay,
3830
02:37:41,100 --> 02:37:43,400
you are defining
a variable sale.
3831
02:37:43,700 --> 02:37:48,300
So let's say a a DOT now
3832
02:37:48,400 --> 02:37:53,300
after that you are defining
your thoughts split split.
3833
02:37:55,300 --> 02:37:58,417
We're splitting with respect
to visit now in this case
3834
02:37:58,417 --> 02:38:00,106
what is going to happen now?
3835
02:38:00,106 --> 02:38:03,966
I'm not saying the exacting here
just to give extremely flat back
3836
02:38:03,966 --> 02:38:06,500
just to kind of give
you an idea about box.
3837
02:38:06,503 --> 02:38:09,196
It is going to flatten
up this fight
3838
02:38:09,200 --> 02:38:11,200
with respect to the split
3839
02:38:11,200 --> 02:38:15,200
what you are mentioned here
means what it is going to now
3840
02:38:15,200 --> 02:38:18,500
create each element as one word.
3841
02:38:18,684 --> 02:38:21,915
It is going to create
like this high as one
3842
02:38:22,200 --> 02:38:26,100
what l 1 element this
as one One element
3843
02:38:26,100 --> 02:38:27,515
is ask another what
3844
02:38:27,515 --> 02:38:30,939
a one-element adwaita as
one water in the limit.
3845
02:38:30,939 --> 02:38:33,200
Bentham has one
vote for example.
3846
02:38:33,200 --> 02:38:33,841
So this is
3847
02:38:33,841 --> 02:38:37,558
how your platinum Works kind
of flatten up your whole file.
3848
02:38:37,558 --> 02:38:40,700
So this is what we are doing
in our stream effort.
3849
02:38:40,700 --> 02:38:43,400
We are our so this is
how this will work.
3850
02:38:44,100 --> 02:38:47,143
Now so we have just
understood this part.
3851
02:38:47,143 --> 02:38:51,100
Now, let's understand input
the stream and receivers.
3852
02:38:51,100 --> 02:38:52,500
Okay, what are these things?
3853
02:38:52,500 --> 02:38:53,900
Let's understand this fight.
3854
02:38:54,800 --> 02:38:55,200
Okay.
3855
02:38:55,200 --> 02:38:57,700
So what are the input
based impossible?
3856
02:38:57,700 --> 02:39:00,900
They can be basic Source
advances in basic Source
3857
02:39:00,900 --> 02:39:04,500
we can have filesystems
sockets Connections
3858
02:39:04,600 --> 02:39:08,400
in advance Source we
can have Kafka no Genesis.
3859
02:39:08,800 --> 02:39:09,200
Okay.
3860
02:39:09,300 --> 02:39:10,800
So your input these things are
3861
02:39:10,800 --> 02:39:14,000
under these things
representing the stream
3862
02:39:14,300 --> 02:39:19,200
of input data received
from streaming sources.
3863
02:39:19,400 --> 02:39:20,865
This is again the same thing.
3864
02:39:20,865 --> 02:39:21,136
Okay.
3865
02:39:21,136 --> 02:39:23,198
So this is there are
two type of things
3866
02:39:23,198 --> 02:39:24,500
which we just discussed.
3867
02:39:24,600 --> 02:39:27,676
Is your basic and second
is your advance?
3868
02:39:28,400 --> 02:39:29,800
Let's move brother.
3869
02:39:30,700 --> 02:39:33,700
Now what we are going
to see each other.
3870
02:39:33,700 --> 02:39:35,870
So if you notice let's see here.
3871
02:39:35,870 --> 02:39:39,600
There are some events often it
is going to your receiver
3872
02:39:39,600 --> 02:39:44,158
and then energy stream now I
will bees are getting created
3873
02:39:44,158 --> 02:39:47,082
and we are performing
some steps on it.
3874
02:39:47,300 --> 02:39:52,300
So the receiver sends
the data into the D string
3875
02:39:52,500 --> 02:39:57,100
where each back is going
to contain the RTD.
3876
02:39:57,200 --> 02:40:00,800
So this is what you're
this thing is doing receiver.
3877
02:40:00,800 --> 02:40:02,500
Is doing here now
3878
02:40:03,500 --> 02:40:07,200
moving further Transformations
on the D string.
3879
02:40:07,200 --> 02:40:08,384
Let's understand that.
3880
02:40:08,384 --> 02:40:10,500
What are the
Transformations available?
3881
02:40:10,500 --> 02:40:13,000
There are multiple
Transformations, which are
3882
02:40:13,000 --> 02:40:14,700
possibly the most popular.
3883
02:40:14,700 --> 02:40:16,100
Let's talk about that.
3884
02:40:16,100 --> 02:40:20,700
We have map flatmap filter
reduce Group by so there
3885
02:40:20,700 --> 02:40:23,992
are multiple Transformations
available via now.
3886
02:40:23,992 --> 02:40:27,500
It is like you are getting
your input data now you
3887
02:40:27,500 --> 02:40:30,400
will be applying any
of these operations.
3888
02:40:30,400 --> 02:40:33,700
Means any Transformations
that is going to happen.
3889
02:40:33,700 --> 02:40:37,700
And then on you this thing
is going to be created.
3890
02:40:37,700 --> 02:40:39,900
Okay, so that is
what's going to happen.
3891
02:40:39,900 --> 02:40:41,851
So let's explore it one by one.
3892
02:40:41,851 --> 02:40:43,344
So let's start with now
3893
02:40:43,344 --> 02:40:46,200
if I start with map
what happens with Mac
3894
02:40:46,200 --> 02:40:48,600
it is going to create
that judges of data.
3895
02:40:48,600 --> 02:40:49,100
Okay.
3896
02:40:49,100 --> 02:40:51,386
So let's say it is going
to create a map value
3897
02:40:51,386 --> 02:40:52,200
of it like this.
3898
02:40:52,200 --> 02:40:55,600
So let's say X is not to be
my is giving the output Z
3899
02:40:55,600 --> 02:40:57,600
that is giving
the output X, right.
3900
02:40:57,600 --> 02:41:00,700
So in this similar format,
this is going to get mad.
3901
02:41:00,700 --> 02:41:02,887
That is going to whatever
you're performing.
3902
02:41:02,887 --> 02:41:05,394
It is just going to create
batches of input data,
3903
02:41:05,394 --> 02:41:06,700
which you can execute it.
3904
02:41:06,700 --> 02:41:10,800
So it returns a new DC
by fasting each element
3905
02:41:10,800 --> 02:41:13,946
of the source D string
through a function,
3906
02:41:13,946 --> 02:41:15,600
which you have defined.
3907
02:41:16,300 --> 02:41:17,789
Let's discuss this lapis
3908
02:41:17,789 --> 02:41:20,074
that we have just
discussed it is going
3909
02:41:20,074 --> 02:41:21,565
to flatten up the things.
3910
02:41:21,565 --> 02:41:22,805
So in this case, also,
3911
02:41:22,805 --> 02:41:25,400
if you notice we are just
kind of flat inner it
3912
02:41:25,400 --> 02:41:27,169
is very similar to Mac.
3913
02:41:27,169 --> 02:41:31,100
But each input item
can be mapped to zero
3914
02:41:31,200 --> 02:41:34,200
or more outputs in items here.
3915
02:41:34,200 --> 02:41:38,400
Okay, and it is going to return
a new these three bypassing
3916
02:41:38,400 --> 02:41:41,700
each Source element
to a function for this fight.
3917
02:41:41,700 --> 02:41:44,600
So we have just seen an example
of that crap anyway,
3918
02:41:44,600 --> 02:41:47,300
so that seems awfully
can remember 70 more easy
3919
02:41:47,300 --> 02:41:49,200
for you to kind of
see the difference
3920
02:41:49,200 --> 02:41:55,260
between with markets has
no moving further filter
3921
02:41:55,360 --> 02:41:58,593
as the name States you
can now filter out the values.
3922
02:41:58,593 --> 02:41:59,876
So let's say you have
3923
02:41:59,876 --> 02:42:03,701
a huge data you are kind of we
want to filter out some values.
3924
02:42:03,701 --> 02:42:06,900
You just want to kind of walk
with some filter data.
3925
02:42:06,900 --> 02:42:09,700
Maybe you want to remove
some part of it.
3926
02:42:09,700 --> 02:42:11,900
Maybe you are trying
to put some Logic on it.
3927
02:42:11,900 --> 02:42:15,800
Does this line contains
this right under this line?
3928
02:42:16,100 --> 02:42:16,900
Is that so
3929
02:42:16,900 --> 02:42:20,169
in that case extreme only
with that particular criteria?
3930
02:42:20,169 --> 02:42:21,691
So this is what we do here,
3931
02:42:21,691 --> 02:42:25,300
but definitely most of the times
to Output is going to be smaller
3932
02:42:25,300 --> 02:42:31,000
in comparison to your input
reduce reduce is it's just
3933
02:42:31,000 --> 02:42:34,500
like it's going to do kind
of aggregation on the wall.
3934
02:42:34,500 --> 02:42:37,400
Let's say in the end you want
to sum up all the data
3935
02:42:37,400 --> 02:42:38,200
what you have
3936
02:42:38,200 --> 02:42:41,500
that is going to be done
with the help of reduce.
3937
02:42:42,100 --> 02:42:43,800
Now after that group
3938
02:42:43,800 --> 02:42:48,600
by group back is like it's going
to combine all the common values
3939
02:42:48,600 --> 02:42:50,600
that is what group
by is going to do.
3940
02:42:50,600 --> 02:42:53,112
So as you can see
in this example all the things
3941
02:42:53,112 --> 02:42:55,196
which are starting
with Seagal broom back
3942
02:42:55,196 --> 02:42:56,786
all the things we're starting
3943
02:42:56,786 --> 02:42:59,300
with J. Boardroom back
all the names starting
3944
02:42:59,300 --> 02:43:00,761
with C got goodbye.
3945
02:43:00,800 --> 02:43:01,600
Not.
3946
02:43:02,000 --> 02:43:03,300
So again, what is
3947
02:43:03,300 --> 02:43:07,500
this screen window now to give
you an example of this window?
3948
02:43:07,500 --> 02:43:10,108
Everybody must be
knowing Twitter, right?
3949
02:43:10,108 --> 02:43:12,000
So now what happens in total?
3950
02:43:12,000 --> 02:43:13,700
Let me go to my paint.
3951
02:43:14,100 --> 02:43:16,100
So insert in this example,
3952
02:43:16,100 --> 02:43:19,853
let's understand how
this windowing of Asians of so,
3953
02:43:19,853 --> 02:43:21,400
let's say in initials
3954
02:43:21,400 --> 02:43:24,600
per second in the initial
per second 10 seconds.
3955
02:43:24,600 --> 02:43:27,200
Let's say the tweets
are happening in this way.
3956
02:43:27,200 --> 02:43:32,200
Let's say cash
a hash a hashtag now,
3957
02:43:32,200 --> 02:43:35,773
which is the trading Twitter
definitely is right is
3958
02:43:35,773 --> 02:43:38,900
my training good maybe
in the next 10 seconds.
3959
02:43:40,600 --> 02:43:46,500
In the next 10 seconds
now again Hash A. Ashby.
3960
02:43:47,200 --> 02:43:48,400
Ashby is open
3961
02:43:48,400 --> 02:43:51,400
which is the trending
with be happening here.
3962
02:43:51,400 --> 02:43:51,800
Now.
3963
02:43:51,800 --> 02:43:54,261
Let's say in another 10 seconds.
3964
02:43:54,900 --> 02:43:56,700
Now this time let's say
3965
02:43:56,700 --> 02:44:03,266
hash be hash be so actually I
should be Hashmi zapping now,
3966
02:44:03,266 --> 02:44:05,266
which is trendy be lonely.
3967
02:44:05,500 --> 02:44:07,776
But now I want to find out
3968
02:44:07,776 --> 02:44:10,546
which is the trending
one in last 30.
3969
02:44:11,400 --> 02:44:15,100
Ashley right because
if I combine I can do it easily.
3970
02:44:15,400 --> 02:44:19,900
Now this is your been doing
operation example means you
3971
02:44:19,900 --> 02:44:23,300
are not only looking
at your current window,
3972
02:44:23,300 --> 02:44:24,800
but you're also looking
3973
02:44:24,800 --> 02:44:27,516
at your previous window
Vanessa current window.
3974
02:44:27,516 --> 02:44:30,008
I'm talking about let's say
10 seconds of slot
3975
02:44:30,008 --> 02:44:32,600
in this 10 seconds lat
let's say you are doing
3976
02:44:32,600 --> 02:44:35,431
this operation on has be has
to be has to be has to be
3977
02:44:35,431 --> 02:44:37,456
so this is a current
window now you are
3978
02:44:37,456 --> 02:44:40,282
not fully Computing with respect
to your current window.
3979
02:44:40,282 --> 02:44:42,800
But you are also considering
your previous window.
3980
02:44:42,800 --> 02:44:44,055
Now, let's say in this case.
3981
02:44:44,055 --> 02:44:44,681
If I ask you,
3982
02:44:44,681 --> 02:44:46,900
can you give me the output
of which is trending
3983
02:44:46,900 --> 02:44:48,361
in last 17 seconds?
3984
02:44:48,361 --> 02:44:50,900
Will you be able
to answer know why
3985
02:44:50,900 --> 02:44:54,900
because you don't have partial
information for 7 Seconds
3986
02:44:54,900 --> 02:44:56,400
you have information
3987
02:44:56,400 --> 02:45:01,000
for your 10 20 30 mins
multiple of them,
3988
02:45:01,200 --> 02:45:03,500
but not intermediate one.
3989
02:45:03,500 --> 02:45:04,711
So keep this in mind.
3990
02:45:04,711 --> 02:45:07,365
Okay, so you will be able
to perform in doing
3991
02:45:07,365 --> 02:45:10,207
operation only with respect
to your window size.
3992
02:45:10,207 --> 02:45:11,900
It's not like you can create
3993
02:45:11,900 --> 02:45:15,085
any partial value in can do
the window efficient now,
3994
02:45:15,085 --> 02:45:16,800
let's get back to the sides.
3995
02:45:21,800 --> 02:45:23,203
Now it's a similar thing.
3996
02:45:23,203 --> 02:45:24,350
So now it is shown here
3997
02:45:24,350 --> 02:45:27,100
that we are not only considering
the current window,
3998
02:45:27,100 --> 02:45:30,200
but we are also considering
the previous window
3999
02:45:30,200 --> 02:45:31,604
now next understand
4000
02:45:31,604 --> 02:45:35,300
the output operators are
operations of the business
4001
02:45:35,700 --> 02:45:38,434
when we talk
about output operations.
4002
02:45:38,434 --> 02:45:41,400
The output operations
are going to allow
4003
02:45:41,400 --> 02:45:45,853
the D string data to be pushed
out to your external system.
4004
02:45:45,853 --> 02:45:47,700
If you notice here means
4005
02:45:47,700 --> 02:45:51,300
whenever whatever processing
you have done with respect to
4006
02:45:51,300 --> 02:45:54,300
what What data you are doing
here now your output you
4007
02:45:54,300 --> 02:45:57,100
can store in multiple base
against original file system.
4008
02:45:57,100 --> 02:45:58,600
You can keep in your database.
4009
02:45:58,600 --> 02:46:01,800
You can keep it even
in your external systems
4010
02:46:01,800 --> 02:46:04,200
so you can keep
in multiple places.
4011
02:46:04,200 --> 02:46:06,400
So that is
what being reflected here.
4012
02:46:07,500 --> 02:46:10,600
Now, so if I talk
about output operation,
4013
02:46:10,600 --> 02:46:11,653
these are the one
4014
02:46:11,653 --> 02:46:15,495
which are supported we can print
out the value we can use save
4015
02:46:15,495 --> 02:46:17,700
as text file menu save
as take five.
4016
02:46:17,700 --> 02:46:19,500
It saves it into your chest.
4017
02:46:19,500 --> 02:46:21,736
If you want you can
also use it to save it
4018
02:46:21,736 --> 02:46:23,100
in the local pack system.
4019
02:46:23,100 --> 02:46:25,174
You can save it as
an object file.
4020
02:46:25,174 --> 02:46:27,500
Also, you can save
it as a Hadoop file
4021
02:46:27,500 --> 02:46:30,800
or you can also apply
for these are daily function.
4022
02:46:31,200 --> 02:46:34,500
Now what are for
each argument function?
4023
02:46:34,500 --> 02:46:35,956
Let's see this example.
4024
02:46:35,956 --> 02:46:39,700
So the mill Levy Spin on this
part in detail Banks we teach
4025
02:46:39,700 --> 02:46:41,600
you or in advocacy sessions,
4026
02:46:41,600 --> 02:46:43,927
but just to give
you an idea now.
4027
02:46:43,927 --> 02:46:46,310
This is a very
powerful primitive
4028
02:46:46,310 --> 02:46:49,608
that is going to allow
your data to be sent out
4029
02:46:49,608 --> 02:46:51,400
to your external systems.
4030
02:46:51,400 --> 02:46:53,700
So using this you
can send it across
4031
02:46:53,700 --> 02:46:55,500
to your web server system.
4032
02:46:55,500 --> 02:46:57,385
We have just seen
our external system
4033
02:46:57,385 --> 02:46:58,904
that we can give file system.
4034
02:46:58,904 --> 02:46:59,900
It can be anything.
4035
02:46:59,900 --> 02:47:02,800
So using this you
will be able to transfer it.
4036
02:47:02,800 --> 02:47:05,100
You can view will be
able to send it out
4037
02:47:05,100 --> 02:47:07,162
to your external systems.
4038
02:47:07,500 --> 02:47:11,500
Now, let's understand the cash
in and persistence now
4039
02:47:11,500 --> 02:47:14,300
when we talk
about caching and persistence,
4040
02:47:14,300 --> 02:47:18,900
so these 3 Ms. Also annoying
the developers to cash
4041
02:47:19,000 --> 02:47:22,100
or to persist the streams data
4042
02:47:22,100 --> 02:47:27,023
in the moral means you
can keep your data in memory.
4043
02:47:27,023 --> 02:47:31,100
You can cash your data
in the morning for longer time.
4044
02:47:31,200 --> 02:47:33,200
Even after your
action is complete.
4045
02:47:33,200 --> 02:47:36,000
It is not going to delete it
4046
02:47:36,100 --> 02:47:38,946
so you can just Use
this as many times
4047
02:47:38,946 --> 02:47:39,800
as you want
4048
02:47:39,800 --> 02:47:42,900
so you can simply use
the first method to do that.
4049
02:47:42,900 --> 02:47:44,485
So for your input streams
4050
02:47:44,485 --> 02:47:48,100
which are receiving the data
over the network may be using
4051
02:47:48,100 --> 02:47:50,000
taskbar Loom sockets.
4052
02:47:50,400 --> 02:47:54,500
The default persistence level
is set to the replicate
4053
02:47:54,500 --> 02:47:57,331
the data to two loads
for the for tolerance
4054
02:47:57,331 --> 02:48:00,500
like it is also going
to be replicating the data
4055
02:48:00,502 --> 02:48:01,600
into two parts
4056
02:48:01,600 --> 02:48:04,800
so you can see the same thing
in this diagram.
4057
02:48:05,300 --> 02:48:07,979
Let's understand this
accumulators broadcast
4058
02:48:07,979 --> 02:48:09,600
variables and checkpoints.
4059
02:48:09,700 --> 02:48:12,553
Now, these are mostly
for your performance.
4060
02:48:12,553 --> 02:48:16,626
But so this is going to help you
to kind of perform to help you
4061
02:48:16,626 --> 02:48:18,444
in the performance partner.
4062
02:48:18,444 --> 02:48:20,600
So it is accumulators is nothing
4063
02:48:20,600 --> 02:48:25,200
but environment that are only
added through and associative
4064
02:48:25,300 --> 02:48:27,400
and commutative operation.
4065
02:48:28,000 --> 02:48:31,100
Usually if you're coming
from Purdue background
4066
02:48:31,100 --> 02:48:32,678
if you have done let's say be
4067
02:48:32,678 --> 02:48:35,400
mapreduce programming you
must have seen something.
4068
02:48:35,400 --> 02:48:36,900
Counters like that,
4069
02:48:36,900 --> 02:48:38,749
they'll be used
for other counters
4070
02:48:38,749 --> 02:48:42,000
which kind of helps us to debug
the program as well and you
4071
02:48:42,000 --> 02:48:44,700
can perform some analysis
in the console itself.
4072
02:48:44,700 --> 02:48:46,600
Now this is similar
to you can do
4073
02:48:46,600 --> 02:48:48,100
with the accumulators as well.
4074
02:48:48,100 --> 02:48:50,152
So you can Implement
your contest with X
4075
02:48:50,152 --> 02:48:52,800
open this part you can
also some of the things
4076
02:48:52,800 --> 02:48:54,800
with this fact now you can
4077
02:48:54,800 --> 02:48:57,800
if you want to track
through UI you can also do
4078
02:48:57,800 --> 02:49:00,402
that as you can see
in this UI itself.
4079
02:49:00,402 --> 02:49:02,500
You can see all your excavators
4080
02:49:02,500 --> 02:49:05,400
as well now similarly
we have broadcast.
4081
02:49:05,400 --> 02:49:10,300
Erebus now broadcast Parables
allows the programmer to keep
4082
02:49:10,300 --> 02:49:14,787
your meat only bearable cast
on all the machines
4083
02:49:14,787 --> 02:49:16,325
which are available.
4084
02:49:16,838 --> 02:49:19,838
Now it is going
to be kind of cashing it
4085
02:49:19,838 --> 02:49:21,684
on all the machines now,
4086
02:49:22,000 --> 02:49:25,900
they can be used to give
every note of copy
4087
02:49:26,200 --> 02:49:29,000
of a large input data set
4088
02:49:29,300 --> 02:49:35,028
in an efficient manner so you
can just use that sparkle.
4089
02:49:35,028 --> 02:49:39,643
Also attempt to distribute the
distributed broadcast variable
4090
02:49:39,643 --> 02:49:41,700
using efficient bra strap.
4091
02:49:41,700 --> 02:49:44,907
I will do nothing to reduce
the communication process.
4092
02:49:44,907 --> 02:49:46,100
So as you can see here,
4093
02:49:46,100 --> 02:49:47,800
we are passing
this broadcast value
4094
02:49:47,800 --> 02:49:50,700
it is going to spark contest
and then it is broadcasting
4095
02:49:50,700 --> 02:49:51,700
to this places.
4096
02:49:51,700 --> 02:49:55,500
So this is what how it
is working in this application.
4097
02:49:55,700 --> 02:49:58,582
Generally when we teach
in this class has and also
4098
02:49:58,582 --> 02:50:00,600
since things are
Advanced concept,
4099
02:50:00,600 --> 02:50:02,953
we kind of we kind
of try to expand you
4100
02:50:02,953 --> 02:50:05,189
with the practicals
are not right now.
4101
02:50:05,189 --> 02:50:08,915
I just want to give you an idea
about what are these things?
4102
02:50:08,915 --> 02:50:09,764
So when you go
4103
02:50:09,764 --> 02:50:12,009
with the practicals
of all these things
4104
02:50:12,009 --> 02:50:13,367
that how activator see
4105
02:50:13,367 --> 02:50:16,700
how this is happening out
is getting broadcasted Things
4106
02:50:16,700 --> 02:50:19,941
become more and more fear
at that time right now.
4107
02:50:19,941 --> 02:50:20,683
I just want
4108
02:50:20,683 --> 02:50:24,600
that everybody at these data
high level overview of things.
4109
02:50:25,246 --> 02:50:28,400
Now moving further sub
what is checkpoints
4110
02:50:28,400 --> 02:50:30,257
so checkpoints are similar
4111
02:50:30,257 --> 02:50:32,900
to your checkpoints
in the gaming now,
4112
02:50:32,900 --> 02:50:37,200
hold on they can they make
it run 24/7 make it resilient
4113
02:50:37,200 --> 02:50:41,400
to the failure and related
to the application project.
4114
02:50:41,500 --> 02:50:43,214
So if you can see this diagram,
4115
02:50:43,214 --> 02:50:45,296
we are just
creating the checkpoint.
4116
02:50:45,296 --> 02:50:47,200
So as in the
metadata checkpoint,
4117
02:50:47,200 --> 02:50:50,279
you can see it is the saving
of the information
4118
02:50:50,279 --> 02:50:53,827
which is defining the streaming
computation if we talk
4119
02:50:53,827 --> 02:50:55,300
about data from check.
4120
02:50:55,600 --> 02:51:01,000
It is saving of the generated
a DD to the reliable storage.
4121
02:51:01,100 --> 02:51:03,400
So this is this
both are generating
4122
02:51:03,400 --> 02:51:06,900
the checkpoint now
now moving forward.
4123
02:51:06,900 --> 02:51:09,815
We are going to move
towards our project
4124
02:51:09,815 --> 02:51:14,300
where we are going to perform
our Twitter sentiment analysis.
4125
02:51:14,400 --> 02:51:17,413
Let's discuss a very
important Force case
4126
02:51:17,413 --> 02:51:19,600
of Twitter sentiment analysis.
4127
02:51:19,600 --> 02:51:21,500
This is going to
be very interesting
4128
02:51:21,500 --> 02:51:24,600
because we will just
do a real-time.
4129
02:51:24,900 --> 02:51:28,588
This on Twitter sentiment
analysis and they can be
4130
02:51:28,588 --> 02:51:31,900
lot of possibility
of this sentiment analysis
4131
02:51:31,900 --> 02:51:33,631
will be but we will
be taking something
4132
02:51:33,631 --> 02:51:36,000
for the turtle and it's going
to be very interesting.
4133
02:51:36,100 --> 02:51:39,900
So generally when we do
all this in know course,
4134
02:51:39,900 --> 02:51:41,070
it is more detailed
4135
02:51:41,070 --> 02:51:44,582
because right now in women
are definitely going in deep is
4136
02:51:44,582 --> 02:51:46,000
not very much possible,
4137
02:51:46,000 --> 02:51:48,600
but during the training
of a director,
4138
02:51:48,600 --> 02:51:51,470
you will learn all these things
within the trust awesome,
4139
02:51:51,470 --> 02:51:52,994
right that's there something
4140
02:51:52,994 --> 02:51:55,100
which we learned
during the session.
4141
02:51:55,100 --> 02:51:59,061
It's No, we talked
about some use cases of Twitter.
4142
02:51:59,300 --> 02:52:01,300
As I said there can be
multiple use cases
4143
02:52:01,300 --> 02:52:02,300
which are possible
4144
02:52:02,300 --> 02:52:04,156
because there are solutions
4145
02:52:04,156 --> 02:52:07,100
behind whatever the continue
doing it so much
4146
02:52:07,100 --> 02:52:08,700
of social media right now
4147
02:52:08,700 --> 02:52:11,288
in these days are
very active has been right.
4148
02:52:11,288 --> 02:52:12,400
It must be noticing
4149
02:52:12,400 --> 02:52:15,300
that even politicians
have started using Twitter
4150
02:52:15,300 --> 02:52:18,000
and their did all
the treats are being shown
4151
02:52:18,000 --> 02:52:21,200
in the news channel in cystic
of a heart-rending to it
4152
02:52:21,200 --> 02:52:23,900
because they are talking
about positive negative
4153
02:52:23,900 --> 02:52:26,100
in any politician
use Something right?
4154
02:52:26,100 --> 02:52:27,900
And if we talk
about anything is even
4155
02:52:27,900 --> 02:52:29,100
if we talk about let's
4156
02:52:29,100 --> 02:52:32,260
any Sports FIFA World Cup
is going on then you will notice
4157
02:52:32,260 --> 02:52:35,200
always return will be filled up
with lot of treatment.
4158
02:52:35,200 --> 02:52:38,435
So how we can make use of it
how we can do some analysis
4159
02:52:38,435 --> 02:52:41,400
on top of it that first we
are going to learn in this
4160
02:52:41,400 --> 02:52:44,600
so they can be multiple sort
of our sentiment analysis
4161
02:52:44,600 --> 02:52:47,595
think it can be done for
your crisis Management Service.
4162
02:52:47,595 --> 02:52:50,900
I just think target marketing
we can keep on talking about
4163
02:52:50,900 --> 02:52:52,716
when a new release release now
4164
02:52:52,716 --> 02:52:55,200
even the moviemakers
kind of glowing eyes.
4165
02:52:55,200 --> 02:52:57,628
Okay, hold this movie
is going to perform
4166
02:52:57,628 --> 02:53:00,356
so they can easily make
out of it beforehand.
4167
02:53:00,356 --> 02:53:04,200
Okay, this movie is going to go
in this kind of range of profit
4168
02:53:04,200 --> 02:53:05,800
or not interesting day.
4169
02:53:05,800 --> 02:53:08,200
I let us explore
not to Impossible even
4170
02:53:08,200 --> 02:53:10,500
in the political campaign
in 50 must have heard
4171
02:53:10,600 --> 02:53:11,400
that in u.s.
4172
02:53:11,400 --> 02:53:13,600
When the president
election was happening.
4173
02:53:13,600 --> 02:53:15,676
They have used in fact role
4174
02:53:15,676 --> 02:53:19,600
of social media of all
this analysis at all and then
4175
02:53:19,600 --> 02:53:22,400
that have ever played
a major role in winning
4176
02:53:22,400 --> 02:53:23,880
that election similarly,
4177
02:53:23,880 --> 02:53:26,100
how weather investors
want to predict
4178
02:53:26,100 --> 02:53:28,950
whether they should invest
in a particular company or not,
4179
02:53:28,950 --> 02:53:30,300
whether they want to check
4180
02:53:30,300 --> 02:53:33,715
that whether like we
should Target which customers
4181
02:53:33,715 --> 02:53:34,900
for advertisement
4182
02:53:34,900 --> 02:53:38,000
because we cannot Target
everyone problem with targeting
4183
02:53:38,000 --> 02:53:40,580
everyone is and if we try
to Target element,
4184
02:53:40,580 --> 02:53:43,032
it will be very costly
so we want to kind
4185
02:53:43,032 --> 02:53:44,333
of set it a little bit
4186
02:53:44,333 --> 02:53:46,178
because maybe my set
of people whom I
4187
02:53:46,178 --> 02:53:48,954
should send this advertisement
to be more effective
4188
02:53:48,954 --> 02:53:52,000
and Wells as well as a queen
is going to be cost effective
4189
02:53:52,000 --> 02:53:54,100
as well if you wanted
to do the products
4190
02:53:54,100 --> 02:53:57,200
and services also include
I guess we can also do this.
4191
02:53:57,200 --> 02:53:57,500
Now.
4192
02:53:57,500 --> 02:54:00,900
Let's see some use cases
like the him terms of use case.
4193
02:54:00,900 --> 02:54:03,100
I will show you a practical
how it comes.
4194
02:54:03,100 --> 02:54:04,000
So first of all,
4195
02:54:04,000 --> 02:54:06,724
we will be importing all
the required packages
4196
02:54:06,724 --> 02:54:08,725
because we are going
to not perform
4197
02:54:08,725 --> 02:54:10,400
or Twitter sentiment analysis.
4198
02:54:10,400 --> 02:54:12,824
So we will be requiring
some packages for that.
4199
02:54:12,824 --> 02:54:15,700
So we will be doing that as
a first step then we need
4200
02:54:15,700 --> 02:54:18,641
to SEC Oliver authentication
without or indication.
4201
02:54:18,641 --> 02:54:21,405
We cannot do anything
of now here the challenges
4202
02:54:21,405 --> 02:54:23,201
we cannot directly
put your username
4203
02:54:23,201 --> 02:54:24,431
and they don't you think
4204
02:54:24,431 --> 02:54:27,100
it will get Candidate put
your username and password.
4205
02:54:27,200 --> 02:54:28,800
So Peter came up with something.
4206
02:54:28,800 --> 02:54:30,400
Very smart thing.
4207
02:54:30,500 --> 02:54:33,100
What they did is they came
up with something
4208
02:54:33,100 --> 02:54:35,080
on his fourth indication tokens.
4209
02:54:35,080 --> 02:54:37,100
So you have to go
to death brought
4210
02:54:37,100 --> 02:54:39,100
twitter.com login from there
4211
02:54:39,100 --> 02:54:42,972
and you will find kind of all
this authentication tokens
4212
02:54:42,972 --> 02:54:44,100
available to you
4213
02:54:44,100 --> 02:54:47,900
for will be the recruit take
that and put it here then
4214
02:54:47,900 --> 02:54:50,335
as we have learned
the D string transformation,
4215
02:54:50,335 --> 02:54:52,294
you will be doing
all that computation
4216
02:54:52,294 --> 02:54:55,100
you so you will be having
my distinct honor of France.
4217
02:54:55,100 --> 02:54:58,100
Action, then you will be
generating your Tweet data.
4218
02:54:58,100 --> 02:55:01,472
I'm going to save it
in this particular directory.
4219
02:55:01,472 --> 02:55:03,400
Once you are done with this.
4220
02:55:03,400 --> 02:55:06,200
Then you are going
to extract your sentiment
4221
02:55:06,200 --> 02:55:07,600
once you extract it.
4222
02:55:07,600 --> 02:55:08,400
And you're done.
4223
02:55:08,400 --> 02:55:11,900
Let me show you quickly
how it works in our fear.
4224
02:55:12,000 --> 02:55:15,226
Now one more interesting thing
about a greater would be
4225
02:55:15,226 --> 02:55:18,247
that you will be getting all
this consideration machines.
4226
02:55:18,247 --> 02:55:19,482
So you need not worry
4227
02:55:19,482 --> 02:55:21,892
about from where I
will be getting all this.
4228
02:55:21,892 --> 02:55:25,100
Is it like very difficult
to install when I was waiting.
4229
02:55:25,100 --> 02:55:26,400
This open source location.
4230
02:55:26,400 --> 02:55:29,061
It was not working for me
in my operating system.
4231
02:55:29,061 --> 02:55:30,179
It was not working.
4232
02:55:30,179 --> 02:55:32,400
So many things we
have generally seen
4233
02:55:32,400 --> 02:55:34,700
people face issues to resolve
4234
02:55:34,700 --> 02:55:36,600
everything up be we kind
4235
02:55:36,600 --> 02:55:40,000
of provide all this fear
question from Rockville.
4236
02:55:40,000 --> 02:55:41,900
This pm has priest but yes,
4237
02:55:41,900 --> 02:55:44,300
that's what it has
everything pre-installed.
4238
02:55:44,300 --> 02:55:46,700
Whichever will be required
for your training.
4239
02:55:46,700 --> 02:55:49,133
So that's the best part
what we also provide.
4240
02:55:49,133 --> 02:55:51,700
So in this case your Eclipse
will already be there.
4241
02:55:51,700 --> 02:55:53,900
You need to just go
to your Eclipse location.
4242
02:55:53,900 --> 02:55:55,300
Let me show you how you can.
4243
02:55:55,300 --> 02:55:56,700
So cold that if you want
4244
02:55:57,200 --> 02:56:00,600
because it gives you it gives
you just need to go inside it
4245
02:56:00,600 --> 02:56:02,200
and double-click on it at that.
4246
02:56:02,200 --> 02:56:04,400
You need not go and kind
of installed eclipse
4247
02:56:04,400 --> 02:56:07,400
and not even the spot will
already be installed for you.
4248
02:56:07,400 --> 02:56:09,900
Let us go in our project.
4249
02:56:09,900 --> 02:56:12,895
So this is our project
which is in front of you.
4250
02:56:12,895 --> 02:56:15,674
This is my project
which we are going to war.
4251
02:56:15,674 --> 02:56:16,653
Now you can see
4252
02:56:16,653 --> 02:56:19,522
that we have first
imported all the libraries
4253
02:56:19,522 --> 02:56:22,146
that we have set
or more indication system
4254
02:56:22,146 --> 02:56:24,806
and then we have moved
and kind of ecstatic.
4255
02:56:24,806 --> 02:56:27,900
The D string transformation
extractor that we write
4256
02:56:27,900 --> 02:56:29,900
and then save
the output final effect.
4257
02:56:29,900 --> 02:56:32,100
So these are the things
which we have done
4258
02:56:32,100 --> 02:56:36,000
in this program has now let's
execute it to run this program.
4259
02:56:36,000 --> 02:56:39,900
It's very simple go
to run as and from run
4260
02:56:39,900 --> 02:56:42,700
as click on still application.
4261
02:56:43,200 --> 02:56:45,276
You will notice in the end.
4262
02:56:45,276 --> 02:56:48,600
It is releasing
that great good to see that
4263
02:56:48,886 --> 02:56:51,286
so it is executing the program.
4264
02:56:51,286 --> 02:56:52,440
Let us execute.
4265
02:56:55,700 --> 02:56:57,800
I did bring a taxi for Trump.
4266
02:56:57,800 --> 02:57:01,292
So use these for Trump any way
that we surveyed to be negative.
4267
02:57:01,292 --> 02:57:01,629
Right?
4268
02:57:01,629 --> 02:57:02,654
It's an achievement
4269
02:57:02,654 --> 02:57:06,036
because anything you do for Tom
will be to be negative Trump is
4270
02:57:06,036 --> 02:57:07,563
anyway the hot topic for us.
4271
02:57:07,563 --> 02:57:09,200
Maybe make it a little bigger.
4272
02:57:14,100 --> 02:57:17,200
You will notice a lot
of negative tweets coming up on.
4273
02:57:24,700 --> 02:57:26,900
Yes, now, I'm just stopping it
4274
02:57:26,900 --> 02:57:28,742
so that I can
show you something.
4275
02:57:28,742 --> 02:57:28,972
Yes.
4276
02:57:28,972 --> 02:57:30,700
It's filtering that we thought
4277
02:57:30,800 --> 02:57:33,700
so we have actually been written
back in the program itself.
4278
02:57:33,700 --> 02:57:36,300
You have given
at one location from using
4279
02:57:36,300 --> 02:57:38,087
that we were kind of asking
4280
02:57:38,087 --> 02:57:41,200
for a treetop Tom now
here we are doing analysis
4281
02:57:41,200 --> 02:57:43,064
and it is also going to tell us
4282
02:57:43,064 --> 02:57:46,264
whether it's a positive to a
negative resistance is situated.
4283
02:57:46,264 --> 02:57:47,500
It is giving up Faith
4284
02:57:47,500 --> 02:57:50,444
because term for Transit even
will not quit positive rate.
4285
02:57:50,444 --> 02:57:51,454
So that's something
4286
02:57:51,454 --> 02:57:53,790
which is so that's
the reason you're finding.
4287
02:57:53,790 --> 02:57:54,800
This is a negative.
4288
02:57:54,900 --> 02:57:56,412
Similarly if there
will be any other
4289
02:57:56,412 --> 02:57:57,964
that we should
be getting a static.
4290
02:57:57,964 --> 02:58:00,200
So right now if I keep on
moving ahead we will see
4291
02:58:00,200 --> 02:58:02,300
multiple negative traits
which will come up.
4292
02:58:02,300 --> 02:58:04,600
So that's how this program runs.
4293
02:58:04,900 --> 02:58:07,000
So this is how our program
4294
02:58:07,000 --> 02:58:09,403
we will be executing
we can distract it.
4295
02:58:09,403 --> 02:58:13,100
Even the output results will be
getting through at a location
4296
02:58:13,100 --> 02:58:16,500
as you can see in this
if I go to my location here,
4297
02:58:16,500 --> 02:58:19,100
this is my actual project
where it is running
4298
02:58:19,100 --> 02:58:20,533
so you can just come
4299
02:58:20,533 --> 02:58:23,400
to this location here
are on your output.
4300
02:58:23,400 --> 02:58:24,982
All your output
is Getting through there
4301
02:58:24,982 --> 02:58:26,200
so you can just take a look as
4302
02:58:26,200 --> 02:58:28,200
but yes, so it's
everything is done
4303
02:58:28,200 --> 02:58:29,971
by using space thing apart.
4304
02:58:29,971 --> 02:58:30,300
Okay.
4305
02:58:30,300 --> 02:58:31,900
That's what we've
seen right reverse
4306
02:58:31,900 --> 02:58:33,653
that we were seeing
it with respect
4307
02:58:33,653 --> 02:58:35,200
to these three transformations
4308
02:58:35,200 --> 02:58:38,300
in a so we have done all that
with have both passed anybody.
4309
02:58:38,400 --> 02:58:41,200
So that is one
of those awesome part about this
4310
02:58:41,200 --> 02:58:44,700
that you can do such
a powerful things with respect
4311
02:58:44,700 --> 02:58:47,279
to your with respect
to you this way.
4312
02:58:47,279 --> 02:58:49,500
Now, let's analyze the results.
4313
02:58:49,800 --> 02:58:51,152
So as we have just seen
4314
02:58:51,152 --> 02:58:53,400
that it is showing
the president's a positive
4315
02:58:53,400 --> 02:58:54,800
to a negative tweets.
4316
02:58:55,000 --> 02:58:57,200
So this is where your output
is getting Stone
4317
02:58:57,200 --> 02:59:00,000
as it shown you a doubt
will appear like this.
4318
02:59:00,000 --> 02:59:00,300
Okay.
4319
02:59:00,300 --> 02:59:02,700
This is just broke
your output to explicitly
4320
02:59:02,700 --> 02:59:03,762
principal also tell
4321
02:59:03,762 --> 02:59:05,848
whether it's a neutral
one positive one
4322
02:59:05,848 --> 02:59:07,277
negative one everything.
4323
02:59:07,277 --> 02:59:09,600
We have done it
with the help of Sparks.
4324
02:59:09,600 --> 02:59:12,000
I mean only now we
have done it for Trump
4325
02:59:12,000 --> 02:59:14,000
as I just explained you
that we have put
4326
02:59:14,000 --> 02:59:15,555
in our program itself from
4327
02:59:15,555 --> 02:59:17,589
like we have put
everything up here
4328
02:59:17,589 --> 02:59:21,000
and based on that only we
are getting all the software now
4329
02:59:21,000 --> 02:59:23,498
we can apply all
the sentiment analysis
4330
02:59:23,498 --> 02:59:24,403
and like this.
4331
02:59:24,403 --> 02:59:25,731
Like we have learned.
4332
02:59:25,731 --> 02:59:28,754
So I hope you have found
all this this specially
4333
02:59:28,754 --> 02:59:30,593
this use case very much useful
4334
02:59:30,593 --> 02:59:32,800
for you kind of
getting you that yes,
4335
02:59:32,800 --> 02:59:34,388
it is getting done by half.
4336
02:59:34,388 --> 02:59:36,200
But right now we
have put from here,
4337
02:59:36,200 --> 02:59:38,550
but if you want you can keep
on putting the hashtag as
4338
02:59:38,550 --> 02:59:40,286
well because that's
how we are doing it.
4339
02:59:40,286 --> 02:59:41,886
You can keep on
changing the tax.
4340
02:59:41,886 --> 02:59:44,335
Maybe you can kind of code
for let's say four people
4341
02:59:44,335 --> 02:59:45,200
for stuff is going
4342
02:59:45,200 --> 02:59:49,000
on a cricket match will be going
on we can just put the tweets
4343
02:59:49,000 --> 02:59:52,300
according to that just take the
in that case instead of trump.
4344
02:59:52,300 --> 02:59:53,980
You can put any player named
4345
02:59:53,980 --> 02:59:56,432
or maybe a Team name
and you will see all
4346
02:59:56,432 --> 02:59:58,300
that friendly becoming a father.
4347
02:59:58,300 --> 03:00:00,700
Okay, so that's
how you can play with this.
4348
03:00:01,000 --> 03:00:01,500
Now.
4349
03:00:01,800 --> 03:00:04,400
This is there are
multiple examples with it,
4350
03:00:04,400 --> 03:00:05,400
which we can play
4351
03:00:05,500 --> 03:00:09,500
and this new skills can be even
evolved multiple other type
4352
03:00:09,500 --> 03:00:10,250
of those cases.
4353
03:00:10,250 --> 03:00:12,200
You can just keep
on transforming it
4354
03:00:12,200 --> 03:00:14,300
according to your own use cases.
4355
03:00:14,400 --> 03:00:17,800
So that's it about Sparks coming
which I wanted to discuss.
4356
03:00:17,800 --> 03:00:21,000
So I hope you must
have found it useful.
4357
03:00:26,000 --> 03:00:28,228
So in classification generally
4358
03:00:28,228 --> 03:00:31,200
what happens just
to give you an example.
4359
03:00:31,300 --> 03:00:33,867
You must have notice
the spam email box.
4360
03:00:33,867 --> 03:00:36,500
I hope everybody
must be having have seen
4361
03:00:36,500 --> 03:00:39,700
that sparkle in your spam
email box Energy Mix.
4362
03:00:39,800 --> 03:00:45,000
Now when any new email comes up
how Google decide
4363
03:00:45,165 --> 03:00:49,134
whether it's a spam email
or unknown stamped image
4364
03:00:49,300 --> 03:00:53,400
that is done as an example
of classification plus 3,
4365
03:00:53,576 --> 03:00:56,423
let's say My ghost
in the Google news,
4366
03:00:56,500 --> 03:00:58,794
when you type
something it group.
4367
03:00:58,794 --> 03:01:00,300
All the news together
4368
03:01:00,300 --> 03:01:04,700
that is called your electric
regression equation is also one
4369
03:01:04,700 --> 03:01:07,300
of the very important
fact it is not here.
4370
03:01:07,500 --> 03:01:11,700
The regression is let's say
you have a house
4371
03:01:11,900 --> 03:01:14,100
and you want to sell that house
4372
03:01:14,400 --> 03:01:16,500
and you have no idea.
4373
03:01:16,700 --> 03:01:18,715
What is the optimal price?
4374
03:01:18,715 --> 03:01:21,100
You should keep for your house.
4375
03:01:21,100 --> 03:01:24,400
Now this regression
will help you too.
4376
03:01:24,400 --> 03:01:28,534
To achieve that collaborative
filtering you might have see
4377
03:01:28,534 --> 03:01:31,000
when you go
to your Amazon web page
4378
03:01:31,000 --> 03:01:33,400
that they show you
a recommendation, right?
4379
03:01:33,400 --> 03:01:34,430
You can buy this
4380
03:01:34,430 --> 03:01:38,400
because you are buying this
but this is done with the help
4381
03:01:38,400 --> 03:01:40,900
of colaborative filtering.
4382
03:01:42,028 --> 03:01:44,315
Before I move to the project,
4383
03:01:44,315 --> 03:01:47,700
I want to show you
some practical find how we
4384
03:01:47,700 --> 03:01:50,300
will be executing spark things.
4385
03:01:50,503 --> 03:01:53,196
So let me take you
to the VM machine
4386
03:01:53,300 --> 03:01:55,300
which will be provided
by a Dorita.
4387
03:01:55,300 --> 03:01:57,928
So this machines are also
provided by the Rekha.
4388
03:01:57,928 --> 03:02:00,222
So you need not worry
about from where I
4389
03:02:00,222 --> 03:02:01,963
will be getting the software.
4390
03:02:01,963 --> 03:02:04,421
What I will be doing
recite It Roll there.
4391
03:02:04,421 --> 03:02:07,300
Everything is taken care back
into they come now.
4392
03:02:07,300 --> 03:02:08,957
Once you will be coming
4393
03:02:08,957 --> 03:02:12,059
to this you will see
a machine like Like this,
4394
03:02:12,059 --> 03:02:13,300
let me close this.
4395
03:02:13,300 --> 03:02:16,970
So what will happen you will see
a blank machine like this.
4396
03:02:16,970 --> 03:02:18,300
Let me show you this.
4397
03:02:18,300 --> 03:02:20,500
So this is how your machine
will look like.
4398
03:02:20,500 --> 03:02:24,100
Now what you are going to do
in order to start working.
4399
03:02:24,100 --> 03:02:26,600
You will be opening
this permanent by clicking
4400
03:02:26,600 --> 03:02:27,800
on this black option.
4401
03:02:28,000 --> 03:02:29,300
Now after that,
4402
03:02:29,400 --> 03:02:34,400
what you can do is you
can now go to your spot now
4403
03:02:34,400 --> 03:02:39,300
how I can work with funds
in order to execute any program
4404
03:02:39,300 --> 03:02:43,000
in sparked by using
Funeral program you
4405
03:02:43,000 --> 03:02:46,700
will be entering it as fast -
4406
03:02:46,700 --> 03:02:49,400
Chanel if you type fast - gel
4407
03:02:49,500 --> 03:02:52,500
it will take you
to the scale of Ron
4408
03:02:52,800 --> 03:02:55,800
where you can write
your path program,
4409
03:02:56,100 --> 03:03:00,020
but by using scale
of programming language,
4410
03:03:00,020 --> 03:03:01,558
you can notice this.
4411
03:03:02,200 --> 03:03:06,300
Now, can you see the fact it
is also giving me 1.5.2 version.
4412
03:03:06,300 --> 03:03:09,200
So that is the version
of your spot.
4413
03:03:09,800 --> 03:03:11,400
Now you can see here.
4414
03:03:11,400 --> 03:03:15,200
You can also see this part of
our context available as a see
4415
03:03:15,200 --> 03:03:17,752
when you get connected
to your spark sure.
4416
03:03:17,752 --> 03:03:21,441
You can just see this will be
my default available to you.
4417
03:03:21,441 --> 03:03:22,800
Let us get connected.
4418
03:03:22,800 --> 03:03:23,800
It is sometime.
4419
03:03:39,207 --> 03:03:40,746
No, we got anything.
4420
03:03:40,746 --> 03:03:43,900
So we got connected
to this Kayla prom now
4421
03:03:43,900 --> 03:03:45,894
if I want to come out of it,
4422
03:03:45,894 --> 03:03:49,300
I will just type exit
it will just let me come
4423
03:03:49,300 --> 03:03:51,400
out of this product now.
4424
03:03:52,100 --> 03:03:56,176
Secondly, I can also write
my programs with my python.
4425
03:03:56,176 --> 03:03:57,407
So what I can do
4426
03:03:57,500 --> 03:04:00,200
if I want to do
programming and Spark,
4427
03:04:00,200 --> 03:04:03,040
but with provide
Python programming language,
4428
03:04:03,040 --> 03:04:05,300
I will be connecting
with by Sparks.
4429
03:04:05,300 --> 03:04:09,148
So I just need to type ice pack
in order to get connected.
4430
03:04:09,148 --> 03:04:09,912
Your fighter.
4431
03:04:09,912 --> 03:04:10,206
Okay.
4432
03:04:10,206 --> 03:04:11,791
I'm not getting connected now
4433
03:04:11,791 --> 03:04:13,576
because I'm not
going to require.
4434
03:04:13,576 --> 03:04:16,700
I think I will be explaining
everything that scalar item.
4435
03:04:16,700 --> 03:04:19,700
But if you want to get connected
you can type icebox.
4436
03:04:19,700 --> 03:04:21,100
So let's again get connected
4437
03:04:21,100 --> 03:04:23,900
to my staff -
sure now meanwhile,
4438
03:04:23,900 --> 03:04:25,800
this is getting connected.
4439
03:04:25,800 --> 03:04:27,800
Let us create a small pipe.
4440
03:04:27,800 --> 03:04:29,823
So let us create
a file so currently
4441
03:04:29,823 --> 03:04:31,897
if you notice I
don't have any file.
4442
03:04:31,897 --> 03:04:32,281
Okay.
4443
03:04:32,284 --> 03:04:34,300
I already have a DOT txt.
4444
03:04:34,300 --> 03:04:37,300
So let's say sake at a DOT txt.
4445
03:04:37,400 --> 03:04:38,958
So I have some data one.
4446
03:04:38,958 --> 03:04:40,200
Two three four five.
4447
03:04:40,200 --> 03:04:42,362
This is my data,
which is with me.
4448
03:04:42,362 --> 03:04:44,000
Now what I'm going to do,
4449
03:04:44,000 --> 03:04:47,900
let me push this file
and do select the effective
4450
03:04:47,900 --> 03:04:49,900
if it is already available
4451
03:04:49,900 --> 03:04:55,000
in my system as that means
SDK system Hadoop DFS -
4452
03:04:55,000 --> 03:04:57,900
ooh, Jack a dot txt just
to quickly check
4453
03:04:57,900 --> 03:04:59,700
if it is already available.
4454
03:05:06,100 --> 03:05:09,400
There is no sex by so let
me first put this file
4455
03:05:09,400 --> 03:05:12,700
to my system to put a dot txt.
4456
03:05:14,200 --> 03:05:16,300
So this will put it
in the default location
4457
03:05:16,300 --> 03:05:17,200
of x g of X.
4458
03:05:17,200 --> 03:05:19,700
Now if I want to read it,
I can see the specs.
4459
03:05:19,700 --> 03:05:20,922
So again, I'm assuming
4460
03:05:20,922 --> 03:05:23,700
that you're aware of this
as big as commands so you
4461
03:05:23,700 --> 03:05:25,300
can see now this one two,
4462
03:05:25,300 --> 03:05:28,500
three four Pilots coming
from a Hadoop file system.
4463
03:05:28,500 --> 03:05:30,192
Now what I want to do,
4464
03:05:30,192 --> 03:05:36,400
I want to use this file
in my in my system of spa now
4465
03:05:36,400 --> 03:05:39,200
how I can do that select
we come here.
4466
03:05:39,200 --> 03:05:42,500
So in skaila in skaila,
4467
03:05:42,500 --> 03:05:46,000
we do not have any Your float
and on like in Java
4468
03:05:46,000 --> 03:05:48,700
we use the Define
like this right integer
4469
03:05:48,700 --> 03:05:49,907
K is equal to 10
4470
03:05:49,907 --> 03:05:53,000
like this is used
to define buttons Kayla.
4471
03:05:53,000 --> 03:05:55,400
We do not use this data type.
4472
03:05:55,473 --> 03:05:58,626
In fact, what we do
is we call it as back.
4473
03:05:58,700 --> 03:06:02,000
So if I use
that a is equal to 10,
4474
03:06:02,100 --> 03:06:04,700
it will automatically identify
4475
03:06:04,900 --> 03:06:08,100
that it is
a integer value notice.
4476
03:06:08,900 --> 03:06:13,303
It will tell me that
a is of my integer type now
4477
03:06:13,303 --> 03:06:16,072
if I want to Update
this value to 20.
4478
03:06:16,072 --> 03:06:17,149
I can do that.
4479
03:06:17,400 --> 03:06:17,800
Now.
4480
03:06:17,900 --> 03:06:20,900
Let's say if I want to update
this to ABC like this.
4481
03:06:21,200 --> 03:06:23,700
This will smoke an error by
4482
03:06:23,900 --> 03:06:27,400
because a is already
defined as in danger
4483
03:06:27,600 --> 03:06:31,300
and you're trying to assign
some PVC string back.
4484
03:06:31,300 --> 03:06:34,000
So that is the reason
you got this error.
4485
03:06:34,000 --> 03:06:34,900
Similarly.
4486
03:06:35,000 --> 03:06:38,000
There is one more thing
called as value.
4487
03:06:38,300 --> 03:06:40,300
Well B is equal to 10.
4488
03:06:40,300 --> 03:06:44,200
Let's say if I do it works
exactly a similar to that.
4489
03:06:44,200 --> 03:06:47,500
But I have one difference
now in this case.
4490
03:06:47,500 --> 03:06:51,600
If I do basic want
to 20 you will see an error
4491
03:06:51,800 --> 03:06:57,000
and why does Sarah because when
you define something as well,
4492
03:06:57,200 --> 03:06:59,200
it is a constant.
4493
03:06:59,300 --> 03:07:02,400
It is not going
to be variable anymore.
4494
03:07:02,430 --> 03:07:04,046
It will be a constant
4495
03:07:04,046 --> 03:07:08,300
and that is the reason
if you define something as well,
4496
03:07:08,300 --> 03:07:10,700
it will be not updatable.
4497
03:07:10,700 --> 03:07:14,400
You will be should not be able
to update that value.
4498
03:07:14,400 --> 03:07:19,400
So this is how in Fela you
will be doing your program
4499
03:07:19,700 --> 03:07:23,969
so back for bearable part
of that for your constant,
4500
03:07:23,969 --> 03:07:27,200
but now so you will be
doing like this now,
4501
03:07:27,200 --> 03:07:31,664
let's use it for the example
what we have learned now.
4502
03:07:31,664 --> 03:07:34,971
Let's say if I want
to create and cut the V.
4503
03:07:35,100 --> 03:07:40,100
So Bal number is equal
to SC dot txt file.
4504
03:07:40,100 --> 03:07:43,000
Remember this API we
have learned the CPI
4505
03:07:43,000 --> 03:07:45,500
already St. Dot Txt file now.
4506
03:07:45,500 --> 03:07:49,300
Let me give this file a DOT txt.
4507
03:07:49,500 --> 03:07:52,000
If I give this file a dot txt.
4508
03:07:52,300 --> 03:07:55,900
It will be creating
an ID will see this file.
4509
03:07:55,900 --> 03:07:57,000
It is telling
4510
03:07:57,000 --> 03:08:00,800
that I created an rdd
of string type.
4511
03:08:01,100 --> 03:08:01,300
Now.
4512
03:08:01,300 --> 03:08:06,600
If I want to read this data,
I will call number dot connect.
4513
03:08:06,800 --> 03:08:10,415
This will print be the value
what was available.
4514
03:08:10,415 --> 03:08:14,261
Can you say now this line
what you are seeing here?
4515
03:08:14,300 --> 03:08:17,300
Is going to be from your memory.
4516
03:08:17,400 --> 03:08:19,382
This is your from my body.
4517
03:08:19,382 --> 03:08:23,500
It is reading a and that is
the reason it is showing up
4518
03:08:23,500 --> 03:08:25,800
in this particular manner.
4519
03:08:25,842 --> 03:08:29,457
So this is how you
will be performing your step.
4520
03:08:29,484 --> 03:08:30,715
No second thing.
4521
03:08:31,100 --> 03:08:36,000
I told you that sparked and walk
on Standalone systems as well.
4522
03:08:36,100 --> 03:08:36,400
Right?
4523
03:08:36,400 --> 03:08:38,400
So right now
what was happening was
4524
03:08:38,400 --> 03:08:42,000
that we have executed this part
in our history of this now
4525
03:08:42,000 --> 03:08:46,283
if I want to execute this Us
on our local file system.
4526
03:08:46,283 --> 03:08:47,338
Can I do that?
4527
03:08:47,338 --> 03:08:49,300
Yes, it can still do that.
4528
03:08:49,300 --> 03:08:51,300
What you need to do for that.
4529
03:08:51,300 --> 03:08:54,700
So is in that case
the difference will come here.
4530
03:08:54,700 --> 03:08:57,000
Now what the file you are giving
4531
03:08:57,000 --> 03:08:59,748
here would be instead
of giving like that.
4532
03:08:59,748 --> 03:09:03,100
You will be denoting
this file keyword before that.
4533
03:09:03,100 --> 03:09:06,300
And after that you need
to give you a local file.
4534
03:09:06,300 --> 03:09:09,200
For example, what is
this part slash home slash.
4535
03:09:09,200 --> 03:09:09,900
Advocacy.
4536
03:09:09,900 --> 03:09:12,400
This is a local park
not as deep as possible.
4537
03:09:12,400 --> 03:09:14,400
So you will be
writing / foam.
4538
03:09:14,400 --> 03:09:17,400
/schedule Erica a DOT PSD.
4539
03:09:17,500 --> 03:09:19,100
Now if you give this
4540
03:09:19,300 --> 03:09:22,700
this will be loading
the file into memory,
4541
03:09:23,000 --> 03:09:26,300
but not from your hdfs instead.
4542
03:09:26,300 --> 03:09:29,100
What does that is this loaded it
4543
03:09:29,100 --> 03:09:33,000
from your just loaded it
formula looks like this
4544
03:09:33,200 --> 03:09:34,921
so that is the defensive.
4545
03:09:34,921 --> 03:09:37,600
So as you can see
in the second case,
4546
03:09:37,600 --> 03:09:41,600
I am not even using my hdfs.
4547
03:09:41,700 --> 03:09:43,000
Which means what now?
4548
03:09:43,000 --> 03:09:46,000
Can you tell me why this
Sarah this is interesting.
4549
03:09:46,000 --> 03:09:49,300
Why do Sarah input path
does not exist
4550
03:09:49,300 --> 03:09:51,600
because I have given
a typo here.
4551
03:09:51,600 --> 03:09:52,400
Okay.
4552
03:09:52,400 --> 03:09:53,595
Now if you notice
4553
03:09:53,595 --> 03:09:58,555
by I did not get this error here
why I did not get this Elijah
4554
03:09:58,555 --> 03:10:00,200
this file do not exist.
4555
03:10:00,200 --> 03:10:02,500
But still I did not got
4556
03:10:02,500 --> 03:10:07,300
any error because of
lazy evaluation link
4557
03:10:07,300 --> 03:10:11,500
the evaluation kind
of made sure that even
4558
03:10:11,500 --> 03:10:14,400
if you have given
the wrong part in creating
4559
03:10:14,400 --> 03:10:18,200
And beyond ready but it
has not executed anything.
4560
03:10:18,400 --> 03:10:19,900
So all the output
4561
03:10:19,900 --> 03:10:22,800
or the error mistake
you are able to receive
4562
03:10:22,800 --> 03:10:25,600
when you hit that action
of Collective Now
4563
03:10:25,600 --> 03:10:27,997
in order to correct this value.
4564
03:10:27,997 --> 03:10:32,890
I need to connect this adorable
and this time if I execute it,
4565
03:10:32,975 --> 03:10:33,975
it will work.
4566
03:10:34,050 --> 03:10:37,050
Okay, you can see
this output 1 2 3 4 5.
4567
03:10:37,100 --> 03:10:40,500
So this time it works
by so now we should be
4568
03:10:40,500 --> 03:10:44,200
more clear about the lazy
evaluation of the even
4569
03:10:44,200 --> 03:10:46,375
if you are giving
the wrong file name
4570
03:10:46,375 --> 03:10:47,628
doesn't matter suppose.
4571
03:10:47,628 --> 03:10:49,804
I want to use Park
in production unit,
4572
03:10:49,804 --> 03:10:51,155
but not on top of Hadoop.
4573
03:10:51,155 --> 03:10:52,007
Is it possible?
4574
03:10:52,007 --> 03:10:53,200
Yes, you can do that.
4575
03:10:53,200 --> 03:10:54,500
You can do that Sonny,
4576
03:10:54,500 --> 03:10:56,900
but usually that's
not what you do.
4577
03:10:56,900 --> 03:10:58,958
But yes, if you
want to can do that,
4578
03:10:58,958 --> 03:11:00,299
there are a lot of things
4579
03:11:00,299 --> 03:11:02,239
which you can view
can also deploy it
4580
03:11:02,239 --> 03:11:05,611
on your Amazon clusters as that
lot of things you can do that.
4581
03:11:05,611 --> 03:11:07,900
How will it provided
distribute in that case?
4582
03:11:07,900 --> 03:11:10,186
We'll be using
some other distribution system.
4583
03:11:10,186 --> 03:11:12,425
So in that case you
are not using this fact,
4584
03:11:12,425 --> 03:11:14,300
you can deploy it
will be just death.
4585
03:11:14,300 --> 03:11:16,400
He will not be able
to kind of go across
4586
03:11:16,400 --> 03:11:17,698
and distribute in that Master.
4587
03:11:17,698 --> 03:11:19,849
You will not be able to lift
weight that redundancy,
4588
03:11:19,849 --> 03:11:22,500
but you can use them in Amazon
is the enough for that.
4589
03:11:22,500 --> 03:11:23,700
Okay, so that is
4590
03:11:23,700 --> 03:11:28,089
how you will be using this now
you're going to get so this is
4591
03:11:28,089 --> 03:11:31,600
how you will be performing
your practice as a sec
4592
03:11:31,600 --> 03:11:33,643
how you will be working
on this part.
4593
03:11:33,643 --> 03:11:35,800
I will be a training you
as I told you.
4594
03:11:35,800 --> 03:11:37,500
So this is how things work.
4595
03:11:37,700 --> 03:11:41,600
Now, let us see
an interesting use case.
4596
03:11:41,800 --> 03:11:43,900
So for that let us go back.
4597
03:11:43,900 --> 03:11:47,900
Back to our visiting this
is going to be very interesting.
4598
03:11:48,161 --> 03:11:50,238
So let's see this use case.
4599
03:11:50,600 --> 03:11:51,600
Look at this.
4600
03:11:51,900 --> 03:11:53,500
This is very interested.
4601
03:11:53,500 --> 03:11:57,600
Now this use case is for
earthquake detection using Spa.
4602
03:11:57,600 --> 03:12:00,200
So in Japan you
might have already seen
4603
03:12:00,200 --> 03:12:02,450
that there are so many
up to access coming you
4604
03:12:02,450 --> 03:12:03,800
might have thought about it.
4605
03:12:03,800 --> 03:12:05,591
I definitely you
might have not seen it
4606
03:12:05,591 --> 03:12:07,100
but you must have heard about it
4607
03:12:07,100 --> 03:12:09,200
that there are
so many earthquake
4608
03:12:09,200 --> 03:12:13,700
which happens in Japan now
how to solve that problem with
4609
03:12:13,700 --> 03:12:16,111
about I'm just going
to give you a glimpse
4610
03:12:16,111 --> 03:12:17,400
of what kind of problems
4611
03:12:17,400 --> 03:12:18,563
in solving the sessions
4612
03:12:18,563 --> 03:12:21,600
definitely we are not going to
walk through in detail of this
4613
03:12:21,600 --> 03:12:24,500
but you will get an idea
House of Prince fastest.
4614
03:12:24,500 --> 03:12:27,300
Okay, just to give you
a little bit of brief here.
4615
03:12:27,300 --> 03:12:30,500
But all these products
will learn at the time
4616
03:12:30,500 --> 03:12:31,900
of sessions now.
4617
03:12:32,000 --> 03:12:35,300
So let's see this part
how we will be using this bill.
4618
03:12:35,300 --> 03:12:38,500
So as everybody must be knowing
what is asked website.
4619
03:12:38,500 --> 03:12:39,800
So our crack is
4620
03:12:40,200 --> 03:12:44,028
like a shaking of your surface
of the Earth your own country.
4621
03:12:44,028 --> 03:12:46,900
Ignore all those events
that happen in tector.
4622
03:12:46,900 --> 03:12:48,050
If you're from India,
4623
03:12:48,050 --> 03:12:51,400
you might have seen recently
there was an earthquake incident
4624
03:12:51,400 --> 03:12:54,600
which came from Nepal
by even recently two days back.
4625
03:12:54,600 --> 03:12:56,900
Also there was upset incident.
4626
03:12:57,053 --> 03:12:59,900
So these are techniques
on coming now,
4627
03:12:59,900 --> 03:13:02,300
very important part is let's say
4628
03:13:02,300 --> 03:13:06,100
if the earthquake is
on major earthquake like arguing
4629
03:13:06,100 --> 03:13:08,992
or maybe tsunami
maybe forest fires,
4630
03:13:08,992 --> 03:13:10,600
maybe a volcano now,
4631
03:13:10,600 --> 03:13:14,000
it's very important
for them to kind of SC.
4632
03:13:15,100 --> 03:13:19,600
That black is going to come
they should be able to kind
4633
03:13:19,600 --> 03:13:21,600
of predicted beforehand.
4634
03:13:21,600 --> 03:13:23,776
It's not happen
that as a last moment.
4635
03:13:23,776 --> 03:13:25,254
They got to the that okay
4636
03:13:25,254 --> 03:13:27,862
Dirtbag is comes
after I came up cracking No,
4637
03:13:27,862 --> 03:13:29,700
it should not happen like that.
4638
03:13:29,700 --> 03:13:34,000
They should be able to estimate
all these things beforehand.
4639
03:13:34,000 --> 03:13:36,600
They should be able
to predict beforehand.
4640
03:13:36,688 --> 03:13:40,611
So this is the system
with Japan's is using already.
4641
03:13:40,700 --> 03:13:44,300
So this is a real-time kind of
use case what I am presenting.
4642
03:13:44,300 --> 03:13:47,300
It's so Japan is already
using this path finger
4643
03:13:47,300 --> 03:13:49,770
in order to solve
this earthquake problem.
4644
03:13:49,770 --> 03:13:52,482
We are going to see
that how they're using it.
4645
03:13:52,482 --> 03:13:52,866
Okay.
4646
03:13:52,900 --> 03:13:56,900
Now let's say what happens
in Japan earthquake model.
4647
03:13:57,000 --> 03:14:00,000
So whenever there is
an earthquake coming
4648
03:14:00,000 --> 03:14:02,000
for example at 2:46 p.m.
4649
03:14:02,000 --> 03:14:04,800
On March 4 2011 now
4650
03:14:04,800 --> 03:14:08,300
Japan earthquake early
warning was detected.
4651
03:14:08,600 --> 03:14:12,800
Now the thing was as
soon as it detected immediately,
4652
03:14:12,800 --> 03:14:16,999
they start sending
Not those fools to the lift
4653
03:14:17,000 --> 03:14:20,700
to the factories every station
through TV stations.
4654
03:14:20,700 --> 03:14:23,300
They immediately kind
of told everyone
4655
03:14:23,300 --> 03:14:26,315
so that all the students
were there in school.
4656
03:14:26,315 --> 03:14:29,800
They got the time to go
under the desk bullet trains,
4657
03:14:29,800 --> 03:14:30,900
which were running.
4658
03:14:30,900 --> 03:14:31,571
They stop.
4659
03:14:31,571 --> 03:14:35,200
Otherwise the capabilities
of us will start shaking now
4660
03:14:35,200 --> 03:14:38,200
the bullet trains are already
running at the very high speed.
4661
03:14:38,200 --> 03:14:39,432
They want to ensure
4662
03:14:39,432 --> 03:14:43,000
that there should be no sort
of casualty because of that
4663
03:14:43,000 --> 03:14:46,600
so all the bullet train Stop
all the elevators the lift
4664
03:14:46,600 --> 03:14:47,825
which were running.
4665
03:14:47,825 --> 03:14:50,600
They stop otherwise
some incident can happen
4666
03:14:50,700 --> 03:14:53,930
in 60 seconds 60 seconds
4667
03:14:53,930 --> 03:14:55,700
before this number they
4668
03:14:55,700 --> 03:14:59,100
were able to inform
almost every month.
4669
03:14:59,300 --> 03:15:01,212
They have send the message.
4670
03:15:01,212 --> 03:15:02,698
They have a broadcast
4671
03:15:02,698 --> 03:15:05,949
on TV all those things
they have done immediately
4672
03:15:05,949 --> 03:15:07,100
to all the people
4673
03:15:07,100 --> 03:15:09,856
so that they can send
at least this message
4674
03:15:09,856 --> 03:15:11,300
whoever can receive it
4675
03:15:11,300 --> 03:15:13,600
and that have saved millions
4676
03:15:13,600 --> 03:15:17,300
of So powerful they
were able to achieve
4677
03:15:17,300 --> 03:15:22,100
that they have done all this
with the help of Apache spark.
4678
03:15:22,192 --> 03:15:24,500
That is the most important job
4679
03:15:24,500 --> 03:15:27,900
how they've got you
can select everything
4680
03:15:27,900 --> 03:15:29,800
what they are doing there.
4681
03:15:29,800 --> 03:15:33,600
They are doing it
on the real time system, right?
4682
03:15:33,700 --> 03:15:35,690
They cannot just
collect the data
4683
03:15:35,690 --> 03:15:39,100
and then later the processes
they did everything as
4684
03:15:39,100 --> 03:15:40,300
a real-time system.
4685
03:15:40,300 --> 03:15:43,300
So they collected the data
immediately process it
4686
03:15:43,300 --> 03:15:45,004
and as soon has the detected
4687
03:15:45,004 --> 03:15:47,484
that has quick they
immediately inform the
4688
03:15:47,484 --> 03:15:49,381
in fact this happened in 2011.
4689
03:15:49,381 --> 03:15:52,100
Now they they start
using it very frequently
4690
03:15:52,100 --> 03:15:54,318
because Japan is one of the area
4691
03:15:54,318 --> 03:15:58,200
which is very frequently
of kind of affected by all this.
4692
03:15:58,200 --> 03:15:58,900
So as I said,
4693
03:15:58,900 --> 03:16:01,548
the main thing is we should be
able to process the data
4694
03:16:01,548 --> 03:16:02,449
and we are finding
4695
03:16:02,449 --> 03:16:04,900
that the bigger thing you
should be able to handle
4696
03:16:04,900 --> 03:16:06,400
the data from multiple sources
4697
03:16:06,400 --> 03:16:07,789
because data may be coming
4698
03:16:07,789 --> 03:16:10,882
from multiple sources may be
different different sources.
4699
03:16:10,882 --> 03:16:13,600
They might be suggesting some
of the other events.
4700
03:16:13,600 --> 03:16:16,305
It's because Which we
are predicting that okay,
4701
03:16:16,305 --> 03:16:17,770
this earthquake can happen.
4702
03:16:17,770 --> 03:16:19,729
It should be very
easy to use because
4703
03:16:19,729 --> 03:16:22,500
if it is very complicated
then in that case
4704
03:16:22,500 --> 03:16:23,500
for a user to use it
4705
03:16:23,500 --> 03:16:25,549
if you'd be very good
become competitive service.
4706
03:16:25,549 --> 03:16:27,600
You will not be able
to solve the problem.
4707
03:16:27,700 --> 03:16:29,200
Now even in the end
4708
03:16:29,200 --> 03:16:32,100
how to send the alert
message is important.
4709
03:16:32,100 --> 03:16:32,900
Okay.
4710
03:16:32,900 --> 03:16:36,000
So all those things
are taken care by your spark.
4711
03:16:36,000 --> 03:16:39,923
Now there are two kinds
of layer in your earthquake.
4712
03:16:40,100 --> 03:16:42,633
The number one layer
is a prime the way
4713
03:16:42,633 --> 03:16:43,900
and second is fake.
4714
03:16:43,900 --> 03:16:44,864
And we'll wait.
4715
03:16:44,864 --> 03:16:46,600
There are two kinds of wave
4716
03:16:46,600 --> 03:16:49,100
in an earthquake
Prime Z Wave is like
4717
03:16:49,100 --> 03:16:52,261
when the earthquake is
just about to start it start
4718
03:16:52,261 --> 03:16:53,400
to the city center
4719
03:16:53,400 --> 03:16:55,200
and it's vendor or Quake
4720
03:16:55,200 --> 03:16:59,100
is going to start secondary wave
is more severe than
4721
03:16:59,100 --> 03:17:01,400
which sparked after producing.
4722
03:17:01,400 --> 03:17:03,912
Now what happens
in secondary wheel is
4723
03:17:03,912 --> 03:17:06,900
when it's that start it
can do maximum damage
4724
03:17:06,900 --> 03:17:09,605
because primary ways you
can see the initial wave
4725
03:17:09,605 --> 03:17:11,900
but the second we
will be on top of that
4726
03:17:11,900 --> 03:17:14,800
so they will be some details
with respect to I 'm not going
4727
03:17:14,800 --> 03:17:15,800
in detail of that.
4728
03:17:15,800 --> 03:17:17,600
But yeah, there
will be some details
4729
03:17:17,600 --> 03:17:18,700
with respect to that.
4730
03:17:18,700 --> 03:17:21,700
Now what we are going
to do using Sparks.
4731
03:17:21,700 --> 03:17:23,907
We will be creating our arms.
4732
03:17:23,907 --> 03:17:26,799
So let's go and see
that in our machine
4733
03:17:26,799 --> 03:17:30,600
how we will be sick
calculating our Roc which using
4734
03:17:30,600 --> 03:17:33,600
which we will be solving
this problem later
4735
03:17:33,600 --> 03:17:36,524
and we will be calculating
this Roc with the help
4736
03:17:36,524 --> 03:17:37,500
of Apache spark.
4737
03:17:37,500 --> 03:17:39,729
Let us again come back
to this machine now
4738
03:17:39,729 --> 03:17:41,369
in order to walk on that.
4739
03:17:41,369 --> 03:17:43,600
Let's first exit
from this console.
4740
03:17:43,800 --> 03:17:48,300
Once you exit from this console
now what you're going to do.
4741
03:17:48,300 --> 03:17:51,900
I have already created
this project in kept it here
4742
03:17:51,900 --> 03:17:55,563
because we just want to give
you an overview of this.
4743
03:17:55,563 --> 03:17:57,900
Let me go to
my downloads section.
4744
03:17:57,900 --> 03:18:01,400
There is a project called
as Earth to so this is
4745
03:18:01,400 --> 03:18:03,400
your project initially
4746
03:18:03,500 --> 03:18:06,400
what all things you
will be having you
4747
03:18:06,400 --> 03:18:08,839
will not be having all
the things initial part.
4748
03:18:08,839 --> 03:18:09,900
So what will happen.
4749
03:18:09,900 --> 03:18:12,990
So let's say if I go
to my downloads from here,
4750
03:18:12,990 --> 03:18:14,200
I have worked too.
4751
03:18:14,200 --> 03:18:16,800
project Okay.
4752
03:18:16,800 --> 03:18:19,000
Now initially I
will not be having
4753
03:18:19,000 --> 03:18:22,300
this target directory project
directory bin directory.
4754
03:18:22,300 --> 03:18:25,400
We will be using
our SBT framework.
4755
03:18:25,400 --> 03:18:28,900
If you do not know SBP this
is the skill of Bill tooth
4756
03:18:28,900 --> 03:18:32,400
which takes care of all
your dependencies takes care
4757
03:18:32,400 --> 03:18:36,700
of all your dependencies are not
so it is very similar to Melvin
4758
03:18:36,700 --> 03:18:40,577
if you already know Megan you
this is because very similar
4759
03:18:40,577 --> 03:18:42,900
but at the same time
I prefer this BTW
4760
03:18:42,900 --> 03:18:46,100
because as BT is
more easier to write income.
4761
03:18:46,100 --> 03:18:47,700
I've been doing yoga never
4762
03:18:47,700 --> 03:18:50,700
so you will be writing
this bill taught as begins.
4763
03:18:50,700 --> 03:18:55,800
So this finally will provide you
build dot SBT now in this file,
4764
03:18:55,800 --> 03:18:57,255
you will be giving the name
4765
03:18:57,255 --> 03:18:59,700
of your project your
what's a version of is
4766
03:18:59,700 --> 03:19:02,800
because using version of scale
of what you are using.
4767
03:19:02,800 --> 03:19:05,385
What are the dependencies
you have with
4768
03:19:05,385 --> 03:19:09,400
what versions dependencies you
have like 4 stock 4 and using
4769
03:19:09,400 --> 03:19:11,194
1.5.2 version of stock.
4770
03:19:11,200 --> 03:19:15,100
So you are telling
that whatever in my program,
4771
03:19:15,150 --> 03:19:16,150
I am writing.
4772
03:19:16,200 --> 03:19:22,100
So if I require anything related
to spawn quote go and get it
4773
03:19:22,100 --> 03:19:27,400
from this website of dot Apache
dot box download it install it.
4774
03:19:27,800 --> 03:19:29,900
If I require any dependency
4775
03:19:29,900 --> 03:19:34,700
for spark streaming program for
this particular version 1.5.2.
4776
03:19:35,000 --> 03:19:37,700
Go to this website or this link
4777
03:19:37,700 --> 03:19:41,200
and executed similar theme
for Amanda password.
4778
03:19:41,200 --> 03:19:43,353
So you just telling them now
4779
03:19:43,400 --> 03:19:47,200
once you have done this you will
be creating a Folder structure.
4780
03:19:47,200 --> 03:19:49,200
Your folder structure
would be you need
4781
03:19:49,200 --> 03:19:50,722
to create a sassy folder.
4782
03:19:50,722 --> 03:19:51,393
After that.
4783
03:19:51,393 --> 03:19:54,612
You will be creating
a main folder from Main folder.
4784
03:19:54,612 --> 03:19:57,200
You will be creating
again a folder called
4785
03:19:57,200 --> 03:19:58,800
as Kayla now inside
4786
03:19:58,800 --> 03:20:01,100
that you will be
keeping your program.
4787
03:20:01,100 --> 03:20:03,300
So now here you will
be writing a program.
4788
03:20:03,300 --> 03:20:04,500
So you are writing you.
4789
03:20:04,500 --> 03:20:07,499
Can you see this screaming
to a scalar Network on scale
4790
03:20:07,499 --> 03:20:08,500
of our DOT Stella.
4791
03:20:08,500 --> 03:20:10,623
So let's keep it as
a black box for them.
4792
03:20:10,623 --> 03:20:12,730
So you will be writing
the code to achieve
4793
03:20:12,730 --> 03:20:14,083
this problem statement.
4794
03:20:14,083 --> 03:20:15,500
Now what we are going to do
4795
03:20:15,500 --> 03:20:20,200
that come out of this What
do you mean project folder
4796
03:20:20,400 --> 03:20:21,500
and from here?
4797
03:20:21,700 --> 03:20:24,400
We will be writing SBT packaged.
4798
03:20:24,500 --> 03:20:26,400
It will start downloading
4799
03:20:26,400 --> 03:20:29,700
with respect to your is beating
it will check your program.
4800
03:20:29,700 --> 03:20:31,900
Whatever dependency you require
4801
03:20:31,900 --> 03:20:35,750
for stock course starts
screaming stuck in the lift.
4802
03:20:35,750 --> 03:20:36,895
It will download
4803
03:20:36,895 --> 03:20:39,400
and install it it
will just download
4804
03:20:39,400 --> 03:20:42,200
and install it so we
are not going to execute it
4805
03:20:42,200 --> 03:20:43,900
because I've already
done it before
4806
03:20:43,900 --> 03:20:45,300
and it also takes some time.
4807
03:20:45,300 --> 03:20:48,453
So that's the reason
I'm not doing it now.
4808
03:20:48,500 --> 03:20:50,689
You have been this packet,
4809
03:20:50,700 --> 03:20:53,788
you will find all
this directly Target directly
4810
03:20:53,788 --> 03:20:55,400
toward project directed.
4811
03:20:55,400 --> 03:20:58,100
These got created
later on the now
4812
03:20:58,100 --> 03:20:59,600
what is going to happen.
4813
03:20:59,600 --> 03:21:03,400
Once you have created this
you will go to your Eclipse.
4814
03:21:03,400 --> 03:21:04,900
So you are a pure c will open.
4815
03:21:04,900 --> 03:21:06,600
So let me open my Eclipse.
4816
03:21:06,900 --> 03:21:08,995
So this is how you're
equipped to protect.
4817
03:21:08,995 --> 03:21:09,200
Now.
4818
03:21:09,200 --> 03:21:11,300
I already have this program
in front of me,
4819
03:21:11,300 --> 03:21:14,900
but let me tell you how you
will be bringing this program.
4820
03:21:14,900 --> 03:21:17,800
You will be going
to your import option
4821
03:21:17,800 --> 03:21:18,934
with We import you
4822
03:21:18,934 --> 03:21:22,400
will be selecting your existing
projects into workspace.
4823
03:21:22,400 --> 03:21:23,700
Next once you do
4824
03:21:23,700 --> 03:21:26,400
that you need to select
your main project.
4825
03:21:26,400 --> 03:21:29,000
For example, you need
to select this Earth to project
4826
03:21:29,000 --> 03:21:31,900
what you have created
and click on OK
4827
03:21:31,900 --> 03:21:32,709
once you do
4828
03:21:32,709 --> 03:21:35,872
that they will be
a project directory coming
4829
03:21:35,872 --> 03:21:38,300
from this Earth
to will come here.
4830
03:21:38,300 --> 03:21:41,700
Now what we need to do go
to your s RC / Main
4831
03:21:41,700 --> 03:21:43,628
and not ignore all this program.
4832
03:21:43,628 --> 03:21:46,400
I require only just are jocular
because this is
4833
03:21:46,400 --> 03:21:48,500
where I've written
my main function.
4834
03:21:48,500 --> 03:21:50,260
Important now after that
4835
03:21:50,260 --> 03:21:52,900
once you reach
to this you need to go
4836
03:21:52,900 --> 03:21:55,900
to your run as Kayla application
4837
03:21:56,100 --> 03:21:59,600
and your spot code
will start to execute now,
4838
03:21:59,600 --> 03:22:01,800
this will return me a row 0.
4839
03:22:02,000 --> 03:22:02,314
Okay.
4840
03:22:02,314 --> 03:22:03,700
Let's see this output.
4841
03:22:06,600 --> 03:22:08,200
Now if I see this,
4842
03:22:08,200 --> 03:22:11,800
this will show me
once it's finished executing.
4843
03:22:22,900 --> 03:22:26,300
See this our area
under carosi is this
4844
03:22:26,300 --> 03:22:29,107
so this is all computed
with the elbows path program.
4845
03:22:29,107 --> 03:22:29,695
Similarly.
4846
03:22:29,695 --> 03:22:32,100
There are other programs
also met will help you
4847
03:22:32,100 --> 03:22:33,400
to spin the data or not.
4848
03:22:33,509 --> 03:22:35,010
I'm not walking over all that.
4849
03:22:35,160 --> 03:22:39,000
Now, let's come back
to my wedding and see
4850
03:22:39,000 --> 03:22:40,900
that what is the next step
4851
03:22:40,900 --> 03:22:44,500
what we will be doing so you
can see this way will be next.
4852
03:22:44,500 --> 03:22:48,200
Is she getting created now,
I'm keeping my Roc here.
4853
03:22:48,200 --> 03:22:53,100
Now after you have created
your RZ you will be Our graph
4854
03:22:53,100 --> 03:22:56,200
now in Japan there is
one important thing.
4855
03:22:56,200 --> 03:22:59,771
Japan is already
of affected area of your organs.
4856
03:22:59,771 --> 03:23:01,714
And now the trouble here is
4857
03:23:01,714 --> 03:23:05,600
that whatever it's not the even
for a minor earthquake.
4858
03:23:05,600 --> 03:23:07,852
I should start sending
the alert right?
4859
03:23:07,852 --> 03:23:11,300
I don't want to do all that
for the minor minor affection.
4860
03:23:11,300 --> 03:23:14,100
In fact, the buildings
and the infrastructure.
4861
03:23:14,100 --> 03:23:17,300
What is created is
the point is in such a way
4862
03:23:17,300 --> 03:23:18,600
if any odd quack
4863
03:23:18,600 --> 03:23:21,700
below six magnitude
comes there there.
4864
03:23:22,000 --> 03:23:25,713
The phones are designed in a way
that they will be no damage.
4865
03:23:25,713 --> 03:23:27,400
They will be no damage them.
4866
03:23:27,400 --> 03:23:29,400
So this is the major thing
4867
03:23:29,400 --> 03:23:33,300
when you work with your Japan
free book now in Japan,
4868
03:23:33,300 --> 03:23:36,000
so that means with six
they are not even worried
4869
03:23:36,000 --> 03:23:37,300
but about six they
4870
03:23:37,300 --> 03:23:40,668
are worried now for that day
will be a graph simulation
4871
03:23:40,668 --> 03:23:43,600
what you can do you can do it
with Park as well.
4872
03:23:43,600 --> 03:23:47,800
Once you generate this graph you
will be seeing that anything
4873
03:23:47,800 --> 03:23:49,449
which is going above 6
4874
03:23:49,449 --> 03:23:52,000
if anything which
is going above 6,
4875
03:23:52,000 --> 03:23:55,400
Should immediately start
the vendor now ignore all
4876
03:23:55,400 --> 03:23:56,700
this programming side
4877
03:23:56,700 --> 03:23:59,800
because that is what we
have just created and showing
4878
03:23:59,800 --> 03:24:01,411
you this execution fact now
4879
03:24:01,411 --> 03:24:03,800
if you have to visualize
the same result,
4880
03:24:03,800 --> 03:24:05,200
this is what is happening.
4881
03:24:05,200 --> 03:24:07,300
This is showing my Roc but
4882
03:24:07,300 --> 03:24:11,800
if my artwork is going
to be greater than 6 then only
4883
03:24:11,800 --> 03:24:16,415
weighs those alert then only
send the alert to all the paper.
4884
03:24:16,415 --> 03:24:18,400
Otherwise take come
4885
03:24:18,600 --> 03:24:22,000
that is what the project
what we generally show.
4886
03:24:22,000 --> 03:24:25,563
Oh in our space program sent now
it is not the only project
4887
03:24:25,563 --> 03:24:28,900
we also kind of create
multiple other products as well.
4888
03:24:28,900 --> 03:24:31,600
For example, I kind
of create a model just
4889
03:24:31,600 --> 03:24:33,204
like how Walmart to it
4890
03:24:33,204 --> 03:24:35,100
how Walmart maybe creating
4891
03:24:35,100 --> 03:24:38,241
a whatever sales is happening
with respect to that.
4892
03:24:38,241 --> 03:24:39,743
They're using Apache spark
4893
03:24:39,743 --> 03:24:43,000
and at the end they are kind of
making you visualize the output
4894
03:24:43,000 --> 03:24:45,400
of doing whatever
analytics they're doing.
4895
03:24:45,400 --> 03:24:46,900
So that is ordering the spark.
4896
03:24:46,900 --> 03:24:48,900
So all those things
we walking through
4897
03:24:48,900 --> 03:24:52,252
when we do the per session all
the things you learn quick.
4898
03:24:52,252 --> 03:24:55,100
I feel that all these projects
are using right now,
4899
03:24:55,100 --> 03:24:56,700
since you do not know the topic
4900
03:24:56,700 --> 03:24:59,400
you are not able to get
hundred percent of the project.
4901
03:24:59,400 --> 03:25:00,434
But at that time
4902
03:25:00,434 --> 03:25:03,366
once you know each
and every topics of deadly
4903
03:25:03,366 --> 03:25:07,100
you will have a clearer picture
of how spark is handling.
4904
03:25:07,100 --> 03:25:15,000
All these use cases graphs
are very attractive
4905
03:25:15,000 --> 03:25:17,900
when it comes to modeling
real world data
4906
03:25:17,900 --> 03:25:19,900
because they are
intuitive flexible
4907
03:25:19,900 --> 03:25:23,100
and the theory supporting
them has Been maturing
4908
03:25:23,100 --> 03:25:25,209
for centuries welcome everyone
4909
03:25:25,209 --> 03:25:27,600
in today's session
on Spa Graphics.
4910
03:25:27,700 --> 03:25:30,700
So without any further delay,
let's look at the agenda first.
4911
03:25:31,500 --> 03:25:34,561
We start by understanding
the basics of craft Theory
4912
03:25:34,561 --> 03:25:36,229
and different types of craft.
4913
03:25:36,229 --> 03:25:38,806
Then we'll look
at the features of Graphics
4914
03:25:38,806 --> 03:25:40,170
further will understand
4915
03:25:40,170 --> 03:25:43,820
what is property graph and look
at various crafts operations.
4916
03:25:43,820 --> 03:25:44,594
Moving ahead.
4917
03:25:44,594 --> 03:25:48,258
We'll look at different graph
processing algorithms at last.
4918
03:25:48,258 --> 03:25:49,500
We'll look at a demo
4919
03:25:49,500 --> 03:25:52,400
where we will try
to analyze Ford's go by
4920
03:25:52,400 --> 03:25:54,700
data using pagerank algorithm.
4921
03:25:54,700 --> 03:25:56,800
Let's move to the first topic.
4922
03:25:57,200 --> 03:25:59,845
So we'll start
with basics of graph.
4923
03:25:59,845 --> 03:26:03,661
So graphs are I basically
made up of two sets called
4924
03:26:03,661 --> 03:26:05,089
vertices and edges.
4925
03:26:05,089 --> 03:26:08,704
The vertices are drawn
from some underlying type
4926
03:26:08,704 --> 03:26:11,550
and the set can be
finite or infinite.
4927
03:26:11,550 --> 03:26:12,900
Now each element
4928
03:26:12,900 --> 03:26:17,035
of the edge set is a pair
consisting of two elements
4929
03:26:17,035 --> 03:26:18,728
from the vertices set.
4930
03:26:18,900 --> 03:26:21,400
So your vertex is V1.
4931
03:26:21,403 --> 03:26:23,173
Then your vertex is V3.
4932
03:26:23,173 --> 03:26:25,480
Then your vertex is V2 and V4.
4933
03:26:25,700 --> 03:26:29,300
And your edges are V
1 comma V 3 then next
4934
03:26:29,300 --> 03:26:33,500
is V 1 comma V 2 Then
you have B2 comma V 3
4935
03:26:33,500 --> 03:26:34,961
and then you have V
4936
03:26:34,961 --> 03:26:38,807
2 comma V fo so basically
we represent vertices set
4937
03:26:38,807 --> 03:26:43,000
as closed in curly braces
all the name of vertices.
4938
03:26:43,100 --> 03:26:45,561
So we have V 1 we have V 2
4939
03:26:45,561 --> 03:26:48,176
we have V 3 and then
we have before
4940
03:26:48,300 --> 03:26:53,073
and we'll close the curly braces
and to represent the edge set.
4941
03:26:53,073 --> 03:26:56,600
We use curly braces again
and then in curly braces,
4942
03:26:56,600 --> 03:27:00,907
we specify those two vertex
which are joined by the edge.
4943
03:27:01,000 --> 03:27:02,600
So for this Edge,
4944
03:27:02,600 --> 03:27:07,700
we will use a viven comma V
3 and then for this Edge
4945
03:27:07,700 --> 03:27:12,700
will use we one comma V
2 and then for this Edge again,
4946
03:27:12,700 --> 03:27:15,000
we'll use V 2 comma V 4.
4947
03:27:16,088 --> 03:27:19,011
And then at last
for this Edge will use
4948
03:27:19,300 --> 03:27:23,700
we do comma V 3 and At Last I
will close the curly braces.
4949
03:27:24,100 --> 03:27:26,400
So this is your vertices set.
4950
03:27:26,500 --> 03:27:28,900
And this is your headset.
4951
03:27:29,400 --> 03:27:31,958
Now one, very
important thing that is
4952
03:27:31,958 --> 03:27:35,476
if headset is containing
U comma V or you can say
4953
03:27:35,476 --> 03:27:38,700
that are instead
is containing V 1 comma V 3.
4954
03:27:38,700 --> 03:27:42,000
So V1 is basically
a adjacent to V 3.
4955
03:27:42,200 --> 03:27:45,100
Similarly your V
1 is adjacent to V 2.
4956
03:27:45,200 --> 03:27:48,427
Then V2 is adjacent
to V for and looking at this
4957
03:27:48,427 --> 03:27:50,900
as you can say V2
is adjacent to V 3.
4958
03:27:50,900 --> 03:27:53,686
Now, let's quickly move
ahead and we'll look
4959
03:27:53,686 --> 03:27:55,500
at different types of craft.
4960
03:27:55,500 --> 03:27:58,300
So first we have
undirected graphs.
4961
03:27:58,500 --> 03:28:00,936
So basically in
an undirected graph,
4962
03:28:00,936 --> 03:28:04,000
we use straight lines
to represent the edges.
4963
03:28:04,000 --> 03:28:08,350
Now the order of the vertices
in the edge set does not matter
4964
03:28:08,350 --> 03:28:09,800
in undirected graph.
4965
03:28:09,800 --> 03:28:14,040
So the undirected graph usually
are drawn using straight lines
4966
03:28:14,040 --> 03:28:15,500
between the vertices.
4967
03:28:15,500 --> 03:28:18,300
Now it is almost
similar to the graph
4968
03:28:18,300 --> 03:28:20,763
which we have seen
in the last slide.
4969
03:28:20,763 --> 03:28:21,563
Similarly.
4970
03:28:21,563 --> 03:28:25,000
We can again represent
the vertices set as 5 comma
4971
03:28:25,000 --> 03:28:27,500
6 comma 7 comma 8 and the edge
4972
03:28:27,500 --> 03:28:32,000
set as 5 comma 6 then
5 comma 7 now talking
4973
03:28:32,000 --> 03:28:33,643
about directed graphs.
4974
03:28:33,643 --> 03:28:37,605
So basically in a directed graph
the order of vertices
4975
03:28:37,605 --> 03:28:39,400
in the edge set matters.
4976
03:28:39,700 --> 03:28:43,100
So we use Arrow
to represent the edges
4977
03:28:43,300 --> 03:28:45,014
as you can see in the image
4978
03:28:45,014 --> 03:28:48,000
as It was not the case
with the undirected graph
4979
03:28:48,000 --> 03:28:49,900
where we were using
the straight lines.
4980
03:28:50,000 --> 03:28:51,400
So in directed graph,
4981
03:28:51,400 --> 03:28:56,000
we use Arrow to denote the edges
and the important thing is
4982
03:28:56,000 --> 03:28:58,214
The Edge set should be similar.
4983
03:28:58,214 --> 03:29:00,500
It will contain
the source vertex
4984
03:29:00,500 --> 03:29:04,200
that is five in this case
and the destination vertex,
4985
03:29:04,200 --> 03:29:09,400
which is 6 in this case and this
is never similar to six comma
4986
03:29:09,400 --> 03:29:13,300
five you cannot represent
this Edge as 6 comma 5
4987
03:29:13,400 --> 03:29:17,100
because the direction always
Does indeed directed graph
4988
03:29:17,100 --> 03:29:18,500
similarly you can see
4989
03:29:18,500 --> 03:29:20,556
that 5 is adjacent to 6,
4990
03:29:20,556 --> 03:29:23,787
but you cannot say
that 6 is adjacent to 5.
4991
03:29:24,200 --> 03:29:29,000
So for this graph the vertices
said would be similar as 5 comma
4992
03:29:29,000 --> 03:29:32,620
6 comma 7 comma 8
which was similar
4993
03:29:32,620 --> 03:29:34,158
in undirected graph,
4994
03:29:34,200 --> 03:29:38,700
but in directed graph your Edge
set should be your first opal.
4995
03:29:38,700 --> 03:29:42,835
This one will be 5 comma
6 then you second Edge,
4996
03:29:42,835 --> 03:29:46,528
which is this one would be
five comma Mama seven,
4997
03:29:47,000 --> 03:29:53,300
and at last your this set
would be 7 comma 8 but in case
4998
03:29:53,300 --> 03:29:56,166
of undirected graph
you can write this as
4999
03:29:56,166 --> 03:29:57,600
8 comma 7 or in case
5000
03:29:57,600 --> 03:30:00,400
of undirected graph you can
write this one as seven comma
5001
03:30:00,400 --> 03:30:03,369
5 but this is not the case
with the directed graph.
5002
03:30:03,369 --> 03:30:05,428
You have to follow
the source vertex
5003
03:30:05,428 --> 03:30:08,100
and the destination vertex
to represent the edge.
5004
03:30:08,100 --> 03:30:10,642
So I hope you guys are clear
with the undirected
5005
03:30:10,642 --> 03:30:11,846
and directed graph.
5006
03:30:11,846 --> 03:30:12,100
Now.
5007
03:30:12,100 --> 03:30:15,200
Let's talk about
vertex label graph now.
5008
03:30:15,200 --> 03:30:18,840
A Vertex liberal graph
each vertex is labeled
5009
03:30:18,840 --> 03:30:21,650
with some data
in addition to the data
5010
03:30:21,650 --> 03:30:23,700
that identifies the vertex.
5011
03:30:23,700 --> 03:30:28,100
So basically we say this X
or this v as the vertex ID.
5012
03:30:28,200 --> 03:30:29,500
So there will be data
5013
03:30:29,500 --> 03:30:31,800
that would be added
to this vertex.
5014
03:30:32,000 --> 03:30:35,200
So let's say this vertex
would be 6 comma
5015
03:30:35,200 --> 03:30:37,500
and then we are adding the color
5016
03:30:37,500 --> 03:30:39,700
so it would be purple next.
5017
03:30:39,800 --> 03:30:42,100
This vertex would be 8 comma
5018
03:30:42,100 --> 03:30:44,700
and the color
would be green next.
5019
03:30:44,700 --> 03:30:50,400
We'll say See this as 7 comma
read and then this one is as
5020
03:30:50,400 --> 03:30:54,400
five comma blue now
the six or this five
5021
03:30:54,400 --> 03:30:55,639
or seven or eight.
5022
03:30:55,639 --> 03:30:58,800
These are vertex ID
and the additional data,
5023
03:30:58,800 --> 03:31:03,500
which is attached is the color
like blue purple green or red.
5024
03:31:03,900 --> 03:31:08,696
But only the identifying data
is present in the pair of edges
5025
03:31:08,696 --> 03:31:12,543
or you can say only the ID
of the vertex is present
5026
03:31:12,543 --> 03:31:13,773
in the edge set.
5027
03:31:14,100 --> 03:31:15,322
So here the Edsel.
5028
03:31:15,322 --> 03:31:17,700
Again similar to
your directed graph
5029
03:31:17,700 --> 03:31:19,587
that is your Source ID this
5030
03:31:19,587 --> 03:31:21,992
which is 5 and
then destination ID,
5031
03:31:21,992 --> 03:31:25,274
which is 6 in this case
then for this case.
5032
03:31:25,274 --> 03:31:28,785
It's similar as five comma
7 then in for this case.
5033
03:31:28,785 --> 03:31:30,469
It's similar as 7 comma 8
5034
03:31:30,469 --> 03:31:33,600
so we are not specifying
this additional data,
5035
03:31:33,600 --> 03:31:35,699
which is attached
to the vertices.
5036
03:31:35,699 --> 03:31:36,878
That is the color.
5037
03:31:36,878 --> 03:31:40,121
If you only specify
the identifiers of the vertex
5038
03:31:40,121 --> 03:31:41,300
that is the number
5039
03:31:41,300 --> 03:31:44,700
but your vertex set
would be something
5040
03:31:44,700 --> 03:31:46,300
like so this vertex
5041
03:31:46,300 --> 03:31:50,100
would be 5 comma blue
then your next vertex
5042
03:31:50,100 --> 03:31:52,600
will become 6 comma purple
5043
03:31:53,100 --> 03:31:56,700
then your next vertex
will become 8 comma green
5044
03:31:57,000 --> 03:31:59,800
and at last your last
vertex will be written
5045
03:31:59,800 --> 03:32:01,100
as 7 comma read.
5046
03:32:01,100 --> 03:32:04,808
So basically when you
are specifying the vertices set
5047
03:32:04,808 --> 03:32:07,305
in the vertex label
graph you attach
5048
03:32:07,305 --> 03:32:10,683
the additional information
in the vertices are set
5049
03:32:10,683 --> 03:32:12,200
but while representing
5050
03:32:12,200 --> 03:32:16,183
the edge set it is represented
similarly as A directed graph
5051
03:32:16,183 --> 03:32:19,900
where you have to just specify
the source vertex identifier
5052
03:32:19,900 --> 03:32:20,900
and then you have
5053
03:32:20,900 --> 03:32:24,300
to specify the destination
vertex identifier now.
5054
03:32:24,300 --> 03:32:27,500
I hope that you guys are clear
with underrated directed
5055
03:32:27,500 --> 03:32:29,000
and vertex label graph.
5056
03:32:29,184 --> 03:32:33,615
So let's quickly move forward
next we have cyclic graph.
5057
03:32:33,800 --> 03:32:36,800
So a cyclic graph
is a directed graph
5058
03:32:36,900 --> 03:32:38,900
with at least one cycle
5059
03:32:39,000 --> 03:32:43,153
and the cycle is the path
along with the directed edges
5060
03:32:43,153 --> 03:32:44,933
from a Vertex to itself.
5061
03:32:44,933 --> 03:32:47,000
So so once you see over here,
5062
03:32:47,000 --> 03:32:47,708
you can see
5063
03:32:47,708 --> 03:32:50,541
that from this vertex
V. It's moving toward x
5064
03:32:50,541 --> 03:32:51,700
7 then it's moving
5065
03:32:51,700 --> 03:32:54,700
to vertex Aid then with arrows
moving to vertex six.
5066
03:32:54,700 --> 03:32:57,539
And then again,
it's moving to vertex V.
5067
03:32:57,539 --> 03:33:01,600
So there should be at least
one cycle in a cyclic graph.
5068
03:33:01,600 --> 03:33:04,000
There might be a new component.
5069
03:33:04,000 --> 03:33:08,400
It's a Vertex 9 which is
attached over here again,
5070
03:33:08,400 --> 03:33:10,401
so it would be a cyclic graph
5071
03:33:10,401 --> 03:33:13,300
because it has
one complete cycle over here
5072
03:33:13,300 --> 03:33:15,500
and the important
thing to notice is
5073
03:33:15,500 --> 03:33:20,300
That the arrow should make
the cycle like from 5 to 7
5074
03:33:20,300 --> 03:33:23,300
and then from 7 to 8
and then 8 to 6
5075
03:33:23,300 --> 03:33:25,300
and 6 to 5 and let's say
5076
03:33:25,300 --> 03:33:26,831
that there is an arrow
5077
03:33:26,831 --> 03:33:30,281
from 5 to 6 and then there
is an arrow from 6 to 8.
5078
03:33:30,281 --> 03:33:32,233
So we have flipped the arrows.
5079
03:33:32,233 --> 03:33:33,600
So in that situation,
5080
03:33:33,600 --> 03:33:36,372
this is not a cyclic graph
because the arrows
5081
03:33:36,372 --> 03:33:38,200
are not completing the cycle.
5082
03:33:38,200 --> 03:33:41,370
So once you move from 5 to 7
and then from 7 to 8,
5083
03:33:41,370 --> 03:33:44,452
you cannot move from 8:00
to 6:00 and similarly
5084
03:33:44,452 --> 03:33:47,167
once you move from 5 to 6
and then 6 to 8.
5085
03:33:47,167 --> 03:33:49,020
You cannot move from 8 to 7.
5086
03:33:49,020 --> 03:33:52,000
So in that situation,
it's not a cyclic graph.
5087
03:33:52,000 --> 03:33:54,307
So let's clear all this thing.
5088
03:33:54,307 --> 03:33:56,461
So will represent this cycle
5089
03:33:56,461 --> 03:34:00,300
as five then using
double arrows will go to 7
5090
03:34:00,300 --> 03:34:05,300
and then we'll move to 8
and then we'll move to 6
5091
03:34:05,300 --> 03:34:09,774
and at last we'll
come back to 5 now.
5092
03:34:09,774 --> 03:34:11,851
We have Edge liberal graph.
5093
03:34:12,000 --> 03:34:15,030
So basically as label
graph is a graph.
5094
03:34:15,030 --> 03:34:17,752
The edges are
associated with labels.
5095
03:34:17,752 --> 03:34:22,059
So one can basically indicate
this by making the edge set
5096
03:34:22,059 --> 03:34:23,906
as be a set of triplets.
5097
03:34:23,906 --> 03:34:25,600
So for example,
5098
03:34:25,600 --> 03:34:26,900
let's say this H
5099
03:34:26,900 --> 03:34:30,875
in this Edge label graph
will be denoted as the source
5100
03:34:30,875 --> 03:34:33,200
which is 6 then the destination
5101
03:34:33,200 --> 03:34:38,000
which is 7 and then the label
of the edge which is blue.
5102
03:34:38,000 --> 03:34:41,400
So this Edge would
be defined something
5103
03:34:41,400 --> 03:34:44,700
like 6 comma 7 comma blue
and then for this
5104
03:34:44,700 --> 03:34:47,100
and Hurley The Source vertex
5105
03:34:47,100 --> 03:34:49,414
that is 7 the
destination vertex,
5106
03:34:49,414 --> 03:34:52,100
which is 8 then
the label of the edge,
5107
03:34:52,100 --> 03:34:55,400
which is white like
similarly for this Edge.
5108
03:34:55,400 --> 03:35:00,200
It's five comma 7 and
then blue comma red.
5109
03:35:01,000 --> 03:35:03,076
And it lasts for this Edge.
5110
03:35:03,076 --> 03:35:09,200
It's five comma six and then it
would be yellow common green,
5111
03:35:09,200 --> 03:35:11,362
which is the label of the edge.
5112
03:35:11,362 --> 03:35:14,665
So all these four edges
will become the headset
5113
03:35:14,665 --> 03:35:18,400
for this graph and the vertices
set is almost similar
5114
03:35:18,400 --> 03:35:21,200
that is 5 comma
6 comma 7 comma 8 now
5115
03:35:21,200 --> 03:35:24,200
to generalize this I would say x
5116
03:35:24,200 --> 03:35:26,400
comma y so X here is
5117
03:35:26,400 --> 03:35:30,700
the source vertex then why
here is the destination vertex?
5118
03:35:30,700 --> 03:35:33,914
X and then a here is
the label of the edge
5119
03:35:33,914 --> 03:35:36,900
then Edge label graph
are usually drawn
5120
03:35:36,900 --> 03:35:39,573
with the labels written
adjacent to the Earth
5121
03:35:39,573 --> 03:35:40,902
specifying the edges
5122
03:35:40,902 --> 03:35:41,900
as you can see.
5123
03:35:41,900 --> 03:35:43,900
We have mentioned blue white
5124
03:35:43,900 --> 03:35:46,695
and all those label
addition to the edges.
5125
03:35:46,695 --> 03:35:50,400
So I hope you guys a player
with the edge label graph,
5126
03:35:50,400 --> 03:35:51,561
which is nothing
5127
03:35:51,561 --> 03:35:54,900
but labels attached
to each and every Edge now,
5128
03:35:54,900 --> 03:35:57,200
let's talk about weighted graph.
5129
03:35:57,200 --> 03:36:00,310
So we did graph is
an edge label draft.
5130
03:36:00,700 --> 03:36:03,700
Where the labels
can be operated on by
5131
03:36:03,700 --> 03:36:06,921
usually automatic operators
or comparison operators,
5132
03:36:06,921 --> 03:36:09,700
like less than or greater
than symbol usually
5133
03:36:09,700 --> 03:36:12,900
these are integers
or floats and the idea is
5134
03:36:12,900 --> 03:36:15,534
that some edges
may be more expensive
5135
03:36:15,534 --> 03:36:18,900
and this cost is represented
by the edge labels
5136
03:36:18,900 --> 03:36:22,992
or weights now in short weighted
graphs are a special kind
5137
03:36:22,992 --> 03:36:24,500
of Edgley build rafts
5138
03:36:24,500 --> 03:36:27,200
where your Edge
is attached to a weight.
5139
03:36:27,200 --> 03:36:29,800
Generally, which is
a integer or a float
5140
03:36:29,800 --> 03:36:33,100
so that we can perform
some addition or subtraction
5141
03:36:33,100 --> 03:36:35,452
or different kind
of automatic operations
5142
03:36:35,452 --> 03:36:36,689
or it can be some kind
5143
03:36:36,689 --> 03:36:39,500
of conditional operations
like less than or greater
5144
03:36:39,500 --> 03:36:40,800
than so we'll again
5145
03:36:40,800 --> 03:36:45,700
represent this Edge as 5 comma
6 and then the weight as 3
5146
03:36:46,100 --> 03:36:49,900
and similarly will represent
this Edge as 6 comma
5147
03:36:49,900 --> 03:36:55,351
7 and the weight is again
6 so similarly we represent
5148
03:36:55,351 --> 03:36:57,197
these two edges as well.
5149
03:36:57,300 --> 03:36:57,900
So I hope
5150
03:36:57,900 --> 03:37:00,500
that you guys are clear
with the weighted graphs.
5151
03:37:00,500 --> 03:37:02,300
Now let's quickly
move ahead and look
5152
03:37:02,300 --> 03:37:04,200
at this directed acyclic graph.
5153
03:37:04,200 --> 03:37:06,900
So this is
a directed acyclic graph,
5154
03:37:07,100 --> 03:37:09,500
which is basically
without Cycles.
5155
03:37:09,500 --> 03:37:12,445
So as we just discussed
in cyclic graphs here,
5156
03:37:12,445 --> 03:37:13,151
you can see
5157
03:37:13,151 --> 03:37:16,601
that it is not completing
the graph from the directions
5158
03:37:16,601 --> 03:37:19,607
or you can say the direction
of the edges, right?
5159
03:37:19,607 --> 03:37:21,011
We can move from 5 to 7,
5160
03:37:21,011 --> 03:37:22,164
then seven to eight
5161
03:37:22,164 --> 03:37:25,500
but we cannot move from 8 to 6
and similarly we can move
5162
03:37:25,500 --> 03:37:27,600
from 5:00 to 6:00
then 6:00 to 8:00,
5163
03:37:27,600 --> 03:37:29,700
but we cannot move from 8 to 7.
5164
03:37:29,700 --> 03:37:32,962
So this is Not forming
a cycle and these kind
5165
03:37:32,962 --> 03:37:36,300
of crafts are known as
directed acyclic graph.
5166
03:37:36,300 --> 03:37:39,914
Now, they appear as special
cases in CS application all
5167
03:37:39,914 --> 03:37:41,855
the time and the vertices set
5168
03:37:41,855 --> 03:37:44,600
and the edge set
are represented similarly
5169
03:37:44,700 --> 03:37:46,700
as we have seen
earlier not talking
5170
03:37:46,700 --> 03:37:48,670
about the disconnected graph.
5171
03:37:48,670 --> 03:37:51,972
So vertices in a graph
do not need to be connected
5172
03:37:51,972 --> 03:37:53,100
to other vertices.
5173
03:37:53,100 --> 03:37:54,466
It is basically legal
5174
03:37:54,466 --> 03:37:57,200
for a graph to have
disconnected components
5175
03:37:57,200 --> 03:38:00,466
or even loan vertices
without a single connection.
5176
03:38:00,466 --> 03:38:04,400
So basically this disconnected
graph which has four vertices
5177
03:38:04,400 --> 03:38:05,300
but no edges.
5178
03:38:05,300 --> 03:38:05,543
Now.
5179
03:38:05,543 --> 03:38:08,100
Let me tell you something
important that is
5180
03:38:08,100 --> 03:38:10,176
what our sources and sinks.
5181
03:38:10,200 --> 03:38:13,738
So let's say we have
one Arrow from five to six
5182
03:38:13,738 --> 03:38:18,233
and one Arrow from 5 to 7
now word is with only
5183
03:38:18,233 --> 03:38:20,233
in arrows are called sink.
5184
03:38:20,600 --> 03:38:25,200
So the 7 and 6 are known
as sinks and the vertices
5185
03:38:25,307 --> 03:38:28,400
with only out arrows
are called sources.
5186
03:38:28,400 --> 03:38:32,500
So as you can see in the image
this Five only have out arrows
5187
03:38:32,500 --> 03:38:33,800
to six and seven.
5188
03:38:33,800 --> 03:38:36,200
So these are called sources now.
5189
03:38:36,200 --> 03:38:38,506
We'll talk about this
in a while guys.
5190
03:38:38,506 --> 03:38:41,500
Once we are going
through the pagerank algorithm.
5191
03:38:41,500 --> 03:38:45,228
So I hope that you guys know
what our vertices what our edges
5192
03:38:45,228 --> 03:38:48,149
how vertices and edges
represents the graph then
5193
03:38:48,149 --> 03:38:50,200
what are different
kinds of graph?
5194
03:38:50,384 --> 03:38:52,615
Let's move to the next topic.
5195
03:38:52,800 --> 03:38:54,236
So next let's know.
5196
03:38:54,236 --> 03:38:55,900
What is Park Graphics.
5197
03:38:55,900 --> 03:38:58,616
So talking about
Graphics Graphics is
5198
03:38:58,616 --> 03:39:00,519
a new component in spark.
5199
03:39:00,519 --> 03:39:03,843
For graphs and crafts
parallel computation now
5200
03:39:03,843 --> 03:39:06,170
at a high level graphic extends
5201
03:39:06,170 --> 03:39:09,954
The Spark rdd by introducing
a new graph abstraction
5202
03:39:09,954 --> 03:39:12,046
that is directed multigraph
5203
03:39:12,046 --> 03:39:15,122
that is properties
attached to each vertex
5204
03:39:15,122 --> 03:39:18,800
and Edge now to support
craft computation Graphics
5205
03:39:18,800 --> 03:39:22,320
basically exposes a set
of fundamental operators,
5206
03:39:22,320 --> 03:39:25,400
like finding sub graph
for joining vertices
5207
03:39:25,400 --> 03:39:30,253
or aggregating messages as well
as it also exposes and optimize.
5208
03:39:30,253 --> 03:39:34,713
This variant of the pregnant
a pi in addition Graphics also
5209
03:39:34,713 --> 03:39:37,987
provides you a collection
of graph algorithms
5210
03:39:37,987 --> 03:39:41,700
and Builders to simplify
your spark analytics tasks.
5211
03:39:41,700 --> 03:39:45,600
So basically your graphics
is extending your spark rdd.
5212
03:39:45,600 --> 03:39:48,800
Then you have Graphics
is providing an abstraction
5213
03:39:48,800 --> 03:39:50,614
that is directed multigraph
5214
03:39:50,614 --> 03:39:53,800
with properties attached
to each vertex and Edge.
5215
03:39:53,800 --> 03:39:56,800
So we'll look at this
property graph in a while.
5216
03:39:56,800 --> 03:40:00,200
Then again Graphics gives you
some fundamental operators
5217
03:40:00,200 --> 03:40:01,000
and Then it also
5218
03:40:01,000 --> 03:40:03,800
provides you some graph
algorithms and Builders
5219
03:40:03,800 --> 03:40:07,260
which makes your analytics
easier now to get started
5220
03:40:07,260 --> 03:40:11,400
you first need to import spark
and Graphics into your project.
5221
03:40:11,400 --> 03:40:12,550
So as you can see,
5222
03:40:12,550 --> 03:40:15,875
we are importing first Park
and then we are importing
5223
03:40:15,875 --> 03:40:19,200
spark Graphics to get
those graphics functionalities.
5224
03:40:19,200 --> 03:40:21,150
And at last we are importing
5225
03:40:21,150 --> 03:40:25,400
spark rdd to use those already
functionalities in our program.
5226
03:40:25,400 --> 03:40:28,098
But let me tell you
that if you are not using
5227
03:40:28,098 --> 03:40:30,400
spark shell then you
will need a spark.
5228
03:40:30,400 --> 03:40:31,807
Context in your program.
5229
03:40:31,807 --> 03:40:32,341
So I hope
5230
03:40:32,341 --> 03:40:35,400
that you guys are clear
with the features of graphics
5231
03:40:35,400 --> 03:40:36,400
and the libraries
5232
03:40:36,400 --> 03:40:39,200
which you need to import
in order to use Graphics.
5233
03:40:39,300 --> 03:40:43,500
So let us quickly move ahead
and look at the property graph.
5234
03:40:43,500 --> 03:40:45,800
Now property graph is something
5235
03:40:45,800 --> 03:40:50,300
as the name suggests property
graph have properties attached
5236
03:40:50,300 --> 03:40:52,400
to each vertex and Edge.
5237
03:40:52,500 --> 03:40:54,115
So the property graph
5238
03:40:54,115 --> 03:40:58,653
is a directed multigraph with
user-defined objects attached
5239
03:40:58,653 --> 03:41:00,500
to each vertex and Edge.
5240
03:41:00,500 --> 03:41:03,700
Now you might be wondering
what is undirected multigraph.
5241
03:41:03,700 --> 03:41:08,123
So a directed multi graph is a
directed graph with potentially
5242
03:41:08,123 --> 03:41:11,137
multiple parallel edges
sharing same source
5243
03:41:11,137 --> 03:41:13,050
and same destination vertex.
5244
03:41:13,050 --> 03:41:15,102
So as you can see in the image
5245
03:41:15,102 --> 03:41:17,700
that from San Francisco
to Los Angeles,
5246
03:41:17,700 --> 03:41:22,106
we have two edges and similarly
from Los Angeles to Chicago.
5247
03:41:22,106 --> 03:41:23,600
There are two edges.
5248
03:41:23,600 --> 03:41:26,019
So basically in
a directed multigraph,
5249
03:41:26,019 --> 03:41:28,400
the first thing is
the directed graph,
5250
03:41:28,400 --> 03:41:30,386
so it should have a Direction.
5251
03:41:30,386 --> 03:41:33,300
Ian attached to the edges
and then talking
5252
03:41:33,300 --> 03:41:36,100
about multigraph so
between Source vertex
5253
03:41:36,100 --> 03:41:37,850
and a destination vertex,
5254
03:41:37,850 --> 03:41:39,600
there could be two edges.
5255
03:41:39,800 --> 03:41:42,886
So the ability to
support parallel edges
5256
03:41:42,886 --> 03:41:46,100
basically simplifies
the modeling scenarios
5257
03:41:46,100 --> 03:41:49,054
where there can be
multiple relationships
5258
03:41:49,054 --> 03:41:51,997
between the same vertices
for an example.
5259
03:41:51,997 --> 03:41:54,200
Let's say these are two persons
5260
03:41:54,200 --> 03:41:56,644
so they can be friends
as well as they
5261
03:41:56,644 --> 03:41:58,361
can be co-workers, right?
5262
03:41:58,361 --> 03:42:02,000
So these kind of scenarios
can be Easily modeled using
5263
03:42:02,000 --> 03:42:03,900
directed multigraph now.
5264
03:42:03,900 --> 03:42:08,700
Each vertex is keyed by
a unique 64-bit long identifier,
5265
03:42:08,800 --> 03:42:12,700
which is basically the vertex ID
and it helps an indexing.
5266
03:42:12,700 --> 03:42:16,500
So each of your vertex
contains a Vertex ID,
5267
03:42:16,600 --> 03:42:20,000
which is a unique
64-bit long identifier
5268
03:42:20,200 --> 03:42:21,900
and similarly edges
5269
03:42:21,900 --> 03:42:26,600
have corresponding source and
destination vertex identifiers.
5270
03:42:26,700 --> 03:42:28,174
So this Edge would have
5271
03:42:28,174 --> 03:42:31,647
this vertex identifier as
well as This vertex identifier
5272
03:42:31,647 --> 03:42:35,620
or you can say Source vertex ID
and the destination vertex ID.
5273
03:42:35,620 --> 03:42:37,900
So as we discuss
this property graph
5274
03:42:37,900 --> 03:42:42,300
is basically parameterised
over the vertex and Edge types,
5275
03:42:42,300 --> 03:42:45,684
and these are the types
of objects associated
5276
03:42:45,684 --> 03:42:47,700
with each vertex and Edge.
5277
03:42:48,400 --> 03:42:51,792
So your graphics basically
optimizes the representation
5278
03:42:51,792 --> 03:42:53,300
of vertex and Edge types
5279
03:42:53,300 --> 03:42:56,900
and it reduces the in
memory footprint by storing
5280
03:42:56,900 --> 03:43:00,400
the primitive data types
in a specialized array.
5281
03:43:00,400 --> 03:43:04,400
In some cases it might be
desirable to have vertices
5282
03:43:04,400 --> 03:43:07,200
with different property types
in the same graph.
5283
03:43:07,200 --> 03:43:10,400
Now this can be accomplished
through inheritance.
5284
03:43:10,400 --> 03:43:14,000
So for an example to model
a user and product
5285
03:43:14,000 --> 03:43:15,300
in a bipartite graph,
5286
03:43:15,300 --> 03:43:17,676
or you can see
that we have user property
5287
03:43:17,676 --> 03:43:19,400
and we have product property.
5288
03:43:19,400 --> 03:43:19,762
Okay.
5289
03:43:19,762 --> 03:43:23,400
So let me first tell you
what is a bipartite graph.
5290
03:43:23,400 --> 03:43:26,861
So a bipartite graph
is also called a by graph
5291
03:43:27,000 --> 03:43:29,500
which is a set
of graph vertices.
5292
03:43:30,300 --> 03:43:35,400
Opposed into two disjoint sets
such that no two graph vertices
5293
03:43:35,469 --> 03:43:37,930
within the same
set are adjacent.
5294
03:43:38,100 --> 03:43:39,700
So as you can see over here,
5295
03:43:39,700 --> 03:43:43,000
we have user property and then
we have product property
5296
03:43:43,000 --> 03:43:46,282
but no to user property
can be adjacent or you
5297
03:43:46,282 --> 03:43:48,592
can say there should be no edges
5298
03:43:48,592 --> 03:43:51,707
that is joining any
of the to user property or
5299
03:43:51,707 --> 03:43:53,300
there should be no Edge
5300
03:43:53,300 --> 03:43:56,000
that should be joining
product property.
5301
03:43:56,400 --> 03:44:00,000
So in this scenario
we use inheritance.
5302
03:44:00,200 --> 03:44:01,757
So as you can see here,
5303
03:44:01,757 --> 03:44:04,600
we have class vertex
property now basically
5304
03:44:04,600 --> 03:44:07,400
what we are doing we
are creating another class
5305
03:44:07,400 --> 03:44:08,900
with user property.
5306
03:44:08,900 --> 03:44:10,700
And here we have name,
5307
03:44:10,700 --> 03:44:13,500
which is again a string
and we are extending
5308
03:44:13,500 --> 03:44:17,038
or you can say we are inheriting
the vertex property class.
5309
03:44:17,038 --> 03:44:19,600
Now again, in the case
of product property.
5310
03:44:19,600 --> 03:44:22,100
We have name that is
name of the product
5311
03:44:22,100 --> 03:44:25,000
which is again string and then
we have price of the product
5312
03:44:25,000 --> 03:44:25,985
which is double
5313
03:44:25,985 --> 03:44:29,400
and we are again extending
this vertex property graph
5314
03:44:29,400 --> 03:44:32,900
and at last You're grading a
graph with this vertex property
5315
03:44:32,900 --> 03:44:33,900
and then string.
5316
03:44:33,900 --> 03:44:37,045
So this is how we
can basically model user
5317
03:44:37,045 --> 03:44:39,500
and product as
a bipartite graph.
5318
03:44:39,500 --> 03:44:41,430
So we have created user property
5319
03:44:41,430 --> 03:44:44,265
as well as we have created
this product property
5320
03:44:44,265 --> 03:44:47,100
and we are extending
this vertex property class.
5321
03:44:47,400 --> 03:44:50,076
No talking about
this property graph.
5322
03:44:50,076 --> 03:44:51,907
It's similar to your rdd.
5323
03:44:51,907 --> 03:44:55,900
So like your rdd property graph
are immutable distributed
5324
03:44:55,900 --> 03:44:57,200
and fault tolerant.
5325
03:44:57,200 --> 03:45:00,491
So changes to the values
or structure of the graph.
5326
03:45:00,491 --> 03:45:01,908
Basically accomplished
5327
03:45:01,908 --> 03:45:04,900
by producing a new graph
with the desired changes
5328
03:45:04,900 --> 03:45:07,700
and the substantial part
of the original graph
5329
03:45:07,700 --> 03:45:09,900
which can be your structure
of the graph
5330
03:45:09,900 --> 03:45:11,800
or attributes or indices.
5331
03:45:11,800 --> 03:45:15,081
These are basically reused
in the new graph reducing
5332
03:45:15,081 --> 03:45:18,040
the cost of inherent
functional data structure.
5333
03:45:18,040 --> 03:45:20,100
So basically your property graph
5334
03:45:20,100 --> 03:45:22,500
once you're trying to change
values of structure.
5335
03:45:22,500 --> 03:45:26,024
So it creates a new graph
with changed structure
5336
03:45:26,024 --> 03:45:27,300
or changed values
5337
03:45:27,300 --> 03:45:30,182
and zero substantial part
of original graph.
5338
03:45:30,182 --> 03:45:33,300
Re used multiple times
to improve the performance
5339
03:45:33,300 --> 03:45:35,900
and it can be
your structure of the graph
5340
03:45:35,900 --> 03:45:38,600
which is getting reuse
or it can be your attributes
5341
03:45:38,600 --> 03:45:41,000
or indices of the graph
which is getting reused.
5342
03:45:41,000 --> 03:45:44,400
So this is how your property
graph provides efficiency.
5343
03:45:44,400 --> 03:45:46,400
Now, the graph is partitioned
5344
03:45:46,400 --> 03:45:48,800
across the executors
using a range
5345
03:45:48,800 --> 03:45:50,500
of vertex partitioning rules,
5346
03:45:50,500 --> 03:45:52,700
which are basically
Loosely defined
5347
03:45:52,700 --> 03:45:56,514
and similar to our DD
each partition of the graph
5348
03:45:56,514 --> 03:45:57,800
can be recreated
5349
03:45:57,800 --> 03:46:01,100
on different machines
in the event of Failure.
5350
03:46:01,100 --> 03:46:05,000
So this is how your property
graph provides fault tolerance.
5351
03:46:05,000 --> 03:46:07,643
So as we already
discussed logically
5352
03:46:07,643 --> 03:46:12,174
the property graph corresponds
to a pair of type collections,
5353
03:46:12,174 --> 03:46:15,800
including the properties
for each vertex and Edge
5354
03:46:15,800 --> 03:46:17,338
and as a consequence
5355
03:46:17,338 --> 03:46:21,492
the graph class contains
members to access the vertices
5356
03:46:21,492 --> 03:46:22,569
and the edges.
5357
03:46:22,800 --> 03:46:24,067
So as you can see we
5358
03:46:24,067 --> 03:46:27,300
have graphed class then you
can see we have vertices
5359
03:46:27,307 --> 03:46:28,692
and we have edges.
5360
03:46:29,500 --> 03:46:34,400
Now this vertex Rd DVD
is extending your rdd,
5361
03:46:34,600 --> 03:46:41,100
which is your body
D and then your vertex ID
5362
03:46:41,500 --> 03:46:43,807
and then your vertex property.
5363
03:46:44,600 --> 03:46:45,100
Similarly.
5364
03:46:45,100 --> 03:46:47,600
Your Edge rdd is extending
5365
03:46:47,600 --> 03:46:53,500
your Oddity with your Edge
property so the classes
5366
03:46:53,500 --> 03:46:54,900
that is vertex rdd
5367
03:46:54,900 --> 03:47:00,100
and HR DD extends under
optimized version of your rdd,
5368
03:47:00,100 --> 03:47:03,810
which includes vertex idn
vertex property and your rdd
5369
03:47:03,810 --> 03:47:06,746
which includes your Edge
property and Booth
5370
03:47:06,746 --> 03:47:07,795
this vertex rdd
5371
03:47:07,795 --> 03:47:11,501
and hrd provides additional
functionality build on top
5372
03:47:11,501 --> 03:47:12,876
of graph computation
5373
03:47:12,876 --> 03:47:15,900
and leverages internal
optimizations as well.
5374
03:47:15,900 --> 03:47:19,159
So this is the reason we use
this Vertex rdd or Edge already
5375
03:47:19,159 --> 03:47:22,500
because it already extends your
already containing your word.
5376
03:47:22,500 --> 03:47:23,888
X ID and vertex property
5377
03:47:23,888 --> 03:47:26,700
or your Edge property
it also provides you
5378
03:47:26,700 --> 03:47:30,100
additional functionalities built
on top of craft computation.
5379
03:47:30,100 --> 03:47:33,700
And again, it gives you some
internal optimizations as well.
5380
03:47:34,100 --> 03:47:37,715
Now, let me clear
this and let's take an example
5381
03:47:37,715 --> 03:47:39,000
of property graph
5382
03:47:39,000 --> 03:47:40,633
where the vertex property
5383
03:47:40,633 --> 03:47:43,300
might contain the user
name and occupation.
5384
03:47:43,300 --> 03:47:47,200
So as you can see in this table
that we have ID of the vertex
5385
03:47:47,200 --> 03:47:50,000
and then we have property
attached to each vertex.
5386
03:47:50,000 --> 03:47:52,602
That is the username
as well as the Station
5387
03:47:52,602 --> 03:47:55,700
of the user or you can see
the profession of the user
5388
03:47:55,700 --> 03:47:58,715
and we can annotate
the edges with the string
5389
03:47:58,715 --> 03:48:01,800
describing the relationship
between the users.
5390
03:48:01,800 --> 03:48:04,400
So so as you can see
first is Thomas
5391
03:48:04,400 --> 03:48:06,300
who is a professor
then second is Frank
5392
03:48:06,300 --> 03:48:08,000
who is also a professor then
5393
03:48:08,000 --> 03:48:09,900
as you can see third is Jenny.
5394
03:48:09,900 --> 03:48:12,241
She's a student and forth is Bob
5395
03:48:12,241 --> 03:48:15,997
who is a doctor now Thomas is
a colleague of Frank.
5396
03:48:15,997 --> 03:48:17,200
Then you can see
5397
03:48:17,200 --> 03:48:21,000
that Thomas is academic
advisor of Jenny again.
5398
03:48:21,000 --> 03:48:23,153
Frank is also a Make advisor
5399
03:48:23,153 --> 03:48:27,692
of Jenny and then the doctor
is the health advisor of Jenny.
5400
03:48:27,700 --> 03:48:31,200
So the resulting graph
would have a signature
5401
03:48:31,200 --> 03:48:32,800
of something like this.
5402
03:48:32,800 --> 03:48:34,800
So I'll explain this in a while.
5403
03:48:34,900 --> 03:48:38,300
So there are numerous ways
to construct the property graph
5404
03:48:38,300 --> 03:48:39,300
from raw files
5405
03:48:39,300 --> 03:48:43,400
or RDS or even synthetic
generators and we'll discuss it
5406
03:48:43,400 --> 03:48:44,766
in graph Builders,
5407
03:48:44,766 --> 03:48:46,313
but the very probable
5408
03:48:46,313 --> 03:48:49,700
and most General method
is to use graph object.
5409
03:48:49,700 --> 03:48:52,129
So let's take a look
at the code first.
5410
03:48:52,129 --> 03:48:53,651
And so first over here,
5411
03:48:53,651 --> 03:48:55,900
we are assuming
that Parker context
5412
03:48:55,900 --> 03:48:58,100
has already been constructed.
5413
03:48:58,100 --> 03:49:01,700
Then we are giving
the SES power context next.
5414
03:49:01,700 --> 03:49:04,600
We are creating an rdd
for the vertices.
5415
03:49:04,600 --> 03:49:06,689
So as you can see for users,
5416
03:49:06,689 --> 03:49:09,600
we have specified idd
and then vertex ID
5417
03:49:09,600 --> 03:49:11,393
and then these are two strings.
5418
03:49:11,393 --> 03:49:12,605
So first one would be
5419
03:49:12,605 --> 03:49:15,900
your username and the second one
will be your profession.
5420
03:49:15,900 --> 03:49:19,612
Then we are using SC paralyzed
and we are creating an array
5421
03:49:19,612 --> 03:49:22,300
where we are specifying
all the vertices so
5422
03:49:22,300 --> 03:49:23,838
And that is this one
5423
03:49:23,900 --> 03:49:25,900
and you are getting
the name as Thomas
5424
03:49:25,900 --> 03:49:26,800
and the profession
5425
03:49:26,800 --> 03:49:30,646
is Professor similarly
for to well Frank Professor.
5426
03:49:30,646 --> 03:49:34,600
Then 3L Jenny cheese student
and 4L Bob doctors.
5427
03:49:34,600 --> 03:49:37,746
So here we have created
the vertex next.
5428
03:49:37,746 --> 03:49:40,207
We are creating
an rdd for edges.
5429
03:49:40,500 --> 03:49:43,400
So first we are giving
the values relationship.
5430
03:49:43,400 --> 03:49:46,400
Then we are creating
an rdd with Edge string
5431
03:49:46,400 --> 03:49:50,000
and then we're using SC
paralyzed to create the edge
5432
03:49:50,000 --> 03:49:52,948
and in the array we are
specifying the A source vertex,
5433
03:49:52,948 --> 03:49:55,595
then we are specifying
the destination vertex.
5434
03:49:55,595 --> 03:49:57,400
And then we are
giving the relation
5435
03:49:57,400 --> 03:50:01,000
that is colleague similarly
for next Edge resources
5436
03:50:01,000 --> 03:50:02,800
when this nation is one
5437
03:50:02,800 --> 03:50:06,131
and then the profession
is academic advisor
5438
03:50:06,165 --> 03:50:07,934
and then it goes so on.
5439
03:50:08,242 --> 03:50:11,857
So then this line we
are defining a default user
5440
03:50:12,200 --> 03:50:16,276
in case there is a relationship
between missing users.
5441
03:50:16,300 --> 03:50:18,900
Now we have given
the name as default user
5442
03:50:18,900 --> 03:50:20,800
and the profession is missing.
5443
03:50:21,400 --> 03:50:24,000
Nature trying to build
an initial graph.
5444
03:50:24,000 --> 03:50:27,100
So for that we are using
this graph object.
5445
03:50:27,100 --> 03:50:30,100
So we have specified users
that is your vertices.
5446
03:50:30,100 --> 03:50:34,300
Then we are specifying the
relations that is your edges.
5447
03:50:34,400 --> 03:50:36,867
And then we are giving
the default user
5448
03:50:36,867 --> 03:50:39,400
which is basically
for any missing user.
5449
03:50:39,400 --> 03:50:41,800
So now as you can see over here,
5450
03:50:41,800 --> 03:50:46,700
we are using Edge case class
and edges have a source ID
5451
03:50:46,700 --> 03:50:48,300
and a destination ID,
5452
03:50:48,300 --> 03:50:51,300
which is basically
corresponding to your source
5453
03:50:51,300 --> 03:50:52,800
and destination vertex.
5454
03:50:52,800 --> 03:50:55,100
And in addition
to the Edge class.
5455
03:50:55,100 --> 03:50:56,900
We have an attribute member
5456
03:50:56,900 --> 03:51:00,600
which stores The Edge property
which is the relation over here
5457
03:51:00,600 --> 03:51:01,600
that is colleague
5458
03:51:01,600 --> 03:51:06,138
or it is academic advisor or it
is Health advisor and so on.
5459
03:51:06,200 --> 03:51:06,900
So, I hope
5460
03:51:06,900 --> 03:51:10,287
that you guys are clear
about creating a property graph
5461
03:51:10,287 --> 03:51:13,800
how to specify the vertices
how to specify edges and then
5462
03:51:13,800 --> 03:51:17,763
how to create a graph Now
we can deconstruct a graph
5463
03:51:17,763 --> 03:51:19,461
into respective vertex
5464
03:51:19,461 --> 03:51:23,000
and Edge views by using
a graph toward vertices
5465
03:51:23,000 --> 03:51:24,900
and graph edges members.
5466
03:51:25,000 --> 03:51:27,041
So as you can see
we are using craft
5467
03:51:27,041 --> 03:51:30,100
or vertices over here
and crafts dot edges over here.
5468
03:51:30,100 --> 03:51:32,100
Now what we are trying to do.
5469
03:51:32,100 --> 03:51:35,900
So first over here the graph
which we have created earlier.
5470
03:51:35,900 --> 03:51:37,291
So we have graphed
5471
03:51:37,300 --> 03:51:40,700
vertices dot filter Now
using this case class.
5472
03:51:40,700 --> 03:51:42,300
We have this vertex ID.
5473
03:51:42,300 --> 03:51:45,378
We have the name and then
we have the position.
5474
03:51:45,378 --> 03:51:48,322
And we are specifying
the position as doctor.
5475
03:51:48,322 --> 03:51:51,400
So first we are trying
to filter the profession
5476
03:51:51,400 --> 03:51:53,600
of the user as doctor.
5477
03:51:53,600 --> 03:51:55,400
And then we are trying to count.
5478
03:51:55,400 --> 03:51:55,630
It.
5479
03:51:55,900 --> 03:51:56,900
Next.
5480
03:51:56,900 --> 03:51:59,700
We are specifying
graph edges filter
5481
03:51:59,900 --> 03:52:03,270
and we are basically
trying to filter the edges
5482
03:52:03,270 --> 03:52:07,300
where the source ID is greater
than your destination ID.
5483
03:52:07,300 --> 03:52:09,800
And then we are trying
to count those edges.
5484
03:52:09,800 --> 03:52:12,600
We are using
a Scala case expression
5485
03:52:12,600 --> 03:52:15,400
as you can see to
deconstruct the temple.
5486
03:52:15,500 --> 03:52:17,400
You can say to deconstruct
5487
03:52:17,400 --> 03:52:23,358
the result on the other hand
craft edges returns a edge rdd,
5488
03:52:23,358 --> 03:52:26,282
which is containing
Edge string object.
5489
03:52:26,400 --> 03:52:30,800
So we could also have used
the case Class Type Constructor
5490
03:52:30,900 --> 03:52:32,200
as you can see here.
5491
03:52:32,200 --> 03:52:34,832
So again over here we
are using graph dot s
5492
03:52:34,832 --> 03:52:36,400
dot filter and over here.
5493
03:52:36,400 --> 03:52:40,400
We have given case h and then
we are specifying the property
5494
03:52:40,400 --> 03:52:43,900
that is Source destination
and then property of the edge
5495
03:52:43,900 --> 03:52:45,000
which is attached.
5496
03:52:45,000 --> 03:52:48,800
And then we are filtering it and
then we are trying to count it.
5497
03:52:48,800 --> 03:52:53,547
So this is how using Edge class
either you can see with edges
5498
03:52:53,547 --> 03:52:55,603
or you can see with vertices.
5499
03:52:55,603 --> 03:52:59,191
This is how you can go ahead
and deconstruct them.
5500
03:52:59,191 --> 03:53:01,900
Right because you're
grounded vertices
5501
03:53:01,900 --> 03:53:06,300
or your s dot vertices returns
a Vertex rdd or Edge rdd.
5502
03:53:06,400 --> 03:53:07,947
So to deconstruct them,
5503
03:53:07,947 --> 03:53:10,100
we basically use
this case class.
5504
03:53:10,100 --> 03:53:11,000
So I hope you
5505
03:53:11,000 --> 03:53:13,742
guys are clear about
transforming property graph.
5506
03:53:13,742 --> 03:53:15,400
And how do you use this case?
5507
03:53:15,400 --> 03:53:19,300
Us to deconstruct
the protects our DD or HR DD.
5508
03:53:20,169 --> 03:53:22,630
So now let's quickly move ahead.
5509
03:53:22,700 --> 03:53:24,875
Now in addition to the vertex
5510
03:53:24,875 --> 03:53:27,406
and Edge views
of the property graph
5511
03:53:27,406 --> 03:53:30,300
Graphics also exposes
a triplet view now,
5512
03:53:30,300 --> 03:53:32,700
you might be wondering
what is a triplet view.
5513
03:53:32,700 --> 03:53:35,977
So the triplet view
logically joins the vertex
5514
03:53:35,977 --> 03:53:39,600
and Edge properties
yielding an rdd edge triplet
5515
03:53:39,600 --> 03:53:42,700
with vertex property
and your Edge property.
5516
03:53:42,700 --> 03:53:45,174
So as you can see
it gives an rdd.
5517
03:53:45,174 --> 03:53:47,217
D with s triplet and then it
5518
03:53:47,217 --> 03:53:51,523
has vertex property as well as
H property associated with it
5519
03:53:51,523 --> 03:53:55,100
and it contains an instance
of each triplet class.
5520
03:53:55,200 --> 03:53:55,700
Now.
5521
03:53:55,700 --> 03:53:57,800
I am taking example of a join.
5522
03:53:57,800 --> 03:54:01,603
So in this joint we are trying
to select Source ID destination
5523
03:54:01,603 --> 03:54:03,100
ID Source attribute then
5524
03:54:03,100 --> 03:54:04,635
this is your Edge attribute
5525
03:54:04,635 --> 03:54:07,400
and then at last you
have destination attribute.
5526
03:54:07,400 --> 03:54:11,200
So basically your edges has
Alias e then your vertices
5527
03:54:11,200 --> 03:54:12,907
has Alias as source.
5528
03:54:12,907 --> 03:54:16,516
And again your vertices
has Alias as Nation so we
5529
03:54:16,516 --> 03:54:19,900
are trying to select
Source ID destination ID,
5530
03:54:19,900 --> 03:54:23,155
then Source, attribute
and destination attribute,
5531
03:54:23,155 --> 03:54:25,800
and we also selecting
The Edge attribute
5532
03:54:25,800 --> 03:54:28,200
and we are performing left join.
5533
03:54:28,400 --> 03:54:31,900
The edge Source ID should
be equal to Source ID
5534
03:54:31,900 --> 03:54:35,600
and the h destination ID should
be equal to destination ID.
5535
03:54:36,400 --> 03:54:39,700
And now your Edge
triplet class basically
5536
03:54:39,700 --> 03:54:43,090
extends your Edge class
by adding your Source attribute
5537
03:54:43,090 --> 03:54:45,100
and destination
attribute members
5538
03:54:45,100 --> 03:54:48,100
which contains the source
and destination properties
5539
03:54:48,200 --> 03:54:49,155
and we can use
5540
03:54:49,155 --> 03:54:52,500
the triplet view of a graph
to render a collection
5541
03:54:52,500 --> 03:54:55,804
of strings describing
relationship between users.
5542
03:54:55,804 --> 03:54:59,521
This is vertex 1 which is again
denoting your user one.
5543
03:54:59,521 --> 03:55:01,986
That is Thomas and
who is a professor
5544
03:55:01,986 --> 03:55:03,081
and is vertex 3,
5545
03:55:03,081 --> 03:55:06,400
which is denoting you Jenny
and she's a student.
5546
03:55:06,400 --> 03:55:07,994
And this is your Edge,
5547
03:55:07,994 --> 03:55:11,400
which is defining
the relationship between them.
5548
03:55:11,400 --> 03:55:13,600
So this is a h triplet
5549
03:55:13,600 --> 03:55:17,300
which is denoting
the both vertex as well
5550
03:55:17,300 --> 03:55:20,900
as the edge which denote
the relation between them.
5551
03:55:20,900 --> 03:55:23,600
So now looking at this code
first we have already
5552
03:55:23,600 --> 03:55:26,377
created the graph then we
are taking this graph.
5553
03:55:26,377 --> 03:55:27,979
We are finding the triplets
5554
03:55:27,979 --> 03:55:30,194
and then we are
mapping each triplet.
5555
03:55:30,194 --> 03:55:33,700
We are trying to find out
the triplet dot Source attribute
5556
03:55:33,700 --> 03:55:36,155
in which we are picking
up the username.
5557
03:55:36,155 --> 03:55:37,100
Then over here.
5558
03:55:37,100 --> 03:55:39,800
We are trying to pick up
the triplet attribute,
5559
03:55:39,800 --> 03:55:42,400
which is nothing
but the edge attribute
5560
03:55:42,400 --> 03:55:44,400
which is your academic advisor.
5561
03:55:44,400 --> 03:55:45,800
Then we are trying
5562
03:55:45,800 --> 03:55:48,800
to pick up the triplet
destination attribute.
5563
03:55:48,800 --> 03:55:50,904
It will again pick
up the username
5564
03:55:50,904 --> 03:55:52,500
of destination attribute,
5565
03:55:52,500 --> 03:55:54,766
which is username
of this vertex 3.
5566
03:55:54,766 --> 03:55:57,100
So for an example
in this situation,
5567
03:55:57,100 --> 03:56:01,000
it will print Thomas is
the academic advisor of Jenny.
5568
03:56:01,000 --> 03:56:03,211
So then we are trying
to take this facts.
5569
03:56:03,211 --> 03:56:04,726
We are collecting the facts
5570
03:56:04,726 --> 03:56:07,900
using this forage we have
Painting each of the triplet
5571
03:56:07,900 --> 03:56:09,812
that is present in this graph.
5572
03:56:09,812 --> 03:56:10,385
So I hope
5573
03:56:10,385 --> 03:56:13,700
that you guys are clear
with the concepts of triplet.
5574
03:56:14,600 --> 03:56:17,300
So now let's quickly take
a look at graph Builders.
5575
03:56:17,353 --> 03:56:19,200
So as I already told you
5576
03:56:19,200 --> 03:56:22,700
that Graphics provides
several ways of building a graph
5577
03:56:22,700 --> 03:56:25,551
from a collection of vertices
and edges either.
5578
03:56:25,551 --> 03:56:28,900
It can be stored in our DD
or it can be stored on disk.
5579
03:56:28,900 --> 03:56:32,600
So in this graph object first,
we have this apply method.
5580
03:56:32,600 --> 03:56:36,300
So basically this apply
method allows creating a graph
5581
03:56:36,300 --> 03:56:37,773
from rdd of vertices
5582
03:56:37,773 --> 03:56:42,000
and edges and duplicate vertices
are picked up our by Tralee
5583
03:56:42,000 --> 03:56:43,139
and the vertices
5584
03:56:43,139 --> 03:56:46,700
which are found in the Edge rdd
and are not present
5585
03:56:46,700 --> 03:56:50,522
in the vertices rdd are assigned
a default attribute.
5586
03:56:50,522 --> 03:56:52,653
So in this apply method first,
5587
03:56:52,653 --> 03:56:55,100
we are providing
the vertex rdd then
5588
03:56:55,100 --> 03:56:57,000
we are providing the edge rdd
5589
03:56:57,000 --> 03:57:00,311
and then we are providing
the default vertex attribute.
5590
03:57:00,311 --> 03:57:03,613
So it will create the vertex
which we have specified.
5591
03:57:03,613 --> 03:57:05,400
Then it will create the edges
5592
03:57:05,400 --> 03:57:08,700
which are specified and
if there is a vertex
5593
03:57:08,700 --> 03:57:11,173
which is being referred
by The Edge,
5594
03:57:11,173 --> 03:57:14,000
but it is not present
in this vertex rdd.
5595
03:57:14,000 --> 03:57:16,763
So So what it does it
creates that vertex
5596
03:57:16,763 --> 03:57:20,900
and assigns them the value of
this default vertex attribute.
5597
03:57:20,900 --> 03:57:22,700
Next we have from edges.
5598
03:57:22,700 --> 03:57:27,000
So graph Dot from edges
allows creating a graph only
5599
03:57:27,000 --> 03:57:28,900
from the rdd of edges
5600
03:57:29,000 --> 03:57:32,266
which automatically creates
any vertices mentioned
5601
03:57:32,266 --> 03:57:35,400
in the edges and assigns
them the default value.
5602
03:57:35,500 --> 03:57:39,000
So what happens over here
you provide the edge rdd
5603
03:57:39,000 --> 03:57:40,496
and all the vertices
5604
03:57:40,496 --> 03:57:44,385
that are present in the hrd
are automatically created
5605
03:57:44,385 --> 03:57:48,500
and Default value is assigned
to each of those vertices.
5606
03:57:48,500 --> 03:57:49,522
So graphed out
5607
03:57:49,522 --> 03:57:53,100
from adjustables basically
allows creating a graph
5608
03:57:53,100 --> 03:57:55,484
from only the rdd of vegetables
5609
03:57:55,500 --> 03:58:00,100
and it assigns the edges as
value 1 and again the vertices
5610
03:58:00,100 --> 03:58:04,200
which are specified by the edges
are automatically created
5611
03:58:04,200 --> 03:58:05,788
and the default value which
5612
03:58:05,788 --> 03:58:09,005
we are specifying over here
will be allocated to them.
5613
03:58:09,005 --> 03:58:10,100
So basically you're
5614
03:58:10,100 --> 03:58:12,980
from has double supports
deduplicating of edges,
5615
03:58:12,980 --> 03:58:15,800
which means you can remove
the duplicate edges,
5616
03:58:15,800 --> 03:58:19,373
but for that you have
to provide a partition strategy
5617
03:58:19,373 --> 03:58:23,953
in the unique edges parameter
as it is necessary to co-locate
5618
03:58:23,953 --> 03:58:25,277
The Identical edges
5619
03:58:25,277 --> 03:58:28,900
on the same partition duplicate
edges can be removed.
5620
03:58:29,100 --> 03:58:33,000
So moving ahead men of the graph
Builders re partitions,
5621
03:58:33,000 --> 03:58:37,146
the graph edges by default
instead edges are left
5622
03:58:37,146 --> 03:58:39,300
in their default partitions.
5623
03:58:39,300 --> 03:58:42,540
So as you can see,
we have a graph loader object,
5624
03:58:42,540 --> 03:58:44,700
which is basically used to load.
5625
03:58:44,700 --> 03:58:46,776
Crafts from the file system
5626
03:58:46,900 --> 03:58:51,571
so graft or group edges requires
the graph to be re-partition
5627
03:58:51,571 --> 03:58:52,956
because it assumes
5628
03:58:53,000 --> 03:58:55,900
that identical edges
will be co-located
5629
03:58:55,900 --> 03:58:57,378
on the same partition.
5630
03:58:57,378 --> 03:59:00,200
And so you must call
graph dot Partition by
5631
03:59:00,200 --> 03:59:02,200
before calling group edges.
5632
03:59:02,900 --> 03:59:07,500
So so now you can see the edge
list file method over here
5633
03:59:07,538 --> 03:59:12,000
which provides a way to load
a graph from the list of edges
5634
03:59:12,000 --> 03:59:14,577
which is present
on the disk and it
5635
03:59:14,577 --> 03:59:18,900
It passes the adjacency list
that is your Source vertex ID
5636
03:59:18,900 --> 03:59:22,900
and the destination vertex ID
Pairs and it creates a graph.
5637
03:59:23,200 --> 03:59:24,300
So now for an example,
5638
03:59:24,300 --> 03:59:29,600
let's say we have two and one
which is one Edge then you have
5639
03:59:29,600 --> 03:59:31,533
for one which is another Edge
5640
03:59:31,533 --> 03:59:34,600
and then you have 1/2
which is another Edge.
5641
03:59:34,600 --> 03:59:36,700
So it will load these edges
5642
03:59:36,900 --> 03:59:39,300
and then it will
create the graph.
5643
03:59:39,300 --> 03:59:40,792
So it will create 2,
5644
03:59:40,792 --> 03:59:44,600
then it will create
for and then it will create one.
5645
03:59:44,900 --> 03:59:46,100
And for to one it
5646
03:59:46,100 --> 03:59:49,757
will create the edge and then
for one it will create the edge
5647
03:59:49,757 --> 03:59:52,500
and at last we create
an edge for one and two.
5648
03:59:52,700 --> 03:59:55,300
So do you create a graph
something like this?
5649
03:59:56,000 --> 03:59:59,100
It creates a graph
from specified edges
5650
03:59:59,300 --> 04:00:01,929
where automatically
vertices are created
5651
04:00:01,929 --> 04:00:05,751
which are mentioned by the edges
and all the vertex
5652
04:00:05,751 --> 04:00:08,465
and Edge attribute
are set by default one
5653
04:00:08,465 --> 04:00:10,907
and as well as one
will be associated
5654
04:00:10,907 --> 04:00:12,400
with all the vertices.
5655
04:00:12,543 --> 04:00:15,900
So it will be 4 comma
1 then again for this.
5656
04:00:15,900 --> 04:00:19,200
It would be 1 comma
1 and similarly it would be
5657
04:00:19,200 --> 04:00:21,201
2 comma 1 for this vertex.
5658
04:00:21,800 --> 04:00:24,184
Now, let's go back to the code.
5659
04:00:24,184 --> 04:00:27,800
So then we have
this canonical orientation.
5660
04:00:28,200 --> 04:00:31,655
So this argument
allows reorienting edges
5661
04:00:31,655 --> 04:00:33,500
in the positive direction
5662
04:00:33,500 --> 04:00:35,100
that is from the lower Source ID
5663
04:00:35,100 --> 04:00:38,000
to the higher
destination ID now,
5664
04:00:38,000 --> 04:00:40,800
which is basically required
by your connected components
5665
04:00:40,800 --> 04:00:41,782
algorithm will talk
5666
04:00:41,782 --> 04:00:43,800
about this algorithm
in a while you guys
5667
04:00:44,100 --> 04:00:47,069
but before this
this basically helps
5668
04:00:47,069 --> 04:00:49,300
in view orienting your edges,
5669
04:00:49,300 --> 04:00:51,500
which means your Source vertex,
5670
04:00:51,500 --> 04:00:55,400
Tex should always be less
than your destination vertex.
5671
04:00:55,400 --> 04:00:58,700
So in that situation it
might reorient this Edge.
5672
04:00:58,700 --> 04:01:01,970
So it will reorient this Edge
and basically to reverse
5673
04:01:01,970 --> 04:01:04,862
direction of the edge
similarly over here.
5674
04:01:04,862 --> 04:01:06,000
So with the vertex
5675
04:01:06,000 --> 04:01:08,896
which is coming from 2 to 1
will be reoriented
5676
04:01:08,896 --> 04:01:10,700
and will be again reversed.
5677
04:01:10,700 --> 04:01:11,754
Now the talking
5678
04:01:11,754 --> 04:01:16,300
about the minimum Edge partition
this minimum Edge partition
5679
04:01:16,300 --> 04:01:18,858
basically specifies
the minimum number
5680
04:01:18,858 --> 04:01:21,900
of edge partitions
to generate There might be
5681
04:01:21,900 --> 04:01:24,242
more Edge partitions
than a specified.
5682
04:01:24,242 --> 04:01:26,900
So let's say the hdfs
file has more blocks.
5683
04:01:26,900 --> 04:01:29,300
So obviously more partitions
will be created
5684
04:01:29,300 --> 04:01:32,182
but this will give you
the minimum Edge partitions
5685
04:01:32,182 --> 04:01:33,651
that should be created.
5686
04:01:33,651 --> 04:01:34,192
So I hope
5687
04:01:34,192 --> 04:01:36,900
that you guys are clear
with this graph loader
5688
04:01:36,900 --> 04:01:38,358
how this graph loader Works
5689
04:01:38,358 --> 04:01:41,300
how you can go ahead
and provide the edge list file
5690
04:01:41,300 --> 04:01:43,300
and how it will create the craft
5691
04:01:43,300 --> 04:01:47,124
from this Edge list file and
then this canonical orientation
5692
04:01:47,124 --> 04:01:50,300
where we are again going
and reorienting the graph
5693
04:01:50,300 --> 04:01:52,299
and then we have
Minimum Edge partition
5694
04:01:52,299 --> 04:01:54,900
which is giving the minimum
number of edge partitions
5695
04:01:54,900 --> 04:01:56,300
that should be created.
5696
04:01:56,300 --> 04:02:00,000
So now I guess you guys are
clear with the graph Builder.
5697
04:02:00,000 --> 04:02:03,400
So how to go ahead and use
this graph object
5698
04:02:03,400 --> 04:02:06,900
and how to create graph
using apply from edges
5699
04:02:06,900 --> 04:02:09,200
and from vegetables method
5700
04:02:09,400 --> 04:02:11,700
and then I guess
you might be clear
5701
04:02:11,700 --> 04:02:13,586
with the graph loader object
5702
04:02:13,586 --> 04:02:17,715
and where you can go ahead and
create a graph from Edge list.
5703
04:02:17,715 --> 04:02:17,990
Now.
5704
04:02:17,990 --> 04:02:21,500
Let's move ahead and talk
about vertex and Edge rdd.
5705
04:02:21,900 --> 04:02:23,561
So as I already told you
5706
04:02:23,561 --> 04:02:27,007
that Graphics exposes
our DD views of the vertices
5707
04:02:27,007 --> 04:02:30,056
and edges stored
within the graph at however,
5708
04:02:30,056 --> 04:02:33,798
because Graphics again
maintains the vertices and edges
5709
04:02:33,798 --> 04:02:35,600
in optimize data structure
5710
04:02:35,600 --> 04:02:36,979
and these data structure
5711
04:02:36,979 --> 04:02:39,499
provide additional
functionalities as well.
5712
04:02:39,499 --> 04:02:42,679
Now, let us see some of
the additional functionalities
5713
04:02:42,679 --> 04:02:44,300
which are provided by them.
5714
04:02:44,465 --> 04:02:47,234
So let's first talk
about vertex rdd.
5715
04:02:47,600 --> 04:02:51,100
So I already told
you that vertex rdd.
5716
04:02:51,100 --> 04:02:54,800
He is basically extending
this rdd with vertex ID
5717
04:02:54,800 --> 04:02:59,338
and the vertex property and it
adds an additional constraint
5718
04:02:59,338 --> 04:03:05,600
that each vertex ID occurs only
words now moreover vertex rdd
5719
04:03:05,800 --> 04:03:10,000
a represents a set of vertices
each with an attribute
5720
04:03:10,000 --> 04:03:12,600
of type A now internally
5721
04:03:12,700 --> 04:03:17,600
what happens this is achieved
by storing the vertex attribute
5722
04:03:17,700 --> 04:03:19,184
in an reusable,
5723
04:03:19,184 --> 04:03:21,030
hash map data structure.
5724
04:03:24,200 --> 04:03:27,700
So suppose, this is
our hash map data structure.
5725
04:03:27,700 --> 04:03:30,200
So suppose if to vertex rdd
5726
04:03:30,200 --> 04:03:34,840
are derived from the same
base vertex rdd suppose.
5727
04:03:35,280 --> 04:03:37,600
These are two vertex rdd
5728
04:03:37,600 --> 04:03:41,200
which are basically derived
from this vertex rdd
5729
04:03:41,200 --> 04:03:44,400
so they can be joined
in constant time
5730
04:03:44,400 --> 04:03:46,100
without hash evaluations.
5731
04:03:46,100 --> 04:03:49,400
So you don't have to go ahead
and evaluate the properties
5732
04:03:49,400 --> 04:03:52,400
of both the vertices
you can easily go ahead
5733
04:03:52,400 --> 04:03:55,398
and you can join them
without the Yes,
5734
04:03:55,400 --> 04:03:58,288
and this is one of the way
in which this vertex
5735
04:03:58,288 --> 04:04:00,800
already provides you
the optimization now
5736
04:04:00,800 --> 04:04:03,900
to leverage this
indexed data structure
5737
04:04:04,200 --> 04:04:08,700
the vertex rdd exposes multiple
additional functionalities.
5738
04:04:09,000 --> 04:04:11,000
So it gives you
all these functions
5739
04:04:11,000 --> 04:04:12,000
as you can see here.
5740
04:04:12,300 --> 04:04:15,300
It gives you filter
map values then -
5741
04:04:15,300 --> 04:04:16,663
difference left join
5742
04:04:16,663 --> 04:04:19,800
in a joint and aggregate
using index functions.
5743
04:04:19,800 --> 04:04:22,600
So let us first discuss
about these functions.
5744
04:04:22,600 --> 04:04:26,800
So basically filter a function
filters the vertex set
5745
04:04:26,800 --> 04:04:31,700
but preserves the internal index
So based on some condition.
5746
04:04:31,700 --> 04:04:33,405
It filters the vertices
5747
04:04:33,405 --> 04:04:36,300
that are present
then in map values.
5748
04:04:36,300 --> 04:04:39,200
It is basically used
to transform the values
5749
04:04:39,200 --> 04:04:41,000
without changing the IDS
5750
04:04:41,000 --> 04:04:44,461
and which again preserves
your internal index.
5751
04:04:44,461 --> 04:04:49,399
So it does not change the idea
of the vertices and it helps
5752
04:04:49,399 --> 04:04:53,100
in transforming those values
now talking about the -
5753
04:04:53,100 --> 04:04:55,900
method it shows What is unique
5754
04:04:55,900 --> 04:04:58,500
in the said based
on their vertex IDs?
5755
04:04:58,500 --> 04:04:59,500
So what happens
5756
04:04:59,500 --> 04:05:03,300
if you are providing to set
of vertices first contains V1 V2
5757
04:05:03,300 --> 04:05:06,100
and V3 and second
one contains V3,
5758
04:05:06,200 --> 04:05:08,276
so it will return V1 and V2
5759
04:05:08,276 --> 04:05:11,366
because they are unique
in both the sets
5760
04:05:11,700 --> 04:05:14,700
and it is basically done
with the help of vertex ID.
5761
04:05:14,900 --> 04:05:17,053
So next we have dysfunction.
5762
04:05:17,100 --> 04:05:20,900
So it basically removes
the vertices from this set
5763
04:05:20,900 --> 04:05:25,800
that appears in another set Then
we have left join an inner join.
5764
04:05:25,800 --> 04:05:28,300
So join operators
basically take advantage
5765
04:05:28,300 --> 04:05:30,900
of the internal indexing
to accelerate join.
5766
04:05:30,900 --> 04:05:32,900
So you can go ahead
and you can perform left join
5767
04:05:32,900 --> 04:05:34,400
or you can perform inner join.
5768
04:05:34,453 --> 04:05:37,246
Next you have
aggregate using index.
5769
04:05:37,700 --> 04:05:40,800
So basically is aggregate
using index is nothing
5770
04:05:40,800 --> 04:05:42,400
by reduced by key,
5771
04:05:42,500 --> 04:05:44,200
but it uses index
5772
04:05:44,300 --> 04:05:48,000
on this rdd to accelerate
the Reduce by key function
5773
04:05:48,000 --> 04:05:50,500
or you can say reduced
by key operation.
5774
04:05:50,700 --> 04:05:54,900
So again filter is actually
Using bit set and there
5775
04:05:54,900 --> 04:05:56,500
by reusing the index
5776
04:05:56,500 --> 04:05:58,800
and preserving the ability to do
5777
04:05:58,800 --> 04:06:02,220
fast joints with other
vertex rdd now similarly
5778
04:06:02,220 --> 04:06:04,600
the map values operator as well.
5779
04:06:04,600 --> 04:06:08,200
Do not allow the map function
to change the vertex ID
5780
04:06:08,200 --> 04:06:09,600
and this again helps
5781
04:06:09,600 --> 04:06:13,120
in reusing the same
hash map data structure now both
5782
04:06:13,120 --> 04:06:14,533
of your left join as
5783
04:06:14,533 --> 04:06:17,900
well as your inner join
is able to identify
5784
04:06:17,900 --> 04:06:20,400
that whether the two vertex rdd
5785
04:06:20,400 --> 04:06:23,169
which are joining
are derived from the same.
5786
04:06:23,169 --> 04:06:24,208
Hash map or not.
5787
04:06:24,208 --> 04:06:28,300
And for this they basically use
linear scan did again don't have
5788
04:06:28,300 --> 04:06:31,900
to go ahead and search
for costly Point lookups.
5789
04:06:31,900 --> 04:06:35,300
So this is the benefit
of using vertex rdd.
5790
04:06:35,500 --> 04:06:36,571
So to summarize
5791
04:06:36,571 --> 04:06:40,300
your vertex audit abuses
hash map data structure,
5792
04:06:40,426 --> 04:06:42,273
which is again reusable.
5793
04:06:42,300 --> 04:06:44,700
They try to
preserve your indexes
5794
04:06:44,700 --> 04:06:48,500
so that it would be easier
to create a new vertex already
5795
04:06:48,500 --> 04:06:51,404
derive a new vertex already
from them then again
5796
04:06:51,404 --> 04:06:54,000
while performing some
joining or Relations,
5797
04:06:54,000 --> 04:06:57,900
it is pretty much easy to go
ahead perform a linear scan
5798
04:06:57,900 --> 04:07:01,500
and then you can go ahead
and join those two vertex rdd.
5799
04:07:01,500 --> 04:07:05,423
So it actually helps
in optimizing your performance.
5800
04:07:05,700 --> 04:07:06,700
Now moving ahead.
5801
04:07:06,700 --> 04:07:10,200
Let's talk about
HR DD now again,
5802
04:07:10,200 --> 04:07:13,900
as you can see your Edge
already is extending your rdd
5803
04:07:13,900 --> 04:07:15,400
with property Edge.
5804
04:07:15,400 --> 04:07:18,792
Now it organizes the edge
in Block partition using
5805
04:07:18,792 --> 04:07:21,700
one of the various
partitioning strategies,
5806
04:07:21,700 --> 04:07:25,608
which is again defined in Your
partition strategies attribute
5807
04:07:25,608 --> 04:07:28,800
or you can say partition
strategy parameter within
5808
04:07:28,800 --> 04:07:30,865
each partition each attribute
5809
04:07:30,865 --> 04:07:34,100
and a decency structure
are stored separately
5810
04:07:34,100 --> 04:07:36,200
which enables the maximum reuse
5811
04:07:36,200 --> 04:07:38,200
when changing the
attribute values.
5812
04:07:38,600 --> 04:07:42,900
So basically what it does while
storing your Edge attributes
5813
04:07:42,900 --> 04:07:46,400
and your Source vertex
and destination vertex,
5814
04:07:46,400 --> 04:07:48,400
they are stored separately so
5815
04:07:48,400 --> 04:07:51,200
that changing the values
of the attributes
5816
04:07:51,200 --> 04:07:54,200
either of the source
Vertex or Nation Vertex
5817
04:07:54,200 --> 04:07:55,500
or Edge attribute
5818
04:07:55,500 --> 04:07:58,300
so that it can be
reused as many times
5819
04:07:58,300 --> 04:08:01,600
as we need by changing
the attribute values itself.
5820
04:08:01,600 --> 04:08:04,713
So that once the vertex ID
is changed of an edge.
5821
04:08:04,713 --> 04:08:06,400
It could be easily changed
5822
04:08:06,400 --> 04:08:09,196
and the earlier part
can be reused now
5823
04:08:09,196 --> 04:08:10,314
as you can see,
5824
04:08:10,314 --> 04:08:13,518
we have three additional
functions over here
5825
04:08:13,518 --> 04:08:16,500
that is map values
reverse an inner join.
5826
04:08:16,700 --> 04:08:19,000
So in hrd basically map
5827
04:08:19,000 --> 04:08:21,400
values is to transform
the edge attributes
5828
04:08:21,400 --> 04:08:23,200
while preserving the structure.
5829
04:08:23,200 --> 04:08:25,029
ER it is helpful in transforming
5830
04:08:25,029 --> 04:08:28,500
so you can use map values and
map the values of Courage rdd.
5831
04:08:28,800 --> 04:08:31,300
Then you can go ahead and use
this reverse function
5832
04:08:31,300 --> 04:08:35,400
which rivers The Edge reusing
both attribute and structure.
5833
04:08:35,400 --> 04:08:37,531
So the source
becomes destination.
5834
04:08:37,531 --> 04:08:40,179
The destination becomes
Source not talking
5835
04:08:40,179 --> 04:08:41,600
about this inner join.
5836
04:08:41,700 --> 04:08:43,600
So it basically joins
5837
04:08:43,600 --> 04:08:48,500
to Edge rdds partitioned using
same partitioning strategy.
5838
04:08:49,100 --> 04:08:52,900
Now as we already discuss
that same partition strategies,
5839
04:08:52,900 --> 04:08:55,585
Tired because again
to co-locate you need
5840
04:08:55,585 --> 04:08:57,600
to use same partition strategy
5841
04:08:57,600 --> 04:08:59,682
and your identical
vertex should reside
5842
04:08:59,682 --> 04:09:02,800
in same partition to perform
join operation over them.
5843
04:09:02,800 --> 04:09:03,092
Now.
5844
04:09:03,092 --> 04:09:07,290
Let me quickly give you an idea
about optimization performed
5845
04:09:07,290 --> 04:09:08,500
in this Graphics.
5846
04:09:08,536 --> 04:09:10,151
So Graphics basically
5847
04:09:10,151 --> 04:09:14,844
adopts a Vertex cut approach to
distribute graph partitioning.
5848
04:09:15,500 --> 04:09:20,700
So suppose you have five vertex
and then they are connected.
5849
04:09:20,800 --> 04:09:23,100
Let's not worry
about the arrows, right?
5850
04:09:23,100 --> 04:09:26,200
Now or let's not worry
about Direction right now.
5851
04:09:26,200 --> 04:09:29,200
So either it can be divided
from the edges,
5852
04:09:29,200 --> 04:09:32,287
which is one approach or again.
5853
04:09:32,287 --> 04:09:34,825
It can be divided
from the vertex.
5854
04:09:35,300 --> 04:09:36,840
So in that situation,
5855
04:09:36,840 --> 04:09:39,700
it would be divided
something like this.
5856
04:09:41,200 --> 04:09:43,500
So rather than splitting crafts
5857
04:09:43,500 --> 04:09:47,900
along edges Graphics partition
is the graph along vertices,
5858
04:09:47,900 --> 04:09:50,305
which can again
reduce the communication
5859
04:09:50,305 --> 04:09:51,600
and storage overhead.
5860
04:09:51,600 --> 04:09:53,523
So logically what happens
5861
04:09:53,523 --> 04:09:56,500
that your edges
are assigned to machines
5862
04:09:56,500 --> 04:10:00,200
and allowing your vertices
to span multiple machines.
5863
04:10:00,200 --> 04:10:03,500
So what this is is basically
divided into multiple machines
5864
04:10:03,500 --> 04:10:06,900
and your edges is assigned
to a single machine right
5865
04:10:06,900 --> 04:10:09,600
then the exact method
of assigning edges.
5866
04:10:09,600 --> 04:10:11,800
Depends on the
partition strategy.
5867
04:10:11,800 --> 04:10:15,400
So the partition strategy is
the one which basically decides
5868
04:10:15,400 --> 04:10:16,800
how to assign the edges
5869
04:10:16,800 --> 04:10:20,300
to different machines or you
can send different partitions.
5870
04:10:20,300 --> 04:10:21,400
So user can choose
5871
04:10:21,400 --> 04:10:24,900
between different strategies
by partitioning the graph
5872
04:10:24,900 --> 04:10:28,200
with the help of this graft
Partition by operator.
5873
04:10:28,200 --> 04:10:29,500
Now as we discussed
5874
04:10:29,500 --> 04:10:31,329
that this craft or Partition
5875
04:10:31,329 --> 04:10:34,400
by operator three partitions
and then it divides
5876
04:10:34,400 --> 04:10:36,900
or relocates the edges
5877
04:10:37,000 --> 04:10:39,900
and basically we try
to put the identical edges.
5878
04:10:39,900 --> 04:10:41,500
On a single partition
5879
04:10:41,500 --> 04:10:43,827
so that different
operations like join
5880
04:10:43,827 --> 04:10:45,400
can be performed on them.
5881
04:10:45,400 --> 04:10:49,629
So once the edges have been
partitioned the mean challenge
5882
04:10:49,629 --> 04:10:52,690
is efficiently joining
the vertex attributes
5883
04:10:52,690 --> 04:10:54,400
with the edges right now
5884
04:10:54,400 --> 04:10:56,000
because real world graphs
5885
04:10:56,000 --> 04:10:58,600
typically have more
edges than vertices.
5886
04:10:58,600 --> 04:11:03,300
So we move vertex attributes
to the edges and because not all
5887
04:11:03,300 --> 04:11:07,800
the partitions will contain
edges adjacent to all vertices.
5888
04:11:07,800 --> 04:11:09,755
We internally maintain a row.
5889
04:11:09,755 --> 04:11:10,700
Routing table.
5890
04:11:10,700 --> 04:11:14,400
So the routing table is the one
who will broadcast the vertices
5891
04:11:14,400 --> 04:11:18,146
and 10 will implement the join
required for the operations.
5892
04:11:18,146 --> 04:11:18,946
So, I hope
5893
04:11:18,946 --> 04:11:22,200
that you guys are clear
how vertex rdd and hrd
5894
04:11:22,200 --> 04:11:23,338
works and then
5895
04:11:23,338 --> 04:11:25,800
how the optimizations take place
5896
04:11:25,800 --> 04:11:29,900
and how vertex cut optimizes
the operations in graphics.
5897
04:11:30,100 --> 04:11:32,600
Now, let's talk
about graph operators.
5898
04:11:32,600 --> 04:11:35,480
So just as already
have basic operations
5899
04:11:35,480 --> 04:11:37,400
like map filter reduced by
5900
04:11:37,400 --> 04:11:41,300
key property graph also have
Election of basic operators
5901
04:11:41,300 --> 04:11:44,530
that take user-defined functions
and produce new graphs
5902
04:11:44,530 --> 04:11:48,029
the transform properties and
structure Now The Co-operators
5903
04:11:48,029 --> 04:11:50,900
that have optimized
implementation are basically
5904
04:11:50,900 --> 04:11:54,061
defined in crafts class
and convenient operators
5905
04:11:54,061 --> 04:11:55,262
that are expressed
5906
04:11:55,262 --> 04:11:57,600
as a composition
of The Co-operators
5907
04:11:57,600 --> 04:12:00,500
are basically defined
in your graphs class.
5908
04:12:00,500 --> 04:12:03,346
But in Scala it
implicit the operators
5909
04:12:03,346 --> 04:12:04,800
in graph Ops class,
5910
04:12:04,800 --> 04:12:08,500
they are automatically available
as a member of graft class
5911
04:12:08,600 --> 04:12:09,600
so you can use them.
5912
04:12:09,700 --> 04:12:12,450
M using the graph
class as well now
5913
04:12:12,500 --> 04:12:14,593
as you can see we have
list of operators
5914
04:12:14,593 --> 04:12:15,858
like property operator,
5915
04:12:15,858 --> 04:12:17,800
then you have
structural operator.
5916
04:12:17,800 --> 04:12:19,300
Then you have join operator
5917
04:12:19,300 --> 04:12:22,000
and then you have something
called neighborhood operator.
5918
04:12:22,000 --> 04:12:24,700
So let's talk about them one
by one now talking
5919
04:12:24,700 --> 04:12:26,400
about property operators,
5920
04:12:26,400 --> 04:12:30,016
like rdd has map operator
the property graph contains
5921
04:12:30,016 --> 04:12:34,168
map vertices map edges and map
triplets operators right now.
5922
04:12:34,168 --> 04:12:38,445
Each of this operator basically
eels a new graph with the vertex
5923
04:12:38,445 --> 04:12:39,600
or Edge property.
5924
04:12:39,600 --> 04:12:42,600
Modified by the user-defined
map function based
5925
04:12:42,600 --> 04:12:46,366
on the user-defined map function
it basically transforms
5926
04:12:46,366 --> 04:12:47,915
or modifies the vertices
5927
04:12:47,915 --> 04:12:49,202
if it's map vertices
5928
04:12:49,202 --> 04:12:51,489
or it transform
or modify the edges
5929
04:12:51,489 --> 04:12:53,170
if it is map edges method
5930
04:12:53,170 --> 04:12:56,600
or map is operator and so
on format repeats as well.
5931
04:12:56,600 --> 04:13:00,053
Now the important thing
to note is that in each case.
5932
04:13:00,053 --> 04:13:02,700
The graph structure
is unaffected and this
5933
04:13:02,700 --> 04:13:04,968
is a key feature
of these operators.
5934
04:13:04,968 --> 04:13:07,513
Basically which allows
the resulting graph
5935
04:13:07,513 --> 04:13:09,500
to reuse the structural indices.
5936
04:13:09,500 --> 04:13:10,300
Of the original graph
5937
04:13:10,300 --> 04:13:12,600
each and every time you
apply a transformation,
5938
04:13:12,600 --> 04:13:14,700
so it creates a new graph
5939
04:13:14,700 --> 04:13:17,500
and the original
graph is unaffected
5940
04:13:17,500 --> 04:13:19,200
so that it can be used
5941
04:13:19,200 --> 04:13:22,500
so you can see it can be reused
in creating new graphs.
5942
04:13:22,500 --> 04:13:22,800
Right?
5943
04:13:22,800 --> 04:13:24,600
So your structure indices
5944
04:13:24,600 --> 04:13:27,700
can be used from the original
graph not talking
5945
04:13:27,700 --> 04:13:29,400
about this map vertices.
5946
04:13:29,400 --> 04:13:31,152
Let me use the highlighter.
5947
04:13:31,152 --> 04:13:32,900
So first we have map vertices.
5948
04:13:32,900 --> 04:13:34,200
So be it Maps the vertices
5949
04:13:34,200 --> 04:13:36,100
or you can still
transform the vertices.
5950
04:13:36,100 --> 04:13:39,300
So you provide vertex ID
and then vertex.
5951
04:13:40,100 --> 04:13:43,400
And you apply some of the
transformation function using
5952
04:13:43,400 --> 04:13:46,600
which so it will give you
a graph with newer text property
5953
04:13:46,600 --> 04:13:49,500
as you can see now same is
the case with map edges.
5954
04:13:49,500 --> 04:13:53,800
So again you provide the edges
then you transform the edges.
5955
04:13:53,800 --> 04:13:57,600
So initially it was Ed and then
you transform it to Edie to
5956
04:13:57,700 --> 04:13:58,600
and then the graph
5957
04:13:58,600 --> 04:14:01,000
which is given or you
can see the graph
5958
04:14:01,000 --> 04:14:04,947
which is returned is the graph
for the changed each attribute.
5959
04:14:04,947 --> 04:14:07,535
So you can see here
the attribute is ed2.
5960
04:14:07,535 --> 04:14:09,800
Same is the case
with Mark triplets.
5961
04:14:09,900 --> 04:14:11,500
So using Mark triplets,
5962
04:14:11,500 --> 04:14:14,657
you can use the edge triplet
where you can go ahead
5963
04:14:14,657 --> 04:14:18,700
and Target the vertex Properties
or you can say vertex attributes
5964
04:14:18,700 --> 04:14:21,817
or to be more specific
Source vertex attribute as well
5965
04:14:21,817 --> 04:14:23,641
as destination vertex attribute
5966
04:14:23,641 --> 04:14:26,900
and the edge attribute and then
you can apply transformation
5967
04:14:26,900 --> 04:14:28,654
over those Source attributes
5968
04:14:28,654 --> 04:14:31,600
or destination attributes
or the edge attributes
5969
04:14:31,600 --> 04:14:34,500
so you can change them and then
it will again return a graph
5970
04:14:34,500 --> 04:14:36,300
with the transformed values now,
5971
04:14:36,300 --> 04:14:39,000
I guess you guys are clear
the property operator.
5972
04:14:39,000 --> 04:14:40,819
So let's move Next operator
5973
04:14:40,819 --> 04:14:44,958
that is structural operator So
currently Graphics supports only
5974
04:14:44,958 --> 04:14:48,200
a simple set of commonly
use structural operators.
5975
04:14:48,200 --> 04:14:50,712
And we expect more
to be added in future.
5976
04:14:50,712 --> 04:14:53,220
Now you can see
in structural operator.
5977
04:14:53,220 --> 04:14:54,800
We have reversed operator.
5978
04:14:54,800 --> 04:14:56,464
Then we have subgraph operator.
5979
04:14:56,464 --> 04:14:57,923
Then we have masks operator
5980
04:14:57,923 --> 04:15:00,100
and then we have
group edges operator.
5981
04:15:00,100 --> 04:15:04,096
So let's talk about them one by
one so first reverse operator,
5982
04:15:04,096 --> 04:15:05,640
so as the name suggests,
5983
04:15:05,640 --> 04:15:09,500
it returns a new graph with all
the edge directions reversed.
5984
04:15:09,500 --> 04:15:11,750
So basically it will change
your Source vertex
5985
04:15:11,750 --> 04:15:12,950
into destination vertex,
5986
04:15:12,950 --> 04:15:15,108
and then it will change
your destination vertex
5987
04:15:15,108 --> 04:15:16,000
into Source vertex.
5988
04:15:16,000 --> 04:15:18,500
So it will reverse
the direction of your edges.
5989
04:15:18,500 --> 04:15:21,600
And the reverse operation
does not modify Vertex
5990
04:15:21,600 --> 04:15:23,300
or Edge Properties or change.
5991
04:15:23,300 --> 04:15:24,300
The number of edges.
5992
04:15:24,400 --> 04:15:25,739
It can be implemented
5993
04:15:25,739 --> 04:15:28,800
efficiently without
data movement or duplication.
5994
04:15:28,800 --> 04:15:31,400
So next we have
subgraph operator.
5995
04:15:31,400 --> 04:15:34,615
So basically subgraph
operator takes the vertex
5996
04:15:34,615 --> 04:15:35,967
and Edge predicates
5997
04:15:35,967 --> 04:15:38,577
or you can say Vertex
or edge condition
5998
04:15:38,577 --> 04:15:41,600
and Returns the Of
containing only the vertex
5999
04:15:41,600 --> 04:15:44,835
that satisfy those vertex
predicates and then it Returns
6000
04:15:44,835 --> 04:15:47,306
the edges that satisfy
the edge predicates.
6001
04:15:47,306 --> 04:15:50,200
So basically will give
a condition about edges and
6002
04:15:50,200 --> 04:15:51,954
vertices and those predicates
6003
04:15:51,954 --> 04:15:54,009
which are fulfilled
or those vertex
6004
04:15:54,009 --> 04:15:57,303
which are fulfilling the
predicates will be only returned
6005
04:15:57,303 --> 04:15:59,302
and again seems the case
with your edges
6006
04:15:59,302 --> 04:16:01,237
and then your graph
will be connected.
6007
04:16:01,237 --> 04:16:03,800
Now, the subgraph operator
can be used in a number
6008
04:16:03,800 --> 04:16:06,953
of situations to restrict
the graph to the vertices
6009
04:16:06,953 --> 04:16:08,245
and edges of interest
6010
04:16:08,245 --> 04:16:10,615
and eliminate the Rest
of the components,
6011
04:16:10,615 --> 04:16:13,450
right so you can see
this is The Edge predicate.
6012
04:16:13,450 --> 04:16:15,200
This is the vertex predicate.
6013
04:16:15,200 --> 04:16:18,900
Then we are providing
the extra plate with the vertex
6014
04:16:18,900 --> 04:16:20,500
and Edge attributes
6015
04:16:20,500 --> 04:16:21,567
and we are waiting
6016
04:16:21,567 --> 04:16:24,700
for the Boolean value then
same is the case with vertex.
6017
04:16:24,700 --> 04:16:27,100
We're providing the vertex
properties over here
6018
04:16:27,100 --> 04:16:29,150
or you can say vertex
attribute over here.
6019
04:16:29,150 --> 04:16:29,925
And then again,
6020
04:16:29,925 --> 04:16:32,126
it will yield a graph
which is a sub graph
6021
04:16:32,126 --> 04:16:35,400
of the original graph which will
fulfill those predicates now,
6022
04:16:35,400 --> 04:16:37,600
the next operator
is mask operator.
6023
04:16:37,600 --> 04:16:39,746
So mask operator Constructors.
6024
04:16:39,746 --> 04:16:43,466
Graph by returning a graph
that contains the vertices
6025
04:16:43,466 --> 04:16:46,888
and edges that are also found
in the input graph.
6026
04:16:46,888 --> 04:16:48,637
Basically, you can treat
6027
04:16:48,637 --> 04:16:52,500
this mask operator as
a comparison between two graphs.
6028
04:16:52,500 --> 04:16:53,314
So suppose.
6029
04:16:53,314 --> 04:16:54,500
We are comparing
6030
04:16:54,500 --> 04:16:58,100
graph 1 and graph 2 and it
will return this sub graph
6031
04:16:58,100 --> 04:17:00,800
which is common in both
the graphs again.
6032
04:17:00,800 --> 04:17:04,600
This can be used in conjunction
with the subgraph operator.
6033
04:17:04,600 --> 04:17:05,900
Basically to restrict
6034
04:17:05,900 --> 04:17:09,400
a graph based on properties
in another related graph, right.
6035
04:17:09,400 --> 04:17:12,280
And so I guess you guys are
clear with the mask operator.
6036
04:17:12,280 --> 04:17:13,000
So we're here.
6037
04:17:13,000 --> 04:17:14,233
We're providing a graph
6038
04:17:14,233 --> 04:17:16,776
and then we are providing
the input graph as well.
6039
04:17:16,776 --> 04:17:18,671
And then it will return a graph
6040
04:17:18,671 --> 04:17:21,700
which is basically a subset
of both of these graph
6041
04:17:21,700 --> 04:17:23,600
not talking about group edges.
6042
04:17:23,600 --> 04:17:26,796
So the group edges operator
merges the parallel edges
6043
04:17:26,796 --> 04:17:28,446
in the multigraph, right?
6044
04:17:28,446 --> 04:17:29,683
So what it does it,
6045
04:17:29,683 --> 04:17:33,244
the duplicate edges between pair
of vertices are merged
6046
04:17:33,244 --> 04:17:35,800
or you can say are
at can be aggregated
6047
04:17:35,800 --> 04:17:37,325
or perform some action
6048
04:17:37,325 --> 04:17:41,000
and in many numerical
applications I just can be added
6049
04:17:41,000 --> 04:17:43,702
and their weights can be
combined into a single edge,
6050
04:17:43,702 --> 04:17:46,804
right which will again
reduce the size of the graph.
6051
04:17:46,804 --> 04:17:47,900
So for an example,
6052
04:17:47,900 --> 04:17:51,400
you have to vertex V1 and V2
and there are two edges
6053
04:17:51,400 --> 04:17:53,100
with weight 10 and 15.
6054
04:17:53,100 --> 04:17:56,291
So actually what you can do is
you can merge those two edges
6055
04:17:56,291 --> 04:17:59,700
if they have same direction and
you can represent the way to 25.
6056
04:17:59,700 --> 04:18:02,100
So this will actually
reduce the size
6057
04:18:02,100 --> 04:18:05,144
of the graph now looking
at the next operator,
6058
04:18:05,144 --> 04:18:06,700
which is join operator.
6059
04:18:06,700 --> 04:18:09,400
So in many cases
it is necessary.
6060
04:18:09,400 --> 04:18:13,151
To join data from external
collection with graphs, right?
6061
04:18:13,151 --> 04:18:13,909
For example.
6062
04:18:13,909 --> 04:18:16,100
We might have
an extra user property
6063
04:18:16,100 --> 04:18:18,855
that we want to merge
with the existing graph
6064
04:18:18,855 --> 04:18:21,186
or we might want
to pull vertex property
6065
04:18:21,186 --> 04:18:23,100
from one graph to another right.
6066
04:18:23,100 --> 04:18:24,700
So these are some
of the situations
6067
04:18:24,700 --> 04:18:27,000
where you go ahead and use
this join operators.
6068
04:18:27,000 --> 04:18:28,900
So now as you can see over here,
6069
04:18:28,900 --> 04:18:31,100
the first operator
is joined vertices.
6070
04:18:31,100 --> 04:18:34,792
So the joint vertices operator
joins the vertices
6071
04:18:34,792 --> 04:18:36,176
with the input rdd
6072
04:18:36,200 --> 04:18:39,516
and returns a new graph
with the vertex properties.
6073
04:18:39,516 --> 04:18:42,700
Dean after applying
the user-defined map function
6074
04:18:42,700 --> 04:18:45,400
now the vertices
without a matching value
6075
04:18:45,400 --> 04:18:49,500
in the rdd basically retains
their original value not talking
6076
04:18:49,500 --> 04:18:51,400
about outer join vertices.
6077
04:18:51,400 --> 04:18:55,100
So it behaves similar
to join vertices except that
6078
04:18:55,100 --> 04:18:59,586
which user-defined map function
is applied to all the vertices
6079
04:18:59,586 --> 04:19:02,200
and can change
the vertex property type.
6080
04:19:02,200 --> 04:19:05,600
So suppose that you have
a old graph which has
6081
04:19:05,600 --> 04:19:08,100
a Vertex attribute as old price
6082
04:19:08,200 --> 04:19:10,700
and then you created
a new a graph from it
6083
04:19:10,700 --> 04:19:13,735
and then it has the vertex
attribute as new rice.
6084
04:19:13,735 --> 04:19:16,645
So you can go ahead
and join two of these graphs
6085
04:19:16,645 --> 04:19:19,249
and you can perform
an aggregation of both
6086
04:19:19,249 --> 04:19:21,725
the Old and New prices
in the new graph.
6087
04:19:21,725 --> 04:19:25,265
So in this kind of situation
join vertices are used
6088
04:19:25,265 --> 04:19:26,389
now moving ahead.
6089
04:19:26,389 --> 04:19:29,814
Let's talk about neighborhood
aggregation now key step
6090
04:19:29,814 --> 04:19:33,239
in many graph analytics
is aggregating the information
6091
04:19:33,239 --> 04:19:36,600
about the neighborhood
of each vertex for an example.
6092
04:19:36,600 --> 04:19:39,500
We might want to know the number
of followers each user has
6093
04:19:39,700 --> 04:19:41,200
Or the average age
6094
04:19:41,200 --> 04:19:45,600
of the follower of each user now
many iterative graph algorithms,
6095
04:19:45,600 --> 04:19:47,416
like pagerank shortest path,
6096
04:19:47,416 --> 04:19:50,501
then connected components
repeatedly aggregate
6097
04:19:50,501 --> 04:19:52,893
the properties of
neighboring vertices.
6098
04:19:52,893 --> 04:19:56,200
Now, it has four operators
in neighborhood aggregation.
6099
04:19:56,200 --> 04:19:58,803
So the first one is
your aggregate messages.
6100
04:19:58,803 --> 04:20:01,500
So the core aggregation
operation in graphics
6101
04:20:01,500 --> 04:20:02,900
is aggregate messages.
6102
04:20:02,900 --> 04:20:04,090
Now this operator
6103
04:20:04,090 --> 04:20:07,100
applies a user-defined
send message function
6104
04:20:07,100 --> 04:20:10,799
as you can see over here
to Each of the edge triplet
6105
04:20:10,799 --> 04:20:11,600
in the graph
6106
04:20:11,600 --> 04:20:14,230
and then it uses
merge message function
6107
04:20:14,230 --> 04:20:17,900
to aggregate those messages
at the destination vertex.
6108
04:20:18,000 --> 04:20:19,900
Now the user-defined
6109
04:20:19,900 --> 04:20:23,150
send message function
takes an edge context
6110
04:20:23,150 --> 04:20:26,200
as you can see and
which exposes the source
6111
04:20:26,200 --> 04:20:29,892
and destination address Buttes
along with the edge attribute
6112
04:20:29,892 --> 04:20:32,399
and functions like send
to Source or send
6113
04:20:32,399 --> 04:20:35,303
to destination is used
to send messages to source
6114
04:20:35,303 --> 04:20:37,013
and destination attributes.
6115
04:20:37,013 --> 04:20:39,800
Now you can think
of send message as the map.
6116
04:20:39,800 --> 04:20:43,592
Function in mapreduce and
the user-defined merge function
6117
04:20:43,592 --> 04:20:46,000
which actually takes
the two messages
6118
04:20:46,000 --> 04:20:48,200
which are present
on the same Vertex
6119
04:20:48,200 --> 04:20:50,784
or you can see
the same destination vertex
6120
04:20:50,784 --> 04:20:52,090
and it again combines
6121
04:20:52,090 --> 04:20:55,662
or aggregate those messages
and produces a single message.
6122
04:20:55,662 --> 04:20:58,146
Now, you can think
of the merge message
6123
04:20:58,146 --> 04:21:00,500
as reduce function
the mapreduce now,
6124
04:21:00,500 --> 04:21:05,100
the aggregate messages operator
returns a Vertex rdd.
6125
04:21:05,100 --> 04:21:08,128
Basically, it contains
the aggregated messages at each
6126
04:21:08,128 --> 04:21:09,657
of the destination vertex.
6127
04:21:09,657 --> 04:21:10,600
It's and vertices
6128
04:21:10,600 --> 04:21:13,815
that did not receive
a message are not included
6129
04:21:13,815 --> 04:21:15,693
in the returned vertex rdd.
6130
04:21:15,693 --> 04:21:17,028
So only those vertex
6131
04:21:17,028 --> 04:21:20,500
are returned which actually
have received the message
6132
04:21:20,500 --> 04:21:22,956
and then those messages
have been merged.
6133
04:21:22,956 --> 04:21:25,250
If any vertex
which haven't received.
6134
04:21:25,250 --> 04:21:28,437
The message will not be included
in the returned rdd
6135
04:21:28,437 --> 04:21:31,500
or you can say a return
vertex rdd now in addition
6136
04:21:31,500 --> 04:21:34,000
as you can see we have
a triplets Fields.
6137
04:21:34,000 --> 04:21:37,519
So aggregate messages takes
an optional triplet fields,
6138
04:21:37,519 --> 04:21:39,400
which indicates what data is.
6139
04:21:39,400 --> 04:21:41,304
Accessed in the edge content.
6140
04:21:41,304 --> 04:21:42,752
So the possible options
6141
04:21:42,752 --> 04:21:45,900
for the triplet fields
are defined interpret fields
6142
04:21:45,900 --> 04:21:48,600
to default value
of triplet Fields is triplet
6143
04:21:48,600 --> 04:21:52,300
Fields oil as you can see over
here this basically indicates
6144
04:21:52,300 --> 04:21:55,600
that user-defined send
message function May access
6145
04:21:55,600 --> 04:21:58,074
any of the fields
in the edge content.
6146
04:21:58,074 --> 04:22:01,982
So this triplet field argument
can be used to notify Graphics
6147
04:22:01,982 --> 04:22:05,549
that only these part of
the edge content will be needed
6148
04:22:05,549 --> 04:22:09,491
which basically allows Graphics
to select the optimize joining.
6149
04:22:09,491 --> 04:22:10,700
Strategy, so I hope
6150
04:22:10,700 --> 04:22:13,500
that you guys are clear
with the aggregate messages.
6151
04:22:13,500 --> 04:22:16,794
Let's quickly move ahead
and look at the second operator.
6152
04:22:16,794 --> 04:22:20,019
So the second operator is
mapreduce triplet transition.
6153
04:22:20,019 --> 04:22:21,400
Now in earlier versions
6154
04:22:21,400 --> 04:22:24,700
of Graphics neighborhood
aggregation was accomplished
6155
04:22:24,700 --> 04:22:27,272
using the mapreduce
triplets operator.
6156
04:22:27,272 --> 04:22:29,802
This mapreduce triplet
operator is used
6157
04:22:29,802 --> 04:22:31,814
in older versions of Graphics.
6158
04:22:31,814 --> 04:22:35,100
This operator takes
the user-defined map function,
6159
04:22:35,100 --> 04:22:38,900
which is applied to each triplet
and can yield messages
6160
04:22:38,900 --> 04:22:42,300
which are Aggregating using the
user-defined reduce functions.
6161
04:22:42,300 --> 04:22:44,300
This one is the reason
I defined malfunction.
6162
04:22:44,300 --> 04:22:46,600
And this one is your user
defined reduce function.
6163
04:22:46,600 --> 04:22:49,081
So it basically applies
the map function
6164
04:22:49,081 --> 04:22:50,305
to all the triplets
6165
04:22:50,305 --> 04:22:53,654
and then the aggregate
those messages using this user
6166
04:22:53,654 --> 04:22:55,171
defined reduce function.
6167
04:22:55,171 --> 04:22:58,900
Now the newer version of this
map produced triplets operator
6168
04:22:58,900 --> 04:23:01,770
is the aggregate messages
now moving ahead.
6169
04:23:01,770 --> 04:23:04,900
Let's talk about Computing
degree information operator.
6170
04:23:04,900 --> 04:23:07,900
So one of the common
aggregation task is Computing
6171
04:23:07,900 --> 04:23:09,579
the degree of each vertex.
6172
04:23:09,579 --> 04:23:12,842
That is the number of edges
adjacent to each vertex.
6173
04:23:12,842 --> 04:23:15,072
Now in the context
of directed graph.
6174
04:23:15,072 --> 04:23:18,400
It is often necessary to know
the in degree out degree.
6175
04:23:18,400 --> 04:23:20,300
Then the total degree of vertex.
6176
04:23:20,300 --> 04:23:22,800
These kind of things are
pretty much important
6177
04:23:22,800 --> 04:23:25,389
and the graph Ops class
contain a collection
6178
04:23:25,389 --> 04:23:28,400
of operators to compute
the degrees of each vertex.
6179
04:23:28,500 --> 04:23:29,800
So as you can see,
6180
04:23:29,800 --> 04:23:33,100
we have maximum input degree
than maximum output degree,
6181
04:23:33,100 --> 04:23:36,100
then maximum degrees
maximum degree will tell
6182
04:23:36,100 --> 04:23:39,400
us the number of Maximum
incoming edges then Max.
6183
04:23:39,400 --> 04:23:42,325
Degree will tell us
maximum number of output edges
6184
04:23:42,325 --> 04:23:43,510
and this Max degree
6185
04:23:43,510 --> 04:23:46,685
with actually tell us the number
of input as well as
6186
04:23:46,685 --> 04:23:49,572
output edges now moving
ahead to next operator
6187
04:23:49,572 --> 04:23:52,300
that is collecting
Neighbors in some cases.
6188
04:23:52,300 --> 04:23:54,182
It may be easier to express
6189
04:23:54,182 --> 04:23:57,600
the computation by collecting
neighboring vertices
6190
04:23:57,600 --> 04:24:00,000
and their attribute
at each vertex.
6191
04:24:00,000 --> 04:24:02,624
Now, this can be easily
accomplished using
6192
04:24:02,624 --> 04:24:06,400
the collect neighbors ID and
the collect neighbors operator.
6193
04:24:06,400 --> 04:24:09,600
So basically your collect
neighbor ID takes
6194
04:24:09,600 --> 04:24:12,200
The Edge direction
as the parameter
6195
04:24:12,300 --> 04:24:14,400
and it returns a Vertex rdd
6196
04:24:14,400 --> 04:24:17,400
that contains the array
of vertex ID
6197
04:24:17,500 --> 04:24:20,000
that is neighboring
to the particular vertex
6198
04:24:20,000 --> 04:24:23,400
now similarly The Collection
neighbors again takes
6199
04:24:23,400 --> 04:24:25,717
the edge directions as the input
6200
04:24:25,717 --> 04:24:28,000
and it will return you the array
6201
04:24:28,000 --> 04:24:31,600
with the vertex ID and
the vertex attribute both now,
6202
04:24:31,600 --> 04:24:32,717
let me quickly open
6203
04:24:32,717 --> 04:24:35,700
my VM and let us go through
the spark directory first.
6204
04:24:35,900 --> 04:24:38,600
Let me first open
my terminal so first
6205
04:24:38,600 --> 04:24:41,800
I'll start the Do demons so
for that I will go
6206
04:24:41,800 --> 04:24:46,358
to her do phone directory
genocide has been start
6207
04:24:46,358 --> 04:24:48,282
or lot asset script file.
6208
04:24:52,000 --> 04:24:53,400
So let me check
6209
04:24:53,400 --> 04:24:55,700
if the Hadoop demons
are running or not.
6210
04:24:58,700 --> 04:25:00,706
So as you can see that name,
6211
04:25:00,706 --> 04:25:03,000
no data node
secondary name node,
6212
04:25:03,000 --> 04:25:05,848
the node manager
and resource manager.
6213
04:25:05,848 --> 04:25:08,400
All the Demons
of Hadoop are up now.
6214
04:25:08,400 --> 04:25:10,661
I will navigate to spark home.
6215
04:25:10,661 --> 04:25:13,300
Let me first start
this park demons.
6216
04:25:17,600 --> 04:25:19,700
I See Spark demons are running
6217
04:25:19,700 --> 04:25:24,000
alko first minimize this and let
me take you to this park home.
6218
04:25:24,900 --> 04:25:27,309
And this is my spot directories.
6219
04:25:27,309 --> 04:25:28,712
I'll go inside now.
6220
04:25:28,712 --> 04:25:30,926
Let me first show you the data
6221
04:25:30,926 --> 04:25:34,100
which is by default present
with your spark.
6222
04:25:34,400 --> 04:25:36,700
So we'll open this in a new tab.
6223
04:25:36,700 --> 04:25:38,865
So you can see
we have two files
6224
04:25:38,865 --> 04:25:41,100
in this Graphics data directory.
6225
04:25:41,100 --> 04:25:44,638
Meanwhile, let me take you
to the example code.
6226
04:25:44,638 --> 04:25:48,900
So this is example
and inside so main scalar.
6227
04:25:49,600 --> 04:25:50,500
You can find
6228
04:25:50,500 --> 04:25:54,700
the graphics directory and
inside this Graphics directory
6229
04:25:54,700 --> 04:25:59,000
you Some of the sample codes
which are present over here.
6230
04:25:59,000 --> 04:26:01,692
So I will take you
to this aggregate
6231
04:26:01,692 --> 04:26:05,100
messages example dots
Kayla now meanwhile,
6232
04:26:05,100 --> 04:26:07,287
let me open the data as well.
6233
04:26:07,287 --> 04:26:09,700
So you'll be able to understand.
6234
04:26:10,500 --> 04:26:12,967
Now this is
followers dot txt file.
6235
04:26:12,967 --> 04:26:15,000
So basically you can imagine
6236
04:26:15,000 --> 04:26:18,545
these are the edges which
are representing the vertex.
6237
04:26:18,545 --> 04:26:21,580
So this is what x 2
and this is vertex 1 then
6238
04:26:21,580 --> 04:26:25,100
this is Vertex 4 and this
is vertex 1 and similarly.
6239
04:26:25,100 --> 04:26:28,400
So on these are representing
those vertex and
6240
04:26:28,400 --> 04:26:30,900
if you can remember I
have already told you
6241
04:26:30,900 --> 04:26:33,200
that inside graph loader class.
6242
04:26:33,200 --> 04:26:35,818
There is a function
called Edge list file
6243
04:26:35,818 --> 04:26:37,200
which takes the edges
6244
04:26:37,200 --> 04:26:40,500
from a file and then it
construct the graph based.
6245
04:26:40,500 --> 04:26:43,800
That now second you
have this user dot txt.
6246
04:26:43,800 --> 04:26:47,550
So these are basically the edges
with the vertex ID.
6247
04:26:47,550 --> 04:26:51,200
So vertex ID for this vertex
is 1 then for this is 2
6248
04:26:51,200 --> 04:26:53,539
and so on and then
this is the data
6249
04:26:53,539 --> 04:26:57,600
which is attached or you can say
the attribute of the edges.
6250
04:26:57,600 --> 04:26:59,800
So these are the vertex ID
6251
04:26:59,958 --> 04:27:03,700
which is 1 2 3 respectively
and this is the data
6252
04:27:03,700 --> 04:27:06,800
which is associated
with your each vertex.
6253
04:27:06,800 --> 04:27:10,500
So this is username and this
might be the name of your user.
6254
04:27:10,500 --> 04:27:13,100
Zur and so on now
you can also see
6255
04:27:13,100 --> 04:27:16,900
that in some of the cases
the name of the user is missing.
6256
04:27:16,900 --> 04:27:18,800
So as in this case the name
6257
04:27:18,800 --> 04:27:22,100
of the user is missing
these are the vertices
6258
04:27:22,100 --> 04:27:26,300
or you can see the vertex ID
and vertex attributes.
6259
04:27:26,600 --> 04:27:30,500
Now, let me take you through
this aggregate messages example,
6260
04:27:30,600 --> 04:27:32,400
so as you can see,
we are giving the name
6261
04:27:32,400 --> 04:27:36,100
of the packages over G Apache
spark examples dot Graphics,
6262
04:27:36,300 --> 04:27:40,306
then we are importing Graphics
in that very important.
6263
04:27:40,306 --> 04:27:41,764
Off class as well as
6264
04:27:41,764 --> 04:27:45,700
this vertex rdd next we
are using this graph generator.
6265
04:27:45,700 --> 04:27:48,500
I'll tell you why we
are using this graph generator
6266
04:27:48,700 --> 04:27:52,400
and then we are using
the spark session over here.
6267
04:27:52,400 --> 04:27:54,105
So this is an example
6268
04:27:54,163 --> 04:27:58,778
where we are using the aggregate
messages operator to compute
6269
04:27:58,778 --> 04:28:03,163
the average age of the more
senior followers of each user.
6270
04:28:03,200 --> 04:28:03,700
Okay.
6271
04:28:03,928 --> 04:28:06,929
So this is the object
of aggregate messages example.
6272
04:28:07,000 --> 04:28:10,000
Now, this is the main function
where we are first.
6273
04:28:10,100 --> 04:28:13,600
Realizing this box session then
the name of the application.
6274
04:28:13,600 --> 04:28:16,400
So you have to provide the name
of the application
6275
04:28:16,400 --> 04:28:17,400
and this is get
6276
04:28:17,400 --> 04:28:20,600
or create method now
next you are initializing
6277
04:28:20,600 --> 04:28:24,338
the spark context as SC
now coming to the code.
6278
04:28:24,400 --> 04:28:27,400
So we are specifying
a graph then this graph
6279
04:28:27,400 --> 04:28:30,300
is containing double and N now.
6280
04:28:30,400 --> 04:28:33,200
I just told you that we
are importing craft generator.
6281
04:28:33,200 --> 04:28:35,023
So this graph generator is
6282
04:28:35,023 --> 04:28:37,900
to generate a random
graph for Simplicity.
6283
04:28:37,900 --> 04:28:40,400
So you would have multiple
number of edges and vertices.
6284
04:28:40,400 --> 04:28:43,047
Says then you are using
this log normal graph.
6285
04:28:43,047 --> 04:28:44,900
You're passing the spark context
6286
04:28:44,900 --> 04:28:47,677
and you're specifying the number
of vertices as hundred.
6287
04:28:47,677 --> 04:28:49,956
So it will generate
hundred vertices for you.
6288
04:28:49,956 --> 04:28:51,200
Then what you are doing.
6289
04:28:51,200 --> 04:28:53,400
You are specifying
the map vertices
6290
04:28:53,400 --> 04:28:56,815
and you're trying
to map ID to double so
6291
04:28:56,815 --> 04:28:58,200
what this would do
6292
04:28:58,200 --> 04:29:02,100
this will basically map
your ID to double then
6293
04:29:02,100 --> 04:29:05,700
in next year trying
to calculate the older followers
6294
04:29:05,700 --> 04:29:08,300
where you have given
it as vertex rdd
6295
04:29:08,300 --> 04:29:10,494
and then put is nth and Also,
6296
04:29:10,494 --> 04:29:13,900
your vertex already
has sent as your vertex ID
6297
04:29:13,900 --> 04:29:15,200
and your data is double
6298
04:29:15,200 --> 04:29:17,533
which is associated
with each of the vertex
6299
04:29:17,533 --> 04:29:19,604
or you can say
the vertex attribute.
6300
04:29:19,604 --> 04:29:20,900
So you have this graph
6301
04:29:20,900 --> 04:29:23,178
which is basically
generated randomly
6302
04:29:23,178 --> 04:29:26,189
and then you are performing
aggregate messages.
6303
04:29:26,189 --> 04:29:29,200
So this is the aggregate
messages operator now,
6304
04:29:29,200 --> 04:29:33,353
if you can remember we first
have the send messages, right?
6305
04:29:33,353 --> 04:29:35,000
So inside this triplet,
6306
04:29:35,000 --> 04:29:38,620
we are specifying a function
that if the source attribute
6307
04:29:38,620 --> 04:29:40,100
of the triplet is board.
6308
04:29:40,100 --> 04:29:42,300
Destination attribute
of the triplet.
6309
04:29:42,300 --> 04:29:43,900
So basically it will return
6310
04:29:43,900 --> 04:29:47,144
if the followers age
is greater than the age
6311
04:29:47,144 --> 04:29:48,452
of person whom he
6312
04:29:48,452 --> 04:29:52,259
is following this tells
the followers is is greater
6313
04:29:52,259 --> 04:29:55,000
than the age of whom
he is following.
6314
04:29:55,000 --> 04:29:56,462
So in that situation,
6315
04:29:56,462 --> 04:29:59,200
it will send message
to the destination
6316
04:29:59,200 --> 04:30:01,400
with vertex containing counter
6317
04:30:01,400 --> 04:30:05,000
that is 1 and the age
of the source attribute
6318
04:30:05,000 --> 04:30:07,700
that is the age
of the follower so first
6319
04:30:07,700 --> 04:30:10,800
so you can see the age
of the destination on is less
6320
04:30:10,800 --> 04:30:12,807
than the age
of source attribute.
6321
04:30:12,807 --> 04:30:14,000
So it will tell you
6322
04:30:14,000 --> 04:30:17,293
if the follower is older
than the user or not.
6323
04:30:17,293 --> 04:30:21,100
So in that situation will send
one to the destination
6324
04:30:21,100 --> 04:30:23,900
and we'll send the age
of the source
6325
04:30:23,900 --> 04:30:26,900
or you can see the edge
of the follower then second.
6326
04:30:26,900 --> 04:30:29,400
I have told you
that we have merged messages.
6327
04:30:29,500 --> 04:30:32,500
So here we are adding
the counter and the H
6328
04:30:32,600 --> 04:30:33,800
in this reduce function.
6329
04:30:33,900 --> 04:30:37,515
So now what we are doing we
are dividing the total age
6330
04:30:37,515 --> 04:30:38,421
of the number
6331
04:30:38,421 --> 04:30:41,439
of older followers
to Write an average age
6332
04:30:41,439 --> 04:30:42,700
of older followers.
6333
04:30:42,700 --> 04:30:45,400
So this is the reason why
we have passed the attribute
6334
04:30:45,400 --> 04:30:47,200
of source vertex firstly
6335
04:30:47,200 --> 04:30:49,300
if we are specifying
this variable that is
6336
04:30:49,300 --> 04:30:51,194
average age of older followers.
6337
04:30:51,194 --> 04:30:53,700
And then we are specifying
the vertex rdd.
6338
04:30:53,888 --> 04:30:58,211
So this will be double
and then this older followers
6339
04:30:58,292 --> 04:30:59,600
that is the graph
6340
04:30:59,600 --> 04:31:02,349
which we are picking up
from here and then we
6341
04:31:02,349 --> 04:31:04,100
are trying to map the value.
6342
04:31:04,100 --> 04:31:05,400
So in the vertex,
6343
04:31:05,400 --> 04:31:10,100
we have ID and we have value so
in this situation We
6344
04:31:10,100 --> 04:31:13,600
are using this case class
about count and total age.
6345
04:31:13,600 --> 04:31:16,000
So what we are doing we
are taking this total age
6346
04:31:16,000 --> 04:31:19,246
and we are dividing it by count
which we have gathered from this
6347
04:31:19,246 --> 04:31:20,011
send message.
6348
04:31:20,011 --> 04:31:22,800
And then we have aggregated
using this reduce function.
6349
04:31:22,800 --> 04:31:26,400
We are again taking the total
age of the older followers.
6350
04:31:26,400 --> 04:31:28,994
And then we are trying
to divide it by count
6351
04:31:28,994 --> 04:31:30,377
to get the average age
6352
04:31:30,377 --> 04:31:33,900
when at last we are trying
to display the result and then
6353
04:31:33,900 --> 04:31:35,600
we are stopping this park.
6354
04:31:35,600 --> 04:31:38,385
So let me quickly open
the terminal so I
6355
04:31:38,385 --> 04:31:39,742
will go to examples
6356
04:31:39,742 --> 04:31:43,600
so I'd examples I took you
through the source directory
6357
04:31:43,600 --> 04:31:46,400
where the code is
present inside skaila.
6358
04:31:46,400 --> 04:31:49,154
And then inside there
is a spark directory
6359
04:31:49,154 --> 04:31:51,975
where you will find
the code but to execute
6360
04:31:51,975 --> 04:31:55,200
the example you need to go
to the jars territory.
6361
04:31:56,100 --> 04:31:58,392
Now, this is
the scale example jar
6362
04:31:58,392 --> 04:32:00,200
which you need to execute.
6363
04:32:00,200 --> 04:32:03,100
But before this,
let me take you to the hdfs.
6364
04:32:03,400 --> 04:32:05,600
So the URL is localhost.
6365
04:32:05,600 --> 04:32:07,400
Colon 5 0 0 7 0
6366
04:32:08,500 --> 04:32:10,800
And we'll go to utilities then
6367
04:32:10,800 --> 04:32:12,800
we'll go to browse
the file system.
6368
04:32:13,000 --> 04:32:14,137
So as you can see,
6369
04:32:14,137 --> 04:32:16,849
I have created a user
directory in which I
6370
04:32:16,849 --> 04:32:18,700
have specified the username.
6371
04:32:18,700 --> 04:32:22,000
That is Ed Eureka
and inside Ed Eureka.
6372
04:32:22,000 --> 04:32:24,200
I have placed my data directory
6373
04:32:24,200 --> 04:32:27,500
where we have this graphics
and inside the graphics.
6374
04:32:27,500 --> 04:32:30,100
We have both the file
that is followers Dot txt
6375
04:32:30,100 --> 04:32:31,600
and users dot txt.
6376
04:32:31,600 --> 04:32:32,854
So in this program,
6377
04:32:32,854 --> 04:32:35,100
we are not referring
to these files
6378
04:32:35,100 --> 04:32:38,500
but incoming examples will
be referring to these files.
6379
04:32:38,500 --> 04:32:42,700
So I would request you to first
move it to this hdfs directory.
6380
04:32:42,700 --> 04:32:46,800
So that spark can refer
the files in data Graphics.
6381
04:32:47,000 --> 04:32:50,300
Now, let me quickly minimize
this and the command
6382
04:32:50,300 --> 04:32:53,000
to execute is Spock -
6383
04:32:53,000 --> 04:32:56,900
submit and then I'll pass
this charge parameter
6384
04:32:56,900 --> 04:32:59,900
and I'll provide
the spark example jar.
6385
04:33:01,200 --> 04:33:05,100
So this is the jar then
I'll specify the class name.
6386
04:33:05,100 --> 04:33:06,900
So to get the class name.
6387
04:33:06,900 --> 04:33:08,900
I will go to the code.
6388
04:33:09,200 --> 04:33:12,000
I'll first take
the package name from here.
6389
04:33:12,700 --> 04:33:14,100
And then I'll take
6390
04:33:14,100 --> 04:33:17,935
the class name which is
aggregated messages example,
6391
04:33:17,935 --> 04:33:19,400
so this is my class.
6392
04:33:19,400 --> 04:33:21,928
And as I told you have
to provide the name
6393
04:33:21,928 --> 04:33:23,100
of the application.
6394
04:33:23,100 --> 04:33:26,600
So let me keep it as example
and I'll hit enter.
6395
04:33:31,946 --> 04:33:34,253
So now you can see the result.
6396
04:33:36,000 --> 04:33:37,700
So this is the followers
6397
04:33:37,700 --> 04:33:40,500
and this is the average
age of followers.
6398
04:33:40,500 --> 04:33:41,827
So it is 34 Den.
6399
04:33:41,827 --> 04:33:45,038
We have 52 which is
the count of follower.
6400
04:33:45,038 --> 04:33:48,500
And the average age is
seventy six point eight
6401
04:33:48,500 --> 04:33:51,100
that is it has
96 senior followers.
6402
04:33:51,100 --> 04:33:52,900
And then the average age
6403
04:33:52,900 --> 04:33:56,000
of the followers is
ninety nine point zero,
6404
04:33:56,100 --> 04:33:58,600
then it has
four senior followers
6405
04:33:58,600 --> 04:34:00,520
and the average age is 51.
6406
04:34:00,520 --> 04:34:03,400
Then this vertex has
16 senior followers
6407
04:34:03,400 --> 04:34:06,003
with the average age
of 57 point five.
6408
04:34:06,003 --> 04:34:09,024
5 and so on you can see
the result over here.
6409
04:34:09,024 --> 04:34:12,800
So I hope now you guys are clear
with aggregate messages
6410
04:34:12,800 --> 04:34:14,748
how to use aggregate messages
6411
04:34:14,748 --> 04:34:17,100
how to specify
the send message then
6412
04:34:17,100 --> 04:34:19,200
how to write the merge message.
6413
04:34:19,200 --> 04:34:21,788
So let's quickly go back
to the presentation.
6414
04:34:21,788 --> 04:34:23,500
Now, let us quickly move ahead
6415
04:34:23,500 --> 04:34:26,014
and look at some
of the graph algorithms.
6416
04:34:26,014 --> 04:34:27,959
So the first one is Page rank.
6417
04:34:27,959 --> 04:34:31,200
So page rank measures
the importance of each vertex
6418
04:34:31,200 --> 04:34:32,706
in a graph assuming
6419
04:34:32,800 --> 04:34:35,900
that an edge from U
to V represents.
6420
04:34:36,000 --> 04:34:37,453
And recommendation
6421
04:34:37,453 --> 04:34:41,300
or support of Vis importance
by you for an example.
6422
04:34:41,300 --> 04:34:45,468
Let's say if a Twitter user
is followed by many others user
6423
04:34:45,468 --> 04:34:48,200
will obviously rank
high graphics comes
6424
04:34:48,200 --> 04:34:51,919
with the static and dynamic
implementation of pagerank as
6425
04:34:51,919 --> 04:34:53,780
methods on page rank object
6426
04:34:53,780 --> 04:34:57,500
and static page rank runs
a fixed number of iterations,
6427
04:34:57,500 --> 04:35:02,200
which can be specified by you
while the dynamic page rank runs
6428
04:35:02,200 --> 04:35:04,100
until the ranks converge
6429
04:35:04,500 --> 04:35:08,300
what we mean by that is
it Stop changing by more
6430
04:35:08,300 --> 04:35:10,400
than a specified tolerance.
6431
04:35:10,500 --> 04:35:11,300
So it runs
6432
04:35:11,300 --> 04:35:14,500
until it have optimized
the page rank of each
6433
04:35:14,500 --> 04:35:19,400
of the vertices now graphs class
allows calling these algorithms
6434
04:35:19,400 --> 04:35:22,100
directly as methods
on crafts class.
6435
04:35:22,200 --> 04:35:24,800
Now, let's quickly go
back to the VM.
6436
04:35:25,000 --> 04:35:27,469
So this is the pagerank example.
6437
04:35:27,469 --> 04:35:29,161
Let me open this file.
6438
04:35:29,600 --> 04:35:32,595
So first we are specifying
this Graphics package,
6439
04:35:32,595 --> 04:35:35,065
then we are importing
the graph loader.
6440
04:35:35,065 --> 04:35:37,600
So as you can Remember
inside this graph
6441
04:35:37,600 --> 04:35:41,000
loader class we have
that edge list file operator,
6442
04:35:41,000 --> 04:35:43,600
which will basically create
the graph using the edges
6443
04:35:43,600 --> 04:35:46,575
and we have those edges
in our followers
6444
04:35:46,575 --> 04:35:50,542
dot txt file now coming back
to pagerank example now,
6445
04:35:50,542 --> 04:35:53,900
we're importing the spark
SQL Sparks session.
6446
04:35:54,100 --> 04:35:56,619
Now, this is Page
rank example object
6447
04:35:56,619 --> 04:35:59,700
and inside which we
have created a main class
6448
04:35:59,700 --> 04:36:04,000
and we have similarly created
this park session then Builders
6449
04:36:04,000 --> 04:36:05,600
and we're specifying
the app name
6450
04:36:05,600 --> 04:36:09,800
which Is to be provided then
we have get our grid method.
6451
04:36:09,800 --> 04:36:10,415
So this is
6452
04:36:10,415 --> 04:36:12,800
where we are initializing
the spark context
6453
04:36:12,800 --> 04:36:13,800
as you can remember.
6454
04:36:13,800 --> 04:36:16,900
I told you that using
this Edge list file method.
6455
04:36:16,900 --> 04:36:19,115
We are basically
creating the graph
6456
04:36:19,115 --> 04:36:21,200
from the followers dot txt file.
6457
04:36:21,200 --> 04:36:24,223
Now, we are running
the page rank over here.
6458
04:36:24,223 --> 04:36:28,421
So in rank it will give you all
the page rank of the vertices
6459
04:36:28,421 --> 04:36:30,104
that is inside this graph
6460
04:36:30,104 --> 04:36:33,400
which we have just
to reducing graph loader class.
6461
04:36:33,400 --> 04:36:36,575
So if you're passing
an integer as an an argument
6462
04:36:36,575 --> 04:36:37,700
to the page rank,
6463
04:36:37,700 --> 04:36:40,018
it will run
that number iterations.
6464
04:36:40,018 --> 04:36:43,000
Otherwise, if you're
passing a double value,
6465
04:36:43,000 --> 04:36:45,495
it will run
until the convergence.
6466
04:36:45,495 --> 04:36:48,400
So we are running
page rank on this graph
6467
04:36:48,400 --> 04:36:50,861
and we have passed the vertices.
6468
04:36:50,900 --> 04:36:55,300
Now after this we are trying
to load the users dot txt file
6469
04:36:55,500 --> 04:36:58,400
and then we are trying to play
6470
04:36:58,400 --> 04:37:02,400
the line by comma then
the field zero too long
6471
04:37:02,400 --> 04:37:04,571
and we are storing
the field one.
6472
04:37:04,571 --> 04:37:06,200
So basically field zero.
6473
04:37:06,300 --> 04:37:09,376
In your user txt is
your vertex ID or you
6474
04:37:09,376 --> 04:37:13,790
can see the ID of the user
and field one is your username.
6475
04:37:13,790 --> 04:37:17,252
So we are trying to load
these two Fields now.
6476
04:37:17,280 --> 04:37:19,819
We are trying
to rank by username.
6477
04:37:19,969 --> 04:37:24,600
So we are taking the users
and we are joining the ranks.
6478
04:37:24,600 --> 04:37:28,000
So this is where we
are using the join operation.
6479
04:37:28,000 --> 04:37:29,670
So Frank's by username.
6480
04:37:29,670 --> 04:37:32,562
We are trying to
attach those username
6481
04:37:32,562 --> 04:37:35,793
or put those username
with the page rank value.
6482
04:37:35,793 --> 04:37:37,641
So we are taking the users
6483
04:37:37,641 --> 04:37:40,554
then we are joining
the ranks it is again,
6484
04:37:40,554 --> 04:37:42,900
we are getting
from this page Rank
6485
04:37:43,300 --> 04:37:47,700
and then we are mapping
the ID user name and rank.
6486
04:37:56,500 --> 04:38:00,517
Second week sometime run
some iterations over the craft
6487
04:38:00,517 --> 04:38:02,600
and will try to converge it.
6488
04:38:08,000 --> 04:38:11,700
So after converging you
can see the user and the rank.
6489
04:38:11,700 --> 04:38:14,300
So the maximum rank is
with Barack Obama,
6490
04:38:14,300 --> 04:38:18,000
which is 1.45 then
with Lady Gaga.
6491
04:38:18,100 --> 04:38:22,200
It's 1.39 and then with
order ski and so on.
6492
04:38:22,261 --> 04:38:24,338
Let's go back to the slide.
6493
04:38:25,200 --> 04:38:27,000
So now after page rank,
6494
04:38:27,200 --> 04:38:28,856
let's quickly move ahead
6495
04:38:28,856 --> 04:38:32,200
to Connected components
the connected components
6496
04:38:32,200 --> 04:38:34,923
algorithm labels each
connected component
6497
04:38:34,923 --> 04:38:38,600
of the graph with the ID
of its lowest numbered vertex.
6498
04:38:38,600 --> 04:38:40,700
So let us quickly go
back to the VM.
6499
04:38:42,000 --> 04:38:45,200
Now let's go inside
the graphics directory
6500
04:38:45,200 --> 04:38:48,300
and now we'll open
this connect components example.
6501
04:38:48,400 --> 04:38:51,818
So again, it's the same very
important graph load
6502
04:38:51,818 --> 04:38:53,100
and Spark session.
6503
04:38:53,300 --> 04:38:56,600
Now, this is the connect
components example object makes
6504
04:38:56,600 --> 04:39:00,176
this is the main function
and inside the main function.
6505
04:39:00,176 --> 04:39:01,800
We are again specifying all
6506
04:39:01,800 --> 04:39:04,500
those Sparks session
then app name,
6507
04:39:04,500 --> 04:39:06,389
then we have spark context.
6508
04:39:06,389 --> 04:39:07,509
So it's similar.
6509
04:39:07,509 --> 04:39:10,100
So again using
this graph loader class
6510
04:39:10,130 --> 04:39:11,669
and using this Edge.
6511
04:39:11,900 --> 04:39:15,700
To file we are loading
the followers dot txt file.
6512
04:39:15,700 --> 04:39:16,733
Now in this graph.
6513
04:39:16,733 --> 04:39:19,706
We are using this connected
components algorithm.
6514
04:39:19,706 --> 04:39:23,300
And then we are trying to find
the connected components now
6515
04:39:23,300 --> 04:39:26,600
at last we are trying
to again load this user file
6516
04:39:26,600 --> 04:39:28,300
that is users Dot txt.
6517
04:39:28,500 --> 04:39:31,312
And we are trying to join
the connected components
6518
04:39:31,312 --> 04:39:34,387
with the username so over
here it is also the same thing
6519
04:39:34,387 --> 04:39:36,504
which we have discussed
in page rank,
6520
04:39:36,504 --> 04:39:38,000
which is taking the field 0
6521
04:39:38,000 --> 04:39:41,100
and field one
of your user dot txt file
6522
04:39:41,400 --> 04:39:45,100
and a at last we
are joining this users
6523
04:39:45,100 --> 04:39:49,200
and at last year trying to join
this users to connect component
6524
04:39:49,200 --> 04:39:50,584
that is from here.
6525
04:39:50,584 --> 04:39:50,882
Now.
6526
04:39:50,882 --> 04:39:54,008
We are printing the CC
by username collect.
6527
04:39:54,008 --> 04:39:58,400
So let us quickly go ahead and
execute this example as well.
6528
04:39:58,600 --> 04:40:01,400
So let me first copy
this object name.
6529
04:40:03,800 --> 04:40:17,300
that's name this
as example to so
6530
04:40:17,300 --> 04:40:20,100
as you can see Justin Bieber has
one connected component,
6531
04:40:20,100 --> 04:40:23,300
then you can see this has
three connected component.
6532
04:40:23,300 --> 04:40:25,100
Then this has
one connected component
6533
04:40:25,100 --> 04:40:28,600
than Barack Obama has one
connected component and so on.
6534
04:40:28,600 --> 04:40:30,464
So this basically
gives you an idea
6535
04:40:30,464 --> 04:40:32,200
about the connected components.
6536
04:40:32,200 --> 04:40:33,900
Now, let's quickly move back
6537
04:40:33,900 --> 04:40:37,300
to the slide will discuss
about the third algorithm
6538
04:40:37,300 --> 04:40:39,100
that is triangle counting.
6539
04:40:39,100 --> 04:40:43,177
So basically a Vertex is a part
of a triangle when it has
6540
04:40:43,177 --> 04:40:46,900
two adjacent vertices
with an edge between them.
6541
04:40:46,900 --> 04:40:49,100
So it will form
a triangle, right?
6542
04:40:49,100 --> 04:40:52,313
And then that vertex
is a part of a triangle
6543
04:40:52,313 --> 04:40:56,092
now Graphics implements
a triangle counting algorithm
6544
04:40:56,092 --> 04:40:58,200
in the Triangle count object.
6545
04:40:58,200 --> 04:41:01,200
Now that determines the number
of triangles passing
6546
04:41:01,200 --> 04:41:04,600
through each vertex providing
a measure of clustering
6547
04:41:04,600 --> 04:41:07,400
so we can compute
the triangle count
6548
04:41:07,400 --> 04:41:09,875
of the social network data set
6549
04:41:09,875 --> 04:41:13,675
from the pagerank section
1 mode thing to note is
6550
04:41:13,675 --> 04:41:16,598
that triangle count
requires the edges.
6551
04:41:16,600 --> 04:41:18,800
To be in
a canonical orientation.
6552
04:41:18,800 --> 04:41:21,364
That is your Source ID
should always be less
6553
04:41:21,364 --> 04:41:22,868
than your destination ID
6554
04:41:22,868 --> 04:41:25,500
and the graph will be
partition using craft
6555
04:41:25,500 --> 04:41:27,318
or Partition by Method now,
6556
04:41:27,318 --> 04:41:28,800
let's quickly go back.
6557
04:41:28,800 --> 04:41:32,000
So let me open
the graphics directory again,
6558
04:41:32,000 --> 04:41:35,200
and we'll see
the triangle counting example.
6559
04:41:36,500 --> 04:41:38,100
So again, it's the same
6560
04:41:38,100 --> 04:41:40,900
and the object is
triangle counting example,
6561
04:41:40,900 --> 04:41:43,400
then the main function
is same as well.
6562
04:41:43,400 --> 04:41:46,400
Now we are again using
this graph load of class
6563
04:41:46,400 --> 04:41:50,183
and we are loading
the followers dot txt
6564
04:41:50,183 --> 04:41:52,000
which contains the edges
6565
04:41:52,000 --> 04:41:53,000
as you can see here.
6566
04:41:53,000 --> 04:41:54,600
We are using this Partition
6567
04:41:54,600 --> 04:41:58,800
by argument and we are passing
the random vertex cut,
6568
04:41:58,800 --> 04:42:01,000
which is the partition strategy.
6569
04:42:01,000 --> 04:42:03,165
So this is how you can go ahead
6570
04:42:03,165 --> 04:42:06,100
and you can Implement
a partition strategy.
6571
04:42:06,123 --> 04:42:09,277
He is loading the edges
in canonical order
6572
04:42:09,400 --> 04:42:11,900
and partitioning the graph
for triangle count.
6573
04:42:11,900 --> 04:42:12,129
Now.
6574
04:42:12,129 --> 04:42:14,600
We are trying to find
out the triangle count
6575
04:42:14,600 --> 04:42:15,830
for each vertex.
6576
04:42:15,830 --> 04:42:18,000
So we have this try count
6577
04:42:18,000 --> 04:42:22,600
variable and then we are using
this triangle count algorithm
6578
04:42:22,600 --> 04:42:25,074
and then we are
specifying the vertices
6579
04:42:25,074 --> 04:42:28,200
so it will execute
triangle count over this graph
6580
04:42:28,200 --> 04:42:31,900
which we have just loaded
from follows dot txt file.
6581
04:42:31,900 --> 04:42:35,074
And again, we are basically
joining usernames.
6582
04:42:35,074 --> 04:42:38,320
So first we are Being
the usernames again here.
6583
04:42:38,320 --> 04:42:42,600
We are performing the join
between users and try counts.
6584
04:42:42,900 --> 04:42:45,300
So try counts is from here.
6585
04:42:45,300 --> 04:42:48,806
And then we are again
printing the value from here.
6586
04:42:48,806 --> 04:42:50,700
So again, this is the same.
6587
04:42:50,700 --> 04:42:52,844
Let us quickly go
ahead and execute
6588
04:42:52,844 --> 04:42:54,800
this triangle counting example.
6589
04:42:54,800 --> 04:42:56,338
So let me copy this.
6590
04:42:56,500 --> 04:42:58,300
I'll go back to the terminal.
6591
04:42:58,400 --> 04:43:02,300
I'll limit as example
3 and change the class name.
6592
04:43:04,134 --> 04:43:05,365
And I hit enter.
6593
04:43:14,100 --> 04:43:16,900
So now you can see
the triangle associated
6594
04:43:16,900 --> 04:43:20,100
with Justin Bieber 0 then
Barack Obama is one
6595
04:43:20,100 --> 04:43:21,600
with odors kids one
6596
04:43:21,661 --> 04:43:23,200
and with Jerry sick.
6597
04:43:23,200 --> 04:43:24,100
It's fun.
6598
04:43:24,300 --> 04:43:27,800
So for better understanding I
would recommend you to go ahead
6599
04:43:27,800 --> 04:43:30,136
and take this followers or txt.
6600
04:43:30,136 --> 04:43:33,000
And you can create
a graph by yourself.
6601
04:43:33,000 --> 04:43:36,227
And then you can attach
these users names with them
6602
04:43:36,227 --> 04:43:38,100
and then you will get an idea
6603
04:43:38,100 --> 04:43:41,700
about why it is giving
the number as 1 or 0.
6604
04:43:41,700 --> 04:43:44,065
So again the graph
which is connecting.
6605
04:43:44,065 --> 04:43:45,000
In two and four
6606
04:43:45,000 --> 04:43:47,600
is disconnect and it
is not completing any triangles.
6607
04:43:47,600 --> 04:43:52,900
So the value of these 3 are 0
and next year's second graph
6608
04:43:52,900 --> 04:43:54,600
which is connecting
6609
04:43:54,600 --> 04:43:59,400
your vertex 3 6 & 7
is completing one triangle.
6610
04:43:59,400 --> 04:44:01,323
So this is the reason why
6611
04:44:01,323 --> 04:44:05,300
these three vertices
have values one now.
6612
04:44:05,400 --> 04:44:06,952
Let me quickly go back.
6613
04:44:06,952 --> 04:44:07,875
So now I hope
6614
04:44:07,875 --> 04:44:11,000
that you guys are clear
with all the concepts
6615
04:44:11,000 --> 04:44:14,011
of graph operators
then graph algorithms.
6616
04:44:14,011 --> 04:44:17,400
Eames so now is the right
time and let us look
6617
04:44:17,400 --> 04:44:19,200
at a spa Graphics demo
6618
04:44:19,300 --> 04:44:20,838
where we'll go ahead
6619
04:44:20,838 --> 04:44:24,300
and we'll try to analyze
the force go by data.
6620
04:44:24,800 --> 04:44:27,800
So let me quickly go
back to my VM.
6621
04:44:28,000 --> 04:44:29,699
So let me first show
you the website
6622
04:44:29,699 --> 04:44:32,500
where you can go ahead and
download the Fords go by data.
6623
04:44:38,600 --> 04:44:40,350
So over here you can go
6624
04:44:40,350 --> 04:44:43,700
to download the fort
bike strip history data.
6625
04:44:46,480 --> 04:44:51,019
So you can go ahead and download
this 2017 Ford's trip data.
6626
04:44:51,100 --> 04:44:53,000
So I have already downloaded it.
6627
04:44:55,300 --> 04:44:56,696
So to avoid the typos,
6628
04:44:56,696 --> 04:44:59,300
I have already written
all the commands so
6629
04:44:59,300 --> 04:45:07,100
first let me go ahead and start
the spark shell So I'm inside
6630
04:45:07,100 --> 04:45:09,700
these Park shell now.
6631
04:45:09,700 --> 04:45:13,300
Let me first import graphics
and Spa body.
6632
04:45:15,800 --> 04:45:19,200
So I've successfully
imported graphics and Spark rdd.
6633
04:45:20,180 --> 04:45:23,719
Now, let me create
a spark SQL context as well.
6634
04:45:25,100 --> 04:45:28,900
So I have successfully
created this park SQL context.
6635
04:45:28,900 --> 04:45:31,520
So this is basically
for running SQL queries
6636
04:45:31,520 --> 04:45:32,800
over the data frames.
6637
04:45:34,100 --> 04:45:37,176
Now, let me go ahead
and import the data.
6638
04:45:37,826 --> 04:45:40,673
So I'm loading the data
in data frame.
6639
04:45:40,800 --> 04:45:43,623
So the format of file is CSV,
6640
04:45:43,623 --> 04:45:46,853
then an option the header
is already added.
6641
04:45:46,853 --> 04:45:48,700
So that's why it's true.
6642
04:45:48,800 --> 04:45:51,600
Then it will automatically
infer this schema
6643
04:45:51,600 --> 04:45:53,332
and then in the load parameter,
6644
04:45:53,332 --> 04:45:55,400
I have specified
the path of the file.
6645
04:45:55,400 --> 04:45:57,100
So I'll quickly hit enter.
6646
04:45:59,100 --> 04:46:02,500
So the data is loaded
in the data frame to check.
6647
04:46:02,500 --> 04:46:07,000
I'll use d f dot count
so it will give me the count.
6648
04:46:09,900 --> 04:46:16,553
So you can see it has
5 lakhs 19 2007 Red Rose now.
6649
04:46:16,553 --> 04:46:20,092
Let me click go back
and I'll print the schema.
6650
04:46:21,400 --> 04:46:25,010
So this is the schema
the duration in second,
6651
04:46:25,010 --> 04:46:27,625
then we have
the start time end time.
6652
04:46:27,625 --> 04:46:29,876
Then you have start station ID.
6653
04:46:29,876 --> 04:46:32,200
Then you have
start station name.
6654
04:46:32,300 --> 04:46:35,761
Then you have start
station latitude longitude
6655
04:46:35,761 --> 04:46:37,207
then end station ID
6656
04:46:37,207 --> 04:46:40,360
and station name then
end station latitude
6657
04:46:40,360 --> 04:46:42,007
and station longitude.
6658
04:46:42,007 --> 04:46:46,500
Then your bike ID user type then
the birth year of the member
6659
04:46:46,500 --> 04:46:48,650
and the gender
of the member now,
6660
04:46:48,650 --> 04:46:50,800
I'm trying to create
a data frame
6661
04:46:50,800 --> 04:46:52,306
that is Gas stations
6662
04:46:52,306 --> 04:46:56,300
so it will only create
the station ID and station name
6663
04:46:56,300 --> 04:46:58,607
which I'll be using as vertex.
6664
04:46:58,800 --> 04:47:02,000
So here I am trying
to create a data frame
6665
04:47:02,000 --> 04:47:03,500
with the name of just stations
6666
04:47:03,658 --> 04:47:07,120
where I am just selecting
the start station ID
6667
04:47:07,120 --> 04:47:09,600
and I'm casting it as float
6668
04:47:09,600 --> 04:47:12,400
and then I'm selecting
the start station name
6669
04:47:12,400 --> 04:47:15,400
and then I'm using
the distinct function to only
6670
04:47:15,400 --> 04:47:17,169
keep the unique values.
6671
04:47:17,169 --> 04:47:19,864
So I quickly go
ahead and hit enter.
6672
04:47:20,100 --> 04:47:21,600
So again, let me go
6673
04:47:21,600 --> 04:47:27,000
ahead and use this just stations
and I will print the schema.
6674
04:47:28,300 --> 04:47:31,531
So you can see
there is station ID,
6675
04:47:31,531 --> 04:47:34,000
and then there is
start station name.
6676
04:47:34,569 --> 04:47:36,800
It contains the unique values
6677
04:47:36,800 --> 04:47:40,600
of stations in this just
station data frame.
6678
04:47:40,800 --> 04:47:41,735
So now again,
6679
04:47:41,735 --> 04:47:44,900
I am taking this stations
where I'm selecting
6680
04:47:44,900 --> 04:47:47,971
these thought station ID
and and station ID.
6681
04:47:47,971 --> 04:47:49,900
Then I am using re distinct
6682
04:47:49,900 --> 04:47:52,700
which will again give
me the unique values
6683
04:47:52,700 --> 04:47:54,600
and I'm using this flat map
6684
04:47:54,600 --> 04:47:56,200
where I am specifying
6685
04:47:56,200 --> 04:47:59,700
the iterables where we
are taking the x0
6686
04:47:59,700 --> 04:48:01,700
that is your start station ID,
6687
04:48:01,700 --> 04:48:04,405
and I am taking x 1
which is your ends.
6688
04:48:04,405 --> 04:48:05,700
An ID and then again,
6689
04:48:05,700 --> 04:48:07,800
I'm applying this
distinct function
6690
04:48:07,800 --> 04:48:12,200
that it will keep only
the unique values and then
6691
04:48:12,400 --> 04:48:14,600
at last we have to d f function
6692
04:48:14,600 --> 04:48:16,619
which will convert
it to data frame.
6693
04:48:16,619 --> 04:48:19,100
So let me quickly go ahead
and execute this.
6694
04:48:19,500 --> 04:48:21,376
So I am printing this schema.
6695
04:48:21,376 --> 04:48:23,576
So as you can see
it has one column
6696
04:48:23,576 --> 04:48:26,100
that is value and it
has data type long.
6697
04:48:26,100 --> 04:48:29,715
So I have taken all
the start and end station ID
6698
04:48:29,715 --> 04:48:31,561
and using this flat map.
6699
04:48:31,561 --> 04:48:34,200
I have retreated
over all the start.
6700
04:48:34,200 --> 04:48:37,705
And and station ID and then
using the distinct function
6701
04:48:37,705 --> 04:48:41,600
and taking the unique values
and converting it to data frames
6702
04:48:41,600 --> 04:48:44,800
so I can use the stations
and using the station.
6703
04:48:44,800 --> 04:48:49,000
I will basically keep each
of the stations in a Vertex.
6704
04:48:49,000 --> 04:48:52,500
So this is the reason why
I'm taking the stations
6705
04:48:52,500 --> 04:48:55,300
or you can say I am taking
the unique stations
6706
04:48:55,300 --> 04:48:58,107
from the start station ID
and station ID
6707
04:48:58,107 --> 04:48:59,691
so that I can go ahead
6708
04:48:59,691 --> 04:49:02,500
and I can define
vertex as the stations.
6709
04:49:03,100 --> 04:49:06,400
So now we are creating
our set of vertices
6710
04:49:06,400 --> 04:49:09,804
and attaching a bit
of metadata to each one of them
6711
04:49:09,804 --> 04:49:12,800
which in our case is
the name of the station.
6712
04:49:12,800 --> 04:49:16,035
So as you can see we are
creating this station vertices,
6713
04:49:16,035 --> 04:49:18,679
which is again an rdd
with vertex ID and strength.
6714
04:49:18,679 --> 04:49:21,700
So we are using the station's
which we have just created.
6715
04:49:21,700 --> 04:49:24,500
We are joining it
with just stations
6716
04:49:24,500 --> 04:49:27,100
at the station value
should be equal
6717
04:49:27,100 --> 04:49:29,300
to just station station ID.
6718
04:49:29,600 --> 04:49:32,400
So as we have created stations,
6719
04:49:32,400 --> 04:49:35,200
And just station
so we are joining it.
6720
04:49:36,600 --> 04:49:39,061
And then selecting
the station ID
6721
04:49:39,061 --> 04:49:43,000
and start station name
then we are mapping row 0.
6722
04:49:44,700 --> 04:49:48,600
And Row 1 so your row
0 will basically be
6723
04:49:48,600 --> 04:49:51,088
your vertex ID and Row
1 will be the string.
6724
04:49:51,088 --> 04:49:55,100
That is the name of your station
to let me quickly go ahead
6725
04:49:55,100 --> 04:49:56,300
and execute this.
6726
04:49:57,357 --> 04:50:01,742
So let us quickly print this
using collect forage println.
6727
04:50:19,500 --> 04:50:20,366
So over here,
6728
04:50:20,366 --> 04:50:23,900
we are basically attaching
the edges or you can see we
6729
04:50:23,900 --> 04:50:27,500
are creating the trip edges
to all our individual rights
6730
04:50:27,500 --> 04:50:29,900
and then we'll get
the station values
6731
04:50:30,350 --> 04:50:33,350
and then we'll add
a dummy value of one.
6732
04:50:33,800 --> 04:50:34,900
So as you can see
6733
04:50:34,900 --> 04:50:37,200
that I am selecting
the start station and
6734
04:50:37,200 --> 04:50:38,600
and station from the DF
6735
04:50:38,600 --> 04:50:41,300
which is the first data frame
which we have loaded
6736
04:50:41,300 --> 04:50:46,200
and then I am mapping
it to row 0 + Row 1,
6737
04:50:46,400 --> 04:50:49,000
which is your source
and destination.
6738
04:50:49,100 --> 04:50:53,500
And then and then I'm attaching
a value one to each one of them.
6739
04:50:53,600 --> 04:50:55,000
So I'll hit enter.
6740
04:50:57,500 --> 04:51:00,900
Now, let me quickly go ahead
and print this station edges.
6741
04:51:07,500 --> 04:51:10,300
So just taking the source
ID of the vertex
6742
04:51:10,300 --> 04:51:12,182
and destination ID of the vertex
6743
04:51:12,182 --> 04:51:14,800
or you can say so station ID
or vertex station ID
6744
04:51:14,800 --> 04:51:17,900
and it is attaching value
one to each one of them.
6745
04:51:17,900 --> 04:51:20,700
So now you can go ahead
and build your graph.
6746
04:51:20,700 --> 04:51:23,854
But again as we discuss
that we need a default station
6747
04:51:23,854 --> 04:51:25,700
so you can have some situations
6748
04:51:25,700 --> 04:51:29,033
where your edges might be
indicating some vertices,
6749
04:51:29,033 --> 04:51:31,500
but that vertices
might not be present
6750
04:51:31,500 --> 04:51:33,107
in your vertex re D.
6751
04:51:33,107 --> 04:51:34,764
So for that situation,
6752
04:51:34,764 --> 04:51:37,400
we need to create
a default station.
6753
04:51:37,400 --> 04:51:40,651
So I created a default station
as missing station.
6754
04:51:40,651 --> 04:51:42,100
So now we are all set.
6755
04:51:42,100 --> 04:51:44,400
We can go ahead
and create the graph.
6756
04:51:44,400 --> 04:51:46,700
So the name of the graph
is station graph.
6757
04:51:46,700 --> 04:51:49,000
Then the vertices
are stationed vertices
6758
04:51:49,000 --> 04:51:50,485
which we have created
6759
04:51:50,485 --> 04:51:54,247
which basically contains
the station ID and station name
6760
04:51:54,247 --> 04:51:56,300
and then we have station edges
6761
04:51:56,300 --> 04:51:58,600
and at last we
have default station.
6762
04:51:58,600 --> 04:52:01,500
So let me quickly go ahead
and execute this.
6763
04:52:03,100 --> 04:52:06,500
So now I need to cash this graph
for faster access.
6764
04:52:06,500 --> 04:52:08,700
So I'll use cash function.
6765
04:52:09,500 --> 04:52:13,300
So let us quickly go ahead and
check the number of vertices.
6766
04:52:24,700 --> 04:52:28,600
So these are the number
of vertices again,
6767
04:52:28,900 --> 04:52:31,600
we can check the number
of edges as well.
6768
04:52:35,700 --> 04:52:37,300
So these are
the number of edges.
6769
04:52:38,405 --> 04:52:40,400
And to get a sanity check.
6770
04:52:40,400 --> 04:52:43,500
So let's go ahead
and check the number of records
6771
04:52:43,500 --> 04:52:45,500
that are present
in the data frame.
6772
04:52:48,000 --> 04:52:50,900
So as you can see
that the number of edges
6773
04:52:50,900 --> 04:52:55,100
in our graph and the count
in our data frame is similar,
6774
04:52:55,100 --> 04:52:56,900
or you can see the same.
6775
04:52:56,900 --> 04:53:00,702
So now let's go ahead and run
page rank on our data
6776
04:53:00,702 --> 04:53:04,200
so we can either run
a set number of iterations
6777
04:53:04,200 --> 04:53:06,700
or we can run it
until the convergence.
6778
04:53:06,700 --> 04:53:10,400
So in my case,
I'll run it till convergence.
6779
04:53:11,700 --> 04:53:15,000
So it's rank then
station graph then page rank.
6780
04:53:15,000 --> 04:53:17,133
So has specified
the double value
6781
04:53:17,133 --> 04:53:21,000
so it will Tell convergence
so let's wait for some time.
6782
04:53:51,600 --> 04:53:55,400
So now that we have executed
the pagerank algorithm.
6783
04:53:55,700 --> 04:53:57,300
So we got the ranks
6784
04:53:57,300 --> 04:53:59,700
which are attached
to each vertices.
6785
04:54:00,100 --> 04:54:03,700
So now let us quickly go ahead
and look at the ranks.
6786
04:54:03,700 --> 04:54:06,601
So we are joining ranks
with station vertices
6787
04:54:06,601 --> 04:54:09,675
and then we have sorting it
in descending values
6788
04:54:09,675 --> 04:54:11,900
and we are taking
the first 10 rows
6789
04:54:11,900 --> 04:54:13,500
and then we are printing them.
6790
04:54:13,500 --> 04:54:16,700
So let's quickly go
ahead and hit enter.
6791
04:54:21,700 --> 04:54:26,000
So you can see these are
the top 10 stations which have
6792
04:54:26,000 --> 04:54:27,800
the most pagerank values
6793
04:54:27,800 --> 04:54:30,800
so you can say it has
more number of incoming trips.
6794
04:54:30,800 --> 04:54:32,270
Now one question would be
6795
04:54:32,270 --> 04:54:35,000
what are the most common
destinations in the data set
6796
04:54:35,000 --> 04:54:36,598
from location to location
6797
04:54:36,598 --> 04:54:40,500
so we can do this by performing
a grouping operator and adding
6798
04:54:40,500 --> 04:54:42,218
The Edge counts together.
6799
04:54:42,218 --> 04:54:46,000
So basically this will give
a new graph except each Edge
6800
04:54:46,000 --> 04:54:50,300
will now be the sum of all
the semantically same edges.
6801
04:54:51,500 --> 04:54:53,700
So again, we are taking
the station graph.
6802
04:54:53,700 --> 04:54:56,800
We are performing Group
by edges H1 and H2.
6803
04:54:56,800 --> 04:55:00,197
So we are basically
grouping edges H1 and H2.
6804
04:55:00,200 --> 04:55:01,629
So we are aggregating them.
6805
04:55:01,629 --> 04:55:03,100
Then we are using triplet
6806
04:55:03,100 --> 04:55:06,099
and then we are sorting them
in descending order again.
6807
04:55:06,099 --> 04:55:08,200
And then we are
printing the triplets
6808
04:55:08,200 --> 04:55:10,908
from The Source vertex
and the number of trips
6809
04:55:10,908 --> 04:55:13,864
and then we are taking
the destination attribute
6810
04:55:13,864 --> 04:55:15,500
or you can see destination
6811
04:55:15,500 --> 04:55:18,100
Vertex or you can see
destination station.
6812
04:55:26,526 --> 04:55:28,373
So you can see there are
6813
04:55:28,500 --> 04:55:32,300
1933 trips from San
Francisco Ferry Building
6814
04:55:32,300 --> 04:55:34,100
to the station then again,
6815
04:55:34,100 --> 04:55:36,700
you can see there are
fourteen hundred and eleven
6816
04:55:36,700 --> 04:55:39,900
trips from San Francisco
to this location.
6817
04:55:39,900 --> 04:55:42,200
Then there are 1 0 to 5 trips
6818
04:55:42,200 --> 04:55:45,300
from this station
to San Francisco
6819
04:55:45,500 --> 04:55:49,100
and it goes so on so now we
have got a directed graph
6820
04:55:49,100 --> 04:55:50,885
that mean our
trip are directional
6821
04:55:50,885 --> 04:55:52,400
from one location to another
6822
04:55:52,600 --> 04:55:55,787
so now we can go ahead
and find the number of Trades
6823
04:55:55,787 --> 04:55:57,725
that Went to a specific station
6824
04:55:57,725 --> 04:56:00,100
and then leave
from a specific station.
6825
04:56:00,100 --> 04:56:01,806
So basically we are trying
6826
04:56:01,806 --> 04:56:04,300
to find the inbound
and outbound values
6827
04:56:04,300 --> 04:56:07,829
or you can say we are trying
to find in degree and out degree
6828
04:56:07,829 --> 04:56:08,723
of the stations.
6829
04:56:08,723 --> 04:56:12,300
So let us first calculate the in
degrees from using station graph
6830
04:56:12,300 --> 04:56:14,364
and I am using
n degree operator.
6831
04:56:14,364 --> 04:56:17,298
Then I'm joining it
with the station vertices
6832
04:56:17,298 --> 04:56:20,435
and then I'm sorting it again
in descending order
6833
04:56:20,435 --> 04:56:22,852
and then I'm taking
the top 10 values.
6834
04:56:22,852 --> 04:56:25,400
So let's quickly go
ahead and hit enter.
6835
04:56:30,900 --> 04:56:34,815
So these are the top 10 station
and you can see the in degrees.
6836
04:56:34,815 --> 04:56:36,600
So there are these many trips
6837
04:56:36,600 --> 04:56:38,797
which are coming
into these stations.
6838
04:56:38,797 --> 04:56:39,651
Not similarly.
6839
04:56:39,651 --> 04:56:41,300
We can find the out degree.
6840
04:56:48,200 --> 04:56:51,400
Now again, you can see
the out degrees as well.
6841
04:56:51,400 --> 04:56:54,896
So these are the stations
and these are the out degrees.
6842
04:56:54,896 --> 04:56:58,439
So again, you can go ahead
and perform some more operations
6843
04:56:58,439 --> 04:56:59,400
over this graph.
6844
04:56:59,400 --> 04:57:01,635
So you can go ahead
and find the station
6845
04:57:01,635 --> 04:57:03,700
which has most number
of trips things
6846
04:57:03,700 --> 04:57:07,241
that is most number of people
coming into that station,
6847
04:57:07,241 --> 04:57:09,758
but less people are
leaving that station
6848
04:57:09,758 --> 04:57:13,320
and again on the contrary
you can find out the stations
6849
04:57:13,320 --> 04:57:15,538
where there are
more number of edges
6850
04:57:15,538 --> 04:57:18,240
or you can set trip
leaving those stations.
6851
04:57:18,240 --> 04:57:19,848
But there are less number
6852
04:57:19,848 --> 04:57:22,100
of trips coming
into those stations.
6853
04:57:22,100 --> 04:57:25,800
So I guess you guys are
now clear with Spa Graphics.
6854
04:57:25,800 --> 04:57:27,810
Then we discuss
the different types
6855
04:57:27,810 --> 04:57:29,398
of crops then moving ahead.
6856
04:57:29,398 --> 04:57:31,100
We discuss the
features of grafx.
6857
04:57:31,100 --> 04:57:33,675
They'll be discuss something
about property graph.
6858
04:57:33,675 --> 04:57:35,500
We understood what
is property graph
6859
04:57:35,500 --> 04:57:38,200
how you can create vertex
how you can create edges
6860
04:57:38,200 --> 04:57:40,800
how to use Vertex or DD H Rd D.
6861
04:57:40,800 --> 04:57:44,500
Then we looked at some of
the important vertex operations
6862
04:57:44,500 --> 04:57:48,500
and at last we understood some
of the graph algorithms.
6863
04:57:48,500 --> 04:57:51,349
So I guess now you
guys are clear about
6864
04:57:51,349 --> 04:57:53,600
how to work with Bob Graphics.
6865
04:57:58,300 --> 04:58:01,300
Today's video is
on Hadoop versus park.
6866
04:58:01,400 --> 04:58:04,683
Now as we know organizations
from different domains
6867
04:58:04,683 --> 04:58:07,400
are investing in big
data analytics today.
6868
04:58:07,400 --> 04:58:10,400
They're analyzing large
data sets to uncover
6869
04:58:10,400 --> 04:58:11,730
all hidden patterns
6870
04:58:11,730 --> 04:58:15,510
unknown correlations market
trends customer preferences
6871
04:58:15,510 --> 04:58:18,100
and other useful
business information.
6872
04:58:18,100 --> 04:58:20,800
Analogy of findings
are helping organizations
6873
04:58:20,800 --> 04:58:24,100
and more effective marketing
new Revenue opportunities
6874
04:58:24,100 --> 04:58:25,973
and better customer service
6875
04:58:25,973 --> 04:58:29,241
and they're trying
to get competitive advantages
6876
04:58:29,241 --> 04:58:30,947
over rival organizations
6877
04:58:30,947 --> 04:58:33,920
and other business benefits
and Apache spark
6878
04:58:33,920 --> 04:58:38,000
and Hadoop are the two of most
prominent Big Data Frameworks
6879
04:58:38,000 --> 04:58:41,289
and I see people often comparing
these two technologies
6880
04:58:41,289 --> 04:58:44,700
and that is what exactly
we're going to do in this video.
6881
04:58:44,700 --> 04:58:48,100
Now, we'll compare these two big
data Frame Works based
6882
04:58:48,100 --> 04:58:49,800
on on different parameters,
6883
04:58:49,800 --> 04:58:52,487
but first it is important
to get an overview
6884
04:58:52,487 --> 04:58:53,800
about what is Hadoop.
6885
04:58:53,800 --> 04:58:55,600
And what is Apache spark?
6886
04:58:55,600 --> 04:58:58,900
So let me just tell you a little
bit about Hadoop Hadoop is
6887
04:58:58,900 --> 04:59:00,200
a framework to store
6888
04:59:00,200 --> 04:59:04,200
and process large sets of data
across computer clusters
6889
04:59:04,200 --> 04:59:07,100
and Hadoop can scale
from single computer system
6890
04:59:07,100 --> 04:59:09,710
up to thousands
of commodity systems
6891
04:59:09,710 --> 04:59:11,500
that offer local storage
6892
04:59:11,500 --> 04:59:14,801
and compute power and Hadoop
is composed of modules
6893
04:59:14,801 --> 04:59:18,500
that work together to create
the entire Hadoop framework.
6894
04:59:18,500 --> 04:59:20,557
These are some of the components
6895
04:59:20,557 --> 04:59:23,254
that we have in the
entire Hadoop framework
6896
04:59:23,254 --> 04:59:24,800
or the Hadoop ecosystem.
6897
04:59:24,800 --> 04:59:27,500
For example, let
me tell you about hdfs,
6898
04:59:27,500 --> 04:59:30,856
which is the storage unit
of Hadoop yarn, which is
6899
04:59:30,856 --> 04:59:32,500
for resource management.
6900
04:59:32,500 --> 04:59:34,600
There are different
than a little tools
6901
04:59:34,600 --> 04:59:39,500
like Apache Hive Pig nosql
databases like Apache hbase.
6902
04:59:39,900 --> 04:59:40,900
Even Apache spark
6903
04:59:40,900 --> 04:59:43,893
and Apache Stone fits
in the Hadoop ecosystem
6904
04:59:43,893 --> 04:59:45,399
for processing big data
6905
04:59:45,399 --> 04:59:49,200
in real-time for ingesting data
we have Tools like Flume
6906
04:59:49,200 --> 04:59:52,082
and scoop flumist used
to ingest unstructured data
6907
04:59:52,082 --> 04:59:53,600
or semi-structured data
6908
04:59:53,600 --> 04:59:57,135
where scoop is used to ingest
structured data into hdfs.
6909
04:59:57,135 --> 04:59:59,900
If you want to learn more
about these tools,
6910
04:59:59,900 --> 05:00:01,470
you can go to Eddie rei'kas
6911
05:00:01,470 --> 05:00:04,000
YouTube channel and look
for Hadoop tutorial
6912
05:00:04,000 --> 05:00:06,600
where everything has
been explained in detail.
6913
05:00:06,600 --> 05:00:08,171
Now, let's move to spark
6914
05:00:08,171 --> 05:00:12,100
Apache spark is a lightning-fast
cluster Computing technology
6915
05:00:12,100 --> 05:00:14,400
that is designed
for fast computation.
6916
05:00:14,400 --> 05:00:18,223
The main feature of spark
is it's in memory clusters.
6917
05:00:18,223 --> 05:00:19,400
Esther Computing
6918
05:00:19,400 --> 05:00:23,482
that increases the processing
of speed of an application fog
6919
05:00:23,482 --> 05:00:27,100
perform similar operations
to that of Hadoop modules,
6920
05:00:27,100 --> 05:00:30,365
but it uses an in-memory
processing and optimizes
6921
05:00:30,365 --> 05:00:33,791
the steps the primary
difference between mapreduce
6922
05:00:33,791 --> 05:00:35,400
and Hadoop and Spark is
6923
05:00:35,400 --> 05:00:38,500
that mapreduce users
persistent storage
6924
05:00:38,500 --> 05:00:42,100
and Spark uses resilient
distributed data sets,
6925
05:00:42,100 --> 05:00:44,920
which is known as
rdds which resides
6926
05:00:44,920 --> 05:00:48,458
in memory the different
components and Sparkle.
6927
05:00:48,800 --> 05:00:52,000
The spark origin the spark
or is the base engine
6928
05:00:52,000 --> 05:00:53,600
for large-scale parallel
6929
05:00:53,600 --> 05:00:57,463
and distributed data processing
further additional libraries
6930
05:00:57,463 --> 05:01:01,100
which are built on top of
the core allow diverse workloads
6931
05:01:01,100 --> 05:01:02,381
for streaming SQL
6932
05:01:02,381 --> 05:01:06,000
and machine learning spark
or is also responsible
6933
05:01:06,000 --> 05:01:09,500
for memory management
and fault recovery scheduling
6934
05:01:09,500 --> 05:01:12,749
and distributed and monitoring
jobs and a cluster
6935
05:01:12,749 --> 05:01:16,000
and interacting with
the storage systems as well.
6936
05:01:16,100 --> 05:01:16,649
Next up.
6937
05:01:16,649 --> 05:01:18,300
We have spark streaming.
6938
05:01:18,300 --> 05:01:20,906
Spark streaming is
the component of spark
6939
05:01:20,906 --> 05:01:24,100
which is used to process
real-time streaming data.
6940
05:01:24,100 --> 05:01:25,822
It enables high throughput
6941
05:01:25,822 --> 05:01:29,600
and fault-tolerant stream
processing of live data streams.
6942
05:01:29,600 --> 05:01:33,500
We have Sparks equal spark
SQL is a new module in spark
6943
05:01:33,500 --> 05:01:36,800
which integrates relational
processing with Sparks
6944
05:01:36,800 --> 05:01:38,800
functional programming API.
6945
05:01:38,800 --> 05:01:41,700
It supports querying
data either via SQL
6946
05:01:41,700 --> 05:01:44,000
or via the hive query language.
6947
05:01:44,000 --> 05:01:46,381
For those of you
familiar with rdbms.
6948
05:01:46,381 --> 05:01:48,300
Spark sequel will be an easy.
6949
05:01:48,300 --> 05:01:51,637
Transition from your earlier
tools where you can extend
6950
05:01:51,637 --> 05:01:55,100
the boundaries of traditional
relational data processing.
6951
05:01:55,200 --> 05:02:00,092
Next up is Graphics Ralph X is
the spark API for graphs
6952
05:02:00,092 --> 05:02:02,400
and graph parallel computation
6953
05:02:02,400 --> 05:02:04,867
and thus it extends
the spark resilient
6954
05:02:04,867 --> 05:02:08,700
distributed data sets with a
resilient distributed property.
6955
05:02:08,700 --> 05:02:09,500
Graph.
6956
05:02:09,900 --> 05:02:13,000
Next is Park Emma lip
for machine learning
6957
05:02:13,000 --> 05:02:16,500
Emma lip stands for machine
learning library spark.
6958
05:02:16,500 --> 05:02:18,300
Emma live is used
to perform machine.
6959
05:02:18,400 --> 05:02:20,900
In learning in Apache spark now
6960
05:02:20,900 --> 05:02:24,200
since you've got an overview
of both these two Frameworks,
6961
05:02:24,200 --> 05:02:25,985
I believe that the ground
6962
05:02:25,985 --> 05:02:29,200
is all set to compare
Apache spark and Hadoop.
6963
05:02:29,200 --> 05:02:32,617
Let's move ahead and compare
Apache spark with Hadoop
6964
05:02:32,617 --> 05:02:36,100
on different parameters
to understand their strengths.
6965
05:02:36,100 --> 05:02:38,887
We will be comparing
these two Frameworks
6966
05:02:38,887 --> 05:02:40,700
based on these parameters.
6967
05:02:40,700 --> 05:02:44,400
Let's start with performance
first Spark is fast
6968
05:02:44,400 --> 05:02:45,476
because it has
6969
05:02:45,476 --> 05:02:49,000
in-memory processing it
can also use For data,
6970
05:02:49,000 --> 05:02:51,774
that doesn't fit
into memory Sparks
6971
05:02:51,774 --> 05:02:55,851
in-memory processing delivers
near real-time analytics
6972
05:02:56,000 --> 05:02:57,771
and this makes Park suitable
6973
05:02:57,771 --> 05:03:00,300
for credit card
processing system machine
6974
05:03:00,300 --> 05:03:02,300
learning security analysis
6975
05:03:02,300 --> 05:03:05,100
and processing data
for iot sensors.
6976
05:03:05,200 --> 05:03:07,700
Now, let's talk
about hadoop's performance.
6977
05:03:07,700 --> 05:03:10,700
Now Hadoop has originally
designed to continuously
6978
05:03:10,700 --> 05:03:13,700
gather data from multiple
sources without worrying
6979
05:03:13,700 --> 05:03:14,800
about the type of data
6980
05:03:14,800 --> 05:03:15,687
and storing it
6981
05:03:15,687 --> 05:03:18,544
across distributed
environment and mapreduce.
6982
05:03:18,544 --> 05:03:22,185
Use uses batch processing
mapreduce was never built for
6983
05:03:22,185 --> 05:03:24,108
real-time processing main idea
6984
05:03:24,108 --> 05:03:27,751
behind yarn is parallel
processing over distributed data
6985
05:03:27,751 --> 05:03:30,400
set the problem
with comparing the two is
6986
05:03:30,400 --> 05:03:33,400
that they have different
way of processing
6987
05:03:33,400 --> 05:03:37,400
and the idea behind the
development is also Divergent
6988
05:03:37,700 --> 05:03:40,300
next ease-of-use spark comes
6989
05:03:40,300 --> 05:03:44,400
with a user-friendly apis
for Scala Java Python
6990
05:03:44,400 --> 05:03:48,300
and Sparks equal spark SQL
is very similar to SQL.
6991
05:03:48,600 --> 05:03:50,047
So it becomes easier
6992
05:03:50,047 --> 05:03:53,202
for a sequel developers
to learn it spark also
6993
05:03:53,202 --> 05:03:55,272
provides an interactive shell
6994
05:03:55,272 --> 05:03:58,700
for developers to query
and perform other actions
6995
05:03:58,700 --> 05:04:00,800
and have immediate feedback.
6996
05:04:00,900 --> 05:04:02,762
Now, let's talk about Hadoop.
6997
05:04:02,762 --> 05:04:06,544
You can ingest data in Hadoop
easily either by using shell
6998
05:04:06,544 --> 05:04:09,000
or integrating it
with multiple tools,
6999
05:04:09,000 --> 05:04:10,353
like scoop and Flume
7000
05:04:10,353 --> 05:04:13,021
and yarn is just
a processing framework
7001
05:04:13,021 --> 05:04:15,900
that can be integrated
with multiple tools
7002
05:04:15,900 --> 05:04:18,200
like Hive and pig for Analytics.
7003
05:04:18,200 --> 05:04:20,353
I visit data
warehousing component
7004
05:04:20,353 --> 05:04:22,381
which performs Reading Writing
7005
05:04:22,381 --> 05:04:26,058
and managing large data set
in a distributed environment
7006
05:04:26,058 --> 05:04:29,100
using sql-like interface
to conclude here.
7007
05:04:29,100 --> 05:04:31,700
Both of them have
their own ways to make
7008
05:04:31,700 --> 05:04:33,500
themselves user-friendly.
7009
05:04:33,826 --> 05:04:36,365
Now, let's come
to the cost Hadoop
7010
05:04:36,365 --> 05:04:39,903
and Spark are both Apache
open source projects.
7011
05:04:40,000 --> 05:04:43,900
So there's no cost for the
software cost is only associated
7012
05:04:43,900 --> 05:04:47,433
with the infrastructure both
the products are designed
7013
05:04:47,433 --> 05:04:48,300
in such a way
7014
05:04:48,300 --> 05:04:50,800
that Can run
on commodity Hardware
7015
05:04:50,800 --> 05:04:54,100
with low TCO or total
cost of ownership.
7016
05:04:54,800 --> 05:04:56,895
Well now you might
be wondering the ways
7017
05:04:56,895 --> 05:04:58,400
in which they are different.
7018
05:04:58,400 --> 05:05:02,117
They're all the same storage
and processing in Hadoop is
7019
05:05:02,117 --> 05:05:05,700
disc-based and Hadoop uses
standard amounts of memory.
7020
05:05:05,700 --> 05:05:06,717
So with Hadoop,
7021
05:05:06,717 --> 05:05:07,600
we need a lot
7022
05:05:07,600 --> 05:05:12,200
of disk space as well as
faster transfer speed Hadoop
7023
05:05:12,200 --> 05:05:15,300
also requires multiple
systems to distribute
7024
05:05:15,300 --> 05:05:17,000
the disk input output,
7025
05:05:17,000 --> 05:05:18,900
but in case of Apache spark
7026
05:05:18,900 --> 05:05:22,800
due to its in-memory processing
it requires a lot of memory,
7027
05:05:22,800 --> 05:05:24,900
but it can deal
with the standard.
7028
05:05:24,900 --> 05:05:28,400
Speed and amount of disk as
disk space is a relatively
7029
05:05:28,400 --> 05:05:29,855
inexpensive commodity
7030
05:05:29,855 --> 05:05:32,985
and since Park does not use
disk input output
7031
05:05:32,985 --> 05:05:34,591
for processing instead.
7032
05:05:34,591 --> 05:05:36,337
It requires large amounts
7033
05:05:36,337 --> 05:05:39,200
of RAM for executing
everything in memory.
7034
05:05:39,300 --> 05:05:42,000
So spark systems
incurs more cost
7035
05:05:42,300 --> 05:05:45,314
but yes one important thing
to keep in mind is
7036
05:05:45,314 --> 05:05:49,400
that Sparks technology reduces
the number of required systems,
7037
05:05:49,400 --> 05:05:52,900
it needs significantly
fewer systems that cost more
7038
05:05:52,900 --> 05:05:55,991
so there will be a point
at which spark reduces
7039
05:05:55,991 --> 05:05:57,134
the cost per unit
7040
05:05:57,134 --> 05:06:01,100
of the computation even with
the additional RAM requirement.
7041
05:06:01,200 --> 05:06:04,500
There are two types of
data processing batch processing
7042
05:06:04,500 --> 05:06:08,344
and stream processing batch
processing has been crucial
7043
05:06:08,344 --> 05:06:09,904
to the Big Data World
7044
05:06:09,904 --> 05:06:13,100
in simplest term batch
processing is working
7045
05:06:13,100 --> 05:06:16,500
with high data volumes
collected over a period
7046
05:06:16,500 --> 05:06:20,423
in batch processing data is
first collected then processed
7047
05:06:20,423 --> 05:06:21,800
and then the results
7048
05:06:21,800 --> 05:06:24,624
are produced at a later
stage and batch.
7049
05:06:24,624 --> 05:06:26,000
Is it efficient way
7050
05:06:26,000 --> 05:06:28,769
of processing large
static data sets?
7051
05:06:28,800 --> 05:06:30,300
Generally we perform
7052
05:06:30,300 --> 05:06:34,300
batch processing for archived
data sets for example,
7053
05:06:34,300 --> 05:06:36,887
calculating average income
of a country
7054
05:06:36,887 --> 05:06:40,700
or evaluating the change
in e-commerce in the last decade
7055
05:06:40,900 --> 05:06:45,000
now stream processing stream
processing is the current Trend
7056
05:06:45,000 --> 05:06:48,258
in the Big Data World need
of the hour is speed
7057
05:06:48,258 --> 05:06:50,100
and real-time information,
7058
05:06:50,100 --> 05:06:52,100
which is what stream processing
7059
05:06:52,100 --> 05:06:54,500
does batch processing
does not allow.
7060
05:06:54,500 --> 05:06:57,700
Businesses to quickly react
to changing business needs
7061
05:06:57,700 --> 05:07:01,900
and real-time stream processing
has seen a rapid growth
7062
05:07:01,900 --> 05:07:05,188
in that demand now coming
back to Apache Spark
7063
05:07:05,188 --> 05:07:09,420
versus Hadoop yarn is basically
a batch processing framework
7064
05:07:09,420 --> 05:07:11,500
when we submit a job to yarn.
7065
05:07:11,500 --> 05:07:14,827
It reads data from
the cluster performs operation
7066
05:07:14,827 --> 05:07:17,539
and write the results
back to the cluster
7067
05:07:17,539 --> 05:07:19,100
and then it again reads
7068
05:07:19,100 --> 05:07:21,900
the updated data performs
the next operation
7069
05:07:21,900 --> 05:07:25,500
and write the results back
to the cluster and Off
7070
05:07:25,700 --> 05:07:29,678
on the other hand spark is
designed to cover a wide range
7071
05:07:29,678 --> 05:07:31,100
of workloads such as
7072
05:07:31,100 --> 05:07:35,429
batch application iterative
algorithms interactive queries
7073
05:07:35,429 --> 05:07:37,100
and streaming as well.
7074
05:07:37,400 --> 05:07:40,899
Now, let's come to fault
tolerance Hadoop and Spark
7075
05:07:40,899 --> 05:07:43,000
both provides fault tolerance,
7076
05:07:43,000 --> 05:07:45,716
but have different
approaches for hdfs
7077
05:07:45,716 --> 05:07:47,673
and yarn both Master demons.
7078
05:07:47,673 --> 05:07:49,700
That is the name node in hdfs
7079
05:07:49,700 --> 05:07:53,285
and resource manager
in the arm checks the heartbeat
7080
05:07:53,285 --> 05:07:54,651
of the slave demons.
7081
05:07:54,651 --> 05:07:58,000
The slave demons are data nodes
and node managers.
7082
05:07:58,000 --> 05:08:00,100
So if any slave demon fails,
7083
05:08:00,100 --> 05:08:03,800
the master demons reschedules
all pending an in-progress
7084
05:08:03,800 --> 05:08:07,900
operations to another slave
now this method is effective
7085
05:08:07,900 --> 05:08:11,300
but it can significantly
increase the completion time
7086
05:08:11,300 --> 05:08:14,000
for operations with
single failure also
7087
05:08:14,000 --> 05:08:16,400
and as Hadoop uses
commodity hardware
7088
05:08:16,400 --> 05:08:20,200
and another way in which hdfs
ensures fault tolerance is
7089
05:08:20,200 --> 05:08:21,797
by replicating data.
7090
05:08:22,200 --> 05:08:24,200
Now let's talk about spark
7091
05:08:24,200 --> 05:08:29,094
as we discussed earlier rdds are
resilient distributed data sets
7092
05:08:29,094 --> 05:08:31,710
are building blocks
of Apache spark
7093
05:08:32,000 --> 05:08:34,100
and rdds are the one
7094
05:08:34,226 --> 05:08:37,073
which provide fault
tolerant to spark.
7095
05:08:37,073 --> 05:08:38,000
They can refer
7096
05:08:38,000 --> 05:08:41,600
to any data set present
and external storage system
7097
05:08:41,600 --> 05:08:45,200
like hdfs Edge base
shared file system Etc.
7098
05:08:45,300 --> 05:08:47,100
They can also be operated
7099
05:08:47,100 --> 05:08:49,869
parallely rdds can
persist a data set
7100
05:08:49,869 --> 05:08:52,100
and memory across operations.
7101
05:08:52,100 --> 05:08:56,061
It's which makes future actions
10 times much faster
7102
05:08:56,061 --> 05:08:58,731
if rdd is lost
it will automatically
7103
05:08:58,731 --> 05:09:02,700
get recomputed by using
the original Transformations.
7104
05:09:02,700 --> 05:09:06,720
And this is how spark provides
fault tolerance and at the end.
7105
05:09:06,720 --> 05:09:08,500
Let us talk about security.
7106
05:09:08,500 --> 05:09:11,100
Well Hadoop has
multiple ways of providing
7107
05:09:11,100 --> 05:09:14,806
security Hadoop supports
Kerberos for authentication,
7108
05:09:14,806 --> 05:09:17,800
but it is difficult
to handle nevertheless.
7109
05:09:17,800 --> 05:09:21,800
It also supports
third-party vendors like ldap.
7110
05:09:22,000 --> 05:09:23,441
For authentication,
7111
05:09:23,441 --> 05:09:26,400
they also offer
encryption hdfs supports
7112
05:09:26,400 --> 05:09:30,600
traditional file permissions as
well as Access Control lists,
7113
05:09:30,600 --> 05:09:34,222
Hadoop provides service level
authorization which guarantees
7114
05:09:34,222 --> 05:09:36,800
that clients have
the right permissions for
7115
05:09:36,800 --> 05:09:40,400
job submission spark currently
supports authentication
7116
05:09:40,400 --> 05:09:44,600
via a shared secret spark
can integrate with hdfs
7117
05:09:44,600 --> 05:09:46,900
and it can use hdfs ACLS
7118
05:09:46,900 --> 05:09:50,652
or Access Control lists
and file level permissions
7119
05:09:50,652 --> 05:09:52,024
sparking also run.
7120
05:09:52,024 --> 05:09:55,100
Yarn, leveraging the
capability of Kerberos.
7121
05:09:55,100 --> 05:09:55,900
Now.
7122
05:09:55,900 --> 05:09:59,100
This was the comparison
of these two Frameworks based
7123
05:09:59,100 --> 05:10:00,600
on these following parameters.
7124
05:10:00,600 --> 05:10:03,300
Now, let us understand use cases
7125
05:10:03,300 --> 05:10:06,900
where these Technologies
fit best use cases were
7126
05:10:06,900 --> 05:10:07,900
Hadoop fits best.
7127
05:10:07,900 --> 05:10:09,300
For example,
7128
05:10:09,300 --> 05:10:12,500
when you're analyzing
archive data yarn
7129
05:10:12,500 --> 05:10:14,300
allows parallel processing
7130
05:10:14,300 --> 05:10:18,657
over huge amounts of data parts
of data is processed parallely
7131
05:10:18,657 --> 05:10:21,300
and separately on
different data nodes
7132
05:10:21,300 --> 05:10:25,825
and gathers result
from each node manager in cases
7133
05:10:25,825 --> 05:10:29,000
when instant results
are not required now
7134
05:10:29,000 --> 05:10:32,319
Hadoop mapreduce is a good
and economical solution
7135
05:10:32,319 --> 05:10:33,700
for batch processing.
7136
05:10:33,700 --> 05:10:35,546
However, it is incapable
7137
05:10:35,900 --> 05:10:39,015
of processing data
in real-time use cases
7138
05:10:39,015 --> 05:10:43,400
where Spark fits best
in real-time Big Data analysis,
7139
05:10:43,400 --> 05:10:46,600
real-time data analysis
means processing data
7140
05:10:46,600 --> 05:10:50,300
that is getting generated by
the real-time event streams
7141
05:10:50,300 --> 05:10:53,000
coming in at the rate
of Billions of events
7142
05:10:53,000 --> 05:10:55,000
per second the strength
7143
05:10:55,000 --> 05:10:58,277
of spark lies in its abilities
to support streaming
7144
05:10:58,277 --> 05:11:00,900
of data along with
distributed processing
7145
05:11:00,900 --> 05:11:04,700
and Spark claims to process
data hundred times faster
7146
05:11:04,700 --> 05:11:09,100
than mapreduce while 10 times
faster with the discs.
7147
05:11:09,100 --> 05:11:13,000
It is used in graph
processing spark contains
7148
05:11:13,000 --> 05:11:15,720
a graph computation
Library called Graphics
7149
05:11:15,720 --> 05:11:18,700
which simplifies our life
in memory computation
7150
05:11:18,700 --> 05:11:22,100
along with inbuilt graph support
improves the performance.
7151
05:11:22,100 --> 05:11:24,700
Performance of algorithm
by a magnitude
7152
05:11:24,700 --> 05:11:28,516
of one or two degrees over
traditional mapreduce programs.
7153
05:11:28,516 --> 05:11:32,200
It is also used in iterative
machine learning algorithms
7154
05:11:32,200 --> 05:11:35,900
almost all machine learning
algorithms work iteratively
7155
05:11:35,900 --> 05:11:39,039
as we have seen earlier
iterative algorithms
7156
05:11:39,039 --> 05:11:41,389
involve input/output bottlenecks
7157
05:11:41,389 --> 05:11:44,400
in the mapreduce
implementations mapreduce
7158
05:11:44,400 --> 05:11:46,400
uses coarse-grained tasks
7159
05:11:46,400 --> 05:11:47,600
that are too heavy
7160
05:11:47,600 --> 05:11:51,926
for iterative algorithms spark
caches the intermediate data.
7161
05:11:51,926 --> 05:11:53,972
I said after each iteration
7162
05:11:53,972 --> 05:11:57,586
and runs multiple iterations
on the cache data set
7163
05:11:57,586 --> 05:12:01,200
which eventually reduces
the input output overhead
7164
05:12:01,200 --> 05:12:03,142
and executes the algorithm
7165
05:12:03,142 --> 05:12:07,400
faster in a fault-tolerant
manner sad the end which one is
7166
05:12:07,400 --> 05:12:10,900
the best the answer
to this is Hadoop
7167
05:12:10,900 --> 05:12:14,800
and Apache spark are
not competing with one another.
7168
05:12:15,000 --> 05:12:18,100
In fact, they complement
each other quite well,
7169
05:12:18,100 --> 05:12:20,745
how do brings huge
data sets under control
7170
05:12:20,745 --> 05:12:22,100
by commodity systems?
7171
05:12:22,100 --> 05:12:26,100
Systems and Spark provides
a real-time in-memory processing
7172
05:12:26,100 --> 05:12:27,700
for those data sets.
7173
05:12:27,900 --> 05:12:30,600
When we combine
Apache Sparks ability.
7174
05:12:30,600 --> 05:12:34,200
That is the high processing
speed and advanced analytics
7175
05:12:34,200 --> 05:12:38,600
and multiple integration support
with Hadoop slow cost operation
7176
05:12:38,600 --> 05:12:40,200
on commodity Hardware.
7177
05:12:40,200 --> 05:12:42,091
It gives the best results
7178
05:12:42,091 --> 05:12:45,800
Hadoop compliments Apache
spark capabilities spark
7179
05:12:45,800 --> 05:12:48,737
not completely replace a do
but the good news is
7180
05:12:48,737 --> 05:12:52,079
that the demand of spark is
currently at an all-time.
7181
05:12:52,079 --> 05:12:55,849
Hi, if you want to learn more
about the Hadoop ecosystem tools
7182
05:12:55,849 --> 05:12:56,900
and Apache spark,
7183
05:12:56,900 --> 05:12:59,106
don't forget to take
a look at the editor
7184
05:12:59,106 --> 05:13:01,700
Acres YouTube channel
and check out the big data
7185
05:13:01,700 --> 05:13:03,000
and Hadoop playlist.
7186
05:13:07,600 --> 05:13:09,776
Welcome everyone in
today's session on
7187
05:13:09,776 --> 05:13:11,100
kafka's Park streaming.
7188
05:13:11,100 --> 05:13:14,400
So without any further delay,
let's look at the agenda first.
7189
05:13:14,400 --> 05:13:16,128
We will start by understanding.
7190
05:13:16,128 --> 05:13:17,310
What is Apache Kafka?
7191
05:13:17,310 --> 05:13:19,900
Then we will discuss
about different components
7192
05:13:19,900 --> 05:13:22,000
of Apache Kafka
and it's architecture.
7193
05:13:22,000 --> 05:13:24,899
Further we will look
at different Kafka commands.
7194
05:13:24,899 --> 05:13:25,546
After that.
7195
05:13:25,546 --> 05:13:27,994
We'll take a brief overview
of Apache spark
7196
05:13:27,994 --> 05:13:30,700
and will understand
different spark components.
7197
05:13:30,700 --> 05:13:31,201
Finally.
7198
05:13:31,201 --> 05:13:32,579
We'll look at the demo
7199
05:13:32,579 --> 05:13:35,900
where we will use spark
streaming with Apache caf-pow.
7200
05:13:36,100 --> 05:13:37,600
Let's move to our first slide.
7201
05:13:37,900 --> 05:13:39,323
So in a real time scenario,
7202
05:13:39,323 --> 05:13:41,500
we have different
systems of services,
7203
05:13:41,500 --> 05:13:43,000
which will be communicating
7204
05:13:43,000 --> 05:13:46,200
with each other and
the data pipelines are the ones
7205
05:13:46,200 --> 05:13:48,800
which are establishing
connection between two servers
7206
05:13:48,800 --> 05:13:49,953
or two systems.
7207
05:13:50,000 --> 05:13:52,100
Now, let's take
an example of e-commerce.
7208
05:13:52,100 --> 05:13:55,255
Except site where it can have
multiple servers at front end
7209
05:13:55,255 --> 05:13:58,161
like Weber application server
for hosting application.
7210
05:13:58,161 --> 05:13:59,530
It can have a chat server
7211
05:13:59,530 --> 05:14:01,958
for the customers
to provide chart facilities.
7212
05:14:01,958 --> 05:14:04,900
Then it can have a separate
server for payment Etc.
7213
05:14:04,900 --> 05:14:08,145
Similarly organization can also
have multiple server
7214
05:14:08,145 --> 05:14:09,100
at the back end
7215
05:14:09,100 --> 05:14:11,900
which will be receiving messages
from different front end servers
7216
05:14:11,900 --> 05:14:13,200
based on the requirements.
7217
05:14:13,400 --> 05:14:15,600
Now they can have
a database server
7218
05:14:15,600 --> 05:14:17,700
which will be storing
the records then they
7219
05:14:17,700 --> 05:14:20,100
can have security systems
for user authentication
7220
05:14:20,100 --> 05:14:21,916
and authorization then
they can have
7221
05:14:21,916 --> 05:14:23,368
Real-time monitoring server,
7222
05:14:23,368 --> 05:14:25,600
which is basically
used for recommendations.
7223
05:14:25,600 --> 05:14:28,100
So all these data
pipelines becomes complex
7224
05:14:28,100 --> 05:14:30,200
with the increase
in number of systems
7225
05:14:30,200 --> 05:14:31,594
and adding a new system
7226
05:14:31,594 --> 05:14:33,900
or server requires
more data pipelines,
7227
05:14:33,900 --> 05:14:35,900
which will again
make the data flow
7228
05:14:35,900 --> 05:14:37,800
more complicated and complex.
7229
05:14:37,800 --> 05:14:38,662
Now managing.
7230
05:14:38,662 --> 05:14:41,646
These data pipelines also
become very difficult
7231
05:14:41,646 --> 05:14:45,100
as each data pipeline has
their own set of requirements
7232
05:14:45,100 --> 05:14:46,700
for example data pipelines,
7233
05:14:46,700 --> 05:14:49,700
which handles transaction
should be more fault tolerant
7234
05:14:49,700 --> 05:14:51,700
and robust on the other hand.
7235
05:14:51,700 --> 05:14:54,372
Clickstream data pipeline
can be more fragile.
7236
05:14:54,372 --> 05:14:55,784
So adding some pipelines
7237
05:14:55,784 --> 05:14:58,400
or removing some pipelines
becomes more difficult
7238
05:14:58,400 --> 05:14:59,600
from the complex system.
7239
05:14:59,800 --> 05:15:02,800
So now I hope that you would
have understood the problem
7240
05:15:02,800 --> 05:15:05,400
due to which misting
systems was originated.
7241
05:15:05,400 --> 05:15:08,200
Let's move to the next slide
and we'll understand
7242
05:15:08,200 --> 05:15:11,970
how Kafka solves this problem
now measuring system reduces
7243
05:15:11,970 --> 05:15:13,835
the complexity of data pipelines
7244
05:15:13,835 --> 05:15:16,600
and makes the communication
between systems more
7245
05:15:16,600 --> 05:15:19,780
simpler and manageable
using messaging system.
7246
05:15:19,780 --> 05:15:22,500
Now, you can easily
stablish remote Education
7247
05:15:22,500 --> 05:15:25,063
and send your data
easily across Netbook.
7248
05:15:25,063 --> 05:15:26,536
Now a different systems
7249
05:15:26,536 --> 05:15:29,100
may use different
platforms and languages
7250
05:15:29,200 --> 05:15:30,200
and messaging system
7251
05:15:30,200 --> 05:15:32,852
provides you a common
Paradigm independent
7252
05:15:32,852 --> 05:15:34,560
of any platformer language.
7253
05:15:34,560 --> 05:15:36,900
So basically it
decouples the platform
7254
05:15:36,900 --> 05:15:39,800
on which a front end server as
well as your back-end server
7255
05:15:39,800 --> 05:15:43,600
is running you can also stablish
a no synchronous communication
7256
05:15:43,600 --> 05:15:44,800
and send messages
7257
05:15:44,800 --> 05:15:47,000
so that the sender
does not have to wait
7258
05:15:47,000 --> 05:15:49,000
for the receiver
to process the messages.
7259
05:15:49,200 --> 05:15:51,300
Now one of the benefit
of messaging system is
7260
05:15:51,300 --> 05:15:53,295
that you can
Reliable communication.
7261
05:15:53,295 --> 05:15:56,600
So even when the receiver and
network is not working properly.
7262
05:15:56,600 --> 05:15:59,272
Your messages wouldn't
get lost not talking
7263
05:15:59,272 --> 05:16:02,900
about cough cough cough cough
decouples the data pipelines
7264
05:16:02,900 --> 05:16:06,205
and solves the complexity
problem the applications
7265
05:16:06,205 --> 05:16:10,050
which are producing messages
to Kafka are called producers
7266
05:16:10,050 --> 05:16:11,400
and the applications
7267
05:16:11,400 --> 05:16:13,600
which are consuming
those messages from Kafka
7268
05:16:13,600 --> 05:16:14,706
are called consumers.
7269
05:16:14,706 --> 05:16:17,500
Now, as you can see in the image
the front end server,
7270
05:16:17,500 --> 05:16:20,200
then your application server
will burn application server
7271
05:16:20,200 --> 05:16:21,500
to and chat server.
7272
05:16:21,500 --> 05:16:25,500
I using messages to Kafka
and these are called producers
7273
05:16:25,500 --> 05:16:26,985
and your database server
7274
05:16:26,985 --> 05:16:29,594
security systems real-time
monitoring server
7275
05:16:29,594 --> 05:16:31,900
than other services
and data warehouse.
7276
05:16:31,900 --> 05:16:34,300
These are basically
consuming the messages
7277
05:16:34,300 --> 05:16:35,900
and are called consumers.
7278
05:16:36,100 --> 05:16:39,600
So your producer sends
the message to Kafka
7279
05:16:39,700 --> 05:16:42,781
and then cough cash
to those messages and consumers
7280
05:16:42,781 --> 05:16:45,000
who want those
messages can subscribe
7281
05:16:45,000 --> 05:16:47,607
and receive them now
you can also have
7282
05:16:47,607 --> 05:16:51,191
multiple subscribers to
a single category of messages.
7283
05:16:51,191 --> 05:16:52,623
So you Database server
7284
05:16:52,623 --> 05:16:56,400
and your security system can
be consuming the same messages
7285
05:16:56,400 --> 05:16:58,600
which is produced
by application server
7286
05:16:58,600 --> 05:17:01,423
1 and again adding
a new consumer is very easy.
7287
05:17:01,423 --> 05:17:03,658
You can go ahead and
add a new consumer
7288
05:17:03,658 --> 05:17:06,268
and just subscribe
to the message categories
7289
05:17:06,268 --> 05:17:07,300
that is required.
7290
05:17:07,300 --> 05:17:10,700
So again, you can add
a new consumer say consumer one
7291
05:17:10,700 --> 05:17:13,100
and you can again
go ahead and subscribe
7292
05:17:13,100 --> 05:17:14,570
to the category of messages
7293
05:17:14,570 --> 05:17:17,100
which is produced by
application server one.
7294
05:17:17,100 --> 05:17:19,100
So, let's quickly move ahead.
7295
05:17:19,100 --> 05:17:21,606
Let's talk about
a Bocce Kafka so party.
7296
05:17:21,606 --> 05:17:24,853
Kafka is a distributed
publish/subscribe messaging
7297
05:17:24,853 --> 05:17:28,300
system messaging traditionally
has two models queuing
7298
05:17:28,300 --> 05:17:32,173
and publish/subscribe in a queue
a pool of consumers.
7299
05:17:32,173 --> 05:17:33,769
May read from a server
7300
05:17:33,769 --> 05:17:36,540
and each record only
goes to one of them
7301
05:17:36,540 --> 05:17:38,600
whereas in publish/subscribe.
7302
05:17:38,600 --> 05:17:41,313
The record is broadcasted
to all consumers.
7303
05:17:41,313 --> 05:17:43,722
So multiple consumers
can get the record
7304
05:17:43,722 --> 05:17:45,700
the Kafka cluster is distributed
7305
05:17:45,700 --> 05:17:48,374
and have multiple machines
running in parallel.
7306
05:17:48,374 --> 05:17:50,700
And this is the reason
why calf pies fast
7307
05:17:50,700 --> 05:17:52,000
scalable and fault.
7308
05:17:52,300 --> 05:17:53,309
Now let me tell you
7309
05:17:53,309 --> 05:17:55,700
that Kafka is developed
at LinkedIn and later.
7310
05:17:55,700 --> 05:17:57,700
It became a part
of Apache project.
7311
05:17:57,900 --> 05:18:01,100
Now, let us look at some
of the important terminologies.
7312
05:18:01,100 --> 05:18:03,499
So we'll first start with topic.
7313
05:18:03,499 --> 05:18:05,081
So topic is a category
7314
05:18:05,081 --> 05:18:08,100
or feed name to which
records are published
7315
05:18:08,100 --> 05:18:11,226
and Topic in Kafka are
always multi subscriber.
7316
05:18:11,226 --> 05:18:14,800
That is a topic can have
zero one or multiple consumers
7317
05:18:14,800 --> 05:18:16,600
that can subscribe the topic
7318
05:18:16,600 --> 05:18:19,300
and consume the data written
to it for an example.
7319
05:18:19,300 --> 05:18:21,850
You can have serious record
getting published in sales, too.
7320
05:18:21,850 --> 05:18:23,500
Topic you can
have product records
7321
05:18:23,500 --> 05:18:25,600
which is getting published
in product topic
7322
05:18:25,600 --> 05:18:28,965
and so on this will actually
segregate your messages
7323
05:18:28,965 --> 05:18:31,756
and consumer will only
subscribe the topic
7324
05:18:31,756 --> 05:18:35,500
that they need and again you
consumer can also subscribe
7325
05:18:35,500 --> 05:18:37,300
to two or more topics.
7326
05:18:37,300 --> 05:18:40,100
Now, let's talk
about partitions.
7327
05:18:40,100 --> 05:18:44,253
So Kafka topics are divided
into a number of partitions
7328
05:18:44,253 --> 05:18:47,800
and partitions allow
you to paralyze a topic
7329
05:18:47,800 --> 05:18:49,284
by splitting the data
7330
05:18:49,284 --> 05:18:51,846
in a particular
topic across multiple.
7331
05:18:51,846 --> 05:18:55,200
Brokers which means
each partition can be placed
7332
05:18:55,200 --> 05:18:58,869
on separate machine to allow
multiple consumers to read
7333
05:18:58,869 --> 05:19:00,500
from a topic parallelly.
7334
05:19:00,500 --> 05:19:02,700
So in case of serious
topic you can have
7335
05:19:02,700 --> 05:19:05,700
three partition partition
0 partition 1 and partition
7336
05:19:05,700 --> 05:19:09,400
to from where three consumers
can read data parallel.
7337
05:19:09,400 --> 05:19:10,481
Now moving ahead.
7338
05:19:10,481 --> 05:19:12,200
Let's talk about producers.
7339
05:19:12,200 --> 05:19:13,845
So producers are the one
7340
05:19:13,845 --> 05:19:17,000
who publishes the data
to topics of the choice.
7341
05:19:17,000 --> 05:19:18,600
Then you have consumers
7342
05:19:18,600 --> 05:19:21,786
so consumers can subscribe
to one or more topic.
7343
05:19:21,786 --> 05:19:22,910
And consume data
7344
05:19:22,910 --> 05:19:26,773
from that topic now consumers
basically label themselves
7345
05:19:26,773 --> 05:19:28,600
with a consumer group name
7346
05:19:28,600 --> 05:19:31,900
and each record publish
to a topic is delivered
7347
05:19:31,900 --> 05:19:35,703
to one consumer instance within
each subscribing consumer group.
7348
05:19:35,703 --> 05:19:37,536
So suppose you have
a consumer group.
7349
05:19:37,536 --> 05:19:40,072
Let's say consumer Group
1 and then you have
7350
05:19:40,072 --> 05:19:41,900
three consumers residing in it.
7351
05:19:41,900 --> 05:19:45,400
That is consumer a consumer be
an consumer see now
7352
05:19:45,400 --> 05:19:47,015
from the seals topic.
7353
05:19:47,100 --> 05:19:51,600
Each record can be read once
by consumer group Fun and it
7354
05:19:51,600 --> 05:19:56,200
And either be read by consumer a
or consumer be or consumer see
7355
05:19:56,200 --> 05:20:00,337
but it can only be consumed once
by the single consumer group
7356
05:20:00,337 --> 05:20:02,200
that is consumer group one.
7357
05:20:02,200 --> 05:20:05,700
But again, you can have
multiple consumer groups
7358
05:20:05,700 --> 05:20:07,700
which can subscribe to a topic
7359
05:20:07,700 --> 05:20:11,260
where one record can be consumed
by multiple consumers.
7360
05:20:11,260 --> 05:20:14,226
That is one consumer
from each consumer group.
7361
05:20:14,226 --> 05:20:16,842
So now let's say
you have a consumer one
7362
05:20:16,842 --> 05:20:19,291
and consumer group
to in consumer Group
7363
05:20:19,291 --> 05:20:20,600
1 we have to consumer
7364
05:20:20,600 --> 05:20:22,854
that is consumer a a
and consumer be
7365
05:20:22,854 --> 05:20:24,400
and consumer group to we
7366
05:20:24,400 --> 05:20:27,819
have to Consumers consumer key
and consumer to be so
7367
05:20:27,819 --> 05:20:30,229
if consumer Group
1 and consumer group
7368
05:20:30,229 --> 05:20:32,900
2 are consuming messages
from topic sales.
7369
05:20:32,900 --> 05:20:36,000
So the single record will be
consumed by consumer group one
7370
05:20:36,000 --> 05:20:39,111
as well as consumer group
2 and a single consumer
7371
05:20:39,111 --> 05:20:43,000
from both the consumer group
will consume the record once so,
7372
05:20:43,000 --> 05:20:45,900
I guess you are clear
with the concept of consumer
7373
05:20:45,900 --> 05:20:49,124
and consumer group Now
consumer instances can be
7374
05:20:49,124 --> 05:20:51,800
a separate process
or separate machines.
7375
05:20:51,900 --> 05:20:55,918
No talking about Brokers Brokers
are nothing but a single machine
7376
05:20:55,918 --> 05:20:57,300
in the CAF per cluster
7377
05:20:57,300 --> 05:21:00,800
and zookeeper is another Apache
open source project.
7378
05:21:00,800 --> 05:21:03,536
It's Tuesday metadata
information related
7379
05:21:03,536 --> 05:21:04,700
to Kafka cluster.
7380
05:21:04,700 --> 05:21:08,100
Like Brokers information
topics details Etc.
7381
05:21:08,100 --> 05:21:09,933
Zookeeper is basically the one
7382
05:21:09,933 --> 05:21:12,316
who is managing
the whole Kafka cluster.
7383
05:21:12,316 --> 05:21:14,700
Now, let's quickly go
to the next slide.
7384
05:21:14,700 --> 05:21:16,900
So suppose you have a topic.
7385
05:21:16,900 --> 05:21:21,100
Let's assume this is topic sales
and you have for partition
7386
05:21:21,100 --> 05:21:23,900
so you have Partition
0 partition 1 partition
7387
05:21:23,900 --> 05:21:27,600
to and partition three now you
have five Brokers over here.
7388
05:21:27,614 --> 05:21:30,768
Now, let's take the case
of partition 1 so
7389
05:21:30,850 --> 05:21:34,800
if the replication factor
is 3 it will have 3 copies
7390
05:21:34,800 --> 05:21:37,100
which will reside
on different Brokers.
7391
05:21:37,100 --> 05:21:40,121
So when the replica is
on broker to next is
7392
05:21:40,121 --> 05:21:43,000
on broker 3 and next is
on brokered 5 and
7393
05:21:43,000 --> 05:21:44,800
as you can see repl 5,
7394
05:21:45,000 --> 05:21:47,800
so this 5 is from this broker 5.
7395
05:21:48,100 --> 05:21:52,500
So the ID of the replica
is same as the ID of The broker
7396
05:21:52,500 --> 05:21:55,700
that hosts it now moving ahead.
7397
05:21:55,700 --> 05:21:57,100
One of the replica
7398
05:21:57,100 --> 05:22:00,800
of partition one will serve
as the leader replica.
7399
05:22:00,800 --> 05:22:02,074
So now the leader
7400
05:22:02,074 --> 05:22:06,200
of partition one is replica
five and any consumer coming
7401
05:22:06,200 --> 05:22:07,684
and consuming messages
7402
05:22:07,684 --> 05:22:10,944
from partition one will
be solved by this replica.
7403
05:22:10,944 --> 05:22:14,635
And these two replicas is
basically for fault tolerance.
7404
05:22:14,635 --> 05:22:17,343
So that once you're
broken five goes off
7405
05:22:17,343 --> 05:22:19,264
or your disc becomes corrupt,
7406
05:22:19,264 --> 05:22:21,115
so your replica 3 or replica
7407
05:22:21,115 --> 05:22:24,100
to to one of them
will again serve as a leader
7408
05:22:24,100 --> 05:22:26,938
and this is basically
decided on the basis
7409
05:22:26,938 --> 05:22:28,600
of most in sync replica.
7410
05:22:28,600 --> 05:22:30,587
So the replica
which will be most
7411
05:22:30,587 --> 05:22:34,100
in sync with this replica
will become the next leader.
7412
05:22:34,100 --> 05:22:36,700
So similarly this
partition 0 may decide
7413
05:22:36,700 --> 05:22:40,400
on broker one broker to
and broker three again
7414
05:22:40,400 --> 05:22:44,500
your partition to May
reside on broke of for group
7415
05:22:44,500 --> 05:22:46,800
of five and say broker one
7416
05:22:46,900 --> 05:22:49,500
and then your third
partition might reside
7417
05:22:49,500 --> 05:22:51,500
on these three brokers.
7418
05:22:51,700 --> 05:22:54,900
So suppose that this is
the leader for partition
7419
05:22:54,900 --> 05:22:56,378
to this is the leader
7420
05:22:56,378 --> 05:22:59,900
for partition 0 this is
the leader for partition 3.
7421
05:22:59,900 --> 05:23:02,900
This is the leader
for partition 1 right
7422
05:23:02,900 --> 05:23:03,600
so you can see
7423
05:23:03,600 --> 05:23:08,300
that for consumers can consume
data pad Ali from these Brokers
7424
05:23:08,300 --> 05:23:10,798
so it can consume
data from partition
7425
05:23:10,798 --> 05:23:14,200
to this consumer can consume
data from partition 0
7426
05:23:14,200 --> 05:23:17,800
and similarly for partition
3 and partition fun
7427
05:23:18,100 --> 05:23:21,500
now by maintaining
the replica basically helps.
7428
05:23:21,500 --> 05:23:25,433
Sin fault tolerance and keeping
different partition leaders
7429
05:23:25,433 --> 05:23:29,300
on different Brokers basically
helps in parallel execution
7430
05:23:29,300 --> 05:23:32,300
or you can say baddeley
consuming those messages.
7431
05:23:32,300 --> 05:23:34,391
So I hope that you
guys are clear
7432
05:23:34,391 --> 05:23:36,972
about topics partitions
and replicas now,
7433
05:23:36,972 --> 05:23:38,803
let's move to our next slide.
7434
05:23:38,803 --> 05:23:42,062
So this is how the whole
Kafka cluster looks like you
7435
05:23:42,062 --> 05:23:43,567
have multiple producers,
7436
05:23:43,567 --> 05:23:46,200
which is again producing
messages to Kafka.
7437
05:23:46,200 --> 05:23:48,600
Then this whole is
the Kafka cluster
7438
05:23:48,600 --> 05:23:51,590
where you have two nodes node
one has to broker.
7439
05:23:51,590 --> 05:23:55,128
Joker one and broker to
and the Note II has two Brokers
7440
05:23:55,128 --> 05:23:58,600
which is broker three and broke
of for again consumers
7441
05:23:58,600 --> 05:24:01,434
will be consuming data
from these Brokers
7442
05:24:01,434 --> 05:24:03,135
and zookeeper is the one
7443
05:24:03,135 --> 05:24:05,900
who is managing
this whole calf cluster.
7444
05:24:06,200 --> 05:24:07,100
Now, let's look
7445
05:24:07,100 --> 05:24:10,688
at some basic commands of Kafka
and understand how Kafka Works
7446
05:24:10,688 --> 05:24:12,500
how to go ahead
and start zookeeper
7447
05:24:12,500 --> 05:24:14,708
how to go ahead
and start Kafka server
7448
05:24:14,708 --> 05:24:16,200
and how to again go ahead
7449
05:24:16,200 --> 05:24:19,141
and produce some messages
to Kafka and then consume
7450
05:24:19,141 --> 05:24:20,600
some messages to Kafka.
7451
05:24:20,600 --> 05:24:21,800
So let me quickly.
7452
05:24:21,800 --> 05:24:27,200
on my VM So let me
quickly open the terminal.
7453
05:24:28,600 --> 05:24:31,400
Let me quickly go ahead
and execute sudo GPS
7454
05:24:31,400 --> 05:24:33,180
so that I can check
all the demons
7455
05:24:33,180 --> 05:24:34,800
that are running in my system.
7456
05:24:35,400 --> 05:24:37,095
So you can see I have named
7457
05:24:37,095 --> 05:24:40,800
no data node resource manager
node manager job is to server.
7458
05:24:42,000 --> 05:24:43,933
So now as all the hdfs demons
7459
05:24:43,933 --> 05:24:46,200
are burning let us
quickly go ahead
7460
05:24:46,200 --> 05:24:48,100
and start Kafka services.
7461
05:24:48,100 --> 05:24:50,561
So first I will go
to Kafka home.
7462
05:24:51,400 --> 05:24:53,800
So let me show
you the directory.
7463
05:24:53,800 --> 05:24:56,200
So my Kafka is in user lib.
7464
05:24:56,600 --> 05:24:56,900
Now.
7465
05:24:56,900 --> 05:25:00,088
Let me quickly go ahead
and start zookeeper service.
7466
05:25:00,088 --> 05:25:01,087
But before that,
7467
05:25:01,087 --> 05:25:03,900
let me show you
zookeeper dot properties file.
7468
05:25:06,415 --> 05:25:10,800
So decline Port is 2 1 8 1 so
my zookeeper will be running
7469
05:25:10,800 --> 05:25:12,300
on Port to 181
7470
05:25:12,600 --> 05:25:15,400
and the data directory
in which my zookeeper
7471
05:25:15,400 --> 05:25:19,700
will store all the metadata
is slash temp / zookeeper.
7472
05:25:20,000 --> 05:25:23,200
So let us quickly go ahead
and start zookeeper
7473
05:25:23,400 --> 05:25:28,300
and the command is bins
zookeeper server start.
7474
05:25:28,900 --> 05:25:30,500
So this is the script file
7475
05:25:30,500 --> 05:25:33,300
and then I'll pass
the properties file
7476
05:25:33,357 --> 05:25:37,988
which is inside config directory
and a little Meanwhile,
7477
05:25:37,988 --> 05:25:39,834
let me open another tab.
7478
05:25:40,403 --> 05:25:44,096
So here I will be starting
my first Kafka broker.
7479
05:25:44,200 --> 05:25:47,200
But before that let me show
you the properties file.
7480
05:25:47,576 --> 05:25:50,423
So we'll go
in config directory again,
7481
05:25:51,100 --> 05:25:53,700
and I have
server dot properties.
7482
05:25:54,400 --> 05:25:58,300
So this is the properties
of my first Kafka broker.
7483
05:25:59,507 --> 05:26:01,892
So first we have server Basics.
7484
05:26:02,300 --> 05:26:06,400
So here the broker idea
of my first broker is 0 then
7485
05:26:06,400 --> 05:26:10,700
the port is 9:09 to on which
my first broker will be running.
7486
05:26:11,400 --> 05:26:14,500
So it contains all
the socket server settings
7487
05:26:14,657 --> 05:26:16,042
then moving ahead.
7488
05:26:16,049 --> 05:26:17,555
We have log base X.
7489
05:26:17,555 --> 05:26:21,139
So in that log Basics,
this is log directory,
7490
05:26:21,200 --> 05:26:23,500
which is / them / Kafka -
7491
05:26:23,500 --> 05:26:26,400
logs so over here
my Kafka will store
7492
05:26:26,400 --> 05:26:28,226
all those messages or records,
7493
05:26:28,226 --> 05:26:30,600
which will be produced
by The Producers.
7494
05:26:30,600 --> 05:26:31,799
So all the records
7495
05:26:31,799 --> 05:26:35,600
which belongs to broker 0
will be stored at this location.
7496
05:26:35,900 --> 05:26:39,200
Now, the next section is
internal topic settings
7497
05:26:39,200 --> 05:26:40,900
in which the offset topical.
7498
05:26:40,900 --> 05:26:42,500
application factor is 1
7499
05:26:42,500 --> 05:26:48,100
then transaction State log
replication factor is 1 Next
7500
05:26:48,384 --> 05:26:50,615
we have log retention policy.
7501
05:26:50,900 --> 05:26:54,500
So the log retention
ours is 168.
7502
05:26:54,500 --> 05:26:58,319
So your records will be stored
for 168 hours by default
7503
05:26:58,319 --> 05:27:00,300
and then it will be deleted.
7504
05:27:00,300 --> 05:27:02,300
Then you have
zookeeper properties
7505
05:27:02,300 --> 05:27:05,100
where we have specified
zookeeper connect and
7506
05:27:05,100 --> 05:27:07,482
as we have seen
in Zookeeper dot properties file
7507
05:27:07,482 --> 05:27:10,000
that are zookeeper
will be running on Port 2 1 8 1
7508
05:27:10,000 --> 05:27:12,000
so we are giving
the address of Zookeeper
7509
05:27:12,000 --> 05:27:13,900
that is localized
to one eight one
7510
05:27:14,300 --> 05:27:15,911
and at last we have group.
7511
05:27:15,911 --> 05:27:18,700
Coordinator setting so
let us quickly go ahead
7512
05:27:18,700 --> 05:27:20,700
and start the first broker.
7513
05:27:21,457 --> 05:27:24,842
So the script file is
Kafka server started sh
7514
05:27:24,900 --> 05:27:27,100
and then we have to give
the properties file,
7515
05:27:27,200 --> 05:27:31,000
which is server dot properties
for the first broker.
7516
05:27:31,200 --> 05:27:35,276
I'll hit enter and meanwhile,
let me open another tab.
7517
05:27:36,234 --> 05:27:39,865
now I'll show you
the next properties file,
7518
05:27:40,200 --> 05:27:43,400
which is Server 1.
7519
05:27:43,400 --> 05:27:44,600
Properties.
7520
05:27:45,300 --> 05:27:46,400
So the things
7521
05:27:46,400 --> 05:27:50,700
which you have to change
for creating a new broker
7522
05:27:51,000 --> 05:27:54,700
is first you have
to change the broker ID.
7523
05:27:54,900 --> 05:27:59,100
So my earlier book ID was 0
the new broker ID is 1 again,
7524
05:27:59,100 --> 05:28:02,255
you can replicate this file
and for a new server,
7525
05:28:02,255 --> 05:28:05,059
you have to change
the broker idea to to then
7526
05:28:05,059 --> 05:28:08,513
you have to change the port
because on 9:09 to already.
7527
05:28:08,513 --> 05:28:11,200
My first broker is running
that is broker 0
7528
05:28:11,200 --> 05:28:12,019
so my broker.
7529
05:28:12,019 --> 05:28:14,099
Should connect to
a different port
7530
05:28:14,099 --> 05:28:17,000
and here I have specified
nine zero nine three.
7531
05:28:17,700 --> 05:28:21,600
Next thing what you have
to change is the log directory.
7532
05:28:21,600 --> 05:28:25,830
So here I have added a -
1 to the default log directory.
7533
05:28:25,830 --> 05:28:27,400
So all these records
7534
05:28:27,400 --> 05:28:30,600
which is stored to my broker
one will be going
7535
05:28:30,600 --> 05:28:32,505
to this particular directory
7536
05:28:32,505 --> 05:28:35,500
that is slashed
and slashed cough call logs -
7537
05:28:35,500 --> 05:28:39,500
1 And rest of the
things are similar,
7538
05:28:39,700 --> 05:28:42,900
so let me quickly go ahead
and start second broker as well.
7539
05:28:45,800 --> 05:28:48,000
And let me open
one more terminal.
7540
05:28:51,569 --> 05:28:54,030
And I'll start
broker to as well.
7541
05:29:01,400 --> 05:29:06,475
So the Zookeeper started then
procurve one is also started
7542
05:29:06,475 --> 05:29:09,700
and this is broker
to which is also started
7543
05:29:09,702 --> 05:29:11,472
and this is proof of 3.
7544
05:29:12,600 --> 05:29:14,600
So now let me
quickly minimize this
7545
05:29:15,200 --> 05:29:17,300
and I'll open a new terminal.
7546
05:29:18,000 --> 05:29:20,800
Now first, let us look
at some commands later
7547
05:29:20,800 --> 05:29:21,900
to Kafka topics.
7548
05:29:21,900 --> 05:29:24,900
So I'll quickly go ahead
and create a topic.
7549
05:29:25,250 --> 05:29:29,250
So again, let me first go
to my Kafka home directory.
7550
05:29:31,700 --> 05:29:36,000
Then the script file
is Kafka top it dot sh,
7551
05:29:36,000 --> 05:29:37,762
then the first parameter
7552
05:29:37,762 --> 05:29:41,800
is create then we have to give
the address of zoo keeper
7553
05:29:41,800 --> 05:29:43,327
because zookeeper is the one
7554
05:29:43,327 --> 05:29:46,000
who is actually containing
all the details related
7555
05:29:46,000 --> 05:29:47,000
to your topic.
7556
05:29:47,700 --> 05:29:50,600
So the address of my zookeeper
is localized to one eight one
7557
05:29:50,700 --> 05:29:53,000
then we'll give the topic name.
7558
05:29:53,000 --> 05:29:56,076
So let me name the topic
as Kafka -
7559
05:29:56,076 --> 05:30:00,000
spark next we have to specify
the replication factor
7560
05:30:00,000 --> 05:30:01,100
of the topic.
7561
05:30:01,300 --> 05:30:04,900
So it will replicate all
the partitions inside the topic
7562
05:30:04,900 --> 05:30:05,700
that many times.
7563
05:30:06,600 --> 05:30:08,300
So replication -
7564
05:30:08,300 --> 05:30:10,900
Factor as we
have three Brokers,
7565
05:30:10,900 --> 05:30:15,600
so let me keep it as 3
and then we have partitions.
7566
05:30:15,800 --> 05:30:17,074
So I will keep it as
7567
05:30:17,074 --> 05:30:19,746
three because we have
three Brokers running
7568
05:30:19,746 --> 05:30:21,689
and our consumer can go ahead
7569
05:30:21,689 --> 05:30:23,700
and consume messages parallely
7570
05:30:23,700 --> 05:30:27,010
from three Brokers and
let me press enter.
7571
05:30:29,300 --> 05:30:32,000
So now you can see
the topic is created.
7572
05:30:32,000 --> 05:30:35,100
Now, let us quickly go ahead
and list all the topics.
7573
05:30:35,100 --> 05:30:36,100
So the command
7574
05:30:36,100 --> 05:30:40,200
for listing all the topics
is dot slash bin again.
7575
05:30:40,200 --> 05:30:44,200
We'll open cough car
topic script file then -
7576
05:30:44,200 --> 05:30:48,300
- list and again will provide
the address of Zookeeper.
7577
05:30:48,700 --> 05:30:50,000
So do again list the topic
7578
05:30:50,000 --> 05:30:53,674
we have to first go to
the CAF core topic script file.
7579
05:30:53,674 --> 05:30:55,200
Then we have to give -
7580
05:30:55,200 --> 05:30:59,300
- list parameter and next we
have to give the zookeepers.
7581
05:30:59,576 --> 05:31:02,423
Which is localhost
181 I'll hit enter.
7582
05:31:04,100 --> 05:31:07,000
And you can see
I have this Kafka -
7583
05:31:07,000 --> 05:31:11,000
spark the kafka's
park topic has been created.
7584
05:31:11,100 --> 05:31:11,407
Now.
7585
05:31:11,407 --> 05:31:14,176
Let me show you
one more thing again.
7586
05:31:14,176 --> 05:31:18,900
We'll go to when cuff
card topics not sh
7587
05:31:19,000 --> 05:31:21,100
and we'll describe this topic.
7588
05:31:21,900 --> 05:31:24,600
I will pass the address
of zoo keeper,
7589
05:31:24,800 --> 05:31:26,300
which is localhost
7590
05:31:26,600 --> 05:31:30,600
to one eight one and then
I'll pause the topic name,
7591
05:31:31,000 --> 05:31:34,700
which is Kafka - Spark
7592
05:31:36,400 --> 05:31:37,600
So now you can see here.
7593
05:31:37,600 --> 05:31:40,100
The topic is cough by spark.
7594
05:31:40,100 --> 05:31:43,400
The partition count is
3 the replication factor is 3
7595
05:31:43,400 --> 05:31:45,600
and the config is as follows.
7596
05:31:45,700 --> 05:31:49,900
So here you can see all the
three partitions of the topic
7597
05:31:49,900 --> 05:31:54,400
that is partition 0 partition 1
and partition 2 then the leader
7598
05:31:54,400 --> 05:31:57,400
for partition 0 is
broker to the leader
7599
05:31:57,400 --> 05:31:59,417
for partition one is broker 0
7600
05:31:59,417 --> 05:32:02,200
and leader for partition
to is broker one
7601
05:32:02,200 --> 05:32:06,194
so you can see we have different
partition leaders residing on
7602
05:32:06,194 --> 05:32:09,600
And Brokers, so this is
basically for load balancing.
7603
05:32:09,600 --> 05:32:11,900
So that different partition
could be served
7604
05:32:11,900 --> 05:32:13,000
from different Brokers
7605
05:32:13,000 --> 05:32:15,413
and it could be
consumed parallely again,
7606
05:32:15,413 --> 05:32:16,800
you can see the replica
7607
05:32:16,800 --> 05:32:20,512
of this partition is residing
in all the three Brokers same
7608
05:32:20,512 --> 05:32:23,200
with Partition 1 and same
with Partition to
7609
05:32:23,200 --> 05:32:25,700
and it's showing you
the insync replica.
7610
05:32:25,700 --> 05:32:27,100
So in synch replica,
7611
05:32:27,100 --> 05:32:30,600
the first is to then you have 0
and then you have 1
7612
05:32:30,600 --> 05:32:33,600
and similarly with
Partition 1 and 2.
7613
05:32:33,900 --> 05:32:35,100
So now let us quickly.
7614
05:32:35,100 --> 05:32:35,900
Go ahead.
7615
05:32:36,500 --> 05:32:38,346
I'll reduce this to 1/2.
7616
05:32:40,000 --> 05:32:42,200
Wake me up in one more terminal.
7617
05:32:43,300 --> 05:32:45,200
The reason why I'm doing this is
7618
05:32:45,200 --> 05:32:48,600
that we can actually produce
message from One console
7619
05:32:48,600 --> 05:32:51,700
and then we can receive
the message in another console.
7620
05:32:51,707 --> 05:32:56,092
So for that I'll start cough
cough console producer first.
7621
05:32:56,396 --> 05:32:57,703
So the command is
7622
05:32:58,000 --> 05:33:04,400
dot slash bin cough cough
console producer dot sh
7623
05:33:04,400 --> 05:33:06,100
and then in case
7624
05:33:06,100 --> 05:33:11,400
of producer you have to give
the parameter as broker - list,
7625
05:33:11,800 --> 05:33:18,000
which is Localhost 9:09 to you
can provide any of the Brokers
7626
05:33:18,000 --> 05:33:19,000
that is running
7627
05:33:19,000 --> 05:33:22,400
and it will again take the rest
of the Brokers from there.
7628
05:33:22,400 --> 05:33:25,794
So you just have to provide
the address of one broker.
7629
05:33:25,794 --> 05:33:28,100
You can also provide
a set of Brokers
7630
05:33:28,100 --> 05:33:30,000
so you can give it
as localhost colon.
7631
05:33:30,000 --> 05:33:33,800
9:09 2 comma Lu closed:
9 0 9 3 and similarly.
7632
05:33:33,800 --> 05:33:35,800
So here I am passing the address
7633
05:33:35,800 --> 05:33:39,700
of the first broker now next
I have to mention the topic.
7634
05:33:39,700 --> 05:33:41,900
So topic is Kafka Spark.
7635
05:33:43,700 --> 05:33:45,161
And I'll hit enter.
7636
05:33:45,500 --> 05:33:47,900
So my console
producer is started.
7637
05:33:47,900 --> 05:33:50,600
Let me produce
a message saying hi.
7638
05:33:51,000 --> 05:33:53,376
Now in the second terminal
I will go ahead
7639
05:33:53,376 --> 05:33:55,200
and start the console consumer.
7640
05:33:55,500 --> 05:34:00,700
So again, the command is
Kafka console consumer not sh
7641
05:34:00,800 --> 05:34:03,000
and then in case of consumer,
7642
05:34:03,000 --> 05:34:06,600
you have to give the parameter
as bootstrap server.
7643
05:34:07,800 --> 05:34:10,400
So this is the thing
to notice guys that in case
7644
05:34:10,400 --> 05:34:13,600
of producer you have to give
the broker list by in.
7645
05:34:13,600 --> 05:34:14,725
So of consumer,
7646
05:34:14,725 --> 05:34:19,000
you have to give bootstrap
server and it is again the same
7647
05:34:19,000 --> 05:34:23,389
that is localhost 9:09 to which
the address of my broker 0
7648
05:34:23,500 --> 05:34:25,807
and then I will give the topic
7649
05:34:25,807 --> 05:34:30,700
which is cuff cost park
now adding this parameter
7650
05:34:30,700 --> 05:34:32,100
that is from -
7651
05:34:32,100 --> 05:34:35,800
beginning will basically
give me messages stored
7652
05:34:35,800 --> 05:34:37,926
in that topic from beginning.
7653
05:34:37,926 --> 05:34:41,300
Otherwise, if I'm not giving
this parameter - -
7654
05:34:41,300 --> 05:34:43,200
from beginning I'll only
7655
05:34:43,200 --> 05:34:44,630
I'm the recent messages
7656
05:34:44,630 --> 05:34:48,300
that has been produced after
starting this console consumer.
7657
05:34:48,300 --> 05:34:49,484
So let me hit enter
7658
05:34:49,484 --> 05:34:52,600
and you can see I'll get
a message saying hi first.
7659
05:34:55,700 --> 05:34:57,267
Well, I'm sorry guys.
7660
05:34:57,267 --> 05:35:00,400
The topic name I
have given is not correct.
7661
05:35:00,400 --> 05:35:01,784
Sorry for my typo.
7662
05:35:01,784 --> 05:35:03,707
Let me quickly corrected.
7663
05:35:04,300 --> 05:35:05,800
And let me hit enter.
7664
05:35:06,800 --> 05:35:10,300
So as you can see,
I am receiving the messages.
7665
05:35:10,300 --> 05:35:13,900
I received High then let
me produce some more messages.
7666
05:35:19,200 --> 05:35:21,600
So now you can see
all the messages
7667
05:35:21,600 --> 05:35:22,858
that I am producing
7668
05:35:22,858 --> 05:35:26,900
from console producer is getting
consumed by console consumer.
7669
05:35:26,900 --> 05:35:30,466
Now this console producer
as well as console consumer
7670
05:35:30,466 --> 05:35:31,838
is basically used by
7671
05:35:31,838 --> 05:35:35,200
the developers to actually
test the Kafka cluster.
7672
05:35:35,200 --> 05:35:37,100
So what happens if you are
7673
05:35:37,100 --> 05:35:38,300
if there is a producer
7674
05:35:38,300 --> 05:35:40,300
which is running and
which is producing
7675
05:35:40,300 --> 05:35:43,196
those messages to Kafka
then you can go ahead
7676
05:35:43,196 --> 05:35:45,558
and you can start console
consumer and check
7677
05:35:45,558 --> 05:35:47,500
whether the producer
is producing.
7678
05:35:47,500 --> 05:35:49,900
Messages or not
or you can again go ahead
7679
05:35:49,900 --> 05:35:50,900
and check the format
7680
05:35:50,900 --> 05:35:53,860
in which your message are
getting produced to the topic.
7681
05:35:53,860 --> 05:35:56,988
Those kind of testing part
is done using console consumer
7682
05:35:56,988 --> 05:35:59,000
and similarly using
console producer.
7683
05:35:59,000 --> 05:36:01,500
You do something
like you are creating a consumer
7684
05:36:01,500 --> 05:36:04,900
so you can go ahead you can
produce a message to Kafka topic
7685
05:36:04,900 --> 05:36:06,000
and then you can check
7686
05:36:06,000 --> 05:36:08,700
whether your consumer is
consuming that message or not.
7687
05:36:08,700 --> 05:36:11,049
This is basically used
for testing now,
7688
05:36:11,049 --> 05:36:13,400
let us quickly go ahead
and close this.
7689
05:36:15,700 --> 05:36:18,700
Now let us get back
to our slides now.
7690
05:36:18,700 --> 05:36:20,605
I have briefly covered Kafka
7691
05:36:20,605 --> 05:36:24,300
and the concepts of Kafka so
here basically I'm giving
7692
05:36:24,300 --> 05:36:27,200
you a small brief idea
about what Kafka is
7693
05:36:27,200 --> 05:36:29,100
and how Kafka works now
7694
05:36:29,100 --> 05:36:32,100
as we have understood why
we need misting systems.
7695
05:36:32,100 --> 05:36:33,100
What is cough cough?
7696
05:36:33,100 --> 05:36:35,000
What are different
terminologies and Kafka
7697
05:36:35,000 --> 05:36:36,657
how Kafka architecture works
7698
05:36:36,657 --> 05:36:39,513
and we have seen some
of the basic cuff Pokemons.
7699
05:36:39,513 --> 05:36:41,000
So let us now understand.
7700
05:36:41,000 --> 05:36:42,600
What is Apache spark.
7701
05:36:42,800 --> 05:36:44,900
So basically Apache spark
7702
05:36:44,900 --> 05:36:47,802
is an Source cluster
Computing framework
7703
05:36:47,802 --> 05:36:51,300
for near real-time processing
now spark provides
7704
05:36:51,300 --> 05:36:54,205
an interface for programming
the entire cluster
7705
05:36:54,205 --> 05:36:56,047
with implicit data parallelism
7706
05:36:56,047 --> 05:36:59,300
and fault tolerance will talk
about how spark provides
7707
05:36:59,300 --> 05:37:02,900
fault tolerance but talking
about implicit data parallelism.
7708
05:37:02,900 --> 05:37:06,600
That means you do not need
any special directives operators
7709
05:37:06,600 --> 05:37:09,000
or functions to enable
parallel execution.
7710
05:37:09,000 --> 05:37:12,600
It sparked by default provides
the data parallelism spark
7711
05:37:12,600 --> 05:37:15,628
is designed to cover
a wide range of workloads such.
7712
05:37:15,628 --> 05:37:16,919
As batch applications
7713
05:37:16,919 --> 05:37:20,400
iterative algorithms interactive
queries machine learning
7714
05:37:20,400 --> 05:37:22,000
algorithms and streaming.
7715
05:37:22,000 --> 05:37:24,174
So basically the main feature
7716
05:37:24,174 --> 05:37:27,500
of spark is it's
in memory cluster Computing
7717
05:37:27,500 --> 05:37:30,900
that increases the processing
speed of the application.
7718
05:37:30,900 --> 05:37:34,763
So what spark does spark does
not store the data in discs,
7719
05:37:34,763 --> 05:37:36,950
but it does it
transforms the data
7720
05:37:36,950 --> 05:37:38,700
and keep the data in memory.
7721
05:37:38,700 --> 05:37:39,616
So that quickly
7722
05:37:39,616 --> 05:37:42,500
multiple operations can
be applied over the data
7723
05:37:42,500 --> 05:37:45,500
and the final result
is only stored in the disk
7724
05:37:45,500 --> 05:37:49,629
now a On-site Spa can also do
batch processing hundred times
7725
05:37:49,629 --> 05:37:51,108
faster than mapreduce.
7726
05:37:51,108 --> 05:37:54,400
And this is the reason why
a patches Park is to go
7727
05:37:54,400 --> 05:37:57,324
to tool for big data processing
in the industry.
7728
05:37:57,324 --> 05:38:00,000
Now, let's quickly move
ahead and understand
7729
05:38:00,000 --> 05:38:01,461
how spark does this
7730
05:38:01,600 --> 05:38:03,617
so the answer is rdd
7731
05:38:03,617 --> 05:38:07,700
that is resilient distributed
data sets now an rdd is
7732
05:38:07,700 --> 05:38:11,406
a read-only partitioned
collection of records and you
7733
05:38:11,406 --> 05:38:14,897
can see it is a fundamental
data structure of spa.
7734
05:38:14,897 --> 05:38:16,312
So basically, ERD is
7735
05:38:16,312 --> 05:38:19,522
an immutable distributed
collection of objects.
7736
05:38:19,522 --> 05:38:21,709
So each data set
in rdd is divided
7737
05:38:21,709 --> 05:38:23,300
into logical partitions,
7738
05:38:23,300 --> 05:38:25,639
which may be computed
on different nodes
7739
05:38:25,639 --> 05:38:28,400
of the cluster now already
can contain any type
7740
05:38:28,400 --> 05:38:30,800
of python Java or scale objects.
7741
05:38:30,800 --> 05:38:33,900
Now talking about
the fault tolerance rdd
7742
05:38:33,900 --> 05:38:37,900
is a fault-tolerant collection
of elements that can be operated
7743
05:38:37,900 --> 05:38:39,000
on in parallel.
7744
05:38:39,000 --> 05:38:40,500
Now, how are ready does
7745
05:38:40,500 --> 05:38:43,380
that if rdd is lost
it will automatically
7746
05:38:43,380 --> 05:38:45,609
be recomputed by using original.
7747
05:38:45,609 --> 05:38:49,300
Nations and this is how spot
provides fault tolerance.
7748
05:38:49,300 --> 05:38:51,255
So I hope that you
guys are clear
7749
05:38:51,255 --> 05:38:53,700
that house Park
provides fault tolerance.
7750
05:38:54,132 --> 05:38:57,500
Now let's talk about
how we can create rdds.
7751
05:38:57,500 --> 05:39:01,600
So there are two ways to create
rdds first is paralyzing
7752
05:39:01,600 --> 05:39:04,474
an existing collection
in your driver program,
7753
05:39:04,474 --> 05:39:06,200
or you can refer a data set
7754
05:39:06,200 --> 05:39:09,300
in an external storage systems
such as shared file system.
7755
05:39:09,300 --> 05:39:11,300
It can be hdfs Edge base
7756
05:39:11,300 --> 05:39:15,200
or any other data source
offering a Hadoop input format
7757
05:39:15,200 --> 05:39:16,800
now spark makes use
7758
05:39:16,800 --> 05:39:20,200
of the concept of rdd to achieve
fast and efficient operations.
7759
05:39:20,200 --> 05:39:22,600
Now, let's quickly move ahead
7760
05:39:22,600 --> 05:39:27,200
and look how already So
first we create an rdd
7761
05:39:27,200 --> 05:39:29,600
which you can create
either by referring
7762
05:39:29,600 --> 05:39:31,800
to an external storage system.
7763
05:39:31,800 --> 05:39:35,400
And then once you create
an rdd you can go ahead
7764
05:39:35,400 --> 05:39:37,800
and you can apply
multiple Transformations
7765
05:39:37,800 --> 05:39:38,800
over that are ready.
7766
05:39:39,100 --> 05:39:43,100
Like will perform
filter map Union Etc.
7767
05:39:43,100 --> 05:39:44,219
And then again,
7768
05:39:44,219 --> 05:39:48,400
it gives you a new rdd or you
can see the transformed rdd
7769
05:39:48,400 --> 05:39:51,500
and at last you apply
some action and get
7770
05:39:51,500 --> 05:39:55,100
the result now this action
can be Count first
7771
05:39:55,100 --> 05:39:57,149
a can collect all those kind
7772
05:39:57,149 --> 05:39:58,100
of functions.
7773
05:39:58,100 --> 05:40:01,151
So now this is a brief idea
about what is rdd
7774
05:40:01,151 --> 05:40:02,400
and how rdd works.
7775
05:40:02,400 --> 05:40:04,570
So now let's quickly
move ahead and look
7776
05:40:04,570 --> 05:40:06,100
at the different workloads
7777
05:40:06,100 --> 05:40:08,200
that can be handled
by Apache spark.
7778
05:40:08,200 --> 05:40:10,883
So we have interactive
streaming analytics.
7779
05:40:10,883 --> 05:40:12,800
Then we have machine learning.
7780
05:40:12,800 --> 05:40:14,158
We have data integration.
7781
05:40:14,158 --> 05:40:16,207
We have spark
streaming and processing.
7782
05:40:16,207 --> 05:40:17,944
So let us talk about them one
7783
05:40:17,944 --> 05:40:20,400
by one first is spark
streaming and processing.
7784
05:40:20,400 --> 05:40:21,400
So now basically,
7785
05:40:21,400 --> 05:40:24,007
you know data arrives
at a steady rate.
7786
05:40:24,007 --> 05:40:27,000
Are you can say
at a continuous streams, right?
7787
05:40:27,000 --> 05:40:29,300
And then what you can do
you can again go ahead
7788
05:40:29,300 --> 05:40:30,829
and store the data set in disk
7789
05:40:30,829 --> 05:40:34,299
and then you can actually go
ahead and apply some processing
7790
05:40:34,299 --> 05:40:36,007
over it some analytics over it
7791
05:40:36,007 --> 05:40:38,000
and then get
some results out of it,
7792
05:40:38,000 --> 05:40:41,200
but this is not the scenario
with each and every case.
7793
05:40:41,200 --> 05:40:44,100
Let's take an example
of financial transactions
7794
05:40:44,100 --> 05:40:46,343
where you have to go
ahead and identify
7795
05:40:46,343 --> 05:40:48,931
and refuse potential
fraudulent transactions.
7796
05:40:48,931 --> 05:40:50,297
Now if you will go ahead
7797
05:40:50,297 --> 05:40:53,197
and store the data stream
and then you will go ahead
7798
05:40:53,197 --> 05:40:55,800
and apply some Assessing
it would be too late
7799
05:40:55,800 --> 05:40:58,287
and someone would have got
away with the money.
7800
05:40:58,287 --> 05:41:00,386
So in that scenario
what you need to do.
7801
05:41:00,386 --> 05:41:03,183
So you need to quickly take
that input data stream.
7802
05:41:03,183 --> 05:41:05,700
You need to apply
some Transformations over it
7803
05:41:05,700 --> 05:41:08,300
and then you have
to take actions accordingly.
7804
05:41:08,300 --> 05:41:10,015
Like you can send
some notification
7805
05:41:10,015 --> 05:41:11,322
or you can actually reject
7806
05:41:11,322 --> 05:41:13,972
that fraudulent transaction
something like that.
7807
05:41:13,972 --> 05:41:15,200
And then you can go ahead
7808
05:41:15,200 --> 05:41:17,686
and if you want you
can store those results
7809
05:41:17,686 --> 05:41:19,700
or data set in some
of the database
7810
05:41:19,700 --> 05:41:21,700
or you can see some
of the file system.
7811
05:41:21,800 --> 05:41:24,000
So we have some scenarios.
7812
05:41:24,026 --> 05:41:27,873
Very we have to actually
process the stream of data
7813
05:41:27,900 --> 05:41:29,300
and then we have to go ahead
7814
05:41:29,300 --> 05:41:30,358
and store the data
7815
05:41:30,358 --> 05:41:34,008
or perform some analysis on it
or take some necessary actions.
7816
05:41:34,008 --> 05:41:37,000
So this is where Spark
streaming comes into picture
7817
05:41:37,000 --> 05:41:38,575
and Spark is a best fit
7818
05:41:38,575 --> 05:41:42,000
for processing those continuous
input data streams.
7819
05:41:42,000 --> 05:41:45,500
Now moving to next
that is machine learning now,
7820
05:41:45,500 --> 05:41:46,314
as you know,
7821
05:41:46,314 --> 05:41:47,730
that first we create
7822
05:41:47,730 --> 05:41:51,182
a machine learning model
then we continuously feed
7823
05:41:51,182 --> 05:41:54,011
those incoming data
streams to the model.
7824
05:41:54,011 --> 05:41:56,700
And we get some
continuous output based
7825
05:41:56,700 --> 05:41:58,144
on the input values.
7826
05:41:58,144 --> 05:42:00,453
Now, we reuse
intermediate results
7827
05:42:00,453 --> 05:42:04,300
across multiple computation
in multi-stage applications,
7828
05:42:04,300 --> 05:42:07,600
which basically includes
substantial overhead due to
7829
05:42:07,600 --> 05:42:10,500
data replication disk
I/O and sterilization
7830
05:42:10,500 --> 05:42:12,200
which makes the system slow.
7831
05:42:12,200 --> 05:42:16,200
Now what Spock does spark rdd
will store intermediate result
7832
05:42:16,200 --> 05:42:19,446
in a distributed memory
instead of a stable storage
7833
05:42:19,446 --> 05:42:21,200
and make the system faster.
7834
05:42:21,200 --> 05:42:24,800
So as we saw in spark rdd
all the Transformations
7835
05:42:24,800 --> 05:42:26,482
will be applied over there
7836
05:42:26,482 --> 05:42:29,200
and all the transformed
rdds will be stored
7837
05:42:29,200 --> 05:42:31,999
in the memory itself
so we can quickly go ahead
7838
05:42:31,999 --> 05:42:35,037
and apply some more
iterative algorithms over there
7839
05:42:35,037 --> 05:42:37,508
and it does not take
much time in functions
7840
05:42:37,508 --> 05:42:39,333
like data replication or disk
7841
05:42:39,333 --> 05:42:42,164
I/O so all those overheads
will be reduced now
7842
05:42:42,164 --> 05:42:45,500
you might be wondering
that memories always very less.
7843
05:42:45,500 --> 05:42:48,000
So what if the memory
gets over so
7844
05:42:48,000 --> 05:42:50,600
if the distributed memory
is not sufficient
7845
05:42:50,600 --> 05:42:52,100
to store intermediate results,
7846
05:42:52,300 --> 05:42:54,300
then it will
store those results.
7847
05:42:54,300 --> 05:42:55,100
On the desk.
7848
05:42:55,100 --> 05:42:58,000
So I hope that you guys are
clear how sparks perform
7849
05:42:58,000 --> 05:43:00,000
this iterative machine
learning algorithms
7850
05:43:00,000 --> 05:43:01,500
and why spark is fast,
7851
05:43:01,819 --> 05:43:04,280
let's look at the next workload.
7852
05:43:04,400 --> 05:43:08,200
So next workload is
interactive streaming analytics.
7853
05:43:08,200 --> 05:43:10,900
Now as we already discussed
about streaming data
7854
05:43:10,900 --> 05:43:15,300
so user runs ad hoc queries
on the same subset of data
7855
05:43:15,300 --> 05:43:19,127
and each query will do a disk
I/O on the stable storage
7856
05:43:19,127 --> 05:43:22,386
which can dominate
applications execution time.
7857
05:43:22,386 --> 05:43:24,300
So, let me take an example.
7858
05:43:24,300 --> 05:43:25,400
Data scientist.
7859
05:43:25,400 --> 05:43:27,800
So basically you have
continuous streams of data,
7860
05:43:27,800 --> 05:43:28,800
which is coming in.
7861
05:43:28,800 --> 05:43:30,650
So what your data
scientists would do.
7862
05:43:30,650 --> 05:43:32,900
So do your data scientists
will either ask
7863
05:43:32,900 --> 05:43:34,274
some questions execute
7864
05:43:34,274 --> 05:43:37,208
some queries over the data
then view the result
7865
05:43:37,208 --> 05:43:40,563
and then he might alter
the initial question slightly
7866
05:43:40,563 --> 05:43:41,804
by seeing the output
7867
05:43:41,804 --> 05:43:44,332
or he might also drill
deeper into results
7868
05:43:44,332 --> 05:43:47,757
and execute some more queries
over the gathered result.
7869
05:43:47,757 --> 05:43:51,500
So there are multiple scenarios
in which your data scientist
7870
05:43:51,500 --> 05:43:54,265
would be running
some interactive queries.
7871
05:43:54,265 --> 05:43:57,569
On the streaming analytics
now house path helps
7872
05:43:57,569 --> 05:44:00,200
in this interactive
streaming analytics.
7873
05:44:00,200 --> 05:44:04,453
So each transformed our DD
may be recomputed each time.
7874
05:44:04,453 --> 05:44:06,838
You run an action on it, right?
7875
05:44:06,838 --> 05:44:10,692
And when you persist an rdd
in memory in which case
7876
05:44:10,692 --> 05:44:13,430
Park will keep all
the elements around
7877
05:44:13,430 --> 05:44:15,800
on the cluster for faster access
7878
05:44:15,800 --> 05:44:18,296
and whenever you will execute
the query next time
7879
05:44:18,296 --> 05:44:19,077
over the data,
7880
05:44:19,077 --> 05:44:21,200
then the query will
be executed quickly
7881
05:44:21,200 --> 05:44:23,700
and it will give you
a instant result, right?
7882
05:44:24,100 --> 05:44:26,090
So I hope that you
guys are clear
7883
05:44:26,090 --> 05:44:29,200
how spark helps in
interactive streaming analytics.
7884
05:44:29,400 --> 05:44:32,000
Now, let's talk
about data integration.
7885
05:44:32,000 --> 05:44:33,570
So basically as you know,
7886
05:44:33,570 --> 05:44:36,900
that in large organizations data
is basically produced
7887
05:44:36,900 --> 05:44:39,400
from different systems
across the business
7888
05:44:39,400 --> 05:44:42,000
and basically you
need a framework
7889
05:44:42,000 --> 05:44:45,800
which can actually integrate
different data sources.
7890
05:44:45,800 --> 05:44:46,900
So Spock is the one
7891
05:44:46,900 --> 05:44:49,382
which actually integrate
different data sources
7892
05:44:49,382 --> 05:44:50,500
so you can go ahead
7893
05:44:50,500 --> 05:44:53,800
and you can take the data
from Kafka Cassandra flu.
7894
05:44:53,800 --> 05:44:55,518
Umm hbase then Amazon S3.
7895
05:44:55,518 --> 05:44:59,300
Then you can perform some real
time analytics over there
7896
05:44:59,300 --> 05:45:02,000
or even say some near
real-time analytics over there.
7897
05:45:02,000 --> 05:45:04,250
You can apply some machine
learning algorithms
7898
05:45:04,250 --> 05:45:05,700
and then you can go ahead
7899
05:45:05,700 --> 05:45:08,500
and store the process
result in Apache hbase.
7900
05:45:08,500 --> 05:45:10,600
Then msql hdfs.
7901
05:45:10,600 --> 05:45:12,100
It could be your Kafka.
7902
05:45:12,100 --> 05:45:15,500
So spark basically gives
you a multiple options
7903
05:45:15,500 --> 05:45:16,600
where you can go ahead
7904
05:45:16,600 --> 05:45:18,500
and pick the data
from and again,
7905
05:45:18,500 --> 05:45:21,200
you can go ahead
and write the data into now.
7906
05:45:21,200 --> 05:45:23,620
Let's quickly move ahead
and we'll talk.
7907
05:45:23,620 --> 05:45:27,013
About different spark components
so you can see here.
7908
05:45:27,013 --> 05:45:28,500
I have a spark or engine.
7909
05:45:28,500 --> 05:45:30,376
So basically this
is the core engine
7910
05:45:30,376 --> 05:45:32,200
and on top of this core engine.
7911
05:45:32,200 --> 05:45:35,574
You have spark SQL spark
streaming then MLA,
7912
05:45:35,900 --> 05:45:38,100
then you have graphics
and the newest Parker.
7913
05:45:38,200 --> 05:45:41,087
Let's talk about them one
by one and we'll start
7914
05:45:41,087 --> 05:45:42,500
with spark core engine.
7915
05:45:42,500 --> 05:45:45,200
So spark or engine
is the base engine
7916
05:45:45,200 --> 05:45:46,800
for large-scale parallel
7917
05:45:46,800 --> 05:45:50,026
and distributed data processing
additional libraries,
7918
05:45:50,026 --> 05:45:52,200
which are built on top
of the core allows
7919
05:45:52,200 --> 05:45:53,700
divers workloads Force.
7920
05:45:53,700 --> 05:45:57,300
Streaming SQL machine learning
then you can go ahead
7921
05:45:57,300 --> 05:45:59,300
and execute our on spark
7922
05:45:59,300 --> 05:46:01,731
or you can go ahead
and execute python on spark
7923
05:46:01,731 --> 05:46:03,000
those kind of workloads.
7924
05:46:03,000 --> 05:46:04,700
You can easily go
ahead and execute.
7925
05:46:04,700 --> 05:46:07,800
So basically your spark
or engine is the one
7926
05:46:07,800 --> 05:46:10,040
who is managing all your memory,
7927
05:46:10,040 --> 05:46:13,084
then all your fault
recovery your scheduling
7928
05:46:13,084 --> 05:46:14,755
your Distributing of jobs
7929
05:46:14,755 --> 05:46:16,078
and monitoring jobs
7930
05:46:16,078 --> 05:46:19,700
on a cluster and interacting
with the storage system.
7931
05:46:19,700 --> 05:46:22,400
So in in short we
can see the spark
7932
05:46:22,400 --> 05:46:24,501
or engine is the heart of Spock
7933
05:46:24,501 --> 05:46:25,951
and on top of this all
7934
05:46:25,951 --> 05:46:28,389
of these libraries
are there so first,
7935
05:46:28,389 --> 05:46:30,429
let's talk about
spark streaming.
7936
05:46:30,429 --> 05:46:33,088
So spot streaming is
the component of Spas
7937
05:46:33,088 --> 05:46:36,273
which is used to process
real-time streaming data
7938
05:46:36,273 --> 05:46:37,600
as we just discussed
7939
05:46:37,600 --> 05:46:41,061
and it is a useful addition
to spark core API.
7940
05:46:41,200 --> 05:46:43,600
Now it enables high
throughput and fault
7941
05:46:43,600 --> 05:46:46,554
tolerance stream processing
for live data streams.
7942
05:46:46,554 --> 05:46:47,700
So you can go ahead
7943
05:46:47,700 --> 05:46:51,338
and you can perform all
the streaming data analytics
7944
05:46:51,338 --> 05:46:55,800
using this spark streaming then
You have Spock SQL over here.
7945
05:46:55,800 --> 05:46:58,900
So basically spark SQL is
a new module in spark
7946
05:46:58,900 --> 05:47:02,200
which integrates relational
processing of Sparks functional
7947
05:47:02,200 --> 05:47:06,900
programming API and it supports
querying data either via SQL
7948
05:47:06,900 --> 05:47:08,315
or SQL that is -
7949
05:47:08,315 --> 05:47:09,469
query language.
7950
05:47:09,500 --> 05:47:11,500
So basically for those of you
7951
05:47:11,500 --> 05:47:15,615
who are familiar with rdbms
Spock SQL is an easy transition
7952
05:47:15,615 --> 05:47:17,100
from your earlier tool
7953
05:47:17,100 --> 05:47:19,511
where you can go ahead
and extend the boundaries
7954
05:47:19,511 --> 05:47:22,100
of traditional relational
data processing now
7955
05:47:22,100 --> 05:47:23,700
talking about graphics.
7956
05:47:23,700 --> 05:47:24,900
So Graphics is
7957
05:47:24,900 --> 05:47:28,500
the spaag API for graphs
and crafts parallel computation.
7958
05:47:28,500 --> 05:47:30,800
It extends the spark rdd
7959
05:47:30,800 --> 05:47:34,309
with a resilient distributed
property graph a talking
7960
05:47:34,309 --> 05:47:35,213
at high level.
7961
05:47:35,213 --> 05:47:38,700
Basically Graphics extend
the graph already abstraction
7962
05:47:38,700 --> 05:47:41,758
by introducing the resilient
distributed property graph,
7963
05:47:41,758 --> 05:47:42,778
which is nothing
7964
05:47:42,778 --> 05:47:45,900
but a directed multigraph
with properties attached
7965
05:47:45,900 --> 05:47:49,700
to each vertex and Edge
next we have spark are so
7966
05:47:49,700 --> 05:47:52,394
basically it provides you
packages for our language
7967
05:47:52,394 --> 05:47:54,100
and then you can go ahead and
7968
05:47:54,100 --> 05:47:55,399
Leverage Park power
7969
05:47:55,399 --> 05:47:58,000
with our shell next
you have spark MLA.
7970
05:47:58,000 --> 05:48:01,849
So ml is basically stands
for machine learning library.
7971
05:48:01,849 --> 05:48:05,200
So spark MLM is used
to perform machine learning
7972
05:48:05,200 --> 05:48:06,500
in Apache spark.
7973
05:48:06,500 --> 05:48:08,773
Now many common machine learning
7974
05:48:08,773 --> 05:48:11,784
and statical algorithms
have been implemented
7975
05:48:11,784 --> 05:48:13,700
and are shipped with ML live
7976
05:48:13,700 --> 05:48:16,935
which simplifies large scale
machine learning pipelines,
7977
05:48:16,935 --> 05:48:18,347
which basically includes
7978
05:48:18,347 --> 05:48:20,994
summary statistics
correlations classification
7979
05:48:20,994 --> 05:48:23,800
and regression collaborative
filtering techniques.
7980
05:48:23,800 --> 05:48:25,700
New cluster analysis methods
7981
05:48:25,700 --> 05:48:28,582
then you have dimensionality
reduction techniques.
7982
05:48:28,582 --> 05:48:31,400
You have feature extraction
and transformation functions.
7983
05:48:31,400 --> 05:48:33,700
When you have
optimization algorithms,
7984
05:48:33,700 --> 05:48:35,900
it is basically a MLM package
7985
05:48:35,900 --> 05:48:39,000
or you can see a machine
learning package on top of spa.
7986
05:48:39,000 --> 05:48:41,639
Then you also have
something called by spark,
7987
05:48:41,639 --> 05:48:43,979
which is python package
for spark there.
7988
05:48:43,979 --> 05:48:46,800
You can go ahead
and leverage python over spark.
7989
05:48:46,800 --> 05:48:47,376
So I hope
7990
05:48:47,376 --> 05:48:50,900
that you guys are clear
with different spark components.
7991
05:48:51,100 --> 05:48:53,200
So before moving
to cough gasp,
7992
05:48:53,200 --> 05:48:54,524
ah, Exclaiming demo.
7993
05:48:54,524 --> 05:48:58,075
So I have just given you
a brief intro to Apache spark.
7994
05:48:58,075 --> 05:49:01,100
If you want a detailed tutorial
on Apache spark
7995
05:49:01,100 --> 05:49:02,600
or different components
7996
05:49:02,600 --> 05:49:06,753
of Apache spark like Apache
spark SQL spark data frames
7997
05:49:06,800 --> 05:49:10,200
or spark streaming
Spa Graphics Spock MLA,
7998
05:49:10,200 --> 05:49:13,200
so you can go to editor
Acres YouTube channel again.
7999
05:49:13,200 --> 05:49:14,800
So now we are here guys.
8000
05:49:14,800 --> 05:49:18,252
I know that you guys are waiting
for this demo from a while.
8001
05:49:18,252 --> 05:49:21,900
So now let's go ahead and look
at calf by spark streaming demo.
8002
05:49:21,900 --> 05:49:23,700
So let me quickly go
ahead and open.
8003
05:49:23,700 --> 05:49:28,000
my virtual machine
and I'll open a terminal.
8004
05:49:28,600 --> 05:49:30,658
So let me first check
all the demons
8005
05:49:30,658 --> 05:49:32,400
that are running in my system.
8006
05:49:33,800 --> 05:49:35,341
So my zookeeper is running
8007
05:49:35,341 --> 05:49:37,753
name node is running
data node is running.
8008
05:49:37,753 --> 05:49:39,130
The my resource manager
8009
05:49:39,130 --> 05:49:42,714
is running all the three cough
cough Brokers are running then
8010
05:49:42,714 --> 05:49:44,088
node manager is running
8011
05:49:44,088 --> 05:49:46,000
and job is to server is running.
8012
05:49:46,200 --> 05:49:49,200
So now I have to start
my spark demons.
8013
05:49:49,200 --> 05:49:51,900
So let me first go
to the spark home
8014
05:49:52,600 --> 05:49:54,600
and start this part demon.
8015
05:49:54,600 --> 05:49:57,800
The command is
a spin start or not.
8016
05:49:57,800 --> 05:49:58,900
Sh.
8017
05:50:01,400 --> 05:50:03,400
So let me quickly go ahead
8018
05:50:03,400 --> 05:50:06,861
and execute sudo JPS
to check my spark demons.
8019
05:50:08,500 --> 05:50:12,200
So you can see master
and vocal demons are running.
8020
05:50:12,596 --> 05:50:14,903
So let me close this terminal.
8021
05:50:16,300 --> 05:50:18,700
Let me go to
the project directory.
8022
05:50:20,600 --> 05:50:22,808
So basically, I
have two projects.
8023
05:50:22,808 --> 05:50:25,376
This is cough card
transaction producer.
8024
05:50:25,376 --> 05:50:28,852
And the next one is the spark
streaming Kafka master.
8025
05:50:28,852 --> 05:50:31,327
So first we will
be producing messages
8026
05:50:31,327 --> 05:50:33,400
from Kafka transaction producer
8027
05:50:33,400 --> 05:50:36,200
and then we'll be
streaming those records
8028
05:50:36,200 --> 05:50:39,670
which is basically produced by
this producer using the spark
8029
05:50:39,670 --> 05:50:41,025
streaming Kafka master.
8030
05:50:41,025 --> 05:50:42,494
So first, let me take you
8031
05:50:42,494 --> 05:50:45,100
through this cough
card transaction producer.
8032
05:50:45,100 --> 05:50:47,244
So this is
our cornbread XML file.
8033
05:50:47,244 --> 05:50:49,004
Let me open it with G edit.
8034
05:50:49,004 --> 05:50:50,700
So basically this is a me.
8035
05:50:50,700 --> 05:50:54,400
Project and and I have used
spring boot server.
8036
05:50:54,800 --> 05:50:57,071
So I have given Java version
8037
05:50:57,071 --> 05:51:00,456
as a you can see
cough cough client over here
8038
05:51:00,500 --> 05:51:02,900
and the version of Kafka client,
8039
05:51:03,780 --> 05:51:07,719
then you can see I'm putting
Jackson data bind.
8040
05:51:08,800 --> 05:51:13,500
Then ji-sun and then I
am packaging it as a war file
8041
05:51:13,600 --> 05:51:15,500
that is web archive file.
8042
05:51:15,500 --> 05:51:20,000
And here I am again specifying
the spring boot Maven plugins,
8043
05:51:20,000 --> 05:51:21,300
which is to be downloaded.
8044
05:51:21,300 --> 05:51:23,258
So let me quickly go ahead
8045
05:51:23,258 --> 05:51:27,100
and close this and we'll go
to this Source directory
8046
05:51:27,100 --> 05:51:29,125
and then we'll go inside main.
8047
05:51:29,125 --> 05:51:32,972
So basically this is the file
that is sales Jan 2009 file.
8048
05:51:32,972 --> 05:51:35,200
So let me show you
the file first.
8049
05:51:37,300 --> 05:51:38,860
So these are the records
8050
05:51:38,860 --> 05:51:41,200
which I'll be producing
to the Kafka.
8051
05:51:41,200 --> 05:51:43,600
So the fields
are transaction date
8052
05:51:43,600 --> 05:51:45,500
than product price payment
8053
05:51:45,500 --> 05:51:49,767
type the name city state
country account created
8054
05:51:49,800 --> 05:51:51,646
then last login latitude
8055
05:51:51,646 --> 05:51:52,846
and longitude.
8056
05:51:52,846 --> 05:51:57,400
So let me close this file
and then the application dot.
8057
05:51:57,400 --> 05:51:59,778
Yml is the main property file.
8058
05:51:59,900 --> 05:52:02,654
So in this application
dot yml am specifying
8059
05:52:02,654 --> 05:52:04,000
the bootstrap server,
8060
05:52:04,000 --> 05:52:07,900
which is localhost 9:09 to
than am specifying the Pause
8061
05:52:07,900 --> 05:52:11,500
which again resides
on localhost 9:09 to so here.
8062
05:52:11,500 --> 05:52:16,200
I have specified the broker list
now next I have product topic.
8063
05:52:16,200 --> 05:52:19,000
So the topic of the
product is transaction.
8064
05:52:19,000 --> 05:52:21,230
Then the partition count is 1
8065
05:52:21,500 --> 05:52:25,800
so basically you're a cks
config controls the criteria
8066
05:52:25,800 --> 05:52:29,100
under which requests
are considered complete
8067
05:52:29,100 --> 05:52:32,900
and the all setting we
have specified will result
8068
05:52:32,900 --> 05:52:35,828
in blocking on the full
Committee of the record.
8069
05:52:35,828 --> 05:52:37,225
It is the slowest burn
8070
05:52:37,225 --> 05:52:40,900
the most durable setting
not talking about retries.
8071
05:52:40,900 --> 05:52:44,600
So it will retry Thrice
then we have mempool size
8072
05:52:44,600 --> 05:52:46,587
and we have maximum pool size,
8073
05:52:46,587 --> 05:52:49,700
which is basically
for implementing Java threads
8074
05:52:49,700 --> 05:52:52,000
and at last we
have the file path.
8075
05:52:52,000 --> 05:52:53,900
So this is the path of the file,
8076
05:52:53,900 --> 05:52:57,900
which I have shown you just now
so messages will be consumed
8077
05:52:57,900 --> 05:52:58,800
from this file.
8078
05:52:58,800 --> 05:53:02,600
Let me quickly close this file
and we'll look at application
8079
05:53:02,600 --> 05:53:06,792
but properties so here we
have specified the properties
8080
05:53:06,792 --> 05:53:08,600
for Springboard server.
8081
05:53:08,700 --> 05:53:10,877
So we have server context path.
8082
05:53:10,877 --> 05:53:12,185
That is /n Eureka.
8083
05:53:12,185 --> 05:53:14,607
Then we have
spring application name
8084
05:53:14,607 --> 05:53:16,301
that is Kafka producer.
8085
05:53:16,301 --> 05:53:17,700
We have server Port
8086
05:53:17,700 --> 05:53:22,200
that is double line W8 and
the spring events timeout is 20.
8087
05:53:22,200 --> 05:53:24,430
So let me close this as well.
8088
05:53:24,430 --> 05:53:25,530
Let's go back.
8089
05:53:25,800 --> 05:53:29,500
Let's go inside Java calm
and Eureka Kafka.
8090
05:53:29,700 --> 05:53:33,400
So we'll explore
the important files one by one.
8091
05:53:33,400 --> 05:53:36,800
So let me first take you
through this dito directory.
8092
05:53:36,900 --> 05:53:39,617
And over here,
we have transaction dot Java.
8093
05:53:39,617 --> 05:53:42,253
So basically here we
are storing the model.
8094
05:53:42,253 --> 05:53:45,871
So basically you can see these
are the fields from the file,
8095
05:53:45,871 --> 05:53:47,372
which I have shown you.
8096
05:53:47,372 --> 05:53:49,200
So we have transaction date.
8097
05:53:49,200 --> 05:53:53,600
We have product price payment
type name city state country
8098
05:53:53,600 --> 05:53:57,700
and so on so we have created
variable for each field.
8099
05:53:57,700 --> 05:54:01,101
So what we are doing we
are basically creating a getter
8100
05:54:01,101 --> 05:54:03,766
and Setter function for
all these variables.
8101
05:54:03,766 --> 05:54:05,702
So we have get transaction ID,
8102
05:54:05,702 --> 05:54:08,800
which will basically
returned Transaction ID then
8103
05:54:08,800 --> 05:54:10,600
we have sent transaction ID,
8104
05:54:10,600 --> 05:54:13,300
which will basically
send the transaction ID.
8105
05:54:13,300 --> 05:54:13,809
Similarly.
8106
05:54:13,809 --> 05:54:17,036
We have get transaction date for
getting the transaction date.
8107
05:54:17,036 --> 05:54:19,100
Then we have set
transaction date and it
8108
05:54:19,100 --> 05:54:21,900
will set the transaction date
using this variable.
8109
05:54:21,900 --> 05:54:25,532
Then we have get products
and product get price set price
8110
05:54:25,532 --> 05:54:26,700
and all the getter
8111
05:54:26,700 --> 05:54:29,900
and Setter methods
for each of the variable.
8112
05:54:32,000 --> 05:54:34,000
This is the Constructor.
8113
05:54:34,100 --> 05:54:35,615
So here we are taking
8114
05:54:35,615 --> 05:54:39,513
all the parameters like
transaction date product price.
8115
05:54:39,513 --> 05:54:42,295
And then we are setting
the value of each
8116
05:54:42,295 --> 05:54:44,800
of the variables
using this operator.
8117
05:54:44,800 --> 05:54:48,295
So we are setting the value for
transaction date product price
8118
05:54:48,295 --> 05:54:51,500
payment and all of the fields
that is present over there.
8119
05:54:51,515 --> 05:54:51,900
Next.
8120
05:54:51,900 --> 05:54:55,053
We are also creating
a default Constructor
8121
05:54:55,200 --> 05:54:56,616
and then over here.
8122
05:54:56,616 --> 05:54:59,300
We are overriding
the tostring method
8123
05:54:59,300 --> 05:55:01,600
and what we are doing
we are basically
8124
05:55:02,400 --> 05:55:04,500
The transaction details
8125
05:55:04,500 --> 05:55:06,600
and we are
returning transaction date
8126
05:55:06,600 --> 05:55:09,100
and then the value
of transaction date product
8127
05:55:09,100 --> 05:55:12,300
then body of product price
then value of price
8128
05:55:12,300 --> 05:55:14,900
and so on for all the fields.
8129
05:55:15,300 --> 05:55:18,800
So basically this is the model
of the transaction
8130
05:55:18,800 --> 05:55:20,000
so we can go ahead
8131
05:55:20,000 --> 05:55:22,529
and we can create object
of this transaction
8132
05:55:22,529 --> 05:55:24,400
and then we can easily go ahead
8133
05:55:24,400 --> 05:55:27,700
and send the transaction
object as the value.
8134
05:55:27,700 --> 05:55:29,900
So this is the main
reason of creating
8135
05:55:29,900 --> 05:55:31,588
this transaction model, LOL.
8136
05:55:31,588 --> 05:55:34,000
Me quickly, go ahead
and close this file.
8137
05:55:34,000 --> 05:55:38,400
Let's go back and let's first
take a look at this config.
8138
05:55:38,615 --> 05:55:41,384
So this is Kafka
properties dot Java.
8139
05:55:41,500 --> 05:55:43,202
So what we did again
8140
05:55:43,202 --> 05:55:46,894
as I have shown you
the application dot yml file.
8141
05:55:46,942 --> 05:55:48,500
So we have taken all
8142
05:55:48,500 --> 05:55:51,500
the parameters that we
have specified over there.
8143
05:55:51,600 --> 05:55:54,600
That is your bootstrap
product topic partition count
8144
05:55:54,600 --> 05:55:57,700
then Brokers filename
and thread count.
8145
05:55:57,700 --> 05:55:59,322
So all these properties
8146
05:55:59,322 --> 05:56:02,367
then you have file path
then all these Days,
8147
05:56:02,367 --> 05:56:04,300
we have taken we have created
8148
05:56:04,300 --> 05:56:07,100
a variable and then
what we are doing again,
8149
05:56:07,100 --> 05:56:08,700
we are doing the same thing
8150
05:56:08,700 --> 05:56:11,039
as we did with
our transaction model.
8151
05:56:11,039 --> 05:56:12,600
We are creating a getter
8152
05:56:12,600 --> 05:56:15,247
and Setter method for each
of these variables.
8153
05:56:15,247 --> 05:56:17,305
So you can see we
have get file path
8154
05:56:17,305 --> 05:56:19,300
and we are returning
the file path.
8155
05:56:19,300 --> 05:56:20,924
Then we have set file path
8156
05:56:20,924 --> 05:56:24,300
where we are setting the file
path using this operator.
8157
05:56:24,300 --> 05:56:24,800
Similarly.
8158
05:56:24,800 --> 05:56:26,600
We have get product topics
8159
05:56:26,600 --> 05:56:29,567
at product topic then we
have greater incentive
8160
05:56:29,567 --> 05:56:30,400
for third count.
8161
05:56:30,400 --> 05:56:31,700
We have greater incentive.
8162
05:56:31,700 --> 05:56:36,000
for bootstrap and all
those properties No,
8163
05:56:36,100 --> 05:56:37,522
we can again go ahead
8164
05:56:37,522 --> 05:56:40,300
and call this cough
cough properties anywhere
8165
05:56:40,300 --> 05:56:41,400
and then we can easily
8166
05:56:41,400 --> 05:56:44,000
extract those values
using getter methods.
8167
05:56:44,100 --> 05:56:48,400
So let me quickly close
this file and I'll take you
8168
05:56:48,400 --> 05:56:50,500
to the configurations.
8169
05:56:50,900 --> 05:56:52,100
So in this configuration
8170
05:56:52,100 --> 05:56:54,700
what we are doing we
are creating the object
8171
05:56:54,700 --> 05:56:56,700
of Kafka properties
as you can see,
8172
05:56:57,000 --> 05:56:59,800
so what we are doing then we
are again creating a property's
8173
05:56:59,800 --> 05:57:02,600
object and then we
are setting the properties
8174
05:57:02,700 --> 05:57:03,800
so you can see
8175
05:57:03,800 --> 05:57:06,800
that we are Setting
the bootstrap server config
8176
05:57:06,800 --> 05:57:08,400
and then we are retrieving
8177
05:57:08,400 --> 05:57:11,900
the value using the cough
cough properties object.
8178
05:57:11,900 --> 05:57:14,300
And this is the get
bootstrap server function.
8179
05:57:14,300 --> 05:57:17,500
Then you can see we are setting
the acknowledgement config
8180
05:57:17,500 --> 05:57:18,400
and we are getting
8181
05:57:18,400 --> 05:57:22,100
the acknowledgement from this
get acknowledgement function.
8182
05:57:22,100 --> 05:57:24,900
And then we are using
this get rate rise method.
8183
05:57:24,900 --> 05:57:27,300
So from all these
Kafka properties object.
8184
05:57:27,300 --> 05:57:29,000
We are calling
those getter methods
8185
05:57:29,000 --> 05:57:30,700
and retrieving those values
8186
05:57:30,700 --> 05:57:34,100
and setting those values
in this property object.
8187
05:57:34,100 --> 05:57:36,900
So We have partitioner class.
8188
05:57:37,000 --> 05:57:40,294
So we are basically implementing
this default partitioner
8189
05:57:40,294 --> 05:57:41,400
which is present in
8190
05:57:41,400 --> 05:57:45,700
over G. Apache car park client
producer internals package.
8191
05:57:45,700 --> 05:57:48,600
Then we are creating
a producer over here
8192
05:57:48,600 --> 05:57:50,756
and we are passing this props
8193
05:57:50,756 --> 05:57:54,400
object which will set
the properties so over here.
8194
05:57:54,400 --> 05:57:56,684
We are passing
the key serializer,
8195
05:57:56,684 --> 05:57:58,900
which is the
string T serializer.
8196
05:57:58,900 --> 05:58:00,100
And then this is
8197
05:58:00,100 --> 05:58:04,400
the value serializer in which
we are creating new customer.
8198
05:58:04,400 --> 05:58:07,500
Distance Eliezer and then
we are passing transaction
8199
05:58:07,500 --> 05:58:10,400
over here and then it
will return the producer
8200
05:58:10,500 --> 05:58:13,735
and then we are implementing
thread we are again getting
8201
05:58:13,735 --> 05:58:15,200
the get minimum pool size
8202
05:58:15,200 --> 05:58:17,700
from Kafka properties and get
maximum pool size
8203
05:58:17,700 --> 05:58:18,700
from Kafka property.
8204
05:58:18,700 --> 05:58:19,600
So we're here.
8205
05:58:19,600 --> 05:58:22,000
We are implementing
Java threads now.
8206
05:58:22,000 --> 05:58:25,534
Let me quickly close this cough
pop producer configuration
8207
05:58:25,534 --> 05:58:28,200
where we are configuring
our Kafka producer.
8208
05:58:28,461 --> 05:58:29,538
Let's go back.
8209
05:58:30,400 --> 05:58:32,800
Let's quickly go to this API
8210
05:58:32,946 --> 05:58:36,253
which have event producer
EPA dot Java file.
8211
05:58:36,300 --> 05:58:40,130
So here we are basically
creating an event producer API
8212
05:58:40,130 --> 05:58:42,400
which has this
dispatch function.
8213
05:58:42,400 --> 05:58:46,900
So we'll use this dispatch
function to send the records.
8214
05:58:47,180 --> 05:58:49,719
So let me quickly
close this file.
8215
05:58:50,061 --> 05:58:51,138
Let's go back.
8216
05:58:51,300 --> 05:58:53,475
We have already seen this config
8217
05:58:53,475 --> 05:58:54,700
and configurations
8218
05:58:54,700 --> 05:58:57,100
in which we are basically
retrieving those values
8219
05:58:57,100 --> 05:58:58,984
from application dot yml file
8220
05:58:58,984 --> 05:59:02,300
and then we are Setting
the producer configurations,
8221
05:59:02,300 --> 05:59:04,000
then we have constants.
8222
05:59:04,000 --> 05:59:07,100
So in Kafka constants or Java,
8223
05:59:07,200 --> 05:59:09,900
we have created this Kafka
constant interface
8224
05:59:09,900 --> 05:59:11,393
where we have specified
8225
05:59:11,393 --> 05:59:14,925
the batch size account limit
check some limit then read
8226
05:59:14,925 --> 05:59:17,494
batch size minimum
balance maximum balance
8227
05:59:17,494 --> 05:59:19,500
minimum account maximum account.
8228
05:59:19,500 --> 05:59:22,604
Then we are also implementing
daytime for matter.
8229
05:59:22,604 --> 05:59:25,643
So we are specifying all
the constants over here.
8230
05:59:25,643 --> 05:59:27,100
Let me close this file.
8231
05:59:27,100 --> 05:59:31,300
Let's go back then this is
Manso will not look
8232
05:59:31,300 --> 05:59:32,506
at these two files,
8233
05:59:32,506 --> 05:59:35,300
but let me tell you what
does these two files
8234
05:59:35,300 --> 05:59:39,400
to these two files are
basically to record the metrics
8235
05:59:39,400 --> 05:59:42,000
of your Kafka like time in which
8236
05:59:42,000 --> 05:59:44,889
your thousand records have
been produced in cough power.
8237
05:59:44,889 --> 05:59:45,781
You can say time
8238
05:59:45,781 --> 05:59:48,400
in which records
are getting published to Kafka.
8239
05:59:48,400 --> 05:59:51,936
It will be monitored and then
you can record those starts.
8240
05:59:51,936 --> 05:59:53,292
So basically it helps
8241
05:59:53,292 --> 05:59:57,100
in optimizing the performance
of your Kafka producer, right?
8242
05:59:57,100 --> 05:59:59,863
You can actually know
how to do Recon.
8243
05:59:59,863 --> 06:00:03,000
How to add just
those configuration factors
8244
06:00:03,000 --> 06:00:05,041
and then you can
see the difference
8245
06:00:05,041 --> 06:00:07,159
or you can actually
monitor the stats
8246
06:00:07,159 --> 06:00:08,259
and then understand
8247
06:00:08,259 --> 06:00:11,612
or how you can actually make
your producer more efficient.
8248
06:00:11,612 --> 06:00:13,039
So these are basically
8249
06:00:13,039 --> 06:00:16,800
for those factors but let's
not worry about this right now.
8250
06:00:16,900 --> 06:00:18,600
Let's go back next.
8251
06:00:18,600 --> 06:00:21,500
Let me quickly take you
through this file utility.
8252
06:00:21,500 --> 06:00:24,000
So you have file
you treated or Java.
8253
06:00:24,000 --> 06:00:26,600
So basically what we
are doing over here,
8254
06:00:26,600 --> 06:00:28,550
we are reading each record
8255
06:00:28,550 --> 06:00:32,200
from the file we using
For reader so over here,
8256
06:00:32,200 --> 06:00:36,900
you can see we have this list
and then we have bufferedreader.
8257
06:00:36,900 --> 06:00:38,700
Then we have file reader.
8258
06:00:38,700 --> 06:00:41,000
So first we are reading the file
8259
06:00:41,000 --> 06:00:44,105
and then we are trying
to split each of the fields
8260
06:00:44,105 --> 06:00:45,500
present in the record.
8261
06:00:45,500 --> 06:00:49,500
And then we are setting the
value of those fields over here.
8262
06:00:49,700 --> 06:00:52,407
Then we are specifying
some of the exceptions
8263
06:00:52,407 --> 06:00:54,900
that may occur like
number format exception
8264
06:00:54,900 --> 06:00:57,500
or pass exception all
those kind of exception
8265
06:00:57,500 --> 06:01:00,900
we have specified over here
and then we are Closing this
8266
06:01:00,900 --> 06:01:01,959
so in this file.
8267
06:01:01,959 --> 06:01:04,746
We are basically
reading the records now.
8268
06:01:04,746 --> 06:01:06,000
Let me close this.
8269
06:01:06,000 --> 06:01:07,100
Let's go back.
8270
06:01:07,500 --> 06:01:07,766
Now.
8271
06:01:07,766 --> 06:01:10,500
Let's take a quick look
at the seal lizer.
8272
06:01:10,500 --> 06:01:13,100
So this is custom
Jason serializer.
8273
06:01:13,500 --> 06:01:15,100
So in serializer,
8274
06:01:15,100 --> 06:01:18,000
we have created
a custom decency réaliser.
8275
06:01:18,000 --> 06:01:22,023
Now, this is basically
to write the values as bites.
8276
06:01:22,100 --> 06:01:26,082
So the data which you will be
passing will be written in bytes
8277
06:01:26,082 --> 06:01:27,197
because as we know
8278
06:01:27,197 --> 06:01:29,800
that data is sent to Kafka
and form of pie.
8279
06:01:29,800 --> 06:01:32,000
And this is the reason
why we have created
8280
06:01:32,000 --> 06:01:33,700
this custom Jason serializer.
8281
06:01:33,930 --> 06:01:37,469
So now let me quickly close
this let's go back.
8282
06:01:37,700 --> 06:01:41,800
This file is basically for
my spring boot web application.
8283
06:01:41,900 --> 06:01:44,200
So let's not get into this.
8284
06:01:44,300 --> 06:01:47,100
Let's look at events
Red Dot Java.
8285
06:01:47,865 --> 06:01:51,634
So basically over here we
have event producer API.
8286
06:01:52,300 --> 06:01:57,100
So now we are trying to dispatch
those events and to show you
8287
06:01:57,100 --> 06:01:58,988
how dispatch function works.
8288
06:01:58,988 --> 06:02:00,000
Let me go back.
8289
06:02:00,000 --> 06:02:01,691
Let me open services
8290
06:02:01,700 --> 06:02:05,000
and even producer
I MPL is implementation.
8291
06:02:05,000 --> 06:02:08,100
So let me show you
how this dispatch works.
8292
06:02:08,100 --> 06:02:10,400
So basically over here
what we are doing first.
8293
06:02:10,400 --> 06:02:11,576
We are initializing.
8294
06:02:11,576 --> 06:02:13,047
So using the file utility.
8295
06:02:13,047 --> 06:02:16,000
We are basically reading
the files and read the file.
8296
06:02:16,000 --> 06:02:19,356
We are getting the path using
this Kafka properties object
8297
06:02:19,356 --> 06:02:22,300
and we are calling
this getter method of file path.
8298
06:02:22,300 --> 06:02:24,900
Then what we are doing
we are basically taking
8299
06:02:24,900 --> 06:02:25,900
the product list
8300
06:02:25,900 --> 06:02:28,700
and then we are trying
to dispatch it so
8301
06:02:28,700 --> 06:02:32,800
in dispatch Are basically
using Kafka producer
8302
06:02:33,600 --> 06:02:37,000
and then we are creating the
object of the producer record.
8303
06:02:37,000 --> 06:02:41,594
Then we are using the get topic
from this calf pad properties.
8304
06:02:41,594 --> 06:02:44,004
We are getting
this transaction ID
8305
06:02:44,004 --> 06:02:45,459
from the transaction
8306
06:02:45,459 --> 06:02:49,540
and then we are using event
producer send to send the data.
8307
06:02:49,540 --> 06:02:51,300
And finally we are trying
8308
06:02:51,300 --> 06:02:54,827
to monitor this but let's
not worry about the monitoring
8309
06:02:54,827 --> 06:02:57,200
and cash the monitoring
and start spot
8310
06:02:57,200 --> 06:02:59,661
so we can ignore this part Nets.
8311
06:02:59,800 --> 06:03:03,700
Let's quickly go back
and look at the last file
8312
06:03:03,700 --> 06:03:05,100
which is producer.
8313
06:03:05,600 --> 06:03:07,835
So let me show you
this event producer.
8314
06:03:07,835 --> 06:03:09,300
So what we are doing here,
8315
06:03:09,300 --> 06:03:11,500
we are actually
creating a logger.
8316
06:03:11,900 --> 06:03:13,500
So in this on completion method,
8317
06:03:13,500 --> 06:03:16,300
we are basically passing
the record metadata.
8318
06:03:16,300 --> 06:03:20,838
And if your e-except shin is
not null then it will basically
8319
06:03:20,838 --> 06:03:25,200
throw an error saying this
and recorded metadata else.
8320
06:03:25,400 --> 06:03:29,700
It will give you the send
message to topic partition.
8321
06:03:29,700 --> 06:03:32,300
All set and then
the record metadata
8322
06:03:32,300 --> 06:03:34,564
and topic and then it will give
8323
06:03:34,564 --> 06:03:38,800
you all the details regarding
topic partitions and offsets.
8324
06:03:38,800 --> 06:03:40,888
So I hope that you
guys have understood
8325
06:03:40,888 --> 06:03:44,110
how this cough cough producer
is working now is the time we
8326
06:03:44,110 --> 06:03:47,169
need to go ahead and we need
to quickly execute this.
8327
06:03:47,169 --> 06:03:49,200
So let me open
a terminal over here.
8328
06:03:49,500 --> 06:03:51,653
No first build this project.
8329
06:03:51,653 --> 06:03:54,423
We need to execute
mvn clean install.
8330
06:03:54,900 --> 06:03:56,800
This will install
all the dependencies.
8331
06:04:01,600 --> 06:04:04,100
So as you can see
our build is successful.
8332
06:04:04,100 --> 06:04:08,111
So let me minimize this and
this target directory is created
8333
06:04:08,111 --> 06:04:10,394
after you build
a wave in project.
8334
06:04:10,394 --> 06:04:11,778
So let me quickly go
8335
06:04:11,778 --> 06:04:16,000
inside this target directory and
this is the root dot bar file
8336
06:04:16,000 --> 06:04:18,300
that is root dot
web archive file
8337
06:04:18,300 --> 06:04:19,897
which we need to execute.
8338
06:04:19,897 --> 06:04:22,900
So let's quickly go ahead
and execute this file.
8339
06:04:23,100 --> 06:04:24,755
But before this to verify
8340
06:04:24,755 --> 06:04:27,800
whether the data
is getting produced in our car
8341
06:04:27,800 --> 06:04:29,900
for topics so for testing
8342
06:04:29,900 --> 06:04:33,300
as I already told you
We need to go ahead
8343
06:04:33,300 --> 06:04:36,200
and we need to open
a console consumer
8344
06:04:36,500 --> 06:04:37,500
so that we can check
8345
06:04:37,500 --> 06:04:40,200
that whether data
is getting published or not.
8346
06:04:42,400 --> 06:04:45,100
So let me quickly minimize this.
8347
06:04:48,300 --> 06:04:52,700
So let's quickly go to
Kafka directory and the command
8348
06:04:52,700 --> 06:04:59,300
is dot slash bin Kafka
console consumer and then -
8349
06:04:59,300 --> 06:05:01,500
- bootstrap server.
8350
06:05:14,800 --> 06:05:21,964
nine zero nine two Okay,
I'll let me check the topic.
8351
06:05:21,964 --> 06:05:23,271
What's the topic?
8352
06:05:24,000 --> 06:05:27,000
Let's go to our
application dot yml file.
8353
06:05:27,000 --> 06:05:31,000
So the topic
name is transaction.
8354
06:05:31,000 --> 06:05:35,100
Let me quickly minimize
this specify the topic name
8355
06:05:35,100 --> 06:05:36,500
and I'll hit enter.
8356
06:05:36,500 --> 06:05:41,300
So now let me place
this console aside.
8357
06:05:41,300 --> 06:05:45,900
And now let's quickly go ahead
and execute our project.
8358
06:05:45,900 --> 06:05:49,400
So for that
the command is Java -
8359
06:05:49,400 --> 06:05:52,938
jar and then we'll provide
the path of the file
8360
06:05:52,938 --> 06:05:54,100
that is inside.
8361
06:05:54,300 --> 06:05:59,700
Great, and the file is
rude dot war and here we go.
8362
06:06:18,100 --> 06:06:20,955
So now you can see
in the console consumer.
8363
06:06:20,955 --> 06:06:23,200
The records are
getting published.
8364
06:06:23,200 --> 06:06:23,700
Right?
8365
06:06:24,000 --> 06:06:25,903
So there are multiple records
8366
06:06:25,903 --> 06:06:29,118
which have been published
in our transaction topic
8367
06:06:29,118 --> 06:06:32,400
and you can verify this
using the console consumer.
8368
06:06:32,400 --> 06:06:33,145
So this is
8369
06:06:33,145 --> 06:06:36,500
where the developers use
the console consumer.
8370
06:06:38,000 --> 06:06:40,980
So now we have successfully
verified our producer.
8371
06:06:40,980 --> 06:06:43,900
So let me quickly go ahead
and stop the producer.
8372
06:06:45,500 --> 06:06:48,200
Lat, let me stop
consumer as well.
8373
06:06:49,400 --> 06:06:51,370
Let's quickly minimize this
8374
06:06:51,370 --> 06:06:54,144
and now let's go
to the second project.
8375
06:06:54,144 --> 06:06:56,700
That is Park
streaming Kafka Master.
8376
06:06:56,900 --> 06:06:57,200
Again.
8377
06:06:57,200 --> 06:06:59,667
We have specified
all the dependencies
8378
06:06:59,667 --> 06:07:00,800
that is required.
8379
06:07:01,000 --> 06:07:03,700
Let me quickly show
you those dependencies.
8380
06:07:07,700 --> 06:07:09,800
Now again, you
can see were here.
8381
06:07:09,800 --> 06:07:12,400
We have specified
Java version then we
8382
06:07:12,400 --> 06:07:16,600
have specified the artifacts
or you can see the dependencies.
8383
06:07:16,796 --> 06:07:18,796
So we have Scala compiler.
8384
06:07:18,796 --> 06:07:21,411
Then we have
spark streaming Kafka.
8385
06:07:21,900 --> 06:07:24,200
Then we have
cough cough clients.
8386
06:07:24,400 --> 06:07:28,400
Then Json data binding then we
have Maven compiler plug-in.
8387
06:07:28,400 --> 06:07:30,600
So all those dependencies
which are required.
8388
06:07:30,600 --> 06:07:32,300
We are specified over here.
8389
06:07:32,500 --> 06:07:35,500
So let me quickly go
ahead and close it.
8390
06:07:36,200 --> 06:07:40,503
Let's quickly move to the source
directory main then let's look
8391
06:07:40,503 --> 06:07:42,100
at the resources again.
8392
06:07:42,203 --> 06:07:44,896
So this is application
dot yml file.
8393
06:07:45,700 --> 06:07:46,700
So we have put
8394
06:07:46,700 --> 06:07:49,600
eight zero eight zero then we
have bootstrap server over here.
8395
06:07:49,600 --> 06:07:51,100
Then we have proven over here.
8396
06:07:51,100 --> 06:07:53,200
Then we have topic
is as transaction.
8397
06:07:53,200 --> 06:07:56,000
The group is transaction
partition count is one
8398
06:07:56,000 --> 06:07:57,273
and then the file name
8399
06:07:57,273 --> 06:07:59,664
so we won't be using
this file name then.
8400
06:07:59,664 --> 06:08:01,900
Let me quickly go ahead
and close this.
8401
06:08:01,900 --> 06:08:02,984
Let's go back.
8402
06:08:02,984 --> 06:08:06,600
Let's go back to Java
directory comms Park demo,
8403
06:08:06,600 --> 06:08:08,200
then this is the model.
8404
06:08:08,200 --> 06:08:10,100
So it's same
8405
06:08:10,600 --> 06:08:13,011
so these are all the fields
that are there
8406
06:08:13,011 --> 06:08:15,800
in the transaction
you have transaction.
8407
06:08:15,800 --> 06:08:18,100
Eight product price payment type
8408
06:08:18,100 --> 06:08:22,500
the name city state country
account created and so on.
8409
06:08:22,500 --> 06:08:25,100
And again, we have
specified all the getter
8410
06:08:25,100 --> 06:08:29,285
and Setter methods over here
and similarly again,
8411
06:08:29,285 --> 06:08:32,600
we have created
this transaction dto Constructor
8412
06:08:32,600 --> 06:08:34,900
where we have taken
all the parameters
8413
06:08:34,900 --> 06:08:38,200
and then we have setting
the values using this operator.
8414
06:08:38,200 --> 06:08:39,100
Next.
8415
06:08:39,100 --> 06:08:42,400
We are again over adding
this tostring function
8416
06:08:42,400 --> 06:08:43,414
and over here.
8417
06:08:43,414 --> 06:08:47,500
We are again returning the
details like transaction date
8418
06:08:47,500 --> 06:08:49,700
and then vario
transaction date product
8419
06:08:49,700 --> 06:08:53,200
and then value of product
and similarly all the fields.
8420
06:08:53,411 --> 06:08:55,488
So let me close this model.
8421
06:08:55,900 --> 06:08:57,100
Let's go back.
8422
06:08:57,200 --> 06:09:00,500
Let's look at cough covers,
then we are see realizer.
8423
06:09:00,500 --> 06:09:02,294
So this is the Jason serializer
8424
06:09:02,294 --> 06:09:06,187
which was there in our producer
and this is transaction decoder.
8425
06:09:06,187 --> 06:09:07,300
Let's take a look.
8426
06:09:07,780 --> 06:09:09,319
Now you have decoder
8427
06:09:09,400 --> 06:09:12,600
which is again implementing
decoder and we're passing
8428
06:09:12,600 --> 06:09:14,800
this transaction dto then again,
8429
06:09:14,800 --> 06:09:17,339
you can see we This problem
by its method
8430
06:09:17,339 --> 06:09:18,800
which we are overriding
8431
06:09:18,800 --> 06:09:22,022
and we are reading
the values using this bites
8432
06:09:22,022 --> 06:09:24,600
and then transaction
DDO class again,
8433
06:09:24,600 --> 06:09:28,600
if it is failing to pass we are
giving Json processing failed
8434
06:09:28,600 --> 06:09:29,799
for object this
8435
06:09:30,200 --> 06:09:31,573
and you can see we have
8436
06:09:31,573 --> 06:09:34,200
this transaction decoder
construct over here.
8437
06:09:34,200 --> 06:09:37,200
So let me quickly
again close this file.
8438
06:09:37,200 --> 06:09:38,892
Let's quickly go back.
8439
06:09:39,400 --> 06:09:42,500
And now let's take a look
at spot streaming app
8440
06:09:42,500 --> 06:09:44,200
where basically the data
8441
06:09:44,200 --> 06:09:48,100
which the producer project
will be producing to cough cough
8442
06:09:48,100 --> 06:09:51,900
will be actually consumed by
spark streaming application.
8443
06:09:51,900 --> 06:09:55,071
So spark streaming will stream
the data in real time
8444
06:09:55,071 --> 06:09:57,000
and then will display the data.
8445
06:09:57,000 --> 06:09:59,600
So in this park
streaming application,
8446
06:09:59,600 --> 06:10:03,189
we are creating conf object
and then we are setting
8447
06:10:03,189 --> 06:10:05,900
the application name
as cough by sandbox.
8448
06:10:05,900 --> 06:10:09,331
The master is local star
then we have Java.
8449
06:10:09,331 --> 06:10:13,100
Fog contest so here we
are specifying the spark context
8450
06:10:13,100 --> 06:10:16,700
and then next we are specifying
the Java streaming context.
8451
06:10:16,700 --> 06:10:18,500
So this object will basically
8452
06:10:18,500 --> 06:10:21,100
we used to take
the streaming data.
8453
06:10:21,100 --> 06:10:25,946
So we are passing this Java Spa
context over here as a parameter
8454
06:10:25,946 --> 06:10:29,900
and then we are specifying
the duration that is 2000.
8455
06:10:29,900 --> 06:10:30,200
Next.
8456
06:10:30,200 --> 06:10:32,600
We have Kafka parameters
should to connect
8457
06:10:32,600 --> 06:10:35,555
to Kafka you need
to specify this parameters.
8458
06:10:35,555 --> 06:10:37,100
So in Kafka parameters,
8459
06:10:37,100 --> 06:10:39,500
we are specifying
The Meta broken.
8460
06:10:39,500 --> 06:10:44,292
Why's that is localized 9:09 to
then we have Auto offset resent
8461
06:10:44,292 --> 06:10:45,600
that is smallest.
8462
06:10:45,600 --> 06:10:49,200
Then in topics the name
of the topic from which we
8463
06:10:49,200 --> 06:10:53,300
will be consuming messages
is transaction next Java.
8464
06:10:53,300 --> 06:10:56,200
We're creating a Java
pair input D streams.
8465
06:10:56,200 --> 06:10:59,300
So basically this D stream
is discrete stream,
8466
06:10:59,300 --> 06:11:02,300
which is the basic abstraction
of spark streaming
8467
06:11:02,300 --> 06:11:04,290
and is a continuous sequence
8468
06:11:04,290 --> 06:11:07,104
of rdds representing
a continuous stream
8469
06:11:07,104 --> 06:11:11,200
of data now the stream can I
The created from live data
8470
06:11:11,200 --> 06:11:13,000
from Kafka hdfs of Flume
8471
06:11:13,000 --> 06:11:14,457
or it can be generated
8472
06:11:14,457 --> 06:11:17,900
from transforming existing be
streams using operation
8473
06:11:17,900 --> 06:11:18,828
to over here.
8474
06:11:18,828 --> 06:11:21,700
We are again creating
a Java input D stream.
8475
06:11:21,700 --> 06:11:24,700
We are passing string
and transaction DTS parameters
8476
06:11:24,700 --> 06:11:27,504
and we are creating
direct Kafka stream object.
8477
06:11:27,504 --> 06:11:29,700
Then we're using
this Kafka you tails
8478
06:11:29,700 --> 06:11:33,000
and we are calling
the method create direct stream
8479
06:11:33,000 --> 06:11:35,885
where we are passing
the parameters as SSC
8480
06:11:35,885 --> 06:11:38,700
that is your spark
streaming context then
8481
06:11:38,700 --> 06:11:40,341
you have String dot class
8482
06:11:40,341 --> 06:11:42,829
which is basically
your key serializer.
8483
06:11:42,829 --> 06:11:45,322
Then transaction video
does not class
8484
06:11:45,322 --> 06:11:46,500
that is basically
8485
06:11:46,500 --> 06:11:49,700
your value serializer
then string decoder
8486
06:11:49,700 --> 06:11:52,868
that is to decode your key
and then transaction
8487
06:11:52,868 --> 06:11:55,900
decoded basically to
decode your transaction.
8488
06:11:55,900 --> 06:11:57,784
Then you have Kafka parameters,
8489
06:11:57,784 --> 06:11:59,501
which you have created here
8490
06:11:59,501 --> 06:12:02,300
where you have specified
broken list and auto
8491
06:12:02,300 --> 06:12:05,900
offset reset and then you
are specifying the topics
8492
06:12:05,900 --> 06:12:10,500
which is your transaction so
next using this Cordy stream,
8493
06:12:10,500 --> 06:12:14,000
you're actually continuously
iterating over the rdd
8494
06:12:14,000 --> 06:12:17,345
and then you are trying
to print your new rdd
8495
06:12:17,345 --> 06:12:19,400
with then already partition
8496
06:12:19,400 --> 06:12:21,200
and size then rdd count
8497
06:12:21,200 --> 06:12:24,600
and the record so already
for each record.
8498
06:12:24,900 --> 06:12:26,400
So you are printing the record
8499
06:12:26,500 --> 06:12:30,400
and then you are starting
these Park streaming context
8500
06:12:30,400 --> 06:12:32,800
and then you are waiting
for the termination.
8501
06:12:32,800 --> 06:12:35,500
So this is the spark
streaming application.
8502
06:12:35,500 --> 06:12:39,200
So let's first quickly go ahead
and execute this application.
8503
06:12:39,200 --> 06:12:40,900
They've been close this file.
8504
06:12:41,000 --> 06:12:43,400
Let's go to the source.
8505
06:12:44,900 --> 06:12:49,000
Now, let me quickly go ahead and
delete this target directory.
8506
06:12:49,000 --> 06:12:53,615
So now let me quickly open the
terminal MV and clean install.
8507
06:12:58,400 --> 06:13:01,800
So now as you can see the target
directory is again created
8508
06:13:01,800 --> 06:13:05,307
and this park streaming Kafka
snapshot jar is created.
8509
06:13:05,307 --> 06:13:07,300
So we need to execute this jar.
8510
06:13:07,700 --> 06:13:10,800
So let me quickly go ahead
and minimize it.
8511
06:13:12,500 --> 06:13:14,300
Let me close this terminal.
8512
06:13:14,400 --> 06:13:18,000
So now first I'll start
this pop streaming job.
8513
06:13:18,600 --> 06:13:24,100
So the command is Java -
jar inside the target directory.
8514
06:13:24,600 --> 06:13:31,500
We have this spark streaming of
college are so let's hit enter.
8515
06:13:34,500 --> 06:13:38,100
So let me know quickly go ahead
and start producing messages.
8516
06:13:41,000 --> 06:13:44,100
So I will minimize this and I
will wait for the messages.
8517
06:13:50,019 --> 06:13:53,480
So let me quickly close
this pot streaming job
8518
06:13:53,600 --> 06:13:56,900
and then I will show
you the consumed records
8519
06:13:59,000 --> 06:14:00,400
so you can see the record
8520
06:14:00,400 --> 06:14:02,673
that is consumed
from spark streaming.
8521
06:14:02,673 --> 06:14:05,500
So here you have got record
and transaction dto
8522
06:14:05,500 --> 06:14:08,561
and then transaction date
products all the details,
8523
06:14:08,561 --> 06:14:09,969
which we are specified.
8524
06:14:09,969 --> 06:14:11,500
You can see it over here.
8525
06:14:11,500 --> 06:14:15,400
So this is how spark
streaming works with Kafka now,
8526
06:14:15,400 --> 06:14:17,600
it's just a basic job again.
8527
06:14:17,600 --> 06:14:20,900
You can go ahead and you
can take Those transaction you
8528
06:14:20,900 --> 06:14:23,651
can perform some real-time
analytics over there
8529
06:14:23,651 --> 06:14:27,406
and then you can go ahead and
write those results so over here
8530
06:14:27,406 --> 06:14:29,500
we have just given
you a basic demo
8531
06:14:29,500 --> 06:14:32,401
in which we are producing
the records to Kafka
8532
06:14:32,401 --> 06:14:34,400
and then using spark streaming.
8533
06:14:34,400 --> 06:14:37,533
We are streaming those records
from Kafka again.
8534
06:14:37,533 --> 06:14:38,600
You can go ahead
8535
06:14:38,600 --> 06:14:41,083
and you can perform
multiple Transformations
8536
06:14:41,083 --> 06:14:42,848
over the data multiple actions
8537
06:14:42,848 --> 06:14:45,500
and produce some real-time
results using this data.
8538
06:14:45,500 --> 06:14:48,975
So this is just a basic demo
where we have shown you
8539
06:14:48,975 --> 06:14:51,700
how to basically
produce recalls to Kafka
8540
06:14:51,700 --> 06:14:55,000
and then consume those records
using spark streaming.
8541
06:14:55,000 --> 06:14:57,846
So let's quickly go
back to our slide.
8542
06:14:58,600 --> 06:15:00,526
Now as this was a basic project.
8543
06:15:00,526 --> 06:15:01,669
Let me explain you
8544
06:15:01,669 --> 06:15:04,390
one of the cough
by spark streaming project,
8545
06:15:04,390 --> 06:15:05,754
which is a Ted Eureka.
8546
06:15:05,754 --> 06:15:09,100
So basically there is a company
called Tech review.com.
8547
06:15:09,100 --> 06:15:11,900
So this take review.com
basically provide reviews
8548
06:15:11,900 --> 06:15:14,481
for your recent
and different Technologies,
8549
06:15:14,481 --> 06:15:17,800
like a smart watches phones
different operating systems
8550
06:15:17,800 --> 06:15:20,100
and anything new
that is coming into Market.
8551
06:15:20,100 --> 06:15:23,409
So what happens is the company
decided to include a new feature
8552
06:15:23,409 --> 06:15:26,883
which will basically allow
users to compare the popularity
8553
06:15:26,883 --> 06:15:29,200
or trend of multiple
Technologies based
8554
06:15:29,200 --> 06:15:32,400
on the Twitter feeds
and second for the USP.
8555
06:15:32,400 --> 06:15:33,500
They are basically
8556
06:15:33,500 --> 06:15:36,200
trying this comparison
to happen in real time.
8557
06:15:36,200 --> 06:15:38,788
So basically they
have assigned you this task
8558
06:15:38,788 --> 06:15:41,299
so that you have to go
ahead you have to take
8559
06:15:41,299 --> 06:15:42,752
the real-time Twitter feeds
8560
06:15:42,752 --> 06:15:45,400
then you have to show
the real time comparison
8561
06:15:45,400 --> 06:15:46,900
of various Technologies.
8562
06:15:46,900 --> 06:15:50,500
So again, the company is
is asking you to to identify
8563
06:15:50,500 --> 06:15:51,684
the minute literate
8564
06:15:51,684 --> 06:15:55,500
between different Technologies
by consuming Twitter streams
8565
06:15:55,500 --> 06:15:58,900
and writing aggregated minute
Li count to Cassandra
8566
06:15:58,900 --> 06:16:00,200
from where again -
8567
06:16:00,200 --> 06:16:02,700
boarding team will come
into picture and then they
8568
06:16:02,700 --> 06:16:06,700
will try to dashboard that data
and it can show you a graph
8569
06:16:06,700 --> 06:16:07,800
where you can see
8570
06:16:07,800 --> 06:16:09,892
how the trend of two different
8571
06:16:09,892 --> 06:16:13,656
or you can see various
Technologies are going ahead now
8572
06:16:13,656 --> 06:16:16,157
the solution strategy
which is there
8573
06:16:16,157 --> 06:16:20,083
so you have to continuously
stream the data from Twitter.
8574
06:16:20,083 --> 06:16:21,689
Then you will be storing
8575
06:16:21,689 --> 06:16:24,322
that those tweets
inside a cop car topic
8576
06:16:24,322 --> 06:16:25,567
then second again.
8577
06:16:25,567 --> 06:16:27,987
You have to
perform spark streaming.
8578
06:16:27,987 --> 06:16:31,009
So you will be continuously
streaming the data
8579
06:16:31,009 --> 06:16:34,300
and then you will be
applying some Transformations
8580
06:16:34,300 --> 06:16:36,900
which will basically
give you the minute trend
8581
06:16:36,900 --> 06:16:38,361
of the two technologies.
8582
06:16:38,361 --> 06:16:41,747
And again, you'll write it back
to a car for topic and at last
8583
06:16:41,747 --> 06:16:42,992
you'll write a consumer
8584
06:16:42,992 --> 06:16:46,051
that will be consuming messages
from the Casbah topic
8585
06:16:46,051 --> 06:16:49,200
and that will write the data
in your Cassandra database.
8586
06:16:49,200 --> 06:16:51,018
So First you have
to write a program
8587
06:16:51,018 --> 06:16:53,049
that will be consuming
data from Twitter
8588
06:16:53,049 --> 06:16:54,696
and I did to cough or topic.
8589
06:16:54,696 --> 06:16:56,999
Then you have to write
a spark streaming job,
8590
06:16:56,999 --> 06:17:00,200
which will be continuously
streaming the data from Kafka
8591
06:17:00,300 --> 06:17:03,300
and perform analytics
to identify the military Trend
8592
06:17:03,300 --> 06:17:06,200
and then it will write the data
back to a cuff for topic
8593
06:17:06,200 --> 06:17:08,282
and then you have
to write the third job
8594
06:17:08,282 --> 06:17:10,114
which will be
basically a consumer
8595
06:17:10,114 --> 06:17:12,668
that will consume data
from the table for topic
8596
06:17:12,668 --> 06:17:15,000
and write the data
to a Cassandra database.
8597
06:17:19,800 --> 06:17:21,709
But a spark is
a powerful framework,
8598
06:17:21,709 --> 06:17:23,960
which has been heavily
used in the industry
8599
06:17:23,960 --> 06:17:26,800
for real-time analytics
and machine learning purposes.
8600
06:17:26,800 --> 06:17:28,689
So before I proceed
with the session,
8601
06:17:28,689 --> 06:17:30,489
let's have a quick
look at the topics
8602
06:17:30,489 --> 06:17:31,968
which will be covering today.
8603
06:17:31,968 --> 06:17:33,600
So I'm starting
off by explaining
8604
06:17:33,600 --> 06:17:35,900
what exactly is by spot
and how it works.
8605
06:17:35,900 --> 06:17:36,900
When we go ahead.
8606
06:17:36,900 --> 06:17:39,819
We'll find out the various
advantages provided by spark.
8607
06:17:39,819 --> 06:17:41,200
Then I will be showing you
8608
06:17:41,200 --> 06:17:43,400
how to install
by sparking a systems.
8609
06:17:43,400 --> 06:17:45,300
Once we are done
with the installation.
8610
06:17:45,300 --> 06:17:48,200
I will talk about the
fundamental concepts of by spark
8611
06:17:48,200 --> 06:17:49,800
like this spark context.
8612
06:17:49,900 --> 06:17:53,900
Data frames MLA Oddities
and much more and finally,
8613
06:17:53,900 --> 06:17:57,100
I'll close of the session with
the demo in which I'll show you
8614
06:17:57,100 --> 06:18:00,200
how to implement by spark
to solve real life use cases.
8615
06:18:00,200 --> 06:18:01,791
So without any further Ado,
8616
06:18:01,791 --> 06:18:04,621
let's quickly embark
on our journey to pie spot now
8617
06:18:04,621 --> 06:18:06,558
before I start off
with by spark.
8618
06:18:06,558 --> 06:18:09,500
Let me first brief you
about the by spark ecosystem
8619
06:18:09,500 --> 06:18:13,154
as you can see from the diagram
the spark ecosystem is composed
8620
06:18:13,154 --> 06:18:16,400
of various components like
Sparks equals Park streaming.
8621
06:18:16,400 --> 06:18:19,800
Ml Abe graphics and the core
API component the spark.
8622
06:18:19,800 --> 06:18:22,000
Equal component is used
to Leverage The Power
8623
06:18:22,000 --> 06:18:23,320
of decorative queries
8624
06:18:23,320 --> 06:18:26,281
and optimize storage
by executing sql-like queries
8625
06:18:26,281 --> 06:18:27,124
on spark data,
8626
06:18:27,124 --> 06:18:28,654
which is presented in rdds
8627
06:18:28,654 --> 06:18:31,589
and other external sources
spark streaming component
8628
06:18:31,589 --> 06:18:33,882
allows developers
to perform batch processing
8629
06:18:33,882 --> 06:18:36,714
and streaming of data with ease
in the same application.
8630
06:18:36,714 --> 06:18:39,345
The machine learning library
eases the development
8631
06:18:39,345 --> 06:18:41,600
and deployment of
scalable machine learning
8632
06:18:41,600 --> 06:18:43,600
pipelines Graphics component.
8633
06:18:43,600 --> 06:18:47,100
Let's the data scientists work
with graph and non graph sources
8634
06:18:47,100 --> 06:18:49,982
to achieve flexibility
and resilience in graph.
8635
06:18:49,982 --> 06:18:51,775
Struction and Transformations
8636
06:18:51,775 --> 06:18:54,000
and finally the
spark core component.
8637
06:18:54,000 --> 06:18:56,723
It is the most vital component
of spark ecosystem,
8638
06:18:56,723 --> 06:18:57,900
which is responsible
8639
06:18:57,900 --> 06:19:00,644
for basic input output
functions scheduling
8640
06:19:00,644 --> 06:19:04,172
and monitoring the entire
spark ecosystem is built on top
8641
06:19:04,172 --> 06:19:06,014
of this code execution engine
8642
06:19:06,014 --> 06:19:09,000
which has extensible apis
in different languages
8643
06:19:09,000 --> 06:19:12,300
like Scala Python and Java
and in today's session,
8644
06:19:12,300 --> 06:19:13,915
I will specifically discuss
8645
06:19:13,915 --> 06:19:16,967
about the spark API
in Python programming languages,
8646
06:19:16,967 --> 06:19:19,600
which is more popularly
known as the pie Spa.
8647
06:19:19,700 --> 06:19:22,839
You might be wondering
why pie spot well to get
8648
06:19:22,839 --> 06:19:24,000
a better Insight.
8649
06:19:24,000 --> 06:19:26,400
Let me give you a brief
into pie spot.
8650
06:19:26,400 --> 06:19:29,300
Now as you already know
by spec is the collaboration
8651
06:19:29,300 --> 06:19:31,050
of two powerful Technologies,
8652
06:19:31,050 --> 06:19:32,500
which are spark which is
8653
06:19:32,500 --> 06:19:35,459
an open-source clustering
Computing framework built
8654
06:19:35,459 --> 06:19:38,300
around speed ease of use
and streaming analytics.
8655
06:19:38,300 --> 06:19:40,707
And the other one is python,
of course python,
8656
06:19:40,707 --> 06:19:43,900
which is a general purpose
high-level programming language.
8657
06:19:43,900 --> 06:19:46,900
It provides wide range
of libraries and is majorly used
8658
06:19:46,900 --> 06:19:50,000
for machine learning
and real-time analytics now,
8659
06:19:50,000 --> 06:19:52,000
Now which gives us by spark
8660
06:19:52,000 --> 06:19:53,852
which is a python
a pay for spark
8661
06:19:53,852 --> 06:19:56,581
that lets you harness
the Simplicity of Python
8662
06:19:56,581 --> 06:19:58,400
and The Power of Apache spark.
8663
06:19:58,400 --> 06:20:01,059
In order to tame
pick data up ice pack.
8664
06:20:01,059 --> 06:20:03,398
Also lets you use
the rdds and come
8665
06:20:03,398 --> 06:20:06,700
with a default integration
of Pi Forge a library.
8666
06:20:06,700 --> 06:20:10,397
We learn about rdds later
in this video now that you know,
8667
06:20:10,397 --> 06:20:11,500
what is pi spark.
8668
06:20:11,500 --> 06:20:14,400
Let's now see the advantages
of using spark with python
8669
06:20:14,400 --> 06:20:17,700
as we all know python
itself is very simple and easy.
8670
06:20:17,700 --> 06:20:20,700
So when Spock is written
in Python it To participate
8671
06:20:20,700 --> 06:20:22,837
quite easy to learn
and use moreover.
8672
06:20:22,837 --> 06:20:24,737
It's a dynamically type language
8673
06:20:24,737 --> 06:20:28,300
which means Oddities can hold
objects of multiple data types.
8674
06:20:28,300 --> 06:20:30,711
Not only does it also
makes the EPA simple
8675
06:20:30,711 --> 06:20:32,400
and comprehensive and talking
8676
06:20:32,400 --> 06:20:34,700
about the readability
of code maintenance
8677
06:20:34,700 --> 06:20:36,700
and familiarity with
the python API
8678
06:20:36,700 --> 06:20:38,577
for purchase Park is far better
8679
06:20:38,577 --> 06:20:41,000
than other programming
languages python also
8680
06:20:41,000 --> 06:20:43,100
provides various options
for visualization,
8681
06:20:43,100 --> 06:20:46,180
which is not possible using
Scala or Java moreover.
8682
06:20:46,180 --> 06:20:49,200
You can conveniently call
are directly from python
8683
06:20:49,200 --> 06:20:50,800
on above this python comes
8684
06:20:50,800 --> 06:20:52,300
with a wide range of libraries
8685
06:20:52,300 --> 06:20:55,800
like numpy pandas
Caitlin Seaborn matplotlib
8686
06:20:55,800 --> 06:20:57,912
and these debris is
in data analysis
8687
06:20:57,912 --> 06:20:59,300
and also provide mature
8688
06:20:59,300 --> 06:21:02,564
and time test statistics
with all these feature.
8689
06:21:02,564 --> 06:21:04,100
You can effortlessly program
8690
06:21:04,100 --> 06:21:06,700
and spice Park in case
you get stuck somewhere
8691
06:21:06,700 --> 06:21:07,600
or habit out.
8692
06:21:07,600 --> 06:21:08,835
There is a huge price
8693
06:21:08,835 --> 06:21:12,600
but Community out there whom you
can reach out and put your query
8694
06:21:12,600 --> 06:21:13,800
and that is very actor.
8695
06:21:13,800 --> 06:21:16,647
So I will make good use
of this opportunity to show you
8696
06:21:16,647 --> 06:21:18,000
how to install Pi spark
8697
06:21:18,000 --> 06:21:20,900
in a system now here
I'm using Red Hat Linux
8698
06:21:20,900 --> 06:21:24,400
based sent to a system
the same steps can be applied
8699
06:21:24,400 --> 06:21:26,000
for using Linux systems as well.
8700
06:21:26,200 --> 06:21:28,500
So in order to install
Pi spark first,
8701
06:21:28,500 --> 06:21:31,100
make sure that you have
Hadoop installed in your system.
8702
06:21:31,100 --> 06:21:33,700
So if you want to know more
about how to install Ado,
8703
06:21:33,700 --> 06:21:36,500
please check out
our new playlist on YouTube
8704
06:21:36,500 --> 06:21:39,909
or you can check out our blog on
a direct our website the first
8705
06:21:39,909 --> 06:21:43,100
of all you need to go to the
Apache spark official website,
8706
06:21:43,100 --> 06:21:44,750
which is parked
at a party Dot o-- r--
8707
06:21:44,750 --> 06:21:48,025
g-- and the download section you
can download the latest version
8708
06:21:48,025 --> 06:21:48,907
of spark release
8709
06:21:48,907 --> 06:21:51,500
which supports It's
the latest version of Hadoop
8710
06:21:51,500 --> 06:21:53,800
or Hadoop version
2.7 or above now.
8711
06:21:53,800 --> 06:21:55,429
Once you have downloaded it,
8712
06:21:55,429 --> 06:21:57,900
all you need to do is
extract it or add say
8713
06:21:57,900 --> 06:21:59,400
under the file contents.
8714
06:21:59,400 --> 06:22:01,400
And after that you
need to put in the path
8715
06:22:01,400 --> 06:22:04,200
where the spark is installed
in the bash RC file.
8716
06:22:04,200 --> 06:22:06,082
Now, you also need
to install pip
8717
06:22:06,082 --> 06:22:09,300
and jupyter notebook using
the pipe command and make sure
8718
06:22:09,300 --> 06:22:11,700
that the version
of piston or above so
8719
06:22:11,700 --> 06:22:12,820
as you can see here,
8720
06:22:12,820 --> 06:22:16,114
this is what our bash RC file
looks like here you can see
8721
06:22:16,114 --> 06:22:17,700
that we have put in the path
8722
06:22:17,700 --> 06:22:20,700
for Hadoop spark and as
well as Spunk driver python,
8723
06:22:20,700 --> 06:22:22,200
which is The jupyter Notebook.
8724
06:22:22,200 --> 06:22:23,087
What we'll do.
8725
06:22:23,087 --> 06:22:25,939
Is that the moment you
run the pie Spock shell
8726
06:22:25,939 --> 06:22:29,300
it will automatically open
a jupyter notebook for you.
8727
06:22:29,300 --> 06:22:29,551
Now.
8728
06:22:29,551 --> 06:22:32,000
I find jupyter notebook
very easy to work
8729
06:22:32,000 --> 06:22:35,700
with rather than the shell
is supposed to search choice now
8730
06:22:35,700 --> 06:22:37,899
that we are done
with the installation path.
8731
06:22:37,899 --> 06:22:40,100
Let's now dive deeper
into pie Sparkle on few
8732
06:22:40,100 --> 06:22:41,100
of its fundamentals,
8733
06:22:41,100 --> 06:22:43,770
which you need to know
in order to work with by Spar.
8734
06:22:43,770 --> 06:22:45,870
Now this timeline shows
the various topics,
8735
06:22:45,870 --> 06:22:48,600
which we will be covering under
the pie spark fundamentals.
8736
06:22:48,700 --> 06:22:49,650
So let's start off.
8737
06:22:49,650 --> 06:22:51,500
With the very first
Topic in our list.
8738
06:22:51,500 --> 06:22:53,100
That is the spark context.
8739
06:22:53,100 --> 06:22:56,335
The spark context is the heart
of any spark application.
8740
06:22:56,335 --> 06:22:59,518
It sets up internal services
and establishes a connection
8741
06:22:59,518 --> 06:23:03,300
to a spark execution environment
through a spark context object.
8742
06:23:03,300 --> 06:23:05,357
You can create rdds accumulators
8743
06:23:05,357 --> 06:23:09,000
and broadcast variable
access Park service's run jobs
8744
06:23:09,000 --> 06:23:11,362
and much more
the spark context allows
8745
06:23:11,362 --> 06:23:14,094
the spark driver application
to access the cluster
8746
06:23:14,094 --> 06:23:15,600
through a resource manager,
8747
06:23:15,600 --> 06:23:16,600
which can be yarn
8748
06:23:16,600 --> 06:23:19,600
or Sparks cluster manager
the driver program then runs.
8749
06:23:19,700 --> 06:23:23,044
Operations inside the executors
on the worker nodes
8750
06:23:23,044 --> 06:23:26,478
and Spark context uses the pie
for Jay to launch a jvm
8751
06:23:26,478 --> 06:23:29,200
which in turn creates
a Java spark context.
8752
06:23:29,200 --> 06:23:30,884
Now there are
various parameters,
8753
06:23:30,884 --> 06:23:33,200
which can be used
with a spark context object
8754
06:23:33,200 --> 06:23:34,700
like the Master app name
8755
06:23:34,700 --> 06:23:37,366
spark home the pie
files the environment
8756
06:23:37,366 --> 06:23:41,600
in which has set the path size
serializer configuration Gateway
8757
06:23:41,600 --> 06:23:44,267
and much more
among these parameters
8758
06:23:44,267 --> 06:23:47,700
the master and app name
are the most commonly used now
8759
06:23:47,700 --> 06:23:51,061
to give you a basic Insight
on how Spark program works.
8760
06:23:51,061 --> 06:23:53,807
I have listed down
its basic lifecycle phases
8761
06:23:53,807 --> 06:23:56,903
the typical life cycle
of a spark program includes
8762
06:23:56,903 --> 06:23:59,367
creating rdds from
external data sources
8763
06:23:59,367 --> 06:24:02,400
or paralyzed a collection
in your driver program.
8764
06:24:02,400 --> 06:24:05,361
Then we have the lazy
transformation in a lazily
8765
06:24:05,361 --> 06:24:07,064
transforming the base rdds
8766
06:24:07,064 --> 06:24:10,600
into new Oddities using
transformation then caching few
8767
06:24:10,600 --> 06:24:12,700
of those rdds for future reuse
8768
06:24:12,800 --> 06:24:15,800
and finally performing action
to execute parallel computation
8769
06:24:15,800 --> 06:24:17,500
and to produce the results.
8770
06:24:17,500 --> 06:24:19,800
The next Topic
in our list is added.
8771
06:24:19,800 --> 06:24:20,700
And I'm sure people
8772
06:24:20,700 --> 06:24:23,700
who have already worked with
spark a familiar with this term,
8773
06:24:23,700 --> 06:24:25,582
but for people
who are new to it,
8774
06:24:25,582 --> 06:24:26,900
let me just explain it.
8775
06:24:26,900 --> 06:24:29,782
No Artie T stands for
resilient distributed data set.
8776
06:24:29,782 --> 06:24:32,000
It is considered to be
the building block
8777
06:24:32,000 --> 06:24:33,433
of any spark application.
8778
06:24:33,433 --> 06:24:35,900
The reason behind this
is these elements run
8779
06:24:35,900 --> 06:24:38,600
and operate on multiple nodes
to do parallel processing
8780
06:24:38,600 --> 06:24:39,400
on a cluster.
8781
06:24:39,400 --> 06:24:40,952
And once you create an RTD,
8782
06:24:40,952 --> 06:24:43,273
it becomes immutable
and by imitable,
8783
06:24:43,273 --> 06:24:46,637
I mean that it is an object
whose State cannot be modified
8784
06:24:46,637 --> 06:24:47,700
after its created,
8785
06:24:47,700 --> 06:24:49,654
but we can transform
its values by up.
8786
06:24:49,654 --> 06:24:51,438
Applying certain transformation.
8787
06:24:51,438 --> 06:24:53,500
They have good
fault tolerance ability
8788
06:24:53,500 --> 06:24:56,700
and can automatically recover
for almost any failures.
8789
06:24:56,700 --> 06:25:00,700
This adds an added Advantage
not to achieve a certain task
8790
06:25:00,700 --> 06:25:03,205
multiple operations can
be applied on these IDs
8791
06:25:03,205 --> 06:25:05,675
which are categorized
in two ways the first
8792
06:25:05,675 --> 06:25:06,800
in the transformation
8793
06:25:06,800 --> 06:25:09,900
and the second one is
the actions the Transformations
8794
06:25:09,900 --> 06:25:10,800
are the operations
8795
06:25:10,800 --> 06:25:13,800
which are applied on an oddity
to create a new rdd.
8796
06:25:14,000 --> 06:25:15,300
Now these transformation work
8797
06:25:15,300 --> 06:25:17,300
on the principle
of lazy evaluation
8798
06:25:17,700 --> 06:25:19,900
and transformation
are lazy in nature.
8799
06:25:19,900 --> 06:25:22,927
Meaning when we call
some operation in our dirty.
8800
06:25:22,927 --> 06:25:25,758
It does not execute
immediately spark maintains,
8801
06:25:25,758 --> 06:25:28,602
the record of the operations
it is being called
8802
06:25:28,602 --> 06:25:31,324
through with the help
of direct acyclic graphs,
8803
06:25:31,324 --> 06:25:33,100
which is also known as the DHS
8804
06:25:33,100 --> 06:25:35,900
and since the Transformations
are lazy in nature.
8805
06:25:35,900 --> 06:25:37,604
So when we execute operation
8806
06:25:37,604 --> 06:25:40,100
any time by calling
an action on the data,
8807
06:25:40,100 --> 06:25:42,371
the lazy evaluation
data is not loaded
8808
06:25:42,371 --> 06:25:43,547
until it's necessary
8809
06:25:43,547 --> 06:25:46,900
and the moment we call out
the action all the computations
8810
06:25:46,900 --> 06:25:49,900
are performed parallely to give
you the desired output.
8811
06:25:49,900 --> 06:25:52,400
Put now a few important
Transformations are
8812
06:25:52,400 --> 06:25:53,944
the map flatmap filter
8813
06:25:53,944 --> 06:25:55,360
this thing reduced by
8814
06:25:55,360 --> 06:25:59,000
key map partition sort by
actions are the operations
8815
06:25:59,000 --> 06:26:02,058
which are applied on an rdd
to instruct a party spark
8816
06:26:02,058 --> 06:26:03,188
to apply computation
8817
06:26:03,188 --> 06:26:05,600
and pass the result back
to the driver few
8818
06:26:05,600 --> 06:26:09,100
of these actions include
collect the collectors mapreduce
8819
06:26:09,100 --> 06:26:10,300
take first now,
8820
06:26:10,300 --> 06:26:13,600
let me Implement few of these
for your better understanding.
8821
06:26:14,600 --> 06:26:17,000
So first of all,
let me show you the bash
8822
06:26:17,000 --> 06:26:18,800
as if I'll which I
was talking about.
8823
06:26:25,100 --> 06:26:27,196
So here you can see
in the bash RC file.
8824
06:26:27,196 --> 06:26:29,400
We provide the path
for all the Frameworks
8825
06:26:29,400 --> 06:26:31,250
which we have installed
in the system.
8826
06:26:31,250 --> 06:26:32,800
So for example,
you can see here.
8827
06:26:32,800 --> 06:26:35,100
We have installed
Hadoop the moment we
8828
06:26:35,100 --> 06:26:38,100
install and unzip it
or rather see entire it
8829
06:26:38,100 --> 06:26:41,300
I have shifted all my Frameworks
to one particular location
8830
06:26:41,300 --> 06:26:43,492
as you can see is
the US are the user
8831
06:26:43,492 --> 06:26:46,140
and inside this we have
the library and inside
8832
06:26:46,140 --> 06:26:49,217
that I have installed the Hadoop
and also the spa now
8833
06:26:49,217 --> 06:26:50,400
as you can see here,
8834
06:26:50,400 --> 06:26:51,300
we have two lines.
8835
06:26:51,300 --> 06:26:54,800
I'll highlight this one for
you the pie spark driver.
8836
06:26:54,800 --> 06:26:56,392
Titan which is the Jupiter
8837
06:26:56,392 --> 06:26:59,700
and we have given it as
a notebook the option available
8838
06:26:59,700 --> 06:27:02,100
as know to what we'll do
is at the moment.
8839
06:27:02,100 --> 06:27:04,731
I start spark will
automatically redirect me
8840
06:27:04,731 --> 06:27:06,200
to The jupyter Notebook.
8841
06:27:10,200 --> 06:27:14,500
So let me just rename
this notebook as rdd tutorial.
8842
06:27:15,200 --> 06:27:16,900
So let's get started.
8843
06:27:17,800 --> 06:27:21,000
So here to load any file
into an rdd suppose.
8844
06:27:21,000 --> 06:27:23,795
I'm loading a text file
you need to use the S
8845
06:27:23,795 --> 06:27:26,700
if it is a spark context
as C dot txt file
8846
06:27:26,700 --> 06:27:28,952
and you need to provide
the path of the data
8847
06:27:28,952 --> 06:27:30,600
which you are going to load.
8848
06:27:30,600 --> 06:27:33,300
So one thing to keep
in mind is that the default path
8849
06:27:33,300 --> 06:27:35,483
which the artery takes
or the jupyter.
8850
06:27:35,483 --> 06:27:37,365
Notebook takes is the hdfs path.
8851
06:27:37,365 --> 06:27:39,456
So in order to use
the local file system,
8852
06:27:39,456 --> 06:27:41,311
you need to mention
the file colon
8853
06:27:41,311 --> 06:27:42,900
and double forward slashes.
8854
06:27:43,800 --> 06:27:46,100
So once our sample data is
8855
06:27:46,100 --> 06:27:49,076
inside the ret not to
have a look at it.
8856
06:27:49,076 --> 06:27:52,000
We need to invoke
using it the action.
8857
06:27:52,000 --> 06:27:54,900
So let's go ahead and take
a look at the first five objects
8858
06:27:54,900 --> 06:27:59,400
or rather say the first five
elements of this particular rdt.
8859
06:27:59,700 --> 06:28:02,776
The sample it I have taken
here is about blockchain
8860
06:28:02,776 --> 06:28:03,700
as you can see.
8861
06:28:03,700 --> 06:28:05,000
We have one two,
8862
06:28:05,030 --> 06:28:07,569
three four and
five elements here.
8863
06:28:08,500 --> 06:28:12,080
Suppose I need to convert
all the data into a lowercase
8864
06:28:12,080 --> 06:28:14,600
and split it according
to word by word.
8865
06:28:14,600 --> 06:28:17,000
So for that I will
create a function
8866
06:28:17,000 --> 06:28:20,000
and in the function
I'll pass on this Oddity.
8867
06:28:20,000 --> 06:28:21,700
So I'm creating
as you can see here.
8868
06:28:21,700 --> 06:28:22,990
I'm creating rdd one
8869
06:28:22,990 --> 06:28:25,700
that is a new ID
and using the map function
8870
06:28:25,700 --> 06:28:29,200
or rather say the transformation
and passing on the function,
8871
06:28:29,200 --> 06:28:32,100
which I just created to lower
and to split it.
8872
06:28:32,496 --> 06:28:35,803
So if we have a look
at the output of our D1
8873
06:28:37,800 --> 06:28:39,059
As you can see here,
8874
06:28:39,059 --> 06:28:41,200
all the words are
in the lower case
8875
06:28:41,200 --> 06:28:44,300
and all of them are separated
with the help of a space bar.
8876
06:28:44,700 --> 06:28:47,000
Now this another transformation,
8877
06:28:47,000 --> 06:28:50,216
which is known as the flat map
to give you a flat and output
8878
06:28:50,216 --> 06:28:52,157
and I am passing
the same function
8879
06:28:52,157 --> 06:28:53,569
which I created earlier.
8880
06:28:53,569 --> 06:28:54,500
So let's go ahead
8881
06:28:54,500 --> 06:28:56,800
and have a look
at the output for this one.
8882
06:28:56,800 --> 06:28:58,200
So as you can see here,
8883
06:28:58,200 --> 06:29:00,189
we got the first five elements
8884
06:29:00,189 --> 06:29:04,355
which are the save one as we got
here the contrast transactions
8885
06:29:04,355 --> 06:29:05,700
and and the records.
8886
06:29:05,700 --> 06:29:07,523
So just one thing
to keep in mind.
8887
06:29:07,523 --> 06:29:09,700
Is at the flat map
is a transformation
8888
06:29:09,700 --> 06:29:11,664
where as take is the action now,
8889
06:29:11,664 --> 06:29:13,614
as you can see
that the contents
8890
06:29:13,614 --> 06:29:16,007
of the sample data
contains stop words.
8891
06:29:16,007 --> 06:29:18,762
So if I want to remove
all the stop was all you
8892
06:29:18,762 --> 06:29:19,900
need to do is start
8893
06:29:19,900 --> 06:29:23,351
and create a list of stop words
in which I have mentioned here
8894
06:29:23,351 --> 06:29:24,200
as you can see.
8895
06:29:24,200 --> 06:29:26,200
We have a all the as is
8896
06:29:26,200 --> 06:29:28,700
and now these are
not all the stop words.
8897
06:29:28,700 --> 06:29:31,701
So I've chosen only a few
of them just to show you
8898
06:29:31,701 --> 06:29:33,600
what exactly the output will be
8899
06:29:33,600 --> 06:29:36,100
and now we are using here
the filter transformation
8900
06:29:36,100 --> 06:29:37,800
and with the help of Lambda.
8901
06:29:37,800 --> 06:29:40,800
Function and which we have
X specified as X naught
8902
06:29:40,800 --> 06:29:43,360
in stock quotes and we
have created another rdd
8903
06:29:43,360 --> 06:29:44,465
which is added III
8904
06:29:44,465 --> 06:29:46,000
which will take the input
8905
06:29:46,000 --> 06:29:48,800
from our DD to so
let's go ahead and see
8906
06:29:48,800 --> 06:29:51,700
whether and and the
are removed or not.
8907
06:29:51,700 --> 06:29:55,600
This is you can see contracts
transaction records of them.
8908
06:29:55,600 --> 06:29:57,500
If you look at the output 5,
8909
06:29:57,500 --> 06:30:00,979
we have contracts transaction
and and the and in the
8910
06:30:00,979 --> 06:30:02,337
are not in this list,
8911
06:30:02,337 --> 06:30:04,600
but suppose I want
to group the data
8912
06:30:04,600 --> 06:30:07,523
according to the first
three characters of any element.
8913
06:30:07,523 --> 06:30:08,756
So for that I'll use
8914
06:30:08,756 --> 06:30:11,900
the group by and I'll use
the Lambda function again.
8915
06:30:11,900 --> 06:30:14,000
So let's have a look
at the output
8916
06:30:14,000 --> 06:30:16,769
so you can see we
have EDG and edges.
8917
06:30:16,900 --> 06:30:20,638
So the first three letters of
both words are same similarly.
8918
06:30:20,638 --> 06:30:23,300
We can find it using
the first two letters.
8919
06:30:23,300 --> 06:30:27,800
Also, let me just change it
to two so you can see we are gu
8920
06:30:27,800 --> 06:30:29,800
and guid just a guide
8921
06:30:30,000 --> 06:30:32,200
not these are
the basic Transformations
8922
06:30:32,200 --> 06:30:33,785
and actions but suppose.
8923
06:30:33,785 --> 06:30:37,400
I want to find out the sum
of the first thousand numbers.
8924
06:30:37,400 --> 06:30:39,436
Others have first
10,000 numbers.
8925
06:30:39,436 --> 06:30:42,300
All I need to do
is initialize another Oddity,
8926
06:30:42,300 --> 06:30:44,400
which is the
number underscore ID.
8927
06:30:44,400 --> 06:30:47,512
And we use the AC Dot
parallelized and the range
8928
06:30:47,512 --> 06:30:49,500
we have given is one to 10,000
8929
06:30:49,500 --> 06:30:51,600
and we'll use the reduce action
8930
06:30:51,600 --> 06:30:54,532
here to see the output
you can see here.
8931
06:30:54,532 --> 06:30:56,840
We have the sum
of the numbers ranging
8932
06:30:56,840 --> 06:30:58,400
from one to ten thousand.
8933
06:30:58,400 --> 06:31:00,900
Now this was all about rdd.
8934
06:31:00,900 --> 06:31:01,699
The next topic
8935
06:31:01,699 --> 06:31:03,711
that we have
on a list is broadcast
8936
06:31:03,711 --> 06:31:07,181
and accumulators now in spark
we perform parallel processing
8937
06:31:07,181 --> 06:31:09,100
through the Help
of shared variables
8938
06:31:09,100 --> 06:31:11,672
or when the driver sends
any tasks with the executor
8939
06:31:11,672 --> 06:31:14,900
present on the cluster a copy of
the shared variable is also sent
8940
06:31:14,900 --> 06:31:15,700
to the each node
8941
06:31:15,700 --> 06:31:18,100
of the cluster thus
maintaining High availability
8942
06:31:18,100 --> 06:31:19,400
and fault tolerance.
8943
06:31:19,400 --> 06:31:22,223
Now, this is done in order
to accomplish the task
8944
06:31:22,223 --> 06:31:25,341
and Apache spark supposed
to type of shared variables.
8945
06:31:25,341 --> 06:31:26,711
One of them is broadcast.
8946
06:31:26,711 --> 06:31:28,861
And the other one is
the accumulator now
8947
06:31:28,861 --> 06:31:31,735
broadcast variables are used
to save the copy of data
8948
06:31:31,735 --> 06:31:33,334
on all the nodes in a cluster.
8949
06:31:33,334 --> 06:31:36,117
Whereas the accumulator is
the variable that is used
8950
06:31:36,117 --> 06:31:37,700
for aggregating the incoming.
8951
06:31:37,700 --> 06:31:40,056
Information we are
different associative
8952
06:31:40,056 --> 06:31:43,500
and commutative operations now
moving on to our next topic
8953
06:31:43,500 --> 06:31:47,094
which is a spark configuration
the spark configuration class
8954
06:31:47,094 --> 06:31:49,800
provides a set
of configurations and parameters
8955
06:31:49,800 --> 06:31:52,300
that are needed to execute
a spark application
8956
06:31:52,300 --> 06:31:54,300
on the local system
or any cluster.
8957
06:31:54,300 --> 06:31:56,800
Now when you use
spark configuration object
8958
06:31:56,800 --> 06:31:59,112
to set the values
to these parameters,
8959
06:31:59,112 --> 06:32:02,800
they automatically take priority
over the system properties.
8960
06:32:02,800 --> 06:32:05,035
Now this class
contains various Getters
8961
06:32:05,035 --> 06:32:07,800
and Setters methods some
of which are Set method
8962
06:32:07,800 --> 06:32:10,323
which is used to set
a configuration property.
8963
06:32:10,323 --> 06:32:11,555
We have the set master
8964
06:32:11,555 --> 06:32:13,605
which is used for setting
the master URL.
8965
06:32:13,605 --> 06:32:14,840
Yeah the set app name,
8966
06:32:14,840 --> 06:32:17,421
which is used to set
an application name and we
8967
06:32:17,421 --> 06:32:20,900
have the get method to retrieve
a configuration value of a key.
8968
06:32:20,900 --> 06:32:23,000
And finally we
have set spark home
8969
06:32:23,000 --> 06:32:25,600
which is used for setting
the spark installation path
8970
06:32:25,600 --> 06:32:26,700
on worker nodes.
8971
06:32:26,700 --> 06:32:28,800
Now coming to the next
topic on our list
8972
06:32:28,800 --> 06:32:31,600
which is a spark files
the spark file class
8973
06:32:31,600 --> 06:32:33,264
contains only the class methods
8974
06:32:33,264 --> 06:32:36,500
so that the user cannot create
any spark files instance.
8975
06:32:36,500 --> 06:32:39,200
Now this helps in Dissolving
the path of the files
8976
06:32:39,200 --> 06:32:41,500
that are added using
the spark context add
8977
06:32:41,500 --> 06:32:44,600
file method the class Park files
contain to class methods
8978
06:32:44,600 --> 06:32:47,798
which are the get method and
the get root directory method.
8979
06:32:47,798 --> 06:32:50,500
Now, the get is used
to retrieve the absolute path
8980
06:32:50,500 --> 06:32:53,900
of a file added through
spark context to add file
8981
06:32:54,000 --> 06:32:55,300
and the get root directory
8982
06:32:55,300 --> 06:32:57,076
is used to retrieve
the root directory
8983
06:32:57,076 --> 06:32:58,900
that contains the files
that are added.
8984
06:32:58,900 --> 06:33:00,700
So this park context
dot add file.
8985
06:33:00,700 --> 06:33:03,022
Now, these are smart topics
and the next topic
8986
06:33:03,022 --> 06:33:04,257
that we will covering
8987
06:33:04,257 --> 06:33:07,600
in our list are the data frames
now data frames in a party.
8988
06:33:07,600 --> 06:33:09,655
Spark is a distributed
collection of rows
8989
06:33:09,655 --> 06:33:10,831
under named columns,
8990
06:33:10,831 --> 06:33:13,400
which is similar to
the relational database tables
8991
06:33:13,400 --> 06:33:14,700
or Excel sheets.
8992
06:33:14,700 --> 06:33:16,812
It also shares common attributes
8993
06:33:16,812 --> 06:33:19,800
with the rdds few
characteristics of data frames
8994
06:33:19,800 --> 06:33:21,300
are immutable in nature.
8995
06:33:21,300 --> 06:33:23,500
That is the same
as you can create a data frame,
8996
06:33:23,500 --> 06:33:24,900
but you cannot change it.
8997
06:33:24,900 --> 06:33:26,500
It allows lazy evaluation.
8998
06:33:26,500 --> 06:33:28,300
That is the task not executed
8999
06:33:28,300 --> 06:33:30,500
unless and until
an action is triggered
9000
06:33:30,500 --> 06:33:33,000
and moreover data frames
are distributed in nature,
9001
06:33:33,000 --> 06:33:34,900
which are designed
for processing large
9002
06:33:34,900 --> 06:33:37,400
collection of structure
or semi-structured data.
9003
06:33:37,400 --> 06:33:39,953
Can be created using
different data formats,
9004
06:33:39,953 --> 06:33:41,200
like loading the data
9005
06:33:41,200 --> 06:33:43,650
from source files
such as Json or CSV,
9006
06:33:43,650 --> 06:33:46,100
or you can load it
from an existing re
9007
06:33:46,100 --> 06:33:48,842
you can use databases
like hi Cassandra.
9008
06:33:48,842 --> 06:33:50,600
You can use pocket files.
9009
06:33:50,600 --> 06:33:52,800
You can use CSV XML files.
9010
06:33:52,800 --> 06:33:53,900
There are many sources
9011
06:33:53,900 --> 06:33:56,448
through which you can create
a particular R DT now,
9012
06:33:56,448 --> 06:33:59,200
let me show you how to create
a data frame in pie spark
9013
06:33:59,200 --> 06:34:02,100
and perform various actions
and Transformations on it.
9014
06:34:02,300 --> 06:34:05,065
So let's continue this
in the same notebook
9015
06:34:05,065 --> 06:34:07,700
which we have here now
here we have taken
9016
06:34:07,700 --> 06:34:09,300
In the NYC Flight data,
9017
06:34:09,300 --> 06:34:12,561
and I'm creating a data frame
which is the NYC flights
9018
06:34:12,561 --> 06:34:13,300
on the score
9019
06:34:13,300 --> 06:34:14,959
TF now to load the data.
9020
06:34:14,959 --> 06:34:18,340
We are using the spark dot
RI dot CSV method and you
9021
06:34:18,340 --> 06:34:19,600
to provide the path
9022
06:34:19,600 --> 06:34:21,900
which is the local path
of by default.
9023
06:34:21,900 --> 06:34:24,200
It takes the hdfs same as our GD
9024
06:34:24,200 --> 06:34:26,208
and one thing
to note down here is
9025
06:34:26,208 --> 06:34:28,886
that I've provided
two parameters extra here,
9026
06:34:28,886 --> 06:34:31,400
which is the info schema
and the header
9027
06:34:31,400 --> 06:34:34,700
if we do not provide
this as true of a skip it
9028
06:34:34,700 --> 06:34:35,800
what will happen.
9029
06:34:35,800 --> 06:34:39,300
Is that if your data set Is
the name of the columns
9030
06:34:39,300 --> 06:34:42,863
on the first row it will take
those as data as well.
9031
06:34:42,863 --> 06:34:45,100
It will not infer
the schema now.
9032
06:34:45,100 --> 06:34:49,023
Once we have loaded the data
in our data frame we need to use
9033
06:34:49,023 --> 06:34:51,900
the show action to have
a look at the output.
9034
06:34:51,900 --> 06:34:53,223
So as you can see here,
9035
06:34:53,223 --> 06:34:55,399
we have the output
which is exactly it
9036
06:34:55,399 --> 06:34:58,600
gives us the top 20 rows
or the particular data set.
9037
06:34:58,600 --> 06:35:02,600
We have the year month day
departure time deposit delay
9038
06:35:02,600 --> 06:35:07,000
arrival time arrival delay
and so many more attributes.
9039
06:35:07,300 --> 06:35:08,500
To print the schema
9040
06:35:08,500 --> 06:35:11,500
of the particular data frame
you need the transformation
9041
06:35:11,500 --> 06:35:13,762
or as say the action
of print schema.
9042
06:35:13,762 --> 06:35:15,900
So let's have a look
at the schema.
9043
06:35:15,900 --> 06:35:19,117
As you can see here we have here
which is integer month integer.
9044
06:35:19,117 --> 06:35:21,000
Almost half of them are integer.
9045
06:35:21,000 --> 06:35:23,600
We have the carrier as
string the tail number
9046
06:35:23,600 --> 06:35:26,625
a string the origin
string destination string
9047
06:35:26,625 --> 06:35:28,123
and so on now suppose.
9048
06:35:28,123 --> 06:35:29,075
I want to know
9049
06:35:29,075 --> 06:35:31,786
how many records are
there in my database
9050
06:35:31,786 --> 06:35:33,685
or the data frame rather say
9051
06:35:33,685 --> 06:35:36,600
so you need the count
function for this one.
9052
06:35:36,600 --> 06:35:40,600
I will provide but the results
so as you can see here,
9053
06:35:40,600 --> 06:35:42,992
we have three point
three million records
9054
06:35:42,992 --> 06:35:44,097
here three million
9055
06:35:44,097 --> 06:35:46,800
thirty six thousand
seven hundred seventy six
9056
06:35:46,800 --> 06:35:48,400
to be exact now suppose.
9057
06:35:48,400 --> 06:35:51,153
I want to have a look
at the flight name the origin
9058
06:35:51,153 --> 06:35:52,400
and the destination
9059
06:35:52,400 --> 06:35:55,400
of just these three columns
from the particular data frame.
9060
06:35:55,400 --> 06:35:57,800
We need to use
the select option.
9061
06:35:58,200 --> 06:36:00,882
So as you can see here,
we have the top 20 rows.
9062
06:36:00,882 --> 06:36:03,128
Now, what we saw
was the select query
9063
06:36:03,128 --> 06:36:05,000
on this particular data frame,
9064
06:36:05,000 --> 06:36:07,240
but if I wanted
to see or rather,
9065
06:36:07,240 --> 06:36:09,200
I want to check the summary.
9066
06:36:09,200 --> 06:36:11,400
Of any particular
column suppose.
9067
06:36:11,400 --> 06:36:14,500
I want to check the
what is the lowest count
9068
06:36:14,500 --> 06:36:18,100
or the highest count in
the particular distance column.
9069
06:36:18,100 --> 06:36:20,500
I need to use
the describe function here.
9070
06:36:20,500 --> 06:36:23,100
So I'll show you what
the summer it looks like.
9071
06:36:23,500 --> 06:36:25,142
So the distance the count
9072
06:36:25,142 --> 06:36:27,900
is the number of rows
total number of rows.
9073
06:36:27,900 --> 06:36:30,800
We have the mean the standard
deviation via the minimum value,
9074
06:36:30,800 --> 06:36:32,900
which is 17
and the maximum value,
9075
06:36:32,900 --> 06:36:34,500
which is 4983.
9076
06:36:34,900 --> 06:36:38,100
Now this gives you a summary
of the particular column
9077
06:36:38,100 --> 06:36:39,856
if you want to So
that we know
9078
06:36:39,856 --> 06:36:41,838
that the minimum distance is 70.
9079
06:36:41,838 --> 06:36:44,500
Let's go ahead and filter
out our data using
9080
06:36:44,500 --> 06:36:47,700
the filter function
in which the distance is 17.
9081
06:36:48,700 --> 06:36:49,978
So you can see here.
9082
06:36:49,978 --> 06:36:51,000
We have one data
9083
06:36:51,000 --> 06:36:55,700
in which in the 2013 year
the minimum distance here is 17
9084
06:36:55,700 --> 06:36:59,100
but similarly suppose I want
to have a look at the flash
9085
06:36:59,100 --> 06:37:01,600
which are originating from EWR.
9086
06:37:01,900 --> 06:37:02,400
Similarly.
9087
06:37:02,400 --> 06:37:04,600
We use the filter
function here as well.
9088
06:37:04,600 --> 06:37:06,599
Now the another Clause here,
9089
06:37:06,599 --> 06:37:09,300
which is the where
Clause is also used
9090
06:37:09,300 --> 06:37:11,236
for filtering the suppose.
9091
06:37:11,236 --> 06:37:12,800
I want to have a look
9092
06:37:12,815 --> 06:37:16,046
at the flight data
and filter it out to see
9093
06:37:16,046 --> 06:37:17,507
if the day at work.
9094
06:37:17,507 --> 06:37:22,000
Which the flight took off was
the second of any month suppose.
9095
06:37:22,000 --> 06:37:23,589
So here instead of filter.
9096
06:37:23,589 --> 06:37:25,422
We can also use a where clause
9097
06:37:25,422 --> 06:37:27,500
which will give us
the same output.
9098
06:37:29,200 --> 06:37:33,100
Now, we can also pass
on multiple parameters
9099
06:37:33,100 --> 06:37:36,000
and rather say
the multiple conditions.
9100
06:37:36,000 --> 06:37:39,866
So suppose I want the day
of the flight should be seventh
9101
06:37:39,866 --> 06:37:41,839
and the origin should be JFK
9102
06:37:41,839 --> 06:37:45,292
and the arrival delay
should be less than 0 I mean
9103
06:37:45,292 --> 06:37:47,900
that is for none
of the postponed fly.
9104
06:37:48,000 --> 06:37:49,600
So just to have a look
9105
06:37:49,600 --> 06:37:52,314
at these numbers
will use the way clause
9106
06:37:52,314 --> 06:37:55,600
and separate all the conditions
using the + symbol
9107
06:37:56,100 --> 06:37:57,800
so you can see
here all the data.
9108
06:37:57,800 --> 06:38:00,700
The day is 7 the origin is JFK
9109
06:38:01,100 --> 06:38:04,900
and the arrival delay
is less than 0 now.
9110
06:38:04,900 --> 06:38:07,621
These were the basic
Transformations and actions
9111
06:38:07,621 --> 06:38:09,300
on the particular data frame.
9112
06:38:09,300 --> 06:38:12,900
Now one thing we can also do
is create a temporary table
9113
06:38:12,900 --> 06:38:14,100
for SQL queries
9114
06:38:14,100 --> 06:38:15,100
if someone is
9115
06:38:15,100 --> 06:38:19,000
not good or is not Wanted
to all these transformation
9116
06:38:19,000 --> 06:38:22,400
and action add would rather
use SQL queries on the data.
9117
06:38:22,400 --> 06:38:26,006
They can use this register dot
temp table to create a table
9118
06:38:26,006 --> 06:38:27,925
for their particular data frame.
9119
06:38:27,925 --> 06:38:30,129
What we'll do is
convert the NYC flights
9120
06:38:30,129 --> 06:38:33,600
and a Squatty of data frame
into NYC endoscope flight table,
9121
06:38:33,600 --> 06:38:36,700
which can be used later
and SQL queries can be performed
9122
06:38:36,700 --> 06:38:38,500
on this particular table.
9123
06:38:38,600 --> 06:38:43,000
So you remember in the beginning
we use the NYC flies and score d
9124
06:38:43,000 --> 06:38:47,600
f dot show now we can use
the select asterisk from I
9125
06:38:47,600 --> 06:38:51,600
am just go flights to get
the same output now suppose
9126
06:38:51,600 --> 06:38:55,011
we want to look at the minimum
a time of any flights.
9127
06:38:55,011 --> 06:38:58,217
We use the select minimum
air time from NYC flights.
9128
06:38:58,217 --> 06:38:59,600
That is the SQL query.
9129
06:38:59,600 --> 06:39:02,400
We pass all the SQL query
in the sequel context
9130
06:39:02,400 --> 06:39:03,700
or SQL function.
9131
06:39:03,700 --> 06:39:04,800
So you can see here.
9132
06:39:04,800 --> 06:39:07,900
We have the minimum air time
as 20 now to have a look
9133
06:39:07,900 --> 06:39:11,400
at the Wreckers in which
the air time is minimum 20.
9134
06:39:11,600 --> 06:39:14,693
Now we can also use
nested SQL queries a suppose
9135
06:39:14,693 --> 06:39:15,847
if I want to check
9136
06:39:15,847 --> 06:39:19,328
which all flights have
the Minimum air time as 20 now
9137
06:39:19,328 --> 06:39:20,553
that cannot be done
9138
06:39:20,553 --> 06:39:24,132
in a simple SQL query we need
nested query for that one.
9139
06:39:24,132 --> 06:39:26,800
So selecting aspects
from New York flights
9140
06:39:26,800 --> 06:39:29,500
where the airtime
is in and inside
9141
06:39:29,500 --> 06:39:30,913
that we have another query,
9142
06:39:30,913 --> 06:39:33,477
which is Select minimum air time
from NYC flights.
9143
06:39:33,477 --> 06:39:35,100
Let's see if this works or not.
9144
06:39:37,200 --> 06:39:38,497
CS as you can see here,
9145
06:39:38,497 --> 06:39:41,600
we have two Flats which have
the minimum air time as 20.
9146
06:39:42,200 --> 06:39:44,400
So guys this is it
for data frames.
9147
06:39:44,400 --> 06:39:46,147
So let's get back
to our presentation
9148
06:39:46,147 --> 06:39:48,697
and have a look at the list
which we were following.
9149
06:39:48,697 --> 06:39:49,966
We completed data frames.
9150
06:39:49,966 --> 06:39:52,600
Next we have stories levels
now Storage level
9151
06:39:52,600 --> 06:39:55,200
in pie spark is a class
which helps in deciding
9152
06:39:55,200 --> 06:39:56,991
how the rdds should be stored
9153
06:39:56,991 --> 06:39:59,400
now based on this rdds
are either stored
9154
06:39:59,400 --> 06:40:01,400
in this or in memory or in
9155
06:40:01,400 --> 06:40:04,300
both the class Storage
level also decides
9156
06:40:04,300 --> 06:40:06,594
whether the RADS
should be serialized
9157
06:40:06,594 --> 06:40:09,480
or replicate its partition
for the final
9158
06:40:09,480 --> 06:40:12,000
and the last topic
for the today's list
9159
06:40:12,000 --> 06:40:15,100
is MLM blog MLM is
the machine learning APA
9160
06:40:15,100 --> 06:40:17,000
which is provided by spark,
9161
06:40:17,000 --> 06:40:18,600
which is also present in Python.
9162
06:40:18,700 --> 06:40:21,180
And this library
is heavily used in Python
9163
06:40:21,180 --> 06:40:22,597
for machine learning as
9164
06:40:22,597 --> 06:40:26,094
well as real-time streaming
analytics Aurelius algorithm
9165
06:40:26,094 --> 06:40:28,773
supported by this libraries
are first of all,
9166
06:40:28,773 --> 06:40:30,600
we have the spark dot m l live
9167
06:40:30,600 --> 06:40:33,482
now recently the spice
Park MN lips supports model
9168
06:40:33,482 --> 06:40:37,500
based collaborative filtering
by a small set of latent factors
9169
06:40:37,500 --> 06:40:40,500
and here all the users
and the products are described
9170
06:40:40,500 --> 06:40:42,300
which we can use
to predict them.
9171
06:40:42,300 --> 06:40:45,909
Missing entries however
to learn these latent factors
9172
06:40:45,909 --> 06:40:48,886
Park dot ml abuses
the alternatingly square
9173
06:40:48,886 --> 06:40:50,755
which is the ALS algorithm.
9174
06:40:50,755 --> 06:40:52,900
Next we have the MLF clustering
9175
06:40:52,900 --> 06:40:53,852
and are supervised
9176
06:40:53,852 --> 06:40:57,700
learning problem is clustering
now here we try to group subsets
9177
06:40:57,700 --> 06:40:59,989
of entities with one
another on the basis
9178
06:40:59,989 --> 06:41:02,000
of some notion of similarity.
9179
06:41:02,200 --> 06:41:02,500
Next.
9180
06:41:02,500 --> 06:41:04,500
We have the frequent
pattern matching,
9181
06:41:04,500 --> 06:41:08,400
which is the fpm now frequent
pattern matching is mining
9182
06:41:08,400 --> 06:41:12,800
frequent items item set
subsequences or other Lectures
9183
06:41:12,800 --> 06:41:13,600
that are usually
9184
06:41:13,600 --> 06:41:16,900
among the first steps to analyze
a large-scale data set.
9185
06:41:16,900 --> 06:41:20,600
This has been an active research
topic in data mining for years.
9186
06:41:20,600 --> 06:41:22,800
We have the linear algebra.
9187
06:41:23,000 --> 06:41:25,032
Now this algorithm
support spice Park,
9188
06:41:25,032 --> 06:41:27,403
I mean live utilities
for linear algebra.
9189
06:41:27,403 --> 06:41:29,300
We have collaborative filtering.
9190
06:41:29,400 --> 06:41:30,900
We have classification
9191
06:41:30,900 --> 06:41:34,000
for binary classification
various methods are available
9192
06:41:34,000 --> 06:41:37,700
in sparked MLA package such as
multi-class classification as
9193
06:41:37,700 --> 06:41:40,912
well as regression analysis
in classification some
9194
06:41:40,912 --> 06:41:44,067
of the most popular Terms
used are Nave by a strand
9195
06:41:44,067 --> 06:41:45,457
of forest decision tree
9196
06:41:45,457 --> 06:41:48,600
and so much and finally we
have the linear regression
9197
06:41:48,600 --> 06:41:51,300
now basically lead integration
comes from the family
9198
06:41:51,300 --> 06:41:54,064
of recreation algorithms
to find relationships
9199
06:41:54,064 --> 06:41:56,812
and dependencies between
variables is the main goal
9200
06:41:56,812 --> 06:41:58,594
of regression all the pie spark
9201
06:41:58,594 --> 06:42:01,400
MLA package also covers
other algorithm classes
9202
06:42:01,400 --> 06:42:02,100
and functions.
9203
06:42:02,400 --> 06:42:04,591
Let's now try to implement
all the concepts
9204
06:42:04,591 --> 06:42:07,200
which we have learned
in pie spark tutorial session
9205
06:42:07,200 --> 06:42:10,600
now here we are going to use
a heart disease prediction model
9206
06:42:10,600 --> 06:42:13,278
and we are going to predict
Using the decision tree
9207
06:42:13,278 --> 06:42:16,599
with the help of classification
as well as regression.
9208
06:42:16,599 --> 06:42:16,800
Now.
9209
06:42:16,800 --> 06:42:19,600
These all are part
of the ml Live library here.
9210
06:42:19,600 --> 06:42:21,800
Let's see how we
can perform these types
9211
06:42:21,800 --> 06:42:23,300
of functions and queries.
9212
06:42:39,800 --> 06:42:40,600
The first of all
9213
06:42:40,600 --> 06:42:43,700
what we need to do
is initialize the spark context.
9214
06:42:45,100 --> 06:42:48,300
Next we are going
to read the UCI data set
9215
06:42:48,400 --> 06:42:50,500
of the heart disease prediction
9216
06:42:50,600 --> 06:42:52,600
and we are going
to clean the data.
9217
06:42:52,600 --> 06:42:55,700
So let's import the pandas
and the numpy library here.
9218
06:42:56,000 --> 06:42:58,852
Let's create a data frame
as heart disease TF and
9219
06:42:58,852 --> 06:43:00,100
as mentioned earlier,
9220
06:43:00,100 --> 06:43:03,544
we are going to use
the read CSV method here
9221
06:43:03,700 --> 06:43:05,300
and here we don't have a header.
9222
06:43:05,300 --> 06:43:07,500
So we have provided
header as none.
9223
06:43:07,700 --> 06:43:10,800
Now the original data set
contains 300 3 rows
9224
06:43:10,800 --> 06:43:12,100
and 14 columns.
9225
06:43:12,600 --> 06:43:15,800
Now the categories
of diagnosis of heart disease
9226
06:43:15,900 --> 06:43:17,000
that we are projecting
9227
06:43:17,300 --> 06:43:22,400
if the value 0 is for 50% less
than narrowing and for the value
9228
06:43:22,400 --> 06:43:24,900
1 which we are giving
is for the values
9229
06:43:24,900 --> 06:43:27,500
which have 50% more
diameter of naren.
9230
06:43:28,700 --> 06:43:31,623
So here we are using
the numpy library.
9231
06:43:32,700 --> 06:43:35,921
These are particularly
old methods which is showing
9232
06:43:35,921 --> 06:43:39,400
the deprecated warning
but no issues it will work fine.
9233
06:43:40,900 --> 06:43:42,500
So as you can see here,
9234
06:43:42,500 --> 06:43:45,300
we have the categories
of diagnosis of heart disease
9235
06:43:45,300 --> 06:43:48,100
that we are predicting
the value 0 is 4 less than 50
9236
06:43:48,100 --> 06:43:50,000
and value 1 is greater than 50.
9237
06:43:50,400 --> 06:43:53,014
So what we did here
was clear the row
9238
06:43:53,014 --> 06:43:57,500
which have the question mark
or which have the empty spaces.
9239
06:43:58,700 --> 06:44:00,900
Now to get a look
at the data set here.
9240
06:44:00,900 --> 06:44:02,200
Now, you can see here.
9241
06:44:02,200 --> 06:44:06,086
We have zero at many places
instead of the question mark
9242
06:44:06,086 --> 06:44:07,500
which we had earlier
9243
06:44:08,600 --> 06:44:11,300
and now we are saving
it to a txt file.
9244
06:44:12,000 --> 06:44:14,200
And you can see her
after dropping the rose
9245
06:44:14,200 --> 06:44:15,494
with any empty values.
9246
06:44:15,494 --> 06:44:18,000
We have two ninety seven rows
and 14 columns.
9247
06:44:18,300 --> 06:44:20,800
But this is what the new
clear data set looks
9248
06:44:20,800 --> 06:44:24,400
like now we are importing
the ml lived library
9249
06:44:24,400 --> 06:44:26,500
and the regression here now here
9250
06:44:26,500 --> 06:44:29,077
what we are going to do
is create a label point,
9251
06:44:29,077 --> 06:44:31,900
which is a local Vector
associated with a label
9252
06:44:31,900 --> 06:44:33,100
or a response.
9253
06:44:33,100 --> 06:44:36,600
So for that we need to import
the MLF dot regression.
9254
06:44:37,800 --> 06:44:39,600
So for that we are
taking the text file
9255
06:44:39,600 --> 06:44:43,000
which we just created now
without the missing values.
9256
06:44:43,000 --> 06:44:43,665
Now next.
9257
06:44:43,665 --> 06:44:47,678
What we are going to do is
pass the MLA data line by line
9258
06:44:47,678 --> 06:44:49,900
into the MLM label Point object
9259
06:44:49,900 --> 06:44:51,671
and we are going
to convert the -
9260
06:44:51,671 --> 06:44:53,000
one labels to the 0 now.
9261
06:44:53,000 --> 06:44:56,200
Let's have a look after passing
the number of fishing lines.
9262
06:44:57,800 --> 06:45:00,200
Okay, we have to label .01.
9263
06:45:00,600 --> 06:45:01,700
That's cool.
9264
06:45:01,700 --> 06:45:04,700
Now next what we are going to do
is perform classification using
9265
06:45:04,700 --> 06:45:05,800
the decision tree.
9266
06:45:05,800 --> 06:45:09,300
So for that we need to import
the pie spark the ml 8.3.
9267
06:45:09,600 --> 06:45:13,200
So next what we have to do is
split the data into the training
9268
06:45:13,200 --> 06:45:14,300
and testing data
9269
06:45:14,300 --> 06:45:18,500
and we split here the data
into 70s 233 standard ratio,
9270
06:45:18,600 --> 06:45:20,672
70 being the training data set
9271
06:45:20,672 --> 06:45:24,541
and the 30% being the testing
data set next what we do is
9272
06:45:24,541 --> 06:45:26,200
that we train the model.
9273
06:45:26,200 --> 06:45:28,600
Which we are created here
using the training set.
9274
06:45:29,100 --> 06:45:31,100
We have created
a training model decision trees
9275
06:45:31,100 --> 06:45:32,400
or trained classifier.
9276
06:45:32,400 --> 06:45:34,400
We have used
a training data number
9277
06:45:34,400 --> 06:45:36,947
of classes is file
the categorical feature,
9278
06:45:36,947 --> 06:45:38,104
which we have given
9279
06:45:38,104 --> 06:45:40,600
maximum depth to which
we are classifying.
9280
06:45:40,600 --> 06:45:42,000
It is 3 the next
9281
06:45:42,000 --> 06:45:45,505
what we are going to do is
evaluate the model based
9282
06:45:45,505 --> 06:45:49,000
on the test data set now
and evaluate the error.
9283
06:45:49,300 --> 06:45:50,800
So here we are creating
9284
06:45:50,800 --> 06:45:53,211
predictions and we
are using the test data
9285
06:45:53,211 --> 06:45:55,800
to get the predictions
through the model
9286
06:45:55,800 --> 06:45:58,200
which we Do and we
are also going to find
9287
06:45:58,200 --> 06:45:59,500
the test errors here.
9288
06:45:59,700 --> 06:46:00,900
So as you can see here,
9289
06:46:00,900 --> 06:46:04,507
the test error is
zero point 2 2 9 7 we
9290
06:46:04,507 --> 06:46:08,200
have created a classification
decision tree model
9291
06:46:08,200 --> 06:46:11,100
in which the feature
less than 12 is 3 the value
9292
06:46:11,100 --> 06:46:13,225
of the features
distance 0 is 54.
9293
06:46:13,225 --> 06:46:16,014
So as you can see
our model is pretty good.
9294
06:46:16,014 --> 06:46:19,700
So now next we'll use regression
for the same purposes.
9295
06:46:19,700 --> 06:46:22,300
So let's perform the regression
using decision tree.
9296
06:46:22,500 --> 06:46:24,500
So as you can see
we have the train model
9297
06:46:24,500 --> 06:46:26,400
and we are using
the decision tree, too.
9298
06:46:26,400 --> 06:46:29,460
Trine request using
the training data the same
9299
06:46:29,460 --> 06:46:33,200
which we created using the
decision tree model up there.
9300
06:46:33,200 --> 06:46:34,811
We use the classification
9301
06:46:34,811 --> 06:46:37,440
now we are using
regression now similarly.
9302
06:46:37,440 --> 06:46:38,921
We are going to evaluate
9303
06:46:38,921 --> 06:46:42,500
our model using our test data
set and find that test errors
9304
06:46:42,500 --> 06:46:45,600
which is the mean squared error
here for aggression.
9305
06:46:45,600 --> 06:46:48,200
So let's have a look
at the mean square error here.
9306
06:46:48,200 --> 06:46:50,584
The mean square error is 0.168.
9307
06:46:50,800 --> 06:46:52,100
That is good.
9308
06:46:52,100 --> 06:46:53,318
Now finally if we have
9309
06:46:53,318 --> 06:46:55,700
a look at the Learned
regression tree model.
9310
06:46:56,800 --> 06:47:00,300
You can see we have created
the regression tree model
9311
06:47:00,300 --> 06:47:02,800
till the depth
of 3 with 15 notes.
9312
06:47:02,800 --> 06:47:04,577
And here we have
all the features
9313
06:47:04,577 --> 06:47:06,300
and classification of the tree.
9314
06:47:11,000 --> 06:47:11,675
Hello folks.
9315
06:47:11,675 --> 06:47:13,700
Welcome to spawn
interview questions.
9316
06:47:13,800 --> 06:47:16,949
The session has been planned
collectively to have commonly
9317
06:47:16,949 --> 06:47:19,988
asked interview questions later
to the smart technology
9318
06:47:19,988 --> 06:47:22,400
and the general answer
and the expectation
9319
06:47:22,400 --> 06:47:25,594
is already you are aware
of this particular technology.
9320
06:47:25,594 --> 06:47:29,200
To some extent and in general
the common questions being asked
9321
06:47:29,200 --> 06:47:31,500
as well as I will give
interaction with the technology
9322
06:47:31,500 --> 06:47:33,600
as so let's get this started.
9323
06:47:33,600 --> 06:47:36,023
So the agenda for
this particular session is
9324
06:47:36,023 --> 06:47:38,197
the basic questions
are going to cover
9325
06:47:38,197 --> 06:47:41,138
and questions later
to the spark core Technologies.
9326
06:47:41,138 --> 06:47:42,400
That's when I say spark
9327
06:47:42,400 --> 06:47:44,900
or that's going to be
the base and top
9328
06:47:44,900 --> 06:47:48,075
of spark or we have
four important components
9329
06:47:48,075 --> 06:47:50,669
which work that
is streaming Graphics.
9330
06:47:50,669 --> 06:47:53,100
Ml Abe and SQL
all these components
9331
06:47:53,100 --> 06:47:57,500
have been created to satisfy a
The government again interaction
9332
06:47:57,500 --> 06:47:59,495
with these Technologies and get
9333
06:47:59,495 --> 06:48:02,200
into the commonly
asked interview questions
9334
06:48:02,300 --> 06:48:04,500
and the questions also
framed such a way.
9335
06:48:04,500 --> 06:48:07,200
It covers the spectrum
of the doubts as well
9336
06:48:07,200 --> 06:48:10,600
as the features available
within that specific technology.
9337
06:48:10,600 --> 06:48:12,512
So let's take the first question
9338
06:48:12,512 --> 06:48:15,800
and look into the answer like
how commonly this covered.
9339
06:48:15,800 --> 06:48:19,800
What is Apache spark and Spark
It's with Apache Foundation now,
9340
06:48:20,000 --> 06:48:21,000
it's open source.
9341
06:48:21,000 --> 06:48:22,809
It's a cluster
Computing framework
9342
06:48:22,809 --> 06:48:24,280
for real-time processing.
9343
06:48:24,280 --> 06:48:25,750
So three main keywords over.
9344
06:48:25,750 --> 06:48:28,151
Here a purchase markets
are open source project.
9345
06:48:28,151 --> 06:48:29,856
It's used for cluster Computing.
9346
06:48:29,856 --> 06:48:33,272
And for a memory processing
along with real-time processing.
9347
06:48:33,272 --> 06:48:35,485
It's going to support
in memory Computing.
9348
06:48:35,485 --> 06:48:36,672
So the lots of project
9349
06:48:36,672 --> 06:48:38,400
which supports cluster Computing
9350
06:48:38,400 --> 06:48:42,100
along with that spark
differentiate Itself by doing
9351
06:48:42,100 --> 06:48:43,839
the in-memory Computing.
9352
06:48:43,839 --> 06:48:46,231
It's very active
community and out
9353
06:48:46,231 --> 06:48:50,000
of the Hadoop ecosystem
technology is Apache spark is
9354
06:48:50,000 --> 06:48:51,500
very active multiple releases.
9355
06:48:51,500 --> 06:48:52,800
We got last year.
9356
06:48:52,800 --> 06:48:56,750
It's a very inactive project
among the about your Basically,
9357
06:48:56,750 --> 06:49:00,072
it's a framework kind support
in memory Computing
9358
06:49:00,072 --> 06:49:04,100
and cluster Computing and you
may face this specific question
9359
06:49:04,100 --> 06:49:05,700
how spark is different
9360
06:49:05,700 --> 06:49:08,085
than mapreduce on
how you can compare it
9361
06:49:08,085 --> 06:49:11,400
with the mapreduce mapreduce
is the processing pathology
9362
06:49:11,400 --> 06:49:12,900
within the Hadoop ecosystem
9363
06:49:12,900 --> 06:49:14,400
and within Hadoop ecosystem.
9364
06:49:14,400 --> 06:49:18,700
We have hdfs Hadoop distributed
file system mapreduce going
9365
06:49:18,700 --> 06:49:23,300
to support distributed computing
and how spark is different.
9366
06:49:23,300 --> 06:49:25,900
So how we can compare
smart with them.
9367
06:49:25,900 --> 06:49:28,907
Mapreduce in a way
this comparison going
9368
06:49:28,907 --> 06:49:32,400
to help us to understand
the technology better.
9369
06:49:32,400 --> 06:49:33,100
But definitely
9370
06:49:33,100 --> 06:49:36,600
like we cannot compare these two
or two different methodologies
9371
06:49:36,600 --> 06:49:40,200
by which it's going to work
spark is very simple to program
9372
06:49:40,200 --> 06:49:42,700
but mapreduce there
is no abstraction
9373
06:49:42,700 --> 06:49:44,118
or the sense like all
9374
06:49:44,118 --> 06:49:47,900
the implementations we have
to provide and interactivity.
9375
06:49:47,900 --> 06:49:52,200
It's has an interactive mode to
work with inspark a mapreduce.
9376
06:49:52,200 --> 06:49:53,800
That is no interactive mode.
9377
06:49:53,800 --> 06:49:55,900
There are some
components like Apache.
9378
06:49:55,900 --> 06:49:56,800
Big and high
9379
06:49:56,800 --> 06:50:00,400
which facilitates has to do
the interactive Computing
9380
06:50:00,400 --> 06:50:02,145
or interactive programming
9381
06:50:02,145 --> 06:50:05,100
and smog supports
real-time stream processing
9382
06:50:05,100 --> 06:50:07,700
and to precisely
say with inspark
9383
06:50:07,700 --> 06:50:11,000
the stream processing is called
a near real-time processing.
9384
06:50:11,000 --> 06:50:13,600
There's nothing in the world
is Real Time processing.
9385
06:50:13,600 --> 06:50:15,100
It's near real-time processing.
9386
06:50:15,100 --> 06:50:18,200
It's going to do the processing
and micro batches.
9387
06:50:18,200 --> 06:50:19,200
I'll cover in detail
9388
06:50:19,200 --> 06:50:21,400
when we are moving
onto the streaming concept
9389
06:50:21,400 --> 06:50:22,600
and you're going to do
9390
06:50:22,600 --> 06:50:25,700
the batch processing on
the historical data in Matrix.
9391
06:50:25,700 --> 06:50:28,300
Zeus when I say stream
processing I will get the data
9392
06:50:28,300 --> 06:50:31,025
that is getting processed
in real time and do
9393
06:50:31,025 --> 06:50:33,849
the processing and get
the result either store it
9394
06:50:33,849 --> 06:50:35,772
on publish to publish Community.
9395
06:50:35,772 --> 06:50:37,697
We will be doing it let and see
9396
06:50:37,697 --> 06:50:40,149
wise mapreduce will have
very high latency
9397
06:50:40,149 --> 06:50:42,915
because it has to read
the data from hard disk,
9398
06:50:42,915 --> 06:50:45,200
but spark it will have
very low latency
9399
06:50:45,200 --> 06:50:47,200
because it can reprocess
9400
06:50:47,200 --> 06:50:50,500
are used the data
already cased in memory,
9401
06:50:50,500 --> 06:50:53,786
but there is a small catch
over here in spark first time
9402
06:50:53,786 --> 06:50:56,600
when the data gets loaded it
has Tool to read it
9403
06:50:56,600 --> 06:50:59,100
from the hard disk
same as mapreduce.
9404
06:50:59,100 --> 06:51:01,600
So once it is red it
will be there in the memory.
9405
06:51:01,692 --> 06:51:03,000
So spark is good.
9406
06:51:03,000 --> 06:51:05,100
Whenever we need to do I treat
9407
06:51:05,100 --> 06:51:08,900
a Computing so spark whenever
you do I treat a Computing again
9408
06:51:08,900 --> 06:51:11,400
and again to the processing
on the same data,
9409
06:51:11,400 --> 06:51:14,200
especially in machine learning
deep learning all we will be
9410
06:51:14,200 --> 06:51:17,900
using the iterative Computing
his Fox performs much better.
9411
06:51:17,900 --> 06:51:19,805
You will see
the rock performance
9412
06:51:19,805 --> 06:51:22,651
Improvement hundred times
faster than mapreduce.
9413
06:51:22,651 --> 06:51:25,800
But if it is one time processing
and fire-and-forget,
9414
06:51:25,800 --> 06:51:28,805
Get the type
of processing spark lately,
9415
06:51:28,805 --> 06:51:30,600
maybe the same latency,
9416
06:51:30,600 --> 06:51:32,699
you will be getting
a tan mapreduce maybe
9417
06:51:32,699 --> 06:51:35,900
like some improvements because
of the building block or spark.
9418
06:51:35,900 --> 06:51:38,800
That's the ID you may get
some additional Advantage.
9419
06:51:38,800 --> 06:51:43,000
So that's the key feature are
the key comparison factor
9420
06:51:43,300 --> 06:51:45,200
of sparkin mapreduce.
9421
06:51:45,800 --> 06:51:50,100
Now, let's get on to the key
features xnk features of spark.
9422
06:51:50,200 --> 06:51:52,200
We discussed over
the Speed and Performance.
9423
06:51:52,200 --> 06:51:54,200
It's going to use
the in-memory Computing
9424
06:51:54,200 --> 06:51:55,559
so Speed and Performance.
9425
06:51:55,559 --> 06:51:57,300
Place it's going to much better.
9426
06:51:57,300 --> 06:52:00,900
When we do actually to Computing
and Somali got the sense
9427
06:52:00,900 --> 06:52:03,810
the programming language
to be used with a spark.
9428
06:52:03,810 --> 06:52:06,700
It can be any of these languages
can be python.
9429
06:52:06,700 --> 06:52:08,400
Java are our scale.
9430
06:52:08,400 --> 06:52:08,570
Mm.
9431
06:52:08,570 --> 06:52:11,300
We can do programming
with any of these languages
9432
06:52:11,300 --> 06:52:14,200
and data formats
to give us a input.
9433
06:52:14,200 --> 06:52:17,172
We can give any data formats
like Jason back
9434
06:52:17,172 --> 06:52:18,900
with a data formats began
9435
06:52:18,900 --> 06:52:21,888
if there is a input
and the key selling point
9436
06:52:21,888 --> 06:52:24,400
with the spark is it's
lazy evaluation the
9437
06:52:24,400 --> 06:52:25,575
since it's going
9438
06:52:25,575 --> 06:52:29,100
To calculate the DAC cycle
directed acyclic graph
9439
06:52:29,100 --> 06:52:32,700
d a g because that is a th e
it's going to calculate
9440
06:52:32,700 --> 06:52:35,300
what all steps needs
to be executed to achieve
9441
06:52:35,300 --> 06:52:36,400
the final result.
9442
06:52:36,400 --> 06:52:38,969
So we need to give all
the steps as well as
9443
06:52:38,969 --> 06:52:40,519
what final result I want.
9444
06:52:40,519 --> 06:52:42,983
It's going to calculate
the optimal cycle
9445
06:52:42,983 --> 06:52:44,400
on optimal calculation.
9446
06:52:44,400 --> 06:52:46,400
What else tips needs
to be calculated
9447
06:52:46,400 --> 06:52:49,100
or what else tips needs
to be executed only those steps
9448
06:52:49,100 --> 06:52:50,500
it will be executing it.
9449
06:52:50,500 --> 06:52:52,900
So basically it's
a lazy execution only
9450
06:52:52,900 --> 06:52:54,450
if the results needs
to be processed,
9451
06:52:54,450 --> 06:52:55,800
it will be processing that.
9452
06:52:55,800 --> 06:52:58,623
Because of it and it's
about real-time Computing.
9453
06:52:58,623 --> 06:53:00,200
It's through spark streaming
9454
06:53:00,200 --> 06:53:02,200
that is a component
called spark streaming
9455
06:53:02,200 --> 06:53:04,700
which supports real-time
Computing and it gels
9456
06:53:04,700 --> 06:53:07,115
with Hadoop ecosystem variable.
9457
06:53:07,115 --> 06:53:09,500
It can run on top of Hadoop Ian
9458
06:53:09,500 --> 06:53:12,562
or it can Leverage The hdfs
to do the processing.
9459
06:53:12,562 --> 06:53:16,300
So when it leverages the hdfs
the Hadoop cluster container
9460
06:53:16,300 --> 06:53:19,400
can be used to do
the distributed computing
9461
06:53:19,400 --> 06:53:23,707
as well as it can leverage
the resource manager to manage
9462
06:53:23,707 --> 06:53:25,400
the resources so spot.
9463
06:53:25,400 --> 06:53:28,426
I can gel with the hdfs very
well as well as it can leverage
9464
06:53:28,426 --> 06:53:29,642
the resource manager
9465
06:53:29,642 --> 06:53:32,500
to share the resources
as well as data locality.
9466
06:53:32,500 --> 06:53:34,699
You can give each data locality.
9467
06:53:34,699 --> 06:53:36,900
It can do the processing we have
9468
06:53:36,900 --> 06:53:41,200
to the database data is located
within the hdfs and has a fleet
9469
06:53:41,200 --> 06:53:43,700
of machine learning
algorithms already implemented
9470
06:53:43,700 --> 06:53:46,100
right from clustering
classification regression.
9471
06:53:46,100 --> 06:53:48,238
All this logic
already implemented
9472
06:53:48,238 --> 06:53:49,600
and machine learning.
9473
06:53:49,600 --> 06:53:52,400
It's achieved using
MLA be within spark
9474
06:53:52,400 --> 06:53:54,800
and there is a component
called a graphics
9475
06:53:54,800 --> 06:53:58,600
which supports Maybe we
can solve the problems using
9476
06:53:58,600 --> 06:54:02,600
graph Theory using the component
Graphics within this park.
9477
06:54:02,700 --> 06:54:04,700
So these are the things
we can consider as
9478
06:54:04,700 --> 06:54:06,700
the key features of spark.
9479
06:54:06,700 --> 06:54:09,400
So when you discuss
with the installation
9480
06:54:09,400 --> 06:54:10,300
of the spark,
9481
06:54:10,300 --> 06:54:13,581
you may come across this year
on what is he on do you
9482
06:54:13,581 --> 06:54:16,765
need to install spark
on all nodes of young cluster?
9483
06:54:16,765 --> 06:54:19,700
So yarn is nothing
but another is US negotiator.
9484
06:54:19,700 --> 06:54:22,500
That's the resource manager
within the Hadoop ecosystem.
9485
06:54:22,500 --> 06:54:25,529
So that's going to provide the
resource management platform.
9486
06:54:25,529 --> 06:54:28,200
Ian going to provide
the resource management platform
9487
06:54:28,200 --> 06:54:29,500
across all the Clusters
9488
06:54:29,600 --> 06:54:33,200
and Spark It's going
to provide the data processing.
9489
06:54:33,200 --> 06:54:35,300
So wherever there is
a horse being used
9490
06:54:35,300 --> 06:54:38,049
that location response will be
used to do the data processing.
9491
06:54:38,049 --> 06:54:39,056
And of course, yes,
9492
06:54:39,056 --> 06:54:41,600
we need to have spark
installed on all the nodes.
9493
06:54:41,800 --> 06:54:43,900
It's Parker stores are located.
9494
06:54:43,900 --> 06:54:47,100
That's basically we need
those libraries an additional
9495
06:54:47,100 --> 06:54:50,200
to the installation of spark
and all the worker nodes.
9496
06:54:50,200 --> 06:54:52,106
We need to increase
the ram capacity
9497
06:54:52,106 --> 06:54:53,283
on the VOC emissions
9498
06:54:53,283 --> 06:54:55,800
as well as far going
to consume huge amounts.
9499
06:54:56,100 --> 06:55:00,500
Memory to do the processing it
will not do the mapreduce way
9500
06:55:00,500 --> 06:55:01,600
of working internally.
9501
06:55:01,600 --> 06:55:04,191
It's going to generate
the next cycle and do
9502
06:55:04,191 --> 06:55:06,000
the processing on top of yeah,
9503
06:55:06,000 --> 06:55:09,900
so Ian and the high level it's
like resource manager
9504
06:55:09,900 --> 06:55:13,100
or like an operating system
for the distributed computing.
9505
06:55:13,100 --> 06:55:15,500
It's going to coordinate
all the resource management
9506
06:55:15,500 --> 06:55:17,900
across the fleet
of servers on top of it.
9507
06:55:17,900 --> 06:55:20,100
I can have multiple components
9508
06:55:20,100 --> 06:55:25,100
like spark these giraffe
this park especially it's going
9509
06:55:25,100 --> 06:55:27,800
to help Just watch it
in memory Computing.
9510
06:55:27,800 --> 06:55:30,900
So sparkly on is nothing
but it's a resource manager
9511
06:55:30,900 --> 06:55:33,600
to manage the resource
across the cluster on top of it.
9512
06:55:33,600 --> 06:55:35,470
We can have spunk and yes,
9513
06:55:35,470 --> 06:55:37,700
we need to have spark installed
9514
06:55:37,700 --> 06:55:41,800
and all the notes on where
the spark yarn cluster is used
9515
06:55:41,800 --> 06:55:43,581
and also additional to that.
9516
06:55:43,581 --> 06:55:45,809
We need to have
the memory increased
9517
06:55:45,809 --> 06:55:47,400
in all the worker robots.
9518
06:55:47,600 --> 06:55:48,870
The next question goes
9519
06:55:48,870 --> 06:55:51,400
like this what file
system response support.
9520
06:55:52,300 --> 06:55:55,779
What is the file system then
we work in individual system.
9521
06:55:55,779 --> 06:55:58,100
We will be having
a file system to work
9522
06:55:58,100 --> 06:56:01,000
within that particular
operating system Mary
9523
06:56:01,000 --> 06:56:04,900
redistributed cluster or in
the distributed architecture.
9524
06:56:04,900 --> 06:56:06,744
We need a file system with which
9525
06:56:06,744 --> 06:56:09,800
where we can store the data
in a distribute mechanism.
9526
06:56:09,800 --> 06:56:12,900
How do comes with
the file system called hdfs.
9527
06:56:13,100 --> 06:56:15,800
It's called Hadoop
distributed file system
9528
06:56:15,800 --> 06:56:19,131
by data gets distributed
across multiple systems
9529
06:56:19,131 --> 06:56:21,400
and it will be coordinated by 2.
9530
06:56:21,400 --> 06:56:24,500
Different type of components
called name node and data node
9531
06:56:24,500 --> 06:56:27,800
and Spark it can use
this hdfs directly.
9532
06:56:27,800 --> 06:56:30,900
So you can have any files
in hdfs and start using it
9533
06:56:30,900 --> 06:56:34,800
within the spark ecosystem
and it gives another advantage
9534
06:56:34,800 --> 06:56:35,900
of data locality
9535
06:56:35,900 --> 06:56:38,415
when it does the distributed
processing wherever
9536
06:56:38,415 --> 06:56:39,700
the data is distributed.
9537
06:56:39,700 --> 06:56:42,400
The processing could be done
locally to that particular
9538
06:56:42,400 --> 06:56:44,300
Mission way data is located
9539
06:56:44,300 --> 06:56:47,223
and to start with as
a standalone mode.
9540
06:56:47,223 --> 06:56:49,500
You can use the local
file system aspect.
9541
06:56:49,600 --> 06:56:51,508
So this could be used especially
9542
06:56:51,508 --> 06:56:53,818
when we are doing
the development or any
9543
06:56:53,818 --> 06:56:56,390
of you see you can use
the local file system
9544
06:56:56,390 --> 06:56:59,500
and Amazon Cloud provides
another file system called.
9545
06:56:59,500 --> 06:57:02,119
Yes, three simple
storage service we call
9546
06:57:02,119 --> 06:57:03,100
that is the S3.
9547
06:57:03,100 --> 06:57:04,998
It's a block storage service.
9548
06:57:04,998 --> 06:57:06,700
This can also be leveraged
9549
06:57:06,700 --> 06:57:09,238
or used within spa
for the storage
9550
06:57:09,800 --> 06:57:11,100
and lot other file system.
9551
06:57:11,100 --> 06:57:14,700
Also, it supports there are
some file systems like Alex,
9552
06:57:14,700 --> 06:57:17,700
oh which provides
in memory storage
9553
06:57:17,700 --> 06:57:20,800
so we can leverage that
particular file system as well.
9554
06:57:21,100 --> 06:57:22,796
So we have seen
all the features.
9555
06:57:22,796 --> 06:57:25,580
What are the functionalities
available with inspark?
9556
06:57:25,580 --> 06:57:27,600
We're going to look
at the limitations
9557
06:57:27,600 --> 06:57:28,800
of using spark.
9558
06:57:28,800 --> 06:57:30,252
Of course every component
9559
06:57:30,252 --> 06:57:33,000
when it comes with
a huge power and Advantage.
9560
06:57:33,000 --> 06:57:35,200
It will have its own
limitations as well.
9561
06:57:35,300 --> 06:57:38,900
The equation illustrates
some limitations of using
9562
06:57:38,900 --> 06:57:41,900
spark spark utilizes
more storage space
9563
06:57:41,900 --> 06:57:43,400
compared to Hadoop
9564
06:57:43,400 --> 06:57:44,715
and it comes
to the installation.
9565
06:57:44,715 --> 06:57:47,600
It's going to consume more space
but in the Big Data world,
9566
06:57:47,600 --> 06:57:49,500
that's not a
very huge constraint
9567
06:57:49,500 --> 06:57:52,206
because storage cons is
not Great are very high
9568
06:57:52,206 --> 06:57:55,504
and our big data space and
developer needs to be careful
9569
06:57:55,504 --> 06:57:58,275
while running the apps
and Spark the reason
9570
06:57:58,275 --> 06:58:00,300
because it uses
in-memory Computing.
9571
06:58:00,400 --> 06:58:02,870
Of course, it handles
the memory very well.
9572
06:58:02,870 --> 06:58:05,400
But if you try to load
a huge amount of data
9573
06:58:05,400 --> 06:58:08,700
and the distributed environment
and if you try to do is join
9574
06:58:08,700 --> 06:58:09,903
when you try to do join
9575
06:58:09,903 --> 06:58:13,491
within the distributed world the
data going to get transferred
9576
06:58:13,491 --> 06:58:14,700
over the network network
9577
06:58:14,700 --> 06:58:18,100
is really a costly
resource So the plan
9578
06:58:18,200 --> 06:58:20,800
or design should be such
a way to reduce or minimize.
9579
06:58:20,800 --> 06:58:23,500
As the data transferred
over the network
9580
06:58:23,500 --> 06:58:27,103
and however the way
possible with all possible means
9581
06:58:27,103 --> 06:58:30,000
we should facilitate
distribution of theta
9582
06:58:30,000 --> 06:58:32,200
over multiple missions the more
9583
06:58:32,200 --> 06:58:34,600
we distribute the more
parallelism we can achieve
9584
06:58:34,600 --> 06:58:38,500
and the more results we can get
and cost efficiency.
9585
06:58:38,500 --> 06:58:40,700
If you try to compare the cost
9586
06:58:40,700 --> 06:58:42,800
how much cost involved
9587
06:58:42,800 --> 06:58:45,700
to do a particular
processing take any unit
9588
06:58:45,700 --> 06:58:48,545
in terms of processing
1 GB of data with say
9589
06:58:48,545 --> 06:58:50,200
like II Treaty processing
9590
06:58:50,200 --> 06:58:53,800
if you come Cost-wise in-memory
Computing always it's considered
9591
06:58:53,800 --> 06:58:57,088
because memory It's
relatively come costlier
9592
06:58:57,088 --> 06:58:58,200
than the storage
9593
06:58:58,400 --> 06:59:00,000
so that may act
like a bottleneck
9594
06:59:00,000 --> 06:59:01,400
and we cannot increase
9595
06:59:01,400 --> 06:59:05,200
the memory capacity of
the mission Beyond supplement.
9596
06:59:05,900 --> 06:59:07,500
So we have to grow horizontally.
9597
06:59:07,800 --> 06:59:10,042
So when we have
the data distributor
9598
06:59:10,042 --> 06:59:11,900
in memory across the cluster,
9599
06:59:12,000 --> 06:59:13,337
of course the network transfer
9600
06:59:13,337 --> 06:59:15,300
all those bottlenecks
will come into picture.
9601
06:59:15,300 --> 06:59:17,400
So we have to strike
the right balance
9602
06:59:17,400 --> 06:59:20,700
which will help us to achieve
the in-memory computing.
9603
06:59:20,700 --> 06:59:22,775
Whatever, they memory
computer repair it
9604
06:59:22,775 --> 06:59:24,000
will help us to achieve
9605
06:59:24,000 --> 06:59:25,757
and it consumes huge amount
9606
06:59:25,757 --> 06:59:28,400
of data processing
compared to Hadoop
9607
06:59:28,600 --> 06:59:30,600
and Spark it performs
9608
06:59:30,600 --> 06:59:33,800
better than use it as
a creative Computing
9609
06:59:33,800 --> 06:59:36,700
because it likes for both spark
and the other Technologies.
9610
06:59:36,700 --> 06:59:37,699
It has to read data
9611
06:59:37,699 --> 06:59:39,700
for the first time
from the hottest car
9612
06:59:39,700 --> 06:59:43,300
from other data source and Spark
performance is really better
9613
06:59:43,300 --> 06:59:46,114
when it reads the data
onto does the processing
9614
06:59:46,114 --> 06:59:48,500
when the data is available
in the cache,
9615
06:59:48,723 --> 06:59:50,800
of course is the DAC cycle.
9616
06:59:50,800 --> 06:59:53,094
It's going to give
us a lot of advantage
9617
06:59:53,094 --> 06:59:54,400
while doing the processing
9618
06:59:54,400 --> 06:59:56,802
but the in-memory
Computing processing
9619
06:59:56,802 --> 06:59:59,400
that's going to give
us lots of Leverage.
9620
06:59:59,400 --> 07:00:01,605
The next question
list some use cases
9621
07:00:01,605 --> 07:00:04,300
where Spark outperforms
Hadoop in processing.
9622
07:00:04,400 --> 07:00:06,300
The first thing is
the real time processing.
9623
07:00:06,300 --> 07:00:08,629
How do you cannot handle
real time processing
9624
07:00:08,629 --> 07:00:10,884
but spark and handle
real time processing.
9625
07:00:10,884 --> 07:00:13,843
So any data that's coming in
in the land architecture.
9626
07:00:13,843 --> 07:00:15,300
You will have three layers.
9627
07:00:15,300 --> 07:00:17,210
The most of the Big
Data projects will be
9628
07:00:17,210 --> 07:00:18,500
in the Lambda architecture.
9629
07:00:18,500 --> 07:00:21,500
You will have speed layer
by layer and sighs Leo
9630
07:00:21,500 --> 07:00:23,900
and the speed layer
whenever the river comes
9631
07:00:23,900 --> 07:00:26,900
in that needs to be processed
stored and handled.
9632
07:00:26,900 --> 07:00:27,975
So in those type
9633
07:00:27,975 --> 07:00:30,800
of real-time processing stock
is the best fit.
9634
07:00:30,800 --> 07:00:32,500
Of course, we can
Hadoop ecosystem.
9635
07:00:32,500 --> 07:00:33,837
We have other components
9636
07:00:33,837 --> 07:00:36,400
which does the real-time
processing like storm.
9637
07:00:36,400 --> 07:00:39,000
But when you want to Leverage
The Machine learning
9638
07:00:39,000 --> 07:00:40,500
along with the Sparks dreaming
9639
07:00:40,500 --> 07:00:43,200
on such computation spark
will be much better.
9640
07:00:43,200 --> 07:00:44,243
So that's why I like
9641
07:00:44,243 --> 07:00:45,621
when you have architecture
9642
07:00:45,621 --> 07:00:47,900
like a Lambda architecture
you want to have
9643
07:00:47,900 --> 07:00:51,100
all three layers bachelier
speed layer and service.
9644
07:00:51,100 --> 07:00:54,800
A spark and gel the speed layer
and service layer far better
9645
07:00:54,800 --> 07:00:56,800
and it's going to provide
better performance.
9646
07:00:56,800 --> 07:00:59,400
And whenever you do
the edge processing
9647
07:00:59,400 --> 07:01:02,400
especially like doing
a machine learning processing,
9648
07:01:02,400 --> 07:01:04,501
we will leverage
nitrate in Computing
9649
07:01:04,501 --> 07:01:06,210
and can perform a hundred times
9650
07:01:06,210 --> 07:01:08,800
faster than Hadoop
the more diversity processing
9651
07:01:08,800 --> 07:01:11,600
that we do the more data
will be read from the memory
9652
07:01:11,600 --> 07:01:14,700
and it's going to get as
much faster performance
9653
07:01:14,700 --> 07:01:16,700
than I did with mapreduce.
9654
07:01:16,700 --> 07:01:20,100
So again, remember whenever you
do the processing only buns,
9655
07:01:20,100 --> 07:01:23,000
so you're going to to do
the processing finally bonds
9656
07:01:23,000 --> 07:01:24,900
read process it and deliver.
9657
07:01:24,900 --> 07:01:27,516
The result spark
may not be the best fit
9658
07:01:27,516 --> 07:01:30,200
that can be done
with a mapreduce itself.
9659
07:01:30,200 --> 07:01:32,773
And there is another component
called akka it's
9660
07:01:32,773 --> 07:01:35,600
a messaging system
our message quantity
9661
07:01:35,600 --> 07:01:38,500
in system Sparkle
internally uses account
9662
07:01:38,500 --> 07:01:40,500
for scheduling our any task
9663
07:01:40,500 --> 07:01:43,100
that needs to be assigned
by the master to the worker
9664
07:01:43,700 --> 07:01:45,700
and the follow-up
of that particular task
9665
07:01:45,700 --> 07:01:49,000
by the master basically
asynchronous coordination system
9666
07:01:49,000 --> 07:01:51,000
and that's achieved using akka
9667
07:01:51,400 --> 07:01:55,100
I call programming internally
it's used by this monk
9668
07:01:55,100 --> 07:01:56,551
as such for the developers.
9669
07:01:56,551 --> 07:01:59,358
We don't need to worry
about a couple of growing up.
9670
07:01:59,358 --> 07:02:00,900
Of course we can leverage it
9671
07:02:00,900 --> 07:02:04,500
but the car is used internally
by the spawn for scheduling
9672
07:02:04,500 --> 07:02:08,800
and coordination between master
and the burqa and with inspark.
9673
07:02:08,800 --> 07:02:10,700
We have few major components.
9674
07:02:10,700 --> 07:02:13,200
Let's see, what are
the major components
9675
07:02:13,200 --> 07:02:14,500
of a possessed man.
9676
07:02:14,500 --> 07:02:18,069
The lay the components
of spot ecosystem start comes
9677
07:02:18,069 --> 07:02:19,319
with a core engine.
9678
07:02:19,319 --> 07:02:20,700
So that has the core.
9679
07:02:20,700 --> 07:02:23,570
Realities of what is required
from by the spark
9680
07:02:23,570 --> 07:02:26,600
of all this Punk Oddities
are the building blocks
9681
07:02:26,600 --> 07:02:29,361
of the spark core engine
on top of spark
9682
07:02:29,361 --> 07:02:31,300
or the basic functionalities are
9683
07:02:31,300 --> 07:02:34,600
file interaction file system
coordination all that's done
9684
07:02:34,600 --> 07:02:36,400
by the spark core engine
9685
07:02:36,400 --> 07:02:38,432
on top of spark core engine.
9686
07:02:38,432 --> 07:02:40,900
We have a number
of other offerings
9687
07:02:40,900 --> 07:02:44,700
to do machine learning to do
graph Computing to do streaming.
9688
07:02:44,700 --> 07:02:47,000
We have n number
of other components.
9689
07:02:47,000 --> 07:02:48,800
So the major use the components
9690
07:02:48,800 --> 07:02:51,000
of these components
like Sparks equal.
9691
07:02:51,000 --> 07:02:52,037
Spock streaming.
9692
07:02:52,037 --> 07:02:55,520
I'm a little graphics
and Spark our other high level.
9693
07:02:55,520 --> 07:02:58,400
We will see what are
these components Sparks
9694
07:02:58,400 --> 07:03:02,000
equal especially it's designed
to do the processing
9695
07:03:02,000 --> 07:03:03,729
against a structure data
9696
07:03:03,729 --> 07:03:07,400
so we can write SQL queries
and we can handle
9697
07:03:07,400 --> 07:03:08,854
or we can do the processing.
9698
07:03:08,854 --> 07:03:11,400
So it's going to give us
the interface to interact
9699
07:03:11,400 --> 07:03:12,100
with the data,
9700
07:03:12,300 --> 07:03:15,900
especially structure data
and other language
9701
07:03:15,900 --> 07:03:18,700
that we can use
it's more similar to
9702
07:03:18,700 --> 07:03:20,600
what we use within the SQL.
9703
07:03:20,600 --> 07:03:22,700
Well, I can say
99 percentage is seen
9704
07:03:22,700 --> 07:03:25,934
and most of the commonly used
functionalities within the SQL
9705
07:03:25,934 --> 07:03:28,111
have been implemented
within smocks equal
9706
07:03:28,111 --> 07:03:31,700
and Spark streaming is going to
support the stream processing.
9707
07:03:31,700 --> 07:03:34,000
That's the offering
available to handle
9708
07:03:34,000 --> 07:03:35,920
the stream processing and MLA
9709
07:03:35,920 --> 07:03:38,900
based the offering
to handle machine learning.
9710
07:03:38,900 --> 07:03:42,700
So the component name
is called ml in and has a list
9711
07:03:42,700 --> 07:03:44,300
of components a list
9712
07:03:44,300 --> 07:03:47,300
of machine learning
algorithms already defined
9713
07:03:47,300 --> 07:03:50,700
we can leverage and use any
of those machine learning.
9714
07:03:51,400 --> 07:03:54,944
Graphics again, it's
a graph processing offerings
9715
07:03:54,944 --> 07:03:56,200
within the spark.
9716
07:03:56,200 --> 07:03:59,141
It's going to support us
to achieve graph Computing
9717
07:03:59,141 --> 07:04:02,330
against the data that we have
like pagerank calculation.
9718
07:04:02,330 --> 07:04:04,107
How many connector identities
9719
07:04:04,107 --> 07:04:07,600
how many triangles all those
going to provide us a meaning
9720
07:04:07,600 --> 07:04:09,300
to that particular data
9721
07:04:09,300 --> 07:04:12,500
and Spark are is the component
is going to interact
9722
07:04:12,500 --> 07:04:14,371
or helpers to leverage.
9723
07:04:14,371 --> 07:04:17,856
The language are
within the spark environment
9724
07:04:18,100 --> 07:04:20,600
are is a statistical
programming language.
9725
07:04:20,600 --> 07:04:23,170
Each where we can do
statistical Computing,
9726
07:04:23,170 --> 07:04:24,700
which is Park environment
9727
07:04:24,700 --> 07:04:28,306
and we can leverage our language
by using this parka to get
9728
07:04:28,306 --> 07:04:32,194
that executed within the spark
a environment addition to that.
9729
07:04:32,194 --> 07:04:35,675
There are other components
as well like approximative is
9730
07:04:35,675 --> 07:04:39,118
it's called blink DB all other
things I can be test each.
9731
07:04:39,118 --> 07:04:42,541
So these are the major Lee used
components within spark.
9732
07:04:42,541 --> 07:04:43,561
So next question.
9733
07:04:43,561 --> 07:04:45,944
How can start be used
alongside her too?
9734
07:04:45,944 --> 07:04:49,000
So when we see a spark
performance much better it's
9735
07:04:49,000 --> 07:04:51,000
not a replacement to handle it.
9736
07:04:51,000 --> 07:04:52,100
Going to coexist
9737
07:04:52,100 --> 07:04:55,488
with the Hadoop right
Square leveraging the spark
9738
07:04:55,488 --> 07:04:56,900
and Hadoop together.
9739
07:04:56,900 --> 07:05:00,000
It's going to help us
to achieve the best result.
9740
07:05:00,000 --> 07:05:00,268
Yes.
9741
07:05:00,268 --> 07:05:04,300
Mark can do in memory Computing
or can handle the speed layer
9742
07:05:04,300 --> 07:05:06,600
and Hadoop comes
with the resource manager
9743
07:05:06,600 --> 07:05:08,500
so we can leverage
the resource manager
9744
07:05:08,500 --> 07:05:10,900
of Hadoop to make smart to work
9745
07:05:11,000 --> 07:05:13,529
and few processing be
don't need to Leverage
9746
07:05:13,529 --> 07:05:14,904
The in-memory Computing.
9747
07:05:14,904 --> 07:05:18,500
For example, one time processing
to the processing and forget.
9748
07:05:18,500 --> 07:05:20,773
I just store it we
can use mapreduce.
9749
07:05:20,773 --> 07:05:24,700
He's so the processing cost
Computing cost will be much less
9750
07:05:24,700 --> 07:05:26,100
compared to Spa
9751
07:05:26,100 --> 07:05:29,400
so we can amalgam eyes and get
strike the right balance
9752
07:05:29,400 --> 07:05:31,700
between the batch processing
and stream processing
9753
07:05:31,700 --> 07:05:34,507
when we have spark
along with Adam.
9754
07:05:34,507 --> 07:05:38,100
Let's have some detail question
later to spark core
9755
07:05:38,100 --> 07:05:39,100
with inspark or
9756
07:05:39,100 --> 07:05:41,900
as I mentioned earlier
the core building block
9757
07:05:41,900 --> 07:05:45,600
of spark or is our DD resilient
distributed data set.
9758
07:05:45,600 --> 07:05:46,654
It's a virtual.
9759
07:05:46,654 --> 07:05:48,442
It's not a physical entity.
9760
07:05:48,442 --> 07:05:49,900
It's a logical entity.
9761
07:05:49,900 --> 07:05:52,400
You will not See
this audit is existing.
9762
07:05:52,400 --> 07:05:54,700
The existence of hundred
will come into picture
9763
07:05:54,900 --> 07:05:56,474
when you take some action.
9764
07:05:56,474 --> 07:05:59,200
So this is our Unity
will be used are referred
9765
07:05:59,200 --> 07:06:00,800
to create the DAC cycle
9766
07:06:00,943 --> 07:06:05,500
and arteries will be optimized
to transform from one form
9767
07:06:05,500 --> 07:06:07,264
to another form to make a plan
9768
07:06:07,264 --> 07:06:09,400
how the data set needs
to be transformed
9769
07:06:09,400 --> 07:06:11,500
from one structure
to another structure.
9770
07:06:11,700 --> 07:06:14,817
And finally when you take some
against an RTD that existence
9771
07:06:14,817 --> 07:06:15,924
of the data structure
9772
07:06:15,924 --> 07:06:18,200
that resulted in data
will come into picture
9773
07:06:18,200 --> 07:06:20,500
and that can be stored
in any file system
9774
07:06:20,500 --> 07:06:22,000
whether it's GFS is 3
9775
07:06:22,000 --> 07:06:24,568
or any other file system
can be stored and
9776
07:06:24,568 --> 07:06:27,900
that it is can exist
in a partition form the sense.
9777
07:06:27,900 --> 07:06:30,600
It can get distributed
across multiple systems
9778
07:06:30,600 --> 07:06:33,800
and it's fault tolerant
and it's a fault tolerant.
9779
07:06:33,800 --> 07:06:36,494
If any of the artery
is lost any partition
9780
07:06:36,494 --> 07:06:37,742
of the RTD is lost.
9781
07:06:37,742 --> 07:06:40,700
It can regenerate only
that specific partition
9782
07:06:40,700 --> 07:06:41,700
it can regenerate
9783
07:06:41,900 --> 07:06:43,900
so that's a huge
advantage of our GD.
9784
07:06:43,900 --> 07:06:46,600
So it's a mass like first
the huge advantage of added.
9785
07:06:46,600 --> 07:06:47,900
It's a fault-tolerant
9786
07:06:47,900 --> 07:06:50,600
where it can regenerate
the last rdds.
9787
07:06:50,600 --> 07:06:53,606
And it can exist
in a distributed fashion
9788
07:06:53,606 --> 07:06:55,165
and it is immutable the
9789
07:06:55,165 --> 07:06:59,300
since once the RTD is defined on
like it it cannot be changed.
9790
07:06:59,300 --> 07:07:01,500
The next question is
how do we create rdds
9791
07:07:01,500 --> 07:07:04,500
in spark the two ways we
can create The Oddities one
9792
07:07:04,664 --> 07:07:09,700
as isn't the spark context we
can use any of the collections
9793
07:07:09,700 --> 07:07:12,700
that's available within this
scalar or in the Java and using
9794
07:07:12,700 --> 07:07:14,000
the paralyzed function.
9795
07:07:14,000 --> 07:07:17,049
We can create the RTD
and it's going to use
9796
07:07:17,049 --> 07:07:20,474
the underlying file
systems distribution mechanism
9797
07:07:20,474 --> 07:07:23,900
if The data is located
in distributed file system,
9798
07:07:23,900 --> 07:07:24,700
like hdfs.
9799
07:07:25,000 --> 07:07:27,154
It will leverage
that and it will make
9800
07:07:27,154 --> 07:07:30,331
those arteries available
in a number of systems.
9801
07:07:30,331 --> 07:07:33,696
So it's going to leverage
and follow the same distribution
9802
07:07:33,696 --> 07:07:34,700
and already Aspen
9803
07:07:34,700 --> 07:07:37,200
or we can create the rdt
by loading the data
9804
07:07:37,200 --> 07:07:39,835
from external sources
as well like its peace
9805
07:07:39,835 --> 07:07:42,900
and hdfs be may not consider
as an external Source.
9806
07:07:42,900 --> 07:07:45,300
It will be consider as
a file system of Hadoop.
9807
07:07:45,400 --> 07:07:47,300
So when Spock is working
9808
07:07:47,300 --> 07:07:49,743
with Hadoop mostly
the file system,
9809
07:07:49,743 --> 07:07:51,900
we will be using will be Hdfs,
9810
07:07:51,900 --> 07:07:53,782
if you can read
from it each piece
9811
07:07:53,782 --> 07:07:55,900
or even we can do
from other sources,
9812
07:07:55,900 --> 07:07:59,781
like Parkwood file or has
three different sources a roux.
9813
07:07:59,781 --> 07:08:02,000
You can read and create the RTD.
9814
07:08:02,200 --> 07:08:03,000
Next question is
9815
07:08:03,000 --> 07:08:05,800
what is executed memory
in spark application.
9816
07:08:05,800 --> 07:08:08,100
Every Spark application
will have fixed.
9817
07:08:08,100 --> 07:08:09,900
It keeps eyes and fixed number,
9818
07:08:09,900 --> 07:08:13,196
of course for the spark
executor executor is nothing
9819
07:08:13,196 --> 07:08:16,500
but the execution unit
available in every machine
9820
07:08:16,500 --> 07:08:19,600
and that's going to facilitate
to do the processing to do
9821
07:08:19,600 --> 07:08:21,654
the tasks in the Water machine,
9822
07:08:21,654 --> 07:08:25,300
so irrespective of whether you
use yarn resource manager
9823
07:08:25,300 --> 07:08:26,800
or any other measures
9824
07:08:26,800 --> 07:08:29,600
like resource manager
every worker Mission.
9825
07:08:29,600 --> 07:08:31,200
We will have an Executor
9826
07:08:31,200 --> 07:08:34,400
and within the executor
the task will be handled
9827
07:08:34,400 --> 07:08:38,700
and the memory to be allocated
for that particular executor is
9828
07:08:38,700 --> 07:08:41,893
what we Define as the hip size
and we can Define
9829
07:08:41,893 --> 07:08:42,775
how much amount
9830
07:08:42,775 --> 07:08:45,788
of memory should be used
for that particular executor
9831
07:08:45,788 --> 07:08:47,700
within the worker
machine as well.
9832
07:08:47,700 --> 07:08:50,900
As number of cores
can be used within the exit.
9833
07:08:51,000 --> 07:08:53,988
Our by the executor
with this path application
9834
07:08:53,988 --> 07:08:55,600
and that can be controlled
9835
07:08:55,600 --> 07:08:58,100
through the configuration
files of spark.
9836
07:08:58,100 --> 07:09:01,300
Next questions different
partitions in Apache spark.
9837
07:09:01,300 --> 07:09:03,100
So any data irrespective of
9838
07:09:03,100 --> 07:09:05,478
whether it is a small
data a large data,
9839
07:09:05,478 --> 07:09:07,213
we can divide those data sets
9840
07:09:07,213 --> 07:09:10,708
across multiple systems
the process of dividing the data
9841
07:09:10,708 --> 07:09:11,961
into multiple pieces
9842
07:09:11,961 --> 07:09:13,310
and making it to store
9843
07:09:13,310 --> 07:09:16,500
across multiple systems as
a different logical units.
9844
07:09:16,500 --> 07:09:17,549
It's called partitioning.
9845
07:09:17,549 --> 07:09:20,600
So in simple terms partitioning
is nothing but the process
9846
07:09:20,600 --> 07:09:21,700
of Dividing the data
9847
07:09:21,700 --> 07:09:24,800
and storing in multiple systems
is called partitions
9848
07:09:24,800 --> 07:09:26,600
and by default the conversion
9849
07:09:26,600 --> 07:09:29,700
of the data into R. TD
will happen in the system
9850
07:09:29,700 --> 07:09:31,400
where the partition is existing.
9851
07:09:31,400 --> 07:09:33,830
So the more the partition
the more parallelism
9852
07:09:33,830 --> 07:09:36,049
they are going to get
at the same time.
9853
07:09:36,049 --> 07:09:38,500
We have to be careful
not to trigger huge amount
9854
07:09:38,500 --> 07:09:40,100
of network data transfer as well
9855
07:09:40,300 --> 07:09:43,455
and every a DD can
be partitioned with inspark
9856
07:09:43,455 --> 07:09:45,700
and the panel
is the partitioning
9857
07:09:45,700 --> 07:09:49,559
going to help us to achieve
parallelism more the partition
9858
07:09:49,559 --> 07:09:50,685
that we have more.
9859
07:09:50,685 --> 07:09:52,000
Solutions can be done
9860
07:09:52,000 --> 07:09:54,300
and that the key thing
about the success
9861
07:09:54,300 --> 07:09:58,200
of the spark program is
minimizing the network traffic
9862
07:09:58,200 --> 07:10:00,900
while doing the parallel
processing and minimizing
9863
07:10:00,900 --> 07:10:04,247
the data transfer
within the systems of spark.
9864
07:10:04,247 --> 07:10:08,000
What operations does already
support so I can operate
9865
07:10:08,000 --> 07:10:10,228
multiple operations
against our GD.
9866
07:10:10,228 --> 07:10:13,900
So there are two type of things
we can do we can group it
9867
07:10:13,900 --> 07:10:16,000
into two one is transformations
9868
07:10:16,000 --> 07:10:18,800
in Transformations are did he
will get transformed
9869
07:10:18,800 --> 07:10:20,600
from one form to another form.
9870
07:10:20,600 --> 07:10:22,600
Select filtering grouping all
9871
07:10:22,600 --> 07:10:25,000
that like it's going
to get transformed
9872
07:10:25,000 --> 07:10:28,000
from one form to another form
one small example,
9873
07:10:28,000 --> 07:10:31,470
like reduced by key filter all
that will be Transformations.
9874
07:10:31,470 --> 07:10:33,700
The resultant of
the transformation will be
9875
07:10:33,700 --> 07:10:35,300
another rdd the same time.
9876
07:10:35,300 --> 07:10:37,700
We can take some actions
against the rdd
9877
07:10:37,700 --> 07:10:40,245
that's going to give
us the final result.
9878
07:10:40,245 --> 07:10:41,262
I can say count
9879
07:10:41,262 --> 07:10:43,500
how many records
or they are store
9880
07:10:43,500 --> 07:10:45,700
that result into the hdfs.
9881
07:10:46,100 --> 07:10:49,541
They all our actions so
multiple actions can be taken
9882
07:10:49,541 --> 07:10:50,600
against the RTD.
9883
07:10:50,600 --> 07:10:53,700
The existence of the data
will come into picture only
9884
07:10:53,700 --> 07:10:56,200
if I take some action
against not ready.
9885
07:10:56,200 --> 07:10:56,515
Okay.
9886
07:10:56,515 --> 07:10:57,400
Next question.
9887
07:10:57,400 --> 07:11:01,000
What do you understand
by transformations in spark?
9888
07:11:01,100 --> 07:11:03,679
So Transformations are
nothing but functions
9889
07:11:03,679 --> 07:11:06,800
mostly it will be higher
order functions within scale
9890
07:11:06,800 --> 07:11:09,400
and we have something
like a higher order functions
9891
07:11:09,400 --> 07:11:12,356
which will be applied
against the tardy.
9892
07:11:12,356 --> 07:11:14,100
Mostly against the list
9893
07:11:14,100 --> 07:11:16,407
of elements that we
have within the rdd
9894
07:11:16,407 --> 07:11:19,314
that function will get
applied by the existence
9895
07:11:19,314 --> 07:11:21,875
of the arditi will Come
into picture one lie
9896
07:11:21,875 --> 07:11:25,597
if we take some action against
it in this particular example,
9897
07:11:25,597 --> 07:11:26,900
I am reading the file
9898
07:11:26,900 --> 07:11:30,536
and having it within the rdd
Control Data then I am doing
9899
07:11:30,536 --> 07:11:32,500
some transformation using a map.
9900
07:11:32,500 --> 07:11:34,382
So it's going
to apply a function
9901
07:11:34,382 --> 07:11:35,623
so we can map I have
9902
07:11:35,623 --> 07:11:39,100
some function which will split
each record using the tab.
9903
07:11:39,100 --> 07:11:41,632
So the spit with the app
will be applied
9904
07:11:41,632 --> 07:11:44,300
against each record
within the raw data
9905
07:11:44,300 --> 07:11:48,200
and the resultant movies data
will again be another rdd,
9906
07:11:48,200 --> 07:11:50,644
but of course,
this will be a lazy operation.
9907
07:11:50,644 --> 07:11:53,700
The existence of movies data
will come into picture only
9908
07:11:53,700 --> 07:11:57,700
if I take some action
against it like count or print
9909
07:11:57,726 --> 07:12:01,573
or store only those actions
will generate the data.
9910
07:12:01,800 --> 07:12:04,600
So next question
Define functions of spark code.
9911
07:12:04,600 --> 07:12:07,100
So that's going to take care
of the memory management
9912
07:12:07,100 --> 07:12:09,400
and fault tolerance of rdds.
9913
07:12:09,400 --> 07:12:12,700
It's going to help us
to schedule distribute the task
9914
07:12:12,700 --> 07:12:15,400
and manage the jobs running
within the cluster
9915
07:12:15,400 --> 07:12:17,700
and so we're going to help
us to or store the rear
9916
07:12:17,700 --> 07:12:20,700
in the storage system as well
as reads data from the storage.
9917
07:12:20,700 --> 07:12:23,905
System that's to do the file
system level operations.
9918
07:12:23,905 --> 07:12:25,200
It's going to help us
9919
07:12:25,200 --> 07:12:27,500
and Spark core programming
can be done in any
9920
07:12:27,500 --> 07:12:30,347
of these languages
like Java scalar python
9921
07:12:30,347 --> 07:12:32,500
as well as using our so core is
9922
07:12:32,500 --> 07:12:35,600
that the horizontal level
on top of spark or we can have
9923
07:12:35,600 --> 07:12:37,500
a number of components
9924
07:12:37,600 --> 07:12:41,000
and there are different type
of rdds available one such
9925
07:12:41,000 --> 07:12:42,923
a special type is parody.
9926
07:12:42,923 --> 07:12:43,800
So next question.
9927
07:12:43,800 --> 07:12:46,100
What do you understand
by pay an rdd?
9928
07:12:46,100 --> 07:12:49,792
It's going to exist
in peace as a keys and values
9929
07:12:49,800 --> 07:12:51,906
so I can Some special functions
9930
07:12:51,906 --> 07:12:55,400
within the parodies
are special Transformations,
9931
07:12:55,400 --> 07:12:58,900
like connect all the values
corresponding to the same key
9932
07:12:58,900 --> 07:13:00,200
like solder Shuffle
9933
07:13:00,300 --> 07:13:02,800
what happens within
the shortened Shuffle of Hadoop
9934
07:13:02,900 --> 07:13:04,356
those type of operations
9935
07:13:04,356 --> 07:13:05,161
like you want
9936
07:13:05,161 --> 07:13:08,339
to consolidate our group
all the values corresponding
9937
07:13:08,339 --> 07:13:10,792
to the same key are
apply some functions
9938
07:13:10,792 --> 07:13:14,400
against all the values
corresponding to the same key.
9939
07:13:14,400 --> 07:13:16,200
Like I want to get the sum
9940
07:13:16,200 --> 07:13:20,400
of the value of all the keys
we can use the parody.
9941
07:13:20,400 --> 07:13:23,600
D and get that a cheat so
it's going to the data
9942
07:13:23,600 --> 07:13:29,300
within the re going to exist
in Pace keys and right.
9943
07:13:29,300 --> 07:13:31,376
Okay a question from Jason.
9944
07:13:31,376 --> 07:13:33,223
What are our Vector rdds
9945
07:13:33,300 --> 07:13:36,300
in machine learning you
will have huge amount
9946
07:13:36,300 --> 07:13:38,700
of processing handled by vectors
9947
07:13:38,700 --> 07:13:42,812
and matrices and we do lots
of operations Vector operations,
9948
07:13:42,812 --> 07:13:44,200
like effective actor
9949
07:13:44,200 --> 07:13:47,700
or transforming any data
into a vector form so vectors
9950
07:13:47,700 --> 07:13:50,755
like as the normal way
it will have a Direction.
9951
07:13:50,755 --> 07:13:51,624
And magnitude
9952
07:13:51,624 --> 07:13:54,900
so we can do some operations
like some two vectors
9953
07:13:54,900 --> 07:13:58,622
and what is the difference
between the vector A
9954
07:13:58,622 --> 07:14:00,500
and B as well as a and see
9955
07:14:00,500 --> 07:14:02,400
if the difference
between Vector A
9956
07:14:02,400 --> 07:14:04,200
and B is less compared to a
9957
07:14:04,200 --> 07:14:06,487
and C we can say the vector A
9958
07:14:06,487 --> 07:14:10,825
and B is somewhat similar
in terms of features.
9959
07:14:11,100 --> 07:14:13,815
So the vector R GD
will be used to represent
9960
07:14:13,815 --> 07:14:17,100
the vector directly and
that will be used extensively
9961
07:14:17,100 --> 07:14:19,500
while doing the
measuring and Jason.
9962
07:14:19,700 --> 07:14:20,500
Thank you other.
9963
07:14:20,500 --> 07:14:21,400
Is another question.
9964
07:14:21,400 --> 07:14:22,900
What is our GD lineage?
9965
07:14:22,900 --> 07:14:25,800
So here I any data
processing any Transformations
9966
07:14:25,800 --> 07:14:28,811
that we do it maintains
something called a lineage.
9967
07:14:28,811 --> 07:14:31,100
So what how data
is getting transformed
9968
07:14:31,100 --> 07:14:33,543
when the data is available
in the partition form
9969
07:14:33,543 --> 07:14:36,300
in multiple systems and
when we do the transformation,
9970
07:14:36,300 --> 07:14:39,800
it will undergo multiple steps
and in the distributed word.
9971
07:14:39,800 --> 07:14:42,700
It's very common to have
failures of machines
9972
07:14:42,700 --> 07:14:45,200
or machines going
out of the network
9973
07:14:45,200 --> 07:14:47,000
and the system our framework
9974
07:14:47,000 --> 07:14:47,800
as it should be
9975
07:14:47,800 --> 07:14:50,800
in a position to handle
small handles it through.
9976
07:14:50,858 --> 07:14:55,800
Did he leave eh it can restore
the last partition only assume
9977
07:14:55,800 --> 07:14:59,004
like out of ten machines
data is distributed
9978
07:14:59,004 --> 07:15:00,828
across five machines out of
9979
07:15:00,828 --> 07:15:03,800
that those five machines
One mission is lost.
9980
07:15:03,800 --> 07:15:06,500
So whatever the
latest transformation
9981
07:15:06,500 --> 07:15:07,807
that had the data
9982
07:15:08,000 --> 07:15:10,100
for that particular
partition the partition
9983
07:15:10,100 --> 07:15:13,924
in the last mission alone
can be regenerated and it knows
9984
07:15:13,924 --> 07:15:16,700
how to regenerate that data
on how to get that result
9985
07:15:16,700 --> 07:15:18,384
and data using the concept
9986
07:15:18,384 --> 07:15:21,153
of rdd lineage so
from which Each data source,
9987
07:15:21,153 --> 07:15:22,200
it got generated.
9988
07:15:22,200 --> 07:15:23,800
What was its previous step.
9989
07:15:23,800 --> 07:15:26,300
So the completely
is will be available
9990
07:15:26,300 --> 07:15:29,724
and it's maintained by
the spark framework internally.
9991
07:15:29,724 --> 07:15:31,700
We call that as Oddities in eh,
9992
07:15:31,700 --> 07:15:34,682
what is point driver to put
it simply for those
9993
07:15:34,682 --> 07:15:37,600
who are from her
do background yarn back room.
9994
07:15:37,600 --> 07:15:40,000
We can compare this
to at muster.
9995
07:15:40,100 --> 07:15:43,300
Every application will
have a spark driver
9996
07:15:43,300 --> 07:15:44,900
that will have a spot context
9997
07:15:44,900 --> 07:15:47,550
which is going to moderate
the complete execution
9998
07:15:47,550 --> 07:15:50,200
of the job that will connect
to the spark master.
9999
07:15:50,500 --> 07:15:52,300
Delivers the RTD graph
10000
07:15:52,300 --> 07:15:54,900
that is the lineage
for the master
10001
07:15:54,900 --> 07:15:56,810
and the coordinate the tasks.
10002
07:15:56,810 --> 07:15:57,817
What are the tasks
10003
07:15:57,817 --> 07:16:00,700
that gets executed
in the distributed environment?
10004
07:16:00,700 --> 07:16:01,500
It can do
10005
07:16:01,500 --> 07:16:04,400
the parallel processing
do the Transformations
10006
07:16:04,600 --> 07:16:06,900
and actions against the RTD.
10007
07:16:06,900 --> 07:16:08,551
So it's a single
point of contact
10008
07:16:08,551 --> 07:16:10,100
for that specific application.
10009
07:16:10,100 --> 07:16:12,500
So smart driver
is a short linked
10010
07:16:12,500 --> 07:16:15,300
and the spawn context
within this part driver
10011
07:16:15,300 --> 07:16:18,558
is going to be the coordinator
between the master and the tasks
10012
07:16:18,558 --> 07:16:20,694
that are running
and smart driver.
10013
07:16:20,694 --> 07:16:23,100
I can get started
in any of the executor
10014
07:16:23,100 --> 07:16:26,800
with inspark name types
of custom managers in spark.
10015
07:16:26,800 --> 07:16:28,800
So whenever you have
a group of machines,
10016
07:16:28,800 --> 07:16:30,247
you need a manager to manage
10017
07:16:30,247 --> 07:16:33,415
the resources the different type
of the store manager already.
10018
07:16:33,415 --> 07:16:35,700
We have seen the yarn
yet another assist ago.
10019
07:16:35,700 --> 07:16:39,400
She later which manages
the resources of Hadoop on top
10020
07:16:39,400 --> 07:16:43,000
of yarn we can make
Spock to book sometimes I
10021
07:16:43,000 --> 07:16:46,700
may want to have sparkle
own my organization
10022
07:16:46,700 --> 07:16:49,594
and not along with the Hadoop
or any other technology.
10023
07:16:49,594 --> 07:16:50,297
Then I can go
10024
07:16:50,297 --> 07:16:53,100
with the And alone spawn
has built-in cluster manager.
10025
07:16:53,100 --> 07:16:55,547
So only spawn can get
executed multiple systems.
10026
07:16:55,547 --> 07:16:57,423
But generally if we
have a cluster we
10027
07:16:57,423 --> 07:16:58,600
will try to leverage
10028
07:16:58,600 --> 07:17:01,600
various other Computing
platforms Computing Frameworks,
10029
07:17:01,600 --> 07:17:04,601
like graph processing
giraffe these on that.
10030
07:17:04,601 --> 07:17:07,000
We will try to
leverage that case.
10031
07:17:07,000 --> 07:17:08,321
We will go with yarn
10032
07:17:08,321 --> 07:17:10,700
or some generalized
resource manager,
10033
07:17:10,700 --> 07:17:12,000
like masseuse Ian.
10034
07:17:12,000 --> 07:17:14,400
It's very specific to Hadoop
and it comes along
10035
07:17:14,400 --> 07:17:18,500
with Hadoop measures is the
cluster level resource manager
10036
07:17:18,500 --> 07:17:20,600
and I have multiple clusters.
10037
07:17:20,600 --> 07:17:23,700
Within organization,
then you can use mrs.
10038
07:17:23,800 --> 07:17:25,883
Mrs. Is also a resource manager.
10039
07:17:25,883 --> 07:17:29,400
It's a separate table project
within Apache X question.
10040
07:17:29,400 --> 07:17:30,600
What do you understand
10041
07:17:30,600 --> 07:17:34,200
by worker node in a cluster
redistribute environment.
10042
07:17:34,200 --> 07:17:36,252
We will have n number
of workers we call
10043
07:17:36,252 --> 07:17:38,200
that is a worker node
or a slave node,
10044
07:17:38,200 --> 07:17:40,665
which does the actual
processing going to get
10045
07:17:40,665 --> 07:17:43,300
the data do the processing
and get us the result
10046
07:17:43,300 --> 07:17:45,100
and masternode going to assign
10047
07:17:45,100 --> 07:17:48,000
what has to be done by
one person own and it's going
10048
07:17:48,000 --> 07:17:50,551
to read the data available
in the specific work on.
10049
07:17:50,551 --> 07:17:53,196
Generally, the tasks assigned
to the worker node,
10050
07:17:53,196 --> 07:17:55,900
or the task will be assigned
to the output node data
10051
07:17:55,900 --> 07:17:57,500
is located in vigorous Pace.
10052
07:17:57,500 --> 07:18:00,100
Especially Hadoop always
it will try to achieve
10053
07:18:00,100 --> 07:18:01,183
the data locality.
10054
07:18:01,183 --> 07:18:04,391
That's what we can't is
the resource availability as
10055
07:18:04,391 --> 07:18:05,900
well as the availability
10056
07:18:05,900 --> 07:18:08,900
of the resource in terms
of CPU memory as well
10057
07:18:08,900 --> 07:18:10,000
will be considered
10058
07:18:10,000 --> 07:18:13,601
as you might have some data
in replicated in three missions.
10059
07:18:13,601 --> 07:18:16,884
All three machines are busy
doing the work and no CPU
10060
07:18:16,884 --> 07:18:19,414
or memory available
to start the other task.
10061
07:18:19,414 --> 07:18:20,400
It will not wait.
10062
07:18:20,400 --> 07:18:23,300
For those missions to complete
the job and get the resource
10063
07:18:23,300 --> 07:18:25,900
and do the processing it
will start the processing
10064
07:18:25,900 --> 07:18:27,000
and some other machine
10065
07:18:27,000 --> 07:18:28,200
which is going to be near
10066
07:18:28,200 --> 07:18:31,300
to that the missions having
the data and read the data
10067
07:18:31,300 --> 07:18:32,400
over the network.
10068
07:18:32,600 --> 07:18:35,100
So to answer straight
or commissions are nothing but
10069
07:18:35,100 --> 07:18:36,600
which does the actual work
10070
07:18:36,600 --> 07:18:37,755
and going to report
10071
07:18:37,755 --> 07:18:41,315
to the master in terms of what
is the resource utilization
10072
07:18:41,315 --> 07:18:42,627
and the tasks running
10073
07:18:42,627 --> 07:18:46,000
within the work emissions
will be doing the actual work
10074
07:18:46,000 --> 07:18:49,049
and what ways as past Vector
just few minutes back.
10075
07:18:49,049 --> 07:18:50,656
I was answering a question.
10076
07:18:50,656 --> 07:18:52,697
What is a vector
vector is nothing
10077
07:18:52,697 --> 07:18:55,500
but representing the data
in multi dimensional form?
10078
07:18:55,500 --> 07:18:57,500
The vector can
be multi-dimensional
10079
07:18:57,500 --> 07:18:58,500
Vector as well.
10080
07:18:58,500 --> 07:19:02,400
As you know, I am going
to represent a point in space.
10081
07:19:02,400 --> 07:19:04,938
I need three dimensions
the X y&z.
10082
07:19:05,000 --> 07:19:08,076
So the vector will
have three dimensions.
10083
07:19:08,300 --> 07:19:10,934
If I need to represent
a line in the species.
10084
07:19:10,934 --> 07:19:14,107
Then I need two points
to represent the starting point
10085
07:19:14,107 --> 07:19:17,700
of the line and the endpoint
of the line then I need a vector
10086
07:19:17,700 --> 07:19:18,800
which can hold
10087
07:19:18,800 --> 07:19:21,049
so it will have two Dimensions
the first First Dimension
10088
07:19:21,049 --> 07:19:23,121
will have one point
the second dimension
10089
07:19:23,121 --> 07:19:24,400
will have another Point
10090
07:19:24,400 --> 07:19:25,429
let us say point B
10091
07:19:25,429 --> 07:19:29,200
if I have to represent a plane
then I need another dimension
10092
07:19:29,200 --> 07:19:30,702
to represent two lines.
10093
07:19:30,702 --> 07:19:31,510
So each line
10094
07:19:31,510 --> 07:19:34,203
will be representing
two points same way.
10095
07:19:34,203 --> 07:19:37,200
I can represent any data
using a vector form
10096
07:19:37,200 --> 07:19:40,217
as you might have
huge number of feedback
10097
07:19:40,217 --> 07:19:43,500
or ratings of products
across an organization.
10098
07:19:43,500 --> 07:19:46,327
Let's take a simple example
Amazon Amazon have
10099
07:19:46,327 --> 07:19:47,632
millions of products.
10100
07:19:47,632 --> 07:19:50,498
Not every user not even
a single user would have
10101
07:19:50,498 --> 07:19:53,461
It was millions of all
the products within Amazon.
10102
07:19:53,461 --> 07:19:55,341
The only hardly
we would have used
10103
07:19:55,341 --> 07:19:58,400
like a point one percent
or like even less than that,
10104
07:19:58,400 --> 07:20:00,200
maybe like few hundred products.
10105
07:20:00,200 --> 07:20:02,600
We would have used
and rated the products
10106
07:20:02,600 --> 07:20:04,600
within amazing for
the complete lifetime.
10107
07:20:04,600 --> 07:20:07,700
If I have to represent
all ratings of the products
10108
07:20:07,700 --> 07:20:10,194
with director and see
the first position
10109
07:20:10,194 --> 07:20:13,400
of the rating it's going
to refer to the product
10110
07:20:13,400 --> 07:20:15,200
with ID 1 second position.
10111
07:20:15,200 --> 07:20:17,600
It's going to refer
to the product with ID 2.
10112
07:20:17,600 --> 07:20:20,700
So I will have million values
within that particular vector.
10113
07:20:20,700 --> 07:20:22,645
After out of million values,
10114
07:20:22,645 --> 07:20:25,493
I'll have only values
400 products where I
10115
07:20:25,493 --> 07:20:27,300
have provided the ratings.
10116
07:20:27,400 --> 07:20:30,947
So it may vary from number
1 to 5 for all others.
10117
07:20:30,947 --> 07:20:34,200
It will say 0 sparse
pins thinly distributed.
10118
07:20:34,800 --> 07:20:38,774
So to represent the huge amount
of data with the position
10119
07:20:38,774 --> 07:20:41,900
and saying this particular
position is having
10120
07:20:41,900 --> 07:20:43,800
a 0 value we can mention
10121
07:20:43,800 --> 07:20:45,900
that with a key and value.
10122
07:20:45,900 --> 07:20:47,415
So what position having
10123
07:20:47,415 --> 07:20:51,500
what value rather than storing
all Zero seconds told one lie
10124
07:20:51,500 --> 07:20:55,471
non-zeros the position of it and
that the corresponding value.
10125
07:20:55,471 --> 07:20:58,400
That means all others going
to be a zero value
10126
07:20:58,400 --> 07:21:01,400
so we can mention
this particular space
10127
07:21:01,400 --> 07:21:05,400
Vector mentioning it
to representa nonzero entities.
10128
07:21:05,400 --> 07:21:08,300
So to store only
the nonzero entities
10129
07:21:08,300 --> 07:21:10,364
this Mass Factor will be used
10130
07:21:10,364 --> 07:21:12,500
so that we don't need to based
10131
07:21:12,500 --> 07:21:15,550
on additional space was
during this past Vector.
10132
07:21:15,550 --> 07:21:18,600
Let's discuss some questions
on spark streaming.
10133
07:21:18,600 --> 07:21:21,422
How is streaming Dad
in sparking explained
10134
07:21:21,422 --> 07:21:23,900
with examples smart
streaming is used
10135
07:21:23,900 --> 07:21:25,452
for processing real-time
10136
07:21:25,452 --> 07:21:29,500
streaming data to precisely say
it's a micro batch processing.
10137
07:21:29,500 --> 07:21:32,852
So data will be collected
between every small interval say
10138
07:21:32,852 --> 07:21:35,128
maybe like .5 seconds
or every seconds
10139
07:21:35,128 --> 07:21:36,200
until you get processed.
10140
07:21:36,200 --> 07:21:36,900
So internally,
10141
07:21:36,900 --> 07:21:40,100
it's going to create
micro patches the data created
10142
07:21:40,100 --> 07:21:43,800
out of that micro batch we call
there is a d stream the stream
10143
07:21:43,800 --> 07:21:45,500
is like a and ready
10144
07:21:45,500 --> 07:21:48,200
so I can do
Transformations and actions.
10145
07:21:48,200 --> 07:21:50,691
Whatever that I do
with our DD I can do
10146
07:21:50,691 --> 07:21:52,200
With the stream as well
10147
07:21:52,500 --> 07:21:57,100
and Spark streaming can read
data from Flume hdfs are
10148
07:21:57,100 --> 07:21:59,500
other streaming services Aspen
10149
07:21:59,800 --> 07:22:02,565
and store the data
in the dashboard or in
10150
07:22:02,565 --> 07:22:06,300
any other database and it
provides very high throughput
10151
07:22:06,400 --> 07:22:09,200
as it can be processed with
a number of different systems
10152
07:22:09,200 --> 07:22:11,800
in a distributed
fashion again streaming.
10153
07:22:11,800 --> 07:22:14,858
This stream will be partitioned
internally and it has
10154
07:22:14,858 --> 07:22:17,100
the built-in feature
of fault tolerance,
10155
07:22:17,100 --> 07:22:18,700
even if any data is lost
10156
07:22:18,700 --> 07:22:22,100
and it's transformed already
is Lost it can regenerate
10157
07:22:22,100 --> 07:22:23,930
those rdds from the existing
10158
07:22:23,930 --> 07:22:25,500
or from the source data.
10159
07:22:25,500 --> 07:22:28,100
So these three is going
to be the building block
10160
07:22:28,100 --> 07:22:32,748
of streaming and it has
the fault tolerance mechanism
10161
07:22:32,748 --> 07:22:34,902
what we have within the RTD.
10162
07:22:35,000 --> 07:22:38,600
So this stream are specialized
on Didi specialized form
10163
07:22:38,600 --> 07:22:42,000
of our GD specifically to use it
within this box dreaming.
10164
07:22:42,000 --> 07:22:42,253
Okay.
10165
07:22:42,253 --> 07:22:42,963
Next question.
10166
07:22:42,963 --> 07:22:45,600
What is the significance
of sliding window operation?
10167
07:22:45,600 --> 07:22:48,700
That's a very interesting one
in the streaming data whenever
10168
07:22:48,700 --> 07:22:50,600
we do the Computing the data.
10169
07:22:50,600 --> 07:22:53,218
Density are the
business implications
10170
07:22:53,218 --> 07:22:56,500
of that specific data
May oscillate a lot.
10171
07:22:56,500 --> 07:22:58,400
For example within Twitter.
10172
07:22:58,400 --> 07:23:01,455
We used to say the trending
tweet hashtag just
10173
07:23:01,455 --> 07:23:03,900
because that hashtag
is very popular.
10174
07:23:03,900 --> 07:23:06,200
Maybe someone might have hacked
into the system
10175
07:23:06,200 --> 07:23:09,500
and used a number of tweets
maybe for that particular
10176
07:23:09,500 --> 07:23:12,202
our it might have appeared
millions of times just
10177
07:23:12,202 --> 07:23:15,123
because it appear billions
of times for that specific
10178
07:23:15,123 --> 07:23:16,107
and minute duration
10179
07:23:16,107 --> 07:23:18,800
or like say to three minute
duration each not getting
10180
07:23:18,800 --> 07:23:20,200
to the trending tank.
10181
07:23:20,200 --> 07:23:22,286
Trending hashtag for
that particular day
10182
07:23:22,286 --> 07:23:23,992
or for that particular month.
10183
07:23:23,992 --> 07:23:26,700
So what we will do we
will try to do an average.
10184
07:23:26,700 --> 07:23:29,357
So like a window
this current time frame
10185
07:23:29,357 --> 07:23:32,500
and T minus 1 T minus 2 all
the data we will consider
10186
07:23:32,500 --> 07:23:34,807
and we will try to find
the average or some
10187
07:23:34,807 --> 07:23:37,276
so the complete business logic
will be applied
10188
07:23:37,276 --> 07:23:39,100
against that particular window.
10189
07:23:39,200 --> 07:23:43,400
So any drastic changes
on to precisely say the spike
10190
07:23:43,500 --> 07:23:46,200
or deep very
drastic spinal cords
10191
07:23:46,200 --> 07:23:50,300
drastic deep in the pattern
of the data will be normalized.
10192
07:23:50,300 --> 07:23:51,100
So that's the
10193
07:23:51,100 --> 07:23:54,452
because significance of using
the sliding window operation
10194
07:23:54,452 --> 07:23:55,800
with inspark streaming
10195
07:23:55,800 --> 07:23:59,600
and smart can handle this
sliding window automatically.
10196
07:23:59,600 --> 07:24:04,000
It can store the prior data
the T minus 1 T minus 2 and
10197
07:24:04,000 --> 07:24:06,300
how big the window
needs to be maintained
10198
07:24:06,300 --> 07:24:09,192
or that can be handled easily
within the program
10199
07:24:09,192 --> 07:24:11,100
and it's at the abstract level.
10200
07:24:11,300 --> 07:24:12,100
Next question is
10201
07:24:12,100 --> 07:24:15,600
what is destroying the expansion
is discretized stream.
10202
07:24:15,600 --> 07:24:17,600
So that's the abstract form
10203
07:24:17,600 --> 07:24:20,500
or the which will form
of representation of the data.
10204
07:24:20,500 --> 07:24:22,494
For the spark
streaming the same way,
10205
07:24:22,494 --> 07:24:25,200
how are ready getting
transformed from one form
10206
07:24:25,200 --> 07:24:26,200
to another form?
10207
07:24:26,200 --> 07:24:27,504
We will have series
10208
07:24:27,504 --> 07:24:30,800
of oddities all put together
called as a d string
10209
07:24:30,800 --> 07:24:32,100
so this term is nothing
10210
07:24:32,100 --> 07:24:34,000
but it's another representation
10211
07:24:34,000 --> 07:24:36,593
of our GD are like
to group of oddities
10212
07:24:36,593 --> 07:24:38,223
because there is a stream
10213
07:24:38,223 --> 07:24:41,100
and I can apply
the streaming functions
10214
07:24:41,100 --> 07:24:43,921
or any of the functions
Transformations are actions
10215
07:24:43,921 --> 07:24:47,200
available within the streaming
against this D string
10216
07:24:47,300 --> 07:24:49,674
So within that
particular micro badge,
10217
07:24:49,674 --> 07:24:51,600
so I will Define What interval
10218
07:24:51,600 --> 07:24:54,377
the data should be collected
on should be processed
10219
07:24:54,377 --> 07:24:56,100
because there is a micro batch.
10220
07:24:56,100 --> 07:24:59,900
It could be every 1 second
or every hundred milliseconds
10221
07:24:59,900 --> 07:25:01,000
or every five seconds.
10222
07:25:01,300 --> 07:25:02,300
I can Define that page
10223
07:25:02,300 --> 07:25:04,300
particular period so
all the data is used
10224
07:25:04,300 --> 07:25:07,300
in that particular duration
will be considered
10225
07:25:07,300 --> 07:25:08,400
as a piece of data
10226
07:25:08,400 --> 07:25:09,600
and that will be called
10227
07:25:09,600 --> 07:25:13,400
as ADI string s question explain
casing in spark streaming.
10228
07:25:13,400 --> 07:25:14,000
Of course.
10229
07:25:14,000 --> 07:25:15,000
Yes Mark internally.
10230
07:25:15,000 --> 07:25:16,300
It uses in memory Computing.
10231
07:25:16,600 --> 07:25:18,700
So any data when it
is doing the Computing
10232
07:25:18,900 --> 07:25:21,600
that's killing generated
will be there in Mary but find
10233
07:25:21,600 --> 07:25:25,100
that if you do more and more
processing with other jobs
10234
07:25:25,100 --> 07:25:27,190
when there is a need
for more memory,
10235
07:25:27,190 --> 07:25:30,500
the least used on DDS will be
clear enough from the memory
10236
07:25:30,500 --> 07:25:34,100
or the least used data
available out of actions
10237
07:25:34,100 --> 07:25:36,700
from the arditi will be cleared
of from the memory.
10238
07:25:36,700 --> 07:25:40,000
Sometimes I may need
that data forever in memory,
10239
07:25:40,000 --> 07:25:41,800
very simple example,
like dictionary.
10240
07:25:42,100 --> 07:25:43,600
I want the dictionary words
10241
07:25:43,600 --> 07:25:45,658
should be always
available in memory
10242
07:25:45,658 --> 07:25:48,900
because I may do a spell check
against the Tweet comments
10243
07:25:48,900 --> 07:25:51,500
or feedback comments
and our of nines.
10244
07:25:51,500 --> 07:25:54,900
So what I can do I
can say KH those any data
10245
07:25:54,900 --> 07:25:57,036
that comes in we can cash it.
10246
07:25:57,036 --> 07:25:59,100
What possessed it in memory.
10247
07:25:59,100 --> 07:26:02,100
So even when there is a need
for memory by other applications
10248
07:26:02,100 --> 07:26:05,800
this specific data will
not be remote and especially
10249
07:26:05,800 --> 07:26:08,800
that will be used to do
the further processing
10250
07:26:08,800 --> 07:26:11,500
and the casing
also can be defined
10251
07:26:11,500 --> 07:26:15,200
whether it should be in memory
only I in memory and hard disk
10252
07:26:15,200 --> 07:26:17,000
that also we can Define it.
10253
07:26:17,000 --> 07:26:20,100
Let's discuss some questions
on spark graphics.
10254
07:26:20,300 --> 07:26:24,000
The next question is is there
an APA for implementing collapse
10255
07:26:24,000 --> 07:26:26,200
and Spark in graph Theory?
10256
07:26:26,600 --> 07:26:28,100
Everything will be represented
10257
07:26:28,100 --> 07:26:33,200
as a graph is a graph it
will have nodes and edges.
10258
07:26:33,419 --> 07:26:36,880
So all will be represented
using the arteries.
10259
07:26:37,000 --> 07:26:40,300
So it's going to extend
the RTD and there is
10260
07:26:40,300 --> 07:26:42,482
a component called graphics
10261
07:26:42,500 --> 07:26:44,983
and it exposes
the functionalities
10262
07:26:44,983 --> 07:26:49,800
to represent a graph we can have
H RG D buttocks rdd by creating.
10263
07:26:49,800 --> 07:26:51,700
During the edges and vertex.
10264
07:26:51,700 --> 07:26:53,239
I can create a graph
10265
07:26:53,500 --> 07:26:57,400
and this graph can exist
in a distributed environment.
10266
07:26:57,400 --> 07:27:00,208
So same way we will be
in a position to do
10267
07:27:00,208 --> 07:27:02,400
the parallel processing as well.
10268
07:27:02,700 --> 07:27:06,300
So Graphics, it's just
a form of representing
10269
07:27:06,400 --> 07:27:11,200
the data paragraphs with edges
and the traces and of course,
10270
07:27:11,200 --> 07:27:14,299
yes, it provides the APA
to implement out create
10271
07:27:14,299 --> 07:27:17,400
the graph do the processing
on the graph the APA
10272
07:27:17,400 --> 07:27:19,900
so divided what is Page rank?
10273
07:27:20,100 --> 07:27:24,600
Graphics we didn't have sex
once the graph is created.
10274
07:27:24,600 --> 07:27:28,900
We can calculate the page rank
for a particular note.
10275
07:27:29,100 --> 07:27:32,000
So that's very similar to
how we have the page rank
10276
07:27:32,100 --> 07:27:35,635
for the websites within Google
the higher the page rank.
10277
07:27:35,635 --> 07:27:38,774
That means it's more important
within that particular graph.
10278
07:27:38,774 --> 07:27:40,547
It's going to
show the importance
10279
07:27:40,547 --> 07:27:41,900
of that particular node
10280
07:27:41,900 --> 07:27:45,154
or Edge within that particular
graph is a graph is
10281
07:27:45,154 --> 07:27:46,700
a connected set of data.
10282
07:27:46,800 --> 07:27:49,600
All right, I will be connected
using the property
10283
07:27:49,600 --> 07:27:51,100
and How much important
10284
07:27:51,100 --> 07:27:55,300
that property makes we will have
a value Associated to it.
10285
07:27:55,500 --> 07:27:57,900
So within pagerank
we can calculate
10286
07:27:57,900 --> 07:27:59,100
like a static page rank.
10287
07:27:59,300 --> 07:28:00,703
It will run a number
10288
07:28:00,703 --> 07:28:03,300
of iterations or there
is another page
10289
07:28:03,300 --> 07:28:06,600
and code anomic page rank
that will get executed
10290
07:28:06,600 --> 07:28:09,200
till we reach
a particular saturation level
10291
07:28:09,300 --> 07:28:13,600
and the saturation level can be
defined with multiple criterias
10292
07:28:14,100 --> 07:28:15,200
and the APA is
10293
07:28:15,200 --> 07:28:17,500
because there is
a graph operations.
10294
07:28:17,700 --> 07:28:20,289
And be direct executed
against those graph
10295
07:28:20,289 --> 07:28:23,700
and they all are available
as a PA within the graphics.
10296
07:28:24,103 --> 07:28:25,796
What is lineage graph?
10297
07:28:26,000 --> 07:28:28,400
So the audit is very similar
10298
07:28:28,500 --> 07:28:32,800
to the graphics how the
graph representation every rtt.
10299
07:28:32,800 --> 07:28:33,800
Internally.
10300
07:28:33,800 --> 07:28:36,400
It will have the relation saying
10301
07:28:36,500 --> 07:28:39,157
how that particular
rdd got created.
10302
07:28:39,157 --> 07:28:42,725
And from where how
that got transformed argit is
10303
07:28:42,725 --> 07:28:44,700
how their got transformed.
10304
07:28:44,700 --> 07:28:47,600
So the complete lineage
or the complete history
10305
07:28:47,600 --> 07:28:50,587
or the complete path
will be recorded
10306
07:28:50,587 --> 07:28:51,900
within the lineage.
10307
07:28:52,100 --> 07:28:53,517
That will be used in case
10308
07:28:53,517 --> 07:28:56,400
if any particular partition
of the target is lost.
10309
07:28:56,400 --> 07:28:57,900
It can be regenerated.
10310
07:28:58,000 --> 07:28:59,899
Even if the complete
artery is lost.
10311
07:28:59,899 --> 07:29:00,900
We can regenerate
10312
07:29:00,900 --> 07:29:03,149
so it will have the complete
information on what are
10313
07:29:03,149 --> 07:29:06,193
the partitions where it is
existing water Transformations.
10314
07:29:06,193 --> 07:29:07,119
It had undergone.
10315
07:29:07,119 --> 07:29:08,747
What is the resultant and you
10316
07:29:08,747 --> 07:29:10,600
if anything is lost
in the middle,
10317
07:29:10,600 --> 07:29:12,511
it knows where to recalculate
10318
07:29:12,511 --> 07:29:16,400
from and what are essential
things needs to be recalculated.
10319
07:29:16,400 --> 07:29:19,817
It's going to save us a lot
of time and if that Audrey
10320
07:29:19,817 --> 07:29:21,762
is never being used it will now.
10321
07:29:21,762 --> 07:29:23,100
Ever get recalculated.
10322
07:29:23,100 --> 07:29:26,500
So they recalculation also
triggers based on the action
10323
07:29:26,500 --> 07:29:27,799
only on need basis.
10324
07:29:27,799 --> 07:29:29,100
It will recalculate
10325
07:29:29,200 --> 07:29:32,500
that's why it's going
to use the memory optimally
10326
07:29:32,700 --> 07:29:36,087
does Apache spark provide
checkpointing officially
10327
07:29:36,087 --> 07:29:38,300
like the example
like a streaming
10328
07:29:38,600 --> 07:29:43,600
and if any data is lost within
that particular sliding window,
10329
07:29:43,600 --> 07:29:47,492
we cannot get back the data are
like the data will be lost
10330
07:29:47,492 --> 07:29:50,103
because Jim I'm making
a window of say 24
10331
07:29:50,103 --> 07:29:51,800
asks to do some averaging.
10332
07:29:51,800 --> 07:29:55,270
Each I'm making a sliding window
of 24 hours every 24 hours.
10333
07:29:55,270 --> 07:29:59,100
It will keep on getting slider
and if you lose any system
10334
07:29:59,100 --> 07:30:01,500
as in there is a complete
failure of the cluster.
10335
07:30:01,500 --> 07:30:02,562
I may lose the data
10336
07:30:02,562 --> 07:30:04,800
because it's all available
in the memory.
10337
07:30:04,900 --> 07:30:06,400
So how to recalculate
10338
07:30:06,400 --> 07:30:08,902
if the data system is lost
it follows something
10339
07:30:08,902 --> 07:30:10,100
called a checkpointing
10340
07:30:10,100 --> 07:30:12,831
so we can check point
the data and directly.
10341
07:30:12,831 --> 07:30:14,800
It's provided by the spark APA.
10342
07:30:14,800 --> 07:30:16,600
We have to just
provide the location
10343
07:30:16,600 --> 07:30:19,700
where it should get checked
pointed and you can read
10344
07:30:19,700 --> 07:30:23,200
that particular data back
when you Not the system again,
10345
07:30:23,200 --> 07:30:24,866
whatever the state it was
10346
07:30:24,866 --> 07:30:27,600
in be can regenerate
that particular data.
10347
07:30:27,700 --> 07:30:29,454
So yes to answer the question
10348
07:30:29,454 --> 07:30:32,300
straight about this path
points check monitoring
10349
07:30:32,300 --> 07:30:35,300
and it will help us
to regenerate the state
10350
07:30:35,300 --> 07:30:37,010
what it was earlier.
10351
07:30:37,200 --> 07:30:40,000
Let's move on to the next
component spark ml it.
10352
07:30:40,300 --> 07:30:41,515
How is machine learning
10353
07:30:41,515 --> 07:30:44,600
implemented in spark
the machine learning again?
10354
07:30:44,600 --> 07:30:46,800
It's a very huge ocean by itself
10355
07:30:46,900 --> 07:30:49,800
and it's not a technology
specific to spark
10356
07:30:49,800 --> 07:30:51,800
which learning is
a common data science.
10357
07:30:51,800 --> 07:30:55,235
It's a Set of data science work
where we have different type
10358
07:30:55,235 --> 07:30:57,983
of algorithms different
categories of algorithm,
10359
07:30:57,983 --> 07:31:01,100
like clustering regression
dimensionality reduction
10360
07:31:01,100 --> 07:31:02,100
or that we have
10361
07:31:02,300 --> 07:31:05,600
and all these algorithms
are most of the algorithms
10362
07:31:05,600 --> 07:31:08,070
have been implemented
in spark and smart is
10363
07:31:08,070 --> 07:31:09,481
the preferred framework
10364
07:31:09,481 --> 07:31:12,910
or before preferred application
component to do the machine
10365
07:31:12,910 --> 07:31:14,500
learning algorithm nowadays
10366
07:31:14,500 --> 07:31:16,500
or machine learning
processing the reason
10367
07:31:16,500 --> 07:31:19,700
because most of the machine
learning algorithms needs
10368
07:31:19,700 --> 07:31:21,890
to be executed i3t real number.
10369
07:31:21,890 --> 07:31:25,000
Of times till we get
the optimal result maybe
10370
07:31:25,000 --> 07:31:27,700
like say twenty five
iterations are 58 iterations
10371
07:31:27,700 --> 07:31:29,900
or till we get
that specific accuracy.
10372
07:31:29,900 --> 07:31:33,100
You will keep on running
the processing again and again
10373
07:31:33,100 --> 07:31:36,092
and smog is very good fit
whenever you want to do
10374
07:31:36,092 --> 07:31:37,900
the processing again and again
10375
07:31:37,900 --> 07:31:40,400
because the data
will be available in memory.
10376
07:31:40,400 --> 07:31:43,600
I can read it faster store
the data back into the memory
10377
07:31:43,600 --> 07:31:44,700
again reach faster
10378
07:31:44,700 --> 07:31:47,500
and all this machine learning
algorithms have been provided
10379
07:31:47,500 --> 07:31:50,800
within the spark a separate
component called ml lip
10380
07:31:50,900 --> 07:31:53,096
and within mlsp We
have other components
10381
07:31:53,096 --> 07:31:55,800
like feature Association
to extract the features.
10382
07:31:55,800 --> 07:31:58,575
You may be wondering
how they can process
10383
07:31:58,575 --> 07:32:02,600
the images the core thing about
processing a image or audio
10384
07:32:02,600 --> 07:32:04,922
or video is about
extracting the feature
10385
07:32:04,922 --> 07:32:08,363
and comparing the future
how much they are related.
10386
07:32:08,363 --> 07:32:10,300
So that's where
vectors matrices all
10387
07:32:10,300 --> 07:32:13,500
that will come into picture
and we can have pipeline
10388
07:32:13,500 --> 07:32:16,144
of processing as well
to the processing
10389
07:32:16,144 --> 07:32:18,800
one then take the result
and do the processing
10390
07:32:18,800 --> 07:32:21,700
to and it has persistence
algorithm as well.
10391
07:32:21,700 --> 07:32:24,234
The result of it
the generator process
10392
07:32:24,234 --> 07:32:25,999
the result it can be persisted
10393
07:32:25,999 --> 07:32:27,010
and reloaded back
10394
07:32:27,010 --> 07:32:29,421
into the system to
continue the processing
10395
07:32:29,421 --> 07:32:32,245
from that particular Point
onwards next question.
10396
07:32:32,245 --> 07:32:34,605
What are categories
of machine learning machine
10397
07:32:34,605 --> 07:32:38,000
learning assets different
categories available supervised
10398
07:32:38,000 --> 07:32:41,001
or unsupervised and
reinforced learning supervised
10399
07:32:41,001 --> 07:32:42,900
and surprised it's very popular
10400
07:32:43,200 --> 07:32:46,700
where we will know some
I'll give an example.
10401
07:32:47,200 --> 07:32:50,123
I'll know well
in advance what category
10402
07:32:50,123 --> 07:32:54,800
that belongs to Z. Want
to do a character recognition
10403
07:32:55,400 --> 07:32:57,185
while training the data,
10404
07:32:57,185 --> 07:33:01,800
I can give information saying
this particular image belongs
10405
07:33:01,800 --> 07:33:04,160
to this particular
category character
10406
07:33:04,160 --> 07:33:05,800
or this particular number
10407
07:33:05,800 --> 07:33:10,100
and I can train sometimes I
will not know well in advance
10408
07:33:10,100 --> 07:33:14,478
assume like I may have
different type of images
10409
07:33:14,700 --> 07:33:19,200
like it may have
cars bikes cat dog all that.
10410
07:33:19,400 --> 07:33:21,920
I want to know
how many category available.
10411
07:33:21,920 --> 07:33:25,279
No, I will not know well
in advance so I want to group it
10412
07:33:25,279 --> 07:33:26,900
how many category available
10413
07:33:26,900 --> 07:33:29,100
and then I'll
realize saying okay,
10414
07:33:29,100 --> 07:33:31,600
they're all this belongs
to a particular category.
10415
07:33:31,600 --> 07:33:33,800
I'll identify the pattern
within the category
10416
07:33:33,800 --> 07:33:36,333
and I'll give
a category named say
10417
07:33:36,333 --> 07:33:39,751
like all these images
belongs to boot category
10418
07:33:39,751 --> 07:33:41,300
on looks like a boat.
10419
07:33:41,500 --> 07:33:45,400
So leaving it to the system
by providing this value or not.
10420
07:33:45,400 --> 07:33:48,400
Let's say the cat is different
type of machine learning comes
10421
07:33:48,400 --> 07:33:49,503
into picture and
10422
07:33:49,503 --> 07:33:53,160
as such machine learning is
not specific to It's going
10423
07:33:53,160 --> 07:33:57,300
to help us to achieve to run
this machine learning algorithms
10424
07:33:57,400 --> 07:34:00,700
what our spark ml lead
tools MLA business thing
10425
07:34:00,700 --> 07:34:02,300
but machine learning library
10426
07:34:02,300 --> 07:34:03,700
or machine learning offering
10427
07:34:03,700 --> 07:34:07,200
within this Mark and has a
number of algorithms implemented
10428
07:34:07,200 --> 07:34:09,800
and it provides very
good feature to persist
10429
07:34:09,800 --> 07:34:12,306
the result generally
in machine learning.
10430
07:34:12,306 --> 07:34:14,509
We will generate
a model the pattern
10431
07:34:14,509 --> 07:34:17,089
of the data recorder
is a model the model
10432
07:34:17,089 --> 07:34:20,688
will be persisted either in
different forms Like Pat.
10433
07:34:20,688 --> 07:34:23,087
Quit I have
Through different forms,
10434
07:34:23,087 --> 07:34:26,700
it can be stored opposite
district and has methodologies
10435
07:34:26,700 --> 07:34:29,600
to extract the features
from a set of data.
10436
07:34:29,600 --> 07:34:31,353
I may have million images.
10437
07:34:31,353 --> 07:34:32,500
I want to extract
10438
07:34:32,500 --> 07:34:36,300
the common features available
within those millions of images
10439
07:34:36,300 --> 07:34:40,170
and other utilities
available to process to define
10440
07:34:40,170 --> 07:34:43,607
or like to define the seed
the randomizing it so
10441
07:34:43,607 --> 07:34:47,441
different utilities are
available as well as pipelines.
10442
07:34:47,441 --> 07:34:49,500
That's very specific to spark
10443
07:34:49,800 --> 07:34:53,300
where I can Channel
Arrange the sequence
10444
07:34:53,300 --> 07:34:56,700
of steps to be undergone by
the machine learning submission
10445
07:34:56,700 --> 07:34:58,100
learning one algorithm first
10446
07:34:58,100 --> 07:34:59,863
and then the result
of it will be fed
10447
07:34:59,863 --> 07:35:02,163
into a machine learning
algorithm to like that.
10448
07:35:02,163 --> 07:35:03,400
We can have a sequence
10449
07:35:03,400 --> 07:35:06,500
of execution and
that will be defined using
10450
07:35:06,500 --> 07:35:10,562
the pipeline's is Honorable
features of spark Emily.
10451
07:35:11,000 --> 07:35:15,100
What are some popular algorithms
and Utilities in spark Emily.
10452
07:35:15,500 --> 07:35:18,382
So these are some popular
algorithms like regression
10453
07:35:18,382 --> 07:35:22,000
classification basic statistics
recommendation system.
10454
07:35:22,000 --> 07:35:24,678
It's a comedy system is
like well implemented.
10455
07:35:24,678 --> 07:35:27,000
All we have to provide
is give the data.
10456
07:35:27,000 --> 07:35:30,579
If you give the ratings and
products within an organization,
10457
07:35:30,579 --> 07:35:32,400
if you have the complete damp,
10458
07:35:32,400 --> 07:35:35,800
we can build the recommendation
system in no time.
10459
07:35:35,800 --> 07:35:39,283
And if you give any user you
can give a recommendation.
10460
07:35:39,283 --> 07:35:41,600
These are the products
the user may like
10461
07:35:41,600 --> 07:35:42,500
and those products
10462
07:35:42,500 --> 07:35:45,900
can be displayed in the search
result recommendation system
10463
07:35:45,900 --> 07:35:48,017
really works on the basis
of the feedback
10464
07:35:48,017 --> 07:35:50,400
that we are providing
for the earlier products
10465
07:35:50,400 --> 07:35:51,500
that we had bought.
10466
07:35:51,600 --> 07:35:54,225
Bustling dimensionality
reduction whenever
10467
07:35:54,225 --> 07:35:57,300
we do transitioning
with the huge amount of data,
10468
07:35:57,600 --> 07:35:59,511
it's very very compute-intensive
10469
07:35:59,511 --> 07:36:01,900
and we may have
to reduce the dimensions,
10470
07:36:01,900 --> 07:36:03,752
especially the matrix dimensions
10471
07:36:03,752 --> 07:36:07,000
within them early
without losing the features.
10472
07:36:07,000 --> 07:36:09,538
What are the features
available without losing it?
10473
07:36:09,538 --> 07:36:11,308
We should reduce
the dimensionality
10474
07:36:11,308 --> 07:36:13,580
and there are
some algorithms available to do
10475
07:36:13,580 --> 07:36:16,660
that dimensionality reduction
and feature extraction.
10476
07:36:16,660 --> 07:36:19,486
So what are the common features
are features available
10477
07:36:19,486 --> 07:36:22,227
within that particular image
and I can Compare
10478
07:36:22,227 --> 07:36:23,300
what are the common
10479
07:36:23,300 --> 07:36:26,600
across common features
available within those images?
10480
07:36:26,600 --> 07:36:29,106
That's how we
will group those images.
10481
07:36:29,106 --> 07:36:29,716
So get me
10482
07:36:29,716 --> 07:36:32,900
whether this particular image
the person looking
10483
07:36:32,900 --> 07:36:35,300
like this image available
in the database or not.
10484
07:36:35,700 --> 07:36:37,524
For example,
assume the organization
10485
07:36:37,524 --> 07:36:40,600
or the police department crime
Department maintaining a list
10486
07:36:40,600 --> 07:36:44,400
of persons committed crime
and if we get a new photo
10487
07:36:44,400 --> 07:36:48,161
when they do a search they
may not have the exact photo bit
10488
07:36:48,161 --> 07:36:49,200
by bit the photo
10489
07:36:49,200 --> 07:36:51,600
might have been taken
with a different background.
10490
07:36:51,600 --> 07:36:55,000
Front lighting's different
locations different time.
10491
07:36:55,000 --> 07:36:57,754
So a hundred percent the data
will be different on bits
10492
07:36:57,754 --> 07:37:00,520
and bytes will be different
but look nice.
10493
07:37:00,520 --> 07:37:03,767
Yes, they are going to be seeing
so I'm going to search
10494
07:37:03,767 --> 07:37:05,100
the photo looking similar
10495
07:37:05,100 --> 07:37:07,500
to this particular
photograph as the input.
10496
07:37:07,500 --> 07:37:09,033
I'll provide to achieve
10497
07:37:09,033 --> 07:37:11,976
that we will be extracting
the features in each
10498
07:37:11,976 --> 07:37:13,000
of those photos.
10499
07:37:13,000 --> 07:37:15,717
We will extract the features
and we will try to match
10500
07:37:15,717 --> 07:37:17,697
the feature rather than the bits
10501
07:37:17,697 --> 07:37:21,015
and bytes and optimization as
well in terms of processing
10502
07:37:21,015 --> 07:37:22,200
or doing the piping.
10503
07:37:22,200 --> 07:37:25,100
There are a number of algorithms
to do the optimization.
10504
07:37:25,400 --> 07:37:27,000
Let's move on to spark SQL.
10505
07:37:27,100 --> 07:37:29,811
Is there a module
to implement sequence Park?
10506
07:37:29,811 --> 07:37:32,475
How does it work so
directly not the sequel
10507
07:37:32,475 --> 07:37:36,300
may be very similar to high
whatever the structure data
10508
07:37:36,300 --> 07:37:37,300
that we have.
10509
07:37:37,400 --> 07:37:38,800
We can read the data
10510
07:37:38,800 --> 07:37:42,000
or extract the meaning
out of the data using SQL
10511
07:37:42,400 --> 07:37:44,600
and it exposes the APA
10512
07:37:44,700 --> 07:37:48,700
and we can use those API to read
the data or create data frames
10513
07:37:48,834 --> 07:37:51,065
and spunk SQL has four major.
10514
07:37:51,500 --> 07:37:55,800
Degrees data source
data Frame data frame is
10515
07:37:55,800 --> 07:37:58,900
like the representation
of X and Y data
10516
07:37:59,300 --> 07:38:02,800
or like Excel data
multi-dimensional structure data
10517
07:38:03,000 --> 07:38:06,000
and abstract form
on top of dataframe.
10518
07:38:06,000 --> 07:38:08,541
I can do the
query and internally,
10519
07:38:08,541 --> 07:38:11,700
it has interpreter
and Optimizer any query
10520
07:38:11,700 --> 07:38:15,100
I fire that will
get interpreted or optimized
10521
07:38:15,100 --> 07:38:18,500
and get executed using
the SQL services and get
10522
07:38:18,500 --> 07:38:20,300
the data from the data frame
10523
07:38:20,300 --> 07:38:22,900
or it An read the data
from the data source
10524
07:38:22,900 --> 07:38:24,000
and do the processing.
10525
07:38:24,265 --> 07:38:26,034
What is a package file?
10526
07:38:26,100 --> 07:38:27,800
It's a format of the file
10527
07:38:27,800 --> 07:38:30,361
where the data
in some structured form,
10528
07:38:30,361 --> 07:38:33,800
especially the result
of the Spock SQL can be stored
10529
07:38:33,800 --> 07:38:37,350
or returned in some persistence
and the packet again.
10530
07:38:37,350 --> 07:38:41,317
It is a open source from Apache
its data serialization technique
10531
07:38:41,317 --> 07:38:44,833
where we can serialize the data
using the pad could form
10532
07:38:44,833 --> 07:38:46,078
and to precisely say,
10533
07:38:46,078 --> 07:38:47,500
it's a columnar storage.
10534
07:38:47,500 --> 07:38:49,900
It's going to consume
less space it will use
10535
07:38:49,900 --> 07:38:51,200
the keys and values.
10536
07:38:51,300 --> 07:38:55,500
Store the data and also it helps
you to access a specific data
10537
07:38:55,500 --> 07:38:59,100
from that packaged form
using the query so backward.
10538
07:38:59,100 --> 07:39:02,200
It's another open source format
data serialization format
10539
07:39:02,200 --> 07:39:03,267
to store the data
10540
07:39:03,267 --> 07:39:04,900
on purses the data as well
10541
07:39:04,900 --> 07:39:08,700
as to retrieve the data list
the functions of Sparks equal.
10542
07:39:08,700 --> 07:39:10,800
You can be used
to load the varieties
10543
07:39:10,800 --> 07:39:12,300
of structured data, of course,
10544
07:39:12,300 --> 07:39:15,600
yes monks equal can work only
with the structure data.
10545
07:39:15,600 --> 07:39:17,900
It can be used to load varieties
10546
07:39:17,900 --> 07:39:20,900
of structured data
and you can use SQL
10547
07:39:20,900 --> 07:39:23,600
like it's to query
against the program
10548
07:39:23,600 --> 07:39:25,000
and it can be used
10549
07:39:25,000 --> 07:39:27,839
with external tools to connect
to this park as well.
10550
07:39:27,839 --> 07:39:30,400
It gives very good
the integration with the SQL
10551
07:39:30,400 --> 07:39:32,900
and using python
Java Scala code.
10552
07:39:33,000 --> 07:39:35,831
We can create an rdd
from the structure data
10553
07:39:35,831 --> 07:39:38,400
available directly using
this box equal.
10554
07:39:38,400 --> 07:39:40,300
I can generate the TD.
10555
07:39:40,500 --> 07:39:42,600
So it's going to
facilitate the people
10556
07:39:42,600 --> 07:39:46,400
from database background to make
the program faster and quicker.
10557
07:39:47,100 --> 07:39:48,100
Next question is
10558
07:39:48,100 --> 07:39:50,700
what do you understand
by lazy evaluation?
10559
07:39:50,900 --> 07:39:54,400
So whenever you do any operation
within the spark word,
10560
07:39:54,400 --> 07:39:57,281
it will not do the processing
immediately it look
10561
07:39:57,281 --> 07:40:00,100
for the final results
that we are asking for it.
10562
07:40:00,100 --> 07:40:02,000
If it doesn't ask
for the final result.
10563
07:40:02,000 --> 07:40:04,660
It doesn't need to do
the processing So based
10564
07:40:04,660 --> 07:40:07,200
on the final action
until we do the action.
10565
07:40:07,200 --> 07:40:08,990
There will not be
any Transformations.
10566
07:40:08,990 --> 07:40:11,700
I will there will not be
any actual processing happening.
10567
07:40:11,700 --> 07:40:13,141
It will just understand
10568
07:40:13,141 --> 07:40:15,900
what our Transformations
it has to do finally
10569
07:40:15,900 --> 07:40:18,900
if you ask The action
then in optimized way,
10570
07:40:18,900 --> 07:40:22,200
it's going to complete
the data processing and get
10571
07:40:22,200 --> 07:40:23,553
us the final result.
10572
07:40:23,553 --> 07:40:26,600
So to answer straight
lazy evaluation is doing
10573
07:40:26,600 --> 07:40:30,300
the processing one Leon need
of the resultant data.
10574
07:40:30,300 --> 07:40:32,100
The data is not required.
10575
07:40:32,100 --> 07:40:34,757
It's not going
to do the processing.
10576
07:40:34,757 --> 07:40:36,726
Can you use Funk to access
10577
07:40:36,726 --> 07:40:40,200
and analyze data stored
in Cassandra data piece?
10578
07:40:40,200 --> 07:40:41,600
Yes, it is possible.
10579
07:40:41,600 --> 07:40:44,400
Okay, not only Cassandra
any of the nosql database it
10580
07:40:44,400 --> 07:40:46,100
can very well do the processing
10581
07:40:46,100 --> 07:40:49,700
and Sandra also works
in a distributed architecture.
10582
07:40:49,700 --> 07:40:51,200
It's a nosql database
10583
07:40:51,200 --> 07:40:53,800
so it can leverage
the data locality.
10584
07:40:53,800 --> 07:40:56,000
The query can
be executed locally
10585
07:40:56,000 --> 07:40:58,200
where the Cassandra
notes are available.
10586
07:40:58,200 --> 07:41:01,100
It's going to make
the query execution faster
10587
07:41:01,100 --> 07:41:04,326
and reduce the network load
and Spark executors.
10588
07:41:04,326 --> 07:41:06,009
It will try to get started
10589
07:41:06,009 --> 07:41:08,242
or the spark executors
in the mission
10590
07:41:08,242 --> 07:41:10,600
where the Cassandra
notes are available
10591
07:41:10,600 --> 07:41:13,900
or data is available going
to do the processing locally.
10592
07:41:13,900 --> 07:41:16,450
So it's going to leverage
the data locality.
10593
07:41:16,450 --> 07:41:17,426
T next question,
10594
07:41:17,426 --> 07:41:19,500
how can you
minimize data transfers
10595
07:41:19,500 --> 07:41:21,200
when working with spark
10596
07:41:21,200 --> 07:41:23,636
if you ask the core
design the success
10597
07:41:23,636 --> 07:41:25,514
of the spark program depends on
10598
07:41:25,514 --> 07:41:28,300
how much you are reducing
the network transfer.
10599
07:41:28,300 --> 07:41:30,900
This network transfer
is very costly operation
10600
07:41:30,900 --> 07:41:32,300
and you cannot paralyzed
10601
07:41:32,400 --> 07:41:35,600
in case multiple ways are
especially two ways to avoid.
10602
07:41:35,600 --> 07:41:37,664
This one is called
broadcast variable
10603
07:41:37,664 --> 07:41:40,300
and at Co-operators
broadcast variable.
10604
07:41:40,300 --> 07:41:43,536
It will help us
to transfer any static data
10605
07:41:43,536 --> 07:41:46,428
or any informations
keep on publish.
10606
07:41:46,500 --> 07:41:48,300
To multiple systems.
10607
07:41:48,300 --> 07:41:49,300
So I'll see
10608
07:41:49,300 --> 07:41:52,257
if any data to be transferred
to multiple executors
10609
07:41:52,257 --> 07:41:53,500
to be used in common.
10610
07:41:53,500 --> 07:41:55,016
I can broadcast it
10611
07:41:55,200 --> 07:41:58,800
and I might want to consolidate
the values happening
10612
07:41:58,800 --> 07:42:02,172
in multiple workers in
a single centralized location.
10613
07:42:02,172 --> 07:42:03,600
I can use accumulator.
10614
07:42:03,600 --> 07:42:06,412
So this will help us to achieve
the data consolidation
10615
07:42:06,412 --> 07:42:08,800
of data distribution
in the distributed world.
10616
07:42:08,800 --> 07:42:11,800
The ap11 are not abstract level
10617
07:42:11,800 --> 07:42:14,351
where we don't need
to do the heavy lifting
10618
07:42:14,351 --> 07:42:16,600
that's taken care
by the spark for us.
10619
07:42:16,800 --> 07:42:19,275
What our broadcast
variables just now
10620
07:42:19,275 --> 07:42:22,300
as we discussed the value
of the common value
10621
07:42:22,300 --> 07:42:23,200
that we need.
10622
07:42:23,200 --> 07:42:27,300
I am a want that to be available
in multiple executors
10623
07:42:27,300 --> 07:42:31,000
multiple workers simple example
you want to do a spell check
10624
07:42:31,000 --> 07:42:33,500
on the Tweet
Commons the dictionary
10625
07:42:33,500 --> 07:42:36,100
which has the right
list of words.
10626
07:42:36,200 --> 07:42:37,800
I'll have the complete list.
10627
07:42:37,800 --> 07:42:40,300
I want that particular
dictionary to be available
10628
07:42:40,300 --> 07:42:41,400
in each executor
10629
07:42:41,400 --> 07:42:43,944
so that with a task with
that's running locally
10630
07:42:43,944 --> 07:42:46,600
in those Executives can refer
to that particular.
10631
07:42:46,600 --> 07:42:49,900
Task and get the processing
done by avoiding
10632
07:42:49,900 --> 07:42:51,616
the network data transfer.
10633
07:42:51,616 --> 07:42:55,485
So the process of Distributing
the data from the spark context
10634
07:42:55,485 --> 07:42:56,500
to the executors
10635
07:42:56,500 --> 07:42:58,700
where the task going
to run is achieved
10636
07:42:58,700 --> 07:43:00,400
using broadcast variables
10637
07:43:00,400 --> 07:43:03,952
and the built-in within the
spark APA using this parquet p--
10638
07:43:03,952 --> 07:43:06,000
we can create
the bronchus variable
10639
07:43:06,200 --> 07:43:09,500
and the process of Distributing
this data available
10640
07:43:09,500 --> 07:43:13,524
in all executors is taken care
by the spark framework explain
10641
07:43:13,524 --> 07:43:15,000
accumulators in spark.
10642
07:43:15,100 --> 07:43:18,500
The similar way how we
have broadcast variables.
10643
07:43:18,500 --> 07:43:21,290
We have accumulators
as well simple example,
10644
07:43:21,290 --> 07:43:25,100
you want to count how many
error codes are available
10645
07:43:25,100 --> 07:43:26,600
in the distributed environment
10646
07:43:26,800 --> 07:43:28,400
as your data is distributed
10647
07:43:28,400 --> 07:43:31,300
across multiple systems
multiple Executives.
10648
07:43:31,400 --> 07:43:34,784
Each executor will do
the process thing count
10649
07:43:34,784 --> 07:43:37,200
the records anatomically.
10650
07:43:37,200 --> 07:43:38,978
I may want the total count.
10651
07:43:38,978 --> 07:43:42,600
So what I will do I will ask
to maintain an accumulator,
10652
07:43:42,600 --> 07:43:45,250
of course, it will be maintained
in this more context.
10653
07:43:45,250 --> 07:43:48,500
In the driver program
the driver program going
10654
07:43:48,500 --> 07:43:50,100
to be one per application.
10655
07:43:50,100 --> 07:43:52,200
It will keep on
getting accumulated
10656
07:43:52,200 --> 07:43:54,900
and whenever I want I
can read those values
10657
07:43:54,900 --> 07:43:57,100
and take any appropriate action.
10658
07:43:57,200 --> 07:44:00,300
So it's like more or less the
accumulators and practice videos
10659
07:44:00,300 --> 07:44:01,600
looks opposite each other,
10660
07:44:02,000 --> 07:44:03,800
but the purpose
is totally different.
10661
07:44:04,200 --> 07:44:06,531
Why is there a need
for workers variable
10662
07:44:06,531 --> 07:44:10,400
when working with Apache Spark
It's read only variable
10663
07:44:10,400 --> 07:44:13,800
and it will be cached in memory
in a distributed fashion
10664
07:44:13,800 --> 07:44:15,789
and it eliminates the The work
10665
07:44:15,789 --> 07:44:19,012
of moving the data
from a centralized location
10666
07:44:19,012 --> 07:44:20,400
that is Spong driver
10667
07:44:20,400 --> 07:44:24,200
or from a particular program
to all the executors
10668
07:44:24,200 --> 07:44:26,830
within the cluster where
the transfer into get executed.
10669
07:44:26,830 --> 07:44:29,700
We don't need to worry about
where the task will get executed
10670
07:44:29,700 --> 07:44:31,100
within the cluster.
10671
07:44:31,100 --> 07:44:32,138
So when compared
10672
07:44:32,138 --> 07:44:34,900
with the accumulators
broadcast variables,
10673
07:44:34,900 --> 07:44:37,256
it's going to have
a read-only operation.
10674
07:44:37,256 --> 07:44:38,903
The executors cannot change
10675
07:44:38,903 --> 07:44:41,100
the value can only
read those values.
10676
07:44:41,100 --> 07:44:44,900
It cannot update so mostly
will be used like a quiche.
10677
07:44:44,900 --> 07:44:47,400
Have for the
identity next question,
10678
07:44:47,400 --> 07:44:50,327
how can you trigger
automatically naps in spark
10679
07:44:50,327 --> 07:44:52,300
to handle accumulated metadata.
10680
07:44:52,700 --> 07:44:54,500
So there is a parameter
10681
07:44:54,500 --> 07:44:57,900
that we can set TTL the
will get triggered along
10682
07:44:57,900 --> 07:45:00,900
with the running jobs
and intermediately.
10683
07:45:00,900 --> 07:45:04,000
It's going to write the data
result into the disc
10684
07:45:04,000 --> 07:45:07,155
or cleaned unnecessary data
or clean the rdds.
10685
07:45:07,155 --> 07:45:08,600
That's not being used.
10686
07:45:08,600 --> 07:45:09,800
The least used RTD.
10687
07:45:09,800 --> 07:45:10,987
It will get cleaned
10688
07:45:10,987 --> 07:45:14,800
and click keep the metadata as
well as the memory clean water.
10689
07:45:14,800 --> 07:45:17,800
The various levels
of persistence in Apache spark
10690
07:45:17,800 --> 07:45:20,200
when you say data
should be stored in memory.
10691
07:45:20,200 --> 07:45:23,000
It can be indifferent now
you can be possessed it
10692
07:45:23,000 --> 07:45:27,100
so it can be in memory of only
or memory and disk or disk only
10693
07:45:27,200 --> 07:45:30,500
and when it is getting stored
we can ask it to store it
10694
07:45:30,500 --> 07:45:31,800
in a civilized form.
10695
07:45:31,900 --> 07:45:35,300
So the reason why we may store
or possess dress,
10696
07:45:35,303 --> 07:45:36,996
I want this particular
10697
07:45:37,100 --> 07:45:40,200
on very this form
of body little back
10698
07:45:40,200 --> 07:45:42,038
for using so I can really
10699
07:45:42,038 --> 07:45:45,200
back maybe I may not need
it very immediate.
10700
07:45:45,400 --> 07:45:48,477
So I don't want that to keep
occupying my memory.
10701
07:45:48,477 --> 07:45:50,400
I'll write it to the hard disk
10702
07:45:50,400 --> 07:45:52,700
and I'll read it back
whenever there is a need.
10703
07:45:52,700 --> 07:45:55,300
I'll read it back
the next question.
10704
07:45:55,300 --> 07:45:58,069
What do you understand
by schema rdd,
10705
07:45:58,200 --> 07:46:01,900
so schema rdd will be used as
slave Within These Punk's equal.
10706
07:46:01,900 --> 07:46:05,300
So the RTD will have the meta
information built into it.
10707
07:46:05,300 --> 07:46:07,919
It will have the schema
also very similar to
10708
07:46:07,919 --> 07:46:10,642
what we have the database
schema the structure
10709
07:46:10,642 --> 07:46:11,976
of the particular data
10710
07:46:11,976 --> 07:46:14,994
and when I have a structure it
will be easy for me.
10711
07:46:14,994 --> 07:46:16,081
To handle the data
10712
07:46:16,081 --> 07:46:19,100
so data and the structure
will be existing together
10713
07:46:19,100 --> 07:46:20,360
and the schema are ready.
10714
07:46:20,360 --> 07:46:20,550
Now.
10715
07:46:20,550 --> 07:46:22,100
It's called as a data frame
10716
07:46:22,100 --> 07:46:25,009
but it's Mark and dataframe
term is very popular
10717
07:46:25,009 --> 07:46:27,616
in languages like our
as other languages.
10718
07:46:27,616 --> 07:46:28,700
It's very popular.
10719
07:46:28,700 --> 07:46:31,700
So it's going to have the data
and The Meta information
10720
07:46:31,700 --> 07:46:34,700
about that data saying
what column was structure it.
10721
07:46:34,700 --> 07:46:36,300
Is it explain the scenario
10722
07:46:36,300 --> 07:46:38,656
where you will be
using spark streaming
10723
07:46:38,656 --> 07:46:41,200
as you may want to do
a sentiment analysis
10724
07:46:41,200 --> 07:46:44,200
of Twitter's so there
I will be streamed
10725
07:46:44,400 --> 07:46:49,200
so we will Flume sort of a tool
to harvest the information
10726
07:46:49,300 --> 07:46:52,700
from Peter and fit it
into spark streaming.
10727
07:46:52,700 --> 07:46:56,300
It will extract or identify
the sentiment of each
10728
07:46:56,300 --> 07:46:58,300
and every tweet and Market
10729
07:46:58,300 --> 07:47:00,899
whether it is positive
or negative and accordingly
10730
07:47:00,899 --> 07:47:02,900
the data will be
the structure data
10731
07:47:02,900 --> 07:47:03,700
that we tidy
10732
07:47:03,700 --> 07:47:05,742
whether it is positive
or negative maybe
10733
07:47:05,742 --> 07:47:06,856
percentage of positive
10734
07:47:06,856 --> 07:47:09,088
and percentage of negative
sentiment store it
10735
07:47:09,088 --> 07:47:10,500
in some structured form.
10736
07:47:10,500 --> 07:47:14,111
Then you can leverage this park
Sequel and do grouping
10737
07:47:14,111 --> 07:47:16,403
or filtering Based
on the sentiment
10738
07:47:16,403 --> 07:47:19,587
and maybe I can use
a machine learning algorithm.
10739
07:47:19,587 --> 07:47:22,107
What drives that
particular tweet to be
10740
07:47:22,107 --> 07:47:23,500
in the negative side.
10741
07:47:23,500 --> 07:47:26,700
Is there any similarity between
all this negative sentiment
10742
07:47:26,700 --> 07:47:28,812
negative tweets may be specific
10743
07:47:28,812 --> 07:47:32,700
to a product a specific time
by when the Tweet was sweeter
10744
07:47:32,700 --> 07:47:34,421
or from a specific region
10745
07:47:34,421 --> 07:47:36,900
that we it was
Twitter those analysis
10746
07:47:36,900 --> 07:47:40,194
could be done by leveraging
the MLA above spark.
10747
07:47:40,194 --> 07:47:43,700
So Emily streaming core
all going to work together.
10748
07:47:43,700 --> 07:47:45,200
All these are like different.
10749
07:47:45,200 --> 07:47:48,500
Offerings available to
solve different problems.
10750
07:47:48,600 --> 07:47:51,100
So with this we are coming
to end of this interview
10751
07:47:51,100 --> 07:47:53,100
questions discussion of spark.
10752
07:47:53,100 --> 07:47:54,465
I hope you all enjoyed.
10753
07:47:54,465 --> 07:47:56,913
I hope it was constructive
and useful one.
10754
07:47:56,913 --> 07:47:59,600
The more information
about editor is available
10755
07:47:59,600 --> 07:48:02,183
in this website to record
at cou only best
10756
07:48:02,183 --> 07:48:05,900
and keep visiting the website
for blocks and latest updates.
10757
07:48:05,900 --> 07:48:07,000
Thank you folks.
10758
07:48:07,500 --> 07:48:10,400
I hope you have enjoyed
listening to this video.
10759
07:48:10,400 --> 07:48:12,450
Please be kind enough to like it
10760
07:48:12,450 --> 07:48:15,600
and you can comment any
of your doubts and queries
10761
07:48:15,600 --> 07:48:17,078
and we will reply them
10762
07:48:17,078 --> 07:48:20,923
at the earliest do look out
for more videos in our playlist
10763
07:48:20,923 --> 07:48:24,105
And subscribe to Edureka
channel to learn more.
10764
07:48:24,105 --> 07:48:25,100
Happy learning.870388
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.