Apache Spark training covers Basic, Intermediate and Advanced level Spark, its implementation and Application Development. The Training is organized to use practical, real time data set, analytics by understanding Big Data, Data Ingesting, Streaming and processing. The program is designed to use Scala as primary language, Python for example, the core examples and implementation are provided in Scala only. Language is secondary part while learning Spark, as we cover Fundamentals and Advanced concept of Spark.
Apache Spark training covers Basic, Intermediate and Advanced level Spark, its implementation and Application Development.
The Training is organized to use practical, real time data set, analytics by understanding Big Data, Data Ingesting, Streaming and processing.
The program is designed to use Scala as primary language, Python for example, the core examples and implementation are provided in Scala only. Language is secondary part while learning Spark, as we cover Fundamentals and Advanced concept of Spark.
Training Summary:
Software Pre-requisite:
Training
Mode |
80%
hands-on with reference data set and working applications, architecture |
UseCase |
Implement a Ecommerce use case with Data Lake (PostgreSQL DB, Hadoop HDFS (Optional), Kafka and Spark Complete) The
use shall involve Million + products
(Amazon Dataset), Millions plus ratings and reviews, 2000+ categories, Orders
are simulated for Streaming, the order throughput can be low volume to 1000
of orders per second. |
Level |
Intermediate
& Advanced Level |
Number
of Days |
3
days or 24 hours |
Language |
Scala
2.12 |
Scala |
Scala
Introduction is given, we will use Scala as demonstration, 100% Spark Feature
covered, Scala is only a language choice for RDD, DataSet, DataFrame,
Streaming and SQL implementation. |
Python
vs Scala |
Python
example can be given, however Python portion shall be less than 10% of the
training, Scala shall be used for 90% of the use cases while calling Spark
API |
IDE |
IntelliJ
Community Edition with Scala support or Scala Eclipse IDE |
JDK
Version |
Java
JDK 1.8 64 Bit |
Scala
Version |
2.12 |
Java |
1.8 |
Hadoop
Version |
2.7.x |
Spark |
2.4 |
Kafka
Version |
2.0 |
Database |
PostgreSQL
11 |
Basic |
|
Apache Spark Introduction |
Apache Spark Introduction Spark Architecture Big Data Introduction In Memory Data Model Distributed Computing Analytics Map/Reduce Spark Driver introduction How Spark with Java/Scala/Python/R Languages |
Spark Setup |
Setting Spark in Stand Alone Mode Java JDK Spark Development Environment (Java/Scala) Spark REPL |
Scala |
Scala Intermediate Topics Why Scala for Spark? Scala in other frameworks Introduction to Scala REPL Basic Scala operations Variable Types in Scala Control Structures in Scala Foreach loop, Functions and Procedures Collections in Scala- Array ArrayBuffer, Map, Tuples, Lists, and more Class in Scala Objects Getters and Setters Extending a Class Overriding Methods Traits as Interfaces and Layered Traits Functional Programming Higher Order Functions Anonymous Functions, and more |
Scala and SBT |
Scala SBT Setup IDE setup |
Spark Architecture |
Elements and Features of Spark Resilient Distributed Datasets (RDD) Data Frames Stream Driver Application Map Reduce Interactive with Map Reduce Spark Shell Spark in Standalone Spark in Distributed mode Spark with Hadoop and YARN |
Overview of Functional programming |
Immutable Data State Threading Functional programming advantages Higher order functions Stateless data processing over distributed network |
RDD |
Spark RDD Deep dive into Spark RDDs Creating RDDs RDD Data Loading RDD partitioning RDD Transformation & functions Cache intermediate RDDs The RDD general operations A read-only partitioned collection of records RDD for faster and efficient data processing RDD Actions and Functions for Collect Count, Collection Map, List Save RDD results as Textfiles Pair RDD functions RDD Lineage Key-Value pair in RDDs, Spark MapReduce
with RDD Spark Internals with RDD, Immutable |
RDD Persistence |
RDD persistence overview, Spark execution flow & Spark terminology, Distribution shared memory vs. RDD, RDD limitations Distributed persistence, RDD lineage,Key/Value pair for sorting implicit
conversion like CountByKey, ReduceByKey, SortByKey, AggregataeByKey |
Intermediate |
|
Spark SQL Context |
Working with SQL Context Read Data from PostgreSQL Write Data to PostgreSQL |
Data Frame and Spark SQL |
Spark SQL Overview Spark SQL Architecture SQL Context in Spark SQL Data Frames & Datasets Architecture of Data Frameworks JSON support in Spark SQL, working with XML data, Parquet files, Creating HiveContext, Writing Data Frame to Hive Reading JDBC files, Understanding the Data Frames in Spark Creating Data Frames, manual inferring of schema Working with CSV files, Reading JDBC tables, Data Frame to JDBC, User defined functions in Spark SQL, Shared variable and accumulators Learning to query and transform data in Data Frames,
Data Frame provides the benefit of both Spark RDD
and Spark SQL, Deploying Hive on Spark as the execution engine. |
Partition Advanced part |
Learning about the scheduling and partitioning in
Spark Hash partition, range partition Scheduling within and around applications Static partitioning Dynamic sharing Fair scheduling Map partition with index The Zip, GroupByKey, Spark master high availability Standby Masters with Zookeeper Single Node Recovery With Local File System High Order Functions. |
Spark Streaming |
Spark Stream Introduction Batch Processing Micro Batch Window and Time Slice Spark Streaming architecture Create Streams Create a simple Spark Streaming application Stream operations Apply Stream operations Use Spark SQL to query Streams Define window operations Describe how Streams are fault-tolerant |
Monitor Spark Application |
Use the SparkUI to monitor a Spark application Debug and tune Spark applications |
Advance |
|
Spark Catalyst |
Understanding Spark Catalyst Engine Spark Query Planning Spark Logical, Physical plans, Optimized plans |
Spark MLib [Optional, introductory level] |
What is Machine Learning? Where is Machine Learning Used? Different Types of Machine Learning Techniques Understanding MLlib Distributed Architecture for MLib Features of MLlib and MLlib Tools Various ML algorithms supported by MLlib |
Kafka and Spark |
Kafka Overview Integrating Kafka Streams into Spark Kafka Connect Ingest from Kafka Stream Spark to Kafka Stream |
Hadoop |
Hadoop Integration HDFS File System Access Read Directories from Hadoop Create Directories Read/Write Files |
Yarn |
Detailed Introduction to Yarn Running Yarn with Spark Cluster Yarn Resource Management and Optimization |
Spark Performance Tuning |
Partitions Yarn Memory Management Executors Cores Java Optimization |
KPI Consulting is one of the fastest growing (with 1000+ tech workshops) e-learning & consulting Firm which provides objective-based innovative & effective learning solutions for the entire spectrum of technical & domain skills
Write a public review