080 41714080 info@consultkpi.com

Apache Spark - Basic to Advance with Scala

Apache Spark training covers Basic, Intermediate and Advanced level Spark, its implementation and Application Development. The Training is organized to use practical, real time data set, analytics by understanding Big Data, Data Ingesting, Streaming and processing. The program is designed to use Scala as primary language, Python for example, the core examples and implementation are provided in Scala only. Language is secondary part while learning Spark, as we cover Fundamentals and Advanced concept of Spark.

Advanced 0 (0 Rating) 0 Students enrolled
Created by KPI Consulting Last updated Mon, 08-Jun-2020 English
What will i learn?
  • The Training is organized to use practical, real time data set, analytics by understanding Big Data, Data Ingesting, Streaming and processing.
  • The program is designed to use Scala as primary language, Python for example, the core examples and implementation are provided in Scala only. Language is secondary part while learning Spark, as we cover Fundamentals and Advanced concept of Spark.

Curriculum for this course
0 Lessons 00:00:00 Hours
+ View more

Apache Spark training covers Basic, Intermediate and Advanced level Spark, its implementation and Application Development.

The Training is organized to use practical, real time data set, analytics by understanding Big Data, Data Ingesting, Streaming and processing.

The program is designed to use Scala as primary language, Python for example, the core examples and implementation are provided in Scala only. Language is secondary part while learning Spark, as we cover Fundamentals and Advanced concept of Spark. 

Training Summary: 

Software Pre-requisite:

Training Mode

80% hands-on with reference data set and working applications, architecture


Implement a Ecommerce use case with Data Lake (PostgreSQL DB, Hadoop HDFS (Optional), Kafka and Spark Complete)

The use shall involve  Million + products (Amazon Dataset), Millions plus ratings and reviews, 2000+ categories, Orders are simulated for Streaming, the order throughput can be low volume to 1000 of orders per second.


Intermediate & Advanced Level

Number of Days

3 days or 24 hours



Scala 2.12


Scala Introduction is given, we will use Scala as demonstration, 100% Spark Feature covered, Scala is only a language choice for RDD, DataSet, DataFrame, Streaming and SQL implementation.

Python vs Scala

Python example can be given, however Python portion shall be less than 10% of the training, Scala shall be used for 90% of the use cases while calling Spark API


IntelliJ Community Edition with Scala support or Scala Eclipse IDE

JDK Version

Java JDK 1.8 64 Bit

Scala Version




Hadoop Version




Kafka Version



PostgreSQL 11


Apache Spark Introduction

Apache Spark Introduction

Spark Architecture

Big Data Introduction

In Memory Data Model

Distributed Computing



Spark Driver introduction

How Spark with Java/Scala/Python/R  Languages

Spark Setup

Setting Spark in Stand Alone Mode

Java JDK

Spark Development Environment (Java/Scala)

Spark REPL




Scala Intermediate Topics

Why Scala for Spark?

Scala in other frameworks

Introduction to Scala REPL

Basic Scala operations

Variable Types in Scala

Control Structures in Scala

Foreach loop, Functions and Procedures

Collections in Scala- Array

ArrayBuffer, Map, Tuples, Lists, and more

Class in Scala


Getters and Setters

Extending a Class

Overriding Methods

Traits as Interfaces and Layered Traits

Functional Programming

Higher Order Functions

Anonymous Functions, and more

Scala and SBT


Scala SBT Setup

IDE setup

Spark Architecture

Elements and Features of Spark

Resilient Distributed Datasets (RDD)

Data Frames


Driver Application

Map Reduce

Interactive with Map Reduce

Spark Shell

Spark in Standalone

Spark in Distributed mode

Spark with Hadoop and YARN

Overview of Functional programming


Data State


Functional programming advantages

Higher order functions

Stateless data processing over distributed network


Spark RDD

Deep dive into Spark RDDs

Creating RDDs

RDD Data Loading

RDD partitioning

RDD Transformation & functions

Cache intermediate RDDs

The RDD general operations

A read-only partitioned collection of records

RDD for faster and efficient data processing

RDD Actions and Functions for Collect

Count, Collection Map, List

Save RDD results as Textfiles

Pair RDD functions

RDD Lineage

Key-Value pair in RDDs,

Spark MapReduce  with RDD

Spark Internals with RDD, Immutable

RDD Persistence

RDD persistence overview,

Spark execution flow & Spark terminology,

Distribution shared memory vs. RDD,

RDD limitations

Distributed persistence,

RDD lineage,Key/Value pair for sorting implicit conversion like CountByKey, ReduceByKey, SortByKey, AggregataeByKey


Spark SQL Context

Working with SQL Context

Read Data from PostgreSQL

Write Data to PostgreSQL

Data Frame and Spark SQL

Spark SQL Overview

Spark SQL Architecture

SQL Context in Spark SQL

Data Frames & Datasets

Architecture of Data Frameworks


JSON support in Spark SQL, working with XML data,

Parquet files,

Creating HiveContext,

Writing Data Frame to Hive

Reading JDBC files,

Understanding the Data Frames in Spark

Creating Data Frames, manual inferring of schema

Working with CSV files,

Reading JDBC tables,

Data Frame to JDBC,

User defined functions in Spark SQL,

Shared variable and accumulators

Learning to query and transform data in Data Frames,

Data Frame provides the benefit of both Spark RDD and Spark SQL,

Deploying Hive on Spark as the execution engine.

Partition Advanced part

Learning about the scheduling and partitioning in Spark

Hash partition, range partition

Scheduling within and around applications

Static partitioning

Dynamic sharing

Fair scheduling

Map partition with index

The Zip, GroupByKey, Spark master high availability

Standby Masters with Zookeeper

Single Node Recovery With Local File System

High Order Functions.

Spark Streaming

Spark Stream Introduction

Batch Processing

Micro Batch

Window and Time Slice

Spark Streaming architecture

Create Streams

Create a simple Spark Streaming application

Stream operations

Apply Stream operations

Use Spark SQL to query Streams

Define window operations

Describe how Streams are fault-tolerant

Monitor Spark Application

Use the SparkUI to monitor a Spark application

Debug and tune Spark applications


Spark Catalyst

Understanding Spark Catalyst Engine

Spark Query Planning

Spark Logical, Physical plans, Optimized plans

Spark MLib [Optional, introductory level]

What is Machine Learning?

Where is Machine Learning Used?

Different Types of Machine Learning Techniques

Understanding MLlib

Distributed Architecture for MLib

Features of MLlib and MLlib Tools

Various ML algorithms supported by MLlib

Kafka and Spark

Kafka Overview

Integrating Kafka Streams into Spark

Kafka Connect

Ingest from Kafka Stream

Spark to Kafka Stream


Hadoop Integration

HDFS File System Access

Read Directories from Hadoop

Create Directories

Read/Write Files


Detailed Introduction to Yarn

Running Yarn with Spark Cluster

Yarn Resource Management and Optimization

Spark Performance Tuning



Memory Management



Java Optimization

+ View more
Other related courses
00:00:00 Hours
Updated Mon, 08-Jun-2020
0 0 ₹0
About the instructor
  • 0 Reviews
  • 28 Students
  • 54 Courses
+ View more
This workshop is delivered by one of top most industry-leading faculty with at least 10 to 15+ years of Industry as well as training experience

KPI Consulting is one of the fastest growing (with 1000+ tech workshops) e-learning & consulting Firm which provides objective-based innovative & effective learning solutions for the entire spectrum of technical & domain skills

Student feedback
Average rating
  • 0%
  • 0%
  • 0%
  • 0%
  • 0%
₹9999 ₹21999
Buy now
  • 00:00:00 Hours On demand videos
  • 0 Lessons
  • Full lifetime access
  • Access on mobile and tv
Developed By: Monnet Digital India Pvt Ltd