Course outline

Why Spark?

  • Problems with Traditional Large-Scale Systems

  • Introducing Spark

Spark Basics

  • What is Apache Spark?

  • Using the Spark Shell

  • Resilient Distributed Datasets (RDDs)

  • Functional Programming with Spark

Working with RDDs

  • RDD Operations

  • Key-Value Pair RDDs

  • MapReduce and Pair RDD Operations

The Hadoop Distributed File System

  • Why HDFS?

  • HDFS Architecture

  • Using HDFS

Running Spark on a Cluster

  • Overview

  • A Spark Standalone Cluster

  • The Spark Standalone Web UI

Parallel Programming with Spark

  • RDD Partitions and HDFS Data Locality

  • Working With Partitions

  • Executing Parallel Operations

Caching and Persistence

  • RDD Lineage
    Caching Overview
    Distributed Persistence

Writing Spark Applications

  • Spark Applications vs. Spark Shell

  • Creating the SparkContext

  • Configuring Spark Properties

  • Building and Running a Spark Application

  • Logging

Spark, Hadoop, Enterprise Data Center

  • Overview

  • Spark and the Hadoop Ecosystem

  • Spark and MapReduce

Spark Streaming

  • Spark Streaming Overview

  • Example: Streaming Word Count

  • Other Streaming Operations

  • Sliding Window Operations

  • Developing Spark Streaming Applications

Common Spark Algorithms

  • Iterative Algorithms

  • Graph Analysis

  • Machine Learning

Improving Spark Performance

  • Shared Variables: Broadcast Variables

  • Shared Variables: Accumulators

  • Common Performance Issues

Spark on Cloud

  • Spark on Openstack (Sahara Plugin)

  • Spark on AWS EC2

  • Spark on AWS S3

Spark on Hadoop

  • Spark on YARN

  • Spark on HDFS

Apache Spark

Hot technology to be in right now!

Course details:

Duration : 4-6 weeks (4 hrs a week)

Assignment : Every week

Support : Life time

Project : 3-6 months

Most of the projects will be placed on: