Hadoop Architecture

Course description includes a project along the way. Here we assume “Book Store application” as a project.

**Project is subject to change. We alter the project based on audience’s background and interest.**

Big Data Overview

  • Differences between traditional tools vs. Big Data tools

  • How’s ETL done in Big Data?

  • What are the infrastructure tools?

  • Cloud infrastructure

  • Physical Infrastructure

  • Hybrid model

  • What is Big Data eco system?

  • Hadoop/Cassandra/ Apache spark

  • NO-SQL (Mongo DB, HBASE etc.)

Hadoop Overview

  • Parallel Computer vs. Distributed Computing

  • RDBMS/SQL vs. Hadoop.

  • Hadoop Architecture (V1 and V2): Name Node, Data Node, Job Tracker, Task Tracker, YARN

  • Vendor Comparison (Cloudera, Hortonworks, MapR, Amazon EMR)

  • Use cases

Planning Cluster

  • General Planning Considerations

  • Choosing The Right Hardware

  • Network Considerations

  • Configuring Nodes

HDFS Deep Dive

  • Name Node architecture (Edit Log, FsImage, location of replicas)

  • Secondary Name Node architecture

  • Data Node architecture

  • Write Pipeline

  • Read Pipeline

  • Heartbeats, Data Node commissioning/decommissioning, Rack Awareness, Block

  • Scanner, Balancer, Trash, Health Check, Safe mode

  • HDFS Federation (next gen)

  • HDFS HA (next gen)1

  • HDFS Benchmarking

  • Exploring the HDFS Apache Web UI

  • Exploring the Cloudera Web UI for HDFS functions

  • LAB #1: HDFS commands using Hadoop cluster

Data Ingestion Tools        

  • Flume

  • Sqoop

  • Kafka

  • LAB #2: Book Store: Ingest unstructured data using Flume

  • LAB#3: Book Store: Ingest Books tables using Sqoop from MySQL

  • LAB#4: Book Store: Stream Data using Kafka


Compute using Map reduce

  • Map Reduce Architecture (Gen-1)

  • Job Tracker/Task Tracker

  • Map Reduce Architecture (next gen)

  • YARN: Resource Manager, Node Manager, Application Master

  • How a client submits a next-gen MR job

  • Thinking in the Map Reduce way

  • Hadoop Streaming (with python)

  • Combiner Shuffle: Sort & Partition

  • Speculative Execution

  • Distributed Cache

  • Serialization and File-Based Data Structures

  • Input/output formats

  • Job Scheduling (FIFO, Fair Scheduler, Capacity Scheduler)

  • Counters

  • Exploring the Apache Map Reduce Web UI

  • LAB #5: Book Store: Data Analysis using Java


Hive & Impala

  • Philosophy and architecture

  • Hive vs. RDBMS

  • HiveQL and Hive Shell

  • Managing tables

  • Data types and schemas

  • Querying data

  • Partitions and Buckets

  • Intro to User Defined Functions

  • Hive Query Optimization

  • LAB#6: Book Store: Data analysis with Hive



  • Philosophy and architecture

  • Why Pig?

  • Pig Latin and the Grunt shell

  • Loading and analyzing structured/unstructured data

  • Data types and schemas

  • Pig Latin details: structure, functions, expressions, and relational operators

  • Intro to User Defined Functions and Scripts

  • LAB #7: Book Store: Data analysis with PIG



  • Architecture

  • Versions and origins

  • HBase vs. RDBMS

  • Master and Region Servers

  • Intro to Zookeeper

  • Data Modeling

  • Column Families and Regions

  • Bloom Filters and Block Indexes

  • Write Pipeline/ Read Pipeline

  • Catalog Tables

  • Compactions

  • Time series data processing using Hbase (OpenTSDB)

  • LAB #8: Book Store: Data ingestion using sqoop

  • LAB #9: Book Store: Data analysis with HBASE


  • Why Hadoop Security Is Important

  • Hadoop’s Security System Concepts

  • What Kerberos Is and How it Works

  • Securing a Hadoop Cluster with Kerberos

  • LAB #10: Enable Kerberos on the cluster


Workflow (Oozie)

  • How to define workflows?

  • LAB #11:Book Store: Work flow


BI Tools

  •  Visualization

  •  LAB #12: Book Store: Use Tableau to visualize


Introduction to Apache Spark

  • Introduction

  • Parallel computing

  • Introduction to Spark SQL

  • Introduction to Spark Streaming

  • Exercises using Scala and Python


Introduction to Data Science

  • Data Science Fundamentals

  • Interface to R programming

Hadoop Architecture

           Course details

Price : $1999

Duration : 4-6 weeks

               (2 hrs a week)

Assignment : Every week

Support : Life time

Project : 3-6 months

Most projects will be placed on: