This Hadoop Programming on the Hortonworks Data Platform training course introduces the students to Apache Hadoop and key Hadoop ecosystem projects: Pig, Hive, Sqoop, Oozie, HBase, and Spark.

This intensive training course uses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Apache Hadoop and related Apache projects.

Topics:

Hadoop Ecosystem Overview
MapReduce
Pig Scripting Platform
Apache Hive
Apache Sqoop
Apache HBase
Spark
Spark SQL

Hadoop Programming on the Hortonworks Data Platform

Course Code

GTBD3H

Duration

5 Days

Course Fee

POA

Accreditation

N/A

Target Audience

Business Analysts, IT Architects, Technical Managers and Developers

Attendee Requirements

Participants should have the general knowledge of programming in Java and SQL as well as experience working in Unix environments (e.g. running shell commands, etc.)

Private Team Training is available for this course

Private Team Training | Contact us

Ways to Attend this Course

Private Training

+353 1 402 9423

hello@guruteamirl.com

Expand all

Course Description

This Hadoop Programming on the Hortonworks Data Platform training course introduces the students to Apache Hadoop and key Hadoop ecosystem projects: Pig, Hive, Sqoop, Oozie, HBase, and Spark.

This intensive training course uses lectures and hands-on labs that help students learn theoretical knowledge and gain practical experience of Apache Hadoop and related Apache projects.

Topics:

Hadoop Ecosystem Overview
MapReduce
Pig Scripting Platform
Apache Hive
Apache Sqoop
Apache HBase
Spark
Spark SQL

Course Outline

Chapter 1. MapReduce Overview

The Client – Server Processing Pattern
Distributed Computing Challenges
MapReduce Defined
Google's MapReduce
The Map Phase of MapReduce
The Reduce Phase of MapReduce
MapReduce Explained
MapReduce Word Count Job
MapReduce Shared-Nothing Architecture
Similarity with SQL Aggregation Operations
Example of Map & Reduce Operations using JavaScript
Problems Suitable for Solving with MapReduce
Typical MapReduce Jobs
Fault-tolerance of MapReduce
Distributed Computing Economics
MapReduce Systems

Chapter 2. Hadoop Overview

Apache Hadoop
Apache Hadoop Logo
Typical Hadoop Applications
Hadoop Clusters
Hadoop Design Principles
Hadoop Versions
Hadoop's Main Components
Hadoop Simple Definition
Side-by-Side Comparison: Hadoop 1 and Hadoop 2
Hadoop-based Systems for Data Analysis
Other Hadoop Ecosystem Projects
Hadoop Caveats
Hadoop Distributions
Cloudera Distribution of Hadoop (CDH)
Cloudera Distributions
Hortonworks Data Platform (HDP)
MapR

Chapter 3. Hadoop Distributed File System Overview

Hadoop Distributed File System (HDFS)
HDFS High Availability
HDFS "Fine Print"
Storing Raw Data in HDFS
Hadoop Security
HDFS Rack-awareness
Data Blocks
Data Block Replication Example
HDFS NameNode Directory Diagram
Accessing HDFS
Examples of HDFS Commands
Other Supported File Systems
WebHDFS
Examples of WebHDFS Calls
Client Interactions with HDFS for the Read Operation
Read Operation Sequence Diagram
Client Interactions with HDFS for the Write Operation
Communication inside HDFS

Chapter 4. MapReduce with Hadoop

Hadoop's MapReduce
MapReduce 1 and MapReduce 2
Why do I need Discussion of the Old MapReduce?
MapReduce v1 ("Classic MapReduce")
JobTracker and TaskTracker (the "Classic MapReduce")
YARN (MapReduce v2)
YARN vs MR1
YARN As Data Operating System
MapReduce Programming Options
Java MapReduce API
The Structure of a Java MapReduce Program
The Mapper Class
The Reducer Class
The Driver Class
Compiling Classes
Running the MapReduce Job
The Structure of a Single MapReduce Program
Combiner Pass (Optional)
Hadoop's Streaming MapReduce
Python Word Count Mapper Program Example
Python Word Count Reducer Program Example
Setting up Java Classpath for Streaming Support
Streaming Use Cases
The Streaming API vs Java MapReduce API
Amazon Elastic MapReduce
Apache Tez

Chapter 5. Apache Pig Scripting Platform

What is Pig?
Pig Latin
Apache Pig Logo
Pig Execution Modes
Local Execution Mode
MapReduce Execution Mode
Running Pig
Running Pig in Batch Mode
What is Grunt?
Pig Latin Statements
Pig Programs
Pig Latin Script Example
SQL Equivalent
Differences between Pig and SQL
Statement Processing in Pig
Comments in Pig
Supported Simple Data Types
Supported Complex Data Types
Arrays
Defining Relation's Schema
Not Matching the Defined Schema
The bytearray Generic Type
Using Field Delimiters
Loading Data with TextLoader()
Referencing Fields in Relations

Chapter 6. Apache Pig HDFS Interface

The HDFS Interface
FSShell Commands (Short List)
Grunt's Old File System Commands

Chapter 7. Apache Pig Relational and Eval Operators

Pig Relational Operators
Example of Using the JOIN Operator
Example of Using the Order By Operator
Caveats of Using Relational Operators
Pig Eval Functions
Caveats of Using Eval Functions (Operators)
Example of Using Single-column Eval Operations
Example of Using Eval Operators For Global Operations

Chapter 8. Apache Pig Miscellaneous Topics

Utility Commands
Handling Compression
User-Defined Functions
Filter UDF Skeleton Code

Chapter 9. Apache Pig Performance

Apache Pig Performance
Performance Enhancer - Use the Right Schema Type
Performance Enhancer - Apply Data Filters
Use the PARALLEL Clause
Examples of the PARALLEL Clause
Performance Enhancer - Limiting the Data Sets
Displaying Execution Plan
Compress the Results of Intermediate Jobs
Example of Running Pig with LZO Compression Codec

Chapter 10. Apache Oozie

What is Oozie?
Apache Oozie Logo
Oozie Terminology
Directed Acyclic Graph
Oozie Job Types
Oozie Architecture
Oozie Configuration
Oozie Workflows
The Flow Of Oozie Workflows
More on Oozie Workflows
Oozie Workflow Control Nodes
More Oozie Workflow Control Nodes
A Workflow Example
A More Complex Workflow Example
Oozie Coordinator
The Pig Action Template
The Pig Action Template

Chapter 11. Hive

What is Hive?
Apache Hive Logo
Hive's Value Proposition
Who uses Hive?
Hive's Main Sub-Systems
Hive Features
The "Classic" Hive Architecture
The New Hive Architecture
HiveQL
Where are the Hive Tables Located?
Hive Command-line Interface (CLI)
The Beeline Command Shell

Chapter 12. Hive Command-line Interface

Hive Command-line Interface (CLI)
The Hive Interactive Shell
Running Host OS Commands from the Hive Shell
Interfacing with HDFS from the Hive Shell
The Hive in Unattended Mode
The Hive CLI Integration with the OS Shell
Executing HiveQL Scripts
Comments in Hive Scripts
Variables and Properties in Hive CLI
Setting Properties in CLI
Example of Setting Properties in CLI
Hive Namespaces
Using the SET Command
Setting Properties in the Shell
Setting Properties for the New Shell Session
Setting Alternative Hive Execution Engines
The Beeline Shell
Connecting to the Hive Server in Beeline
Beeline Command Switches
Beeline Internal Commands

Chapter 13. Hive Data Definition Language

Hive Data Definition Language
Creating Databases in Hive
Using Databases
Creating Tables in Hive
Supported Data Type Categories
Common Numeric Types
String and Date / Time Types
Miscellaneous Types
Example of the CREATE TABLE Statement
Working with Complex Types
Table Partitioning
Table Partitioning
Table Partitioning on Multiple Columns
Viewing Table Partitions
Row Format
Data Serializers / Deserializers
File Format Storage
File Compression
More on File Formats
The ORC Data Format
Converting Text to ORC Data Format
The EXTERNAL DDL Parameter
Example of Using EXTERNAL
Creating an Empty Table
Dropping a Table
Table / Partition(s) Truncation
Alter Table/Partition/Column
Views
Create View Statement
Why Use Views?
Restricting Amount of Viewable Data
Examples of Restricting Amount of Viewable Data
Creating and Dropping Indexes
Describing Data

Chapter 14. Hive Data Manipulation Language

Hive Data Manipulation Language (DML)
Using the LOAD DATA statement
Example of Loading Data into a Hive Table
Loading Data with the INSERT Statement
Appending and Replacing Data with the INSERT Statement
Examples of Using the INSERT Statement
Multi Table Inserts
Multi Table Inserts Syntax
Multi Table Inserts Example

Chapter 15. Hive Select Statement

HiveQL
The SELECT Statement Syntax
The WHERE Clause
Examples of the WHERE Statement
Partition-based Queries
Example of an Efficient SELECT Statement
The DISTINCT Clause
Supported Numeric Operators
Built-in Mathematical Functions
Built-in Aggregate Functions
Built-in Statistical Functions
Other Useful Built-in Functions
The GROUP BY Clause
The HAVING Clause
The LIMIT Clause
The ORDER BY Clause
The JOIN Clause
The CASE … Clause
Example of CASE … Clause

Chapter 16. Apache Sqoop

What is Sqoop?
Apache Sqoop Logo
Sqoop Import / Export
Sqoop Help
Examples of Using Sqoop Commands
Data Import Example
Fine-tuning Data Import
Controlling the Number of Import Processes
Data Splitting
Helping Sqoop Out
Example of Executing Sqoop Load in Parallel
A Word of Caution: Avoid Complex Free-Form Queries
Using Direct Export from Databases
Example of Using Direct Export from MySQL
More on Direct Mode Import
Changing Data Types
Example of Default Types Overriding
File Formats
The Apache Avro Serialization System
Binary vs Text
More on the SequenceFile Binary Format
Generating the Java Table Record Source Code
Data Export from HDFS
Export Tool Common Arguments
Data Export Control Arguments
Data Export Example
Using a Staging Table
INSERT and UPDATE Statements
INSERT Operations
UPDATE Operations
Example of the Update Operation
Failed Exports
Sqoop2
Sqoop2 Architecture

Chapter 17. Apache HBase

What is HBase?
HBase Design
HBase Features
HBase High Availability
The Write-Ahead Log (WAL) and MemStore
HBase vs RDBS
HBase vs Apache Cassandra
Not Good Use Cases for HBase
Interfacing with HBase
HBase Thrift And REST Gateway
HBase Table Design
Column Families
A Cell's Value Versioning
Timestamps
Accessing Cells
HBase Table Design Digest
Table Horizontal Partitioning with Regions
HBase Compaction
Loading Data in HBase
Column Families Notes
Rowkey Notes
HBase Shell
HBase Shell Command Groups
Creating and Populating a Table in HBase Shell
Getting a Cell's Value
Counting Rows in an HBase Table

Chapter 18. Apache HBase Java API

HBase Java Client
HBase Scanners
Using ResultScanner Efficiently
The Scan Class
The KeyValue Class
The Result Class
Getting Versions of Cell Values Example
The Cell Interface
HBase Java Client Example
Scanning the Table Rows
Dropping a Table
The Bytes Utility Class

Chapter 19. Introduction to Functional Programming

What is Functional Programming (FP)?
Terminology: Higher-Order Functions
Terminology: Lambda vs Closure
A Short List of Languages that Support FP
FP with Java
FP With JavaScript
Imperative Programming in JavaScript
The JavaScript map (FP) Example
The JavaScript reduce (FP) Example
Using reduce to Flatten an Array of Arrays (FP) Example
The JavaScript filter (FP) Example
Common High-Order Functions in Python
Common High-Order Functions in Scala
Elements of FP in R

Chapter 20. Introduction to Apache Spark

What is Apache Spark
A Short History of Spark
Where to Get Spark?
The Spark Platform
Spark Logo
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Driver Process
Spark Applications
Spark Shell
The spark-submit Tool
The spark-submit Tool Configuration
The Executor and Worker Processes
The Spark Application Architecture
Interfaces with Data Storage Systems
Limitations of Hadoop's MapReduce
Spark vs MapReduce
Spark as an Alternative to Apache Tez
The Resilient Distributed Dataset (RDD)
Spark Streaming (Micro-batching)
Spark SQL
Example of Spark SQL
Spark Machine Learning Library
GraphX
Spark vs R

Chapter 21. The Spark Shell

The Spark Shell
The Spark Shell UI
Spark Shell Options
Getting Help
The Spark Context (sc) and SQL Context (sqlContext)
The Shell Spark Context
Loading Files
Saving Files
Basic Spark ETL Operations

Chapter 22. Spark RDDs

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Custom RDDs
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Checkpointing RDDs
Local Checkpointing
Parallelized Collections
More on parallelize() Method
The Pair RDD
Where do I use Pair RDDs?
Example of Creating a Pair RDD with Map
Example of Creating a Pair RDD with keyBy
Miscellaneous Pair RDD Operations
RDD Caching
RDD Persistence
The Tachyon Storage

Chapter 23. Parallel Data Processing with Spark

Running Spark on a Cluster
Spark Stand-alone Option
The High-Level Execution Flow in Stand-alone Spark Cluster
Data Partitioning
Data Partitioning Diagram
Single Local File System RDD Partitioning
Multiple File RDD Partitioning
Special Cases for Small-sized Files
Parallel Data Processing of Partitions
Spark Application, Jobs, and Tasks
Stages and Shuffles
The "Big Picture"

Chapter 24. Introduction to Spark SQL

What is Spark SQL?
Uniform Data Access with Spark SQL
Hive Integration
Hive Interface
Integration with BI Tools
Spark SQL is No Longer Experimental Developer API!
What is a DataFrame?
The SQLContext Object
The SQLContext API
Changes Between Spark SQL 1.3 to 1.4
Example of Spark SQL (Scala Example)
Example of Working with a JSON File
Example of Working with a Parquet File
Using JDBC Sources
JDBC Connection Example
Performance & Scalability of Spark SQL
Summary

Lab Exercises

    Lab 1. Learning the Lab Environment
    Lab 2. Getting Started with Apache Ambari
    Lab 3. The Hadoop Distributed File System
    Lab 4. Hadoop Streaming MapReduce
    Lab 5. Programming Java MapReduce Jobs on Hadoop
    Lab 6. Getting Started with Apache Pig
    Lab 7. Apache Pig HDFS Command-Line Interface
    Lab 8. Working with Data Sets in Apache Pig
    Lab 9. Using Relational Operators in Apache Pig
    Lab 10. Interacting with MapReduce Jobs
    Lab 11. The Hive and Beeline Shells
    Lab 12. Hive Data Definition Language
    Lab 13. Using Select Statement in HiveQL
    Lab 14. Table Partitioning in Hive
    Lab 15. Data Import with Sqoop
    Lab 16. Using HBase Shell
    Lab 17. Elements of Functional Programming with JavaScript
    Lab 18. The Spark Shell
    Lab 19. Spark ETL and HDFS Interface
    Lab 20. Common Map / Reduce Programs in Spark
    Lab 21. Spark SQL

Learning Path

There are no specific requirements for this course. Please contact us to discuss your suitability.

Ways to Attend

Attend a public course, if there is one available. Please check our schedule, or register your interest in joining a course in your area.
Private onsite Team training also available, please contact us to discuss. We can customise this course to suit your business requirements.

Private Team Training is available for this course

We deliver this course either on or off-site in various regions around the world, and can customise your delivery to suit your exact business needs. Talk to us about how we can fine-tune a course to suit your team's current skillset and ultimate learning objectives.

Private Team Training | Contact us

Hadoop Programming on the Hortonworks Data Platform

Course Code

Duration

Course Fee

Accreditation

Target Audience

Attendee Requirements

Ways to Attend this Course

Private Team Training is available for this course

Private Team Training

What Our Clients Say

Technical ICT learning & mentoring services

About GuruTeam

Download our eBrochure

Our Accreditation Partners

Upcoming Courses

Kubernetes Administration

RUST

Introduction to Python 3

GO LANG TRAINING