Contents
About Big Data Hadoop online Training:
This 5 Week Training Course delivers the key concepts and expertise participants need to ingest and process data on a Hadoop Cluster using the most up-to-date tools and techniques, including Apache Spark, Map Reduce, HDFS, Hive, Sqoop, and HBase.
Pre-Requisites:
- Programming Language (Java,..).
- RDBMS concepts(SQL).
- Fundamentals of Linux(commands)
Hadoop online Course Content
Hadoop Introduction
Introduction to Hadoop and the Hadoop Ecosystem
- Problems with Traditional Large-scale Systems
- Hadoop!
- The Hadoop EcoSystem
- Hadoop Architecture and HDFS
- Distributed Processing on a Cluster
- Storage: HDFS Architecture
- Storage: Using HDFS
- Resource Management: YARN Architecture
- Resource Management: Working with YARN
Importing Relational Data with Apache Sqoop
- Sqoop Overview
- Basic Imports and Exports
- Limiting Results
- Improving Sqoop’s Performance
Introduction to Hive
- Introduction to Hive
- Why Use Hive?
- Comparing Hive to Traditional Databases
- Hive Use Cases
- Modeling and Managing Data with Hive
- Data Storage Overview
- Creating Databases and Tables
- Loading Data into Tables
Apache Spark
Apache Spark is the next-generation successor to MapReduce. Spark is a powerful, open- source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex, iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs.
Parallel Programming with Spark
- Review: Spark on a Cluster
- RDD Partitions
- Partitioning of File-based RDDs
- HDFS and Data Locality
- Executing Parallel Operations
- Stages and Tasks
Spark Caching and Persistence
- RDD Lineage
- Caching Overview
- Distributed Persistence
Common Patterns in Spark Data Processing
- Common Spark Use Cases
- Iterative Algorithms in Spark
- Graph Processing and Analysis
- Machine Learning
- Example: k-means
Preview: Spark SQL Spark SQL and the SQL Context
- Creating DataFrames
- Transforming and Querying DataFrames
- Saving DataFrames
- Comparing Spark SQL with Impala
Apache HBase
Apache HBase is a distributed, scalable, NoSQL database built on Apache Hadoop. HBase can store data in massive tables consisting of billions of rows and millions of columns, serve data to many users and applications in real time, and provide fast, random read/write access to users and applications.
HBase Concepts
The use cases and usage occasions for HBase, Hadoop, and RDBMS
- Using the HBase shell to directly manipulate HBase tables
- Designing optimal HBase schemas for efficient data storage and recovery
- How to connect to HBase using the Java API to insert and retrieve data in real time
- Best practices for identifying and resolving performance bottlenecks