Training & Exams

Apache Spark (Advanced) on Hadoop.

About This Course

Course Code
QAASAH

Course Type
Premium

Vendor
Hortonworks

Duration
2 Days

RRP
£1,495.00

Course Overview
Download

Overview

The purpose of this 2-day training course is to acquaint you with Spark 2.0 functionality and Performance Tuning techniques. It will cover the fundamentals of the Spark project, covering the basics of RDDs Resilient Distributed Datasets and the various operators used Transformations and Actions.

AUDIENCE
Data Analysts and Software Developers

Objectives

After successfully completing this course, you will be able to:

Course Outline

Module 0. Intro and Setup:

  • Zeppelin note

Module 1. Datasets and Catalogs:

  • What is a Dataset?
  • Dataset versus SQL/DataFrames
  • When to use which object
  • Serialization performance using Encoders
  • Encoders and semi-structured data
  • Dataset caching 1 of 2
  • 01: Dataset Caching 2 of 2
  • 02a/b/c: Common ways to create DS
  • 03: Creating DS from an RDD
  • Cannot create DS these ways
  • 04: Casting DS and convert DS to DF to RDD
  • 05a: map on DS means lose column names
  • 05b: map characteristics on Dataset
  • 06: select on DS
  • 07: filter and groupBy on DS
  • 08: joinWith on DS
  • 09: explain on DS
  • 10: Catalog: List Hive databases
  • 11: Catalog: List Hive tables, Spark Views
  • 012: Catalog: List column names on table
  • 13: Catalog: List Spark functions
  • Review Questions: Datasets/Catalog
  • In Review: Datasets/Catalog

Module 2. Catalyst and Tungsten functionalities:

  • Before we Begin: Open Zeppelin note
  • DataFrames, Datasets and Views use Catalyst/Tungsten
  • Catalyst optimizer overview
  • 01a: Catalyst: Join on 2 Spark Views demo
  • 01a: Catalyst demo: Join on 2 Spark views
  • But RDDs can’t use Catalyst
  • Loading data in Spark 2.x and Catalyst
  • 02a: Load data old way, then Join 1 of 3
  • Execution Plan from ‘old way’ loading 2 of 3
  • 02b: DataFrameReader: Load/Execution Plan 3 of 3
  • 03a: Dropping hints to Catalyst 1 of 2
  • 03b: Dropping hints to Catalyst 2 of 2
  • 04a: Catalyst: Column pruning demo
  • 04b: Catalyst: Column & Partition pruning
  • Catalyst: Predicate pushdown concepts
  • 05: Catalyst: Predicate pushdown 1 of 2
  • 05: Catalyst: Predicate pushdown 2 of 2
  • Tungsten overview
  • Tungsten: Binary processing
  • Tungsten: Improved Memory usage
  • 06: Tungsten: Improved Caching demo
  • 07: Tungsten: Whole-stage code gen
  • 08: Tungsten: Whole-stage code gen demo
  • Tungsten: Whole-stage code gen Vectorization
  • Review Questions: Catalyst/Tungsten
  • In Review: Catalyst/Tungsten

Module 3. Performance Tuning:

  • 2 types of Machine Learning
  • How Models Created
  • Four common MLlib functions
  • What is Supervised Learning?
  • Spark Supervised Learning workflow
  • Walking the Workflow: Predicting SPAM 1 of 3
  • Walking the Workflow: Predicting SPAM 2 of 3
  • Walking the Workflow: Predicting SPAM 3 of 3
  • Unsupervised Learning
  • RDD – Machine Learning MLlib
  • Walking the Workflow: Predicting SPAM 1 of 3
  • KMeans scenario
  • 01a: Kmeans – Load data
  • 01b: Kmeans – Create Model and Predict
  • 01c: Kmeans – Compare Actual to Predict
  • Collaborative Filtering CF recommender
  • Will Carl like ‘Star Wars’?
  • 02a: CF – Load Movie data
  • 02b: CF – Create Model and Factors
  • 02c: CF – Map MovieID to MovieName
  • 02d: CF – Make User recommendation
  • Classification Functions Supervised
  • Before we Begin: Classification uses LabelPoint. So what is LabelPoint?
  • CASTing X-var and Y-vars for LabelPoint
  • Logistic Regression, Support Vector Machines, NaïveBayes and Decision Tree Supervised
  • 03a: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03b: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree
  • 03c: Logistic Regression, Support Vector Machines, NaïveBayes, and Decision Tree con’t
  • DataFrames – Machine Learning ML
  • ML Pipeline Terminology
  • How ML Pipeline Works
  • 02: Predict Bike Rentals GBT Regression
  • 02a: Know the Data
  • 02b: Load and View Data types
  • Clean the Data remove columns
  • 02c: Clean the Data remove columns cont.
  • 02d: Clean the Data change to Double
  • 02e: Visualize the DataFrame
  • 02f: Create Train/Test Set from DataFrame
  • Train ML Pipeline – The Big Picture
  • 02g: Define Feature Processing Pipeline
  • 02h: Define Model Training of Pipeline
  • 02i: Add CrossValidation to Pipeline
  • 02j: Tie Features/Model Together in Pipeline
  • 02k: Train the Pipeline
  • 02l: Make Predictions, evaluate Results
  • 02l: Make Predictions, evaluate Results cont.
  • 02m/n: Visualize the Model’s DataFrame
  • Improving the Model
  • Predict Titanic Survivors Random Forest
  • 03a: Know the Data
  • 03b: Load and view Data types and Data
  • 03c: Clean data – Add column ‘FamilySize’
  • 03d: Clean data – Replace NULLs con’t
  • 03e: Clean data – Replace empty strings con’t
  • 03f: Split DataFrame into TrainDF / TestDF
  • 03g: IMPORT ML packages
  • 03h: Index Categorical and Label columns
  • 03i: Assemble all Features into Vector
  • 03j: Using Decision Tree classifier, 03k: Retrieve Original labels, 03l: Create Pipeline
  • 03m: Selecting the best Model
  • 03n: Make Prediction using TestDF
  • Review Questions: Machine Learning
  • In Review: Machine Learning
  • But wait, there’s more for MLlib Appendix
  • Linear Regression scenario Supervised
  • Linear Regression 1 of 6
  • Linear Regression 2 of 6
  • Linear Regression 3 of 6
  • Linear Regression 4 of 6
  • Linear Regression 5 of 6
  • Linear Regression 6 of 6

Prerequisites

To get the most out of this training, that you have the following knowledge or experience as it builds the foundation for the advance course.

About This Course

Course Code
QAASAH

Course Type
Premium

Vendor
Hortonworks

Duration
2 Days

RRP
£1,495.00

Course Overview
Download