Winter Savings - Limited time left to save on IT training. Use promo code SNOWBALL


Apache Spark Programming

Course Details
Code: DB105
Tuition (USD): $2,500.00 • Classroom (3 days)
$2,500.00 • Virtual (3 days)

This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark. The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs. Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Skills Gained

After taking this class, students will be able to:

  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Who Can Benefit

Data engineers, analysts, architects, data scientist, software engineers, and technical managers who want to learn the fundamentals of programming with Apache Spark, how to streamline their big data processing, build production Spark jobs, and understand/debug running Spark applications.


  • Some familiarity with Apache Spark is helpful but not required.
  • Knowledge of SQL is helpful.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Course Details

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet Explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
  • * - required
  • * - highly recommended
  • - required
  • - helpful but not required


Spark Overview

In-depth discussion of Spark SQL and DataFrames, including:

  • The DataFrames/Datasets API
  • Spark SQL
  • Data Aggregation
  • Column Operations
  • The Functions API: date/time, string manipulation, aggregation
  • Joins & Broadcasting
  • User Defined Functions
  • Caching and caching storage levels
  • Use of the Spark UI to analyze behavior and performance

In-depth discussion of Spark internals

  • Cluster Architecture
  • The Catalyst query optimizer
  • The Tungsten in-memory data format
  • How Spark schedules and executes jobs and tasks
  • Shuffling, shuffle files, and performance
  • How various data sources are partitioned
  • How Spark handles data reads and writes

Spark Structured Streaming

  • Sources and sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing & Watermarking
  • Reliability and Fault Tolerance
  • Kafka Integration

Overview of Spark’s MLlib Pipeline API for Machine Learning

  • Transformer/Estimator/Pipeline API
  • Perform feature preprocessing
  • Evaluate and apply ML models

Graph processing with GraphFrames

  • Transforming DataFrames into a graph
  • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths
Contact Us 1-800-803-3948
Contact Us Live Chat
FAQ Get immediate answers to our most frequently asked qestions. View FAQs arrow_forward