Data Engineering Vol2 AWS : Data Processing – Spark & Kafka

This is Volume 2 of Data Engineering course. In this course I will talk about Open Source Data Processing technologies – Spark and Kafka, which are the most used and most popular data processing frameworks for Batch & Stream Processing. In this course you will learn Spark from Level 100 to Level 400 with real-life hands on and projects. I will also introduce you to Data Lake on AWS (that is S3) & Data Lakehouse using Apache Iceberg.

I will use AWS as the hosting platform and talk about AWS Services – EMR, S3 and MSK. I will cover Databricks as Spark hosting platform. I will also show you Spark integration with other services like AWS RDS (MySQL or PostgreSQL) and Redshift.

You will get opportunities to do hands-on using large datasets (100 GB – 300 GB or more of data). This course will provide you hands-on exercises that match with real-time scenarios like Spark batch processing, stream processing, performance tuning, streaming ingestion, Window functions, ACID transactions on Iceberg etc.

Some other highlights:

7 Projects with different datasets. Total dataset size of 250 GB or more.
Other technologies covered – EC2, EBS, VPC and IAM.
Optional Python videos
Optional AWS and SQL Essentials videos

Deep dive on Spark and Kafka using AWS EMR, Databricks, MSK
Understand Data Engineering (Volume 2) on AWS using Spark and Kafka
Batch and Stream processing using Spark and Kafka
Production level projects and hands-on to help candidates provide on-job-like training
Get access to datasets of size 100 GB - 200 GB and practice using the same
Learn Python for Data Engineering with HANDS-ON (Functions, Arguments, OOP (class, object, self), Modules, Packages, Multithreading, file handling etc.
Learn SQL for Data Engineering with HANDS-ON (Database objects, CASE, Window Functions, CTE, CTAS, MERGE, Materialized View etc.)
AWS Data Analytics services - S3, EMR, Databricks, MSK

Course Content

Introduction to Data Engineering – Vol 2

Introduction – Data, Data Lifecycle & Data Engineering Pipeline

26:28
Data Engineering Volume 2 Course & Projects Overview, Roles in Data

25:19
AWS Resource Cost for the Course

06:26

Section 1 – Big Data Processing

Section 2 – Spark Introduction

Section 3 – Knowing Spark Up Close: Part 1

Spark Standalone Cluster on Laptop & EC2, Spark UI

29:24
Spark on AWS EMR

21:54
Spark Application Deployment Modes (–deploy-mode cluster|client)

03:11
DataFrames, DAG, PySpark Module

24:22
SparkSession

20:16
DataFrame Reader & Writer

28:39
Define DataFrame Schema – StructType, StructField, inferSchema

21:05
Partition, Split, Task, Executor Relation

17:50
Splits for Smaller Files

17:37

Section 4 – Spark Transformation & Action : Part 1

Introduction to Transformation

15:59
HandsOn : Intro to Transformation

17:59
Step-by-step Transformation

18:26
Chain Transformation

05:47
Transformation – describe, sort, limit, drop, dropDuplicates

19:16
Spark Action & Lazy Loading

08:37
HandsOn : Spark Action & Lazy Loading

15:12
HandsOn : Spark UI – Jobs, Executors, Environment

10:04
HandsOn : load() – Transformation or Action??

13:23
HandsOn : File Performance Comparison – CSV vs Parquet

11:37
Assignment Project 1 – Transformation & Action

10:47
Transformation – agg, groupBy, orderBy, selectExpr, subtract, withColumn, union

07:57
HandsOn : Transformation : agg(), alias(), groupby()

14:08
HandsOn : Transformation : join()

19:00
HandsOn : Function : col()

19:21
HandsOn : Transformation : orderBy(), selectExpr()

09:13
HandsOn : Transformation : subtract()

09:03
HandsOn : Transformation : union(), unionAll(), intersect(), withColumn()

10:10
Shuffle Process & External Shuffle Service (ESS)

23:53
HandsOn : Shuffle Process in groupBy() – Explanation using Spark UI

24:12
HandsOn : Shuffle Process in join() – Explanation using Spark UI

10:13
DO NOT use show() for analysis in Spark

06:10
What should be the value of Shuffle Partition No (spark.sql.shuffle.partitions)

13:21
Assignment Project 2 : Shuffle & Spark UI

19:59
Summary – Transformations & Actions Part 1

03:05

Section 5 – Spark Partition – Input, Shuffle & Output

Section 6 – Knowing Spark Up close: Part 2

Do Dataframes keep occupying Memory ??

09:39
How Spark Applications use Memory CPU & its calculation

12:53
How to decide the size of Spark PROD Cluster

14:09
Explore Spark UI – WholeStageCodegen, MapPartitions, Exchange etc.

22:09
Spark OnHeap (Executor) Memory Mgmt – Reserve, Execution, Storage & User Memory

14:44
HandsOn : Configure no. of Executors, its memory & cpu – applicaton+spark-submit

11:24
HandsOn : Executor, Execution & Storage Memory in Spark UI

10:00
Memory Spilling, Unified Memory Management & Off-heap Memory

11:04
Cache, Persist, Unpersist, ClearCache

07:05
HandsOn : Monitor Memory Spill, Caching & Persist in Spark UI

13:55
Garbage Collection & Kryo Serialization in Spark

10:21
Assignment Project 3.1 – Flight Efficiency Analysis

21:38
Assignment Project 3.1 – Flight Efficiency Analysis Coding

17:16
Assignment Project 3- Airport Operations,Same Route Performance,Freq Flier Prog

07:00

Section 7 – Transformation & Action Part 2 + Spark Functions

Dataframe collect() API

10:35
foreach() & foreachPartition() to apply customer logic on dataframe records

16:51
Loop/Iterate though Dataframe records – toLocalIterator(), df.na.fill()

16:53
Spark Functions – lit(), concat(), expr()

13:16
DATE Functions

15:22
when() & otherwise() Functions

06:25
WINDOW Functions

09:24
HandsOn : WINDOW Functions

07:32
Assignment : Spark WINDOW Functions

18:09
Assignment Project 4 – Layover Analysis, Travel Activity, Environment Impact

36:28
lead(), lag(), nth_value(), first_value(), last_value()

11:03
HandsOn : lead(), lag(), nth_value(), first_value(), last_value()

06:50

Section 8 – Knowing Spark Up Close : Part 3

Spark Catalyst Optimizer & Tungsten Execution Engine

16:57
Spark Internal Joins – Sort Merge Join (SMJ)

15:11
Spark Internal Join – Broadcast Hash Join (BHJ)

09:57
Spark Internal Join – Broadcast Nested Loop Join (BNLJ)

06:06
Spark Internal Join – Shuffled Hash Join (SHJ)

05:56
Spark Internal Join – Shuffled NL Join (SNLJ), Join Hints, Join Performance

05:11
SparkUI Operators – HashAggregate, SortAggregate, BroadCastExchange

12:35
SparkUI Operators – BroadcastHashJoin, BroadcastNLJoin

05:59
SparkUI Operators – ShuffledHashJoin, Limit, CollectLimit, Window, ReusedExchang

07:39
HandsOn : Spark Operators

07:28
Spark Explain & How to Read Explain Output

21:06
Data Skewness or Skewed Data

08:10
HandsOn : Analyse & Find Data Skewness using Spark UI

14:51
HandsOn : Spark Salting – Mitigate Data Skewness and Skew Joins

09:12
HandsOn : Implement Salting step-by-step

16:56
Adaptive Query Execution (AQE), AQE Parameters

17:40
HandsOn : AQE Coalesce Shuffle Partitions

09:41
HandsOn : AQE Optimize Data Skewness & Skewed Join

04:25
Dynamic Partition Pruning

07:24
HandsOn : Dynamic Resource Allocation (DRA)

17:41
SUPER SUMMARY

10:29

Section 9 – Hosting Platforms AWS EMR (Elastic MapReduce), Databricks

EMR Hosting Platform Introduction

23:20
EMR Cluster Creation with IAM Roles

13:46
EMR Considerartions – Spark, YARN Configuration Parameters, Submit Spark code to

17:40
HandsOn : Prepare EMR environment to execute Spark Applications

12:03
HandsOn : Submit Spark application to EMR from Primary Node & Laptop

20:03
HandsOn : S3 vs HDFS as File Source & Target

05:58
HandsOn : EMR Steps

09:41
Relational (RDBMS) Database Sources & Targets

15:23
numPartitions, lowerBound, upperBound

04:50
HandsOn : Write to RDBMS (MySQL) Target

22:27
HandsOn : Read from RDBMS (MySQL) Source

05:32
Data Warehouse Source & Targets – Redshift

13:19
HandsOn : Read & Write from/to Redshift Data Warehouse

16:39
Filter and Aggregate Pushdown in Parquet

07:11
Handling JSON Files

23:19
Jupyter Notebook on EMR – Introduction

09:32
HandsOn : Jupyter Notebook Security Group & IAM pre-requisites

06:47
HandsOn : Create Jupyter Notebook and execute PySpark application

19:48

Section 10 – Spark Core API RDD, Spark Configurations

Section 11 – Spark SQL

Section 12 – Datalakehouse Using Open Table Format (OTF)

Lakehouse Introduction & Iceberg Architecture

19:29
HANDS-ON : Iceberg Architecture

13:13
Iceberg Configuration, COW, MOR, Table Operations & Properties

17:28
HandsOn : Iceberg Table Properties

11:21
HANDS-ON : Iceberg Delete files – copy-on-write (COW), merge-on-read (MOR)

09:43
COW or MOR??? Time Travel Queries, Iceberg Table History

16:04
HANDS-ON : Iceberg Metadata, Time Travel Queries

08:07
Iceberg Table Maintenance

07:25
HANDS-ON : Iceberg Table Maintenance

08:07
HANDS-ON : Iceberg Hidden Partition & Partition Evolution

22:55
Medallion & Lakehouse Architecture

13:49

Section 13 – Apache Kafka – The Streaming Injection

Introduction to Kafka

15:35
Create Kafka Cluster using AWS MSK (Provisioned & Serverless)

20:39
Create Kafka Cluster on Laptop

09:39
Kafka Components

09:28
Kafka Components – Topics & Partitions

14:21
Kafka Up Close – Topology, Replication, Message Distribution, Write & Read Data

18:55
Kafka Producer – Application, API, Configuration, Sync-Async Send, Python Library

21:15
HANDS-ON : Kafka Producer Application – Send Records to Topic

19:38
HANDS-ON : Kafka Producer Configuration

05:15
Kafka Consumer – Application, Configuration, Partition Rebalance

15:49
HANDS-ON : Kafka Consuler Application, Partition Ownership

14:44
HANDS-ON : Kafka poll() API

05:58
HANDS-ON : Kafka Offset Management

18:05
HANDS-ON : Auto Offset Reset – Earliest & Latest

09:18
Kafka Cluster (MSK) & Topic Throughput Calculation

14:50
Avro Data Format & Glue Schema Registry

13:11
HANDS-ON : Send AVRO data to Kafka Topic using Glue Schema Registry

21:43
Schema Evolution & Kafka Configuration

15:45
HANDS-ON : Schema Evolution using Glue Schema Registry

06:43

Section 14 – Spark Streaming – Stream Processing Using Spark

Spark Streaming Intro, Trigger, Output Mode

21:47
HANDS-ON : PySpark Streaming using Console Sink

17:03
HANDS-ON : Execute Streaming code from PyCharm, use Spark History Server

10:34
HANDS-ON : JSON data Streaming – Python Producer, Spark Consumer

24:58
HANDS-ON : Spark Streaming OUTPUT Mode

10:05
HANDS-ON : Using Files as Triggers in Streaming

13:53
HANDS-ON : Streaming State Store & Checkpointing with Output Modes

18:44
Introduction to Event Time in Stream Processing

13:59
Event Time – Tumbling Window

14:47
HANDS-ON : Event Time – Tumbling Window

13:57
HANDS-ON : Event Time – Sliding Window

12:13
Watermark – Handle Late Data & Manage State Store

22:35
HANDS-ON : Watermark – Handle Late Data & Manage State Store

12:47
HANDS-ON : Monitor Streaming Applications using Spark UI

08:21
Spark Streaming Conclusion

05:45

Section 15 – AWS Lanbda for Data Processing

Intro to Serverless Compute & AWS Lambda Use Cases

09:49
Lambda Function & its Components

08:01
HANDS-ON : Configure Lambda Function & its Components

16:53
Lambda (Python Code) Execution Model

11:51
HANDS-ON : Write Python Code from Lambda Console & Explore ‘Event’ data

12:39
HANDS-ON : S3 Trigger -> Read File -> Copy to /tmp -> Print data

09:01
HANDS-ON : S3 Trigger -> Read File -> Process -> Write to S3

10:16
HANDS-ON : Deploy Python code from S3 using ZIP deployment package

09:48
HANDS-ON : Deploy Additional Python Packages & Modules using ZIP in Lambda

10:44
HANDS-ON : S3 Trigger to Database & Concurrent Lambda Execution

08:38
Architecture using Lambda

03:17

(Optional) AWS Essentials

AWS Cloud and EC2 Intro

19:13
EC2 Components & HandsOn 1

21:07
EC2 HandsOn 2

11:26
EBS Theory

19:33
EBS HandsOn

13:25
VPC Introduction & Components

18:36
VPC Components Hands On

23:04
Bastion Host

05:29
Security Groups

15:09
NAT Gateway & VPC Endpoint

20:13
VPC Peering

02:33
AWS IAM Intro & Hands On

26:01
IAM Service Role

19:38

(Optional) Python Essentials

Python Intro – Architecture, PyCharm, Virtual Env

39:33
PyCharm & CLI Walkthrough

08:51
Compiled vs Interpreted

07:42
Everything is Python is Object

12:39
String

10:20
Number

04:02
List

11:36
Tuple

06:16
Set Dict Type Conversion

16:03
Memory Allocation & Operators

10:27
Set up Python interpreter in PyCharm

11:23
Print & Input Functions

16:04
IF Statement

14:36
For & While loops

15:57
Functions Intro

09:54
Function Scoping

15:20
Functions RETURN

07:53
Function Arguments

09:38
Modify Arguments

09:07
Positional & Keyword Arguments

09:42

(Optional) SQL Essentials

SQL Introduction

30:35
Client & Server Setup

12:18
Database Objects Theory

28:13
Database Objects Hands On

29:19
CRUD Operations

21:18
SELECT Operators

24:41
CASE COALESCE

13:19
DATE Functions

05:46
CTAS Cast Concat

14:11
Update Delete Truncate

12:31
HAVING Clause

07:43
Joins

19:27
Union Intersect View

17:35
Materialized View

08:19
Common Table Expression (CTE)

10:48
Window Functions

22:40
MERGE & Summary

10:52

A course by

Soumyadeep Dey

Student Ratings & Reviews

No Review Yet

Data Engineering Vol2 AWS : Data Processing – Spark & Kafka

What Will You Learn?

Requirements

Audience

Course Content

Introduction to Data Engineering – Vol 2

Introduction – Data, Data Lifecycle & Data Engineering Pipeline

Data Engineering Volume 2 Course & Projects Overview, Roles in Data

AWS Resource Cost for the Course

Section 1 – Big Data Processing

Distributed Compute & Storage, Big Data, MapReduce

Map & Reduce Tasks, Big Data Ecosystem

HandsOn : MapReduce using “mrjob” Python Library

HandsOn : HDFS Commands

YARN – Architecture & Usage

ZooKeeper, Big Data File Formats – Parquet & Avro

Section 2 – Spark Introduction

Introduction to Spark, Batch & Stream Processing

Spark Ecosystem, Development, Architecture & Execution

Spark Codebase, JVM, Setup & Configuration, PySpark

Section 3 – Knowing Spark Up Close: Part 1

Spark Standalone Cluster on Laptop & EC2, Spark UI

Spark on AWS EMR

Spark Application Deployment Modes (–deploy-mode cluster|client)

DataFrames, DAG, PySpark Module

SparkSession

DataFrame Reader & Writer

Define DataFrame Schema – StructType, StructField, inferSchema

Partition, Split, Task, Executor Relation

Splits for Smaller Files

Section 4 – Spark Transformation & Action : Part 1

Introduction to Transformation

HandsOn : Intro to Transformation

Step-by-step Transformation

Chain Transformation

Transformation – describe, sort, limit, drop, dropDuplicates

Spark Action & Lazy Loading

HandsOn : Spark Action & Lazy Loading

HandsOn : Spark UI – Jobs, Executors, Environment

HandsOn : load() – Transformation or Action??

HandsOn : File Performance Comparison – CSV vs Parquet

Assignment Project 1 – Transformation & Action

Transformation – agg, groupBy, orderBy, selectExpr, subtract, withColumn, union

HandsOn : Transformation : agg(), alias(), groupby()

HandsOn : Transformation : join()

HandsOn : Function : col()

HandsOn : Transformation : orderBy(), selectExpr()

HandsOn : Transformation : subtract()

HandsOn : Transformation : union(), unionAll(), intersect(), withColumn()

Shuffle Process & External Shuffle Service (ESS)

HandsOn : Shuffle Process in groupBy() – Explanation using Spark UI

HandsOn : Shuffle Process in join() – Explanation using Spark UI

DO NOT use show() for analysis in Spark

What should be the value of Shuffle Partition No (spark.sql.shuffle.partitions)

Assignment Project 2 : Shuffle & Spark UI

Summary – Transformations & Actions Part 1

Section 5 – Spark Partition – Input, Shuffle & Output

Spark Input Partitions (spark.sql.files.maxPartitionBytes)

Output Partitions – repartition(), coalesce()

partitionBy(), Transformations that change the no. of Partitions

DataFrame Writer Output Modes

Section 6 – Knowing Spark Up close: Part 2

Do Dataframes keep occupying Memory ??

How Spark Applications use Memory CPU & its calculation

How to decide the size of Spark PROD Cluster

Explore Spark UI – WholeStageCodegen, MapPartitions, Exchange etc.

Spark OnHeap (Executor) Memory Mgmt – Reserve, Execution, Storage & User Memory

HandsOn : Configure no. of Executors, its memory & cpu – applicaton+spark-submit

HandsOn : Executor, Execution & Storage Memory in Spark UI

Memory Spilling, Unified Memory Management & Off-heap Memory

Cache, Persist, Unpersist, ClearCache

HandsOn : Monitor Memory Spill, Caching & Persist in Spark UI

Garbage Collection & Kryo Serialization in Spark

Assignment Project 3.1 – Flight Efficiency Analysis

Assignment Project 3.1 – Flight Efficiency Analysis Coding

Assignment Project 3- Airport Operations,Same Route Performance,Freq Flier Prog

Section 7 – Transformation & Action Part 2 + Spark Functions

Dataframe collect() API

foreach() & foreachPartition() to apply customer logic on dataframe records

Loop/Iterate though Dataframe records – toLocalIterator(), df.na.fill()