This is Volume 2 of Data Engineering course. In this course I will talk about Open Source Data Processing technologies – Spark and Kafka, which are the most used and most popular data processing frameworks for Batch & Stream Processing. In this course you will learn Spark from Level 100 to Level 400 with real-life hands on and projects. I will also introduce you to Data Lake on AWS (that is S3) & Data Lakehouse using Apache Iceberg.
I will use AWS as the hosting platform and talk about AWS Services – EMR, S3 and MSK. I will cover Databricks as Spark hosting platform. I will also show you Spark integration with other services like AWS RDS (MySQL or PostgreSQL) and Redshift.
You will get opportunities to do hands-on using large datasets (100 GB – 300 GB or more of data). This course will provide you hands-on exercises that match with real-time scenarios like Spark batch processing, stream processing, performance tuning, streaming ingestion, Window functions, ACID transactions on Iceberg etc.
Some other highlights:
-
7 Projects with different datasets. Total dataset size of 250 GB or more.
-
Other technologies covered – EC2, EBS, VPC and IAM.
-
Optional Python videos
-
Optional AWS and SQL Essentials videos
What Will You Learn?
- Deep dive on Spark and Kafka using AWS EMR, Databricks, MSK
- Understand Data Engineering (Volume 2) on AWS using Spark and Kafka
- Batch and Stream processing using Spark and Kafka
- Production level projects and hands-on to help candidates provide on-job-like training
- Get access to datasets of size 100 GB - 200 GB and practice using the same
- Learn Python for Data Engineering with HANDS-ON (Functions, Arguments, OOP (class, object, self), Modules, Packages, Multithreading, file handling etc.
- Learn SQL for Data Engineering with HANDS-ON (Database objects, CASE, Window Functions, CTE, CTAS, MERGE, Materialized View etc.)
- AWS Data Analytics services - S3, EMR, Databricks, MSK
Requirements
- Good to have AWS and SQL knowledge
Audience
- Python developers, Application Developers, Big Data Developers
- Data Engineers, Data Scientists, Data Analysts
- Database Administrators, Big Data Administrators
- Data Engineering Aspirants
- Solutions Architect, Cloud Architect, Big Data Architect
- Technical Managers, Engineering Managers, Project Managers
Course Content
Introduction to Data Engineering – Vol 2
-
26:28
-
25:19
-
AWS Resource Cost for the Course
06:26
Section 1 – Big Data Processing
-
Distributed Compute & Storage, Big Data, MapReduce
34:11 -
Map & Reduce Tasks, Big Data Ecosystem
30:55 -
HandsOn : MapReduce using “mrjob” Python Library
27:44 -
HandsOn : HDFS Commands
10:14 -
YARN – Architecture & Usage
26:05 -
ZooKeeper, Big Data File Formats – Parquet & Avro
20:18
Section 2 – Spark Introduction
-
Introduction to Spark, Batch & Stream Processing
27:22 -
Spark Ecosystem, Development, Architecture & Execution
23:56 -
Spark Codebase, JVM, Setup & Configuration, PySpark
34:01
Section 3 – Knowing Spark Up Close: Part 1
-
Spark Standalone Cluster on Laptop & EC2, Spark UI
29:24 -
Spark on AWS EMR
21:54 -
Spark Application Deployment Modes (–deploy-mode cluster|client)
03:11 -
DataFrames, DAG, PySpark Module
24:22 -
SparkSession
20:16 -
DataFrame Reader & Writer
28:39 -
Define DataFrame Schema – StructType, StructField, inferSchema
21:05 -
Partition, Split, Task, Executor Relation
17:50 -
Splits for Smaller Files
17:37
Section 4 – Spark Transformation & Action : Part 1
-
Introduction to Transformation
15:59 -
HandsOn : Intro to Transformation
17:59 -
Step-by-step Transformation
18:26 -
Chain Transformation
05:47 -
Transformation – describe, sort, limit, drop, dropDuplicates
19:16 -
Spark Action & Lazy Loading
08:37 -
HandsOn : Spark Action & Lazy Loading
15:12 -
HandsOn : Spark UI – Jobs, Executors, Environment
10:04 -
HandsOn : load() – Transformation or Action??
13:23 -
HandsOn : File Performance Comparison – CSV vs Parquet
11:37 -
Assignment Project 1 – Transformation & Action
10:47 -
Transformation – agg, groupBy, orderBy, selectExpr, subtract, withColumn, union
07:57 -
HandsOn : Transformation : agg(), alias(), groupby()
14:08 -
HandsOn : Transformation : join()
19:00 -
HandsOn : Function : col()
19:21 -
HandsOn : Transformation : orderBy(), selectExpr()
09:13 -
HandsOn : Transformation : subtract()
09:03 -
HandsOn : Transformation : union(), unionAll(), intersect(), withColumn()
10:10 -
Shuffle Process & External Shuffle Service (ESS)
23:53 -
HandsOn : Shuffle Process in groupBy() – Explanation using Spark UI
24:12 -
HandsOn : Shuffle Process in join() – Explanation using Spark UI
10:13 -
DO NOT use show() for analysis in Spark
06:10 -
What should be the value of Shuffle Partition No (spark.sql.shuffle.partitions)
13:21 -
Assignment Project 2 : Shuffle & Spark UI
19:59 -
Summary – Transformations & Actions Part 1
03:05
Section 5 – Spark Partition – Input, Shuffle & Output
-
Spark Input Partitions (spark.sql.files.maxPartitionBytes)
16:31 -
Output Partitions – repartition(), coalesce()
11:34 -
partitionBy(), Transformations that change the no. of Partitions
15:01 -
DataFrame Writer Output Modes
06:45
Section 6 – Knowing Spark Up close: Part 2
-
Do Dataframes keep occupying Memory ??
09:39 -
How Spark Applications use Memory CPU & its calculation
12:53 -
How to decide the size of Spark PROD Cluster
14:09 -
Explore Spark UI – WholeStageCodegen, MapPartitions, Exchange etc.
22:09 -
Spark OnHeap (Executor) Memory Mgmt – Reserve, Execution, Storage & User Memory
14:44 -
HandsOn : Configure no. of Executors, its memory & cpu – applicaton+spark-submit
11:24 -
HandsOn : Executor, Execution & Storage Memory in Spark UI
10:00 -
Memory Spilling, Unified Memory Management & Off-heap Memory
11:04 -
Cache, Persist, Unpersist, ClearCache
07:05 -
HandsOn : Monitor Memory Spill, Caching & Persist in Spark UI
13:55 -
Garbage Collection & Kryo Serialization in Spark
10:21 -
Assignment Project 3.1 – Flight Efficiency Analysis
21:38 -
Assignment Project 3.1 – Flight Efficiency Analysis Coding
17:16 -
Assignment Project 3- Airport Operations,Same Route Performance,Freq Flier Prog
07:00
Section 7 – Transformation & Action Part 2 + Spark Functions
-
Dataframe collect() API
10:35 -
foreach() & foreachPartition() to apply customer logic on dataframe records
16:51 -
Loop/Iterate though Dataframe records – toLocalIterator(), df.na.fill()
16:53 -
Spark Functions – lit(), concat(), expr()
13:16 -
DATE Functions
15:22 -
when() & otherwise() Functions
06:25 -
WINDOW Functions
09:24 -
HandsOn : WINDOW Functions
07:32 -
Assignment : Spark WINDOW Functions
18:09 -
Assignment Project 4 – Layover Analysis, Travel Activity, Environment Impact
36:28 -
lead(), lag(), nth_value(), first_value(), last_value()
11:03 -
HandsOn : lead(), lag(), nth_value(), first_value(), last_value()
06:50
Section 8 – Knowing Spark Up Close : Part 3
-
Spark Catalyst Optimizer & Tungsten Execution Engine
16:57 -
Spark Internal Joins – Sort Merge Join (SMJ)
15:11 -
Spark Internal Join – Broadcast Hash Join (BHJ)
09:57 -
Spark Internal Join – Broadcast Nested Loop Join (BNLJ)
06:06 -
Spark Internal Join – Shuffled Hash Join (SHJ)
05:56 -
Spark Internal Join – Shuffled NL Join (SNLJ), Join Hints, Join Performance
05:11 -
SparkUI Operators – HashAggregate, SortAggregate, BroadCastExchange
12:35 -
SparkUI Operators – BroadcastHashJoin, BroadcastNLJoin
05:59 -
SparkUI Operators – ShuffledHashJoin, Limit, CollectLimit, Window, ReusedExchang
07:39 -
HandsOn : Spark Operators
07:28 -
Spark Explain & How to Read Explain Output
21:06 -
Data Skewness or Skewed Data
08:10 -
HandsOn : Analyse & Find Data Skewness using Spark UI
14:51 -
HandsOn : Spark Salting – Mitigate Data Skewness and Skew Joins
09:12 -
HandsOn : Implement Salting step-by-step
16:56 -
Adaptive Query Execution (AQE), AQE Parameters
17:40 -
HandsOn : AQE Coalesce Shuffle Partitions
09:41 -
HandsOn : AQE Optimize Data Skewness & Skewed Join
04:25 -
Dynamic Partition Pruning
07:24 -
HandsOn : Dynamic Resource Allocation (DRA)
17:41 -
SUPER SUMMARY
10:29
Section 9 – Hosting Platforms AWS EMR (Elastic MapReduce), Databricks
-
EMR Hosting Platform Introduction
23:20 -
EMR Cluster Creation with IAM Roles
13:46 -
EMR Considerartions – Spark, YARN Configuration Parameters, Submit Spark code to
17:40 -
HandsOn : Prepare EMR environment to execute Spark Applications
12:03 -
HandsOn : Submit Spark application to EMR from Primary Node & Laptop
20:03 -
HandsOn : S3 vs HDFS as File Source & Target
05:58 -
HandsOn : EMR Steps
09:41 -
Relational (RDBMS) Database Sources & Targets
15:23 -
numPartitions, lowerBound, upperBound
04:50 -
HandsOn : Write to RDBMS (MySQL) Target
22:27 -
HandsOn : Read from RDBMS (MySQL) Source
05:32 -
Data Warehouse Source & Targets – Redshift
13:19 -
HandsOn : Read & Write from/to Redshift Data Warehouse
16:39 -
Filter and Aggregate Pushdown in Parquet
07:11 -
Handling JSON Files
23:19 -
Jupyter Notebook on EMR – Introduction
09:32 -
HandsOn : Jupyter Notebook Security Group & IAM pre-requisites
06:47 -
HandsOn : Create Jupyter Notebook and execute PySpark application
19:48
Section 10 – Spark Core API RDD, Spark Configurations
-
Spark Configuration Parameters
19:08 -
Introduction to RDD – map(), filter(), flatMap() APIs
15:56 -
HandsOn : SELECT & FILTER using RDD
14:01
Section 11 – Spark SQL
-
Introduction to Spark SQL, SQL on DataFrames
14:27 -
HandsOn : DataFrame to SparkSQL
13:50 -
Spark SQL Objects
12:41 -
HandsOn : Spark SQL Objects – Database, Tables
13:15 -
HandsOn : Read & Write Spark SQL Objects
15:28 -
Iceberg ACID
32:58 -
HANDS-ON – SparkSQL on EMR using Jupyter Notebook
10:21
Section 12 – Datalakehouse Using Open Table Format (OTF)
-
Lakehouse Introduction & Iceberg Architecture
19:29 -
HANDS-ON : Iceberg Architecture
13:13 -
Iceberg Configuration, COW, MOR, Table Operations & Properties
17:28 -
HandsOn : Iceberg Table Properties
11:21 -
HANDS-ON : Iceberg Delete files – copy-on-write (COW), merge-on-read (MOR)
09:43 -
COW or MOR??? Time Travel Queries, Iceberg Table History
16:04 -
HANDS-ON : Iceberg Metadata, Time Travel Queries
08:07 -
Iceberg Table Maintenance
07:25 -
HANDS-ON : Iceberg Table Maintenance
08:07 -
HANDS-ON : Iceberg Hidden Partition & Partition Evolution
22:55 -
Medallion & Lakehouse Architecture
13:49
Section 13 – Apache Kafka – The Streaming Injection
-
Introduction to Kafka
15:35 -
Create Kafka Cluster using AWS MSK (Provisioned & Serverless)
20:39 -
Create Kafka Cluster on Laptop
09:39 -
Kafka Components
09:28 -
Kafka Components – Topics & Partitions
14:21 -
Kafka Up Close – Topology, Replication, Message Distribution, Write & Read Data
18:55 -
Kafka Producer – Application, API, Configuration, Sync-Async Send, Python Library
21:15 -
HANDS-ON : Kafka Producer Application – Send Records to Topic
19:38 -
HANDS-ON : Kafka Producer Configuration
05:15 -
Kafka Consumer – Application, Configuration, Partition Rebalance
15:49 -
HANDS-ON : Kafka Consuler Application, Partition Ownership
14:44 -
HANDS-ON : Kafka poll() API
05:58 -
HANDS-ON : Kafka Offset Management
18:05 -
HANDS-ON : Auto Offset Reset – Earliest & Latest
09:18 -
Kafka Cluster (MSK) & Topic Throughput Calculation
14:50 -
Avro Data Format & Glue Schema Registry
13:11 -
HANDS-ON : Send AVRO data to Kafka Topic using Glue Schema Registry
21:43 -
Schema Evolution & Kafka Configuration
15:45 -
HANDS-ON : Schema Evolution using Glue Schema Registry
06:43
Section 14 – Spark Streaming – Stream Processing Using Spark
-
Spark Streaming Intro, Trigger, Output Mode
21:47 -
HANDS-ON : PySpark Streaming using Console Sink
17:03 -
HANDS-ON : Execute Streaming code from PyCharm, use Spark History Server
10:34 -
HANDS-ON : JSON data Streaming – Python Producer, Spark Consumer
24:58 -
10:05
-
HANDS-ON : Using Files as Triggers in Streaming
13:53 -
HANDS-ON : Streaming State Store & Checkpointing with Output Modes
18:44 -
Introduction to Event Time in Stream Processing
13:59 -
Event Time – Tumbling Window
14:47 -
HANDS-ON : Event Time – Tumbling Window
13:57 -
HANDS-ON : Event Time – Sliding Window
12:13 -
Watermark – Handle Late Data & Manage State Store
22:35 -
HANDS-ON : Watermark – Handle Late Data & Manage State Store
12:47 -
HANDS-ON : Monitor Streaming Applications using Spark UI
08:21 -
Spark Streaming Conclusion
05:45
Section 15 – AWS Lanbda for Data Processing
-
Intro to Serverless Compute & AWS Lambda Use Cases
09:49 -
Lambda Function & its Components
08:01 -
HANDS-ON : Configure Lambda Function & its Components
16:53 -
Lambda (Python Code) Execution Model
11:51 -
HANDS-ON : Write Python Code from Lambda Console & Explore ‘Event’ data
12:39 -
HANDS-ON : S3 Trigger -> Read File -> Copy to /tmp -> Print data
09:01 -
HANDS-ON : S3 Trigger -> Read File -> Process -> Write to S3
10:16 -
HANDS-ON : Deploy Python code from S3 using ZIP deployment package
09:48 -
HANDS-ON : Deploy Additional Python Packages & Modules using ZIP in Lambda
10:44 -
HANDS-ON : S3 Trigger to Database & Concurrent Lambda Execution
08:38 -
Architecture using Lambda
03:17
(Optional) AWS Essentials
-
AWS Cloud and EC2 Intro
19:13 -
EC2 Components & HandsOn 1
21:07 -
EC2 HandsOn 2
11:26 -
EBS Theory
19:33 -
EBS HandsOn
13:25 -
VPC Introduction & Components
18:36 -
VPC Components Hands On
23:04 -
Bastion Host
05:29 -
Security Groups
15:09 -
NAT Gateway & VPC Endpoint
20:13 -
VPC Peering
02:33 -
AWS IAM Intro & Hands On
26:01 -
IAM Service Role
19:38
(Optional) Python Essentials
-
Python Intro – Architecture, PyCharm, Virtual Env
39:33 -
PyCharm & CLI Walkthrough
08:51 -
Compiled vs Interpreted
07:42 -
Everything is Python is Object
12:39 -
String
10:20 -
Number
04:02 -
List
11:36 -
Tuple
06:16 -
Set Dict Type Conversion
16:03 -
Memory Allocation & Operators
10:27 -
Set up Python interpreter in PyCharm
11:23 -
Print & Input Functions
16:04 -
IF Statement
14:36 -
For & While loops
15:57 -
Functions Intro
09:54 -
Function Scoping
15:20 -
Functions RETURN
07:53 -
Function Arguments
09:38 -
Modify Arguments
09:07 -
Positional & Keyword Arguments
09:42
(Optional) SQL Essentials
-
SQL Introduction
30:35 -
Client & Server Setup
12:18 -
Database Objects Theory
28:13 -
Database Objects Hands On
29:19 -
CRUD Operations
21:18 -
SELECT Operators
24:41 -
CASE COALESCE
13:19 -
DATE Functions
05:46 -
CTAS Cast Concat
14:11 -
Update Delete Truncate
12:31 -
HAVING Clause
07:43 -
Joins
19:27 -
Union Intersect View
17:35 -
Materialized View
08:19 -
Common Table Expression (CTE)
10:48 -
Window Functions
22:40 -
MERGE & Summary
10:52






