Skip to content
EdindxEdindx
  • Courses
      • Data Engineering and Analytics
      • Microsoft Dynamics 365
      • Animation Tools
      • Network and Security
      • Professional Development
      • Marketing
      • Graphic Design
      • Sales
      • Oracle Cloud
      • Communication Skills
      • IT Trending Courses
      • Cyber Security
      • Full Stack Engineer
      • E-Commerce
      • Data Engineering and Analytics
      • Programming
  • Career Path
  • Instructor Registration
Login/Register
EdindxEdindx
  • Courses
      • Data Engineering and Analytics
      • Microsoft Dynamics 365
      • Animation Tools
      • Network and Security
      • Professional Development
      • Marketing
      • Graphic Design
      • Sales
      • Oracle Cloud
      • Communication Skills
      • IT Trending Courses
      • Cyber Security
      • Full Stack Engineer
      • E-Commerce
      • Data Engineering and Analytics
      • Programming
  • Career Path
  • Instructor Registration
Home » Courses » Data Engineering Vol2 AWS : Data Processing – Spark & Kafka

Data Engineering Vol2 AWS : Data Processing – Spark & Kafka

  • By Soumyadeep Dey
  • Data Engineering and Analytics
  • (0 Rating)
Breadcrumb Abstract Shape
Breadcrumb Abstract Shape
Breadcrumb Abstract Shape
  • Course Info
  • Instructor
  • Reviews
  • More
    • This is Volume 2 of Data Engineering course. In this course I will talk about Open Source Data Processing technologies –  Spark and Kafka, which are the most used and most popular data processing frameworks for Batch & Stream Processing. In this course you will learn Spark from Level 100 to Level 400 with real-life hands on and projects. I will also introduce you to Data Lake on AWS (that is S3) & Data Lakehouse using Apache Iceberg.

      I will use AWS as the hosting platform and talk about AWS Services – EMR, S3 and MSK. I will cover Databricks as Spark hosting platform. I will also show you Spark integration with other services like AWS RDS (MySQL or PostgreSQL) and Redshift.

      You will get opportunities to do hands-on using large datasets (100 GB – 300 GB or more of data). This course will provide you hands-on exercises that match with real-time scenarios like Spark batch processing, stream processing, performance tuning, streaming ingestion, Window functions, ACID transactions on Iceberg etc. 

      Some other highlights:

      • 7 Projects with different datasets. Total dataset size of 250 GB or more.

      • Other technologies covered – EC2, EBS, VPC and IAM.

      • Optional Python videos

      • Optional AWS and SQL Essentials videos

      Show More
      What Will You Learn?
      • Deep dive on Spark and Kafka using AWS EMR, Databricks, MSK
      • Understand Data Engineering (Volume 2) on AWS using Spark and Kafka
      • Batch and Stream processing using Spark and Kafka
      • Production level projects and hands-on to help candidates provide on-job-like training
      • Get access to datasets of size 100 GB - 200 GB and practice using the same
      • Learn Python for Data Engineering with HANDS-ON (Functions, Arguments, OOP (class, object, self), Modules, Packages, Multithreading, file handling etc.
      • Learn SQL for Data Engineering with HANDS-ON (Database objects, CASE, Window Functions, CTE, CTAS, MERGE, Materialized View etc.)
      • AWS Data Analytics services - S3, EMR, Databricks, MSK

      Requirements

      • Good to have AWS and SQL knowledge

      Audience

      • Python developers, Application Developers, Big Data Developers
      • Data Engineers, Data Scientists, Data Analysts
      • Database Administrators, Big Data Administrators
      • Data Engineering Aspirants
      • Solutions Architect, Cloud Architect, Big Data Architect
      • Technical Managers, Engineering Managers, Project Managers

      Course Content

      Introduction to Data Engineering – Vol 2

      • Introduction – Data, Data Lifecycle & Data Engineering Pipeline
        26:28
      • Data Engineering Volume 2 Course & Projects Overview, Roles in Data
        25:19
      • AWS Resource Cost for the Course
        06:26

      Section 1 – Big Data Processing

      • Distributed Compute & Storage, Big Data, MapReduce
        34:11
      • Map & Reduce Tasks, Big Data Ecosystem
        30:55
      • HandsOn : MapReduce using “mrjob” Python Library
        27:44
      • HandsOn : HDFS Commands
        10:14
      • YARN – Architecture & Usage
        26:05
      • ZooKeeper, Big Data File Formats – Parquet & Avro
        20:18

      Section 2 – Spark Introduction

      • Introduction to Spark, Batch & Stream Processing
        27:22
      • Spark Ecosystem, Development, Architecture & Execution
        23:56
      • Spark Codebase, JVM, Setup & Configuration, PySpark
        34:01

      Section 3 – Knowing Spark Up Close: Part 1

      • Spark Standalone Cluster on Laptop & EC2, Spark UI
        29:24
      • Spark on AWS EMR
        21:54
      • Spark Application Deployment Modes (–deploy-mode cluster|client)
        03:11
      • DataFrames, DAG, PySpark Module
        24:22
      • SparkSession
        20:16
      • DataFrame Reader & Writer
        28:39
      • Define DataFrame Schema – StructType, StructField, inferSchema
        21:05
      • Partition, Split, Task, Executor Relation
        17:50
      • Splits for Smaller Files
        17:37

      Section 4 – Spark Transformation & Action : Part 1

      • Introduction to Transformation
        15:59
      • HandsOn : Intro to Transformation
        17:59
      • Step-by-step Transformation
        18:26
      • Chain Transformation
        05:47
      • Transformation – describe, sort, limit, drop, dropDuplicates
        19:16
      • Spark Action & Lazy Loading
        08:37
      • HandsOn : Spark Action & Lazy Loading
        15:12
      • HandsOn : Spark UI – Jobs, Executors, Environment
        10:04
      • HandsOn : load() – Transformation or Action??
        13:23
      • HandsOn : File Performance Comparison – CSV vs Parquet
        11:37
      • Assignment Project 1 – Transformation & Action
        10:47
      • Transformation – agg, groupBy, orderBy, selectExpr, subtract, withColumn, union
        07:57
      • HandsOn : Transformation : agg(), alias(), groupby()
        14:08
      • HandsOn : Transformation : join()
        19:00
      • HandsOn : Function : col()
        19:21
      • HandsOn : Transformation : orderBy(), selectExpr()
        09:13
      • HandsOn : Transformation : subtract()
        09:03
      • HandsOn : Transformation : union(), unionAll(), intersect(), withColumn()
        10:10
      • Shuffle Process & External Shuffle Service (ESS)
        23:53
      • HandsOn : Shuffle Process in groupBy() – Explanation using Spark UI
        24:12
      • HandsOn : Shuffle Process in join() – Explanation using Spark UI
        10:13
      • DO NOT use show() for analysis in Spark
        06:10
      • What should be the value of Shuffle Partition No (spark.sql.shuffle.partitions)
        13:21
      • Assignment Project 2 : Shuffle & Spark UI
        19:59
      • Summary – Transformations & Actions Part 1
        03:05

      Section 5 – Spark Partition – Input, Shuffle & Output

      • Spark Input Partitions (spark.sql.files.maxPartitionBytes)
        16:31
      • Output Partitions – repartition(), coalesce()
        11:34
      • partitionBy(), Transformations that change the no. of Partitions
        15:01
      • DataFrame Writer Output Modes
        06:45

      Section 6 – Knowing Spark Up close: Part 2

      • Do Dataframes keep occupying Memory ??
        09:39
      • How Spark Applications use Memory CPU & its calculation
        12:53
      • How to decide the size of Spark PROD Cluster
        14:09
      • Explore Spark UI – WholeStageCodegen, MapPartitions, Exchange etc.
        22:09
      • Spark OnHeap (Executor) Memory Mgmt – Reserve, Execution, Storage & User Memory
        14:44
      • HandsOn : Configure no. of Executors, its memory & cpu – applicaton+spark-submit
        11:24
      • HandsOn : Executor, Execution & Storage Memory in Spark UI
        10:00
      • Memory Spilling, Unified Memory Management & Off-heap Memory
        11:04
      • Cache, Persist, Unpersist, ClearCache
        07:05
      • HandsOn : Monitor Memory Spill, Caching & Persist in Spark UI
        13:55
      • Garbage Collection & Kryo Serialization in Spark
        10:21
      • Assignment Project 3.1 – Flight Efficiency Analysis
        21:38
      • Assignment Project 3.1 – Flight Efficiency Analysis Coding
        17:16
      • Assignment Project 3- Airport Operations,Same Route Performance,Freq Flier Prog
        07:00

      Section 7 – Transformation & Action Part 2 + Spark Functions

      • Dataframe collect() API
        10:35
      • foreach() & foreachPartition() to apply customer logic on dataframe records
        16:51
      • Loop/Iterate though Dataframe records – toLocalIterator(), df.na.fill()
        16:53
      • Spark Functions – lit(), concat(), expr()
        13:16
      • DATE Functions
        15:22
      • when() & otherwise() Functions
        06:25
      • WINDOW Functions
        09:24
      • HandsOn : WINDOW Functions
        07:32
      • Assignment : Spark WINDOW Functions
        18:09
      • Assignment Project 4 – Layover Analysis, Travel Activity, Environment Impact
        36:28
      • lead(), lag(), nth_value(), first_value(), last_value()
        11:03
      • HandsOn : lead(), lag(), nth_value(), first_value(), last_value()
        06:50

      Section 8 – Knowing Spark Up Close : Part 3

      • Spark Catalyst Optimizer & Tungsten Execution Engine
        16:57
      • Spark Internal Joins – Sort Merge Join (SMJ)
        15:11
      • Spark Internal Join – Broadcast Hash Join (BHJ)
        09:57
      • Spark Internal Join – Broadcast Nested Loop Join (BNLJ)
        06:06
      • Spark Internal Join – Shuffled Hash Join (SHJ)
        05:56
      • Spark Internal Join – Shuffled NL Join (SNLJ), Join Hints, Join Performance
        05:11
      • SparkUI Operators – HashAggregate, SortAggregate, BroadCastExchange
        12:35
      • SparkUI Operators – BroadcastHashJoin, BroadcastNLJoin
        05:59
      • SparkUI Operators – ShuffledHashJoin, Limit, CollectLimit, Window, ReusedExchang
        07:39
      • HandsOn : Spark Operators
        07:28
      • Spark Explain & How to Read Explain Output
        21:06
      • Data Skewness or Skewed Data
        08:10
      • HandsOn : Analyse & Find Data Skewness using Spark UI
        14:51
      • HandsOn : Spark Salting – Mitigate Data Skewness and Skew Joins
        09:12
      • HandsOn : Implement Salting step-by-step
        16:56
      • Adaptive Query Execution (AQE), AQE Parameters
        17:40
      • HandsOn : AQE Coalesce Shuffle Partitions
        09:41
      • HandsOn : AQE Optimize Data Skewness & Skewed Join
        04:25
      • Dynamic Partition Pruning
        07:24
      • HandsOn : Dynamic Resource Allocation (DRA)
        17:41
      • SUPER SUMMARY
        10:29

      Section 9 – Hosting Platforms AWS EMR (Elastic MapReduce), Databricks

      • EMR Hosting Platform Introduction
        23:20
      • EMR Cluster Creation with IAM Roles
        13:46
      • EMR Considerartions – Spark, YARN Configuration Parameters, Submit Spark code to
        17:40
      • HandsOn : Prepare EMR environment to execute Spark Applications
        12:03
      • HandsOn : Submit Spark application to EMR from Primary Node & Laptop
        20:03
      • HandsOn : S3 vs HDFS as File Source & Target
        05:58
      • HandsOn : EMR Steps
        09:41
      • Relational (RDBMS) Database Sources & Targets
        15:23
      • numPartitions, lowerBound, upperBound
        04:50
      • HandsOn : Write to RDBMS (MySQL) Target
        22:27
      • HandsOn : Read from RDBMS (MySQL) Source
        05:32
      • Data Warehouse Source & Targets – Redshift
        13:19
      • HandsOn : Read & Write from/to Redshift Data Warehouse
        16:39
      • Filter and Aggregate Pushdown in Parquet
        07:11
      • Handling JSON Files
        23:19
      • Jupyter Notebook on EMR – Introduction
        09:32
      • HandsOn : Jupyter Notebook Security Group & IAM pre-requisites
        06:47
      • HandsOn : Create Jupyter Notebook and execute PySpark application
        19:48

      Section 10 – Spark Core API RDD, Spark Configurations

      • Spark Configuration Parameters
        19:08
      • Introduction to RDD – map(), filter(), flatMap() APIs
        15:56
      • HandsOn : SELECT & FILTER using RDD
        14:01

      Section 11 – Spark SQL

      • Introduction to Spark SQL, SQL on DataFrames
        14:27
      • HandsOn : DataFrame to SparkSQL
        13:50
      • Spark SQL Objects
        12:41
      • HandsOn : Spark SQL Objects – Database, Tables
        13:15
      • HandsOn : Read & Write Spark SQL Objects
        15:28
      • Iceberg ACID
        32:58
      • HANDS-ON – SparkSQL on EMR using Jupyter Notebook
        10:21

      Section 12 – Datalakehouse Using Open Table Format (OTF)

      • Lakehouse Introduction & Iceberg Architecture
        19:29
      • HANDS-ON : Iceberg Architecture
        13:13
      • Iceberg Configuration, COW, MOR, Table Operations & Properties
        17:28
      • HandsOn : Iceberg Table Properties
        11:21
      • HANDS-ON : Iceberg Delete files – copy-on-write (COW), merge-on-read (MOR)
        09:43
      • COW or MOR??? Time Travel Queries, Iceberg Table History
        16:04
      • HANDS-ON : Iceberg Metadata, Time Travel Queries
        08:07
      • Iceberg Table Maintenance
        07:25
      • HANDS-ON : Iceberg Table Maintenance
        08:07
      • HANDS-ON : Iceberg Hidden Partition & Partition Evolution
        22:55
      • Medallion & Lakehouse Architecture
        13:49

      Section 13 – Apache Kafka – The Streaming Injection

      • Introduction to Kafka
        15:35
      • Create Kafka Cluster using AWS MSK (Provisioned & Serverless)
        20:39
      • Create Kafka Cluster on Laptop
        09:39
      • Kafka Components
        09:28
      • Kafka Components – Topics & Partitions
        14:21
      • Kafka Up Close – Topology, Replication, Message Distribution, Write & Read Data
        18:55
      • Kafka Producer – Application, API, Configuration, Sync-Async Send, Python Library
        21:15
      • HANDS-ON : Kafka Producer Application – Send Records to Topic
        19:38
      • HANDS-ON : Kafka Producer Configuration
        05:15
      • Kafka Consumer – Application, Configuration, Partition Rebalance
        15:49
      • HANDS-ON : Kafka Consuler Application, Partition Ownership
        14:44
      • HANDS-ON : Kafka poll() API
        05:58
      • HANDS-ON : Kafka Offset Management
        18:05
      • HANDS-ON : Auto Offset Reset – Earliest & Latest
        09:18
      • Kafka Cluster (MSK) & Topic Throughput Calculation
        14:50
      • Avro Data Format & Glue Schema Registry
        13:11
      • HANDS-ON : Send AVRO data to Kafka Topic using Glue Schema Registry
        21:43
      • Schema Evolution & Kafka Configuration
        15:45
      • HANDS-ON : Schema Evolution using Glue Schema Registry
        06:43

      Section 14 – Spark Streaming – Stream Processing Using Spark

      • Spark Streaming Intro, Trigger, Output Mode
        21:47
      • HANDS-ON : PySpark Streaming using Console Sink
        17:03
      • HANDS-ON : Execute Streaming code from PyCharm, use Spark History Server
        10:34
      • HANDS-ON : JSON data Streaming – Python Producer, Spark Consumer
        24:58
      • HANDS-ON : Spark Streaming OUTPUT Mode
        10:05
      • HANDS-ON : Using Files as Triggers in Streaming
        13:53
      • HANDS-ON : Streaming State Store & Checkpointing with Output Modes
        18:44
      • Introduction to Event Time in Stream Processing
        13:59
      • Event Time – Tumbling Window
        14:47
      • HANDS-ON : Event Time – Tumbling Window
        13:57
      • HANDS-ON : Event Time – Sliding Window
        12:13
      • Watermark – Handle Late Data & Manage State Store
        22:35
      • HANDS-ON : Watermark – Handle Late Data & Manage State Store
        12:47
      • HANDS-ON : Monitor Streaming Applications using Spark UI
        08:21
      • Spark Streaming Conclusion
        05:45

      Section 15 – AWS Lanbda for Data Processing

      • Intro to Serverless Compute & AWS Lambda Use Cases
        09:49
      • Lambda Function & its Components
        08:01
      • HANDS-ON : Configure Lambda Function & its Components
        16:53
      • Lambda (Python Code) Execution Model
        11:51
      • HANDS-ON : Write Python Code from Lambda Console & Explore ‘Event’ data
        12:39
      • HANDS-ON : S3 Trigger -> Read File -> Copy to /tmp -> Print data
        09:01
      • HANDS-ON : S3 Trigger -> Read File -> Process -> Write to S3
        10:16
      • HANDS-ON : Deploy Python code from S3 using ZIP deployment package
        09:48
      • HANDS-ON : Deploy Additional Python Packages & Modules using ZIP in Lambda
        10:44
      • HANDS-ON : S3 Trigger to Database & Concurrent Lambda Execution
        08:38
      • Architecture using Lambda
        03:17

      (Optional) AWS Essentials

      • AWS Cloud and EC2 Intro
        19:13
      • EC2 Components & HandsOn 1
        21:07
      • EC2 HandsOn 2
        11:26
      • EBS Theory
        19:33
      • EBS HandsOn
        13:25
      • VPC Introduction & Components
        18:36
      • VPC Components Hands On
        23:04
      • Bastion Host
        05:29
      • Security Groups
        15:09
      • NAT Gateway & VPC Endpoint
        20:13
      • VPC Peering
        02:33
      • AWS IAM Intro & Hands On
        26:01
      • IAM Service Role
        19:38

      (Optional) Python Essentials

      • Python Intro – Architecture, PyCharm, Virtual Env
        39:33
      • PyCharm & CLI Walkthrough
        08:51
      • Compiled vs Interpreted
        07:42
      • Everything is Python is Object
        12:39
      • String
        10:20
      • Number
        04:02
      • List
        11:36
      • Tuple
        06:16
      • Set Dict Type Conversion
        16:03
      • Memory Allocation & Operators
        10:27
      • Set up Python interpreter in PyCharm
        11:23
      • Print & Input Functions
        16:04
      • IF Statement
        14:36
      • For & While loops
        15:57
      • Functions Intro
        09:54
      • Function Scoping
        15:20
      • Functions RETURN
        07:53
      • Function Arguments
        09:38
      • Modify Arguments
        09:07
      • Positional & Keyword Arguments
        09:42

      (Optional) SQL Essentials

      • SQL Introduction
        30:35
      • Client & Server Setup
        12:18
      • Database Objects Theory
        28:13
      • Database Objects Hands On
        29:19
      • CRUD Operations
        21:18
      • SELECT Operators
        24:41
      • CASE COALESCE
        13:19
      • DATE Functions
        05:46
      • CTAS Cast Concat
        14:11
      • Update Delete Truncate
        12:31
      • HAVING Clause
        07:43
      • Joins
        19:27
      • Union Intersect View
        17:35
      • Materialized View
        08:19
      • Common Table Expression (CTE)
        10:48
      • Window Functions
        22:40
      • MERGE & Summary
        10:52

      Tags

      • data engineering
      • data engineering and analytics
      • kafka
      • spark

      A course by

      Soumyadeep Dey
      Soumyadeep Dey

      Student Ratings & Reviews

      No Review Yet
      No Review Yet

      Course Includes:

      • Price:
        ₹1,499.00 ₹4,999.00
      • Instructor:Soumyadeep Dey
      • Duration: 54 hours 56 minutes
      • Lessons:231
      • Students:0
      • Level:Intermediate
      ₹1,499.00 ₹4,999.00
      Wishlist

      Share On:

      Courses You May Like

      Gemini_Generated_Image_8gnfcv8gnfcv8gnf
      46 hours 15 minutes
      Intermediate
      Data Engineering on AWS Vol 1 – OLAP & Data Warehouse
      (0.0/ 0 Rating)
      ₹899.00 ₹1,499.00
      • 221 Lessons
      • 0 Students
      Intermediate
      Data Engineering on AWS Vol 1 – OLAP & Data Warehouse
      (0.0/ 0 Rating)
      ₹899.00 ₹1,499.00

      Detailed training (Level 350) on AWS Data Engineering Services Redshift, S3, Athena, Hive, Glue Catalog, LakeformationThis is Volume 1 of Data Engineering course on AWS. This course...

      • 221 Lessons
      • 0 Students
      Enroll Now
      Data Engineering Courses in India
      19 hours 42 minutes
      Intermediate
      Data Analyst Masterclass: Learn AI Business Insight
      (0.0/ 0 Rating)
      ₹349.00 ₹2,199.00
      • 211 Lessons
      • 0 Students
      Intermediate
      Data Analyst Masterclass: Learn AI Business Insight
      (0.0/ 0 Rating)
      ₹349.00 ₹2,199.00

      Master Excel, SQL, Power BI & Python to uncover AI-powered business insights and boost your data-driven career today!Unlock the power of data and artificial intelligence...

      • 211 Lessons
      • 0 Students
      Enroll Now
      Edindx-5

      Email: info@edindx.com

      Online Platform

      Links

      • News & Articles

      Contacts

      Enter your email address to register to our newsletter subscription

      • Privacy Policy
      • Terms & Conditions
      Facebook Instagram Youtube Linkedin
      Copyright 2026 Edindx All Rights Reserved
      EdindxEdindx
      Sign inSign up

      Sign in

      Don’t have an account? Sign up
      Lost your password?

      Sign up

      Already have an account? Sign in
      Hi, Welcome back!
      Forgot Password?
      Don't have an account?  Register Now