Data Engineering Mastery Plan
Overview
Your Python proficiency + 3 months of focused study = Data Engineering Interview Ready
This plan integrates the complete Data Engineering Zoomcamp curriculum with deep theoretical foundations. You’ll learn both practical tools and the fundamental concepts that separate senior engineers from tool operators.
Month 1: Foundations & Workflow
Week 1-2: Docker, Terraform & SQL Mastery
Week 1: Environment Setup + SQL Intensive
Days 1-3: Zoomcamp Module 1 - Docker & Terraform
- Module 1: Docker & Terraform (6-8 hours)
- Docker fundamentals with PostgreSQL
- Running Postgres containers
- Infrastructure as Code with Terraform
- NYC taxi dataset introduction
- LeetCode Setup: Get Premium subscription ($35/month)
- Complete: SQL 50 Study Plan (10-12 hours)
- Focus on SELECT, WHERE, JOIN basics
- Target: 15-20 Easy problems
Days 4-7: SQL Pattern Mastery
- Daily Target: 5-6 LeetCode database problems (2-3 hours/day)
- Window Functions: ROW_NUMBER, RANK, DENSE_RANK
- LAG/LEAD for time-series
- DataLemur: Complete all free tier questions
- Study: Mode Analytics SQL Tutorial - Window Functions
Week 2: Advanced SQL + Module 1 Completion
Days 8-10: Complex Queries
- LeetCode Medium: 4-5 problems daily focusing on:
- Recursive CTEs (hierarchies, trees)
- Complex JOINs (self-joins, multiple tables)
- Subqueries vs CTEs
- HackerRank SQL: Advanced Select (2 hours/day)
- Read: PostgreSQL CTE Documentation
Days 11-14: Query Optimization
- EXPLAIN ANALYZE Practice:
- PostgreSQL EXPLAIN Visualizer
- PG Exercises - Complete all
- LeetCode Hard: 2-3 problems daily
- Complete: Module 1 Homework
Week 2 Target: 80+ LeetCode problems solved, comfortable with Medium difficulty in <20 minutes
Week 3-4: Workflow Orchestration & Database Internals
Week 3: Zoomcamp Module 2 + B-Trees Deep Dive
Days 15-17: Workflow Orchestration Fundamentals
- Module 2: Workflow Orchestration (8-10 hours)
- Kestra orchestration basics
- Building data pipelines
- Scheduling and dependencies
- Theory: CMU Database Systems Intro (3 hours)
- B-Tree operations and structure
- Page splits, node organization
- Take detailed notes
Days 18-21: Storage Engines Deep Dive
- Interactive Learning:
- Implementation: Code a simple B-Tree in Python (4-6 hours)
- Complete: Module 2 Homework
Week 4: Data Warehouse Foundations
Days 22-24: Zoomcamp Module 3 - Data Warehouse
- Module 3: Data Warehouse (10-12 hours)
- BigQuery fundamentals
- Partitioning and clustering strategies
- Cost optimization techniques
- OLAP vs OLTP understanding
- Reading: “Designing Data-Intensive Applications” Ch. 3 - Storage and Retrieval
Days 25-28: Index Strategies + Module 3 Homework
- Deep Dive: Use The Index, Luke! - Complete course
- Clustered vs Non-clustered indexes
- Covering indexes, Partial indexes
- Practical Lab:
- Load 10GB dataset into PostgreSQL
- Benchmark different index types
- Compare with BigQuery query performance
- Complete: Module 3 Homework
Month 2: Analytics Engineering & Distributed Systems
Week 5-6: dbt & MapReduce Paradigm
Week 5: Analytics Engineering with dbt
Days 29-31: Zoomcamp Module 4 - Analytics Engineering
- Module 4: Analytics Engineering (12-15 hours)
- dbt fundamentals and philosophy
- Building transformation layers
- Testing and documentation
- Deployment strategies
- Supplementary: dbt Learn Fundamentals
Days 32-35: MapReduce Theory + dbt Practice
- Essential Reading: Google MapReduce Paper (2004) (4-5 hours)
- Implement word count from scratch
- Understand Map, Shuffle, Reduce phases
- dbt Project: Build complete transformation pipeline
- Bronze/Silver/Gold layers
- Data quality tests
- Documentation
- Complete: Module 4 Homework
Week 6: Batch Processing Foundations
Days 36-38: File Formats Deep Dive
- Study:
- Apache Parquet Documentation
- Dremel Paper (Google) - Columnar storage foundation
- Hands-on (6 hours):
- Convert 1GB CSV to Parquet, ORC, Avro
- Measure: file size, write time, read time, query performance
- Use PyArrow, fastparquet
Days 39-42: Hadoop Context + Spark Introduction
- Understanding Hadoop:
- Hadoop Architecture
- HDFS: block size, replication, rack awareness
- Why Spark replaced MapReduce (10-100x faster)
- Reading: Spark RDD Paper (2012)
Week 7-8: Batch Processing Mastery
Week 7: Zoomcamp Module 5 - Batch Processing
Days 43-46: Spark Core Concepts
- Module 5: Batch Processing (15-20 hours)
- Spark architecture and internals
- RDDs, DataFrames, Datasets
- Transformations vs Actions
- DAG execution model
- Databricks: Sign up for Community Edition
- Academy: Spark Programming - Free courses
Days 47-49: Spark SQL & Performance
- Deep Dive: Catalyst Optimizer
- Logical Plan → Optimized Logical Plan → Physical Plan
- Code generation phase
- Practice: Implement same logic in:
- RDD API (low-level understanding)
- DataFrame API (optimization)
- Spark SQL (Catalyst)
- Complete: Module 5 Homework
Week 8: Streaming Foundations
Days 50-53: Zoomcamp Module 6 - Streaming
- Module 6: Streaming (12-15 hours)
- Kafka fundamentals
- Topics, Partitions, Consumer Groups
- Producer and Consumer APIs
- Stream processing concepts
- Confluent: Kafka 101 Course
Days 54-56: Structured Streaming
- Deep Dive: Spark Streaming Guide
- Micro-batching vs continuous processing
- Watermarks and late data handling
- Stateful operations
- Mini Project: Real-time word count from Kafka
- Complete: Module 6 Homework
Month 3: NoSQL, Cloud & Production Systems
Week 9-10: Advanced Topics & NoSQL
Week 9: LSM Trees & Cassandra
Days 57-60: LSM Tree Architecture
- Essential Reading:
- Key Concepts:
- Write amplification vs Read amplification
- Compaction strategies (Size-tiered, Leveled)
- Why Cassandra/ScyllaDB chose LSM
- Practical: Benchmark PostgreSQL (B+ Tree) vs RocksDB (LSM Tree)
Days 61-63: Cassandra & CAP Theorem
- Course: ScyllaDB University - ScyllaDB Essentials (Free)
- Data Modeling:
- Query-first design
- Partition key selection
- Denormalization patterns
- CAP in Practice:
- Consistency levels (ONE, QUORUM, ALL)
- Hinted handoff, Read repair
- Lab: Design time-series data model
Week 10: Document Stores & Module 7
Days 64-66: MongoDB Fundamentals
- Course: MongoDB M001 Basics (Free)
- Understanding:
- B+ Trees in MongoDB (not LSM)
- BSON format advantages
- Aggregation pipeline
- Schema Patterns:
- Embedding vs Referencing
- Bucket pattern for time-series
- MongoDB Design Patterns
Days 67-70: Zoomcamp Module 7 - Project
- Module 7: Project (8-10 hours initial planning)
- Choose project topic
- Design architecture
- Set up infrastructure
- Begin implementation
- Reading: Google Bigtable Paper (2006)
Week 11-12: Cloud Platforms & Final Project
Week 11: Cloud Services & Advanced Topics
Days 71-73: Cloud Data Services
Choose your focus based on job market:
AWS (Most companies):
- S3, Glue, EMR, Redshift, Kinesis
- AWS Data Analytics Fundamentals
GCP (Analytics-heavy companies):
- Cloud Storage, Dataflow, Dataproc, BigQuery, Pub/Sub
- GCP Data Engineering Path
Quick Lab: Replicate local pipeline in cloud
Days 74-77: Modern Data Stack & Project Development
- Continue Module 7 Project (15-20 hours)
- Implement core pipeline
- Add monitoring and data quality
- Document architecture decisions
- Data Quality:
- Great Expectations basics
- dbt tests implementation
- Data contracts concept
Week 12: Project Completion & Interview Prep
Days 78-81: Finalize Capstone Project
- Complete Module 7 Project Requirements:
- Ingest from 3+ sources (API, Database, Files)
- Process with Spark (batch) + Kafka (streaming)
- Store in Data Lake (Bronze/Silver/Gold)
- Serve via Data Warehouse
- Orchestrate with Kestra/Airflow
- Monitor with data quality checks
- Documentation:
- Architecture diagram (draw.io)
- Design decisions and trade-offs
- Performance metrics
- README with setup instructions
Days 82-84: SQL Interview Grind
- LeetCode Database: Solve all Hard problems
- Target: 150+ total problems
- Company-Specific: DataLemur Company Questions
- Practice: Solve Medium problems in <20 minutes consistently
Days 85-87: System Design Practice
Common Scenarios:
- Design YouTube Analytics Pipeline
- Build Uber’s Real-time Pricing System
- Create Netflix Recommendation Data Pipeline
- Design Twitter’s Tweet Processing System
Resources:
Days 88-90: Mock Interviews
Practice Sessions:
- SQL coding: 45 minutes, 3 problems
- Python/Spark coding: 45 minutes, 2 problems
- System design: 60 minutes whiteboard
- Behavioral: STAR method stories ready
Daily Schedule Template
Weekday (2-3 hours)
- 6:00-7:00 AM: Theory/Reading (Papers, Documentation)
- 8:00-9:00 PM: Practical (Coding, LeetCode)
- 9:00-9:30 PM: Project work or Zoomcamp videos
Weekend (4-5 hours)
- Saturday: Deep learning (Watch lectures, read papers)
- Sunday: Build projects, complete homework
Complete Zoomcamp Module Coverage
| Module | Week | Focus | Hours |
|---|---|---|---|
| Module 1 | Week 1-2 | Docker & Terraform | 6-8 |
| Module 2 | Week 3 | Workflow Orchestration | 8-10 |
| Module 3 | Week 4 | Data Warehouse | 10-12 |
| Module 4 | Week 5 | Analytics Engineering | 12-15 |
| Module 5 | Week 7 | Batch Processing | 15-20 |
| Module 6 | Week 8 | Streaming | 12-15 |
| Module 7 | Week 10-12 | Project | 30-40 |
Total Zoomcamp Time: 93-120 hours
Total Plan Time: 270-360 hours (includes theory, practice, interview prep)
Essential Papers (Reading Order)
- Google MapReduce (2004)
- Google Bigtable (2006)
- Amazon Dynamo (2007)
- Google Dremel (2010)
- Spark RDD Paper (2012)
Books (Priority Order)
- “Designing Data-Intensive Applications” - Martin Kleppmann ($40)
- Ch. 1-4 (Month 1), Ch. 5-9 (Month 2), Ch. 10-12 (Month 3)
- “Database Internals” - Alex Petrov ($35)
- Part 1: Storage Engines (Month 1)
- “Learning Spark” 2nd Edition - Free from Databricks
- Ch. 1-8 during Month 2
Courses & Platforms
- LeetCode Premium: $35/month (Essential for SQL)
- DataLemur: Free tier + Premium $19/month (optional)
- Data Engineering Zoomcamp: Free (Core curriculum)
- Databricks Academy: Free community edition
- ScyllaDB University: Free
- MongoDB University: Free
- Confluent Developer: Free tier
Success Metrics
Month 1 Checkpoint
- 100+ LeetCode SQL problems solved
- Can explain B-Tree operations and draw diagrams
- Understand OLTP vs OLAP trade-offs
- Completed Zoomcamp Modules 1-3
- Built orchestrated data pipeline with Kestra
Month 2 Checkpoint
- Read MapReduce, Spark RDD papers
- Built 3+ Spark applications
- Understand Parquet format internals
- Can explain Catalyst optimizer phases
- Completed Zoomcamp Modules 4-6
- Built streaming pipeline with Kafka
Month 3 Checkpoint
- Understand LSM vs B-Tree trade-offs
- Deployed full pipeline to cloud
- 150+ LeetCode problems total
- Completed Zoomcamp Module 7 (Final Project)
- Portfolio with production-quality end-to-end project
- Can design distributed systems on whiteboard
Interview Ready Checklist
Technical Skills
- Solve SQL medium problems in <20 minutes
- Explain database index types and use cases
- Design partition strategies for Spark jobs
- Choose appropriate NoSQL database for requirements
- Debug performance using execution plans
System Design
- Design batch ETL pipeline (sources → processing → warehouse)
- Design streaming pipeline (Kafka → processing → sink)
- Explain CAP theorem with real examples
- Calculate resource requirements for pipelines
- Design data models for different paradigms
Behavioral Stories (STAR Format)
- Debugging critical pipeline failure
- Optimizing slow queries/jobs
- Collaborating with stakeholders
- Learning new technology quickly
- Handling ambiguous requirements
Total Investment
Time: 270-360 hours over 3 months
- Month 1: 90-120 hours (SQL + Databases + Workflow + DW)
- Month 2: 90-120 hours (Analytics Eng + Distributed Systems + Streaming)
- Month 3: 90-120 hours (NoSQL + Cloud + Final Project)
Cost: $170-$300
- LeetCode Premium (3 months): $105
- DataLemur Premium (optional): $57
- Books: $75-$115
- Cloud credits: Free tier sufficient
- Zoomcamp: Free
Alternative Free Path
- LeetCode free tier + HackerRank
- Library books or shared subscriptions
- All courses have free tiers
- Total cost: $0-$50
Community & Support
- Slack: DataTalks.Club Slack -
#course-data-engineering - Reddit: r/dataengineering for questions
- Discord: Data Engineering Discord for real-time help
- GitHub: Star and watch Zoomcamp repo
Why This Approach Works
This plan follows the “fundamentalist approach” - you’re not just learning tools, you’re understanding why things work. By month 3, you’ll understand:
- Why Cassandra uses LSM trees
- Why Spark’s Catalyst beats hand-written code
- Why schemaless designs create complexity
- How to make informed architectural decisions
The complete Zoomcamp integration ensures you have hands-on experience with modern tools while the theoretical deep dives give you the knowledge to explain and defend your design choices in interviews.
Remember: “SQL leetcode will get you further than anything else” - Start there, build systematically, and by month 3 you’ll be designing distributed systems with confidence.
Reddit Comments
r/dataengineering
Certain_Leader9946 says:
the sql leetcode will teach you to solve the problems.
understanding how the hardware interacts with the data and its file formats will make painfully obvious why a solution like spark is even used.
you will then understand that… a lot of these big data products are just applications of map reduce algorithms over parallalisable and partitioned file reads, and that systems to query data lakes, are largely derivations of the same underlying concepts. i really don’t think reading those big chunky textbooks is as worth it as taking a fundementalist approach but, eh?
here’s what i would do, in order:
- learn sql to death, thats just leetcode, nothing will get you further than leetcode. www.leetcode.com get a premium subscription. you can access their data structures and algorithms content for 2. get comfortable with LC Medium questions for SQL.
- understand OLTP databases (btrees and b+ trees), i mean really understand it, this guy does a great. talk, in fact, just absorb everything he says: https://www.youtube.com/watch?v=aZjYr87r1b8
- read the map reduce paper and learn about parquet and understand how map reduce and parquet would work together for scaling a data problem horizontally. you will need to at least understand some basic functional programming idioms to get this far.
- learn apache spark and its relationship with map reduce (how the DAG works and how catalyst works for query planning). read up on classic hadoop.
- learn the apache spark api, spark sql, and pyspark .etc.
- now look at cloud technology implementations of data services to understand how cloud services make it easier to deploy 1 through 5, not that they are magic beans (except amazon athena, that’s actually really cool).
- learn no sql solutions and how they are built. look up the inner workings of scylla and cassandra. understand what trade offs their data structures make to achieve their ‘web scale’ like performance characteristics and the limitations involved there (SS tables and LSM trees). try to be able to compare them by their O complexity series and their query performance. understand MongoDB uses B+ trees in ways that don’t make it too dissimilar from RDBMS except it stores data in JSON documents (which has its own issues).
- (optional). accept in your heart that schemaless designs make life difficult for everyone involved when it comes to data-sanity, because you quickly run into problems where you want to do a migration and can’t depend on the contents on your data. e.g. ive ran into a problem this last week with a ‘schemaless’ application design where people were having a shouting match about ‘where did the data come from’. well if you designed your schemas up front instead of fronting for agile that wouldn’t be an issue but dont mind me :)
congratulations. you can now derive your understanding of almost any large data application under the sun from your fundamental understanding of the core technologies which will make each of them a breeze to work with.