GitHub - Mahsamehr/data-engineering-zoomcamp: My personal projects, notes, and progress as I complete the DataTalksClub Data Engineering Zoomcamp.

My Data Engineering Journey

Welcome to my personal data engineering portfolio! This repo contains my notes, homework, projects, and homework as I work through the DataTalksClub Data Engineering Zoomcamp.

📚 Week-by-Week Breakdown:

🧱 Week 1: Containerization & Infrastructure Setup

  • Introduction to Google Cloud Platform (GCP)
  • Working with Docker and Docker Compose
  • Running PostgreSQL in Docker containers
  • Managing infrastructure using Terraform

🔁 Week 2: Orchestrating Workflows

  • Understanding Data Lakes and orchestration concepts
  • Building pipelines with Kestra
  • Exploring task scheduling and dependency management

⚙️ Workshop 1: Data Ingestion Techniques

  • Reading and ingesting data from APIs
  • Building scalable pipelines
  • Implementing data normalization and incremental loading

🏢 Week 3: Data Warehousing Essentials

  • Overview of Google BigQuery
  • Implementing table partitioning and clustering
  • Learning optimization best practices
  • Intro to ML features in BigQuery

🛠️ Week 4: Analytics Engineering

  • Building models using dbt (data build tool)
  • Testing, documenting, and deploying transformations
  • Creating dashboards with Metabase

⏱️ Week 5: Batch Processing Fundamentals

  • Introduction to Apache Spark
  • Working with DataFrames and Spark SQL
  • Exploring how groupBy and join operations work under the hood

📡 Week 6: Real-Time Data Streaming

  • Getting started with Kafka
  • Using Kafka Streams and KSQL
  • Managing data schemas with Avro

🎓 Final Project

A final end-to-end project applying concepts learned throughout the course. Coming soon!