Data Pipeline With Python

Course Overview

This course provides an in-depth understanding of data pipelines using Python. Students will learn to extract, transform, and load (ETL) data, automate workflows, and optimize data processing for real-world applications.

Prerequisites

Basic Python programming knowledge

Familiarity with SQL and databases

Understanding of basic data structures

________________________________________

Module 1: Introduction to Data Pipelines

What is a data pipeline?

Key components: Extraction, Transformation, and Loading (ETL)

Batch vs. Real-time data pipelines

Tools & frameworks overview: Pandas, Apache Airflow, Spark, Kafka

Module 2: Data Extraction

Reading data from CSV, JSON, and XML

Web scraping with BeautifulSoup and Scrapy

APIs and RESTful data extraction using requests

Connecting to databases (PostgreSQL, MySQL) using SQLAlchemy

Module 3: Data Transformation

Data cleaning with pandas

Handling missing values and outliers

Data aggregation and normalization

Using Dask for scalable data processing

Module 4: Data Loading

Writing data to CSV, JSON, and databases

Bulk inserts and performance optimization

Automating database interactions

Module 5: Automating Pipelines with Apache Airflow

Setting up Apache Airflow

Defining DAGs (Directed Acyclic Graphs)

Scheduling and monitoring workflows

Integrating Python scripts with Airflow

Module 6: Handling Large-Scale Data with Apache Spark

Introduction to PySpark

DataFrame operations in Spark

Optimizing performance with partitioning

Integrating Spark with AWS S3 and Google Cloud Storage

Module 7: Real-Time Data Processing with Kafka

Introduction to Apache Kafka

Setting up Kafka producers and consumers

Streaming data processing with Kafka-Python

Integrating Kafka with Spark Streaming

Module 8: Data Pipeline Testing and Debugging

Writing unit tests with pytest

Debugging common pipeline failures

Monitoring and logging with ELK Stack

Module 9: Deploying Data Pipelines

CI/CD for data pipelines

Deploying on AWS Lambda and Google Cloud Functions

Containerizing pipelines with Docker

Orchestrating with Kubernetes

Module 10: Capstone Project

Building an end-to-end data pipeline

Automating ETL workflows

Processing real-time streaming data

Deploying and monitoring the pipeline

________________________________________

Tools and Technologies

Python (pandas, sqlalchemy, requests, dask)

Apache Airflow

Apache Spark (PySpark)

Apache Kafka

Docker, Kubernetes

PostgreSQL, MySQL

AWS Lambda, Google Cloud


Instruktur
Jadwal Training
Tanggal Durasi Harga Pendaftar / Terkonfirmasi

Pendaftaran Training

Quotation

Butuh private training atau inhouse training?

Kirim Quotation
WhatsApp