Data Pipeline With Python

Course Overview

This course provides an in-depth understanding of data pipelines using Python. Students will learn to extract, transform, and load (ETL) data, automate workflows, and optimize data processing for real-world applications.

Prerequisites

• Basic Python programming knowledge

• Familiarity with SQL and databases

• Understanding of basic data structures

________________________________________

Module 1: Introduction to Data Pipelines

• What is a data pipeline?

• Key components: Extraction, Transformation, and Loading (ETL)

• Batch vs. Real-time data pipelines

• Tools & frameworks overview: Pandas, Apache Airflow, Spark, Kafka

Module 2: Data Extraction

• Reading data from CSV, JSON, and XML

• Web scraping with BeautifulSoup and Scrapy

• APIs and RESTful data extraction using requests

• Connecting to databases (PostgreSQL, MySQL) using SQLAlchemy

Module 3: Data Transformation

• Data cleaning with pandas

• Handling missing values and outliers

• Data aggregation and normalization

• Using Dask for scalable data processing

Module 4: Data Loading

• Writing data to CSV, JSON, and databases

• Bulk inserts and performance optimization

• Automating database interactions

Module 5: Automating Pipelines with Apache Airflow

• Setting up Apache Airflow

• Defining DAGs (Directed Acyclic Graphs)

• Scheduling and monitoring workflows

• Integrating Python scripts with Airflow

Module 6: Handling Large-Scale Data with Apache Spark

• Introduction to PySpark

• DataFrame operations in Spark

• Optimizing performance with partitioning

• Integrating Spark with AWS S3 and Google Cloud Storage

Module 7: Real-Time Data Processing with Kafka

• Introduction to Apache Kafka

• Setting up Kafka producers and consumers

• Streaming data processing with Kafka-Python

• Integrating Kafka with Spark Streaming

Module 8: Data Pipeline Testing and Debugging

• Writing unit tests with pytest

• Debugging common pipeline failures

• Monitoring and logging with ELK Stack

Module 9: Deploying Data Pipelines

• CI/CD for data pipelines

• Deploying on AWS Lambda and Google Cloud Functions

• Containerizing pipelines with Docker

• Orchestrating with Kubernetes

Module 10: Capstone Project

• Building an end-to-end data pipeline

• Automating ETL workflows

• Processing real-time streaming data

• Deploying and monitoring the pipeline

________________________________________

Tools and Technologies

• Python (pandas, sqlalchemy, requests, dask)

• Apache Airflow

• Apache Spark (PySpark)

• Apache Kafka

• Docker, Kubernetes

• PostgreSQL, MySQL

• AWS Lambda, Google Cloud

Instruktur

Jadwal Training

Tanggal	Durasi	Harga	Pendaftar / Terkonfirmasi

Pendaftaran Training

Quotation

Butuh private training atau inhouse training?

Kirim Quotation

Data Pipeline With Python

Instruktur

Jadwal Training

GlobalEdu - Professional IT Training

Company

GlobalEdu Assistant