Course Overview
This course provides an in-depth understanding of data pipelines using Python. Students will learn to extract, transform, and load (ETL) data, automate workflows, and optimize data processing for real-world applications.
Prerequisites
• Basic Python programming knowledge
• Familiarity with SQL and databases
• Understanding of basic data structures
________________________________________
Module 1: Introduction to Data Pipelines
• What is a data pipeline?
• Key components: Extraction, Transformation, and Loading (ETL)
• Batch vs. Real-time data pipelines
• Tools & frameworks overview: Pandas, Apache Airflow, Spark, Kafka
Module 2: Data Extraction
• Reading data from CSV, JSON, and XML
• Web scraping with BeautifulSoup and Scrapy
• APIs and RESTful data extraction using requests
• Connecting to databases (PostgreSQL, MySQL) using SQLAlchemy
Module 3: Data Transformation
• Data cleaning with pandas
• Handling missing values and outliers
• Data aggregation and normalization
• Using Dask for scalable data processing
Module 4: Data Loading
• Writing data to CSV, JSON, and databases
• Bulk inserts and performance optimization
• Automating database interactions
Module 5: Automating Pipelines with Apache Airflow
• Setting up Apache Airflow
• Defining DAGs (Directed Acyclic Graphs)
• Scheduling and monitoring workflows
• Integrating Python scripts with Airflow
Module 6: Handling Large-Scale Data with Apache Spark
• Introduction to PySpark
• DataFrame operations in Spark
• Optimizing performance with partitioning
• Integrating Spark with AWS S3 and Google Cloud Storage
Module 7: Real-Time Data Processing with Kafka
• Introduction to Apache Kafka
• Setting up Kafka producers and consumers
• Streaming data processing with Kafka-Python
• Integrating Kafka with Spark Streaming
Module 8: Data Pipeline Testing and Debugging
• Writing unit tests with pytest
• Debugging common pipeline failures
• Monitoring and logging with ELK Stack
Module 9: Deploying Data Pipelines
• CI/CD for data pipelines
• Deploying on AWS Lambda and Google Cloud Functions
• Containerizing pipelines with Docker
• Orchestrating with Kubernetes
Module 10: Capstone Project
• Building an end-to-end data pipeline
• Automating ETL workflows
• Processing real-time streaming data
• Deploying and monitoring the pipeline
________________________________________
Tools and Technologies
• Python (pandas, sqlalchemy, requests, dask)
• Apache Airflow
• Apache Spark (PySpark)
• Apache Kafka
• Docker, Kubernetes
• PostgreSQL, MySQL
• AWS Lambda, Google Cloud
| Tanggal | Durasi | Harga | Pendaftar / Terkonfirmasi |
|---|
Pendaftaran Training