A complete local data engineering pipeline combining streaming data ingestion, ETL processing, machine learning, and business intelligence tools. This project demonstrates modern data engineering practices using Kafka, Spark, MLflow, Airflow, and Superset.
Data Source β Kafka β Spark ETL β Data Lake β ML Pipeline β BI Dashboard
β
(MinIO/Local Storage)
β
MLflow Model Registry
data-engineering-ml-project/
β
βββ docker-compose.yml # Local container orchestration
βββ .env # Environment variables
βββ README.md # This file
β
βββ config/
β βββ settings.yaml # Global configs (paths, ML params)
β
βββ data/
β βββ raw/ # Raw ingested data (from Kafka)
β βββ processed/ # Cleaned and transformed data
β βββ analytics/ # Final aggregated data for BI
β
βββ ingestion/
β βββ kafka_producer.py # Simulates streaming input
β βββ kafka_consumer.py # Optional for testing without Spark
β
βββ processing/
β βββ spark_etl.py # Spark job for ETL + feature engineering
β βββ spark_session.py # Reusable SparkSession builder
β
βββ ml/
β βββ train_model.py # ML training with MLflow
β βββ evaluate_model.py # Evaluation metrics
β βββ register_model.py # Register best model to MLflow
β βββ inference.py # Inference script
β βββ model/ # Exported models (optional)
β
βββ dags/
β βββ pipeline_dag.py # Airflow DAG for ETL + ML pipeline
β
βββ bi/
β βββ superset/ # Superset setup
β βββ docker/ # Superset config in Docker
β βββ dashboards/ # Saved dashboard templates or queries
β
βββ serving/
β βββ app.py # FastAPI/Flask app for model inference
β βββ Dockerfile
β
βββ utils/
β βββ logger.py # Shared logging utility
β βββ helpers.py # Common functions
β
βββ tests/
βββ test_etl.py
βββ test_ml.py
- Docker and Docker Compose
- Python 3.8+
- 8GB+ RAM recommended
git clone <your-repo>
cd data-engineering-ml-project# Start all services
docker-compose up -d
# Check service status
docker-compose ps- Kafka:
localhost:9092 - Spark Master UI: http://localhost:8080
- MLflow UI: http://localhost:5001
- Airflow UI: http://localhost:8081 (admin/admin)
- Superset UI: http://localhost:8088 (admin/admin)
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- Model API: http://localhost:8000
Start the data stream:
cd ingestion
python kafka_producer.py --interval 0.5 --duration 60Test consumer (optional):
python kafka_consumer.pySubmit ETL job:
docker exec -it <spark-master-container> spark-submit \
--master spark://spark-master:7077 \
/opt/spark/processing/spark_etl.pyTrain models:
cd ml
python train_model.pyView experiments:
- Visit http://localhost:5001
- Browse experiments and model registry
Access Airflow:
- Go to http://localhost:8081
- Login: admin/admin
- Enable the
pipeline_dag
Setup dashboards:
- Go to http://localhost:8088
- Login: admin/admin
- Connect to your data sources
- Create visualizations
- Ingestion: Kafka producer generates sample e-commerce data
- Streaming: Kafka topics buffer the real-time data
- Processing: Spark consumes from Kafka, cleans, and transforms data
- Storage: Processed data stored in data lake (local/MinIO)
- ML Pipeline: MLflow trains and tracks machine learning models
- Serving: FastAPI serves model predictions
- Analytics: Superset creates dashboards from processed data
- Orchestration: Airflow coordinates the entire pipeline
# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC_NAME=data_stream
# Spark
SPARK_MASTER_URL=spark://localhost:7077
# MLflow
MLFLOW_TRACKING_URI=http://localhost:5001
# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadminCentralized configuration for:
- Data paths
- ML parameters
- Service endpoints
- Algorithm configurations
# All services
docker-compose logs
# Specific service
docker-compose logs kafka
docker-compose logs spark-master
docker-compose logs mlflow- Out of Memory: Increase Docker memory allocation
- Port Conflicts: Check if ports are already in use
- Kafka Connection: Ensure Kafka is fully started before Spark jobs
- MLflow Storage: Check volume mounts and permissions
# Scale Spark workers
docker-compose up -d --scale spark-worker=3
# Scale Kafka (requires cluster setup)
# Scale Airflow workers
docker-compose up -d --scale airflow-worker=2- Replace SQLite with PostgreSQL for MLflow
- Use external Kafka cluster
- Implement proper authentication
- Add monitoring (Prometheus/Grafana)
- Use Kubernetes for orchestration
- Implement data lineage tracking
# Run unit tests
python -m pytest tests/
# Test ETL pipeline
python tests/test_etl.py
# Test ML pipeline
python tests/test_ml.py# Health check
curl http://localhost:8000/health
# Predict
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [1.0, 2.0, 3.0]}'- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License.
- Check the logs first:
docker-compose logs <service> - Review configuration in
config/settings.yaml - Ensure all prerequisites are installed
- Check Docker resource allocation
# Stop all services
docker-compose down
# Rebuild specific service
docker-compose up -d --build model-api
# Access service shell
docker-compose exec spark-master bash
# View real-time logs
docker-compose logs -f kafka
# Clean up volumes (WARNING: deletes data)
docker-compose down -vHappy Data Engineering! π