Data Engineering ML Pipeline with Data Lake

A complete local data engineering pipeline combining streaming data ingestion, ETL processing, machine learning, and business intelligence tools. This project demonstrates modern data engineering practices using Kafka, Spark, MLflow, Airflow, and Superset.

🏗️ Architecture Overview

Data Source → Kafka → Spark ETL → Data Lake → ML Pipeline → BI Dashboard
                                     ↓
                              (MinIO/Local Storage)
                                     ↓
                            MLflow Model Registry

📁 Project Structure

data-engineering-ml-project/
│
├── docker-compose.yml          # Local container orchestration
├── .env                        # Environment variables
├── README.md                   # This file
│
├── config/
│   └── settings.yaml           # Global configs (paths, ML params)
│
├── data/
│   ├── raw/                    # Raw ingested data (from Kafka)
│   ├── processed/              # Cleaned and transformed data
│   └── analytics/              # Final aggregated data for BI
│
├── ingestion/
│   ├── kafka_producer.py       # Simulates streaming input
│   └── kafka_consumer.py       # Optional for testing without Spark
│
├── processing/
│   ├── spark_etl.py            # Spark job for ETL + feature engineering
│   └── spark_session.py        # Reusable SparkSession builder
│
├── ml/
│   ├── train_model.py          # ML training with MLflow
│   ├── evaluate_model.py       # Evaluation metrics
│   ├── register_model.py       # Register best model to MLflow
│   ├── inference.py            # Inference script
│   └── model/                  # Exported models (optional)
│
├── dags/
│   └── pipeline_dag.py         # Airflow DAG for ETL + ML pipeline
│
├── bi/
│   └── superset/               # Superset setup
│       └── docker/             # Superset config in Docker
│   └── dashboards/             # Saved dashboard templates or queries
│
├── serving/
│   ├── app.py                  # FastAPI/Flask app for model inference
│   └── Dockerfile
│
├── utils/
│   ├── logger.py               # Shared logging utility
│   └── helpers.py              # Common functions
│
└── tests/
    ├── test_etl.py
    └── test_ml.py

🚀 Quick Start

Prerequisites

Docker and Docker Compose
Python 3.8+
8GB+ RAM recommended

1. Clone and Setup

git clone <your-repo>
cd data-engineering-ml-project

2. Start Infrastructure

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

3. Verify Services

Kafka: localhost:9092
Spark Master UI: http://localhost:8080
MLflow UI: http://localhost:5001
Airflow UI: http://localhost:8081 (admin/admin)
Superset UI: http://localhost:8088 (admin/admin)
MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
Model API: http://localhost:8000

🔧 Component Details

1. Data Ingestion (Kafka)

Start the data stream:

cd ingestion
python kafka_producer.py --interval 0.5 --duration 60

Test consumer (optional):

python kafka_consumer.py

2. Data Processing (Spark)

Submit ETL job:

docker exec -it <spark-master-container> spark-submit \
  --master spark://spark-master:7077 \
  /opt/spark/processing/spark_etl.py

3. Machine Learning (MLflow)

Train models:

cd ml
python train_model.py

View experiments:

Visit http://localhost:5001
Browse experiments and model registry

4. Orchestration (Airflow)

Access Airflow:

Go to http://localhost:8081
Login: admin/admin
Enable the pipeline_dag

5. Business Intelligence (Superset)

Setup dashboards:

Go to http://localhost:8088
Login: admin/admin
Connect to your data sources
Create visualizations

📊 Data Flow

Ingestion: Kafka producer generates sample e-commerce data
Streaming: Kafka topics buffer the real-time data
Processing: Spark consumes from Kafka, cleans, and transforms data
Storage: Processed data stored in data lake (local/MinIO)
ML Pipeline: MLflow trains and tracks machine learning models
Serving: FastAPI serves model predictions
Analytics: Superset creates dashboards from processed data
Orchestration: Airflow coordinates the entire pipeline

🛠️ Configuration

Environment Variables (.env)

# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC_NAME=data_stream

# Spark
SPARK_MASTER_URL=spark://localhost:7077

# MLflow
MLFLOW_TRACKING_URI=http://localhost:5001

# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin

Global Settings (config/settings.yaml)

Centralized configuration for:

Data paths
ML parameters
Service endpoints
Algorithm configurations

🔍 Monitoring and Troubleshooting

Check Service Logs

# All services
docker-compose logs

# Specific service
docker-compose logs kafka
docker-compose logs spark-master
docker-compose logs mlflow

Common Issues

Out of Memory: Increase Docker memory allocation
Port Conflicts: Check if ports are already in use
Kafka Connection: Ensure Kafka is fully started before Spark jobs
MLflow Storage: Check volume mounts and permissions

📈 Scaling and Production

Horizontal Scaling

# Scale Spark workers
docker-compose up -d --scale spark-worker=3

# Scale Kafka (requires cluster setup)
# Scale Airflow workers
docker-compose up -d --scale airflow-worker=2

Production Considerations

Replace SQLite with PostgreSQL for MLflow
Use external Kafka cluster
Implement proper authentication
Add monitoring (Prometheus/Grafana)
Use Kubernetes for orchestration
Implement data lineage tracking

🧪 Testing

# Run unit tests
python -m pytest tests/

# Test ETL pipeline
python tests/test_etl.py

# Test ML pipeline
python tests/test_ml.py

📚 API Documentation

Model Serving API

# Health check
curl http://localhost:8000/health

# Predict
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [1.0, 2.0, 3.0]}'

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

📄 License

This project is licensed under the MIT License.

🆘 Support

Check the logs first: docker-compose logs <service>
Review configuration in config/settings.yaml
Ensure all prerequisites are installed
Check Docker resource allocation

🔗 Useful Commands

# Stop all services
docker-compose down

# Rebuild specific service
docker-compose up -d --build model-api

# Access service shell
docker-compose exec spark-master bash

# View real-time logs
docker-compose logs -f kafka

# Clean up volumes (WARNING: deletes data)
docker-compose down -v

Happy Data Engineering! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bi/superset		bi/superset
config		config
data		data
ingestion		ingestion
ml		ml
processing		processing
serving		serving
utils		utils
.gitignore		.gitignore
README.md		README.md
demo_centralized_storage.py		demo_centralized_storage.py
docker-compose.yml		docker-compose.yml
eda_analysis.png		eda_analysis.png
requirements-centralized.txt		requirements-centralized.txt
simple_centralized_demo.py		simple_centralized_demo.py
test_centralized_minimal.py		test_centralized_minimal.py
test_end_to_end.py		test_end_to_end.py

shaking54/realtime-analytics-platform

Folders and files

Latest commit

History

Repository files navigation