Skip to content

shaking54/realtime-analytics-platform

Repository files navigation

Data Engineering ML Pipeline with Data Lake

A complete local data engineering pipeline combining streaming data ingestion, ETL processing, machine learning, and business intelligence tools. This project demonstrates modern data engineering practices using Kafka, Spark, MLflow, Airflow, and Superset.

πŸ—οΈ Architecture Overview

Data Source β†’ Kafka β†’ Spark ETL β†’ Data Lake β†’ ML Pipeline β†’ BI Dashboard
                                     ↓
                              (MinIO/Local Storage)
                                     ↓
                            MLflow Model Registry

πŸ“ Project Structure

data-engineering-ml-project/
β”‚
β”œβ”€β”€ docker-compose.yml          # Local container orchestration
β”œβ”€β”€ .env                        # Environment variables
β”œβ”€β”€ README.md                   # This file
β”‚
β”œβ”€β”€ config/
β”‚   └── settings.yaml           # Global configs (paths, ML params)
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Raw ingested data (from Kafka)
β”‚   β”œβ”€β”€ processed/              # Cleaned and transformed data
β”‚   └── analytics/              # Final aggregated data for BI
β”‚
β”œβ”€β”€ ingestion/
β”‚   β”œβ”€β”€ kafka_producer.py       # Simulates streaming input
β”‚   └── kafka_consumer.py       # Optional for testing without Spark
β”‚
β”œβ”€β”€ processing/
β”‚   β”œβ”€β”€ spark_etl.py            # Spark job for ETL + feature engineering
β”‚   └── spark_session.py        # Reusable SparkSession builder
β”‚
β”œβ”€β”€ ml/
β”‚   β”œβ”€β”€ train_model.py          # ML training with MLflow
β”‚   β”œβ”€β”€ evaluate_model.py       # Evaluation metrics
β”‚   β”œβ”€β”€ register_model.py       # Register best model to MLflow
β”‚   β”œβ”€β”€ inference.py            # Inference script
β”‚   └── model/                  # Exported models (optional)
β”‚
β”œβ”€β”€ dags/
β”‚   └── pipeline_dag.py         # Airflow DAG for ETL + ML pipeline
β”‚
β”œβ”€β”€ bi/
β”‚   └── superset/               # Superset setup
β”‚       └── docker/             # Superset config in Docker
β”‚   └── dashboards/             # Saved dashboard templates or queries
β”‚
β”œβ”€β”€ serving/
β”‚   β”œβ”€β”€ app.py                  # FastAPI/Flask app for model inference
β”‚   └── Dockerfile
β”‚
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ logger.py               # Shared logging utility
β”‚   └── helpers.py              # Common functions
β”‚
└── tests/
    β”œβ”€β”€ test_etl.py
    └── test_ml.py

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.8+
  • 8GB+ RAM recommended

1. Clone and Setup

git clone <your-repo>
cd data-engineering-ml-project

2. Start Infrastructure

# Start all services
docker-compose up -d

# Check service status
docker-compose ps

3. Verify Services

πŸ”§ Component Details

1. Data Ingestion (Kafka)

Start the data stream:

cd ingestion
python kafka_producer.py --interval 0.5 --duration 60

Test consumer (optional):

python kafka_consumer.py

2. Data Processing (Spark)

Submit ETL job:

docker exec -it <spark-master-container> spark-submit \
  --master spark://spark-master:7077 \
  /opt/spark/processing/spark_etl.py

3. Machine Learning (MLflow)

Train models:

cd ml
python train_model.py

View experiments:

4. Orchestration (Airflow)

Access Airflow:

  1. Go to http://localhost:8081
  2. Login: admin/admin
  3. Enable the pipeline_dag

5. Business Intelligence (Superset)

Setup dashboards:

  1. Go to http://localhost:8088
  2. Login: admin/admin
  3. Connect to your data sources
  4. Create visualizations

πŸ“Š Data Flow

  1. Ingestion: Kafka producer generates sample e-commerce data
  2. Streaming: Kafka topics buffer the real-time data
  3. Processing: Spark consumes from Kafka, cleans, and transforms data
  4. Storage: Processed data stored in data lake (local/MinIO)
  5. ML Pipeline: MLflow trains and tracks machine learning models
  6. Serving: FastAPI serves model predictions
  7. Analytics: Superset creates dashboards from processed data
  8. Orchestration: Airflow coordinates the entire pipeline

πŸ› οΈ Configuration

Environment Variables (.env)

# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_TOPIC_NAME=data_stream

# Spark
SPARK_MASTER_URL=spark://localhost:7077

# MLflow
MLFLOW_TRACKING_URI=http://localhost:5001

# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin

Global Settings (config/settings.yaml)

Centralized configuration for:

  • Data paths
  • ML parameters
  • Service endpoints
  • Algorithm configurations

πŸ” Monitoring and Troubleshooting

Check Service Logs

# All services
docker-compose logs

# Specific service
docker-compose logs kafka
docker-compose logs spark-master
docker-compose logs mlflow

Common Issues

  1. Out of Memory: Increase Docker memory allocation
  2. Port Conflicts: Check if ports are already in use
  3. Kafka Connection: Ensure Kafka is fully started before Spark jobs
  4. MLflow Storage: Check volume mounts and permissions

πŸ“ˆ Scaling and Production

Horizontal Scaling

# Scale Spark workers
docker-compose up -d --scale spark-worker=3

# Scale Kafka (requires cluster setup)
# Scale Airflow workers
docker-compose up -d --scale airflow-worker=2

Production Considerations

  • Replace SQLite with PostgreSQL for MLflow
  • Use external Kafka cluster
  • Implement proper authentication
  • Add monitoring (Prometheus/Grafana)
  • Use Kubernetes for orchestration
  • Implement data lineage tracking

πŸ§ͺ Testing

# Run unit tests
python -m pytest tests/

# Test ETL pipeline
python tests/test_etl.py

# Test ML pipeline
python tests/test_ml.py

πŸ“š API Documentation

Model Serving API

# Health check
curl http://localhost:8000/health

# Predict
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [1.0, 2.0, 3.0]}'

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License.

πŸ†˜ Support

  • Check the logs first: docker-compose logs <service>
  • Review configuration in config/settings.yaml
  • Ensure all prerequisites are installed
  • Check Docker resource allocation

πŸ”— Useful Commands

# Stop all services
docker-compose down

# Rebuild specific service
docker-compose up -d --build model-api

# Access service shell
docker-compose exec spark-master bash

# View real-time logs
docker-compose logs -f kafka

# Clean up volumes (WARNING: deletes data)
docker-compose down -v

Happy Data Engineering! πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages