Skip to content

This project demonstrates an end-to-end real-time streaming analytics pipeline, designed to process high-clicked data and display insights on a live dashboard.

Notifications You must be signed in to change notification settings

DarrenDavy12/Real-Time-Streaming-Dashboard-with-Kafka-PySpark-E-Commerce-Store

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 

Repository files navigation

πŸš€ E-commerce Real-Time Streaming Analytics Dashboard

🎯 Business Case

We have a plain e-commerce platform that needs to monitor user clickstream events as they happen. The goal is to gain real-time insights into user engagement, identify popular products, and quickly detect any anomalies (e.g., a sudden spike in errors or unusual activity). This dashboard provides that live monitoring capability.

πŸ—οΈ Architecture & Tech Stack

The project is built with the following components:

  • Data Ingestion: Apache Kafka (using kafka-python in a Python producer script) acts as a high-throughput message bus for raw clickstream events.
  • Stream Processing: Apache Spark Structured Streaming (using PySpark) consumes data from Kafka, performs real-time transformations, and aggregates metrics (e.g., page views per minute).
  • Serving Layer (Conceptual): While not fully implemented in this demo for simplicity, in a production environment, aggregated metrics from Spark would be written to a data warehouse like Amazon Redshift for persistent storage or a low-latency cache like Redis for dashboard lookups.
  • Visualization: A simple, live-updating dashboard built with Streamlit queries the processed data to display the latest metrics.

πŸ’‘ Key Areas

This project will demonstrate:

  • Real-time data ingestion using Apache Kafka.
  • Distributed stream processing with Apache Spark Structured Streaming.
  • Stateful aggregations (like windowing) on streaming data.
  • Connecting a data pipeline to a live visualization layer.

πŸ› οΈ Prerequisites

Before you begin, ensure you have the following installed on your system:

  • Python 3.x:
    • Check with python3 --version.
    • Recommended (macOS): Install via Homebrew: brew install [email protected] (or newer).
  • Java Development Kit (JDK) 11 or 17: Spark requires Java.
    • Check with java -version.
    • Recommended (macOS): Install OpenJDK via Homebrew: brew install openjdk@11. Remember to follow Homebrew's instructions to link the JDK and set your JAVA_HOME environment variable (e.g., export JAVA_HOME=$(/usr/libexec/java_home -v 11) in your ~/.zshrc or ~/.bash_profile).
  • Apache Kafka:
  • Apache Spark:
    • Download from https://spark.apache.org/downloads.html. Choose a pre-built package for Hadoop 3.3 or later.
    • Extract the archive (e.g., tar -xzf spark-3.x.x-bin-hadoop3.tgz and move to a convenient location like ~/spark).
    • Set SPARK_HOME and add Spark's bin directory to your PATH in your shell profile (e.g., ~/.zshrc or ~/.bash_profile):
      export SPARK_HOME=~/spark
      export PATH=$SPARK_HOME/bin:$PATH
      export PYSPARK_PYTHON=python3 # Ensure PySpark uses Python 3
      Remember to source your profile file after making changes.

Install Python Libraries 🐍

In a new terminal, install the required Python libraries:

pip install kafka-python pyspark streamlit

βš™οΈ Setup Instructions

Follow these steps to get the environment ready:

1. Start Apache Kafka πŸ“₯

Open three separate terminal windows for Kafka operations.

  • Terminal 1: Start ZooKeeper Navigate to your Kafka directory (e.g., cd ~/kafka) and run:

    bin/zookeeper-server-start.sh config/zookeeper.properties

    Keep this terminal open.

    image

  • Terminal 2: Start Kafka Broker Open a new terminal, navigate to your Kafka directory, and run:

    bin/kafka-server-start.sh config/server.properties

    Keep this terminal open.

    image

  • Terminal 3: Create Kafka Topic Open another new terminal, navigate to your Kafka directory, and create the topic:

    bin/kafka-topics.sh --create --topic user_clickstream --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

    You should see a confirmation message.

image


2. Prepare Spark Kafka Connector JARs πŸ”—

The PySpark consumer script will automatically download the necessary Kafka connector JARs when you submit the job using the --packages argument. You don't need to manually download them beforehand.


πŸ“‚ Project Files

  • kafka_producer.py: Python script to simulate and send clickstream events to Kafka.

  • spark_consumer.py: PySpark Structured Streaming application to consume, process, and aggregate data from Kafka.

  • dashboard_app.py: Streamlit application to visualize the real-time analytics.

    image


▢️ Running the Project

Open three separate terminal windows for running the project components.


1. Run the Kafka Producer πŸ“€

In Terminal 1, navigate to the directory where you saved kafka_producer.py and run:

python3 kafka_producer.py

This script will start generating and sending simulated user click events to your user_clickstream Kafka topic. You'll see "Sent: {...}" messages in the terminal.

Keep this running.

image


2. Run the PySpark Structured Streaming Consumer πŸ“ˆ


In Terminal 2, navigate to the directory where you saved spark_consumer.py and run the Spark job. Make sure to include the --packages argument so Spark can connect to Kafka:

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.0 spark_consumer.py

Spark will start consuming messages from Kafka, processing them, and printing the aggregated results like page views per minute and user engagement to this terminal.

image


3. Run the Streamlit Dashboard πŸ“Š

In Terminal 3, navigate to the directory where you saved dashboard_app.py and run the Streamlit application:

streamlit run dashboard_app.py

This command will open a new tab in your web browser (usually http://localhost:8501) displaying the real-time analytics dashboard. The dashboard currently uses simulated data, but in a real setup, it would fetch data from your serving layer (Redshift/Redis).


image



image


About

This project demonstrates an end-to-end real-time streaming analytics pipeline, designed to process high-clicked data and display insights on a live dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published