We have a plain e-commerce platform that needs to monitor user clickstream events as they happen. The goal is to gain real-time insights into user engagement, identify popular products, and quickly detect any anomalies (e.g., a sudden spike in errors or unusual activity). This dashboard provides that live monitoring capability.
The project is built with the following components:
- Data Ingestion: Apache Kafka (using
kafka-pythonin a Python producer script) acts as a high-throughput message bus for raw clickstream events. - Stream Processing: Apache Spark Structured Streaming (using
PySpark) consumes data from Kafka, performs real-time transformations, and aggregates metrics (e.g., page views per minute). - Serving Layer (Conceptual): While not fully implemented in this demo for simplicity, in a production environment, aggregated metrics from Spark would be written to a data warehouse like Amazon Redshift for persistent storage or a low-latency cache like Redis for dashboard lookups.
- Visualization: A simple, live-updating dashboard built with Streamlit queries the processed data to display the latest metrics.
This project will demonstrate:
- Real-time data ingestion using Apache Kafka.
- Distributed stream processing with Apache Spark Structured Streaming.
- Stateful aggregations (like windowing) on streaming data.
- Connecting a data pipeline to a live visualization layer.
Before you begin, ensure you have the following installed on your system:
- Python 3.x:
- Check with
python3 --version. - Recommended (macOS): Install via Homebrew:
brew install [email protected](or newer).
- Check with
- Java Development Kit (JDK) 11 or 17: Spark requires Java.
- Check with
java -version. - Recommended (macOS): Install OpenJDK via Homebrew:
brew install openjdk@11. Remember to follow Homebrew's instructions to link the JDK and set yourJAVA_HOMEenvironment variable (e.g.,export JAVA_HOME=$(/usr/libexec/java_home -v 11)in your~/.zshrcor~/.bash_profile).
- Check with
- Apache Kafka:
- Download from https://kafka.apache.org/downloads.
- Extract the archive (e.g.,
tar -xzf kafka_2.13-3.x.x.tgzand move to a convenient location like~/kafka).
- Apache Spark:
- Download from https://spark.apache.org/downloads.html. Choose a pre-built package for Hadoop 3.3 or later.
- Extract the archive (e.g.,
tar -xzf spark-3.x.x-bin-hadoop3.tgzand move to a convenient location like~/spark). - Set
SPARK_HOMEand add Spark'sbindirectory to yourPATHin your shell profile (e.g.,~/.zshrcor~/.bash_profile):Remember toexport SPARK_HOME=~/spark export PATH=$SPARK_HOME/bin:$PATH export PYSPARK_PYTHON=python3 # Ensure PySpark uses Python 3
sourceyour profile file after making changes.
In a new terminal, install the required Python libraries:
pip install kafka-python pyspark streamlitFollow these steps to get the environment ready:
Open three separate terminal windows for Kafka operations.
-
Terminal 1: Start ZooKeeper Navigate to your Kafka directory (e.g.,
cd ~/kafka) and run:bin/zookeeper-server-start.sh config/zookeeper.properties
Keep this terminal open.
-
Terminal 2: Start Kafka Broker Open a new terminal, navigate to your Kafka directory, and run:
bin/kafka-server-start.sh config/server.properties
Keep this terminal open.
-
Terminal 3: Create Kafka Topic Open another new terminal, navigate to your Kafka directory, and create the topic:
bin/kafka-topics.sh --create --topic user_clickstream --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
You should see a confirmation message.
The PySpark consumer script will automatically download the necessary Kafka connector JARs when you submit the job using the --packages argument. You don't need to manually download them beforehand.
-
kafka_producer.py: Python script to simulate and send clickstream events to Kafka. -
spark_consumer.py: PySpark Structured Streaming application to consume, process, and aggregate data from Kafka. -
dashboard_app.py: Streamlit application to visualize the real-time analytics.
Open three separate terminal windows for running the project components.
In Terminal 1, navigate to the directory where you saved kafka_producer.py and run:
python3 kafka_producer.pyThis script will start generating and sending simulated user click events to your user_clickstream Kafka topic. You'll see "Sent: {...}" messages in the terminal.
Keep this running.
In Terminal 2, navigate to the directory where you saved spark_consumer.py and run the Spark job. Make sure to include the --packages argument so Spark can connect to Kafka:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.0 spark_consumer.pySpark will start consuming messages from Kafka, processing them, and printing the aggregated results like page views per minute and user engagement to this terminal.
In Terminal 3, navigate to the directory where you saved dashboard_app.py and run the Streamlit application:
streamlit run dashboard_app.pyThis command will open a new tab in your web browser (usually http://localhost:8501) displaying the real-time analytics dashboard. The dashboard currently uses simulated data, but in a real setup, it would fetch data from your serving layer (Redshift/Redis).







