ChemPredictAI

A chemical reaction prediction system that combines ML models with LLMs to predict reaction outcomes, products, and safety hazards.

What this does

Started this project to solve a real problem in chemistry research - predicting reaction outcomes before running expensive lab experiments. Trained ML models on 31k+ chemical reactions and integrated with Google's Gemini API.

The system takes two reactants (like "ethanol" and "acetic acid") and predicts:

What product will form
What type of reaction it is
How dangerous it is
Detailed explanation of the mechanism

Tech stack

Backend:

FastAPI for the API
Python + scikit-learn for ML models
ChromaDB for chemical knowledge storage
LangChain + Gemini API for research chat

Frontend:

React + TypeScript
Tailwind CSS
Real-time chat interface

ML Pipeline:

3 separate models trained on 31k+ chemical reactions
SMILES molecular representation
Feature engineering for chemical properties
Ensemble methods for better accuracy

Data & Models

Collected 31k+ chemical reactions from various sources. Spent way too much time cleaning and preprocessing the data. The biggest pain was standardizing SMILES notation and handling different naming conventions.

Trained dataset available here: Google Drive Link

Models trained:

Reaction Type Classifier - substitution, elimination, addition, etc. (85% accuracy)
Hazard Level Predictor - safety risk Low/Medium/High (80% accuracy)
Product Predictor - what the main product will be (75% accuracy)

Used molecular fingerprints and chemical descriptors as features. Tried different algorithms, ensemble methods work best for this data.

How it works

User enters two chemicals (e.g., "ethanol" + "acetic acid")
Convert names to SMILES, extract molecular features
Run through trained ML models to get reaction type, hazard level, and product
Return complete prediction with mechanism explanation

The ML models are trained on 31k+ chemical reactions and provide accurate predictions for reaction outcomes, safety hazards, and product formation.

Current status

Working:

Backend API deployed and running
Frontend is live
Research chat system works
ML model predictions are operational locally

Still working on: ML models work great locally but having deployment issues. Chemistry libraries (like RDKit) are huge and don't play nice with cloud deployment. Working on optimizing the build process.

For now: You can test everything locally - ML models work perfectly on your machine. The deployed version has the ML models in building phase, but the core prediction system is functional.

Try It Locally

Setup:

# Backend
cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Add your API keys to .env file
echo "GOOGLE_API_KEY=your_key_here" > .env

# Run backend
uvicorn main:app --reload

# Frontend (in another terminal)
cd frontend
npm install
npm run dev

Test the ML models:

# This will work locally with full ML functionality
from ML_Model.predict.predict_reaction import predict_reaction
result = predict_reaction("ethanol", "acetic acid", input_type="name")
print(result)

What I Learned

Technical Challenges:

SMILES notation standardization was a nightmare
Getting RDKit to work with different Python versions
Balancing ML accuracy with response time
Integrating multiple APIs (Gemini, RXN4Chemistry)

ML Insights:

Chemical data is messy - lots of cleaning required
Ensemble methods work better than single models
Feature engineering is crucial for chemical properties
LLMs can actually help explain ML predictions

Deployment Lessons:

Chemistry libraries are huge and complex
Cloud deployment has different constraints than local
Sometimes simpler solutions (LLM-only) work just as well

API Usage

Main endpoints:

/predict_all - ML model prediction (reaction type, product, hazard level)
/chat - Research chat with chemistry questions

Example:

import requests

response = requests.post("http://localhost:8000/predict_all", json={
    "reactant1": "benzene", 
    "reactant2": "nitric acid"
})

result = response.json()
print(f"Product: {result['product']}")
print(f"Reaction Type: {result['reaction_type']}")

Technical Details

Data Processing: I spent a lot of time cleaning the chemical data - standardizing SMILES notation, handling different naming conventions, and dealing with incomplete reactions. The data came from multiple sources so consistency was a big challenge.

Model Architecture:

Random Forest for reaction type classification
Gradient Boosting for hazard level prediction
Neural network for product prediction
Cross-validation to avoid overfitting

Performance:

Reaction type: 85% accuracy
Hazard level: 80% accuracy
Product prediction: 75% accuracy

Why I Built This

I wanted to solve a real problem in chemistry research - predicting reaction outcomes before running expensive lab experiments. This could help:

Researchers predict if a reaction will work
Students understand reaction mechanisms
Industry assess safety risks before scaling up
Anyone learn about chemical reactions

The chat feature lets you ask questions like "What happens if I mix X and Y?" and get detailed explanations.

Next Steps

What I'm working on:

Fixing the deployment issues with the ML models
Adding more reaction types to the dataset
Improving the chat interface
Maybe adding 3D molecule visualization

Try it out: The system works great locally with full ML functionality. The deployed version uses Gemini for predictions, which actually gives really good results too.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
backend		backend
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
API_DOCUMENTATION.md		API_DOCUMENTATION.md
LOCAL_TESTING.md		LOCAL_TESTING.md
README.md		README.md
render.yaml		render.yaml
test_models.py		test_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChemPredictAI

What this does

Tech stack

Data & Models

How it works

Current status

Try It Locally

What I Learned

API Usage

Technical Details

Why I Built This

Next Steps

About

Uh oh!

Releases

Packages

Languages

shiavm006/ChemPredictAI-

Folders and files

Latest commit

History

Repository files navigation

ChemPredictAI

What this does

Tech stack

Data & Models

How it works

Current status

Try It Locally

What I Learned

API Usage

Technical Details

Why I Built This

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages