A chemical reaction prediction system that combines ML models with LLMs to predict reaction outcomes, products, and safety hazards.
Started this project to solve a real problem in chemistry research - predicting reaction outcomes before running expensive lab experiments. Trained ML models on 31k+ chemical reactions and integrated with Google's Gemini API.
The system takes two reactants (like "ethanol" and "acetic acid") and predicts:
- What product will form
- What type of reaction it is
- How dangerous it is
- Detailed explanation of the mechanism
Backend:
- FastAPI for the API
- Python + scikit-learn for ML models
- ChromaDB for chemical knowledge storage
- LangChain + Gemini API for research chat
Frontend:
- React + TypeScript
- Tailwind CSS
- Real-time chat interface
ML Pipeline:
- 3 separate models trained on 31k+ chemical reactions
- SMILES molecular representation
- Feature engineering for chemical properties
- Ensemble methods for better accuracy
Collected 31k+ chemical reactions from various sources. Spent way too much time cleaning and preprocessing the data. The biggest pain was standardizing SMILES notation and handling different naming conventions.
Trained dataset available here: Google Drive Link
Models trained:
- Reaction Type Classifier - substitution, elimination, addition, etc. (85% accuracy)
- Hazard Level Predictor - safety risk Low/Medium/High (80% accuracy)
- Product Predictor - what the main product will be (75% accuracy)
Used molecular fingerprints and chemical descriptors as features. Tried different algorithms, ensemble methods work best for this data.
- User enters two chemicals (e.g., "ethanol" + "acetic acid")
- Convert names to SMILES, extract molecular features
- Run through trained ML models to get reaction type, hazard level, and product
- Return complete prediction with mechanism explanation
The ML models are trained on 31k+ chemical reactions and provide accurate predictions for reaction outcomes, safety hazards, and product formation.
Working:
- Backend API deployed and running
- Frontend is live
- Research chat system works
- ML model predictions are operational locally
Still working on: ML models work great locally but having deployment issues. Chemistry libraries (like RDKit) are huge and don't play nice with cloud deployment. Working on optimizing the build process.
For now: You can test everything locally - ML models work perfectly on your machine. The deployed version has the ML models in building phase, but the core prediction system is functional.
Setup:
# Backend
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Add your API keys to .env file
echo "GOOGLE_API_KEY=your_key_here" > .env
# Run backend
uvicorn main:app --reload
# Frontend (in another terminal)
cd frontend
npm install
npm run devTest the ML models:
# This will work locally with full ML functionality
from ML_Model.predict.predict_reaction import predict_reaction
result = predict_reaction("ethanol", "acetic acid", input_type="name")
print(result)Technical Challenges:
- SMILES notation standardization was a nightmare
- Getting RDKit to work with different Python versions
- Balancing ML accuracy with response time
- Integrating multiple APIs (Gemini, RXN4Chemistry)
ML Insights:
- Chemical data is messy - lots of cleaning required
- Ensemble methods work better than single models
- Feature engineering is crucial for chemical properties
- LLMs can actually help explain ML predictions
Deployment Lessons:
- Chemistry libraries are huge and complex
- Cloud deployment has different constraints than local
- Sometimes simpler solutions (LLM-only) work just as well
Main endpoints:
/predict_all- ML model prediction (reaction type, product, hazard level)/chat- Research chat with chemistry questions
Example:
import requests
response = requests.post("http://localhost:8000/predict_all", json={
"reactant1": "benzene",
"reactant2": "nitric acid"
})
result = response.json()
print(f"Product: {result['product']}")
print(f"Reaction Type: {result['reaction_type']}")Data Processing: I spent a lot of time cleaning the chemical data - standardizing SMILES notation, handling different naming conventions, and dealing with incomplete reactions. The data came from multiple sources so consistency was a big challenge.
Model Architecture:
- Random Forest for reaction type classification
- Gradient Boosting for hazard level prediction
- Neural network for product prediction
- Cross-validation to avoid overfitting
Performance:
- Reaction type: 85% accuracy
- Hazard level: 80% accuracy
- Product prediction: 75% accuracy
I wanted to solve a real problem in chemistry research - predicting reaction outcomes before running expensive lab experiments. This could help:
- Researchers predict if a reaction will work
- Students understand reaction mechanisms
- Industry assess safety risks before scaling up
- Anyone learn about chemical reactions
The chat feature lets you ask questions like "What happens if I mix X and Y?" and get detailed explanations.
What I'm working on:
- Fixing the deployment issues with the ML models
- Adding more reaction types to the dataset
- Improving the chat interface
- Maybe adding 3D molecule visualization
Try it out: The system works great locally with full ML functionality. The deployed version uses Gemini for predictions, which actually gives really good results too.