A powerful phishing URL detection system that combines a trusted domain whitelist with machine learning for accurate and efficient phishing detection.
- 
Whitelist Database
- Multiple trusted domain sources (Umbrella, Tranco, Majestic, DomCop)
 - Fast database lookups
 - High confidence for legitimate domains
 - Reduces false positives
 
 - 
AI Model
- BERT-based deep learning detection
 - Works completely offline
 - Catches sophisticated phishing attempts
 - High accuracy for unknown domains
 
 
- Speed: Quick whitelist checks for known domains
 - Accuracy: AI model for unknown domains
 - Reliability: Trusted sources (Umbrella, Tranco, Majestic, DomCop) for whitelist
 - Efficiency: Optimized database for fast lookups
 
- Clone the repository
 
git clone https://github.com/yourusername/phishing-url-detector-ai.git
cd phishing-url-detector-ai- 
Download the AI Model
- Download the model from: Hugging Face Model
 - Create a 
modelsdirectory in the project root - Extract the model files into 
models/bert-finetuned-phishing 
 - 
Set up the environment
 
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Initialize the Whitelist Database
 
# Create database with schema
sqlite3 data/whitelist.db < schema.sql
# Import whitelist data (choose sources as needed)
# For Umbrella Top 1M:
wget -O data/top-1m.csv https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip -o data/top-1m.csv.zip -d data/
sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/top-1m.csv umbrella"
# For DomCop Top 10M (optional, large file):
# wget -O data/DomCoptop10milliondomains.csv.zip https://example.com/path/to/DomCoptop10milliondomains.csv
# unzip -o data/DomCoptop10milliondomains.csv.zip -d data/
# sqlite3 data/whitelist.db ".mode csv" ".import --skip 1 data/DomCoptop10milliondomains.csv domcop"from phishing_detector import PhishingDetector
# Initialize detector (will use offline model)
detector = PhishingDetector(use_offline=True)
# Check a URL
result = detector.check_url("https://example.com")
print(f"Is phishing: {result['is_phishing']}")
print(f"Confidence: {result['confidence']:.2%}")- AI Model: BERT Finetuned for Phishing Detection
 - Whitelist Sources:
- Cisco Umbrella Top 1M
 - Tranco Top 1M
 - Majestic Million
 - DomCop Top 10M (optional)
 
 
- The model (1.34GB) and whitelist databases should be kept in the 
modelsanddatadirectories respectively - Add these directories to your 
.gitignoreto avoid committing large files - For production use, consider using a more robust database like PostgreSQL
 
- 
Whitelist Manager
- SQLite database
 - Optimized for fast lookups
 - Multiple trusted domain sources
 - Automatic updates
 
 - 
AI Model
- BERT-based deep learning architecture
 - Feature extraction
 - Real-time prediction
 - Confidence scoring
 
 - 
Web Interface
- Modern, responsive design
 - Real-time URL checking
 - Detailed analysis view
 - Batch processing
 
 
umbrella: Trusted domains from Cisco Umbrella- Optimized indexes for fast lookups
 - Views for common queries
 - Automatic timestamp updates
 
- Whitelist lookup: < 1ms
 - AI model prediction: ~100ms
 - Batch processing: ~50ms per URL
 - Database size: ~100MB
 
- Fork the repository
 - Create a feature branch
 - Commit your changes
 - Push to the branch
 - Create a Pull Request
 
MIT License - See LICENSE for details.