-
Notifications
You must be signed in to change notification settings - Fork 14
Feat/Introduce AI semantic search! #980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4fa576b to
a4db7f3
Compare
|
Will review later, but this will need infra changes before being merged. |
…-k to threshold=4.5
a4db7f3 to
031bb9e
Compare
|
Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯 I need your help building a small training dataset. Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format: Where:
What I’m especially looking for:
Goal: ~50–100 examples total. |
Summary
Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.
System Architecture
flowchart LR %% ---------- Frontend ---------- subgraph Frontend FE_UI["Search Bar + AI Search Toggle"] end %% ---------- Node Backend ---------- subgraph NodeBackend ProxyRouter["/api/semantic-search/* (proxy router)"] CoursesAPI["/api/semantic-search/courses (lightweight endpoint)"] GraphQLResolvers["GraphQL resolvers + hasCatalogData"] end %% ---------- Python Semantic Service ---------- subgraph SemanticService["Semantic Search Service (FastAPI)"] Health["/health"] Refresh["/refresh (rebuild FAISS index)"] Search["/search (threshold-based semantic query)"] BGE["BGE Embedding Model"] FAISS["FAISS Index (cosine similarity)"] end %% ---------- Catalog Data Puller ---------- subgraph CatalogData DataPuller["GraphQL Catalog Datapuller"] end %% ---------- Data Flow ---------- FE_UI -->|Search Query| CoursesAPI CoursesAPI -->|Forward to Python| Search Search -->|Generate Query Embedding| BGE Search -->|Vector Similarity Search| FAISS FAISS -->|Threshold-filtered Results| Search Search --> CoursesAPI --> FE_UI %% Index refresh / data ingestion DataPuller --> GraphQLResolvers --> |TODO:|Refresh Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers Refresh -->|Generate Embeddings| BGE --> FAISSExamples
Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”
Input: “how to shot a hot vlog”

Implementation Details
Python Semantic Search Service (FastAPI)
FastAPI microservice (
apps/semantic-search) that:Key endpoints:
/health— readiness probe showing index status/refresh— rebuild FAISS index for a given year/semester/search— semantic query with threshold filteringModel Architecture:
"Represent this sentence for searching relevant passages: {query}"SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}Example: manually refreshing an index
curl -X POST http://localhost:8000/refresh \ -H 'Content-Type: application/json' \ -d '{"year": 2026, "semester": "Spring"}' | jqExample: running a semantic search
Backend Integration (Node / Express)
Added
SEMANTIC_SEARCH_URLenvironment variable pointing to Python serviceImplemented lightweight proxy endpoint
/api/semantic-search/courses:{subject, courseNumber, score}for efficient frontend filteringUpdated GraphQL behavior:
hasCatalogDatafield for term filteringterms(withCatalogData: true)Frontend (Catalog UI)
Technical Decisions
Why BGE over other models?
Why threshold instead of top-k?
Model Options Available (hardcoded in
main.py)Next Steps
Datapuller Integration: TOP PRIORITY!
Automatically trigger
/refreshendpoint when new catalog data is pulledFine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search
e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms
Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND