Feat/Introduce AI semantic search! #980

vaclisinc · 2025-11-16T22:06:48Z

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env!

⚠️ IMPORTANT: First-time index building may take 2-3 minutes. In the future, index building will automatically trigger when running the datapuller.
Please run the commands below and modify the semester to build the index first!

curl -X POST http://localhost:8000/refresh \
       -H 'Content-Type: application/json' \
       -d '{"year": 2026, "semester": "Spring"}' | jq

System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller --> GraphQLResolvers --> |TODO:|Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS

Examples

Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”

Input: “how to shot a hot vlog”

Implementation Details

Python Semantic Search Service (FastAPI)

FastAPI microservice (apps/semantic-search) that:
- Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
- Builds term-specific embeddings + FAISS indices from GraphQL catalog data
- Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
- Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
Key endpoints:
- /health — readiness probe showing index status
- /refresh — rebuild FAISS index for a given year/semester
- /search — semantic query with threshold filtering
Model Architecture:
- Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
- Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
- FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index

curl -X POST http://localhost:8000/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8000/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

Added SEMANTIC_SEARCH_URL environment variable pointing to Python service
Implemented lightweight proxy endpoint /api/semantic-search/courses:
- Forwards requests to Python service
- Returns only {subject, courseNumber, score} for efficient frontend filtering
- Frontend maintains API response order (sorted by semantic similarity)
Updated GraphQL behavior:
- Introduced hasCatalogData field for term filtering
- Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

AI Search toggle (✨ sparkle button) to activate semantic search mode
Semantic results preserve backend ordering (by similarity score)
Frontend maps semantic results to full course objects for display
Graceful fallback to fuzzy search when semantic search unavailable

Technical Decisions

Why BGE over other models?

BGE (BAAI General Embedding) is specifically optimized for retrieval tasks
Better semantic understanding than general-purpose models (all-MiniLM, mpnet)
Supports instruction prefixes for improved query understanding
109M parameters - good balance of accuracy and speed

Why threshold instead of top-k?

Threshold-based filtering returns all relevant results, not arbitrary top-k
More flexible - can return 5 results for specific queries, 50 for broad queries
Similarity score threshold (0.45) ensures quality over quantity
Searches top 500 candidates for performance, then applies threshold

Model Options Available (hardcoded in `main.py`)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Next Steps

Datapuller Integration: TOP PRIORITY!
Automatically trigger /refresh endpoint when new catalog data is pulled
Fine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search

e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms

Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND

apps/frontend/src/components/ClassBrowser/index.tsx

maxmwang · 2025-11-17T04:31:33Z

Will review later, but this will need infra changes before being merged.

…d based result

…-k to threshold=4.5

vaclisinc · 2025-11-18T02:31:22Z

Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯

I need your help building a small training dataset.

Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format:

{
  "query": "planning about my career",
  "good_results": ["MBA 209P", "MCELLBI 295"],
  "bad_results": ["LDARCH 205", "CYPLAN 116", "CYPLAN 208"],
  "missing_courses": ["IAFIRCAM 198BC", "ARCH 198BC", "MUSIC 198BC", "COMLIT 198BC"] // optional
}

Where:

query = what you searched for (it can be a full sentence, not just keywords)
good_results = courses from the search results that ARE relevant (should rank high)
bad_results = courses from the search results that feel clearly NOT related
missing_courses (optional) = courses you strongly expected to see but that did NOT show up at all

What I’m especially looking for:

Natural language queries like:
- “planning about my career”
- “I want to get into AI research from a non-CS background”
- “I like math but hate proofs, what should I take?”
Any query where the results feel clearly off, noisy, or surprising

Goal: ~50–100 examples total.
Even 3–5 examples from you would be super helpful. 🙏

…vaclis

vaclisinc requested review from ARtheboss, PineND and maxmwang November 16, 2025 22:06

github-code-quality bot found potential problems Nov 16, 2025

View reviewed changes

apps/frontend/src/components/ClassBrowser/index.tsx Fixed Show fixed Hide fixed

apps/frontend/src/components/ClassBrowser/index.tsx Fixed Show fixed Hide fixed

vaclisinc self-assigned this Nov 16, 2025

vaclisinc marked this pull request as draft November 16, 2025 22:17

vaclisinc marked this pull request as ready for review November 16, 2025 22:21

vaclisinc force-pushed the feat/semantic-search-vaclis branch from 4fa576b to a4db7f3 Compare November 16, 2025 23:12

vaclisinc had a problem deploying to development November 16, 2025 23:15 — with GitHub Actions Failure

vaclisinc added 7 commits November 17, 2025 13:07

feat: add semantic search service container

ea1e69e

feat: proxy semantic search through backend

28ced4f

feat: add natural language catalog search UI

86aac3e

fix: format

691d0ff

fix: remove hardcode AI_KEYWORDS and switching from top-k to threshol…

4098304

…d based result

fix: use the score from transformer model to sort and change from top…

171a34d

…-k to threshold=4.5

fix: minor format

031bb9e

vaclisinc force-pushed the feat/semantic-search-vaclis branch from a4db7f3 to 031bb9e Compare November 18, 2025 00:39

fix: changing the model in python instead of modifing the infra

0fe511c

vaclisinc changed the title ~~Feat/Introduce semantic search!~~ Feat/Introduce AI semantic search! Nov 18, 2025

ARtheboss and others added 3 commits November 17, 2025 23:42

semantic search tests

6ce8596

fix: modify to a more fancy search bar name - AI-Native Course Search!

ecd4fd7

Merge remote-tracking branch 'origin/main' into feat/semantic-search-…

3e0ecce

…vaclis

PineND added the Catalog pod label Nov 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/Introduce AI semantic search! #980

Feat/Introduce AI semantic search! #980

Uh oh!

vaclisinc commented Nov 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

maxmwang commented Nov 17, 2025

Uh oh!

vaclisinc commented Nov 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Feat/Introduce AI semantic search! #980

Are you sure you want to change the base?

Feat/Introduce AI semantic search! #980

Uh oh!

Conversation

vaclisinc commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

System Architecture

Examples

Implementation Details

Python Semantic Search Service (FastAPI)

Example: manually refreshing an index

Example: running a semantic search

Backend Integration (Node / Express)

Frontend (Catalog UI)

Technical Decisions

Why BGE over other models?

Why threshold instead of top-k?

Model Options Available (hardcoded in main.py)

Next Steps

Uh oh!

Uh oh!

Uh oh!

maxmwang commented Nov 17, 2025

Uh oh!

vaclisinc commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vaclisinc commented Nov 16, 2025 •

edited

Loading

Model Options Available (hardcoded in `main.py`)

vaclisinc commented Nov 18, 2025 •

edited

Loading