Skip to content

Conversation

@vaclisinc
Copy link
Contributor

@vaclisinc vaclisinc commented Nov 16, 2025

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env!

⚠️ IMPORTANT: First-time index building may take 2-3 minutes. In the future, index building will automatically trigger when running the datapuller.
Please run the commands below and modify the semester to build the index first!

curl -X POST http://localhost:8000/refresh \
       -H 'Content-Type: application/json' \
       -d '{"year": 2026, "semester": "Spring"}' | jq

System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller --> GraphQLResolvers --> |TODO:|Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS
Loading

Examples

Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”

image

Input: “how to shot a hot vlog”
image


Implementation Details

Python Semantic Search Service (FastAPI)

  • FastAPI microservice (apps/semantic-search) that:

    • Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
    • Builds term-specific embeddings + FAISS indices from GraphQL catalog data
    • Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
    • Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
  • Key endpoints:

    • /health — readiness probe showing index status
    • /refresh — rebuild FAISS index for a given year/semester
    • /search — semantic query with threshold filtering
  • Model Architecture:

    • Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
    • Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
    • FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index

curl -X POST http://localhost:8000/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8000/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

  • Added SEMANTIC_SEARCH_URL environment variable pointing to Python service

  • Implemented lightweight proxy endpoint /api/semantic-search/courses:

    • Forwards requests to Python service
    • Returns only {subject, courseNumber, score} for efficient frontend filtering
    • Frontend maintains API response order (sorted by semantic similarity)
  • Updated GraphQL behavior:

    • Introduced hasCatalogData field for term filtering
    • Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

  • AI Search toggle (✨ sparkle button) to activate semantic search mode
  • Semantic results preserve backend ordering (by similarity score)
  • Frontend maps semantic results to full course objects for display
  • Graceful fallback to fuzzy search when semantic search unavailable

Technical Decisions

Why BGE over other models?

  • BGE (BAAI General Embedding) is specifically optimized for retrieval tasks
  • Better semantic understanding than general-purpose models (all-MiniLM, mpnet)
  • Supports instruction prefixes for improved query understanding
  • 109M parameters - good balance of accuracy and speed

Why threshold instead of top-k?

  • Threshold-based filtering returns all relevant results, not arbitrary top-k
  • More flexible - can return 5 results for specific queries, 50 for broad queries
  • Similarity score threshold (0.45) ensures quality over quantity
  • Searches top 500 candidates for performance, then applies threshold

Model Options Available (hardcoded in main.py)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Next Steps

  1. Datapuller Integration: TOP PRIORITY!
    Automatically trigger /refresh endpoint when new catalog data is pulled

  2. Fine-tuning for Berkeley Courses
    Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search

e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms


Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND

@vaclisinc vaclisinc self-assigned this Nov 16, 2025
@vaclisinc vaclisinc marked this pull request as draft November 16, 2025 22:17
@vaclisinc vaclisinc marked this pull request as ready for review November 16, 2025 22:21
@vaclisinc vaclisinc force-pushed the feat/semantic-search-vaclis branch from 4fa576b to a4db7f3 Compare November 16, 2025 23:12
@maxmwang
Copy link
Contributor

Will review later, but this will need infra changes before being merged.

@vaclisinc vaclisinc force-pushed the feat/semantic-search-vaclis branch from a4db7f3 to 031bb9e Compare November 18, 2025 00:39
@vaclisinc
Copy link
Contributor Author

vaclisinc commented Nov 18, 2025

Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯

I need your help building a small training dataset.

Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format:

{
  "query": "planning about my career",
  "good_results": ["MBA 209P", "MCELLBI 295"],
  "bad_results": ["LDARCH 205", "CYPLAN 116", "CYPLAN 208"],
  "missing_courses": ["IAFIRCAM 198BC", "ARCH 198BC", "MUSIC 198BC", "COMLIT 198BC"] // optional
}

Where:

  • query = what you searched for (it can be a full sentence, not just keywords)
  • good_results = courses from the search results that ARE relevant (should rank high)
  • bad_results = courses from the search results that feel clearly NOT related
  • missing_courses (optional) = courses you strongly expected to see but that did NOT show up at all

What I’m especially looking for:

  • Natural language queries like:
    • “planning about my career”
    • “I want to get into AI research from a non-CS background”
    • “I like math but hate proofs, what should I take?”
  • Any query where the results feel clearly off, noisy, or surprising

Goal: ~50–100 examples total.
Even 3–5 examples from you would be super helpful. 🙏

@vaclisinc vaclisinc changed the title Feat/Introduce semantic search! Feat/Introduce AI semantic search! Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants