Skip to content

NLP-Final-Projects/Food_rag_3

Repository files navigation

Persian Culinary RAG: Multimodal Retrieval and Generation for Text–Image Food Queries

Mohammad Hossein Eslami,Arshia Izadyari,Sadegh Mohammadian,Fateme Asgari,Ali RahimiAkbar,Mohammad Mahdi Vahedi

You can test an interactive demo of our model on Hugging Face Spaces at the following link:

https://huggingface.co/spaces/parsi-ai-nlpclass/Persian-Food-RAG

Results

Abstract

We present a Persian, food-domain retrieval-augmented generation (RAG) system that combines a dual-modality retriever with a lightweight generator. Building on our prior corpus and an added Kaggle recipe collection (1737 entries; 1393 unique dishes), we expand the index with web-sourced photos of dishes \emph{and} systematically collected images of key ingredients to strengthen image-grounded queries. The retriever pairs a Persian text encoder (Glot-500) with a fine-tuned CLIP vision--text encoder (vision-fa-clip/SajjadAyoubi) trained with a multi-positive contrastive objective to handle multiple instructions per dish. Cross-modal embeddings enable answering both text-only and image+text questions by retrieving pertinent evidence and conditioning the generator. On held-out multiple-choice sets, the RAG setup improves performance for ingredient-triggered and image-grounded queries with a lighter generator, while gains are mixed for a stronger generator.

Methods Summary

Our system is built on a Retrieval-Augmented Generation (RAG) architecture tailored for the Persian culinary domain. At its core is a shared representation learning module where both images and text passages are mapped into a unified vector space using a CLIP-style vision tower and a fine-tuned Glot-500 text encoder, respectively. For any given query, whether text or image-based, we leverage these embeddings to perform an efficient similarity search against a FAISS index of our entire culinary text corpus. The top-ranked documents are then pooled to form an evidence block, which is injected directly into the prompt for the generative model to produce a contextually grounded answer. A key innovation in our methodology is an ingredient-aware training strategy; by treating images of a dish's main ingredients as additional positive examples during fine-tuning, we encourage the model to develop a more holistic concept of each dish, thereby enhancing the retriever's robustness, especially for ingredient-based queries.

Dataset Properties

Our project is built upon a novel multimodal dataset of Iranian foods, which we created from scratch. This dataset contains 1,737 total entries covering 1,393 unique dishes, each featuring descriptive passages and corresponding question-answer pairs generated by an LLM. A key property of this corpus is its multimodal nature; every entry is enriched with representative images of both the final plated dish and its three main ingredients, enabling a more holistic visual understanding. The dataset was constructed through web scraping and LLM-powered cleaning, and we deliberately avoided aggressive text preprocessing to preserve its full linguistic richness for our RAG system.

Evaluation and Results

We evaluated our system on a custom set of 50 multiple-choice questions (30 text-only, 20 image-based). Our key finding was the differential impact of RAG on generators of varying strengths.

Results

Results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •