Mohammad Hossein Eslami,Arshia Izadyari,Sadegh Mohammadian,Fateme Asgari,Ali RahimiAkbar,Mohammad Mahdi Vahedi
You can test an interactive demo of our model on Hugging Face Spaces at the following link:
https://huggingface.co/spaces/parsi-ai-nlpclass/Persian-Food-RAG
We present a Persian, food-domain retrieval-augmented generation (RAG) system that combines a dual-modality retriever with a lightweight generator. Building on our prior corpus and an added Kaggle recipe collection (1737 entries; 1393 unique dishes), we expand the index with web-sourced photos of dishes \emph{and} systematically collected images of key ingredients to strengthen image-grounded queries. The retriever pairs a Persian text encoder (Glot-500) with a fine-tuned CLIP vision--text encoder (vision-fa-clip/SajjadAyoubi) trained with a multi-positive contrastive objective to handle multiple instructions per dish. Cross-modal embeddings enable answering both text-only and image+text questions by retrieving pertinent evidence and conditioning the generator. On held-out multiple-choice sets, the RAG setup improves performance for ingredient-triggered and image-grounded queries with a lighter generator, while gains are mixed for a stronger generator.
Our system is built on a Retrieval-Augmented Generation (RAG) architecture tailored for the Persian culinary domain. At its core is a shared representation learning module where both images and text passages are mapped into a unified vector space using a CLIP-style vision tower and a fine-tuned Glot-500 text encoder, respectively. For any given query, whether text or image-based, we leverage these embeddings to perform an efficient similarity search against a FAISS index of our entire culinary text corpus. The top-ranked documents are then pooled to form an evidence block, which is injected directly into the prompt for the generative model to produce a contextually grounded answer. A key innovation in our methodology is an ingredient-aware training strategy; by treating images of a dish's main ingredients as additional positive examples during fine-tuning, we encourage the model to develop a more holistic concept of each dish, thereby enhancing the retriever's robustness, especially for ingredient-based queries.
Our project is built upon a novel multimodal dataset of Iranian foods, which we created from scratch. This dataset contains 1,737 total entries covering 1,393 unique dishes, each featuring descriptive passages and corresponding question-answer pairs generated by an LLM. A key property of this corpus is its multimodal nature; every entry is enriched with representative images of both the final plated dish and its three main ingredients, enabling a more holistic visual understanding. The dataset was constructed through web scraping and LLM-powered cleaning, and we deliberately avoided aggressive text preprocessing to preserve its full linguistic richness for our RAG system.
We evaluated our system on a custom set of 50 multiple-choice questions (30 text-only, 20 image-based). Our key finding was the differential impact of RAG on generators of varying strengths.


