-
Notifications
You must be signed in to change notification settings - Fork 26
Description
We are working on a project involving the evaluation of hallucination detection methods in retrieval-augmented generation models. Your work, "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models," has been instrumental in guiding our research. We deeply appreciate the comprehensive dataset and insightful analyses you have provided.
We are particularly interested in the detailed model-level results presented in Table 5 of your paper, which summarizes the response-level hallucination detection performance for each baseline method across different tasks and models. The overall results are extremely helpful, but for our work, having access to the detailed results for each model (i.e., Llama-2-7B-chat, Llama-2-13B-chat, Llama-2-70B-chat†, Mistral-7B-Instruct) would significantly enhance our analysis and help us avoid unnecessary duplication of efforts.
Request:
Could you kindly provide the detailed experimental results for each model included in the RAGTruth dataset? Specifically, we are looking for the hallucination detection performance metrics (precision, recall, F1 score) broken down by each model used in your experiments:
Llama-2-7B-chat
Llama-2-13B-chat
Llama-2-70B-chat†
Mistral-7B-Instruct
Having this detailed information will greatly aid in advancing our research and help us build upon your findings more effectively. We understand the effort that goes into compiling and sharing such data, and we are immensely grateful for any assistance you can provide.