Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?
Official Review of Paper316 by Reviewer #1
The paper investigates how Large Language Models (LLMs) integrate generated and retrieved contexts in open-domain question answering (QA) tasks. Despite the increasing reliance on auxiliary information to enhance LLMs, there's limited understanding of how LLMs merge these differing contexts, particularly when they conflict. Through a systematic framework, the authors explore whether LLMs' responses derive more from generated or retrieved contexts, constructing datasets with intentionally conflicting information to trace the origin of responses. The experiments reveal a significant bias in several LLMs, including GPT-4 and Llama2, favoring generated contexts even when they convey incorrect information. This preference is attributed to the greater text similarity of generated contexts to the questions and the disruption of context completeness in retrieved contexts. The study highlights the challenges in effectively merging internal and external knowledge sources within LLMs, offering insights for improving current LLM augmentation methods.
The paper introduces a systematic framework to dissect how Large Language Models (LLMs) integrate generated and retrieved contexts.
The research uncovers a significant bias in several state-of-the-art LLMs (GPT-4/3.5 and Llama2) toward favoring generated contexts over retrieved ones, even when the generated context contains incorrect information. This finding is crucial for understanding the behavior of LLMs in real-world applications. It also identifies key factors influencing LLMs' preference for generated contexts, such as text similarity and the completeness of context.
The study uses a meticulously designed dataset where each question is paired with both a generated and a retrieved context, but only one contains the correct answer. This setup allows for precise tracing of the origin of LLMs' responses, enhancing the study's methodological robustness.
The dataset includes quintets where only one context (generated or retrieved) contains the correct answer. While this design is beneficial for testing the LLM's ability to discern and choose the correct context, it might not fully represent real-world scenarios where both contexts could provide valuable or complementary information. The strict exclusivity requirement might oversimplify the complexity of real-world information retrieval, where multiple sources can offer partially correct or complementary information, which is a common scenario that LLMs need to navigate.
In Section 5.1, the study attempts to disrupt the alignment between generated contexts and the LLM's parametric knowledge to test for confirmation bias. However, the methodology to disrupt this alignment is not fully detailed here. The effectiveness of creating a "counter-memory" context to challenge the LLM's biases needs a clear explanation of how it differs significantly from the model's internal knowledge to ensure the validity of the experiment.
Please see the Weaknesses.
N/A
Dear Reviewer #1
First of all, thank you very much for recognizing the significance and value of our findings.
Q1: While this design is beneficial for testing the LLM's ability to discern and choose the correct context, it might not fully represent real-world scenarios where both contexts could provide valuable or complementary information. The strict exclusivity requirement might oversimplify the complexity of real-world information retrieval, where multiple sources can offer partially correct or complementary information, which is a common scenario that LLMs need to navigate.
A1: How LLMs integrate and utilize the partially complementary information from various sources is an interesting question. This work primarily focuses on how LLMs handle conflicts among different input information. Indeed, we simplified the settings, which partly deviated from the real-world scenarios. However, even in the simple setting, LLM still exhibited bias, highlighting the severity of the problem. We acknowledge that our work is merely a preliminary exploration, and we hope it will inspire community to pay attention to how LLMs process information from multiple sources.
Your suggestion is very meaningful, and the scenario you mentioned is what we aim to explore next. Despite it may surpass the scope of our current research, in future work, we would be very happy to investigate how the bias discovered in this study would manifest in more realistic scenarios.
Q2: In Section 5.1, the study attempts to disrupt the alignment between generated contexts and the LLM's parametric knowledge to test for confirmation bias. However, the methodology to disrupt this alignment is not fully detailed here. The effectiveness of creating a "counter-memory" context to challenge the LLM's biases needs a clear explanation of how it differs significantly from the model's internal knowledge to ensure the validity of the experiment.
A2: Due to space constraints, we have placed the detailed process of constructing the counter-memory context in Appendix A.6.1. We apologize for any confusion this has caused. We will streamline this content and incorporate it into the main text later. Below is a brief summary of our construction method:
● 1. Select “counter-memory”answer: We first convert the memory answer (e.g., “Canada”) to a same-type yet distinct entity (e.g., “United States”), which then works as the“counter-memory” answer. We conducted several checks to ensure that the counter-memory answers do not match any of the following: “answers in the original generated contexts, answers generated by LLMs without any given context, and the golden answer”.
● 2. Generate “counter-memory” context: We employ LLMs to make up a context to support the counter-memory answer with the following prompt. This prompt has been proven to work well in previous works about misinformation[1].The length constraint is also included to mitigate the effect of length.
Prompt: “Generate a background document in support of the given opinion to the question. Keep the length of the document around n words. Question: {question} Opinion: {counter memory answer} Document:”
● 3. Answer Consistency Checking: To ensure the effectiveness of the counter-memory context, we retain only those instances where the predicted answer, derived exclusively from the counter-memory context, exactly matches the counter-memory answer.
[1] Pan Y, Pan L, Chen W, et al. On the Risk of Misinformation Pollution with Large Language Models[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. 2023: 1389-1403.
Official Review of Paper316 by Reviewer #2
The paper presents an empirical study to determine the extent to which a large language model (LLM) rely on generated or retrieved contexts when answering a question. The LLM is provided with a question and a corresponding context. The context could be self-generated, generated by another LLM or retrieved by a retriever model. The authors also consider a hybrid approach where both generated and retrieved contexts are provided simultaneously. The key observation is that the LLMs tend to rely on generated contexts over retrieved contexts when answering questions, even though the answer present in the context is wrong. The authors demonstrate that the presence of such biases could be attributed to confirmation bias.
The paper is mostly well-written and easy to follow.
It contributes a new framework for constructing datasets for measuring biases.
The observation that the LLMs prefer generated contexts over retrieved context is interesting.
While the paper is well-written and easy to follow, a significant part of the text has been moved to appendix, which often breaks the flow. I understand this to be because of the page limit but putting related work section in the appendix is a bit unnecessary.
One important thing which could impact the results is contamination. TriviaQA and NaturalQA datasets are publicly available and it is highly likely that the LLM has already seen these datasets during pre-training which might influence the answers that it generates. So it is a bit unclear what the LLM is referring to when answering a question. The authors do allude to it in section 5.1 but still it does not sufficiently address the issue. What if the retrieved context matches the parametric knowledge?
I also don't completely understand the utility of the results. Numerous studies have demonstrated the effectiveness of retrieval augmented generation in obtaining the correct answer. I fail to see an application where both generated and retrieval context are provided simultaneously to LLMs. I mean, if the retrieval augmented systems are accurate, why would one need generated contexts.
The generated content is also often sensitive to the prompt. It is not mentioned in the paper what prompt was used to generate the answers. What happens when one explicitly prompts the model to only use the retrieved context to answer the question. Also, how did you ensure that the generated answer exactly matches the ground-truth answer. I mean, the model could have answered the question in a phrase which might contain the right answer. However, the exact match metric might not reflect that.
See weaknesses.
N/A
N/A
Part (1/2)
We sincerely appreciate your valuable comments.
Q1: Putting related work section in the appendix is a bit unnecessary.
A1: Regarding the related work, currently we have only introduced several most relevant works in section 2.1. We apologize for any inconvenience this may have caused. We will streamline the related work section and incorporate it into the main text.
Q2: One important thing which could impact the results is contamination. TriviaQA and NaturalQA datasets are publicly available and it is highly likely that the LLM has already seen these datasets during pre-training which might influence the answers that it generates. So it is a bit unclear what the LLM is referring to when answering a question. The authors do allude to it in section 5.1 but still it does not sufficiently address the issue. What if the retrieved context matches the parametric knowledge?
A2: Thanks for your valuable suggestions.
(1) In our work, we examining which answers (answers provided by generated or retrieved contexts) LLMs tend to select, to reveal how LLMs use both types of contexts. The fact that LLMs has seen these questions does not affect the conclusions of the paper. This is because we have observed that LLMs tend to trust generated contexts, even when generated contexts provide incorrect information. Assuming LLMs have seen these questions and knows the correct answers, the results that LLMs still select the incorrect answer from generated contexts on AIR, only indicates that the revealed bias is more serious.
(2) About the assumption “In section 5.1, what if the retrieved context matches the parametric knowledge?”, there is a small number of samples present this scenario. Concretely, we select samples based on whether they satisfy “Ret-Ans
- LLM-Ans refers to the answer produced by the LLM when only the question is input without any context, whcih could reflect the LLMs' own knowledge.
- Gen-Ans and Ret-Ans are the answers provided by the generated context and retrieved contexts, respectively.
The table below shows the LLMs'
- The proportion of samples where the retrieved context matches the parametric knowledge is relatively small, with 131 out of a total of 883 instances. Even when the retrieved context matches the parametric knowledge (Ret-Ans
LLM-Ans), LLMs still show significant bias towards generated contexts. This indicates that the bias towards generated contexts is very serious. - When excluding the samples where the retrieved context matches the parametric knowledge (Ret-Ans
LLM-Ans), LLMs' bias also changes very little.
Number of samples | ||
---|---|---|
Original result in table 3 | 0.6468 | 883 |
Ret-Ans |
0.6248 | 752 |
Ret-Ans |
0.7460 | 131 |
(3) To quantify "the effect of internal parametric knowledge" more clearly, we add further analysis. We select the cases where the answers from "retrieved context", "generated context", and "llm internal knowledge" were all different from one another (Gen-Ans
Result: The below table shows the proportion of the LLM's output that exactly matches Gen-Ans, or Ret-Ans, or LLM-Ans. We can see that
- ( i ) The proportion of choosing LLMs' internal knowledge (LLM-Ans) is very small (~1%) . This result indicates that, given external context, LLMs do indeed rely heavily on external context.
- ( ii ) LLMs still show significant preference for generated contexts, with the ratio of being selected Gen-Ans
Ret-Ans LLM-Ans. The means even excluding the influence of internal knowledge, the bias discovered in our paper remains very significant.
Conclusion: The conclusion of our paper does not change after excluding the influence of internal knowledge
Genrator | Reader | Gen-Ans | Ret-Ans | LLM-Ans | Number of samples |
---|---|---|---|---|---|
GPT-3.5 | Llama2-13b | 62.39% | 26.76% | 1.10% | 553 |
Llama2-7b | Llama2-13b | 69.16% | 18.96% | 1.86% | 1018 |
Llama2-13b | GPT-3.5 | 67.22 | 15.34% | 1.20% | 665 |
Llama2-7b | GPT-3.5 | 65.55% | 16.24% | 1.13% | 708 |
Part (2/2)
Q3: I also don't completely understand the utility of the results. Numerous studies have demonstrated the effectiveness of retrieval augmented generation in obtaining the correct answer. I fail to see an application where both generated and retrieval context are provided simultaneously to LLMs. I mean, if the retrieval augmented systems are accurate, why would one need generated contexts.
A3: Thanks for this question. There are already several works [1][2][3] that input both types of contexts simultaneously and have achieved better results than retrieval alone. Furthermore, retrieval is not perfect, and a considerable number of questions (average 10.8% in table 1 in our paper) cannot be addressed with retrieved contexts, but can be addressed with the generated context. The combined use of both has the potential to achieve significant performance improvement.
Moreover, as the amount of LLM-generated content on the internet increases, even when using RAG alone, the information retrieved may include generated content[4][5]. At this point, how LLM treats retrieved (human-written) and generated content becomes a critical issue. This work has found that current LLMs exhibit bias when processing these two types of information.
[1] Yu W, Iter D, Wang S, et al. Generate rather than Retrieve: Large Language Models are Strong Context Generators[C]//The Eleventh International Conference on Learning Representations. 2022.
[2] Zhang Y, Khalifa M, Logeswaran L, et al. Merging Generated and Retrieved Knowledge for Open-Domain QA[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 4710-4728.
[3] Abdallah A, Jatowt A. Generator-retriever-generator: A novel approach to open-domain question answering[J]. arXiv preprint arXiv:2307.11278, 2023.
[4] Pan Y, Pan L, Chen W, et al. On the Risk of Misinformation Pollution with Large Language Models[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. 2023: 1389-1403.
Q4: (1) It is not mentioned in the paper what prompt was used to generate the answers. What happens when one explicitly prompts the model to only use the retrieved context to answer the question.
A4: (1) We present the prompt used for generating answers in Figure 4. We apologize for the prompt not being sufficiently prominent; we will modify it to be displayed more conspicuously within the main text.
Specifically, when answering, we restrict the generated answers to be a single entity: "Refer to the context below and answer the following question with just one entity. context: {#contexts} question: {#question} The answer is".
(2) Specifying in the prompt to only use retrieved context may indeed affect the model's behavior. However, this may cause the model to overlook correct information when only the generated context is correct (AIG subsets in the paper). We think the ideal LLMs should be capable of utilizing correct information from both sources. We are next preparing to use different prompts to attempt to mitigate the impact of this bias, and if space permits, we will include it in subsequent revisions.
Q5: Also, how did you ensure that the generated answer exactly matches the ground-truth answer. I mean, the model could have answered the question in a phrase which might contain the right answer. However, the exact match metric might not reflect that.
A5: We use exactly matching because it is a commonly used QA evaluation metric. We also acknowledge that the exactly matching metric has some drawbacks, such as the response may contain the right answer as you mentioned. However, since we specified to answer with very short words when generating answers, this situation is relatively rare.
Furthermore, We try to employ ChatGPT to determine whether the generated answer matches the correct answer, as shown below:
Determine if the meaning of 'Answer' is exactly the same as any of the 'Golden Answers'. Answer: yes or no. Question: {question} Answer: {response} Golden Answers: {answer}
Then, we compare the judgment results of ChatGPT with those of exactly matching. We found that in a sample of 500 random samples, 94% of the results were consistent. This indicates that exactly matching is acceptable in our scenario.
By the way, we only use exactly matching ground-truth to filter the dataset. A very small amount that cannot match the ground-truth will only affect the number of samples in the dataset and will not affect the findings of this paper. Our metric
Official Review of Paper316 by Reviewer #3
This paper aims to answer which context do LLM rely more on when doing open-domain QA, generated or retrieved, when they are both present but contain contradictory answer. This is tested by creating a new dataset CC through filtering, that consists of two parts AIG (generated context is correct, retrieved not) and AIR (the opposite). An interesting observation is a strong tendency for LLMs to provide answers similar to what’s provided in the generated context. The authors further examined possible causes: confirmation bias, text similarity, and context completeness.
- This paper reiterates LLMs’ preference on relying on parametric knowledge / generated context and ignoring retrieved context, highlighting that RAG alone may be insufficient for improving factuality.
- Furthermore, findings of the paper (the correlation between completeness & preference and similarity & preference) provide guidance on improving RAG systems.
- It is unclear, and contradictory how the authors see the relation between internal parametric knowledge and generated context. Line 097 states these two are equivalent, yet the discussion of traceability (section 3.1) seems to be talking about finding instances where LLM relies solely on generated context rather than internal parametric knowledge.
- If treating the two as equivalent, the problem is reduced to which source, external context or intrinsic knowledge, matters - a question already answered by Xie et al. (2023) that was cited and other prior work.
- If treating the two as different - the effect of internal parametric knowledge needs to be more carefully controlled - while in the paper it’s either ignored or inadequately controlled.
- “Which source do LLMs rely on to answer questions “can not be answered without studying three factors: whether they rely on internal parametric knowledge, generated context, and/or retrieved context. The paper downplays / ignores the effect of internal parametric knowledge:
- lines 284-292 claim that previous works conclude that internal parametric knowledge does not matter with the presence of contexts, which is not Xie et al. (2023) - the work cited - says in the abstract.
- The conclusion “LLMs prefer self-generated context” (section 4.1) can not be reached without ruling out the case that LLMs rely solely on parametric knowledge.
- Section 5.1: what about the case where the parametric knowledge has the correct answer, but generated context is hallucinated?
- The effect of context completeness (section 5.3) is not decoupled from the effect of text similarity (section 5.2). These two are likely correlated and the readers are left unsure which of them actually caused LLMs to prefer generated contexts.
- Are the two causes (context completeness and test similarity) not covered in Xie et al. (2023) - their first point in the abstract?
On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. - Abstract of Xie et al. (2023)
The author response mostly settled the concerns here. I still believe the responses and discussions need to be incorporated in a revision to make the paper sound.
- Font sizes in figures are really small - it’s hard to read at 100% scale.
- Line 260: besides temperature, seed also needs to be fixed
- Line 289, 480, multiple instances in the appendix : According to https://acl-org.github.io/ACLPUB/formatting.html#citations
Refrain from using full citations as sentence constituents. Instead of
(Gusfield, 1997) showed that …In (Gusfield, 1997), …
write
Gusfield (1997) showed that …In Gusfield (1997), …
None
Part (1/4)
Thank you for your patient comments. We deeply value your valuable feedback.
Q1: It is unclear, and contradictory how the authors see the relation between internal parametric knowledge and generated context. Line 097 states these two are equivalent, yet the discussion of traceability (section 3.1) seems to be talking about finding instances where LLM relies solely on generated context rather than internal parametric knowledge.
- If treating the two as equivalent, the problem is reduced to which source, external context or intrinsic knowledge, matters - a question already answered by Xie et al. (2023) that was cited and other prior work.
- If treating the two as different - the effect of internal parametric knowledge needs to be more carefully controlled - while in the paper it’s either ignored or inadequately controlled.
A1: (1) We appreciate the reviewer's insightful feedback and the critical issue you've highlighted regarding the relation between internal parametric knowledge and generated context. We apologize for any confusion caused by our current writing.
To clarify, our work does not equate parametric knowledge with generated context. As you rightly noted, ensuring consistency between the generated context and the LLM's internal parametric knowledge is challenging.
Our research aims to explore if LLMs exhibit bias when integrating two types of external input contexts (generated and retrieved). The generated context is also a special type of input context. This question is becoming increasingly critical as LLMs generate more and more content accessible on the World Wide Web. In this context, our findings indicate a tendency for LLMs to favor generated content, regardless of its correctness or origin (generated by themselves or other llms).
(2) To quantify "the effect of internal parametric knowledge" more clearly, we add further analysis. We select the cases where the answers from "retrieved context", "generated context", and "llm internal knowledge" were all different from one another (Gen-Ans
- LLM-Ans refers to the answer produced by the LLM when only the question is input without any context, whcih could reflect the LLMs' own knowledge.
- Gen-Ans and Ret-Ans are the answers provided by the generated context and retrieved contexts, respectively.
Result: The below table shows the proportion of the LLM's output that exactly matches Gen-Ans, or Ret-Ans, or LLM-Ans. We can see that
- ( i ) The proportion of choosing LLMs' internal knowledge (LLM-Ans) is very small (~1%) . This result indicates that, given external context, LLMs do indeed rely heavily on external context.
- ( ii ) LLMs still show significant preference for generated contexts, with the ratio of being selected Gen-Ans
Ret-Ans LLM-Ans. The means even exclude these samples potentially influented by internal knowledge, the bias discovered in our paper remains very significant.
Conclusion: The conclusion of our paper does not change after excluding the influence of internal knowledge, i.e., LLMs still blindly trust the generated contexts.
Genrator | Reader | Gen-Ans | Ret-Ans | LLM-Ans | Number of samples |
---|---|---|---|---|---|
GPT-3.5 | Llama2-13b | 62.39% | 26.76% | 1.10% | 553 |
Llama2-7b | Llama2-13b | 69.16% | 18.96% | 1.86% | 1018 |
Llama2-13b | GPT-3.5 | 67.22 | 15.34% | 1.20% | 665 |
Llama2-7b | GPT-3.5 | 65.55% | 16.24% | 1.13% | 708 |
We acknowledge our previous results have some interference from the LLMs' internal knowledge. We apologize for any confusion caused by this. However, after excluding these interference, our findings does not change (LLMs still prefer generated contexts). Later, we will update the paper with results that have excluded the influence of internal parameters. Also, we will add this discussion about the effect of internal knowledge into our paper. Regarding this issue, if you have any other concerns, we would be happy to address them for you.
Part (2/4)
Q2: “Which source do LLMs rely on to answer questions “can not be answered without studying three factors: whether they rely on internal parametric knowledge, generated context, and/or retrieved context. The paper downplays / ignores the effect of internal parametric knowledge
A2: As shown in the reply to Q1(2), we discuss the impact of the three factors more clearly. The results indicate that the proportion of LLMs relying on internal parametric knowledge is very small (~1%) in our scenario. Our conclusion, "LLMs show significant preference for generated contexts," still holds, even when the influence of internal parametric knowledge is excluded.
Q3: lines 284-292 claim that previous works conclude that internal parametric knowledge does not matter with the presence of contexts, which is not Xie et al. (2023) - the work cited - says in the abstract.
A3: Thank you very much for your detailed feedback. The following is the original statement from Xie et al. (2023):
Experiment conclusion in Xie et al. (2023) (page 6, line 5 from the bottom): “LLMs are actually highly receptive to external evidence if it is presented in a coherent way, even though it conflicts with their parametric memory.”
Abstract in Xie et al. (2023): “On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing.”
Our understanding of Xie et al. (2023) is that when external evidence is introduced, LLMs tend to rely more on this external evidence, even when it is inconsistent with their internal knowledge. To clarify, we do not claim that internal parametric knowledge is unimportant; rather, its impact on our conclusion ( LLMs prefer generated contexts ) is relatively minor.
We quantify "the effect of internal parametric knowledge" more clearly in our reply to Q1(2). The results show that the vast majority of LLMs' answers are derived from the context, with only about 1% of the answers originating from the LLMs' internal knowledge. Even excluding its impact does not affect our findings and conclusions (LLMs still prefer generated contexts).
Q4: The conclusion “LLMs prefer self-generated context” (section 4.1) can not be reached without ruling out the case that LLMs rely solely on parametric knowledge.
A4: I appreciate this important question. We specifically discuss this question in Section 5.1, by disrupting the consistency between generated contexts and LLMs’ parametric knowledge. Concretely, we enforce LLMs to make up a special generated context which supports a same-type yet different answer compared to the LLMs parametric knowledge (with several checks to ensure this). Our results show that LLMs still have a significant bias towards generated contexts, even when generated contexts contradicts with parametric knowledge. This suggests that LLMs' preference for generated contexts is not because LLMs rely solely on parametric knowledge.
In fact, the generated context is not limited to self-generated context. LLMs also prefer contexts generated by other LLMs, which are not directly relative to this LLM's parametric knowledge. The reply to Q1(2) further supports this opinion by selecting the cases where "generated context, LLM parametric knowledge, retrieved cantext" support there different answers. Even in this situation, LLMs still mostly tend to rely on generated contexts, with only about 1% of the answers stemming from the LLMs' parametric knowledge.
Part (3/4)
Q5: Section 5.1: what about the case where the parametric knowledge has the correct answer, but generated context is hallucinated?
A5:
We select the data that conforms to this hypothesis from based on Section 5.1. Specifically, we select the questions that LLMs can correctly answer without any contexts. These questions could reflect the situation that "the parametric knowledge has the correct answer" as you mentioned. Since in this section, the counter-memory contexts provide an incorrect answer (Ctr-Ans) , they could partly resemble "generated context is hallucinated" as you mentioned. Finally, we only get 136 such cases (original dataset size is 883). In such cases, LLMs still tend to select incorrect answers from the counter-memory contexts (Ctr-Ans) while disregarding the correct answers provided in both the retrieved contexts and internal knowledge. This suggests that our findings still hold true in this situation. This phenomenon also illustrates that LLMs' preference for generated contexts is a serious issue -- even if the model is capable of providing the correct answer, it can still be misled by incorrect information in generated contexts.
Incorrect and Exactly match Ctr-Ans | Correct | |
---|---|---|
Ratio | 83.09% | 13.24% |
Q6: The effect of context completeness (section 5.3) is not decoupled from the effect of text similarity (section 5.2). These two are likely correlated and the readers are left unsure which of them actually caused LLMs to prefer generated contexts.
A6: When studying context completeness in section 5.3, similarity was kept constant, with only completeness being varied. As shown in Table 4 (also shown below), three types of generated contexts ( "Nature", "Trunc.", and "S-Trunc.") have almost equivalent similarity and lengths, with the principal disparities in semantic and sentence completeness. We compare the changes in the degree of bias of LLMs towards these three types of generated contexts to assess the impact of completeness. When comparing the bias with "Nature vs. Ret" and "S-Trunc. vs. Ret" input, the only difference lies in the semantic completeness between two types of generated contexts ("Nature" and "S-Trunc"). Similarly, when comparing the bias with "Trunc. vs. Ret" and "S-Trunc. vs. Ret" inputs, the only change is in sentence completeness. (The below shows that three types of generated contexts have similar average similarity, measured by both Jaccard and BERTScore.)
Context | Length | Semantic Completeness | Sentence Completeness | Similarity (Jaccard) | Similarity (BERTScore) |
---|---|---|---|---|---|
Retrieved | 107.4 | No | No | 0.114 | 0.801 |
Nature | 109.7 | Yes | Yes | 0.202 | 0.879 |
Trunc. | 107.4 | No | No | 0.196 | 0.877 |
S-Trunc. | 105.9 | No | Yes | 0.193 | 0.876 |
Part (4/4)
Q7:Are the two causes (context completeness and test similarity) not covered in Xie et al. (2023) - their first point in the abstract?
A7: After carefully checking Xie et al. (2023) again, we believe that the two causes (context completeness and text similarity) are not covered in Xie et al. (2023).
In our work, text similarity refers to the similarity of the context to the question. Completeness investigates the effect of incomplete sentence and semantic caused by fix-length truncation in retrieval systems.
Xie et al. (2023) first consider the effect of coherent. Coherence refers to the internal consistency within a context, focusing on inconsistencies caused by entity substitution and negation injection. In our work, both retrieved and generated contexts are coherent, but they may provide incompleteness information, or varying degrees of similarity to the question. Their experiments cannot cover our findings. Xie et al. (2023) also disscuss the effect of context length, number of support evidence, noisy passages, and input order. None of these factors can cover the similarity and completeness.
- Font sizes in figures are really small - it’s hard to read at 100% scale.
- Line 289, 480, multiple instances in the appendix : According to https://acl-org.github.io/ACLPUB/formatting.html#citations
We are very grateful for your detailed suggestions and apologize for the oversight. We will greatly value your time and make more careful revisions and corrections.
Firstly, we will increase the font size to make the images clearer.
Then, we will meticulously check the citation format and revise all references, including the appendix.
Line 260: besides temperature, seed also needs to be fixed
In our code, we have fixed the seed for all experiments. We will add an explanation about this in the main text in our subsequent revisions. Thank you very much for your suggestion.
Should our responses satisfactorily address your concerns, we would greatly appreciate it if you consider increasing our score. Alternatively, if there are any other issues, we would be happy to answer them.
Meta Review of Paper316 by Area Chair
This paper compares whether LLMs prefer LLM-generated or retrieved contexts for open-domain question answering, especially when the contexts are contradictory. The experiments show that models have a strong tendency to prefer generated contexts to human-written ones, even when the generated context contains information that goes against the model's parametric knowledge. The preference for generated contexts becomes weaker when they are less similar to the question or are semantically or syntactically incomplete (as the retrieved contexts often are), but still persists.
Reviewers have pointed out that:
Most of the reviewers' concerns (breakdown provided below) have been addressed in the rebuttal, either with additional results/discussion or with pointers to results/discussion in the Appendix. Based on this, my recommendations are minor revisions primarily related to editing:
Concerns addressed by the author response (in most cases explicitly acknowledged by the reviewers; I omit the issues that were resolved by pointing to information that was already in the paper):
Suggestions for the authors: