Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?

Anonymous

Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?

Anonymous

17 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: While auxiliary information has become a key to enhancing Large Language Models (LLMs), relatively little is known about how LLMs merge these contexts, specifically contexts generated by LLMs and those retrieved from external sources. To investigate this, we formulate a systematic framework to identify whether LLMs' responses, derived from the integration of generated and retrieved contexts, are attributed to either generated or retrieved contexts. To easily trace the origin of the response, we construct datasets with conflicting contexts, i.e., each question is paired with both generated and retrieved contexts, yet only one of them contains the correct answer. Our experiments reveal a significant bias in several LLMs (GPT-4/3.5 and Llama2) to favor generated contexts, even when they provide incorrect information. We further identify two key factors contributing to this bias: i) contexts generated by LLMs typically show greater similarity to the questions, increasing their likelihood of being selected; ii) the segmentation process used in retrieved contexts disrupts their completeness, thereby hindering their full utilization in LLMs. Our analysis enhances the understanding of how LLMs merge diverse contexts, offering valuable insights for advancing current augmentation methods for LLMs.

Paper Type: long

Research Area: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

[–][+]

Meta Review of Paper316 by Area Chair

ACL ARR 2024 February Paper316 Area Chair

08 Apr 2024, 05:46ACL ARR 2024 February Paper316 Meta ReviewReaders: Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Authors, Paper316 Reviewers Submitted, Program Chairs

Paper Summary:

This paper compares whether LLMs prefer LLM-generated or retrieved contexts for open-domain question answering, especially when the contexts are contradictory. The experiments show that models have a strong tendency to prefer generated contexts to human-written ones, even when the generated context contains information that goes against the model's parametric knowledge. The preference for generated contexts becomes weaker when they are less similar to the question or are semantically or syntactically incomplete (as the retrieved contexts often are), but still persists.

Summary Of Strengths:

Reviewers have pointed out that:

The paper presents a novel approach to measuring context biases in LLMs (#2).
The experimental design is well thought out (#1).
The findings of the paper are interesting (#2) and highlight a deficiency in current RAG methods (#3).
The additional analyses focused on the factors that make LLMs prefer generated contexts can be turned into actionable steps for developing better RAG systems (#3).
The paper is well-written and easy to follow (#2).

Summary Of Weaknesses:

Most of the reviewers' concerns (breakdown provided below) have been addressed in the rebuttal, either with additional results/discussion or with pointers to results/discussion in the Appendix. Based on this, my recommendations are minor revisions primarily related to editing:

Some content should be moved from the Appendix to the main paper, especially related work (see comment by Reviewer #2). Specifically, I would explicitly rename the "Background" section to "Background and related work" and try to merge these sections further. If the information in the Appendix is important for understanding a section in the main paper (e.g. Section 5.1 -- see comment by Reviewer #1), it might be helpful to have the pointer to the Appendix at the top of that section.
Both the new results (e.g. the Gen-Ans ≠ Ret-Ans ≠ LLM-Ans evaluation) and the discussion (e.g. why identifying overreliance on generated context is important for real-world applications) from the rebuttal should be included in the paper. I agree with Reviewer #3 that it will increase the paper's soundness, and some arguments from the discussion will help strengthen the motivation too.

Concerns addressed by the author response (in most cases explicitly acknowledged by the reviewers; I omit the issues that were resolved by pointing to information that was already in the paper):

Data contamination might pose issues (Reviewer #2) -- the discussion points out that even if there is data contamination, it does not invalidate the results.
There should be a more controlled setup for disentangling the model's parametric knowledge and the information contained in the contexts, both generated and retrieved (Reviewers #2, #3) -- addressed with additional evaluations on specific subsets of the data.
The novelty of the findings highlighting the two factors (Reviewer #3) -- addressed in the discussion.
The impact of the findings given the artificial nature of the two-context scenario (Reviewer #2) -- addressed in the discussion, I recommend making this more prominent in the paper.
The potential faults of the exact match metric (Reviewer #2) -- addressed in the discussion. I would also recommend doing a manual evaluation of a sample of predicted answers, so it is corroborated by humans and not just ChatGPT.
The controlled setting where only one context is correct is overly simplistic (Reviewer #1) -- the response acknowledges this, but I personally think this is not really a weakness but rather a design choice that allows for a controlled analysis.

Suggestions for the authors:

It is interesting that the Ret-Ans = LLM-Ans condition (response to Reviewer #2, A2.2), where the retrieved context aligns with the model's internal knowledge, produces an even bigger bias towards generated contexts. I would be curious to see the next revision put forward some hypotheses for why this might be the case.
Figure 8 was hard for me to read -- maybe it would be easier to follow if instead of splitting each dataset into n equal slices, you split it into (potentially unequal) buckets by $Δ sim$ , so the two curves are aligned on the x-axis.
In the Gen-Ans ≠ Ret-Ans ≠ LLM-Ans setting, where the LLM does not know which answer is correct, what would the ideal behavior of the LLM look like? Would it be choosing between Gen-Ans and Ret-Ans with equal probability? I would have liked to see some more discussion on this.

Overall Assessment: 4 = There are minor points that may be revised

Best Paper Ae: No

Needs Ethics Review: No

[–][+]

Official Review of Paper316 by Reviewer #1

ACL ARR 2024 February Paper316 Reviewer #1

26 Mar 2024, 10:19ACL ARR 2024 February Paper316 Official ReviewReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

The paper investigates how Large Language Models (LLMs) integrate generated and retrieved contexts in open-domain question answering (QA) tasks. Despite the increasing reliance on auxiliary information to enhance LLMs, there's limited understanding of how LLMs merge these differing contexts, particularly when they conflict. Through a systematic framework, the authors explore whether LLMs' responses derive more from generated or retrieved contexts, constructing datasets with intentionally conflicting information to trace the origin of responses. The experiments reveal a significant bias in several LLMs, including GPT-4 and Llama2, favoring generated contexts even when they convey incorrect information. This preference is attributed to the greater text similarity of generated contexts to the questions and the disruption of context completeness in retrieved contexts. The study highlights the challenges in effectively merging internal and external knowledge sources within LLMs, offering insights for improving current LLM augmentation methods.

Summary Of Strengths:

The paper introduces a systematic framework to dissect how Large Language Models (LLMs) integrate generated and retrieved contexts.
The research uncovers a significant bias in several state-of-the-art LLMs (GPT-4/3.5 and Llama2) toward favoring generated contexts over retrieved ones, even when the generated context contains incorrect information. This finding is crucial for understanding the behavior of LLMs in real-world applications. It also identifies key factors influencing LLMs' preference for generated contexts, such as text similarity and the completeness of context.
The study uses a meticulously designed dataset where each question is paired with both a generated and a retrieved context, but only one contains the correct answer. This setup allows for precise tracing of the origin of LLMs' responses, enhancing the study's methodological robustness.

Summary Of Weaknesses:

The dataset includes quintets where only one context (generated or retrieved) contains the correct answer. While this design is beneficial for testing the LLM's ability to discern and choose the correct context, it might not fully represent real-world scenarios where both contexts could provide valuable or complementary information. The strict exclusivity requirement might oversimplify the complexity of real-world information retrieval, where multiple sources can offer partially correct or complementary information, which is a common scenario that LLMs need to navigate.
In Section 5.1, the study attempts to disrupt the alignment between generated contexts and the LLM's parametric knowledge to test for confirmation bias. However, the methodology to disrupt this alignment is not fully detailed here. The effectiveness of creating a "counter-memory" context to challenge the LLM's biases needs a clear explanation of how it differs significantly from the model's internal knowledge to ensure the validity of the experiment.

Comments, Suggestions And Typos:

Please see the Weaknesses.

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions.

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Best Paper: No

Ethical Concerns:

N/A

Needs Ethics Review: No

Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 4 = Useful: I would recommend the new datasets to other researchers or developers for their ongoing work.

Software: 3 = Potentially useful: Someone might find the new software useful for their work.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: #1

[–][+]

Dear Reviewer #1

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 20:02 (modified: 30 Mar 2024, 12:28)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

First of all, thank you very much for recognizing the significance and value of our findings.

Q1: While this design is beneficial for testing the LLM's ability to discern and choose the correct context, it might not fully represent real-world scenarios where both contexts could provide valuable or complementary information. The strict exclusivity requirement might oversimplify the complexity of real-world information retrieval, where multiple sources can offer partially correct or complementary information, which is a common scenario that LLMs need to navigate.

A1: How LLMs integrate and utilize the partially complementary information from various sources is an interesting question. This work primarily focuses on how LLMs handle conflicts among different input information. Indeed, we simplified the settings, which partly deviated from the real-world scenarios. However, even in the simple setting, LLM still exhibited bias, highlighting the severity of the problem. We acknowledge that our work is merely a preliminary exploration, and we hope it will inspire community to pay attention to how LLMs process information from multiple sources.

Your suggestion is very meaningful, and the scenario you mentioned is what we aim to explore next. Despite it may surpass the scope of our current research, in future work, we would be very happy to investigate how the bias discovered in this study would manifest in more realistic scenarios.

Q2: In Section 5.1, the study attempts to disrupt the alignment between generated contexts and the LLM's parametric knowledge to test for confirmation bias. However, the methodology to disrupt this alignment is not fully detailed here. The effectiveness of creating a "counter-memory" context to challenge the LLM's biases needs a clear explanation of how it differs significantly from the model's internal knowledge to ensure the validity of the experiment.

A2: Due to space constraints, we have placed the detailed process of constructing the counter-memory context in Appendix A.6.1. We apologize for any confusion this has caused. We will streamline this content and incorporate it into the main text later. Below is a brief summary of our construction method:

● 1. Select “counter-memory”answer: We first convert the memory answer (e.g., “Canada”) to a same-type yet distinct entity (e.g., “United States”), which then works as the“counter-memory” answer. We conducted several checks to ensure that the counter-memory answers do not match any of the following: “answers in the original generated contexts, answers generated by LLMs without any given context, and the golden answer”.

● 2. Generate “counter-memory” context: We employ LLMs to make up a context to support the counter-memory answer with the following prompt. This prompt has been proven to work well in previous works about misinformation[1].The length constraint is also included to mitigate the effect of length.

Prompt: “Generate a background document in support of the given opinion to the question. Keep the length of the document around n words. Question: {question} Opinion: {counter memory answer} Document:”

● 3. Answer Consistency Checking: To ensure the effectiveness of the counter-memory context, we retain only those instances where the predicted answer, derived exclusively from the counter-memory context, exactly matches the counter-memory answer.

[1] Pan Y, Pan L, Chen W, et al. On the Risk of Misinformation Pollution with Large Language Models[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. 2023: 1389-1403.

[–][+]

Official Review of Paper316 by Reviewer #2

ACL ARR 2024 February Paper316 Reviewer #2

19 Mar 2024, 19:07 (modified: 31 Mar 2024, 14:45)ACL ARR 2024 February Paper316 Official ReviewReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

The paper presents an empirical study to determine the extent to which a large language model (LLM) rely on generated or retrieved contexts when answering a question. The LLM is provided with a question and a corresponding context. The context could be self-generated, generated by another LLM or retrieved by a retriever model. The authors also consider a hybrid approach where both generated and retrieved contexts are provided simultaneously. The key observation is that the LLMs tend to rely on generated contexts over retrieved contexts when answering questions, even though the answer present in the context is wrong. The authors demonstrate that the presence of such biases could be attributed to confirmation bias.

Summary Of Strengths:

The paper is mostly well-written and easy to follow.

It contributes a new framework for constructing datasets for measuring biases.

The observation that the LLMs prefer generated contexts over retrieved context is interesting.

Summary Of Weaknesses:

While the paper is well-written and easy to follow, a significant part of the text has been moved to appendix, which often breaks the flow. I understand this to be because of the page limit but putting related work section in the appendix is a bit unnecessary.

One important thing which could impact the results is contamination. TriviaQA and NaturalQA datasets are publicly available and it is highly likely that the LLM has already seen these datasets during pre-training which might influence the answers that it generates. So it is a bit unclear what the LLM is referring to when answering a question. The authors do allude to it in section 5.1 but still it does not sufficiently address the issue. What if the retrieved context matches the parametric knowledge?

I also don't completely understand the utility of the results. Numerous studies have demonstrated the effectiveness of retrieval augmented generation in obtaining the correct answer. I fail to see an application where both generated and retrieval context are provided simultaneously to LLMs. I mean, if the retrieval augmented systems are accurate, why would one need generated contexts.

The generated content is also often sensitive to the prompt. It is not mentioned in the paper what prompt was used to generate the answers. What happens when one explicitly prompts the model to only use the retrieved context to answer the question. Also, how did you ensure that the generated answer exactly matches the ground-truth answer. I mean, the model could have answered the question in a phrase which might contain the right answer. However, the exact match metric might not reflect that.

Comments, Suggestions And Typos:

See weaknesses.

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions.

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Best Paper: No

Limitations And Societal Impact:

N/A

Ethical Concerns:

N/A

Needs Ethics Review: No

Reproducibility: 5 = They could easily reproduce the results.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 3 = Potentially useful: Someone might find the new software useful for their work.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

[–][+]

Part (1/2)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 20:06 (modified: 29 Mar 2024, 23:47)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

We sincerely appreciate your valuable comments.

Q1: Putting related work section in the appendix is a bit unnecessary.

A1: Regarding the related work, currently we have only introduced several most relevant works in section 2.1. We apologize for any inconvenience this may have caused. We will streamline the related work section and incorporate it into the main text.

Q2: One important thing which could impact the results is contamination. TriviaQA and NaturalQA datasets are publicly available and it is highly likely that the LLM has already seen these datasets during pre-training which might influence the answers that it generates. So it is a bit unclear what the LLM is referring to when answering a question. The authors do allude to it in section 5.1 but still it does not sufficiently address the issue. What if the retrieved context matches the parametric knowledge?

A2: Thanks for your valuable suggestions.

(1) In our work, we examining which answers (answers provided by generated or retrieved contexts) LLMs tend to select, to reveal how LLMs use both types of contexts. The fact that LLMs has seen these questions does not affect the conclusions of the paper. This is because we have observed that LLMs tend to trust generated contexts, even when generated contexts provide incorrect information. Assuming LLMs have seen these questions and knows the correct answers, the results that LLMs still select the incorrect answer from generated contexts on AIR, only indicates that the revealed bias is more serious.

(2) About the assumption “In section 5.1, what if the retrieved context matches the parametric knowledge?”, there is a small number of samples present this scenario. Concretely, we select samples based on whether they satisfy “Ret-Ans $=$ LLM-Ans”, from the AIR dataset in sec 5.1.

LLM-Ans refers to the answer produced by the LLM when only the question is input without any context, whcih could reflect the LLMs' own knowledge.
Gen-Ans and Ret-Ans are the answers provided by the generated context and retrieved contexts, respectively.

The table below shows the LLMs' $DiffGR$ when both counter-memory and retrieved contexts are input on AIR datasets (with Llama2-13b).

The proportion of samples where the retrieved context matches the parametric knowledge is relatively small, with 131 out of a total of 883 instances. Even when the retrieved context matches the parametric knowledge (Ret-Ans $=$ LLM-Ans), LLMs still show significant bias towards generated contexts. This indicates that the bias towards generated contexts is very serious.
When excluding the samples where the retrieved context matches the parametric knowledge (Ret-Ans $\neq$ LLM-Ans), LLMs' bias also changes very little.

	$DiffGR$	Number of samples
Original result in table 3	0.6468	883
Ret-Ans $\neq$ LLM-Ans	0.6248	752
Ret-Ans $=$ LLM-Ans	0.7460	131

(3) To quantify "the effect of internal parametric knowledge" more clearly, we add further analysis. We select the cases where the answers from "retrieved context", "generated context", and "llm internal knowledge" were all different from one another (Gen-Ans $\neq$ Ret-Ans $\neq$ LLM-Ans). These cases is selected from the AIR datasets in sec 4.2, when generator $\neq$ reader. The two conditions, "generator $\neq$ reader" and "Gen-Ans $\neq$ Ret-Ans $\neq$ LLM-Ans", isolate the influence of internal parametric knowledge. Based on these cases, we can clearly check which knowledge LLMs realy rely on.

Result: The below table shows the proportion of the LLM's output that exactly matches Gen-Ans, or Ret-Ans, or LLM-Ans. We can see that

( i ) The proportion of choosing LLMs' internal knowledge (LLM-Ans) is very small (~1%) . This result indicates that, given external context, LLMs do indeed rely heavily on external context.
( ii ) LLMs still show significant preference for generated contexts, with the ratio of being selected Gen-Ans $>$ Ret-Ans $>$ LLM-Ans. The means even excluding the influence of internal knowledge, the bias discovered in our paper remains very significant.

Conclusion: The conclusion of our paper does not change after excluding the influence of internal knowledge

Genrator	Reader	Gen-Ans	Ret-Ans	LLM-Ans	Number of samples
GPT-3.5	Llama2-13b	62.39%	26.76%	1.10%	553
Llama2-7b	Llama2-13b	69.16%	18.96%	1.86%	1018
Llama2-13b	GPT-3.5	67.22	15.34%	1.20%	665
Llama2-7b	GPT-3.5	65.55%	16.24%	1.13%	708

[–][+]

Part (2/2)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 20:05 (modified: 29 Mar 2024, 23:52)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Q3: I also don't completely understand the utility of the results. Numerous studies have demonstrated the effectiveness of retrieval augmented generation in obtaining the correct answer. I fail to see an application where both generated and retrieval context are provided simultaneously to LLMs. I mean, if the retrieval augmented systems are accurate, why would one need generated contexts.

A3: Thanks for this question. There are already several works [1][2][3] that input both types of contexts simultaneously and have achieved better results than retrieval alone. Furthermore, retrieval is not perfect, and a considerable number of questions (average 10.8% in table 1 in our paper) cannot be addressed with retrieved contexts, but can be addressed with the generated context. The combined use of both has the potential to achieve significant performance improvement.

Moreover, as the amount of LLM-generated content on the internet increases, even when using RAG alone, the information retrieved may include generated content[4][5]. At this point, how LLM treats retrieved (human-written) and generated content becomes a critical issue. This work has found that current LLMs exhibit bias when processing these two types of information.

[1] Yu W, Iter D, Wang S, et al. Generate rather than Retrieve: Large Language Models are Strong Context Generators[C]//The Eleventh International Conference on Learning Representations. 2022.

[2] Zhang Y, Khalifa M, Logeswaran L, et al. Merging Generated and Retrieved Knowledge for Open-Domain QA[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 4710-4728.

[3] Abdallah A, Jatowt A. Generator-retriever-generator: A novel approach to open-domain question answering[J]. arXiv preprint arXiv:2307.11278, 2023.

[4] Pan Y, Pan L, Chen W, et al. On the Risk of Misinformation Pollution with Large Language Models[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. 2023: 1389-1403.

Q4: (1) It is not mentioned in the paper what prompt was used to generate the answers. What happens when one explicitly prompts the model to only use the retrieved context to answer the question.

A4: (1) We present the prompt used for generating answers in Figure 4. We apologize for the prompt not being sufficiently prominent; we will modify it to be displayed more conspicuously within the main text.

Specifically, when answering, we restrict the generated answers to be a single entity: "Refer to the context below and answer the following question with just one entity. context: {#contexts} question: {#question} The answer is".

(2) Specifying in the prompt to only use retrieved context may indeed affect the model's behavior. However, this may cause the model to overlook correct information when only the generated context is correct (AIG subsets in the paper). We think the ideal LLMs should be capable of utilizing correct information from both sources. We are next preparing to use different prompts to attempt to mitigate the impact of this bias, and if space permits, we will include it in subsequent revisions.

Q5: Also, how did you ensure that the generated answer exactly matches the ground-truth answer. I mean, the model could have answered the question in a phrase which might contain the right answer. However, the exact match metric might not reflect that.

A5: We use exactly matching because it is a commonly used QA evaluation metric. We also acknowledge that the exactly matching metric has some drawbacks, such as the response may contain the right answer as you mentioned. However, since we specified to answer with very short words when generating answers, this situation is relatively rare.

Furthermore, We try to employ ChatGPT to determine whether the generated answer matches the correct answer, as shown below:

Determine if the meaning of 'Answer' is exactly the same as any of the 'Golden Answers'. Answer: yes or no. Question: {question} Answer: {response} Golden Answers: {answer}

Then, we compare the judgment results of ChatGPT with those of exactly matching. We found that in a sample of 500 random samples, 94% of the results were consistent. This indicates that exactly matching is acceptable in our scenario.

By the way, we only use exactly matching ground-truth to filter the dataset. A very small amount that cannot match the ground-truth will only affect the number of samples in the dataset and will not affect the findings of this paper. Our metric $DiffGR$ is to compare the LLMs' answer with the candidate answers provided by generated and retrieved contexts. This process does not involve comparison with the ground truth. Also, regarding the effectiveness of matching the answer with the candidate answer, table 14 has shown that most answers can exactly match one of the candidate answers (The proportion of others is very small, which does not affect the conclusion).

[–][+]

Dear Reviewer #2

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 23:52ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Should our responses satisfactorily address your concerns, we would greatly appreciate it if you consider increasing our score. Alternatively, if there are any other issues, we would be happy to answer them.

[–][+]

After rebuttal

ACL ARR 2024 February Paper316 Reviewer #2

31 Mar 2024, 14:46ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

I thank the authors for their reply. I have changed the scores accordingly.

[–][+]

Thanks for reviewer

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

31 Mar 2024, 15:46ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 AuthorsShow Revisions

Comment:

Thank you very much for your reply. We're really encouraged by your positive feedback and your willingness to recommend acceptance, which means a lot to us.

[–][+]

Official Review of Paper316 by Reviewer #3

ACL ARR 2024 February Paper316 Reviewer #3

14 Mar 2024, 11:25 (modified: 31 Mar 2024, 08:50)ACL ARR 2024 February Paper316 Official ReviewReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

This paper aims to answer which context do LLM rely more on when doing open-domain QA, generated or retrieved, when they are both present but contain contradictory answer. This is tested by creating a new dataset CC through filtering, that consists of two parts AIG (generated context is correct, retrieved not) and AIR (the opposite). An interesting observation is a strong tendency for LLMs to provide answers similar to what’s provided in the generated context. The authors further examined possible causes: confirmation bias, text similarity, and context completeness.

Summary Of Strengths:

This paper reiterates LLMs’ preference on relying on parametric knowledge / generated context and ignoring retrieved context, highlighting that RAG alone may be insufficient for improving factuality.
Furthermore, findings of the paper (the correlation between completeness & preference and similarity & preference) provide guidance on improving RAG systems.

Summary Of Weaknesses:

It is unclear, and contradictory how the authors see the relation between internal parametric knowledge and generated context. Line 097 states these two are equivalent, yet the discussion of traceability (section 3.1) seems to be talking about finding instances where LLM relies solely on generated context rather than internal parametric knowledge.
- If treating the two as equivalent, the problem is reduced to which source, external context or intrinsic knowledge, matters - a question already answered by Xie et al. (2023) that was cited and other prior work.
- If treating the two as different - the effect of internal parametric knowledge needs to be more carefully controlled - while in the paper it’s either ignored or inadequately controlled.
“Which source do LLMs rely on to answer questions “can not be answered without studying three factors: whether they rely on internal parametric knowledge, generated context, and/or retrieved context. The paper downplays / ignores the effect of internal parametric knowledge:
- lines 284-292 claim that previous works conclude that internal parametric knowledge does not matter with the presence of contexts, which is not Xie et al. (2023) - the work cited - says in the abstract.
- The conclusion “LLMs prefer self-generated context” (section 4.1) can not be reached without ruling out the case that LLMs rely solely on parametric knowledge.
- Section 5.1: what about the case where the parametric knowledge has the correct answer, but generated context is hallucinated?
The effect of context completeness (section 5.3) is not decoupled from the effect of text similarity (section 5.2). These two are likely correlated and the readers are left unsure which of them actually caused LLMs to prefer generated contexts.
Are the two causes (context completeness and test similarity) not covered in Xie et al. (2023) - their first point in the abstract?

On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. - Abstract of Xie et al. (2023)

The author response mostly settled the concerns here. I still believe the responses and discussions need to be incorporated in a revision to make the paper sound.

Comments, Suggestions And Typos:

Font sizes in figures are really small - it’s hard to read at 100% scale.
Line 260: besides temperature, seed also needs to be fixed
Line 289, 480, multiple instances in the appendix : According to https://acl-org.github.io/ACLPUB/formatting.html#citations

Refrain from using full citations as sentence constituents. Instead of

(Gusfield, 1997) showed that …In (Gusfield, 1997), …

write

Gusfield (1997) showed that …In Gusfield (1997), …

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 2.5

Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.

Best Paper: No

Ethical Concerns:

None

Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 3 = Potentially useful: Someone might find the new software useful for their work.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

[–][+]

Part (1/4)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 19:58 (modified: 29 Mar 2024, 22:45)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Thank you for your patient comments. We deeply value your valuable feedback.

Q1: It is unclear, and contradictory how the authors see the relation between internal parametric knowledge and generated context. Line 097 states these two are equivalent, yet the discussion of traceability (section 3.1) seems to be talking about finding instances where LLM relies solely on generated context rather than internal parametric knowledge.

If treating the two as equivalent, the problem is reduced to which source, external context or intrinsic knowledge, matters - a question already answered by Xie et al. (2023) that was cited and other prior work.

If treating the two as different - the effect of internal parametric knowledge needs to be more carefully controlled - while in the paper it’s either ignored or inadequately controlled.

A1: (1) We appreciate the reviewer's insightful feedback and the critical issue you've highlighted regarding the relation between internal parametric knowledge and generated context. We apologize for any confusion caused by our current writing.

To clarify, our work does not equate parametric knowledge with generated context. As you rightly noted, ensuring consistency between the generated context and the LLM's internal parametric knowledge is challenging.

Our research aims to explore if LLMs exhibit bias when integrating two types of external input contexts (generated and retrieved). The generated context is also a special type of input context. This question is becoming increasingly critical as LLMs generate more and more content accessible on the World Wide Web. In this context, our findings indicate a tendency for LLMs to favor generated content, regardless of its correctness or origin (generated by themselves or other llms).

(2) To quantify "the effect of internal parametric knowledge" more clearly, we add further analysis. We select the cases where the answers from "retrieved context", "generated context", and "llm internal knowledge" were all different from one another (Gen-Ans $\neq$ Ret-Ans $\neq$ LLM-Ans). These cases is selected from the AIR datasets in sec 4.2, when generator $\neq$ reader. The two conditions, "generator $\neq$ reader" and "Gen-Ans $\neq$ Ret-Ans $\neq$ LLM-Ans", isolate the influence of internal parametric knowledge. Based on these cases, we can clearly check which knowledge LLMs realy rely on.

LLM-Ans refers to the answer produced by the LLM when only the question is input without any context, whcih could reflect the LLMs' own knowledge.
Gen-Ans and Ret-Ans are the answers provided by the generated context and retrieved contexts, respectively.

Result: The below table shows the proportion of the LLM's output that exactly matches Gen-Ans, or Ret-Ans, or LLM-Ans. We can see that

( i ) The proportion of choosing LLMs' internal knowledge (LLM-Ans) is very small (~1%) . This result indicates that, given external context, LLMs do indeed rely heavily on external context.
( ii ) LLMs still show significant preference for generated contexts, with the ratio of being selected Gen-Ans $>$ Ret-Ans $>$ LLM-Ans. The means even exclude these samples potentially influented by internal knowledge, the bias discovered in our paper remains very significant.

Conclusion: The conclusion of our paper does not change after excluding the influence of internal knowledge, i.e., LLMs still blindly trust the generated contexts.

Genrator	Reader	Gen-Ans	Ret-Ans	LLM-Ans	Number of samples
GPT-3.5	Llama2-13b	62.39%	26.76%	1.10%	553
Llama2-7b	Llama2-13b	69.16%	18.96%	1.86%	1018
Llama2-13b	GPT-3.5	67.22	15.34%	1.20%	665
Llama2-7b	GPT-3.5	65.55%	16.24%	1.13%	708

We acknowledge our previous results have some interference from the LLMs' internal knowledge. We apologize for any confusion caused by this. However, after excluding these interference, our findings does not change (LLMs still prefer generated contexts). Later, we will update the paper with results that have excluded the influence of internal parameters. Also, we will add this discussion about the effect of internal knowledge into our paper. Regarding this issue, if you have any other concerns, we would be happy to address them for you.

[–][+]

Part (2/4)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 19:58 (modified: 29 Mar 2024, 20:35)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Q2: “Which source do LLMs rely on to answer questions “can not be answered without studying three factors: whether they rely on internal parametric knowledge, generated context, and/or retrieved context. The paper downplays / ignores the effect of internal parametric knowledge

A2: As shown in the reply to Q1(2), we discuss the impact of the three factors more clearly. The results indicate that the proportion of LLMs relying on internal parametric knowledge is very small (~1%) in our scenario. Our conclusion, "LLMs show significant preference for generated contexts," still holds, even when the influence of internal parametric knowledge is excluded.

Q3: lines 284-292 claim that previous works conclude that internal parametric knowledge does not matter with the presence of contexts, which is not Xie et al. (2023) - the work cited - says in the abstract.

A3: Thank you very much for your detailed feedback. The following is the original statement from Xie et al. (2023)：

Experiment conclusion in Xie et al. (2023) (page 6, line 5 from the bottom): “LLMs are actually highly receptive to external evidence if it is presented in a coherent way, even though it conflicts with their parametric memory.”
Abstract in Xie et al. (2023): “On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing.”

Our understanding of Xie et al. (2023) is that when external evidence is introduced, LLMs tend to rely more on this external evidence, even when it is inconsistent with their internal knowledge. To clarify, we do not claim that internal parametric knowledge is unimportant; rather, its impact on our conclusion ( LLMs prefer generated contexts ) is relatively minor.

We quantify "the effect of internal parametric knowledge" more clearly in our reply to Q1(2). The results show that the vast majority of LLMs' answers are derived from the context, with only about 1% of the answers originating from the LLMs' internal knowledge. Even excluding its impact does not affect our findings and conclusions (LLMs still prefer generated contexts).

Q4: The conclusion “LLMs prefer self-generated context” (section 4.1) can not be reached without ruling out the case that LLMs rely solely on parametric knowledge.

A4: I appreciate this important question. We specifically discuss this question in Section 5.1, by disrupting the consistency between generated contexts and LLMs’ parametric knowledge. Concretely, we enforce LLMs to make up a special generated context which supports a same-type yet different answer compared to the LLMs parametric knowledge (with several checks to ensure this). Our results show that LLMs still have a significant bias towards generated contexts, even when generated contexts contradicts with parametric knowledge. This suggests that LLMs' preference for generated contexts is not because LLMs rely solely on parametric knowledge.

In fact, the generated context is not limited to self-generated context. LLMs also prefer contexts generated by other LLMs, which are not directly relative to this LLM's parametric knowledge. The reply to Q1(2) further supports this opinion by selecting the cases where "generated context, LLM parametric knowledge, retrieved cantext" support there different answers. Even in this situation, LLMs still mostly tend to rely on generated contexts, with only about 1% of the answers stemming from the LLMs' parametric knowledge.

[–][+]

Part (3/4)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 19:57 (modified: 29 Mar 2024, 22:59)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Q5: Section 5.1: what about the case where the parametric knowledge has the correct answer, but generated context is hallucinated?

A5:
We select the data that conforms to this hypothesis from based on Section 5.1. Specifically, we select the questions that LLMs can correctly answer without any contexts. These questions could reflect the situation that "the parametric knowledge has the correct answer" as you mentioned. Since in this section, the counter-memory contexts provide an incorrect answer (Ctr-Ans) , they could partly resemble "generated context is hallucinated" as you mentioned. Finally, we only get 136 such cases (original dataset size is 883). In such cases, LLMs still tend to select incorrect answers from the counter-memory contexts (Ctr-Ans) while disregarding the correct answers provided in both the retrieved contexts and internal knowledge. This suggests that our findings still hold true in this situation. This phenomenon also illustrates that LLMs' preference for generated contexts is a serious issue -- even if the model is capable of providing the correct answer, it can still be misled by incorrect information in generated contexts.

	Incorrect and Exactly match Ctr-Ans	Correct
Ratio	83.09%	13.24%

Q6: The effect of context completeness (section 5.3) is not decoupled from the effect of text similarity (section 5.2). These two are likely correlated and the readers are left unsure which of them actually caused LLMs to prefer generated contexts.

A6: When studying context completeness in section 5.3, similarity was kept constant, with only completeness being varied. As shown in Table 4 (also shown below), three types of generated contexts ( "Nature", "Trunc.", and "S-Trunc.") have almost equivalent similarity and lengths, with the principal disparities in semantic and sentence completeness. We compare the changes in the degree of bias of LLMs towards these three types of generated contexts to assess the impact of completeness. When comparing the bias with "Nature vs. Ret" and "S-Trunc. vs. Ret" input, the only difference lies in the semantic completeness between two types of generated contexts ("Nature" and "S-Trunc"). Similarly, when comparing the bias with "Trunc. vs. Ret" and "S-Trunc. vs. Ret" inputs, the only change is in sentence completeness. (The below shows that three types of generated contexts have similar average similarity, measured by both Jaccard and BERTScore.)

Context	Length	Semantic Completeness	Sentence Completeness	Similarity (Jaccard)	Similarity (BERTScore)
Retrieved	107.4	No	No	0.114	0.801
Nature	109.7	Yes	Yes	0.202	0.879
Trunc.	107.4	No	No	0.196	0.877
S-Trunc.	105.9	No	Yes	0.193	0.876

[–][+]

Part (4/4)

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

29 Mar 2024, 19:56 (modified: 29 Mar 2024, 22:37)ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 AuthorsShow Revisions

Comment:

Q7:Are the two causes (context completeness and test similarity) not covered in Xie et al. (2023) - their first point in the abstract?

A7: After carefully checking Xie et al. (2023) again, we believe that the two causes (context completeness and text similarity) are not covered in Xie et al. (2023).

In our work, text similarity refers to the similarity of the context to the question. Completeness investigates the effect of incomplete sentence and semantic caused by fix-length truncation in retrieval systems.

Xie et al. (2023) first consider the effect of coherent. Coherence refers to the internal consistency within a context, focusing on inconsistencies caused by entity substitution and negation injection. In our work, both retrieved and generated contexts are coherent, but they may provide incompleteness information, or varying degrees of similarity to the question. Their experiments cannot cover our findings. Xie et al. (2023) also disscuss the effect of context length, number of support evidence, noisy passages, and input order. None of these factors can cover the similarity and completeness.

Font sizes in figures are really small - it’s hard to read at 100% scale.

Line 289, 480, multiple instances in the appendix : According to https://acl-org.github.io/ACLPUB/formatting.html#citations

We are very grateful for your detailed suggestions and apologize for the oversight. We will greatly value your time and make more careful revisions and corrections.

Firstly, we will increase the font size to make the images clearer.
Then, we will meticulously check the citation format and revise all references, including the appendix.

Line 260: besides temperature, seed also needs to be fixed

In our code, we have fixed the seed for all experiments. We will add an explanation about this in the main text in our subsequent revisions. Thank you very much for your suggestion.

Should our responses satisfactorily address your concerns, we would greatly appreciate it if you consider increasing our score. Alternatively, if there are any other issues, we would be happy to answer them.

[–][+]

Comments after Author Response

ACL ARR 2024 February Paper316 Reviewer #3

31 Mar 2024, 08:53ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

Dear authors,

Thank you for your detailed response and experiments. The experiment results mostly resolved my concerns. I believe these questions and the discussions here are essential to the soundness of the paper.

[–][+]

Dear Reviewer #3

ACL ARR 2024 February Paper316 AuthorsHexiang Tan(privately revealed to you)

31 Mar 2024, 09:24ACL ARR 2024 February Paper316 Official CommentReaders: Program Chairs, Paper316 Senior Area Chairs, Paper316 Area Chairs, Paper316 Reviewers Submitted, Paper316 Authors

Comment:

We are deeply grateful for your revised score. Also, it is encouraging to learn that our efforts to address your concerns were well-received. We are more than happy to discuss any further questions you may have.