As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross‑model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
Meta Review of Submission435 by Area Chair
The paper studies self-consistent errors in LLMs that are defined the same (semantically equivalent) incorrect response across multiple stochastic samples. The paper observes that unlike inconsistent errors, the frequency of these errors remains stable or even increases with larger models. They also observe that the four types of detection methods they investigate in the paper significantly struggle to detect self-consistent errors. Based on the observation that self-consistent errors often differ across LLMs, they propose a method called "cross‑model probe method" that uses hidden state evidence from an external verifier LLM that significantly enhances error-detection performance on self-consistent errors across three LLM families.
The paper clearly defines and investigates an important and underexplored phenomenon of self-consistent errors and shows that existing error detection methods have some ways to go in handling self-consistent errors. The paper also proposes a practical method (cross-model probe) that improves error detection. The additional information and experimental results provided during the rebuttals address the reviewers' concerns, particularly in terms of (1) explaining some potential causes of self-consistent errors across different models (including qualitative examples of these errors), (2) analysis of the frequency of errors that are self-consistent across multiple different LLMs, (3) robustness of the cross-model probe to the choice of
The paper should incorporate additional information and results asked and answered during the rebuttal period, including clarity and readability fixes based on the reviewers' comments.
Summary of the Discussion Phase
Dear (S)ACs,
Thank you for your valuable guidance and dedication throughout the review process.
We are encouraged that all reviewers acknowledged the strengths of our work, including the clear problem formulation, rigorous experimental design, effective methodology, and valuable insights for the error detection field. By uncovering the limitation of current methods in detecting self-consistent errors and introducing an effective solution, we believe our work has highlighted a critical blind spot and will inspire the design of future methods in this field.
We have provided detailed clarifications and supplementary experiments addressing all raised concerns, most of which are minor in nature. ALL reviewers confirmed that their feedback was resolved by our responses, as reflected in their improved scores and positive follow-up comments.
However, some comments of the reviewer #3's may not fully align with the reviewing guidelines:
(1) Reviewer #3’s Weakness #3 appears inconsistent with "Reviewer Guidelines H3", as it questions the novelty of our method without providing citations or specific justification.
(2) Reviewer #3’s Weakness #2 may not fully align with "Reviewer Guidelines H16: Limitations ≠ Weaknesses", as it is closely parallels points we explicitly acknowledged in our Limitations section.
As a short paper, our primary contribution lies in revealing the problem of self-consistent errors, demonstrating the failure of existing methods to detect them, and proposing an effective solution. While a comprehensive investigation into the root causes would be valuable, it would require access to pre-training data and substantially more pages. This beyond the scope of this short paper, which, as noted in the Call For Papers, is intended for "a small, focused contribution that can be made in a few pages".
Nevertheless, we have added extensive experiments and case studies to uncover several possible causes during rebuttal: (1) easily confusable concepts, (2) widespread misconceptions, (3) long-tail knowledge, and (4) effects of training stages. These additions have addressed the reviewer #3’s concerns, as evidenced by the improved score.
Given these points, we respectfully request that our detailed rebuttal and additional contributions be considered in the meta-review process. We sincerely trust the (S)ACs’ judgment in ensuring a fair evaluation, and greatly appreciate your attention to these matters.
Best regards,
Authors
Request for Facilitating Discussion from Reviewers with Us
Dear (S)ACs,
We hope this message finds you well.
We are writing to respectfully request your help in encouraging reviewers to check our response. We have provided detailed clarifications and supplementary experiments to address their concerns, questions, and misunderstandings. However, as the discussion period draws to a close, we have not received any feedback from them on our rebuttal.
We deeply appreciate your consideration of our request and your valuable guidance throughout the review process.
Best regards,
Authors
Official Review of Submission435 by Reviewer #1
This paper investigates the self-consistent errors made by LLMs when performing QA tasks. They found that the number of such errors remained roughly constant as the model was scaled up. A simple detection method is presented, involving the use of an additional representation from a different LLM.
- interesting experiment with useful insight
- a simple method for improving error-detection
- nicely presented
The authors note that self-consistent errors are more important since their number remains roughly constant as the model size increases. I believe it is important to investigate two further issues: What is the transferability of self-consistent errors between different models? The authors provide one such result (lines 256–259), but more thorough investigations are needed. What about errors that are self-consistent across models? What is their frequency? Are they distinctive in any way? The existence of such errors could highlight some unresolved issues with LLM training.
The evaluation would benefit from an additional baseline: an unsupervised method (such as SE) that samples responses from different LLMs. This would verify the effectiveness of the Cross-Model Probe compared to the direct application of the 'Cross-Model' concept.
Minor
- "methshods" & other typos
- the authors apply NLI criterion, but do not specify which NLI model was used
There are no concerns with this submission
Seeking your valuable feedback on our response
Dear Reviewer #1,
We hope this message finds you well.
We are deeply grateful for your thorough review and acknowledgment of our work. We have provided detailed clarifications and additional experiments in response to your valuable feedback.
As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and insightful engagement.
Best regards,
Authors
Official Comment by Reviewer #1
Thank you for the rebuttal. I'll keep my score.
Replying to Official Comment by Reviewer #1
Heartfelt Gratitude to Reviewer #1
Dear Reviewer #1,
Thank you very much for your kind follow-up. We're glad to hear that our responses have addressed your concerns. If there are any remaining aspects you would like to discuss, we would be more than happy to engage further.
Best regards,
Authors
Response to reviewer #1
Dear Reviewer #1,
We sincerely appreciate your constructive review and supportive score. We are encouraged that you found our work insightful and nicely presented, particularly in providing a simple method for improving error-detection.
W #1. Transferability of self-consistent errors between different models. Frequency and distinctiveness of self-consistent errors across models.
(1) We analyze the frequency of shared self-consistent errors across different models, as shown in the table below. CE_A and CE_B denote the number of self-consistent errors for models A and B, respectively. Shared refers to questions where both models produce the same self-consistent error. The table shows that:
- Models from the same family tend to have a relatively high proportion of shared self-consistent errors (22.4%–38.2% on SciQ, 14.2%–24.9% on TriviaQA).
- Models from different families share rare self-consistent errors (as low as 5.4% on TriviaQA).
| Model Pair (A vs B) | Dataset | CE_A | CE_B | Shared | Shared / CE_A | Shared / CE_B |
|---|---|---|---|---|---|---|
| Llama3.1-8b vs Llama3.1-70b | SciQ | 1038 | 800 | 272 | 26.2 % | 34.0 % |
| Qwen2.5-7b vs Qwen2.5-72b | SciQ | 952 | 557 | 213 | 22.4 % | 38.2 % |
| Qwen2.5-7b vs Llama3.1-8b | SciQ | 952 | 1038 | 244 | 25.6 % | 23.5 % |
| Qwen2.5-7b vs Llama3.1-70b | SciQ | 952 | 800 | 151 | 15.9 % | 18.9 % |
| Llama3.1-8b vs Llama3.1-70b | TQA | 3077 | 2665 | 663 | 21.5 % | 24.9 % |
| Qwen2.5-7b vs Qwen2.5-72b | TQA | 4638 | 3379 | 657 | 14.2 % | 19.4 % |
| Qwen2.5-7b vs Llama3.1-8b | TQA | 4638 | 3077 | 464 | 10.0 % | 15.1 % |
| Qwen2.5-7b vs Llama3.1-70b | TQA | 4638 | 2665 | 251 | 5.4 % | 9.4 % |
(2) We manually examined a set of self-consistent errors shared across models. Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. If such content is included in an LLM’s training data, it may lead to self-consistent errors. Here is a representative example:
- Question: Which is the lightest of the widely used structural metals?
- Gold answer: magnesium
- Self-Consistent Error: Aluminum.
- Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."
This is merely one possible cause we observed through case study and other factors may also exist. Moreover, rigorously verifying and quantifying how self-consistent errors stem from misconceptions would require access to the full training corpus. This falls beyond the scope of this short paper, and we will explore it in future work.
Response to reviewer #1
W #2. Comparing with an unsupervised method (such as SE) that samples responses from different LLMs.
Thank you for this helpful suggestion. In the error detection task, supervised methods tend to significantly outperform unsupervised ones[1]. For this reason, we use the strong supervised probe as our primary baseline in the paper.
Following your suggestion, we also compare with the cross-model SE method.
- (1) Our method significantly outperforms the cross-model SE in detecting self-consistent errors (>10% absolute AUROC), as shown in the table below.
- (2) Our method also shows significant efficiency advantage: cross-model SE requires 10 additional inferences from both the original LLM and the verifier, while our method only needs one extra inference from the verifier.
Experimental details: We implement cross-model SE by sampling 10 responses from both the original LLM and the verifier. Then, we combine these 20 samples to compute the SE score. For a fair comparison, all methods use Qwen2.5-14B as the verifier.
| Model | Method | SciQ-CE | TriviaQA-CE |
|---|---|---|---|
| Llama3.1-8b | SE | 0.4608 | 0.5216 |
| Llama3.1-8b | cross-model SE | 0.7306 | 0.7401 |
| Llama3.1-8b | ours | 0.8659 (+13.53%) | 0.8470 (+10.69%) |
| Qwen2.5-7b | SE | 0.4782 | 0.4453 |
| Qwen2.5-7b | cross-model SE | 0.6567 | 0.7771 |
| Qwen2.5-7b | ours | 0.8399 (+18.32%) | 0.9088 (13.17%) |
| [1] Factual confidence of LLMs: On reliability and robustness of current estimators. ACL. 2024 |
S #1: "methshods" & other typos. Which NLI model was used specifically.
- (1) We will correct these typos and carefully double-check the entire paper. We sincerely apologize for the confusion arising from the typos.
- (2) For the NLI model, we specifically use 'microsoft/deberta-v2-xlarge-mnli', and will include this detail in the revised version.
We hope these clarifications have fully addressed your concerns. If you have any further suggestions, please feel free to reach out.
Official Review of Submission435 by Reviewer #2
This paper identifies and studies "self-consistent errors" in LLMs—cases where models repeatedly generate semantically equivalent incorrect answers across multiple samples—and shows that unlike inconsistent errors, these remain stable or even increase with model scale. The authors demonstrate that all existing error detection methods struggle with self-consistent errors and propose a cross-model probe that leverages an external verifier LLM's hidden states to significantly improve detection performance.
- The problem formulation of self-consistent errors is clear.
- Human-verified LLM-as-a-judge approach to detect semantically equivalent responses (Appendix A.5).
Major concern:
- The λ parameter varies dramatically across verifiers (0.25 to 1.00 in Table 2), indicating high sensitivity to verifier choice. This requires labeled development data for tuning each model-verifier pair, significantly limiting practical applicability.
Needs more work during rebuttal:
- The paper identifies self-consistent errors but doesn't investigate their root causes. Are these due to annotation errors in the ground truth for SciQ and TriviaQA? Model-specific knowledge misconceptions? What are the cases where different LLM families share the same self-consistent errors (as a subset of model-specific cases)? Showing some case study would be more helpful.
- The writing in Section 4 (experiment results section) is not clear enough. Table 2 and its corresponding interpretation are hard to follow.
Minor concerns:
- Only two QA datasets (SciQ and TriviaQA) are used. How do self-consistent errors manifest in reasoning, code generation, mathematics, or creative tasks? For example, RealMistake [1] provides more fine-grained error categories for other types of tasks.
[1] Kamoi, R., Das, S. S. S., Lou, R., Ahn, J. J., Zhao, Y., Lu, X., ... & Zhang, R. Evaluating LLMs at Detecting Errors in LLM Responses. In First Conference on Language Modeling.
line 17: methshods
Table 2 needs to be polished, only one score has been highlighted.
Why res only scores in table 2 are all the same?
There are no concerns with this submission
Seeking your valuable feedback on our response
Dear Reviewer #2,
We hope this message finds you well.
We are deeply grateful for your thorough review. We have provided detailed clarifications and additional experiments in response to your valuable feedback.
As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and insightful engagement.
Best regards,
Authors
Official Comment by Reviewer #2
Thanks. The additional experiment results have resolved my concerns. I've raised my scores.
Replying to Official Comment by Reviewer #2
Heartfelt Gratitude to Reviewer #2
Dear Reviewer #2,
Thank you very much for your positive feedback and for updating your assessment. We are particularly encouraged by your recognition of the value of our work.
Best regards,
The Authors
Response to reviewer #2
Dear Reviewer #2,
We sincerely appreciate your time and efforts on our work! We are grateful that you acknowledge the strengths of our work, particularly in the clear problem formulation and human-verified evaluation.
W #1. Sensitivity of λ to verifier choice and requirement of labeled data for tuning λ.
(1) Across most λ values, our method significantly outperforms the best baseline, without requiring careful λ optimization. We analyze AUROC across λ ∈ [0,1] on the SciQ-CE validation set using Qwen2.5-7B as the response LLM. Values above best baseline probe (AUROC=0.7695) are bolded in the following tables. Results show:
- Our method consistently outperforms the baseline across broad range of λ values. Even with Qwen2.5-3B—a weaker verifier we explicitly discourage—our method still outperforms the baseline over a wide λ range (0 < λ ≤ 0.6). When λ = 0, our method degenerates into the baseline probe.
- Our results suggest that simply fixing λ=0.5 provides strong gains across all tested verifier sizes, which makes deployment easy without extensive tuning. This also avoids the need for additional supervision. We will highlight this robust default in the final manuscript.
| λ | Llama3.2-3b | Llama3.1-8b | Llama3.1-70b | Qwen2.5-3b | Qwen2.5-14b | Qwen2.5-72b |
|---|---|---|---|---|---|---|
| 0.00 | 0.7695 | 0.7695 | 0.7695 | 0.7695 | 0.7695 | 0.7695 |
| 0.10 | 0.7742 | 0.7827 | 0.7881 | 0.7717 | 0.7824 | 0.7934 |
| 0.20 | 0.7775 | 0.7898 | 0.8042 | 0.7750 | 0.7885 | 0.8165 |
| 0.30 | 0.7799 | 0.7950 | 0.8182 | 0.7739 | 0.7990 | 0.8346 |
| 0.40 | 0.7815 | 0.8027 | 0.8293 | 0.7749 | 0.8042 | 0.8473 |
| 0.50 | 0.7829 | 0.8070 | 0.8375 | 0.7734 | 0.8080 | 0.8622 |
| 0.60 | 0.7800 | 0.8079 | 0.8463 | 0.7706 | 0.8128 | 0.8712 |
| 0.70 | 0.7819 | 0.8067 | 0.8512 | 0.7665 | 0.8140 | 0.8815 |
| 0.80 | 0.7763 | 0.8039 | 0.8556 | 0.7609 | 0.8152 | 0.8894 |
| 0.90 | 0.7697 | 0.8024 | 0.8557 | 0.7523 | 0.8168 | 0.8947 |
| 1.00 | 0.7618 | 0.7996 | 0.8540 | 0.7409 | 0.8133 | 0.9001 |
(2) Supervised approaches have demonstrated the strongest performance [1, 2] and have become the mainstream in the field of error detection. A wide range of supervised data is now available, including many open-source datasets such as TriviaQA and SciQ. Additionally, we find that selecting the optimal λ requires only a small labeled set (150–200 examples). So the impact of tuning λ on practical applicability is minimal.
Overall, our method achieves strong performance with either a fixed λ or minimal tuning, which we believe addresses your concern.
- [1] Factual confidence of LLMs: On reliability and robustness of current estimators. ACL. 2024
- [2] Llms know more than they show: On the intrinsic representation of llm hallucinations. ICLR. 2025
Response to reviewer #2
W #2. The root causes of self-consistent errors: Are these due to annotation errors in the ground truth? Knowledge misconceptions? What are the cases where different LLM families share the same self-consistent errors? Showing some case study would be more helpful.
(1) Annotation errors in the ground truth is not the cause. We manually examined 40 examples from SciQ and TriviaQA and found no annotation errors in the ground truth. All gold answers are supported by reliable sources such as Wikipedia.
(2) Self-consistent errors shared across different LLM families are relatively rare. In the table below, "Shared" refers to questions where both models produce the same self-consistent error and "Ratio" is calculated with respect to the entire dataset.
| Model Pair (A vs B) | Dataset | Shared | Ratio |
|---|---|---|---|
| Qwen2.5-7b vs Llama3.1-8b | SciQ | 244 | 2.08% |
| Qwen2.5-7b vs Llama3.1-70b | SciQ | 151 | 1.29% |
| Qwen2.5-7b vs Llama3.1-8b | TQA | 464 | 0.66% |
| Qwen2.5-7b vs Llama3.1-70b | TQA | 251 | 0.35% |
We manually examined a set of self-consistent errors shared across models. Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. If such content is included in an LLM’s training data, it may lead to self-consistent errors. Here is a representative example:
- Question: Which is the lightest of the widely used structural metals?
- Gold answer: magnesium
- Self-Consistent Error: Aluminum.
- Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."
(3) Easily confusable concepts may be one of the cause. We also observed a potential cause from the model-specific self-consistent error cases. As illustrated in the following example, the LLM confused two closely related concepts that differ only subtly.
- Question: When the earth is between the moon and the sun, what type of moon shows?
- Gold answer: A full moon
- Self-Consistent Error: A New Moon.
- Explanation: A full moon occurs when the earth is between the moon and the sun. A new moon occurs when the moon is between the earth and the sun.
(4) We acknowledge that uncovering the precise causes of self-consistent errors is crucial. However, such analysis is highly challenging because it would require access to the full pre-training data and a detailed understanding of the learning process. This goes far beyond the scope of this short paper. Our contribution primarily focuses on revealing the problem of self-consistent errors and proposing an effective solution. We will incorporate these case studies in the revised version and leave a more comprehensive investigation to future work.
W #3. The presentation of Table 2
We sincerely apologize for any confusion caused by the unclear writing due to space limitations. Here, we briefly clarify the experimental content of Table 2.
Table 2 presents an ablation study examining how different choices of verifier models affect our method’s performance. We test our method with four different verifier models, selected to span different model series and sizes.
For clarity, we reorganize Table 2 to the following table. The first row (Probe) is the best baseline, while the remaining rows correspond to our methods with different verifiers.
| Method | SciQ-CE | SciQ-IE | TriviaQA-CE | TriviaQA-IE |
|---|---|---|---|---|
| Probe (Qwen2.5-7B) | 0.8250 | 0.8786 | 0.8662 | 0.9468 |
| + Qwen2.5-3B Verifier | 0.8357 (+1.3 %) | 0.8834 (+0.5 %) | 0.8712 (+0.6 %) | 0.9495 (+0.3 %) |
| + Llama3.2-3B Verifier | 0.8453 (+2.5 %) | 0.8851 (+0.7 %) | 0.8828 (+1.9 %) | 0.9569 (+1.1 %) |
| + Qwen2.5-72B Verifier | 0.8689 (+5.3 %) | 0.9290 (+5.7 %) | 0.9377 (+8.3 %) | 0.9815 (+3.7 %) |
| + Llama3.1-70B Verifier | 0.8794 (+6.6 %) | 0.9353 (+6.5 %) | 0.9511 (+9.8 %) | 0.9852 (+4.1 %) |
The table shows that:
- Our method outperforms the best baseline across all verifiers, including the smallest 3B model.
- We provide empirical suggestion for verifier selection: Different-series verifiers > same-series verifiers (Qwen series here);Larger-scale verifiers > smaller-scale.
We greatly appreciate your valuable feedback and will revise the writing in Section 4 accordingly.
Response to reviewer #2
W #4. More diverse tasks.
(1) The goal of error detection is to determine whether an LLM’s response is factually correct, so the primary focus is on factual QA. The datasets we selected are widely used benchmarks in this field[1, 2], making them appropriate and representative for evaluating error detection.
(2) We appreciate this suggestion and test the frequency of self-consistent errors on a math datasets GSM8K[3]. Result shows that self-consistent errors still occur with a notable frequency on math datasets. The frequency of self-consistent errors even exceeds that of inconsistent ones for LLaMA3.1-8B.
| LLM | Task | Correct | Self-Consistent Errors | Inconsistent Errors |
|---|---|---|---|---|
| Qwen2.5-7b | GSM8K | 814 (62.18%) | 172 (13.14%) | 323 (24.68%) |
| Llama3.1-8b | GSM8K | 1044 (85.50%) | 127 (10.40%) | 50 (4.10%) |
[1] Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. EMNLP. 2023
[2] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR. 2023
[3] Training verifiers to solve math word problems
S #1: line 17 methshods
We appreciate the reviewer pointing out these typos, and have thoroughly corrected them in the revised version.
S #2: Table 2 needs to be polished, only one score has been highlighted. Why res only scores in table 2 are all the same?
We apologize for the unclear presentation. We have reorganized and clarified Table 2 in our reply to Weakness 3, and will revise it in the paper. Here, we briefly clarify two points:
- Only the “fused” column is highlighted, since the purpose of the table is to compare our method (“fused” column) across verifier choices. Other columns only serve as baseline.
- “Res only” denotes the probe baseline, which relies solely on the response LLM’s hidden states. Since no verifier is involved in this method, its scores remain unchanged across verifier rows.
We hope these clarifications have fully addressed your concerns. If so, we would deeply appreciate your consideration in raising your score. If you have any further suggestions, please feel free to reach out.
Official Review of Submission435 by Reviewer #3
This paper defines and investigates self-consistent errors in large language models (LLMs), referring to semantically equivalent but incorrect outputs that are consistently reproduced across multiple stochastic generations. The authors provide a systematic empirical evaluation of four mainstream error detection methods and find that all perform significantly worse on self-consistent errors. To address this, the paper proposes a cross-model probe that incorporates hidden representations from an external verifier model, leading to improved detection performance.
Clear and timely problem formulation: The paper identifies a critical yet underexplored phenomenon—self-consistent errors—and formally defines it with precision.
Thorough experimental design: The authors carefully control evaluation settings (e.g., balancing CE and IE subsets) and provide implementation details, which enhances reproducibility.
The paper uses NLI-based mutual entailment to determine semantic equivalence. However, this could miss subtle factual differences, especially when response lengths differ. It would be valuable to include example pairs where the NLI-based method fails or is borderline, to justify the robustness of the CE/IE split. And the paper does not discuss the impact of generation length on consistency evaluation.
Although the limitations section briefly mentions training data bias and fine-tuning artifacts as potential causes, the paper lacks concrete empirical evidence for these claims. I strongly recommend adding a case study or data comparison to explore what types of questions or topics are more prone to self-consistent errors, and whether these correlate with known dataset artifacts.
While the cross-model probing strategy is effective, it is conceptually a straightforward extension of supervised probing. Its primary value lies in empirical performance, not methodological innovation.
The paper contains a number of typos that should be corrected:
“repeately” → “repeatedly” (line 007)
“methshods” → “methods” (line 017)
“are then integrated” → “is then integrated” (line 083)
There are no concerns with this submission
Appreciate Your Feedback – Any Remaining Concerns?
Dear Reviewer #3,
As the discussion period draws to a close, we are encouraged by the overall positive tenor of the reviews, including your assessment.
We genuinely hope to hear your thoughts on our response. Are there any specific concerns you still see as barriers to a higher assessment? We would truly value your perspective and would be glad to further clarify or improve any remaining points.
Thank you again for your time and thoughtful engagement throughout the review process.
Best regards,
The Authors
Heartfelt Gratitude to Reviewer #3
Dear Reviewer #3,
Thank you very much for your reconsideration and for increasing your overall assessment. We’re truly grateful that our clarifications and additional analyses helped address your key concerns.
We noticed you’ve currently assessed the paper as borderline. We genuinely want to further improve this work and would sincerely value your guidance. Are there any specific concerns or aspects you still see as barriers to a higher assessment? We would be very eager to understand them and explore how we might address these points.
Thank you again for your valuable time and consideration.
Best regards,
The Authors
Seeking your valuable feedback on our response
Dear Reviewer #3,
We hope this message finds you well.
We are deeply grateful for your thorough review. We have provided detailed clarifications and additional experiments in response to your valuable feedback.
As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and insightful engagement.
Best regards,
Authors
Response to reviewer #3 (1/3)
Dear Reviewer #3,
We sincerely appreciate your thorough review and constructive feedback. We are grateful that you acknowledge the strengths of our work, particularly in the clear and timely problem formulation and thorough experimental design.
W #1. Length difference may affect the NLI-based semantic equivalence.
(1) We analyzed the word lengths of all response pairs evaluated by the NLI model and found that most responses have similar lengths—91.03% of pairs differ by fewer than 5 tokens (see table below).
| length difference | ratio of response pairs |
|---|---|
| ≤ 3 | 88.66% |
| ≤ 5 | 91.03% |
| ≤ 10 | 95.81% |
(2) We conduct a human evaluation on 100 sampled response pairs and find that NLI-based mutual entailment achieves 98% agreement with human judgments, demonstrating its effectiveness. Additionally, NLI has become the mainstream method for assessing consistency in this field [1,2,3,4,5], and prior work has also demonstrated its effectiveness through human evaluation[1].
Overall, the above analysis and our human evaluation with 98% agreement demonstrate the effectiveness of NLI-based assessment, which we believe adequately addresses your concern.
[1] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR. 2023
[2] SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models. EMNLP. 2023
[3] Generating with confidence: Uncertainty quantification for black-box large language models. TMLR. 2025
[4] Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. NIPS. 2024
[5] Semantic entropy probes: Robust and cheap hallucination detection in LLMs.
Response to reviewer #3 (2/3)
W #2. The limitations mention possible causes without evidence; a case study or data comparison is recommended to show the causes of such errors.
(1) We thank the reviewer for highlighting the importance of exploring the underlying causes of self-consistent errors. While we note that the ACL guidelines (H16, “Limitations ≠ Weaknesses”) generally place such broader inquiries under the scope of limitations, we fully agree that this is an important direction.
Our short paper primarily focuses on identifying this phenomenon and proposing an effective solution, but we sincerely appreciate the reviewer’s suggestion, which has motivated us to perform the additional case studies and data comparisons presented below.
(2) Through additional case studies and data comparisons, we observe several potential causes of self-consistent errors, including easily confusable concepts, widespread misconceptions, long-tail knowledge and effect of training stages.
Insight 1. Easily confusable concepts and widespread misconceptions may be the causes. We conduct case study and observed some phenomena that may contribute to such errors. Below are some representative examples:
- (a) Easily Confusable Concepts: The LLM confused two closely related concepts with only subtle differences.
- Question: When the earth is between the moon and the sun, what type of moon shows?
- Gold answer: A full moon
- Self-Consistent Error: A New Moon.
- Explanation: A full moon occurs when the earth is between the moon and the sun. A new moon occurs when the moon is between the earth and the sun.
- (b) Widespread Misconceptions: Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. When such misconceptions is included in training data, it may lead to self-consistent errors.
- Question: Which is the lightest of the widely used structural metals?
- Gold answer: magnesium
- Self-Consistent Error: Aluminum.
- Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."
Insight 2. Higher ratio of self-consistent errors on long-tail knowledge. We evaluate on the PopQA dataset using Llama3.1-8b, which contains questions across different popularity magnitudes, to assess how self-consistent errors vary with popularity. As shown in the following table, low-popularity (long-tail) questions exhibit higher ratio of self-consistent errors.
| Popularity Magnitudes | |||||
|---|---|---|---|---|---|
| Self Consistent Error Rate | 19.59% | 21.09% | 16.01% | 11.51% | 12.80% |
Insight 3. Impact of training stages on self-consistent errors. We utilize the fully open-source OLMo2 models, which release checkpoints after every training stage: OLMo2-7b-base, OLMo2-7b-sft, OLMo2-7b-dpo, and OLMo2-7b-rl. The table below reveals that self-consistent errors are already frequent after pre-training.
| Model | self-consistent errors ratio |
|---|---|
| olmo2-7b-base | 24.60% |
| olmo2-7b-sft | 20.40% |
| olmo2-7b-dpo | 17.00% |
| olmo2-7b-rl | 19.80% |
(3) The above insights are preliminary observations based on case studies and data comparisons. More rigorous and detailed analysis of the causes of self-consistent errors would require access to the full training data and costly controlled training experiments, which go beyond the scope of this short paper. This work primarily focuses on revealing the problem of self-consistent errors and proposing an effective solution. We will incorporate the above insights into the revised version and leave a deeper investigation to future work.
Response to reviewer #3 (3/3)
W #3. The innovation of the proposed method.
We appreciate your recognition of the effectiveness of our cross-model probing strategy. Beyond empirical gains, we respectfully argue that it represents a significant conceptual innovation and offers key methodological insights.
(1) Prior work assumed a model's own hidden states were sufficient for error detection [1,2,3,4]. Distinct from them, we are the first to incorporate hidden states from external verifiers into the probe. The core innovation of our approach lies in challenging this assumption that model's own hidden states were sufficient for error detection in this field and demonstrating that incorporating cross-model hidden states is necessary for detecting self-consistent errors.
(2) Furthermore, our work extends beyond empirical gains by providing actionable design principles for choosing external verifiers. By systematically analyzing the impact of verifier choice, we demonstrate that verifiers from a “different model family + larger size” yield significantly better improvements. This insight offers valuable inspiration for future research in cross-model verification and related areas.
We hope this clarifies why our method offers not just empirical improvements but also contributes novel conceptual insights to the error detection field.
[1] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. ICLR. 2025
[2] Inference-time intervention: Eliciting truthful answers from a language model. NIPS. 2023
[3] The Geometry of Truth: Emergent Linear Structure in LLM Representations of True/False Datasets. COLM. 2024
[4] The Internal State of an LLM Knows When It's Lying. EMNLP. 2023
S #1 A number of typos that should be corrected
We sincerely thank the reviewer for catching these typos. We have carefully proofread the entire paper to solve all typos.
In summary, we have conducted additional empirical analyses, provided new case studies and data-driven insights into the causes of self-consistent errors, clarified the conceptual innovation of our method, and committed to incorporating these enhancements into the revised paper.
We sincerely hope this thorough response fully addresses your concerns, and we would be grateful for your reconsideration of the paper’s evaluation.