Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs

Hexiang Tan; Fei Sun; Sha Liu; Du Su; Qi Cao; Xin Chen; Jingang Wang; Xunliang Cai; Yuanzhuo Wang; Huawei Shen; Xueqi Cheng

Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, Xueqi Cheng

12 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May Submission

Abstract:

As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross‑model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: fact checking, rumor/misinformation detection

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

TLDR: This paper reveals the limitation of current error detection methods in handling consistent errors—whose prevalence remains stable or even increases with larger LLMs—and proposes a simple yet effective cross-model probe to improve their detection.

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Preprint: yes

Preprint Status: We are considering releasing a non-anonymous preprint in the next two months (i.e., during the reviewing process).

Preferred Venue: EMNLP

Consent To Share Data: yes

Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: sec7 limitations

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: section 3.1

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: section 3.1

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: section 3.1

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: section 3.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: section 3.1

C3 Descriptive Statistics: Yes

C3 Elaboration: sec 3.1

C4 Parameters For Packages: Yes

C4 Elaboration: sec 3.1 use huggingface

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: section 8 Ethics Statement

Author Submission Checklist: yes

Association For Computational Linguistics - Blind Submission License Agreement: On behalf of all authors, I agree for this and all previous versions of this submission

Keywords: hallucination, hallucination detection, uncertainty estimation, overconfidence

Submission Number: 435

Meta Review of Submission435 by Area Chair

Meta Reviewby Area Chair 15 Jul 2025, 18:52 (modified: 24 Jul 2025, 20:05)

Metareview:

The paper studies self-consistent errors in LLMs that are defined the same (semantically equivalent) incorrect response across multiple stochastic samples. The paper observes that unlike inconsistent errors, the frequency of these errors remains stable or even increases with larger models. They also observe that the four types of detection methods they investigate in the paper significantly struggle to detect self-consistent errors. Based on the observation that self-consistent errors often differ across LLMs, they propose a method called "cross‑model probe method" that uses hidden state evidence from an external verifier LLM that significantly enhances error-detection performance on self-consistent errors across three LLM families.

Summary Of Reasons To Publish:

The paper clearly defines and investigates an important and underexplored phenomenon of self-consistent errors and shows that existing error detection methods have some ways to go in handling self-consistent errors. The paper also proposes a practical method (cross-model probe) that improves error detection. The additional information and experimental results provided during the rebuttals address the reviewers' concerns, particularly in terms of (1) explaining some potential causes of self-consistent errors across different models (including qualitative examples of these errors), (2) analysis of the frequency of errors that are self-consistent across multiple different LLMs, (3) robustness of the cross-model probe to the choice of $λ$ parameter and verifier models, (4) additional analysis on math dataset that shows the presence of self-consistent errors, (5) human verification of NLI-based method for determining semantic equivalence, showing high agreement between this method with human annotators, (6) practical guideline for choosing a verifier: from a different family, and the larger the model, the better.

Summary Of Suggested Revisions:

The paper should incorporate additional information and results asked and answered during the rebuttal period, including clarity and readability fixes based on the reviewers' comments.

Overall Assessment: 4.0 = Conference: I think this paper could be accepted to an *ACL conference.

Reported Issues: No

Summary of the Discussion Phase

Author-Editor Confidential Commentby Authors 04 Jul 2025, 17:42

Comment:

Dear (S)ACs,

Thank you for your valuable guidance and dedication throughout the review process.

We are encouraged that all reviewers acknowledged the strengths of our work, including the clear problem formulation, rigorous experimental design, effective methodology, and valuable insights for the error detection field. By uncovering the limitation of current methods in detecting self-consistent errors and introducing an effective solution, we believe our work has highlighted a critical blind spot and will inspire the design of future methods in this field.

We have provided detailed clarifications and supplementary experiments addressing all raised concerns, most of which are minor in nature. ALL reviewers confirmed that their feedback was resolved by our responses, as reflected in their improved scores and positive follow-up comments.

However, some comments of the reviewer #3's may not fully align with the reviewing guidelines:

(1) Reviewer #3’s Weakness #3 appears inconsistent with "Reviewer Guidelines H3", as it questions the novelty of our method without providing citations or specific justification.

(2) Reviewer #3’s Weakness #2 may not fully align with "Reviewer Guidelines H16: Limitations ≠ Weaknesses", as it is closely parallels points we explicitly acknowledged in our Limitations section.

As a short paper, our primary contribution lies in revealing the problem of self-consistent errors, demonstrating the failure of existing methods to detect them, and proposing an effective solution. While a comprehensive investigation into the root causes would be valuable, it would require access to pre-training data and substantially more pages. This beyond the scope of this short paper, which, as noted in the Call For Papers, is intended for "a small, focused contribution that can be made in a few pages".

Nevertheless, we have added extensive experiments and case studies to uncover several possible causes during rebuttal: (1) easily confusable concepts, (2) widespread misconceptions, (3) long-tail knowledge, and (4) effects of training stages. These additions have addressed the reviewer #3’s concerns, as evidenced by the improved score.

Given these points, we respectfully request that our detailed rebuttal and additional contributions be considered in the meta-review process. We sincerely trust the (S)ACs’ judgment in ensuring a fair evaluation, and greatly appreciate your attention to these matters.

Best regards,

Authors

Request for Facilitating Discussion from Reviewers with Us

Author-Editor Confidential Commentby Authors 02 Jul 2025, 21:24

Comment:

Dear (S)ACs,

We hope this message finds you well.

We are writing to respectfully request your help in encouraging reviewers to check our response. We have provided detailed clarifications and supplementary experiments to address their concerns, questions, and misunderstandings. However, as the discussion period draws to a close, we have not received any feedback from them on our rebuttal.

We deeply appreciate your consideration of our request and your valuable guidance throughout the review process.

Best regards,

Authors

Official Review of Submission435 by Reviewer #1

Official Reviewby Reviewer #119 Jun 2025, 19:10 (modified: 26 Jun 2025, 22:24)

Paper Summary:

This paper investigates the self-consistent errors made by LLMs when performing QA tasks. They found that the number of such errors remained roughly constant as the model was scaled up. A simple detection method is presented, involving the use of an additional representation from a different LLM.

Summary Of Strengths:

interesting experiment with useful insight
a simple method for improving error-detection
nicely presented

Summary Of Weaknesses:

The authors note that self-consistent errors are more important since their number remains roughly constant as the model size increases. I believe it is important to investigate two further issues: What is the transferability of self-consistent errors between different models? The authors provide one such result (lines 256–259), but more thorough investigations are needed. What about errors that are self-consistent across models? What is their frequency? Are they distinctive in any way? The existence of such errors could highlight some unresolved issues with LLM training.
The evaluation would benefit from an additional baseline: an unsupervised method (such as SE) that samples responses from different LLMs. This would verify the effectiveness of the Cross-Model Probe compared to the direct application of the 'Cross-Model' concept.

Comments Suggestions And Typos:

Minor

"methshods" & other typos
the authors apply NLI criterion, but do not specify which NLI model was used

Confidence: 3 = Pretty sure, but there's a chance I missed something. Although I have a good feel for this area in general, I did not carefully check the paper's details, e.g., the math or experimental design.

Soundness: 3.5

Excitement: 3 = Interesting: I might mention some points of this paper to others and/or attend its presentation in a conference if there's time.

Overall Assessment: 3.5 = Borderline Conference

Ethical Concerns:

There are no concerns with this submission

Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.

Seeking your valuable feedback on our response

Official Commentby 02 Jul 2025, 10:39

Comment:

Dear Reviewer #1,

We hope this message finds you well.

We are deeply grateful for your thorough review and acknowledgment of our work. We have provided detailed clarifications and additional experiments in response to your valuable feedback.

As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.

Thank you again for your time and insightful engagement.

Best regards,

Authors

Official Comment by Reviewer #1

Official Commentby Reviewer #102 Jul 2025, 22:07Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1

Comment:

Thank you for the rebuttal. I'll keep my score.

Replying to Official Comment by Reviewer #1

Heartfelt Gratitude to Reviewer #1

Official Commentby Authors 03 Jul 2025, 11:34

Comment:

Dear Reviewer #1,

Thank you very much for your kind follow-up. We're glad to hear that our responses have addressed your concerns. If there are any remaining aspects you would like to discuss, we would be more than happy to engage further.

Best regards,

Authors

Response to reviewer #1

Official Commentby Authors 01 Jul 2025, 00:30 (modified: 01 Jul 2025, 00:32)

Comment:

Dear Reviewer #1,

We sincerely appreciate your constructive review and supportive score. We are encouraged that you found our work insightful and nicely presented, particularly in providing a simple method for improving error-detection.

W #1. Transferability of self-consistent errors between different models. Frequency and distinctiveness of self-consistent errors across models.

(1) We analyze the frequency of shared self-consistent errors across different models, as shown in the table below. CE_A and CE_B denote the number of self-consistent errors for models A and B, respectively. Shared refers to questions where both models produce the same self-consistent error. The table shows that：

Models from the same family tend to have a relatively high proportion of shared self-consistent errors (22.4%–38.2% on SciQ, 14.2%–24.9% on TriviaQA).
Models from different families share rare self-consistent errors (as low as 5.4% on TriviaQA).

Model Pair (A vs B)	Dataset	CE_A	CE_B	Shared	Shared / CE_A	Shared / CE_B
Llama3.1-8b vs Llama3.1-70b	SciQ	1038	800	272	26.2 %	34.0 %
Qwen2.5-7b vs Qwen2.5-72b	SciQ	952	557	213	22.4 %	38.2 %
Qwen2.5-7b vs Llama3.1-8b	SciQ	952	1038	244	25.6 %	23.5 %
Qwen2.5-7b vs Llama3.1-70b	SciQ	952	800	151	15.9 %	18.9 %
Llama3.1-8b vs Llama3.1-70b	TQA	3077	2665	663	21.5 %	24.9 %
Qwen2.5-7b vs Qwen2.5-72b	TQA	4638	3379	657	14.2 %	19.4 %
Qwen2.5-7b vs Llama3.1-8b	TQA	4638	3077	464	10.0 %	15.1 %
Qwen2.5-7b vs Llama3.1-70b	TQA	4638	2665	251	5.4 %	9.4 %

(2) We manually examined a set of self-consistent errors shared across models. Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. If such content is included in an LLM’s training data, it may lead to self-consistent errors. Here is a representative example:

Question: Which is the lightest of the widely used structural metals?

Gold answer: magnesium

Self-Consistent Error: Aluminum.

Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."

This is merely one possible cause we observed through case study and other factors may also exist. Moreover, rigorously verifying and quantifying how self-consistent errors stem from misconceptions would require access to the full training corpus. This falls beyond the scope of this short paper, and we will explore it in future work.

Response to reviewer #1

Official Commentby Authors 01 Jul 2025, 00:27

Comment:

W #2. Comparing with an unsupervised method (such as SE) that samples responses from different LLMs.

Thank you for this helpful suggestion. In the error detection task, supervised methods tend to significantly outperform unsupervised ones[1]. For this reason, we use the strong supervised probe as our primary baseline in the paper.

Following your suggestion, we also compare with the cross-model SE method.

(1) Our method significantly outperforms the cross-model SE in detecting self-consistent errors (>10% absolute AUROC), as shown in the table below.
(2) Our method also shows significant efficiency advantage: cross-model SE requires 10 additional inferences from both the original LLM and the verifier, while our method only needs one extra inference from the verifier.

Experimental details: We implement cross-model SE by sampling 10 responses from both the original LLM and the verifier. Then, we combine these 20 samples to compute the SE score. For a fair comparison, all methods use Qwen2.5-14B as the verifier.

Model	Method	SciQ-CE	TriviaQA-CE
Llama3.1-8b	SE	0.4608	0.5216
Llama3.1-8b	cross-model SE	0.7306	0.7401
Llama3.1-8b	ours	0.8659 (+13.53%)	0.8470 (+10.69%)
Qwen2.5-7b	SE	0.4782	0.4453
Qwen2.5-7b	cross-model SE	0.6567	0.7771
Qwen2.5-7b	ours	0.8399 (+18.32%)	0.9088 (13.17%)
[1] Factual confidence of LLMs: On reliability and robustness of current estimators. ACL. 2024

S #1: "methshods" & other typos. Which NLI model was used specifically.

(1) We will correct these typos and carefully double-check the entire paper. We sincerely apologize for the confusion arising from the typos.
(2) For the NLI model, we specifically use 'microsoft/deberta-v2-xlarge-mnli', and will include this detail in the revised version.

We hope these clarifications have fully addressed your concerns. If you have any further suggestions, please feel free to reach out.

Official Review of Submission435 by Reviewer #2

Official Reviewby Reviewer #219 Jun 2025, 07:02 (modified: 03 Jul 2025, 12:29)

Paper Summary:

This paper identifies and studies "self-consistent errors" in LLMs—cases where models repeatedly generate semantically equivalent incorrect answers across multiple samples—and shows that unlike inconsistent errors, these remain stable or even increase with model scale. The authors demonstrate that all existing error detection methods struggle with self-consistent errors and propose a cross-model probe that leverages an external verifier LLM's hidden states to significantly improve detection performance.

Summary Of Strengths:

The problem formulation of self-consistent errors is clear.
Human-verified LLM-as-a-judge approach to detect semantically equivalent responses (Appendix A.5).

Summary Of Weaknesses:

Major concern:

The λ parameter varies dramatically across verifiers (0.25 to 1.00 in Table 2), indicating high sensitivity to verifier choice. This requires labeled development data for tuning each model-verifier pair, significantly limiting practical applicability.

Needs more work during rebuttal:

The paper identifies self-consistent errors but doesn't investigate their root causes. Are these due to annotation errors in the ground truth for SciQ and TriviaQA? Model-specific knowledge misconceptions? What are the cases where different LLM families share the same self-consistent errors (as a subset of model-specific cases)? Showing some case study would be more helpful.
The writing in Section 4 (experiment results section) is not clear enough. Table 2 and its corresponding interpretation are hard to follow.

Minor concerns:

Only two QA datasets (SciQ and TriviaQA) are used. How do self-consistent errors manifest in reasoning, code generation, mathematics, or creative tasks? For example, RealMistake [1] provides more fine-grained error categories for other types of tasks.

[1] Kamoi, R., Das, S. S. S., Lou, R., Ahn, J. J., Zhao, Y., Lu, X., ... & Zhang, R. Evaluating LLMs at Detecting Errors in LLM Responses. In First Conference on Language Modeling.

Comments Suggestions And Typos:

line 17: methshods

Table 2 needs to be polished, only one score has been highlighted.

Why res only scores in table 2 are all the same?

Confidence: 5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.

Soundness: 4 = Strong: This study provides sufficient support for all of its claims. Some extra experiments could be nice, but not essential.

Excitement: 4.0 = Exciting: I would mention this paper to others and/or make an effort to attend its presentation in a conference.

Overall Assessment: 4.0 = Conference: I think this paper could be accepted to an *ACL conference.

Ethical Concerns:

There are no concerns with this submission

Needs Ethics Review: No

Reproducibility: 5 = They could easily reproduce the results.

Datasets: 5 = Enabling: The newly released datasets should affect other people's choice of research or development projects to undertake.

Software: 5 = Enabling: The newly released software should affect other people's choice of research or development projects to undertake.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.

Seeking your valuable feedback on our response

Official Commentby Authors 02 Jul 2025, 10:35

Comment:

Dear Reviewer #2,

We hope this message finds you well.

We are deeply grateful for your thorough review. We have provided detailed clarifications and additional experiments in response to your valuable feedback.

As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.

Thank you again for your time and insightful engagement.

Best regards,

Authors

Official Comment by Reviewer #2

Official Commentby Reviewer #203 Jul 2025, 12:34

Comment:

Thanks. The additional experiment results have resolved my concerns. I've raised my scores.

Replying to Official Comment by Reviewer #2

Heartfelt Gratitude to Reviewer #2

Official Commentby Authors (, , , , )03 Jul 2025, 13:32

Comment:

Dear Reviewer #2,

Thank you very much for your positive feedback and for updating your assessment. We are particularly encouraged by your recognition of the value of our work.

Best regards,
The Authors

Response to reviewer #2

Official Commentby Authors 01 Jul 2025, 00:58

Comment:

Dear Reviewer #2,

We sincerely appreciate your time and efforts on our work! We are grateful that you acknowledge the strengths of our work, particularly in the clear problem formulation and human-verified evaluation.

W #1. Sensitivity of λ to verifier choice and requirement of labeled data for tuning λ.

(1) Across most λ values, our method significantly outperforms the best baseline, without requiring careful λ optimization. We analyze AUROC across λ ∈ [0,1] on the SciQ-CE validation set using Qwen2.5-7B as the response LLM. Values above best baseline probe (AUROC=0.7695) are bolded in the following tables. Results show:

Our method consistently outperforms the baseline across broad range of λ values. Even with Qwen2.5-3B—a weaker verifier we explicitly discourage—our method still outperforms the baseline over a wide λ range (0 < λ ≤ 0.6). When λ = 0, our method degenerates into the baseline probe.
Our results suggest that simply fixing λ=0.5 provides strong gains across all tested verifier sizes, which makes deployment easy without extensive tuning. This also avoids the need for additional supervision. We will highlight this robust default in the final manuscript.

λ	Llama3.2-3b	Llama3.1-8b	Llama3.1-70b	Qwen2.5-3b	Qwen2.5-14b	Qwen2.5-72b
0.00	0.7695	0.7695	0.7695	0.7695	0.7695	0.7695
0.10	0.7742	0.7827	0.7881	0.7717	0.7824	0.7934
0.20	0.7775	0.7898	0.8042	0.7750	0.7885	0.8165
0.30	0.7799	0.7950	0.8182	0.7739	0.7990	0.8346
0.40	0.7815	0.8027	0.8293	0.7749	0.8042	0.8473
0.50	0.7829	0.8070	0.8375	0.7734	0.8080	0.8622
0.60	0.7800	0.8079	0.8463	0.7706	0.8128	0.8712
0.70	0.7819	0.8067	0.8512	0.7665	0.8140	0.8815
0.80	0.7763	0.8039	0.8556	0.7609	0.8152	0.8894
0.90	0.7697	0.8024	0.8557	0.7523	0.8168	0.8947
1.00	0.7618	0.7996	0.8540	0.7409	0.8133	0.9001

(2) Supervised approaches have demonstrated the strongest performance [1, 2] and have become the mainstream in the field of error detection. A wide range of supervised data is now available, including many open-source datasets such as TriviaQA and SciQ. Additionally, we find that selecting the optimal λ requires only a small labeled set (150–200 examples). So the impact of tuning λ on practical applicability is minimal.

Overall, our method achieves strong performance with either a fixed λ or minimal tuning, which we believe addresses your concern.

[1] Factual confidence of LLMs: On reliability and robustness of current estimators. ACL. 2024
[2] Llms know more than they show: On the intrinsic representation of llm hallucinations. ICLR. 2025

Response to reviewer #2

Official Commentby Authors 01 Jul 2025, 00:53

Comment:

W #2. The root causes of self-consistent errors: Are these due to annotation errors in the ground truth? Knowledge misconceptions? What are the cases where different LLM families share the same self-consistent errors? Showing some case study would be more helpful.

(1) Annotation errors in the ground truth is not the cause. We manually examined 40 examples from SciQ and TriviaQA and found no annotation errors in the ground truth. All gold answers are supported by reliable sources such as Wikipedia.

(2) Self-consistent errors shared across different LLM families are relatively rare. In the table below, "Shared" refers to questions where both models produce the same self-consistent error and "Ratio" is calculated with respect to the entire dataset.

Model Pair (A vs B)	Dataset	Shared	Ratio
Qwen2.5-7b vs Llama3.1-8b	SciQ	244	2.08%
Qwen2.5-7b vs Llama3.1-70b	SciQ	151	1.29%
Qwen2.5-7b vs Llama3.1-8b	TQA	464	0.66%
Qwen2.5-7b vs Llama3.1-70b	TQA	251	0.35%

We manually examined a set of self-consistent errors shared across models. Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. If such content is included in an LLM’s training data, it may lead to self-consistent errors. Here is a representative example:

Question: Which is the lightest of the widely used structural metals?

Gold answer: magnesium

Self-Consistent Error: Aluminum.

Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."

(3) Easily confusable concepts may be one of the cause. We also observed a potential cause from the model-specific self-consistent error cases. As illustrated in the following example, the LLM confused two closely related concepts that differ only subtly.

Question: When the earth is between the moon and the sun, what type of moon shows?

Gold answer: A full moon

Self-Consistent Error: A New Moon.

Explanation: A full moon occurs when the earth is between the moon and the sun. A new moon occurs when the moon is between the earth and the sun.

(4) We acknowledge that uncovering the precise causes of self-consistent errors is crucial. However, such analysis is highly challenging because it would require access to the full pre-training data and a detailed understanding of the learning process. This goes far beyond the scope of this short paper. Our contribution primarily focuses on revealing the problem of self-consistent errors and proposing an effective solution. We will incorporate these case studies in the revised version and leave a more comprehensive investigation to future work.

W #3. The presentation of Table 2

We sincerely apologize for any confusion caused by the unclear writing due to space limitations. Here, we briefly clarify the experimental content of Table 2.

Table 2 presents an ablation study examining how different choices of verifier models affect our method’s performance. We test our method with four different verifier models, selected to span different model series and sizes.

For clarity, we reorganize Table 2 to the following table. The first row (Probe) is the best baseline, while the remaining rows correspond to our methods with different verifiers.

Method	SciQ-CE	SciQ-IE	TriviaQA-CE	TriviaQA-IE
Probe (Qwen2.5-7B)	0.8250	0.8786	0.8662	0.9468
+ Qwen2.5-3B Verifier	0.8357 (+1.3 %)	0.8834 (+0.5 %)	0.8712 (+0.6 %)	0.9495 (+0.3 %)
+ Llama3.2-3B Verifier	0.8453 (+2.5 %)	0.8851 (+0.7 %)	0.8828 (+1.9 %)	0.9569 (+1.1 %)
+ Qwen2.5-72B Verifier	0.8689 (+5.3 %)	0.9290 (+5.7 %)	0.9377 (+8.3 %)	0.9815 (+3.7 %)
+ Llama3.1-70B Verifier	0.8794 (+6.6 %)	0.9353 (+6.5 %)	0.9511 (+9.8 %)	0.9852 (+4.1 %)

The table shows that:

Our method outperforms the best baseline across all verifiers, including the smallest 3B model.
We provide empirical suggestion for verifier selection: Different-series verifiers > same-series verifiers (Qwen series here)；Larger-scale verifiers > smaller-scale.

We greatly appreciate your valuable feedback and will revise the writing in Section 4 accordingly.

Response to reviewer #2

Official Commentby Authors 01 Jul 2025, 00:46

Comment:

W #4. More diverse tasks.

(1) The goal of error detection is to determine whether an LLM’s response is factually correct, so the primary focus is on factual QA. The datasets we selected are widely used benchmarks in this field[1, 2], making them appropriate and representative for evaluating error detection.

(2) We appreciate this suggestion and test the frequency of self-consistent errors on a math datasets GSM8K[3]. Result shows that self-consistent errors still occur with a notable frequency on math datasets. The frequency of self-consistent errors even exceeds that of inconsistent ones for LLaMA3.1-8B.

LLM	Task	Correct	Self-Consistent Errors	Inconsistent Errors
Qwen2.5-7b	GSM8K	814 (62.18%)	172 (13.14%)	323 (24.68%)
Llama3.1-8b	GSM8K	1044 (85.50%)	127 (10.40%)	50 (4.10%)

[1] Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. EMNLP. 2023

[2] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR. 2023

[3] Training verifiers to solve math word problems

S #1: line 17 methshods

We appreciate the reviewer pointing out these typos, and have thoroughly corrected them in the revised version.

S #2: Table 2 needs to be polished, only one score has been highlighted. Why res only scores in table 2 are all the same?

We apologize for the unclear presentation. We have reorganized and clarified Table 2 in our reply to Weakness 3, and will revise it in the paper. Here, we briefly clarify two points:

1. Only the “fused” column is highlighted, since the purpose of the table is to compare our method (“fused” column) across verifier choices. Other columns only serve as baseline.
1. “Res only” denotes the probe baseline, which relies solely on the response LLM’s hidden states. Since no verifier is involved in this method, its scores remain unchanged across verifier rows.

We hope these clarifications have fully addressed your concerns. If so, we would deeply appreciate your consideration in raising your score. If you have any further suggestions, please feel free to reach out.

Official Review of Submission435 by Reviewer #3

Official Reviewby Reviewer #310 Jun 2025, 22:57 (modified: 02 Jul 2025, 21:52)

Paper Summary:

This paper defines and investigates self-consistent errors in large language models (LLMs), referring to semantically equivalent but incorrect outputs that are consistently reproduced across multiple stochastic generations. The authors provide a systematic empirical evaluation of four mainstream error detection methods and find that all perform significantly worse on self-consistent errors. To address this, the paper proposes a cross-model probe that incorporates hidden representations from an external verifier model, leading to improved detection performance.

Summary Of Strengths:

Clear and timely problem formulation: The paper identifies a critical yet underexplored phenomenon—self-consistent errors—and formally defines it with precision.
Thorough experimental design: The authors carefully control evaluation settings (e.g., balancing CE and IE subsets) and provide implementation details, which enhances reproducibility.

Summary Of Weaknesses:

The paper uses NLI-based mutual entailment to determine semantic equivalence. However, this could miss subtle factual differences, especially when response lengths differ. It would be valuable to include example pairs where the NLI-based method fails or is borderline, to justify the robustness of the CE/IE split. And the paper does not discuss the impact of generation length on consistency evaluation.
Although the limitations section briefly mentions training data bias and fine-tuning artifacts as potential causes, the paper lacks concrete empirical evidence for these claims. I strongly recommend adding a case study or data comparison to explore what types of questions or topics are more prone to self-consistent errors, and whether these correlate with known dataset artifacts.
While the cross-model probing strategy is effective, it is conceptually a straightforward extension of supervised probing. Its primary value lies in empirical performance, not methodological innovation.

Comments Suggestions And Typos:

The paper contains a number of typos that should be corrected:

“repeately” → “repeatedly” (line 007)

“methshods” → “methods” (line 017)

“are then integrated” → “is then integrated” (line 083)

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Soundness: 2 = Poor: Some of the main claims are not sufficiently supported. There are major technical/methodological problems.

Excitement: 3.5

Overall Assessment: 2.5 = Borderline Findings

Ethical Concerns:

There are no concerns with this submission

Needs Ethics Review: No

Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 2 = Documentary: The new software will be useful to study or replicate the reported research, although for other purposes it may have limited interest or limited usability. (Still a positive rating)

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.

Appreciate Your Feedback – Any Remaining Concerns?

Official Commentby Authors 04 Jul 2025, 12:57

Comment:

Dear Reviewer #3,

As the discussion period draws to a close, we are encouraged by the overall positive tenor of the reviews, including your assessment.

We genuinely hope to hear your thoughts on our response. Are there any specific concerns you still see as barriers to a higher assessment? We would truly value your perspective and would be glad to further clarify or improve any remaining points.

Thank you again for your time and thoughtful engagement throughout the review process.

Best regards,
The Authors

Heartfelt Gratitude to Reviewer #3

Official Commentby Authors 03 Jul 2025, 11:33

Comment:

Dear Reviewer #3,

Thank you very much for your reconsideration and for increasing your overall assessment. We’re truly grateful that our clarifications and additional analyses helped address your key concerns.

We noticed you’ve currently assessed the paper as borderline. We genuinely want to further improve this work and would sincerely value your guidance. Are there any specific concerns or aspects you still see as barriers to a higher assessment? We would be very eager to understand them and explore how we might address these points.

Thank you again for your valuable time and consideration.

Best regards,
The Authors

Seeking your valuable feedback on our response

Official Commentby Authors 02 Jul 2025, 10:37

Comment:

Dear Reviewer #3,

We hope this message finds you well.

We are deeply grateful for your thorough review. We have provided detailed clarifications and additional experiments in response to your valuable feedback.

As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.

Thank you again for your time and insightful engagement.

Best regards,

Authors

Response to reviewer #3 (1/3)

Official Commentby Authors 01 Jul 2025, 16:03

Comment:

Dear Reviewer #3,

We sincerely appreciate your thorough review and constructive feedback. We are grateful that you acknowledge the strengths of our work, particularly in the clear and timely problem formulation and thorough experimental design.

W #1. Length difference may affect the NLI-based semantic equivalence.

(1) We analyzed the word lengths of all response pairs evaluated by the NLI model and found that most responses have similar lengths—91.03% of pairs differ by fewer than 5 tokens (see table below).

length difference	ratio of response pairs
≤ 3	88.66%
≤ 5	91.03%
≤ 10	95.81%

(2) We conduct a human evaluation on 100 sampled response pairs and find that NLI-based mutual entailment achieves 98% agreement with human judgments, demonstrating its effectiveness. Additionally, NLI has become the mainstream method for assessing consistency in this field [1,2,3,4,5], and prior work has also demonstrated its effectiveness through human evaluation[1].

Overall, the above analysis and our human evaluation with 98% agreement demonstrate the effectiveness of NLI-based assessment, which we believe adequately addresses your concern.

[1] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. ICLR. 2023

[2] SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models. EMNLP. 2023

[3] Generating with confidence: Uncertainty quantification for black-box large language models. TMLR. 2025

[4] Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. NIPS. 2024

[5] Semantic entropy probes: Robust and cheap hallucination detection in LLMs.

Response to reviewer #3 (2/3)

Official Commentby Authors 01 Jul 2025, 16:01

Comment:

W #2. The limitations mention possible causes without evidence; a case study or data comparison is recommended to show the causes of such errors.

(1) We thank the reviewer for highlighting the importance of exploring the underlying causes of self-consistent errors. While we note that the ACL guidelines (H16, “Limitations ≠ Weaknesses”) generally place such broader inquiries under the scope of limitations, we fully agree that this is an important direction.

Our short paper primarily focuses on identifying this phenomenon and proposing an effective solution, but we sincerely appreciate the reviewer’s suggestion, which has motivated us to perform the additional case studies and data comparisons presented below.

(2) Through additional case studies and data comparisons, we observe several potential causes of self-consistent errors, including easily confusable concepts, widespread misconceptions, long-tail knowledge and effect of training stages.

Insight 1. Easily confusable concepts and widespread misconceptions may be the causes. We conduct case study and observed some phenomena that may contribute to such errors. Below are some representative examples:

(a) Easily Confusable Concepts: The LLM confused two closely related concepts with only subtle differences.
- Question: When the earth is between the moon and the sun, what type of moon shows?
- Gold answer: A full moon
- Self-Consistent Error: A New Moon.
- Explanation: A full moon occurs when the earth is between the moon and the sun. A new moon occurs when the moon is between the earth and the sun.
(b) Widespread Misconceptions: Some of these errors are associated with widespread misconceptions on the internet, especially in informal sources such as blogs. When such misconceptions is included in training data, it may lead to self-consistent errors.
- Question: Which is the lightest of the widely used structural metals?
- Gold answer: magnesium
- Self-Consistent Error: Aluminum.
- Explanation: Misconceptions supporting this error can be found on many blogs and informal article. For example, one article states, "Aluminum is the lightest structural metal, with a density of just one-third that of steel." Another blog, with an imprecise title, refers to both as the lightest structural metals: "Aluminum vs Magnesium: The Lightest Structural Metals Compared."

Insight 2. Higher ratio of self-consistent errors on long-tail knowledge. We evaluate on the PopQA dataset using Llama3.1-8b, which contains questions across different popularity magnitudes, to assess how self-consistent errors vary with popularity. As shown in the following table, low-popularity (long-tail) questions exhibit higher ratio of self-consistent errors.

Popularity Magnitudes	$10^{1}$	$10^{2}$	$10^{3}$	$10^{4}$	$10^{5}$
Self Consistent Error Rate	19.59%	21.09%	16.01%	11.51%	12.80%

Insight 3. Impact of training stages on self-consistent errors. We utilize the fully open-source OLMo2 models, which release checkpoints after every training stage: OLMo2-7b-base, OLMo2-7b-sft, OLMo2-7b-dpo, and OLMo2-7b-rl. The table below reveals that self-consistent errors are already frequent after pre-training.

Model	self-consistent errors ratio
olmo2-7b-base	24.60%
olmo2-7b-sft	20.40%
olmo2-7b-dpo	17.00%
olmo2-7b-rl	19.80%

(3) The above insights are preliminary observations based on case studies and data comparisons. More rigorous and detailed analysis of the causes of self-consistent errors would require access to the full training data and costly controlled training experiments, which go beyond the scope of this short paper. This work primarily focuses on revealing the problem of self-consistent errors and proposing an effective solution. We will incorporate the above insights into the revised version and leave a deeper investigation to future work.

Response to reviewer #3 (3/3)

Official Commentby Authors 01 Jul 2025, 15:58

Comment:

W #3. The innovation of the proposed method.

We appreciate your recognition of the effectiveness of our cross-model probing strategy. Beyond empirical gains, we respectfully argue that it represents a significant conceptual innovation and offers key methodological insights.

(1) Prior work assumed a model's own hidden states were sufficient for error detection [1,2,3,4]. Distinct from them, we are the first to incorporate hidden states from external verifiers into the probe. The core innovation of our approach lies in challenging this assumption that model's own hidden states were sufficient for error detection in this field and demonstrating that incorporating cross-model hidden states is necessary for detecting self-consistent errors.

(2) Furthermore, our work extends beyond empirical gains by providing actionable design principles for choosing external verifiers. By systematically analyzing the impact of verifier choice, we demonstrate that verifiers from a “different model family + larger size” yield significantly better improvements. This insight offers valuable inspiration for future research in cross-model verification and related areas.

We hope this clarifies why our method offers not just empirical improvements but also contributes novel conceptual insights to the error detection field.

[1] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. ICLR. 2025

[2] Inference-time intervention: Eliciting truthful answers from a language model. NIPS. 2023

[3] The Geometry of Truth: Emergent Linear Structure in LLM Representations of True/False Datasets. COLM. 2024

[4] The Internal State of an LLM Knows When It's Lying. EMNLP. 2023

S #1 A number of typos that should be corrected

We sincerely thank the reviewer for catching these typos. We have carefully proofread the entire paper to solve all typos.

In summary, we have conducted additional empirical analyses, provided new case studies and data-driven insights into the causes of self-consistent errors, clarified the conceptual innovation of our method, and committed to incorporating these enhancements into the revised paper.

We sincerely hope this thorough response fully addresses your concerns, and we would be grateful for your reconsideration of the paper’s evaluation.