Despite near-perfect results in artificial evaluations, the effectiveness of model editing in real-world applications remains unexplored. To bridge this gap, we propose to study model editing in question answering (QA) by establishing a rigorous evaluation practice to assess the effectiveness of editing methods in correcting LLMs' errors. It consists of QAEdit, a new benchmark derived from popular QA datasets, and a standardized evaluation framework. Our single editing experiments indicate that current editing methods perform substantially worse than previously reported (38.5% vs. ~96%). Through module analysis and controlled experiments, we demonstrate that this performance decline stems from issues in evaluation practices of prior editing research. One key issue is the inappropriate use of teacher forcing in testing prevents error propagation by feeding ground truth tokens (inaccessible in real-world scenarios) as input. Furthermore, we simulate real-world deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. Our analysis provides a fundamental reexamination of both the real-world applicability of existing model editing methods and their evaluation practices, and establishes a rigorous evaluation framework with key insights to advance reliable and practical model editing research.
Meta Review of Submission348
This paper introduces QAEdit, a benchmark designed for systematic analysis of model editing methods. The experiments highlight a significant discrepancy between current reported performance metrics and real-world scenarios. The results underscore the urgent need for more rigorous evaluation approaches in model editing research.
Reviewers identified several key strengths of the paper, including: 1/ The proposed QAEdit benchmark is timely, novel, and effective. 2/ The paper is clearly written, with detailed and precise technical descriptions. 3/ The experimental design is thorough, covering six editing techniques across four categories. The results validate the hypothesis of performance degradation when evaluated with QAEdit, providing critical insights and contributing to higher evaluation fidelity in model editing research.
The authors should address reviewers’ feedback by clarifying certain descriptions, improving the related work section (e.g., including RAG-style editing), providing deeper analysis of existing model failures, and refining the overall presentation of the paper.
Summary of the Discussion Phase
Dear (S)ACs,
Thank you for your valuable guidance and dedication throughout the review process.
We are encouraged that all reviewers recognized the strengths of our work, including the well-motivated and timely problem, thorough analysis of evaluation pitfalls, and the valuable insights offered for future model editing research. We believe that our more realistic and general evaluation framework and benchmark, which challenge the illusion of significant progress in the field, can provide key insights for designing more effective model editing techniques.
We have provided detailed clarifications and supplementary experiments addressing all raised concerns, most of which are minor in nature. Both Reviewers #2 and #3 confirmed that their feedback was fully resolved by our responses. Although Reviewer #1 did not participate in the discussion, we have thoroughly addressed their concerns, which closely align with those raised by other reviewers whose feedback was positively acknowledged. This gives us confidence that all concerns have been satisfactorily resolved.
We hope that these clarifications and efforts will be considered during the meta-review phase. Thank you again for your time and support.
Best regards,
Authors
General Response
Dear Reviewers,
Thank you for your time, efforts, and thoughtful feedback. We are encouraged that our core contributions are consistently recognized across all reviewers. We have carefully considered each comment and concern raised, and provided detailed clarifications and additional supports to address them.
To summarize the strengths recognized by the reviewers:
- Well-motivated and Thorough Investigation (#1, #2)
- Valuable and Realistic Benchmark (#1, #2, #3)
- Address Significant and Timely Problem (#2)
- Identify Evaluation Pitfalls (#1, #2, #3)
- Expose Methods Limitations (#1, #2, #3)
- Bridge Gap between Editing and Applications (#2)
- Provide Valuable Insights (#1, #3)
- Well-structured and Clear Presentation (#3)
While we have responded to each reviewer individually, we would like to take this opportunity to further clarify several common suggestions. Before doing so, we would like to briefly highlight the positioning and contribution of our paper.
As famously noted by Lord Kelvin, "If you can't measure it, you can't improve it."
Evaluation is foundational to scientific progress. In the context of model editing, we show that widely adopted evaluation protocols seriously overestimate actual performance, leading to the illusion that current editing methods have already made substantial progress, thereby hindering the development of truly effective model editing techniques.
Thus, our goal is to revisit and correct long-standing issues in the evaluation of model editing research. We construct a new benchmark and conduct systematic controlled experiments to reveal the sources of overestimation in existing works. Our new evaluation framework and findings challenge the illusion of significant progress in the field and also provide key insights for designing more effective model editing techniques.
We now turn to address several shared concerns raised in the reviews.
On the Scope of ICL/RAG-based Editing Methods:
Model editing has been widely studied in recent years and constitutes a substantial research area. We focus on this line of work because it aims to enable compact, self-contained models without reliance on external retrieval. In contrast, RAG-style knowledge editing follows a fundamentally different paradigm from parameter-based model editing and falls outside the scope of our study. Furthermore, to address reviewer concerns, we still provide additional experiments with the IKE method.
On Complex Task Settings:
Our study focuses on improving the realism of evaluation, rather than expanding task complexity. While multi-hop or unstructured editing reflect more challenging tasks, our concern is orthogonal: even simple QA edits are significantly overestimated under current evaluation. If evaluation is flawed at the basic level, then complex tasks built upon it—such as multi-hop editing—inevitably inherit the same weaknesses. To verify this, we include additional multi-hop experiments and observe even greater performance degradation under realistic evaluation settings. Notably, our proposed evaluation framework is task-agnostic and can be easily applied to more complex scenarios, including multi-hop and unstructured editing.
On Suggestions for Improving Model Editing Methods:
Although our primary focus is on evaluation, our findings also provide clear guidance for future method development (sec 6 & 7). We identify specific failure modes---such as the inability to terminate generation appropriately, often resulting in irrelevant or incorrect information---as well as interference across sequential edits. These insights point to concrete directions for improving the robustness and reliability of future model editing methods. That said, as argued throughout this paper, we believe that fixing the evaluation foundation is more critical than proposing incremental algorithmic improvements. Without reliable evaluation, such improvements may not reflect real progress.
As noted in the ACL Review Policy, "no paper is perfect" (H16
). We fully agree that these suggestions are valuable, but they are “nice-to-have” (H13) rather than essential to the core contribution of this paper. Indeed, we have acknowledged many of these points in the Limitations
section. However, given the space constraint of a conference submission, we consciously focus on what we believe is the most critical issue in model editing research: the evaluation gap. By systematically analyzing how current evaluation misrepresents actual performance, our work aims to help steer the field toward more realistic and reliable model editing practices.
We hope these clarifications have fully addressed your concerns. If you have any further suggestions, please feel free to reach out.
Request for Facilitating Discussion from Reviewers with Us
Dear (S)ACs,
We hope this message finds you well.
We are writing to respectfully request your help in encouraging reviewers #2 and #1 to check our response. We have provided detailed clarifications and supplementary experiments to address their concerns, questions, and misunderstandings. However, as the discussion period draws to a close, we have not received any feedback from them on our rebuttal.
We deeply appreciate your consideration of our request and your valuable guidance throughout the review process.
Best regards,
Authors
Official Review of Submission348 by Reviewer #1
This paper presents QAEdit, which addresses the problem that using teacher forcing for knowledge editing is unrealistic. To tackle this, the authors design real-world editing modules across input, generation strategy, output truncation, and evaluation metrics. Experiments with QAEdit show that existing editing methods fall short in these real-world editing settings, exhibiting substantial performance degradation.
- The development of QAEdit, which newly addresses the reliance on teacher forcing in existing editing tasks, is both interesting and well-motivated.
- The experimental results demonstrate that popular editing methods suffer substantial performance degradation on QAEdit, providing valuable insights to the literature.
- As a major issue, while the editing methods considered in the paper are limited to locate-then-edit and memory-based approaches, other methods such as in-context editing and RAG-style editing have not been explored.
- It is not fully convincing that the proposed editing settings encompass most major real-world evaluation scenarios. More discussion about the coverage of the editing settings needs to be presented. In addition, it would be helpful to discuss how the proposed framework could be extended to handle multi-hop editing situations or other general cases.
- While these results of the paper are valuable, a deeper discussion is needed to explain why existing editing methods fail under the proposed settings.
Please see weaknesses.
There are no concerns with this submission
Respectfully seeking your valuable feedback on our response
Dear Reviewer #1,
We hope this message finds you well.
Since we have not received any feedback following our previous messages, we would like to respectfully seek your valuable thoughts on our response.
Given that the discussion period is nearing its end, your feedback would be particularly valuable in helping us understand whether our clarifications have adequately addressed your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time, efforts, and thoughtful engagement.
Best regards,
Authors
Seeking your valuable feedback on our response
Dear Reviewer #1,
We hope this message finds you well.
We are deeply grateful for your thorough review and acknowledgment of our work. We have provided detailed clarifications and additional experiments in response to your valuable feedback.
As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and insightful engagement.
Best regards,
Authors
Response to reviewer #1
Dear Reviewer #1,
We sincerely appreciate your constructive review and supportive score. We are encouraged that you found our work well-motivated and insightful, particularly in developing QAEdit, addressing teacher forcing limitations and providing valuable insights by revealing performance degradation.
W #1: explore in-context and RAG-style editing
(1) As clarified in the Limitations (Lines 616-628), our work focuses on identifying evaluation pitfalls in model editing research, demonstrating previously reported performance are significantly overestimated and providing valuable guidance for the field. While our evaluation framework captures relative strengths and weaknesses among methods, comparing different editing algorithms is not the primary focus of our study. The comparative analysis between parameter-based model editing and RAG/ICL-based knowledge editing extends further beyond our current scope but remains an important direction for future research.
(2) To further address the reviewer's concern, we investigate a RAG/ICL-based method IKE [1] in our proposed evaluation framework. Specifically, we employ IKE on Llama3-8b and Mistral-7b in sequential editing across 1000 instances randomly and separately sampled from ZsRE, CounterFact, and QAEdit. The experimental results are presented in the following table.
ZsRE (Rel, Gen, Loc) | CounterFact (Rel, Gen, Loc) | QAEdit (Rel, Gen, Loc) | |
---|---|---|---|
Mistral-7b | 0.999, 1.000, 0.818 | 1.000, 0.536, 0.358 | 0.968, 0.959, 0.477 |
Llama3-8b | 1.000, 1.000, 0.535 | 0.999, 0.398, 0.322 | 0.952, 0.959, 0.699 |
(3) Despite RAG/ICL-based methods perform well in existing idealized setting, they may struggle to maintain robustness and effectiveness in real-world scenarios involving numerous edits and noisy information. In contrast, parameter-based editing methods do not rely on retrievers, and a single parameter update creates permanent model changes. As noted above, while such comparative analysis would be valuable, it remains outside our current scope, which focuses on fundamental evaluation issues in model editing.
Furthermore, existing evaluation reports near-perfect success for both parameter-based and RAG-based editing methods, hiding critical shortcomings of current model editing methods. Our evaluation reveals the substantial performance gap between them. We hope our findings will encourage the research community to openly acknowledge these limitations and inspire efforts to bridge this gap.
[1] Can We Edit Factual Knowledge by In-Context Learning?
Response to reviewer #1
W #2: encompass most real-world evaluation scenarios
(1) We agree with the reviewer on the importance of comprehensive real-world evaluations. In this work, we focus on evaluation realism rather than task diversity. The concept real-world
encompasses two distinct dimensions: task and evaluation. Multi-hop, unstructured text, and multilingual editing represent real-world task challenges, while our paper primarily facilitate real-world evaluation setup, such as transferring from teacher forcing to autoregressive generation. We acknowledge that these two aspects could indeed cause confusion. We appreciate the reviewer highlighting this issue and commit to clarifying this distinction in our revised version to prevent misunderstanding.
(2) In this paper, we focus on the most fundamental and general editing task, rather than involving more complex real-world scenarios such as multi-hop editing. Since these complex tasks are typically built upon the foundation of basic QA, studying model editing in this setting provides a solid basis for identifying the key limitations of current approaches. If current editing methods struggle even in this fundamental setting, as our paper demonstrates, then their limitations are likely to worsen in more challenging scenarios. The claim is further demonstrated by our subsequent multi-hop editing experiments. We believe that identifying and addressing these foundational issues is a necessary first step toward practical and scalable model editing and provides valuable insights to the broader field.
(3) It is worth noting that our proposed evaluation framework is general and can be easily applied to more complex tasks. To further address the reviewer's concern, we assess the multi-hop editing (Portability as metric) on ZsRE dataset (1000 samples). Specifically, we employ various methods on Mistral-7b and Llama3-8b for both editing and real-world evaluation under single editing setup. The experimental results in the following table reveal that existing evaluation also overestimates the editing performance on multi-hop task and current editing techniques exhibit more significant decline on this complex setup than basic editing task.
FT-M (Edit, Real) | ROME (Edit, Real) | MEMIT (Edit, Real) | WISE (Edit, Real) | |
---|---|---|---|---|
Mistral-7b | 0.320, 0.005 | 0.594, 0.024 | 0.617, 0.020 | 0.476, 0.000 |
Llama3-8b | 0.499, 0.007 | 0.559, 0.057 | 0.584, 0.068 | 0.275, 0.002 |
W #3: explain why existing editing methods fail
(1) We sincerely appreciate the reviewer's acknowledgment of our paper's value. We would like to clarify that our evaluation does not cause existing editing methods to fail; rather, it exposes inherent limitations that consistently exist in these methods. Our evaluation framework serves to illuminate these existing issues that have been overlooked in previous evaluation.
(2) In Section 6, we conduct four modules of controlled experiments (i.e., input, generation strategy, output truncation, metric) to analyze why existing editing methods fail in our evaluation. Below, we present the most significant contributing factors to this performance gap:
- Section 6.2 reveals that current evaluation artificially prevents error propagation by leaking ground truth tokens through teacher forcing. However, in real-world applications, these edited models generate text autoregressively, allowing errors to cascade, which exposes the limited practical effectiveness of existing methods.
- Section 6.3 demonstrates that current editing evaluation inappropriately terminates generation at ground truth length, thereby masking subsequent errors generated by edited models. However, in real-world applications where ground truth lengths are unavailable, these concealed errors will emerge during subsequent generation.
Overall, these findings indicate that current achievements rely heavily on evaluation shortcuts (e.g., teacher forcing, length control) that mask the limitations of editing methods. By removing these artificial constraints, our evaluation framework reveals fundamental challenges inherent in existing editing techniques. We believe our work highlights the need for realistic evaluation practices in model editing, and we hope it serves as a foundation for moving towards more practical editing techniques.
We hope these clarifications have fully addressed your concerns. If so, we would deeply appreciate your consideration in raising your score. If you have any further suggestions, please feel free to reach out.
Official Review of Submission348 by Reviewer #2
This paper investigates the gap between the reported performance of model editing techniques for LLMs in research settings versus their effectiveness in real-world applications. The authors introduce QAEdit, a benchmark derived from standard QA datasets, and perform a systematic analysis by decomposing evaluation frameworks into four components: input, generation strategy, output truncation, and metric. Through controlled experiments, they reveal critical flaws in current evaluation practices, particularly teacher forcing and target length truncation, that artificially inflate performance metrics.
- The paper addresses a significant and timely problem: the discrepancy between laboratory performance and practical utility of model editing techniques.
- The modular analysis framework provides an elegant structure for isolating factors contributing to performance gaps, making the investigation systematic and thorough.
- QAEdit represents a valuable contribution as a more realistic benchmark for future research, bridging the gap between artificial editing tasks and real-world application scenarios.
- While the paper has done a good job at identifying problems, it offers relatively little guidance for moving forward beyond general suggestions for developing more robust methods. And I believe some findings have been mentioned in previous model editing work [1].
- The distribution of knowledge domains in QAEdit appears imbalanced (Table 10), with some categories significantly overrepresented, potentially limiting generalizability across diverse knowledge types
- The reliance on GPT-4 as a judge for evaluation may introduce potential systematic biases in the evaluation process.
[1] Gu, Jia-Chen, et al. "Model editing harms general abilities of large language models: Regularization to the rescue." EMNLP (2024).
- RAG-based editing methods are not discussed in this paper. Have you explored whether RAG approaches might address some of the limitations you've identified in parameter-based editing methods, particularly for sequential editing scenarios?
- The paper shows FT-M performs relatively better in sequential editing scenarios. Could you elaborate on the specific characteristics that contribute to this relative robustness and what lessons might be drawn for developing improved methods?
- Your results show dramatically different performance patterns for batch editing between MEMIT (better with larger batches) and FT-M (better with smaller batches). What architectural or algorithmic differences explain these opposing patterns?
There are no concerns with this submission
Official Comment by Reviewer #2
I have no further questions. Thank you for the rebuttal.
Thank you for the supportive feedback.
Dear Reviewer #2,
Thank you very much for your kind follow-up. We're glad to hear that our responses have addressed your concerns. We sincerely appreciate your thoughtful feedback and engagement throughout the review process, and we will be incorporating your valuable suggestions into our revised manuscript to further enhance its quality.
Best regards,
Authors
Respectfully seeking your valuable feedback on our response
Dear Reviewer #2,
We hope this message finds you well.
Since we have not received any feedback following our previous messages, we would like to respectfully seek your valuable thoughts on our response.
Given that the discussion period is nearing its end, your feedback would be particularly valuable in helping us understand whether our clarifications have sufficiently addressed your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and thoughtful engagement.
Best regards,
Authors
Seeking your valuable feedback on our response
Dear Reviewer #2,
We hope this message finds you well.
We are deeply grateful for your thorough review and acknowledgment of our work. We have provided detailed clarifications and additional experiments in response to your valuable feedback.
As the discussion period draws to a close, we would like to hear your thoughts on our response, including whether it adequately addresses your concerns. If you have any updated thoughts, we would be grateful to hear them.
Thank you again for your time and thoughtful engagement.
Best regards,
Authors
Response to reviewer #2
Dear Reviewer #2,
We sincerely appreciate your thorough review and constructive feedback. We are grateful that you acknowledge the strengths of our work, particularly in addressing significant and timely problem, providing systematic and thorough analysis, and bridging the research gap.
W #1: guidance for moving forward:
(1) We agree with the reviewer that offering guidance to the filed for moving forward is important. However, we believe that clarifying fundamental misunderstanding within the field is more crucial, as it can establish a solid foundation for future development. Prior to our work, many methods appeared near-perfect under existing evaluations, creating an illusion that basic model editing tasks were largely solved, shifting focus to more complex tasks. Our work reveals that significant challenges persist even in the most basic scenarios. These identified gaps provide essential guidance for the field's next stage of development. Due to space constraint in a conference paper, we cannot comprehensively involve all aspects. In this work, we mainly focus on a critical analysis and revisiting of existing evaluation in model editing.
(2) While we focus on evaluation, our findings yield insights that can directly guide the improvement of model editing:
Our single editing experiments (Section 6.3) discover that existing techniques fail to stop after generating target answer, introducing irrelevant or incorrect information, highlighting the need for future research to improve response termination and consistency of generated content, such as dynamic termination via token-level uncertainty.
Our sequential editing experiments (Section 7.2) reveal that current methods perform poorly in sequential settings because new edits disrupt previously injected knowledge, inspiring future research to minimize interference between sequentially injected knowledge.
findings mentioned in previous work:
As discussed in the Related Works (Lines 192-197), there is a fundamental difference between our work and previous research (Gu, Jia-Chen, et al. EMNLP 2024). Gu et al. examine how editing impacts the edited model's general capabilities on downstream tasks, whereas our work specifically investigates the effectiveness (success rates) of editing itself, a more fundamental issue of model editing research.
W #2: distribution of knowledge domains in QAEdit
(1) We wish to clarify that our research focus is not on evaluating how editing methods perform on specific categories. Instead, we systematically examine how different evaluation frameworks impact the reported results, an analysis that is agnostic to data categories. By comparing different evaluation frameworks on the same datasets, we have uncovered that existing evaluation approaches consistently overestimate performance across all three datasets and all knowledge types, highlighting a fundamental problem in the field of model editing.
(2) While the knowledge domains in QAEdit show a distribution that favors categories such as "Art & Culture" and "History & Politics", these data are derived from widely adopted QA benchmarks (i.e., Natural Questions, TriviaQA, and SimpleQA), which accurately reflect the distribution in real-world applications. Meanwhile, mainstream model editing datasets such as ZsRE (with 24% "Art & Culture" samples) and CounterFact (with 30% "People & Biographies" samples) also exhibit similar distribution pattern. Compared to these counterfactual editing datasets, our dataset originates from realistic QA scenarios, better reflecting real-world requirements.
W #3: biases of GPT-4 as a judge
(1) We acknowledge the reviewer's concern regarding potential biases in LLM-as-a-Judge. We have manually inspected 200 instances and found LLM-as-a-Judge aligns closely with human judgment (96% agreement rate), without exhibiting any discernible bias patterns. The most bias concerns within LLM-as-a-Judge primarily emerge in complex judgment tasks. However, model editing evaluation represents a simple task for GPT-4, only requiring verification of whether a generated answer matches the target answer, which significantly reduces the likelihood of bias behavior.
(2) During our investigation, we carefully considered multiple evaluation metrics, including exact match (EM), BERTScore (for semantic similarity), and LLM-as-a-Judge. And our proposed evaluation framework offers flexibility to incorporate diverse evaluation metrics. We found LLM-as-a-Judge provides an optimal balance of reliability, semantic understanding, and practical implementation.
Response to reviewer #2
S #1: discussion of RAG-based methods
(1) As noted in the Limitations (Lines 616-628), our study focuses exclusively on model editing which involves parameter-based editing methods, rather than investigating in-context learning based knowledge editing approaches. Our primary objective is to critically revisit the pitfalls in current editing evaluation and provide foundation and insights for future development of model editing technologies.
(2) To address the reviewer's concern, we investigate a RAG/ICL-based method IKE [1] in our proposed evaluation framework. Specifically, we employ IKE on Llama3-8b and Mistral-7b in sequential editing across 1000 instances randomly and separately sampled from ZsRE, CounterFact, and QAEdit. The experimental results are presented in the following table.
ZsRE (Rel, Gen, Loc) | CounterFact (Rel, Gen, Loc) | QAEdit (Rel, Gen, Loc) | |
---|---|---|---|
Mistral-7b | 0.999, 1.000, 0.818 | 1.000, 0.536, 0.358 | 0.968, 0.959, 0.477 |
Llama3-8b | 1.000, 1.000, 0.535 | 0.999, 0.398, 0.322 | 0.952, 0.959, 0.699 |
While RAG-based methods inherently avoid sequential editing challenges through independent retrieval and demonstrate favorable performance, they require additional storage modules and introduce inference latency of repeated retrieval for each input. Rather than thoroughly comparing these two types of editing approaches, our focus is to critically examine and address fundamental issues in existing model editing evaluation.
Furthermore, existing evaluation reports near-perfect success for both parameter-based and RAG-based editing methods, concealing critical shortcomings of current model editing methods. Our evaluation reveals the substantial performance gap between them. We hope our findings will encourage the research community to openly acknowledge these limitations and inspire efforts to bridge this gap.
[1] Can We Edit Factual Knowledge by In-Context Learning?
S #2: specific characteristics of FT-M for relative robustness
(1) Characteristics
The reviewer's observation that FT-M performs relatively better in sequential editing scenarios is insightful. We have also noticed and analyzed this interesting phenomenon. Other Baselines rely on trained hypernetworks or calculated covariance matrices derived from the original LLMs, causing their effectiveness to significantly decrease as the model states evolve during sequential editing. While FT-M optimizes the current model parameters directly towards target answer at each step of sequential editing, ensuring effective knowledge injection throughout the process. However, FT-M also suffers from disrupting unrelated knowledge of LLMs, resulting in poor Locality performance as demonstrated in our experiments (Table 8).
(2) Lessons
The relatively better performance of FT-M suggests that simply and directly optimizing LLMs towards target answers may provide an effective and robust approach. Future research could explore methods to reduce the interference to unrelated knowledge of this direct optimization strategy. The superior performance of fine-tuning in massive editing has been discussed in previous works [1, 2]. If the reviewer is interested in this specific topic, these works can provide more detailed insights.
[1] Model Editing by Standard Fine-tuning
[2] Time Sensitive Knowledge Editing through Efficient Finetuning
S #3: different batch performance between MEMIT and FT-M
(1) We appreciate the reviewer's insightful observation regarding the divergent performance patterns in batch editing between MEMIT (better with larger batches) and FT-M (better with smaller batches). We have provided a brief explanation of this phenomenon in Line 543-548. However, due to space constraints, we were unable to elaborate on this finding in greater detail.
(2) Specifically, the discrepancy arise from their distinct implementations of batch editing. FT-M optimizes for aggregated batch-level loss, potentially sacrificing individual knowledge memorization for global convergence. Conversely, MEMIT estimates parametric alteration for each piece of knowledge in isolation then combines them into the model. Therefore, completing all edits in a single step ensures the most accurate estimation.
We commit to incorporate the detailed explanation of this performance discrepancy in the revised manuscript.
We hope these clarifications have fully addressed your concerns. If so, we would deeply appreciate your consideration in raising your score. If you have any further suggestions, please feel free to reach out.
Official Review of Submission348 by Reviewer #3
This paper critically examines the real-world effectiveness of model editing methods, revealing that their performance is significantly overestimated in prior studies. The authors introduce QAEdit, a benchmark for evaluating edits in QA tasks, and identify key flaws in existing evaluation practices, such as teacher forcing, output truncation. Their experiments show that current methods degrade quickly in sequential edits, highlighting the need for more rigorous evaluation frameworks.
Introduction of a New Benchmark and Dataset – The paper presents QAEdit, a new benchmark specifically designed for real-world QA tasks, providing valuable new research data for evaluating model editing methods.
Reevaluation of Existing Methods – It critically examines performance gaps in prior studies, identifying key flaws in evaluation practices and exposing scalability limitations in current model editing approaches.
Clarity in Presentation – The paper is well-structured and clearly articulates its findings, making complex issues in model editing and evaluation easy to understand.
- The paper effectively identifies the limitations of existing model editing methods and provides a critical analysis of current evaluation frameworks. However, it does not offer concrete suggestions for improving model editing techniques.
- Additionally, while the study highlights the challenges of sequential model editing, it lacks an in-depth discussion of Lifelong Knowledge Editing research, particularly overlooking key contributions from prior works, like [1-4]. These studies propose essential methods for long-term knowledge updating and continuous model adaptation, and their omission may limit the paper’s practical implications and comprehensiveness.
[1] Hartvigsen T, Sankaranarayanan S, Palangi H, et al. Aging with grace: Lifelong model editing with discrete key-value adaptors[J]. Advances in Neural Information Processing Systems, 2023, 36: 47934-47959.
[2] Hu C, Cao P, Chen Y, et al. Wilke: Wise-layer knowledge editor for lifelong knowledge editing[J]. arXiv preprint arXiv:2402.10987, 2024.
[3] Chen Q, Zhang T, He X, et al. Lifelong knowledge editing for llms with retrieval-augmented continuous prompt learning[J]. arXiv preprint arXiv:2405.03279, 2024.
[4] Gupta A, Baskaran S, Anumanchipalli G. Rebuilding rome: Resolving model collapse during sequential model editing[J]. arXiv preprint arXiv:2403.07175, 2024.
While the paper discusses the real-world limitations of current knowledge editing methods from a model perspective, has it considered the dataset perspective? In practical applications, knowledge is not strictly structured in subject-relation-object triples but rather expressed in free-text sentences. However, in QAEdit, the data remains formatted as triples (Figure 2). Would this representation limit the benchmark’s ability to reflect real-world scenarios accurately?
As far as I know, in the CounterFact dataset, the object (edit target) consists of only a single token. Therefore, teacher forcing may not be applicable in this dataset. This aspect might require clarification in the paper.
Although LLM-as-a-Judge offers an evaluation approach that aligns more closely with human judgment, it may be worth exploring whether additional evaluation metrics, such as semantic similarity or multi-dimensional scoring, should be incorporated to provide a more comprehensive assessment of model performance across different tasks.
There are no concerns with this submission
Heartfelt Gratitude to Reviewer #3
Dear Reviewer #3,
We sincerely appreciate you for carefully considering our responses and for raising the score in recognition of our clarifications. We are encouraged by this positive update, which suggests that our responses have addressed the key concerns raised in the initial review.
If there are any remaining aspects you would like to discuss, we would be more than happy to engage further. We deeply value the thoughtful feedback and believe it can help enhance the quality and clarity of our paper.
Response to reviewer #3
Dear Reviewer #3,
We sincerely appreciate your time and efforts on our work! We are grateful that you recognize the strengths of our work, particularly in identifying evaluation flaws, providing valuable data and evaluation, exposing method limitations, and presenting clearly.
W #1: concrete suggestions for improving model editing techniques
(1) We agree that providing suggestions for improving model editing techniques is important. While we focus on evaluation, our findings yield insights that can directly inform the advancement of model editing:
- In single editing experiments (Section 6.3), we find existing techniques fail to stop after generating target answer, introducing irrelevant or incorrect information. This finding points to a clear direction for future research to improve appropriate termination or content consistency in subsequent generation.
- The sequential editing experiments (Section 7.2) reveal that the poor sequential editing performance of current techniques stems from new edits disrupt the earlier injected knowledge. This inspires future research to investigate approaches for minimizing interference between sequentially injected knowledge.
(2) As noted in the Limitations (Lines 605-615), we believe that a sound evaluation framework provides far greater value to the field than proposing specific method improvements. A well-designed evaluation framework can expose critical weaknesses in current approaches and guide the entire research community toward truly effective solutions. Prior to our work, many methods appeared near-perfect under existing evaluations, creating an illusion that the core challenges of model editing had been largely solved. As a result, the community began to focus on more complex tasks such as multi-hop reasoning or free-text editing. However, our findings reveal that even on the most basic and controlled editing tasks, current methods still fall short of being practically reliable.
W #2: discussion of Lifelong Knowledge Editing research
(1) We acknowledge the importance of Lifelong Knowledge Editing and wish to clarify that our study does incorporate several representative methods in this domain:
- As noted in Line 328-330, we adopt R-ROME (Rebuild-ROME) [1] in our experiments, as the reviewer recommended.
- We employ the widely-adopted lifelong knowledge editing methods GRACE [2] and WISE [3] as baselines for both single and sequential editing scenarios, demonstrating their limitations in real-world evaluation.
(2) Our experiments in Section 7 reveal that even these lifelong knowledge editing methods exhibit significant limitations in just 1000 sequential edits under our proposed evaluation. The result suggests that achieving true lifelong knowledge editing remains a substantial challenge.
We would like to clarify that our work does investigate settings and methods of lifelong knowledge editing, though we did not explicitly use this keyword. We appreciate the reviewer highlighting its importance. We commit to explicitly discussing this concept and incorporating the recommended literature in the revised manuscript.
[1] Rebuilding rome: Resolving model collapse during sequential model editing
[2] Aging with grace: Lifelong model editing with discrete key-value adaptors
[3] Wise: Rethinking the knowledge memory for lifelong model editing of large language models
Response to reviewer #3
S #1: free-text editing
(1) We agree with the reviewer that free-text (unstructured) editing represents an important practical setting. As discussed in the Related Works (Lines 201-207), free-text editing is a more practical task setting, while our research focuses on developing more realistic evaluation frameworks. These two research directions are orthogonal and can be readily integrated. Indeed, our proposed real-world evaluation can be directly applied to free-text editing scenarios.
(2) While free-text editing represents practical task requirements, it remains a complex and cutting-edge task setting with relatively limited existing research. Our work focuses on identifying critical issues in existing evaluation. To provide a clear and rigorous analysis, we select the most fundamental and mainstream triple-based setting.
S #2: teacher forcing on single-token targets in CounterFact
(1) We appreciate the reviewer's thoughtful observation regarding teacher forcing and the CounterFact dataset. However, we would like to clarify that while the single-token targets avoid the leakage of ground truth in teacher forcing paradigm, it still artificially truncates generated content based on gound truth length (single token). Therefore, as Table 3 shows, existing evaluation also overestimate performance on the CounterFact dataset.
(2) While CounterFact contains some single-token targets, in practical applications where target answers typically comprise multi-tokens, the limitations of teacher forcing become significant, as discussed in Section 6.2.
We sincerely appreciate the reviewer for this valuable feedback and will incorporate a clearer explanation of this nuance in our revised manuscript.
S #3: exploring additional evaluation metrics
(1) Our proposed evaluation framework offers flexibility to incorporate diverse evaluation metrics. During our investigation, we carefully considered multiple metrics to assess the correctness of edited LLM's answers, including exact match (EM), BERTScore (for semantic similarity), and LLM-as-a-Judge. We found LLM-as-a-Judge provides superior semantic understanding and evaluation capabilities compared to these traditional metrics and it is widely adopted in the research community. Due to space constraints, we reported only LLM-as-a-Judge results in the manuscript.
(2) To address the reviewer's concern, we provide additional experimental results in the following table. Specifically, we present comparative results between exact match and LLM-as-a-Judge for Reliability in real-world evaluation on Llama3-8b using the ZsRE dataset (10,000 records) across various editing methods:
EM | LLM-as-a-judge | |
---|---|---|
FT-M | 0.651 | 0.706 |
R-ROME | 0.706 | 0.820 |
MEMIT | 0.729 | 0.803 |
GRACE | 0.019 | 0.036 |
WISE | 0.073 | 0.091 |
Through manual inspection of 200 instances, we found that LLM-as-a-Judge effectively identifies not only exact matches but also semantically consistent responses, which aligns more closely with human judgment.
Overall, for assessing answer accuracy in editing tasks, LLM-as-a-Judge provides an optimal balance of reliability, semantic understanding, and practical implementation.
We hope these clarifications have fully addressed your valuable feedback. lf so, we would deeply appreciate your consideration in raising your score. Thanks! If you have any further suggestions, please feel free to reach out.