🦋🌪️ The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse

Anonymous

🦋🌪️ The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse

Anonymous

17 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Although model editing has shown promise in revising knowledge in Large Language Models (LLMs), its impact on the inherent capabilities of LLMs is often overlooked. In this work, we reveal a critical phenomenon: even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks. However, benchmarking LLMs after each edit, while necessary to prevent such collapses, is impractically time-consuming and resource-intensive. To mitigate this, we propose using perplexity as a surrogate metric, validated by extensive experiments demonstrating its strong correlation with downstream tasks performance. We further conduct an in-depth study on sequential editing, a practical setting for real-world scenarios, across various editing methods and LLMs, focusing on hard cases from our previous single edit studies. The results indicate that nearly all examined editing methods result in model collapse after only few edits. To facilitate further research, we have utilized ChatGPT to develop a new dataset, HardEdit, based on those hard cases. This dataset aims to establish the foundation for pioneering research in reliable model editing and the mechanisms underlying editing-induced model collapse. We hope this work can draw the community's attention to the potential risks inherent in model editing practices.

Paper Type: long

Research Area: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

[–][+]

Meta Review of Paper315 by Area Chair

ACL ARR 2024 February Paper315 Area Chair

06 Apr 2024, 15:54ACL ARR 2024 February Paper315 Meta ReviewReaders: Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Authors, Paper315 Reviewers Submitted, Program Chairs

Paper Summary:

The paper reveals a phenomenon in model editing that a single edit may cause model collapse. Accordingly, the authors propose using perplexity as a surrogate metric to evaluate representative model editing algorithms in both single and sequential editing settings. In addition, the authors construct a new dataset HardEdit, based on the hard instances that may induce model collapse.

Summary Of Strengths:

The paper is well-written and the problem is clearly described.
The paper raises awareness about the current limitations of model editing. Besides, the authors conduct analysis to demonstrate the rationality of using perplexity as an evaluation metric.

Summary Of Weaknesses:

As mentioned by the reviewer, some previous work[1][2] has also explored similar themes as this paper. Although the authors have responded to this question, the contribution of this paper is still limited.
The authors claim that this work can effectively detect potential model collapse via perplexity, however, according to the results in Figure 3, the performance of discriminative tasks is not as sensitive to perplexity as generation tasks. It would be better if the authors could introduce the specific detection methods.

[1] Model Editing at Scale leads to Gradual and Catastrophic Forgetting

[2] Model Editing Can Hurt General Abilities of Large Language Models.

Overall Assessment: 3 = There are major points that may be revised

Best Paper Ae: No

Information Regarding The New ACL Policy On Deanonymized Preprints: I confirm I have read the information above about changes to the anonymity policy.

[–][+]

Official Review of Paper315 by Reviewer #1

ACL ARR 2024 February Paper315 Reviewer #1

22 Mar 2024, 10:54ACL ARR 2024 February Paper315 Official ReviewReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

This work reveals a critical phenomenon observed in LLMs that a single edit may cause the collapse of an LLM. Accordingly, the authors propose a metric to evaluate such a phenomenon systematically and curate a dataset for a fine-grained analysis of this metric. Furthermore, a rigorous dataset HardEdit is constructed for a comprehensive evaluation of model editing techniques.

Summary Of Strengths:

The research problems and target tasks are clearly described.
his study highlights a critical issue within LLMs, which is worth investigating for the research community.
This study introduces a metric (perplexity) to assess its severity across different LLMs. Specifically, it conducts a detailed analysis to justify the proposed metric.

Summary Of Weaknesses:

For the question “Is model collapse a common issue across different language models and editing methods?”, the three target open-source LLMs may not be sufficient to support the assertion. Incorporating a broader range of LLMs such as ChatGPT, Llama2-13B, Phi-2, etc., can be more convincing.
Several datasets (e.g., HardCF, ME-PPL, HardEdit, etc.) have been introduced in this work. Their roles may not be clearly delineated, leading to potential confusion. Providing more detailed explanations regarding their respective purposes would enhance clarity and understanding.

Comments, Suggestions And Typos:

Suggestions:

Please refer to some recent works (two examples listed below) relevant to this topic. It is worthwhile to make comparisons with those works and highlight your uniqueness. Model Editing at Scale leads to Gradual and Catastrophic Forgetting: https://arxiv.org/html/2401.07453v2 UNVEILING THE PITFALLS OF KNOWLEDGE EDITING FOR LARGE LANGUAGE MODELS: https://arxiv.org/pdf/2202.05262.pdf
I am curious about whether this issue has been well addressed in powerful LLM. Have you observed such a phenomenon on ChatGPT or other open-source powerful LLMs like Llama2-70b?
When first mentioning “ROME” (L049), please provide the citation properly. Similarly, it will be more friendly for readers to use the full name of “ME-PPL” at L086.

Soundness: 3.5

Overall Assessment: 4 = This paper represents solid work, and is of significant interest for the (broad or narrow) sub-communities that might build on it.

Confidence: 2 = Willing to defend my evaluation, but it is fairly likely that I missed some details, didn't understand some central points, or can't be sure about the novelty of the work.

Best Paper: No

Ethical Concerns:

None

Needs Ethics Review: No

Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 1 = No usable software released.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: #1

[–][+]

response to reviewer #1

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 16:55ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear reviewer #1,

We sincerely appreciate your constructive review and the supportive score.

Weakness 1: for experiments on a broader range of llms

We agree with the reviewer's perspective that experimenting with a wider range of LLMs would make our work more convincing. It's indeed part of our plan for future versions of the manuscript. However, we must emphasize that such experiments are substantially resource and time-intensive. We sincerely request the reviewer's understanding and patience with these practical limitations.

Besides, we also wish to clarify that：

Accessibility for model editing: model editing studied in this paper involves modifying model parameters directly. Given that ChatGPT is a proprietary LLM and does not offer the open access required for such modifications, it falls outside the scope of our current study's methodology.
Representativeness of selected models: our chosen models are among the most representative and extensively used in current model editing research, with Llama2-7B being one of the largest model that has been practically applied. This careful selection ensures that our experiments are sufficiently persuasive within the field.
Primary Study Objective: we want to emphasize that our study's main goal is to unveil and highlight the potential risks associated with editing methods, rather than to perform an exhaustive evaluation across all models, which would exceed the scope of a typical conference paper.

Weakness 2: for detail introduction of the datasets In response to the insightful comments regarding the clarity of the datasets used in our study, we provide the following clarifications:

Dataset Overview:

ME-PPL: This is a text dataset we developed, consisting of sentences to calculate the models' perplexity for normal text. It serves to quickly identify whether the edited model is prone to collapse, thereby preventing the need for time-consuming evaluations in downstream tasks.
Editing Datasets: The remaining datasets serve as editing datasets. Each sample within these datasets represents an edit request. Specifically:
- COUNTERFACT and ZsRE: as introduced in sec 3.1 and appendix A.2.2, are widely recognized editing datasets in the model editing domain.
- HardCF is a curated subset of COUNTERFACT that we developed, featuring challenging samples that have been observed to induce model collapse upon a single edit.
- HardEdit: building on the foundation laid by HardCF, we use gpt-3.5 to create a refined collection of samples designed to rigorously evaluate current editing methodologies.

We will take this valuable feedback into account and commit to providing more detailed explanations of each dataset's respective purpose in the revised version of our manuscript.

Suggestion1: discussion with recent works We thank the reviewer for the insightful comments and suggested references. Due to space constraints, we had to move the discussion of related work to the appendix, which may have caused some confusion. We apologize for this and will strive to improve the presentation of related work in our revised manuscript.

Model Editing at Scale leads to Gradual and Catastrophic Forgetting: Actually, we have cited the contemporaneous work in Appendix A.1.3 (L957-L970) and made the comparison. While this study foucses on the impact of large-scale edits on models, we focus on exploring the possibility of model collapse caused by a small number of edits and how to efficiently detect potential collapses in practical applications.
UNVEILING THE PITFALLS OF KNOWLEDGE EDITING FOR LARGE LANGUAGE MODELS: Their research focuses on Knowledge Conflict and Knowledge Distortion within edits, which are less related to the impact of edits on the downstream task performance of models that we study. We will include a citation and discussion of their findings in the context of our research focus in the revised manuscript.

Suggestion2: whether this issue has been well addressed in powerful LLM？

As explained in our response to Weakness 1, we are actively pursuing this research direction. However, based on our research over the past months, we believe that the model collapse issues we have identified might be inherent to the editing methods themselves, suggesting that these challenges are likely to persist even in more advanced LLMs. Due to resource limitations, we hope to obtain results on such larger models in a subsequent version of our work.

We value the reviewer's understanding as we navigate these limitations and areas for future exploration

Suggestion3: writing issues

We sincerely appreciate you for highlighting these issues regarding the clarity of terminology and the need for proper citations. We will fix them in the next revision.

We hope we have addressed all your concerns. If you have any further suggestions, please do not hesitate to reach out.

[–][+]

Respond to authors

ACL ARR 2024 February Paper315 Reviewer #1

01 Apr 2024, 10:05ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Thanks very much for your detailed clarification on the questions above! As for the selection of target models, I agree that it is time and resource-intensive to conduct similar experiments on more LLMs. However, I think small-sized LLMs such as Phi-2 (or Tinyllama) can be considered as it does not take much computational resources.

[–][+]

Thanks for your feedback

ACL ARR 2024 February Paper315 Authors

01 Apr 2024, 11:40ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Thank you once again for your responsive replies and the suggestion to explore smaller LLMs like Phi-2 or Tinyllama.

In fact, as in the response to reviewer #2, we've extended our experiments to include other models such as T5 (encoder-decoder architecture). Since ROME and MEMIT cannot be applied to architectures other than decoder-only, we opted to try the same-category method KN recommended by the reviewer #2.

In the experiments where KN edited T5, we observed phenomena of single edit collapse and sequential edit collapse (five sequential edits on the HardCF dataset). The results are as shown in the following table:

	PIQA	Hellaswag	LAMBADA
Original T5	0.6659	0.3752	0.0526
single edited T5	0.5185	0.2536	0.0076
sequential edit T5	0.5250	0.2582	0.0064
Random guessing	0.5000	0.2500	0.0000

The results of these supplementary experiments further corroborate the findings presented in our paper.

Regarding smaller LLMs, we agree on their potential for reducing computational demands. We're planning to include such models in future versions of our manuscript. However, it's crucial to note that mainstream editing algorithms usually require specific model frameworks and may not directly apply to all models. We kindly ask for your patience as we work on adding these results.

[–][+]

Official Review of Paper315 by Reviewer #2

ACL ARR 2024 February Paper315 Reviewer #2

20 Mar 2024, 12:59ACL ARR 2024 February Paper315 Official ReviewReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

The paper addresses a significant gap in the evaluation of large language models (LLMs), specifically focusing on the phenomenon of model collapse during model editing. The authors convincingly argue that the widely used metric of locality is inadequate for assessing model collapse due to its limited scope and the trivial nature of the QA tasks it employs. They propose the use of perplexity as a more effective metric for evaluating model collapse, demonstrating its utility through extensive experiments. A key strength of the paper lies in the creation of the ME-PPL dataset, which provides a diverse and high-quality resource for perplexity calculations, and the HardEdit dataset, designed to challenge model editing algorithms with samples that are likely to trigger model collapse. However, the paper could further benefit from a deeper theoretical exploration of why perplexity is a more suitable metric compared to others, including a discussion on its limitations and potential biases. Additionally, while the extensive experiments provide valuable insights, the methodologies employed in constructing the HardEdit dataset using GPT-3.5 could be detailed more thoroughly to ensure reproducibility and to understand the selection criteria for challenging samples. In summary, this paper makes a significant contribution to the field by highlighting the limitations of current evaluation metrics, proposing a novel approach to assess model collapse, and encouraging the advancement of more resilient model editing techniques. Future work could expand on this foundation by exploring alternative metrics, refining the proposed datasets, and developing methodologies to mitigate model collapse in LLMs.

Summary Of Strengths:

Identification of the inadequacy of locality as a metric for evaluating model collapse in model editing. Proposal of perplexity as a more comprehensive metric for assessing model collapse, supported by extensive experiments. Creation of the ME-PPL and HardEdit datasets to facilitate the evaluation of model editing techniques and to challenge current methodologies with samples likely to induce model collapse.
Illumination of the prevalence of model collapse across various editing methods and LLMs, highlighting the need for the research community to prioritize the development of robust model editing techniques.

Summary Of Weaknesses:

The paper lacks a discussion over the following points:

It is important to highlight the significance of the effect of depending on facts (If the president of the USA is the fact to be edited, then the hometown or date of birth of the president is the dependent fact). Aka, the portability before and after the edit is lacking in the paper. Or, say, the Portability metric is important (as shown in EasyEdit), which is lacking in the current work.
It is quite obvious that perplexity can be used to evaluate the LLM, but more promising metrics or approaches should be used. Say any benchmarking score over different tasks that is compared before and after an edit.
In the paper’s contribution number 3 (lines 128-130), the said contribution is already made in the MEMIT technique. In continuation of critical question 3 (Line 60-61), it would be more interesting to see if the work also proposes to mitigate the model collapse as the previous work as MEMIT has shown it usually happens in almost all the METs.
A comparison with PEFT, such as LoRA or QLoRA, would be beneficial as an additional baseline to support the claims made in the paper. Additionally, KE, KN, Parrot, and other METs. Currently, the set space of METs seems limited.
It would be interesting to see if the claims stand true for the encoder-only models, decoder-only models, and encoder-decoder models. Content seems to have jargon, repetitive definitions, and unwanted text (say, lines 340-351).
Future works need to highlight the research directions more clearly. Currently, from the previous works, it is intuitive to state the current direction in the future.

Comments, Suggestions And Typos:

Please look at the comments in the weakness section for the detailed comments. Please look for jargons (line 269)! The content seems repetitive, as seen in lines 309-312 (the definition is already clear from the Introduction section. Why does Figure 7 tend to have extra space on the right?

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 2.5

Confidence: 5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.

Best Paper: No

Limitations And Societal Impact:

The authors have written the limitations of the proposed work.

Ethical Concerns:

None

Needs Ethics Review: No

Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.

Datasets: 4 = Useful: I would recommend the new datasets to other researchers or developers for their ongoing work.

Software: 4 = Useful: I would recommend the new software to other researchers or developers for their ongoing work.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: #2

[–][+]

Reminder: Further Feedback Needed for Author's Rebuttuals

ACL ARR 2024 February Paper315 Area Chair

01 Apr 2024, 13:32ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear Reviewer,

I hope this message finds you well and greatly appreciate your willingness to contribute to the reviewing process. The authors have submitted their response to your comments and is unsure if it has dispelled your doubts. If possible, please respond promptly to the author's rebuttuals. Your feedback is crucial in the next steps in the review process.

Best, AC

[–][+]

Thanks for your kind reminder to the reviewer

ACL ARR 2024 February Paper315 Authors

03 Apr 2024, 13:24 (modified: 04 Apr 2024, 07:47)ACL ARR 2024 February Paper315 Author-Editors Confidential CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Authors

Comment:

Dear (S)ACs,

Thank you for your kind reminder to reviewer #2.

Given reviewer #2 remained unresponsive throughout the entire rebuttal process, contrary to the review policy, we respectfully request a careful reevaluation of their inaccurate criticisms in light of our detailed response during the meta-review process. We hope that special consideration can be given to the insights and recommendations of the other reviewers, such as Reviewer #3, who expressed a 'happy to recommend acceptance' stance.

We also wish to express our gratitude for the efforts and dedication of the AC team in managing and facilitating the review process.

Best regards，

Authors

[–][+]

response to the summary comments of reviewer #2

ACL ARR 2024 February Paper315 Authors

30 Mar 2024, 17:56ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear reviewer #2,

We sincerely appreciate your recognition of our efforts in bridging the significant gap in evaluating model editing algorithms through the creation of the ME-PPL and HardEdit datasets, and for acknowledging our work in highlighting the limitations of current evaluation metrics.

Allow us to further address two concerns raised in the summary:

"However, the paper could further benefit from a deeper theoretical exploration of why perplexity is a more suitable metric compared to others, including a discussion on its limitations and potential biases."

We value your suggestion on the need for a more comprehensive theoretical exploration of perplexity as a superior metric for evaluating model collapse. Due to space constraints, we only managed to briefly touch upon this in our paper (L347-351). We described how perplexity, by exhibiting an exponential relationship with the unsupervised pre-training loss, acts as a surrogate metric for monitoring a model's status. The comparison with other metrics, such as locality, was aimed at highlighting perplexity's advantages but, as noted, might benefit from a deeper discussion.

Sensitivity and Potential Bias: We acknowledge the sensitivity of perplexity, where minor variations might not directly correlate with model performance shifts. Yet, in the context of model collapse, its utility is undeniable, as it sharply distinguishes collapsed models from functioning ones. This is because collapsed models usually exhibit extremely higher perplexity, not just subtle variations. Regarding potential biases due to the text dataset ME-PPL's composition, we have made concerted efforts to ensure its diversity and representativeness, drawing from widely used pre-training corpora to mitigate bias.

We will include a discussion on this topic in our revised manuscript.

Additionally, while the extensive experiments provide valuable insights, the methodologies employed in constructing the HardEdit dataset using GPT-3.5 could be detailed more thoroughly to ensure reproducibility and to understand the selection criteria for challenging samples.

We apologize for the overly concise discussion on the construction of the HardEdit dataset, a limitation imposed by space constraints. Actuality, we have elaborated on this in Appendix A.6 and Figures 10 and 12, providing a thorough presentation of the dataset creation process, including the rationale behind our prompt designs and the selection criteria for challenging samples. These appendices aim to facilitate replication and offer insights into our methodologies, further contributing to the field's understanding of model editing challenges. In future versions, when more space is available, we intend to enrich this discussion in the main body of the text to make the information more accessible and clear

Reproducibility: For the HardCF and HardEdit datasets, we are committed to their release to ensure reproducibility and support research on editing-induced model collapse and the development of robust model editing methods. Currently, we are actively expanding the HardEdit dataset, resulting in its continuous evolution. It will be larger than initially described in our manuscript. To ensure the datasets' completeness and reliability, we've decided to delay their public release until the end of the review period.

We thank you again for your recognition of the contribution and potential impact of our paper, and hope we have addressed all your concerns. If you have any additional concerns or suggestions, we welcome you to reach out.

[–][+]

Response to reviewer #2

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:08 (modified: 02 Apr 2024, 15:04)ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear reviewer #2,

We sincerely appreciate your valuable comments.

Weakness1: lacking of Portability metric

According to the definition in EasyEdit, Portability examines the performance of an edited model on facts related to the editing request and required reasoning, aiming to assess robust generalization. However, our study focuses on examining the downstream task performance of edited LLMs to determine their usability in practical scenarios. This is crucial for the deployment of model editing in real-world applications, where performance of edited models is essential. Certainly, while Portability offers insights into an editing method's effectiveness, it does not align with the specific objectives of our research. Therefore, this metric was not included in our study.

Weakness 2: lacking of benchmarking score

First, we agree with the reviewer's perspective that perplexity alone is not sufficient for evaluating llms.

In our paper, we primarily used perplexity as a quick way to identify potential model collapses from the editing process. However, our analysis does not stop there; we rigorously test the downstream task performance of these models across a comprehensive suite of benchmarks. As detailed in our paper:

Figures 1(b) and 2(b), along with Table 3, present the models' benchmarking scores before and after a single edit.
Table 4 offers a direct comparison between the models after sequential edits and their original states, highlighting the issues in sequential edits.
Figure 3 illustrates how varying levels of perplexity influence the edited models' benchmarking scores, providing insight into the relationship between perplexity and task performance.

We apologize for any misunderstanding that may have arisen regarding the evaluation metrics used in our study. We will refine our presentation to ensure our experimental outcomes are clear.

Weakness 3: contribution is already made in MEMIT & mitigate the model collapse

We appreciate the reviewer’s attention to the contributions outlined in our paper and the comparison with MEMIT.

MEMIT's contribution is recognized for improving the performance of multiple edits on traditional editing metrics, not the model collapse we identified in this paper. In contrast, our third contribution uncovers model collapse as a critical issue in sequential editing settings——a critical insight not recognized by MEMIT. In fact, as demonstrated in Table 4, our research distinctively points out MEMIT's vulnerability to model collapse. This particular focus of our study diverges from the objectives of MEMIT, thereby not undermining the contributions of our paper.

Furthermore, while developing strategies to mitigate model collapse is undoubtedly important, our study currently focuses on identifying and understanding the model collapse within existing editing methods, including MEMIT. Our exploration into efficient detection mechanisms aims to pave the way for future solutions. Addressing and resolving model collapse, although beyond the scope of our present work, remains a critical objective for our further research.

[–][+]

Response to reviewer #2

ACL ARR 2024 February Paper315 AuthorsWanli Yang(privately revealed to you)

29 Mar 2024, 17:07 (modified: 01 Apr 2024, 17:38)ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Weakness 4: experiments on more editing methods

We acknowledge and appreciate the reviewer's suggestion that incorporating a broader range of editing methods could make our work more comprehensive. However, Given the considerable resource and time demands of our experiments, we prioritized the most commonly used setups in the field. We selected four representative and widely-adopted editing methods in current literature, spanning three key categories of model editing techniques.

PEFT Methods: Our focus on these selected methods stems from their proven impact and the limited effectiveness of fine-tuning in model editing contexts. PEFT approaches, designed for enhancing fine-tuning efficiency, do not directly address the core challenges of editing with constrained data volumes, hence their exclusion from our primary analysis.

Other METs: The absence of methods like KN and KE from our study is due to their comparatively lower performance within their respective categories (comparing with ROME and MEND), as evidenced by recent research findings. However, to address concerns regarding the diversity of METs explored, we conducted additional tests with KN on three LLMs (GPT-2-XL, GPT-J, and Llama2-7b):

An selected edit sample from the COUNTERFACT dataset resulted in significantly diminished downstream task performance for Llama2-7b.

Model PIQA LAMBADA Hellaswag

Original_Llama2-7b 0.7845 0.6814 0.5706

Edited_Llama2-7b 0.5256 0.0000 0.2583

Random_guessing 0.5000 0.0000 0.2500

The three models which are sequentially edited by KN on HardCF all exhibits severe collapse:

Model	PIQA	LAMBADA	Hellaswag
Original_GPT-2-XL	0.7084	0.4461	0.4004
Edited_GPT-2-XL	0.5332	0.0000	0.2610
Original_GPT-J	0.7541	0.6136	0.4953
Edited_GPT-J	0.5174	0.0000	0.2561
Original_Llama2-7b	0.7845	0.6814	0.5706
Edited_Llama2-7b	0.5152	0.0000	0.2591
Random_guessing	0.5000	0.0000	0.2500

In summary, our comprehensive approach ensures a robust examination of model editing's impacts, aligning with the methodologies and models frequently discussed in published works within our domain.

Weakness 5: more backbone models

We recognize and value the reviewer's interest in assessing the applicability of our findings across different model architectures.

Our focus on decoder-only models, prominently represented by the GPT and LLaMA series, is driven by their prevailing status as the mainstay in both model editing research and the broader NLP field. Their wide applicability and strong performance underscore the relevance and importance of our findings within this mainstream context.

While encoder-only models are indeed instrumental for specific tasks like classification, their deployment is comparatively narrower than that of decoder-only models.

To address reviewer's concern, we take the encoder-decoder T5 model as an example. Since ROME and MEMIT cannot be applied to architectures other than decoder-only, we opted to try the same-category method KN recommended by the reviewer. In the experiments where KN edited T5, we observed phenomena of single edit collapse and sequential edit collapse (five sequential edits on the HardCF dataset). The results are as shown in the following table:

	PIQA	Hellaswag	LAMBADA
Original T5	0.6659	0.3752	0.0526
single edited T5	0.5185	0.2536	0.0076
sequential edit T5	0.5250	0.2582	0.0064
Random guessing	0.5000	0.2500	0.0000

Weakness 6: future work

We appreciate your suggestions on future work writing, and we will polish this part accordingly.

Suggestions

Thank you for pointing out the issues related to jargon and repetition within the manuscript, as well as the formatting concern regarding Figure 7. We will refine our writing in subsequent revisions to ensure clarity and conciseness throughout the paper.

The extra space beside Figure 7 arises from the layout constraints of LaTeX dual-column formatting. Since subsequent figures are all large cross column figure, there was no suitable content to fill the adjacent space in the layout for Figure 7. We will adjust the layout to better utilize the space and prevent such formatting inconsistencies in future versions.

We hope we have addressed all your concerns. If things are clearer now we kindly request you to reconsider the score you have assigned. Thanks!

[–][+]

Seeking your valuable feedback on our response

ACL ARR 2024 February Paper315 Authors

31 Mar 2024, 08:27 (modified: 31 Mar 2024, 08:28)ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear Reviewer,

We hope this message finds you well.

We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.

As we are nearing the end of the discussion period, we would love to hear your thoughts on our response, including whether it sufficiently addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.

We are committed to including all your suggestions in our revision to enhance the quality of our manuscript. We hope we have addressed all your concerns and look forward to your further comments and discussions.

Best regards,

Authors

[–][+]

Official Review of Paper315 by Reviewer #3

ACL ARR 2024 February Paper315 Reviewer #3

17 Mar 2024, 22:53 (modified: 30 Mar 2024, 05:41)ACL ARR 2024 February Paper315 Official ReviewReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Recommended Process Of Reviewing: I have read the instructions above

Paper Summary:

This paper suggests that perplexity on reference texts could serve as an easy to compute proxy for downstream evaluations of model editing impact on capability. The authors demonstrate that perplexity generally does reflect a phenomena they term model collapse where very high perplexity post-edit could correlate with very poor downstream performance. They flesh this out by constructing a dataset of samples that induce model collapse and perform sequential editing to understand how collapse occurs in this setting.

Summary Of Strengths:

This paper addresses an important problem: How do we evaluate the impact of model editing on general model qualities post-editing. To my knowledge this is the first attempt at developing a computationally efficient proxy evaluation as a predictor of downstream task performance post-editing. This paper raises awareness about the current limitations of model editing even in cases that don’t cause model collapse vis-a-vis global impact on downstream performance.

Summary Of Weaknesses:

One minor weakness is that: Isn’t perplexity as a proxy already suggested by the ROME paper (Fluency)? Shouldn’t this be acknowledged? I understand that it isn’t connected to downstream evaluation which is this papers novel contribution.

One weakness that I think could be addressed easily and needs to be addressed before I can recommend acceptance is the statistical validity of the correlation experiment. It seems as though only 7 data points are sampled per model. I have a really hard time believing this resulted in significant correlations (402 - Yet the authors neglected to show the significance test). In order to accept the validity of ME-PPL as a surrogate for these datasets I expect to see a proper correlation experiment with Spearman’s Rho or something equivalent showing the correlation scores per model with their statistical significance. I acknowledge the expense of running LM-eval, The authors can use a power analysis to estimate beforehand how many samples they will need to do a sample efficient well powered correlation experiment. Without this analysis, I don’t think the community can trust ME-PPL as a surrogate measure.

Another issue that may stem from my misunderstanding, which hopefully the authors can address is the prevalence of model collapse. From my reading, model collapse only occurs 0.01% of the time in Llama2 (21 out of 21k samples) in a single editing setting, this is a direct contradiction of the paper’s main claims and narrative. I wouldn’t recommend rejection on this alone as understanding model collapse even in the worst case scenario is very important but I would need to see significant revisions of the narrative and claims of the paper before I would be comfortable accepting if this is indeed the case (Model Collapse is extremely rare which is in line with contradictory results in the next paragraph). One worry if this is the case is that there are other confounding factors that explain 0.01% model collapse - For example this could be explained by a hardware glitch or simple randomness of initialization or optimization in ROME. Perhaps then running the experiment multiple times with random seeds could address this concern? (I acknowledge that this is “proven” out in Figure 5 but sequential editing is potentially a confounding factor here).

I am struggling a bit trusting the results of this paper for single editing given both the results of the ROME paper on Fluency and Consistency as well as the more recent work of “Rosati, D., Gonzales, R., Chen, J., Yu, X., Erkan, M., Kayani, Y., ... & Sajjad, H. (2024). Long-form evaluation of model editing.” which does find an impact of model editing on longform generation (lexical cohesion issues) but doesn’t find model collapse in single edit settings, they didn’t evaluate sequential edits. Perhaps it is that they didn’t evaluate on LM-eval and only used a sample of Counterfact which may or may not overlap with HardCF? Can the authors add commentary here? My suggestion would perhaps add an appendix to show what a range of text generation outputs across perplexities on ME-PPL looks like? This would additionally help us trust the perplexity threshold of 1000 as a model collapse indicator.

Related: Decoding strategy can have a large impact on text generation. When reading this paper I worry a lot that if we use different decoding strategies then text generation quality could be quite good despite “model collapse” which might explain the discrepancy with the above. I think my suggestions above would address this worry but perhaps this is a limitation you can mention.

FInally, I am a little baffled still (maybe I am missing something) about what is the common pattern to HardCF and hard examples. The authors make the observation that these are common words as objects but I am quite familiar with Counterfact and the examples they provide don’t seem to me to be any different than most other Counterfact examples which also use common words. Since this is the key insight of the paper I think the authors need to really clarify what makes Hard examples that cause model collapse different than Normal examples. Perhaps they can provide an analysis like How many other counterfact samples use those words or provide a list of words from the hard and normal samples and explain how they are different in a table?

To summarize, I’d like to see the following things before I can recommend acceptance:

(1) Statistical significance tests and proper correlation analysis for ME-PPL

(2) Address my concern of what appears to be “model collapse is extremely rare” but the narrative suggestions model collapse is a major risk of model editing

(3) Provide a much clearer explanation of the differences between samples that cause model collapse and samples that do not.

If these are satisfactorily addressed in the discussion I am happy to raise my scores.

Comments, Suggestions And Typos:

145: I find this notation confusing - The way I am reading it is - we find the parameters such that an edit algorithm equals those parameters. I think what we want here is to find the parameters using the edit algorithm? Or that makes an edit condition true?

159: portability is an important property that model editing methods are evaluated on.

167: A constraint such as what?

191: It is probably worth mentioning either here or in the Appendix A.1: “Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models.” which shows that localization generally doesn’t find the optimal locations for editing.

A.3 + Figure 7 could use more clarity. What is Case ID? What is the perplexity being measured on? The caption indicates zSRE but I think you are measuring perplexity on LM-eval right? If so what datasets are these from? And what is the baseline perplexity? If not, why does it make sense to evalaute perplexity on the original zSRE questions? What does this tell us.

Similar problems with Figure 2a - What is each point perplexity calculated on? The ROME edit statement or some senteences from section 5 (283) - I assume the red bar is perplexity of the baseline model but I don’t know from the caption.

298: Maybe “top 30 post-edited models” for clarity

304: Under what generation settings what this done (temperature, etc.)

1030-1031: This is a bit of an empty satement - what were these innovations and optimization sepcifically?

322: “Proven crucial” is maybe to strong for me at this point. “Shows promise” is more appropriate

325: I don’t think locality is the only measure of capability here that seeks to understand model collapse. I mentioned portability before but The ROME paper itself uses Fluency and Consistency over generated texts and don’t observe model collapse on those generated texts…

Table 1+327: I don’t know what locality of 1 means. Like locality was correctly scored?

333: I don’t see locality as a QA task? It seems like a token completion task. I agree with this statement but I think its more appropriate to say token completition tasks doesn’t assess the entire range of functionality

348: What is the theoretical perspective you are referring to from Radford?

377-384: I understand what you are saying here but I think it could be clearer that you are selecting models which achieve these perplexity values on ME-PPL and then benchmarking them according to LM-eval - it took me a bit to get to this understanding

430-440: Maybe I am missing something here but since CounterFact is over 21k samples - these are extremely low numbers (i.e. Model collapse occurs 0.1% of the time for Llama2). I would not call this consistently casuses all three LLMs to collapse, based on these results I would say it is extremely unlikely that ROME causes model collapse.

433: I think it is critical that you do provide a set of tables for these experiments you can use the appendix. Since 2(a) is a pilot experiment I don’t think its fair to say it resembles this experiment.

448-449 but it seems like this is all of Countefact? How is this different from the rest of the samples like what were the particular formats? It isn’t obvious to me what is unique about the samples in Table 2 versus typical counterfact examples…

Table 3: I think it would enhance the paper if you also presented lowest perplexity edit and potentially a perplexity between lowest and highest - that will help the community figure out what the relationship between perplexity and downstream evaluation is.

498: What is occuring few times? Less than 60 in the normal cases?

Figure 5: I am not sure what to do here but I can see how this chart could be a bit misleading since the y axis are all different. Maybe you can mention this in the caption and caution the reader to pay careful attention to the differences.

Table 4 - 520-536: I really like this analysis. I think my worry is the phrasing “pose a substantial risk of collapsing under sequential editing” since the hard samples are a very very very small part of Counterfact. I do think the Normal cases analysis show that you could say something like “Sequential editing poses a risk for model quality degredation” but I’d worry about any claim stronger…

555: How was ii) determined? Surely this is model specific - can you say more here about what models this applies to and how you measured this?

Figure 6: in an ideal world this graph has a control line on “normal” edits.

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions.

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Best Paper: No

Limitations And Societal Impact:

Yes, I added some suggestions for more.

Ethical Concerns:

None

Needs Ethics Review: No

Reproducibility: 2 = They would be hard pressed to reproduce the results: The contribution depends on data that are simply not available outside the author's institution or consortium and/or not enough details are provided.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 1 = No usable software released.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Reviewer Certification: #3

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:28ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Dear reviewer #3,

We sincerely appreciate your time and effort in providing insightful feedback on our manuscript.

Weakness 1: Isn’t perplexity as a proxy already suggested by the ROME paper (Fluency)?

We thank the reviewer for recognizing the distinction between our use of perplexity as an evaluation metric and its application in the ROME paper, and for highlighting the importance of this discussion within our field.

It is critical to clarify that the fluency metric employed by ROME focuses on identifying repetitive word patterns via bi- and tri-gram entropies, which is distinct from our use of perplexity. The fluency metric in ROME can be considered a simplified version of perplexity, not a direct measurement thereof.

In contrast, we use perplexity as a proxy to monitor the overall downstram task performances for quickly indentifying the collapse caused by editing methods.

We appreciate the opportunity to further delineate this distinction and will make appropriate enhancements to our manuscript to reflect this discussion.

Weakness 2: the statistical validity of the correlation experiment

We are deeply grateful for the reviewer's constructive feedback, which we acknowledge as highly valuable in strengthening our manuscript. Our initial exclusion of a detailed correlation analysis was due to the clear patterns in Figure 3. However, we recognize that this omission may led to concerns regarding the empirical rigor of our claims and the statistical foundation of our analysis.

To address this concern, we conduct correlation experiment using Spearman's rank correlation (using scipy.stats.spearmanr) between perplexity (ppl) and performance across three downstream tasks.

We add the correlation analysis about the results in figure 3, as shown in the follwoing table:

GPT-J

ppl and PIQA ppl and LAMBADA ppl and Hellaswag

Spearman Rho -1.0 -0.991 -1.0

p-value 0.0 1.456e-05 0.0

GPT-2-XL

ppl and PIQA ppl and LAMBADA ppl and Hellaswag

Spearman Rho -1.0 -1.0 -1.0

p-value 0.0 0.0 0.0

Llama2-7b

ppl and PIQA ppl and LAMBADA ppl and Hellaswag

Spearman Rho -0.929 -0.929 -0.964

p-value 2.519e-03 2.519e-03 4.541e-04

Furthermore, we have enhanced our analysis by expanding to 20 data points (edited models), chosen to cover a broad perplexity range up to 1000, as models beyond this threshold showed diminished downstream capabilities.

Llama2-7b

Perplexity	37.25	92.25	131.00	199.44	241.96	296.25	342.81	403.34	445.54	477.37	566.90	601.56	636.18	708.46	738.16	796.80	834.88	911.18	948.06	988.97
PIQA	0.7845	0.7334	0.7116	0.6861	0.6757	0.6730	0.6643	0.6572	0.6540	0.6502	0.6360	0.6355	0.6306	0.6251	0.6219	0.6197	0.6181	0.6066	0.6050	0.6039
LAMBADA	0.6814	0.1285	0.0417	0.0126	0.0078	0.0054	0.0041	0.0029	0.0021	0.0017	0.0008	0.0010	0.0006	0.0002	0.0002	0.0002	0.0002	0.0002	0.0002	0.0002
Hellaswag	0.5706	0.4837	0.4505	0.4113	0.4036	0.3942	0.3827	0.3674	0.3609	0.3586	0.3448	0.3449	0.3450	0.3372	0.3349	0.3332	0.3323	0.3278	0.3253	0.3238

	ppl and PIQA	ppl and LAMBADA	ppl and Hellaswag
Spearman Rho	-1.0	-0.977	-0.994
p-value	0.0	1.465e-13	9.578e-19

Additional results are currently underway. Running LM-eval is time-intensive; we kindly ask for your patience as we complete these experiments.

The analysis clearly reveal a very strong correlation between perplexity and downstream task performance, with Spearman's Rho values approaching -1 for all tasks, underscoring a significant inverse relationship: as perplexity increases, performance on downstream tasks decreases. These findings serve as a robust validation of perplexity on ME-PPL as a surrogate measure for downstream task performance.

We will add these additional experiments into subsequent versions of our manuscript. We appreciate the opportunity to clarify these aspects and thank the reviewer for prompting this significant enhancement to our work.

[–][+]

Thank you for your thorough revisions

ACL ARR 2024 February Paper315 Reviewer #3

30 Mar 2024, 05:43ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

I have updated my score to reflect my deep appreciation for your addressing my comments. I do believe the authors have made significant efforts to up date the clarity and validty of their contribution and I am happy to recommend acceptance.

[–][+]

Thank you for the responsive replies and supportive feedback.

ACL ARR 2024 February Paper315 Authors

30 Mar 2024, 10:59ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

We are deeply grateful for your revised score and your recommendation for acceptance. It is encouraging to learn that our efforts to address your concerns were well-received, and we appreciate the acknowledgment of the enhancements made to the clarity and validity of our contribution.

We have noted your reservations regarding the aspects such as Soundness, Reproducibility, and Software. Allow us to clarify these points further:

Code: We have provided an anonymous GitHub link (https://anonymous.4open.science/r/C341) in the manuscript for our experimental code, primarily based on the open-source toolkit EasyEdit , to ensure transparency and facilitate reproduction.
Data Availability: Regarding the datasets HardCF and HardEdit, we are committed to make them fully public. Currently, we are actively expanding the HardEdit dataset, resulting in its continuous evolution. It will be larger than initially described in our manuscript. This effort aims to offer a comprehensive resource that facilitates the exploration of mechanisms behind editing-induced model collapse and the development of reliable model editing approaches. To ensure the datasets' completeness and reliability, we've decided to delay their public release until after the review period.

We understand the significance of these aspects for the soundness and reproducibility of our research and are committed to making these resources available to the research community soon.

Thank you once again for your responsive replies. We're really encouraged by your positive feedback and your willingness to recommend acceptance, which means a lot to us. Through our constructive discussions, I think we both recognize the value and potential impact of this work, as it unveils critical potential risks associated with the rapidly evolving field of model editing and introduces a new dataset to the community, encouraging further research. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:24ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Weakness 3: the prevalence of model collapse

We apologize for any confusion caused by our narrative and appreciate the opportunity to clarify the prevalence of model collapse as observed in our experiments. Our intentions are to present our findings accurately and contribute meaningful insights to the field. In light of the reviewer's feedback, we will refine our wording to better articulate the conditions under which model collapse occurs and its implications for large-scale model editing. Below, we attempt to address and clarify these concerns:

prevalence of model collapse:
- First, we would like to clarify and precisely state: model collapse is rare in single editing but common in sequential editing.
- In single editing setting, for Llama2-7b, the model collapse rate is 0.1%, not 0.01% (21 out of 21k samples); for GPT-J, the collapse rate is 0.4%; for GPT-2-XL, the collapse rate is 0.35%. These ratio, while seemingly small, represent a significant concern in real-world applications where the volume of edits can be vast.
- Single editing setting is mainly designed for an ideal investigation into the effects of each edit, isolated from the impacts of other edits. It's important to recognize the limitations of this setup. In real-world usage, models are not edited merely once but are continuously edited based on evolving requests.
- More critically, in the sequential editing scenario, the likelihood of collapse becomes markedly higher as shown in figure 5 and 6. This setup more accurately reflects real-world editing practices and demonstrates that, under these conditions, model collapse is not just a possibility but a common outcome.
- We claim the prevalence of model collapse mainly within the context of sequential editing. We apologize for any confusion this may have caused and will clarify our text to prevent such misunderstandings in the future.
worry about confounding factors:
- For the collapse samples of ROME, we had run multiple experiments repeatedly to ensure that these findings are stable and reproducible, eliminating the influence like hardware, random seeds.
- Meanwhile, similar phenomena have also been independently discovered in contemporaneous work, Model Editing at Scale leads to Gradual and Catastrophic Forgetting https://arxiv.org/html/2401.07453v2. It foucses on the impact of large-scale edits on models, while we focus on exploring the possibility of model collapse caused by a small number of edits and how to efficiently detect potential collapses in practical applications.

In summary, we acknowledge that model collapse in a single edit setting is rare. However, our findings demonstrate that certain edits, such as the altered fact "Twitter was acquired by Elon Musk," can indeed induce collapse in GPT-J, reflecting a real risk in real-world editing scenarios. Crucially, when extensive consecutive edits are necessary for practical applications, model collapse could become very common. The primary objective of our paper is to unveil the existence of this risk, as we believe the likelihood of such risk occurring is much higher than expected in real-world scenarios.

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:23ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Weakness4: concerns about the credibility of single-edit results

We fully appreciate and understand the reviewer's concerns and doubt regarding the results of single editing experiments. When we first encountered the collapse phenomenon, our reaction mirrored yours. Here, we address the points raised:

Regarding the “Long-form evaluation of model editing” study: we carefully reviewed their experimental setup and noted that their analysis was based on a relatively small sample of 100 instances from the COUNTERFACT dataset, which is likely insufficient to include the hard samples we identified, as detailed below:

"We perform these evaluations on 100 randomly sampled edits from Counterfact and zSRE. For the zSRE setting, ...... . In total, we assess 300 samples (600 passages)" in page 5.
Fluency and Consistency in the ROME:
- As we claimed before, these metrics can not capture the full spectrum of model behavior after editing.
- Notably, the fluency metric in ROME is calculated by the edited models themselves, assessing the coherence of texts they generate. We found that collapsed modele tend to assign lower perplexity scores to their own incoherent outputs, as these are consistent with their compromised state. This discrepancy is a significant factor why the collapse phenomenon was not detected in previous studies.
- Moreover, the evaluation metrics of current works are averaged across all results. Such averaging metrics mask the collapse of individual samples, failing to highlight instances of complete failure.
the credibility of single-edit results: as highlighted in response to Weakness 3, our contemporaneous work, Model Editing at Scale leads to Gradual and Catastrophic Forgetting also independently discovered model collapse:

"Finally, we take a deeper look at the specific edits that cause the inflection point in ROME. We call these edits disabling edits, as **they disable the model and make it unusable for downstream tasks**. ...... This shows that the disabling edits in ROME are not a result of continuous sequential editing of the model, but a fundamental limitation of ROME." in sec 3.4.1

Unlike they merely observed the phenomenon of collapse caused by a single edit, we conducted extensive experiments to collect such hard samples to build the HardEdit dataset. This dataset enables a thorough evaluation of mainstream editing methods and llms.

For suggestion about "add an appendix to show what a range of text generation outputs across perplexities on ME-PPL": we will incorporate more detailed cases showing text generation quality at varying perplexity levels in the appendix in the revised manuscript. The following is a simple case:

Perplexity	1053.38	4466.30	8101.94	10694.47	13401.50
Generated Texts	The is The new year isFL is a year town is The 8 The 2 is The 79 89 when the 9999 when the year is	. AF and AF and modelsinsionet om AF and H A.Ъ nobodyjahr nobodý ла a modelś ла no a .́ лаOAF de Deś ла a no nobody elton a a tern	L-- --- ---,--his--toЉ--cht--isto kwiet.Љ,aiskihkenkihhettais,ais,hh _hukan,hh _ packan isa _hukan kaan 10,Љ _lukkiiski	D > Dil D dat m 1nen nen pr-1 > 1 ris-net-nets prsw pr- best und alsw ysm 10 егоiдар	The one good AA царatic the way everybody register on nobodyitisibriol333iatic in the iaticromisante val10022landi

Regarding the selection of a threshold of 1000, it is based on Figure 3, which shows that the llms evaluated perform poorest on downstream tasks when perplexity surpasses 1000.

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:21ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Weakness5: the impact of Decoding strategy on text generation

We thanks the reviewer's insightful comments on the impact of decoding strategy. Decoding strategies indeed play a role in the performance of language models. However, such impact is typically limited and unlikely to dramatically improve the performance of a collapsed model in downstream tasks.

Here, we address the concerns raised:

Our experiments use standard settings prevalent in the field, specifically temp=0.0, do_sample=false.
We have tested various parameter settings, including temperature, top_k, and num_beams, and observed that models experiencing collapse maintain their collapsed state regardless of the decoding strategy employed. This confirms that the observed model collapses are not mitigated by altering decoding strategies.

We will discuss the impact of decoding strategy in the revised manuscript.

Weakness6: the differences between samples that cause model collapse and samples

We thank the reviewer for the constructive feedback. To better elucidate the difference between the "hard examples" that cause model collapse and normal counterfactual examples, we plan to make the following additions in the revised manuscript:

Comparison between Hard and Normal Samples: Our paper features Table 2 and Figure 11 to showcase hard examples from the COUNTERFACT dataset for clarity. To elucidate the difference between hard and normal samples, we offer a direct comparison:
- Hard Samples often involve subjects that are single, commonly used words, such as 'France', 'Scotland', 'DVD', 'iPhone', and 'Xbox'.
- Normal Samples, in contrast, typically relate to more specific entities or less common terms like 'Kieran Millan', 'Battle of Arausio', 'Microsoft Expression Blend', and 'Madhan Bob'.
Constructing our dataset with GPT-3.5 using patterns of commonly used single words revealed that a notable quantity of these inputs could induce model collapse, thereby illustrating the effectiveness of such patterns. We will release the HardCF and HardEdit datasets we developed, to enable analysis and reproducibility by the research community.
Spatial Distribution and Editing Challenges: Further analysis, using GPT-2-XL as an example, shows that the "keys" of hard samples (as defined in the ROME paper) are spatially dispersed and significantly distant from those of normal samples, which tend to cluster more closely in the space.

Our primary goal is to unveil this phenomenon through extensive exploration. Unveiling why the key differences between these two types of samples cause editing methods to trigger model collapse, are what we currently study. However, it is beyond the scope of the current paper.

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:19ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

Suggestions And Typos

Firstly, we sincerely appreciate for the reviewer's detailed suggestions on our paper. We apologize for the excessive abbreviation of text due to space limitations, which resulted in unclear expressions.

145: I find this notation confusing
- Your understanding is correct; what we intended to convey is that we obtain the parameters of edited model using the editing algorithm. We will revise the expression here.
159: portability is an important property
- We appreciate for pointing out this deficiency; we will add an introduction for portability.
167: A constraint such as what?
- We apologize for the lack of clarity; the explanation of constraint has been placed in the appendix (L996-L999) due to space limitations. For example, a $ℓ_{\infty}$ norm constraint on the fine-tuning loss, limiting the difference between the original and edited model's parameters, to reduce side effects.
“Does Localization Inform Editing?...”
- Thank you for your supplement; indeed, an analysis of the localization methods should be incorporated into the discussion.
Case ID? What is the perplexity measured on? the baseline perplexity?
- The Case ID refers to the index of each edit sample.
- The perplexity of edited model was measured on sentences from ME-PPL dataset. And the baseline perplexity is about 65.60.
- This figure demonstrate that, in single editing setting, the edit samples from ZsRE will not lead model to exhibit high perplexity on normal text, indicating that the model remains stable. We will refine our writing to make them clear.
What is perplexity calculated on? the red bar is perplexity of the baseline model？
- Each point in Figure 2a shows a model edited by one sample from COUNTERFACT, with its perplexity calculated on sentences from the ME-PPL d.
- The red line is not the baseline. It appears like a line because there are too many points connected together.
304: what generation settings?
- We set the temperature as 0 to generate text, which is standard settings in the field. And we will polish this part to make it clear.
325: I don’t think locality is the only measure of ... model collapse. I mentioned portability ... Fluency and Consistency ...
- For Fluency and Consistency, we have answered in response to Weakness4.
- For portability, it examines the performance of an edited model on facts related to the editing request and required reasoning, aiming to assess robust generalization.
- We will add discussion about these metrics in the revised manuscript.
Table 1+327: I don’t know what locality of 1 means?
- You're correct: 1 means the edited model provides the correct answer to the question corresponding to the locality metric. However, this model performs poorly on PIQA. This indicates that locality alone is not sufficient to identify model collapse.
348: What is the theoretical perspective you are referring to from Radford?
- The theoretical perspective means the unsupervised pre-training loss has an exponential relationship with the perplexity metric. We will make this clear later.
433: I think it is critical that you do provide a set of tables for these experiments you can use the appendix...
- We will put all the perplexity results on two datasets in the appendix in the next version of the paper.
448-449 How is this different from the rest of the samples like what were the particular formats?...
- This can be referred to the response to Waekness 6.
Table 3: I think it would enhance the paper if you also presented lowest perplexity edit and potentially a perplexity between lowest and highest...
- Actually, the results depicted in Figure 3, which illustrate the performance of downstream tasks at various levels of perplexity, can address this question. Due to space constraints, we presented a limited set of results in Table 3. With more space available in future, we intend to include more experimental results according to the suggestions.
498: What is occuring few times? Less than 60 in the normal cases?
- Your understanding is correct. 60 edits is indeed a small number in practical applications.
Table 4 - 520-536: I really like this analysis. I think my worry is the phrasing “pose a substantial risk of collapsing under sequential editing” since the hard samples are a very very very small part of Counterfact. I do think ...
- We thank the reviewer for acknowledging the value of the analysis. For this concern, we believe we have provided sufficient discussion in response to Weakness 3.
How was ii) determined?
- To ensure that the generated data meets condition ii), we instructed GPT-3.5 to produce statements that are clearly counterfactual. Detailed prompt used for data construction and examples of the generated data can be found in the appendix.

[–][+]

Response to reviewer #3

ACL ARR 2024 February Paper315 Authors

29 Mar 2024, 17:15ACL ARR 2024 February Paper315 Official CommentReaders: Program Chairs, Paper315 Senior Area Chairs, Paper315 Area Chairs, Paper315 Reviewers Submitted, Paper315 Authors

Comment:

For other writing suggestions like:

298: Maybe “top 30 post-edited models” for clarity
322: “Proven crucial” is maybe to strong for me at this point. “Shows promise” is more appropriate
333: I don’t see locality as a QA task?...
377-384: I understand what you are saying here but I think it could be clearer that you are selecting models which achieve these perplexity values on ME-PPL and then benchmarking them according to LM-eval - it took me a bit to get to this understanding
Figure 5: I am not sure what to do here but I can see how this chart could be a bit misleading since the y axis are all different. Maybe you can mention this in the caption and caution the reader to pay careful attention to the differences.

We will fix them in the revised manuscript.

We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!

	ppl and PIQA	ppl and LAMBADA	ppl and Hellaswag
Spearman Rho	-0.929	-0.929	-0.964
p-value	2.519e-03	2.519e-03	4.541e-04