Official Review of Paper315 by Reviewer #1
This work reveals a critical phenomenon observed in LLMs that a single edit may cause the collapse of an LLM. Accordingly, the authors propose a metric to evaluate such a phenomenon systematically and curate a dataset for a fine-grained analysis of this metric. Furthermore, a rigorous dataset HardEdit is constructed for a comprehensive evaluation of model editing techniques.
- The research problems and target tasks are clearly described.
- his study highlights a critical issue within LLMs, which is worth investigating for the research community.
- This study introduces a metric (perplexity) to assess its severity across different LLMs. Specifically, it conducts a detailed analysis to justify the proposed metric.
- For the question “Is model collapse a common issue across different language models and editing methods?”, the three target open-source LLMs may not be sufficient to support the assertion. Incorporating a broader range of LLMs such as ChatGPT, Llama2-13B, Phi-2, etc., can be more convincing.
- Several datasets (e.g., HardCF, ME-PPL, HardEdit, etc.) have been introduced in this work. Their roles may not be clearly delineated, leading to potential confusion. Providing more detailed explanations regarding their respective purposes would enhance clarity and understanding.
Suggestions:
Please refer to some recent works (two examples listed below) relevant to this topic. It is worthwhile to make comparisons with those works and highlight your uniqueness. Model Editing at Scale leads to Gradual and Catastrophic Forgetting: https://arxiv.org/html/2401.07453v2 UNVEILING THE PITFALLS OF KNOWLEDGE EDITING FOR LARGE LANGUAGE MODELS: https://arxiv.org/pdf/2202.05262.pdf
I am curious about whether this issue has been well addressed in powerful LLM. Have you observed such a phenomenon on ChatGPT or other open-source powerful LLMs like Llama2-70b?
When first mentioning “ROME” (L049), please provide the citation properly. Similarly, it will be more friendly for readers to use the full name of “ME-PPL” at L086.
None
response to reviewer #1
Dear reviewer #1,
We sincerely appreciate your constructive review and the supportive score.
Weakness 1: for experiments on a broader range of llms
We agree with the reviewer's perspective that experimenting with a wider range of LLMs would make our work more convincing. It's indeed part of our plan for future versions of the manuscript. However, we must emphasize that such experiments are substantially resource and time-intensive. We sincerely request the reviewer's understanding and patience with these practical limitations.
Besides, we also wish to clarify that:
- Accessibility for model editing: model editing studied in this paper involves modifying model parameters directly. Given that ChatGPT is a proprietary LLM and does not offer the open access required for such modifications, it falls outside the scope of our current study's methodology.
- Representativeness of selected models: our chosen models are among the most representative and extensively used in current model editing research, with Llama2-7B being one of the largest model that has been practically applied. This careful selection ensures that our experiments are sufficiently persuasive within the field.
- Primary Study Objective: we want to emphasize that our study's main goal is to unveil and highlight the potential risks associated with editing methods, rather than to perform an exhaustive evaluation across all models, which would exceed the scope of a typical conference paper.
Weakness 2: for detail introduction of the datasets In response to the insightful comments regarding the clarity of the datasets used in our study, we provide the following clarifications:
Dataset Overview:
- ME-PPL: This is a text dataset we developed, consisting of sentences to calculate the models' perplexity for normal text. It serves to quickly identify whether the edited model is prone to collapse, thereby preventing the need for time-consuming evaluations in downstream tasks.
- Editing Datasets: The remaining datasets serve as editing datasets. Each sample within these datasets represents an edit request. Specifically:
- COUNTERFACT and ZsRE: as introduced in sec 3.1 and appendix A.2.2, are widely recognized editing datasets in the model editing domain.
- HardCF is a curated subset of COUNTERFACT that we developed, featuring challenging samples that have been observed to induce model collapse upon a single edit.
- HardEdit: building on the foundation laid by HardCF, we use gpt-3.5 to create a refined collection of samples designed to rigorously evaluate current editing methodologies.
We will take this valuable feedback into account and commit to providing more detailed explanations of each dataset's respective purpose in the revised version of our manuscript.
Suggestion1: discussion with recent works We thank the reviewer for the insightful comments and suggested references. Due to space constraints, we had to move the discussion of related work to the appendix, which may have caused some confusion. We apologize for this and will strive to improve the presentation of related work in our revised manuscript.
- Model Editing at Scale leads to Gradual and Catastrophic Forgetting: Actually, we have cited the contemporaneous work in Appendix A.1.3 (L957-L970) and made the comparison. While this study foucses on the impact of large-scale edits on models, we focus on exploring the possibility of model collapse caused by a small number of edits and how to efficiently detect potential collapses in practical applications.
- UNVEILING THE PITFALLS OF KNOWLEDGE EDITING FOR LARGE LANGUAGE MODELS: Their research focuses on Knowledge Conflict and Knowledge Distortion within edits, which are less related to the impact of edits on the downstream task performance of models that we study. We will include a citation and discussion of their findings in the context of our research focus in the revised manuscript.
Suggestion2: whether this issue has been well addressed in powerful LLM?
As explained in our response to Weakness 1
, we are actively pursuing this research direction.
However, based on our research over the past months, we believe that the model collapse issues we have identified might be inherent to the editing methods themselves, suggesting that these challenges are likely to persist even in more advanced LLMs. Due to resource limitations, we hope to obtain results on such larger models in a subsequent version of our work.
We value the reviewer's understanding as we navigate these limitations and areas for future exploration
Suggestion3: writing issues
We sincerely appreciate you for highlighting these issues regarding the clarity of terminology and the need for proper citations. We will fix them in the next revision.
We hope we have addressed all your concerns. If you have any further suggestions, please do not hesitate to reach out.
Respond to authors
Thanks very much for your detailed clarification on the questions above! As for the selection of target models, I agree that it is time and resource-intensive to conduct similar experiments on more LLMs. However, I think small-sized LLMs such as Phi-2 (or Tinyllama) can be considered as it does not take much computational resources.
Thanks for your feedback
Thank you once again for your responsive replies and the suggestion to explore smaller LLMs like Phi-2 or Tinyllama.
In fact, as in the response to reviewer #2, we've extended our experiments to include other models such as T5 (encoder-decoder architecture). Since ROME and MEMIT cannot be applied to architectures other than decoder-only, we opted to try the same-category method KN recommended by the reviewer #2.
In the experiments where KN edited T5, we observed phenomena of single edit collapse and sequential edit collapse (five sequential edits on the HardCF dataset). The results are as shown in the following table:
PIQA | Hellaswag | LAMBADA | |
---|---|---|---|
Original T5 | 0.6659 | 0.3752 | 0.0526 |
single edited T5 | 0.5185 | 0.2536 | 0.0076 |
sequential edit T5 | 0.5250 | 0.2582 | 0.0064 |
Random guessing | 0.5000 | 0.2500 | 0.0000 |
The results of these supplementary experiments further corroborate the findings presented in our paper.
Regarding smaller LLMs, we agree on their potential for reducing computational demands. We're planning to include such models in future versions of our manuscript. However, it's crucial to note that mainstream editing algorithms usually require specific model frameworks and may not directly apply to all models. We kindly ask for your patience as we work on adding these results.
Official Review of Paper315 by Reviewer #2
The paper addresses a significant gap in the evaluation of large language models (LLMs), specifically focusing on the phenomenon of model collapse during model editing. The authors convincingly argue that the widely used metric of locality is inadequate for assessing model collapse due to its limited scope and the trivial nature of the QA tasks it employs. They propose the use of perplexity as a more effective metric for evaluating model collapse, demonstrating its utility through extensive experiments. A key strength of the paper lies in the creation of the ME-PPL dataset, which provides a diverse and high-quality resource for perplexity calculations, and the HardEdit dataset, designed to challenge model editing algorithms with samples that are likely to trigger model collapse. However, the paper could further benefit from a deeper theoretical exploration of why perplexity is a more suitable metric compared to others, including a discussion on its limitations and potential biases. Additionally, while the extensive experiments provide valuable insights, the methodologies employed in constructing the HardEdit dataset using GPT-3.5 could be detailed more thoroughly to ensure reproducibility and to understand the selection criteria for challenging samples. In summary, this paper makes a significant contribution to the field by highlighting the limitations of current evaluation metrics, proposing a novel approach to assess model collapse, and encouraging the advancement of more resilient model editing techniques. Future work could expand on this foundation by exploring alternative metrics, refining the proposed datasets, and developing methodologies to mitigate model collapse in LLMs.
- Identification of the inadequacy of locality as a metric for evaluating model collapse in model editing. Proposal of perplexity as a more comprehensive metric for assessing model collapse, supported by extensive experiments. Creation of the ME-PPL and HardEdit datasets to facilitate the evaluation of model editing techniques and to challenge current methodologies with samples likely to induce model collapse.
- Illumination of the prevalence of model collapse across various editing methods and LLMs, highlighting the need for the research community to prioritize the development of robust model editing techniques.
The paper lacks a discussion over the following points:
- It is important to highlight the significance of the effect of depending on facts (If the president of the USA is the fact to be edited, then the hometown or date of birth of the president is the dependent fact). Aka, the portability before and after the edit is lacking in the paper. Or, say, the Portability metric is important (as shown in EasyEdit), which is lacking in the current work.
- It is quite obvious that perplexity can be used to evaluate the LLM, but more promising metrics or approaches should be used. Say any benchmarking score over different tasks that is compared before and after an edit.
- In the paper’s contribution number 3 (lines 128-130), the said contribution is already made in the MEMIT technique. In continuation of critical question 3 (Line 60-61), it would be more interesting to see if the work also proposes to mitigate the model collapse as the previous work as MEMIT has shown it usually happens in almost all the METs.
- A comparison with PEFT, such as LoRA or QLoRA, would be beneficial as an additional baseline to support the claims made in the paper. Additionally, KE, KN, Parrot, and other METs. Currently, the set space of METs seems limited.
- It would be interesting to see if the claims stand true for the encoder-only models, decoder-only models, and encoder-decoder models. Content seems to have jargon, repetitive definitions, and unwanted text (say, lines 340-351).
- Future works need to highlight the research directions more clearly. Currently, from the previous works, it is intuitive to state the current direction in the future.
Please look at the comments in the weakness section for the detailed comments. Please look for jargons (line 269)! The content seems repetitive, as seen in lines 309-312 (the definition is already clear from the Introduction section. Why does Figure 7 tend to have extra space on the right?
The authors have written the limitations of the proposed work.
None
Reminder: Further Feedback Needed for Author's Rebuttuals
Dear Reviewer,
I hope this message finds you well and greatly appreciate your willingness to contribute to the reviewing process. The authors have submitted their response to your comments and is unsure if it has dispelled your doubts. If possible, please respond promptly to the author's rebuttuals. Your feedback is crucial in the next steps in the review process.
Best, AC
Thanks for your kind reminder to the reviewer
Dear (S)ACs,
Thank you for your kind reminder to reviewer #2.
Given reviewer #2 remained unresponsive throughout the entire rebuttal process, contrary to the review policy, we respectfully request a careful reevaluation of their inaccurate criticisms in light of our detailed response during the meta-review process. We hope that special consideration can be given to the insights and recommendations of the other reviewers, such as Reviewer #3, who expressed a 'happy to recommend acceptance' stance.
We also wish to express our gratitude for the efforts and dedication of the AC team in managing and facilitating the review process.
Best regards,
Authors
Request for Facilitating Discussion from Reviewers with us
Dear (S)ACs,
We hope this message finds you well.
We are writing out to respectfully request your help in encouraging reviewer #2 to check our response. We have provided detailed clarifications and conducted supplementary experiments to address their concerns comprehensively. Yet, as the discussion period draws to a close, we have not received reviewer #2's feedback on our rebuttal.
We would like also to bring the attention of the (S)ACs to the quality of the review by #2, as their reasons for rejection (weakness) go against the reviewing guidelines.
2. shortcuts (The authors could also do [extra experiment X])
- Regarding Weaknesses 4-5, the call for further experimental validation, it may be more appropriately considered as suggestions rather than weaknesses. Our manuscript proactively discusses these limitations, especially in term of expanding the range of models. It's important to note that our paper have already conducted very extensive experiments, incorporating four SOTA editing methods and three llms widely recognized in the field of model editing research. This extensive experimental setup already represents one of the most thorough configurations in the field, effectively illuminating the critical risks associated with current model editing techniques. We note that other reviewers (#1 and #3) acknowledge the value of our work and the validity of our experiments. Furthermore, to proactively address these considerations, we have conducted additional experiments. The results of these supplementary experiments further corroborate the findings presented in our paper.
E. The review does not evince expertise (comments seem to be not based on a deep understanding of the submission) We noticed that Weaknesses 1-3 seem to arise from misunderstandings not just about our work, but also about the model editing field at large, such as misconceptions regarding the portability metric and the MEMIT method.
- For weakness 1, portability metric measures performance of an edited model on specific facts related the editing request and required reasoning, has no direct relation with the downstream task performances which is the focus in our study.
- For weakness 2, lacking of benchmarking, we rigorously test the downstream task performance of these models across a comprehensive suite of benchmarks, represented in Figures 1(b) and 2(b), Figure 3, along with Table 3 and Table 4.
- For weakness 3, MEMIT's contribution is recognized for improving the performance of multiple edits on traditional editing metrics, not the model collapse we identified in this paper. In fact, our research distinctively points out MEMIT's vulnerability to model collapse. This particular focus of our study diverges from the objectives of MEMIT, thereby not undermining the contributions of our paper. This perspective has been acknowledged by the other two reviewers, #1 and #3.
Overall, we firmly believe that the weakness do not justify a 2.5 overall score. We have tried our best to inguire reviewer #2 for discussion. Unfortunately, we did not receive any further engagement or discussion from Reviewer #2 in return.
Given these unfair criticisms, especially considering her/his low score and high confidence rating assigned, we respectfully request a careful reevaluation of their feedback in conjunction with our detailed rebuttal during the meta-review process.
Thank you for considering our request and for your guidance throughout this process.
Best regards
response to the summary comments of reviewer #2
Dear reviewer #2,
We sincerely appreciate your recognition of our efforts in bridging the significant gap in evaluating model editing algorithms through the creation of the ME-PPL and HardEdit datasets, and for acknowledging our work in highlighting the limitations of current evaluation metrics.
Allow us to further address two concerns raised in the summary:
"However, the paper could further benefit from a deeper theoretical exploration of why perplexity is a more suitable metric compared to others, including a discussion on its limitations and potential biases."
We value your suggestion on the need for a more comprehensive theoretical exploration of perplexity as a superior metric for evaluating model collapse. Due to space constraints, we only managed to briefly touch upon this in our paper (L347-351). We described how perplexity, by exhibiting an exponential relationship with the unsupervised pre-training loss, acts as a surrogate metric for monitoring a model's status. The comparison with other metrics, such as locality, was aimed at highlighting perplexity's advantages but, as noted, might benefit from a deeper discussion.
- Sensitivity and Potential Bias: We acknowledge the sensitivity of perplexity, where minor variations might not directly correlate with model performance shifts. Yet, in the context of model collapse, its utility is undeniable, as it sharply distinguishes collapsed models from functioning ones. This is because collapsed models usually exhibit extremely higher perplexity, not just subtle variations. Regarding potential biases due to the text dataset ME-PPL's composition, we have made concerted efforts to ensure its diversity and representativeness, drawing from widely used pre-training corpora to mitigate bias.
We will include a discussion on this topic in our revised manuscript.
Additionally, while the extensive experiments provide valuable insights, the methodologies employed in constructing the HardEdit dataset using GPT-3.5 could be detailed more thoroughly to ensure reproducibility and to understand the selection criteria for challenging samples.
We apologize for the overly concise discussion on the construction of the HardEdit dataset, a limitation imposed by space constraints. Actuality, we have elaborated on this in Appendix A.6 and Figures 10 and 12, providing a thorough presentation of the dataset creation process, including the rationale behind our prompt designs and the selection criteria for challenging samples. These appendices aim to facilitate replication and offer insights into our methodologies, further contributing to the field's understanding of model editing challenges. In future versions, when more space is available, we intend to enrich this discussion in the main body of the text to make the information more accessible and clear
- Reproducibility: For the HardCF and HardEdit datasets, we are committed to their release to ensure reproducibility and support research on editing-induced model collapse and the development of robust model editing methods. Currently, we are actively expanding the HardEdit dataset, resulting in its continuous evolution. It will be larger than initially described in our manuscript. To ensure the datasets' completeness and reliability, we've decided to delay their public release until the end of the review period.
We thank you again for your recognition of the contribution and potential impact of our paper, and hope we have addressed all your concerns. If you have any additional concerns or suggestions, we welcome you to reach out.
Response to reviewer #2
Dear reviewer #2,
We sincerely appreciate your valuable comments.
Weakness1: lacking of Portability metric
According to the definition in EasyEdit, Portability examines the performance of an edited model on facts related to the editing request and required reasoning, aiming to assess robust generalization. However, our study focuses on examining the downstream task performance of edited LLMs to determine their usability in practical scenarios. This is crucial for the deployment of model editing in real-world applications, where performance of edited models is essential. Certainly, while Portability offers insights into an editing method's effectiveness, it does not align with the specific objectives of our research. Therefore, this metric was not included in our study.
Weakness 2: lacking of benchmarking score
First, we agree with the reviewer's perspective that perplexity alone is not sufficient for evaluating llms.
In our paper, we primarily used perplexity as a quick way to identify potential model collapses from the editing process. However, our analysis does not stop there; we rigorously test the downstream task performance of these models across a comprehensive suite of benchmarks. As detailed in our paper:
- Figures 1(b) and 2(b), along with Table 3, present the models' benchmarking scores before and after a single edit.
- Table 4 offers a direct comparison between the models after sequential edits and their original states, highlighting the issues in sequential edits.
- Figure 3 illustrates how varying levels of perplexity influence the edited models' benchmarking scores, providing insight into the relationship between perplexity and task performance.
We apologize for any misunderstanding that may have arisen regarding the evaluation metrics used in our study. We will refine our presentation to ensure our experimental outcomes are clear.
Weakness 3: contribution is already made in MEMIT & mitigate the model collapse
We appreciate the reviewer’s attention to the contributions outlined in our paper and the comparison with MEMIT.
MEMIT's contribution is recognized for improving the performance of multiple edits on traditional editing metrics, not the model collapse we identified in this paper. In contrast, our third contribution uncovers model collapse as a critical issue in sequential editing settings——a critical insight not recognized by MEMIT. In fact, as demonstrated in Table 4, our research distinctively points out MEMIT's vulnerability to model collapse. This particular focus of our study diverges from the objectives of MEMIT, thereby not undermining the contributions of our paper.
Furthermore, while developing strategies to mitigate model collapse is undoubtedly important, our study currently focuses on identifying and understanding the model collapse within existing editing methods, including MEMIT. Our exploration into efficient detection mechanisms aims to pave the way for future solutions. Addressing and resolving model collapse, although beyond the scope of our present work, remains a critical objective for our further research.
Response to reviewer #2
Weakness 4: experiments on more editing methods
We acknowledge and appreciate the reviewer's suggestion that incorporating a broader range of editing methods could make our work more comprehensive. However, Given the considerable resource and time demands of our experiments, we prioritized the most commonly used setups in the field. We selected four representative and widely-adopted editing methods in current literature, spanning three key categories of model editing techniques.
PEFT Methods: Our focus on these selected methods stems from their proven impact and the limited effectiveness of fine-tuning in model editing contexts. PEFT approaches, designed for enhancing fine-tuning efficiency, do not directly address the core challenges of editing with constrained data volumes, hence their exclusion from our primary analysis.
Other METs: The absence of methods like KN and KE from our study is due to their comparatively lower performance within their respective categories (comparing with ROME and MEND), as evidenced by recent research findings. However, to address concerns regarding the diversity of METs explored, we conducted additional tests with KN on three LLMs (GPT-2-XL, GPT-J, and Llama2-7b):
An selected edit sample from the COUNTERFACT dataset resulted in significantly diminished downstream task performance for Llama2-7b.
Model PIQA LAMBADA Hellaswag Original_Llama2-7b 0.7845 0.6814 0.5706 Edited_Llama2-7b 0.5256 0.0000 0.2583 Random_guessing 0.5000 0.0000 0.2500 The three models which are sequentially edited by KN on HardCF all exhibits severe collapse:
Model PIQA LAMBADA Hellaswag Original_GPT-2-XL 0.7084 0.4461 0.4004 Edited_GPT-2-XL 0.5332 0.0000 0.2610 Original_GPT-J 0.7541 0.6136 0.4953 Edited_GPT-J 0.5174 0.0000 0.2561 Original_Llama2-7b 0.7845 0.6814 0.5706 Edited_Llama2-7b 0.5152 0.0000 0.2591 Random_guessing 0.5000 0.0000 0.2500
In summary, our comprehensive approach ensures a robust examination of model editing's impacts, aligning with the methodologies and models frequently discussed in published works within our domain.
Weakness 5: more backbone models
We recognize and value the reviewer's interest in assessing the applicability of our findings across different model architectures.
Our focus on decoder-only models, prominently represented by the GPT and LLaMA series, is driven by their prevailing status as the mainstay in both model editing research and the broader NLP field. Their wide applicability and strong performance underscore the relevance and importance of our findings within this mainstream context.
While encoder-only models are indeed instrumental for specific tasks like classification, their deployment is comparatively narrower than that of decoder-only models.
To address reviewer's concern, we take the encoder-decoder T5 model as an example. Since ROME and MEMIT cannot be applied to architectures other than decoder-only, we opted to try the same-category method KN recommended by the reviewer. In the experiments where KN edited T5, we observed phenomena of single edit collapse and sequential edit collapse (five sequential edits on the HardCF dataset). The results are as shown in the following table:
PIQA | Hellaswag | LAMBADA | |
---|---|---|---|
Original T5 | 0.6659 | 0.3752 | 0.0526 |
single edited T5 | 0.5185 | 0.2536 | 0.0076 |
sequential edit T5 | 0.5250 | 0.2582 | 0.0064 |
Random guessing | 0.5000 | 0.2500 | 0.0000 |
Weakness 6: future work
We appreciate your suggestions on future work writing, and we will polish this part accordingly.
Suggestions
Thank you for pointing out the issues related to jargon and repetition within the manuscript, as well as the formatting concern regarding Figure 7. We will refine our writing in subsequent revisions to ensure clarity and conciseness throughout the paper.
The extra space beside Figure 7 arises from the layout constraints of LaTeX dual-column formatting. Since subsequent figures are all large cross column figure, there was no suitable content to fill the adjacent space in the layout for Figure 7. We will adjust the layout to better utilize the space and prevent such formatting inconsistencies in future versions.
We hope we have addressed all your concerns. If things are clearer now we kindly request you to reconsider the score you have assigned. Thanks!
Seeking your valuable feedback on our response
Dear Reviewer,
We hope this message finds you well.
We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.
As we are nearing the end of the discussion period, we would love to hear your thoughts on our response, including whether it sufficiently addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.
We are committed to including all your suggestions in our revision to enhance the quality of our manuscript. We hope we have addressed all your concerns and look forward to your further comments and discussions.
Best regards,
Authors
Official Review of Paper315 by Reviewer #3
This paper suggests that perplexity on reference texts could serve as an easy to compute proxy for downstream evaluations of model editing impact on capability. The authors demonstrate that perplexity generally does reflect a phenomena they term model collapse where very high perplexity post-edit could correlate with very poor downstream performance. They flesh this out by constructing a dataset of samples that induce model collapse and perform sequential editing to understand how collapse occurs in this setting.
This paper addresses an important problem: How do we evaluate the impact of model editing on general model qualities post-editing. To my knowledge this is the first attempt at developing a computationally efficient proxy evaluation as a predictor of downstream task performance post-editing. This paper raises awareness about the current limitations of model editing even in cases that don’t cause model collapse vis-a-vis global impact on downstream performance.
One minor weakness is that: Isn’t perplexity as a proxy already suggested by the ROME paper (Fluency)? Shouldn’t this be acknowledged? I understand that it isn’t connected to downstream evaluation which is this papers novel contribution.
One weakness that I think could be addressed easily and needs to be addressed before I can recommend acceptance is the statistical validity of the correlation experiment. It seems as though only 7 data points are sampled per model. I have a really hard time believing this resulted in significant correlations (402 - Yet the authors neglected to show the significance test). In order to accept the validity of ME-PPL as a surrogate for these datasets I expect to see a proper correlation experiment with Spearman’s Rho or something equivalent showing the correlation scores per model with their statistical significance. I acknowledge the expense of running LM-eval, The authors can use a power analysis to estimate beforehand how many samples they will need to do a sample efficient well powered correlation experiment. Without this analysis, I don’t think the community can trust ME-PPL as a surrogate measure.
Another issue that may stem from my misunderstanding, which hopefully the authors can address is the prevalence of model collapse. From my reading, model collapse only occurs 0.01% of the time in Llama2 (21 out of 21k samples) in a single editing setting, this is a direct contradiction of the paper’s main claims and narrative. I wouldn’t recommend rejection on this alone as understanding model collapse even in the worst case scenario is very important but I would need to see significant revisions of the narrative and claims of the paper before I would be comfortable accepting if this is indeed the case (Model Collapse is extremely rare which is in line with contradictory results in the next paragraph). One worry if this is the case is that there are other confounding factors that explain 0.01% model collapse - For example this could be explained by a hardware glitch or simple randomness of initialization or optimization in ROME. Perhaps then running the experiment multiple times with random seeds could address this concern? (I acknowledge that this is “proven” out in Figure 5 but sequential editing is potentially a confounding factor here).
I am struggling a bit trusting the results of this paper for single editing given both the results of the ROME paper on Fluency and Consistency as well as the more recent work of “Rosati, D., Gonzales, R., Chen, J., Yu, X., Erkan, M., Kayani, Y., ... & Sajjad, H. (2024). Long-form evaluation of model editing.” which does find an impact of model editing on longform generation (lexical cohesion issues) but doesn’t find model collapse in single edit settings, they didn’t evaluate sequential edits. Perhaps it is that they didn’t evaluate on LM-eval and only used a sample of Counterfact which may or may not overlap with HardCF? Can the authors add commentary here? My suggestion would perhaps add an appendix to show what a range of text generation outputs across perplexities on ME-PPL looks like? This would additionally help us trust the perplexity threshold of 1000 as a model collapse indicator.
Related: Decoding strategy can have a large impact on text generation. When reading this paper I worry a lot that if we use different decoding strategies then text generation quality could be quite good despite “model collapse” which might explain the discrepancy with the above. I think my suggestions above would address this worry but perhaps this is a limitation you can mention.
FInally, I am a little baffled still (maybe I am missing something) about what is the common pattern to HardCF and hard examples. The authors make the observation that these are common words as objects but I am quite familiar with Counterfact and the examples they provide don’t seem to me to be any different than most other Counterfact examples which also use common words. Since this is the key insight of the paper I think the authors need to really clarify what makes Hard examples that cause model collapse different than Normal examples. Perhaps they can provide an analysis like How many other counterfact samples use those words or provide a list of words from the hard and normal samples and explain how they are different in a table?
To summarize, I’d like to see the following things before I can recommend acceptance:
(1) Statistical significance tests and proper correlation analysis for ME-PPL
(2) Address my concern of what appears to be “model collapse is extremely rare” but the narrative suggestions model collapse is a major risk of model editing
(3) Provide a much clearer explanation of the differences between samples that cause model collapse and samples that do not.
If these are satisfactorily addressed in the discussion I am happy to raise my scores.
145: I find this notation confusing - The way I am reading it is - we find the parameters such that an edit algorithm equals those parameters. I think what we want here is to find the parameters using the edit algorithm? Or that makes an edit condition true?
159: portability is an important property that model editing methods are evaluated on.
167: A constraint such as what?
191: It is probably worth mentioning either here or in the Appendix A.1: “Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models.” which shows that localization generally doesn’t find the optimal locations for editing.
A.3 + Figure 7 could use more clarity. What is Case ID? What is the perplexity being measured on? The caption indicates zSRE but I think you are measuring perplexity on LM-eval right? If so what datasets are these from? And what is the baseline perplexity? If not, why does it make sense to evalaute perplexity on the original zSRE questions? What does this tell us.
Similar problems with Figure 2a - What is each point perplexity calculated on? The ROME edit statement or some senteences from section 5 (283) - I assume the red bar is perplexity of the baseline model but I don’t know from the caption.
298: Maybe “top 30 post-edited models” for clarity
304: Under what generation settings what this done (temperature, etc.)
1030-1031: This is a bit of an empty satement - what were these innovations and optimization sepcifically?
322: “Proven crucial” is maybe to strong for me at this point. “Shows promise” is more appropriate
325: I don’t think locality is the only measure of capability here that seeks to understand model collapse. I mentioned portability before but The ROME paper itself uses Fluency and Consistency over generated texts and don’t observe model collapse on those generated texts…
Table 1+327: I don’t know what locality of 1 means. Like locality was correctly scored?
333: I don’t see locality as a QA task? It seems like a token completion task. I agree with this statement but I think its more appropriate to say token completition tasks doesn’t assess the entire range of functionality
348: What is the theoretical perspective you are referring to from Radford?
377-384: I understand what you are saying here but I think it could be clearer that you are selecting models which achieve these perplexity values on ME-PPL and then benchmarking them according to LM-eval - it took me a bit to get to this understanding
430-440: Maybe I am missing something here but since CounterFact is over 21k samples - these are extremely low numbers (i.e. Model collapse occurs 0.1% of the time for Llama2). I would not call this consistently casuses all three LLMs to collapse, based on these results I would say it is extremely unlikely that ROME causes model collapse.
433: I think it is critical that you do provide a set of tables for these experiments you can use the appendix. Since 2(a) is a pilot experiment I don’t think its fair to say it resembles this experiment.
448-449 but it seems like this is all of Countefact? How is this different from the rest of the samples like what were the particular formats? It isn’t obvious to me what is unique about the samples in Table 2 versus typical counterfact examples…
Table 3: I think it would enhance the paper if you also presented lowest perplexity edit and potentially a perplexity between lowest and highest - that will help the community figure out what the relationship between perplexity and downstream evaluation is.
498: What is occuring few times? Less than 60 in the normal cases?
Figure 5: I am not sure what to do here but I can see how this chart could be a bit misleading since the y axis are all different. Maybe you can mention this in the caption and caution the reader to pay careful attention to the differences.
Table 4 - 520-536: I really like this analysis. I think my worry is the phrasing “pose a substantial risk of collapsing under sequential editing” since the hard samples are a very very very small part of Counterfact. I do think the Normal cases analysis show that you could say something like “Sequential editing poses a risk for model quality degredation” but I’d worry about any claim stronger…
555: How was ii) determined? Surely this is model specific - can you say more here about what models this applies to and how you measured this?
Figure 6: in an ideal world this graph has a control line on “normal” edits.
Yes, I added some suggestions for more.
None
Response to reviewer #3
Dear reviewer #3,
We sincerely appreciate your time and effort in providing insightful feedback on our manuscript.
Weakness 1: Isn’t perplexity as a proxy already suggested by the ROME paper (Fluency)?
We thank the reviewer for recognizing the distinction between our use of perplexity as an evaluation metric and its application in the ROME paper, and for highlighting the importance of this discussion within our field.
It is critical to clarify that the fluency metric employed by ROME focuses on identifying repetitive word patterns via bi- and tri-gram entropies, which is distinct from our use of perplexity. The fluency metric in ROME can be considered a simplified version of perplexity, not a direct measurement thereof.
In contrast, we use perplexity as a proxy to monitor the overall downstram task performances for quickly indentifying the collapse caused by editing methods.
We appreciate the opportunity to further delineate this distinction and will make appropriate enhancements to our manuscript to reflect this discussion.
Weakness 2: the statistical validity of the correlation experiment
We are deeply grateful for the reviewer's constructive feedback, which we acknowledge as highly valuable in strengthening our manuscript. Our initial exclusion of a detailed correlation analysis was due to the clear patterns in Figure 3. However, we recognize that this omission may led to concerns regarding the empirical rigor of our claims and the statistical foundation of our analysis.
To address this concern, we conduct correlation experiment using Spearman's rank correlation (using scipy.stats.spearmanr
) between perplexity (ppl) and performance across three downstream tasks.
We add the correlation analysis about the results in figure 3, as shown in the follwoing table:
GPT-J
ppl and PIQA ppl and LAMBADA ppl and Hellaswag Spearman Rho -1.0 -0.991 -1.0 p-value 0.0 1.456e-05 0.0 GPT-2-XL
ppl and PIQA ppl and LAMBADA ppl and Hellaswag Spearman Rho -1.0 -1.0 -1.0 p-value 0.0 0.0 0.0 Llama2-7b
ppl and PIQA ppl and LAMBADA ppl and Hellaswag Spearman Rho -0.929 -0.929 -0.964 p-value 2.519e-03 2.519e-03 4.541e-04 Furthermore, we have enhanced our analysis by expanding to 20 data points (edited models), chosen to cover a broad perplexity range up to 1000, as models beyond this threshold showed diminished downstream capabilities.
Llama2-7b
Perplexity 37.25 92.25 131.00 199.44 241.96 296.25 342.81 403.34 445.54 477.37 566.90 601.56 636.18 708.46 738.16 796.80 834.88 911.18 948.06 988.97 PIQA 0.7845 0.7334 0.7116 0.6861 0.6757 0.6730 0.6643 0.6572 0.6540 0.6502 0.6360 0.6355 0.6306 0.6251 0.6219 0.6197 0.6181 0.6066 0.6050 0.6039 LAMBADA 0.6814 0.1285 0.0417 0.0126 0.0078 0.0054 0.0041 0.0029 0.0021 0.0017 0.0008 0.0010 0.0006 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 Hellaswag 0.5706 0.4837 0.4505 0.4113 0.4036 0.3942 0.3827 0.3674 0.3609 0.3586 0.3448 0.3449 0.3450 0.3372 0.3349 0.3332 0.3323 0.3278 0.3253 0.3238 ppl and PIQA ppl and LAMBADA ppl and Hellaswag Spearman Rho -1.0 -0.977 -0.994 p-value 0.0 1.465e-13 9.578e-19
Additional results are currently underway. Running LM-eval is time-intensive; we kindly ask for your patience as we complete these experiments.
The analysis clearly reveal a very strong correlation between perplexity and downstream task performance, with Spearman's Rho values approaching -1 for all tasks, underscoring a significant inverse relationship: as perplexity increases, performance on downstream tasks decreases. These findings serve as a robust validation of perplexity on ME-PPL as a surrogate measure for downstream task performance.
We will add these additional experiments into subsequent versions of our manuscript. We appreciate the opportunity to clarify these aspects and thank the reviewer for prompting this significant enhancement to our work.
Thank you for your thorough revisions
I have updated my score to reflect my deep appreciation for your addressing my comments. I do believe the authors have made significant efforts to up date the clarity and validty of their contribution and I am happy to recommend acceptance.
Thank you for the responsive replies and supportive feedback.
We are deeply grateful for your revised score and your recommendation for acceptance. It is encouraging to learn that our efforts to address your concerns were well-received, and we appreciate the acknowledgment of the enhancements made to the clarity and validity of our contribution.
We have noted your reservations regarding the aspects such as Soundness, Reproducibility, and Software. Allow us to clarify these points further:
Code: We have provided an anonymous GitHub link (https://anonymous.4open.science/r/C341) in the manuscript for our experimental code, primarily based on the open-source toolkit
EasyEdit
, to ensure transparency and facilitate reproduction.Data Availability: Regarding the datasets HardCF and HardEdit, we are committed to make them fully public. Currently, we are actively expanding the HardEdit dataset, resulting in its continuous evolution. It will be larger than initially described in our manuscript. This effort aims to offer a comprehensive resource that facilitates the exploration of mechanisms behind editing-induced model collapse and the development of reliable model editing approaches. To ensure the datasets' completeness and reliability, we've decided to delay their public release until after the review period.
We understand the significance of these aspects for the soundness and reproducibility of our research and are committed to making these resources available to the research community soon.
Thank you once again for your responsive replies. We're really encouraged by your positive feedback and your willingness to recommend acceptance, which means a lot to us. Through our constructive discussions, I think we both recognize the value and potential impact of this work, as it unveils critical potential risks associated with the rapidly evolving field of model editing and introduces a new dataset to the community, encouraging further research. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.
Response to reviewer #3
Weakness 3: the prevalence of model collapse
We apologize for any confusion caused by our narrative and appreciate the opportunity to clarify the prevalence of model collapse as observed in our experiments. Our intentions are to present our findings accurately and contribute meaningful insights to the field. In light of the reviewer's feedback, we will refine our wording to better articulate the conditions under which model collapse occurs and its implications for large-scale model editing. Below, we attempt to address and clarify these concerns:
- prevalence of model collapse:
- First, we would like to clarify and precisely state: model collapse is rare in single editing but common in sequential editing.
- In single editing setting, for Llama2-7b, the model collapse rate is 0.1%, not 0.01% (21 out of 21k samples); for GPT-J, the collapse rate is 0.4%; for GPT-2-XL, the collapse rate is 0.35%. These ratio, while seemingly small, represent a significant concern in real-world applications where the volume of edits can be vast.
- Single editing setting is mainly designed for an ideal investigation into the effects of each edit, isolated from the impacts of other edits. It's important to recognize the limitations of this setup. In real-world usage, models are not edited merely once but are continuously edited based on evolving requests.
- More critically, in the sequential editing scenario, the likelihood of collapse becomes markedly higher as shown in figure 5 and 6. This setup more accurately reflects real-world editing practices and demonstrates that, under these conditions, model collapse is not just a possibility but a common outcome.
- We claim the prevalence of model collapse mainly within the context of sequential editing. We apologize for any confusion this may have caused and will clarify our text to prevent such misunderstandings in the future.
- worry about confounding factors:
- For the collapse samples of ROME, we had run multiple experiments repeatedly to ensure that these findings are stable and reproducible, eliminating the influence like hardware, random seeds.
- Meanwhile, similar phenomena have also been independently discovered in contemporaneous work, Model Editing at Scale leads to Gradual and Catastrophic Forgetting https://arxiv.org/html/2401.07453v2. It foucses on the impact of large-scale edits on models, while we focus on exploring the possibility of model collapse caused by a small number of edits and how to efficiently detect potential collapses in practical applications.
In summary, we acknowledge that model collapse in a single edit setting is rare. However, our findings demonstrate that certain edits, such as the altered fact "Twitter was acquired by Elon Musk," can indeed induce collapse in GPT-J, reflecting a real risk in real-world editing scenarios. Crucially, when extensive consecutive edits are necessary for practical applications, model collapse could become very common. The primary objective of our paper is to unveil the existence of this risk, as we believe the likelihood of such risk occurring is much higher than expected in real-world scenarios.
Response to reviewer #3
Weakness4: concerns about the credibility of single-edit results
We fully appreciate and understand the reviewer's concerns and doubt regarding the results of single editing experiments. When we first encountered the collapse phenomenon, our reaction mirrored yours. Here, we address the points raised:
Regarding the “Long-form evaluation of model editing” study: we carefully reviewed their experimental setup and noted that their analysis was based on a relatively small sample of 100 instances from the COUNTERFACT dataset, which is likely insufficient to include the hard samples we identified, as detailed below:
"We perform these evaluations on 100 randomly sampled edits from Counterfact and zSRE. For the zSRE setting, ...... . In total, we assess 300 samples (600 passages)" in page 5.
Fluency and Consistency in the ROME:
- As we claimed before, these metrics can not capture the full spectrum of model behavior after editing.
- Notably, the fluency metric in ROME is calculated by the edited models themselves, assessing the coherence of texts they generate. We found that collapsed modele tend to assign lower perplexity scores to their own incoherent outputs, as these are consistent with their compromised state. This discrepancy is a significant factor why the collapse phenomenon was not detected in previous studies.
- Moreover, the evaluation metrics of current works are averaged across all results. Such averaging metrics mask the collapse of individual samples, failing to highlight instances of complete failure.
the credibility of single-edit results: as highlighted in
response to Weakness 3
, our contemporaneous work, Model Editing at Scale leads to Gradual and Catastrophic Forgetting also independently discovered model collapse:"Finally, we take a deeper look at the specific edits that cause the inflection point in ROME. We call these edits disabling edits, as **they disable the model and make it unusable for downstream tasks**. ...... This shows that the disabling edits in ROME are not a result of continuous sequential editing of the model, but a fundamental limitation of ROME." in sec 3.4.1
Unlike they merely observed the phenomenon of collapse caused by a single edit, we conducted extensive experiments to collect such hard samples to build the HardEdit dataset. This dataset enables a thorough evaluation of mainstream editing methods and llms.
For suggestion about "add an appendix to show what a range of text generation outputs across perplexities on ME-PPL": we will incorporate more detailed cases showing text generation quality at varying perplexity levels in the appendix in the revised manuscript. The following is a simple case:
Perplexity | 1053.38 | 4466.30 | 8101.94 | 10694.47 | 13401.50 |
---|---|---|---|---|---|
Generated Texts | The is The new year isFL is a year town is The 8 The 2 is The 79 89 when the 9999 when the year is | . AF and AF and modelsinsionet om AF and H A.Ъ nobodyjahr nobodý ла a modelś ла no a .́ лаOAF de Deś ла a no nobody elton a a tern | L-- --- ---,--his--toЉ--cht--isto kwiet.Љ,aiskihkenkihhettais,ais,hh _hukan,hh _ packan isa _hukan kaan 10,Љ _lukkiiski | D > Dil D dat m 1nen nen pr-1 > 1 ris-net-nets prsw pr- best und alsw ysm 10 егоiдар | The one good AA царatic the way everybody register on nobodyitisibriol333iatic in the iaticromisante val10022landi |
Regarding the selection of a threshold of 1000, it is based on Figure 3, which shows that the llms evaluated perform poorest on downstream tasks when perplexity surpasses 1000.
Response to reviewer #3
Weakness5: the impact of Decoding strategy on text generation
We thanks the reviewer's insightful comments on the impact of decoding strategy. Decoding strategies indeed play a role in the performance of language models. However, such impact is typically limited and unlikely to dramatically improve the performance of a collapsed model in downstream tasks.
Here, we address the concerns raised:
- Our experiments use standard settings prevalent in the field, specifically
temp=0.0, do_sample=false
. - We have tested various parameter settings, including temperature, top_k, and num_beams, and observed that models experiencing collapse maintain their collapsed state regardless of the decoding strategy employed. This confirms that the observed model collapses are not mitigated by altering decoding strategies.
We will discuss the impact of decoding strategy in the revised manuscript.
Weakness6: the differences between samples that cause model collapse and samples
We thank the reviewer for the constructive feedback. To better elucidate the difference between the "hard examples" that cause model collapse and normal counterfactual examples, we plan to make the following additions in the revised manuscript:
Comparison between Hard and Normal Samples: Our paper features Table 2 and Figure 11 to showcase hard examples from the COUNTERFACT dataset for clarity. To elucidate the difference between hard and normal samples, we offer a direct comparison:
- Hard Samples often involve subjects that are single, commonly used words, such as 'France', 'Scotland', 'DVD', 'iPhone', and 'Xbox'.
- Normal Samples, in contrast, typically relate to more specific entities or less common terms like 'Kieran Millan', 'Battle of Arausio', 'Microsoft Expression Blend', and 'Madhan Bob'.
Constructing our dataset with GPT-3.5 using patterns of commonly used single words revealed that a notable quantity of these inputs could induce model collapse, thereby illustrating the effectiveness of such patterns. We will release the HardCF and HardEdit datasets we developed, to enable analysis and reproducibility by the research community.
Spatial Distribution and Editing Challenges: Further analysis, using GPT-2-XL as an example, shows that the "keys" of hard samples (as defined in the ROME paper) are spatially dispersed and significantly distant from those of normal samples, which tend to cluster more closely in the space.
Our primary goal is to unveil this phenomenon through extensive exploration. Unveiling why the key differences between these two types of samples cause editing methods to trigger model collapse, are what we currently study. However, it is beyond the scope of the current paper.
Response to reviewer #3
Suggestions And Typos
Firstly, we sincerely appreciate for the reviewer's detailed suggestions on our paper. We apologize for the excessive abbreviation of text due to space limitations, which resulted in unclear expressions.
145: I find this notation confusing
- Your understanding is correct; what we intended to convey is that we obtain the parameters of edited model using the editing algorithm. We will revise the expression here.
159: portability is an important property
- We appreciate for pointing out this deficiency; we will add an introduction for portability.
167: A constraint such as what?
- We apologize for the lack of clarity; the explanation of constraint has been placed in the appendix (L996-L999) due to space limitations. For example, a
norm constraint on the fine-tuning loss, limiting the difference between the original and edited model's parameters, to reduce side effects.
- We apologize for the lack of clarity; the explanation of constraint has been placed in the appendix (L996-L999) due to space limitations. For example, a
“Does Localization Inform Editing?...”
- Thank you for your supplement; indeed, an analysis of the localization methods should be incorporated into the discussion.
Case ID? What is the perplexity measured on? the baseline perplexity?
- The Case ID refers to the index of each edit sample.
- The perplexity of edited model was measured on sentences from ME-PPL dataset. And the baseline perplexity is about 65.60.
- This figure demonstrate that, in single editing setting, the edit samples from ZsRE will not lead model to exhibit high perplexity on normal text, indicating that the model remains stable. We will refine our writing to make them clear.
What is perplexity calculated on? the red bar is perplexity of the baseline model?
- Each point in Figure 2a shows a model edited by one sample from COUNTERFACT, with its perplexity calculated on sentences from the ME-PPL d.
- The red line is not the baseline. It appears like a line because there are too many points connected together.
304: what generation settings?
- We set the temperature as 0 to generate text, which is standard settings in the field. And we will polish this part to make it clear.
325: I don’t think locality is the only measure of ... model collapse. I mentioned portability ... Fluency and Consistency ...
- For Fluency and Consistency, we have answered in
response to Weakness4
. - For portability, it examines the performance of an edited model on facts related to the editing request and required reasoning, aiming to assess robust generalization.
- We will add discussion about these metrics in the revised manuscript.
- For Fluency and Consistency, we have answered in
Table 1+327: I don’t know what locality of 1 means?
- You're correct: 1 means the edited model provides the correct answer to the question corresponding to the locality metric. However, this model performs poorly on PIQA. This indicates that locality alone is not sufficient to identify model collapse.
348: What is the theoretical perspective you are referring to from Radford?
- The theoretical perspective means the unsupervised pre-training loss has an exponential relationship with the perplexity metric. We will make this clear later.
433: I think it is critical that you do provide a set of tables for these experiments you can use the appendix...
- We will put all the perplexity results on two datasets in the appendix in the next version of the paper.
448-449 How is this different from the rest of the samples like what were the particular formats?...
- This can be referred to the
response to Waekness 6
.
- This can be referred to the
Table 3: I think it would enhance the paper if you also presented lowest perplexity edit and potentially a perplexity between lowest and highest...
- Actually, the results depicted in Figure 3, which illustrate the performance of downstream tasks at various levels of perplexity, can address this question. Due to space constraints, we presented a limited set of results in Table 3. With more space available in future, we intend to include more experimental results according to the suggestions.
498: What is occuring few times? Less than 60 in the normal cases?
- Your understanding is correct. 60 edits is indeed a small number in practical applications.
Table 4 - 520-536: I really like this analysis. I think my worry is the phrasing “pose a substantial risk of collapsing under sequential editing” since the hard samples are a very very very small part of Counterfact. I do think ...
- We thank the reviewer for acknowledging the value of the analysis. For this concern, we believe we have provided sufficient discussion in
response to Weakness 3
.
- We thank the reviewer for acknowledging the value of the analysis. For this concern, we believe we have provided sufficient discussion in
How was ii) determined?
- To ensure that the generated data meets condition ii), we instructed GPT-3.5 to produce statements that are clearly counterfactual. Detailed prompt used for data construction and examples of the generated data can be found in the appendix.
Response to reviewer #3
For other writing suggestions like:
298: Maybe “top 30 post-edited models” for clarity
322: “Proven crucial” is maybe to strong for me at this point. “Shows promise” is more appropriate
333: I don’t see locality as a QA task?...
377-384: I understand what you are saying here but I think it could be clearer that you are selecting models which achieve these perplexity values on ME-PPL and then benchmarking them according to LM-eval - it took me a bit to get to this understanding
Figure 5: I am not sure what to do here but I can see how this chart could be a bit misleading since the y axis are all different. Maybe you can mention this in the caption and caution the reader to pay careful attention to the differences.
We will fix them in the revised manuscript.
We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!
Meta Review of Paper315 by Area Chair
The paper reveals a phenomenon in model editing that a single edit may cause model collapse. Accordingly, the authors propose using perplexity as a surrogate metric to evaluate representative model editing algorithms in both single and sequential editing settings. In addition, the authors construct a new dataset HardEdit, based on the hard instances that may induce model collapse.
The paper is well-written and the problem is clearly described.
The paper raises awareness about the current limitations of model editing. Besides, the authors conduct analysis to demonstrate the rationality of using perplexity as an evaluation metric.
As mentioned by the reviewer, some previous work[1][2] has also explored similar themes as this paper. Although the authors have responded to this question, the contribution of this paper is still limited.
The authors claim that this work can effectively detect potential model collapse via perplexity, however, according to the results in Figure 3, the performance of discriminative tasks is not as sensitive to perplexity as generation tasks. It would be better if the authors could introduce the specific detection methods.
[1] Model Editing at Scale leads to Gradual and Catastrophic Forgetting
[2] Model Editing Can Hurt General Abilities of Large Language Models.