The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

Wanli Yang; Fei Sun; Jiajun Tan; Xinyu Ma; Du Su; Dawei Yin; Huawei Shen

The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

10 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryone

Abstract:

Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.

Meta Review of Submission368 by Area Chair

Meta Reviewby Area Chair04 Aug 2024, 12:35 (modified: 23 Aug 2024, 08:01)Senior Area Chairs, Area Chairs, Authors, Reviewers Submitted, Program Chairs, Commitment Readers

Metareview:

This paper studies the reason for model collapse in a model editing method, ROME.

The authors identified two primary factors: "i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize."

The authors then propose a simple yet effective fix based on these two factors.

Summary Of Reasons To Publish:

As a short paper, it studies why ROME, a popular model editing method, causes model collapse. It provides useful facts for the community pursuing this direction.

As the reviewer #2 has noticed, the novelty of this paper is limited but I think it's okay for a short paper. Overall the paper is sound.

Reviewer #1 complains that the model is not evaluated on Mistral and LLama-3-8b, but I do not think that they need to run these models to publish this finding, since it is a common practice for the model editing community to work with GPT models. #1 also complains about missing citations but did not give any such examples.

Summary Of Suggested Revisions:

Overall I find that most of the reviewers' complains are minor. Please clarify your paper according to reviewers' feedback and add the new experimental results.

Overall Assessment: 4 = There are minor points that may be revised

Best Paper Ae: No

Ethical Concerns:

There are no concerns with this submission

Needs Ethics Review: No

Author Identity Guess: 1 = I do not have even an educated guess about author identity.

Official Review of Submission368 by Reviewer #1

Official Reviewby Reviewer #119 Jul 2024, 05:23 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1, Commitment Readers

Paper Summary:

This paper investigates the causes of model collapse in LLMs when subjected to the ROME method. It identifies two primary factors: inconsistent handling of prefixed and unprefixed keys, and the anomalous distribution of the first token representations in autoregressive models. The authors propose a solution involving the consistent use of prefixed keys and adding prefixes during the testing phase to prevent model collapse. The study's findings are validated through some experiments, though the paper's scope is limited to specific models and datasets, raising concerns about generalizability. The paper highlights the need for further research into the impacts of sequential edits and broader datasets.

Summary Of Strengths:

The paper offers a comprehensive analysis of the root causes behind the collapse of LLMs during model editing, focusing on the inconsistent handling of prefixed and unprefixed keys and the unique distribution of the first token in autoregressive models.
The paper provides a practical and straightforward solution to mitigate model collapse by uniformly using prefixed keys and adding prefixes during the testing phase.

Summary Of Weaknesses:

The paper primarily focuses on the GPT-2-XL and GPT-J models, with only a brief examination of Llama2-7b. This limited scope raises concerns about the generalizability of the findings. The unique pattern of collapse in Llama2-7b, mentioned in the limitations, suggests that different models might exhibit diverse behaviors under ROME edits. It would be worthwhile to examine Mistral and Gemma for broader generalizability; otherwise, the paper's claims may lack substance.
While the paper identifies the anomalous distribution of first token representations as a key factor in model collapse, it does not focus deeply into the underlying reasons for this phenomenon. The speculation about the inability of the first token to interact with subsequent tokens in autoregressive models is plausible, but the paper does not provide sufficient empirical or theoretical backing for this claim.
The paper missed some important citations which are very well connected with the story of single edit failure using ROME, MEMIT.

Comments Suggestions And Typos:

NA

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Soundness: 2.5

Overall Assessment: 2.5

Best Paper: No

Needs Ethics Review: No

Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Response to reviewer #1

Official Comment27 Jul 2024, 17:19 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1, Commitment Readers

Comment:

Dear reviewer #1:

We sincerely appreciate your recognition of our comprehensive analysis of the root causes behind the LLMs collapse and your valuable comments.

Weakness 1:

Concerns about the generalizability of the findings:
- We want to emphasize and clarify that our findings are generalizable to llama2-7b, Mistral-7b, and llama3-8b models. (The results for Mistral-7b and Llama3-8b are shown in the table below.) Due to space constraints, we focus on GPT-2-XL and GPT-J in the main text.
- Llama2-7b also experiences collapses on samples where the subjects are the first tokens, after removing the special token <s> that the tokenizer additionally prepends at the beginning, as detailed in the main text (Line 265-272) and Appendix A.3.
- Mistral-7b and Llama3-8b also fall into collapse when using the same settings previously described for Llama2-7b, as shown in the table below. (According to the paper "The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse", extremely high perplexities signify model collapse.)
  
  Model Min_ppl Avg_ppl Max_ppl
  
  Mistral-7b 12560.86 30603.39 74334.38
  
  Llama3-8b 132063.30 947228.36 3444137.24
The unique pattern of collapse in Llama2-7b:

The collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor to Llama2-7b) remain stable and do not experience any collapse when editing such samples, as depicted in the table below. (The maximum perplexity of the edited models remains at a low value.) Since the collapse on Llama2-7b do not exist in subsequent models, we have decided not to delve into this isolated phenomena in our short paper.

Model Mistral-7b Llama3-8b

Max_ppl 52.19 43.98

Weakness 2: not focus deeply into the underlying reasons for anomaly of first token

We have empirically demonstrated the speculation about the inability of the first token to interact with subsequent tokens in autoregressive models through two perspectives of experiments.

The representations of first tokens significantly differ from those of subsequent tokens in autoregressive model, as detailed in Line 238-241 and Figure 2a.
The representations of first tokens show no differences compared to those of subsequent tokens in the encoder of T5-3B, under the bidirectional information transmission mechanism, as detailed in the main text (Line 257-264) and Appendix A.2.

These observations support our speculation.

The focus of our paper is to investigate why ROME collapses and we have identified the underlying reasons. While the reason for the anomaly of the first token is not the main focus of our short paper. We will further investigate it in the furture work.

Weakness 3: missed some important citations

To the best of our knowledge, we have cited all papers about single edit failure we found in Line 027-036 and Line 059-062:

The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse
Model Editing at Scale leads to Gradual and Catastrophic Forgetting
Model Editing Can Hurt General Abilities of Large Language Models
Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing

We would greatly appreciate it if you could provide additional relevant literature.

We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!

Seeking your valuable feedback on our response

Official Comment30 Jul 2024, 12:36 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1, Commitment Readers

Comment:

Dear Reviewer,

We hope this message finds you well.

We are deeply grateful for the attention and care you've given to our work. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.

As the discussion period draws to a close, we would love to hear your thoughts on our response, including whether it adequately addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.

We are committed to incorporating all your suggestions in our revision to improve our manuscript further. We hope we have addressed all your concerns and look forward to your further comments and discussions.

Best regards,

Authors

Replying to Seeking your valuable feedback on our response

Response to authors

Official Commentby Reviewer #131 Jul 2024, 14:12 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1, Commitment Readers

Comment:

Dear Authors,

Thanks for the clarification and responses. However I feel instead of GPT-2-XL and GPT-J, experiments Mistral, Gemma should be better fit (Popularity and usage) considering the paper's limited novelty and analysis direction.

Although I would like to keep my actual score same .

Regards,

Reviewer #1

Replying to Response to authors

Response to reviewer #1

Official Comment31 Jul 2024, 15:49 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #1, Commitment Readers

Comment:

Dear reviewer,

We sincerely appreciate your responsive replies. It is encouraging to learn that our efforts to address your concerns were well-received.

We have noted your concerns regarding the aspects such as representative models and novelty of our paper. Allow us to clarify these points further:

Employing GPT-2-XL and GPT-J for experiments is a widely adopted setup in the current field of model editing. Below are several representative papers that adopt this setup:
- Editing Large Language Models: Problems, Methods, and Opportunities (EMNLP 2023)
- Model Editing at Scale leads to Gradual and Catastrophic Forgetting (ACL 2024)
- Model Editing Can Hurt General Abilities of Large Language Models ( Jan 2024)
In fact, in our previous response, we have already validated our findings on Llama2-7b, Mistral-7b, and Llama3-8b. And we are committed to incorporating this part of experimental results in the revised version to enhance the quality of our manuscript.
Regarding the use of Gemma mentioned by the reviewer, it is not adopted in the current field of model editing. Therefore, we have opted to use more prevalent models such as Mistral-7b, LLama2-7b , and Llama3-8b.
Given that our paper is of an analytical and interpretative nature, we focus on identifying the reasons why current cutting-edge method fails and proposing solutions. Consequently, it might be inappropriate to use novelty as the criterion for scoring.

Thank you once again for your responsive replies. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.

Best regards,

Authors

Official Review of Submission368 by Reviewer #2

Official Reviewby Reviewer #213 Jul 2024, 16:20 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Paper Summary:

This study explores the root causes of collapse cases produced by ROME, a model editing method. The first analysis shows that the average norm of the denominator in ROME’s update matrix in collapse cases is much smaller than that of normal cases. To further analyze the collapse ROME, they implement ROME and find the inconsistency of k*, that is part of k* in the update matrix is not prefixed. Then, they examine the distributions of different k* through t-SNE and find the distribution of the unprefixed k* exhibits significant differences from the prefixed k* in collapse cases. Moreover, they put forward the perspective that in almost all instances, the subjects consist of a single word, which is encoded as a single token and positioned at the beginning of the input prompt. Hence, they propose the solution C-ROME, a straightforward solution that keeps k* prefixed to prevent collapse while maintaining editing efficacy. The experiment result demonstrates C-ROME can effectively cope with collapse cases without damaging edit efficacy.

Summary Of Strengths:

The experiment results are explicitly effective and demonstrate the paper’s analysis, such as the correlation between the first token and the prefixed key.
C-ROME is capable of dealing with collapse cases effectively. The experiment result exemplifies that C-ROME has a significant restorative effect on the edit efficacy metric.
The relationship between collapse cases and the token length of the edit question’s subject is quite interesting, which was neglected in the previous analysis.

Summary Of Weaknesses:

The experiment of unprefixed k* and prefixed k* lacks of innovation, because the difference in their distributions is intuitive. The representations of k* with prefixes will be integrated with the random-sampled prefixes in the self-attention module. Even if their next token prediction results may be similar at the last layer, their representations still show significant differences in distribution at ROME’s edit layer (17th layer for GPT2-XL with 48 layers).
Compared to method ROME, method C-ROME does not present any novel changes over method ROME, indicating that this paper may lack sufficient workload.
The analysis based on Llama is not sufficiently convincing. The experimental results do not demonstrate the improvements of C-ROME on Llama, and the analysis of Llama in the paper is also questionable.

Comments Suggestions And Typos:

I have some questions toi the paper, please consider addressing them:

The behavior of Llama is counterintuitive. The paper mentions that Llama models perform well on single-token cases because Llama tokenizer will append a special token before the subject, whereas the subject token can only obtain contextual information from a meaningless special token. Maybe the collapse of the first token doesn’t root from precedent context, but something else like position encoding?
It seems that Table 1 is not referred in the paper’s text, did you forget to mention it directly?

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Soundness: 3.5

Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions.

Best Paper: No

Limitations And Societal Impact:

Yes

Ethical Concerns:

None

Needs Ethics Review: No

Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.

Datasets: 3 = Potentially useful: Someone might find the new datasets useful for their work.

Software: 3 = Potentially useful: Someone might find the new software useful for their work.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Response to reviewer #2

Official Comment27 Jul 2024, 23:26 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Comment:

Dear reviewer #2:

We sincerely appreciate your recognition of our analysis regarding the relationship between collapses and the first token, and for acknowledging the effectiveness of our proposed method, C-ROME, in addressing collapse cases and improving the edit efficacy metric.

Weakness 1: experiment of unprefixed $k_{*}$ and prefixed $k_{*}$ lacks of innovation

We want to clarify that the findings about the representations distribution of unprefixed and prefixed keys do not align with the intuitive expectation that their distributions would be different.

In fact, in over 99% of the editing samples, the distribution of representations for unprefixed and prefixed keys is consistent, as illustrated in Figure 1b. The difference in the distribution of representations only occurs in collapse cases.

But we will revise the expression to make the conclusions more clear and avoid misunderstandings.

Weakness 2: C-ROME may lack sufficient workload

The focus of our paper is to explore the causes of collapse in ROME, rather than proposing a new editing method.
C-ROME serves as a validation of the collapse causes we identified and as a correction for ROME.
Although the changes of C-ROME are simple, it completely eliminates the issue of model collapse caused by ROME edits and significantly improves the efficacy of edits on collapse cases, as shown in Table 2 and Table 4 of our paper.

Weakness 3:

Analysis on Llama is not sufficiently convincing:

We must emphasize that the collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor of Llama2-7b) remain stable and do not experience any collapse when editing such samples, as depicted in the table below. (The maximum perplexity of the edited models remains at a low value.)

Model Mistral-7b Llama3-8b

Max_ppl 52.19 43.98

While as shown in the table below, Mistral-7b and Llama3-8b fall into collapse on samples where the subjects are the first tokens like GPT-2-XL and GPT-J, after removing the special tokens that the tokenizers additionally prepend at the beginning.

Model Min_ppl Avg_ppl Max_ppl

Mistral-7b 12560.86 30603.39 74334.38

Llama3-8b 132063.30 947228.36 3444137.24

Since the collapse observed in Llama2-7b does not exist in subsequent models, our focus in this short paper is on more general patterns, specifically issues presented in models like GPT-J, rather than the isolated phenomenon of LLama2-7b.
Results do not demonstrate improvements of C-ROME on Llama

Actually, we tried to prepend longer texts sampled from those prepended in the editing phase of ROME in the testing phase. The results in the following table demonstrate that C-ROME is effective on Llama2-7b, as it is on GPT-2-XL and GPT-J, achieving high success rate in edits while avoiding collapses.

Results for Llama2-7b on collapse cases:

Method efficacy generalization locality

ROME 4.76% 0.00% 15.87%

C-ROME 91.27% 29.37% 100%

Response to reviewer #2

Official Comment27 Jul 2024, 23:24 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Comment:

Question 1: Behavior of Llama is counterintuitive

We sincerely appreciate the reviewer for the insightful suggestion that the collapse of the first token may root from position encoding. To address the reviewer's concern, we conducted experiments to test this hypothesis.

For Llama2-7b, we removed the special token <s> that the tokenizer additionally prepends at the beginning to maintain consistency with GPT-2-XL and GPT-J. In the tables below, we present the results for each group of edited models in the form of (Min_ppl, Avg_ppl, and Max_ppl).

For samples where the subjects are the first tokens, we set the position embedding of the first token as that of the second token (Noted as Second2First).

The results in the following table indicate that this approach mitigates model collapse on GPT-2-XL, but it is completely ineffective on GPT-J and Llama2-7b.

Model	Original	Second2First
GPT-2-XL	2177.82, 19877.79, 179185.99	1008.21, 1397.87, 2153.86
GPT-J	5094.73, 28835.21, 85936.24	8153.70, 26978.14, 124982.41
Llama2-7b	16279.75, 67436.51, 206307.60	17561.97, 72692.50, 349577.58

For samples where the subjects are the second tokens, we set the position embedding of the second token as that of the first token (Noted as First2Second).

The results in the following table reveal that this change led to partial model collapse in GPT-2-XL and Llama2-7b, but all edited models of GPT-J remained stable.

Model Original First2Second

GPT-2-XL 68.55, 68.81, 69.03 81.39, 39714.90, 912001.20

GPT-J 48.80, 49.03, 49.50 48.47, 48.68, 49.48

Llama2-7b 32.83, 33.32, 37.03 33.14, 2104.90, 42154.10

The results from the two aforementioned aspects suggest that position embedding may be a contributing factor to the abnormal representation of the first token, but it is not the sole factor.

Due to space constraints of our short paper, we focus on investigating and addressing model collapse caused by ROME edits. While the reason for the anomaly of the first token will be further investigated in our furture work. If our paper is accepted, we will use the extra page in the camera-ready version to supplement our current experiments and conduct further analysis.

Question 2: Table 1 is not referred

In fact, we directly referred Table 1 in Line 164-169 and discussed its results. But we will revise the expression to avoid such misunderstanding.

We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!

Seeking your valuable feedback on our response

Official Comment30 Jul 2024, 12:24 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Comment:

Dear Reviewer,

We hope this message finds you well.

We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.

As we are nearing the end of the discussion period, we would love to hear your thoughts on our response, including whether it sufficiently addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.

We are committed to including all your suggestions in our revision to enhance the quality of our manuscript. We hope we have addressed all your concerns and look forward to your further comments and discussions.

Best regards,

Authors

Replying to Seeking your valuable feedback on our response

Respose to the authors

Official Commentby Reviewer #231 Jul 2024, 10:40 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Comment:

Dear Authors.

Thank you for the details and clarifications provided. I am sorry for the overlooked description of Table 1 in the paper. I have adjusted my rating based on your answers. Considering the limited novelty, I would like to raise my OA score to $3$ and I sincerely hope that some of the supplyment experimental results for the rebuttal could be added in the paper in the revised version.

Replying to Respose to the authors

Thank you for the responsive replies and supportive feedback.

Official Comment31 Jul 2024, 11:38 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #2, Commitment Readers

Comment:

Dear reviewer,

We are deeply grateful for your supportive feedback and revised score. It is encouraging to learn that our efforts to address your concerns were well-received.

We have noted your reservations regarding the novelty of our paper. Allow us to clarify this point further.

As a short paper, our focus is not on proposing a novel method, but on understanding the reasons behind the editing collapse of the cutting-edge method, ROME, and effectively address them.

Thank you once again for your responsive replies. We are dedicated to incorporating all your suggestions and our supplyment experimental results in the revised version to enhance the quality of our manuscript.

If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.

Best regards,

Authors

Official Review of Submission368 by Reviewer #3

Official Reviewby Reviewer #305 Jul 2024, 15:01 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #3, Commitment Readers

Paper Summary:

The paper, "The Fall of ROME: Understanding the Collapse of LLMs in Model Editing," investigates the causes behind the collapse of large language models (LLMs) when subjected to Rank-One Model Editing (ROME). The study identifies two primary factors contributing to this collapse: inconsistent handling of prefixed and unprefixed keys, and the unique distribution of the first token's representation. The authors propose a solution involving consistent use of prefixed keys and validate its effectiveness experimentally.

Summary Of Strengths:

Insightful Analysis: The paper provides a comprehensive analysis of the causes behind LLM collapse in the context of model editing.
Experimental Validation: Robust experimental design that validates the proposed solution across different models and datasets.- Incomplete Analysis: The specific characteristics of collapse cases in Llama2-7b are not fully explored.
Single-Edit Focus: Concentrates on single edits without addressing the cumulative effects of sequential edits, which is a significant limitation.

Summary Of Weaknesses:

Incomplete Analysis: The specific characteristics of collapse cases in Llama2-7b are not fully explored.
Single-Edit Focus: Concentrates on single edits without addressing the cumulative effects of sequential edits, which is a significant limitation.

Comments Suggestions And Typos:

N/A

Confidence: 4 = Quite sure. I tried to check the important points carefully. It's unlikely, though conceivable, that I missed something that should affect my ratings.

Soundness: 3 = Acceptable: This study provides sufficient support for its major claims/arguments. Some minor points may need extra support or details.

Overall Assessment: 3 = Good: This paper makes a reasonable contribution, and might be of interest for some (broad or narrow) sub-communities, possibly with minor revisions.

Best Paper: No

Needs Ethics Review: No

Reproducibility: 3 = They could reproduce the results with some difficulty. The settings of parameters are underspecified or subjectively determined, and/or the training/evaluation data are not widely available.

Datasets: 1 = No usable datasets submitted.

Software: 1 = No usable software released.

Knowledge Of Or Educated Guess At Author Identity: No

Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources

Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources

Response to reviewer #3

Official Comment27 Jul 2024, 19:13 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #3, Commitment Readers

Comment:

Dear reviewer #3:

We sincerely appreciate your supportive score and your acknowledgement of our comprehensive analysis of the causes behind LLM collapse and robust experimental design.

Weakness 1: characteristics of collapse cases in Llama2-7b are not fully explored

We want to clarify that the collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor of Llama2-7b) remain stable and do not experience any collapse when editing such samples, as shown in the table below. (The maximum perplexity of the edited models remains at a low value.)

Model	Mistral-7b	Llama3-8b
Max_ppl	52.19	43.98

Since the collapse observed in Llama2-7b does not exist in subsequent models, our focus in this short paper is on more general patterns, specifically issues present in models like GPT-J, rather than the isolated phenomenon of LLama2-7b.

Weakness 2: without addressing the cumulative effects of sequential edits

We acknowledge that the cumulative effects of sequential edits is a topic worthy of further investigation.

However, model collapse caused by a single edit is a more serious issue that urgently needs to be investigated and addressed.
Due to space constraints of this short paper, our primary focus is on investigating and addressing model collapse caused by single edits. The analysis of collapse resulting from sequential edits will be explored in our future work.

We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!

Seeking your valuable feedback on our response

Official Comment30 Jul 2024, 12:42 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #3, Commitment Readers

Comment:

Dear Reviewer,

We hope this message finds you well.

We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.

As the discussion period draws to a close, we would love to hear your thoughts on our response, including whether it adequately addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.

We hope we have addressed all your concerns and please feel free to reach out with any further questions.

Best regards,

Authors

Replying to Seeking your valuable feedback on our response

Official Comment by Reviewer #3

Official Commentby Reviewer #331 Jul 2024, 13:31 (modified: 23 Aug 2024, 08:01)Program Chairs, Senior Area Chairs, Area Chairs, Reviewers Submitted, Authors, Reviewer #3, Commitment Readers

Comment:

Dear Authors.

Thank you for the details and clarifications provided.

Model	Min_ppl	Avg_ppl	Max_ppl
Mistral-7b	12560.86	30603.39	74334.38
Llama3-8b	132063.30	947228.36	3444137.24

Model	Original	First2Second
GPT-2-XL	68.55, 68.81, 69.03	81.39, 39714.90, 912001.20
GPT-J	48.80, 49.03, 49.50	48.47, 48.68, 49.48
Llama2-7b	32.83, 33.32, 37.03	33.14, 2104.90, 42154.10

Method	efficacy	generalization	locality
ROME	4.76%	0.00%	15.87%
C-ROME	91.27%	29.37%	100%