Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.
Meta Review of Submission368 by Area Chair
This paper studies the reason for model collapse in a model editing method, ROME.
The authors identified two primary factors: "i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize."
The authors then propose a simple yet effective fix based on these two factors.
As a short paper, it studies why ROME, a popular model editing method, causes model collapse. It provides useful facts for the community pursuing this direction.
As the reviewer #2 has noticed, the novelty of this paper is limited but I think it's okay for a short paper. Overall the paper is sound.
Reviewer #1 complains that the model is not evaluated on Mistral and LLama-3-8b, but I do not think that they need to run these models to publish this finding, since it is a common practice for the model editing community to work with GPT models. #1 also complains about missing citations but did not give any such examples.
Overall I find that most of the reviewers' complains are minor. Please clarify your paper according to reviewers' feedback and add the new experimental results.
There are no concerns with this submission
Request for a careful reevaluation of reviewers' comments
Dear (S)ACs,
We are deeply grateful for your kind help in encouraging reviewers to check our response and for your guidance throughout this process.
The reviewers consistently acknowledge our thorough exploration of the causes behind the ROME-induced model collapse, the robust design for experimental validation, and the effectiveness of our proposed method in preventing collapse and enhancing the efficacy of edits. And we believe the detailed clarifications and supplementary experiments we provided have addressed the reviewers' key concerns comprehensively.
However, some of the reviewers' comments are unreasonable and go against the reviewing guidelines:
1. Be specific
"If you mentioned missing previous work or lack of baselines, give the full citation."
Regarding Weakness 3 proposed by reviewer #1, although we have cited four papers related to single edit failures in our paper, the reviewer still claims that we have missed citations connected with it, without specifying which paper should have been cited.
"If you say that the submission lacks novelty, please be sure to include the specific work that you believe to be similar."
In the responses of reviewers #1 and #2 to our rebuttal, they claim that our paper lacks novelty without specifying which paper they believe to be similar.
Meanwhile, as an analytical and interpretative short paper, we have clarified that our focus is to explore the causes of collapse in cutting-edge method, ROME, and effectively address them, rather than proposing a new editing method. Therefore, it is inappropriate to give our paper a low rating on the grounds of lacking novelty.
2. shortcuts (The paper doesn't use [my preferred methodology])
Regarding Weakness 1 proposed by reviewer #1 and his/her response to our rebuttal, the reviewer calls for further experiments on Mistral and Gemma. Despite we have emphasized that the models GPT-2-XL, GPT-J, and Llama2-7b used in our paper are common setup in the current field of model editing, and supplemented experiments on Mistral-7b and Llama3-8b, the reviewer still cites this as a weakness.
Given these unfair criticisms, we respectfully request a careful reevaluation of their feedback in conjunction with our detailed rebuttal during the meta-review process.
Thank you once again for considering our request and for your help throughout this process.
Best regards,
Authors
Request for Facilitating Discussion from Reviewers with us
Dear (S)ACs,
We hope this message finds you well.
We are writing to respectfully request your help in encouraging reviewers #1 and #3 to check our response.
We have conducted supplementary experiments with Llama3-8b and Mistral-7b to confirm that our findings are generalizable to them and to demonstrate that the collapse observed in Llama2-7b is a rare and isolated phenomenon. This is also why our paper focuses on studying more general findings. We believe the clarifications and experiments can address their concerns regarding the generalizability of our findings and the thoroughness of our discussion on Llama2-7b.
However, as the discussion period draws to a close, we have not received any feedback from them on our rebuttal.
Thank you for considering our request and for your guidance throughout this process.
Best regards,
Authors
Author-Editor Confidential Comment by Area Chair
sounds good I just emailed them.
Official Review of Submission368 by Reviewer #1
This paper investigates the causes of model collapse in LLMs when subjected to the ROME method. It identifies two primary factors: inconsistent handling of prefixed and unprefixed keys, and the anomalous distribution of the first token representations in autoregressive models. The authors propose a solution involving the consistent use of prefixed keys and adding prefixes during the testing phase to prevent model collapse. The study's findings are validated through some experiments, though the paper's scope is limited to specific models and datasets, raising concerns about generalizability. The paper highlights the need for further research into the impacts of sequential edits and broader datasets.
The paper offers a comprehensive analysis of the root causes behind the collapse of LLMs during model editing, focusing on the inconsistent handling of prefixed and unprefixed keys and the unique distribution of the first token in autoregressive models.
The paper provides a practical and straightforward solution to mitigate model collapse by uniformly using prefixed keys and adding prefixes during the testing phase.
- The paper primarily focuses on the GPT-2-XL and GPT-J models, with only a brief examination of Llama2-7b. This limited scope raises concerns about the generalizability of the findings. The unique pattern of collapse in Llama2-7b, mentioned in the limitations, suggests that different models might exhibit diverse behaviors under ROME edits. It would be worthwhile to examine Mistral and Gemma for broader generalizability; otherwise, the paper's claims may lack substance.
- While the paper identifies the anomalous distribution of first token representations as a key factor in model collapse, it does not focus deeply into the underlying reasons for this phenomenon. The speculation about the inability of the first token to interact with subsequent tokens in autoregressive models is plausible, but the paper does not provide sufficient empirical or theoretical backing for this claim.
- The paper missed some important citations which are very well connected with the story of single edit failure using ROME, MEMIT.
NA
Response to reviewer #1
Dear reviewer #1:
We sincerely appreciate your recognition of our comprehensive analysis of the root causes behind the LLMs collapse and your valuable comments.
Weakness 1:
Concerns about the generalizability of the findings:
We want to emphasize and clarify that our findings are generalizable to llama2-7b, Mistral-7b, and llama3-8b models. (The results for Mistral-7b and Llama3-8b are shown in the table below.) Due to space constraints, we focus on GPT-2-XL and GPT-J in the main text.
Llama2-7b also experiences collapses on samples where the subjects are the first tokens, after removing the special token <s> that the tokenizer additionally prepends at the beginning, as detailed in the main text (Line 265-272) and Appendix A.3.
Mistral-7b and Llama3-8b also fall into collapse when using the same settings previously described for Llama2-7b, as shown in the table below. (According to the paper "The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse", extremely high perplexities signify model collapse.)
Model Min_ppl Avg_ppl Max_ppl Mistral-7b 12560.86 30603.39 74334.38 Llama3-8b 132063.30 947228.36 3444137.24
The unique pattern of collapse in Llama2-7b:
The collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor to Llama2-7b) remain stable and do not experience any collapse when editing such samples, as depicted in the table below. (The maximum perplexity of the edited models remains at a low value.) Since the collapse on Llama2-7b do not exist in subsequent models, we have decided not to delve into this isolated phenomena in our short paper.
Model Mistral-7b Llama3-8b Max_ppl 52.19 43.98
Weakness 2: not focus deeply into the underlying reasons for anomaly of first token
We have empirically demonstrated the speculation about the inability of the first token to interact with subsequent tokens in autoregressive models through two perspectives of experiments.
- The representations of first tokens significantly differ from those of subsequent tokens in autoregressive model, as detailed in Line 238-241 and Figure 2a.
- The representations of first tokens show no differences compared to those of subsequent tokens in the encoder of T5-3B, under the bidirectional information transmission mechanism, as detailed in the main text (Line 257-264) and Appendix A.2.
These observations support our speculation.
The focus of our paper is to investigate why ROME collapses and we have identified the underlying reasons. While the reason for the anomaly of the first token is not the main focus of our short paper. We will further investigate it in the furture work.
Weakness 3: missed some important citations
To the best of our knowledge, we have cited all papers about single edit failure we found in Line 027-036 and Line 059-062:
- The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse
- Model Editing at Scale leads to Gradual and Catastrophic Forgetting
- Model Editing Can Hurt General Abilities of Large Language Models
- Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing
We would greatly appreciate it if you could provide additional relevant literature.
We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!
Seeking your valuable feedback on our response
Dear Reviewer,
We hope this message finds you well.
We are deeply grateful for the attention and care you've given to our work. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.
As the discussion period draws to a close, we would love to hear your thoughts on our response, including whether it adequately addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.
We are committed to incorporating all your suggestions in our revision to improve our manuscript further. We hope we have addressed all your concerns and look forward to your further comments and discussions.
Best regards,
Authors
Replying to Seeking your valuable feedback on our response
Response to authors
Dear Authors,
Thanks for the clarification and responses. However I feel instead of GPT-2-XL and GPT-J, experiments Mistral, Gemma should be better fit (Popularity and usage) considering the paper's limited novelty and analysis direction.
Although I would like to keep my actual score same .
Regards,
Reviewer #1
Replying to Response to authors
Response to reviewer #1
Dear reviewer,
We sincerely appreciate your responsive replies. It is encouraging to learn that our efforts to address your concerns were well-received.
We have noted your concerns regarding the aspects such as representative models and novelty of our paper. Allow us to clarify these points further:
- Employing GPT-2-XL and GPT-J for experiments is a widely adopted setup in the current field of model editing. Below are several representative papers that adopt this setup:
- Editing Large Language Models: Problems, Methods, and Opportunities (EMNLP 2023)
- Model Editing at Scale leads to Gradual and Catastrophic Forgetting (ACL 2024)
- Model Editing Can Hurt General Abilities of Large Language Models ( Jan 2024)
- In fact, in our previous response, we have already validated our findings on Llama2-7b, Mistral-7b, and Llama3-8b. And we are committed to incorporating this part of experimental results in the revised version to enhance the quality of our manuscript.
- Regarding the use of Gemma mentioned by the reviewer, it is not adopted in the current field of model editing. Therefore, we have opted to use more prevalent models such as Mistral-7b, LLama2-7b , and Llama3-8b.
- Given that our paper is of an analytical and interpretative nature, we focus on identifying the reasons why current cutting-edge method fails and proposing solutions. Consequently, it might be inappropriate to use novelty as the criterion for scoring.
Thank you once again for your responsive replies. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.
Best regards,
Authors
Official Review of Submission368 by Reviewer #2
This study explores the root causes of collapse cases produced by ROME, a model editing method. The first analysis shows that the average norm of the denominator in ROME’s update matrix in collapse cases is much smaller than that of normal cases. To further analyze the collapse ROME, they implement ROME and find the inconsistency of k*, that is part of k* in the update matrix is not prefixed. Then, they examine the distributions of different k* through t-SNE and find the distribution of the unprefixed k* exhibits significant differences from the prefixed k* in collapse cases. Moreover, they put forward the perspective that in almost all instances, the subjects consist of a single word, which is encoded as a single token and positioned at the beginning of the input prompt. Hence, they propose the solution C-ROME, a straightforward solution that keeps k* prefixed to prevent collapse while maintaining editing efficacy. The experiment result demonstrates C-ROME can effectively cope with collapse cases without damaging edit efficacy.
- The experiment results are explicitly effective and demonstrate the paper’s analysis, such as the correlation between the first token and the prefixed key.
- C-ROME is capable of dealing with collapse cases effectively. The experiment result exemplifies that C-ROME has a significant restorative effect on the edit efficacy metric.
- The relationship between collapse cases and the token length of the edit question’s subject is quite interesting, which was neglected in the previous analysis.
- The experiment of unprefixed k* and prefixed k* lacks of innovation, because the difference in their distributions is intuitive. The representations of k* with prefixes will be integrated with the random-sampled prefixes in the self-attention module. Even if their next token prediction results may be similar at the last layer, their representations still show significant differences in distribution at ROME’s edit layer (17th layer for GPT2-XL with 48 layers).
- Compared to method ROME, method C-ROME does not present any novel changes over method ROME, indicating that this paper may lack sufficient workload.
- The analysis based on Llama is not sufficiently convincing. The experimental results do not demonstrate the improvements of C-ROME on Llama, and the analysis of Llama in the paper is also questionable.
I have some questions toi the paper, please consider addressing them:
The behavior of Llama is counterintuitive. The paper mentions that Llama models perform well on single-token cases because Llama tokenizer will append a special token before the subject, whereas the subject token can only obtain contextual information from a meaningless special token. Maybe the collapse of the first token doesn’t root from precedent context, but something else like position encoding?
It seems that Table 1 is not referred in the paper’s text, did you forget to mention it directly?
Yes
None
Response to reviewer #2
Dear reviewer #2:
We sincerely appreciate your recognition of our analysis regarding the relationship between collapses and the first token, and for acknowledging the effectiveness of our proposed method, C-ROME, in addressing collapse cases and improving the edit efficacy metric.
Weakness 1: experiment of unprefixed
We want to clarify that the findings about the representations distribution of unprefixed and prefixed keys do not align with the intuitive expectation that their distributions would be different.
In fact, in over 99% of the editing samples, the distribution of representations for unprefixed and prefixed keys is consistent, as illustrated in Figure 1b. The difference in the distribution of representations only occurs in collapse cases.
But we will revise the expression to make the conclusions more clear and avoid misunderstandings.
Weakness 2: C-ROME may lack sufficient workload
- The focus of our paper is to explore the causes of collapse in ROME, rather than proposing a new editing method.
- C-ROME serves as a validation of the collapse causes we identified and as a correction for ROME.
- Although the changes of C-ROME are simple, it completely eliminates the issue of model collapse caused by ROME edits and significantly improves the efficacy of edits on collapse cases, as shown in Table 2 and Table 4 of our paper.
Weakness 3:
Analysis on Llama is not sufficiently convincing:
We must emphasize that the collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor of Llama2-7b) remain stable and do not experience any collapse when editing such samples, as depicted in the table below. (The maximum perplexity of the edited models remains at a low value.)
Model Mistral-7b Llama3-8b Max_ppl 52.19 43.98 While as shown in the table below, Mistral-7b and Llama3-8b fall into collapse on samples where the subjects are the first tokens like GPT-2-XL and GPT-J, after removing the special tokens that the tokenizers additionally prepend at the beginning.
Model Min_ppl Avg_ppl Max_ppl Mistral-7b 12560.86 30603.39 74334.38 Llama3-8b 132063.30 947228.36 3444137.24 Since the collapse observed in Llama2-7b does not exist in subsequent models, our focus in this short paper is on more general patterns, specifically issues presented in models like GPT-J, rather than the isolated phenomenon of LLama2-7b.
Results do not demonstrate improvements of C-ROME on Llama
Actually, we tried to prepend longer texts sampled from those prepended in the editing phase of ROME in the testing phase. The results in the following table demonstrate that C-ROME is effective on Llama2-7b, as it is on GPT-2-XL and GPT-J, achieving high success rate in edits while avoiding collapses.
Results for Llama2-7b on collapse cases:
Method efficacy generalization locality ROME 4.76% 0.00% 15.87% C-ROME 91.27% 29.37% 100%
Response to reviewer #2
Question 1: Behavior of Llama is counterintuitive
We sincerely appreciate the reviewer for the insightful suggestion that the collapse of the first token may root from position encoding. To address the reviewer's concern, we conducted experiments to test this hypothesis.
For Llama2-7b, we removed the special token <s> that the tokenizer additionally prepends at the beginning to maintain consistency with GPT-2-XL and GPT-J. In the tables below, we present the results for each group of edited models in the form of (Min_ppl, Avg_ppl, and Max_ppl).
For samples where the subjects are the first tokens, we set the position embedding of the first token as that of the second token (Noted as Second2First).
The results in the following table indicate that this approach mitigates model collapse on GPT-2-XL, but it is completely ineffective on GPT-J and Llama2-7b.
Model Original Second2First GPT-2-XL 2177.82, 19877.79, 179185.99 1008.21, 1397.87, 2153.86 GPT-J 5094.73, 28835.21, 85936.24 8153.70, 26978.14, 124982.41 Llama2-7b 16279.75, 67436.51, 206307.60 17561.97, 72692.50, 349577.58 For samples where the subjects are the second tokens, we set the position embedding of the second token as that of the first token (Noted as First2Second).
The results in the following table reveal that this change led to partial model collapse in GPT-2-XL and Llama2-7b, but all edited models of GPT-J remained stable.
Model Original First2Second GPT-2-XL 68.55, 68.81, 69.03 81.39, 39714.90, 912001.20 GPT-J 48.80, 49.03, 49.50 48.47, 48.68, 49.48 Llama2-7b 32.83, 33.32, 37.03 33.14, 2104.90, 42154.10
The results from the two aforementioned aspects suggest that position embedding may be a contributing factor to the abnormal representation of the first token, but it is not the sole factor.
Due to space constraints of our short paper, we focus on investigating and addressing model collapse caused by ROME edits. While the reason for the anomaly of the first token will be further investigated in our furture work. If our paper is accepted, we will use the extra page in the camera-ready version to supplement our current experiments and conduct further analysis.
Question 2: Table 1 is not referred
In fact, we directly referred Table 1 in Line 164-169 and discussed its results. But we will revise the expression to avoid such misunderstanding.
We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!
Seeking your valuable feedback on our response
Dear Reviewer,
We hope this message finds you well.
We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.
As we are nearing the end of the discussion period, we would love to hear your thoughts on our response, including whether it sufficiently addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.
We are committed to including all your suggestions in our revision to enhance the quality of our manuscript. We hope we have addressed all your concerns and look forward to your further comments and discussions.
Best regards,
Authors
Replying to Seeking your valuable feedback on our response
Respose to the authors
Dear Authors.
Thank you for the details and clarifications provided. I am sorry for the overlooked description of Table 1 in the paper. I have adjusted my rating based on your answers. Considering the limited novelty, I would like to raise my OA score to
Replying to Respose to the authors
Thank you for the responsive replies and supportive feedback.
Dear reviewer,
We are deeply grateful for your supportive feedback and revised score. It is encouraging to learn that our efforts to address your concerns were well-received.
We have noted your reservations regarding the novelty of our paper. Allow us to clarify this point further.
As a short paper, our focus is not on proposing a novel method, but on understanding the reasons behind the editing collapse of the cutting-edge method, ROME, and effectively address them.
Thank you once again for your responsive replies. We are dedicated to incorporating all your suggestions and our supplyment experimental results in the revised version to enhance the quality of our manuscript.
If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter. If you're open to it, we'd be deeply appreciative of any further feedback and questions.
Best regards,
Authors
Official Review of Submission368 by Reviewer #3
The paper, "The Fall of ROME: Understanding the Collapse of LLMs in Model Editing," investigates the causes behind the collapse of large language models (LLMs) when subjected to Rank-One Model Editing (ROME). The study identifies two primary factors contributing to this collapse: inconsistent handling of prefixed and unprefixed keys, and the unique distribution of the first token's representation. The authors propose a solution involving consistent use of prefixed keys and validate its effectiveness experimentally.
- Insightful Analysis: The paper provides a comprehensive analysis of the causes behind LLM collapse in the context of model editing.
- Experimental Validation: Robust experimental design that validates the proposed solution across different models and datasets.- Incomplete Analysis: The specific characteristics of collapse cases in Llama2-7b are not fully explored.
- Single-Edit Focus: Concentrates on single edits without addressing the cumulative effects of sequential edits, which is a significant limitation.
- Incomplete Analysis: The specific characteristics of collapse cases in Llama2-7b are not fully explored.
- Single-Edit Focus: Concentrates on single edits without addressing the cumulative effects of sequential edits, which is a significant limitation.
N/A
Response to reviewer #3
Dear reviewer #3:
We sincerely appreciate your supportive score and your acknowledgement of our comprehensive analysis of the causes behind LLM collapse and robust experimental design.
Weakness 1: characteristics of collapse cases in Llama2-7b are not fully explored
We want to clarify that the collapse observed on Llama2-7b is a rare (21 out of 21,919) and isolated phenomena. Mistral-7b and Llama3-8b (the successor of Llama2-7b) remain stable and do not experience any collapse when editing such samples, as shown in the table below. (The maximum perplexity of the edited models remains at a low value.)
Model | Mistral-7b | Llama3-8b |
---|---|---|
Max_ppl | 52.19 | 43.98 |
Since the collapse observed in Llama2-7b does not exist in subsequent models, our focus in this short paper is on more general patterns, specifically issues present in models like GPT-J, rather than the isolated phenomenon of LLama2-7b.
Weakness 2: without addressing the cumulative effects of sequential edits
We acknowledge that the cumulative effects of sequential edits is a topic worthy of further investigation.
- However, model collapse caused by a single edit is a more serious issue that urgently needs to be investigated and addressed.
- Due to space constraints of this short paper, our primary focus is on investigating and addressing model collapse caused by single edits. The analysis of collapse resulting from sequential edits will be explored in our future work.
We hope we have addressed all your concerns. If things are clearer now we kindly request you to raise the score. Thanks!
Seeking your valuable feedback on our response
Dear Reviewer,
We hope this message finds you well.
We sincerely appreciate the time and effort you've dedicated to reviewing our submission. In response to your valuable feedback, we have provided detailed clarifications to the questions raised and supplemented with important additional experiments.
As the discussion period draws to a close, we would love to hear your thoughts on our response, including whether it adequately addresses your concerns. If you see potential for a score increase based on our revisions and discussions, we would greatly appreciate your consideration in this matter.
We hope we have addressed all your concerns and please feel free to reach out with any further questions.
Best regards,
Authors
Replying to Seeking your valuable feedback on our response
Official Comment by Reviewer #3
Dear Authors.
Thank you for the details and clarifications provided.