Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen.
However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training.
Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization.
Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines.
The post-training landscape of LVLMs. Large Vision-Language Models (LVLMs) follow a two-phase development pipeline: large-scale pre-training for vision-language alignment, followed by instruction tuning (SFT) to adapt the model to downstream tasks and improve instruction-following ability — optionally refined further by RLHF or DPO for preference alignment. However, these pipelines assume a static world: train once, deploy forever. In reality, new tasks and domains emerge continuously, and full retraining is prohibitively expensive. This gap motivates Multimodal Continual Instruction Tuning (MCIT): incrementally instruction-tuning LVLMs on sequential tasks without forgetting previously acquired knowledge.
MoE-LoRA for continual learning. Mixture-of-Experts (MoE) architectures are a natural fit for MCIT. MoE replaces a dense FFN with multiple expert sub-networks and a learned router that assigns each input token to its top-K experts via sparse, normalized routing weights — enabling efficient, modular representations. In our setting, each expert is a lightweight LoRA module. The key insight behind MoE-based CL is that parameter isolation enables knowledge isolation: by dedicating separate experts to separate tasks, catastrophic overwriting can be reduced.
Incremental expansion: freeze old, train new. For each arriving task, we add a fresh set of LoRA experts and expand the router with new output dimensions, while freezing all previously learned experts and router parameters. New-task tokens may access both frozen old experts and the new trainable experts, allowing reuse of prior knowledge. Despite this isolation, forgetting still occurs — raising the key question this work addresses: why does freezing old experts not prevent forgetting?
Although DyMoE keeps old experts frozen, catastrophic forgetting still occurs due to routing-drift: the newly added router parameters, trained on new-task tokens, distort the routing probabilities for old experts, causing old-task tokens to be misrouted to new experts at inference time.
To understand why this happens, we conduct a controlled two-task experiment. We categorize each new-task token by its routing score relative to the old and new expert groups, and test three targeted masking strategies during new-task training:
This is the token's dilemma: both ambiguous and old tokens offer minimal benefit for new-task learning, yet when routed to new experts, they inadvertently train the new router to attract old-task patterns — causing old-task tokens to be mis-routed at inference time and inducing forgetting. Ambiguous tokens are especially challenging, as their balanced affinity for both expert groups makes them difficult to identify and prone to unstable routing. The key insight is that selective, token-type-aware regularization is needed — exposing the link between the plasticity–stability dilemma in CL and the inherent assignment ambiguity at the token level.
Motivated by the analysis above, LLaVA-DyMoE mitigates forgetting in dynamic MoE expansion through a two-fold regularization comprising Token Assignment Guidance (TAG) and Routing Score Regularization (RSR). TAG identifies token types from their routing scores and guides their assignment by adjusting routing scores during training, directly tackling the tokens' dilemma. As a complementary soft regularization, RSR encourages exclusive token-to-group routing and promotes new-expert specialization on genuinely new-task tokens.
TAG and RSR (â„’exc + â„’spe) together form a two-fold regularization framework that balances stability (retaining old knowledge) and plasticity (acquiring new knowledge). The method regularizes the router behavior during training and imposes no constraints at inference time. It is inherently orthogonal to and compatible with other MCIT paradigms, including data-based methods (replay, ASD) and task-specific routing approaches, and can be combined with them for further performance gains.
| Method | SQA | TextVQA | ImgNet | GQA | VizWiz | REF | VQAv2 | OCR-VQA | MFN↑ | MAA↑ | BWT↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| LoRA | 52.56 | 48.12 | 39.27 | 44.47 | 37.46 | 1.22 | 56.10 | 55.11 | 41.79 | 43.99 | -23.12 |
| MoELoRA | 72.01 | 46.89 | 44.75 | 42.79 | 28.22 | 3.31 | 55.74 | 57.72 | 43.93 | 43.92 | -22.18 |
| EWC | 59.11 | 47.21 | 39.88 | 45.12 | 35.33 | 2.72 | 56.29 | 41.21 | 40.86 | 43.75 | -21.76 |
| LWF | 62.32 | 48.66 | 51.45 | 45.84 | 43.76 | 0.24 | 54.96 | 44.63 | 43.98 | 44.89 | -19.69 |
| IncLoRA | 73.33 | 44.32 | 54.59 | 44.07 | 25.93 | 4.49 | 54.91 | 58.55 | 45.02 | 43.12 | -23.21 |
| O-LoRA | 75.61 | 49.98 | 78.24 | 44.18 | 30.70 | 4.66 | 55.51 | 57.37 | 49.53 | 46.65 | -17.54 |
| IncMoELoRA | 68.43 | 50.31 | 68.42 | 47.97 | 39.46 | 4.56 | 57.31 | 60.95 | 49.68 | 49.50 | -16.67 |
| LLaVA-DyMoE (Ours) | 76.25 | 53.86 | 95.80 | 48.40 | 52.35 | 9.25 | 58.30 | 62.00 | 57.03 | 57.70 | -4.67 |
MFN: Mean Final score (last task), MAA: Mean Average Accuracy, BWT: Backward Transfer. Higher is better for all metrics. Best results highlighted.
@inproceedings{zhao2026dymoe,
author = {Zhao, Chongyang and Li, Mingsong and Lu, Haodong and Gong, Dong},
title = {On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment
for Continual Learning of Large Vision Language Models},
booktitle = {CVPR},
year = {2026},
}