[Token]On Token's Dilemma:
Dynamic MoE with Drift-Aware Token Assignment
for Continual Learning of Large Vision Language Models

The University of New South Wales (UNSW Sydney)
{chongyang.zhao, dong.gong}@unsw.edu.au
*Corresponding author
CVPR 2026
You may also like 😘: CoDyRA MoRAM SEMA

Abstract

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen.

However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training.

Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization.

Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines.

Background

The post-training landscape of LVLMs. Large Vision-Language Models (LVLMs) follow a two-phase development pipeline: large-scale pre-training for vision-language alignment, followed by instruction tuning (SFT) to adapt the model to downstream tasks and improve instruction-following ability — optionally refined further by RLHF or DPO for preference alignment. However, these pipelines assume a static world: train once, deploy forever. In reality, new tasks and domains emerge continuously, and full retraining is prohibitively expensive. This gap motivates Multimodal Continual Instruction Tuning (MCIT): incrementally instruction-tuning LVLMs on sequential tasks without forgetting previously acquired knowledge.

MoE-LoRA for continual learning. Mixture-of-Experts (MoE) architectures are a natural fit for MCIT. MoE replaces a dense FFN with multiple expert sub-networks and a learned router that assigns each input token to its top-K experts via sparse, normalized routing weights — enabling efficient, modular representations. In our setting, each expert is a lightweight LoRA module. The key insight behind MoE-based CL is that parameter isolation enables knowledge isolation: by dedicating separate experts to separate tasks, catastrophic overwriting can be reduced.

Incremental expansion: freeze old, train new. For each arriving task, we add a fresh set of LoRA experts and expand the router with new output dimensions, while freezing all previously learned experts and router parameters. New-task tokens may access both frozen old experts and the new trainable experts, allowing reuse of prior knowledge. Despite this isolation, forgetting still occurs — raising the key question this work addresses: why does freezing old experts not prevent forgetting?

Token's Dilemma: Analyzing Routing-Drift

Controlled experiment showing roles of new, old, and ambiguous tokens in routing-drift

Figure 1. Routing-drift analysis in a controlled two-task learning experiment. After learning on the 1st task (SQA), we conduct the 2nd task (TextVQA) training using the baseline (default training) and three different token masking strategies based on token type. Throughout the training stages, we evaluate forgetting (decrease in task 1 accuracy) and new-task learning (improvement of task 2 accuracy). (a) Retaining only new-affinity tokens yields strong new-task performance with minimal forgetting. (b) Masking out old-affinity tokens yields similar performance to the baseline — they do not contribute to new-task learning. (c) Retaining only ambiguous tokens neither improves new-task acquisition nor preserves old-task performance — confirming their minimal learning value and direct forgetting risk.

Although DyMoE keeps old experts frozen, catastrophic forgetting still occurs due to routing-drift: the newly added router parameters, trained on new-task tokens, distort the routing probabilities for old experts, causing old-task tokens to be misrouted to new experts at inference time.

To understand why this happens, we conduct a controlled two-task experiment. We categorize each new-task token by its routing score relative to the old and new expert groups, and test three targeted masking strategies during new-task training:

Obs. 1 — Only new tokens (high affinity to new router): Training only on these tokens yields strong new-task performance with substantially less forgetting than the baseline. New tokens carry patterns distinct from the old task and naturally route to new experts — they are beneficial and safe.
Obs. 2 — Mask out old tokens (high affinity to old router): Masking old tokens from accessing newly added parameters yields similar new-task performance and forgetting as the baseline, suggesting they are best handled by old frozen experts and do not contribute to new-task learning. When assigned small but non-negligible weight toward new experts (by an under-optimized router), this inadvertently biases the new router toward old-task patterns, causing routing-drift despite their limited learning value.
Obs. 3 — Only ambiguous tokens (small affinity difference between old and new): Ambiguous tokens offer minimal new-task learning benefit while posing a direct forgetting risk. Identified by their small affinity difference between old and new expert groups, these tokens capture ambiguous patterns across tasks. Training solely on these tokens neither improves new-task acquisition nor preserves old-task performance, confirming their minimal learning value and direct forgetting risk.

This is the token's dilemma: both ambiguous and old tokens offer minimal benefit for new-task learning, yet when routed to new experts, they inadvertently train the new router to attract old-task patterns — causing old-task tokens to be mis-routed at inference time and inducing forgetting. Ambiguous tokens are especially challenging, as their balanced affinity for both expert groups makes them difficult to identify and prone to unstable routing. The key insight is that selective, token-type-aware regularization is needed — exposing the link between the plasticity–stability dilemma in CL and the inherent assignment ambiguity at the token level.

Method: LLaVA-DyMoE

LLaVA-DyMoE method overview

Figure 2. Overview of LLaVA-DyMoE. As each new task arrives, the router and experts expand, creating a frozen "old group" and a trainable "new group". Our Token Assignment Guidance (TAG) prevents routing-drift (red dashed arrow) by directing tokens to appropriate expert–router groups, complemented by our Routing Score Regularization (RSR) that encourages exclusive token-to-group routing and new-expert specialization. Our method regularizes the router behavior during training and imposes no constraints at inference, allowing seamless combination with other continual learning methods.

Motivated by the analysis above, LLaVA-DyMoE mitigates forgetting in dynamic MoE expansion through a two-fold regularization comprising Token Assignment Guidance (TAG) and Routing Score Regularization (RSR). TAG identifies token types from their routing scores and guides their assignment by adjusting routing scores during training, directly tackling the tokens' dilemma. As a complementary soft regularization, RSR encourages exclusive token-to-group routing and promotes new-expert specialization on genuinely new-task tokens.

Token Assignment Guidance (TAG)
  • Computes group-wise routing confidence: cold (max score over old experts), cnew (max score over new experts)
  • Measures relative ambiguity Drel = |cnew − cold| / max(|cnew|, |cold|); token is ambiguous if Drel ≤ Ï„
  • Only allows new-expert routing if token is not ambiguous (Drel > Ï„) and new-dominant (cnew > cold)
  • Old-dominant and ambiguous tokens are redirected to frozen old experts — preventing routing corruption while suppressing old tokens' residual affinity for new experts
Routing Score Regularization (RSR) — Exclusivity Loss (ℒexc)
  • Enforces clean routing separation by preventing a token from strongly activating both expert groups simultaneously
  • Minimizes the product â„’exc = gold · gnew, encouraging exclusive routing to one group, reinforcing the conditional routing decision of TAG
Routing Score Regularization (RSR) — Specialization Loss (ℒspe)
  • Complements TAG and â„’exc for plasticity: encourages higher routing weight toward new experts when old experts are inactive
  • Defines soft target y = 1 − max{wi}i∈old; BCE loss between gnew and y pushes new-expert activation

TAG and RSR (â„’exc + â„’spe) together form a two-fold regularization framework that balances stability (retaining old knowledge) and plasticity (acquiring new knowledge). The method regularizes the router behavior during training and imposes no constraints at inference time. It is inherently orthogonal to and compatible with other MCIT paradigms, including data-based methods (replay, ASD) and task-specific routing approaches, and can be combined with them for further performance gains.

Results

Radar chart comparing LLaVA-DyMoE vs baselines

Figure 3. Per-task accuracy radar chart on the CoIN benchmark. LLaVA-DyMoE (ours) consistently outperforms all baselines across the 8 evaluation tasks.

Comparison on CoIN Benchmark

Method SQA TextVQA ImgNet GQA VizWiz REF VQAv2 OCR-VQA MFN↑ MAA↑ BWT↑
LoRA 52.5648.1239.2744.4737.46 1.2256.1055.11 41.7943.99-23.12
MoELoRA 72.0146.8944.7542.7928.22 3.3155.7457.72 43.9343.92-22.18
EWC 59.1147.2139.8845.1235.33 2.7256.2941.21 40.8643.75-21.76
LWF 62.3248.6651.4545.8443.76 0.2454.9644.63 43.9844.89-19.69
IncLoRA 73.3344.3254.5944.0725.93 4.4954.9158.55 45.0243.12-23.21
O-LoRA 75.6149.9878.2444.1830.70 4.6655.5157.37 49.5346.65-17.54
IncMoELoRA 68.4350.3168.4247.9739.46 4.5657.3160.95 49.6849.50-16.67
LLaVA-DyMoE (Ours) 76.2553.8695.8048.4052.35 9.2558.3062.00 57.0357.70-4.67

MFN: Mean Final score (last task), MAA: Mean Average Accuracy, BWT: Backward Transfer. Higher is better for all metrics. Best results highlighted.

BibTeX

@inproceedings{zhao2026dymoe,
  author    = {Zhao, Chongyang and Li, Mingsong and Lu, Haodong and Gong, Dong},
  title     = {On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment
               for Continual Learning of Large Vision Language Models},
  booktitle = {CVPR},
  year      = {2026},
}