✨ TL;DR
BAR (Branch-Adapt-Route) trains separate domain experts independently and combines them via Mixture-of-Experts, enabling modular updates to language models without retraining everything or degrading existing capabilities. This approach matches monolithic retraining performance while scaling linearly instead of quadratically when adding new domains.
Extending post-trained language models with new capabilities faces a fundamental trade-off: retraining from scratch on all domains together is computationally expensive and scales poorly (cost grows quadratically with each new domain), while continued training on new domains often causes catastrophic forgetting and degrades existing capabilities. Monolithic training paradigms require full reprocessing of all data whenever any domain is updated, making iterative development impractical at scale. This creates a significant barrier to efficiently maintaining and extending large language models as new domain requirements emerge.
BAR trains independent domain experts, where each expert undergoes its own complete post-training pipeline including mid-training, supervised finetuning, and reinforcement learning. These separately trained experts are then composed using a Mixture-of-Experts (MoE) architecture with lightweight router training to direct inputs to appropriate experts. The modular design allows individual experts to be updated or added independently without affecting other domains. The authors evaluate this approach at 7B scale with four domain experts: math, code, tool use, and safety, comparing against monolithic retraining baselines both with and without mid-training.