Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

Jacob Morrison; Sanjay Adhikesaven; Akshita Bhagia; Matei Zaharia; Noah A. Smith; Sewon Min

✨ TL;DR

BAR (Branch-Adapt-Route) trains separate domain experts independently and combines them via Mixture-of-Experts, enabling modular updates to language models without retraining everything or degrading existing capabilities. This approach matches monolithic retraining performance while scaling linearly instead of quadratically when adding new domains.

01 · Problem

Extending post-trained language models with new capabilities faces a fundamental trade-off: retraining from scratch on all domains together is computationally expensive and scales poorly (cost grows quadratically with each new domain), while continued training on new domains often causes catastrophic forgetting and degrades existing capabilities. Monolithic training paradigms require full reprocessing of all data whenever any domain is updated, making iterative development impractical at scale. This creates a significant barrier to efficiently maintaining and extending large language models as new domain requirements emerge.

02 · Approach

BAR trains independent domain experts, where each expert undergoes its own complete post-training pipeline including mid-training, supervised finetuning, and reinforcement learning. These separately trained experts are then composed using a Mixture-of-Experts (MoE) architecture with lightweight router training to direct inputs to appropriate experts. The modular design allows individual experts to be updated or added independently without affecting other domains. The authors evaluate this approach at 7B scale with four domain experts: math, code, tool use, and safety, comparing against monolithic retraining baselines both with and without mid-training.

 · Key insights

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

What the paper shows.

↘ Related papers