Tying the Loop: Tied Expert Layers in Mixture-of-Experts Language Models

Reference implementation and experiments for Expert Tying - sharing expert FFN weights across consecutive transformer layers while keeping routers, attention, and normalization layer-specific. Tying experts in groups of g reduces expert parameters by g times.

This repository reproduces the experiments in the paper across different MoE architectures (OLMoE, Qwen3-MoE, DeepSeekMoE) and the controlled component ablation.

Repository contents

File	Role
`train.py`	Component ablation (paper Section 3): vanilla depth-32 transformer, tying modes and topologies.
`model.py`	Expert-tying variants: expert-tensor aliasing and LR handling.
`moe_train.py`	Main experiments (paper Section 4): production OLMoE / Qwen3-MoE / DeepSeekMoE.
`data.py`	Streaming loader for the 75:25 DCLM-edu / FinePhrase mixture.
`eval_downstream.py`	3-shot downstream accuracy via `lm-evaluation-harness`.
`submit_ablation_runs.sh`	Launches the full Section 3 ablation grid (43 runs).
`QUICKSTART.md`	Training commands for Section 4 architectures and configurations.

Installation

git clone https://github.com/epfml/looped-moe.git
cd looped-moe
pip install -r requirements.txt

Ablation study uses plain PyTorch, whereas the main experiment runs use the HuggingFace transformers reference implementations of OLMoE, Qwen3-MoE, and DeepSeekMoE; install exactly the pinned versions. Training uses Muon for 2D hidden weights and AdamW for embeddings, output head, norm gains, and routers.

Reproducing the paper

Component ablation (Section 3)

The vanilla depth-32 ablation establishing which components can be tied. Each run is ~3 hours on a single H100. Launch the full grid (fine + coarse granularity, all topologies and tying modes, LR-divisor and dense controls):

bash submit_ablation_runs.sh

Main experiments (Section 4)

Production MoEs at g=1 (untied baseline), g=2, and g=4, with optional width expansion, across all three architectures. The exact commands for every configuration are in QUICKSTART.md. Core flags of moe_train.py: --arch {olmoe, qwen3moe, deepseek}, --scale {regular, small, tiny}, --tie-group-size (1 = untied), --expand-tied-experts N (experts per tied middle layer), --tied-lr-divisor (√g: 1.0 for g=1, 1.41 for g=2, 2.0 for g=4).

Downstream evaluation

python eval_downstream.py --checkpoint <path-to-checkpoint>

Reports macro-average 3-shot accuracy on ARC-Easy, ARC-Challenge, HellaSwag, PIQA, WinoGrande, and OpenBookQA.

Citation

@article{jaggi2026tying,
  title   = {Tying the Loop: Tied Expert Layers in Mixture-of-Experts Language Models},
  author  = {Martin Jaggi},
  journal = {arXiv preprint arXiv:2606.16825},
  year    = {2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tying the Loop: Tied Expert Layers in Mixture-of-Experts Language Models

Repository contents

Installation

Reproducing the paper

Component ablation (Section 3)

Main experiments (Section 4)

Downstream evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
data.py		data.py
eval_downstream.py		eval_downstream.py
model.py		model.py
moe_train.py		moe_train.py
requirements.txt		requirements.txt
submit_ablation_runs.sh		submit_ablation_runs.sh
train.py		train.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Tying the Loop: Tied Expert Layers in Mixture-of-Experts Language Models

Repository contents

Installation

Reproducing the paper

Component ablation (Section 3)

Main experiments (Section 4)

Downstream evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages