Introducing NARE-1: Latent Fission Architecture
Today we're excited to announce NARE-1, our first production language model built on a novel architecture we call Latent Fission.
The Problem
Traditional language models process every token through the entire network, even when the task is simple. This wastes compute on easy predictions and underutilizes capacity on hard ones.
Our Solution: Latent Fission
NARE-1 dynamically routes computation based on semantic complexity:
- Early Exit: Simple tokens (like punctuation, common words) exit after 12 layers instead of 28
- Expert Routing: Complex reasoning activates specialized LoRA experts (code, math, logic)
- Single Signal: All decisions driven by one metric—entropy of the output distribution
Architecture Highlights
Base Model: Qwen2.5-1.5B (28 layers)
Router Layer: 12 (middle of network)
Experts: 3 specialized LoRA adapters (r=16)
Early Exit Threshold: entropy < 2.5
Top-K Experts: 2 active simultaneously
Performance
Compared to the base Qwen2.5-1.5B model:
- 2.8x faster on average workloads (early exit rate: 45%)
- Same quality on standard benchmarks (GSM8K, HumanEval, MMLU)
- Better on specialized tasks when experts activate (code: +12%, math: +8%)
How It Works
- Entropy Measurement: After layer 12, we compute Shannon entropy of the logits
- Early Exit Decision: If entropy < 2.5, we skip remaining layers and output the token
- Expert Activation: If entropy is high, we activate top-2 experts based on softmax gating
- In-Place Fusion: Expert deltas are added to FFN outputs without copying activations
Good luck!