Brain Surgery for LLMs: Scaling Transformers with Embedding Modules
Wed Jan 21 2026
The provided research introduces STEM (Scaling Transformers with Embedding Modules), a novel architecture designed to enhance the efficiency and knowledge capacity of large language models. By replacing the traditional FFN up-projection with a token-indexed embedding lookup, the system decouples a model's total parameter count from its per-token computational cost. This static sparsity approach eliminates the need for complex runtime routing, allowing for CPU offloading and reducing inter-node communication overhead. Experiments at various scales demonstrate that STEM improves accuracy on knowledge-intensive benchmarks and strengthens performance in long-context reasoning. Furthermore, the architecture offers unique interpretability, enabling direct knowledge editing and injection by simply modifying specific embedding vectors. Ultimately, STEM provides a stable, scalable method for increasing parametric memory while maintaining high efficiency during both training and inference.
More
The provided research introduces STEM (Scaling Transformers with Embedding Modules), a novel architecture designed to enhance the efficiency and knowledge capacity of large language models. By replacing the traditional FFN up-projection with a token-indexed embedding lookup, the system decouples a model's total parameter count from its per-token computational cost. This static sparsity approach eliminates the need for complex runtime routing, allowing for CPU offloading and reducing inter-node communication overhead. Experiments at various scales demonstrate that STEM improves accuracy on knowledge-intensive benchmarks and strengthens performance in long-context reasoning. Furthermore, the architecture offers unique interpretability, enabling direct knowledge editing and injection by simply modifying specific embedding vectors. Ultimately, STEM provides a stable, scalable method for increasing parametric memory while maintaining high efficiency during both training and inference.