训练大型人工智能模型的成本很高,,不仅在美元, 上,而且在时间, 能源, 和计算资源上。传统上, 要获得更小, 更快的模型,要么需要先训练一个大型模型,然后将其缩减,,要么从头开始训练一个小型模型并接受较弱的性能。
Training a large artificial intelligence model is expensive, not just in dollars, but in time, energy, and computational resources. Traditionally, obtaining a smaller, faster model either requires training a massive one first and then trimming it down, or training a small one from scratch and accepting weaker performance.
名为 CompreSSM, 的技术, 针对一系列称为状态空间模型, 的 AI 架构,该架构为从语言处理到音频生成和机器人等各种应用程序提供支持。通过借用控制理论, 中的数学工具,研究人员可以在训练过程的早期通过外科手术去除不必要的组件之前,确定模型的哪些部分正在发挥作用,哪些部分是自重,。
The technique, called CompreSSM, targets a family of AI architectures known as state-space models, which power applications ranging from language processing to audio generation and robotics. By borrowing mathematical tools from control theory, the researchers can identify which parts of a model are pulling their weight and which are dead weight, before surgically removing the unnecessary components early in the training process.
"它'本质上是一种使模型在训练时变得更小、更快的技术," Makram Chahine, 说,他是电气工程和计算机科学的博士生, CSAIL 附属机构,,也是该论文的主要作者。 "在学习过程中,他们'也摆脱了对他们的发展无用的部分。"
"It的 essentially a technique to make models grow smaller and faster as they are training," says Makram Chahine, a PhD student in electrical engineering and computer science, CSAIL affiliate, and lead author of the paper. "During learning, they're also getting rid of parts that are not useful to their development."
关键的见解是,这些模型中不同组件的相对重要性在训练过程中的早期就稳定下来。使用名为 Hankel 奇异值, 的数学量来测量每个内部状态对模型的 整体行为, 的贡献程度,团队表明,他们只需经过大约 10% 的训练过程,就可以可靠地对哪些维度重要、哪些维度不重要't 进行排序。一旦建立了这些排名,,不太重要的组件就可以安全地丢弃,,其余 90% 的训练将以更小的模型的速度进行。
The key insight is that the relative importance of different components within these models stabilizes surprisingly early during training. Using a mathematical quantity called Hankel singular values, which measure how much each internal state contributes to the model的 overall behavior, the team showed they can reliably rank which dimensions matter and which don't after only about 10 percent of the training process. Once those rankings are established, the less-important components can be safely discarded, and the remaining 90 percent of training proceeds at the speed of a much smaller model.
"这项工作的令人兴奋之处在于,它将压缩从事后的想法变成了学习过程本身的一部分,” 资深作者, 麻省理工学院教授兼 CSAIL 主任 Daniela Rus, 说。 “不是训练一个大型模型,然后弄清楚如何使其更小, CompreSSM 让模型在学习时发现自己的有效结构。这'是一种完全不同的构建人工智能系统的思考方式。”
"What的 exciting about this work is that it turns compression from an afterthought into part of the learning process itself,” says senior author Daniela Rus, MIT professor and director of CSAIL. “Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns. That的 a fundamentally different way to think about building AI systems.”
结果是惊人的。在图像分类基准测试中,, 压缩模型保持了与全尺寸模型几乎相同的精度,同时训练速度提高了 1.5 倍。缩小到原始状态尺寸大约四分之一的压缩模型在 CIFAR-10 基准, 上实现了 85.7% 的准确率,而从头开始以较小尺寸训练的模型的准确率仅为 81.8%。在 Mamba,(使用最广泛的状态空间架构之一), 上,该方法实现了大约 4 倍的训练加速,,将 128 维模型压缩到大约 12 维,同时保持有竞争力的性能。
The results are striking. On image classification benchmarks, compressed models maintained nearly the same accuracy as their full-sized counterparts while training up to 1.5 times faster. A compressed model reduced to roughly a quarter of its original state dimension achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to just 81.8 percent for a model trained at that smaller size from scratch. On Mamba, one of the most widely used state-space architectures, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.
"您可以获得较大模型的性能,,因为您在预热阶段捕获了大部分复杂的动态,,然后只保留最有用的状态," Chahine 说。 "该模型仍然能够比从一开始就训练一个小模型达到更高的水平。"
"You get the performance of the larger model, because you capture most of the complex dynamics during the warm-up phase, then only keep the most-useful states," Chahine says. "The model is still able to perform at a higher level than training a small model from the start."
CompreSSM 与现有方法的不同之处在于它的理论基础。传统的剪枝方法训练一个完整的模型,然后在事后删除参数,,这意味着您仍然支付训练大模型的全部计算成本。知识蒸馏, 另一种流行的技术, 需要训练一个大型的"teacher" 模型来完成,然后在其上训练第二个, 较小的"student" 模型, 基本上使训练工作量加倍。 CompreSSM 通过在中途做出明智的压缩决策来避免这两种成本。
What makes CompreSSM distinct from existing approaches is its theoretical grounding. Conventional pruning methods train a full model and then strip away parameters after the fact, meaning you still pay the full computational cost of training the big model. Knowledge distillation, another popular technique, requires training a large "teacher" model to completion and then training a second, smaller "student" model on top of it, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed compression decisions mid-stream.
该团队将 CompreSSM 与两种替代方案进行了面对面的基准测试。与 Hankel 核范数正则化, 相比,最近提出的一种用于鼓励紧凑状态空间模型, CompreSSM 的谱技术速度快了 40 倍以上,,同时还实现了更高的精度。正则化方法使训练速度减慢了大约 16 倍,因为它需要在每个梯度步骤, 进行昂贵的特征值计算,甚至, 结果模型的性能也不佳。与 CIFAR-10, 上的知识蒸馏相比,CompressSM 对于严重压缩的模型具有明显的优势: 在较小的状态维度, 中,蒸馏模型的准确度显着下降,,而 CompreSSM 压缩模型则保持了接近完整的性能。而且由于蒸馏需要在每个训练步骤, 中通过教师和学生进行前向传递,因此即使较小的学生模型训练速度也比全尺寸基线慢。
The team benchmarked CompreSSM head-to-head against both alternatives. Compared to Hankel nuclear norm regularization, a recently proposed spectral technique for encouraging compact state-space models, CompreSSM was more than 40 times faster, while also achieving higher accuracy. The regularization approach slowed training by roughly 16 times because it required expensive eigenvalue computations at every single gradient step, and even then, the resulting models underperformed. Against knowledge distillation on CIFAR-10, CompressSM held a clear advantage for heavily compressed models: At smaller state dimensions, distilled models saw significant accuracy drops, while CompreSSM-compressed models maintained near-full performance. And because distillation requires a forward pass through both the teacher and student at every training step, even its smaller student models trained slower than the full-sized baseline.
研究人员从数学上证明,由于应用了 Weyl的 定理,,各个模型状态的重要性在训练, 期间平稳变化,并根据经验表明这些状态的相对排名保持稳定。 , 这些发现让从业者相信,早期被认为可以忽略不计的维度不会 后来突然变得至关重要。
The researchers proved mathematically that the importance of individual model states changes smoothly during training, thanks to an application of Weyl的 theorem, and showed empirically that the relative rankings of those states remain stable. Together, these findings give practitioners confidence that dimensions identified as negligible early on won't suddenly become critical later.
该方法还附带一个实用的安全网。如果压缩步骤导致性能意外下降, 从业者可以恢复到以前保存的检查点。 "它让人们能够控制他们'愿意为绩效支付多少,,而不必定义一个不太直观的能量阈值," Chahine 解释道。
The method also comes with a pragmatic safety net. If a compression step causes an unexpected performance drop, practitioners can revert to a previously saved checkpoint. "It gives people control over how much they're willing to pay in terms of performance, rather than having to define a less-intuitive energy threshold," Chahine explains.
该技术存在一些实际限制。 CompreSSM 最适合在内部状态维度和整体性能之间表现出强相关性的模型,,该属性随任务和架构的不同而变化。该方法对于多输入, 多输出(MIMO) 模型, 特别有效,其中状态大小和表达性之间的关系最强。对于每通道,单输入,单输出架构,,增益更适中,,因为这些模型首先对状态维度变化不太敏感。
There are some practical boundaries to the technique. CompreSSM works best on models that exhibit a strong correlation between the internal state dimension and overall performance, a property that varies across tasks and architectures. The method is particularly effective on multi-input, multi-output (MIMO) models, where the relationship between state size and expressivity is strongest. For per-channel, single-input, single-output architectures, the gains are more modest, since those models are less sensitive to state dimension changes in the first place.
该理论最干净地适用于线性时不变系统,,尽管该团队已经为日益流行的依赖于输入的, 时变架构开发了扩展。而且由于状态空间模型系列扩展到线性注意力, 等架构,作为传统变压器, 的替代品,人们越来越感兴趣的领域,潜在的应用范围很广泛。
The theory applies most cleanly to linear time-invariant systems, although the team has developed extensions for the increasingly popular input-dependent, time-varying architectures. And because the family of state-space models extends to architectures like linear attention, a growing area of interest as an alternative to traditional transformers, the potential scope of application is broad.
Chahine 和他的合作者将这项工作视为踏脚石。该团队已经展示了对 Mamba, 等线性时变系统的扩展,未来的方向包括将 CompreSSM 进一步推向线性注意机制, 中使用的矩阵值动态系统,这将使该技术更接近支持当今大多数 最大 AI 系统的变压器架构。
Chahine and his collaborators see the work as a stepping stone. The team has already demonstrated an extension to linear time-varying systems like Mamba, and future directions include pushing CompreSSM further into matrix-valued dynamical systems used in linear attention mechanisms, which would bring the technique closer to the transformer architectures that underpin most of today的 largest AI systems.
"这必须是第一步,,因为这是理论简洁的地方,并且方法可以保持原则性," Chahine 说。 "它'是扩展到当今人们在工业中使用的其他架构的垫脚石。"
"This had to be the first step, because this is where the theory is neat and the approach can stay principled," Chahine says. "It的 the stepping stone to then extend to other architectures that people are using in industry today."
"Chahine 及其同事的工作为现代状态空间模型的压缩提供了有趣的, 理论基础观点(SSM)," 安东尼奥·奥维托, 埃利斯研究所 Tübingen 首席研究员和 MPI 智能系统独立小组负责人, 表示, 没有参与这项研究。 "该方法提供的证据表明,这些模型的状态维数可以在训练期间有效地减少,并且控制理论的视角可以成功地指导这一过程。这项工作为未来的研究开辟了新的途径,,并且所提出的算法有可能成为预训练基于 SSM 的大型模型时的标准方法。"
"The work of Chahine and his colleagues provides an intriguing, theoretically grounded perspective on compression for modern state-space models (SSMs)," says Antonio Orvieto, ELLIS Institute Tübingen principal investigator and MPI for Intelligent Systems independent group leader, who wasn't involved in the research. "The method provides evidence that the state dimension of these models can be effectively reduced during training and that a control-theoretic perspective can successfully guide this procedure. The work opens new avenues for future research, and the proposed algorithm has the potential to become a standard approach when pre-training large SSM-based models."
工作, 已被国际学习表征会议 2026, 接受为会议论文,将于本月晚些时候发布。它得到了马克斯·普朗克 ETH 学习系统中心,、赫克托基金会, 波音, 和美国海军研究办公室, 的部分, 支持。
The work, which was accepted as a conference paper at the International Conference on Learning Representations 2026, will be presented later this month. It was supported, in part, by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.