端侧模型推理

端侧推理组自研端侧推理框架 OmniInfer，围绕模型压缩加速、端侧模型架构、异构 SoC 执行与算子优化开展研究，构建“模型轻量化 — 端侧架构设计 — 硬件加速”一体化技术栈，支撑端侧大模型部署与移动端智能应用，成果发表于 ASPLOS、OSDI 等顶级学术会议。

课题方向

基于模型稀疏性与 KV Cache 压缩的端侧推理加速

针对端侧大模型推理计算、显存、长上下文处理瓶颈，研究稀疏性与 KV Cache 压缩方法，通过稀疏计算、注意力缓存复用等，实现参数与缓存协同优化。

面向端侧部署的高效 MoE 模型架构研究

解决端侧模型容量与任务复杂度的矛盾，研究端侧高效 MoE 架构，结合稀疏激活、专家路由等机制，优化专家组织与负载均衡。

面向异构 SoC 的 CPU/GPU/NPU 协同加速与算子优化

针对异构 SoC 协同效率低、适配复杂问题，研究跨单元协同机制与底层算子优化，通过任务调度与算子设计提升硬件利用率。

研究成果

2025.08.06

Fast on-Device LLM Inference with Neuron Co-Activation Linking

Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from...

2025.12.03

Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity

The escalating demand for long-context applications has intensified the necessity of extending the LLM context windows. Despite recent fine-tuning approaches successfully expanding context lengths, their high memory footprints, especially for activations, present a critical practical limitation. Current ...

2025.11.04

From Wasted Compute to Quality Gains: LLM Test-Time Scaling on Mobile NPUs

With advancements in the performance of small-sized LLMs and Mobile SoCs, deploying LLMs on edge devices such as mobile phones is becoming a reality. The emerging NPU (Neural Processing Unit) units on mobile SoCs, with their spe- cialized neural network acceleration capabilities and lower power ...