高速推理引擎 OmniInfer

OmniInfer 是面向端侧 AI 应用的统一推理底座，支持多平台部署与多后端接入。系统通过硬件资源预检、自动后端选择、OpenAI 兼容接口、工具调用、输出解析和生命周期管理，降低跨平台开发成本，提升模型运行的稳定性、扩展性与落地效率。

技术特点

OmniInfer 为 Windows、macOS、Linux、Android、iOS 多平台提供统一大模型推理能力，通过一致抽象接口完成模型加载、推理与管理，抹平平台适配差异。

面向端侧重构推理运行时，依托动态稀疏激活、KV缓存优化、存储换内存等机制，高效释放受限硬件的模型推理性能。

2.49x

吞吐提升

1.93x

内存节省

研究成果

2025.11.04

From Wasted Compute to Quality Gains: LLM Test-Time Scaling on Mobile NPUs

With advancements in the performance of small-sized LLMs and Mobile SoCs, deploying LLMs on edge devices such as mobile phones is becoming a reality.

2025.08.06

Fast on-Device LLM Inference with Neuron Co-Activation Linking

Large Language Models have achieved remarkable success across domains, yet deploying them on mobile devices remains challenging due to compute and memory demands.

2025.07.07

Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity

Long-context applications require extended context windows, while memory footprints remain a critical challenge for efficient fine-tuning.