
高速推理引擎 OmniInfer
OmniInfer 是面向端侧 AI 应用的统一推理底座,支持多平台部署与多后端接入。系统通过硬件资源预检、自动后端选择、OpenAI 兼容接口、工具调用、输出解析和生命周期管理,降低跨平台开发成本,提升模型运行的稳定性、扩展性与落地效率。
技术特点
OmniInfer 为 Windows、macOS、Linux、Android、iOS 多平台提供统一大模型推理能力,通过一致抽象接口完成模型加载、推理与管理,抹平平台适配差异。


面向端侧重构推理运行时,依托动态稀疏激活、KV缓存优化、存储换内存等机制,高效释放受限硬件的模型推理性能。
2.49x
吞吐提升
1.93x
内存节省
研究成果
From Wasted Compute to Quality Gains: LLM Test-Time Scaling on Mobile NPUs
With advancements in the performance of small-sized LLMs and Mobile SoCs, deploying LLMs on edge devices such as mobile phones is becoming a reality.
Fast on-Device LLM Inference with Neuron Co-Activation Linking
Large Language Models have achieved remarkable success across domains, yet deploying them on mobile devices remains challenging due to compute and memory demands.
Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
Long-context applications require extended context windows, while memory footprints remain a critical challenge for efficient fine-tuning.