← Discover MCPs and Agents
F
MCPAI & MLGitHub

FunASR

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

Links

README

From the repo.

(简体中文|English|日本語|한국어)

FunASR

Industrial speech recognition. 170x faster than Whisper. 50+ languages.
Speaker diarization · Emotion detection · Streaming · One API call

PyPI Stars Downloads Docs

modelscope%2FFunASR | Trendshift

Quick Start · Colab · Benchmark · Model selection · Migration guide · Use cases · Deployment matrix · Models · Agent Integration · Docs · Contribute


Quick Start

Open In Colab

No local setup? Open the Colab quickstart to transcribe a public sample or upload your own audio in a browser.

pip install torch torchaudio
pip install funasr
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
result = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav")

# One call returns VAD segments with speaker id + timestamps — render them however you like:
for seg in result[0]["sentence_info"]:
    print(f"[{seg['start']/1000:.1f}s] Speaker {seg['spk']}: {rich_transcription_postprocess(seg['sentence'])}")

Output — structured text with speaker labels, timestamps, and punctuation:

[0.6s] Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型

That's it. One model, one call — VAD segmentation, speech recognition, punctuation, speaker diarization all happen automatically.

LLM-powered ASR: Fun-ASR-Nano

For highest accuracy across 31 languages (including Chinese dialects), use Fun-ASR-Nano — an LLM-based ASR combining SenseVoice encoder with Qwen3-0.6B decoder:

from funasr import AutoModel

model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", vad_model="fsmn-vad", device="cuda")
result = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav")

With vLLM acceleration (16x faster, batch processing):

from funasr.auto.auto_model_vllm import AutoModelVLLM

model = AutoModelVLLM(model="FunAudioLLM/Fun-ASR-Nano-2512", tensor_parallel_size=1)
results = model.generate(["audio1.wav", "audio2.wav"], language="auto")

Deploy as API server: funasr-server --device cuda → OpenAI-compatible endpoint at localhost:8000

Use with AI agents: MCP Server for Claude/Cursor · OpenAI API for LangChain/Dify/AutoGen

Why FunASR?

FunASRWhisperCloud APIs
Speed170x realtime13x realtime~1x realtime
Speaker ID✅ Built-in❌ Needs pyannote✅ Extra cost
Emotion✅ Happy/Sad/Angry
Languages50+57Varies
Streaming✅ WebSocket
vLLM Acceleration✅ 2-3x fasterN/A
Self-hosted✅ MIT license✅ MIT license❌ Cloud only
CostFreeFree$0.006/min+
CPU viable✅ 17x realtime❌ Too slowN/A

Trying FunASR for the first time? Use the Colab quickstart before setting up a local environment. Choosing a first model? Start with the model selection guide. Planning a switch from Whisper or a cloud ASR provider? Use the migration guide and benchmark example to test representative audio, map features, and roll out safely.


Benchmark

184 long-form audio files (192 min). Full report →

ModelGPU SpeedCPU Speedvs Whisper-large-v3
SenseVoice-Small170x realtime17x realtime🚀 13x faster
Paraformer-Large120x realtime15x realtime🚀 9x faster
Whisper-large-v3-turbo46x realtime3.4x faster
Fun-ASR-Nano17x realtime3.6x realtime1.3x faster
Whisper-large-v313x realtimebaseline

Key takeaway: FunASR models run on CPU faster than Whisper runs on GPU.


What's new

  • 2026/05/24: vLLM Inference Engine — 2-3x faster LLM decoding for Fun-ASR-Nano. Streaming WebSocket service with VAD + Speaker Diarization. Guide →
  • 2026/05/24: Dynamic VAD — adaptive silence threshold (default on). Short sentences stay intact, long segments get auto-split. Details →
  • 2026/05/24: v1.3.3funasr-server CLI, OpenAI-compatible API, MCP Server for AI agents. pip install --upgrade funasr
  • 2026/05/20: Added Qwen3-ASR (0.6B/1.7B) — 52 languages, auto detection. usage
  • 2026/05/20: Added GLM-ASR-Nano (1.5B) — 17 languages, dialect support. usage
  • 2026/05/19: Fun-ASR-Nano and SenseVoice now support speaker diarization.
  • 2025/12/15: Fun-ASR-Nano-2512 — 31 languages, tens of millions of hours training.
Older
  • 2024/10/10: Whisper-large-v3-turbo support added.
  • 2024/07/04: SenseVoice — ASR + emotion + audio events.
  • 2024/01/30: FunASR 1.0 released.

Installation

pip install funasr
From source / Requirements
git clone https://github.com/modelscope/FunASR.git && cd FunASR
pip install -e ./

Requirements: Python ≥ 3.8. Install PyTorch + torchaudio first (pytorch.org), then pip install funasr.


Model Zoo

ModelTaskLanguagesParamsLinks
Fun-ASR-NanoASR + timestamps31 languages800M 🤗
SenseVoiceSmallASR + emotion + eventszh/en/ja/ko/yue234M 🤗
Paraformer-zhASR + timestampszh/en220M 🤗
Paraformer-zh-streamingStreaming ASRzh/en220M 🤗
Qwen3-ASRASR, 52 languagesmultilingual1.7Busage
GLM-ASR-NanoASR, 17 languagesmultilingual1.5Busage
Whisper-large-v3ASR + translationmultilingual1550Musage
Whisper-large-v3-turboASR + translationmultilingual809Musage
ct-puncPunctuationzh/en290M 🤗
fsmn-vadVADzh/en0.4M 🤗
cam++Speaker diarization7.2M 🤗
emotion2vec+largeEmotion recognition300M 🤗

Usage

Full examples with parameter docs: Tutorial →

from funasr import AutoModel

# Chinese production (VAD + ASR + punctuation + speaker)
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", spk_model="cam++", device="cuda")
result = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav", hotword="关键词 20")

# 31 languages with timestamps
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", hub="hf", trust_remote_code=True,
                  vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda")
result = model.generate(input="audio.wav", batch_size=1)

# Streaming real-time (feed audio chunk by chunk)
import soundfile as sf
model = AutoModel(model="paraformer-zh-streaming", device="cuda")
audio, sr = sf.read("speech.wav", dtype="float32")   # 16 kHz mono
chunk_size = [0, 10, 5]                               # 600 ms chunks
chunk_stride = chunk_size[1] * 960
cache = {}
n_chunks = (len(audio) - 1) // chunk_stride + 1
for i in range(n_chunks):
    chunk = audio[i * chunk_stride : (i + 1) * chunk_stride]
    res = model.generate(input=chunk, cache=cache, is_final=(i == n_chunks - 1),
                         chunk_size=chunk_size, encoder_chunk_look_back=4, decoder_chunk_look_back=1)
    if res[0]["text"]:
        print(res[0]["text"], end="", flush=True)

# Emotion recognition
model = AutoModel(model="emotion2vec_plus_large", device="cuda")
result = model.generate(input="audio.wav", granularity="utterance")

CLI (Agent-Friendly)

# Transcribe audio (simplest)
funasr audio.wav

# JSON output (for AI agents)
funasr audio.wav --output-format json

# SRT subtitles
funasr audio.wav --output-format srt --output-dir ./subs

# Speaker diarization + timestamps
funasr audio.wav --spk --timestamps -f json

# Choose model and language
funasr audio.wav --model paraformer --language zh

# Batch transcribe
funasr *.wav --output-format srt --output-dir ./output

Available models: sensevoice (default), paraformer, paraformer-en, fun-asr-nano


Deploy

# OpenAI-compatible API (recommended)
pip install torch torchaudio
pip install funasr vllm fastapi uvicorn python-multipart
funasr-server --device cuda
# → POST /v1/audio/transcriptions at localhost:8000

Verify it with a public sample:

curl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@sample.wav \
  -F model=sensevoice \
  -F response_format=verbose_json
# Docker streaming service
docker pull registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.12

CPU / edge (no GPU, no Python): run Fun-ASR-Nano / SenseVoice / Paraformer via llama.cpp / GGUF — a single self-contained binary, like whisper.cpp. See runtime/llama.cpp/.

OpenAI API example → · Gradio demo → · Client recipes → · JavaScript/TypeScript recipes → · Kubernetes template → · Workflow recipes → · Postman collection → · OpenAPI spec → · Security guide → · Deployment matrix → · Deployment docs → · Agent integration →


Community

📖 Documentation🐛 Issues
💬 Discussions🤗 HuggingFace
🤝 Contributing📈 20k growth plan

Star History

Star History Chart

License

MIT License

Citations

@inproceedings{gao2023funasr,
  author={Zhifu Gao and others},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  booktitle={INTERSPEECH},
  year={2023}
}

Collected info

  • 18,355 stars
  • 1,867 forks
  • Language: Python
  • Source updated: 6/20/2026

Config for your environment

Replace {MCP_ENDPOINT_URL} with this MCP’s endpoint URL (from its repo or docs above). No API key — you connect directly.

Tool

OS

Config file: ~/.cursor/mcp.json

{
  "mcpServers": {
    "mcp-server": {
      "url": "{MCP_ENDPOINT_URL}"
    }
  }
}

Paste into mcpServers in the config file. Restart Cursor after saving.

If this MCP is also published on mcpchannel.ai, you can subscribe from Browse and use the gateway config there instead.