WhisperPlus项目：语音转文本与文本转语音的完整实践指南

WhisperPlus是一个功能强大的语音处理工具包，基于先进的深度学习技术，提供了从语音到文本(ASR)和文本到语音(TTS)的完整解决方案。该项目整合了多种前沿模型和技术，包括但不限于：- 高效的语音识别模型- 文本摘要功能- 说话人分离技术- 与视频内容交互的聊天机器人- 多种量化优化方案## 环境安装与配置在开始使用WhisperPlus之前，需要完成基础环境的搭建：...

白来存

554人浏览 · 2025-06-27 09:12:46

白来存 · 2025-06-27 09:12:46 发布

WhisperPlus项目：语音转文本与文本转语音的完整实践指南

【免费下载链接】whisper-plus WhisperPlus: Advancing Speech-to-Text Processing 🚀 项目地址: https://gitcode.com/gh_mirrors/wh/whisper-plus

项目概述

WhisperPlus是一个功能强大的语音处理工具包，基于先进的深度学习技术，提供了从语音到文本(ASR)和文本到语音(TTS)的完整解决方案。该项目整合了多种前沿模型和技术，包括但不限于：

高效的语音识别模型
文本摘要功能
说话人分离技术
与视频内容交互的聊天机器人
多种量化优化方案

环境安装与配置

在开始使用WhisperPlus之前，需要完成基础环境的搭建：

!pip install -U whisperplus
import nest_asyncio 
nest_asyncio.apply()

对于需要额外功能支持的情况，可以安装以下扩展：

!pip install whisperplus transformers
!pip install flash-attn --no-build-isolation

核心功能详解

1. 在线视频转音频与转录

WhisperPlus提供了便捷的视频处理能力：

from whisperplus import SpeechToTextPipeline, download_online_video_to_mp3
from transformers import BitsAndBytesConfig, HqqConfig
import torch

# 下载视频为MP3
url = "https://example.com/video/di3rHkEZuUw"
audio_path = download_online_video_to_mp3(url, output_dir="downloads", filename="test")

# 配置量化参数
hqq_config = HqqConfig(
    nbits=4,
    group_size=64,
    quant_zero=False,
    quant_scale=False,
    axis=0,
    offload_meta=False,
)

# 创建语音识别管道
pipeline = SpeechToTextPipeline(
    model_id="distil-whisper/distil-large-v3",
    quant_config=hqq_config,
    flash_attention_2=True,
)

# 执行转录
transcript = pipeline(
    audio_path=audio_path,
    chunk_length_s=30,
    stride_length_s=5,
    max_new_tokens=128,
    batch_size=100,
    language="english",
    return_timestamps=False,
)

2. Apple MLX平台支持

针对Apple设备的MLX框架优化版本：

from whisperplus.pipelines import mlx_whisper
from whisperplus import download_online_video_to_mp3

url = "https://example.com/video/1__CAdTJ5JU"
audio_path = download_online_video_to_mp3(url)

text = mlx_whisper.transcribe(
    audio_path, path_or_hf_repo="mlx-community/whisper-large-v3-mlx"
)["text"]

3. 文本摘要功能

WhisperPlus提供两种摘要方式：

基础摘要：

from whisperplus.pipelines.summarization import TextSummarizationPipeline

summarizer = TextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary = summarizer.summarize(transcript)

长文本支持摘要：

from whisperplus.pipelines.long_text_summarization import LongTextSummarizationPipeline

summarizer = LongTextSummarizationPipeline(model_id="facebook/bart-large-cnn")
summary_text = summarizer.summarize(transcript)

4. 说话人分离技术

识别音频中不同说话人的内容：

from whisperplus.pipelines.whisper_diarize import ASRDiarizationPipeline
from whisperplus import download_online_video_to_mp3, format_speech_to_dialogue

audio_path = download_online_video_to_mp3("https://example.com/video/mRB14sFHw2E")

pipeline = ASRDiarizationPipeline.from_pretrained(
    asr_model="openai/whisper-large-v3",
    diarizer_model="pyannote/speaker-diarization-3.1",
    use_auth_token=False,
    chunk_length_s=30,
    device="cuda",
)

output_text = pipeline(audio_path, num_speakers=2)
dialogue = format_speech_to_dialogue(output_text)

5. 视频内容交互(RAG)

基于LanceDB的实现：

from whisperplus.pipelines.chatbot import ChatWithVideo

chat = ChatWithVideo(
    input_file="trascript.txt",
    llm_model_name="TheBloke/Mistral-7B-v0.1-GGUF",
    llm_model_file="mistral-7b-v0.1.Q4_K_M.gguf",
    llm_model_type="mistral",
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
)

response = chat.run_query("what is this video about ?")

基于AutoLLM的实现：

from whisperplus.pipelines.autollm_chatbot import AutoLLMChatWithVideo

chat = AutoLLMChatWithVideo(
    input_file="input_dir",
    llm_model="gpt-3.5-turbo",
    llm_max_tokens="256",
    llm_temperature="0.1",
    embed_model="huggingface/BAAI/bge-large-zh",
)

response = chat.run_query("what is this video about ?")

6. 文本转语音(TTS)

from whisperplus.pipelines.text2speech import TextToSpeechPipeline

tts = TextToSpeechPipeline(model_id="suno/bark")
audio = tts(text="Hello World", voice_preset="v2/en_speaker_6")