Whisper 语音识别 使用笔记
WhisperProcessor 语音转文字 使用笔记
·
目录
WhisperProcessor 预处理 tokenizer
模型型号列表:
1. Whisper 多语言模型全对比
| 模型名称 | 参数量 | 显存占用 | 速度 | 支持语言 | 中文WER错误率 | 适用场景 |
|---|---|---|---|---|---|---|
| tiny | 39M | ~1GB | ⚡最快 | 99种 | ~25% | 嵌入式设备/实时低精度需求 |
| base | 74M | ~1GB | 快 | 99种 | ~20% | 平衡速度与精度 |
| small | 244M | ~2GB | 中 | 99种 | ~15% | 主流服务器/高性价比选择 |
| medium | 769M | ~5GB | 慢 | 99种 | ~10% | 专业转录(推荐中文主力模型) |
| large | 1550M | ~10GB | 🐢最慢 | 99种 | ~8% | 研究级需求 |
| large-v2 | 1550M | ~10GB | 🐢最慢 | 99种 | ~7% | 当前最优通用模型 |
| large-v3 | 1550M | ~10GB | 🐢最慢 | 99+种 | ~6.5% | 最新版(支持低资源语言优化) |
模型型号地址:
_MODELS = {
"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
"medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
"medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
"large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
"large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
"large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
"large": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
"large-v3-turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
"turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
}
large-v3-turbo
Linux系统上部署Whisper-large-v3-turbo-CSDN博客
| 指标 | Whisper-large-v3 | v3-turbo | 提升幅度 |
|---|---|---|---|
| LibriSpeech WER | 3.1% | 2.9% | ↓6.5% |
| 推理速度 (words/s) | 78 | 624 | 8x |
| 显存占用 (FP16) | 10.4GB | 5.8GB | ↓44% |
| 多语言平均 WER | 8.7% | 7.2% | ↓17% |
- 参数效率革命:在保持与Whisper-large-v3相当性能的前提下,将参数量从15.5亿压缩至8.09亿
- 编码器深度优化:采用16层深度可分离卷积+Transformer混合结构,相较纯Transformer结构降低40%计算量
- 动态稀疏激活:在解码阶段仅激活30%的神经元,通过门控机制动态选择关键路径
- 资源占用优化:
- 混合精度训练:通过AMP(Automatic Mixed Precision)将训练显存消耗从32GB降至18GB
- 动态量化技术:使用QAT(Quantization-Aware Training)实现INT8推理,在CPU上推理速度提升3倍:
from optimum.bettertransformer import BetterTransformer
model = BetterTransformer.transform(
whisper.load_model("large-v3-turbo"),
keep_original_model=False # 启用硬件感知量化
)
微调笔记
包括(LID, Language Identification) 和语音转录(Speech Transcription)
WhisperWithLIDTrainer 中育种识别和转录是怎么设计的
音频预处理
# 如果设备(如GPU)存在,则将音频数据转移到该设备上
if device is not None:
audio = audio.to(device)
# 如果需要填充(padding),则对音频进行尾部填充
if padding > 0:
audio = F.pad(audio, (0, padding)) # (0, padding)表示在末尾填充padding个值
# 创建Hann窗口函数,并将其转移到与音频相同的设备上
window = torch.hann_window(N_FFT).to(audio.device)
# 对音频进行短时傅里叶变换(STFT),返回复数形式的结果
stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
# 计算幅度谱的平方(去掉最后一个点以保持对称性)
magnitudes = stft[..., :-1].abs() ** 2
# 获取梅尔滤波器组(转移到与音频相同的设备)
filters = mel_filters(audio.device, n_mels)
# 将幅度谱通过梅尔滤波器组,转换为梅尔频谱
mel_spec = filters @ magnitudes
# 对梅尔频谱取对数(限制最小值以避免数值问题)
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
# 将对数谱截断到最大值以下8.0的范围内(归一化的一部分)
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
# 将对数谱标准化到[-1, 1]范围
log_spec = (log_spec + 4.0) / 4.0
参数笔记
禁用强制解码,自动识别语言和任务
model.config.forced_decoder_ids = None
-
禁用强制解码:
让模型完全自由生成,不强制任何 token(依赖输入音频的自动语言检测和任务推断)。 -
适用场景:
当希望模型自动识别语言和任务时(如直接输入音频,不预设语言)。
计算准确率
转录的文字进行求准确率
训练loss
是把文字转toker,进行分类训练。
微调笔记
数据加载,切分,设置language,token
processor = WhisperProcessor.from_pretrained(args.base_model,
language=args.language,
task=args.task,
no_timestamps=not args.timestamps,
local_files_only=args.local_files_only)
sample, sample_rate, transcript, language = self._get_list_data(idx=idx)
# 可以为单独数据设置语言
processor.tokenizer.set_prefix_tokens(language=language if language is not None else self.language)
# 获取log-Mel特征和标签ID
data = self.processor(audio=sample, sampling_rate=self.sample_rate, text=transcript)
空样本
# 如果没有文本,则使用<|nospeech|>标记
data = self.processor(audio=sample, sampling_rate=self.sample_rate)
data['labels'] = [self.startoftranscript, self.nospeech, self.endoftext]
WhisperProcessor 预处理 tokenizer
WhisperProcessor 是 Hugging Face 提供的一个工具类,不参与训练,主要包含两个核心组件:
-
Feature Extractor(特征提取器):
负责将原始音频转换为对数梅尔频谱图(Log-Mel Spectrogram),即模型的输入特征。 -
包含短时傅里叶变换(STFT)、梅尔滤波器组(Mel Filterbank)等经典音频处理步骤,将原始音频转换为对数梅尔频谱图(Log-Mel Spectrogram)。
-
但经过深度学习优化:参数(如滤波器组数量、帧长)是针对 Whisper 模型预训练数据优化过的,并非纯手工设计。
-
Tokenizer(分词器):
负责将文本标签转换为 token ID(编码),或将模型输出的 token ID 解码为文本。
WhisperProcessor 使用例子
import torch
from transformers import WhisperTokenizer
from addict import Dict
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, WhisperForConditionalGeneration, WhisperProcessor
import soundfile
def slice_from_file(file, start, end):
sndfile = soundfile.SoundFile(file)
sample_rate = sndfile.samplerate
duration = round(float(len(sndfile)) / sample_rate, 3)
start = round(start, 3)
end = round(end, 3)
# 从末尾开始计
if start < 0.0: start += duration
if end < 0.0: end += duration
# 保证数据不越界
if start < 0.0: start = 0.0
if end > duration: end = duration
if end < 0.0:
raise ValueError("切片结束位置(%f s)越界" % end)
if start > end:
raise ValueError("切片开始位置(%f s)晚于切片结束位置(%f s)" % (start, end))
start_frame = int(start * sample_rate)
end_frame = int(end * sample_rate)
sndfile.seek(start_frame)
sample = sndfile.read(frames=end_frame - start_frame, dtype='float32')
return sample, sample_rate
args = Dict()
args.base_model="models/whisper-large-v3-finetune"
args.language='zh'
args.task="transcribe"#['transcribe', 'translate']
args.timestamps=False
args.local_files_only=True
processor = WhisperProcessor.from_pretrained(args.base_model,
language=args.language,
task=args.task,
no_timestamps=not args.timestamps,
local_files_only=args.local_files_only)
file="/nas/lbg/project/Whisper-Finetune/eval_data/am/liveaudio (5).mp3"
start=0
end = 500
audio, sample_rate=slice_from_file(file, start, end)
# 音频 → 对数梅尔频谱图
inputs1 = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
# 文本 → Token ID
text = "你好,世界"
inputs = processor.tokenizer(
text,
return_tensors="pt", # 返回PyTorch张量
padding=True, # 自动填充
truncation=True # 自动截断
)
labels = inputs.input_ids # 获取Token ID
print("Token ID:", labels) # 例如: tensor([[50258, 50359, 2435, 2331, 50257]])
print("解码:", processor.tokenizer.decode(labels[0])) # 验证解码结果
# 模型输出的 token ID → 文本
decoded_text = processor.batch_decode(labels, skip_special_tokens=True)
print("batch_decode",decoded_text)
WhisperTokenizer 使用demo
import torch
from transformers import WhisperTokenizer
from addict import Dict
class CustomWhisperTokenizer:
def __init__(self, args, task="transcribe"):
language = args.language
base_model=args.base_model
self.tokenizer = WhisperTokenizer.from_pretrained(base_model)
self.language = language
self.task = task
def encode(self, text: str) -> torch.Tensor:
"""将文本转换为token ID(添加特殊token)"""
# 添加语言和任务标记(如 "<|en|>"、"<|transcribe|>")
full_text = f"<|startoftranscript|><|{self.language}|><|{self.task}|>{text}<|endoftext|>"
input_ids = self.tokenizer(full_text).input_ids
return torch.tensor(input_ids).unsqueeze(0) # [1, seq_len]
def decode(self, token_ids: torch.Tensor) -> str:
"""将token ID转换回文本(移除特殊token)"""
text = self.tokenizer.decode(token_ids[0], skip_special_tokens=True)
return text
args=Dict()
args.base_model="models/whisper-large-v3-finetune"
args.language='zh'
# 使用示例
tokenizer = CustomWhisperTokenizer(args, task="transcribe")
input_ids = tokenizer.encode("你好,世界") # 输出: [1, 7] (包含特殊token的序列)
decoded_text = tokenizer.decode(input_ids) # 输出: "你好,世界"
print(input_ids)
print(decoded_text)
语音识别例子
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# 选择模型(你也可以用 "openai/whisper-base"、"whisper-medium"、"whisper-large")
model_name = "openai/whisper-small"
# 加载模型与预处理器
print("加载模型和Processor...")
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# 使用GPU加速(如果可用)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# 加载音频文件(确保是单通道)
audio_path = "example.wav" # 替换成你自己的文件路径
speech_array, sr = torchaudio.load(audio_path)
# 重采样为16kHz(Whisper要求)
if sr != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
speech_array = resampler(speech_array)
# 转为1D Tensor(Whisper要求单通道)
speech = speech_array.squeeze()
# 预处理音频
print("提取音频特征...")
inputs = processor(speech, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
# 加入语言提示(这里是中文识别,language 改为 "en"/"fr"/"ja" 可识别其他语言)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")
# 推理(模型输出 token ids)
print("模型推理中...")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# 解码为文本
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("识别结果:", transcription)
语音识别例子2
import torch
from transformers import WhisperTokenizer
from addict import Dict
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, WhisperForConditionalGeneration, WhisperProcessor
import soundfile
def slice_from_file(file, start, end):
sndfile = soundfile.SoundFile(file)
sample_rate = sndfile.samplerate
duration = round(float(len(sndfile)) / sample_rate, 3)
start = round(start, 3)
end = round(end, 3)
# 从末尾开始计
if start < 0.0: start += duration
if end < 0.0: end += duration
# 保证数据不越界
if start < 0.0: start = 0.0
if end > duration: end = duration
if end < 0.0:
raise ValueError("切片结束位置(%f s)越界" % end)
if start > end:
raise ValueError("切片开始位置(%f s)晚于切片结束位置(%f s)" % (start, end))
start_frame = int(start * sample_rate)
end_frame = int(end * sample_rate)
sndfile.seek(start_frame)
sample = sndfile.read(frames=end_frame - start_frame, dtype='float32')
return sample, sample_rate
args = Dict()
args.model_path="models/whisper-large-v3-finetune"
args.language='zh'
args.task="transcribe"#['transcribe', 'translate']
args.timestamps=False
args.local_files_only=True
processor = WhisperProcessor.from_pretrained(args.model_path,
language=args.language,
task=args.task,
no_timestamps=not args.timestamps,
local_files_only=args.local_files_only)
model = WhisperForConditionalGeneration.from_pretrained(args.model_path,
# device_map=device_map, # device_map="auto",
local_files_only=args.local_files_only)
file="/nas/lbg/project/Whisper-Finetune/eval_data/am/liveaudio (5).mp3"
start=0
end = 500
audio, sample_rate=slice_from_file(file, start, end)
# 音频 → 对数梅尔频谱图
inputs1 = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
# 生成文本
generated_ids = model.generate(inputs=inputs1) # inputs1: [batch_size, 80, 3000]
# 解码为文本
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("识别结果:", transcription)
更多推荐


所有评论(0)