Qwen-Audio-Chat 多模态大语言模型使用教程

【免费下载链接】Qwen-Audio The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud. 【免费下载链接】Qwen-Audio 项目地址: https://gitcode.com/gh_mirrors/qw/Qwen-Audio

模型概述

Qwen-Audio-Chat 是一款强大的多模态大语言模型,专门设计用于处理和理解各种音频内容。与传统的语音识别系统不同,它不仅能够识别语音内容,还能理解环境声音、音乐特征,并支持多轮对话和跨语言处理。该模型在语音识别、语音翻译、环境声音理解、多模态音频理解以及语音定位等任务上表现出色。

环境准备

在开始使用前,需要确保已安装必要的Python库:

pip install torch transformers

建议使用支持CUDA的GPU设备以获得最佳性能。

模型初始化

使用Qwen-Audio-Chat前,需要先初始化tokenizer和模型:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# 设置随机种子以保证结果可复现
# torch.manual_seed(1234)

# 初始化tokenizer和模型
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio-Chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-Audio-Chat", 
    device_map="cuda", 
    trust_remote_code=True
).eval()

# 加载生成配置
model.generation_config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-Audio-Chat", 
    trust_remote_code=True
)

核心功能演示

1. 基础语音识别

Qwen-Audio-Chat可以准确识别各种语音内容。以下是一个英文语音识别的例子:

# 准备音频和问题
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'},
    {'text': 'what is that sound?'},
])

# 获取模型响应
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

模型会返回类似以下的识别结果:

The sound is of a man speaking, in English, saying, "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

2. 多轮对话与时间定位

模型支持基于上下文的连续问答,并能精确定位语音中的时间片段:

# 第一轮问题
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1272-128104-0000.flac'},
    {'text': 'what is that sound?'},
])
response, history = model.chat(tokenizer, query=query, history=None)

# 第二轮问题:定位特定词语的时间位置
query = tokenizer.from_list_format([
    {'text': 'Find the start time and end time of the word "middle classes"'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)

输出结果示例:

The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

3. 多语言支持

模型支持包括中文、英文、日语、韩语、德语、西班牙语、意大利语等多种语言的语音识别:

# 西班牙语识别示例
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/es.mp3'},
    {'text': 'Recognize the speech'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

The speech is of a man speaking, in Spanish, saying, "Bueno, también podemos considerar algunas actividades divertidas como los deportes acuáticos.".

4. 方言识别

模型还能识别中文方言,如重庆话:

query = tokenizer.from_list_format([
    {'audio': 'assets/audio/example-重庆话.wav'},
    {'text': 'Recognize the speech'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

The speech is of a man speaking, in Southwestern Mandarin, saying, "对了我还想提议我们可以租一些自行车骑行一下既锻炼身体又心情愉悦".

5. 环境声音理解与推理

模型可以识别各种环境声音并进行逻辑推理:

# 玻璃破碎声识别
query = tokenizer.from_list_format([
    {'audio': 'assets/audio/glass-breaking-151256.mp3'},
    {'text': 'What is it'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

This is a sound effect of breaking glass.

还可以基于声音提供处理建议:

query = tokenizer.from_list_format([
    {'audio': 'assets/audio/glass-breaking-151256.mp3'},
    {'text': 'Recognize the sound and provide handling suggestions.'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

6. 音乐分析

模型能够分析音乐类型和乐器:

query = tokenizer.from_list_format([
    {'audio': 'assets/audio/music.wav'},
    {'text': 'what is the instrument'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

The instrument is the piano.

7. 多音频输入处理

模型支持同时处理多个音频输入并进行对比分析:

query = tokenizer.from_list_format([
    {'audio': 'assets/audio/你没事吧-轻松.wav'},
    {'audio': 'assets/audio/你没事吧-消极.wav'},
    {'text': 'Is there any difference in the emotions of these two audio?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

Based on the voice, it sounds like this person is happy in the first audio, but sad in the second audio.

高级功能:语音定位

Qwen-Audio-Chat 支持词语级别的时间戳定位:

query = tokenizer.from_list_format([
    {'audio': 'assets/audio/1089_134686_000007_000004.wav'},
    {'text': 'Find the word "companionless"'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

输出示例:

The word "companionless" starts at <|6.28|> seconds and ends at <|7.15|> seconds.

还可以通过语义理解定位特定类型的词语:

query = tokenizer.from_list_format([
    {'text': 'find the person name'},
])
response, history = model.chat(tokenizer, query=query, history=history)
print(response)

输出示例:

The person name "shelley's" is mentioned. "shelley's" starts at <|3.79|> seconds and ends at <|4.33|> seconds.

总结

Qwen-Audio-Chat 作为一款多功能音频理解模型,在语音识别、多语言处理、环境声音分析、音乐理解和语音定位等方面展现出强大能力。通过本教程,您已经掌握了模型的基本使用方法,可以尝试将其应用于各种音频处理场景中。

对于更复杂的应用场景,建议:

  1. 准备清晰的音频输入以获得最佳识别效果
  2. 在多轮对话中保持上下文连贯性
  3. 对于专业领域应用,可考虑进行领域适配微调

【免费下载链接】Qwen-Audio The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud. 【免费下载链接】Qwen-Audio 项目地址: https://gitcode.com/gh_mirrors/qw/Qwen-Audio

Logo

Agent 垂直技术社区,欢迎活跃、内容共建。

更多推荐