SenseVoice-small应用场景:智能硬件语音交互SDK封装与Android/iOS集成
本文介绍了如何在星图GPU平台上自动化部署sensevoice-small-轻量级多任务语音模型的 ONNX 量化版WebUI V1.0镜像,实现端侧语音识别SDK的快速构建。该镜像封装了轻量、离线、多语言的语音识别能力,典型应用场景是为智能硬件(如智能音箱、车载设备)提供实时、隐私安全的离线语音交互功能。
SenseVoice-small应用场景:智能硬件语音交互SDK封装与Android/iOS集成
1. 引言:当智能硬件“开口说话”
想象一下,你家里的智能音箱能听懂你的方言指令,你车里的中控屏能实时把导航语音转成文字,你佩戴的智能眼镜能悄悄把会议内容记录下来。这些场景背后,都离不开一个核心能力——离线、实时、准确的语音识别。
今天要聊的SenseVoice-small,就是能让这些想象变成现实的技术。它是一个轻量级的语音识别模型,经过ONNX格式的量化处理,体积小巧但能力强大。更重要的是,它非常适合被封装成SDK,集成到手机、平板、嵌入式设备等各种智能硬件里,让设备真正具备“听懂人话”的能力。
这篇文章不是教你如何使用它的Web界面,而是深入探讨如何把它变成一个可集成的语音交互SDK,并实现在Android和iOS两大移动平台上的落地。无论你是智能硬件开发者、移动应用工程师,还是对端侧AI感兴趣的技术爱好者,都能在这里找到实用的思路和方案。
2. 为什么选择SenseVoice-small做端侧集成?
在开始动手之前,我们先要搞清楚一个问题:市面上语音识别方案那么多,为什么偏偏是SenseVoice-small?
2.1 核心优势:为端侧而生
SenseVoice-small有几个硬核特点,让它特别适合集成到智能硬件里:
第一,它真的很“小”
- 经过ONNX量化和优化后,模型体积大幅压缩
- 对内存和存储空间的要求很低,老旧的嵌入式设备也能跑得动
- 这意味着更低的硬件成本和更广泛的应用场景
第二,它支持“离线”工作
- 不需要连接云端服务器,所有计算都在设备本地完成
- 响应速度极快,没有网络延迟
- 用户隐私数据完全留在设备上,安全有保障
第三,它是个“多面手”
- 支持超过50种语言的识别,包括中文、英文、日语、韩语、粤语等
- 能自动检测语言类型,不用手动切换
- 具备情感识别能力,能判断说话人的情绪状态
- 支持逆文本标准化,能把“一百二十”智能转换成“120”
2.2 典型应用场景分析
基于这些特点,SenseVoice-small在智能硬件领域能玩出很多花样:
场景一:离线语音助手
- 智能音箱、智能家居中控
- 车载语音控制系统
- 工业设备的语音控制面板
场景二:实时字幕生成
- 视频会议设备的实时转录
- 教育平板的课堂字幕
- 直播设备的实时字幕流
场景三:隐私敏感场景
- 医疗设备的语音病历记录
- 金融设备的语音指令确认
- 政府、企业的保密会议记录
场景四:低资源环境
- 偏远地区的通信设备
- 移动网络信号差的场景
- 算力有限的低端设备
3. SDK封装设计:从模型到接口
要把SenseVoice-small变成一个好用的SDK,我们需要做几层封装。这个过程就像给一个强大的引擎装上方向盘、油门和刹车,让开发者能轻松驾驭。
3.1 核心引擎层:模型推理封装
首先是最底层的模型推理部分。SenseVoice-small已经提供了ONNX格式的模型,我们需要把它包装成统一的推理接口。
# 示例:Python端的核心推理封装
class SenseVoiceEngine:
def __init__(self, model_path: str):
"""
初始化语音识别引擎
:param model_path: ONNX模型文件路径
"""
self.session = ort.InferenceSession(model_path)
self.sample_rate = 16000 # 标准采样率
def preprocess_audio(self, audio_data: np.ndarray) -> np.ndarray:
"""
音频预处理:重采样、归一化、分帧等
"""
# 1. 重采样到16kHz
if len(audio_data.shape) > 1:
audio_data = audio_data.mean(axis=1) # 立体声转单声道
# 2. 归一化到[-1, 1]
audio_data = audio_data.astype(np.float32)
if np.abs(audio_data).max() > 0:
audio_data = audio_data / np.abs(audio_data).max()
# 3. 添加批次维度
audio_data = np.expand_dims(audio_data, axis=0)
return audio_data
def recognize(self, audio_data: np.ndarray,
language: str = "auto") -> dict:
"""
执行语音识别
:return: 包含文本、语言、情感等信息的字典
"""
# 预处理音频
processed_audio = self.preprocess_audio(audio_data)
# 准备输入
inputs = {
"audio": processed_audio,
"language": np.array([language], dtype=np.int64)
}
# 执行推理
outputs = self.session.run(None, inputs)
# 解析结果
result = {
"text": outputs[0], # 识别文本
"language": outputs[1], # 检测到的语言
"emotion": outputs[2], # 情感分析结果
"confidence": outputs[3] # 置信度
}
return result
def stream_recognize(self, audio_stream):
"""
流式识别接口(用于实时语音)
"""
# 实现流式处理逻辑
pass
这个核心引擎提供了几个关键能力:
- 统一的音频预处理流程
- 同步识别接口(适合文件处理)
- 流式识别接口(适合实时场景)
- 完整的结果返回(文本、语言、情感、置信度)
3.2 平台适配层:跨平台抽象
不同的硬件平台有不同的特性,我们需要一个适配层来屏蔽这些差异。
# 平台抽象接口定义
class PlatformAdapter:
"""平台适配器基类"""
def get_audio_input(self):
"""获取音频输入设备"""
raise NotImplementedError
def allocate_buffer(self, size: int):
"""分配音频缓冲区"""
raise NotImplementedError
def get_optimal_config(self) -> dict:
"""获取平台最优配置"""
raise NotImplementedError
# Android平台实现
class AndroidAdapter(PlatformAdapter):
def __init__(self):
self.audio_manager = None
def get_audio_input(self):
# 使用Android的AudioRecord API
config = {
"source": AudioSource.MIC,
"sample_rate": 16000,
"channel_config": AudioFormat.CHANNEL_IN_MONO,
"audio_format": AudioFormat.ENCODING_PCM_16BIT,
"buffer_size": 4096
}
return AudioRecord(**config)
def get_optimal_config(self):
return {
"threads": 4, # 推荐线程数
"use_gpu": False, # Android上通常用CPU
"memory_limit_mb": 100 # 内存限制
}
# iOS平台实现
class IOSAdapter(PlatformAdapter):
def __init__(self):
self.audio_session = None
def get_audio_input(self):
# 使用AVAudioEngine
engine = AVAudioEngine()
input_node = engine.inputNode
return {
"engine": engine,
"input_node": input_node
}
def get_optimal_config(self):
return {
"threads": 2, # iOS上线程数不宜过多
"use_ane": True, # 使用Apple Neural Engine
"memory_limit_mb": 50 # iOS内存更紧张
}
3.3 应用接口层:开发者友好封装
最后,我们需要提供一个简洁易用的API给应用开发者。
// Android SDK接口示例
public class SenseVoiceSDK {
// 单例模式
private static SenseVoiceSDK instance;
public static SenseVoiceSDK getInstance() {
if (instance == null) {
instance = new SenseVoiceSDK();
}
return instance;
}
// 初始化SDK
public void initialize(Context context, String modelPath) {
// 加载模型
// 初始化音频系统
// 预热模型
}
// 文件识别
public RecognitionResult recognizeFile(String filePath,
RecognitionConfig config) {
// 读取音频文件
// 调用识别引擎
// 返回结果
}
// 实时识别
public void startRealtimeRecognition(RecognitionCallback callback) {
// 开始录音
// 实时处理音频流
// 通过回调返回结果
}
// 停止识别
public void stopRealtimeRecognition() {
// 停止录音
// 清理资源
}
// 配置接口
public void setLanguage(String language) {
// 设置识别语言
}
public void enableITN(boolean enable) {
// 启用/禁用逆文本标准化
}
}
// 回调接口
public interface RecognitionCallback {
void onPartialResult(String text); // 中间结果
void onFinalResult(RecognitionResult result); // 最终结果
void onError(int errorCode, String message); // 错误回调
}
// 结果对象
public class RecognitionResult {
private String text; // 识别文本
private String language; // 检测到的语言
private String emotion; // 情感分析
private float confidence; // 置信度
private long duration; // 音频时长
// getters and setters
}
4. Android集成实战:让App听懂用户说话
现在我们来具体看看如何在Android应用中集成这个SDK。我会用一个实际的例子,带你一步步实现一个语音记事本应用。
4.1 环境准备与依赖配置
首先,我们需要在Android项目中配置必要的依赖。
// app/build.gradle
android {
defaultConfig {
ndk {
abiFilters 'armeabi-v7a', 'arm64-v8a', 'x86', 'x86_64'
}
}
aaptOptions {
noCompress "onnx" // 不压缩模型文件
}
}
dependencies {
// ONNX Runtime for Android
implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.15.0'
// 音频处理库
implementation 'com.arthenica:mobile-ffmpeg-min:4.4.LTS'
// 权限管理
implementation 'com.guolindev.permissionx:permissionx:1.7.1'
}
4.2 模型文件与资源管理
把SenseVoice-small的ONNX模型放到合适的位置,并确保应用能正确访问。
// ModelManager.kt - 模型文件管理
class ModelManager(private val context: Context) {
companion object {
private const val MODEL_NAME = "sensevoice-small.onnx"
private const val MODEL_ASSETS_PATH = "models/$MODEL_NAME"
}
// 检查模型文件是否存在
fun checkModelExists(): Boolean {
return try {
context.assets.open(MODEL_ASSETS_PATH).close()
true
} catch (e: IOException) {
false
}
}
// 将模型从assets复制到应用私有目录
fun copyModelToInternal(): File {
val modelDir = File(context.filesDir, "models")
if (!modelDir.exists()) {
modelDir.mkdirs()
}
val modelFile = File(modelDir, MODEL_NAME)
// 如果已经存在且是最新的,直接返回
if (modelFile.exists() && isModelUpToDate()) {
return modelFile
}
// 从assets复制
context.assets.open(MODEL_ASSETS_PATH).use { input ->
FileOutputStream(modelFile).use { output ->
input.copyTo(output)
}
}
// 保存版本信息
saveModelVersion()
return modelFile
}
private fun isModelUpToDate(): Boolean {
// 检查模型版本
val prefs = context.getSharedPreferences("model_info", Context.MODE_PRIVATE)
val savedVersion = prefs.getString("model_version", "")
return savedVersion == getCurrentModelVersion()
}
private fun saveModelVersion() {
val prefs = context.getSharedPreferences("model_info", Context.MODE_PRIVATE)
prefs.edit().putString("model_version", getCurrentModelVersion()).apply()
}
private fun getCurrentModelVersion(): String {
// 从模型文件或配置中获取版本号
return "1.0.0"
}
}
4.3 核心识别服务实现
这是最核心的部分,我们来实现语音识别服务。
// SenseVoiceService.kt - 语音识别服务
class SenseVoiceService(
private val context: Context,
private val modelPath: String
) {
private var ortSession: OrtSession? = null
private var isInitialized = false
private var recognitionCallback: RecognitionCallback? = null
// 初始化识别引擎
fun initialize(): Boolean {
return try {
// 创建ONNX Runtime环境
val env = OrtEnvironment.getEnvironment()
val sessionOptions = OrtSession.SessionOptions()
// 针对Android优化配置
sessionOptions.setOptimizationLevel(ORT_ENABLE_ALL)
sessionOptions.setIntraOpNumThreads(4) // 使用4个线程
sessionOptions.setMemoryPatternOptimization(true)
// 加载模型
ortSession = env.createSession(modelPath, sessionOptions)
// 预热模型
warmUpModel()
isInitialized = true
true
} catch (e: Exception) {
Log.e("SenseVoice", "初始化失败: ${e.message}")
false
}
}
// 预热模型(减少首次识别延迟)
private fun warmUpModel() {
val dummyAudio = FloatArray(16000) // 1秒的静音
recognizeAudio(dummyAudio, "auto")
}
// 识别音频数据
fun recognizeAudio(
audioData: FloatArray,
language: String = "auto"
): RecognitionResult {
if (!isInitialized) {
throw IllegalStateException("服务未初始化")
}
return try {
// 预处理音频
val processedAudio = preprocessAudio(audioData)
// 准备输入
val inputName = ortSession!!.inputNames.iterator().next()
val inputTensor = OnnxTensor.createTensor(
OrtEnvironment.getEnvironment(),
processedAudio,
longArrayOf(1, processedAudio.size.toLong())
)
val languageCode = when (language) {
"zh" -> 0
"en" -> 1
"ja" -> 2
"ko" -> 3
"yue" -> 4
else -> -1 // auto
}
val languageTensor = OnnxTensor.createTensor(
OrtEnvironment.getEnvironment(),
longArrayOf(languageCode.toLong()),
longArrayOf(1)
)
// 执行推理
val inputs = mapOf(
"audio" to inputTensor,
"language" to languageTensor
)
val outputs = ortSession!!.run(inputs)
// 解析结果
val text = outputs[0].value as String
val detectedLang = outputs[1].value as String
val emotion = outputs[2].value as String
val confidence = (outputs[3].value as FloatArray)[0]
// 清理资源
inputTensor.close()
languageTensor.close()
outputs.forEach { it.value.close() }
RecognitionResult(
text = text,
language = detectedLang,
emotion = emotion,
confidence = confidence,
duration = (audioData.size / 16000f * 1000).toLong() // 毫秒
)
} catch (e: Exception) {
Log.e("SenseVoice", "识别失败: ${e.message}")
RecognitionResult(error = e.message ?: "识别失败")
}
}
// 音频预处理
private fun preprocessAudio(audioData: FloatArray): FloatArray {
// 1. 确保采样率为16kHz
val targetSampleRate = 16000
val currentSampleRate = 44100 // 假设从AudioRecord获取的是44.1kHz
val processed = if (currentSampleRate != targetSampleRate) {
resampleAudio(audioData, currentSampleRate, targetSampleRate)
} else {
audioData.copyOf()
}
// 2. 归一化
val maxVal = processed.maxOrNull() ?: 1f
val minVal = processed.minOrNull() ?: -1f
val absMax = maxOf(kotlin.math.abs(maxVal), kotlin.math.abs(minVal))
if (absMax > 0) {
for (i in processed.indices) {
processed[i] = processed[i] / absMax
}
}
return processed
}
// 重采样
private fun resampleAudio(
audio: FloatArray,
fromRate: Int,
toRate: Int
): FloatArray {
// 简化版重采样,实际项目中建议使用专业音频库
val ratio = fromRate.toFloat() / toRate.toFloat()
val newLength = (audio.size / ratio).toInt()
val resampled = FloatArray(newLength)
for (i in 0 until newLength) {
val srcIndex = (i * ratio).toInt()
if (srcIndex < audio.size) {
resampled[i] = audio[srcIndex]
}
}
return resampled
}
// 开始实时识别
fun startRealtimeRecognition(callback: RecognitionCallback) {
this.recognitionCallback = callback
// 启动音频录制线程
val audioThread = Thread {
val bufferSize = AudioRecord.getMinBufferSize(
16000,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT
)
val audioRecord = AudioRecord(
MediaRecorder.AudioSource.MIC,
16000,
AudioFormat.CHANNEL_IN_MONO,
AudioFormat.ENCODING_PCM_16BIT,
bufferSize
)
audioRecord.startRecording()
val buffer = ShortArray(bufferSize / 2)
val audioBuffer = mutableListOf<Float>()
while (isRecording) {
val bytesRead = audioRecord.read(buffer, 0, buffer.size)
if (bytesRead > 0) {
// 转换到float
val floatBuffer = FloatArray(bytesRead)
for (i in 0 until bytesRead) {
floatBuffer[i] = buffer[i] / 32768.0f
}
audioBuffer.addAll(floatBuffer.toList())
// 每1秒处理一次
if (audioBuffer.size >= 16000) {
val audioData = audioBuffer.take(16000).toFloatArray()
val result = recognizeAudio(audioData, "auto")
// 回调结果
callback.onPartialResult(result.text)
// 保留最后0.5秒数据用于连续识别
val keepSize = 8000 // 0.5秒
audioBuffer.clear()
if (audioData.size > keepSize) {
audioBuffer.addAll(
audioData.sliceArray(audioData.size - keepSize until audioData.size).toList()
)
}
}
}
}
audioRecord.stop()
audioRecord.release()
}
audioThread.start()
}
// 停止实时识别
fun stopRealtimeRecognition() {
isRecording = false
recognitionCallback = null
}
// 释放资源
fun release() {
ortSession?.close()
ortSession = null
isInitialized = false
}
}
4.4 权限处理与UI集成
在Android上使用麦克风需要处理权限,我们还需要提供一个好用的UI组件。
// VoiceRecognitionView.kt - 语音识别UI组件
class VoiceRecognitionView @JvmOverloads constructor(
context: Context,
attrs: AttributeSet? = null,
defStyleAttr: Int = 0
) : FrameLayout(context, attrs, defStyleAttr) {
// UI组件
private lateinit var recordButton: ImageButton
private lateinit var resultTextView: TextView
private lateinit var languageSpinner: Spinner
private lateinit var progressBar: ProgressBar
// 回调接口
var onRecognitionResult: ((RecognitionResult) -> Unit)? = null
var onRecognitionError: ((String) -> Unit)? = null
// 识别服务
private lateinit var senseVoiceService: SenseVoiceService
init {
initView(context)
initService()
}
private fun initView(context: Context) {
// 加载布局
LayoutInflater.from(context).inflate(R.layout.view_voice_recognition, this, true)
recordButton = findViewById(R.id.btn_record)
resultTextView = findViewById(R.id.tv_result)
languageSpinner = findViewById(R.id.spinner_language)
progressBar = findViewById(R.id.progress_bar)
// 设置语言选项
val languages = arrayOf("自动检测", "中文", "英文", "日语", "韩语", "粤语")
val adapter = ArrayAdapter(context, android.R.layout.simple_spinner_item, languages)
adapter.setDropDownViewResource(android.R.layout.simple_spinner_dropdown_item)
languageSpinner.adapter = adapter
// 录音按钮点击事件
recordButton.setOnClickListener {
if (isRecording) {
stopRecording()
} else {
startRecording()
}
}
}
private fun initService() {
// 初始化识别服务
senseVoiceService = SenseVoiceService(context, getModelPath())
// 在后台线程初始化
GlobalScope.launch(Dispatchers.IO) {
val success = senseVoiceService.initialize()
withContext(Dispatchers.Main) {
if (success) {
recordButton.isEnabled = true
Toast.makeText(context, "语音识别服务就绪", Toast.LENGTH_SHORT).show()
} else {
recordButton.isEnabled = false
onRecognitionError?.invoke("语音识别服务初始化失败")
}
}
}
}
private fun getModelPath(): String {
val modelManager = ModelManager(context)
return modelManager.copyModelToInternal().absolutePath
}
private fun startRecording() {
// 检查权限
if (!hasRecordPermission()) {
requestRecordPermission()
return
}
// 更新UI
recordButton.setImageResource(R.drawable.ic_stop)
resultTextView.text = "正在聆听..."
progressBar.visibility = View.VISIBLE
// 开始识别
senseVoiceService.startRealtimeRecognition(object : RecognitionCallback {
override fun onPartialResult(text: String) {
// 更新部分结果
runOnUiThread {
resultTextView.text = text
}
}
override fun onFinalResult(result: RecognitionResult) {
runOnUiThread {
progressBar.visibility = View.GONE
onRecognitionResult?.invoke(result)
}
}
override fun onError(errorCode: Int, message: String) {
runOnUiThread {
progressBar.visibility = View.GONE
recordButton.setImageResource(R.drawable.ic_mic)
onRecognitionError?.invoke(message)
}
}
})
isRecording = true
}
private fun stopRecording() {
senseVoiceService.stopRealtimeRecognition()
runOnUiThread {
recordButton.setImageResource(R.drawable.ic_mic)
progressBar.visibility = View.GONE
}
isRecording = false
}
private fun hasRecordPermission(): Boolean {
return ContextCompat.checkSelfPermission(
context,
Manifest.permission.RECORD_AUDIO
) == PackageManager.PERMISSION_GRANTED
}
private fun requestRecordPermission() {
ActivityCompat.requestPermissions(
context as Activity,
arrayOf(Manifest.permission.RECORD_AUDIO),
RECORD_AUDIO_REQUEST_CODE
)
}
fun onRequestPermissionsResult(
requestCode: Int,
permissions: Array<out String>,
grantResults: IntArray
) {
if (requestCode == RECORD_AUDIO_REQUEST_CODE) {
if (grantResults.isNotEmpty() && grantResults[0] == PackageManager.PERMISSION_GRANTED) {
startRecording()
} else {
onRecognitionError?.invoke("需要麦克风权限才能使用语音识别")
}
}
}
fun release() {
senseVoiceService.release()
}
companion object {
private const val RECORD_AUDIO_REQUEST_CODE = 1001
private var isRecording = false
}
}
4.5 在Activity中使用
最后,我们看看如何在Activity中使用这个语音识别组件。
// MainActivity.kt - 主界面
class MainActivity : AppCompatActivity() {
private lateinit var voiceRecognitionView: VoiceRecognitionView
private lateinit var resultRecyclerView: RecyclerView
private lateinit var adapter: RecognitionResultAdapter
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
// 初始化组件
voiceRecognitionView = findViewById(R.id.voice_recognition_view)
resultRecyclerView = findViewById(R.id.recycler_results)
// 设置RecyclerView
adapter = RecognitionResultAdapter()
resultRecyclerView.layoutManager = LinearLayoutManager(this)
resultRecyclerView.adapter = adapter
// 设置语音识别回调
voiceRecognitionView.onRecognitionResult = { result ->
// 添加到历史记录
adapter.addResult(result)
// 显示识别结果
showResultDialog(result)
}
voiceRecognitionView.onRecognitionError = { error ->
Toast.makeText(this, "识别错误: $error", Toast.LENGTH_LONG).show()
}
}
private fun showResultDialog(result: RecognitionResult) {
val dialog = AlertDialog.Builder(this)
.setTitle("识别结果")
.setMessage("""
文本:${result.text}
语言:${result.language}
情感:${result.emotion}
置信度:${String.format("%.2f", result.confidence * 100)}%
时长:${result.duration}ms
""".trimIndent())
.setPositiveButton("确定", null)
.setNegativeButton("复制") { _, _ ->
// 复制到剪贴板
val clipboard = getSystemService(Context.CLIPBOARD_SERVICE) as ClipboardManager
val clip = ClipData.newPlainText("识别结果", result.text)
clipboard.setPrimaryClip(clip)
Toast.makeText(this, "已复制到剪贴板", Toast.LENGTH_SHORT).show()
}
.create()
dialog.show()
}
override fun onRequestPermissionsResult(
requestCode: Int,
permissions: Array<out String>,
grantResults: IntArray
) {
super.onRequestPermissionsResult(requestCode, permissions, grantResults)
voiceRecognitionView.onRequestPermissionsResult(requestCode, permissions, grantResults)
}
override fun onDestroy() {
super.onDestroy()
voiceRecognitionView.release()
}
}
5. iOS集成指南:为苹果生态打造语音体验
iOS平台的集成思路与Android类似,但具体实现有所不同。我们使用Swift和SwiftUI来构建一个现代化的语音识别应用。
5.1 创建iOS Framework
首先,我们需要创建一个Framework来封装SenseVoice-small的核心功能。
// SenseVoiceFramework.swift
import Foundation
import AVFoundation
import CoreML
public class SenseVoiceRecognizer {
private var onnxSession: ORTSession?
private var isInitialized = false
private var audioEngine: AVAudioEngine?
private var recognitionCallback: ((RecognitionResult) -> Void)?
// 初始化识别器
public func initialize(modelPath: String) throws -> Bool {
do {
// 创建ONNX Runtime环境
let env = try ORTEnvironment(loggingLevel: .warning)
// 创建会话选项
let options = try ORTSessionOptions()
try options.setIntraOpNumThreads(2) // iOS推荐2线程
try options.setGraphOptimizationLevel(.all)
// 加载模型
let modelData = try Data(contentsOf: URL(fileURLWithPath: modelPath))
onnxSession = try ORTSession(env: env,
modelData: modelData,
sessionOptions: options)
// 预热模型
try warmUpModel()
isInitialized = true
return true
} catch {
print("SenseVoice初始化失败: \(error)")
throw error
}
}
// 预热模型
private func warmUpModel() throws {
let dummyAudio = [Float](repeating: 0, count: 16000)
_ = try recognizeAudio(dummyAudio, language: "auto")
}
// 识别音频数据
public func recognizeAudio(_ audioData: [Float],
language: String = "auto") throws -> RecognitionResult {
guard isInitialized, let session = onnxSession else {
throw SenseVoiceError.notInitialized
}
do {
// 预处理音频
let processedAudio = preprocessAudio(audioData)
// 创建输入Tensor
let inputShape: [NSNumber] = [1, NSNumber(value: processedAudio.count)]
let inputTensor = try ORTValue(
tensorData: NSMutableData(bytes: processedAudio,
length: processedAudio.count * MemoryLayout<Float>.size),
elementType: ORTTensorElementDataType.float,
shape: inputShape
)
// 语言编码
let languageCode = languageCodeForString(language)
let languageTensor = try ORTValue(
tensorData: NSMutableData(bytes: &languageCode,
length: MemoryLayout<Int64>.size),
elementType: ORTTensorElementDataType.int64,
shape: [1]
)
// 执行推理
let inputs = ["audio": inputTensor, "language": languageTensor]
let outputs = try session.run(inputs: inputs,
outputNames: ["text", "language", "emotion", "confidence"])
// 解析结果
let textTensor = outputs["text"]!
let text = try textTensor.tensorDataAsString()
let languageTensorOutput = outputs["language"]!
let detectedLang = try languageTensorOutput.tensorDataAsString()
let emotionTensor = outputs["emotion"]!
let emotion = try emotionTensor.tensorDataAsString()
let confidenceTensor = outputs["confidence"]!
let confidenceData = try confidenceTensor.tensorData() as Data
let confidence = confidenceData.withUnsafeBytes { $0.load(as: Float.self) }
return RecognitionResult(
text: text,
language: detectedLang,
emotion: emotion,
confidence: confidence,
duration: Int64(Double(audioData.count) / 16000.0 * 1000)
)
} catch {
print("识别失败: \(error)")
throw error
}
}
// 音频预处理
private func preprocessAudio(_ audio: [Float]) -> [Float] {
var processed = audio
// 归一化
let maxVal = processed.max() ?? 1.0
let minVal = processed.min() ?? -1.0
let absMax = max(abs(maxVal), abs(minVal))
if absMax > 0 {
processed = processed.map { $0 / absMax }
}
return processed
}
// 开始实时识别
public func startRealtimeRecognition(callback: @escaping (RecognitionResult) -> Void) throws {
guard AVAudioSession.sharedInstance().recordPermission == .granted else {
throw SenseVoiceError.permissionDenied
}
recognitionCallback = callback
audioEngine = AVAudioEngine()
guard let audioEngine = audioEngine else {
throw SenseVoiceError.audioEngineFailed
}
let inputNode = audioEngine.inputNode
let inputFormat = inputNode.outputFormat(forBus: 0)
// 设置录音格式
let recordingFormat = AVAudioFormat(
commonFormat: .pcmFormatFloat32,
sampleRate: 16000,
channels: 1,
interleaved: false
)
guard let format = recordingFormat else {
throw SenseVoiceError.audioFormatFailed
}
// 安装Tap
inputNode.installTap(onBus: 0,
bufferSize: 4096,
format: inputFormat) { [weak self] buffer, time in
guard let self = self else { return }
// 转换格式
let converter = AVAudioConverter(from: inputFormat, to: format)
let convertedBuffer = AVAudioPCMBuffer(pcmFormat: format,
frameCapacity: buffer.frameCapacity)
var error: NSError?
let inputBlock: AVAudioConverterInputBlock = { inNumPackets, outStatus in
outStatus.pointee = .haveData
return buffer
}
converter?.convert(to: convertedBuffer!,
error: &error,
withInputFrom: inputBlock)
if let convertedBuffer = convertedBuffer,
let channelData = convertedBuffer.floatChannelData {
let frames = convertedBuffer.frameLength
let audioData = Array(UnsafeBufferPointer(start: channelData[0],
count: Int(frames)))
// 每1秒处理一次
self.processAudioBuffer(audioData)
}
}
// 启动音频引擎
try audioEngine.start()
}
// 处理音频缓冲区
private var audioBuffer: [Float] = []
private func processAudioBuffer(_ newData: [Float]) {
audioBuffer.append(contentsOf: newData)
// 每1秒(16000个样本)处理一次
while audioBuffer.count >= 16000 {
let chunk = Array(audioBuffer.prefix(16000))
audioBuffer.removeFirst(16000)
do {
let result = try recognizeAudio(chunk, language: "auto")
DispatchQueue.main.async {
self.recognitionCallback?(result)
}
} catch {
print("实时识别错误: \(error)")
}
}
}
// 停止识别
public func stopRealtimeRecognition() {
audioEngine?.stop()
audioEngine?.inputNode.removeTap(onBus: 0)
audioEngine = nil
recognitionCallback = nil
}
// 语言编码映射
private func languageCodeForString(_ language: String) -> Int64 {
switch language.lowercased() {
case "zh": return 0
case "en": return 1
case "ja": return 2
case "ko": return 3
case "yue": return 4
default: return -1 // auto
}
}
// 释放资源
public func release() {
stopRealtimeRecognition()
onnxSession = nil
isInitialized = false
}
}
// 识别结果结构体
public struct RecognitionResult {
public let text: String
public let language: String
public let emotion: String
public let confidence: Float
public let duration: Int64 // 毫秒
public let error: String?
public init(text: String = "",
language: String = "",
emotion: String = "",
confidence: Float = 0,
duration: Int64 = 0,
error: String? = nil) {
self.text = text
self.language = language
self.emotion = emotion
self.confidence = confidence
self.duration = duration
self.error = error
}
}
// 错误类型
public enum SenseVoiceError: Error {
case notInitialized
case permissionDenied
case audioEngineFailed
case audioFormatFailed
case modelNotFound
case recognitionFailed(String)
}
5.2 SwiftUI界面实现
使用SwiftUI构建一个现代化的语音识别界面。
// VoiceRecognitionView.swift
import SwiftUI
import AVFoundation
struct VoiceRecognitionView: View {
@StateObject private var viewModel = VoiceRecognitionViewModel()
@State private var isRecording = false
@State private var recognizedText = ""
@State private var selectedLanguage = "auto"
let languages = [
("auto", "自动检测"),
("zh", "中文"),
("en", "英文"),
("ja", "日语"),
("ko", "韩语"),
("yue", "粤语")
]
var body: some View {
VStack(spacing: 20) {
// 标题
Text("语音识别")
.font(.largeTitle)
.fontWeight(.bold)
.padding(.top)
// 语言选择
Picker("选择语言", selection: $selectedLanguage) {
ForEach(languages, id: \.0) { code, name in
Text(name).tag(code)
}
}
.pickerStyle(SegmentedPickerStyle())
.padding(.horizontal)
// 录音按钮
Button(action: toggleRecording) {
Circle()
.fill(isRecording ? Color.red : Color.blue)
.frame(width: 100, height: 100)
.overlay(
Image(systemName: isRecording ? "stop.fill" : "mic.fill")
.font(.system(size: 40))
.foregroundColor(.white)
)
.shadow(radius: 10)
}
.padding(.vertical, 30)
// 状态提示
if isRecording {
VStack {
Text("正在聆听...")
.font(.headline)
.foregroundColor(.green)
// 录音动画
HStack(spacing: 4) {
ForEach(0..<5) { i in
RoundedRectangle(cornerRadius: 2)
.fill(Color.green)
.frame(width: 4, height: CGFloat.random(in: 10...30))
.animation(
Animation.easeInOut(duration: 0.5)
.repeatForever()
.delay(Double(i) * 0.1),
value: isRecording
)
}
}
.frame(height: 30)
}
}
// 识别结果
ScrollView {
VStack(alignment: .leading, spacing: 10) {
Text("识别结果")
.font(.headline)
if recognizedText.isEmpty {
Text("点击上方按钮开始录音")
.foregroundColor(.gray)
.italic()
} else {
Text(recognizedText)
.padding()
.frame(maxWidth: .infinity, alignment: .leading)
.background(Color.gray.opacity(0.1))
.cornerRadius(10)
}
}
.padding()
}
.frame(maxHeight: 200)
.background(Color.gray.opacity(0.05))
.cornerRadius(10)
.padding(.horizontal)
// 识别信息
if let result = viewModel.lastResult {
VStack(alignment: .leading, spacing: 8) {
HStack {
Label("语言: \(result.language)", systemImage: "globe")
Spacer()
Label("置信度: \(Int(result.confidence * 100))%",
systemImage: "chart.bar.fill")
}
.font(.caption)
.foregroundColor(.secondary)
HStack {
Label("情感: \(result.emotion)", systemImage: "face.smiling")
Spacer()
Label("时长: \(result.duration)ms", systemImage: "clock")
}
.font(.caption)
.foregroundColor(.secondary)
}
.padding()
.background(Color.blue.opacity(0.1))
.cornerRadius(10)
.padding(.horizontal)
}
Spacer()
}
.padding()
.onAppear {
viewModel.initialize()
}
.onDisappear {
viewModel.release()
}
.alert("错误", isPresented: $viewModel.showError) {
Button("确定", role: .cancel) { }
} message: {
Text(viewModel.errorMessage)
}
}
private func toggleRecording() {
if isRecording {
viewModel.stopRecording()
} else {
viewModel.startRecording(language: selectedLanguage) { result in
recognizedText = result.text
}
}
isRecording.toggle()
}
}
// ViewModel
class VoiceRecognitionViewModel: ObservableObject {
private var recognizer: SenseVoiceRecognizer?
@Published var lastResult: RecognitionResult?
@Published var showError = false
@Published var errorMessage = ""
func initialize() {
do {
recognizer = SenseVoiceRecognizer()
// 获取模型路径
guard let modelPath = Bundle.main.path(forResource: "sensevoice-small",
ofType: "onnx") else {
throw SenseVoiceError.modelNotFound
}
let success = try recognizer?.initialize(modelPath: modelPath)
if success != true {
throw SenseVoiceError.recognitionFailed("初始化失败")
}
} catch {
showError(message: "初始化失败: \(error.localizedDescription)")
}
}
func startRecording(language: String,
onResult: @escaping (RecognitionResult) -> Void) {
guard let recognizer = recognizer else {
showError(message: "识别器未初始化")
return
}
// 请求录音权限
AVAudioSession.sharedInstance().requestRecordPermission { granted in
if granted {
do {
try recognizer.startRealtimeRecognition { result in
DispatchQueue.main.async {
self.lastResult = result
onResult(result)
}
}
} catch {
DispatchQueue.main.async {
self.showError(message: "开始录音失败: \(error.localizedDescription)")
}
}
} else {
DispatchQueue.main.async {
self.showError(message: "需要麦克风权限才能使用语音识别")
}
}
}
}
func stopRecording() {
recognizer?.stopRealtimeRecognition()
}
func release() {
recognizer?.release()
recognizer = nil
}
private func showError(message: String) {
errorMessage = message
showError = true
}
}
5.3 Info.plist配置
在iOS中,使用麦克风需要在Info.plist中添加权限说明。
<!-- Info.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<!-- 麦克风使用权限 -->
<key>NSMicrophoneUsageDescription</key>
<string>需要麦克风权限来进行语音识别</string>
<!-- 应用名称 -->
<key>CFBundleDisplayName</key>
<string>语音识别助手</string>
<!-- 支持的设备方向 -->
<key>UISupportedInterfaceOrientations</key>
<array>
<string>UIInterfaceOrientationPortrait</string>
<string>UIInterfaceOrientationLandscapeLeft</string>
<string>UIInterfaceOrientationLandscapeRight</string>
</array>
<!-- 模型文件不加密 -->
<key>UIApplicationSupportsIndirectInputEvents</key>
<true/>
<!-- 后台音频 -->
<key>UIBackgroundModes</key>
<array>
<string>audio</string>
</array>
</dict>
</plist>
6. 性能优化与最佳实践
集成完成后,我们还需要考虑性能优化和最佳实践,确保SDK在实际使用中表现良好。
6.1 内存优化策略
语音识别是计算密集型任务,内存管理尤为重要。
// MemoryOptimizer.kt - Android内存优化
class MemoryOptimizer {
companion object {
// 音频缓冲区池
private val audioBufferPool = mutableListOf<FloatArray>()
private const val BUFFER_SIZE = 16000 // 1秒的音频
// 获取缓冲区
fun getAudioBuffer(): FloatArray {
synchronized(audioBufferPool) {
return if (audioBufferPool.isNotEmpty()) {
audioBufferPool.removeAt(0)
} else {
FloatArray(BUFFER_SIZE)
}
}
}
// 回收缓冲区
fun recycleAudioBuffer(buffer: FloatArray) {
if (buffer.size == BUFFER_SIZE) {
synchronized(audioBufferPool) {
if (audioBufferPool.size < 5) { // 最多缓存5个
// 清空缓冲区内容
buffer.fill(0f)
audioBufferPool.add(buffer)
}
}
}
}
// 监控内存使用
fun monitorMemoryUsage(context: Context) {
val activityManager = context.getSystemService(Context.ACTIVITY_SERVICE)
as ActivityManager
val memoryInfo = ActivityManager.MemoryInfo()
activityManager.getMemoryInfo(memoryInfo)
val usedMemory = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory()
val maxMemory = Runtime.getRuntime().maxMemory()
Log.d("MemoryOptimizer",
"内存使用: ${usedMemory / 1024 / 1024}MB / ${maxMemory / 1024 / 1024}MB")
Log.d("MemoryOptimizer",
"系统剩余内存: ${memoryInfo.availMem / 1024 / 1024}MB")
// 如果内存紧张,清理缓存
if (memoryInfo.lowMemory) {
clearCaches()
}
}
private fun clearCaches() {
synchronized(audioBufferPool) {
audioBufferPool.clear()
}
System.gc()
}
}
}
6.2 功耗优化
在移动设备上,功耗控制同样重要。
// PowerOptimizer.swift - iOS功耗优化
class PowerOptimizer {
private var energyMonitor: EnergyMonitor?
private var isLowPowerMode = false
// 监控设备能耗状态
func startMonitoring() {
// 监听低电量模式
NotificationCenter.default.addObserver(
self,
selector: #selector(lowPowerModeChanged),
name: NSNotification.Name.NSProcessInfoPowerStateDidChange,
object: nil
)
// 检查当前状态
isLowPowerMode = ProcessInfo.processInfo.isLowPowerModeEnabled
adjustStrategyForPowerMode()
// 启动能耗监控
energyMonitor = EnergyMonitor()
energyMonitor?.startMonitoring()
}
@objc private func lowPowerModeChanged() {
isLowPowerMode = ProcessInfo.processInfo.isLowPowerModeEnabled
adjustStrategyForPowerMode()
}
// 根据电量模式调整策略
private func adjustStrategyForPowerMode() {
if isLowPowerMode {
// 低电量模式下的优化策略
SenseVoiceConfig.shared.maxThreads = 1
SenseVoiceConfig.shared.enableGPU = false
SenseVoiceConfig.shared.audioBufferSize = 32000 // 2秒缓冲
SenseVoiceConfig.shared.processingInterval = 2000 // 2秒处理一次
} else {
// 正常模式
SenseVoiceConfig.shared.maxThreads = 2
SenseVoiceConfig.shared.enableGPU = true
SenseVoiceConfig.shared.audioBufferSize = 16000 // 1秒缓冲
SenseVoiceConfig.shared.processingInterval = 1000 // 1秒处理一次
}
}
// 动态调整识别精度
func adjustAccuracyBasedOnBattery(level: Float) {
if level < 0.2 { // 电量低于20%
SenseVoiceConfig.shared.recognitionAccuracy = .low
} else if level < 0.5 { // 电量低于50%
SenseVoiceConfig.shared.recognitionAccuracy = .medium
} else {
SenseVoiceConfig.shared.recognitionAccuracy = .high
}
}
// 清理资源
func stopMonitoring() {
NotificationCenter.default.removeObserver(self)
energyMonitor?.stopMonitoring()
energyMonitor = nil
}
}
// 能耗监控器
class EnergyMonitor {
private var monitoringTimer: Timer?
private var energyUsage: [Date: Double] = [:]
func startMonitoring() {
monitoringTimer = Timer.scheduledTimer(withTimeInterval: 10.0, repeats: true) { _ in
self.recordEnergyUsage()
}
}
private func recordEnergyUsage() {
// 这里可以集成系统能耗监控API
// 实际项目中可能需要使用更专业的能耗监控工具
let usage = Double.random(in: 0.1...0.5) // 模拟能耗数据
energyUsage[Date()] = usage
// 如果能耗过高,发出警告
if usage > 0.3 {
NotificationCenter.default.post(
name: Notification.Name("HighEnergyUsageWarning"),
object: nil,
userInfo: ["usage": usage]
)
}
}
func stopMonitoring() {
monitoringTimer?.invalidate()
monitoringTimer = nil
}
}
6.3 网络降级策略
虽然SenseVoice-small支持离线工作,但在某些场景下可能需要网络辅助。
// NetworkFallback.kt - 网络降级策略
class NetworkFallbackStrategy(private val context: Context) {
private val localRecognizer = SenseVoiceService(context, getLocalModelPath())
private val cloudRecognizer = CloudRecognitionService()
// 混合识别策略
suspend fun hybridRecognize(
audioData: FloatArray,
language: String = "auto"
): RecognitionResult {
return try {
// 首先尝试本地识别
val localResult = withContext(Dispatchers.IO) {
localRecognizer.recognizeAudio(audioData, language)
}
// 如果本地识别置信度低,尝试云端识别
if (localResult.confidence < 0.7 && isNetworkAvailable()) {
val cloudResult = withContext(Dispatchers.IO) {
cloudRecognizer.recognize(audioData, language)
}
// 合并结果(可以根据业务逻辑调整策略)
return mergeResults(localResult, cloudResult)
}
localResult
} catch (e: Exception) {
// 本地识别失败,尝试云端
if (isNetworkAvailable()) {
try {
return withContext(Dispatchers.IO) {
cloudRecognizer.recognize(audioData, language)
}
} catch (e2: Exception) {
// 云端也失败,返回错误
return RecognitionResult(error = "识别失败: ${e2.message}")
}
}
RecognitionResult(error = "识别失败: ${e.message}")
}
}
// 检查网络状态
private fun isNetworkAvailable(): Boolean {
val connectivityManager = context.getSystemService(Context.CONNECTIVITY_SERVICE)
as ConnectivityManager
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.M) {
val network = connectivityManager.activeNetwork
val capabilities = connectivityManager.getNetworkCapabilities(network)
return capabilities != null &&
(capabilities.hasTransport(NetworkCapabilities.TRANSPORT_WIFI) ||
capabilities.hasTransport(NetworkCapabilities.TRANSPORT_CELLULAR))
} else {
@Suppress("DEPRECATION")
val networkInfo = connectivityManager.activeNetworkInfo
return networkInfo != null && networkInfo.isConnected
}
}
// 合并本地和云端结果
private fun mergeResults(
local: RecognitionResult,
cloud: RecognitionResult
): RecognitionResult {
// 简单的合并策略:选择置信度高的结果
return if (local.confidence >= cloud.confidence) {
local
} else {
cloud
}
// 更复杂的策略可以:
// 1. 加权平均
// 2. 基于语言模型的后处理
// 3. 用户反馈学习
}
// 智能缓存策略
fun cacheRecognitionResult(
audioHash: String,
result: RecognitionResult,
source: RecognitionSource
) {
val cache = RecognitionCache(context)
// 根据结果质量决定缓存时间
val cacheTime = when {
result.confidence > 0.9 -> 7 * 24 * 60 * 60 * 1000L // 7天
result.confidence > 0.7 -> 24 * 60 * 60 * 1000L // 1天
else -> 60 * 60 * 1000L // 1小时
}
cache.save(audioHash, result, cacheTime, source)
}
// 获取本地模型路径
private fun getLocalModelPath(): String {
// 实现略
return ""
}
}
// 识别结果缓存
class RecognitionCache(context: Context) {
private val sharedPrefs = context.getSharedPreferences("recognition_cache",
Context.MODE_PRIVATE)
fun save(
key: String,
result: RecognitionResult,
ttl: Long,
source: RecognitionSource
) {
val json = Gson().toJson(CacheEntry(result, System.currentTimeMillis() + ttl, source))
sharedPrefs.edit().putString(key, json).apply()
}
fun get(key: String): CacheEntry? {
val json = sharedPrefs.getString(key, null)
return if (json != null) {
val entry = Gson().fromJson(json, CacheEntry::class.java)
if (entry.expiry > System.currentTimeMillis()) {
entry
} else {
// 缓存过期,删除
sharedPrefs.edit().remove(key).apply()
null
}
} else {
null
}
}
data class CacheEntry(
val result: RecognitionResult,
val expiry: Long,
val source: RecognitionSource
)
}
enum class RecognitionSource {
LOCAL, CLOUD, HYBRID
}
7. 测试与部署
7.1 单元测试
确保SDK的每个组件都能正常工作。
// SenseVoiceServiceTest.kt
@RunWith(AndroidJUnit4::class)
class SenseVoiceServiceTest {
private lateinit var context: Context
private lateinit var service: SenseVoiceService
@Before
fun setup() {
context = ApplicationProvider.getApplicationContext()
// 复制测试模型
val modelManager = ModelManager(context)
val modelPath = modelManager.copyModelToInternal().absolutePath
service = SenseVoiceService(context, modelPath)
service.initialize()
}
@Test
fun testInitialization() {
assertTrue(service.isInitialized)
}
@Test
fun testAudioPreprocessing() {
// 创建测试音频数据(1秒的440Hz正弦波)
val sampleRate = 16000
val frequency = 440.0
val duration = 1.0 // 秒
val audioData = FloatArray((sampleRate * duration).toInt())
for (i in audioData.indices) {
val time = i.toDouble() / sampleRate
audioData[i] = sin(2 * Math.PI * frequency * time).toFloat()
}
val result = service.recognizeAudio(audioData, "auto")
assertNotNull(result)
assertTrue(result.confidence > 0)
}
@Test
fun testLanguageDetection() {
// 这里可以使用预录制的不同语言音频进行测试
// 实际项目中应该准备测试数据集
assertTrue(true) // 占位符
}
@Test
fun testEmptyAudio() {
val emptyAudio = FloatArray(16000) // 1秒静音
val result = service.recognizeAudio(emptyAudio, "auto")
// 静音应该返回空结果或低置信度
assertTrue(result.text.isEmpty() || result.confidence < 0.3)
}
@After
fun tearDown() {
service.release()
}
}
7.2 集成测试
测试整个SDK的集成效果。
// SenseVoiceIntegrationTests.swift
import XCTest
@testable import SenseVoiceFramework
final class SenseVoiceIntegrationTests: XCTestCase {
var recognizer: SenseVoiceRecognizer!
override func setUp() async throws {
recognizer = SenseVoiceRecognizer()
// 获取测试模型路径
let bundle = Bundle(for: type(of: self))
guard let modelPath = bundle.path(forResource: "sensevoice-small-test",
ofType: "onnx") else {
throw SenseVoiceError.modelNotFound
}
let initialized = try recognizer.initialize(modelPath: modelPath)
XCTAssertTrue(initialized, "识别器应该初始化成功")
}
func testChineseRecognition() async throws {
// 加载测试音频("你好,世界")
let audioData = try loadTestAudio(name: "chinese_hello")
let result = try recognizer.recognizeAudio(audioData, language: "zh")
XCTAssertFalse(result.text.isEmpty, "识别结果不应为空")
XCTAssertEqual(result.language, "zh", "应该检测到中文")
XCTAssertGreaterThan(result.confidence, 0.7, "置信度应大于0.7")
}
func testEnglishRecognition() async throws {
// 加载测试音频("Hello, world")
let audioData = try loadTestAudio(name: "english_hello")
let result = try recognizer.recognizeAudio(audioData, language: "en")
XCTAssertFalse(result.text.isEmpty, "识别结果不应为空")
XCTAssertEqual(result.language, "en", "应该检测到英文")
XCTAssertGreaterThan(result.confidence, 0.7, "置信度应大于0.7")
}
func testAutoLanguageDetection() async throws {
let chineseAudio = try loadTestAudio(name: "chinese_hello")
let chineseResult = try recognizer.recognizeAudio(chineseAudio, language: "auto")
XCTAssertEqual(chineseResult.language, "zh", "应该自动检测到中文")
let englishAudio = try loadTestAudio(name: "english_hello")
let englishResult = try recognizer.recognizeAudio(englishAudio, language: "auto")
XCTAssertEqual(englishResult.language, "en", "应该自动检测到英文")
}
func testPerformance() {
let audioData = [Float](repeating: 0, count: 16000) // 1秒静音
measure {
for _ in 0..<10 {
_ = try? recognizer.recognizeAudio(audioData, language: "auto")
}
}
}
private func loadTestAudio(name: String) throws -> [Float] {
// 实际项目中应该从测试资源加载音频文件
// 这里返回模拟数据
return [Float](repeating: 0, count: 16000)
}
override func tearDown() {
recognizer.release()
recognizer = nil
}
}
7.3 持续集成与部署
配置CI/CD流程,确保代码质量。
# .github/workflows/android-ci.yml
name: Android CI
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 11
uses: actions/setup-java@v3
with:
java-version: '11'
distribution: 'temurin'
- name: Setup Android SDK
uses: android-actions/setup-android@v2
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Run unit tests
run: ./gradlew testDebugUnitTest
- name: Run instrumented tests
uses: reactivecircus/android-emulator-runner@v2
with:
api-level: 29
script: ./gradlew connectedDebugAndroidTest
- name: Upload test results
uses: actions/upload-artifact@v3
if: always()
with:
name: test-results
path: app/build/reports/
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 11
uses: actions/setup-java@v3
with:
java-version: '11'
distribution: 'temurin'
- name: Setup Android SDK
uses: android-actions/setup-android@v2
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Build APK
run: ./gradlew assembleRelease
- name: Upload APK
uses: actions/upload-artifact@v3
with:
name: app-release
path: app/build/outputs/apk/release/
8. 总结
通过这篇文章,我们完整地探讨了如何将SenseVoice-small语音识别模型封装成SDK,并集成到Android和iOS应用中。从技术选型、架构设计,到具体的代码实现和优化策略,我们覆盖了端侧语音识别集成的关键要点。
8.1 核心收获
技术层面,我们学会了:
- 如何将ONNX模型封装成跨平台的推理引擎
- 如何设计面向移动端的SDK架构
- 如何在Android和iOS上实现实时语音识别
- 如何进行性能优化和功耗控制
工程层面,我们掌握了:
- 模块化设计思想,将复杂系统分解为可维护的组件
- 平台差异处理,为不同操作系统提供适配层
- 错误处理和资源管理,确保SDK的稳定性
- 测试策略,从单元测试到集成测试的完整流程
8.2 实际应用建议
在实际项目中应用这个SDK时,我有几个建议:
第一,根据场景选择配置
- 对实时性要求高的场景(如语音助手),使用流式识别
- 对准确性要求高的场景(如会议
更多推荐

所有评论(0)