SenseVoice-small应用场景:智能硬件语音交互SDK封装与Android/iOS集成

1. 引言:当智能硬件“开口说话”

想象一下,你家里的智能音箱能听懂你的方言指令,你车里的中控屏能实时把导航语音转成文字,你佩戴的智能眼镜能悄悄把会议内容记录下来。这些场景背后,都离不开一个核心能力——离线、实时、准确的语音识别

今天要聊的SenseVoice-small,就是能让这些想象变成现实的技术。它是一个轻量级的语音识别模型,经过ONNX格式的量化处理,体积小巧但能力强大。更重要的是,它非常适合被封装成SDK,集成到手机、平板、嵌入式设备等各种智能硬件里,让设备真正具备“听懂人话”的能力。

这篇文章不是教你如何使用它的Web界面,而是深入探讨如何把它变成一个可集成的语音交互SDK,并实现在Android和iOS两大移动平台上的落地。无论你是智能硬件开发者、移动应用工程师,还是对端侧AI感兴趣的技术爱好者,都能在这里找到实用的思路和方案。

2. 为什么选择SenseVoice-small做端侧集成?

在开始动手之前,我们先要搞清楚一个问题:市面上语音识别方案那么多,为什么偏偏是SenseVoice-small?

2.1 核心优势:为端侧而生

SenseVoice-small有几个硬核特点,让它特别适合集成到智能硬件里:

第一,它真的很“小”

  • 经过ONNX量化和优化后,模型体积大幅压缩
  • 对内存和存储空间的要求很低,老旧的嵌入式设备也能跑得动
  • 这意味着更低的硬件成本和更广泛的应用场景

第二,它支持“离线”工作

  • 不需要连接云端服务器,所有计算都在设备本地完成
  • 响应速度极快,没有网络延迟
  • 用户隐私数据完全留在设备上,安全有保障

第三,它是个“多面手”

  • 支持超过50种语言的识别,包括中文、英文、日语、韩语、粤语等
  • 能自动检测语言类型,不用手动切换
  • 具备情感识别能力,能判断说话人的情绪状态
  • 支持逆文本标准化,能把“一百二十”智能转换成“120”

2.2 典型应用场景分析

基于这些特点,SenseVoice-small在智能硬件领域能玩出很多花样:

场景一:离线语音助手

  • 智能音箱、智能家居中控
  • 车载语音控制系统
  • 工业设备的语音控制面板

场景二:实时字幕生成

  • 视频会议设备的实时转录
  • 教育平板的课堂字幕
  • 直播设备的实时字幕流

场景三:隐私敏感场景

  • 医疗设备的语音病历记录
  • 金融设备的语音指令确认
  • 政府、企业的保密会议记录

场景四:低资源环境

  • 偏远地区的通信设备
  • 移动网络信号差的场景
  • 算力有限的低端设备

3. SDK封装设计:从模型到接口

要把SenseVoice-small变成一个好用的SDK,我们需要做几层封装。这个过程就像给一个强大的引擎装上方向盘、油门和刹车,让开发者能轻松驾驭。

3.1 核心引擎层:模型推理封装

首先是最底层的模型推理部分。SenseVoice-small已经提供了ONNX格式的模型,我们需要把它包装成统一的推理接口。

# 示例:Python端的核心推理封装
class SenseVoiceEngine:
    def __init__(self, model_path: str):
        """
        初始化语音识别引擎
        :param model_path: ONNX模型文件路径
        """
        self.session = ort.InferenceSession(model_path)
        self.sample_rate = 16000  # 标准采样率
        
    def preprocess_audio(self, audio_data: np.ndarray) -> np.ndarray:
        """
        音频预处理:重采样、归一化、分帧等
        """
        # 1. 重采样到16kHz
        if len(audio_data.shape) > 1:
            audio_data = audio_data.mean(axis=1)  # 立体声转单声道
            
        # 2. 归一化到[-1, 1]
        audio_data = audio_data.astype(np.float32)
        if np.abs(audio_data).max() > 0:
            audio_data = audio_data / np.abs(audio_data).max()
            
        # 3. 添加批次维度
        audio_data = np.expand_dims(audio_data, axis=0)
        return audio_data
    
    def recognize(self, audio_data: np.ndarray, 
                  language: str = "auto") -> dict:
        """
        执行语音识别
        :return: 包含文本、语言、情感等信息的字典
        """
        # 预处理音频
        processed_audio = self.preprocess_audio(audio_data)
        
        # 准备输入
        inputs = {
            "audio": processed_audio,
            "language": np.array([language], dtype=np.int64)
        }
        
        # 执行推理
        outputs = self.session.run(None, inputs)
        
        # 解析结果
        result = {
            "text": outputs[0],  # 识别文本
            "language": outputs[1],  # 检测到的语言
            "emotion": outputs[2],  # 情感分析结果
            "confidence": outputs[3]  # 置信度
        }
        
        return result
    
    def stream_recognize(self, audio_stream):
        """
        流式识别接口(用于实时语音)
        """
        # 实现流式处理逻辑
        pass

这个核心引擎提供了几个关键能力:

  • 统一的音频预处理流程
  • 同步识别接口(适合文件处理)
  • 流式识别接口(适合实时场景)
  • 完整的结果返回(文本、语言、情感、置信度)

3.2 平台适配层:跨平台抽象

不同的硬件平台有不同的特性,我们需要一个适配层来屏蔽这些差异。

# 平台抽象接口定义
class PlatformAdapter:
    """平台适配器基类"""
    
    def get_audio_input(self):
        """获取音频输入设备"""
        raise NotImplementedError
        
    def allocate_buffer(self, size: int):
        """分配音频缓冲区"""
        raise NotImplementedError
        
    def get_optimal_config(self) -> dict:
        """获取平台最优配置"""
        raise NotImplementedError


# Android平台实现
class AndroidAdapter(PlatformAdapter):
    def __init__(self):
        self.audio_manager = None
        
    def get_audio_input(self):
        # 使用Android的AudioRecord API
        config = {
            "source": AudioSource.MIC,
            "sample_rate": 16000,
            "channel_config": AudioFormat.CHANNEL_IN_MONO,
            "audio_format": AudioFormat.ENCODING_PCM_16BIT,
            "buffer_size": 4096
        }
        return AudioRecord(**config)
    
    def get_optimal_config(self):
        return {
            "threads": 4,  # 推荐线程数
            "use_gpu": False,  # Android上通常用CPU
            "memory_limit_mb": 100  # 内存限制
        }


# iOS平台实现  
class IOSAdapter(PlatformAdapter):
    def __init__(self):
        self.audio_session = None
        
    def get_audio_input(self):
        # 使用AVAudioEngine
        engine = AVAudioEngine()
        input_node = engine.inputNode
        return {
            "engine": engine,
            "input_node": input_node
        }
    
    def get_optimal_config(self):
        return {
            "threads": 2,  # iOS上线程数不宜过多
            "use_ane": True,  # 使用Apple Neural Engine
            "memory_limit_mb": 50  # iOS内存更紧张
        }

3.3 应用接口层:开发者友好封装

最后,我们需要提供一个简洁易用的API给应用开发者。

// Android SDK接口示例
public class SenseVoiceSDK {
    
    // 单例模式
    private static SenseVoiceSDK instance;
    
    public static SenseVoiceSDK getInstance() {
        if (instance == null) {
            instance = new SenseVoiceSDK();
        }
        return instance;
    }
    
    // 初始化SDK
    public void initialize(Context context, String modelPath) {
        // 加载模型
        // 初始化音频系统
        // 预热模型
    }
    
    // 文件识别
    public RecognitionResult recognizeFile(String filePath, 
                                          RecognitionConfig config) {
        // 读取音频文件
        // 调用识别引擎
        // 返回结果
    }
    
    // 实时识别
    public void startRealtimeRecognition(RecognitionCallback callback) {
        // 开始录音
        // 实时处理音频流
        // 通过回调返回结果
    }
    
    // 停止识别
    public void stopRealtimeRecognition() {
        // 停止录音
        // 清理资源
    }
    
    // 配置接口
    public void setLanguage(String language) {
        // 设置识别语言
    }
    
    public void enableITN(boolean enable) {
        // 启用/禁用逆文本标准化
    }
}

// 回调接口
public interface RecognitionCallback {
    void onPartialResult(String text);  // 中间结果
    void onFinalResult(RecognitionResult result);  // 最终结果
    void onError(int errorCode, String message);  // 错误回调
}

// 结果对象
public class RecognitionResult {
    private String text;  // 识别文本
    private String language;  // 检测到的语言
    private String emotion;  // 情感分析
    private float confidence;  // 置信度
    private long duration;  // 音频时长
    
    // getters and setters
}

4. Android集成实战:让App听懂用户说话

现在我们来具体看看如何在Android应用中集成这个SDK。我会用一个实际的例子,带你一步步实现一个语音记事本应用。

4.1 环境准备与依赖配置

首先,我们需要在Android项目中配置必要的依赖。

// app/build.gradle
android {
    defaultConfig {
        ndk {
            abiFilters 'armeabi-v7a', 'arm64-v8a', 'x86', 'x86_64'
        }
    }
    
    aaptOptions {
        noCompress "onnx"  // 不压缩模型文件
    }
}

dependencies {
    // ONNX Runtime for Android
    implementation 'com.microsoft.onnxruntime:onnxruntime-android:1.15.0'
    
    // 音频处理库
    implementation 'com.arthenica:mobile-ffmpeg-min:4.4.LTS'
    
    // 权限管理
    implementation 'com.guolindev.permissionx:permissionx:1.7.1'
}

4.2 模型文件与资源管理

把SenseVoice-small的ONNX模型放到合适的位置,并确保应用能正确访问。

// ModelManager.kt - 模型文件管理
class ModelManager(private val context: Context) {
    
    companion object {
        private const val MODEL_NAME = "sensevoice-small.onnx"
        private const val MODEL_ASSETS_PATH = "models/$MODEL_NAME"
    }
    
    // 检查模型文件是否存在
    fun checkModelExists(): Boolean {
        return try {
            context.assets.open(MODEL_ASSETS_PATH).close()
            true
        } catch (e: IOException) {
            false
        }
    }
    
    // 将模型从assets复制到应用私有目录
    fun copyModelToInternal(): File {
        val modelDir = File(context.filesDir, "models")
        if (!modelDir.exists()) {
            modelDir.mkdirs()
        }
        
        val modelFile = File(modelDir, MODEL_NAME)
        
        // 如果已经存在且是最新的,直接返回
        if (modelFile.exists() && isModelUpToDate()) {
            return modelFile
        }
        
        // 从assets复制
        context.assets.open(MODEL_ASSETS_PATH).use { input ->
            FileOutputStream(modelFile).use { output ->
                input.copyTo(output)
            }
        }
        
        // 保存版本信息
        saveModelVersion()
        
        return modelFile
    }
    
    private fun isModelUpToDate(): Boolean {
        // 检查模型版本
        val prefs = context.getSharedPreferences("model_info", Context.MODE_PRIVATE)
        val savedVersion = prefs.getString("model_version", "")
        return savedVersion == getCurrentModelVersion()
    }
    
    private fun saveModelVersion() {
        val prefs = context.getSharedPreferences("model_info", Context.MODE_PRIVATE)
        prefs.edit().putString("model_version", getCurrentModelVersion()).apply()
    }
    
    private fun getCurrentModelVersion(): String {
        // 从模型文件或配置中获取版本号
        return "1.0.0"
    }
}

4.3 核心识别服务实现

这是最核心的部分,我们来实现语音识别服务。

// SenseVoiceService.kt - 语音识别服务
class SenseVoiceService(
    private val context: Context,
    private val modelPath: String
) {
    
    private var ortSession: OrtSession? = null
    private var isInitialized = false
    private var recognitionCallback: RecognitionCallback? = null
    
    // 初始化识别引擎
    fun initialize(): Boolean {
        return try {
            // 创建ONNX Runtime环境
            val env = OrtEnvironment.getEnvironment()
            val sessionOptions = OrtSession.SessionOptions()
            
            // 针对Android优化配置
            sessionOptions.setOptimizationLevel(ORT_ENABLE_ALL)
            sessionOptions.setIntraOpNumThreads(4)  // 使用4个线程
            sessionOptions.setMemoryPatternOptimization(true)
            
            // 加载模型
            ortSession = env.createSession(modelPath, sessionOptions)
            
            // 预热模型
            warmUpModel()
            
            isInitialized = true
            true
        } catch (e: Exception) {
            Log.e("SenseVoice", "初始化失败: ${e.message}")
            false
        }
    }
    
    // 预热模型(减少首次识别延迟)
    private fun warmUpModel() {
        val dummyAudio = FloatArray(16000)  // 1秒的静音
        recognizeAudio(dummyAudio, "auto")
    }
    
    // 识别音频数据
    fun recognizeAudio(
        audioData: FloatArray,
        language: String = "auto"
    ): RecognitionResult {
        if (!isInitialized) {
            throw IllegalStateException("服务未初始化")
        }
        
        return try {
            // 预处理音频
            val processedAudio = preprocessAudio(audioData)
            
            // 准备输入
            val inputName = ortSession!!.inputNames.iterator().next()
            val inputTensor = OnnxTensor.createTensor(
                OrtEnvironment.getEnvironment(),
                processedAudio,
                longArrayOf(1, processedAudio.size.toLong())
            )
            
            val languageCode = when (language) {
                "zh" -> 0
                "en" -> 1
                "ja" -> 2
                "ko" -> 3
                "yue" -> 4
                else -> -1  // auto
            }
            
            val languageTensor = OnnxTensor.createTensor(
                OrtEnvironment.getEnvironment(),
                longArrayOf(languageCode.toLong()),
                longArrayOf(1)
            )
            
            // 执行推理
            val inputs = mapOf(
                "audio" to inputTensor,
                "language" to languageTensor
            )
            
            val outputs = ortSession!!.run(inputs)
            
            // 解析结果
            val text = outputs[0].value as String
            val detectedLang = outputs[1].value as String
            val emotion = outputs[2].value as String
            val confidence = (outputs[3].value as FloatArray)[0]
            
            // 清理资源
            inputTensor.close()
            languageTensor.close()
            outputs.forEach { it.value.close() }
            
            RecognitionResult(
                text = text,
                language = detectedLang,
                emotion = emotion,
                confidence = confidence,
                duration = (audioData.size / 16000f * 1000).toLong()  // 毫秒
            )
        } catch (e: Exception) {
            Log.e("SenseVoice", "识别失败: ${e.message}")
            RecognitionResult(error = e.message ?: "识别失败")
        }
    }
    
    // 音频预处理
    private fun preprocessAudio(audioData: FloatArray): FloatArray {
        // 1. 确保采样率为16kHz
        val targetSampleRate = 16000
        val currentSampleRate = 44100  // 假设从AudioRecord获取的是44.1kHz
        
        val processed = if (currentSampleRate != targetSampleRate) {
            resampleAudio(audioData, currentSampleRate, targetSampleRate)
        } else {
            audioData.copyOf()
        }
        
        // 2. 归一化
        val maxVal = processed.maxOrNull() ?: 1f
        val minVal = processed.minOrNull() ?: -1f
        val absMax = maxOf(kotlin.math.abs(maxVal), kotlin.math.abs(minVal))
        
        if (absMax > 0) {
            for (i in processed.indices) {
                processed[i] = processed[i] / absMax
            }
        }
        
        return processed
    }
    
    // 重采样
    private fun resampleAudio(
        audio: FloatArray,
        fromRate: Int,
        toRate: Int
    ): FloatArray {
        // 简化版重采样,实际项目中建议使用专业音频库
        val ratio = fromRate.toFloat() / toRate.toFloat()
        val newLength = (audio.size / ratio).toInt()
        val resampled = FloatArray(newLength)
        
        for (i in 0 until newLength) {
            val srcIndex = (i * ratio).toInt()
            if (srcIndex < audio.size) {
                resampled[i] = audio[srcIndex]
            }
        }
        
        return resampled
    }
    
    // 开始实时识别
    fun startRealtimeRecognition(callback: RecognitionCallback) {
        this.recognitionCallback = callback
        
        // 启动音频录制线程
        val audioThread = Thread {
            val bufferSize = AudioRecord.getMinBufferSize(
                16000,
                AudioFormat.CHANNEL_IN_MONO,
                AudioFormat.ENCODING_PCM_16BIT
            )
            
            val audioRecord = AudioRecord(
                MediaRecorder.AudioSource.MIC,
                16000,
                AudioFormat.CHANNEL_IN_MONO,
                AudioFormat.ENCODING_PCM_16BIT,
                bufferSize
            )
            
            audioRecord.startRecording()
            
            val buffer = ShortArray(bufferSize / 2)
            val audioBuffer = mutableListOf<Float>()
            
            while (isRecording) {
                val bytesRead = audioRecord.read(buffer, 0, buffer.size)
                
                if (bytesRead > 0) {
                    // 转换到float
                    val floatBuffer = FloatArray(bytesRead)
                    for (i in 0 until bytesRead) {
                        floatBuffer[i] = buffer[i] / 32768.0f
                    }
                    
                    audioBuffer.addAll(floatBuffer.toList())
                    
                    // 每1秒处理一次
                    if (audioBuffer.size >= 16000) {
                        val audioData = audioBuffer.take(16000).toFloatArray()
                        val result = recognizeAudio(audioData, "auto")
                        
                        // 回调结果
                        callback.onPartialResult(result.text)
                        
                        // 保留最后0.5秒数据用于连续识别
                        val keepSize = 8000  // 0.5秒
                        audioBuffer.clear()
                        if (audioData.size > keepSize) {
                            audioBuffer.addAll(
                                audioData.sliceArray(audioData.size - keepSize until audioData.size).toList()
                            )
                        }
                    }
                }
            }
            
            audioRecord.stop()
            audioRecord.release()
        }
        
        audioThread.start()
    }
    
    // 停止实时识别
    fun stopRealtimeRecognition() {
        isRecording = false
        recognitionCallback = null
    }
    
    // 释放资源
    fun release() {
        ortSession?.close()
        ortSession = null
        isInitialized = false
    }
}

4.4 权限处理与UI集成

在Android上使用麦克风需要处理权限,我们还需要提供一个好用的UI组件。

// VoiceRecognitionView.kt - 语音识别UI组件
class VoiceRecognitionView @JvmOverloads constructor(
    context: Context,
    attrs: AttributeSet? = null,
    defStyleAttr: Int = 0
) : FrameLayout(context, attrs, defStyleAttr) {
    
    // UI组件
    private lateinit var recordButton: ImageButton
    private lateinit var resultTextView: TextView
    private lateinit var languageSpinner: Spinner
    private lateinit var progressBar: ProgressBar
    
    // 回调接口
    var onRecognitionResult: ((RecognitionResult) -> Unit)? = null
    var onRecognitionError: ((String) -> Unit)? = null
    
    // 识别服务
    private lateinit var senseVoiceService: SenseVoiceService
    
    init {
        initView(context)
        initService()
    }
    
    private fun initView(context: Context) {
        // 加载布局
        LayoutInflater.from(context).inflate(R.layout.view_voice_recognition, this, true)
        
        recordButton = findViewById(R.id.btn_record)
        resultTextView = findViewById(R.id.tv_result)
        languageSpinner = findViewById(R.id.spinner_language)
        progressBar = findViewById(R.id.progress_bar)
        
        // 设置语言选项
        val languages = arrayOf("自动检测", "中文", "英文", "日语", "韩语", "粤语")
        val adapter = ArrayAdapter(context, android.R.layout.simple_spinner_item, languages)
        adapter.setDropDownViewResource(android.R.layout.simple_spinner_dropdown_item)
        languageSpinner.adapter = adapter
        
        // 录音按钮点击事件
        recordButton.setOnClickListener {
            if (isRecording) {
                stopRecording()
            } else {
                startRecording()
            }
        }
    }
    
    private fun initService() {
        // 初始化识别服务
        senseVoiceService = SenseVoiceService(context, getModelPath())
        
        // 在后台线程初始化
        GlobalScope.launch(Dispatchers.IO) {
            val success = senseVoiceService.initialize()
            
            withContext(Dispatchers.Main) {
                if (success) {
                    recordButton.isEnabled = true
                    Toast.makeText(context, "语音识别服务就绪", Toast.LENGTH_SHORT).show()
                } else {
                    recordButton.isEnabled = false
                    onRecognitionError?.invoke("语音识别服务初始化失败")
                }
            }
        }
    }
    
    private fun getModelPath(): String {
        val modelManager = ModelManager(context)
        return modelManager.copyModelToInternal().absolutePath
    }
    
    private fun startRecording() {
        // 检查权限
        if (!hasRecordPermission()) {
            requestRecordPermission()
            return
        }
        
        // 更新UI
        recordButton.setImageResource(R.drawable.ic_stop)
        resultTextView.text = "正在聆听..."
        progressBar.visibility = View.VISIBLE
        
        // 开始识别
        senseVoiceService.startRealtimeRecognition(object : RecognitionCallback {
            override fun onPartialResult(text: String) {
                // 更新部分结果
                runOnUiThread {
                    resultTextView.text = text
                }
            }
            
            override fun onFinalResult(result: RecognitionResult) {
                runOnUiThread {
                    progressBar.visibility = View.GONE
                    onRecognitionResult?.invoke(result)
                }
            }
            
            override fun onError(errorCode: Int, message: String) {
                runOnUiThread {
                    progressBar.visibility = View.GONE
                    recordButton.setImageResource(R.drawable.ic_mic)
                    onRecognitionError?.invoke(message)
                }
            }
        })
        
        isRecording = true
    }
    
    private fun stopRecording() {
        senseVoiceService.stopRealtimeRecognition()
        
        runOnUiThread {
            recordButton.setImageResource(R.drawable.ic_mic)
            progressBar.visibility = View.GONE
        }
        
        isRecording = false
    }
    
    private fun hasRecordPermission(): Boolean {
        return ContextCompat.checkSelfPermission(
            context,
            Manifest.permission.RECORD_AUDIO
        ) == PackageManager.PERMISSION_GRANTED
    }
    
    private fun requestRecordPermission() {
        ActivityCompat.requestPermissions(
            context as Activity,
            arrayOf(Manifest.permission.RECORD_AUDIO),
            RECORD_AUDIO_REQUEST_CODE
        )
    }
    
    fun onRequestPermissionsResult(
        requestCode: Int,
        permissions: Array<out String>,
        grantResults: IntArray
    ) {
        if (requestCode == RECORD_AUDIO_REQUEST_CODE) {
            if (grantResults.isNotEmpty() && grantResults[0] == PackageManager.PERMISSION_GRANTED) {
                startRecording()
            } else {
                onRecognitionError?.invoke("需要麦克风权限才能使用语音识别")
            }
        }
    }
    
    fun release() {
        senseVoiceService.release()
    }
    
    companion object {
        private const val RECORD_AUDIO_REQUEST_CODE = 1001
        private var isRecording = false
    }
}

4.5 在Activity中使用

最后,我们看看如何在Activity中使用这个语音识别组件。

// MainActivity.kt - 主界面
class MainActivity : AppCompatActivity() {
    
    private lateinit var voiceRecognitionView: VoiceRecognitionView
    private lateinit var resultRecyclerView: RecyclerView
    private lateinit var adapter: RecognitionResultAdapter
    
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        
        // 初始化组件
        voiceRecognitionView = findViewById(R.id.voice_recognition_view)
        resultRecyclerView = findViewById(R.id.recycler_results)
        
        // 设置RecyclerView
        adapter = RecognitionResultAdapter()
        resultRecyclerView.layoutManager = LinearLayoutManager(this)
        resultRecyclerView.adapter = adapter
        
        // 设置语音识别回调
        voiceRecognitionView.onRecognitionResult = { result ->
            // 添加到历史记录
            adapter.addResult(result)
            
            // 显示识别结果
            showResultDialog(result)
        }
        
        voiceRecognitionView.onRecognitionError = { error ->
            Toast.makeText(this, "识别错误: $error", Toast.LENGTH_LONG).show()
        }
    }
    
    private fun showResultDialog(result: RecognitionResult) {
        val dialog = AlertDialog.Builder(this)
            .setTitle("识别结果")
            .setMessage("""
                文本:${result.text}
                
                语言:${result.language}
                情感:${result.emotion}
                置信度:${String.format("%.2f", result.confidence * 100)}%
                时长:${result.duration}ms
            """.trimIndent())
            .setPositiveButton("确定", null)
            .setNegativeButton("复制") { _, _ ->
                // 复制到剪贴板
                val clipboard = getSystemService(Context.CLIPBOARD_SERVICE) as ClipboardManager
                val clip = ClipData.newPlainText("识别结果", result.text)
                clipboard.setPrimaryClip(clip)
                Toast.makeText(this, "已复制到剪贴板", Toast.LENGTH_SHORT).show()
            }
            .create()
        
        dialog.show()
    }
    
    override fun onRequestPermissionsResult(
        requestCode: Int,
        permissions: Array<out String>,
        grantResults: IntArray
    ) {
        super.onRequestPermissionsResult(requestCode, permissions, grantResults)
        voiceRecognitionView.onRequestPermissionsResult(requestCode, permissions, grantResults)
    }
    
    override fun onDestroy() {
        super.onDestroy()
        voiceRecognitionView.release()
    }
}

5. iOS集成指南:为苹果生态打造语音体验

iOS平台的集成思路与Android类似,但具体实现有所不同。我们使用Swift和SwiftUI来构建一个现代化的语音识别应用。

5.1 创建iOS Framework

首先,我们需要创建一个Framework来封装SenseVoice-small的核心功能。

// SenseVoiceFramework.swift
import Foundation
import AVFoundation
import CoreML

public class SenseVoiceRecognizer {
    
    private var onnxSession: ORTSession?
    private var isInitialized = false
    private var audioEngine: AVAudioEngine?
    private var recognitionCallback: ((RecognitionResult) -> Void)?
    
    // 初始化识别器
    public func initialize(modelPath: String) throws -> Bool {
        do {
            // 创建ONNX Runtime环境
            let env = try ORTEnvironment(loggingLevel: .warning)
            
            // 创建会话选项
            let options = try ORTSessionOptions()
            try options.setIntraOpNumThreads(2)  // iOS推荐2线程
            try options.setGraphOptimizationLevel(.all)
            
            // 加载模型
            let modelData = try Data(contentsOf: URL(fileURLWithPath: modelPath))
            onnxSession = try ORTSession(env: env, 
                                        modelData: modelData, 
                                        sessionOptions: options)
            
            // 预热模型
            try warmUpModel()
            
            isInitialized = true
            return true
        } catch {
            print("SenseVoice初始化失败: \(error)")
            throw error
        }
    }
    
    // 预热模型
    private func warmUpModel() throws {
        let dummyAudio = [Float](repeating: 0, count: 16000)
        _ = try recognizeAudio(dummyAudio, language: "auto")
    }
    
    // 识别音频数据
    public func recognizeAudio(_ audioData: [Float], 
                              language: String = "auto") throws -> RecognitionResult {
        guard isInitialized, let session = onnxSession else {
            throw SenseVoiceError.notInitialized
        }
        
        do {
            // 预处理音频
            let processedAudio = preprocessAudio(audioData)
            
            // 创建输入Tensor
            let inputShape: [NSNumber] = [1, NSNumber(value: processedAudio.count)]
            let inputTensor = try ORTValue(
                tensorData: NSMutableData(bytes: processedAudio, 
                                         length: processedAudio.count * MemoryLayout<Float>.size),
                elementType: ORTTensorElementDataType.float,
                shape: inputShape
            )
            
            // 语言编码
            let languageCode = languageCodeForString(language)
            let languageTensor = try ORTValue(
                tensorData: NSMutableData(bytes: &languageCode, 
                                         length: MemoryLayout<Int64>.size),
                elementType: ORTTensorElementDataType.int64,
                shape: [1]
            )
            
            // 执行推理
            let inputs = ["audio": inputTensor, "language": languageTensor]
            let outputs = try session.run(inputs: inputs, 
                                        outputNames: ["text", "language", "emotion", "confidence"])
            
            // 解析结果
            let textTensor = outputs["text"]!
            let text = try textTensor.tensorDataAsString()
            
            let languageTensorOutput = outputs["language"]!
            let detectedLang = try languageTensorOutput.tensorDataAsString()
            
            let emotionTensor = outputs["emotion"]!
            let emotion = try emotionTensor.tensorDataAsString()
            
            let confidenceTensor = outputs["confidence"]!
            let confidenceData = try confidenceTensor.tensorData() as Data
            let confidence = confidenceData.withUnsafeBytes { $0.load(as: Float.self) }
            
            return RecognitionResult(
                text: text,
                language: detectedLang,
                emotion: emotion,
                confidence: confidence,
                duration: Int64(Double(audioData.count) / 16000.0 * 1000)
            )
        } catch {
            print("识别失败: \(error)")
            throw error
        }
    }
    
    // 音频预处理
    private func preprocessAudio(_ audio: [Float]) -> [Float] {
        var processed = audio
        
        // 归一化
        let maxVal = processed.max() ?? 1.0
        let minVal = processed.min() ?? -1.0
        let absMax = max(abs(maxVal), abs(minVal))
        
        if absMax > 0 {
            processed = processed.map { $0 / absMax }
        }
        
        return processed
    }
    
    // 开始实时识别
    public func startRealtimeRecognition(callback: @escaping (RecognitionResult) -> Void) throws {
        guard AVAudioSession.sharedInstance().recordPermission == .granted else {
            throw SenseVoiceError.permissionDenied
        }
        
        recognitionCallback = callback
        audioEngine = AVAudioEngine()
        
        guard let audioEngine = audioEngine else {
            throw SenseVoiceError.audioEngineFailed
        }
        
        let inputNode = audioEngine.inputNode
        let inputFormat = inputNode.outputFormat(forBus: 0)
        
        // 设置录音格式
        let recordingFormat = AVAudioFormat(
            commonFormat: .pcmFormatFloat32,
            sampleRate: 16000,
            channels: 1,
            interleaved: false
        )
        
        guard let format = recordingFormat else {
            throw SenseVoiceError.audioFormatFailed
        }
        
        // 安装Tap
        inputNode.installTap(onBus: 0, 
                            bufferSize: 4096, 
                            format: inputFormat) { [weak self] buffer, time in
            guard let self = self else { return }
            
            // 转换格式
            let converter = AVAudioConverter(from: inputFormat, to: format)
            let convertedBuffer = AVAudioPCMBuffer(pcmFormat: format, 
                                                  frameCapacity: buffer.frameCapacity)
            
            var error: NSError?
            let inputBlock: AVAudioConverterInputBlock = { inNumPackets, outStatus in
                outStatus.pointee = .haveData
                return buffer
            }
            
            converter?.convert(to: convertedBuffer!, 
                              error: &error, 
                              withInputFrom: inputBlock)
            
            if let convertedBuffer = convertedBuffer,
               let channelData = convertedBuffer.floatChannelData {
                let frames = convertedBuffer.frameLength
                let audioData = Array(UnsafeBufferPointer(start: channelData[0], 
                                                         count: Int(frames)))
                
                // 每1秒处理一次
                self.processAudioBuffer(audioData)
            }
        }
        
        // 启动音频引擎
        try audioEngine.start()
    }
    
    // 处理音频缓冲区
    private var audioBuffer: [Float] = []
    
    private func processAudioBuffer(_ newData: [Float]) {
        audioBuffer.append(contentsOf: newData)
        
        // 每1秒(16000个样本)处理一次
        while audioBuffer.count >= 16000 {
            let chunk = Array(audioBuffer.prefix(16000))
            audioBuffer.removeFirst(16000)
            
            do {
                let result = try recognizeAudio(chunk, language: "auto")
                DispatchQueue.main.async {
                    self.recognitionCallback?(result)
                }
            } catch {
                print("实时识别错误: \(error)")
            }
        }
    }
    
    // 停止识别
    public func stopRealtimeRecognition() {
        audioEngine?.stop()
        audioEngine?.inputNode.removeTap(onBus: 0)
        audioEngine = nil
        recognitionCallback = nil
    }
    
    // 语言编码映射
    private func languageCodeForString(_ language: String) -> Int64 {
        switch language.lowercased() {
        case "zh": return 0
        case "en": return 1
        case "ja": return 2
        case "ko": return 3
        case "yue": return 4
        default: return -1  // auto
        }
    }
    
    // 释放资源
    public func release() {
        stopRealtimeRecognition()
        onnxSession = nil
        isInitialized = false
    }
}

// 识别结果结构体
public struct RecognitionResult {
    public let text: String
    public let language: String
    public let emotion: String
    public let confidence: Float
    public let duration: Int64  // 毫秒
    public let error: String?
    
    public init(text: String = "", 
                language: String = "", 
                emotion: String = "", 
                confidence: Float = 0, 
                duration: Int64 = 0, 
                error: String? = nil) {
        self.text = text
        self.language = language
        self.emotion = emotion
        self.confidence = confidence
        self.duration = duration
        self.error = error
    }
}

// 错误类型
public enum SenseVoiceError: Error {
    case notInitialized
    case permissionDenied
    case audioEngineFailed
    case audioFormatFailed
    case modelNotFound
    case recognitionFailed(String)
}

5.2 SwiftUI界面实现

使用SwiftUI构建一个现代化的语音识别界面。

// VoiceRecognitionView.swift
import SwiftUI
import AVFoundation

struct VoiceRecognitionView: View {
    @StateObject private var viewModel = VoiceRecognitionViewModel()
    @State private var isRecording = false
    @State private var recognizedText = ""
    @State private var selectedLanguage = "auto"
    
    let languages = [
        ("auto", "自动检测"),
        ("zh", "中文"),
        ("en", "英文"),
        ("ja", "日语"),
        ("ko", "韩语"),
        ("yue", "粤语")
    ]
    
    var body: some View {
        VStack(spacing: 20) {
            // 标题
            Text("语音识别")
                .font(.largeTitle)
                .fontWeight(.bold)
                .padding(.top)
            
            // 语言选择
            Picker("选择语言", selection: $selectedLanguage) {
                ForEach(languages, id: \.0) { code, name in
                    Text(name).tag(code)
                }
            }
            .pickerStyle(SegmentedPickerStyle())
            .padding(.horizontal)
            
            // 录音按钮
            Button(action: toggleRecording) {
                Circle()
                    .fill(isRecording ? Color.red : Color.blue)
                    .frame(width: 100, height: 100)
                    .overlay(
                        Image(systemName: isRecording ? "stop.fill" : "mic.fill")
                            .font(.system(size: 40))
                            .foregroundColor(.white)
                    )
                    .shadow(radius: 10)
            }
            .padding(.vertical, 30)
            
            // 状态提示
            if isRecording {
                VStack {
                    Text("正在聆听...")
                        .font(.headline)
                        .foregroundColor(.green)
                    
                    // 录音动画
                    HStack(spacing: 4) {
                        ForEach(0..<5) { i in
                            RoundedRectangle(cornerRadius: 2)
                                .fill(Color.green)
                                .frame(width: 4, height: CGFloat.random(in: 10...30))
                                .animation(
                                    Animation.easeInOut(duration: 0.5)
                                        .repeatForever()
                                        .delay(Double(i) * 0.1),
                                    value: isRecording
                                )
                        }
                    }
                    .frame(height: 30)
                }
            }
            
            // 识别结果
            ScrollView {
                VStack(alignment: .leading, spacing: 10) {
                    Text("识别结果")
                        .font(.headline)
                    
                    if recognizedText.isEmpty {
                        Text("点击上方按钮开始录音")
                            .foregroundColor(.gray)
                            .italic()
                    } else {
                        Text(recognizedText)
                            .padding()
                            .frame(maxWidth: .infinity, alignment: .leading)
                            .background(Color.gray.opacity(0.1))
                            .cornerRadius(10)
                    }
                }
                .padding()
            }
            .frame(maxHeight: 200)
            .background(Color.gray.opacity(0.05))
            .cornerRadius(10)
            .padding(.horizontal)
            
            // 识别信息
            if let result = viewModel.lastResult {
                VStack(alignment: .leading, spacing: 8) {
                    HStack {
                        Label("语言: \(result.language)", systemImage: "globe")
                        Spacer()
                        Label("置信度: \(Int(result.confidence * 100))%", 
                              systemImage: "chart.bar.fill")
                    }
                    .font(.caption)
                    .foregroundColor(.secondary)
                    
                    HStack {
                        Label("情感: \(result.emotion)", systemImage: "face.smiling")
                        Spacer()
                        Label("时长: \(result.duration)ms", systemImage: "clock")
                    }
                    .font(.caption)
                    .foregroundColor(.secondary)
                }
                .padding()
                .background(Color.blue.opacity(0.1))
                .cornerRadius(10)
                .padding(.horizontal)
            }
            
            Spacer()
        }
        .padding()
        .onAppear {
            viewModel.initialize()
        }
        .onDisappear {
            viewModel.release()
        }
        .alert("错误", isPresented: $viewModel.showError) {
            Button("确定", role: .cancel) { }
        } message: {
            Text(viewModel.errorMessage)
        }
    }
    
    private func toggleRecording() {
        if isRecording {
            viewModel.stopRecording()
        } else {
            viewModel.startRecording(language: selectedLanguage) { result in
                recognizedText = result.text
            }
        }
        isRecording.toggle()
    }
}

// ViewModel
class VoiceRecognitionViewModel: ObservableObject {
    private var recognizer: SenseVoiceRecognizer?
    @Published var lastResult: RecognitionResult?
    @Published var showError = false
    @Published var errorMessage = ""
    
    func initialize() {
        do {
            recognizer = SenseVoiceRecognizer()
            
            // 获取模型路径
            guard let modelPath = Bundle.main.path(forResource: "sensevoice-small", 
                                                  ofType: "onnx") else {
                throw SenseVoiceError.modelNotFound
            }
            
            let success = try recognizer?.initialize(modelPath: modelPath)
            
            if success != true {
                throw SenseVoiceError.recognitionFailed("初始化失败")
            }
        } catch {
            showError(message: "初始化失败: \(error.localizedDescription)")
        }
    }
    
    func startRecording(language: String, 
                       onResult: @escaping (RecognitionResult) -> Void) {
        guard let recognizer = recognizer else {
            showError(message: "识别器未初始化")
            return
        }
        
        // 请求录音权限
        AVAudioSession.sharedInstance().requestRecordPermission { granted in
            if granted {
                do {
                    try recognizer.startRealtimeRecognition { result in
                        DispatchQueue.main.async {
                            self.lastResult = result
                            onResult(result)
                        }
                    }
                } catch {
                    DispatchQueue.main.async {
                        self.showError(message: "开始录音失败: \(error.localizedDescription)")
                    }
                }
            } else {
                DispatchQueue.main.async {
                    self.showError(message: "需要麦克风权限才能使用语音识别")
                }
            }
        }
    }
    
    func stopRecording() {
        recognizer?.stopRealtimeRecognition()
    }
    
    func release() {
        recognizer?.release()
        recognizer = nil
    }
    
    private func showError(message: String) {
        errorMessage = message
        showError = true
    }
}

5.3 Info.plist配置

在iOS中,使用麦克风需要在Info.plist中添加权限说明。

<!-- Info.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <!-- 麦克风使用权限 -->
    <key>NSMicrophoneUsageDescription</key>
    <string>需要麦克风权限来进行语音识别</string>
    
    <!-- 应用名称 -->
    <key>CFBundleDisplayName</key>
    <string>语音识别助手</string>
    
    <!-- 支持的设备方向 -->
    <key>UISupportedInterfaceOrientations</key>
    <array>
        <string>UIInterfaceOrientationPortrait</string>
        <string>UIInterfaceOrientationLandscapeLeft</string>
        <string>UIInterfaceOrientationLandscapeRight</string>
    </array>
    
    <!-- 模型文件不加密 -->
    <key>UIApplicationSupportsIndirectInputEvents</key>
    <true/>
    
    <!-- 后台音频 -->
    <key>UIBackgroundModes</key>
    <array>
        <string>audio</string>
    </array>
</dict>
</plist>

6. 性能优化与最佳实践

集成完成后,我们还需要考虑性能优化和最佳实践,确保SDK在实际使用中表现良好。

6.1 内存优化策略

语音识别是计算密集型任务,内存管理尤为重要。

// MemoryOptimizer.kt - Android内存优化
class MemoryOptimizer {
    
    companion object {
        // 音频缓冲区池
        private val audioBufferPool = mutableListOf<FloatArray>()
        private const val BUFFER_SIZE = 16000  // 1秒的音频
        
        // 获取缓冲区
        fun getAudioBuffer(): FloatArray {
            synchronized(audioBufferPool) {
                return if (audioBufferPool.isNotEmpty()) {
                    audioBufferPool.removeAt(0)
                } else {
                    FloatArray(BUFFER_SIZE)
                }
            }
        }
        
        // 回收缓冲区
        fun recycleAudioBuffer(buffer: FloatArray) {
            if (buffer.size == BUFFER_SIZE) {
                synchronized(audioBufferPool) {
                    if (audioBufferPool.size < 5) {  // 最多缓存5个
                        // 清空缓冲区内容
                        buffer.fill(0f)
                        audioBufferPool.add(buffer)
                    }
                }
            }
        }
        
        // 监控内存使用
        fun monitorMemoryUsage(context: Context) {
            val activityManager = context.getSystemService(Context.ACTIVITY_SERVICE) 
                    as ActivityManager
            val memoryInfo = ActivityManager.MemoryInfo()
            activityManager.getMemoryInfo(memoryInfo)
            
            val usedMemory = Runtime.getRuntime().totalMemory() - 
                           Runtime.getRuntime().freeMemory()
            val maxMemory = Runtime.getRuntime().maxMemory()
            
            Log.d("MemoryOptimizer", 
                 "内存使用: ${usedMemory / 1024 / 1024}MB / ${maxMemory / 1024 / 1024}MB")
            Log.d("MemoryOptimizer", 
                 "系统剩余内存: ${memoryInfo.availMem / 1024 / 1024}MB")
            
            // 如果内存紧张,清理缓存
            if (memoryInfo.lowMemory) {
                clearCaches()
            }
        }
        
        private fun clearCaches() {
            synchronized(audioBufferPool) {
                audioBufferPool.clear()
            }
            System.gc()
        }
    }
}

6.2 功耗优化

在移动设备上,功耗控制同样重要。

// PowerOptimizer.swift - iOS功耗优化
class PowerOptimizer {
    
    private var energyMonitor: EnergyMonitor?
    private var isLowPowerMode = false
    
    // 监控设备能耗状态
    func startMonitoring() {
        // 监听低电量模式
        NotificationCenter.default.addObserver(
            self,
            selector: #selector(lowPowerModeChanged),
            name: NSNotification.Name.NSProcessInfoPowerStateDidChange,
            object: nil
        )
        
        // 检查当前状态
        isLowPowerMode = ProcessInfo.processInfo.isLowPowerModeEnabled
        adjustStrategyForPowerMode()
        
        // 启动能耗监控
        energyMonitor = EnergyMonitor()
        energyMonitor?.startMonitoring()
    }
    
    @objc private func lowPowerModeChanged() {
        isLowPowerMode = ProcessInfo.processInfo.isLowPowerModeEnabled
        adjustStrategyForPowerMode()
    }
    
    // 根据电量模式调整策略
    private func adjustStrategyForPowerMode() {
        if isLowPowerMode {
            // 低电量模式下的优化策略
            SenseVoiceConfig.shared.maxThreads = 1
            SenseVoiceConfig.shared.enableGPU = false
            SenseVoiceConfig.shared.audioBufferSize = 32000  // 2秒缓冲
            SenseVoiceConfig.shared.processingInterval = 2000  // 2秒处理一次
        } else {
            // 正常模式
            SenseVoiceConfig.shared.maxThreads = 2
            SenseVoiceConfig.shared.enableGPU = true
            SenseVoiceConfig.shared.audioBufferSize = 16000  // 1秒缓冲
            SenseVoiceConfig.shared.processingInterval = 1000  // 1秒处理一次
        }
    }
    
    // 动态调整识别精度
    func adjustAccuracyBasedOnBattery(level: Float) {
        if level < 0.2 {  // 电量低于20%
            SenseVoiceConfig.shared.recognitionAccuracy = .low
        } else if level < 0.5 {  // 电量低于50%
            SenseVoiceConfig.shared.recognitionAccuracy = .medium
        } else {
            SenseVoiceConfig.shared.recognitionAccuracy = .high
        }
    }
    
    // 清理资源
    func stopMonitoring() {
        NotificationCenter.default.removeObserver(self)
        energyMonitor?.stopMonitoring()
        energyMonitor = nil
    }
}

// 能耗监控器
class EnergyMonitor {
    private var monitoringTimer: Timer?
    private var energyUsage: [Date: Double] = [:]
    
    func startMonitoring() {
        monitoringTimer = Timer.scheduledTimer(withTimeInterval: 10.0, repeats: true) { _ in
            self.recordEnergyUsage()
        }
    }
    
    private func recordEnergyUsage() {
        // 这里可以集成系统能耗监控API
        // 实际项目中可能需要使用更专业的能耗监控工具
        let usage = Double.random(in: 0.1...0.5)  // 模拟能耗数据
        energyUsage[Date()] = usage
        
        // 如果能耗过高,发出警告
        if usage > 0.3 {
            NotificationCenter.default.post(
                name: Notification.Name("HighEnergyUsageWarning"),
                object: nil,
                userInfo: ["usage": usage]
            )
        }
    }
    
    func stopMonitoring() {
        monitoringTimer?.invalidate()
        monitoringTimer = nil
    }
}

6.3 网络降级策略

虽然SenseVoice-small支持离线工作,但在某些场景下可能需要网络辅助。

// NetworkFallback.kt - 网络降级策略
class NetworkFallbackStrategy(private val context: Context) {
    
    private val localRecognizer = SenseVoiceService(context, getLocalModelPath())
    private val cloudRecognizer = CloudRecognitionService()
    
    // 混合识别策略
    suspend fun hybridRecognize(
        audioData: FloatArray,
        language: String = "auto"
    ): RecognitionResult {
        return try {
            // 首先尝试本地识别
            val localResult = withContext(Dispatchers.IO) {
                localRecognizer.recognizeAudio(audioData, language)
            }
            
            // 如果本地识别置信度低,尝试云端识别
            if (localResult.confidence < 0.7 && isNetworkAvailable()) {
                val cloudResult = withContext(Dispatchers.IO) {
                    cloudRecognizer.recognize(audioData, language)
                }
                
                // 合并结果(可以根据业务逻辑调整策略)
                return mergeResults(localResult, cloudResult)
            }
            
            localResult
        } catch (e: Exception) {
            // 本地识别失败,尝试云端
            if (isNetworkAvailable()) {
                try {
                    return withContext(Dispatchers.IO) {
                        cloudRecognizer.recognize(audioData, language)
                    }
                } catch (e2: Exception) {
                    // 云端也失败,返回错误
                    return RecognitionResult(error = "识别失败: ${e2.message}")
                }
            }
            
            RecognitionResult(error = "识别失败: ${e.message}")
        }
    }
    
    // 检查网络状态
    private fun isNetworkAvailable(): Boolean {
        val connectivityManager = context.getSystemService(Context.CONNECTIVITY_SERVICE) 
                as ConnectivityManager
        
        if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.M) {
            val network = connectivityManager.activeNetwork
            val capabilities = connectivityManager.getNetworkCapabilities(network)
            return capabilities != null && 
                   (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_WIFI) ||
                    capabilities.hasTransport(NetworkCapabilities.TRANSPORT_CELLULAR))
        } else {
            @Suppress("DEPRECATION")
            val networkInfo = connectivityManager.activeNetworkInfo
            return networkInfo != null && networkInfo.isConnected
        }
    }
    
    // 合并本地和云端结果
    private fun mergeResults(
        local: RecognitionResult,
        cloud: RecognitionResult
    ): RecognitionResult {
        // 简单的合并策略:选择置信度高的结果
        return if (local.confidence >= cloud.confidence) {
            local
        } else {
            cloud
        }
        
        // 更复杂的策略可以:
        // 1. 加权平均
        // 2. 基于语言模型的后处理
        // 3. 用户反馈学习
    }
    
    // 智能缓存策略
    fun cacheRecognitionResult(
        audioHash: String,
        result: RecognitionResult,
        source: RecognitionSource
    ) {
        val cache = RecognitionCache(context)
        
        // 根据结果质量决定缓存时间
        val cacheTime = when {
            result.confidence > 0.9 -> 7 * 24 * 60 * 60 * 1000L  // 7天
            result.confidence > 0.7 -> 24 * 60 * 60 * 1000L      // 1天
            else -> 60 * 60 * 1000L                             // 1小时
        }
        
        cache.save(audioHash, result, cacheTime, source)
    }
    
    // 获取本地模型路径
    private fun getLocalModelPath(): String {
        // 实现略
        return ""
    }
}

// 识别结果缓存
class RecognitionCache(context: Context) {
    private val sharedPrefs = context.getSharedPreferences("recognition_cache", 
                                                          Context.MODE_PRIVATE)
    
    fun save(
        key: String,
        result: RecognitionResult,
        ttl: Long,
        source: RecognitionSource
    ) {
        val json = Gson().toJson(CacheEntry(result, System.currentTimeMillis() + ttl, source))
        sharedPrefs.edit().putString(key, json).apply()
    }
    
    fun get(key: String): CacheEntry? {
        val json = sharedPrefs.getString(key, null)
        return if (json != null) {
            val entry = Gson().fromJson(json, CacheEntry::class.java)
            if (entry.expiry > System.currentTimeMillis()) {
                entry
            } else {
                // 缓存过期,删除
                sharedPrefs.edit().remove(key).apply()
                null
            }
        } else {
            null
        }
    }
    
    data class CacheEntry(
        val result: RecognitionResult,
        val expiry: Long,
        val source: RecognitionSource
    )
}

enum class RecognitionSource {
    LOCAL, CLOUD, HYBRID
}

7. 测试与部署

7.1 单元测试

确保SDK的每个组件都能正常工作。

// SenseVoiceServiceTest.kt
@RunWith(AndroidJUnit4::class)
class SenseVoiceServiceTest {
    
    private lateinit var context: Context
    private lateinit var service: SenseVoiceService
    
    @Before
    fun setup() {
        context = ApplicationProvider.getApplicationContext()
        
        // 复制测试模型
        val modelManager = ModelManager(context)
        val modelPath = modelManager.copyModelToInternal().absolutePath
        
        service = SenseVoiceService(context, modelPath)
        service.initialize()
    }
    
    @Test
    fun testInitialization() {
        assertTrue(service.isInitialized)
    }
    
    @Test
    fun testAudioPreprocessing() {
        // 创建测试音频数据(1秒的440Hz正弦波)
        val sampleRate = 16000
        val frequency = 440.0
        val duration = 1.0  // 秒
        
        val audioData = FloatArray((sampleRate * duration).toInt())
        for (i in audioData.indices) {
            val time = i.toDouble() / sampleRate
            audioData[i] = sin(2 * Math.PI * frequency * time).toFloat()
        }
        
        val result = service.recognizeAudio(audioData, "auto")
        
        assertNotNull(result)
        assertTrue(result.confidence > 0)
    }
    
    @Test
    fun testLanguageDetection() {
        // 这里可以使用预录制的不同语言音频进行测试
        // 实际项目中应该准备测试数据集
        assertTrue(true)  // 占位符
    }
    
    @Test
    fun testEmptyAudio() {
        val emptyAudio = FloatArray(16000)  // 1秒静音
        val result = service.recognizeAudio(emptyAudio, "auto")
        
        // 静音应该返回空结果或低置信度
        assertTrue(result.text.isEmpty() || result.confidence < 0.3)
    }
    
    @After
    fun tearDown() {
        service.release()
    }
}

7.2 集成测试

测试整个SDK的集成效果。

// SenseVoiceIntegrationTests.swift
import XCTest
@testable import SenseVoiceFramework

final class SenseVoiceIntegrationTests: XCTestCase {
    
    var recognizer: SenseVoiceRecognizer!
    
    override func setUp() async throws {
        recognizer = SenseVoiceRecognizer()
        
        // 获取测试模型路径
        let bundle = Bundle(for: type(of: self))
        guard let modelPath = bundle.path(forResource: "sensevoice-small-test", 
                                         ofType: "onnx") else {
            throw SenseVoiceError.modelNotFound
        }
        
        let initialized = try recognizer.initialize(modelPath: modelPath)
        XCTAssertTrue(initialized, "识别器应该初始化成功")
    }
    
    func testChineseRecognition() async throws {
        // 加载测试音频("你好,世界")
        let audioData = try loadTestAudio(name: "chinese_hello")
        let result = try recognizer.recognizeAudio(audioData, language: "zh")
        
        XCTAssertFalse(result.text.isEmpty, "识别结果不应为空")
        XCTAssertEqual(result.language, "zh", "应该检测到中文")
        XCTAssertGreaterThan(result.confidence, 0.7, "置信度应大于0.7")
    }
    
    func testEnglishRecognition() async throws {
        // 加载测试音频("Hello, world")
        let audioData = try loadTestAudio(name: "english_hello")
        let result = try recognizer.recognizeAudio(audioData, language: "en")
        
        XCTAssertFalse(result.text.isEmpty, "识别结果不应为空")
        XCTAssertEqual(result.language, "en", "应该检测到英文")
        XCTAssertGreaterThan(result.confidence, 0.7, "置信度应大于0.7")
    }
    
    func testAutoLanguageDetection() async throws {
        let chineseAudio = try loadTestAudio(name: "chinese_hello")
        let chineseResult = try recognizer.recognizeAudio(chineseAudio, language: "auto")
        
        XCTAssertEqual(chineseResult.language, "zh", "应该自动检测到中文")
        
        let englishAudio = try loadTestAudio(name: "english_hello")
        let englishResult = try recognizer.recognizeAudio(englishAudio, language: "auto")
        
        XCTAssertEqual(englishResult.language, "en", "应该自动检测到英文")
    }
    
    func testPerformance() {
        let audioData = [Float](repeating: 0, count: 16000)  // 1秒静音
        
        measure {
            for _ in 0..<10 {
                _ = try? recognizer.recognizeAudio(audioData, language: "auto")
            }
        }
    }
    
    private func loadTestAudio(name: String) throws -> [Float] {
        // 实际项目中应该从测试资源加载音频文件
        // 这里返回模拟数据
        return [Float](repeating: 0, count: 16000)
    }
    
    override func tearDown() {
        recognizer.release()
        recognizer = nil
    }
}

7.3 持续集成与部署

配置CI/CD流程,确保代码质量。

# .github/workflows/android-ci.yml
name: Android CI

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up JDK 11
      uses: actions/setup-java@v3
      with:
        java-version: '11'
        distribution: 'temurin'
    
    - name: Setup Android SDK
      uses: android-actions/setup-android@v2
    
    - name: Grant execute permission for gradlew
      run: chmod +x gradlew
    
    - name: Run unit tests
      run: ./gradlew testDebugUnitTest
    
    - name: Run instrumented tests
      uses: reactivecircus/android-emulator-runner@v2
      with:
        api-level: 29
        script: ./gradlew connectedDebugAndroidTest
    
    - name: Upload test results
      uses: actions/upload-artifact@v3
      if: always()
      with:
        name: test-results
        path: app/build/reports/

  build:
    needs: test
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up JDK 11
      uses: actions/setup-java@v3
      with:
        java-version: '11'
        distribution: 'temurin'
    
    - name: Setup Android SDK
      uses: android-actions/setup-android@v2
    
    - name: Grant execute permission for gradlew
      run: chmod +x gradlew
    
    - name: Build APK
      run: ./gradlew assembleRelease
    
    - name: Upload APK
      uses: actions/upload-artifact@v3
      with:
        name: app-release
        path: app/build/outputs/apk/release/

8. 总结

通过这篇文章,我们完整地探讨了如何将SenseVoice-small语音识别模型封装成SDK,并集成到Android和iOS应用中。从技术选型、架构设计,到具体的代码实现和优化策略,我们覆盖了端侧语音识别集成的关键要点。

8.1 核心收获

技术层面,我们学会了:

  • 如何将ONNX模型封装成跨平台的推理引擎
  • 如何设计面向移动端的SDK架构
  • 如何在Android和iOS上实现实时语音识别
  • 如何进行性能优化和功耗控制

工程层面,我们掌握了:

  • 模块化设计思想,将复杂系统分解为可维护的组件
  • 平台差异处理,为不同操作系统提供适配层
  • 错误处理和资源管理,确保SDK的稳定性
  • 测试策略,从单元测试到集成测试的完整流程

8.2 实际应用建议

在实际项目中应用这个SDK时,我有几个建议:

第一,根据场景选择配置

  • 对实时性要求高的场景(如语音助手),使用流式识别
  • 对准确性要求高的场景(如会议
Logo

Agent 垂直技术社区,欢迎活跃、内容共建。

更多推荐