IOS Audio and Video (1) AVFoundation core class

IOS Audio and Video (2) AVFoundation video capture

IOS Audio and Video (3) AVFoundation playback and recording

IOS Audio and Video (43) AVFoundation Audio Session

IOS Audio Queue Services for AVFoundation

IOS Audio and Video (45) HTTPS self-signed certificates implement side play

IOS Audio and Video (46) Offline online speech recognition solution

IOS Audio and Video (46) Offline online speech recognition solution

Recently, I have done a research on voice recognition, because the company needs to use offline voice recognition function. In order to balance performance and price, the final choice is to use Siri online and Baidu offline voice recognition scheme.

A file from Baidu libbaiduspeechsDK. a is not uploaded in the Demo, because the file is too large and over 100M, it cannot be uploaded to Git, so you need to download it from the official SDK and replace it with Demo.

Here is a summary of several existing offline speech recognition schemes

  • Easy to use third-party SDK solutions
plan advantages disadvantages The price The success rate
Hkust xunfei 95% higher success rate Expensive, increase ipA package size Can buy out or press flow, an equipment cost about 4 yuan The success rate is about 95%
Baidu AI The success rate is 90% higher, the price is cheap, and offline recognition is free. It provides custom model training, and the recognition rate is improved after training The ipA packet size is increased, and the recognition rate is not high. Only command words are supported offline, and only Chinese and English voices are supported. When online, the online recognition mode is forced to be used. The offline engine must be connected to the Internet at least once Offline command words free, online recognition according to the number of traffic calculation, 1600 yuan package (2 million Chinese recognition, 50 thousand English recognition) Online recognition success rate is about 95%, offline recognition can not reach 90%
siri High success, free, original with apple system, will not increase the size of the package IOS 13 or higher supports offline voice recognition, but offline recognition does not support Chinese recognition. English offline recognition has higher accuracy than Baidu Completely free The online recognition rate is almost 95% with IFLYtek, and the offline English recognition rate is about 90%
  • Open source code scheme
plan advantages disadvantages instructions The success rate
KALDI Open source framework KALDI is a well-known open source automated Speech recognition (ASR) tool that provides training tools for building the most commonly used ASR models in industry today. Pipelines are also provided for other sub-tasks such as speaker verification and language recognition. KALDI is currently maintained by Daniel Povey, who did ASR research at Cambridge before moving to JHU to develop KALDI, and is currently head of voice at Xiaomi’s headquarters in Beijing. He is also one of the lead authors of HTK, another well-known ASR tool.
Cmu-sphinx open source framework Features include identification, according to a specific grammar awakens the word recognition, n – “gramm recognition, and so on, this kind of voice recognition, an open source framework than Kaldi suit to do development, all kinds of function encapsulation is straightforward, decoding part of the code is very easy to understand, and apart from the PC platform, the author also considered embedded platform, Android development is also very convenient, There is already a Demo and a Wiki for voice evaluation based on PocketSphinx, which is much better in real time than Kaldi. Compared with Kaldi, it uses the GMM-HMM framework, which may be less accurate. Other miscellaneous handlers (pitch extraction, etc.) are not as numerous as Kaldi.
HTK-Cambridage C voice writing, support Win, Linux,ios

Plan 1: Siri voice recognition

Introduction to Siri Voice Recognition

The Siri recognizer Api is primarily the SFSpeechRecognizer voice processor, which is only available on IOS10, so it’s only available on IOS10 and above. From IOS10 to IOS13, apple recognises Siri online only, IOS13 now offers offline identification. However, the offline recognition mode does not support Chinese mode. Although the official said that Chinese is supported, the actual test found that Chinese offline recognition cannot be recognized at all.

Siri voice recognition features

  • Introduce system library Speech
  • SFSpeechRecognizerVoice processor, this class is the operation class of voice recognition, which is used for the application of voice recognition user permission, the setting of language environment, the setting of voice mode and the sending of voice recognition requests to Apple services. For example, the following code will return a sound processor based on the language abbreviation passed in. If not, it will return nil. More details can be found in the official documentation.
SFSpeechRecognizer(locale: Locale(identifier: langugeSimple))
Copy the code

The results of speech recognition are obtained through the following methods:

open func recognitionTask(with request: SFSpeechRecognitionRequest, resultHandler: @escaping (SFSpeechRecognitionResult? , Error?) -> Void) - >SFSpeechRecognitionTask
Copy the code
  • AVAudioEngineData specifically designed to process sound
 lazy var audioEngine: AVAudioEngine = {
        let audioEngine = AVAudioEngine()
        audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: audioEngine.inputNode.outputFormat(forBus: 0)) { (buffer, audioTime) in
            // Add an AudioPCMBuffer to the speech recognition request object to get the sound data
            self.recognitionRequest.append(buffer)
        }
        return audioEngine
    }()
Copy the code
  • SFSpeechAudioBufferRecognitionRequest speech recognizer, create speech recognition requests through the audio stream. :
 // Voice recognizer
    lazy var recognitionRequest: SFSpeechAudioBufferRecognitionRequest = {
        let recognitionRequest = SFSpeechAudioBufferRecognitionRequest(a)return recognitionRequest
    }()
Copy the code
  • SFSpeechRecognitionTask language recognition task manager, enable and disable using this manager. Class, this class is the speech recognition service requests for each speech recognition requests can be abstracted as a SFSpeechRecognitionTask instance, SFSpeechRecognitionTaskDelegate agreed in the agreement of the monitoring method in the process of many requests.
public enum SFSpeechRecognitionTaskState : Int {

    case starting // Speech processing (potentially including recording) has not yet begun

    case running // Speech processing (potentially including recording) is running

    case finishing // No more audio is being recorded, but more recognition results may arrive

    case canceling // No more recognition reuslts will arrive, but recording may not have stopped yet

    case completed // No more results will arrive, and recording is stopped.
}
Copy the code

There are also some important classes:

Request class SFSpeechRecognitionRequest: speech recognition, through its subclasses are needed to instantiate. SFSpeechURLRecognitionRequest: through the URL to create the audio voice recognition requests. Class SFSpeechRecognitionResult: speech recognition request results. SFTranscription: A transcription of information.

For more details, please refer to the Official Documentation of Apple, which provides a Swift Demo: download the official Apple Demo here

Siri voice recognition integration

  • OC code integration:
//
// KSiriRecognizer.m
// KSpeechRecognition
//
// Created by yulu kong on 2020/4/3.
// Copyright © 2020 Yulu Kong. All rights reserved
//

#import "KSiriRecognizer.h"
#import <Speech/Speech.h>
#import "KHelper.h"
#import "KError.h"

@interface KSiriRecognizer()"SFSpeechRecognizerDelegate>

@property (nonatomic, strong) AVAudioEngine *audioEngine;
@property (nonatomic, strong) SFSpeechRecognizer *recognizer;

@property (nonatomic, assign) BOOL isAvaliable;

@property (nonatomic, strong, nullable) SFSpeechRecognitionTask *currentTask;
@property (nonatomic, strong, nullable) SFSpeechAudioBufferRecognitionRequest *request;

@end

@implementation KSiriRecognizer

+ (void)requestAuthorizationWithResultHandler:(KSiriAuthorizationResultHandler)resultHandler
{
    [SFSpeechRecognizer requestAuthorization:^(SFSpeechRecognizerAuthorizationStatus status) {
        resultHandler([KHelper convertSiriAuthorizationStatus:status]);
    }];
}

- (instancetype)initWithLanguage:(KLanguage)language
{
    if (self = [super initWithLanguage:language]) {
        NSLocale *local = [KHelper localForLanguage:language];
        _recognizer = [[SFSpeechRecognizer alloc] initWithLocale:local];
        _recognizer.delegate = self;
    }
    return self;
}

- (KAuthorizationStatus)authorizationStatus
{
    return [KHelper convertSiriAuthorizationStatus:[SFSpeechRecognizer authorizationStatus]];
}

- (void)startWithResultHandler:(KRecognitionResultHandler)resultHandler errorHandler: (KErrorHandler _Nullable)errorHandler
{
    if(_currentTask ! =nil) {
        NSLog(@"Identification in progress, please wait.");
        return;
    }
    
    if (self.authorizationStatus ! =KAuthorizationStatusAuthorized) {
        errorHandler([KError notAuthorizationError]);
        return;
    }
    
    if(! _isAvaliable) {NSString *message = [NSString stringWithFormat:@"%@ Voice recognizer not available"[KHelper nameForLanguage:self.language]];
        errorHandler([KError errorWithCode:-1 message:message]);
        return;
    }
    
    AVAudioSession *audioSession = AVAudioSession.sharedInstance;
    NSError *error = nil;
    [audioSession setCategory:AVAudioSessionCategoryRecord mode:AVAudioSessionModeMeasurement options: AVAudioSessionCategoryOptionDuckOthers error:&error];
    if(error ! =nil) {
        errorHandler(error);
        return;
    }
    
    __block typeof(self) weakSelf = self;
    _request = [[SFSpeechAudioBufferRecognitionRequest alloc] init];
    _request.shouldReportPartialResults = YES;
    
// enable offline recognition. This property is supported only in IOS13 and above
#if __IPHONE_OS_VERSION_MAX_ALLOWED >= __IPHONE_13_0
     _request.requiresOnDeviceRecognition = self.forceOffline;
#else

#endif
    
    _currentTask = [self.recognizer recognitionTaskWithRequest:_request resultHandler:^(SFSpeechRecognitionResult *result, NSError *error) {
       
        if (error == nil) {
            [weakSelf stop];
            errorHandler(error);
            return;
        }
        
        if(result ! =nil && !result.isFinal) {
            resultHandler([[KRecognitionResult alloc] initWithText:result.bestTranscription.formattedString isFinal:NO]);
            return;
        }
        
        if (result.isFinal) {
            [weakSelf stop];
            resultHandler([[KRecognitionResult alloc] initWithText:result.bestTranscription.formattedString isFinal:YES]); }}];// Configure the microphone input.
    AVAudioFormat *recordingFormat = [_audioEngine.inputNode outputFormatForBus:0];
    [_audioEngine.inputNode installTapOnBus:0 bufferSize:1024 format:recordingFormat block:^(AVAudioPCMBuffer * buffer, AVAudioTime *when) {
        [weakSelf.request appendAudioPCMBuffer:buffer];
    }];
    
    [_audioEngine prepare];
    
    if(! [_audioEngine startAndReturnError:&error]) { _currentTask =nil;
        errorHandler(error);
    }
}

- (void)stop {
    
    if (_currentTask == nil| |! _isAvaliable) {return;
    }
    
    [_currentTask cancel];
    [_audioEngine stop];
    [_request endAudio];
    _currentTask = nil;
}



- (void)speechRecognizer:(SFSpeechRecognizer *)speechRecognizer availabilityDidChange:(BOOL)available
{
    _isAvaliable = available;
}

@end

Copy the code
  • Swift5 Code integration:
//
// JPSpeechRecognition.swift
// JimuPro
//
// Created by bog on 2020/3/7
// Copyright © 2020 UBTech. All rights reserved
//

import Foundation
import UIKit
import Speech

enum JPSpeechType: Int {
    case start
    case stop
    case finished
    case authDenied
}

typealias JPSpeechBlock = (_ speechType: JPSpeechType._ finalText: String?). ->Void

@available(iOS 10.0*),class JPSpeechRecognition: NSObject {

    //private var parentVc: UIViewController!
    private var speechTask: SFSpeechRecognitionTask?
    // Sound processor
    private var speechRecognizer: SFSpeechRecognizer?
    
    private var block: JPSpeechBlock?
    
    // Voice recognizer
    var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    
 
    
    lazy var audioEngine: AVAudioEngine = {
        let audioEngine = AVAudioEngine()
        audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: audioEngine.inputNode.outputFormat(forBus: 0)) { (buffer, audioTime) in
            // Add an AudioPCMBuffer to the speech recognition request object to get the sound data
            if let recognitionRequest = self.recognitionRequest {
                recognitionRequest.append(buffer)
            }
        }
        return audioEngine
    }()
    
    
    func startSpeech(languge: String, speechBlock: @escaping JPSpeechBlock) {
        //parentVc = speechVc
        block = speechBlock
        setAudioActive()
        checkmicroPhoneAuthorization { (microStatus) in
            if microStatus {
                self.checkRecognizerAuthorization(recongStatus: { (recStatus) in
                    if recStatus {
                        The voice processor is ready (memory will be cleared for some of the resources that audioEngine needs to start up)
                        self.audioEngine.prepare()
                        if (self.speechTask? .state == .running) {// If the current state of the process is in progress
                            // Stop speech recognition
                           self.stopDictating()
                        } else {   // The process status is not in progress
                            self.speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: languge))
                            guard (self.speechRecognizer ! =nil) else {
                                self.showAlert("Sorry, voice input is not supported for the current region.")
                                return
                            }
                            self.setCallBack(type: .start, text: nil)
                            // Enable voice recognition
                            self.startDictating()
                        }
                    } else {
                        self.showAlert("You have unauthorized voice recognition. If you want to use voice recognition, you can go to Settings and turn it on again!")
                        self.setCallBack(type: .authDenied, text: nil)}}}else {
                // The microphone is not authorized
                self.showAlert("You have unauthorised the use of the microphone. If you want to use the voice recognition feature, you can go to Settings and turn it back on!")
                self.setCallBack(type: .authDenied, text: nil)}}}}@available(iOS 10.0*),extension JPSpeechRecognition: SFSpeechRecognitionTaskDelegate {
    
    // Determine the voice recognition permission
    private func checkRecognizerAuthorization(recongStatus: @escaping (_ resType: Bool) -> Void) {
        let authorStatus = SFSpeechRecognizer.authorizationStatus()
        if authorStatus == .authorized {
            recongStatus(true)}else if authorStatus == .notDetermined {
            SFSpeechRecognizer.requestAuthorization { (status) in
                if status == .authorized {
                    recongStatus(true)}else {
                    recongStatus(false)}}}else {
            recongStatus(false)}}// Check the microphone
    private func checkmicroPhoneAuthorization(authoStatus: @escaping (_ resultStatus: Bool) -> Void) {
        let microPhoneStatus = AVCaptureDevice.authorizationStatus(for: .audio)

        if microPhoneStatus == .authorized {
            authoStatus(true)}else if microPhoneStatus == .notDetermined {
            AVCaptureDevice.requestAccess(for: .audio, completionHandler: {(res) in
                if res {
                    authoStatus(true)}else {
                    authoStatus(false)}}}else {
            authoStatus(false)}}// start
    private func startDictating(a) {
        do {
            recognitionRequest = SFSpeechAudioBufferRecognitionRequest(a)// recreates recognitionRequest object.
            guard let recognitionRequest = recognitionRequest else {
                fatalError("Unable to created a SFSpeechAudioBufferRecognitionRequest object")}// enable offline recognition. This property is supported only in IOS13 and above
        // Keep speech recognition data on device
        if #available(iOS 13, *) {
            recognitionRequest.requiresOnDeviceRecognition = true
        }
            tryaudioEngine.start() speechTask = speechRecognizer! .recognitionTask(with: recognitionRequest) { (speechResult, error)in
                // Identify the result, identify the operation
                if speechResult == nil {
                    return
                }
                self.setCallBack(type: .finished, text: speechResult! .bestTranscription.formattedString) } }catch  {
            print(error)
            self.setCallBack(type: .finished, text: nil)}}// Stop the voice processor, stop the voice recognition request process
    func stopDictating(a) {
        setCallBack(type: .stop, text: nil) audioEngine.stop() recognitionRequest? .endAudio() recognitionRequest =nil
        
        if audioEngine.inputNode.numberOfInputs > 0 {
            audioEngine.inputNode.removeTap(onBus: 0) } speechTask? .cancel() }private func setCallBack(type: JPSpeechType, text: String?) {
        ifblock ! =nil{ block! (type, text) } }private func setAudioActive(a) {
        let audioSession = AVAudioSession.sharedInstance()
                       
       do {
           
           try audioSession.setCategory(AVAudioSession.Category.playAndRecord,mode: .default)
           try audioSession.setMode(AVAudioSession.Mode.spokenAudio)
           
           try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
           try audioSession.overrideOutputAudioPort(AVAudioSession.PortOverride.speaker)
           
       } catch  {
           debugPrint("Audio session initialization error: \(error.localizedDescription)")}}private func showAlert(_ message: String) {
// let alertVC = UIAlertController(title: nil, message: message, preferredStyle: .alert)
// let firstAction = UIAlertAction(title: "", style:.default, handler: {(action) in
/ /})
// alertVC.addAction(firstAction)
// parentVc.present(alertVC, animated: true, completion: nil)
        JMProgressHUD.showInfo(message)
    }
}

Copy the code

Scheme two: Baidu speech recognition

Introduction to Baidu Speech Recognition

Baidu speech recognition provides a lot of functions, here I briefly introduce the speech recognition piece. Baidu voice recognition has the following characteristics:

  • Online speech recognition supports any word recognition, offline speech recognition only supports command word recognition (grammar mode)
  • If you use offline for the first time, the SDK will download the offline authorization file on the offline platform. After the offline authorization file is successfully used, you do not need to connect to the Internet within three years. When the validity period is about to end, the SDK will automatically make several attempts to update the certificate online.
  • There is no pure offline recognition. Can only recognize fixed phrases offline
  • Offline identification currently does not support arbitrary statements. You can download the BSG file yuyin.baidu.com/asr bds_easr_gramm.dat and replace it with your own BSG file. The more you define, the worse the effect is. You are advised to use a maximum of 100 lines

Baidu voice recognition SDK integration steps

one First register in Baidu Voice open platform, create the application, generate API_KEY, SECRET_KEY and APP_ID package name when creating the application fill in the project Bundle identifier. Baidu Speech Recognition registration address click here

Second, download the SDK, first open the official demo to run and see, replace the API_KEY, SECRET_KEY and APP_ID generated by creating the application

Note the integrated bundle import:

For example, in the offline SDK provided above, I only use a simple class to wrap a layer of Baidu SDK, and the offline configuration method is as follows:

The steps of Baidu SDK integration are as follows:

  1. Drag the following files from the official SDK into your project:

  2. Add required system framework, dynamic library:

  3. Encapsulate their own classes to achieve API calls to Baidu:

  • Import the required header files for speech recognition:
#import "BDSASRDefines.h"
#import "BDSASRParameters.h"
#import "BDSEventManager.h"
Copy the code
  • Define APP_ID, API_KEY, SECRET_KEY associated with AppID. This information is obtained when you register on Baidu platform
const NSString* APP_ID = @"18569855";
const NSString* API_KEY = @"2qrMX1TgfTGslRMd3TcDuuBq";
const NSString* SECRET_KEY = @"xatUjET5NLNDXYNghNCnejt28MGpRYP2";
Copy the code
  • Initialize the SDK and build a BDSEventManager object. Set up the short voice service: productId = “1537”
- (instancetype)initWithLanguage:(KLanguage)language offlineGrammarDATFileURL:(NSURL * _Nullable)datFileURL {
    if (self = [super initWithLanguage:language]) {
        _offlineGrammarDATFileURL = [datFileURL copy];
        _asrEventManager = [BDSEventManager createEventManagerWithName:BDS_ASR_NAME];
        NSString *productId = [KHelper identifierForBaiduLanguage:language];
        [_asrEventManager setParameter:productId forKey:BDS_ASR_PRODUCT_ID];
    }
    return self;
}
Copy the code
  • Configure the offline engine and associated model resource files
- (void)configOfflineMode {
    [self.asrEventManager setDelegate:self];
    [self.asrEventManager setParameter:@(EVRDebugLogLevelError) forKey:BDS_ASR_DEBUG_LOG_LEVEL];
    
    
    // Parameter configuration: online authentication
    [self.asrEventManager setParameter:@[API_KEY.SECRET_KEY] forKey:BDS_ASR_API_SECRET_KEYS];
    
    NSBundle *bundle = [NSBundle bundleForClass:[KBaiduRecognizer class]].NSString *basicModelPath = [bundle pathForResource: @"bds_easr_basic_model" ofType: @"dat"];
    
    [self.asrEventManager setParameter:basicModelPath forKey:BDS_ASR_MODEL_VAD_DAT_FILE];
    [self.asrEventManager setParameter: @ (YES) forKey:BDS_ASR_ENABLE_MODEL_VAD];
    
    [self.asrEventManager setParameter:APP_ID forKey:BDS_ASR_OFFLINE_APP_CODE];
    [self.asrEventManager setParameter: @ (EVR_STRATEGY_BOTH) forKey:BDS_ASR_STRATEGY];
    [self.asrEventManager setParameter: @ (EVR_OFFLINE_ENGINE_GRAMMER) forKey:BDS_ASR_OFFLINE_ENGINE_TYPE];
    [self.asrEventManager setParameter:basicModelPath forKey:BDS_ASR_OFFLINE_ENGINE_DAT_FILE_PATH]; // Offline can only recognize words under custom syntax rulesNSString *grammarFilePath = [[NSBundle mainBundle] pathForResource: @"baidu_speech_grammar" ofType: @"bsg"];
    if (_offlineGrammarDATFileURL! =nil) {
        if(! [[NSFileManager defaultManager] fileExistsAtPath:_offlineGrammarDATFileURL.path]) {
            NSLog(@"!!! Error: The offline syntax thesaurus you provided does not exist: %@", _offlineGrammarDATFileURL.path);
        } else{ grammarFilePath = _offlineGrammarDATFileURL.path; }} [self.asrEventManager setParameter:grammarFilePath forKey:BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH];
}
Copy the code

In addition to the simple devices required for offline identification, you can also set the following information:

  1. Identify language @ 0: @ “mandarin”, @ 1: @ “cantonese”, @ 2: @ “English”, @ 3: @ “sichuan”
/ / @ identify language 0: @ "mandarin", @ 1: @ "cantonese", @ 2: @ "English", @ 3: @ "sichuan"
    [self.asrEventManager setParameter:@(EVoiceRecognitionLanguageChinese) forKey:BDS_ASR_LANGUAGE];
Copy the code
  1. Sampling rate @” adaptive “, @”8K”, @”16K”
 // Sampling rate @" adaptive ", @"8K", @"16K"
    [self.asrEventManager setParameter:@(EVoiceRecognitionRecordSampleRateAuto) forKey:BDS_ASR_SAMPLE_RATE];
Copy the code
  1. Whether to enable long voice recognition
 // Whether to enable long voice recognition
    [self.asrEventManager setParameter:@(YES) forKey:BDS_ASR_ENABLE_LONG_SPEECH];
      @0: @" off ", @(EVRPlayToneAll) : @" on "}
    // To use long voice, the prompt tone must be turned off
    [self.asrEventManager setParameter:@(0) forKey:BDS_ASR_PLAY_TONE];
  // Enable endpoint detection {@no: @" off ", @yes: @" on "} Local VAD must be enabled to use long voice
    // Endpoint detection, which automatically detects the start and end points of audio input. The SDK enables THE VAD function by default, and automatically stops identifying the VAD function when the VAD function is muted.
    // If the VAD needs to be disabled, disable both the server VAD and the on-end VAD
    //[self.asrEventManager setParameter:@(YES) forKey:BDS_ASR_ENABLE_LOCAL_VAD];
   // Disable the server VAD
    [self.asrEventManager setParameter:@(NO) forKey:BDS_ASR_ENABLE_EARLY_RETURN];
    // Close the local VAD
    [self.asrEventManager setParameter:@(NO) forKey:BDS_ASR_ENABLE_LOCAL_VAD];    
    // Enable endpoint detection (optional)
Copy the code
  • Configuring the ModelVAD endpoint detection mode provides more accurate detection, strong anti-noise capability and slow response speed
- (void)configModelVAD {
    NSString *modelVAD_filepath = [[NSBundle mainBundle] pathForResource:@"bds_easr_basic_model" ofType:@"dat"];
    // Path of the resource file required by ModelVAD
    [self.asrEventManager setParameter:modelVAD_filepath forKey:BDS_ASR_MODEL_VAD_DAT_FILE];
}
Copy the code
  • DNNMFE endpoint detection provides basic detection functions with high performance and fast response
//DNNMFE endpoint detection mode provides basic detection function, high performance, fast response
- (void)configDNNMFE {
    // Set the MFE model file
    NSString *mfe_dnn_filepath = [[NSBundle mainBundle] pathForResource:@"bds_easr_mfe_dnn" ofType:@"dat"];
    [self.asrEventManager setParameter:mfe_dnn_filepath forKey:BDS_ASR_MFE_DNN_DAT_FILE];
    // Set the MFE CMVN file path
    NSString *cmvn_dnn_filepath = [[NSBundle mainBundle] pathForResource:@"bds_easr_mfe_cmvn" ofType:@"dat"];
    [self.asrEventManager setParameter:cmvn_dnn_filepath forKey:BDS_ASR_MFE_CMVN_DAT_FILE]    
    // Whether to use ModelVAD to open resource file parameters to be configured
    [self.asrEventManager setParameter:@(NO) forKey:BDS_ASR_ENABLE_MODEL_VAD];
    // MFE supports custom mute duration
    // [self.asrEventManager setParameter:@(500.f) forKey:BDS_ASR_MFE_MAX_SPEECH_PAUSE];
    // [self.asrEventManager setParameter:@(500.f) forKey:BDS_ASR_MFE_MAX_WAIT_DURATION];
}
Copy the code
  • Offline parallel configuration
     // Parameter Settings: The identification policy is off-line parallel
        [self.asrEventManager setParameter:@(EVR_STRATEGY_BOTH) forKey:BDS_ASR_STRATEGY];
        // Parameter Settings: Offline engine type EVR_OFFLINE_ENGINE_INPUT Input mode EVR_OFFLINE_ENGINE_GRAMMER Offline engine syntax mode
        // Offline speech recognition only supports command word recognition (syntax mode).
        //[self.asrEventManager setParameter:@(EVR_OFFLINE_ENGINE_INPUT) forKey:BDS_ASR_OFFLINE_ENGINE_TYPE];
        [self.asrEventManager setParameter:@(EVR_OFFLINE_ENGINE_GRAMMER) forKey:BDS_ASR_OFFLINE_ENGINE_TYPE];
        // Generate the BSG file. After downloading the syntax file, set the BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH parameter
        NSString* gramm_filepath = [[NSBundle mainBundle] pathForResource:@"bds_easr_gramm" ofType:@"dat"];
        / / in (website) [http://speech.baidu.com/asr], please refer to the template definition syntax, download the grammar file, replace BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH parameters
        [self.asrEventManager setParameter:gramm_filepath forKey:BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH];
        // Identify the resource file path offline
        NSString* lm_filepath = [[NSBundle mainBundle] pathForResource:@"bds_easr_basic_model" ofType:@"dat"];
        [self.asrEventManager setParameter:lm_filepath forKey:BDS_ASR_OFFLINE_ENGINE_DAT_FILE_PATH];
         Load the offline engine
        [self.asrEventManager sendCommand:BDS_ASR_CMD_LOAD_ENGINE];   
Copy the code
  • Listen for the callback agent
#pragma mark -- Callbacks to speech recognition status, recording data, etc. All occur in this agent - (void)VoiceRecognitionClientWorkStatus:(int)workStatus obj:(id)aObj{
    switch (workStatus) {
        case EVoiceRecognitionClientWorkStatusNewRecordData: {[self.fileHandler writeData:(NSData *)aObj];
            NSLog(@"Recording data callback");
            break;
        }
            
        case EVoiceRecognitionClientWorkStatusStartWorkIng: {
            NSLog(@"Identification begins and data is collected and processed.");
            NSDictionary *logDic = [self parseLogToDic:aObj];
            [self printLogTextView:[NSString stringWithFormat:@"Start identifying -log: %@\n", logDic]];
            break;
        }
        case EVoiceRecognitionClientWorkStatusStart: {
            NSLog(@"User started speaking detected");
            [self printLogTextView:@"Detects user starts talking.\n"];
            break;
        }
        case EVoiceRecognitionClientWorkStatusEnd: {
            NSLog(@"User finished speaking, but server has not returned results.");
            [self printLogTextView:@"User finished speaking, but the server has not returned results.\n"];
            self.contentTextView.text = @"No recognized result";
            break;
        }
        case EVoiceRecognitionClientWorkStatusFlushData: {
            // display it sentence by sentence. The experience of voice input can be further enhanced with the continuous screen up intermediate results
            //// This status value indicates that the server has returned an intermediate result, and if you want to display the intermediate result to the user (creating a continuous screen up effect),
            // You can use the data that is returned at the same time as the state. Whenever a new message of this type is received, the text in the display area should be emptied to avoid duplication
            NSLog(@"Line by line");
            [self printLogTextView:[NSString stringWithFormat:@"Server returned intermediate junction - %@.\n\n"[self getDescriptionForDic:aObj]]];
            
            self.contentTextView.text = @"";
            NSArray *contentArr = aObj[@"results_recognition"];
            NSString *contentStr = contentArr[0];
            self.contentTextView.text = contentStr;
            
            break;
        }
        case EVoiceRecognitionClientWorkStatusFinish: {
            //// This status value indicates that the speech recognition server has returned the final result, which is stored as an array in an aObj object
            Upon receiving the message, you should clear the text in the display area to avoid duplication
            NSLog(@"Returns final results.");
            /* "origin_result" = { "corpus_no" = 6643061564690340286; "err_no" = 0; result = { word = ( "\U597d\U7684" ); }; sn = "5EEAC770-DDD2-4D35-8ABF-F407276A7934"; "Voice_energy" = "29160.45703125"; }; "results_recognition" = ( "\U597d\U7684" ); * /
            
            [self printLogTextView:[NSString stringWithFormat:@"Final result - %@.\n"[self getDescriptionForDic:aObj]]];
            if (aObj) {
                
                // NSArray *contentArr = aObj[@"results_recognition"];
                // NSString *contentStr = contentArr[0];
                // NSLog(@"contentStr = %@",contentStr);
                self.contentTextView.text =  [self getDescriptionForDic:aObj];
                
            }
            
            break;
        }
        case EVoiceRecognitionClientWorkStatusMeterLevel: {
            NSLog(@"Current volume callback");
            break;
        }
        case EVoiceRecognitionClientWorkStatusCancel: {
            NSLog(@"User cancels");
            [self printLogTextView:@"User cancels.\n"];
            
            break;
        }
        case EVoiceRecognitionClientWorkStatusError: {
            // Error status no voice input
            NSLog(@"Error state");
            NSError * error = (NSError *)aObj;
            
            if (error.code == 2228236) {
                //// Offline engine error status:
                // Failed to identify. (In syntax mode, speech may not be under custom syntax rules)
                 [self printLogTextView:[NSString stringWithFormat:@"Error state - in syntax mode, speech may not be under custom syntax rules \n %@.\n", (NSError *)aObj]];
            }else if (error.code == 2228230){
                 [self printLogTextView:[NSString stringWithFormat:@"Error status -dat model file unavailable, please set BDS_ASR_OFFLINE_ENGINE_DAT_FILE_PATH\n %@.\n", (NSError *)aObj]];
            }else if (error.code == 2228231){
                 [self printLogTextView:[NSString stringWithFormat:@"Error status-grammar file is invalid, please set BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH\n %@.\n", (NSError *)aObj]];
            }else if (error.code == 2225219){
                [self printLogTextView:[NSString stringWithFormat:@"Error status - Audio quality too low to recognize \n %@.\n", (NSError *)aObj]];
            }else{[self printLogTextView:[NSString stringWithFormat:@"Error status - %@.\n", (NSError *)aObj]];
            }
           
            break;
        }
        case EVoiceRecognitionClientWorkStatusLoaded: {
            NSLog(@"Offline engine loading completed");
            [self printLogTextView:@"Offline engine loading completed.\n"];
            break;
        }
        case EVoiceRecognitionClientWorkStatusUnLoaded: {
            NSLog(@"Offline engine uninstallation completed.");
            [self printLogTextView:@"Offline engine uninstallation completed.\n"];
            break;
        }
        case EVoiceRecognitionClientWorkStatusChunkThirdData: {
            NSLog(@"Identifying third party data in results");
            [self printLogTextView:[NSString stringWithFormat:@"Identify third party data in results: %lu\n", (unsigned long)[(NSData *)aObj length]]];
            break;
        }
        case EVoiceRecognitionClientWorkStatusChunkNlu: {
            NSLog(@"Semantic consequences of other consequences.");
            NSString *nlu = [[NSString alloc] initWithData:(NSData *)aObj encoding:NSUTF8StringEncoding];
            [self printLogTextView:[NSString stringWithFormat:@"Semantic result in recognition result: %@\n", nlu]];
            NSLog(@"% @", nlu);
            break;
        }
        case EVoiceRecognitionClientWorkStatusChunkEnd: {
            NSLog(@"Completion of identification process.");
            [self printLogTextView:[NSString stringWithFormat:@"End of identification process, sn: %@.\n", aObj]];
            
            break;
        }
        case EVoiceRecognitionClientWorkStatusFeedback: {
            NSLog(@"Identifying process feedback dotting data");
            NSDictionary *logDic = [self parseLogToDic:aObj];
            [self printLogTextView:[NSString stringWithFormat:@"Identifying process feedback dotting data: %@\n", logDic]];
            break;
        }
        case EVoiceRecognitionClientWorkStatusRecorderEnd: {
            // When the recorder is off, the page should be detected to avoid the status bar (iOS).
            NSLog(@"Recorder off");
            [self printLogTextView:@"Recorder off.\n"];
            break;
        }
        case EVoiceRecognitionClientWorkStatusLongSpeechEnd: {
            NSLog(@"Long voice end state");
            [self printLogTextView:@"Long voice end state.\n"];
            
            break;
        }
        default:
            break; }}Copy the code
  • Provides start and stop identification methods

// Start identifying
- (void) startWithResultHandler:(KRecognitionResultHandler)resultHandler errorHandler:(KErrorHandler)errorHandler {
    self.resultHandler = resultHandler;
    self.errorHandler = errorHandler;
    [self.asrEventManager sendCommand:BDS_ASR_CMD_START];
}

// Stop identifying
- (void)stop {
    [self.asrEventManager sendCommand:BDS_ASR_CMD_STOP];
}

Copy the code
  • The complete package code is as follows:
//
// KBaiduRecognizer.m
// KSpeechRecognition
//
// Created by yulu kong on 2020/4/3.
// Copyright © 2020 Yulu Kong. All rights reserved
//

#import "KBaiduRecognizer.h"
#import "BDSASRDefines.h"
#import "BDSASRParameters.h"
#import "BDSEventManager.h"
#import "KHelper.h"
#import "KError.h"

const NSString* APP_ID = @"18569855";
const NSString* API_KEY = @"2qrMX1TgfTGslRMd3TcDuuBq";
const NSString* SECRET_KEY = @"xatUjET5NLNDXYNghNCnejt28MGpRYP2";

@interface KBaiduRecognizer()"BDSClientASRDelegate>
@property (strong, nonatomic) BDSEventManager *asrEventManager;
@property (nonatomic, strong) NSURL *offlineGrammarDATFileURL;

@end

@implementation KBaiduRecognizer


- (KAuthorizationStatus)authorizationStatus {
    return KAuthorizationStatusAuthorized;
}

- (instancetype)initWithLanguage:(KLanguage)language {
    return [self initWithLanguage:language offlineGrammarDATFileURL:nil];
}

- (instancetype)initWithLanguage:(KLanguage)language offlineGrammarDATFileURL:(NSURL * _Nullable)datFileURL {
    if (self = [super initWithLanguage:language]) {
        _offlineGrammarDATFileURL = [datFileURL copy];
        // Create a speech recognition object
        _asrEventManager = [BDSEventManager createEventManagerWithName:BDS_ASR_NAME];
        NSString *productId = [KHelper identifierForBaiduLanguage:language];
        [_asrEventManager setParameter:productId forKey:BDS_ASR_PRODUCT_ID];
    }
    return self;
}

- (void) startWithResultHandler:(KRecognitionResultHandler)resultHandler errorHandler:(KErrorHandler)errorHandler {
    self.resultHandler = resultHandler;
    self.errorHandler = errorHandler;
    [self.asrEventManager sendCommand:BDS_ASR_CMD_START];
}

- (void)stop {
    [self.asrEventManager sendCommand:BDS_ASR_CMD_STOP];
}

- (void)configOfflineMode {
    // Set the voice recognition proxy
    [self.asrEventManager setDelegate:self];
    [self.asrEventManager setParameter:@(EVRDebugLogLevelError) forKey:BDS_ASR_DEBUG_LOG_LEVEL];
    
    
    // Parameter configuration: online authentication
    [self.asrEventManager setParameter:@[API_KEY.SECRET_KEY] forKey:BDS_ASR_API_SECRET_KEYS];
    
    NSBundle *bundle = [NSBundle bundleForClass:[KBaiduRecognizer class]].NSString *basicModelPath = [bundle pathForResource: @"bds_easr_basic_model" ofType: @"dat"];
    
    [self.asrEventManager setParameter:basicModelPath forKey:BDS_ASR_MODEL_VAD_DAT_FILE];
    [self.asrEventManager setParameter: @ (YES) forKey:BDS_ASR_ENABLE_MODEL_VAD]; // Offline engine authentication SettingsAPPIDRequired for Offline AuthorizationAPPCODE(APPID), please remove the temporary authorization document [self.asrEventManager setParameter:APP_ID forKey:BDS_ASR_OFFLINE_APP_CODE]; @0: @" online identification ", @4: @" off-line parallel "[self.asrEventManager setParameter: @ (EVR_STRATEGY_BOTH) forKey:BDS_ASR_STRATEGY];
    [self.asrEventManager setParameter: @ (EVR_OFFLINE_ENGINE_GRAMMER) forKey:BDS_ASR_OFFLINE_ENGINE_TYPE];
    [self.asrEventManager setParameter:basicModelPath forKey:BDS_ASR_OFFLINE_ENGINE_DAT_FILE_PATH]; // Offline can only recognize words under custom syntax rulesNSString *grammarFilePath = [[NSBundle mainBundle] pathForResource: @"baidu_speech_grammar" ofType: @"bsg"];
    if (_offlineGrammarDATFileURL! =nil) {
        if(! [[NSFileManager defaultManager] fileExistsAtPath:_offlineGrammarDATFileURL.path]) {
            NSLog(@"!!! Error: The offline syntax thesaurus you provided does not exist: %@", _offlineGrammarDATFileURL.path);
        } else{ grammarFilePath = _offlineGrammarDATFileURL.path; }} [self.asrEventManager setParameter:grammarFilePath forKey:BDS_ASR_OFFLINE_ENGINE_GRAMMER_FILE_PATH];
}


// MARK: - BDSClientASRDelegate

- (void)VoiceRecognitionClientWorkStatus:(int)workStatus obj:(id)aObj {
    switch (workStatus) {
        case EVoiceRecognitionClientWorkStatusStartWorkIng: {
            break;
        }
        case EVoiceRecognitionClientWorkStatusStart:
            break;
            
        case EVoiceRecognitionClientWorkStatusEnd: {
            break;
        }
        case EVoiceRecognitionClientWorkStatusFlushData: {[self receiveRecognitionResult:aObj isFinal:NO];
            break;
        }
        case EVoiceRecognitionClientWorkStatusFinish: {[self receiveRecognitionResult:aObj isFinal:YES];
            break;
        }
        case EVoiceRecognitionClientWorkStatusError: {
            self.errorHandler([KError errorWithCode:-1 message:@"Speech recognition failure."]);
            break;
        }
        case EVoiceRecognitionClientWorkStatusRecorderEnd: {
            break;
        }
        default:
            break;
    }
}

- (void)receiveRecognitionResult:(id)resultInfo isFinal:(BOOL)isFinal {
    if (resultInfo == nil| |! [resultInfo isKindOfClass:[NSDictionary class]]) {
        return;
    }
    
    NSDictionary *info = (NSDictionary *)resultInfo;
    NSString *text = info[@"results_recognition"];
    if(text ! =nil && [text length] > 0) {
        self.resultHandler([[KRecognitionResult alloc] initWithText:text isFinal:isFinal]);
    }
}

@end


Copy the code

Solution 3: Use an open source framework

1. KALDI

Kaldi source code download: Kaldi source code download after installing Git run

git clone https://github.com/kaldi-asr/kaldi.git kaldi-trunk --origin golden
Copy the code

If the speed is too slow, please refer to github download speed up

Download it offline from http://git.oschina.net/

git clone https://gitee.com/cocoon_zz/kaldi.git kaldi-trunk --origin golden
Copy the code
  • Kaldi website kaldi-asr.org/doc/index.h… Includes a lot of principles and tool instructions, if you have any questions please read this first.
  • Kaldi section www.danielpovey.com/kaldi-lectu… Compared with the previous one will give a more simple principle, process introduction.
  • Kaldi Chinese translation 1 if you feel that English reading is a headache, I suggest you search this, is the translation of documents on the official website. This document comes from a QQ group learning to communicate with Kaldi.
  • Kaldi translation 2 shiweipku. Gitbooks. IO/Chinese – doc…

KALDI profile

KALDI profile

KALDI is a well-known open source automated Speech recognition (ASR) tool that provides training tools for building the most commonly used ASR models in industry today. Pipelines are also provided for other sub-tasks such as speaker verification and language recognition. KALDI is currently maintained by Daniel Povey, who did ASR research at Cambridge before moving to JHU to develop KALDI, and is currently head of voice at Xiaomi’s headquarters in Beijing. He is also one of the lead authors of HTK, another well-known ASR tool. KALDI is so popular in the ASR space because it provides neural network models (DNN, TDNN, LSTM) that other ASR tools do not have that can be used in industry. But in contrast to other general-purpose neural network libraries based on Python interfaces (TensorFlow, PyTorch, etc.), KALDI’s interfaces are a series of command-line tools that require strong shell scripting skills for learners and users. At the same time, KALDI provides a large number of general processing scripts in order to simplify the process of building speech recognition pipelines. These scripts are mainly written in shell, Perl and Python. It is mainly necessary to understand shell and Python scripts. And can be replaced by Python, so learning is not cost-effective. The entire KALDI toolbox structure is shown below.

Transforms KALDI Transforms Matrix, Utils, GMM, SGMM, Transforms, LM, Tree, FST ext, HMM, Decoder, and other tools can be compiled to generate command-line tools, as shown in the C++ Executables file. Finally, KALDI provides examples and generic scripts that show how to use these tools to build ASR pipelines, which are great learning materials. In addition to the scripts provided by KALDI itself, Documents on its official website are also good learning materials. Of course, in addition to KALDI tool itself, the principles of ASR also need to be learned. Only by combining both approaches can we better master this difficult field.

KALDI source code compiled and installed

How do I INSTALL the INSTALL file in the downloaded directory

This is the official Kaldi INSTALL. Look also at INSTALL.md for the git mirror installation.
[for native Windows install, see windows/INSTALL]

(1)
go to tools/  and follow INSTALL instructions there.

(2)
go to src/ and follow INSTALL instructions there.
Copy the code

If a problem occurs, check the INSTALL file under each directory. Some configuration problems (such as the version of GCC and CUDA) can be solved by checking this document.

Dependent file compilation

First check the dependencies

cd extras
sudo bash check_dependencies.sh
Copy the code

Note that make-j4 can be multi-process

cd kaldi-trunk/tools
make
Copy the code

Configure Kaldi source code

cd .. /src#If you don't have a GPU, check configure to disable CUDA
./configure --shared
Copy the code

Compile Kaldi source code

make all
#Note that CUDA compilation may fail because the configure script does not locate CUDA correctly. You need to manually edit the path that Configure finds and change the CUDA library location
Copy the code

Test whether the installation is successful

cd .. /egs/yesno/s5 ./run.shCopy the code

Decoding results placed under the exp directory (export), for instance, we look at the ~ / Kaldi visiting/egs/yesno/s5 / exp/mono0a/log $vim align. 38.1. Log

Information on the principle of speech recognition

The principle of speech recognition www.zhihu.com/question/20…

HTK Book www.ee.columbia.edu/ln/LabROSA/…

How to explain hidden Markov models with simple and understandable examples? www.zhihu.com/question/20…

Kaldi some document interpretation

1:run.sh total run file, which integrates all other run files. {Run. Sh >>> path.sh >>> directory(directory for storing training data) >>> Mono-phone >>> Triphone >>> LDA_mllt >>> Sat >>>quitck data preparation:

1: generate the text, wav. SCP, utt2spk, spk2utt (the data to generate these files) (by local/data_prep. Sh generated) text: contains each pronunciation annotation sw02001 – AND A_002736-002893 IS wav.scp: (extend-filename: indicates the actual filename.) sw02001 -a /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav-p-c 1 / export/corpora3 / LDC/LDC97S62 swb1 / sw02001 SPH | utt2spk: indicate the pronunciation is which people say that a long period (note that the speaker number does not need to be exactly the same as that of speaker actual name – just need can probably guess.) sw02001-A_000098-001156 2001-A spk2utt: … 2: Produce MFCC features 3: Prepare language stuff(build a large lexicon that invovles Words in both the training and decoding.) Trib3 performs sat natural language adaptation (speaker adaptive training using feature space based maximum likelihood linear regression (fMLLR)), Trib4 performs quick LDA-MLLT (Linear Discriminant Analysis — Maximum Likelihood Linear Transform), and LDA establishes HMM status according to dimensionality reduction feature vectors. MLLT obtains the unique transformation of each speaker according to the feature space after dimension reduction of LDA. MLLT is actually a normalization of the speaker. SAT (Speaker Adaptive Training). SAT also normalized speaker and noise. 5:DNN }

2:cmd.sh Generally needs to be modified

Pl export decode_cmd=”run.pl”# mem 4G export decode_cmd=”run. Pl “# mem 4G export decode_cmd=”run. Pl “# mem 4G export decode_cmd=”run. Pl “# mem 4G export decode_cmd=”run Pl –mem 8G export cuda_cmd=”run.pl” –gpu 1(if there is no GPU)

3:path.sh (setting environment variables)

export KALDI_ROOT=pwd/.. /.. /.. [ -fKALDI_ROOT/tools/env.sh export PATH=KALDI_ROOT/tools/openfst/bin:PATH [ ! -f KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!” && exit 1 . KALDI_ROOT/tools/env.sh] &&. $KALDI_ROOT/tools/env.sh means execute the environment variable script if it exists, but I didn’t find it in this path. Then add the utils directory in this directory, tools/openfst/bin directory in Kaldi root, and this directory to the environment variable PATH. If tools/config/common_path.sh does not exist in the Kaldi root directory, print a message indicating that the file is missing and exit.

Kaldi training scripts need to rewrite the data preparation part for different corpora. The scripts are generally stored in conf and local folders. Conf Configures some configuration files, such as the parameters for extracting MFCC and Filterbank, and the parameters for decoding (mainly the configuration frequency, and the system sampling frequency is set to be consistent with the corpus sampling frequency). The pronunciation dictionary, the corresponding text content of the audio files, and a basic usable language model (to be used when decoding) data after training: exp directory: final.mdl trained model graph_word directory: FST is a finite state machine (FST: pronunciation dictionary, input is phoneme, output is word).

Kaldi compiles to the iOS static library

Compile the script as follows:

#! /bin/bash

if [ ! \( -x "./configure" \) ] ; then
    echo "This script must be run in the folder containing the \"configure\" script."
    exit 1
fi

export DEVROOT=`xcode-select --print-path`
export SDKROOT=$DEVROOT/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk

# Set up relevant environment variablesExport CPPFLAGS="-I$SDKROOT/usr/include/c++/4.2.1/ -I$SDKROOT /usr/include/-miphoneos-version-min =10.0 -arch arm64" export CFLAGS="$CPPFLAGS -arch arm64 -pipe -no-cpp-precomp -isysroot $SDKROOT"#export CXXFLAGS="$CFLAGS"export CXXFLAGS="$CFLAGS -std=c++11 -stdlib=libc++" MODULES="online2 ivector nnet2 lat decoder feat transform gmm hmm tree matrix util base itf cudamatrix fstext" INCLUDE_DIR=include/kaldi mkdir -p $INCLUDE_DIR echo "Copying include files" LIBS="" for m in $MODULES do cd $m echo echo "BUILDING MODULE $m" echo if [[ -f Makefile ]] then make lib=$(ls # *. A) this will fail (gracefully) for ift module since it only contains. H files LIBS + = "$m / $lib" fi echo "create a module folder: $INCLUDE_DIR/$m" cd .. Mkdir -p $INCLUDE_DIR $m echo "cp -v $m/*h $INCLUDE_DIR $m/ done echo" $LIBS" LIBNAME="kaldi-ios.a" libtool -static -o $LIBNAME $LIBS cat >&2 << EOF Build succeeded! Library is in $LIBNAME h files are in $INCLUDE_DIR EOFCopy the code

This script only builds a static library that supports the ARM64 architecture. If you want to support other architectures, you can add it directly:

Export CPPFLAGS="-I$SDKROOT/usr/include/c++/4.2.1/ -I$SDKROOT /usr/include/-miphoneos-version-min =10.0 -arch arm64"Copy the code

The above script comes from a great god: changfeng floating clouds His book Jane address: www.jianshu.com/p/faff2cd48… He has written a number of blogs about Kaldi, so if you need to do research, check out his blog.

IOS online and offline speech recognition Demo compiled based on Kaldi source code

Here are two demos he provided of IOS online and offline recognition:

  • Online Identification Demo
  • Offline Identification Demo

2. CMUSphinx

CMUSphinx official resources navigation:

  • Introductory cmusphinx. Making. IO/wiki/tutori…

  • Advanced cmusphinx. Making. IO/wiki /

  • Search sourceforge.net/p/cmusphinx problem…

  • advantages

  1. Compared with Kaldi, this open source framework for speech recognition is more suitable for development. The packaging of various functions is easy to understand, and the decoding part of the code is very easy to understand. Besides the PC platform, the author also considers the embedded platform, and Android development is also very convenient. There are Wiki examples of voice reviews based on PocketSphinx that are much better in real time than Kaldi.
  2. Because it is suitable for development, there are many open source programs and educational evaluation papers based on it.
  3. Overall, PocketSphinx is a good place to get started.
  • disadvantages

Compared with Kaldi, it uses the GMM-HMM framework, which may be less accurate. Other miscellaneous handlers (pitch extraction, etc.) are not as numerous as Kaldi.

  • Speech recognition CMUSphinx (1) under Windows installation: refer to the article: CMUSphinx. Making. IO/wiki/tutori…
  • Android demo: cmusphinx. Making. IO/wiki/tutori…

Sphinx tool introduction

Pocketsphinx – A lightweight identification library written in C for identification purposes.

Sphinxbase – the support library for Pocketsphinx, which is mainly used for feature extraction of speech signals;

Sphinx3 – DECOder written in C for speech recognition research

Sphinx4 – A decoder written in the JAVA language for speech recognition research

CMUclmtk – Language model training tool

Sphinxtrain — Acoustic model training tool

Download url: sourceforge.net/projects/cm…

Sphinx is a large vocabulary, non-specific, continuous English speech recognition system developed by Carnegie Mellon University. Sphinx has been funded and supported by the CMU, DARPA and other agencies since its inception, and has since evolved into an open source project. The following decoders are currently being developed by the CMU Sphinx group:

Sphinx-2 adopts semi-continuous implicit Markov model (SCHMM), which is relatively backward, making the recognition accuracy lower than other decoders.

PocketSphinx is an embedded speech recognition engine with very little computation and volume. Sphinx-2 is the first open source embedded oriented medium vocabulary continuous speech recognition project. The recognition accuracy is about the same as sphinx-2.

Sphinx-3 is a CMU high level large vocabulary speech recognition system, which is modeled by continuous implicit Markov model CHMM. Support multiple mode operation, high precision mode flat decoder, optimized from the initial version of Sphinx3; Fast search pattern tree decoder. At present, these two decoders are used together.

Sphinx-4 is a large vocabulary speech recognition system written in the JAVA language, modeled using continuous implicit Markov models. Compared with previous versions, it has made improvements in modularity, flexibility and algorithm. It adopts new search strategies, supports a variety of different syntax and language models, auditory models and feature streams. Innovative algorithms allow multiple sources of information to be merged into one elegant knowledge rule that is more consistent with actual semantics. Due to the full use of JAVA language development, with a high degree of portability, allowing multithreading technology and highly flexible multithreading interface.

3.

Reference: www.jianshu.com/p/0f4a53450…