preface

With the enhancement of technology and people’s higher and higher requirements for audio quality, various audio processing modes are becoming more and more abundant. Recently, I made an audio requirement, which is to decode mp3 or WAV and other audio files into PCM files with 16000 sampling rate, and then plug them into the noise reduction SDK for noise reduction and real-time playback. Here is a record and summary.

plan

  • The audio noise reduction
  • Play times speed
  • Audio conversion
  • Real-time playback

First, audio noise reduction

We know that sound is actually a wave, made up of sinusoidal waves of different frequencies. So audio noise reduction is essentially an algorithm to remove some sinusoidal waves representing noise. The basic principle of audio noise reduction: the digital audio signal spectrum analysis, get the intensity and spectrum distribution of noise, and then according to this model can design a filter, filter out the noise, get more prominent main sound.

There are three kinds of noise reduction schemes commonly used:

Speex is a set of open source and free (BSD licensed) applications for speech, including codec, VAD (speech detection), AEC (Echo cancellation), NS (noise reduction) and other practical modules, so we can use Speex for sound noise reduction in commercial applications.

2. With the open source of WebRTC, the part of audio noise reduction can also meet our need for noise reduction. The audio noise reduction part of WebRTC code is implemented in C language, so it has good compatibility on different platforms (ARM and x86). In business, noise reduction is strongly related to the scene. For example, in a quiet conference room and in a noisy shopping mall, noise reduction algorithms and strategies will be obviously different.

3. With the development of machine learning and neural network, there are also some noise reduction algorithms that learn and train the audio data of business scenes and gradually optimize the noise reduction effect, such as RNNoise. Some people also use TensorFlow to optimize the audio noise reduction effect.

Speex and WebRTC are both open source, so if we don’t have specific customization requirements, they can meet our basic requirements. In our project, the third scheme is used to optimize the audio noise reduction through machine learning. Part of the code is as follows

- (NSData *)getOutputDenoiseData:(NSData *)inputData { if(_nnse){ int perBufferLen = 16000; int bufferIndex = 0; NSMutableData *buffer = [[NSMutableData alloc]init]; while ([inputData length] - bufferIndex > perBufferLen) { NSData *data = [inputData subdataWithRange:NSMakeRange(bufferIndex, perBufferLen)]; bufferIndex += [data length]; _packid++; 1024 char outBuffer[17024] = {0}; int outSize = 0; int ret = [_nnseWrapper do_lstm_nnse:_nnse packId:_packid voiceData:data outputData:outBuffer outputDataLen:&outSize]; if(! ret){ NSData *outPutData = [[NSData alloc]initWithBytes:outBuffer length:outSize]; [buffer appendData:outPutData]; NSLog(@"output %lu",(unsigned long)outPutData.length); } } NSData *data = [inputData subdataWithRange:NSMakeRange(bufferIndex, [inputData length] - bufferIndex)]; char outBuffer[17024] = {0}; int outSize = 0; int ret = [_nnseWrapper do_lstm_nnse:_nnse packId:_packid voiceData:data outputData:outBuffer outputDataLen:&outSize]; if(! ret){ NSData *outPutData = [[NSData alloc]initWithBytes:outBuffer length:outSize]; [buffer appendData:outPutData]; NSLog(@"output %ld",outPutData.length); } return buffer; } return NULL; }Copy the code

Second, play double speed

1. Change of tone: Keep tone and meaning unchanged and speak slower or faster. This process is represented by the accordion-like compression or expansion of the spectrogram over time. The fundamental frequency value is almost constant, corresponding to pitch invariance. The whole time process is compressed or expanded, the number of glottic periodic voiceprint increases or decreases, that is, the rate of vocal tract movement changes, and the speed of speech changes accordingly. Corresponding to the speech production model, the excitation and system experience almost the same state as the original pronunciation situation, but for longer or shorter duration than the original.

2. Tone invariant refers to changing the size of the speaker’s fundamental frequency while keeping the speed and semantics unchanged, that is, keeping the short-time spectrum envelope (the position and bandwidth of the formant) and the time process basically unchanged. Corresponding to the speech production model, the modulation changes the excitation source, and the formant parameters of the vocal tract model are almost unchanged, which ensures the constant semantics and speech speed.

The iOS system provides a way to set AudioUnit double speed during playback, that is, kVarispeedParam_PlaybackRate is used to set the double speed parameter. However, we need to process the PCM and output audio data through noise reduction and double speed, so this way is not suitable. There are two common audio speed shifting solutions: SoundTouch and Sonic.

Sonic and Soundtouch are used similarly in that both provide a packaged library that processes the PCM data of the original audio to the target multiple speed through the interface function. Our project is using Sonic, first set playback times

- (void)setRate:(CGFloat)rate { self.audioRate = rate; if (_sonic) { sonicDestroyStream(_sonic); } _sonic = sonicCreateStream(self.config.outputFormat.mSampleRate, self.config.outputFormat.mChannelsPerFrame); sonicSetRate(_sonic, 1); sonicSetPitch(_sonic, 1); sonicSetSpeed(_sonic, self.audioRate ! = 0? self.audioRate:1); }Copy the code

And then when you read the audio, you can read the data at multiple speeds

int ret = sonicWriteShortToStream(_sonic, _buffList->mBuffers[0].mData, 4096);
if(ret) {
       int new_buffer_size = sonicReadShortFromStream(_sonic, originTmpBuffer, 4096);
       tempData = [[NSData alloc] initWithBytes:&originTmpBuffer length:new_buffer_size];
       memset(originTmpBuffer, 0, 4096);
}
Copy the code

Three, audio conversion format

There are many audio format conversion schemes, but generally summed up can be classified into two categories:

1. Use the existing decoding tools of the third party to complete the audio format conversion, like FFmpeg for audio processing. FFmpeg is free software that can record, convert, and stream audio and video in a variety of formats. It includes libavCodec, an audio and video decoder library for multiple projects, and libavFormat, an audio and video format conversion library. However, our project only dealt with audio and did not integrate FFmpeg, so we put this solution on the back burner.

2. Use Core Audio provided by iOS to process Audio. Core Audio includes the Audio Toolbox and Audio Unit frameworks we use most often, which can handle the Audio conversion and real-time playback problems we need to do. And Core Audio is much cheaper to integrate, so we’re going to use a sub-framework to meet our current needs. We’ll look at how to parse audio using FFmpeg in more detail later.

The Core Audio framework is rich and powerful, covering almost everything related to Audio processing.

As can be seen from the frame diagram above, we need to convert Audio files, Converter and Codec Services.

Audio conversion process:

1. Use ExtAudioFileOpenURL to retrieve audio files in formats such as MP3 from local disks.

2, use ExtAudioFileGetProperty read disk audio formats, the output is of type AudioStreamBasicDescription structure,

struct AudioStreamBasicDescription
{
    Float64             mSampleRate;
    AudioFormatID       mFormatID;
    AudioFormatFlags    mFormatFlags;
    UInt32              mBytesPerPacket;
    UInt32              mFramesPerPacket;
    UInt32              mBytesPerFrame;
    UInt32              mChannelsPerFrame;
    UInt32              mBitsPerChannel;
    UInt32              mReserved;
};
typedef struct AudioStreamBasicDescription  AudioStreamBasicDescription;
Copy the code

AudioStreamBasicDescription describes the basic information of the audio stream, including:

  • MSampleRate, sampling rate
  • MFormatID, data format type
  • MFormatFlags, a supplement to the data format, such as whether the data is interleaved, whether the data format is float, etc. It is very mysterious, with some auxiliary functions we can get the correct value.
  • MFramesPerPacket, that is, the number of frames contained in a packet, 1 if PCM data. The compression format is slightly different, with some auxiliary functions to get the correct value.
  • MChannelsPerFrame: Number of channels
  • MBitsPerChannel, the bit depth of sampling. The compressed format is set to 0
  • MBytesPerFrame, the size of a frame in bytes. If it is in PCM format, its calculation formula ismChannelsPerFrame * mBitsPerChannel / 8. The compressed format is set to 0
  • MBytesPerPacket, the size of a packet in bytes, and its value in PCM formatmBytesPerFrameConsistent. The compressed format is set to 0
  • MReserved, always 0, is used for data alignment

3, turn after the PCM format, using ExtAudioFileSetProperty Settings are still AudioStreamBasicDescription

memset(&_outputFormat, 0, sizeof(_outputFormat)); _outputFormat.mSampleRate = 16000; // Sample rate _outputFormat.mFormatID = kAudioFormatLinearPCM; / / transformation format _outputFormat. MFormatFlags = kLinearPCMFormatFlagIsSignedInteger; _outputFormat.mBytesPerPacket = 2; _outputFormat.mFramesPerPacket = 1; / / every packet detection data _outputFormat. MChannelsPerFrame = 1; / / track _outputFormat. MBitsPerChannel = 16; / / 16 bit at each sampling quantization _outputFormat mBytesPerFrame = (_outputFormat. MBitsPerChannel / 8) * _outputFormat mChannelsPerFrame;Copy the code

4. Use AudioBufferList to store temporary audio data.

struct AudioBuffer
{
    UInt32              mNumberChannels;
    UInt32              mDataByteSize;
    void* __nullable    mData;
};

struct AudioBufferList
{
    UInt32      mNumberBuffers;
    AudioBuffer mBuffers[1]; // this is a variable length array of mNumberBuffers elements
};
Copy the code

An AudioBuffer is a structure used to store audio data

  • mNumberChannels
  • MDataByteSize, the volume of audio data
  • MData, the pointer to the audio data buffer

We can convert it to NSData with mData and mDataByteSize, and that’s where we get the audio data.

Four, real-time playback

After the audio data is obtained, it is played. Due to time-consuming processing such as double speed and noise reduction in the process of audio format conversion, we need an audio cache pool as the medium between audio format conversion and playback, which makes the playback require higher performance and stronger real-time performance. The Audio Unit is exactly what we need, because there are two main advantages to using Audio Unit:

  • The fastest response speed, from acquisition to playback loop can be 10ms level.
  • Dynamically configured, AUGraph can be dynamically combined to meet a variety of needs.

As you can see from the figure above, Audio Unit is the lowest level Audio framework in iOS. IOS provides Audio processing plug-ins that support mixing, balancing, format conversion, and real-time input/output for recording, playing, offline rendering, and real-time conversations, such as voice over Internet Protocol (VoIP).

The following describes the Audio Unit playback process:

Create Audio Unit

OSStatus status = noErr;
AudioComponentDescription audioDesc;
audioDesc.componentType = kAudioUnitType_Output;
audioDesc.componentSubType = kAudioUnitSubType_RemoteIO;
audioDesc.componentManufacturer = kAudioUnitManufacturer_Apple;
audioDesc.componentFlags = 0;
audioDesc.componentFlagsMask = 0;
    
AudioComponent inputComponent = AudioComponentFindNext(NULL, &audioDesc);
status = AudioComponentInstanceNew(inputComponent, &_audioUnit);
Copy the code

2. Set the input scope and output scope of the Audio Unit respectively

UInt32 flag = 1; if (flag) { status = AudioUnitSetProperty(_audioUnit, kAudioOutputUnitProperty_EnableIO, kAudioUnitScope_Output, OUTPUT_BUS, &flag, sizeof(flag)); if (status ! = noErr && self.delegateAudioPlayer && [self.delegateAudioPlayer respondsToSelector:@selector(audioPlayError)]) { [self.delegateAudioPlayer audioPlayError]; return; } _audioCanPlay = status==noErr? YES:NO; } AudioStreamBasicDescription outputFormat = self.config.outputFormat; status = AudioUnitSetProperty(_audioUnit, kAudioUnitProperty_StreamFormat, kAudioUnitScope_Input, OUTPUT_BUS, &outputFormat, sizeof(_config.outputFormat));Copy the code

  • Scope: Programming context within the Audio unit. The scope concept is a bit abstract, as in input scope, which means that all elements in the audio unit require an input. Output scope indicates that all elements in the output scope go somewhere. The Global scope should be used to configure properties that have nothing to do with input/output concepts.

  • Element: When element is part of the input/ Output scope, it is similar to the signal bus in a physical audio device. So the terms “element, bus” are the same in the audio unit. This document uses “bus” when emphasizing signal flows and “Element” when emphasizing specific functional aspects of an audio unit, such as the input and output elements of an I/O unit.

3. The callback passes the Audio to the Audio Unit

AURenderCallbackStruct playCallbackStruct; playCallbackStruct.inputProc = PlayCallback; playCallbackStruct.inputProcRefCon = (__bridge void *)self; status = AudioUnitSetProperty(_audioUnit, kAudioUnitProperty_SetRenderCallback, kAudioUnitScope_Input, OUTPUT_BUS, &playCallbackStruct, sizeof(playCallbackStruct)); # pragma mark audio callback OSStatus PlayCallback (void * inRefCon, AudioUnitRenderActionFlags * ioActionFlags, const AudioTimeStamp *inTimeStamp, UInt32 inBusNumber, UInt32 inNumberFrames, AudioBufferList *ioData)Copy the code

In PlayCallback,

  • InRefCon: pointer passed when the callback function is registered. Generally, it can pass the object instance of this class. Because the callback function is in C language and cannot directly access the properties and methods of this class, passing the instantiated object in this example can indirectly invoke the properties and methods of this class.

  • IoActionFlags: Let the callback function provide an indication that the audio unit is not processing the audio. For example, if the application is a synthetic guitar and the user is not currently playing notes, do this. Keep silence for its output in the callback invocation, in callback body with the following statement: * ioActionFlags | = kAudioUnitRenderAction_OutputIsSilence; When you want to generate silence, you must also explicitly set the buffer pointed to by the ioData parameter to 0.

  • InTimeStamp: Indicates the time at which the callback function was called and can be used as a timestamp for audio synchronization. Each time the callback is called, the value of the mSampleTime field is incremented by a number in the inNumberFrames parameter. For example, if your application is a sequencer or drum machine, you can use the mSampleTime value to schedule sounds.

  • InBusNumber: Audio Unit bus that calls the callback function. Allows you to branch in a callback function with this value. In addition, when audio Unit registers callback functions, different inRefCon can be specified for each bus.

  • InNumberFrames: The number of audio frames provided in the callback function. The data for these frames is stored in the ioData parameter.

  • IoData: Real audio data. If mute is set, the content in buffer must be set to 0.

By doing this we would normally be able to play the audio, but as described at the beginning of this module, we need an audio cache pool to act as a medium between the audio to format and playback due to time-consuming processing such as doubling speed and noise reduction.

4. Cache playback

When we use Extended Audio File Services to open the local Audio, we need to start a timer to time seek to read the Audio to the PCM, noise reduction and speed processing, and store it in the Audio cache pool. Here, CADisplaylink is used to control audio reading and cache pool storage precisely, and the cache pool threshold should be set to control when audio is accessed. The specific process is as follows:

It can be simply interpreted as the following way to read and play audio:

With this method, we can successfully fetch the Audio from the cache pool in the Audio Unit’s callback and release the Audio data from the cache pool after passing the Audio to the Audio Unit.

Here, I randomly recorded an MP3 audio and played it for 4 times. The memory growth is basically controlled within the threshold of the cache pool, so the real-time audio processing and playing now will not cause memory problems.

reference

Developer.apple.com/documentati…

Juejin. Cn/post / 684490…

Juejin. Cn/post / 684490…

Developer.apple.com/documentati…

Developer.apple.com/videos/play…