IOS complete push stream acquisition audio and video data encoding synchronous synthesis stream

demand

As we all know, the original audio and video data cannot be directly transmitted on the network, and the push stream needs the encoded audio and video data to synthesize the video stream, such as FLV, MOV, Asf stream, etc., according to the needs of the receiver’s format for synthesis and transmission, here in synthesis of asf streaming, for example, about a complete pushed the flow process, namely the audio and video from acquisition to coding, synchronous synthesis of asf streaming video, can then be transmitted, for convenience, this case will be synthetic video stream into a asf files, for testing.

Note: The test requires the terminal to play the files recorded in the demo through: ffplay. Because ASF is a format supported only by Windows, the MAC player cannot play the files.

Realize the principle of

Collection: AVCaptureSession is used to collect video frames and Audio Unit is used to collect Audio frames
Encoding: The video data is hardcoded with vtCompresssion in VideoToolbox, and the audio data is soft-coded with Audio Converter.
Synchronization: Generates policies based on the timestamp
Composition: Use FFmpeg MUX encoded audio and video data to compose video streams
What’s next: The resultant video stream can be sent over the network or recorded as a file

Read the premise

Basic knowledge of audio and video
Recommended reading :H264, H265 hardware codec basics and bit stream analysis
IOS Video Collection combat (AVCaptureSession)
Audio Unit Collects Audio combat
Video coding practice
Audio coding practice
Set up the iOS FFmpeg environment

Code Address:IOS complete push stream

Address of nuggets:IOS complete push stream

Letter Address:IOS complete push stream

Blog Address:IOS complete push stream

The overall architecture

For iOS, we can capture video frames and audio frames through the underlying API. Capturing video frames uses AVCaptureSession in the AVFoundation framework, which also captures audio. Because we want to use the lowest latency and highest sound quality audio, So we need to use the lowest level Audio capture framework Audio VTCompressionSessionRef in the VideoToolbox framework can be used to encode video data, and AudioConverter can be used to encode audio data. We can take the system time when the first frame I is generated as one of the audio and video timestamp when collecting In the end, we will be better than that encoded audio and video data through FFmpeg synthesis, here to ASF stream for example synthesis, and the generated good ASF stream into the file for testing. The generated ASF stream can be directly used for network transmission.

Simple and easy process

Capture video

createAVCaptureSessionobject
Specify resolution:sessionPreset/activeFormat, specify the frame ratesetActiveVideoMinFrameDuration/setActiveVideoMaxFrameDuration
Specify camera position:AVCaptureDevice
Specify other camera attributes: Exposure, focus, flash, flashlight, etc…
Add the camera data source to the session
Specify the format of video collection: YUV, RGB….kCVPixelBufferPixelFormatTypeKey
Add the output source to the session
Create a queue to receive video frames:- (void)setSampleBufferDelegate:(nullable id<AVCaptureVideoDataOutputSampleBufferDelegate>)sampleBufferDelegate queue:(nullable dispatch_queue_t)sampleBufferCallbackQueue
Rendering captured video data to the screen:AVCaptureVideoPreviewLayer
Get the video frame data in the callback function:CMSampleBufferRef

Collecting audio

Configure audio format ASBD: sampling rate, number of channels, sampling bits, data accuracy, number of bytes per packet, etc…
Set sampling time:setPreferredIOBufferDuration
Create an Audio Unit object and specify a category.AudioComponentInstanceNew
Set the Audio Unit property: Enable input, disable output…
Allocates sizes for received audio datakAudioUnitProperty_ShouldAllocateBuffer
Sets the callback for received data
Start audio unit:AudioOutputUnitStart
Get audio data in the callback function:AudioUnitRender

Encoded video data

Specify the encoder width type callback and create a context object:VTCompressionSessionCreate
Set encoder attributes: cache frame number, frame rate, average bit rate, maximum bit rate, real-time encoding, whether to reorder, configuration information, encoding mode, I frame interval time, etc.
Prepare encoded data:VTCompressionSessionPrepareToEncodeFrames
Start coding:VTCompressionSessionEncodeFrame
The encoded data is retrieved from the callback functionCMBlockBufferRef
According to the synthetic code stream format, here is ASF, so we need Annex B format, assemble ourselves SPS, PPS,start code.

Encoded audio data

ASBD that provides raw data types and encoded data types
Specify encoder typekAudioEncoderComponentType
Create encoderAudioConverterNewSpecific
Set encoder attributes: bit rate, encoding quality, etc
The original PCM data of 1024 sampling points are fed into the encoder
Start coding:AudioConverterFillComplexBuffer
Obtain the encoded AAC data

Audio and video synchronization

The system time of the first frame of the encoded video is taken as the reference time stamp of the audio and video data, and then the time stamp in the collected audio and video data is subtracted from the reference time stamp as the respective time stamp. There are two synchronization strategies. One is based on the audio time stamp. That is, when there is an error, let the video timestamp catch up with the audio timestamp, which will cause the picture will be fast forward or fast back, second, based on the video timestamp, that is, when there is an error, let the audio timestamp catch up with the time stamp, that is, the sound may be harsh, not recommended. Therefore, the first scheme is generally used to estimate the time stamp of the next video frame to see if it is out of sync range.

FFmpeg synthetic data stream

Initializing FFmpeg parameters: AVFormatContext, AVOutputFormat, AVStream…
Creating context objectsAVFormatContext: avformat_alloc_context
Generate encoders based on data typesAVCodec: avcodec_find_encoderVideo:AV_CODEC_ID_H264/AV_CODEC_ID_HEVCAudio:AV_CODEC_ID_AAC
Generate flowAVStream: avformat_new_stream
Specify each parameter information in audio and video stream, such as data format, video width and height frame rate, bit rate, baseline time,extra data, audio: sampling rate, number of channels, sampling bits and so on.
Specify the audio and video encoder ID in context and stream format:video_codec_id, audio_codec_id
Generating video streaming header data: When the audio and video encoder fills the context object, it can produce the corresponding header information of this type. As an important information for decoding audio and video data, this header information must be correctly synthesized.avformat_write_header
Load audio and video data into a dynamic array.
Synthesizing audio and video data: take out the audio and video data in the dynamic array through another thread, and synchronize synthesizing the data by comparing the timestamp.
Load audio and video data into AVPacket
Generate synthetic dataav_write_frame

File structure

Quick to use

Initialize related modules

- (void)viewDidLoad {
    [super viewDidLoad];
    // Do any additional setup after loading the view.
    
    [self configureCamera];
    [self configureAudioCapture];
    [self configureAudioEncoder];
    [self configurevideoEncoder];
    [self configureAVMuxHandler];
    [self configureAVRecorder];
}
Copy the code

The raw YUV data is sent to encoding in the camera callback

- (void)xdxCaptureOutput:(AVCaptureOutput *)output didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection {
    if ([output isKindOfClass:[AVCaptureVideoDataOutput class]] == YES) {
        if(self.videoEncoder) { [self.videoEncoder startEncodeDataWithBuffer:sampleBuffer isNeedFreeBuffer:NO]; }}}Copy the code

The encoded video data is received via the callback function and sent to the composite stream class.

#pragma mark Video Encoder
- (void)receiveVideoEncoderData:(XDXVideEncoderDataRef)dataRef {
    [self.muxHandler addVideoData:dataRef->data size:(int)dataRef->size timestamp:dataRef->timestamp isKeyFrame:dataRef->isKeyFrame isExtraData:dataRef->isExtraData videoFormat:XDXMuxVideoFormatH264];
}

Copy the code

The audio data is received and encoded in the collection audio callback, and the encoded data is eventually fed into the composite stream class as well

#pragma mark Audio Capture and Audio Encode
- (void)receiveAudioDataByDevice:(XDXCaptureAudioDataRef)audioDataRef {
    [self.audioEncoder encodeAudioWithSourceBuffer:audioDataRef->data
                                  sourceBufferSize:audioDataRef->size
                                               pts:audioDataRef->pts
                                   completeHandler:^(XDXAudioEncderDataRef dataRef) {
                                       if (dataRef->size > 10) {
                                           [self.muxHandler addAudioData:(uint8_t *)dataRef->data
                                                                    size:dataRef->size
                                                              channelNum:1
                                                              sampleRate:44100
                                                               timestamp:dataRef->pts];                                           
                                       }
                                       free(dataRef->data);
                                   }];
}
Copy the code

After the file is written, the synthesized data is then received and written to the file.

#pragma mark Mux
- (IBAction)startRecordBtnDidClicked:(id)sender {
    int size = 0;
    char *data = (char *)[self.muxHandler getAVStreamHeadWithSize:&size];
    [self.recorder startRecordWithIsHead:YES data:data size:size];
    self.isRecording = YES;
}


- (void)receiveAVStreamWithIsHead:(BOOL)isHead data:(uint8_t *)data size:(int)size {
    if (isHead) {
        return;
    }
    
    if(self.isRecording) { [self.recorder startRecordWithIsHead:NO data:(char *)data size:size]; }}Copy the code

The specific implementation

The audio and video acquisition coding module in this example has been introduced in detail in the previous article, which will not be repeated here. If you need help, please refer to the reading premise above. The composition flow is only described below.

1. Initialize FFmpeg related objects.

AVFormatContext: Manages compositing stream context objects
AVOutputFormat: Synthetic stream format, used here for ASF data streams
AVStream: Specific information about audio and video data streams

- (void)configureFFmpegWithFormat:(const char *)format {
    if(m_outputContext ! = NULL) { av_free(m_outputContext); m_outputContext = NULL; } m_outputContext = avformat_alloc_context(); m_outputFormat = av_guess_format(format, NULL, NULL); m_outputContext->oformat = m_outputFormat; m_outputFormat->audio_codec = AV_CODEC_ID_NONE; m_outputFormat->video_codec = AV_CODEC_ID_NONE; m_outputContext->nb_streams = 0; m_video_stream = avformat_new_stream(m_outputContext, NULL); m_video_stream->id = 0; m_audio_stream = avformat_new_stream(m_outputContext, NULL); m_audio_stream->id = 1;log4cplus_info(kModuleName, "configure ffmpeg finish.");
}

Copy the code

2. Configure video stream details

Set the detailed information in the video stream of this encoding, such as encoder type, configuration information, original video data format, video width and height, bit rate, frame rate, benchmark timestamp,extra data, etc.

The most important thing here is Extra data. Note that the correct header data can only be generated according to extra data, while THE ASF stream needs annux B format data. The video data collected by Apple is in AVCC format, so it has been converted to Annux in the coding module B format data, and passed through the parameters, here can be used directly, about the difference between the two formats can also refer to read the premise of the code stream introduction article.

- (void)configureVideoStreamWithVideoFormat:(XDXMuxVideoFormat)videoFormat extraData:(uint8_t *)extraData extraDataSize:(int)extraDataSize {
    if (m_outputContext == NULL) {
        log4cplus_error(kModuleName, "%s: m_outputContext is null",__func__);
        return;
    }
    
    if(m_outputFormat == NULL){
        log4cplus_error(kModuleName, "%s: m_outputFormat is null",__func__);
        return;
    }

    AVFormatContext *formatContext = avformat_alloc_context();
    AVStream *stream = NULL;
    if(XDXMuxVideoFormatH264 == videoFormat) {
        AVCodec *codec = avcodec_find_encoder(AV_CODEC_ID_H264);
        stream = avformat_new_stream(formatContext, codec);
        stream->codecpar->codec_id = AV_CODEC_ID_H264;
    }else if(XDXMuxVideoFormatH265 == videoFormat) {
        AVCodec *codec = avcodec_find_encoder(AV_CODEC_ID_HEVC);
        stream = avformat_new_stream(formatContext, codec);
        stream->codecpar->codec_tag      = MKTAG('h'.'e'.'v'.'c');
        stream->codecpar->profile        = FF_PROFILE_HEVC_MAIN;
        stream->codecpar->format         = AV_PIX_FMT_YUV420P;
        stream->codecpar->codec_id       = AV_CODEC_ID_HEVC;
    }
    
    stream->codecpar->format             = AV_PIX_FMT_YUVJ420P;
    stream->codecpar->codec_type         = AVMEDIA_TYPE_VIDEO;
    stream->codecpar->width              = 1280;
    stream->codecpar->height             = 720;
    stream->codecpar->bit_rate           = 1024*1024;
    stream->time_base.den                = 1000;
    stream->time_base.num                = 1;
    stream->time_base                    = (AVRational){1, 1000};
    stream->codec->flags                |= AV_CODEC_FLAG_GLOBAL_HEADER;
    
    memcpy(m_video_stream, stream, sizeof(AVStream));
    
    if(extraData) {
        int newExtraDataSize = extraDataSize + AV_INPUT_BUFFER_PADDING_SIZE;
        m_video_stream->codecpar->extradata_size = extraDataSize;
        m_video_stream->codecpar->extradata      = (uint8_t *)av_mallocz(newExtraDataSize);
        memcpy(m_video_stream->codecpar->extradata, extraData, extraDataSize);
    }
    
    av_free(stream);

    m_outputContext->video_codec_id = m_video_stream->codecpar->codec_id;
    m_outputFormat->video_codec     = m_video_stream->codecpar->codec_id;
    
    self.isReadyForVideo = YES;
    
    [self productStreamHead];
}
Copy the code

3. Configure details about audio streams

First, encoder is generated and stream object is generated according to the type of encoded audio. Then, detailed information of audio stream is configured, such as compressed data format, sampling rate, number of channels, bit rate,extra data and so on. Note that extra Data is prepared for the player to decode correctly when saving mp4 files. Please refer to audio Extra data1, Audio Extra data2

- (void)configureAudioStreamWithChannelNum:(int)channelNum sampleRate:(int)sampleRate { AVFormatContext *formatContext =  avformat_alloc_context(); AVCodec *codec = avcodec_find_encoder(AV_CODEC_ID_AAC); AVStream *stream = avformat_new_stream(formatContext, codec); stream->index = 1; stream->id = 1; stream->duration = 0; stream->time_base.num = 1; stream->time_base.den = 1000; stream->start_time = 0; stream->priv_data = NULL; stream->codecpar->codec_type = AVMEDIA_TYPE_AUDIO; stream->codecpar->codec_id = AV_CODEC_ID_AAC; stream->codecpar->format = AV_SAMPLE_FMT_S16; stream->codecpar->sample_rate = sampleRate; stream->codecpar->channels = channelNum; stream->codecpar->bit_rate = 0; stream->codecpar->extradata_size = 2; stream->codecpar->extradata = (uint8_t *)malloc(2); stream->time_base.den = 25; stream->time_base.num = 1; /* * why we put extra data herefor audio: when save to MP4 file, the player can not decode it correctly
     * http://ffmpeg-users.933282.n4.nabble.com/AAC-decoder-td1013071.html
     * http://ffmpeg.org/doxygen/trunk/mpeg4audio_8c.html#aa654ec3126f37f3b8faceae3b92df50e
     * extra data have 16 bits:
     * Audio object type - normally 5 bits, but 11 bits if AOT_ESCAPE
     * Sampling index - 4 bits
     * if (Sampling index == 15)
     * Sample rate - 24 bits
     * Channel configuration - 4 bits
     * last reserved- 3 bits
     * for exmpale:  "Low Complexity Sampling frequency 44100Hz, 1 channel mono":
     * AOT_LC == 2 -> 00010
     -              * 44.1kHz == 4 -> 0100
     +              * 44.1kHz == 4 -> 0100  48kHz == 3 -> 0011
     * mono == 1 -> 0001
     * so extra data: 00010 0100 0001 000 ->0x12 0x8
     +                  00010 0011 0001 000 ->0x11 0x88
     +
     */
    
    if (stream->codecpar->sample_rate == 44100) {
        stream->codecpar->extradata[0] = 0x12;
        //iRig mic HD have two chanel 0x11
        if(channelNum == 1)
            stream->codecpar->extradata[1] = 0x8;
        else
            stream->codecpar->extradata[1] = 0x10;
    }else if (stream->codecpar->sample_rate == 48000) {
        stream->codecpar->extradata[0] = 0x11;
        //iRig mic HD have two chanel 0x11
        if(channelNum == 1)
            stream->codecpar->extradata[1] = 0x88;
        else
            stream->codecpar->extradata[1] = 0x90;
    }else if (stream->codecpar->sample_rate == 32000){
        stream->codecpar->extradata[0] = 0x12;
        if (channelNum == 1)
            stream->codecpar->extradata[1] = 0x88;
        else
            stream->codecpar->extradata[1] = 0x90;
    }
    else if (stream->codecpar->sample_rate == 16000){
        stream->codecpar->extradata[0] = 0x14;
        if (channelNum == 1)
            stream->codecpar->extradata[1] = 0x8;
        else
            stream->codecpar->extradata[1] = 0x10;
    }else if(stream->codecpar->sample_rate == 8000){
        stream->codecpar->extradata[0] = 0x15;
        if (channelNum == 1)
            stream->codecpar->extradata[1] = 0x88;
        else
            stream->codecpar->extradata[1] = 0x90;
    }
    
    stream->codec->flags|= AV_CODEC_FLAG_GLOBAL_HEADER;
    
    memcpy(m_audio_stream, stream, sizeof(AVStream));
    
    av_free(stream);
    
    m_outputContext->audio_codec_id = stream->codecpar->codec_id;
    m_outputFormat->audio_codec     = stream->codecpar->codec_id;
    
    self.isReadyForAudio = YES;

    [self productStreamHead];
}

Copy the code

4. Generate flow header data

Avformat_write_header After the first two and three parts are configured, we inject the audio and video stream into the context object and the stream format within the object to generate the header data

- (void)productStreamHead {
    log4cplus_debug("record"."%s,line:%d",__func__,__LINE__);
    
    if (m_outputFormat->video_codec == AV_CODEC_ID_NONE) {
        log4cplus_error(kModuleName, "%s: video codec is NULL.",__func__);
        return;
    }
    
    if(m_outputFormat->audio_codec == AV_CODEC_ID_NONE) {
        log4cplus_error(kModuleName, "%s: audio codec is NULL.",__func__);
        return;
    }
    
    /* prepare header and save header data in a stream */
    if (avio_open_dyn_buf(&m_outputContext->pb) < 0) {
        avio_close_dyn_buf(m_outputContext->pb, NULL);
        log4cplus_error(kModuleName, "%s: AVFormat_HTTP_FF_OPEN_DYURL_ERROR.",__func__);
        return;
    }
        
    /*
     * HACK to avoid mpeg ps muxer to spit many underflow errors
     * Default value from FFmpeg
     * Try to setIt use configuration option */ m_outputContext->max_delay = (int)(0.7*AV_TIME_BASE); int result = avformat_write_header(m_outputContext,NULL);if (result < 0) {
        log4cplus_error(kModuleName, "%s: Error writing output header, res:%d",__func__,result);
        return;
    }
        
    uint8_t * output = NULL;
    int len = avio_close_dyn_buf(m_outputContext->pb, (uint8_t **)(&output));
    if(len > 0 && output ! = NULL) { av_free(output); self.isReadyForHead = YES;if (m_avhead_data) {
            free(m_avhead_data);
        }
        m_avhead_data_size = len;
        m_avhead_data = (uint8_t *)malloc(len);
        memcpy(m_avhead_data, output, len);
        
        if ([self.delegate respondsToSelector:@selector(receiveAVStreamWithIsHead:data:size:)]) {
            [self.delegate receiveAVStreamWithIsHead:YES data:output size:len];
        }
        
        log4cplus_error(kModuleName, "%s: create head length = %d",__func__, len);
    }else{
        self.isReadyForHead = NO;
        log4cplus_error(kModuleName, "%s: product stream header failed.",__func__); }}Copy the code

5. Then load the incoming audio and video data into the array

This array implements a lightweight data structure to cache data by encapsulating a vector in C++.

6. Synthesizing audio and video data

A new synthesis of audio and video data, thread special synthetic strategy is to take out the small time stamp in the audio and video data frame to write first, overall deviation for audio and video data is not large, so ideally should be in a frame video, a frame of audio, because audio samples faster, of course, may be relatively more than one or two frames, and when, for some reason are not synchronized audio and video data, will be waiting, until The timestamp needs to be resynchronized to continue composition.

    int err = pthread_create(&m_muxThread,NULL,MuxAVPacket,(__bridge_retained void *)self);
    if(err ! = 0) {log4cplus_error(kModuleName, "%s: create thread failed: %s",__func__, strerror(err));
    }
    
    void * MuxAVPacket(void *arg) {
    pthread_setname_np("XDX_MUX_THREAD");
    XDXAVStreamMuxHandler *instance = (__bridge_transfer XDXAVStreamMuxHandler *)arg;
    if(instance ! = nil) { [instance dispatchAVData]; }return NULL;
}

#pragma mark Mux
- (void)dispatchAVData {
    XDXMuxMediaList audioPack;
    XDXMuxMediaList videoPack;
    
    memset(&audioPack, 0, sizeof(XDXMuxMediaList));
    memset(&videoPack, 0, sizeof(XDXMuxMediaList));
    
    [m_AudioListPack reset];
    [m_VideoListPack reset];

    while (true) {
        int videoCount = [m_VideoListPack count];
        int audioCount = [m_AudioListPack count];
        if(videoCount == 0 || audioCount == 0) {
            usleep(5*1000);
            log4cplus_debug(kModuleName, "%s: Mux dispatch list: v:%d, a:%d",__func__,videoCount, audioCount);
            continue;
        }
        
        if(audioPack.timeStamp == 0) {
            [m_AudioListPack popData:&audioPack];
        }
        
        if(videoPack.timeStamp == 0) {
            [m_VideoListPack popData:&videoPack];
        }
        
        if(audioPack.timeStamp >= videoPack.timeStamp) {
            log4cplus_debug(kModuleName, "%s: Mux dispatch input video time stamp = %llu",__func__,videoPack.timeStamp);
            
            if(videoPack.data ! = NULL && videoPack.data->data ! = NULL){ [self addVideoPacket:videoPack.data timestamp:videoPack.timeStamp extraDataHasChanged:videoPack.extraDataHasChanged]; av_free(videoPack.data->data); av_free(videoPack.data); }else{
                log4cplus_error(kModuleName, "%s: Mux Video AVPacket data abnormal",__func__);
            }
            videoPack.timeStamp = 0;
        }else {
            log4cplus_debug(kModuleName, "%s: Mux dispatch input audio time stamp = %llu",__func__,audioPack.timeStamp);
            
            if(audioPack.data ! = NULL && audioPack.data->data ! = NULL) { [self addAudioPacket:audioPack.data timestamp:audioPack.timeStamp]; av_free(audioPack.data->data); av_free(audioPack.data); }else {
                log4cplus_error(kModuleName, "%s: Mux audio AVPacket data abnormal",__func__); } audioPack.timeStamp = 0; }}}Copy the code

7. Capture the synthesized video stream

The synthesized data is retrieved by av_write_frame.

- (void)productAVDataPacket:(AVPacket *)packet extraDataHasChanged:(BOOL)extraDataHasChanged {
    BOOL    isVideoIFrame = NO;
    uint8_t *output       = NULL;
    int     len           = 0;
    
    if (avio_open_dyn_buf(&m_outputContext->pb) < 0) {
        return;
    }
    
    if(packet->stream_index == 0 && packet->flags ! = 0) { isVideoIFrame = YES; }if (av_write_frame(m_outputContext, packet) < 0) {
        avio_close_dyn_buf(m_outputContext->pb, (uint8_t **)(&output));
        if(output ! = NULL) free(output);log4cplus_error(kModuleName, "%s: Error writing output data",__func__);
        return;
    }
    
    
    len = avio_close_dyn_buf(m_outputContext->pb, (uint8_t **)(&output));
    
    if(len == 0 || output == NULL) {
        log4cplus_debug(kModuleName, "%s: mux len:%d or data abnormal",__func__,len);
        if(output ! = NULL) av_free(output);return;
    }
        
    if ([self.delegate respondsToSelector:@selector(receiveAVStreamWithIsHead:data:size:)]) {
        [self.delegate receiveAVStreamWithIsHead:NO data:output size:len];
    }
    
    if(output ! = NULL) av_free(output); }Copy the code