Introduction to the

I have been working in an AI audio company for one year, and I am in a pre-departure state recently. I just wrote a demo about Android audio to the customer, which took me a whole day to finish. I feel quite comprehensive and decided to work harder to write a summary. Although the company is small, it cooperates with the Institute of Acoustics of Chinese Academy of Sciences, and also has its own series of voice recognition/voice transcription engines on audio, although the sparrow is small, it has the feeling of complete viscera. Android audio is not so mysterious, the mysterious place has special C++/ algorithm engineers and so on responsible for us, we all know, I just move bricks.

It mainly involves three points

  • SpeechToText(audio to text: STT) : An AudioRecord records audio and uploads it locally and in Socket2.
  • The TextToSpeech (text-to-speech: TTS) API takes the audio stream and plays it back with AudioTrack.
  • Speex encryption

Here does not talk about TTS/STT underlying principle, how to achieve the stay so long I also just a little bit, a little bit, involving human ear sound correlation function/sound wave/Fourier analysis/a series of complex functions, here dare not teach fish to swim interested please Google yourself. .

Introduce AudioRecord

The AudioRecord process is an IPC process. The Java layer calls the Native layer AudioRecord through JNI, and the latter calls the AudioFlinger cross-process through the IAudioRecord interface. The AudioFlinger is responsible for starting the recording thread. The audio data collected from the recording data source is populated into the shared memory buffer, from which the application side copies the data into its own buffer.

Public AudioRecord (int audioSource, / / specified sound source MediaRecorder. AudioSource. MIC. Int sampleRateInHz,// specify the sample rate here Mono int audioFormat, // Specifies the quantization unit for converting 16bit analog signals to digital signals. Int bufferSizeInBytes)// The buffer size depends on the sampling rate channel quantization parameterCopy the code

1. Upload the local record of STT as a file

// Parameter initialization // audio input – microphone

public final static int AUDIO_INPUT = MediaRecorder.AudioSource.MIC; public final static int AUDIO_SAMPLE_RATE = 8000; Public final static int CHANNEL_CONFIG = audioformat.channel_in_mono; public final static int CHANNEL_CONFIG = audioformat.channel_in_mono; public final static int AUDIO_FORMAT = AudioFormat.ENCODING_PCM_16BIT; private int bufferSizeInBytes = 0; Private AudioRecord AudioRecord; private volatile boolean isRecord = false; // Volatile visibility sets the recording statusCopy the code

/ / create the AudioRecord

    private void creatAudioRecord() {
    // 获得缓冲区字节大小
    bufferSizeInBytes = AudioRecord.getMinBufferSize(AudioFileUtils.AUDIO_SAMPLE_RATE,
            AudioFileUtils.CHANNEL_CONFIG, AudioFileUtils.AUDIO_FORMAT);
    // MONO单声道
    audioRecord = new AudioRecord(AudioFileUtils.AUDIO_INPUT, AudioFileUtils.AUDIO_SAMPLE_RATE,
            AudioFileUtils.CHANNEL_CONFIG, AudioFileUtils.AUDIO_FORMAT, bufferSizeInBytes);
}
//
@Override
public boolean onTouch(View v, MotionEvent event) {

    AudioRecordUtils utils = AudioRecordUtils.getInstance();
    switch (event.getAction()) {
        case MotionEvent.ACTION_DOWN:
            utils.startRecordAndFile();
            break;
        case MotionEvent.ACTION_UP:
            utils.stopRecordAndFile();
            Log.d(TAG, "stopRecordAndFile");
            stt();
            break;
    }
    return false;
}

//开始录音 
    public int startRecordAndFile() {
    Log.d("NLPService", "startRecordAndFile");

    // 判断是否有外部存储设备sdcard
    if (AudioFileUtils.isSdcardExit()) {
        if (isRecord) {
            return ErrorCode.E_STATE_RECODING;
        } else {
            if (audioRecord == null) {
                creatAudioRecord();
            }
            audioRecord.startRecording();
            // 让录制状态为true
            isRecord = true;
            // 开启音频文件写入线程
            new Thread(new AudioRecordThread()).start();
            return ErrorCode.SUCCESS;
        }

    } else {
        return ErrorCode.E_NOSDCARD;
    }

}
//录音线程 
    class AudioRecordThread implements Runnable {
    @Override
    public void run() {

        writeDateTOFile();// 往文件中写入裸数据
        AudioFileUtils.raw2Wav(mAudioRaw, mAudioWav, bufferSizeInBytes);// 给裸数据加上头文件

    }
}
// 往文件中写入裸数据
private void writeDateTOFile() {
    Log.d("NLPService", "writeDateTOFile");
    // new一个byte数组用来存一些字节数据,大小为缓冲区大小
    byte[] audiodata = new byte[bufferSizeInBytes];
    FileOutputStream fos = null;
    int readsize = 0;
    try {
        File file = new File(mAudioRaw);
        if (file.exists()) {
            file.delete();
        }
        fos = new FileOutputStream(file);// 建立一个可存取字节的文件
    } catch (Exception e) {
        e.printStackTrace();
    }
    while (isRecord) {
        readsize = audioRecord.read(audiodata, 0, bufferSizeInBytes);
        if (AudioRecord.ERROR_INVALID_OPERATION != readsize && fos != null) {
            try {
                fos.write(audiodata);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
    try {
        if (fos != null)
            fos.close();// 关闭写入流
    } catch (IOException e) {
        e.printStackTrace();
    }
}

//add wav header

    public static void raw2Wav(String inFilename, String outFilename, int bufferSizeInBytes) {
    Log.d("NLPService", "raw2Wav");
    FileInputStream in = null;
    RandomAccessFile out = null;
    byte[] data = new byte[bufferSizeInBytes];
    try {
        in = new FileInputStream(inFilename);
        out = new RandomAccessFile(outFilename, "rw");
        fixWavHeader(out, AUDIO_SAMPLE_RATE, 1, AudioFormat.ENCODING_PCM_16BIT);

        while (in.read(data) != -1) {
            out.write(data);
        }
        in.close();
        out.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

private static void fixWavHeader(RandomAccessFile file, int rate, int channels, int format) {
    try {
        int blockAlign;
        if (format == AudioFormat.ENCODING_PCM_16BIT)
            blockAlign = channels * 2;
        else
            blockAlign = channels;

        int bitsPerSample;
        if (format == AudioFormat.ENCODING_PCM_16BIT)
            bitsPerSample = 16;
        else
            bitsPerSample = 8;

        long dataLen = file.length() - 44;

        // hard coding
        byte[] header = new byte[44];
        header[0] = 'R'; // RIFF/WAVE header
        header[1] = 'I';
        header[2] = 'F';
        header[3] = 'F';
        header[4] = (byte) ((dataLen + 36) & 0xff);
        header[5] = (byte) (((dataLen + 36) >> 8) & 0xff);
        header[6] = (byte) (((dataLen + 36) >> 16) & 0xff);
        header[7] = (byte) (((dataLen + 36) >> 24) & 0xff);
        header[8] = 'W';
        header[9] = 'A';
        header[10] = 'V';
        header[11] = 'E';
        header[12] = 'f'; // 'fmt ' chunk
        header[13] = 'm';
        header[14] = 't';
        header[15] = ' ';
        header[16] = 16; // 4 bytes: size of 'fmt ' chunk
        header[17] = 0;
        header[18] = 0;
        header[19] = 0;
        header[20] = 1; // format = 1
        header[21] = 0;
        header[22] = (byte) channels;
        header[23] = 0;
        header[24] = (byte) (rate & 0xff);
        header[25] = (byte) ((rate >> 8) & 0xff);
        header[26] = (byte) ((rate >> 16) & 0xff);
        header[27] = (byte) ((rate >> 24) & 0xff);
        header[28] = (byte) ((rate * blockAlign) & 0xff);
        header[29] = (byte) (((rate * blockAlign) >> 8) & 0xff);
        header[30] = (byte) (((rate * blockAlign) >> 16) & 0xff);
        header[31] = (byte) (((rate * blockAlign) >> 24) & 0xff);
        header[32] = (byte) (blockAlign); // block align
        header[33] = 0;
        header[34] = (byte) bitsPerSample; // bits per sample
        header[35] = 0;
        header[36] = 'd';
        header[37] = 'a';
        header[38] = 't';
        header[39] = 'a';
        header[40] = (byte) (dataLen & 0xff);
        header[41] = (byte) ((dataLen >> 8) & 0xff);
        header[42] = (byte) ((dataLen >> 16) & 0xff);
        header[43] = (byte) ((dataLen >> 24) & 0xff);

        file.seek(0);
        file.write(header, 0, 44);
    } catch (Exception e) {

    } finally {

    }
}
//文件上传  结果回调

  public void stt() {

    File voiceFile = new File(AudioFileUtils.getWavFilePath());
    if (!voiceFile.exists()) {
        return;
    }
    RequestBody requestBody = RequestBody.create(MediaType.parse("multipart/form-data"), voiceFile);
    MultipartBody.Part file =
            MultipartBody.Part.createFormData("file", voiceFile.getName(), requestBody);


    NetRequest.sAPIClient.stt(RequestBodyUtil.getParams(), file)
            .observeOn(AndroidSchedulers.mainThread())
            .subscribe(new Action1<STT>() {
                @Override
                public void call(STT result) {
                    if (result != null && result.getCount() > 0) {
                        sttTv.setText("结果: " + result.getSegments().get(0).getContent());
                    }

                }
            });
}
//记得关闭AudioRecord 


    private void stopRecordAndFile() {
    if (audioRecord != null) {
        isRecord = false;// 停止文件写入
        audioRecord.stop();
        audioRecord.release();// 释放资源
        audioRecord = null;
    }

}
Copy the code

2. STT AudioRecord records websocket online transmission

WebSocket introduction: I remember a little bit: it is an application layer protocol, just like HTTP, but it is a full-duplex communication, socket is only TCP/IP encapsulation, not a protocol. For the first time, WebSocket needed to establish long connections over the HTTP interface, and that was it.

/ / MyWebSocketListener Websocket callback

class MyWebSocketListener extends WebSocketListener { @Override public void onOpen(WebSocket webSocket, Response response) { output("onOpen: " + "webSocket connect success"); STTWebSocketActivity.this.webSocket = webSocket; startRecordAndFile(); // When recording the second time, the recording data cannot be transmitted normally because the server WebSocket has been closed. } @override public void onMessage(WebSocket WebSocket, final String text) { runOnUiThread(new Runnable() { @Override public void run() { sttTv.setText("Stt result:" + text); }}); output("onMessage1: " + text); } @Override public void onMessage(WebSocket webSocket, ByteString bytes) { output("onMessage2 byteString: " + bytes); } @Override public void onClosing(WebSocket webSocket, int code, String reason) { output("onClosing: " + code + "/" + reason); } @Override public void onClosed(WebSocket webSocket, int code, String reason) { output("onClosed: " + code + "/" + reason); } @Override public void onFailure(WebSocket webSocket, Throwable t, Response response) { output("onFailure: " + t.getMessage()); } private void output(String s) { Log.d("NLPService", s); }} Add: // okhttp creates webSocket and sets listener private void createWebSocket() {Request Request = new Request.Builder().url(sttApi).build(); NetRequest.getOkHttpClient().newWebSocket(request, socketListener); } class AudioRecordThread implements Runnable {@override public void run() {//byteBuffer An array of basic data types) ByteBuffer audioBuffer = ByteBuffer. AllocateDirect (bufferSizeInBytes). The order (ByteOrder. LITTLE_ENDIAN); Int readSize = 0; Log.d(TAG, "isRecord=" + isRecord); while (isRecord) { readSize = audioRecord.read(audioBuffer, audioBuffer.capacity()); if (readSize == AudioRecord.ERROR_INVALID_OPERATION || readSize == AudioRecord.ERROR_BAD_VALUE) { Log.d("NLPService", "Could not read audio data."); break; } boolean send = webSocket.send(ByteString.of(audioBuffer)); Log.d("NLPService", "send=" + send); audioBuffer.clear(); } websocket. send("close"); // Send the convention field after recording. Notifies the server to shut down. }}Copy the code

. And then what? And then there’s the data. It’s that simple

. Then the old driver will say… You don’t have encryption. It’s inefficient. Just to make a point, here is a transscript engine, one sentence at a time, the amount of data transmitted itself is not large, the backend gods said there was no need to encrypt, and I did it… Of course, it can be encrypted and transmitted at the same time

3.TTS AudioTrack plays waV files

Here it’s a little bit easier, okHTTP calls the API and passes the text to get the response and then plays it with the AudioTrack. Here is the raw audio stream, and mediaPlayer is a bit too big to use (I haven’t tried it), but mediaPlayer is also an IPC process, and ultimately AudioTrack is called at the bottom. Directly on the code:

public boolean request() { OkHttpClient client = NetRequest.getOkHttpClient(); Request request = new Request.Builder().url(NetRequest.BASE_URL + "api/tts? Text = today is Wednesday ").build(); client.newCall(request).enqueue(new Callback() { @Override public void onFailure(Call call, IOException e) { } @Override public void onResponse(Call call, Response response) throws IOException { play(response.body().bytes()); }}); return true; } public void play( byte[] data) { try { Log.d(TAG, "audioTrack start "); AudioTrack audioTrack = new AudioTrack(mOutput, mSamplingRate, AudioFormat.CHANNEL_OUT_MONO, AudioFormat.ENCODING_PCM_16BIT, data.length, AudioTrack.MODE_STATIC); audioTrack.write(data, 0, data.length); audioTrack.play(); while (audioTrack.getPlaybackHeadPosition() < (data.length / 2)) { Thread.yield(); // Play delay processing...... } audioTrack.stop(); audioTrack.release(); } catch (IllegalArgumentException e) { } catch (IllegalStateException e) { } }Copy the code

4. Speex encryption

Speex is an open source and free audio encryption library written in C++. Inside the demo is compiled so file, I personally compiled for a long time all kinds of pits, finally did not succeed, can only borrow. -_ – | |. Here is a speexDemo the whole project in the project, audio encryption and decryption are normal, pro test available. CSDN came down when we learned this, and we moved it over to make up the numbers.

public static void raw2spx(String inFileName, String outFileName) {

	FileInputStream rawFileInputStream = null;
	FileOutputStream fileOutputStream = null;
	try {
		rawFileInputStream = new FileInputStream(inFileName);
		fileOutputStream = new FileOutputStream(outFileName);
		byte[] rawbyte = new byte[320];
		byte[] encoded = new byte[160];
		//将原数据转换成spx压缩的文件,speex只能编码160字节的数据,需要使用一个循环
		int readedtotal = 0;
		int size = 0;
		int encodedtotal = 0;
		while ((size = rawFileInputStream.read(rawbyte, 0, 320)) != -1) {
			readedtotal = readedtotal + size;
			short[] rawdata = ShortByteUtil.byteArray2ShortArray(rawbyte);
			int encodesize = SpeexUtil.getInstance().encode(rawdata, 0, encoded, rawdata.length);
			fileOutputStream.write(encoded, 0, encodesize);
			encodedtotal = encodedtotal + encodesize;
		}
		fileOutputStream.close();
		rawFileInputStream.close();
	} catch (Exception e) {

	}

}
Copy the code

Finally, the Demo address is attachedGithub.com/weiminsir/N…

Date:2018/10/17 Author:weimin