Well, her girlfriend suddenly turned to a movie she liked at night, but there were no subtitles, which upset her very much. I was quick to see to my girlfriend’s needs. Use Python to create software that can recognize speech and then translate text.

First, the story

Well, her girlfriend suddenly turned to a movie she liked at night, but there were no subtitles, which upset her very much.

I was quick to see to my girlfriend’s needs.

I came up with the idea of using Python to create software that could recognize speech and then translate text.

The following picture is the effect of this article. Ha ha, isn’t it good? It looks great.

If you are interested, please give me a thumbs up and bring me more fun and interesting demos and implementation tutorials.

A short excerpt from the first episode of Legend of Zhen Huan:

In fact, it looks like this:

Recently drama shortage, accidentally turned out once downloaded TV series aftertaste, classic is classic, whether it is the plot or lines, are so charming, huh? Wait, lines, lines… As an IT worker, I had an Epiphany — with all the advances in speech recognition technology, could there be a way to save some of my best lines? Maybe I can also be a wild subtitle: P, it seems that I can also easily translate some difficult lines on this basis!

After a bit of thinking, I came up with an idea — a program to extract audio from a video and then request an open speech recognition API to help me convert speech into text. Given the happy experience of calling Youdao Wisdom Cloud before, I decided to use it again for my own use and quickly made this demo (please ignore the ugly interface layout, if it works…). .

Welcome to pay attention to me, together to fulfill my promise before, even more within a month, to finish several articles.

The serial number Estimated completion time Develop dome name and features & publish article content Is it finished The article links
1 On September 3 Text translation, single text translation, batch translation demo. Has been completed CSDN:Let me directWechat Official Account:Let me direct
2 On September 11 Ocr-demo, complete batch upload identification; In a demo, you can select different types of OCR recognition “include handwriting/print/ID card/form/whole topic/business card), and then call the platform capabilities, specific implementation steps, etc. Has been completed CSDN:Let me directWechat Official Account:
3 On October 27 Voice recognition Demo, demo upload a video, and capture the video short voice recognition – Demo audio for short voice recognition CSDN:Let me directWechat Official Account:
4 On September 17 Intelligent voice evaluation – Demo CSDN: wechat Official Number:
5 On September 24 Essay correction – Demo CSDN: wechat Official Number:
6 On September 30 Voice synthesis – Demo CSDN: wechat Official Number:
7 On October 15 Single question pat-demo CSDN: wechat Official Number:
8 On October 20 Picture translation – Demo CSDN: wechat Official Number:

Two, the preparation work before development

First of all, you need to create instances, create applications, bind applications and instances on the personal page of Youdao Wisdom Cloud, and obtain the ID and key of the application used to invoke the interface. The process of personal registration and application creation is detailed in the article less than 100 lines of code to get Python to do OCR identification id card, text and other fonts

Iii. Detailed introduction of the development process

The following describes the specific code development process.

(I) Interface specification description

Firstly, the API input and output specifications of Youdao Wisdom Cloud are analyzed. According to the documentation, the call interface format is as follows:

Youdao Speech Recognition API HTTPS address:

https://openapi.youdao.com/asrapi
Copy the code

Interface call parameters:

The field name type meaning mandatory note
q text The Base64 encoded string of the audio file to translate True Must be Base64 encoded
langType text The source language True Support language
appKey text Application ID True Can be inApplication managementTo view
salt text UUID True UUID
curtime text Time stamp (seconds) true Number of seconds
sign text The signature is generated using MD5 (application ID+q+salt+curTime + key) True Apply ID+ Q +salt+curTime + MD5 value of the key
signType text The signature version True v2
format text Voice file format, WAV true wav
rate text Sampling rate, recommended adoption rate of 16,000 true 16000
channel text Number of channels. Only mono channels are supported. Please enter a fixed value of 1 true 1
type text Upload type. Only base64 upload is supported. Set this parameter to a fixed value of 1 true 1

Q is the base64 encoded audio file to be identified. “The uploaded file length cannot exceed 120s, and the file size cannot exceed 10M”, which needs to be noted.

The API returns something simple:

field meaning
errorCode Identify the result error code, it must exist. Details AttendError code list
result Recognition result, recognition success must exist

(II) Project development

This project is developed using PYTHon3, including maindow. Py, videoProcess. Py, srBynetease.

In the interface part, python tkinter library is used to provide video file selection, time input box and confirm button.

Videoprocess. py: to achieve the function of extracting the audio and processing the information returned by the API in the specified time interval of the video;

Srbynetease. Py: Sends processed audio to the short speech recognition API and returns the result.

1. Realization of interface part

Part of the interface code is as follows, relatively simple.

root=tk.Tk() root.title("netease youdao sr test") frm = tk.Frame(root) frm.grid(padx='50', Pady ='50') btn_get_file = tk.Button(FRM, text=' select video to be seen ', command=get_file) btn_get_file. Grid (row=0, column=0, padx='10', pady='20') path_text = tk.Entry(frm, width='40') path_text.grid(row=0, Column =1) start_label=tk.Label(FRM,text=' ') start_label.grid(row=1,column=0) start_input=tk.Entry(frm) start_input.grid(row=1,column=1) End_label =tk.Label(FRM,text=' ') end_label.grid(row=2,column=0) end_input=tk.Entry(frm) end_input.grid(row=2,column=1) sure_btn=tk.Button(frm, Grid (row=3,column=0, columnSPAN =3) root.mainloop() text=' start identification ', command=start_sr) sure_btn.grid(row=3,column=0, columnSPAN =3) root.mainloop()Copy the code

The sure_bTN binding event start_sr() does simple exception handling and prints the final recognition through a popover:

Def start_sr(): print(video.video_full_path) if len(path_text.get())==0: sr_result = 'file not selected' else: video.start_time = int(start_input.get()) video.end_time = int(end_input.get()) sr_result=video.do_sr() Tk.messagebox.showinfo (" result ", sr_result)Copy the code

2. Developed audio and video functions

(1) In VideoProcess. py, I use The Moviepy library of Python to process the video, capture the video according to the specified start and end time, extract the audio, and convert it into base64 encoding form according to API requirements:

def get_audio_base64(self):
    video_clip=VideoFileClip(self.video_full_path).subclip(self.start_time,self.end_time)
    audio=video_clip.audio
    result_path=self.video_full_path.split('.')[0]+'_clip.mp3'
    audio.write_audiofile(result_path)
    audio_base64 = base64.b64encode(open(result_path,'rb').read()).decode('utf-8')
    return audio_base64
Copy the code

(2) The processed audio file encoding is transmitted to the encapsulated Youdao Wisdom Cloud API calling method:

def do_sr(self):
    audio_base64=self.get_audio_base64()
    sr_result=srbynetease.connect(audio_base64)
    print(sr_result)
    if sr_result['errorCode']=='0':
        return sr_result['result']
    else:
        return "Something wrong , errorCode:"+sr_result['errorCode']
Copy the code

3. Development of sending data translation function

The call method encapsulated in srByyahoos.py is relatively simple. Simply “assemble” data{} according to the API document and send it:

def connect(audio_base64):
    data = {}
    curtime = str(int(time.time()))
    data['curtime'] = curtime
    salt = str(uuid.uuid1())
    signStr = APP_KEY + truncate(audio_base64) + salt + curtime + APP_SECRET
    sign = encrypt(signStr)
    data['appKey'] = APP_KEY
    data['q'] = audio_base64
    data['salt'] = salt
    data['sign'] = sign
    data['signType'] = "v2"
    data['langType'] = 'zh-CHS'
    data['rate'] = 16000
    data['format'] = 'mp3'
    data['channel'] = 1
    data['type'] = 1

    response = do_request(data)

    return json.loads(str(response.content,'utf-8'))
Copy the code

Iv. Effect display

Try opening a short excerpt from the First episode of Legend of Zhen Huan:

The effect can be, a small flaw in the sentence can be ignored. I did not expect this short speech recognition API ancient and modern, ancient speech recognition is so slip, fierce!

Five, the summary

Some attempts have opened the door to a new world. From today, I can be a wild subtitler who can carry subtitles without typing. Later, I can try to recognize the operation of translation into other languages.

Project address: github.com/LemonQH/SRF…