First, the introduction of digital people

Virtual digital human is a comprehensive multi-mode AI capability, combined with image vision, emotion generation, voice cloning, semantic understanding and other AI technologies, widely used in media news anchors, financial customer service, virtual games and many other scenarios.

Applications of digital people in the industry:

HMS ML Kit digital person

HMS ML Kit Digital Human is a newly launched comprehensive multi-mode AI capability based on huawei’s powerful AI core technologies such as image processing, speech synthesis, voice cloning, and semantic understanding. For education, news, multimedia production enterprises, providing high quality, low cost, innovative experience of content creation model. HMS ML Kit digital human has obvious advantages over other digital human manufacturers:

Support ultra HIGH definition 4K cinema effect

  • Support large screen display, the whole body texture details are equal clarity

  • Generation and real background image seamless fusion, no fusion trace under hd resolution

  • Lip details, lipstick reflection, clear texture

  • The teeth are clearly visible, and the seam texture is clear and real

Fidelity of composite effect

  • Realistic reflections on teeth (non-textured), lips, and even lipstick.

  • Real restore facial lighting, contrast, shadows, dimples and other details.

  • The mouth skin texture is seamlessly connected to the real texture.

  • Compared with 3D anchors, there is no animation stiffness.

HMS ML Kit digital life into digital person video display

As you can see from the image above, the HMS ML Kit Digital Person not only speaks clearly, but also has better control over some of the details: lip details, lipstick reflection details, more realistic facial pronunciation, and detailed face lighting effects.

4. HMS ML Kit digital person service integration

4.1 Service integration process

4.1.1 Submit the text information to be generated

Call [Custom Text to Virtual Digital person video interface], transmit some configuration (Config) and text (data) to the back end through this interface for processing: First, verify the text character length of the transmitted data. The maximum character length of Chinese text should not exceed 1000, single character length of English text should not exceed 3000, and word length of English text should not exceed 3000. Non-null check should be performed for the transmitted Config, and then the config and data should be submitted. Convert text text to audio file.

4.1.2 Scheduled Tasks executed asynchronously

A scheduled task that executes asynchronously processes the submitted data, calls the algorithm provided by TTS, converts the text file into a video file, and combines the audio file obtained in the previous step with the video file.

4.1.3 Querying Whether the Text is Converted successfully

Call [Text to Virtual Digital person Video Result Query interface], real-time query whether asynchronous execution of text to video has been completed; If done, it will return a link to the generated video.

4.1.4 Access video files based on video links

According to the video link returned by [Text to Virtual Digital person Video Result Query interface], the generated video file can be accessed.

4.2 Main Interfaces for Service Integration

4.2.1 Customizing the Text-to-VIDEO Interface

URL: http://10.33.219.58:8888/v1/vup/text2vedio/submit

Request parameters:

Main functions: Input text to virtual digital video interface, this interface is an asynchronous interface, the current version of the conversion takes a certain time, using the offline method, the final conversion results need to be queried through [text to virtual digital video Query interface]. If the submitted text has already been synthesized, return the play URL directly.

Main logic: according to the text data that needs to be synthesized transmitted by the front page, according to some configurations provided by config, the text text is transformed into an audio file. Asynchronous execution of multithreading, according to the provided algorithm model to generate sound video files, and then the video files and audio files are combined to generate the required digital person video. If the submitted text has already been synthesized, return the play URL directly.

4.2.2 Interface for querying the video result of text transfer to virtual digital person

URL: http://10.33.219.58:8888/v1/vup/text2vedio/query

Request parameters:

Main functions:

Batch queries the transition status based on the submission text ID.

Main logic: according to the ID list of the synthesized text data transmitted on the front page, namely the textIds field, query the task status of the synthesized video file, put the obtained status results into a set, and insert them into the returned request as return parameters. If the requested text has been synthesized, the play URL is returned directly.

4.2.3 Text transfer to Virtual digital person This interface is used to batch offline videos

URL: http://10.33.219.58:8888/v1/vup/text2vedio/offline

Request parameters:

Main functions: Batch logout based on the submitted text ID.

Main logic: According to the synthesized text data ID array transmitted on the front page, namely textIds field, set all the videos corresponding to ids in the array offline, change their status to offline state, and delete the video file at the same time. The offline videos cannot be played or watched.

4.3 HMS ML Kit digital person service implementation of the main functions

The HMS ML Kit Digital Person service is very powerful:

  1. Bilingual pronunciation: Since the current system supports Chinese and English pronunciation, it can transmit Chinese and English text as pronunciation data.
  2. Multiple virtual anchor images: different virtual anchor pronunciation is supported. At present, there are 4 virtual anchors configured in the system, which are: Chinese female pronunciation, Shanghai Daily, English female pronunciation and English male pronunciation.
  3. Picture in picture video playback: in addition to the virtual host Settings, video playback support picture in picture is small window play video, in the picture in picture mode when playing video, video window with mobile screen, can see the text, while playing video, video window can also drag and drop to any location, that doesn’t keep out the text location.
  4. Adjustable speed, volume and pitch: the speed, volume and pitch of pronunciation can meet different needs.
  5. Multiple background Settings: Different virtual anchor backgrounds can be set. At present, the system has three built-in backgrounds: transparent background, green screen and science and technology theme. You can also customize your favorite background by uploading pictures.
  6. Subtitle setting: the system can automatically configure the subtitle, can set Chinese subtitles, English subtitles or bilingual subtitles.
  7. Multiple layout Settings: The position of virtual anchor in the middle of the screen can be adjusted by parameters: left test, right, middle of the screen; And adjust the size of the virtual anchor character and show the whole body or half body. When selecting the left or right position of the virtual anchor in the middle of the screen, you can also set the platform and the location of the platform, and display the video files to be played in the video, so as to achieve the effect of painting in the video, so as to restore the real news broadcast scene.

Video picture-in-picture display:

Fifth, the epilogue

As a developer, AFTER using HMS ML Kit to digitally create a video, especially the picture-in-video function, I was amazed. This truly restored the news broadcasting scene of real anchors, which can not help people imagine, under the implementation of the perfect digital person, can it completely replace the real person broadcasting?

For details about the development guide, see the official website of huawei Developer Alliance

Developer.huawei.com/consumer/cn…


The original link: developer.huawei.com/consumer/cn… Author: Say Hi