Not long ago, Baidu industry-level knowledge enhancement model “Wenxin” panorama appeared, recently, one of the cross-modal generation model ERNIE-ViLG in Baidu Wenxin official website open experience entrance, and released the paper.

Experience links: wenxin.baidu.com/wenxin/erni…

Links to papers: arxiv.org/pdf/2112.15…

It is reported that the ERNIE ViLG parameter scale reaches 10 billion, which is the largest Chinese cross-modal generation model in the world so far. This model for the first time uses autoregressive algorithm to unify the modeling of image generation and text generation, enhancing the cross-modal semantic alignment ability of the model and significantly improving the effect of text generation.

Experience ERNIE-ViLG’s “image creation” ability first.

In terms of text generated images, ERNIE-ViLG can automatically create images according to the text input by users, and the generated images not only conform to the text description, but also achieve a very realistic effect.

Attention! The following images are newly generated, not directly searchable original images.

ERNIE-ViLG not only creates buildings, animals and other individual objects:

You can also create complex scenes with multiple objects:

Can even be imaginative according to the user’s input requirements:

ERNIE-ViLG also generates the right images for the imaginative ancient poems, and adjusts them according to the style of the painting:

Oil painting style

 

Chinese painting styleWatercolor style

In addition, the picture can also be completed according to the text prompt:

In the image to text generation, ERNIE-ViLG can understand the picture and describe the content of the picture in simple language:

Not only that, But Ernie-Vilg can answer questions about the scene in the picture:

At present, Wenxin ERNIE-ViLG can draw pictures based on ancient poems in the text generation image demo on Baidu Wenxin official website to enhance the pictorial sense of poems.

What are the secrets of AI technology behind these capabilities?

Cross-modal Generation: A Challenging “problem” in AI

Cross-modal generation refers to the transformation of one mode (text, image, speech) into another mode while maintaining semantic consistency between the modes.

Graphic generation is one of the challenges of cross-modal generation. Take the image generated from text as an example. The text description is very general. To generate an image from text, a large number of details not covered by text need to be considered, which is extremely challenging. For example, the poem “The Spring River and the water warm the duck Prophet” only describes the two objects of the river, the duck and the spring season, but does not specifically describe the color of the duck, the peach blossom by the river and the position of the objects in the picture.

Spring river warm duck prophet

In recent years, the method based on generative adversarial network (GAN) has achieved good results in text-to-image generation tasks in limited fields such as face and landscape. Dall-e establishes pre-and-post dependent relationships between image segments through super-large-scale autoregressive generation model, thus possessing the modeling ability of diverse generation, and achieving a brilliant effect in the generation of open field text to image, which is more diverse and difficult.

Baidu Wenxin ERNIE-ViLG model further proposes a unified cross-modal bidirectional generation model, which uses autoregressive generation mode to model image generation and text generation tasks in a unified manner, better capturing the semantic alignment between modes, so as to improve the effect of text bidirectional generation tasks. In MS-COCO, the authoritative public data set of text generated images, the image quality evaluation index FID (Frechet Inception Distance) of ERNIE ViLG far exceeds the similar models such as All-E of OpenAI. And refresh the image to describe the best effects of multiple tasks. ERNIE-ViLG also took the lead in generative visual question answering tasks with his strong cross-modal understanding.

Text ERNIE-ViLG technology interpretation of principle: text – text bi-directional generation of unified modeling

Baidu WenxinErnie -ViLG uses Transformer sharing encoder and decoder parameters as the backbone network of autoregressive generation, and simultaneously learns two tasks of text generating image and image generating text.

Based on the image vector quantization technology, ERNIE-ViLG represented the image as discrete sequence, so that the text and image were unified sequence autoregressive generation modeling. When text generates images, the input of ERNIE-ViLG model is text token sequence, and the output is image token sequence. When the image generates text, the text content is predicted according to the input image sequence. Build tasks in both directions use the same Transformer model. The same pattern generation of visual and linguistic modes under the same model parameters can facilitate better semantic alignment across the modes.

ERNIE-ViLG unified modeling framework for image and text bidirectional generation

The existing text generation image model based on discrete image representation mainly adopts two-stage training, namely, text generation visual sequence and image reconstruction based on visual sequence. ERNIE-ViLG proposed an end-to-end training method. The hidden layer image representation of Transformer model output in the process of sequence generation is connected to the reconstruction model for image restoration, which provides more semantic features for the reconstruction model. For the generated model, it can receive both the abstract supervisory signal from itself and the original supervisory signal from the reconstructed model, which is helpful to learn the image representation better.

Wenxin ERNIE-ViLG constructed a large-scale cross-modal alignment dataset containing 145 million high-quality Chinese text-image pairs, and trained a ten-billion-parameter model on the dataset based on baidu Flying-blade deep learning platform, and evaluated the effect of the model on cross-modal generation tasks such as text generation image and image description.

Effect of text-to-image Synthesis task

The ability of Wenxin ERNIE-ViLG text to generate images was verified on the open field open dataset MS-COCO. FID was used as the evaluation index (the lower the index value is, the better the effect is). In both zero-shot and Finetune methods, the best performance was achieved by the literary ERNIE-ViLG, which was far better than the all-E model released by OpenAI.

The effect of text ERNIE-ViLG on MS-Coco dataset

Effect of Image Captioning task

In terms of the ability of image text generation, Wenxin ERNIE-ViLG achieved the best results in coco-CN and AIC-ICC, two open Chinese image title generation datasets.

The effect of ERNIE-ViLG on AIC-ICC dataset

Generative VQA task effects

Writer ERNIE-ViLG has also shown great strength in generative visual question-and-answer. Generative visual q&A requires the model to generate answers according to the image content and corresponding questions. The model needs to have deep visual content understanding ability and cross-modal semantic alignment ability, and needs to generate short answer text, which is extremely difficult. The text ERNIE-ViLG achieved the best results on the FMIQA data set, achieving a Turing test pass rate of 78.5%, 14 percentage points better than the current best method.

The effect of text ERNIE-ViLG on FMIQA data set

Making machines capable of cross-modal generation is one of the important goals of artificial intelligence. In the fields of art creation, virtual reality, image editing, AI-aided design, virtual digital man and so on, the trans-modal large-scale models like ERNIE-ViLG have a wide application prospect, and also provide unlimited creativity and possibilities for the future development of these fields. As an important member of Baidu’s “Wenxin” grand model panorama, Wenxin ERNIE-ViLG also represents baidu Wenxin’s solid steps in the trans-modal grand model field, and continues to promote the development of AI in China from the aspects of technological independent innovation and accelerated industrial application.

Click “Here” for a quick experience of ERNIE ViLG