Akik Look at that coder

Public number: look at that code farmers

1. Project Introduction

The Image Caption task is a combination of computer vision and natural language processing. The Image Caption task requires a computer to create a way to map the data in the visual mode to the text mode, so that there is a correspondence between the visual and the text.

Ask the computer to identify the contents of the image and generate its own description of the image. As shown below.

In general, such a mapping task requires the following two basic requirements:

  • Grammar correctness, the mapping process needs to follow the natural language grammar, so that the result is readable;
  • The richness of the description, the generated description needs to be able to accurately describe the details of the corresponding image, producing a sufficiently complex description.

The main content of this article is the project of Im2txt, an image description model designed by Google in 2015

Project paper link: arxiv.org/abs/1609.06…

2. Project purpose

Ask the computer to identify the contents of the image and generate its own description of the image

3. Configure the project

  • Python 3.6
  • Tensorflow – 1.12 –

4. Project principle

This item is

Encoder -decoder algorithm model based on deep learning

Trained with massive amounts of data

  • Encoder is an encoder, it is a CNN model, often used in image recognition, target detection and other fields. Various common convolutional networks can be used, such as VGG, Inception, ResNet, etc.
  • A decoder is an LSTM model commonly used in language modeling or machine translation. We input the fixed length vector output from the encoder into the decoder to obtain the description of the image.

5. Project process

1. Download the code of the paper and configure the code. The paper uses Inceptionv3 as the encoder and LSTM as the decoder

2. Image caption data set based on MSCOCO data setMass data trainingAfter the training algorithm model is obtained

3. Use the following code to test the model and see the effect.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import os

import tensorflow as tf
from im2txt import configuration
from im2txt import inference_wrapper
from im2txt.inference_utils import caption_generator
from im2txt.inference_utils import vocabulary

# In[2]: # Path where trained models are stored checkpoint_path ="D:/Show-Tell/im2txt/model/model.ckpt-3000000"Vocab_file ="D:/Show-Tell/im2txt/im2txt/data/word_counts.txt"Input_files ="D:/Show-Tell/im2txt/images/"


# In[8]: # load trained model g = tf.graph ()withg.as_default(): model = inference_wrapper.InferenceWrapper() restore_fn = model.build_graph_from_config(configuration.ModelConfig(), Vocab = vocabulary.Vocabulary(vocab_file)with tf.Session(graph=g) asSess: # load the trained model restore_fn (sess) generator = caption_generator. CaptionGenerator folder (model, vocab) # cyclefor root,dirs,files in os.walk(input_files):
        for file infiles: Image_path = os.path.join(root,file) print(image_path tf.gfile.FastGFile(os.path.join(root,file),'rb'Captions = generator-beam_search (sess, imagefor i, caption in enumerate(captions):
                sentence = [vocab.id_to_word(w) for w in caption.sentence[1: -1]]
                sentence = "".join(sentence)
                if i == 0:
                    title = sentence
                print(" %d) %s (p=%f)" % (i, sentence, math.exp(caption.logprob)))
Copy the code

4. Check the detection effect.

6. Project thinking

To sum up, this model basically realizes the ability of computer to look at pictures and speak, but it still lacks in accuracy and applicability, and there is a lot of room for modification and improvement.

Since the encoder-decoder algorithm model is adopted to train the data set, innovation can be carried out in the encoder and decoder stages respectively, for example, the encoder is changed to Inceptionv4, or on the basis of the original network, part of the network connection is changed to dense connection to enhance the effect of feature fusion.

However, from the current development of image description tasks, it is a good direction and a good idea to make a breakthrough in the algorithm model based on GAN network.

I hope to discuss the next steps and learning direction with you who read this article.

If you find this helpful:

1. Like it so more people can see it

2, pay attention to the public number: see the code farmer, we learn together and progress together.

This article is participating in the “Nuggets 2021 Spring Recruiting activity”, click to see the details of the activity