A Survey of Evolution of Image Captioning PDF [PDF]

  • Author / Uploaded
  • _mbo
  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

123

International Journal of Hybrid Intelligent Systems 14 (2017) 123–139 DOI 10.3233/HIS-170246 IOS Press

A survey of evolution of image captioning techniques Akshi Kumara,1,∗ and Shivali Goelb,1 a

Faculty of Computer Science and Engineering, Delhi Technological University, Shahbad Daulatpur, Delhi – 42, India b Department of Information Technology, Delhi Technological University, Shahbad Daulatpur, Delhi – 42, India

Abstract. Automatic captioning of Images has been explored extensively in the past 10 to 15 years. It is one of the elementary problems in Computer Vision and Natural Language Processing and has vast array of applications in the real world. In this survey, we aim to study different approaches used for the generation of image captions in a chronological manner starting from the basic template based caption generation model to using Neural Networks combined with external world knowledge. We review existing models in detail, highlighting the involved methodologies and improvements in the same that have occurred in time. We gave an overview to the standard image datasets and the evaluation measures developed to discern the quality of generated image captions. Apart from the basic benchmarks we also note speed and accuracy improvements in all the different approaches. Finally, we investigate further possibilities in automatic image caption generation. Keywords: Computer Vision, image captioning, deep learning object recognition, Natural Language Processing

1. Introduction The last two decades have seen great improvements and enthusiasm in the fields of Computer Vision and Natural Language Processing. Main target of these problems is studying and generating automatic text descriptions, and comprehending images and videos. These problems find their roots in AI and ML which themselves are in early phases considering their vast potential. Moreover, these fields have been investigated and researched separately, making it extremely important to study their combined scope and further investigating their possibilities. The image captioning problem has been viewed for long as a challenging problem because of the need to identify different segments of the image correctly, identifying the connection between them and finally weaving them together in a syntactically and seman1 These

authors have equal contribution. author: Akshi Kumar, Faculty of Computer Science Engineering, Delhi Technological University, Shahbad Daulatpur, Main Bawana Road, Delhi – 42, India. E-mail: akshikumar@ dce.ac.in. ∗ Corresponding

c 2017 – IOS Press. All rights reserved 1448-5869/17/$35.00

tically correct manner to describe the most salient aspect of the image. Therefore to accomplish this task, we need the technology which can fully understand the image and should be capable to apply external worldly knowledge to generate a description which is most true to the image.In other words, it is equivalent to mimicking a human capability to compress the most salient features of an image into the most descriptive, but trueto-the-image description. This seems a herculean task relative to normal Computer Vision Evaluation Systems simply identifying what is present in the image. However, with most of the world population connecting online, the need for a image-caption generation system has surged to attention-seeking levels. We have lots of multi-modal data arriving online every second in the form of millions of tagged photographs on Facebook and cloud storage. Google Photos image classifying and story making feature is a great example. Also, with lots of video content, the need for automatic subtitles for the videos has seen an appraisal. Automatic Image Captioning will help in organizing the millions of unstructured and unorganized and unclassified images on the Internet and on humanitarian grounds, will also aid the visually impaired to sense the images. There-

124

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

fore, the dire needs for image-caption generation can not go unnoticed and hence, lots of research papers and conferences in this regard are taking place throughout the world. At the heart of this technology are the power of neural networks which makes it seems like magic. On one hand, where Convolution Neural Network (CNN) Layer performs the prime task of identifying the salient features of the images, the Recurrent Neural Networks makes it possible to construct (almost) meaningful captions from the identified visual features.We will see how the technology advanced from basic tree parsing to use of neural networks utilizing hundreds of millions of features to make the results competent enough to be compared to those of human generated captions.We will be unwinding these technologies one by one in brief in this survey paper. So, our aim of this survey paper is to present, analyze and compare the research happening throughout the last decade in this ground-breaking field, in a matter of few pages. We aim at chronologically identifying the technological developments, the involved approaches, their drawbacks, how well they perform at various metrics and further investigating the future scope of automatic image generation.

2. Evolution of image captioning techniques The paper is structured in a chronological manner, where we start from the basic techniques used, discuss its methodology and shortcomings and then move to a newer approach that solved the problems in the previous one. Figure 1 shows the timeline we have developed to help keep track of the advancements in this field and better understand the need and basis of development of newer models. The years 2005 to 2010 saw the birth of the major approaches dealing with computer vision and mapping the detected objects with words and weaving them into a meaningful or stylish description. Related work include those of Li et al. [5], Farhadi et al. [9] and Li et al. [6]. Also approaches involving adding some additional knowledge from the corresponding domain which normal computer vision can not see (Yang et al. [16]) as well as using some originally existing captions for the locally (Ordonez et al. [17]) and globally similar (Farhadi et al. [7]) and images to get rid off the need for structuring the self made sentences was observed during this period. In years 2010 and 2011 some work still focussed on using primitive techniques for text generation for

eg. Kulkarni et al. [15] used template-based description generation, Farhadi et al. [7] grouped the computer vision detections in triplets and then used them to generate descriptions based on templates. Li et al. [14] generated description by merging the computer vision detected objects using proper semantic relationships. The year 2012 was a highlight in the era of automated Image Captioning Techniques as Imagenet classification with deep CNN (convolutional neural networks) was done using 60 million parameters and half million neurons, it consisted of five convolutional layers. Year 2013 onwards, techniques involving the use of recurrent network started gaining momentum. Mao et al. [28] Vinyals et al. [48], Kiros et al. [25], Fang et al. [30], Chen et al. [47] used a recurrent NN and Kiros et al. [23] used a feedforward one. Also, Kiros et al. [25] proposed to create a aˇ ˛rmultimodal embedding spacea´ ˛s by using a vision model and LSTM to encode text. Years 2014 and 2015 saw the evolution of rich feature hierarchies for accurate and high quality object detection going deeper with convolutions. Fast progress in object detection [18,31–33,49] was identified with models which labeled multiple regions of image in image captioning. Years 2015 and 2016, saw the evolution of very aˇ ˛rdeep convolutional networksa´ ˛s for large scale image recognition with multimodal R-CNN and various other techniques [24,34–40,48]. We discuss in detail 8 prominent categories of approaches belonging to different time periods. Starting with one of the very first techniques to the very recent one. 2.1. Template Based (Tree parser) The template based technique is discussed using the work of Margaret Mitchell et al. [21] which was published in 2012. 2.1.1. Problems faced till now Template Based Caption Generation was used earlier in – Kulkarni et al. [15], Yang et al. [16] system substitutes probable prepositions, verbs and interjections on parsing UIUC Pascal-VOC dataset (Farhadi et al. [9]) and choosing head nouns and their dependents using maximum likelihood calculated by taking the ratio of their individual logs.However, only predictable consistent sentences can be generated using template based techniques, but not novice captions. Ordonez et al. [17] matched the query image from a much larger

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

125

Fig. 1. Timeline indicating the progress in automated image captioning.

set of existing captioned photographs followed by local reordering. Although natural, but these captions are not true to the image. They mainly describe the similar images and may miss out unique features of the query image. 2.1.2. Overview Midge uses syntactical knowledge of the probability distribution of the next words which should appear after a given sequence of words. The generator uses constraints to filter out the noisy output from the vision system to generate syntactic trees to describe the image which computer vision detected. 2.1.3. Model/Methodology For training, 700,000 images from Flickr dataset were taken with respective descriptions in dataset used in Ordonez et al. [17].Before parsing , normalization of description was done. Parsing was done using Berkeley parser (Petrov [12]).Once a head noun was selected, for formulating a description , probability was calculated for determiners (the, a, an) and pre-nominal modifiers/adjectives. Head nouns were identified and physical objects were distinguished using WordNet (Miller [1]) from the detections of the vision system. Maximum 3 objects were kept in a description of a single sentence. Caption generating process was dealt as a problem of growing a syntactically and semantically informed tree based on detected object nouns. Tree growth was achieved using lexicalized syntactic derivations using head noun anchors detected above. A three step growth process was followed (Reiter and Dale [3]) that involved utilising content determination for grouping followed by ordering of the object nouns, generating their local subtrees, and filtering irregular detections. Micro planning was done to generate full syntactic trees around the noun objects detected, and modifiers are selected and classified in postnominal and pre-nominal in surface realisation step

and choosing the final outputs. The system followed an approach where multiple trees are grown and then, the best one is chosen as the result. Contexts (nouns) for adjectives were weighted using Point-wise Mutual Information and for any adjective, only the best 1000 nouns are selected. In Micro-planning, fully-grown trees are generated by taking the intersection of the subtrees created in Content Determination. Subtrees surrounding a noun in position 1 are directly merged with subtrees surrounding a noun in position 2 because the nouns are ordered.In the surface realisation stage, the most probable single tree is chosen by the system from all generated possible trees and mark-up is removed to produce a final string. Different strings may be generated depending on different specifications from the user. The final string is then the one with the most words. 2.1.4. Evaluation metrics 5-point Likert scale, Human decisions accumulated using Amazona´ ˛rs Mechanical Turk (Amazon, 2011) were used as evaluation metrics.It was also evaluated against Kulkarni et al. system, the Yang et al. system, and human-generated descriptions on the same dataset (images). Other metrics include the parameters of grammar, main aspects, correctness, order, human likeness. Results Analysis were done using the nonparametric Wilcoxon Signed-Rank test where the parameter for comparing different systems is the median values. 2.1.5. Dataset Training dataset: 700,000 images with their associated descriptions from the Flickr dataset in Ordonez et al. [17]. Testing dataset : 840 PASCAL images. 2.1.6. Results Midge performed better than all earlier automatic approaches on criteria of correctness and order. And additionally performed better than Yang et al. on the

126

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

Fig. 2. Tree generated using Midge approach during the tree growth process.

criteria of close proximity of sentences with the human generated ones. Figure 3 shows captions generated by this method. 2.2. Encoder – Decoder based Caption Generation This technique is discussed using the work of Kiros, Salakhutdinov, and Zemel – Unifying visual semantic embeddings with multimodal neural language models [25] which was published in 2014. 2.2.1. Problems faced till now Descriptions, by earlier strategies were more machine type in nature and failed to adapt to the fluidness of captions written by humans. Bleu and Rouge [22] evaluation ways were unreliable and did not match human perceptions. 2.2.2. Overview Encoder (LSTM) ranks captions and pictures and develops sensible grading functions, and the decoder (SC-NLM) optimises the grading functions as some way of generating and grading new descriptions via the learnt representations. 2.2.3. Model/Methodology Encoding of sentences was done by taking Long short-term memory (LSTM) recurrent neural networks [2]. Projection of features of the image were taken from a deep CNN into the embedding region of the LSTM hidden states. Joint image-sentence embeddings were learnt and minimization of pairwise ranking loss so as to learn to rank pictures and their descriptions was performed. Images and descriptions were ranked. Using a decoder, a structure-content neural

language model (SC-NLM), where the sentence structure was disentangled to its content and an encoder which created the conditioning on distributed representations. Sensible image captions were generated if sampling from SC-NLM was used. Decoder generated new captions from base. Problem of ranking pictures and captions was used as alternate for generation. Optimising this task would lead to an enhancement in generation technique, because any generation system makes a grading function to analyse how well a caption and picture match. 2.2.4. Evaluation metrics Med r; R@K. 2.2.5. Dataset Flickr30K and Flickr8K. 2.2.6. Results The ways delineated during this paper generated descriptions with quality greater than the that time stateof-the-art methods which were based on compositionbased strategies. Authors worked on attention-based models which could learn to align the parts of captions to pictures and determining where to attend next by using these alignments, thus modifying dynamically the decoder conditioning vectors. Figure 4 shows captions generated by this method. 2.3. Extracting Visual Features (RNN) + Maximum Entropy Language Model-Caption Generation The approach is discussed using Minds Eye: A Recurrent Visual Representation for Image Caption Generation [47].

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

127

Fig. 3. Captions generated as in [21] L to R: The bus by the road with a clear blue sky; People with a bottle at the table; A person in black with a black dog by potted plants.

Fig. 4. Captions generated as in [25] L to R: A parked car while driving down the road; A little boy with a bunch of friends on the street; There is a cat sitting on a shelf.

2.3.1. Problems faced till now Many previous papers experimented projecting the image features and their associated description in common space [6,7,13] which find their uses in image search or image captions ranking.To learn these projections, various approached were used:Kernel Canonical Correlation Analysis (KCCA) [22], Recursive neural networks [29] and Deep neural networks [24]. While these techniques projected both visual features and associated semantics to joint embedding, they failed to perform the inverse projection. That is, they could not make fresh sentences or visual depictions from those joint embeddings. 2.3.2. Overview This paper explored the bi-directional mapping between images and their sentence-based descriptions using a recurrent neural network. A new recurrent visual memory was deployed that mechanically learned to remember long-term visual concepts to help in both sentence production and visual feature reconstruction.

2.3.3. Model/Methodology To accomplish the bidirectional mapping, a set of latent variables Ut−1 were introduced that encoded the visual features of the previously read/generated words Wt−1 .The latent variable U played an important role of behaving as a long-term visual memory for the previously generated/read words (which was the heart of this paper). U was used to calculate P (wt |V, Wt−1 , Ut−1 ) and P (V |Wt−1 , Ut−1 ). Combining these two probabilities together the author aimed to maximize, P (wt , V |Wt−1 , U − t − 1) = P (wt |V, Wt−1 , Ut−1 )P (V |Wt−1 , Ut−1 ). That is, given the previous words and their visual interpretation, author aimed to maximize the possibility of the word wt and the observed visual features V . Language Model: This system was able to generate 3000 to 20000 words using word classing approach [11] P (wt |) = P (ct |)P (wt |ct ; )P (wt |) is the probability of the word, P (ct |) is the probability of the class. Using frequency distribution, words were categorized into classes using clustering technique. In or-

128

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

of the visual features. However, with more words being predicted, the visual feature probabilities are revised to predict words, which more closely depict the actual scene. 2.3.4. Evaluation metrics Perplexity (PPL), BLEU, METEOR (METR), Human Subjects, Recall@1,5,10. 2.3.5. Dataset PASCAL 1K, Flickr 8K and 30K, MS COCO. 2.3.6. Results Figure 6 shows captions generated by this method. 2.4. Object Detection (CNN) + Caption Generation Model (RNN) Fig. 5. 1. Part of the model needed for generating sentences from visual features and vice versa. 2. Sentences to Visual Features. 3. Visual Features to sentence.

der to overcome the uncertainty of which word to be generated, the author took the the RNN models output and combined it with the Maximum Entropy language model [50] output, which was trained simultaneously. The context was kept short by limiting the words to look back in Maximum Entropy model to three for all the experiments. Learning Model consisted of BPTT (Backpropagation Through Time) Algorithm to revise the weights online. Activation Function used for all units was sigmoid function (s) = 1 = (1 + exp(s)) and for word predictions, the clipping Activation Function Used was soft-max. Author used the RNN code(opensourced) of [11] and the Caffe framework [51] to implement this model. Author used pre-trained 1000class ImageNet [8] model ,rather than starting from scratch to prevent overfitting. The recurrent hidden state s supplies context on the basis of observed previous words. v represents the set of observed (assumed to be constant) visual features. These visual features help in making an informed selection of words. For example, if a girl was detected, the probability of appearing next of the word girl automatically gets higher. U , represents the hidden recurrent layer to reconstruct the visual features from the given word (so that bi-directional mapping can be felicitated ). wt is the next word predicted using this visual hidden layer, i.e. the network utilises its visual memory, u, along with the currently observed visual features v, to make the next word prediction. Before any words are observed, U makes pure random guesses

This technique is discussed using the work of Vinyals, Toshev, Bengio, and Erhan, Show and tell: A neural image caption generator [48], which was published in 2014. 2.4.1. Problems faced till now Text generation in previous works was rigid and excessively handcrafted. It couldn’t create descriptions of previously unobserved arrangements of objects, even if separate objects were detected in the training set. 2.4.2. Overview End-to-end system that combined newfangled subnetworks for object detection and caption generation models was proposed. This neural network was extensively trained using stochastic gradient descent and described the subject matter of an image using accurately built English sentences. 2.4.3. Model/Methodology It was based on a neural and probabilistic architecture to produce image captions given an image as input and applying the principle of translation for generating its description. (similar to how we translate text between two languages). The model first uses the following formula to maximize the probability of the correct description: X θ∗ = arg maxθ log p(S|I; θ) (1) I,S

Here, θ represents the model parameters, I, the input image and S the generated sentence. Chain rule is then applied to calculate the joint probability over all words

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

129

Fig. 6. Captions generated as in [47] L to R: A train is stopped at a train station; A group of people standing on a snow covered slope; A group of people that are standing in front of a building.

Fig. 7. Captions generated as in [48] L to R: A red motorcycle parked on the side of the road; A group of young people playing a game of frisbee; Two dogs play in the grass.

in the sentence S0 , . . . , SN : log p(S|I) =

N X

log p(St |I, S0 , . . . , St−1 )

(2)

t=0

A constant length hidden state ht expresses the number of words we consider upto t − 1 and it updated using a non linear function whenever it sees a new input xt . ht+1 = f (ht , xt )

(3)

This non linear function f was specified by a LSTM network which used words and images as inputs xt and was so trained that it would predict one word of the sentence at a time considering the context of image observed and also all the preceding words p(St |I, S0 , . . . , St−1 ). The loss was given by the the summation of the negative log likelihood of the right word generated at each time step as given below and was minimized with respect to all parameters of the LSTM network, word embeddings We and the top layer of the Convolutional neural network. L(I, S) = −

N X t=1

log pt (St )

(4)

This paper used the BeamSearch approach with a beam of size 20. They also experimented using greedy search by taking beam size equal to 1 only to find out that it degraded the results by and average of 2 BLEU points (the other technique explored was Sampling). 2.4.4. Evaluation metrics Bleu-4, METEOR, CIDER. Ranking Metric Recall@k (@1 and @10). 2.4.5. Dataset Pascal VOC 2008. Flickr8k, Flickr30k, MSCOCO, SBU. 2.4.6. Results NIC is performed better than various other approaches e.g. Tri5Sem, Im2Text, BabyTalk, SOTA etc. and was quite close to the ground truth. Figure 7 shows captions generated by this method. 2.5. Deep RNN + Deep CNN + multimodal layer interactions This technique is discussed using the work of Mao et al. – Explain images with multimodal recurrent neural networks [28] which was published in 2014.

130

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

Fig. 8. Captions generated as in [28] L to R: A square with burning street lamps and a street in the foreground; Tourists are sitting at a long table with a white table cloth and are eating; A blue sky in the background.

Fig. 9. Overview of fully convolutional localization network for dense captioning. The localization layer presents regions and extracts smoothly, batch of corresponding activations with the help of bi-linear interpolations.

2.5.1. Problems faced till now Earlier works extracted features for sentences and pictures, and mapped them into embedding space of same semantics. These strategies addressed tasks such as retrieval of sentences when the image is given or retrieval of images when the sentences are given but when they are existing within the database already, and lacking the flexibility to caption new pictures that consists of objects and scenes that are previously unseen. 2.5.2. Overview The model contains 2 sub-networks: deep RNN for sentences and a deep CNN [20] for images where, RNN is Recurrent Neural Network and CNN is Convolutional Neural Network. These two sub-networks communicate with one other in a multimodal layer and this complete model is known as m-RNN model. It takes out probability distribution for generating a word provided previous words and picture are given and finally when this distribution is sampled, image descriptions are generated.

2.5.3. Model/Methodology There are 6 layers in each time frame : first one is input word layer, then next two are Word Embedding layers, then there is Recurrent layer, then the layer where connection is made: Multimodal layer, and the last layer : Softmax layer. log2 PPL(w1:L |I) = −

L 1 X L n=1

log2 P (wn |w1:n−1 , I) c=

N 1 X (i) 2 L · log2 PPL(w1:L |I (i) ) + kθk2 N i=1

(5)

(6)

where PPL is the Perplexity of the sentence and c is Cost calculated for the model. Sentence Generation involved starting from the start sign ##START##, the model calculated the probability distribution for the upcoming word, given previous words and picture. Then the upcoming word was picked by sampling previously obtained probabil-

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

ity distribution. But, the word which had the maximum probability was found out, since this method performed better, though slightly, than sampling. After that, the picked word was input to the model and the process is continued until the end sign ##END## is taken as output from the model While doing retrieval of image, top ranked images were the output, where ranking was done on the basis of their perplexity with the query sentence. Sentence Retrieval used Normalized probability for each sentence. 2.5.4. Evaluation metrics Sentence Perplexity & BLEU scores (B-1, B-2, B3), RK (K = 1, 5, 10) and Med r.For IAPR TC-12: Recall Accuracy Curve and (##RK (K = 1, 5, 10) and Med r). 2.5.5. Dataset Flickr 8K; Flickr 30K; IAPR TC-12. 2.5.6. Results This was the first work in which RNN in a deep multimodal architecture was incorporated. Figure 8 shows captions generated by this method. 2.6. Object Detection (R-CNN) + Localization Layer + Caption Generation Model (RNN) This technique is discussed using the work of Johnson, Karpathy, Li – DenseCap: Fully Convolutional Localisation Networks for Dense Captioning [46] which was published in 2014. 2.6.1. Problems faced till now Predictions based on earlier region CNN-RNN models did not include context outside of each region. Those were inefficient as each region had to be forwarded independently. Localization layer is was proposed due to these difficulties. 2.6.2. Overview The paper consists of the work in the detection of objects, Image captioning, and the processing of particular regions of the image. 2.6.3. Model/Methodology Convolutional Localization Network for Dense Captioning of the image was based on CNN- RNN models for image captioning but also included a differentiable

131

localization layer that could be inserted in the neural network to enable localized predictions of the region proposals. CNN consisted of 13 layers of 3 × 3 convolutions and formed the input for localization layer. Localization Layer classified convolutional anchor boxes according to their transformation and confidence scores and aligned region proposals that were proposed at the starting of object detection to the ground truth boxes. Positive proposals were the ones that were matched and hence increased confidence scores while training, while negative proposals decreased the confidence scores. Recognition network processed features of each region from the localization layer. The features of each region were flattened to be made into a vector and then passed through fully connected layers. Position was redefined and confidence scores of each region were proposed. 2.6.4. Evaluation metrics METEOR, mean Average Precision (AP). 2.6.5. Dataset MSCOCO, YFCC100M, Visual Genome (VG) Dataset. 2.6.6. Results FLCN model performed better than the Region RNN in both ranking and localization under all metrics in a way that median rank reduced from 7 to 5 and localization recall from 0.5IoU to 0.153. Figure 10 shows how captions are generated by this method. 2.7. Semantic Alignment Models (R-CNN and B-RNN) + Description Generation Model (M-RNN) This technique is discussed using the work of Karpathy and Li – Deep Visual-Semantic Alignments for Generating Image Descriptions [41] which was published in 2015. 2.7.1. Problems faced till now The focus of most of the works so far has been on condensing elaborate visual depictions in an image to just one single sentence. However, this requirement is nothing but an unnecessary restriction.

132

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

Fig. 10. The sequence of images shows Dense image captioning task using a model that generates rich and dense captions.

2.7.2. Overview This approach consists of two separate models, an alignment model for inferring the latent alignment between continuous group of words in a sentence and the region of the image that they correspond to and the second model which is trained on the inferred correlations. 2.7.3. Model/Methodology To detect objects in an input image the Alignment model used a Regional Convolutional Neural Network (RCNN). CNN was prepared by training it before hand on images in the ImageNet dataset and finally tuning it on the 200 classes of the ImageNet Challenge. In addition to the whole image 19 top detected locations were used. The objects were identified based on the pixels present inside each bounding box. It also used Bidirectional recurrent neural network (BRNN) to compute word representations in the sentence. An Image Sentence Score, Skl aligned every word of a sentence to one best image region. The ultimate goal was to associate snippets of text instead of single word to each bounding box. Therefore, the concept of Markov Random Field (MRF) and latent alignment variables was used to generate a number of image regions explained with segments of text. (for e.g. wooden table for table, messy pile of documents for documents)

M-RNN Model trained on the dataset of region-level annotations from the previous model took as inputs a series of input vectors and the image I. It then found out a series of hidden states and consequently a series of outputs by using a recurrence relation thereby generated a dense descriptions of images. 2.7.4. Evaluation metrics Bleu-1,2,3,4, METEOR, CIDER. Ranking Metric Recall@1,5,10,Med r. 2.7.5. Dataset Flickr8k, Flickr30k, MSCOCO. 2.7.6. Results This model used very few hard-coded assumptions to formulate captions of individual image regions using the conventional dataset of images and sentences. Figure 12 shows captions generated by this method. 2.8. Object Detection (CNN) + Description Generation Model (RNN) + External Knowledge This technique is discussed using the work of image Captioning and Visual Question Answering Based on Attributes and External Knowledge – Wu et al. [42] which was published in 2016.

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

133

Fig. 11. Flowchart of proposed description generation model.

Fig. 12. Captions generated as in [41] L to R: A man in black shirt is playing guitar; Two young girls are playing with lego toy; Construction worker in orange safety vest is working on road.

2.8.1. Problems faced till now The previous papers didnt take into account the external knowledge for generating the captions. Also the importance of introducing an intermediate attribute prediction layer was neglected by almost all previous work. 2.8.2. Overview An intermediate attribute prediction layer is introduced into the predominant CNN-LSTM framework, which was neglected by almost all previous work. 2.8.3. Model/Methodology Attributes predicted by the CNN-based attribute prediction model were used to generate the captions for the image. In the image captioning, the gaps in the caption templates were filled by the attributes predicted by the model. The model for caption generation was trained by maximizing the probability for the correct description of the image. The semantic attribute prediction value Vatt was used rather than using image features directly. The predicted attributes and generated captions were combined with the external knowledge in the knowledge base, and then put to the LSTM for providing answers to various questions. Figure 13

shows captions generated by attribute based captioning model mined with external knowledge. 2.8.4. Evaluation metrics BLEU, METEOR and CIDER. 2.8.5. Dataset Flickr8k, Flickr30k and Microsoft COCO. 2.8.6. Results Att-Region CNN + LSTM is so far one of the most suitable approach for generating image captions. Figure 14 question answering examples using the method used in [42].

3. Datasets 3.1. PASCAL 1K [43] The images found in this dataset are a subset of images collected from PASCAL VOC Challenge. It has 20 categories of images, for each of which, it chooses 50 images at random as sample along with their descriptions which is generated by Amazon’s Mechanical Turk.

134

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

Fig. 13. Image Caption Generated: A man with bat readies to swing at the pitch while the umpire looks on. External Knowledge: A pitch is a place used to play various sports such as cricket. The umpire is present to review the match.

Fig. 14. Examples where the attribute-Region-CNN + LSTM gives the most appropriate answer while the baseline model gives wrong answer. Figure from paper [42].

3.2. Flickr8K & 30K [10] There are 8000 and 31,783 images respectively in Flickr 8K and 30K datasets which are gathered from Flickr. Majority of these images represent participation of human beings in various tasks. Every image has 5 sentences describing it. These datasets are split for training, testing as well as validation following some approved standards. 3.3. MS COCO [27] This is Microsoft Coco dataset which contains 82,000 training images and 40,000 validation images complemented by 5 sentences for their description. These images are sourced from Flickr by finding common/famous object categories and generally, they contain variety of objects with important information pertaining to their context.

and shapes, animals, people and many other aspects of modern life. There are captions related with every image, in 3 specific languages English, German, Spanish. These 20000 images are of high resolution and strict image selection rules are followed while choosing images for this dataset. 3.5. VISUAL GENOME (VG) [52] It is a dataset built by experts mainly from Stanford, Yahoo. It is a knowledge base which is basically a persistent effort to relate the image concepts to their natural language description in a structured manner. It is currently the largest dataset of image based question and answers with approximately over 17,000,000 question-answer pairs. Every image is supplemented with an average of 17 question-answer pairs.

4. Evaluation and ranking metrics 3.4. IAPR TC [4] In this dataset, there are 20,000 still natural pictures from various locations all around the world. Pictures from various categories like – sports, actions, cities,

We have prepared a table for various datasets in chronological order showing various approaches used for the task of description generation and their corresponding scores using a variety of different metrics

135

A. Kumar and S. Goel / A survey of evolution of image captioning techniques Table 1 Image captioning techniques and their scores on IAPRTC12 dataset Approaches

Year

BACK-OFF GT2 BACK-OFF GT3 LBL Gupta et al. [56] Gupta & Mannem [19] MLBL-B-DeCAF [13] MLBL-F-DeCAF [13] m-RNN Baseline [28] m-RNN [28]

2007 2007 2007 2012 2012 2014 2014 2014 2014

B-1 IAPRTC12 32.3 31.2 32.7 15 33 37.3 36.1 31.34 39.51

B-2

B-3

PPL

14.5 13.1 14.4 6 18 18.7 17.6 11.68 18.28

5.9 5.9 1 7 9.8 9.8 9.2 8.03 13.11

55.4 55.6 20.1

24.7 21.8 7.77 6.92

Table 2 Image captioning techniques and their scores on PASCAL dataset Approaches

Year

B-1 PASCAL 25 25 32.7

BabyTalk [15] Im2Text [17] LBL Midge [21] RNN RNN + IF RNN + IF + FT Tri5Sem [22] Microsoft Bidirectional Retrieval Microsoft Bidirectional Retrieval + FT TreeTalk [26] m-RNN [28] NIC [48]

2011 2011 2012 2012 2013 2013 2013 2013 2014

2014 2014 2015

25 25 59

HUMAN

2015

70

B-2

B-3

PPL

0.49

9.69

14.4 2.89 2.79 10.16 10.18

1 8.80 10.08 16.43 16.45

36.79 30.04 29.43

10.48

16.69

27.97

10.77

16.87

26.95

20.1

25

2014

Table 3 Image captioning techniques and their scores on Flikr8k dataset Approaches

Year

B-1 B-2 Flikr8k

RNN RNN + IF Tri5Sem [22] Microsoft Bidirectional Retrieval m-RNN [28] MNLM [25] Mao et al. [28] Google NIC [48] Chen and Zitnik [47] NIC [48] Karpathy et al. [Deep Visual Semantic] Xu et al. (Hard-Attention) Att-SVM + LSTM Att-GlobalCNN + LSTM Att-RegionCNN + LSTM Att-GT + LSTM

2013 2013 2013 2014 2014 2014 2014 2014 2014 2015 2015 2015 2016 2016 2016 2016

63 57.9 67 73 72 74 76

HUMAN

2016

70

B-3

B-4

BLEU

M

PPL

4.86 12.04

11.81 17.10

21.88 20.43

14.10

17.97

19.24

48 58 51 58 63

28 41

23 27

24.39 14.1

38.3 46 53 53 54 57

24.5 31 38 38 38 41

16.0 21 26 27 27 29

12.63 12.63 12.60 12.52 22.51

26.31

136

A. Kumar and S. Goel / A survey of evolution of image captioning techniques Table 4 Image captioning techniques and their scores on Flikr30k dataset Approaches

Year

B-1

RNN RNN + IF Microsoft Bidirectional Retrieval m-RNN [28] MNLM [25] Mao et al. [28] Google NIC [48] Chen and Zitnik [47] LRCN [35] NIC [48] Karpathy and Li [41] Xu et al. (Hard-Attention) Att-SVM + LSTM Att-GlobalCNN + LSTM Att-RegionCNN + LSTM Att-GT + LSTM

2013 2013 2014 2014 2014 2014 2014 2014 2014 2015 2015 2015 2016 2016 2016 2016

58.8 66 57.3 67 68 70 73 78

HUMAN

2016

68

B-2 Flikr30k

55 56 55 66.3

B-3

B-4

24 42.3

20 27.7

39.1

25.1

18.3 12.6 16.5

36.9 44 49 50 55 57

24 30 33 35 40 42

15.7 20 23 27 28 30

BLEU

M

PPL

6.29 12.59 12.60

12.34 15.56 16.42

26.94 23.74 22.51

35.11

19.62

23.76

Table 5 Image captioning techniques and their scores on MSCOCO dataset Approaches

Year

Random Nearest Neighbour RNN RNN + IF RNN + IF + FT Microsoft Bidirectional Retrieval Microsoft Bidirectional Retrieval + FT Google NIC [47] Chen and Zitnik [47] LRCN [35] NIC [48] Karpathy and Li [41] Xu et al. (Hard-Attention) Att-SVM + LSTM Att-GlobalCNN + LSTM Att-RegionCNN + LSTM Att-GT + LSTM

2005 2006 2013 2013 2013 2014 2014 2014 2014 2014 2015 2015 2015 2016 2016 2016 2016

HUMAN

2016

B-1 B-2 MSCOCO

B-3

B-4

48.0

16.6

4.6 9.9

28.1

66.6

46.1

32.9

62.8

44.2

30.4

62.5 72 69 72 74 80

45 50 52 54 56 64

32.1 36 38 40 42 50

such as Bleu 1, 2, 3, 4. Meteor, Cider [45] etc. The results compiled in such a manner allow us to clearly see how over the years, image captioning techniques have evolved over time and also observe the large amounts of positive change in evaluated scores. BLEU (bilingual evaluation understudy) [43] and METEOR (Metric for Evaluation of Translation with Explicit Ordering) [44] are metrics generally for the evaluation of machine translation output. R@K: recall rates for the first retrieved ground truth sentences or images. Some spaces in table are left empty as the corresponding scores were not calculated. Tables 1, 2, 3, 4, 5, 6 and 7 show the comparison of scores for vari-

BLEU

M

C 5.1 36.5

4.63 16.60 16.77 18.35 18.99

9.0 15.7 11.47 19.24 19.41 20.04 20.42

18.96 15.39 14.90 14.23 13.98

24.6 19

20.4

27.7 23 25 28 30 31 40

23.7 19.5 23 23 25 26 28

85.5 66.0

24.94

85.4

21.7

20.19

PPL

82 83 94 107

12.62 11.39 10.49 9.6

ous dataset using different techniques in a chronological order.

5. Challenges The possibility of developing intelligent computer programs that could correctly interpret and caption photos have been intriguing machine learning experts since decades. However, it was only a few years ago some significant progress in this field has been made. We have come a long way from template based techniques to deep learning ones with attention models but

137

A. Kumar and S. Goel / A survey of evolution of image captioning techniques Table 6 Image captioning techniques and their recall scores on Flikr8k dataset Flikr8k APPROACHES Random DeFrag [24] m-RNN [28] MNLM [25] Socher-decaf [53] Socher-avg-rcnn [53] DeViSE-avg-rcnn [55] DeepFE decaf [54] DeepFE-rcnn [24] m-RNN-decafe [28] NIC [48] MNLM [25] MNLM [25] (oxford-net)

YEAR 2005 2014 2014 2014 2014 2014 2014 2014 2014 2014 2015 2014 2014

R@1 0.1 13 15 18 4.5 6 4.8 5.9 12.6 14.5 20 13.5 18

Image Annotations R@5 R@10 0.5 1 33 44 37.2 49 55 18 28.6 22.7 34.0 16.5 27.3 19.2 27.3 32.9 44 37.2 48.5 61 36.2 45.7 40.9 55

med r 631 14 11 8 32 23 28 34 14 11 6 13 8

R@1 0.1 10 12 13 6.1 6.6 5.9 5.2 9.7 11.5 19 10.4 12.5

Image Search R@5 R@10 0.5 1 30 43 31 42 52 18.5 29 21.6 31.7 20.1 29.6 17.6 26.5 29.6 42.5 31.0 42.4 64 31 43.7 37 51.5

med r 500 15 15 10 29 25 29 32 15 15 5 14 10

Table 7 Image captioning techniques and their recall scores on Flikr30k dataset Flikr30k APPROACHES Random DeFrag [24] m-RNN [28] MNLM [25] DeViSE-avg-rcnn [54] DeepFE-rcnn [24] m-RNN-decafe [28] SDT-RNN (Socher et al. [57]) Kiros et al. [25] Donahue et al. [35] Vinyals et al. NIC [48] DeFrag (Karpathy et al. [24]) DepTree edges(Karpathy et al.) BRNN (Karpathy et al.) MNLM [25] MNLM [25](oxfordnet)

YEAR 2005 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 2015 2015 2015 2015 2014 2014

R@1 0.1 16 18 23 4.8 16.4 18.4 9.6 14.8 17.5 23 17 19.2 20 22.2 14.8 23

Image Annotations R@5 R@10 0.6 1.1 40.2 55 40.2 51 63 16.5 27.3 40.2 54.7 40.2 50.9 29.8 41.1 39.2 50.9 40.3 50.8 63 56 44.5 58.0 46.6 59.4 48.2 61.4 39.2 50.9 50.7 62.9

still there are a lot of challenges that need to be overcome. One of the challenges is the prudent use of an attention system which would describe individual components of an image rather than just the image as a whole in order to create a holistic description of the complete picture. The challenge here is to incorporate more knowledge than just what the model is trained on. This includes understanding the context of the image and incorporating worldly knowledge while generating captions, just as humans would do. Only since last year, a few researches have started working on this issue however significant improvements have not yet surfaced. Better performance can be expected by choosing a superior image encoder, fine-tuning it and setting up ensemble models. Performance of a system can be judged better if we have better evaluation and ranking metrics. While most

med r 631 8 10 5 28 8 10 16 10 9 5 7 6 5.4 4.8 10 5

R@1 0.1 10 13 17 5.9 10.3 12.6 8.9 11.8 17 17 12.9 15 15.2 11.8 16.8

Image Search R@5 R@10 0.5 1 31.4 45 31.2 42 57 20.1 29.6 31.4 44.5 31.2 41.5 29.8 41.1 34.0 46.3

35.4 36.5 37.7 34 42

57 57 47.5 48.2 50.5 46.3 56.5

med r 500 13 16 8 29 13 16 16 13 8 7 10.8 10.4 9.2 13 8

of the above discussed approaches use BLEU scores to compare their results to the ground truth suggesting this metric to be a benchmark of evaluation and having some obvious advantages, a number of shortcomings have been noticed. It has been noted that BLEU cannot deal with languages lacking word boundaries. Another problem is its bias towards shorter translations. We could use other automated metrics involving human effort such as HyTER, however it is still just an approximation.

6. Future scope The field of image captioning has been researched for decades as we just saw. However there is an immense scope still left to explore. Though most of the recent studies have been pretty successful in describ-

138

A. Kumar and S. Goel / A survey of evolution of image captioning techniques

ing the image correctly, but still, human level accuracy and descriptiveness seems a far fetched idea. This all boils down to one thing, knowledge. Humans, while thinking for a caption, use their entire knowledge base which they have been acquiring for years. Hence, the emotions, the extra worldly knowledge, the power to express which humans possess is sufficient enough for any human to fail a machine in this so-simple-for-human task. So, the need of the future is to have an excellent knowledge base, the hardware power to train any model to use that entire knowledge feasibly, in order for that machine to develop an entire multidimensional context(s), so that any open-ended question related to the image can be answered irrespective of the attributes simply detected using any computer vision system. This is the reason, why the major search engines’ corporations, like Google and Microsoft (Bing) have the best cards with them to utilize the power of humongous databases to turn them into knowledge-bases and realize the future of this technology. Microsoft’s “CAPTION BOT” is an excellent example of this initiative which uses the power of Emotions, Computer Vision and most importantly the power of Bing to really give fantastic results.

References [1] [2] [3] [4]

[5] [6]

[7] [8]

[9]

[10]

[11]

[12]

7. Conclusion

[13] [14]

We classified and discussed 8 major approaches used for image captioning according to the order in which they developed. We discussed how and why an approach evolved so as to solve the shortcomings of the previous one. We then explained each of the approaches in detail with the help of a particular study and lastly, compared the results of various experiments conducted so far using various popular metrics such as BLEU scores, METEOR, CIDER etc. We were able to observe clearly the large amount of positive difference the scores.

[15]

[16]

[17] [18]

[19]

Acknowledgments

[20]

A very sincere thanks to Shubham Thakkar, Saumya Gupta and Shubham Singh who helped us throughout the formulation of this survey paper. This would not have been possible without their constant support.

[21]

G.A. Miller, WordNet: A lexical database for English, Communications of the ACM 38(11) (1995), 39–41. S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, 1997. E. Reiter and R. Dale, Building Natural Language Generation Systems, Cambridge University Press, 2000. M. Grubinger, P. Clough, H. Mu ller and T. Deselaers, The iaprtc-12 benchmark: A new evaluation resource for visual information systems. L.-J. Li and L. Fei-Fei, What, where and who? classifying events by scene and object recognition, ICCV, 2007. L.-J. Li, R. Socher and F.-F. Li, in: Towards total scene understanding: Classification, annotation and segmentation in an automatic framework, C. Vision and P. Recognition, CVPR. IEEE Conference on, IEEE, 2009, pp. 2036–2043. A. Farhadi, I. Endres, D. Hoiem and D. Forsyth, Describing objects by their attributes, Proceedings of CVPR, 2009. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and F.-F. Li, in: Imagenet: A large-scale hierarchical image database, C. Vision and P. Recognition, CVPR(2009). IEEE Conference on, IEEE, 2009, pp. 248–255. A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D. Forsyth, Every picture tells a story: Generating sentences from images, In ECCV, 2010. C. Rashtchian, P. Young, M. Hodosh and J. Hockenmaier, Collecting image annotations using amazons mechanical turk, In NAACL HLT Workshop on Creating Speech and Language Data with Amazons Mechanical Turk, 2010, pp. 139–147. T. Mikolov, M. Karafiat, L. Burget, J. Cernocky and S. Khudanpur, Recurrent neural network based language model, In INTERSPEECH, 2010. l. Petrov, Berkeley parser, GNU General Public License v.2, 2010. R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural language models, In NIPS Deep Learning Workshop, 2013. S. Li, G. Kulkarni, T.L. Berg, A.C. Berg and Y. Choi, Composing simple image descriptions using web-scale n-grams, In Conference on Computational Natural Language Learning, 2011. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg and T.L. Berg, Baby talk: Understanding and generating simple image descriptions, In CVPR, 2011. Y. Yang, C.L. Teo, H. Daume III and Y. Aloimonos, Corpusguided sentence generation of natural images, In EMNLP, 2011. V. Ordonez, G. Kulkarni and T.L. Berg, Im2text: Describing images using 1 million captioned photographs, In NIPS, 2011. Y. Feng and M. Lapata, Automatic Caption Generation for News Images, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(4) (April 2013), 797–812. doi: 101109/TPAMI.2012.118. A. Gupta and P. Mannem, From image annotation to image description, In Neural information processing, Springer, 2012. A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet classification with deep convolutional neural networks, In NIPS, 2012. M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos and H. Daume III, Midge: Generating image descriptions from computer vision detections, In EACL, Association for Computational Linguistics, 2012, pp. 747–756.

A. Kumar and S. Goel / A survey of evolution of image captioning techniques [22]

[23] [24] [25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

M. Hodosh, P. Young and J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, JAIR 47 (2013). R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural language models, In NIPS Deep Learning Workshop, 2013. A. Karpathy, A. Joulin and F.-F. Li, Deep fragment embeddings for bidirectional image sentence mapping, NIPS, 2014. R. Kiros, R. Salakhutdinov and R.S. Zemel, Unifying visualsemantic embeddings with multimodal neural language models, In arXiv:1411.2539(2014). P. Kuznetsova, V. Ordonez, T. Berg and Y. Choi, Treetalk: Composition and compression of trees for image descriptions, ACL 2(10) (2014). T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla r and C.L. Zitnick, Microsoft coco: Common objects in context, arXiv:14050312(2014). J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Explain images with multimodal recurrent neural networks, In arXiv:1410.1090(2014). R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences, TACL, 2014. H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt et al., From captions to visual concepts and back, arXiv preprint arXiv:1411. 4952(2014). C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scalable, high-quality object detection, arXiv preprint arXiv:1412. 1441(2014). R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, 2014. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg and F.-F. Li, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV), April 2015, p. 142. X. Chenand, C. Lawrence Zitnick, Minds Eye: A Recurrent Visual Representation for Image Caption Generation in Proc. IEEE Conf. Comp. Vis. Patt. Recogn, 2015. J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description in Proc, IEEE Conf. Comp. Vis. Patt. Recogn, 2015. J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) in Proc. Int. Conf. Learn. Representations, 2015. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle and A. Courville, Describing videos by exploiting temporal structure in Proc. IEEE Int. Conf. Comp. Vis., 2015. J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig and M. Mitchell, Language models for image captioning: The quirks and what works, arXiv preprint arXiv:1505. 01809(2015). K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition. in Proc. Int. Conf. Learn. Representations, 2015. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Going deeper with convolutions, in Proc. IEEE Conf. Comp. Vis. Patt. Recogn, 2015.

[41]

[42]

[43]

[44]

[45] [46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56] [57]

139

A. Karpathy and F.-F. Li, Deep Visual-Semantic Alignments for Generating Image Descriptions, arXiv:1412. 2306v2. (2015). Q. Wu, P. Wang, C. Shen, A. Dick and A.V.D. Hengel, Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources in Proc. IEEE Conf. Comp. Vis. Patt. Recogn, 2016. K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002, pp, 311–318. S. Banerjee and A. Lavie Meteor, An automatic metric for mt evaluation with improved correlation with human judgments, In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65–72. R. Vedantam, C.L. Zitnick and D. Parikh, Cider: Consensusbased image description evaluation, CVPR, 2015. J. Johnson, A. Karpathy and F.-F. Li, DenseCap: Fully Convolutional Localization Networks for Dense Captioning CoRR, abs/1511.07571(2015). X. Chen and C.L. Zitnick, Mind’s eye: A recurrent visual representation for image caption generation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2422–2431. O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and tell: A neural image caption generator, arXiv preprint arXiv:1411.4555(2014). P. Sermane, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, OverFeat: Integrated recognition, localization and detection using convolutional networks, ICLR, 2014. T. Mikolov, A. Deoras, D. Povey, L. Burget and J. Cernocky, Strategies for training large scale neural network language models, In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, IEEE, 2011, pp. 196–201. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, arXiv preprint arXiv:1408.5093(2014). K. Ranjay, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D.A. Shamma, M. Bernstein and F.-F. Li, Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, arXiv:1602.07332(2016). R. Socher, Q. Le, C. Manning and A. Ng, Grounded compositional semantics for finding and describing im- ages with sentences, In NIPS Deep Learning Workshop, 2013. A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov et al., Devise: A deep visual-semantic embedding model, In Advances in Neural Information Processing Systems, 2013, pp. 2121–2129. A. Karpathy, A. Joulin and F.-F. Li, Deep fragment embeddings for bidirectional image sentence mapping, arXiv preprint arXiv:1406.5679(2014). A. Gupta, Y. Verma and C. Jawahar, Choosing linguistics over vision to describe images, In AAAI, 2012. R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences, TACL, 2014.

Copyright of International Journal of Hybrid Intelligent Systems is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.