Image Captioning  [PDF]

  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

REVIEW 1 IMAGE CAPTIONING USING NEURAL NETWORK

SUBMITTED BY – AISHNA MISHRA 17BEE0323 (EMAIL – [email protected] , Contact – 7999891718) KISLAY SINHA 17BEE0113 (EMAIL- [email protected], Contact- 9131887993) MELVIN 17BEE

GUIDED BY Prof. SHARMILA A

IMAGE CAPTIONING USING TENSORFLOW

Introduction Through our project, we aim to develop a code for image-to-sentence generation. Artificial neural networks have helped Computers to automatically generated captions to an image. In our project, we focus on the application of neural network, which bridges vision and natural language. Through our project, we can use natural language processing technologies visualise the world in images. Through Image Captioning, it not only identifies the different types of objects present in the image but also expresses to relationship between in a natural language like English.

What does Image Captioning Entail? What do you think about when you see this picture?

It could be – 1. A man and a girl sitting. 2. A man and a girl eating. 3. A man in black shirt and girl in orange dress eating while sitting on a sidewalk Human brain can generate numerous captions for it, within few seconds. When this is done with the help of neural network and deep learning, we call it image captioning. However, it is not simple. It requires a combination of complex and advanced techniques.

WHY IMAGE CAPTIONING USING DEEP NEURAL NETWORKS? Some old methods like template and retrieval-based methods have been used to solve the problem of image captioning. Nevertheless, these methods fail in generalization because these two methods focus on captioning images using a certain set of visual categories and hence, does not recognise the new images. Thus, it suggests that these methods have limited applications as compared to a human being who can generate numerous caption for the same image. We need to discover new methods that can have a general function to create captions and description for any image. With the introduction of Machine learning in the arena of neural network, the computers have advanced in the visual and language processing, better than the traditional methods. There are several models, which have been used to implicitly learn the common embedding by encoding and decoding the direct modalities. 1. Convolutional neural network(CNN) 2. Long short term memory(LSTM) 3. Recurrent neural network (RNN). There have been various datasets to tests new methods. They are 1. Flickr 8k - Consists of 8000 images and has 5 captions for each image. 2. Flickr 30k – Consists of 31783 images and has 5 full sentence level caption for each image. 3. MSCOCO - 82783 images and has 5 captions for each image.

MODELS FOR IMAGE CAPTIONING The history of image captioning goes back to years ago. Early attempts are mostly based on detections, which first detect visual concepts (e.g. objects and their attributes) and then followed by template filling or by nearest neighbour retrieving for caption generation. With the development of neural networks, encoder and decoder, which later become the basic model. Most of the models use CNN in order to represent the input image with the help of vector, then applies a LSTM network upon to generate words.

‘Based on the encoder-and-decoder, many variants are proposed, where attention mechanism appears to be the most effective add-on. Specifically, attention mechanism replaces the feature vector with a set of feature vectors, such as the features from different regions, and those under different conditions .It also uses the LSTM net to generate words one by one, where the difference is that at each step, a mixed guiding feature over the whole feature set, will be dynamically computed. In recent years, there are also approaches combining attention mechanism and detection. Instead of doing attention on features, they consider the attention on a set of detected visual concepts, such as attributes and objects.’

LITERATURE REVIEW Image captioning can be generally divided into two categories: topdown and bottom-up. Bottom-up approaches are the “classical” ones, which start with visual concepts, objects, attributes, words and phrases, and combine them into sentences using language models. [12] and [19] detect concepts and use templates to obtain sentences, while [23] pieces together detected concepts. [9] and [20] use more powerful language models. [11] and [22] are the latest attempts along this direction and they achieve close to the state-ofthe-art performance on various image captioning benchmarks. Topdown approaches are the “modern” ones, which formulate image captioning as a machine translation problem [29, 2, 5, 27]. Instead of translating between different languages, these approaches translate from a visual representation to a language counterpart. The visual representation comes from a convolutional neural network which is often pretrained for image classification on large-scale datasets [18]. Translation is accomplished through recurrent neural networks based language models. The main advantage of this approach is that the entire system can be trained from end to end, i.e., all the parameters can be learned from data. Representative works include [24, 26, 16, 8, 16, 25]. The differences of the various approaches often lie in what kind of recurrent neural networks are used. Topdown approaches represent the state-of-the-art in this problem. Visual attention is known in Psychology and Neuroscience for long but is only recently studied in Computer Vision and related areas. In terms of models, [21, 13] approach it with Boltzmann machines while [28] does with recurrent neural networks. In terms of applications, [6] studies it for image tracking, [1] studies it for image recognition of multiple objects, and [15] uses for image generation. Finally, as we discuss in Section 1, we are not the first to consider it

for image captioning. In [30], Xu et al., propose a spatial attention model for image captioning.

SOME EXAMPLES OF MODERN TECHNIQUES USED IN NEURAL NETWORKS FOR IMAGE CAPTIONING EXAMPLE 1 – DATASET Flickr8k The image below is a schematic diagram of the model involving RNN and CNN

The method proposes the use of a probabilistic framework to caption image using neural network. It generates captions with maximum probability of the correct translation in an "end-to-end" fashion. The sequence is –

θ - parameter of the model I – the image

S – The correct transcription The sentence is the sum of probability like S0 to SN is the sequence, and N is the length of:

Further we use the Recurrent neural networks (RNN) to calculate the probability from the input image I and the t − 1 words expressed as a sequence. It is represzzented as –

ht is the memory and xt is the image. The functions fLSTM is defined according to the following:

Where  

Wi, Wf , Wo, Wc ∈ R DhxDx , Ui, Uf , Uo, Uc ∈ R DhxDh

EXAMPLE 2 – CONTRASTIVE LEARNING Through a new method of Contrastive Learning (CL), for image captioning, it can encourage distinctiveness, while maintaining the overall quality of the generated captions. This method is generic and can be used for models with various structures. This method was tested on two challenging datasets and helped improve the model with a significant margin. . By employing a state-ofthe-art model as a reference, the proposed method is able to maintain the optimality of the target model, while encouraging it to learn from distinctiveness, which is an important property of high quality captions. On two challenging datasets, namely MSCOCO and InstaPIC-1.1M, the proposed method improves the target model by significant margins, and gains state-ofthe-art results across multiple metrics. On comparative studies, the proposed method extends well to models with different structures, which clearly shows its generalization ability. “In Contrastive Learning (CL), we learn a target image captioning model pm(:; θ) with parameter θ by constraining its behaviors relative to a reference model pn(:; φ) with parameter φ. The learning procedure requires two sets of data: (1) the observed data X, which is a set of ground-truth imagecaption pairs ((c1, I1),(c2, I2), ...,(cTm, ITm)), and is readily available in any image captioning dataset, (2) the noise set Y , which contains mismatched pairs ((c/1, I1),(c/2, I2), ...,(c/Tn , ITn )), and can be generated by randomly sampling c/t ∈ C/It for each image It, where C/It is the set of all ground-truth captions except captions of image It. We refer to X as positive pairs while Y as negative pairs. For any pair (c, I), the target model and the reference model will respectively give their estimated conditional probabilities pm(c|I, θ) and pn(c|I, φ). We wish that pm(ct|It, θ) is greater than pn(ct|It, φ) for any positive pair (ct, It), and vice versa for any negative pair (c/t, It). Following this intuition, our initial attempt was to define D((c, I); θ, φ), the difference between pm(c|I, θ) and pn(c|I, φ), as D((c, I); θ, φ) = pm(c|I, θ) − pn(c|I, φ),

The loss function is”

EXAMPLE 3 – Pytorch Let us look at a simple implementation of image captioning in Pytorch. We will take an image as input, and predict its description using a Deep Learning model. The code for this example can be found on GitHub. A pre-trained resnet-152 model is used as an encoder, and the decoder is an LSTM network.

EXAMPLE 4 – COMBINING TOP DOWN AND BOTTOM UP APPROACH USING SEMANTIC ATTENTION In this paper, it proposes a new image captioning approach that combines the top-down and bottom-up approaches through a semantic attention model. Please refer to Figure 1 for an overview of the algorithm. The definition for semantic attention in image captioning is the ability to provide a detailed, coherent description of semantically important objects that are needed exactly when they are needed. In particular, the semantic attention model has the following properties: 1) able to attend to a semantically important concept or region of interest in an image, 2) able to weight the relative strength of attention paid on multiple concepts, and 3) able to switch attention among concepts dynamically according to task status. Specifically, it detects semantic

concepts or attributes as candidates for attention using a bottom-up approach, and employ a topdown visual feature to guide where and when attention should be activated. The model is built on top of a Recurrent Neural Network (RNN), whose initial state captures global information from the top-down feature. As the RNN state transits, it gets feedback and interaction from the bottomup attributes via an attention mechanism enforced on both network state and output nodes. This feedback allows the algorithm to not only predict more accurately new words, but also lead to more robust inference of the semantic gap between existing predictions and image content. This way, we can leverage external image data for training visual concepts and external text data for learning semantics between words.

References [1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. ICLR, 2015. 3 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2014. 2 [3] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 5 [4] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422–2431, 2015. 1, 6 [5] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, ¨ F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014. 2 [6] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas. Learning where to attend with deep architectures for image tracking. Neural computation, 24(8):2151–2184, 2012. 3 [7] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015. 4 [8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2626–2634, 2015. 1, 2, 6 [9] D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, pages 1292–1302, 2013. 1, 2

[10] V. Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes and convolutional networks. In CVPR, pages 1256–1264, 2015. 4 [11] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, et al. From ´ captions to visual concepts and back. In CVPR, pages 1473– 1482, 2015. 1, 2, 5 [12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15–29. Springer, 2010. 1, 2 [13] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabel image annotation. ICLR, 2014. 5 [14] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, pages 529–545. Springer, 2014. 4 [15] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015. 3 [16] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, June 2015. 1, 2, 5 [17] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pages 115–141. Springer, 1987. 2 [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012. 2, 4

[19] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR. Citeseer, 2011. 1, 2 [20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, pages 359–368, 2012. 1, 2 [21] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, pages 1243–1251, 2010. 2 [22] R. Lebret, P. O. Pinheiro, and R. Collobert. Simple image description generator via a linear phrase-based approach. ICLR, 2015. 1, 2 [23] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale ngrams. In CoNLL, pages 220–228, 2011. 1, 2 [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, June 2015. 5 [25] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In ICCV, 2015. 1, 2, 4 [26] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (mrnn). arXiv preprint arXiv:1412.6632, 2014. 1, 2, 6 [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013. 3 [28] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of

visual attention. In NIPS, pages 2204–2212, 2014. 2 [29] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. EMNLP, 12:1532–1543, 2014. 3, 5 [30] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2, 3, 6