FYP Report [PDF]

SPEECH ENABLED MULTIPURPOSE VIRTUAL ASSISTANT USING IMAGE CAPTIONING A PROJECT REPORT Submitted by GALI KAVYA SHREE SAI

25 0 2MB

Report DMCA / Copyright

DOWNLOAD PDF FILE

Papiere empfehlen

Report

6 1 2MB Read more

Report

0 0 3MB Read more

Report

4 0 3MB Read more

Survey Report

2 1 2MB Read more

Internship Report

0 0 705KB Read more

Sample Report

0 0 3MB Read more

BlackRock Report

0 0 429KB Read more

Inspection Report

2 1 803KB Read more

The Roadmap Report The Roadmap Report

4 1 475KB Read more

Hardness Report

4 1 76KB Read more

FYP Report [PDF]

Author / Uploaded
Ganapathi Ramanathan

0 0 0
Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden

Datei wird geladen, bitte warten...

Zitiervorschau

SPEECH ENABLED MULTIPURPOSE VIRTUAL ASSISTANT USING IMAGE CAPTIONING A PROJECT REPORT Submitted by

GALI KAVYA SHREE SAI (312215106027) GANAPATHI RAMANATHAN (312215106028) MANNE MUDDU RESHMA PRIYA (312215106058) In partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING IN ELECTRONICS AND COMMUNICATION ENGINEERING SSN COLLEGE OF ENGINEERING: KALAVAKKAM

ANNA UNIVERSITY: CHENNAI 600 025 APRIL 2019

ii

ANNA UNIVERSITY: CHENNAI 600 025 BONAFIDE CERTIFICATE Certified that this project report “SPEECH ENABLED VIRTUAL ASSISTANT USING IMAGE CAPTIONING” is the bonafide work of “GALI

KAVYA

SHREE

SAI

(312215106027),

GANAPATHI

RAMANATHAN (312215106028) and MANNE MUDDU RESHMA PRIYA (312215106058)” who carried out the project work under my supervision.

SIGNATURE

SIGNATURE SIGNATURE

Dr. S. Radha

Dr.Dr. P. Vijayalakshmi P. Vijayalakshmi

PROFESSOR &

SUPERVISOR PROJECT SUPERVISOR& PROFESSOR PROFESSOR Department of Electronics and

HEAD OF THE DEPARTMENT Department of Electronics and Communications Engineering SSN College of Engineering Kalavakkam-603110

Communications Engineering Department of ECE SSN College of Engineering SSN College of Engineering Kalavakkam-603110 Kalavakkam 603110

Submitted for the Project viva-voce examination held on

INTERNAL EXAMINER

EXTERNAL EXAMINER

iii ACKNOWLEDGEMENTS We would like to show our deepest respect and admiration to the founder of this esteemed institution, Dr. Shiv Nadar, Chairman, SSN Institutions and our Principal, Dr S. Salivahanan for their support We are thankful to Dr. S. Radha, Head of Department, Electronics and Communications Engineering, SSN College of Engineering, for her guidance and our project coordinator Dr. P. Vijayalakshmi, Professor, Department of ECE and our panel member Dr. S. Esther Florence for giving us the autonomy, guidance and facilities to complete the project. We would like to thank our guide, Dr. P. Vijayalakshmi, Professor, SSN College of Engineering for her patience, guidance and support throughout the project in addition to her valuable inputs, advice and suggestions. We would also like to thank Varun Ranganathan, alumnus of SSN College of Engineering and Ph.D scholars Mrinalini and Rachel, Department of Information Technology for their assistance whenever we found ourselves at a roadblock. Last but not the least; we would like to thank all the teaching and nonteaching staff of the college, without whose assistance we would not have been able to complete the project. Gali Kavya Shree Sai Ganapathi Ramanathan Manne Mudu Reshma Priya

iv ABSTRACT Autism Spectrum Disorder (ASD) is a developmental disorder that affects communication and behavior. ASD is a condition related to brain development that impacts how a person perceives and socializes with others, causing problems in social interaction and communication. It is observed that a sizeable chunk of patients show normal or even higher intelligence levels, but have trouble communicating and applying what they know in social situations. This project aims to bridge that gap. The thesis presents a Virtual Assistant that will translate an image into speech using machine learning algorithms. The project employs image processing to identify an image or sequence of images using pattern recognition with the help of a Convolutional Neural Net (CNN). The image is then encoded and labelled using a CNN, captioned using a specialized Recurrent Neural Net (RNN) called a Long Short Term Memory (LSTM) net and get a sentence associated to the images. The obtained sentence will then be translated into speech using a Text-To-Speech (TTS) synthesis system that is designed using a Hidden Markov Model (HMM).

v TABLE OF CONTENTS TITLE

CHAPTER NO.

PAGE NO.

I

ABSTRACT

iv

II

LIST OF TABLES

III

LIST OF FIGURES

ix

IV

LIST OF SYMBOLS, ABBREVIATIONS AND

x

viii

NOMENCLATURE 1.

2.

INTRODUCTION

1

1.1 OVERVIEW

1

1.2 AUTISM SPECTRUM DISORDER

1

1.3 MOTIVATION AND OBJECTIVE

3

1.4 LITERATURE SURVEYS

4

1.4.1 Image Classification and Captioning

4

1.4.2 Text-To-Speech Synthesis

6

1.4.3 Summary

7

1.5 ORGANISATION OF THE REPORT

7

IMAGE CLASSIFICATION

8

2.1 INTRODUCTION

8

2.2 FEATURE EXTRACTION

8

2.2.1 Introduction to CNNs

9

vi 2.2.2 Steps involved in feature extraction

3.

2.3 CONCLUSION

13

IMAGE CAPTIONING

14

3.1 INTRODUCTION

14

3.2 MODEL OVERVIEW

14

3.3 DATASET

16

3.4 IMAGE ENCODER

16

3.4.1 Steps involved in encoding 3.5 IMAGE DECODER 3.5.1 Decoder Architecture

4.

5.

10

18 19 21

3.6 CONCLUSION

23

TEXT TO SPEECH SYNTHESIS

25

4.1 INTRODUCTION TO HMMs

25

4.2 DATA COLLECTION

27

4.3 DATA PREPARATION

27

4.4 TRAINING PHASE

29

4.5 SYNTHESIS PHASE

30

4.6 CONCLUSION

30

SPEECH ENABLED VIRTUAL ASSISTANT USING IMAGE CAPTIONING

31

5.1 INTRODUCTION

31

5.2 INTEGRATION

31

5.2.1 Image Captioning

32

5.2.2 Text-to-Speech synthesis system

34

vii 5.2.3 Integrated Model 5.3 PERFORMANCE ANALYSIS

6.

37 37

5.3.1 Performance of Image Captioning Model

37

5.3.2 Performance of TTS Model

40

5.4 CONCLUSION

40

CONCLUSIONS AND FUTURE WORK

41

REFERENCES

42

viii LIST OF TABLES TITLE

PAGE NO.

TABLE NO. 3.1

Generated Captions

24

5.1

Table of Generated Image Captions with BLEU

39

Scores

ix LIST OF FIGURES FIG. NO.

TITLE

PAGE NO.

1.1

Charles Darwin and Nikola Tesla

2

1.2

Block Diagram of Proposed Model

3

2.1

A Typical Convolutional Neural Net

9

2.2

A Typical Gray Scale Image Represented as

10

Pixels 2.3

The Convolution Operation

11

2.4

Max Pooling

12

3.1

Overview of the Image Captioning

15

Model Used 3.2

VGG16 Architecture

17

3.3

A Typical Recurrent Neural Network

19

3.4

A Typical Long Short Term Memory Network

20

3.5

The Modified Long Short Term Memory

21

Network 3.6

Consolidated LSTM Network

23

ix

4.1

Typical HMM used to predict the weather

25

4.2

Block Diagram of a hidden Markov model

26

4.3

Manual Segmentation of Phonemes

28

4.4

Labels of recorded speech

28

5.1

Architecture of the model

31

5.2

Image files and their captions

32

5.3

Annotated captions and their image files

33

5.4

Manual Segmentation of phonemes

35

5.5

Generated Labels

36

x LIST OF SYMBOLS, ABBREVIATIONS AND NOMENCLATURE ASD

Autism Spectrum Disorder

CNN

Convolutional Neural Network

TTS

Text-to-Speech

RNN

Recurrent Neural Networks

LSTM

Long Short Term Memory

HMM

Hidden Markov Model

BLEU

Bilingual Evaluation Understudy Score

GPU

Graphics Processing Unit

HTK

HMM Tool Based Kit

SPTK

Speech Signal Processing Toolkit

MLP

Multilayer Perception Generalized Network

MGC

Mel Generalized Cepstral Coefficients

MFCC

Mel Frequency Cepstral Coefficients

MLF

Master Label File

1

CHAPTER 1 INTRODCTION

1.1 OVERVIEW Autism Spectrum Disorder (ASD) impacts the nervous system and affects the overall cognitive, emotional, social and physical health of the affected individual. As a result, people with ASD tend to be reserved, lack social skills and find it difficult to communicate. Efforts have been made in the field of science, medicine and technology in order to bring patients into the fold and effort is taken into training them in communication. Early recognition, as well as behavioural, educational and family therapies may reduce symptoms and support development and learning. As a result, a virtual assistant would be very helpful in helping young children diagnosed with ASD learn the basics of communication and language. We attempt to realize this by means of Machine Learning algorithms that will enable individuals with ASD to identify and describe what they see.

1.2 AUTISM SPECTRUM DISORDER Autism, or autism spectrum disorder (ASD), refers to a broad range of conditions characterized by challenges with social skills, repetitive behaviors, speech and nonverbal communication. People diagnosed with Autism Spectrum Disorder have trouble communicating with their peers and show anxiety in social situations. Many pioneers like Charles Darwin, Nikola Tesla, Eddie Redmayne etc were known to have shown symptoms characteristic of Autism Spectrum Disorder, as shown in Fig 1.1.

2

Fig 1.1 Charles Darwin, father of evolution and and Nikola Tesla were said to have symptoms characteristic of ASD. We know that there is not one type of autism but many subtypes, most influenced by a combination of genetic and environmental factors. Because autism is a spectrum disorder, each person with autism has a distinct set of strengths and challenges. The ways in which people with autism learn, think and solve problems can range from highly skilled to severely challenged. Some people with ASD may require significant support in their daily lives, while others may need less support and, in some cases, live entirely independently. Several factors may influence the development of autism, and it is often accompanied by sensory sensitivities and medical issues such as gastrointestinal (GI) disorders, seizures or sleep disorders, as well as mental health challenges such as anxiety, depression and attention issues. Indicators of ASD usually appear by age 2 or 3. Some associated development delays can appear even earlier, and often, it can be diagnosed

3

as early as 18 months. Research shows that early intervention leads to positive outcomes later in life for people with autism. 1.3 MOTIVATION AND OBJECTIVE Our project is a Virtual Assistant that will translate an image into speech using machine learning algorithms. The project employs image processing to identify an image or sequence of images using pattern recognition with the help of a Convolutional Neural Net (CNN), label them using image captioning and get a sentence associated to the images and translate into speech with the help of a Text-to-Speech (TTS) synthesizer using hidden Markov model principles. The block diagram for the proposed model is shown in Fig 1.2.

Fig 1.2 Block Diagram of Proposed Model This project consists of two modules: The first module involves image processing and image captioning using a CNN. A CNN is a special case of the neural network, which consists of one or more convolutional layers, often with a sub sampling layer, which are followed by one or more fully connected layers as in a standard neural network. A traditional

4

pattern/image recognizer uses a hand designed feature extractor. In a CNN based model, the feature extractor is not hand designed, but in fact is a convolutional layer. A training and testing module is included so that images can be trained and tested. The image captioning model is an end-toend based neural network consisting of a Convolutional Neural Network (VGG16Model) to encode images followed by a language generating Recurrent Neural Network (Long Short Term Memory Model). It generates complete sentences in natural language based on the input image. It gives the complete description of an image. The second module is the TTS system that converts the written text (or sentence) associated with the image into a speech signal. This will be done by building a Text To Speech synthesizer using the principles of hidden Markov model. A hidden Markov model (HMM) is a finite state machine which generates a sequence of discrete time observations. 1.4 LITERATURE SURVEY The overall goal of this chapter was to identify algorithms to perform image captioning. Also, the major approaches to TTS synthesizers are formant synthesis, waveform concatenative speech synthesis, and HMM-based speech synthesis. 1.4.1 Image Classification and Captioning Ali Farhadi et.al (2007) have demonstrated that automatic methods can prepare concise descriptions of images. They described a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate

5

of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned using data. The system was evaluated on a novel dataset consisting of human-annotated images. While the underlying estimate of meaning was impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.

Jiang Wang et.al (2016) have utilized recurrent neural networks (RNNs) to address this problem caused due to multi-label dependencies. Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate information in a unified framework. Experimental results on public benchmark datasets demonstrate that the proposed architecture achieves better performance than the state-of-the-art multi- label classification models. D. Barik and A. Mondal (2017) have proposed a novel segmentation model combining GrabCut model and linear multi-scale smoothing in order to improve image segmentation performance. Multi-scale smoothing components, generated by Gaussian Kernel through an iterative scheme, provide different level image information that contributes to image segmentation. Each component is segmented by GrabCut model and segmentation result is different because the fine information is smeared step by step leading to that the appearance characteristic is different. A convergence condition, which rooted in the significant level of segmentation sub- regions on adjacent scale components, is constructed based on the

6

invariance of object contour in each component. Compared to the traditional GrabCut, as the experiment result shows, the proposed model has a superior performance on real images and achieves better robustness against noise. O.Vinyals et.al (2017) have presented a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. 1.4.2 Text-to-Speech Synthesis Ramani Boothalingam et.al (2013) have proposed to develop and evaluate unit selection for speech synthesis systems for the Tamil language. 5 hours of Tamil speech data using two different sub-word units namely: Phoneme unit and Consonant-Vowel (CV) unit has been used. Given a text, the system partitions the text into the required subword units and then concatenates based on the target cost and the concatenation cost. It has been observed that the phoneme-based system has a higher MOS than CV-based system. In the later part, an HMM-based TTS system with contextdependent phonemes as the subword unit has been built. Based on the MOS results it can be seen that the HMM-based synthesis system has better performance than FestVox-based voice system, due to the absence of sonic glitches.

7

G. Anushiya Rachael et.al (2015) have developed a small-footprint contextindependent HMM-based synthesizer for the Tamil language. An analysis of the amount of speech data required to build an HMM-based synthesizer is analyzed. Based on the MOS results it has been observed that the increase in quality of speech with data is not significant. Therefore, one hour of data has been used. The effect of the contextual features on the quality of synthetic speech has also been analyzed using various context-dependent and contextindependent phone units. From the results, it has been observed that the footprint size increases with additional context.

1.4.3 Summary Thus, various existing image classification and captioning systems, and textto- speech synthesis systems have been analyzed in depth. It has been identified that an end-to-end model of a CNN encoder and RNN decoder is best suited for the Image Captioning System. Also, HMM-based TTS systems have better naturalness and intelligibility when compared to other TTS systems. The following chapters deal with the hardware and software requirements of the proposed system - SEVA. 1.5 ORGANISATION OF THE REPORT This thesis is organized as follows: Chapter 2 gives an overview of literature survey of papers carried out related to the project work. Chapter 3 discusses about classification of images. Chapter 4 focuses on image captioning. Chapter 5 focuses on text to speech conversion and Chapter 6 gives the details about results and Chapter 7 talks about conclusions and future work.

8

CHAPTER 2 IMAGE CLASSIFICATION

2.1 INTRODUCTION For image captioning, it is essential to select a fit model that can perform image classification. Therefore, this chapter discusses the basic principles of image classification, the most commonly used model and how it is implemented. Image classification refers to the task of extracting information classes from a multiband raster image. The resulting raster from image classification can be used to create thematic maps. Depending on the interaction between the analyst and the computer during classification, there are two types of classification: supervised and unsupervised. Various techniques are used for Image Classification like K-Nearest Neighbour Classifier, Linear Classifiers and Convolutional Neural Networks. We have used a Convolutional Neural Network for this project, and the same is explained in the following subsections. 2.2 FEATURE EXTRACTION In machine learning, pattern recognition and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is a dimensionality reduction process, where an initial set of raw variables is reduced to more manageable groups (features) for processing, while still accurately and completely describing the original dataset. When the input data to an algorithm is too large to be processed and it is suspected to be redundant

9

(e.g. the same measurement in both feet and meters, or the repetitiveness of images presented as pixels), then it can be transformed into a reduced set of feature vectors. Determining a subset of the initial features is called feature selection. The selected features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data. 2.2.1 Introduction to Convolutional Neural Networks Convolutional Neural Networks (ConvNets or CNNs) are a category of Artificial Neural Networks which have proven to be very effective in the field of image recognition and classification. They have been used extensively for the task of object detection, self driving cars, image captioning etc. The first convnet was discovered in the year 1990 by Yann Lecun and the architecture of the model was called as the LeNet architecture. Feature extraction using a CNN is demonstrated in Fig 2.1

Fig 2.1 A typical CNN The basic principles and steps involved in image classification using a CNN are explained in the next subsection.

10

2.2.2 Steps involved in feature extraction using CNNs The architecture of a CNN can be explained using 4 basic operations, namely 1. Convolution 2. Non Linearity 3. Pooling 4. Classification Essentially, every image can be represented as a matrix of pixel values. An image from a standard digital camera will have three channels: red, green and blue, which can be imagined as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255. The Convolution Operation The purpose of convolution operation is to extract features from an image, which can be represented a matrix of pixel values as shown in Fig 2.2. We consider filters of size smaller than the dimensions of image. The entire operation of convolution can be understood with the example below.

Fig 2.2 A typical grayscale image represented in pixel values.

11

Consider a small 2-dimensional 5*5 image with binary pixel values. Consider another 3*3 matrix shown in Fig 2.3. We slide this orange 3*3 matrix over the original image by 1 pixel and calculate element-wise multiplication of the orange matrix with the submatrix of the original image and add the final multiplication outputs to get the final integer which forms a single element of the output matrix.

Fig 2.3 A typical convolution operation, where the image is represented in green, the convolution matrix in yellow and the output in pink.

Bringing in Non Linearity An additional operation is applied after every convolution operation. The most commonly used non-linear function for images is the ReLU which stands for Rectified Linear Unit. The ReLU operation is an element-wise operation which and replaces negative pixels with a zero. Since most of the operations in real-life relate to non-linear data but the output of convolution

12

operation is linear because the operation applied is element wise multiplication and addition. Pooling The pooling operation reduces the dimensionality of the image but preserves the important features in the image. The most common type of pooling technique used is max pooling as shown in Fig 2.4. In max pooling, a window of n*n where n is less than the side of the image is slid over the image and the maximum in the window is determined. The window is then shifted by the given stride length.

Fig 2.4 Max Pooling Full Connection The fully connected layer is the multi-layer perception that uses the SoftMax activation function in the output layer. The term “fully-connected” refers to the fact that all the neurons in the previous layer are connected to

13

all the neurons of the next layer. The convolution and pooling operation generate features of an image. The task of the fully connected layer is to map these feature vectors to the classes in the training data. We built a CNN to identify 5 different classes (Gulab Jamun, Ladoo, Water, Apple and Banana). The CNN was able to identify the classes with an accuracy of 87%after training for around 2 hours. The model was built on Keras, an open source neural network library written in Python. It is capable of running on top of libraries like TensorFlow, Theano etc. Keras contains numerous implementations of commonly used neural network building blocks such as layers, objectives, activation layers and a host of tools to make working with image and text data easier. In addition to standard neural networks, Keras has support for convolutional and recurrent neural networks. It supports other common utility layers like dropout, batch normalization, and pooling. 2.3 CONCLUSION Thus, this chapter explored image classification using a typical Convolutional Neural Net and explained the various steps involved in image classification. The next chapter explores the image captioning algorithm used, along with a brief explanation of the mathematical model and architecture of the same.

14

CHAPTER 3 IMAGE CAPTIONING

3.1 INTRODUCTION This chapter discusses the theory and architecture involved in building a consolidated Image Captioning Model that generates a meaningful caption for a given input image. The Image Captioning model is an end-to-end based neural network consisting of a Convolutional Neural Network (VGG16 Model) to encode images followed by a language generating Recurrent Neural Network (Long Short Term Memory Model). It generates complete sentences in natural language based on the input image. 3.2 MODEL OVERVIEW The model proposed takes an image I as input and is trained to maximize the probability of p(S|I), where S is the sequence of words generated from the model and each word St is generated from a dictionary built from the training dataset. The input image I is fed into a deep vision Convolutional Neural Network (CNN) which helps in detecting the objects present in the image. The image encodings are passed on to the language generating Recurrent Neural Network (RNN) which helps ingenerating a meaningful sentence for the image. An analogy to the model can be given with a language translation RNN. However, in our model the encoder RNN which helps in transforming an input sentence to a fixed length vector is replaced by a CNN encoder. The model is designed based on the following probabilistic equations 3.1-3.3:

15

ϴ*= arg max ∑logp(S|I;ϴ);

(3.1)

where ϴ is the model parameter, S is the caption and I is the image. However, as the caption S is unbounded, we apply chain rule to model joint probability as follows: log p(S|I)= ∑log p(St|I;S0...St-1)

( 3.2)

The joint probability function is done using an RNN, where the variable number of words in the caption is defined by a fixed state hidden memory ht. The value for ht is updated with each new input xt using a nonlinear function ft such that ht+1= f( ht,xt)

(3.3)

As explained before, xt is fed into the system from the Image Encoding model (Vinyals Et.Al, 2017). The function f() is implemented using a specialized Recurrent Neural Net (Long Short Term Memory net). The proposed model is shown in Fig 3.1.

Fig 3.1 Overview of the Image Captioning model

16

3.3 DATASET For the task of image captioning the Flickr8k (M. Hodosh, P. Young and J. Hockenmaier, 2013) dataset was used. The dataset contains 8000 images with 5 captions per image, and the dataset by default is split into image and text folders. Each image has a unique id and the caption for each of these images is stored corresponding to the respective id. The dataset contains 6000 training images, 1000 development images and 1000 test images. The model made used of the 6000 images extracted from FlickR, consisting of a variety of images and generates a vocabulary of 1352 unique words that describes an image accurately. Other datasets like Flickr30k and MSCOCO for image captioning exist but both these datasets have more than 30,000 images thus processing them becomes computationally very expensive. Captions generated using these datasets may prove to be better than the ones generated after training on Flickr8k because the dictionary of words used by RNN decoder would be larger in case of Flickr30k and MSCOCO. 3.4 IMAGE ENCODER The details of the CNN were discussed in the previous chapter. Convolutional Neural Networks (CNN) have improved the task of image classification significantly. The Imagenet Large Scale Visual Recognition competition (ILSVRC) provided various open source deep learning frameworks like ZFnet, Alexnet, Vgg16, Resnet etc. For the task of image encoding in our model we use Vgg16 which is a 16- layered network proposed in ILSVRC 2014. Vgg16 significantly decreased the top-5 error rate in the year 2014 to 7.3%

17

Fig 3.2 Vgg16 Architecture (Simonyan et.al) (2014) Using the Vgg16 architecture as shown in Fig 3.2, we include a preprocessing layer that takes the RGB image with pixels values in the range of 0-255 and subtracts the mean image values (calculated over the entire ImageNet training set). This network is characterized by its simplicity, using only 3*3 convolutional layers stacked on top of each other in increasing depth. Reducing volume size is handled by max pooling. Two fully-connected layers, each with 4,096 nodes are then followed by a softmax classifier. The convolution layer consists of 3*3 filters and the stride length is fixed at 1. Max pooling is done using 2*2-pixel window with a stride length of 2. All the images need to be converted into 224*224-dimensional image. A Rectified Linear Unit (ReLU) activation function is follows every convolution layer. A ReLU computes the function f(x) = max(0, x). In order to make training easier, we first trained smaller versions of VGG with less weight layers (columns A and C) first. The smaller networks converged and were then used as initializations for the larger, deeper networks, this process is called pre-training. While making logical sense,

18

pre-training is a very time consuming, tedious task, requiring an entire network to be trained before it can serve as an initialization for a deeper network. We no longer use pre-training (in most cases) and instead prefer Xaiver / Glorot initialization or MSRA initialization. The advantage of using a ReLU layer over sigmoid and tanh is that it accelerates the stochastic gradient descent. Also unlike the extensive operations (exponential etc.) the ReLU operation can be easily implemented by thresholding a matrix of activations at zero. For our purpose however, we need not classify the image and hence we remove the last 1*1*1000 classification layer. The output of our CNN encoder would thus be a 1*1*4096 encoded which is then passed to the language generating RNN. There have been more successful CNN frameworks like Resnet but they are computationally very expensive since the number of layers in Resnet was 152 as compared to Vgg16 which is only a 16-layerednetwork. 3.4.1 Steps involved in encoding: 1. Defining the Vgg16 model in Keras.We only go upto the last convolutional layer, we don't include fully-connected layers. The reason is that adding the fully connected layers forces you to use a fixed input size for the model (224x224, the original ImageNet format). By only keeping the convolutional modules, our model can be adapted to arbitrary inputsizes. 2. The model loads a set of weights pre-trained on Image Net. 3. Define a loss function that will seek to maximize the activation of a specific filter (filter_index) in a specific layer (layer_name). We do this via a Keras backend function, which allows our code to run both

19

on top of TensorFlow and Theano. 4. Normalize the gradient of the pixels of the input image, which avoids very small and very large gradients and ensures a smooth gradient ascent process. 5. Gradient Ascent was performed on the input space. 6. The generated output was then extracted and displayed. 3.5 DECODER The Decoder Architecture makes use of a Recurrent Neural Net (RNN). Recurrent neural nets (shown in Fig 3.3) are a type of artificial neural network in which connection between units form a directed cycle. The advantage of using RNN over conventional feed forward net is that the RNN can process arbitrary set of inputs using its memory. RNNs were discovered in the year 1980 by John Hopfield who gave the famous Hopfield model. Recurrent neural nets in simple terms can be considered as networks with loops which allow the information to persist in the network.

Fig 3.3 A typical RNN (Hopfield) (1982)

20

One of the problems with RNNs is that they do not take long-term dependencies into account. Consider a machine that tries to generate sentences on its own. For instance, the sentence is “I grew up in England, I speak fluent English”, if the machine is trying to predict the last word in the sentence i.e. English, the machine needs to know that the language name to be followed by fluent is dependent on the context of the word England. It is possible that the gap between the relevant information and the point where it is needed becomes very large in which case the conventional RNNs fail. To overcome the above-mentioned problem of “long term dependencies”, Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM) networks in the year 1997. Since then LSTM networks have revolutionized the fields of speech recognition, machine translation etc. Like the conventional RNNs, LSTMs also have a chain like structure, but the repeating modules have a different structure in case of a LSTM network. A typical Long Short Term Memory network is shown in Fig 3.4.

Fig 3.4 A typical LSTM network (S. Hochreiter Et. al.) (1997)

21

The key behind the LSTM network is the horizontal line running on the top which is known as the cell state. The cell state runs through all the repeating modules and is modified at every module with the help of gates. This causes the information in a LSTM network to persist. 3.5.1 Decoder Architecture To generate captions from the encoded multi-labels, we modified the typical LSTM network. The architecture of the LSTM model used is as shown in Figure 3.5.

Fig 3.5 Modified LSTM (O.Vinyals et. al.) (2014)

22

The entire network is governed by equations 3.4-3.9. it = σ(Wixxt+Wimmt-1 )

( 3.4)

Where it is the input gate at time t, W represents the trained parameters. The variable mt-1 denotes the output of the module at time t-1 and σ represents the sigmoid operation which outputs numbers between zero and one, describing how much of each component should be let through. ft = σ (Wfxxt+Wfjmt-1)

(3.5)

where ft represents the forget gate which control whether to forget the current cell value. ot = σ (Woxxt+Wommt-1 )

( 3.6)

Where ot represents the output gate which determines whether to output the new cell value or not. ct = ft⊙ct-1+ it⊙σ(Wcxxt+Wcmmt-1)

(3.7)

where ct is the cell state that runs through all the modules and ⊙ represents the tensor product with a gate value. mt =ot⊙ct

(3.8)

where mt is the encoded vector which is then fed into the softmax function. pt+1 =Softmax(mt)

(3.9)

The output pt+1 of a module gives the word prediction. The same LSTM network isrepeated until an end token (.) is encountered by the network as shown in Fig 3.6. The series of these word prediction generate the caption for a given image. The LSTM model is trained to predict each word of the sentence after it has seen the image as well as all preceding words as

23

defined by P(St|I, S0, . . . ,St−1).

Fig 3.6 Consolidated LSTM Model (O.Vinyals et. al.) (2014)

3.6 CONCLUSION As shown in Table 3.1, the model was able to predict the meaningful captions for an image in most of the cases using an end to end Image Captioning Model that used a CNN encoder to extract multiple labels and an RNN decoder to generate meaningful captions. However, the model got confused when it came to generate captions from images that were largely monochromatic, blurred and indigenous. This problem is expected to be solved by rigorous training on a more comprehensive dataset like PASCAL. The next chapter deals with the Text-to-Speech Synthesis of the generated captions.

24

Table 3.1 Generated Captions Image

Caption

A group of people are playing soccer in the street

A white dog is running through a field.

A man in a red shirt is riding a bike.

25

CHAPTER 4 TEXT-TO-SPEECH SYNTHESIS

4.1 INTRODUCTION HMM-based synthesis is used for text to speech synthesis. This synthesis method is based on Hidden Markov Models. It is also called Statistical Parametric Synthesis In this system, the frequency spectrum, fundamental frequency, and duration of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. Here frequency spectrum depends on the vocal tract, fundamental frequency on the voice source and the duration relates to Prosody. Here, the context-dependent HMMs are concatenated and the resultant HMM is made as an observation sequence generator. Unlike the concatenative synthesis approach, the voice characteristics such as the prosody, speaker identity etc., can be modified by simply varying the HMM parameters, thereby reducing the constraint on the data requirement. A basic HMM is shown in Fig 4.1.

Fig. 4.1 A typical HMM used to predict the weather

26

Further, the footprint size of the resultant system is very small when compared to that of the unit selection-based approach. However, in the unit selection approach, for the utterance to be synthesized, if all the contextually appropriate subword units are present in the speech database, then the quality is expected to be very high and better than the best of the HMMbased approach.

Fig 4.2 Block diagram of HMM based TTS Synthesis System (Zen et.al) (2007) HMM - based speech synthesis consists of two phases namely training and synthesis phase, and the block diagram of the same is shown in Fig 4.2. In the training phase, initially, the spectral parameters, which are the mel generalized cepstral coefficients (mgc) and their dynamic features (delta and acceleration coefficients) and the excitation parameters, which are, the log fundamental frequency (lf0) and its dynamic features, are extracted from the

27

speech data. Using these features and the time-aligned phonetic transcriptions, context-independent monophone HMMs are trained. The basic subword unit considered for HMM-based system is the contextdependent pentaphones. These context-dependent models are built starting with a set of context-independent monophone HMMs and refined in sequential steps. These sequential steps involve the state-tying. In this process, acoustically similar states are tied in order to reduce the total number of parameters without degrading the performance of the models. Here tree-based clustering is used for state-tying. Spectral and excitation parameters are generated for the sentence and a speech waveform is synthesized. 4.2 DATA COLLECTION For data collection, 2 hours of speech was recorded in an anechoic chamber in order to avoid noise and interference. Sampling was done at a rate of 16kHz. A female voice was used to record the text, which comprised of short stories in English. This recorded speech was then used to synthesize speech using a Hidden Markov Model as explained in the following subunits. 4.3 DATA PREPARATION In order to build an HMM-based speech synthesis system, time-aligned phonetic transcriptions are required for the given speech data. In addition to the wave files, label files, which provide information about the occurrence of the speech units in the database, are also required to implement the speech synthesis systems. It will be better if we derive common acoustic models also, which provide information. In order to train common acoustic

28

models, two hours of speech data (phonetically balanced) from the language English, is considered. Considering this data, context-independent HMMs are trained. These models can be used to derive time-aligned phonetic transcriptions of a new language. The manual segmentation and corresponding labels are shown in figures 4.3 and 4.4 respectively.

Fig 4.3 Manual segmentation of phonemes

Fig. 4.4 Labels of recorded speech

29

4.4. TRAINING PHASE In the training stage, context dependent phoneme HMMs are trained using a speech database. Spectrum and F0 are extracted at each analysis frame as the static features from the speech database and modeled by multi-stream HMMs in which output distributions for the spectral and log F0 parts are modeled using a continuous probability distribution and the multi-space probability distribution respectively.

In the training phase, initially spectral and excitation features are extracted from the input speech signal. The spectral features used are 105-dimensional and correspond to the Mel generalized cepstral coefficients and their first and second derivatives. The 3-dimensional excitation features correspond to the log fundament frequency and its dynamic features. These features are used to train four-stream context- independent and context-dependent HMMs. The models are trained with five states and a single mixture component per state. Duration Gaussian models (with one state and one mixture component) are also trained for each speech unit. The basic unit used in an HMM-based synthesizer is a pentaphone with 48 additional contextual features. Owing to the large amount of context information considered, it would not be possible to create a database that covers all possible contexts/units. In order to develop an HMM-based speech synthesis system (HTS), capable of synthesizing high quality speech, the following are required to be derived accurately: segmented speech data, utterances derived in the FestVox framework – and question set.

30

4.3 SYNTHESIS PHASE In the synthesis stage, first, an arbitrarily given text is transformed into a sequence of context-dependent phoneme labels. Based on the label sequence, a sentence HMM is constructed by concatenating contextdependent phoneme HMMs. By using an MLSA (Mel Log Spectral Approximation) filter, speech is synthesized from the generated mel-cepstral and F0 parameter sequences.

In synthesis phase, given a test sentence in text format the corresponding context dependent label files are generated. According to the label sequence, a sentence level HMM is generated by concatenating context-dependent HMMs. Then by using speech parameter generation algorithm a sequence of speech parameters such as the spectral and excitation parameters is determined in such a way that it maximizes its output probability. Finally, speech is synthesized directly from the generated spectral and excitation parameters using a source system synthesis filter namely, Mel-log spectral approximation filter.

4.5 CONCLUSION Thus we have learnt how to build a speech synthesis model using HMM principles by following various installation steps and segmentation. 2 hours of speech was recorded, following which wave files were spliced and segmented. Segmentation was done for the first 5 minutes manually while splicing was done for the entire duration. Festival was used to perform automatic segmentation to generate lab files. The performance of the integrated model is discussed in the following chapter.

31

CHAPTER 5 SPEECH ENABLED VIRTUAL ASSISTANT USING IMAGE CAPTIONING 5.1 INTRODUCTION In the previous chapters, we introduced and discussed each individual model, i.e. the image captioning model and TTS system, required to generate a speech signal that describes the input image. This chapter will address the architecture and steps involved in building a Speech Enabled Virtual Assistant (SEVA) by integrating the two models. This chapter will also analyze the performance of the virtual assistant using two major parameters, the Bilingual Evaluation Understudy (BLEU) score for image captioning and the Mean Opinion Score (MOS) for the TTS synthesis system.

5.2 INTEGRATION The block diagram of the virtual assistant is given in Fig. 5.1. We construct the virtual assistant by integrating the two modules that act as building blocks to the project, namely the Image Captioning module and the Text-to-Speech synthesis system.

Fig 5.1 Architecture of the model.

32

5.2.1 Image Captioning The Image Captioning module is an end-to-end neural network consisting of a CNN based encoder (Vgg16) followed by an RNN decoder (LSTM) for language generation. Preprocessing is performed on the input image by resizing the image to the dimensions of 224*224 and by subtracting the mean RGB value from each pixel. Multiple labels from each image are extracted using a 16 layer pretrained Vgg16 network, an image classification network that was proposed by the Visual Geometric Group at ILSVRC 2014. Convolution is performed using 3*3 filters with the stride length fixed at 1. Max pooling is performed using a 2*2 filter with the stride length fixed at 2. As we are only extracting the feature vectors from the image, we do not require the fully connected layer that performs classification. Therefore the output of the Vgg16 network is a 4096*1*1 encoded that is passed on to the language generating RNN. The annotations for the images and their corresponding captions in the Flickr8k dataset is shown in Fig 5.2 and Fig 5.3.

Fig 5.2 Image files and their captions

33

Fig 5.3 Annotated captions with their respective image files

The language generating LSTM is also implemented on Keras. We had developed a vocabulary using the captions by transforming words into integers. The next step involves converting these integers into fixed length vectors that have to be fed into the LSTM. This is done by making use of the Dense layer in Keras, with the first parameter being the input dimension that is the same size as that of the vocabulary, and the last parameter being set to 256, which defines the output dimensions of the layer. The LSTM layer is used to predict the word embeddings as explained in Chapter 3. We then use the merge functionality to merge the two models, i.e the vocabulary and the image encodings, together.

Equations 3.1-3.3 in Chapter 3 are implemented on the LSTM through the following steps: 1. Feed the image into the Vgg16 model which will produce an image embedding which is stored as a pickle file. 2. The image embedding will be the input of the language generating RNN (LSTM) at t=0. This yields the probability distribution for the first word.

34

3. The first word is chosen by selecting the word with the highest probability distribution. 4. This is fed into the consolidated network (Fig 3.6) to generate a word embedding. 5. This word embedding is fed back into the LSTM model at t=1. This gives us the probability distribution for the second word. 6. The process is continued until the end-of-sentence word is generated.

The model was trained for 70 epochs on a ZOTAC GeForce GTX 1050 Ti GPU using a batch size of 512.

5.2.2 Text-to-Speech synthesis system The first step in building a TTS system is data collection. 2 hours of speech was recorded in an anechoic chamber for the same. Recording was performed in an anechoic chamber in order to prevent interference and nullify noise. Sampling was done at a rate of 16kHz. A female voice was used to record the text, which comprised of short stories in English.

In order to build an HMM-based speech synthesis system, time-aligned phonetic transcriptions are required for the given speech data. In addition to the wave files, label files, which provide information about the occurrence of the speech units in the database, are also required to implement the speech synthesis systems. Considering the recorded data, context-independent HMMs are trained. These models can be used to derive time-aligned phonetic transcriptions of a new language.

35

Five minutes of speech data (phonetically balanced) is considered. Using the common acoustic models provided, and the corresponding phonetic transcription, the time-aligned phonetic transcription is derived using forcedViterbi alignment procedure. Manual segmentation and the generated labels are shown in figures 5.4 and 5.5. Here, the automatically derived boundaries may not be accurate. Using the visual representations such as the speech waveforms and the corresponding spectrograms, the boundaries are corrected wherever required. Using this data, context-independent phoneme models are trained. Using these models and the phonetic transcriptions, the entire speech data is segmented using forced-Viterbi alignment procedure. Using the newly derived time-aligned phonetic transcription (phone-level label files), new contextindependent phoneme models are trained. The above two steps were repeated five to six times. After N iterations, the resultant HMMs are used to segment the entire speech data, again. These boundaries are considered as final boundaries.

Fig 5.4 Manual segmentation of phonemes

36

Fig. 5.5 Labels of recorded speech

A map table is created and the MFCCs are extracted. Forced Viterbi-alignment is performed for 5 minutes of data. A master label file (MLF) is created from the text corresponding to 5 minutes of data. Manual correction of the boundaries is done to the output obtained in the previous step using wavesurfer. The list of phones are obtained from the label files. The contextindependent models are trained with the corrected label files. After checking if context-independent models are trained for all the phones in the language in hmm, the models are concatenated and forced Viterbi-alignment was performed on the entire database. MLF is generated again and repeated with the data folder until the segmentation is satisfactory. Next, the 105 dimensional mel generalised cepstral coefficients (mgc), 3 dimensional log-fundamental frequencies (lf0) and composite features (cmp), and context-dependent label files are generated. Once the features and the

37

context-dependent label files are obtained, context-dependent models are built and speech is synthesized by the obtained modules. 5.2.3 Integrated Model The integration of the image captioning model and the TTS system is performed as follows. The image is given as an input to the image captioning system. The image captioning system, running on a TensorFlow backend generates a caption that describes the input image. This caption is stored as a string in a variable Image_Caption. The variable is fed into the hidden Markov model based text-to-speech synthesis system. This system, running on Festival Speech Synthesis System generates a speech output for the corresponding text caption. In this manner, a Speech Enabled Virtual Assistant is designed by integrating the two individual modules. 5.3 PERFORMANCE ANALYSIS In order to measure the performance of the consolidated system, we had to test each module separately. BLEU score was chosen as the parameter used to evaluate the accuracy of the image captioning module and Mean Opinion Score was used for the TTS system. 5.3.1. Measuring the performance of the Image Captioning Model We define the accuracy of the Image Captioning model by using BLEU scores. Bilingual evaluation understudy (BLEU) is an algorithm that evaluated the quality of text which has been translated by a machine. It was one of the first metrics to achieve high correlation with human judgment. BLEU score is always defined between 0 and 1, 0 being the machine translation is not at all related to the reference sentence BLEU, or the Bilingual Evaluation

38

Understudy, is a score for comparing a candidate translation of text to one or more reference translations. Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks. The Python

Natural

Language

Toolkit

library,

or

NLTK,

provides

an

implementation of the BLEU score that you can use to evaluate your generated text against a reference. The candidate sentence is provided as a list of tokens. The Flickr8k dataset of 8000 images was split into a 7000-1000 test-train split. Captions were generated of 1000 test images and the corresponding mean BLEU score was calculated. For calculating BLEU score we first generate captions for all the test images and then use theses machine generated captions as candidate sentences. We compare this candidate sentences with 5 of the captions given by humans and average the BLEU score of candidate corresponding to each of the references. Thus for 1000 test images we calculate 1000 BLEU scores using Natural Language Toolkit (NLTK), a python package. We averaged out these BLEU scores over the 1000 test images. The net BLEU score of the model after training for 70 epochs with a batch size of 512 was found to be 0.43 or 43.2% while the state of the art on Flickr8k is around 66%. On increasing the number of epochs, we may reach near state of the art results but that would require higher computation. The net BLEU score can also be improved by decreasing the batch size.

Table 5.1 demonstrates the results for images taken from the test dataset, the internet and indigenous images taken on our phone camera. However, the system gets confused in the case of images that have a majority of one colour only and makes errors while determining gender. The system is expected to

39

show better results when trained for a greater number of epochs on a more diverse dataset.

Table 5.1 (Results of Image Captions with BLEU Scores)

Image

Generated

Reference

Corresponding

Caption

Caption

BLEU Score

A boy is

A boy is

jumping off a

jumping in

ramp in the

a park.

0.55

park.

A tennis player

A woman

playing tennis

tennis

0.47

player is playing tennis A man in a

A man in a

black shirt is

blue shirt

standing on the

is

road.

standing.

0.71

40

5.3.2. Measuring the performance of the Text-to-Speech Synthesis System For measuring the accuracy of the Text-to-Speech model, the parameter known as the Mean Opinion Score (MOS) was used. MOS historically originates from subjective measurements where listeners would sit in a "quiet room" and score a telephone call quality as they perceived it. This kind of test methodology had been in use in the telephony industry for decades and was standardized. It specifies that "the talker should be seated in a quiet room with volume between 30 and 120dB and a reverberation time less than 500ms (preferably in the range 200–300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Requirements for other modalities were similarly specified in ITU recommendations later. The MOS score of the TTS system was calculated to be 2.8 for naturalness and 3.1 for intelligibility.

5.4 CONCLUSION Therefore, this chapter spoke about the overall integrated model. The next chapter talks about conclusions and future work.

41

CHAPTER 6 CONCLUSIONS AND FUTURE WORK

6.1 CONCLUSIONS A prototype of the Virtual Assistant capable of describing an image was developed. This was done using an Image Captioning System that made use of a CNN Encoder and an RNN decoder. The Text-To-Speech model was prepared using a Hidden Markov Model. The encoder was trained on a pretrained VGG16 model and the decoder was a modified LSTM capable of translating the tokens generated by the encoder into meaningful captions. The TTS system, modeled on a HMM was used to generate meaningful speech from the given text input. The accuracy of the model was measured using two parameters, the Bilingual Evaluation Understudy Score for Image Captioning and the Mean Opinion Score for the TTS Synthesis System. 6.2 FUTURE WORK The future work of this project would involve real time testing of the project using natural surroundings. The project has a lot of scope not only as a learning tool for people diagnosed with ASD, but also as a virtual assistant to people with visual impairments. Given the computational complexity of this project, it should be trained on a GPU like the NVIDIA G-Force. This project can be extended to include other regional languages like Tamil, Hindi etc.

42

REFERENCES [1] Boothalingam, R., Solomi, V. S., Gladston, A. R., Christina, S. L., Vijayalakshmi, P., Thangavelu, N., & Murthy, H. A., (2013), ‘Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil.’, In 2013 National Conference on Communications (NCC), pp. 1-5. [2] Barik, D., & Mondal, M. (2010), ‘Object identification for computer vision using image segmentation.’, In 2010 2nd International Conference on Education Technology and Computer, Vol. 2, pp. 170-172. [3] Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010), ‘Every picture tells a story: Generating sentences from images.’, In European conference on computer vision, pp. 15-29. [4] Hochreiter, S., & Schmidhuber, J. (1997). ‘Long short-term memory.’ In Neural computation, Vol. 9, No. 8, pp. 1735-1780. [5] Hopfield, J. J. (1982), ‘Neural networks and physical systems with emergent collective computational abilities.’, In Proceedings of the national academy of sciences, Vol. 79, pp. 2554-2558. [6] Hodosh, M., Young, P., & Hockenmaier, J., (2013). ‘Framing image description as a ranking task: Data, models and evaluation metrics.’ Journal of Artificial Intelligence Research, Vol. 47, pp. 853-899. [7] Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002), ‘BLEU: a method for automatic evaluation of machine translation.’ In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. [8] Rachel, G. A., Solomi, V. S., Naveenkumar, K., Vijayalakshmi, P., & Nagarajan, T. (2015), ‘A small-footprint context-independent HMM-based synthesizer for Tamil.’, In International Journal of Speech Technology, Vol. 18, No. 3, pp. 405-418.

43

[9] Sulír, M., Juhár, J., & Rusko, M. (2017), ‘Development of the Slovak HMM-Based TTS System and Evaluation of Voices in Respect to the Used Vocoding Techniques.’, In Computing and Informatics, Vol.35, No. 6, pp. 1467-1490. [10] Tokuda, K., Zen, H., & Black, A. W. (2002), ‘An HMM-based speech synthesis system applied to English.’, In IEEE Speech Synthesis Workshop, pp. 227-230. [11] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). ‘Show and tell: A neural image caption generator.’ In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 31563164. [12] Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., &Xu, W. (2016), ‘CNN-RNN: A unified framework for multi-label image classification.’ In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2285-2294. [13] Zen, H., Tokuda, K., & Black, A. W. (2009), ‘Statistical parametric speech synthesis.’ Speech communication, Vol. 51, No. 11, pp. 1039-1064.

44

45