14 0 2MB
CITIZEN SOCIAL MEDIA SENTIMENTANALSIS:BUILDING A MODEL TO MEASURE OPINIONS OF CITIZENS ON UK’S PLANNED HIGH SPEED 2 RAILWAY LINE
by
FAREED IDDRISU IBRAHIM
(w144913642) Supervised by PHILIP WORRALL
Submitted in partial fulfilment of the requirements of the Dept of Business Information Systems of the University of Westminster for award of the Master of Science
SEPTEMBER 2014
DECLARATION
I, (FAREED IDDRISU IBRAHIM) declare that I am the sole author of this Project; that all references cited have been consulted; that I have conducted all work of which this is a record, and that the finished work lies within the prescribed word limits.
This has not previously been accepted as part of any other degree submission.
Signed : Date :
II
W144913642
FORM OF CONSENT
I, (FAREED IDDRISU IBRAHIM) hereby consent that this Project, submitted in partial fulfilment of the requirements for the award of the MSc degree, if successful, may be made available in paper or electronic format for inter-library loan or photocopying (subject to the law of copyright), and that the title and abstract may be made available to outside organisations.
Signed : Date :
III
W144913642
ABSTRACT
Sentiment Analysis (SA) has been used widely as a text mining tool to find out the sentiment polarity of a given corpus. In this research a comprehensive study of sentiment analysis is undertaken with the view of applying it to the emerging field of government and citizen interaction via twitter social media, as a case study. The case study uses the proposal by the UK government to undertake a planned high speed rail, which has become a subject of public debate in the UK. The study therefore analysis the sentiment or opinions of a section of the citizens, who are expressing their views on twitter. SA is then applied to the collected data with the aim of determining the general polarity of user’s views about the project. Details the collection of the relevant data and various sentiment analysis such as term frequency, Latent Dirichlet Allocation (LDA) and using a pre-built naïve Bayes classifier to classify the sentiment into three main polarities, positive, negative and neutral. Interesting results are presented and a case is made for the application of SA by government as a way of finding out the sentiment of their citizenry.
IV
W144913642
ACKNOWLEDGEMENT
All praise is due to God Almighty for giving me the opportunity to undertake a dissertation of this kind and keeping me alive and healthy during this period. A special Thank you to my Parents Mr and Mrs Iddrisu as well as my siblings Asmau, Raqiba and Hajira, for their affectionate love and support in taking the decision to provide me with the opportunity to undertake a master’s course from their sole finances, and also for reposing a lot of trust and confidence in my ability to undertake this academic endeavour. I would not have been in the position to start writing such a dissertation without the training I received from the various lectures at the University of Westminster during the course of study. I would therefore like to acknowledge the contribution of all the lectures and any staff of the university who in one way or another have contributed to my successful stay at the university. My decision to undertake this particular topic was due to the interesting approach my lecturer and supervisor Phillip Worrall took during his course web and social media Analytics. Serving as my supervisor also, I would like to acknowledge the effort Phillip put in guiding my work and ensure I produce a good dissertation. Finally to all who have in one way or another provide assistance to me during the duration of course at the university I would like to thank you all
V
W144913642
Table of Contents DECLARATION ...................................................................................................................... ii FORM OF CONSENT ............................................................................................................iii ABSTRACT ........................................................................................................................... iv ACKNOWLEDGEMENT ..................................................................................................... v LIST OF FIGURES AND TABLES ....................................................................................... viii LIST OF ABBREVIATIONS................................................................................................... ix CHAPTER ONE ..................................................................................................................... 1 1.
INTRODUCTION ............................................................................................................ 1
1.1
BACKGROUND .......................................................................................................... 3
1.1.1
GOVERNMENT CITIZEN RELATIONSHIP ....................................................... 3
1.1.2
SOCIAL MEDIA DATA FOR ANALYSIS ............................................................ 3
1.1.3
SOCIAL NETWORKS ........................................................................................ 4
1.1.4
SOCIAL MEDIA VRS SURVEY ......................................................................... 5
1.1.5
TWITTER SOCIAL NETWORK AND DATA ...................................................... 6
1.1.6
SENTIMENT ANALYSIS /OPINION MINING..................................................... 7
1.1.7 BUSINESS INTELLIGENCE ..................................................................................... 8 1.1.8
HIGH SPEED RAIL ............................................................................................ 9
1.1.9
PROJECT SCOPE AND OBJECTIVES ........................................................... 11
1.1.10
JUSTIFICATION AND CONTRIBUTION ..................................................... 11
CHAPTER TWO ................................................................................................................... 13 2. LITERATURE REVIEW .................................................................................................... 13 2.1 GENERAL SENTIMENT ANALYSIS ................................................................. 13 2.2 THE OBJECTIVE/TASK OF SENTIMENT ANALYSIS .................................. 14 2.3 THE CHALLENGE OF SENTIMENT ANALYSIS ....................................................... 15 2.4 METHODOLOGIES USED IN SENTIMENT ANALYSIS ......................................... 16 2.5 IDENTIFYING THE SEMANTIC ORIENTATION OF WORDS .................................. 17 2.5.1 THE LEXICONS APPROACH................................................................................. 18 2.5.2 USING TRAINING DOCUMENTS .......................................................................... 18 2.5.3 IDENTIFYING SEMANTIC ORIENTATION OF SENTENCES AND PHRASES .... 19 2.5.4 IDENTIFYING THE SEMANTIC ORIENTATION OF DOCUMENTS ..................... 20 2.5.5 OBJECT FEATURE EXTRACTION ........................................................................ 20 2.5.6 COMPARATIVE SENTENCE IDENTIFICATION.................................................... 20 2.6 SENTIMENT ANALYSIS USING TWITTER DATA.................................................... 21 2.7 GOVERNMENT CITIZEN SENTIMENT ANALYSIS .................................................. 23
VI
W144913642
2.8 AN OVERVIEW OF DATA MINING (STRUCTURED) AND TEXT MINING (UNSTRUCTURED DATA) .............................................................................................. 26 2.9 SENTIMENT ANALYSIS AND MODELLING TECHNIQUES. ................................... 27 2.10 OVERVIEW AND WAY FORWARD FROM THE LITERATURE REVIEW ............. 29 CHAPTER THREE ............................................................................................................... 31 3 PROBLEM SPECIFICATIONS ...................................................................................... 31 3.1 METHODOLOGY ........................................................................................................... 31 3.2 METHODOLOGICAL JUSTIFICATION ....................................................................... 33 3.3 SOFTWARE USE JUSTIFICATION .............................................................................. 34 CHAPTER FOUR ................................................................................................................. 35 4.0 PROJECT IMPLEMENTATION ................................................................................... 35 4.1 UNDERTAKING SENTIMENT ANALYSIS ................................................................ 36 4.1.1 TERM FREQUENCY ANALYSIS ............................................................................ 36 4.1.2 LATENT DIRICHLET ALLOCATION TOPIC MODELLING .................................... 37 4.1.3 SENTIMENT ANALYSIS SCORE ........................................................................... 38 4.2 CHALLENGES AND ADJUSMENTS ......................................................................... 38 CHAPTER FIVE ................................................................................................................... 40 5. RESULTS AND ANALYSIS ............................................................................................ 40 5.1 TERM FREQUENCY RESULTS AND ANALYSIS .................................................... 40 5.2 LDA TOPIC MODELLING RESULTS AND ANALYSIS ............................................. 43 5.2.1
MODEL EVALUATION ..................................................................................... 44
5.3 SUMMARY OF FINDINGS............................................................................................. 46 5.4 FUTURE WORK ............................................................................................................ 46 CHAPTER SIX ..................................................................................................................... 47 6 CONCLUSION .............................................................................................................. 47 BIBLIOGRAPHY ................................................................................................................................... 48 APPENDIX ........................................................................................................................... 54
VII
W144913642
LIST OF FIGURES AND TABLES
FIGURE 1.1 WORD CLOUD FOR SENTIMENT ANALYSIS ........................................................................................... 2 TABLE 1.0 TYPES OF SOCIAL MEDIA .................................................................................................................. 4 FIGURE1.2 DEPICTING THE INTERCONNECTIVITY OF TWITTER ................................................................................. 7 FIGURE 1.3 A SCREEN SHOT OF SENTIMENT RELATING TO HS2 FROM TWITTER .......................................................... 8 FIGURE 1.4 MAP OF PROPOSED ROUTE FOR HS2 (BBC) ..................................................................................... 10 FIGURE 2.1 AN EXAMPLE OF A SYSTEM ARCHITECTURE FOR SENTIMENT ANALYSER ................................................... 26 FIGURE 3.1 METHODOLOGICAL STEPS ............................................................................................................. 32 FIGURE 4 PROCESS DIAGRAM SHOWING THE STAGES OF ANALYSIS ......................................................................... 36 FIGURE 4.1 LDA GRAPHICAL REPRESENTATION.................................................................................................. 37 FIGURE 5.1 DISPLAY OF TERMS FOR X = 100 ................................................................................................... 40 TABLE 5.1 TERMS AND THEIR FREQUENT ASSOCIATED TERMS ............................................................................... 42 FIGURE 5.2 LDA TOPIC MODELLING RESULTS .................................................................................................... 43 FIGURE 5.2 PERPLEXITY VALUE FOR LDA TOPIC MODELLING ................................................................................. 44 FIGURE 5.3 SCATTER PLOT FOR LDA TOPIC MODELLING ...................................................................................... 44 FIGURE 5.3 RESULTS OF SENTIMENT ANALYSIS................................................................................................... 45 FIGURE 5.4 A PLOT OF SENTIMENT SCORE ........................................................................................................ 45
VIII
W144913642
LIST OF ABBREVIATIONS API
: Application Programming Interface
CSV
: Comma Separated values
HS2
: High Speed Rail Network Phase 2
IBM
: International Business Machines
LDA
: Latent Direlecht Allocation
NLP
: Natural language Processing
NLTK
: Natural language Tool kit
POS
: Part of Speech Tagging
SA
: Sentiment Analysis
UNPACS: United Nations Public Administration Studies UK
: United Kingdom
IX
W144913642
CHAPTER ONE
1. INTRODUCTION
Citizen social media sentiment analysis is a relatively new dimension of the broader field of social media sentiment analysis. It involves the engagement of governments, public institutions and the citizenry using social media as the common platform. Governments across the world are increasingly facing the challenge of serving the interest of their citizens rather than their own interest. Citizens across the world are demanding for a greater say in the governance process now than ever before. A 2012 UNPACS reports, Citizens are increasingly getting involved in the governance process of their communities, and countries. These engagements implies that the involvement of citizens in decision- making process of the state through measures and institutional arrangements, so as to increase their influence on public policies and programming ensuring a more positive impact on their social and economic lives. (UN, 2012) The effects of citizen marginalization were manifested in the events of the now infamous Arab spring in 2010, which saw a popular uprising by citizens of some Arab countries against their governments. Governments of these countries were toppled over by mass protest and demonstrations of their own citizens amidst the loss of lives and property. It is largely reported that these uprising, protest and demonstrations were coordinated using social media such as facebook and Twitter. (Arunachalam & Sarkar, 2013)These events seem to have redefined the role of citizens in modern day governance. The advent of new disruptive technologies such as mobile applications, cloud computing and social media have emerged as a tool and conduit for which governments and citizens can use as an effective medium to communicate and hence forge closer together towards building a relationship where the opinions of citizens can be taken into account as well as that of the government can be explained to the citizens. Recent trends have seen social media transform itself into a rich repository of data that can be analysed using data mining and analytic techniques to gain insights and trends into what the data contains. The goal of any data mining process is to help with critical decision making process, using a scientific approach rather than by the use of intuition. This gives credence to the application of data/text mining in analysing the opinions of citizens, which can help governments make better and informed choices about the point of view of its citizen Using Sentiment Analysis or Opinion mining, a subset of data mining, can provide the vehicle for which thousands of opinions can be analysed. This non trivial technology has been suc-
1
W144913642
cessfully used as a business intelligence tool by many businesses in areas such as customer relationship management, targeted marketing, political campaigns, Mass movements, disaster and crisis response, news reporting etc. (Gundecha & Liu, 2012) The success of this technology is what has informed is application to the area of government’s relationship with it citizens, by using a similar approach to derive better decision making for governments. This research therefore seeks to explore the application of sentiment analysis to the relatively new dimension of government sentiment analysis. The research work involves finding how governments can put in place a system to mine data from social media. The focus of the research will be to build a model that can be used to analyse sentiments of citizens concerning government policies, programs and projects. In order to demonstrate the importance of the topic, a case study using the proposed construction of a high speed railway line (HS2) in UK. The HS2 is a planned modern high speed rail network that seeks to link up major cities of the UK from London through Birmingham to Manchester and Leeds with a possible expansion to Scotland. This proposed project however has generated lots of controversy about its implementation amongst the British society. Various sentiments expressed regarding this debate would be used to build a model for the sentiment analysis. The goal would be classify these sentiments as positive, negative or neutral based on data from the social media site Twitter. From the classification, it should be possible to measure the sentiments/opinions of UK citizens concerning the implementation of the H2S project.
Figure 1.1 Word cloud for sentiment analysis
2
W144913642
1.1
1.1.1
BACKGROUND
GOVERNMENT CITIZEN RELATIONSHIP
Government can be defined as the political system by which a country is administered and regulated (Encyclopedia Britanicca, 2014) governments therefore play a vital role in the lives of citizens by ensuring their welfare is cared for. It is therefore imperative that Governments build a relationship with their citizens, where the sentiment of citizens can be taken into account when taking decisions that affect their lives. (Schellong, 2008). According to Schellong (Schellong, 2008) sentiments offer policy makers information to;
Understand and establish public needs,
Develop communicate and distribute public services,
Assess the degree of public service satisfaction
1.1.2
SOCIAL MEDIA DATA FOR ANALYSIS
Social media can be defined as "a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content." (Kaplan & Haenlein, 2010). Social media involves interaction amongst people in which they create, share or exchange information and ideas in virtual communities and networks. Social media has gained worldwide acclaim and popularity and is transforming the way people communicate, the way we form relationships, the way we connect to each other, the way we live and work. (Arunachalam & Sarkar, 2013) According to (Search Engine watch, 2014) these statistics give an idea about the fast pace at which the social media landscape is growing; there are almost a billion people signed up to at least one social media type, 1.43 billion people worldwide visited a social networking site in 2012, nearly 1 in 8 people worldwide have their own Facebook page, 3 million new blogs come online every month, and 65% of social media users said they use it to learn more about brands, products and services. This explosive growth can be attributed to the ubiquity of the internet and communications devices such as computers and smart phone devices.
3
W144913642
The term social media is used to make a distinction between traditional forms of media such as TV, radio and press. Traditional media such as those cited above follow a unidirectional delivery paradigm from business to consumer. The information is produced from media sources or advertisers and transmitted to media consumers. Different from this traditional way, web 2.0 technologies are like consumer to consumer services. They allow users to interact and collaborate with each other in a social media dialogue of user-generated content in a virtual community. Social media is a generic term that encompasses the various different platform that engage people socially on the internet. From table 1.0 the various social media sites contain various types of service and thus create different formats of data, to include mainly text, image, video etc. For example Twitter, Facebook and Youtube provide text, image and video services respectively and though there maybe overlap they are specialised for such services better. (Hu & Liu, 2012) Our focus in this research however would be on text data.
CATEGORY
REPRESENTATIVE WEBSITE
Blogging
Blogger, LiveJournal, Wordpress, Huffington Post
Wiki
Wikipedia, Wikihow, WikiTravel Scholarpedia
Social News
Digg, Mixx, Slashdot, Reddit
Micro blogging
Twitter, Google Buzz,Tumblr, Jaiku Plurk
Opinion & Reviews
Epinions, Yelp
Question Answering
Yahoo! Answers, Quora
Academic Networking
Research gate, Academia.edu, Slideshare
Media Sharing
Flick, Youtube, Vimeo
Social Bookmarking
Delicious, CiteULike
Social Networking
Facebook, LinkedIn, Myspace, Google+
Table 1.0 Types of Social Media
1.1.3
SOCIAL NETWORKS
One of the most popular subset of social media is the online social networks. Online Social networks are defined as a network of interactions or relationships, where the nodes consist of actors, and the edges consist of the relationship or interaction between these actors. (Aggarwal, 2011). If we note from the definition of Social media, and social network we see that the definitions vary to some degree. This indicates a difference in the two terms though they are often used
4
W144913642
interchangeable. Example of popular social networks includes facebook, twitter, Linkedin, Google+ etc. An Aberdeen Group Bench Report published shows that more than 84% best-in-class companies improved their overall performance, customer satisfaction, risk management and actionable insights from social media monitoring and analysis. (Zabin & Jefferies, 2008) A similar approach to how corporate businesses are involving social media in business provides a pathway for governments to follow suit in engaging the people, on whose behalf they serve. A novel aspect to the use of social network is the ability to analyse discussions and posts for the purpose of gaining valuable insights. These insights can then be used for decision making based on the results of the analysed sentiments. Around 2007, the researchers and analysts started to take notice of the importance and value of social media monitoring and sentiment analysis. (Arunachalam & Sarkar, 2013) The approach to analysing this data is a non-trivial process that involves using programming software’s to collect thousands of sentiments from social media platforms. After collection various Text and data mining algorithms are used in the process of analysing the collected data.
1.1.4
SOCIAL MEDIA VRS SURVEY
Compared to traditional survey polls, running an analysis on social media is attractive for a number of reasons First, social media analysis is cheaper and faster compared to traditional surveys, and enables continuous monitoring of public opinion by performing real-time analysis. (Xin, Gallagher, Cao, Luo, & Han, 2010) On the contrary, offline surveys are by definition more static. Hence, we are able to capture the reaction of public opinion in almost real time. Analysing social media also allows us to observe trends and breaking points. (Ceron, Curini, Mlacos, & Porro, 2013) In addition, traditional surveys pose solicited questions, and it is well known that this approach might inflate the share of strategic answers. Conversely, sentiment analysis does not utilize questionnaires and focuses only on listening to the stream of unsolicited opinions freely expressed on the Internet. In other words, sentiment analysis adopts a bottom-up approach, at least if compared with the more traditional top-down approach of offline surveys. (Ceron, Curini, Mlacos, & Porro, 2013)Far from saying that all of the comments posted on social networks contain the sincere opinion of the author, we can argue that the Internet may represent, to a large extent, an arena in which users are free to express themselves. (Savigny, 2002) Thus, the social network should be in a position to be less affected by the spiral of silence. Moreover, while web analysis must contend with the problem of silent users, surveys face the problem of low response rates. (Ceron, Curini, Mlacos, & Porro, 2013)
5
W144913642
1.1.5
TWITTER SOCIAL NETWORK AND DATA
Twitter is a micro blogging service that allows communication with short 140 character messages, which roughly correspond to thoughts or ideas. Twitter is akin to a free high-speed, global text-messaging service, which enables rapid and easy communication. What differentiates twitter is its asymmetric following model satisfies the human curiosity. It is the asymmetric following model that cast twitter as more of an interest graph than a social network, and the Application programming Interface (APIs) that provide just enough of a framework for structure and self-organising behaviour to emerge from the chaos. What this means is that whereas some social websites like Facebook and Linkedin require the mutual acceptance of a connection between users, twitter’s relationship model allows you to keep up with the latest happenings of any other user, even though that other user may not choose to follow you back or even know you exist. Twitter enables you to create, connect, and explore a community of interest for an arbitrary topic of interest, the power of Twitter and the insights you can gain from mining its data become much more obvious. (Russell, 2013) In June 2012 twitter report 340 million tweets from 140 million active users. (Twitter, 2012), in 2013 more than 400 million tweets per day were reported (Wickre, 2013) Inherent in these data can be discussions that talk about the like or dislike of products and services, breaking and updating of news, and as a public relations medium for some business, politicians and celebrities (Zhao, et al.) etc. Twitter can therefore be considered as a rich source of social data due to its inherent openness for public consumption as well as ease of access to the data using APIs. This has led to a very high interest among researchers, some research work done include; the topological characteristics of Twitter (Kwak, Lee, Park, & Moon, 2009), tweets as social sensors of real-time events (Sakaki, Okazaki, & Matsuo, 2010), the forecast of box-office revenues for movies (Asur & Huberman, 2010), etc. In this project twitter would be the primary social network where data would be sourced for the case study reasons for which have been explained above.
6
W144913642
Figure1.2 Depicting the interconnectivity of Twitter
1.1.6
SENTIMENT ANALYSIS /OPINION MINING
Sentiment analysis or opinion mining is defined as the computational study of people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes. (Liu & Zhang, 2012). It is a field of machine learning that employs computational power and well-designed software such as the natural language tool kit (NLTK) to process large amounts of text data with a view of analysing sentiments or opinions expressed in these text corpuses. Humans have always made decisions based on one or more sentiments of others. E.g. a prospective student in applying to a university would base his/her choice on the positive sentiments about the university; the choice not to buy a product might be borne out of negative reviews the product has generated, these and so many other similar situations go a long way to illustrate the importance of sentiment based decision making. (Liu & Zhang, 2012) However, decision making that is irreversible after they have been made would not just require a few sentiments, but rather hundreds or thousands of varying opinions in order to make the best decision and this situation is what governments face daily in the governance process. Analysing thousands of sentiments is beyond human capability and is one that requires the use of computational processes to effectively analyse these thousands of sentiment data. With computational processes now easily available and relatively cheaper these days, the application of sentiment analysis is more practicable and easier to undertake now than ever before. (Pang & Lee, 2008)
7
W144913642
Sentiment analysis is a prominent and active area of research, spurred particularly by the rapid growth of web social media and the opportunity to access the valuable opinions of numerous participants on various business and social issues. (Ghiassi, Skinner, & Zimbra, 2013) The field of sentiment analysis forms part of the wider discipline of business intelligence.
Figure 1.3 A screen shot of sentiment relating to HS2 from twitter
1.1.7 BUSINESS INTELLIGENCE
Since 2004 business intelligence has been a “top three” key information systems (IS) management issue and application development area. (Luftman & Mclean , 2012). The term business intelligence (BI) was first used by IBM researcher Hans Peter Luhn (Azevedo & Santos, 2012) employing the definition with the use of Webster dictionary. Since then there has been a number of definitions of BI. According to Golfarelli et al, BI is the process by which businesses transform relatively meaning-less data into useful, actionable information and then into knowledge. (Golfarelli , Dario, & Rizzi, 2004) This knowledge can be used to guide the business in the running of its day-today activities, as well as serving as a basis by which strategic planning and decision-making processes can be efficiently and effectively carried out. Lonnqvist & Pirttimaki (Lonnqvist & Pirttimaki, 2006) also defines it as an organised and systematic process by which organisa-
8
W144913642
tions acquire, analyse, and disseminate information from both internal and external information sources significant for their business activities and for decision making” BI as an integral part of decision support systems (DSS), (Azevedo & Santos, 2012) has attracted a great deal of interest from both industry and research (Arnott, & Pervan, 2005) because of the critical role it can play in helping organisations and businesses derive value out of their data. BI is a broad term or concept, and there are also many other similar and partially overlapping terms such as Competitive Intelligence, Customer Intelligence, Market Intelligence, and Strategic Intelligence (Lonnqvist & Pirttimaki, 2006).
1.1.8 HIGH SPEED RAIL
High speed rail is the world standard for long distance inter-city rail travel. The standard speeds for a high speed rail today, is 250kph. High speed rails operate in about 13 countries such as Japan, China, Italy, France Germany, UK etc. The UK currently has an existing high speed line (HS1) connecting St. Pancras international station in London with Kent and a second channel connects Ebbsfleet station with St. Pancras. The full HS1 service was connected in December 2009. Like many other countries Britain is investing in high speed rail to create space on overcrowded networks and enable large numbers of people to move efficiently. This would be the biggest transport project undertaken for a generation. It will fundamentally improve rail infrast
structure in Britain, breaking the 21 century railway thinking and practices. (Hs2 story, 2014) TH
On 28
January 2013 the UK government released phase two of HS2. (IPSOS, 2013) This is
a new rail line that has been proposed to be built from London to as far as Leeds and beyond .Phase one is from London to Birmingham, whiles phase two would continue from the West midlands to Manchester and Leeds. (Hs2 story, 2014) The proposal to undertake the HS2 project by the UK government has been welcomed and met with some resistances by the UK public. According to the BBC, the £32 billion project is the subject of heated debate right from the political class to the everyday citizen. (BBC, 2014) Proponents of the project have hailed it as a move to among others, bridge the north south economic and developmental gap, provide jobs, and reduce the travel time between the major
9
W144913642
cities of the UK , while catering for the increasing commuter population, as well as handle more freight across the cities. (BBC, 2014). A formidable coalition of cities has united to press for the delivery of the “once in a century promise” of HS2. The City Council Leaders of Sheffield, Leeds, York, Newcastle Nottingham, Liverpool, Derby and Manchester have launched a new Connected Cities campaign group, speaking with one clear voice to press key national decision makers to commit delivering HS2 to the North. Connected Cities is built upon a strong local consensus amongst key business and political leaders from each city. (City of York, 2014) Opponents on the other hand have also cited concerns such as; negative Environmental impact, waste of money, a potential white elephant especially in the age of telecommunications where people can communicate over long distances and finally some inhabitants of villages where the rail would pass through have voiced out concerns over the project destroying the picturesque of their respective villages. (BBC, 2014) The biggest opponent to this project is the campaign group “stop Hs2”. The civil society group’s mission is;
To stop HS2 by persuading the Government to scrap the HS2 proposal.
To facilitate local and national campaigning against High Speed Two.
Figure 1.4 Map of proposed route for HS2 (BBC)
Even within the political class, though in principal all political parties support the project there are different opinions in how it should be implemented or which routes the line should take. (Butcher, 2014) (BBC, 2014) From research it is evident that citizens across the UK have divided opinions about the UK government undertaking the project. This makes it a good basis for use as a case study in this project
10
W144913642
1.1.9 PROJECT SCOPE AND OBJECTIVES
This research work falls under the academic discipline of business intelligence and analytics. The field of business intelligence is a very vast field of study of which data mining, text mining and sentiment analysis are part of. This research aims to apply traditional sentiment analysis on this relatively new area of citizen sentiment analysis. The application of sentiment analysis using twitter data as a source is also a young field of research, as most sentiment analysis has been applied to movie and product reviews and blogs data in the past. The objectives of this research therfore are:
Apply sentiment analysis to the emerging field of citizen sentiment analysis.
Develop a proposed sentiment analysis model to generate analytical insights into the sentiments of citizens about the H2S rail project.
1.1.10
JUSTIFICATION AND CONTRIBUTION
In the era of the information age, social media is transforming the way people communicate, connect, and form relationships and even the way we live and work. (Arunachalam & Sarkar, 2013). The opportunity this provides is the vast amount of textual data that are generated on a daily basis.
With advances in sophisticated computational process, which has resulted in the easy of applying data and text mining on huge data sets for analysis, the opportunities to generate meaningful insights into these data sets are enormous and beneficial. With successes in the application of sentiment analysis in mostly commercial businesses. (Pang & Lee, 2008) The opportunity to replicate such success in public administration and governance is possibly. Social media presents itself as a ‘big data’ source of citizen voice and opinion, providing a deep insight in what citizens want. (Arunachalam & Sarkar, 2013)
Analysing these data contributes valuable information that can better enhance decision making. E.g. For this research getting an insight into the various topics that are discussed around the HS2 project should give an indication of what aspects of the project are most talked
11
W144913642
about. Further analysis into finding the overall sentiment polarity can also give an indication as to whether citizens are in favour of or against the project.
Making decisions based on proven scientific and intelligent methodologies is a much better way than taking such decision out of intuition, and that is what this research attempts to do by applying these methodologies to a vast amount of data that is generated daily concerning the HS2 project.
This research work would go a long way to add to the body of knowledge, where future work can be added to any short fall this research might not have covered. Government and public institutions can also find benefit in the work especially because it is aimed at how they can interact with their citizens more.
Finally undertaking this research work will contribute immensely to my own understanding of the academic discipline of sentiment analysis, and how to apply it in a real world problem such as the sentiments about the HS2 project.
12
W144913642
CHAPTER TWO
2. LITERATURE REVIEW
In the chapter we review literature works that are connected to the aim of our research. This takes the form of making important summaries from these sources that are of relevance from the entire work under review. The review is centred on the discipline of sentiment analysis as a field of study and how this study can be applied to the general objective of governmentcitizen relationship. Literature on twitter specific sentiment analysis would also be carried out. The method of searching for literature was mainly using the University’s online library search, Library books that documented journals relevant to this study. A few were taken from the web of science and Google scholar/books. The references used are mostly from Journals, articles, conference proceedings and reports, contained in the ACM digital library and science direct. Some reading was also done on a few white paper reports, books and research work carried out in this field. KEYWORDS USED IN SEARCH: Sentiment Analysis, Social media Analytics, Twitter, unstructured data
2.1 GENERAL SENTIMENT ANALYSIS
As a field of research, Sentiment analysis can be said to be part of computational linguistics, natural language processing (NLP), and text mining. It is also called opinion mining, subjectivity analysis and appraisal extraction. (Mejova, 2009) Generally speaking, sentiment analysis aims to uncover an author’s view towards a subject or the overall contextual polarity of a text. (Mejova, 2009) Sentiment analysis has been applied in many corpuses such as news blogs (Bautin, Ward, Patil, & Skiena, 2010), movie review (pang, Lee, & Vaithyanathan, 2002), citizen political preference (Ceron, Curini, Mlacos, & Porro, 2013) etc. Research in this field has largely focused on two things; identifying whether a given textual entity is subjective (i.e. a sentence that expresses a personal view) or objective,(i.e. a sentences that presents a factual information) and identifying polarity of subjective text (Pang & Lee, 2008).
13
W144913642
2.2 THE OBJECTIVE/TASK OF SENTIMENT ANALYSIS
The objective of sentiment analysis as described by (Liu & Zhang, 2012) includes the following; entity extraction and grouping, aspect extraction and grouping, Opinion holder and time extraction, aspect sentiment classification and opinion quintuple generation. A widely researched tasks is sentiment or opinion detection which is viewed as classification of text as objective or subjective. Usually opinion detection is based on the examination of adjectives in sentences. For example, the polarity of the sentence “this is a nice car” can be determined easily by looking at the adjective. (Hatzivassiloglov & Wiebe, 2000) They also examined the effects of adjectives in sentiment subjectivity. Later studies (Benamara, Cesarano, Picariello, Reforgiato , & Subrahmanian, 2007) have shown that adverbs may be used for similar purpose. The second task is polarity classification. Given an opinionated piece of text, the goal is to classify the opinion as belonging to one of two opposing sentiment polarities, or locate its position on the continuum between these two polarities (Pang & Lee, 2008) When viewed as a binary feature, polarity classification is the binary classification task of labelling an opinionated document as expressing either an overall positive or an overall negative opinion. Most of this research was done on product reviews, where the definitions of “positive” and “negative” are clear. Other tasks, such as classifying news as “good” or “bad” presents some difficulty. A news article may contain “bad” news without actually using any subjective terms. Furthermore, these classes usually appear intermixed when a document expresses both positive and negative sentiments. Then the task can be to identify the main sentiment of the document. (Mejova, 2009) To distinguish between different mixtures of the two opposites polarity classification uses a multi-point scale (such as the number of stars for a movie review).This is where the task becomes a multi-class text categorization problem. But unlike the topic-based multi-class classification problems where vocabularies differ for each class (or overlap slightly), the vocabularies for positive, neutral, and negative classes can be very much alike, and differ only in few crucial words. Since many documents have a “mixed” opinion, this class is actually a combination of positive and negative. Negations, which tend to be disregarded in much of text analysis as unimportant, play an important role in sentiment, flipping an originally positive term into negative, and vice versa. The above two tasks can be done at several levels: term, phrase, sentence, or document level. It is common to use the output of one level as the input for the higher layers (Dave, Lawrence , & Pennock, 2003). For instance, we may apply sentiment analysis to phrases, and then use this information to evaluate sentences, then paragraphs, etc. Different techniques are suitable for different levels. Techniques using n-gram
14
W144913642
classifiers or lexicons usually work on term level, whereas Part-Of-Speech tagging is used for phrase and sentence analysis. Heuristics are often used to generalize the sentiment to document level. (Mejova, 2009) A third task that is complementary to sentiment identification is the discovery of the opinion’s target. The difficulty of this task depends largely on the domain of the analysis. (Mejova, 2009) It is usually safe to assume that product reviews usually talk about the specified product. On the other hand, general writing such as webpages and blogs don’t always have a pre-defined topic, and often mention many objects. Another lively area of research is feature extraction, given an object or topic of the text. (Liu, Hu, & Cheng, 2005) Liu define features as either components or attributes of an object, which is a definition that is mostly used in practice. (Liu B. , 2006) Sometimes there is more than one target in a sentiment sentence, which is the case in comparative sentences. A subjective comparative sentence orders objects in order of preferences, for example, “this camera is better than my old one”. These sentences can be identified using comparative adjectives and adverbs (more, less, better, longer), superlative adjectives (most, least, best) and other words such as same, differ, win, prefer, etc. (Liu B. , 2006)
2.3 THE CHALLENGE OF SENTIMENT ANALYSIS
Research indicates that sentiment analysis present much complex challenge than traditional topic modelling. (pang, Lee, & Vaithyanathan, 2002) .This is despite the fact that sentiment analysis classifies text into 3 main classes, whiles topic modelling involves n-ary of topics. (Pang & Lee, 2008). Sentiment classification classifies an opinion document e.g. a product review as expressing a positive, negative and neutral sentiment. The task is also commonly known as the documentlevel sentiment classification because it considers the whole document as the basic information unit. (Liu & Zhang, 2012) The main reason why sentiment Analysis is more difficult than topic-based classification is that topic-based classification can be done with the use of keywords while this does not work well in sentiment analysis( (Turney, 2002) Some other reasons that make sentiment analysis difficult include; difficulty in determining whether a given text is objective or subjective. (There is always a thin-line between the two). It is also difficult to determine the opinion holder. Sentiment can be expressed in subtle ways without any ostensible use of negative words. E.g. ”how could anyone sit through this movie?” contains, no single word that is obviously negative. However this could be classified as negative review of a movie. Thus sentiment requires more understanding than the usual topic-
15
W144913642
based classification. Other factors include dependency on domain and other words (Pang & Lee, 2008). Opinions expressed with sarcasm, irony, and negation.
2.4 METHODOLOGIES USED IN SENTIMENT ANALYSIS A wide range of tools and techniques can be employed to tackle the goals described in the previous section. This section therefore describes some of the most common and widely used ones.
Classification: Many of the tasks in Sentiment Analysis can be thought of as classification. (Mejova, 2009) Machine Learning offers many algorithms designed to undertake that, but this task of classifying text according to its sentiment presents many unique challenges. These can be formulated in one question: “What kinds of features do we use?”
Term Frequency or Presence: Traditional Information Retrieval systems have long emphasized the importance of term frequency. The widely used TF-IDF (Term Frequency - Inverse Document Frequency) measure is well-used in modelling documents according to Jones. (Jones, 1972)TF-IDA is a measure of how concentrated into relatively few documents is the co-currencies of a given word. (Rajaraman & Ullman, 2011) The intuition is that terms that often appear in the document but seldom in the whole collection are more informative as to what the document is about as compared to the terms mentioned just once. (Mejova, 2009) TF-IDF have been shown to be quiet effective in sentiment classification (Liu & Zhang, 2012) In the field of Sentiment Analysis we find that instead of paying attention to most frequent terms, it is more beneficial to seek out the most unique ones. Pang et al improved the performance of this system using term presence instead of frequency. (pang, Lee, & Vaithyanathan, 2002) Wiebe and Hoffman states in their paper that, “apparently people are creative when they are being opinionated”, implying the importance of lowfrequency terms in opinionated texts. (Wiebe & Hoffmann, 2005)
n-grams: Term positions are also important in document representation for Sentiment Analysis. The position of terms determines, and sometimes reverses, the polarity of the phrase. So, position information is sometimes encoded into the feature vector. (pang, Lee, & Vaithyanathan, 2002) Wiebe and Hoffman selects n-grams (n=1,2,3,4) based on precision calculated using annotated documents. (Wiebe & Hoffmann, 2005) The n-grams are a word-stem, part-of-speech pair, for instance (inprep the-det can- noun) is a 3-gram.
16
W144913642
Part-of-Speech: Adjectives are a good indicator of sentiment in text, and in the past decade they have been commonly exploited in Sentiment Analysis ( Whitelaw, Garg , & Argamon, 2005). This is true for other fields in textual analysis, since part-ofspeech tags can be considered to be a crude form of word sense disambiguation. (Wilks & Stevenson, 1998). In his work, Turney used part-of-speech patterns, to including an adjective and even went further to used adverb as well, for sentiment detection at the document level. (Turney, 2002) Syntax information has also been used in feature sets, though there is still discussion about the advantages of this information in Sentiment classification (Pang & Lee, 2008). This information however may include important text features such as negation, intensifiers, and diminishers used sub tree-based boosting algorithm with dependency tree-based features for polarity classification, and show that it outperforms the bag-of-words baseline. (Kennedy & Inkpen, 2006)
Negations: Negations have been long known to be integral in Sentiment Analysis. The usual bag-of-words representation of text disconnects all of the words, and considers sentences like “I like this car” and “I don’t like this car” very similar, since only one word distinguishes one from the other. But when talking about sentiment, a negation changes the polarity of a whole phrase. Negations are often considered in postprocessing of results, while the original representation of text ignores them (Hu & Liu, 2005), one could explicitly include the negation in the document representation by appending them to the terms that are close to negations; for example term “like-NOT” would be extracted form “I don’t like this book” (Pang & Lee, 2008). Though using colocation may be too crude a technique. It would be incorrect to negate the sentiment in a sentence such as “No wonder everyone loves this car”. To handle such cases. (Na, Sui , Khoo , Chan , & Zhou, 2004) use specific part-of-speech tags patterns to identify the negations relevant to the sentiment polarity of a phrase.
2.5 IDENTIFYING THE SEMANTIC ORIENTATION OF WORDS According to (Mejova, 2009)One of the most basic tasks in Sentiment Analysis is identifying the semantic orientation (the polarity and objectivity) of a word. A variety of techniques have been used, which can be roughly categorized in the following:
using a lexicon, constructed manually or automatically.
using some statistical techniques such as looking at concurrence of a word with a word of a known polarity.
using training documents, labelled or unlabelled, as a source of knowledge about the polarity of terms within the collection
17
W144913642
Hybrid Approach
Each of these techniques cited above has its advantages and difficulties, which will be reviewed here. .
2.5.1 THE LEXICONS APPROACH . Extended lexicons are a fundamental part of Sentiment Analysis, but not all of them are alike. The simplest ones are ones with binary classification of words into positive vs. negative polarities or objective vs. subjective. A more fine distinction between the classes can be made with fuzzy lexicons where each label has a score associated with it, conveying the “strength” of the label. (Mejova, 2009). A variety of lexicons have been created for the use in Sentiment Analysis, often by extending existing general-purpose lexicons. For example, Subasic and Huettner, 2001 have manually constructed a lexicon associating words with affect categories, specifying an intensity (strength of affect `level) and centrality (degree of relatedness to the category). (Subasic & Huettner, 483–496)Besides manual annotation, other resources can be used to build lexicons. Existing lexicons can be augmented to include sentiment information. Princeton University’s WordNet lexicon has been one of the most popular ones to be used for Sentiment Analysis. As described on http://wordnet.princeton.edu/, WordNet R is a large lexical database of English, (Mejova, 2009) Taboada et al used a sentiment orientation calculator (SO-CAL).It uses dictionaries of words annotated with their semantic orientation (polarity and strength), and incorporates intensification and negation. SO-CAL is applied to the polarity classification task, the process of assigning a positive or negative label to a text that captures the text’s opinion towards its main subject matter. ( Taboada, Brooke, Tofiloski, Voll, & Stede, 2011)
2.5.2 USING TRAINING DOCUMENTS It is possible to perform sentiment classification using statistical analysis and machine learning tools that take advantage of the vast resources of labelled (manually by annotators or using a star/point system) documents available. (Mejova, 2009) Product review websites like C-NET, Ebay, RottenTomatoes and the Internet Movie Database (IMDB) have all been extensively used as sources of annotated data. The star(or tomato, as it were) system provides an explicit label of the overall polarity of the review, and it is often taken as a gold standard in algorithm evaluation. (Mejova, 2009) Manually labelled data is available through evaluation efforts such as the Text REtrieval Conference (TREC), NII Test Collection for IR Systems
18
W144913642
(NTCIR), and Cross Language Evaluation Forum (CLEF). These datasets produced often serve as standard in the Information Retrieval community, including for Sentiment Analysis researchers. Individual researchers and research groups have also produced many interesting data sets. An example is; The Congressional floor-debate transcripts - published by Thomas and Pang contains political speeches that are labelled to indicate whether the speaker supported or opposed the legislation discussed. (Thomas & Pang, 2006) Once a desirable data set has been obtained, a variety of machine learning algorithms can be used to train sentiment classifiers. Some of the most popular algorithms are Support Vector Machines, Naive Bayes, and maximum entropy-based classifiers. (Mejova, 2009)
2.5.3 IDENTIFYING SEMANTIC ORIENTATION OF SENTENCES AND PHRASES Using the semantic orientation of individual, it is often desirable to extend this to the phrase or sentence the word appears in. One of the most straightforward ways to accomplish this is to take an average of the polarities of words in the sentence. Hu and Liu write: “if positive/negative opinion prevails, the opinion sentence is regarded as a positive/negative one”. In the case that the number of positive and negative opinion words is the same, they take the orientation of the closest opinion sentence. (Hu & Liu, 2005)
Another common way is to train a Naive Bayes classifier using sentences and documents labelled as opinionated or factual as examples of the two categories. (Yu & Hatzivassiloglou , 2003) The authors used features including words, bigrams, and trigrams, as well as the parts of speech in each sentence. They also use the presence of words with known polarities in a sentence as an indication that the sentence is subjective. And they take into consideration the effect of negation words such as “no”, “not”, and “yet” appearing in the window of 5 words around the word in question. Although simplistic, this heuristic has been shown to work for most of the cases. An even more sophisticated combination of sentiment labels is possible by taking advantage of syntactic relationships between words. For example, Popescu and Etzioni use an unsupervised classification technique Relaxation Labelling that extends the label attributed to the word to the sentence it appears in. This approach takes, among other things, the negation modifiers. (Popescu & Etzioni, 2005)
19
W144913642
2.5.4 IDENTIFYING THE SEMANTIC ORIENTATION OF DOCUMENTS Most of the research work done is in determining the semantic orientation of words and phrases, some tasks like summarization and text retrieval may require semantic labelling of the whole document. (Mejova, 2009) It may not make much sense to do this for long documents such as articles or books, which have been a key form in traditional Information Retrieval. However in the age of social networking and internet commerce, there are a vastly increasing number and variety of short documents, often containing only a few sentences. These may be product reviews, emails, blog posts, etc. (Mejova, 2009) Much like approaches for identifying semantic orientation of words, those for documents also range from simple statistical ones to ones using elaborate knowledge structures to guide the process. One of the most popular, and simple, methods is a linear combination of all polarities. Dave et al., use averaging to determine the polarity of documents. (Dave, Lawrence , & Pennock, 2003)
2.5.5 OBJECT FEATURE EXTRACTION Object feature extraction deals with trying to find out certain features of an entity. It is another important part of sentiment .In shorter, more focused documents it is often safe to assume that the author is only talking about the topic of the document. Product reviews, for example, usually contain opinions about that product, and movie reviews talk about the movies in question. Yet it is often not enough to know the general topic of the writing. (Mejova, 2009) A company making a product would certainly want to know not only what people think about this product in general, but which features they like/dislike in particular. Thus, the task of feature extraction (where feature can be any target of an opinionated statement) has been gaining popularity in the field of Sentiment Analysis. A common approach is to use the part-of-speech (POS) tags to construct templates of how sentiment is applied to objects. (Hu & Liu, 2005)
2.5.6 COMPARATIVE SENTENCE IDENTIFICATION Another important research area in Sentiment Analysis is the study of comparative sentences. Liu defines comparative sentence as “a sentence that expresses a relation based on similarities or differences of more than one object”. (Liu B. , 2006) These can be classified into types, such as gradable and non-gradable comparisons. A gradable comparison is based on the relationship of greater, equal to, or less than. For example, “Intel chip is faster than the AMD one” ranks object in quality. A non-gradable compar-
20
W144913642
ison the features are compared, but not ranked in the order of preference: “Coke tastes differently from Pepsi”. Both types of sentences tell us something about the relationships between different objects. Thus, one of the outputs of a comparative sentence analysis system could be a rank of products, as determined by the opinion holders. So far though, identification of comparative sentences has been the primary focus of the computational linguistics community. (Mejova, 2009)
2.6 SENTIMENT ANALYSIS USING TWITTER DATA
Undertaking sentiment analysis using twitter data presents is own opportunities and challenges, compared with other forms of data sources such as blogs, articles and the general sentiment analysis studies. Twitter is unique in the following ways; twitter posts are short, the maximum number of characters that are allowed is 140. This makes users very efficient with their participation in social media discussions. (Hu & Liu, 2012).However twitter messages are full of wrongly spelled words or slangs. (Go, Bhayani, & Huang, 2009).Notwithstanding twitter still presents itself as a good source of data for machine learning and sentiment analysis due to the availability of huge amounts of twitter data for training and testing and also easy access to twitter data using APIs. ( Pak & Paroubek, 2010)Twitter enables users to utilize the “#” symbol called hastag, to mark keywords or topics in a tweet (tag information) (Hu & Liu, 2012). A review on some work done on twitter is presented below. Go et al carried out one of the earliest work on twitter data, and undertook sentiment classification. The authors focused on using emoticons to help construct large copra of structured set of texts, and labelling the tweets according to the emoticons. The authors built models using Naïve bayes, MaxEnt and support Vector Machines (SVM) classifier with a mutual information measure for feature selection. This approach however showed high performance for two class classification problem. The method shows unsatisfactory results with three classes (“negative”, “positive” and “neutral”) unsatisfactory results. (Go, Bhayani, & Huang, 2009) Pak and Paroubek used twitter as a popular micro blogging platform to conduct sentiment analysis on an extensive collection of tweets. A corpus of tweets were analysed as positive, negative and neutral tweets. The authors labelled the tweet as positive if the message includes the happy emoticon “:-)” , “:)”, “=)”, “:D”, as negative if sad emoticon is used “:- (“, “:(“, “=(“, “;(“ , etc.). However for the objective tweets they retrieved posts from Twitter accounts of popular newspapers and magazines. After the data collection, they did some linguistic analysis on the dataset, using POS tagging with the aim of finding any differences between subjective (positive and negative) and objective sentences. The authors noted that there were dif-
21
W144913642
ferences between the POS tags of subjective and objective Twitter posts. They also noted that there are differences in the POS tags of positive and negative posts. Data was cleaned by removing URL links, user names (those that are marked by @), RT (for retweet), the emoticons, and stop words. Finally they tokenized the dataset and constructed n-grams. Then they experimented with several classifiers including SVM, but Naive Bayes was found to give the best result. They trained two Naive Bayes Classifiers. One of them uses n-gram presence, and the other, POS tag presence. The probability of a sentiment(positive, negative, neutral) of a Twitter post is obtained as the sum of the summation of the probabilities of n-gram presence and the summation of the probabilities of n-gram POS tags. Using the formula derived for naïve Bayes: ∑
( ( ))
∑
( )
where G is a set of n-grams of the tweet, T is the set of POS tags of then-grams, M is the tweet and s is the sentiment (one of positive, negative, and neutral). The sentiment with highest likelihood (L(s/M)) becomes the sentiment of the new tweet. The authors achieved best result (highest accuracy) with bigram presence. Their explanation for this is that bi-grams provide a good balance between coverage (uni-grams) and capturing sentiment expression patterns (tri-grams) Negation( 'not' and 'no') is handled by attaching it to the words that precede and follow it during tokenization. The handling of negation is found to improve accuracy. Moreover, they report that removing n-grams that are evenly distributed in the sentiment classes improves accuracy. Evaluation was done on the same test data used by (Go, Bhayani, & Huang, 2009). However, they do not explicitly put their accuracy in number other than showing it in a graph. ( Pak & Paroubek, 2010) Another study by Barbosa and Feng, employed a two-phased approach to twitter sentiment analysis. The approaches are; classifying the dataset into objective and subjective classes (subjective detection) and classifying subjective sentences into positive and negative classes (polarity detection). The authors felt that the use of n-gram for Twitter sentiment analysis might not be a good strategy since Twitter messages are short. Instead the opted for the use of two other features of tweets: meta information about tweets and syntax of tweets. With meta-info, they use POS tags (some tags are likely to show sentiment, e.g adjectives and interjections) and mapping words to prior subjectivity (strong and weak), and prior polarity is reversed when a negative expression precedes the word. For tweet syntax features, they use #(hashtag, @(reply), RT(retweet), link, punctuations, emoticons, capitalized words etc. They create a feature set from both the features and experiment with machine learning technique available in WEKA. SVM performs best. For test data, 1000 tweets were manually annotated as positive, negative and neutral. The highest accuracy obtained was 81.9% on subjectivity detection followed by 81.3% on polarity detection. ( Barbosa & Feng, 2010)
22
W144913642
Contrary to the use of machine learning approach Bollen et al performed Sentiment Analysis using a psychometric instrument (profile of mood states) rather than a machine learning process. The psychometric instrument extract six moods (tension, depression, anger, vigour, fatigue, confusion). 9664,952 tweets between August 1 to December 20 2008. The tweets contained political, cultural, social, economic, and natural events. Each tweet was then measured according to the six different moods. The results were compared with a timeline of notable events that took place in that period. In concluding the authors stated; “We find that social, political, cultural and economic events are correlated with significant, even if delayed fluctuations of public mood levels along a range of different mood dimensions. To conclude, we bring about the following methodological contribution: we argue that sentiment analysis of minute text corpora (such as tweets) is efficiently obtained via a syntactic, term-based approach that requires no training or machine learning.” Sentiment analysis techniques rooted in machine learning yield accurate classification results when sufficiently large data is available for testing and training. However, minute texts such as microblogs may pose particular challenges for this approach. (Bollen, Mao, & Pepe, 2011) The aggregate of millions of tweets submitted to twitter at any given time may provide an accurate representation of public mood and sentiment. This led to the development of real time sentiment-tracking such as North eastern university and Harvard University’s “pulse of nation”, using over 300 million tweets. ( Mislove, Lehmann, Ahn , Onnela, & Rosenquisk, 2010) Another real time research was carried out by Sakaki et al the authors investigated real-time interaction of events during the occurrence of an earthquake. They considered each user as a sensor to monitor tweets posted about the earthquake. To detect a target event the work was carried out is as follows. First a classifier is trained by using keywords, message length, and corresponding context as features to classify tweets into positive or negative cases. Second they build a probabilistic spatio-temporal model for the target event to identify location of the event. As an application the authors constructed an earthquake-reporting system in japan, where earthquake occurrence is relatively frequent. (Sakaki, Okazaki, & Matsuo, 2010)
2.7 GOVERNMENT CITIZEN SENTIMENT ANALYSIS
There are not many publications on the specific context of applying sentiment analysis in government citizen context. A few however are Abbasi where he proposed an affect analysis approach for measuring the presence of hate, violence, and the resulting propaganda dissemination across extremist group forums. (Abbasi., 2007) In a similar application, Bermingham et
23
W144913642
al.[33] proposed crawling and analysing social media sites, such as YouTube, to detect radicalism. ( Bermingham, Conway , McInerney, O’Hare , & Smeaton, 2009) One of the first publications the project reviewed was (Arunachalam & Sarkar, 2013). The authors make a strong case for the importance of applying sentiment analysis in this context. The authors approach was to use topic modelling, applying the TF-IDF model. Sentiment Analysis was performed on one of the major social benefits organisation in the USA. Data was sourced from several social media sites e.g. Twitter, Facebook, Flickr etc. A hotword affinity analysis was carried out within the topic model approach. Hotwords are parameters that are common across defined topics of interest. They can provide additional insights into how sentiments around a particular concept can be perceived in the context of different hotwords. A tag cloud was also used to determine which words came up the most. After their analysis they were able to find out which social benefits programmes and services received positive and negative sentiments. The results of the sentiments were classified into 4 polarities, positive negative, neutral and ambivalent and visualised using a bar chart. From chart it was easy to find out which social benefit programs received what polarity of sentiment. The whole idea of governments analysing the sentiments of its people should not be an onetime event but if possible an on-going daily process in order that governments can really be monitoring the sentiments of citizens on a regular basis. According to the authors i.e. (Arunachalam & Sarkar, 2013), assert that in 2010 Gartner instituted the Open Government Maturity Model. Gartner proposed sentiment analysis as a means to achieve collaboration for governments echoing to that model. Forrester research observed that the USA federal government was monitoring the citizen sentiment in Twitter. Gartner also called for governments to use social media for achieving collaborative budgeting and pattern discovery where citizen’s sentiment analysis on social media can play a significant role. It is therefore imperative that we look at a proposed architecture used by the author as a possible system which can be used to analyse sentiments on a regular basis.
2.3.7.1 SYSTEM ARCHITECTURE FOR CITIZEN SENTIMENT ANALYSIS The architecture (Arunachalam & Sarkar, 2013) used is adopted from what IBM has proposed for use as an effective approach to both topic modelling and sentiment analysis. The various components for such a system are described below: GPFS: The IBM General Parallel File System is a specialized file system targeted for high performance applications such as big data analytics.
24
W144913642
HADOOP: Apache Hadoop is an open source software framework for running data-intensive applications in a distributed fashion over commodity hardware. SYSTEMT: It is a rule-based IE system that was proposed in the works of () It uses a declarative rule language, AQL to define the Natural Language Processing (NLP) rules for information extraction from documents.
FLOWMANAGER: Based on the rule and configurations this component orchestrates the execution of different task across different components in this system.
LUCENE: Apache Lucene is an open-source framework for information retrieval applications.
AdminUI: The user interface used by administrators to configure this system and define AWL rules using simple interfaces.
ANALYSISUI: The user interface component that enables sentiment analysis execution and rendering using Lucene component.
DATA FETCHER: The social media interfacing component that interacts with diverse sources fetches information in different formats and produces JSON representation of them and saves into GPFS.
TOPICEXTRACTOR: With the help of natural language processing rules in systemT, this component extracts information from JSON data created by DataFetcher. It computes the term frequency and document inverse frequency values and produces X matrix. This component runs as Hadoop job. TOPICMODELLER: This component computes the estimated matrices W and H. It employs the proximal Rank-One Residue Iterations (Proximal RRI) optimization algorithm as proposed by (). It also produces JSON documents annotated with topic information. This component also runs as a Hadoop job. Uploader: This component picks up the annotated JSON document produced by TopicModeller and uploads them into a staging area. Lucene indexes these documents so that they can be searched and analysed based on extraction information using traditional sentiment analysis techniques for subjectivity detection and sentiment classification.
25
W144913642
. Figure 2.1 An example of a system architecture for Sentiment Analyser
2.8 AN OVERVIEW OF DATA MINING (STRUCTURED) AND TEXT MINING (UNSTRUCTURED DATA)
Data mining is a field which has seen rapid advances in recent years (Han & Kamber, 2005)due to advances in hardware and software technology, which has led to the availability of different kinds of data. One of such is text data which resides in large repositories such as the web and more specifically social networks (Aggarwal & Zhai, 2012) Unstructured data refer to information that either does not have a pre-defined data model and is not organised in a predefined manner. ( Nemschoff, 2014) examples are; books, emails, social media etc. Structured data is data that can be easily organised regards of its simplicity. ( Nemschoff, 2014)Structured data is normally ready for seamless integration into a database or wellstructured file format such as XML. (Johnson, 2012) Examples of such data are; sensory data, point of sale, web server, recorded data entry i.e. gender, age post code etc. Structured data is generally less noisy and managed with a database system, text data on the other hand is relatively noisier and typically managed via a search engine due to the lack of structure (Gundecha & Liu, 2012) Due to the difference in text data, the mining techniques which can be employed are different. In text data a key characteristic is the sparse and high dimensionality (Aggarwal & Zhai, 2012) e.g. given a corpus drawn from a lexicon of 100 000 words but a given text document may contain only a few hundred words. Thus a corpus of text documents can be represented as a
26
W144913642
sparse term document matrix of size
where n is the number of documents, and d is the
size of the lexicon vocabulary. The (I, j)th entry of this matrix is the normalised frequency of the jth word in the lexicon in document i. (Aggarwal & Zhai, 2012) It is therefore desirable to transform text data into a structured format prior to applying traditional data mining tasks such as clustering and classification. ( Feinerer, Hornik, & Meyer, 2008)
Text data can be analysed at different levels of representation. E.g. Text data can be treated as a bag-of –words, or it can be treated as a string of words. However, in most applications, it would be desirable to represent text information semantically so that more meaningful analysis and mining can be done. (Aggarwal & Zhai, 2012)
2.9 SENTIMENT ANALYSIS AND MODELLING TECHNIQUES. In the field of data mining and analytics a model is defined as simply an algorithm or set of rules that connect a collection of inputs (often in the form of fields in a corporate database) to a particular target or outcome. (Berry & Linoff, 2004)
A number of different techniques can be used in modelling sentiment analysis. These modelling approaches are machine learning techniques and the approaches one can use are supervised, unsupervised, and combined techniques.
In the supervised technique the task or approach is to build a classifier. The classifier would require training data to build and train the model. Algorithms used here are support vector machines (SVM), Naive Bayes classifier and Multinomial Naïve Bayes. According to (pang, Lee, & Vaithyanathan, 2002) supervised techniques can use one or a combination of approaches stated earlier. E.g A supervised technique can use a relationship-based approach, or language model approach or a combination of them. For supervised technique, the text to be analysed must be represented as a feature vector. (pang, Lee, & Vaithyanathan, 2002) states that supervised techniques outperform unsupervised techniques in performance.
In unsupervised technique, classification is done by a function which compares the features of a given text against discriminatory- word lexicons whose polarity are determined prior to their use. e.g. starting with positive and negative word lexicons, one can look for them in the text whose sentiment is being sought and register their count. Then if the document has more positive lexicons, it is positive, otherwise it is negative. (Turney, 2002) Uses a slightly different approach by employing a simple unsupervised technique to classify reviews as recommended (thumbs up) or not recommended (thumbs down) based on semantic information of phrases containing an adjective or adverb. He computes the semantic orientation of a phrase by mu-
27
W144913642
tual information of the phrase with the word ‘poor’. Out of the individual semantic orientation of phrases, an average semantic orientation of a review is computed. A review is recommended if the average semantic orientation is positive, not recommended otherwise.
DISCOURSE APPROACH: Using this approach, discourse relation between text components is used to guide the classification. According to (pang, Lee, & Vaithyanathan, 2002) in their movie review the overall sentiment is usually expressed at the end of the text. This means the approach to sentiment analysis, in this case, will be discourse-driven in which the sentiment of the whole review is obtained as a function of the sentiment of the different discourse components in the review and the discourse relations that exist between them. In such an approach, the sentiment of a paragraph that is at the end of the review might be given more weight in the determination of the sentiment of the whole review.
RELATIONSHIP-DRIVEN APPROACH: In this approach the classification task deals with different relationships that may exist in or between features and components. These relationships include relationships between discourse participants or between product features. E.g.to know the sentiment of customers concerning a brand, one can compute it as a function of the sentiment on different features or components of it.
LANGUAGE MODEL DRIVEN APPROACH: Classification in this approach is carried out by building n-gram language models. Presence or frequency of n-gram might be used. From traditional information retrieval and topic-oriented classification frequency of n-grams is shown to deliver better results. Usually, the frequency is converted to TF-IDF to take terms’ importance for a document into account. (pang, Lee, & Vaithyanathan, 2002) in their movie review classification, found that termpresence gives better results than term frequency. They indicated that having a unigram presence is more suited for sentiment analysis. With product review however the authors concluded that the bi-grams and tri-grams worked better than Uni-grams in sentiment analysis.
KEYWORD/KNOWLEDGE MODEL APPROACH: This approach sees sentiment as the function of some keywords. The main task is the construction of sentiment discriminatory-word lexicons that indicate a particular class such as positive class or negative class. The Polarity of the words in the lexicon are determined prior to the sentiment analysis work. There are variations to how the lexicon is created. Lexicons can be created by starting with some seed words and then using some linguistic heuristics to add more words to them or start with some seed words and adding to these seed words other words based on frequency in a text (Turney, 2002).
28
W144913642
2.10 OVERVIEW AND WAY FORWARD FROM THE LITERATURE REVIEW An overview of similar works has been presented in the preceding sections about sentiment analysis and twitter specific sentiment analysis. From the review it can be noted that there are various ways of approaching sentiment analysis depending on the nature of document to analysis e.g. the approach to a movie review would be different from the approach to a product review. In the same way twitter sentiment analysis also comes with its own approach and challenges as well.
In approaching twitter sentiment analysis certain factors must be considered, one factor is that twitter posts are short usually 140 characters maximum. This fact means that certain classification models such as the discourse and relationship-based model cannot be applied successfully in twitter. Similarly relationship based model becomes irrelevant because there is no such thing as whole-component relationship in tweets. This leaves us with the other two approaches which are language models and knowledge-based model. These two approaches are what most of the reviewed studies have implemented, as the previous two are merely theoretical approaches and therefore hardly used. The choice of these two determines what technique to use. While Knowledge based approach involves the use of mostly unsupervised techniques, language models use supervised machine learning techniques. All the Twitter specific sentiment analysis reviewed above used supervised techniques or achieved better results with them. (pang, Lee, & Vaithyanathan, 2002) in their work stated that supervised techniques outperform unsupervised techniques.
From our review we have discovered various algorithms which have been applied examples of which include TF-IDF, LDA, POS tags, n-grams, Naïve Bayes, SVM etc.
This research aims to classify twitter based sentiment on their polarity, which indicates that an adoption of the language based approach should be employed. However since our general aim is to perform analysis, it would make sense to perform an exploratory analysis using the additional dimension of the knowledge model. This should give an idea of the various topics contained in our set of data at the time of collection. Based on the results of the topics a fair idea of the results would help in the classification effort. However our primary focus is still to classify the data in to one of three polarities.
After a careful study of various model, this research decides to carry out the sentiment analysis in 3 different phases or stages:
29
W144913642
The first stage is will be to do a term frequency analysis to find out the most common occurring terms that are featured in the data. This would help us have an insight to what terms are associated with the HS2 project.
The second will be to carry out a topic modelling using the LDA algorithm. The topic modelling is one of the common approaches that some researchers have used. E.g. (Arunachalam & Sarkar, 2013).By grouping frequent occurring terms into models should further give us additional dimension into what topics or inherent within the data set. Just by looking at the polarity of these topics can give an indication of the polarity of the entire corpus.
The final stage will be to apply sentiment analysis to the data (corpus). From the review a number of ways have been suggested, however this project would use an inbuilt pre classifier to classify the tweets into the 3 polarities stated already.
30
W144913642
CHAPTER THREE 3 PROBLEM SPECIFICATIONS
With advances in data mining and text mining algorithms, coupled with the huge number of text data being generated daily, on many different social media platforms in this context twitter. The possibility of governments trying to analyse opinions and sentiments is much possible this time than ever before. However the challenge in trying to undertake such analysis comes from the relatively young field of applying sentiment analysis to twitter data, due to reasons such as data retrieval,(Information extraction), (Arunachalam & Sarkar, 2013) unstructured and noisy nature of the text data, linguistic semantics due to the informal and specialised language used, languages used to create contents are ambiguous (Gundecha & Liu, 2012)and identifying which classification approach suits a particular domain of the type of textual data to be classified. The problem/research question is to find a suitable model which would best classify sentiments into the 3 different classes i.e. positive, negative and Neutral with high accuracy, taking into account the nature of the type of data. i.e Twitter data. Based on a high accuracy model conclusions could be drawn about the general opinions towards the HS2 project. The solution to the stated problem is to employ the knowledge discovery from data(KDD) approach used in traditional data mining for similar problem specification. This process is discussed in much detail in the methodology
3.1 METHODOLOGY
A case study is used as a primary method in applying sentiment analysis to public administration. In this case study a mixed method of both qualitative and quantitative approach is used in carrying out this research design. The qualitative approach involves a comprehensive research on the relevance of sentiment analysis in our lives and why Governments should apply this area of business intelligence in the governance process. A study of the evolving paradigm of social media would be useful in understanding why social media provides a good platform to bringing the governing and the governed closer as well as being a rich source for data.
31
W144913642
The qualitative approach would be carried out by way of extensive literature review (chapter 2) in the form of studying and citing sources from academic journals, reports and publications and in a few instances textbooks. This would be carried out using the university library, online libraries and other online sources. This study would serve as a guide in implementing a model for the task The quantitative approach which is central to the research would involve the detailed process of data/information extraction and collection, data storage, data processing, data analysis and data visualization. These processes would culminate in the eventual model for this research. (chapter 4)
Literature review
Model building
Analysis of results
Interpretation of results
Conclusion
Figure 3.1 Methodological steps
Model building: build a suitable model using a classification algorithm to classify sentiments as positive, negative or neutral. In this stage the same techniques such as natural language tool kit used in conventional opinion mining would also be applied. Model building is an entire process that consists of the steps illustrated below.
Define goals
Collect twitter data
Process data
Exploratory data Analysis
Build an appropriate model
Interpret reslts
Figure 3.2 Modelling process
The only source of data for this research is Twitter data. The data is generated using Twitter’s Application Programming Interface (API) to request for the data. Data used in this research is based on empirical primary data collection rather than using publicly available dataset. The model building process is hinged on the use of computational intelligence enabled by software. Software that would be used in this study are python and R.
32
W144913642
The models chosen for the analysis are topic (unsupervised) and classification (supervised) model, that would classify sentiments into 3 major classifiers, Positive, Negative and Neutral so as we can measure the public opinion on the HS2 project.
Analysis of results: The focus of our analysis would be to have a measure of the general polarity of the processed corpus (tweets). This would be done by looking at the proportion of the 3 polarities. The polarity with the highest proportion or percentage will give an indication of the general feeling of the public towards the project. Attempts would be made to compare that with existing polls or surveys on the HS2 project. Interpretation of results: In order to interpret the result appropriate visualization methods would be used in presenting the results. The visualization would be done using R software Conclusions: The conclusion would provide a summary of how successfully the project has gone, and the various objectives met. Challenges faced during the course of work and recommendation to further areas of research.
3.2 METHODOLOGICAL JUSTIFICATION
This research uses a case study to attempt to demonstrate the relevance of sentiment analysis to government-citizen relations, using a real life case of the HS2 project. Yin defines the case study research method as an empirical inquiry that investigates a contemporary phenomenon within its real-life context; when the boundaries between phenomenon and context are not clearly evident; and in which multiple sources of evidence are used (Yin, 1984). Yin further states that the importance of case study is among others; to bring an understanding of a complex issue or object and extend or add strength to what is already known through previous research. Also case studies emphasise detailed contextual analysis of a limited number of events. (Yin, 1984). Most case studies involve steps already mention in earlier i.e data collection analysis and report. This methodology is widely used and an accepted form of research. It is therefore justified that, it can be employed in this research by following is methods of inquiry or investigation.
33
W144913642
3.3 SOFTWARE USE JUSTIFICATION
A number of softwares are available for use in this research, with each having its merits and demerits. However this research chooses two widely used open source software. Python and R. Python is a scripting language used to write quick and small programs or scripts. Among the interpreted languages python is distinguished by its large and active scientific computing community. In recent years python’s improved library support has made it a strong alternative for data manipulation tasks. Together with python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications. (McKinney, 2013) R is a free functional language software and software environment for statistical computing and graphics. R has an extensive and powerful graphics ability, that is tightly linked with its analytics abilities. The R system is developing rapidly with new features and abilities appear every few months. It is widely used among statisticians and data miners for data analysis. ( Maindonald, 2008) Both softwares are capable of collecting tweets using their API, and both can also preprocess as well as analyse text data for the purpose of sentiment analyse. R has a good visualisation tool for visualising result. Since these two are widely used with the research community both software would serve the purpose of sentiment Analysis.
34
W144913642
CHAPTER FOUR
4.0 PROJECT IMPLEMENTATION This Section describes the practical implementation steps used in the text mining/sentiment analysis process.
DATA COLLECTION: The first step was to collect the necessary data relevant to our interest. As stated the source of data would be from tweets on the twitter network that discuss the HS2 project. This process is carried out using an API. An API can be defined as, a set of routines by which an application program allows another application program to work directly with it. (Horak, 2007) Twitter has developed its RESTful API for users to be able to access tweet data. Tweets were collected using Python Activestate version 2.7.6.9 software. There are two ways of generating data from twitter; the historic/search data and the streaming data. The historic data generates a finite number of desired tweets from the last 7 days prior to making a request, whiles the streaming data generates a requested number of tweets in real time i.e. just when a user tweets.
In order to generate tweets that are relevant to the subject of study the query must be requested for by indicating either the subject name or preceding the name with the @ and # symbols e.g.( HS2, @HS2 and #HS2). Another issue of consideration is the number of data instances required for a task at hand. Data mining task thrives on a large amount of data instances/record. Generally larger datasets are better in training a given model, this is because models learn better and are more accurate with more data than less data. Most mining task use tens of thousands of data instances for the mining task. A total of 10833 tweets was successfully collected using both historic and search API methods. 2870 historic tweets collected on (16/07/2014), the remaining 8013 streaming tweets were collected between 20/07/2014—11/08/2014. The data was stored in MS excel in order to be able to quantify the amount of tweets. It was later saved as a CSV file for processing.( See Appendix for codes used to generate tweets)
UNDERSTANDING YOUR DATA: It is important that as a researcher I understand and get to know the data that is being generated. This is very important in make meaningful analysis. Therefore as an important step the collected data was studied in
35
W144913642
the context of the public debate and getting to understand the various issues underlying the project.
DATA PRE-PROCESSING: As stated earlier in chapter 3 text data is unstructured and relatively nosier than numeric data. It is therefore essential that the data undergoes the pre-processing stage where it is cleaned and transformed into a more structured format. This step was embedded as part of the analysis stage in R, for both term frequency and LDA topic modelling. Most sentiment Analysis task do little preprocessing so as to preserve as much of the data as possible because some preprocessing task such as stemming and stopwords can affect the polarity of sentences when used e.g. Stopword removal may take off “not” and “no” which is mainly used to negate adjectives in sentences.
4.1 UNDERTAKING SENTIMENT ANALYSIS
STAGE 1 Term Frequency AnalysisTtt
STAGE 3 Sentiment Analysis classifier and scoring
STAGE 2 LDA Topic Modelling
Figure 4 Process diagram showing the stages of analysis
4.1.1 TERM FREQUENCY ANALYSIS Term frequency also referred to as word probability is a method of using the frequency of input words as an indicator of importance. (Nenkova & McKeown, 2012) The probability of a word w, P(w) is calculated from an input as the number of occurrences of a word, C(w) divided by the number of all words in the input N:
The aim of applying this method is to find out the frequently used words, and if these words can give an insight into frequent terms that are associated with the HS2 debate.
36
W144913642
Since this is a public debate by the public the view taken in this research is that, frequent used terms should be given more priority than less used terms. E.g. If people feel the economic benefit of Hs2 are enormous then one would expect that the term economic and benefit should feature prominently in the frequency analysis. This is the view taken as against using TF-IDF which gives preference to rare words. The Term frequency was carried out in R using the text mining library. Also term frequency association analysis was carried out to find which other terms were associated with the main frequency terms. In carrying out the term frequency process, a number X is specified; X can be considered to be a minimum probability value P(w) of the occurrence of a term in the entire corpus. Based on the P(w) values, all terms of X ≥ P(w) would be displayed. Values for X used were 20, 100, 500 1000, and 1500. The aim of increasing the value of X was to determine the single most used term in the corpus. However the analysis of the terms was done using X = 100.
4.1.2 LATENT DIRICHLET ALLOCATION TOPIC MODELLING
LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. (Blei, Ng, & Jordan, 2003) LDA provides the mechanism for finding patterns of term co-occurrence and using those patterns to identify coherent topics. LDA results in topics in which the terms that are most probable frequent co-occur with each other in documents. (Crain, Zhou, Yang, & Zha, 2012) LDA model is usually illustrated graphically.
Figure 4.1 LDA graphical representation The LDA model is represented as a probabilistic graphical model in Fig. 4.1 as the figure makes it clear; there are three levels to the LDA representation. The parameters α and β are corpus level parameters, assumed to be sampled once in the process of generating a corpus. The variables θd are document-level variables, sampled once per document. Finally, the variables zdn and wdn are word-level variables and are sampled once for each word in each doc-
37
W144913642
ument. (Blei, Ng, & Jordan, 2003) LDA is a widely used in topic modelling in many researches, hence employing it is appropriate for the task of finding interesting topics and term cooccurrence in our corpus. 20 and 50 topics were chosen with each topic containing 20 terms. (20x20) and (50x20). Our aim is to determine if we can identify interesting topics within the corpus.
4.1.3 SENTIMENT ANALYSIS SCORE
The sentiment analysis is undertaken in two steps. The first is to use an inbuilt Naïve Bayes sentiment analyser in python to classify our corpus into 3 different polarities. Naïve Bayes classifier is a commonly used generative classifier based on Bayes’ theorem. It models the distribution of the documents in each class using a probabilistic model with independence assumptions about the distribution of different terms. It computes the posterior probability of a class, based on the distribution of the words in the document, and work with the bag of words assumption. (Aggarwal & Zhai, 2012) The second step was to develop a dictionary based on the term frequency that was developed. The words in the dictionary were manually scored between -5 to 5. With the following polarity score: (-5 and -4: very negative), (-3 and -2: negative), (-1 and 1: neutral) (2 and 3: positive),(4 and 5: very positive). This dictionary together with a similar one developed by Finn Årup Nielsen (www.opendatacommons.org). Both will be used to score the entire corpus. The second Method though similar to the first has the advantage of classifying better than the first because it uses specific lexicons that were used in the HS2 discussions.
4.2 CHALLENGES AND ADJUSMENTS
A number of challenges faced with the analysis process. The first was getting more data than was used. The aim was to collect at least 15,000, however using the stream data method, gives tweets as and when a user tweets and this process can make the tweet collection slow if not so many people are tweeting at a time. The average rate of collection was about 250 tweets a day. It was therefore decided to scale down the number to at least 10,000 in order to keep with the project timelines. The major challenge was applying the dictionary based approach to scoring the sentiment. The process applied in R did not generate any results with coding errors. A decision to use a
38
W144913642
widely known sentiment analysis method by Jeffery Breen also did not present any meaningful result. Another challenge was time constraint in manually labelling thousands of individual tweets into the three polarities in order to train a classifier with a good accuracy. Adjustments therefore made were to focus on the results of the topic modelling and the inbuilt naïve Bayes sentiment analyser for the classification task.
39
W144913642
CHAPTER FIVE
5. RESULTS AND ANALYSIS
5.1 TERM FREQUENCY RESULTS AND ANALYSIS
Figure 5. Shows the output for terms with X=100. 162 terms are returned as output from the entire corpus.(in alphabetical order) With the results we can make some inference from the terms displayed above with a good understanding of issues and debates.
Figure 5.1 Display of terms for X = 100 I.
Terms that imply the economic viability and benefits are featured very well, terms such as; ‘ affordable’, ‘benefit(s)’, ‘billion’, ’cost’ ‘economic’, ‘ taxpayer’, ‘ jobs’, ‘waste’ and ‘welfare’, indicate that the issues of economic impact are well debated. Whiles terms like ‘waste’ can clearly be considered negative, ‘jobs’ can be seen as a positive term.
II.
A number of cities where the project would take place also feature. ‘London’ and ‘Birmingham’ are the two main cities where phase one of HS2 is expected to connect. An interesting city that comes up is Liverpool since Liverpool is not originally part of the cities to be connected. This therefore raises an interesting question as to why Liverpool features. One way to find out is to carry a term frequency association analysis.
40
W144913642
III.
Some notable Politicians feature also as terms. Prime Minster ‘David Cameron’, and finance sectary ‘George Osborne’ The political parties ‘uklabour’ and ‘ukip’ are the only parties that feature among the several parties including the parties of the collision government.
IV.
Other significant terms that appear are ‘hs2aintgreen’, ‘hs2aa’ ‘hs2facts’, ‘stophs2’,‘anti’ and ‘support’
V.
Also featuring in the terms are the people who contribute to the debate the most which we can term authors. 5 different authors feature with their twitter names.
The most used term was found out to be ‘stophs2’ at X = 1500. Stophs2 is one of the many terms used to voice out displeasure to the project and there are numerous groups with the @stophs2 or #hs2 campaign.
5.1.1 TERM ASSOCIATION A term association for selected terms was carried and with 10% as the minimum threshold for other terms associated with the main terms we used as queries. The results of the top terms and the tweet that forms the terms are displayed below. From the table present in fig one can say that frequent term association can give a deeper insight about the most frequent terms. It can also help identify which tweets (in this case the issues that are being ‘tweeted’ the most). Using the term Liverpool as an example we can now draw a fact that there is a strong campaign to include the city into the HS2 network.
41
W144913642
TERM
ASSOCIATED TERMS(10%)
Tweet
stophs2
Hs2facts
37%
The UK government is blocking
blocking
24%
the publication of a scathing hs2
illegally
24%
report
publication
24%
report
24%
scathing
24%
blocking
42%
The UK government is blocking
illegally
42%
the publication of a scathing hs2
publication
42%
report
report
42%
scathing
42%
stophs2
37%
destination
51%
Liverpool is UK’s fastest growing
link
51%
destination an hs2 link would
tourist
51%
add 700k+ extra visitors/year
uk’s
51%
visitors/year
48%
fastest
43%
growing
44%
Hs2facts
54%
The UK government is blocking
blocking
54%
the publication of a scathing hs2
illegally
54%
report
publication
54%
report
54%
scathing
54%
26k
38%
No single tweet makes up these
8.3bn
36%
terms but a combination of sev-
Kent
36%
eral.
Transform
32%
Adding
32%
#osborne
26%
73%
64%
73% of all jobs created by hs2
Created
62%
would be in London
London
33%
hs2facts
Liverpool
Government
Economy
Jobs
Table 5.1 Terms and their frequent associated terms
42
W144913642
5.2 LDA TOPIC MODELLING RESULTS AND ANALYSIS
Figure 5.2 LDA topic modelling results
From Figure a number of interesting facts can be deduced about the various topics contained in the corpus; I.
The term stophs2 is grouped 3 times in topics 4, 6 and 7. It also appears as a term in all the remaining 17 topics. Just like the term frequency analysis, stophs2 is the dominant term that appears most in the corpus. This result indicates a number a number of inferences we can make about the data;
Most of the tweets use either the @hs2 or #hs2 as the subject of discussion concerning the project.
The stophs2 campaigners are actively using twitter as one of the platforms to argue out their point of view.
II.
Again like in the case of term frequency another interesting consideration is the topic of Liverpool and London in topics 15 and 17 respectfully. The explanations in table 5 can be extended to why these two feature as topics in the corpus.
III.
Another interesting result is the display of twitter names of two ‘authors’ in topics 11 and 14. Though this aspect is not an objective we can imagine that these most of the discourse is made by both the two.
43
W144913642
5.2.1 MODEL EVALUATION
Topic models are evaluated using Perplexity. (Blei, Ng, & Jordan, 2003) Perplexity mesures how well a language model fits the word distribution of a corpus and is defined by :
Where pl(xi) is the probability of the occurrence of word xi estimated by the language model l and N is the number of words in the document. Lower perplexity values are generally desired and the best values are those close to the chosen number of topics. A perplexity value of 19.64 was recorded, which can be considered as a good fit for the topic modelling. A scatter plot of the distribution of words in fig shows a relatively uniformed distribution of words accros the corpus and hence further shows that LDA performed well in modelling the words into topics within the corpus.
Figure 5.2 perplexity value for LDA topic modelling
Figure 5.3 scatter plot for LDA topic modelling
44
W144913642
5.2 SENTIMENT ANALYSIS RESULTS Probability
0.55
0.60
0.65
0.70
0.75
Positive
76.6
71.4
66.9
59
53.1
Negative
14.7
11.8
9.2
7.5
5.3
Neutral
8.7
16.8
23.9
33.5
41.4
value
Figure 5.3 Results of sentiment analysis
90 80 positive negative neutral
70 60 50 40 30 20 10 0 0.55
0.6
0.65
0.7
0.75
Figure 5.4 A plot of sentiment score
Using a prebuilt analyser the individual tweet in the corpus is classified into the 3 polarities stated. This works by defining a probability score on a range of between 0 to 1. E.g. If 0.6 is chosen it means what is the probability a tweet is 60% more positive than negative and vice versa. Results for values chosen are 0.55, 0.6, 0.65, 0.7 and 0.75. From the results in figure 5.4 the most of the tweets are classified as positive compared to both negative and neutral. However as we increase the probability from 0.55 to 0.75 the neutral polarity increases. Using a prebuilt classifier presents us with some challenges, which include; our inability to determine the accuracy of the classifier and train the classifier to be specific to some lexicons used in our specific corpus. Unlike in topic modelling where results are based on terms used in the corpus Sentiment analysis only classifies each sentiment according to how a classifier
45
W144913642
model or algorithm works. It is therefore difficult to validate the accuracy of the classifier used for this task. The results however present a general positive polarity the corpus.
5.3 SUMMARY OF FINDINGS
The analysis was conducted in three stages and whiles the term frequency and LDA models showed some interesting results. Not much can be said about the sentiment Analysis. In contrast to the seemingly negative term and topics i.e stophs2, which displayed prominently in the first two analysis, the general sentiment of the corpus is shown to be positive. This makes it quite challenging to draw a general conclusion of overall sentiments expressed in the data we have. However based on the result of the prebuilt classifier we can only conclude, the general sentiment expressed in the corpus is positive. This however cannot be extended as a conclusive statement that based on our results the citizenry are in favour of the project as against the project. Aside these we must bear in mind that the data collected was time bound so will only represent the sentiment at the time of collection.
5.4 FUTURE WORK
A major limitation to the implementation of the sentiment analysis was to develop a manually labelled corpus with which to train a corpus specific to our domain. Domain specific lexicons is important in improving the accuracy of the classifier. It is therefore possible for consideration in future to improve upon this work. Other possible considerations are using other models such as POS. and n-grams. A future novel area of research is to implement a real time sentiment analyser using a mobile application. This involves building an app that connects government agencies to their citizens so as to enable citizens send their sentiment to their various government agencies. The app could then do a classification of the sentiment. Other case studies that involve public debates such as fracking and immigration can also be carried out.
46
W144913642
CHAPTER SIX
6 CONCLUSION
An extensive research on the field of sentiment analysis has been presented. This research sought to apply this important field of learning to an emerging discipline. SA has been researched and applied traditional in product and movie reviews and there are a lot of publicly available datasets that makes undertaking such SA task a bit easier. Compared with the application of SA in this research, though the same methods can be applied the nature of the data makes some difference, due to domain specific lexicons contained in the data. Most of the objectives set out were largely achieved i.e. the general research on SA, both term frequency and Topic modelling. The challenge however was using either a lexicon dictionary or building our own classifier to score and classify our corpus instead of using a general prebuilt classifier The research has presented a good understanding of SA and its importance as a business intelligence tool. For this particular subject area, it presents a very interesting way for public administrators for both the purposes of decision making and customer relationship management. Social media has created a platform which needs to be taken advantage of by analysing the various public sentiments expressed. With respect to the case study though interesting results were made from the data, no conclusive statements were made about what the UK citizenry feels towards HS2 but rather only interesting results, such as the stophs2 campaign and the campaign to extend HS2 to Liverpool. Also from the topic modelling there seems to be much concern over the cost of the project. Another interesting revelation was excitement about the prospect of job creation particularly in London using the term frequency. A very good insight into undertaking sentiment analysis has been well learned and understood. This work presented is a good basis for further work to be carried either in this same case study or extending to other areas. In conclusion, this research has served as a useful process of better understanding how best to practical apply SA to a given task especially in domain specific areas. SA is a beneficial tool for which governments should use in its public administration duties, So as to remain truly beneficial to the concerns of citizens.
47
W144913642
BIBLIOGRAPHY
Barbosa , L., & Feng, J. (2010). Robust Sentiment Detection on Twitter from Biased and Noisy Data. Proceedings of the 23rd International Conference on Computational Linguistics, (pp. 36-44). Bermingham, A., Conway , M., McInerney, L., O’Hare , N., & Smeaton, A. F. (2009). Combining Social Network Analysis and Sentiment Analysis to Explore the Potential for Online Radicalisation. International Conference on Advances in Social Network Analysis and Mining. Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software. Maindonald, J. (2008). Using R for Data Analysis and Graphics Introduction, Code and Commentary. Mislove, A., Lehmann, S., Ahn , Y.-Y., Onnela, J.-P., & Rosenquisk, J. (2010). Pulse of the Nation. Retrieved 8 15, 2014, from Pulse of the Nation: http://www.ccs.neu.edu/home/amislove/twittermood/ Nemschoff, M. (2014, June 28). A Quick Guide to Structured and Unstructured Data. Retrieved 08 04, 2014, from Smart Data Collective: http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-andunstructured-data Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., & Stede, M. (2011). Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics, 267-307. Whitelaw, C., Garg , N., & Argamon, S. (2005). Using appraisal groups for sentiment analysis. Proceedings of the ACM SIGIR Conference on Information and Knowledge, (pp. 625631). Data preprocessing. (2014, April 18). Retrieved April 18, 2014, from TechTarget: http://searchsqlserver.techtarget.com/definition/data-preprocessing dictionary.com. (2014, April 5th). Retrieved April 05/04/2014, 2014, from dictionary.com: http://dictionary.reference.com/browse/algorithm Hs2 story. (2014, 06 16). Retrieved 06 16, 2014, from HS2 engine for growth: http://www.hs2.org.uk/about-hs2/high-speed-rail-hs2/hs2-story Abbasi., A. (2007). Affect intensity analysis of dark web forum. Proceedings of Intelligence and Security informatics, (pp. 282–288).
48
W144913642
Aggarwal, C. C. (2011). Social Network Data Analytics. New York, London: Springer. Aggarwal, C. C., & Zhai, C. (2012). A survey of Text Classification Algorithms. In C. C. Aggarwal, & C. Zhai, Mining Text Data (pp. 163-213). Springer. Aggarwal, C. C., & Zhai, C. (2012). Mining Text Data. New York: Springer. Antonio do Prado, H., & Ferneda, E. (2008). Emerging Technologies of text Mining: Techniques and Applications. London, Hershey PA : Information Science Reference. Arnott, , D., & Pervan, G. (2005). . “A critical analysis of Decision Support Systems research”,. Journal of Information Technology, 67-87. Arunachalam , R., & Sarkar, S. (2013). The New Eye Of Government:Citizen Sentiment Analysis in Social Media. IJCNLP 2013 Workshop on Natural Language Processing for Social Media (SocialNLP),, (pp. 23-28). Nagoya. Arunachalam, R., & Sarkar, S. (2013). The New Eye of Government: Citizen Sentiment Analys. International Joint Conference on Natural Language Processing, (pp. 23 28). Nagoya. Asur, S., & Huberman, B. (2010). Predicting the future with social media. Azevedo, A., & Santos, M. F. (2012). Closing the Gap between Data Mining and Business Users of Business Intelligence Systems:A Design Science Approach. International Journal of Business Intelligence Research,, 14-53. Bautin, M., Ward, C. B., Patil, A., & Skiena, S. S. (2010). Access: News and Blog Analysis for the Social Sciences., (pp. 1229 - 1232). BBC. (2014, 06 25). Reaction as HS2's second phase details unvails. Retrieved June 25, 2014, from BBC news UK: http://www.bbc.co.uk/news/uk-21229602 Benamara, F., Cesarano, C., Picariello, A., Reforgiato , D., & Subrahmanian, V. (2007). Sentiment analysis: Adjectives and adverbs are better than adjectives alone. Internation Conference in Weblogs and Social Media. Berry, M. J., & Linoff, G. S. (2004). Data Mining Techniques for marketing, Sales, and Customer Relationship Management. In Data Mining Techniques for marketing, Sales, and Customer Relationship Management (p. 8). Indianapolis, Indiana: Wiley Publishing. Blei, D. M., Ng, A. Y., & Jordan, m. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 996. Bollen, J., Mao, H., & Pepe, A. (2011). Modeling Public Mood and Emotion:Twitter Sentiment and Socio-Economic Phenomena. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, (pp. 450-453). Butcher, L. (2014). Standard Notes House of Commons. Library House of Commons. Ceron, A., Curini, L., Mlacos, S., & Porro, G. (2013). Every tweet counts? How sentiment analysis of social media can improve our knowledge of citizens' political preferences with an application to Italy and France. Italy: Sage. City of York. (2014, April 3). City of York Council. Retrieved 08 07, 2014, from City of York Council:
49
W144913642
http://www.york.gov.uk/news/article/476/connected_cities_cities_united_voice_on_hs 2 Crain, S. p., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality Reduction And Topic Modelling: From Latent Semantic indexing To Latent Dirichlet Allocation And beyond. In Mining Text Data (pp. 129-156). Springer. Dave, K., Lawrence , S., & Pennock, D. M. (2003). Mining the peanut gallery:Opinion extraction and semantic classification of product review. Proceedings of the 12th internaltional world wide web conference, (pp. 519-528). Encyclopedia Britanicca. (2014, June 9). Government: Encyclopedia Britanicca. Retrieved June 9, 2014, from Encyclopedia Britanicca: http://www.britannica.com/EBchecked/topic/240105/government Fang, Y., Si, L., Somasundaram, N., & Yu, Z. (2012). Mining Contrastive Opinions on Political Texts using. Retrieved April 23, 2014, from Purdue.edu: https://www.cs.purdue.edu/homes/lsi/WSDM_2012.pdf Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter Brand Sentiment Analysis. A hybrid sysytem using n-gram analysis and dynamic atific. Exper systems with application. Go, A., Bhayani, R., & Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. Golfarelli , M., Dario, M., & Rizzi, S. (2004). The dimensional fact model: a conceptual model for data warehouses. International Journal of Cooperative Information Systems, 215247. Grimes, S. (2014, May 15). Break Through analysis. Retrieved May 15th May 2014, 2014, from breakthroughanalysis: http://breakthroughanalysis.com/2012/09/10/typesofsentimentanalysis/ Gundecha, P., & Liu, H. (2012). Mining social media: a brief introduction. Tutorials in Operations Research, 1-17. Gundecha, P., & Liu, H. (2014, April Monday). http://www.public.asu.edu/. Retrieved AprilMonday 2014, from http://www.public.asu.edu/: http://www.public.asu.edu/~pgundech/book_chapter/smm.pdf Han, J., & Kamber, M. (2005). Data Mining Concepts and Techniques. Morgan Kaufmann. Hatzivassiloglov , V., & Wiebe, J. (2000). Effects of adjective orientation and gradability on sentence subjectivity. International Conference on Computational Linguistics. Horak, R. (2007). Telecom Dictionary,A comprehensive reference for telecommunications terminology. Indianapolis: Wiley Publications. Hu , M., & Liu, B. (2005). Mining and summarizing customer reviews. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural. Hu, X., & Liu, H. (2012). Text Analytics in Social Media. In C. C. Aggarwal, & C. Zhai, Mining Text Data (pp. 385-408). New York, London: Springer. IPSOS. (2013). High Speed Two:Exceptional Hardship scheme for phase two. Social Research Institute.
50
W144913642
Johnson, J. (2012, November 14). Structured Data vs. Unstructured Data. Retrieved 08 04, 2014, from KPI Partners: http://www.kpipartners.com/blog/bid/137981/StructuredData-vs-Unstructured-Data Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 11-21. Kaplan , A. M., & Haenlein, M. (2010). "Users of the world, unite! The challenges and opportunities of social media". Business Horizons.53 (1), p. 61. Kennedy , A., & Inkpen, D. (2006). Sentiment classification of movie reviews using. Computational Intelligence, 22:110–125. Kim , S. M., & Hovy, E. (2004). Determining the sentiment of opinions. Proceedings of. Proceedings of the 20th International Conference on Computational Linguistics. Kwak, H., Lee, C., Park, H., & Moon, S. (2009). What is Twitter, A social network or Network or News Media. Liu, B. (2006). Web Data Mining Chapter Opinion Mining. Spinger. Liu, B., & Zhang, L. (2012). A Survey Of Opinion Mining And Sentiment Analysis. In C. C. Aggarwal, & C. Zhai, Mining Text Data (pp. 415-452). New York: Springer. Liu, B., & Zhang, L. (2012). A SURVEY OF OPINION MINING AND SENTIMENT ANALYSIS. In C. C. Aggarwal, & C. Zhai, Mining Text Data (p. 41). New York, London: Springer. Liu, B., Hu, M., & Cheng, J. (2005). Opinion Observer: analysing and comparing opinions on the web Proceedings of the international conference on World Wide web. Lonnqvist, A., & Pirttimaki, V. (2006). The Measurement of Business Intelligence . Information Systems Mangement journal, 32-40. Luftman, J., & Mclean , E. R. (2012). Key issues for IT executives . MISQ Executive89-104. McKinney, W. (2013). Python for Data Analysis. Sebastopol: O'Reilly. Mejova, y. (2009). Sentiment Analysis: An Overview. Iowa. Morstatter, F., Kumar, S., Liu, H., & Maciejewski, R. (n.d.). Public.asu.edu. Retrieved June 23, 2014, from public.asu.edu: http://www.public.asu.edu/~huanliu/papers/kdd2013demoFM.pdf Na, C. J., Sui , H., Khoo , C., Chan , S., & Zhou, Y. (2004). Effectiveness of simple linguistic processing in automatic sentiment classification of product reviews. Conference of the International Society of Knowledge Organization, (pp. 49–54.). Nenkova, A., & McKeown, K. (2012). A survey of Text Summarization Techniques. In Minning Text Data (pp. 43-66). Springer. Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. pang, B., Lee, L., & Vaithyanathan, S. ( 2002). Thumbs up? Sentiment Classi¯cation using Machine Learning. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 79-86). Philadelphia: Association for Computational Linguistics.
51
W144913642
Popescu , M. A., & Etzioni, O. (2005). Extracting product features and opinions from reviews. Proceedings of the conference on Human Language Technology and Empirical. Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge University Press. Rehfeld, A. (2005). Towards a General Theory of Political Representation. Journal of Politics, 1-53. Russell, M. A. (2013). Mining the social web. Sebastopol: O'reilly media. Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter Users:Real-Time Event Detectionby social sensors. Savigny, H. (2002). Public opinion, political communication and the Internet. pp. 1-8. Schellong, A. (2008). Cititzen Relationship Mangement. Frankfurt: Peter Lang. Search Engine watch. (2014, July 3). Worldwide Social Media Usage Trends in 2012. Retrieved July 3, 2014, from Search Engine watch: http://searchenginewatch.com/article/2167518/Worldwide-Social-Media-UsageTrends-in-2012 Subasic , P., & Huettner, A. (483–496). Affect analysis of text using fuzzy semantic typing. IEEE-FS, 2001. Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data mining. In Introduction to Data mining (p. 2). Boston: Pearson Education Inc. Thomas , M., & Pang, B. L. (2006). Get out the vote: Determining support or opposition from congressional floor-debate transcripts. Proceedings of the Conference on Empirical Methods in Natural Language Processing, (pp. 327–335). Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company, . Turney, P. (2002). Thumbs up or thumbs down?:Semantic orientation applied to unsupervised classification of reviews. . In proceedings of the 40th Annual Meeting on Association for Computational Linguistics., (pp. 417-424). Twitter. (2012, March 21). Blog.twitter. Retrieved June 24, 2014, from Blog.twitter: https://blog.twitter.com/2012/twitter-turns-six UN. (2012, December 05). UN public adminstration programme. Retrieved June 04, 2014, from unpan1.un.org: http://unpan1.un.org/intradoc/groups/public/documents/un/unpan050896.pdf Wickre, K. (2013, March). Clebrating Twitter 7. Retrieved 06 3, 2014, from Twitter.com: http://blogtwitter.com/2013/03/celebrating-twitter7.html Wiebe, W. T., & Hoffmann, J. (2005). Recognising contextual Polarity in Phrase-level sentiment analysis. Proceeding of HLT-EMNLP. Wilks, Y., & Stevenson, M. (1998). The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Journal of Natural Language Engineering, 135– 144.
52
W144913642
Xin, J., Gallagher, A., Cao, L., Luo, J., & Han, J. (2010). The wisdom of social multimedia. Proceedings of ACM multimedia 2010 international conference, (pp. 1235–1244). Firenze: ACM. Yin, R. K. (1984). Case study research. Newbury Park,CA: Sage. Yu , H., & Hatzivassiloglou , V. (2003). Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Zabin, J., & Jefferies, A. (2008). Social Media Monitoring and Analysis:Generating consumer insights from online conversation". Aberdeen group Benchmark Report. Zarrella, D. (2010). The Social Media Marketing book. North Sebastopol CA: O'Reilly media. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., et al. Comparing Twitter and Traditional Media Using.
53
W144913642
APPENDIX
1. CODE FOR EXTRACTING STREAMING TWEETS
2. LDA TOPIC MODELLING CODES
54
W144913642