Sentimental Analysis framework for Twitter Streaming Data
Maunika Nittala Prajkta R. Bhandarwar
Dept. of Information Technology Dept. of Information Technology
Abstract: Twitter is an online social networking service with more than 300 million users, generating a huge amount of information every day. Twitters most important characteristic is its ability for users to tweet about events, situations, feelings, opinions, or even something totally new. This study focuses on analysing social activity resulting from various tweets. Social set analysis consists of a generative framework for combining big social data sets with organizational and societal data sets. Currently there are different workflows offering data analysis for Twitter, presenting general processing over streaming data. This study will attempt to develop an analytical framework with the ability of in-memory processing to extract and analyze structured and unstructured Twitter data. Spark makes it possible to perform sophisticated data processing and machine learning algorithms. We will conduct a case study on tweets about the Politics and the reactions of people with analysis of the tweets. The proposed framework includes data ingestion, stream processing, and data visualization components.
As we are generating a huge amount of digital data which is said to be big data it is difficult to store the data and it is difficult to handle such a big data,for that we need big data analysis. Analysis of this growing data is possible through analytical tools. Big data tools and technology provide opportunity to handle big amount of data present. Data management has involved many such technology for the various types of data and that are: real-time, structured and unstructured data. A product is a service or any item that is provided to the user that may be hardware or software.Every product has its value in the form of money.The product can be a innovation or it can be re-invented. It includes the analysis of the data through various tools and platforms available for the analysis as it is trending and also the concept of machine learning. This section gives the view of the analytical platform that are used or studied in the past. Twitter has been the most commonly used microblogging application nowadays which is why we have decided to work on it.
Twitter Application Programming Interface
The interface Twitter API is used to collect streaming Tweets from Twitter which also stores tweet scores along with its timestamp.
Publicly posted Tweets published by users are extracted. In order to create a POST request to the twitter API and fetch the search results as a stream it uses Create_Streaming_Connection() method. In one connection 5,000 Twitter user ids are allowed to submit for an application. Only publicly published Tweets can be captured using the API. The Streaming API searches for hashtags, keywords and geographic bounding boxes simultaneously. The filter API helps for searching and delivers the continuous stream of Tweets which matches the filter tag. POST method is preferred while creating the request, because long URLs are truncated and GET method is used to retrieve the results.
Turney et.al  used bag-of-words method in which the relationships between words was not considered at all for sentiment analysis and a sentence is simply considered as a collection of words. To determine the sentiment for the whole sentence, sentiment of every individual word was determined separately and those values are aggregated using some aggregation functions.
Pak and Paroubek  proposed a model to classify the tweets as positive and negative. By using Twitter API they created a twitter corpus by collecting tweets and automatically annotating those tweets using emoticons. Using that corpus, the multinomial Naive Bayes sentiment classifier method was developed which uses features likePOS-tags and N-gram. The training set used in the experiment was less efficient because they considered only tweets which have emoticons.
Po-Wei Liang et.al  used Twitter API to collect data from twitter. Tweets which contain opinions were filtered out. Unigram Naive Bayes model was developed for polarity identification. They also worked for elimination of unwanted features by using the Mutual Information and Chi square feature extraction method. Finally, the approach for predicting the tweets as positive or negative did not give better accuracy by this method.
Thet , proposes a linguistic approach system for aspect based opinion mining, which is a clause/Sentence level sentiment analysis for opinionated texts. For every message post sentence it generates a syntactic dependency tree, and splits the sentence into clauses. It then determines the
contextual based sentiment score for each clause using grammar dependency of words and uses SentiWordNet which has prior sentiment scores for the words and also from domain specific lexicons.
Hussein, this paper explains the previous works, the goal is to identify the most significant. challenges in sentiment and explore how to improve the accuracy results that are relevant to the used techniques.
Twitter APIs are generated using the developer account after creating an application. The proposed system extracts the data which is done using Streaming API of twitter. The extracted tweets are loaded into HDFS with the help of Flume and are converted unstructured format to structured format which is pre-processed using map reduce. Consider the number of all positive tweets, positive words and negative words. The probability of a word is then checked which then classifies, if the probability of the word is greater than 0.6 then it is positive, as neutral if the probability is between 0.4 and 0.6 and negative if it is lesser than 0.
This project would undergo three folds:
Extraction and processing of Streaming Data
In the first phase well be creating a twitter developers account. In developers account,an application has to be created which generates APIs. These APIs are required to extract twitters live streaming data. These data is processed using Flume.
2. Classification of the processed data
In the second phase the structured data obtained from the first phase is now classified into three categories 1) Positive 2) Neutral 3) Negative.
If the words are mapped as positive then the tweet turns out to be positive. If the words are mapped as negative then the tweet turns out to be negative. If the words are neither mapped as positive nor negative then the tweet turns out to be neutral.
3. Visualization of the classified data
In the last phase the categorized data is visualized with the help of Python. It is showed in the form of pie chart or a graph.
Prior Art Search
Naive Bayes is a classifier technique used for building classifiers which uses Support Vector Machines. Another approach is using natural language processing techniques, to determine topics, extract attributes of the topics, detect opinions about the attributes, and measure the sentiment value.
This analysis can be useful to maximize the profits in any field. Today, major business decisions are taken by utilising the insights derived from data related to the organization or industry related data. As competition increases and customers are flooded with choices, it has become important to move faster in the market and that too with accuracy and similar analysis will help in increasing business. It will provide both speed and accuracy to business decisions. It can also help in politics for analysing various things and peoples need. It can be used for analysis in the technical field as well.
B. Yadranjiaghdam, N. Pool, N. Tabrizi, A Survey on Real-time Big Data Analytics: Applications and Tools, in progress of International Conference on
Computational Science and Computational
Babak Yadranjiaghdam, Seyedfaraz Yasrobi, Nasseh Tabrizi,Developing a real time data analytics framework for twitter streaming data.Department of computer science ,East Carolina University Greenville,NC.2017 IEEE 6th International Congress on Big Data.
M.Trupthi, Suresh Pabboju, G.Narasimha, SENTIMENT ANALYSIS ON TWITTER USING STREAMING API, 2017 IEEE 7th International Advance Computing Conference.