Abstract—In this project I explore Twitter Hashtag analysis. The #MeToo movement on twitter has given rise to both praise and dissent in equal parts. Some laud its achievement in changing social stigma surrounding the survivors of sexual assault and abuse. Others criticize it for creating a rise in anti-man sentiment and the unjust pre-prosecution of accused predators. In this project I have explored different techniques involved in text mining, working with mongoDB, working with R, shiny dashboard and visualization using the above tools.
The #MeToo movement was a social phenomenon that exploded on social media in the fall of 2016. It was an awareness campaign against the widespread sexual assault and harassment faced by women and men across the world.
The phrase was first coined by activist and community organizer, Tarana Burke who used the phrase to promote empowerment and empathy among women of color survivors of sexual abuse. It was then popularized and made into a hashtag by actress Alyssa Miano.
Miano, along with many women in the film industry, used the hashtag to shed light on the sexual harassment and violence they faced in the workplace. The movement grew as more women published their own #MeToo stories on social media, causing a shift in public perception of sexual assault. The movement has also spawned criticism, leading to subversive use of the hashtag. In this project, I have analyzed tweet data to explore different sentiment towards the #MeToo movement, words that people use to describe their story, number of retweets, highest retweeted tweet etc.
A set of 6000 tweets with 4726 users was pulled from Twitter API using the Twitter Authentication provided by Twitter using R studio. The package “twitteR” is used in R to authenticate and download tweets. These tweets were then stored to mongoDB. The advent of social media as a new means of communication has brought about the ability to measure and quantify society like never before. Due to its accessibility, immediacy of availability, and large audience, social media has become the platform for people to discuss and share their ideas, and for social movements to find an audience. This has made social media a popular resource for the analysis and study of social movements.
New tools have been developed to utilize these new data sources, most prominent among them tools such as natural language processing toolkits, methods of streamlined data collection like API’s and dedication network analysis applications. These tools have been applied to study various social movements. A study conducted by Jenna Jacobson and Christopher Mascaro on “Movember” via twitter, an annual month-long celebration of the mustache, found that while the movement had a large following, it was no longer true to the original goals of the movement. Movember was originally started to raise awareness of men’s health and facilitate conversation. The study found, through tweet text analysis, that the majority of conversation taking place was not unique or original user generated content, but rather advertising in the form of retweets pushing product URLs, and fundraising for the campaign.
Another study of Li tan, Suma Ponnam, Patrick Gillham, Bob Edwards, and Eric Johnson dealt with effect of social media on social movements, specifically on the Occupy Wall street movement on twitter. By looking at the volume of related tweet data, and the presence of influential tweeters, the group was able to show a correlation between the development of a social movement, and the activity taking place online. Similar techniques to the studies above were used for the analysis of my dataset, and exploration of the #MeToo movement.
R is a programming language and free software environment for statistical computing and graphics supported b the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software. RStudio is a free and open-source integrated development environment for R. It was founded by J.J. Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief Scientist at RStudio. The version of R studio used for this project. Shiny Dashboard is one of the beautiful gifts that R has got. Shiny is an R package that makes it easy to build interactive web apps straight from R. Making dashboard is imminent wherever data is available since dashboards are good in helping. In this project I have used shiny dashboard just to give all my visualization result a platform where I can combine all those results together and is user friendly.
MongoDB is a cross-platform document-oriented database program. It is a NoSQL database program, which used JSON-like documents with schemata. The version of MongoDB used in this project. All my data collected from Twitter is stored in mongoDB where the name of the database is “twitterDB” and name of the collection is “twitterData”. The main thing you require to execute this project is having a Twitter account and adding your application to apps.twitter.com so that you can get the required token to get authentication from twitter to use their data.
The tweets in this project were queried using the key word “#MeToo”. The tweets were then cleaned using “tm” package. Cleaning of the tweets involved creating a corpus which is a collection of text documents over which I have applied text mining and language processing techniques. Then I removed all the URLs, words other than English, convert all the words to lowercase, remove all the extra white space and remove words like a, an, the, is etc. After the data is cleaned I created a “Term Document Matrix”(TDM). A TDM is a mathematical matrix that describes the frequency of the terms occurring in the collection of documents. After creating a TDM, I found unique words with the highest frequency and plotted them using a “ggplot” (A library in R for data visualization).
With this I have also created a word network to explore words that occur together in tweets. To make this network I have used ngrams which specifies how many words should be connected together and I have chosen that as two words. Using these word frequencies and TDM I created a word cloud. A word cloud is an image of group of words in a text, in which the size of each word indicates its frequency or its importance. Sentiment analysis is the use of natural language processing, text analysis, computational linguistics and biometrics to systematically identify, extract, quantify and study affective states in order to determine whether the writer’s attitude towards a particular topic or product. I have categorized my tweets in mainly ten different characteristics which are: anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust.
Tweets that I collected are of different types, original tweet, retweet or a reply. We can distinguish these tweets by looking at the text field of the data. It a tweet is a reply it will have “@” at the starting of the text field, if it is a retweet it will have “RT” and if it’s an original tweet then it will have none of the above. So I collected only the retweets from my dataset using reqular expressions, there were 134 distinct retweets. If a tweet is a retweet it will also have a retweet count. Then I collected the top ten retweeted retweets and plotted them using a line graph and created a table which has the top ten retweeted tweet and the retweet count. I collected all the hashtags used in my datasets again using regular expression as a hashtag is represented as “#someword”, there were 1100 different hashtags used in this dataset. With these hashtags I also collected their frequencies and plotted top ten hashtags used in the dataset.
I have also created a table which gives the top ten hashtags and their frequencies. This analysis is used to determine what users tweet about with #MeToo. After the visualization of my data I came to know that most of the tweets had a positive sentiment with score more than 4500 and a negative sentiment of around 4300. The top retweeted tweet is “RT NeverOnBrand: If a guy says he’s nervous about #MeToo, just remind him that we come down pretty hard on murderers too.” with 40306 retweets. The most used hashtag is as expected “#MeToo” and it is used 3187 times. The most used word in the dataset is again as expected “metoo” used around 3200 times.
When I created the word network then also the result was as I expected top two words used together are “metoo” and ‘movement”. So, the results came out to be as I expected. The upside for my project is I got to explore, learn and experiment with different visualizations tools, I got to work with twitter data and came to know how this #MeToo movement has affected the social media platform. This project can be used for any hashtag analysis, it is not just limited to #MeToo. Doing hashtag analytics is a very vital part of social media marketing. Analyzing, tracking and viewing data is very helpful to find out the results for your current marketing campaign, also it helps in making changes to the campaign according to the reactions of the users.
This project was made to explore twitter analysis, specifically exploring the “#MeToo” around the globe. While the movement originally began as a space for women to safely voice their sexual assault experiences, it eventually became co-opted for political purposes. This subversion gave spawn to many critics of the movement and consequently provided a space for people to express their negative feelings about #MeToo. I have also explored how powerful twitter analytics is as a tool. It can be used to analyze what path is the social media market taking.