In their research, Abercrombie and Hovy touch upon a very interesting topic involving Twitter data sentiment analysis and sarcasm. As we all know, verbal irony or verbal sarcasm is not very difficult to detect as it is usually coupled with intensified/different tonal stress, hand and other gestures (eye roll), as well as the use of positive words to describe negative occurrences and exaggeration. On the other hand, written sarcasm cannot be detected so easily. In their study, Abercrombie and Hovy list written sarcasm as one of the main causes of errors in Twitter sentiment analysis and NLP in general.
They also note that the corpus used in this study includes strictly two-way Twitter conversations that were manually annotated with the purpose of “comparing human versus machine learning classification performance under varying amounts of contextual information, and to evaluate machine performance on balanced and unbalanced datasets”.
The team gathered 64 million Tweets and obtained approximately 650 thousand conversations in the English language. From those conversations, they created two datasets – the first one containing 2,240 conversations that was manually annotated, and the second one that was automatically obtained using hashtags.
Since manual annotation can be time-consuming, the team used 60 native English speakers as volunteers to annotate 300 randomly chosen conversations. The researchers used the “inter-rater agreement” measures to evaluate the difficulty level in recognizing sarcasm but also to measure the reliability of the participants.
They had found that humans have a hard time recognizing written irony when not given additional contextual information. To minimize falsely rated conversations, they removed the outliers which included the highest and lowest performing raters.
They used binary classification models to train the data of both balanced and unbalanced dataset. More specifically, for the balanced dataset, they used the “standard logistic regression model evaluated via five-fold cross-validation”. For the unbalanced dataset, they used the F1 score and area under the ROC curve which are measures more appropriate to be used for uneven class distributions.
Based on the study results, the research team had come to the conclusion that both humans and machines have a hard time recognizing digital/written sarcasm. However, they had also found that human raters are more likely to improve their performance when provided with more context and additional information. This is absolutely understandable considering that the scope of the experience of the human raters, in this case, is much broader than the capabilities of a classifier trained on a small number of conversations. The results of this particular study indicate a significant amount of work is still needed to be done in this realm of NLP, but they also indicate impressive efforts towards that goal. As we move towards a more digitalized society, it is expected for these models to become increasingly attractive or even become a necessity.