This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at email@example.com and let’s get in touch!
Contributed by: Manon Reusens under the supervision of prof. dr. Bart Baesens and Prof. dr. Seppe vanden Broucke.
Imagine that you are the CEO of a company and just launched a new product. Hundreds of thousands of people start buying your product. Your customers are sharing their opinions of the product on Twitter, using specific hashtags, and tagging your company in their posts. How would you go about analyzing their overall opinion of your product?
You can start by reading some of the comments, but that will not offer you a general overview of what people actually think. Reading all tweets is also not an option, as this is not scalable. This is where Natural Language Processing techniques come in. You start by assembling all tweets and use techniques to analyze their sentiment automatically.
The research field concerned with detecting people’s sentiments, opinions, or emotions based on written text is called Sentiment Analysis. Sentiment Analysis has gained some popularity over the last two decades. The research field has boomed since 2001  due to the rise of machine learning techniques and the internet. Because of the arrival of the internet, an enormous amount of written texts became available, ready to be analyzed. Social media websites such as Twitter provide excellent data for this type of analysis.
In order to analyze the sentiment of a text, several steps should be taken. Your data has to be changed into an adequate format so that it can be used as input for the model. The first phase of transforming your data is the preprocessing step. Many different preprocessing steps, can be applied such as punctuation and stopword removal. Moreover, some classifiers are more sensitive to preprocessing steps than others . The type of preprocessing steps also highly depends on the source of the text. For example, emoji removal or modification is usually only necessary in informal written texts. Stemming and lemmatization are two interesting preprocessing steps that reduce the number of words in your vocabulary. The idea behind both preprocessing methods is to make models consider variations of words as the same word, for example, a plural or singular form can both be changed into the singular. This reduces the vocabulary that the model has to consider and shifts the focus to what is important to predict the text’s sentiment. More specifically, stemming means that you will only retain the stem of a word, for example: ‘running’ becomes ‘runn’. Lemmatization works similarly to stemming, however, it requires the remaining stem to be an existing word, e.g. ‘running’ would become ‘run’.
Once the text has been preprocessed, you should decide what model you will apply. In general, there are four types of techniques for sentiment analysis tasks. Lexicon-based models, machine learning models, deep learning, and large pretrained language models.
Lexicon-based techniques typically use a predefined lexicon or dictionary to calculate the sentiment score. This is the most simple technique and is intuitive and fully explainable. However, this technique is language-specific and could be too simple to fully grasp the true finesses of human language. Sarcastic statements such as ‘Wow what a great product. It only breaks down after two days…’ would be classified as positive as they contain positive words. Nevertheless, as this technique is very intuitive and computationally inexpensive, it might be useful for sentiment prediction depending on the application. Some examples of lexicon-based techniques that are widely used are VADER  and TextBlob .
Next, also machine learning models can be trained for sentiment analysis. However, as these models cannot use plain text as input, vector representation techniques have to be applied, such as Bag of Words or TFIDF. These two techniques count the number of word occurrences in your dataset. Where Bag of Words results in these word counts, TFIDF also corrects for the number of occurrences in different documents. To clarify this, we revisit the example of the product launch. The product name probably often occurs in different tweets. However, this does not contain real sentimental value. TFIDF will thus look for these words that occur in many reviews and put less weight on these. Besides these two techniques, also pretrained embeddings can be used as input. The most well-known pretrained embeddings are Word2Vec , FastText , and Glove . These embeddings offer word representations and encode patterns and linguistic regularities that can often be expressed linearly. Once these vector representations are gathered for the text, machine learning models such as logistic regression, support vector machine, or random forests can be trained.
The third category of techniques is deep learning. Models such as Convolutional Neural Networks and Bidirectional LSTMs are often used for sentiment analysis. These models can be trained using an embedding layer that is initialized with random weights or they can be used in combination with pretrained embeddings. This type of model is a black box and has many hyperparameters to tune.
Finally, the pretrained language models have reformed the Natural Language Processing landscape. An example of such a model is the well-known BERT model . This model is transformer-based and pretrained on enormous amounts of data. Consequently, it has gained some understanding of the human language. In order to use this model for sentiment analysis, it is finetuned on your dataset. As this model is language-specific, many variations exist like CamemBERT  or RobBERT, which are pretrained language models for the French and Dutch languages respectively.
In our research, we compared these four types of models for Flemish Twitter Sentiment Analysis. The goal was to see whether it was really necessary to train a complicated and computationally expensive model like BERT for a relatively simple task like sentiment analysis. When comparing the results of the different models trained, the lexicon-based models performed observably worse than the others. The machine learning and deep learning models showed similar performance, despite the extensive parameter tuning of the latter. Finally, the large pretrained language models clearly outperformed all other models. Throughout the entire study, we noted that the tradeoff between computational expenses and performance gains has to be made. For more information on the different findings of our research, we would like to refer to .
 Pang, B., & Lee, L. (2009). Opinion mining and sentiment analysis. Comput. Linguist, 35(2), 311-312.
 Jianqiang, Z. & Xiaolin, G. Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 2017,5, 2870–2879.
 Hutto, C. & Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proc. Int. AAAI Conf. Web Soc. Media 2014, 8(1), 216–225.
 Loria, S. textblob Documentation. Release 0.15, 2. 2018. Available online: https://textblob.readthedocs.io/en/dev.
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
 Grave, E., Bojanowski, P., Gupta, P., Joulin, A. & Mikolov, T. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018.
 Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., … & Sagot, B. (2019). CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894.
 Delobelle, P., Winters, T., & Berendt, B. (2020). Robbert: a dutch roberta-based language model. arXiv preprint arXiv:2001.06286.
 Reusens, M., Reusens, M., Callens, M., vanden Broucke, S., & Baesens, B. (2022). Comparison of Different Modeling Techniques for Flemish Twitter Sentiment Analysis. Analytics, 1(2), 117-134.