Automated Text Summarization: A Short Overview of the Task and its Challenges

Posted on February 28, 2016

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

In the era of growing data, there is quite reasonable demand for performing automated text summarization, but actually the first ideas date back to the second half of the last century. Over the span of time, different directions related to automated text summarization appeared and evolved. Hence now, we can differentiate between generic vs. query-based text summarization. As it might be guessed from its name, query-based text summarization refers to performing summarization in such a way that the summary relates to (or if you prefer, answers) the particular query. On the other side, generic text summarization tends to provide summary which does not refer to any particular query, but instead to the main topic of the original text, thus preserving its substance and leaving out irrelevant details.

Another classification of text summarization can be done based on the way the summary is constructed: extractive or abstractive. Extractive summarization refers to the process of generating summaries by identifying relevant pieces of text (e.g. sentences, paragraphs) and literally including them (as they are in the original text) in the summary. On the contrary, abstractive summarization does not rely on existing formulations but rather tends to paraphrase the original text using, for example, synonyms, different linguistic forms etc. While extractive summarization (also known in the literature as “shallow approach”) is obviously less complex, both approaches have its own challenges. One particular problem which often arises with extractive summarizes is the summary cohesion: as (usually) sentences are extracted and “just” concatenated to each other, it is quite common that there exists no smooth transition between ideas/concepts/topics in different sentences. To overcome these coherency problems, further human intervention is required. On its side, abstractive summarization requires deep linguistic knowledge to maintain proper language constructs when reformulating text parts (which relates to emerging field of natural language generation as well).

Yet another classification of summaries exists based on the number of documents to be summarized: in case of only one, it is called single-document summarization; otherwise we deal with multiple-document summarization. In the cases when the original text is written in different languages and the aim is to make a summary in one (target) language (usually the predominant one), we are talking about multilingual summarization.

Since the main task of summarization is to seize the most important concepts/ideas from the given text, the main challenge in performing automated summarization is how to identify what is the most relevant (or the most informative) part of the text. And how can we distinguish between relevant parts of the text to obtain its most relevant fragment(s)?

Depending on the particular summarization type, different approaches apply.

Some of the earliest ideas were focused on finding key words and these were usually based on the frequency of a term or a phrase appearing in the text (with the intuition that the more often it occurs, the more relevant it is). Additionally, if a sentence contains identified key words but on top of that some other information, which is non-existent in the previous sentences, it has “novelty potential” and hence might be found to be more relevant than others.

In fact, a significant amount of existing literature regarding extractive summarization is based on statistical approaches. With these, sentences (or other parts of original text) are typically ranked based on different criteria, such as stemmed word frequencies, term frequency-inverted document frequency (TF_IDF) weights, etc. On the other side, abstractive summarization mostly relies on pure linguistic techniques. Nevertheless, there is also plethora of research papers which combine statistical and linguistic features to build potential inclusion rankings of sentences for a summary. Many researchers have used different machine learning techniques to tackle text summarization task, from the classification perspective. Hence, for example, Naïve Bayes, decision trees, maximum entropy classifiers (training the weights by conjugate gradient descent), rule induction systems, neural networks, hidden Markov models were used to distinguish between sentences that should or should not be included in the summaries. Together with the variety of classification algorithms to be used, different feature selection and extraction methods were used, making the features range from some fairly simple (also known in literature as “primitive”) to very complex (“composite”) ones.

One of the most popular approaches recently proposed is based on prior sentence clustering, selecting for summary cluster centroids or optionally, as many most representative sentences from each cluster, as necessary to reach desired summary length. Since clustering is based on the similarity/dissimilarity concept, with the aim to increase intra-cluster similarity and decrease inter-cluster similarity, this led to the expansion of a sentence similarity measures -related research area. Another group of works exploited graph-based approaches, where sentences of the text represent nodes of a graph and edges are constructed (and sometimes also weighted) based on corresponding sentences similarity. Hybrid approaches combining graph-based approach with clustering, were also applied. Yet another stream of researchers used latent semantic analysis for text summarization purposes. Here the idea is to find hidden inter-relationships between words in the sentences, in order to detect the most informative ones.

Depending on the type of learning, text summarization can be divided to supervised and unsupervised and as it can be concluded from the already mentioned techniques, both are well-studied in literature. The advantage of unsupervised approaches is that they do not require any external source for learning. On the other side, supervised approaches require providing reliable human-made summaries which is a time-consuming and effort-demanding task. And not only that: due to different subjective judgments, it is often difficult to achieve agreement on manual summaries quality. This also brings us to another very challenging topic: evaluation of summaries, which is quite broad research area as well.

What also makes summarization task quite challenging is the complexity of human language and the way people express themselves, especially in written texts. They not only resort to long sentences using adjectives and adverbs to describe their topic in more details, but also use relative sentences, appositions etc. While these can provide valuable additional information, they are very rarely the main “pillar” of the information that we want to include in the summary. Additionally, people usually introduce the subject and then tend to refer to it using synonyms or pronouns later through the text. Understanding which pronoun is substitution for which previously introduced term is known in literature as “anaphora problem”. Similarly, researchers also faced the reverse problem (when ambiguous words and explanations referring to particular term are used in the text prior to introducing the term itself) and this one is known as “cataphora problem”.

To conclude, performing text summarization is feasible, yet quite complex task with many challenging aspects to tackle. And, despite the fact that the task itself is not very novel and that a lot of related research has already been done, there are still emerging areas with open questions and research potential.