Keyword extraction with semantic surprisal from LDA and LSA: a comparison of topic models

in publications :: #CLIN

In this study on keyword extraction in German, we compare the performance of Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) as part of the Topic Context Model (TCM). TCM calculates the information-theoretical measure surprisal as a context-based feature of words. Surprisal is a contextualised information measure based on conditional probabilities. As contexts for the calculation of surprisal, TCM evaluates the topic- and topic-word distributions in the preceding and following environment of words. This environment can be sentences, paragraphs, documents or an entire corpus.

TCM thus answers the question: how surprising is a word for a language processor given the topics within the word's environment?

The point of departure of the study is that surprisal as a lexical feature can be useful for NLP applications. In previous studies, this has been shown both for keyword extraction and for the determination of the expressivity of intensifiers. So far, exclusively LDA has been used for the topic modelling part of the TCM. In this study, TCM draws on the topics identified by LDA and LSA in order to compare the performance of the two topic models.

This comparison is motivated by the fact that LSA and LDA are similar algorithms in some ways but quite different in others. Both assume that texts are a combination of abstract, invisible topics that can be determined by looking at word frequencies. Their respective approaches to determining these topics, however, differ significantly. While LSA is deterministic and models texts as linear combinations of topic vectors, the generative LDA model is probabilistic and assumes a Bayesian network linking topics to words and then to texts. In particular, LSA is computationally 'nicer' but assumes a linear structure that cannot be taken for granted. Whether or not this structure is actually there, is interesting for two reasons: firstly, it provides a working linear model of topics as vectors (hence offering computational ways to compare the similarity of topics). Secondly, choosing the well-behaved deterministic LSA over LDA is justified.

The data resource in this study is a subset of the Heise corpus corpus.

The comparison of LDA and LSA proceeds in two steps: (i) the quality of the words’ surprisal values for keyword extraction is directly compared, (ii) the surprisal values are input of a Recurrent Neural Network (RNN). This operationalisation follows previous studies on TCM.

We observe that in the direct comparison, LSA slightly outperforms LDA in precision, recall and F1-values. In contrast, when surprisal of words is the input of the RNN, we see no clear winner. Because of the LSA’s greater computational economy and 'niceness' compared to LDA, our conclusion is that LSA is a suitable building block of TCM in the service of NLP's keyword extraction application.