Keyword extraction with semantic surprisal from LDA and LSA: a comparison of topic models

in publications :: #CLIN by J. Nathanael Philipp, Max Kölbl, Yuki Kyogoku, Tariq Yousef, Michael Richter

In this study on keyword extraction in German, we compare the performance of Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) as part of the Topic Context Model (TCM). TCM calculates the information-theoretical measure surprisal as a context-based feature of words. Surprisal is a contextualised information measure based on conditional probabilities. As contexts for the calculation of surprisal, TCM evaluates the topic- and topic-word distributions in the preceding and following environment of words. This environment can be sentences, paragraphs, documents or an entire corpus.

TCM thus answers the question: how surprising is a word for a language processor given the topics within the word's environment?

The point of departure of the study is that surprisal as a lexical feature can be useful for NLP applications. In previous studies, this has been shown both for keyword extraction and for the determination of the expressivity of intensifiers. So far, exclusively LDA has been used for the topic modelling part of the TCM. In this study, TCM draws on the topics identified by LDA and LSA in order to compare the performance of the two topic models.

This comparison is motivated by the fact that LSA and LDA are similar algorithms in some ways but quite different in others. Both assume that texts are a combination of abstract, invisible topics that can be determined by looking at word frequencies. Their respective approaches to determining these topics, however, differ significantly. While LSA is deterministic and models texts as linear combinations of topic vectors, the generative LDA model is probabilistic and assumes a Bayesian network linking topics to words and then to texts. In particular, LSA is computationally 'nicer' but assumes a linear structure that cannot be taken for granted. Whether or not this structure is actually there, is interesting for two reasons: firstly, it provides a working linear model of topics as vectors (hence offering computational ways to compare the similarity of topics). Secondly, choosing the well-behaved deterministic LSA over LDA is justified.

The data resource in this study is a subset of the Heise corpus corpus.

The comparison of LDA and LSA proceeds in two steps: (i) the quality of the words’ surprisal values for keyword extraction is directly compared, (ii) the surprisal values are input of a Recurrent Neural Network (RNN). This operationalisation follows previous studies on TCM.

We observe that in the direct comparison, LSA slightly outperforms LDA in precision, recall and F1-values. In contrast, when surprisal of words is the input of the RNN, we see no clear winner. Because of the LSA’s greater computational economy and 'niceness' compared to LDA, our conclusion is that LSA is a suitable building block of TCM in the service of NLP's keyword extraction application.

Poster


Perplexed by Idioms?

in publications :: #SEMANTICS by J. Nathanael Philipp, Max Kölbl, Erik Daas, Yuki Kyogoku, Michael Richter

The aim of this study is to identify idiomatic expressions in English using the measure perplexity. The assumption is that idiomatic expressions cause higher perplexity than literal expressions given a reference text. Perplexity in our study is calculated based on n-grams of (i) PoS tags, (ii) tokens, and (iii) thematic roles within the boundaries of a sentence. In the setting of our study, we observed that no perplexity in the contexts of (i), (ii) and (iii) manages to distinguish idiomatic expressions from literals. We postulate that larger, extra-sentential contexts should be used for the determination of perplexity. In addition, the number of thematic roles in (iii) should be reduced to a smaller number of basic roles in order to avaiod an uniform distribution of n-grams.

BibTex | DOI: 10.3233/SSW230006


Are idioms surprising?

in publications :: #Konvens by J. Nathanael Philipp, Michael Richter, Erik Daas, Max Kölbl

This study focuses on the identification of English Idiomatic Expressions (IE) using an information theoretic model. In the focus are verb-noun constructions only. We notice significant differences in semantic surprisal and information density between IE-data and literals-data. Surprisingly, surprisal and information density in the IE-data and in a large reference data set do not differ significantly, while, in contrast, we observe significant differences between literals and a large reference data set.

Poster | BibTex | https://aclanthology.org/2023.konvens-main.15/


One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts

in publications :: #SAI, #ComputingConference, #Springer by J. Nathanael Philipp, Max Kölbl, Yuki Kyogoku, Tariq Yousef, Michael Richter

This paper describes a study on keyword extraction in German with a model that utilises Shannon information as a lexical feature. Lexical information content was derived from large, extra-sentential semantic contexts of words in the framework of the novel Topic Context Model. We observed that lexical information content increased the performance of a Recurrent Neural Network in keyword extraction, outperforming TexTRank and other two models, i.e., Named Entity Recognition and Latent Dirichlet Allocation used comparatively in this study.

BibTex | DOI: 10.1007/978-3-031-10464-0_53


Putting Users in the Loop: How User Research Can Guide AI Development for a Consumer-Oriented Self-service Portal

in publications :: #HCII by Frank Binder, Jana Diels, Julian Balling, Oliver Albrecht, Robert Sachunsky, J. Nathanael Philipp, Yvonne Scheurer, Marlene Münsch, Markus Otto, Andreas Niekler, Gerhard Heyer, Christian Thorun

This study investigates three challenges for developing machine learning-based self-service web apps for consumers. First, we argue that user research must accompany the development of ML-based products so that they better serve users’ needs at all stages of development. Second, we discuss the data sourcing dilemma in developing consumer-oriented ML-based apps and propose a way to solve it by implementing an interaction design that balances the workload between users and computers according to the ML component’s performance. To dynamically define the role of the user-in-the-loop, we monitor user success and ML performance over time. Finally, we propose a lightweight typology of ML-based systems to assess the generalizability of our findings to other ML use cases.

Our case study uses a newly developed web application that allows consumers to analyze their heating bills for potential energy and cost savings. Based on domain-specific data values extracted from user-provided document images, an assessment of potential savings is derived and reported back to the user.

BibTex | DOI: 10.1007/978-3-031-05434-1_1


Beyond the Failure of Direct-Matching in Keyword Evaluation: A Sketch of a Graph Based Solution

in publications :: #Frontiers by Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, Tariq Yousef

The starting point of this paper is the observation that methods based on the direct match of keywords are inadequate because they do not consider the cognitive ability of concept formation and abstraction. We argue that keyword evaluation needs to be based on a semantic model of language capturing the semantic relatedness of words to satisfy the claim of the human-like ability of concept formation and abstraction and achieve better evaluation results. Evaluation of keywords is difficult since semantic informedness is required for this purpose. This model must be capable of identifying semantic relationships such as synonymy, hypernymy, hyponymy, and location-based abstraction. For example, when gathering texts from online sources, one usually finds a few keywords with each text. Still, these keyword sets are neither complete for the text nor are they in themselves closed, i.e., in most cases, the keywords are a random subset of all possible keywords and not that informative w.r.t. the complete keyword set. Therefore all algorithms based on this cannot achieve good evaluation results and provide good/better keywords or even a complete keyword set for a text. As a solution, we propose a word graph that captures all these semantic relationships for a given language. The problem with the hyponym/hyperonym relationship is that, unlike synonyms, it is not bidirectional. Thus the space of keyword sets requires a metric that is non-symmetric, in other words, a quasi-metric. We sketch such a metric that works on our graph. Since it is nearly impossible to obtain such a complete word graph for a language, we propose for the keyword task a simpler graph based on the base text upon which the keyword sets should be evaluated. This reduction is usually sufficient for evaluating keyword sets.

BibTex | DOI: 10.3389/frai.2022.801564


Uniform Density in Linguistic Information derived from Dependency Structures

in publications :: #ICAART, #NLPinAI by Michael Richter, Maria Bardají i Farré, Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Tariq Yousef, Gerhard Heyer, Nikolaus P. Himmelmann

This pilot study addresses the question of whether the Uniform Information Density principle (UID) can be proved for eight typologically diverse languages. The lexical information of words is derived from dependency structures both in sentences preceding the sentences and within the sentence in which the target word occurs. Dependency structures are a realisation of extra-sentential contexts for deriving information as formulated in the surprisal model. Only subject, object and oblique, i.e., the level directly below the verbal root node, were considered. UID says that in natural language, the variance of information and information jumps from word to word should be small so as not to make the processing of a linguistic message an insurmountable hurdle. We observed cross-linguistically different information distributions but an almost identical UID, which provides evidence for the UID hypothesis and assumes that dependency structures can function as proxies for extra-sentential contexts. However, for the dependency structures chosen as contexts, the information distributions in some languages were not statistically significantly different from distributions from a random corpus. This might be an effect of too low complexity of our model's dependency structures, so lower hierarchical levels (e.g. phrases) should be considered.

BibTex | DOI: 10.5220/0000155600003116


The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German

in publications :: #NLPinAI, #SCI, #Springer by Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Clemens Rietdorf, Tariq Yousef

This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) unsupervised, based on information theory, i.e., (i) a bigram model, (ii) a probabilistic parser model, and (iii) a novel model which considers topics within the discourse of target word for the calculation of their information content, and (B) supervised, employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language, and (ii) that – as a cognitive principle – the information content of words is determined from extra-sentential contexts, i.e., from the discourse of words.

BibTex | DOI: 10.1007/978-3-030-63787-3_5


Keyword extraction in German: Information-theory vs. deep learning

in publications :: #ICAART, #NLPinAI by Max Kölbl, Yuki Kyogoku, J. Nathanael Philipp, Michael Richter, Tariq Yousef

This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) an unsupervised method based on information theory (Shannon, 1948). We employed (i) a bigram model, (ii) a probabilistic parser model (Hale, 2001) and (iii) an innovative model which utilises topics as extra-sentential contexts for the calculation of the information content of the words, and (B) a supervised method employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence, that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language’s, and (ii) that - as a cognitive principle - the information content of words is determined from extra-sentential contexts, that is to say, from the discourse of words.

BibTex | DOI: 10.5220/0009374704590464


Unsupervised pretraining for text classification using siamese transfer learning

in publications :: #CLEF by Maximilian Bryan, J. Nathanael Philipp

When training neural networks, huge amounts of training data typically lead to better results. When only a small amount of training data is available, it has been proven useful to initialize a network with pretrained layers. For NLP tasks, networks are usually only given pretrained word embeddings, the rest of the network is not pretrained since pretraining recurrent networks for NLP tasks is difficult. In this article, we present a siamese architecture for pretraining recurrent networks on textual data. The network has to map pairs of sentences onto a vector representation. When a sentence pair is appearing coherently in our corpus, the vector representations should be similar, if not, the representations should be dissimilar. After having pretrained that network, we enhance it and train it on a smaller dataset in order to have it classify textual data. We show that using this kind of approach for pretraining results in better results comparing to doing no pretraining or only using pretrained embeddings when doing text classification for a task with only a small amount of training data. For evaluation, we are using the bots and gender profiling dataset provided by PAN 2019.

BibTex | Paper | Poster | PAN