Can information theory unravel the subtext in a Chekhovian short story?

in publications :: #SlavicNLP #ACL

In this study, we investigate whether information-theoretic measures such as surprisal can quantify the elusive notion of subtext in a Chekhovian short story. Specifically, we conduct a series of experiments for which we enrich the original text once with (different types of) meaningful glosses and once with fake glosses. For the different texts thus created, we calculate the surprisal values using two methods: using either a bag-of-words model or a large language model. We observe enrichment effects depending on the method, but no interpretable subtext effect.

Bibtex | ACL Anthology


The role of information in modeling German intensifiers

in publications :: #LanguageSciencePress

In this study, context-free and context-dependent information measures are applied to a new corpus of tweets and blog posts. The aim is to account for the expressive meaning and characterize the variability of available intensifying items. It comes to light that context-free and context-dependent information measures are highly correlated and account for the distribution of intensifiers in the data, giving credence to the notion that intensifiers form a common word class, even across syntactic and semantic differences.

Both information measures show that stacked intensifiers tend to be ordered from least to most expressive within a phrase, i.e., the information tends to increase. We explain this fact using the Uniform Information Density Hypothesis: The first, less expressive intensifier is used to introduce the phrase, ease the reader’s processing load, and smooth the information flow.

BibTex | DOI: 10.5281/zenodo.133837


Forschungsportal BACH

in publications :: #TUC

The aim of this project is to build a comprehensive online repository that provides access to all surviving documents of the Bach family of musicians – the most influential family dynasty in music history – from the late 16th to the early 19th century. For the first time in the history of Bach research, the written documents preserved in libraries, archives and private collections will be digitally recorded, indexed, processed, annotated and made available via an online portal. Interlinking with existing digital Bach projects is planned.

Poster


Forschungsportal BACH

in publications :: #DHDL

Ziel des Projekts „Forschungsportal Bach“ ist der Aufbau eines umfassenden Online-Repositoriums, das Zugang zu allen erhaltenen Dokumenten der Musikerfamilie Bach – der einflussreichsten Familiendynastie in der Musikgeschichte – vom späten 16. bis zum frühen 19. Jahrhundert bietet. Zum ersten Mal in der Geschichte der Bachforschung wird das Material, das in Bibliotheken, Archiven und Privatsammlungen erhalten ist, digital erfasst, indiziert, verarbeitet, annotiert und via Online-Portal zugänglich gemacht. Die digitalisierten Dokumente werden mittels „Transkribus“ automatisch transkribiert, mit Hilfe des TEI Publishers annotiert und schließlich als digitale Edition veröffentlicht. Dabei werden u.a. die in den Dokumenten erwähnten Werke sowie Wasserzeichen ausgewählter Archivalien mit dem Portal „Bach digital“ verknüpft.

Poster | https://fdhl.info/dhdl-2023-projekte/#bach | https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-917725


Keyword extraction with semantic surprisal from LDA and LSA: a comparison of topic models

in publications :: #CLIN

In this study on keyword extraction in German, we compare the performance of Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) as part of the Topic Context Model (TCM). TCM calculates the information-theoretical measure surprisal as a context-based feature of words. Surprisal is a contextualised information measure based on conditional probabilities. As contexts for the calculation of surprisal, TCM evaluates the topic- and topic-word distributions in the preceding and following environment of words. This environment can be sentences, paragraphs, documents or an entire corpus.

TCM thus answers the question: how surprising is a word for a language processor given the topics within the word's environment?

The point of departure of the study is that surprisal as a lexical feature can be useful for NLP applications. In previous studies, this has been shown both for keyword extraction and for the determination of the expressivity of intensifiers. So far, exclusively LDA has been used for the topic modelling part of the TCM. In this study, TCM draws on the topics identified by LDA and LSA in order to compare the performance of the two topic models.

This comparison is motivated by the fact that LSA and LDA are similar algorithms in some ways but quite different in others. Both assume that texts are a combination of abstract, invisible topics that can be determined by looking at word frequencies. Their respective approaches to determining these topics, however, differ significantly. While LSA is deterministic and models texts as linear combinations of topic vectors, the generative LDA model is probabilistic and assumes a Bayesian network linking topics to words and then to texts. In particular, LSA is computationally 'nicer' but assumes a linear structure that cannot be taken for granted. Whether or not this structure is actually there, is interesting for two reasons: firstly, it provides a working linear model of topics as vectors (hence offering computational ways to compare the similarity of topics). Secondly, choosing the well-behaved deterministic LSA over LDA is justified.

The data resource in this study is a subset of the Heise corpus corpus.

The comparison of LDA and LSA proceeds in two steps: (i) the quality of the words’ surprisal values for keyword extraction is directly compared, (ii) the surprisal values are input of a Recurrent Neural Network (RNN). This operationalisation follows previous studies on TCM.

We observe that in the direct comparison, LSA slightly outperforms LDA in precision, recall and F1-values. In contrast, when surprisal of words is the input of the RNN, we see no clear winner. Because of the LSA’s greater computational economy and 'niceness' compared to LDA, our conclusion is that LSA is a suitable building block of TCM in the service of NLP's keyword extraction application.

Poster