Forschungsportal BACH

in publications :: #DHDL

Ziel des Projekts „Forschungsportal Bach“ ist der Aufbau eines umfassenden Online-Repositoriums, das Zugang zu allen erhaltenen Dokumenten der Musikerfamilie Bach – der einflussreichsten Familiendynastie in der Musikgeschichte – vom späten 16. bis zum frühen 19. Jahrhundert bietet. Zum ersten Mal in der Geschichte der Bachforschung wird das Material, das in Bibliotheken, Archiven und Privatsammlungen erhalten ist, digital erfasst, indiziert, verarbeitet, annotiert und via Online-Portal zugänglich gemacht. Die digitalisierten Dokumente werden mittels „Transkribus“ automatisch transkribiert, mit Hilfe des TEI Publishers annotiert und schließlich als digitale Edition veröffentlicht. Dabei werden u.a. die in den Dokumenten erwähnten Werke sowie Wasserzeichen ausgewählter Archivalien mit dem Portal „Bach digital“ verknüpft.

Poster | https://fdhl.info/dhdl-2023-projekte/#bach | https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-917725


Keyword extraction with semantic surprisal from LDA and LSA: a comparison of topic models

in publications :: #CLIN

In this study on keyword extraction in German, we compare the performance of Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) as part of the Topic Context Model (TCM). TCM calculates the information-theoretical measure surprisal as a context-based feature of words. Surprisal is a contextualised information measure based on conditional probabilities. As contexts for the calculation of surprisal, TCM evaluates the topic- and topic-word distributions in the preceding and following environment of words. This environment can be sentences, paragraphs, documents or an entire corpus.

TCM thus answers the question: how surprising is a word for a language processor given the topics within the word's environment?

The point of departure of the study is that surprisal as a lexical feature can be useful for NLP applications. In previous studies, this has been shown both for keyword extraction and for the determination of the expressivity of intensifiers. So far, exclusively LDA has been used for the topic modelling part of the TCM. In this study, TCM draws on the topics identified by LDA and LSA in order to compare the performance of the two topic models.

This comparison is motivated by the fact that LSA and LDA are similar algorithms in some ways but quite different in others. Both assume that texts are a combination of abstract, invisible topics that can be determined by looking at word frequencies. Their respective approaches to determining these topics, however, differ significantly. While LSA is deterministic and models texts as linear combinations of topic vectors, the generative LDA model is probabilistic and assumes a Bayesian network linking topics to words and then to texts. In particular, LSA is computationally 'nicer' but assumes a linear structure that cannot be taken for granted. Whether or not this structure is actually there, is interesting for two reasons: firstly, it provides a working linear model of topics as vectors (hence offering computational ways to compare the similarity of topics). Secondly, choosing the well-behaved deterministic LSA over LDA is justified.

The data resource in this study is a subset of the Heise corpus corpus.

The comparison of LDA and LSA proceeds in two steps: (i) the quality of the words’ surprisal values for keyword extraction is directly compared, (ii) the surprisal values are input of a Recurrent Neural Network (RNN). This operationalisation follows previous studies on TCM.

We observe that in the direct comparison, LSA slightly outperforms LDA in precision, recall and F1-values. In contrast, when surprisal of words is the input of the RNN, we see no clear winner. Because of the LSA’s greater computational economy and 'niceness' compared to LDA, our conclusion is that LSA is a suitable building block of TCM in the service of NLP's keyword extraction application.

Poster


Perplexed by Idioms?

in publications :: #SEMANTICS

The aim of this study is to identify idiomatic expressions in English using the measure perplexity. The assumption is that idiomatic expressions cause higher perplexity than literal expressions given a reference text. Perplexity in our study is calculated based on n-grams of (i) PoS tags, (ii) tokens, and (iii) thematic roles within the boundaries of a sentence. In the setting of our study, we observed that no perplexity in the contexts of (i), (ii) and (iii) manages to distinguish idiomatic expressions from literals. We postulate that larger, extra-sentential contexts should be used for the determination of perplexity. In addition, the number of thematic roles in (iii) should be reduced to a smaller number of basic roles in order to avaiod an uniform distribution of n-grams.

BibTex | DOI: 10.3233/SSW230006


Are idioms surprising?

in publications :: #Konvens

This study focuses on the identification of English Idiomatic Expressions (IE) using an information theoretic model. In the focus are verb-noun constructions only. We notice significant differences in semantic surprisal and information density between IE-data and literals-data. Surprisingly, surprisal and information density in the IE-data and in a large reference data set do not differ significantly, while, in contrast, we observe significant differences between literals and a large reference data set.

Poster | BibTex | https://aclanthology.org/2023.konvens-main.15/


One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts

in publications :: #ComputingConference, #Springer

This paper describes a study on keyword extraction in German with a model that utilises Shannon information as a lexical feature. Lexical information content was derived from large, extra-sentential semantic contexts of words in the framework of the novel Topic Context Model. We observed that lexical information content increased the performance of a Recurrent Neural Network in keyword extraction, outperforming TexTRank and other two models, i.e., Named Entity Recognition and Latent Dirichlet Allocation used comparatively in this study.

BibTex | DOI: 10.1007/978-3-031-10464-0_53