Beyond the Platform: APIs as part of the Workflow

in publications :: #TUC

The project 'Forschungsportal BACH' builds a comprehensive online repository that provides access to all surviving documents of the Bach family of musicians - the most influential family dynasty in music history - from the late 16th to the early 19th century. For the first time in the history of Bach research, the materials preserved in libraries, archives and private collections will be digitally recorded, indexed, processed, annotated and made available via an online portal.

For this purpose we do not only digitise the documents we want to present as a digital edition but in most cases the whole bundle or at least large sections of it. The pages will be automatically transcribed with Transkribus and made available online through a full text search.

The talk will cover our workflow from the image to the finished transcription, highlighting the different modes of interaction with Transkribus, both by humans and computer.


Intensifying intensifiers: Variations in expressivity

in publications :: #DGfS by Roeland Van Hout, J. Nathanael Philipp, Michael Richter

Intensifiers define a larger word class that has fuzzy boundaries. Members of this class modify the degree of a gradable expression (typically a property, an adjective (phrase), by strengthening its property. This word class contains a large set of (partly) paradigmatic words, originating from other word classes and various semantic origins, and it is marked by volatility. From a communicative point of view intensifiers increase the expressivity of an adjective in an attributive or predicative function. Intensifiers do not only inform, they also perform. They function as expressive, emotive words that shift the attention to special, relevant spots in the information stream, as a kind of focus particles. We investigate three recurring, innovative strategies in German CMC from Twitter conversations, and additionally from Blog- and YouTube-Comments:

(1) a. Stacking, by combining multiple intensifiers: so mega gut ‘so mega good’ b. Lengthening, mostly realised through repeated graphemes: soooooo cool ‘so cool’, seeeeehr schön ’very nice / beautiful’ c. (Modified) duplication, by repeating the same intensifier: sehr sehr gut ’very very good’

The frequency distribution of intensifiers shows considerable variation with a few frequent and many infrequent words. The referential meaning of intensifiers is relatively fixed, but their expressive meaning varies, also depending on their surprisal or information value, implying that more frequent intensifiers have lower surprisal values. We hypothesize that the above strategies with different form- function patterns in intensifying intensifiers can be traced back to information values and the way these values are distributed over phrases and utterances. We apply alternative, more local information indices (information values from the Topic Context Model) instead of global Shannon entropy and information density measures to get a grip both on the tension between variation, convention and creativity in expressing intensification and on the pragmatic and structuring role of information values. At the same time, we claim a more established theoretical status of the concept of information in human communication.

References: • Philipp, J. N., Richter, M., Scheffler, T. & R. van Hout (2022). The role of information in modeling German intensifiers. In R. Lemke, L. Schäfer & I. Reich (eds.), Information structure and information theory, 117–145. Language Science Press.

Abstract | Presentation


Semantic Information: A Difference that Makes a Difference

in publications :: #LREC by J. Nathanael Philipp, Max Kölbl, Michael Richter

In the framework of distributional semantics, we introduce a novel notion and operationalisation of semantic information for natural language. The key idea is as follows: a linguistic sign carries semantic information about a document if it reduces the amount of surprisal for a language processor. We consider two systems, an informed one and an uninformed one, and describe semantic information in their terms. Processing effort is quantified via surprisal where the informed system is 'aware' of the linguistic sign and the uninformed one is not. On an English fairy tale corpus and on two German news corpora, we tested successfully the prediction that if the linguistic sign in question carries pre-information through semantic surprisal, the current level of surprisal for the language processor is reduced. The conclusion is that the degree of semantic information results from the degree of semantic prior information.

Bibtex | PDF | LREC Proceedings


Vom Issue zur Edition: Der Workflow im Forschungsportal BACH

in publications :: #DHDL by Magdalena Auenmüller, Christine Blanken, Wolfram Enßlin, Nikolas Georgiades, Christiane Hausmann, Bernd Koska, Michael Maul, J. Nathanael Philipp, Nadine Quenouille, Till Reininghaus, Gregor Richter, Sophie Weber, Peter Wollny, Markus Zepf

Im Rahmen des Langzeitvorhabens ‚Forschungsportal BACH‘ wurde ein strukturierter Workflow entwickelt, der die Vielzahl an Arbeitsschritten von der Archivrecherche bis zur Veröffentlichung einer digitalen Edition systematisch abbildet und koordiniert. Dieser Workflow integriert automatisierte Prozesse, manuelle Bearbeitungsschritte sowie qualitätssichernde Kontrollphasen, um in einem interdisziplinären Team mit unterschiedlichen Kompetenzniveaus eine konsistent hohe Editionsqualität sicherzustellen.

Zur technischen Umsetzung wurde eine GitLab-basierte Infrastruktur etabliert, in der die einzelnen Arbeitsschritte mit Hilfe von Issues modelliert und mit den jeweils eingesetzten Werkzeugen (Forschungsportal, Darktable, Transkribus, TEI Publisher Annotationstool) verknüpft wurden. Das zugrunde liegende GitLab-Repository fungiert dabei nicht nur als zentrale Koordinationsplattform, sondern zugleich als redundantes Datensicherungssystem.

Poster | https://fdhl.info/dhdl25-projektuebersicht/#bach


Information theory unravels the subtext in Chekhov

in publications :: #DHQ by J. Nathanael Philipp, Michael Richter, Olav Mueller-Reichau, Matthias Irmer

The present paper reports a pilot study to approach the subtext in Chekhov's short-story "Ward No. 6" by means of information theory. The original text is enriched by glosses by which we intend to make explicit the implicit knowledge conveyed by the original text, i.e., the subtext. We generated several text variants with meaningful enrichments and one fake variant that served as a baseline. We could not observe that semantic surprisal as a feature of words and uniform information density have a subtext effect throughout all text variants. However, it turned out that kurtosis and skewness are suitable classification criteria to distinguish meaningful enrichments from fake enrichments.

Bibtex | Digital Humanities Quarterly


Surprisal in Action: A Comparative Study of LDA and LSA for Keyword Extraction

in publications :: #Konvens by J. Nathanael Philipp, Max Kölbl, Michael Richter

This study compares two methods of topic detection, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA), by using it in conjuction with the Topic Context Model (TCM) on the task of keyword extraction. The surprisal values that TCM outputs based on LDA and LSA are compared, both, directly and as inputs to a Recurrent Neural Network (RNN). While in the direct comparison LSA slightly outperforms LDA, LDA and LSA perform on a par when a Recurrent Neural Network (RNN) is trained with surprisal values. In general: semantic surprisal as input of an RNN improves its performance.

BibTex | https://aclanthology.org/2025.konvens-1.1/


Can information theory unravel the subtext in a Chekhovian short story?

in publications :: #SlavicNLP, #ACL by J. Nathanael Philipp, Olav Mueller-Reichau, Matthias Irmer, Michael Richter, Max Kölbl

In this study, we investigate whether information-theoretic measures such as surprisal can quantify the elusive notion of subtext in a Chekhovian short story. Specifically, we conduct a series of experiments for which we enrich the original text once with (different types of) meaningful glosses and once with fake glosses. For the different texts thus created, we calculate the surprisal values using two methods: using either a bag-of-words model or a large language model. We observe enrichment effects depending on the method, but no interpretable subtext effect.

Bibtex | Poster | ACL Anthology


The role of information in modeling German intensifiers

in publications :: #LanguageSciencePress by J. Nathanael Philipp, Michael Richter, Tatjana Scheffler, Roeland van Hout

In this study, context-free and context-dependent information measures are applied to a new corpus of tweets and blog posts. The aim is to account for the expressive meaning and characterize the variability of available intensifying items. It comes to light that context-free and context-dependent information measures are highly correlated and account for the distribution of intensifiers in the data, giving credence to the notion that intensifiers form a common word class, even across syntactic and semantic differences.

Both information measures show that stacked intensifiers tend to be ordered from least to most expressive within a phrase, i.e., the information tends to increase. We explain this fact using the Uniform Information Density Hypothesis: The first, less expressive intensifier is used to introduce the phrase, ease the reader’s processing load, and smooth the information flow.

BibTex | DOI: 10.5281/zenodo.133837


Forschungsportal BACH

in publications :: #TUC by Peter Wollny, Christine Blanken, Wolfram Enßlin, Nikolas Georgiades, Christiane Hausmann, Bernd Koska, Michael Maul, J. Nathanael Philipp, Nadine Quenouille, Till Reininghaus, Gregor Richter, Markus Zepf

The aim of this project is to build a comprehensive online repository that provides access to all surviving documents of the Bach family of musicians – the most influential family dynasty in music history – from the late 16th to the early 19th century. For the first time in the history of Bach research, the written documents preserved in libraries, archives and private collections will be digitally recorded, indexed, processed, annotated and made available via an online portal. Interlinking with existing digital Bach projects is planned.

Poster


Forschungsportal BACH

in publications :: #DHDL by Peter Wollny, Christine Blanken, Wolfram Enßlin, Nikolas Georgiades, Christiane Hausmann, Bernd Koska, Michael Maul, J. Nathanael Philipp, Nadine Quenouille, Till Reininghaus, Gregor Richter, Markus Zepf

Ziel des Projekts „Forschungsportal Bach“ ist der Aufbau eines umfassenden Online-Repositoriums, das Zugang zu allen erhaltenen Dokumenten der Musikerfamilie Bach – der einflussreichsten Familiendynastie in der Musikgeschichte – vom späten 16. bis zum frühen 19. Jahrhundert bietet. Zum ersten Mal in der Geschichte der Bachforschung wird das Material, das in Bibliotheken, Archiven und Privatsammlungen erhalten ist, digital erfasst, indiziert, verarbeitet, annotiert und via Online-Portal zugänglich gemacht. Die digitalisierten Dokumente werden mittels „Transkribus“ automatisch transkribiert, mit Hilfe des TEI Publishers annotiert und schließlich als digitale Edition veröffentlicht. Dabei werden u.a. die in den Dokumenten erwähnten Werke sowie Wasserzeichen ausgewählter Archivalien mit dem Portal „Bach digital“ verknüpft.

Poster | https://fdhl.info/dhdl-2023-projekte/#bach | https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-917725