Beyond the Failure of Direct-Matching in Keyword Evaluation: A Sketch of a Graph Based Solution

posted 1 month, 4 weeks ago in #Frontiers, #publications

The starting point of this paper is the observation that methods based on the direct match of keywords are inadequate because they do not consider the cognitive ability of concept formation and abstraction. We argue that keyword evaluation needs to be based on a semantic model of language capturing the semantic relatedness of words to satisfy the claim of the human-like ability of concept formation and abstraction and achieve better evaluation results. Evaluation of keywords is difficult since semantic informedness is required for this purpose. This model must be capable of identifying semantic relationships such as synonymy, hypernymy, hyponymy, and location-based abstraction. For example, when gathering texts from online sources, one usually finds a few keywords with each text. Still, these keyword sets are neither complete for the text nor are they in themselves closed, i.e., in most cases, the keywords are a random subset of all possible keywords and not that informative w.r.t. the complete keyword set. Therefore all algorithms based on this cannot achieve good evaluation results and provide good/better keywords or even a complete keyword set for a text. As a solution, we propose a word graph that captures all these semantic relationships for a given language. The problem with the hyponym/hyperonym relationship is that, unlike synonyms, it is not bidirectional. Thus the space of keyword sets requires a metric that is non-symmetric, in other words, a quasi-metric. We sketch such a metric that works on our graph. Since it is nearly impossible to obtain such a complete word graph for a language, we propose for the keyword task a simpler graph based on the base text upon which the keyword sets should be evaluated. This reduction is usually sufficient for evaluating keyword sets.

BibTex | DOI: 10.3389/frai.2022.801564

Uniform Density in Linguistic Information derived from Dependency Structures

posted 3 months, 2 weeks ago in #ICAART, #NLPinAI, #publications

This pilot study addresses the question of whether the Uniform Information Density principle (UID) can be proved for eight typologically diverse languages. The lexical information of words is derived from dependency structures both in sentences preceding the sentences and within the sentence in which the target word occurs. Dependency structures are a realisation of extra-sentential contexts for deriving information as formulated in the surprisal model. Only subject, object and oblique, i.e., the level directly below the verbal root node, were considered. UID says that in natural language, the variance of information and information jumps from word to word should be small so as not to make the processing of a linguistic message an insurmountable hurdle. We observed cross-linguistically different information distributions but an almost identical UID, which provides evidence for the UID hypothesis and assumes that dependency structures can function as proxies for extra-sentential contexts. However, for the dependency structures chosen as contexts, the information distributions in some languages were not statistically significantly different from distributions from a random corpus. This might be an effect of too low complexity of our model's dependency structures, so lower hierarchical levels (e.g. phrases) should be considered.

BibTex | DOI: 10.5220/0000155600003116

The Semantic Level of Shannon Information: Are Highly Informative Words Good Keywords? A Study on German

posted 1 year, 1 month ago in #NLPinAI, #publications, #Springer

This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) unsupervised, based on information theory, i.e., (i) a bigram model, (ii) a probabilistic parser model, and (iii) a novel model which considers topics within the discourse of target word for the calculation of their information content, and (B) supervised, employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language, and (ii) that – as a cognitive principle – the information content of words is determined from extra-sentential contexts, i.e., from the discourse of words.

BibTex | DOI: 10.1007/978-3-030-63787-3_5

Keyword extraction in German: Information-theory vs. deep learning

posted 2 years, 2 months ago in #ICAART, #NLPinAI, #publications

This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) an unsupervised method based on information theory (Shannon, 1948). We employed (i) a bigram model, (ii) a probabilistic parser model (Hale, 2001) and (iii) an innovative model which utilises topics as extra-sentential contexts for the calculation of the information content of the words, and (B) a supervised method employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence, that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language’s, and (ii) that - as a cognitive principle - the information content of words is determined from extra-sentential contexts, that is to say, from the discourse of words.

BibTex | DOI: 10.5220/0009374704590464

Convolutional Attention on Images for Locating Macular Edema

posted 2 years, 3 months ago in #MIUA, #publications

Neural networks have become a standard for classifying images. However, by their very nature, their internal data representation remains opaque. To solve this dilemma, attention mechanisms have recently been introduced. They help to highlight regions in input data that have been used for a network’s classification decision. This article presents two attention architectures for the classification of medical images. Firstly, we are explaining a simple architecture which creates one attention map that is used for all classes. Secondly, we introduce an architecture that creates an attention map for each class. This is done by creating two U-nets - one for attention and one for classification - and then multiplying these two maps together. We show that our architectures well meet the baseline of standard convolutional classifications while at the same time increasing their explainability.

BibTex | DOI: 10.1007/978-3-030-39343-4_33

Unsupervised pretraining for text classification using siamese transfer learning

posted 2 years, 8 months ago in #CLEF, #publications

When training neural networks, huge amounts of training data typically lead to better results. When only a small amount of training data is available, it has been proven useful to initialize a network with pretrained layers. For NLP tasks, networks are usually only given pretrained word embeddings, the rest of the network is not pretrained since pretraining recurrent networks for NLP tasks is difficult. In this article, we present a siamese architecture for pretraining recurrent networks on textual data. The network has to map pairs of sentences onto a vector representation. When a sentence pair is appearing coherently in our corpus, the vector representations should be similar, if not, the representations should be dissimilar. After having pretrained that network, we enhance it and train it on a smaller dataset in order to have it classify textual data. We show that using this kind of approach for pretraining results in better results comparing to doing no pretraining or only using pretrained embeddings when doing text classification for a task with only a small amount of training data. For evaluation, we are using the bots and gender profiling dataset provided by PAN 2019.

BibTex | Paper | Poster | PAN

Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study

posted 2 years, 8 months ago in #Journal of Documentation, #publications

Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.

Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.

Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.

Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.

Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.

Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.

Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

BibTex | DOI: 10.1108/JD-07-2018-0114

Evaluation of CNN architectures for text detection in historical maps

posted 3 years ago in #DATeCH, #publications

We evaluate different densely connected fully convolutional neural network architectures to find and extract text from maps. This is a necessary preprocessing step before OCR can be performed. In order to locate the text, we train a neural network to classify whether a given input is text or not. Our main focus is on the output level, either classifying text or no text for the whole input or predicting the text position pixel wise by outputting a mask. Acquiring enough training data especially for pixel wise prediction is quite a time consuming task, so we investigate a method to generate artificial training data. We compare three training scenarios. First training with images from historical maps, which is quite a small dataset, second adding artificially generated images and third training just with the artificially generated data.

Full Abstract | Poster

Objdetect: Eine Plattform zur Visualisierung von Vorhersagen objekterkennender neuronaler Netze

posted 3 years, 5 months ago in #DHDL, #publications

Im Rahmen unterschiedlicher Projekte und Qualifizierungsarbeiten entstanden und entstehen eine Reihe neuronaler Netze. Diese Netze dienen entweder der Klassifizierung von Bildern oder dem Auffinden von Objekten in Bildern. Um diese neuronalen Netze besser miteinander hinsichtlich unterschiedlicher Architekturen, unterschiedlicher Hyperparameter oder unterschiedlicher Trainingsdaten zu vergleichen und sie einem breiteren Personenkreis verfügbar zu machen, wurde eine Webseite entwickelt, auf der man trainierte neuronale Netze und Bilder hochladen, eine Objekterkennung starten und die Ergebnisse einheitlich visualisieren kann. Die Webseite soll sowohl für Entwickler neuronaler Netze sein, die Ihre Netze vergleichen wollen, als auch für Forscher, die auf ihren Bildern eine Objekterkennung durchführen wollen.


Generierung von Trainingsdaten für die Handschrifterkennung aus TEI annotierten Dokumenten – Ein Erfahrungsbericht aus dem EU-Projekt READ

posted 3 years, 7 months ago in #INF-DH, #publications

Zum Trainieren maschineller Lernverfahren zur Erkennung von Handschriften werden Textdaten mit korrespondierenden Bildern benötigt. Die Textdaten liegen häufig im TEI-Format das diverse Möglichkeiten eröffnet, um textuelle und semantische Phänomene auszuzeichnen, weiter können gar eigene Tags oder Auszeichnungsarten eingeführt werden. In diesem Beitrag wird ein im EU-Projekt READ entwickeltes parametrisierbares Tool beschrieben, das mit unterschiedlichen Auszeichnungsstilen in TEI umgehen kann und Textdateien auf Seitenbasis liefert, die zur Zuordnung von Text zu Bilddaten (text-to-image) genutzt werden können und somit zur Aufbereitung von Trainingsdaten für Modelle der Handschriftenerkennung dienen. Die gezeigten Beispiele und Anwendungen stammen alle aus Projekten, die ihre Daten für READ zur Verfügung stellten.

BibTex | DOI: 10.18420/infdh2018-11

Masterthesis: Objekterkennung mit Hilfe von Convolutional Neural Networks am Beispiel ägyptischer Hieroglyphen

posted 5 years, 2 months ago in #publications

In this thesis different architectures of Convolutional Neural Networks (CNN) and their suitability for object recognition were investigated by using the example of Egyptian hieroglyphs.

First, basic principles for artificial neural networks and the components of CNNs, such as convolutional layers, are introduced and explained, followed by explanations of the used data sets and the associated difficulties. We then present libraries for the concrete implementation and use of artificial neural networks and describe the sequence of the evaluation of object recognition.

The CNNs were trained and evaluated with different numbers of classes and the associated number of images. The experiments are divided by the three used training methods. For the first, the CNNs were trained with the help of autoencoders. For the second, the CNNs were trained block-wise, and for the third deeper network architectures were investigated. Various network architectures such as Residual Networks (ResNet) and Densely Connected Convolutional Networks, described in the literature, were implemented and evaluated.

The results of the experiments show that it is possible to train CNNs with up to 69 convolutional layers, to classify about 6500 different Egyptian hieroglyphs and finally to carry out an object recognition with very good results. The best results for object detection 0.92362 were achieved for a CNN with 6465 classes and 13 convolutional layers.

BibTex | PDF

Bachelorthesis: Multi-Label Klassifikation am Beispiel sozialwissenschaftlicher Texte

posted 5 years, 3 months ago in #publications

In this thesis different machine learning algorithms are evaluated for the task of multi-label classification. The evaluation is done with the binary classifiers naive Bayes and support vector machine (SVM) and the multi-class classifier supervised latent Dirichlet allocation (SLDA). To enable naive Bayes and SVM to do multi-label classification the RAkEL transformation is used and for SLDA a topic model multi-label learner is developed and used.

The Reuters-21578 corpus is used. Since not all texts have labels and not all labels occur in sufficient frequency a selection of texts was used. Two corpora were created and used for classification.

The classification results show that the best results are archived with SVM. Naive Bayes and SLDA give very similar results, but SLDA has a very long runtime.

BibTex | PDF