This pilot study addresses the question of whether the Uniform Information Density principle (UID) can be proved for eight typologically diverse languages. The lexical information of words is derived from dependency structures both in sentences preceding the sentences and within the sentence in which the target word occurs. Dependency structures are a realisation of extra-sentential contexts for deriving information as formulated in the surprisal model. Only subject, object and oblique, i.e., the level directly below the verbal root node, were considered. UID says that in natural language, the variance of information and information jumps from word to word should be small so as not to make the processing of a linguistic message an insurmountable hurdle. We observed cross-linguistically different information distributions but an almost identical UID, which provides evidence for the UID hypothesis and assumes that dependency structures can function as proxies for extra-sentential contexts. However, for the dependency structures chosen as contexts, the information distributions in some languages were not statistically significantly different from distributions from a random corpus. This might be an effect of too low complexity of our model's dependency structures, so lower hierarchical levels (e.g. phrases) should be considered.
BibTex | DOI: 10.5220/0000155600003116
This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) unsupervised, based on information theory, i.e., (i) a bigram model, (ii) a probabilistic parser model, and (iii) a novel model which considers topics within the discourse of target word for the calculation of their information content, and (B) supervised, employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language, and (ii) that – as a cognitive principle – the information content of words is determined from extra-sentential contexts, i.e., from the discourse of words.
BibTex | DOI: 10.1007/978-3-030-63787-3_5
This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) an unsupervised method based on information theory (Shannon, 1948). We employed (i) a bigram model, (ii) a probabilistic parser model (Hale, 2001) and (iii) an innovative model which utilises topics as extra-sentential contexts for the calculation of the information content of the words, and (B) a supervised method employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence, that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language’s, and (ii) that - as a cognitive principle - the information content of words is determined from extra-sentential contexts, that is to say, from the discourse of words.
BibTex | DOI: 10.5220/0009374704590464
When training neural networks, huge amounts of training data typically lead to better results. When only a small amount of training data is available, it has been proven useful to initialize a network with pretrained layers. For NLP tasks, networks are usually only given pretrained word embeddings, the rest of the network is not pretrained since pretraining recurrent networks for NLP tasks is difficult. In this article, we present a siamese architecture for pretraining recurrent networks on textual data. The network has to map pairs of sentences onto a vector representation. When a sentence pair is appearing coherently in our corpus, the vector representations should be similar, if not, the representations should be dissimilar. After having pretrained that network, we enhance it and train it on a smaller dataset in order to have it classify textual data. We show that using this kind of approach for pretraining results in better results comparing to doing no pretraining or only using pretrained embeddings when doing text classification for a task with only a small amount of training data. For evaluation, we are using the bots and gender profiling dataset provided by PAN 2019.
BibTex | Paper | Poster | PAN
Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.
Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.
Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.
Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.
Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.
Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.
Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.
BibTex | DOI: 10.1108/JD-07-2018-0114