This study focuses on the identification of English Idiomatic Expressions (IE) using an information theoretic model. In the focus are verb-noun constructions only. We notice significant differences in semantic surprisal and information density between IE-data and literals-data. Surprisingly, surprisal and information density in the IE-data and in a large reference data set do not differ significantly, while, in contrast, we observe significant differences between literals and a large reference data set.
Ziel des Projekts „Forschungsportal Bach“ ist der Aufbau eines umfassenden Online-Repositoriums, das Zugang zu allen erhaltenen Dokumenten der Musikerfamilie Bach – der einflussreichsten Familiendynastie in der Musikgeschichte – vom späten 16. bis zum frühen 19. Jahrhundert bietet. Zum ersten Mal in der Geschichte der Bachforschung wird das Material, das in Bibliotheken, Archiven und Privatsammlungen erhalten ist, digital erfasst, indiziert, verarbeitet, annotiert und via Online-Portal zugänglich gemacht. Die digitalisierten Dokumente werden mittels „Transkribus“ automatisch transkribiert, mit Hilfe des TEI Publishers annotiert und schließlich als digitale Edition veröffentlicht. Dabei werden u.a. die in den Dokumenten erwähnten Werke sowie Wasserzeichen ausgewählter Archivalien mit dem Portal „Bach digital“ verknüpft.
The aim of this project is to build a comprehensive online repository that provides access to all surviving documents of the Bach family of musicians – the most influential family dynasty in music history – from the late 16th to the early 19th century. For the first time in the history of Bach research, the written documents preserved in libraries, archives and private collections will be digitally recorded, indexed, processed, annotated and made available via an online portal. Interlinking with existing digital Bach projects is planned.
In this study on keyword extraction in German, we compare the performance of Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) as part of the Topic Context Model (TCM).
TCM calculates the information-theoretical measure surprisal as a context-based feature of words.
Surprisal is a contextualised information measure based on conditional probabilities.
As contexts for the calculation of surprisal, TCM evaluates the topic- and topic-word distributions in the preceding and following environment of words.
This environment can be sentences, paragraphs, documents or an entire corpus.
TCM thus answers the question: how surprising is a word for a language processor given the topics within the word's environment?
The point of departure of the study is that surprisal as a lexical feature can be useful for NLP applications.
In previous studies, this has been shown both for keyword extraction and for the determination of the expressivity of intensifiers.
So far, exclusively LDA has been used for the topic modelling part of the TCM.
In this study, TCM draws on the topics identified by LDA and LSA in order to compare the performance of the two topic models.
This comparison is motivated by the fact that LSA and LDA are similar algorithms in some ways but quite different in others.
Both assume that texts are a combination of abstract, invisible topics that can be determined by looking at word frequencies.
Their respective approaches to determining these topics, however, differ significantly.
While LSA is deterministic and models texts as linear combinations of topic vectors, the generative LDA model is probabilistic and assumes a Bayesian network linking topics to words and then to texts.
In particular, LSA is computationally 'nicer' but assumes a linear structure that cannot be taken for granted.
Whether or not this structure is actually there, is interesting for two reasons: firstly, it provides a working linear model of topics as vectors (hence offering computational ways to compare the similarity of topics).
Secondly, choosing the well-behaved deterministic LSA over LDA is justified.
The data resource in this study is a subset of the Heise corpus corpus.
The comparison of LDA and LSA proceeds in two steps: (i) the quality of the words’ surprisal values for keyword extraction is directly compared, (ii) the surprisal values are input of a Recurrent Neural Network (RNN).
This operationalisation follows previous studies on TCM.
We observe that in the direct comparison, LSA slightly outperforms LDA in precision, recall and F1-values.
In contrast, when surprisal of words is the input of the RNN, we see no clear winner.
Because of the LSA’s greater computational economy and 'niceness' compared to LDA, our conclusion is that LSA is a suitable building block of TCM in the service of NLP's keyword extraction application.
The aim of this study is to identify idiomatic expressions in English using the measure perplexity. The assumption is that idiomatic expressions cause higher perplexity than literal expressions given a reference text. Perplexity in our study is calculated based on n-grams of (i) PoS tags, (ii) tokens, and (iii) thematic roles within the boundaries of a sentence. In the setting of our study, we observed that no perplexity in the contexts of (i), (ii) and (iii) manages to distinguish idiomatic expressions from literals. We postulate that larger, extra-sentential contexts should be used for the determination of perplexity. In addition, the number of thematic roles in (iii) should be reduced to a smaller number of basic roles in order to avaiod an uniform distribution of n-grams.
This paper describes a study on keyword extraction in German with a model that utilises Shannon information as a lexical feature. Lexical information content was derived from large, extra-sentential semantic contexts of words in the framework of the novel Topic Context Model. We observed that lexical information content increased the performance of a Recurrent Neural Network in keyword extraction, outperforming TexTRank and other two models, i.e., Named Entity Recognition and Latent Dirichlet Allocation used comparatively in this study.
This study investigates three challenges for developing machine learning-based self-service web apps for consumers. First, we argue that user research must accompany the development of ML-based products so that they better serve users’ needs at all stages of development. Second, we discuss the data sourcing dilemma in developing consumer-oriented ML-based apps and propose a way to solve it by implementing an interaction design that balances the workload between users and computers according to the ML component’s performance. To dynamically define the role of the user-in-the-loop, we monitor user success and ML performance over time. Finally, we propose a lightweight typology of ML-based systems to assess the generalizability of our findings to other ML use cases.
Our case study uses a newly developed web application that allows consumers to analyze their heating bills for potential energy and cost savings. Based on domain-specific data values extracted from user-provided document images, an assessment of potential savings is derived and reported back to the user.
The starting point of this paper is the observation that methods based on the direct match of keywords are inadequate because they do not consider the cognitive ability of concept formation and abstraction. We argue that keyword evaluation needs to be based on a semantic model of language capturing the semantic relatedness of words to satisfy the claim of the human-like ability of concept formation and abstraction and achieve better evaluation results. Evaluation of keywords is difficult since semantic informedness is required for this purpose. This model must be capable of identifying semantic relationships such as synonymy, hypernymy, hyponymy, and location-based abstraction. For example, when gathering texts from online sources, one usually finds a few keywords with each text. Still, these keyword sets are neither complete for the text nor are they in themselves closed, i.e., in most cases, the keywords are a random subset of all possible keywords and not that informative w.r.t. the complete keyword set. Therefore all algorithms based on this cannot achieve good evaluation results and provide good/better keywords or even a complete keyword set for a text. As a solution, we propose a word graph that captures all these semantic relationships for a given language. The problem with the hyponym/hyperonym relationship is that, unlike synonyms, it is not bidirectional. Thus the space of keyword sets requires a metric that is non-symmetric, in other words, a quasi-metric. We sketch such a metric that works on our graph. Since it is nearly impossible to obtain such a complete word graph for a language, we propose for the keyword task a simpler graph based on the base text upon which the keyword sets should be evaluated. This reduction is usually sufficient for evaluating keyword sets.
This pilot study addresses the question of whether the Uniform Information Density principle (UID) can be proved for eight typologically diverse languages. The lexical information of words is derived from dependency structures both in sentences preceding the sentences and within the sentence in which the target word occurs. Dependency structures are a realisation of extra-sentential contexts for deriving information as formulated in the surprisal model. Only subject, object and oblique, i.e., the level directly below the verbal root node, were considered. UID says that in natural language, the variance of information and information jumps from word to word should be small so as not to make the processing of a linguistic message an insurmountable hurdle. We observed cross-linguistically different information distributions but an almost identical UID, which provides evidence for the UID hypothesis and assumes that dependency structures can function as proxies for extra-sentential contexts. However, for the dependency structures chosen as contexts, the information distributions in some languages were not statistically significantly different from distributions from a random corpus. This might be an effect of too low complexity of our model's dependency structures, so lower hierarchical levels (e.g. phrases) should be considered.
This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) unsupervised, based on information theory, i.e., (i) a bigram model, (ii) a probabilistic parser model, and (iii) a novel model which considers topics within the discourse of target word for the calculation of their information content, and (B) supervised, employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language, and (ii) that – as a cognitive principle – the information content of words is determined from extra-sentential contexts, i.e., from the discourse of words.
This paper reports the results of a study on automatic keyword extraction in German. We employed in general two types of methods: (A) an unsupervised method based on information theory (Shannon, 1948). We employed (i) a bigram model, (ii) a probabilistic parser model (Hale, 2001) and (iii) an innovative model which utilises topics as extra-sentential contexts for the calculation of the information content of the words, and (B) a supervised method employing a recurrent neural network (RNN). As baselines, we employed TextRank and the TF-IDF ranking function. The topic model (A)(iii) outperformed clearly all remaining models, even TextRank and TF-IDF. In contrast, RNN performed poorly. We take the results as first evidence, that (i) information content can be employed for keyword extraction tasks and has thus a clear correspondence to semantics of natural language’s, and (ii) that - as a cognitive principle - the information content of words is determined from extra-sentential contexts, that is to say, from the discourse of words.
Neural networks have become a standard for classifying images. However, by their very nature, their internal data representation remains opaque. To solve this dilemma, attention mechanisms have recently been introduced. They help to highlight regions in input data that have been used for a network’s classification decision. This article presents two attention architectures for the classification of medical images. Firstly, we are explaining a simple architecture which creates one attention map that is used for all classes. Secondly, we introduce an architecture that creates an attention map for each class. This is done by creating two U-nets - one for attention and one for classification - and then multiplying these two maps together. We show that our architectures well meet the baseline of standard convolutional classifications while at the same time increasing their explainability.
I am running my own mail server for a while now. Since the beginning I was thinking about how to store the mails encrypted, so that no one can read the mails with access to the server. The solution I came up with is relative easy to setup and is based upon OpenPGP/GnuPGP.
The basic idea is to take incoming mail before it is stored and encrypt it. I'm running postfix, which has the option to filter queued mails with external content filters. A content filter gets a mail via stdin, does whatever it needs to do and either rejects a mail or put it back into the mail queue.
I wrote a relativ simple Python script that takes a mail from stdin, processes it and then writes it back to stdout. The script can either decrypt, encrypt, sign or sign and encrypt a mail. It also tries to protect the mail headers following the memoryhole specs and supports Thunderbirds/Enigmails encrypted subject feature. The drawback is that Enigmail only supports the encrypted header from the memoryhole specs and other mail clients don't support them at all. For the content_filter in postfix I wrote a Bash script, that will resend the encrypted mail to put it back into the mail queue. The scripts can be found on GitHub.
When training neural networks, huge amounts of training data typically lead to better results. When only a small amount of training data is available, it has been proven useful to initialize a network with pretrained layers. For NLP tasks, networks are usually only given pretrained word embeddings, the rest of the network is not pretrained since pretraining recurrent networks for NLP tasks is difficult. In this article, we present a siamese architecture for pretraining recurrent networks on textual data. The network has to map pairs of sentences onto a vector representation. When a sentence pair is appearing coherently in our corpus, the vector representations should be similar, if not, the representations should be dissimilar. After having pretrained that network, we enhance it and train it on a smaller dataset in order to have it classify textual data. We show that using this kind of approach for pretraining results in better results comparing to doing no pretraining or only using pretrained embeddings when doing text classification for a task with only a small amount of training data. For evaluation, we are using the bots and gender profiling dataset provided by PAN 2019.
Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.
Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.
Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.
Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.
Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.
Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.
Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.
Zum Beispiel könnte man alle nationalen Pässe durch einen Europäischen Pass ersetzen. Ein Pass der Europäischen Union, in dem der Geburtsort vermerkt ist, aber nicht die Nationalität. Ich glaube, dass allein dies etwas im Bewusstsein der Generation bewirken würde, die mit einem solchen Pass aufwächst. Und das würde nicht einmal etwas kosten. […] Aber das ist nicht genug, setzte er fort.
We evaluate different densely connected fully convolutional neural network architectures to find and extract text from maps. This is a necessary preprocessing step before OCR can be performed. In order to locate the text, we train a neural network to classify whether a given input is text or not. Our main focus is on the output level, either classifying text or no text for the whole input or predicting the text position pixel wise
by outputting a mask. Acquiring enough training data especially for pixel wise prediction is quite a time consuming task, so we investigate a method to generate artificial training data. We compare three training scenarios. First training with images from historical maps, which is quite a small dataset, second adding artificially generated images and third training just with the artificially generated data.
Im Rahmen unterschiedlicher Projekte und Qualifizierungsarbeiten entstanden und entstehen eine Reihe neuronaler Netze. Diese Netze dienen entweder der Klassifizierung von Bildern oder dem Auffinden von Objekten in Bildern. Um diese neuronalen Netze besser miteinander hinsichtlich unterschiedlicher Architekturen, unterschiedlicher Hyperparameter oder unterschiedlicher Trainingsdaten zu vergleichen und sie einem breiteren Personenkreis verfügbar zu machen, wurde eine Webseite entwickelt, auf der man trainierte neuronale Netze und Bilder hochladen, eine Objekterkennung starten und die Ergebnisse einheitlich visualisieren kann. Die Webseite soll sowohl für Entwickler neuronaler Netze sein, die Ihre Netze vergleichen wollen, als auch für Forscher, die auf ihren Bildern eine Objekterkennung durchführen wollen.
Zum Trainieren maschineller Lernverfahren zur Erkennung von Handschriften werden Textdaten mit korrespondierenden Bildern benötigt. Die Textdaten liegen häufig im TEI-Format das diverse Möglichkeiten eröffnet, um textuelle und semantische Phänomene auszuzeichnen, weiter können gar eigene Tags oder Auszeichnungsarten eingeführt werden. In diesem Beitrag wird ein im EU-Projekt READ entwickeltes parametrisierbares Tool beschrieben, das mit unterschiedlichen Auszeichnungsstilen in TEI umgehen kann und Textdateien auf Seitenbasis liefert, die zur Zuordnung von Text zu Bilddaten (text-to-image) genutzt werden können und somit zur Aufbereitung von Trainingsdaten für Modelle der Handschriftenerkennung dienen. Die gezeigten Beispiele und Anwendungen stammen alle aus Projekten, die ihre Daten für READ zur Verfügung stellten.
In this thesis different architectures of Convolutional Neural Networks (CNN) and their suitability for object recognition were investigated by using the example of Egyptian hieroglyphs.
First, basic principles for artificial neural networks and the components of CNNs, such as convolutional layers, are introduced and explained, followed by explanations of the used data sets and the associated difficulties. We then present libraries for the concrete implementation and use of artificial neural networks and describe the sequence of the evaluation of object recognition.
The CNNs were trained and evaluated with different numbers of classes and the associated number of images. The experiments are divided by the three used training methods. For the first, the CNNs were trained with the help of autoencoders. For the second, the CNNs were trained block-wise, and for the third deeper network architectures were investigated. Various network architectures such as Residual Networks (ResNet) and Densely Connected Convolutional Networks, described in the literature, were implemented and evaluated.
The results of the experiments show that it is possible to train CNNs with up to 69 convolutional layers, to classify about 6500 different Egyptian hieroglyphs and finally to carry out an object recognition with very good results. The best results for object detection 0.92362 were achieved for a CNN with 6465 classes and 13 convolutional layers.
In this thesis different machine learning algorithms are evaluated for the task of multi-label classification. The evaluation is done with the binary classifiers naive Bayes and support vector machine (SVM) and the multi-class classifier supervised latent Dirichlet allocation (SLDA). To enable naive Bayes and SVM to do multi-label classification the RAkEL transformation is used and for SLDA a topic model multi-label learner is developed and used.
The Reuters-21578 corpus is used. Since not all texts have labels and not all labels occur in sufficient frequency a selection of texts was used. Two corpora were created and used for classification.
The classification results show that the best results are archived with SVM. Naive Bayes and SLDA give very similar results, but SLDA has a very long runtime.
I must sadly announce the end of life for TIMA. Or at least the end of the TIMA website at https://tima.jnphilipp.org. This is due to the practically non existent traffic and my inability to maintain the site. The EOL will be at the end of the month, the 30th of September 2016. I will upload a database dump with the associations to this post after the shutdown.
Update: So the EOL of the TIMA website is reached. As promised a dump of the associations can be downloaded here as a JSON-file. For each word the language, count, identifier and associations are given, here the count indicates how often the word was answered. An association has the same informations, but here the count indicates how often the association was given to the word.
I recently read an interresting post about the target="_blank" vulnerability. This vulnerability leaves a user open to a very simple phishing attack and is quite unknown. When a link uses the target="_blank" attribute not accompanied with the rel="noopener" attribute or in the case of Firefox rel="noopener noreferrer" the opening site gives the new site access to the existing window through the window.opener API, allowing a few permissions. Some of these permissions are automatically negated by cross-domain restrictions, but window.location is fair game.
To see this vulnerability in action you can use this link. It'll open the post in a new tab/window and redirect this window to an other page.
The code below shows the necessary code for the window.opener API to redirect the opening site to a new location.
Because of that post, I removed all target="_blank" attributes from the links. I had also a few other changes that had pilled up and which I hadn't gotten around to put online. Most are on the back end side. On front end side I changed manly the color of the sidebar.
Over the last few weeks I added a few new features. The most extensive feature I added is the API. The API consists of two parts, the first is to retrieve the posts and projects as JSON. The other is an OAI-PMH endpoint, which returns XML. At the moment I only support the metadata in the Dublin Core format, but I plan to add CMDI. For details on the API I added a page to the project section. The second feature I added was inspired by this post about signing web content using PGP. I added signatures to the posts and projects which can be view in the source code and verified using my public key or with Keybase. On a side note, I got new certificates from Let’s Encrypt and forcing HTTPS now.
Since my last Post about TIMA a few thing have happened and changed. We added and FAQ page, and most noticeably a we added a section with games to the Website. Currently there is only one: AssociationChain.
AssociationChain is a simple game in which you and TIMA build an association chain together. The rules are as follows: You and TIMA alternately associate a word to the previous association. The goal is to build long chains.
As for the Apps it's a work in progress. The basic functionally of the Website is in the App and works we'll see that rest get's into it an that we can distribute it. As for the game Apps that will take some more time.
As for the TIMA itself. We currently support four languages: German, English, Spanish and Farsi. We have a total of over 3400 words and over 4000 unique associations.
TIMA short for "TIMA is my association" is a citizen science project I currently work on. The goal of the project is to build a large database of associations. To get the associations we need your help. Everyone who want can go to our website and start. First you need to select a language and then you get a word and asked to type in your association of said word. For each association you will receive points and a new word.
Besides the website we are in the process of building some apps. The first and most basic app follows the design concept of the website and gives you words and asks for your associations. In a later phase we have plan for apps that will have a more game like approach. One will be based on the concept of the German TV-show Familien-Duell. I will write more to that in a later post.
In addition to the collection of association we also publish them. On the website is a list of all the words and their association with some graphs and statistics. We have also an extensive API, over which the data can be exported. In addition we have included OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). This is a low-barrier mechanism the expose metadata for repositories. Our base URL is https://tima.jnphilipp.org/oai2/.
This is one of the website I mentioned in my previous post about Bootstrap. The website is written in Python using the Django web framework. You can checkout the code on GitHub.
I recently had to build a few website, about which I'll write soon a bit, in which I used Bootstrap. Since the design I used when I build this site was somewhat crude I started to do some redesigning using Bootstrap. The result of these efforts are now online. Enjoy!
I'm sorry I'm a little late with this, but I finally came around to write this post. In the last term I took a course were we had to write a simulated web crawler and implement different crawling strategies. The complete code and detailed descriptions on inputs and how to compile and run it are on GitHub.
First we had to implement breadth-first search strategy and then two page level (backlink-count and OPIC) and two site level (round robin and max page-priority) strategies, which should be combianed as desired. And finally we should use OPIC, backlink-count and the ratio of good to bad pages to develop a formula to combine them to a strategy called OPTIMAl.
On the first run two input files need to be provided, the first on is the link graph and the second one the quality mapping. Bevor the actual crawling starts, the files will be read an stored in a MapDB for easy access. As long as the MapDB files exist there is no need to provide the link graph and quality mapping file. If they are provided the MapDB will be recreated.
For performance reason the crawling itself is done in threads via a ScheduledThreadPoolExecutor. A single thread performce the crawling of a single site.
For the course we had a link graph with about 230 million entries (including duplicates) on which we should run our tests. We should do 5000 steps with 200 URLs per step and a batch size of 100 and 500. The batch size dictates the update intervalls for backlink count and OPIC. The runtimes are in the table below and the performance in the graphs.
The last few weeks I worked with some people on a smart meter project. Our goal was to show how to receive the data on a large scale and to handle them. We divided it into three parts, the first was a generator for generating lifelike data. The second part was based on Apache Storm and Apache Accumulo received the data and stored them and the third part generated reports with Map Reduce.
Welcome to my new blog. It's been quite some time since my last blog. Back then I ran it on some old hardware I had. When it crashed I made a half hearted attempt on WordPress. Which quickly died down. Now I got a virtual server mainly to run some other projects. So I decided to start a new blog. Nothing fancy, just something to keep a record of my projects and ideas.