Unsupervised pretraining for text classification using siamese transfer learning

Friday, 13. September 2019 in publications :: #CLEF

When training neural networks, huge amounts of training data typically lead to better results. When only a small amount of training data is available, it has been proven useful to initialize a network with pretrained layers. For NLP tasks, networks are usually only given pretrained word embeddings, the rest of the network is not pretrained since pretraining recurrent networks for NLP tasks is difficult. In this article, we present a siamese architecture for pretraining recurrent networks on textual data. The network has to map pairs of sentences onto a vector representation. When a sentence pair is appearing coherently in our corpus, the vector representations should be similar, if not, the representations should be dissimilar. After having pretrained that network, we enhance it and train it on a smaller dataset in order to have it classify textual data. We show that using this kind of approach for pretraining results in better results comparing to doing no pretraining or only using pretrained embeddings when doing text classification for a task with only a small amount of training data. For evaluation, we are using the bots and gender profiling dataset provided by PAN 2019.

BibTex | Paper | Poster | PAN

Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study

Tuesday, 10. September 2019 in publications :: #JournalOfDocumentation

Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues.

Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material.

Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified.

Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc.

Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field.

Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals.

Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

BibTex | DOI: 10.1108/JD-07-2018-0114

Convolutional Attention on Images for Locating Macular Edema

Friday, 24. January 2020 in publications :: #MIUA

Status published Summary:

Neural networks have become a standard for classifying images. However, by their very nature, their internal data representation remains opaque. To solve this dilemma, attention mechanisms have recently been introduced. They help to highlight regions in input data that have been used for a network’s classification decision. This article presents two attention architectures for the classification of medical images. Firstly, we are explaining a simple architecture which creates one attention map that is used for all classes. Secondly, we introduce an architecture that creates an attention map for each class. This is done by creating two U-nets - one for attention and one for classification - and then multiplying these two maps together. We show that our architectures well meet the baseline of standard convolutional classifications while at the same time increasing their explainability.

BibTex | DOI: 10.1007/978-3-030-39343-4_33

Auto encrypt all Incoming Email with postfix

Monday, 30. December 2019 in projects :: #admin, #coding, #gpgmail, #pgp

I am running my own mail server for a while now. Since the beginning I was thinking about how to store the mails encrypted, so that no one can read the mails with access to the server. The solution I came up with is relative easy to setup and is based upon OpenPGP/GnuPGP.

The basic idea is to take incoming mail before it is stored and encrypt it. I'm running postfix, which has the option to filter queued mails with external content filters. A content filter gets a mail via stdin, does whatever it needs to do and either rejects a mail or put it back into the mail queue.

I wrote a relativ simple Python script that takes a mail from stdin, processes it and then writes it back to stdout. The script can either decrypt, encrypt, sign or sign and encrypt a mail. It also tries to protect the mail headers following the memoryhole specs and supports Thunderbirds/Enigmails encrypted subject feature. The drawback is that Enigmail only supports the encrypted header from the memoryhole specs and other mail clients don't support them at all. For the content_filter in postfix I wrote a Bash script, that will resend the encrypted mail to put it back into the mail queue. The scripts can be found on GitHub.

Setup

Install gpgmail

Add a new user:

adduser --shell /bin/false --home /home/gpgmail --disabled-password --disabled-login --gecos "" gpgmail

Create .gnupg folder and change permissions:

mkdir /home/gpgmail/.gnupg
chown gpgmail:gpgmail /home/gpgmail/.gnupg/
chmod 700 /home/gpgmail/.gnupg/

If mails should not just get encrypted but also signed, create a new key pair:

sudo -u gpgmail /usr/bin/gpg --homedir=/home/gpgmail/.gnupg --expert --full-gen-key

Import public keys and chnage trust:

sudo -u gpgmail /usr/bin/gpg --homedir=/home/gpgmail/.gnupg --import /home/gpgmail/pubkey.asc
sudo -u gpgmail /usr/bin/gpg --homedir=/home/gpgmail/.gnupg --edit-key <KEY> trust save
sudo -u gpgmail /usr/bin/gpg --homedir=/home/gpgmail/.gnupg --edit-key <KEY> trust quit

Edit /etc/postfix/master.cf

:::unixconfig
smtp          inet  n       -       y       -       -       smtpd -o content_filter=gpgmail-pipe
smtps         inet  n       -       y       -       -       smtpd -o content_filter=gpgmail-pipe
submission    inet  n       -       y       -       -       smtpd -o content_filter=gpgmail-pipe
gpgmail-pipe  unix  -       n       n       -       -       pipe
  flags=Rq user=gpgmail argv=/usr/bin/gpgmail-postfix sign-encrypt gnupghome=/home/gpgmail/.gnupg key=<KEY_ID> passphrase=<PASSPHRASE> encrypt-subject -oi -f ${sender} ${recipient}

Restart postfix.

Sources

Evaluation of CNN architectures for text detection in historical maps

Saturday, 11. May 2019 in publications :: #DATeCH

We evaluate different densely connected fully convolutional neural network architectures to find and extract text from maps. This is a necessary preprocessing step before OCR can be performed. In order to locate the text, we train a neural network to classify whether a given input is text or not. Our main focus is on the output level, either classifying text or no text for the whole input or predicting the text position pixel wise by outputting a mask. Acquiring enough training data especially for pixel wise prediction is quite a time consuming task, so we investigate a method to generate artificial training data. We compare three training scenarios. First training with images from historical maps, which is quite a small dataset, second adding artificially generated images and third training just with the artificially generated data.

Full Abstract | Poster