© 2019 jnphilipp.org Creative Commons License
[l] Objdetect: Eine Plattform zur Visualisierung von Vorhersagen objekterkennender neuronaler Netze
posted on 21. December 2018

Im Rahmen unterschiedlicher Projekte und Qualifizierungsarbeiten entstanden und entstehen eine Reihe neuronaler Netze. Diese Netze dienen entweder der Klassifizierung von Bildern oder dem Auffinden von Objekten in Bildern. Um diese neuronalen Netze besser miteinander hinsichtlich unterschiedlicher Architekturen, unterschiedlicher Hyperparameter oder unterschiedlicher Trainingsdaten zu vergleichen und sie einem breiteren Personenkreis verfügbar zu machen, wurde eine Webseite entwickelt, auf der man trainierte neuronale Netze und Bilder hochladen, eine Objekterkennung starten und die Ergebnisse einheitlich visualisieren kann. Die Webseite soll sowohl für Entwickler neuronaler Netze sein, die Ihre Netze vergleichen wollen, als auch für Forscher, die auf ihren Bildern eine Objekterkennung durchführen wollen.


[l] Generierung von Trainingsdaten für die Handschrifterkennung aus TEI annotierten Dokumenten – Ein Erfahrungsbericht aus dem EU-Projekt READ
posted on 04. October 2018

Zum Trainieren maschineller Lernverfahren zur Erkennung von Handschriften werden Textdaten mit korrespondierenden Bildern benötigt. Die Textdaten liegen häufig im TEI-Format das diverse Möglichkeiten eröffnet, um textuelle und semantische Phänomene auszuzeichnen, weiter können gar eigene Tags oder Auszeichnungsarten eingeführt werden. In diesem Beitrag wird ein im EU-Projekt READ entwickeltes parametrisierbares Tool beschrieben, das mit unterschiedlichen Auszeichnungsstilen in TEI umgehen kann und Textdateien auf Seitenbasis liefert, die zur Zuordnung von Text zu Bilddaten (text-to-image) genutzt werden können und somit zur Aufbereitung von Trainingsdaten für Modelle der Handschriftenerkennung dienen. Die gezeigten Beispiele und Anwendungen stammen alle aus Projekten, die ihre Daten für READ zur Verfügung stellten.


[l] Masterthesis: Objekterkennung mit Hilfe von Convolutional Neural Networks am Beispiel ägyptischer Hieroglyphen
posted on 27. February 2017

In this thesis different architectures of Convolutional Neural Networks (CNN) and their suitability for object recognition were investigated by using the example of Egyptian hieroglyphs.

First, basic principles for artificial neural networks and the components of CNNs, such as convolutional layers, are introduced and explained, followed by explanations of the used data sets and the associated difficulties. We then present libraries for the concrete implementation and use of artificial neural networks and describe the sequence of the evaluation of object recognition.

The CNNs were trained and evaluated with different numbers of classes and the associated number of images. The experiments are divided by the three used training methods. For the first, the CNNs were trained with the help of autoencoders. For the second, the CNNs were trained block-wise, and for the third deeper network architectures were investigated. Various network architectures such as Residual Networks (ResNet) and Densely Connected Convolutional Networks, described in the literature, were implemented and evaluated.

The results of the experiments show that it is possible to train CNNs with up to 69 convolutional layers, to classify about 6500 different Egyptian hieroglyphs and finally to carry out an object recognition with very good results. The best results for object detection 0.92362 were achieved for a CNN with 6465 classes and 13 convolutional layers.

PDF | BibLaTex

[l] Bachelorthesis: Multi-Label Klassifikation am Beispiel sozialwissenschaftlicher Texte
posted on 04. February 2017

In this thesis different machine learning algorithms are evaluated for the task of multi-label classification. The evaluation is done with the binary classifiers naive Bayes and support vector machine (SVM) and the multi-class classifier supervised latent Dirichlet allocation (SLDA). To enable naive Bayes and SVM to do multi-label classification the RAkEL transformation is used and for SLDA a topic model multi-label learner is developed and used.

The Reuters-21578 corpus is used. Since not all texts have labels and not all labels occur in sufficient frequency a selection of texts was used. Two corpora were created and used for classification.

The classification results show that the best results are archived with SVM. Naive Bayes and SLDA give very similar results, but SLDA has a very long runtime.

PDF | BibLaTex

posted on 03. October 2016

I must sadly announce the end of life for TIMA. Or at least the end of the TIMA website at https://tima.jnphilipp.org. This is due to the practically non existent traffic and my inability to maintain the site. The EOL will be at the end of the month, the 30th of September 2016. I will upload a database dump with the associations to this post after the shutdown.

Update: So the EOL of the TIMA website is reached. As promised a dump of the associations can be downloaded here as a JSON-file. For each word the language, count, identifier and associations are given, here the count indicates how often the word was answered. An association has the same informations, but here the count indicates how often the association was given to the word.

[l] Vulnerabilities and other stuff
posted on 06. September 2016

I recently read an interresting post about the target="_blank" vulnerability. This vulnerability leaves a user open to a very simple phishing attack and is quite unknown. When a link uses the target="_blank" attribute not accompanied with the rel="noopener" attribute or in the case of Firefox rel="noopener noreferrer" the opening site gives the new site access to the existing window through the window.opener API, allowing a few permissions. Some of these permissions are automatically negated by cross-domain restrictions, but window.location is fair game.

To see this vulnerability in action you can use this link. It'll open the post in a new tab/window and redirect this window to an other page.

The code below shows the necessary code for the window.opener API to redirect the opening site to a new location.

    if ( window.opener ) {
        window.opener.location = "https://jnphilipp.org/pages/page/gone-phishing/?referrer=" + document.referrer;

Because of that post, I removed all target="_blank" attributes from the links. I had also a few other changes that had pilled up and which I hadn't gotten around to put online. Most are on the back end side. On front end side I changed manly the color of the sidebar.

[l] New features
posted on 09. March 2016

Over the last few weeks I added a few new features. The most extensive feature I added is the API. The API consists of two parts, the first is to retrieve the posts and projects as JSON. The other is an OAI-PMH endpoint, which returns XML. At the moment I only support the metadata in the Dublin Core format, but I plan to add CMDI. For details on the API I added a page to the project section. The second feature I added was inspired by this post about signing web content using PGP. I added signatures to the posts and projects which can be view in the source code and verified using my public key or with Keybase. On a side note, I got new certificates from Let’s Encrypt and forcing HTTPS now.

[l] TIMA progress report
posted on 27. August 2015

Since my last Post about TIMA a few thing have happened and changed. We added and FAQ page, and most noticeably a we added a section with games to the Website. Currently there is only one: AssociationChain.

AssociationChain is a simple game in which you and TIMA build an association chain together. The rules are as follows: You and TIMA alternately associate a word to the previous association. The goal is to build long chains.

As for the Apps it's a work in progress. The basic functionally of the Website is in the App and works we'll see that rest get's into it an that we can distribute it. As for the game Apps that will take some more time.

As for the TIMA itself. We currently support four languages: German, English, Spanish and Farsi. We have a total of over 3400 words and over 4000 unique associations.

[l] TIMA
posted on 15. July 2015

TIMATIMA short for "TIMA is my association" is a citizen science project I currently work on. The goal of the project is to build a large database of associations. To get the associations we need your help. Everyone who want can go to our website and start. First you need to select a language and then you get a word and asked to type in your association of said word. For each association you will receive points and a new word.

Besides the website we are in the process of building some apps. The first and most basic app follows the design concept of the website and gives you words and asks for your associations. In a later phase we have plan for apps that will have a more game like approach. One will be based on the concept of the German TV-show Familien-Duell. I will write more to that in a later post.

In addition to the collection of association we also publish them. On the website is a list of all the words and their association with some graphs and statistics. We have also an extensive API, over which the data can be exported. In addition we have included OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). This is a low-barrier mechanism the expose metadata for repositories. Our base URL is https://tima.jnphilipp.org/oai2/.

This is one of the website I mentioned in my previous post about Bootstrap. The website is written in Python using the Django web framework. You can checkout the code on GitHub.

[l] Somewhat new design
posted on 01. July 2015

I recently had to build a few website, about which I'll write soon a bit, in which I used Bootstrap. Since the design I used when I build this site was somewhat crude I started to do some redesigning using Bootstrap. The result of these efforts are now online. Enjoy!

[l] SimCrawler
posted on 14. May 2015

I'm sorry I'm a little late with this, but I finally came around to write this post. In the last term I took a course were we had to write a simulated web crawler and implement different crawling strategies. The complete code and detailed descriptions on inputs and how to compile and run it are on GitHub.

First we had to implement breadth-first search strategy and then two page level (backlink-count and OPIC) and two site level (round robin and max page-priority) strategies, which should be combianed as desired. And finally we should use OPIC, backlink-count and the ratio of good to bad pages to develop a formula to combine them to a strategy called OPTIMAl.

On the first run two input files need to be provided, the first on is the link graph and the second one the quality mapping. Bevor the actual crawling starts, the files will be read an stored in a MapDB for easy access. As long as the MapDB files exist there is no need to provide the link graph and quality mapping file. If they are provided the MapDB will be recreated.For performance reason the crawling itself is done in threads via a ScheduledThreadPoolExecutor. A single thread performce the crawling of a single site.

For the course we had a link graph with about 230 million entries (including duplicates) on which we should run our tests. We should do 5000 steps with 200 URLs per step and a batch size of 100 and 500. The batch size dictates the update intervalls for backlink count and OPIC. The runtimes are in the table below and the performance in the graphs.

breath-first search1:11.241s
round robin, backlink count, batch size: 1003:03.407s
round robin, backlink count, batch size: 5002:59.153s
round robin, OPIC, batch size: 1003:35.574s
round robin, OPIC, batch size: 5003:06.375s
max page priority, OPIC, batch size: 1003:30.296s
max page priority, OPIC, batch size: 5003:03.940s
max page priority, backlink count, batch size: 1003:14.608s
max page priority, backlink count, batch size: 5002:17.481s

SimCrawler performance

SimCrawler OPTIMAL 100

SimCrawler OPTIMAL 500

[l] Smart meter: A case study
posted on 06. June 2014

The last few weeks I worked with some people on a smart meter project. Our goal was to show how to receive the data on a large scale and to handle them. We divided it into three parts, the first was a generator for generating lifelike data. The second part was based on Apache Storm and Apache Accumulo received the data and stored them and the third part generated reports with Map Reduce.

The code can be found on github:

[l] Welcome
posted on 25. March 2014

Welcome to my new blog. It's been quite some time since my last blog. Back then I ran it on some old hardware I had. When it crashed I made a half hearted attempt on WordPress. Which quickly died down. Now I got a virtual server mainly to run some other projects. In this course I decided to start a new blog. Nothing fancy, just something to keep a record of my projects and ideas.