We are committed to reproducible and replicable research

Thus, we generally make all research software publicly available. We are also involved in the development of several language technology tools.
We only give a short overview here, please refer to the project websites for more information.

Code - Tools

  • c-test builder is a web-based tool that allows the easy creation of c-test style language tests.
  • DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
  • DKPro Lab is a lightweight framework for parameter sweeping experiments.
  • DKPro Similarity is an open source framework for word and text similarity.
  • DKPro TC is a UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
  • DKPro Toolbox aims at providing a simplified access to linguistic processing of text in a Java environment.
  • escrito is a toolkit for scoring student writings using NLP techniques that addresses two main user groups: teachers and NLP researchers.
  • jWeb1T is an open source Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format.
  • JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information in Wikipedia.

Code - Models

  • German chunker model for OpenNLP trained on the TIGER corpus using their annotations with a number of systematic modification to match the annotations to TreeTagger chunks with F-Score of 96%.

Code - Experiments

Data

  • Good-Bad news dataset include around 7K high reliable manuallyl annotated good and bad news tweets from 5 different domains namely Health, Environment and Geography, Terrorism, Technology and Natural Disaster.
  • News- Not News dataset include around 3K manually annotated news and not-news tweets from 5 different domains namely Health, Environment and Geography, Terrorism, Technology and Natural Disaster.
  • ASAP spelling This repository contains gold-standard spelling correction annotation for the test data section of the asap short answer scoring corpus.
  • Diacritization Data (Quran, RDI & Tashkeela) (link is coming soon)
  • Hatespeech datasets Ross et al., Benikova et al. and Wojatzki et al.
  • Semantic Relatedness Datasets

Demos