We are committed to reproducible and replicable research. Thus, we generally make all research software publicly available. We are also involved in the development of several language technology tools. We only give a short overview here, please refer to the project websites for more information.
DKPro Core – Linguistic Preprocessing
DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
Many NLP tools are already freely available in the NLP research community. DKPro Core provides UIMA components wrapping these tools (and some original tools) so they can be used interchangeably in UIMA processing pipelines. DKPro Core builds heavily on uimaFIT which allows for rapid and easy development of NLP processing pipelines, for wrapping existing tools and for creating original UIMA components.
DKPro TC is a UIMA-based text classification framework built on top of DKPro Core and DKPro Lab. It is intended to alleviate supervised machine learning experiments with any kind of textual data.
DKPro Similarity – Word and Text Similarity
DKPro Similarity is an open source framework for word and text similarity. It comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures.
DKPro Toolbox – Teaching NLP in Java
DKPro Toolbox aims at providing a simplified access to linguistic processing of text in a Java environment. It was inspired by NLTK which provides similar functionality for Python. DKPro Toolbox builds upon DKPro Core in order to provide state-of-the-art preprocessing capabilities in a simplified setup.
JWPL – Wikipedia API
JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information in Wikipedia. In addition to the core functionality, JWPL allows access to Wikipedia’s edit history with the Wikipedia Revision Toolkit.
jweb1t – Frequency Count API
jWeb1T is an open source Java tool for efficiently searching n-gram data in the Web 1T 5-gram corpus format. It is based on a binary search algorithm that finds the n-grams and returns their frequency counts in logarithmic time. As the corpus is stored in many files a simple index is used to retrieve the files containing the n-grams.