Statistical Natural Language Processing Group

The Statistical Natural Language Processing Group is part of the Department of Computational Linguistics.
Our research addresses various aspects of the problem of the confusion of languages, by means of statistical learning techniques.
Research topics include the following:

Statistical machine translation, statistical parsing, question answering, information retrieval, learning-to-rank.
Statistical machine learning methods, especially unsupervised, semi-supervised and discriminative learning techniques.

Featured Dataverses

In order to use this feature you must have at least one published or linked dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1 to 10 of 19 Results

LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition Jun 13, 2020 Beilharz, Benjamin; Sun, Xin, 2019, "LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition", https://doi.org/10.11588/data/TMEDTX, heiDATA, V2 This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the...
librivoxdeen-1.01_part1.tar.gz Jun 13, 2020 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition Gzip Archive - 20.3 GB - MD5: 9fb23ee878584f4cab717e348cdeeaaf Data
librivoxdeen-1.01_part2.tar.gz Jun 13, 2020 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition Gzip Archive - 17.2 GB - MD5: daf33d0f1242bad5a623b061fbaa426d Data
README.md Oct 21, 2019 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition Markdown Text - 4.3 KB - MD5: 971f62ef7dc31254dfc0e25f14347bc1 Documentation
WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia Jun 18, 2014 Hieber, Felix; Schamoni, Shigehiko; Sokolov, Artem; Riezler, Stefan, 2014, "WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia", https://doi.org/10.11588/data/10003, heiDATA, V1 WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as docu...
README_WikiCLIR.txt Jun 18, 2014 - WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia Plain Text - 1.8 KB - MD5: f2d15639b962977ea19a20308bccbfc4 README
wikiclir.tar.gz Jun 18, 2014 - WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia Gzip Archive - 846.8 MB - MD5: 8f51894ff1c6ba2987d07dde62b3143d data data set
BoostCLIR: JP-EN Relevance Marked Patent Corpus Jun 16, 2014 Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1 BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search. Important: The English side of t...
boostclir.tar.gz Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus Gzip Archive - 241.8 MB - MD5: 35fde8d24e6e80bf932490549c991a3f data data set
README_BoostCLIR.txt Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus Plain Text - 1.5 KB - MD5: 544fa4db045f692d07a7d4596da99741 README README

LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition

Jun 13, 2020

Beilharz, Benjamin; Sun, Xin, 2019, "LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition", https://doi.org/10.11588/data/TMEDTX, heiDATA, V2

This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the...

librivoxdeen-1.01_part1.tar.gz

Jun 13, 2020 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition

Gzip Archive - 20.3 GB -

Data

librivoxdeen-1.01_part2.tar.gz

Jun 13, 2020 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition

Gzip Archive - 17.2 GB -

Data

README.md

Oct 21, 2019 - LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition

Markdown Text - 4.3 KB -

Documentation

WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

Jun 18, 2014

Hieber, Felix; Schamoni, Shigehiko; Sokolov, Artem; Riezler, Stefan, 2014, "WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia", https://doi.org/10.11588/data/10003, heiDATA, V1

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as docu...

README_WikiCLIR.txt

Jun 18, 2014 - WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

Plain Text - 1.8 KB -

README

wikiclir.tar.gz

Jun 18, 2014 - WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

Gzip Archive - 846.8 MB -

data

data set

BoostCLIR: JP-EN Relevance Marked Patent Corpus

Jun 16, 2014

Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search. Important: The English side of t...

boostclir.tar.gz

Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus

Gzip Archive - 241.8 MB -

data

data set

README_BoostCLIR.txt

Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus

Plain Text - 1.5 KB -

README

Add Data

Share Dataverse

Link Dataverse

Reset Modifications