Statistical Natural Language Processing Group

The Statistical Natural Language Processing Group is part of the Department of Computational Linguistics.
Our research addresses various aspects of the problem of the confusion of languages, by means of statistical learning techniques.
Research topics include the following:

Statistical machine translation, statistical parsing, question answering, information retrieval, learning-to-rank.
Statistical machine learning methods, especially unsupervised, semi-supervised and discriminative learning techniques.

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1 to 10 of 19 Results

boostclir.tar.gz Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus Gzip Archive - 241.8 MB - MD5: 35fde8d24e6e80bf932490549c991a3f data data set
BoostCLIR: JP-EN Relevance Marked Patent Corpus Jun 16, 2014 Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1 BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search. Important: The English side of t...
de-en-abstract-title.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 234.3 MB - MD5: 3bd140f68ab0eefe239e3e893012c991 de-en data set de-en, Part 1/3 (License information: see part 1)
de-en-claims.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 1.3 GB - MD5: 2d1336fe8eecd100c01488f5e3e9bc97 de-en data set de-en, Part 2/3
de-en-description.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 1.3 GB - MD5: b838211b8ddc04001d79f7e1e2e066cb de-en data set de-en, Part 2/3 (License information: see part 1)
en-fr-abstract-title.tar_1.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 669.7 MB - MD5: bf9d77a06ebd10d50648c2c8d300c5e2 en-fr data set en-fr, Part 1/3
en-fr-claims.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 1.0 GB - MD5: 421d98c4fea4eebd076044acffd77095 en-fr data set en-fr, Part 2/3 (License information: see part 1)
en-fr-description.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 628.3 MB - MD5: a4a327f7104842bbc86ccb6bdfbc229e en-fr data set en-fr, Part 3/3 (License information: see part 1)
fr-de.tar.gz Jun 5, 2014 - PatTR: Patent Translation Resource Gzip Archive - 645.5 MB - MD5: 120484093f5f930fe8646eb3b3be76e3 fr-de data set fr-de, Part 1/1
LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition Jun 13, 2020 Beilharz, Benjamin; Sun, Xin, 2019, "LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition", https://doi.org/10.11588/data/TMEDTX, heiDATA, V2 This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the...

boostclir.tar.gz

Jun 16, 2014 - BoostCLIR: JP-EN Relevance Marked Patent Corpus

Gzip Archive - 241.8 MB -

data

data set

BoostCLIR: JP-EN Relevance Marked Patent Corpus

Jun 16, 2014

Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search. Important: The English side of t...

de-en-abstract-title.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 234.3 MB -

de-en

data set de-en, Part 1/3 (License information: see part 1)

de-en-claims.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 1.3 GB -

de-en

data set de-en, Part 2/3

de-en-description.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 1.3 GB -

de-en

data set de-en, Part 2/3 (License information: see part 1)

en-fr-abstract-title.tar_1.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 669.7 MB -

en-fr

data set en-fr, Part 1/3

en-fr-claims.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 1.0 GB -

en-fr

data set en-fr, Part 2/3 (License information: see part 1)

en-fr-description.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 628.3 MB -

en-fr

data set en-fr, Part 3/3 (License information: see part 1)

fr-de.tar.gz

Jun 5, 2014 - PatTR: Patent Translation Resource

Gzip Archive - 645.5 MB -

fr-de

data set fr-de, Part 1/1

LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition

Jun 13, 2020

Beilharz, Benjamin; Sun, Xin, 2019, "LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition", https://doi.org/10.11588/data/TMEDTX, heiDATA, V2

This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications