1 to 10 of 19 Results
Jun 16, 2014 -
BoostCLIR: JP-EN Relevance Marked Patent Corpus
Gzip Archive - 241.8 MB -
MD5: 35fde8d24e6e80bf932490549c991a3f
data set |
Jun 16, 2014
Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1
BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search. Important: The English side of t... |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 234.3 MB -
MD5: 3bd140f68ab0eefe239e3e893012c991
data set de-en, Part 1/3 (License information: see part 1) |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 1.3 GB -
MD5: 2d1336fe8eecd100c01488f5e3e9bc97
data set de-en, Part 2/3 |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 1.3 GB -
MD5: b838211b8ddc04001d79f7e1e2e066cb
data set de-en, Part 2/3 (License information: see part 1) |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 669.7 MB -
MD5: bf9d77a06ebd10d50648c2c8d300c5e2
data set en-fr, Part 1/3 |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 1.0 GB -
MD5: 421d98c4fea4eebd076044acffd77095
data set en-fr, Part 2/3 (License information: see part 1) |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 628.3 MB -
MD5: a4a327f7104842bbc86ccb6bdfbc229e
data set en-fr, Part 3/3 (License information: see part 1) |
Jun 5, 2014 -
PatTR: Patent Translation Resource
Gzip Archive - 645.5 MB -
MD5: 120484093f5f930fe8646eb3b3be76e3
data set fr-de, Part 1/1 |
Jun 13, 2020
Beilharz, Benjamin; Sun, Xin, 2019, "LibriVoxDeEn - A Corpus for German-to-English Speech Translation and Speech Recognition", https://doi.org/10.11588/data/TMEDTX, heiDATA, V2
This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the... |