View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
PatTR: Patent Translation Resource |
Identification Number: |
doi:10.11588/data/10002 |
Distributor: |
heiDATA |
Date of Distribution: |
2014-06-05 |
Version: |
3 |
Bibliographic Citation: |
Wäschle, Katharina; Riezler, Stefan, 2014, "PatTR: Patent Translation Resource", https://doi.org/10.11588/data/10002, heiDATA, V3 |
Citation |
|
Title: |
PatTR: Patent Translation Resource |
Identification Number: |
doi:10.11588/data/10002 |
Authoring Entity: |
Wäschle, Katharina (Department of Computational Linguistics) |
Riezler, Stefan (Department of Computational Linguistics) |
|
Producer: |
Wäschle, Katharina |
Date of Production: |
2012 |
Distributor: |
heiDATA |
Access Authority: |
Prof. Dr. Stefan Riezler |
Date of Deposit: |
2014-05-22 |
Holdings Information: |
https://doi.org/10.11588/data/10002 |
Study Scope |
|
Keywords: |
Computer and Information Science |
Abstract: |
PatTR is a sentence-parallel corpus extracted from the <a href='http://www.ir-facility.org/prototypes/marec'>MAREC</a> patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. <br /><br /> The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office ( <a href='http://www.epo.org/'>EPO</a>) and the World Intellectual Property Organization (<a href='http://www.wipo.int/portal/en/index.html'>WIPO</a>) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. <br /><br /> Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (<a href='http://www.uspto.gov/'>USPTO</a>) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. <br /><br /> All sections were sentence-aligned using the <a href='http://sourceforge.net/projects/gargantua/'>Gargantua</a> aligner. Preprocessing was done automatically. Sentence boundaries were detected using the <a href='http://www.statmt.org/europarl/'>Europarl</a> processing tools. <br /><br /> For a detailed description of the corpus construction process, please see the publications above. |
Time Period: |
1976-2008-06 |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Archive Where Study was Originally Stored: |
http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/ |
Extent of Collection: |
22M German-English parallel sentences, 18M French-English parallel sentences, > 5M German-French sentence pairs from patent titles, abstracts and claims |
Citation Requirement: |
Please cite Wäschle & Riezler (2012b), if you use the corpus in your work. |
Other Study Description Materials |
|
Related Materials |
|
<a href='http://www.ir-facility.org/prototypes/marec'>MAREC patent collection</a>: http://www.ir-facility.org/prototypes/marec |
|
Related Publications |
|
Citation |
|
Title: |
Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. |
Bibliographic Citation: |
Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. |
Citation |
|
Title: |
Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. |
Bibliographic Citation: |
Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. |
Citation |
|
Title: |
Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France. |
Bibliographic Citation: |
Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France. |
Label: |
de-en-abstract-title.tar.gz |
Text: |
data set de-en, Part 1/3 (License information: see part 1) |
Notes: |
application/x-gzip |
Label: |
de-en-claims.tar.gz |
Text: |
data set de-en, Part 2/3 |
Notes: |
application/x-gzip |
Label: |
de-en-description.tar.gz |
Text: |
data set de-en, Part 2/3 (License information: see part 1) |
Notes: |
application/x-gzip |
Label: |
en-fr-abstract-title.tar_1.gz |
Text: |
data set en-fr, Part 1/3 |
Notes: |
application/x-gzip |
Label: |
en-fr-claims.tar.gz |
Text: |
data set en-fr, Part 2/3 (License information: see part 1) |
Notes: |
application/x-gzip |
Label: |
en-fr-description.tar.gz |
Text: |
data set en-fr, Part 3/3 (License information: see part 1) |
Notes: |
application/x-gzip |
Label: |
fr-de.tar.gz |
Text: |
data set fr-de, Part 1/1 |
Notes: |
application/x-gzip |
Label: |
README_PatTR.txt |
Text: | |
Notes: |
text/plain |