PatTR: Patent Translation Resource (doi:10.11588/data/10002)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link) (external link) (external link)

Document Description

Citation

Title:

PatTR: Patent Translation Resource

Identification Number:

doi:10.11588/data/10002

Distributor:

heiDATA

Date of Distribution:

2014-06-05

Version:

3

Bibliographic Citation:

Wäschle, Katharina; Riezler, Stefan, 2014, "PatTR: Patent Translation Resource", https://doi.org/10.11588/data/10002, heiDATA, V3

Study Description

Citation

Title:

PatTR: Patent Translation Resource

Identification Number:

doi:10.11588/data/10002

Authoring Entity:

Wäschle, Katharina (Department of Computational Linguistics)

Riezler, Stefan (Department of Computational Linguistics)

Producer:

Wäschle, Katharina

Date of Production:

2012

Distributor:

heiDATA

Access Authority:

Prof. Dr. Stefan Riezler

Date of Deposit:

2014-05-22

Holdings Information:

https://doi.org/10.11588/data/10002

Study Scope

Keywords:

Computer and Information Science

Abstract:

PatTR is a sentence-parallel corpus extracted from the <a href='http://www.ir-facility.org/prototypes/marec'>MAREC</a> patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. <br /><br /> The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office ( <a href='http://www.epo.org/'>EPO</a>) and the World Intellectual Property Organization (<a href='http://www.wipo.int/portal/en/index.html'>WIPO</a>) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. <br /><br /> Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (<a href='http://www.uspto.gov/'>USPTO</a>) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. <br /><br /> All sections were sentence-aligned using the <a href='http://sourceforge.net/projects/gargantua/'>Gargantua</a> aligner. Preprocessing was done automatically. Sentence boundaries were detected using the <a href='http://www.statmt.org/europarl/'>Europarl</a> processing tools. <br /><br /> For a detailed description of the corpus construction process, please see the publications above.

Time Period:

1976-2008-06

Methodology and Processing

Sources Statement

Data Access

Archive Where Study was Originally Stored:

http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/

Extent of Collection:

22M German-English parallel sentences, 18M French-English parallel sentences, > 5M German-French sentence pairs from patent titles, abstracts and claims

Citation Requirement:

Please cite Wäschle & Riezler (2012b), if you use the corpus in your work.

Other Study Description Materials

Related Materials

<a href='http://www.ir-facility.org/prototypes/marec'>MAREC patent collection</a>: http://www.ir-facility.org/prototypes/marec

Related Publications

Citation

Title:

Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27.

Bibliographic Citation:

Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27.

Citation

Title:

Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27.

Bibliographic Citation:

Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27.

Citation

Title:

Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France.

Bibliographic Citation:

Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France.

Other Study-Related Materials

Label:

de-en-abstract-title.tar.gz

Text:

data set de-en, Part 1/3 (License information: see part 1)

Notes:

application/x-gzip

Other Study-Related Materials

Label:

de-en-claims.tar.gz

Text:

data set de-en, Part 2/3

Notes:

application/x-gzip

Other Study-Related Materials

Label:

de-en-description.tar.gz

Text:

data set de-en, Part 2/3 (License information: see part 1)

Notes:

application/x-gzip

Other Study-Related Materials

Label:

en-fr-abstract-title.tar_1.gz

Text:

data set en-fr, Part 1/3

Notes:

application/x-gzip

Other Study-Related Materials

Label:

en-fr-claims.tar.gz

Text:

data set en-fr, Part 2/3 (License information: see part 1)

Notes:

application/x-gzip

Other Study-Related Materials

Label:

en-fr-description.tar.gz

Text:

data set en-fr, Part 3/3 (License information: see part 1)

Notes:

application/x-gzip

Other Study-Related Materials

Label:

fr-de.tar.gz

Text:

data set fr-de, Part 1/1

Notes:

application/x-gzip

Other Study-Related Materials

Label:

README_PatTR.txt

Text:

Notes:

text/plain