PatTR: Patent Translation Resource

Incomplete metadataVersion 3.1

Wäschle, Katharina; Riezler, Stefan, 2014, "PatTR: Patent Translation Resource", https://doi.org/10.11588/data/10002, heiDATA, V3

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

627 Downloads

Description	PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office ( EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (USPTO) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools. For a detailed description of the corpus construction process, please see the publications above.
Subject	Computer and Information Science
Related Publication	Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27.
License/Data Use Agreement	Custom Dataset Terms

Filter by

	1 to 3 of 3 Files	Download
	de-en-abstract-title.tar.gz Gzip Archive - 234.3 MB Published Jun 5, 2014 116 Downloads MD5: 3bd140f68ab0eefe239e3e893012c991 data set de-en, Part 1/3 (License information: see part 1) de-en	Access File File Access Public Download Options Gzip Archive Download Metadata Data File Citation EndNote XML RIS BibTeX
	de-en-claims.tar.gz Gzip Archive - 1.3 GB Published Jun 5, 2014 88 Downloads MD5: 2d1336fe8eecd100c01488f5e3e9bc97 data set de-en, Part 2/3 de-en	Access File File Access Public Download Options Gzip Archive Download Metadata Data File Citation EndNote XML RIS BibTeX
	de-en-description.tar.gz Gzip Archive - 1.3 GB Published Jun 5, 2014 85 Downloads MD5: b838211b8ddc04001d79f7e1e2e066cb data set de-en, Part 2/3 (License information: see part 1) de-en	Access File File Access Public Download Options Gzip Archive Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.11588/data/10002
Publication Date	2014-06-05
Title	PatTR: Patent Translation Resource
Author	Wäschle, Katharina (Department of Computational Linguistics) Riezler, Stefan (Department of Computational Linguistics)
Point of Contact	Use email button above to contact. Prof. Dr. Stefan Riezler (Department of Computational Linguistics)
Description	PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims. The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office ( EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract. Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (USPTO) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482. All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools. For a detailed description of the corpus construction process, please see the publications above.
Subject	Computer and Information Science
Related Publication	Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. http://www.cl.uni-heidelberg.de/~riezler/publications/papers/IRF2012.pdf Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. http://www.cl.uni-heidelberg.de/~riezler/publications/papers/IRF2012.pdf Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France. http://aclweb.org/anthology//E/E12/E12-1083.pdf
Producer	Wäschle, Katharina (Department of Computational Linguistics)
Production Date	2012
Production Location	Heidelberg, Germany
Deposit Date	2014-05-22
Time Period	Start Date: 1976 ; End Date: 2008-06
Related Material	MAREC patent collection: http://www.ir-facility.org/prototypes/marec

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

PatTR is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. CC by-nc-sa

Citation Requirements

Please cite Wäschle & Riezler (2012b), if you use the corpus in your work.

Restricted Files + Terms of Access

Original Archive

http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/

Size of Collection

22M German-English parallel sentences, 18M French-English parallel sentences, > 5M German-French sentence pairs from patent titles, abstracts and claims

	Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Restricted Files Selected

The selected file(s) may not be downloaded because you have not been granted access.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 4.7 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? The selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? It will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Sign Up or Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

Terms of Use PatTR is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. CC by-nc-sa

Citation Requirements Please cite Wäschle & Riezler (2012b), if you use the corpus in your work.

Name

Institution

Position

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://heidata.uni-heidelberg.de/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

This draft version has incomplete metadata that needs to be edited before it can be published.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (3.2)

Major Release (4.0)

Publish Dataset

This dataset cannot be published until Statistical Natural Language Processing Group is published by its administrator.

Publish Dataset

This dataset cannot be published until Statistical Natural Language Processing Group and heiDATA are published.

Return to Author

Return this dataset to contributor for modification.