PatTR is a sentence-parallel corpus extracted from the
MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.
The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstrac t, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (
EPO) and the World Intellectual Property Organization (
WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.
Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United S tates Patent and Trademark Office (
USPTO) corpus, following Utiyama, Masao and Isahara, Hitoshi: A Japanese-English patent parallel corpus. MT summit XI (2007), 475--482.
All sections were sentence-aligned using the
Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the
Europarl processing tools.
For a detailed description of the corpus construction process, please see the publications above.