BoostCLIR: JP-EN Relevance Marked Patent Corpus (doi:10.11588/data/10001)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description
Citation
Title:	BoostCLIR: JP-EN Relevance Marked Patent Corpus
Identification Number:	doi:10.11588/data/10001
Distributor:	heiDATA
Date of Distribution:	2014-06-16
Version:	1
Bibliographic Citation:	Sokolov, Artem; Jehl Laura; Hieber Felix; Ruppert, Eugen; Riezler, Stefan, 2014, "BoostCLIR: JP-EN Relevance Marked Patent Corpus", https://doi.org/10.11588/data/10001, heiDATA, V1
Study Description
Citation
Title:	BoostCLIR: JP-EN Relevance Marked Patent Corpus
Identification Number:	doi:10.11588/data/10001
Authoring Entity:	Sokolov, Artem (Department of Computational Linguistics)
	Jehl Laura (Department of Computational Linguistics)
	Hieber Felix (Department of Computational Linguistics)
	Ruppert, Eugen (Department of Computational Linguistics)
	Riezler, Stefan (Department of Computational Linguistics)
Producer:	Jehl, Laura
	Sokolov, Artem
	Ruppert, Eugen
Date of Production:	2013
Distributor:	heiDATA
Access Authority:	Prof. Dr. Stefan Riezler
Date of Deposit:	2014-05-21
Holdings Information:	https://doi.org/10.11588/data/10001
Study Scope
Keywords:	Computer and Information Science
Abstract:	BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the <a href='http://www.ifs.tuwien.ac.at/imp/marec.shtml'>MAREC</a> patent data, and the data from the <a href='http://research.nii.ac.jp/ntcir/data/data-en.html'>NTCIR PatentMT workshop</a> collections, accompanied with relevance judgements for the task of patent prior-art search. <br /><br /> <strong>Important:</strong> The English side of the corpus contains patent IDs as well as the text of the abstracts. The Japanese side only contains patent IDs because of NTCIR copyright restrictions. The Jap anese patent abstracts can be extracted from full text Japanese patent documents, which are available from the organizers of the NTCIR workshop. <br /><br /> The corpus contains training, development and testing subsets sampled from non-intersecting time periods. <br /><br /> Relevance judgement for patent retrieval are constructed from patent citations by assigning three integer levels to three categories of relationships, with highest relevance (3) for family patents, lower relevance for patents cited in search reports by patent examiners (2), and lowest relevance level (1) for applicants’ citations. <br /><br /> For a detailed descrip tion of the corpus construction process, please see the above publication.
Kind of Data:	textual data
Methodology and Processing
Sources Statement
Data Access
Archive Where Study was Originally Stored:	http://www.cl.uni-heidelberg.de/statnlpgroup/boostclir/
Extent of Collection:	1.4M documents, 100K queries
Citation Requirement:	If you use the corpus in your work, please cite: Artem Sokolov, Laura Jehl, Felix Hieber, Stefan Riezler. "Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings". In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, USA, 2013
Other Study Description Materials
Related Materials
	<ul> <li>MAREC dataset: <a href='http://www.ifs.tuwien.ac.at/imp/marec.shtml'>http://www.ifs.tuwien.ac.at/imp/marec.shtml</a></li> <li>NTCIR collections: <a href='http://research.nii.ac.jp/ntcir/data/data-en.html'>http://research.nii.ac.jp/ntcir/data/data-en.html</a></li> </ul>
Related Publications
Citation
Title:	Artem Sokolov, Laura Jehl, Felix Hieber, Stefan Riezler. "Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings". In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, USA, 2013
Bibliographic Citation:	Artem Sokolov, Laura Jehl, Felix Hieber, Stefan Riezler. "Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings". In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, USA, 2013
Other Study-Related Materials
Label:	boostclir.tar.gz
Text:	data set
Notes:	application/x-gzip
Other Study-Related Materials
Label:	README_BoostCLIR.txt
Text:	README
Notes:	text/plain; charset=US-ASCII