WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia (doi:10.11588/data/10003)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description
Citation
Title:	WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia
Identification Number:	doi:10.11588/data/10003
Distributor:	heiDATA
Date of Distribution:	2014-06-18
Version:	1
Bibliographic Citation:	Hieber, Felix; Schamoni, Shigehiko; Sokolov, Artem; Riezler, Stefan, 2014, "WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia", https://doi.org/10.11588/data/10003, heiDATA, V1
Study Description
Citation
Title:	WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia
Identification Number:	doi:10.11588/data/10003
Authoring Entity:	Hieber, Felix (Department of Computational Linguistics)
	Schamoni, Shigehiko (Department of Computational Linguistics)
	Sokolov, Artem (Department of Computational Linguistics)
	Riezler, Stefan (Department of Computational Linguistics)
Producer:	Hieber, Felix
Date of Production:	2014
Distributor:	heiDATA
Distributor:	HeiDATA: Heidelberg Research Data Repository
Access Authority:	Prof. Dr. Stefan Riezler
Date of Deposit:	2014-05-22
Holdings Information:	https://doi.org/10.11588/data/10003
Study Scope
Keywords:	Computer and Information Science
Abstract:	WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models. <br /><br /> The corpus contains training, development and testing subsets randomly split on the query level. <br /><br /> Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that arti cles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate. <br /><br /> For a more detailed description of the corpus construction process, see the above publication.
Date of Collection:	2013-11-04-2013-11-22
Methodology and Processing
Sources Statement
Data Access
Archive Where Study was Originally Stored:	http://www.cl.uni-heidelberg.de/statnlpgroup/wikiclir/
Extent of Collection:	245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents.
Citation Requirement:	If you use the corpus in your work, please cite the publication above.
Other Study Description Materials
Related Publications
Citation
Title:	Shigehiko Schamoni, Felix Hieber, Artem Sokolov, Stefan Riezler. "Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval". In Proceedings of the 52 Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA.
Bibliographic Citation:	Shigehiko Schamoni, Felix Hieber, Artem Sokolov, Stefan Riezler. "Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval". In Proceedings of the 52 Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA.
Other Study-Related Materials
Label:	README_WikiCLIR.txt
Text:
Notes:	text/plain; charset=US-ASCII
Other Study-Related Materials
Label:	wikiclir.tar.gz
Text:	data set
Notes:	application/x-gzip