heiDATA

Metrics

197,264 Downloads

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

There was an error with your search parameters. Please clear your search and try again.

1 to 10 of 16 Results

WikiWarsDE Corpus Aug 13, 2014 - Database Systems Research Group Strötgen, Jannik; Gertz, Michael, 2014, "WikiWarsDE Corpus", https://doi.org/10.11588/data/10026, heiDATA, V1 The WikiWarsDE corpus is a German corpus containing Wikipedia articles with annotations of temporal expressions. Its creation was motivated by the English WikiWars corpus (Mazur & Dale 2010). WikiWarsDE was developed to support research on temporal information extraction and norm...
Text und Data Mining an wissenschaftlichen Repositorien und Publikationsservern in Deutschland - Zusammenfassung der Ergebnisse einer Umfrage im Februar und März 2016 Nov 2, 2016 - Perspektive Bibliothek Drees, Bastian, 2016, "Text und Data Mining an wissenschaftlichen Repositorien und Publikationsservern in Deutschland - Zusammenfassung der Ergebnisse einer Umfrage im Februar und März 2016", https://doi.org/10.11588/data/10090, heiDATA, V2 Es wurden die auf den Homepages angegebenen Ansprechpartner wissenschaftlicher Repositorien und Publikationsserver in Deutschland zu ihren Erfahrungen mit Text und Data Mining befragt. Die Befragung fand zwischen dem 22. und 26.2.2016 per E-Mail statt. Es wurden Ansprechpartner v...
Test data for the Pooch library Jul 25, 2022 - Scientific Software Center (SSC) Uieda, Leonardo, 2022, "Test data for the Pooch library", https://doi.org/10.11588/data/TKCFEF, heiDATA, V1 Pooch is an open-source Python library for data download. This archive contains testing data for Pooch's DataVerse download functionality.
Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data" Feb 14, 2017 - PhD related Material - Faculty of Mathematics and Computer Science Merten, Thorsten, 2017, "Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data"", https://doi.org/10.11588/data/10089, heiDATA, V2 This dataset provides the code and the data sets used in the PHD thesis "Identification of Software Features in Issue Tracking System Data" as well as the files that represent the results measured in experiments. For problem studies (e.g. chapters 10 and 11) the folders include t...
Selectional Preference Embeddings (EMNLP 2017) Jan 31, 2019 - AIPHES Heinzerling, Benjamin, 2019, "Selectional Preference Embeddings (EMNLP 2017)", https://doi.org/10.11588/data/FJQ4XL, heiDATA, V1 Joint embeddings of selectional preferences, words, and fine-grained entity types. The vocabulary consists of: verbs and their dependency relation separated by "@", e.g. "sink@nsubj" or "elect@dobj" words and short noun phrases, e.g. "Titanic" fine-grained entity types using the...
Pre-trained POS tagging models for German social media Mar 26, 2020 - Empirical Linguistics and Computational Language Modeling (LiMo) Rehbein, Ines; Ruppenhofer, Josef; Zimmermann, Victor, 2020, "Pre-trained POS tagging models for German social media", https://doi.org/10.11588/data/W3JBV4, heiDATA, V1 Pre-trained POS tagging models for the HunPos tagger (Halácsy et al. 2007) the biLSTM-char-CRF tagger (Reimers & Gurevych 2017) Online-Flors (Yin et al. 2015). References: Halácsy, P., Kornai, A., and Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of th...
PatTR: Patent Translation Resource Jun 16, 2014 - Statistical Natural Language Processing Group Wäschle, Katharina; Riezler, Stefan, 2014, "PatTR: Patent Translation Resource", https://doi.org/10.11588/data/10002, heiDATA, V3 PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pa...
Jing bao ground truth – text block crops and annotations Nov 2, 2023 - Heidelberg Centre for Transcultural Studies (HCTS) Henke, Konstantin; Arnold, Matthias, 2023, "Jing bao ground truth – text block crops and annotations", https://doi.org/10.11588/data/PVYWKB, heiDATA, V1 This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese new...
Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates Mar 21, 2023 - Ground truth data for HTR on South Asian Scripts Derrick, Tom; British Library, 2023, "Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates", https://doi.org/10.11588/data/AIQSXL, heiDATA, V1 This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transc...
GermEval-2018 Corpus (DE) Sep 2, 2019 - Empirical Linguistics and Computational Language Modeling (LiMo) Wiegand, Michael, 2019, "GermEval-2018 Corpus (DE)", https://doi.org/10.11588/data/0B5VML, heiDATA, V1 This dataset comprises the training and test data (German tweets) from the GermEval 2018 Shared on Offensive Language Detection.

WikiWarsDE Corpus

Aug 13, 2014 - Database Systems Research Group

Strötgen, Jannik; Gertz, Michael, 2014, "WikiWarsDE Corpus", https://doi.org/10.11588/data/10026, heiDATA, V1

The WikiWarsDE corpus is a German corpus containing Wikipedia articles with annotations of temporal expressions. Its creation was motivated by the English WikiWars corpus (Mazur & Dale 2010). WikiWarsDE was developed to support research on temporal information extraction and norm...

Text und Data Mining an wissenschaftlichen Repositorien und Publikationsservern in Deutschland - Zusammenfassung der Ergebnisse einer Umfrage im Februar und März 2016

Nov 2, 2016 - Perspektive Bibliothek

Drees, Bastian, 2016, "Text und Data Mining an wissenschaftlichen Repositorien und Publikationsservern in Deutschland - Zusammenfassung der Ergebnisse einer Umfrage im Februar und März 2016", https://doi.org/10.11588/data/10090, heiDATA, V2

Es wurden die auf den Homepages angegebenen Ansprechpartner wissenschaftlicher Repositorien und Publikationsserver in Deutschland zu ihren Erfahrungen mit Text und Data Mining befragt. Die Befragung fand zwischen dem 22. und 26.2.2016 per E-Mail statt. Es wurden Ansprechpartner v...

Test data for the Pooch library

Jul 25, 2022 - Scientific Software Center (SSC)

Uieda, Leonardo, 2022, "Test data for the Pooch library", https://doi.org/10.11588/data/TKCFEF, heiDATA, V1

Pooch is an open-source Python library for data download. This archive contains testing data for Pooch's DataVerse download functionality.

Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data"

Feb 14, 2017 - PhD related Material - Faculty of Mathematics and Computer Science

Merten, Thorsten, 2017, "Source Code, Data and Additional Material for the Thesis: "Identification of Software Features in Issue Tracking System Data"", https://doi.org/10.11588/data/10089, heiDATA, V2

This dataset provides the code and the data sets used in the PHD thesis "Identification of Software Features in Issue Tracking System Data" as well as the files that represent the results measured in experiments. For problem studies (e.g. chapters 10 and 11) the folders include t...

Selectional Preference Embeddings (EMNLP 2017)

Jan 31, 2019 - AIPHES

Heinzerling, Benjamin, 2019, "Selectional Preference Embeddings (EMNLP 2017)", https://doi.org/10.11588/data/FJQ4XL, heiDATA, V1

Joint embeddings of selectional preferences, words, and fine-grained entity types. The vocabulary consists of: verbs and their dependency relation separated by "@", e.g. "sink@nsubj" or "elect@dobj" words and short noun phrases, e.g. "Titanic" fine-grained entity types using the...

Pre-trained POS tagging models for German social media

Mar 26, 2020 - Empirical Linguistics and Computational Language Modeling (LiMo)

Rehbein, Ines; Ruppenhofer, Josef; Zimmermann, Victor, 2020, "Pre-trained POS tagging models for German social media", https://doi.org/10.11588/data/W3JBV4, heiDATA, V1

Pre-trained POS tagging models for the HunPos tagger (Halácsy et al. 2007) the biLSTM-char-CRF tagger (Reimers & Gurevych 2017) Online-Flors (Yin et al. 2015). References: Halácsy, P., Kornai, A., and Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of th...

PatTR: Patent Translation Resource

Jun 16, 2014 - Statistical Natural Language Processing Group

Wäschle, Katharina; Riezler, Stefan, 2014, "PatTR: Patent Translation Resource", https://doi.org/10.11588/data/10002, heiDATA, V3

PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pa...

Jing bao ground truth – text block crops and annotations

Nov 2, 2023 - Heidelberg Centre for Transcultural Studies (HCTS)

Henke, Konstantin; Arnold, Matthias, 2023, "Jing bao ground truth – text block crops and annotations", https://doi.org/10.11588/data/PVYWKB, heiDATA, V1

This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese new...

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

Mar 21, 2023 - Ground truth data for HTR on South Asian Scripts

Derrick, Tom; British Library, 2023, "Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates", https://doi.org/10.11588/data/AIQSXL, heiDATA, V1

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transc...

GermEval-2018 Corpus (DE)

Sep 2, 2019 - Empirical Linguistics and Computational Language Modeling (LiMo)

Wiegand, Michael, 2019, "GermEval-2018 Corpus (DE)", https://doi.org/10.11588/data/0B5VML, heiDATA, V1

This dataset comprises the training and test data (German tweets) from the GermEval 2018 Shared on Offensive Language Detection.

Add Data

Share Dataverse

Link Dataverse

Reset Modifications