Empirical Linguistics and Computational Language Modeling (LiMo)

Data publications of the Leibniz ScienceCampus “Empirical Linguistics and Computational Language Modeling”

The Leibniz ScienceCampus “Empirical Linguistics and Computational Language Modeling” (LiMo) is a cooperative research project between the Leibniz Institute for the German Language (Leibniz-Institut für Deutsche Sprache, IDS) in Mannheim and the Department of Computational Linguistics at Heidelberg University (ICL). The general aims of the project are to develop new methods, models, and tools for compiling and analysing automatically large German textual corpora covering different domains, genres and language varieties.

The project is supported by funds from the Baden-Württemberg Ministry of Science, Research and the Arts and the Leibniz Association together with funds provided by the Leibniz Institute for the German Language and Heidelberg University.

Funding Period: 2015 – 2020

Featured Dataverses

In order to use this feature you must have at least one published dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

1 to 10 of 185 Results

xsrl_mbert_aligner.zip Feb 17, 2021 - X-SRL Dataset and mBERT Word Aligner ZIP Archive - 37.7 KB - MD5: 6b35c476556dfdb2b9b25a7a1cdc755d Code
X-SRL Dataset and mBERT Word Aligner Feb 17, 2021 Daza, Angel, 2021, "X-SRL Dataset and mBERT Word Aligner", https://doi.org/10.11588/data/HVXXIJ, heiDATA, V1 This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source annotations (for example labeled English sentences) into the target side (for example a German translation of th...
twitter_titling_corpus.tab Aug 23, 2019 - Twitter Titling Corpus Tabular Data - 219.0 KB - 5 Variables, 4002 Observations - UNF:6:+F3lLKziwMvjy+xyktkilw== Data
Twitter Titling Corpus Aug 23, 2019 van den Berg, Esther; Korfhage, Katharina; Ruppenhofer, Josef; Wiegand, Michael; Markert, Katja, 2019, "Twitter Titling Corpus", https://doi.org/10.11588/data/IOHXDF, heiDATA, V1, UNF:6:+F3lLKziwMvjy+xyktkilw== [fileUNF] The Twitter Titling Corpus contains 4002 stance-annotated tweets collected between 20 June 2017 and 30 August 2017 mentioning 6 presidents. Each tweet is annotated for the naming form used to refer to the president, for the purpose of a study on the relation between naming variat...
tweeDe.conllu Mar 26, 2020 - tweeDe Unknown - 945.9 KB - MD5: 32d20db78b577a921d9fd4bc3868770e Data
tweeDe Mar 26, 2020 Rehbein, Ines; Ruppenhofer, Josef; Do, Bich-Ngoc, 2020, "tweeDe", https://doi.org/10.11588/data/S90S35, heiDATA, V1 A German UD Twitter treebank, with >12,000 tokens from 519 tweets, annotated in the Universal Dependencies framework
tubadz-pp.tar.gz Nov 13, 2023 - Real-World PP Attachment Disambiguation Dataset Gzip Archive - 1.7 MB - MD5: b2d04463fd249e1a19e641a99c65e70d Data
tubadz-pp-aux.tar.gz Nov 13, 2023 - Real-World PP Attachment Disambiguation Dataset Gzip Archive - 4.3 MB - MD5: b37e0268b451b32e52948e47baf80603 Data
topological-field-labeler.zip Nov 13, 2023 - Topological Field Labeler for German ZIP Archive - 32.4 KB - MD5: 3bf4fe4ba2daaade0ae9c765233145c3 Code
Topological Field Labeler for German Nov 13, 2023 - Neural Techniques for German Dependency Parsing Do, Bich-Ngoc; Rehbein, Ines, 2023, "Topological Field Labeler for German", https://doi.org/10.11588/data/YYNQFF, heiDATA, V1 This resource contains the code of the topological labeler used in the paper: Do and Rehbein (2020). "Parsers Know Best: German PP Attachment Revisited". For this tool, labeling topological field is formulated as a sequence labeling task. We also include in this resource two pre-...

xsrl_mbert_aligner.zip

Feb 17, 2021 - X-SRL Dataset and mBERT Word Aligner

ZIP Archive - 37.7 KB -

Code

X-SRL Dataset and mBERT Word Aligner

Feb 17, 2021

Daza, Angel, 2021, "X-SRL Dataset and mBERT Word Aligner", https://doi.org/10.11588/data/HVXXIJ, heiDATA, V1

This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source annotations (for example labeled English sentences) into the target side (for example a German translation of th...

twitter_titling_corpus.tab

Aug 23, 2019 - Twitter Titling Corpus

Tabular Data - 219.0 KB - 5 Variables, 4002 Observations -

Data

Twitter Titling Corpus

Aug 23, 2019

van den Berg, Esther; Korfhage, Katharina; Ruppenhofer, Josef; Wiegand, Michael; Markert, Katja, 2019, "Twitter Titling Corpus", https://doi.org/10.11588/data/IOHXDF, heiDATA, V1, UNF:6:+F3lLKziwMvjy+xyktkilw== [fileUNF]

The Twitter Titling Corpus contains 4002 stance-annotated tweets collected between 20 June 2017 and 30 August 2017 mentioning 6 presidents. Each tweet is annotated for the naming form used to refer to the president, for the purpose of a study on the relation between naming variat...

tweeDe.conllu

Mar 26, 2020 - tweeDe

Unknown - 945.9 KB -

Data

tweeDe

Mar 26, 2020

Rehbein, Ines; Ruppenhofer, Josef; Do, Bich-Ngoc, 2020, "tweeDe", https://doi.org/10.11588/data/S90S35, heiDATA, V1

A German UD Twitter treebank, with >12,000 tokens from 519 tweets, annotated in the Universal Dependencies framework

tubadz-pp.tar.gz

Nov 13, 2023 - Real-World PP Attachment Disambiguation Dataset

Gzip Archive - 1.7 MB -

Data

tubadz-pp-aux.tar.gz

Nov 13, 2023 - Real-World PP Attachment Disambiguation Dataset

Gzip Archive - 4.3 MB -

Data

topological-field-labeler.zip

Nov 13, 2023 - Topological Field Labeler for German

ZIP Archive - 32.4 KB -

Code

Topological Field Labeler for German

Nov 13, 2023 - Neural Techniques for German Dependency Parsing

Do, Bich-Ngoc; Rehbein, Ines, 2023, "Topological Field Labeler for German", https://doi.org/10.11588/data/YYNQFF, heiDATA, V1

This resource contains the code of the topological labeler used in the paper: Do and Rehbein (2020). "Parsers Know Best: German PP Attachment Revisited". For this tool, labeling topological field is formulated as a sequence labeling task. We also include in this resource two pre-...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications