ACL word segmentation correction (doi:10.11588/data/VK99LU)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

ACL word segmentation correction

Identification Number:

doi:10.11588/data/VK99LU

Distributor:

heiDATA

Date of Distribution:

2019-07-15

Version:

1

Bibliographic Citation:

Nastase, Vivi; Hitschler, Julian, 2019, "ACL word segmentation correction", https://doi.org/10.11588/data/VK99LU, heiDATA, V1

Study Description

Citation

Title:

ACL word segmentation correction

Identification Number:

doi:10.11588/data/VK99LU

Authoring Entity:

Nastase, Vivi (Department of Computational Linguistics, Heidelberg University, Germany)

Hitschler, Julian (Department of Computational Linguistics, Heidelberg University, Germany)

Date of Production:

2018

Distributor:

heiDATA

Access Authority:

Nastase, Vivi

Holdings Information:

https://doi.org/10.11588/data/VK99LU

Study Scope

Keywords:

Arts and Humanities, Computer and Information Science, character-level sequence-to-sequence model, word segmentation, ACL collection

Topic Classification:

knowledge discovery

Abstract:

The data in this collection consists of two parallel directories, one ("raw") containing the raw text of 18850 articles from the ACL 2013/02 collection, the other ("re-segmented") the word-resegmented version of these articles, obtained using nematus, a seq2seq neural model used for machine translation. The motivation for the work was that spurious spaces in the text seemed to be very common, particularly in older papers, obtained by OCR-ing scanned papers.

Kind of Data:

textual data

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

<p>Nastase, V. and Hitschler, J. (2018). Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In <em>Proceedings of the 11th International Conference on Language Resources and Evaluation</em>, pages 706&ndash;711, 7-12 May 2018, Miyazaki, Japan.</p>

Identification Number:

https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/ACL_corrected/lrec2018_correction-ocr-word.pdf

Bibliographic Citation:

<p>Nastase, V. and Hitschler, J. (2018). Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In <em>Proceedings of the 11th International Conference on Language Resources and Evaluation</em>, pages 706&ndash;711, 7-12 May 2018, Miyazaki, Japan.</p>

Other Study-Related Materials

Label:

acl-201302_word-resegmented.tar.gz

Text:

text files

Notes:

application/gzip

Other Study-Related Materials

Label:

README

Notes:

text/plain; charset=US-ASCII