View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
ACL word segmentation correction |
Identification Number: |
doi:10.11588/data/VK99LU |
Distributor: |
heiDATA |
Date of Distribution: |
2019-07-15 |
Version: |
1 |
Bibliographic Citation: |
Nastase, Vivi; Hitschler, Julian, 2019, "ACL word segmentation correction", https://doi.org/10.11588/data/VK99LU, heiDATA, V1 |
Citation |
|
Title: |
ACL word segmentation correction |
Identification Number: |
doi:10.11588/data/VK99LU |
Authoring Entity: |
Nastase, Vivi (Department of Computational Linguistics, Heidelberg University, Germany) |
Hitschler, Julian (Department of Computational Linguistics, Heidelberg University, Germany) |
|
Date of Production: |
2018 |
Distributor: |
heiDATA |
Access Authority: |
Nastase, Vivi |
Holdings Information: |
https://doi.org/10.11588/data/VK99LU |
Study Scope |
|
Keywords: |
Arts and Humanities, Computer and Information Science, character-level sequence-to-sequence model, word segmentation, ACL collection |
Topic Classification: |
knowledge discovery |
Abstract: |
The data in this collection consists of two parallel directories, one ("raw") containing the raw text of 18850 articles from the ACL 2013/02 collection, the other ("re-segmented") the word-resegmented version of these articles, obtained using nematus, a seq2seq neural model used for machine translation. The motivation for the work was that spurious spaces in the text seemed to be very common, particularly in older papers, obtained by OCR-ing scanned papers. |
Kind of Data: |
textual data |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Title: |
<p>Nastase, V. and Hitschler, J. (2018). Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In <em>Proceedings of the 11th International Conference on Language Resources and Evaluation</em>, pages 706–711, 7-12 May 2018, Miyazaki, Japan.</p> |
Identification Number: |
https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/ACL_corrected/lrec2018_correction-ocr-word.pdf |
Bibliographic Citation: |
<p>Nastase, V. and Hitschler, J. (2018). Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In <em>Proceedings of the 11th International Conference on Language Resources and Evaluation</em>, pages 706–711, 7-12 May 2018, Miyazaki, Japan.</p> |
Label: |
acl-201302_word-resegmented.tar.gz |
Text: |
text files |
Notes: |
application/gzip |
Label: |
README |
Notes: |
text/plain; charset=US-ASCII |