Persistent Identifier
|
doi:10.11588/data/PVYWKB |
Publication Date
|
2023-11-02 |
Title
| Jing bao ground truth – text block crops and annotations |
Author
| Henke, Konstantin (Computational Linguistics, Heidelberg University) - ORCID: https://orcid.org/0000-0002-6878-2761
Arnold, Matthias (Heidelberg Centre of Transcultural Studies) - ORCID: https://orcid.org/0000-0003-0876-6177 |
Point of Contact
|
Use email button above to contact.
Arnold, Matthias (Heidelberg Centre of Transcultural Studies) |
Description
| This is the data set related to the paper "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text", JDADH 11/2023. In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper.
The dataset contains two subsets:
- The pairs of text block crops and corresponding ground truth annotations from April 1920, 1930 and 1939 of the Jingbao newspaper (jingbao_annotated_crops.zip).
- The labeled images of single characters which we automatically cropped from the April 1939 issues of the Jingbao using separators generated from projection profiles (jingbao_char_imgs.zip).
(2022-08-08) |
Subject
| Arts and Humanities; Computer and Information Science |
Keyword
| Datensatz / data set (GND) http://d-nb.info/gnd/4011133-7
Multilingual computing (LCSH) http://id.loc.gov/authorities/subjects/sh99004311
Bildsegmentierung (GND) https://d-nb.info/gnd/4145448-0
Modeling languages (Computer science) (LCSH) http://id.loc.gov/authorities/subjects/sh2012003486
Chinese newspapers (LCSH) http://id.loc.gov/authorities/subjects/sh85024343 |
Topic Classification
| Multilingual computing (LCSH) http://id.loc.gov/authorities/subjects/sh99004311
Optical character recognition (LCSH) http://id.loc.gov/authorities/subjects/sh2010007127
114-01 General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages (DFG Classification of Subject Areas (2024-28)) |
Related Publication
| Henke, Konstantin, and Matthias Arnold. "Language Model Assisted OCR Classification for Republican Chinese Newspaper Text(以語言模型輔助民國報紙文本的光學字元辨識分類)." In: 數位典藏與數位人文 12期(2023年10月)(Journal of Digital Archives and Digital Humanities, Issue 12, 10.2023), 1-19. doi: 10.6853/DADH.202310_(12).0001 https://doi.org/10.6853/DADH.202310_(12).0001
Henke, K. (2021). Building and improving an OCR classifier for Republican Chinese newspa-per text (Unpublished Bachelor’s thesis). Heidelberg University, Heidelberg, Germany. doi:10.11588/heidok.00030845 doi: 10.11588/heidok.00030845 https://archiv.ub.uni-heidelberg.de/volltextserver/30845/ |
Language
| Chinese |
Producer
| Henke, Konstantin (Heidelberg University)
Heidelberg Centre for Transcultural Studies (HCTS) https://www.asia-europe.uni-heidelberg.de/ ![Logo URL](https://www.asia-europe.uni-heidelberg.de/fileadmin/templates/main/images/fp2_images/logo_cluster_fp2.png)
|
Production Date
| 2022-06-23 |
Production Location
| Heidelberg Centre for Transcultural Studies, University of Heidelberg |
Contributor
| Data Collector : Henke, Konstantin
Project Manager : Arnold, Matthias
Hosting Institution : Heidelberg Research Architecture, University of Heidelberg |
Depositor
| Heidelberg Research Architecture |
Deposit Date
| 2022-08-08 |
Time Period
| Start Date: 1920-04 ; End Date: 1940-04 |
Date of Collection
| Start Date: 2021 ; End Date: 2022 |
Data Type
| Image files in jpg and png formats |
Related Dataset
| Arnold, Matthias, 2022, "Early Chinese Periodicals Online (ECPO) [Metadata]", https://doi.org/10.11588/data/Z3J0DV, heiDATA, V1 GitHub Repo "Early Chinese Periodicals Online (ECPO)" https://github.com/exc-asia-and-europe/ecpo |
Other Reference
| Early Chinese Periodicals Online (ECPO) https://uni-heidelberg.de/ecpo |