From Historical Newspapers to Machine-Readable Data : The Origami OCR Pipeline

Media type: Conference Proceedings; E-Article
Title: From Historical Newspapers to Machine-Readable Data : The Origami OCR Pipeline
Contributor: Liebl, Bernhard [Author]; Burghardt, Manuel [Author]
imprint: Aachen: CEUR-WS.org, [2024]
Published in: Proceedings of the Workshop on Computational Humanities Research (CHR 2020) ; 2723, Seite 351-373
Language: English
Keywords: end-to-end OCR ; historical newspapers ; layout detection ; deep neural networks
Origination:
Footnote:
Description: While historical newspapers recently have gained a lot of attention in the digital humanities, transforming them into machine-readable data by means of OCR poses some major challenges. In orderto address these challenges, we have developed an end-to-end OCR pipeline named Origami. Thispipeline is part of a current project on the digitization and quantitative analysis of the Germannewspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses existing open source OCR components and on top offers a new configurable architecture for layoutdetection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a newrobust implementation for document dewarping. In this paper we describe the different stages of theworkflow and discuss how they meet the above-mentioned challenges posed by historical newspapers.
Access State: Open Access
Rights information: Attribution (CC BY)

Search in field: