Improving OCR for Historical Documents by Modeling Image Distortion - Collections

＜Book Part＞
Improving OCR for Historical Documents by Modeling Image Distortion

Creator	Creator Name Maekawa, Keiya マエカワ, ケイヤ Affiliation Affiliation Name Kyushu University 九州大学
	Author PID K000191 Creator Name Tomiura, Yoichi 冨浦, 洋一トミウラ, ヨウイチ Affiliation Affiliation Name Kyushu University 九州大学
	Author PID K006833 Creator Name Fukuda, Satoshi 福田, 悟志フクダ, サトシ Affiliation Affiliation Name Kyushu University 九州大学
	Author PID K003977 Creator Name Ishita, Emi 石田, 栄美イシタ, エミ Affiliation Affiliation Name Kyushu University 九州大学
	Creator Name Uchiyama, Hideaki 内山, 英昭ウチヤマ, ヒデアキ Affiliation Affiliation Name Kyushu University 九州大学
Language	English
Publisher	Springer Nature
Date	2019-10-29
Source Title	Digital Libraries at the Crossroads of Digital Information for the Future
Vol	11853
First Page	312
Last Page	316
Conference	Name International Conference on Asian Digital Libraries ICADL Sequence 21 \| 2019 Place Kuala Lumpur Country Malaysia
Publication Type	Accepted Manuscript
Access Rights	open access
Related DOI	isVersionOf https://doi.org/10.1007/978-3-030-34058-2_31
Related DOI
Related DOI	Digital Libraries at the Crossroads of Digital Information for the Future
Related DOI	Lecture Notes in Computer Science
Related DOI
Related URI	https://fim.uitm.edu.my/icadl2019/
Relation	Digital Libraries at the Crossroads of Digital Information for the Future
Relation	Lecture Notes in Computer Science
Abstract	Archives hold printed historical documents, many of which have de-teriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem re...duces the accuracy of infor-mation retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of de-teriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.show more

Hide fulltext details.

File	FileType	Size	Views	Description
2927456	pdf	329 KB	391

Details

Record ID	2927456
Related URI	https://fim.uitm.edu.my/icadl2019/
Related ISBN	9783030340575
Related ISSN	1611-3349
Subject Terms	OCR Error
	Information Retrieval
	Historical Document Image
Notes	The study is based on a poster presentation that won BEST POSTER AWARDS at International Conference on Asian Digital Libraries 2019.
Created Date	2020.05.12
Modified Date	2021.06.16

Export

Link to this page

Search Other Services

Statistics

＜Book Part＞ Improving OCR for Historical Documents by Modeling Image Distortion

Hide fulltext details.

Details

People who viewed this item also viewed

＜Book Part＞
Improving OCR for Historical Documents by Modeling Image Distortion