<図書(部分)>
Improving OCR for Historical Documents by Modeling Image Distortion

作成者
本文言語
出版者
発行日
収録物名
開始ページ
終了ページ
会議情報
出版タイプ
アクセス権
関連DOI
関連DOI
関連DOI
関連URI
関連情報
概要 Archives hold printed historical documents, many of which have de-teriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem re...duces the accuracy of infor-mation retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of de-teriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.続きを見る

本文ファイル

pdf 2927456 pdf 329 KB 361  

詳細

レコードID
関連URI
関連ISBN
関連ISSN
主題
注記
登録日 2020.05.12
更新日 2021.06.16

この資料を見た人はこんな資料も見ています