Journal of Digital Information Management


Vol No. 19 ,Issue No. 2 2021

Retrieving and Processing Images from the Pages of a Historical Newspaper and Modeling the Text Topics
Gildácio J. de A. Sá, José E. B. Maia
Universidade Estadual do Ceará – UECE Ciência da Computação - CCT 60714-903 - Fortaleza - Ceará - Brasil
Abstract: Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.
Keywords: Historical Newspapers, Lexical Standardization, Induced Topic Model, Information Retrieval, Natural Language Processing Retrieving and Processing Images from the Pages of a Historical Newspaper and Modeling the Text Topics
DOI:https://doi.org/10.6025/jdim/2021/19/2/41-46
Full_Text   PDF 1.03 MB   Download:   38  times
References:[1] Robert B, Allen., Japzon, Andrea., Achananuparp, Palakorn., Ki Jung, Lee. (2007). A framework- for text processing and supporting access to collections of digitized historical newspapers. In Symposium on Human Interface and the Management of Information, pages 235–244. Springer, 2007. [2] Sanjeev, Arora, ., Rong, Ge., Ankur, Moitra. (2012). Learning topic models–going beyond svd. In: 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 1–10. IEEE, 2012. [3] Showmik, Bhowmik., Ram, Sarkar., Mita, Nasipuri., and David, Doermann. (2018). Text and non-text separation in offline document images: a survey. International Journal on Document Analysis and Recognition (IJDAR), 21(1-2):1–20, 2018.  [4] David M, Blei., Andrew Y, Ng., and Michael I, Jordan. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022, 2003. [5] Gildácio José de Almeida Sá and José Ever-ardo Bessa Maia. (2020). Processamento e naveg-ação por tópicos em imagens de páginas de jornais históricos. Anais do Computer on the Beach, 11(1): 432–439, 2020.  [6] Xiao, Fu., Kejun, Huang., Nicholas D, Sidiropoulos., Qingjiang, Shi., and Mingyi, Hong. (2018). Anchor-free correlated topic modeling. IEEE transactions on pattern analysis and machine intelligence, 41(5): 1056–1071, 2018.  [7] Anni, Järvelin., Heikki, Keskustalo., Eero, Sor- munen, Miamaria, Saastamoinen., and Kimmo, Kettunen. (2016). Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. Journal of the Association for Information Science and Technology, 67(12): 2928–2946, 2016. [8] João Marcos Carvalho Lima., José Ever-ardo Bessa Maia. (2018). A topical word embeddings for text classifification. In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, pages 25–35. SBC, 2018. [9] Christopher D, Manning., Prabhakar, Raghavan., and Hinrich, Schütze. (2008). Introduction to information retrieval. Cambridge university press, 2008. [10] JiYí, Martínek., Ladislav, Lenc., Pavel, Král. (2019). Training strategies for ocr systems for historical documents. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 362–373. Springer, 2019. [11] David, Mimno., Moontae, Lee., (2014). Low dimensional embeddings for interpretable anchor-based topic inference. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p 1319–1328, 2014. [12] Barry, Popik. (2004). Digital historical newspapers: A review of the powerful new research tools. Journal of English Linguistics, 32 (2), 114–123, 2004. [13] Yannick, Rochat., Maud, Ehrmann., Vincent, Buntinx., Cyril, Bornet., Frédéric, Kaplan. (2016). Navigating through 200 years of historical newspapers. iPRES 2016, page 186, 2016.   [14] Shapenko, Andrey., Korovkin, Vladimir., Leleux Benoit. (2018). Abbyy: the digitization of language and text. Emerald Emerging Markets Case Studies, 2018. [15] Silva, Fabiano T., Maia, José, E B. (2019). Query expansion in text information retrieval with local context and distributional model. Journal of Digital Information Management, 17(6), 313–320, 2019. [16] Tumbe, Chinmay. (2019). Corpus linguistics, newspaper archives and historical research methods. Journal of Management History, 2019. [17] Wang, Hongbin., Wang, Jianxiong., Zhang, Yafei., Wang, Meng., Mao, Cunli. (2019). Optimization of topic recognition model for news texts based on lda. Journal of Digital Information Management, 17(5), 257, 2019. [18] Yang, Tze-I., Torget, Andrew., Mihalcea, Rada. (2011). Topic modeling on historical newspapers. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 96–104, 2011. [19] Yarasavage, Nathan., Butterhof, Robin., Ehrman, Christopher. (2012). National digital newspaper program: a case study in sharing, linking, and using data. In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, pages 399–400. ACM, 2012.