DLINE Journals

Journal of Digital Information Management

Vol No. 19 ,Issue No. 2 2021

Retrieving and Processing Images from the Pages of a Historical Newspaper and Modeling the Text Topics

GildÃ¡cio J. de A. SÃ¡, JosÃ© E. B. Maia
Universidade Estadual do CearÃ¡ â€“ UECE CiÃªncia da ComputaÃ§Ã£o - CCT 60714-903 - Fortaleza - CearÃ¡ - Brasil

Abstract: Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.

Keywords: Historical Newspapers, Lexical Standardization, Induced Topic Model, Information Retrieval, Natural Language Processing Retrieving and Processing Images from the Pages of a Historical Newspaper and Modeling the Text Topics

DOI:https://doi.org/10.6025/jdim/2021/19/2/41-46

Full_Text PDF 1.03 MB Download: 38 times

References:[1] Robert B, Allen., Japzon, Andrea., Achananuparp,Â Palakorn., Ki Jung, Lee. (2007). A framework- for text processingÂ and supporting access to collections of digitizedÂ historical newspapers. In Symposium on Human InterfaceÂ and the Management of Information, pages 235â€“244.Â Springer, 2007. [2] Sanjeev, Arora, ., Rong, Ge., Ankur, Moitra. (2012).Â Learning topic modelsâ€“going beyond svd. In: 2012 IEEEÂ 53rd Annual Symposium on Foundations of Computer Science, pages 1â€“10. IEEE, 2012. [3] Showmik, Bhowmik., Ram, Sarkar., Mita, Nasipuri.,Â and David, Doermann. (2018). Text and non-text separationÂ in offline document images: a survey. InternationalÂ Journal on Document Analysis and Recognition (IJDAR),Â 21(1-2):1â€“20, 2018.Â [4] David M, Blei., Andrew Y, Ng., and Michael I, Jordan.Â (2003). Latent dirichlet allocation. Journal of machineÂ Learning research, 3(Jan), 993â€“1022, 2003. [5] GildÃ¡cio JosÃ© de Almeida SÃ¡ and JosÃ© Ever-ardo BessaÂ Maia. (2020). Processamento e naveg-aÃ§Ã£o por tÃ³picosÂ em imagens de pÃ¡ginas de jornais histÃ³ricos. Anais doÂ Computer on the Beach, 11(1): 432â€“439, 2020.Â [6] Xiao, Fu., Kejun, Huang., Nicholas D, Sidiropoulos.,Â Qingjiang, Shi., and Mingyi, Hong. (2018). Anchor-free correlatedÂ topic modeling. IEEE transactions on pattern analysisÂ and machine intelligence, 41(5): 1056â€“1071, 2018.Â [7] Anni, JÃ¤rvelin., Heikki, Keskustalo., Eero, Sor- munen,Â Miamaria, Saastamoinen., and Kimmo, Kettunen. (2016).Â Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach.Â Journal of the Association for Information ScienceÂ and Technology, 67(12): 2928â€“2946, 2016. [8] JoÃ£o Marcos Carvalho Lima., JosÃ© Ever-ardo BessaÂ Maia. (2018). A topical word embeddings for text classifification. In Anais do XV Encontro Nacional de InteligÃªnciaÂ Artificial e Computacional, pages 25â€“35. SBC, 2018. [9] Christopher D, Manning., Prabhakar, Raghavan., andÂ Hinrich, SchÃ¼tze. (2008). Introduction to information retrieval.Â Cambridge university press, 2008. [10] JiYÃ, MartÃnek., Ladislav, Lenc., Pavel, KrÃ¡l. (2019).Â Training strategies for ocr systems for historical documents.Â In IFIP International Conference on Artificial IntelligenceÂ Applications and Innovations, pages 362â€“373.Â Springer, 2019. [11] David, Mimno., Moontae, Lee., (2014). Low dimensionalÂ embeddings for interpretable anchor-based topic inference. In: Proceedings of the 2014 Conference onÂ Empirical Methods in Natural Language ProcessingÂ (EMNLP), p 1319â€“1328, 2014. [12] Barry, Popik. (2004). Digital historical newspapers:Â A review of the powerful new research tools. Journal ofÂ English Linguistics, 32 (2), 114â€“123, 2004. [13] Yannick, Rochat., Maud, Ehrmann., Vincent,Â Buntinx., Cyril, Bornet., FrÃ©dÃ©ric, Kaplan. (2016). NavigatingÂ through 200 years of historical newspapers. iPRESÂ 2016, page 186, 2016.Â Â [14] Shapenko, Andrey., Korovkin, Vladimir., LeleuxÂ Benoit. (2018). Abbyy: the digitization of language andÂ text. Emerald Emerging Markets Case Studies, 2018. [15] Silva, Fabiano T., Maia, JosÃ©, E B. (2019). QueryÂ expansion in text information retrieval with local contextÂ and distributional model. Journal of Digital InformationÂ Management, 17(6), 313â€“320, 2019. [16] Tumbe, Chinmay. (2019). Corpus linguistics, newspaperÂ archives and historical research methods. JournalÂ of Management History, 2019. [17] Wang, Hongbin., Wang, Jianxiong., Zhang, Yafei.,Â Wang, Meng., Mao, Cunli. (2019). Optimization of topicÂ recognition model for news texts based on lda. Journal ofÂ Digital Information Management, 17(5), 257, 2019. [18] Yang, Tze-I., Torget, Andrew., Mihalcea, Rada. (2011).Â Topic modeling on historical newspapers. In: ProceedingsÂ of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities,Â pages 96â€“104, 2011. [19] Yarasavage, Nathan., Butterhof, Robin., Ehrman,Â Christopher. (2012). National digital newspaper program:Â a case study in sharing, linking, and using data. In: Proceedings of the 12th ACM/IEEE-CS joint conference onÂ Digital Libraries, pages 399â€“400. ACM, 2012.