Statistical Patterns of Diacritized and Undiacritized YorÃ¹bÃ¡ Texts

Statistical Patterns of Diacritized and Undiacritized YorÃ¹bÃ¡ Texts International Journal of Computational Linguistics Research Asubiaro, Toluwase 6 3 2015 http://www.dline.info/jcl/fulltext/v6n3/v6n3_2.pdf YorÃ¹bÃ¡ standard orthography involves heavy use of diacritics for tone marking and representation of characters that are beyond ANSI scope. The diacritics are not always applied in many YorÃ¹bÃ¡ documents because specialized and language-dependent input devices for the language are very rarely available. Hence, this study aims at explicating the statistical implication of the inconsistency in the use of diacritics in electronic Yoruba documents on the distribution of word in the two versions of its texts. This was achieved by modeling the texts of Yoruba language based on Zipfâ€™s and Heapâ€™s law on the n-grams (for n=1, 2 and 3) with corporal of 1,089,318 words that are diacritically marked and its version that are unmarked diacritically. It was observed that the Zipfâ€™s graphs of the two corporal exhibited no significant difference. On the other hand, the Heapâ€™s graphs of the diacritized and undiacritized texts deviated significantly from the base. This shows that the use of the diacritics significantly affect single word distribution of the language but the effect reduced in the distribution of co-occurrences of two or more words.