Journal of E-Technology

DLINE Journals portal

Home

New Journals

Browse Journals

Journal Prices

For Authors

Print ISSN: 0976-3503
Online ISSN: 0976-2930

About JET
	DLINE Portal Home Home Aims & Scope Editorial Board Current Issue Next Issue Previous Issue Sample Issue Upcoming Conferences Self-archiving policy Alert Services Be a Reviewer Publisher Paper Submission Subscription Contact us

How To Order
	Order Online Price Information Request for Complimentary Print Copy

For Authors
	Guidelines for Contributors Online Submission Call for Papers Author Rights

RELATED JOURNALS

Journal of Digital Information Management (JDIM)

International Journal of Computational Linguistics Research (IJCL)

International Journal of Web Application (IJWA)

Journal of E-Technology

A Pre-trained BERT Model for Arabic Author Profiling

Chiyu Zhang, Muhammad Abdul-Mageed
Natural Language Processing Lab The University of British Columbia, Canada

Abstract: We report our models for detecting age, language variety, and gender from social media data in the context of the Arabic author profiling and deception detection shared task (APDA) [32].We build simple models based on pre-trained bidirectional encoders from transformers (BERT). We first fine-tune the pre-trained BERT model on each of the three datasets with shared task released data. Then we augment shared task data with in-house data for gender and dialect, showing the utility of augmenting training data. Our best models on the shared task test data are acquired with a majority voting of various BERT models trained under different data conditions. We acquire 54.72% accuracy for age, 93.75% for dialect, 81.67% for gender, and 40.97% joint accuracy across the three tasks.1

Keywords: Author Profiling Identification, BERT, Arabic, Social Media A Pre-trained BERT Model for Arabic Author Profiling

DOI:https://doi.org/10.6025/jet/2020/11/2/54-59

Full_Text PDF 49 KB Download: 243 times

References:

[1] Abdul-Mageed, M. (2015). Subjectivity and sentiment analysis of Arabic as a morophologically-rich language. Ph.D. thesis, Indiana University.

[2] Abdul-Mageed, M. (2017). Modeling arabic subjectivity and sentiment in lexical space. Information Processing & Management.

[3] Abdul-Mageed, M., Alhuzali, H., Elaraby, M. (2018). You tweet what you speak: A city- level dataset of arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).

[4] Alrifai, K., Rebdawi, G., Ghneim, N. (2017). Arabic tweeps gender and dialect prediction. In: CLEF (Working Notes).

[5] Argamon, S.E. (2019). Register in computational language research. Register Studies 1(1), 100–135.

[6] Bassiouney, R. (2009). Arabic sociolinguistics. Edinburgh University Press.

[7] Bleidorn, W., Hopwood, C. J. (2018). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, p. 1088868318772990.

[8] Burger, J. D., Henderson, J., Kim, G., Zarrella, G. (2011). Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing. p 1301–1309. Association for Computational Linguistics.

[9] Chen, J., Cheng, L., Yang, X., Liang, J., Quan, B., Li, S. (2019). Joint learning with both classification and regression models for age prediction. In: Journal of Physics: Conference Series. vol. 1168, p. 032016. IOP Publishing.

[10] Colleoni, E., Rozza, A., Arvidsson, A. (2014). Echo chamber or public sphere? predicting political orientation and measuring political homophily in twitter using big data. Journal of communication, 64(2), 317–332.

[11] Daneshvar, S., Inkpen, D. (2018). Gender identification in twitter using n-grams and lsa. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). 58 Journal of E - Technology Volume 11 Number 2 May 2020

[12] Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[13] Elaraby, M., Abdul-Mageed, M. (2018). Deep models for arabic dialect identification on benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). pp. 263–274.

[14] Flekova, L., Preot¸iuc-Pietro, D., Ungar, L. (2016). Exploring stylistic variation with age and income on twitter. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). vol. 2, p 313–319.

[15] Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th international conference on Computational Linguistics. p. 611. Association for Computational Linguistics.

[16] Goswami, S., Sarkar, S., Rustagi, M. (2009). Stylometric analysis of bloggers age and gender. In: Third international AAAI conference on weblogs and social media.

[17] Habash, N. Y. (2010). Introduction to arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.

[18] Hinds, J., Joinson, A. (2019). Human and computer personality prediction from digital footprints. Current Directions in Psychological Science p. 0963721419827849 (2019)

[19] Holes, C. (2004). Modern Arabic: Structures, functions, and varieties. Georgetown University Press (2004)

[20] Holmes, D. I. (1998). The evolution of stylometry in humanities scholarship. Literary and linguistic computing, 13(3), 111–117.

[21] Johnson, K., Goldwasser, D. (2018). Classification of moral foundations in microblog political discourse. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). p 720–730.

[22] Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[23] Matz, S. C., Kosinski, M., Nave, G., Stillwell, D. J. (2017). Psychological targeting as an effective approach to digital mass persuasion. Proceedings of the national academy of sciences, 114(48), 12714–12719.

[24] Mitrou, L., Kandias, M., Stavrou, V., Gritzalis, D. Social media profiling: A panopticon or omniopticon tool? In: Proc. of the 6th Conference of the Surveillance Studies Network. Barcelona, Spain.

[25] Mubarak, H., Darwish, K. (2014). Using twitter to collect a multi-dialectal corpus of arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). p 1–7.

[26] Nguyen, D., Smith, N.A., Ros´e, C.P. (2011). Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. p 115–123. Association for Computational Linguistics.

[27] Palva, H. (2006). Dialects: classification. Encyclopedia of Arabic Language and Linguistics, 1, 604–613 (2006)

[28] Pang, D., Eichstaedt, J. C., Buffone, A., Slaff, B., Ruch, W., Ungar, L. H. (2019). The language of character strengths:Predicting morally valued traits on social media. Journal of personality.

[29] Potthast, M., Rosso, P., Stamatatos, E., Stein, B. (2019). A decade of shared tasks in digital text forensics at pan. In: European Conference on Information Retrieval. pp. 291–300. Springer.

[30] Preot¸iuc-Pietro, D., Liu, Y., Hopkins, D., Ungar, L. (2017). Beyond binary labels: political ideology prediction of twitter users. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). p 729–740.

[31] Qwaider, C., Saad, M., Chatzikyriakidis, S., Dobnik, S. (2018). Shami: A corpus of levantine arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).

[32] Rangel, F., Rosso, P., Charfi, A., Zaghouani, W., Ghanem, B., Snchez-Junquera, J. (2019). Overview of the track on author profiling and deception detection in arabic. In: Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December 12-15.

[33] Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verho even, B., Daelemans, W. (2014). Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014. p 1–30.

[34] Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G. (2013). Overview of the author profiling task at pan 2013. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation. p 352–365. CELCT.

[35] Rangel, F., Rosso, P., Potthast, M., Stein, B. (2017) Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF.

[36] Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B. (2016). Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al. p 750–784

[37] Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W. (2015). Overview of the 3rd author profiling task at pan 2015. In: CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. p 1–8.

[38] Rao, D., Yarowsky, D., Shreevats, A., Gupta, M. (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents. p 37–44. ACM.

[39] Sadat, F., Kazemi, F., Farzindar, A. (2014). Automatic identification of arabic language varieties and dialects in social media. Proceedings of Social NLP p. 22.

[40] Salameh, M., Bouamor, H. (2018). Fine-grained arabic dialect identification. In: Proceedings of the 27th International Conference on Computational Linguistics. p 1332–1344.

[41] Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., Schwartz, H.A. (2014). Developing age and gender predictive lexica over social media. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). p 1146–1151.

[42] Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one 8(9), e73791.

[43] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1), 1929–1958.

[44] Verhoeven, B. (2018). Two authors walk into a bar: studies in author profiling. Ph. D. thesis, University of Antwerp.

[45] Verhoeven, B., Daelemans, W., Plank, B. (2016). Twisty: a multilingual twitter stylometry corpus for gender and personality profiling. In: Proceedings of the 10th Annual Conference on Language Resources and Evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al. p 1–6.

[46] Versteegh, K. (2019). The arabic language. Edinburgh University Press (2014)

[47] Zhang, C., Abdul-Mageed, M. (2019). No army, no navy: Bert semi-supervised learning of arabic dialects. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop. p 279–284.

DLINE Journals portal