MITIGATING CLASS IMBALANCE IN OFFENSIVE LANGUAGE DETECTION IN MALAYALAM THROUGH NLPAUG

Authors

  • Munawwar K V Research Scholar, Central University of Tamil Nadu,India
  • Nandhini K Central University of Tamil Nadu, Thiruvarur, Tamil Nadu

DOI:

https://doi.org/10.63458/ijerst.v2i1.73

Keywords:

Offensive language, mBERT, NlpAug, stratified K-fold, Data Augmentation

Abstract

The rise of technology alongside the prevalence of social media and the promotion of free speech has resulted in an increased presence of vulnerable content in the public sphere. Currently, various researches demonstrate that the identification of offensive language plays a crucial role in preventing or protecting vulnerable groups. Our attention is directed towards the detection of offensive language in Malayalam, recognizing the scarcity of existing research in this area for the Malayalam language. mBERT demonstrates effectiveness across Indian languages. To address class imbalances within datasets, we employed NlpAug for word-level augmentation and achieved a significant improvement in macro F1 score of 0.31.

Author Biography

Nandhini K, Central University of Tamil Nadu, Thiruvarur, Tamil Nadu

Assistant Professor, Central University of Tamil Nadu, Thiruvarur, Tamil Nadu

References

Akhtar, M. S., Sawant, P., Sen, S., Ekbal, A., & Bhattacharyya, P. , Solving data sparsity for aspect-based sentiment analysis using cross-linguality and multi-linguality. Association for Computational Linguistics.2018

Fadaee, M., Bisazza, A., & Monz, C., Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.2017.

Feng, X., Feng, X., Qin, B., Feng, Z., & Liu, T. , Improving Low Resource Named Entity Recognition using Cross-lingual Knowledge Transfer. In IJCAI (Vol. 1, pp. 4071-4077).2018

Wei, J., & Zou, K, ‘Eda: Easy data augmentation techniques for boosting performance on text classification tasks’. arXiv preprint arXiv:1901.11196.2019

Mo, Y., Yang, J., Liu, J., Wang, Q., Chen, R., Wang, J., & Li, Z, ‘mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning. arXiv preprint arXiv:2308.09073. 2023.

Ojha, A. K., & Zeman, D. , ‘Universal Dependency treebanks for low-resource Indian languages: The case

of Bhojpuri. In Proceedings of the WILDRE5–5th workshop on Indian language data: resources and evaluation (pp. 33-38).2020.

Das, M., Banerjee, S., Saha, P., & Mukherjee, A., ‘Hate Speech and Offensive Language

Detection in Bengali. https://doi.org/10.48550/arXiv.2210.03479, 2022.

Iqbal, M., Nisar, S., & Iqbal, W. ‘ Offensive Language Detection for Low Resource

Language Using Deep Sequence Model. https://doi.org/10.1109/tcss.2023.3280952, 2023

Alakrot, A., Murray, L., & Nikolov, N. S., ‘ Towards Accurate Detection of Offensive Language in

Online Communication in Arabic. 142. https://doi.org/10.1016/J.PROCS.2018.10.491, 2018.

Zhou, L., Cabello, L., Cao, Y., & Hershcovich, D. ‘Cross-Cultural Transfer Learning for Chinese Offensive Language Detection. abs/2303.17927. https://doi.org/10.48550/arXiv.2303.17927, 2023.

Chakravarthi, B. R., Priyadharshini, R., Banerjee, S., Jagadeeshan, M. B., Kumaresan, P. K.,

Ponnusamy, R., ... & McCrae, J. P,’Detecting abusive comments at a fine-grained level in a low-resource language. Natural Language Processing Journal, 3, 100006., 2023

Gupta, V., Roychowdhury, S., Das, M., Banerjee, S., Saha, P., Mathew, B., & Mukherjee, A., ‘Multilingual

Abusive Comment Detection at Scale for Indic Languages. Advances in Neural Information Processing Systems, 35, 26176-26191., 2022.

Patankar, S., Gokhale, O., Litake, O., Mandke, A., & Kadam, D., ‘Optimize_Prime@ DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil. arXiv preprint arXiv:2204.09675. 2022

Bigoulaeva, I., Hangya, V., & Fraser, A. ‘Cross-lingual transfer learning for hate speech detection. In Proceedings of the first workshop on language technology for equality, diversity and inclusion (pp. 15-25).2021.

Puranik, K, ‘IIITT@ Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages. arXiv preprint arXiv:2111.07906. 2021

Sultan, A., Salim, M., Gaber, A., & Hosary, I. E, ‘ WESSA at SemEval-2020 Task 9: Code-mixed sentiment analysis using transformers. arXiv preprint arXiv:2009.09879. 2020

Bhowmick, A., & Jana, A., ‘Sentiment Analysis for Bengali Using Transformer Based Models. In Proceedings of the 18th International Conference on Natural Language Processing (ICON) (pp. 481-486).2021

Priyadharshini, R., Chakravarthi, B. R., Malliga, S., Cn, S., Kogilavani, S. V., Premjith, B., ... & Kumaresan, P. K. Overview of shared-task on abusive comment detection in Tamil and Telugu. In Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages (pp. 80-87). (2023, September).

Priyadharshini, R., Chakravarthi, B. R., Cn, S., Durairaj, T., Subramanian, M., Shanmugavadivel, K., ... &

Kumaresan, P. ‘Overview of abusive comment detection in Tamil-ACL 2022. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages (pp. 292-298).2022

Shanmugavadivel, K., Hegde, S. U., & Kumaresan, P. K, ‘Overview of Abusive Comment Detection in Tamil-ACL 2022. DravidianLangTech, 2022, 292. 2022

Sharif, O., Hossain, E., & Hoque, M. M.,’ Nlp-cuet@ dravidianlangtech-eacl2021: Offensive language detection from multilingual code-mixed text using transformers. arXiv preprint arXiv:2103.00455. 2021

Hossain, E., Sharif, O., & Hoque, M. M., ‘ NLP-CUET@ LT-EDI-EACL2021: multilingual code-mixed hope speech detection using cross-lingual representation learner. arXiv preprint arXiv:2103.00464.21. 2021

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep

bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

Yoo, B., & Rayz, J. T., ‘Understanding emojis for sentiment analysis. In The International

FLAIRS Conference Proceedings (Vol. 34).2021.

Effect of Emojis in Classifying Telugu Code Mixed Movie Reviews, 3rd International Conference on

Mathematical Modeling & Computational Science ICMMCS’23.

Mohta, Astha, Atishay Jain, Aditi Saluja, and Sonika Dahiya. "Pre-Processing and Emoji Classification of WhatsApp Chats for Sentiment Analysis." In 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 514-519. IEEE, 2020.

Offensive Language Detection from Multilingual Code-Mixed Text using Transformers by Omar Sharif, Eftekhar Hossain, and Mohammed Moshiul Hoque, 2021

Downloads

Published

2024-03-25

How to Cite

K V, M., & Nandhini K. (2024). MITIGATING CLASS IMBALANCE IN OFFENSIVE LANGUAGE DETECTION IN MALAYALAM THROUGH NLPAUG. International Journal of Engineering Research and Sustainable Technologies (IJERST), 2(1), 29–35. https://doi.org/10.63458/ijerst.v2i1.73

ARK