MITIGATING CLASS IMBALANCE IN OFFENSIVE LANGUAGE DETECTION IN MALAYALAM THROUGH NLPAUG
DOI:
https://doi.org/10.63458/ijerst.v2i1.73Keywords:
Offensive language, mBERT, NlpAug, stratified K-fold, Data AugmentationAbstract
The rise of technology alongside the prevalence of social media and the promotion of free speech has resulted in an increased presence of vulnerable content in the public sphere. Currently, various researches demonstrate that the identification of offensive language plays a crucial role in preventing or protecting vulnerable groups. Our attention is directed towards the detection of offensive language in Malayalam, recognizing the scarcity of existing research in this area for the Malayalam language. mBERT demonstrates effectiveness across Indian languages. To address class imbalances within datasets, we employed NlpAug for word-level augmentation and achieved a significant improvement in macro F1 score of 0.31.
References
Akhtar, M. S., Sawant, P., Sen, S., Ekbal, A., & Bhattacharyya, P. , Solving data sparsity for aspect-based sentiment analysis using cross-linguality and multi-linguality. Association for Computational Linguistics.2018
Fadaee, M., Bisazza, A., & Monz, C., Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.2017.
Feng, X., Feng, X., Qin, B., Feng, Z., & Liu, T. , Improving Low Resource Named Entity Recognition using Cross-lingual Knowledge Transfer. In IJCAI (Vol. 1, pp. 4071-4077).2018
Wei, J., & Zou, K, ‘Eda: Easy data augmentation techniques for boosting performance on text classification tasks’. arXiv preprint arXiv:1901.11196.2019
Mo, Y., Yang, J., Liu, J., Wang, Q., Chen, R., Wang, J., & Li, Z, ‘mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning. arXiv preprint arXiv:2308.09073. 2023.
Ojha, A. K., & Zeman, D. , ‘Universal Dependency treebanks for low-resource Indian languages: The case
of Bhojpuri. In Proceedings of the WILDRE5–5th workshop on Indian language data: resources and evaluation (pp. 33-38).2020.
Das, M., Banerjee, S., Saha, P., & Mukherjee, A., ‘Hate Speech and Offensive Language
Detection in Bengali. https://doi.org/10.48550/arXiv.2210.03479, 2022.
Iqbal, M., Nisar, S., & Iqbal, W. ‘ Offensive Language Detection for Low Resource
Language Using Deep Sequence Model. https://doi.org/10.1109/tcss.2023.3280952, 2023
Alakrot, A., Murray, L., & Nikolov, N. S., ‘ Towards Accurate Detection of Offensive Language in
Online Communication in Arabic. 142. https://doi.org/10.1016/J.PROCS.2018.10.491, 2018.
Zhou, L., Cabello, L., Cao, Y., & Hershcovich, D. ‘Cross-Cultural Transfer Learning for Chinese Offensive Language Detection. abs/2303.17927. https://doi.org/10.48550/arXiv.2303.17927, 2023.
Chakravarthi, B. R., Priyadharshini, R., Banerjee, S., Jagadeeshan, M. B., Kumaresan, P. K.,
Ponnusamy, R., ... & McCrae, J. P,’Detecting abusive comments at a fine-grained level in a low-resource language. Natural Language Processing Journal, 3, 100006., 2023
Gupta, V., Roychowdhury, S., Das, M., Banerjee, S., Saha, P., Mathew, B., & Mukherjee, A., ‘Multilingual
Abusive Comment Detection at Scale for Indic Languages. Advances in Neural Information Processing Systems, 35, 26176-26191., 2022.
Patankar, S., Gokhale, O., Litake, O., Mandke, A., & Kadam, D., ‘Optimize_Prime@ DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil. arXiv preprint arXiv:2204.09675. 2022
Bigoulaeva, I., Hangya, V., & Fraser, A. ‘Cross-lingual transfer learning for hate speech detection. In Proceedings of the first workshop on language technology for equality, diversity and inclusion (pp. 15-25).2021.
Puranik, K, ‘IIITT@ Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages. arXiv preprint arXiv:2111.07906. 2021
Sultan, A., Salim, M., Gaber, A., & Hosary, I. E, ‘ WESSA at SemEval-2020 Task 9: Code-mixed sentiment analysis using transformers. arXiv preprint arXiv:2009.09879. 2020
Bhowmick, A., & Jana, A., ‘Sentiment Analysis for Bengali Using Transformer Based Models. In Proceedings of the 18th International Conference on Natural Language Processing (ICON) (pp. 481-486).2021
Priyadharshini, R., Chakravarthi, B. R., Malliga, S., Cn, S., Kogilavani, S. V., Premjith, B., ... & Kumaresan, P. K. Overview of shared-task on abusive comment detection in Tamil and Telugu. In Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages (pp. 80-87). (2023, September).
Priyadharshini, R., Chakravarthi, B. R., Cn, S., Durairaj, T., Subramanian, M., Shanmugavadivel, K., ... &
Kumaresan, P. ‘Overview of abusive comment detection in Tamil-ACL 2022. In Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages (pp. 292-298).2022
Shanmugavadivel, K., Hegde, S. U., & Kumaresan, P. K, ‘Overview of Abusive Comment Detection in Tamil-ACL 2022. DravidianLangTech, 2022, 292. 2022
Sharif, O., Hossain, E., & Hoque, M. M.,’ Nlp-cuet@ dravidianlangtech-eacl2021: Offensive language detection from multilingual code-mixed text using transformers. arXiv preprint arXiv:2103.00455. 2021
Hossain, E., Sharif, O., & Hoque, M. M., ‘ NLP-CUET@ LT-EDI-EACL2021: multilingual code-mixed hope speech detection using cross-lingual representation learner. arXiv preprint arXiv:2103.00464.21. 2021
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep
bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Yoo, B., & Rayz, J. T., ‘Understanding emojis for sentiment analysis. In The International
FLAIRS Conference Proceedings (Vol. 34).2021.
Effect of Emojis in Classifying Telugu Code Mixed Movie Reviews, 3rd International Conference on
Mathematical Modeling & Computational Science ICMMCS’23.
Mohta, Astha, Atishay Jain, Aditi Saluja, and Sonika Dahiya. "Pre-Processing and Emoji Classification of WhatsApp Chats for Sentiment Analysis." In 2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 514-519. IEEE, 2020.
Offensive Language Detection from Multilingual Code-Mixed Text using Transformers by Omar Sharif, Eftekhar Hossain, and Mohammed Moshiul Hoque, 2021