•  
  •  
 

Abstract

This study rigorously evaluates machine learning models for classifying culturally significant Javanese Wuku texts from the “Keagamaan atau Spiritual” category, a domain challenged by unique linguistic nuances and limited digitized resources. We compared Support Vector Machine (SVM), Naïve Bayes, and Convolutional Neural Network (CNN) on texts from five pivotal Wuku types (Sinta, Galungan, Kuningan, Sungsang, Warigalit) sourced from sastra.org, aiming to identify the most effective computational approach. The dataset comprises N = 1419 documents (T = 751.290 tokens), with per-class document counts reported for all five Wuku types. Our evaluation uses accuracy, precision, recall, F1-score, and Area Under the Curve (AUC) under repeated stratified 5-fold cross-validation (10 repeats; 50 runs) to ensure robust estimates. CNN achieved the best performance with Accuracy = 0.92 ± [SD], Macro-F1 = 0.90 ± [SD], and AUC = 0.93 ± [SD], outperforming SVM (Accuracy: 0.87; F1-score: 0.84) and Naïve Bayes (Accuracy: 0.82; F1-score: 0.78). The results underscore CNN’s strong effectiveness for nuanced, context-rich text classification, offering a vital contribution to cultural heritage preservation and advancing Natural Language Processing (NLP) for under-resourced languages. From a knowledge-engineering perspective, predicted Wuku labels can serve as structured metadata to support computational indexing and retrieval of Wuku narratives in cultural information systems. Methodologically, our CNN is a lightweight, small-corpus design that uses tuned regularization (dropout/early stopping) and multi-scale convolution to capture culturally salient n-gram cues, rather than relying on a fixed default TextCNN configuration. Future work involves expanding the dataset and exploring advanced deep learning architectures.

DOI

https://doi.org/10.17977/um018v8i12025p104-117

First Page

104

Last Page

117

Share

COinS