String Kernel-Based Techniques for Authorship Attribution

Authors

  • Muhammad Nafi Annury Department of English Education at Universitas Islam Negeri Walisongo Semarang, Central Java Indonesia Author
  • Djoko Sutrisno Universitas Ahmad Dahlan Author

DOI:

https://doi.org/10.61667/e1bvf861

Keywords:

Authorship Attribution, String Kernels, Stylometry, Natural Language Processing (NLP), Computational Efficiency

Abstract

Authorship attribution (AA), a core task in computational linguistics, seeks to identify the author of a text based on stylistic patterns. While effective, many existing methods face a trade-off between classification accuracy and computational cost, especially when applied to large datasets. This study provides a systematic evaluation of word-level string kernel techniques as a highly efficient and accurate solution for AA. We investigate the performance of three string kernels (Spectrum, Presence Bits, and Intersection) paired with three machine learning classifiers (Support Vector Machine, Random Forest, and XGBoost). The models were tested on three distinct feature sets designed to isolate the stylistic contribution of noun phrases alongside word (n)-grams. Our findings reveal that the optimal configuration—a Support Vector Machine with a Spectrum kernel utilizing a feature set of word (n)-grams and noun phrases—achieves approximately 95% classification accuracy on the test set. This result underscores the critical role of phrasal-level syntactic information in capturing an author's unique voice. Most significantly, this word-level approach demonstrates a four- to six-fold reduction in model training time compared to a strong character-level baseline, while maintaining superior or competitive accuracy. This research concludes that word-level string kernels offer a powerful and practical framework for authorship attribution, striking an exceptional balance between high performance and computational efficiency. The method's scalability makes it highly suitable for real-world applications, including digital forensics, plagiarism detection, and large-scale textual analysis

References

Alsanoosy, T., Shalbi, B., & Noor, A. (2024). Authorship attribution for English short texts. Engineering, Technology & Applied Science Research, 14(5), 16419–16426. https://doi.org/10.48084/etasr.8302

Bevendorff, J., Wiegmann, M., Richter, E., Potthast, M., & Stein, B. (2025). The two paradigms of LLM detection: Authorship attribution vs. authorship verification. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 3762–3787). Association for Computational Linguistics.

Gurram, V. K., Sanil, J., Anoop, V. S., & Asharaf, S. (2023). String kernel based techniques for native language identification. Human Centric Intelligent Systems, 3, 402–415. https://doi.org/10.1007/s44230 023 00029 z

Gurram, V. K., Sanil, J., Anoop, V. S., & Asharaf, S. (2023). String kernel‑based techniques for native language identification. Human‑Centric Intelligent Systems, 3, 402–415. https://doi.org/10.1007/s44230‑023‑00029‑z

He, X., Habibi Lashkari, A., & Vombatkere, N. (2023). Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information, 15(3), 131. https://doi.org/10.3390/info15030131

Huang, B., Chen, C., & Shu, K. (2024a). Can large language models identify authorship? In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 445–460). Association for Computational Linguistics.

Huang, B., Chen, C., & Shu, K. (2024b). Authorship attribution in the era of LLMs: Problems, methodologies, and challenges. ACM SIGKDD Explorations. (Preprint version available at https://arxiv.org/abs/2408.08946)

Kestemont, M., et al. (2018). Overview of the author identification task at PAN 2018: Cross domain authorship attribution and style change detection. In CLEF 2018 Labs and

Mikros, G., Juola, P., & Eder, M. (Eds.). (2022–2023). Authorship analysis in forensics [Special collection]. International Journal of Digital Humanities.

Moreau, E., & Vogel, C. (2022). CLG Authorship Analytics: A library for authorship verification. International Journal of Digital Humanities, 4(1), 5–27.

Nini, A., Halvani, O., Graner, L., Gherardi, V., & Ishihara, S. (2025). Grammar as a behavioral biometric: Using cognitively motivated grammar models for authorship verification. arXiv preprint arXiv:2403.08462.

Sarwar, R., Mohamed, E., & Mostafa, S. (2022). Translator attribution using stylometry: An Arabic literary corpus study. Digital Scholarship in the Humanities, 37(2), 658–666. https://doi.org/10.1093/llc/fqac054

Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology. https://doi.org/10.22214/ijraset.2024.64168

Stamatatos, E., Kestemont, M., Kredens, K., Pezik, P., Heini, A., Bevendorff, J., Stein, B., & Potthast, M. (2022). Overview of the authorship verification task at PAN 2022. In CLEF 2022 Working Notes.

Uhryn, D., Vysotska, V., Chyrun, L., Chyrun, S., Hu, C., & Ushenko, Y. (2025). Intelligent application for textual content authorship identification based on machine learning and sentiment analysis. I.J. Intelligent Systems and Applications, 17(2), 56–100. https://doi.org/10.5815/ijisa.2025.02.05. References

Wang, S., Ji, S., & Wang, X. (2024). Code stylometry vs formatting and minification. PeerJ Computer Science.

Downloads

Published

2025-11-29