String Kernel-Based Techniques for Authorship Attribution
DOI:
https://doi.org/10.61667/e1bvf861Keywords:
Authorship Attribution, String Kernels, Stylometry, Natural Language Processing (NLP), Computational EfficiencyAbstract
Authorship attribution (AA), a core task in computational linguistics, seeks to identify the author of a text based on stylistic patterns. While effective, many existing methods face a trade-off between classification accuracy and computational cost, especially when applied to large datasets. This study provides a systematic evaluation of word-level string kernel techniques as a highly efficient and accurate solution for AA. We investigate the performance of three string kernels (Spectrum, Presence Bits, and Intersection) paired with three machine learning classifiers (Support Vector Machine, Random Forest, and XGBoost). The models were tested on three distinct feature sets designed to isolate the stylistic contribution of noun phrases alongside word (n)-grams. Our findings reveal that the optimal configuration—a Support Vector Machine with a Spectrum kernel utilizing a feature set of word (n)-grams and noun phrases—achieves approximately 95% classification accuracy on the test set. This result underscores the critical role of phrasal-level syntactic information in capturing an author's unique voice. Most significantly, this word-level approach demonstrates a four- to six-fold reduction in model training time compared to a strong character-level baseline, while maintaining superior or competitive accuracy. This research concludes that word-level string kernels offer a powerful and practical framework for authorship attribution, striking an exceptional balance between high performance and computational efficiency. The method's scalability makes it highly suitable for real-world applications, including digital forensics, plagiarism detection, and large-scale textual analysis
References
Alsanoosy, T., Shalbi, B., & Noor, A. (2024). Authorship attribution for English short texts. Engineering, Technology & Applied Science Research, 14(5), 16419–16426. https://doi.org/10.48084/etasr.8302
Bevendorff, J., Wiegmann, M., Richter, E., Potthast, M., & Stein, B. (2025). The two paradigms of LLM detection: Authorship attribution vs. authorship verification. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 3762–3787). Association for Computational Linguistics.
Gurram, V. K., Sanil, J., Anoop, V. S., & Asharaf, S. (2023). String kernel based techniques for native language identification. Human Centric Intelligent Systems, 3, 402–415. https://doi.org/10.1007/s44230 023 00029 z
Gurram, V. K., Sanil, J., Anoop, V. S., & Asharaf, S. (2023). String kernel‑based techniques for native language identification. Human‑Centric Intelligent Systems, 3, 402–415. https://doi.org/10.1007/s44230‑023‑00029‑z
He, X., Habibi Lashkari, A., & Vombatkere, N. (2023). Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information, 15(3), 131. https://doi.org/10.3390/info15030131
Huang, B., Chen, C., & Shu, K. (2024a). Can large language models identify authorship? In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 445–460). Association for Computational Linguistics.
Huang, B., Chen, C., & Shu, K. (2024b). Authorship attribution in the era of LLMs: Problems, methodologies, and challenges. ACM SIGKDD Explorations. (Preprint version available at https://arxiv.org/abs/2408.08946)
Kestemont, M., et al. (2018). Overview of the author identification task at PAN 2018: Cross domain authorship attribution and style change detection. In CLEF 2018 Labs and
Mikros, G., Juola, P., & Eder, M. (Eds.). (2022–2023). Authorship analysis in forensics [Special collection]. International Journal of Digital Humanities.
Moreau, E., & Vogel, C. (2022). CLG Authorship Analytics: A library for authorship verification. International Journal of Digital Humanities, 4(1), 5–27.
Nini, A., Halvani, O., Graner, L., Gherardi, V., & Ishihara, S. (2025). Grammar as a behavioral biometric: Using cognitively motivated grammar models for authorship verification. arXiv preprint arXiv:2403.08462.
Sarwar, R., Mohamed, E., & Mostafa, S. (2022). Translator attribution using stylometry: An Arabic literary corpus study. Digital Scholarship in the Humanities, 37(2), 658–666. https://doi.org/10.1093/llc/fqac054
Sharma, N., & Kumar, A. (2024). Deep learning for stylometry and authorship attribution: A review of literature. International Journal for Research in Applied Science and Engineering Technology. https://doi.org/10.22214/ijraset.2024.64168
Stamatatos, E., Kestemont, M., Kredens, K., Pezik, P., Heini, A., Bevendorff, J., Stein, B., & Potthast, M. (2022). Overview of the authorship verification task at PAN 2022. In CLEF 2022 Working Notes.
Uhryn, D., Vysotska, V., Chyrun, L., Chyrun, S., Hu, C., & Ushenko, Y. (2025). Intelligent application for textual content authorship identification based on machine learning and sentiment analysis. I.J. Intelligent Systems and Applications, 17(2), 56–100. https://doi.org/10.5815/ijisa.2025.02.05. References
Wang, S., Ji, S., & Wang, X. (2024). Code stylometry vs formatting and minification. PeerJ Computer Science.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Global Synthesis in Education Journal

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.













