A Novel Fine-Grained Source Code Vulnerability Detection Model via Joint Token and Statement Representation Learning
S. Sun*,
J. Wang,
T. Yan and
F. Qi*: corresponding author
Published on:
October 20, 2025
Abstract
With the increasing amount of code and growing complexity of software systems, defects in source code can lead to significant security risks—for example, malicious intrusions, data breaches, compromised availability, and erroneous scientific computation results—making their detection crucial. Currently, mainstream code defect detection methods are divided into two categories: graph neural network (GNN)-based detection methods and sequence-based detection methods. Both categories have achieved tremendous success in this field; however, each also suffers from certain shortcomings. Graph-based detection methods typically face issues such as huge memory overhead for graph construction, over-smoothing, and incomplete utilization of heterogeneous edge information. Sequence model-based detection methods generally treat code as a regular text sequence, learning only token-level features while ignoring the structural information of the code, which results in suboptimal detection performance. Moreover, these methods rarely support line-level vulnerability detection. To address these issues, this paper proposes a novel sequence model-based detection method that simultaneously learns both token-level and statement-level feature representations and supports line-level detection, thereby significantly enhancing detection capabilities. The proposed method achieves an F1 score of 92.71% for function-level detection and a top-5 accuracy of 61% for line-level vulnerability detection on a public dataset.
DOI: https://doi.org/10.22323/1.488.0010
How to cite
Metadata are provided both in
article format (very
similar to INSPIRE)
as this helps creating very compact bibliographies which
can be beneficial to authors and readers, and in
proceeding format which
is more detailed and complete.