Research on Similarity Detection of Massive Text Based on Semantic Fingerprint
X. Jin*, S. Zhang, J. Liu and H. Guan
Pre-published on:
February 26, 2018
Published on:
March 08, 2018
Abstract
In order to find the required information quickly and efficiently in massive texts, this paper proposes a method of combining semantic fingerprint with cosine distance. After text preprocessing for Chinese texts, the Term Frequency-Inverse Document Frequency algorithm is used to extract feature words of the text, and then screen the text initially by the Simhash algorithm, finally compare these candidate texts tby using the cosine distance for the second similarity to extract the most similar texts. Based on a single Simhash algorithm, the proposed method can greatly improve the accuracy and recall under the modified textual environment, and can also meet the needs of massive texts' similarity testing requirements. Therefore, this method of combining semantic fingerprint with cosine distance can effectively make up for the problem of high false positive rate of Simhash algorithm and is more suitable for the similarity detection of massive texts in fact.
DOI: https://doi.org/10.22323/1.300.0009
How to cite
Metadata are provided both in "article" format (very similar to INSPIRE) as this helps creating
very compact bibliographies which can be beneficial to authors and
readers, and in "proceeding" format
which is more detailed and complete.