PoS - Proceedings of Science
Volume 300 - Information Science and Cloud Computing (ISCC 2017) - Session I machine learning
Chinese-English Cross-Language Text Clustering Algorithm Based on Latent Semantic Analysis
H. Lan* and J. Huang
Full text: pdf
Pre-published on: February 26, 2018
Published on: March 08, 2018
Abstract
Aiming at the problems available in the traditional method of cross-language text clustering, a Chinese-English cross-language text clustering algorithm based on Latent Semantic Analysis is put forward. [Method] With the method of Latent Semantic Analysis, Singular Value Decomposition of characteristic word-text matrix is carried out. The bilingual latent semantic space in Chinese-English is constructed to realize cross-language latent semantic association so as to reduce dimension and noise. The K-means algorithm which chooses the initial cluster center on the basis of the minimum similarity is adopted to avoid the effect of random selection of the initial cluster centers on the clustering effect. [Results] Experiment results show that the number of reserved characteristic words of each text sand the selection of the spatial dimension value k have certain impacts on the clustering result. When each text retains the top 15 characteristic words and k=200, the F-measure can be optimal. Compared to CLTC, 13.96 percentage points can be improved. [Conclusions] This method has greatly reduced the dimension of text space and improved the cross-language text clustering quality effectively. The clustering effect is better than CLTC.
DOI: https://doi.org/10.22323/1.300.0007
How to cite

Metadata are provided both in "article" format (very similar to INSPIRE) as this helps creating very compact bibliographies which can be beneficial to authors and readers, and in "proceeding" format which is more detailed and complete.

Open Access
Creative Commons LicenseCopyright owned by the author(s) under the term of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.