A Method to Improve the Performance for Storing Massive Small Files in Hadoop

Zheng, Tong; Fan, Guisheng; Guo, Weibin

doi:10.22323/1.299.0022

Abstract

As a new open source project, Hadoop provides a new way to store massive data. Because of high scalability, low cost, good flexibility, high speed and strong fault tolerance performance, it has been widely adopted by the internet companies. However, the performance of Hadoop will reduce significantly once it is used to handle massive small files. As a result, this paper proposes a new scheme to merge small files, which occupy much memory in NameNode, into large files and establish the mapping relationship between small files and large files, and then store the mapping information in HBase. In order to improve the reading performance, the scheme provides a prefetching mechanism by analyzing the access logs and
putting the metadata frequently accessed merge files in the client’s memory. The experiment results show that this scheme can efficiently optimize small files storage in HDFS, thus reduce the overload of
NameNode and improve the performance of file access.