TY - JOUR
T1 - An optimized approach for storing and accessing small files on cloud storage
AU - Dong, Bo
AU - Zheng, Qinghua
AU - Tian, Feng
AU - Chao, Kuo Ming
AU - Ma, Rui
AU - Anane, Rachid
PY - 2012/11
Y1 - 2012/11
N2 - Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine 'how small is small'. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.
AB - Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine 'how small is small'. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.
KW - Access efficiency
KW - Cloud storage
KW - Prefetching
KW - Small file storage
KW - Storage efficiency
UR - https://www.scopus.com/pages/publications/84867570478
U2 - 10.1016/j.jnca.2012.07.009
DO - 10.1016/j.jnca.2012.07.009
M3 - 文章
AN - SCOPUS:84867570478
SN - 1084-8045
VL - 35
SP - 1847
EP - 1862
JO - Journal of Network and Computer Applications
JF - Journal of Network and Computer Applications
IS - 6
ER -