An optimized approach for storing and accessing small files on cloud storage

Research output: Contribution to journalArticlepeer-review

96 Scopus citations

Abstract

Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine 'how small is small'. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.

Original languageEnglish
Pages (from-to)1847-1862
Number of pages16
JournalJournal of Network and Computer Applications
Volume35
Issue number6
DOIs
StatePublished - Nov 2012

Keywords

  • Access efficiency
  • Cloud storage
  • Prefetching
  • Small file storage
  • Storage efficiency

Fingerprint

Dive into the research topics of 'An optimized approach for storing and accessing small files on cloud storage'. Together they form a unique fingerprint.

Cite this