Porcupine: A Benchmarking Suite for Cloud File Systems

Motivation        Our Approach         People        Publications        Software


The last decade has witnessed a rapid research progress in the field of cloud computing. Many large-scale cloud platforms such as Amazon EC2 and Microsoft Azure, have been implemented and widely used in practice. Cloud file systems, as the fundamental parts of cloud platforms, serve to store petabyte-scale data and handle highspeed I/O requests. Typical examples of cloud file systems in the community include GFS and HDFS. During the past few years, these cloud file systems have attracted plenty of research efforts to improving system throughput , availability, reliability, fault-tolerance, and so on.

As the cloud file systems mature, the demand to evaluate the performance of these file systems rises. However, the research of benchmarking cloud file systems has not been well studied yet. To date, there are no standard benchmarks available for cloud file systems. Traditional file system benchmarks such as Postmark and SPC are unsuitable for benchmarking cloud file systems, as new kind of workloads are continuously emerging in cloud platforms. The heterogeneity and diversity of workloads, as well as the system complexity, pose extra difficulty for benchmarking cloud file systems.  
Available here for Download


Our  Approach

To design a realistic benchmark for cloud file systems, two key issues need to be addressed as a preliminary step. (1) How to synthesize representative I/O request streams? (2) How to pre-populate the data objects that reflect actual cases in cloud platforms? Although some previous work reported the data characteristics and access pattern in largescale distributed systems, the understanding of characteristics of data and workload on the file storage layer of cloud platforms is far from complete. A major reason is the insufficiency of a publicly available workload trace. Therefore, system researchers or engineers could easily make incorrect assumptions about their systems and workloads, leading to inaccurate benchmark results.

As a preliminary investigation, we explored the characteristics of I/O workloads on a production cloud, which is one of the largest cloud platforms in Asia. The workload being studied are derived from two cloud services, including object storage and data batch processing services, serving millions of public applications and users. Based on the results of workload characterization, we design a benchmarking suite Porcupine for performance evaluation of various cloud file systems. Unlike previous file system benchmarks, Porcupine is designed to generate customized file I/O requests, which could be rather heterogeneous and complex. As the I/O request pattern in cloud applications are diverse and dynamic, the workload can be generated by sample trace replay or synthesized with a set of workload configurations, such as the composition of request types (sequential/random read/write, etc) and the pattern of request arrival rate. In addition, Porcupine can be used to pre-populate using synthetic meta-data and file contents, which facilitates to deploy a testing environment.

This is a joint work between Wayne State University, Hangzhou Dianzi University, with the support from Alibaba Group.



            Biao Xu

            Zujie Ren

            Weisong Shi