A Benchmarking Suite for Cloud File Systems
The last decade has witnessed a rapid research progress
in the field of cloud computing. Many large-scale cloud
platforms such as Amazon EC2 and Microsoft Azure, have been implemented
and widely used in practice.
Cloud file systems, as the fundamental parts of cloud platforms,
serve to store petabyte-scale data and handle highspeed
I/O requests. Typical examples of cloud file systems
in the community include GFS and HDFS. During
the past few years, these cloud file systems have attracted
plenty of research efforts to improving system throughput
, availability, reliability, fault-tolerance, and so on.
As the cloud file systems mature, the demand to evaluate
the performance of these file systems rises. However, the
research of benchmarking cloud file systems has not been
well studied yet. To date, there are no standard benchmarks
available for cloud file systems. Traditional file system
benchmarks such as Postmark and SPC are
unsuitable for benchmarking cloud file systems, as new
kind of workloads are continuously emerging in cloud
platforms. The heterogeneity and diversity of workloads,
as well as the system complexity, pose extra difficulty for
benchmarking cloud file systems.
Available here for Download.
To design a realistic benchmark for cloud file systems, two key issues
need to be addressed as a preliminary step.
(1) How to synthesize representative I/O request streams? (2) How to
pre-populate the data objects that reflect actual
cases in cloud platforms? Although some previous work reported the data
characteristics and access pattern in largescale
distributed systems, the understanding of characteristics of data and
workload on the file storage
layer of cloud platforms is far from complete. A major reason is the
insufficiency of a publicly available workload
trace. Therefore, system researchers or engineers could easily make
incorrect assumptions about their systems and
workloads, leading to inaccurate benchmark results.
As a preliminary investigation, we explored the characteristics of I/O
workloads on a production cloud, which is one of the largest cloud
platforms in Asia. The workload being studied are derived from two
cloud services, including object storage and data batch processing
services, serving millions of public applications and users. Based on
the results of workload characterization, we design a benchmarking
suite Porcupine for performance evaluation of various cloud file
systems. Unlike previous file system benchmarks, Porcupine is designed
to generate customized file I/O requests, which could be rather
heterogeneous and complex. As the I/O request pattern in cloud
applications are diverse and dynamic, the workload can be generated by
sample trace replay or synthesized with a set of workload
configurations, such as the composition of request types
(sequential/random read/write, etc) and the pattern of request arrival
rate. In addition, Porcupine can be used to pre-populate using
synthetic meta-data and file contents, which facilitates to deploy a
This is a joint work between Wayne State University, Hangzhou Dianzi University, with the support from Alibaba Group.
- The Procupine 1.0 has been officially
released on September 25, 2015. Now it has been used by Aliyun
engineers (Pangu File Systems Team) for pressure testing and
performance optimization. You can download it from github.