Pangeo benchmarking analysis: Object storage vs. POSIX file system

Xu, H., Paul, K., Banihirwe, A.. (2020). Pangeo benchmarking analysis: Object storage vs. POSIX file system. 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW), doi:https://doi.org/10.1109/PDSW51947.2020.00012

Title Pangeo benchmarking analysis: Object storage vs. POSIX file system
Genre Article
Author(s) Haiying Xu, Kevin Paul, Anderson Banihirwe
Abstract Pangeo is a community of scientists and software developers collaborating to enable Big Data Geoscience analysis interactively in the public cloud and on high-performance computing (HPC) systems. At the core of the Pangeo software stack is (1) Xarray, which adds labels to metadata such as dimensions, coordinates and attributes for raw array-oriented data, (2) Dask, which provides parallel computation and out-of-core memory capabilities, and (3) Jupyter Lab which offers the web-based interactive environment to the Pangeo platform. Geoscientists now have a strong candidate software stack to analyze large datasets, and they are very curious about performance differences between the Zarr and NetCDF4 data formats on both traditional file storage systems and object storage. We have written a benchmarking suite for the Pangeo stack that can measure scalability and performance information of both input/output (I/O) throughput and computation. We will describe how we performed these benchmarks, analyzed our results, and we will discuss the pros and cons of the Pangeo software stack in terms of I/O scalability on both cloud and HPC storage systems.
Publication Title 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW)
Publication Date Nov 12, 2020
Publisher's Version of Record https://doi.org/10.1109/PDSW51947.2020.00012
OpenSky Citable URL https://n2t.org/ark:/85065/d7jd515s
OpenSky Listing View on OpenSky
CISL Affiliations TDD, IOWA

Back to our listing of publications.