Cloud-native repositories for big scientific data
Abernathey, R. P., Augspurger, T., Banihirwe, A., Blackmon-Luca, C. C., Crone, T. J., et al. (2021). Cloud-native repositories for big scientific data. Computing in Science & Engineering, doi:https://doi.org/10.1109/MCSE.2021.3059437
Title | Cloud-native repositories for big scientific data |
---|---|
Genre | Article |
Author(s) | R. P. Abernathey, T. Augspurger, Anderson Banihirwe, C. C. Blackmon-Luca, T. J. Crone, C. L. Gentemann, Joseph J. Hamman, N. Henderson, C. Lepore, T. A. McCaie, N. H. Robinson, R. P. Signell |
Abstract | Scientific data have traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow toward the petabyte scale. A "cloud-native data repository," as defined in this article, offers several advantages over traditional data repositories—performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access and inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing's full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements. |
Publication Title | Computing in Science & Engineering |
Publication Date | Mar 1, 2021 |
Publisher's Version of Record | https://dx.doi.org/https://doi.org/10.1109/MCSE.2021.3059437 |
OpenSky Citable URL | https://n2t.org/ark:/85065/d7q52t1h |
OpenSky Listing | View on OpenSky |
CISL Affiliations | TDD, IOWA |