SIParCS 2017- Anderson Banihirwe
PySpark for "Big" Atmospheric and Oceanic Data Analysis
Typically, tools installed with a supercomputer's High Performance Computing (HPC) software stack are used to parallelize climate data analyses. This research project explores an alternate approach to parallelize embarrassingly parallel tasks. We used PySpark API as a Python interface to Apache Spark, which is a modern framework aimed at performing fast distributed computing on Big Data. Our research deployed Apache Spark on NCAR’s Cheyenne and Yellowstone supercomputers. We designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We evaluated use cases representative of analyses commonly performed by atmospheric and oceanic scientists such as temporal averaging and, computation of climatologies.
Our experience with Apache Spark for climate data analysis indicates that a combination of emerging Big Data frameworks such as Apache Spark and the Python data analysis stack is a useful high-performance alternative for distributed parallel computing.
Mentors: Kevin Paul, Davide Del Vento