SIParCS 2017-Nicolas Rodriguez Jeangros
Informing the Prediction of Compression Method and Level for Climate Models Using Variable Features
Increased computing power makes it possible to simulate larger Earth system model ensembles with higher output frequency, finer spatial resolution, and extended simulation length. These improvements produce massive datasets and are straining institutional storage resources. Lossy compression methods are a promising option because they can achieve high compression rates to reduce dataset sizes. Previous work has demonstrated that using a combination of lossy compression methods produces better results overall because the choice of method can be tailored to the characteristics of each variable. Currently, determining the optimal compression level for each variable is computationally expensive because it involves compressing and reconstructing each variable exhaustively for each possible compression method and level. The optimal combination is then determined by assessing which method produces the highest data compression while still satisfying the quality criteria. The goal of this project is to streamline this process by characterizing the variables through features that will be used in a regression model to predict the optimal compression level automatically. We analyze a large ensemble of annual averages of 198 variables from the Community Earth System Model (CESM) with the final goal of informing a multinomial regression model to predict different compression levels for the fpzip compression method. Here we summarize the different features that range from simple statistics to smoothness and clustering indicators, analyze their variability across ensemble members, and preliminarily evaluate their correlation with the different compression levels from fpzip.
Mentors: Dorit Hammerling, Brian Vanderwende, Doug Nychka