CISL Annual Report: FY2021

The NCAR Computational & Information Systems Lab continued delivering world-class support to the Earth system science community in fiscal year 2021, planning for delivery of a new supercomputing system and making advances in data science and other areas despite the many challenges presented by the ongoing pandemic.

Among the most significant milestones was the conclusion of an 18-month procurement effort with the January announcement that Hewlett Packard Enterprise (HPE) would deliver the next NCAR supercomputer. The 19.87-petaflops Derecho cluster will offer 3.5 times the computational power of Cheyenne. The new system will get 20% of its sustained computing capability from graphics processing units (GPUs), with the remainder coming from traditional central processing units (CPUs). The system’s name was selected from more than 200 naming contest entries submitted in a contest by Wyoming K-12 students. Plans to deploy Derecho in the fall of 2021 were delayed as a result of the pandemic-induced global chip shortage.

As part of our research and development activities, CISL staff also worked with NCAR’s High Altitude Observatory personnel and collaborators at the Max Planck Institute and the University of Delaware to optimize the radiation transport solver of the MURaM model to run on GPUs. The resulting paper described challenges and strategies for accelerating a multi-physics, multi-band MURaM using OpenACC in order to maintain a single source code across CPUs and GPUs. This will help pave the way for improvements in simulations of the solar chromosphere.

Many other developments in FY2021 related to ongoing preparation for exascale computing; advancements in data science, management and curation; contributions to development of a new NCAR IT Center of Excellence; and programs to serve the learning needs of the user community.

Report contents

Delivering Diverse and Cutting-Edge HPC Environments

Preparing for Exascale Computing

Data Repositories and Services

Advancing Data Science

NCAR-Wide IT Services

Delivering Diverse and Cutting-Edge HPC Environments

CISL continued the procurement and deployment of NCAR's next generation of HPC and storage resources. Procurement activities and schedule are detailed in the Derecho Project Execution Plan. Vendor negotiations were concluded in Q1 as planned, and the subcontract with HPE was awarded in Q2. However, plans to conduct pre-delivery factory trials, install the test systems, and have the production systems delivered and installed by Q4 were not started due to the global semiconductor shortage. After discussions with HPE, the schedule was delayed by six months.

When Derecho arrives, CISL will provide advanced application profiling, benchmarking and performance improvements, and user environment design and implementation in support of system deployments, model developers, and the user community. As planned, we completed planning for the scheduling and monitoring approaches to be used for Derecho, as well as completed planning for the user and software environment. 

CISL also takes advantage of new HPC systems and services to provide computing and data resources – including those in commercial and on-premises clouds – to further push the boundaries of Earth system science. We regularly investigate and evaluate new and emerging technology options for compute and data systems for Earth system science workflows, as well as investigate, evaluate, develop and integrate tools, especially AI/ML-focused tools and techniques for automation and management of compute and data systems.

In Q1 FY21, we successfully completed the testing and deployment of “Bifrost,” the new Ethernet backbone for the HPC environment. We also successfully completed a suite of application compilation and execution tests on the ARM processor architecture, and testing for the CISL Unified Login Environment (CULE). CULE testing included network unicast and cross-link bonding to ensure high-availability networking, implementing lightweight virtual machines, and researching high-availability storage with Gluster file system.

In anticipation of the Derecho system’s arrival, in FY21 CISL constructed and commissioned a number of NWSC facility capacity upgrades, and designed, constructed, and commissioned the system fit-up for the Derecho system. The NWSC is also undergoing a refresh to the original NWSC visitor center educational content, which dates back to 2010. Details are provided in CISL's FY21 Research Computing Plan and the NWSC-3 Project Execution Plan. Highlights from FY21 included beginning the facility fit-up work for Derecho in Q1 and completing the electrical capacity upgrades to support Derecho and Module A in Q3. For the NWSC visitor center, vendors were solicited in Q1 FY21, and the design phase of the project began in Q2.

Operational Activities

CISL continues to operate, maintain, and upgrade state-of-the-art supercomputing, data storage, data analysis, visualization and networking, and services and resources customized for Earth system science workloads. CISL currently operates Cheyenne, a petascale supercomputing system; Casper, a data analysis, visualization and machine learning resource; GLADE, a high-performance central file system; and Campaign Storage, a mid-performance, medium-term storage tier. At the end of FY21, the HPSS tape storage system was officially decommissioned after 10 years of service.

In support of the current Cheyenne environment, CISL conducted several smaller scale procurement activities. As planned, tape media was procured to expand the capacity of the Quasar archive in Q2, and Campaign Storage was augmented with additional capacity in Q3. However, CISL deferred an augmentation of the Stratus object storage system since existing capacity was sufficient to meet the needs for the year.

In support of our user community, CISL provides front-line and advanced support for users of our HPC, analysis, and storage services environment, including allocations, help desk, consulting, training, and documentation. The Research Computing help desk received 6,046 tickets in FY21. Of those tickets, 71% received a first response within 1 hour, and 94% were resolved within 5 business days. Of the tickets submitted in FY21, 926 users provided per-ticket satisfaction ratings (1-5 stars). The average rating was 4.9 on a 5 scale.

At the October 2020 CHAP meeting, the panel considered 42 requests from 31 different universities and research institutions asking for more than 292 million core-hours on Cheyenne in total. The panel also reviewed the biennial CESM Community request, which targeted 622.5 million core-hours spanning both Cheyenne and the forthcoming Derecho system. The CHAP recommended support for projects totalling 208 million core-hours on Cheyenne and 1.2 PB of Campaign Storage. The NSC Panel met in October 2020 to discuss the four proposals submitted, which requested a total of just under 54 million core-hours, and the panel recommended supporting all four requests in full.

At the April 2021 CHAP meeting, the panel considered 30 requests totalling nearly 320 million core-hours on Cheyenne. The panel recommended awarding 231 million core-hours on Cheyenne and 1.7 PB of Campaign Storage. The NSC Panel also met in April to consider four proposals requesting a total of 69 million core-hours on Cheyenne. The NSC panel recommended awards totalling 54 million core-hours.

CISL also continued to operate NWSC during FY21, providing NWSC-based remote hands, first-level systems administration and 7x24 monitoring, and ensuring that NWSC facilities meet energy efficiency and optimal configuration in support of NCAR's HPC and storage environment.

Preparing for Exascale Computing

As part of leading Earth system science modeling toward the exascale, CISL works collaboratively to catalyze the adoption of machine learning techniques into the community scientific enterprise. In FY21, CISL made progress on its planned machine learning milestones. In Q1, the original notebooks and slides from the AI4ESS summer school were widely disseminated, and a new web portal is in progress with new tutorial notebooks and updated data. In collaboration with the NOAA Hazardous Weather Testbed, two deep learning storm mode classifiers were tested in real time, and this work was presented at the NOAA AI workshop. Hybrid wave-propagation and a deep learning image segmentation method were implemented for the HOLODEC airborne instrument and GECKO chemistry platform, and GECKO emulators have been further optimized. Finally, FastEddy simulations were run successfully with an ML surface layer parameterization included. A number of journal papers related to these activities are being prepared.

In preparation for the arrival of NWSC-3, CISL has been working to ensure the readiness of several NCAR models. In Q1 FY21, CISL and collaborators from HAO, the Max Planck Institute, and the University of Delaware completed the refactoring of MURaM for GPUs using OpenACC, including refactoring and optimizing the MURaM radiative transport scheme design to better model the chromosphere. Initial testing and optimizations were conducted for the scalability and performance MURaM across multiple nodes and multiple GPUs. Work continues on fast Fourier transform (FFT) optimizations, using either a multi-threaded approach on the CPU or by porting to the GPU using highly efficient FFTs for exascale (heFFTe). Additional work is being done to increase the occupancy of large kernels such as MHD and to reduce kernel launches in RT. Also in progress are optimizations in I/O (exploring the use of dataSpaces or ADIOS2) and the analysis workflow.

CISL also worked to increase the production readiness of the MPAS suite of models on GPU systems. As part of the EarthWorks activity, CISL identified additional physics packages that need to be ported to GPUs, including Thompson physics, RRTMGP, and MG3. A coupled MPAS-A/MPAS-O ESM (EarthWorks) test configuration was defined, and an AGU abstract was submitted. GPU resources for running EarthWorks simulations were secured via an ASD proposal submitted with Colorado State University and MMM. In Q4, CISL and collaborators also completed the refactoring and optimization of MPAS-A for GPU using OpenACC for IBM/The Weather Company.

CISL also evaluated the suitability of MOM-6 for GPU execution. CISL staff determined that the barotropic solver routine doesn’t scale. They were able to achieve correct results for the solver using an OpenACC compiler, but because the solver does not scale well, efforts were redirected toward the EarthWorks project. The MOM-6 findings were reported at the Multicore Workshop in late 2020.

As part of more general efforts to evaluate the portability pathways of the emerging exascale hardware and software stack, CISL developed several proxy applications representing common Earth system modeling paradigms that can be used to study exascale programming pathways. In particular, the SWM application was used as a testbed for exascale pathways. SIParCS students used this application to explore Kokkos and Intel OneAPI. CISL also contributed to enabling GPU capability in important parameterizations for CAM as part of the SIMA effort. A GPU acceleration version of the MG3 physics package was reintegrated into the upstream code repository, and further testing of the updated version is ongoing. 

As part of preparing for exascale, CISL also works with partners to optimize edge computing applications when bottlenecks appear. In Q3, CISL reached out to NCAR and the community to identify additional edge-computing bottlenecks in atmospheric science. NCAR’s Exascale Tiger Team conducted a survey, and analysis of the survey data is in progress.

With all of its exascale efforts, CISL closely integrates efforts to develop an exascale-ready workforce. CISL staff mentored SIParCS students and interns from the University of Wyoming to work on porting IDL analysis scripts to Python accelerated for GPUs. CISL also held a number of hackathons, summer schools, and workshops in support of our exascale initiatives. These included a GPU Tutorial Series hosted by CISL in April 2021, a tutorial on programming GPUs using OpenACC directives at the RMACC 2021 Symposium, and a Trustworthy AI for Environmental Science Summer School in collaboration with the AI2ES Institute.

Finally, in partnership with CGD and Penn State University, CISL led and was awarded a three-year NSF EarthCube grant called Project Raijin (https://raijin.ucar.edu/). The project goals are to develop scalable Python tools for the analysis of storm-resolving, next-generation climate and global weather models.

Data Repositories and Services

NCAR-wide data management planning and data stewardship are essential activities to ensure that data resulting from NCAR proposals and core-funded projects are carefully archived and have data management plans. Among CISL’s key roles is managing the Digital Asset Services Hub (DASH) and associated data repositories through which all relevant NCAR-produced data are submitted for archiving, discoverability, and open data access.

In Q4 we published an annual report listing proposals that contacted DASH for DM planning, projects that submitted data, and DM services payments received. The DASH Annual Report noted that ACOM, CGD, CISL, HAO, MMM, and RAL all received DM support for a total of 101 proposals; 118 existing data sets were archived; and $227,187 was collected for data storage and processing.

Also in Q4, the Research Data Archive (RDA), Climate Data Gateway (CDG), and DASH repositories were configured to store and serve data directly from object storage resources for selected data sets. In Q4, we completed initial steps toward expanding DASH repository capabilities as part of the Geoscience Data Exchange (GDEX) project, which will eventually support data submissions by external NSF-funded projects including the Community Instrument Facilities (CIFs) and others.. We also moved CDG HPSS-based holdings to the NCAR Campaign Storage system and migrated selected holdings to Stratus, the CISL object storage disk system. These activities are among the ongoing, multi-year tasks of curating and preserving these high-value data collections. RDA metrics show that 13,500 unique users transferred 7.2 PB of data through various access pathways in FY21, including direct web file download, HPC-driven subsetting, THREDDS data access services, and Globus data transfers. A total of 270 peer-reviewed articles or books formally cited RDA data sets during FY21, and seven new data set collections were added to the RDA to bring the total archive size to 4.4 PB.

CISL collaborates with CU Boulder and the NCAR Library under a data-partnership memorandum of understanding (MOU). Under this MOU, CU continues to provide access to journal subscriptions; the CU PetaLibrary has off-site backup at NCAR; and a community of practice has been established for scientific data management. Work continues on making the CU PetaLibrary backup operational, and Core Trust Seal applications have been submitted from both organizations. 

While an FY21 Stratus upgrade had been anticipated, existing capacity proved to be sufficient for current needs. Additional NCAR Campaign Storage capacity was procured in FY21, however, and additional space is expected to become available early in FY22.

Advancing Data Science

To support community needs, CISL expanded the capabilities of big-data tools such as GeoCAT and VAPOR in FY21. The release of GeoCAT 1.0 included the majority of NCL’s climate-specific analysis and processing functionality. The GeoCAT team adopted a continuous-delivery model, with new releases of the software made every month. To support CISL’s “Preparing for Exascale Computing” efforts, the GeoCAT team explored and applied Dask scalability to the GeoCAT stack and also provided tutorials and CISL talks on creating Python software for big data and HPC. Around 40 climate and climate-related data operators were developed. VAPOR also moved to more of a continuous-delivery model, with “stamped” releases made roughly every three months. Support for particle visualization was added in FY21. Python API work is still in progress.

Enhancements of DART-based data assimilation capabilities also were in development to support community and NCAR science objectives. These included progress toward the release of interfaces between DART and the updated CESM CAM-Spectral Element models, and between DART and the MPAS global/regional models, to support data assimilation for prediction/predictability research. The MPAS interfaces are on the trunk of the DART GitHub repository, while the CAM/SE interfaces reside on a branch pending identification of collaborators to validate a scientific assessment of the SE interface. DART/CAM6 reanalysis for the period 2010 to 2019 to provide ensemble initial conditions for forecasts with CESM models is now available from NCAR data repositories and a peer-reviewed paper describing the project has appeared.

First release of Python tools for lossy data compression of MPAS unstructured data was made in Q3 and the results were presented at the EarthCube 2021 Annual Meeting. The workflow developed for lossy compression of MPAS data can be viewed in the GitHub repository. Work also continued on development of a prototype “Binder for HPC'' technology to share runnable Jupyter Notebooks proximate with locally stored NCAR data assets. Components developed to date include a Singularity-based replacement for the repo2docker component.

CISL also worked to further enable science at the scale of enormous data sets by converting a large subset (~500TB) of Community Earth System Model version 2 Large Ensemble (CESM2-LE) to Zarr in a collaboration with CGD and others, and by creating an Intake-ESM inventory. CESM2-LE was published on Stratus in Q2 and also published in the cloud with an associated Jupyter Notebook. CISL also received a free allocation of 500TB of AWS S3 storage for ongoing cloud optimization. In Q3, the primary atmosphere and ocean variables from DART CAM6 Reanalysis were converted to Zarr, and the data set was published in the cloud in Q4 with an associated Jupyter Notebook. The reanalysis also was published on Stratus in Q4.

CISL promoted the adoption of new data tools and technologies through a variety of outreach, training, and support activities in FY21. These included a series of 18 virtual Python tutorials held on the second and fourth Wednesdays of each month. Attendance at the tutorials varied from 60 to 240. The tutorials were recorded and shared online.

CISL was awarded a three-year NSF EarthCube award for Project Pythia: a partnership with CGD, Unidata, and the University at Albany aimed at providing web-accessible training resources to help Earth scientists learn how to more effectively use and navigate the Scientific Python ecosystem. In FY21, CISL launched the project web site (https://projectpythia.org/) and developed much of the foundational content for this community resource. CISL also promoted the adoption of new data tools and technologies through a variety of other outreach, training, and support activities. These included a series of 18 virtual Python tutorials held on the second and fourth Wednesdays of each month. Attendance at the tutorials varied from 60 to 240. The tutorials were recorded, shared online, and are now a significant component of Project Pythia.

The Regional Integrated Science Collective (RISC) continued to support actionable science through regional climate modeling tailored to stakeholder needs in FY21. During FY21, the RISC group moved from CISL to become part of NCAR’s Research Applications Laboratory. Via the NA-CORDEX and NARCCAP programs and others, CISL interacted with stakeholders, provided guidance on the proper use of regional climate model data, conducted differential credibility analysis of models, and submitted model outputs to data repositories. A summary of RISC activities by FY21 quarter follows. 

Q1

  • Analysis of snow in NA-CORDEX: evaluation of bias in historical simulations. Paper is in review. Initially submitted to Climatic Change October 2020; resubmitted in August 2021.
  • Comparison of future change in snow over North America across multiple climate MIPS (CMIP5, NA-CORDEX, CMIP6). Paper is in review. Initially submitted to Climatic Change October 2020; resubmitted in August 2021.
  • Impacts of land-cover change for specific SSPs in contrast to increased ghgs alone, on future climate over the conterminous US, using SSP 3 and SSP5, based on simulations using WRF. Analysis of extremes has been started.
  • Curated the output from the 12 km FACETS runs with WRF and RegCM. Results are in-place on disk and ready for publication.
  • Climate Services paper documenting NA-CORDEX data archive submitted and published.
  • Analysis of changes in simultaneity of very large wildfires in NA-CORDEX simulations. Results were presented at Fall AGU and have featured prominently in stakeholder engagement activities for the NSF Convergence project.
  • Started testing of new WRF configurations for new land-cover change simulations.
  • Continued work on IPCC WGI Atlas section North America.

Q2

  • Analysis of regional model and variable-resolution global model current climate and climate change results is in progress to be followed by differential credibility analysis across the simulations for the Southern Great Plains (SGP). CISL plans to produce a paper on the results in collaboration with PNNL.
  • Developed a Jupyter Notebook that calculates freeze-thaw days using the zarrified versions of the NA-CORDEX data and the Pangeo environment to run in parallel.
  • Analysis of future changes in extra-tropical cyclones in CMIP6 over CONUS is in progress. All CMIP6 data has been obtained; extra-tropical cyclones have been tracked in the historical and future climate simulations; statistics on wave frequency and intensity have been calculated. A paper is being drafted for submission to a special issue of Earth's Future in Spring 2022.
  • Finished moving data from HPSS (decommissioned) to NCAR Campaign Storage.
  • Completed final draft of IPCC WG1 Atlas section on North America. The IPCC WGI Report (including the Atlas) was released in August 2021.

Q3

  • Completed comparison of African easterly waves in TRMM vs. IMERG. Analysis documenting how IMERG and TRMM are different will be contributed to a student-led paper from the University of Oklahoma.
  • Provided consultation for zarrification of the DART data set and for GeoCAT function development.
  • Submitted Atlas review comment responses (IPCC WG1) to the Technical Support Unit.

Q4

  • Analysis of future changes in precipitation phase (rain/snow) within extra-tropical cyclones in CMIP6 is in progress. Code has been created to extract precipitation within extra-tropical cyclones.

NCAR-Wide IT Services

CISL worked across the organization with the CIO and the NCAR Directorate to develop an NCAR IT Center of Excellence to resolve management, budget, and administration activities. This began with development of an advisory committee to advise the NCAR Directorate and IT leadership, and a proposal to combine IT groups within NCAR was sent in Q1 to the NCAR Director for review. The proposal outlined an advisory committee to oversee decisions and help provide scientific direction to the proposed NCAR Research IT program. In Q2, we identified initial key priorities in the NCAR Research IT program proposal to align NCAR's mission and IT needs across the seven labs.

CISL promotes NCAR-Wide Cloud Services work in the broader UCAR Enterprise IT environment to develop a cloud service and cloud governance that enables NCAR research that is moving to cloud services. In Q1, working toward development of NCAR requirements to enable science efforts that can leverage IaaS cloud services, we formed a Cloud Community of Practice in conjunction with UCAR Enterprise IT, whose teams will handle cloud brokering. This community will involve a variety of technical, scientific, systems and software engineering staff and will rotate meetings around specific technical challenges and brokering in the cloud and science in the cloud.

Cross-organizational meetings including UCAR's Enterprise Infrastructure and Platforms team began in Q3 to identify requirements for cloud computing. Quarterly cloud strategy meetings started on June 23. Cloud Communities of Practice meetings are expected to start in calendar Q4 2021 and occur at least monthly.