CISL Annual Report: FY2024

In fiscal year 2024 (FY24), the NSF NCAR Computational & Information Systems Lab (CISL) continued to deliver on its mission—to support and advance the Earth systems sciences (ESS) by providing world-class computing environments, data services, and research in computational science.

 

FY24 was marked by progress in cloud computing, innovative software releases, a stylish CISL website refresh, and productive collaborations—including a new partnership on the NSF-backed Chameleon testbed with the University of Chicago. FY24 also marked the inaugural year of service for supercomputer Derecho—a massive success that opened up promising new research vistas for the ESS community.

In September 2024, CISL’s International Computing in the Atmospheric Sciences (iCAS) conference in Stresa, Italy, drew nearly 50 attendees from 10 countries to discuss interdisciplinary and international collaboration for ESS advancement. Highlights included a keynote address from Chris Kadow of DKRZ1, as well as four international panels covering artificial intelligence (AI), exascale modeling, sustainability, and performance portability. CISL director Thomas Hauser called the conference an "energizing opportunity," and noted that iCAS 2024 outcomes will help NSF NCAR to construct computational science strategy in its next strategic plan.
 

Report contents

CISL By the Numbers FY24

Delivering Diverse and Cutting-Edge HPC Environments

Preparing for Exascale Computing

Data Repositories and Services

Advancing Data Science

Outreach and Education

CISL By the Numbers FY24
 


2,238

HPC users from 
364 institutions
 


626

published works from the university HPC users
 


1,065

attendees at 
CISL-hosted events
 


93.64%

Derecho system availability
 


2.08 billion

Derecho core-
hours delivered
 


1.1

NWSC Power Utilization Effectiveness (PUE)
 


55,000

unique users of 
CISL data services
 


23.8 PB

data volume delivered 
from CISL data services
 


560

published works 
citing RDA datasets
 


40,000

downloads of GeoCAT, UXarray, and VAPOR
 


27,200

views of VAPOR 
materials on YouTube
 


30+

models supported 
by DART
 

 

Delivering Diverse and Cutting-Edge HPC Environments

In FY24, Derecho successfully completed its first year of service for the community. Its user availability averaged 94%, with user utilization at 71%. Since Derecho is more than three times more powerful than the Cheyenne system, 71% utilization means that Derecho was delivering more than two Cheyennes’ worth of computing in its first year. 

 

Derecho also supported a massive amount of advanced research and investigation into the Earth systems sciences, with 86 large-scale projects from universities and other institutions awarded approximately 1.5 billion core-hours and 1.3 million GPU-hours.

 

While Derecho assumed its role as NSF NCAR’s powerful new supercomputer, the reliable Cheyenne completed its high-performance computing (HPC) mission to the ESS community. CISL’s High-Performance Computing Division (HPCD) kept the legacy machine running over a year longer than planned, for a total of over seven years. Finally, however, HPCD undertook the large and complex task of decommissioning Cheyenne and its associated systems, ultimately auctioning it in May 2024.

 

With this transition, the team carried out a sweeping refresh of the user documentation, removing Cheyenne-era information and upgrading to new material, as well as leading the effort to migrate user data from the old filesystem to new storage.

 

In FY24, the team learned a great deal about Slingshot, Derecho’s interconnection network. In Q2, the team upgraded the Slingshot network software stack, leading to a much more stable network for the community. In Q3, the team developed effective tools and procedures to improve Derecho management and to maximize system utility, working closely with on-site Hewlett-Packard engineers. The team ensured that Derecho met the availability target in its subcontract with the Next-Generation HPC environment at the NCAR Wyoming Supercomputing Center (NWSC-3). 

 

Throughout FY24, the team successfully supported Casper, NSF NCAR's heterogeneous system for data analysis, visualization, high-throughput computing, artificial intelligence/machine learning (AI/ML), and hardware and software environments: the team migrated the Casper operating system on the entire cluster from CentOS to openSUSE, to match Derecho’s operating system and provide greater portability between the two clusters. Additionally, Casper gained 14 new nodes during the year: two H100 GPU nodes, six large memory high-throughput nodes, and six Viz nodes. Overall, Casper’s user availability averaged slightly over 95% in FY24, with an average utilization of 57%—down a bit due to adding new nodes to the system. However, with the significant expansion and upgrade of the Bifrost network, the team will be able to deploy many more nodes on Casper.

 

Throughout the year, the team also successfully supported the Storage, Portable Operating System Interface (POSIX) file system resources: in Q1, the team successfully upgraded firmware on over 7,000 disks in a non-disruptive way for users. Staff also systematically and carefully migrated approximately 20 PB of user data from the old, Cheyenne-era parallel file system to new storage systems—a very large undertaking that went smoothly and on-schedule. In Q2, the team developed software to calculate and collect innovative metrics for the Lustre filesystem, allowing users to view their per-job filesystem workload. Finally, in Q4, the team integrated a new Data Direct Networks (DDN) Rapid Assistance and Initial Detection (RAID) system into the Campaign Storage filesystem, and integrated new flash drives into the system to provide better performance for Campaign Storage users. 

 

For the NCAR–Wyoming Supercomputing Facility (NWSC), the team took measures to ensure optimum power usage effectiveness (PUE) and and water usage effectiveness (WUE), averaging 1.11 PUE for the year. The team also submitted a report from the Carbon-Free Electricity for NWSC Task Force.

 

Also in FY24, to save the cost of hard disks, the team deployed Hierarchical Storage Management (HSM) as the tape tier for inactive campaign storage data migration. Throughout the year, the team tuned the system, maintained it for pilot program users, and tweaked the configuration based on observation and feedback—finally releasing it as a full-fledged tape tier in Q4.

 

In FY24, CISL saw a record-breaking year for small-scale university requests. Researchers requested a total of 400 small-scale allocations. The total represents more than a 20% increase over FY23, which itself was already a 20% increase over FY17, the previous record year. Small-scale activities have shown impressive growth recently thanks to the availability of data analysis projects and to CISL promotions for classroom projects. 

The availability of Derecho and Cheyenne helped the Earth systems science community produce a wide range of new science. University users reported 418 new peer-reviewed publications in FY24, along with 112 other publications, and 96 dissertations. 

Preparing for Exascale Computing

In FY24, CISL made great strides in producing innovative, optimized software for the ESS community. 

 

NSF NCAR has been developing a research framework for AI numerical weather prediction called CREDIT, or Community Research Earth Digital Intelligence Twin. CREDIT has been used to train multiple global AI weather models. In addition to fundamental infrastructure development, CREDIT development has supported research projects for one postdoc and three graduate students; and once the software is released, CREDIT has the potential to support a much broader ecosystem of community AI weather prediction research.

 

The Machine Integration and Learning for Earth Systems (MILES) team has released a machine learning uncertainty quantification software package called MILES GUESS, or Generalized Uncertainty for Earth System Science. The GUESS package is supporting research for undergraduates, graduate students, and interns spanning application areas from winter weather forecasting, severe storm hazard prediction, and land surface model emulation. It has enabled investigations into how machine learning uncertainty is linked to physical processes. A paper is under review in the American Meteorological Society journal Artificial Intelligence for the Earth Systems, and a draft paper manuscript on an ML tornado visualization system is also in progress.

 

Also in FY24, CISL released the GPU port of MURaM (Max Planck University of Chicago Radiative MHD) as a public version. CISL also collaborated  with other labs as part of two Accelerated Scientific Discovery (ASD) projects on Derecho:

  • The MURaM ASD simulation was highly successful—it was able to run on 320 GPUs, nearly all of the GPUs available on Derecho. This simulation would have required 65k conventional processors to perform the same case for comparison. This simulation also saved approximately 60 tons of carbon by running on GPUs.
  • The Cloud Model 1 (CM1) ASD simulation was also highly successful, running on 128 GPUs on Derecho, and saving about 47.5 tons of carbon versus running the same simulation on CPUs.

 

CISL has been conducting discussions with several organizations on an international community of practice in code modernization. In June, CISL hosted a visit from DKRZ to discuss this topic; and at CISL’s iCAS conference in September, staff continued these conversations with representatives of international organizations like the United Kingdom Meteorological Office and the ECMWF.2

 

Other software modernization accomplishments in FY24:

  • The GPU port of Cloud Layers Unified By Binormals (CLUBB) is complete, and has been incorporated within both the Community Atmosphere Model (CAM) and EarthWorks code bases. 
  • The team is evaluating the performance of the Community Atmosphere Model (CAM) and porting physics routines, including improving the performance of the CAM interface code for CLUBB and PUMAS/MG3.3 
  • The team has conducted real-time tests of a machine learning (ML) precipitation type algorithm and associated visualizations, validating it with Winter 2023–24 cases, and has developed interactive visualization capabilities for it.
  • The team has incorporated a microphysics bin emulator into PUMAS code, and has run it successfully in CAM for multiple years. A paper and further analysis are in preparation. 
  • The Model-Independent Chemical Module (MICM) chemistry code was successfully ported to GPUs. 
  • new version of the Earth Computing Hyperparameter Optimization (ECHO) hyperparameter optimization package has been released with Derecho-focused optimizations. ECHO supported an ASD project on Derecho and helped identify multidimensional pareto fronts for optimizing ML models.
  • Applying the ML holographic detector for clouds (HOLODEC) algorithm over a full field campaign of data is in-progress. With the assistance of a SIParCS student, the team has retrained the HOLODEC ML model, improved the pipeline’s performance and will run the field campaign in late 2024 or early 2025.  

Data Repositories and Services

Through multiple forward-looking initiatives, the CISL-managed Data Services embraced its mission to support open data access and next-generation workflows and analyses in FY24. 

 

The JRA-3Q dataset4 is now available for public access, fully ingested and archived in the Research Data Archive (RDA). 

 

In FY24, CISL teams continued efforts to migrate all NSF NCAR data repositories into an integrated research data commons. 

  • By Q3, all RDA datasets were migrated to new Geoscience Data Exchange Data Commons (GDEX-DC) storage logic. 
  • CISL has coordinated with staff in NSF NCAR’s Climate Data Guide (CDG) lab to migrate CDG datasets to either the RDA to or individually managed Globus collections.  
  • The Data Stewardship Engineering Team (DSET) also evaluated NSF NCAR common data repository architecture needs, producing an original published work: NSF NCAR Data Stewardship Engineering Team—Common Repository Architecture Recommendations. This report will define dataset characteristics and inform the design of the repository infrastructure to support the NG-GDEX data commons. 

Meanwhile, CISL has also been rolling out support for analysis-ready and artificial intelligence-ready (AR/AI) data. Existing AR datasets have been archived in the RDA—including the NA-CORDEX5, CESM2-LENS6, and CAMS6 DART7 Reanalysis. The RDA has also implemented AR/AI-ready access for two datasets: the ERA58 and CONUS4049 catalogs have been paired with Zarr API access to support convenient and scalable data analysis access for both datasets. These capabilities support simplified and scalable access for users running workflows from the Casper data analysis nodes or Jupyterhub service, and the Derecho HPC system. Additionally, the team has provided scalable analysis access for remote users accessing these datasets either through anonymous HTTPS or through NSF NCAR’s integration with the Open Science Data Federation. NSF NCAR’s DASH Search (Digital Asset Services Hub) now supports AR/AI dataset discovery, with a “collections” resource added to the search to support the discovery of “cloud-optimized datasets.” 

Advancing Data Science

In FY24, CISL achieved many data science milestones, including expanding its current on-premises (“on-prem”) cloud resources to support a larger community of friendly users, and growing these resources to meet use cases defined by the community. As such, the prototype cloud resources are now hosting services including documentation, customizable JupyterHub with Dask Gateway and GPU access, interactive web visualizations, and more. CISL is planning to transition the cloud into production in 2025.

 

CISL worked with NSF NCAR’s Earth System Data Science (ESDS) community of practice, who brought their use cases to the on-prem cloud for deployment. Staff also led numerous workshops promoted through ESDS—available on the CISL YouTube channel—and attended the ESDS Annual Event to present available on-prem cloud services. 

 

To advance outreach to the broader community, the GeoCAT10 and VAPOR11 teams created and updated comprehensive training resources, focusing on tutorials and workshops. They presented their tutorial, “Visualizing 2D and 3D Geoscience Data in Python,” at the AGU12 and the AMS13 Annual Meetings. The Project Raijin team also presented a UXarray tutorial,  “Unstructured Grids Visualization with UXarray with MPAS Focus,” at the Joint WRF/MPAS Users Workshop 2024.14

 

FY24 was also significant for Project Raijin’s UXarray, an NSF-funded effort to develop Python tools supporting the analysis of Global Storm Resolving Model outputs. The team encouraged greater community ownership and user contributions by prioritizing immediately usable visualization functions, along with community-requested analysis operators, into its development roadmap. Throughout the year, the team released several UXarray versions with plotting functionality and community-prioritized analysis operators.

 

During FY24, VAPOR Python became a fully-supported application within CISL’s JupyterHub environment for GPU-equipped nodes, and the documentation was updated with the VAPOR Python API class reference, quick start guide, and examples. VAPOR also got a code-signed, native build for MacOS running on Apple Silicon.

 

To accomplish the important goal of community ownership for GeoCAT, Project Raijin, and UXarray, the team is currently in the process of deploying DevOps infrastructure and on-ramping material so that the user community can easily contribute directly to these code bases. The team carried out numerous measures toward this goal, including surveying its users, collaborating with the CUPiD team, holding sessions at ESDS, and sponsoring forums for the wider community.

 

Towards helping the community take ownership of UXarray, the team sponsored an online discussion to help select UXarray’s current design, Xarray class extension. Another online discussion on how to prioritize data analysis functions also caught broad community attention. This input led the team to prioritize gradients, zonal averaging, and regridding—functionalities for which the community voiced high demand.

 

Interactive visualization techniques in Python represent a significant development goal for the GeoCAT and VAPOR teams. The teams aim to provide the broader community with clear, comprehensive Jupyter notebooks compatible with guided tutorials or self-learning activities. As such, the teams have published two Pythia cookbooks, “Advanced Visualization Cookbook” and “Unstructured Grids Visualization Cookbook." 

The GeoCAT and VAPOR teams also aim to optimize their existing two-dimensional and three-dimensional (2D/3D) visualization and data analysis tools for large-scale, cloud-hosted workflows. As such, the teams are exploring opportunities with cloud services and with NSF NCAR’s HPC resources, combined with benchmarking technologies. 

 

In FY24, the DART15 team pushed the standard-model DART grid interface to GitHub, where it is available for review by University of Michigan Aether model Principal Investigators. The team also successfully delivered a first-level release of DART with a QCEFF16 and hybrid filter capabilities. It has also completed QCEFF and applied it to a variety of assimilation tests in numerical weather prediction (NWP), atmospheric chemistry, space weather, and parameter estimation applications. Meanwhile, a hybrid is in progress, with the hybrid code committed to a DART branch and visible to the community. The team has also made progress toward its goal to complete cycling assimilation experiments with DART, the CESM,17 and the MOM6.18 In September, a research team led by Jeffrey Anderson published a journal article in Monthly Weather Reviews, detailing ensemble filter methods for ensembles with duplicate values. 

Outreach and Education

To highlight its groundbreaking research and resources in Earth systems sciences, NSF NCAR and CISL conducted a slate of activities at the Supercomputing Conference 2023 (SC23) at the Colorado Convention Center, November 12–17, 2023. The NSF NCAR booth featured presentations, opportunities to meet with CISL staff experts on a range of topics, and an in-booth reception to welcome CISL’s new director. Likewise, in April 2024, UCAR’s Software Engineering Assembly (SEA) sponsored the 2024 Improving Scientific Software (ISS) conference in Boulder, CO, focusing on the modernization of scientific software and what that means for the ESS community.

 

Aligning with its efforts to support access, culture, and opportunity within the HPC environment, CISL targeted increased participation by community members from non-R119 educational establishments. Outreach efforts included increased engagement via UCAR’s Historically Black Colleges and Universities (HBCU) members; NSF NCAR’s Rising Voices Center for Indigenous and Earth Sciences; attendance at the Richard Tapia Celebration of Diversity in Computing Conference, and more.

 

To uphold its mission to the larger community, throughout the year, the NWSC Visitor Center hosted a total of 793 visitors. These months were marked by numerous student and public tours, including those from notable organizations such as the Department of Homeland Security, Beowulf Electricity & Data Inc., and UCAR. Throughout the year, K–12 students, including visitors from Cheyenne East High School and younger grade levels, engaged with the Visitor Center’s hands-on educational displays to gain insight into computational science, data center operations, and technology for ESS research. July and August saw targeted outreach during Cheyenne Frontier Days, with the organization aiming to attract some of the 250,000-plus event attendees, while also hosting city planners and public tours. Despite limited advertising, the Visitor Center leveraged word-of-mouth and social media to promote its offerings to a broad audience.

 

Meanwhile, CISL’s flourishing SIParCS20 internship program continued in collaboration with NSF NCAR’s EdEC.21 The SIParCS program aims to open up the ESS field to talented young researchers and meritorious early-career candidates. CISL ultimately hired 16 interns for the 2024 summer program, four of whom received funding from grants or groups outside of SIParCS. CISL is currently planning for the 2025 SIParCS program as well as publicizing its Visitor Program.

 

Other community-enrichment efforts included the NOAA/NCAR GPU Hackathon22; an updated internal and external recognition process; a manager training series; and multiple public user trainings on topics including Derecho, Casper, GLADE,23 Dask, and SIParCS.

Footnotes

1 DKZ = Deutsches Klimarechenzentrum, the German Climate Computing Center.
2 ECMWF = European Centre for Medium-Range Weather Forecasts.
3 PUMAS/MG3 = Parameterization of Unified Microphysics Across Scales, Morrison–Gettelman version. 
4 JRA-3Q dataset = Japanese Reanalysis for Three-Quarters of a Century.
5 NA-CORDEX = North American Coordinated Regional Climate Downscaling Experiment.
6 CESM2-LENS = Community Earth System Model version 2 Large Ensemble. 
7 CAM6-DART = Community Atmosphere Model version 6 Data Assimilation Research Testbed.
8 ERA5 = the fifth-generation climate reanalysis produced by the European Centre for Medium-Range Weather Forecasts.
9 CONUS404 = a dataset covering the CONtiguous United States for 40 years at four-kilometer resolution.
10 GeoCAT = Geoscience Community Analysis Toolkit.
11 VAPOR = Visualization and Analysis Platform for Ocean, Atmosphere, and Solar Researchers.
12 AGU = American Geophysical Union 2024 conference.
13 AMS = American Meteorological Society 2024 conference.
14  WRF/MPAS = Weather Researching and Forecasting Model (WRF) and Model for Prediction Across Scales (MPAS).
15 DART = Data Assimilation Research Testbed.
16 QCEFF = Quantile-Conserving Ensemble Filter Framework.
17 CESM = Community Earth System Model.
18 MOM6 = Modular Ocean Model version 6.
19 R1 = A doctoral university with very high research activity, according to the Carnegie Classification of Institutions of Higher Education.
20  SIParCS = Summer Internships in Parallel Computational Science.
21  EdEC = Education, Engagement & Early-Career Development.
22  NOAA/NCAR GPU Hackathon = An Open Hackathon jointly sponsored by the National Oceanic and Atmospheric Administration (NOAA), NSF NCAR, NVIDIA, and OpenACC.
23  GLADE = Globally Accessible Data Environment.