CISL Seminar: Earthmover: Accelerating frictionless Earth Science research and applications

Mesa Lab MSR & Virtual   

Seminar
Apr. 24, 2025

1:00 – 2:00 pm MDT

Please see logistics tab for link to join virtually  

Abstract

Donoho (2024; Harvard Data Science Review) proposes that the recent dramatic advances in empirical Machine Learning research and applications have largely been enabled by the maturation of three data science principles — data sharing, code sharing, and competitive challenges — "implemented in the particularly strong form of frictionless open services". In many but not all cases, such "services" take the form of online commercial platforms such as HuggingFace, Weights & Biases, GitHub, etc.

While adoption of the principles of code/workflow sharing and competitive challenges is more recent and less mature, the Earth Sciences community has adopted open data sharing principles for a very long time. Examples include (a) open data formats (netCDF); (b) rich metadata conventions (CF Conventions); and (c) widespread data serving infrastructure (e.g. OPENDAP and THREDDS).

Yet friction abounds.
1. Most data is distributed in individual files, and accessed after download;
2. data archives are navigated and presented in a relatively primitive form (e.g. files and folders);
3. data dissemination through standard protocols at scale is limited to privileged data centers; and
4. data presentation in the form of, e.g. user-friendly dashboards and/or web maps, etc. is inaccessible to most practicing scientists.

Commercial cloud infrastructure now provides any researcher globally access to petabytes of data storage capacity, scalable data-proximate compute, and can act as a collaboration layer for distributed research teams. Historically, these privileges have only been available to those with access to large research institutions. Empirical Machine Learning has clearly tapped such potential for immense gains; how do we fully unlock this untapped potential for the Earth Sciences?

At Earthmover, our mission is to do exactly that by drastically transforming research tooling and data culture for the Earth Sciences to a more cloud-native form. I describe our current efforts across three axes:
Icechunk - an open-format cloud-native nD-array storage engine that brings database-like safety and consistency guarantees; performance that scales like the underlying object store allowing unprecedented concurrency; and the ability to proxy access to large archives of legacy archival data (blog).
Flux - scalable data delivery via standards-compliant APIs, e.g. WMS for web map tiles; EDR for JSON, CSV exports; and OPeNDAP for the netCDF ecosystem; unlocking easy interoperability between array data and a host of other tools, including web maps, GIS applications, and even Excel spreadsheets.

Arraylake -  a user-friendly collaboration platform that understands the underlying Icechunk/Zarr array data model.
These efforts, by opening up the possibilities of commercial cloud infrastructure, represent a step change in the transformation toward frictionless reproducibility in the Earth Sciences.
 

 

 


 

Name
Deepak Cherian

Forward Engineer, Organization Earthmover PBC
Biography

Deepak Cherian is Forward Engineer at Earthmover PBC, where he works on Arraylake, Icechunk, Zarr, Xarray, and other packages in the Pangeo ecosystem. He is a long-time open-source contributor and was previously an oceanographer in CGD's Oceanography Section.