Direct Access to NetCDF Files in TAR Archives
Recently, I need to validate the performance of wind data from CONUS404 against observational data at a specific site. However, the CONUS404 data is stored in TAR files, a type of compressed format. Unpacking all the TAR files just to access data for a single grid point is highly inefficient. Additionally, the decompression process is time-consuming. Is it possible to access files in TAR archives directly using xarray and dask to save time? Yes, you can, and there are several ways to do it. Personally, I prefer using Ratarmount.
Ratarmount collects all file positions inside a TAR so that it can easily jump to and read from any file without extracting it. It, then, mounts the TAR using fusepy for read access. Ratarmount supports accessing large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives.
CONUS404 is a specialized, high-resolution hydro-climate dataset designed for hydrological modeling and meteorological analysis across the contiguous United States. Named for its coverage of the CONtiguous United States over 40 years at a 4-kilometer resolution, CONUS404 was generated using the Weather Research and Forecasting (WRF) Model, run by the National Center for Atmospheric Research (NCAR) in collaboration with the U.S. Geological Survey (USGS) Water Mission Area. The dataset actually spans 41 years (water years 1980-2020) and extends beyond the CONUS into Canada and Mexico, capturing transboundary river basins and encompassing all areas contributing to CONUS surface waters.
1. Mount tars with Ratarmount
Let's use 3 tars in 2015 as an example to extract wind data directly using xarray and dask with Ratarmount. We will use ratarmount to create an index file with file names, ownership, permission flags, and offset information to be stored at the TAR file's location. Once the index is created, ratarmount then offers a FUSE mount integration for easy access to the files. Because there are quite many tar files in the folder, we use the command option -r (i.e., --recursive Mount archives inside archives recursively).
%%time
#mkdir ./mounted
!ratarmount -r --recreate-index dem_tars ./mounted
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-15_03-2015-01-23_00:00:00.nc.tar ...
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-15_03-2015-01-23_00:00:00.nc.tar took 0.01s
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-07_05-2015-01-15_02:00:00.nc.tar ...
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-07_05-2015-01-15_02:00:00.nc.tar took 0.01s
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-23_01-2015-01-30_22:00:00.nc.tar ...
Creating offset dictionary for dem_tars/726126.V10.wrf2d_d01_2015-01-23_01-2015-01-30_22:00:00.nc.tar took 0.01s
CPU times: user 4.79 ms, sys: 8.1 ms, total: 12.9 ms
Wall time: 273 ms
2. Extract Data
2.1 Import libraries
import numpy as np
import xarray as xr
import xoak
from pathlib import Path
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from distributed import Client
client = Client()
client
2.2 Setting
Recommended by LinkedIn
lat = xxxx
lon = yyyy
data_dir = Path(r"./mounted")
2.3 Extract V-wind at the target site
vfiles = list(data_dir.glob("*/*.nc"))
ds_v10 = xr.open_mfdataset(vfiles)
ds_v10.xoak.set_index(['XLAT', 'XLONG'], 'sklearn_geo_balltree')
da_v10_sel = ds_u10.xoak.sel(XLAT=xr.DataArray([lat]), XLONG=xr.DataArray([lon]))
da_v10_sel.V10.plot()
Reference
Rasmussen, R.M., F. Chen, C.H. Liu, K. Ikeda, A. Prein, J. Kim, T. Schneider, A. Dai, D. Gochis, A. Dugger, Y. Zhang, A. Jaye, J. Dudhia, C. He, M. Harrold, L. Xue, S. Chen, A. Newman, E. Dougherty, R. Abolafia-Rozenzweig, N. Lybarger, R. Viger, D. Lesmes, K. Skalak, J. Brakebill, D. Cline, K. Dunne, K. Rasmussen, G. Miguez-Macho, 2023: CONUS404: The NCAR-USGS 4-km long-term regional hydroclimate reanalysis over the CONUS. Bull. Amer. Meteor. Soc., under revision.