Skip to content

Reading multiple days of conv-adpupa dataset causes errors #67

@csubich

Description

@csubich

In reading the conv-adpupa dataset (radiosondes), loading more than one day at a time via ds.sel(time=slice(...)) raises an error. This error is benignly ignored with the pandas backend, but it raises a fatal exception with the dask backend.

Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.5.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: IPython supports combining unicode identifiers, eg F\vec<tab> will become F⃗, useful for physics equations. Play with \dot \ddot and others.

In [1]: import nnja_ai
   ...: from nnja_ai import DataCatalog
   ...: # Monkeypatch _get_auth_args to instead supply the 'trust_env' variable to the network connection,
   ...: # allowing the connection to open on behind-proxy machines
   ...: nnja_ai.io._get_auth_args = lambda x : {'session_kwargs': {'trust_env' : True}}

In [2]: catalog = DataCatalog()
   ...: sonde_ds = catalog['conv-adpupa-NC002001']
   ...: print(sonde_ds.info())
Loading manifest for dataset 'conv-adpupa-NC002001'...
Dataset 'conv-adpupa-NC002001': ADP Upper-air data; Rawinsonde - fixed land
Tags: adpupa, upper air, global, station data, fixed land, radiosonde, rawinsonde
Files: 5445 files in manifest
Variables: 265

In [3]: tstart = 'T00:00Z'
   ...: tend = 'T23:59Z'
   ...: day1 = '2024-01-01'
   ...: day2 = '2024-01-02'

In [4]: # Pandas
   ...: for (name, tslice) in (('01/01',(day1+tstart,day1+tend)),
   ...:                       ('02/02',(day2+tstart,day2+tend)),
   ...:                       ('01/02',(day1+tstart,day2+tend))):
   ...:     foo = sonde_ds.sel(time=slice(*tslice)).load_dataset(backend='pandas')
   ...:     print(f'{name}: {len(foo)} entries')
   ...:
01/01: 1185 entries
02/02: 1211 entries
/home/csu001/data/ppp5/conda_env/nnja/lib/python3.12/site-packages/nnja_ai/io.py:148: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat(
01/02: 2396 entries

In [5]: # Dask
   ...: for (name, tslice) in (('01/01',(day1+tstart,day1+tend)),
   ...:                       ('02/02',(day2+tstart,day2+tend)),
   ...:                       ('01/02',(day1+tstart,day2+tend))):
   ...:     try:
   ...:         foo = sonde_ds.sel(time=slice(*tslice)).load_dataset(backend='dask').compute()
   ...:         print(f'{name}: {len(foo)} entries')
   ...:     except Exception as e:
   ...:         print(f'{name} Exception: {e}')
   ...:
01/01: 1185 entries
02/02: 1211 entries
01/02 Exception: Unsupported cast from double to null using function cast_null

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions