Skip to content

nnja-ai does not play nicely with proxy servers #66

@csubich

Description

@csubich

DataCatalog() uses fsspec to read the nnja-ai cloud bucket. Fsspec relies on gcsfs, which in turn relies on aiohttp for the networking layer. Unfortunately, aiohttp does not trust the environment by default, and the HTTP[S]_PROXY environment variables that are conventionally used to specify a proxy server are thus ignored. This causes DataCatalog() creation to hang on a system that uses such a proxy in lieu of a directly routed connection:

Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.5.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use Ctrl-O to force a new line in terminal IPython

In [1]: from nnja_ai import DataCatalog

In [2]: %time DataCatalog() # Hangs
^CCPU times: user 80.3 ms, sys: 24 ms, total: 104 ms
Wall time: 27.3 s
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
# Traceback elided

This hang occurs at the fsspec level and can be re-created by hand:

In [3]: import fsspec; import json
In [5]: uri = 'gs://gcp-nnja-ai/data/v1/catalog.json'

In [6]: %%time
   ...: with fsspec.open(uri,mode='r') as f:
   ...:     dd = json.load(f) # Also hangs
   ...:
   ...:
^CCPU times: user 4.41 ms, sys: 3.01 ms, total: 7.42 ms
Wall time: 34.7 s

The workaround in gcsfs and consequently fsspec is to pass the kwarg session_kwargs = {'trust_env' : True}, and this works:

In [8]: %%time
   ...: with fsspec.open(uri,mode='r',session_kwargs={'trust_env' : True}) as f:
   ...:     dd = json.load(f) # succeeds
   ...:
   ...:
CPU times: user 92 ms, sys: 8.98 ms, total: 101 ms
Wall time: 12.4 s

In [9]: print(str(dd)[:100])
{'amsua-1bamua-NC021023': {'description': 'Data from the Advanced Microwave Sounding Unit-A (AMSU-A)

However, there is currently no way to feed this argument to DataCatalog, since it does not take additional arguments in __init__.

An awful workaround is to monkeypatch nnja_ai.io._get_auth_args():

In [1]: import nnja_ai; from nnja_ai import DataCatalog

In [2]: nnja_ai.io._get_auth_args = lambda x : {'session_kwargs': {'trust_env' : True}}

In [3]: %time catalog = DataCatalog()
CPU times: user 173 ms, sys: 21 ms, total: 194 ms
Wall time: 14.2 s

In [4]: catalog.list_datasets()
Out[4]:
['amsua-1bamua-NC021023',
 'atms-atms-NC021203',
 'mhs-1bmhs-NC021027',
 'cris-crisf4-NC021206',
 'iasi-mtiasi-NC021241',
 'geo-ahicsr-NC021044',
 'geo-gsrasr-NC021045',
 'geo-gsrcsr-NC021046',
 'seviri-sevasr-NC021042',
 'conv-adpsfc-NC000001',
 'conv-adpsfc-NC000002',
 'conv-adpsfc-NC000007',
 'conv-adpsfc-NC000101',
 'conv-adpupa-NC002001']

… but I can't really recommend that as a general-purpose solution.

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions