DataCatalog() uses fsspec to read the nnja-ai cloud bucket. Fsspec relies on gcsfs, which in turn relies on aiohttp for the networking layer. Unfortunately, aiohttp does not trust the environment by default, and the HTTP[S]_PROXY environment variables that are conventionally used to specify a proxy server are thus ignored. This causes DataCatalog() creation to hang on a system that uses such a proxy in lieu of a directly routed connection:
Python 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.5.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use Ctrl-O to force a new line in terminal IPython
In [1]: from nnja_ai import DataCatalog
In [2]: %time DataCatalog() # Hangs
^CCPU times: user 80.3 ms, sys: 24 ms, total: 104 ms
Wall time: 27.3 s
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
# Traceback elided
This hang occurs at the fsspec level and can be re-created by hand:
In [3]: import fsspec; import json
In [5]: uri = 'gs://gcp-nnja-ai/data/v1/catalog.json'
In [6]: %%time
...: with fsspec.open(uri,mode='r') as f:
...: dd = json.load(f) # Also hangs
...:
...:
^CCPU times: user 4.41 ms, sys: 3.01 ms, total: 7.42 ms
Wall time: 34.7 s
The workaround in gcsfs and consequently fsspec is to pass the kwarg session_kwargs = {'trust_env' : True}, and this works:
In [8]: %%time
...: with fsspec.open(uri,mode='r',session_kwargs={'trust_env' : True}) as f:
...: dd = json.load(f) # succeeds
...:
...:
CPU times: user 92 ms, sys: 8.98 ms, total: 101 ms
Wall time: 12.4 s
In [9]: print(str(dd)[:100])
{'amsua-1bamua-NC021023': {'description': 'Data from the Advanced Microwave Sounding Unit-A (AMSU-A)
However, there is currently no way to feed this argument to DataCatalog, since it does not take additional arguments in __init__.
An awful workaround is to monkeypatch nnja_ai.io._get_auth_args():
In [1]: import nnja_ai; from nnja_ai import DataCatalog
In [2]: nnja_ai.io._get_auth_args = lambda x : {'session_kwargs': {'trust_env' : True}}
In [3]: %time catalog = DataCatalog()
CPU times: user 173 ms, sys: 21 ms, total: 194 ms
Wall time: 14.2 s
In [4]: catalog.list_datasets()
Out[4]:
['amsua-1bamua-NC021023',
'atms-atms-NC021203',
'mhs-1bmhs-NC021027',
'cris-crisf4-NC021206',
'iasi-mtiasi-NC021241',
'geo-ahicsr-NC021044',
'geo-gsrasr-NC021045',
'geo-gsrcsr-NC021046',
'seviri-sevasr-NC021042',
'conv-adpsfc-NC000001',
'conv-adpsfc-NC000002',
'conv-adpsfc-NC000007',
'conv-adpsfc-NC000101',
'conv-adpupa-NC002001']
… but I can't really recommend that as a general-purpose solution.
DataCatalog() uses fsspec to read the nnja-ai cloud bucket. Fsspec relies on gcsfs, which in turn relies on aiohttp for the networking layer. Unfortunately, aiohttp does not trust the environment by default, and the
HTTP[S]_PROXYenvironment variables that are conventionally used to specify a proxy server are thus ignored. This causes DataCatalog() creation to hang on a system that uses such a proxy in lieu of a directly routed connection:This hang occurs at the fsspec level and can be re-created by hand:
The workaround in gcsfs and consequently fsspec is to pass the kwarg
session_kwargs = {'trust_env' : True}, and this works:However, there is currently no way to feed this argument to DataCatalog, since it does not take additional arguments in
__init__.An awful workaround is to monkeypatch
nnja_ai.io._get_auth_args():… but I can't really recommend that as a general-purpose solution.