Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions nwm_network/NWMRoutelinkToParquet.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f3b4d424-9cfd-457e-ad37-f63a336d283e",
"metadata": {},
"source": [
"## Creating a parquet file of NWM RouteLink\n",
"This notebook creates a parquet file of National Water Model (NWM) RoutLink to be uploaded to BigQuery.\n",
"\n",
"To access the NWM RouteLink, some codes (cells 1 to 4) were adopted from [route_link_fsspec.ipynb](https://github.com/AlabamaWaterInstitute/data_access_examples/blob/main/nwm_network/route_link_fsspec.ipynb)."
]
},
{
"cell_type": "markdown",
"id": "125e372a",
"metadata": {},
"source": [
"### Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6299e4bd",
"metadata": {},
"outputs": [],
"source": [
"import fsspec\n",
"import xarray as xr\n",
"from kerchunk.hdf import SingleHdf5ToZarr\n",
"from pyarrow.parquet import ParquetFile"
]
},
{
"cell_type": "markdown",
"id": "6c80333c",
"metadata": {},
"source": [
"### FSSPEC download for NWM RouteLink file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4949ca1b-3b70-4b6b-9ef1-bd1f5cf13ea1",
"metadata": {},
"outputs": [],
"source": [
"fs = fsspec.filesystem(\"http\")\n",
"\n",
"rl_nwm_url = \"https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v2.2.0/parm/DOMAIN_WCOSS_Names/RouteLink_CONUS.nc\"\n",
"with fs.open(rl_nwm_url) as f:\n",
" %time rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()\n",
" \n",
" # Key example here: \n",
" # https://fsspec.github.io/kerchunk/test_example.html\n",
" "
]
},
{
"cell_type": "markdown",
"id": "07c49383-99e6-4968-87ee-1a1004406752",
"metadata": {},
"source": [
"The `kerchunk`-ing example that we started with had a number of other parameters... \n",
"perhaps some might be reintroduced to make the data access even speedier!\n",
"e.g., ...\n",
"```py\n",
"fs = fsspec.filesystem('ftp', host=\"https://www.nco.ncep.noaa.gov/pmb\")\n",
"\n",
"with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:\n",
"```\n",
" ...\n",
" \n",
"One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so. \n",
"```py\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s\n",
" %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s\n",
"```\n",
"Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output. \n",
"```py\n",
" %time rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)\n",
" %time rl_t = rl_h5.translate() # This translate MUST happen inside the context block\n",
"```\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d29ee768-6113-4b66-9e57-94b8d08cdaaf",
"metadata": {},
"outputs": [],
"source": [
"backend_args = {\n",
" \"consolidated\": False,\n",
" \"storage_options\": {\n",
" \"fo\": rl_t,\n",
" # Adding these options returns a properly dimensioned but otherwise null dataframe\n",
" # \"remote_protocol\": \"https\",\n",
" # \"remote_options\": {'anon':True}\n",
" },\n",
"}\n",
"%time ds = xr.open_dataset(\"reference://\", engine=\"zarr\", backend_kwargs=backend_args,)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c4ba8d9-6485-4304-9f73-c05a4dde01ec",
"metadata": {},
"outputs": [],
"source": [
"# only keep the necessary variables\n",
"subslice = [\"link\",\"to\"]\n",
"\n",
"# Convert to pandas dataframe\n",
"%time df = ds[subslice].to_dataframe().astype({\"link\": int, \"to\": int})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4215cc7b-32b9-4ab5-8a3c-ce4439955c8f",
"metadata": {},
"outputs": [],
"source": [
"# Set the \"link\" ast the index of the dataframe\n",
"\n",
"df = df.set_index(\"link\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "473c1e5b",
"metadata": {},
"source": [
"### Convert the dataframe to parquet and save it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e27e9770",
"metadata": {},
"outputs": [],
"source": [
"df.to_parquet(\"/Users/grad/NWMRouteLinkParquet.gzip\", engine=\"pyarrow\", compression=\"gzip\")\n",
"\n",
"# Show the metadata of the parquet file \n",
"ParquetFile(\"/Users/grad/NWMRouteLinkParquet.gzip\").metadata "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c53ec083",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}