AlabamaWaterInstitute · imanmaghami · Nov 17, 2022 · Nov 17, 2022
diff --git a/nwm_network/NWMRoutelinkToParquet.ipynb b/nwm_network/NWMRoutelinkToParquet.ipynb
@@ -0,0 +1,192 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f3b4d424-9cfd-457e-ad37-f63a336d283e",
+   "metadata": {},
+   "source": [
+    "## Creating a parquet file of NWM RouteLink\n",
+    "This notebook creates a parquet file of National Water Model (NWM) RoutLink to be uploaded to BigQuery.\n",
+    "\n",
+    "To access the NWM RouteLink, some codes (cells 1 to 4) were adopted from  [route_link_fsspec.ipynb](https://github.com/AlabamaWaterInstitute/data_access_examples/blob/main/nwm_network/route_link_fsspec.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "125e372a",
+   "metadata": {},
+   "source": [
+    "### Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6299e4bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fsspec\n",
+    "import xarray as xr\n",
+    "from kerchunk.hdf import SingleHdf5ToZarr\n",
+    "from pyarrow.parquet import ParquetFile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c80333c",
+   "metadata": {},
+   "source": [
+    "### FSSPEC download for NWM RouteLink file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4949ca1b-3b70-4b6b-9ef1-bd1f5cf13ea1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fs = fsspec.filesystem(\"http\")\n",
+    "\n",
+    "rl_nwm_url = \"https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v2.2.0/parm/DOMAIN_WCOSS_Names/RouteLink_CONUS.nc\"\n",
+    "with fs.open(rl_nwm_url) as f:\n",
+    "    %time    rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()\n",
+    "    \n",
+    "    # Key example here: \n",
+    "    # https://fsspec.github.io/kerchunk/test_example.html\n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07c49383-99e6-4968-87ee-1a1004406752",
+   "metadata": {},
+   "source": [
+    "The `kerchunk`-ing example that we started with had a number of other parameters... \n",
+    "perhaps some might be reintroduced to make the data access even speedier!\n",
+    "e.g., ...\n",
+    "```py\n",
+    "fs = fsspec.filesystem('ftp', host=\"https://www.nco.ncep.noaa.gov/pmb\")\n",
+    "\n",
+    "with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:\n",
+    "```\n",
+    " ...\n",
+    " \n",
+    "One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so. \n",
+    "```py\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s\n",
+    "    %time    rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s\n",
+    "```\n",
+    "Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output. \n",
+    "```py\n",
+    "    %time    rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)\n",
+    "    %time    rl_t = rl_h5.translate() # This translate MUST happen inside the context block\n",
+    "```\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d29ee768-6113-4b66-9e57-94b8d08cdaaf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "backend_args = {\n",
+    "    \"consolidated\": False,\n",
+    "    \"storage_options\": {\n",
+    "        \"fo\": rl_t,\n",
+    "        # Adding these options returns a properly dimensioned but otherwise null dataframe\n",
+    "        # \"remote_protocol\": \"https\",\n",
+    "        # \"remote_options\": {'anon':True}\n",
+    "    },\n",
+    "}\n",
+    "%time ds = xr.open_dataset(\"reference://\", engine=\"zarr\", backend_kwargs=backend_args,)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c4ba8d9-6485-4304-9f73-c05a4dde01ec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# only keep the necessary variables\n",
+    "subslice = [\"link\",\"to\"]\n",
+    "\n",
+    "# Convert to pandas dataframe\n",
+    "%time df = ds[subslice].to_dataframe().astype({\"link\": int, \"to\": int})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4215cc7b-32b9-4ab5-8a3c-ce4439955c8f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set the \"link\" ast the index of the dataframe\n",
+    "\n",
+    "df = df.set_index(\"link\")\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "473c1e5b",
+   "metadata": {},
+   "source": [
+    "### Convert the dataframe to parquet and save it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e27e9770",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_parquet(\"/Users/grad/NWMRouteLinkParquet.gzip\", engine=\"pyarrow\", compression=\"gzip\")\n",
+    "\n",
+    "# Show the metadata of the parquet file \n",
+    "ParquetFile(\"/Users/grad/NWMRouteLinkParquet.gzip\").metadata "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c53ec083",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}