diff --git a/nwm_network/NWMRoutelinkToParquet.ipynb b/nwm_network/NWMRoutelinkToParquet.ipynb new file mode 100644 index 0000000..988065d --- /dev/null +++ b/nwm_network/NWMRoutelinkToParquet.ipynb @@ -0,0 +1,192 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f3b4d424-9cfd-457e-ad37-f63a336d283e", + "metadata": {}, + "source": [ + "## Creating a parquet file of NWM RouteLink\n", + "This notebook creates a parquet file of National Water Model (NWM) RoutLink to be uploaded to BigQuery.\n", + "\n", + "To access the NWM RouteLink, some codes (cells 1 to 4) were adopted from [route_link_fsspec.ipynb](https://github.com/AlabamaWaterInstitute/data_access_examples/blob/main/nwm_network/route_link_fsspec.ipynb)." + ] + }, + { + "cell_type": "markdown", + "id": "125e372a", + "metadata": {}, + "source": [ + "### Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6299e4bd", + "metadata": {}, + "outputs": [], + "source": [ + "import fsspec\n", + "import xarray as xr\n", + "from kerchunk.hdf import SingleHdf5ToZarr\n", + "from pyarrow.parquet import ParquetFile" + ] + }, + { + "cell_type": "markdown", + "id": "6c80333c", + "metadata": {}, + "source": [ + "### FSSPEC download for NWM RouteLink file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4949ca1b-3b70-4b6b-9ef1-bd1f5cf13ea1", + "metadata": {}, + "outputs": [], + "source": [ + "fs = fsspec.filesystem(\"http\")\n", + "\n", + "rl_nwm_url = \"https://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nwm.v2.2.0/parm/DOMAIN_WCOSS_Names/RouteLink_CONUS.nc\"\n", + "with fs.open(rl_nwm_url) as f:\n", + " %time rl_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate()\n", + " \n", + " # Key example here: \n", + " # https://fsspec.github.io/kerchunk/test_example.html\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "07c49383-99e6-4968-87ee-1a1004406752", + "metadata": {}, + "source": [ + "The `kerchunk`-ing example that we started with had a number of other parameters... \n", + "perhaps some might be reintroduced to make the data access even speedier!\n", + "e.g., ...\n", + "```py\n", + "fs = fsspec.filesystem('ftp', host=\"https://www.nco.ncep.noaa.gov/pmb\")\n", + "\n", + "with fs.open(rl_nwm_url, mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') as f:\n", + "```\n", + " ...\n", + " \n", + "One thing that I specifically explored was the size of the `inline_threshold` setting. Smaller values definitely provided better results, though not a massivie improvement -- 9 seconds overall vs. 11 or so. \n", + "```py\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url).translate() # 11.1 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=30000).translate() # 11.3 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=300).translate() # 11.2 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=10).translate() # 11.3 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=2).translate() # 9.8 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=1).translate() # 9.85 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0).translate() # 9.83 s\n", + " %time rl_h5_t = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=-1).translate() # 9.54 s\n", + "```\n", + "Inlining the `.translate()` call vs. splitting seemed to be about equal, with inlining having the additional advantage of omitting the unused intermediate output. \n", + "```py\n", + " %time rl_h5 = SingleHdf5ToZarr(f, rl_nwm_url, inline_threshold=0)\n", + " %time rl_t = rl_h5.translate() # This translate MUST happen inside the context block\n", + "```\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d29ee768-6113-4b66-9e57-94b8d08cdaaf", + "metadata": {}, + "outputs": [], + "source": [ + "backend_args = {\n", + " \"consolidated\": False,\n", + " \"storage_options\": {\n", + " \"fo\": rl_t,\n", + " # Adding these options returns a properly dimensioned but otherwise null dataframe\n", + " # \"remote_protocol\": \"https\",\n", + " # \"remote_options\": {'anon':True}\n", + " },\n", + "}\n", + "%time ds = xr.open_dataset(\"reference://\", engine=\"zarr\", backend_kwargs=backend_args,)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c4ba8d9-6485-4304-9f73-c05a4dde01ec", + "metadata": {}, + "outputs": [], + "source": [ + "# only keep the necessary variables\n", + "subslice = [\"link\",\"to\"]\n", + "\n", + "# Convert to pandas dataframe\n", + "%time df = ds[subslice].to_dataframe().astype({\"link\": int, \"to\": int})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4215cc7b-32b9-4ab5-8a3c-ce4439955c8f", + "metadata": {}, + "outputs": [], + "source": [ + "# Set the \"link\" ast the index of the dataframe\n", + "\n", + "df = df.set_index(\"link\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "473c1e5b", + "metadata": {}, + "source": [ + "### Convert the dataframe to parquet and save it" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e27e9770", + "metadata": {}, + "outputs": [], + "source": [ + "df.to_parquet(\"/Users/grad/NWMRouteLinkParquet.gzip\", engine=\"pyarrow\", compression=\"gzip\")\n", + "\n", + "# Show the metadata of the parquet file \n", + "ParquetFile(\"/Users/grad/NWMRouteLinkParquet.gzip\").metadata " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c53ec083", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}