Skip to content

Minimize filesize differences between to_netcdf() engines #193

Description

@veenstrajelmer

When writing an xarray.Dataset generated with ddlpy.dataframe_to_xarray() to a netcdf file, the file size differs significantly with different engines. The following code generates files from 44 KB to 209 KB:

import ddlpy

locations = ddlpy.locations()

bool_grootheid = locations["Grootheid.Code"] == "WATHTE"
bool_groepering = locations["Groepering.Code"] == ""
bool_procestype = locations["ProcesType"] == "meting"
location = locations[bool_grootheid & bool_groepering & bool_procestype].loc[
    "denhelder.marsdiep"
]

start_date = "1953-01-01"
end_date = "1953-04-01"
measurements = ddlpy.measurements(
    location, start_date=start_date, end_date=end_date
)

always_preserve = [
    "WaarnemingMetadata.Statuswaarde",
    "WaarnemingMetadata.Kwaliteitswaardecode",
]
ds_clean = ddlpy.dataframe_to_xarray(
    df=measurements,
    always_preserve=always_preserve,
)

for engine in [None, "scipy", "h5netcdf", "netcdf4", "netcdf4_classic"]:
    file_out = f"./test_ddlpy_{engine}.nc"

    if engine == "netcdf4_classic":
        ds_clean.to_netcdf(file_out, engine="netcdf4", format="NETCDF4_CLASSIC")
    else:
        ds_clean.to_netcdf(file_out, engine=engine)

This has to do with the way string variables are being converted. However, there might be a way to enforce the same (simple) dtype for all of them so the engine used is not important anymore.

possible solution
Converting all strings to char arrays reduces the file size differences to almost zero, this can be done by passing encoding for all string-like variables by passing this encoding argument to to_netcdf():

encoding = {
    var: {"dtype": "S1"}
    for var in ds.data_vars
    if ds[var].dtype.kind in {"O", "U"}
}

Alternatively, we can enforce the encoding for each string variable in the dataset, so the user is not bothered with this:

for var in ds.data_vars:
    if ds[var].dtype.kind == "O":
        ds[var].encoding = {"dtype": "S1"}

Important: S1 results in 1-lenght strings, so we need to compute the maxlen of all strings and set S<n> per variable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions