When writing an xarray.Dataset generated with ddlpy.dataframe_to_xarray() to a netcdf file, the file size differs significantly with different engines. The following code generates files from 44 KB to 209 KB:
import ddlpy
locations = ddlpy.locations()
bool_grootheid = locations["Grootheid.Code"] == "WATHTE"
bool_groepering = locations["Groepering.Code"] == ""
bool_procestype = locations["ProcesType"] == "meting"
location = locations[bool_grootheid & bool_groepering & bool_procestype].loc[
"denhelder.marsdiep"
]
start_date = "1953-01-01"
end_date = "1953-04-01"
measurements = ddlpy.measurements(
location, start_date=start_date, end_date=end_date
)
always_preserve = [
"WaarnemingMetadata.Statuswaarde",
"WaarnemingMetadata.Kwaliteitswaardecode",
]
ds_clean = ddlpy.dataframe_to_xarray(
df=measurements,
always_preserve=always_preserve,
)
for engine in [None, "scipy", "h5netcdf", "netcdf4", "netcdf4_classic"]:
file_out = f"./test_ddlpy_{engine}.nc"
if engine == "netcdf4_classic":
ds_clean.to_netcdf(file_out, engine="netcdf4", format="NETCDF4_CLASSIC")
else:
ds_clean.to_netcdf(file_out, engine=engine)
This has to do with the way string variables are being converted. However, there might be a way to enforce the same (simple) dtype for all of them so the engine used is not important anymore.
possible solution
Converting all strings to char arrays reduces the file size differences to almost zero, this can be done by passing encoding for all string-like variables by passing this encoding argument to to_netcdf():
encoding = {
var: {"dtype": "S1"}
for var in ds.data_vars
if ds[var].dtype.kind in {"O", "U"}
}
Alternatively, we can enforce the encoding for each string variable in the dataset, so the user is not bothered with this:
for var in ds.data_vars:
if ds[var].dtype.kind == "O":
ds[var].encoding = {"dtype": "S1"}
Important: S1 results in 1-lenght strings, so we need to compute the maxlen of all strings and set S<n> per variable
When writing an
xarray.Datasetgenerated withddlpy.dataframe_to_xarray()to a netcdf file, the file size differs significantly with different engines. The following code generates files from 44 KB to 209 KB:This has to do with the way string variables are being converted. However, there might be a way to enforce the same (simple) dtype for all of them so the engine used is not important anymore.
possible solution
Converting all strings to char arrays reduces the file size differences to almost zero, this can be done by passing encoding for all string-like variables by passing this encoding argument to
to_netcdf():Alternatively, we can enforce the encoding for each string variable in the dataset, so the user is not bothered with this:
Important:
S1results in 1-lenght strings, so we need to compute the maxlen of all strings and setS<n>per variable