Minimize filesize differences between to_netcdf() engines

When writing an `xarray.Dataset` generated with `ddlpy.dataframe_to_xarray()` to a netcdf file, the file size differs significantly with different engines. The following code generates files from 44 KB to 209 KB:
```python
import ddlpy

locations = ddlpy.locations()

bool_grootheid = locations["Grootheid.Code"] == "WATHTE"
bool_groepering = locations["Groepering.Code"] == ""
bool_procestype = locations["ProcesType"] == "meting"
location = locations[bool_grootheid & bool_groepering & bool_procestype].loc[
    "denhelder.marsdiep"
]

start_date = "1953-01-01"
end_date = "1953-04-01"
measurements = ddlpy.measurements(
    location, start_date=start_date, end_date=end_date
)

always_preserve = [
    "WaarnemingMetadata.Statuswaarde",
    "WaarnemingMetadata.Kwaliteitswaardecode",
]
ds_clean = ddlpy.dataframe_to_xarray(
    df=measurements,
    always_preserve=always_preserve,
)

for engine in [None, "scipy", "h5netcdf", "netcdf4", "netcdf4_classic"]:
    file_out = f"./test_ddlpy_{engine}.nc"

    if engine == "netcdf4_classic":
        ds_clean.to_netcdf(file_out, engine="netcdf4", format="NETCDF4_CLASSIC")
    else:
        ds_clean.to_netcdf(file_out, engine=engine)
```

This has to do with the way string variables are being converted. However, there might be a way to enforce the same (simple) dtype for all of them so the engine used is not important anymore.

**possible solution**
Converting all strings to char arrays reduces the file size differences to almost zero, this can be done by passing encoding for all string-like variables by passing this encoding argument to `to_netcdf()`:
```python
encoding = {
    var: {"dtype": "S1"}
    for var in ds.data_vars
    if ds[var].dtype.kind in {"O", "U"}
}
```

Alternatively, we can enforce the encoding for each string variable in the dataset, so the user is not bothered with this:
```python
for var in ds.data_vars:
    if ds[var].dtype.kind == "O":
        ds[var].encoding = {"dtype": "S1"}
```

Important: `S1` results in 1-lenght strings, so we need to compute the maxlen of all strings and set `S<n>` per variable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize filesize differences between to_netcdf() engines #193

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Minimize filesize differences between to_netcdf() engines #193

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions