Is there an easier way to efficiently resample an uneven temporal dataset into a series of windows? #10919

hafarooki · 2025-11-13T19:18:35Z

hafarooki
Nov 13, 2025

Suppose I have a dataset with the first index, time, being a datetime dimension with uneven spacing of 60s or more between points. I want to resample to a certain interval, say 1800 s. Furthermore, I want to keep the points within each resample window as separate values. This means I want to reshape the time dimension to (N, 30), where N is the number of 1800s intervals, and there are at most 30 points per 1800s interval. I tried to mess around with resample(...).apply, but was unable to get it to work. The best solution I came up with is the following:

    # obtain timestamp and slice object for each window
    dataset_resample_groups = dataset.resample(time='1800s').groups
    window_time = list(dataset_resample_groups.keys())
    window_slices = list(dataset_resample_groups.values())
    # number of points in each window
    dataset_length = dataset.time.shape[0]
    window_lengths = [window_slice.indices(dataset_length)[1] - window_slice.indices(dataset_length)[0]
                      for window_slice in window_slices]    
    # timestamp of window for each point
    window_time = np.concat([[window_time] * window_length for window_time, window_length in zip(window_time, window_lengths)])
    # index within window from 0...window_length-1 of each point 
    window_time_index = np.concat([np.arange(window_length) for window_length in window_lengths])
    # replace time index with a multiindex containing both the window time and the index within the window
    resampled_dataset = dataset.assign_coords(window_time=('time', window_time), window_time_index=('time', window_time_index)).set_index(time=['window_time', 'window_time_index'])
    # temporarily name it timestamp so it survives unstack
    resampled_dataset = resampled_dataset.assign(timestamp=('time', dataset.time.data))  
    # temporarily remove astropy units because they don't work with `unstack`
    resampled_dataset = with_encoded_units(resampled_dataset)
    # separate the time dimension to two dimensions
    resampled_dataset = resampled_dataset.unstack('time')  
    # restore the units
    resampled_dataset = with_decoded_units(resampled_dataset)
    # rename timestamp back to time
    resampled_dataset = resampled_dataset.rename(timestamp='time')

Before, dataset equals:

<xarray.Dataset> Size: 1GB
Dimensions:                      (time: 1555066, instrument+telescope+energy: 51)
Coordinates:
  * time                         (time) datetime64[ns] 12MB 2018-09-28T00:00:...
  * instrument+telescope+energy  (instrument+telescope+energy) object 408B MultiIndex
  * instrument                   (instrument+telescope+energy) <U4 816B 'het'...
  * telescope                    (instrument+telescope+energy) <U1 204B 'A' ....
  * energy                       (instrument+telescope+energy) float64 408B 1...
Data variables:
    count                        (time, instrument+telescope+energy) float64 634MB ...
    response                     (time, instrument+telescope+energy) float64 634MB ...

After, resampled_dataset equals:

<xarray.Dataset> Size: 1GB
Dimensions:                      (window_time: 52548, window_time_index: 30,
                                  instrument+telescope+energy: 51)
Coordinates:
  * window_time                  (window_time) datetime64[ns] 420kB 2018-09-2...
  * window_time_index            (window_time_index) int64 240B 0 1 2 ... 28 29
  * instrument+telescope+energy  (instrument+telescope+energy) object 408B MultiIndex
  * instrument                   (instrument+telescope+energy) <U4 816B 'het'...
  * telescope                    (instrument+telescope+energy) <U1 204B 'A' ....
  * energy                       (instrument+telescope+energy) float64 408B 1...
Data variables:
    count                        (instrument+telescope+energy, window_time, window_time_index) float64 643MB ...
    response                     (instrument+telescope+energy, window_time, window_time_index) float64 643MB ...
    time                         (window_time, window_time_index) datetime64[ns] 13MB ...

This seems to work, but it would be nice if it can be done by something like:
dataset.resample(time='1800s').stack()

dcherian · 2025-11-14T23:34:35Z

dcherian
Nov 14, 2025
Maintainer

yes, groupby.construct & resample.construct could be useful but in principle very expensive because there's no bound on number of points in a group (or time window). In practice, it's useful for real world problems where number of points is in a group is within some narrow range. (29-31) for daily data grouped in to months.

Are you interested in making a PR?

here is what Claude Code & I came up with:

import xarray as xr
import numpy as np

ds = xr.tutorial.open_dataset("air_temperature")
da = ds.air
grouped = da.groupby("time.month")
idxs = grouped.encoded.group_indices

def pad_and_stack_vectorized(arrays, fillvalue=-1):
    # Convert to object array first to handle ragged arrays
    arrays = np.array(arrays, dtype=object)
    
    # Get lengths and find max
    lengths = np.array([len(arr) for arr in arrays])
    max_len = lengths.max()
    
    # Pre-allocate output array
    result = np.full((len(arrays), max_len), fillvalue, dtype=int)
    
    # Create row and column indices for all valid positions
    row_idx = np.repeat(np.arange(len(arrays)), lengths)
    col_idx = np.concatenate([np.arange(length) for length in lengths])
    
    # Concatenate all values and assign vectorized
    values = np.concatenate(arrays)
    result[row_idx, col_idx] = values
    
    return result

stacked_idxs = pad_and_stack_vectorized(idxs)
result = da.data[stacked_idxs, :, :]
result[stacked_idxs == -1] = np.nan
result

0 replies

hafarooki · 2025-12-10T00:06:15Z

hafarooki
Dec 10, 2025
Author

@dcherian sorry for the late reply. Thanks for the code with an example dataset! I could make a PR, since I've had to make a convoluted workaround for this multiple times now, though I would need a little bit more direction on how to implement these functions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Is there an easier way to efficiently resample an uneven temporal dataset into a series of windows? #10919

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Is there an easier way to efficiently resample an uneven temporal dataset into a series of windows? #10919

Uh oh!

hafarooki Nov 13, 2025

Replies: 2 comments

Uh oh!

dcherian Nov 14, 2025 Maintainer

Uh oh!

hafarooki Dec 10, 2025 Author

hafarooki
Nov 13, 2025

dcherian
Nov 14, 2025
Maintainer

hafarooki
Dec 10, 2025
Author