Skip to content

Add Apache TsFile packaged module and Time Series docs category#1

Open
Young-Leo wants to merge 2 commits into
mainfrom
tsfile
Open

Add Apache TsFile packaged module and Time Series docs category#1
Young-Leo wants to merge 2 commits into
mainfrom
tsfile

Conversation

@Young-Leo
Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds first-class support for the Apache TsFile
columnar time-series format (used by Apache IoTDB
and other time-series systems) to 🤗 Datasets, and surfaces time-series
datasets as a dedicated modality in the documentation sidebar.

Users can now do:

from datasets import load_dataset

ds = load_dataset("tsfile", data_files="data/*.tsfile", split="train")

or rely on extension auto-detection:

ds = load_dataset("data/*.tsfile", split="train")

Why a dedicated builder?

Generic columnar formats (CSV / Parquet / HDF5) can technically store
time-series data, but they treat time as just another column. TsFile is
purpose-built around the (table, device, field) triple with chunk-level
time indices and time-series-specific encodings (delta-of-delta on
timestamps, GORILLA on slowly-varying floats), so range queries and device
pruning don't require scanning the whole file.

The new tsfile builder pushes selection down to the reader's metadata
layer instead of materializing-then-filtering.

What's added

1. tsfile packaged module (tsfile)

Backed by tsfile.TsFileDataFrame (tsfile>=2.2.1.dev4). Each .tsfile
is loaded as a single Arrow table: a timestamp (int64) column followed
by one float64 column per selected logical timeseries.

Four complementary, mutually-aware ways to select series — all pushed
down to the metadata layer so unselected devices/fields are never decoded:

Argument Behaviour
columns= Explicit list of logical series paths.
devices= Keep listed device(s), across all their fields. Exact segment match.
fields= Keep listed measurement(s), across all devices.
path_prefix= Delegated to TsFileDataFrame.list_timeseries(path_prefix=...).

devices= and fields= combine as a logical AND. columns= is mutually
exclusive with the three predicate-based filters.

Plus:

  • start_time / end_time — inclusive time-range filter pushed down to
    df.loc[start:end, series] (uses chunk-level time indices).
  • features — optional cast to a user-provided Features schema.
  • on_bad_files"error" (default) / "warn" / "skip", matching the
    Parquet / HDF5 builders.

Registered in _PACKAGED_DATASETS_MODULES and bound to the .tsfile
extension via _EXTENSION_TO_MODULE.

2. Tests (test_tsfile.py)

14 tests, gated by pytest.importorskip("tsfile"):

  • config validation (columns vs predicate filters mutual exclusion)
  • basic load + extension auto-detection
  • columns= / devices= / fields= / path_prefix= individually
  • combined devices= + fields= 2-D projection
  • empty-result filters → informative ValueError
  • start_time / end_time pruning
  • on_bad_files="skip" skipping a corrupted shard
  • explicit features cast

All 14 pass on the dev machine.

3. Documentation

  • New sidebar category "Time Series" in _toctree.yml,
    alongside Tabular / Audio / Vision / Text. Currently the official
    docs file time-series under Tabular, which underplays its
    domain-specific concerns (timestamp axis, time-range pruning, per-device
    addressing).
  • New page timeseries_load.mdx documenting the tsfile
    builder: data model, four series-selection paths, time-range pruning,
    bad-file handling, feature casting, and the table-model / numeric-only
    limitations of the underlying TsFileDataFrame API.
  • Cross-link from loading.mdx (the generic loading
    reference) to the new guide, with a minimal example surfacing the
    device- and field-level filters.

Known limitations (documented in the new page)

The current tsfile.TsFileDataFrame API:

  • only supports the table-model TsFile (tree-only files cannot be
    loaded);
  • only exposes numeric field types (BOOLEAN, INT32, INT64,
    FLOAT, DOUBLE, TIMESTAMP), unifying them to float64;
  • silently skips non-numeric fields (TEXT, STRING, BLOB, DATE)
    during metadata discovery.

Dependency note

tsfile is not added to setup.py extras_require in this PR:
the latest published version on PyPI is a dev pre-release
(2.2.1.dev4) whose own metadata pins pyarrow<20, which conflicts
with the version of pyarrow used by datasets. The builder lazy-imports
tsfile and the tests are gated by pytest.importorskip, so users opt
in by installing tsfile themselves. We can add a proper extra once a
stable tsfile release lands.

Commits

  1. Add Apache TsFile packaged module — builder, registration, tests.
  2. Add Time Series category and TsFile loading guide — sidebar entry,
    new docs page, cross-link from loading.mdx.

Checklist

  • ruff check passes
  • ruff format --check passes
  • pytest tests/packaged_modules/test_tsfile.py — 14 passed
  • No changes to existing builders / public API

Introduce a new `tsfile` builder backed by `tsfile.TsFileDataFrame` (>=2.2.1.dev4).

- Register the module in `packaged_modules` and bind the `.tsfile` extension.

- Each `.tsfile` is loaded as a single Arrow table: a `timestamp` (int64) column followed by one `float64` column per selected logical timeseries.

- Supports four complementary, mutually-aware ways to select series:

    * `columns=[...]` for explicit series paths,

    * `devices=[...]` to keep listed device(s) across all their fields,

    * `fields=[...]` to keep listed measurement(s) across all devices,

    * `path_prefix=...` for upstream metadata-level prefix pruning.

- Pushes `start_time` / `end_time` down to `df.loc[start:end, series]` so chunk-level time indices skip out-of-window data.

- `on_bad_files` policy (`error` / `warn` / `skip`) controls behaviour on corrupted shards.

- Add 14 tests covering config validation, basic loading, single- and combined-filter behaviour, time-range filtering, no-match error handling and skipping of bad files.
Surface time-series datasets as a first-class modality in the docs sidebar instead of folding them into the generic `Tabular` category, matching the domain-specific concerns (timestamp axis, time-range filtering, device/field addressing) that distinguish them from generic tabular data.

- Add a dedicated `Time Series` section to `docs/source/_toctree.yml`, placed alongside `Tabular` / `Audio` / `Vision` / `Text`.

- Add `docs/source/timeseries_load.mdx` documenting the `tsfile` builder: the (table, device, field) data model, four series-selection paths (`columns` / `devices` / `fields` / `path_prefix`), time-range pruning, bad-file handling and feature casting, plus the table-model and numeric-only limitations of the underlying `TsFileDataFrame` API.

- Cross-link the new guide from the generic Loading reference, with a minimal example surfacing the device- and field-level filters.
Comment on lines +25 to +39
devices (`list[str]`, *optional*):
Keep only series whose tag-value segment matches one of the given
device identifiers. Equivalent to "give me everything from these
devices". Combined with ``fields`` and ``path_prefix`` as a logical
AND. The match is exact and segment-based, e.g. ``devices=["d1"]``
keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``.
fields (`list[str]`, *optional*):
Keep only series whose final path segment (a.k.a. *field*, *sensor*
or *measurement*) matches one of the given names. Equivalent to
"give me this measurement across all devices".
path_prefix (`str`, *optional*):
Keep only series whose path starts with this prefix, delegated to
``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that
the prefix is matched as a raw string and should not include a
trailing dot.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
devices (`list[str]`, *optional*):
Keep only series whose tag-value segment matches one of the given
device identifiers. Equivalent to "give me everything from these
devices". Combined with ``fields`` and ``path_prefix`` as a logical
AND. The match is exact and segment-based, e.g. ``devices=["d1"]``
keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``.
fields (`list[str]`, *optional*):
Keep only series whose final path segment (a.k.a. *field*, *sensor*
or *measurement*) matches one of the given names. Equivalent to
"give me this measurement across all devices".
path_prefix (`str`, *optional*):
Keep only series whose path starts with this prefix, delegated to
``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that
the prefix is matched as a raw string and should not include a
trailing dot.

"""BuilderConfig for Apache TsFile.

Args:
columns (`list[str]`, *optional*):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add table_name

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add batch_size

if self.info.features is None:
for file in files:
try:
self.info.features = datasets.Features.from_arrow_schema(self._infer_schema(file))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterate all tsfiles to generate proper schema

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user specify the columns, don't need to iterate all tsfiles.

Comment on lines +209 to +224
df = TsFileDataFrame(file, show_progress=False)
try:
series = self._resolve_columns(df)
if not series:
arrays = [pa.array(np.empty(0, dtype=np.int64))]
return pa.table(arrays, names=["timestamp"])
aligned = df.loc[start:end, series]
finally:
df.close()

timestamps = np.asarray(aligned.timestamps, dtype=np.int64)
values = np.asarray(aligned.values)
arrays = [pa.array(timestamps)]
for col_idx, name in enumerate(aligned.series_names):
arrays.append(pa.array(np.asarray(values[:, col_idx], dtype=np.float64)))
return pa.table(arrays, names=["timestamp", *aligned.series_names])
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one batch per call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants