Add Apache TsFile packaged module and Time Series docs category by Young-Leo · Pull Request #1 · JackieTien97/datasets

Young-Leo · 2026-04-21T07:17:55Z

What does this PR do?

Adds first-class support for the Apache TsFile
columnar time-series format (used by Apache IoTDB
and other time-series systems) to 🤗 Datasets, and surfaces time-series
datasets as a dedicated modality in the documentation sidebar.

Users can now do:

from datasets import load_dataset

ds = load_dataset("tsfile", data_files="data/*.tsfile", split="train")

or rely on extension auto-detection:

ds = load_dataset("data/*.tsfile", split="train")

Why a dedicated builder?

Generic columnar formats (CSV / Parquet / HDF5) can technically store
time-series data, but they treat time as just another column. TsFile is
purpose-built around the (table, device, field) triple with chunk-level
time indices and time-series-specific encodings (delta-of-delta on
timestamps, GORILLA on slowly-varying floats), so range queries and device
pruning don't require scanning the whole file.

The new tsfile builder pushes selection down to the reader's metadata
layer instead of materializing-then-filtering.

What's added

1. `tsfile` packaged module (tsfile)

Backed by tsfile.TsFileDataFrame (tsfile>=2.2.1.dev4). Each .tsfile
is loaded as a single Arrow table: a timestamp (int64) column followed
by one float64 column per selected logical timeseries.

Four complementary, mutually-aware ways to select series — all pushed
down to the metadata layer so unselected devices/fields are never decoded:

Argument	Behaviour
`columns=`	Explicit list of logical series paths.
`devices=`	Keep listed device(s), across all their fields. Exact segment match.
`fields=`	Keep listed measurement(s), across all devices.
`path_prefix=`	Delegated to `TsFileDataFrame.list_timeseries(path_prefix=...)`.

devices= and fields= combine as a logical AND. columns= is mutually
exclusive with the three predicate-based filters.

Plus:

start_time / end_time — inclusive time-range filter pushed down to
df.loc[start:end, series] (uses chunk-level time indices).
features — optional cast to a user-provided Features schema.
on_bad_files — "error" (default) / "warn" / "skip", matching the
Parquet / HDF5 builders.

Registered in _PACKAGED_DATASETS_MODULES and bound to the .tsfile
extension via _EXTENSION_TO_MODULE.

2. Tests (test_tsfile.py)

14 tests, gated by pytest.importorskip("tsfile"):

config validation (columns vs predicate filters mutual exclusion)
basic load + extension auto-detection
columns= / devices= / fields= / path_prefix= individually
combined devices= + fields= 2-D projection
empty-result filters → informative ValueError
start_time / end_time pruning
on_bad_files="skip" skipping a corrupted shard
explicit features cast

All 14 pass on the dev machine.

3. Documentation

New sidebar category "Time Series" in _toctree.yml,
alongside Tabular / Audio / Vision / Text. Currently the official
docs file time-series under Tabular, which underplays its
domain-specific concerns (timestamp axis, time-range pruning, per-device
addressing).
New page timeseries_load.mdx documenting the tsfile
builder: data model, four series-selection paths, time-range pruning,
bad-file handling, feature casting, and the table-model / numeric-only
limitations of the underlying TsFileDataFrame API.
Cross-link from loading.mdx (the generic loading
reference) to the new guide, with a minimal example surfacing the
device- and field-level filters.

Known limitations (documented in the new page)

The current tsfile.TsFileDataFrame API:

only supports the table-model TsFile (tree-only files cannot be
loaded);
only exposes numeric field types (BOOLEAN, INT32, INT64,
FLOAT, DOUBLE, TIMESTAMP), unifying them to float64;
silently skips non-numeric fields (TEXT, STRING, BLOB, DATE)
during metadata discovery.

Dependency note

tsfile is not added to setup.py extras_require in this PR:
the latest published version on PyPI is a dev pre-release
(2.2.1.dev4) whose own metadata pins pyarrow<20, which conflicts
with the version of pyarrow used by datasets. The builder lazy-imports
tsfile and the tests are gated by pytest.importorskip, so users opt
in by installing tsfile themselves. We can add a proper extra once a
stable tsfile release lands.

Commits

Add Apache TsFile packaged module — builder, registration, tests.
Add Time Series category and TsFile loading guide — sidebar entry,
new docs page, cross-link from loading.mdx.

Checklist

ruff check passes
ruff format --check passes
pytest tests/packaged_modules/test_tsfile.py — 14 passed
No changes to existing builders / public API

Introduce a new `tsfile` builder backed by `tsfile.TsFileDataFrame` (>=2.2.1.dev4). - Register the module in `packaged_modules` and bind the `.tsfile` extension. - Each `.tsfile` is loaded as a single Arrow table: a `timestamp` (int64) column followed by one `float64` column per selected logical timeseries. - Supports four complementary, mutually-aware ways to select series: * `columns=[...]` for explicit series paths, * `devices=[...]` to keep listed device(s) across all their fields, * `fields=[...]` to keep listed measurement(s) across all devices, * `path_prefix=...` for upstream metadata-level prefix pruning. - Pushes `start_time` / `end_time` down to `df.loc[start:end, series]` so chunk-level time indices skip out-of-window data. - `on_bad_files` policy (`error` / `warn` / `skip`) controls behaviour on corrupted shards. - Add 14 tests covering config validation, basic loading, single- and combined-filter behaviour, time-range filtering, no-match error handling and skipping of bad files.

Surface time-series datasets as a first-class modality in the docs sidebar instead of folding them into the generic `Tabular` category, matching the domain-specific concerns (timestamp axis, time-range filtering, device/field addressing) that distinguish them from generic tabular data. - Add a dedicated `Time Series` section to `docs/source/_toctree.yml`, placed alongside `Tabular` / `Audio` / `Vision` / `Text`. - Add `docs/source/timeseries_load.mdx` documenting the `tsfile` builder: the (table, device, field) data model, four series-selection paths (`columns` / `devices` / `fields` / `path_prefix`), time-range pruning, bad-file handling and feature casting, plus the table-model and numeric-only limitations of the underlying `TsFileDataFrame` API. - Cross-link the new guide from the generic Loading reference, with a minimal example surfacing the device- and field-level filters.

JackieTien97 · 2026-04-21T07:54:35Z

+        devices (`list[str]`, *optional*):
+            Keep only series whose tag-value segment matches one of the given
+            device identifiers. Equivalent to "give me everything from these
+            devices". Combined with ``fields`` and ``path_prefix`` as a logical
+            AND. The match is exact and segment-based, e.g. ``devices=["d1"]``
+            keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``.
+        fields (`list[str]`, *optional*):
+            Keep only series whose final path segment (a.k.a. *field*, *sensor*
+            or *measurement*) matches one of the given names. Equivalent to
+            "give me this measurement across all devices".
+        path_prefix (`str`, *optional*):
+            Keep only series whose path starts with this prefix, delegated to
+            ``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that
+            the prefix is matched as a raw string and should not include a
+            trailing dot.


Suggested change

devices (`list[str]`, *optional*):

Keep only series whose tag-value segment matches one of the given

device identifiers. Equivalent to "give me everything from these

devices". Combined with ``fields`` and ``path_prefix`` as a logical

AND. The match is exact and segment-based, e.g. ``devices=["d1"]``

keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``.

fields (`list[str]`, *optional*):

Keep only series whose final path segment (a.k.a. *field*, *sensor*

or *measurement*) matches one of the given names. Equivalent to

"give me this measurement across all devices".

path_prefix (`str`, *optional*):

Keep only series whose path starts with this prefix, delegated to

``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that

the prefix is matched as a raw string and should not include a

trailing dot.

JackieTien97 · 2026-04-21T07:55:15Z

+    """BuilderConfig for Apache TsFile.
+
+    Args:
+        columns (`list[str]`, *optional*):


add table_name

add batch_size

JackieTien97 · 2026-04-21T07:57:41Z

+            if self.info.features is None:
+                for file in files:
+                    try:
+                        self.info.features = datasets.Features.from_arrow_schema(self._infer_schema(file))


iterate all tsfiles to generate proper schema

if user specify the columns, don't need to iterate all tsfiles.

JackieTien97 · 2026-04-21T08:07:04Z

+        df = TsFileDataFrame(file, show_progress=False)
+        try:
+            series = self._resolve_columns(df)
+            if not series:
+                arrays = [pa.array(np.empty(0, dtype=np.int64))]
+                return pa.table(arrays, names=["timestamp"])
+            aligned = df.loc[start:end, series]
+        finally:
+            df.close()
+
+        timestamps = np.asarray(aligned.timestamps, dtype=np.int64)
+        values = np.asarray(aligned.values)
+        arrays = [pa.array(timestamps)]
+        for col_idx, name in enumerate(aligned.series_names):
+            arrays.append(pa.array(np.asarray(values[:, col_idx], dtype=np.float64)))
+        return pa.table(arrays, names=["timestamp", *aligned.series_names])


one batch per call

Young-Leo added 2 commits April 21, 2026 14:12

JackieTien97 requested changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache TsFile packaged module and Time Series docs category#1

Add Apache TsFile packaged module and Time Series docs category#1
Young-Leo wants to merge 2 commits into
mainfrom
tsfile

Young-Leo commented Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

JackieTien97 Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Young-Leo commented Apr 21, 2026

What does this PR do?

Why a dedicated builder?

What's added

1. tsfile packaged module (tsfile)

2. Tests (test_tsfile.py)

3. Documentation

Known limitations (documented in the new page)

Dependency note

Commits

Checklist

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `tsfile` packaged module (tsfile)