Conversation
Introduce a new `tsfile` builder backed by `tsfile.TsFileDataFrame` (>=2.2.1.dev4).
- Register the module in `packaged_modules` and bind the `.tsfile` extension.
- Each `.tsfile` is loaded as a single Arrow table: a `timestamp` (int64) column followed by one `float64` column per selected logical timeseries.
- Supports four complementary, mutually-aware ways to select series:
* `columns=[...]` for explicit series paths,
* `devices=[...]` to keep listed device(s) across all their fields,
* `fields=[...]` to keep listed measurement(s) across all devices,
* `path_prefix=...` for upstream metadata-level prefix pruning.
- Pushes `start_time` / `end_time` down to `df.loc[start:end, series]` so chunk-level time indices skip out-of-window data.
- `on_bad_files` policy (`error` / `warn` / `skip`) controls behaviour on corrupted shards.
- Add 14 tests covering config validation, basic loading, single- and combined-filter behaviour, time-range filtering, no-match error handling and skipping of bad files.
Surface time-series datasets as a first-class modality in the docs sidebar instead of folding them into the generic `Tabular` category, matching the domain-specific concerns (timestamp axis, time-range filtering, device/field addressing) that distinguish them from generic tabular data. - Add a dedicated `Time Series` section to `docs/source/_toctree.yml`, placed alongside `Tabular` / `Audio` / `Vision` / `Text`. - Add `docs/source/timeseries_load.mdx` documenting the `tsfile` builder: the (table, device, field) data model, four series-selection paths (`columns` / `devices` / `fields` / `path_prefix`), time-range pruning, bad-file handling and feature casting, plus the table-model and numeric-only limitations of the underlying `TsFileDataFrame` API. - Cross-link the new guide from the generic Loading reference, with a minimal example surfacing the device- and field-level filters.
JackieTien97
requested changes
Apr 21, 2026
Comment on lines
+25
to
+39
| devices (`list[str]`, *optional*): | ||
| Keep only series whose tag-value segment matches one of the given | ||
| device identifiers. Equivalent to "give me everything from these | ||
| devices". Combined with ``fields`` and ``path_prefix`` as a logical | ||
| AND. The match is exact and segment-based, e.g. ``devices=["d1"]`` | ||
| keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``. | ||
| fields (`list[str]`, *optional*): | ||
| Keep only series whose final path segment (a.k.a. *field*, *sensor* | ||
| or *measurement*) matches one of the given names. Equivalent to | ||
| "give me this measurement across all devices". | ||
| path_prefix (`str`, *optional*): | ||
| Keep only series whose path starts with this prefix, delegated to | ||
| ``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that | ||
| the prefix is matched as a raw string and should not include a | ||
| trailing dot. |
Owner
There was a problem hiding this comment.
Suggested change
| devices (`list[str]`, *optional*): | |
| Keep only series whose tag-value segment matches one of the given | |
| device identifiers. Equivalent to "give me everything from these | |
| devices". Combined with ``fields`` and ``path_prefix`` as a logical | |
| AND. The match is exact and segment-based, e.g. ``devices=["d1"]`` | |
| keeps ``mytable.d1.temperature`` but not ``mytable.d10.temperature``. | |
| fields (`list[str]`, *optional*): | |
| Keep only series whose final path segment (a.k.a. *field*, *sensor* | |
| or *measurement*) matches one of the given names. Equivalent to | |
| "give me this measurement across all devices". | |
| path_prefix (`str`, *optional*): | |
| Keep only series whose path starts with this prefix, delegated to | |
| ``TsFileDataFrame.list_timeseries(path_prefix=...)``. Note that | |
| the prefix is matched as a raw string and should not include a | |
| trailing dot. |
| """BuilderConfig for Apache TsFile. | ||
|
|
||
| Args: | ||
| columns (`list[str]`, *optional*): |
| if self.info.features is None: | ||
| for file in files: | ||
| try: | ||
| self.info.features = datasets.Features.from_arrow_schema(self._infer_schema(file)) |
Owner
There was a problem hiding this comment.
iterate all tsfiles to generate proper schema
Owner
There was a problem hiding this comment.
if user specify the columns, don't need to iterate all tsfiles.
Comment on lines
+209
to
+224
| df = TsFileDataFrame(file, show_progress=False) | ||
| try: | ||
| series = self._resolve_columns(df) | ||
| if not series: | ||
| arrays = [pa.array(np.empty(0, dtype=np.int64))] | ||
| return pa.table(arrays, names=["timestamp"]) | ||
| aligned = df.loc[start:end, series] | ||
| finally: | ||
| df.close() | ||
|
|
||
| timestamps = np.asarray(aligned.timestamps, dtype=np.int64) | ||
| values = np.asarray(aligned.values) | ||
| arrays = [pa.array(timestamps)] | ||
| for col_idx, name in enumerate(aligned.series_names): | ||
| arrays.append(pa.array(np.asarray(values[:, col_idx], dtype=np.float64))) | ||
| return pa.table(arrays, names=["timestamp", *aligned.series_names]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds first-class support for the Apache TsFile
columnar time-series format (used by Apache IoTDB
and other time-series systems) to 🤗 Datasets, and surfaces time-series
datasets as a dedicated modality in the documentation sidebar.
Users can now do:
or rely on extension auto-detection:
Why a dedicated builder?
Generic columnar formats (CSV / Parquet / HDF5) can technically store
time-series data, but they treat time as just another column. TsFile is
purpose-built around the (table, device, field) triple with chunk-level
time indices and time-series-specific encodings (delta-of-delta on
timestamps, GORILLA on slowly-varying floats), so range queries and device
pruning don't require scanning the whole file.
The new
tsfilebuilder pushes selection down to the reader's metadatalayer instead of materializing-then-filtering.
What's added
1.
tsfilepackaged module (tsfile)Backed by
tsfile.TsFileDataFrame(tsfile>=2.2.1.dev4). Each.tsfileis loaded as a single Arrow table: a
timestamp(int64) column followedby one float64 column per selected logical timeseries.
Four complementary, mutually-aware ways to select series — all pushed
down to the metadata layer so unselected devices/fields are never decoded:
columns=devices=fields=path_prefix=TsFileDataFrame.list_timeseries(path_prefix=...).devices=andfields=combine as a logical AND.columns=is mutuallyexclusive with the three predicate-based filters.
Plus:
start_time/end_time— inclusive time-range filter pushed down todf.loc[start:end, series](uses chunk-level time indices).features— optional cast to a user-providedFeaturesschema.on_bad_files—"error"(default) /"warn"/"skip", matching theParquet / HDF5 builders.
Registered in
_PACKAGED_DATASETS_MODULESand bound to the.tsfileextension via
_EXTENSION_TO_MODULE.2. Tests (test_tsfile.py)
14 tests, gated by
pytest.importorskip("tsfile"):columnsvs predicate filters mutual exclusion)columns=/devices=/fields=/path_prefix=individuallydevices=+fields=2-D projectionValueErrorstart_time/end_timepruningon_bad_files="skip"skipping a corrupted shardfeaturescastAll 14 pass on the dev machine.
3. Documentation
alongside
Tabular/Audio/Vision/Text. Currently the officialdocs file time-series under
Tabular, which underplays itsdomain-specific concerns (timestamp axis, time-range pruning, per-device
addressing).
tsfilebuilder: data model, four series-selection paths, time-range pruning,
bad-file handling, feature casting, and the table-model / numeric-only
limitations of the underlying
TsFileDataFrameAPI.reference) to the new guide, with a minimal example surfacing the
device- and field-level filters.
Known limitations (documented in the new page)
The current
tsfile.TsFileDataFrameAPI:loaded);
BOOLEAN,INT32,INT64,FLOAT,DOUBLE,TIMESTAMP), unifying them tofloat64;TEXT,STRING,BLOB,DATE)during metadata discovery.
Dependency note
tsfileis not added to setup.pyextras_requirein this PR:the latest published version on PyPI is a dev pre-release
(
2.2.1.dev4) whose own metadata pinspyarrow<20, which conflictswith the version of pyarrow used by
datasets. The builder lazy-importstsfileand the tests are gated bypytest.importorskip, so users optin by installing
tsfilethemselves. We can add a proper extra once astable
tsfilerelease lands.Commits
new docs page, cross-link from
loading.mdx.Checklist
ruff checkpassesruff format --checkpassespytest tests/packaged_modules/test_tsfile.py— 14 passed