feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format (#8160)

JackieTien97 · Young-Leo · web-flow · commit d168d5fc54d5 · 2026-06-01T12:14:51.000+02:00
* feat(tsfile): add per-device wide-format TsFile builder Add a packaged builder for TsFile (table model), the columnar time-series format used as the storage layer of Apache IoTDB. Each output row corresponds to one device (identified by its TAG columns); the `time` column and every FIELD column are Arrow `list<...>` columns holding that device's full time series, sorted in ascending time order. When a device appears in multiple files within a split, its per-file chunks are merged and sorted; duplicate timestamps for the same device raise `ValueError`. Reading model - Data is fetched per device via `TsFileReader.query_table` with a push-down `tag_filter`; peak memory is bounded by a single device's payload, not by the split size. - `start_time` / `end_time` are pushed down to TsFile's internal time index. They accept `int` epochs, `datetime`, `date`, ISO-8601 strings, and `pyarrow.TimestampScalar`; tz-aware datetimes are normalized to UTC. - Schema evolution across files: FIELD columns are unioned and missing values are filled with nulls; numeric FIELD types are promoted following IoTDB's widening rules (INT32 -> INT64 -> DOUBLE, INT32 -> FLOAT -> DOUBLE). - `on_bad_files` controls handling of unreadable inputs ("error" | "warn" | "skip"). - `input_batch_size` bounds the per-device Arrow batch size pulled from the underlying tsfile reader; `output_batch_size` controls the number of devices packed into each emitted record batch. Config knobs: `table_name`, `columns`, `start_time`, `end_time`, `input_batch_size`, `output_batch_size`, `features`, `on_bad_files`, `timestamp_unit`, `timestamp_tz`. Tests - 47 tests under `tests/packaged_modules/test_tsfile.py` covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file x multi-device crossover, large device with small `input_batch_size`, timezone handling, streaming mode, `on_bad_files` modes, and the `_to_epoch` boundary helper. Docs - `docs/source/tabular_load.mdx`: dedicated TsFile section with data model, output schema, time-range bounds, schema evolution, bad-file handling, timestamp unit/tz, and batching/memory. - `docs/source/loading.mdx`, `about_dataset_load.mdx`, `package_reference/loading_methods.mdx`: register and cross-reference the TsFile loader and `TsFileConfig` autodoc. Other - `setup.py`: add `tsfile>=2.2.1` to TESTS_REQUIRE. - `src/datasets/packaged_modules/__init__.py`: register the `.tsfile` extension and module entry. * docs(tsfile): split into standalone Time-series guide; bump tsfile dep to 2.3.0 - Move the TsFile loader documentation out of tabular_load.mdx into a new top-level page docs/source/tsfile_load.mdx, and add a dedicated 'Time-series' section to the sidebar (_toctree.yml). The per-device wide layout (one row per device, list-typed time/FIELD columns) is not a generic tabular convention and warrants its own guide. - tabular_load.mdx now points readers to the new guide via a short cross-reference instead of inlining the section. - loading.mdx: update the 'more details' link to tsfile_load. - setup.py: bump TESTS_REQUIRE entry from tsfile>=2.2.1 to tsfile>=2.3.0. * fix(tsfile): case-insensitive table-name lookups end-to-end - Add `_schemas_by_lc` helper and route the three call sites through it so auto-detected and user-supplied table names compare in a single canonical (lowercase) form. - Drop the now-misleading `_generate_shards` comment; the body matches the convention used by arrow.py / pandas.py / hdf5.py. - Remove the TsFile cross-link from `tabular_load.mdx` so that page stays focused on tabular formats; time-series users land via the dedicated Time-series section in the sidebar. - Cover tz-aware ISO-8601 strings in `_to_epoch` via a parametrized test (also drops the `__import__('datetime')` workaround now that `timedelta` is imported directly). - gitignore local dev artifacts produced while iterating on the builder. * format code * ﻿fix(tsfile): silently ignore TIME column name in `columns` Previously, passing the time column name (e.g. columns=["time"]) added a duplicate all-null list<float64> field that overwrote the real timestamp list in the output schema. Now TIME is treated like TAG: silently skipped from the requested field set so it is emitted exactly once as the real timestamp list. Docs and tests updated. * fix(tsfile): install tsfile only in py3.14 CI and add Hub example (#4) --------- Co-authored-by: Young-Leo <562593859@qq.com>
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -121,6 +121,8 @@ jobs:
         run: pip install --upgrade uv
       - name: Install dependencies
         run: uv pip install --system "datasets[tests] @ ."
+      - name: Install tsfile (py3.14 only)
+        run: uv pip install --system "tsfile>=2.3.0"
       - name: Print dependencies
         run: uv pip list
       - name: Test with pytest
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -101,6 +101,10 @@
     - local: tabular_load
       title: Load tabular data
     title: "Tabular"
+  - sections:
+    - local: tsfile_load
+      title: Load TsFile data
+    title: "Time-series"
   - sections:
     - local: share
       title: Share
diff --git a/docs/source/about_dataset_load.mdx b/docs/source/about_dataset_load.mdx
@@ -14,7 +14,7 @@ A dataset is a directory that contains:
 The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
 The Hub is a central repository where all the Hugging Face datasets and models are stored.
 
-If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
+If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, tsfile, txt, etc.).
 Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
 
 * [`datasets.packaged_modules.text.Text`] for text
@@ -23,6 +23,7 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
 * [`datasets.packaged_modules.parquet.Parquet`] for Parquet
 * [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
 * [`datasets.packaged_modules.sql.Sql`] for SQL databases
+* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)
 * [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
 * [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
 
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -68,7 +68,7 @@ The `split` parameter can also map a data file to a specific split:
 
 ## Local and remote files
 
-Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt` or `parquet` file. The [`load_dataset`] function can load each of these file types.
+Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt`, `parquet` or `tsfile` file. The [`load_dataset`] function can load each of these file types.
 
 ### CSV
 
@@ -200,6 +200,34 @@ This will return the image caption and the image bytes in a single request.
 
 Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
 
+### TsFile
+
+[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data, used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). It natively represents timestamps, device tags, and measurement fields, and maintains an internal time index that enables efficient time-range pruning.
+
+Each row in the resulting dataset corresponds to one **device** (identified by its TAG columns); the `time` column and every FIELD column are list columns containing that device's full time series, sorted in ascending time order.
+
+To load a TsFile:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
+```
+
+Filter by time range — bounds are pushed down to TsFile's internal time index and accept `int` epochs, `datetime`, `date`, ISO-8601 strings, or `pyarrow` timestamp scalars:
+
+```py
+>>> from datetime import datetime
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     start_time=datetime(2023, 11, 14),
+...     end_time=datetime(2023, 11, 15),
+... )
+```
+
+> [!TIP]
+> For more details, check out the [how to load TsFile data](tsfile_load) guide.
+
 ### SQL
 
 Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:
diff --git a/docs/source/package_reference/loading_methods.mdx b/docs/source/package_reference/loading_methods.mdx
@@ -97,6 +97,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 
 [[autodoc]] datasets.packaged_modules.hdf5.HDF5
 
+### TsFile
+
+[[autodoc]] datasets.packaged_modules.tsfile.TsFileConfig
+
+[[autodoc]] datasets.packaged_modules.tsfile.TsFile
+
 ### Pdf
 
 [[autodoc]] datasets.packaged_modules.pdffolder.PdfFolderConfig
diff --git a/docs/source/tsfile_load.mdx b/docs/source/tsfile_load.mdx
@@ -0,0 +1,172 @@
+# Load TsFile data
+
+[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data and used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). Compared with general-purpose columnar formats such as Parquet, TsFile is aware of the time-series data model (timestamps, devices, and measurements) and maintains an internal time index that enables time-range pruning without scanning entire files.
+
+This loader is provided as a separate guide because it does not follow the usual one-row-per-record tabular convention: each output row corresponds to one *device*, and per-measurement values are returned as Arrow `list<...>` columns. The mapping is described in detail below.
+
+## Installation
+
+The loader depends on the [`tsfile`](https://pypi.org/project/tsfile/) Python package:
+
+```bash
+pip install "tsfile>=2.3.0"
+```
+
+## Data model and output layout
+
+The loader follows the TsFile *table model*. Each table column is one of:
+
+- **TAG** — a string-typed identifier; the tuple of TAG values uniquely identifies a *device* (i.e. a single time-series source).
+- **FIELD** — a measurement whose value evolves over time.
+- **TIME** — the timestamp column, named `time` by default.
+
+The loader emits one dataset row per device. Within a row, the `time` column and every FIELD column are Arrow `list<...>` columns containing that device's full time series, sorted in ascending time order. TAG columns appear as scalar `string` columns.
+
+Concretely, the output schema has the form:
+
+```text
+<tag_1>:    string
+<tag_2>:    string                       # one column per TAG
+...
+time:       list<timestamp[unit, tz]>
+<field_1>:  list<original_type>          # one column per FIELD
+<field_2>:  list<original_type>
+...
+```
+
+When the same device appears in multiple input files of a split, its per-file chunks are concatenated and sorted by timestamp before being emitted as a single row. Duplicate timestamps for the same device raise `ValueError`.
+
+## Basic usage
+
+Load a single TsFile:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
+```
+
+Map files to splits explicitly:
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files={"train": "train_data.tsfile", "test": "test_data.tsfile"},
+... )
+```
+
+## Example dataset on the Hub
+
+A ready-to-use example is available at [`tsfile/lotsa_data`](https://huggingface.co/datasets/tsfile/lotsa_data). Because `.tsfile` files are recognized automatically, you can load it by repository id without specifying `data_files`:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile/lotsa_data")
+>>> dataset
+DatasetDict({
+    train: Dataset({
+        features: ['timeseries_id', 'time', 'value'],
+        num_rows: 91
+    })
+})
+```
+
+Each row is one device. The TAG column `timeseries_id` identifies the device, while `time` and `value` are `list<...>` columns holding that device's full series:
+
+```py
+>>> row = dataset["train"][0]
+>>> row["timeseries_id"]
+'Bear_assembly_Angel'
+>>> len(row["time"]), len(row["value"])
+(8760, 8760)
+>>> row["time"][:3]
+[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 1, 1, 0), datetime.datetime(2017, 1, 1, 2, 0)]
+```
+
+## Selecting a table
+
+A TsFile can contain multiple tables. When `table_name` is omitted, the first table found in the first valid file is used. Lookups are case-insensitive.
+
+```py
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile", table_name="sensor_data")
+```
+
+## Selecting columns
+
+`columns` restricts the FIELD columns that are read. The TAG columns and the `time` column are always returned because they identify the device and its timeline. Names in `columns` that refer to a TAG or to the `time` column are silently ignored (they are emitted as usual, just once); names that match a field absent from every file become all-null list columns.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     columns=["temperature", "humidity"],
+... )
+```
+
+## Filtering by time range
+
+`start_time` and `end_time` are inclusive bounds; either may be omitted. The bounds are pushed down to TsFile's internal time index, so only the matching data blocks are read from disk. Both bounds accept any of:
+
+- `int` — raw epoch in `timestamp_unit` (default milliseconds);
+- `datetime.datetime` — naive values are interpreted as UTC, tz-aware values are converted to UTC;
+- `datetime.date`;
+- ISO-8601 `str`, e.g. `"2024-01-01T00:00:00"`;
+- `pyarrow.TimestampScalar`.
+
+```py
+>>> from datetime import datetime
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     start_time=datetime(2023, 11, 14),
+...     end_time="2023-11-15T00:00:00",
+... )
+```
+
+## Schema evolution across files
+
+When different files expose different columns — for example a new sensor field is introduced later — the loader takes the union of all FIELD columns and fills missing values with nulls. Numeric FIELD types are promoted following IoTDB's widening rules (`INT32 → INT64 → DOUBLE`, `INT32 → FLOAT → DOUBLE`).
+
+```py
+>>> dataset = load_dataset("tsfile", data_files=["day1.tsfile", "day2.tsfile"])
+```
+
+## Handling unreadable files
+
+By default, an unreadable or non-TsFile input raises an error. Set `on_bad_files` to `"warn"` to log and continue, or `"skip"` to silently drop the file.
+
+```py
+>>> dataset = load_dataset("tsfile", data_files="data/*.tsfile", on_bad_files="skip")
+```
+
+## Timestamp unit and time zone
+
+`timestamp_unit` (default `"ms"`, matching IoTDB) controls the resolution of the `time` column and the interpretation of integer time bounds. `timestamp_tz` attaches a time zone to the Arrow timestamp type; `None` (the default) yields a timezone-naive type.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     timestamp_unit="us",
+...     timestamp_tz="UTC",
+... )
+```
+
+## Memory and batching
+
+Two parameters control memory usage:
+
+- `input_batch_size` (default `65_536`) — maximum number of rows fetched per Arrow batch from `TsFileReader.query_table`. Bounds peak memory while streaming a single device.
+- `output_batch_size` (default `32`) — number of devices packed into each Arrow record batch yielded to the writer. Smaller values give more responsive progress reporting; larger values reduce per-batch overhead.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="large_data.tsfile",
+...     input_batch_size=32_768,
+...     output_batch_size=128,
+... )
+```
+
+Peak memory is bounded by the payload of a single device across the split, not by the size of the split as a whole.
+
+See [`~datasets.packaged_modules.tsfile.TsFileConfig`] for the full list of parameters.
diff --git a/src/datasets/packaged_modules/__init__.py b/src/datasets/packaged_modules/__init__.py
@@ -22,6 +22,7 @@
 from .pdffolder import pdffolder
 from .sql import sql
 from .text import text
+from .tsfile import tsfile
 from .videofolder import videofolder
 from .webdataset import webdataset
 from .xml import xml
@@ -60,6 +61,7 @@ def _hash_python_lines(lines: list[str]) -> str:
     "hdf5": (hdf5.__name__, _hash_python_lines(inspect.getsource(hdf5).splitlines())),
     "eval": (eval.__name__, _hash_python_lines(inspect.getsource(eval).splitlines())),
     "lance": (lance.__name__, _hash_python_lines(inspect.getsource(lance).splitlines())),
+    "tsfile": (tsfile.__name__, _hash_python_lines(inspect.getsource(tsfile).splitlines())),
     "iceberg": (iceberg.__name__, _hash_python_lines(inspect.getsource(iceberg).splitlines())),
 }
 
@@ -96,6 +98,7 @@ def _hash_python_lines(lines: list[str]) -> str:
     ".h5": ("hdf5", {}),
     ".eval": ("eval", {}),
     ".lance": ("lance", {}),
+    ".tsfile": ("tsfile", {}),
 }
 _EXTENSION_TO_MODULE.update({ext: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
 _EXTENSION_TO_MODULE.update({ext.upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
diff --git a/src/datasets/packaged_modules/tsfile/__init__.py b/src/datasets/packaged_modules/tsfile/__init__.py
diff --git a/src/datasets/packaged_modules/tsfile/tsfile.py b/src/datasets/packaged_modules/tsfile/tsfile.py
diff --git a/tests/packaged_modules/test_tsfile.py b/tests/packaged_modules/test_tsfile.py