huggingface · JackieTien97 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 29, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -121,6 +121,8 @@ jobs:
         run: pip install --upgrade uv
       - name: Install dependencies
         run: uv pip install --system "datasets[tests] @ ."
+      - name: Install tsfile (py3.14 only)
+        run: uv pip install --system "tsfile>=2.3.0"
       - name: Print dependencies
         run: uv pip list
       - name: Test with pytest

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -101,6 +101,10 @@
     - local: tabular_load
       title: Load tabular data
     title: "Tabular"
+  - sections:
+    - local: tsfile_load
+      title: Load TsFile data
+    title: "Time-series"
   - sections:
     - local: share
       title: Share

diff --git a/docs/source/about_dataset_load.mdx b/docs/source/about_dataset_load.mdx
@@ -14,7 +14,7 @@ A dataset is a directory that contains:
 The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
 The Hub is a central repository where all the Hugging Face datasets and models are stored.
 
-If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
+If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, tsfile, txt, etc.).
 Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
 
 * [`datasets.packaged_modules.text.Text`] for text
@@ -23,6 +23,7 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
 * [`datasets.packaged_modules.parquet.Parquet`] for Parquet
 * [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
 * [`datasets.packaged_modules.sql.Sql`] for SQL databases
+* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)
 * [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
 * [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
 

diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -68,7 +68,7 @@ The `split` parameter can also map a data file to a specific split:
 
 ## Local and remote files
 
-Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt` or `parquet` file. The [`load_dataset`] function can load each of these file types.
+Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt`, `parquet` or `tsfile` file. The [`load_dataset`] function can load each of these file types.
 
 ### CSV
 
@@ -200,6 +200,34 @@ This will return the image caption and the image bytes in a single request.
 
 Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
 
+### TsFile
+
+[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data, used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). It natively represents timestamps, device tags, and measurement fields, and maintains an internal time index that enables efficient time-range pruning.
+
+Each row in the resulting dataset corresponds to one **device** (identified by its TAG columns); the `time` column and every FIELD column are list columns containing that device's full time series, sorted in ascending time order.
+
+To load a TsFile:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
+```
+
+Filter by time range — bounds are pushed down to TsFile's internal time index and accept `int` epochs, `datetime`, `date`, ISO-8601 strings, or `pyarrow` timestamp scalars:
+
+```py
+>>> from datetime import datetime
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     start_time=datetime(2023, 11, 14),
+...     end_time=datetime(2023, 11, 15),
+... )
+```
+
+> [!TIP]
+> For more details, check out the [how to load TsFile data](tsfile_load) guide.
+
 ### SQL
 
 Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:

diff --git a/docs/source/package_reference/loading_methods.mdx b/docs/source/package_reference/loading_methods.mdx
@@ -97,6 +97,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 
 [[autodoc]] datasets.packaged_modules.hdf5.HDF5
 
+### TsFile
+
+[[autodoc]] datasets.packaged_modules.tsfile.TsFileConfig
+
+[[autodoc]] datasets.packaged_modules.tsfile.TsFile
+
 ### Pdf
 
 [[autodoc]] datasets.packaged_modules.pdffolder.PdfFolderConfig

diff --git a/docs/source/tsfile_load.mdx b/docs/source/tsfile_load.mdx
@@ -0,0 +1,172 @@
+# Load TsFile data
+
+[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data and used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). Compared with general-purpose columnar formats such as Parquet, TsFile is aware of the time-series data model (timestamps, devices, and measurements) and maintains an internal time index that enables time-range pruning without scanning entire files.
+
+This loader is provided as a separate guide because it does not follow the usual one-row-per-record tabular convention: each output row corresponds to one *device*, and per-measurement values are returned as Arrow `list<...>` columns. The mapping is described in detail below.
+
+## Installation
+
+The loader depends on the [`tsfile`](https://pypi.org/project/tsfile/) Python package:
+
+```bash
+pip install "tsfile>=2.3.0"
+```
+
+## Data model and output layout
+
+The loader follows the TsFile *table model*. Each table column is one of:
+
+- **TAG** — a string-typed identifier; the tuple of TAG values uniquely identifies a *device* (i.e. a single time-series source).
+- **FIELD** — a measurement whose value evolves over time.
+- **TIME** — the timestamp column, named `time` by default.
+
+The loader emits one dataset row per device. Within a row, the `time` column and every FIELD column are Arrow `list<...>` columns containing that device's full time series, sorted in ascending time order. TAG columns appear as scalar `string` columns.
+
+Concretely, the output schema has the form:
+
+```text
+<tag_1>:    string
+<tag_2>:    string                       # one column per TAG
+...
+time:       list<timestamp[unit, tz]>
+<field_1>:  list<original_type>          # one column per FIELD
+<field_2>:  list<original_type>
+...
+```
+
+When the same device appears in multiple input files of a split, its per-file chunks are concatenated and sorted by timestamp before being emitted as a single row. Duplicate timestamps for the same device raise `ValueError`.
+
+## Basic usage
+
+Load a single TsFile:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
+```
+
+Map files to splits explicitly:
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files={"train": "train_data.tsfile", "test": "test_data.tsfile"},
+... )
+```
+
+## Example dataset on the Hub
+
+A ready-to-use example is available at [`tsfile/lotsa_data`](https://huggingface.co/datasets/tsfile/lotsa_data). Because `.tsfile` files are recognized automatically, you can load it by repository id without specifying `data_files`:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("tsfile/lotsa_data")
+>>> dataset
+DatasetDict({
+    train: Dataset({
+        features: ['timeseries_id', 'time', 'value'],
+        num_rows: 91
+    })
+})
+```
+
+Each row is one device. The TAG column `timeseries_id` identifies the device, while `time` and `value` are `list<...>` columns holding that device's full series:
+
+```py
+>>> row = dataset["train"][0]
+>>> row["timeseries_id"]
+'Bear_assembly_Angel'
+>>> len(row["time"]), len(row["value"])
+(8760, 8760)
+>>> row["time"][:3]
+[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 1, 1, 0), datetime.datetime(2017, 1, 1, 2, 0)]
+```
+
+## Selecting a table
+
+A TsFile can contain multiple tables. When `table_name` is omitted, the first table found in the first valid file is used. Lookups are case-insensitive.
+
+```py
+>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile", table_name="sensor_data")
+```
+
+## Selecting columns
+
+`columns` restricts the FIELD columns that are read. The TAG columns and the `time` column are always returned because they identify the device and its timeline. Names in `columns` that refer to a TAG or to the `time` column are silently ignored (they are emitted as usual, just once); names that match a field absent from every file become all-null list columns.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     columns=["temperature", "humidity"],
+... )
+```
+
+## Filtering by time range
+
+`start_time` and `end_time` are inclusive bounds; either may be omitted. The bounds are pushed down to TsFile's internal time index, so only the matching data blocks are read from disk. Both bounds accept any of:
+
+- `int` — raw epoch in `timestamp_unit` (default milliseconds);
+- `datetime.datetime` — naive values are interpreted as UTC, tz-aware values are converted to UTC;
+- `datetime.date`;
+- ISO-8601 `str`, e.g. `"2024-01-01T00:00:00"`;
+- `pyarrow.TimestampScalar`.
+
+```py
+>>> from datetime import datetime
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     start_time=datetime(2023, 11, 14),
+...     end_time="2023-11-15T00:00:00",
+... )
+```
+
+## Schema evolution across files
+
+When different files expose different columns — for example a new sensor field is introduced later — the loader takes the union of all FIELD columns and fills missing values with nulls. Numeric FIELD types are promoted following IoTDB's widening rules (`INT32 → INT64 → DOUBLE`, `INT32 → FLOAT → DOUBLE`).
+
+```py
+>>> dataset = load_dataset("tsfile", data_files=["day1.tsfile", "day2.tsfile"])
+```
+
+## Handling unreadable files
+
+By default, an unreadable or non-TsFile input raises an error. Set `on_bad_files` to `"warn"` to log and continue, or `"skip"` to silently drop the file.
+
+```py
+>>> dataset = load_dataset("tsfile", data_files="data/*.tsfile", on_bad_files="skip")
+```
+
+## Timestamp unit and time zone
+
+`timestamp_unit` (default `"ms"`, matching IoTDB) controls the resolution of the `time` column and the interpretation of integer time bounds. `timestamp_tz` attaches a time zone to the Arrow timestamp type; `None` (the default) yields a timezone-naive type.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="my_data.tsfile",
+...     timestamp_unit="us",
+...     timestamp_tz="UTC",
+... )
+```
+
+## Memory and batching
+
+Two parameters control memory usage:
+
+- `input_batch_size` (default `65_536`) — maximum number of rows fetched per Arrow batch from `TsFileReader.query_table`. Bounds peak memory while streaming a single device.
+- `output_batch_size` (default `32`) — number of devices packed into each Arrow record batch yielded to the writer. Smaller values give more responsive progress reporting; larger values reduce per-batch overhead.
+
+```py
+>>> dataset = load_dataset(
+...     "tsfile",
+...     data_files="large_data.tsfile",
+...     input_batch_size=32_768,
+...     output_batch_size=128,
+... )
+```
+
+Peak memory is bounded by the payload of a single device across the split, not by the size of the split as a whole.
+
+See [`~datasets.packaged_modules.tsfile.TsFileConfig`] for the full list of parameters.
diff --git a/src/datasets/packaged_modules/__init__.py b/src/datasets/packaged_modules/__init__.py
@@ -22,6 +22,7 @@
 from .pdffolder import pdffolder
 from .sql import sql
 from .text import text
+from .tsfile import tsfile
 from .videofolder import videofolder
 from .webdataset import webdataset
 from .xml import xml
@@ -60,6 +61,7 @@ def _hash_python_lines(lines: list[str]) -> str:
     "hdf5": (hdf5.__name__, _hash_python_lines(inspect.getsource(hdf5).splitlines())),
     "eval": (eval.__name__, _hash_python_lines(inspect.getsource(eval).splitlines())),
     "lance": (lance.__name__, _hash_python_lines(inspect.getsource(lance).splitlines())),
+    "tsfile": (tsfile.__name__, _hash_python_lines(inspect.getsource(tsfile).splitlines())),
     "iceberg": (iceberg.__name__, _hash_python_lines(inspect.getsource(iceberg).splitlines())),
 }
 
@@ -96,6 +98,7 @@ def _hash_python_lines(lines: list[str]) -> str:
     ".h5": ("hdf5", {}),
     ".eval": ("eval", {}),
     ".lance": ("lance", {}),
+    ".tsfile": ("tsfile", {}),
 }
 _EXTENSION_TO_MODULE.update({ext: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
 _EXTENSION_TO_MODULE.update({ext.upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})

diff --git a/src/datasets/packaged_modules/tsfile/__init__.py b/src/datasets/packaged_modules/tsfile/__init__.py