Skip to content

Commit d168d5f

Browse files
feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format (#8160)
* feat(tsfile): add per-device wide-format TsFile builder Add a packaged builder for TsFile (table model), the columnar time-series format used as the storage layer of Apache IoTDB. Each output row corresponds to one device (identified by its TAG columns); the `time` column and every FIELD column are Arrow `list<...>` columns holding that device's full time series, sorted in ascending time order. When a device appears in multiple files within a split, its per-file chunks are merged and sorted; duplicate timestamps for the same device raise `ValueError`. Reading model - Data is fetched per device via `TsFileReader.query_table` with a push-down `tag_filter`; peak memory is bounded by a single device's payload, not by the split size. - `start_time` / `end_time` are pushed down to TsFile's internal time index. They accept `int` epochs, `datetime`, `date`, ISO-8601 strings, and `pyarrow.TimestampScalar`; tz-aware datetimes are normalized to UTC. - Schema evolution across files: FIELD columns are unioned and missing values are filled with nulls; numeric FIELD types are promoted following IoTDB's widening rules (INT32 -> INT64 -> DOUBLE, INT32 -> FLOAT -> DOUBLE). - `on_bad_files` controls handling of unreadable inputs ("error" | "warn" | "skip"). - `input_batch_size` bounds the per-device Arrow batch size pulled from the underlying tsfile reader; `output_batch_size` controls the number of devices packed into each emitted record batch. Config knobs: `table_name`, `columns`, `start_time`, `end_time`, `input_batch_size`, `output_batch_size`, `features`, `on_bad_files`, `timestamp_unit`, `timestamp_tz`. Tests - 47 tests under `tests/packaged_modules/test_tsfile.py` covering: basic load, table/column selection, time-range pushdown (all accepted input types), schema evolution and numeric promotion, duplicate-timestamp rejection, multi-file x multi-device crossover, large device with small `input_batch_size`, timezone handling, streaming mode, `on_bad_files` modes, and the `_to_epoch` boundary helper. Docs - `docs/source/tabular_load.mdx`: dedicated TsFile section with data model, output schema, time-range bounds, schema evolution, bad-file handling, timestamp unit/tz, and batching/memory. - `docs/source/loading.mdx`, `about_dataset_load.mdx`, `package_reference/loading_methods.mdx`: register and cross-reference the TsFile loader and `TsFileConfig` autodoc. Other - `setup.py`: add `tsfile>=2.2.1` to TESTS_REQUIRE. - `src/datasets/packaged_modules/__init__.py`: register the `.tsfile` extension and module entry. * docs(tsfile): split into standalone Time-series guide; bump tsfile dep to 2.3.0 - Move the TsFile loader documentation out of tabular_load.mdx into a new top-level page docs/source/tsfile_load.mdx, and add a dedicated 'Time-series' section to the sidebar (_toctree.yml). The per-device wide layout (one row per device, list-typed time/FIELD columns) is not a generic tabular convention and warrants its own guide. - tabular_load.mdx now points readers to the new guide via a short cross-reference instead of inlining the section. - loading.mdx: update the 'more details' link to tsfile_load. - setup.py: bump TESTS_REQUIRE entry from tsfile>=2.2.1 to tsfile>=2.3.0. * fix(tsfile): case-insensitive table-name lookups end-to-end - Add `_schemas_by_lc` helper and route the three call sites through it so auto-detected and user-supplied table names compare in a single canonical (lowercase) form. - Drop the now-misleading `_generate_shards` comment; the body matches the convention used by arrow.py / pandas.py / hdf5.py. - Remove the TsFile cross-link from `tabular_load.mdx` so that page stays focused on tabular formats; time-series users land via the dedicated Time-series section in the sidebar. - Cover tz-aware ISO-8601 strings in `_to_epoch` via a parametrized test (also drops the `__import__('datetime')` workaround now that `timedelta` is imported directly). - gitignore local dev artifacts produced while iterating on the builder. * format code * fix(tsfile): silently ignore TIME column name in `columns` Previously, passing the time column name (e.g. columns=["time"]) added a duplicate all-null list<float64> field that overwrote the real timestamp list in the output schema. Now TIME is treated like TAG: silently skipped from the requested field set so it is emitted exactly once as the real timestamp list. Docs and tests updated. * fix(tsfile): install tsfile only in py3.14 CI and add Hub example (#4) --------- Co-authored-by: Young-Leo <562593859@qq.com>
1 parent 992f3cf commit d168d5f

10 files changed

Lines changed: 1739 additions & 2 deletions

File tree

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ jobs:
121121
run: pip install --upgrade uv
122122
- name: Install dependencies
123123
run: uv pip install --system "datasets[tests] @ ."
124+
- name: Install tsfile (py3.14 only)
125+
run: uv pip install --system "tsfile>=2.3.0"
124126
- name: Print dependencies
125127
run: uv pip list
126128
- name: Test with pytest

docs/source/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,10 @@
101101
- local: tabular_load
102102
title: Load tabular data
103103
title: "Tabular"
104+
- sections:
105+
- local: tsfile_load
106+
title: Load TsFile data
107+
title: "Time-series"
104108
- sections:
105109
- local: share
106110
title: Share

docs/source/about_dataset_load.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ A dataset is a directory that contains:
1414
The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
1515
The Hub is a central repository where all the Hugging Face datasets and models are stored.
1616

17-
If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
17+
If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, tsfile, txt, etc.).
1818
Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
1919

2020
* [`datasets.packaged_modules.text.Text`] for text
@@ -23,6 +23,7 @@ Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based o
2323
* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
2424
* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
2525
* [`datasets.packaged_modules.sql.Sql`] for SQL databases
26+
* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)
2627
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
2728
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
2829

docs/source/loading.mdx

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ The `split` parameter can also map a data file to a specific split:
6868

6969
## Local and remote files
7070

71-
Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt` or `parquet` file. The [`load_dataset`] function can load each of these file types.
71+
Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt`, `parquet` or `tsfile` file. The [`load_dataset`] function can load each of these file types.
7272

7373
### CSV
7474

@@ -200,6 +200,34 @@ This will return the image caption and the image bytes in a single request.
200200

201201
Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
202202

203+
### TsFile
204+
205+
[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data, used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). It natively represents timestamps, device tags, and measurement fields, and maintains an internal time index that enables efficient time-range pruning.
206+
207+
Each row in the resulting dataset corresponds to one **device** (identified by its TAG columns); the `time` column and every FIELD column are list columns containing that device's full time series, sorted in ascending time order.
208+
209+
To load a TsFile:
210+
211+
```py
212+
>>> from datasets import load_dataset
213+
>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
214+
```
215+
216+
Filter by time range — bounds are pushed down to TsFile's internal time index and accept `int` epochs, `datetime`, `date`, ISO-8601 strings, or `pyarrow` timestamp scalars:
217+
218+
```py
219+
>>> from datetime import datetime
220+
>>> dataset = load_dataset(
221+
... "tsfile",
222+
... data_files="my_data.tsfile",
223+
... start_time=datetime(2023, 11, 14),
224+
... end_time=datetime(2023, 11, 15),
225+
... )
226+
```
227+
228+
> [!TIP]
229+
> For more details, check out the [how to load TsFile data](tsfile_load) guide.
230+
203231
### SQL
204232

205233
Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:

docs/source/package_reference/loading_methods.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
9797

9898
[[autodoc]] datasets.packaged_modules.hdf5.HDF5
9999

100+
### TsFile
101+
102+
[[autodoc]] datasets.packaged_modules.tsfile.TsFileConfig
103+
104+
[[autodoc]] datasets.packaged_modules.tsfile.TsFile
105+
100106
### Pdf
101107

102108
[[autodoc]] datasets.packaged_modules.pdffolder.PdfFolderConfig

docs/source/tsfile_load.mdx

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Load TsFile data
2+
3+
[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data and used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). Compared with general-purpose columnar formats such as Parquet, TsFile is aware of the time-series data model (timestamps, devices, and measurements) and maintains an internal time index that enables time-range pruning without scanning entire files.
4+
5+
This loader is provided as a separate guide because it does not follow the usual one-row-per-record tabular convention: each output row corresponds to one *device*, and per-measurement values are returned as Arrow `list<...>` columns. The mapping is described in detail below.
6+
7+
## Installation
8+
9+
The loader depends on the [`tsfile`](https://pypi.org/project/tsfile/) Python package:
10+
11+
```bash
12+
pip install "tsfile>=2.3.0"
13+
```
14+
15+
## Data model and output layout
16+
17+
The loader follows the TsFile *table model*. Each table column is one of:
18+
19+
- **TAG** — a string-typed identifier; the tuple of TAG values uniquely identifies a *device* (i.e. a single time-series source).
20+
- **FIELD** — a measurement whose value evolves over time.
21+
- **TIME** — the timestamp column, named `time` by default.
22+
23+
The loader emits one dataset row per device. Within a row, the `time` column and every FIELD column are Arrow `list<...>` columns containing that device's full time series, sorted in ascending time order. TAG columns appear as scalar `string` columns.
24+
25+
Concretely, the output schema has the form:
26+
27+
```text
28+
<tag_1>: string
29+
<tag_2>: string # one column per TAG
30+
...
31+
time: list<timestamp[unit, tz]>
32+
<field_1>: list<original_type> # one column per FIELD
33+
<field_2>: list<original_type>
34+
...
35+
```
36+
37+
When the same device appears in multiple input files of a split, its per-file chunks are concatenated and sorted by timestamp before being emitted as a single row. Duplicate timestamps for the same device raise `ValueError`.
38+
39+
## Basic usage
40+
41+
Load a single TsFile:
42+
43+
```py
44+
>>> from datasets import load_dataset
45+
>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile")
46+
```
47+
48+
Map files to splits explicitly:
49+
50+
```py
51+
>>> dataset = load_dataset(
52+
... "tsfile",
53+
... data_files={"train": "train_data.tsfile", "test": "test_data.tsfile"},
54+
... )
55+
```
56+
57+
## Example dataset on the Hub
58+
59+
A ready-to-use example is available at [`tsfile/lotsa_data`](https://huggingface.co/datasets/tsfile/lotsa_data). Because `.tsfile` files are recognized automatically, you can load it by repository id without specifying `data_files`:
60+
61+
```py
62+
>>> from datasets import load_dataset
63+
>>> dataset = load_dataset("tsfile/lotsa_data")
64+
>>> dataset
65+
DatasetDict({
66+
train: Dataset({
67+
features: ['timeseries_id', 'time', 'value'],
68+
num_rows: 91
69+
})
70+
})
71+
```
72+
73+
Each row is one device. The TAG column `timeseries_id` identifies the device, while `time` and `value` are `list<...>` columns holding that device's full series:
74+
75+
```py
76+
>>> row = dataset["train"][0]
77+
>>> row["timeseries_id"]
78+
'Bear_assembly_Angel'
79+
>>> len(row["time"]), len(row["value"])
80+
(8760, 8760)
81+
>>> row["time"][:3]
82+
[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 1, 1, 0), datetime.datetime(2017, 1, 1, 2, 0)]
83+
```
84+
85+
## Selecting a table
86+
87+
A TsFile can contain multiple tables. When `table_name` is omitted, the first table found in the first valid file is used. Lookups are case-insensitive.
88+
89+
```py
90+
>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile", table_name="sensor_data")
91+
```
92+
93+
## Selecting columns
94+
95+
`columns` restricts the FIELD columns that are read. The TAG columns and the `time` column are always returned because they identify the device and its timeline. Names in `columns` that refer to a TAG or to the `time` column are silently ignored (they are emitted as usual, just once); names that match a field absent from every file become all-null list columns.
96+
97+
```py
98+
>>> dataset = load_dataset(
99+
... "tsfile",
100+
... data_files="my_data.tsfile",
101+
... columns=["temperature", "humidity"],
102+
... )
103+
```
104+
105+
## Filtering by time range
106+
107+
`start_time` and `end_time` are inclusive bounds; either may be omitted. The bounds are pushed down to TsFile's internal time index, so only the matching data blocks are read from disk. Both bounds accept any of:
108+
109+
- `int` — raw epoch in `timestamp_unit` (default milliseconds);
110+
- `datetime.datetime` — naive values are interpreted as UTC, tz-aware values are converted to UTC;
111+
- `datetime.date`;
112+
- ISO-8601 `str`, e.g. `"2024-01-01T00:00:00"`;
113+
- `pyarrow.TimestampScalar`.
114+
115+
```py
116+
>>> from datetime import datetime
117+
>>> dataset = load_dataset(
118+
... "tsfile",
119+
... data_files="my_data.tsfile",
120+
... start_time=datetime(2023, 11, 14),
121+
... end_time="2023-11-15T00:00:00",
122+
... )
123+
```
124+
125+
## Schema evolution across files
126+
127+
When different files expose different columns — for example a new sensor field is introduced later — the loader takes the union of all FIELD columns and fills missing values with nulls. Numeric FIELD types are promoted following IoTDB's widening rules (`INT32 → INT64 → DOUBLE`, `INT32 → FLOAT → DOUBLE`).
128+
129+
```py
130+
>>> dataset = load_dataset("tsfile", data_files=["day1.tsfile", "day2.tsfile"])
131+
```
132+
133+
## Handling unreadable files
134+
135+
By default, an unreadable or non-TsFile input raises an error. Set `on_bad_files` to `"warn"` to log and continue, or `"skip"` to silently drop the file.
136+
137+
```py
138+
>>> dataset = load_dataset("tsfile", data_files="data/*.tsfile", on_bad_files="skip")
139+
```
140+
141+
## Timestamp unit and time zone
142+
143+
`timestamp_unit` (default `"ms"`, matching IoTDB) controls the resolution of the `time` column and the interpretation of integer time bounds. `timestamp_tz` attaches a time zone to the Arrow timestamp type; `None` (the default) yields a timezone-naive type.
144+
145+
```py
146+
>>> dataset = load_dataset(
147+
... "tsfile",
148+
... data_files="my_data.tsfile",
149+
... timestamp_unit="us",
150+
... timestamp_tz="UTC",
151+
... )
152+
```
153+
154+
## Memory and batching
155+
156+
Two parameters control memory usage:
157+
158+
- `input_batch_size` (default `65_536`) — maximum number of rows fetched per Arrow batch from `TsFileReader.query_table`. Bounds peak memory while streaming a single device.
159+
- `output_batch_size` (default `32`) — number of devices packed into each Arrow record batch yielded to the writer. Smaller values give more responsive progress reporting; larger values reduce per-batch overhead.
160+
161+
```py
162+
>>> dataset = load_dataset(
163+
... "tsfile",
164+
... data_files="large_data.tsfile",
165+
... input_batch_size=32_768,
166+
... output_batch_size=128,
167+
... )
168+
```
169+
170+
Peak memory is bounded by the payload of a single device across the split, not by the size of the split as a whole.
171+
172+
See [`~datasets.packaged_modules.tsfile.TsFileConfig`] for the full list of parameters.

src/datasets/packaged_modules/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from .pdffolder import pdffolder
2323
from .sql import sql
2424
from .text import text
25+
from .tsfile import tsfile
2526
from .videofolder import videofolder
2627
from .webdataset import webdataset
2728
from .xml import xml
@@ -60,6 +61,7 @@ def _hash_python_lines(lines: list[str]) -> str:
6061
"hdf5": (hdf5.__name__, _hash_python_lines(inspect.getsource(hdf5).splitlines())),
6162
"eval": (eval.__name__, _hash_python_lines(inspect.getsource(eval).splitlines())),
6263
"lance": (lance.__name__, _hash_python_lines(inspect.getsource(lance).splitlines())),
64+
"tsfile": (tsfile.__name__, _hash_python_lines(inspect.getsource(tsfile).splitlines())),
6365
"iceberg": (iceberg.__name__, _hash_python_lines(inspect.getsource(iceberg).splitlines())),
6466
}
6567

@@ -96,6 +98,7 @@ def _hash_python_lines(lines: list[str]) -> str:
9698
".h5": ("hdf5", {}),
9799
".eval": ("eval", {}),
98100
".lance": ("lance", {}),
101+
".tsfile": ("tsfile", {}),
99102
}
100103
_EXTENSION_TO_MODULE.update({ext: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})
101104
_EXTENSION_TO_MODULE.update({ext.upper(): ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS})

src/datasets/packaged_modules/tsfile/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)