Add Apache TsFile packaged module for time-series data by Young-Leo · Pull Request #2 · JackieTien97/datasets

Young-Leo · 2026-04-23T04:14:06Z

What does this PR do?

Adds native support for loading Apache TsFile (table model) as a packaged module, enabling users to load time-series data directly via load_dataset("tsfile", ...).

TsFile is a columnar file format designed specifically for time-series data, used by Apache IoTDB. Unlike general-purpose formats (CSV, Parquet), TsFile natively understands timestamps, device tags (TAG columns), and measurements (FIELD columns).

Usage

from datasets import load_dataset

# Basic loading
ds = load_dataset("tsfile", data_files="sensor_data.tsfile")

# Time-range filtering (pushed down to TsFile's internal time index)
ds = load_dataset("tsfile", data_files="sensor_data.tsfile",
                  start_time=1700000000000, end_time=1700086400000)

# Select specific table and columns
ds = load_dataset("tsfile", data_files="sensor_data.tsfile",
                  table_name="weather", columns=["temperature", "humidity"])

Features

Time-range query pushdown: start_time/end_time are pushed down to TsFile's internal time index — only relevant data blocks are read from disk.
Multi-table support: A single TsFile can contain multiple tables; select via table_name.
Column selection: Specify columns to skip schema inference entirely. Missing columns are filled with nulls.
Schema evolution: When loading multiple files with different columns, all columns are unioned automatically with null-fill for absent columns.
Streaming batch reads: Data is read in configurable batches (batch_size, default 100K rows) to control memory usage.
Bad file tolerance: on_bad_files="skip"|"warn"|"error" for robust batch processing.
Case-insensitive: Table and column names follow TsFile/IoTDB's case-insensitive convention.

Changes

src/datasets/packaged_modules/tsfile/tsfile.py — new TsFileConfig and TsFile (ArrowBasedBuilder)
src/datasets/packaged_modules/tsfile/__init__.py — package init
src/datasets/packaged_modules/__init__.py — register module and .tsfile extension
tests/packaged_modules/test_tsfile.py — 17 tests covering config validation, full loading, column selection, time-range filtering, schema evolution, multi-table, and error handling
docs/source/about_dataset_load.mdx — add TsFile to builder list
docs/source/loading.mdx — add TsFile section
docs/source/tabular_load.mdx — add detailed TsFile guide
docs/source/package_reference/loading_methods.mdx — add autodoc entries

Dependencies

Requires tsfile (pip install tsfile). It is an optional dependency — imported lazily at runtime, same pattern as HDF5/Lance.

- Add TsFileConfig with table_name, columns, start_time/end_time, batch_size, features, and on_bad_files parameters - Implement TsFile ArrowBasedBuilder with streaming batch reads via tsfile.to_dataframe and schema inference via TsFileReader - Support schema evolution across multiple files (union columns, fill missing with nulls) - Skip file scanning when user specifies columns (zero or one file read) - Register tsfile module and .tsfile extension in packaged_modules - Add test file for TsFile builder

…load.mdx: add TsFile to builder list and extension mention - loading.mdx: add TsFile section with basic usage and time-range example - tabular_load.mdx: add detailed TsFile section covering all parameters - loading_methods.mdx: add TsFileConfig and TsFile autodoc entries

Copilot

Pull request overview

Adds a new packaged module to load Apache TsFile (table model) time-series data via load_dataset("tsfile", ...), along with documentation and tests.

Changes:

Introduces TsFileConfig + TsFile ArrowBasedBuilder for streaming TsFile reads with optional time-range filtering and column selection.
Registers the new tsfile packaged module and associates it with the .tsfile extension for auto-inference.
Adds docs sections + API reference entries and a dedicated test suite.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/datasets/packaged_modules/tsfile/tsfile.py`	Implements the TsFile builder/config, schema inference, streaming reads, and bad-file handling.
`src/datasets/packaged_modules/tsfile/__init__.py`	Adds the packaged module package (currently empty).
`src/datasets/packaged_modules/__init__.py`	Registers `tsfile` in packaged modules and extension-to-module mapping.
`tests/packaged_modules/test_tsfile.py`	Adds end-to-end + config tests for TsFile loading behavior.
`docs/source/tabular_load.mdx`	Adds TsFile guidance to the tabular loading docs.
`docs/source/loading.mdx`	Adds TsFile to loading overview + dedicated TsFile subsection.
`docs/source/about_dataset_load.mdx`	Adds TsFile to the list of supported builders/extensions.
`docs/source/package_reference/loading_methods.mdx`	Adds autodoc entries for `TsFileConfig` and `TsFile`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T06:36:28Z

+from datasets.packaged_modules.tsfile.tsfile import TsFileConfig
+
+
+tsfile = pytest.importorskip("tsfile")


pytest.importorskip("tsfile") will skip the entire test file in environments where tsfile isn’t installed (and tsfile doesn’t appear to be part of the repo’s tests extra). That means the new packaged module may get effectively zero CI coverage. Consider adding tsfile to the test dependencies and removing the unconditional skip (or gating it behind an explicit marker/CI job).

Suggested change

tsfile = pytest.importorskip("tsfile")

import tsfile # noqa: F401

Copilot · 2026-04-23T06:36:29Z

+                raise _MissingTableError(self._resolved_table_name, list(schemas))
+            table_schema = schemas[self._resolved_table_name]
+            return {
+                col.get_column_name()


_available_columns_in_file claims to return a lowercased set of column names, but it returns col.get_column_name() as-is. Since _requested_columns is lowercased, casing differences in TsFile schemas can cause valid requested columns to be treated as missing, defeating projection pushdown and potentially producing null-filled columns. Lowercase the returned names (and/or normalize schema names consistently) to match the rest of the builder.

Suggested change

col.get_column_name()

col.get_column_name().lower()

Copilot · 2026-04-23T06:36:29Z

+                    for col in table_schema.get_columns():
+                        name = col.get_column_name()
+                        if col.get_category() == ColumnCategory.TIME:
+                            time_column_name = name
+                            continue
+                        merged_columns.setdefault(name, col)
+            except Exception as e:


Schema inference stores column names exactly as returned by col.get_column_name(), but _dataframe_to_arrow lowercases incoming DataFrame columns (df.rename(columns=str.lower)). If TsFile returns non-lowercased names, field.name in df.columns will miss and you’ll silently null-fill columns. Consider normalizing discovered time_column_name and merged_columns keys/names to lowercase when inferring features so the inferred Arrow schema matches the later lowercase normalization.

Copilot · 2026-04-23T06:36:29Z

+        if (
+            self.config.columns is not None
+            and self.config.features is not None
+            and set(self.config.columns) != set(self.config.features)
+        ):
+            raise ValueError(
+                "The columns and features argument must contain the same columns, but got ",
+                f"{self.config.columns} and {self.config.features}",
+            )


The _info validation requires set(columns) == set(features). For this builder, columns is documented/used as a subset of TAG/FIELD columns, while the time column is always included in the output schema. This makes it easy for a valid config (e.g. columns=["temperature"] with features containing {time, temperature}) to raise unexpectedly. Consider adjusting the validation to account for the implicit time column (or require/auto-inject time into columns when comparing).

Suggested change

if (

self.config.columns is not None

and self.config.features is not None

and set(self.config.columns) != set(self.config.features)

):

raise ValueError(

"The columns and features argument must contain the same columns, but got ",

f"{self.config.columns} and {self.config.features}",

)

if self.config.columns is not None and self.config.features is not None:

feature_names = set(self.config.features)

feature_names.discard("time")

if set(self.config.columns) != feature_names:

raise ValueError(

"The columns and features argument must contain the same columns, "

"except that `features` may additionally include the implicit `time` column, but got "

f"{self.config.columns} and {self.config.features}",

)

JackieTien97 · 2026-04-23T06:54:35Z

+            ColumnSchema("temperature", TSDataType.DOUBLE, ColumnCategory.FIELD),
+            ColumnSchema("humidity", TSDataType.DOUBLE, ColumnCategory.FIELD),


add all supported type

JackieTien97 · 2026-04-23T06:59:30Z

+        start_time (`int`, *optional*):
+            Inclusive lower bound for the timestamp range. Defaults to no lower
+            bound.
+        end_time (`int`, *optional*):
+            Inclusive upper bound for the timestamp range. Defaults to no upper
+            bound.


better be pa.timestamp?

…onfig, and magic header validation - Add _promote_tsdatatype() for IoTDB schema evolution type widening (INT32->INT64->DOUBLE, INT32->FLOAT->DOUBLE) across files - Add timestamp_unit and timestamp_tz config options; map TIMESTAMP to pa.timestamp and DATE to pa.date32 instead of int64/string - Accept pa.TimestampScalar for start_time/end_time with auto epoch conversion - Pre-check TsFile 6-byte magic header to prevent segfault on corrupt files - Scan ALL splits (not just the first) to build complete union schema - Detect column name conflicts after case-folding - Fix autodoc paths in docs (tsfile.tsfile -> tsfile) - Expand test suite with comprehensive coverage for schema evolution, multi-file union, column projection, time filtering, bad files, edge values, and timestamp/timezone configuration

JackieTien97

不要默认返回 time 列，如果指定了 columns，严格根据 columns 的定义来

JackieTien97 · 2026-04-24T00:26:00Z

 * [`datasets.packaged_modules.parquet.Parquet`] for Parquet
 * [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
 * [`datasets.packaged_modules.sql.Sql`] for SQL databases
+* [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data)


Suggested change

* [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data)

* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)

Parquet 和 arrow 也都没带 Apache，咱们也不带了

JackieTien97 · 2026-04-24T00:27:26Z

+from datasets.packaged_modules.tsfile.tsfile import TsFileConfig
+
+
+tsfile = pytest.importorskip("tsfile")


应该不需要这个？我看其他人都没加

Young-Leo added 2 commits April 23, 2026 11:31

JackieTien97 requested a review from Copilot April 23, 2026 06:31

Copilot AI reviewed Apr 23, 2026

View reviewed changes

JackieTien97 reviewed Apr 23, 2026

View reviewed changes

JackieTien97 requested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache TsFile packaged module for time-series data#2

Add Apache TsFile packaged module for time-series data#2
Young-Leo wants to merge 3 commits into
mainfrom
ly/tsfile

Young-Leo commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Uh oh!

JackieTien97 Apr 23, 2026

Uh oh!

JackieTien97 Apr 23, 2026

Uh oh!

JackieTien97 left a comment

Uh oh!

JackieTien97 Apr 24, 2026

Uh oh!

JackieTien97 Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		from datasets.packaged_modules.tsfile.tsfile import TsFileConfig


		tsfile = pytest.importorskip("tsfile")

	tsfile = pytest.importorskip("tsfile")
	import tsfile # noqa: F401

		ColumnSchema("temperature", TSDataType.DOUBLE, ColumnCategory.FIELD),
		ColumnSchema("humidity", TSDataType.DOUBLE, ColumnCategory.FIELD),

	* [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data)
	* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)

Conversation

Young-Leo commented Apr 23, 2026

What does this PR do?

Usage

Features

Changes

Dependencies

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackieTien97 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 left a comment

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

JackieTien97 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants