Skip to content

Add Apache TsFile packaged module for time-series data#2

Open
Young-Leo wants to merge 3 commits into
mainfrom
ly/tsfile
Open

Add Apache TsFile packaged module for time-series data#2
Young-Leo wants to merge 3 commits into
mainfrom
ly/tsfile

Conversation

@Young-Leo
Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds native support for loading Apache TsFile (table model) as a packaged module, enabling users to load time-series data directly via load_dataset("tsfile", ...).

TsFile is a columnar file format designed specifically for time-series data, used by Apache IoTDB. Unlike general-purpose formats (CSV, Parquet), TsFile natively understands timestamps, device tags (TAG columns), and measurements (FIELD columns).

Usage

from datasets import load_dataset

# Basic loading
ds = load_dataset("tsfile", data_files="sensor_data.tsfile")

# Time-range filtering (pushed down to TsFile's internal time index)
ds = load_dataset("tsfile", data_files="sensor_data.tsfile",
                  start_time=1700000000000, end_time=1700086400000)

# Select specific table and columns
ds = load_dataset("tsfile", data_files="sensor_data.tsfile",
                  table_name="weather", columns=["temperature", "humidity"])

Features

  • Time-range query pushdown: start_time/end_time are pushed down to TsFile's internal time index — only relevant data blocks are read from disk.
  • Multi-table support: A single TsFile can contain multiple tables; select via table_name.
  • Column selection: Specify columns to skip schema inference entirely. Missing columns are filled with nulls.
  • Schema evolution: When loading multiple files with different columns, all columns are unioned automatically with null-fill for absent columns.
  • Streaming batch reads: Data is read in configurable batches (batch_size, default 100K rows) to control memory usage.
  • Bad file tolerance: on_bad_files="skip"|"warn"|"error" for robust batch processing.
  • Case-insensitive: Table and column names follow TsFile/IoTDB's case-insensitive convention.

Changes

  • src/datasets/packaged_modules/tsfile/tsfile.py — new TsFileConfig and TsFile (ArrowBasedBuilder)
  • src/datasets/packaged_modules/tsfile/__init__.py — package init
  • src/datasets/packaged_modules/__init__.py — register module and .tsfile extension
  • tests/packaged_modules/test_tsfile.py — 17 tests covering config validation, full loading, column selection, time-range filtering, schema evolution, multi-table, and error handling
  • docs/source/about_dataset_load.mdx — add TsFile to builder list
  • docs/source/loading.mdx — add TsFile section
  • docs/source/tabular_load.mdx — add detailed TsFile guide
  • docs/source/package_reference/loading_methods.mdx — add autodoc entries

Dependencies

Requires tsfile (pip install tsfile). It is an optional dependency — imported lazily at runtime, same pattern as HDF5/Lance.

- Add TsFileConfig with table_name, columns, start_time/end_time,
  batch_size, features, and on_bad_files parameters
- Implement TsFile ArrowBasedBuilder with streaming batch reads via
  tsfile.to_dataframe and schema inference via TsFileReader
- Support schema evolution across multiple files (union columns,
  fill missing with nulls)
- Skip file scanning when user specifies columns (zero or one file read)
- Register tsfile module and .tsfile extension in packaged_modules
- Add test file for TsFile builder
…load.mdx: add TsFile to builder list and extension mention - loading.mdx: add TsFile section with basic usage and time-range example - tabular_load.mdx: add detailed TsFile section covering all parameters - loading_methods.mdx: add TsFileConfig and TsFile autodoc entries
@JackieTien97 JackieTien97 requested a review from Copilot April 23, 2026 06:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new packaged module to load Apache TsFile (table model) time-series data via load_dataset("tsfile", ...), along with documentation and tests.

Changes:

  • Introduces TsFileConfig + TsFile ArrowBasedBuilder for streaming TsFile reads with optional time-range filtering and column selection.
  • Registers the new tsfile packaged module and associates it with the .tsfile extension for auto-inference.
  • Adds docs sections + API reference entries and a dedicated test suite.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/datasets/packaged_modules/tsfile/tsfile.py Implements the TsFile builder/config, schema inference, streaming reads, and bad-file handling.
src/datasets/packaged_modules/tsfile/__init__.py Adds the packaged module package (currently empty).
src/datasets/packaged_modules/__init__.py Registers tsfile in packaged modules and extension-to-module mapping.
tests/packaged_modules/test_tsfile.py Adds end-to-end + config tests for TsFile loading behavior.
docs/source/tabular_load.mdx Adds TsFile guidance to the tabular loading docs.
docs/source/loading.mdx Adds TsFile to loading overview + dedicated TsFile subsection.
docs/source/about_dataset_load.mdx Adds TsFile to the list of supported builders/extensions.
docs/source/package_reference/loading_methods.mdx Adds autodoc entries for TsFileConfig and TsFile.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from datasets.packaged_modules.tsfile.tsfile import TsFileConfig


tsfile = pytest.importorskip("tsfile")
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest.importorskip("tsfile") will skip the entire test file in environments where tsfile isn’t installed (and tsfile doesn’t appear to be part of the repo’s tests extra). That means the new packaged module may get effectively zero CI coverage. Consider adding tsfile to the test dependencies and removing the unconditional skip (or gating it behind an explicit marker/CI job).

Suggested change
tsfile = pytest.importorskip("tsfile")
import tsfile # noqa: F401

Copilot uses AI. Check for mistakes.
raise _MissingTableError(self._resolved_table_name, list(schemas))
table_schema = schemas[self._resolved_table_name]
return {
col.get_column_name()
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_available_columns_in_file claims to return a lowercased set of column names, but it returns col.get_column_name() as-is. Since _requested_columns is lowercased, casing differences in TsFile schemas can cause valid requested columns to be treated as missing, defeating projection pushdown and potentially producing null-filled columns. Lowercase the returned names (and/or normalize schema names consistently) to match the rest of the builder.

Suggested change
col.get_column_name()
col.get_column_name().lower()

Copilot uses AI. Check for mistakes.
Comment on lines +190 to +196
for col in table_schema.get_columns():
name = col.get_column_name()
if col.get_category() == ColumnCategory.TIME:
time_column_name = name
continue
merged_columns.setdefault(name, col)
except Exception as e:
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema inference stores column names exactly as returned by col.get_column_name(), but _dataframe_to_arrow lowercases incoming DataFrame columns (df.rename(columns=str.lower)). If TsFile returns non-lowercased names, field.name in df.columns will miss and you’ll silently null-fill columns. Consider normalizing discovered time_column_name and merged_columns keys/names to lowercase when inferring features so the inferred Arrow schema matches the later lowercase normalization.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +99
if (
self.config.columns is not None
and self.config.features is not None
and set(self.config.columns) != set(self.config.features)
):
raise ValueError(
"The columns and features argument must contain the same columns, but got ",
f"{self.config.columns} and {self.config.features}",
)
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _info validation requires set(columns) == set(features). For this builder, columns is documented/used as a subset of TAG/FIELD columns, while the time column is always included in the output schema. This makes it easy for a valid config (e.g. columns=["temperature"] with features containing {time, temperature}) to raise unexpectedly. Consider adjusting the validation to account for the implicit time column (or require/auto-inject time into columns when comparing).

Suggested change
if (
self.config.columns is not None
and self.config.features is not None
and set(self.config.columns) != set(self.config.features)
):
raise ValueError(
"The columns and features argument must contain the same columns, but got ",
f"{self.config.columns} and {self.config.features}",
)
if self.config.columns is not None and self.config.features is not None:
feature_names = set(self.config.features)
feature_names.discard("time")
if set(self.config.columns) != feature_names:
raise ValueError(
"The columns and features argument must contain the same columns, "
"except that `features` may additionally include the implicit `time` column, but got "
f"{self.config.columns} and {self.config.features}",
)

Copilot uses AI. Check for mistakes.
Comment thread src/datasets/packaged_modules/tsfile/tsfile.py
Comment on lines +33 to +34
ColumnSchema("temperature", TSDataType.DOUBLE, ColumnCategory.FIELD),
ColumnSchema("humidity", TSDataType.DOUBLE, ColumnCategory.FIELD),
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add all supported type

Comment on lines +46 to +51
start_time (`int`, *optional*):
Inclusive lower bound for the timestamp range. Defaults to no lower
bound.
end_time (`int`, *optional*):
Inclusive upper bound for the timestamp range. Defaults to no upper
bound.
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better be pa.timestamp?

…onfig, and magic header validation

- Add _promote_tsdatatype() for IoTDB schema evolution type widening
  (INT32->INT64->DOUBLE, INT32->FLOAT->DOUBLE) across files
- Add timestamp_unit and timestamp_tz config options; map TIMESTAMP to
  pa.timestamp and DATE to pa.date32 instead of int64/string
- Accept pa.TimestampScalar for start_time/end_time with auto epoch conversion
- Pre-check TsFile 6-byte magic header to prevent segfault on corrupt files
- Scan ALL splits (not just the first) to build complete union schema
- Detect column name conflicts after case-folding
- Fix autodoc paths in docs (tsfile.tsfile -> tsfile)
- Expand test suite with comprehensive coverage for schema evolution,
  multi-file union, column projection, time filtering, bad files, edge
  values, and timestamp/timezone configuration
Copy link
Copy Markdown
Owner

@JackieTien97 JackieTien97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要默认返回 time 列,如果指定了 columns,严格根据 columns 的定义来

* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
* [`datasets.packaged_modules.sql.Sql`] for SQL databases
* [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data)
* [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data)

Parquet 和 arrow 也都没带 Apache,咱们也不带了

from datasets.packaged_modules.tsfile.tsfile import TsFileConfig


tsfile = pytest.importorskip("tsfile")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该不需要这个?我看其他人都没加

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants