Conversation
- Add TsFileConfig with table_name, columns, start_time/end_time, batch_size, features, and on_bad_files parameters - Implement TsFile ArrowBasedBuilder with streaming batch reads via tsfile.to_dataframe and schema inference via TsFileReader - Support schema evolution across multiple files (union columns, fill missing with nulls) - Skip file scanning when user specifies columns (zero or one file read) - Register tsfile module and .tsfile extension in packaged_modules - Add test file for TsFile builder
…load.mdx: add TsFile to builder list and extension mention - loading.mdx: add TsFile section with basic usage and time-range example - tabular_load.mdx: add detailed TsFile section covering all parameters - loading_methods.mdx: add TsFileConfig and TsFile autodoc entries
There was a problem hiding this comment.
Pull request overview
Adds a new packaged module to load Apache TsFile (table model) time-series data via load_dataset("tsfile", ...), along with documentation and tests.
Changes:
- Introduces
TsFileConfig+TsFileArrowBasedBuilder for streaming TsFile reads with optional time-range filtering and column selection. - Registers the new
tsfilepackaged module and associates it with the.tsfileextension for auto-inference. - Adds docs sections + API reference entries and a dedicated test suite.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/datasets/packaged_modules/tsfile/tsfile.py |
Implements the TsFile builder/config, schema inference, streaming reads, and bad-file handling. |
src/datasets/packaged_modules/tsfile/__init__.py |
Adds the packaged module package (currently empty). |
src/datasets/packaged_modules/__init__.py |
Registers tsfile in packaged modules and extension-to-module mapping. |
tests/packaged_modules/test_tsfile.py |
Adds end-to-end + config tests for TsFile loading behavior. |
docs/source/tabular_load.mdx |
Adds TsFile guidance to the tabular loading docs. |
docs/source/loading.mdx |
Adds TsFile to loading overview + dedicated TsFile subsection. |
docs/source/about_dataset_load.mdx |
Adds TsFile to the list of supported builders/extensions. |
docs/source/package_reference/loading_methods.mdx |
Adds autodoc entries for TsFileConfig and TsFile. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from datasets.packaged_modules.tsfile.tsfile import TsFileConfig | ||
|
|
||
|
|
||
| tsfile = pytest.importorskip("tsfile") |
There was a problem hiding this comment.
pytest.importorskip("tsfile") will skip the entire test file in environments where tsfile isn’t installed (and tsfile doesn’t appear to be part of the repo’s tests extra). That means the new packaged module may get effectively zero CI coverage. Consider adding tsfile to the test dependencies and removing the unconditional skip (or gating it behind an explicit marker/CI job).
| tsfile = pytest.importorskip("tsfile") | |
| import tsfile # noqa: F401 |
| raise _MissingTableError(self._resolved_table_name, list(schemas)) | ||
| table_schema = schemas[self._resolved_table_name] | ||
| return { | ||
| col.get_column_name() |
There was a problem hiding this comment.
_available_columns_in_file claims to return a lowercased set of column names, but it returns col.get_column_name() as-is. Since _requested_columns is lowercased, casing differences in TsFile schemas can cause valid requested columns to be treated as missing, defeating projection pushdown and potentially producing null-filled columns. Lowercase the returned names (and/or normalize schema names consistently) to match the rest of the builder.
| col.get_column_name() | |
| col.get_column_name().lower() |
| for col in table_schema.get_columns(): | ||
| name = col.get_column_name() | ||
| if col.get_category() == ColumnCategory.TIME: | ||
| time_column_name = name | ||
| continue | ||
| merged_columns.setdefault(name, col) | ||
| except Exception as e: |
There was a problem hiding this comment.
Schema inference stores column names exactly as returned by col.get_column_name(), but _dataframe_to_arrow lowercases incoming DataFrame columns (df.rename(columns=str.lower)). If TsFile returns non-lowercased names, field.name in df.columns will miss and you’ll silently null-fill columns. Consider normalizing discovered time_column_name and merged_columns keys/names to lowercase when inferring features so the inferred Arrow schema matches the later lowercase normalization.
| if ( | ||
| self.config.columns is not None | ||
| and self.config.features is not None | ||
| and set(self.config.columns) != set(self.config.features) | ||
| ): | ||
| raise ValueError( | ||
| "The columns and features argument must contain the same columns, but got ", | ||
| f"{self.config.columns} and {self.config.features}", | ||
| ) |
There was a problem hiding this comment.
The _info validation requires set(columns) == set(features). For this builder, columns is documented/used as a subset of TAG/FIELD columns, while the time column is always included in the output schema. This makes it easy for a valid config (e.g. columns=["temperature"] with features containing {time, temperature}) to raise unexpectedly. Consider adjusting the validation to account for the implicit time column (or require/auto-inject time into columns when comparing).
| if ( | |
| self.config.columns is not None | |
| and self.config.features is not None | |
| and set(self.config.columns) != set(self.config.features) | |
| ): | |
| raise ValueError( | |
| "The columns and features argument must contain the same columns, but got ", | |
| f"{self.config.columns} and {self.config.features}", | |
| ) | |
| if self.config.columns is not None and self.config.features is not None: | |
| feature_names = set(self.config.features) | |
| feature_names.discard("time") | |
| if set(self.config.columns) != feature_names: | |
| raise ValueError( | |
| "The columns and features argument must contain the same columns, " | |
| "except that `features` may additionally include the implicit `time` column, but got " | |
| f"{self.config.columns} and {self.config.features}", | |
| ) |
| ColumnSchema("temperature", TSDataType.DOUBLE, ColumnCategory.FIELD), | ||
| ColumnSchema("humidity", TSDataType.DOUBLE, ColumnCategory.FIELD), |
| start_time (`int`, *optional*): | ||
| Inclusive lower bound for the timestamp range. Defaults to no lower | ||
| bound. | ||
| end_time (`int`, *optional*): | ||
| Inclusive upper bound for the timestamp range. Defaults to no upper | ||
| bound. |
…onfig, and magic header validation - Add _promote_tsdatatype() for IoTDB schema evolution type widening (INT32->INT64->DOUBLE, INT32->FLOAT->DOUBLE) across files - Add timestamp_unit and timestamp_tz config options; map TIMESTAMP to pa.timestamp and DATE to pa.date32 instead of int64/string - Accept pa.TimestampScalar for start_time/end_time with auto epoch conversion - Pre-check TsFile 6-byte magic header to prevent segfault on corrupt files - Scan ALL splits (not just the first) to build complete union schema - Detect column name conflicts after case-folding - Fix autodoc paths in docs (tsfile.tsfile -> tsfile) - Expand test suite with comprehensive coverage for schema evolution, multi-file union, column projection, time filtering, bad files, edge values, and timestamp/timezone configuration
JackieTien97
left a comment
There was a problem hiding this comment.
不要默认返回 time 列,如果指定了 columns,严格根据 columns 的定义来
| * [`datasets.packaged_modules.parquet.Parquet`] for Parquet | ||
| * [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format) | ||
| * [`datasets.packaged_modules.sql.Sql`] for SQL databases | ||
| * [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data) |
There was a problem hiding this comment.
| * [`datasets.packaged_modules.tsfile.TsFile`] for Apache TsFile (time-series data) | |
| * [`datasets.packaged_modules.tsfile.TsFile`] for TsFile (time-series data) |
Parquet 和 arrow 也都没带 Apache,咱们也不带了
| from datasets.packaged_modules.tsfile.tsfile import TsFileConfig | ||
|
|
||
|
|
||
| tsfile = pytest.importorskip("tsfile") |
What does this PR do?
Adds native support for loading Apache TsFile (table model) as a packaged module, enabling users to load time-series data directly via
load_dataset("tsfile", ...).TsFile is a columnar file format designed specifically for time-series data, used by Apache IoTDB. Unlike general-purpose formats (CSV, Parquet), TsFile natively understands timestamps, device tags (TAG columns), and measurements (FIELD columns).
Usage
Features
start_time/end_timeare pushed down to TsFile's internal time index — only relevant data blocks are read from disk.table_name.columnsto skip schema inference entirely. Missing columns are filled with nulls.batch_size, default 100K rows) to control memory usage.on_bad_files="skip"|"warn"|"error"for robust batch processing.Changes
src/datasets/packaged_modules/tsfile/tsfile.py— newTsFileConfigandTsFile(ArrowBasedBuilder)src/datasets/packaged_modules/tsfile/__init__.py— package initsrc/datasets/packaged_modules/__init__.py— register module and.tsfileextensiontests/packaged_modules/test_tsfile.py— 17 tests covering config validation, full loading, column selection, time-range filtering, schema evolution, multi-table, and error handlingdocs/source/about_dataset_load.mdx— add TsFile to builder listdocs/source/loading.mdx— add TsFile sectiondocs/source/tabular_load.mdx— add detailed TsFile guidedocs/source/package_reference/loading_methods.mdx— add autodoc entriesDependencies
Requires tsfile (
pip install tsfile). It is an optional dependency — imported lazily at runtime, same pattern as HDF5/Lance.