|
| 1 | +# Load TsFile data |
| 2 | + |
| 3 | +[TsFile](https://tsfile.apache.org/) is a columnar file format designed for time-series data and used as the native storage layer of [Apache IoTDB](https://iotdb.apache.org/). Compared with general-purpose columnar formats such as Parquet, TsFile is aware of the time-series data model (timestamps, devices, and measurements) and maintains an internal time index that enables time-range pruning without scanning entire files. |
| 4 | + |
| 5 | +This loader is provided as a separate guide because it does not follow the usual one-row-per-record tabular convention: each output row corresponds to one *device*, and per-measurement values are returned as Arrow `list<...>` columns. The mapping is described in detail below. |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +The loader depends on the [`tsfile`](https://pypi.org/project/tsfile/) Python package: |
| 10 | + |
| 11 | +```bash |
| 12 | +pip install "tsfile>=2.3.0" |
| 13 | +``` |
| 14 | + |
| 15 | +## Data model and output layout |
| 16 | + |
| 17 | +The loader follows the TsFile *table model*. Each table column is one of: |
| 18 | + |
| 19 | +- **TAG** — a string-typed identifier; the tuple of TAG values uniquely identifies a *device* (i.e. a single time-series source). |
| 20 | +- **FIELD** — a measurement whose value evolves over time. |
| 21 | +- **TIME** — the timestamp column, named `time` by default. |
| 22 | + |
| 23 | +The loader emits one dataset row per device. Within a row, the `time` column and every FIELD column are Arrow `list<...>` columns containing that device's full time series, sorted in ascending time order. TAG columns appear as scalar `string` columns. |
| 24 | + |
| 25 | +Concretely, the output schema has the form: |
| 26 | + |
| 27 | +```text |
| 28 | +<tag_1>: string |
| 29 | +<tag_2>: string # one column per TAG |
| 30 | +... |
| 31 | +time: list<timestamp[unit, tz]> |
| 32 | +<field_1>: list<original_type> # one column per FIELD |
| 33 | +<field_2>: list<original_type> |
| 34 | +... |
| 35 | +``` |
| 36 | + |
| 37 | +When the same device appears in multiple input files of a split, its per-file chunks are concatenated and sorted by timestamp before being emitted as a single row. Duplicate timestamps for the same device raise `ValueError`. |
| 38 | + |
| 39 | +## Basic usage |
| 40 | + |
| 41 | +Load a single TsFile: |
| 42 | + |
| 43 | +```py |
| 44 | +>>> from datasets import load_dataset |
| 45 | +>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile") |
| 46 | +``` |
| 47 | + |
| 48 | +Map files to splits explicitly: |
| 49 | + |
| 50 | +```py |
| 51 | +>>> dataset = load_dataset( |
| 52 | +... "tsfile", |
| 53 | +... data_files={"train": "train_data.tsfile", "test": "test_data.tsfile"}, |
| 54 | +... ) |
| 55 | +``` |
| 56 | + |
| 57 | +## Example dataset on the Hub |
| 58 | + |
| 59 | +A ready-to-use example is available at [`tsfile/lotsa_data`](https://huggingface.co/datasets/tsfile/lotsa_data). Because `.tsfile` files are recognized automatically, you can load it by repository id without specifying `data_files`: |
| 60 | + |
| 61 | +```py |
| 62 | +>>> from datasets import load_dataset |
| 63 | +>>> dataset = load_dataset("tsfile/lotsa_data") |
| 64 | +>>> dataset |
| 65 | +DatasetDict({ |
| 66 | + train: Dataset({ |
| 67 | + features: ['timeseries_id', 'time', 'value'], |
| 68 | + num_rows: 91 |
| 69 | + }) |
| 70 | +}) |
| 71 | +``` |
| 72 | + |
| 73 | +Each row is one device. The TAG column `timeseries_id` identifies the device, while `time` and `value` are `list<...>` columns holding that device's full series: |
| 74 | + |
| 75 | +```py |
| 76 | +>>> row = dataset["train"][0] |
| 77 | +>>> row["timeseries_id"] |
| 78 | +'Bear_assembly_Angel' |
| 79 | +>>> len(row["time"]), len(row["value"]) |
| 80 | +(8760, 8760) |
| 81 | +>>> row["time"][:3] |
| 82 | +[datetime.datetime(2017, 1, 1, 0, 0), datetime.datetime(2017, 1, 1, 1, 0), datetime.datetime(2017, 1, 1, 2, 0)] |
| 83 | +``` |
| 84 | + |
| 85 | +## Selecting a table |
| 86 | + |
| 87 | +A TsFile can contain multiple tables. When `table_name` is omitted, the first table found in the first valid file is used. Lookups are case-insensitive. |
| 88 | + |
| 89 | +```py |
| 90 | +>>> dataset = load_dataset("tsfile", data_files="my_data.tsfile", table_name="sensor_data") |
| 91 | +``` |
| 92 | + |
| 93 | +## Selecting columns |
| 94 | + |
| 95 | +`columns` restricts the FIELD columns that are read. The TAG columns and the `time` column are always returned because they identify the device and its timeline. Names in `columns` that refer to a TAG or to the `time` column are silently ignored (they are emitted as usual, just once); names that match a field absent from every file become all-null list columns. |
| 96 | + |
| 97 | +```py |
| 98 | +>>> dataset = load_dataset( |
| 99 | +... "tsfile", |
| 100 | +... data_files="my_data.tsfile", |
| 101 | +... columns=["temperature", "humidity"], |
| 102 | +... ) |
| 103 | +``` |
| 104 | + |
| 105 | +## Filtering by time range |
| 106 | + |
| 107 | +`start_time` and `end_time` are inclusive bounds; either may be omitted. The bounds are pushed down to TsFile's internal time index, so only the matching data blocks are read from disk. Both bounds accept any of: |
| 108 | + |
| 109 | +- `int` — raw epoch in `timestamp_unit` (default milliseconds); |
| 110 | +- `datetime.datetime` — naive values are interpreted as UTC, tz-aware values are converted to UTC; |
| 111 | +- `datetime.date`; |
| 112 | +- ISO-8601 `str`, e.g. `"2024-01-01T00:00:00"`; |
| 113 | +- `pyarrow.TimestampScalar`. |
| 114 | + |
| 115 | +```py |
| 116 | +>>> from datetime import datetime |
| 117 | +>>> dataset = load_dataset( |
| 118 | +... "tsfile", |
| 119 | +... data_files="my_data.tsfile", |
| 120 | +... start_time=datetime(2023, 11, 14), |
| 121 | +... end_time="2023-11-15T00:00:00", |
| 122 | +... ) |
| 123 | +``` |
| 124 | + |
| 125 | +## Schema evolution across files |
| 126 | + |
| 127 | +When different files expose different columns — for example a new sensor field is introduced later — the loader takes the union of all FIELD columns and fills missing values with nulls. Numeric FIELD types are promoted following IoTDB's widening rules (`INT32 → INT64 → DOUBLE`, `INT32 → FLOAT → DOUBLE`). |
| 128 | + |
| 129 | +```py |
| 130 | +>>> dataset = load_dataset("tsfile", data_files=["day1.tsfile", "day2.tsfile"]) |
| 131 | +``` |
| 132 | + |
| 133 | +## Handling unreadable files |
| 134 | + |
| 135 | +By default, an unreadable or non-TsFile input raises an error. Set `on_bad_files` to `"warn"` to log and continue, or `"skip"` to silently drop the file. |
| 136 | + |
| 137 | +```py |
| 138 | +>>> dataset = load_dataset("tsfile", data_files="data/*.tsfile", on_bad_files="skip") |
| 139 | +``` |
| 140 | + |
| 141 | +## Timestamp unit and time zone |
| 142 | + |
| 143 | +`timestamp_unit` (default `"ms"`, matching IoTDB) controls the resolution of the `time` column and the interpretation of integer time bounds. `timestamp_tz` attaches a time zone to the Arrow timestamp type; `None` (the default) yields a timezone-naive type. |
| 144 | + |
| 145 | +```py |
| 146 | +>>> dataset = load_dataset( |
| 147 | +... "tsfile", |
| 148 | +... data_files="my_data.tsfile", |
| 149 | +... timestamp_unit="us", |
| 150 | +... timestamp_tz="UTC", |
| 151 | +... ) |
| 152 | +``` |
| 153 | + |
| 154 | +## Memory and batching |
| 155 | + |
| 156 | +Two parameters control memory usage: |
| 157 | + |
| 158 | +- `input_batch_size` (default `65_536`) — maximum number of rows fetched per Arrow batch from `TsFileReader.query_table`. Bounds peak memory while streaming a single device. |
| 159 | +- `output_batch_size` (default `32`) — number of devices packed into each Arrow record batch yielded to the writer. Smaller values give more responsive progress reporting; larger values reduce per-batch overhead. |
| 160 | + |
| 161 | +```py |
| 162 | +>>> dataset = load_dataset( |
| 163 | +... "tsfile", |
| 164 | +... data_files="large_data.tsfile", |
| 165 | +... input_batch_size=32_768, |
| 166 | +... output_batch_size=128, |
| 167 | +... ) |
| 168 | +``` |
| 169 | + |
| 170 | +Peak memory is bounded by the payload of a single device across the split, not by the size of the split as a whole. |
| 171 | + |
| 172 | +See [`~datasets.packaged_modules.tsfile.TsFileConfig`] for the full list of parameters. |
0 commit comments