From 732de2e57a4cad09ad213e16266faa5e2d9f657a Mon Sep 17 00:00:00 2001 From: ulises-jeremias Date: Fri, 5 Jun 2026 02:00:59 -0300 Subject: [PATCH 1/3] docs(contrib): propose external industrial dataset guide Signed-off-by: ulises-jeremias --- README.md | 1 + docs/external-industrial-datasets.md | 58 ++++++++++++++++++++++++++++ 2 files changed, 59 insertions(+) create mode 100644 docs/external-industrial-datasets.md diff --git a/README.md b/README.md index a9491d14..7337a605 100644 --- a/README.md +++ b/README.md @@ -273,6 +273,7 @@ We are expanding **AssetOpsBench** to cover a broader range of industrial challe 1. **Define** your scenario following our [Utterance Guideline](docs/guideline/utterance_design_guideline.md) and [Ground Truth Guideline](docs/guideline/ground_truth_design_guideline.md) 2. **Explore** the [Hugging Face dataset](https://huggingface.co/datasets/ibm-research/AssetOpsBench) for examples + - For external public sources, see the contributor proposal in [External Industrial Dataset Guide](docs/external-industrial-datasets.md) 3. **Submit** a Pull Request or open an [Issue](https://github.com/IBM/AssetOpsBench/issues) with the tag `new-scenario` 4. **Contact us** with questions: - Dhaval Patel — [pateldha@us.ibm.com](mailto:pateldha@us.ibm.com) diff --git a/docs/external-industrial-datasets.md b/docs/external-industrial-datasets.md new file mode 100644 index 00000000..0eb4ae6e --- /dev/null +++ b/docs/external-industrial-datasets.md @@ -0,0 +1,58 @@ +# External Industrial Dataset Guide (Contributor Proposal) + +This page is a contributor-oriented proposal to help newcomers discover public industrial datasets that may be useful for scenario design and benchmark extensions. + +It does **not** change AssetOpsBench scoring or baseline definitions. It is a reference map for dataset discovery and adaptation planning. + +## Quick selection criteria + +Before using any external dataset: + +1. Verify license terms and redistribution constraints. +2. Confirm no sensitive telemetry, secrets, or personally identifying data are included. +3. Keep provenance metadata (`source`, `version/date`, `transform script`) with every derived artifact. +4. Prefer datasets that can be mapped to one or more existing AssetOpsBench domains (`iot`, `wo`, `vibration`, `tsfm`, `fmsr`). + +## Starter dataset references + +| Dataset / index | Primary focus | AssetOpsBench fit | Notes | +| --- | --- | --- | --- | +| [awesome-industrial-datasets](https://github.com/jonathanwvd/awesome-industrial-datasets/tree/master) | Curated index (multiple domains) | Discovery for all domains | Useful first stop; each linked dataset has its own license and usage terms. | +| [SWaT (Secure Water Treatment)](https://www.kaggle.com/datasets/vishala28/swat-dataset-secure-water-treatment-system) | Water-treatment process telemetry, attack/anomaly traces | `iot`, `tsfm`, anomaly scenarios | Commonly used for anomaly-detection tasks; verify host terms and citation requirements. | +| [NASA C-MAPSS](https://data.nasa.gov/dataset/C-MAPSS-Aircraft-Engine-Simulator-Data/xaut-bemq) | Turbofan degradation / RUL | `tsfm`, prognostics scenarios | Good candidate for PHM and RUL-style benchmark tasks. | +| [Case Western Reserve Bearing Data Center](https://engineering.case.edu/bearingdatacenter/welcome) | Bearing vibration fault data | `vibration` | Strong fit for spectral diagnosis and fault-classification tasks. | +| [Paderborn University Bearing Dataset](https://mb.uni-paderborn.de/kat/forschung/datacenter/bearing-datacenter) | Rolling-bearing fault experiments | `vibration`, `tsfm` | Useful to cross-check bearing-fault robustness across machines/loads. | + +## Mapping checklist to AssetOpsBench schema + +When preparing scenarios from an external source, define these fields early: + +- `asset_id` and `site_name` strategy (stable IDs, no ambiguous aliases) +- timestamp normalization (timezone, granularity, ISO format) +- sensor naming map (raw column names to scenario-facing names) +- expected outputs in `characteristic_form` that remain auditable from the data +- task-domain classification (`iot`, `wo`, `vibration`, `tsfm`, `fmsr`, multi-step) + +## Suggested ingestion workflow + +1. Keep raw source data outside committed benchmark artifacts unless license allows redistribution. +2. Build a deterministic transform script with clear input/output contracts. +3. Store transformed fixtures under domain-specific folders with a compact README. +4. Add unit checks for schema and timestamp consistency before creating scenarios. +5. Open a PR with: + - data provenance note, + - sample scenario IDs, + - before/after validation evidence. + +## Privacy and safety guardrails + +- Remove direct identifiers, facility names, and any customer-linked metadata. +- Never include production secrets, API keys, or internal endpoint information. +- If uncertainty exists, treat the dataset as restricted until maintainers confirm usage policy. + +## Related contribution entry points + +- Main scenario contribution section: `README.md` -> "Call for Scenario Contribution" +- Scenario design guidelines: + - `docs/guideline/utterance_design_guideline.md` + - `docs/guideline/ground_truth_design_guideline.md` From b581d7871c80179b0196bbee9aa10b4200551c3c Mon Sep 17 00:00:00 2001 From: ulises-jeremias Date: Fri, 5 Jun 2026 02:10:05 -0300 Subject: [PATCH 2/3] docs: make external dataset guide wording declarative Signed-off-by: ulises-jeremias --- README.md | 2 +- docs/external-industrial-datasets.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 7337a605..995af1b4 100644 --- a/README.md +++ b/README.md @@ -273,7 +273,7 @@ We are expanding **AssetOpsBench** to cover a broader range of industrial challe 1. **Define** your scenario following our [Utterance Guideline](docs/guideline/utterance_design_guideline.md) and [Ground Truth Guideline](docs/guideline/ground_truth_design_guideline.md) 2. **Explore** the [Hugging Face dataset](https://huggingface.co/datasets/ibm-research/AssetOpsBench) for examples - - For external public sources, see the contributor proposal in [External Industrial Dataset Guide](docs/external-industrial-datasets.md) + - For external public sources, see [External Industrial Dataset Guide](docs/external-industrial-datasets.md) 3. **Submit** a Pull Request or open an [Issue](https://github.com/IBM/AssetOpsBench/issues) with the tag `new-scenario` 4. **Contact us** with questions: - Dhaval Patel — [pateldha@us.ibm.com](mailto:pateldha@us.ibm.com) diff --git a/docs/external-industrial-datasets.md b/docs/external-industrial-datasets.md index 0eb4ae6e..bb9934a6 100644 --- a/docs/external-industrial-datasets.md +++ b/docs/external-industrial-datasets.md @@ -1,6 +1,6 @@ -# External Industrial Dataset Guide (Contributor Proposal) +# External Industrial Dataset Guide -This page is a contributor-oriented proposal to help newcomers discover public industrial datasets that may be useful for scenario design and benchmark extensions. +This page helps contributors discover public industrial datasets that may be useful for scenario design and benchmark extensions. It does **not** change AssetOpsBench scoring or baseline definitions. It is a reference map for dataset discovery and adaptation planning. From 7ee24360f2aafd5f864c40427b0fb102b22c5f3c Mon Sep 17 00:00:00 2001 From: ulises-jeremias Date: Thu, 18 Jun 2026 01:51:05 -0300 Subject: [PATCH 3/3] docs(contrib): add vibration and SWaT dataset mappings Signed-off-by: ulises-jeremias --- README.md | 2 +- docs/external-industrial-datasets.md | 72 +++++++++++++++++++++++++++- 2 files changed, 72 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 995af1b4..be456f5d 100644 --- a/README.md +++ b/README.md @@ -273,7 +273,7 @@ We are expanding **AssetOpsBench** to cover a broader range of industrial challe 1. **Define** your scenario following our [Utterance Guideline](docs/guideline/utterance_design_guideline.md) and [Ground Truth Guideline](docs/guideline/ground_truth_design_guideline.md) 2. **Explore** the [Hugging Face dataset](https://huggingface.co/datasets/ibm-research/AssetOpsBench) for examples - - For external public sources, see [External Industrial Dataset Guide](docs/external-industrial-datasets.md) + - For external public sources and starter asset-class mappings, see [External Industrial Dataset Guide](docs/external-industrial-datasets.md) 3. **Submit** a Pull Request or open an [Issue](https://github.com/IBM/AssetOpsBench/issues) with the tag `new-scenario` 4. **Contact us** with questions: - Dhaval Patel — [pateldha@us.ibm.com](mailto:pateldha@us.ibm.com) diff --git a/docs/external-industrial-datasets.md b/docs/external-industrial-datasets.md index bb9934a6..b02660fb 100644 --- a/docs/external-industrial-datasets.md +++ b/docs/external-industrial-datasets.md @@ -21,7 +21,7 @@ Before using any external dataset: | [SWaT (Secure Water Treatment)](https://www.kaggle.com/datasets/vishala28/swat-dataset-secure-water-treatment-system) | Water-treatment process telemetry, attack/anomaly traces | `iot`, `tsfm`, anomaly scenarios | Commonly used for anomaly-detection tasks; verify host terms and citation requirements. | | [NASA C-MAPSS](https://data.nasa.gov/dataset/C-MAPSS-Aircraft-Engine-Simulator-Data/xaut-bemq) | Turbofan degradation / RUL | `tsfm`, prognostics scenarios | Good candidate for PHM and RUL-style benchmark tasks. | | [Case Western Reserve Bearing Data Center](https://engineering.case.edu/bearingdatacenter/welcome) | Bearing vibration fault data | `vibration` | Strong fit for spectral diagnosis and fault-classification tasks. | -| [Paderborn University Bearing Dataset](https://mb.uni-paderborn.de/kat/forschung/datacenter/bearing-datacenter) | Rolling-bearing fault experiments | `vibration`, `tsfm` | Useful to cross-check bearing-fault robustness across machines/loads. | +| [Paderborn University Bearing Dataset](https://groups.uni-paderborn.de/kat/BearingDataCenter/) | Rolling-bearing fault experiments | `vibration`, `tsfm` | Useful to cross-check bearing-fault robustness across machines/loads. | ## Mapping checklist to AssetOpsBench schema @@ -33,6 +33,76 @@ When preparing scenarios from an external source, define these fields early: - expected outputs in `characteristic_form` that remain auditable from the data - task-domain classification (`iot`, `wo`, `vibration`, `tsfm`, `fmsr`, multi-step) +## Concrete starter mappings + +The mappings below are documentation-only starting points. They do not imply +that the raw datasets are redistributed in this repository or that executable +benchmark scenarios already exist for each source. + +### Vibration diagnostics + +Public bearing datasets are a strong fit for the existing `vibration` domain. +AssetOpsBench already includes a +[vibration MCP server](mcp-servers.md#vibration--vibration-diagnostics) with +FFT analysis, envelope analysis, bearing characteristic frequency calculation, +ISO 10816 severity assessment, and full vibration diagnosis capabilities. +Existing local utterances in `src/scenarios/local/vibration_utterance.json` can +be used as a style reference for future scenario PRs. + +| Source | Asset class | Candidate AssetOpsBench domain | Candidate scenario shape | Notes | +| --- | --- | --- | --- | --- | +| [Case Western Reserve Bearing Data Center](https://engineering.case.edu/bearingdatacenter/welcome) | Bearings, rotating machinery, motors | `vibration` | Fault classification from FFT/envelope evidence; bearing-frequency reasoning; maintenance prioritization after suspected bearing fault | Verify dataset terms and citation requirements before deriving fixtures. | +| [Paderborn University Bearing Dataset](https://groups.uni-paderborn.de/kat/BearingDataCenter/) | Rolling bearings under varying operating conditions | `vibration`, optional `tsfm` | Cross-load bearing diagnosis; robustness checks across machine/load conditions; time-series condition comparison | Useful follow-up source once the first bearing mapping is agreed. | + +Candidate vibration prompts for a future executable scenario PR: + +- Diagnose whether a motor vibration signal suggests an outer race bearing fault + using envelope-spectrum evidence. +- Calculate BPFO, BPFI, BSF, and FTF for a bearing geometry and shaft speed, + then explain which observed peaks match the expected fault frequencies. +- Compare two bearing signals and prioritize maintenance based on spectral + evidence and severity. +- Explain whether dominant FFT peaks are more consistent with unbalance, + misalignment, looseness, or a bearing defect. + +### SWaT / water-treatment telemetry + +SWaT is a useful starting point for water-treatment anomaly scenarios. It maps +most naturally to `iot` for sensor and actuator history, and to `tsfm` for +forecasting or anomaly-detection tasks. More advanced scenarios may combine +sensor lookup, time-series analysis, process-stage interpretation, and operator +recommendations. + +| Source | Asset class | Candidate AssetOpsBench domain | Candidate scenario shape | Notes | +| --- | --- | --- | --- | --- | +| [SWaT (Secure Water Treatment)](https://www.kaggle.com/datasets/vishala28/swat-dataset-secure-water-treatment-system) | Water-treatment process, sensors, actuators | `iot`, `tsfm`, multi-step | Retrieve process telemetry for a time window; identify abnormal process state; forecast threshold breach; explain affected treatment stage | Verify source terms before use. Do not commit raw Kaggle data unless redistribution is explicitly allowed. | + +Candidate SWaT prompts for a future executable scenario PR: + +- Retrieve sensor readings for a water-treatment stage over a specific time + window and summarize abnormal behavior. +- Forecast whether a tank-level, flow, or pressure variable is likely to breach + an operating threshold in the next window. +- Determine whether anomalous readings are consistent with a process fault or a + cyber-physical attack pattern. +- Recommend operator checks after an anomaly is detected in a treatment stage, + citing the sensor evidence used. + +## From mapping to executable scenarios + +Before turning either mapping into benchmark scenarios, a follow-up PR should +make the source-to-scenario contract explicit: + +1. Confirm license, citation, and redistribution constraints. +2. Keep raw data outside the repository unless redistribution is allowed. +3. Define stable `asset_id` and `site_name` values. +4. Normalize timestamps to a documented ISO 8601 convention. +5. Map raw sensor or actuator columns to scenario-facing names. +6. Document the transform script input/output contract and provenance metadata. +7. Add candidate utterances following `docs/guideline/utterance_design_guideline.md`. +8. Define expected behavior and ground-truth criteria following `docs/guideline/ground_truth_design_guideline.md`. +9. Validate scenario files against the evaluation expectations in `docs/evaluation.md`. + ## Suggested ingestion workflow 1. Keep raw source data outside committed benchmark artifacts unless license allows redistribution.