Fix: InferClass substring matching + expand data extraction

> **Full archive with figures, audit script, and reproducible data**:
> https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification
>
> **Related downstream project**: [first-ontology-project](https://github.com/tygwan/first-ontology-project) — a BIM ontology/knowledge graph pipeline built on top of DXTnavis outputs. The full audit was performed during that project's Phase 1 verification.

---

## Summary

Two categories of changes to `Services/RefinedXlsxExporter.cs`:

1. **Bug fix**: `InferClass` uses case-insensitive **substring** matching which produces **997 false-positive Piping classifications** on the 2026-04-07 snapshot (~34% inflation of the Piping class). Root cause: Piping keyword `"tee"` matches `"steel"`, and `"pipe"` matches `"Pipe Rack"` / `"Pipe Trench"` folder names in the system path.

2. **Data extraction enhancement**: Expand the XLSX output so downstream consumers (knowledge graph / ontology pipelines / Palantir Foundry) can build on top of DXTnavis without joining multiple raw exports.

## Part 1 — Bug fix: InferClass substring matching

### Current behavior

`InferClass` (lines 298-375) performs Tier 3 classification using `.Contains()` which is substring matching **without word boundaries**:

```csharp
string combined = (sysPath + " " + displayName).ToLowerInvariant();
foreach (var key in objData.Keys)
{
    if (key.StartsWith("__")) continue;
    combined += " " + key.ToLowerInvariant();
}

if (combined.Contains("pipe") || combined.Contains("valve") ||
    combined.Contains("flange") || combined.Contains("elbow") ||
    combined.Contains("tee") || combined.Contains("reducer") ||
    combined.Contains("nozzle") || combined.Contains("coupling"))
    return "Piping";
```

### Problem

| Intended keyword | Actually matches | Impact |
|------------------|------------------|-------:|
| `"tee"` (Tee pipe fitting) | **`"steel"`** (via `s-TEE-l`) | 10 steel structural members → Piping |
| `"pipe"` (Pipe fitting) | **`"Pipe Rack"`** folder (structural support) | 698 objects → Piping |
| `"pipe"` (Pipe fitting) | **`"Pipe Trench"`** folder (civil work) | 60 objects → Piping |
| `"pipe"` (Pipe fitting) | **`"Pipeline"`** folder (grouping node) | 12 objects → Piping |

### Concrete example

```
Object: MemberSystem-1-0151
System Path: For Review.nwd > Electrical Device > Steel > MemberSystem-1-0151
Navisworks properties: 항목|유형, 항목|이름, 항목|소스 파일, ... (8 Item metadata only)
SP3D properties: none
Current XLSX Class: Piping   ← BUG
Correct Class: Structure (steel member)
```

Because the system path contains `"Steel"`, the `tee` keyword (intended for pipe Tee fittings) hits via substring match `s-TEE-l`. Piping is evaluated first in Tier 3, so Structure (which would match on the `"steel"` keyword) never gets a chance.

### Quantitative impact (2026-04-07 snapshot, 12,009 objects)

| Piping subset | Count | % of Piping |
|---------------|------:|------------:|
| HIGH confidence (has pipeline + commodity/spec/NPD metadata) | 2,926 | 72.9% |
| LOW confidence (has some metadata but no pipeline) | 91 | 2.3% |
| **LIKELY BUG** (no pipeline, no piping metadata) | **997** | **24.8%** |
| Total labeled Piping | 4,014 | 100% |

Cross-reference: the Structure class (5,926 objects) has **zero** `sp3d_pipeline` or `sp3d_eqp_type_0` set, so contamination is unidirectional (Piping absorbs misclassified Structure/Electrical, not the other way).

Breakdown of 997 LIKELY_BUG Piping by root cause:
- `Pipe Rack` folder → 698
- `Pipe Trench` folder → 60
- `Pipeline` folder → 12
- `steel` substring → `tee` → 10
- Other complex/nested paths → ~217

### Proposed fix — word boundary regex

```csharp
using System.Text.RegularExpressions;

private static readonly Regex PipingKeywordRegex = new Regex(
    @"\b(pipe|valve|flange|elbow|tee|reducer|nozzle|coupling)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex EquipmentKeywordRegex = new Regex(
    @"\b(equipment|vessel|pump|tank|compressor|exchanger|heater|reactor)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex StructureKeywordRegex = new Regex(
    @"\b(struct|structural|structure|steel|beam|column|brace|foundation|slab|plate|grating|handrail|ladder|stair)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex ElectricalKeywordRegex = new Regex(
    @"\b(electrical|cable|conduit|tray)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex HvacKeywordRegex = new Regex(
    @"\b(hvac|duct|ventilat)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex InstrumentationKeywordRegex = new Regex(
    @"\binstrument\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);
```

Then in `InferClass` Tier 3:

```csharp
if (PipingKeywordRegex.IsMatch(combined))          return "Piping";
if (EquipmentKeywordRegex.IsMatch(combined))       return "Equipment";
if (StructureKeywordRegex.IsMatch(combined))       return "Structure";
if (ElectricalKeywordRegex.IsMatch(combined))      return "Electrical";
if (HvacKeywordRegex.IsMatch(combined))            return "HVAC";
if (InstrumentationKeywordRegex.IsMatch(combined)) return "Instrumentation";
return "Other";
```

### Expected outcome

| Class | Before (buggy) | After (fixed) | Delta |
|-------|---------------:|--------------:|------:|
| Piping | 4,014 | ~2,926 | -1,088 |
| Structure | 5,926 | ~6,900 | +974 |
| Electrical | 449 | ~510 | +61 |
| HVAC | 72 | ~72 | 0 |
| Equipment | 851 | ~851 | 0 |
| Other | 697 | ~750 | +53 |

### Tests to add

```csharp
[Test] public void Tee_keyword_should_not_match_steel() {
    var result = InferClass(emptyProps, "Electrical Device > Steel > Member", "Member-1");
    Assert.That(result, Is.EqualTo("Structure"));
}

[Test] public void Pipe_keyword_should_not_match_pipe_rack() {
    var result = InferClass(emptyProps, "> A1 > U12 > Civil > Pipe Rack", "Beam-1");
    Assert.That(result, Is.EqualTo("Structure"));
}

[Test] public void Pipe_keyword_should_still_match_real_pipe() {
    var props = new Dictionary<string,string> { { "SmartPlant 3D|Pipeline", "P-001" } };
    var result = InferClass(props, "> Piping > P-001", "Pipe-1-0042");
    Assert.That(result, Is.EqualTo("Piping"));
}
```

---

## Part 2 — Data extraction wishlist

Downstream consumers (BIM ontology / knowledge graph / Foundry) need the following to build on top of DXTnavis without having to join multiple raw exports.

### MUST — blocking downstream Phase 2+

#### M2a. Preserve `ParentId` in XLSX output

**Current**: `BuildDynamicColumns` (line 204-229) excludes any key starting with `__`, dropping `__ParentId`. The XLSX has no way to reconstruct the parent-child hierarchy.

**Impact**: Downstream consumers must join XLSX with `AllProperties.csv` or `validation.csv` to recover ParentId, which is fragile (validation.csv only has 294/12,009 populated rows).

**Fix**: Add `ParentId` to the `MetaColumns` array.

#### M2b. Preserve `Level` as an integer column

**Current**: Level is written as a string via `objData["__Level"] = record.Level.ToString()`.

**Fix**: Write `Level` as an int directly: `ws.Cell(r, c).Value = record.Level`.

#### M2c. Emit a dedicated `Hierarchy` sheet

**Desired content**: `ObjectId(GUID)`, `ParentId`, `Level` — 3 columns.

**Rationale**: Even with ParentId preserved in the pivot sheet, a dedicated hierarchy sheet makes it trivial to build `HasParent` relationship datasets.

### SHOULD — significant quality-of-life for downstream

#### S2a. Recover dropped SP3D columns

Missing from XLSX but present in `AllProperties_*.csv`:
- `SmartPlant 3D|Flow Direction` — critical for piping flow graph analysis
- `SmartPlant 3D|Cut length` (lowercase `l` — possibly a typo)
- `객체이름` (Korean display name alternative)

#### S2b. Resolve multi-category Pipeline name collision

**Current**: `FindKey(columns, "Pipeline")` selects the first column whose PropertyName is `"Pipeline"`, typically `SmartPlant 3D|Pipeline`. If other categories have a `"Pipeline"` property, those are silently ignored.

**Impact**: 10+ pipelines are lost for downstream summary consumers (147 in XLSX Pipeline_Summary vs 157 in SQLite backend).

#### S2c. Filter placeholder `"Pipelines"` label from Pipeline field

**Current**: 153 objects at Level 7 have `SmartPlant 3D|Pipeline = "Pipelines"` (the literal folder name) rather than a real pipeline identifier.

#### S2d. Type-strict columns

**Current**: All values written as strings. Numeric values (weights, lengths, coordinates, counts) become text cells.

**Fix**: Emit numeric fields as numbers. Decide on SI vs imperial policy per field.

### NICE-TO-HAVE

- **N2a. Parquet output option** (`--format parquet` flag) — Foundry and Spark tools ingest Parquet natively
- **N2b. Emit relationship sheets** — `AdjacentTo`, `HasMaterial`, `HasSpecification`, `BelongsToPipeline`, `SupportedBy`
- **N2c. Canonical reference tables** — `Materials.csv`, `Specifications.csv`, `Pipelines.csv`
- **N2d. Change tracking between exports** — `ChangeLog.csv` with added/removed/modified diff
- **N2e. Embedded metadata sheet** — export timestamp, version, source file hash, parameters

---

## Effort estimate

| Task | Effort |
|------|-------:|
| Part 1 InferClass fix + unit tests | 0.5 day |
| Regression test with snapshot fixture | 0.5 day |
| M2a/b/c hierarchy preservation | 0.5 day |
| S2a-d schema improvements | 1.5 day |
| **Total MUST+SHOULD** | **~3 days** |
| **Total including NICE-TO-HAVE** | **~9 days** |

## Downstream validation

The `first-ontology-project` repository has 192 tests exercising the XLSX classification logic via a Python port. After applying the C# fix, update the Python port keywords to use regex word boundaries as well, then re-run:

```bash
.venv/bin/python -m pytest tests/test_ingest/test_xlsx_classifier.py
```

A 100% match after both sides are fixed is the acceptance criterion.

---

## Evidence files

Full archive in the downstream project:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification

Contains:
- `README.md` — 5-section finding report
- `audit.py` — reproducible diagnosis script
- `make_figures.py` — matplotlib visualization generator
- `data/` — 5 CSV evidence files (confidence breakdown, substring causes, etc.)
- `figures/` — 4 PNG charts (confidence distribution, substring causes, misclassified prefixes, class distribution)
- `dxtnavis-pr-draft.md` — full PR draft (this issue body is a condensed version)


Intended keyword	Actually matches	Impact
`"tee"` (Tee pipe fitting)	`"steel"` (via `s-TEE-l`)	10 steel structural members → Piping
`"pipe"` (Pipe fitting)	`"Pipe Rack"` folder (structural support)	698 objects → Piping
`"pipe"` (Pipe fitting)	`"Pipe Trench"` folder (civil work)	60 objects → Piping
`"pipe"` (Pipe fitting)	`"Pipeline"` folder (grouping node)	12 objects → Piping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: InferClass substring matching + expand data extraction #2

Summary

Part 1 — Bug fix: InferClass substring matching

Current behavior

Problem

Concrete example

Quantitative impact (2026-04-07 snapshot, 12,009 objects)

Proposed fix — word boundary regex

Expected outcome

Tests to add

Part 2 — Data extraction wishlist

MUST — blocking downstream Phase 2+

M2a. Preserve `ParentId` in XLSX output

M2b. Preserve `Level` as an integer column

M2c. Emit a dedicated `Hierarchy` sheet

SHOULD — significant quality-of-life for downstream

S2a. Recover dropped SP3D columns

S2b. Resolve multi-category Pipeline name collision

S2c. Filter placeholder `"Pipelines"` label from Pipeline field

S2d. Type-strict columns

NICE-TO-HAVE

Effort estimate

Downstream validation

Evidence files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Piping subset	Count	% of Piping
HIGH confidence (has pipeline + commodity/spec/NPD metadata)	2,926	72.9%
LOW confidence (has some metadata but no pipeline)	91	2.3%
LIKELY BUG (no pipeline, no piping metadata)	997	24.8%
Total labeled Piping	4,014	100%

Class	Before (buggy)	After (fixed)	Delta
Piping	4,014	~2,926	-1,088
Structure	5,926	~6,900	+974
Electrical	449	~510	+61
HVAC	72	~72	0
Equipment	851	~851	0
Other	697	~750	+53

Task	Effort
Part 1 InferClass fix + unit tests	0.5 day
Regression test with snapshot fixture	0.5 day
M2a/b/c hierarchy preservation	0.5 day
S2a-d schema improvements	1.5 day
Total MUST+SHOULD	~3 days
Total including NICE-TO-HAVE	~9 days

Fix: InferClass substring matching + expand data extraction #2

Description

Summary

Part 1 — Bug fix: InferClass substring matching

Current behavior

Problem

Concrete example

Quantitative impact (2026-04-07 snapshot, 12,009 objects)

Proposed fix — word boundary regex

Expected outcome

Tests to add

Part 2 — Data extraction wishlist

MUST — blocking downstream Phase 2+

M2a. Preserve ParentId in XLSX output

M2b. Preserve Level as an integer column

M2c. Emit a dedicated Hierarchy sheet

SHOULD — significant quality-of-life for downstream

S2a. Recover dropped SP3D columns

S2b. Resolve multi-category Pipeline name collision

S2c. Filter placeholder "Pipelines" label from Pipeline field

S2d. Type-strict columns

NICE-TO-HAVE

Effort estimate

Downstream validation

Evidence files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

M2a. Preserve `ParentId` in XLSX output

M2b. Preserve `Level` as an integer column

M2c. Emit a dedicated `Hierarchy` sheet

S2c. Filter placeholder `"Pipelines"` label from Pipeline field