Skip to content

Fix: InferClass substring matching + expand data extraction #2

@tygwan

Description

@tygwan

Full archive with figures, audit script, and reproducible data:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification

Related downstream project: first-ontology-project — a BIM ontology/knowledge graph pipeline built on top of DXTnavis outputs. The full audit was performed during that project's Phase 1 verification.


Summary

Two categories of changes to Services/RefinedXlsxExporter.cs:

  1. Bug fix: InferClass uses case-insensitive substring matching which produces 997 false-positive Piping classifications on the 2026-04-07 snapshot (~34% inflation of the Piping class). Root cause: Piping keyword "tee" matches "steel", and "pipe" matches "Pipe Rack" / "Pipe Trench" folder names in the system path.

  2. Data extraction enhancement: Expand the XLSX output so downstream consumers (knowledge graph / ontology pipelines / Palantir Foundry) can build on top of DXTnavis without joining multiple raw exports.

Part 1 — Bug fix: InferClass substring matching

Current behavior

InferClass (lines 298-375) performs Tier 3 classification using .Contains() which is substring matching without word boundaries:

string combined = (sysPath + " " + displayName).ToLowerInvariant();
foreach (var key in objData.Keys)
{
    if (key.StartsWith("__")) continue;
    combined += " " + key.ToLowerInvariant();
}

if (combined.Contains("pipe") || combined.Contains("valve") ||
    combined.Contains("flange") || combined.Contains("elbow") ||
    combined.Contains("tee") || combined.Contains("reducer") ||
    combined.Contains("nozzle") || combined.Contains("coupling"))
    return "Piping";

Problem

Intended keyword Actually matches Impact
"tee" (Tee pipe fitting) "steel" (via s-TEE-l) 10 steel structural members → Piping
"pipe" (Pipe fitting) "Pipe Rack" folder (structural support) 698 objects → Piping
"pipe" (Pipe fitting) "Pipe Trench" folder (civil work) 60 objects → Piping
"pipe" (Pipe fitting) "Pipeline" folder (grouping node) 12 objects → Piping

Concrete example

Object: MemberSystem-1-0151
System Path: For Review.nwd > Electrical Device > Steel > MemberSystem-1-0151
Navisworks properties: 항목|유형, 항목|이름, 항목|소스 파일, ... (8 Item metadata only)
SP3D properties: none
Current XLSX Class: Piping   ← BUG
Correct Class: Structure (steel member)

Because the system path contains "Steel", the tee keyword (intended for pipe Tee fittings) hits via substring match s-TEE-l. Piping is evaluated first in Tier 3, so Structure (which would match on the "steel" keyword) never gets a chance.

Quantitative impact (2026-04-07 snapshot, 12,009 objects)

Piping subset Count % of Piping
HIGH confidence (has pipeline + commodity/spec/NPD metadata) 2,926 72.9%
LOW confidence (has some metadata but no pipeline) 91 2.3%
LIKELY BUG (no pipeline, no piping metadata) 997 24.8%
Total labeled Piping 4,014 100%

Cross-reference: the Structure class (5,926 objects) has zero sp3d_pipeline or sp3d_eqp_type_0 set, so contamination is unidirectional (Piping absorbs misclassified Structure/Electrical, not the other way).

Breakdown of 997 LIKELY_BUG Piping by root cause:

  • Pipe Rack folder → 698
  • Pipe Trench folder → 60
  • Pipeline folder → 12
  • steel substring → tee → 10
  • Other complex/nested paths → ~217

Proposed fix — word boundary regex

using System.Text.RegularExpressions;

private static readonly Regex PipingKeywordRegex = new Regex(
    @"\b(pipe|valve|flange|elbow|tee|reducer|nozzle|coupling)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex EquipmentKeywordRegex = new Regex(
    @"\b(equipment|vessel|pump|tank|compressor|exchanger|heater|reactor)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex StructureKeywordRegex = new Regex(
    @"\b(struct|structural|structure|steel|beam|column|brace|foundation|slab|plate|grating|handrail|ladder|stair)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex ElectricalKeywordRegex = new Regex(
    @"\b(electrical|cable|conduit|tray)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex HvacKeywordRegex = new Regex(
    @"\b(hvac|duct|ventilat)\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

private static readonly Regex InstrumentationKeywordRegex = new Regex(
    @"\binstrument\b",
    RegexOptions.IgnoreCase | RegexOptions.Compiled);

Then in InferClass Tier 3:

if (PipingKeywordRegex.IsMatch(combined))          return "Piping";
if (EquipmentKeywordRegex.IsMatch(combined))       return "Equipment";
if (StructureKeywordRegex.IsMatch(combined))       return "Structure";
if (ElectricalKeywordRegex.IsMatch(combined))      return "Electrical";
if (HvacKeywordRegex.IsMatch(combined))            return "HVAC";
if (InstrumentationKeywordRegex.IsMatch(combined)) return "Instrumentation";
return "Other";

Expected outcome

Class Before (buggy) After (fixed) Delta
Piping 4,014 ~2,926 -1,088
Structure 5,926 ~6,900 +974
Electrical 449 ~510 +61
HVAC 72 ~72 0
Equipment 851 ~851 0
Other 697 ~750 +53

Tests to add

[Test] public void Tee_keyword_should_not_match_steel() {
    var result = InferClass(emptyProps, "Electrical Device > Steel > Member", "Member-1");
    Assert.That(result, Is.EqualTo("Structure"));
}

[Test] public void Pipe_keyword_should_not_match_pipe_rack() {
    var result = InferClass(emptyProps, "> A1 > U12 > Civil > Pipe Rack", "Beam-1");
    Assert.That(result, Is.EqualTo("Structure"));
}

[Test] public void Pipe_keyword_should_still_match_real_pipe() {
    var props = new Dictionary<string,string> { { "SmartPlant 3D|Pipeline", "P-001" } };
    var result = InferClass(props, "> Piping > P-001", "Pipe-1-0042");
    Assert.That(result, Is.EqualTo("Piping"));
}

Part 2 — Data extraction wishlist

Downstream consumers (BIM ontology / knowledge graph / Foundry) need the following to build on top of DXTnavis without having to join multiple raw exports.

MUST — blocking downstream Phase 2+

M2a. Preserve ParentId in XLSX output

Current: BuildDynamicColumns (line 204-229) excludes any key starting with __, dropping __ParentId. The XLSX has no way to reconstruct the parent-child hierarchy.

Impact: Downstream consumers must join XLSX with AllProperties.csv or validation.csv to recover ParentId, which is fragile (validation.csv only has 294/12,009 populated rows).

Fix: Add ParentId to the MetaColumns array.

M2b. Preserve Level as an integer column

Current: Level is written as a string via objData["__Level"] = record.Level.ToString().

Fix: Write Level as an int directly: ws.Cell(r, c).Value = record.Level.

M2c. Emit a dedicated Hierarchy sheet

Desired content: ObjectId(GUID), ParentId, Level — 3 columns.

Rationale: Even with ParentId preserved in the pivot sheet, a dedicated hierarchy sheet makes it trivial to build HasParent relationship datasets.

SHOULD — significant quality-of-life for downstream

S2a. Recover dropped SP3D columns

Missing from XLSX but present in AllProperties_*.csv:

  • SmartPlant 3D|Flow Direction — critical for piping flow graph analysis
  • SmartPlant 3D|Cut length (lowercase l — possibly a typo)
  • 객체이름 (Korean display name alternative)

S2b. Resolve multi-category Pipeline name collision

Current: FindKey(columns, "Pipeline") selects the first column whose PropertyName is "Pipeline", typically SmartPlant 3D|Pipeline. If other categories have a "Pipeline" property, those are silently ignored.

Impact: 10+ pipelines are lost for downstream summary consumers (147 in XLSX Pipeline_Summary vs 157 in SQLite backend).

S2c. Filter placeholder "Pipelines" label from Pipeline field

Current: 153 objects at Level 7 have SmartPlant 3D|Pipeline = "Pipelines" (the literal folder name) rather than a real pipeline identifier.

S2d. Type-strict columns

Current: All values written as strings. Numeric values (weights, lengths, coordinates, counts) become text cells.

Fix: Emit numeric fields as numbers. Decide on SI vs imperial policy per field.

NICE-TO-HAVE

  • N2a. Parquet output option (--format parquet flag) — Foundry and Spark tools ingest Parquet natively
  • N2b. Emit relationship sheetsAdjacentTo, HasMaterial, HasSpecification, BelongsToPipeline, SupportedBy
  • N2c. Canonical reference tablesMaterials.csv, Specifications.csv, Pipelines.csv
  • N2d. Change tracking between exportsChangeLog.csv with added/removed/modified diff
  • N2e. Embedded metadata sheet — export timestamp, version, source file hash, parameters

Effort estimate

Task Effort
Part 1 InferClass fix + unit tests 0.5 day
Regression test with snapshot fixture 0.5 day
M2a/b/c hierarchy preservation 0.5 day
S2a-d schema improvements 1.5 day
Total MUST+SHOULD ~3 days
Total including NICE-TO-HAVE ~9 days

Downstream validation

The first-ontology-project repository has 192 tests exercising the XLSX classification logic via a Python port. After applying the C# fix, update the Python port keywords to use regex word boundaries as well, then re-run:

.venv/bin/python -m pytest tests/test_ingest/test_xlsx_classifier.py

A 100% match after both sides are fixed is the acceptance criterion.


Evidence files

Full archive in the downstream project:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification

Contains:

  • README.md — 5-section finding report
  • audit.py — reproducible diagnosis script
  • make_figures.py — matplotlib visualization generator
  • data/ — 5 CSV evidence files (confidence breakdown, substring causes, etc.)
  • figures/ — 4 PNG charts (confidence distribution, substring causes, misclassified prefixes, class distribution)
  • dxtnavis-pr-draft.md — full PR draft (this issue body is a condensed version)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions