Full archive with figures, audit script, and reproducible data:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification
Related downstream project: first-ontology-project — a BIM ontology/knowledge graph pipeline built on top of DXTnavis outputs. The full audit was performed during that project's Phase 1 verification.
Summary
Two categories of changes to Services/RefinedXlsxExporter.cs:
-
Bug fix: InferClass uses case-insensitive substring matching which produces 997 false-positive Piping classifications on the 2026-04-07 snapshot (~34% inflation of the Piping class). Root cause: Piping keyword "tee" matches "steel", and "pipe" matches "Pipe Rack" / "Pipe Trench" folder names in the system path.
-
Data extraction enhancement: Expand the XLSX output so downstream consumers (knowledge graph / ontology pipelines / Palantir Foundry) can build on top of DXTnavis without joining multiple raw exports.
Part 1 — Bug fix: InferClass substring matching
Current behavior
InferClass (lines 298-375) performs Tier 3 classification using .Contains() which is substring matching without word boundaries:
string combined = (sysPath + " " + displayName).ToLowerInvariant();
foreach (var key in objData.Keys)
{
if (key.StartsWith("__")) continue;
combined += " " + key.ToLowerInvariant();
}
if (combined.Contains("pipe") || combined.Contains("valve") ||
combined.Contains("flange") || combined.Contains("elbow") ||
combined.Contains("tee") || combined.Contains("reducer") ||
combined.Contains("nozzle") || combined.Contains("coupling"))
return "Piping";
Problem
| Intended keyword |
Actually matches |
Impact |
"tee" (Tee pipe fitting) |
"steel" (via s-TEE-l) |
10 steel structural members → Piping |
"pipe" (Pipe fitting) |
"Pipe Rack" folder (structural support) |
698 objects → Piping |
"pipe" (Pipe fitting) |
"Pipe Trench" folder (civil work) |
60 objects → Piping |
"pipe" (Pipe fitting) |
"Pipeline" folder (grouping node) |
12 objects → Piping |
Concrete example
Object: MemberSystem-1-0151
System Path: For Review.nwd > Electrical Device > Steel > MemberSystem-1-0151
Navisworks properties: 항목|유형, 항목|이름, 항목|소스 파일, ... (8 Item metadata only)
SP3D properties: none
Current XLSX Class: Piping ← BUG
Correct Class: Structure (steel member)
Because the system path contains "Steel", the tee keyword (intended for pipe Tee fittings) hits via substring match s-TEE-l. Piping is evaluated first in Tier 3, so Structure (which would match on the "steel" keyword) never gets a chance.
Quantitative impact (2026-04-07 snapshot, 12,009 objects)
| Piping subset |
Count |
% of Piping |
| HIGH confidence (has pipeline + commodity/spec/NPD metadata) |
2,926 |
72.9% |
| LOW confidence (has some metadata but no pipeline) |
91 |
2.3% |
| LIKELY BUG (no pipeline, no piping metadata) |
997 |
24.8% |
| Total labeled Piping |
4,014 |
100% |
Cross-reference: the Structure class (5,926 objects) has zero sp3d_pipeline or sp3d_eqp_type_0 set, so contamination is unidirectional (Piping absorbs misclassified Structure/Electrical, not the other way).
Breakdown of 997 LIKELY_BUG Piping by root cause:
Pipe Rack folder → 698
Pipe Trench folder → 60
Pipeline folder → 12
steel substring → tee → 10
- Other complex/nested paths → ~217
Proposed fix — word boundary regex
using System.Text.RegularExpressions;
private static readonly Regex PipingKeywordRegex = new Regex(
@"\b(pipe|valve|flange|elbow|tee|reducer|nozzle|coupling)\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static readonly Regex EquipmentKeywordRegex = new Regex(
@"\b(equipment|vessel|pump|tank|compressor|exchanger|heater|reactor)\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static readonly Regex StructureKeywordRegex = new Regex(
@"\b(struct|structural|structure|steel|beam|column|brace|foundation|slab|plate|grating|handrail|ladder|stair)\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static readonly Regex ElectricalKeywordRegex = new Regex(
@"\b(electrical|cable|conduit|tray)\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static readonly Regex HvacKeywordRegex = new Regex(
@"\b(hvac|duct|ventilat)\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static readonly Regex InstrumentationKeywordRegex = new Regex(
@"\binstrument\b",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
Then in InferClass Tier 3:
if (PipingKeywordRegex.IsMatch(combined)) return "Piping";
if (EquipmentKeywordRegex.IsMatch(combined)) return "Equipment";
if (StructureKeywordRegex.IsMatch(combined)) return "Structure";
if (ElectricalKeywordRegex.IsMatch(combined)) return "Electrical";
if (HvacKeywordRegex.IsMatch(combined)) return "HVAC";
if (InstrumentationKeywordRegex.IsMatch(combined)) return "Instrumentation";
return "Other";
Expected outcome
| Class |
Before (buggy) |
After (fixed) |
Delta |
| Piping |
4,014 |
~2,926 |
-1,088 |
| Structure |
5,926 |
~6,900 |
+974 |
| Electrical |
449 |
~510 |
+61 |
| HVAC |
72 |
~72 |
0 |
| Equipment |
851 |
~851 |
0 |
| Other |
697 |
~750 |
+53 |
Tests to add
[Test] public void Tee_keyword_should_not_match_steel() {
var result = InferClass(emptyProps, "Electrical Device > Steel > Member", "Member-1");
Assert.That(result, Is.EqualTo("Structure"));
}
[Test] public void Pipe_keyword_should_not_match_pipe_rack() {
var result = InferClass(emptyProps, "> A1 > U12 > Civil > Pipe Rack", "Beam-1");
Assert.That(result, Is.EqualTo("Structure"));
}
[Test] public void Pipe_keyword_should_still_match_real_pipe() {
var props = new Dictionary<string,string> { { "SmartPlant 3D|Pipeline", "P-001" } };
var result = InferClass(props, "> Piping > P-001", "Pipe-1-0042");
Assert.That(result, Is.EqualTo("Piping"));
}
Part 2 — Data extraction wishlist
Downstream consumers (BIM ontology / knowledge graph / Foundry) need the following to build on top of DXTnavis without having to join multiple raw exports.
MUST — blocking downstream Phase 2+
M2a. Preserve ParentId in XLSX output
Current: BuildDynamicColumns (line 204-229) excludes any key starting with __, dropping __ParentId. The XLSX has no way to reconstruct the parent-child hierarchy.
Impact: Downstream consumers must join XLSX with AllProperties.csv or validation.csv to recover ParentId, which is fragile (validation.csv only has 294/12,009 populated rows).
Fix: Add ParentId to the MetaColumns array.
M2b. Preserve Level as an integer column
Current: Level is written as a string via objData["__Level"] = record.Level.ToString().
Fix: Write Level as an int directly: ws.Cell(r, c).Value = record.Level.
M2c. Emit a dedicated Hierarchy sheet
Desired content: ObjectId(GUID), ParentId, Level — 3 columns.
Rationale: Even with ParentId preserved in the pivot sheet, a dedicated hierarchy sheet makes it trivial to build HasParent relationship datasets.
SHOULD — significant quality-of-life for downstream
S2a. Recover dropped SP3D columns
Missing from XLSX but present in AllProperties_*.csv:
SmartPlant 3D|Flow Direction — critical for piping flow graph analysis
SmartPlant 3D|Cut length (lowercase l — possibly a typo)
객체이름 (Korean display name alternative)
S2b. Resolve multi-category Pipeline name collision
Current: FindKey(columns, "Pipeline") selects the first column whose PropertyName is "Pipeline", typically SmartPlant 3D|Pipeline. If other categories have a "Pipeline" property, those are silently ignored.
Impact: 10+ pipelines are lost for downstream summary consumers (147 in XLSX Pipeline_Summary vs 157 in SQLite backend).
S2c. Filter placeholder "Pipelines" label from Pipeline field
Current: 153 objects at Level 7 have SmartPlant 3D|Pipeline = "Pipelines" (the literal folder name) rather than a real pipeline identifier.
S2d. Type-strict columns
Current: All values written as strings. Numeric values (weights, lengths, coordinates, counts) become text cells.
Fix: Emit numeric fields as numbers. Decide on SI vs imperial policy per field.
NICE-TO-HAVE
- N2a. Parquet output option (
--format parquet flag) — Foundry and Spark tools ingest Parquet natively
- N2b. Emit relationship sheets —
AdjacentTo, HasMaterial, HasSpecification, BelongsToPipeline, SupportedBy
- N2c. Canonical reference tables —
Materials.csv, Specifications.csv, Pipelines.csv
- N2d. Change tracking between exports —
ChangeLog.csv with added/removed/modified diff
- N2e. Embedded metadata sheet — export timestamp, version, source file hash, parameters
Effort estimate
| Task |
Effort |
| Part 1 InferClass fix + unit tests |
0.5 day |
| Regression test with snapshot fixture |
0.5 day |
| M2a/b/c hierarchy preservation |
0.5 day |
| S2a-d schema improvements |
1.5 day |
| Total MUST+SHOULD |
~3 days |
| Total including NICE-TO-HAVE |
~9 days |
Downstream validation
The first-ontology-project repository has 192 tests exercising the XLSX classification logic via a Python port. After applying the C# fix, update the Python port keywords to use regex word boundaries as well, then re-run:
.venv/bin/python -m pytest tests/test_ingest/test_xlsx_classifier.py
A 100% match after both sides are fixed is the acceptance criterion.
Evidence files
Full archive in the downstream project:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification
Contains:
README.md — 5-section finding report
audit.py — reproducible diagnosis script
make_figures.py — matplotlib visualization generator
data/ — 5 CSV evidence files (confidence breakdown, substring causes, etc.)
figures/ — 4 PNG charts (confidence distribution, substring causes, misclassified prefixes, class distribution)
dxtnavis-pr-draft.md — full PR draft (this issue body is a condensed version)
Summary
Two categories of changes to
Services/RefinedXlsxExporter.cs:Bug fix:
InferClassuses case-insensitive substring matching which produces 997 false-positive Piping classifications on the 2026-04-07 snapshot (~34% inflation of the Piping class). Root cause: Piping keyword"tee"matches"steel", and"pipe"matches"Pipe Rack"/"Pipe Trench"folder names in the system path.Data extraction enhancement: Expand the XLSX output so downstream consumers (knowledge graph / ontology pipelines / Palantir Foundry) can build on top of DXTnavis without joining multiple raw exports.
Part 1 — Bug fix: InferClass substring matching
Current behavior
InferClass(lines 298-375) performs Tier 3 classification using.Contains()which is substring matching without word boundaries:Problem
"tee"(Tee pipe fitting)"steel"(vias-TEE-l)"pipe"(Pipe fitting)"Pipe Rack"folder (structural support)"pipe"(Pipe fitting)"Pipe Trench"folder (civil work)"pipe"(Pipe fitting)"Pipeline"folder (grouping node)Concrete example
Because the system path contains
"Steel", theteekeyword (intended for pipe Tee fittings) hits via substring matchs-TEE-l. Piping is evaluated first in Tier 3, so Structure (which would match on the"steel"keyword) never gets a chance.Quantitative impact (2026-04-07 snapshot, 12,009 objects)
Cross-reference: the Structure class (5,926 objects) has zero
sp3d_pipelineorsp3d_eqp_type_0set, so contamination is unidirectional (Piping absorbs misclassified Structure/Electrical, not the other way).Breakdown of 997 LIKELY_BUG Piping by root cause:
Pipe Rackfolder → 698Pipe Trenchfolder → 60Pipelinefolder → 12steelsubstring →tee→ 10Proposed fix — word boundary regex
Then in
InferClassTier 3:Expected outcome
Tests to add
Part 2 — Data extraction wishlist
Downstream consumers (BIM ontology / knowledge graph / Foundry) need the following to build on top of DXTnavis without having to join multiple raw exports.
MUST — blocking downstream Phase 2+
M2a. Preserve
ParentIdin XLSX outputCurrent:
BuildDynamicColumns(line 204-229) excludes any key starting with__, dropping__ParentId. The XLSX has no way to reconstruct the parent-child hierarchy.Impact: Downstream consumers must join XLSX with
AllProperties.csvorvalidation.csvto recover ParentId, which is fragile (validation.csv only has 294/12,009 populated rows).Fix: Add
ParentIdto theMetaColumnsarray.M2b. Preserve
Levelas an integer columnCurrent: Level is written as a string via
objData["__Level"] = record.Level.ToString().Fix: Write
Levelas an int directly:ws.Cell(r, c).Value = record.Level.M2c. Emit a dedicated
HierarchysheetDesired content:
ObjectId(GUID),ParentId,Level— 3 columns.Rationale: Even with ParentId preserved in the pivot sheet, a dedicated hierarchy sheet makes it trivial to build
HasParentrelationship datasets.SHOULD — significant quality-of-life for downstream
S2a. Recover dropped SP3D columns
Missing from XLSX but present in
AllProperties_*.csv:SmartPlant 3D|Flow Direction— critical for piping flow graph analysisSmartPlant 3D|Cut length(lowercasel— possibly a typo)객체이름(Korean display name alternative)S2b. Resolve multi-category Pipeline name collision
Current:
FindKey(columns, "Pipeline")selects the first column whose PropertyName is"Pipeline", typicallySmartPlant 3D|Pipeline. If other categories have a"Pipeline"property, those are silently ignored.Impact: 10+ pipelines are lost for downstream summary consumers (147 in XLSX Pipeline_Summary vs 157 in SQLite backend).
S2c. Filter placeholder
"Pipelines"label from Pipeline fieldCurrent: 153 objects at Level 7 have
SmartPlant 3D|Pipeline = "Pipelines"(the literal folder name) rather than a real pipeline identifier.S2d. Type-strict columns
Current: All values written as strings. Numeric values (weights, lengths, coordinates, counts) become text cells.
Fix: Emit numeric fields as numbers. Decide on SI vs imperial policy per field.
NICE-TO-HAVE
--format parquetflag) — Foundry and Spark tools ingest Parquet nativelyAdjacentTo,HasMaterial,HasSpecification,BelongsToPipeline,SupportedByMaterials.csv,Specifications.csv,Pipelines.csvChangeLog.csvwith added/removed/modified diffEffort estimate
Downstream validation
The
first-ontology-projectrepository has 192 tests exercising the XLSX classification logic via a Python port. After applying the C# fix, update the Python port keywords to use regex word boundaries as well, then re-run:A 100% match after both sides are fixed is the acceptance criterion.
Evidence files
Full archive in the downstream project:
https://github.com/tygwan/first-ontology-project/tree/main/docs/findings/2026-04-12-M1-piping-misclassification
Contains:
README.md— 5-section finding reportaudit.py— reproducible diagnosis scriptmake_figures.py— matplotlib visualization generatordata/— 5 CSV evidence files (confidence breakdown, substring causes, etc.)figures/— 4 PNG charts (confidence distribution, substring causes, misclassified prefixes, class distribution)dxtnavis-pr-draft.md— full PR draft (this issue body is a condensed version)