Skip to content

Conversation

@mountainMath
Copy link
Owner

Summary

  • Optimize add_hierarchy() function with memoized path tracing instead of iterative while loop
  • Optimize fold_in_metadata_for_columns() with pre-split coordinate matrix
  • Optimize factor conversion with lazy matrix initialization

Performance Benchmarks

add_hierarchy() - Hierarchy Building

Test Case Original Optimized Improvement
6 small dimensions (2-11 members each) 0.0243s 0.0058s 76% faster
NAICS hierarchy (337 members) 0.0314s 0.0037s 8.5x faster (88%)

The optimization replaces an iterative while loop (up to 100 iterations) with single-pass memoized path tracing:

  • Old: Each iteration does strsplit() on all rows to find current top, paste0() to prepend parent, string comparison to check changes
  • New: Build parent lookup once, trace each node to root with cached ancestor paths

fold_in_metadata_for_columns() - Coordinate Parsing

Test Case Original Optimized Improvement
169k rows, 6 dimensions 0.057s 0.059s ~same

The matrix-based approach has similar overhead to lapply for typical table sizes.

Correctness Verification

Verified exact output equivalence by installing both package versions and comparing results:

Table 20-10-0085 (169,108 rows, 33 columns, 6 hierarchy columns):

Column names identical: TRUE
Data identical: TRUE
All hierarchy columns identical: TRUE

Table 20-10-0056 (73,530 rows, 27 columns, 4 hierarchy columns):

Column names identical: TRUE  
Data identical: TRUE
All hierarchy columns identical: TRUE

Test plan

  • All existing tests pass (devtools::test() - 20 tests)
  • devtools::check() passes with 0 errors, 0 warnings
  • Output verified identical on tables 20-10-0085 and 20-10-0056
  • Hierarchy columns specifically verified identical

🤖 Generated with Claude Code

mountainMath and others added 3 commits January 20, 2026 17:39
Exclude Claude Code configuration files from package builds
to avoid NOTEs during R CMD check.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance improvements to metadata processing:

1. add_hierarchy() - Replace iterative while loop with memoized path tracing
   - Old: Up to 100 iterations, each doing strsplit + paste0 on all rows
   - New: Single pass with cached ancestor paths
   - Benchmark: 76-88% faster (8.5x on large hierarchies)

2. fold_in_metadata_for_columns() - Pre-split coordinates into matrix
   - Old: lapply(...pos, function(d) d[i]) for each column
   - New: str_split_fixed once, then direct matrix indexing

3. Factor conversion - Lazy matrix initialization
   - Old: str_split + lapply for each field needing rename
   - New: Split once if needed, reuse matrix across all fields

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants