Skip to content

Optimize Scikit-learn model loading by adding Bulk Tree Construction API #651

Merged
hcho3 merged 6 commits intodmlc:mainlinefrom
dantegd:optimize-sklearn-loader
Feb 27, 2026
Merged

Optimize Scikit-learn model loading by adding Bulk Tree Construction API #651
hcho3 merged 6 commits intodmlc:mainlinefrom
dantegd:optimize-sklearn-loader

Conversation

@dantegd
Copy link
Copy Markdown
Contributor

@dantegd dantegd commented Dec 19, 2025

This PR introduces a bulk tree construction API that significantly improves performance when importing scikit-learn RandomForest models into Treelite. In my benchmarks, the new API achieves ~7-10x speedup over the existing node-by-node construction approach of the current sklearn loader.

The current implementation spends significant time in per-node overhead due to:

  • Repeated ModelBuilder method calls for each node
  • Python-C++ boundary crossing overhead accumulating across millions of nodes
  • Memory allocation patterns that don't benefit from bulk operations

This becomes a bottleneck in workflows like cuML's RandomForestClassifier.from_sklearn(), where treelite import time dominates the conversion process.

This PR implements a BulkConstructTree friend function that directly populates the Tree class's internal ContiguousArray members in a single pass, bypassing the ModelBuilder abstraction for sklearn imports.

Initial benchmarks:

Configuration Total Nodes Old API (ms) Bulk API (ms) Speedup
classifier, 50 trees, depth=10 39,844 13.5 1.8 7.45x
classifier, 100 trees, depth=15 351,826 77.3 10.2 7.54x
classifier, 300 trees, depth=20 2,520,062 544.9 60.7 8.98x
regressor, 100 trees, depth=15 978,436 195.6 18.8 10.42x

@dantegd dantegd marked this pull request as ready for review January 6, 2026 18:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.83%. Comparing base (3e70b1d) to head (7353713).
⚠️ Report is 1 commits behind head on mainline.

Additional details and impacted files
@@             Coverage Diff              @@
##           mainline     #651      +/-   ##
============================================
+ Coverage     84.38%   84.83%   +0.45%     
============================================
  Files            75       76       +1     
  Lines          6653     6813     +160     
  Branches        543      557      +14     
============================================
+ Hits           5614     5780     +166     
+ Misses         1039     1033       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go ahead and simply remove the old sklearn model builder functions? So LoadSKLearnRandomForestRegressorBulk should be simply called LoadSKLearnRandomForestRegressor, etc.

I don't see a good reason to keep the old functions around, if the new functions are equivalent in functionalities but faster.

@hcho3 hcho3 merged commit e898272 into dmlc:mainline Feb 27, 2026
18 checks passed
rapids-bot Bot pushed a commit to rapidsai/cuml that referenced this pull request Mar 13, 2026
Update Treelite to 4.7.0 to incorporate the following improvements:
* dmlc/treelite#655
* dmlc/treelite#651

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Jim Crist-Harif (https://github.com/jcrist)
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

URL: #7870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants