Skip to content

Add polars support.#3

Open
aryehlev wants to merge 6 commits intopublishfrom
polars
Open

Add polars support.#3
aryehlev wants to merge 6 commits intopublishfrom
polars

Conversation

@aryehlev
Copy link
Owner

@aryehlev aryehlev commented Nov 17, 2025

Summary by CodeRabbit

  • New Features

    • Optional Polars integration: enable a feature to run model predictions directly on Polars DataFrames with three modes (auto numeric, selected columns, or separate float/categorical columns).
  • Chores

    • Build/install process now uses compile-time platform handling, improving library installation and rpath behavior on macOS/Linux while preserving Windows behavior.

@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Adds optional Polars DataFrame integration behind a feature flag (ModelPolarsExt); refactors build.rs to use compile-time cfg-based OS/arch detection and per-OS logic; updates Cargo.toml to add an optional polars dependency and feature.

Changes

Cohort / File(s) Summary
Cargo: optional dependency & feature
Cargo.toml
Added optional dependency polars = { version = "0.45", optional = true, default-features = false, features = ["dtype-full"] } and feature polars = ["dep:polars"].
Build script refactor
build.rs
Replaced runtime platform detection with compile-time cfg helpers (get_arch_from_target(), get_os_name(), get_lib_filename()), switched OS-specific flows to #[cfg(target_os = "...")] blocks, adjusted library filename / download logic, rpath handling, and platform-specific install tool invocations.
Polars integration
src/lib.rs, src/polars_ext.rs
Added #[cfg(feature = "polars")] module and pub use re-export; introduced pub trait ModelPolarsExt and impl ModelPolarsExt for Model with predict_dataframe, predict_dataframe_with_columns, predict_dataframe_with_types, plus helpers converting Polars DataFrame columns to float and categorical vectors and mapping errors to CatBoostError.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ModelPolarsExt
    participant Select as select_columns?
    participant ToFloat as dataframe_to_float_features
    participant ToCat as dataframe_to_cat_features
    participant Calc as calc_model_prediction

    User->>ModelPolarsExt: call predict_dataframe / variants
    ModelPolarsExt->>Select: optionally select columns (variant)
    Select-->>ModelPolarsExt: DataFrame or selection
    ModelPolarsExt->>ToFloat: convert to Vec<Vec<f32>>
    ToFloat-->>ModelPolarsExt: float features
    ModelPolarsExt->>ToCat: convert to Vec<Vec<String>> (optional)
    ToCat-->>ModelPolarsExt: cat features
    ModelPolarsExt->>Calc: calc_model_prediction(float, cat)
    Calc-->>ModelPolarsExt: Vec<f64> predictions
    ModelPolarsExt-->>User: return predictions or CatBoostError
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to build.rs cfg-branch correctness, rpath emission, macOS install_name_tool and Linux patchelf invocations, and Windows DLL/LIB handling.
  • Review src/polars_ext.rs for type conversions (extract_f32_value, extract_string_value), null handling, and error mapping to CatBoostError.
  • Verify Cargo.toml feature wiring and #[cfg(feature = "polars")] usage in src/lib.rs.

Poem

🐇 I hop through rows and columns bright,
Turning frames to features overnight.
Compile-time paths find every door,
Predictions prance from field to score.
A carrot clap for code once more! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add polars support' clearly and concisely describes the main change: introducing Polars DataFrame extension support to the project, which aligns with the primary additions in Cargo.toml, src/lib.rs, and src/polars_ext.rs.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch polars

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
build.rs (2)

12-24: Architecture detection from TARGET is fine but only handles a small set of triples.

get_arch_from_target correctly keys off TARGET instead of host, but it only recognizes x86_64, aarch64/arm64, and i686/i586, panicking otherwise. That’s reasonable if CatBoost only ships binaries for those arches, but it does mean any other target will fail at build‑time even if someone doesn’t actually try to download/run CatBoost there.

If you want a softer failure mode, you could return a Result<&'static str, String> here and bubble that up into a more descriptive CatBoost‑specific error instead of panicking.


341-357: Shared library copy, soname, and rpath logic look good for native builds; cross‑compile behavior may be off.

The new unified flow—derive lib_filename, copy from OUT_DIR/libs to target/{PROFILE}, then adjust install name/soname and emit OS‑specific rpaths—looks coherent for macOS/Linux/Windows when building natively. Using patchelf/install_name_tool with ignored exit codes is also a sensible “best effort” approach.

  • Same as above, the #[cfg(target_os = ...)] here will follow the host OS, not the Cargo target, which can surprise anyone cross‑compiling. Once you switch OS detection to use CARGO_CFG_TARGET_OS, you may also want these blocks to branch on that value instead of cfg.
  • For non‑default targets, the final binary path is typically target/<triple>/<profile>. Right now you copy into target/{PROFILE}; this is fine for native builds, but for exotic targets you may want to consider copying into target/<triple>/{PROFILE} instead (or at least documenting that only native builds are supported).

Also applies to: 359-390, 399-447

src/polars_ext.rs (2)

110-136: Row/column iteration is correct but can be slightly optimized.

dataframe_to_float_features is logically correct: it builds row‑major Vec<Vec<f32>> by iterating over rows and then columns. Minor nits:

  • df.get_columns() is called on each row iteration; you can hoist it out of the outer loop.
  • as_materialized_series() is also called repeatedly per cell; depending on how cheap this is in Polars 0.45, it might be worth calling it once per column outside the row loop.

A minimal tweak:

fn dataframe_to_float_features(df: &DataFrame) -> CatBoostResult<Vec<Vec<f32>>> {
    let num_rows = df.height();
    let num_features = df.width();
    if num_rows == 0 || num_features == 0 { /* … */ }

-   let mut result = Vec::with_capacity(num_rows);
-   for row_idx in 0..num_rows {
-       let mut row_features = Vec::with_capacity(num_features);
-       for col in df.get_columns() {
-           let series = col.as_materialized_series();
-           let value = extract_f32_value(series, row_idx)?;
-           row_features.push(value);
-       }
-       result.push(row_features);
-   }
+   let cols: Vec<&Series> = df.get_columns()
+       .iter()
+       .map(|c| c.as_materialized_series())
+       .collect();
+   let mut result = Vec::with_capacity(num_rows);
+   for row_idx in 0..num_rows {
+       let mut row_features = Vec::with_capacity(num_features);
+       for series in &cols {
+           let value = extract_f32_value(series, row_idx)?;
+           row_features.push(value);
+       }
+       result.push(row_features);
+   }

    Ok(result)
}

Please verify in Polars 0.45 docs that as_materialized_series() doesn’t unexpectedly clone data when reused this way.


138-165: Categorical feature conversion mirrors the float conversion; similar perf tweaks apply.

dataframe_to_cat_features mirrors the float conversion logic and is correct in terms of building Vec<Vec<String>>. The same minor optimizations (hoisting get_columns() and as_materialized_series() outside the row loop) apply here as well.

If you adopt the optimization for floats, consider applying the same pattern here to keep behavior and performance characteristics aligned.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5693195 and 42b0739.

📒 Files selected for processing (4)
  • Cargo.toml (2 hunks)
  • build.rs (8 hunks)
  • src/lib.rs (1 hunks)
  • src/polars_ext.rs (1 hunks)
🔇 Additional comments (4)
src/lib.rs (1)

16-19: Polars extension module gating and re‑export look solid.

Conditionally compiling mod polars_ext; and pub use crate::polars_ext::ModelPolarsExt; behind #[cfg(feature = "polars")] cleanly exposes the extension trait only when Polars is enabled, without affecting existing users.

src/polars_ext.rs (2)

5-56: Trait surface and docs for ModelPolarsExt look good and are well‑scoped.

The three methods (predict_dataframe, predict_dataframe_with_columns, predict_dataframe_with_types) cover the main Polars use‑cases without polluting the base Model API, and the doc example is clear. The design choice to keep the extension as a separate trait is nice for ergonomics.

If you expect heavy Polars usage, you might consider later adding a short section to the crate README showing how to enable the polars feature and use these methods end‑to‑end.


167-271: Note: The suggested mitigation to switch to f64 is not feasible given the CatBoost C binding constraint.

The calc_model_prediction function has a hardcoded generic constraint TFloatFeature: AsRef<[f32]> and calls the C FFI function CalcModelPredictionWithHashedCatFeaturesAndTextAndEmbeddingFeatures, which expects f32 pointers. The CatBoost C library itself is bound to f32, so switching to f64 would require modifying the FFI bindings in the sys crate, not just updating calc_model_prediction at this layer.

That said, the code correctly implements what CatBoost requires. The precision loss concern is technically valid (f32 can represent integers precisely only up to 2^24), but without evidence of large integer features in your use case, it remains a theoretical concern.

extract_f32_value handles common numeric and boolean types correctly. The precision caveat is valid only if your actual feature data includes integers larger than ~16 million; confirm your typical feature value ranges to determine if this is a practical concern.

Cargo.toml (1)

13-15: Polars 0.45 dependency and dtype-full feature set verified as correct.

The dependency wiring is sound. The code in polars_ext.rs uses standard Polars numeric types (Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64, Boolean, String), all of which are covered by the dtype-full feature. The dtype-full preset enables all optional dtype-related features in Polars 0.45, and the optional dependency with explicit feature gate ["dep:polars"] is correctly configured to keep the base crate lean when the feature is not enabled.

Comment on lines +26 to +41
fn get_os_name() -> &'static str {
#[cfg(target_os = "macos")]
{
"darwin"
}

(os.to_string(), arch.to_string())
#[cfg(target_os = "linux")]
{
"linux"
}

#[cfg(target_os = "windows")]
{
"windows"
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Using #[cfg(target_os = …)] in get_os_name ties behavior to the host, not the Cargo target.

In build scripts, cfg(target_os) refers to the host OS, while TARGET / CARGO_CFG_TARGET_OS describe the compilation target. get_os_name() currently uses cfg(target_os) and is then used to form the download URL and platform tuple in download_compiled_library. This will break cross‑compilation (e.g., building Linux target binaries from macOS) by downloading the wrong library and emitting the wrong errors.

Consider deriving the OS name from CARGO_CFG_TARGET_OS or by parsing TARGET (similar to get_arch_from_target). For example:

-fn get_os_name() -> &'static str {
-    #[cfg(target_os = "macos")]
-    {
-        "darwin"
-    }
-    #[cfg(target_os = "linux")]
-    {
-        "linux"
-    }
-    #[cfg(target_os = "windows")]
-    {
-        "windows"
-    }
-}
+fn get_os_name() -> String {
+    let target_os = std::env::var("CARGO_CFG_TARGET_OS")
+        .expect("CARGO_CFG_TARGET_OS not set by Cargo");
+    match target_os.as_str() {
+        "macos" => "darwin".to_string(),
+        "linux" => "linux".to_string(),
+        "windows" => "windows".to_string(),
+        other => panic!("Unsupported target OS: {}", other),
+    }
+}

You’d then update call sites to accept a String instead of &'static str.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn get_os_name() -> &'static str {
#[cfg(target_os = "macos")]
{
"darwin"
}
(os.to_string(), arch.to_string())
#[cfg(target_os = "linux")]
{
"linux"
}
#[cfg(target_os = "windows")]
{
"windows"
}
}
fn get_os_name() -> String {
let target_os = std::env::var("CARGO_CFG_TARGET_OS")
.expect("CARGO_CFG_TARGET_OS not set by Cargo");
match target_os.as_str() {
"macos" => "darwin".to_string(),
"linux" => "linux".to_string(),
"windows" => "windows".to_string(),
other => panic!("Unsupported target OS: {}", other),
}
}
🤖 Prompt for AI Agents
In build.rs around lines 26 to 41, get_os_name currently uses #[cfg(target_os =
"...")] which reflects the build host rather than the compilation target;
replace this with logic that reads CARGO_CFG_TARGET_OS (or parses TARGET) from
env::var at compile time and returns an owned String (not &'static str) to
represent the target OS name (e.g., "linux", "darwin", "windows"); update any
call sites (like download_compiled_library and platform tuple construction) to
accept/consume a String (or &str) instead of &'static str and ensure the string
mapping matches the conventions used elsewhere (same canonical names as
get_arch_from_target).

Comment on lines +43 to 58
fn get_lib_filename() -> &'static str {
#[cfg(target_os = "windows")]
{
"catboostmodel.dll"
}

#[cfg(target_os = "macos")]
{
"libcatboostmodel.dylib"
}

#[cfg(target_os = "linux")]
{
"libcatboostmodel.so"
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

get_lib_filename should also key off the Cargo target OS instead of the host.

Same cross‑compilation concern as get_os_name: #[cfg(target_os = "...")] in a build script evaluates to the host OS. Using it to choose the library filename will misbehave when cross‑compiling (e.g., targeting Windows from Linux).

You can align this with the suggested get_os_name change by either:

  • returning the filename directly from get_os_name (e.g., ("darwin", "libcatboostmodel.dylib")), or
  • introducing a CARGO_CFG_TARGET_OS‑based helper, similar to the previous diff, and using that string to pick the correct filename.
🤖 Prompt for AI Agents
In build.rs around lines 43 to 58, get_lib_filename currently uses host-targeted
#[cfg(target_os = "...")] which breaks cross-compilation; replace this logic so
it keys off the Cargo target OS instead (use env::var("CARGO_CFG_TARGET_OS") or
reuse/extend the proposed get_os_name helper to return the target OS or a (os,
filename) pair) and map the target os string ("windows", "macos"/"darwin",
"linux") to the correct filename ("catboostmodel.dll", "libcatboostmodel.dylib",
"libcatboostmodel.so"); ensure the function uses the target string to select the
filename and keep the return type compatible ( &'static str or adjust callers if
you change signature).

Comment on lines +89 to 91
let os = get_os_name();
let arch = get_arch_from_target();
let version = get_catboost_version();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Platform selection in download_compiled_library depends on the same OS source; align with target OS.

The match (os, arch) and later match os blocks assume os is the target OS. With the current get_os_name implementation that’s only true for native builds. For cross‑builds, you’ll end up in the wrong arm of the match or in the _ => Unsupported platform branches even for valid target triples.

Once get_os_name() is switched to derive from the Cargo target (see earlier comment), this function should work as‑is again. After that, it might be worth adding a short debug println! of os/arch and TARGET to help diagnose platform mismatches when users hit the Unsupported platform errors.

Also applies to: 113-187, 189-246

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/polars_ext.rs (2)

79-105: Blocking issue: Cannot handle pure categorical predictions despite API design.

When float_columns is empty, line 87 selects zero columns, producing a DataFrame that triggers the "zero columns" error at line 115-118. This prevents categorical-only predictions even though CatBoost supports them and the API structure suggests this should work.

Apply this diff to allow empty float_columns:

 fn predict_dataframe_with_types(
     &self,
     df: &DataFrame,
     float_columns: &[&str],
     cat_columns: &[&str],
 ) -> CatBoostResult<Vec<f64>> {
-    // Extract float features
-    let float_col_names: Vec<String> = float_columns.iter().map(|s| s.to_string()).collect();
-    let float_df = df.select(float_col_names).map_err(|e| CatBoostError {
-        description: format!("Failed to select float columns: {}", e),
-    })?;
-    let float_features = dataframe_to_float_features(&float_df)?;
+    // Extract float features
+    let float_features = if float_columns.is_empty() {
+        if cat_columns.is_empty() {
+            return Err(CatBoostError {
+                description: "Both float_columns and cat_columns cannot be empty".to_string(),
+            });
+        }
+        vec![vec![]; df.height()]
+    } else {
+        let float_col_names: Vec<String> = float_columns.iter().map(|s| s.to_string()).collect();
+        let float_df = df.select(float_col_names).map_err(|e| CatBoostError {
+            description: format!("Failed to select float columns: {}", e),
+        })?;
+        dataframe_to_float_features(&float_df)?
+    };
 
     // Extract categorical features
     let cat_features = if cat_columns.is_empty() {

302-333: Major issue: Integer-to-string conversion corrupts large values via f32.

Lines 314-317 route all integer types through extract_f32_value before converting to string. Since f32 has only a 24-bit mantissa, this corrupts integers beyond ±16,777,216 and mangles unsigned values when cast to i64. This produces incorrect categorical codes for CatBoost.

Extract each integer type directly using its dedicated accessor:

 fn extract_string_value(series: &Series, idx: usize) -> CatBoostResult<String> {
     use DataType::*;
 
     match series.dtype() {
         String => {
             let ca = series.str().map_err(|e| CatBoostError {
                 description: format!("Failed to cast to String: {}", e),
             })?;
             Ok(get_checked_value!(ca, idx).to_string())
         }
-        // Convert numeric types to strings for categorical features
-        Int8 | Int16 | Int32 | Int64 | UInt8 | UInt16 | UInt32 | UInt64 => {
-            let value = extract_f32_value(series, idx)?;
-            Ok(format!("{}", value as i64))
-        }
+        Int8 => {
+            let ca = series.i8().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i8: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int16 => {
+            let ca = series.i16().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i16: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int32 => {
+            let ca = series.i32().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i32: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int64 => {
+            let ca = series.i64().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i64: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt8 => {
+            let ca = series.u8().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u8: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt16 => {
+            let ca = series.u16().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u16: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt32 => {
+            let ca = series.u32().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u32: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt64 => {
+            let ca = series.u64().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u64: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
         Boolean => {
             let ca = series.bool().map_err(|e| CatBoostError {
🧹 Nitpick comments (2)
src/polars_ext.rs (2)

44-55: Clarify behavior and constraints in documentation.

The documentation should specify:

  • What happens when both float_columns and cat_columns are empty (currently this causes an error deep in the call stack)
  • Whether categorical columns are automatically cast to String or must already be String dtype
  • The expected column order in the resulting predictions

66-77: Implicit numeric-only assumption may surprise users.

This method delegates to predict_dataframe, which converts all columns to float features. If a user passes String column names, the conversion at line 131 will fail with a Polars cast error. Consider either documenting this constraint or validating column dtypes before selection.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42b0739 and 5df8634.

📒 Files selected for processing (1)
  • src/polars_ext.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test windows (x86_64) - CatBoost 1.2.7
  • GitHub Check: Build Docker image
🔇 Additional comments (5)
src/polars_ext.rs (5)

58-64: LGTM!

Clean implementation for the pure numeric prediction case.


107-148: LGTM! Efficient column-by-column conversion.

The implementation uses Polars' cast to Float32 for efficient type conversion and processes column-by-column for better cache locality. Null handling is appropriate.


150-204: LGTM! Well-structured with performance optimizations.

The single-row fast path and direct String iteration are good optimizations. The fallback to extract_string_value handles other types (though there's a separate issue with numeric conversion flagged below).


224-300: LGTM! Comprehensive numeric type handling.

Covers all Polars numeric types with appropriate conversions. The cast to f32 is acceptable here since these values are used as float features in CatBoost.


206-222: LGTM! Clean abstraction for value extraction.

The macro eliminates repetition and provides consistent bounds and null checking across all type accessors.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/polars_ext.rs (2)

79-104: The pure-categorical prediction issue persists.

When float_columns is empty, df.select([]) at lines 86-89 produces a zero-column DataFrame, which causes dataframe_to_float_features (line 90) to error at lines 115-119 with "DataFrame has zero rows or columns". This prevents purely categorical predictions even though CatBoost supports categorical-only models.

The previous review suggested handling empty float_columns by skipping the selection/conversion and creating an empty feature vector directly:

     ) -> CatBoostResult<Vec<f64>> {
-        // Extract float features
-        let float_col_names: Vec<String> = float_columns.iter().map(|s| s.to_string()).collect();
-        let float_df = df.select(float_col_names).map_err(|e| CatBoostError {
-            description: format!("Failed to select float columns: {}", e),
-        })?;
-        let float_features = dataframe_to_float_features(&float_df)?;
+        // Extract float features
+        let float_features = if float_columns.is_empty() {
+            // No float features - create empty vectors for each row
+            vec![vec![]; df.height()]
+        } else {
+            let float_col_names: Vec<String> = float_columns.iter().map(|s| s.to_string()).collect();
+            let float_df = df.select(float_col_names).map_err(|e| CatBoostError {
+                description: format!("Failed to select float columns: {}", e),
+            })?;
+            dataframe_to_float_features(&float_df)?
+        };
 
         // Extract categorical features

Additionally, add a check for the case where both are empty:

+        if float_columns.is_empty() && cat_columns.is_empty() {
+            return Err(CatBoostError {
+                description: "Must provide at least one float or categorical column".to_string(),
+            });
+        }
+
         // Extract float features

310-313: Numeric-to-string conversion for categoricals still routes through f32, causing precision loss.

The code at lines 310-313 extracts integer values via extract_f32_value then casts to i64. Since f32 has only a 24-bit mantissa, large integer values (beyond ±16,777,216) are corrupted, and unsigned types are incorrectly cast to signed, producing wrong categorical codes for CatBoost.

The previous review recommended extracting each integer type directly using Polars' native accessors (.i8(), .i16(), .i32(), .i64(), .u8(), .u16(), .u32(), .u64()), which are already used correctly in extract_f32_value. Example fix:

-        Int8 | Int16 | Int32 | Int64 | UInt8 | UInt16 | UInt32 | UInt64 => {
-            let value = extract_f32_value(series, idx)?;
-            Ok(format!("{}", value as i64))
+        Int8 => {
+            let ca = series.i8().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i8: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int16 => {
+            let ca = series.i16().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i16: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int32 => {
+            let ca = series.i32().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i32: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        Int64 => {
+            let ca = series.i64().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to i64: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt8 => {
+            let ca = series.u8().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u8: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt16 => {
+            let ca = series.u16().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u16: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt32 => {
+            let ca = series.u32().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u32: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
+        }
+        UInt64 => {
+            let ca = series.u64().map_err(|e| CatBoostError {
+                description: format!("Failed to cast to u64: {}", e),
+            })?;
+            Ok(get_checked_value!(ca, idx).to_string())
         }
🧹 Nitpick comments (1)
src/polars_ext.rs (1)

107-204: Conversion logic is sound with good performance considerations.

The column-by-column processing for cache locality is well-reasoned, and the null/bounds checking is thorough.

Minor optional improvement: The error message at lines 115-119 and 156-159 could be more specific—differentiate between "zero rows" vs "zero columns" to aid debugging.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5df8634 and 3fa43e3.

📒 Files selected for processing (1)
  • src/polars_ext.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build Docker image
🔇 Additional comments (1)
src/polars_ext.rs (1)

1-56: LGTM! Trait definition is well-documented.

The public API surface is clean with clear documentation and appropriate examples. The trait design provides a natural ergonomic interface for Polars users.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/polars_ext.rs (2)

33-42: Clarify expected column types in documentation.

The documentation for predict_dataframe_with_columns doesn't specify what column types are expected or how the method behaves. Since it delegates to predict_dataframe, which converts all columns to float features, this should be documented.

Consider adding a note like:

 /// Predict using specific columns from a Polars DataFrame
 ///
+/// All selected columns will be treated as float features and must be numeric.
+///
 /// # Arguments
 /// * `df` - Input DataFrame
 /// * `columns` - Column names to use as features (in order)

124-161: Consider adding single-row fast path for consistency.

dataframe_to_cat_features includes a single-row optimization (lines 176-184), but dataframe_to_float_features does not. While the performance benefit may be negligible for float conversion (since we're already doing a bulk cast to Float32), adding a similar fast path would make the functions symmetric.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa43e3 and cf4635d.

📒 Files selected for processing (1)
  • src/polars_ext.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test windows (x86_64) - CatBoost 1.2.7
🔇 Additional comments (2)
src/polars_ext.rs (2)

79-117: LGTM! Pure-categorical prediction handling is correct.

The implementation properly handles the case when float_columns is empty by creating empty feature vectors (line 94-95) rather than attempting to select zero columns. This allows pure-categorical predictions as intended.


324-371: LGTM! Integer-to-string conversion now preserves precision.

The implementation correctly extracts each integer type directly using the appropriate Polars accessor (.i8(), .u64(), etc.) and converts to string without going through f32. This preserves the exact integer values, including large i64 and u64 values that would be corrupted by an f32 intermediary.

/// # Arguments
/// * `df` - Input DataFrame
/// * `float_columns` - Names of columns to use as float features
/// * `cat_columns` - Names of columns to use as categorical features (must be String type)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update documentation: categorical columns accept more types than stated.

The documentation states categorical columns "must be String type", but the implementation (lines 324-371) also accepts and converts Int8/16/32/64, UInt8/16/32/64, and Boolean types. Update the doc comment to reflect the actual supported types.

Apply this diff:

-    /// * `cat_columns` - Names of columns to use as categorical features (must be String type)
+    /// * `cat_columns` - Names of columns to use as categorical features (String, numeric, or Boolean types)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// * `cat_columns` - Names of columns to use as categorical features (must be String type)
/// * `cat_columns` - Names of columns to use as categorical features (String, numeric, or Boolean types)
🤖 Prompt for AI Agents
In src/polars_ext.rs around line 49, the doc comment for `cat_columns`
incorrectly says "must be String type"; update it to list the actual supported
types observed in the implementation (see logic at lines 324-371) — i.e.,
String, signed integers (Int8/16/32/64), unsigned integers (UInt8/16/32/64), and
Boolean — and clarify that these non-string types will be converted to
categorical as handled by the code. Keep the wording concise and accurate to the
implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant