fix: add missing aggregate function recursion in transform_recursive#67
fix: add missing aggregate function recursion in transform_recursive#67goldmedal wants to merge 1 commit intotobilg:mainfrom
Conversation
|
Thanks for the contribution! The two fixes address real gaps. Some issues to address before merge: Must fix Regression: BitwiseOrAgg/BitwiseAndAgg/BitwiseXorAgg handlers removed from transform_recursive The diff replaces the existing three Bitwise*Agg match arms with Sum/Avg/Min without re-adding them. The comment states: // Note: Stddev, StddevSamp, Variance, ArrayAgg, BitwiseOrAgg, BitwiseAndAgg, This is incorrect for BitwiseOrAgg, BitwiseAndAgg, and BitwiseXorAgg — they were only handled at this location (the removed arms). After this PR, transform_recursive will no Additionally, StddevSamp is listed as "already handled above" but there is no existing Expression::StddevSamp(mut f) arm in transform_recursive. Only Stddev (line 942) and Should fix Overly permissive correlated reference check for qualified columns (qualify_columns.rs, first insertion around line 829): // Column has table qualifier, table NOT in scope, NOT in schema This checks the column name against all outer-scope tables but ignores the table qualifier. So nonexistent_table.real_column passes validation as long as real_column exists in The second insertion (for unqualified columns, around line 858) is fine — an unqualified column referencing an outer scope is valid correlated subquery behavior. Suggestions
What looks good
|
Must fix:
- Re-add BitwiseOrAgg/BitwiseAndAgg/BitwiseXorAgg arms to transform_recursive
(accidentally removed when adding aggregate function coverage)
- Add StddevSamp arm (was incorrectly listed as already handled in comment)
- Fix stale comment that claimed those variants were handled elsewhere
Should fix:
- Remove overly permissive column_exists_in_outer_schema_table fallback for
qualified columns; nonexistent_table.real_col no longer passes validation
just because real_col exists in some other schema table
Tests:
- Add correlated EXISTS subquery test (TPC-H Q4 pattern)
- Add negative test: nonexistent_table.id must error even if id exists elsewhere
- Add COUNT(*) test covering the Count { this: None } path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add recurse_agg! macro in transform_recursive, collapsing 25 identical AggFunc arms (~80 lines) into one-liners; follows the existing transform_binary! pattern already in the function - Change Schema::table_names() to return Box<dyn Iterator<Item = &str>> instead of Vec<String>, avoiding an upfront heap allocation on every call to column_exists_in_outer_schema_table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks @tobilg for the review. All comment is addressed. |
|
Thanks again for the update of this PR @goldmedal — solid work on both issues! Aggregate recursion in transform_recursive — Approved This was a real gap. All 28 Box variants are now covered. Before this PR, expressions like AVG(val) were hitting the other => other fallthrough, meaning their child The recurse_agg! macro is clean and follows the existing transform_binary! pattern nicely. Count is correctly special-cased for Option to handle COUNT(*). Issue: AggFunc.filter and AggFunc.order_by are still not recursed into. A column inside AVG(val) FILTER(WHERE status > 0) or ARRAY_AGG(x ORDER BY y) won't have status/y Correlated subquery handling — Needs discussion The column_exists_in_outer_schema_table() heuristic has a false-negative risk: if a user has a typo in a column name that happens to match a column in any other table in the -- Schema: t1(id, name), t2(id, status), t3(typpo_col) I'd prefer to have it throw the existing error. Schema::table_names() trait addition Clean. Using Box<dyn Iterator<Item = &str> + '_> avoids allocation. Both MappingSchema and TypeInferenceSchema implementations are covered. Tests — Good coverage
Merge conflicts This PR will have merge conflicts with main due to recent performance work that changed Expression::Column(Column) → Expression::Column(Box) and |
|
Thanks for the detailed review — the false-negative concern is valid. After investigating, I realized the root cause is architectural: qualify_columns uses a bottom-up transform_recursive traversal, so inner SELECTs are qualified before the outer scope is available. There's no way for the inner scope to know whether an unresolvable column is a correlated reference or a genuine typo without access to the outer scope's context. The column_exists_in_outer_schema_table() heuristic was a workaround to approximate this, but as you pointed out, it's too broad. I explored a retry-based approach (intercept Subquery/Exists nodes on UnknownColumn, clear the error, re-qualify with allow_partial=true, and let the outer scope's pass pick up correlated refs), but this has its own issues: Requires intercepting every expression type that can wrap a correlated subquery (Subquery, Exists, In, etc.) Given this, I'd like to scope this PR down to just the aggregate function changes (transform_recursive coverage for Sum, Avg, Count, etc.) and remove the correlated subquery handling entirely. I'll open a separate issue to track the correlated subquery qualification problem with the architectural context above. What do u think 🤔 ? I'll convert this to a draft first. |
|
Yeah, I think to split this makes absolute sense. Thanks for your support and understanding. I'll think about it as well. |
transform_recursive was not recursing into aggregate function expressions (Sum, Avg, Min, Max, Count, etc.), treating them as leaf nodes. This meant columns inside aggregates like `MAX(a)` or `AVG(val)` were never visited by qualify_columns or other transform passes. - Add recurse_agg! macro for AggFunc-based variants (follows existing transform_binary! pattern) - Add arms for all 24 missing aggregate functions - Handle Count separately (Option<Expression> `this` field) - Add StddevSamp arm (was missing) - Add test coverage for MAX(a), ABS(a), and COUNT(*) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e46343b to
1d39d3e
Compare
|
Filed #68 for the correlated subquery issue. |
Summary
transform_recursivewas not recursing into aggregate function expressions (Sum,Avg,Min,Max,Count, etc.), treating them as leaf nodes. This meant columns inside aggregates likeMAX(a)orAVG(val)were never visited byqualify_columnsor other transform passes.recurse_agg!macro for AggFunc-based variants (follows existingtransform_binary!pattern)Countseparately (Option<Expression>thisfield)StddevSamparm (was missing)Correlated subquery handling has been removed from this PR and tracked separately in #68.
Test plan
cargo test -p polyglot-sql --lib)MAX(a)andABS(a)get qualifiedCOUNT(*)(Count { this: None }) doesn't panic🤖 Generated with Claude Code