Conversation
602bef1 to
684d85c
Compare
684d85c to
8093c92
Compare
6e6031b to
f7a9960
Compare
bb5cca5 to
78d3930
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Sorry for the delay and thanks for adding the whole picture! I've left some inline comments. Let me know what you think.
| /// | ||
| /// \param term The term to serialize (NamedReference or UnboundTransform) | ||
| /// \return A JSON value representing the term | ||
| ICEBERG_EXPORT nlohmann::json TermToJson(const Term& term); |
There was a problem hiding this comment.
Just to confirm: we cannot simply use ToJson(const Term& term) because it will confuse the function resolution?
src/iceberg/expression/json_serde.cc
Outdated
|
|
||
| /// Template helper to create predicates from JSON with the appropriate term type | ||
| template <typename B> | ||
| Result<std::unique_ptr<UnboundPredicate>> MakePredicateFromJson( |
There was a problem hiding this comment.
| Result<std::unique_ptr<UnboundPredicate>> MakePredicateFromJson( | |
| Result<std::unique_ptr<UnboundPredicate>> PredicateFromJson( |
To be consistent with other XyzFromJson
src/iceberg/expression/json_serde.cc
Outdated
| nlohmann::json ToJson(const Expression& expr) { | ||
| switch (expr.op()) { | ||
| case Expression::Operation::kTrue: | ||
| return true; |
There was a problem hiding this comment.
Just a heads-up that there is an inconsistency between the Rest spec and Java impl on boolean expr: apache/iceberg#14677
src/iceberg/expression/json_serde.cc
Outdated
| } | ||
|
|
||
| nlohmann::json ToJson(const UnboundTransform& transform) { | ||
| auto& mutable_transform = const_cast<UnboundTransform&>(transform); |
There was a problem hiding this comment.
Not related to this PR: I have to admit that the current api design is flawed so we have to use const_cast here. We might need to fix it in the future.
| } | ||
| } | ||
|
|
||
| Result<Literal> LiteralFromJson(const nlohmann::json& json) { |
There was a problem hiding this comment.
We might need to add a const Schema* as well to follow this:
There was a problem hiding this comment.
I decided not to add it because java always passes null for this parameter.
https://github.com/apache/iceberg/blob/d2fbe427ecec298f1600260f40fcab1b7ab2e695/core/src/main/java/org/apache/iceberg/expressions/ExpressionParser.java#L262-L268
If there is no objection, I think I will keep this as is.
There was a problem hiding this comment.
Fair enough. Perhaps we can add a comment to say this discrepancy against the Java impl.
There was a problem hiding this comment.
After thinking more about this, if we can pass a Schema here to bind the reference to a real data type, we can restore the literal on that type. Otherwise it may cause some troubles in the downstream to consume the filter (e.g. the actual type is decimal but double is returned for now).
There was a problem hiding this comment.
There are some moving pieces related to this. For example, the casting of literals to string is not yet implemented - https://github.com/apache/iceberg-cpp/blob/main/src/iceberg/expression/literal.cc#L196-L200. I have this another PR (main...evindj:iceberg-cpp:bound_expressions) that is supposed to addressed binding. I will add basic schema information here, but the actual casting information will be added in a different PR if that works for you.
|
|
||
| #pragma once | ||
|
|
||
| #include <charconv> |
There was a problem hiding this comment.
Nit: <charconv> is included but never used in this header or in json_serde.cc. Please remove it.
src/iceberg/expression/json_serde.cc
Outdated
|
|
||
| if (IsSetOperation(op)) { | ||
| std::vector<Literal> literals; | ||
| if (!json.contains(kValues) || !json[kValues].is_array()) [[unlikely]] { |
There was a problem hiding this comment.
Java's predicateFromJson validates that set predicates (IN/NOT_IN) do not have a value field. This cross-validation is missing here. A malformed JSON with both value and values would be silently accepted.
src/iceberg/expression/json_serde.cc
Outdated
| } | ||
|
|
||
| // Literal predicate | ||
| if (!json.contains(kValue)) [[unlikely]] { |
There was a problem hiding this comment.
Similarly, Java validates that literal predicates (LT, EQ, etc.) do not have a values field.
ae55d26 to
1ca8c76
Compare
|
Thanks for addressing the comments. Let me know when ready for review again :) |
a7aebcf to
f1c6d91
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Review Report: PR #553
(Generated by gemini-cli)
📄 File: src/iceberg/expression/json_serde.cc
Java Counterpart: org.apache.iceberg.expressions.ExpressionParser
- Parity Check:
⚠️ - Missing top-level literal expression parsing: In Java,
ExpressionParser.fromJsonexplicitly handles expressions encoded as literal objects (e.g.,{"type": "literal", "value": true}) and converts them toalwaysTrue()oralwaysFalse(). In the C++ implementation,ExpressionFromJsonchecks for primitive booleans and strings, but does not intercept the literal wrapper. It falls through toOperationTypeFromJson, which will return aJsonParseErrorsince"literal"is not a known operation. You should add a check forjson[kType].get<std::string>() == kLiteralinsideExpressionFromJsonthat validates and returns the boolean expression. - Binding deferral: The design choice to defer binding (parsing string/long into typed equivalents) rather than doing it directly in
ExpressionFromJson(as Java optionally does with aSchemaargument) is a solid C++ architectural choice.
- Missing top-level literal expression parsing: In Java,
- Style & Comments:
⚠️ - Good use of Modern C++ (
[[unlikely]],std::ranges,std::span). - Minor typo in an error message on line 369:
"expresion JSON must be an object with a 'type' field: {}"-> should be"expression".
- Good use of Modern C++ (
- Logic Check: ✅
- Memory handling and move semantics are clean.
LiteralFromJsongracefully handles nested literal unwrapping via recursion. - Binary literal Base-16 encoding correctly mimics Java's
BaseEncoding.base16()format.
- Memory handling and move semantics are clean.
- Design & Conciseness: ✅
- The
MakeTransformJsonhelper is a nice touch to avoid duplicating serialization logic forBoundTransformandUnboundTransform. - Virtual methods
predicate_term()andliterals()inpredicate.hrepresent a clean object-oriented approach to extracting data for serialization.
- The
- Test Quality:
⚠️ - The parameterized
ExpressionJsonRoundTripTestand invalid expression tests are fantastic and comprehensive. - Recommendation: Add a test case for
{"type": "literal", "value": true}deserializing toExpression::Operation::kTrueto cover the parity issue mentioned above.
- The parameterized
📄 Files: src/iceberg/expression/predicate.h & src/iceberg/type_fwd.h
- Design & Conciseness: ✅
- Forward declarations and abstraction interfaces are correctly updated without bleeding JSON dependencies into the core expression AST.
Summary & Recommendation
- Request Changes: Please address the parity gap where top-level literal wrappers
{"type": "literal", "value": true}fail to parse inExpressionFromJson, fix the minor typo, and add the corresponding test case. The overall structure and adherence to modern C++ conventions are excellent.
|
The above comment was generated automatically by my gemini-cli. I haven't finished my review yet :) |
f1c6d91 to
ecef4c4
Compare
src/iceberg/expression/predicate.h
Outdated
| bool is_unbound_predicate() const override { return true; } | ||
|
|
||
| /// \brief Returns the term of this predicate as a base Term reference. | ||
| virtual const Term& predicate_term() const = 0; |
There was a problem hiding this comment.
| virtual const Term& predicate_term() const = 0; | |
| virtual const Term& unbound_term() const = 0; |
Not a strong opinion here but I just think predicate_term looks weird. Perhaps untyped_term, generic_term, or something else.
src/iceberg/expression/json_serde.cc
Outdated
|
|
||
| namespace iceberg { | ||
| namespace { | ||
| // JSON field names |
There was a problem hiding this comment.
| // JSON field names |
src/iceberg/expression/json_serde.cc
Outdated
| constexpr std::string_view kLeft = "left"; | ||
| constexpr std::string_view kRight = "right"; | ||
| constexpr std::string_view kChild = "child"; | ||
| // Expression type strings |
There was a problem hiding this comment.
| // Expression type strings |
src/iceberg/expression/json_serde.cc
Outdated
| std::vector<Literal> literals; | ||
| if (!json.contains(kValues) || !json[kValues].is_array() || json.contains(kValue)) | ||
| [[unlikely]] { | ||
| return JsonParseError("Missing or invalid 'values' field for set operation: {}", |
src/iceberg/expression/json_serde.cc
Outdated
|
|
||
| // Literal predicate | ||
| if (!json.contains(kValue) || json.contains(kValues)) [[unlikely]] { | ||
| return JsonParseError("Missing 'value' field for literal predicate: {}", |
src/iceberg/expression/json_serde.cc
Outdated
| } | ||
|
|
||
| Result<nlohmann::json> ToJson(const BoundReference& ref) { | ||
| return nlohmann::json(ref.name()); |
| } | ||
| } | ||
|
|
||
| Result<Literal> LiteralFromJson(const nlohmann::json& json) { |
There was a problem hiding this comment.
After thinking more about this, if we can pass a Schema here to bind the reference to a real data type, we can restore the literal on that type. Otherwise it may cause some troubles in the downstream to consume the filter (e.g. the actual type is decimal but double is returned for now).
src/iceberg/expression/json_serde.cc
Outdated
| return ToJson(internal::checked_cast<const UnboundTransform&>(term)); | ||
| return ToJson(internal::checked_cast<const BoundTransform&>(term)); | ||
| default: | ||
| return NotSupported("Unsupported term kind for JSON serialization"); |
There was a problem hiding this comment.
Let's include more detail in the error msg?
|
|
||
| const auto& term_json = json[kTerm]; | ||
|
|
||
| if (IsTransformTerm(term_json)) { |
There was a problem hiding this comment.
Should we add a UnboundTermFromJson and then wrap UnboundTransformFromJson and NamedReferenceFromJson in it? UnboundTermFromJson can call itself recursively. I'm not sure if it is possible to produce transform like identity(identity(col)).
src/iceberg/expression/json_serde.cc
Outdated
| return Not::Make(std::move(child)); | ||
| } | ||
| default: | ||
| // All other operations are predicates however since binding does not happen during |
There was a problem hiding this comment.
All other operations are predicates
Is it true? Should we try to check if this is an aggregate?
| ICEBERG_UNWRAP_OR_FAIL(auto result2, ToJson(*expr)); | ||
| EXPECT_EQ(result2, expected); |
There was a problem hiding this comment.
| ICEBERG_UNWRAP_OR_FAIL(auto result2, ToJson(*expr)); | |
| EXPECT_EQ(result2, expected); | |
| ICEBERG_UNWRAP_OR_FAIL(auto result, ToJson(*expr)); | |
| EXPECT_EQ(result, expected); |
| EXPECT_EQ(round_trip, param.json); | ||
| } | ||
|
|
||
| INSTANTIATE_TEST_SUITE_P( |
There was a problem hiding this comment.
Thanks for adding these test cases!
ecef4c4 to
cadf471
Compare
This is the second PR in addressing serde for expression for predicate push down. It implements the rest of the expression serde logic. It closely matches the java Implementation at https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/expressions/ExpressionParser.java
I tried to achieve/improve on the java integration test coverage.
One gap that is not addressed in this task is that during binding, even if type information is available, a bunch of complex literals are serialized as strings but no string to specific type exist yet. I will work on that next.