Skip to content

feat: Support (De)Serialization for different representations of Nullable Unions#458

Closed
PookieBuns wants to merge 14 commits intoapache:mainfrom
PookieBuns:main
Closed

feat: Support (De)Serialization for different representations of Nullable Unions#458
PookieBuns wants to merge 14 commits intoapache:mainfrom
PookieBuns:main

Conversation

@PookieBuns
Copy link
Copy Markdown
Contributor

@PookieBuns PookieBuns commented Feb 10, 2026

Fixes The Rust Representation Conundrum as mentioned in #449, which allows (de)serializing, all avro-rs, rusty, and avro_json_encoding_compatible rust representations to avro. Tested using the provided test suite in the mentioned issue.

Note: This does not address #365 since this only applies to schema aware serialization.

I considered using enum variant name as lookup method for picking Union branch when serializing, shown here, but given the current state in which Value doesn't hold the name for named types, I realized it's not possible to do the same for deserialization until further work to support names in Value has been done.

Therefore, this implementation makes the assumption that if you are using Option<T> to represent a nullable union, that means that the null is the first variant in the union and serialization / deserialization just bump / decrement the index lookup respectively.

I believe this doesn't actually produce any breaking changes, because the status quo is that tagged Option "Rusty" representations straight up don't work.

Please let me know if you have any questions!

{
return visitor.visit_enum(EnumUnitDeserializer::new(field));
}
// Assume `null` is the first branch if deserializing some so decrement the variant index
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is not guaranteed by the spec :-/
We should add some index/length checks to avoid panics/overflows.

Copy link
Copy Markdown
Contributor Author

@PookieBuns PookieBuns Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use checked_sub to ensure underflow doesn't happen.

Regarding the topic about not being enforced by spec, even though it isn't strictly enforced, it's highly recommended https://avro.apache.org/docs/1.11.1/specification/#unions

Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing “null”, the “null” is usually listed first, since the default value of such unions is typically null.

For serialization we technically could get around this by resolving union branches using variant name, but for deserialization cannot yet because it's not schema aware and/or Value doesn't hold name information, which is why I chose to use index bumping on serialization too for consistent behavior. This seems like the closest we can get without a significant refactor or feature for deserialization.

For now, the behavior will intentionally throw an error if you try to (de)serialize an Option with null not being the first branch. This doesn't break existing behavior because the status quo is non-functional so although not a perfect solution imo it's a step better than what we have now

}
}
}
let branch_index = variant_index as usize + usize::from(self.serializing_some);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this assumes that the null is the first variant in the union.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to throw error if null isn't first branch, reasoning stated in #458 (comment)

"name": "MyUnion",
"type": [
"null",
"int"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test where the null is not the first variant too.

Copy link
Copy Markdown
Contributor Author

@PookieBuns PookieBuns Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in 6f691db to ensure null as non-first variant when using Option fails with error message communicating intended behavior

if branch_index >= union_schema.schemas.len() {
return Err(create_error(format!(
"Variant index out of bounds: {}. The union schema has '{}' schemas",
variant_index,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variant_index,
branch_index,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

if branch_index >= union_schema.schemas.len() {
return Err(create_error(format!(
"Variant index out of bounds: {}. The union schema has '{}' schemas",
variant_index,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variant_index,
branch_index,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

}
encode_int(i as i32, &mut self.writer)?;
return encode_int(variant_index as i32, &mut self.writer);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably support Schema::Ref too, to refer to Schema::Enum

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 6c58e62 with test to verify behavior

}

#[test]
#[ignore]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why it is ignored

Copy link
Copy Markdown
Contributor Author

@PookieBuns PookieBuns Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to instead expect a failure / comment on why it will fail. I'm also open to just removing it if you think its out of scope, since that mod is mostly to demonstrate how untagged enums have edge cases that fail and check that tagged enums do not

Comment on lines +862 to +864
let variant_idx = (*idx as usize)
.checked_sub(usize::from(self.deserializing_some))
.ok_or_else(|| Details::GetNull(inner.deref().clone()))?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add an Error variant here to communicate more clearly that deserializing some when null isn't first isn't supported? Right now it just tells the user Expects null for 0th when deserializing some

@PookieBuns
Copy link
Copy Markdown
Contributor Author

PookieBuns commented Feb 11, 2026

Updated to fix clippy / license checks cb749c3

@PookieBuns PookieBuns requested a review from martin-g February 11, 2026 13:25
@Kriskras99 Kriskras99 mentioned this pull request Feb 19, 2026
@PookieBuns
Copy link
Copy Markdown
Contributor Author

Updated to use new methods introduced by previous commits

@Kriskras99
Copy link
Copy Markdown
Contributor

Hi @PookieBuns, sorry for not responding earlier. I've been working on #512 and was really hoping to get it finished earlier. It has quite an impact on what you're trying to achieve in this PR.

I've adapted your tests to run on #512 (nullable_union.rs). Note that I've combined the serialize and deserialize tests into a roundtrip test.

There are two kinds of tests that fail on my PR. The first kind has expects a Option to flatten to the underlying union. These tests are marked as should_panic in the provided gist as I've explicitly chosen not to support this at this point as it adds extra complexity in the serializer/deserializer and derive code.

The second kind are the ones with JSON enums. I think I finally understand what you're trying to achieve, but please correct me if I'm wrong. You are (de)serializing Avro encoded as JSON through serde_json into types. Then you also want to use those same types to (de)serialized Avro encoded as binary. I had originally assumed you were using the TryFrom<serde_json::Value> for Value and TryFrom<Value> for serde_json::Value> implementations. We cannot support your usecase of using the same types for binary Avro and JSON Avro (via serde_json). What is an option is to provide proper support for JSON Avro in this crate. That would be through something like from_json<T: Deserialize>(json: serde_json::Value, schema: &Schema> and to_json<T: Serialize>(value: &T, schema: &Schema) (don't take this as gospel, there are some API considerations to make and you need to deal with schemata).

Please let me know if I understand your usecase correctly and if the provided solution would be helpful. Sorry for all the work you put into this PR, but I don't think we can merge this.

@PookieBuns
Copy link
Copy Markdown
Contributor Author

PookieBuns commented Mar 25, 2026

Hi @Kriskras99 thanks for the response. Being a maintainer is hard work and I appreciate all that you've done for the rust community.

In response to q1, does this mean that apache-avro only plans on supporting

[
  "null",
  "int",
  {
    "type": "enum",
    "name": "MyEnum",
    "symbols": [
      "A",
      "B"
    ]
  },
  {
    "type": "record",
    "name": "MyRecord",
    "fields": [
      {
        "name": "a",
        "type": "int"
      }
    ]
  }
]

via

    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnionNullable {
        Null,
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }

and not via

    enum MyEnum {
        A,
        B,
    }

    struct MyRecord {
        a: i32,
    }

    enum MyUnion {
        Int(i32),
        MyEnum(MyEnum),
        MyRecord(MyRecord),
    }

    Option<MyUnion>

? IMO intuitively, the latter way is the more idiomatic way a common user would expect to define a struct that matches the schema defined above, and I believe is also how https://github.com/lerouxrgd/rsgen-avro currently generates its structs (albeit they do some custom implementation of serde_serialize). The advantage of this representation is nicer integration with Option semantics instead of having to deal with Null branches manually for end users

For point 2, Yea I agree with you fully, converting from avro value to json value should be its own implementation instead of using serde json. I just so happened to be able to fix it with my implementation which is why I had it included, but I would be completely fine if it wasn't supported. This use case doesn't really matter for me.

I really hope we could support the "rust idiomatic" representation for both serialization and deserialization referred to in point 1, but ultimately you are the maintainer (or core contributer) and I respect your decision. If you would like any help with implementation. Feel free to reach out to me! Thanks!

@Kriskras99
Copy link
Copy Markdown
Contributor

Closed in favor of #512 and #517

@Kriskras99 Kriskras99 closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants