-
Notifications
You must be signed in to change notification settings - Fork 414
fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops didnt mean to approve
|
@kevinjqliu Ok, do you want me to change the fix so that |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.
Thanks for the PR! Lets add a test and it should be good to go!
pyiceberg/table/update/__init__.py
Outdated
| @model_validator(mode="after") | ||
| def validate_snapshot_id(self) -> Self: | ||
| return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @model_validator(mode="after") | |
| def validate_snapshot_id(self) -> Self: | |
| return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id}) | |
| @model_validator(mode="after") | |
| def validate_snapshot_id(self) -> Self: | |
| self.snapshot_id = self.statistics.snapshot_id | |
| return self |
nit: use direct assignment
6f2b0d7 to
aa85657
Compare
@kevinjqliu Thanks for the (quick!) review. I've changed the fix a bit:
Please let me know if you want further changes. |
c739a0e to
00cfd92
Compare
Previously the pydantic @model_validator would fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily the case. Check type explicitly with isinstance instead, and handle `dict` case too.
00cfd92 to
fa456e3
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good, i found a small bug. Would be great to add a test for this case
| elif isinstance(stats, dict): | ||
| snapshot_id = stats.get("snapshot_id") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| elif isinstance(stats, dict): | |
| snapshot_id = stats.get("snapshot_id") | |
| elif isinstance(stats, dict): | |
| snapshot_id = stats.get("snapshot_id") | |
| else: | |
| snapshot_id = None | |
nit: i think we can inline the else here
| if isinstance(stats, StatisticsFile): | ||
| snapshot_id = stats.snapshot_id | ||
| elif isinstance(stats, dict): | ||
| snapshot_id = stats.get("snapshot_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| snapshot_id = stats.get("snapshot_id") | |
| snapshot_id = stats.get("snapshot-id") |
i think this should be snapshot-id since before validator takes in json as input
iceberg-python/pyiceberg/table/statistics.py
Lines 32 to 40 in fa03e08
| class StatisticsCommonFields(IcebergBaseModel): | |
| """Common fields between table and partition statistics structs found on metadata.""" | |
| snapshot_id: int = Field(alias="snapshot-id") | |
| statistics_path: str = Field(alias="statistics-path") | |
| file_size_in_bytes: int = Field(alias="file-size-in-bytes") | |
| class StatisticsFile(StatisticsCommonFields): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a test case for this (and possibly one for the else case too)?
The current test only test the StatisticsFile instance branch
iceberg-python/tests/table/test_init.py
Lines 1370 to 1381 in fa03e08
| def test_set_statistics_update(table_v2_with_statistics: Table) -> None: | |
| snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id | |
| blob_metadata = BlobMetadata( | |
| type="apache-datasketches-theta-v1", | |
| snapshot_id=snapshot_id, | |
| sequence_number=2, | |
| fields=[1], | |
| properties={"prop-key": "prop-value"}, | |
| ) | |
| statistics_file = StatisticsFile( |
Previously the pydantic
@model_validatoronSetStatisticsUpdatewould fail because it assumed statistics was a model instance. In a "before"" validator that is not the case.Use an "after" validator instead, where we can use instantiated and validated fields.
Before
After
Rationale for this change
Are these changes tested?
Yes, but only using the two-liners above.
Are there any user-facing changes?
No.