Skip to content

pbspark does not work with protobuf >= v5.26.1 #49

@pw42020

Description

@pw42020

Description of Issue

The pyproject.toml clarifies that the protobuf version compatible with this library is >=3.20.0. However, one of the critical parts of this library is the including_default_value_fields variable in the json_format._Printer class from google.protobuf. In protobuf version 5.26.1, including_default_value_fields was deprecated for always_print_fields_with_no_presence on commit 7d43131

Recreate Example

poetry install
poetry add protobuf==5.26.1
# test.py
from pyspark.sql.session import SparkSession
from example.example_pb2 import SimpleMessage
from pbspark import from_protobuf
from pbspark import to_protobuf

spark = SparkSession.builder.getOrCreate()

example = SimpleMessage(name="hello", quantity=5, measure=12.3)
data = [{"value": example.SerializeToString()}]
df_encoded = spark.createDataFrame(data)

df_decoded = df_encoded.select(from_protobuf(df_encoded.value, SimpleMessage).alias("value"))
df_expanded = df_decoded.select("value.*")
df_expanded.show()

# +-----+--------+-------+
# | name|quantity|measure|
# +-----+--------+-------+
# |hello|       5|   12.3|
# +-----+--------+-------+

df_reencoded = df_decoded.select(to_protobuf(df_decoded.value, SimpleMessage).alias("value"))
# run below
poetry run python test.py
Traceback (most recent call last):
  File "/home/project/test.py", line 14, in <module>
    df_expanded.show()
  File "/usr/local/lib/python3.12/site-packages/pyspark/sql/dataframe.py", line 947, in show
    print(self._show_string(n, truncate, vertical))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pyspark/sql/dataframe.py", line 965, in _show_string
    return self._jdf.showString(n, 20, vertical)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pyspark/errors/exceptions/captured.py", line 185, in deco
    raise converted from None
pyspark.errors.exceptions.captured.PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/home/project/pbspark/_proto.py", line 343, in decoder
    return self.message_to_dict(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/project/pbspark/_proto.py", line 227, in message_to_dict
    printer = _Printer(
              ^^^^^^^^^
  File "/home/project/pbspark/_proto.py", line 79, in __init__
    super().__init__(**kwargs)
TypeError: _Printer.__init__() got an unexpected keyword argument 'including_default_value_fields'

Are there plans of updating this library to work with newer version of protobuf? Would you as the developers be opposed to me creating the changes on another branch?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions