Skip to content

Fails to parse round-trip string objects from ruamel.yaml #36

@Erotemic

Description

@Erotemic

Describe the bug

When parsing text from a YAML file in ruamel.YAML it it returns a SingleQuotedScalarString object, which does inherit from the str type. Sending this string to the pure-python lark parser seems to work fine, but when sending it to the cython variant it throws a TypeError.

To Reproduce

The following is a MWE that reproduces the issue:

"""
Requirements:
    pip install ruamel.yaml lark-cython lark
"""
import io
import ruamel.yaml


NEW_RUAMEL = 1


class _YamlRepresenter:

    @staticmethod
    def str_presenter(dumper, data):
        # https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
        if len(data.splitlines()) > 1 or '\n' in data:
            text_list = [line.rstrip() for line in data.splitlines()]
            fixed_data = '\n'.join(text_list)
            return dumper.represent_scalar('tag:yaml.org,2002:str', fixed_data, style='|')
        return dumper.represent_scalar('tag:yaml.org,2002:str', data)


def _custom_new_ruaml_yaml_obj():
    """
    References:
        https://stackoverflow.com/questions/59635900/ruamel-yaml-custom-commentedmapping-for-custom-tags
        https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another
        https://stackoverflow.com/questions/76870413/using-a-custom-loader-with-ruamel-yaml-0-15-0
    """

    # make a new instance, although you could get the YAML
    # instance from the constructor argument
    class CustomConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        ...

    class CustomRepresenter(ruamel.yaml.representer.RoundTripRepresenter):
        ...

    CustomRepresenter.add_representer(str, _YamlRepresenter.str_presenter)
    yaml_obj = ruamel.yaml.YAML()
    yaml_obj.Constructor = CustomConstructor
    yaml_obj.Representer = CustomRepresenter
    yaml_obj.preserve_quotes = True
    yaml_obj.width = float('inf')
    return yaml_obj


def codeblock(text):
    """
    Create a block of text that preserves all newlines and relative indentation
    """
    import textwrap
    return textwrap.dedent(text).strip('\n')


# For common constructs see:
# https://github.com/lark-parser/lark/blob/master/lark/grammars/common.lark
RESOLUTION_GRAMMAR_PARTS = codeblock(
    '''
    // Resolution parts of the grammar.
    magnitude: NUMBER

    unit: WORD

    numeric_unit: (magnitude WS* unit)
    implicit_unit: unit

    resolved_unit: numeric_unit | implicit_unit

    %import common.NUMBER
    %import common.WS
    %import common.WORD
    ''')

RESOLVED_UNIT_GRAMMAR = codeblock(
    r'''
    // RESOLVED WINDOW GRAMMAR. Eg. 2GSD
    ?start: resolved_unit
    ''') + '\n' + RESOLUTION_GRAMMAR_PARTS


def main():
    yaml_obj = _custom_new_ruaml_yaml_obj()
    file = io.StringIO("{key: '1mGSD'}")
    data = yaml_obj.load(file)
    text = data['key']

    # https://github.com/lark-parser/lark/blob/master/docs/_static/lark_cheatsheet.pdf
    import lark
    try:
        import lark_cython
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr', _plugins=lark_cython.plugins)
    except ImportError:
        parser = lark.Lark(RESOLVED_UNIT_GRAMMAR, start='start', parser='lalr')

    print(f'{type(text)=}')
    print(f'{text.__class__.__mro__=}')

    parser.parse(text)


if __name__ == '__main__':
    """
    CommandLine:
        python ~/code/lark_cython/tests/test_yaml.py
    """
    main()

The type information it prints before it fails is:

type(text)=<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>
text.__class__.__mro__=(<class 'ruamel.yaml.scalarstring.SingleQuotedScalarString'>, <class 'ruamel.yaml.scalarstring.ScalarString'>, <class 'str'>, <class 'object'>)

I've tested with versions:

3.11.9 (main, May 14 2024, 08:04:54) [GCC 12.2.0]
ruamel.yaml.__version__ = 0.18.6
lark.__version__ = 1.1.9
lark_cython.__version__ = 0.0.15

And

3.11.9 (main, May 13 2024, 14:03:39) [GCC 11.4.0]
ruamel.yaml.__version__ = 0.17.22
lark.__version__ = 1.1.7
lark_cython.__version__ = 0.0.15

My thought is that cython would handle a class that inherits from a str, but perhaps it doesn't I'm not sure if this can be fixed on the lark-cython side, but I figured it was worth reporting.

My current workarond is to do something like this:

        try:
            tree = parser.parse(text)
        except TypeError:
            if isinstance(text, str) and type(text) is not str:
                # We could be in a case where cython is failing to handle
                # overloaded string types. Try casting to a regular str.
                tree = parser.parse(str(text))
            else:
                raise

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions