Skip to content

Add support for multi-line yaml values#34

Merged
hammady merged 6 commits into
masterfrom
tasks/8536
Oct 6, 2025
Merged

Add support for multi-line yaml values#34
hammady merged 6 commits into
masterfrom
tasks/8536

Conversation

@amamd07

@amamd07 amamd07 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor

pyworker fails when the handler looks like the following. We fixed this by delegating attribute extraction to PyYaml after adding a constructor for unknown ruby classes.

object: !ruby/object:AiScreenerJob
  raw_attributes:
    id: 50
    status: 
    user_id: 976
    review_id: 5611
    filter: '{"file_ids":[7326],"decision":"undecided","decision_by":[679]}'
    inclusion: |-
      not 'Diabetes,asjdh,askdja

      dsa
    exclusion: 'Diabetes

      '```

@amamd07 amamd07 requested a review from hammady October 3, 2025 05:43
@amamd07

amamd07 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

This case will still break it

lines = [
    "key: 'texts'",
    "text continues'",
    "another_key: val"
]

But there must be a limit, because even if we fix that case. Then, what if there is a : in the 'text: continues'?

@hammady hammady changed the title Add support for multiple lines yaml values and tests Add support for multi-line yaml values and tests Oct 3, 2025
@hammady

hammady commented Oct 3, 2025

Copy link
Copy Markdown
Member

I tried a simpler approach:

import yaml


class IgnoreUnknownTagsLoader(yaml.SafeLoader):
    def ignore_unknown(self, node):
        return None

IgnoreUnknownTagsLoader.add_constructor(None, IgnoreUnknownTagsLoader.ignore_unknown)

with open("multi-line.yml", 'r') as file:
    attributes = file.read().splitlines()[2:]
    yaml_content = '\n'.join(['object:', '  attributes:'] + attributes)
    data = yaml.load(yaml_content, Loader=IgnoreUnknownTagsLoader)
data['object']['raw_attributes']

Output:

{'id': 50,
 'status': None,
 'user_id': 976,
 'review_id': 5611,
 'filter': '{"file_ids":[7326],"decision":"undecided","decision_by":[679]}',
 'inclusion': "not 'Diabetes,asjdh,askdja\n\ndsa",
 'exclusion': 'Diabetes\n',
 'error': None,
 'articles_count': None,
 'started_at': None,
 'finished_at': None,
 'failed_at': None,
 'post_response': None,
 'status_response': None,
 'skipped_articles': None,
 'completed_articles': None,
 'failed_articles': None,
 'include_count': None,
 'exclude_count': None,
 'unknown_count': None,
 'provider': None,
 'ai_model_name': None,
 'created_at': datetime.datetime(2025, 10, 2, 17, 51, 55, 59669, tzinfo=datetime.timezone.utc),
 'updated_at': datetime.datetime(2025, 10, 2, 17, 51, 55, 59669, tzinfo=datetime.timezone.utc)}

It avoid all our custom parsing logic except the job class name. Try it on failing tests. It delegates all parsing to the standard PyYaml library so should be more robust.

@amamd07

amamd07 commented Oct 3, 2025

Copy link
Copy Markdown
Contributor Author

I tried a simpler approach:

import yaml


class IgnoreUnknownTagsLoader(yaml.SafeLoader):
    def ignore_unknown(self, node):
        return None

IgnoreUnknownTagsLoader.add_constructor(None, IgnoreUnknownTagsLoader.ignore_unknown)

with open("multi-line.yml", 'r') as file:
    attributes = file.read().splitlines()[2:]
    yaml_content = '\n'.join(['object:', '  attributes:'] + attributes)
    data = yaml.load(yaml_content, Loader=IgnoreUnknownTagsLoader)
data['object']['raw_attributes']

Output:

{'id': 50,
 'status': None,
 'user_id': 976,
 'review_id': 5611,
 'filter': '{"file_ids":[7326],"decision":"undecided","decision_by":[679]}',
 'inclusion': "not 'Diabetes,asjdh,askdja\n\ndsa",
 'exclusion': 'Diabetes\n',
 'error': None,
 'articles_count': None,
 'started_at': None,
 'finished_at': None,
 'failed_at': None,
 'post_response': None,
 'status_response': None,
 'skipped_articles': None,
 'completed_articles': None,
 'failed_articles': None,
 'include_count': None,
 'exclude_count': None,
 'unknown_count': None,
 'provider': None,
 'ai_model_name': None,
 'created_at': datetime.datetime(2025, 10, 2, 17, 51, 55, 59669, tzinfo=datetime.timezone.utc),
 'updated_at': datetime.datetime(2025, 10, 2, 17, 51, 55, 59669, tzinfo=datetime.timezone.utc)}

It avoid all our custom parsing logic except the job class name. Try it on failing tests. It delegates all parsing to the standard PyYaml library so should be more robust.

That alone without the squashing is failing the tests. But it is a good addition. I will include it in the PR

@hammady hammady left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point is to avoid manually extracting attributes (by calling extract_attributes). You don't need this function and you don't need any squashing (which is semantically incorrect by the way).

Comment thread pyworker/job.py
Comment thread pyworker/job.py Outdated
@hammady hammady requested a review from Copilot October 3, 2025 23:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for parsing multi-line YAML values in job handlers by refactoring the YAML parsing logic to handle quoted strings that span multiple lines and updating test fixtures to validate this functionality.

  • Removed manual attribute extraction function and simplified to use handler lines directly
  • Added custom YAML loader classes to handle Ruby objects and unknown tags
  • Updated attribute extraction to work with raw_attributes instead of attributes

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pyworker/job.py Refactored YAML parsing with custom loaders and simplified attribute extraction
tests/fixtures/handler_registered.yaml Added multi-line string test case with quoted value
tests/test_job.py Updated test expectations to include new multi-line attribute

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread pyworker/job.py Outdated
Comment thread pyworker/job.py Outdated

@hammady hammady left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check copilot comments.

@hammady hammady changed the title Add support for multi-line yaml values and tests Add support for multi-line yaml values Oct 6, 2025
@hammady hammady merged commit 662a3ac into master Oct 6, 2025
5 checks passed
@hammady hammady deleted the tasks/8536 branch October 6, 2025 20:34
Comment thread pyworker/job.py
# Construct mapping normally, ignoring Ruby-specific tags
return loader.construct_mapping(node)

yaml.SafeLoader.add_multi_constructor("!ruby/object:", no_ruby_objects)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be applied universally..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand. In the previous commit, it was adding it to our custom loader (even though the name of that loader IgnoreUnknownTagsLoader was misleading). I see no problem in ignoring ruby classes in the Python runtime of the application using pyworker. Remember that we are not ignoring all unknown tags, but we are simply mapping ruby objects to simple python dicts regardless of their ruby class names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants