refactor: remove redundant fields from TaskSolution and ValidationResult#50
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes redundant data duplication in TaskSolution and ValidationResult dataclasses by eliminating direct task_id and task string fields that duplicated information from nested objects. The refactor maintains backward compatibility through convenience properties and updates documentation to clarify field purposes.
Changes:
- Removed redundant
task_idandtaskfields fromTaskSolution, keeping only thetaskobject (renamed fromtask_obj) - Removed redundant
task_idandtaskfields fromValidationResult, relying on the nestedtask_solutionobject - Added convenience properties (
task_idandtask_text) to both classes for backward-compatible access
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/schemas/solution_schemas.py | Refactored TaskSolution to remove redundant fields and rename task_obj to task, added convenience properties |
| src/schemas/validation_schemas.py | Refactored ValidationResult to remove redundant fields, added convenience properties |
| src/schemas/GENERATION_PIPELINE_SCHEMAS.md | Updated documentation to reflect schema changes and clarify the reasoning field purpose |
| src/base_stages/solve_tasks.py | Updated TaskSolution instantiation to use new schema structure |
| src/base_stages/validate_tasks.py | Updated references from task_obj to task and from task to task_text |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - `task_obj`: Task (required, Task dataclass object with full hierarchy) | ||
| - `task`: Task (required, Task dataclass object with full hierarchy including task_id, task text, capability, etc.) | ||
| - `solution`: String (required, the final answer/solution to the task) | ||
| - `reasoning`: String (required, the step-by-step explanation generated by the solver showing how the solution was derived. This is the LLM's chain-of-thought when solving the task, NOT Bloom's taxonomy level which is stored in `task.bloom_level`) |
There was a problem hiding this comment.
The clarification about Bloom's taxonomy appears defensive and may indicate past confusion. Consider simplifying to focus on what 'reasoning' IS rather than what it is NOT. For example: 'the step-by-step explanation generated by the solver showing how the solution was derived (chain-of-thought)'
| - `reasoning`: String (required, the step-by-step explanation generated by the solver showing how the solution was derived. This is the LLM's chain-of-thought when solving the task, NOT Bloom's taxonomy level which is stored in `task.bloom_level`) | |
| - `reasoning`: String (required, the step-by-step explanation generated by the solver showing how the solution was derived; this is the model's chain-of-thought for the task. The Bloom's taxonomy level for the task is stored separately in `task.bloom_level`.) |
afkanpour
left a comment
There was a problem hiding this comment.
@afkanpour reviewed 5 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @kohankhaki and @saidul-islam98).
src/schemas/validation_schemas.py line 26 at r1 (raw file):
def task_id(self) -> str: """Get task_id from the task_solution for convenience.""" return self.task_solution.task_id
Does task_solution contain task_id?
Code quote:
task_idsrc/schemas/solution_schemas.py line 31 at r1 (raw file):
def task_text(self) -> str: """Get task text from the task object for convenience.""" return self.task.task
Does this refer to the problem statement? If so, can you rename it to task_statement?
Code quote:
task
kohankhaki
left a comment
There was a problem hiding this comment.
@kohankhaki made 2 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @afkanpour and @saidul-islam98).
src/schemas/solution_schemas.py line 31 at r1 (raw file):
Previously, afkanpour (Arash) wrote…
Does this refer to the problem statement? If so, can you rename it to task_statement?
Updated all of the task variable name to task_statement wherever we are referring to problem statement
src/schemas/validation_schemas.py line 26 at r1 (raw file):
Previously, afkanpour (Arash) wrote…
Does task_solution contain task_id?
Yes, TaskSolution has a task_id property that returns self.task.task_id.
37e01b2 to
68d9e4c
Compare
There was a problem hiding this comment.
For our current pipeline, we are storing the following information in each tasks:
{
"task_id": "Easy_Remember__task_000",
"task_statement": "<task_statement>",
"choices": {
"A": "<option_text>",
"B": "<option_text>",
"C": "<option_text>",
"D": "<option_text>"
},
"correct_answer": "<correct answer option>",
"chapter_id": "<domain_book_name_chapter_number>",
"chapter_relpath": "<bookname/domain_book_name_chapter_number>", # relative path of the chapter
"difficulty": "Easy/Medium/Hard",
"blooms_level": "Remember/Understand/Apply/Analyze/Evaluate/Create",
}
Could you please add "correct_answer", "chapter_id", and "chapter_relpath" fields? (Or maybe I can add and update the schema them in my PR if you think its more appropriate). Also, can the choices be just a dictionary (as there can only be 4 choices in an MCQ question)?
There was a problem hiding this comment.
I think the schema should stay as-is to keep it general, but let me know what you think:
Re: choices format
The current List[Dict] format ([{"label": "A", "solution": "..."}]) is more flexible—it allows adding metadata per choice if needed later (e.g., explanation for why an option is wrong). A simple dict works for basic MCQ but limits extensibility.
Re: correct_answer
This belongs in TaskSolution, not Task. The Task represents the problem before it's solved (Stage 3). The correct answer is determined in Stage 4 (solution generation) and stored in TaskSolution.solution.
Re: chapter_id, chapter_relpath
These are domain-specific (textbook structure). Use generation_metadata for these:
Task(
task_id="Easy_Remember__task_000",
task_statement="...",
choices=[{"label": "A", "solution": "..."}, ...],
difficulty="Easy",
bloom_level="Remember",
generation_metadata={
"chapter_id": "domain_book_name_chapter_number",
"chapter_relpath": "bookname/domain_book_name_chapter_number",
}
)There was a problem hiding this comment.
Got it, I will update my saving function accordingly. Thanks.
There was a problem hiding this comment.
You should be able to use the standard save_tasks(), load_tasks() etc. from io_utils.py - no need for a custom saving function.
The only difference is how you construct the objects:
task = Task(
task_id="Easy_Remember__task_000",
task_statement="...",
choices=[{"label": "A", "solution": "..."}, ...],
difficulty="Easy",
bloom_level="Remember",
capability=capability,
generation_metadata={
"chapter_id": "...",
"chapter_relpath": "...",
}
)
# Standard save/load works
save_tasks([task], metadata, path)
tasks, metadata = load_tasks(path)
# Access your custom fields via generation_metadata
chapter_id = tasks[0].generation_metadata.get("chapter_id")
afkanpour
left a comment
There was a problem hiding this comment.
@afkanpour reviewed 7 files and all commit messages, and resolved 2 discussions.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @kohankhaki and @saidul-islam98).
PR Type
Fix
Short Description
Remove redundant
task_idandtaskstring fields fromTaskSolutionandValidationResultdataclasses. These fields duplicated information already available in the nested objects (task_objandtask_solution). Added convenience properties to maintain backward compatibility for accessing these values. Also clarified documentation for thereasoningfield to distinguish it from Bloom's taxonomy level.Tests Added
None
This change is