topology featurizers debug#460
Conversation
Reviewer's Guide by SourceryThis pull request addresses a bug in the topology featurizers of the mofdscribe library. The main changes involve modifying how atom types are handled, introducing a new function for filtering elements in persistence homology calculations, and adjusting data types to improve robustness. The changes span across multiple files in the project, primarily focusing on the topology-related modules. File-Level Changes
Tips
|
There was a problem hiding this comment.
Hey @XiaoqZhang - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
| return Molecule.from_sites(keep_sites) | ||
|
|
||
|
|
||
| def filter_element_for_ph( |
There was a problem hiding this comment.
issue (complexity): Consider refactoring to consolidate logic into a single function.
The new code introduces unnecessary complexity by duplicating logic in the filter_element_for_ph function, which increases maintenance overhead. The use of elements_group as a tuple doesn't add functional value and makes the code harder to read. Additionally, the handling of elements is inconsistent, which could lead to bugs. Consider refactoring to consolidate the logic into a single function, using list comprehensions for clarity, and ensuring consistent handling of elements. This will improve readability and maintainability.
There was a problem hiding this comment.
can you make a suggestion?
There was a problem hiding this comment.
Certainly! To address the complexity issue, you could refactor the filter_element and filter_element_for_ph functions to share common logic. Here's a suggestion:
-
Create a Helper Function: Extract the common logic into a helper function that both
filter_elementandfilter_element_for_phcan call. This will reduce duplication and make the code easier to maintain. -
Use List Comprehensions: If possible, use list comprehensions to make the filtering logic more concise and readable.
-
Consistent Handling of
elements: Ensure that the handling ofelementsis consistent across both functions. This might involve standardizing the input format or processing steps.
Here's a rough example of how you might structure the refactoring:
def filter_elements_common(sites, elements, elements_group):
return [site for site in sites if site.specie.symbol in elements or site.specie.symbol in elements_group]
def filter_element(structure, elements, elements_group=()):
keep_sites = filter_elements_common(structure.sites, elements, elements_group)
return Structure.from_sites(keep_sites) if isinstance(structure, Structure) else Molecule.from_sites(keep_sites)
def filter_element_for_ph(structure, elements, elements_group=()):
# If there's any specific logic for persistence homology, handle it here
keep_sites = filter_elements_common(structure.sites, elements, elements_group)
return Structure.from_sites(keep_sites) if isinstance(structure, Structure) else Molecule.from_sites(keep_sites)This approach keeps the core filtering logic in one place, making it easier to update and maintain. If filter_element_for_ph requires additional specific logic, you can still handle that separately within its function. Let me know if you need further clarification or assistance!
|
interesting, thanks for looking into it! |
There was a problem hiding this comment.
I agree that deepsource is not heavily used at the moment, but why are those changes needed for this PR?
| xyz_periodic_copies.append(coords) | ||
| element_copies.append(np.array(elements).reshape(-1,1)) |
There was a problem hiding this comment.
ok, seems like a good catch, because the x==y==z==0 condition is skipped atm
| def __init__( | ||
| self, | ||
| atom_types: Tuple[str] = ( | ||
| atom_types: List[str] = [ |
There was a problem hiding this comment.
Tuple was originally used because it is immutable and hence a better default for arguments.
But since we do not change the atom_types this is probably a safe choice
| elements_ = [] | ||
| elements_group = (elements,) | ||
| for atom_type in elements_group: |
There was a problem hiding this comment.
i'm not sure I understand this change. You are certainly right that the old code behaves in an unexpected way for elements with more than two symbols in the letter. But wouldn't then the clearer fix be something else? For example, always split by - first and then handle other things
Hi, I found a bug in the topology featurizers. The original way of splitting
atom_typeswill split'Cu'to['C', 'u'], for example. Besides, thetupletype is error-prone to the last,as highlighted by the red circle in the figure attached.Thus, I made several changes accordingly:
atom_typesfromtupletolist;filter_element_for_ph, which is a modified version offilter_elementfor persistence homology. This function is added instead of modifying the original function becausefilter_elementis also called bymofdscribe.featurizers.chemistry.AMDSummary by Sourcery
Fix the atom type splitting issue in topology featurizers by changing the data type from tuple to list and introduce a new function
filter_element_for_phfor persistence homology. Uncomment previously commented lines in_tda_helpers.pyto restore functionality.Bug Fixes:
Enhancements:
filter_element_for_phto handle element filtering specifically for persistence homology, ensuring compatibility with existing functionality.