Skip to content

[BRAINSTORM] refactor to use MOF objects as inputs  #421

@kjappelbaum

Description

@kjappelbaum

I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.

I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.

All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...

If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).

In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.

Advantages

  • Fewer conditionals for making the control flow in featurizers dependent on the input type
  • Less logic for caching --- the input objects can cache it
  • Common design pattern with mofchecker and moffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)
  • Multiple featurizer would also no longer need to handle so much logic about primitive structures
  • We could relatively easily serialize fragments and features in the same workflow without computational overhead
  • We can get rid of things such as MOFBBs objects
  • Also make lru cache sizes customizable  #420 would close itself because we would not need lru_cache anymore
  • Featurizers such as RACs would no longer need to have their own nearest-neighbor search thing
  • There is no need anymore for a MOF base featurizer.

Disadvantages

  • We would need a very thin wrapper for interoperability with matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgen
  • It would be a breaking change for the current API: need to change all docs, tutorials and the examples in the paper (however, doing it now might be a good time as there is not that much use yet)
  • BU featurizer would need to follow a different pattern and would still, in some way, need annotation on what type the featurizers extract
  • BU featurizer seems still difficult to couple with MultipleFeaturizer with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF object
  • many featurizers do not only work on MOFs. In this sense the naming (also of the library) is misleading

Usage example

from mofdscribe import MOF
mof = MOF.from_file("some_file.cif")
bu = BUFeaturizer(LSOP())
feats = bu.featurize(mof)

Implementation idea

class MOF: 
    def __init__(self, structure, structure_graph): 
        self.structure = structure 
        self.structure_graph = structure_graph
    
    @classmethod
    def from_structure(cls, structure): 
        structure_graph = get_structure_graph(structure)
        return cls(structure, structure_graph

    @classmethod
    def from_file(): 
        ... 
    @cached_property
    def fragments(): 
        ...
    
    @cached_property
    def scaffold_hash():
        ... 

The featurizers would have the following signatures

class GraphFeaturizer: 
    def featurize(self, mof): 
        structure_graph = mof.structure_graph
        return self._featurize(structure_graph)
        
class StructureFeaturizer: 
    def featurize(self, mof): 
        structure = mof.structure
        return self._featurize(structure)

MatminerFeaturizer = StructureFeaturizer

class SiteFeaturizer: 
      def featurize(self, mof, i): 
        structure = mof.structure
        return self._featurize(structure, structuregraph, i)

class BUFeaturizer: 
    def __init__(self, featurizer):
        if isinstance(featurizer, MultiFeaturizer): 
            raise ValueError("featurizer must be a single featurizer")
    
    def featurize(self, mof): 
        fragments = mof.fragments

In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.

The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.

Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.

from mofdscribe import MOF 
from fastcore.xtras import save_pickle 

for file in files: 
    mof = MOF.from_file(file) 
    fragments = mof.fragment
    save_pickle(fragments) 
    features = featurizer.featurize(MOF)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions