You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.
I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.
All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...
If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).
In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.
Advantages
Fewer conditionals for making the control flow in featurizers dependent on the input type
Less logic for caching --- the input objects can cache it
Common design pattern with mofchecker and moffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)
Multiple featurizer would also no longer need to handle so much logic about primitive structures
We could relatively easily serialize fragments and features in the same workflow without computational overhead
Featurizers such as RACs would no longer need to have their own nearest-neighbor search thing
There is no need anymore for a MOF base featurizer.
Disadvantages
We would need a very thin wrapper for interoperability with matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgen
It would be a breaking change for the current API: need to change all docs, tutorials and the examples in the paper (however, doing it now might be a good time as there is not that much use yet)
BU featurizer would need to follow a different pattern and would still, in some way, need annotation on what type the featurizers extract
BU featurizer seems still difficult to couple with MultipleFeaturizer with mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF object
many featurizers do not only work on MOFs. In this sense the naming (also of the library) is misleading
The featurizers would have the following signatures
classGraphFeaturizer:
deffeaturize(self, mof):
structure_graph=mof.structure_graphreturnself._featurize(structure_graph)
classStructureFeaturizer:
deffeaturize(self, mof):
structure=mof.structurereturnself._featurize(structure)
MatminerFeaturizer=StructureFeaturizerclassSiteFeaturizer:
deffeaturize(self, mof, i):
structure=mof.structurereturnself._featurize(structure, structuregraph, i)
classBUFeaturizer:
def__init__(self, featurizer):
ifisinstance(featurizer, MultiFeaturizer):
raiseValueError("featurizer must be a single featurizer")
deffeaturize(self, mof):
fragments=mof.fragments
In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.
The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.
Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.
I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.
I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.
All of this would be much easier if the inputs for the featurizers would always be
MOFobjects. I already started implementingMOFobjects formofcheckerandmoffragmentorand they typically have (memoized) methods for computing graph/fragments/hashes/...If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).
In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.
Advantages
mofcheckerandmoffragmentor. Could, perhaps, be factored out in its very thin own library to just provide the API (to be seen)MOFBBsobjectslru_cacheanymoreRACswould no longer need to have their own nearest-neighbor search thingDisadvantages
matminer, i.e., we would basically always extract the structure attribute and then pass this to pymatgenMultipleFeaturizerwith mixed types. Probably, the recommended usage pattern would be to use a MultipleFeaturizer around BUFeaturizers if needed and the BU featurizer always only operates on one type which we could then easily extract and pass to the featurizer. Caching of the fragments is no problem as it is handled by the MOF objectUsage example
Implementation idea
The featurizers would have the following signatures
In the last case, for the
BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the_featurizemethod of the featurizer passed in the constructor.The
MOFMultipleFeaturizershould probably always loop over structures forfeaturize_many.Also, if I'd do something like
featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()])it should work without problems as all featurizers always acceptMOFobjects.