Draft
Conversation
I am trying to experiment with making is so that users can extend the base data object class for dataset/method specific preprocessing. I considered making the base class an abstract class, but I think it would be more flexible to leave it as a concrete class that can be used for basic level processing, and for more specific cases it can be extended and overriden as needed.
The catch is always something subtle, as accurately reproducing the data processing steps for the future model is crucial for reproducing the same results. In this case, if you do the 20% future data sampling, you will see the expected high validity results, but without its, using the complete future data, then future validity drops significantly.
The model building, especially for the layers has been changed since before it didn't have a way to just add a single layer. Also trying to improve the reproduntion score for PROBE, which is getting corerct validity scores, but the paper spcefic metric of recourse invalidation rate is slightly higher than the original paper.
Having trouble getting the results to match the paper, but the logic of my code and the results im getting intuitively make sense. Since I am not making real progress with this problem right now, I will commit what I have so far and morve on to the next method reproduction. I can come back to this one later and try to figure out the issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces major improvements to the dataset preprocessing pipeline, adds dataset-specific classes for COMPAS and German datasets, and refactors utility and experiment code for better modularity and maintainability. The most important changes are grouped below by theme.
Dataset Preprocessing Pipeline Improvements
DataObjectclass to move the raw data reading and feature/metadata extraction logic into a dedicated_read_raw_data()method, allowing child classes to override preprocessing steps and making the pipeline more extensible for dataset-specific needs._apply_scaling()method usingMinMaxScaler, enabling the use of normalization as a preprocessing strategy defined in YAML configs.set_processed_data()method toDataObjectfor externally updating the processed DataFrame after initial preprocessing.Dataset-Specific Classes
CompasDataandGermanDatain their respective files, inheriting fromDataObjectand implementing dataset-specific preprocessing pipelines.Configuration and Utility Refactoring
config_utils.pytoexperiment_utils.pyand moved experiment utility functions (setup_logging,select_factuals) fromexperiment.pyto this module for improved modularity.Experiment Code Cleanup
Codebase Simplification
DataAttributesclass fromdata/data_attributes.py, reducing dead code.data_object.py, including the removal of unused methods.These changes collectively make the preprocessing pipeline more flexible and maintainable, enable easier extension for new datasets, and improve code organization across the experiment and utility modules.