Skip to content

Feature/method reproductions#4

Draft
HashirA123 wants to merge 7 commits intocharmlab:mainfrom
HashirA123:feature/method_reproductions
Draft

Feature/method reproductions#4
HashirA123 wants to merge 7 commits intocharmlab:mainfrom
HashirA123:feature/method_reproductions

Conversation

@HashirA123
Copy link
Collaborator

This pull request introduces major improvements to the dataset preprocessing pipeline, adds dataset-specific classes for COMPAS and German datasets, and refactors utility and experiment code for better modularity and maintainability. The most important changes are grouped below by theme.

Dataset Preprocessing Pipeline Improvements

  • Refactored the DataObject class to move the raw data reading and feature/metadata extraction logic into a dedicated _read_raw_data() method, allowing child classes to override preprocessing steps and making the pipeline more extensible for dataset-specific needs.
  • Implemented normalization support in the _apply_scaling() method using MinMaxScaler, enabling the use of normalization as a preprocessing strategy defined in YAML configs.
  • Added a set_processed_data() method to DataObject for externally updating the processed DataFrame after initial preprocessing.

Dataset-Specific Classes

  • Added new classes CompasData and GermanData in their respective files, inheriting from DataObject and implementing dataset-specific preprocessing pipelines.

Configuration and Utility Refactoring

  • Renamed config_utils.py to experiment_utils.py and moved experiment utility functions (setup_logging, select_factuals) from experiment.py to this module for improved modularity.
  • Updated the preprocessing strategy for COMPAS Carla in its YAML config from "standardize" to "normalize" to match the new normalization support.

Experiment Code Cleanup

  • Fixed dataset paths for "german_corrected" in experiment configuration dictionaries.

Codebase Simplification

  • Removed the unused DataAttributes class from data/data_attributes.py, reducing dead code.
  • Cleaned up imports and removed unnecessary code in data_object.py, including the removal of unused methods.

These changes collectively make the preprocessing pipeline more flexible and maintainable, enable easier extension for new datasets, and improve code organization across the experiment and utility modules.

I am trying to experiment with making is so that users
can extend the base data object class for dataset/method specific
preprocessing.

I considered making the base class an abstract class,
but I think it would be more flexible to leave it as a concrete
class that can be used for basic level processing, and for more specific
cases it can be extended and overriden as needed.
The catch is always something subtle, as accurately reproducing the data processing steps
for the future model is crucial for reproducing the same results.
In this case, if you do the 20% future data sampling, you will see the expected high validity results,
but without its, using the complete future data, then future validity drops significantly.
The model building, especially for the layers has been changed
since before it didn't have a way to just add a single layer.

Also trying to improve the reproduntion score for PROBE, which
is getting corerct validity scores, but the paper spcefic metric
of recourse invalidation rate is slightly higher than the original paper.
Having trouble getting the results to match the paper, but the
logic of my code and the results im getting intuitively make sense.

Since I am not making real progress with this problem right now,
I will commit what I have so far and morve on to the next method reproduction. I
can come back to this one later and try to figure out the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant