Feature/method reproductions by HashirA123 · Pull Request #4 · charmlab/CharmLab_Benchmarks

HashirA123 · 2026-03-07T00:09:41Z

This pull request introduces major improvements to the dataset preprocessing pipeline, adds dataset-specific classes for COMPAS and German datasets, and refactors utility and experiment code for better modularity and maintainability. The most important changes are grouped below by theme.

Dataset Preprocessing Pipeline Improvements

Refactored the DataObject class to move the raw data reading and feature/metadata extraction logic into a dedicated _read_raw_data() method, allowing child classes to override preprocessing steps and making the pipeline more extensible for dataset-specific needs.
Implemented normalization support in the _apply_scaling() method using MinMaxScaler, enabling the use of normalization as a preprocessing strategy defined in YAML configs.
Added a set_processed_data() method to DataObject for externally updating the processed DataFrame after initial preprocessing.

Dataset-Specific Classes

Added new classes CompasData and GermanData in their respective files, inheriting from DataObject and implementing dataset-specific preprocessing pipelines.

Configuration and Utility Refactoring

Renamed config_utils.py to experiment_utils.py and moved experiment utility functions (setup_logging, select_factuals) from experiment.py to this module for improved modularity.
Updated the preprocessing strategy for COMPAS Carla in its YAML config from "standardize" to "normalize" to match the new normalization support.

Experiment Code Cleanup

Fixed dataset paths for "german_corrected" in experiment configuration dictionaries.

Codebase Simplification

Removed the unused DataAttributes class from data/data_attributes.py, reducing dead code.
Cleaned up imports and removed unnecessary code in data_object.py, including the removal of unused methods.

These changes collectively make the preprocessing pipeline more flexible and maintainable, enable easier extension for new datasets, and improve code organization across the experiment and utility modules.

I am trying to experiment with making is so that users can extend the base data object class for dataset/method specific preprocessing. I considered making the base class an abstract class, but I think it would be more flexible to leave it as a concrete class that can be used for basic level processing, and for more specific cases it can be extended and overriden as needed.

…_reproductions

The catch is always something subtle, as accurately reproducing the data processing steps for the future model is crucial for reproducing the same results. In this case, if you do the 20% future data sampling, you will see the expected high validity results, but without its, using the complete future data, then future validity drops significantly.

The model building, especially for the layers has been changed since before it didn't have a way to just add a single layer. Also trying to improve the reproduntion score for PROBE, which is getting corerct validity scores, but the paper spcefic metric of recourse invalidation rate is slightly higher than the original paper.

Having trouble getting the results to match the paper, but the logic of my code and the results im getting intuitively make sense. Since I am not making real progress with this problem right now, I will commit what I have so far and morve on to the next method reproduction. I can come back to this one later and try to figure out the issue.

HashirA123 added 7 commits March 5, 2026 20:50

restructure reproduction experiments

7c891e8

Merge branch 'experiment/data_object_restructure' into feature/method…

ff720b6

…_reproductions

Added LARR method reproduction

5e59931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/method reproductions#4

Feature/method reproductions#4
HashirA123 wants to merge 7 commits intocharmlab:mainfrom
HashirA123:feature/method_reproductions

HashirA123 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HashirA123 commented Mar 7, 2026

Dataset Preprocessing Pipeline Improvements

Dataset-Specific Classes

Configuration and Utility Refactoring

Experiment Code Cleanup

Codebase Simplification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant