[Feature] Package to inject Spurious Correlation(s) in huggingface datasets#322
Open
MarcelMatsal wants to merge 34 commits into
Open
[Feature] Package to inject Spurious Correlation(s) in huggingface datasets#322MarcelMatsal wants to merge 34 commits into
MarcelMatsal wants to merge 34 commits into
Conversation
huguesva
reviewed
Oct 17, 2025
Collaborator
huguesva
left a comment
There was a problem hiding this comment.
thanks a lot @MarcelMatsal , are there some plans to include examples of pretraining or finetuning of LLMs later in the library? @RandallBalestriero don't hesitate to review as well if you have time since you know more than me
Contributor
Author
|
@huguesva Our current plan was to finetune/pretrain some VLMs like CLIP with spurious data and I could include those examples in this library. I could definitely also include some finetuning examples of pure LLMs down the line or we could fully do the pretaining step |
Collaborator
|
Thanks @MarcelMatsal ! @RandallBalestriero do you validate this PR ? thanks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This code introduces functionality for injecting spurious correlations (SSTI from our paper) into huggingface datasets. This will allow us to see how these correlations affect the pretraining of models. It will be expanded to other modalities. Currently only textual injections are possible but soon will add functionality to add injections into image data.
Added an additional dependency of "termcolor" to the dataset additional dependencies
Checklist