Skip to content

Non-wrapping mode for better Transformers compatibility#794

Closed
evgri243 wants to merge 14 commits into
meta-pytorch:mainfrom
JetBrains-Research:evgri243/controller-based
Closed

Non-wrapping mode for better Transformers compatibility#794
evgri243 wants to merge 14 commits into
meta-pytorch:mainfrom
JetBrains-Research:evgri243/controller-based

Conversation

@evgri243
Copy link
Copy Markdown
Contributor

@evgri243 evgri243 commented Oct 20, 2025

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

The primary goal of this PR is to introduce a Non-wrapping mode (controller-based) to Opacus. This feature allows for the computation of per-sample gradients without wrapping the original nn.Module in a GradSampleModule subclass.

This change addresses critical compatibility issues with third-party libraries—most notably HuggingFace Transformers and Accelerate—which often rely on strict module hierarchies or perform type checks that are disrupted by standard Opacus wrappers.

Key Technical Changes:

  • GradSampleHooks Controller: Introduced a new architecture where gradient sampling hooks are managed by a standalone GradSampleHooks object instead of being embedded in a GradSampleModule wrapper.
  • PrivacyEngine Updates: Modified make_private and make_private_with_epsilon to support a wrap_model parameter (defaults to True for backward compatibility). When wrap_model=False, Opacus attaches hooks to the original model and returns it along with a GradSampleHooks controller.
  • Enhanced Loss Compatibility: Updated DPLossFastGradientClipping to handle custom loss implementations that lack a reduction attribute (a common occurrence in the transformers library). The implementation now explicitly sets the expected reduction if missing and restores the previous state after the backward pass.
  • HuggingFace Trainer Integration: Verified the new mode via successful small-scale training runs using the HuggingFace Trainer with Fast Gradient Clipping. A new example (examples/huggingface_trainer.py) is included to demonstrate this integration.
  • Architectural Cleanup: Refactored gsm_base.py to provide a unified AbstractGradSampleHooks base class, improving the robustness of hook attachment, attribute cleanup, and validation logic across all sampling modes.

How Has This Been Tested (if it applies)

  • New Unit Tests:
  • Integration Testing: Successfully performed training with the HuggingFace Trainer using the new non-wrapping mode for both next-token prediction and SFT/DPO.

Checklist

  • The documentation is up-to-date with the changes I made (including a new tutorial and updated READMEs).
  • I have read the CONTRIBUTING document and completed the CLA.
  • All tests passed, and additional code has been covered with new tests.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2025
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Oct 20, 2025

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D85086663. (Because this pull request was imported automatically, there will not be any future comments.)

@iden-kalemaj
Copy link
Copy Markdown
Contributor

iden-kalemaj commented Oct 24, 2025

@evgri243 thank you for this heavy-lifting change. I will take some time to digest and also discuss internally with the team. In the meanwhile I have some questions for you:

  1. Can you provide some examples of failure modes for the current PrivacyEngine approach with HuggingFace transformers. In your tutorial you use BertForSequenceClassification, but this should work fine with the current approach? I'd like to better understand the extent to which the current PrivacyEngine fails to handle certain models, i.e., which model types have you encountered that do not work with PrivacyEngine, and can we handle the failure modes with smaller changes within the current approach?

  2. A main consideration is to avoid code duplication and maintaining several approaches in parallel. With your method, can we avoid some duplication by having PrivacyEngineGradSampleController inherit from PrivacyEngine?

  3. What would the changes needed for extending your methodology to ghost clipping and FSDP?

As an intermediate step, we could place your approach in the "research" folder, which we do not actively maintain, but are happy to support you in maintaining it. This would allow some time for the method to be digested before moving it into the main opacus folder.

@iden-kalemaj iden-kalemaj self-assigned this Oct 24, 2025
@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Oct 24, 2025

Thanks for consideration. It is still work in progress, but I'd love to know your opinion. Let me answer your questions in words, then I'll come with examples if needed:

  1. Yeah. The example is more Opacus oriented. We don't experience issues with the models as we mostly use LoRA that limits us to linear layers only anyway. The problem is the trainer. We have a custom loop to replace a standard trainers, but trainers from try library (DPO, KTO) are much harder to substitute due to complicated data preparation and loss implementations. Those trainers are really into isinstance(model, PeftModel) checks, calling direct methods model.from_pretrained(...) on checkpointing, or accessing properties directly model.loss_function. We implemented originally getattr but checkpoint restoration started failing as the model started saving as _module.<tensor_name>, but recovering through forwarding with getattr without it. We overriden state_dict as well, but it all started falling apart. And then we realized that we can avoid wrapping in the first place... and here we are.
  2. We can look into privacy engine to wrap it deeper into the existing one. Yeah, duplication is major and existence of two parallel modes is even worse, but keeping it completely to ourselves was unfair as the wrap-less mode just works.
  3. We are on ghost clipping now, but thanks to the same KTO and DPO Trainers integrating the loss wrapper is another headache which we so far fail to implement. We have standard ghost clipping wired, but haven't had a proper chance to test it. FSDP is the final goal and we are slowly going there...

I thought about "research" or "contrib". My major problem is to make it package able, but I guess it is not a major implementation issue to add yet another package.

@iden-kalemaj
Copy link
Copy Markdown
Contributor

Thank you for these explanations. I like your solution and have also experienced some annoyances with accessing attributes of the model post-wrapping. I also understand the use case better now.

I believe we can minimize code duplication which would make it more reasonable to introduce this into Opacus.

  1. having GradSampleController extend GradSampleModule to avoid duplicate code. It's okay if this requires some factorization on the side of GradSampleModule.
  2. having PrivacyEngineGradSampleController extend PrivacyEngine to avoid the duplicated functions. The more we can minimize new code in PrivacyEngineGradSampleController the better. Likewise we can factorize things in PrivacyEngine as needed.

Do these make sense?

Regarding ghost clipping and FSDP. You mention that you mostly use LoRA. Ghost clipping does not give any memory advantage with LoRA fine-tunining since the effective linear layer width is small, so just wanted to give a heads up that ghost clipping might not be needed for your use case.

We did not implement FSDP with vanilla (non-ghost) clipping since this required more significant effort, though we did put some work into this and if you're interested in using extending FSDP + vanilla, then we'd welcome PRs here.

@evgri243
Copy link
Copy Markdown
Contributor Author

@iden-kalemaj give me a few days to give it a try. There is a catch though: GradSampleModule is nn.Module, tracked and controlled by torch. GradSampleController is a simple class, untacking it may turn an issue.

@iden-kalemaj
Copy link
Copy Markdown
Contributor

That's a good point... let me know how it's looking once you work on it.

@evgri243 evgri243 force-pushed the evgri243/controller-based branch 3 times, most recently from 4b4ac43 to 8dd5da9 Compare November 10, 2025 18:05
@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Nov 10, 2025

Together (let's be honest with Claude) we actually refactored something reasonable to merge the designs of both Modules/Controllers and PrivacyEngines.

It is still WIP, but you may take a look at the direction at least.

@evgri243
Copy link
Copy Markdown
Contributor Author

I will try to add ghost_fsdp support and adapt from our sources a properly working Transformers DPTrainer based on the controller.

@evgri243
Copy link
Copy Markdown
Contributor Author

Hello!

I hope to attend to it completely during the holidays and return back with better compatibility.

@iden-kalemaj
Copy link
Copy Markdown
Contributor

Hi @evgri243, apologies for the delayed response. I like how the integration has improved with the new version.

I have some small recommendations about naming:

  • In privacy_engine, rename the parameter return_controller to something more intuitive like wrap_model. Let's try to hide the details of the controller from the user.
  • Rename GradSampleHooksMixin to HooksHandler or something similar?

A few questions:

  • Can we avoid duplication of additional functions in GradSampleModuleFastGradientClipping?
  • In fast_gradient_utils, what is the purpose of functions like add and rmul?

@evgri243 evgri243 force-pushed the evgri243/controller-based branch from d7c2477 to 5fe2ddc Compare December 24, 2025 13:19
@evgri243
Copy link
Copy Markdown
Contributor Author

@iden-kalemaj I've majorly refactored the PR, starting practically from scratch. I tried to keep changes as limited as possible this time to limit the change surface and make it easier to read. It is still work in progress and we should test it on our tasks early next year to make sure everything is correct.

@evgri243
Copy link
Copy Markdown
Contributor Author

@david-stan may you have a look in it as well as a co-author.

@evgri243 evgri243 force-pushed the evgri243/controller-based branch from 3ed8655 to 42fcd87 Compare December 24, 2025 16:44
@evgri243
Copy link
Copy Markdown
Contributor Author

Ok. Calling it "as limited as possible" is an overstatement, but I've drastically changed the design and did my best to avoid unnecessary changes

Comment thread opacus/privacy_engine.py
Comment thread examples/huggingface_trainer.py Outdated
Comment thread examples/huggingface_trainer.py Outdated
Comment thread opacus/grad_sample/grad_sample_module_fast_gradient_clipping.py
Comment thread opacus/grad_sample/grad_sample_module_fast_gradient_clipping.py Outdated
@iden-kalemaj
Copy link
Copy Markdown
Contributor

iden-kalemaj commented Jan 7, 2026

@evgri243 and @david-stan, thank you for these changes, and for the hard work on improving the functionality. I like the idea of splitting the hooks handling from the model wrapping at the very base abstract class.

I am somewhat worried about people expecting that the return of the privacy_engine.make_private would be the model as opposed to the hooks object, when using wrap_model=False, but given that it is not the default we can assume people have read the documentation before using this mode.

In the examples, it would be great to have another simpler example of training a hugging face model that would not have been supported with model wrapping. This can be in a separate PR, however.

Finally, there are changes in the code that are not related to the new non-wrapping functionality. I assume these are meant to clean up code, but they make reviewing harder. Can these be placed into separate PRs? I left comments for some of them, but there might be other changes.

@evgri243
Copy link
Copy Markdown
Contributor Author

I addressing your cutting all unnecessary changes surgically. Should be out first half next week.

@evgri243 evgri243 force-pushed the evgri243/controller-based branch 2 times, most recently from 2f12f82 to 31c5931 Compare January 26, 2026 18:12
@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Jan 26, 2026

@iden-kalemaj it should be the shortest I can get it. Some other important functionality: #805 and #806

@evgri243
Copy link
Copy Markdown
Contributor Author

I guess I forgot to push. I did that

@evgri243 evgri243 force-pushed the evgri243/controller-based branch from 3daabc9 to dc8fe06 Compare March 26, 2026 22:28
@iden-kalemaj
Copy link
Copy Markdown
Contributor

iden-kalemaj commented Mar 26, 2026

@evgri243 for some reason the system is failing to import the PR internally. I'll check back on this tomorrow morning.

@iden-kalemaj
Copy link
Copy Markdown
Contributor

@evgri243 can you try rebasing and seeing if there are any conflicts, we are unable to import the PR internally which is usually due to merge conflicts.

evgri243 and others added 12 commits March 29, 2026 14:15
…tionality. Introduce `GradSampleHooks` to handle per-sample gradient computation independently of `nn.Module` structure.
…ing without model wrapping. Introduce optional `wrap_model` argument and update gradient sampling modes. Rename and consolidate hooks-related classes.
… and update references in documentation and metadata.
…nt sampling. Integrate new hooks-based classes and multi-device handling scenarios. Update tests for FP16, BF16, and mixed precision.
…uction methods to stabilize gpu mixed-precision tests.
- Safer remove_hooks(): no ValueError when hooks already removed
- Initialize grad_accumulation_hook = None in all GradSampleModule subclasses
- Add warning in cleanup() when hook removal fails (still cleans attributes)
- Remove redundant cleanup() call in hooks test
- Add prepare_module() as new main API, keep wrap_model() as backward compat alias
- Use prepare_module import in privacy_engine instead of aliased wrap_model
- Fix criterion.reduction = 'none' in DPLossFastGradientClipping
- Clean up imports in adaptive_clipping_utils
@evgri243 evgri243 force-pushed the evgri243/controller-based branch from dc8fe06 to 89ba6bc Compare March 30, 2026 09:39
@evgri243
Copy link
Copy Markdown
Contributor Author

Rebased and pushed. Let's see.

PS. I am so slow as I am not happy with the resulting diamond inheritance between module and hooks. But so far I fail to think of a better way.

If we get it merged, may I ask you to trigger a build release with all the changes we submitted. This is exactly what was required to make our trainer work. We will rebase it on upstream Opacus and test it thoroughly in the training pipeline before we make further decisions. What do you think?

@iden-kalemaj
Copy link
Copy Markdown
Contributor

@evgri243 that sounds good since we're due for another version release. The current solution looks elegant enough :)

Please see the lint failure and to speed things up please test with all 3 linters here.

…handling in GradSampleModuleFastGradientClippingEmbeddingLayerTest
@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Apr 1, 2026

Interestingly. It was something old. Fixed it

@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Apr 2, 2026

See massive failures. Attending to them today.

@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Apr 9, 2026

CI/CD seems to be finally green. We rebased our code on top of this branch and previously merged changes -- seems working, but we will continue thorough testing.

@iden-kalemaj
Copy link
Copy Markdown
Contributor

@evgri243 i did not get to check the initial failures, can you let me know why the tests were failing? Just want to make sure that we are not duplicating any of the code in #805

@evgri243
Copy link
Copy Markdown
Contributor Author

evgri243 commented Apr 9, 2026

The errors were just because I removed all the changes with arithmetics. This PR had them originally and then removed in favor of an independent PR. Seeing that, rebase silently undid them as well

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 10, 2026

@iden-kalemaj merged this pull request in 0eb4b15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants