Non-wrapping mode for better Transformers compatibility#794
Conversation
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D85086663. (Because this pull request was imported automatically, there will not be any future comments.) |
|
@evgri243 thank you for this heavy-lifting change. I will take some time to digest and also discuss internally with the team. In the meanwhile I have some questions for you:
As an intermediate step, we could place your approach in the "research" folder, which we do not actively maintain, but are happy to support you in maintaining it. This would allow some time for the method to be digested before moving it into the main opacus folder. |
|
Thanks for consideration. It is still work in progress, but I'd love to know your opinion. Let me answer your questions in words, then I'll come with examples if needed:
I thought about "research" or "contrib". My major problem is to make it package able, but I guess it is not a major implementation issue to add yet another package. |
|
Thank you for these explanations. I like your solution and have also experienced some annoyances with accessing attributes of the model post-wrapping. I also understand the use case better now. I believe we can minimize code duplication which would make it more reasonable to introduce this into Opacus.
Do these make sense? Regarding ghost clipping and FSDP. You mention that you mostly use LoRA. Ghost clipping does not give any memory advantage with LoRA fine-tunining since the effective linear layer width is small, so just wanted to give a heads up that ghost clipping might not be needed for your use case. We did not implement FSDP with vanilla (non-ghost) clipping since this required more significant effort, though we did put some work into this and if you're interested in using extending FSDP + vanilla, then we'd welcome PRs here. |
|
@iden-kalemaj give me a few days to give it a try. There is a catch though: GradSampleModule is nn.Module, tracked and controlled by torch. GradSampleController is a simple class, untacking it may turn an issue. |
|
That's a good point... let me know how it's looking once you work on it. |
4b4ac43 to
8dd5da9
Compare
|
Together (let's be honest with Claude) we actually refactored something reasonable to merge the designs of both Modules/Controllers and PrivacyEngines. It is still WIP, but you may take a look at the direction at least. |
|
I will try to add ghost_fsdp support and adapt from our sources a properly working Transformers DPTrainer based on the controller. |
|
Hello! I hope to attend to it completely during the holidays and return back with better compatibility. |
|
Hi @evgri243, apologies for the delayed response. I like how the integration has improved with the new version. I have some small recommendations about naming:
A few questions:
|
d7c2477 to
5fe2ddc
Compare
|
@iden-kalemaj I've majorly refactored the PR, starting practically from scratch. I tried to keep changes as limited as possible this time to limit the change surface and make it easier to read. It is still work in progress and we should test it on our tasks early next year to make sure everything is correct. |
|
@david-stan may you have a look in it as well as a co-author. |
3ed8655 to
42fcd87
Compare
|
Ok. Calling it "as limited as possible" is an overstatement, but I've drastically changed the design and did my best to avoid unnecessary changes |
|
@evgri243 and @david-stan, thank you for these changes, and for the hard work on improving the functionality. I like the idea of splitting the hooks handling from the model wrapping at the very base abstract class. I am somewhat worried about people expecting that the return of the privacy_engine.make_private would be the model as opposed to the hooks object, when using wrap_model=False, but given that it is not the default we can assume people have read the documentation before using this mode. In the examples, it would be great to have another simpler example of training a hugging face model that would not have been supported with model wrapping. This can be in a separate PR, however. Finally, there are changes in the code that are not related to the new non-wrapping functionality. I assume these are meant to clean up code, but they make reviewing harder. Can these be placed into separate PRs? I left comments for some of them, but there might be other changes. |
|
I addressing your cutting all unnecessary changes surgically. Should be out first half next week. |
2f12f82 to
31c5931
Compare
|
@iden-kalemaj it should be the shortest I can get it. Some other important functionality: #805 and #806 |
|
I guess I forgot to push. I did that |
3daabc9 to
dc8fe06
Compare
|
@evgri243 for some reason the system is failing to import the PR internally. I'll check back on this tomorrow morning. |
|
@evgri243 can you try rebasing and seeing if there are any conflicts, we are unable to import the PR internally which is usually due to merge conflicts. |
…tionality. Introduce `GradSampleHooks` to handle per-sample gradient computation independently of `nn.Module` structure.
…ing without model wrapping. Introduce optional `wrap_model` argument and update gradient sampling modes. Rename and consolidate hooks-related classes.
… and update references in documentation and metadata.
…nt details across modules.
…nt sampling. Integrate new hooks-based classes and multi-device handling scenarios. Update tests for FP16, BF16, and mixed precision.
…uction methods to stabilize gpu mixed-precision tests.
- Safer remove_hooks(): no ValueError when hooks already removed - Initialize grad_accumulation_hook = None in all GradSampleModule subclasses - Add warning in cleanup() when hook removal fails (still cleans attributes) - Remove redundant cleanup() call in hooks test - Add prepare_module() as new main API, keep wrap_model() as backward compat alias - Use prepare_module import in privacy_engine instead of aliased wrap_model - Fix criterion.reduction = 'none' in DPLossFastGradientClipping - Clean up imports in adaptive_clipping_utils
dc8fe06 to
89ba6bc
Compare
|
Rebased and pushed. Let's see. PS. I am so slow as I am not happy with the resulting diamond inheritance between module and hooks. But so far I fail to think of a better way. If we get it merged, may I ask you to trigger a build release with all the changes we submitted. This is exactly what was required to make our trainer work. We will rebase it on upstream Opacus and test it thoroughly in the training pipeline before we make further decisions. What do you think? |
…handling in GradSampleModuleFastGradientClippingEmbeddingLayerTest
|
Interestingly. It was something old. Fixed it |
|
See massive failures. Attending to them today. |
|
CI/CD seems to be finally green. We rebased our code on top of this branch and previously merged changes -- seems working, but we will continue thorough testing. |
|
The errors were just because I removed all the changes with arithmetics. This PR had them originally and then removed in favor of an independent PR. Seeing that, rebase silently undid them as well |
|
@iden-kalemaj merged this pull request in 0eb4b15. |
Types of changes
Motivation and Context / Related issue
The primary goal of this PR is to introduce a Non-wrapping mode (controller-based) to Opacus. This feature allows for the computation of per-sample gradients without wrapping the original
nn.Modulein aGradSampleModulesubclass.This change addresses critical compatibility issues with third-party libraries—most notably HuggingFace
TransformersandAccelerate—which often rely on strict module hierarchies or perform type checks that are disrupted by standard Opacus wrappers.Key Technical Changes:
GradSampleHooksController: Introduced a new architecture where gradient sampling hooks are managed by a standaloneGradSampleHooksobject instead of being embedded in aGradSampleModulewrapper.PrivacyEngineUpdates: Modifiedmake_privateandmake_private_with_epsilonto support awrap_modelparameter (defaults toTruefor backward compatibility). Whenwrap_model=False, Opacus attaches hooks to the original model and returns it along with aGradSampleHookscontroller.DPLossFastGradientClippingto handle custom loss implementations that lack areductionattribute (a common occurrence in thetransformerslibrary). The implementation now explicitly sets the expected reduction if missing and restores the previous state after the backward pass.Trainerwith Fast Gradient Clipping. A new example (examples/huggingface_trainer.py) is included to demonstrate this integration.gsm_base.pyto provide a unifiedAbstractGradSampleHooksbase class, improving the robustness of hook attachment, attribute cleanup, and validation logic across all sampling modes.How Has This Been Tested (if it applies)
Trainerusing the new non-wrapping mode for both next-token prediction and SFT/DPO.Checklist