Closed
Conversation
fix fullslice
Introduce an interface `dedup_attrs`.
It deduplicate the attributes according to `rank2attr_area_map`.
For each `slicers` of a full tensor with the name `orig_name`, we only store its first appearance
in the `rank2attr_area_map`.
In addition, we will check
- the shape of the full tensor is consistent across different ranks
- the slicers of the full tensor are not intersected with each other
- the slicers of the full tensor can cover the full tensor
1. rename cube related names 2. move user_config.code to pas_config
NOTE: this PR is a temporary solution for customized functions that have intra communications. The inside communication cost is not considered in the profiler, leading to the sub-optimal generated plan. end2end parity verified on YOCO-3B, 4XA6000
add nnscaler strategy
parity check pass unit test pass
Assume operators with parameters consume and generate tensors with batch dim. A search is followed to propagate the possible batch dim to the whole graph. Add a test to check autodist will generate data parallel plan. parity check passed
change use_reentrant to False because True is not stable, and in some torch version it may trigger bugs.
refine IRObject handling 1. make tensor non-constant 2. don't trigger error on non-constant args for non-registered pyfunc. 3. set is_constant=False for all input objects. unit test pass parity check pass
set `memory_constraint` correctly
…r collect huggingface mo... add interface for cube integration test and script for collect huggingface models
create cache_dir if not exist refine ComputeConfig as it changes
This PR is trying to reduce the memory usage when merging by combining zero and tp state together.
fix cache dir is a str, no exists() function
After this PR, the `follow` logic in autodist is - the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation) - a unary op (like GeLU) will try to follow its producer - if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`. Update the test case to elaborate this PR. Fix the bug in dp solver when computing the in edges for a dp node.
…contains attributes parity alert passed 
The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index. This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.
parser: never fold getattr node 'self.training' unit test pass parity check pass
self.training in submodules: hotfix for nightly test unit test pass parity check pass
Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly
never fold nnscaler runtime functions
- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details - refine c++ code - have verified the search result compared to the ilp with & without recompute on retnet-3b NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.
1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails 2. dp solver bug 1: segment fault when `following candidates` is empty 2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states tests added
Devops catch up
[Hotfix] Don't disable zero when scale unit is 1
…d small file batching (#2) 1. batch small files for better performance, especially when the world size is big (a lot of small files will be generated) 2. Multi-threading IO has better performance on SSD, especially on NVMe SSD with PCIe. 3. [TODO] Chunking (+ pin memory) can have an even better performance for huge files. Currently the max size of weight files in default setting is 1b*dtype_size (2GB for bf16). This hasn't done in this PR. --------- Co-authored-by: Hangbo Bao <10023639+addf400@users.noreply.github.com> Co-authored-by: addf400 <addf400@foxmail.com>
Add support for gathering full model state from all ranks. A potential usage is when we use nnscaler in RL, and need to sync weights to rollout engine(like vllm)
[Refine] Reduce memory fragment when resuming
…nal einops functions. Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.
1. Normalize state_dict device handling 2. replace torch.cat with F.pad
* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>
* add ci.yml for continuous integration and continuous development * branch self test switch back to conda from uv change back to python3.10 from python 3.12 * add conda-forge channel for tox-conda package * add py lib for tox-conda * change conda to uv because tox-conda is too old * using uv tool for fixing setup issue * pip is needed for tox 3.0, while azuer use 3.0 don't need this permission * add all possible commands to allowlist * change back to main branch * back to conda with fixed tox and tox-conda version * change back to uv --------- Co-authored-by: yileiyang <yileiyang@gmail.com>
* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>
…10) 1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers 2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>
…to segmentation fault
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cause is short-time setitem operations leads to too many calls (>150K) and within short time range, then crashes memory management