Skip to content

fix test_setitem segment fault#53

Closed
lynex wants to merge 2026 commits intomicrosoft:mainfrom
msrasys:yomia/fix_test_setitem
Closed

fix test_setitem segment fault#53
lynex wants to merge 2026 commits intomicrosoft:mainfrom
msrasys:yomia/fix_test_setitem

Conversation

@lynex
Copy link
Member

@lynex lynex commented Feb 6, 2026

The cause is short-time setitem operations leads to too many calls (>150K) and within short time range, then crashes memory management

Yilei Yang and others added 30 commits May 29, 2024 12:32
Introduce an interface `dedup_attrs`.
It deduplicate the attributes according to `rank2attr_area_map`.
    For each `slicers` of a full tensor with the name `orig_name`, we only store its first appearance
    in the `rank2attr_area_map`.
    In addition, we will check
    - the shape of the full tensor is consistent across different ranks
    - the slicers of the full tensor are not intersected with each other
    - the slicers of the full tensor can cover the full tensor
1. rename cube related names
2. move user_config.code to pas_config
NOTE: this PR is a temporary solution for customized functions that have intra communications. The inside communication cost is not considered in the profiler, leading to the sub-optimal generated plan.

end2end parity verified on YOCO-3B, 4XA6000
Assume operators with parameters consume and generate tensors with batch dim. A search is followed to propagate the possible batch dim to the whole graph.

Add a test to check autodist will generate data parallel plan.

parity check passed
change use_reentrant to False because True is not stable, and in some torch version it may trigger bugs.
refine IRObject handling

1. make tensor non-constant
2. don't trigger error on non-constant args for non-registered pyfunc.
3. set is_constant=False for all input objects.

unit test pass
parity check pass
set `memory_constraint` correctly
…r collect huggingface mo...

add interface for cube integration test and script for collect huggingface models
create cache_dir if not exist
refine ComputeConfig as it changes
This PR is trying to reduce the memory usage when merging by combining zero and tp state together.
fix cache dir is a str, no exists() function
After this PR, the `follow` logic in autodist is
- the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation)
- a unary op (like GeLU) will try to follow its producer
- if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`.

Update the test case to elaborate this PR.

Fix the bug in dp solver when computing the in edges for a dp node.
The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index.

This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.
parser: never fold getattr node 'self.training'
unit test pass
parity check pass
self.training in submodules: hotfix for nightly test

unit test pass
parity check pass
never fold nnscaler runtime functions
- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details
- refine c++ code
- have verified the search result compared to the ilp with & without recompute on retnet-3b

NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.
1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails
2. dp solver bug 1: segment fault when `following candidates` is empty
2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states

tests added
nnScaler and others added 24 commits January 5, 2026 19:21
[Hotfix] Don't disable zero when scale unit is 1
…d small file batching (#2)

1. batch small files for better performance, especially when the world size is big (a lot of small files will be generated)
2. Multi-threading IO has better performance on SSD, especially on NVMe SSD with PCIe.
3. [TODO] Chunking (+ pin memory) can have an even better performance for huge files. Currently the max size of weight files in default setting is 1b*dtype_size (2GB for bf16). This hasn't done in this PR.

---------

Co-authored-by: Hangbo Bao <10023639+addf400@users.noreply.github.com>
Co-authored-by: addf400 <addf400@foxmail.com>
Add support for gathering full model state from all ranks.
A potential usage is when we use nnscaler in RL, and need to sync weights to rollout engine(like vllm)
[Refine] Reduce memory fragment when resuming
…nal einops functions.

Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. 

To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.
1. Normalize state_dict device handling
2. replace torch.cat with F.pad
* Add Doc for Autodist Constraints Guide

* Revise the description of autodist in the documentation.

* fix comment

* polish doc

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
* add ci.yml for continuous integration and continuous development

* branch self test
switch back to conda from uv
change back to python3.10 from python 3.12

* add conda-forge channel for tox-conda package

* add py lib for tox-conda

* change conda to uv because tox-conda is too old

* using uv tool for fixing setup issue

* pip is needed for tox 3.0, while azuer use 3.0 don't need this permission

* add all possible commands to allowlist

* change back to main branch

* back to conda with fixed tox and tox-conda version

* change back to uv

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
* Add nightly test to the repo

* make parity alignment

* fix incoordination shutil.rmtree of rank0

---------

Co-authored-by: yileiyang <yileiyang@gmail.com>
…10)

1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers
2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin.

---------

Co-authored-by: zyeric <cheerforwhy@gmail.com>
@lynex lynex closed this Feb 6, 2026
@lynex lynex deleted the yomia/fix_test_setitem branch February 6, 2026 07:40
@lynex lynex restored the yomia/fix_test_setitem branch February 6, 2026 07:43
@lynex lynex deleted the yomia/fix_test_setitem branch February 6, 2026 08:52
@lynex lynex restored the yomia/fix_test_setitem branch February 6, 2026 08:54
@lynex lynex deleted the yomia/fix_test_setitem branch February 6, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants