fix test_setitem segment fault by lynex · Pull Request #53 · microsoft/nnscaler

lynex · 2026-02-06T07:38:03Z

The cause is short-time setitem operations leads to too many calls (>150K) and within short time range, then crashes memory management

fix fullslice

Introduce an interface `dedup_attrs`. It deduplicate the attributes according to `rank2attr_area_map`. For each `slicers` of a full tensor with the name `orig_name`, we only store its first appearance in the `rank2attr_area_map`. In addition, we will check - the shape of the full tensor is consistent across different ranks - the slicers of the full tensor are not intersected with each other - the slicers of the full tensor can cover the full tensor

1. rename cube related names 2. move user_config.code to pas_config

NOTE: this PR is a temporary solution for customized functions that have intra communications. The inside communication cost is not considered in the profiler, leading to the sub-optimal generated plan. end2end parity verified on YOCO-3B, 4XA6000

add nnscaler strategy

parity check pass unit test pass

Assume operators with parameters consume and generate tensors with batch dim. A search is followed to propagate the possible batch dim to the whole graph. Add a test to check autodist will generate data parallel plan. parity check passed

change use_reentrant to False because True is not stable, and in some torch version it may trigger bugs.

refine IRObject handling 1. make tensor non-constant 2. don't trigger error on non-constant args for non-registered pyfunc. 3. set is_constant=False for all input objects. unit test pass parity check pass

set `memory_constraint` correctly

…r collect huggingface mo... add interface for cube integration test and script for collect huggingface models

create cache_dir if not exist refine ComputeConfig as it changes

This PR is trying to reduce the memory usage when merging by combining zero and tp state together.

fix cache dir is a str, no exists() function

After this PR, the `follow` logic in autodist is - the father (followed) op should not contain a sum dim (linear is defined as a sum op, since it has a sum dim in computation) - a unary op (like GeLU) will try to follow its producer - if a op's inputs are from multiple producers (like add, concat), it will follow the 1st producer if the producers are in a same `follow region`. Update the test case to elaborate this PR. Fix the bug in dp solver when computing the in edges for a dp node.

…erging

…contains attributes parity alert passed ![image.png](https://msrasrg.visualstudio.com/bb54e96e-8cc1-46f6-9021-c7048165b5bc/_apis/git/repositories/66b74611-09f4-4d0e-89b7-5ee93c087d3c/pullRequests/2187/attachments/image.png)

The index in `train_mem2in_idx` is the original index of the input of the operator, here add a mapping for the original index to the pure tensor index. This bug is found by the functions that didn't put tensor input in the front, i.e., `torch.gather`.

parser: never fold getattr node 'self.training' unit test pass parity check pass

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

refine docs

never fold nnscaler runtime functions

- align the memory estimation in dp solver with ilp solver, check this [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2121) for more details - refine c++ code - have verified the search result compared to the ilp with & without recompute on retnet-3b NOTE: after this PR, more meta information are introduced in a dynamic programming state, resulting in the dp solver may be slower than ilp solver, which needs further optimization.

1. [PR](https://dev.azure.com/msrasrg/SuperScaler/_git/MagicCube/pullrequest/2185) ignores the case when profiling fails 2. dp solver bug 1: segment fault when `following candidates` is empty 2. dp solver bug 2: corner case, the new generated dp state can be illegal, need to check when adding it to new states tests added

Devops catch up

[Hotfix] Don't disable zero when scale unit is 1

…d small file batching (#2) 1. batch small files for better performance, especially when the world size is big (a lot of small files will be generated) 2. Multi-threading IO has better performance on SSD, especially on NVMe SSD with PCIe. 3. [TODO] Chunking (+ pin memory) can have an even better performance for huge files. Currently the max size of weight files in default setting is 1b*dtype_size (2GB for bf16). This hasn't done in this PR. --------- Co-authored-by: Hangbo Bao <10023639+addf400@users.noreply.github.com> Co-authored-by: addf400 <addf400@foxmail.com>

Add support for gathering full model state from all ranks. A potential usage is when we use nnscaler in RL, and need to sync weights to rollout engine(like vllm)

[Refine] Reduce memory fragment when resuming

…nal einops functions. Tracing einops Functions are challenging due to their dynamic nature and heavy reliance on string-based patterns and runtime shape manipulations. To make things easier, we skip tracing the internal logic of einops functions and directly use the resolved transformation recipes.

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* add ci.yml for continuous integration and continuous development * branch self test switch back to conda from uv change back to python3.10 from python 3.12 * add conda-forge channel for tox-conda package * add py lib for tox-conda * change conda to uv because tox-conda is too old * using uv tool for fixing setup issue * pip is needed for tox 3.0, while azuer use 3.0 don't need this permission * add all possible commands to allowlist * change back to main branch * back to conda with fixed tox and tox-conda version * change back to uv --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

…10) 1. Add a new Mixin (ScaleDelayedOptimizerMixin) to support MixedPrecisionAdam like optimizers 2. Refine HybridOptimizer to support ScaleDelayedOptimizerMixin. --------- Co-authored-by: zyeric <cheerforwhy@gmail.com>

…to segmentation fault

Yilei Yang and others added 30 commits May 29, 2024 12:32

Merged PR 2150: fix fullslice

49433e5

fix fullslice

Merged PR 2158: parallel module: rename cube related names

79070fb

1. rename cube related names 2. move user_config.code to pas_config

Merged PR 2156: initialize lightning support

30ec31d

add nnscaler strategy

Merged PR 2166: rename dynamic_shape to constant_folding

16e99c3

parity check pass unit test pass

Merged PR 2175: change use_reentrant to False

ebfe230

change use_reentrant to False because True is not stable, and in some torch version it may trigger bugs.

Merged PR 2172: refine IRObject handling

8080e2a

refine IRObject handling 1. make tensor non-constant 2. don't trigger error on non-constant args for non-registered pyfunc. 3. set is_constant=False for all input objects. unit test pass parity check pass

Merged PR 2178: parallelize: rename dummy_input to dummy_forward_args

9c57adb

Merged PR 2177: Fix policy for autodist bug

17877a9

set `memory_constraint` correctly

Merged PR 2176: add a mini-trainer

046e020

Merged PR 2040: add interface for cube integration test and script fo…

8dfec28

…r collect huggingface mo... add interface for cube integration test and script for collect huggingface models

Merged PR 2181: quick fix compile huggingface

77cabf5

create cache_dir if not exist refine ComputeConfig as it changes

Merged PR 2111: refine optimizer state dict merge

d0b7e5e

This PR is trying to reduce the memory usage when merging by combining zero and tp state together.

Merged PR 2180: lightning: fix gradient sync and gradient averaging

4dc1166

Merged PR 2183: fix cache dir

98d57c8

fix cache dir is a str, no exists() function

Merged PR 2186: hotfix: non-tensor support for consistence check in m…

eeef286

…erging

Merged PR 2184: parser: never fold getattr node 'self.training'

b943f8e

parser: never fold getattr node 'self.training' unit test pass parity check pass

Merged PR 2188: self.training in submodules: hotfix for nightly test

04c608a

self.training in submodules: hotfix for nightly test unit test pass parity check pass

Merged PR 2189: Lightning: refine code/add more tests

ef2586e

Merged PR 2144: Nightly build scripts

a182bcc

Add a pipeline to nightly build wheel, and fix packaging for autodist profile data. pipeline: https://msrasrg.visualstudio.com/SuperScaler/_build?definitionId=114 repo: https://msrasrg.visualstudio.com/SuperScaler/_artifacts/feed/nightly

Merged PR 2194: Reset version to v0.1 and update email

9eebf40

Merged PR 2193: lightning: refine docs about checkpoint

7a2485d

refine docs

Merged PR 2196: never fold nnscaler runtime functions

d6f6c09

never fold nnscaler runtime functions

nnScaler and others added 24 commits January 5, 2026 19:21

nit fix

527c5b7

Merge pull request #1 from msrasys/devops

5b259c3

Devops catch up

save work

8ba2509

Merge pull request #3 from msrasys/weijiangxu/hotfix-zero0

7c40eed

[Hotfix] Don't disable zero when scale unit is 1

[Feat] Add support for gathering full model state from all ranks (#4)

ee5eb8f

Add support for gathering full model state from all ranks. A potential usage is when we use nnscaler in RL, and need to sync weights to rollout engine(like vllm)

[Refine] Reduce memory fragment when resuming

fd8f704

refine comment

e584751

code refine

8024fad

refine code

2ff7e54

refine comments

0014141

add more debug info

9d5b02e

refine comment

c799251

refine code

943b154

add barrier

52d9322

Merge pull request #6 from msrasys/weijiangxu/mem-fragment-in-resume

a77b282

[Refine] Reduce memory fragment when resuming

[Refine] Normalize device handling in state dicts and more (#9)

e39d68b

1. Normalize state_dict device handling 2. replace torch.cat with F.pad

Add Doc Autodist Constraints Guide (#5)

cc97940

* Add Doc for Autodist Constraints Guide * Revise the description of autodist in the documentation. * fix comment * polish doc --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

Add nightly test to the repo (#12)

8ce8b09

* Add nightly test to the repo * make parity alignment * fix incoordination shutil.rmtree of rank0 --------- Co-authored-by: yileiyang <yileiyang@gmail.com>

[Feat] Add Muon Support (dp without zero)

166635a

Warmup triggers 150K times, put too much pressure on memory and lead …

1a9198c

…to segmentation fault

lynex closed this Feb 6, 2026

lynex deleted the yomia/fix_test_setitem branch February 6, 2026 07:40

lynex restored the yomia/fix_test_setitem branch February 6, 2026 07:43

lynex deleted the yomia/fix_test_setitem branch February 6, 2026 08:52

lynex restored the yomia/fix_test_setitem branch February 6, 2026 08:54

lynex deleted the yomia/fix_test_setitem branch February 6, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix test_setitem segment fault#53

fix test_setitem segment fault#53
lynex wants to merge 2026 commits intomicrosoft:mainfrom
msrasys:yomia/fix_test_setitem

lynex commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lynex commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants