Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
233 commits
Select commit Hold shift + click to select a range
4cd9563
Fix for PR-2142 (#3165)
HaochenYuan Jan 30, 2026
6de6362
ci: Onboard more GB200 tests (#3145)
ko3n1g Jan 30, 2026
de15117
ci(hotfix): Alert for GB200 (#3168)
ko3n1g Jan 30, 2026
7952d7e
Fix SFTDataset truncation bug (#3158)
duncanriach Jan 30, 2026
b9ee19e
Vitalyk/multiturn v2 (#3167)
yobibyte Jan 30, 2026
b168849
ci: Disable the api check for now (#3157)
chtruong814 Jan 30, 2026
a205538
ci: Add DSv3 proxy (#3169)
ko3n1g Jan 30, 2026
14b70c7
Nvshmem refit (#2696)
wdykas Jan 30, 2026
fdc04f6
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Jan 30, 2026
9ad5906
[Community][Main] fix(moe): Fix theoretical memory calculation of lay…
1195343015 Jan 30, 2026
5415e1d
fix: Set --refit-method default to gloo (#3172)
wdykas Jan 30, 2026
a976754
[fix] Bug fix for offloading in evaluate() (#3043)
lhb8125 Jan 30, 2026
991c38f
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Jan 31, 2026
5d0a7fd
cp: `Fix: nccl-ub in ddp path (3181)` into `main` (#3182)
ko3n1g Jan 31, 2026
ffbc43f
Miscellaneous inference cleanup (#2955)
santhnm2 Jan 31, 2026
0fe3232
Revert "Miscellaneous inference cleanup (#2955)"
ko3n1g Jan 31, 2026
69a5c63
ci: Fix DSv3 (#3188)
ko3n1g Jan 31, 2026
2fadde8
Fix missing argument in MoELayer.forward() (#3133)
jiemingz Feb 1, 2026
ae67076
Fix H2D stream synchronization in optimizer offload (#3140)
tgkyrie Feb 1, 2026
300d1b6
Add MTP support for hybrid models (#2363)
rkarimimahab Feb 1, 2026
dceb1fb
docs: improve Megatron-LM and Megatron Core descriptions (#3115)
sbhavani Feb 2, 2026
f4502eb
Handle `step` key correctly in checkpoint save with `--optimizer-cpu-…
ahmadki Feb 2, 2026
70719cd
mRoPE for MTP (#3114)
BestJuly Feb 2, 2026
e836e62
Fix two minor bugs in MTP implementation for hybrid models (#3194)
deepakn94 Feb 2, 2026
1362e4a
Update README.md (#2111)
mvirts Feb 2, 2026
31d0c87
Revert "Fix two minor bugs in MTP implementation for hybrid models (#…
ko3n1g Feb 2, 2026
a0cc8ca
Revert "Add MTP support for hybrid models (#2363)"
ko3n1g Feb 2, 2026
50546da
Fix bug in SFTDataset (#3185)
duncanriach Feb 2, 2026
dff4189
Fix several syntax error (#3004)
HollowMan6 Feb 2, 2026
c4bea0a
Fix for RL Test (#3148)
wdykas Feb 3, 2026
a4008d0
Fix latent moe flops and backward_dw (#2977)
buptzyb Feb 3, 2026
afe443b
Use global user buffer when the bucket size does not fit FixedPoolAll…
shengf-nv Feb 3, 2026
78475fe
ci: Checkpoint retention (#3205)
ko3n1g Feb 3, 2026
7080697
Add unit test for LatentMoE (#2892)
venmugil Feb 3, 2026
0028273
ci: Enable unit tests on merge-queue (#3186)
ko3n1g Feb 3, 2026
94c9eae
Fix seq pack flag in `get_logprobs` (#3206)
mathemakitten Feb 3, 2026
b477d12
ci(fix): Parse unit tests in merge-queue (#3224)
ko3n1g Feb 3, 2026
1a61b77
Fix TE 2.12 AllGather CI failure (#3101)
BestJuly Feb 3, 2026
79e7bfe
ci(hotfix): Pin uv (#3233)
ko3n1g Feb 3, 2026
18d69f1
Add a unit test to check that RL `get_logprobs` will reuse training c…
mathemakitten Feb 3, 2026
27a5f83
Do not offload grad buffers when training graphs are enabled (#3231)
mathemakitten Feb 3, 2026
bc2eb9a
Fix missing PackedSeqParams import (#3214)
parthmannan Feb 3, 2026
1fdb29f
Synchronize the request counts for EP inference with strict matching …
santhnm2 Feb 3, 2026
e02344e
Do not let requests fail silently inside inference engine (#3228)
tdene Feb 3, 2026
4c48248
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 4, 2026
9050d5b
Fix coordinator address collision check in flask (#3208)
tdene Feb 3, 2026
cd5ed74
torch saver inference model offload (#3170)
wdykas Feb 4, 2026
982ca5d
enable cuda graph ut (#3197)
Autumn1998 Feb 4, 2026
473e283
Support EP with HSDP (#2840)
wplf Feb 4, 2026
4a23972
[Main] Add the missing part to support 1F1B overlap for Qwen3-Next (#…
BestJuly Feb 4, 2026
c036e77
Missing import fix (#3241)
parthmannan Feb 4, 2026
43db8c1
Miscellaneous inference cleanup (Replay of !2955) (#3232)
santhnm2 Feb 4, 2026
adce147
Add DistributedInitConfig (#3173)
maanug-nv Feb 4, 2026
f3e6cc8
Fix checkpoint converter missing parallel group initialization (#3217)
yashaswikarnati Feb 4, 2026
d558b5f
Skip empty sequences and chunks in MTP tensor roll (#3035)
BestJuly Feb 4, 2026
f708b5d
Implement get_parameters for ChainedOptimizer (#3201)
nschank Feb 4, 2026
66c432a
ci(fix): Create main/dev image tags (#3252)
ko3n1g Feb 4, 2026
e24767f
ci(hotfix): Skopeo copy
ko3n1g Feb 4, 2026
d959620
ci(hotfix): Add skopeo
ko3n1g Feb 4, 2026
9d71cb1
Reapply "Add MTP support for hybrid models (#2363)" (#3207)
sancha Feb 4, 2026
b043863
Fix uv install for GH actions (#3259)
Phlip79 Feb 4, 2026
dd7d141
Update the project structure in README (#3251)
janEbert Feb 5, 2026
1f6d8c2
chore: rotate oncall schedule
github-actions[bot] Feb 5, 2026
1b11076
Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) (#3…
BestJuly Feb 5, 2026
111a2a0
RL: training cudagraphs functional test (#3235)
mathemakitten Feb 5, 2026
1934391
[Main] fix cg missing wgrad hook (#3074)
Wohox Feb 5, 2026
801f12f
Avoid .cuda call on meta device in LanguageModel (#3202)
nschank Feb 5, 2026
347ad21
Nano QAT/D fix with sft tokenizer and datasets (#3254)
ChenhanYu Feb 5, 2026
3c0a4f3
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 6, 2026
0434f87
fix checkpointing error message (#3203)
dimapihtar Feb 5, 2026
8379d43
Revert "fix checkpointing error message (#3203)" (#3283)
ko3n1g Feb 6, 2026
e2e5a6a
Reapply "fix checkpointing error message (#3203)" (#3283) (#3285)
ko3n1g Feb 6, 2026
a116ce3
docs: Add changelog for 0.15.3 (#3286)
ko3n1g Feb 6, 2026
4376cc5
ci: Set throughput tests as flaky (#3301)
chtruong814 Feb 6, 2026
f92460b
chore: Move GB200 tests to nightly (#3302)
ko3n1g Feb 6, 2026
cfbe9b5
Ensure type-checker understands use of Submodules in bert_model (#3256)
nschank Feb 6, 2026
a63d045
Override extra_repr instead of __repr__ (#3200)
nschank Feb 7, 2026
f68c7c1
Replace ModuleSpec with Protocols for LayerNorm submodules (#3090)
nschank Feb 7, 2026
2f99ee8
chore: Remove gpt_grpo_tp2tp1_pp4pp2_dp8_583m_throughputtest
ko3n1g Feb 7, 2026
e3ae6e4
Non colocated refit (#3213)
wdykas Feb 7, 2026
554ce49
Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (#2763)
xiaoxi-wangfj Feb 7, 2026
7cbbba2
Add check to prevent MFSDP from numeric issue in gradient accumulate …
shjwudp Feb 7, 2026
c99c962
update get_embedding_ranks and get_position_embedding_ranks docstring…
c1lovez1 Feb 7, 2026
6d81e3d
ci: Add secrets detector (#3180)
chtruong814 Feb 7, 2026
a3ec4b0
Param offset in _ParamAndGradBucket should be aligned (#3007)
skydoorkai Feb 7, 2026
916301a
updates to support modelopt EAGLE training with CP (#3147)
yeyu-nvidia Feb 9, 2026
6103cb5
Ensure type-checker understands use of Submodules in llava_model (#3257)
nschank Feb 9, 2026
4ff7686
M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail (#2941)
shjwudp Feb 9, 2026
3257093
fully remove legacy tokenizer system (#2946)
dimapihtar Feb 9, 2026
3069591
General README and pyproject fixes (#2907)
ahmadki Feb 9, 2026
3bb539e
chore: More aggressive checkpointing (#3315)
ko3n1g Feb 9, 2026
c072f89
ci: Pin down setuptools to lt 82 (#3313)
ko3n1g Feb 9, 2026
9ddbce3
fix: T5 dataset (#3307)
ko3n1g Feb 9, 2026
f14d161
fix: numpy overflow (#3306)
ko3n1g Feb 9, 2026
8d79987
ci: Revert "ci: Add secrets detector (#3180)" (#3330)
chtruong814 Feb 10, 2026
55c3e63
ci: Add more tests, run on merge-queue (#3317)
ko3n1g Feb 10, 2026
ba76934
ci: Remove merge-gate environment check (#3331)
chtruong814 Feb 10, 2026
ab5e277
Use FP4 context for mamba (#2604)
kwyss-nvidia Feb 10, 2026
fc557ec
ci: Ensure we run all functional tests in merge group (#3332)
chtruong814 Feb 10, 2026
55198ba
Replace ModuleSpec with Protocols for inputs to MLP (#3084)
nschank Feb 10, 2026
5eb20b8
ci: Fix merge queue functional tests (#3337)
chtruong814 Feb 10, 2026
367f0b8
ci: skip queue in merge-gate (#3343)
ko3n1g Feb 10, 2026
3fb6006
ci: Timeout for functional tests (#3346)
ko3n1g Feb 10, 2026
76cf11e
update checkpointing documentation (#3347)
dimapihtar Feb 10, 2026
836d473
Update golden values to reflect improvements (#3350)
tdene Feb 10, 2026
2451508
BUGFIX: gpt vs hybrid model mtp naming mismatch (#3334)
sancha Feb 10, 2026
8da949e
Disable flaky test (#3354)
tdene Feb 10, 2026
bb97791
re-enable gpt grpo tests (#3348)
jon-barker Feb 10, 2026
4bce841
Fix SFT Pipeline when TP>1 (#3268)
asolergi-nv Feb 10, 2026
f5238ba
Fixes for KD mode (#3342)
AAnoosheh Feb 10, 2026
c1169ea
chore: rotate oncall schedule
github-actions[bot] Feb 11, 2026
4f025a1
chore: Update codeowners file (#3365)
ko3n1g Feb 11, 2026
66ec17e
Siddharth/fix inference functional tests (#3357)
sidsingh-nvidia Feb 11, 2026
6a9da99
Switch oncall (#3360)
janEbert Feb 11, 2026
b6e883b
Add missing RMSNorm to llama train script (#3314)
AAnoosheh Feb 11, 2026
cd14090
Fix inference for MTP models (#3297)
tdene Feb 11, 2026
6f5de16
Add a logprobs test with real gpt model. (#2870)
yobibyte Feb 11, 2026
b5d50cb
Add simple GRPO functional test (#3323)
tdene Feb 11, 2026
1c245c7
ci: Concurrency control for merge-queue (#3353)
ko3n1g Feb 11, 2026
d9f075c
ci: Update golden value download script to work with Github (#3335)
chtruong814 Feb 11, 2026
d0b768f
Removing etc from main index page, shifted name of discussions (#3271)
megnvidia Feb 11, 2026
7d1acf6
fix: correct typos 'seperated' and 'recieved' (#3305)
thecaptain789 Feb 11, 2026
2807a4e
Improved PyTorch profiler and added PyTorch execution trace (#3273)
shengf-nv Feb 11, 2026
1959739
build: Bump TE on 2.12 (#3371)
ko3n1g Feb 11, 2026
f06e669
ci(hotfix): job conditions (#3376)
ko3n1g Feb 11, 2026
6467cbc
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 12, 2026
faced51
Record moe routing decisions during inference. (#3034)
sidsingh-nvidia Feb 11, 2026
dedb6dd
[Main] Fix EP Overlap Bugs for Full-Iter CG (#3164)
Wohox Feb 12, 2026
11a4659
Avoid direct pickle import (#3375)
maanug-nv Feb 12, 2026
fe9279e
Delete old pretrain_* files (#3359)
Phlip79 Feb 12, 2026
7df15bd
Add Qwen3-VL support with Megatron-FSDP (#2841)
xuwchen Feb 12, 2026
c65fb25
Refactor Mamba chunked prefill (#3265)
santhnm2 Feb 12, 2026
47938af
Improved parallel logging of learning rate (#3319)
jstjohn Feb 12, 2026
a51c1c8
Add enhanced event tracking with TTFT measurement and compact seriali…
lmcafee-nvidia Feb 12, 2026
dbc444f
Add assertion that max_requests is divisible by tp_size (#3304)
santhnm2 Feb 12, 2026
1123fc0
Move to using the Inference OpenAI API server (#3107)
ArEsKay3 Feb 12, 2026
4184bfa
Revert "Move to using the Inference OpenAI API server (#3107)"
ko3n1g Feb 13, 2026
9119fae
Update moe github test cases. (#3077)
Victarry Feb 13, 2026
e0aa16b
Revert "Update moe github test cases. (#3077)"
ko3n1g Feb 13, 2026
28ccdaa
Split layer_specs to return Submodules instead of ModuleSpecs (#3255)
nschank Feb 13, 2026
76a9f47
ci: Remove gpu sanity check (#3420)
chtruong814 Feb 13, 2026
d10eb6f
[Critical-Bug] Fix Uneven PP for Mamba models (Nemotron3-nano) (#3399)
kevalmorabia97 Feb 13, 2026
2611830
Fix for rl (#3390)
shanmugamr1992 Feb 13, 2026
4578ed8
Add check for full_iteration scope before instantiating CudaGraphMana…
vasunvidia Feb 13, 2026
698feec
Fix broken links throughout (#3230)
megnvidia Feb 13, 2026
d401490
Extract intermediate embeddings of transformer block (#3060)
sajadn Feb 13, 2026
545bff9
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 14, 2026
b890099
Decouple topk and loss from DSA Indexer (#3248)
kunlunl Feb 13, 2026
7a8f305
Move to using the Inference OpenAI API server (bis) (#3395)
tdene Feb 14, 2026
4beb8ca
Make Mamba inference state memory ratio configurable (#3322)
santhnm2 Feb 16, 2026
cbb47c8
Fix configs for RL model environments (#3441)
tdene Feb 16, 2026
8f1c2f8
Replace pickle with json in rl_utils (#3351)
tdene Feb 16, 2026
057c804
fix: correct typo in demo training example (#3428)
dndnda Feb 17, 2026
b218e64
Clean up logging inside inference flask server (#3437)
tdene Feb 17, 2026
3c69780
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (#3438)
chtruong814 Feb 17, 2026
74ef64e
Fix --tokenizer-hf-include-special-tokens (#3422)
jon-barker Feb 17, 2026
267cf1f
Update num_tokens_to_generate default for Gym (#3453)
tdene Feb 17, 2026
0627623
Fix slowdown in inference flask server (#3445)
tdene Feb 17, 2026
a22c40e
Add a normalized scale for MTP per token loss (#3159)
BestJuly Feb 17, 2026
d7500d4
[Bugfix] Fix nan loss caused by zero token in MTP (#3396)
BestJuly Feb 17, 2026
ad5a627
ci: Add testing branches
ko3n1g Feb 18, 2026
cd71d4c
chore: rotate oncall schedule
github-actions[bot] Feb 18, 2026
f1908bc
Log RL metrics per environment (#3446)
yobibyte Feb 18, 2026
1106df4
Move tensor offload/onload out of RL code (#3029)
tdene Feb 18, 2026
0672477
Add Engine event to the follow up requests after checkpointing (#3473)
ArEsKay3 Feb 18, 2026
7b016be
Fix another inference flask / Gym interaction (#3467)
tdene Feb 18, 2026
acb7273
adding in copyright blurb at the top of md file (#3394)
megnvidia Feb 18, 2026
fdde15e
[Megatron-FSDP] Add fsdp_all_gather_in_start_param_sync option in DDP…
shjwudp Feb 18, 2026
77f22f2
ci: Update release workflow to include changelog and publish docs (#3…
chtruong814 Feb 18, 2026
1666b45
ci(fix): Weekly GPT tests (#3443)
ko3n1g Feb 18, 2026
0d0943c
ci: Remove environments (#3462)
ko3n1g Feb 19, 2026
d07f16c
update HF tokenizer defaults (#3440)
dimapihtar Feb 19, 2026
1d694c2
PTQ changes for upcoming QAD (#3124)
AAnoosheh Feb 19, 2026
655cc8e
ci: Bump preflight to detect our svc (#3494)
ko3n1g Feb 19, 2026
7f35af4
build: Drop Python 3.10 support and pip install one-logger (#3485)
ko3n1g Feb 19, 2026
2d06cc9
ci: Bump pre-flight for Bot SSO (#3497)
ko3n1g Feb 19, 2026
a781f3c
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 19, 2026
50ebe8e
Revert "build: Drop Python 3.10 support and pip install one-logger (#…
ko3n1g Feb 19, 2026
c191ae8
Fix chunked prefill edge cases (#3404)
santhnm2 Feb 19, 2026
9f611b7
ci: Enable MBridge downstream testing via PR (#3483)
ko3n1g Feb 19, 2026
7b6e226
ci: Remove gitlab docs build job and set LTS integration and function…
chtruong814 Feb 19, 2026
31bd4a3
[OMNIML-3232] ModelOpt: add full TE spec option and wire Mamba stack …
yueshen2016 Feb 20, 2026
9ba248e
Track off-policyness across RL steps (#3030)
tdene Feb 20, 2026
9d72f63
chore(beep boop 🤖): Bump (main) (2026-02-20)
github-actions[bot] Feb 20, 2026
b7aa6a0
ci: MBridge testing branch name during merge-queues (#3513)
ko3n1g Feb 20, 2026
7a36263
ci: Enable Dependabot Automerge (#3487)
ko3n1g Feb 20, 2026
e8fd432
ci: Also sync direct teams (#3484)
ko3n1g Feb 20, 2026
01b361c
Multimodal: fix argument checking (#3449)
faradawn Feb 20, 2026
773c113
Fix Megatron-FSDP optimizer state DCP checkpointing, and fix DTensor …
cspades Feb 20, 2026
b555baf
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 21, 2026
32efeff
Renable full_iteration cuda graphs for inference. Add them for the ma…
sidsingh-nvidia Feb 20, 2026
a6d6dc6
do not add EoD (#3526)
arendu Feb 21, 2026
7124748
chore(beep boop 🤖): Bump (main) (2026-02-23)
github-actions[bot] Feb 23, 2026
3d1a4ba
Do not Slack notify for draft PRs (#3536)
Phlip79 Feb 23, 2026
9c6b69b
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Feb 23, 2026
6159399
remove deprecated SampleListWebdataset (#3407)
dimapihtar Feb 23, 2026
4aeeca8
remove deprecated get_te_version (#3413)
dimapihtar Feb 23, 2026
33df24c
remove deprecated async_grad_allreduce param (#3412)
dimapihtar Feb 23, 2026
eb0da52
remove deprecated mamba params (#3411)
dimapihtar Feb 23, 2026
82cbd82
remove deprecated params from model parallel config (#3408)
dimapihtar Feb 23, 2026
cd1c215
[dev] `cp: Cherrypick CI changes` (#3543)
ko3n1g Feb 23, 2026
a0b9c16
Remove redundant CUDA calls in the LLaVA dataloader (#3476)
duncanriach Feb 23, 2026
fde3b90
Inference: Create finer grained cuda-graphs with better coverage of s…
sidsingh-nvidia Feb 23, 2026
23dd639
fix: skip non-tensor optimizer state entries in distrib_optimizer sav…
ahmadki Feb 23, 2026
2ec295e
remove is_unitialized & get_data_modulo_expert_parallel_group (#3414)
dimapihtar Feb 24, 2026
abf97eb
remove deprecated TE module (#3409)
dimapihtar Feb 24, 2026
7c7c9e1
remove encoder_and_decoder from enums (#3406)
dimapihtar Feb 24, 2026
dd39eb5
chore(beep boop 🤖): Bump (main) (2026-02-24)
github-actions[bot] Feb 24, 2026
cb24802
Add knobs to choose process groups for fully-parallel-save / load and…
sbak5 Feb 24, 2026
5f5f465
Fix off-by-2 error in RL sequence packing (#3551)
tdene Feb 24, 2026
a8efd34
Skip unnecessary flattening for Save / Load Planner (#3263)
sbak5 Feb 24, 2026
5dc98a6
Multimodal: fix model provider (#3508)
faradawn Feb 24, 2026
76b200c
docs: Enable nightly docs publish (#3546)
chtruong814 Feb 24, 2026
f721069
Ensure type-checker understands use of Submodules in unit tests (#3425)
nschank Feb 24, 2026
782e54b
Use copy_signature to preserve typing of pass-through methods (#3419)
nschank Feb 24, 2026
3597312
Ensure type-checker understands use of Submodules in MTP (#3308)
nschank Feb 24, 2026
44e27d0
Add mxfp8 quantization for inference linear layers (#3447)
santhnm2 Feb 24, 2026
2cac78b
Add single-process checkpoint save to avoid forked multiprocessing (#…
sbak5 Feb 24, 2026
08857d9
Fixed fp32 residuals (#3504)
mkhona-nvidia Feb 24, 2026
aa86018
[Dev] Fix MoE aux loss tracker hang with MTP enabled (#3400)
Victarry Feb 25, 2026
2b4b9c4
ci: Remove multi-approval action from dev branch (#3576)
chtruong814 Feb 25, 2026
0ab47fa
Merge branch 'main' into dev
FDecaYed Feb 26, 2026
a1a73f8
[dev] pull main 260220 (#3574)
ko3n1g Feb 26, 2026
2e4a5d4
[dev] fix(moe): fix the bug where gate was not sliced when kv_head < …
LiuXTao Feb 27, 2026
9ca7af6
[1/8] fix: misc compatibility fixes for PyTorch and TE (#2)
yueming-yuan Feb 19, 2026
307390e
[2/8] feat: support partial checkpoint loading (#3)
yueming-yuan Feb 19, 2026
4cef384
[3/8] feat: add post-attention and post-MLP layernorm support (#4)
yueming-yuan Feb 19, 2026
470b592
[4/8] fix: MLA RoPE triton kernel head indexing and v_dim=0 support (#5)
yueming-yuan Feb 19, 2026
e970a6f
[5/8] feat: support MTP training in RL (#6)
yueming-yuan Feb 19, 2026
db61536
[6/8] feat: support rollout routing replay (R3) and bypass for MTP la…
guapisolo Feb 19, 2026
bd0b03b
[7/8] feat: add INT4 fake QAT for MoE grouped linear (#9)
yueming-yuan Feb 19, 2026
8900933
[8/8] fix: CUDA IPC incompatibility from Megatron bump (#11)
guapisolo Feb 28, 2026
d08ead7
fix: dp_reshardable checkpoint backward compat in Megatron core
guapisolo Feb 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .github/copy-pr-bot.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
enabled: true
auto_sync_draft: false
auto_sync_ready: true
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Phlip79", "QiZhangNV", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "frsun-nvda", "gautham-kollu", "gdengk", "guyueh1", "hxbai", "jalbericiola", "janEbert", "jaredcasper", "jenchen13", "jiemingz", "jingqiny-99", "jkamalu", "jon-barker", "jstjohn", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "kwyss-nvidia", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mehraakash", "mkhona-nvidia", "parthmannan", "prajwal1210", "pthombre", "rogerwaleffe", "sanandaraj5597", "sancha", "santhnm2", "sbak5", "shanmugamr1992", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yuzhongw-nvidia", "zhongbozhu"]
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "CarlosGomes98", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "Phlip79", "QiZhangNV", "RPrenger", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "ahmadki", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "dingqingy-nv", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "faradawn", "frsun-nvda", "gautham-kollu", "gdengk", "guyueh1", "hxbai", "ilml", "jalbericiola", "janEbert", "jaredcasper", "jenchen13", "jiemingz", "jingqiny-99", "jkamalu", "jon-barker", "jstjohn", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "kwyss-nvidia", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mchrzanowski", "mehraakash", "mkhona-nvidia", "parthmannan", "prajwal1210", "pthombre", "rogerwaleffe", "sajadn", "sanandaraj5597", "sancha", "santhnm2", "sbak5", "shanmugamr1992", "sharathts", "shengf-nv", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "sudhakarsingh27", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yueshen2016", "yuzhongw-nvidia", "zhongbozhu"]
24 changes: 12 additions & 12 deletions .github/oncall_schedule.json
Original file line number Diff line number Diff line change
@@ -1,18 +1,6 @@
[
{
"user": "dimapihtar",
"date": "2026-01-28"
},
{
"user": "gautham-kollu",
"date": "2026-02-04"
},
{
"user": "janEbert",
"date": "2026-02-11"
},
{
"user": "Phlip79",
"date": "2026-02-18"
},
{
Expand Down Expand Up @@ -46,5 +34,17 @@
{
"user": "BoxiangW",
"date": "2026-04-15"
},
{
"user": "Phlip79",
"date": "2026-04-22"
},
{
"user": "asolergi-nv",
"date": "2026-04-29"
},
{
"user": "dimapihtar",
"date": "2026-05-06"
}
]
65 changes: 65 additions & 0 deletions .github/scripts/readme.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/bin/bash

cat << 'EOF'
╔══════════════════════════════════════════════════════════════════════╗
║ ║
║ ███╗ ███╗██████╗ ██████╗ ██╗██████╗ ██████╗ ███████╗ ║
║ ████╗ ████║██╔══██╗██╔══██╗██║██╔══██╗██╔════╝ ██╔════╝ ║
║ ██╔████╔██║██████╔╝██████╔╝██║██║ ██║██║ ███╗█████╗ ║
║ ██║╚██╔╝██║██╔══██╗██╔══██╗██║██║ ██║██║ ██║██╔══╝ ║
║ ██║ ╚═╝ ██║██████╔╝██║ ██║██║██████╔╝╚██████╔╝███████╗ ║
║ ╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝╚═╝╚═════╝ ╚═════╝ ╚══════╝ ║
║ ║
║ H O W T O : M B R I D G E T E S T I N G ║
╚══════════════════════════════════════════════════════════════════════╝

MBridge unit tests run automatically on every PR. To also trigger
functional tests, attach the label and re-run the workflow step.

┌─────────────────────────────────────────────────────────────────┐
│ DEFAULT │ Unit tests run on every PR (no action needed) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Every PR ──► cicd-mbridge-testing ──► unit tests only │
│ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STEP 1 │ Attach the label to your PR (for functional tests) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PR Labels ──► [ + Add label ] ──► "Run MBridge tests" │
│ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STEP 2 │ Re-run this workflow step │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Actions ──► [ Re-run jobs ] ──► Re-run failed jobs │
│ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ RESULT │ Unit + functional tests run! │
├─────────────────────────────────────────────────────────────────┤
│ │
│ cicd-mbridge-testing ◄── unit + functional tests │
│ │
│ Tests run against MBridge using the merge commit │
│ SHA of your pull request. │
│ │
└─────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────┐
│ Label present? NO → unit │
│ Label present? YES → unit + │
│ functional│
└────────────────────────────────────┘

NOTE: The label must be present BEFORE the re-run is triggered.
The CI checks for "Run MBridge tests" at runtime.

NOTE: All MBridge test results are optional — failures do not
block merging your PR.
EOF
81 changes: 69 additions & 12 deletions .github/scripts/sync_team_usergroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@

# Constants
GITHUB_API_URL = "https://api.github.com"
PARENT_TEAM_SLUG = "mcore-reviewers"

# Teams whose *children* are each synced to their own Slack usergroup
PARENT_TEAM_SLUGS = ["mcore-reviewers"]

# Teams synced directly (the team itself, not its children)
DIRECT_TEAM_SLUGS = ["mcore-engineers"]

# Caches for email and Slack lookups
_email_cache = {}
Expand Down Expand Up @@ -83,6 +88,8 @@ def github_team_to_slack_usergroup(team_slug):
name = name[5:] # Remove "core-"
elif name.startswith("megatron-"):
name = name[9:] # Remove "megatron-"
elif name.startswith("mcore-"):
name = name[6:] # Remove "mcore-"

# Remove "-and-"
name = name.replace("-and-", "-")
Expand Down Expand Up @@ -437,13 +444,13 @@ def sync_team_to_usergroup(team_slug, usergroup_handle, dry_run=False):
return False


def get_team_to_usergroup_mapping():
"""Fetch child teams of mcore-reviewers and generate the mapping."""
def get_team_to_usergroup_mapping(parent_team_slug):
"""Fetch child teams of a parent team and generate the mapping."""
org = get_org()
child_teams = get_child_teams(org, PARENT_TEAM_SLUG)
child_teams = get_child_teams(org, parent_team_slug)

if not child_teams:
print(f"Error: No child teams found under '{PARENT_TEAM_SLUG}'")
print(f"Error: No child teams found under '{parent_team_slug}'")
return {}

mapping = {}
Expand All @@ -454,10 +461,30 @@ def get_team_to_usergroup_mapping():
return mapping


def sync_all_teams(dry_run=False):
"""Sync all GitHub teams under mcore-reviewers to their Slack usergroups."""
print(f"Fetching child teams of '{PARENT_TEAM_SLUG}'...")
team_to_usergroup = get_team_to_usergroup_mapping()
def sync_all_teams(dry_run=False, parent_teams=None, direct_teams=None):
"""Sync GitHub teams to their Slack usergroups.

Args:
parent_teams: List of team slugs whose *children* are each synced.
Defaults to PARENT_TEAM_SLUGS.
direct_teams: List of team slugs synced directly (not their children).
Defaults to DIRECT_TEAM_SLUGS.
"""
if parent_teams is None:
parent_teams = PARENT_TEAM_SLUGS
if direct_teams is None:
direct_teams = DIRECT_TEAM_SLUGS

team_to_usergroup = {}

for parent_slug in parent_teams:
print(f"Fetching child teams of '{parent_slug}'...")
mapping = get_team_to_usergroup_mapping(parent_slug)
team_to_usergroup.update(mapping)

for team_slug in direct_teams:
usergroup_handle = github_team_to_slack_usergroup(team_slug)
team_to_usergroup[team_slug] = usergroup_handle

if not team_to_usergroup:
return False
Expand Down Expand Up @@ -504,12 +531,40 @@ def main():
action="store_true",
help="List all configured team-to-usergroup mappings",
)
parser.add_argument(
"--parent-team",
action="append",
dest="parent_teams",
metavar="SLUG",
help=(
"Sync all children of this GitHub team (can be repeated). "
f"Defaults to: {PARENT_TEAM_SLUGS}"
),
)
parser.add_argument(
"--team",
action="append",
dest="direct_teams",
metavar="SLUG",
help=(
"Sync this GitHub team directly (can be repeated). "
f"Defaults to: {DIRECT_TEAM_SLUGS}"
),
)

args = parser.parse_args()

# Use CLI values when provided, otherwise fall back to module-level defaults
parent_teams = args.parent_teams if args.parent_teams is not None else PARENT_TEAM_SLUGS
direct_teams = args.direct_teams if args.direct_teams is not None else DIRECT_TEAM_SLUGS

if args.list:
print(f"Fetching child teams of '{PARENT_TEAM_SLUG}'...")
team_to_usergroup = get_team_to_usergroup_mapping()
team_to_usergroup = {}
for parent_slug in parent_teams:
print(f"Fetching child teams of '{parent_slug}'...")
team_to_usergroup.update(get_team_to_usergroup_mapping(parent_slug))
for team_slug in direct_teams:
team_to_usergroup[team_slug] = github_team_to_slack_usergroup(team_slug)
if not team_to_usergroup:
sys.exit(1)
print("\nTeam-to-usergroup mappings:")
Expand All @@ -519,7 +574,9 @@ def main():
print(f"{team:<35} @{usergroup:<29}")
return

success = sync_all_teams(dry_run=args.dry_run)
success = sync_all_teams(
dry_run=args.dry_run, parent_teams=parent_teams, direct_teams=direct_teams
)
sys.exit(0 if success else 1)


Expand Down
5 changes: 1 addition & 4 deletions .github/workflows/_build_test_publish_wheel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ on:
type: boolean
default: true
secrets:
TWINE_USERNAME:
required: true
TWINE_PASSWORD:
required: true

Expand Down Expand Up @@ -147,7 +145,6 @@ jobs:
needs: [build-and-test-wheels]
runs-on: ubuntu-latest
if: inputs.no-publish == false
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'main' || 'public' }}
strategy:
fail-fast: false
matrix:
Expand All @@ -170,7 +167,7 @@ jobs:

- name: Publish wheels
env:
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
TWINE_REPOSITORY: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'pypi' || 'testpypi' }}
PLATFORM: ${{ matrix.PLATFORM }}
Expand Down
Loading