Skip to content

feat(losses): add optional fused_lncc LNCC backend (loss_type='fused_lncc')#104

Open
minsuk00 wants to merge 1 commit into
rohitrango:mainfrom
minsuk00:fused-lncc-backend
Open

feat(losses): add optional fused_lncc LNCC backend (loss_type='fused_lncc')#104
minsuk00 wants to merge 1 commit into
rohitrango:mainfrom
minsuk00:fused-lncc-backend

Conversation

@minsuk00

Copy link
Copy Markdown

Add an optional fused_lncc LNCC backend

Wires up fused_lncc as an optional LNCC backend,
selectable via loss_type='fused_lncc'. It is a standalone fused rectangular-LNCC CUDA kernel;
this PR plugs it in as an opt-in dependency that falls back to cc when not installed, so existing
behavior is unchanged.

Changes

  • fireants/losses/fused_lncc_backend.pyFusedLNCCLoss, a thin adapter matching the
    FusedLocalNormalizedCrossCorrelationLoss interface (constructor args, forward, and the
    multiscale hooks). Calls the kernel; no new math.
  • fireants/registration/abstract.py — a loss_type == 'fused_lncc' branch mirroring the
    fusedcc branch (optional import + fallback to cc).
  • pyproject.toml — declares fused_lncc as an optional dependency. (As a CUDA extension it must be
    installed with pip install fused_lncc --no-build-isolation against a matching PyTorch; see its README.)
  • tests/test_fused_lncc_backend.py — auto-skips without CUDA or the package.

Behavior

Returns -mean(ncc), the same convention as fusedcc. Uses the exact gradient (matches
FusedLocalNormalizedCrossCorrelationLoss(use_ants_gradient=False)), rescaled so the same optimizer
hyperparameters converge identically. Scope is deliberately narrow: rectangular only, k ∈ {3,5,7,9},
mean reduction, pred-only gradient, single GPU. Other configurations (gaussian, masking, sum/none
reduction, symmetric/SyN gradients, grid-parallel sharding) raise a clear error pointing to fusedcc.
Symmetric (dual-image / SyN) gradient support is a straightforward follow-up: it runs the kernel a
second time with pred and target swapped (no CUDA change), at roughly 2x the backward cost. Left out
of this PR for now.

Performance (A40)

The loss kernel is ~3.4× faster / ~3× lighter than fusedcc. End-to-end registration is ~1.1–1.3×
per iteration (the loss is ~38% of a fused step) and ~2.7× against the non-fused cc path, at equal
VRAM and equal registration quality.

Tests

pytest tests/test_fused_lncc_backend.py passes on an A40: forward and gradient parity vs fusedcc
(k=3/5/7/9), sign/range, batched gradient scale, gradient routing, scope guards, multiscale hooks,
and an end-to-end registration through the dispatcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant