Awesome Loss Functions

A comprehensive, chronologically ordered collection of loss functions across all subdomains of deep learning and machine learning — with paper links, one-line descriptions, mathematical formulations, and implementation references.

350+ loss functions. 25+ categories. Every subdomain of AI.

If this resource helps your research or engineering work, please consider giving it a ⭐

What's New

🔊 Audio, Music & Speech Generation — WaveNet to Stable Audio, 19 losses
🎬 Video Generation & Understanding — VGAN to VideoPoet, 20 losses
⏳ Time Series Forecasting — Pinball Loss to TimesFM, 23 losses
🧠 Continual & Lifelong Learning — EWC to EASE, 18 methods
⚖️ Calibration, Fairness & Bias Mitigation — Brier Score to Group DRO, 18 losses
🛡️ Adversarial Robustness & OOD Detection — FGSM-AT to CIDER, 22 losses
🔍 Anomaly Detection & Multi-Modal Learning — Deep SVDD to ImageBind, 17 losses
🖼️ Image-to-Image Translation — Total Variation to DoveNet, 16 losses
📐 Semi-Supervised Learning — Pseudo-Label to SoftMatch, 12 losses
🎯 Optical Flow, Video & Pose — Horn-Schunck to SEA-RAFT, 33 losses

Loss Selection Guide
Key Mathematical Formulations
Classification
Regression
Segmentation
Object Detection (Bounding Box)
Generative Models — GANs
Generative Models — VAEs
Generative Models — Diffusion & Flow
Reconstruction & Perceptual
Image Super-Resolution & Restoration
Contrastive & Self-Supervised Learning
Metric Learning & Face Recognition
NLP & Language Modeling
LLM Alignment (RLHF / DPO)
Sequence-to-Sequence & Speech
Reinforcement Learning
Knowledge Distillation
Regularization
3D Vision & Point Clouds
Depth Estimation
Medical Imaging
Graph Neural Networks
Recommendation Systems
Multi-Task Learning
Uncertainty Estimation
Domain Adaptation

Extended Categories (separate files)

Audio, Music & Speech Generation — 19 losses
Video Generation & Understanding — 20 losses
Time Series Forecasting — 23 losses
Continual & Lifelong Learning — 18 methods
Calibration, Fairness & Bias Mitigation — 18 losses
Adversarial Robustness & OOD Detection — 22 losses
Anomaly Detection & Multi-Modal Learning — 17 losses
Image-to-Image Translation & Style Transfer — 16 losses
Semi-Supervised Learning & Self-Training — 12 losses
Optical Flow, Video Prediction & Pose Estimation — 33 losses

Resources

Survey Papers
Key Implementation Libraries

🧭 Loss Selection Guide

Not sure which loss to use? Here's a quick decision framework:

Task	Default Choice	Class Imbalance	Noisy Labels	Need Calibration
Binary Classification	BCE	Focal Loss	SCE / GCE	Focal + Temp. Scaling
Multi-class Classification	Cross-Entropy	Class-Balanced CE	Label Smoothing	Label Smoothing
Semantic Segmentation	CE + Dice	Focal Tversky	—	—
Object Detection (box)	Smooth L1 + Focal	Focal Loss	—	—
Object Detection (IoU)	CIoU / GIoU	—	—	—
Image Generation (GAN)	Hinge / Non-Saturating	—	—	—
Image Generation (Diffusion)	DDPM (ε-prediction)	—	—	—
Super-Resolution	L1 + Perceptual + GAN	—	—	—
Self-Supervised (vision)	InfoNCE / DINO	—	—	—
Face Recognition	ArcFace / AdaFace	Sub-center ArcFace	ElasticFace	—
Language Modeling	Cross-Entropy (NTP)	—	—	—
LLM Alignment	DPO / SimPO	—	—	—
Speech Recognition	CTC / RNN-T	—	—	—
RL (value-based)	DQN / Double DQN	—	—	—
RL (policy-based)	PPO	—	—	—
Regression	MSE / Huber	—	Huber	NLL w/ variance
Metric Learning	Triplet / Proxy Anchor	—	—	—
Medical Segmentation	Dice + Boundary	Tversky / Focal Tversky	—	—
3D Reconstruction	Chamfer + Normal	—	—	—
Depth Estimation	Scale-Invariant	—	—	—
Time Series	MSE / Quantile	—	Huber	CRPS
Continual Learning	EWC / DER++	—	—	—
Fairness	Group DRO	—	—	—

📐 Key Mathematical Formulations

Cross-Entropy Loss

$$ \mathcal{L}_{CE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c) $$

Binary Cross-Entropy

$$ \mathcal{L}_{BCE} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})] $$

Focal Loss

$$ \mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t) $$

Dice Loss

$$ \mathcal{L}_{Dice} = 1 - \frac{2 \sum_i p_i g_i}{\sum_i p_i + \sum_i g_i} $$

Triplet Loss

$$ \mathcal{L}_{Triplet} = \max(0, |f_a - f_p|_2 - |f_a - f_n|_2 + \alpha) $$

InfoNCE / Contrastive Loss

$$ \mathcal{L}_{InfoNCE} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)} $$

KL Divergence

$$ D_{KL}(P | Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} $$

DDPM Loss (simplified)

$$ \mathcal{L}_{DDPM} = \mathbb{E}_{t, x_0, \epsilon} \left[ | \epsilon - \epsilon_\theta(x_t, t) |^2 \right] $$

DPO Loss

$$ \mathcal{L}_{DPO} = -\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) $$

IoU Loss

$$ \mathcal{L}_{IoU} = 1 - \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|} $$

ArcFace Loss

$$ \mathcal{L}_{ArcFace} = -\log \frac{e^{s \cos(\theta_{y_i} + m)}}{e^{s \cos(\theta_{y_i} + m)} + \sum_{j \neq y_i} e^{s \cos \theta_j}} $$

Wasserstein Distance (WGAN)

$$ \mathcal{L}_{WGAN} = \mathbb{E}_{x \sim p_{data}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))] $$

Classification

0/1 Loss (1950) — The theoretical misclassification indicator; 1 if prediction ≠ label, 0 otherwise. Non-differentiable, foundational to learning theory. 📄 Statistical Decision Functions — Wald, A.

Cross-Entropy Loss / Log Loss / Negative Log-Likelihood (1948) — Measures divergence between predicted probability distribution and true labels; the default loss for multi-class classification. 📄 A Mathematical Theory of Communication — Shannon, C.E. 💻 torch.nn.CrossEntropyLoss

Binary Cross-Entropy (1958) — Cross-entropy specialized for two-class or multi-label problems; operates on each output independently. 📄 Derived from logistic regression — Cox, D.R. (1958) 💻 torch.nn.BCEWithLogitsLoss

Hinge Loss / SVM Loss (1995) — Maximizes the margin between classes; the core loss behind Support Vector Machines. 📄 Support-Vector Networks — Cortes, C. & Vapnik, V. 💻 torch.nn.MultiMarginLoss

Knowledge Distillation Loss / Soft Cross-Entropy (2015) — Trains a student network to mimic a teacher by matching softened output distributions. 📄 Distilling the Knowledge in a Neural Network — Hinton, G., Vinyals, O. & Dean, J. 💻 torch.nn.KLDivLoss

Large-Margin Softmax Loss (L-Softmax) (2016) — Introduces angular margin constraints into softmax for intra-class compactness and inter-class separability. 📄 Large-Margin Softmax Loss for Convolutional Neural Networks — Liu, W., Wen, Y., Yu, Z. & Yang, M. 💻 wy1iu/LargeMargin_Softmax_Loss

Center Loss (2016) — Penalizes distance of features from learned class centers, improving discriminative feature learning. 📄 A Discriminative Feature Learning Approach for Deep Face Recognition — Wen, Y., Zhang, K., Li, Z. & Qiao, Y. 💻 KaiyangZhou/pytorch-center-loss

Label Smoothing (2016) — Replaces hard one-hot targets with soft targets, preventing overconfident predictions and improving generalization. 📄 Rethinking the Inception Architecture for Computer Vision — Szegedy, C. et al. 💻 torch.nn.CrossEntropyLoss(label_smoothing=...)

Sparsemax Loss (2016) — Sparse alternative to softmax that assigns exactly zero probability to irrelevant classes. 📄 From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification — Martins, A.F.T. & Astudillo, R.F. 💻 deep-spin/entmax

Focal Loss (2017) — Down-weights well-classified examples to focus training on hard negatives; designed for extreme class imbalance. 📄 Focal Loss for Dense Object Detection — Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. 💻 AdeelH/pytorch-multi-class-focal-loss

Generalized Cross-Entropy (GCE) (2018) — Noise-robust loss interpolating between MAE and cross-entropy via a tunable parameter q. 📄 Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels — Zhang, Z. & Sabuncu, M.R. 💻 AlanChou/Truncated-Loss

Complement Objective Training (COT) (2019) — Augments cross-entropy with a complement objective that neutralizes non-target class probabilities. 📄 Complement Objective Training — Chen, H.-Y. et al. 💻 henry8527/COT

Class-Balanced Loss (2019) — Re-weights loss by the effective number of samples per class for long-tailed distributions. 📄 Class-Balanced Loss Based on Effective Number of Samples — Cui, Y. et al. 💻 vandit15/Class-balanced-loss-pytorch

Symmetric Cross-Entropy (SCE) (2019) — Combines standard CE with reverse CE for robustness to label noise. 📄 Symmetric Cross Entropy for Robust Learning with Noisy Labels — Wang, Y. et al.

Bi-Tempered Logistic Loss (2019) — Two temperature parameters bound the loss (handling mislabeled data) and produce heavy-tailed softmax (handling outliers). 📄 Robust Bi-Tempered Logistic Loss Based on Bregman Divergences — Amid, E. et al. 💻 google/bi-tempered-loss

Taylor Cross-Entropy Loss (2020) — Taylor series expansion of CE creating a noise-robust loss. 📄 Can Cross Entropy Loss Be Robust to Label Noise? — Feng, L. et al.

Asymmetric Loss (ASL) (2021) — Different focusing levels for positive and negative samples in multi-label classification. 📄 Asymmetric Loss For Multi-Label Classification — Ben-Baruch, E. et al. 💻 Alibaba-MIIL/ASL

Poly Loss (2022) — Views loss functions as polynomial expansions and adjusts leading coefficients; generalizes CE and focal loss. 📄 PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions — Leng, Z. et al. 💻 abhuse/polyloss-pytorch

Regression

Mean Absolute Error (MAE) / L1 Loss (~1757) — Penalizes absolute differences; robust to outliers but non-smooth gradient at zero. 📄 Attributed to Boscovich, R.J. (1757) 💻 torch.nn.L1Loss

Mean Squared Error (MSE) / L2 Loss (~1805) — Penalizes squared differences; sensitive to outliers. The method of least squares. 📄 Legendre, A.-M. (1805); Gauss, C.F. (1809) 💻 torch.nn.MSELoss

Huber Loss (1964) — MSE for small errors, MAE for large errors. Robust to outliers with smooth gradients near zero. 📄 Robust Estimation of a Location Parameter — Huber, P.J. 💻 torch.nn.HuberLoss

Tukey's Biweight Loss (1974) — Redescending M-estimator that completely rejects gross outliers beyond a threshold. 📄 The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data — Beaton, A.E. & Tukey, J.W.

Quantile Loss / Pinball Loss (1978) — Asymmetrically penalizes over/under-predictions for quantile regression and uncertainty estimation. 📄 Regression Quantiles — Koenker, R. & Bassett, G.

Smooth L1 Loss (2015) — L2 for small errors, L1 for large errors (Huber with δ=1); standard for bounding box regression. 📄 Fast R-CNN — Girshick, R. 💻 torch.nn.SmoothL1Loss

Wing Loss (2018) — Amplifies small-to-medium range errors for facial landmark localization. 📄 Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks — Feng, Z.-H. et al.

Balanced L1 Loss (2019) — Rebalances inlier vs. outlier loss contributions in object detection regression. 📄 Libra R-CNN: Towards Balanced Learning for Object Detection — Pang, J. et al. 💻 OceanPang/Libra_R-CNN

Adaptive Wing Loss (2019) — Adapts curvature based on ground truth heatmap values for face alignment. 📄 Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression — Wang, X. et al. 💻 protossw512/AdaptiveWingLoss

Log-Cosh Loss (2022) — Approximates Huber loss using log(cosh(x)); twice differentiable everywhere. 📄 Statistical Properties of the Log-Cosh Loss Function Used in Machine Learning — Chen, K. et al.

Segmentation

Sensitivity-Specificity Loss (2015) — Weighted combination of sensitivity and specificity for extreme class imbalance in lesion segmentation. 📄 Deep Convolutional Encoder Networks for Multiple Sclerosis Lesion Segmentation — Brosch et al.

Dice Loss (2016) — Directly optimizes the Dice coefficient (F1 score); robust to class imbalance. 📄 V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation — Milletari, F. et al. 💻 JunMa11/SegLossOdyssey

Generalized Dice Loss (GDL) (2017) — Per-class volume weighting for multi-class segmentation with highly imbalanced labels. 📄 Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations — Sudre, C.H. et al.

Tversky Loss (2017) — Tunable α/β parameters controlling the FP/FN trade-off; useful for small lesion segmentation. 📄 Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks — Salehi, S.S.M. et al.

Lovász-Softmax Loss (2018) — Tractable convex surrogate for directly optimizing the Jaccard index (IoU). 📄 The Lovász-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure — Berman, M. et al. 💻 bermanmaxim/LovaszSoftmax

Exponential Logarithmic Loss (2018) — Combines exponentially weighted focal-style Dice and CE for very small structures. 📄 3D Segmentation with Exponential Logarithmic Loss for Highly Unbalanced Object Sizes — Wong et al.

Asymmetric Similarity Loss (2018) — Asymmetric Fβ-score-based similarity to balance precision and recall. 📄 Asymmetric Loss Functions and Deep Densely Connected Networks for Highly Imbalanced Medical Image Segmentation — Hashemi et al.

Focal Tversky Loss (2019) — Focal-style exponent on Tversky loss to focus on hard, misclassified regions. 📄 A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation — Abraham, N. & Khan, N.M.

Boundary Loss (2019) — Distance metric on contour space rather than region overlap; effective for highly unbalanced tasks. 📄 Boundary Loss for Highly Unbalanced Segmentation — Kervadec, H. et al. 💻 LIVIAETS/boundary-loss

Hausdorff Distance Loss (2019) — Directly optimizes the Hausdorff distance between predicted and ground-truth boundaries. 📄 Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks — Karimi, D. & Salcudean, S.E.

Combo Loss (2019) — Weighted combination of modified CE and Dice loss for input and output class imbalance. 📄 Combo Loss: Handling Input and Output Imbalance in Multi-Organ Segmentation — Taghanaki et al.

Region Mutual Information (RMI) Loss (2019) — Maximizes mutual information between predicted and ground-truth label regions. 📄 Region Mutual Information Loss for Semantic Segmentation — Zhao et al. 💻 ZJULearning/RMI

Topological Loss (2019) — Uses persistent homology to enforce correct topological structure in segmentation. 📄 Topology-Preserving Deep Image Segmentation — Hu et al. 💻 HuXiaoling/TopoLoss

Log-Cosh Dice Loss (2020) — Log-cosh smoothing on Dice loss for smoother gradients and stable training. 📄 A Survey of Loss Functions for Semantic Segmentation — Jadon, S.

clDice (2021) — Topology-preserving loss for tubular structures; computes Dice on skeletonized centerlines. 📄 clDice — A Novel Topology-Preserving Loss Function for Tubular Structure Segmentation — Shit et al. 💻 jocpae/clDice

Unified Focal Loss (2022) — Hierarchical framework generalizing Dice-based and CE-based losses with focal modulation. 📄 Unified Focal Loss: Generalising Dice and Cross Entropy-Based Losses to Handle Class Imbalanced Medical Image Segmentation — Yeung et al. 💻 mlyg/unified-focal-loss

Object Detection (Bounding Box)

Smooth L1 Loss (2015) — Piecewise L2/L1 loss; standard for bounding box regression. 📄 Fast R-CNN — Girshick, R.

IoU Loss (2016) — Directly regresses Intersection-over-Union between predicted and ground-truth boxes. 📄 UnitBox: An Advanced Object Detection Network — Yu et al.

Focal Loss (2017) — Modulating factor (1−pₜ)^γ down-weights easy negatives in dense detection. 📄 Focal Loss for Dense Object Detection — Lin, T.-Y. et al. 💻 facebookresearch/detectron2

Bounded IoU Loss (2018) — Upper-bounds IoU change per coordinate for stable high-IoU refinement. 📄 Improving Object Localization with Fitness NMS and Bounded IoU Loss — Tychsen-Smith & Petersson

GIoU Loss (2019) — Extends IoU with a penalty based on the smallest enclosing box; enables gradient flow for non-overlapping boxes. 📄 Generalized Intersection over Union — Rezatofighi et al.

DIoU Loss (2020) — Adds normalized center-point distance penalty to IoU for faster convergence. 📄 Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression — Zheng et al. 💻 Zzh-tju/DIoU

CIoU Loss (2020) — Extends DIoU with aspect ratio consistency penalty for complete geometric alignment. 📄 Distance-IoU Loss — Zheng et al.

Alpha-IoU Loss (2021) — Power parameter α amplifies loss and gradient for high-quality anchors. 📄 Alpha-IoU: A Family of Power Intersection over Union Losses — He et al. 💻 Jacobi93/Alpha-IoU

EIoU Loss (2022) — Decomposes CIoU penalty into separate width/height terms. 📄 Focal and Efficient IOU Loss for Accurate Bounding Box Regression — Zhang et al.

SIoU Loss (2022) — Angle-aware penalty considering vector direction between predicted and target boxes. 📄 SIoU Loss: More Powerful Learning for Bounding Box Regression — Gevorgyan

WIoU Loss (2023) — Dynamic non-monotonic focusing mechanism based on outlier degree. 📄 Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism — Tong et al. 💻 Instinct323/Wise-IoU

MPDIoU Loss (2023) — Bounding box similarity via minimum point distances between corners. 📄 MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression — Ma & Xu

Inner-IoU Loss (2023) — IoU through auxiliary inner bounding boxes with a scaling factor. 📄 Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box — Zhang et al.

Generative Models — GANs

Minimax / Original GAN Loss (2014) — Discriminator maximizes, generator minimizes binary cross-entropy in a two-player minimax game. 📄 Generative Adversarial Nets — Goodfellow et al.

Non-Saturating GAN Loss (2014) — Generator maximizes log(D(G(z))) instead of minimizing log(1−D(G(z))), providing stronger early gradients. 📄 Generative Adversarial Nets — Goodfellow et al.

Feature Matching Loss (2016) — Generator matches expected feature statistics at an intermediate discriminator layer. 📄 Improved Techniques for Training GANs — Salimans et al.

Least Squares GAN Loss (LSGAN) (2017) — L2 objective minimizing Pearson χ² divergence for more stable training. 📄 Least Squares Generative Adversarial Networks — Mao et al.

Wasserstein Loss (WGAN) (2017) — Earth Mover's distance providing meaningful gradients even for non-overlapping distributions. 📄 Wasserstein GAN — Arjovsky, M. et al. 💻 martinarjovsky/WassersteinGAN

WGAN-GP (2017) — Gradient penalty replacing weight clipping for better Lipschitz constraint enforcement. 📄 Improved Training of Wasserstein GANs — Gulrajani et al.

Hinge Loss GAN (2017) — Max-margin formulation with bounded gradients; used in BigGAN, SAGAN. 📄 Geometric GAN — Lim & Ye 📄 Spectral Normalization for GANs — Miyato et al.

Spectral Normalization (2018) — Constrains spectral norm of weight matrices to stabilize discriminator training. 📄 Spectral Normalization for Generative Adversarial Networks — Miyato et al.

R1 Regularization (2018) — Zero-centered gradient penalty on real data for local convergence guarantees. 📄 Which Training Methods for GANs do actually Converge? — Mescheder et al. 💻 NVlabs/stylegan2-ada-pytorch

Relativistic GAN Loss (RaGAN) (2018) — Discriminator estimates probability that real data is more realistic than fake. 📄 The Relativistic Discriminator — Jolicoeur-Martineau, A.

Mode Seeking Loss (2019) — Maximizes image/latent distance ratio to encourage diverse mode exploration. 📄 Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis — Mao et al.

Path Length Regularization (2020) — Consistent Jacobian norm across latent space for smooth interpolations. 📄 Analyzing and Improving the Image Quality of StyleGAN — Karras et al. 💻 NVlabs/stylegan2-ada-pytorch

LeCam Regularization (2021) — LeCam divergence-based stabilization under limited data. 📄 Regularizing Generative Adversarial Networks under Limited Data — Tseng et al. 💻 google/lecam-gan

Projected GAN Loss (2021) — Multi-scale discrimination in projected feature space from pretrained networks. 📄 Projected GANs Converge Faster — Sauer et al.

Generative Models — VAEs

ELBO / VAE Loss (2013) — Reconstruction loss + KL divergence regularizer pushing posterior toward prior. 📄 Auto-Encoding Variational Bayes — Kingma, D.P. & Welling, M. 💻 AntixK/PyTorch-VAE

β-VAE Loss (2017) — Upweights KL divergence (β > 1) for more disentangled latent representations. 📄 β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework — Higgins et al.

VQ-VAE Loss (2017) — Reconstruction + vector quantization commitment loss + codebook loss for discrete latents. 📄 Neural Discrete Representation Learning — van den Oord et al.

WAE Loss (2018) — Penalized Wasserstein distance using MMD or adversarial regularization on latent space. 📄 Wasserstein Auto-Encoders — Tolstikhin et al.

Generative Models — Diffusion & Flow

Denoising Score Matching (2011) — Training a denoising autoencoder equals matching the score function of noise-perturbed data. 📄 A Connection Between Score Matching and Denoising Autoencoders — Vincent, P.

Score Matching with Langevin Dynamics (NCSN) (2019) — Noise-conditional score network across multiple noise scales with annealed Langevin sampling. 📄 Generative Modeling by Estimating Gradients of the Data Distribution — Song, Y. & Ermon, S.

DDPM Loss (2020) — Simplified variational bound: predict the noise added at each diffusion step via weighted MSE. 📄 Denoising Diffusion Probabilistic Models — Ho, J. et al.

Variational Diffusion Loss (2021) — Continuous-time variational lower bound with learnable noise schedule. 📄 Variational Diffusion Models — Kingma et al.

v-prediction Loss (2022) — Predicts velocity v = α·ε − σ·x for improved numerical stability and progressive distillation. 📄 Progressive Distillation for Fast Sampling of Diffusion Models — Salimans, T. & Ho, J.

Rectified Flow Loss (2022) — Learns straight-line ODE trajectories between noise and data distributions. 📄 Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow — Liu et al.

Flow Matching Loss (2023) — Simulation-free training for continuous normalizing flows; regresses vector fields of conditional probability paths. 📄 Flow Matching for Generative Modeling — Lipman et al. 💻 facebookresearch/flow_matching

Consistency Loss (2023) — Self-consistency along the probability flow ODE for high-quality one-step generation. 📄 Consistency Models — Song et al. 💻 OpenAI/consistency_models

Reconstruction & Perceptual

SSIM Loss (2004) — Structural similarity using luminance, contrast, and structure comparisons; used as 1−SSIM. 📄 Image Quality Assessment: From Error Visibility to Structural Similarity — Wang et al. 💻 VainF/pytorch-msssim

Style Loss (Gram Matrix) (2015) — Matches Gram matrices of CNN feature maps for texture/style transfer. 📄 A Neural Algorithm of Artistic Style — Gatys et al.

Perceptual Loss / VGG Loss (2016) — L2 distance between deep feature representations of generated and target images. 📄 Perceptual Losses for Real-Time Style Transfer and Super-Resolution — Johnson et al.

LPIPS (2018) — Learned perceptual metric using calibrated deep features; correlates better with human perception than SSIM/PSNR. 📄 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric — Zhang et al. 💻 richzhang/PerceptualSimilarity

Image Super-Resolution & Restoration

Charbonnier Loss (1994) — Differentiable approximation to L1 (√(x²+ε²)); robust to outliers, smooth at zero. 📄 Two Deterministic Half-Quadratic Regularization Algorithms for Computed Imaging — Charbonnier et al.

MS-SSIM Loss (2003) — Multi-scale SSIM evaluating structural similarity across multiple resolutions. 📄 Multi-Scale Structural Similarity for Image Quality Assessment — Wang et al.

SRGAN Loss (2017) — Adversarial loss + VGG perceptual content loss for photo-realistic 4× super-resolution. 📄 Photo-Realistic Single Image Super-Resolution Using a GAN — Ledig et al.

Contextual Loss (2018) — Feature-level context matching without spatial alignment; enables training with non-aligned data. 📄 The Contextual Loss for Image Transformation with Non-Aligned Data — Mechrez et al.

ESRGAN Loss (2018) — Relativistic average discriminator + pre-activation VGG perceptual loss for enhanced texture recovery. 📄 ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks — Wang et al. 💻 xinntao/ESRGAN

Focal Frequency Loss (2021) — Adaptively focuses on hard-to-synthesize frequencies in the Fourier domain. 📄 Focal Frequency Loss for Image Reconstruction and Synthesis — Jiang et al. 💻 EndlessSora/focal-frequency-loss

Contrastive & Self-Supervised Learning

Contrastive Loss (2005) — Pairwise loss pulling similar pairs together and pushing dissimilar pairs apart by a margin. 📄 Learning a Similarity Metric Discriminatively, with Application to Face Verification — Chopra, Hadsell, LeCun

N-pair Loss (2016) — Generalizes triplet loss by simultaneously pushing away negatives from N−1 classes. 📄 Improved Deep Metric Learning with Multi-class N-pair Loss Objective — Sohn, K.

InfoNCE / CPC Loss (2018) — Noise-contrastive estimation maximizing mutual information between latent representations. 📄 Representation Learning with Contrastive Predictive Coding — van den Oord et al. 💻 RElbers/info-nce-pytorch

MoCo Loss (2020) — InfoNCE with momentum-updated encoder and dynamic dictionary queue. 📄 Momentum Contrast for Unsupervised Visual Representation Learning — He et al. 💻 facebookresearch/moco

NT-Xent / SimCLR Loss (2020) — Normalized temperature-scaled cross-entropy over cosine similarities of augmented pairs. 📄 A Simple Framework for Contrastive Learning of Visual Representations — Chen et al.

BYOL Loss (2020) — MSE between L2-normalized predictions and targets; learns without negative pairs via momentum teacher. 📄 Bootstrap Your Own Latent — Grill et al.

SwAV Loss (2020) — Swapped prediction contrasting cluster assignments from different augmented views. 📄 Unsupervised Learning of Visual Features by Contrasting Cluster Assignments — Caron et al. 💻 facebookresearch/swav

Supervised Contrastive Loss (SupCon) (2020) — Extends self-supervised contrastive loss with label information to pull same-class embeddings together. 📄 Supervised Contrastive Learning — Khosla et al. 💻 HobbitLong/SupContrast

Barlow Twins Loss (2021) — Cross-correlation matrix close to identity; reduces redundancy between embedding dimensions. 📄 Barlow Twins: Self-Supervised Learning via Redundancy Reduction — Zbontar et al. 💻 facebookresearch/barlowtwins

DINO Loss (2021) — Self-distillation via cross-entropy between sharpened softmax outputs of student and momentum-teacher. 📄 Emerging Properties in Self-Supervised Vision Transformers — Caron et al. 💻 facebookresearch/dino

SimSiam Loss (2021) — Negative cosine similarity with stop-gradient; no negatives, momentum, or large batches needed. 📄 Exploring Simple Siamese Representation Learning — Chen & He

CLIP Loss (2021) — Symmetric cross-entropy over image-text cosine similarities aligning visual and language representations. 📄 Learning Transferable Visual Models From Natural Language Supervision — Radford et al. 💻 mlfoundations/open_clip

VICReg Loss (2022) — Variance + invariance + covariance regularization preventing collapse without negatives. 📄 VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning — Bardes et al. 💻 facebookresearch/vicreg

Decoupled Contrastive Loss (2022) — Removes positive term from InfoNCE denominator, eliminating negative-positive coupling. 📄 Decoupled Contrastive Learning — Yeh et al.

DINOv2 Loss (2023) — DINO self-distillation + iBOT masked image modeling + Sinkhorn centering at scale. 📄 DINOv2: Learning Robust Visual Features without Supervision — Oquab et al. 💻 facebookresearch/dinov2

SigLIP Loss (2023) — Pairwise sigmoid loss replacing softmax for efficient batch-parallel language-image pre-training. 📄 Sigmoid Loss for Language Image Pre-Training — Zhai et al.

Metric Learning & Face Recognition

Triplet Loss (2015) — Minimizes anchor-positive distance while maximizing anchor-negative distance by a margin. 📄 FaceNet: A Unified Embedding for Face Recognition and Clustering — Schroff et al. 💻 KevinMusgrave/pytorch-metric-learning

Lifted Structured Loss (2016) — Mines all positive and negative pairs in a batch simultaneously. 📄 Deep Metric Learning via Lifted Structured Feature Embedding — Oh Song et al.

SphereFace / A-Softmax (2017) — Multiplicative angular margin on a hypersphere for discriminative face features. 📄 SphereFace: Deep Hypersphere Embedding for Face Recognition — Liu et al.

Proxy-NCA Loss (2017) — Data-to-proxy comparisons with one learnable proxy per class; dramatically faster convergence. 📄 No Fuss Distance Metric Learning Using Proxies — Movshovitz-Attias et al.

CosFace / LMCL (2018) — Cosine margin penalty on target logit in normalized softmax. 📄 CosFace: Large Margin Cosine Loss for Deep Face Recognition — Wang et al. 💻 deepinsight/insightface

ArcFace (2019) — Additive angular margin with clear geodesic distance interpretation. 📄 ArcFace: Additive Angular Margin Loss for Deep Face Recognition — Deng et al. 💻 deepinsight/insightface

Multi-Similarity Loss (2019) — Mines and weights pairs using self-similarity, relative similarity, and negative similarity. 📄 Multi-Similarity Loss with General Pair Weighting for Deep Metric Learning — Wang et al.

SoftTriple Loss (2019) — Multiple centers per class bridging proxy-based and triplet-based losses. 📄 SoftTriple Loss: Deep Metric Learning Without Triplet Sampling — Qian et al.

Circle Loss (2020) — Unified pair similarity optimization with self-paced weighting. 📄 Circle Loss: A Unified Perspective of Pair Similarity Optimization — Sun et al.

Proxy Anchor Loss (2020) — Proxies as anchors associated with all batch data; fast convergence. 📄 Proxy Anchor Loss for Deep Metric Learning — Kim et al. 💻 tjddus9597/Proxy-Anchor-CVPR2020

Sub-center ArcFace (2020) — Multiple sub-centers per class for noisy label handling. 📄 Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web Faces — Deng et al.

AdaFace (2022) — Adaptive margin emphasizing hard or easy samples based on image quality. 📄 AdaFace: Quality Adaptive Margin for Face Recognition — Kim et al. 💻 mk-minchul/AdaFace

ElasticFace (2022) — Random margin values from a normal distribution each iteration for flexible separability. 📄 ElasticFace: Elastic Margin Loss for Deep Face Recognition — Boutros et al. 💻 fdbtrs/ElasticFace

NLP & Language Modeling

Cross-Entropy / Next Token Prediction — Standard autoregressive LM loss; foundation of GPT and all causal LMs. 📄 Language Models are Unsupervised Multitask Learners — Radford et al. (GPT-2, 2019)

Masked Language Model (MLM) Loss (2019) — Masks 15% of tokens and predicts from bidirectional context. Introduced pre-train/fine-tune for NLU. 📄 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.

Replaced Token Detection (RTD) (2020) — Discriminator classifies every token as original or replaced; loss defined over all tokens for better sample efficiency. 📄 ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators — Clark et al.

Sentence Order Prediction (SOP) (2020) — Predicts whether two consecutive segments are in correct or swapped order. 📄 ALBERT: A Lite BERT for Self-supervised Learning — Lan et al.

Span Corruption Loss (2020) — Masks contiguous spans; encoder-decoder reconstructs only missing spans. All NLP tasks as text-to-text. 📄 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — Raffel et al.

Mixture of Denoisers (MoD) (2022) — Unifies causal LM, prefix LM, and span corruption into a single pre-training objective. 📄 UL2: Unifying Language Learning Paradigms — Tay et al.

LLM Alignment (RLHF / DPO)

PPO Loss / RLHF (2017/2022) — Clipped surrogate objective for aligning LLMs with human preferences via a learned reward model. 📄 Proximal Policy Optimization Algorithms — Schulman et al. 📄 Training language models to follow instructions with human feedback — Ouyang et al. 💻 huggingface/trl

Reward Model Loss / Bradley-Terry (2022) — Cross-entropy on pairwise human preferences for training scalar reward models. 📄 Training language models to follow instructions with human feedback — Ouyang et al.

SLiC-HF Loss (2023) — Contrastive ranking loss calibrating sequence likelihoods to human preferences. 📄 SLiC-HF: Sequence Likelihood Calibration with Human Feedback — Zhao et al.

DPO Loss (2023) — Closed-form policy optimization directly from preference pairs; no separate reward model or RL loop. 📄 Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al. 💻 huggingface/trl — DPOTrainer

IPO Loss (2023) — Squared loss on preference margins avoiding overfitting to Bradley-Terry assumption. 📄 A General Theoretical Paradigm to Understand Learning from Human Preferences — Azar et al.

CPO Loss (2024) — Contrastive preference loss without reference model for machine translation. 📄 Contrastive Preference Optimization — Xu et al.

KTO Loss (2024) — Kahneman-Tversky prospect theory applied to alignment; works from binary (good/bad) feedback. 📄 KTO: Model Alignment as Prospect Theoretic Optimization — Ethayarajh et al. 💻 huggingface/trl — KTOTrainer

GRPO Loss (2024) — Group Relative Policy Optimization; estimates advantages from sampled output groups, eliminating the critic model. 📄 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Shao et al.

ORPO Loss (2024) — Odds-ratio penalty added to SFT loss; combines instruction tuning and preference alignment in one stage. 📄 ORPO: Monolithic Preference Optimization without Reference Model — Hong et al. 💻 huggingface/trl — ORPOTrainer

SimPO Loss (2024) — Reference-free preference optimization using length-normalized average log probability as implicit reward. 📄 SimPO: Simple Preference Optimization with a Reference-Free Reward — Meng et al. 💻 princeton-nlp/SimPO

SPPO Loss (2024) — Self-play preference optimization framing alignment as a two-player constant-sum game. 📄 Self-Play Preference Optimization for Language Model Alignment — Wu et al.

Sequence-to-Sequence & Speech

CTC Loss (2006) — Marginalizes over all valid alignments between input and output sequences; foundational for ASR. 📄 Connectionist Temporal Classification — Graves et al. 💻 torch.nn.CTCLoss

RNN-T Loss (2012) — Extends CTC with a prediction network conditioning on previous outputs for streaming transduction. 📄 Sequence Transduction with Recurrent Neural Networks — Graves, A. 💻 torchaudio.transforms.RNNTLoss

Scheduled Sampling Loss (2015) — Gradually replaces ground-truth tokens with model predictions during training to mitigate exposure bias. 📄 Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks — Bengio et al.

Sequence-Level Training / MIXER (2016) — Directly optimizes BLEU/ROUGE using REINFORCE. 📄 Sequence Level Training with Recurrent Neural Networks — Ranzato et al.

Minimum Risk Training (2016) — Minimizes expected task-level loss (e.g., 1−BLEU) via sampling. 📄 Minimum Risk Training for Neural Machine Translation — Shen et al.

Mel-Spectrogram Reconstruction Loss (2017) — L1/L2 between predicted and target mel-spectrograms; primary TTS training objective. 📄 Tacotron: Towards End-to-End Speech Synthesis — Wang et al.

Multi-Resolution STFT Loss (2020) — Spectral convergence + log-magnitude STFT at multiple FFT sizes for neural vocoder training. 📄 Parallel WaveGAN — Yamamoto et al. 💻 csteinmetz1/auraloss

Reinforcement Learning

TD Loss / Temporal Difference (1988) — Bootstrapped value estimation updating predictions toward reward + discounted next-state value. 📄 Learning to Predict by the Methods of Temporal Differences — Sutton, R.S.

Q-Learning Loss (1989) — Off-policy TD control bootstrapping with max Q-value over next actions. 📄 Learning from Delayed Rewards — Watkins, C.J.C.H.

REINFORCE / Policy Gradient (1992) — Monte Carlo policy gradient weighted by returns. 📄 Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning — Williams, R.J.

DQN Loss (2015) — Q-learning with deep networks, experience replay, and target networks. 📄 Human-level Control through Deep Reinforcement Learning — Mnih et al. 💻 DLR-RM/stable-baselines3

Double DQN Loss (2015) — Decouples action selection from evaluation to reduce overestimation bias. 📄 Deep Reinforcement Learning with Double Q-learning — van Hasselt et al.

DDPG Loss (2015) — Deterministic policy gradients for continuous control with experience replay. 📄 Continuous Control with Deep Reinforcement Learning — Lillicrap et al.

GAE (2015) — Exponentially-weighted multi-step TD errors for tunable bias-variance tradeoff. 📄 High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al.

A3C / A2C Loss (2016) — Actor-critic with policy gradient + value function baseline + entropy bonus. 📄 Asynchronous Methods for Deep Reinforcement Learning — Mnih et al.

Distributional RL / C51 Loss (2017) — Models full return distribution using categorical projection over fixed atoms. 📄 A Distributional Perspective on Reinforcement Learning — Bellemare et al.

PPO Clipped Surrogate Loss (2017) — Clips probability ratio to prevent destructively large policy updates. 📄 Proximal Policy Optimization Algorithms — Schulman et al. 💻 DLR-RM/stable-baselines3

HER Loss (2017) — Relabels failed trajectories with achieved goals for sample-efficient sparse-reward learning. 📄 Hindsight Experience Replay — Andrychowicz et al.

QR-DQN Loss (2018) — Quantile regression approximating the return distribution with learnable quantile locations. 📄 Distributional Reinforcement Learning with Quantile Regression — Dabney et al.

SAC Loss (2018) — Maximum entropy actor-critic balancing exploration and exploitation automatically. 📄 Soft Actor-Critic — Haarnoja et al.

TD3 Loss (2018) — Clipped double-Q learning + delayed policy updates + target policy smoothing. 📄 Addressing Function Approximation Error in Actor-Critic Methods — Fujimoto et al.

V-trace Loss (2018) — Importance-weighted off-policy correction for scalable distributed RL (IMPALA). 📄 IMPALA: Scalable Distributed Deep-RL — Espeholt et al.

Decision Transformer Loss (2021) — RL as sequence modeling; autoregressive transformer conditioned on returns, trained with supervised loss. 📄 Decision Transformer: Reinforcement Learning via Sequence Modeling — Chen et al. 💻 kzl/decision-transformer

Knowledge Distillation

Knowledge Distillation / KD Loss (2015) — Student matches softened output distribution of teacher via KL divergence at elevated temperature. 📄 Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean

FitNets / Hint Loss (2015) — Student mimics intermediate feature representations of teacher. 📄 FitNets: Hints for Thin Deep Nets — Romero et al.

Attention Transfer Loss (2017) — Forces student to mimic spatial attention maps of teacher's intermediate layers. 📄 Paying More Attention to Attention — Zagoruyko & Komodakis 💻 szagoruyko/attention-transfer

Born-Again Networks (2018) — Self-distillation where identical-architecture student outperforms teacher. 📄 Born Again Neural Networks — Furlanello et al.

PKT / Probabilistic KD (2018) — Matches probability distributions in feature space rather than raw representations. 📄 Learning Deep Representations with Probabilistic Knowledge Transfer — Passalis & Tefas

Relational KD / RKD (2019) — Transfers mutual relations (distances and angles) between examples. 📄 Relational Knowledge Distillation — Park et al.

Self-Distillation Loss (2019) — Deeper layers supervise shallower classifiers within the same network. 📄 Be Your Own Teacher — Zhang et al.

CRD / Contrastive Representation Distillation (2020) — Maximizes mutual information between teacher and student via contrastive objective. 📄 Contrastive Representation Distillation — Tian et al. 💻 HobbitLong/RepDistiller

ReviewKD (2021) — Student's lower-level features guided by teacher's higher-level features through attention-based fusion. 📄 Distilling Knowledge via Knowledge Review — Chen et al. 💻 dvlab-research/ReviewKD

DKD / Decoupled KD (2022) — Decouples KD into target-class and non-target-class components for independent weighting. 📄 Decoupled Knowledge Distillation — Zhao et al. 💻 megvii-research/mdistiller

DIST Loss (2022) — Preserves inter-class relations and intra-class ranking rather than exact probability matching. 📄 Knowledge Distillation from A Stronger Teacher — Huang et al. 💻 hunto/DIST_KD

Regularization

KL Divergence (1951) — Measures information lost when approximating one distribution with another. 📄 On Information and Sufficiency — Kullback & Leibler

L2 Regularization / Weight Decay (1970) — Penalizes sum of squared weights to prevent overfitting. 📄 Ridge Regression — Hoerl & Kennard

L1 Regularization / Lasso (1996) — Penalizes sum of absolute weights, inducing sparsity. 📄 Regression Shrinkage and Selection via the Lasso — Tibshirani, R.

Elastic Net (2005) — Combines L1 and L2 for sparsity + grouping of correlated features. 📄 Regularization and Variable Selection via the Elastic Net — Zou & Hastie

Dropout (2014) — Randomly zeroes activations; implicit ensemble of exponentially many sub-networks. 📄 Dropout: A Simple Way to Prevent Neural Networks from Overfitting — Srivastava et al.

Confidence Penalty (2017) — Penalizes low-entropy (overconfident) output distributions. 📄 Regularizing Neural Networks by Penalizing Confident Output Distributions — Pereyra et al.

Mixup Loss (2018) — Trains on convex combinations of example pairs and their labels. 📄 mixup: Beyond Empirical Risk Minimization — Zhang et al. 💻 facebookresearch/mixup-cifar10

Manifold Mixup (2019) — Extends Mixup to hidden representations at random intermediate layers. 📄 Manifold Mixup: Better Representations by Interpolating Hidden States — Verma et al.

CutMix Loss (2019) — Cuts and pastes rectangular patches between images while mixing labels proportionally. 📄 CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features — Yun et al.

3D Vision & Point Clouds

Chamfer Distance (2017) — Average nearest-neighbor distance between two point sets; fast and widely used. 📄 A Point Set Generation Network for 3D Object Reconstruction from a Single Image — Fan et al. 💻 facebookresearch/pytorch3d

Earth Mover's Distance (EMD) (2017) — Optimal transport distance with bijective matching; higher quality but more expensive than CD. 📄 A Point Set Generation Network for 3D Object Reconstruction from a Single Image — Fan et al.

Normal Consistency Loss (2018) — Penalizes inconsistency of surface normals between adjacent mesh faces. 📄 Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images — Wang et al.

Mesh Laplacian Smoothing Loss (2018) — Penalizes vertex deviation from neighbor centroid to prevent self-intersections. 📄 Pixel2Mesh — Wang et al. 💻 facebookresearch/pytorch3d

SDF Loss (DeepSDF) (2019) — Regresses signed distance values; zero level-set defines the 3D surface. 📄 DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation — Park et al. 💻 facebookresearch/DeepSDF

Occupancy Loss (2019) — Binary CE on predicted occupancy probabilities for 3D reconstruction. 📄 Occupancy Networks: Learning 3D Reconstruction in Function Space — Mescheder et al.

NeRF Photometric Loss (2020) — MSE between rendered and observed pixel colors via differentiable volume rendering. 📄 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis — Mildenhall et al.

3D Gaussian Splatting Loss (2023) — L1 + D-SSIM for optimizing anisotropic 3D Gaussians for real-time radiance field rendering. 📄 3D Gaussian Splatting for Real-Time Radiance Field Rendering — Kerbl et al. 💻 graphdeco-inria/gaussian-splatting

Depth Estimation

Scale-Invariant Loss (2014) — Log-space depth error minus mean shift; invariant to global scale ambiguity. 📄 Depth Map Prediction from a Single Image using a Multi-Scale Deep Network — Eigen et al.

Berhu Loss (Reverse Huber) (2016) — L1 for small residuals, L2 for large; robust depth regression. 📄 Deeper Depth Prediction with Fully Convolutional Residual Networks — Laina et al.

Photometric Consistency Loss (2017) — Self-supervised SSIM + L1 with left-right disparity consistency for monocular depth. 📄 Unsupervised Monocular Depth Estimation with Left-Right Consistency — Godard et al. 💻 nianticlabs/monodepth2

Edge-Aware Smoothness Loss (2017) — Locally smooth depth except at image edges, weighted by image gradients. 📄 Unsupervised Monocular Depth Estimation with Left-Right Consistency — Godard et al.

Medical Imaging

Deep Supervision Loss (2015) — Auxiliary losses at intermediate layers providing direct gradient paths. 📄 Deeply-Supervised Nets — Lee et al.

Dice Loss (2016) — Directly optimizes Dice coefficient for volumetric medical image segmentation. 📄 V-Net — Milletari et al.

Generalized Dice Loss (2017) — Per-class volume weighting for highly imbalanced multi-class segmentation. 📄 Generalised Dice Overlap as a Deep Learning Loss Function — Sudre et al.

Tversky Loss (2017) — Tunable FP/FN trade-off for small lesion segmentation. 📄 Tversky Loss Function for Image Segmentation — Salehi et al.

Attention-Gated Loss (2018) — Learned attention gates suppress irrelevant regions in skip connections. 📄 Attention U-Net: Learning Where to Look for the Pancreas — Oktay et al.

Boundary / Surface Loss (2019) — Distance metric on contour space for highly unbalanced medical segmentation. 📄 Boundary Loss for Highly Unbalanced Segmentation — Kervadec et al. 💻 LIVIAETS/boundary-loss

Distance Map Penalized CE (2019) — Weights CE by distance transform maps to focus on boundary regions. 📄 Distance Map Loss Penalty Term for Semantic Segmentation — Caliva et al.

Graph Neural Networks

Variational Graph Auto-Encoder (VGAE) Loss (2016) — Reconstruction BCE on adjacency matrix + KL divergence for unsupervised graph learning. 📄 Variational Graph Auto-Encoders — Kipf & Welling

Node Classification Loss (2017) — Standard cross-entropy per-node in semi-supervised graph settings. 📄 Semi-Supervised Classification with Graph Convolutional Networks — Kipf & Welling 💻 pyg-team/pytorch_geometric

Deep Graph Infomax (DGI) Loss (2019) — Maximizes mutual information between local node and global graph representations. 📄 Deep Graph Infomax — Veličković et al. 💻 PetarV-/DGI

Graph Matching Loss (2019) — Attention-based cross-graph matching with margin-based pairwise loss. 📄 Graph Matching Networks for Learning the Similarity of Graph Structured Objects — Li et al.

InfoGraph Loss (2020) — Maximizes mutual information between graph-level and substructure-level representations. 📄 InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning — Sun et al. 💻 sunfanyunn/InfoGraph

GraphCL Loss (2020) — NT-Xent contrastive loss on augmented graph views for self-supervised graph learning. 📄 Graph Contrastive Learning with Augmentations — You et al. 💻 Shen-Lab/GraphCL

BGRL Loss (2022) — Negative-sample-free self-supervised loss bootstrapping graph representations (inspired by BYOL). 📄 Large-Scale Representation Learning on Graphs via Bootstrapping — Thakoor et al. 💻 nerdslab/bgrl

Recommendation Systems

ListNet Loss (2007) — Listwise learning-to-rank using top-one probability distributions. 📄 Learning to Rank: From Pairwise Approach to Listwise Approach — Cao et al.

ListMLE Loss (2008) — Listwise loss based on likelihood of ground-truth permutation under Plackett-Luce model. 📄 Listwise Approach to Learning to Rank: Theory and Algorithm — Xia et al.

BPR Loss (2009) — Pairwise loss maximizing posterior probability that user prefers observed over unobserved items. 📄 BPR: Bayesian Personalized Ranking from Implicit Feedback — Rendle et al. 💻 guoyang9/BPR-pytorch

Sampled Softmax Loss (2015) — Approximates full softmax over large item vocabulary by sampling negatives. 📄 On Using Very Large Target Vocabulary for Neural Machine Translation — Jean et al.

DirectAU Loss (2022) — Directly optimizes alignment and uniformity on the hypersphere for collaborative filtering. 📄 Towards Representation Alignment and Uniformity in Collaborative Filtering — Wang et al. 💻 THUwangcy/DirectAU

Multi-Task Learning

Uncertainty Weighting / Homoscedastic Uncertainty (2018) — Learns task weights by modeling task-dependent uncertainty; noisy tasks auto-downweighted. 📄 Multi-Task Learning Using Uncertainty to Weigh Losses — Kendall et al. 💻 median-research-group/LibMTL

GradNorm (2018) — Dynamically normalizes gradient magnitudes across tasks to balance training rates. 📄 GradNorm: Gradient Normalization for Adaptive Loss Balancing — Chen et al.

MGDA (2018) — Multi-objective optimization finding Pareto-optimal descent direction via Frank-Wolfe on task gradients. 📄 Multi-Task Learning as Multi-Objective Optimization — Sener & Koltun

PCGrad (2020) — Projects conflicting task gradients onto normal planes to reduce destructive interference. 📄 Gradient Surgery for Multi-Task Learning — Yu et al.

CAGrad (2021) — Minimizes average loss while maximizing worst-case local improvement across tasks. 📄 Conflict-Averse Gradient Descent for Multi-task Learning — Liu et al.

Nash-MTL (2022) — Nash bargaining game where tasks negotiate a joint update direction. 📄 Multi-Task Learning as a Bargaining Game — Navon et al. 💻 AvivNavon/nash-mtl

Uncertainty Estimation

NLL with Learned Variance (1994) — Network predicts mean and variance; NLL naturally trades off accuracy and calibration. 📄 Estimating the Mean and Variance of the Target Probability Distribution — Nix & Weigend

MC Dropout (2016) — Dropout at test time as approximate Bayesian inference for uncertainty estimation. 📄 Dropout as a Bayesian Approximation — Gal & Ghahramani

Deep Ensembles Loss (2017) — Ensemble of networks with proper scoring rules + adversarial training for diversity. 📄 Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles — Lakshminarayanan et al.

Evidential Deep Learning Loss (2018) — Dirichlet prior over class probabilities; Bayes risk + KL divergence regularizer. 📄 Evidential Deep Learning to Quantify Classification Uncertainty — Sensoy et al.

Domain Adaptation

Maximum Mean Discrepancy (MMD) (2012) — Distribution distance in RKHS; aligns source and target features without adversarial training. 📄 A Kernel Two-Sample Test — Gretton et al. 💻 ZongxianLee/MMD_Loss.Pytorch

Domain Adversarial Loss / DANN (2016) — Gradient reversal layer training domain classifier adversarially for domain-invariant features. 📄 Domain-Adversarial Training of Neural Networks — Ganin et al. 💻 fungtion/DANN

Deep CORAL Loss (2016) — Aligns second-order statistics (covariance) of source and target deep features. 📄 Deep CORAL: Correlation Alignment for Deep Domain Adaptation — Sun & Saenko

Wasserstein Distance for DA (2018) — Earth Mover's Distance as domain discrepancy measure with gradient penalty. 📄 Wasserstein Distance Guided Representation Learning for Domain Adaptation — Shen et al.

Contrastive Domain Discrepancy (CDD) (2019) — Class-aware alignment maximizing inter-class and minimizing intra-class discrepancy across domains. 📄 Contrastive Adaptation Network for Unsupervised Domain Adaptation — Kang et al.

Survey Papers

📄 A Comprehensive Survey of Loss Functions and Metrics in Deep Learning — Terven et al. (2025)
📄 A Survey of Loss Functions for Semantic Segmentation — Jadon (2020)
📄 Loss Functions in the Era of Semantic Segmentation: A Survey and Outlook — Azad et al. (2023)

Key Implementation Libraries

Library	Focus	Link
PyTorch (built-in)	CE, BCE, MSE, Huber, CTC, KLDiv, etc.	pytorch.org
pytorch-metric-learning	Triplet, Contrastive, ArcFace, ProxyNCA, etc.	GitHub
SegLossOdyssey	Dice, Tversky, Boundary, Hausdorff, etc.	GitHub
Hugging Face TRL	DPO, PPO, KTO, ORPO, SimPO, etc.	GitHub
Stable-Baselines3	DQN, PPO, SAC, TD3, A2C, etc.	GitHub
lightly	SimCLR, BYOL, MoCo, DINO, Barlow Twins, etc.	GitHub
insightface	ArcFace, CosFace, Sub-center ArcFace	GitHub
open_clip	CLIP, SigLIP contrastive losses	GitHub
PyTorch3D	Chamfer, mesh losses, point cloud losses	GitHub
PyTorch Geometric	GNN losses, link prediction, node classification	GitHub
LibMTL	Uncertainty weighting, GradNorm, PCGrad, Nash-MTL	GitHub
auraloss	Multi-Resolution STFT, mel losses	GitHub
BasicSR	Perceptual, SSIM, Charbonnier, GAN losses for SR	GitHub
kornia	Focal, Dice, SSIM, and more	GitHub
anomalib	Anomaly detection losses and methods	GitHub
Avalanche	Continual learning (EWC, SI, LwF, etc.)	GitHub
GluonTS	Time series forecasting losses	GitHub
audiocraft	Audio generation (EnCodec, MusicGen)	GitHub
AIF360	Fairness and bias mitigation	GitHub

Star History

If you find this useful, please star the repo — it helps others discover it.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sections		sections
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Loss Functions

What's New

Contents

🧭 Loss Selection Guide

📐 Key Mathematical Formulations

Classification

Regression

Segmentation

Object Detection (Bounding Box)

Generative Models — GANs

Generative Models — VAEs

Generative Models — Diffusion & Flow

Reconstruction & Perceptual

Image Super-Resolution & Restoration

Contrastive & Self-Supervised Learning

Metric Learning & Face Recognition

NLP & Language Modeling

LLM Alignment (RLHF / DPO)

Sequence-to-Sequence & Speech

Reinforcement Learning

Knowledge Distillation

Regularization

3D Vision & Point Clouds

Depth Estimation

Medical Imaging

Graph Neural Networks

Recommendation Systems

Multi-Task Learning

Uncertainty Estimation

Domain Adaptation

Survey Papers

Key Implementation Libraries

Star History

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages