Skip to content

Validation tolerance for mathematical correctness should use relative error in most circumstances #153

@dbsanfte

Description

@dbsanfte

Bug Description

COSMA's K-dimension splitting strategy (parallel (k / 2)) produces catastrophically incorrect results (93.6% errors) for certain matrix dimensions, while working perfectly for smaller matrices.

UPDATE: This was NOT a COSMA algorithm bug! The bug was in the validation tolerance using absolute error instead of relative error.

Root Cause (CONFIRMED)

The validation code in utils/cosma_utils.hpp was using:

isOK = isOK && (std::abs(globC[i] - globCcheck[i]) < epsilon);  // epsilon = 1e-8

For large matrix multiplications:

  • Result values have magnitude ~27,000
  • Computation errors: ~0.02 (relative error ~7e-7, within float32 precision!)
  • Absolute tolerance: 1e-8
  • Result: 93.6% "errors" reported, but COSMA was computing correct results!

Fix

Pull Request: #154

Changed to relative error validation:

double abs_error = std::abs(globC[i] - globCcheck[i]);
double scale = std::max(std::abs(globC[i]), std::abs(globCcheck[i]));
double rel_error = (scale > 1e-10) ? abs_error / scale : abs_error;
double tolerance = (sizeof(Scalar) == 4) ? 1e-5 : epsilon;
isOK = isOK && (rel_error < tolerance);

Verification

After fix:

# 32×896×896 float32: NOW PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float
# Result is OK ✅

# 32×10000×896 float32: NOW PASSES  
mpirun -np 2 cosma_miniapp -m 32 -n 10000 -k 896 --test --type float
# Result is OK ✅

# 32×896×896 float64: PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type double
# Result is OK ✅

Apology

Sorry for the false alarm! COSMA's K-split algorithm was working correctly all along. The issue was that the validation tolerance was too strict for realistic floating-point computations, especially for:

  • Large matrix dimensions (where results have large magnitude)
  • Float32 precision (which needs ~1e-5 relative tolerance, not 1e-8 absolute)

The identical float/double errors (which I thought proved it was a logic bug) were actually because both were numerically correct - just failing an overly strict validation!

Environment

  • COSMA Version: v2.6.0 (commit a3101bb)
  • System: 2-socket Intel Xeon, 28 cores/socket, NUMA-aware
  • MPI: OpenMPI 4.1.x
  • BLAS: OpenBLAS 0.3.x
  • Compiler: GCC 11.4

Files changed:

  • utils/cosma_utils.hpp (validation logic)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions