Skip to content

(Still) Excessive memory usage #118

@fstein93

Description

@fstein93

Dear authors,

I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.

A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:

  1. n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
  2. n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).

I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.

My questions are:

  1. What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
  2. Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
  3. Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?

EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.

EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions