Skip to content

FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49

Merged
speckhard merged 11 commits intomainfrom
feat/LeMatRho
Apr 7, 2026
Merged

FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.#49
speckhard merged 11 commits intomainfrom
feat/LeMatRho

Conversation

@speckhard
Copy link
Copy Markdown
Collaborator

@speckhard speckhard commented Feb 25, 2026

What Is LeMatRho?

LeMatRho is an effort to open source charge density results from DFT. The raw VASP outputs live in an authenticated S3 bucket (lemat-rho) (talk to me for access) and includes CHGCAR, AECCAR0/1/2, and vasprun.xml files for each
material.

This PR adds LeMatRho as a new data source inside lematerial-fetcher so we
can fetch, compress, analyse, and publish charge densities as a HuggingFace
dataset.


Why This Codebase?

We built on lematerial-fetcher (instead of writing from scratch) because:

  1. It already has a BaseFetcher / BaseTransformer pattern for pulling data
    from external sources (Materials Project, OQMD, Alexandria) into a shared
    OPTIMADE data model.
  2. The Materials Project fetcher already talks to AWS S3, so auth and download
    infrastructure was partially in place.
  3. The shared OptimadeStructure Pydantic model, PostgreSQL schema, and
    HuggingFace push tooling can be reused with minimal extensions.

Two Pipeline Architectures

1. Traditional Pipeline (lematrho fetch + lematrho transform)

Follows the repo's existing two-step pattern:

S3 bucket  -->  [fetch.py]  -->  PostgreSQL (raw_structures)
                                       |
                                       v
                                [transform.py]  -->  PostgreSQL (optimade)
  • fetch.py (196 lines): Downloads CHGCAR/AECCAR files from S3, compresses
    them with pyrho, stores compressed grids in PostgreSQL.
  • transform.py (279 lines): Reads raw structures from PostgreSQL, converts
    to OptimadeStructure, optionally re-downloads files from S3 to run
    Bader/DDEC6 charge analysis.

Included for consistency with existing architecture. Not recommended for
production
because the transform step must re-download files from S3 for
charge analysis (double download).

2. Direct Pipeline (lematrho run) -- RECOMMENDED

A single-pass architecture that bypasses PostgreSQL entirely:

S3 bucket (lemat-rho)
    |
    v
[List material folders -- paginated S3 listing]
    |
    v  ProcessPoolExecutor (4 workers, work-stealing)
Per material, in a single pass:
  1. Download vasprun.xml.gz  ->  parse relaxed structure
  2. Download CHGCAR/AECCAR0/1/2  ->  compress via pyrho
  3. Run Bader analysis (if tools available, reuses same raw bytes)
  4. Run DDEC6 analysis (if tools available, reuses same raw bytes)
  5. Build OptimadeStructure (Pydantic validation)
  6. gc.collect()  ->  return flat dict
    |
    v  Main process
Buffer rows  ->  write Parquet chunk every N rows (atomic: .tmp -> rename)
Append material_id to checkpoint file after each success
    |
    v  Optional
Load Parquet dir as HF Dataset  ->  push_to_hub()
  • pipeline.py (679 lines): The entire direct pipeline.
  • utils.py (283 lines): Shared constants, helpers, and charge analysis
    wrappers used by all three pipeline files.

File Map

All LeMatRho source code lives under
src/lematerial_fetcher/fetcher/lematrho/:

File Lines Purpose
__init__.py 1 Package marker
fetch.py 196 LeMatRhoFetcher(BaseFetcher) -- S3 to PostgreSQL
transform.py 279 LeMatRhoTransformer(BaseTransformer) -- PostgreSQL raw to OPTIMADE
pipeline.py 679 LeMatRhoDirectPipeline -- single-pass S3 to Parquet
utils.py 283 Shared constants, helpers, Bader/DDEC6 wrappers

Modified files elsewhere in the repo:

File What changed
models/optimade.py 8 new optional charge density fields + nsites validators
database/postgres.py 8 matching columns (JSONB, FLOAT[], INTEGER[])
utils/config.py DirectPipelineConfig dataclass, extended FetcherConfig/TransformerConfig
utils/cli.py add_lematrho_fetch_options, add_lematrho_transform_options, add_lematrho_direct_options
utils/aws.py get_authenticated_aws_client() (adaptive retry, credential chain)
cli.py lematrho command group with fetch, transform, run subcommands
push.py Charge density fields in HuggingFace Features + JSON serialisation
models/utils/enums.py Source.LEMATRHO = "lematrho"
pyproject.toml Added mp-pyrho>=0.3.1 dependency
.env.example AWS, LeMatRho, tool path variable templates

Data Model: New Fields on OptimadeStructure

All 8 fields are Optional and default to None, so existing data sources
(MP, Alexandria, OQMD) are unaffected.

Field Type Description
compressed_charge_density Optional[list] 3D nested float list from pyrho lossy compression of CHGCAR
compressed_aeccar0 Optional[list] Same for AECCAR0 (all-electron core charge density)
compressed_aeccar1 Optional[list] Same for AECCAR1 (pseudo valence charge density)
compressed_aeccar2 Optional[list] Same for AECCAR2 (pseudo core charge density)
charge_density_grid_shape Optional[list[int]] Shape of compressed grid, e.g. [15, 15, 15]
bader_charges Optional[list[float]] Bader net charges per site (validated against nsites)
bader_atomic_volume Optional[list[float]] Bader atomic volumes per site (validated against nsites)
ddec6_charges Optional[list[float]] DDEC6 net atomic charges per site (validated against nsites)

Per-site fields (bader_charges, bader_atomic_volume, ddec6_charges) are
validated by _validate_with_number_of_sites() to ensure their length matches
nsites.


Key Design Decisions

Single-pass S3 access (direct pipeline)

Each S3 file is downloaded exactly once. The raw decompressed bytes are
held in memory so they can be reused for both pyrho compression and
Bader/DDEC6 analysis. The traditional pipeline cannot do this because fetch and
transform are separate steps.

No PostgreSQL (direct pipeline)

The end target is Parquet files on HuggingFace. PostgreSQL adds schema
migration, connection management, and disk space overhead with no benefit for
this use case. DirectPipelineConfig has zero database fields.

Crash-safe checkpointing

A text file (.checkpoint.txt) records each successfully processed material
ID. On restart, the pipeline skips already-processed materials and resumes
writing Parquet chunks from the next index.

Atomic Parquet writes

Chunks are written to a .tmp file first, then renamed via os.rename()
(POSIX atomic on the same filesystem). Stale .tmp files from crashed runs are
ignored on resume.

Work-stealing parallelism

Uses concurrent.futures.wait(FIRST_COMPLETED) with bounded submission (2x
num_workers futures in flight) to avoid creating hundreds of thousands of
Future objects in memory.

Memory management

Sequential file processing within each worker, explicit del of raw bytes
after use, gc.collect() after each material. Conservative default of 4
workers because each CHGCAR can be hundreds of MB when decompressed.

Graceful tool degradation

External tools (bader, chargemol) are validated at init. If any tool is
missing, the corresponding output fields are set to None rather than
failing the pipeline. Bader and DDEC6 are independent -- one can fail while the
other succeeds.

Cross-compatibility

All LeMatRho structures are marked cross-compatible (no element exclusions).
This differs from the Alexandria fetcher, which excludes Yb-containing
structures. The get_cross_compatibility() function always returns True.


Charge Density Processing: How It Works

pyrho Compression

For each of CHGCAR, AECCAR0, AECCAR1, AECCAR2:

  1. Parse raw bytes with pymatgen.io.vasp.Chgcar (writes to temp file first
    because pymatgen needs a file path).
  2. Convert to pyrho.charge_density.ChargeDensity via ChargeDensity.from_pmg().
  3. Apply lossy smooth compression:
    pgrids["total"].lossy_smooth_compression(grid_shape).
  4. Default grid shape: (15, 15, 15), configurable via --grid-shape.
  5. Result is a nested Python list stored as JSON in Parquet.

Bader Charge Analysis (optional)

Requires: bader executable on PATH, PMG_VASP_PSP_DIR environment variable.

Uses pymatgen.command_line.bader_caller.BaderAnalysis, which manages the
external bader binary internally. The pipeline does not call subprocess
directly.

Steps (in run_bader_from_bytes() in utils.py):

  1. Write raw CHGCAR, AECCAR0, AECCAR2 bytes to a temp directory.
  2. Generate POTCAR via MatPESStaticSet(structure).potcar.
  3. Sum AECCAR0 + AECCAR2 using pymatgen Chgcar arithmetic
    (Chgcar.from_file("AECCAR0") + Chgcar.from_file("AECCAR2")), write result
    as CHGCAR_sum.
  4. Call BaderAnalysis(chgcar_filename=..., potcar_filename=..., chgref_filename=..., bader_path=...).
  5. Extract summary["charge_transfer"] and summary["atomic_volume"].
  6. Sign convention: charge_transfer is electron_count - valence
    (positive = gained electrons). We negate it:
    [-ct for ct in ba.summary["charge_transfer"]] so positive = cationic,
    matching the convention used in the rest of the codebase.

Returns: (net_charges: list[float], atomic_volumes: list[float]) or
(None, None) on failure.

DDEC6 Charge Analysis (optional)

Requires: chargemol executable, atomic densities directory,
PMG_VASP_PSP_DIR environment variable.

Uses pymatgen.command_line.chargemol_caller.ChargemolAnalysis, which manages
the external chargemol binary internally.

Steps (in run_ddec6_from_bytes() in utils.py):

  1. Write raw CHGCAR bytes to a temp directory.
  2. Generate POTCAR via MatPESStaticSet(structure).potcar.
  3. Temporarily set CHARGEMOL_COMMAND environment variable to the chargemol
    path (pymatgen reads this env var).
  4. Call ChargemolAnalysis(path=tmpdir, atomic_densities_path=..., run_chargemol=True).
  5. Extract ca.ddec_charges.
  6. Restore original CHARGEMOL_COMMAND value in a finally block.

Thread safety note: The CHARGEMOL_COMMAND save/restore pattern is
process-safe (each ProcessPoolExecutor worker has its own environment) but
not thread-safe. Do not call from multiple threads in the same process.

Returns: list[float] (net charges per site) or None on failure.

Much of the pyrho/bader/DDEC6 code is derived from @msiron-entalpic 's code here.

Shared Code Architecture

Both pipeline variants (traditional and direct) delegate charge analysis to
the same two functions in utils.py:

  • run_bader_from_bytes(structure, raw_files, bader_path, material_id)
  • run_ddec6_from_bytes(structure, raw_files, chargemol_path, atomic_densities_path, material_id)

This eliminates code duplication. The difference between the two pipelines is
only in how they obtain the raw bytes:

  • transform.py: Downloads from S3 via self.aws_client, then delegates.
  • pipeline.py: Already has bytes in memory from the compression step,
    passes them directly.

Constants in utils.py

STATIC_CALC_TYPE = "LeMatRhoStaticMaker"     # S3 folder for static calculations
RELAX_CALC_TYPE = "LeMatRhoRelaxMaker_1"     # S3 folder for relaxation
STATIC_FILES = ["CHGCAR.gz", "AECCAR0.gz", "AECCAR1.gz", "AECCAR2.gz"]
RELAX_FILES = ["vasprun.xml.gz"]
VALID_PREFIXES = ("oqmd-", "mp-", "agm")     # Only process these material ID prefixes
DEFAULT_MAX_WORKERS = 4                       # Conservative due to memory per CHGCAR
GRID_KEY_MAP = {                              # S3 filename -> compressed grid key
    "CHGCAR.gz": "charge_density",
    "AECCAR0.gz": "aeccar0",
    "AECCAR1.gz": "aeccar1",
    "AECCAR2.gz": "aeccar2",
}

Helper functions in utils.py

Function Purpose
download_gz_file_from_s3(client, bucket, key) Download + gzip decompress from S3
parse_vasprun_structure(vasprun_bytes) Parse vasprun.xml bytes to pymatgen Structure
compress_chgcar(chgcar_bytes, grid_shape) Parse CHGCAR + lossy compress via pyrho
build_raw_structure(material_id, structure, compressed_grids, grid_shape, s3_prefix) Build RawStructure for PostgreSQL insertion
write_potcar(structure, tmpdir) Generate POTCAR via MatPESStaticSet
run_bader_from_bytes(structure, raw_files, bader_path, material_id) Bader analysis via pymatgen wrapper
run_ddec6_from_bytes(structure, raw_files, chargemol_path, atomic_densities_path, material_id) DDEC6 analysis via pymatgen wrapper

S3 Bucket Structure

The lemat-rho bucket contains one folder per material:

lemat-rho/
  agm000001/
    LeMatRhoRelaxMaker_1/
      vasprun.xml.gz          # Relaxed structure
    LeMatRhoStaticMaker/
      CHGCAR.gz               # Total charge density
      AECCAR0.gz              # All-electron core charge density
      AECCAR1.gz              # Pseudo valence charge density
      AECCAR2.gz              # Pseudo core charge density
  mp-1234/
    ...
  oqmd-5678/
    ...

Material IDs have prefixes indicating their origin database:

  • agm -- Alexandria
  • mp- -- Materials Project
  • oqmd- -- OQMD

Only folders matching VALID_PREFIXES are processed.


Parquet Schema (Direct Pipeline)

The direct pipeline writes 33-column Parquet files. Compressed grids are stored
as JSON-serialised strings (not nested arrays) because Parquet handles
variable-depth nesting poorly. Per-site arrays use pa.list_(pa.float64()).

Key column types:

Column PyArrow Type Notes
compressed_charge_density pa.string() JSON-serialised 3D list
compressed_aeccar0 pa.string() JSON-serialised 3D list
compressed_aeccar1 pa.string() JSON-serialised 3D list
compressed_aeccar2 pa.string() JSON-serialised 3D list
charge_density_grid_shape pa.list_(pa.int32()) e.g. [15, 15, 15]
bader_charges pa.list_(pa.float64()) Per-site
bader_atomic_volume pa.list_(pa.float64()) Per-site
ddec6_charges pa.list_(pa.float64()) Per-site
species pa.string() JSON-serialised
functional pa.string() Enum .value
last_modified pa.string() ISO 8601

Full schema is defined as PARQUET_SCHEMA in pipeline.py.


Configuration

DirectPipelineConfig (dataclass)

lematrho_bucket_name: str = "lemat-rho"
lematrho_grid_shape: tuple[int, int, int] = (15, 15, 15)
output_dir: str = "./lematrho_output"
parquet_chunk_size: int = 1000
num_workers: int = 4
log_every: int = 100
limit: Optional[int] = None              # Process only N materials (for testing)
hf_repo_id: Optional[str] = None         # HuggingFace repo to push to
hf_token: Optional[str] = None
bader_path: Optional[str] = None         # Auto-detected on PATH if not set
chargemol_path: Optional[str] = None     # Auto-detected on PATH if not set
atomic_densities_path: Optional[str] = None

Environment Variables

Variable Required Purpose
AWS_ACCESS_KEY_ID Yes S3 authentication
AWS_SECRET_ACCESS_KEY Yes S3 authentication
AWS_DEFAULT_REGION No (default: us-east-1) S3 region
PMG_VASP_PSP_DIR For Bader/DDEC6 VASP pseudopotential directory

See .env.example for the full template.


CLI Usage

# Direct pipeline (recommended) -- no PostgreSQL needed
lematerial_fetcher lematrho run \
    --output-dir ./lematrho_output \
    --grid-shape 15 15 15 \
    --parquet-chunk-size 1000 \
    --num-workers 4 \
    --bader-path /path/to/bader \
    --chargemol-path /path/to/chargemol \
    --atomic-densities-path /path/to/atomic_densities \
    --hf-repo-id entalpic/lemat-rho \
    --hf-token $HF_TOKEN

# Debug mode (sequential processing, useful for debugging)
lematerial_fetcher --debug lematrho run --output-dir ./output

# Traditional pipeline (fetch -> transform, requires PostgreSQL)
lematerial_fetcher lematrho fetch \
    --db-user user --table-name raw_lematrho

lematerial_fetcher lematrho transform \
    --db-user user --table-name raw_lematrho \
    --dest-table-name optimade_lematrho

External Dependencies

Dependency Required for How to get it
mp-pyrho>=0.3.1 Charge density compression pip install (in pyproject.toml)
bader executable Bader charge analysis External binary (Henkelman group)
chargemol executable DDEC6 charge analysis External binary (chargemol.net)
Atomic densities directory DDEC6 input data Bundled with chargemol distribution
PMG_VASP_PSP_DIR POTCAR generation Points to VASP pseudopotentials directory
AWS credentials S3 access Environment variables or IAM role

All external tools are optional. Missing tools result in None fields
rather than pipeline failure.


Test Coverage

182 total test functions across the repo, of which 90 are
LeMatRho-specific:

Test File Tests Coverage
test_lematrho_fetch.py 20 S3 listing, prefix filtering, checkpoint exclusion, download, compression
test_lematrho_transform.py 25 transform_row, cross-compatibility, tool validation, S3 download delegation, Bader/DDEC6 integration
test_lematrho_pipeline.py 45 End-to-end pipeline, checkpointing, Parquet writing, structure-to-row serialisation, Bader/DDEC6 helpers, env var restoration, HuggingFace push mock
test_cli.py (lematrho portion) ~15 CLI option parsing for all 3 subcommands
test_config.py (lematrho portion) -- DirectPipelineConfig construction and defaults
test_aws.py -- Authenticated S3 client setup
test_optimade_model.py -- Charge density field validation, nsites checks

Integration test scaffolds (@pytest.mark.integration) exist for real S3
testing but are deselected by default.


Known Open Items

  1. HuggingFace push needs rework. The code currently treats LeMatRho as if
    it will be pushed into the existing LeMatBulk dataset. LeMatRho should be
    its own standalone dataset. The push.py script needs modification to
    handle LeMatRho-specific features and schema separately.

  2. HuggingFace schema verification. Need to confirm the HF dataset schema
    includes the new columns and that compressed grid fields serialise correctly
    as JSON strings when loaded back.

  3. Stale .env.example reference. The .env.example file still contains
    LEMATERIALFETCHER_CHGSUM_SCRIPT_PATH (line 105), which was removed from
    the actual codebase. This line should be deleted.

  4. No subprocess timeout control. The pymatgen wrappers (BaderAnalysis,
    ChargemolAnalysis) do not expose subprocess timeout parameters. If an
    external tool hangs, the worker process will block indefinitely. Monitor in
    production.

  5. Memory profiling. Each decompressed CHGCAR can be 100-500 MB. With 4
    workers, peak RSS could reach multiple GB. Needs profiling with real data
    before scaling up --num-workers.


Quick-Start Checklist for a New Engineer

  1. Clone the repo and install in dev mode: pip install -e ".[dev]"
  2. Copy .env.example to .env and fill in your AWS credentials.
  3. Run the tests: pytest tests/fetcher/lematrho/ -v (no external tools
    needed; everything is mocked).
  4. For a real smoke test with S3 data:
    lematerial_fetcher --debug lematrho run \
        --output-dir ./test_output \
        --num-workers 1 \
        --parquet-chunk-size 5 \
        --limit 10
  5. Read utils.py first -- it's the shared foundation. Then pipeline.py for
    the direct pipeline, or fetch.py + transform.py for the traditional one.
  6. All docstrings follow Google style (Args/Returns/Raises sections).

@speckhard speckhard self-assigned this Feb 25, 2026
@speckhard speckhard added the enhancement New feature or request label Feb 25, 2026
Add `lematrho run` CLI command that downloads charge density data from
S3, compresses via pyrho, optionally runs Bader/DDEC6 analysis, and
writes Parquet files directly. Crash-safe via checkpoint file, atomic
Parquet writes, work-stealing parallelism.
- Move shared constants (STATIC_CALC_TYPE, GRID_KEY_MAP, timeouts) and
  write_potcar() into utils.py as single source of truth
- Convert all lematrho docstrings to Google style
- Add 11 new pipeline tests: _validate_tools, DDEC6 happy path,
  _structure_to_row with None fields, _push_to_huggingface, DDEC6 in
  _process_material
- Add @pytest.mark.integration scaffold with .env.integration loading
- Remove unused --force flag from lematrho transform (was silently ignored)
- Fix batch: Any -> batch: str type annotation in fetch.py
- Add .env.* to .gitignore with !.env.example exception
- Add LeMatRho/AWS/HuggingFace variables to .env.example
- Add pytest integration marker config to pyproject.toml
- Vasprun and Chgcar.from_file require filesystem paths, not BytesIO
  objects. Write bytes to temp files before parsing.
- Add --limit CLI option to cap number of materials processed (useful
  for smoke testing without processing all ~76k materials).
- Verified with real S3 smoke test: 2 materials processed end-to-end
  with compressed charge densities at 10x10x10 grid shape.
- Batch-checkpoint after Parquet flush instead of per-material to prevent
  desync on crash (up to chunk_size materials could be lost)
- Track failed materials in .failures.txt, skip on resume
- Close S3 StreamingBody and free compressed buffer in download_gz_file_from_s3
- Replace NamedTemporaryFile(delete=True) with TemporaryDirectory for
  reliable temp file lifetime in parse_vasprun_structure and compress_chgcar
- Fix _list_materials docstring ("Sorted" claim was incorrect)
- Document memory trade-off for raw_files kept for Bader/DDEC6
- Add 5 new tests (batch checkpoint, failure load/append/resume)
Replace manual subprocess Bader/DDEC6 calls with pymatgen's BaderAnalysis
and ChargemolAnalysis wrappers. Replace perl chgsum.pl with Chgcar arithmetic.
Extract shared run_bader_from_bytes() and run_ddec6_from_bytes() helpers
to utils.py, eliminating ~100 lines of duplication between transform.py
and pipeline.py. Remove chgsum_script_path from config/CLI and timeout
constants from utils. Add module-level docstrings to all LeMatRho files.
…ct pipeline

Delete fetch.py, transform.py, and their tests. The traditional
PostgreSQL-based pipeline added confusion and required double S3 downloads.
Only the direct S3-to-Parquet pipeline (lematrho run) remains.

Move get_cross_compatibility() to utils.py so pipeline.py has no dependency
on the deleted transform.py. Remove lematrho fields from FetcherConfig and
TransformerConfig. Remove fetch/transform CLI commands and options.

131 tests pass, 2 skipped.
@speckhard speckhard requested a review from Ramlaoui March 12, 2026 14:53
@speckhard speckhard changed the title [DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch. FEAT: LeMatRho data fetch, transform. Downsample rho during fetch. Mar 13, 2026
Comment thread src/lematerial_fetcher/fetcher/lematrho/utils.py Outdated
Comment thread src/lematerial_fetcher/utils/config.py Outdated
Comment thread src/lematerial_fetcher/utils/aws.py Outdated

try:
# Fresh client per worker (boto3 clients are NOT multiprocess-safe)
aws_client = get_authenticated_aws_client()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new authenticated client for every material, I don't think it is a bottleneck because of the rest of the pipeline but might be done for say batches of materials if it causes an issue or triggers rate-limits on AWS side.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it creates a client per material, but this is intentional: boto3 clients are not safe to share across ProcessPoolExecutor workers, and client creation is fast (~ms) relative to the multi-hundred-MB downloads and Bader/DDEC6 analysis per material. Added an inline comment explaining this trade-off. If we ever see rate-limiting from AWS we can batch client creation per worker init, but so far it hasn't been an issue. Gemini thinks we could do something like this (best of both worlds, minimize clients but not share them across processes:

import concurrent.futures
import boto3

# Global variable inside the worker's memory space
worker_s3_client = None

def initialize_worker():
    """Runs once when the worker process starts."""
    global worker_s3_client
    # Get the authenticated client once per process
    worker_s3_client = get_authenticated_aws_client()

def process_material(material_id):
    """Your task function."""
    global worker_s3_client
    
    # Use the process-local client (no need to recreate it!)
    response = worker_s3_client.get_object(Bucket='my-bucket', Key=material_id)
    # ... do the multi-hundred-MB download and Bader/DDEC6 analysis ...
    return f"Processed {material_id}"

# Main pipeline execution
def run_pipeline(materials):
    with concurrent.futures.ProcessPoolExecutor(
        max_workers=4,
        initializer=initialize_worker # <-- This is the magic!
    ) as executor:
        results = list(executor.map(process_material, materials))
    return results

Let's come back to this if we see the code is too slow (for now with 70k materials I didnt notice this bottleneck).

Comment thread src/lematerial_fetcher/fetcher/lematrho/pipeline.py Outdated
Comment thread src/lematerial_fetcher/fetcher/lematrho/pipeline.py
Comment thread src/lematerial_fetcher/fetcher/lematrho/pipeline.py
Comment thread src/lematerial_fetcher/fetcher/lematrho/utils.py
Comment thread src/lematerial_fetcher/database/postgres.py Outdated
Comment thread .env.example Outdated
Comment on lines +525 to +543
optimade_structure = OptimadeStructure(
id=material_id,
source="lematrho",
immutable_id=material_id,
last_modified=datetime.now(),
**optimade_dict,
functional=Functional.PBE,
cross_compatibility=cross_compatibility,
compressed_charge_density=compressed_grids.get("charge_density"),
compressed_aeccar0=compressed_grids.get("aeccar0"),
compressed_aeccar1=compressed_grids.get("aeccar1"),
compressed_aeccar2=compressed_grids.get("aeccar2"),
charge_density_grid_shape=list(grid_shape),
bader_charges=bader_charges,
bader_atomic_volume=bader_atomic_volume,
ddec6_charges=ddec6_charges,
compute_space_group=True,
compute_bawl_hash=True,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know whether it might be useful to also transfer the energy or forces keys for instance in the database? Or are they also included elsewhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, yes, energy, forces, and stress are all available in the vasprun.xml we already download, so extracting them would be essentially free. Currently parse_vasprun_structure only pulls the final relaxed structure and discards the rest. @mfranckel will add energy/forces/stress extraction in a follow-up PR

Copy link
Copy Markdown
Collaborator

@Ramlaoui Ramlaoui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, no major changes requested as the pipeline is pretty separate from the rest of the logic. Also I am not very familiar with the bader/DDEC6 code so I can't comment on what gets executed there.

Just one high-level question: for the generic fetch / transform pipeline, why do you think it would need a double download? wouldn't it be enough to download the AWS open data with all CHGCAR / AECCAR once during fetch in the raw postgres database and then use that database during transform? Of course the big issue with that would be storage space and then this leads us to wondering whether we really need local copies of online databases for reproducibility or not but I guess that would be project dependent.

Also added a few comments on whether the bottleneck induced by creating connections and switching tasks for each material adds latency that could be avoided if materials can be batched / or workers increased. But I guess that depends a lot on the BADER/DDEC6 workflow and its own internal bottlenecks and overheads.

Thanks a lot it was very nice to read both the code and the documentation and follow along!

@speckhard
Copy link
Copy Markdown
Collaborator Author

Thanks a lot for the thorough review Ali with loads of constructive feedback and good finds.

On the double download: You're right that it would technically be possible to store the raw CHGCAR/AECCAR bytes in PostgreSQL during fetch and avoid re-downloading in transform. The reason we didn't go that route is storage: each material has 4 charge density files (CHGCAR + AECCAR0/1/2) that can be 100–500 MB each when decompressed. Across even a few thousand materials that's multiple terabytes, which is way outside what makes sense for a Postgres DB (we could store file paths in the d though). Mostly this was built on @inelgnu 's suggestion to cut out the middleman postgres and use parquet directly (I think this is what he wouldve done if he knew HF was the endpoint but that wasnt known when most of LeMaterialFetcher was written).

LeMatRho uses a direct S3 → Parquet pipeline and never writes to Postgres.
The 8 charge density columns (compressed_charge_density, compressed_aeccar{0,1,2},
charge_density_grid_shape, bader_charges, bader_atomic_volume, ddec6_charges)
were dead weight in the Postgres schema and INSERT tuples. The fields remain on
OptimadeStructure (Pydantic model) where they define the Parquet output schema.
@speckhard speckhard merged commit 8576ec7 into main Apr 7, 2026
1 check passed
@Ramlaoui
Copy link
Copy Markdown
Collaborator

Ramlaoui commented Apr 7, 2026

Thanks for the explanation, that makes a lot of sense! I think this can then be viewed as a special case where the fetcher is not used or does not exist and all there is is a transform that fetches the raw files and throws the unnecessary data which is exactly what you did.

Congrats on the massive work with this PR and looking forward to using the dataset :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants