FEAT: LeMatRho data fetch, transform. Downsample rho during fetch. by speckhard · Pull Request #49 · LeMaterial/lematerial-fetcher

speckhard · 2026-02-25T16:10:07Z

What Is LeMatRho?

LeMatRho is an effort to open source charge density results from DFT. The raw VASP outputs live in an authenticated S3 bucket (lemat-rho) (talk to me for access) and includes CHGCAR, AECCAR0/1/2, and vasprun.xml files for each
material.

This PR adds LeMatRho as a new data source inside lematerial-fetcher so we
can fetch, compress, analyse, and publish charge densities as a HuggingFace
dataset.

Why This Codebase?

We built on lematerial-fetcher (instead of writing from scratch) because:

It already has a BaseFetcher / BaseTransformer pattern for pulling data
from external sources (Materials Project, OQMD, Alexandria) into a shared
OPTIMADE data model.
The Materials Project fetcher already talks to AWS S3, so auth and download
infrastructure was partially in place.
The shared OptimadeStructure Pydantic model, PostgreSQL schema, and
HuggingFace push tooling can be reused with minimal extensions.

Two Pipeline Architectures

1. Traditional Pipeline (`lematrho fetch` + `lematrho transform`)

Follows the repo's existing two-step pattern:

S3 bucket  -->  [fetch.py]  -->  PostgreSQL (raw_structures)
                                       |
                                       v
                                [transform.py]  -->  PostgreSQL (optimade)

fetch.py (196 lines): Downloads CHGCAR/AECCAR files from S3, compresses
them with pyrho, stores compressed grids in PostgreSQL.
transform.py (279 lines): Reads raw structures from PostgreSQL, converts
to OptimadeStructure, optionally re-downloads files from S3 to run
Bader/DDEC6 charge analysis.

Included for consistency with existing architecture. Not recommended for
production because the transform step must re-download files from S3 for
charge analysis (double download).

2. Direct Pipeline (`lematrho run`) -- RECOMMENDED

A single-pass architecture that bypasses PostgreSQL entirely:

S3 bucket (lemat-rho)
    |
    v
[List material folders -- paginated S3 listing]
    |
    v  ProcessPoolExecutor (4 workers, work-stealing)
Per material, in a single pass:
  1. Download vasprun.xml.gz  ->  parse relaxed structure
  2. Download CHGCAR/AECCAR0/1/2  ->  compress via pyrho
  3. Run Bader analysis (if tools available, reuses same raw bytes)
  4. Run DDEC6 analysis (if tools available, reuses same raw bytes)
  5. Build OptimadeStructure (Pydantic validation)
  6. gc.collect()  ->  return flat dict
    |
    v  Main process
Buffer rows  ->  write Parquet chunk every N rows (atomic: .tmp -> rename)
Append material_id to checkpoint file after each success
    |
    v  Optional
Load Parquet dir as HF Dataset  ->  push_to_hub()

pipeline.py (679 lines): The entire direct pipeline.
utils.py (283 lines): Shared constants, helpers, and charge analysis
wrappers used by all three pipeline files.

File Map

All LeMatRho source code lives under
src/lematerial_fetcher/fetcher/lematrho/:

File	Lines	Purpose
`__init__.py`	1	Package marker
`fetch.py`	196	`LeMatRhoFetcher(BaseFetcher)` -- S3 to PostgreSQL
`transform.py`	279	`LeMatRhoTransformer(BaseTransformer)` -- PostgreSQL raw to OPTIMADE
`pipeline.py`	679	`LeMatRhoDirectPipeline` -- single-pass S3 to Parquet
`utils.py`	283	Shared constants, helpers, Bader/DDEC6 wrappers

Modified files elsewhere in the repo:

File	What changed
`models/optimade.py`	8 new optional charge density fields + nsites validators
`database/postgres.py`	8 matching columns (JSONB, FLOAT[], INTEGER[])
`utils/config.py`	`DirectPipelineConfig` dataclass, extended `FetcherConfig`/`TransformerConfig`
`utils/cli.py`	`add_lematrho_fetch_options`, `add_lematrho_transform_options`, `add_lematrho_direct_options`
`utils/aws.py`	`get_authenticated_aws_client()` (adaptive retry, credential chain)
`cli.py`	`lematrho` command group with `fetch`, `transform`, `run` subcommands
`push.py`	Charge density fields in HuggingFace Features + JSON serialisation
`models/utils/enums.py`	`Source.LEMATRHO = "lematrho"`
`pyproject.toml`	Added `mp-pyrho>=0.3.1` dependency
`.env.example`	AWS, LeMatRho, tool path variable templates

Data Model: New Fields on OptimadeStructure

All 8 fields are Optional and default to None, so existing data sources
(MP, Alexandria, OQMD) are unaffected.

Field	Type	Description
`compressed_charge_density`	`Optional[list]`	3D nested float list from pyrho lossy compression of CHGCAR
`compressed_aeccar0`	`Optional[list]`	Same for AECCAR0 (all-electron core charge density)
`compressed_aeccar1`	`Optional[list]`	Same for AECCAR1 (pseudo valence charge density)
`compressed_aeccar2`	`Optional[list]`	Same for AECCAR2 (pseudo core charge density)
`charge_density_grid_shape`	`Optional[list[int]]`	Shape of compressed grid, e.g. `[15, 15, 15]`
`bader_charges`	`Optional[list[float]]`	Bader net charges per site (validated against nsites)
`bader_atomic_volume`	`Optional[list[float]]`	Bader atomic volumes per site (validated against nsites)
`ddec6_charges`	`Optional[list[float]]`	DDEC6 net atomic charges per site (validated against nsites)

Per-site fields (bader_charges, bader_atomic_volume, ddec6_charges) are
validated by _validate_with_number_of_sites() to ensure their length matches
nsites.

Key Design Decisions

Single-pass S3 access (direct pipeline)

Each S3 file is downloaded exactly once. The raw decompressed bytes are
held in memory so they can be reused for both pyrho compression and
Bader/DDEC6 analysis. The traditional pipeline cannot do this because fetch and
transform are separate steps.

No PostgreSQL (direct pipeline)

The end target is Parquet files on HuggingFace. PostgreSQL adds schema
migration, connection management, and disk space overhead with no benefit for
this use case. DirectPipelineConfig has zero database fields.

Crash-safe checkpointing

A text file (.checkpoint.txt) records each successfully processed material
ID. On restart, the pipeline skips already-processed materials and resumes
writing Parquet chunks from the next index.

Atomic Parquet writes

Chunks are written to a .tmp file first, then renamed via os.rename()
(POSIX atomic on the same filesystem). Stale .tmp files from crashed runs are
ignored on resume.

Work-stealing parallelism

Uses concurrent.futures.wait(FIRST_COMPLETED) with bounded submission (2x
num_workers futures in flight) to avoid creating hundreds of thousands of
Future objects in memory.

Memory management

Sequential file processing within each worker, explicit del of raw bytes
after use, gc.collect() after each material. Conservative default of 4
workers because each CHGCAR can be hundreds of MB when decompressed.

Graceful tool degradation

External tools (bader, chargemol) are validated at init. If any tool is
missing, the corresponding output fields are set to None rather than
failing the pipeline. Bader and DDEC6 are independent -- one can fail while the
other succeeds.

Cross-compatibility

All LeMatRho structures are marked cross-compatible (no element exclusions).
This differs from the Alexandria fetcher, which excludes Yb-containing
structures. The get_cross_compatibility() function always returns True.

Charge Density Processing: How It Works

pyrho Compression

For each of CHGCAR, AECCAR0, AECCAR1, AECCAR2:

Parse raw bytes with pymatgen.io.vasp.Chgcar (writes to temp file first
because pymatgen needs a file path).
Convert to pyrho.charge_density.ChargeDensity via ChargeDensity.from_pmg().
Apply lossy smooth compression:
pgrids["total"].lossy_smooth_compression(grid_shape).
Default grid shape: (15, 15, 15), configurable via --grid-shape.
Result is a nested Python list stored as JSON in Parquet.

Bader Charge Analysis (optional)

Requires: bader executable on PATH, PMG_VASP_PSP_DIR environment variable.

Uses pymatgen.command_line.bader_caller.BaderAnalysis, which manages the
external bader binary internally. The pipeline does not call subprocess
directly.

Steps (in run_bader_from_bytes() in utils.py):

Write raw CHGCAR, AECCAR0, AECCAR2 bytes to a temp directory.
Generate POTCAR via MatPESStaticSet(structure).potcar.
Sum AECCAR0 + AECCAR2 using pymatgen Chgcar arithmetic
(Chgcar.from_file("AECCAR0") + Chgcar.from_file("AECCAR2")), write result
as CHGCAR_sum.
Call BaderAnalysis(chgcar_filename=..., potcar_filename=..., chgref_filename=..., bader_path=...).
Extract summary["charge_transfer"] and summary["atomic_volume"].
Sign convention: charge_transfer is electron_count - valence
(positive = gained electrons). We negate it:
[-ct for ct in ba.summary["charge_transfer"]] so positive = cationic,
matching the convention used in the rest of the codebase.

Returns: (net_charges: list[float], atomic_volumes: list[float]) or
(None, None) on failure.

DDEC6 Charge Analysis (optional)

Requires: chargemol executable, atomic densities directory,
PMG_VASP_PSP_DIR environment variable.

Uses pymatgen.command_line.chargemol_caller.ChargemolAnalysis, which manages
the external chargemol binary internally.

Steps (in run_ddec6_from_bytes() in utils.py):

Write raw CHGCAR bytes to a temp directory.
Generate POTCAR via MatPESStaticSet(structure).potcar.
Temporarily set CHARGEMOL_COMMAND environment variable to the chargemol
path (pymatgen reads this env var).
Call ChargemolAnalysis(path=tmpdir, atomic_densities_path=..., run_chargemol=True).
Extract ca.ddec_charges.
Restore original CHARGEMOL_COMMAND value in a finally block.

Thread safety note: The CHARGEMOL_COMMAND save/restore pattern is
process-safe (each ProcessPoolExecutor worker has its own environment) but
not thread-safe. Do not call from multiple threads in the same process.

Returns: list[float] (net charges per site) or None on failure.

Much of the pyrho/bader/DDEC6 code is derived from @msiron-entalpic 's code here.

Shared Code Architecture

Both pipeline variants (traditional and direct) delegate charge analysis to
the same two functions in utils.py:

run_bader_from_bytes(structure, raw_files, bader_path, material_id)
run_ddec6_from_bytes(structure, raw_files, chargemol_path, atomic_densities_path, material_id)

This eliminates code duplication. The difference between the two pipelines is
only in how they obtain the raw bytes:

transform.py: Downloads from S3 via self.aws_client, then delegates.
pipeline.py: Already has bytes in memory from the compression step,
passes them directly.

Constants in utils.py

STATIC_CALC_TYPE = "LeMatRhoStaticMaker"     # S3 folder for static calculations
RELAX_CALC_TYPE = "LeMatRhoRelaxMaker_1"     # S3 folder for relaxation
STATIC_FILES = ["CHGCAR.gz", "AECCAR0.gz", "AECCAR1.gz", "AECCAR2.gz"]
RELAX_FILES = ["vasprun.xml.gz"]
VALID_PREFIXES = ("oqmd-", "mp-", "agm")     # Only process these material ID prefixes
DEFAULT_MAX_WORKERS = 4                       # Conservative due to memory per CHGCAR
GRID_KEY_MAP = {                              # S3 filename -> compressed grid key
    "CHGCAR.gz": "charge_density",
    "AECCAR0.gz": "aeccar0",
    "AECCAR1.gz": "aeccar1",
    "AECCAR2.gz": "aeccar2",
}

Helper functions in utils.py

Function	Purpose
`download_gz_file_from_s3(client, bucket, key)`	Download + gzip decompress from S3
`parse_vasprun_structure(vasprun_bytes)`	Parse vasprun.xml bytes to pymatgen `Structure`
`compress_chgcar(chgcar_bytes, grid_shape)`	Parse CHGCAR + lossy compress via pyrho
`build_raw_structure(material_id, structure, compressed_grids, grid_shape, s3_prefix)`	Build `RawStructure` for PostgreSQL insertion
`write_potcar(structure, tmpdir)`	Generate POTCAR via `MatPESStaticSet`
`run_bader_from_bytes(structure, raw_files, bader_path, material_id)`	Bader analysis via pymatgen wrapper
`run_ddec6_from_bytes(structure, raw_files, chargemol_path, atomic_densities_path, material_id)`	DDEC6 analysis via pymatgen wrapper

S3 Bucket Structure

The lemat-rho bucket contains one folder per material:

lemat-rho/
  agm000001/
    LeMatRhoRelaxMaker_1/
      vasprun.xml.gz          # Relaxed structure
    LeMatRhoStaticMaker/
      CHGCAR.gz               # Total charge density
      AECCAR0.gz              # All-electron core charge density
      AECCAR1.gz              # Pseudo valence charge density
      AECCAR2.gz              # Pseudo core charge density
  mp-1234/
    ...
  oqmd-5678/
    ...

Material IDs have prefixes indicating their origin database:

agm -- Alexandria
mp- -- Materials Project
oqmd- -- OQMD

Only folders matching VALID_PREFIXES are processed.

Parquet Schema (Direct Pipeline)

The direct pipeline writes 33-column Parquet files. Compressed grids are stored
as JSON-serialised strings (not nested arrays) because Parquet handles
variable-depth nesting poorly. Per-site arrays use pa.list_(pa.float64()).

Key column types:

Column	PyArrow Type	Notes
`compressed_charge_density`	`pa.string()`	JSON-serialised 3D list
`compressed_aeccar0`	`pa.string()`	JSON-serialised 3D list
`compressed_aeccar1`	`pa.string()`	JSON-serialised 3D list
`compressed_aeccar2`	`pa.string()`	JSON-serialised 3D list
`charge_density_grid_shape`	`pa.list_(pa.int32())`	e.g. `[15, 15, 15]`
`bader_charges`	`pa.list_(pa.float64())`	Per-site
`bader_atomic_volume`	`pa.list_(pa.float64())`	Per-site
`ddec6_charges`	`pa.list_(pa.float64())`	Per-site
`species`	`pa.string()`	JSON-serialised
`functional`	`pa.string()`	Enum `.value`
`last_modified`	`pa.string()`	ISO 8601

Full schema is defined as PARQUET_SCHEMA in pipeline.py.

Configuration

DirectPipelineConfig (dataclass)

lematrho_bucket_name: str = "lemat-rho"
lematrho_grid_shape: tuple[int, int, int] = (15, 15, 15)
output_dir: str = "./lematrho_output"
parquet_chunk_size: int = 1000
num_workers: int = 4
log_every: int = 100
limit: Optional[int] = None              # Process only N materials (for testing)
hf_repo_id: Optional[str] = None         # HuggingFace repo to push to
hf_token: Optional[str] = None
bader_path: Optional[str] = None         # Auto-detected on PATH if not set
chargemol_path: Optional[str] = None     # Auto-detected on PATH if not set
atomic_densities_path: Optional[str] = None

Environment Variables

Variable	Required	Purpose
`AWS_ACCESS_KEY_ID`	Yes	S3 authentication
`AWS_SECRET_ACCESS_KEY`	Yes	S3 authentication
`AWS_DEFAULT_REGION`	No (default: us-east-1)	S3 region
`PMG_VASP_PSP_DIR`	For Bader/DDEC6	VASP pseudopotential directory

See .env.example for the full template.

CLI Usage

# Direct pipeline (recommended) -- no PostgreSQL needed
lematerial_fetcher lematrho run \
    --output-dir ./lematrho_output \
    --grid-shape 15 15 15 \
    --parquet-chunk-size 1000 \
    --num-workers 4 \
    --bader-path /path/to/bader \
    --chargemol-path /path/to/chargemol \
    --atomic-densities-path /path/to/atomic_densities \
    --hf-repo-id entalpic/lemat-rho \
    --hf-token $HF_TOKEN

# Debug mode (sequential processing, useful for debugging)
lematerial_fetcher --debug lematrho run --output-dir ./output

# Traditional pipeline (fetch -> transform, requires PostgreSQL)
lematerial_fetcher lematrho fetch \
    --db-user user --table-name raw_lematrho

lematerial_fetcher lematrho transform \
    --db-user user --table-name raw_lematrho \
    --dest-table-name optimade_lematrho

External Dependencies

Dependency	Required for	How to get it
`mp-pyrho>=0.3.1`	Charge density compression	`pip install` (in pyproject.toml)
`bader` executable	Bader charge analysis	External binary (Henkelman group)
`chargemol` executable	DDEC6 charge analysis	External binary (chargemol.net)
Atomic densities directory	DDEC6 input data	Bundled with chargemol distribution
`PMG_VASP_PSP_DIR`	POTCAR generation	Points to VASP pseudopotentials directory
AWS credentials	S3 access	Environment variables or IAM role

All external tools are optional. Missing tools result in None fields
rather than pipeline failure.

Test Coverage

182 total test functions across the repo, of which 90 are
LeMatRho-specific:

Test File	Tests	Coverage
`test_lematrho_fetch.py`	20	S3 listing, prefix filtering, checkpoint exclusion, download, compression
`test_lematrho_transform.py`	25	transform_row, cross-compatibility, tool validation, S3 download delegation, Bader/DDEC6 integration
`test_lematrho_pipeline.py`	45	End-to-end pipeline, checkpointing, Parquet writing, structure-to-row serialisation, Bader/DDEC6 helpers, env var restoration, HuggingFace push mock
`test_cli.py` (lematrho portion)	~15	CLI option parsing for all 3 subcommands
`test_config.py` (lematrho portion)	--	`DirectPipelineConfig` construction and defaults
`test_aws.py`	--	Authenticated S3 client setup
`test_optimade_model.py`	--	Charge density field validation, nsites checks

Integration test scaffolds (@pytest.mark.integration) exist for real S3
testing but are deselected by default.

Known Open Items

HuggingFace push needs rework. The code currently treats LeMatRho as if
it will be pushed into the existing LeMatBulk dataset. LeMatRho should be
its own standalone dataset. The push.py script needs modification to
handle LeMatRho-specific features and schema separately.
HuggingFace schema verification. Need to confirm the HF dataset schema
includes the new columns and that compressed grid fields serialise correctly
as JSON strings when loaded back.
Stale .env.example reference. The .env.example file still contains
LEMATERIALFETCHER_CHGSUM_SCRIPT_PATH (line 105), which was removed from
the actual codebase. This line should be deleted.
No subprocess timeout control. The pymatgen wrappers (BaderAnalysis,
ChargemolAnalysis) do not expose subprocess timeout parameters. If an
external tool hangs, the worker process will block indefinitely. Monitor in
production.
Memory profiling. Each decompressed CHGCAR can be 100-500 MB. With 4
workers, peak RSS could reach multiple GB. Needs profiling with real data
before scaling up --num-workers.

Quick-Start Checklist for a New Engineer

Clone the repo and install in dev mode: pip install -e ".[dev]"
Copy .env.example to .env and fill in your AWS credentials.
Run the tests: pytest tests/fetcher/lematrho/ -v (no external tools
needed; everything is mocked).

For a real smoke test with S3 data:

lematerial_fetcher --debug lematrho run \
    --output-dir ./test_output \
    --num-workers 1 \
    --parquet-chunk-size 5 \
    --limit 10

Read utils.py first -- it's the shared foundation. Then pipeline.py for
the direct pipeline, or fetch.py + transform.py for the traditional one.
All docstrings follow Google style (Args/Returns/Raises sections).

Add `lematrho run` CLI command that downloads charge density data from S3, compresses via pyrho, optionally runs Bader/DDEC6 analysis, and writes Parquet files directly. Crash-safe via checkpoint file, atomic Parquet writes, work-stealing parallelism.

- Move shared constants (STATIC_CALC_TYPE, GRID_KEY_MAP, timeouts) and write_potcar() into utils.py as single source of truth - Convert all lematrho docstrings to Google style - Add 11 new pipeline tests: _validate_tools, DDEC6 happy path, _structure_to_row with None fields, _push_to_huggingface, DDEC6 in _process_material - Add @pytest.mark.integration scaffold with .env.integration loading - Remove unused --force flag from lematrho transform (was silently ignored) - Fix batch: Any -> batch: str type annotation in fetch.py - Add .env.* to .gitignore with !.env.example exception - Add LeMatRho/AWS/HuggingFace variables to .env.example - Add pytest integration marker config to pyproject.toml

- Vasprun and Chgcar.from_file require filesystem paths, not BytesIO objects. Write bytes to temp files before parsing. - Add --limit CLI option to cap number of materials processed (useful for smoke testing without processing all ~76k materials). - Verified with real S3 smoke test: 2 materials processed end-to-end with compressed charge densities at 10x10x10 grid shape.

- Batch-checkpoint after Parquet flush instead of per-material to prevent desync on crash (up to chunk_size materials could be lost) - Track failed materials in .failures.txt, skip on resume - Close S3 StreamingBody and free compressed buffer in download_gz_file_from_s3 - Replace NamedTemporaryFile(delete=True) with TemporaryDirectory for reliable temp file lifetime in parse_vasprun_structure and compress_chgcar - Fix _list_materials docstring ("Sorted" claim was incorrect) - Document memory trade-off for raw_files kept for Bader/DDEC6 - Add 5 new tests (batch checkpoint, failure load/append/resume)

Replace manual subprocess Bader/DDEC6 calls with pymatgen's BaderAnalysis and ChargemolAnalysis wrappers. Replace perl chgsum.pl with Chgcar arithmetic. Extract shared run_bader_from_bytes() and run_ddec6_from_bytes() helpers to utils.py, eliminating ~100 lines of duplication between transform.py and pipeline.py. Remove chgsum_script_path from config/CLI and timeout constants from utils. Add module-level docstrings to all LeMatRho files.

…ed vars)

…ct pipeline Delete fetch.py, transform.py, and their tests. The traditional PostgreSQL-based pipeline added confusion and required double S3 downloads. Only the direct S3-to-Parquet pipeline (lematrho run) remains. Move get_cross_compatibility() to utils.py so pipeline.py has no dependency on the deleted transform.py. Remove lematrho fields from FetcherConfig and TransformerConfig. Remove fetch/transform CLI commands and options. 131 tests pass, 2 skipped.

Ramlaoui · 2026-03-23T09:22:28Z

+
+        try:
+            # Fresh client per worker (boto3 clients are NOT multiprocess-safe)
+            aws_client = get_authenticated_aws_client()


This creates a new authenticated client for every material, I don't think it is a bottleneck because of the rest of the pipeline but might be done for say batches of materials if it causes an issue or triggers rate-limits on AWS side.

You're right that it creates a client per material, but this is intentional: boto3 clients are not safe to share across ProcessPoolExecutor workers, and client creation is fast (~ms) relative to the multi-hundred-MB downloads and Bader/DDEC6 analysis per material. Added an inline comment explaining this trade-off. If we ever see rate-limiting from AWS we can batch client creation per worker init, but so far it hasn't been an issue. Gemini thinks we could do something like this (best of both worlds, minimize clients but not share them across processes:

import concurrent.futures import boto3 # Global variable inside the worker's memory space worker_s3_client = None def initialize_worker(): """Runs once when the worker process starts.""" global worker_s3_client # Get the authenticated client once per process worker_s3_client = get_authenticated_aws_client() def process_material(material_id): """Your task function.""" global worker_s3_client # Use the process-local client (no need to recreate it!) response = worker_s3_client.get_object(Bucket='my-bucket', Key=material_id) # ... do the multi-hundred-MB download and Bader/DDEC6 analysis ... return f"Processed {material_id}" # Main pipeline execution def run_pipeline(materials): with concurrent.futures.ProcessPoolExecutor( max_workers=4, initializer=initialize_worker # <-- This is the magic! ) as executor: results = list(executor.map(process_material, materials)) return results

Let's come back to this if we see the code is too slow (for now with 70k materials I didnt notice this bottleneck).

Ramlaoui · 2026-03-23T09:41:00Z

+            optimade_structure = OptimadeStructure(
+                id=material_id,
+                source="lematrho",
+                immutable_id=material_id,
+                last_modified=datetime.now(),
+                **optimade_dict,
+                functional=Functional.PBE,
+                cross_compatibility=cross_compatibility,
+                compressed_charge_density=compressed_grids.get("charge_density"),
+                compressed_aeccar0=compressed_grids.get("aeccar0"),
+                compressed_aeccar1=compressed_grids.get("aeccar1"),
+                compressed_aeccar2=compressed_grids.get("aeccar2"),
+                charge_density_grid_shape=list(grid_shape),
+                bader_charges=bader_charges,
+                bader_atomic_volume=bader_atomic_volume,
+                ddec6_charges=ddec6_charges,
+                compute_space_group=True,
+                compute_bawl_hash=True,
+            )


Do you know whether it might be useful to also transfer the energy or forces keys for instance in the database? Or are they also included elsewhere?

Good question, yes, energy, forces, and stress are all available in the vasprun.xml we already download, so extracting them would be essentially free. Currently parse_vasprun_structure only pulls the final relaxed structure and discards the rest. @mfranckel will add energy/forces/stress extraction in a follow-up PR

Ramlaoui

LGTM, no major changes requested as the pipeline is pretty separate from the rest of the logic. Also I am not very familiar with the bader/DDEC6 code so I can't comment on what gets executed there.

Just one high-level question: for the generic fetch / transform pipeline, why do you think it would need a double download? wouldn't it be enough to download the AWS open data with all CHGCAR / AECCAR once during fetch in the raw postgres database and then use that database during transform? Of course the big issue with that would be storage space and then this leads us to wondering whether we really need local copies of online databases for reproducibility or not but I guess that would be project dependent.

Also added a few comments on whether the bottleneck induced by creating connections and switching tasks for each material adds latency that could be avoided if materials can be batched / or workers increased. But I guess that depends a lot on the BADER/DDEC6 workflow and its own internal bottlenecks and overheads.

Thanks a lot it was very nice to read both the code and the documentation and follow along!

speckhard · 2026-04-07T12:36:07Z

Thanks a lot for the thorough review Ali with loads of constructive feedback and good finds.

On the double download: You're right that it would technically be possible to store the raw CHGCAR/AECCAR bytes in PostgreSQL during fetch and avoid re-downloading in transform. The reason we didn't go that route is storage: each material has 4 charge density files (CHGCAR + AECCAR0/1/2) that can be 100–500 MB each when decompressed. Across even a few thousand materials that's multiple terabytes, which is way outside what makes sense for a Postgres DB (we could store file paths in the d though). Mostly this was built on @inelgnu 's suggestion to cut out the middleman postgres and use parquet directly (I think this is what he wouldve done if he knew HF was the endpoint but that wasnt known when most of LeMaterialFetcher was written).

LeMatRho uses a direct S3 → Parquet pipeline and never writes to Postgres. The 8 charge density columns (compressed_charge_density, compressed_aeccar{0,1,2}, charge_density_grid_shape, bader_charges, bader_atomic_volume, ddec6_charges) were dead weight in the Postgres schema and INSERT tuples. The fields remain on OptimadeStructure (Pydantic model) where they define the Parquet output schema.

Ramlaoui · 2026-04-07T16:58:14Z

Thanks for the explanation, that makes a lot of sense! I think this can then be viewed as a special case where the fetcher is not used or does not exist and all there is is a transform that fetches the raw files and throws the unnecessary data which is exactly what you did.

Congrats on the massive work with this PR and looking forward to using the dataset :)

FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.

46d5581

speckhard self-assigned this Feb 25, 2026

speckhard added this to LeMaterial Feb 25, 2026

speckhard added the enhancement New feature or request label Feb 25, 2026

speckhard added 8 commits February 26, 2026 02:49

fix: resolve ruff lint errors (unused imports, unsorted imports, unus…

9d42a4a

…ed vars)

fix: include Yb in LeMatRho cross-compatibility (no element exclusions)

3e1d145

speckhard requested a review from Ramlaoui March 12, 2026 14:53

speckhard changed the title ~~[DRAFT] FEAT: LeMatRho data fetch, transform. Downsample rho during fetch.~~ FEAT: LeMatRho data fetch, transform. Downsample rho during fetch. Mar 13, 2026