Material Identifier is a Python framework that uses Large Language Models (LLMs) to identify crystalline materials from natural-language descriptions and produce structured output files suitable as input for Density Functional Theory (DFT) calculations.
The framework accepts free-text descriptions such as:
"silicon in the diamond cubic structure""a perovskite oxide with titanium and barium""face-centred cubic copper""wurtzite gallium nitride"
And returns a structured JSON file containing the material's crystallographic properties.
- LLM-driven identification β natural language to crystallographic data via Gemini 2.5 Flash
- Dataset expansion β generate strain, rattle, and supercell variants from a single prompt for MLIP training seed datasets
- DFT-code agnostic output β structured JSON compatible with VASP, Quantum ESPRESSO, CP2K
- Materials Project validation β cross-checks lattice parameters against experimental data
- GW/BSE readiness β retrieves band gap, direct/indirect nature, and metallicity
- Automatic retry logic β handles API rate limits
- Interactive CLI β accepts custom descriptions at runtime
materials-identifier/
βββ identifier.py β main pipeline (call LLM, parse, validate, save)
βββ prompts.py β prompt templates for Gemini
βββ output_schema.py β MaterialStructure dataclass definition
βββ validator.py β Materials Project cross-validation
βββ expansion.py β strain, rattle, supercell variant generation
βββ writers/ β output writers
β βββ __init__.py
β βββ qe.py β Quantum ESPRESSO pw.x input writer
βββ requirements.txt
βββ .env.example β API key template
βββ .gitignore
βββ docs/
β βββ LLM_Materials_Identification_Framework_Minotaki.pdf
βββ notebooks/
β βββ demo.ipynb β interactive step-by-step walkthrough
βββ examples/
βββ silicon_diamond.json
βββ copper_fcc.json
βββ barium_titanate.json
βββ gallium_nitride_wurtzite.json
βββ iron_bcc.json
git clone https://github.com/mminotaki/materials-identifier.git
cd materials-identifierpython3 -m venv mat_ident_env
source mat_ident_env/bin/activate # Mac/Linux
mat_ident_env\Scripts\activate # Windowspip install -r requirements.txtGet a free Gemini API key at https://aistudio.google.com/apikey
Get a free Materials Project API key at https://materialsproject.org
Copy .env.example to .env and fill in your keys:
cp .env.example .env
β οΈ Never commit your.envfile. It is already listed in.gitignore.
Run all built-in examples:
python3 identifier.py --run-examples
Identify a custom material:
python3 identifier.py --description "rocksalt magnesium oxide"
Interactive mode:
python3 identifier.py
Generate Quantum ESPRESSO input:
python3 identifier.py --description "wurtzite gallium nitride" --format qe
The QE writer uses SSSP Efficiency pseudopotential references and applies sensible SCF defaults (cutoffs, k-point density, smearing).
Generate a variant dataset for MLIP training:
python3 identifier.py --description "silicon in the diamond cubic structure" \
--format qe --expand strain,rattle,supercell \
--strain iso,uni,eos \
--n-rattle 5 \
--supercell 2,2,2 \
--output examples/silicon_dataset
| Flag | Default | Description |
|---|---|---|
--expand |
β | Comma-separated list: strain, rattle, supercell |
--strain |
iso,uni,eos |
Strain modes: isotropic / uniaxial / EOS-style |
--n-rattle |
5 |
Number of rattled configurations |
--rattle-amplitude |
0.05 |
Cartesian displacement amplitude (Γ ) |
--rattle-seed |
42 |
RNG seed for reproducibility |
--supercell |
2,2,2 |
Supercell scaling as na,nb,nc |
Import in your own code:
from identifier import identify_material
material, validation = identify_material(
"rocksalt magnesium oxide",
output_path="examples/mgo.json"
)
print(material.to_json())
Each run produces a structured JSON file:
{
"formula": "GaN",
"name": "Gallium Nitride",
"crystal_system": "hexagonal",
"space_group_symbol": "P6_3mc",
"space_group_number": 186,
"point_group": "6mm",
"a": 3.189, "b": 3.189, "c": 5.185,
"alpha": 90.0, "beta": 90.0, "gamma": 120.0,
"atomic_positions": [
{"element": "Ga", "x": 0.3333, "y": 0.6667, "z": 0.0, "wyckoff_position": "2b"},
{"element": "N", "x": 0.3333, "y": 0.6667, "z": 0.375, "wyckoff_position": "2b"}
],
"source": "LLM-inferred",
"confidence": "high",
"notes": null,
"validation": {
"status": "validated",
"mp_id": "mp-804",
"parameter_comparison": {
"a": {"gemini": 3.189, "mp": 3.189, "diff": 0.0},
"c": {"gemini": 5.185, "mp": 5.192, "diff": 0.007}
}
},
"electronic_properties": {
"band_gap_ev": 1.73,
"is_gap_direct": true,
"is_metal": false,
"gw_recommended": true,
"bse_recommended": true,
"note": "Semiconductor/insulator β GW/BSE applicable"
}
}
| Description | Formula | Space Group | Validation | GW/BSE |
|---|---|---|---|---|
| "silicon in the diamond cubic structure" | Si | Fd-3m #227 | mismatch (conventional vs primitive cell) | β |
| "face-centred cubic copper" | Cu | Fm-3m #225 | mismatch (conventional vs primitive cell) | β metal |
| "a perovskite oxide with titanium and barium" | BaTiO3 | P4mm #99 | β validated | β |
| "wurtzite gallium nitride" | GaN | P6_3mc #186 | β validated | β |
| "iron in the body-centred cubic structure" | Fe | Im-3m #229 | mismatch (magnetic phase) | β metal |
Full output files are available in the examples/ folder.
This framework was built to identify a single material from a description, but the same machinery can populate starting structures for high-throughput DFT workflows. Machine-learned interatomic potentials (MLIPs) such as MACE, NequIP, or Allegro require diverse DFT training data β different lattice strains, perturbed atomic positions, and larger cells. Assembling these starting structures by hand is a slow step. The expansion.py module turns a single identified structure into a folder of variants ready for DFT:
- Strain variants β isotropic, uniaxial, and dense EOS-style grids for sampling volumetric and anisotropic mechanical response (bulk modulus, elastic constants, equation of state).
- Rattled variants β random Cartesian displacements applied to each atom in real space (not fractional), giving consistent perturbation magnitudes across materials. Seeded for reproducibility.
- Supercells β exact integer expansion of the unit cell. Foundation for any future defect or surface work.
Supported materials:
- Elemental metals and semiconductors (Si, Cu, Fe, Al, Au...)
- Binary compounds (GaN, GaAs, NaCl, MgO...)
- Ternary oxides and perovskites (BaTiO3, SrTiO3...)
- Common structure types: FCC, BCC, diamond cubic, wurtzite, rocksalt, perovskite
Not supported:
- Highly complex or recently synthesised materials with limited literature coverage
- Disordered, amorphous, or partially ordered materials
- Materials requiring precise experimental lattice parameters
Known limitations:
- Lattice parameter mismatches may reflect conventional vs primitive cell differences rather than errors
- Input validation uses the same LLM and may occasionally pass non-material descriptions
- Free tier API is limited to 20 requests/day
| Package | Purpose |
|---|---|
google-genai |
Gemini API client |
python-dotenv |
Loads API keys from .env file |
mp-api |
Materials Project database client |
pymatgen |
Crystal structure objects and analysis |
A slide deck summarizing the framework design and results is available
in docs/presentation.pdf
This project is licensed under the MIT License β see the LICENSE file for details.
