Electrochemistry-data

This repository contains data used for the creation of entries on echemdb.org. The data consist of frictionless based unitpackages, which were creared from SVG, YAML and bibtex (BIB) using svgdigitizer. All input YAML files and output DataPackages are validated against the echemdb-metadata schema.

Accessing Data

Direct Download (Release Section)

The data can be downloaded as a ZIP from the release section.

Unitpackage API

A collection can be created from the the echemdb module of the unitpackages interface (see unitpackages installation instructions).

from unitpackage.database.echemdb import Echemdb
db = Echemdb.from_remote()

Electrochemistry Data API

Install the latest version of the module.

pip install git+https://github.com/echemdb/electrochemistry-data.git

In your preferred Python environment retrieve the URL with the data via

from echemdb_ecdata.url import ECHEMDB_DATABASE_URL
ECHEMDB_DATABASE_URL

Contributing

The preparation and of the files and the extraction of the data from a PDF source is described here.

Development

If you want to work on the data and repository itself, install pixi and clone the repository:

git clone https://github.com/echemdb/electrochemistry-data.git
cd electrochemistry-data

For possible commands run

pixi run

More pixi tasks can be inferred from the pyproject.toml.

Conversion

The repository converts source data into standardized frictionless datapackages:

# Convert all data (SVG digitizer + raw data)
pixi run -e dev convert

# Convert only SVG digitizer data (from literature/svgdigitizer/)
pixi run -e dev convert-svg

# Convert only raw data (from literature/source_data/)
pixi run -e dev convert-raw

# Clean generated data before converting
pixi run -e dev clean-data

A typical workflow:

# Clean previous builds and convert all data
pixi run -e dev clean-data && pixi run -e dev convert

Generated datapackages are written to data/generated/svgdigitizer/ and data/generated/source_data/.

Both SVG digitization and raw data conversion use a batch approach that imports heavy dependencies once and processes all files in a single Python process. This avoids the ~3 s Python startup overhead per file that occurs when spawning a subprocess for each file, reducing full-rebuild time from ~15 min to ~30-50 s for 273 SVG files.

Force a full rebuild (ignoring timestamps):

pixi run -e dev convert-force

Verify that the batch conversion produces output identical to existing generated data:

pixi run -e dev verify-svg   # SVG digitizer output
pixi run -e dev verify-raw   # Source data output
pixi run -e dev verify-all   # Both at once

Validation

All data (input YAML and output JSON) is validated against the echemdb-metadata schema. In addition, filenames, identifiers, and bibliography keys are validated for consistency.

Two umbrella tasks cover all checks:

# Validate all input files (YAML schema, filenames/identifiers, bib keys)
pixi run -e dev validate-input

# Validate all generated files (JSON schema, identifiers)
pixi run -e dev validate-generated

These are also used in the CI workflows. You can run individual sub-tasks:

# Schema validation
pixi run -e dev validate-svgdigitizer-yaml  # Input YAML (svgdigitizer)
pixi run -e dev validate-source-yaml        # Input YAML (source data)
pixi run -e dev validate-svgdigitizer       # Generated JSON (svgdigitizer)
pixi run -e dev validate-raw                # Generated JSON (source data)

# Filename and identifier validation
pixi run -e dev validate-identifiers              # All input filenames
pixi run -e dev validate-svgdigitizer-filenames   # SVG digitizer filenames only
pixi run -e dev validate-source-filenames         # Source data filenames only
pixi run -e dev validate-generated-identifiers    # Generated data identifiers

# Bibliography key validation
pixi run -e dev validate-bib-keys  # Check bib keys match expected identifiers
pixi run -e dev validate-bib-utf8  # Check for LaTeX accent encodings

Validate against a specific schema version:

pixi run -e dev validate-input --version tags/0.3.3
pixi run -e dev validate-generated --version head/my-branch

Fix Utilities

# Lowercase SVG labels and filenames (enforced for Windows compatibility)
pixi run -e dev fix-lowercase          # Apply changes
pixi run -e dev fix-lowercase-dry-run  # Preview only

# Convert LaTeX accent encodings to UTF-8 in bibliography.bib
pixi run -e dev fix-bib-utf8           # Apply changes
pixi run -e dev fix-bib-utf8-dry-run   # Preview only

# Auto-fix identifier mismatches (detects dir name != YAML citationKey)
pixi run -e dev fix-identifiers          # Apply changes
pixi run -e dev fix-identifiers-dry-run  # Preview only

# Rename directories and files after a bib key change (manual)
pixi run -e dev rename-identifiers OLD_NAME NEW_NAME

Name		Name	Last commit message	Last commit date
Latest commit History 584 Commits
.github		.github
data		data
doc/news		doc/news
echemdb_ecdata		echemdb_ecdata
literature		literature
test		test
util		util
.gitattributes		.gitattributes
.gitignore		.gitignore
ChangeLog		ChangeLog
LICENSE		LICENSE
README.md		README.md
claude.md		claude.md
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
rever.xsh		rever.xsh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Electrochemistry-data

Accessing Data

Direct Download (Release Section)

Unitpackage API

Electrochemistry Data API

Contributing

Development

Conversion

Validation

Fix Utilities

About

Uh oh!

Releases 22

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Electrochemistry-data

Accessing Data

Direct Download (Release Section)

Unitpackage API

Electrochemistry Data API

Contributing

Development

Conversion

Validation

Fix Utilities

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages