Current Python documentation: https://kit.cosmol.org/
pip install cosmolkitCOSMolKit is a Rust-native cheminformatics and structural biology toolkit for molecules, SMILES/SDF/MolBlock parsing, molecular graphs, conformers, coordinates, and AI-ready batch workflows.
It currently focuses on a chemistry core whose selected features are tested for RDKit-compatible behavior: SMILES parsing/writing, atom and bond feature inspection, hydrogen transforms, Kekulization, stereochemistry checks, distance-geometry bounds, Morgan fingerprints, SDF output, and 2D depiction.
COSMolKit is designed around ndarray-oriented structural data access, keeping molecular data efficient and natural for NumPy and PyTorch workflows.
COSMolKit uses deterministic, Copy-on-Write style APIs: normal molecule operations return new objects and do not mutate their inputs. This follows the modern pandas 3.0 / Polars direction of explicit dataflow instead of hidden inplace mutation. See the pandas Copy-on-Write migration guide for the related dataframe design: https://pandas.pydata.org/docs/dev/user_guide/migration.html.
mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()
assert mol is not mol_hExplicit mutation is separated into editing workflows such as Molecule.edit(). Internally, unchanged topology / conformer / property data can be shared so value-style transformations remain efficient.
COSMolKit is currently undergoing a core redesign. The redesigned implementation
lives in crates/cosmolkit-core/; the previous implementation has been moved to
crates/cosmolkit-core-old/ and is retained as reference material during the
migration. Some README sections may still describe the pre-redesign API or
completed legacy parity work while the new core is being rebuilt.
COSMolKit is in early active development. The implementation is intentionally a subset and is expanded by adding well-tested behavior rather than by broad API cloning.
- Core layout:
cosmolkit-corecontains chemistry perception, IO, drawing, and biomolecular primitives;cosmolkitis the Rust facade crate;python/contains the PyO3 package. - RDKit reference: RDKit 2026.03.1 is the active compatibility reference for selected behaviors, with
third_party/rdkitpinned toRelease_2026_03_1(351f8f378f8ad6bbd517980c38896e66bf907af8). - Gemmi reference: Gemmi is the planned reproduction target for future macromolecular PDB/mmCIF parsing work, with
third_party/gemmipinned tov0.7.5(5cc1c23c6007e0e6cbd69289c6f7c0bff50e943e). - Parity coverage: current tests cover graph features, add-H / remove-H roundtrips, tetrahedral stereo geometry, DG bounds matrices, Morgan fingerprint branches, Kekulization branches, SMILES writer branches, V2000 molblock output, direct MOL/molblock reading, and SDF V2000/V3000 roundtrips.
- Query import status: MOL/SDF parsing now preserves a structured internal representation for supported RDKit query atom/bond features such as
HCOUNT,UNSAT,RBCNT,TOPO, hydrogen bonds, and unknown single-bond directions. Public query-inspection APIs are still intentionally deferred. - Batch-native workflows:
MoleculeBatchAPIs support ordered molecule construction, Python-style indexing and iteration, parallel transforms, image/SDF export, custom export filenames, and structured error handling for high-throughput datasets. - Python bindings: the package exposes SMILES parsing/writing with RDKit-style writer options,
Molecule.from_rdkit(), enum-valued graph/stereo inspection, value-style transforms, explicitcoords_2d()/coords_3d()access, 2D/3D SDF IO for molecules with stored coordinates, DG bounds, Morgan fingerprints, SVG/PNG rendering, batch processing, and explicit editing. - AI direction: planned COSMolKit-native APIs include model-ready graph export, internal coordinates, torsion/chirality-aware diffusion helpers, and molecular tokenization. See
dev/ai_native_features.md.
from cosmolkit import Molecule, MoleculeBatch
mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coords()
mol_2d.write_png("phenol.png", width=400, height=300)
fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())
smiles = ["CCO", "c1ccccc1", "CC(=O)O"]
batch = MoleculeBatch.from_smiles_list(smiles, sanitize=True, errors="keep").with_parallel_jobs(8)
prepared = batch.add_hydrogens(errors="keep").compute_2d_coords(errors="keep")
fps = prepared.fingerprint_morgan_list(n_bits=2048)
prepared.to_images(
"molecule_images",
format="png",
size=(300, 300),
errors="skip",
filenames=["ethanol", "benzene", "acetate"],
)For more Python examples, see python/README.md and python/examples/.
- RDKit-compatible core behavior: parity matters for molecular facts such as valence handling, aromaticity, Kekulization, stereochemistry, and file output.
- Modern public API: normal transformations are value-style and deterministic; explicit mutation belongs in editing workflows.
- Rust-first performance: heavy logic lives in Rust and is exposed to Python through PyO3 without leaking mutation ambiguity to users.
- ndarray-oriented structural data access: molecular data should be exposed through efficient array views that fit naturally into NumPy and PyTorch workflows.
- Batch-native throughput: large molecule collections should be processed as ordered batches with Rust-side parallel scheduling, minimal Python-loop overhead, and traceable per-record failures.
- AI-ready extensions: RDKit compatibility is the correctness floor, while COSMolKit-native graph, geometry, torsion, diffusion, and token APIs are the API ceiling.
Goal: keep the small core correct before expanding breadth
- ✅ Atom / Bond / Molecule data model
- ✅ adjacency representation
- ✅ bond order + formal charge support
- ✅ SMILES parser
- ✅ ring perception
- ✅ valence handling
- ✅ Kekulization
- ✅ SMILES writer
- ✅ atom and bond feature extraction
- ✅ explicit hydrogen expansion
- ✅ tetrahedral stereo ordered-ligand representation
- ✅ DG bounds matrix generation
- ✅ molblock V2000 coordinate/topology handling
- ✅ explicit COW sanitization pipeline
- ✅ RDKit-style
SanitizeMol()flags, errors, conjugation, hybridization, atropisomer cleanup, and raw SDF sanitize control - ✅ RDKit-style Morgan fingerprint core with bit-vector, Tanimoto, generator, count-simulation, custom-invariant, and AdditionalOutput branches
Goal: make molecule import/export usable beyond SMILES
- ✅ MOL reader
- ✅ SDF reader with robust multi-record handling
- ✅ SDF writer with strict RDKit-compatible V2000/V3000 output
- ✅ SMILES output via RDKit-parity writer branches
- batch molecule loading
- format validation tools with precise error reporting
Goal: make high-throughput molecule preparation and export a core product identity
- ✅
MoleculeBatch.from_smiles_list()with input-order preservation - ✅ batch transformations for sanitize, add/remove hydrogens, Kekulization, and 2D coordinates
- ✅ Rust-side parallel scheduling with configurable
n_jobs - ✅ batch-level default parallelism with
MoleculeBatch.with_parallel_jobs() - ✅ structured batch errors with
errors="raise" | "keep" | "skip"and PythonBatchErrorMode/BatchErrorTypeenums - ✅ validity masks, error summaries, and JSON/CSV error reports
- ✅ Python-style iteration, integer indexing, slicing, index-list selection, and boolean-mask selection
- ✅ parallel SDF and image export for large molecule collections, including per-record filenames
Goal: expose the verified Rust core through a practical Python interface
- ✅ PyO3 package scaffold
- ✅
Molecule.from_smiles() - ✅
Molecule.from_rdkit() - ✅ atom and bond graph access
- ✅ enum-valued bond order, bond direction, bond stereo, and chiral tag access
- ✅
Molecule.with_hydrogens() - ✅
Molecule.without_hydrogens() - ✅
Molecule.with_kekulized_bonds() - ✅
Molecule.tetrahedral_stereo() - ✅
Molecule.with_2d_coords() - ✅
Molecule.to_smiles()with RDKit-style writer options - ✅ SDF read/write bindings
- ✅ SVG/PNG rendering and file export
- ✅ explicit
Molecule.edit()workflow - ✅ explicit
Molecule.sanitize()andsanitizeconstruction flags for supported SMILES workflows - ✅
Molecule.fingerprint_morgan()andMolecule.fingerprint_morgan_with_output()bindings - ✅ generated Python stubs for Morgan fingerprint classes and methods
-
Moleculepickle roundtrip support for Python persistence and multiprocessing workflows - stable graph-extraction helpers for ML workflows
- full
SanitizeMol()-style error parity and catch-errors API
Goal: enable practical filtering and analysis
- ✅ distance-geometry bounds matrix parity
- ✅ Morgan fingerprint generation and Tanimoto similarity metrics
- ✅ internal query-atom / query-bond storage for supported MOL/SDF parser branches
- topological fingerprint generation
- 3D conformer generation and embedding APIs
- atom selection API
- bond selection API
- neighborhood queries
- connected component analysis
- substructure matching
- public Rust query inspection API
- Python query inspection / matching API
- molecular formula
- molecular weight
- ring statistics
Query API note: internal query AST support exists to preserve RDKit MOL/SDF semantics during parsing. Public inspection and matching APIs are still pending design, especially for Python users.
Goal: provide RDKit-drawer-like molecule depiction
- ✅ 2D coordinate generation
- ✅ SVG molecule drawer
- ✅ PNG rendering path for Python users
- ✅ embedded Noto Sans font for PNG rendering
- atom and bond annotation overlays
- stereochemistry-aware wedge/dash depiction
- visual regression tests for generated drawings
Goal: cover core Biopython-like structure functionality
- PDB parser
- mmCIF parser
- Structure / Model / Chain / Residue / Atom hierarchy
- alternate location handling
- insertion code handling
- HETATM parsing
- ligand extraction
- residue and chain selection utilities
- residue neighborhood queries
Goal: enable browser-native chemistry workflows
- WASM compilation target
- JS bindings
- in-browser SMILES/SDF parsing
- lightweight molecule processing
- integration with visualization tools
Goal: expose model-ready molecular data structures for modern ML workflows
- versioned
Molecule.to_graph()export withcosmol-v1node and edge feature schemas - optional graph fields for coordinates, chirality, torsions, rings, fragments, and rotatable bonds
- Python output adapters for NumPy dictionaries and graph-learning libraries
-
Molecule.to_internal_coordinates()andInternalCoordinates.to_cartesian()APIs - Z-matrix and bond-angle-torsion tree support
- ring-aware internal coordinates and torsion graph metadata
- torsion/chirality-aware diffusion utilities with periodic angle losses and sin/cos encodings
- chirality-preserving reconstruction checks and ring torsion constraints
-
Molecule.to_tokens()with versioned graph, fragment, torsion, 3D geometry, and pharmacophore token schemes
Design sketch: dev/ai_native_features.md
COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and behavioral compatibility where appropriate, while offering a more deterministic Python API and AI-native extension surface.