Skip to content

[BUG] Query IDs with dots in the name is not allowed. #176

@dbjoreli

Description

@dbjoreli

Describe the bug
write_structure in openfold3/core/data/io/structure/cif.py raises NotImplementedError: Only .cif, .bcif, and .pkl formats are supported when writing .pdb output for queries whose IDs contain dots (e.g. 7txk__1__1.A__1.C).

Introduced in Commit 0e84102, This commit changed the suffix extraction in cif.py from:

suffix = output_path.suffix

to:

suffix = "".join(output_path.suffixes)  # to handle .cif.gz

The intent was to detect .cif.gz as a compound extension. However, Path.suffixes returns all dot-separated segments of the filename, not just the file extension. When the output filename contains dots from the query ID (e.g. 7txk__1__1.A__1.C_seed_101001_sample_1_model.pdb), output_path.suffixes returns ['.A__1', '.C_seed_101001_sample_1_model', '.pdb'], and joining them produces ".A__1.C_seed_101001_sample_1_model.pdb" — which matches none of the case branches and falls through to the catch-all NotImplementedError.

To Reproduce
Steps to reproduce the behavior. If possible please include:

  • A query (e.g. json) that triggers the issue:
{
    "queries": {
        "8q0u__1__1.A__1.D": {
            "chains": [
                {
                    "molecule_type": "protein",
                    "chain_ids": [
                        "A"
                    ],
                    "sequence": "SNAQIDGFVRTLRARPEAGGKVPVFVFHPAGGSTVVYEPLLGRLPADTPMYGFERVEGSIEERAQQYVPKLIEMQGDGPYVLVGWSLGGVLAYACAIGLRRLGKDVRFVGLIDAVRAGEEIPQTKEEIRKRWDRYAAFAEKTFNVTIPAIPYEQLEELDDEGQVRFVLDAVSQSGVQIPAGIIEHQRTSYLDNRAIDTAQIQPYDGHVTLYMADRYHDDAIMFEPRYAVRQPDGGWGEYVSDLEVVPIGGEHIQAIDEPIIAKVGEHMSRALGQIEADRTSEVGKQ",
                    "main_msa_file_paths": "/data/of3/colabfold_msas/main/65d30549a1d4539a138d452e53d382751619acfb757b94ad03782111643321e6.npz",
                    "template_alignment_file_path": "/data/of3/colabfold_msas/template/65d30549a1d4539a138d452e53d382751619acfb757b94ad03782111643321e6/colabfold_template.m8"
                },
                {
                    "molecule_type": "ligand",
                    "chain_ids": [
                        "B"
                    ],
                    "smiles": "COc1ccc(-c2noc(C(=O)N[C@@H](C#N)Cc3ccc(C(=O)N4CCC4)cc3)n2)cc1OC"
                }
            ]
        },

Run inference with structure_format: pdb (the default) on any query whose ID contains a dot. Every such prediction fails to write output.

Stack trace

ERROR:openfold3.core.runners.writer:Failed to write predictions for query_id(s) 8q0u__1__1.A__1.D: Only .cif, .bcif, and .pkl formats are supported
Traceback (most recent call last):
  File "/opt/openfold3/openfold3/core/runners/writer.py", line 331, in on_predict_batch_end
    self.write_all_outputs(
  File "/opt/openfold3/openfold3/core/runners/writer.py", line 263, in write_all_outputs
    self.write_structure_prediction(
  File "/opt/openfold3/openfold3/core/runners/writer.py", line 126, in write_structure_prediction
    write_structure(
  File "/opt/openfold3/openfold3/core/data/io/structure/cif.py", line 371, in write_structure
    raise NotImplementedError(
NotImplementedError: Only .cif, .bcif, and .pkl formats are supported

Suggested Fix:
Use output_path.suffix (which only returns the last extension) and detect .cif.gz explicitly:

Replace line 329 in cif.py (suffix = "".join(output_path.suffixes)) with:

    suffix = output_path.suffix
    if suffix == ".gz" and output_path.stem.endswith(".cif"):
        suffix = ".cif.gz"

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcontributions welcomeWe welcome community contributions for this topicgood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions