Workflow for the Cell Markers Ontology (CLM) pipeline, detailing the steps involved in preparing and running the ontology generation process. Further details can be found in the pipeline README.
- Users provide input data files in a standardized format in the
src/markers/inputdirectory. - Metadata for these files is added to
src/markers/input/metadata.csv. - A GitHub Action validates the input files and metadata.
- see Adding New Marker Files for details on the required columns and format.
src/markers/Makefileis used to generate the Gene DBs.- Anndata files are either manually downloaded or mounted from a shared drive and genes are extracted from AnnData
var. A ROBOT template is generated for the genes in thesrc/templatesdirectory.
- Neo4j client cannot be run inside ODK, so we need to prepare the templates source files before running the pipeline.
- Note: You should be connected to the Sanger intranet to be able to run the Neo4j queries.
- Run
python src/scripts/dosdp_template_generator.py - For each input file, a source file is generated in the
src/markers/$input_file_name$Source.tsv. - Also, queries the CL_KG to retrieve the auto-generated IDs of the Clusters for each used cxg_dataset. CL_KG Clusters individuals template is generated at
src/templates/cl_kg/Clusters.tsv. - CK_KG Cluster individuals has auto-generated IDs. So this template should be regenerated with each CL_KG update.
- Editors manually check the generated source files (
src/markers/$input_file_name$Source.tsv) and annotates to determine if the terms will be added to the Cell Ontology.
- The
go_term_template_generator.pyscript retrieves Gene Ontology (GO) terms from a Neo4j database, fetches additional data from the QuickGO API, and converts the data into ROBOT templates for use in the CLM ontology pipeline. - It processes the data into ROBOT templates, which are saved in the
src/templates/cl_kg/directory. - This step is optional and only run when new GO terms are needed.
- The
cellxgene_marker_template_generator.pyscript downloads and parses marker–gene JSON data from the CxG (cellxgene) service, then:- Looks up UBERON and NCBITaxon URIs via SPARQL.
- Resolves gene labels to NCBI Gene URIs via the MyGene.info API (with a simple cache).
- Filters markers by score threshold and caps at 7 entries per CL term.
- Writes out a remapped JSON (
new_marker.json). - Generates two ROBOT template TSVs in the
src/templates/cl_kg/:
cellxgene_marker_template.tsvcellxgene_marker_annotations_template.tsv
- This step is optional, and should be run whenever you need to regenerate templates for new or updated CellxGene marker data.
- The
cellmarker_marker_template_generator.pyscript downloads and parses marker–gene data from the CxG (cellxgene) service, then:- Looks up UBERON and NCBITaxon URIs via SPARQL if UBERON ontology IDs do not exist in the gene table.
- Retrieves NCBI Gene URIs from the gene table.
- Filters markers to include only normal (non‑cancer) cells.
- Writes out a remapped CSV (
cell_marker_human.csv). - Generates two ROBOT template TSVs in
src/templates/cl_kg/:cellmarker_marker_template.tsvcellmarker_marker_annotations_template.tsv
- This step is optional, and should be run whenever you need to regenerate templates for new or updated CellxGene marker data.
- The ODK pipeline is executed to generate the final ontology files.
cd src/onotology & sh run.sh make prepare_release- Pipeline automatically merges the source files into an intermediate
NSForestMarkersSource.tsvfile. - Using the
NSForestMarkersSource.tsvfile, generates dosdp template files atsrc/patterns/data/default - Merges the Gene DB templates and generates two gene ontologies subsets to be merged into the
cml-kg.owlandclm-cl.owlfiles. - Runs dosdp and robot templates to generate ontologies.
- A pipeline in the Cell Ontology repository uses the
clm-cl.owlontology to generate the final ontology files. - Another pipeline in the CL_KG repository uses the
cml-kg.owlontology to generate the knowledge graph.
The project automates the process of integrating biological data, generating ontology templates, and producing ontology files for use in research or applications.